TAR Applicability to Asian and Multi-Language Datasets

As the number of cross-border legal matters with data originating in Asia rises, legal teams are increasingly looking to technology-assisted review (TAR), also known as predictive coding, to automate parts of their early investigations, costly and often error-prone document reviews for production, and other critical case activities.

When ESI content involves Chinese, Japanese and Korean (CJK) languages, the complexity of using predictive coding compounds. The challenge is not understanding the language itself; most technologies do not attempt to process language like humans. The core challenges are technological.  Many encoding and file formats are still poorly processed in traditional U.S. toolsets, proprietary software still abounds, and many TAR solutions still “translate” before they index and categorize.  Beyond the core technical challenges are linguistic and cultural complexities, but those are topics for another day.

In previous posts, my colleagues and I discussed collecting, processing, and searching ESI content containing Asian and multi-language data sets.  But what about TAR? Do TAR tools work in multi-language cases, and more particularly with CJK?

Here are a few things legal teams should know:

  1. CJK data needs to be processed software designed to accurately and completely extract content by people with expertise and experience processing CJK data.

In TAR discussions, people often use the expression “garbage in, garbage out.”  That expression almost universally refers to the consistency and correctness of human coding used to train a TAR system.  The expression is even more fundamentally applicable when it refers to data processing before TAR starts.  TAR effectiveness is unavoidably constrained by the accuracy and completeness of the processed data.  Phrased differently, the most perfect subject-matter expert available cannot effectively train TAR system that relies on incomplete or inaccurate data.

Processing problems take different forms, but a few easy steps will help you avoid damaging your TAR results. Ineffective CJK data processing may generate garbled text, metadata may be missing, or the processing tool may simply not recognize the file and throw improper errors.  Case teams can reduce the risk of damaging their TAR results by taking a couple easy steps.  First, confirm the processing tool being used supports the file and encoding formats being collected.  The technology team on a case should be able to confirm their software supports a dataset easily.  Supported formats are typically published by the software providers and available.  Second, work with a technology team that knows what to expect.  A technology team new to a specific file format won’t know what metadata should be extractable, how that data extracts, or whether the system is missing key information.  Experience and expertise are difficult to replace when a case team tackles a new challenge like multi-language data handling.

Note: A previous post addresses topics leading up to the point of processing and related to indexing for search.

  1. Available TAR solutions handle multi-language datasets differently.

Not all information retrieval and categorization models are created equal.  How the underlying algorithm in any given system gathers information about documents and categorizes them can significantly impact the ultimate efficacy of your TAR efforts.  Some systems, for example, assign “weights” to concepts within each document and across overall document populations.  Non-English words may be under-weighted (less influential in categorization) in a model if the overall prevalence of documents containing the language is low.  It is important to understand—at least conceptually—how the system you choose identifies concepts and categorizes your data.

Early and direct discussions with your technology provider will help you avoid poor results (and the increased costs that accompany them) in the long run.

  1. TAR offers the same benefits in cases with single- and multi-language datasets, and CAL can be particularly advantageous.

Most western TAR tools “learn” from sample training data provided by knowledgeable lawyers or subject matter experts. The tool then uses morphological analysis and statistical algorithms to find similar documents in the remaining document collection. Those general truths apply to multi-language datasets as well.

Continuous Active Learning (CAL, also known as TAR 2.0) models support the way most legal teams work today.  Teams can use known information—acquired from their client or otherwise—to find key documents early and advance the system’s training organically.  These systems are typically malleable enough to support parallel (or offset) training on issues or in specific languages.

The flexibility to work in parallel streams while continually improving TAR results is particularly valuable in cross-border cases. The data relating to different legal issues may be geographically discrete, and relevant experts may need to work in parallel. That’s not a challenge in most CAL systems.  A case team may have subject-matter experts with different language fluencies spanning time zones. Also not generally a problem. The flexible nature of the training model is a boon to teams with differences in geographic location, language fluency, or subject-matter knowledge.

In sum:

TAR holds great promise.  It is successfully applied today in many contexts, and its applicability knows no geographic boundary.  TAR systems eliminate core challenges like human inconsistency, throughput bottlenecking, and the astronomical costs commonly associated with manual review.  Implemented properly, TAR allows legal teams to focus on litigation strategy, provides early access to key documents, and helps teams acquire information that could have been otherwise unattainable or obscured.  These benefits apply equally in cases involving CJK or other languages—if your technology team understands the challenges and has the expertise to address them.