CJK Indexing and Processing and Search, Oh My!

A previous post on Chinese, Japanese and Korean (CJK) related eDiscovery outlined the language and technical challenges presented by Asian language content.  Once the ESI is collected, there are several processing steps taken to prepare the document set for review.  For many people – even those directly involved in eDiscovery review projects – high-level details of processing and indexing are a mystery.  In this post we will look at a few of the key aspects of ESI processing, some of which are common with CJK data.  By no means is this summary a comprehensive description; ESI tech experts spend years working with various tools on a wide variety of data in order to build a proficiency.  If you work with a full-service eDiscovery provider on cross-border cases, you can rely on that partner to handle CJK processing. But with the basics outlined here, you’ll be grounded in the essentials of ESI processing, and that will allow you to improve communication, ask informed questions and have a better overall experience.

Decryption

Encryption levels vary widely throughout the world, and encryption is common in the Asia-Pacific region.  In this step tools are used to identify encrypted files and run a decryption algorithm (key) or program to decode the content.  Password protected documents pose a similar challenge.  Often the client can provide encryption keys or passwords to protected files, but there are alternative methods for ‘ethical hacking’ in situations where keys or passwords are not available.

Extraction of Container Files

Mailbox folders, zipped folders, archives or other file types that contain multiple individual files or documents are opened, and the individual files are made available for indexing and search.

Identify Encoding & Convert

Unicode compliance does not traditionally encompass the wide variety of systems used throughout China, Japan and Korea.  Each nation has its own distinct code sets, some of which – Japan in particular – utilize multiple code sets.  Content that is not Unicode-compliant is typically converted to Unicode format. This is common with legacy formats of e-mail or other legacy file types.

Deduplication

Content and metadata of each document is analyzed via a hashing algorithm, resulting in a unique digital fingerprint.  This allows for identification of duplicates and near-duplicates, and deduplication.

Indexing to Make Content Searchable

Indexing is a process that inventories the total content of a file and builds a searchable index, a digital table that serves, conceptually, quite like the index in a book. Search indexes function as tools designed to facilitate and expedite the retrieval of information.

Before searches can occur, document content must be tokenized into searchable elements. As discussed in a previous post, the spaces between words in English allow quick tokenization and indexing of words. In CJK languages, however, words are not separated by spaces.  Various approaches are used to index CJK content in order to make it searchable.

Early efforts at CJK tokenization often indexed every character in the dataset. As you might imagine, a search conducted against a single-character index will result in an overabundance of search hits that are less responsive to the search. This puts a burden on reviewers to examine far more content in order to find the truly responsive documents.

Another indexing approach is the use of dictionary definitions.  An expansive dictionary, however, is maintenance intensive, and susceptible to errors.

The most effective tools that search and process Asian language data are now using “n-gram models” to predict subsequent items in a data sequence, combining statistics and communication theory to create scalable and cost-effective solutions.  FRONTEO technology, for example, tokenizes CJK language content into two-character pairs called bi-grams.

Benefits of Bi-Gram Indexing:

Bi-Gram indexing technology can support search strings of varying lengths, as the search engine locates bi-gram pairs in the index to provide search results.  The use of Bi-Gram indexing provides a more thorough result, ensuring that potentially responsive content is not missed in search results due to the length of search strings. The bi-gram model assures quality results in string searches.

Applying CJK Specialization

The CJK processing rudiments described here have hopefully taken some of the mystery out of these activities.  It may in fact inspire an appreciation for the fact that all of this complexity can be addressed within a day, or even within hours, making searchable documents available to review attorneys very quickly.

Your role may not require a deep level of expertise in this area, but your project will require it. When you know that a cross-border case will likely involve CJK data, select a team with native language speakers and ESI experts in-country. Ask service providers for a clear explanation of their tools and approaches in processing ESI, handling non-Unicode content, and indexing the dataset for search. CJK language collection, processing, search and review require specialized tools and expertise.

The same issues that make indexing and search a challenge also have impact on TAR/predictive coding downstream.  A future post will look at CJK review, and specifically TAR for CJK language data.

(Please send your comments and requests for future blog posts to blog@fronteo.com)