New to Asian Language eDiscovery? Challenges posed by CJK

Economic integration in the ASEAN region, public investment, increased M&A activity, and the expansion of foreign and domestic multinational corporations are driving regional legal market growth. Most organizations either have branches in Asian countries or conduct business with entities that do.

Growth trends suggest that the increased frequency of Litigation, enforcement of the U.S. Foreign Corrupt Practices Act and the U.K. Bribery Act, and M&A Activity is driving an increased need for eDiscovery involving data in Chinese, Japanese, or Korean language (CJK).

While it may seem daunting if you haven’t had to deal with it before, and while there are definitely unique challenges, don’t be deterred from participating in a case that will involve collecting and reviewing CJK email or documents. An understanding of the most common challenges will help you select and work effectively with experts in the technology, language, culture and laws surrounding the collection, processing, review and production of Chinese, Japanese, and Korean (CJK) language data.

Rules Governing Data Privacy and Collection

Rules vary by country. There are regulations imposing restrictions on gathering data in China for an investigation or litigation. Japan has no standard processes for eDiscovery. Japan Privacy Act limits the transfer of data from a corporate entity to a third-party. In South Korea, eDiscovery practices are relatively undeveloped, though several laws protect the processing of personal information. eDiscovery experts with current experience in the local legal environment provide an added level of quality and accuracy while reducing risk of penalties.

Complex Languages & Characters

Chinese, Japanese and Korean languages use different alphabets, often have different grammatical constructs and, in some cases characters may be entirely pictographic, causing challenges downstream in search. Many times an entire phrase can be captured in just a single character.

Japanese writing relies on two alphabets (or kana) called hiragana and katakana, which are essentially two versions of the same set of sounds in the language. Chinese characters, called kanji in Japanese, are also heavily used in Japanese writing.

In Asian language writing, differences in tone between casual and formal, regional language variations or even the use of various keyboard styles can actually affect the meaning of a phrase or sentence. These challenges can be mitigated through employing local native speakers who understand regional and cultural variations to provide the most accurate insight into data being collected for various legal matters.

Tokenization

In English, spaces indicate word segments, but CJK language content is not segmented into words, which can create downstream challenges in search and analytics. Even differences in approaches to segmenting groups of characters can result in variances in meaning based on segmentation, or tokenization. Accurate tokenization is critical for preservation of meaning and the use of search and analytics.

In the example below, a sentence is tokenized in two different ways. One segmentation means “I like New Zealand Flowers”, while a second segmentation of the same string means “I like fresh broccoli”.

我喜欢  新西兰 

我喜欢新  西兰花

(Teahan, W. J., McNab, R., Wen, Y., & Witten, I. H. (2000), “A compression-based algorithm for Chinese word segmentation” Comput.Linguist. 26(3), 375-393.)

 Character Encoding – ASCII and Unicode

Content is composed of a sequence of characters. While characters represent letters of the alphabet, punctuation, and such, content is stored electronically by computers as a sequence of numeric values called bytes. Character encoding is the term to describe the “key” used to convert a sequence of bytes into characters. Without this key, it becomes nearly impossible to read any content created other than the most basic English text. The result may instead look like Wing Dings: Blog- CJK Wingdings phrase The result

Not only does lack of character encoding information harm the quality of displayed text, but it may also prevent data from being found in search.

Traditionally, western languages were encoded in ASCII format which did not accommodate Japanese, Chinese, or other Asian languages. CJK language content uses SHIFT JIS, EUC-JP, Big 5 and other standards that were developed to encode email text for Asian languages.

Western character encoding for non-ASCII standards has only developed in the past 15 years or so. Unicode has become the de facto standard for character encoding and has extended the traditional Western ASCII standard to accommodate additional standards such as those found in Asian languages.

There are numerous historical email programs that have been used in Asia for years, including “Becky” email or “Thunderbird” that may have unusual file types common in legacy data but unrecognizable by western tools. In some cases, .MSG email files may have Unicode-compliant main text, but metadata (such as email headers) may be non-Unicode compliant and would convert into nonsense characters.

Experts in CJK discovery are critically important here, as they can recognize multiple encoding standards and convert ESI content to Unicode to better facilitate the discovery process. Technicians with experience in Asian eDiscovery are aware of these and many other nuances in CJK, and can anticipate and solve problems that crop up.

Technology helps, but humans are still indispensable

When you know that a cross-border case will likely involve CJK data, plan early and assemble a team in-country, including a technology and service provider with extensive and current expertise in Asian language eDiscovery. Ask service providers for a clear explanation of their tools and approaches in handling non-Unicode content, indexing content for search, and ask for detail on the native language capabilities on their team that will manage hosting and review. CJK language collection, processing, search and review require specialized tools and expertise, and there is no “easy button” – and no substitute for experience.

(Please send your comments and requests for future blog posts to blog@fronteo.com)