It's all Belgian Fries to Me: The Art of Multilingual eDisclosure (Part I)
"Un type spécial de beauté existe qui est né dans la langue, de la langue et pour la langue.” – Gaston Bachelard (1884-1962) French philosopher and poet
As a Belgian national born to Spanish parents, my love affair with languages started at a young age. I was brought up bilingual in a country with three national languages: Dutch, French, and German. This meant that I was constantly exposed to different languages and cultures. When growing up in Brussels, reading traffic signs, street names, and publicity boards displayed both in Dutch and French became second nature.
Love of languages leads to eDisclosure prowess
Family summer holidays spent in Galicia or Valencia, my mother’s and father’s respective home regions, added yet another layer to my life’s linguistic fabric. Galician, an official language closely related to Portuguese, is spoken in my mother’s region, and Valencian, the official name given to the variant of Catalan, is spoken in my father’s region. These additional languages punctuated our extended family conversations as cousins, uncles, and aunts rapidly turned back, out of habit, to their own regional vernacular, meaning my siblings and I had to quickly pick up the semantics, syntax and lexicology of these languages in order to keep up with the family chatter.
Unsurprisingly, I went on to study English and Italian in the Home Counties and Dutch in the Brabançon town of Tilburg in the very south of the Netherlands, just 90 minutes from Brussels. My first job in the UK, some 20 years ago, was at a major translation corporation in Berkshire, programme managing translations of automotive and telecom manuals into Norwegian and Swedish. No, I do not speak either of them before you ask.
This exposure to the translation world cemented my belief that “a special kind of beauty exists which is born in language, of language, and for language”(G. Bachelard). Now 15 years on, the beauty of languages and translation is now even more prevalent in my world of eDisclosure and computer forensics. The rise in global commerce has seen an increase in cross-border litigation, arbitration, and compliance investigations, which has in turn meant that the industry has had to develop or tweak existing technical and human solutions to deal with the processing, keyword searching, the review and potential translation of documents for these types of legal matters.
It's all Caesar salad to me: eDisclosure, extended Latin, and Unicode
Firstly, the industry was faced with the technical limitations of processing data that was not in simple Latin script. Processing documents in languages that use the Roman alphabet but with extended Latin script (i.e. accents and special characters, and most languages have) was a real challenge. My French name is a perfect example. Even to this day, when completing an online form, the rendering of it can turn from Jérôme to a rather disturbing Jirtme on some language-unfriendly sites.
The Japanese kindly refer to this garbling of characters as Mojibake (文字化け), meaning unintelligible. The Russian coined it krakozyabry (кракозя́бры), Germans may call it Buchstabensalat (a salad of letters or think "alphabetti spaghetti") but all languages not using simple Latin characters would suffer from some level of scrambling of text. Electronic disclosure providers had to painstakingly apply the correct language encoding to each document to eradicate the issue or at the very least minimize it.
And then came Unicode, the so-called universal panacea of all encoding systems, or so it seemed. Its UTF-8 variant seems to have been adopted as the worldwide standard and while it eliminated many of the Mojibakeissues encountered in the past, these are still present when dealing with legacy systems and other proprietary or Asian email systems that still have their own encoding system or partly use Unicode.
Once most eDiscovery specialists were able to process foreign language documents, the next challenge was how to identify specific languages in the universe of data to assist the legal teams in planning their review, the amount of documents that may need translation and/or the allocation of foreign language review resources.
It's all Greek to me: Exceptions prove the translation rules
Earlier options for language detection were crude and rudimentary, mostly based on alphabets or writing systems so it could detect and differentiate Arabic, Cyrillic, Chinese, and Latin scripts for example, but could not provide any more granularity. Is it Farsi or Arabic, Russian or Ukrainian, Cantonese or Mandarin, Spanish or Italian...or is it all Greek?
Then it all got a bit clever, using dictionaries, but as always with languages, nothing is that straightforward. Little use is a dictionary when you need to differentiate the Spanish word fresco from the Portuguese, Italian, Dutch, or English fresco (meaning fresh, insolent, wall painting). Languages within the same family tree share many common roots and attributes, which made automated differentiation and detection a challenge.
As a small linguistic digression, did you know that the word butterfly is one of the very few words that disproves my earlier comment about the sharing of roots? Butterfly turns into a French papillon, an Italian farfalla, a Portuguese borboleta, a Spanish mariposa, a Rumanian fluture, a German Schmetterling, a Dutch vlinder, a Danish sommerfugl, a Swedish fjäril… you get the gist! Exceptions are what languages are made of.
Now, language identification uses intelligent algorithms combining dictionaries and the analysis of characters sets, accentuation, spelling, single letters, and grouping letters, etc. Still not perfect but a little less Greek to us.
(To be continued… look for Part II in a couple of weeks!)