https://mixremix.cc/wp-content/uploads/2018/08/alexa-288x300.jpg 288w, https://mixremix.cc/wp-content/uploads/2018/08/alexa-768x799.jpg 768w" sizes="(max-width: 800px) 100vw, 800px" />
This is some work being done by Amazon addressing the difficulties in following names across the language barriers.
As Alexa-enabled devices continue to expand into new countries, finding information across languages that use different scripts becomes a more pressing challenge. For example, a Japanese music catalogue may contain names written in English or the various scripts used in Japanese — Kanji, Katakana, or Hiragana. When an Alexa customer, from anywhere in the world, asks for a certain song, album, or artist, we could have a mismatch between Alexa’s transcription of the request and the script used in the corresponding catalogue.
Considering all the different languages in the world and alphabets using different scripts to write in those languages this is obviously a difficult data intensive problem.
To address this problem, we developed a machine-learned multilingual named-entity transliteration system. Named-entity transliteration is the process of converting a name from one language script to another. We describe the design challenges of building such a system in a paper we are presenting this month at the 27th International Conference on Computational Linguistics (COLING 2018).
The first challenge is obtaining a large dataset that contains name pairs in different languages. Since we could not find a publicly available dataset that satisfied our needs, we created a new dataset based on Wikidata, a central knowledge base for Wikipedia and other Wikimedia projects. We have released our dataset online, together with our code, under a Creative Commons license.
https://mixremix.cc/wp-content/uploads/2018/08/transformer-diagram-1152x889._CB471572976_-300x232.png 300w, https://mixremix.cc/wp-content/uploads/2018/08/transformer-diagram-1152x889._CB471572976_-768x593.png 768w, https://mixremix.cc/wp-content/uploads/2018/08/transformer-diagram-1152x889._CB471572976_.png 1152w" sizes="(max-width: 840px) 100vw, 840px" />
This is all available on Steve Ash’s NETransliteration-COLING2018 GitHub page.
As for the licensing
The code in xlit_s2s_nmt and xlit_t2t are adapted from other tensorflow repositories and is licensed under the original Apache 2 licenses.
The data is adapted from Wikidata and retains its license, Creative Commons CC0 1.0 Universal (see data/LICENSE)
The scripts folder contains data preparation and train/test scripts licensed under the MIT License.
You can see the project and the data are built out from existing Creative Commons and open source projects. Yay for Virality. The code and huge data sets are available for AI researchers and application builders.