July 25th, 2013 | Published in Uncategorized
Although we’ve been working on speech recognition for several years, every new language requires our engineers and scientists to tackle unique challenges. Our most recent additions - Croatian, Filipino, Ukrainian, and Vietnamese - required creative solutions to reflect how each language is used across devices and in everyday conversations.
For example, since Vietnamese is a tonal language, we had to explore how to take tones into consideration. One simple technique is to model the tone and vowel combinations (tonemes) directly in our lexicons. This, however, has the side effect of a larger phonetic inventory. As a result we had to come up with special algorithms to handle the increased complexity. Additionally, Vietnamese is a heavily diacritized language, with tone markers on a majority of syllables. Since Google Search is very good at returning valid results even when diacritics are omitted, our Vietnamese users frequently omit the diacritics when typing their queries. This creates difficulties for the speech recognizer, which selects its vocabulary from typed queries. For this purpose, we created a special diacritic restoration algorithm which enables us to present properly formatted text to our users in the majority of cases.
Filipino also presented interesting challenges. Much like in other multilingual societies such as Hong Kong, India, South Africa, etc., Filipinos often mix several languages in their daily life. This is called code switching. Code switching complicates the design of pronunciation, language, and acoustic models. Speech scientists are effectively faced with a dilemma: should we build one system per language, or should we combine all languages into one?
In such situations we prefer to model the reality of daily language use in our speech recognizer design. If users mix several languages, our recognizers should do their best in modeling this behavior. Hence our Filipino voice search system, while mainly focused on the Filipino language, also allows users to mix in English terms.
The algorithms we’re using to model how speech sounds are spoken in each language make use of our distributed large-scale neural network learning infrastructure (yes, the same one that spontaneously discovered cats on YouTube!). By partitioning the gigantic parameter set of the model, and by evaluating each partition on a separate computation server, we’re able to achieve unprecedented levels of parallelism in training acoustic models.
The more people use Google speech recognition products, the more accurate the technology becomes. These new neural network technologies will help us bring you lots of improvements and many more languages in the future.