Google Research: junho 2010

quarta-feira, 30 de junho de 2010

Google launches Korean Voice Search

Posted by Mike Schuster & Martin Jansche, Google Research

On June 16th, we launched our Korean voice search system. Google Search by Voice has been available in various flavors of English since 2008, in Mandarin and Japanese since 2009, and in French, Italian, German and Spanish just a few weeks ago (some more details in a recent blog post).

Korean speech recognition has received less attention than English, which has been studied extensively around the world by teams in both English and non-English speaking countries. Fundamentally, our methodology for developing a Korean speech recognition system is similar to the process we have used for other languages. We created a set of statistical models: an acoustic model for the basic sounds of the language, a language model for the words and phrases of the language, and a dictionary mapping the words to their pronunciations. We trained our acoustic model using a large quantity of recorded and transcribed Korean speech. The language model was trained using anonymized Korean web search queries. Once these models were trained, given an audio input, we can compute and display the most likely spoken phrase, along with its search result.

There were several challenges in developing a Korean speech recognition system, some unique to Korean, some typical of Asian languages and some universal to all languages. Here are some examples of problems that stood out:

Developing a Korean dictionary: Unlike English, where there are many publicly-available dictionaries for mapping words to their pronunciations, there are very few available for Korean. Since our Korean recognizer knows several hundred thousand words, we needed to create these mappings ourselves. Luckily, Korean has one of the most elegant and simple writing systems in the world (created in the 15th century!) and this makes mapping Korean words to pronunciations relatively straightforward. However, we found that Koreans also use quite a few English words in their queries, which complicates the mapping process. To predict these pronunciations, we built a statistical model using data from an existing (smaller) Korean dictionary.
Korean word boundaries: Although Korean orthography uses spaces to indicate word boundaries (unlike Japanese or Mandarin), we found that people use word boundaries inconsistently for search queries. To limit the size of the vocabulary generated from the search queries, we used statistical techniques to cut rare long words into smaller sub-words (similarly to the system we developed for Japanese).
Pronunciation exceptions: Korean (like all other languages) has many exceptions for pronunciations that are not immediately obvious. For example, numbers are often written as digit sequences but not necessarily spoken this way (2010 = 이천십). The same is true for many common alphanumeric sequences like “mp3”, “kbs2” or mixed queries like “삼성 tv”, which often contain spelled letters and possibly English spoken digits as opposed to Korean ones.
Encoding issues: Korean script (Hangul) is written in syllabic blocks, with each block containing at least two of the 24 modern Hangul letters (Jamo), at least one consonant and one vowel. Including the normal ASCII characters this brings the total number of possible basic characters to over 10000, not including Hanja (used mostly in the formal spelling of names). So, despite its simple writing system, Korean still presents the same challenge of handling a large alphabet that is typical of Asian languages.
Script ambiguity: We found that some users like to use English native words and others the Korean transliteration (example: “ncis season 6” vs. “ncis 시즌6”). This makes it challenging to train and evaluate the system. We use a metric that estimates whether our transcription will give the correct web page result on the user’s smart phone screen, and such script variations make this tricky.
Recognizing rare words: The recognizer is good at recognizing things users often type into the search engine, such as cities, shops, addresses, common abbreviations, common product model numbers and well-known names like “김연아”. However, rare words (like many personal names) are often harder for us to recognize. We continue to work on improving those.
Every speaker sounds different: People speak in different styles, slow or fast, with an accent or without, have lower or higher pitched voices, etc. To make our system work for all these different conditions, we trained our system using data from many different sources to capture as many conditions as possible.

When speech recognizers make errors, the reason is usually that the models are not good enough, and that often means they haven’t been trained on enough data. For Korean (and all other languages) our cloud computing infrastructure allows us to retrain our models frequently and using an ever growing amount of data to continually improve performance. Over time, we are committed to improve the system regularly to make speech a user-friendly input method on mobile devices.

segunda-feira, 14 de junho de 2010

Google Search by Voice now available in France, Italy, Germany and Spain

Posted by Thad Hughes, Martin Jansche, and Pedro Moreno, Google Research

Google’s speech team is composed of people from many different cultural backgrounds. Indeed, if we count the languages spoken by our teammates, the number comes to well over a dozen. Given our own backgrounds and interests, we are naturally excited to extend our software to work with many different languages and dialects. After testing the waters with English, Mandarin Chinese, and Japanese, we decided to tackle four main European languages which are often referred to as FIGS - French, Italian, German and Spanish.

Developing Voice Search systems in each of these languages presented its own challenges. French and Spanish required special work to deal with diacritic and accent marks (e.g. ç in French, ñ in Spanish). When we develop a new language we tweak our dictionaries based on user generated content. To our surprise we found that a lot of this content in French and Spanish often uses non-standard orthography. For example a French speaker might type “francoise” into a search engine and still expect it to return results for “Françoise”. Likewise in Spanish a user might type “espana” and expect results for the term “España”. Of course a lot of this has to do with the fact that, until recently, domain names (like www.elpais.es) did not allow diacritics, and that entering special characters is often painful but omitting diacrictics is usually not an obstacle to communication. However, non-standard spellings distort the intended pronunciations. For example, if “francoise” were a real French word, one would expect it to be pronounced “franquoise”. In order to capture the intended pronunciation of the non-standard spellings, we fixed the orthography in our dictionaries for Spanish and French automatically. While this is not perfect, it deals with many of the offending cases.

Since our Voice search systems typically understand more than a million different words in each language, developing pronunciation dictionaries is one of the most critical tasks. We need the dictionary to match what the user said with the written form. Not surprisingly we found that dictionary development for some languages like Spanish and Italian to be extremely easy, as they have very regular orthographies. In fact the core of our Spanish pronunciation module consists of less than 100 lines of source code. Other languages like German and French have more complex orthographies. For example in French “au”, “eaux” and “hauts” are all pronounced “o”.

A notable aspect of German (especially “Internet German”) is that a lot of English words are in common usage. We do our best to recognize thousands of English words, even though English contains some sounds that don’t exist in German, like “th” in “the”. One of the trickiest examples we came across was when one of our volunteers read “nba playoffs 2009”, saying “nba playoffs” in English followed by “zwei tausend neun” in German. So go ahead and search for “Germany’s Next Topmodel” or “Postbank Online”, see if it works for you.

German is also notorious for having long, complex words. Our favorite examples include:

Berufskraftfahrerqualifikationsgesetz (or shorter: BKrFQG)
Eierschalensollbruchstellenverursacher
Verkehrsinfrastrukturfinanzierungsgesellschaft
Stichpimpulibockforcelorum
Hypothalamus-Hypophysen-Nebennierenrinde-Achse

Just for fun, compare how long it takes you to say these to Voice Search vs. typing them.

Even though a vocabulary size of one million words sounds like a large number, each of these languages has even more words, so we need a procedure to select which ones to model. We obviously do not do this manually and instead use statistical procedures to identify the list of words we will allow. We do this by looking at many sources of data and looking at the frequency of words. It is therefore surprising to find sometimes really weird terms selected by our algorithms. For example in Spanish we found these unusual words:

So, in the unlikely event that you ever try a Spanish voice search query like this “imágenes del músculo supercalifragilisticoespialidoso chiripitiflautico esternocleidomastoideo” you may be surprised to see that it works.

French, Italian, German, and Spanish are spoken in many parts of the world. In this first release of Google Search by Voice in these languages, we initially only support the varieties spoken in France, Italy, Germany, and Spain, respectively. The reason is that almost all aspects of a Voice Search system are affected by regional variation: French speakers from different regions have slightly different accents, use a number of different words, and will want to search for different things. Eventually, we plan to support other regions as well, and we will work hard to make sure our systems work well for all of you.

So, we hope you find these new voice search system useful and fun to use. We definitely had a “supercalifragilisticoespialidoso chiripitiflautico” time developing them.

quarta-feira, 9 de junho de 2010

Google Fusion Tables celebrates one year of data management

Posted by Alon Halevy, Google Research and Rebecca Shapley, User Experience

A year ago we launched Google Fusion Tables, an easy way to integrate, visualize and collaborate on data tables in the Google cloud. You used it and saw the potential, and told us what else you wanted. Since then, we’ve responded by offering programmatic access through the Fusion Tables API, math across data columns owned by multiple people, and search on the collection of tables that have been made public. We published about Fusion Tables in SIGMOD 2010 and in the First Symposium on Cloud Computing. And since the map visualizations were such a hit, we made them even better by supporting large numbers of points, lines and polygons, custom HTML in map pop-up balloons complete with tutorials and integration with the Google Maps API. We’ve made all this capability available on Google’s cloud and are excited to see examples every day of how our cloud approach to data tables is changing the game and making structured data management, collaboration, and publishing fast, easy, and open.

But more exciting than all the features we’ve been releasing is the things that people have been *doing* with Fusion Tables. News agencies have been taking advantage of Fusion Tables to map data that governments make public, and tell a more complete story (see the L.A. Times, Knoxville News, and Chicago Tribune). Just this month the State of California kicked off an application development contest, hosting data sets like this one in Fusion Tables for easy API access for developers. And the US Department of Health and Human Services held the Community Health Data Forum, where attendees presented data applications such as the heart-friendly and people-friendly hospital-finder, built with Google Fusion Tables.

It continues to astound us how quickly our users are able to pull together these kinds of compelling data applications with Fusion Tables, again showing the power of a cloud approach to data. Fusion Tables were the multimedia extension to Joseph Rossano’s art exhibit on Butterflies and DNA barcodes, an easy way to map real-estate in Monterey county or potholes in Spain, provided the geo-catalog for wind power data and ethanol-selling stations, and even the data backend for an geo portal to organize water data for Africa, among many, many other uses.

As we head into our second year, we’re looking forward to delivering more tools that make data management easier and more powerful on the web. What’s next for Fusion Tables? Request your favorite features on our Feature Request (a special implementation of Google Moderator), and follow the latest progress of Fusion Tables on our User Group, Facebook, and Twitter. We love to hear from you!