quarta-feira, 31 de outubro de 2012

Large Scale Language Modeling in Automatic Speech Recognition



At Google, we’re able to use the large amounts of data made available by the Web’s fast growth. Two such data sources are the anonymized queries on google.com and the web itself. They help improve automatic speech recognition through large language models: Voice Search makes use of the former, whereas YouTube speech transcription benefits significantly from the latter.

The language model is the component of a speech recognizer that assigns a probability to the next word in a sentence given the previous ones. As an example, if the previous words are “new york”, the model would assign a higher probability to “pizza” than say “granola”. The n-gram approach to language modeling (predicting the next word based on the previous n-1 words) is particularly well-suited to such large amounts of data: it scales gracefully, and the non-parametric nature of the model allows it to grow with more data. For example, on Voice Search we were able to train and evaluate 5-gram language models consisting of 12 billion n-grams, built using large vocabularies (1 million words), and trained on as many as 230 billion words.


The computational effort pays off, as highlighted by the plot above: both word error rate (a measure of speech recognition accuracy) and search error rate (a metric we use to evaluate the output of the speech recognition system when used in a search engine) decrease significantly with larger language models.

A more detailed summary of results on Voice Search and a few YouTube speech transcription tasks (authors: Ciprian Chelba, Dan Bikel, Maria Shugrina, Patrick Nguyen, Shankar Kumar) presents our results when increasing both the amount of training data, and the size of the language model estimated from such data. Depending on the task, availability and amount of training data used, as well as language model size and the performance of the underlying speech recognizer, we observe reductions in word error rate between 6% and 10% relative, for systems on a wide range of operating points.

domingo, 28 de outubro de 2012

Ubuntu Server distribution upgrade

Doing a distribution upgrade is easy in the graphical user interface.  On a "headless" Ubuntu Server, you probably don't have a graphical user interface available, and you need to perform a few easy steps.

First you need to check 2 things:
  • The package "update-manager-core" needs to be installed.  Just run the following command: (it will install it, if it is not already)
    sudo apt-get install update-manager-core
  • On Ubuntu Server, the update channel is usually "LTS" (Long Time Support) which will upgrade only to the major release of Ubuntu Server.  Probably, you want to set this to "normal", to allow an upgrade to every distribution upgrade.  To do this open the file "/etc/update-manager/release-upgrades", and change the line "Prompt=lts" to "Prompt=normal".
    For example, with pico:
    sudo pico /etc/update-manager/release-upgrades

    Change the Prompt setting to "normal".

    Press Ctrl-X, press "Y" to confirm and press enter to confirm the filename.

Then the distribution upgrade can be started by running this command:
sudo do-release-upgrade -d 

It is not "recommended" to do a distribution upgrade remotely over ssh.  The "do-release-upgrade" command above checks on this, and gives a warning.  However, I didn't have any problem doing the upgrade remotely over ssh.

quinta-feira, 18 de outubro de 2012

Converting a date/time to time_t using FxGqlC (or SQL)

The time_t is used in C++ to represent a date/time.  It is expressed in seconds since Januari 1st, 1970.

To get the current date/time as a time_t value, you can run this query in FxGqlC (or SQL):
select datediff(second, '1970-01-01', getutcdate())

You need to use getutcdate() because time_t defines the UTC time.

Or for any arbitrary date/time (in UTC):
select datediff(second, '1970-01-01', '2012-10-18 22:33') 
-- Returns 1350599580

The other way around is also easy: run this query to convert a time_t to a date/time
select dateadd(second, 1234567890, '1970-01-01')
-- Returns 13/02/2009 23:31:30

Ngram Viewer 2.0



Since launching the Google Books Ngram Viewer, we’ve been overjoyed by the public reception. Co-creator Will Brockman and I hoped that the ability to track the usage of phrases across time would be of interest to professional linguists, historians, and bibliophiles. What we didn’t expect was its popularity among casual users. Since the launch in 2010, the Ngram Viewer has been used about 50 times every minute to explore how phrases have been used in books spanning the centuries. That’s over 45 million graphs created, each one a glimpse into the history of the written word. For instance, comparing flapper, hippie, and yuppie, you can see when each word peaked:

Meanwhile, Google Books reached a milestone, having scanned 20 million books. That’s approximately one-seventh of all the books published since Gutenberg invented the printing press. We’ve updated the Ngram Viewer datasets to include a lot of those new books we’ve scanned, as well as improvements our engineers made in OCR and in hammering out inconsistencies between library and publisher metadata. (We’ve kept the old dataset around for scientists pursuing empirical, replicable language experiments such as the ones Jean-Baptiste Michel and Erez Lieberman Aiden conducted for our Science paper.)

At Google, we’re also trying to understand the meaning behind what people write, and to do that it helps to understand grammar. Last summer Slav Petrov of Google’s Natural Language Processing group and his intern Yuri Lin (who’s since joined Google full-time) built a system that identified parts of speech—nouns, adverbs, conjunctions and so forth—for all of the words in the millions of Ngram Viewer books. Now, for instance, you can compare the verb and noun forms of “cheer” to see how the frequencies have converged over time:
Some users requested the ability to combine Ngrams, and Googler Matthew Gray generalized that notion into what we’re calling Ngram compositions: the ability to add, subtract, multiply, and divide Ngram counts. For instance, you can see how “record player” rose at the expense of “Victrola”:
Our info page explains all the details about this curious notion of treating phrases like components of a mathematical expression. We’re guessing they’ll only be of interest to lexicographers, but then again that’s what we thought about Ngram Viewer 1.0.

Oh, and we added Italian too, supplementing our current languages: English, Chinese, Spanish, French, German, Hebrew, and Russian. Buon divertimento!

segunda-feira, 15 de outubro de 2012

LibreOffice doesn't appear in alt-tab view

I have an annoying problem with Unity on Ubuntu: it often happens that LibreOffice/OpenOffice Calc or Writer windows don't appear in the alt-tab application switching.
Googling the problem learned me that it is a known problem: https://bugs.launchpad.net/bamf/+bug/1026426
The good news is that it is solved in Ubuntu 12.10, and a patch will be available for 12.04 LTS. A workaround is to restart Unity:
  • Start a command window (Ctrl-Alt-T)
  • Run:
    unity --replace & disown

quinta-feira, 4 de outubro de 2012

ReFr: A New Open-Source Framework for Building Reranking Models



We are pleased to announce the release of an open source, general-purpose framework designed for reranking problems, ReFr (Reranker Framework), now available at: http://code.google.com/p/refr/.

Many types of systems capable of processing speech and human language text produce multiple hypothesized outputs for a given input, each with a score. In the case of machine translation systems, these hypotheses correspond to possible translations from some sentence in a source language to a target language. In the case of speech recognition, the hypotheses are possible word sequences of what was said derived from the input audio. The goal of such systems is usually to produce a single output for a given input, and so they almost always just pick the highest-scoring hypothesis.

A reranker is a system that uses a trained model to rerank these scored hypotheses, possibly inducing a different ranked order. The goal is that by employing a second model after the fact, one can make use of additional information not available to the original model, and produce better overall results. This approach has been shown to be useful for a wide variety of speech and natural language processing problems, and was the subject of one of the groups at the 2011 summer workshop at Johns Hopkins’ Center for Language and Speech Processing. At that workshop, led by Professor Brian Roark of Oregon Health & Science University, we began building a general-purpose framework for training and using reranking models. The result of all this work is ReFr.

From the outset, we designed ReFr with both speed and flexibility in mind. The core implementation is entirely in C++, with a flexible architecture allowing rich experimentation with both features and learning methods. The framework also employs a powerful runtime configuration mechanism to make experimentation even easier. Finally, ReFr leverages the parallel processing power of Hadoop to train and use large-scale reranking models in a distributed computing environment.

terça-feira, 2 de outubro de 2012

EMEA Faculty Summit 2012



Last week we held our fifth Europe, Middle East and Africa (EMEA) Faculty Summit in London, bringing together 94 of EMEA’s foremost computer science academics from 65 universities representing 25 countries, together with more than 60 Googlers.

This year’s jam-packed agenda included a welcome reception at the Science Museum (plus a tour of the special exhibition: “Codebreaker - Alan Turing’s life and legacy”), a keynote on “Research at Google” by Alfred Spector, Vice President of Research and Special Initiatives and a welcome address by Nelson Mattos, Vice President of Engineering and Products in EMEA, covering Google’s engineering activity and recent innovations in the region.

The Faculty Summit is a chance for us to meet with academics in Computer Science and other areas to discuss the latest exciting developments in research and education, and to explore ways in which we can collaborate via our our University Relations programs.

The two and a half day program consisted of tech talks, break out sessions, a panel on online education, and demos. The program covered a variety of computer science topics including Infrastructure, Cloud Computing Applications, Information Retrieval, Machine Translation, Audio/Video, Machine Learning, User Interface, e-Commerce, Digital Humanities, Social Media, and Privacy. For example, Ed H. Chi summarized how researchers use data analysis to understand the ways users share content with their audiences using the Circle feature in Google+. Jens Riegelsberger summarized how UI design and user experience research is essential to creating a seamless experience on Google Maps. John Wilkes discussed some of the research challenges - and opportunities - associated with building, managing, and using computer systems at massive scale. Breakout sessions ranged from technical follow-ups on the talk topics to discussing ways to increase the presence of women in computer science.

We also held one-on-one sessions where academics and Googlers could meet privately and discuss topics of personal interest, such as how to develop a compelling research award proposal, how to apply for a sabbatical at Google or how to gain Google support for a conference in a particular research area.

The Summit provides a great opportunity to build and strengthen research and academic collaborations. Our hope is to drive research and education forward by fostering mutually beneficial relationships with our academic colleagues and their universities.