Report from Google EMEA Faculty Summit 8-10 February 2010

The Faculty Summit is a meeting organized by Google to gather Academic scientists and Google engineers together. ("Engineer" is the title of honour used by most of their scientists as well.) EMEA means Europe, Middle East, and Africa. This year's summit in Zurich gathered 99 Academics and a similar number of Googlers. The Academics were invited by Google; many of those I talked with could only guess why just they had been invited. But everyone had at least something to do with this year's four themes, one of which was Natural Language Technologies. Within this theme, we had an excellent representation from the MOLTO project, since also Lluís Màrquez from UPC was invited.

Most of the Faculty Summit programme consisted of Google engineers giving highlights of Google's ongoing research and future visions. This was highly interesting: of course we know what Google roughly is doing, and get impressed or puzzled by their innovations almost every day. But here we could hear something more, and even ask questions.

The top of the highlights for me was a talk by Franz Och about translation research at Google. Everyone knows Google translate, some people see it as a joke (or an inexhaustible joke generator), and almost everyone is impressed by it. The theoretical foundations are well-known from the works by people like Brown, Koehn, Ney, and Och himself - but all this leaves a lot of room for guessing how it really works.

One guess I had myself was that Google translate uses English as interlingua. This was first suggested by the translation of Swedish "jag är svensk" to German "ich bin Amerikaner", a joke that circulated in Sweden last summer. I tested this in various ways, and the hypothesis was confirmed by e.g. the neutralization of the plural "you": both "jag älskar dig" and "jag älskar er" produce "ich liebe dich" in German. Moreover, this makes sense, since there is probably much more data for Swedish-English and English-German than for Swedish-German, to train the statistical model. So I posed this question to Och, and the answer was: yes, in 99% of the cases, the only exceptions being language pairs like Croatian-Serbian and Japanese-Korean. The high percentage came as a surprise to me, but no-one else looked so surprised, so perhaps they knew it already.

In any case, this shows that an interlingua is a good idea when maintaining a growing number of languages (now 50-odd). At the same time, one may wonder how good English is as an interlingua - but probably it's the best one in the context of Google's methods, as there is so much English data.

Other interesting points in Och's talk were the "10 main challenges" for translation. My memory is that MOLTO addresses 5 of them:

  • translation from English to other languages (Google is at its best when English is the target)
  • morphologically rich languages (Google treats word forms as primitives, without further morphological analysis)
  • word order variations (where syntax rules seem to be the key)
  • sharing language models for different purposes: translation, information extraction, etc
  • and sorry, I don't remember now what the fifth one was. But maybe the talk will appear in YouTube like an earlier talk by him

On the other hand, it became even clearer than before that Google's definition of translation is different from MOLTO's. For Google, a translator is a program such that you can throw any document at it, and it returns a document in another language; there is no condition of semantic equivalence other than that the result is in some way useful and informative. This is actually quite like their search philosophy: there's no guarantee that the best results are returned, but the hope is high that they will be useful.

A related Google principle is that something is always better than nothing. A translation with some correct words is better than no translation at all. Google seems to apply this to software in general, very much in the spirit of the open source movement: release early and often, instead of keeping it to yourself until it's perfect. I really like this principle and follow it in many things I do; my former PhD student Björn Bringert, who is now a Google engineer, was always a good example of how to applying it.

Anyway, as it belongs to Google's definition of translation that you can throw anything at it, MOLTO is actually not doing translation at all! It was a slight disappointment to me to realize this, but after all it's just a matter of terminology. OK, we are doing something else than translation, but we want to render the original message unaltered and in grammatically correct language. If we can't do this, we just tell the user, "sorry we can't". And in the future we want to be able to do more than just say this, probably using statistical techniques. I also believe that the linguistic resources created in GF - morphology, syntactic combination rules - could be a useful addition to Google's fundamentally statistical model, but it remains to find out how exactly to use them. Both these grammar+statistics questions will be central in MOLTO.

Another interesting talk was Fred Jelinek's (Johns Hopkins) chronicle about the emergence of statistical speech recognition and machine translation. He confirmed explicitly what Per Martin-Löf once was the first one to point out to me, namely that the technique is based on Shannon's work on decryption just like the first attempts in machine translation in the 1940's. What Jelinek and his team had at IBM in the 1980's was, first and foremost, more computational power, so that they could apply the models effectively. Another interesting fact was that DARPA was always explicitly interested in translation from other languages to English, because this was what intelligence wanted. Hence it's no coincidence that almost all literature on statistical machine translation focuses on this direction. Another reason, I think, is that it is easier to get good quality in this direction, which was also one of the things pointed out by Och.

In the statistical translation business, it seems to be a rule of the game that the engineers should know as little about the languages as general. Knowing a language would be cheating. One good effect of this rule might be that the methods are not biased to particular languages, and can hence always be applied to new ones. On the other hand, many of the methods are really biased toward English (with its poor morphology and rigid word order), and I find it implausible that one can ever reach high quality without actually knowing how the languages work.

Outside the Natural Language programme, there were many interesting talks, and of course discussions with colleagues, both Academic and from Google. The event was extremely well organized. Taxis picked us up at Zurich airport on Monday morning and took us back there on Wednesday afternoon; buses took us from Google to the hotel late in the evening and back early in the morning (with an intermediate dinner stop at a nice restaurant on Tuesday). Google premises were a pleasure to stay in, and the food was really good. The only thing I regret personally was that we weren't really given any chance to walk the recommended 10,000 steps a day, which e.g. the Android pedometer uses as its default goal.