June 14th, 2012 | Published in Google Research
This past week, researchers from across the world descended on Montreal for the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). NAACL, as with other Association for Computational Linguistics meetings (ACL), is a premier meeting for researchers who study natural language processing (NLP). This includes applications such as machine translation and sentiment analysis, but also low-level language technologies such as the automatic analysis of morphology, syntax, semantics and discourse.
Like many applied fields in computer science, NLP underwent a transformation in the mid ‘90s from a primarily rule- and knowledge-based discipline to one whose methods are predominantly statistical and leverage advances in large data and machine learning. This trend continues at NAACL. Two common themes dealt with a historical deficiency of machine-learned NLP systems -- that they require expensive and difficult-to-obtain annotated data in order to achieve high accuracies. To this end, there were a number of studies on unsupervised and weakly-supervised learning for NLP systems, which aim to learn from large corpora containing little to no linguistic annotations, instead relying only on observed regularities in the data or easily obtainable annotations. This typically led to much talk during the question periods about how reliable it might be to use services such as Mechanical Turk to get the detailed annotations needed for difficult language prediction tasks. Multilinguality in statistical systems also appeared to be a common theme as researchers have continued to move their focus from building systems for resource-rich languages (e.g., English) to building systems for the rest of the world’s languages, many of which do not have any annotated resources. Work here included focused studies on single languages to studies aiming to develop techniques for a wide variety of languages leveraging morphology, parallel data and regularities across closely-related languages.
There was also an abundance of papers on text analysis for non-traditional domains. This includes the now standard tracks on sentiment analysis, but combined with this, a new focus on social-media, and in particular NLP for microblogs. There was even a paper on predicting whether a given bill will pass committee in the U.S. Congress based on the text of the bill. The presentation of this paper included the entire video on how a bill becomes a law.
There were two keynote talks. The first talk by Ed Hovy of the Information Sciences Institute of the University of Southern California was on “A New Semantics: Merging Propositional and Distributional Information.” Prof. Hovy gave his insights into the challenge of bringing together distributional (statistical) lexical-semantics and compositional semantics, which has been a need espoused recently by many leaders in the field. The second, by James W. Pennebaker, was called “A, is, I, and, the: How our smallest words reveal the most about who we are.” As a psychologist, Prof. Pennebaker represented the “outsider” keynote that typically draws a lot of interest from the audience, and he did not disappoint. Prof. Pennebaker spoke about how the use of function words can provide interesting social observations. One example was personal pronouns like “we,” whose increased usage now causes people to feel the speaker is colder and more distant as opposed to engaging the audience and making them appear accessible. This is partly due to a second and increasingly more common meaning of “we” that is much more like “you,” e.g., when a boss says: “We must increase sales”.
Finally, this year the organizers of NAACL decided to do something new called “NLP Idol.” The idea was to have four senior researchers in the community select a paper from the past that they think will have (or should have) more impact on future directions of NLP research. The idea is to pluck a paper from obscurity and bring it to the limelight. Each researcher presented their case and three judges gave feedback American Idol-style, with Brian Roark hosting a la Ryan Seacrest. The winner was "PAM - A Program That Infers Intentions," published in Inside Computer Understanding in 1981 by Robert Wilensky, which was selected and presented by Ray Mooney. PAM (“Plan Applier Mechanism”) was a system for understanding agents and their plans, and more generally, what is happening in a discourse and why. Some of the questions that PAM could answer were astonishing, which reminded the audience (or me at least) that while statistical methods have brought NLP broader coverage, this is often at the loss of specificity and deep knowledge representation that previous closed-world language understanding systems could achieve. This echoed sentiments in Prof. Hovy’s invited talk.
Ever since the early days of Google, Googlers have had a presence at NAACL and other ACL-affiliated events. NAACL this year was no different. Googlers authored three papers at the conference, one of which merited the conference’s Best Full Paper Award, and the other the Best Student Paper:
Award Oscar Täckström (Google intern), Ryan McDonald (Googler), Jakob Uszkoreit (Googler)
Vine Pruning for Efficient Multi-Pass Dependency Parsing - Best Full Paper Award
Alexander Rush (Google intern) and Slav Petrov (Googler)
Unsupervised Translation Sense Clustering
Mohit Bansal (Google intern), John DeNero (Googler), Dekang Lin (Googler)
Many Googlers were also active participants in the NAACL workshops, June 7 - 8:
David Elson (Googler), Anna Kazantseva, Rada Mihalcea, Stan Szpakowicz
Automatic Knowledge Base Construction/Workshop on Web-scale Knowledge Extraction
Invited Speaker - Fernando Pereira, Research Director (Googler)
Workshop on Inducing Linguistic Structure
Accepted Paper - Capitalization Cues Improve Dependency Grammar Induction
Valentin I. Spitkovsky (Googler), Hiyan Alshawi (Googler) and Daniel Jurafsky
Workshop on Statistical Machine TranslationProgram
Committee members - Keith Hall, Shankar Kumar, Zhifei Li, Klaus Macherey, Wolfgang Macherey, Bob Moore, Roy Tromble, Jakob Uszkoreit, Peng Xu, Richard Zens, Hao Zhang (Googlers)
Workshop on the Future of Language Modeling for HLT
Invited Speaker - Language Modeling at Google, Shankar Kumar (Googler)
Accepted Paper - Large-scale discriminative language model reranking for voice-search
Preethi Jyothi, Leif Johnson (Googler), Ciprian Chelba (Googler) and Brian Strope (Googler)
First Workshop on Syntactic Analysis of Non-Canonical Language
Invited Speaker - Keith Hall (Googler)
Shared Task Organizers - Slav Petrov, Ryan McDonald (Googlers)
Evaluation Metrics and System Comparison for Automatic Summarization
Program Committee member - Katja Filippova (Googler)