Google Data » Research Blog

Get moving with the new Motion Stills

Research Blog — Thu, 15 Dec 2016 17:48:00 +0000

Posted by Matthias Grundmann and Ken Conley, Machine Perception

Last June, we released Motion Stills, an iOS app that uses our video stabilization technology to create easily shareable GIFs from Apple Live Photos. Since then, we integrated Motion Stills into Google Photos for iOS and thought of ways to improve it, taking into account your ideas for new features.

Today, we are happy to announce a major new update to the Motion Stills app that will help you create even more beautiful videos and fun GIFs using motion-tracked text overlays, super-resolution videos, and automatic cinemagraphs.

Motion Text

We’ve added motion text so you can create moving text effects, similar to what you might see in movies and TV shows, directly on your phone. With Motion Text, you can easily position text anywhere over your video to get the exact result you want. It only takes a second to initialize while you type, and a tracks at 1000 FPS throughout the whole Live Photo, so the process feels instantaneous.

To make this possible, we took the motion tracking technology that we run on YouTube servers for “Privacy Blur”, and made it run even faster on your device. How? We first create motion metadata for your video by leveraging machine learning to classify foreground/background features as well as to model temporally coherent camera motion. We then take this metadata, and use it as input to an algorithm that can track individual objects while discriminating it from others. The algorithm models each object’s state that includes its motion in space, an implicit appearance model (described as a set of its moving parts), and its centroid and extent, as shown in the figure below.

Enhance! your videos with better detail and loops

Last month, we published the details of our state-of-the-art RAISR technology, which employs machine learning to create super-resolution detail in images. This technology is now available in Motion Stills, automatically sharpening every video you export.

We are also going beyond stabilization to bring you fully automatic cinemagraphs. After freezing the background into a still photo, we analyze our result to optimize for the perfect loop transition. By considering a range of start and end frames, we build a matrix of transition scores between frame pairs. A significant minimum in this matrix reflects the perfect transition, resulting in an endless loop of motion stillness.

Continuing improve the experience

Thanks to your feedback, we’ve additionally rebuilt our navigation and added more tutorials. We’ve also added Apple’s 3D touch to let you “peek and pop” clips in your stream and movie tray. Lots more is coming to address your top requests, so please download the new release of Motion Stills and keep sending us feedback with #motionstills on your favorite social media.

App Discovery With Google Play, Part 2: Personalized Recommendations with Related Apps

Research Blog — Wed, 14 Dec 2016 19:00:00 +0000

Posted by Ananth Balashankar & Levent Koc, Software Engineers, and Norberto Guimaraes, Product Manager

In Part 1 of this series on app discovery, we discussed using machine learning to gain a deeper understanding of the topics associated with an app, in order to provide a better search and discovery experience on the Google Play Apps Store. In this post, we discuss a deep learning framework to provide personalized recommendations to users based on their previous app downloads and the context in which they are used.

Providing useful and relevant app recommendations to visitors of the Google Play Apps Store is a key goal of our apps discovery team. An understanding of the topics associated with an app, however, is only one part of creating a system that best serves the user. In order to create a better overall experience, one must also take into account the tastes of the user and provide personalized recommendations. If one didn’t, the “You might also like” recommendation would look the same for everyone!

Discovering these nuances requires both an understanding what an app does, and also the context of the app with respect to the user. For example, to an avid sci-fi gamer, similar game recommendations may be of interest, but if a user installs a fitness app, recommending a health recipe app may be more relevant than five more fitness apps. As users may be more interested in downloading an app or game that complements one they already have installed, we provide recommendations based on app relatedness with each other (“You might also like”), in addition to providing recommendations based on the topic associated with an app (“Similar apps”).

Suggestions of similar apps and apps that you also might like shown both before making an install decision (left) and while the current install is in progress (right).

One particularly strong contextual signal is app relatedness, based on previous installs and search query clicks. As an example, a user who has searched for and plays a lot of graphics-heavy games likely has a preference for apps which are also graphically intense rather than apps with simpler graphics. So, when this user installs a car racing game, the “You might also like” suggestions includes apps which relate to the “seed” app (because they are graphically intense racing games) ranked higher than racing apps with simpler graphics. This allows for a finer level of personalization where the characteristics of the apps are matched with the preferences of the user.

To incorporate this app relatedness in our recommendations, we take a two pronged approach: (a) offline candidate generation i.e. the generation of the potential related apps that other users have downloaded, in addition to the app in question, and (b) online personalized re-ranking, where we re-rank these candidates using a personalized ML model.

Offline Candidate Generation
The problem of finding related apps can be formulated as a nearest neighbor search problem. Given an app X, we want to find the k nearest apps. In the case of “you might also like”, a naive approach would be one based on counting, where if many people installed apps X and Y, then the app Y would be used as candidate for seed app X. However, this approach is intractable as it is difficult to learn and generalize effectively in the huge problem space. Given that there are over a million apps on Google Play, the total number of possible app pairs is over ~10¹².

To solve this, we trained a deep neural network to predict the next app installed by the user given their previous installs. Output embeddings at the final layer of this deep neural network generally represents the types of apps a given user has installed. We then apply the nearest neighbor algorithm to find related apps for a given seed app in the trained embedding space. Thus, we perform dimensionality reduction by representing apps using embeddings to help prune the space of potential candidates.

Online Personalized Re-ranking
The candidates generated in the previous step represent relatedness along multiple dimensions. The objective is to assign scores to the candidates so they can be re-ranked in a personalized way, in order to provide an experience that is crafted to the user’s overall interests and yet maintain relevance for the user installing a given app. In order to do this, we take the characteristics of the app candidates as input to a separate deep neural network, which is then trained with real-time with user specific context features (region, language, app store search queries, etc.) to predict the likelihood of a related app being specifically relevant to the user.

Architecture for personalized related apps

One of the takeaways from this work is that re-ranking content, like related apps, is one of the critical ways of app discovery in the store, and can bring great value to the user without impacting perceived relevance. Compared to the control (where no re-ranking was done), we saw a 20% increase in the app install rate from the “You might also like” suggestions. This had no user perceivable change in latency.

In Part 3 of this series, we will discuss how we employ machine learning to keep bad actors who try to manipulate the signals we use for search and personalization at bay.

Acknowledgements
This work was done within the Google Play team in collaboration with Halit Erdogan, Mark Taylor, Michael Watson, Huazhong Ning, Stan Bileschi, John Kraemer, and Chuan Yu Foo.

Open sourcing the Embedding Projector: a tool for visualizing high dimensional data

Research Blog — Wed, 07 Dec 2016 09:15:00 +0000

Posted by Daniel Smilkov and the Big Picture group

Recent advances in Machine Learning (ML) have shown impressive results, with applications ranging from image recognition, language translation, medical diagnosis and more. With the widespread adoption of ML systems, it is increasingly important for research scientists to be able to explore how the data is being interpreted by the models. However, one of the main challenges in exploring this data is that it often has hundreds or even thousands of dimensions, requiring special tools to investigate the space.

To enable a more intuitive exploration process, we are open-sourcing the Embedding Projector, a web application for interactive visualization and analysis of high-dimensional data recently shown as an A.I. Experiment, as part of TensorFlow. We are also releasing a standalone version at projector.tensorflow.org, where users can visualize their high-dimensional data without the need to install and run TensorFlow.

Exploring Embeddings

The data needed to train machine learning systems comes in a form that computers don't immediately understand. To translate the things we understand naturally (e.g. words, sounds, or videos) to a form that the algorithms can process, we use embeddings, a mathematical vector representation that captures different facets (dimensions) of the data. For example, in this language embedding, similar words are mapped to points that are close to each other.

With the Embedding Projector, you can navigate through views of data in either a 2D or a 3D mode, zooming, rotating, and panning using natural click-and-drag gestures. Below is a figure showing the nearest points to the embedding for the word “important” after training a TensorFlow model using the word2vec tutorial. Clicking on any point (which represents the learned embedding for a given word) in this visualization, brings up a list of nearest points and distances, which shows which words the algorithm has learned to be semantically related. This type of interaction represents an important way in which one can explore how an algorithm is performing.

Methods of Dimensionality Reduction

The Embedding Projector offers three commonly used methods of data dimensionality reduction, which allow easier visualization of complex data: PCA, t-SNE and custom linear projections. PCA is often effective at exploring the internal structure of the embeddings, revealing the most influential dimensions in the data. t-SNE, on the other hand, is useful for exploring local neighborhoods and finding clusters, allowing developers to make sure that an embedding preserves the meaning in the data (e.g. in the MNIST dataset, seeing that the same digits are clustered together). Finally, custom linear projections can help discover meaningful "directions" in data sets - such as the distinction between a formal and casual tone in a language generation model - which would allow the design of more adaptable ML systems.

A custom linear projection of the 100 nearest points of "See attachments." onto the "yes" - "yeah" vector (“yes” is right, “yeah” is left) of a corpus of 35k frequently used phrases in emails

The Embedding Projector website includes a few datasets to play with. We’ve also made it easy for users to publish and share their embeddings with others (just click on the “Publish” button on the left pane). It is our hope that the Embedding Projector will be a useful tool to help the research community explore and refine their ML applications, as well as enable anyone to better understand how ML algorithms interpret data. If you'd like to get the full details on the Embedding Projector, you can read the paper here. Have fun exploring the world of embeddings!

NIPS 2016 & Research at Google

Research Blog — Mon, 05 Dec 2016 07:34:00 +0000

Posted by Doug Eck, Research Scientist, Google Brain Team

This week, Barcelona hosts the 30^th Annual Conference on Neural Information Processing Systems (NIPS 2016), a machine learning and computational neuroscience conference that includes invited talks, demonstrations and oral and poster presentations of some of the latest in machine learning research. Google will have a strong presence at NIPS 2016, with over 280 Googlers attending in order to contribute to and learn from the broader academic research community by presenting technical talks and posters, in addition to hosting workshops and tutorials.

Research at Google is at the forefront of innovation in Machine Intelligence, actively exploring virtually all aspects of machine learning including classical algorithms as well as cutting-edge techniques such as deep learning. Focusing on both theory as well as application, much of our work on language understanding, speech, translation, visual processing, ranking, and prediction relies on Machine Intelligence. In all of those tasks and many others, we gather large volumes of direct or indirect evidence of relationships of interest, and develop learning approaches to understand and generalize.

If you are attending NIPS 2016, we hope you’ll stop by our booth and chat with our researchers about the projects and opportunities at Google that go into solving interesting problems for billions of people, and to see demonstrations of some of the exciting research we pursue. You can also learn more about our work being presented at NIPS 2016 in the list below (Googlers highlighted in blue).

Google is a Platinum Sponsor of NIPS 2016.

Organizing Committee
Executive Board includes: Corinna Cortes, Fernando Pereira
Advisory Board includes: John C. Platt
Area Chairs include: John Shlens, Moritz Hardt, Navdeep Jaitly, Hugo Larochelle, Honglak Lee, Sanjiv Kumar, Gal Chechik

Invited Talk
Dynamic Legged Robots
Marc Raibert

Accepted Papers:
Boosting with Abstention
Corinna Cortes, Giulia DeSalvo, Mehryar Mohri

Community Detection on Evolving Graphs
Stefano Leonardi, Aris Anagnostopoulos, Jakub Łącki, Silvio Lattanzi, Mohammad Mahdian

Linear Relaxations for Finding Diverse Elements in Metric Spaces
Aditya Bhaskara, Mehrdad Ghadiri, Vahab Mirrokni, Ola Svensson

Nearly Isometric Embedding by Relaxation
James McQueen, Marina Meila, Dominique Joncas

Optimistic Bandit Convex Optimization
Mehryar Mohri, Scott Yang

Reward Augmented Maximum Likelihood for Neural Structured Prediction
Mohammad Norouzi, Samy Bengio, Zhifeng Chen, Navdeep Jaitly, Mike Schuster, Yonghui Wu, Dale Schuurmans

Stochastic Gradient MCMC with Stale Gradients
Changyou Chen, Nan Ding, Chunyuan Li, Yizhe Zhang, Lawrence Carin

Unsupervised Learning for Physical Interaction through Video Prediction
Chelsea Finn^*, Ian Goodfellow, Sergey Levine

Using Fast Weights to Attend to the Recent Past
Jimmy Ba, Geoffrey Hinton, Volodymyr Mnih, Joel Leibo, Catalin Ionescu

A Credit Assignment Compiler for Joint Prediction
Kai-Wei Chang, He He, Stephane Ross, Hal III

A Neural Transducer
Navdeep Jaitly, Quoc Le, Oriol Vinyals, Ilya Sutskever, David Sussillo, Samy Bengio

Attend, Infer, Repeat: Fast Scene Understanding with Generative Models
S. M. Ali Eslami, Nicolas Heess, Theophane Weber, Yuval Tassa, David Szepesvari, Koray Kavukcuoglu, Geoffrey Hinton

Bi-Objective Online Matching and Submodular Allocations
Hossein Esfandiari, Nitish Korula, Vahab Mirrokni

Combinatorial Energy Learning for Image Segmentation
Jeremy Maitin-Shepard, Viren Jain, Michal Januszewski, Peter Li, Pieter Abbeel

Deep Learning Games
Dale Schuurmans, Martin Zinkevich

DeepMath - Deep Sequence Models for Premise Selection
Geoffrey Irving, Christian Szegedy, Niklas Een, Alexander Alemi, François Chollet, Josef Urban

Density Estimation via Discrepancy Based Adaptive Sequential Partition
Dangna Li, Kun Yang, Wing Wong

Domain Separation Networks
Konstantinos Bousmalis, George Trigeorgis, Nathan Silberman,  Dilip Krishnan, Dumitru Erhan

Fast Distributed Submodular Cover: Public-Private Data Summarization
Baharan Mirzasoleiman, Morteza Zadimoghaddam, Amin Karbasi

Satisfying Real-world Goals with Dataset Constraints
Gabriel Goh, Andrew Cotter, Maya Gupta, Michael P Friedlander

Can Active Memory Replace Attention?
Łukasz Kaiser, Samy Bengio

Fast and Flexible Monotonic Functions with Ensembles of Lattices
Kevin Canini,  Andy Cotter,  Maya Gupta,  Mahdi Fard,  Jan Pfeifer

Launch and Iterate: Reducing Prediction Churn
Quentin Cormier, Mahdi Fard, Kevin Canini, Maya Gupta

On Mixtures of Markov Chains
Rishi Gupta, Ravi Kumar, Sergei Vassilvitskii

Orthogonal Random Features
Felix Xinnan Yu,  Ananda Theertha Suresh,  Krzysztof Choromanski,  Dan Holtmann-Rice,
Sanjiv Kumar

Perspective Transformer Nets: Learning Single-View 3D Object Reconstruction without 3D
Supervision
Xinchen Yan, Jimei Yang, Ersin Yumer, Yijie Guo, Honglak Lee

Structured Prediction Theory Based on Factor Graph Complexity
Corinna Cortes, Vitaly Kuznetsov, Mehryar Mohri, Scott Yang

Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity
Amit Daniely, Roy Frostig, Yoram Singer

Demonstrations
Interactive musical improvisation with Magenta
Adam Roberts, Sageev Oore, Curtis Hawthorne, Douglas Eck

Content-based Related Video Recommendation
Joonseok Lee

Workshops, Tutorials and Symposia
Advances in Approximate Bayesian Inference
Advisory Committee includes: Kevin P. Murphy
Invited Speakers include: Matt Johnson
Panelists include: Ryan Sepassi

Adversarial Training
Accepted Authors: Luke Metz, Ben Poole, David Pfau, Jascha Sohl-Dickstein, Augustus Odena, Christopher Olah, Jonathon Shlens

Bayesian Deep Learning
Organizers include: Kevin P. Murphy
Accepted Authors include: Rif A. Saurous, Eugene Brevdo, Kevin Murphy, Eric Jang, Shixiang Gu, Ben Poole

Brains & Bits: Neuroscience Meets Machine Learning
Organizers include: Jascha Sohl-Dickstein

Connectomics II: Opportunities & Challanges for Machine Learning
Organizers include: Viren Jain

Constructive Machine Learning
Invited Speakers include: Douglas Eck

Continual Learning & Deep Networks
Invited Speakers include: Honglak Lee

Deep Learning for Action & Interaction
Organizers include: Sergey Levine
Invited Speakers include: Honglak Lee
Accepted Authors include: Pararth Shah, Dilek Hakkani-Tur, Larry Heck

End-to-end Learning for Speech and Audio Processing
Invited Speakers include: Tara Sainath
Accepted Authors include: Brian Patton, Yannis Agiomyrgiannakis, Michael Terry, Kevin Wilson, Rif A. Saurous, D. Sculley

Extreme Classification: Multi-class & Multi-label Learning in Extremely Large Label Spaces
Organizers include: Samy Bengio

Interpretable Machine Learning for Complex Systems
Invited Speaker: Honglak Lee
Accepted Authors include: Daniel Smilkov, Nikhil Thorat, Charles Nicholson, Emily Reif, Fernanda Viegas, Martin Wattenberg

Large Scale Computer Vision Systems
Organizers include: Gal Chechik

Machine Learning Systems
Invited Speakers include: Jeff Dean

Nonconvex Optimization for Machine Learning: Theory & Practice
Organizers include: Hossein Mobahi

Optimizing the Optimizers
Organizers include: Alex Davies

Reliable Machine Learning in the Wild
Accepted Authors: Andres Medina, Sergei Vassilvitskii

The Future of Gradient-Based Machine Learning Software
Invited Speakers: Jeff Dean, Matt Johnson

Time Series Workshop
Organizers include: Vitaly Kuznetsov
Invited Speakers include: Mehryar Mohri

Theory and Algorithms for Forecasting Non-Stationary Time Series
Tutorial Organizers: Vitaly Kuznetsov, Mehryar Mohri

Women in Machine Learning
Invited Speakers include: Maya Gupta

* Work done as part of the Google Brain team ^↩

Deep Learning for Detection of Diabetic Eye Disease

Research Blog — Tue, 29 Nov 2016 16:04:00 +0000

Posted by Lily Peng MD PhD, Product Manager and Varun Gulshan PhD, Research Engineer

Diabetic retinopathy (DR) is the fastest growing cause of blindness, with nearly 415 million diabetic patients at risk worldwide. If caught early, the disease can be treated; if not, it can lead to irreversible blindness. Unfortunately, medical specialists capable of detecting the disease are not available in many parts of the world where diabetes is prevalent. We believe that Machine Learning can help doctors identify patients in need, particularly among underserved populations.

A few years ago, several of us began wondering if there was a way Google technologies could improve the DR screening process, specifically by taking advantage of recent advances in Machine Learning and Computer Vision. In "Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs", published today in JAMA, we present a deep learning algorithm capable of interpreting signs of DR in retinal photographs, potentially helping doctors screen more patients in settings with limited resources.

One of the most common ways to detect diabetic eye disease is to have a specialist examine pictures of the back of the eye (Figure 1) and rate them for disease presence and severity. Severity is determined by the type of lesions present (e.g. microaneurysms, hemorrhages, hard exudates, etc), which are indicative of bleeding and fluid leakage in the eye. Interpreting these photographs requires specialized training, and in many regions of the world there aren’t enough qualified graders to screen everyone who is at risk.

Figure 1. Examples of retinal fundus photographs that are taken to screen for DR. The image on the left is of a healthy retina (A), whereas the image on the right is a retina with referable diabetic retinopathy (B) due a number of hemorrhages (red spots) present.

Working closely with doctors both in India and the US, we created a development dataset of 128,000 images which were each evaluated by 3-7 ophthalmologists from a panel of 54 ophthalmologists. This dataset was used to train a deep neural network to detect referable diabetic retinopathy. We then tested the algorithm’s performance on two separate clinical validation sets totalling ~12,000 images, with the majority decision of a panel 7 or 8 U.S. board-certified ophthalmologists serving as the reference standard. The ophthalmologists selected for the validation sets were the ones that showed high consistency from the original group of 54 doctors.

Performance of both the algorithm and the ophthalmologists on a 9,963-image validation set are shown in Figure 2.

Figure 2. Performance of the algorithm (black curve) and eight ophthalmologists (colored dots) for the presence of referable diabetic retinopathy (moderate or worse diabetic retinopathy or referable diabetic macular edema) on a validation set consisting of 9963 images. The black diamonds on the graph correspond to the sensitivity and specificity of the algorithm at the high sensitivity and high specificity operating points.

The results show that our algorithm’s performance is on-par with that of ophthalmologists. For example, on the validation set described in Figure 2, the algorithm has a F-score (combined sensitivity and specificity metric, with max=1) of 0.95, which is slightly better than the median F-score of the 8 ophthalmologists we consulted (measured at 0.91).

These are exciting results, but there is still a lot of work to do. First, while the conventional quality measures we used to assess our algorithm are encouraging, we are working with retinal specialists to define even more robust reference standards that can be used to quantify performance. Furthermore, interpretation of a 2D fundus photograph, which we demonstrate in this paper, is only one part in a multi-step process that leads to a diagnosis for diabetic eye disease. In some cases, doctors use a 3D imaging technology, Optical Coherence Tomography (OCT), to examine various layers of a retina in detail. Applying machine learning to this 3D imaging modality is already underway, led by our colleagues at DeepMind. In the future, these two complementary methods might be used together to assist doctors in the diagnosis of a wide spectrum of eye diseases.

Automated DR screening methods with high accuracy have the strong potential to assist doctors in evaluating more patients and quickly routing those who need help to a specialist. We are working with doctors and researchers to study the entire process of screening in settings around the world, in the hopes that we can integrate our methods into clinical workflow in a manner that is maximally beneficial. Finally, we are working with the FDA and other regulatory agencies to further evaluate these technologies in clinical studies.

Given the many recent advances in deep learning, we hope our study will be just one of many compelling examples to come demonstrating the ability of machine learning to help solve important problems in medical imaging in healthcare more broadly.

Learn more about the Health Research efforts of the Brain team at Google

Zero-Shot Translation with Google’s Multilingual Neural Machine Translation System

Research Blog — Tue, 22 Nov 2016 18:00:00 +0000

Posted by Mike Schuster (Google Brain Team), Melvin Johnson (Google Translate) and Nikhil Thorat (Google Brain Team)

In the last 10 years, Google Translate has grown from supporting just a few languages to 103, translating over 140 billion words every day. To make this possible, we needed to build and maintain many different systems in order to translate between any two languages, incurring significant computational cost. With neural networks reforming many fields, we were convinced we could raise the translation quality further, but doing so would mean rethinking the technology behind Google Translate.

In September, we announced that Google Translate is switching to a new system called Google Neural Machine Translation (GNMT), an end-to-end learning framework that learns from millions of examples, and provided significant improvements in translation quality. However, while switching to GNMT improved the quality for the languages we tested it on, scaling up to all the 103 supported languages presented a significant challenge.

In “Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation”, we address this challenge by extending our previous GNMT system, allowing for a single system to translate between multiple languages. Our proposed architecture requires no change in the base GNMT system, but instead uses an additional “token” at the beginning of the input sentence to specify the required target language to translate to. In addition to improving translation quality, our method also enables “Zero-Shot Translation” — translation between language pairs never seen explicitly by the system.

Here’s how it works. Let’s say we train a multilingual system with Japanese⇄English and Korean⇄English examples, shown by the solid blue lines in the animation. Our multilingual system, with the same size as a single GNMT system, shares its parameters to translate between these four different language pairs. This sharing enables the system to transfer the “translation knowledge” from one language pair to the others. This transfer learning and the need to translate between multiple languages forces the system to better use its modeling power.

This inspired us to ask the following question: Can we translate between a language pair which the system has never seen before? An example of this would be translations between Korean and Japanese where Korean⇄Japanese examples were not shown to the system. Impressively, the answer is yes — it can generate reasonable Korean⇄Japanese translations, even though it has never been taught to do so. We call this “zero-shot” translation, shown by the yellow dotted lines in the animation. To the best of our knowledge, this is the first time this type of transfer learning has worked in Machine Translation.

The success of the zero-shot translation raises another important question: Is the system learning a common representation in which sentences with the same meaning are represented in similar ways regardless of language — i.e. an “interlingua”? Using a 3-dimensional representation of internal network data, we were able to take a peek into the system as it translates a set of sentences between all possible pairs of the Japanese, Korean, and English languages.

Part (a) from the figure above shows an overall geometry of these translations. The points in this view are colored by the meaning; a sentence translated from English to Korean with the same meaning as a sentence translated from Japanese to English share the same color. From this view we can see distinct groupings of points, each with their own color. Part (b) zooms in to one of the groups, and part (c) colors by the source language. Within a single group, we see a sentence with the same meaning but from three different languages. This means the network must be encoding something about the semantics of the sentence rather than simply memorizing phrase-to-phrase translations. We interpret this as a sign of existence of an interlingua in the network.

We show many more results and analyses in our paper, and hope that its findings are not only interesting for machine learning or machine translation researchers but also to linguists and others who are interested in how multiple languages can be processed by machines using a single system.

Finally, the described Multilingual Google Neural Machine Translation system is running in production today for all Google Translate users. Multilingual systems are currently used to serve 10 of the recently launched 16 language pairs, resulting in improved quality and a simplified production architecture.

Enhance! RAISR Sharp Images with Machine Learning

Research Blog — Mon, 14 Nov 2016 19:00:00 +0000

Posted by Peyman Milanfar, Research Scientist

Everyday the web is used to share and store millions of pictures, enabling one to explore the world, research new topics of interest, or even share a vacation with friends and family. However, many of these images are either limited by the resolution of the device used to take the picture, or purposely degraded in order to accommodate the constraints of cell phones, tablets, or the networks to which they are connected. With the ubiquity of high-resolution displays for home and mobile devices, the demand for high-quality versions of low-resolution images, quickly viewable and shareable from a wide variety of devices, has never been greater.

With “RAISR: Rapid and Accurate Image Super-Resolution”, we introduce a technique that incorporates machine learning in order to produce high-quality versions of low-resolution images. RAISR produces results that are comparable to or better than the currently available super-resolution methods, and does so roughly 10 to 100 times faster, allowing it to be run on a typical mobile device in real-time. Furthermore, our technique is able to avoid recreating the aliasing artifacts that may exist in the lower resolution image.

Upsampling, the process of producing an image of larger size with significantly more pixels and higher image quality from a low quality image, has been around for quite a while. Well-known approaches to upsampling are linear methods which fill in new pixel values using simple, and fixed, combinations of the nearby existing pixel values. These methods are fast because they are fixed linear filters (a constant convolution kernel applied uniformly across the image). But what makes these upsampling methods fast, also makes them ineffective in bringing out vivid details in the higher resolution results. As you can see in the example below, the upsampled image looks blurry – one would hesitate to call it enhanced.

Left: Low-res original, Right: simple (bicubic) upsampled version (2x). Image Credit: Masa Ushioda/Seapics/Solent News

With RAISR, we instead use machine learning and train on pairs of images, one low quality, one high, to find filters that, when applied to selectively to each pixel of the low-res image, will recreate details that are of comparable quality to the original. RAISR can be trained in two ways. The first is the "direct" method, where filters are learned directly from low and high-resolution image pairs. The other method involves first applying a computationally cheap upsampler to the low resolution image (as in the figure above) and then learning the filters from the upsampled and high resolution image pairs. While the direct method is computationally faster, the 2nd method allows for non-integer scale factors and better leveraging of hardware-based upsampling.

For either method, RAISR filters are trained according to edge features found in small patches of images, - brightness/color gradients, flat/textured regions, etc. - characterized by direction (the angle of an edge), strength (sharp edges have a greater strength) and coherence (a measure of how directional the edge is). Below is a set of RAISR filters, learned from a database of 10,000 high and low resolution image pairs (where the low-res images were first upsampled). The training process takes about an hour.

Collection of learned 11x11 filters for 3x super-resolution. Filters can be learned for a range of super-resolution factors, including fractional ones. Note that as the angle of the edge changes, we see the angle of the filter rotate as well. Similarly, as the strength increases, the sharpness of the filters increases, and the anisotropy of the filter increases with rising coherence.

From left to right, we see that the learned filters correspond selectively to the direction of the underlying edge that is being reconstructed. For example, the filter in the middle of the bottom row is most appropriate for a strong horizontal edge (gradient angle of 90 degrees) with a high degree of coherence (a straight, rather than a curved, edge). If this same horizontal edge is low-contrast, then a different filter is selected such one in the top row.

In practice, at run-time RAISR selects and applies the most relevant filter from the list of learned filters to each pixel neighborhood in the low-resolution image. When these filters are applied to the lower quality image, they recreate details that are of comparable quality to the original high resolution, and offer a significant improvement to linear, bicubic, or Lanczos interpolation methods.

Top: RAISR algorithm at run-time, applied to a cheap upscaler’s output. Bottom: Low-res original (left), bicubic upsampler 2x (middle), RAISR output (right)

Some examples of RAISR in action can be seen below:

Top: Original, Bottom: RAISR super-resolved 2x. Original image from Andrzej Dragan

Left: Original, Right: RAISR super-resolved 3x. Image courtesy of Marc Levoy

One of the more complex aspects of super-resolution is getting rid of aliasing artifacts such as Moire patterns and jaggies that arise when high frequency content is rendered in lower resolution (as is the case when images are purposefully degraded). Depending on the shape of the underlying features, these artifacts can be varied and hard to undo.

Example of aliasing artifacts seen on the lower right (Image source)

Linear methods simply can not recover the underlying structure, but RAISR can. Below is an example where the aliased spatial frequencies are apparent under the numbers 3 and 5 in the low-resolution original on the left, while the RAISR image on the right recovered the original structure. Another important advantage of the filter learning approach used by RAISR is that we can specialize it to remove noise, or compression artifacts unique to individual compression algorithms (such as JPEG) as part of the training process. By providing it with examples of such artifacts, RAISR can learn to undo other effects besides resolution enhancement, having them “baked” inside the resulting filters.

Left: Low res original, with strong aliasing. Right: RAISR output, removing aliasing.

Super-resolution technology, using one or many frames, has come a long way. Today, the use of machine learning, in tandem with decades of advances in imaging technology, has enabled progress in image processing that yields many potential benefits. For example, in addition to improving digital “pinch to zoom” on your phone, one could capture, save, or transmit images at lower resolution and super-resolve on demand without any visible degradation in quality, all while utilizing less of mobile data and storage plans.

To learn more about the details of our research and a comparison to other current architectures, check out our paper, which will appear soon in the IEEE Transactions on Computational Imaging.

Open Source Visualization of GPS Displacements for Earthquake Cycle Physics

Research Blog — Thu, 10 Nov 2016 18:00:00 +0000

Posted by Jimbo Wilson, Software Engineer, Google Big Picture Team and Brendan Meade, Professor, Harvard Department of Earth and Planetary Sciences

The Earth’s surface is moving, ever so slightly, all the time. This slow, small, but persistent movement of the Earth's crust is responsible for the formation of mountain ranges, sudden earthquakes, and even the positions of the continents. Scientists around the world measure these almost imperceptible movements using arrays of Global Navigation Satellite System (GNSS) receivers to better understand all phases of an earthquake cycle—both how the surface responds after an earthquake, and the storage of strain energy between earthquakes.

To help researchers explore this data and better understand the Earthquake cycle, we are releasing a new, interactive data visualization which draws geodetic velocity lines on top of a relief map by amplifying position estimates relative to their true positions. Unlike existing approaches, which focus on small time slices or individual stations, our visualization can show all the data for a whole array of stations at once. Open sourced under an Apache 2 license, and available on GitHub, this visualization technique is a collaboration between Harvard’s Department of Earth and Planetary Sciences and Google's Machine Perception and Big Picture teams.

Our approach helps scientists quickly assess deformations across all phases of the earthquake cycle—both during earthquakes (coseismic) and the time between (interseismic). For example, we can see azimuth (direction) reversals of stations as they relate to topographic structures and active faults. Digging into these movements will help scientists vet their models and their data, both of which are crucial for developing accurate computer representations that may help predict future earthquakes.

Classical approaches to visualizing these data have fallen into two general categories: 1) a map view of velocity/displacement vectors over a fixed time interval and 2) time versus position plots of each GNSS component (longitude, latitude and altitude).

Examples of classical approaches. On the left is a map view showing average velocity vectors over the period from 1997 to 2001[1]. On the right you can see a time versus eastward (longitudinal) position plot for a single station.

Each of these approaches have proved to be informative ways to understand the spatial distribution of crustal movements and the time evolution of solid earth deformation. However, because geodetic shifts happen in almost imperceptible distances (mm) and over long timescales, both approaches can only show a small subset of the data at any time—a condensed average velocity per station, or a detailed view of a single station, respectively. Our visualization enables a scientist to see all the data at once, then interactively drill down to a specific subset of interest.

Our visualization approach is straightforward; by magnifying the daily longitude and latitude position changes, we show tracks of the evolution of the position of each station. These magnified position tracks are shown as trails on top of a shaded relief topography to provide a sense of position evolution in geographic context.

To see how it works in practice, let’s step through an an example. Consider this tiny set of longitude/latitude pairs for a single GNSS station, with the differing digits shown in bold:

Day Index	Longitude	Latitude
0	139.06990407	34.949757897
1	139.06990400	34.949757882
2	139.06990413	34.949757941
3	139.06990409	34.949757921
4	139.06990413	34.949757904

If we were to draw line segments between these points directly on a map, they’d be much too small to see at any reasonable scale. So we take these minute differences and multiply them by a user-controlled scaling factor. By default this factor is 10^5.5 (about 316,000x).

To help the user identify which end is the start of the line, we give the start and end points different colors and interpolate between them. Blue and red are the default colors, but they’re user-configurable. Although day-to-day movement of stations may seem erratic, by using this method, one can make out a general trend in the relative motion of a station.

Close-up of a single station’s movement during the three year period from 2003 to 2006.

However, static renderings of this sort suffer from the same problem that velocity vector images do; in regions with a high density of GNSS stations, tracks overlap significantly with one another, obscuring details. To solve this problem, our visualization lets the user interactively control the time range of interest, the amount of amplification and other settings. In addition, by animating the lines from start to finish, the user gets a real sense of motion that’s difficult to achieve in a static image.

We’ve applied our new visualization to the ~20 years of data from the GEONET array in Japan. Through it, we can see small but coherent changes in direction before and after the great 2011 Tohoku earthquake.

GPS data sets (in .json format) for both the GEONET data in Japan and the Plate Boundary Observatory (PBO) data in the western US are available at earthquake.rc.fas.harvard.edu.

This short animation shows many of the visualization’s interactive features. In order:

Modifying the multiplier adjusts how significantly the movements are magnified.
We can adjust the time slider nubs to select a particular time range of interest.
Using the map controls provided by the Google Maps JavaScript API, we can zoom into a tiny region of the map.
By enabling map markers, we can see information about individual GNSS stations.

By focusing on a stations of interest, we can even see curvature changes in the time periods before and after the event.

Station designated 960601 of Japan’s GEONET array is located on the island of Mikura-jima. Here we see the period from 2006 to 2012, with movement magnified 10^5.1 times (126,000x).

To achieve fast rendering of the line segments, we created a custom overlay using THREE.js to render the lines in WebGL. Data for the GNSS stations is passed to the GPU in a data texture, which allows our vertex shader to position each point on-screen dynamically based on user settings and animation.

We’re excited to continue this productive collaboration between Harvard and Google as we explore opportunities for groundbreaking, new earthquake visualizations. If you’d like to try out the visualization yourself, follow the instructions at earthquake.rc.fas.harvard.edu. It will walk you through the setup steps, including how to download the available data sets. If you’d like to report issues, great! Please submit them through the GitHub project page.

Acknowledgments

We wish to thank Bill Freeman, a researcher on Machine Perception, who hatched the idea and developed the initial prototypes, and Fernanda Viégas and Martin Wattenberg of the Big Picture Team for their visualization design guidance.

References

[1] Loveless, J. P., and Meade, B. J. (2010). Geodetic imaging of plate motions, slip rates, and partitioning of deformation in Japan, Journal of Geophysical Research.

Celebrating TensorFlow’s First Year

Research Blog — Wed, 09 Nov 2016 17:00:00 +0000

Posted by Zak Stone, Product Manager for TensorFlow, on behalf of the TensorFlow team

(Cross-posted on the Google Open Source Blog & Google Developers Blog)

It has been an eventful year since the Google Brain Team open-sourced TensorFlow to accelerate machine learning research and make technology work better for everyone. There has been an amazing amount of activity around the project: more than 480 people have contributed directly to TensorFlow, including Googlers, external researchers, independent programmers, students, and senior developers at other large companies. TensorFlow is now the most popular machine learning project on GitHub.

With more than 10,000 commits in just twelve months, we’ve made numerous performance improvements, added support for distributed training, brought TensorFlow to iOS and Raspberry Pi, and integrated TensorFlow with widely-used big data infrastructure. We’ve also made TensorFlow accessible from Go, Rust and Haskell, released state-of-the-art image classification models, and answered thousands of questions on GitHub, StackOverflow and the TensorFlow mailing list along the way.

At Google, TensorFlow supports everything from large-scale product features to exploratory research. We recently launched major improvements to Google Translate using TensorFlow (and Tensor Processing Units, which are special hardware accelerators for TensorFlow). Project Magenta is working on new reinforcement learning-based models that can produce melodies, and a visiting PhD student recently worked with the Google Brain team to build a TensorFlow model that can automatically interpolate between artistic styles. DeepMind has also decided to use TensorFlow to power all of their research – for example, they recently produced fascinating generative models of speech and music based on raw audio.

We’re especially excited to see how people all over the world are using TensorFlow. For example:

Australian marine biologists are using TensorFlow to find sea cows in tens of thousands of hi-res photos to better understand their populations, which are under threat of extinction.
An enterprising Japanese cucumber farmer trained a model with TensorFlow to sort cucumbers by size, shape, and other characteristics.
Radiologists have adapted TensorFlow to identify signs of Parkinson’s disease in medical scans.
Data scientists in the Bay Area have rigged up TensorFlow and the Raspberry Pi to keep track of the Caltrain.

We’re committed to making sure TensorFlow scales all the way from research to production and from the tiniest Raspberry Pi all the way up to server farms filled with GPUs or TPUs. But TensorFlow is more than a single open-source project – we’re doing our best to foster an open-source ecosystem of related software and machine learning models around it:

The TensorFlow Serving project simplifies the process of serving TensorFlow models in production.
TensorFlow “Wide and Deep” models combine the strengths of traditional linear models and modern deep neural networks.
For those who are interested in working with TensorFlow in the cloud, Google Cloud Platform recently launched Cloud Machine Learning, which offers TensorFlow as a managed service.

Furthermore, TensorFlow’s repository of models continues to grow with contributions from the community, with more than 3000 TensorFlow-related repositories listed on GitHub alone! To participate in the TensorFlow community, you can follow our new Twitter account (@tensorflow), find us on GitHub, ask and answer questions on StackOverflow, and join the community discussion list.

Thanks very much to all of you who have already adopted TensorFlow in your cutting-edge products, your ambitious research, your fast-growing startups, and your school projects; special thanks to everyone who has contributed directly to the codebase. In collaboration with the global machine learning community, we look forward to making TensorFlow even better in the years to come!

App Discovery With Google Play, Part 1: Understanding Topics

Research Blog — Tue, 08 Nov 2016 18:00:00 +0000

Posted by Malay Haldar, Matt MacMahon, Neha Jha and Raj Arasu, Software Engineers

Every month, more than a billion users come to Google Play to download apps for their mobile devices. While some are looking for specific apps, like Snapchat, others come with only a broad notion of what they are interested in, like “horror games” or “selfie apps”. These broad searches by topic represent nearly half of the queries in Play Store, so it’s critical to find the most relevant apps.

Searches by topic require more than simply indexing apps by query terms; they require an understanding of the topics associated with an app. Machine learning approaches have been applied to similar problems, but success heavily depends on the number of training examples to learn about a topic. While for some popular topics such as “social networking” we had many labeled apps to learn from, the majority of topics had only a handful of examples. Our challenge was to learn from a very limited number of training examples and scale to millions of apps across thousands of topics, forcing us to adapt our machine learning techniques.

Our initial attempt was to build a deep neural network (DNN) trained to predict topics for an app based on words and phrases from the app title and description. For example, if the app description mentioned “frightening”, “very scary”, and “fear” then associate the “horror game” topic with it. However, given the learning capacity of DNNs, it completely “memorized” the topics for the apps in our small training data and failed to generalize to new apps it hadn’t seen before.

To generalize effectively, we needed a much larger dataset to train on, so we turned to how people learn as inspiration. In contrast to DNNs, human beings need much less training data. For example, you would likely need to see very few “horror game” app descriptions before learning how to generalize and associate new apps to that genre. Just by knowing the language describing the apps, people can correctly infer topics from even a few examples.

To emulate this, we tried a very rough approximation of this language-centric learning. We trained a neural network to learn how language was used to describe apps. We built a Skip-gram model, where the neural network attempts to predict the words around a given word, for example “share” given “photo”. The neural network encodes its knowledge as vectors of floating point numbers, referred to as embeddings. These embeddings were used to train another model called a classifier, capable of distinguishing which topics applied to an app. We now needed much less training data to learn about app topics, due to the large amount of learning already done with Skip-gram.

While this architecture generalized well for popular topics like “social networking”, we ran into a new problem for more niche topics like “selfie”. The single classifier built to predict all the topics together focused most of its learning on the popular topics, ignoring the errors it made on the less common ones. To solve this problem we built a separate classifier for each topic and tuned them in isolation.

This architecture produced reasonable results, but would still sometimes overgeneralize. For instance, it might associate Facebook with “dating” or Plants vs Zombies with “educational games”. To produce more precise classifiers, we needed higher volume and quality of training data. We treated the system described above as a coarse classifier that pruned down every possible {app, topic} pair, numbering in billions, to a more manageable list of {app, topic} pairs of interest. We built a pipeline to have human raters evaluate the classifier output and fed consensus results back as training data. This process allowed us to bootstrap from our existing system, giving us a path to steadily improve classifier performance.

To evaluate {app, topic} pairs by human raters, we asked them questions of the form, “To what extent is topic X related to app Y?” Multiple raters received the same question and independently selected answers on a rating scale to indicate if the topic was “important” for the app, “somewhat related”, or completely “off-topic”. Our initial evaluations showed a high level of disagreement amongst the raters. Diving deeper, we identified several causes of disagreement: vague guidelines for answer selection, insufficient rater training, evaluating broad topics like “computer files” and “game physics” that applied to most apps or games. Tackling these issues led to significant gains in rater agreement. Asking raters to choose an explicit reason for their answer from a curated list further improved reliability. Despite the improvements, we sometimes still have to “agree to disagree” and currently discard answers where raters fail to reach consensus.

These app topic classifiers enable search and discovery features in the Google Play Apps store. The current system helps provide relevant results to our users, but we are constantly exploring new ways to improve the system, through additional signals, architectural improvements and new algorithms. In Part 2 of this series, we will discuss how to personalize the app discovery experience for users.

Acknowledgments
This work was done within the Google Play team in close collaboration with Liadan O'Callaghan, Yuhua Zhu, Mark Taylor and Michael Watson.

Research suggestions at your fingertips with Explore in Docs

Research Blog — Tue, 01 Nov 2016 17:00:00 +0000

Posted by Kishore Papineni, Research Scientist, Google Research NY

Enabling easy access to vast amounts of information across multiple languages and modalities (from text to images to video), computers have become highly influential tools for learning, allowing you to use the world’s information to aid you with your research. However, when researching a topic or writing a term paper, gathering all the information you need from a variety of sources on the Internet can be time-consuming, and at times, a distraction from the writing process.

That’s why we developed algorithms for Explore in Docs, a collaboration between the Coauthor and Apps teams that uses powerful Google infrastructure, best-in-class information retrieval, machine learning, and machine translation technologies to assemble the relevant information and sources for a research paper, all within the document. Explore in Docs suggests relevant content—in the form of topics, images, and snippets —based on the content of the document, allowing the user to focus on critical thinking and idea development.

More than just a Search

Suggesting material that is relevant to the content in a Google Doc is a difficult problem. A naive approach would be to consider the content of a document as a Search query. However, search engines are not designed to accept large blocks of text as queries, so they might truncate the query or focus on the wrong words. So the challenge becomes not only identifying relevant search terms based on the overall content of the document, but additionally providing related topics that may be useful.

To tackle this, the Coauthor team built algorithms that are able to associate external content with topics - entities, abstract concepts - in a document and assign relative importance to each of them. This is accomplished by creating a “target” in a topic vector space that incorporates not only the topics you are writing about but also related topics, creating a variety of search terms that include both. Then, each returned search result (piece of text, image, etc) is embedded in the same vector space and the closest items in that vector space are suggested to the user.

For example, if you’re writing about monarch butterflies, our algorithms find that monarch butterfly and milkweed plant are related to each other. This is done by analyzing the statistics of discourse on the web, collected from hundreds of billions of sentences from billions of webpages across dozens of languages. Note that these two are not semantically close (an insect versus a plant). An example of a set of learned relations is below:

The connection between concepts related to "monarch butterfly", with the thickness of the lines representing the strength of connection, as determined by analysis of discourse on the web. Because this is a discourse graph and not a concept/classification hierarchy, this analysis indicates that "Butterflies & moth" and "Monarch butterfly" are not discussed together as often as "monarch butterfly" and "milkweed".

And because we take the entire document into account while constructing the search request and scoring each candidate piece of text, the resulting suggestions are typically different and more varied than the search snippets users would see if they search the web for each topic individually. By eliminating the need to switch tabs to search, and additionally suggesting new, related topics based on discourse on the web, Explore provides opportunities for learning that users might not discover otherwise - all from the Doc that they’re currently working in!

The information you need, in multiple languages

Cross-lingual predictive search is another key aspect of what we have designed and built. If the relevant material is likely to be in foreign languages, Google searches the web in those languages and translates the selected nuggets into the language of the document.

In the example pictured below, the user begins to type an essay in Docs about Claudia Neto and clicks on the “Explore” button to learn more about her. Explore returns relevant “Topics” and “Images” as well as “Related Research” sourced from multiple websites. Also, Explore suggests Dolores Silva as a related topic since she and Claudia have high mutual information in multilingual web text (statistics collected from more than 10 billion webpages).

Because Swedish ranks high among languages that have significant discourse on Claudia Neto, our algorithms search Swedish content on the Internet for any additional information about her that might not be available on English websites. Before returning information obtained from the Swedish websites, we use Google Translate to render the nugget in the user’s preferred language (in this case, English). Related Research is currently available in 10 languages with more to come in the future.

Explore in Docs is a useful tool that can be used worldwide, in all forms of industry and at all levels of education. Try out the Explore feature the next time you create a document, and check back for more exciting progress from the Coauthor team!

Supercharging Style Transfer

Research Blog — Wed, 26 Oct 2016 16:30:00 +0000

Posted by Vincent Dumoulin^*, Jonathon Shlens and Manjunath Kudlur, Google Brain Team

Pastiche. A French word, it designates a work of art that imitates the style of another one (not to be confused with its more humorous Greek cousin, parody). Although it has been used for a long time in visual art, music and literature, pastiche has been getting mass attention lately with online forums dedicated to images that have been modified to be in the style of famous paintings. Using a technique known as style transfer, these images are generated by phone or web apps that allow a user to render their favorite picture in the style of a well known work of art.

Although users have already produced gorgeous pastiches using the current technology, we feel that it could be made even more engaging. Right now, each painting is its own island, so to speak: the user provides a content image, selects an artistic style and gets a pastiche back. But what if one could combine many different styles, exploring unique mixtures of well known artists to create an entirely unique pastiche?

Learning a representation for artistic style

In our recent paper titled “A Learned Representation for Artistic Style”, we introduce a simple method to allow a single deep convolutional style transfer network to learn multiple styles at the same time. The network, having learned multiple styles, is able to do style interpolation, where the pastiche varies smoothly from one style to another. Our method enables style interpolation in real-time as well, allowing this to be applied not only to static images, but also videos.

Credit: awesome dog role played by Google Brain team office dog Picabo.

In the video above, multiple styles are combined in real-time and the resulting style is applied using a single style transfer network. The user is provided with a set of 13 different painting styles and adjusts their relative strengths in the final style via sliders. In this demonstration, the user is an active participant in producing the pastiche.

A Quick History of Style Transfer

While transferring the style of one image to another has existed for nearly 15 years [1] [2], leveraging neural networks to accomplish it is both very recent and very fascinating. In “A Neural Algorithm of Artistic Style” [3], researchers Gatys, Ecker & Bethge introduced a method that uses deep convolutional neural network (CNN) classifiers. The pastiche image is found via optimization: the algorithm looks for an image which elicits the same kind of activations in the CNN’s lower layers - which capture the overall rough aesthetic of the style input (broad brushstrokes, cubist patterns, etc.) - yet produces activations in the higher layers - which capture the things that make the subject recognizable - that are close to those produced by the content image. From some starting point (e.g. random noise, or the content image itself), the pastiche image is progressively refined until these requirements are met.

Content image: The Tübingen Neckarfront by Andreas Praefcke, Style painting: “Head of a Clown”, by Georges Rouault.

The pastiches produced via this algorithm look spectacular:

Figure adapted from L. Gatys et al. "A Neural Algorithm of Artistic Style" (2015).

This work is considered a breakthrough in the field of deep learning research because it provided the first proof of concept for neural network-based style transfer. Unfortunately this method for stylizing an individual image is computationally demanding. For instance, in the first demos available on the web, one would upload a photo to a server, and then still have plenty of time to go grab a cup of coffee before a result was available.

This process was sped up significantly by subsequent research [4, 5] that recognized that this optimization problem may be recast as an image transformation problem, where one wishes to apply a single, fixed painting style to an arbitrary content image (e.g. a photograph). The problem can then be solved by teaching a feed-forward, deep convolutional neural network to alter a corpus of content images to match the style of a painting. The goal of the trained network is two-fold: maintain the content of the original image while matching the visual style of the painting.

The end result of this was that what once took a few minutes for a single static image, could now be run real time (e.g. applying style transfer to a live video). However, the increase in speed that allowed real-time style transfer came with a cost - a given style transfer network is tied to the style of a single painting, losing some flexibility of the original algorithm, which was not tied to any one style. This means that to build a style transfer system capable of modeling 100 paintings, one has to train and store 100 separate style transfer networks.

Our Contribution: Learning and Combining Multiple Styles

We started from the observation that many artists from the impressionist period employ similar brush stroke techniques and color palettes. Furthermore, painting by say, Monet, are even more visually similar.

Poppy Field (left) and Impression, Sunrise (right) by Claude Monet. Images from Wikipedia

We leveraged this observation in our training of a machine learning system. That is, we trained a single system that is able to capture and generalize across many Monet paintings or even a diverse array of artists across genres. The pastiches produced are qualitatively comparable to those produced in previous work, while originating from the same style transfer network.

Pastiches produced by our single network, trained on 32 varied styles. These pastiches are qualitatively equivalent to those created by single-style networks: Image Credit: (from top to bottom) content photographs by Andreas Praefcke, Rich Niewiroski Jr. and J.-H. Janßen, (from left to right) style paintings by William Glackens, Paul Signac, Georges Rouault, Edvard Munch and Vincent van Gogh.

The technique we developed is simple to implement and is not memory intensive. Furthermore, our network, trained on several artistic styles, permits arbitrary combining multiple painting styles in real-time, as shown in the video above. Here are four styles being combined in different proportions on a photograph of Tübingen:

Unlike previous approaches to fast style transfer, we feel that this method of modeling multiple styles at the same time opens the door to exciting new ways for users to interact with style transfer algorithms, not only allowing the freedom to create new styles based on the mixture of several others, but to do it in real-time. Stay tuned for a future post on the Magenta blog, in which we will describe the algorithm in more detail and release the TensorFlow source code to run this model and demo yourself. We also recommend that you check out Nat & Lo’s fantastic video explanation on the subject of style transfer.

References

[1] Efros, Alexei A., and William T. Freeman. Image quilting for texture synthesis and transfer (2001).

[2] Hertzmann, Aaron, Charles E. Jacobs, Nuria Oliver, Brian Curless, and David H. Salesin. Image analogies (2001).

[3] Gatys, Leon A., Alexander S. Ecker, and Matthias Bethge. A Neural Algorithm of Artistic Style (2015).

[4] Ulyanov, Dmitry, Vadim Lebedev, Andrea Vedaldi, and Victor Lempitsky. Texture Networks: Feed-forward Synthesis of Textures and Stylized Images (2016).

[5] Johnson, Justin, Alexandre Alahi, and Li Fei-Fei. Perceptual Losses for Real-Time Style Transfer and Super-Resolution (2016).

* This work was done during an internship with the Google Brain Team. Vincent is currently a Ph.D. candidate at MILA, Université de Montréal.^↩

Course Builder now supports scheduling, easier customization and more

Research Blog — Mon, 17 Oct 2016 17:30:00 +0000

Posted by Adam Feldman, Product Manager and Pavel Simakov, Technical Lead, Course Builder Team

Over the years, we've learned that there are as many ways to run an online course as there are instructors to run them. Today's release of Course Builder v1.11 has a focus on improved student access controls, easier visual customization and a new course explorer. Additionally, we've added better support for deploying from Windows!

Improved student access controls
A course's availability is often dynamic - sometimes you want to make a course available to everyone all at once, while other times may call for the course to be available to some students before others. Perhaps registration will be available for a while and then the course later becomes read-only. To support these use cases, we've added Student Groups and Calendar Triggers.

Student Groups allow you to define which students can see which parts of a course. Want your morning class to see unit 5 and your afternoon class to see unit 6 -- while letting random Internet visitors only see unit 1? Student groups have you covered.
Calendar Triggers can be used to update course or content availability automatically at a specific time. For instance, if your course goes live at midnight on Sunday night, you don't need to be at a computer to make it happen. Or, if you want to unlock a new unit every week, you can set up a trigger to automate the process. Read more about calendar triggers and availability.

You can even use these features together. Say you want to start a new group of students through the course every month, giving each access to one new unit per week. Using Student Groups and Calendar Triggers together, you can achieve this cohort-like functionality.

Easier visual customization
In the past, if you wanted to customize Course Builder's student experience beyond a certain point, you needed to be a Python developer. We heard from many web developers that they would like to be able to create their own student-facing pages, too. With this release, Course Builder includes a GraphQL server that allows you to create your own frontend experience, while still letting Course Builder take care of things like user sessions and statefulness.

New course explorer
Large Course Builder partners such as Google's Digital Workshop and NPTEL have many courses and students with diverse needs. To help them, we've completely revamped the Course Explorer page, giving it richer information and interactivity, so your students can find which of your courses they're looking for. You can provide categories and start/end dates, in addition to the course title, abstract and instructor information.

In v1.11, we've added several new highly requested features. Together, they help make Course Builder easier to use and customize, giving you the flexibility to schedule things in advance.

We've come a long way since releasing our first experimental code over 4 years ago, turning Course Builder into a large open-source Google App Engine application with over 5 million student registrations across all Course Builder users. With these latest additions, we consider Course Builder feature complete and fully capable of delivering online learning at any scale. We will continue to provide support and bug fixes for those using the platform.

We hope you’ll enjoy these new features and share how you’re using them in the forum. Keep on learning!

Equality of Opportunity in Machine Learning

Research Blog — Fri, 07 Oct 2016 17:00:00 +0000

Posted by Moritz Hardt, Research Scientist, Google Brain Team

As machine learning technology progresses rapidly, there is much interest in understanding its societal impact. A particularly successful branch of machine learning is supervised learning. With enough past data and computational resources, learning algorithms often produce surprisingly effective predictors of future events. To take one hypothetical example: an algorithm could, for example, be used to predict with high accuracy who will pay back their loan. Lenders might then use such a predictor as an aid in deciding who should receive a loan in the first place. Decisions based on machine learning can be both incredibly useful and have a profound impact on our lives.

Even the best predictors make mistakes. Although machine learning aims to minimize the chance of a mistake, how do we prevent certain groups from experiencing a disproportionate share of these mistakes? Consider the case of a group that we have relatively little data on and whose characteristics differ from those of the general population in ways that are relevant to the prediction task. As prediction accuracy is generally correlated with the amount of data available for training, it is likely that incorrect predictions will be more common in this group. A predictor might, for example, end up flagging too many individuals in this group as ‘high risk of default’ even though they pay back their loan. When group membership coincides with a sensitive attribute, such as race, gender, disability, or religion, this situation can lead to unjust or prejudicial outcomes.

Despite the need, a vetted methodology in machine learning for preventing this kind of discrimination based on sensitive attributes has been lacking. A naive approach might require a set of sensitive attributes to be removed from the data before doing anything else with it. This idea of “fairness through unawareness,” however, fails due to the existence of “redundant encodings.” Even if a particular attribute is not present in the data, combinations of other attributes can act as a proxy.

Another common approach, called demographic parity, asks that the prediction must be uncorrelated with the sensitive attribute. This might sound intuitively desirable, but the outcome itself is often correlated with the sensitive attribute. For example, the incidence of heart failure is substantially more common in men than in women. When predicting such a medical condition, it is therefore neither realistic nor desirable to prevent all correlation between the predicted outcome and group membership.

Equal Opportunity

Taking these conceptual difficulties into account, we’ve proposed a methodology for measuring and preventing discrimination based on a set of sensitive attributes. Our framework not only helps to scrutinize predictors to discover possible concerns. We also show how to adjust a given predictor so as to strike a better tradeoff between classification accuracy and non-discrimination if need be.

At the heart of our approach is the idea that individuals who qualify for a desirable outcome should have an equal chance of being correctly classified for this outcome. In our fictional loan example, it means the rate of ‘low risk’ predictions among people who actually pay back their loan should not depend on a sensitive attribute like race or gender. We call this principle equality of opportunity in supervised learning.

When implemented, our framework also improves incentives by shifting the cost of poor predictions from the individual to the decision maker, who can respond by investing in improved prediction accuracy. Perfect predictors always satisfy our notion, showing that the central goal of building more accurate predictors is well aligned with the goal of avoiding discrimination.

Learn more

To explore the ideas in this blog post on your own, our Big Picture team created a beautiful interactive visualization of the different concepts and tradeoffs. So, head on over to their page to learn more.

Once you’ve walked through the demo, please check out the full version of our paper, a joint work with Eric Price (UT Austin) and Nati Srebro (TTI Chicago). We’ll present the paper at this year’s Conference on Neural Information Processing Systems (NIPS) in Barcelona. So, if you’re around, be sure to stop by and chat with one of us.

Our paper is by no means the final word on this important and complex topic. It joins an ongoing conversation with a multidisciplinary focus of research. We hope to inspire future research that will sharpen the discussion of the different achievable tradeoffs surrounding discrimination and machine learning, as well as the development of tools that will help practitioners address these challenges.

Graph-powered Machine Learning at Google

Research Blog — Thu, 06 Oct 2016 17:03:00 +0000

Posted by Sujith Ravi, Staff Research Scientist, Google Research

Recently, there have been significant advances in Machine Learning that enable computer systems to solve complex real-world problems. One of those advances is Google’s large scale, graph-based machine learning platform, built by the Expander team in Google Research. A technology that is behind many of the Google products and features you may use everyday, graph-based machine learning is a powerful tool that can be used to power useful features such as reminders in Inbox and smart messaging in Allo, or used in conjunction with deep neural networks to power the latest image recognition system in Google Photos.

Learning with Minimal Supervision

Much of the recent success in deep learning, and machine learning in general, can be attributed to models that demonstrate high predictive capacity when trained on large amounts of labeled data -- often millions of training examples. This is commonly referred to as “supervised learning” since it requires supervision, in the form of labeled data, to train the machine learning systems. (Conversely, some machine learning methods operate directly on raw data without any supervision, a paradigm referred to as unsupervised learning.)

However, the more difficult the task, the harder it is to get sufficient high-quality labeled data. It is often prohibitively labor intensive and time-consuming to collect labeled data for every new problem. This motivated the Expander research team to build new technology for powering machine learning applications at scale and with minimal supervision.

Expander’s technology draws inspiration from how humans learn to generalize and bridge the gap between what they already know (labeled information) and novel, unfamiliar observations (unlabeled information). Known as “semi-supervised” learning, this powerful technique enables us to build systems that can work in situations where training data may be sparse. One of the key advantages to a graph-based semi-supervised machine learning approach is the fact that (a) one models labeled and unlabeled data jointly during learning, leveraging the underlying structure in the data, (b) one can easily combine multiple types of signals (for example, relational information from Knowledge Graph along with raw features) into a single graph representation and learn over them. This is in contrast to other machine learning approaches, such as neural network methods, in which it is typical to first train a system using labeled data with features and then apply the trained system to unlabeled data.

Graph Learning: How It Works

At its core, Expander’s platform combines semi-supervised machine learning with large-scale graph-based learning by building a multi-graph representation of the data with nodes corresponding to objects or concepts and edges connecting concepts that share similarities. The graph typically contains both labeled data (nodes associated with a known output category or label) and unlabeled data (nodes for which no labels were provided). Expander’s framework then performs semi-supervised learning to label all nodes jointly by propagating label information across the graph.

However, this is easier said than done! We have to (1) learn efficiently at scale with minimal supervision (i.e., tiny amount of labeled data), (2) operate over multi-modal data (i.e., heterogeneous representations and various sources of data), and (3) solve challenging prediction tasks (i.e., large, complex output spaces) involving high dimensional data that might be noisy.

One of the primary ingredients in the entire learning process is the graph and choice of connections. Graphs come in all sizes, shapes and can be combined from multiple sources. We have observed that it is often beneficial to learn over multi-graphs that combine information from multiple types of data representations (e.g., image pixels, object categories and chat response messages for PhotoReply in Allo). The Expander team’s graph learning platform automatically generates graphs directly from data based on the inferred or known relationships between data elements. The data can be structured (for example, relational data) or unstructured (for example, sparse or dense feature representations extracted from raw data).

To understand how Expander’s system learns, let us consider an example graph shown below.

There are two types of nodes in the graph: “grey” represents unlabeled data whereas the colored nodes represent labeled data. Relationships between node data is represented via edges and thickness of each edge indicates strength of the connection. We can formulate the semi-supervised learning problem on this toy graph as follows: predict a color (“red” or “blue”) for every node in the graph. Note that the specific choice of graph structure and colors depend on the task. For example, as shown in this research paper we recently published, a graph that we built for the Smart Reply feature in Inbox represents email messages as nodes and colors indicate semantic categories of user responses (e.g., “yes”, “awesome”, “funny”).

The Expander graph learning framework solves this labeling task by treating it as an optimization problem. At the simplest level, it learns a color label assignment for every node in the graph such that neighboring nodes are assigned similar colors depending on the strength of their connection. A naive way to solve this would be to try to learn a label assignment for all nodes at once -- this method does not scale to large graphs. Instead, we can optimize the problem formulation by propagating colors from labeled nodes to their neighbors, and then repeating the process. In each step, an unlabeled node is assigned a label by inspecting color assignments of its neighbors. We can update every node’s label in this manner and iterate until the whole graph is colored. This process is a far more efficient way to optimize the same problem and the sequence of iterations converges to a unique solution in this case. The solution at the end of the graph propagation looks something like this:

Semi-supervised learning on a graph

In practice, we use complex optimization functions defined over the graph structure, which incorporate additional information and constraints for semi-supervised graph learning that can lead to hard, non-convex problems. The real challenge, however, is to scale this efficiently to graphs containing billions of nodes, trillions of edges and for complex tasks involving billions of different label types.

To tackle this challenge, we created an approach outlined in Large Scale Distributed Semi-Supervised Learning Using Streaming Approximation, published last year. It introduces a streaming algorithm to process information propagated from neighboring nodes in a distributed manner that makes it work on very large graphs. In addition, it addresses other practical concerns, notably it guarantees that the space complexity or memory requirements of the system stays constant regardless of the difficulty of the task, i.e., the overall system uses the same amount of memory regardless of whether the number of prediction labels is two (as in the above toy example) or a million or even a billion. This enables wide-ranging applications for natural language understanding, machine perception, user modeling and even joint multimodal learning for tasks involving multiple modalities such as text, image and video inputs.

Language Graphs for Learning Humor

As an example use of graph-based machine learning, consider emotion labeling, a language understanding task in Smart Reply for Inbox, where the goal is to label words occurring in natural language text with their fine-grained emotion categories. A neural network model is first applied to a text corpus to learn word embeddings, i.e., a mathematical vector representation of the meaning of each word. The dense embedding vectors are then used to build a sparse graph where nodes correspond to words and edges represent semantic relationship between them. Edge strength is computed using similarity between embedding vectors — low similarity edges are ignored. We seed the graph with emotion labels known a priori for a few nodes (e.g., laugh is labeled as “funny”) and then apply semi-supervised learning over the graph to discover emotion categories for remaining words (e.g., ROTFL gets labeled as “funny” owing to its multi-hop semantic connection to the word “laugh”).

Learning emotion associations using graph constructed from word embedding vectors

For applications involving large datasets or dense representations that are observed (e.g., pixels from images) or learned using neural networks (e.g., embedding vectors), it is infeasible to compute pairwise similarity between all objects to construct edges in the graph. The Expander team solves this problem by leveraging approximate, linear-time graph construction algorithms.

Graph-based Machine Intelligence in Action

The Expander team’s machine learning system is now being used on massive graphs (containing billions of nodes and trillions of edges) to recognize and understand concepts in natural language, images, videos, and queries, powering Google products for applications like reminders, question answering, language translation, visual object recognition, dialogue understanding, and more.

We are excited that with the recent release of Allo, millions of chat users are now experiencing smart messaging technology powered by the Expander team’s system for understanding and assisting with chat conversations in multiple languages. Also, this technology isn’t used only for large-scale models in the cloud - as announced this past week, Android Wear has opened up an on-device Smart Reply capability for developers that will provide smart replies for any messaging application. We’re excited to tackle even more challenging Internet-scale problems with Expander in the years to come.

Acknowledgements

We wish to acknowledge the hard work of all the researchers, engineers, product managers, and leaders across Google who helped make this technology a success. In particular, we would like to highlight the efforts of Allan Heydon, Andrei Broder, Andrew Tomkins, Ariel Fuxman, Bo Pang, Dana Movshovitz-Attias, Fritz Obermeyer, Krishnamurthy Viswanathan, Patrick McGregor, Peter Young, Robin Dua, Sujith Ravi and Vivek Ramavajjala.

How Robots Can Acquire New Skills from Their Shared Experience

Research Blog — Tue, 04 Oct 2016 00:30:00 +0000

Posted by Sergey Levine (Google Brain Team), Timothy Lillicrap (DeepMind), Mrinal Kalakrishnan (X)

The ability to learn from experience will likely be a key in enabling robots to help with complex real-world tasks, from assisting the elderly with chores and daily activities, to helping us in offices and hospitals, to performing jobs that are too dangerous or unpleasant for people. However, if each robot must learn its full repertoire of skills for these tasks only from its own experience, it could take far too long to acquire a rich enough range of behaviors to be useful. Could we bridge this gap by making it possible for robots to collectively learn from each other’s experiences?

While machine learning algorithms have made great strides in natural language understanding and speech recognition, the kind of symbolic high-level reasoning that allows people to communicate complex concepts in words remains out of reach for machines. However, robots can instantaneously transmit their experience to other robots over the network - sometimes known as "cloud robotics" - and it is this ability that can let them learn from each other.

This is true even for seemingly simple low-level skills. Humans and animals excel at adaptive motor control that integrates their senses, reflexes, and muscles in a closely coordinated feedback loop. Robots still struggle with these basic skills in the real world, where the variability and complexity of the environment demands well-honed behaviors that are not easily fooled by distractors. If we enable robots to transmit their experiences to each other, could they learn to perform motion skills in close coordination with sensing in realistic environments?

We previously wrote about how multiple robots could pool their experiences to learn a grasping task. Here, we will discuss new experiments that we conducted to investigate three possible approaches for general-purpose skill learning across multiple robots: learning motion skills directly from experience, learning internal models of physics, and learning skills with human assistance. In all three cases, multiple robots shared their experiences to build a common model of the skill. The skills learned by the robots are still relatively simple -- pushing objects and opening doors -- but by learning such skills more quickly and efficiently through collective learning, robots might in the future acquire richer behavioral repertoires that could eventually make it possible for them to assist us in our daily lives.

Learning from raw experience with model-free reinforcement learning.
Perhaps one of the simplest ways for robots to teach each other is to pool information about their successes and failures in the world. Humans and animals acquire many skills by direct trial-and-error learning. During this kind of ‘model-free’ learning -- so called because there is no explicit model of the environment formed -- they explore variations on their existing behavior and then reinforce and exploit the variations that give bigger rewards. In combination with deep neural networks, model-free algorithms have recently proved to be surprisingly effective and have been key to successes with the Atari video game system and playing Go. Having multiple robots allows us to experiment with sharing experiences to speed up this kind of direct learning in the real world.

In these experiments we tasked robots with trying to move their arms to goal locations, or reaching to and opening a door. Each robot has a copy of a neural network that allows it to estimate the value of taking a given action in a given state. By querying this network, the robot can quickly decide what actions might be worth taking in the world. When a robot acts, we add noise to the actions it selects, so the resulting behavior is sometimes a bit better than previously observed, and sometimes a bit worse. This allows each robot to explore different ways of approaching a task. Records of the actions taken by the robots, their behaviors, and the final outcomes are sent back to a central server. The server collects the experiences from all of the robots and uses them to iteratively improve the neural network that estimates value for different states and actions. The model-free algorithms we employed look across both good and bad experiences and distill these into a new network that is better at understanding how action and success are related. Then, at regular intervals, each robot takes a copy of the updated network from the server and begins to act using the information in its new network. Given that this updated network is a bit better at estimating the true value of actions in the world, the robots will produce better behavior. This cycle can then be repeated to continue improving on the task. In the video below, a robot explores the door opening task.

With a few hours of practice, robots sharing their raw experience learn to make reaches to targets, and to open a door by making contact with the handle and pulling. In the case of door opening, the robots learn to deal with the complex physics of the contacts between the hook and the door handle without building an explicit model of the world, as can be seen in the example below:

Learning how the world works by interacting with objects.
Direct trial-and-error reinforcement learning is a great way to learn individual skills. However, humans and animals don’t learn exclusively by trial and error. We also build mental models about our environment and imagine how the world might change in response to our actions.

We can start with the simplest of physical interactions, and have our robots learn the basics of cause and effect from reflecting on their own experiences. In this experiment, we had the robots play with a wide variety of common household objects by randomly prodding and pushing them inside a tabletop bin. The robots again shared their experiences with each other and together built a single predictive model that attempted to forecast what the world might look like in response to their actions. This predictive model can make simple, if slightly blurry, forecasts about future camera images when provided with the current image and a possible sequence of actions that the robot might execute:

Top row: robotic arms interacting with common household items.
Bottom row: Predicted future camera images given an initial image and a sequence of actions.

Once this model is trained, the robots can use it to perform purposeful manipulations, for example based on user commands. In our prototype, a user can command the robot to move a particular object simply by clicking on that object, and then clicking on the point where the object should go:

The robots in this experiment were not told anything about objects or physics: they only see that the command requires a particular pixel to be moved to a particular place. However, because they have seen so many object interactions in their shared past experiences, they can forecast how particular actions will affect particular pixels. In order for such an implicit understanding of physics to emerge, the robots must be provided with a sufficient breadth of experience. This requires either a lot of time, or sharing the combined experiences of many robots. An extended video on this project may be found here.

Learning with the help of humans.
So far, we discussed how robots can learn entirely on their own. However, human guidance is important, not just for telling the robot what to do, but also for helping the robots along. We have a lot of intuition about how various manipulation skills can be performed, and it only seems natural that transferring this intuition to robots can help them learn these skills a lot faster. In the next experiment, we provided each robot with a different door, and guided each of them by hand to show how these doors can be opened. These demonstrations are encoded into a single combined strategy for all robots, called a policy. The policy is a deep neural network which converts camera images to robot actions, and is maintained on a central server. The following video shows the instructor demonstrating the door-opening skill to a robot:

Next, the robots collectively improve this policy through a trial-and-error learning process. Each robot attempts to open its own door using the latest available policy, with some added noise for exploration. These attempts allow each robot to plan a better strategy for opening the door the next time around, and improve the policy accordingly:

Not surprisingly, we find that robots learn more effectively if they are trained on a curriculum of tasks that are gradually increasing in difficulty. In our experiment, each robot starts off by practicing the door-opening skill on a specific position and orientation of the door that the instructor had previously shown it. As it gets better at performing the task, the instructor starts to alter the position and orientation of the door to be just a bit beyond the current capabilities of the policy, but not so difficult that it fails entirely. This allows the robots to gradually increase their skill level over time, and expands the range of situations they can handle. The combination of human-guidance with trial-and-error learning allowed the robots to collectively learn the skill of door-opening in just a couple of hours. Since the robots were trained on doors that look different from each other, the final policy succeeds on a door with a handle that none of the robots had seen before:

In all three of the experiments described above, the ability to communicate and exchange their experiences allows the robots to learn more quickly and effectively. This becomes particularly important when we combine robotic learning with deep learning, as is the case in all of the experiments discussed above. We’ve seen before that deep learning works best when provided with ample training data. For example, the popular ImageNet benchmark uses over 1.5 million labeled examples. While such a quantity of data is not impossible for a single robot to gather over a few years, it is much more efficient to gather the same volume of experience from multiple robots over the course of a few weeks. Besides faster learning times, this approach might benefit from the greater diversity of experience: a real-world deployment might involve multiple robots in different places and different settings, sharing heterogeneous, varied experiences to build a single highly generalizable representation.

Of course, the kinds of behaviors that robots today can learn are still quite limited. Even basic motion skills, such as picking up objects and opening doors, remain in the realm of cutting edge research. In all of these experiments, a human engineer is still needed to tell the robots what they should learn to do by specifying a detailed objective function. However, as algorithms improve and robots are deployed more widely, their ability to share and pool their experiences could be instrumental for enabling them to assist us in our daily lives.

The experiments on learning by trial-and-error were conducted by Shixiang (Shane) Gu and Ethan Holly from the Google Brain team, and Timothy Lillicrap from DeepMind. Work on learning predictive models was conducted by Chelsea Finn from the Google Brain team, and the research on learning from demonstration was conducted by Yevgen Chebotar, Ali Yahya, Adrian Li, and Mrinal Kalakrishnan from X. We would also like to acknowledge contributions by Peter Pastor, Gabriel Dulac-Arnold, and Jon Scholz. Articles about each of the experiments discussed in this blog post can be found below:

Deep Reinforcement Learning for Robotic Manipulation. Shixiang Gu, Ethan Holly, Timothy Lillicrap, Sergey Levine. [video]

Deep Visual Foresight for Planning Robot Motion. Chelsea Finn, Sergey Levine. [video] [data]

Collective Robot Reinforcement Learning with Distributed Asynchronous Guided Policy Search.
Ali Yahya, Adrian Li, Mrinal Kalakrishnan, Yevgen Chebotar, Sergey Levine. [video]

Path Integral Guided Policy Search. Yevgen Chebotar, Mrinal Kalakrishnan, Ali Yahya, Adrian Li, Stefan Schaal, Sergey Levine. [video]

Introducing the Open Images Dataset

Research Blog — Fri, 30 Sep 2016 17:00:00 +0000

Posted by Ivan Krasin and Tom Duerig, Software Engineers

In the last few years, advances in machine learning have enabled Computer Vision to progress rapidly, allowing for systems that can automatically caption images to apps that can create natural language replies in response to shared photos. Much of this progress can be attributed to publicly available image datasets, such as ImageNet and COCO for supervised learning, and YFCC100M for unsupervised learning.

Today, we introduce Open Images, a dataset consisting of ~9 million URLs to images that have been annotated with labels spanning over 6000 categories. We tried to make the dataset as practical as possible: the labels cover more real-life entities than the 1000 ImageNet classes, there are enough images to train a deep neural network from scratch and the images are listed as having a Creative Commons Attribution license^*.

The image-level annotations have been populated automatically with a vision model similar to Google Cloud Vision API. For the validation set, we had human raters verify these automated labels to find and remove false positives. On average, each image has about 8 labels assigned. Here are some examples:

Annotated images form the Open Images dataset. Left: Ghost Arches by Kevin Krejci. Right: Some Silverware by J B. Both images used under CC BY 2.0 license

We have trained an Inception v3 model based on Open Images annotations alone, and the model is good enough to be used for fine-tuning applications as well as for other things, like DeepDream or artistic style transfer which require a well developed hierarchy of filters. We hope to improve the quality of the annotations in Open Images the coming months, and therefore the quality of models which can be trained.

The dataset is a product of a collaboration between Google, CMU and Cornell universities, and there are a number of research papers built on top of the Open Images dataset in the works. It is our hope that datasets like Open Images and the recently released YouTube-8M will be useful tools for the machine learning community.

* While we tried to identify images that are licensed under a Creative Commons Attribution license, we make no representations or warranties regarding the license status of each image and you should verify the license for each image yourself.^↩

Image Compression with Neural Networks

Research Blog — Thu, 29 Sep 2016 17:00:00 +0000

Posted by Nick Johnston and David Minnen, Software Engineers

Data compression is used nearly everywhere on the internet - the videos you watch online, the images you share, the music you listen to, even the blog you're reading right now. Compression techniques make sharing the content you want quick and efficient. Without data compression, the time and bandwidth costs for getting the information you need, when you need it, would be exorbitant!

In "Full Resolution Image Compression with Recurrent Neural Networks", we expand on our previous research on data compression using neural networks, exploring whether machine learning can provide better results for image compression like it has for image recognition and text summarization. Furthermore, we are releasing our compression model via TensorFlow so you can experiment with compressing your own images with our network.

We introduce an architecture that uses a new variant of the Gated Recurrent Unit (a type of RNN that allows units to save activations and process sequences) called Residual Gated Recurrent Unit (Residual GRU). Our Residual GRU combines existing GRUs with the residual connections introduced in "Deep Residual Learning for Image Recognition" to achieve significant image quality gains for a given compression rate. Instead of using a DCT to generate a new bit representation like many compression schemes in use today, we train two sets of neural networks - one to create the codes from the image (encoder) and another to create the image from the codes (decoder).

Our system works by iteratively refining a reconstruction of the original image, with both the encoder and decoder using Residual GRU layers so that additional information can pass from one iteration to the next. Each iteration adds more bits to the encoding, which allows for a higher quality reconstruction. Conceptually, the network operates as follows:

The initial residual, R[0], corresponds to the original image I: R[0] = I.
Set i=1 for to the first iteration.
Iteration[i] takes R[i-1] as input and runs the encoder and binarizer to compress the image into B[i].
Iteration[i] runs the decoder on B[i] to generate a reconstructed image P[i].
The residual for Iteration[i] is calculated: R[i] = I - P[i].
Set i=i+1 and go to Step 3 (up to the desired number of iterations).

The residual image represents how different the current version of the compressed image is from the original. This image is then given as input to the network with the goal of removing the compression errors from the next version of the compressed image. The compressed image is now represented by the concatenation of B[1] through B[N]. For larger values of N, the decoder gets more information on how to reduce the errors and generate a higher quality reconstruction of the original image.

To understand how this works, consider the following example of the first two iterations of the image compression network, shown in the figures below. We start with an image of a lighthouse. On the first pass through the network, the original image is given as an input (R[0] = I). P[1] is the reconstructed image. The difference between the original image and encoded image is the residual, R[1], which represents the error in the compression.

Left: Original image, I = R[0]. Center: Reconstructed image, P[1]. Right: the residual, R[1], which represents the error introduced by compression.

On the second pass through the network, R[1] is given as the network’s input (see figure below). A higher quality image P[2] is then created. So how does the system recreate such a good image (P[2], center panel below) from the residual R[1]? Because the model uses recurrent nodes with memory, the network saves information from each iteration that it can use in the next one. It learned something about the original image in Iteration[1] that is used along with R[1] to generate a better P[2] from B[2]. Lastly, a new residual, R[2] (right), is generated by subtracting P[2] from the original image. This time the residual is smaller since there are fewer differences between the reconstructed image, and what we started with.

The second pass through the network. Left: R[1] is given as input. Center: A higher quality reconstruction, P[2]. Right: A smaller residual R[2] is generated by subtracting P[2] from the original image.

At each further iteration, the network gains more information about the errors introduced by compression (which is captured by the residual image). If it can use that information to predict the residuals even a little bit, the result is a better reconstruction. Our models are able to make use of the extra bits up to a point. We see diminishing returns, and at some point the representational power of the network is exhausted.

To demonstrate file size and quality differences, we can take a photo of Vash, a Japanese Chin, and generate two compressed images, one JPEG and one Residual GRU. Both images target a perceptual similarity of 0.9 MS-SSIM, a perceptual quality metric that reaches 1.0 for identical images. The image generated by our learned model results in an file 25% smaller than JPEG.

Left: Original image (1419 KB PNG) at ~1.0 MS-SSIM. Center: JPEG (33 KB) at ~0.9 MS-SSIM. Right: Residual GRU (24 KB) at ~0.9 MS-SSIM. This is 25% smaller for a comparable image quality

Taking a look around his nose and mouth, we see that our method doesn’t have the magenta blocks and noise in the middle of the image as seen in JPEG. This is due to the blocking artifacts produced by JPEG, whereas our compression network works on the entire image at once. However, there's a tradeoff -- in our model the details of the whiskers and texture are lost, but the system shows great promise in reducing artifacts.

Left: Original. Center: JPEG. Right: Residual GRU.

While today’s commonly used codecs perform well, our work shows that using neural networks to compress images results in a compression scheme with higher quality and smaller file sizes. To learn more about the details of our research and a comparison of other recurrent architectures, check out our paper. Our future work will focus on even better compression quality and faster models, so stay tuned!

Announcing YouTube-8M: A Large and Diverse Labeled Video Dataset for Video Understanding Research

Research Blog — Wed, 28 Sep 2016 17:00:00 +0000

Posted by Sudheendra Vijayanarasimhan and Paul Natsev, Software Engineers

Many recent breakthroughs in machine learning and machine perception have come from the availability of large labeled datasets, such as ImageNet, which has millions of images labeled with thousands of classes. Their availability has significantly accelerated research in image understanding, for example on detecting and classifying objects in static images.

Video analysis provides even more information for detecting and recognizing objects, and understanding human actions and interactions with the world. Improving video understanding can lead to better video search and discovery, similarly to how image understanding helped re-imagine the photos experience. However, one of the key bottlenecks for further advancements in this area has been the lack of real-world video datasets with the same scale and diversity as image datasets.

Today, we are excited to announce the release of YouTube-8M, a dataset of 8 million YouTube video URLs (representing over 500,000 hours of video), along with video-level labels from a diverse set of 4800 Knowledge Graph entities. This represents a significant increase in scale and diversity compared to existing video datasets. For example, Sports-1M, the largest existing labeled video dataset we are aware of, has around 1 million YouTube videos and 500 sports-specific classes--YouTube-8M represents nearly an order of magnitude increase in both number of videos and classes.

In order to construct a labeled video dataset of this scale, we needed to address two key challenges: (1) video is much more time-consuming to annotate manually than images, and (2) video is very computationally expensive to process and store. To overcome (1), we turned to YouTube and its video annotation system, which identifies relevant Knowledge Graph topics for all public YouTube videos. While these annotations are machine-generated, they incorporate powerful user engagement signals from millions of users as well as video metadata and content analysis. As a result, the quality of these annotations is sufficiently high to be useful for video understanding research and benchmarking purposes.

To ensure the stability and quality of the labeled video dataset, we used only public videos with more than 1000 views, and we constructed a diverse vocabulary of entities, which are visually observable and sufficiently frequent. The vocabulary construction was a combination of frequency analysis, automated filtering, verification by human raters that the entities are visually observable, and grouping into 24 top-level verticals (more details in our technical report). The figures below depict the dataset browser and the distribution of videos along the top-level verticals, and illustrate the dataset’s scale and diversity.

A dataset explorer allows browsing and searching the full vocabulary of Knowledge Graph entities, grouped in 24 top-level verticals, along with corresponding videos. This screenshot depicts a subset of dataset videos annotated with the entity “Guitar”.

The distribution of videos in the top-level verticals illustrates the scope and diversity of the dataset and reflects the natural distribution of popular YouTube videos.

To address (2), we had to overcome the storage and computational resource bottlenecks that researchers face when working with videos. Pursuing video understanding at YouTube-8M’s scale would normally require a petabyte of video storage and dozens of CPU-years worth of processing. To make the dataset useful to researchers and students with limited computational resources, we pre-processed the videos and extracted frame-level features using a state-of-the-art deep learning model--the publicly available Inception-V3 image annotation model trained on ImageNet. These features are extracted at 1 frame-per-second temporal resolution, from 1.9 billion video frames, and are further compressed to fit on a single commodity hard disk (less than 1.5 TB). This makes it possible to download this dataset and train a baseline TensorFlow model at full scale on a single GPU in less than a day!

We believe this dataset can significantly accelerate research on video understanding as it enables researchers and students without access to big data or big machines to do their research at previously unprecedented scale. We hope this dataset will spur exciting new research on video modeling architectures and representation learning, especially approaches that deal effectively with noisy or incomplete labels, transfer learning and domain adaptation. In fact, we show that pre-training models on this dataset and applying / fine-tuning on other external datasets leads to state of the art performance on them (e.g. ActivityNet, Sports-1M). You can read all about our experiments using this dataset, along with more details on how we constructed it, in our technical report.

A Neural Network for Machine Translation, at Production Scale

Research Blog — Tue, 27 Sep 2016 17:00:00 +0000

Posted by Quoc V. Le & Mike Schuster, Research Scientists, Google Brain Team

Ten years ago, we announced the launch of Google Translate, together with the use of Phrase-Based Machine Translation as the key algorithm behind this service. Since then, rapid advances in machine intelligence have improved our speech recognition and image recognition capabilities, but improving machine translation remains a challenging goal.

Today we announce the Google Neural Machine Translation system (GNMT), which utilizes state-of-the-art training techniques to achieve the largest improvements to date for machine translation quality. Our full research results are described in a new technical report we are releasing today: “Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation” [1].

A few years ago we started using Recurrent Neural Networks (RNNs) to directly learn the mapping between an input sequence (e.g. a sentence in one language) to an output sequence (that same sentence in another language) [2]. Whereas Phrase-Based Machine Translation (PBMT) breaks an input sentence into words and phrases to be translated largely independently, Neural Machine Translation (NMT) considers the entire input sentence as a unit for translation.The advantage of this approach is that it requires fewer engineering design choices than previous Phrase-Based translation systems. When it first came out, NMT showed equivalent accuracy with existing Phrase-Based translation systems on modest-sized public benchmark data sets.

Since then, researchers have proposed many techniques to improve NMT, including work on handling rare words by mimicking an external alignment model [3], using attention to align input words and output words [4] and breaking words into smaller units to cope with rare words [5,6]. Despite these improvements, NMT wasn't fast or accurate enough to be used in a production system, such as Google Translate. Our new paper [1] describes how we overcame the many challenges to make NMT work on very large data sets and built a system that is sufficiently fast and accurate enough to provide better translations for Google’s users and services.

Data from side-by-side evaluations, where human raters compare the quality of translations for a given source sentence. Scores range from 0 to 6, with 0 meaning “completely nonsense translation”, and 6 meaning “perfect translation."

The following visualization shows the progression of GNMT as it translates a Chinese sentence to English. First, the network encodes the Chinese words as a list of vectors, where each vector represents the meaning of all words read so far (“Encoder”). Once the entire sentence is read, the decoder begins, generating the English sentence one word at a time (“Decoder”). To generate the translated word at each step, the decoder pays attention to a weighted distribution over the encoded Chinese vectors most relevant to generate the English word (“Attention”; the blue link transparency represents how much the decoder pays attention to an encoded word).

Using human-rated side-by-side comparison as a metric, the GNMT system produces translations that are vastly improved compared to the previous phrase-based production system. GNMT reduces translation errors by more than 55%-85% on several major language pairs measured on sampled sentences from Wikipedia and news websites with the help of bilingual human raters.

An example of a translation produced by our system for an input sentence sampled from a news site. Go here for more examples of translations for input sentences sampled randomly from news sites and books.

In addition to releasing this research paper today, we are announcing the launch of GNMT in production on a notoriously difficult language pair: Chinese to English. The Google Translate mobile and web apps are now using GNMT for 100% of machine translations from Chinese to English—about 18 million translations per day. The production deployment of GNMT was made possible by use of our publicly available machine learning toolkit TensorFlow and our Tensor Processing Units (TPUs), which provide sufficient computational power to deploy these powerful GNMT models while meeting the stringent latency requirements of the Google Translate product. Translating from Chinese to English is one of the more than 10,000 language pairs supported by Google Translate, and we will be working to roll out GNMT to many more of these over the coming months.

Machine translation is by no means solved. GNMT can still make significant errors that a human translator would never make, like dropping words and mistranslating proper names or rare terms, and translating sentences in isolation rather than considering the context of the paragraph or page. There is still a lot of work we can do to serve our users better. However, GNMT represents a significant milestone. We would like to celebrate it with the many researchers and engineers—both within Google and the wider community—who have contributed to this direction of research in the past few years.

Acknowledgements:
We thank members of the Google Brain team and the Google Translate team for the help with the project. We thank Nikhil Thorat and the Big Picture team for the visualization.

References:
[1] Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, Jeffrey Dean. Technical Report, 2016.
[2] Sequence to Sequence Learning with Neural Networks, Ilya Sutskever, Oriol Vinyals, Quoc V. Le. Advances in Neural Information Processing Systems, 2014.
[3] Addressing the rare word problem in neural machine translation, Minh-Thang Luong, Ilya Sutskever, Quoc V. Le, Oriol Vinyals, and Wojciech Zaremba. Proceedings of the 53th Annual Meeting of the Association for Computational Linguistics, 2015.
[4] Neural Machine Translation by Jointly Learning to Align and Translate, Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio. International Conference on Learning Representations, 2015.
[5] Japanese and Korean voice search, Mike Schuster, and Kaisuke Nakajima. IEEE International Conference on Acoustics, Speech and Signal Processing, 2012.
[6] Neural Machine Translation of Rare Words with Subword Units, Rico Sennrich, Barry Haddow, Alexandra Birch. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016.

Show and Tell: image captioning open sourced in TensorFlow

Research Blog — Thu, 22 Sep 2016 17:00:00 +0000

Posted by Chris Shallue, Software Engineer, Google Brain Team

In 2014, research scientists on the Google Brain team trained a machine learning system to automatically produce captions that accurately describe images. Further development of that system led to its success in the Microsoft COCO 2015 image captioning challenge, a competition to compare the best algorithms for computing accurate image captions, where it tied for first place.

Today, we’re making the latest version of our image captioning system available as an open source model in TensorFlow. This release contains significant improvements to the computer vision component of the captioning system, is much faster to train, and produces more detailed and accurate descriptions compared to the original system. These improvements are outlined and analyzed in the paper Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge, published in IEEE Transactions on Pattern Analysis and Machine Intelligence.

Automatically captioned by our system.

So what’s new?

Our 2014 system used the Inception V1 image classification model to initialize the image encoder, which produces the encodings that are useful for recognizing different objects in the images. This was the best image model available at the time, achieving 89.6% top-5 accuracy on the benchmark ImageNet 2012 image classification task. We replaced this in 2015 with the newer Inception V2 image classification model, which achieves 91.8% accuracy on the same task. The improved vision component gave our captioning system an accuracy boost of 2 points in the BLEU-4 metric (which is commonly used in machine translation to evaluate the quality of generated sentences) and was an important factor of its success in the captioning challenge.

Today’s code release initializes the image encoder using the Inception V3 model, which achieves 93.9% accuracy on the ImageNet classification task. Initializing the image encoder with a better vision model gives the image captioning system a better ability to recognize different objects in the images, allowing it to generate more detailed and accurate descriptions. This gives an additional 2 points of improvement in the BLEU-4 metric over the system used in the captioning challenge.

Another key improvement to the vision component comes from fine-tuning the image model. This step addresses the problem that the image encoder is initialized by a model trained to classify objects in images, whereas the goal of the captioning system is to describe the objects in images using the encodings produced by the image model. For example, an image classification model will tell you that a dog, grass and a frisbee are in the image, but a natural description should also tell you the color of the grass and how the dog relates to the frisbee.

In the fine-tuning phase, the captioning system is improved by jointly training its vision and language components on human generated captions. This allows the captioning system to transfer information from the image that is specifically useful for generating descriptive captions, but which was not necessary for classifying objects. In particular, after fine-tuning it becomes better at correctly describing the colors of objects. Importantly, the fine-tuning phase must occur after the language component has already learned to generate captions - otherwise, the noisiness of the randomly initialized language component causes irreversible corruption to the vision component. For more details, read the full paper here.

Left: the better image model allows the captioning model to generate more detailed and accurate descriptions. Right: after fine-tuning the image model, the image captioning system is more likely to describe the colors of objects correctly.

Until recently our image captioning system was implemented in the DistBelief software framework. The TensorFlow implementation released today achieves the same level of accuracy with significantly faster performance: time per training step is just 0.7 seconds in TensorFlow compared to 3 seconds in DistBelief on an Nvidia K20 GPU, meaning that total training time is just 25% of the time previously required.

A natural question is whether our captioning system can generate novel descriptions of previously unseen contexts and interactions. The system is trained by showing it hundreds of thousands of images that were captioned manually by humans, and it often re-uses human captions when presented with scenes similar to what it’s seen before.

When the model is presented with scenes similar to what it’s seen before, it will often re-use human generated captions.

So does it really understand the objects and their interactions in each image? Or does it always regurgitate descriptions from the training data? Excitingly, our model does indeed develop the ability to generate accurate new captions when presented with completely new scenes, indicating a deeper understanding of the objects and context in the images. Moreover, it learns how to express that knowledge in natural-sounding English phrases despite receiving no additional language training other than reading the human captions.

Our model generates a completely new caption using concepts learned from similar scenes in the training set.

We hope that sharing this model in TensorFlow will help push forward image captioning research and applications, and will also allow interested people to learn and have fun. To get started training your own image captioning system, and for more details on the neural network architecture, navigate to the model’s home-page here. While our system uses the Inception V3 image classification model, you could even try training our system with the recently released Inception-ResNet-v2 model to see if it can do even better!

The 280-Year-Old Algorithm Inside Google Trips

Research Blog — Tue, 20 Sep 2016 17:00:00 +0000

Posted by Bogdan Arsintescu, Software Engineer & Sreenivas Gollapudi, Kostas Kollias, Tamas Sarlos and Andrew Tomkins, Research Scientists

Algorithms Engineering is a lot of fun because algorithms do not go out of fashion: one never knows when an oldie-but-goodie might come in handy. Case in point: Yesterday, Google announced Google Trips, a new app to assist you in your travels by helping you create your own “perfect day” in a city. Surprisingly, deep inside Google Trips, there is an algorithm that was invented 280 years ago.

In 1736, Leonhard Euler authored a brief but beautiful mathematical paper regarding the town of Königsberg and its 7 bridges, shown here:

Image from Wikipedia

In the paper, Euler studied the following question: is it possible to walk through the city crossing each bridge exactly once? As it turns out, for the city of Königsberg, the answer is no. To reach this answer, Euler developed a general approach to represent any layout of landmasses and bridges in terms of what he dubbed the Geometriam Situs (the “Geometry of Place”), which we now call Graph Theory. He represented each landmass as a “node” in the graph, and each bridge as an “edge,” like this:

Image from Wikipedia

Euler noticed that if all the nodes in the graph have an even number of edges (such graphs are called “Eulerian” in his honor) then, and only then, a cycle can be found that visits every edge exactly once. Keep this in mind, as we’ll rely on this fact later in the post.

Our team in Google Research has been fascinated by the “Geometry of Place” for some time, and we started investigating a question related to Euler’s: rather than visiting just the bridges, how can we visit as many interesting places as possible during a particular trip? We call this the “itineraries” problem. Euler didn’t study it, but it is a well known topic in Optimization, where it is often called the “Orienteering” problem.

While Euler’s problem has an efficient and exact solution, the itineraries problem is not just hard to solve, it is hard to even approximately solve! The difficulty lies in the interplay between two conflicting goals: first, we should pick great places to visit, but second, we should pick them to allow a good itinerary: not too much travel time; don’t visit places when they’re closed; don’t visit too many museums, etc. Embedded in such problems is the challenge of finding efficient routes, often referred to as the Travelling Salesman Problem (TSP).

Algorithms for Travel Itineraries

Fortunately, the real world has a property called the “triangle inequality” that says adding an extra stop to a route never makes it shorter. When the underlying geometry satisfies the triangle inequality, the TSP can be approximately solved using another algorithm discovered by Christofides in 1976. This is an important part of our solution, and builds on Euler’s paper, so we’ll give a quick four-step rundown of how it works here:

We start with all our destinations separate, and repeatedly connect together the closest two that aren’t yet connected. This doesn’t yet give us an itinerary, but it does connect all the destinations via a minimum spanning tree of the graph.
We take all the destinations that have an odd number of connections in this tree (Euler proved there must be an even number of these), and carefully pair them up.
Because all the destinations now have an even number of edges, we’ve created an Eulerian graph, so we create a route that crosses each edge exactly once.
We now have a great route, but it might visit some places more than once. No problem, we find any double visits and simply bypass them, going directly from the predecessor to the successor.

Christofides gave an elegant proof that the resulting route is always close to the shortest possible. Here’s an example of the Christofides’ algorithm in action on a location graph with the nodes representing places and the edges with costs representing the travel time between the places.

Construction of an Eulerian Tour in a location graph

Armed with this efficient route-finding subroutine, we can now start building itineraries one step at a time. At each step, we estimate the benefit to the user of each possible new place to visit, and likewise estimate the cost using the Christofides algorithm. A user’s benefit can be derived from a host of natural factors such as the popularity of the place and how different the place is relative to places already visited on the tour. We then pick whichever new place has the best benefit per unit of extra cost (e.g., time needed to include the new place in the tour). Here’s an example of our algorithm actually building a route in London using the location graph shown above:

Itineraries in Google Trips

With our first good approximate solution to the itineraries problem in hand, we started working with our colleagues from the Google Trips team, and we realized we’d barely scratched the surface. For instance, even if we produce the absolute perfect itinerary, any particular user of the system will very reasonably say, “That’s great, but all my friends say I also need to visit this other place. Plus, I’m only around for the morning, and I don’t want to miss this place you listed in the afternoon. And I’ve already seen Big Ben twice.” So rather than just producing an itinerary once and calling it a perfect day, we needed a fast dynamic algorithm for itineraries that users can modify on the fly to suit their individual taste. And because many people have bad data connections while traveling, the solution had to be efficient enough to run disconnected on a phone.

Better Itineraries Through the Wisdom of Crowds

While the algorithmic aspects of the problem were highly challenging, we realized that producing high-quality itineraries was just as dependent on our understanding of the many possible stopping points on the itinerary. We had Google’s extensive travel database to identify the interesting places to visit, and we also had great data from Google’s existing systems about how to travel from any place to any other. But we didn’t have a good sense for how people typically move through this geometry of places.

For this, we turned to the wisdom of crowds. This type of wisdom is used by Google to estimate delays on highways, and to discover when restaurants are most busy. Here, we use the same techniques to learn about common visit sequences that we can stitch together into itineraries that feel good to our users. We combine Google's knowledge of when places are popular, with the directions between those places to gather an idea of what tourists like to do when travelling.

And the crowd has a lot more wisdom to offer in the future. For example, we noticed that visits to Buckingham Palace spike around 11:30 and stay a bit longer than at other times of the day. This seemed a little strange to us, but when we looked more closely, it turns out to be the time of the Changing of the Guard. We’re looking now at ways to incorporate this type of timing information into the itinerary selection algorithms.

So give it a try: Google Trips, available now on Android and iOS, has you covered from departure to return.

The 2016 Google Earth Engine User Summit: Turning pixels into insights

Research Blog — Mon, 19 Sep 2016 16:00:00 +0000

Posted by Chris Herwig, Program Manager, Google Earth Engine

"We are trying new methods [of flood modeling] in Earth Engine based on machine learning techniques which we think are cheaper, more scalable, and could exponentially drive down the cost of flood mapping and make it accessible to everyone."
-Beth Tellman, Arizona State University and Cloud to Street

Recently, Google headquarters hosted the Google Earth Engine User Summit 2016, a three-day hands-on technical workshop for scientists and students interested in using Google Earth Engine for planetary-scale cloud-based geospatial analysis. Earth Engine combines a multi-petabyte catalog of satellite imagery and geospatial datasets with a simple, yet powerful API backed by Google's cloud, which scientists and researchers use to detect, measure, and predict changes to the Earth's surface.

Earth Engine founder Rebecca Moore kicking off the first day of the summit

Summit attendees could choose among twenty-five hands-on workshops over the course of the three day summit, most generated for the summit specifically, giving attendees an exclusive introduction to the latest features in our platform. The sessions covered a wide range of topics and Earth Engine experience levels, from image classifiers and classifications, time series analysis, building custom web applications, all the way to arrays, matrices, and linear algebra in Earth Engine.

Terra Bella Product Manager, Kristi Bohl, taught a session on using SkySat imagery, like the image above over Sydney, Australia, for change detection. Workshop attendees also learned how to take advantage of the deep temporal stack the SkySat archive offers for change-over-time analyses.

Cross-correlation between Landsat 8 NDVI and the sum of CHIRPS precipitation. Red is high cross-correlation and blue is low. The gap in data is because CHIRPS is masked over water.

Nick Clinton, a developer advocate for Earth Engine, taught a time series session that covered statistical techniques as applied to satellite imagery data. Students learned how to make graphics like the above, which shows the cross-correlation between Landsat 8 NDVI and the sum of CHIRPS precipitation from the previous month over San Francisco, CA. The correlation should be high for relatively r-selected plants like grasses and weeds and relatively low for perennials, shrubs, or forest.

My workshop session covered how users can upload their own data into Earth Engine and the many different ways to take the results of their analyses with them, including rendering static map tiles hosted on Google Cloud Storage, exporting images, creating new assets, and even making movies, like this timelapse video of all the Sentinel 2A images captured over Sydney Australia.

Along with the workshop sessions, we hosted five plenary speakers and 18 lightning talk presenters. These presenters shared how Earth Engine fits into their research, spanning from drought monitoring, agriculture, conservation, flood risk mapping, and hydrological analysis.

Plenary Speakers

Agriculture in the Sentinel era: scaling up with Earth Engine, Guido Lemoine, European Commission's Joint Research Centre
Flood Vulnerability from the Cloud to the Street (and back!) powered by Google Earth Engine, Beth Tellman, Arizona State University and Cloud to Street
Accelerating Rangeland Conservation, Brady Allred, University of Montana
Monitoring Drought with Google Earth Engine: From Archives to Answers, Justin Huntington, Desert Research Institute / Western Regional Climate Center
Automated methods for surface water detection, Gennadii Donchytes, Deltares

Lightning Presentations

Mapping the Behavior of Rivers, Alex Bryk, University of California, Berkeley
Climate Data for Crisis and Health Applications, Pietro Ceccato, Columbia University
Appalachian Communities at Risk, Matt Wasson, Jeff Deal, Appalachian Voices
Water, Wildlife and Working Lands, Patrick Donnelly, U.S. Fish and Wildlife Service
Stream-side NDVI and The Salmonid Population Viability Project, Kurt Fesenmyer, Trout Unlimited
Mapping Evapotranspiration for Water Use and Availability, Mac Friedrichs, USGS
Dynamic Wildfire Modeling in Earth Engine, Miranda Gray, Conservation Science Partners
Fishing at Scale, now also in Earth Engine, David Kroodsma, Skytruth
Mapping crop yields from field to national scales in Earth Engine, David Lobell, Stanford University
Mapping Pacific Wildfires Impacts with Earth Engine, Matthew Lucas, University of Hawaii
EarthEnv.org - Environmental layers for accessing status and trends in biodiversity, ecosystems and climate, Jeremy Malczyk, Map of Life
Building a Landsat 8 Mosaic of Antarctica, Allen Pope, University of Colorado Boulder
Monitoring Primary Production at Broad Spatial and Temporal Scales, Nathaniel Robinson, University of Montana
Assessing Urbanization Trends for Public Health: Modelling Nighttime Lights Imagery in Africa with Earth Engine, David Savory, University of California, San Francisco
National-scale mapping of forest carbon, Ty Wilson, US Forest Service
Utilizing Google Earth Engine to Enhance Decision-Making Capabilities, Brittany Zajic, NASA DEVELOP National Program

Keeping our users first

It is always inspiring to see such a diverse group of people come together to celebrate, learn, and share all the amazing and wondrous things people are doing with Earth Engine. It is not only an opportunity for our users to learn the latest techniques; it is also a way for the Earth Engine team to experience the new and exciting ways people are harnessing Earth Engine to solve some of the most pressing environmental issues facing humanity.

We've already begun planning for next year's user summit, and based on the success of this year's, we're hoping to hold an even larger one.

Research from VLDB 2016: Improved Friend Suggestion using Ego-Net Analysis

Research Blog — Thu, 15 Sep 2016 18:00:00 +0000

Posted by Alessandro Epasto, Research Scientist, Google Research NY

On September 5 - 9, New Delhi, India hosted the 42nd International Conference on Very Large Data Bases (VLDB), a premier annual forum for academic and industry research on databases, data management, data mining and data analytics. Over the past several years, Google has actively participated in VLDB, both as official sponsor and with numerous contributions to the research and industrial tracks. In this post, we would like to share the research presented in one of the Google papers from VLDB 2016.

In Ego-net Community Mining Applied to Friend Suggestion, co-authored by Googlers Silvio Lattanzi, Vahab Mirrokni, Ismail Oner Sebe, Ahmed Taei, Sunita Verma and myself, we explore how social networks can provide better friend suggestions to users, a challenging practical problem faced by all social network platforms

Friend suggestion – the task of suggesting to a user the contacts she might already know in the network but that she hasn’t added yet – is major driver of user engagement and social connection in all online social networks. Designing a high quality system that can provide relevant and useful friend recommendations is very challenging, and requires state-of-the-art machine learning algorithms based on a multitude of parameters.

An effective family of features for friend suggestion consist of graph features such as the number of common friends between two users. While widely used, the number of common friends has some major drawbacks, including the following which is shown in Figure 1.

Figure 1: Ego-net of Sally.

In this figure we represent the social connections of Sally and her friends – the ego-net of Sally. An ego-net of a node (in this case, Sally) is defined as the graph that contains the node itself, all of the node’s neighbors and the connection among those nodes. Sally has 6 friends in her ego-net: Albert (her husband), Brian (her son), Charlotte (her mother) as well as Uma (her boss), Vincent and Wally (two of her team members). Notice how A, B and C are all connected with each other while they do not know U, V or W. On the other hand U, V and W have all added each other as their friend (except U and W who are good friend but somehow forgot to add each other).

Notice how each of A, B, C have a common friend with each of U, V and W: Sally herself. A friend recommendation system based on common neighbors might suggest to Sally’s son (for instance) to add Sally’s boss as his friend! In reality the situation is even more complicated because users’ online and offline friends span several different social circles or communities (family, work, school, sports, etc).

In our paper we introduce a novel technique for friend suggestions based on independently analyzing the ego-net structure. The main contribution of the paper is to show that it is possible to provide friend suggestions efficiently by constructing all ego-nets of the nodes in the graph and then independently applying community detection algorithms on them in large-scale distributed systems.

Specifically, the algorithm proceeds by constructing the ego-nets of all nodes and applying, independently on each of them, a community detection algorithm. More precisely the algorithm operates on so-called “ego-net-minus-ego” graphs, which is defined as the graph including only the neighbors of a given node, as shown in the figure below.

Figure 2: Clustering of the ego-net of Sally.

Notice how in this example the ego-net-minus-ego of Sally has two very clear communities: her family (A, B, C) and her co-workers (U, V, W) which are easily separated. Intuitively, this is because one might expect that while nodes (e.g. Sally) participate in many communities, there is usually a single (or a limited number of) contexts in which two specific neighbors interact. While Sally is both part of her family and work community, Sally and Uma interact only at work. Through extensive experimental evaluation on large-scale public social networks and formally through a simple mathematical model, our paper confirms this intuition. It seems that while communities are hard to separate in a global graph, they are easier to identify at the local level of ego-nets.

This allows for a novel graph-based method for friend suggestion which intuitively only allows suggestion of pairs of users that are clustered together in the same community from the point of view of their common friends. With this method, U and W will be suggested to add each other (as they are in the same community and they are not yet connected) while B and U will not be suggested as friends as they span two different communities.

From an algorithmic point of view, the paper introduces efficient parallel and distributed techniques for computing and clustering all ego-nets of very large graphs at the same time – a fundamental aspect enabling use of the system on the entire world Google+ graph. We have applied this feature in the “You May Know” system of Google+, resulting in a clear positive impact on the prediction task, improving the acceptance rate by more than 1.5% and decreasing the rejection rate by more than 3.3% (a significative impact at Google scales).

We believe that many future directions of work might stem from our preliminary results. For instance ego-net analysis could be potentially to automatically classify a user contacts in circles and to detect spam. Another interesting direction is the study of ego-network evolution in dynamic graphs.

Computational Thinking from a Dispositions Perspective

Research Blog — Tue, 13 Sep 2016 17:00:00 +0000

Posted by Chris Stephenson, Head of Computer Science Education Programs at Google, and Joyce Malyn-Smith, Managing Project Director at Education Development Center (EDC)

(Cross-posted on the Google for Education Blog)

In K–12 computer science (CS) education, much of the discussion about what students need to learn and do to has centered around computational thinking (CT). While much of the current work in CT education is focused on core concepts and their application, the one area of CT that has not been well explored is the relationship between CT as a problem solving model, and the dispositions or habits of mind that it can build in students of all ages.

Exploring the mindset that CT education can engender depends, in part, on the definition of CT itself. While there are a number of definitions of CT in circulation, Valerie Barr and I defined it in the following way:

CT is an approach to solving problems in a way that can be implemented with a computer. Students become not merely tool users but tool builders. They use a set of concepts, such as abstraction, recursion, and iteration, to process and analyze data, and to create real and virtual artifacts. CT is a problem solving methodology that can be automated and transferred and applied across subjects.

Like many others, our view of CT also included the core CT concepts: abstraction, algorithms and procedures, automation, data collection and analysis, data representation, modeling and simulation, parallelization and problem decomposition.

The idea of dispositions, however, comes from the field of vocational education and research on career development which focuses on the personal qualities or soft skills needed for employment (see full report from Economist Intelligence Unit here). These skills traditionally include being responsible, adaptable, flexible, self-directed, and self-motivated; being able to solve simple and complex problems, having integrity, self-confidence, and self-control. They can also include the ability to work with people of different ages and cultures, collaboration, complex communication and expert thinking.

Cuoco, Goldenberg, and Mark’s research also provided examples of what students should learn to develop the habits of mind used by scientists across numerous disciplines. These are: recognizing patterns, experimenting, describing, tinkering, inventing, visualizing, and conjecturing. Potter and Vickers also found that in the burgeoning field of cyber security “there is significant overlap between the roles for many soft skills, including analysis, consulting and process skills, leadership, and relationship management. Both communication and presentation skills were valued.”

CT, because of its emphasis on problem solving, provides a natural environment for embedding the idea of dispositions into K-12. According to the International Society for Technology in Education and the Computer Science Teachers Association, the set of dispositions that student practice and internalize while learning about CT can include:

confidence in dealing with complexity,
persistence in working with difficult problems,
the ability to handle ambiguity,
the ability to deal with open-ended problems,
setting aside differences to work with others to achieve a common goal or solution, and
knowing one's strengths and weaknesses when working with others.

Any teacher in any discipline is likely to tell you that persistence, problem solving, collaboration and awareness of one’s strengths and limitations are critical to successful learning for all students. So how do we make these dispositions a more explicit part of the CT curriculum? One of the ways to do so is to to call them out directly to students and explain why they are important in all areas of their study, career, and lives. In addition educators can:

Post in the classroom a list of the Dispositions Leading to Success,
Help familiarize students with these dispositions by using the terms when talking with students and referring to the work they are doing. “Today we are going to be solving an open-ended problem. What do you think that means?”
Help students understand that they are developing these dispositions by congratulating them when these dispositions lead to success: “Great problem-solving skills!”; “Great job! Your persistence helped solve the problem”; “You dealt with ambiguity really well!”.
Engage students in discussions about the dispositions: “Today we are going to work in teams. What does it mean to be on a team? What types of people would you want on your team and why?”
Help students articulate their dispositions when developing their resumes or preparing for job interviews.

Guest speakers from industry might also:

Integrate the importance of dispositions into their talks with students: examples of the problems they have solved, how the different skills of team members led to different solutions, the role persistence played in solving a problem/developing a product or service…
Talk about the importance of dispositions to employers and how they contribute to their own organizational culture, the ways employers ask interviewees about their dispositions or how interviewees might respond (e.g. use the terms and give examples).

As Google’s Director of Education and University Relations, Maggie Johnson noted in a recent blog post, CT represents a core set of skills that are necessary for all students:

If we can make these explicit connections for students, they will see how the devices and apps that they use everyday are powered by algorithms and programs. They will learn the importance of data in making decisions. They will learn skills that will prepare them for a workforce that will be doing vastly different tasks than the workforce of today.

In addition to these concepts, we can now add developing critical dispositions for success in computing and in life to the list of benefits for teaching CT to all students.

Announcing the First Annual Global PhD Fellowship Summit and the 2016 Google PhD Fellows

Research Blog — Wed, 07 Sep 2016 19:00:00 +0000

Posted by Michael Rennaker, Program Manager, University Relations

In 2009, Google created the PhD Fellowship Program to recognize and support outstanding graduate students doing exceptional research in Computer Science and related disciplines. Now in its eighth year, our Fellowships have helped support over 250 graduate students in Australia, China and East Asia, India, North America, Europe and the Middle East who seek to shape and influence the future of technology.

Recently, Google PhD Fellows from around the globe converged on our Mountain View campus for the first annual Global PhD Fellowship Summit. The students heard talks from researchers like Jeff Dean, Françoise Beaufays, Peter Norvig, Maya Gupta and Amin Vahdat, and got a glimpse into some of the state-of-the-art research pursued across Google.

Senior Google Fellow Jeff Dean shares how TensorFlow is used at Google

Fellows also had the chance to connect one-on-one with Googlers to discuss their research, as well as receive feedback from leaders in their fields. The event wrapped up with a panel discussion with Dan Russell, Kristen LeFevre, Douglas Eck and Françoise Beaufays about their unique career paths. Maggie Johnson concluded the Summit by sharing about the different types of research environments across academia and industry.

(Left) PhD Fellows share their work with Google researchers during the poster session
(Right) Research panelists share their journeys through academia and industry

Our PhD Fellows represent some the best and brightest young researchers around the globe in Computer Science and it is our ongoing goal to support them as they make their mark on the world.

We’d also like to welcome the newest class of Google PhD Fellows recently awarded in China and East Asia, India, and Australia. We look forward to seeing each of them at next year’s summit!

2016 Global PhD Fellows

Computational Neuroscience
Cameron (Po-Hsuan) Chen, Princeton University
Grace Lindsay, Columbia University
Martino Sorbaro Sindaci, The University of Edinburgh

Human-Computer Interaction
Dana McKay, University of Melbourne
Koki Nagano, University of Southern California
Arvind Satyanarayan, Stanford University
Amy Xian Zhang, Massachusetts Institute of Technology

Machine Learning
Olivier Bachem, Swiss Federal Institute of Technology Zurich
Tianqi Chen, University of Washington
Emily Denton, New York University
Kwan Hui Lim, University of Melbourne
Yves-Laurent Kom Samo, University of Oxford
Woosang Lim, Korea Advanced Institute of Science and Technology
Anirban Santara, Indian Institute of Technology Kharagpur
Daniel Jaymin Mankowitz, Technion - Israel Institute of Technology
Lucas Maystre, École Polytechnique Fédérale de Lausanne
Arvind Neelakantan, University of Massachusetts, Amherst
Ludwig Schmidt, Massachusetts Institute of Technology
Quanming Yao, The Hong Kong University of Science and Technology
Shandian Zhe, Purdue University, West Lafayette

Machine Perception, Speech Technology and Computer Vision
Eugen Beck, RWTH Aachen University
Yu-Wei Chao, University of Michigan, Ann Arbor
Wei Liu, University of North Carolina at Chapel Hill
Aron Monszpart, University College London
Thomas Schoeps, Swiss Federal Institute of Technology Zurich
Tian Tan, Shanghai Jiao Tong University
Chia-Yin Tsai, Carnegie Mellon University
Weitao Xu, University of Queensland

Market Algorithms
Hossein Esfandiari, University of Maryland, College Park
Sandy Heydrich, Saarland University - Saarbrucken GSCS
Rad Niazadeh, Cornell University
Sadra Yazdanbod, Georgia Institute of Technology

Mobile Computing
Lei Kang, University of Wisconsin
Tauhidur Rahman, Cornell University
Chungkuk Yoo, Korea Advanced Institute of Science and Technology
Yuhao Zhu, University of Texas, Austin

Natural Language Processing
Tamer Alkhouli, RWTH Aachen University
Jose Camacho Collados, Sapienza - Università di Roma

Privacy and Security
Chitra Javali, University of New South Wales
Kartik Nayak, University of Maryland, College Park
Nicolas Papernot, Pennsylvania State University
Damian Vizar, École Polytechnique Fédérale de Lausanne
Xi Wu, University of Wisconsin

Programming Languages, Algorithms and Software Engineering
Marcelo Sousa, University of Oxford
Arpita Biswas, Indian Institute of Science

Structured Data and Database Management
Xiang Ren, University of Illinois, Urbana-Champaign

Systems and Networking
Ying Chen, Tsinghua University
Andrew Crotty, Brown University
Aniruddha Singh Kushwaha, Indian Institute of Technology Bombay
Ilias Marinos, University of Cambridge
Kay Ousterhout, University of California, Berkeley
Hong Zhang, The Hong Kong University of Science and Technology

Reproducible Science: Cancer Researchers Embrace Containers in the Cloud

Research Blog — Tue, 06 Sep 2016 18:00:00 +0000

Posted by Dr. Kyle Ellrott, Oregon Health and Sciences University, Dr. Josh Stuart, University of California Santa Cruz, and Dr. Paul Boutros, Ontario Institute for Cancer Research

Today we hear from the principal investigators of the ICGC-TCGA DREAM Somatic Mutation Calling Challenges about how they are encouraging cancer researchers to make use of Docker and Google Cloud Platform to gain a deeper understanding of the complex genetic mutations that occur in cancer, while doing so in a reproducible way.
– Nicole Deflaux and Jonathan Bingham, Google Genomics

Today’s genomic analysis software tools often give different answers when run in different computing environments - that’s like getting a different diagnosis from your doctor depending on which examination room you’re sitting in. Reproducible science matters, especially in cancer research where so many lives are at stake. The Cancer Moonshot has called for the research world to 'Break down silos and bring all the cancer fighters together'. Portable software “containers” and cloud computing hold the potential to help achieve these goals by making scientific data analysis more reproducible, reusable and scalable.

Our team of researchers from the Ontario Institute for Cancer Research, University of California Santa Cruz, Sage Bionetworks and Oregon Health and Sciences University is pushing the frontiers by encouraging scientists to package up their software in reusable Docker containers and make use of cloud-resident data from the Cancer Cloud Pilots funded by the National Cancer Institute.

In 2014 we initiated the ICGC-TCGA DREAM Somatic Mutation Calling (SMC) Challenges where Google provided credits on Google Cloud Platform. The first result of this collaboration was the DREAM-SMC DNA challenge, a public challenge that engaged cancer researchers from around the world to find the best methods for discovering DNA somatic mutations. By the end of the challenge, over 400 registered participants competed by submitting 3,500 open-source entries for 14 test genomes, providing key insights on the strengths and limitations of the current mutation detection methods.

The SMC-DNA challenge enabled comparison of results, but it did little to facilitate the exchange of cross-platform software tools. Accessing extremely large genome sequence input files and shepherding complex software pipelines created a “double whammy” to discourage data sharing and software reuse.

How can we overcome these barriers?

Exciting developments have taken place in the past couple of years that may annihilate these last barriers. The availability of cloud technologies and containerization can serve as the vanguards of reproducibility and interoperability.

Thus, a new way of creating open DREAM challenges has emerged: rather than encouraging the status quo where participants run their own methods themselves on their own systems, and the results cannot be verified, the new challenge design requires participants to submit open-source code packaged in Docker containers so that anyone can run their methods and verify the results. Real-time leaderboards show which entries are winning and top performers have a chance to claim a prize.

Working with Google Genomics and Google Cloud Platform, the DREAM-SMC organizers are now using cloud and containerization technologies to enable portability and reproducibility as a core part of the DREAM challenges. The latest SMC installments, the SMC-Het Challenge and the SMC-RNA Challenge have implemented this new plan:

SMC-Het Challenge: Tumour biopsies are composed of many different cell types in addition to tumour cells, including normal tissue and infiltrating immune cells. Furthermore, the tumours themselves are made of a mixture of different subpopulations, all related to one another through cell division and mutation. Critically, each sub-population can have distinct clinical outcomes, with some more resistant to treatment or more likely to metastasize than others. The goal of the SMC-Het Challenge is to identify the best methods for predicting tumor subpopulations and their “family tree” of relatedness from genome sequencing data.
SMC-RNA Challenge: The alteration of RNA production is a fundamental mechanism by which cancer cells rewire cellular circuitry. Genomic rearrangements in cancer cells can produce fused protein products that can bestow Frankenstein-like properties. Both RNA abundances and novel fusions can serve as the basis for clinically-important prognostic biomarkers. The SMC-RNA Challenge will identify the best methods to detect such rogue expressed RNAs in cancer cells.

Ultimately, the success will be gauged by the amount of serious participation in these latest competitions. So far, the signs are encouraging. SMC-Het, which focuses on a very new research area, launched in November 2015 and has already enlisted 18 teams contributing over 70 submissions. SMC-RNA just recently launched and will run until early 2017, with several of the world leaders in the field starting to prepare entries. What’s great about the submissions being packaged in containers is that even after the challenges end, the tested methods can be applied and further adapted by anyone around the world.

Thus, the moon shot need not be a lucky solo attempt made by one hero in one moment of inspiration. Instead, the new informatics of clouds and containers will enable us to combine intelligence so we can build a series of bridges from here to there.

To participate in the DREAM challenges, visit the SMC-Het and SMC-RNA Challenge sites.

Improving Inception and Image Classification in TensorFlow

Research Blog — Wed, 31 Aug 2016 19:00:00 +0000

Posted by Alex Alemi, Software Engineer

Earlier this week, we announced the latest release of the TF-Slim library for TensorFlow, a lightweight package for defining, training and evaluating models, as well as checkpoints and model definitions for several competitive networks in the field of image classification.

In order to spur even further progress in the field, today we are happy to announce the release of Inception-ResNet-v2, a convolutional neural network (CNN) that achieves a new state of the art in terms of accuracy on the ILSVRC image classification benchmark. Inception-ResNet-v2 is a variation of our earlier Inception V3 model which borrows some ideas from Microsoft's ResNet papers [1][2]. The full details of the model are in our arXiv preprint Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning.

Residual connections allow shortcuts in the model and have allowed researchers to successfully train even deeper neural networks, which have lead to even better performance. This has also enabled significant simplification of the Inception blocks. Just compare the model architectures in the figures below:

Schematic diagram of Inception V3

Schematic diagram of Inception-ResNet-v2

At the top of the second Inception-ResNet-v2 figure, you'll see the full network expanded. Notice that this network is considerably deeper than the previous Inception V3. Below in the main figure is an easier to read version of the same network where the repeated residual blocks have been compressed. Here, notice that the inception blocks have been simplified, containing fewer parallel towers than the previous Inception V3.

The Inception-ResNet-v2 architecture is more accurate than previous state of the art models, as shown in the table below, which reports the Top-1 and Top-5 validation accuracies on the ILSVRC 2012 image classification benchmark based on a single crop of the image. Furthermore, this new model only requires roughly twice the memory and computation compared to Inception V3.

Model	Architecture	Checkpoint	Top-1 Accuracy	Top-5 Accuracy
Inception-ResNet-v2	Code	inception_resnet_v2_2016_08_30.tar.gz	80.4	95.3
Inception V3	Code	inception_v3_2016_08_28.tar.gz	78.0	93.9
ResNet 152	Code	resnet_v1_152_2016_08_28.tar.gz	76.8	93.2
ResNet V2 200	Code	TBA	79.9*	95.2*

(*): Results quoted in ResNet paper.

As an example, while both Inception V3 and Inception-ResNet-v2 models excel at identifying individual dog breeds, the new model does noticeably better. For instance, whereas the old model mistakenly reported Alaskan Malamute for the picture on the right, the new Inception-ResNet-v2 model correctly identifies the dog breeds in both images.

An Alaskan Malamute (left) and a Siberian Husky (right). Images from Wikipedia

In order to allow people to immediately begin experimenting, we are also releasing a pre-trained instance of the new Inception-ResNet-v2, as part of the TF-Slim Image Model Library.

We are excited to see what the community does with this improved model, following along as people adapt it and compare its performance on various tasks. Want to get started? See the accompanying instructions on how to train, evaluate or fine-tune a network.

As always, releasing the code was a team effort. Specific thanks are due to:

Model Architecture - Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alex Alemi
Systems Infrastructure - Jon Shlens, Benoit Steiner, Mark Sandler, and David Andersen
TensorFlow-Slim - Sergio Guadarrama and Nathan Silberman
Model Visualization - Fernanda Viégas and James Wexler

TF-Slim: A high level library to define complex models in TensorFlow

Research Blog — Tue, 30 Aug 2016 17:00:00 +0000

Posted by Nathan Silberman and Sergio Guadarrama, Google Research

Earlier this year, we released a TensorFlow implementation of a state-of-the-art image classification model known as Inception-V3. This code allowed users to train the model on the ImageNet classification dataset via synchronized gradient descent, using either a single local machine or a cluster of machines. The Inception-V3 model was built on an experimental TensorFlow library called TF-Slim, a lightweight package for defining, training and evaluating models in TensorFlow. The TF-Slim library provides common abstractions which enable users to define models quickly and concisely, while keeping the model architecture transparent and its hyperparameters explicit.

Since that release, TF-Slim has grown substantially, with many types of layers, loss functions, and evaluation metrics added, along with handy routines for training and evaluating models. These routines take care of all the details you need to worry about when working at scale, such as reading data in parallel, deploying models on multiple machines, and more. Additionally, we have created the TF-Slim Image Models library, which provides definitions and training scripts for many widely used image classification models, using standard datasets. TF-Slim and its components are already widely used within Google, and many of these improvements have already been integrated into tf.contrib.slim.

Today, we are proud to share the latest release of TF-Slim with the TF community. Some highlights of this release include:

Many new kinds of layers (such as Atrous Convolution and Deconvolution) enabling a much richer family of neural network architectures.
Support for more loss functions and evaluation metrics (e.g., mAP, IoU).
A deployment library to make it easier to perform synchronous or asynchronous training using multiple GPUs/CPUs, on the same machine or on multiple machines.
Code to define and train many widely used image classification models (e.g., Inception^[1][2][3], VGG^[4], AlexNet^[5], ResNet^[6]).
Pre-trained model weights for the above image classification models. These models have been trained on the ImageNet classification dataset, but can be used for many other computer vision tasks. As a simple example, we provide code to fine-tune these classifiers to a new set of output labels.
Tools to easily process standard image datasets, such as ImageNet, CIFAR10 and MNIST.

Want to get started using TF-Slim? See the README for details. Interested in working with image classification models? See these instructions or this Jupyter notebook.

The release of the TF-Slim library and the pre-trained model zoo has been the result of widespread collaboration within Google Research. In particular we want to highlight the vital contributions of the following researchers:

TF-Slim: Sergio Guadarrama, Nathan Silberman.
Model Definitions and Checkpoints: Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Jon Shlens, Zbigniew Wojna, Vivek Rathod, George Papandreou, Alex Alemi
Systems Infrastructure: Jon Shlens, Matthieu Devin, Martin Wicke
Jupyter notebook: Nathan Silberman, Kevin Murphy

References:
[1] Going deeper with convolutions, Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich, CVPR 2015
[2] Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift Sergey Ioffe, Christian Szegedy, ICML 2015
[3] Rethinking the Inception Architecture for Computer Vision, Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Zbigniew Wojna, arXiv technical report 2015
[4] Very Deep Convolutional Networks for Large-Scale Image Recognition, Karen Simonyan, Andrew Zisserman, ICLR 2015
[5] ImageNet Classification with Deep Convolutional Neural Networks, Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton, NIPS 2012
[6] Deep Residual Learning for Image Recognition, Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, CVPR 2016

Text summarization with TensorFlow

Research Blog — Wed, 24 Aug 2016 18:00:00 +0000

Posted by Peter Liu and Xin Pan, Software Engineers, Google Brain Team

Every day, people rely on a wide variety of sources to stay informed -- from news stories to social media posts to search results. Being able to develop Machine Learning models that can automatically deliver accurate summaries of longer text can be useful for digesting such large amounts of information in a compressed form, and is a long-term goal of the Google Brain team.

Summarization can also serve as an interesting reading comprehension test for machines. To summarize well, machine learning models need to be able to comprehend documents and distill the important information, tasks which are highly challenging for computers, especially as the length of a document increases.

In an effort to push this research forward, we’re open-sourcing TensorFlow model code for the task of generating news headlines on Annotated English Gigaword, a dataset often used in summarization research. We also specify the hyper-parameters in the documentation that achieve better than published state-of-the-art on the most commonly used metric as of the time of writing. Below we also provide samples generated by the model.

Extractive and Abstractive summarization

One approach to summarization is to extract parts of the document that are deemed interesting by some metric (for example, inverse-document frequency) and join them to form a summary. Algorithms of this flavor are called extractive summarization.

Original Text: Alice and Bob took the train to visit the zoo. They saw a baby giraffe, a lion, and a flock of colorful tropical birds.

Extractive Summary: Alice and Bob visit the zoo. saw a flock of birds.

Above we extract the words bolded in the original text and concatenate them to form a summary. As we can see, sometimes the extractive constraint can make the summary awkward or grammatically strange.

Another approach is to simply summarize as humans do, which is to not impose the extractive constraint and allow for rephrasings. This is called abstractive summarization.

Abstractive summary: Alice and Bob visited the zoo and saw animals and birds.

In this example, we used words not in the original text, maintaining more of the information in a similar amount of words. It’s clear we would prefer good abstractive summarizations, but how could an algorithm begin to do this?

About the TensorFlow model

It turns out for shorter texts, summarization can be learned end-to-end with a deep learning technique called sequence-to-sequence learning, similar to what makes Smart Reply for Inbox possible. In particular, we’re able to train such models to produce very good headlines for news articles. In this case, the model reads the article text and writes a suitable headline.

To get an idea of what the model produces, you can take a look at some examples below. The first column shows the first sentence of a news article which is the model input, and the second column shows what headline the model has written.

Input: Article 1st sentence	Model-written headline
metro-goldwyn-mayer reported a third-quarter net loss of dlrs 16 million due mainly to the effect of accounting rules adopted this year	mgm reports 16 million net loss on higher revenue
starting from july 1, the island province of hainan in southern china will implement strict market access control on all incoming livestock and animal products to prevent the possible spread of epidemic diseases	hainan to curb spread of diseases
australian wine exports hit a record 52.1 million liters worth 260 million dollars (143 million us) in september, the government statistics office reported on monday	australian wine exports hit record high in september

Future Research

We’ve observed that due to the nature of news headlines, the model can generate good headlines from reading just a few sentences from the beginning of the article. Although this task serves as a nice proof-of-concept, we started looking at more difficult datasets where reading the entire document is necessary to produce good summaries. In those tasks training from scratch with this model architecture does not do as well as some other techniques we’re researching, but it serves as a baseline. We hope this release can also serve as a baseline for others in their summarization research.

Meet Parsey’s Cousins: Syntax for 40 languages, plus new SyntaxNet capabilities

Research Blog — Mon, 08 Aug 2016 16:00:00 +0000

Posted by Chris Alberti, Dave Orr & Slav Petrov, Google Natural Language Understanding Team

Just in time for ACL 2016, we are pleased to announce that Parsey McParseface, released in May as part of SyntaxNet and the basis for the Cloud Natural Language API, now has 40 cousins! Parsey’s Cousins is a collection of pretrained syntactic models for 40 languages, capable of analyzing the native language of more than half of the world’s population at often unprecedented accuracy. To better address the linguistic phenomena occurring in these languages we have endowed SyntaxNet with new abilities for Text Segmentation and Morphological Analysis.

When we released Parsey, we were already planning to expand to more languages, and it soon became clear that this was both urgent and important, because researchers were having trouble creating top notch SyntaxNet models for other languages.

The reason for that is a little bit subtle. SyntaxNet, like other TensorFlow models, has a lot of knobs to turn, which affect accuracy and speed. These knobs are called hyperparameters, and control things like the learning rate and its decay, momentum, and random initialization. Because neural networks are more sensitive to the choice of these hyperparameters than many other machine learning algorithms, picking the right hyperparameter setting is very important. Unfortunately there is no tested and proven way of doing this and picking good hyperparameters is mostly an empirical science -- we try a bunch of settings and see what works best.

An additional challenge is that training these models can take a long time, several days on very fast hardware. Our solution is to train many models in parallel via MapReduce, and when one looks promising, train a bunch more models with similar settings to fine-tune the results. This can really add up -- on average, we train more than 70 models per language. The plot below shows how the accuracy varies depending on the hyperparameters as training progresses. The best models are up to 4% absolute more accurate than ones trained without hyperparameter tuning.

Held-out set accuracy for various English parsing models with different hyperparameters (each line corresponds to one training run with specific hyperparameters). In some cases training is a lot slower and in many cases a suboptimal choice of hyperparameters leads to significantly lower accuracy. We are releasing the best model that we were able to train for each language.

In order to do a good job at analyzing the grammar of other languages, it was not sufficient to just fine-tune our English setup. We also had to expand the capabilities of SyntaxNet. The first extension is a model for text segmentation, which is the task of identifying word boundaries. In languages like English, this isn’t very hard -- you can mostly look for spaces and punctuation. In Chinese, however, this can be very challenging, because words are not separated by spaces. To correctly analyze dependencies between Chinese words, SyntaxNet needs to understand text segmentation -- and now it does.

Analysis of a Chinese string into a parse tree showing dependency labels, word tokens, and parts of speech (read top to bottom for each word token).

The second extension is a model for morphological analysis. Morphology is a language feature that is poorly represented in English. It describes inflection: i.e., how the grammatical function and meaning of the word changes as its spelling changes. In English, we add an -s to a word to indicate plurality. In Russian, a heavily inflected language, morphology can indicate number, gender, whether the word is the subject or object of a sentence, possessives, prepositional phrases, and more. To understand the syntax of a sentence in Russian, SyntaxNet needs to understand morphology -- and now it does.

Parse trees showing dependency labels, parts of speech, and morphology.

As you might have noticed, the parse trees for all of the sentences above look very similar. This is because we follow the content-head principle, under which dependencies are drawn between content words, with function words becoming leaves in the parse tree. This idea was developed by the Universal Dependencies project in order to increase parallelism between languages. Parsey’s Cousins are trained on treebanks provided by this project and are designed to be cross-linguistically consistent and thus easier to use in multi-lingual language understanding applications.

Using the same set of labels across languages can help us understand how sentences in different languages, or variations in the same language, convey the same meaning. In all of the above examples, the root indicates the main verb of the sentence and there is a passive nominal subject (indicated by the arc labeled with ‘nsubjpass’) and a passive auxiliary (‘auxpass’). If you look closely, you will also notice some differences because the grammar of each language differs. For example, English uses the preposition ‘by,’ where Russian uses morphology to mark that the phrase ‘the publisher (издателем)’ is in instrumental case -- the meaning is the same, it is just expressed differently.

Google has been involved in the Universal Dependencies project since its inception and we are very excited to be able to bring together our efforts on datasets and modeling. We hope that this release will facilitate research progress in building computer systems that can understand all of the world’s languages.

Parsey's Cousins can be found on GitHub, along with Parsey McParseface and SyntaxNet.

ACL 2016 & Research at Google

Research Blog — Sun, 07 Aug 2016 12:17:00 +0000

Posted by Slav Petrov, Research Scientist

This week, Berlin hosts the 2016 Annual Meeting of the Association for Computational Linguistics (ACL 2016), the premier conference of the field of computational linguistics, covering a broad spectrum of diverse research areas that are concerned with computational approaches to natural language. As a leader in Natural Language Processing (NLP) and a Platinum Sponsor of the conference, Google will be on hand to showcase research interests that include syntax, semantics, discourse, conversation, multilingual modeling, sentiment analysis, question answering, summarization, and generally building better learners using labeled and unlabeled data, state-of-the-art modeling, and learning from indirect supervision.

Our systems are used in numerous ways across Google, impacting user experience in search, mobile, apps, ads, translate and more. Our work spans the range of traditional NLP tasks, with general-purpose syntax and semantic algorithms underpinning more specialized systems.
Our researchers are experts in natural language processing and machine learning, and combine methodological research with applied science, and our engineers are equally involved in long-term research efforts and driving immediate applications of our technology.

If you’re attending ACL 2016, we hope that you’ll stop by the booth to check out some demos, meet our researchers and discuss projects and opportunities at Google that go into solving interesting problems for billions of people. Learn more about Google research being presented at ACL 2016 below (Googlers highlighted in blue), and visit the Natural Language Understanding Team page at g.co/NLUTeam.

Papers
Generalized Transition-based Dependency Parsing via Control Parameters
Bernd Bohnet, Ryan McDonald, Emily Pitler, Ji Ma

Learning the Curriculum with Bayesian Optimization for Task-Specific Word Representation Learning
Yulia Tsvetkov, Manaal Faruqui, Wang Ling (Google DeepMind), Chris Dyer (Google DeepMind)

Morpho-syntactic Lexicon Generation Using Graph-based Semi-supervised Learning (TACL)
Manaal Faruqui, Ryan McDonald, Radu Soricut

Many Languages, One Parser (TACL)
Waleed Ammar, George Mulcaire, Miguel Ballesteros, Chris Dyer (Google DeepMind)^*, Noah A. Smith

Latent Predictor Networks for Code Generation
Wang Ling (Google DeepMind), Phil Blunsom (Google DeepMind), Edward Grefenstette (Google DeepMind), Karl Moritz Hermann (Google DeepMind), Tomáš Kočiský (Google DeepMind), Fumin Wang (Google DeepMind), Andrew Senior (Google DeepMind)

Collective Entity Resolution with Multi-Focal Attention
Amir Globerson, Nevena Lazic, Soumen Chakrabarti, Amarnag Subramanya, Michael Ringgaard, Fernando Pereira

Plato: A Selective Context Model for Entity Resolution (TACL)
Nevena Lazic, Amarnag Subramanya, Michael Ringgaard, Fernando Pereira

WikiReading: A Novel Large-scale Language Understanding Task over Wikipedia
Daniel Hewlett, Alexandre Lacoste, Llion Jones, Illia Polosukhin, Andrew Fandrianto, Jay Han, Matthew Kelcey, David Berthelot

Stack-propagation: Improved Representation Learning for Syntax
Yuan Zhang, David Weiss

Cross-lingual Models of Word Embeddings: An Empirical Comparison
Shyam Upadhyay, Manaal Faruqui, Chris Dyer (Google DeepMind), Dan Roth

Globally Normalized Transition-Based Neural Networks (Outstanding Papers Session)
Daniel Andor, Chris Alberti, David Weiss, Aliaksei Severyn, Alessandro Presta, Kuzman Ganchev, Slav Petrov, Michael Collins

Posters
Cross-lingual projection for class-based language models
Beat Gfeller, Vlad Schogol, Keith Hall

Synthesizing Compound Words for Machine Translation
Austin Matthews, Eva Schlinger^*, Alon Lavie, Chris Dyer (Google DeepMind)^*

Cross-Lingual Morphological Tagging for Low-Resource Languages
Jan Buys, Jan A. Botha

Workshops
1st Workshop on Representation Learning for NLP
Keynote Speakers include: Raia Hadsell (Google DeepMind)
Workshop Organizers include: Edward Grefenstette (Google DeepMind), Phil Blunsom (Google DeepMind), Karl Moritz Hermann (Google DeepMind)
Program Committee members include: Tomáš Kočiský (Google DeepMind), Wang Ling (Google DeepMind), Ankur Parikh (Google), John Platt (Google), Oriol Vinyals (Google DeepMind)

1st Workshop on Evaluating Vector-Space Representations for NLP
Contributed Papers:
Problems With Evaluation of Word Embeddings Using Word Similarity Tasks
Manaal Faruqui, Yulia Tsvetkov, Pushpendre Rastogi, Chris Dyer (Google DeepMind)^*

Correlation-based Intrinsic Evaluation of Word Vector Representations
Yulia Tsvetkov, Manaal Faruqui, Chris Dyer (Google DeepMind)

SIGFSM Workshop on Statistical NLP and Weighted Automata
Contributed Papers:
Distributed representation and estimation of WFST-based n-gram models
Cyril Allauzen, Michael Riley, Brian Roark

Pynini: A Python library for weighted finite-state grammar compilation
Kyle Gorman

* Work completed at CMU^↩

Computational Thinking for All Students

Research Blog — Wed, 03 Aug 2016 17:00:00 +0000

Posted by Maggie Johnson, Director of Education and University Relations, Google

(Crossposted on the Google for Education Blog, and the the Huffington Post)

Last year, I wrote about the importance of teaching computational thinking to all K-12 students. Given the growing use of computing, algorithms and data in all fields from the humanities to medicine to business, it’s becoming increasingly important for students to understand the basics of computer science (CS). One lesson we have learned through Google’s CS education outreach efforts is that these skills can be accessible to all students, if we introduce them early in K-5. These are truly 21st century skills which can, over time, produce a workforce ready for a technology-enabled and driven economy.

How can teachers start introducing computational thinking in early school curriculum? It is already present in many topic areas - algorithms for solving math problems, for example. However, what is often missing in current examples of computational thinking is the explicit connection between what students are learning and its application in computing. For example, once a student has mastered adding multi-digit numbers, the following algorithm could be presented:

Add together the digits in the ones place. If the result is < 10, it becomes the ones digit of the answer. If it's >= 10 or greater, the ones digit of the result becomes the ones digit of the answer, and you add 1 to the next column.
Add together the digits in the tens place, plus the 1 carried over from the ones place, if necessary. If the answer < than 10, it becomes the tens digit of the answer; if it's >= 10, the ones digit becomes the tens digit of the answer and 1 is added to the next column.
Repeat this process for any additional columns until they are all added.

This allows a teacher to present the concept of an algorithm and its use in computing, as well as the most important elements of any computer program: conditional branching (“if the result is less than 10…”) and iteration (“repeat this process…”). Going a step farther, a teacher translating the algorithm into a running program can have a compelling effect. When something that students have used to solve an instance of a problem can automatically solve all instances of the that problem, it’s quite a powerful moment for them even if they don’t do the coding themselves.

Google has created an online course for K-12 teachers to learn about computational thinking and how to make these explicit connections for their students. We also have a large repository of lessons, explorations and programs to support teachers and students. Our videos illustrate real-world examples of the application of computational thinking in Google’s products and services, and we have compiled a set of great resources showing how to integrate computational thinking into existing curriculum. We also recently announced Project Bloks to engage younger children in computational thinking. Finally, code.org, for whom Google is a primary sponsor, has curriculum and materials for K-5 teachers and students.

We feel that computational thinking is a core skill for all students. If we can make these explicit connections for students, they will see how the devices and apps that they use everyday are powered by algorithms and programs. They will learn the importance of data in making decisions. They will learn skills that will prepare them for a workforce that will be doing vastly different tasks than the workforce of today. We owe it to all students to give them every possible opportunity to be productive and successful members of society.

Announcing an Open Source ADC board for BeagleBone

Research Blog — Wed, 20 Jul 2016 17:00:00 +0000

Posted by Jason Holt, Software Engineer

(Cross-posted on the Google Open Source Blog)

Working with electronics, we often find ourselves soldering up a half baked electronic circuit to detect some sort of signal. For example, last year we wanted to measure the strength of a carrier. We started with traditional analog circuits — amplifier, filter, envelope detector, threshold. You can see some of our prototypes in the image below; they get pretty messy.

While there's a certain satisfaction in taming a signal using the physical properties of capacitors, coils of wire and transistors, it's usually easier to digitize the signal with an Analog to Digital Converter (ADC) and manage it with Digital Signal Processing (DSP) instead of electronic parts. Tweaking software doesn't require a soldering iron, and lets us modify signals in ways that would require impossible analog circuits.

There are several standard solutions for digitizing a signal: connect a laptop to an oscilloscope or Data Acquisition System (DAQ) via USB or Ethernet, or use the onboard ADCs of a maker board like an Arduino. The former are sensitive and accurate, but also big and power hungry. The latter are cheap and tiny, but slower and have enough RAM for only milliseconds worth of high speed sample data.

That led us to investigate single board computers like the BeagleBone and Raspberry Pi, which are small and cheap like an Arduino, but have specs like a smartphone. And crucially, the BeagleBone's system-on-a-chip (SoC) combines a beefy ARMv7 CPU with two smaller Programmable Realtime Units (PRUs) that have access to all 512MB of system RAM. This lets us dedicate the PRUs to the time-sensitive and repetitive task of reading each sample out of an external ADC, while the main CPU lets us use the data with the GNU/Linux tools we're used to.

The result is an open source BeagleBone cape we've named PRUDAQ. It's built around the Analog Devices AD9201 ADC, which samples two inputs simultaneously at up to 20 megasamples per second, per channel. Simultaneous sampling and high sample rates make it useful for software-defined radio (SDR) and scientific applications where a built-in ADC isn't quite up to the task.

Our open source electrical design and sample code are available on GitHub, and GroupGets has boards ready to ship for $79. We also were fortunate to have help from Google intern Kumar Abhishek. He added support for PRUDAQ to his Google Summer of Code project BeagleLogic that performs much better than our sample code.

We started PRUDAQ for our own needs, but quickly realized that others might also find it useful. We're excited to get your feedback through the email list. Tell us what can be done with inexpensive fast ADCs paired with inexpensive fast CPUs!

Towards an exact (quantum) description of chemistry

Research Blog — Mon, 18 Jul 2016 20:30:00 +0000

Posted by Ryan Babbush, Quantum Software Engineer

“...nature isn't classical, dammit, and if you want to make a simulation of nature, you'd better make it quantum mechanical...” - Richard Feynman, Simulating Physics with Computers

One of the most promising applications of quantum computing is the ability to efficiently model quantum systems in nature that are considered intractable for classical computers. Now, in collaboration with the Aspuru-Guzik group at Harvard and researchers from Lawrence Berkeley National Labs, UC Santa Barbara, Tufts University and University College London, we have performed the first completely scalable quantum simulation of a molecule. Our experimental results are detailed in the paper Scalable Quantum Simulation of Molecular Energies, which recently appeared in Physical Review X.

The goal of our experiment was to use quantum hardware to efficiently solve the molecular electronic structure problem, which seeks the solution for the lowest energy configuration of electrons in the presence of a given nuclear configuration. In order to predict chemical reaction rates (which govern the mechanism of chemical reactions), one must make these calculations to extremely high precision. The ability to predict such rates could revolutionize the design of solar cells, industrial catalysts, batteries, flexible electronics, medicines, materials and more. The primary difficulty is that molecular systems form highly entangled quantum superposition states which require exponentially many classical computing resources in order to represent to sufficiently high precision. For example, exactly computing the energies of methane (CH₄) takes about one second, but the same calculation takes about ten minutes for ethane (C₂H₆) and about ten days for propane (C₃H₈).

In our experiment, we focus on an approach known as the variational quantum eigensolver (VQE), which can be understood as a quantum analog of a neural network. Whereas a classical neural network is a parameterized mapping that one trains in order to model classical data, VQE is a parameterized mapping (e.g. a quantum circuit) that one trains in order to model quantum data (e.g. a molecular wavefunction). The training objective for VQE is the molecular energy function, which is always minimized by the true ground state. The quantum advantage of VQE is that quantum bits can efficiently represent the molecular wavefunction whereas exponentially many classical bits would be required.

Using VQE, we quantum computed the energy landscape of molecular hydrogen, H₂. We compared the performance of VQE to another quantum algorithm for chemistry, the phase estimation algorithm (PEA). Experimentally computed energies, as a function of the H - H bond length, are shown below alongside the exact curve. We were able to obtain such high performance with VQE because the neural-network-like training loop helped to establish experimentally optimal circuit parameters for representing the wavefunction in the presence of systematic control errors. One can understand this by considering a hardware implementation of a neural network with a faulty weight, e.g. the weight is only represented half as strong as it should be. Because the weights of the neural network are established via a closed-loop training procedure which can compensate for such systematic errors, the hardware neural network is robust against such imperfections. Likewise, despite systematic errors in our implementation of the VQE circuit, we are still able to learn an accurate model for the wavefunction. This robustness inspires hope that VQE may be able to solve classically intractable problems without quantum error correction.

While the energies of molecular hydrogen can be computed classically (albeit inefficiently), as one scales up quantum hardware it becomes possible to simulate even larger chemical systems, including classically intractable ones. For instance, with only about a hundred reliable quantum bits one could model the process by which bacteria produce fertilizer at room temperature. Elucidating this mechanism is a famous open problem in chemistry because the way humans produce fertilizer is extremely inefficient and consumes 1-2% of the world's energy annually. Such calculations could also assist with breakthroughs in fundamental science, for instance, in the understanding of high temperature superconductivity.

Though many theoretical and experimental challenges lay ahead, a quantum enabled paradigm shift from qualitative / descriptive chemistry simulations to quantitative / predictive chemistry simulations could modernize the field so dramatically that the examples imaginable today are just the tip of the iceberg.

Wide & Deep Learning: Better Together with TensorFlow

Research Blog — Wed, 29 Jun 2016 17:00:00 +0000

Posted by Heng-Tze Cheng, Senior Software Engineer, Google Research

The human brain is a sophisticated learning machine, forming rules by memorizing everyday events (“sparrows can fly” and “pigeons can fly”) and generalizing those learnings to apply to things we haven't seen before (“animals with wings can fly”). Perhaps more powerfully, memorization also allows us to further refine our generalized rules with exceptions (“penguins can't fly”). As we were exploring how to advance machine intelligence, we asked ourselves the question—can we teach computers to learn like humans do, by combining the power of memorization and generalization?

It's not an easy question to answer, but by jointly training a wide linear model (for memorization) alongside a deep neural network (for generalization), one can combine the strengths of both to bring us one step closer. At Google, we call it Wide & Deep Learning. It's useful for generic large-scale regression and classification problems with sparse inputs (categorical features with a large number of possible feature values), such as recommender systems, search, and ranking problems.

Today we’re open-sourcing our implementation of Wide & Deep Learning as part of the TF.Learn API so that you can easily train a model yourself. Please check out the TensorFlow tutorials on Linear Models and Wide & Deep Learning, as well as our research paper to learn more.

How Wide & Deep Learning works.
Let's say one day you wake up with an idea for a new app called FoodIO^*. A user of the app just needs to say out loud what kind of food he/she is craving for (the query). The app magically predicts the dish that the user will like best, and the dish gets delivered to the user's front door (the item). Your key metric is consumption rate—if a dish was eaten by the user, the score is 1; otherwise it's 0 (the label).

You come up with some simple rules to start, like returning the items that match the most characters in the query, and you release the first version of FoodIO. Unfortunately, you find that the consumption rate is pretty low because the matches are too crude to be really useful (people shouting “fried chicken” end up getting “chicken fried rice”), so you decide to add machine learning to learn from the data.

The Wide model.
In the 2nd version, you want to memorize what items work the best for each query. So, you train a linear model in TensorFlow with a wide set of cross-product feature transformations to capture how the co-occurrence of a query-item feature pair correlates with the target label (whether or not an item is consumed). The model predicts the probability of consumption P(consumption | query, item) for each item, and FoodIO delivers the top item with the highest predicted consumption rate. For example, the model learns that feature AND(query="fried chicken", item="chicken and waffles") is a huge win, while AND(query="fried chicken", item="chicken fried rice") doesn't get as much love even though the character match is higher. In other words, FoodIO 2.0 does a pretty good job memorizing what users like, and it starts to get more traction.

The Deep model.
Later on you discover that many users are saying that they're tired of the recommendations. They're eager to discover similar but different cuisines with a “surprise me” state of mind. So you brush up on your TensorFlow toolkit again and train a deep feed-forward neural network for FoodIO 3.0. With your deep model, you're learning lower-dimensional dense representations (usually called embedding vectors) for every query and item. With that, FoodIO is able to generalize by matching items to queries that are close to each other in the embedding space. For example, you find that people who asked for “fried chicken” often don't mind having “burgers” as well.

Combining Wide and Deep models.
However, you discover that the deep neural network sometimes generalizes too much and recommends irrelevant dishes. You dig into the historic traffic, and find that there are actually two distinct types of query-item relationships in the data.

The first type of queries is very targeted. People shouting very specific items like “iced decaf latte with nonfat milk” really mean it. Just because it's pretty close to “hot latte with whole milk” in the embedding space doesn't mean it's an acceptable alternative. And there are millions of these rules where the transitivity of embeddings may actually do more harm than good. On the other hand, queries that are more exploratory like “seafood” or “italian food” may be open to more generalization and discovering a diverse set of related items. Having realized these, you have an epiphany: Why do I have to choose either wide or deep models? Why not both?

Finally, you build FoodIO 4.0 with Wide & Deep Learning in TensorFlow. As shown in the graph above, the sparse features like query="fried chicken" and item="chicken fried rice" are used in both the wide part (left) and the deep part (right) of the model. During training, the prediction errors are backpropagated to both sides to train the model parameters. The cross-feature transformation in the wide model component can memorize all those sparse, specific rules, while the deep model component can generalize to similar items via embeddings.

Wider. Deeper. Together.
We're excited to share the TensorFlow API and implementation of Wide & Deep Learning with you, so you can try out your ideas with it and share your findings with everyone else. To get started, check out the code on GitHub and our TensorFlow tutorials on Linear Models and Wide & Deep Learning.

Acknowledgement
Bringing Wide & Deep from idea, research to implementation has been a huge team effort. We'd to like to thank all the people who have contributed to the project or have given us advice, including: Heng-Tze Cheng, Mustafa Ispir, Zakaria Haque, Lichan Hong, Rohan Anil, Denis Baylor, Vihan Jain, Salem Haykal, Robson Araujo, Xiaobing Liu, Yonghui Wu, Thomas Strohmann, Tal Shaked, Jeremiah Harmsen, Greg Corrado, Glen Anderson, D. Sculley, Tushar Chandra, Ed Chi, Rajat Monga, Rob von Behren, Jarek Wilkiewicz, Christine Robson, Illia Polosukhin, Martin Wicke, Gus Katsiapis, Alexandre Passos, Olivier Chapelle, Levent Koc, Akshay Naresh Modi, Wei Chai, Hrishi Aradhye, Othar Hansson, Xinran He, Martin Zinkevich, Joe Toth, Anton Rusanov, Hemal Shah, Petros Mol, Frank Li, Yutaka Suematsu, Sameer Ahuja, Eugene Brevdo, Philip Tucker, Shanqing Cai, Kester Tong, and more.

* For illustration only. FoodIO is not a real app.^↩

CVPR 2016 & Research at Google

Research Blog — Tue, 28 Jun 2016 16:00:00 +0000

Posted by Rahul Sukthankar, Research Scientist

This week, Las Vegas hosts the 2016 Conference on Computer Vision and Pattern Recognition (CVPR 2016), the premier annual computer vision event comprising the main conference and several co-located workshops and short courses. As a leader in computer vision research, Google has a strong presence at CVPR 2016, with many Googlers presenting papers and invited talks at the conference, tutorials and workshops.

We congratulate Google Research Scientist Ce Liu and Google Faculty Advisor Abhinav Gupta, who were selected as this year’s recipients of the PAMI Young Researcher Award for outstanding research contributions within computer vision. We also congratulate Googler Henrik Stewenius for receiving the Longuet-Higgins Prize, a retrospective award that recognizes up to two CVPR papers from ten years ago that have made a significant impact on computer vision research, for his 2006 CVPR paper “Scalable Recognition with a Vocabulary Tree”, co-authored with David Nister, during their time at University of Kentucky.

If you are attending CVPR this year, please stop by our booth and chat with our researchers about the projects and opportunities at Google that go into solving interesting problems for hundreds of millions of people. The Google booth will also showcase several recent efforts, including the technology behind Motion Stills, a live demo of neural network-based image compression and TensorFlow-Slim, the lightweight library for defining, training and evaluating models in TensorFlow. Learn more about our research being presented at CVPR 2016 in the list below (Googlers highlighted in blue).

Oral Presentations
Generation and Comprehension of Unambiguous Object Descriptions
Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L. Yuille, Kevin Murphy

Detecting Events and Key Actors in Multi-Person Videos
Vignesh Ramanathan, Jonathan Huang, Sami Abu-El-Haija, Alexander Gorban, Kevin Murphy, Li Fei-Fei

Spotlight Session: 3D Reconstruction
DeepStereo: Learning to Predict New Views From the World’s Imagery
John Flynn, Ivan Neulander, James Philbin, Noah Snavely

Posters
Discovering the Physical Parts of an Articulated Object Class From Multiple Videos
Luca Del Pero, Susanna Ricco, Rahul Sukthankar, Vittorio Ferrari

Blockout: Dynamic Model Selection for Hierarchical Deep Networks
Calvin Murdock, Zhen Li, Howard Zhou, Tom Duerig

Rethinking the Inception Architecture for Computer Vision
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, Zbigniew Wojna

Improving the Robustness of Deep Neural Networks via Stability Training
Stephan Zheng, Yang Song, Thomas Leung, Ian Goodfellow

Semantic Image Segmentation With Task-Specific Edge Detection Using CNNs and a Discriminatively Trained Domain Transform
Liang-Chieh Chen, Jonathan T. Barron, George Papandreou, Kevin Murphy, Alan L. Yuille

Tutorial
Optimization Algorithms for Subset Selection and Summarization in Large Data Sets
Ehsan Elhamifar, Jeff Bilmes, Alex Kulesza, Michael Gygli

Workshops
Perceptual Organization in Computer Vision: The Role of Feedback in Recognition and Reorganization
Organizers: Katerina Fragkiadaki, Phillip Isola, Joao Carreira
Invited talks: Viren Jain, Jitendra Malik

VQA Challenge Workshop
Invited talks: Jitendra Malik, Kevin Murphy

Women in Computer Vision
Invited talk: Caroline Pantofaru

Computational Models for Learning Systems and Educational Assessment
Invited talk: Jonathan Huang

Large-Scale Scene Understanding (LSUN) Challenge
Invited talk: Jitendra Malik

Large Scale Visual Recognition and Retrieval: BigVision 2016
General Chairs: Jason Corso, Fei-Fei Li, Samy Bengio

ChaLearn Looking at People
Invited talk: Florian Schroff

Medical Computer Vision
Invited talk: Ramin Zabih

Project Bloks: Making code physical for kids

Research Blog — Mon, 27 Jun 2016 15:00:00 +0000

Posted by Steve Vranakis and Jayme Goldstein, Executive Creative Director and Project Lead, Google Creative Lab

At Google, we’re passionate about empowering children to create and explore with technology. We believe that when children learn to code, they’re not just learning how to program a computer—they’re learning a new language for creative expression and are developing computational thinking: a skillset for solving problems of all kinds.

In fact, it’s a skillset whose importance is being recognised around the world—from President Obama’s CS4All program to the inclusion of Computer Science in the UK National Curriculum. We’ve long supported and advocated the furthering of CS education through programs and platforms such as Blockly, Scratch Blocks, CS First and Made w/ Code.

Today, we’re happy to announce Project Bloks, a research collaboration between Google, Paulo Blikstein (Stanford University) and IDEO with the goal of creating an open hardware platform that researchers, developers and designers can use to build physical coding experiences. As a first step, we’ve created a system for tangible programming and built a working prototype with it. We’re sharing our progress before conducting more research over the summer to inform what comes next.

Physical coding
Kids are inherently playful and social. They naturally play and learn by using their hands, building stuff and doing things together. Making code physical - known as tangible programming - offers a unique way to combine the way children innately play and learn with computational thinking.

Project Bloks is preceded and shaped by a long history of educational theory and research in the area of hands-on learning. From Friedrich Froebel, Maria Montessori and Jean Piaget’s pioneering work in the area of learning by experience, exploration and manipulation, to the research started in the 1970s by Seymour Papert and Radia Perlman with LOGO and TORTIS. This exploration has continued to grow and includes a wide range of research and platforms.

However, designing kits for tangible programming is challenging—requiring the resources and time to develop both the software and the hardware. Our goal is to remove those barriers. By creating an open platform, Project Bloks will allow designers, developers and researchers to focus on innovating, experimenting and creating new ways to help kids develop computational thinking. Our vision is that, one day, the Project Bloks platform becomes for tangible programming what Blockly is for on-screen programming.

The Project Bloks system
We’ve designed a system that developers can customise, reconfigure and rearrange to create all kinds of different tangible programming experiences.

A birdseye view of the customisable and reconfigurable Project Bloks system

The Project Bloks system is made up of three core components the “Brain Board”, “Base Boards” and “Pucks”. When connected together they create a set of instructions which can be sent to connected devices, things like toys or tablets, over wifi or Bluetooth.

The three core components of the Project Bloks system

Pucks: abundant, inexpensive, customisable physical instructions
Pucks are what make the Project Bloks system so versatile. They help bring the infinite flexibility of software programming commands to tangible programming experiences. Pucks can be programmed with different instructions, such as ‘turn on or off’, ‘move left’ or ‘jump’. They can also take the shape of many different interactive forms—like switches, dials or buttons. With no active electronic components, they’re also incredibly cheap and easy to make. At a minimum, all you'd need to make a puck is a piece of paper and some conductive ink.

Pucks allow for the creation and customisation of endless amount of different domain-specific physical instructions cheaply and easily.

Base Boards: a modular design for diverse tangible programming experiences
Base Boards read a Puck’s instruction through a capacitive sensor. They act as a conduit for a Puck’s command to the Brain Board. Base Boards are modular and can be connected in sequence and in different orientations to create different programming flows and experiences.

The modularity of the Base Boards means they can be arranged in different configurations and flows

Each Base Board is fitted with a haptic motor and LEDs that can be used to give end-users real time feedback on their programming experience. The Base Boards can also trigger audio feedback from the Brain Board’s built-in speaker.

Brain Board: control any device that has an API over WiFi or Bluetooth
The Brain Board is the processing unit of the system, built on a Raspberry Pi Zero. It also provides the other boards with power, and contains an API to receive and send data to the Base Boards. It sends the Base Boards’ instructions to any device with WiFi or Bluetooth connectivity and an API.

As a whole, the Project Bloks system can take on different form factors and be made out of different materials. This means developers have the flexibility to create diverse experiences that can help kids develop computational thinking: from composing music using functions to playing around with sensors or anything else they care to invent.

The Project Bloks system can be used to create all sorts of different physical programming experiences for kids

The Coding Kit
To show how designers, developers, and researchers might make use of system, the Project Bloks team worked with IDEO to create a reference device, called the Coding Kit. It lets kids learn basic concepts of programming by allowing them to put code bricks together to create a set of instructions that can be sent to control connected toys and devices—anything from a tablet, to a drawing robot or educational tools for exploring science like LEGO® Education WeDo 2.0.

What’s next?
We are looking for participants (educators, developers, parents and researchers) from around the world who would like to help shape the future of Computer Science education by remotely taking part in our research studies later in the year. If you would like to be part of our research study or simply receive updates on the project, please sign up.

If you want more context and detail on Project Bloks, you can read our position paper.

Finally, a big thank you to the team beyond Google who’ve helped us get this far—including the pioneers of tangible learning and programming who’ve inspired us and informed so much of our thinking.

Bringing Precision to the AI Safety Discussion

Research Blog — Wed, 22 Jun 2016 00:00:00 +0000

Posted by Chris Olah, Google Research

We believe that AI technologies are likely to be overwhelmingly useful and beneficial for humanity. But part of being a responsible steward of any new technology is thinking through potential challenges and how best to address any associated risks. So today we’re publishing a technical paper, Concrete Problems in AI Safety, a collaboration among scientists at Google, OpenAI, Stanford and Berkeley.

While possible AI safety risks have received a lot of public attention, most previous discussion has been very hypothetical and speculative. We believe it’s essential to ground concerns in real machine learning research, and to start developing practical approaches for engineering AI systems that operate safely and reliably.

We’ve outlined five problems we think will be very important as we apply AI in more general circumstances. These are all forward thinking, long-term research questions -- minor issues today, but important to address for future systems:

Avoiding Negative Side Effects: How can we ensure that an AI system will not disturb its environment in negative ways while pursuing its goals, e.g. a cleaning robot knocking over a vase because it can clean faster by doing so?
Avoiding Reward Hacking: How can we avoid gaming of the reward function? For example, we don’t want this cleaning robot simply covering over messes with materials it can’t see through.
Scalable Oversight: How can we efficiently ensure that a given AI system respects aspects of the objective that are too expensive to be frequently evaluated during training? For example, if an AI system gets human feedback as it performs a task, it needs to use that feedback efficiently because asking too often would be annoying.
Safe Exploration: How do we ensure that an AI system doesn’t make exploratory moves with very negative repercussions? For example, maybe a cleaning robot should experiment with mopping strategies, but clearly it shouldn’t try putting a wet mop in an electrical outlet.
Robustness to Distributional Shift: How do we ensure that an AI system recognizes, and behaves robustly, when it’s in an environment very different from its training environment? For example, heuristics learned for a factory workfloor may not be safe enough for an office.

We go into more technical detail in the paper. The machine learning research community has already thought quite a bit about most of these problems and many related issues, but we think there’s a lot more work to be done.

We believe in rigorous, open, cross-institution work on how to build machine learning systems that work as intended. We’re eager to continue our collaborations with other research groups to make positive progress on AI.

ICML 2016 & Research at Google

Research Blog — Mon, 20 Jun 2016 12:00:00 +0000

Posted by Afshin Rostamizadeh, Research Scientist

This week, New York hosts the 2016 International Conference on Machine Learning (ICML 2016), a premier annual Machine Learning event supported by the International Machine Learning Society (IMLS). Machine Learning is a key focus area at Google, with highly active research groups exploring virtually all aspects of the field, including deep learning and more classical algorithms.

We work on an extremely wide variety of machine learning problems that arise from a broad range of applications at Google. One particularly important setting is that of large-scale learning, where we utilize scalable tools and architectures to build machine learning systems that work with large volumes of data that often preclude the use of standard single-machine training algorithms. In doing so, we are able to solve deep scientific problems and engineering challenges, exploring theory as well as application, in areas of language, speech, translation, music, visual processing and more.

As Gold Sponsor, Google has a strong presence at ICML 2016 with many Googlers publishing their research and hosting workshops. If you’re attending, we hope you’ll visit the Google booth and talk with our researchers to learn more about the exciting work, creativity and fun that goes into solving interesting ML problems that impact millions of people. You can also learn more about our research being presented at ICML 2016 in the list below (Googlers highlighted in blue).

ICML 2016 Organizing Committee
Area Chairs include: Corinna Cortes, John Blitzer, Maya Gupta, Moritz Hardt, Samy Bengio

IMLS
Board Members include: Corinna Cortes

Accepted Papers
ADIOS: Architectures Deep In Output Space
Moustapha Cisse, Maruan Al-Shedivat, Samy Bengio

Associative Long Short-Term Memory
Ivo Danihelka (Google DeepMind), Greg Wayne (Google DeepMind), Benigno Uria (Google DeepMind), Nal Kalchbrenner (Google DeepMind), Alex Graves (Google DeepMind)

Asynchronous Methods for Deep Reinforcement Learning
Volodymyr Mnih (Google DeepMind), Adria Puigdomenech Badia (Google DeepMind), Mehdi Mirza, Alex Graves (Google DeepMind), Timothy Lillicrap (Google DeepMind), Tim Harley (Google DeepMind), David Silver (Google DeepMind), Koray Kavukcuoglu (Google DeepMind)

Binary embeddings with structured hashed projections
Anna Choromanska, Krzysztof Choromanski, Mariusz Bojarski, Tony Jebara, Sanjiv Kumar, Yann LeCun

Discrete Distribution Estimation Under Local Privacy
Peter Kairouz, Keith Bonawitz, Daniel Ramage

Dueling Network Architectures for Deep Reinforcement Learning (Best Paper Award recipient)
Ziyu Wang (Google DeepMind), Nando de Freitas (Google DeepMind), Tom Schaul (Google DeepMind), Matteo Hessel (Google DeepMind), Hado van Hasselt (Google DeepMind), Marc Lanctot (Google DeepMind)

Exploiting Cyclic Symmetry in Convolutional Neural Networks
Sander Dieleman (Google DeepMind), Jeffrey De Fauw (Google DeepMind), Koray Kavukcuoglu (Google DeepMind)

Fast Constrained Submodular Maximization: Personalized Data Summarization
Baharan Mirzasoleiman, Ashwinkumar Badanidiyuru, Amin Karbasi

Greedy Column Subset Selection: New Bounds and Distributed Algorithms
Jason Altschuler, Aditya Bhaskara, Gang Fu, Vahab Mirrokni, Afshin Rostamizadeh, Morteza Zadimoghaddam

Horizontally Scalable Submodular Maximization
Mario Lucic, Olivier Bachem, Morteza Zadimoghaddam, Andreas Krause

Continuous Deep Q-Learning with Model-based Acceleration
Shixiang Gu, Timothy Lillicrap (Google DeepMind), Ilya Sutskever, Sergey Levine

Meta-Learning with Memory-Augmented Neural Networks
Adam Santoro (Google DeepMind), Sergey Bartunov, Matthew Botvinick (Google DeepMind), Daan Wierstra (Google DeepMind), Timothy Lillicrap (Google DeepMind)

One-Shot Generalization in Deep Generative Models
Danilo Rezende (Google DeepMind), Shakir Mohamed (Google DeepMind), Daan Wierstra (Google DeepMind)

Pixel Recurrent Neural Networks (Best Paper Award recipient)
Aaron Van den Oord (Google DeepMind), Nal Kalchbrenner (Google DeepMind), Koray Kavukcuoglu (Google DeepMind)

Pricing a low-regret seller
Hoda Heidari, Mohammad Mahdian, Umar Syed, Sergei Vassilvitskii, Sadra Yazdanbod

Primal-Dual Rates and Certificates
Celestine Dünner, Simone Forte, Martin Takac, Martin Jaggi

Recommendations as Treatments: Debiasing Learning and Evaluation
Tobias Schnabel, Thorsten Joachims, Adith Swaminathan, Ashudeep Singh, Navin Chandak

Recycling Randomness with Structure for Sublinear Time Kernel Expansions
Krzysztof Choromanski, Vikas Sindhwani

Train faster, generalize better: Stability of stochastic gradient descent
Moritz Hardt, Ben Recht, Yoram Singer

Variational Inference for Monte Carlo Objectives
Andriy Mnih (Google DeepMind), Danilo Rezende (Google DeepMind)

Workshops
Abstraction in Reinforcement Learning
Organizing Committee: Daniel Mankowitz, Timothy Mann (Google DeepMind), Shie Mannor
Invited Speaker: David Silver (Google DeepMind)

Deep Learning Workshop
Organizers: Antoine Bordes, Kyunghyun Cho, Emily Denton, Nando de Freitas (Google DeepMind), Rob Fergus
Invited Speaker: Raia Hadsell (Google DeepMind)

Neural Networks Back To The Future
Organizers: Léon Bottou, David Grangier, Tomas Mikolov, John Platt

Data-Efficient Machine Learning
Organizers: Marc Deisenroth, Shakir Mohamed (Google DeepMind), Finale Doshi-Velez, Andreas Krause, Max Welling

On-Device Intelligence
Organizers: Vikas Sindhwani, Daniel Ramage, Keith Bonawitz, Suyog Gupta, Sachin Talathi
Invited Speakers: Hartwig Adam, H. Brendan McMahan

Online Advertising Systems
Organizing Committee: Sharat Chikkerur, Hossein Azari, Edoardo Airoldi
Opening Remarks: Hossein Azari
Invited Speakers: Martin Pál, Todd Phillips

Anomaly Detection 2016
Organizing Committee: Nico Goernitz, Marius Kloft, Vitaly Kuznetsov

Tutorials
Deep Reinforcement Learning
David Silver (Google DeepMind)

Rigorous Data Dredging: Theory and Tools for Adaptive Data Analysis
Moritz Hardt, Aaron Roth

Announcing Google Research, Europe

Research Blog — Thu, 16 Jun 2016 09:00:00 +0000

Posted by Emmanuel Mogenet, Head of Google Research, Europe

Google’s ongoing research in Machine Intelligence is what powers many of the products being used by hundreds of millions of people a day - from Translate to Photo Search to Smart Reply for Inbox. One of the things that enables these advances is the extensive collaboration between the Google researchers in our offices across the world, all contributing their unique knowledge and disseminating ideas in state-of-the-art Machine Learning (ML) technologies and techniques in order to develop useful tools and products.

Today, we’re excited to announce a dedicated Machine Learning research group in Europe, based in our Zurich office. Google Research, Europe, will foster an environment where software engineers and researchers specialising in ML will have the opportunity to develop products and conduct research right here in Europe, as part of the wider efforts at Google.

Zurich is already the home of Google’s largest engineering office outside the US, and is responsible for developing the engine that powers Knowledge Graph, as well as the conversation engine that powers the Google Assistant in Allo. In addition to continued collaboration with Google’s various research teams, Google Research, Europe will be focused on three key areas:

In pursuit of these areas, the team will actively research ways in which to improve ML infrastructure, broadly facilitating research for the community, and enabling it to be put to practical use. Furthermore, researchers in the Zurich office will be uniquely able to work closely with team linguists, advancing Natural Language Understanding in collaboration with Google Research groups across the world, all while enjoying Mountain Views of a different kind.

Europe is home to some of the world’s premier technical universities, making it an ideal place to build a top-notch research team. We look forward to collaborating with all the excellent Computer Science research that is coming from the region, and hope to contribute towards the wider academic community through our publications and academic support.

Quantum annealing with a digital twist

Research Blog — Wed, 08 Jun 2016 17:00:00 +0000

Posted by Rami Barends and Alireza Shabani, Quantum Electronics Engineers

One of the key benefits of quantum computing is that it has the potential to solve some of the most complex problems in nature, from physics to chemistry to biology. For example, when attempting to calculate protein folding, or when exploring reaction catalysts and “designer” molecules, one can look at computational challenges as optimization problems, and represent the different configurations of a molecule as an energy landscape in a quantum computer. By letting the system cool, or “anneal”, one finds the lowest energy state in the landscape - the most stable form of the molecule. Thanks to the peculiarities of quantum mechanics, the correct answer simply drops out at the end of the quantum computation. In fact, many tough problems can be dealt with this way, this combination of simplicity and generality makes it appealing.

But finding the lowest energy state in a system is like being put in the Alps, and being told to find the lowest elevation - it’s easy to get stuck in a “local” valley, and not know that there is an even lower point elsewhere. Therefore, we use a different approach: We start with a very simple energy landscape - a flat meadow - and initialize the system of quantum bits (qubits) to represent the known lowest energy point, or “ground state”, in that landscape. We then begin to adjust the simple landscape towards one that represents the problem we are trying to solve - from the smooth meadow to the highly uneven terrain of the Alps. Here’s the fun part: if one evolves the landscape very slowly, the ground state of the qubits also evolves, so that they stay in the ground state of the changing system. This is called “adiabatic quantum computing”, and qubits exploit quantum tunneling to ensure they always find the lowest energy "valley" in the changing system.

While this is great in theory, getting this to work in practice is challenging, as you have to set up the energy landscape using the available qubit interactions. Ideally you’d have multiple interactions going on between all of the qubits, but for a large-scale solver the requirements to accurately keep track of these interactions become enormous. Realistically, the connectivity has to be reduced, but this presents a major limitation for the computational possibilities.

In "Digitized adiabatic quantum computing with a superconducting circuit", published in Nature, we’ve overcome this obstacle by giving quantum annealing a digital twist. With a limited connectivity between qubits you can still construct any of the desired interactions: Whether the interaction is ferromagnetic (the quantum bits prefer an aligned) or antiferromagnetic (anti-aligned orientation), or even defined along an arbitrary different direction, you can make it happen using easy to combine discrete building blocks. In this case, the blocks we use are the logic gates that we've been developing with our superconducting architecture.

Superconducting quantum chip with nine qubits. Each qubit (cross-shaped structures in the center) is connected to its neighbors and individually controlled. Photo credit: Julian Kelly.

The key is controllability. Qubits, like other physical objects in nature, have a resonance frequency, and can be addressed individually with short voltage and current pulses. In our architecture we can steer this frequency, much like you would tune a radio to a broadcast. We can even tune one qubit to the frequency of another one. By moving qubit frequencies to or away from each other, interactions can be turned on or off. The exchange of quantum information resembles a relay race, where the baton can be handed down when the runners meet.

You can see the algorithm in action below. Any problem is encoded as local “directions” we want qubits to point to - like a weathervane pointing into the wind - and interactions, depicted here as links between the balls. We start by aligning all qubits into the same direction, and the interactions between the qubits turned off - this is the simplest ground state of the system. Next, we turn on interactions and change qubit directions to start evolving towards the energy landscape we wish to solve. The algorithmic steps are implemented with many control pulses, illustrating how the problem gets solved in a giant dance of quantum entanglement.

Top: Depiction of the problem, with the gold arrows in the blue balls representing the directions we’d like each qubit to align to, like a weathervane pointing to the wind. The thickness of the link between the balls indicates the strength of the interaction - red denotes a ferromagnetic link, and blue an antiferromagnetic link. Middle: Implementation with qubits (yellow crosses) with control pulses (red) and steering the frequency (vertical direction). Qubits turn blue when there is interaction. The qubits turn green when they are being measured. Bottom: Zoom in of the physical device, showing the corresponding nine qubits (cross-shaped).

To run the adiabatic quantum computation efficiently and design a set of test experiments we teamed up with the QUTIS group at the University of the Basque Country in Bilbao, Spain, led by Prof. E. Solano and Dr. L. Lamata, who are experts in synthesizing digital algorithms. It’s the largest digital algorithm to date, with up to nine qubits and using over one thousand logic gates.

The crucial advantage for the future is that this digital implementation is fully compatible with known quantum error correction techniques, and can therefore be protected from the effects of noise. Otherwise, the noise will set a hard limit, as even the slightest amount can derail the state from following the fragile path to the solution. Since each quantum bit and interaction element can add noise to the system, some of the most important problems are well beyond reach, as they have many degrees of freedom and need a high connectivity. But with error correction, this approach becomes a general-purpose algorithm which can be scaled to an arbitrarily large quantum computer.

Motion Stills – Create beautiful GIFs from Live Photos

Research Blog — Tue, 07 Jun 2016 18:00:00 +0000

Posted by Ken Conley and Matthias Grundmann, Machine Perception

Today we are releasing Motion Stills, an iOS app from Google Research that acts as a virtual camera operator for your Apple Live Photos. We use our video stabilization technology to freeze the background into a still photo or create sweeping cinematic pans. The resulting looping GIFs and movies come alive, and can easily be shared via messaging or on social media.

With Motion Stills, we provide an immersive stream experience that makes your clips fun to watch and share. You can also tell stories of your adventures by combining multiple clips into a movie montage. All of this works right on your phone, no Internet connection needed.

A Live Photo before and after stabilization with Motion Stills

How does it work?
We pioneered this technology by stabilizing hundreds of millions of videos and creating GIF animations from photo bursts. Our algorithm uses linear programming to compute a virtual camera path that is optimized to recast videos and bursts as if they were filmed using stabilization equipment, yielding a still background or creating cinematic pans to remove shakiness.

Our challenge was to take technology designed to run distributed in a data center and shrink it down to run even faster on your mobile phone. We achieved a 40x speedup by using techniques such as temporal subsampling, decoupling of motion parameters, and using Google Research’s custom linear solver, GLOP. We obtain further speedup and conserve storage by computing low-resolution warp textures to perform real-time GPU rendering, just like in a videogame.

Making it loop
Short videos are perfect for creating loops, so we added loop optimization to bring out the best in your captures. Our approach identifies optimal start and end points, and also discards blurry frames. As an added benefit, this fixes “pocket shots” (footage of the phone being put back into the pocket).

To keep the background steady while looping, Motion Stills has to separate the background from the rest of the scene. This is a difficult task when foreground elements occlude significant portions of the video, as in the example below. Our novel method classifies motion vectors into foreground (red) and background (green) in a temporally consistent manner. We use a cascade of motion models, moving our motion estimation from simple to more complex models and biasing our results along the way.

Left: Original with virtual camera path (red rectangle) and motion classification; foreground(red) vs. background(green) Right: Motion Stills result

Try it out
We’re excited to see what you can create with this app. From fun family moments to exciting adventures with friends, try it out and let us know what you think. Motion Stills is an on-device experience with no sign-in: even if you’re on top of a glacier without signal, you can see your results immediately. You can show us your favorite clips by using #motionstills on social media.

This app is a way for us to experiment and iterate quickly on the technology needed for short video creation. Based on the feedback we receive, we hope to integrate this feature into existing products like Google Photos.

Motion Stills is available on the App Store.

"Aw, so cute!": Allo helps you respond to shared photos

Research Blog — Wed, 18 May 2016 22:00:00 +0000

by Ariel Fuxman, Research Scientist

Today, Google announced Allo — our new mobile messaging app. From day one of the Allo development effort, we set out to build a truly special product that is powered by Google’s strengths in machine intelligence to make messaging easier, more efficient, and more expressive. Photo Reply is a unique feature of Allo that just does that! We use machine learning to understand what a shared photo depicts and to suggest rich natural language replies that the user can tap to send. This makes it easier for users to sustain meaningful conversations while using small mobile keyboards.

Here is an example of the responses that Allo suggests when a friend shares a photo of his child.

Photo Reply — Under the Hood

During the winter, our product managers, Patrick McGregor and Ryan Cassidy, challenged us to develop new approaches to simplify media sharing in messaging while simultaneously delighting users with Google insights. With my colleagues Vivek Ramavajjala, Sergey Nazarov, and Sujith Ravi, we set out to build Photo Reply.

We utilize Google's image recognition technology, developed by our Machine Perception team, to associate images with semantic entities — people, animals, cars, etc. We then apply a machine learned model that maps those recognized entities to actual natural language responses. Our system produces replies for thousands of entity types that are drawn from a taxonomy that is a subset of Google's Knowledge Graph and may be at different granularity levels. For example, when you receive a photo of a dog, the system may detect that the dog is actually a labrador and suggest "Love that lab!". Or given a photo of a pasta dish, it may detect the type of pasta ("Yum linguine!") and even the cuisine ("I love Italian food!").

Examples of response suggestions reflecting fine-grained object classes

One aspect of the system that we find very useful is that it can suggest responses not just for physical objects but also for abstract concepts. It can produce suggestions for events (birthday parties, weddings, etc.), nature (sunrises, mountains, etc.), recreational activities (hiking, camping, etc.), and many more categories. Also, the system can generate responses that reflect the emotions that might be associated with an image, such as “happiness”. Here are some examples of responses for abstract concepts:

Response suggestions reflecting abstract concepts

Learning entity-response associations

At runtime, Photo Reply recognizes entities in the shared photo and triggers responses for the entities. The model that maps entities to natural language responses is learned offline using Expander, which is a large-scale graph-based semi-supervised learning platform at Google. We built a massive a graph where nodes correspond to photos, semantic entities, and textual responses. Edges in the graph indicate when an entity was recognized for a photo, when a specific response was given for a photo, and visual similarities between photos. Some of the nodes are "labeled" and we learn associations for the unlabeled nodes by propagating label information across the graph.

To illustrate this, consider the graph below. There are two labels: the red label corresponds to the response "yummy" and the blue label corresponds to "delicious". The nodes for "spaghetti" and "linguine" are unlabeled, but from the fact that they are close to the red and blue nodes, the algorithm can learn that they should be associated to the "yummy" and "delicious" responses. Notice that in this way, we are associating the entity "linguine" to the response "yummy" even though none of the linguine photos in the graph are directly connected to this answer. Expander can perform this kind of learning at very large scale, for graphs containing billions of nodes and hundred of billions of edges.

Graph of entities, photos, and responses

Photo Reply is an exciting example of multimodal learning, where computer vision and natural language processing come together in order to create a compelling user experience. Allo will be available on Android and iOS later this summer. Be sure to check out what Allo sees in your beautiful photos!

Chat Smarter with Allo

Research Blog — Wed, 18 May 2016 17:30:00 +0000

Posted by Pranav Khaitan, Google Research

At Google, we are continuously building products powered by Machine Learning to delight our users and simplify their lives. Today, we are excited to talk about the technology behind Allo, a new smart messaging app that uses the power of neural networks and Google Search to make your text conversations easier and more productive.

Just like Smart Reply for Inbox, Allo understands the conversation history to generate a set of suggestions that the user will likely want to respond with. In addition to understanding the context of your conversation, Allo learns your individual style, so the responses are personalized for you.

How does it work?

About a year ago, we started exploring how we can make communication easier and more fun. The idea of Smart Reply for Allo came up in a brainstorming session with my teammates Sushant Prakash and Ori Gershony who then helped me lead our team to build this technology. We began by experimenting with neural network based model architectures which had proven to be successful for sequence prediction, including the encoder-decoder model used in Smart Reply for Inbox.

One challenge we faced was that response generation in online conversations have very strict latency requirements. To address this, Pavel Sountsov and Sushant came up with an innovative two-stage model that works as follows. First, a recurrent neural network looks at the conversation context one word at a time and encodes it in the hidden state of a long short term memory (LSTM). Below, we show an example with a context ‘Where are you?’. The context has three tokens, each of which is embedded into a continuous space and input to the LSTM. The LSTM state now encodes the context as a continuous vector. This vector is used to generate the response as a discretized semantic class.

Each semantic class is associated with a set of possible messages that belong to it. We use a second recurrent network to generate a specific message from that set. This network also converts the context into a hidden LSTM state but this time the hidden state is used to generate the full message of the reply one token at a time. For example, now the LSTM after seeing the context “Where are you?” generates the tokens in the response: “I’m at work”.

A beam search is used to efficiently select the top-N highest scoring responses from among the very large set of possible messages that a LSTM can generate. A snippet of the search space explored by such a beam-search technique is shown below.

As with any large-scale product, there were several engineering challenges we had to solve in generating a set of high-quality responses efficiently. For example, in spite of the two staged architecture, our first few networks were very slow and required about half a second to generate a response. This was obviously a deal breaker when we are talking about real time communication apps! So we had to evolve our neural network architecture further to reduce the latency to less than 200ms. We moved from using a softmax layer to a hierarchical softmax layer which traverses a tree of words instead of traversing a list of words thus making it more efficient.

Another interesting challenge we had to solve when generating predictions is controlling for message length. Sometimes none of the most probable responses are appropriate - if the model predicts too short a message, it might not be useful to the user, and if we predict something too long, it might not fit on the phone screen. We solved this by biasing the beam search to follow paths that lead to higher utility responses instead of favoring just the responses that are most probable. That way, we can efficiently generate appropriate length response predictions that are useful to our users.

Personalized for you

The best part about these suggestions is that over time they are personalized to you so that your individual style is reflected in your conversations. For example, if you often reply to “How are you?” with “Fine.” instead of “I am good.”, it will learn your preference and your future suggestions will take that into account. This was accomplished by incorporating a user's "style" as one of the features in a Neural Network that is used to predict the next word in a response, resulting in suggestions that are customized for your personality and individual preferences. The user's style is captured in a sequence of numbers that we call the user embedding. These embeddings can be generated as part of the regular model training, but this approach requires waiting for many days for training to be complete and it cannot handle more than a handful of millions of users. To solve this issue, Alon Shafrir implemented a L-BFGS based technique to generate user embeddings quickly and at scale. Now, you'll be able to enjoy personalized suggestions after only a short time of using Allo.

More than just English

The neural network model described above is language agnostic so building separate prediction models for each language works quite well. To make sure that responses for each language benefit from our semantic understanding of other languages, Sujith Ravi came up with a graph-based machine learning technique that can connect possible responses across languages. Dana Movshovitz-Attias and Peter Young applied this technique to build a graph that connects responses to incoming messages and to other responses that have similar word embeddings and syntactic relationships. It also connects responses with similar meaning across languages based on the machine translation models developed by our Translate team.

With this graph, we use semi-supervised learning, as described in this paper, to learn the semantic meaning of responses and determine which are the most useful clusters of possible responses. As a result, we can allow the LSTM to score many possible variants of each possible response meaning, allowing the personalization routines to select the best response for the user in the context of the conversation. This also helps enforce diversity as we can now pick the final set of responses from different semantic clusters.

Here’s an example of how the graph might look for a set of messages related to greetings:

Beyond Smart Reply

I am also very excited about the Google assistant in Allo with which you can converse and get information about anything that Google Search knows about. It understands your sentences and helps you accomplish tasks directly from the conversation. For example, the Google assistant can help you discover a restaurant and reserve a table from within the Allo app when chatting with your friends. This has been made possible because of the cutting-edge research in natural language understanding that we have been doing at Google. More details to follow soon!

These smart features will be part of the Android and iOS apps for Allo that will be available later this summer. We can’t wait for you to try and enjoy it!

We wish to acknowledge the hard work of the following in building Smart Reply:
Pranav Khaitan, Sushant Prakash, Pavel Sountsov, Alon Shafrir, Max Gubin, Shu Zhang, Sunita Sarawagi, Ori Gershony, Sergey Nazarov, Hung Pham, Harini Krishnamurthy, Ryan Cassidy, Dave Citron, Patrick McGregor, Sujith Ravi, Dana Movshovitz-Attias, Peter Young, Vivek Ramavajjala

Announcing SyntaxNet: The World’s Most Accurate Parser Goes Open Source

Research Blog — Thu, 12 May 2016 19:00:00 +0000

Posted by Slav Petrov, Senior Staff Research Scientist

At Google, we spend a lot of time thinking about how computer systems can read and understand human language in order to process it in intelligent ways. Today, we are excited to share the fruits of our research with the broader community by releasing SyntaxNet, an open-source neural network framework implemented in TensorFlow that provides a foundation for Natural Language Understanding (NLU) systems. Our release includes all the code needed to train new SyntaxNet models on your own data, as well as Parsey McParseface, an English parser that we have trained for you and that you can use to analyze English text.

Parsey McParseface is built on powerful machine learning algorithms that learn to analyze the linguistic structure of language, and that can explain the functional role of each word in a given sentence. Because Parsey McParseface is the most accurate such model in the world, we hope that it will be useful to developers and researchers interested in automatic extraction of information, translation, and other core applications of NLU.

How does SyntaxNet work?

SyntaxNet is a framework for what’s known in academic circles as a syntactic parser, which is a key first component in many NLU systems. Given a sentence as input, it tags each word with a part-of-speech (POS) tag that describes the word's syntactic function, and it determines the syntactic relationships between words in the sentence, represented in the dependency parse tree. These syntactic relationships are directly related to the underlying meaning of the sentence in question. To take a very simple example, consider the following dependency tree for Alice saw Bob:

This structure encodes that Alice and Bob are nouns and saw is a verb. The main verb saw is the root of the sentence and Alice is the subject (nsubj) of saw, while Bob is its direct object (dobj). As expected, Parsey McParseface analyzes this sentence correctly, but also understands the following more complex example:

This structure again encodes the fact that Alice and Bob are the subject and object respectively of saw, in addition that Alice is modified by a relative clause with the verb reading, that saw is modified by the temporal modifier yesterday, and so on. The grammatical relationships encoded in dependency structures allow us to easily recover the answers to various questions, for example whom did Alice see?, who saw Bob?, what had Alice been reading about? or when did Alice see Bob?.

Why is Parsing So Hard For Computers to Get Right?

One of the main problems that makes parsing so challenging is that human languages show remarkable levels of ambiguity. It is not uncommon for moderate length sentences - say 20 or 30 words in length - to have hundreds, thousands, or even tens of thousands of possible syntactic structures. A natural language parser must somehow search through all of these alternatives, and find the most plausible structure given the context. As a very simple example, the sentence Alice drove down the street in her car has at least two possible dependency parses:

The first corresponds to the (correct) interpretation where Alice is driving in her car; the second corresponds to the (absurd, but possible) interpretation where the street is located in her car. The ambiguity arises because the preposition in can either modify drove or street; this example is an instance of what is called prepositional phrase attachment ambiguity.

Humans do a remarkable job of dealing with ambiguity, almost to the point where the problem is unnoticeable; the challenge is for computers to do the same. Multiple ambiguities such as these in longer sentences conspire to give a combinatorial explosion in the number of possible structures for a sentence. Usually the vast majority of these structures are wildly implausible, but are nevertheless possible and must be somehow discarded by a parser.

SyntaxNet applies neural networks to the ambiguity problem. An input sentence is processed from left to right, with dependencies between words being incrementally added as each word in the sentence is considered. At each point in processing many decisions may be possible—due to ambiguity—and a neural network gives scores for competing decisions based on their plausibility. For this reason, it is very important to use beam search in the model. Instead of simply taking the first-best decision at each point, multiple partial hypotheses are kept at each step, with hypotheses only being discarded when there are several other higher-ranked hypotheses under consideration. An example of a left-to-right sequence of decisions that produces a simple parse is shown below for the sentence I booked a ticket to Google.

Furthermore, as described in our paper, it is critical to tightly integrate learning and search in order to achieve the highest prediction accuracy. Parsey McParseface and other SyntaxNet models are some of the most complex networks that we have trained with the TensorFlow framework at Google. Given some data from the Google supported Universal Dependencies project, you can train a parsing model on your own machine.

So How Accurate is Parsey McParseface?

On a standard benchmark consisting of randomly drawn English newswire sentences (the 20 year old Penn Treebank), Parsey McParseface recovers individual dependencies between words with over 94% accuracy, beating our own previous state-of-the-art results, which were already better than any previous approach. While there are no explicit studies in the literature about human performance, we know from our in-house annotation projects that linguists trained for this task agree in 96-97% of the cases. This suggests that we are approaching human performance—but only on well-formed text. Sentences drawn from the web are a lot harder to analyze, as we learned from the Google WebTreebank (released in 2011). Parsey McParseface achieves just over 90% of parse accuracy on this dataset.

While the accuracy is not perfect, it’s certainly high enough to be useful in many applications. The major source of errors at this point are examples such as the prepositional phrase attachment ambiguity described above, which require real world knowledge (e.g. that a street is not likely to be located in a car) and deep contextual reasoning. Machine learning (and in particular, neural networks) have made significant progress in resolving these ambiguities. But our work is still cut out for us: we would like to develop methods that can learn world knowledge and enable equal understanding of natural language across all languages and contexts.

To get started, see the SyntaxNet code and download the Parsey McParseface parser model. Happy parsing from the main developers, Chris Alberti, David Weiss, Daniel Andor, Michael Collins & Slav Petrov.

Research at Google and ICLR 2016

Research Blog — Sun, 01 May 2016 18:49:00 +0000

Posted by Dumitru Erhan, Gentleman Scientist

This week, San Juan, Puerto Rico hosts the 4th International Conference on Learning Representations (ICLR 2016), a conference focused on how one can learn meaningful and useful representations of data for Machine Learning. ICLR includes conference and workshop tracks, with invited talks along with oral and poster presentations of some of the latest research on deep learning, metric learning, kernel learning, compositional models, non-linear structured prediction, and issues regarding non-convex optimization.

At the forefront of innovation in cutting-edge technology in Neural Networks and Deep Learning, Google focuses on both theory and application, developing learning approaches to understand and generalize. As Platinum Sponsor of ICLR 2016, Google will have a strong presence with over 40 researchers attending (many from the Google Brain team and Google DeepMind), contributing to and learning from the broader academic research community by presenting papers and posters, in addition to participating on organizing committees and in workshops.

If you are attending ICLR 2016, we hope you’ll stop by our booth and chat with our researchers about the projects and opportunities at Google that go into solving interesting problems for billions of people. You can also learn more about our research being presented at ICLR 2016 in the list below (Googlers highlighted in blue).

Organizing Committee

Program Chairs
Samy Bengio, Brian Kingsbury

Area Chairs include:
John Platt, Tara Sanaith

Oral Sessions

Neural Programmer-Interpreters (Best Paper Award Recipient)
Scott Reed, Nando de Freitas

Net2Net: Accelerating Learning via Knowledge Transfer
Tianqi Chen, Ian Goodfellow, Jon Shlens

Conference Track Posters

Prioritized Experience Replay
Tom Schau, John Quan, Ioannis Antonoglou, David Silver

Reasoning about Entailment with Neural Attention
Tim Rocktäschel, Edward Grefenstette, Karl Moritz Hermann, Tomáš Kočiský, Phil Blunsom

Neural Programmer: Inducing Latent Programs With Gradient Descent
Arvind Neelakantan, Quoc Le, Ilya Sutskever

MuProp: Unbiased Backpropagation For Stochastic Neural Networks
Shixiang Gu, Sergey Levine, Ilya Sutskever, Andriy Mnih

Multi-Task Sequence to Sequence Learning
Minh-Thang Luong, Quoc Le, Ilya Sutskever, Oriol Vinyals, Lukasz Kaiser

A Test of Relative Similarity for Model Selection in Generative Models
Eugene Belilovsky, Wacha Bounliphone, Matthew Blaschko, Ioannis Antonoglou, Arthur Gretton

Continuous control with deep reinforcement learning
Timothy Lillicrap, Jonathan Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, Daan Wierstra

Policy Distillation
Andrei Rusu, Sergio Gomez, Caglar Gulcehre, Guillaume Desjardins, James Kirkpatrick, Razvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, Raia Hadsell

Neural Random-Access Machines
Karol Kurach, Marcin Andrychowicz, Ilya Sutskever

Variable Rate Image Compression with Recurrent Neural Networks
George Toderici, Sean O'Malley, Damien Vincent, Sung Jin Hwang, Michele Covell, Shumeet Baluja, Rahul Sukthankar, David Minnen

Order Matters: Sequence to Sequence for Sets
Oriol Vinyals, Samy Bengio, Manjunath Kudlur

Grid Long Short-Term Memory
Nal Kalchbrenner, Alex Graves, Ivo Danihelka

Neural GPUs Learn Algorithms
Lukasz Kaiser, Ilya Sutskever

ACDC: A Structured Efficient Linear Layer
Marcin Moczulski, Misha Denil, Jeremy Appleyard, Nando de Freitas

Workshop Track Posters

Revisiting Distributed Synchronous SGD
Jianmin Chen, Rajat Monga, Samy Bengio, Rafal Jozefowicz

Black Box Variational Inference for State Space Models
Evan Archer, Il Memming Park, Lars Buesing, John Cunningham, Liam Paninski

A Minimalistic Approach to Sum-Product Network Learning for Real Applications
Viktoriya Krakovna, Moshe Looks

Efficient Inference in Occlusion-Aware Generative Models of Images
Jonathan Huang, Kevin Murphy

Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke

Deep Autoresolution Networks
Gabriel Pereyra, Christian Szegedy

Learning visual groups from co-occurrences in space and time
Phillip Isola, Daniel Zoran, Dilip Krishnan, Edward H. Adelson

Adding Gradient Noise Improves Learning For Very Deep Networks
Arvind Neelakantan, Luke Vilnis, Quoc V. Le, Ilya Sutskever, Lukasz Kaiser, Karol Kurach, James Martens

Adversarial Autoencoders
Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow

Generating Sentences from a Continuous Space
Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Jozefowicz, Samy Bengio

DeepMind moves to TensorFlow

Research Blog — Fri, 29 Apr 2016 16:00:00 +0000

Posted by Koray Kavukcuoglu, Research Scientist, Google DeepMind

At DeepMind, we conduct state-of-the-art research on a wide range of algorithms, from deep learning and reinforcement learning to systems neuroscience, towards the goal of building Artificial General Intelligence. A key factor in facilitating rapid progress is the software environment used for research. For nearly four years, the open source Torch7 machine learning library has served as our primary research platform, combining excellent flexibility with very fast runtime execution, enabling rapid prototyping. Our team has been proud to contribute to the open source project in capacities ranging from occasional bug fixes to being core maintainers of several crucial components.

With Google’s recent open source release of TensorFlow, we initiated a project to test its suitability for our research environment. Over the last six months, we have re-implemented more than a dozen different projects in TensorFlow to develop a deeper understanding of its potential use cases and the tradeoffs for research. Today we are excited to announce that DeepMind will start using TensorFlow for all our future research. We believe that TensorFlow will enable us to execute our ambitious research goals at much larger scale and an even faster pace, providing us with a unique opportunity to further accelerate our research programme.

As one of the core contributors of Torch7, I have had the pleasure of working closely with an excellent community of developers and researchers, and it has been amazing to see all the great work that has been built on top of the platform and the impact this has had on the field. Torch7 is currently being used by Facebook, Twitter, and many start-ups and academic labs as well as DeepMind, and I’m proud of the significant contribution it has made to a large community in both research and industry. Our transition to TensorFlow represents a new chapter, and I feel very excited about the prospect of DeepMind contributing heavily to another great open source machine learning platform that everyone can use to advance the state-of-the-art.

Computer Science Education for All Students

Research Blog — Tue, 26 Apr 2016 13:15:00 +0000

Posted by Maggie Johnson, Director of Education and University Relations

(Cross-posted on the Google for Education Blog)

Computer science education is a pathway to innovation, to creativity, and to exciting career prospects. No longer considered an optional skill, CS is quickly becoming a “new basic”, foundational for learning. In order for our students to be equipped for the world of tomorrow, we need to provide them with access to computer science education today.

At Google, we believe that all students deserve these opportunities. Today we join some of America’s leading companies, governors, and educators to support an open letter to Congress, asking for funding to provide every student in every school the opportunity to learn computer science. Google has long been committed to developing programs, resources, tools and community partnerships that make computer science engaging and accessible for all students.

We are strengthening that commitment today by announcing an additional investment of $10 million towards computer science education for 2017, along with the $23.5 million that we have allocated for 2016. This funding will allow us to build more resources, scale our programs, and provide additional support to our partners, with a goal of reaching an additional 5 million students.

With Congress’ help, we can ensure that every child has access to computer science education. Please join us by signing our online petition at www.change.org/computerscience.

Helping webmasters re-secure their sites

Research Blog — Mon, 18 Apr 2016 16:00:00 +0000

Posted by Kurt Thomas and Yuan Niu, Spam & Abuse Research

Every week, over 10 million users encounter harmful websites that deliver malware and scams. Many of these sites are compromised personal blogs or small business pages that have fallen victim due to a weak password or outdated software. Safe Browsing and Google Search protect visitors from dangerous content by displaying browser warnings and labeling search results with ‘this site may harm your computer’. While this helps keep users safe in the moment, the compromised site remains a problem that needs to be fixed.

Unfortunately, many webmasters for compromised sites are unaware anything is amiss. Worse yet, even when they learn of an incident, they may lack the security expertise to take action and address the root cause of compromise. Quoting one webmaster from a survey we conducted, “our daily and weekly backups were both infected” and even after seeking the help of a specialist, after “lots of wasted hours/days” the webmaster abandoned all attempts to restore the site and instead refocused his efforts on “rebuilding the site from scratch”.

In order to find the best way to help webmasters clean-up from compromise, we recently teamed up with the University of California, Berkeley to explore how to quickly contact webmasters and expedite recovery while minimizing the distress involved. We’ve summarized our key lessons below. The full study, which you can read here, was recently presented at the International World Wide Web Conference.

When Google works directly with webmasters during critical moments like security breaches, we can help 75% of webmasters re-secure their content. The whole process takes a median of 3 days. This is a better experience for webmasters and their audience.

How many sites get compromised?

Number of freshly compromised sites Google detects every week.

Over the last year Google detected nearly 800,000 compromised websites—roughly 16,500 new sites every week from around the globe. Visitors to these sites are exposed to low-quality scam content and malware via drive-by downloads. While browser and search warnings help protect visitors from harm, these warnings can at times feel punitive to webmasters who learn only after-the-fact that their site was compromised. To balance the safety of our users with the experience of webmasters, we set out to find the best approach to help webmasters recover from security breaches and ultimately reconnect websites with their audience.

Finding the most effective ways to aid webmasters

Getting in touch with webmasters: One of the hardest steps on the road to recovery is first getting in contact with webmasters. We tried three notification channels: email, browser warnings, and search warnings. For webmasters who proactively registered their site with Search Console, we found that email communication led to 75% of webmasters re-securing their pages. When we didn’t know a webmaster’s email address, browser warnings and search warnings helped 54% and 43% of sites clean up respectively.
Providing tips on cleaning up harmful content: Attackers rely on hidden files, easy-to-miss redirects, and remote inclusions to serve scams and malware. This makes clean-up increasingly tricky. When we emailed webmasters, we included tips and samples of exactly which pages contained harmful content. This, combined with expedited notification, helped webmasters clean up 62% faster compared to no tips—usually within 3 days.
Making sure sites stay clean: Once a site is no longer serving harmful content, it’s important to make sure attackers don’t reassert control. We monitored recently cleaned websites and found 12% were compromised again in 30 days. This illustrates the challenge involved in identifying the root cause of a breach versus dealing with the side-effects.

Making security issues less painful for webmasters—and everyone

We hope that webmasters never have to deal with a security incident. If you are a webmaster, there are some quick steps you can take to reduce your risk. We’ve made it easier to receive security notifications through Google Analytics as well as through Search Console. Make sure to register for both services. Also, we have laid out helpful tips for updating your site’s software and adding additional authentication that will make your site safer.

If you’re a hosting provider or building a service that needs to notify victims of compromise, understand that the entire process is distressing for users. Establish a reliable communication channel before a security incident occurs, make sure to provide victims with clear recovery steps, and promptly reply to inquiries so the process feels helpful, not punitive.

As we work to make the web a safer place, we think it’s critical to empower webmasters and users to make good security decisions. It’s easy for the security community to be pessimistic about incident response being ‘too complex’ for victims, but as our findings demonstrate, even just starting a dialogue can significantly expedite recovery.

Announcing TensorFlow 0.8 – now with distributed computing support!

Research Blog — Wed, 13 Apr 2016 17:00:00 +0000

Posted by Derek Murray, Software Engineer

Google uses machine learning across a wide range of its products. In order to continually improve our models, it's crucial that the training process be as fast as possible. One way to do this is to run TensorFlow across hundreds of machines, which shortens the training process for some models from weeks to hours, and allows us to experiment with models of increasing size and sophistication. Ever since we released TensorFlow as an open-source project, distributed training support has been one of the most requested features. Now the wait is over.

Today, we're excited to release TensorFlow 0.8 with distributed computing support, including everything you need to train distributed models on your own infrastructure. Distributed TensorFlow is powered by the high-performance gRPC library, which supports training on hundreds of machines in parallel. It complements our recent announcement of Google Cloud Machine Learning, which enables you to train and serve your TensorFlow models using the power of the Google Cloud Platform.

To coincide with the TensorFlow 0.8 release, we have published a distributed trainer for the Inception image classification neural network in the TensorFlow models repository. Using the distributed trainer, we trained the Inception network to 78% accuracy in less than 65 hours using 100 GPUs. Even small clusters—or a couple of machines under your desk—can benefit from distributed TensorFlow, since adding more GPUs improves the overall throughput, and produces accurate results sooner.

TensorFlow can speed up Inception training by a factor of 56, using 100 GPUs.

The distributed trainer also enables you to scale out training using a cluster management system like Kubernetes. Furthermore, once you have trained your model, you can deploy to production and speed up inference using TensorFlow Serving on Kubernetes.

Beyond distributed Inception, the 0.8 release includes new libraries for defining your own distributed models. TensorFlow's distributed architecture permits a great deal of flexibility in defining your model, because every process in the cluster can perform general-purpose computation. Our previous system DistBelief (like many systems that have followed it) used special "parameter servers" to manage the shared model parameters, where the parameter servers had a simple read/write interface for fetching and updating shared parameters. In TensorFlow, all computation—including parameter management—is represented in the dataflow graph, and the system maps the graph onto heterogeneous devices (like multi-core CPUs, general-purpose GPUs, and mobile processors) in the available processes. To make TensorFlow easier to use, we have included Python libraries that make it easy to write a model that runs on a single process and scales to use multiple replicas for training.

This architecture makes it easier to scale a single-process job up to use a cluster, and also to experiment with novel architectures for distributed training. As an example, my colleagues have recently shown that synchronous SGD with backup workers, implemented in the TensorFlow graph, achieves improved time-to-accuracy for image model training.

The current version of distributed computing support in TensorFlow is just the start. We are continuing to research ways of improving the performance of distributed training—both through engineering and algorithmic improvements—and will share these improvements with the community on GitHub. However, getting to this point would not have been possible without help from the following people:

TensorFlow training libraries - Jianmin Chen, Matthieu Devin, Sherry Moore and Sergio Guadarrama
TensorFlow core - Zhifeng Chen, Manjunath Kudlur and Vijay Vasudevan
Testing - Shanqing Cai
Inception model architecture - Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Jonathon Shlens and Zbigniew Wojna
Project management - Amy McDonald Sandjideh
Engineering leadership - Jeff Dean and Rajat Monga

All of Google’s CS Education Programs and Tools in One Place

Research Blog — Tue, 12 Apr 2016 17:00:00 +0000

Posted by Chris Stephenson, Head of Computer Science Education Programs

(Cross-posted on the Google for Education Blog)

Interest in computer science education is growing rapidly; even the President of the United States has spoken of the importance of giving every student an opportunity to learn computer science. Google has been a supportive partner in these efforts by developing high-quality learning programs, educational tools and resources to advance new approaches in computer science education. To make it easier for all students and educators to access this information, today we’re launching a CS EDU website that specifically outlines our initiatives in CS education.

The President’s call to action is grounded in economic realities coupled with a lack of access and ongoing system inequities. There is an increasing need for computer science skills in the workforce, with the Bureau of Labor Statistics estimating that there will be more than 1.3 million job openings in computer and mathematical occupations by 2022. The majority of these jobs will require at least a Bachelor’s degree in Computer Science or in Information Technology, yet the U.S. is only producing 16,000 CS undergraduates per year.

One of the reasons there are so few computer science graduates is that too few students have the opportunity to study computer science in high school. Google’s research shows that only 25% of U.S. schools currently offer CS with programming or coding, despite the fact that 91% of parents want their children to learn computer science. In addition, schools with higher percentages of students living in households below the poverty line are even less likely to offer rigorous computer science courses.

Increasing access to computer science for all learners requires tremendous commitment from a wide range of stakeholders, and we strive to be a strong supportive partner of these efforts. Our new CS EDU website shows all the ways Google is working to address the need for improved access to high quality computer science learning in formal and informal education. Some current programs you’ll find there include:

CS First: providing more than 360,000 middle school students with an opportunity to create technology through free computer science clubs
Exploring Computational Thinking: sharing more than 130 lesson plans aligned to international standards for students aged 8 to 18
igniteCS: offering support and mentoring to address the retention problem in diverse student populations at the undergraduate level in more than 40 universities and counting
Blockly and other programming tools powering Code.org’s Hour of Code (2 million users)
Google’s Made with Code: movement that inspires millions of girls to learn to code and to see it as a means to pursue their dream careers (more than 10 million unique visitors)
...and many more!

Computer science education is a pathway to innovation, to creativity and to exciting career opportunities, and Google believes that all students deserve these opportunities. That is why we are committed to developing programs, resources, tools and community partnerships that make computer science engaging and accessible for all students. With the launch of our CS EDU website, all of these programs are at your fingertips.

Genomic Data Processing on Google Cloud Platform

Research Blog — Wed, 06 Apr 2016 06:30:00 +0000

Posted by Dr. Stacey Gabriel, Director of the Genomics Platform at the Broad Institute of MIT and Harvard

Today we hear from Broad Institute of MIT and Harvard about how their researchers and software engineers are collaborating closely with the Google Genomics team on large-scale genomic data analysis. They’ve already reduced the time and cost for whole genome processing by several fold, helping researchers think even bigger. Broad’s open source tools, developed in close collaboration with Google Genomics, will also be made available to the wider research community.
– Jonathan Bingham, Product Manager, Google Genomics

Dr. Stacey Gabriel, Director of the
Genomics Platform at the Broad Institute

As one of the largest genome sequencing centers in the world, the Broad Institute of MIT and Harvard generates a lot of data. Our DNA sequencers produce more than 20 Terabytes (TB) of genomic data per day, and they run 365 days a year. Moreover, our rate of data generation is not only growing, but accelerating – our output increased more than two-fold last year, and nearly two-fold the previous year. We are not alone in facing this embarrassment of riches; across the whole genomics community, the rate of data production is doubling about every eight months with no end in sight.

Here at Broad, our team of software engineers and methods developers have spent the last year working to re-architect our production sequencing environment for the cloud. This has been no small feat, especially as we had to build the plane while we flew it! It required an entirely new system for developing and deploying pipelines (which we call Cromwell), as well as a new framework for wet lab quality control that uncouples data generation from data processing.

Courtesy: Broad Institute of MIT and Harvard

Last summer Broad and Google announced a collaboration to develop a safe, secure and scalable cloud computing infrastructure capable of storing and processing enormous datasets. We also set out to build cloud-supported tools to analyze such data and unravel long-standing mysteries about human health. Our engineers collaborate closely; we teach them about genomic data science and genomic data engineering, and they teach us about cloud computing and distributed systems. To us, this is a wonderful model for how a basic research institute can productively collaborate with industry to advance science and medicine. Both groups move faster and go further by working together.

As of today, the largest and most important of our production pipelines, the Whole Genome Sequencing Pipeline, has been completely ported to the Google Cloud Platform (GCP). We are now beginning to run production jobs on GCP and will be switching over entirely this month. This switch has proved to be a very cost-effective decision. While the conventional wisdom is that public clouds can be more expensive, our experience is that cloud is dramatically cheaper. Consider the curve below that my colleague Kristian Cibulskis recently showed at GCP NEXT:

Out of the box, the cost of running the Genome Analysis Toolkit (GATK) best practices pipeline on a 30X-coverage whole genome was roughly the same as the cost of our on-premise infrastructure. Over a period of a few months, however, we developed techniques that allowed us to really reduce costs: We learned how to parallelize the computationally intensive steps like aligning DNA sequences against a reference genome. We also optimized for GCP’s infrastructure to lower costs by using features such as Preemptible VMs. After doing these optimizations, our production whole genome pipeline was about 20% the cost of where we were when we started, saving our researchers millions of dollars, all while reducing processing turnaround time eight-fold.

There is a similar story to be told on storage of the input and output data. Google Cloud Storage Nearline is a medium for storing DNA sequence alignments and raw data. Like most people in genomics, we access genetic variants data every day, but raw DNA sequences only a few times per year, such as when there is a new algorithm that requires raw data or a new assembly of the human genome. Nearline’s price/performance tradeoff is well-suited to data that’s infrequently accessed. By using Nearline, along with some compression tricks, we were able to reduce our storage costs by greater than 50%.

Altogether, we estimate that, by using GCP services for both compute and storage, we will be able to lower the total cost of ownership for storing and processing genomic data significantly relative to our on premise costs. Looking forward, we also see advantages for data sharing, particularly for large multi-group genome projects. An environment where the data can be securely stored and analyzed will solve problems of multiple groups copying and paying for transmission and storage of the same data.

Porting the GATK whole genome pipeline to the cloud is just the starting point. During the coming year, we plan to migrate the bulk of our production pipelines to the cloud, including tools for arrays, exomes, cancer genomes, and RNA-seq. Moreover, our non-exclusive relationship with Google is founded on the principle that our groups can leverage complementary skills to make products that can not only serve the needs of Broad, but also help serve the needs of researchers around the world. Therefore, as we migrate each of our pipelines to the cloud to meet our own needs, we also plan to make them available to the greater genomics community through a Software-as-a-Service model.

This is an exciting time for us at Broad. For more than a decade we have served the genomics community by acting as a hub for data generation; now, we are extending this mission to encompass not only sequencing services, but also data services. We believe that by expanding access to our tools and optimizing our pipelines for the cloud, will enable the community to benefit from the enormous effort we have invested. We look forward to expanding the scope of this mission in the years to come.

Lessons learned while protecting Gmail

Research Blog — Tue, 29 Mar 2016 20:08:00 +0000

Posted by Elie Bursztein - anti-abuse & security research, Nicolas Lidzborski - Gmail security engineering, and Vijay Eranti - Gmail anti-abuse engineering

Earlier this year in San Francisco, USENIX hosted their inaugural Enigma Conference, which focused on security, privacy and electronic crime through the lens of emerging threats and novel attacks. We were excited to help make this conference happen and to participate in it.

At the conference, we heard from a variety of terrific speakers including:

Ron Rivest, Professor at MIT and inventor of RSA, who spoke about the consequences of backdooring encryption
Rob Joyce, Chief of the NSA Tailored Access Operations organization, who spoke about about defending against state attackers
George “Geohot” Hotz, Hacker extraordinaire, who discussed state of the art software debugging

In addition, we were able to share the lessons we’ve learned about protecting Gmail users since it was launched over a decade ago. Those lessons are summarized in the infographic below (the talk slides are also available).

We were proud to sponsor this year's inaugural Enigma conference, and it is our hope that the core lessons that we have learned over the years can benefit other online products and services. We're looking forward to participating again next year when Enigma returns in 2017. We hope to see you there!

Machine Learning in the Cloud, with TensorFlow

Research Blog — Wed, 23 Mar 2016 17:00:00 +0000

Posted by Slaven Bilac, Software Engineer, Google Research

At Google, researchers collaborate closely with product teams, applying the latest advances in Machine Learning to existing products and services - such as speech recognition in the Google app, search in Google Photos and the Smart Reply feature in Inbox by Gmail - in order to make them more useful. A growing number of Google products are using TensorFlow, our open source Machine Learning system, to tackle ML challenges and we would like to enable others do the same.

Today, at GCP NEXT 2016, we announced the alpha release of Cloud Machine Learning, a framework for building and training custom models to be used in intelligent applications.

Machine Learning projects can come in many sizes, and as we’ve seen with our open source offering TensorFlow, projects often need to scale up. Some small tasks are best handled with a local solution running on one’s desktop, while large scale applications require both the scale and dependability of a hosted solution. Google Cloud Machine Learning aims to support the full range and provide a seamless transition from local to cloud environment.

The Cloud Machine Learning offering allows users to run custom distributed learning algorithms based on TensorFlow. In addition to the deep learning capabilities that power Cloud Translate API, Cloud Vision API, and Cloud Speech API, we provide easy-to-adopt samples for common tasks like linear regression/classification with very fast convergence properties (based on the SDCA algorithm) and building a custom image classification model with few hundred training examples (based on the DeCAF algorithm).

We are excited to bring the best of Google Research to Google Cloud Platform. Learn more about this release and more from GCP Next 2016 on the Google Cloud Platform blog.

Announcing the 2016 Google PhD Fellows for North America, Europe and the Middle East

Research Blog — Thu, 10 Mar 2016 18:00:00 +0000

Posted by Michael Rennaker, Google PhD Fellowships Lead

Google created the PhD Fellowship program in 2009 to recognize and support outstanding graduate students doing exceptional research in Computer Science and related disciplines. Now in its eighth year, our fellowship program has supported hundreds of future faculty, industry researchers, innovators and entrepreneurs.

Reflecting our continuing commitment to supporting and building relationships with the academic community, we are excited to announce the 39 recipients from North America, Europe and the Middle East. We offer our sincere congratulations to Google’s 2016 Class of PhD Fellows.

Computational Neuroscience
Cameron (Po-Hsuan) Chen, Princeton University
Grace Lindsay, Columbia University
Martino Sorbaro Sindaci, The University of Edinburgh

Human-Computer Interaction
Koki Nagano, University of Southern California
Arvind Satyanarayan, Stanford University
Amy Xian Zhang, Massachusetts Institute of Technology

Machine Learning
Olivier Bachem, Swiss Federal Institute of Technology Zurich
Tianqi Chen, University of Washington
Emily Denton, New York University
Yves-Laurent Kom Samo, University of Oxford
Daniel Jaymin Mankowitz, Technion - Israel Institute of Technology
Lucas Maystre, École Polytechnique Fédérale de Lausanne
Arvind Neelakantan, University of Massachusetts, Amherst
Ludwig Schmidt, Massachusetts Institute of Technology
Shandian Zhe, Purdue University, West Lafayette

Machine Perception, Speech Technology and Computer Vision
Eugen Beck, RWTH Aachen University
Yu-Wei Chao, University of Michigan, Ann Arbor
Wei Liu, University of North Carolina at Chapel Hill
Aron Monszpart, University College London
Thomas Schoeps, Swiss Federal Institute of Technology Zurich
Chia-Yin Tsai, Carnegie Mellon University

Market Algorithms
Hossein Esfandiari, University of Maryland, College Park
Sandy Heydrich, Saarland University - Saarbrucken GSCS
Rad Niazadeh, Cornell University
Sadra Yazdanbod, Georgia Institute of Technology

Mobile Computing
Lei Kang, University of Wisconsin
Tauhidur Rahman, Cornell University
Yuhao Zhu, University of Texas, Austin

Natural Language Processing
Tamer Alkhouli, RWTH Aachen University
Jose Camacho Collados, Sapienza - Università di Roma

Privacy and Security
Kartik Nayak, University of Maryland, College Park
Nicolas Papernot, Pennsylvania State University
Damian Vizar, École Polytechnique Fédérale de Lausanne
Xi Wu, University of Wisconsin

Programming Languages and Software Engineering
Marcelo Sousa, University of Oxford

Structured Data and Database Management
Xiang Ren, University of Illinois, Urbana-Champaign

Systems and Networking
Andrew Crotty, Brown University
Ilias Marinos, University of Cambridge
Kay Ousterhout, University of California, Berkeley

Train your own image classifier with Inception in TensorFlow

Research Blog — Wed, 09 Mar 2016 18:00:00 +0000

Posted by Jon Shlens, Senior Research Scientist

At the end of last year we released code that allows a user to classify images with TensorFlow models. This code demonstrated how to build an image classification system by employing a deep learning model that we had previously trained. This model was known to classify an image across 1000 categories supplied by the ImageNet academic competition with an error rate that approached human performance. After all, what self-respecting computer vision system would fail to recognize a cute puppy?

Image via Wikipedia

Well, thankfully the image classification model would recognize this image as a retriever with 79.3% confidence. But, more spectacularly, it would also be able to distinguish between a spotted salamander and fire salamander with high confidence – a task that might be quite difficult for those not experts in herpetology. Can you tell the difference?

Images via Wikipedia

The deep learning model we released, Inception-v3, is described in our Arxiv preprint "Rethinking the Inception Architecture for Computer Vision” and can be visualized with this schematic diagram:

Schematic diagram of Inception-v3

As described in the preprint, this model achieves 5.64% top-5 error while an ensemble of four of these models achieves 3.58% top-5 error on the validation set of the ImageNet whole image ILSVRC 2012 classification task. Furthermore, in the 2015 ImageNet Challenge, an ensemble of 4 of these models came in 2nd in the image classification task.

After the release of this model, many people in the TensorFlow community voiced their preference on having an Inception-v3 model that they can train themselves, rather than using our pre-trained model. We could not agree more, since a system for training an Inception-v3 model provides many opportunities, including:

Exploration of different variants of this model architecture in order to improve the image classification system.
Comparison of optimization algorithms and hardware setups for training this model faster or to a higher degree of predictive performance.
Retraining/fine-tuning the Inception-v3 model on a distinct image classification task or as a component of a larger network tasked with object detection or multi-modal learning.

The last topic is often referred to as transfer learning, and has been an area of particular excitement in the field of deep networks in the context of vision. A common prescription to a computer vision problem is to first train an image classification model with the ImageNet Challenge data set, and then transfer this model’s knowledge to a distinct task. This has been done for object detection, zero-shot learning, image captioning, video analysis and multitudes of other applications.

Today we are happy to announce that we are releasing libraries and code for training Inception-v3 on one or multiple GPU’s. Some features of this code include:

Training an Inception-v3 model with synchronous updates across multiple GPUs.
Employing batch normalization to speed up training of the model.
Leveraging many distortions of the image to augment model training.
Releasing a new (still experimental) high-level language for specifying complex model architectures, which we call TensorFlow-Slim.
Demonstrating how to perform transfer learning by taking a pre-trained Inception-v3 model and fine-tuning it for another task.

We can train a model from scratch to its best performance on a desktop with 8 NVIDIA Tesla K40s in about 2 weeks. In order to make research progress faster, we are additionally supplying a new version of a pre-trained Inception-v3 model that is ready to be fine-tuned or adapted to a new task. We demonstrate how to use this model for transfer learning on a simple flower classification task. Hopefully, this provides a useful didactic example for employing this Inception model on wide range of vision tasks.

Want to get started? See the accompanying instructions on how to train, evaluate or fine-tune a network.

Releasing this code has been a huge team effort. These efforts have taken several months with contributions from many individuals spanning research at Google. We wish to especially acknowledge the following people who contributed to this project:

Model Architecture – Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Jon Shlens and Zbigniew Wojna
Systems Infrastructure – Sherry Moore, Martin Wicke, David Andersen, Matthieu Devin, Manjunath Kudlur and Nishant Patil
TensorFlow-Slim – Sergio Guadarrama and Nathan Silberman
Model Visualization – Fernanda Viégas, Martin Wattenberg and James Wexler

Deep Learning for Robots: Learning from Large-Scale Interaction

Research Blog — Tue, 08 Mar 2016 18:00:00 +0000

Posted by Sergey Levine, Research Scientist

While we’ve recently seen great strides in robotic capability, the gap between human and robot motor skills remains vast. Machines still have a very long way to go to match human proficiency even at basic sensorimotor skills like grasping. However, by linking learning with continuous feedback and control, we might begin to bridge that gap, and in so doing make it possible for robots to intelligently and reliably handle the complexities of the real world.

Consider for example this robot from KAIST, which won last year’s DARPA robotics challenge. The remarkably precise and deliberate motions are deeply impressive. But they are also quite… robotic. Why is that? What makes robot behavior so distinctly robotic compared to human behavior? At a high level, current robots typically follow a sense-plan-act paradigm, where the robot observes the world around it, formulates an internal model, constructs a plan of action, and then executes this plan. This approach is modular and often effective, but tends to break down in the kinds of cluttered natural environments that are typical of the real world. Here, perception is imprecise, all models are wrong in some way, and no plan survives first contact with reality.

In contrast, humans and animals move quickly, reflexively, and often with remarkably little advance planning, by relying on highly developed and intelligent feedback mechanisms that use sensory cues to correct mistakes and compensate for perturbations. For example, when serving a tennis ball, the player continually observes the ball and the racket, adjusting the motion of his hand so that they meet in the air. This kind of feedback is fast, efficient, and, crucially, can correct for mistakes or unexpected perturbations. Can we train robots to reliably handle complex real-world situations by using similar feedback mechanisms to handle perturbations and correct mistakes?

While servoing and feedback control have been studied extensively in robotics, the question of how to define the right sensory cue remains exceptionally challenging, especially for rich modalities such as vision. So instead of choosing the cues by hand, we can program a robot to acquire them on its own from scratch, by learning from extensive experience in the real world. In our first experiments with real physical robots, we decided to tackle robotic grasping in clutter.

A human child is able to reliably grasp objects after one year, and takes around four years to acquire more sophisticated precision grasps. However, networked robots can instantaneously share their experience with one another, so if we dedicate 14 separate robots to the job of learning grasping in parallel, we can acquire the necessary experience much faster. Below is a video of our robots practicing grasping a range of common office and household objects:

While initially the grasps are executed at random and succeed only rarely, each day the latest experiences are used to train a deep convolutional neural network (CNN) to learn to predict the outcome of a grasp, given a camera image and a potential motor command. This CNN is then deployed on the robots the following day, in the inner loop of a servoing mechanism that continually adjusts the robot’s motion to maximize the predicted chance of a successful grasp. In essence, the robot is constantly predicting, by observing the motion of its own hand, which kind of subsequent motion will maximize its chances of success. The result is continuous feedback: what we might call hand-eye coordination. Observing the behavior of the robot after over 800,000 grasp attempts, which is equivalent to about 3000 robot-hours of practice, we can see the beginnings of intelligent reactive behaviors. The robot observes its own gripper and corrects its motions in real time. It also exhibits interesting pre-grasp behaviors, like isolating a single object from a group. All of these behaviors emerged naturally from learning, rather than being programmed into the system.

To evaluate whether the system achieves measurable benefit from continuous feedback, we can compare its performance to an open-loop baseline that closer resembles the perception-planning-action loop described previously, albeit with a learned CNN used to determine both the open-loop grasps and the closed-loop servoing trained on the same data. This approach is most similar to recent work by Pinto and Gupta. With open-loop grasp selection, the robot chooses a single grasp pose from a single image, and then blindly executes this grasp. This method has a 34% average failure rate on the first 30 picking attempts for this set of office objects:

Incorporating continuous feedback into the system reduces the failures by nearly half, down to 18% from 34%, and produces interesting corrections and adjustments:

Neural networks have made great strides in allowing us to build computer programs that can process images, speech, text, and even draw pictures. However, introducing actions and control adds considerable new challenges, since every decision the network makes will affect what it sees next. Overcoming these challenges will bring us closer to building systems that understand the effects of their actions in the world. If we can bring the power of large-scale machine learning to robotic control, perhaps we will come one step closer to solving fundamental problems in robotics and automation.

The research on robotic hand-eye coordination and grasping was conducted by Sergey Levine, Peter Pastor, Alex Krizhevsky, and Deirdre Quillen, with special thanks to colleagues at Google Research and X who've contributed their expertise and time to this research. An early preprint is available on arXiv.

An Update on fast Transit Routing with Transfer Patterns

Research Blog — Wed, 02 Mar 2016 12:00:00 +0000

Arno Eigenwillig, Software Engineer on Google Maps Directions

What is the best way to get from A to B by public transit? Google Maps is answering such queries for over 20,000 cities and towns in over 70 countries around the world, including large metro areas like New York, São Paulo or Moscow, and some complete countries, such as Japan or Great Britain.

Since its beginnings in 2005 with the single city of Portland, Oregon, the number of cities and countries served by Google’s public transit directions has been growing rapidly. With more and larger regions, the amount of data we need to search in order to provide optimal directions has grown as well. In 2010, the search speed of transit directions made a leap ahead of that growth and became fast enough to update the result while you drag the endpoints. The technique behind that speed-up is the Transfer Patterns algorithm [1], which was created at Google’s engineering office in Zurich, Switzerland, by visiting researcher Hannah Bast and a number of Google engineers.

I am happy to report that this research collaboration has continued and expanded with the Google Focused Research Award on Next-Generation Route Planning. Over the past three years, this grant has supported Hannah Bast’s research group at the University of Freiburg, as well as the research groups of Peter Sanders and Dorothea Wagner at the Karlsruhe Institute of Technology (KIT).

From the project’s numerous outcomes, I’d like to highlight two recent ones that re-examine the Transfer Patterns approach and massively improve it for continent-sized networks: Scalable Transfer Patterns [2] and Frequency-Based Search for Public Transit [3] by Hannah Bast, Sabine Storandt and Matthias Hertel. This blogpost presents the results from these publications.

The notion of a transfer pattern is easy to understand. Suppose you are at a transit stop downtown, call it A, and want to go to some stop B as quickly as possible. Suppose further you brought a printed schedule book but no smartphone. (This sounded plausible only a few years ago!) As a local, you might know that there are only two reasonable options:

Take a tram from A to C, then transfer at C to a bus to B.
Take the direct bus from A to B, which only runs infrequently.

We say the first option has transfer pattern A-C-B, and the second option has transfer pattern A-B. Notice that no in-between stops are mentioned. This is very compact information, much less than the actual schedules, but it makes looking up the schedules significantly faster: Knowing that all optimal trips follow one of these patterns, you only need to look at those lines in the schedule book that provide direct connections from A to C, C to B and A to B. All other lines can safely be ignored: you know you will not miss a better option.

While the basic idea of transfer patterns is indeed that simple, it takes more to make it work in practice. The transfer patterns of all optimal trips have to be computed ahead of time and stored, so that they are available to answer queries. Conceptually, we need transfer patterns for every pair of stops, because any pair could come up in a query. It is perfectly reasonable to compute them for all pairs within one city, or even one metro area that is densely criss-crossed by a transit network comprising, say, a thousand stops, yielding a million of pairs to consider.

As the scale of the problem increases from one metro area to an entire country or continent, this “all pairs” approach rapidly becomes expensive: ten thousand stops (10x more than above) already yield a hundred million pairs (100x more than above), and so on. Also, the individual transfer patterns become quite repetitive: For example, from any stop in Paris, France to any stop in Cologne, Germany, all optimal connections end up using the same few long-distance train lines in the middle, only the local connections to the railway stations depend on the specific pair of stops considered.

However, designated long-distance connections are not the only way to travel between different local networks – they also overlap and connect to each other. For mid-range trips, there is no universally correct rule when to choose a long-distance train or intercity bus, short of actually comparing options with local or regional transit, too.

The Scalable Transfer Patterns algorithm [2] does just that, but in a smart way. For starters, it uses what is known as graph clustering to cut the network into pieces, called clusters, that have a lot of connections inside but relatively few to the outside. As an example, the figure below (kindly provided by the authors) shows a partitioning of Germany into clusters. The stops highlighted in red are border stops: They connect directly to stops outside the cluster. Notice how they are a small fraction of the total network.

The public transit network of Germany (dots and lines), split into clusters (shown in various colors). Of all 251,763 stops, only 10,886 (4.32%) are boundary stops, highlighted as red boxes. Click here to view the full resolution image.[source: S. Storandt, 2016]

Based on the clustering, the transfer patterns of all optimal connections are computed in two steps.

In step 1, transfer patterns are computed for optimal connections inside each cluster. They are stored for query processing later on, but they also accelerate the search through a cluster in the following step: between the stops on its border, we only need to consider the connections captured in the transfer patterns.

The next figure sketches how the transit network in the cluster around Berlin gets reduced to much fewer connections between border stations. (The central station stands out as a hub, as expected. It is a border station itself, because it has direct connections out of the cluster.)

The cluster of public transit connections around Berlin (shown as dots and lines in light blue), its border stops (highlighted as red boxes), and the transfer patterns of optimal connections between border stops (thick black lines; only the most important 111 of 592 are shown to keep the image legible). This cuts out 96.15% of the network (especially a lot of the high-frequency inner city trips, which makes the time savings even bigger). Click here to view the full resolution image. [source: S. Storandt, 2016]

In step 2, transfer patterns can be computed for the entire network, that is, between any pair of clusters. This is done with the following twists:

It suffices to consider trips from and to boundary stops of any cluster; the local transfer patterns from step 1 will supply the missing pieces later on.
The per-cluster transfer patterns from step 1 greatly accelerate the search across other clusters.
The search stops exploring any possible connection between two boundary stops as soon as it gets worse than a connection that sticks to long-distance transit between clusters (which may not always be optimal, but is always quick to compute).

The results of steps 1 and 2 are stored and used to answer queries. For any given query from some A to some B, one can now easily stitch together a network of transfer patterns that covers all optimal connections from A to B. Looking up the direct connections on that small network (like in the introductory example) and finding the best one for the queried time is very fast, even if A and B are far away.

The total storage space needed for this is much smaller than the space that would be needed for all pairs of stops, all the more the larger the network gets. Extrapolating from their experiments, the researchers estimate [2] that Scalable Transfer Patterns for the whole world could be stored in 30 GB, cutting their estimate for the basic Transfer Patterns by a thousand(!). This is considerably more powerful than the “hub station” idea from the original Transfer Patterns paper [1].

The time needed to compute Scalable Transfer Patterns is also estimated to shrink by three orders of magnitude: At a high level, the earlier phases of the algorithm accelerate the later ones, as described above. At a low level, a second optimization technique kicks in: exploiting the repetitiveness of schedules in time. Recall that finding transfer patterns is all about finding the optimal connections between pairs of stops at any possible departure time.

Frequency-based schedules (e.g., one bus every 10 minutes) cause a lot of similarity during the day, although it often doesn’t match up between lines (e.g., said bus runs every 10 minutes before 6pm and every 20 minutes after, and we seek connections to a train that runs every 12 minutes before 8pm and every 30 minutes after). Moreover, this similarity also exists from one day to the next, and we need to consider all permissible departure dates.

The Frequency-Based Search for Public Transit [3] is carefully designed to find and take advantage of repetitive schedules while representing all one-off cases exactly. Comparing to the set-up from the original Transfer Patterns paper [1], the authors estimate a whopping 60x acceleration of finding transfer patterns from this part alone.

I am excited to see that the scalability questions originally left open by [1] have been answered so convincingly as part of this Focused Research Award. Please see the list of publications on the project’s website for more outcomes of this award. Besides more on transfer patterns, they contain a wealth of other results about routing on road networks, transit networks, and with combinations of travel modes.

References:

[1] Fast Routing in Very Large Public Transportation Networks Using Transfer Patterns
by H. Bast, E. Carlsson, A. Eigenwillig, R. Geisberger, C. Harrelson, V. Raychev and F. Viger
(ESA 2010). [doi]

[2] Scalable Transfer Patterns
by H. Bast, M. Hertel and S. Storandt (ALENEX 2016). [doi]

[3] Frequency-based Search for Public Transit
by H. Bast and S. Storandt (SIGSPATIAL 2014). [doi]

And the winner of the $1 Million Little Box Challenge is…CE+T Power’s Red Electrical Devils

Research Blog — Mon, 29 Feb 2016 21:00:00 +0000

Posted by Ross Koningstein, Engineering Director Emeritus, Google Research

In July 2014, Google and the IEEE launched the $1 Million Little Box Challenge, an open competition to design and build a small kW-scale inverter with a power density greater than 50 Watts per cubic inch while meeting a number of other specifications related to efficiency, electrical noise and thermal performance. Over 2,000 teams from across the world registered for the competition and more than 80 proposals qualified for review by IEEE Power Electronics Society and Google. In October 2015, 18 finalists were selected to bring their inverters to the National Renewable Energy Laboratory (NREL) for testing.

Today, Google and the IEEE are proud to announce that the grand prize winner of the $1 Million Little Box Challenge is CE+T Power’s Red Electrical Devils. The Red Electrical Devils (named after Belgium’s national soccer team) were declared the winner by a consensus of judges from Google, IEEE Power Electronics Society and NREL. Honorable mentions go to teams from Schneider Electric and Virginia Tech’s Future Energy Electronics Center.

CE+T Power’s Red Electrical Devils receive $1 Million Little Box Challenge Prize

Schneider, Virginia Tech and The Red Electrical Devils all built 2kW inverters that passed 100 hours of testing at NREL, adhered to the technical specifications of the competition, and were recognized today in a ceremony at the ARPA-E Energy Innovation Summit in Washington, DC. Among the 3 finalists, the Red Electric Devils’ inverter had the highest power density and smallest volume.

Impressively, the winning team exceeded the power density goal for the competition by a factor of 3, which is more than 10 times more compact than commercially available inverters! When we initially brainstormed technical targets for the Little Box Challenge, some of us at Google didn’t think such audacious goals could be achieved. Three teams from around the world proved decisively that it could be done.

Our takeaway: Establish a worthy goal and smart people will exceed it!

Congratulations again to CE+T Power’s Red Electrical Devils, Schneider Electric and Virginia Tech’s Future Energy Electronics and sincere thanks to our collaborators at IEEE and NREL. The finalist’s technical approach documents will be posted on the Little Box Challenge website until December 31, 2017. We hope this helps advance the state of the art and innovation in kW-scale inverters.