Google Research

How Robots Can Acquire New Skills from Their Shared Experience

October 3rd, 2016 | by Research Blog | published in Google Research

Posted by Sergey Levine (Google Brain Team), Timothy Lillicrap (DeepMind), Mrinal Kalakrishnan (X)

The ability to learn from experience will likely be a key in enabling robots to help with complex real-world tasks, from assisting the elderly with chores and daily activities, to helping us in offices and hospitals, to performing jobs that are too dangerous or unpleasant for people. However, if each robot must learn its full repertoire of skills for these tasks only from its own experience, it could take far too long to acquire a rich enough range of behaviors to be useful. Could we bridge this gap by making it possible for robots to collectively learn from each other’s experiences?

While machine learning algorithms have made great strides in natural language understanding and speech recognition, the kind of symbolic high-level reasoning that allows people to communicate complex concepts in words remains out of reach for machines. However, robots can instantaneously transmit their experience to other robots over the network – sometimes known as “cloud robotics” – and it is this ability that can let them learn from each other.

This is true even for seemingly simple low-level skills. Humans and animals excel at adaptive motor control that integrates their senses, reflexes, and muscles in a closely coordinated feedback loop. Robots still struggle with these basic skills in the real world, where the variability and complexity of the environment demands well-honed behaviors that are not easily fooled by distractors. If we enable robots to transmit their experiences to each other, could they learn to perform motion skills in close coordination with sensing in realistic environments?

We previously wrote about how multiple robots could pool their experiences to learn a grasping task. Here, we will discuss new experiments that we conducted to investigate three possible approaches for general-purpose skill learning across multiple robots: learning motion skills directly from experience, learning internal models of physics, and learning skills with human assistance. In all three cases, multiple robots shared their experiences to build a common model of the skill. The skills learned by the robots are still relatively simple — pushing objects and opening doors — but by learning such skills more quickly and efficiently through collective learning, robots might in the future acquire richer behavioral repertoires that could eventually make it possible for them to assist us in our daily lives.

Learning from raw experience with model-free reinforcement learning.
Perhaps one of the simplest ways for robots to teach each other is to pool information about their successes and failures in the world. Humans and animals acquire many skills by direct trial-and-error learning. During this kind of ‘model-free’ learning — so called because there is no explicit model of the environment formed — they explore variations on their existing behavior and then reinforce and exploit the variations that give bigger rewards. In combination with deep neural networks, model-free algorithms have recently proved to be surprisingly effective and have been key to successes with the Atari video game system and playing Go. Having multiple robots allows us to experiment with sharing experiences to speed up this kind of direct learning in the real world.

In these experiments we tasked robots with trying to move their arms to goal locations, or reaching to and opening a door. Each robot has a copy of a neural network that allows it to estimate the value of taking a given action in a given state. By querying this network, the robot can quickly decide what actions might be worth taking in the world. When a robot acts, we add noise to the actions it selects, so the resulting behavior is sometimes a bit better than previously observed, and sometimes a bit worse. This allows each robot to explore different ways of approaching a task. Records of the actions taken by the robots, their behaviors, and the final outcomes are sent back to a central server. The server collects the experiences from all of the robots and uses them to iteratively improve the neural network that estimates value for different states and actions. The model-free algorithms we employed look across both good and bad experiences and distill these into a new network that is better at understanding how action and success are related. Then, at regular intervals, each robot takes a copy of the updated network from the server and begins to act using the information in its new network. Given that this updated network is a bit better at estimating the true value of actions in the world, the robots will produce better behavior. This cycle can then be repeated to continue improving on the task. In the video below, a robot explores the door opening task.

With a few hours of practice, robots sharing their raw experience learn to make reaches to targets, and to open a door by making contact with the handle and pulling. In the case of door opening, the robots learn to deal with the complex physics of the contacts between the hook and the door handle without building an explicit model of the world, as can be seen in the example below:

Learning how the world works by interacting with objects.
Direct trial-and-error reinforcement learning is a great way to learn individual skills. However, humans and animals don’t learn exclusively by trial and error. We also build mental models about our environment and imagine how the world might change in response to our actions.

We can start with the simplest of physical interactions, and have our robots learn the basics of cause and effect from reflecting on their own experiences. In this experiment, we had the robots play with a wide variety of common household objects by randomly prodding and pushing them inside a tabletop bin. The robots again shared their experiences with each other and together built a single predictive model that attempted to forecast what the world might look like in response to their actions. This predictive model can make simple, if slightly blurry, forecasts about future camera images when provided with the current image and a possible sequence of actions that the robot might execute:

Top row: robotic arms interacting with common household items.
Bottom row: Predicted future camera images given an initial image and a sequence of actions.

Once this model is trained, the robots can use it to perform purposeful manipulations, for example based on user commands. In our prototype, a user can command the robot to move a particular object simply by clicking on that object, and then clicking on the point where the object should go:

The robots in this experiment were not told anything about objects or physics: they only see that the command requires a particular pixel to be moved to a particular place. However, because they have seen so many object interactions in their shared past experiences, they can forecast how particular actions will affect particular pixels. In order for such an implicit understanding of physics to emerge, the robots must be provided with a sufficient breadth of experience. This requires either a lot of time, or sharing the combined experiences of many robots. An extended video on this project may be found here.

Learning with the help of humans.
So far, we discussed how robots can learn entirely on their own. However, human guidance is important, not just for telling the robot what to do, but also for helping the robots along. We have a lot of intuition about how various manipulation skills can be performed, and it only seems natural that transferring this intuition to robots can help them learn these skills a lot faster. In the next experiment, we provided each robot with a different door, and guided each of them by hand to show how these doors can be opened. These demonstrations are encoded into a single combined strategy for all robots, called a policy. The policy is a deep neural network which converts camera images to robot actions, and is maintained on a central server. The following video shows the instructor demonstrating the door-opening skill to a robot:

Next, the robots collectively improve this policy through a trial-and-error learning process. Each robot attempts to open its own door using the latest available policy, with some added noise for exploration. These attempts allow each robot to plan a better strategy for opening the door the next time around, and improve the policy accordingly:

Not surprisingly, we find that robots learn more effectively if they are trained on a curriculum of tasks that are gradually increasing in difficulty. In our experiment, each robot starts off by practicing the door-opening skill on a specific position and orientation of the door that the instructor had previously shown it. As it gets better at performing the task, the instructor starts to alter the position and orientation of the door to be just a bit beyond the current capabilities of the policy, but not so difficult that it fails entirely. This allows the robots to gradually increase their skill level over time, and expands the range of situations they can handle. The combination of human-guidance with trial-and-error learning allowed the robots to collectively learn the skill of door-opening in just a couple of hours. Since the robots were trained on doors that look different from each other, the final policy succeeds on a door with a handle that none of the robots had seen before:

In all three of the experiments described above, the ability to communicate and exchange their experiences allows the robots to learn more quickly and effectively. This becomes particularly important when we combine robotic learning with deep learning, as is the case in all of the experiments discussed above. We’ve seen before that deep learning works best when provided with ample training data. For example, the popular ImageNet benchmark uses over 1.5 million labeled examples. While such a quantity of data is not impossible for a single robot to gather over a few years, it is much more efficient to gather the same volume of experience from multiple robots over the course of a few weeks. Besides faster learning times, this approach might benefit from the greater diversity of experience: a real-world deployment might involve multiple robots in different places and different settings, sharing heterogeneous, varied experiences to build a single highly generalizable representation.

Of course, the kinds of behaviors that robots today can learn are still quite limited. Even basic motion skills, such as picking up objects and opening doors, remain in the realm of cutting edge research. In all of these experiments, a human engineer is still needed to tell the robots what they should learn to do by specifying a detailed objective function. However, as algorithms improve and robots are deployed more widely, their ability to share and pool their experiences could be instrumental for enabling them to assist us in our daily lives.

The experiments on learning by trial-and-error were conducted by Shixiang (Shane) Gu and Ethan Holly from the Google Brain team, and Timothy Lillicrap from DeepMind. Work on learning predictive models was conducted by Chelsea Finn from the Google Brain team, and the research on learning from demonstration was conducted by Yevgen Chebotar, Ali Yahya, Adrian Li, and Mrinal Kalakrishnan from X. We would also like to acknowledge contributions by Peter Pastor, Gabriel Dulac-Arnold, and Jon Scholz. Articles about each of the experiments discussed in this blog post can be found below:

Deep Reinforcement Learning for Robotic Manipulation. Shixiang Gu, Ethan Holly, Timothy Lillicrap, Sergey Levine. [video]

Deep Visual Foresight for Planning Robot Motion. Chelsea Finn, Sergey Levine. [video] [data]

Collective Robot Reinforcement Learning with Distributed Asynchronous Guided Policy Search.
Ali Yahya, Adrian Li, Mrinal Kalakrishnan, Yevgen Chebotar, Sergey Levine. [video]

Path Integral Guided Policy Search. Yevgen Chebotar, Mrinal Kalakrishnan, Ali Yahya, Adrian Li, Stefan Schaal, Sergey Levine. [video]

ddd

Introducing the Open Images Dataset

September 30th, 2016 | by Research Blog | published in Google Research

Posted by Ivan Krasin and Tom Duerig, Software Engineers

In the last few years, advances in machine learning have enabled Computer Vision to progress rapidly, allowing for systems that can automatically caption images to apps that can create natural language replies in response to shared photos. Much of this progress can be attributed to publicly available image datasets, such as ImageNet and COCO for supervised learning, and YFCC100M for unsupervised learning.

Today, we introduce Open Images, a dataset consisting of ~9 million URLs to images that have been annotated with labels spanning over 6000 categories. We tried to make the dataset as practical as possible: the labels cover more real-life entities than the 1000 ImageNet classes, there are enough images to train a deep neural network from scratch and the images are listed as having a Creative Commons Attribution license^*.

The image-level annotations have been populated automatically with a vision model similar to Google Cloud Vision API. For the validation set, we had human raters verify these automated labels to find and remove false positives. On average, each image has about 8 labels assigned. Here are some examples:

Annotated images form the Open Images dataset. Left: Ghost Arches by Kevin Krejci. Right: Some Silverware by J B. Both images used under CC BY 2.0 license

We have trained an Inception v3 model based on Open Images annotations alone, and the model is good enough to be used for fine-tuning applications as well as for other things, like DeepDream or artistic style transfer which require a well developed hierarchy of filters. We hope to improve the quality of the annotations in Open Images the coming months, and therefore the quality of models which can be trained.

The dataset is a product of a collaboration between Google, CMU and Cornell universities, and there are a number of research papers built on top of the Open Images dataset in the works. It is our hope that datasets like Open Images and the recently released YouTube-8M will be useful tools for the machine learning community.

* While we tried to identify images that are licensed under a Creative Commons Attribution license, we make no representations or warranties regarding the license status of each image and you should verify the license for each image yourself.^↩

ddd

Image Compression with Neural Networks

September 29th, 2016 | by Research Blog | published in Google Research

Posted by Nick Johnston and David Minnen, Software Engineers

Data compression is used nearly everywhere on the internet – the videos you watch online, the images you share, the music you listen to, even the blog you’re reading right now. Compression techniques make sharing the content you want quick and efficient. Without data compression, the time and bandwidth costs for getting the information you need, when you need it, would be exorbitant!

In “Full Resolution Image Compression with Recurrent Neural Networks“, we expand on our previous research on data compression using neural networks, exploring whether machine learning can provide better results for image compression like it has for image recognition and text summarization. Furthermore, we are releasing our compression model via TensorFlow so you can experiment with compressing your own images with our network.

We introduce an architecture that uses a new variant of the Gated Recurrent Unit (a type of RNN that allows units to save activations and process sequences) called Residual Gated Recurrent Unit (Residual GRU). Our Residual GRU combines existing GRUs with the residual connections introduced in “Deep Residual Learning for Image Recognition” to achieve significant image quality gains for a given compression rate. Instead of using a DCT to generate a new bit representation like many compression schemes in use today, we train two sets of neural networks – one to create the codes from the image (encoder) and another to create the image from the codes (decoder).

Our system works by iteratively refining a reconstruction of the original image, with both the encoder and decoder using Residual GRU layers so that additional information can pass from one iteration to the next. Each iteration adds more bits to the encoding, which allows for a higher quality reconstruction. Conceptually, the network operates as follows:

The initial residual, R[0], corresponds to the original image I: R[0] = I.
Set i=1 for to the first iteration.
Iteration[i] takes R[i-1] as input and runs the encoder and binarizer to compress the image into B[i].
Iteration[i] runs the decoder on B[i] to generate a reconstructed image P[i].
The residual for Iteration[i] is calculated: R[i] = I – P[i].
Set i=i+1 and go to Step 3 (up to the desired number of iterations).

The residual image represents how different the current version of the compressed image is from the original. This image is then given as input to the network with the goal of removing the compression errors from the next version of the compressed image. The compressed image is now represented by the concatenation of B[1] through B[N]. For larger values of N, the decoder gets more information on how to reduce the errors and generate a higher quality reconstruction of the original image.

To understand how this works, consider the following example of the first two iterations of the image compression network, shown in the figures below. We start with an image of a lighthouse. On the first pass through the network, the original image is given as an input (R[0] = I). P[1] is the reconstructed image. The difference between the original image and encoded image is the residual, R[1], which represents the error in the compression.

Left: Original image, I = R[0]. Center: Reconstructed image, P[1]. Right: the residual, R[1], which represents the error introduced by compression.

On the second pass through the network, R[1] is given as the network’s input (see figure below). A higher quality image P[2] is then created. So how does the system recreate such a good image (P[2], center panel below) from the residual R[1]? Because the model uses recurrent nodes with memory, the network saves information from each iteration that it can use in the next one. It learned something about the original image in Iteration[1] that is used along with R[1] to generate a better P[2] from B[2]. Lastly, a new residual, R[2] (right), is generated by subtracting P[2] from the original image. This time the residual is smaller since there are fewer differences between the reconstructed image, and what we started with.

The second pass through the network. Left: R[1] is given as input. Center: A higher quality reconstruction, P[2]. Right: A smaller residual R[2] is generated by subtracting P[2] from the original image.

At each further iteration, the network gains more information about the errors introduced by compression (which is captured by the residual image). If it can use that information to predict the residuals even a little bit, the result is a better reconstruction. Our models are able to make use of the extra bits up to a point. We see diminishing returns, and at some point the representational power of the network is exhausted.

To demonstrate file size and quality differences, we can take a photo of Vash, a Japanese Chin, and generate two compressed images, one JPEG and one Residual GRU. Both images target a perceptual similarity of 0.9 MS-SSIM, a perceptual quality metric that reaches 1.0 for identical images. The image generated by our learned model results in an file 25% smaller than JPEG.

Left: Original image (1419 KB PNG) at ~1.0 MS-SSIM. Center: JPEG (33 KB) at ~0.9 MS-SSIM. Right: Residual GRU (24 KB) at ~0.9 MS-SSIM. This is 25% smaller for a comparable image quality

Taking a look around his nose and mouth, we see that our method doesn’t have the magenta blocks and noise in the middle of the image as seen in JPEG. This is due to the blocking artifacts produced by JPEG, whereas our compression network works on the entire image at once. However, there’s a tradeoff — in our model the details of the whiskers and texture are lost, but the system shows great promise in reducing artifacts.

Left: Original. Center: JPEG. Right: Residual GRU.

While today’s commonly used codecs perform well, our work shows that using neural networks to compress images results in a compression scheme with higher quality and smaller file sizes. To learn more about the details of our research and a comparison of other recurrent architectures, check out our paper. Our future work will focus on even better compression quality and faster models, so stay tuned!

ddd

Announcing YouTube-8M: A Large and Diverse Labeled Video Dataset for Video Understanding Research

September 28th, 2016 | by Research Blog | published in Google Research, Youtube

Posted by Sudheendra Vijayanarasimhan and Paul Natsev, Software Engineers

Many recent breakthroughs in machine learning and machine perception have come from the availability of large labeled datasets, such as ImageNet, which has millions of images labeled with thousands of classes. Their availability has significantly accelerated research in image understanding, for example on detecting and classifying objects in static images.

Video analysis provides even more information for detecting and recognizing objects, and understanding human actions and interactions with the world. Improving video understanding can lead to better video search and discovery, similarly to how image understanding helped re-imagine the photos experience. However, one of the key bottlenecks for further advancements in this area has been the lack of real-world video datasets with the same scale and diversity as image datasets.

Today, we are excited to announce the release of YouTube-8M, a dataset of 8 million YouTube video URLs (representing over 500,000 hours of video), along with video-level labels from a diverse set of 4800 Knowledge Graph entities. This represents a significant increase in scale and diversity compared to existing video datasets. For example, Sports-1M, the largest existing labeled video dataset we are aware of, has around 1 million YouTube videos and 500 sports-specific classes–YouTube-8M represents nearly an order of magnitude increase in both number of videos and classes.

In order to construct a labeled video dataset of this scale, we needed to address two key challenges: (1) video is much more time-consuming to annotate manually than images, and (2) video is very computationally expensive to process and store. To overcome (1), we turned to YouTube and its video annotation system, which identifies relevant Knowledge Graph topics for all public YouTube videos. While these annotations are machine-generated, they incorporate powerful user engagement signals from millions of users as well as video metadata and content analysis. As a result, the quality of these annotations is sufficiently high to be useful for video understanding research and benchmarking purposes.

To ensure the stability and quality of the labeled video dataset, we used only public videos with more than 1000 views, and we constructed a diverse vocabulary of entities, which are visually observable and sufficiently frequent. The vocabulary construction was a combination of frequency analysis, automated filtering, verification by human raters that the entities are visually observable, and grouping into 24 top-level verticals (more details in our technical report). The figures below depict the dataset browser and the distribution of videos along the top-level verticals, and illustrate the dataset’s scale and diversity.

A dataset explorer allows browsing and searching the full vocabulary of Knowledge Graph entities, grouped in 24 top-level verticals, along with corresponding videos. This screenshot depicts a subset of dataset videos annotated with the entity “Guitar”.

The distribution of videos in the top-level verticals illustrates the scope and diversity of the dataset and reflects the natural distribution of popular YouTube videos.

To address (2), we had to overcome the storage and computational resource bottlenecks that researchers face when working with videos. Pursuing video understanding at YouTube-8M’s scale would normally require a petabyte of video storage and dozens of CPU-years worth of processing. To make the dataset useful to researchers and students with limited computational resources, we pre-processed the videos and extracted frame-level features using a state-of-the-art deep learning model–the publicly available Inception-V3 image annotation model trained on ImageNet. These features are extracted at 1 frame-per-second temporal resolution, from 1.9 billion video frames, and are further compressed to fit on a single commodity hard disk (less than 1.5 TB). This makes it possible to download this dataset and train a baseline TensorFlow model at full scale on a single GPU in less than a day!

We believe this dataset can significantly accelerate research on video understanding as it enables researchers and students without access to big data or big machines to do their research at previously unprecedented scale. We hope this dataset will spur exciting new research on video modeling architectures and representation learning, especially approaches that deal effectively with noisy or incomplete labels, transfer learning and domain adaptation. In fact, we show that pre-training models on this dataset and applying / fine-tuning on other external datasets leads to state of the art performance on them (e.g. ActivityNet, Sports-1M). You can read all about our experiments using this dataset, along with more details on how we constructed it, in our technical report.

ddd

A Neural Network for Machine Translation, at Production Scale

September 27th, 2016 | by Research Blog | published in Google Research, Google Translate

Posted by Quoc V. Le & Mike Schuster, Research Scientists, Google Brain Team

Ten years ago, we announced the launch of Google Translate, together with the use of Phrase-Based Machine Translation as the key algorithm behind this service. Since then, rapid advances in machine intelligence have improved our speech recognition and image recognition capabilities, but improving machine translation remains a challenging goal.

Today we announce the Google Neural Machine Translation system (GNMT), which utilizes state-of-the-art training techniques to achieve the largest improvements to date for machine translation quality. Our full research results are described in a new technical report we are releasing today: “Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation” [1].

A few years ago we started using Recurrent Neural Networks (RNNs) to directly learn the mapping between an input sequence (e.g. a sentence in one language) to an output sequence (that same sentence in another language) [2]. Whereas Phrase-Based Machine Translation (PBMT) breaks an input sentence into words and phrases to be translated largely independently, Neural Machine Translation (NMT) considers the entire input sentence as a unit for translation.The advantage of this approach is that it requires fewer engineering design choices than previous Phrase-Based translation systems. When it first came out, NMT showed equivalent accuracy with existing Phrase-Based translation systems on modest-sized public benchmark data sets.

Since then, researchers have proposed many techniques to improve NMT, including work on handling rare words by mimicking an external alignment model [3], using attention to align input words and output words [4] and breaking words into smaller units to cope with rare words [5,6]. Despite these improvements, NMT wasn’t fast or accurate enough to be used in a production system, such as Google Translate. Our new paper [1] describes how we overcame the many challenges to make NMT work on very large data sets and built a system that is sufficiently fast and accurate enough to provide better translations for Google’s users and services.

Data from side-by-side evaluations, where human raters compare the quality of translations for a given source sentence. Scores range from 0 to 6, with 0 meaning “completely nonsense translation”, and 6 meaning “perfect translation.”

The following visualization shows the progression of GNMT as it translates a Chinese sentence to English. First, the network encodes the Chinese words as a list of vectors, where each vector represents the meaning of all words read so far (“Encoder”). Once the entire sentence is read, the decoder begins, generating the English sentence one word at a time (“Decoder”). To generate the translated word at each step, the decoder pays attention to a weighted distribution over the encoded Chinese vectors most relevant to generate the English word (“Attention”; the blue link transparency represents how much the decoder pays attention to an encoded word).

Using human-rated side-by-side comparison as a metric, the GNMT system produces translations that are vastly improved compared to the previous phrase-based production system. GNMT reduces translation errors by more than 55%-85% on several major language pairs measured on sampled sentences from Wikipedia and news websites with the help of bilingual human raters.

An example of a translation produced by our system for an input sentence sampled from a news site. Go here for more examples of translations for input sentences sampled randomly from news sites and books.

In addition to releasing this research paper today, we are announcing the launch of GNMT in production on a notoriously difficult language pair: Chinese to English. The Google Translate mobile and web apps are now using GNMT for 100% of machine translations from Chinese to English—about 18 million translations per day. The production deployment of GNMT was made possible by use of our publicly available machine learning toolkit TensorFlow and our Tensor Processing Units (TPUs), which provide sufficient computational power to deploy these powerful GNMT models while meeting the stringent latency requirements of the Google Translate product. Translating from Chinese to English is one of the more than 10,000 language pairs supported by Google Translate, and we will be working to roll out GNMT to many more of these over the coming months.

Machine translation is by no means solved. GNMT can still make significant errors that a human translator would never make, like dropping words and mistranslating proper names or rare terms, and translating sentences in isolation rather than considering the context of the paragraph or page. There is still a lot of work we can do to serve our users better. However, GNMT represents a significant milestone. We would like to celebrate it with the many researchers and engineers—both within Google and the wider community—who have contributed to this direction of research in the past few years.

Acknowledgements:
We thank members of the Google Brain team and the Google Translate team for the help with the project. We thank Nikhil Thorat and the Big Picture team for the visualization.

References:
[1] Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, Jeffrey Dean. Technical Report, 2016.
[2] Sequence to Sequence Learning with Neural Networks, Ilya Sutskever, Oriol Vinyals, Quoc V. Le. Advances in Neural Information Processing Systems, 2014.
[3] Addressing the rare word problem in neural machine translation, Minh-Thang Luong, Ilya Sutskever, Quoc V. Le, Oriol Vinyals, and Wojciech Zaremba. Proceedings of the 53th Annual Meeting of the Association for Computational Linguistics, 2015.
[4] Neural Machine Translation by Jointly Learning to Align and Translate, Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio. International Conference on Learning Representations, 2015.
[5] Japanese and Korean voice search, Mike Schuster, and Kaisuke Nakajima. IEEE International Conference on Acoustics, Speech and Signal Processing, 2012.
[6] Neural Machine Translation of Rare Words with Subword Units, Rico Sennrich, Barry Haddow, Alexandra Birch. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016.

ddd

Show and Tell: image captioning open sourced in TensorFlow

September 22nd, 2016 | by Research Blog | published in Google Research

Posted by Chris Shallue, Software Engineer, Google Brain Team

In 2014, research scientists on the Google Brain team trained a machine learning system to automatically produce captions that accurately describe images. Further development of that system led to its success in the Microsoft COCO 2015 image captioning challenge, a competition to compare the best algorithms for computing accurate image captions, where it tied for first place.

Today, we’re making the latest version of our image captioning system available as an open source model in TensorFlow. This release contains significant improvements to the computer vision component of the captioning system, is much faster to train, and produces more detailed and accurate descriptions compared to the original system. These improvements are outlined and analyzed in the paper Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge, published in IEEE Transactions on Pattern Analysis and Machine Intelligence.

Automatically captioned by our system.

So what’s new?

Our 2014 system used the Inception V1 image classification model to initialize the image encoder, which produces the encodings that are useful for recognizing different objects in the images. This was the best image model available at the time, achieving 89.6% top-5 accuracy on the benchmark ImageNet 2012 image classification task. We replaced this in 2015 with the newer Inception V2 image classification model, which achieves 91.8% accuracy on the same task. The improved vision component gave our captioning system an accuracy boost of 2 points in the BLEU-4 metric (which is commonly used in machine translation to evaluate the quality of generated sentences) and was an important factor of its success in the captioning challenge.

Today’s code release initializes the image encoder using the Inception V3 model, which achieves 93.9% accuracy on the ImageNet classification task. Initializing the image encoder with a better vision model gives the image captioning system a better ability to recognize different objects in the images, allowing it to generate more detailed and accurate descriptions. This gives an additional 2 points of improvement in the BLEU-4 metric over the system used in the captioning challenge.

Another key improvement to the vision component comes from fine-tuning the image model. This step addresses the problem that the image encoder is initialized by a model trained to classify objects in images, whereas the goal of the captioning system is to describe the objects in images using the encodings produced by the image model. For example, an image classification model will tell you that a dog, grass and a frisbee are in the image, but a natural description should also tell you the color of the grass and how the dog relates to the frisbee.

In the fine-tuning phase, the captioning system is improved by jointly training its vision and language components on human generated captions. This allows the captioning system to transfer information from the image that is specifically useful for generating descriptive captions, but which was not necessary for classifying objects. In particular, after fine-tuning it becomes better at correctly describing the colors of objects. Importantly, the fine-tuning phase must occur after the language component has already learned to generate captions – otherwise, the noisiness of the randomly initialized language component causes irreversible corruption to the vision component. For more details, read the full paper here.

Left: the better image model allows the captioning model to generate more detailed and accurate descriptions. Right: after fine-tuning the image model, the image captioning system is more likely to describe the colors of objects correctly.

Until recently our image captioning system was implemented in the DistBelief software framework. The TensorFlow implementation released today achieves the same level of accuracy with significantly faster performance: time per training step is just 0.7 seconds in TensorFlow compared to 3 seconds in DistBelief on an Nvidia K20 GPU, meaning that total training time is just 25% of the time previously required.

A natural question is whether our captioning system can generate novel descriptions of previously unseen contexts and interactions. The system is trained by showing it hundreds of thousands of images that were captioned manually by humans, and it often re-uses human captions when presented with scenes similar to what it’s seen before.

When the model is presented with scenes similar to what it’s seen before, it will often re-use human generated captions.

So does it really understand the objects and their interactions in each image? Or does it always regurgitate descriptions from the training data? Excitingly, our model does indeed develop the ability to generate accurate new captions when presented with completely new scenes, indicating a deeper understanding of the objects and context in the images. Moreover, it learns how to express that knowledge in natural-sounding English phrases despite receiving no additional language training other than reading the human captions.

Our model generates a completely new caption using concepts learned from similar scenes in the training set.

We hope that sharing this model in TensorFlow will help push forward image captioning research and applications, and will also allow interested people to learn and have fun. To get started training your own image captioning system, and for more details on the neural network architecture, navigate to the model’s home-page here. While our system uses the Inception V3 image classification model, you could even try training our system with the recently released Inception-ResNet-v2 model to see if it can do even better!

ddd

The 280-Year-Old Algorithm Inside Google Trips

September 20th, 2016 | by Research Blog | published in Google Research

Posted by Bogdan Arsintescu, Software Engineer & Sreenivas Gollapudi, Kostas Kollias, Tamas Sarlos and Andrew Tomkins, Research Scientists

Algorithms Engineering is a lot of fun because algorithms do not go out of fashion: one never knows when an oldie-but-goodie might come in handy. Case in point: Yesterday, Google announced Google Trips, a new app to assist you in your travels by helping you create your own “perfect day” in a city. Surprisingly, deep inside Google Trips, there is an algorithm that was invented 280 years ago.

In 1736, Leonhard Euler authored a brief but beautiful mathematical paper regarding the town of Königsberg and its 7 bridges, shown here:

Image from Wikipedia

In the paper, Euler studied the following question: is it possible to walk through the city crossing each bridge exactly once? As it turns out, for the city of Königsberg, the answer is no. To reach this answer, Euler developed a general approach to represent any layout of landmasses and bridges in terms of what he dubbed the Geometriam Situs (the “Geometry of Place”), which we now call Graph Theory. He represented each landmass as a “node” in the graph, and each bridge as an “edge,” like this:

Image from Wikipedia

Euler noticed that if all the nodes in the graph have an even number of edges (such graphs are called “Eulerian” in his honor) then, and only then, a cycle can be found that visits every edge exactly once. Keep this in mind, as we’ll rely on this fact later in the post.

Our team in Google Research has been fascinated by the “Geometry of Place” for some time, and we started investigating a question related to Euler’s: rather than visiting just the bridges, how can we visit as many interesting places as possible during a particular trip? We call this the “itineraries” problem. Euler didn’t study it, but it is a well known topic in Optimization, where it is often called the “Orienteering” problem.

While Euler’s problem has an efficient and exact solution, the itineraries problem is not just hard to solve, it is hard to even approximately solve! The difficulty lies in the interplay between two conflicting goals: first, we should pick great places to visit, but second, we should pick them to allow a good itinerary: not too much travel time; don’t visit places when they’re closed; don’t visit too many museums, etc. Embedded in such problems is the challenge of finding efficient routes, often referred to as the Travelling Salesman Problem (TSP).

Algorithms for Travel Itineraries

Fortunately, the real world has a property called the “triangle inequality” that says adding an extra stop to a route never makes it shorter. When the underlying geometry satisfies the triangle inequality, the TSP can be approximately solved using another algorithm discovered by Christofides in 1976. This is an important part of our solution, and builds on Euler’s paper, so we’ll give a quick four-step rundown of how it works here:

We start with all our destinations separate, and repeatedly connect together the closest two that aren’t yet connected. This doesn’t yet give us an itinerary, but it does connect all the destinations via a minimum spanning tree of the graph.
We take all the destinations that have an odd number of connections in this tree (Euler proved there must be an even number of these), and carefully pair them up.
Because all the destinations now have an even number of edges, we’ve created an Eulerian graph, so we create a route that crosses each edge exactly once.
We now have a great route, but it might visit some places more than once. No problem, we find any double visits and simply bypass them, going directly from the predecessor to the successor.

Christofides gave an elegant proof that the resulting route is always close to the shortest possible. Here’s an example of the Christofides’ algorithm in action on a location graph with the nodes representing places and the edges with costs representing the travel time between the places.

Construction of an Eulerian Tour in a location graph

Armed with this efficient route-finding subroutine, we can now start building itineraries one step at a time. At each step, we estimate the benefit to the user of each possible new place to visit, and likewise estimate the cost using the Christofides algorithm. A user’s benefit can be derived from a host of natural factors such as the popularity of the place and how different the place is relative to places already visited on the tour. We then pick whichever new place has the best benefit per unit of extra cost (e.g., time needed to include the new place in the tour). Here’s an example of our algorithm actually building a route in London using the location graph shown above:

Itineraries in Google Trips

With our first good approximate solution to the itineraries problem in hand, we started working with our colleagues from the Google Trips team, and we realized we’d barely scratched the surface. For instance, even if we produce the absolute perfect itinerary, any particular user of the system will very reasonably say, “That’s great, but all my friends say I also need to visit this other place. Plus, I’m only around for the morning, and I don’t want to miss this place you listed in the afternoon. And I’ve already seen Big Ben twice.” So rather than just producing an itinerary once and calling it a perfect day, we needed a fast dynamic algorithm for itineraries that users can modify on the fly to suit their individual taste. And because many people have bad data connections while traveling, the solution had to be efficient enough to run disconnected on a phone.

Better Itineraries Through the Wisdom of Crowds

While the algorithmic aspects of the problem were highly challenging, we realized that producing high-quality itineraries was just as dependent on our understanding of the many possible stopping points on the itinerary. We had Google’s extensive travel database to identify the interesting places to visit, and we also had great data from Google’s existing systems about how to travel from any place to any other. But we didn’t have a good sense for how people typically move through this geometry of places.

For this, we turned to the wisdom of crowds. This type of wisdom is used by Google to estimate delays on highways, and to discover when restaurants are most busy. Here, we use the same techniques to learn about common visit sequences that we can stitch together into itineraries that feel good to our users. We combine Google’s knowledge of when places are popular, with the directions between those places to gather an idea of what tourists like to do when travelling.

And the crowd has a lot more wisdom to offer in the future. For example, we noticed that visits to Buckingham Palace spike around 11:30 and stay a bit longer than at other times of the day. This seemed a little strange to us, but when we looked more closely, it turns out to be the time of the Changing of the Guard. We’re looking now at ways to incorporate this type of timing information into the itinerary selection algorithms.

So give it a try: Google Trips, available now on Android and iOS, has you covered from departure to return.

ddd

The 2016 Google Earth Engine User Summit: Turning pixels into insights

September 19th, 2016 | by Research Blog | published in Google Research

Posted by Chris Herwig, Program Manager, Google Earth Engine”We are trying new methods [of flood modeling] in Earth Engine based on machine learning techniques which we think are cheaper, more scalable, and could exponentially drive down the cost of fl…

ddd

Research from VLDB 2016: Improved Friend Suggestion using Ego-Net Analysis

September 15th, 2016 | by Research Blog | published in Google Research

Posted by Alessandro Epasto, Research Scientist, Google Research NY

On September 5 – 9, New Delhi, India hosted the 42nd International Conference on Very Large Data Bases (VLDB), a premier annual forum for academic and industry research on databases, data management, data mining and data analytics. Over the past several years, Google has actively participated in VLDB, both as official sponsor and with numerous contributions to the research and industrial tracks. In this post, we would like to share the research presented in one of the Google papers from VLDB 2016.

In Ego-net Community Mining Applied to Friend Suggestion, co-authored by Googlers Silvio Lattanzi, Vahab Mirrokni, Ismail Oner Sebe, Ahmed Taei, Sunita Verma and myself, we explore how social networks can provide better friend suggestions to users, a challenging practical problem faced by all social network platforms

Friend suggestion – the task of suggesting to a user the contacts she might already know in the network but that she hasn’t added yet – is major driver of user engagement and social connection in all online social networks. Designing a high quality system that can provide relevant and useful friend recommendations is very challenging, and requires state-of-the-art machine learning algorithms based on a multitude of parameters.

An effective family of features for friend suggestion consist of graph features such as the number of common friends between two users. While widely used, the number of common friends has some major drawbacks, including the following which is shown in Figure 1.

Figure 1: Ego-net of Sally.

In this figure we represent the social connections of Sally and her friends – the ego-net of Sally. An ego-net of a node (in this case, Sally) is defined as the graph that contains the node itself, all of the node’s neighbors and the connection among those nodes. Sally has 6 friends in her ego-net: Albert (her husband), Brian (her son), Charlotte (her mother) as well as Uma (her boss), Vincent and Wally (two of her team members). Notice how A, B and C are all connected with each other while they do not know U, V or W. On the other hand U, V and W have all added each other as their friend (except U and W who are good friend but somehow forgot to add each other).

Notice how each of A, B, C have a common friend with each of U, V and W: Sally herself. A friend recommendation system based on common neighbors might suggest to Sally’s son (for instance) to add Sally’s boss as his friend! In reality the situation is even more complicated because users’ online and offline friends span several different social circles or communities (family, work, school, sports, etc).

In our paper we introduce a novel technique for friend suggestions based on independently analyzing the ego-net structure. The main contribution of the paper is to show that it is possible to provide friend suggestions efficiently by constructing all ego-nets of the nodes in the graph and then independently applying community detection algorithms on them in large-scale distributed systems.

Specifically, the algorithm proceeds by constructing the ego-nets of all nodes and applying, independently on each of them, a community detection algorithm. More precisely the algorithm operates on so-called “ego-net-minus-ego” graphs, which is defined as the graph including only the neighbors of a given node, as shown in the figure below.

Figure 2: Clustering of the ego-net of Sally.

Notice how in this example the ego-net-minus-ego of Sally has two very clear communities: her family (A, B, C) and her co-workers (U, V, W) which are easily separated. Intuitively, this is because one might expect that while nodes (e.g. Sally) participate in many communities, there is usually a single (or a limited number of) contexts in which two specific neighbors interact. While Sally is both part of her family and work community, Sally and Uma interact only at work. Through extensive experimental evaluation on large-scale public social networks and formally through a simple mathematical model, our paper confirms this intuition. It seems that while communities are hard to separate in a global graph, they are easier to identify at the local level of ego-nets.

This allows for a novel graph-based method for friend suggestion which intuitively only allows suggestion of pairs of users that are clustered together in the same community from the point of view of their common friends. With this method, U and W will be suggested to add each other (as they are in the same community and they are not yet connected) while B and U will not be suggested as friends as they span two different communities.

From an algorithmic point of view, the paper introduces efficient parallel and distributed techniques for computing and clustering all ego-nets of very large graphs at the same time – a fundamental aspect enabling use of the system on the entire world Google+ graph. We have applied this feature in the “You May Know” system of Google+, resulting in a clear positive impact on the prediction task, improving the acceptance rate by more than 1.5% and decreasing the rejection rate by more than 3.3% (a significative impact at Google scales).

We believe that many future directions of work might stem from our preliminary results. For instance ego-net analysis could be potentially to automatically classify a user contacts in circles and to detect spam. Another interesting direction is the study of ego-network evolution in dynamic graphs.

ddd

Computational Thinking from a Dispositions Perspective

September 13th, 2016 | by Research Blog | published in Google Research

Posted by Chris Stephenson, Head of Computer Science Education Programs at Google, and Joyce Malyn-Smith, Managing Project Director at Education Development Center (EDC)

(Cross-posted on the Google for Education Blog)

In K–12 computer science (CS) education, much of the discussion about what students need to learn and do to has centered around computational thinking (CT). While much of the current work in CT education is focused on core concepts and their application, the one area of CT that has not been well explored is the relationship between CT as a problem solving model, and the dispositions or habits of mind that it can build in students of all ages.

Exploring the mindset that CT education can engender depends, in part, on the definition of CT itself. While there are a number of definitions of CT in circulation, Valerie Barr and I defined it in the following way:

CT is an approach to solving problems in a way that can be implemented with a computer. Students become not merely tool users but tool builders. They use a set of concepts, such as abstraction, recursion, and iteration, to process and analyze data, and to create real and virtual artifacts. CT is a problem solving methodology that can be automated and transferred and applied across subjects.

Like many others, our view of CT also included the core CT concepts: abstraction, algorithms and procedures, automation, data collection and analysis, data representation, modeling and simulation, parallelization and problem decomposition.

The idea of dispositions, however, comes from the field of vocational education and research on career development which focuses on the personal qualities or soft skills needed for employment (see full report from Economist Intelligence Unit here). These skills traditionally include being responsible, adaptable, flexible, self-directed, and self-motivated; being able to solve simple and complex problems, having integrity, self-confidence, and self-control. They can also include the ability to work with people of different ages and cultures, collaboration, complex communication and expert thinking.

Cuoco, Goldenberg, and Mark’s research also provided examples of what students should learn to develop the habits of mind used by scientists across numerous disciplines. These are: recognizing patterns, experimenting, describing, tinkering, inventing, visualizing, and conjecturing. Potter and Vickers also found that in the burgeoning field of cyber security “there is significant overlap between the roles for many soft skills, including analysis, consulting and process skills, leadership, and relationship management. Both communication and presentation skills were valued.”

CT, because of its emphasis on problem solving, provides a natural environment for embedding the idea of dispositions into K-12. According to the International Society for Technology in Education and the Computer Science Teachers Association, the set of dispositions that student practice and internalize while learning about CT can include:

confidence in dealing with complexity,
persistence in working with difficult problems,
the ability to handle ambiguity,
the ability to deal with open-ended problems,
setting aside differences to work with others to achieve a common goal or solution, and
knowing one’s strengths and weaknesses when working with others.

Any teacher in any discipline is likely to tell you that persistence, problem solving, collaboration and awareness of one’s strengths and limitations are critical to successful learning for all students. So how do we make these dispositions a more explicit part of the CT curriculum? One of the ways to do so is to to call them out directly to students and explain why they are important in all areas of their study, career, and lives. In addition educators can:

Post in the classroom a list of the Dispositions Leading to Success,
Help familiarize students with these dispositions by using the terms when talking with students and referring to the work they are doing. “Today we are going to be solving an open-ended problem. What do you think that means?”
Help students understand that they are developing these dispositions by congratulating them when these dispositions lead to success: “Great problem-solving skills!”; “Great job! Your persistence helped solve the problem”; “You dealt with ambiguity really well!”.
Engage students in discussions about the dispositions: “Today we are going to work in teams. What does it mean to be on a team? What types of people would you want on your team and why?”
Help students articulate their dispositions when developing their resumes or preparing for job interviews.

Guest speakers from industry might also:

Integrate the importance of dispositions into their talks with students: examples of the problems they have solved, how the different skills of team members led to different solutions, the role persistence played in solving a problem/developing a product or service…
Talk about the importance of dispositions to employers and how they contribute to their own organizational culture, the ways employers ask interviewees about their dispositions or how interviewees might respond (e.g. use the terms and give examples).

As Google’s Director of Education and University Relations, Maggie Johnson noted in a recent blog post, CT represents a core set of skills that are necessary for all students:

If we can make these explicit connections for students, they will see how the devices and apps that they use everyday are powered by algorithms and programs. They will learn the importance of data in making decisions. They will learn skills that will prepare them for a workforce that will be doing vastly different tasks than the workforce of today.

In addition to these concepts, we can now add developing critical dispositions for success in computing and in life to the list of benefits for teaching CT to all students.

ddd

Announcing the First Annual Global PhD Fellowship Summit and the 2016 Google PhD Fellows

September 7th, 2016 | by Research Blog | published in Google Research

Posted by Michael Rennaker, Program Manager, University Relations

In 2009, Google created the PhD Fellowship Program to recognize and support outstanding graduate students doing exceptional research in Computer Science and related disciplines. Now in its eighth year, our Fellowships have helped support over 250 graduate students in Australia, China and East Asia, India, North America, Europe and the Middle East who seek to shape and influence the future of technology.

Recently, Google PhD Fellows from around the globe converged on our Mountain View campus for the first annual Global PhD Fellowship Summit. The students heard talks from researchers like Jeff Dean, Françoise Beaufays, Peter Norvig, Maya Gupta and Amin Vahdat, and got a glimpse into some of the state-of-the-art research pursued across Google.

Senior Google Fellow Jeff Dean shares how TensorFlow is used at Google

Fellows also had the chance to connect one-on-one with Googlers to discuss their research, as well as receive feedback from leaders in their fields. The event wrapped up with a panel discussion with Dan Russell, Kristen LeFevre, Douglas Eck and Françoise Beaufays about their unique career paths. Maggie Johnson concluded the Summit by sharing about the different types of research environments across academia and industry.

(Left) PhD Fellows share their work with Google researchers during the poster session
(Right) Research panelists share their journeys through academia and industry

Our PhD Fellows represent some the best and brightest young researchers around the globe in Computer Science and it is our ongoing goal to support them as they make their mark on the world.

We’d also like to welcome the newest class of Google PhD Fellows recently awarded in China and East Asia, India, and Australia. We look forward to seeing each of them at next year’s summit!

2016 Global PhD Fellows

Computational Neuroscience
Cameron (Po-Hsuan) Chen, Princeton University
Grace Lindsay, Columbia University
Martino Sorbaro Sindaci, The University of Edinburgh

Human-Computer Interaction
Dana McKay, University of Melbourne
Koki Nagano, University of Southern California
Arvind Satyanarayan, Stanford University
Amy Xian Zhang, Massachusetts Institute of Technology

Machine Learning
Olivier Bachem, Swiss Federal Institute of Technology Zurich
Tianqi Chen, University of Washington
Emily Denton, New York University
Kwan Hui Lim, University of Melbourne
Yves-Laurent Kom Samo, University of Oxford
Woosang Lim, Korea Advanced Institute of Science and Technology
Anirban Santara, Indian Institute of Technology Kharagpur
Daniel Jaymin Mankowitz, Technion – Israel Institute of Technology
Lucas Maystre, École Polytechnique Fédérale de Lausanne
Arvind Neelakantan, University of Massachusetts, Amherst
Ludwig Schmidt, Massachusetts Institute of Technology
Quanming Yao, The Hong Kong University of Science and Technology
Shandian Zhe, Purdue University, West Lafayette

Machine Perception, Speech Technology and Computer Vision
Eugen Beck, RWTH Aachen University
Yu-Wei Chao, University of Michigan, Ann Arbor
Wei Liu, University of North Carolina at Chapel Hill
Aron Monszpart, University College London
Thomas Schoeps, Swiss Federal Institute of Technology Zurich
Tian Tan, Shanghai Jiao Tong University
Chia-Yin Tsai, Carnegie Mellon University
Weitao Xu, University of Queensland

Market Algorithms
Hossein Esfandiari, University of Maryland, College Park
Sandy Heydrich, Saarland University – Saarbrucken GSCS
Rad Niazadeh, Cornell University
Sadra Yazdanbod, Georgia Institute of Technology

Mobile Computing
Lei Kang, University of Wisconsin
Tauhidur Rahman, Cornell University
Chungkuk Yoo, Korea Advanced Institute of Science and Technology
Yuhao Zhu, University of Texas, Austin

Natural Language Processing
Tamer Alkhouli, RWTH Aachen University
Jose Camacho Collados, Sapienza – Università di Roma

Privacy and Security
Chitra Javali, University of New South Wales
Kartik Nayak, University of Maryland, College Park
Nicolas Papernot, Pennsylvania State University
Damian Vizar, École Polytechnique Fédérale de Lausanne
Xi Wu, University of Wisconsin

Programming Languages, Algorithms and Software Engineering
Marcelo Sousa, University of Oxford
Arpita Biswas, Indian Institute of Science

Structured Data and Database Management
Xiang Ren, University of Illinois, Urbana-Champaign

Systems and Networking
Ying Chen, Tsinghua University
Andrew Crotty, Brown University
Aniruddha Singh Kushwaha, Indian Institute of Technology Bombay
Ilias Marinos, University of Cambridge
Kay Ousterhout, University of California, Berkeley
Hong Zhang, The Hong Kong University of Science and Technology

ddd

Reproducible Science: Cancer Researchers Embrace Containers in the Cloud

September 6th, 2016 | by Research Blog | published in Google Research

Posted by Dr. Kyle Ellrott, Oregon Health and Sciences University, Dr. Josh Stuart, University of California Santa Cruz, and Dr. Paul Boutros, Ontario Institute for Cancer Research

Today we hear from the principal investigators of the ICGC-TCGA DREAM Somatic Mutation Calling Challenges about how they are encouraging cancer researchers to make use of Docker and Google Cloud Platform to gain a deeper understanding of the complex genetic mutations that occur in cancer, while doing so in a reproducible way.
– Nicole Deflaux and Jonathan Bingham, Google Genomics

Today’s genomic analysis software tools often give different answers when run in different computing environments – that’s like getting a different diagnosis from your doctor depending on which examination room you’re sitting in. Reproducible science matters, especially in cancer research where so many lives are at stake. The Cancer Moonshot has called for the research world to ‘Break down silos and bring all the cancer fighters together‘. Portable software “containers” and cloud computing hold the potential to help achieve these goals by making scientific data analysis more reproducible, reusable and scalable.

Our team of researchers from the Ontario Institute for Cancer Research, University of California Santa Cruz, Sage Bionetworks and Oregon Health and Sciences University is pushing the frontiers by encouraging scientists to package up their software in reusable Docker containers and make use of cloud-resident data from the Cancer Cloud Pilots funded by the National Cancer Institute.

In 2014 we initiated the ICGC-TCGA DREAM Somatic Mutation Calling (SMC) Challenges where Google provided credits on Google Cloud Platform. The first result of this collaboration was the DREAM-SMC DNA challenge, a public challenge that engaged cancer researchers from around the world to find the best methods for discovering DNA somatic mutations. By the end of the challenge, over 400 registered participants competed by submitting 3,500 open-source entries for 14 test genomes, providing key insights on the strengths and limitations of the current mutation detection methods.

The SMC-DNA challenge enabled comparison of results, but it did little to facilitate the exchange of cross-platform software tools. Accessing extremely large genome sequence input files and shepherding complex software pipelines created a “double whammy” to discourage data sharing and software reuse.

How can we overcome these barriers?

Exciting developments have taken place in the past couple of years that may annihilate these last barriers. The availability of cloud technologies and containerization can serve as the vanguards of reproducibility and interoperability.

Thus, a new way of creating open DREAM challenges has emerged: rather than encouraging the status quo where participants run their own methods themselves on their own systems, and the results cannot be verified, the new challenge design requires participants to submit open-source code packaged in Docker containers so that anyone can run their methods and verify the results. Real-time leaderboards show which entries are winning and top performers have a chance to claim a prize.

Working with Google Genomics and Google Cloud Platform, the DREAM-SMC organizers are now using cloud and containerization technologies to enable portability and reproducibility as a core part of the DREAM challenges. The latest SMC installments, the SMC-Het Challenge and the SMC-RNA Challenge have implemented this new plan:

SMC-Het Challenge: Tumour biopsies are composed of many different cell types in addition to tumour cells, including normal tissue and infiltrating immune cells. Furthermore, the tumours themselves are made of a mixture of different subpopulations, all related to one another through cell division and mutation. Critically, each sub-population can have distinct clinical outcomes, with some more resistant to treatment or more likely to metastasize than others. The goal of the SMC-Het Challenge is to identify the best methods for predicting tumor subpopulations and their “family tree” of relatedness from genome sequencing data.
SMC-RNA Challenge: The alteration of RNA production is a fundamental mechanism by which cancer cells rewire cellular circuitry. Genomic rearrangements in cancer cells can produce fused protein products that can bestow Frankenstein-like properties. Both RNA abundances and novel fusions can serve as the basis for clinically-important prognostic biomarkers. The SMC-RNA Challenge will identify the best methods to detect such rogue expressed RNAs in cancer cells.

Ultimately, the success will be gauged by the amount of serious participation in these latest competitions. So far, the signs are encouraging. SMC-Het, which focuses on a very new research area, launched in November 2015 and has already enlisted 18 teams contributing over 70 submissions. SMC-RNA just recently launched and will run until early 2017, with several of the world leaders in the field starting to prepare entries. What’s great about the submissions being packaged in containers is that even after the challenges end, the tested methods can be applied and further adapted by anyone around the world.

Thus, the moon shot need not be a lucky solo attempt made by one hero in one moment of inspiration. Instead, the new informatics of clouds and containers will enable us to combine intelligence so we can build a series of bridges from here to there.

To participate in the DREAM challenges, visit the SMC-Het and SMC-RNA Challenge sites.

ddd

Improving Inception and Image Classification in TensorFlow

August 31st, 2016 | by Research Blog | published in Google Research

Posted by Alex Alemi, Software Engineer

Earlier this week, we announced the latest release of the TF-Slim library for TensorFlow, a lightweight package for defining, training and evaluating models, as well as checkpoints and model definitions for several competitive networks in the field of image classification.

In order to spur even further progress in the field, today we are happy to announce the release of Inception-ResNet-v2, a convolutional neural network (CNN) that achieves a new state of the art in terms of accuracy on the ILSVRC image classification benchmark. Inception-ResNet-v2 is a variation of our earlier Inception V3 model which borrows some ideas from Microsoft’s ResNet papers [1][2]. The full details of the model are in our arXiv preprint Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning.

Residual connections allow shortcuts in the model and have allowed researchers to successfully train even deeper neural networks, which have lead to even better performance. This has also enabled significant simplification of the Inception blocks. Just compare the model architectures in the figures below:

Schematic diagram of Inception V3

Schematic diagram of Inception-ResNet-v2

At the top of the second Inception-ResNet-v2 figure, you’ll see the full network expanded. Notice that this network is considerably deeper than the previous Inception V3. Below in the main figure is an easier to read version of the same network where the repeated residual blocks have been compressed. Here, notice that the inception blocks have been simplified, containing fewer parallel towers than the previous Inception V3.

The Inception-ResNet-v2 architecture is more accurate than previous state of the art models, as shown in the table below, which reports the Top-1 and Top-5 validation accuracies on the ILSVRC 2012 image classification benchmark based on a single crop of the image. Furthermore, this new model only requires roughly twice the memory and computation compared to Inception V3.

Model	Architecture	Checkpoint	Top-1 Accuracy	Top-5 Accuracy
Inception-ResNet-v2	Code	inception_resnet_v2_2016_08_30.tar.gz	80.4	95.3
Inception V3	Code	inception_v3_2016_08_28.tar.gz	78.0	93.9
ResNet 152	Code	resnet_v1_152_2016_08_28.tar.gz	76.8	93.2
ResNet V2 200	Code	TBA	79.9*	95.2*

(*): Results quoted in ResNet paper.

As an example, while both Inception V3 and Inception-ResNet-v2 models excel at identifying individual dog breeds, the new model does noticeably better. For instance, whereas the old model mistakenly reported Alaskan Malamute for the picture on the right, the new Inception-ResNet-v2 model correctly identifies the dog breeds in both images.

An Alaskan Malamute (left) and a Siberian Husky (right). Images from Wikipedia

In order to allow people to immediately begin experimenting, we are also releasing a pre-trained instance of the new Inception-ResNet-v2, as part of the TF-Slim Image Model Library.

We are excited to see what the community does with this improved model, following along as people adapt it and compare its performance on various tasks. Want to get started? See the accompanying instructions on how to train, evaluate or fine-tune a network.

As always, releasing the code was a team effort. Specific thanks are due to:

Model Architecture – Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alex Alemi
Systems Infrastructure – Jon Shlens, Benoit Steiner, Mark Sandler, and David Andersen
TensorFlow-Slim – Sergio Guadarrama and Nathan Silberman
Model Visualization – Fernanda Viégas and James Wexler

ddd

TF-Slim: A high level library to define complex models in TensorFlow

August 30th, 2016 | by Research Blog | published in Google Research

Posted by Nathan Silberman and Sergio Guadarrama, Google Research Earlier this year, we released a TensorFlow implementation of a state-of-the-art image classification model known as Inception-V3. This code allowed users to train the model on the Ima…

ddd

Text summarization with TensorFlow

August 24th, 2016 | by Research Blog | published in Google Research

Posted by Peter Liu and Xin Pan, Software Engineers, Google Brain Team

Every day, people rely on a wide variety of sources to stay informed — from news stories to social media posts to search results. Being able to develop Machine Learning models that can automatically deliver accurate summaries of longer text can be useful for digesting such large amounts of information in a compressed form, and is a long-term goal of the Google Brain team.

Summarization can also serve as an interesting reading comprehension test for machines. To summarize well, machine learning models need to be able to comprehend documents and distill the important information, tasks which are highly challenging for computers, especially as the length of a document increases.

In an effort to push this research forward, we’re open-sourcing TensorFlow model code for the task of generating news headlines on Annotated English Gigaword, a dataset often used in summarization research. We also specify the hyper-parameters in the documentation that achieve better than published state-of-the-art on the most commonly used metric as of the time of writing. Below we also provide samples generated by the model.

Extractive and Abstractive summarization

One approach to summarization is to extract parts of the document that are deemed interesting by some metric (for example, inverse-document frequency) and join them to form a summary. Algorithms of this flavor are called extractive summarization.

Original Text: Alice and Bob took the train to visit the zoo. They saw a baby giraffe, a lion, and a flock of colorful tropical birds.

Extractive Summary: Alice and Bob visit the zoo. saw a flock of birds.

Above we extract the words bolded in the original text and concatenate them to form a summary. As we can see, sometimes the extractive constraint can make the summary awkward or grammatically strange.

Another approach is to simply summarize as humans do, which is to not impose the extractive constraint and allow for rephrasings. This is called abstractive summarization.

Abstractive summary: Alice and Bob visited the zoo and saw animals and birds.

In this example, we used words not in the original text, maintaining more of the information in a similar amount of words. It’s clear we would prefer good abstractive summarizations, but how could an algorithm begin to do this?

About the TensorFlow model

It turns out for shorter texts, summarization can be learned end-to-end with a deep learning technique called sequence-to-sequence learning, similar to what makes Smart Reply for Inbox possible. In particular, we’re able to train such models to produce very good headlines for news articles. In this case, the model reads the article text and writes a suitable headline.

To get an idea of what the model produces, you can take a look at some examples below. The first column shows the first sentence of a news article which is the model input, and the second column shows what headline the model has written.

Input: Article 1st sentence	Model-written headline
metro-goldwyn-mayer reported a third-quarter net loss of dlrs 16 million due mainly to the effect of accounting rules adopted this year	mgm reports 16 million net loss on higher revenue
starting from july 1, the island province of hainan in southern china will implement strict market access control on all incoming livestock and animal products to prevent the possible spread of epidemic diseases	hainan to curb spread of diseases
australian wine exports hit a record 52.1 million liters worth 260 million dollars (143 million us) in september, the government statistics office reported on monday	australian wine exports hit record high in september

Future Research

We’ve observed that due to the nature of news headlines, the model can generate good headlines from reading just a few sentences from the beginning of the article. Although this task serves as a nice proof-of-concept, we started looking at more difficult datasets where reading the entire document is necessary to produce good summaries. In those tasks training from scratch with this model architecture does not do as well as some other techniques we’re researching, but it serves as a baseline. We hope this release can also serve as a baseline for others in their summarization research.

ddd

Google Data