December 10th, 2009 | Published in Google Research
Recently, Google's research groups reviewed over 140 grant proposals across sixteen different research areas. During this process, we identified a number of strategic research topics. These topics represent critical areas of research for Google in collaboration with our university partners.
We'll be examining several of these topics in future posts but we'd like to begin by raising some of the research challenges we face in our multimedia endeavors:
- Large scale annotation: How can we learn from large, noisy sets of image/video data to automatically get human-level accurate models for label annotation?
The images and videos that are available on the web provide massive data sets. We have some very noisy labels on that set, in terms of possible content. We have labels based on popularity of an item when considered for a particular search, on anchor text and other context, and on labels given to other content that is often associated with each item. The challenge is to make use of the sheer volume of available data to improve our recognition models and to carry appearance models from one media type to another. Further, we must be able to handle the variability in appearance and in the labels themselves.
- Image/Audio/Video Representation: How can we improve our understanding of low level representations of images that goes beyond bag of words modeling?
Much of the current work in labeling and retrieval is based on fairly simple local descriptions of the content, putting the emphasis on classifier learning from combinations of simple models. While this classifier approach has been useful, we should also examine the basic features that we are using, to see if we can better characterize the content. Providing better inputs into our learning algorithms should reduce the size of the space over which we need to search. Possible examples include shape modeling in images, better texture/color models, and descriptions of soft segmentations of regions.
- Localization of image-/video-level labels to spatial/temporal portions of the content: Can we automatically associate image and video labels with specific portions of the content?
The most obvious examples in this area are labels like "dog" and "explosion". However, can we also localize more complex concepts like "waves" or "suspense"? Alternately, can we automatically distinguish between labels, based on how well we are able to localize them to a particular place or time within the content?
- Large scale matching / Hashing: Can we identify matching techniques to deal with large datasets?
We need image, video, and audio matching techniques that can effectively deal with large datasets, embedded in high-dimensional descriptor spaces, in sub-linear time. Of special interest are methods that can efficiently handle a wide range of recall/precision needs without massive increases in the data-structure sizes that are used.
We expect these questions to keep us busy for some time.