Google Data » Kevin McCurley

On the predictability of Search Trends

Kevin McCurley — Mon, 17 Aug 2009 21:19:00 +0000

Posted by Yossi Matias, Niv Efron, and Yair Shimshoni, Google Labs, Israel.

Since launching Google Trends and Google Insights for Search, we've been providing daily insight into what the world is searching for. An understanding of search trends can be useful for advertisers, marketers, economists, scholars, and anyone else interested in knowing more about their world and what's currently top-of-mind.

As many have observed, the trends of some search queries are quite seasonal and have repeated patterns. See, for instance, the search trends for the query "ski" hit their peak during the winter seasons in the US and Australia. The search trends for basketball correlate with annual league events, and are consistent year-over-year. When looking at trends of the aggregated volume of search queries related to particular categories, one can also observe regular patterns in some categories like Food & Drink or Automotive. Such trends sequences appear quite predictable, and one would naturally expect the patterns of previous years to repeat looking forward.

On the other hand, for many other search queries and categories, the trends are quite irregular and hard to predict. Examples include the search trends for obama, twitter, android, or global warming, and the trend of aggregate searches in the News & Current Events category.

Having predictable trends for a search query or for a group of queries could have interesting ramifications. One could forecast the trends into the future, and use it as a "best guess" for various business decisions such as budget planning, marketing campaigns and resource allocations. One could identify deviation from such forecasting and identify new factors that are influencing the search volume as demonstrated in Flu Trends.

We were therefore interested in the following questions:

How many search queries have trends that are predictable?
Are some categories more predictable than others? How is the distribution of predictable trends between the various categories?
How predictable are the trends of aggregated search queries for different categories? Which categories are more predictable and which are less so?

To learn about the predictability of search trends, and so as to overcome our basic limitation of not knowing what the future will entail, we characterize the predictability of a Trends series based on its historical performance. In other words, we estimate the a posteriori predictability of a sequence determined by the error of forecasted trends vs the actual performance.

Specifically, we have used a simple forecasting model that learns basic seasonality and general trend. For each trends sequence of interest, we take a point in time, t, which is about a year back, compute a one year forecasting for t based on historical data available at time t, and compare it to the actual trends sequence that occurs since time t. The error between the forecasting trends and the actual trends characterizes the predictability level of a sequence, and when the error is smaller than a pre-defined threshold, we denote the trends query as predictable.

Our work to date is summarized in a paper called On the Predictability of Search Trends which includes the following observations:

Over half of the most popular Google search queries are predictable in a 12 month ahead forecast, with a mean absolute prediction error of about 12%.
Nearly half of the most popular queries are not predictable (with respect to the model we have used).
Some categories have particularly high fraction of predictable queries; for instance, Health (74%), Food & Drink (67%) and Travel (65%).
Some categories have particularly low fraction of predictable queries; for instance, Entertainment (35%) and Social Networks & Online Communities (27%).
The trends of aggregated queries per categories are much more predictable: 88% of the aggregated category search trends of over 600 categories in Insights for Search are predictable, with a mean absolute prediction error of of less than 6%.
There is a clear association between the existence of seasonality patterns and higher predictability, as well as an association between high levels of outliers and lower predictability. For the Entertainment category that has typically less seasonal search behavior as well as relatively higher number of singular spikes of interest, we have seen a predictability of 35%, where as the category of Travel with a very seasonal behavior and lower tendency for short spikes of interest had a predictability of 65%.
One should expect the actual search trends to deviate from forecast for many predictable queries, due to possible events and dynamic circumstances.
We show the forecasting vs actual for trends of a few categories, including some that were used recently for predicting the present of various economic indicators. This demonstrates how forecasting can serve as a good baseline for identifying interesting deviations in actual search traffic.

As we see that many of the search trends are predictable, we are introducing today a new forecasting feature in Insights for Search, along with a new version of the product. The forecasting feature is applied to queries which are identified as predictable (see, for instance, basketball or the trends in the Automotive category) and then shown as an extrapolation of the historical trends and search patterns.

There are many more questions that can be looked at regarding search trends in general, and their predictability in particular, including design and testing more advanced forecasting models, getting other insights into the distributions of sequences, and demonstrating interesting deviations of actual-vs-forecast for predictable trends series. We'd love to hear from you - share with us your findings, published results or insights - email us at insightsforsearch@google.com.

Remembering Randy Pausch

Kevin McCurley — Sat, 26 Jul 2008 07:51:00 +0000

Posted by Kevin McCurley, Research Team

It is with great sadness that we note the passing of Randy Pausch, who taught computer science at Carnegie Mellon University. Randy was well-known by many within the research community, including quite a number of us here at Google. Alfred Spector, our Vice President of Research, was his Ph.D. advisor. Rich Gossweiler, a Senior Research Scientist, was his first Ph.D. student. Several other former colleagues and coauthors (Joshua Bloch, Adam Fass, and Ning Hu) now work here.

All of us strive to make an impact with our research, and Randy was no exception. He will be remembered for his work, but also for his contributions to humanity at large. Millions have watched the video on YouTube from his lecture titled Achieving your Childhood Dreams. The strength of his character was already known to his family, his colleagues, and the broader computer science research community. The courage and optimism that he displayed at the end of his life became inspirational to millions more.

I've seen Randy repeatedly go to bat for what is right. As a leader, he consistently evoked incredible enthusiasm and optimism for the subjects he embraces. Randy had a very human passion about people and not just who they are, but their potential, despite any flaws or obstacles in their way. His contributions will be remembered for generations to come. - Rich Gossweiler

Randy was one of the most vibrant, passionate people I've ever known. His passion was inspirational not only to his family and colleagues, but also, because of his courageous presentations beginning with his well-known Last Lecture, he has influenced millions more. - Alfred Spector

We will miss Randy very much, and remember him fondly.

VisualRank

Kevin McCurley — Thu, 01 May 2008 20:30:00 +0000

Posted by Shumeet Baluja and Yushi Jing

At WWW-2008, in Beijing, China, we presented our paper "PageRank for Product Image Search". In this paper, we presented a system that used visual cues, instead of solely text information, to determine the rank of images. The idea was simple: find common visual themes in a set of images, and then find a small set of images that best represented those themes. The resulting algorithm wound up being PageRank, but on an entirely inferred graph of image similarities. Since the release of the paper, we've noticed lots of coverage in the press and have received quite a few questions. We thought we could answer a few of them here.

"Why did we choose to use products for our test case?" First and foremost, product queries are popular in actual usage; addressing them is important. Second, users have strong expectations of what results we should return for these queries; therefore, this category provides an important set of examples that we need to address especially carefully. Third, on a pragmatic note, they lend themselves well to the type of "image features" that we selected in this study. Since the publication of the paper, we've also extended our results to other query types, including travel-related queries. One of the nice features of the approach is that (we hope) it will be easy to extend to new domains; as research in measuring image or object similarity continues, the advances can easily be incorporated into the similarity calculation to compute the underlying graph; the computations on the graph do not change.

"Where are we going from here?" Besides broadening the sets of queries (and sets of features) for which we can use this approach, there are three directions we're exploring. First, estimating similarity measures for all of the images on the web is computationally expensive; approximations or alternative computations are needed. Second, we hope to evaluate our approach with respect to the large number of recently proposed alternative clustering methods. Third, many variations of PageRank can be used in quite interesting ways for image search. For example, we can use some of these previously published methods to reintroduce, in a meaningful manner, the textual information that the VisualRank algorithm removed. In the end, we have an approach that has an easy integration with both text and visual clues. Stay tuned for more on that in the coming months.

And now to answer the most commonly asked question, "Is it live?" Not yet. Currently, it is research in progress (click here to help speed up the process). In the meantime, though, if you'd like another sneak peek of our research on large graphs, this time in the context of YouTube datamining, just follow the link.

Finally, we want to extend our deepest thanks to the people who helped on this project, especially the image-search team; without their help, this research would not have been possible.

Research in the Cloud: Providing Cutting Edge Computational Resources to Scientists

Kevin McCurley — Wed, 23 Apr 2008 21:13:00 +0000

Posted by Christophe Bisciglia, Senior Software Engineer, and Alfred Spector, Vice President of Research

The emergence of extremely large datasets, well beyond the capacity of almost any single computer, has challenged traditional and contemporary methods of analysis in the research world. While a simple spreadsheet or modest database remains sufficient for some research, problems in the domain of "computational science," which explores mathematical models via computational simulation, require systems that provide huge amounts of data storage and computer processing (current research areas in computational science include climate modeling, gene sequencing, protein mapping, materials science and many more). As an added hurdle, this level of computational infrastructure is often not affordable to research teams, who usually work with significant budgetary restrictions.

Fortunately, as the Internet technology industry expands its global infrastructure, accessing world class distributed computational and storage resources can be as simple as visiting a website. Building on its Academic Cloud Computing Initiative (ACCI) announced last October, Google and IBM, with the National Science Foundation, announced in February the CluE initiative to address this particular need. After coordinating the technical details with Google and IBM, the NSF posted the official solicitation of proposals last week.

Our primary goal in participating in the CluE initiative is to encourage the understanding, further refinement and --importantly-- targeted application of the latest distributed computing technology and methods across many academic disciplines. Engaging educators and researchers with the new potential of distributed computing for processing and analyzing extremely large datasets is an invaluable investment for any technology company to make, and Google in particular is pleased to make a contribution to the academic community that has enabled so many recent advances in the industry.

We're looking forward to an eclectic collection of proposals from the NSF's solicitation. We believe many will leverage the power of distributed computing to produce a diverse range of knowledge that will provide long term benefit to both the research community and the public at large. We also hope that Google's contribution to this low cost, open source approach to distributed computing will allow many more in the academic community to take advantage of this pervasive technological shift.

More details, including information on how to apply for access to these resources, is available on the NSF site.

Deploying Goog411

Kevin McCurley — Fri, 28 Mar 2008 22:34:00 +0000

Posted by Francoise Beaufays

A couple of years ago, a few of us got together and decided to build Goog411. It would be a free phone service that users could call to connect to any business in the US, or simply to browse through a list of businesses such as "bookstores" in a given city. Everything would be fully automated, with no operator in the background, just a speech recognition system to converse with the user, and Google Maps to execute the business search.

We knew that speech recognition is not a solved problem; there would be users for whom the system wouldn't work well, and queries that would be harder to recognize than others. But we got big assets through hosting the service: we could iterate as often as we wanted on any component of the system, we'd have access to all the data, and we could measure whatever seemed relevant to callers. So we built Goog411, started taking some traffic, defining metrics, and iterated many, many times.

We learned a few interesting things in the process (see our ICASSP paper). For example, we discovered that databases with lists of business names are almost useless to train a language model for how users answer the question "What business name or category?"; aggregated web query logs from Google Maps yield far better performance. And we found the speech data we collect through our own service is almost as useful to model new queries as the web data, even though we have orders of magnitude less of it. After all, you may type "real estate" in Google Maps to glance at a few properties, but would you ask for it over the phone while driving your car?

Today Goog411 has grown from an experiment into a product, and we're working on expanding the service to Canada. As calls flow through the system, our focus is still on making the best use of the increasing data, defining metrics that best correlate to the user's experience, and taking advantage of the computer resources and data sources available within Google.

Maybe our most rewarding experience so far has been to see our traffic grow, and to see repeat callers succeed more and more often with the system. Have you tried it already? Just call 1-800-GOOG-411, and don't hesitate to send us feedback!

This year’s scalability conference

Kevin McCurley — Mon, 11 Feb 2008 19:53:00 +0000

Posted by Andrew Schwerin, Software Engineer

Managing huge repositories of data and large clusters of machines is no easy task -- and building systems that use those clusters to usefully process that data is even harder. Last year, we held a conference on scalable systems so a bunch of people who work on these challenges could get together and share ideas. Well, it was so much fun that we've decided to do it again.

This year, the conference is taking place in Seattle on Saturday, June 14. (Registration is free.) If you'd like to talk about a topic on scalable or large-scale systems that is near and dear to your heart, we'd love to hear from you. Potential topics include:

Development, deployment and production:

Systems, environments and languages for building, deploying and debugging complex datacenter-scale apps, or for allowing teams of collaborating engineers to work together on such apps more effectively

Mobile applications:

Unique challenges of scaling services for mobile devices
Location-aware scaling techniques
Experiences designing scalable apps involving mobile devices

Of course, you've probably got more ideas. Send a 500-word abstract of your 30-minute presentation to scalabilityconf@google.com no later than Friday, April 11, and we'll post registration details in the next couple of months.

Google Education Summit

Kevin McCurley — Fri, 19 Oct 2007 00:31:00 +0000

Posted by Jeff Walz and Kevin McCurley

The world's research and educational infrastructures are tightly intertwined. Research universities enable students to participate in research activities, and research contributes to the vitality of the educational experience. At Google, we also recognize the importance of education to our research and engineering activities. In addition to our own in-house activities, we maintain strong ties to academic institutions through visiting faculty programs and summer internships. In recognition of the importance of education to Google's mission, we also recently organized a Google Education Summit. Mehran Sahami has more to say about this in a recent blog post.

OpenHTMM Released

Kevin McCurley — Sun, 23 Sep 2007 21:01:00 +0000

Posted by Ashok C. Popat, Research Scientist

Statistical methods of text analysis have become increasingly sophisticated over the years. A good example is automated topic analysis using latent models, two variants of which are Probabilistic latent semantic analysis and Latent Dirichlet Allocation.

Earlier this year, Amit Gruber, a Ph.D. student at the Hebrew University of Jerusalem, presented a technique for analyzing the topical content of text at the Eleventh International Conference on Artificial Intelligence and Statistics in Puerto Rico.

Gruber's approach, dubbed Hidden Topic Markov Models (HTMM), was developed in collaboration with Michal Rosen-Zvi and Yair Weiss. It differs notably from others in that, rather than treat each document as a single "bag of words," it imposes a temporal Markov structure on the document. In this way, it is able to account for shifting topics within a document, and in so doing, provides a topic segmentation within the document, and also seems to effectively distinguish among multiple senses that the same word may have in different contexts within the same document.

Amit is currently a doing graduate internship at Google. As part of his project, he has developed a fresh implementation of his method in C++. We are pleased to release it as the OpenHTMM package to the research community under the Apache 2 license, in the hopes that it will be of general interest and facilitate further research in this area.

The Sky is Open

Kevin McCurley — Wed, 19 Sep 2007 23:16:00 +0000

Posted by Jeremy Brewer

We've gotten an incredible amount of positive feedback about Sky in Google Earth, which lets Google Earth users explore the sky above them with hundreds of millions of stars and galaxies taken from astronomy imagery.

From the start though, we have wanted to open the sky up to everyone. As a first step, we've been hard at work developing tools to let astronomers add their own imagery, and we think we've come up with something that does the job nicely. We're pleased to announce the availability of wcs2kml, an open source project for importing astronomical imagery into Sky.

Modern telescopes output imagery in the FITS binary format that contains a set of headers known as a World Coordinate System (that's the "wcs" part) specifying the location of the image on the sky. Wcs2kml handles the task of transforming this imagery into the projection system used by Google Earth (the "kml" part) so that it can be viewed directly in Sky. Wcs2kml also includes tools to simplify uploading this data to a web server and sharing it with friends.

We were astounded at the imagery and novel applications people created when we opened the Google Earth API to our users. Now, by opening Sky in Google Earth to the astronomy community, we hope to open a floodgate of new imagery for Sky!

Introducing Sky in Google Earth

Kevin McCurley — Wed, 22 Aug 2007 02:46:00 +0000

Posted by Andy Connolly and Ryan Scranton

At Google we are always interested in creating new ways to share ideas and information and applying these techniques to different research fields. Astronomy provides a great opportunity with an abundance of images and information that are accessible to researchers and indeed, anyone with an interest in the stars. With the release of the Google Earth 4.2 client the new Sky feature acts as a virtual telescope that provides a view of some of the most detailed images ever taken of the night sky. By clicking on the Sky button, you can explore the universe, seamlessly zooming from the familiar views of the constellations and stars, to the deepest images ever taken of galaxies and more. From planets moving across the sky to supernovae exploding in distant galaxies, Sky provides a view of a dynamic universe that we hope you will enjoy.

In addition to allowing educators, amateurs or anyone with an interest in space to visually explore the sky, one of the most exciting aspects of Sky is its capability for research and discovery in astronomy. With the latest features in KML you can connect astronomical image and catalog databases directly to the visualization capabilities of Sky ( e.g. searching the Sloan Digital Sky Survey database for the highest redshift quasars or correlating the the infrared and optical sky to detect the presence of dust within our Galaxy). From releasing new data about the latest discovery of planets around nearby stars to identifying the host galaxy of a gamma ray burst the possibilities are endless. Examples of how to build research applications such as a view of the microwave background emission from the remnant of the Big Bang can be found in the Google Earth Gallery.

It has been a lot of fun creating Google's first astronomical observatory. Go check it out; explore the sky from the comfort of your home; If you find something interesting let us know on the Sky section of the Google Earth Community, or author your own KML applications to to share your discoveries and data with everyone else. You can also find more Sky resources on our website.

Drink from the firehose with University Research Programs

Kevin McCurley — Thu, 26 Jul 2007 18:27:00 +0000

Posted by Michael Lancaster and Josh Estelle, Software Engineers

Whenever we talk to university researchers, we hear a consistent message: they wish they had Google infrastructure. In pursuit of our company mission, we have built an elaborate set of systems for collecting, organizing, and analyzing information about the web. Operating and maintaining such an infrastructure is a high barrier to entry for many researchers. We recognize this and want to share some of the fruits of our labor with the research community. Today, in conjunction with the Google Faculty Summit we're making two services available under the new University Research Programs, namely access to web search, and machine translation.

University Research Program for Google Search

Google is focused on the success of the web, which is essentially an organism in and of itself with extremely complex contents and an ever-evolving structure. The primary goal of the University Research Program for Google Search is to promote research that creates a greater understanding of the web. We want to make it easy for researchers to analyze millions of queries in a reasonably short amount of time. We feel that such research can benefit everyone. As such, we've added a proviso that all research produced through this program must be published in a freely accessible manner.

University Research Program for Google Translate

The web is a global information medium with content from many cultures and languages. In order to break the language barrier, many researchers are hard at work building high quality, automatic, machine translation systems. We've been successful with our own statistical machine translation system, and are now happy to provide researchers greater access to it. The University Research Program for Google Translate provides researchers access to translations, including detailed word alignment information and lists of the n-best translations with detailed scoring information. We hope this program will be a terrific resource to help further the state of the art in automatic machine translation.

The web holds a wealth of untapped research potential and we look forward to seeing great new publications enabled by these new programs. Go ahead - surprise us!

By the way, since many researchers lead a double life as educators, we want to let you know about a site that recently launched: Google Code for Educators, designed to make it easy for CS faculty to integrate cutting-edge computer science topics into their courses. Check it out.

New Conference on Web Search and Data Mining

Kevin McCurley — Mon, 18 Jun 2007 22:10:00 +0000

Posted by Ziv Bar-Yossef and Kevin McCurley, Research Team

The pace of innovation on the World Wide Web continues unabated more than fifteen years after the first servers went live. The web was initially used by only a small community of scientists, but there are now over a billion people on the planet who use the web in their lives. The World Wide Web grows and changes as a young organism might, reflecting the social forces of the users and information producers. Each year seems to bring a radical new change, including the movement of commerce to the web, the availability of realtime news on the web, mobile users being able to access the web from anywhere, new forms of media such as video, and the emergence of blogs changing politics and publishing.

This rapid pace of innovation and scale presents many interesting research questions. At Google our goal is to organize information in ways that are useful to users, and we regularly find ourselves solving problems that seemed like ridiculous thought experiments just a few years ago. We therefore welcome the arrival of a new conference on Web Search and Data Mining, prosaically named with the acronym WSDM (pronounced as wisdom). WSDM is intended to be complementary to the World Wide Web Conference tracks in search and data mining. The soaring volume of submissions to these two tracks over the past few years justifies the foundation of a new top-tier conference on web search and mining. WSDM is a joint effort of researchers from the three large search engines (Google, Yahoo, MSN) as well as top-notch scientists from the Academia (such as Jon Kleinberg from Cornell, Rajeev Motwani from Stanford, and Monika Henzinger from Google and EPFL). The first WSDM conference will take place at Stanford University (the place where both Google and Yahoo! were conceived by their founders). The conference will be held in February of 2008, and the deadline for submissions is July 30, 2007. For further information see the WSDM web site. If you have good papers on search or data mining in the pipeline, please consider sending them to WSDM.

We look forward to seeing you there!

Videos of talks

Kevin McCurley — Mon, 18 Jun 2007 21:38:00 +0000

Posted by Kevin McCurley, Research Team

We've recently launched a Google Research web site that we'll be updating to provide information about research activities at Google. Among other things, one thing you'll find there is the ability to search and view videos of talks at Google.

One of the best features of working at Google is the rich variety of talks that we can attend, both technical and general interest. Most of these are videotaped for later viewing. This has multiple benefits:

In case of a scheduling conflict, Google employees may view talks at a later time (yes, some of us do have other things to do in the day).

Talks are available for viewing by Google employees at other sites. This provides us with a much more cohesive intellectual culture than most global companies.

When appropriate, speakers may opt to have their talks available on the World Wide Web. This provides a benefit to both viewers and speakers, since it allows speakers to reach a much broader audience, and it allows viewers to hear interesting talks without the need to be
physically present.

The World Wide Web started out as a means for scientists to communicate among themselves. In the early days it provided a less formal and timely means of distributing information than archival refereed publications, and it's now routine for a scientist to have a home page from which they distribute their writings and thoughts. Moreover, it's also now commonplace to find a large fraction of current scientific literature through the web, both refereed and unrefereed. In fact, the situation has evolved to the point where scientists often consult the web for publications before going to a library.

Archival publications are but one means of communication that has typically been used by scientists. Another mode of communication that has a long history of use is the presentation of talks at meetings and during visits to other institutions. Oral presentations have historically been less formal, and allow the speaker to be more speculative and interactive.

In the last few years, several technological developments have made it possible to distribute high quality video of talks on the web in addition to written publications. This distribution of videos from talks holds the promise of changing the way that scientists think about communication. Imagine what lessons would be available to us if we had the ability to view lectures by Kepler, Einstein, Turing, Shannon, or von Neumann! Imagine also what it would be like to be able to watch and listen to selected talks from conferences that are across the world, without having to suffer the burden of traveling to the remote location. Such media are unlikely to ever completely supplant the richness of communication that arises from personal interaction in physical proximity, but it will probably still change scientific communication as much as email and the web have already.