Google has a strong interest in promoting high quality systems research, and we believe that providing information about real-life workloads to the academic community can help.
In support of this we published a small (7-hour) sample of resource-usage information from a Google production cluster in 2010 (research blog on Google Cluster Data). Approximately a dozen researchers at UC Berkeley, CMU, Brown, NCSU, and elsewhere have made use of it.
Recently, we released a larger dataset. It covers a longer period of time (29 days) for a larger cell (about 11k machines) and includes significantly more information, including:
- the original resource requests, to permit scheduling experiments
- request constraints and machine attriibutes
- machine availability and failure events
- some of the reasons for task exits
- (obfuscated) job and job-submitter names, to help identify repeated or related jobs
- more types of usage information
- CPI (cycles per instruction) and memory traffic for some of the machines
Note that this trace primarily provides data about resource requests and usage. It contains no information about end users, their data, or access patterns to storage systems and other services.
More information can be found via this link, which will (after a short questionnaire) take you to a site that provides access instructions, a description of the data schema, and information about how the data was derived and its meaning.
We hope this data will facilitate a range of research in cluster management. Let us know if you find it useful, are willing to share tools that analyze it, or have suggestions for how to improve it.