January 27th, 2010 | Published in Google Research
As part of our series on research areas of interest to Google, we discuss some important areas relating to cluster applications in distributed systems. In the last two decades distributed systems have undergone a metamorphosis from academic curiosities to the foundation of an entire industry. Despite these successes, at Google we see distributed systems as a technology in its infancy, with huge gaps in the supporting research (some examples here and here) that represent some of the most important problems in the space. Here are some examples:
- Resource sharing: Stranded resources like idle memory, CPU, and disk bandwidth represent huge capital and operating expenses that deliver no business value. A cluster system based upon the best published research would be likely to leave 50% or more of hardware resources idle. We encourage researchers to explore hardware/software architectures that facilitate more supple sharing to avoid stranded and underutilized computational resources.
- Balancing cost, performance, and reliability: Current cluster applications tend to be excessively rigid and brittle, offering only coarse controls to tune the balance between reliability, performance and cost. We envision systems that allow cost to be optimized based on an input specification of performance and reliability requirements. An effective solution might allow service level settings to propagate downward through the layered structure of the system.
- Self-maintaining systems: The level of expertise required to troubleshoot today's large systems is one of the biggest barriers to more and larger deployments. The published research in this area has at best marginally improved the need for such rare expertise. We envision systems that can adapt automatically to changing conditions, in which redundancy and multiple geographically distributed data centers simplify rather than complicate manageability. This will require breakthroughs in monitoring and data analysis to address the diversity of failure modes and simplify the task of keeping systems healthy.
Previous posts in the series: Mulitmedia