Big Cycles, Big Data: The Next Generation of Computing
Hadoop, data-intensive supercomputing, computing clouds, M45, Internet-scale computing …
These are just a few of the terms and concepts that are becoming prominent in more and more of the research in computer science, and particularly in the Carnegie Mellon Computer Science Department. They are part of an emerging shift towards research that involves large amounts of computing power, but additionally depends on the analysis of massive amounts of data to enable scientific discovery. Fields such as astrophysics, high-energy particle physics, biology, oceanography, geoscience, and environmental science are already building instruments that are capable of creating petabytes of data per day. And in computer science, we are beginning to see practical approaches to machine learning, language translation, and image processing that improve almost linearly with the amount of computing power and data available.
In a nutshell, science is becoming increasingly data-intensive, and this is creating a demand for research computing infrastructure that is big, in some cases really big.
It would be reasonable to ask at this point why today’s supercomputing centers don’t satisfy these needs. Of course, to some extent they do. But the shift is not just in raw computing power. It is in the need to interact with and maintain large amounts of data, and in the potential need to interact dynamically with both users (researchers) and instruments. An analogy can be made to web search engines, which need to provide interactive access to web data by millions of users, while simultaneously maintaining and dynamically updating the database. This is in stark contrast to today’s supercomputers, which are essentially (very powerful) batch processors.
What is both exciting and daunting is that there are so many basic computer science questions on how to compute with very large numbers of processors on highly data-intensive problems. The fact that industry is also becoming data-intensive makes this all the more interesting, because there is now the possibility that industrial computing infrastructure might also be useful for research computing.
So, what have we been doing?
For more than a year we have been in discussions with Google on ways to provide our faculty and students with access to Google-scale computing facilities, for research in a wide range of areas, including machine learning, language translation, parallel and distributed algorithms, image processing, proteomics, and many other areas. Other top departments have been doing the same, most notably the University of Washington, who has also staked out a leadership position on this. In October, Google and IBM announced publicly that they would team up “to provide hardware, software and services to augment university curricula and expand research horizons.” Besides CMU and UW, the other departments involved in this announcement included MIT, Berkeley, Stanford, and Maryland. The New York Times reported on this “Cloud Computing Initiative”, explaining that “the nation’s elite universities do not provide the technical training needed for the kind of powerful and highly complex computing Google is famous for,” in part due to the lack of computing large-scale infrastructure.
We are, of course, thrilled and impressed that Google and IBM have been so proactive in pushing the cloud computing concept, and happy to play a part in its creation. Google and IBM have also taken some of the steps needed to coordinate with the NSF, an important step to figuring out how to coordinate community access to this resource. I know that there are a lot of us ready to make big use of the “cloud” as soon as it becomes available. We will almost certainly learn a lot, and I suspect the CS curriculum will be affected in some pretty fundamental ways.
But this isn’t the end of the story for us.
Yesterday, Yahoo! and CMU announced publicly a new program to make advances in the software needed to exploit internet-scale computing resources. Quoting from the press release:
“Yahoo!’s program is intended to leverage its leadership in Hadoop, an open source distributed computing sub-project of the Apache Software Foundation, to enable researchers to modify and evaluate the systems software running on a 4,000 processor supercomputer… Called the M45,… it [is] among the top 50 fastest supercomputers in the world.
Carnegie Mellon University will the first institution to take advantage of Yahoo!’s M45. Leading systems software researchers Garth Gibson and Greg Ganger … will instrument the system and and evaluate its performance. “
The press release goes on to explain that Jaime Carbonell, Christos Faloutsos, Jamie Callan, Alexei Efros, Noah Smith, and Stephan Vogel will also be among the first to use M45, to enable their cycle-and-data hungry research applications to make much faster progress than was previously possible. The fact that this can take place using open-source Hadoop software, including newer Yahoo! language developments such as Pig (developed by our own Chris Olston) is a further boon for us and the field.
Coinciding with these developments is the appointment of Dave O’Hallaron as the new director of the Pittsburgh Intel Lab. Dave has a long history of research in high-performance computing, with notable contributions in large-scale modeling of earthquake ground motion, and winner of the Supercomputing’06 HPC Analytics Challenge. As the new lablet director, Dave is promoting a “big data” initiative, to explore the core problems in data-intensive computing. This will undoubtedly have a big impact on what we do, given the history of very close collaborations between our department and the lablet, and the traditional research interests we have had in high-performance storage systems and information-retrieval problems.
Leadership from the top.
A great deal of credit must be given to Randy Bryant, who was key person in negotiating the agreement with Yahoo!. Also impressive is the impact of Randy’s leadership on developing the DISC (Data-Intensive Super Computing) concept. While it would be incorrect to say that DISC sparked the Google/IBM and Yahoo! initiatives, it is certainly the case that DISC has been extremely important in raising awareness of the need for data-intensive computing and the possibilities in alternative large-scale computing platforms for research. I’ve played a role, too, by spearheading the university’s Next-Generation Computing Initiative, to ensure that the basic research needed for data-intensive research is made available here.
Well, enough bragging about CMU Computer Science. :-) These developments are certainly not exclusive to CMU. All of the top departments are moving rapidly in this direction and making important contributions. In fact, I see this as a major trend for the whole field.
Peter Lee @ November 14, 2007
[…] couple of weeks ago I wrote about “next-generation computing” — computing structures, algorithms, and applications on very large scale distributed […]
[…] We would want many of the articles to be about major directions in research. For example, see an article on my blog about “big cycles, big data” computing. […]