Researchers tackle problem of data storage for next-generation supercomputers

The U.S. Department of Energy (DOE) has awarded a five-year, $11 million grant to researchers at three universities and five national laboratories to find new ways of managing the torrent of data that will be produced by the coming generation of supercomputers. The Petascale Data Storage Institute includes researchers at the University of California, Santa Cruz, Carnegie Mellon University, University of Michigan, and the DOE's Los Alamos, Sandia, Oak Ridge, Lawrence Berkeley, and Pacific Northwest National Laboratories.

The innovations developed by this new institute will enable U.S. scientists to fully exploit the power of computing systems that will be capable of performing millions of billions of calculations each second. Such computational power is necessary because scientists depend on computer modeling to simulate extremely complicated phenomena, such as the global climate system, earthquake motions, the design of fuel-efficient engines, nuclear fusion, and the global spread of disease.

Computer simulation of these processes yields scientific insights that conventional observation or experimentation are often unable to provide. This capability is critical to U.S. economic competitiveness, scientific leadership, and national security, the President's Information Technology Advisory Committee concluded last year.

But simply building computers with faster processing speeds--the new target threshold is a quadrillion (a million billion) calculations per second, or a "petaflop"--will not be sufficient to achieve those goals. Darrell Long, the Malavalli Professor of Storage Systems Research at UCSC, said the main challenges involve building reliable systems on a vast scale and handling huge amounts of data at high speeds.

"In these giant supercomputers, you've got many thousands of hard disks all working in parallel to provide as much data as possible to as many processors as possible. The file system feeds data to the processors, and if you want to increase speed by adding more processors, the file system has to be able to scale up too and feed the processors at a higher rate," said Long, who directs UCSC's Storage Systems Research Center.

Garth Gibson, a Carnegie Mellon University computer scientist who will lead the data-storage institute, said new methods will be needed to handle the huge amounts of data that computer simulations both use and produce. Petaflop computers will achieve their high speeds by adding hundreds of thousands to millions of processors, and they will also require many more hard disks to handle the data required for simulations, provide fault tolerance, and store the output of the experiments, Gibson said.

"With such a large number of components, it is a given that some component will be failing at all times," he said.

Today's supercomputers, which perform trillions of calculations each second, suffer failures once or twice a day, said Gary Grider of Los Alamos National Laboratory, a co-principal investigator. Once supercomputers are built out to the scale of multiple petaflops, he said, the failure rate could jump to once every few minutes. That means petascale data storage systems will require robust designs that can tolerate many failures, mask the effects of those failures, and continue to operate reliably.

"It's beyond daunting," Grider said of the challenge facing the new institute. Imagine failures every minute or two in your PC and you'll have an idea of how a high-performance computer might be crippled. For simulations of phenomena such as global weather or nuclear stockpile safety, he said, "We're talking about running for months and months and months to get meaningful results."

Collaborating members in the Petascale Data Storage Institute represent a breadth of experience and expertise in data storage. "We felt we needed to bring the best and brightest together to address these problems that we don't yet know how to solve," said Grider, leader of Los Alamos's High Performance Computing Systems Integration Group.

Carnegie Mellon and UC Santa Cruz are the two leading academic centers for storage systems research, while the University of Michigan is a leader in network file systems, and all three have sizable government and industrial collaborations. UCSC's Storage Systems Research Center is focused on improving the performance, security, and reliability of large-scale data-storage systems and software. It includes faculty from the Departments of Computer Science, Computer Engineering, and Electrical Engineering in UCSC's Baskin School of Engineering.

Los Alamos and Oak Ridge National Laboratories are both in the process of building petaflop supercomputers, while a third member, Sandia National Laboratories, has recently built a leadership class supercomputer. The remaining members, the National Energy Research Scientific Computing Center at Lawrence Berkeley National Laboratory and Pacific Northwest National Laboratory, both provide supercomputing resources for a diverse array of scientists.

The data storage institute will focus its efforts in three areas: collecting field data about computer failure rates and application behaviors, disseminating knowledge through best practices and standards, and developing innovative system solutions for managing petascale data storage. The latter category could include so-called "self-star" systems that use computers to manage computers.

"The institute is bringing together the best people in the world to collaborate on these problems, and it provides a place where DOE can go to talk to the experts in this area," Long said.

The Petascale Data Storage Institute is part of DOE's Scientific Discovery through Advanced Computing program, which develops new tools and techniques for computational modeling and simulations. It is funded by a grant from the DOE Office of Science Programs. More information about the institute is available on the web at http://www.pdl.cmu.edu/PDSI/.

______

Note to reporters: You may contact Long at (831) 459-2616 or darrell@cs.ucsc.edu.