"Big data" may have become an over-hyped buzzword, but there is still a growing demand for new ways to store, manage, analyze, and exploit the massive amounts of electronic data being generated by everything from consumer transactions and social media to scientific computing and medical records. At the Baskin School of Engineering at UC Santa Cruz, researchers in several departments are tackling different aspects of this multifaceted problem.
A series of speakers at the engineering school's Research Review Day on October 18 provided an overview of the opportunities and challenges presented by big data. They discussed advanced technology for data storage, efficient processing of structured data, proactive information retrieval, and computational challenges for the use of genomic data in medicine.
The day began with a talk by Eric Brill, vice president of eBay Research Labs, who highlighted the opportunities presented by large amounts of data in a talk titled "The Magic is in the Data." Because of eBay's huge customer base and the data generated by every online transaction, "we've become a data company," Brill said. "Our success depends on our ability to use the data to more deeply and subtly understand our customers."
With enough data, even weak effects can yield clear signals, and the implications can be huge for a large company like eBay, he said, noting that even small changes like altering the dimensions of the search box can have a significant impact on how customers interact with the site. "This stuff really matters," Brill said.
How to manage and store all that electronic data, however, is a growing challenge. Andy Hospodor, executive director of UCSC's Storage Systems Research Center (SSRC), gave an overview of research on advanced data storage systems. SSRC researchers are developing new technologies for archival storage, searchable file systems, and storage security and forensics. New projects include energy-aware storage systems designed to minimize power consumption by large-scale data storage centers.
Scott Brandt, associate dean of research and graduate studies, professor of computer science, and director of the UCSC/Los Alamos Institute for Scalable Scientific Data Management (ISSDM), presented his group's work on efficient processing of large-scale scientific data and other kinds of highly structured data, such as financial data. Structured data is systematically organized, like data in a spreadsheet. Brandt and his colleagues have discovered how to take advantage of that organizational structure to significantly improve the efficiency of large-scale data analysis. SciHadoop is the system they developed to enhance the performance of the popular Hadoop open-source software platform for distributed processing.
"SciHadoop provides much faster processing of scientific data and much better query performance on large structured data sets," Brandt said. "A lot of commercial data is also highly structured, so it should be applicable to much more than just scientific data sets."
Another big data challenge is how to extract useful information from vast amounts of unstructured or heterogeneous data. Yi Zhang, an associate professor of Technology and Information Management, discussed her work developing a proactive information retrieval agent. Her goal, she said, is to go beyond Apple's Siri app, which responds to a user's questions, and create an electronic personal assistant that can anticipate the information needs of the user and recommend information without being explicity asked.
"Our approach is to develop a system with the desirable characteristics you would want in a personal assistant," Zhang said. "We are building software based on machine learning techniques and Bayesian graphical models."
This project brings together research in areas such as artificial intelligence, machine learning, data mining, natural language processing, and human-computer interaction. Several demonstration systems have emerged from Zhang's work, including a product recommendation system linked to Facebook (goodbuylist.com) and a tracking system for identifying and filtering online bullying activity (kideroo.net).
In the medical world, the increasing availability of personal genomic information is an important source of big data challenges and opportunities. According to David Haussler, professor of biomolecular engineering and a Howard Hughes Medical Institute investigator, cancer genomics is leading the way to the use of personal genomics in medicine. As DNA sequencing becomes steadily faster and less expensive, cancer genome sequencing will soon become a widespread clinical practice, he said.
Haussler's group at the Center for Biomolecular Science and Engineering has built the Cancer Genomics Hub, a 5-petabyte database for genomic data from cancer patients who have had their genomes sequenced through National Cancer Institute research projects. In an effort to unravel the complex genetic roots of cancer, researchers are sequencing the normal and tumor genomes of thousands of cancer patients. At about 100 gigabytes per genome, the computational challenges of storing, serving, and interpreting all this data are significant.
"Big data is needed for statistical power," Haussler said. "Understanding cancer will require aggregating DNA data from many thousands of cancer genomes."
More information about the Research Review Day presentations is available online.