Data science researchers to tackle privacy challenges in genomics

Computer scientist Abhradeep Guha Thakurta has won NSF funding to investigate ways to protect the privacy of individuals while allowing access to large genomic data sets

genome data graphic

Rapidly growing databases of human genome sequences represent a potential goldmine of information for health researchers, but access to these databases is tightly controlled and extremely limited due to privacy concerns.

“Unfortunately, so far, the dramatic drop in sequencing costs has not translated into a significant increase in publicly accessible large-scale genomic data sets. Hundreds of thousands of whole genome sequences are hidden away on encrypted servers,” said Abhradeep Guha Thakurta, assistant professor of computer science and engineering at UC Santa Cruz.

Guha Thakurta hopes to unlock the full potential of these data by developing reliable methods for preserving the privacy of individuals whose genomes have been sequenced, while allowing broad access to genomic data sets. He has received a $600,000 grant from the National Science Foundation (NSF) to fund the project, which is part of a larger data science effort in the Baskin School of Engineering at UC Santa Cruz.


In 2017, UC Santa Cruz was one of 12 universities funded by NSF's Transdisciplinary Research in Principles of Data Science (TRIPODS) program to create small collaborative institutes working on the theoretical foundations of data science. The new award is one of 19 TRIPODS+X grants intended to expand the scope of these cross-disciplinary TRIPODS institutes into broader areas of science, engineering, and mathematics.

“The multidisciplinary approach for addressing the increasing volume and complexity of data enabled through the TRIPODS+X projects will have a profound impact on the field of data science and its use,” said Jim Kurose, NSF assistant director for computer and information science and engineering. "This impact will be sure to grow as data continues to drive scientific discovery and innovation.”

Guha Thakurta's project will investigate approaches for sanitizing sensitive genomic data that provably protects the privacy of individuals in the data set while preserving statistical validity of the data. If successful, this will provide algorithmic tools to allow statistical analyses by geneticists on data sets that were previously inaccessible due to privacy concerns.

Guha Thakurta's team includes genomics expert Russ Corbett-Detig, an assistant professor of biomolecular engineering; theoretical computer scientist Dimitris Achlioptas, a professor of computer science and engineering; and statistician Vishesh Karwa at Temple University.

The UC Santa Cruz TRIPODS effort brings together researchers from mathematics, statistics, and computer science to develop a unified theory of data science applied to uncertain and heterogeneous graph and network data. Led by Lise Getoor, professor of computer science and engineering, the researchers collaborate closely with the D3 Data Science Research Center and Data Science Santa Cruz.

The TRIPODS institutes share expertise and work together to advance NSF priorities. The program aligns with Harnessing the Data Revolution (one of the 10 Big Ideas for Future NSF Investments), which aims to engage NSF's grantee community in the pursuit of fundamental research in data science and engineering, the development of a cohesive, federated, national-scale approach to research data infrastructure, and the development of a 21st-century data-capable workforce.

“TRIPODS+X is exciting not only for its near-term impact addressing some of society's most important scientific challenges, but because of its potential for developing tools for future applications,” said Anne Kinney, NSF assistant director for Mathematical and Physical Sciences (MPS).