The National Science Foundation (NSF) has awarded a $1.5 million grant to a team of computer scientists, statisticians, and mathematicians at UC Santa Cruz to develop the tools and techniques needed to understand large, complex datasets in fields as diverse as social sciences, biology, cybersecurity, and computer networking.
Led by Lise Getoor, professor of computer science in the Baskin School of Engineering, the project will address the challenges of incompleteness, uncertainty, and bias in large, heterogeneous sets of interconnected data.
“When we talk about big data, the challenges are not just the scale of it, but also that all of the data are connected in some way, whether it’s social media data, or the Internet of Things, or protein interactions in biology. Data now are interconnected in all kinds of crazy, complex ways, and we need new computational and mathematical tools to help us make inferences and find patterns in the data,” Getoor said.
This type of interconnected data, where relationships or links between nodes define a network of interactions, is often referred to as heterogeneous graph data. Statistician Abel Rodriguez, professor of applied mathematics and statistics and a co-principal investigator on the grant, explained that incomplete data is particularly problematic in this context.
“We only see pieces of the whole network, and what makes this challenging is that it’s not random, we tend to miss certain things preferentially,” he said. “That type of bias is apparent in, for example, social media networks like Facebook or LinkedIn, where you don’t necessarily see all of my friends or connections. If I only respond to requests to connect but don’t initiate any, you’ll get a very biased version of my network. So how to identify and adjust for those biases is one component of this project.”
Digital trails
The digital trails people now leave behind in so many activities and interactions fit naturally into the network structure of graph data. The biases, uncertainty, and incompleteness in so much of this data raise issues of fairness when the data are used as the basis for decisions that affect people’s lives.
“We want to understand the bias in a more nuanced way so that we can talk about accuracy and fairness in the context of these rich, multimodal, heterogeneous datasets,” Getoor said. “We need to train people to understand and question the outputs of data science algorithms. I don’t think we even understand yet all the ways the algorithms can go wrong.”
As part of the project, the researchers aim to develop a theoretical framework for "responsible data science" to address issues such as bias, fairness, privacy, and robustness. Rodriguez noted that uncertainty in the data can actually be exploited as a powerful tool for protecting privacy and preserving anonymity. The researchers will investigate techniques that involve adding random noise to data processing to preserve the privacy of individual nodes in a network.
In addition to Getoor and Rodriguez, the faculty involved in this collaborative project include computer scientists Seshadhri Comandur (a co-PI), Dimitris Achlioptas, and Abhradeep Guha Thakurta; statistician Rajarshi Guhaniyogi; and applied mathematician Daniele Venturi (also a co-PI). The team brings to the table a wide range of expertise and a diverse set of tools, including randomized algorithms, Bayesian statistics, and uncertainty quantification. A major goal of the collaboration is to bring together researchers from computer science, statistics, and mathematics, considered the fundamental disciplines of data science, to develop the field's theoretical foundations.
Getoor and Rodriguez also lead D3, a data science research center that provides a platform for collaboration with industry partners. Synergistic interactions between D3 and the NSF project will help advance both programs, Getoor said. They are key components of a growing data science initiative across the UC Santa Cruz campus.
The new grant is part of a major NSF program to develop the theoretical foundations of data science, called Transdisciplinary Research in Principles of Data Science (TRIPODS). The UCSC project is called "TRIPODS: Towards a Unified Theory of Structure, Incompleteness, and Uncertainty in Heterogeneous Graphs." It is one of 12 awards NSF announced August 24 to support the development of small collaborative institutes in Phase I of the TRIPODS program. A future TRIPODS Phase II is planned to support a smaller number of larger institutes.