Modern machine learning methods have enabled major advances in analyzing big data, but the current state-of-the-art technology is not suited for the intricacies of surveys that use complex sampling methods. With the support of a three-year, $337,000 grant from the National Science Foundation, Assistant Professor of Statistics Paul Parker will develop statistical and machine learning methods to best suit the analysis of complex surveys produced by federal statistics agencies.
“We're currently in this data science and machine learning revolution, where there's all these new methods that can analyze these massive datasets and do so very well, but they're not necessarily able to be used off the shelf for these types of survey datasets,” Parker said. “That's because they typically assume a simple random sample from the population, which is not the case with these types of surveys.”
This project will focus on a group of surveys produced by the National Center for Science and Engineering Statistics (NCSES), such as the National Survey of College Graduates and the Survey of Earned Doctorates, which help inform important official population estimates. Instead of sampling a population with equal probability, these surveys and other federal surveys typically over or undersample from particular groups.
Parker will create statistical methods for machine learning models that are specifically designed to account for survey design, the unique way in which data is collected. He aims to take advantage of machine learning technology’s ability to create flexible data models that can often improve precision of population estimates.
However, many machine learning models are often not equipped to provide important estimates of uncertainty in datasets, a shortcoming which Parker will address through the frameworks he develops.
“[The project addresses] two things: accounting for the survey design, but also incorporating it into a statistical framework to generate those uncertainty estimates,” Parker said. “I think those are the two areas where our expertise will help to improve these models.”
These new methods will benefit agencies tasked with producing population estimates from NCSES surveys, who Parker says are increasingly faced with a combination of limited resources and higher expectations for their work. The improved estimates will also be useful for people who interpret and make policy or funding decisions based on the data produced.
Ultimately, Parker hopes that these methods will have broader applicability to other federal statistical agencies as well as fields such as economics and sociology that deal with dependent survey datasets.
This project is funded through the NSF’s National Center For Science and Engineering Statistics and will be a collaborative effort with principal investigator Scott Holan at the University of Missouri.