Campus News
AnVIL makes groundbreaking genomic datasets available on AWS for free
With the support of the AWS Registry of Open Data, AnVIL has made major genomic datasets available on AWS, free of the data transfer fees that previously cost over $15,000 dollars for some datasets.
Key takeaways
- With the support of the AWS Registry of Open Data, AnVIL has made major genomic datasets available on AWS, free of the data transfer fees that previously cost over $15,000 dollars for some datasets.
- Datasets include some of the most important resources in human genomics, from the 1000 Genomes Project to the first complete map of the human genome.
Some of the most important datasets in human genomics are now available for researchers everywhere to use to make impactful health discoveries, without being limited by costly transfer fees. Researchers at the UC Santa Cruz Genomics Institute Computational Genomics Lab and the Broad Institute have deployed a mirror of the National Human Genome Research Institute (NHGRI) AnVIL Data Explorer’s open-access genomic datasets in the Amazon Web Services (AWS) Registry of Open Data.
AnVIL, the Genomic Data Science Analysis, Visualization, and Informatics Lab-space, is NHGRI’s flagship platform for making large-scale genomic data available to the research community. By centralizing storage, computing, and data sharing in one cloud-based environment, AnVIL is designed to make it easier to access and work with publicly funded genomic data.
The new mirror makes some of the most important data collections in human genomics freely accessible to researchers working outside of Google Cloud for the first time, and quietly removes one of the most frustrating barriers in modern genomics research.
“We are very excited to expand the capabilities of the AnVIL to reach researchers working on AWS as well as those looking to analyze AnVIL data on their local compute infrastructure,” said Benedict Paten, Professor of Biomolecular Engineering at UC Santa Cruz and Director for Computational Genomics at the UC Santa Cruz Genomics Institute. “With these features we are well placed to support a whole new group of potential AnVIL users.”
The hidden cost of “open” data

Openly available datasets accelerate the pace of research and discovery. Genomic datasets in particular have led to countless breakthroughs for human health. Large public research consortiums funded by the NHGRI are required to make their data openly available for other researchers to use in order to get the most value out of data that is incredibly costly to collect. Just because this data is publically available, however, does not mean that accessing it is necessarily free, and the costs associated with downloading this data can be prohibitive.
Moving large amounts of data out of a cloud provider’s storage, a process the industry calls “data egress,” triggers fees charged by the cloud provider. For small files, this is barely noticeable, but for the massive datasets that genomics research depends on, it can be staggering.
For example, the Telomere-to-Telomere (T2T) genome assembly, co-led by UC Santa Cruz and NHGRI, is one of the most complete and scientifically significant maps of the human genome ever produced, consisting of over 200TB of data. Downloading that dataset from Google Cloud, where AnVIL’s data has historically lived, could cost a research group well over $15,000 in egress fees alone. For many labs, particularly those at smaller institutions or in regions with limited cloud computing resources, that price tag effectively puts the data out of reach.
What’s changed
By mirroring AnVIL’s open-access datasets in the AWS Registry of Open Data and the AWS Marketplace, the AnVIL team has created a second, cost-free pathway to this data. Researchers can now download these datasets directly from AWS storage without incurring any egress charges, whether they plan to analyze the data within AWS, on a university computing cluster, a regional supercomputer, or any cloud environment other than Google Cloud.
This means that the data is no longer functionally locked inside a single cloud ecosystem, allowing researchers whose infrastructure and workflows already live in AWS or private high- performance computing clusters to access the data. This matters enormously for the scientific community’s ability to act on the principle that publicly funded research should be genuinely and practically accessible to everyone.
The datasets now available through this program include some of the most widely used resources in human genomics, including the 1000 Genomes project, Human Pangenome Reference Consortium, the open portion of the Genotype-Tissue Expression (GTEx) project, and T2T consortium data. Researchers have several options for downloading that make the data equally accessible to bioinformaticians running automated pipelines and scientists who just need to grab a handful of files.
UC Santa Cruz and its partners like NHGRI have long been committed to making genomic data accessible and impactful, and the AnVIL project is central to achieving that goal. By removing the financial penalty for downloading open data, AnVIL is ensuring that wherever a researcher works, and whatever computing infrastructure they have access to, they are able to participate in leading-edge genomics research.