The SCI Institute


Home U-SWAP

U-SWAP – Utah State-Wide Archive Project


Goals


This proposal requested support to pilot state-wide data archiving and disaster management services as part of the broader data cyberinfrastructure (CI) being developed to serve researchers across the state of Utah. Specifically, funds were requested to extend the current capabilities with a modest Ceph-based S3 object storage system at the University of Utah’s Downtown Data Center (DDC). The system would then be integrated with existing infrastructure at the University of Utah through the Center for High Performance Computing (CHPC), as well as into off-site disaster recovery and archiving cyberinfrastructure.

Logistics


With the aid of an NSF Campus Cyberinfrastructure (CC*) award, the CHPC has built a prototype state-wide archive system. This system provides infrastructure that allows researchers to satisfy the data sharing, resiliency, and retention requirements placed on published and complete datasets. The system will also provide an opportunity for researchers to explore sharing datasets with national data-sharing platforms (e.g., National Data Platform (NDP), and Open Science Data Federation (OSDF) to promote caching datasets close to computational resources. This system can only house open data; that is, it may only be used for data without security regulations or restrictions.

This system is composed of two pieces:

  • A disk-based object store built on the Ceph software stack and located at the Downtown Data Center (DDC)
  • A tape-based library located at the Tonaquint Data Center (TDC) in St. George

Users interact with the object store at the DDC, called ARC-A (Archive-A), copying their datasets to this system via an S3 interface with tools like Rclone and Globus. The datasets are then automatically replicated to the Spectralogic BlackPearl tape library system, called ARC-B, where two copies are written. ARC-A is 2.8 PB and ARC-B is 7.2 PB in capacity.

If you have large datasets that (a) have accessibility requirements, (b) are associated with a publication, or (c) have datasets that are—or will be—broadly used by other institutions, we encourage you to apply for an allocation of space. Because this system was created with grant funds, it will be provided free of charge for the life of the grant. Beyond that window of time, there will be a charge per terabyte, which we are working to determine. We are limiting allocations of space to 50 TB per group. To apply for an allocation of space on this system, please contact us. As part of the application process, we ask that you provide a coarse manifest of what datasets you plan to store on the archive, along with a description of broader significance of your data, the capacity you require, and the duration of any applicable data retention requirements.

Use Cases


Zach Gompert, Co-PI and Professor of Biology at Utah State University, is using the system to back up important genome sequence data, genome assemblies, and annotations. While these data will eventually be stored in public databases, such as the National Center for Biotechnology Information (NCBI), after publication, the tape archives serve a critical role in Gompert’s research pipeline. These archives provide opportunities to back up the data, as well as intermediate files that would be time- and resource-intensive to regenerate, prior to publication.

 

John Lin, Professor of Atmospheric Sciences at the University of Utah, is using the system to provide an invaluable backup for irreplaceable data regarding air quality and atmospheric observations carried out at different sites across Utah. These datasets include some of the most unique and longest records in the world, including the greenhouse gas observations carried out on the light rail system in Salt Lake City and in the oil and gas producing area of the Uinta Basin.

 

Nina de Lacy, Assistant Professor of Psychiatry at the University of Utah, is leveraging the new durable, long-term / disaster-recovery storage for large numerical datasets, enabling multiple secure copies to be maintained automatically. Going forward, completed datasets will be automatically deposited as part of standardized workflows, ensuring that these important data products remain accessible for future analyses and reproducibility while freeing our active systems for day-to-day computation.

 

Tanmoy Laskar, Assistant Professor of Physics & Astronomy at the University of Utah, generates large volumes of raw data, spanning from tens to hundreds of gigabytes per individual dataset, in collaboration with international radio telescope facilities. These data must be processed (both manually and through data analysis pipelines) prior to being leveraged for research and science. The key products from this pre-processing are calibration tables, sky images, and large volumes of calibrated data products. Laskar is using the archive for short- and long-term storage of these data products to facilitate analysis reproducibility. This archive allows the research team to perform cross-dataset archival studies without re-running the compute- and IO-intensive calibration pipelines.


Fan-Chi Lin, Associate Professor of Geology & Geophysics at the University of Utah, has a large dataset (approximately 10TB) containing distributed acoustic sensing (DAS) velocity measurements collected along an 8.4 km fiber optic cable between the University of Utah campus and the Downtown Data Center. This dataset provides new insights into shallow seismic velocities and fault-related structures based on vibration signals along the East Bench segment of the Wasatch Fault zone. Transitioning this dataset into the archive system will make access and preservation more seamless by consolidating application, storage, archival, and disaster recovery functions in one integrated platform, reducing reliance on custom scripts and simplifying reuse.

Acknowledgements


Funding for this project is provided by NSF #2430361