Evaluating single-cell ATAC-seq atlasing technologies using sequence-to-function modeling

Mar 30
3 min read

By: Hannah Dickmänken, Marta Wojno, Lukas Mahieu, Koen Theunis, Eren Can Ekşi, Valerie Christiaens, Niklas Kempynck, Florian V. De Rop, Natalie Roels, Katina I. Spanier, Roel Vandepoel, Gert Hulselmans, Suresh Poovathingal & Stein Aerts

High-quality training data are essential for reliable machine learning (ML) models in biology, yet generating such data remains costly. Single-cell chromatin accessibility (scATAC-seq) atlases enable the training of sequence-to-function (S2F) models that decode enhancer logic; however, the impact of data quality and scale on model performance remains unclear.

We benchmarked custom versus commercial scATAC-seq platforms for S2F model training and introduced an improved droplet-based HyDrop v2 protocol, which offers enhanced sensitivity and scalability. Across 220 models trained on mouse and Drosophila data at varying read depths and cell numbers, we perform systematic cross-validations to assess robustness and model interpretability. The results show that lower per-cell coverage can be offset by larger datasets, enabling cost-effective S2F training without compromising predictive performance.

This work provides three main advances: a new framework to benchmark technologies in the context of ML model training; large-scale training resources, including a 600,000-cell Drosophila embryo atlas and comprehensive mouse motor cortex datasets profiled with both HyDrop v2 and 10x Genomics; and an optimized HyDrop v2 protocol for generating high-quality single-cell atlases.

Together, these results establish practical guidelines for building training data for deep learning in regulatory genomics and demonstrate that custom and commercial scATAC-seq data can be combined into robust, large-scale atlases to advance enhancer logic decoding.

Figure 1. a, b, study design in mouse and Drosophila embryo. c, Estimation of generation costs of 67k cells in euros, excluding sequencing costs. HyDrop v1: 851.53 euro, HyDrop v2: 668.18 euro, 10x v2: 9,337.79 euro. — **Figure 1**. a, b, study design in mouse and Drosophila embryo. c, Estimation of generation costs of 67k cells in euros, excluding sequencing costs. HyDrop v1: 851.53 euro, HyDrop v2: 668.18 euro, 10x v2: 9,337.79 euro.

Figure 2; d-e, Computational design: 10x v1 and v2 datasets were combined into the 10x Genomics training dataset to compare to the HyDrop v2-based dataset, serving as training data for S2F deep learning models in k-fold cross-validation (k=10). The model performance is validated on standard DL metrics, accessibility predictions, and mouse cortex enhancers previously validated in vivo by Ben-Simon et al. (2024) seen in e. — **Figure 2**; d-e, Computational design: 10x v1 and v2 datasets were combined into the 10x Genomics training dataset to compare to the HyDrop v2-based dataset, serving as training data for S2F deep learning models in k-fold cross-validation (k=10). The model performance is validated on standard DL metrics, accessibility predictions, and mouse cortex enhancers previously validated in vivo by Ben-Simon et al. (2024) seen in e.

Figure 3. f, Computational design evaluating scATAC techniques as training data for S2F models in k-fold cross-validation (k=10). g-h, Scanning of ~2kb enhancers (g) with 500 bp sliding window (10bp shift). Predicted accessibility of 500 bp windows of VT3067 enhancer (VDRC library). The region coordinates are based on Kvon et al. (2014), accessibility as ground truth. Line plots show mean predicted accessibility across n=10 cross-validation folds. Bands represent 95% confidence intervals (1.96 × SD). Both 10x and Hydrop v2-based models identify the same region as a core enhancer in the 200kb VDRC region.

Key findings

The optimized scATAC platform HyDrop v2 is able to provide high-quality training data for deep learning models. The updated protocol is freely available.
Models trained on HyDrop v2 data correctly identify in vivo validated enhancers in mouse cortex and Drosophila embryo. The models are available to download and ready to use.
Slightly lower quality of training data can be compensated for by increasing the size of the training dataset. Even when supplying 60% more cells from HyDrop v2 compared to 10x, the cost of generating such a HyDrop v2 dataset is still eight times lower.
From a saturated sequencing depth of 12k in Drosophila embryo and 36k reads per cell in the mouse cortex data, deeper sequencing does not add additional value for model training.
The Tn5 bias is consistent between the droplet-based custom scATAC HyDrop v2 and commercial scATAC platforms across species.
Evaluating -omics platforms in their ability to generate high-quality training data adds an important pillar in tech development, given the increased availability of deep learning tools to unravel the genomic grammar of cell type identity.

How did VSC contribute to your work?

The high-performance resources from the VSC enabled us to analyze our new large datasets with more than 600k cells very efficiently. Training more than 220 deep convolutional neural network models, each requiring GPU-accelerated computation on NVIDIA A100/H100 nodes with 80 GiB memory, would not have been possible without the Tier-2 infrastructure from the VSC.

Read the full scientific publication in Springer Nature here

🔍 Your Research Matters — Let’s Share It!

Have you used VSC’s computing power in your research? Did our infrastructure support your simulations, data analysis, or workflow?

We’d love to hear about it!

Take part in our #ShareYourSuccess campaign and show how VSC helped move your research forward. Whether it’s a publication, a project highlight, or a visual from your work, your story can inspire others.

🖥️ Be featured on our website and social media. Show the impact of your work. Help grow our research community

📬 Submit your story: https://www.vscentrum.be/sys