Evaluating single-cell ATAC-seq atlasing technologies using sequence-to-function modeling
- 2 days ago
- 3 min read
By: Hannah Dickmänken, Marta Wojno, Lukas Mahieu, Koen Theunis, Eren Can Ekşi, Valerie Christiaens, Niklas Kempynck, Florian V. De Rop, Natalie Roels, Katina I. Spanier, Roel Vandepoel, Gert Hulselmans, Suresh Poovathingal & Stein Aerts

High-quality training data are essential for reliable machine learning (ML) models in biology, yet generating such data remains costly. Single-cell chromatin accessibility (scATAC-seq) atlases enable the training of sequence-to-function (S2F) models that decode enhancer logic; however, the impact of data quality and scale on model performance remains unclear.
We benchmarked custom versus commercial scATAC-seq platforms for S2F model training and introduced an improved droplet-based HyDrop v2 protocol, which offers enhanced sensitivity and scalability. Across 220 models trained on mouse and Drosophila data at varying read depths and cell numbers, we perform systematic cross-validations to assess robustness and model interpretability. The results show that lower per-cell coverage can be offset by larger datasets, enabling cost-effective S2F training without compromising predictive performance.
This work provides three main advances: a new framework to benchmark technologies in the context of ML model training; large-scale training resources, including a 600,000-cell Drosophila embryo atlas and comprehensive mouse motor cortex datasets profiled with both HyDrop v2 and 10x Genomics; and an optimized HyDrop v2 protocol for generating high-quality single-cell atlases.
Together, these results establish practical guidelines for building training data for deep learning in regulatory genomics and demonstrate that custom and commercial scATAC-seq data can be combined into robust, large-scale atlases to advance enhancer logic decoding.



Key findings
The optimized scATAC platform HyDrop v2 is able to provide high-quality training data for deep learning models. The updated protocol is freely available.
Models trained on HyDrop v2 data correctly identify in vivo validated enhancers in mouse cortex and Drosophila embryo. The models are available to download and ready to use.
Slightly lower quality of training data can be compensated for by increasing the size of the training dataset. Even when supplying 60% more cells from HyDrop v2 compared to 10x, the cost of generating such a HyDrop v2 dataset is still eight times lower.
From a saturated sequencing depth of 12k in Drosophila embryo and 36k reads per cell in the mouse cortex data, deeper sequencing does not add additional value for model training.
The Tn5 bias is consistent between the droplet-based custom scATAC HyDrop v2 and commercial scATAC platforms across species.
Evaluating -omics platforms in their ability to generate high-quality training data adds an important pillar in tech development, given the increased availability of deep learning tools to unravel the genomic grammar of cell type identity.
How did VSC contribute to your work?
The high-performance resources from the VSC enabled us to analyze our new large datasets with more than 600k cells very efficiently. Training more than 220 deep convolutional neural network models, each requiring GPU-accelerated computation on NVIDIA A100/H100 nodes with 80 GiB memory, would not have been possible without the Tier-2 infrastructure from the VSC.
Read the full scientific publication in Springer Nature here
🔍 Your Research Matters — Let’s Share It!
Have you used VSC’s computing power in your research? Did our infrastructure support your simulations, data analysis, or workflow?
We’d love to hear about it!
Take part in our #ShareYourSuccess campaign and show how VSC helped move your research forward. Whether it’s a publication, a project highlight, or a visual from your work, your story can inspire others.
🖥️ Be featured on our website and social media. Show the impact of your work. Help grow our research community
📬 Submit your story: https://www.vscentrum.be/sys




