You're pledging to donate if the project hits its minimum goal and gets approved. If not, your funds will be returned.
Biology generates massive amounts of data with the potential to transform medicine, yet the vast majority of this knowledge remains effectively inaccessible. Public biological datasets are scattered across hundreds of repositories, described with inconsistent metadata, plagued by errors, and poorly documented. As a result, the majority of global biomedical data is invisible to researchers and unusable for machine learning. This is not a minor inconvenience—it is a structural failure at the foundation of modern science, slowing discovery and wasting enormous scientific effort. While machine learning is revolutionizing countless fields, it cannot do the same for biomedicine until its data are property prepared.
Covalent.bio addresses this problem by building an AI-powered platform that automatically ingests, standardizes, and quality-controls entire public biological repositories at scale. Our system transforms fragmented, unstructured datasets into trustworthy, machine-readable resources that are immediately discoverable and reusable. From genomics and single-cell data to clinical and imaging studies, Covalent turns the world’s public biomedical data into a unified, continuously growing foundation for research.
By making global biological knowledge searchable, comparable, and interoperable, Covalent enables scientists to find relevant data in seconds, perform large-scale analyses that were previously impossible, and avoid redundant experiments. At the same time, we unlock the high-quality, integrated training data that machine learning models urgently require. Our long-term vision is to shift biology from a manual, siloed discipline to a predictive, data-driven science—dramatically accelerating discovery and bringing better treatments to patients faster.
Our technical vision enables a disruptive go-to-market strategy that is designed to create differentiated, high-value intellectual property at every stage.
1. The metadata layer
Covalent’s immediate focus is on standardizing and enriching the metadata of the entire NCBI's SRA and GEO. This is a technically tractable, capital-efficient, and high-leverage starting point.
Delivery: 3 months
Target Customers: 3rd party platforms
Revenue potential: small ($50k-$100k)
2️. The provider of ML training data
Covalent’s enriched, standardized metadata allows instant dataset discovery. Our primary commercial focus is the "virtual cell" opportunity.
Delivery: 6 months
Target Customers: Research Institutes and large companies selling decision support systems
Revenue potential: medium ($500k-$5M)
3. The model builder
The unparalleled datasets assembled through Covalent’s public pipeline will become a unique asset for training their own proprietary machine learning models. By moving up the value chain from data provider to insight generator, Covalent will address critical challenges in drug discovery.
Delivery: 24 months
Target customers: big pharma and large biotechs running drug development pipelines
Revenue potential: medium ($1M-$20M)
4️. CovalentTx
Covalent will utilise its world’s largest repository of standardized, interoperable biomedical data as a platform to fuel therapeutic development.
Delivery: 3-5 years
Revenue potential: large ($100M-$1B)
Nick Schaum | https://www.linkedin.com/in/nicholas-schaum/
My work is driven by a single goal: to keep as many people as possible healthy for as long as possible. Fifteen years ago, I began by dissecting individual molecular pathways in ageing biology, hoping to prevent disease onset. But I soon realized that generating community‑wide resources would amplify impact far more. That led me to convene over 100 researchers to build Tabula Muris Senis, the first organism‑wide, single‑cell atlas of ageing, in partnership with the CZ Biohub.
I then co‑founded Rejuvenome, a $70 million proto-FRO mapping how rejuvenation strategies affect the ageing process across the body. From scratch, we assembled a dedicated team of twelve, designed a seven‑year roadmap, and processed tens of thousands of samples in our first year alone.
When Rejuvenome was abruptly halted, I stepped back and recognized a deeper bottleneck: data generation now outstrips our ability to organize and interpret it. At a FutureHouse workshop, I realized that machine learning could bridge that gap—so long as we first make every dataset machine‑ready. Today, my mission is to build that foundation: an AI‑driven platform that transforms fragmented biological data into a unified resource, empowering researchers everywhere to accelerate discovery and improve human health.
Jorge Bastos | https://www.linkedin.com/in/jmigbastos/
I bring a unique blend of technical expertise and strategic product leadership. For over two decades, I’ve led the development of software and data products that empower people—from children learning maths to families managing debt. At TotallyMoney, I joined pre-launch and helped grow the platform to over 5 million users. I built the first machine learning model for credit card eligibility, transforming how people managed their finances. More importantly, it made a real difference—helping people under financial stress make better decisions. I'm now applying that same mission-driven approach to the biosciences: building infrastructure that empowers both human and AI scientists.
Covalent.bio, co-founded with Nick Schaum, is my response to a core bottleneck in biology: AI can't fulfill its promise until biological data is clean, discoverable, and machine-readable. The ultimate aim is to make biology predictable—enabling simulated interventions, faster drug discovery, and the precise design of biological systems.
Covalent is designed to be an open, enduring public-good infrastructure with a sustainable path to scale. We are at an inflection point in biology. With the right infrastructure, we can unlock a Cambrian explosion of discovery. Covalent is designed to be that foundation. And I am committed—intellectually, emotionally, and professionally—to seeing it through.
Learn more at Covalent.Bio or via our blogs.
Our execution plan is flexible enough to accommodate different levels of risk/ambition of each investor. For instance, the first milestone can be simplified to only target a fraction of the SRA/GEO and still deliver an outsized, high-impact result.
We're looking for $70k to fund the first big milestone in our roadmap: the metadata layer. If we raise $140k, we can also deliver the second milestone: the provider of ML training data.