Our Datasets

Synthetic datasets designed for training state-of-the-art protein structure prediction models. Supercharge your models and go beyond the PDB.

Who's already using synthetic datasets?

Here are the world-leading models that already use synthetic data in their training curriculum.

AlphaFold/OpenFold

Foundational co-folding models

Demonstrated significant performance advances by training on large-scale self-distillation datasets, especially for biologics/antibodies.

PEARL

Genesis Molecular AI

Technical report extensively mentions the data-scaling benefits observed when training on PDB-based synthetic data.

Read PEARL Technical Report

Boltz2

Boltz.bio

Boltz2 pre-print mentions 3 separate molecular dynamics datasets used in model training.

Read Boltz2 pre-print

Self-Distillation Datasets

Large-scale model inferences with confidence filtering for high-quality synthetic structures

Generation Pipeline

  • Base Models: any open-source model (OpenFold3p, Boltz1/2), including your custom models
  • Confidence Threshold: calibrated pLDDT
  • Sequence: clustered to avoid redundancy and maximize diversity
  • Scale: Millions of structures

Use Cases

  • Pre-training & fine-tuning foundation models
  • Data augmentation
  • Biologics discovery

MD Simulation Datasets

Extended molecular dynamics simulations capturing protein dynamics and conformational states

Dataset Specifications

  • Trajectory Length: 10ns - 100ns simulations, vanilla, enhanced sampling
  • Force Fields: AMBER, CHARMM, OpenFF
  • Solvation: Explicit water (TIP3P, TIP4P) or mixed-solvent
  • Ensembles: NPT, NVT conditions

Protein Preparation

  • Protein preparation is key to avoid simulation artifacts
  • Capping the termini to prevent unfolding
  • Include co-factors, lipids, ions and small-molecule ligands
  • Protein-ligand complexes, and multi-chain assemblies

Ready to access our datasets?

Join our waitlist for early access to pre-made packages and custom generation