Unlock co-folding performance with synthetic datasets

Power-up your protein structure prediction models with synthetic training data, including self-distillation predictions and physics-based MD simulations.

The Challenge

Current co-folding models are bottlenecked by data

Protein structure prediction models like AlphaFold or OpenFold are data-hungry and limited by the availability of training data. Public datasets contain ~200K static structures, restricting model performance and generalization.

See Our Solution
Our Solution

Massive expansion of training data through synthetic datasets

Synthetic datasets expand into novel spaces built outside the available training data, filling the gaps. Multiple models have been trained on synthetic datasets showing significant performance improvements over public data alone. This suggests that 'data scaling laws' apply to protein structure prediction as well.

Self-Distillation Datasets

Training your models on the most high-quality structure predictions that go beyond the public datasets.

  • Confidence-filtered predictions
  • Diverse sequence space exploration
  • Scalable to millions of structures
Example: Crambin structure with confidence scores

MD Simulation Datasets

Teach your models about physics and protein dynamics, by training on molecular dynamics structures.

  • Longer trajectories (100 ns and beyond), enhanced sampling
  • Include small-molecule ligands, ions, key waters
  • Extended motions (loops, allosteric sites)
Example: HIV Protease with bound inhibitor

Ready to scale-up your model's training data?

Join our waitlist for early access to pre-fabricated dataset or custom dataset generation tailored to your needs.