拉蛋白 (Lā Dàn Bái)

拉蛋白 (Lā Dàn Bái)
La-Proteina

原始链接: https://github.com/NVIDIA-Digital-Bio/la-proteina

## La-Proteina：一种新的原子级蛋白质设计方法 La-Proteina 是一种新颖的生成模型，用于*从头*设计蛋白质结构，直接生成氨基酸序列和完整的原子级 3D 结构。它通过利用**部分潜在表示**来解决生成复杂蛋白质结构的问题：骨架结构被明确建模，而序列和原子细节则由每个残基的潜在变量捕获。这避免了表示变长侧链的困难。该模型采用流匹配来学习序列和结构的联合分布，在共同设计性、多样性和结构有效性方面取得了最先进的结果。值得注意的是，La-Proteina 在**原子级基序支架搭建**方面表现出色，这是结构条件蛋白质设计的一项关键任务，并且可以可靠地生成高达 **800 个残基**的蛋白质——这是许多其他模型无法达到的规模。代码和模型是公开可用的，并提供了详细的设置（使用 mamba 或 conda）、检查点下载、训练和采样的说明。不同的配置可用于无条件生成和条件基序支架搭建，需要特定的代码调整才能获得最佳性能。评估依赖于 ProteinMPNN 进行可设计性评估。 La-Proteina 的权重根据 NVIDIA 开源模型许可协议发布，源代码根据 Apache 2.0 许可协议发布。

## 新型蛋白质设计：前景与担忧最近的 Hacker News 讨论围绕着“La-Proteina”技术展开，该技术用于设计新型蛋白质，可能在医药、生物技术和材料科学领域带来益处。虽然承认该技术前景广阔，但评论员们也对将这些新创造的蛋白质释放到环境中可能造成的生态和免疫影响表示担忧。一位用户质疑当前或未来的法规是否有能力充分预测后果，考虑到巨大的设计可能性。然而，另一位用户反驳说，自 1970 年代以来，“打印蛋白质”——包括新型蛋白质——已经成为常态，例如胰岛素生产。他们认为，主要的担忧在于故意设计高度传染性病原体，但即使如此，也需要传递机制（如病毒），并利用现有的病毒机制，而不是仅仅依赖新型蛋白质设计。最终，这场讨论凸显了合成生物学领域中创新与负责任发展之间的平衡。

Abstract. Recently, many generative models for de novo protein structure design have emerged. Yet, only few tackle the difficult task of directly generating fully atomistic structures jointly with the underlying amino acid sequence. This is challenging, for instance, because the model must reason over side chains that change in length during generation. We introduce La-Proteina for atomistic protein design based on a novel partially latent protein representation: coarse backbone structure is modeled explicitly, while sequence and atomistic details are captured via per-residue latent variables of fixed dimensionality, thereby effectively side-stepping challenges of explicit side-chain representations. Flow matching in this partially latent space then models the joint distribution over sequences and full-atom structures. La-Proteina achieves state-of-the-art performance on multiple generation benchmarks, including all-atom co-designability, diversity, and structural validity, as confirmed through detailed structural analyses and evaluations. Notably, La-Proteina also surpasses previous models in atomistic motif scaffolding performance, unlocking critical atomistic structure-conditioned protein design tasks. Moreover, La-Proteina is able to generate co-designable proteins of up to 800 residues, a regime where most baselines collapse and fail to produce valid samples, demonstrating La-Proteina's scalability and robustness.

Find the Model Card++ for La-Proteina here.

For environment setup mamba or micromamba is recommended, but alternatively conda can also be used as a drop-in replacement (substitute mamba with conda).

mamba env create -f environment.yaml
mamba activate laproteina_env
pip install torch==2.7.0 --index-url https://download.pytorch.org/whl/cu118
pip install graphein==1.7.7 --no-deps
pip install torch_geometric torch_scatter torch_sparse torch_cluster -f https://data.pyg.org/whl/torch-2.7.0+cu118.html

Please download all model checkpoints into the ./checkpoints_laproteina directory, as indicated in the Checkpoints section, which is necessary to sample our models.

If you are only interested in sampling our models, please go to Sampling our models.

To train models and use our minimal dataloaders, create a file .env in the root directory of the repo with the single line

DATA_PATH=/directory/where/you/want/dataset

In the paper we train our models on subsets of the AFDB. We provide lists of IDs for each dataset in laproteina_afdb_ids.zip.

We provide minimal dataloader implementations that allow training on different subsets of the PDB. Here we describe how to use these minimal dataloaders. Note that all our models are trained on subsets of the AFDB, not on the PDB, as described in the paper. We provide the indices of the AFDB used for the datasets described in the paper here.

To use the PDB dataloader, you can for example use the pdb_train_ucond.yaml file, which we provide as part of our configs/dataset/pdb directory, in the following way:

import os
import hydra
import lightning as L

L.seed_everything(43)
version_base = hydra.__version__
config_path = </path/to/datasets_configs>
hydra.initialize_config_dir(config_dir=f"{config_path}/pdb", version_base=version_base)

cfg = hydra.compose(
    config_name="pdb_train",
    return_hydra_config=True,
)
pdb_datamodule = hydra.utils.instantiate(cfg.datamodule)
pdb_datamodule.prepare_data()
pdb_datamodule.setup("fit")
pdb_train_dataloader = pdb_datamodule.train_dataloader()

With this, the dataloader selects all PDB chains according to the selection criteria specified in the yaml file, downloads, processes and splits the data and generates ready-to-use dataloaders and datamodules. For simple demonstration we subsample the dataset in pdb_train_ucond.yaml via the fraction attribute; for the full dataset change this value to 1.

Run the command python proteinfoundation/partial_autoencoder/train.py. Certain parameters, such as KL weight use for the loss, how often to save checkpoints, wandb logging, etc, can be controlled from configs/training_ae.yaml.

Run the command python proteinfoundation/train.py. This will start training according to the configuration specified in configs/training_local_latents.yaml, specifying the autoencoder checkpoint, and the neural network config (for unconditional models use - nn: local_latents_score_nn_160M), learning rate, among other parameters.

Motif-scaffolding partially latent flow matching training

Training indexed all-atom vs. indexed tip-atom scaffolding models. For indexed all-atom scaffolding, set the dataset, line 11 of configs/training_local_latents.yaml, to - dataset: pdb/pdb_train_motif_aa, and the neural network, line 12 of configs/training_local_latents.yaml, to - nn: local_latents_score_nn_160M_motif_idx_aa. For indexed tip-atom scaffolding, set them to - dataset: pdb/pdb_train_motif_tip and - nn: local_latents_score_nn_160M_motif_idx_tip.

Training unindexed all-atom vs. unindexed tip-atom scaffolding models. Set the architecture, line 12 of configs/training_local_latents.yaml, to - nn: local_latents_score_nn_160M_motif_uidx (all-atom and tip-atom use the same neural network in the unindexed case). For unindexed all-atom scaffolding, set the dataset, line 11 of configs/training_local_latents.yaml, to - dataset: pdb/pdb_train_motif_aa. For indexed tip-atom scaffolding, set them to - dataset: pdb/pdb_train_motif_tip and - nn: local_latents_score_nn_160M_motif_idx_tip.

Once the config is modified accordingly, run python proteinfoundation/train.py.

Training/Sampling with compiled models

Since our transformer-based architecture is amenable to hardware optimizations, we leverage the torch compilation framework to speed up training and inference. This feature is by default disabled in this repository, but can be enabled easily:

For sampling, just outcomment the torch.compile line in the forward method in proteinfoundation/nn/local_latents_transformer.py and proteinfoundation/nn/local_latents_transformer_unindexed.py.
For training, in addition you need to enable the PaddingTransform with the appropriate max_size argument to make all batches the same length across the sequence dimension. By default, we only pad the batches to the longest sequence in the batch for efficiency reasons, but to leverage compilation this size should be constant.

La-Proteina consists of two models, the autoencoder and the latent diffusion. All checkpoints should be downloaded and placed in the directory ./checkpoints_laproteina in the codebase's root directory. The code loads checkpoints from this directory automatically. All checkpoints can be found here. We also provide individual links below.

We provide weights for the following latent diffusion models (for autoencoders see next paragraph):

(LD1) Unconditional generation, no triangular update layers, generation up to 500 residues: LD1_ucond_notri_512.ckpt.
(LD2) Unconditional generation, triangular multiplicative update layers, generation up to 500: LD2_ucond_tri_512.ckpt.
(LD3) Unconditional generation, no triangular update layers, generation between 300 and 800 residues: LD3_ucond_notri_800.ckpt.
(LD4) Indexed atomistic motif scaffolding, all atom: LD4_motif_idx_aa.ckpt.
(LD5) Indexed atomistic motif scaffolding, tip atom: LD5_motif_idx_tip.ckpt.
(LD6) Unindexed atomistic motif scaffolding, all atom: LD6_motif_uidx_aa.ckpt.
(LD7) Unindexed atomistic motif scaffolding, tip atom: LD7_motif_uidx_tip.ckpt.

Each La-Proteina model requires an autoencoder in addition to the latent diffusion model. We provide three different autoencoder checkpoints since each one is trained for proteins of different lengths: up to 256 residues for atomistic motif scaffolding, up to 512 residues for unconditional generation up to 500 residues, and up to 896 residues for unconditional generation of longer chains:

(AE1) Autoencoder for unconditional generation up to 500 residues, which should be used with models (LD1, LD2): AE1_ucond_512.ckpt.
(AE2) Autoencoder for unconditional generation between 300 and 800 residues, which should be used with model (LD3): AE2_ucond_800.ckpt.
(AE3) Autoencoder for atomistic motif scaffolding, which should be used with model (LD4, LD5, LD6, LD7): AE3_motif.ckpt.

Sampling different configurations

We provide config files and commands to sample our different models. All models can be sampled by running

python proteinfoundation/generate.py --config_name <config_name>

for the corresponding config file.

Unconditional generation:

Unconditional samples from the LD1 model: Run python proteinfoundation/generate.py --config_name inference_ucond_notri. This will use noise scales of 0.1 for the alpha carbon atoms and 0.1 for the latent variables, and will produce 100 samples for each of the lengths in [100, 200, 300, 400, 500]. These values can be changed in the config file configs/generation/uncod_codes.yaml.
Unconditional samples from the LD2 model: Run python proteinfoundation/generate.py --config_name inference_ucond_tri. This will use noise scales of 0.1 for the alpha carbon atoms and 0.1 for the latent variables, and will produce 100 samples for each of the lengths in [100, 200, 300, 400, 500]. These values can be changed in the config file configs/generation/uncod_codes.yaml.
Unconditional samples from the LD3 model: Run python proteinfoundation/generate.py --config_name inference_ucond_notri_long. This will use noise scales of 0.15 for the alpha carbon atoms and 0.05 for the latent variables, and will produce 100 samples for each of the lengths in [300, 400, 500, 600, 700, 800]. These values can be changed in the config file configs/generation/uncod_codes_800.yaml.

Conditional generation: We provide models for atomistic motif scaffolding for four different setups: indexed all-atom, indexed tip-atom, unindexed all-atom, unindexed tip-atom. Please check the paper for precise descriptions of each setup. The lists of atomistic motif scaffolding tasks can be found in configs/generation/motif_dict.yaml, where you'll also find the specification (contig string) for each task. Tasks ending with the "_TIP" suffix should be used for tip-atom scaffolding models (indexed and unindexed), and tasks without the "_TIP" suffix in their names should be used for the all-atom scaffolding models. Note that the following four tasks (indexed all-atom, indexed tip-atom, unindexed all-atom, unindexed tip-atom) require some small changes in the codebase, as indicated below for each specific case. These changes involve how to map motif residues to sequence indices (different in indexed vs. unindexed) or how the exact features used as input to the model are built.

Conditional samples from the LD4 model, indexed and all-atom atomistic motif scaffolding: Run python proteinfoundation/generate.py --config_name inference_motif_idx_aa. This will use a temperature of 0.1 for the alpha carbon atoms and 0.1 for the latent variables. The motif task can be specified in the config file configs/inference_motif_idx_aa.yaml, and the number of samples produced in the configs/generation/motif.yaml file. This model should be used for all-atom tasks, that is, the ones that do not contain the "_TIP" suffix in their name. For this model you need to use lines 1851 and 1852 in proteinfoundation/nn/feature_factory.py, and lines 228 to 230 in ./proteinfoundation/evaluate.py.
Conditional samples from the LD5 model, indexed and tip-atom atomistic motif scaffolding: Run python proteinfoundation/generate.py --config_name inference_motif_idx_tip. This will use a temperature of 0.1 for the alpha carbon atoms and 0.1 for the latent variables. The motif task can be specified in the config file configs/inference_motif_idx_tip.yaml, and the number of samples produced in the configs/generation/motif.yaml file. This model should be used for tip-atom tasks, that is, the ones that contain the "_TIP" suffix in their name. For this model you need to use lines 1847 and 1848 in proteinfoundation/nn/feature_factory.py, and lines 228 to 230 in ./proteinfoundation/evaluate.py.
Conditional samples from the LD6 model, unindexed and all-atom atomistic motif scaffolding: Run python proteinfoundation/generate.py --config_name inference_motif_uidx_aa. This will use a temperature of 0.1 for the alpha carbon atoms and 0.1 for the latent variables. The motif task can be specified in the config file configs/inference_motif_uidx_aa.yaml, and the number of samples produced in the configs/generation/motif.yaml file. This model should be used for all-atom tasks, that is, the ones that do not contain the "_TIP" suffix in their name. For this model you need to use lines 1851 and 1852 in proteinfoundation/nn/feature_factory.py, and lines 233 to 240 in ./proteinfoundation/evaluate.py.
Conditional samples from the LD7 model, unindexed and tip-atom atomistic motif scaffolding: Run python proteinfoundation/generate.py --config_name inference_motif_uidx_tip. This will use a temperature of 0.1 for the alpha carbon atoms and 0.1 for the latent variables. The motif task can be specified in the config file configs/inference_motif_uidx_tip.yaml, and the number of samples produced in the configs/generation/motif.yaml file. This model should be used for all-atom tasks, that is, the ones that contain the "_TIP" suffix in their name. For this model you need to use lines 1847 and 1848 in proteinfoundation/nn/feature_factory.py, and lines 233 to 240 in ./proteinfoundation/evaluate.py.

Our evaluation pipeline requires ProteinMPNN to compute (co-)designability. Running bash script_utils/download_pmpnn_weights.sh in the codebase root directory will download the required weights to the correct location. Note that these weights only needed for the evaluation pipeline, not for sampling our models.

We provide code to compute (co-)designability of the samples produced by our models in proteinfoundation/evaluate.py. For the motif scaffolding tasks it also computes motif RMSD values and motif sequence recovery. Run python proteinfoundation/evaluate.py --config_name <config_name>. Please see this bash script example which can be used to sample some one of our models and evaluate the resulting samples.

Explanation of config file parameters

This section briefly explains the multiple parameters in the inference config files. All inference config files are based on configs/experiment_config/inference_base_release.yaml. Then, the config files for the corresponding experiments (e.g. inference_ucond_notri.yaml simply overrides some parameters in the base config). Some of the parameters in these config files are:

ckpt_name determines the checkpoint for the latent diffusion model. Only requires the checkpoint name, not full path.
autoencoder_ckpt_path determines the checkpoint for the autoencoder. Requires full path to checkpoint.
self_cond specifies whether to use self-conditioning during sampling. All our evaluations are done with self-conditioning, as we observe that yields better performance.
sc_scale_noise controls the noise scale (for alpha carbon atoms and latent variables, can be set separately).

Source code is released under the Apache 2.0 license. Please see the LICENSE. Model Weights are released under NVIDIA Open Model License Agreement. All other materials are released under the Creative Commons Attribution 4.0 International License, CC-BY 4.0.

Cite our paper using the following bibtex item:

@article{geffner2025laproteina,
  title={La-Proteina: Atomistic Protein Generation via Partially Latent Flow Matching},
  author={Geffner, Tomas and Didi, Kieran and Cao, Zhonglin and Reidenbach, Danny and Zhang, Zuobai and Dallago, Christian and Kucukbenli, Emine and Kreis, Karsten and Vahdat, Arash},
  journal={arXiv preprint arXiv:2507.09466},
  year={2025}
}