arXiv Papers of Protein Design
Authors: Rome Thorstenson
Abstract: The Categorical Jacobian (CJ) of Zhang et al. (2024) reads protein contacts from a language model by perturbing every residue with every alternative amino acid, about 19L forward passes. We show the signal it reconstructs is already concentrated in a small subset of attention heads: averaging the top‑K contact‑relevant heads, selected on as few as 10 labeled proteins, recovers contacts in one forward pass and beats CJ on leakage‑clean data for every bidirectional model where CJ is defined, and matches or beats it in‑distribution (the exceptions being the smallest 8M model and a statistical tie on ESM Cambrian). Ablations localize the gain to labeled head selection, not averaging: at a matched label budget the unweighted mean ties a supervised L1 logistic regression on the same heads, so the parameter‑free mean is selection's minimal form, not the source of the advantage. Our primary test is leakage‑clean: on a CAMEO split where neither selection nor evaluation touches data the models have plausibly memorized, the head readout beats CJ on ESM‑2‑650M by +9 pp (N=29, p<0.001), with the within‑model margin reproducing across architectures on a wider pretraining‑aware set. Both methods fall 30‑36 percentage points from their in‑distribution Zhang numbers to the leakage‑clean numbers, consistent with substantial pretraining overlap inflating prior numbers (a CAMEO‑vs‑Zhang difficulty shift contributes too, so we read it as an upper bound on the leakage component). We additionally introduce representation‑CJ, a hidden‑state generalization of the Jacobian for architectures without a masked‑LM head; show that the optimal K tracks how diffusely a model spreads its contact heads; and find that both methods lose the contact signal on both causal LMs we test (ProGen2), suggesting attention‑encoded pair structure may depend on bidirectional pretraining.
Authors: Nathaniel L. Diamant, Brian L. Trippe
Abstract: Generative models can produce individually plausible samples while deviating substantially from a target set in the distribution of key features. For example, a model pretrained on broad drug‑like chemical space may generate molecules whose molecular features differ from those of a therapeutic class of interest, such as known antibiotics. Correcting such distributional miscalibration is challenging: direct finetuning on the target set can overfit and does not control which features are matched. To fill this gap, we introduce kernel Calibrating Generative Models (kCGM). kCGM minimizes a maximum mean discrepancy (MMD) between generated and target feature distributions using an unbiased score‑function estimator, with KL regularization to remain close to the pretrained model. On a target set of 174 antibiotics, direct finetuning sacrifices chemical validity for feature‑distribution matching, whereas kCGM improves target feature matching while increasing validity. We further demonstrate kCGM in protein and DNA generation tasks, showing it can adapt autoregressive, continuous‑space diffusion, and discrete diffusion models using only feature‑level supervision. Code is available at https://github.com/smithhenryd/cgm.
Authors: Mohamed Mouhajir, Limei Wang, El Houcine Bergou, Hajar El Hammouti, Lamiae Azizi, Dongqi Fu
Abstract: Graph‑based representations are widely used in protein modeling, yet many existing approaches rely primarily on sequence adjacency or geometric proximity, which only partially reflect the principles governing protein folding. Proteins instead adopt complex three‑dimensional conformations organized around secondary structure elements, such as α‑helices and β‑sheets, which encode recurring local motifs and stabilizing hydrogen‑bond interactions. In this work, we introduce a secondary‑structure‑aware graph neural network for protein representation learning. Residue‑level node representations are augmented with secondary structure assignments, and graph edges are constructed from hydrogen‑bond interactions filtered by their energetic strength. This design enables the model to capture both local structural context and long‑range couplings that are central to protein stability and function. We evaluate the proposed approach on commonly used protein benchmarks and observe consistent improvements over existing graph‑based methods. In addition, the resulting graph representations offer enhanced biological interpretability, as the learned connectivity aligns with established structural motifs. These findings suggest that incorporating secondary structure and energy‑filtered hydrogen‑bond topology provides an effective inductive bias for protein representation learning. The code is released at https://github.com/mohamedmohamed2021/SSProNet
Authors: Arthur Bigot, Harmon Bhasin, Core Francisco Park, Eugene Shakhnovich, Dianzhuo Wang
Abstract: Protein language models are trained on highly imbalanced datasets, raising the question of how they represent underrepresented biological sequences. Using viral proteins as a case study across ESM model families, we identify a dominant nativeness axis in embedding space, aligned with masked reconstruction perplexity, that orders sequences from well‑modeled cellular proteins through viral proteins to shuffled and random sequences. Scaling contracts this axis unevenly across viral families. Despite this, protein language model embeddings retain viral‑specific signal: viral proteins remain linearly separable beyond zero‑shot perplexity and shallow sequence features. Together, these results suggest that pLM representations are structured by a general notion of nativeness while preserving information specific to distinct biological groups.
Authors: Gianluca Scarpellini, Ron Shprints, Peter Holderrieth, Juno Nam, Pranav Murugan, Rafael Gómez-Bombarelli, Tommi Jaakola, Maruan Al-Shedivat, Nicholas Matthew Boffi, Avishek Joey Bose
Abstract: All‑atom generative modeling of 3D biomolecular complexes has emerged as the dominant paradigm for predicting the structure of proteins and protein‑ligand systems. Generating structures at the atomic level of fidelity, however, typically requires expensive iterative diffusion rollouts, making both conventional deployment and inference‑time search techniques computationally costly. In this paper, we introduce the Denoiser Cofolding All‑Atom Flowmap (DeCAF) framework for distilling state‑of‑the‑art all‑atom cofolding models into all‑atom flow maps that produce high‑quality samples in only a few inference steps. We build DeCAF on a denoiser‑based formulation of flow maps with endpoint losses that naturally support SE(3) rigid alignment, which we show is critical for training accurate models. We further derive a simple change of variables that lets DeCAF operate in the σ‑space noise schedule of EDM‑style architectures, enabling direct distillation from pretrained cofolding diffusion models. Equipped with DeCAF's flowmap lookahead, we introduce a purpose‑built inference‑time framework that improves sampling through reward‑guided search. Empirically, DeCAF‑Boltz statistically improves over Boltz‑1x in both accuracy (RMSD) and physical validity scores of protein‑ligand poses at strict NFE budgets on the challenging Runs N' Poses, while also showing a more optimal Pareto frontier across all inference compute budgets on PoseBusters. Distilling the state‑of‑the‑art Pearl cofolding model, DeCAF‑Pearl outperforms diffusion‑based cofolding models and matches its teacher on success rate while using 5x fewer NFEs. We release our code at https://github.com/genesistherapeutics/decaf.
Authors: Fang Wu, Shuting Jin, Xiangru Tang, Mark Gerstein, Xiangxiang Zeng, Yejin Choi, Jure Leskovec, Jinbo Xu
Abstract: Protein function is largely determined by molecular surface geometry and physicochemical complementarity, yet most protein design methods condition only on backbone structure. We introduce SurfDesign, a surface‑conditioned protein design framework that models molecular surfaces as continuous geometric manifolds and integrates them with pretrained protein language models. SurfDesign employs surface‑based equivariant message passing to capture surface normals, curvature, and directional geometry, together with a parameter‑efficient fine‑tuning strategy. Focusing on functional protein design, we show that SurfDesign consistently outperforms prior surface‑conditioned and backbone‑only methods on de novo binder and enzyme design benchmarks. We also report strong performance on inverse‑folding benchmarks as a diagnostic of structural compatibility. Our results highlight manifold‑aware surface representations as a principled foundation for functional protein and enzyme design. Code is available at https://github.com/smiles724/SurfDesign.
Authors: Chen Wei, Fanding Xu, Minghao Sun, Zhiyuan Liu, Lin Wang, Tianrui Jia, Yihang Zhou, Yang Zhang
Abstract: Proteins perform their biological functions through three‑dimensional structures encoded by amino acid sequences, and ligand‑binding protein co‑design requires models that generate sequence‑structure compatible proteins under explicit ligand constraints. Although continuous diffusion and flow‑based models support ligand‑aware design in coordinate or latent spaces, existing discrete diffusion protein language models mainly operate over sequence or structure tokens without direct small‑molecule conditioning. We introduce ProtLiD^2, a Protein Ligand‑conditioned Discrete Diffusion model for protein sequence‑structure co‑design. ProtLiD^2 jointly generates amino‑acid sequence and discrete structure tokens while incorporating ligand chemical and geometric information through geometry‑aware cross‑attention. Trained on over one million ligand‑protein complexes, ProtLiD^2 extends masked discrete diffusion to ligand‑aware functional protein design. We further propose maximum confidence‑margin guided ReMask decoding, an inference‑time self‑correction strategy that retains confident predictions and remasks uncertain tokens. ProtLiD^2 improves global fold confidence over Complexa in whole‑protein design, increasing TM‑score from 0.672 to 0.802 and pLDDT from 64.55 to 73.00. In pocket co‑design, ProtLiD^2 reduces active‑site BB‑RMSD from 3.46/3.40Å for FAIR/PocketGen to 1.97Å, and improves ligand‑aware pass rates over PocketGen from 14.86% to 59.73% and from 6.08% to 23.49% under stricter docking thresholds. These results support ligand‑conditioned discrete diffusion as an effective token‑space framework for functional protein co‑design. Code will be available at https://github.com/auroua/ProtLiD.
Authors: Ashima Khanna, Dominik Grimm
Abstract: Protein sequence optimization under tight oracle budgets requires methods that explore vast combinatorial spaces while making each evaluation informative. Existing reinforcement learning and off‑policy generative approaches often degrade under surrogate noise, and position‑agnostic mutation proposals risk disrupting functionally critical residues. We introduce SILO, a trajectory‑level self‑improvement imitation framework for oracle‑budgeted protein design. SILO uses a hierarchical edit policy that decomposes each mutation into a position choice followed by a residue choice. In each active‑learning round, the policy samples candidate trajectories via incremental stochastic beam search without replacement (SBS), and a UCB‑based proxy ensemble, combined with an alanine‑scan fitness score (AFS), selects candidates with functionally relevant edits for in silico oracle evaluation. The policy is then updated by next‑action cross‑entropy imitation on the round's best oracle‑labeled trajectories, avoiding value‑function estimation. Across eight reproduced protein fitness landscapes and five strong baselines from prior work, SILO achieves the highest maximum and top‑100 mean fitness on 8 of 8 landscapes within our evaluations, often exhibiting faster early‑stage improvement. In low‑data and noisy‑proxy stress tests on two landscapes per setting, SILO remains competitive or best when several baselines degrade. Ablations show that SBS with AFS account for much of the gains, with iterative imitation providing additional improvement. Code is available at: https://github.com/grimmlab/SILO.git
Authors: Muhammad Muneeb, David B. Ascher
Abstract: Missense variant interpretation remains challenging because pathogenicity depends on heterogeneous evidence from population frequency, evolutionary conservation, transcript context, amino acid substitution severity, prior pathogenicity predictors and protein‑language‑model‑derived features. We present AnnotateMissense, a scalable annotation, benchmarking and genome‑wide prediction framework for missense variant interpretation. AnnotateMissense integrates hg38 missense variants derived from dbNSFP v5.1 with ANNOVAR annotations, dbNSFP transcript/protein descriptors, AlphaMissense scores, ESM‑derived features, conservation metrics, population‑frequency variables, established pathogenicity predictors and engineered amino acid/codon‑context features. Using 132,714 ClinVar‑labelled missense variants, we benchmarked machine‑learning and deep‑learning models under controlled feature configurations. The full 303‑feature benchmark set achieved the strongest performance with XGBoost, reaching mean MCC = 0.9411 and ROC‑AUC = 0.9950 across stratified five‑fold cross‑validation. Restricted naive and location‑oriented feature sets achieved lower best MCC values of 0.4989 and 0.5113, respectively. Circularity‑controlled ablations showed that removing prior‑predictor, population‑frequency and clinically overlapping evidence reduced performance, whereas excluding AlphaMissense and ESM‑derived features alone had minimal effect. Temporal ClinVar validation on newly observed pathogenic/benign variants achieved MCC = 0.7613, accuracy = 0.8798 and F1‑score = 0.8750. The final model was applied to 90,643,830 hg38 missense variants to generate AnnotateMissense pathogenicity scores and binary prediction labels. Code and outputs are available at https://github.com/MuhammadMuneeb007/CAGI7_Annotate_All_Missense and https://doi.org/10.5281/zenodo.19981867.
Authors: Mingqing Wang, Zhiwei Nie, Athanasios V. Vasilakos, Yonghong He, Zhixiang Ren
Abstract: Proteins encode diverse functions within complex three‑dimensional structures, yet most deep learning representations remain highly entangled, obscuring the biophysical signals that underlie function. Here we introduce ProtDiS, a knowledge‑guided framework that decomposes pretrained protein micro‑environment embeddings into biologically grounded and task‑relevant dimensions. Inspired by the information bottleneck principle, ProtDiS learns representations that balance informativeness and compression, yielding structural features that are more specific, independent, and information‑efficient, and achieving consistent improvements across twelve downstream tasks, with the largest gains under structure‑based splits. Protein‑ and residue‑level analyses further show that ProtDiS differentiates proteins with similar folds but divergent functions and captures fine‑grained biophysical signals critical. These findings suggest that knowledge‑guided decomposition provides a general and interpretable approach for structuring latent spaces in protein structural modeling. The source code and implementation details are publicly available at https://github.com/AI‑HPC‑Research‑Team/ProtDiS.
Authors: Langzhang Liang, Ming Yang, Yi Feng, Junfan Li, Shirui Pan, Yinghui Xu, Tianlei Ying, Yizhen Zheng, Zenglin Xu
Abstract: Protein sequence generation for engineering requires samples that are biophysically plausible and, when targeting a family/domain, remain recognizable members while exploring within‑family diversity. Current discrete generative models typically start from uniform or masked‑token noise, which discards strong position‑specific constraints induced by evolution and forces the model to reconstruct conserved residues from scratch, leading to weak family control and low plausibility. We propose \emphLineageFlow, a Dirichlet flow‑matching model that initializes generation from lineage priors derived from ancestral sequence reconstruction, turning generation into structured mutation from an evolved scaffold. Across diverse protein families, LineageFlow achieves family validity close to held‑out natural sequences and improves predicted structural confidence over uniform‑/mask‑initialized baselines while maintaining substantial novelty and diversity. Finally, we introduce \emphrerouting, a single intermediate‑time mutate‑‑select‑‑amplify intervention that enables objective‑guided sampling without per‑step predictor guidance and yields further gains in plausibility, including a zero‑shot enzyme generation case study. Code is available at https://github.com/Jinx‑byebye/LineageFlow.
Authors: Taewon Kim, Hyosoon Jang, Hyunjin Seo, Seonghwan Seo, Hyeongwoo Kim, Wonho Zhung, Mingyeong Shin, Wooyoun Kim, Sungsoo Ahn
Abstract: Recent advances in generative modeling show that pretrained representations can improve generation as conditioning features or alignment targets. Motivated by this, we study protein representations for predicting structures beyond conventional function annotation. We propose TriProRep, a structure‑aware pretraining method that jointly models three aligned residue‑level views: amino‑acid identity, backbone geometry, and local full‑atom geometry, discretely encoded via VQ‑VAE tokenizers. By pretraining to recover original tokens from generator‑corrupted views, TriProRep learns to distinguish plausible but incorrect cross‑view augmentations from the original protein. We further introduce RepSP, a benchmark for evaluating protein representations in structure‑predictive settings. RepSP tests three uses of representations: homodimer co‑folding from apo‑chain representations, residue‑level prediction of homodimer‑derived interaction properties, and representation‑aligned monomer structure prediction. Across these tasks, TriProRep improves over sequence‑only and prior structure‑aware representation models, while maintaining competitive performance on conventional benchmarks.
Authors: Shuo Zhang, Rongqi Hong, Huifeng Zhang, Jian K. Liu
Abstract: Predicting protein‑ligand binding affinity remains intractable for multi‑domain proteins, where inter‑domain dynamics govern molecular recognition. Existing geometric deep learning methods typically treat proteins as monolithic static graphs, suffering from rigid‑body assumptions and aleatoric noise in flexible regions. To address this, we introduced HCLBind, a self‑supervised framework that decouples geometric representation learning from affinity regression. HCLBind leverages a general‑to‑specific pre‑training paradigm on the Q‑BioLiP database to learn a robust physical grammar of binding. We propose a novel hierarchical decoy strategy: the model learns local physicochemical constraints through protein coordinate perturbation in single‑domain proteins and global conformational geometry through inter‑domain rotation in multi‑domain complexes. Our hybrid architecture integrates a domain‑gated graph attention network and cross‑modal attention to explicitly prioritize domain interfaces. Furthermore, we employ LoRA on protein and ligand foundation models, ensuring efficient optimization while preserving evolutionary knowledge. Experiments on PDBBind demonstrate that HCLBind effectively learns discriminative interface features and provides robust uncertainty estimation, overcoming the limitations of standard supervised learning. The code is available at https://github.com/jiankliu/HCLBind.
Authors: Dexiong Chen, Andrei Manolache, Mathias Niepert, Karsten Borgwardt
Abstract: Classifying protein topology is essential for deciphering biological function, but progress is held back by the lack of large‑scale benchmarks that avoid duplicates and by models that do not scale well. We introduce TEDBench, a large‑scale, non‑redundant benchmark for protein fold classification constructed from the Encyclopedia of Domains (TED) and Foldseek‑clustered AlphaFold structures. We show that on TEDBench, current protein representation learning methods either require very large models or fail to deliver strong performance. To address this challenge, we propose Masked Invariant Autoencoders (MiAE), a self‑supervised framework for protein structure representation learning. MiAE uses an extremely high masking ratio of up to 90% with an \mathrmSE(3)‑invariant encoder and a lightweight decoder that reconstructs backbone coordinates from the latent representation and mask tokens. MiAE scales well and outperforms supervised counterparts and state‑of‑the‑art baselines on TEDBench, establishing a strong recipe for protein fold classification. To test transfer beyond AlphaFold structures, we further benchmark on a curated dataset from experimental structures of CATH v4.4. TEDBench is available at https://github.com/BorgwardtLab/TEDBench.
Authors: Minseo Kim, Huanghao Mai, Jay Shenoy, Alec Follmer, Gordon Wetzstein, Frederic Poitevin
Abstract: Generative models trained on public databases of protein structures, most of which have been determined by X‑ray crystallography, now provide powerful priors for structure prediction. However, they are not readily conditioned on the measurements from a new crystallographic experiment, limiting their use for X‑ray structure determination. In crystallography, the measured structure‑factor amplitudes do not by themselves determine an electron density map or atomic structure because the associated phases are unobserved and must be inferred. Structure determination therefore remains an inverse problem in which candidate models must be both structurally plausible and consistent with measured diffraction data, often requiring substantial manual refinement by human experts. Emerging methods aim to incorporate experimental information more directly into predictive and refinement workflows. We present CrystalBoltz, a generative framework that casts crystallographic refinement as Bayesian inference over atomic structures and operates directly on structure‑factor amplitudes. CrystalBoltz moves from unguided generation with a pre‑trained prior over protein structures to experiment‑guided posterior sampling, followed by atomic coordinate and B‑factor refinement. Across multiple protein crystallography datasets, CrystalBoltz attains lower coordinate RMSD and lower R‑factors than the strongest baselines considered, while reducing runtime by a factor of 33 relative to existing experimentally guided refinement.
Authors: Saqib Nazir, Ardhendu Behera
Abstract: Label‑free single‑cell imaging offers a scalable, non‑invasive alternative to fluorescence‑based cytometry, yet inferring molecular phenotypes directly from bright‑field morphology remains challenging. We present a unified Deep Learning (DL) framework that jointly performs White Blood Cell (WBC) classification and continuous protein‑expression regression from label‑free Differential Phase Contrast (DPC) images. Our model employs a Hybrid architecture that fuses convolutional fine‑grained texture features with transformer‑based global representations through a learnable cross‑branch gating module, enabling robust morpho‑molecular inference from DPC images. To support downstream interpretability, we further incorporate a Large Language Model (LLM) that generates concise, biologically grounded summaries of the predicted cell states. Experiments on the Berkeley Single Cell Computational Microscopy (BSCCM) and Blood Cells Image benchmarks demonstrate strong performance, achieving a 91.3% WBC classification accuracy and a 0.72 Pearson correlation for CD16 expression regression on BSCCM. These results underscore the promise of label‑free single‑cell imaging for cost‑effective hematological profiling, enabling simultaneous phenotype identification and quantitative biomarker estimation without fluorescent staining. The source code is available at https://github.com/saqibnaziir/Single‑Cell‑Phenotyping.
Authors: Thor Klamt, Wolfgang Nejdl, Ming Tang
Abstract: Machine‑learning predictors of biochemical activity often exhibit large random‑split‑to‑leave‑one‑target‑out generalisation gaps that have been documented but not decomposed. We frame this as an evaluation‑science question and use targeted protein degradation as the empirical test bed. PROTACs (proteolysis‑targeting chimeras) are heterobifunctional small molecules that induce targeted protein degradation, with more than forty candidates currently in clinical trials; published predictors report AUROC of 0.85 to 0.91 under random‑split cross‑validation, while the leave‑one‑target‑out (LOTO) protocol of Ribes et al. reduces performance to approximately 0.67. Random splits reward within‑target interpolation, whereas LOTO measures the novel‑target prediction that de‑novo design depends on. We decompose this gap and identify inter‑laboratory measurement variance as the dominant component, anchored by a within‑target cross‑laboratory cascade bounding the inter‑laboratory contribution at 0.124 AUROC, well above the 0.05 contribution from binarisation‑threshold choice. Across eight published architectures and ESM‑2 protein language models up to 3B parameters, LOTO AUROC plateaus near 0.67, with a comparable plateau under SMILES‑level deduplication; a 21‑dimensional 2000‑trial hyperparameter optimisation cannot break this ceiling, and the rank‑1 single‑seed configuration regresses by 0.161 AUROC under multi‑seed validation, matching a closed‑form selection‑bias prediction (Bailey and Lopez de Prado, 2014). Few‑shot k=5 stratified per‑target retraining combined with ADMET features lifts 65‑target LOTO AUROC from 0.668 to 0.7050, and post‑hoc Platt scaling recovers raw output to within the 0.05 well‑calibrated threshold. We release PROTAC‑Bench (10,748 measurements, 173 targets, 65 LOTO folds), the variance‑decomposition framework, the per‑target calibration protocol, and the evaluation code.
Authors: Xinwu Ye, He Cao, Hao Li, Bin Feng, Zijing Liu, Xiangru Tang, Yu Li, Shenghua Gao
Abstract: Biomolecular generators are often adapted with reward feedback to improve task‑specific utility, but pushing utility alone can concentrate generation on a narrow family of candidates. Maintaining diversity is difficult because sample diversity is a set‑level property. We introduce Supergroup Relative Policy Optimization (SGRPO), a flexible GRPO‑style framework that directly constructs rewards from set‑level diversity. For each condition, SGRPO samples a supergroup of candidate sets, compares their diversity under the same condition, and redistributes the group diversity reward to individual rollouts through leave‑one‑out diversity contributions before combining it with rollout‑level utility. This design decouples SGRPO from a particular generator, utility reward, or diversity metric, and allows instantiation with different GRPO‑style approaches. We evaluate SGRPO on de novo small‑molecule design, pocket‑based small‑molecule design, and de novo protein design, instantiating it with both GRPO and Coupled‑GRPO across autoregressive and discrete diffusion generators. Across decoding sweeps, SGRPO expands the utility‑diversity Pareto frontier and achieves the best frontier‑level metrics relative to pretrained generators, GRPO, and memory‑assisted GRPO when applicable. Our analyses further show that direct set‑level diversity rewards remain effective with small groups and help preserve broader generation‑distribution coverage during post‑training. The code is available at https://github.com/IDEA‑XL/SGRPO.
Authors: Haydn Jones, Yimeng Zeng, Alden Rose, Li S. Yifei, Yining Huang, Kaiwen Wu, Jiaming Liang, Maggie Ziyu Huan, Yoseph Barash, Cesar de la Fuente-Nunez, Osbert Bastani, Zachary Ives, Mark Yatskar, Jacob R. Gardner
Abstract: Manually curated biomedical repositories ‑‑ spanning bioactivity, genomics, and chemistry ‑‑ are expensive to maintain, lag behind primary literature, and discard experimental context, obscuring nuances needed to assess data correctness and coverage. We show that PubMed itself can be autonomously and cost‑effectively turned into structured datasets that are larger, more nuanced, and more accurate than the curated databases they replace. We present three coupled contributions: (1) an LLM‑based entity‑tagging pipeline, grounded in nine biomedical ontologies, that tags 4.5B entities across 19 categories in a 22.5M‑paper, 2.5T‑token PubMed corpus; (2) hybrid sparse‑dense retrieval supporting entity‑filtered semantic queries over the tagged corpus; and (3) Starling, a multi‑agent deep research system that, given only a natural‑language task description, designs precision‑ and recall‑targeted retrieval filters, induces an extraction schema, and emits structured records with nuance‑rich fields and supporting passages. Across six tasks ‑‑ blood‑brain barrier permeability, oral bioavailability, acute toxicity (LD50), gene‑disease associations, protein subcellular localization, and chemical reactions ‑‑ Starling produces ~6.3M records (91K‑3M per task); several are, to our knowledge, the largest public datasets for their property. Frontier‑model rejection of our extractions is 0.6‑7.7% across tasks, far below error rates we measure on widely used curated counterparts (e.g., 16.5% on BBB_Martins, 7.3% on Bioavailability_Ma). Beyond scale and accuracy, the supporting passages carry nuance tabular databases discard ‑‑ e.g., oral bioavailability may depend on fed vs. fasted state. Together, the corpus, retrieval, and agent establish a foundation for AI‑driven therapeutic design. Code and datasets: https://github.com/starling‑labs/starling.
Authors: Fang Wu, Weihao Xuan, Heli Qi, Hanqun Cao, Heng-Jui Chang, Zeqi Zhou, Haokai Zhao, Ma Jian, Carl Ma, Yu-Chi Cheng, Kuan Pang, Xiangru Tang, Zehong Wang, Guanlue Li, Hanchen Wang, Kejun Ying, Pan Lu, Chiho Im, Seungju Han, Peng Xia, Tinson Xu, Yinxi Li, Deyao Zhu, Pheng-Ann Heng, Naoto Yokoya, Masashi Sugiyama, Li Erran Li, Jure Leskovec, Yejin Choi
Abstract: Deep learning in \emphde novo protein design has achieved atomic‑level fidelity. However, existing models remain largely non‑deliberative: they directly synthesize molecular geometries without explicitly reasoning about which residues or interactions are functionally essential. As a result, design decisions are entangled with continuous sampling dynamics, limiting interpretability, controllability, and systematic reuse of biochemical knowledge. We introduce Proteo‑R1, a reasoning‑guided protein design framework that explicitly decouples \emphmolecular understanding from \emphgeometric generation. Proteo‑R1 adopts a dual‑expert architecture in which a multimodal large language model (MLLM) serves as an \emphunderstanding expert, analyzing protein sequences, structures, and textual context to identify key functional residues that govern binding and specificity. These residue‑level decisions are then passed as hard constraints to a separate diffusion‑based \emphgeneration expert, which performs conditional co‑design while respecting the fixed interaction anchors. This factorization mirrors how human experts approach molecular engineering: first, reasoning about critical interactions, then optimizing geometry subject to those constraints. By operationalizing reasoning as explicit residue‑level commitments rather than latent textual guidance, Proteo‑R1 achieves stable, interpretable, and modular integration of LLM reasoning with state‑of‑the‑art geometric generative models. Code, data, and demos are available at https://smiles724.github.io/r1/.
Authors: Viet Thanh Duy Nguyen, John K. Johnstone, Truong-Son Hy
Abstract: Proteins are inherently multiscale physical systems whose functional properties emerge from coordinated structural organization across multiple spatial resolutions, ranging from atomic interactions to global fold topology. However, existing protein representation learning methods typically operate at a single structural level or treat different sources of structural information as parallel modalities, without explicitly modeling their hierarchical relationships. We introduce PRIME (Protein Representation via Physics‑Informed Multiscale Equivariant Hierarchies), a unified framework that models proteins as a nested family of five physically grounded structural graphs spanning surface, atomic, residue, secondary‑structure, and protein levels. Adjacent levels are connected through deterministic, physics‑informed assignment operators, enabling bidirectional information exchange via bottom‑up aggregation and top‑down contextual refinement. Experiments on standard protein representation learning benchmarks demonstrate strong and competitive performance across diverse tasks, with particularly notable gains on the Fold Classification benchmark, where PRIME outperforms the strongest geometric GNN baseline by margins of 13.80 and 18.30 points on the harder Superfamily and Fold splits, and achieves a state‑of‑the‑art accuracy of 84.10% on Reaction Class prediction, surpassing all baseline methods, including ESM. Ablation studies confirm that each structural level contributes complementary and non‑redundant information, and adaptive cross‑attention analysis reveals that PRIME autonomously identifies the most task‑relevant structural resolutions at prediction time. Our source code is publicly available at https://github.com/HySonLab/PRIME
Authors: Dake Bu, Wei Huang, Andi Han, Hau-San Wong, Qingfu Zhang, Taiji Suzuki, Atsushi Nitanda
Abstract: Diffusion language models generate without a fixed left‑to‑right order, making token ordering a central algorithmic choice: which tokens should be revealed, retained, revised or verified at each step? Existing systems mainly use random masking or confidence‑driven ordering. Random masking creates train‑‑test mismatch, while confidence‑only rules are efficient but can be myopic and suppress useful exploration.
We introduce DPRM (Doob h‑transform Process Reward Model), a plug‑in token‑ordering module for diffusion language models. DPRM keeps the host architecture, denoising objective and supervision unchanged, and changes only the ordering policy. It starts from confidence‑driven progressive ordering and gradually shifts to Doob h transform Process Reward guided ordering through online estimates.
We characterize the exact DPRM policy as a reward‑tilted Gibbs reveal law, prove O(1/N) convergence of the stagewise Soft‑BoN approximation, and show that the online bucketized controller tracks the exact DPRM score at empirical‑Bernstein rates. Under tractable optimization assumptions, DPRM also yields a sample‑complexity advantage over random and confidence‑only ordering.
DPRM improves over confidence‑based baselines in pretraining, post‑training, test‑time scaling, and single‑cell masked diffusion, with particularly strong gains on harder reasoning subsets. In protein, molecular generation and DNA design, the effect is more multi‑objective: ordering‑aware variants significantly improve selected structural or fragment‑constrained metrics while not uniformly dominating the host baseline on every quality metric. These results identify token ordering as a fundamental control axis in diffusion language models and establish DPRM as a general‑purpose module for improving it. Code is available at https://github.com/DakeBU/DPRM‑DLLM.
Authors: Zhenyu Wang, Geyan Ye, Wei Liu, Man Tat Alexander Ng
Abstract: Virtual cell modeling predicts molecular state changes under genetic perturbations in silico, which is essential for biological mechanism studies. However, existing approaches suffer from unconstrained reasoning, uninterpretable predictions, and retrieval signals that are weakly aligned with regulatory topology. To address these limitations, we propose AROMA, an Augmented Reasoning Over a Multimodal Architecture for virtual cell genetic perturbation modeling. AROMA integrates textual evidence, graph‑topology information, and protein sequence features to model perturbation‑target dependencies, and is trained with a two‑stage optimization strategy to yield predictions that are both accurate and interpretable. We also construct two knowledge graphs and a perturbation reasoning dataset, PerturbReason, containing more than 498k samples, as reusable resources for the virtual cell domain. Experiments show that AROMA outperforms existing methods across multiple cell lines, and remains robust under zero‑shot evaluation on an unseen cell line, as well as in knowledge‑sparse, long‑tail scenarios. Overall, AROMA demonstrates that combining knowledge‑driven multimodal modeling with evidence retrieval provides a promising pathway toward more reliable and interpretable virtual cell perturbation prediction. Model weights are available at https://huggingface.co/blazerye/AROMA. Code is available at https://github.com/blazerye/AROMA.
Authors: Sanzo Miyazawa
Abstract: The inverse Potts problem for estimating evolutionary single‑site fields and pairwise couplings in homologous protein sequences from their single‑site and pairwise amino acid frequencies observed in their multiple sequence alignment would be still one of useful methods in the studies of protein structure and evolution. Since the reproducibility of fields and couplings are the most important, the Boltzmann machine method is employed here, although it is computationally intensive. In order to reduce computational time required for the Boltzmann machine, parallel, persistent Markov chain Monte Carlo method is employed to estimate the single‑site and pairwise marginal distributions in each learning step. Also, stochastic gradient descent methods are used to reduce computational time for each learning. Another problem is how to adjust the values of hyperparameters; there are two regularization parameters for evolutionary fields and couplings. The precision of contact residue pair prediction is often used to adjust the hyperparameters. However, it is not sensitive to these regularization parameters. Here, they are adjusted for the fields and couplings to satisfy a specific condition that is appropriate for protein conformations. This method has been applied to eight protein families.
Authors: Gökçe Uludoğan, Buse Giledereli, Elif Ozkirimli, Arzucan Özgür
Abstract: Proteins carry out biological functions through the coordinated action of groups of residues organized into structural arrangements. These arrangements, which we refer to as protein units, exist at an intermediate scale, being larger than individual residues yet smaller than entire proteins. A deeper understanding of protein function can be achieved by identifying these units and their associations with function. However, existing approaches either focus on residue‑level signals, rely on curated annotations, or segment protein structures without incorporating functional information, thereby limiting interpretable analysis of structure‑function relationships. We introduce PUFFIN, a data‑driven framework for discovering protein units by jointly learning structural partitioning and functional supervision. PUFFIN represents proteins as residue‑level structure graphs and applies a graph neural network with a structure‑aware pooling mechanism that partitions each protein into multi‑residue units, with functional supervision that shapes the partition. We show that the learned units are structurally coherent, exhibit organized associations with molecular function, and show meaningful correspondence with curated InterPro annotations. Together, these results demonstrate that PUFFIN provides an interpretable framework for analyzing structure‑function relationships using learned protein units and their statistical function associations. We made our source code available at https://github.com/boun‑tabi‑lifelu/puffin.
Authors: Michael Cuccarese
Abstract: Activity cliff prediction ‑ identifying positions where small structural changes cause large potency shifts ‑ has been a persistent challenge in computational medicinal chemistry. This work focuses on a parsimonious definition: which small modifications, at which positions, confer the highest probability of an outcome change. Position‑level sensitivity is calculated using 25 million matched molecular pairs from 50 ChEMBL targets across six protein families, revealing that two questions have fundamentally different answers. "Which positions vary most?" is answered by scaffold size alone (NDCG@3 = 0.966), requiring no machine learning. "Which are true activity cliffs?" ‑ where small modifications cause disproportionately large effects, as captured by SALI normalization ‑ requires an 11‑feature model with 3D pharmacophore context (NDCG@3 = 0.910 vs. 0.839 random), generalizing across all six protein families, novel scaffolds (0.913), and temporal splits (0.878). The model identifies the cliff‑prone position first 53% of the time (vs. 27% random ‑ 2x lift), reducing positions a chemist must explore from 3.1 to 2.1 ‑ a 31% reduction in first‑round experiments. Predicting which modification to make is not tractable from structure alone (Spearman 0.268, collapsing to ‑0.31 on novel scaffolds). The system is released as open‑source code and an interactive webapp.
Authors: Jarrid Rector-Brooks, Théophile Lambert, Marta Skreta, Daniel Roth, Yueming Long, Zi-Qi Li, Xi Zhang, Miruna Cretu, Francesca-Zhoufan Li, Tanvi Ganapathy, Emily Jin, Avishek Joey Bose, Jason Yang, Kirill Neklyudov, Yoshua Bengio, Alexander Tong, Frances H. Arnold, Cheng-Hao Liu
Abstract: Evolution is an extraordinary engine for enzymatic diversity, yet the chemistry it has explored remains a narrow slice of what DNA can encode. Deep generative models can design new proteins that bind ligands, but none have created enzymes without pre‑specifying catalytic residues. We introduce DISCO (DIffusion for Sequence‑structure CO‑design), a multimodal model that co‑designs protein sequence and 3D structure around arbitrary biomolecules, as well as inference‑time scaling methods that optimize objectives across both modalities. Conditioned solely on reactive intermediates, DISCO designs diverse heme enzymes with novel active‑site geometries. These enzymes catalyze new‑to‑nature carbene‑transfer reactions, including alkene cyclopropanation, spirocyclopropanation, B‑H, and C(sp^3)‑H insertions, with high activities exceeding those of engineered enzymes. Random mutagenesis of a selected design further confirmed that enzyme activity can be improved through directed evolution. By providing a scalable route to evolvable enzymes, DISCO broadens the potential scope of genetically encodable transformations. Code is available at https://github.com/DISCO‑design/DISCO.
Authors: Yao Qin, Yangyang Yan, Jinhua Pang, Xiaoming Zhang
Abstract: The integration of Large Language Models (LLMs) into life sciences has catalyzed the development of "AI Scientists." However, translating these theoretical capabilities into deployment‑ready research environments exposes profound infrastructural vulnerabilities. Current frameworks are bottlenecked by fragile JSON‑based tool‑calling protocols, easily disrupted execution sandboxes that lose graphical outputs, and rigid conversational interfaces inherently ill‑suited for high‑dimensional scientific data.We introduce BloClaw, a unified, multi‑modal operating system designed for Artificial Intelligence for Science (AI4S). BloClaw reconstructs the Agent‑Computer Interaction (ACI) paradigm through three architectural innovations: (1) An XML‑Regex Dual‑Track Routing Protocol that statistically eliminates serialization failures (0.2% error rate vs. 17.6% in JSON); (2) A Runtime State Interception Sandbox that utilizes Python monkey‑patching to autonomously capture and compile dynamic data visualizations (Plotly/Matplotlib), circumventing browser CORS policies; and (3) A State‑Driven Dynamic Viewport UI that morphs seamlessly between a minimalist command deck and an interactive spatial rendering engine. We comprehensively benchmark BloClaw across cheminformatics (RDKit), de novo 3D protein folding via ESMFold, molecular docking, and autonomous Retrieval‑Augmented Generation (RAG), establishing a highly robust, self‑evolving paradigm for computational research assistants. The open‑source repository is available at https://github.com/qinheming/BloClaw.
Authors: Edward Wijaya
Abstract: Deep learning models for drug‑like molecules and proteins overwhelmingly reuse transformer architectures designed for natural language, yet whether molecular sequences benefit from different designs has not been systematically tested. We deploy autonomous architecture search via an agent across three sequence types (SMILES, protein, and English text as control), running 3,106 experiments on a single GPU. For SMILES, architecture search is counterproductive: tuning learning rates and schedules alone outperforms the full search (p = 0.001). For natural language, architecture changes drive 81% of improvement (p = 0.009). Proteins fall between the two. Surprisingly, although the agent discovers distinct architectures per domain (p = 0.004), every innovation transfers across all three domains with <1% degradation, indicating that the differences reflect search‑path dependence rather than fundamental biological requirements. We release a decision framework and open‑source toolkit for molecular modeling teams to choose between autonomous architecture search and simple hyperparameter tuning.
Authors: Truong-Son Hy
Abstract: Protein fitness optimization is inherently a discrete combinatorial problem, yet most learning‑based approaches rely on continuous representations and are primarily evaluated through predictive accuracy. We introduce Q‑BIOLAT, a framework for modeling and optimizing protein fitness landscapes in compact binary latent spaces. Starting from pretrained protein language model embeddings, we construct binary latent representations and learn a quadratic unconstrained binary optimization (QUBO) surrogate that captures unary and pairwise interactions.
Beyond its formulation, Q‑BIOLAT provides a representation‑centric perspective on protein fitness modeling. We show that representations with similar predictive performance can induce fundamentally different optimization landscapes. In particular, learned autoencoder‑based representations collapse after binarization, producing degenerate latent spaces that fail to support combinatorial search, whereas simple structured representations such as PCA yield high‑entropy, decodable, and optimization‑friendly latent spaces.
Across multiple datasets and data regimes, we demonstrate that classical combinatorial optimization methods, including simulated annealing, genetic algorithms, and greedy hill climbing, are highly effective in structured binary latent spaces. By expressing the objective in QUBO form, our approach connects modern machine learning with discrete and quantum‑inspired optimization.
Our implementation and dataset are publicly available at: https://github.com/HySonLab/Q‑BIOLAT‑Extended
Authors: Jason Dury
Abstract: The Predictive Associative Memory (PAM) framework posits that useful relationships often connect items that co‑occur in shared contexts rather than items that appear similar in embedding space. A contrastive MLP trained on co‑occurrence annotations‑‑Contrastive Association Learning (CAL)‑‑has improved multi‑hop passage retrieval and discovered narrative function at corpus scale in text. We test whether this principle transfers to molecular biology, where protein‑protein interactions provide functional associations distinct from gene expression similarity. Four experiments across two biological domains map the operating envelope. On gene perturbation data (Replogle K562 CRISPRi, 2,285 genes), CAL trained on STRING protein interactions achieves cross‑boundary AUC of 0.908 where expression similarity scores 0.518. A second gene dataset (DepMap, 17,725 genes) confirms the result after negative sampling correction, reaching cross‑boundary AUC of 0.947. Two drug sensitivity experiments produce informative negatives that sharpen boundary conditions. Three cross‑domain findings emerge: (1) inductive transfer succeeds in biology‑‑a node‑disjoint split with unseen genes yields AUC 0.826 (Delta +0.127)‑‑where it fails in text (+/‑0.10), suggesting physically grounded associations are more transferable than contingent co‑occurrences; (2) CAL scores anti‑correlate with interaction degree (Spearman r = ‑0.590), with gains concentrating on understudied genes with focused interaction profiles; (3) tighter association quality outperforms larger but noisier training sets, reversing the text pattern. Results are stable across training seeds (SD < 0.001) and cross‑boundary threshold choices.
Authors: Truong-Son Hy
Abstract: We propose Q‑BIOLAT, a framework for modeling and optimizing protein fitness landscapes in binary latent spaces. Starting from protein sequences, we leverage pretrained protein language models to obtain continuous embeddings, which are then transformed into compact binary latent representations. In this space, protein fitness is approximated using a quadratic unconstrained binary optimization (QUBO) model, enabling efficient combinatorial search via classical heuristics such as simulated annealing and genetic algorithms.
On the ProteinGym benchmark, we demonstrate that Q‑BIOLAT captures meaningful structure in protein fitness landscapes and enables the identification of high‑fitness variants. Despite using a simple binarization scheme, our method consistently retrieves sequences whose nearest neighbors lie within the top fraction of the training fitness distribution, particularly under the strongest configurations. We further show that different optimization strategies exhibit distinct behaviors, with evolutionary search performing better in higher‑dimensional latent spaces and local search remaining competitive in preserving realistic sequences.
Beyond its empirical performance, Q‑BIOLAT provides a natural bridge between protein representation learning and combinatorial optimization. By formulating protein fitness as a QUBO problem, our framework is directly compatible with emerging quantum annealing hardware, opening new directions for quantum‑assisted protein engineering.
Our implementation is publicly available at: https://github.com/HySonLab/Q‑BIOLAT
Authors: Jing Dai, Chen Wu, Ming Wu, Qibin Zhang, Zexi Wu, Jingdong Zhang, Hongming Xu
Abstract: Recent advances in multimodal learning have significantly improved cancer survival risk prediction. However, the joint prognostic potential of protein markers and histopathology images remains underexplored, largely due to the high cost and limited availability of protein expression profiling. To address this challenge, we propose HGP‑Mamba, a Mamba‑based multimodal framework that efficiently integrates histological with generated protein features for survival risk prediction. Specifically, we introduce a protein feature extractor (PFE) that leverages pretrained foundation models to derive high‑throughput protein embeddings directly from Whole Slide Images (WSIs), enabling data‑efficient incorporation of molecular information. Together with histology embeddings that capture morphological patterns, we further introduce the Local Interaction‑aware Mamba (LiAM) for fine‑grained feature interaction and the Global Interaction‑enhanced Mamba (GiEM) to promote holistic modality fusion at the slide level, thus capture complex cross‑modal dependencies. Experiments on four public cancer datasets demonstrate that HGP‑Mamba achieves state‑of‑the‑art performance while maintaining superior computational efficiency compared with existing methods. Our source code is publicly available at https://github.com/Daijing‑ai/HGP‑Mamba.git.
Authors: Yining Qian, Lijie Su, Meiling Xu, Xianpeng Wang
Abstract: Predicting protein secondary structure is essential for understanding protein function and advancing drug discovery. However, the intricate sequence‑structure relationship poses significant challenges for accurate modeling. To address these, we propose MOGP‑MMF, a multi‑objective genetic programming framework that reformulates PSSP as an automated optimization task focused on feature selection and fusion. Specifically, MOGP‑MMF introduces a multi‑view multi‑level representation strategy that integrates evolutionary, semantic, and newly introduced structural views to capture the comprehensive protein folding logic. Leveraging an enriched operator set, the framework evolves both linear and nonlinear fusion functions, effectively capturing high‑order feature interactions while reducing fusion complexity. To resolve the accuracy‑complexity trade‑off, an improved multi‑objective GP algorithm is developed, incorporating a knowledge transfer mechanism that utilizes prior evolutionary experience to guide the population toward global optima. Extensive experiments across seven benchmark datasets demonstrate that MOGP‑MMF surpasses state‑of‑the‑art methods, particularly in Q8 accuracy and structural integrity. Furthermore, MOGP‑MMF generates a diverse set of non‑dominated solutions, offering flexible model selection schemes for various practical application scenarios. The source code is available on GitHub: https://github.com/qian‑ann/MOGP‑MMF/tree/main.
Authors: Darius Catrina, Christian Bepler, Samuel Sledzieski, Rohit Singh
Abstract: Unlike the predictable scaling laws in natural language processing and computer vision, protein language models (PLMs) scale poorly: for many tasks, models within the same family plateau or even decrease in performance, with mid‑sized models often outperforming the largest in the family. We introduce Reverse Distillation, a principled framework that decomposes large PLM representations into orthogonal subspaces guided by smaller models of the same family. The resulting embeddings have a nested, Matryoshka‑style structure: the first k dimensions of a larger model's embedding are exactly the representation from the smaller model. This ensures that larger reverse‑distilled models consistently outperform smaller ones. A motivating intuition is that smaller models, constrained by capacity, preferentially encode broadly‑shared protein features. Reverse distillation isolates these shared features and orthogonally extracts additional contributions from larger models, preventing interference between the two. On ProteinGym benchmarks, reverse‑distilled ESM‑2 variants outperform their respective baselines at the same embedding dimensionality, with the reverse‑distilled 15 billion parameter model achieving the strongest performance. Our framework is generalizable to any model family where scaling challenges persist. Code and trained models are available at https://github.com/rohitsinghlab/plm_reverse_distillation.
Authors: Faisal Bin Ashraf, Animesh Ray, Stefano Lonardi
Abstract: Machine learning‑based antibody design is emerging as one of the most promising approaches to combat infectious diseases, due to significant advancements in the field of artificial intelligence and an exponential surge in experimental antibody data (in particular related to COVID‑19). The ability of an antibody to bind to an antigens (called binding affinity) is one of the the most critical properties in designing neutralizing antibodies. In this study we introduce Ab‑Affinity, a new large language model that can accurately predict the binding affinity of antibodies against a target peptide, e.g., the SARS‑CoV‑2 spike protein. Code and model are available at https://github.com/ucrbioinfo/AbAffinity.
Authors: Daiheng Zhang, Shiyang Zhang, Sizhuang He, Yangtian Zhang, Syed Asad Rizvi, David van Dijk
Abstract: Discrete biological sequence optimization requires iterative refinement under strict syntactic constraints. Diffusion models offer progressive refinement but do not naturally expose controllable discrete edit operations, while autoregressive LLMs often lack explicit long‑horizon planning for constrained edits. We propose STRIDE (Sequence Trajectory Refinement via Internalized Denoising Emulation), a post‑training framework that trains an LLM to emit executable trajectories of atomic edits (INSERT/DELETE/REPLACE) as a verifiable reasoning trace for variable‑length refinement. STRIDE combines supervised fine‑tuning on Levenshtein‑aligned shortest edit demonstrations with group‑based policy optimization to align edit trajectories with task rewards while preserving coherent editing behavior. Across protein fluorescence and instruction‑conditioned molecular optimization, STRIDE improves variable‑length protein editing success from 42% to 89% while increasing novelty from 47% to 97%, and yields stronger validity and controllability compared to diverse baselines. The code is published at https://github.com/daiheng‑zhang/STRIDE.
Authors: Erik Hartman, Di Tang, Johan Malmström
Abstract: Designing novel proteins with desired characteristics remains a significant challenge due to the large sequence space and the complexity of sequence‑function relationships. Efficient exploration of this space to identify sequences that meet specific design criteria is crucial for advancing therapeutics and biotechnology. Here, we present BoGA (Bayesian Optimization Genetic Algorithm), a framework that combines evolutionary search with Bayesian optimization to efficiently navigate the sequence space. By integrating a genetic algorithm as a stochastic proposal generator within a surrogate modeling loop, BoGA prioritizes candidates based on prior evaluations and surrogate model predictions, enabling data‑efficient optimization. We demonstrate the utility of BoGA through benchmarking on sequence and structure design tasks, followed by its application in designing peptide binders against pneumolysin, a key virulence factor of Streptococcus pneumoniae. BoGA accelerates the discovery of high‑confidence binders, demonstrating the potential for efficient protein design across diverse objectives. The algorithm is implemented within the BoPep suite and is available under an MIT license at \hrefhttps://github.com/ErikHartman/bopepGitHub.
Authors: Zhanghan Ni, Yanjing Li, Zeju Qiu, Bernhard Schölkopf, Hongyu Guo, Weiyang Liu, Shengchao Liu
Abstract: Generative models have recently advanced de novo protein design by learning the statistical regularities of natural structures. However, current approaches face three key limitations: (1) Existing methods cannot jointly learn protein geometry and design tasks, where pretraining can be a solution; (2) Current pretraining methods mostly rely on local, non‑rigid atomic representations for property prediction downstream tasks, limiting global geometric understanding for protein generation tasks; and (3) Existing approaches have yet to effectively model the rich dynamic and conformational information of protein structures. To overcome these issues, we introduce RigidSSL (Rigidity‑Aware Self‑Supervised Learning), a geometric pretraining framework that front‑loads geometry learning prior to generative finetuning. Phase I (RigidSSL‑Perturb) learns geometric priors from 432K structures from the AlphaFold Protein Structure Database with simulated perturbations. Phase II (RigidSSL‑MD) refines these representations on 1.3K molecular dynamics trajectories to capture physically realistic transitions. Underpinning both phases is a bi‑directional, rigidity‑aware flow matching objective that jointly optimizes translational and rotational dynamics to maximize mutual information between conformations. Empirically, RigidSSL variants improve designability by up to 43% while enhancing novelty and diversity in unconditional generation. Furthermore, RigidSSL‑Perturb improves the success rate by 5.8% in zero‑shot motif scaffolding and RigidSSL‑MD captures more biophysically realistic conformational ensembles in G protein‑coupled receptor modeling.
Authors: David Jackson, Michael Gertz, Jürgen Hesser
Abstract: Adverse Drug Reactions (ADRs) are a leading cause of morbidity and mortality. Existing prediction methods rely mainly on chemical similarity, machine learning on structured databases, or isolated target profiles, but often fail to integrate heterogeneous, partly unstructured evidence effectively. We present a knowledge graph‑based framework that unifies diverse sources, drug‑target data (ChEMBL), clinical trial literature (PubMed), trial metadata (ClinicalTrials.gov), and post‑marketing safety reports (FAERS) into a single evidence‑weighted bipartite network of drugs and medical conditions. Applied to 400 protein kinase inhibitors, the resulting network enables contextual comparison of efficacy (HR, PFS, OS), phenotypic and target similarity, and ADR prediction via target‑to‑adverse‑event correlations. A non‑small cell lung cancer case study correctly highlights established and candidate drugs, target communities (ERbB, ALK, VEGF), and tolerability differences. Designed as an orthogonal, extensible analysis and search tool rather than a replacement for current models, the framework excels at revealing complex patterns, supporting hypothesis generation, and enhancing pharmacovigilance. Code and data are publicly available at https://github.com/davidjackson99/PKI_KG.
Authors: Fuyao Huang, Xiaozhu Yu, Kui Xu, Qiangfeng Cliff Zhang
Abstract: High‑resolution structure determination by cryo‑electron microscopy (cryo‑EM) requires the accurate fitting of an atomic model into an experimental density map. Traditional refinement pipelines such as Phenix.real_space_refine and Rosetta are computationally expensive, demand extensive manual tuning, and present a significant bottleneck for researchers. We present CryoNet.Refine, an end‑to‑end deep learning framework that automates and accelerates molecular structure refinement. Our approach utilizes a one‑step diffusion model that integrates a density‑aware loss function with robust stereochemical restraints, enabling rapid optimization of a structure against experimental data. CryoNet.Refine provides a unified and versatile solution capable of refining protein complexes as well as DNA/RNA‑protein complexes. In benchmarks against Phenix.real_space_refine, CryoNet.Refine consistently achieves substantial improvements in both model‑map correlation and overall geometric quality metrics. By offering a scalable, automated, and powerful alternative, CryoNet.Refine aims to serve as an essential tool for next‑generation cryo‑EM structure refinement. Web server: https://cryonet.ai/refine; Source code: https://github.com/kuixu/cryonet.refine.
Authors: Soumik Deb Niloy, Md. Fahmid-Ul-Alam Juboraj, Swakkhar Shatabda
Abstract: Pro‑inflammatory peptides (PIPs) play critical roles in immune signaling and inflammation but are difficult to identify experimentally due to costly and time‑consuming assays. To address this challenge, we present KEMP‑PIP, a hybrid machine learning framework that integrates deep protein embeddings with handcrafted descriptors for robust PIP prediction. Our approach combines contextual embeddings from pretrained ESM protein language models with multi‑scale k‑mer frequencies, physicochemical descriptors, and modlAMP sequence features. Feature pruning and class‑weighted logistic regression manage high dimensionality and class imbalance, while ensemble averaging with an optimized decision threshold enhances the sensitivity‑‑specificity balance. Through systematic ablation studies, we demonstrate that integrating complementary feature sets consistently improves predictive performance. On the standard benchmark dataset, KEMP‑PIP achieves an MCC of 0.505, accuracy of 0.752, and AUC of 0.762, outperforming ProIn‑fuse, MultiFeatVotPIP, and StackPIP. Relative to StackPIP, these results represent improvements of 9.5% in MCC and 4.8% in both accuracy and AUC. The KEMP‑PIP web server is freely available at https://nilsparrow1920‑kemp‑pip.hf.space/ and the full implementation at https://github.com/S18‑Niloy/KEMP‑PIP.
Authors: Osman Onur Kuzucu, Tunca Doğan
Abstract: Understanding disease‑gene associations is essential for unravelling disease mechanisms and advancing diagnostics and therapeutics. Traditional approaches based on manual curation and literature review are labour‑intensive and not scalable, prompting the use of machine learning on large biomedical data. In particular, graph neural networks (GNNs) have shown promise for modelling complex biological relationships. To address limitations in existing models, we propose GLaDiGAtor (Graph Learning‑bAsed DIsease‑Gene AssociaTiOn pRediction), a novel GNN framework with an encoder‑decoder architecture for disease‑gene association prediction. GLaDiGAtor constructs a heterogeneous biological graph integrating gene‑gene, disease‑disease, and gene‑disease interactions from curated databases, and enriches each node with contextual features from well‑known language models (ProtT5 for protein sequences and BioBERT for disease text). In evaluations, our model achieves superior predictive accuracy and generalisation, outperforming 14 existing methods. Literature‑supported case studies confirm the biological relevance of high‑confidence novel predictions, highlighting GLaDiGAtor's potential to discover candidate disease genes. These results underscore the power of graph convolutional networks in biomedical informatics and may ultimately facilitate drug discovery by revealing new gene‑disease links. The source code and processed datasets are publicly available at https://github.com/HUBioDataLab/GLaDiGAtor.
Authors: Pingzhi Li, Hongxuan Li, Zirui Liu, Xingcheng Lin, Tianlong Chen
Abstract: Graph neural network (GNN) potentials such as SchNet improve the accuracy and transferability of molecular dynamics (MD) simulation by learning many‑body interactions, but remain slower than classical force fields due to fragmented kernels and memory‑bound pipelines that underutilize GPUs. We show that a missing principle is making GNN‑MD IO‑aware, carefully accounting for reads and writes between GPU high‑bandwidth memory (HBM) and on‑chip SRAM. We present FlashSchNet, an efficient and accurate IO‑aware SchNet‑style GNN‑MD framework built on four techniques: (1) flash radial basis, which fuses pairwise distance computation, Gaussian basis expansion, and cosine envelope into a single tiled pass, computing each distance once and reusing it across all basis functions; (2) flash message passing, which fuses cutoff, neighbor gather, filter multiplication, and reduction to avoid materializing edge tensors in HBM; (3) flash aggregation, which reformulates scatter‑add via CSR segment reduce, reducing atomic writes by a factor of feature dimension and enabling contention‑free accumulation in both forward and backward passes; (4) channel‑wise 16‑bit quantization that exploits the low per‑channel dynamic range in SchNet MLP weights to further improve throughput with negligible accuracy loss. On a single NVIDIA RTX PRO 6000, FlashSchNet achieves 1000 ns/day aggregate simulation throughput over 64 parallel replicas on coarse‑grained (CG) protein containing 269 beads (6.5x faster than CGSchNet baseline with 80% reduction of peak memory), surpassing classical force fields (e.g. MARTINI) while retaining SchNet‑level accuracy and transferability.
Authors: Zhaorui Jiang, Yingfang Yuan, Lei Hu, Wei Pang
Abstract: The integration of spatial multi‑omics data from single tissues is crucial for advancing biological research. However, a significant data imbalance impedes progress: while spatial transcriptomics data is relatively abundant, spatial proteomics data remains scarce due to technical limitations and high costs. To overcome this challenge we propose STProtein, a novel framework leveraging graph neural networks with multi‑task learning strategy. STProtein is designed to accurately predict unknown spatial protein expression using more accessible spatial multi‑omics data, such as spatial transcriptomics. We believe that STProtein can effectively addresses the scarcity of spatial proteomics, accelerating the integration of spatial multi‑omics and potentially catalyzing transformative breakthroughs in life sciences. This tool enables scientists to accelerate discovery by identifying complex and previously hidden spatial patterns of proteins within tissues, uncovering novel relationships between different marker genes, and exploring the biological "Dark Matter".
Authors: Minhuan Li, Jiequn Han, Pilar Cossio, Luhuan Wu
Abstract: A core challenge in structural biophysics is generating biomolecular conformations that are both physically plausible and consistent with experimental measurements. While sequence‑to‑structure diffusion models provide powerful priors, posterior sampling methods steer generation by perturbing atomic coordinates with gradients from experimental likelihoods. However, when the target lies in a low‑density region of the prior, these methods require aggressive upweighting of the likelihood that can destabilize sampling and be sensitive to hyperparameters. We propose EmbedOpt, an inference‑time steering framework that introduces an orthogonal optimization axis: rather than performing posterior sampling under a fixed prior, EmbedOpt directly optimizes the prior by updating the model's conditional embedding. This embedding space encodes rich coevolutionary signals, so optimizing it shifts the structural prior to align with experimental constraints. Empirically, EmbedOpt matches coordinate‑based posterior sampling baselines on sparse distance constraints and outperforms them on cryo‑electron microscopy map fitting, including real, noisy experimental ones. Furthermore, EmbedOpt's smooth optimization behavior yields robustness to hyperparameters spanning two orders of magnitude and enables comparable performance with fewer diffusion steps. Code is available at https://github.com/rs‑station/embedopt.
Authors: Nima Shoghi, Yuxuan Liu, Yuning Shen, Rob Brekelmans, Pan Li, Quanquan Gu
Abstract: Molecular dynamics (MD) simulations remain the gold standard for studying protein dynamics, but their computational cost limits access to biologically relevant timescales. Recent generative models have shown promise in accelerating simulations, yet they struggle with long‑horizon generation due to architectural constraints, error accumulation, and inadequate modeling of spatio‑temporal dynamics. We present STAR‑MD (Spatio‑Temporal Autoregressive Rollout for Molecular Dynamics), a scalable SE(3)‑equivariant diffusion model that generates physically plausible protein trajectories over microsecond timescales. Our key innovation is a causal diffusion transformer with joint spatio‑temporal attention that efficiently captures complex space‑time dependencies while avoiding the memory bottlenecks of existing methods. On the standard ATLAS benchmark, STAR‑MD achieves state‑of‑the‑art performance across all metrics‑‑substantially improving conformational coverage, structural validity, and dynamic fidelity compared to previous methods. STAR‑MD successfully extrapolates to generate stable microsecond‑scale trajectories where baseline methods fail catastrophically, maintaining high structural quality throughout the extended rollout. Our comprehensive evaluation reveals severe limitations in current models for long‑horizon generation, while demonstrating that STAR‑MD's joint spatio‑temporal modeling enables robust dynamics simulation at biologically relevant timescales, paving the way for accelerated exploration of protein function.
Authors: Furkan Eris
Abstract: Protein language models (PLMs) face a fundamental divide: masked language models (MLMs) excel at fitness prediction while causal models enable generation, forcing practitioners to maintain separate architectures. We introduce Proust, a 309M‑parameter causal PLM that bridges this gap through architectural innovations adapted from recent LLM research, including grouped‑query attention with shared K/V projections, cross‑layer value residuals, and depthwise causal convolutions. Trained on 33B tokens in 40 B200 GPU‑hours, Proust achieves Spearman ρ= 0.390 on ProteinGym substitutions, competitive with MLMs requiring 50‑‑200× the compute. On indels, Proust sets a new state‑of‑the‑art, outperforming models up to 20× larger. On EVEREST viral fitness benchmarks, it approaches structure‑aware methods using sequence alone. These powerful representations position Proust in a sweet spot as it also retains native generative capabilities that MLMs lack by design. Interpretability analysis reveals that per‑position entropy variance predicts, to an extent, when retrieval augmentation helps and hurts. Such insights can grow in both quantity and quality at scale and inform capabilities such as test‑time scaling. Code and weights are available at https://github.com/Furkan9015/proust‑inference
Authors: Yang Tan, Yuanxi Yu, Can Wu, Bozitao Zhong, Mingchen Li, Guisheng Fan, Jiankang Zhu, Yafeng Liang, Nanqing Dong, Liang Hong
Abstract: Zero‑shot mutation prediction is vital for low‑resource protein engineering, yet existing protein language models (PLMs) often yield statistically confident results that ignore fundamental biophysical constraints. Currently, selecting candidates for wet‑lab validation relies on manual expert auditing of PLM outputs, a process that is inefficient, subjective, and highly dependent on domain expertise. To address this, we propose Rank‑and‑Reason (VenusRAR), a two‑stage agentic framework to automate this workflow and maximize expected wet‑lab fitness. In the Rank‑Stage, a Computational Expert and Virtual Biologist aggregate a context‑aware multi‑modal ensemble, establishing a new Spearman correlation record of 0.551 (vs. 0.518) on ProteinGym. In the Reason‑Stage, an agentic Expert Panel employs chain‑of‑thought reasoning to audit candidates against geometric and structural constraints, improving the Top‑5 Hit Rate by up to 367% on ProteinGym‑DMS99. The wet‑lab validation on Cas12i3 nuclease further confirms the framework's efficacy, achieving a 46.7% positive rate and identifying two novel mutants with 4.23‑fold and 5.05‑fold activity improvements. Code and datasets are released on GitHub (https://github.com/ai4protein/VenusRAR/).
Authors: Kangyu Zheng, Kai Zhang, Jiale Tan, Xuehan Chen, Yingzhou Lu, Zaixi Zhang, Lichao Sun, Marinka Zitnik, Tianfan Fu, Zhiding Liang
Abstract: Currently, the field of structure‑based drug design is dominated by three main types of algorithms: search‑based algorithms, deep generative models, and reinforcement learning. While existing works have typically focused on comparing models within a single algorithmic category, cross‑algorithm comparisons remain scarce. In this paper, to fill the gap, we establish a benchmark to evaluate the performance of fifteen models across these different algorithmic foundations by assessing the pharmaceutical properties of the generated molecules and their docking affinities and poses with specified target proteins. We highlight the unique advantages of each algorithmic approach and offer recommendations for the design of future SBDD models. We emphasize that 1D/2D ligand‑centric drug design methods can be used in SBDD by treating the docking function as a black‑box oracle, which is typically neglected. Our evaluation reveals distinct patterns across model categories. 3D structure‑based models excel in binding affinities but show inconsistencies in chemical validity and pose quality. 1D models demonstrate reliable performance in standard molecular metrics but rarely achieve optimal binding affinities. 2D models offer balanced performance, maintaining high chemical validity while achieving moderate binding scores. Through detailed analysis across multiple protein targets, we identify key improvement areas for each model category, providing insights for researchers to combine strengths of different approaches while addressing their limitations. All the code that are used for benchmarking is available in https://github.com/zkysfls/2025‑sbdd‑benchmark
Authors: Po-Yu Liang, Tibo Duran, Jun Bai
Abstract: We present PepEDiff, a novel peptide binder generator that designs binding sequences given a target receptor protein sequence and its pocket residues. Peptide binder generation is critical in therapeutic and biochemical applications, yet many existing methods rely heavily on intermediate structure prediction, adding complexity and limiting sequence diversity. Our approach departs from this paradigm by generating binder sequences directly in a continuous latent space derived from a pretrained protein embedding model, without relying on predicted structures, thereby improving structural and sequence diversity. To encourage the model to capture binding‑relevant features rather than memorizing known sequences, we perform latent‑space exploration and diffusion‑based sampling, enabling the generation of peptides beyond the limited distribution of known binders. This zero‑shot generative strategy leverages the global protein embedding manifold as a semantic prior, allowing the model to propose novel peptide sequences in previously unseen regions of the protein space. We evaluate PepEDiff on TIGIT, a challenging target with a large, flat protein‑protein interaction interface that lacks a druggable pocket. Despite its simplicity, our method outperforms state‑of‑the‑art approaches across benchmark tests and in the TIGIT case study, demonstrating its potential as a general, structure‑free framework for zero‑shot peptide binder design. The code for this research is available at GitHub: https://github.com/LabJunBMI/PepEDiff‑An‑Peptide‑binder‑Embedding‑Diffusion‑Model
Authors: Jiahao Wang, Shuangjia Zheng
Abstract: The ability to engineer optimized protein variants has transformative potential for biotechnology and medicine. Prior sequence‑based optimization methods struggle with the high‑dimensional complexities due to the epistasis effect and the disregard for structural constraints. To address this, we propose HADES, a Bayesian optimization method utilizing Hamiltonian dynamics to efficiently sample from a structure‑aware approximated posterior. Leveraging momentum and uncertainty in the simulated physical movements, HADES enables rapid transition of proposals toward promising areas. A position discretization procedure is introduced to propose discrete protein sequences from such a continuous state system. The posterior surrogate is powered by a two‑stage encoder‑decoder framework to determine the structure and function relationships between mutant neighbors, consequently learning a smoothed landscape to sample from. Extensive experiments demonstrate that our method outperforms state‑of‑the‑art baselines in in‑silico evaluations across most metrics. Remarkably, our approach offers a unique advantage by leveraging the mutual constraints between protein structure and sequence, facilitating the design of protein sequences with similar structures and optimized properties. The code and data are publicly available at https://github.com/GENTEL‑lab/HADES.
Authors: Mohsin Hasan, Viktor Ohanesian, Artem Gazizov, Yoshua Bengio, Alán Aspuru-Guzik, Roberto Bondesan, Marta Skreta, Kirill Neklyudov
Abstract: Discrete diffusion models have recently emerged as a promising alternative to the autoregressive approach for generating discrete sequences. Sample generation via gradual denoising or demasking processes allows them to capture hierarchical non‑sequential interdependencies in the data. These custom processes, however, do not assume a flexible control over the distribution of generated samples. We propose Discrete Feynman‑Kac Correctors, a framework that allows for controlling the generated distribution of discrete masked diffusion models at inference time. We derive Sequential Monte Carlo (SMC) algorithms that, given a trained discrete diffusion model, control the temperature of the sampled distribution (i.e. perform annealing), sample from the product of marginals of several diffusion processes (e.g. differently conditioned processes), and sample from the product of the marginal with an external reward function, producing likely samples from the target distribution that also have high reward. Notably, our framework does not require any training of additional models or fine‑tuning of the original model. We illustrate the utility of our framework in several applications including: efficient sampling from the annealed Boltzmann distribution of the Ising model, improving the performance of language models for code generation and amortized learning, as well as reward‑tilted protein sequence generation.
Authors: Sang T. Truong, Duc Q. Nguyen, Willie Neiswanger, Ryan-Rhys Griffiths, Stefano Ermon, Nick Haber, Sanmi Koyejo
Abstract: Bayesian optimization (BO) is a common framework for optimizing black‑box functions, yet most existing methods assume static query costs and rely on myopic acquisition strategies. We introduce LookaHES, a nonmyopic BO framework designed for dynamic, history‑dependent cost environments, where evaluation costs vary with prior actions, such as travel distance in spatial tasks or edit distance in sequence design. LookaHES combines a multi‑step variant of H‑Entropy Search with pathwise sampling and neural policy optimization, enabling long‑horizon planning beyond twenty steps without the exponential complexity of existing nonmyopic methods. The key innovation is the integration of neural policies, including large language models, to effectively navigate structured, combinatorial action spaces such as protein sequences. These policies amortize lookahead planning and can be integrated with domain‑specific constraints during rollout. Empirically, LookaHES outperforms strong myopic and nonmyopic baselines across nine synthetic benchmarks from two to eight dimensions and two real‑world tasks: geospatial optimization using NASA night‑light imagery and protein sequence design with constrained token‑level edits. In short, LookaHES provides a general, scalable, and cost‑aware solution for robust long‑horizon optimization in complex decision spaces, which makes it a useful tool for researchers in machine learning, statistics, and applied domains. Our implementation is available at https://github.com/sangttruong/nonmyopia.
Authors: Mustapha Hamdi, Mourad Jabou
Abstract: Energy efficiency is a first‑order concern in AI deployment, as long‑running inference can exceed training in cumulative carbon impact. We propose a bio‑inspired framework that maps protein‑folding energy basins to inference cost landscapes and controls execution via a decaying, closed‑loop threshold. A request is admitted only when the expected utility‑to‑energy trade‑off is favorable (high confidence/utility at low marginal energy and congestion), biasing operation toward the first acceptable local basin rather than pursuing costly global minima. We evaluate DistilBERT and ResNet‑18 served through FastAPI with ONNX Runtime and NVIDIA Triton on an RTX 4000 Ada GPU. Our ablation study reveals that the bio‑controller reduces processing time by 42% compared to standard open‑loop execution (0.50s vs 0.29s on A100 test set), with a minimal accuracy degradation (<0.5%). Furthermore, we establish the efficiency boundaries between lightweight local serving (ORT) and managed batching (Triton). The results connect biophysical energy models to Green MLOps and offer a practical, auditable basis for closed‑loop energy‑aware inference in production.
Authors: Wajid Arshad Abbasi, Syed Ali Abbas, Maryum Bibi, Saiqa Andleeb, Muhammad Naveed Akhtar
Abstract: The trade‑off between predictive accuracy and data availability makes it difficult to predict protein‑‑protein binding affinity accurately. The lack of experimentally resolved protein structures limits the performance of structure‑based machine learning models, which generally outperform sequence‑based methods. In order to overcome this constraint, we suggest a regression framework based on knowledge distillation that uses protein structural data during training and only needs sequence data during inference. The suggested method uses binding affinity labels and intermediate feature representations to jointly supervise the training of a sequence‑based student network under the guidance of a structure‑informed teacher network. Leave‑One‑Complex‑Out (LOCO) cross‑validation was used to assess the framework on a non‑redundant protein‑‑protein binding affinity benchmark dataset. A maximum Pearson correlation coefficient (P_r) of 0.375 and an RMSE of 2.712 kcal/mol were obtained by sequence‑only baseline models, whereas a P_r of 0.512 and an RMSE of 2.445 kcal/mol were obtained by structure‑based models. With a P_r of 0.481 and an RMSE of 2.488 kcal/mol, the distillation‑based student model greatly enhanced sequence‑only performance. Improved agreement and decreased bias were further confirmed by thorough error analyses. With the potential to close the performance gap between sequence‑based and structure‑based models as larger datasets become available, these findings show that knowledge distillation is an efficient method for transferring structural knowledge to sequence‑based predictors. The source code for running inference with the proposed distillation‑based binding affinity predictor can be accessed at https://github.com/wajidarshad/ProteinAffinityKD.
Authors: R Yadunandan, Nimisha Ghosh
Abstract: De novo drug design is a crucial component of modern drug development, yet navigating the vast chemical space to find synthetically accessible, high‑affinity candidates remains a significant challenge. Reinforcement Learning (RL) enhances this process by enabling multi‑objective optimization and exploration of novel chemical space ‑ capabilities that traditional supervised learning methods lack. In this work, we introduce ReACT‑Drug, a fully integrated, target‑agnostic molecular design framework based on Reinforcement Learning. Unlike models requiring target‑specific fine‑tuning, ReACT‑Drug utilizes a generalist approach by leveraging ESM‑2 protein embeddings to identify similar proteins for a given target from a knowledge base such as Protein Data Base (PDB). Thereafter, the known drug ligands corresponding to such proteins are decomposed to initialize a fragment‑based search space, biasing the agent towards biologically relevant subspaces. For each such fragment, the pipeline employs a Proximal Policy Optimization (PPO) agent guiding a ChemBERTa‑encoded molecule through a dynamic action space of chemically valid, reaction‑template‑based transformations. This results in the generation of de novo drug candidates with competitive binding affinities and high synthetic accessibility, while ensuring 100% chemical validity and novelty as per MOSES benchmarking. This architecture highlights the potential of integrating structural biology, deep representation learning, and chemical synthesis rules to automate and accelerate rational drug design. The dataset and code are available at https://github.com/YadunandanRaman/ReACT‑Drug/.
Authors: Simon Gutwein, Arthur Longuefosse, Jun Seita, Sabine Taschner-Mandl, Roxane Licandro
Abstract: Multiplexed tissue imaging measures dozens of protein markers per cell, yet most deep learning models still apply early channel fusion, assuming shared structure across markers. We investigate whether preserving marker independence, combined with deliberately shallow architectures, provides a more suitable inductive bias for self‑supervised representation learning in multiplex data than increasing model scale. Using a Hodgkin lymphoma CODEX dataset with 145,000 cells and 49 markers, we compare standard early‑fusion CNNs with channel‑separated architectures, including a marker‑aware baseline and our novel shallow Channel‑Independent Model (CIM‑S) with 5.5K parameters. After contrastive pretraining and linear evaluation, early‑fusion models show limited ability to retain marker‑specific information and struggle particularly with rare‑cell discrimination. Channel‑independent architectures, and CIM‑S in particular, achieve substantially stronger representations despite their compact size. These findings are consistent across multiple self‑supervised frameworks, remain stable across augmentation settings, and are reproducible across both the 49‑marker and reduced 18‑marker settings. These results show that lightweight, channel‑independent architectures can match or surpass deep early‑fusion CNNs and foundation models for multiplex representation learning. Code is available at https://github.com/SimonBon/CIM‑S.
Authors: Chang Liu, Vivian Li, Linus Leong, Vladimir Radenkovic, Pietro Liò, Chaitanya K. Joshi
Abstract: Geometric Graph Neural Networks (GNNs) and Transformers have become state‑of‑the‑art for learning from 3D protein structures. However, their reliance on message passing prevents them from capturing the hierarchical interactions that govern protein function, such as global domains and long‑range allosteric regulation. In this work, we argue that the network architecture itself should mirror this biological hierarchy. We introduce Geometric Graph U‑Nets, a new class of models that learn multi‑scale representations by recursively coarsening and refining the protein graph. We prove that this hierarchical design can theoretically more expressive than standard Geometric GNNs. Empirically, on the task of protein fold classification, Geometric U‑Nets substantially outperform invariant and equivariant baselines, demonstrating their ability to learn the global structural patterns that define protein folds. Our work provides a principled foundation for designing geometric deep learning architectures that can learn the multi‑scale structure of biomolecules.
Authors: Mehmet Efe Akça, Gökçe Uludoğan, Arzucan Özgür, İnci M. Baytaş
Abstract: Accurate prediction of protein function is essential for elucidating molecular mechanisms and advancing biological and therapeutic discovery. Yet experimental annotation lags far behind the rapid growth of protein sequence data. Computational approaches address this gap by associating proteins with Gene Ontology (GO) terms, which encode functional knowledge through hierarchical relations and textual definitions. However, existing models often emphasize one modality over the other, limiting their ability to generalize, particularly to unseen or newly introduced GO terms that frequently arise as the ontology evolves, and making the previously trained models outdated. We present STAR‑GO, a Transformer‑based framework that jointly models the semantic and structural characteristics of GO terms to enhance zero‑shot protein function prediction. STAR‑GO integrates textual definitions with ontology graph structure to learn unified GO representations, which are processed in hierarchical order to propagate information from general to specific terms. These representations are then aligned with protein sequence embeddings to capture sequence‑function relationships. STAR‑GO achieves state‑of‑the‑art performance and superior zero‑shot generalization, demonstrating the utility of integrating semantics and structure for robust and adaptable protein function prediction. Code is available at https://github.com/boun‑tabi‑lifelu/stargo.
Authors: Ian Dunn, Liv Toft, Tyler Katz, Juhi Gupta, Riya Shah, Ramith Hettiarachchi, David R. Koes
Abstract: Structure‑based drug design (SBDD) focuses on designing small‑molecule ligands that bind to specific protein pockets. Computational methods are integral in modern SBDD workflows and often make use of virtual screening methods via docking or pharmacophore search. Modern generative modeling approaches have focused on improving novel ligand discovery by enabling de novo design. In this work, we recognize that these tasks share a common structure and can therefore be represented as different instantiations of a consistent generative modeling framework. We propose a unified approach in OMTRA, a multi‑modal flow matching model that flexibly performs many tasks relevant to SBDD, including some with no analogue in conventional workflows. Additionally, we curate a dataset of 500M 3D molecular conformers, complementing protein‑ligand data and expanding the chemical diversity available for training. OMTRA obtains state of the art performance on pocket‑conditioned de novo design and docking; however, the effects of large‑scale pretraining and multi‑task training are modest. All code, trained models, and dataset for reproducing this work are available at https://github.com/gnina/OMTRA
Authors: Ming-Hsiu Wu, Ziqian Xie, Shuiwang Ji, Degui Zhi
Abstract: Advancements in AI for science unlocks capabilities for critical drug discovery tasks such as protein‑ligand binding affinity prediction. However, current models overfit to existing oversimplified datasets that does not represent naturally occurring and biologically relevant proteins with modifications. In this work, we curate a complete and modification‑aware version of the widely used DAVIS dataset by incorporating 4,032 kinase‑ligand pairs involving substitutions, insertions, deletions, and phosphorylation events. This enriched dataset enables benchmarking of predictive models under biologically realistic conditions. Based on this new dataset, we propose three benchmark settings‑Augmented Dataset Prediction, Wild‑Type to Modification Generalization, and Few‑Shot Modification Generalization‑designed to assess model robustness in the presence of protein modifications. Through extensive evaluation of both docking‑free and docking‑based methods, we find that docking‑based model generalize better in zero‑shot settings. In contrast, docking‑free models tend to overfit to wild‑type proteins and struggle with unseen modifications but show notable improvement when fine‑tuned on a small set of modified examples. We anticipate that the curated dataset and benchmarks offer a valuable foundation for developing models that better generalize to protein modifications, ultimately advancing precision medicine in drug discovery. The benchmark is available at: https://github.com/ZhiGroup/DAVIS‑complete
Authors: Michael Sun, Weize Yuan, Gang Liu, Wojciech Matusik, Marinka Zitnik
Abstract: Protein structure is central to biological function, and enabling multimodal protein models requires joint reasoning over sequence, structure, and function. A key barrier is the lack of principled protein structure tokenizers (PSTs): existing approaches fix token size or rely on continuous vector codebooks, limiting interpretability, multi‑scale control, and transfer across architectures. We introduce GeoBPE, a geometry‑grounded PST that transforms continuous, noisy, multi‑scale backbone conformations into discrete ``sentences'' of geometry while enforcing global constraints. Analogous to byte‑pair encoding, GeoBPE generates a hierarchical vocabulary of geometric primitives by iteratively (i) clustering Geo‑Pair occurrences with k‑medoids to yield a resolution‑controllable vocabulary; (ii) quantizing each Geo‑Pair to its closest medoid prototype; and (iii) reducing drift through differentiable inverse kinematics that optimizes boundary glue angles under an \mathrmSE(3) end‑frame loss. GeoBPE offers compression (>10x reduction in bits‑per‑residue at similar distortion rate), data efficiency (>10x less training data), and generalization (maintains test/train distortion ratio of 1.0‑1.1). It is architecture‑agnostic: (a) its hierarchical vocabulary provides a strong inductive bias for coarsening residue‑level embeddings from large PLMs into motif‑ and protein‑level representations, consistently outperforming leading PSTs across 12 tasks and 24 test splits; (b) paired with a transformer, GeoBPE supports unconditional backbone generation via language modeling; and (c) tokens align with CATH functional families and support expert‑interpretable case studies, offering functional meaning absent in prior PSTs. Code is available at https://github.com/shiningsunnyday/PT‑BPE/.
Authors: Zijing Liu, Bin Feng, He Cao, Yu Li
Abstract: Protein structure tokenization converts 3D structures into discrete or vectorized representations, enabling the integration of structural and sequence data. Despite many recent works on structure tokenization, the properties of the underlying discrete representations are not well understood. In this work, we first demonstrate that the successful utilization of structural tokens in a language model for structure prediction depends on using rich, pre‑trained sequence embeddings to bridge the semantic gap between the sequence and structural "language". The analysis of the structural vocabulary itself then reveals significant semantic redundancy, where multiple distinct tokens correspond to nearly identical local geometries, acting as "structural synonyms". This redundancy, rather than being a flaw, can be exploited with a simple "synonym swap" strategy to generate diverse conformational ensembles by perturbing a predicted structure with its structural synonyms. This computationally lightweight method accurately recapitulates protein flexibility, performing competitively with state‑of‑the‑art models. Our study provides fundamental insights into the nature of discrete protein structure representations and introduces a powerful, near‑instantaneous method for modeling protein dynamics. Source code is available in https://github.com/IDEA‑XL/TokenMD.
Authors: Ethan Baron, Alan N. Amin, Ruben Weitzman, Debora Marks, Andrew Gordon Wilson
Abstract: Many proteins useful in modern medicine or bioengineering are challenging to make in the lab, fuse with other proteins in cells, or deliver to tissues in the body, because their sequences are too long. Shortening these sequences typically involves costly, time‑consuming experimental campaigns. Ideally, we could instead use modern models of massive databases of sequences from nature to learn how to propose shrunken proteins that resemble sequences found in nature. Unfortunately, these models struggle to efficiently search the combinatorial space of all deletions, and are not trained with inductive biases to learn how to delete. To address this gap, we propose SCISOR, a novel discrete diffusion model that deletes letters from sequences to generate protein samples that resemble those found in nature. To do so, SCISOR trains a de‑noiser to reverse a forward noising process that adds random insertions to natural sequences. As a generative model, SCISOR fits evolutionary sequence data competitively with previous large models. In evaluation, SCISOR achieves state‑of‑the‑art predictions of the functional effects of deletions on ProteinGym. Finally, we use the SCISOR de‑noiser to shrink long protein sequences, and show that its suggested deletions result in significantly more realistic proteins and more often preserve functional motifs than previous models of evolutionary sequences.
Authors: Chao Song, Zhiyuan Liu, Han Huang, Liang Wang, Qiong Wang, Jianyu Shi, Hui Yu, Yihang Zhou, Yang Zhang
Abstract: Designing enzyme backbones with substrate‑specific functionality is a critical challenge in computational protein engineering. Current generative models excel in protein design but face limitations in binding data, substrate‑specific control, and flexibility for de novo enzyme backbone generation. To address this, we introduce EnzyBind, a dataset with 11,100 experimentally validated enzyme‑substrate pairs specifically curated from PDBbind. Building on this, we propose EnzyControl, a method that enables functional and substrate‑specific control in enzyme backbone generation. Our approach generates enzyme backbones conditioned on MSA‑annotated catalytic sites and their corresponding substrates, which are automatically extracted from curated enzyme‑substrate data. At the core of EnzyControl is EnzyAdapter, a lightweight, modular component integrated into a pretrained motif‑scaffolding model, allowing it to become substrate‑aware. A two‑stage training paradigm further refines the model's ability to generate accurate and functional enzyme structures. Experiments show that our EnzyControl achieves the best performance across structural and functional metrics on EnzyBind and EnzyBench benchmarks, with particularly notable improvements of 13% in designability and 13% in catalytic efficiency compared to the baseline models. The code is released at https://github.com/Vecteur‑libre/EnzyControl.
Authors: Mingyu Huang, Shasha Zhou, Ke Li
Abstract: Machine learning models increasingly map biological sequence‑fitness landscapes to predict mutational effects. Effective evaluation of these models requires benchmarks curated from empirical data. Despite their impressive scales, existing benchmarks lack topographical information regarding the underlying fitness landscapes, which hampers interpretation and comparison of model performance beyond averaged scores. Here, we introduce GraphFLA, a Python framework that constructs and analyzes fitness landscapes from mutagensis data in diverse modalities (e.g., DNA, RNA, protein, and beyond) with up to millions of mutants. GraphFLA calculates 20 biologically relevant features that characterize 4 fundamental aspects of landscape topography. By applying GraphFLA to over 5,300 landscapes from ProteinGym, RNAGym, and CIS‑BP, we demonstrate its utility in interpreting and comparing the performance of dozens of fitness prediction models, highlighting factors influencing model accuracy and respective advantages of different models. In addition, we release 155 combinatorially complete empirical fitness landscapes, encompassing over 2.2 million sequences across various modalities. All the codes and datasets are available at https://github.com/COLA‑Laboratory/GraphFLA.
Authors: Jacob B. Roberts, Catherine R. Ji, Isaac Donnell, Thomas D. Young, Allison N. Pearson, Graham A. Hudson, Leah S. Keiser, Mia Wesselkamper, Peter H. Winegar, Janik Ludwig, Sarah H. Klass, Isha V. Sheth, Ezechinyere C. Ukabiala, Maria C. T. Astolfi, Benjamin Eysenbach, Jay D. Keasling
Abstract: Proteins are traditionally optimized through the costly construction and measurement of many mutants. Active Learning‑assisted Directed Evolution (ALDE) alleviates that cost by predicting the best improvements and iteratively testing mutants to inform predictions. However, existing ALDE methods face a critical limitation: selecting the highest‑predicted mutants in each round yields homogeneous training data insufficient for accurate prediction models in subsequent rounds. Here we present FolDE, an ALDE method designed to maximize end‑of‑campaign success. In simulations across 20 protein targets, FolDE discovers 23% more top 10% mutants than the best baseline ALDE method (p=0.005) and is 55% more likely to find top 1% mutants. FolDE achieves this primarily through naturalness‑based warm‑starting, which augments limited activity measurements with protein language model outputs to improve activity prediction. We also introduce a constant‑liar batch selector, which improves batch diversity; this is important in multi‑mutation campaigns but had limited effect in our benchmarks. The complete workflow is freely available as open‑source software, making efficient protein optimization accessible to any laboratory.
Authors: Constance Ferragu, Jonathan D. Ziegler, Nicolas Deutschmann, Arthur Lindoulsi, Eli Bixby, Cradle ML Team
Abstract: Direct Preference Optimization (DPO) is an effective approach for aligning protein language models with experimental design goals. However, DPO faces a scalability bottleneck: the number of possible training pairs grows quadratically with the number of labeled sequences, leading to prohibitive training times even for modestly sized datasets. We introduce g‑DPO, a framework that (i) uses sequence space clustering to prune redundant pairs while preserving training signal, and (ii) amortizes likelihood computations with group‑based approximations. Across three protein engineering tasks, g‑DPO maintains in silico and in vitro performance that is statistically indistinguishable from standard DPO, while converging 1.7x to 5.4x times faster, with speedups that scale with dataset size and the structure of the underlying mutational landscape.
Authors: Jeffrey Ouyang-Zhang, Pranav Murugan, Daniel J. Diaz, Gianluca Scarpellini, Richard Strong Bowen, Nate Gruver, Adam Klivans, Philipp Krähenbühl, Aleksandra Faust, Maruan Al-Shedivat
Abstract: AlphaFold has transformed protein structure prediction, but emerging applications such as virtual ligand screening, proteome‑wide folding, and de novo binder design demand predictions at a massive scale, where runtime and memory costs become prohibitive. A major bottleneck lies in the Pairformer backbone of AlphaFold3‑style models, which relies on computationally expensive triangular primitives‑especially triangle attention‑for pairwise reasoning. We introduce Pairmixer, a streamlined alternative that eliminates triangle attention while preserving higher‑order geometric reasoning capabilities that are critical for structure prediction. Pairmixer substantially improves computational efficiency, matching state‑of‑the‑art structure predictors across folding and docking benchmarks, delivering up to 4x faster inference on long sequences while reducing training cost by 34%. Its efficiency alleviates the computational burden of downstream applications such as modeling large protein complexes, high‑throughput ligand and binder screening, and hallucination‑based design. Within BoltzDesign, for example, Pairmixer delivers over 2x faster sampling and scales to sequences ~30% longer than the memory limits of Pairformer. Code is available at https://github.com/genesistherapeutics/pairmixer.
Authors: Daria Frolova, Talgat Daulbaev, Egor Sevriugov, Sergei A. Nikolenko, Dmitry N. Ivankov, Ivan Oseledets, Marina A. Pak
Abstract: Accurate prediction of protein‑ligand binding poses is crucial for structure‑based drug design, yet existing methods struggle to balance speed, accuracy, and physical plausibility. We introduce Matcha, a novel molecular docking pipeline that combines multi‑stage flow matching with physically‑aware post‑processing. Our approach consists of three sequential stages applied consecutively to progressively refine docking predictions, each implemented as a flow matching model operating on appropriate geometric spaces (\mathbbR^3, \mathrmSO(3), and \mathrmSO(2)). We enhance the prediction quality through GNINA energy minimization and apply unsupervised physical validity filters to eliminate unrealistic poses. Compared to various approaches, Matcha demonstrates superior physical plausibility across all considered benchmarks. Moreover, our method works approximately 31 times faster than modern large‑scale co‑folding models. The model weights and inference code to reproduce our results are available at https://github.com/LigandPro/Matcha.
Authors: Congying Liu, Xingyuan Wei, Peipei Liu, Yiqing Shen, Yanxu Mao, Tiehan Cui
Abstract: Biomedical queries often rely on a deep understanding of specialized knowledge such as gene regulatory mechanisms and pathological processes of diseases. They require detailed analysis of complex physiological processes and effective integration of information from multiple data sources to support accurate retrieval and reasoning. Although large language models (LLMs) perform well in general reasoning tasks, their generated biomedical content often lacks scientific rigor due to the inability to access authoritative biomedical databases and frequently fabricates protein functions, interactions, and structural details that deviate from authentic information. Therefore, we present BioMedSearch, a multi‑source biomedical information retrieval framework based on LLMs. The method integrates literature retrieval, protein database and web search access to support accurate and efficient handling of complex biomedical queries. Through sub‑queries decomposition, keywords extraction, task graph construction, and multi‑source information filtering, BioMedSearch generates high‑quality question‑answering results. To evaluate the accuracy of question answering, we constructed a multi‑level dataset, BioMedMCQs, consisting of 3,000 questions. The dataset covers three levels of reasoning: mechanistic identification, non‑adjacent semantic integration, and temporal causal reasoning, and is used to assess the performance of BioMedSearch and other methods on complex QA tasks. Experimental results demonstrate that BioMedSearch consistently improves accuracy over all baseline models across all levels. Specifically, at Level 1, the average accuracy increases from 59.1% to 91.9%; at Level 2, it rises from 47.0% to 81.0%; and at the most challenging Level 3, the average accuracy improves from 36.3% to 73.4%. The code and BioMedMCQs are available at: https://github.com/CyL‑ucas/BioMed_Search
Authors: Zhiyu Wang, Bingxin Zhou, Jing Wang, Yang Tan, Weishu Zhao, Pietro Liò, Liang Hong
Abstract: Proteins are essential biological macromolecules that execute life functions. Local structural motifs, such as active sites, are the most critical components for linking structure to function and are key to understanding protein evolution and enabling protein engineering. Existing computational methods struggle to identify and compare these local structures, which leaves a significant gap in understanding protein structures and harnessing their functions. This study presents PLASMA, a deep‑learning‑based framework for efficient and interpretable residue‑level local structural alignment. We reformulate the problem as a regularized optimal transport task and leverage differentiable Sinkhorn iterations. For a pair of input protein structures, PLASMA outputs a clear alignment matrix with an interpretable overall similarity score. Through extensive quantitative evaluations and three biological case studies, we demonstrate that PLASMA achieves accurate, lightweight, and interpretable residue‑level alignment. Additionally, we introduce PLASMA‑PF, a training‑free variant that provides a practical alternative when training data are unavailable. Our method addresses a critical gap in protein structure analysis tools and offers new opportunities for functional annotation, evolutionary studies, and structure‑based drug design. Reproducibility is ensured via our official implementation at https://github.com/ZW471/PLASMA‑Protein‑Local‑Alignment.git.
Authors: Shaoning Li, Le Zhuo, Yusong Wang, Mingyu Li, Xinheng He, Fandi Wu, Hongsheng Li, Pheng-Ann Heng
Abstract: Developing effective representations of protein structures is essential for advancing protein science, particularly for protein generative modeling. Current approaches often grapple with the complexities of the SE(3) manifold, rely on discrete tokenization, or the need for multiple training objectives, all of which can hinder the model optimization and generalization. We introduce ProteinAE, a novel and streamlined protein diffusion autoencoder designed to overcome these challenges by directly mapping protein backbone coordinates from E(3) into a continuous, compact latent space. ProteinAE employs a non‑equivariant Diffusion Transformer with a bottleneck design for efficient compression and is trained end‑to‑end with a single flow matching objective, substantially simplifying the optimization pipeline. We demonstrate that ProteinAE achieves state‑of‑the‑art reconstruction quality, outperforming existing autoencoders. The resulting latent space serves as a powerful foundation for a latent diffusion model that bypasses the need for explicit equivariance. This enables efficient, high‑quality structure generation that is competitive with leading structure‑based approaches and significantly outperforms prior latent‑based methods. Code is available at https://github.com/OnlyLoveKFC/ProteinAE_v1.
Authors: Henry D. Smith, Nathaniel L. Diamant, Brian L. Trippe
Abstract: Generative models frequently suffer miscalibration, wherein statistics of the sampling distribution, such as the fraction of generations in a given class, deviate from desired values. We frame calibration as a constrained optimization problem and seek the closest model in Kullback‑Leibler divergence satisfying a calibration constraint. To address the intractability of imposing these constraints exactly, we introduce two surrogate objectives for fine‑tuning: (1) the relax loss, which replaces the constraint with a miscalibration penalty, and (2) the reward loss, which converts calibration into a reward fine‑tuning problem. We demonstrate that these approaches substantially reduce calibration error across hundreds of simultaneous constraints and models with up to nine billion parameters, spanning applications in protein design, image generation, and language modeling.
Authors: Xiang Zhang, Jiaqi Wei, Zijie Qiu, Sheng Xu, Zhi Jin, ZhiQiang Gao, Nanqing Dong, Siqi Sun
Abstract: Autoregressive (AR) models, common in sequence generation, are limited in many biological tasks such as de novo peptide sequencing and protein modeling by their unidirectional nature, failing to capture crucial global bidirectional token dependencies. Non‑Autoregressive (NAR) models offer holistic, bidirectional representations but face challenges with generative coherence and scalability. To transcend this, we propose a hybrid framework enhancing AR generation by dynamically integrating rich contextual information from non‑autoregressive mechanisms. Our approach couples a shared input encoder with two decoders: a non‑autoregressive one learning latent bidirectional biological features, and an AR decoder synthesizing the biological sequence by leveraging these bidirectional features. A novel cross‑decoder attention module enables the AR decoder to iteratively query and integrate these bidirectional features, enriching its predictions. This synergy is cultivated via a tailored training strategy with importance annealing for balanced objectives and cross‑decoder gradient blocking for stable, focused learning. Evaluations on a demanding nine‑species benchmark of de novo peptide sequencing show that our model substantially surpasses AR and NAR baselines. It uniquely harmonizes AR stability with NAR contextual awareness, delivering robust, superior performance on diverse downstream data. This research advances biological sequence modeling techniques and contributes a novel architectural paradigm for augmenting AR models with enhanced bidirectional understanding for complex sequence generation. Code is available at https://github.com/BEAM‑Labs/denovo.
Authors: Kai Yang, Yuqi Huang, Junheng Tao, Wanyu Wang, Qitian Wu
Abstract: Modeling 3D dynamics is a fundamental problem in multi‑body systems across scientific and engineering domains and has important practical implications in object trajectory prediction and simulation. While recent GNN‑based approaches have achieved strong performance by enforcing geometric symmetries, encoding high‑order features or incorporating neural‑ODE mechanics, they typically depend on explicitly observed structures and inherently fail to capture the unobserved interactions that are crucial to complex physical behaviors and dynamics mechanism. In this paper, we propose PAINET, a principled SE(3)‑equivariant transformer for learning all‑pair interactions in multi‑body systems. The model comprises: (1) a novel physics‑inspired attention network derived from the minimization trajectory of an energy function, and (2) a parallel decoder that preserves equivariance while enabling efficient inference. Empirical results on diverse real‑world benchmarks, including human motion capture, molecular dynamics, and large‑scale protein simulations, show that PAINET consistently outperforms recently proposed models, yielding 4.7% to 41.5% error reductions in 3D dynamics prediction with comparable computation costs in terms of time and memory. Our codes, baseline models and datasets are available at https://github.com/Icarus1411/PAINET.
Authors: Liyang Xie, Haoran Zhang, Zhendong Wang, Wesley Tansey, Mingyuan Zhou
Abstract: Diffusion‑ and flow‑based generative models have recently demonstrated strong performance in protein backbone generation tasks, offering unprecedented capabilities for de novo protein design. However, while achieving notable performance in generation quality, these models are limited by their generating speed, often requiring hundreds of iterative steps in the reverse‑diffusion process. This computational bottleneck limits their practical utility in large‑scale protein discovery, where thousands to millions of candidate structures are needed. To address this challenge, we explore the techniques of score distillation, which has shown great success in reducing the number of sampling steps in the vision domain while maintaining high generation quality. However, a straightforward adaptation of these methods results in unacceptably low designability. Through extensive study, we have identified how to appropriately adapt Score identity Distillation (SiD), a state‑of‑the‑art score distillation strategy, to train few‑step protein backbone generators which significantly reduce sampling time, while maintaining comparable performance to their pretrained teacher model. In particular, multistep generation combined with inference time noise modulation is key to the success. We demonstrate that our distilled few‑step generators achieve more than a 20‑fold improvement in sampling speed, while achieving similar levels of designability, diversity, and novelty as the Proteina teacher model. This reduction in inference cost enables large‑scale in silico protein design, thereby bringing diffusion‑based models closer to real‑world protein engineering applications. The PyTorch implementation is available at https://github.com/LY‑Xie/SiD_Protein
Authors: Hanqun Cao, Hongrui Zhang, Junde Xu, Zhou Zhang, Lingdong Shen, Minghao Sun, Ge Liu, Jinbo Xu, Wu-Jun Li, Jinren Ni, Cesar de la Fuente-Nunez, Tianfan Fu, Yejin Choi, Pheng-Ann Heng, Fang Wu
Abstract: Protein language models (PLMs) have advanced computational protein science through large‑scale pretraining and scalable architectures. In parallel, reinforcement learning (RL) has broadened exploration and enabled precise multi‑objective optimization in protein design. Yet whether RL can push PLMs beyond their pretraining priors to uncover latent sequence‑structure‑function rules remains unclear. We address this by pairing RL with PLMs across four domains: antimicrobial peptide design, kinase variant optimization, antibody engineering, and inverse folding. Using diverse RL algorithms and model classes, we ask if RL improves sampling efficiency and, more importantly, if it reveals capabilities not captured by supervised learning. Across benchmarks, RL consistently boosts success rates and sample efficiency. Performance follows a three‑factor interaction: task headroom, reward fidelity, and policy capacity jointly determine gains. When rewards are accurate and informative, policies have sufficient capacity, and tasks leave room beyond supervised baselines, improvements scale; when rewards are noisy or capacity is constrained, gains saturate despite exploration. This view yields practical guidance for RL in protein design: prioritize reward modeling and calibration before scaling policy size, match algorithm and regularization strength to task difficulty, and allocate capacity where marginal gains are largest. Implementation is available at https://github.com/chq1155/RL‑PLM.
Authors: Rohit Dilip, Evan Zhang, Ayush Varshney, David Van Valen
Abstract: Protein structure tokenizers enable the creation of multimodal models of protein structure, sequence, and function. Current approaches to protein structure tokenization rely on bespoke components that are invariant to spatial symmetries, but that are challenging to optimize and scale. We present Kanzi, a flow‑based tokenizer for tokenization and generation of protein structures. Kanzi consists of a diffusion autoencoder trained with a flow matching loss. We show that this approach simplifies several aspects of protein structure tokenizers: frame‑based representations can be replaced with global coordinates, complex losses are replaced with a single flow matching loss, and SE(3)‑invariant attention operations can be replaced with standard attention. We find that these changes stabilize the training of parameter‑efficient models that outperform existing tokenizers on reconstruction metrics at a fraction of the model size and training cost. An autoregressive model trained with Kanzi outperforms similar generative models that operate over tokens, although it does not yet match the performance of state‑of‑the‑art continuous diffusion models. Code is available here: https://github.com/rdilip/kanzi/.
Authors: Long Xu, Yongcai Chen, Fengshuo Liu, Yuzhong Peng
Abstract: Structure‑Based Drug Design (SBDD) is a powerful strategy in computational drug discovery, utilizing three‑dimensional protein structures to guide the design of molecules with improved binding affinity. However, capturing complex protein‑ligand interactions across multiple scales remains challenging, as current methods often overlook the hierarchical organization and intrinsic asymmetry of these interactions. To address these limitations, we propose MSCoD, a novel Bayesian updating‑based generative framework for structure‑based drug design. In our MSCoD, Multi‑Scale Information Bottleneck (MSIB) was developed, which enables semantic compression at multiple abstraction levels for efficient hierarchical feature extraction. Furthermore, a multi‑head cooperative attention (MHCA) mechanism was developed, which employs asymmetric protein‑to‑ligand attention to capture diverse interaction types while addressing the dimensionality disparity between proteins and ligands. Empirical studies showed that MSCoD outperforms state‑of‑the‑art methods on the benchmark dataset. Its real‑world applicability is confirmed by case studies on difficult targets like KRAS G12D (7XKJ). Additionally, the MSIB and MHCA modules prove transferable, boosting the performance of GraphDTA on standard drug target affinity prediction benchmarks (Davis and Kiba). The code and data underlying this article are freely available at https://github.com/xulong0826/MSCoD.
Authors: Nimisha Ghosh, Dheeran Sankaran, Rahul Balakrishnan Adhi, Sharath S, Amrut Anand
Abstract: Identifying DNA‑ (DBPs) and RNA‑binding proteins (RBPs) is crucial for the understanding of cell function, molecular interactions as well as regulatory functions. Owing to their high similarity, most of the existing approaches face challenges in differentiating between DBPs and RBPs leading to high cross‑prediction errors. Moreover, identifying proteins which bind to both DNA and RNA (DRBPs) is also quite a challenging task. In this regard, we propose a novel framework viz. LAMP‑PRo which is based on pre‑trained protein language model (PLM), attention mechanisms and multi‑label learning to mitigate these issues. First, pre‑trained PLM such ESM‑2 is used for embedding the protein sequences followed by convolutional neural network (CNN). Subsequently multi‑head self‑attention mechanism is applied for the contextual information while label‑aware attention is used to compute class‑specific representations by attending to the sequence in a way that is tailored to each label (DBP, RBP and non‑NABP) in a multi‑label setup. We have also included a novel cross‑label attention mechanism to explicitly capture dependencies between DNA‑ and RNA‑binding proteins, enabling more accurate prediction of DRBP. Finally, a linear layer followed by a sigmoid function are used for the final prediction. Extensive experiments are carried out to compare LAMP‑PRo with the existing methods wherein the proposed model shows consistent competent performance. Furthermore, we also provide visualization to showcase model interpretability, highlighting which parts of the sequence are most relevant for a predicted label. The original datasets are available at http://bliulab.net/iDRBP\_MMC and the codes are available at https://github.com/NimishaGhosh/LAMP‑PRo.
Authors: Yinuo Ren, Wenhao Gao, Lexing Ying, Grant M. Rotskoff, Jiequn Han
Abstract: We study inference‑time scaling for diffusion models, where the goal is to adapt a pre‑trained model to new target distributions without retraining. Existing guidance‑based methods are simple but introduce bias, while particle‑based corrections suffer from weight degeneracy and high computational cost. We introduce DriftLite, a lightweight, training‑free particle‑based approach that steers the inference dynamics on the fly with provably optimal stability control. DriftLite exploits a previously unexplored degree of freedom in the Fokker‑Planck equation between the drift and particle potential, and yields two practical instantiations: Variance‑ and Energy‑Controlling Guidance (VCG/ECG) for approximating the optimal drift with minimal overhead. Across Gaussian mixture models, particle systems, and large‑scale protein‑ligand co‑folding problems, DriftLite consistently reduces variance and improves sample quality over pure guidance and sequential Monte Carlo baselines. These results highlight a principled, efficient route toward scalable inference‑time adaptation of diffusion models. Our source code is publicly available at https://github.com/yinuoren/DriftLite.
Authors: Sarang Patil, Zeyong Zhang, Yiran Huang, Tengfei Ma, Mengjia Xu
Abstract: Large language models (LLMs) have achieved remarkable success and demonstrated superior performance across various tasks, including natural language processing (NLP), weather forecasting, biological protein folding, text generation, and solving mathematical problems. However, many real‑world data exhibit highly non‑Euclidean latent hierarchical anatomy, such as protein networks, transportation networks, financial networks, brain networks, and linguistic structures or syntactic trees in natural languages. Effectively learning intrinsic semantic entailment and hierarchical relationships from these raw, unstructured input data using LLMs remains an underexplored area. Due to its effectiveness in modeling tree‑like hierarchical structures, hyperbolic geometry ‑‑ a non‑Euclidean space ‑‑ has rapidly gained popularity as an expressive latent representation space for complex data modeling across domains such as graphs, images, languages, and multi‑modal data. Here, we provide a comprehensive and contextual exposition of recent advancements in LLMs that leverage hyperbolic geometry as a representation space to enhance semantic representation learning and multi‑scale reasoning. Specifically, the paper presents a taxonomy of the principal techniques of Hyperbolic LLMs (HypLLMs) in terms of four main categories: (1) hyperbolic LLMs through exp/log maps; (2) hyperbolic fine‑tuned models; (3) fully hyperbolic LLMs, and (4) hyperbolic state‑space models. We also explore crucial potential applications and outline future research directions. A repository of key papers, models, datasets, and code implementations is available at https://github.com/sarangp2402/Hyperbolic‑LLM‑Models.
Authors: Jigang Fan, Zhenghong Zhou, Ruofan Jin, Le Cong, Mengdi Wang, Zaixi Zhang
Abstract: Proteins play crucial roles in almost all biological processes. The advancement of deep learning has greatly accelerated the development of protein foundation models, leading to significant successes in protein understanding and design. However, the lack of systematic red‑teaming for these models has raised serious concerns about their potential misuse, such as generating proteins with biological safety risks. This paper introduces SafeProtein, the first red‑teaming framework designed for protein foundation models to the best of our knowledge. SafeProtein combines multimodal prompt engineering and heuristic beam search to systematically design red‑teaming methods and conduct tests on protein foundation models. We also curated SafeProtein‑Bench, which includes a manually constructed red‑teaming benchmark dataset and a comprehensive evaluation protocol. SafeProtein achieved continuous jailbreaks on state‑of‑the‑art protein foundation models (up to 70% attack success rate for ESM3), revealing potential biological safety risks in current protein foundation models and providing insights for the development of robust security protection technologies for frontier models. The codes will be made publicly available at https://github.com/jigang‑fan/SafeProtein.
Authors: Vsevolod Viliuga, Leif Seute, Nicolas Wolf, Simon Wagner, Arne Elofsson, Jan Stühmer, Frauke Gräter
Abstract: Recent advances in geometric deep learning and generative modeling have enabled the design of novel proteins with a wide range of desired properties. However, current state‑of‑the‑art approaches are typically restricted to generating proteins with only static target properties, such as motifs and symmetries. In this work, we take a step towards overcoming this limitation by proposing a framework to condition structure generation on flexibility, which is crucial for key functionalities such as catalysis or molecular recognition. We first introduce BackFlip, an equivariant neural network for predicting per‑residue flexibility from an input backbone structure. Relying on BackFlip, we propose FliPS, an SE(3)‑equivariant conditional flow matching model that solves the inverse problem, that is, generating backbones that display a target flexibility profile. In our experiments, we show that FliPS is able to generate novel and diverse protein backbones with the desired flexibility, verified by Molecular Dynamics (MD) simulations. FliPS and BackFlip are available at https://github.com/graeter‑group/flips .
Authors: Bokai Zhao, Weiyang Shi, Hanqing Chao, Zijiang Yang, Yiyang Zhang, Ming Song, Tianzi Jiang
Abstract: Spatial proteomics maps protein distributions in tissues, providing transformative insights for life sciences. However, current sequencing‑based technologies suffer from low spatial resolution, and substantial inter‑tissue variability in protein expression further compromises the performance of existing molecular data prediction methods. In this work, we introduce the novel task of spatial super‑resolution for sequencing‑based spatial proteomics (seq‑SP) and, to the best of our knowledge, propose the first deep learning model for this task‑‑Neural Proteomics Fields (NPF). NPF formulates seq‑SP as a protein reconstruction problem in continuous space by training a dedicated network for each tissue. The model comprises a Spatial Modeling Module, which learns tissue‑specific protein spatial distributions, and a Morphology Modeling Module, which extracts tissue‑specific morphological features. Furthermore, to facilitate rigorous evaluation, we establish an open‑source benchmark dataset, Pseudo‑Visium SP, for this task. Experimental results demonstrate that NPF achieves state‑of‑the‑art performance with fewer learnable parameters, underscoring its potential for advancing spatial proteomics research. Our code and dataset are publicly available at https://github.com/Bokai‑Zhao/NPF.
Authors: Yuxuan Song, Zhe Zhang, Yu Pei, Jingjing Gong, Qiying Yu, Zheng Zhang, Mingxuan Wang, Hao Zhou, Jingjing Liu, Wei-Ying Ma
Abstract: Generative modeling of discrete variables is challenging yet crucial for applications in natural language processing and biological sequence design. We introduce the Shortlisting Model (SLM), a novel simplex‑based diffusion model inspired by progressive candidate pruning. SLM operates on simplex centroids, reducing generation complexity and enhancing scalability. Additionally, SLM incorporates a flexible implementation of classifier‑free guidance, enhancing unconditional generation performance. Extensive experiments on DNA promoter and enhancer design, protein design, character‑level and large‑vocabulary language modeling demonstrate the competitive performance and strong potential of SLM. Our code can be found at https://github.com/GenSI‑THUAIR/SLM
Authors: Utku Ozbulak, Michaela Cohrs, Hristo L. Svilenov, Joris Vankerschaver, Wesley De Neve
Abstract: Sub‑visible particle analysis using flow imaging microscopy combined with deep learning has proven effective in identifying particle types, enabling the distinction of harmless components such as silicone oil from protein particles. However, the scarcity of available data and severe imbalance between particle types within datasets remain substantial hurdles when applying multi‑class classifiers to such problems, often forcing researchers to rely on less effective methods. The aforementioned issue is particularly challenging for particle types that appear unintentionally and in lower numbers, such as silicone oil and air bubbles, as opposed to protein particles, where obtaining large numbers of images through controlled settings is comparatively straightforward. In this work, we develop a state‑of‑the‑art diffusion model to address data imbalance by generating high‑fidelity images that can augment training datasets, enabling the effective training of multi‑class deep neural networks. We validate this approach by demonstrating that the generated samples closely resemble real particle images in terms of visual quality and structure. To assess the effectiveness of using diffusion‑generated images in training datasets, we conduct large‑scale experiments on a validation dataset comprising 500,000 protein particle images and demonstrate that this approach improves classification performance with no negligible downside. Finally, to promote open research and reproducibility, we publicly release both our diffusion models and the trained multi‑class deep neural network classifiers, along with a straightforward interface for easy integration into future studies, at https://github.com/utkuozbulak/svp‑generative‑ai.
Authors: Amartya Banerjee, Xingyu Xu, Caroline Moosmüller, Harlin Lee
Abstract: In an inverse problem, the goal is to recover an unknown parameter (e.g., an image) that has typically undergone some lossy or noisy transformation during measurement. Recently, deep generative models, particularly diffusion models, have emerged as powerful priors for protein structure generation. However, integrating noisy experimental data from multiple sources to guide these models remains a significant challenge. Existing methods often require precise knowledge of experimental noise levels and manually tuned weights for each data modality. In this work, we introduce Adam‑PnP, a Plug‑and‑Play framework that guides a pre‑trained protein diffusion model using gradients from multiple, heterogeneous experimental sources. Our framework features an adaptive noise estimation scheme and a dynamic modality weighting mechanism integrated into the diffusion process, which reduce the need for manual hyperparameter tuning. Experiments on complex reconstruction tasks demonstrate significantly improved accuracy using Adam‑PnP.
Authors: Hongzhi Zhang, Zhonglie Liu, Kun Meng, Jiameng Chen, Jia Wu, Bo Du, Di Lin, Yan Che, Wenbin Hu
Abstract: Given the vastness of chemical space and the ongoing emergence of previously uncharacterized proteins, zero‑shot compound‑protein interaction (CPI) prediction better reflects the practical challenges and requirements of real‑world drug development. Although existing methods perform adequately during certain CPI tasks, they still face the following challenges: (1) Representation learning from local or complete protein sequences often overlooks the complex interdependencies between subsequences, which are essential for predicting spatial structures and binding properties. (2) Dependence on large‑scale or scarce multimodal protein datasets demands significant training data and computational resources, limiting scalability and efficiency. To address these challenges, we propose a novel approach that pretrains protein representations for CPI prediction tasks using subsequence reordering, explicitly capturing the dependencies between protein subsequences. Furthermore, we apply length‑variable protein augmentation to ensure excellent pretraining performance on small training datasets. To evaluate the model's effectiveness and zero‑shot learning ability, we combine it with various baseline methods. The results demonstrate that our approach can improve the baseline model's performance on the CPI task, especially in the challenging zero‑shot scenario. Compared to existing pre‑training models, our model demonstrates superior performance, particularly in data‑scarce scenarios where training samples are limited. Our implementation is available at https://github.com/Hoch‑Zhang/PSRP‑CPI.
Authors: Lang Yu, Zhangyang Gao, Cheng Tan, Qin Chen, Jie Zhou, Liang He
Abstract: SE(3)‑based generative models have shown great promise in protein geometry modeling and effective structure design. However, the field currently lacks a modularized benchmark to enable comprehensive investigation and fair comparison of different methods. In this paper, we propose Protein‑SE(3), a new benchmark based on a unified training framework, which comprises protein scaffolding tasks, integrated generative models, high‑level mathematical abstraction, and diverse evaluation metrics. Recent advanced generative models designed for protein scaffolding, from multiple perspectives like DDPM (Genie1 and Genie2), Score Matching (FrameDiff and RfDiffusion) and Flow Matching (FoldFlow and FrameFlow) are integrated into our framework. All integrated methods are fairly investigated with the same training dataset and evaluation metrics. Furthermore, we provide a high‑level abstraction of the mathematical foundations behind the generative models, enabling fast prototyping of future algorithms without reliance on explicit protein structures. Accordingly, we release the first comprehensive benchmark built upon unified training framework for SE(3)‑based protein structure design, which is publicly accessible at https://github.com/BruthYU/protein‑se3.
Authors: Xingyu Su, Xiner Li, Yuchao Lin, Ziqian Xie, Degui Zhi, Shuiwang Ji
Abstract: We consider controllable DNA sequence design, where sequences are generated by conditioning on specific biological properties. While language models (LMs) such as GPT and BERT have achieved remarkable success in natural language generation, their application to DNA sequence generation remains largely underexplored. In this work, we introduce ATGC‑Gen, an Automated Transformer Generator for Controllable Generation, which leverages cross‑modal encoding to integrate diverse biological signals. ATGC‑Gen is instantiated with both decoder‑only and encoder‑only transformer architectures, allowing flexible training and generation under either autoregressive or masked recovery objectives. We evaluate ATGC‑Gen on representative tasks including promoter and enhancer sequence design, and further introduce a new dataset based on ChIP‑Seq experiments for modeling protein binding specificity. Our experiments demonstrate that ATGC‑Gen can generate fluent, diverse, and biologically relevant sequences aligned with the desired properties. Compared to prior methods, our model achieves notable improvements in controllability and functional relevance, highlighting the potential of language models in advancing programmable genomic design. The source code is released at (https://github.com/divelab/AIRS/blob/main/OpenBio/ATGC_Gen).
Authors: Zihao Li, Zhichen Zeng, Xiao Lin, Feihao Fang, Yanru Qu, Zhe Xu, Zhining Liu, Xuying Ning, Tianxin Wei, Ge Liu, Hanghang Tong, Jingrui He
Abstract: Over the past decade, advances in generative modeling, such as generative adversarial networks, masked autoencoders, and diffusion models, have significantly transformed biological research and discovery, enabling breakthroughs in molecule design, protein generation, catalysis discovery, drug discovery, and beyond. At the same time, biological applications have served as valuable testbeds for evaluating the capabilities of generative models. Recently, flow matching has emerged as a powerful and efficient alternative to diffusion‑based generative modeling, with growing interest in its application to problems in biology and life sciences. This paper presents the first comprehensive survey of recent developments in flow matching and its applications in biological domains. We begin by systematically reviewing the foundations and variants of flow matching, and then categorize its applications into three major areas: biological sequence modeling, molecule generation and design, and peptide and protein generation. For each, we provide an in‑depth review of recent progress. We also summarize commonly used datasets and software tools, and conclude with a discussion of potential future directions. The corresponding curated resources are available at https://github.com/Violet24K/Awesome‑Flow‑Matching‑Meets‑Biology.
Authors: Kai Yi, Kiarash Jamali, Sjors H. W. Scheres
Abstract: The recent breakthrough of AlphaFold3 in modeling complex biomolecular interactions, including those between proteins and ligands, nucleotides, or metal ions, creates new opportunities for protein design. In so‑called inverse protein folding, the objective is to find a sequence of amino acids that adopts a target protein structure. Many inverse folding methods struggle to predict sequences for complexes that contain non‑protein components, and perform poorly with complexes that adopt multiple structural states. To address these challenges, we present ADFLIP (All‑atom Discrete FLow matching Inverse Protein folding), a generative model based on discrete flow‑matching for designing protein sequences conditioned on all‑atom structural contexts. ADFLIP progressively incorporates predicted amino acid side chains as structural context during sequence generation and enables the design of dynamic protein complexes through ensemble sampling across multiple structural states. Furthermore, ADFLIP implements training‑free classifier guidance sampling, which allows the incorporation of arbitrary pre‑trained models to optimise the designed sequence for desired protein properties. We evaluated the performance of ADFLIP on protein complexes with small‑molecule ligands, nucleotides, or metal ions, including dynamic complexes for which structure ensembles were determined by nuclear magnetic resonance (NMR). Our model achieves state‑of‑the‑art performance in single‑structure and multi‑structure inverse folding tasks, demonstrating excellent potential for all‑atom protein design. The code is available at https://github.com/ykiiiiii/ADFLIP.
Authors: Yaowei Jin, Junjie Wang, Yufan Tang, Wenkai Xiang, Duanhua Cao, Dan Teng, Zhehuan Fan, Jiacheng Xiong, Xia Sheng, Chuanlong Zeng, Duo An, Mingyue Zheng, Shuangjia Zheng, Qian Shi
Abstract: Motivation: Structure‑based drug design (SBDD) has advanced with deep generative models, but bridging the gap between continuous atomic coordinates and discrete atom types remains a challenge. Current approaches, such as diffusion and flow matching models, often fail to unify these heterogeneous modalities, relying on separate strategies or ill‑fitting Euclidean metrics for discrete variables. This lack of a consistent framework limits generative models' ability to capture the geometric and chemical structure of protein‑ligand complexes. Results: We present MolPIF, a parameter interpolation flow mechanism designed to unify the generation of continuous and discrete molecular variables. Unlike traditional flow models that operate in sample space, MolPIF interpolates between distributions in the parameter space, theoretically recovering Wasserstein‑2 optimal transport for continuous coordinates and establishing Fisher‑Rao geodesics for discrete atom types. We further incorporate a geometry‑enhanced learning strategy to improve the capture of atomic contexts. Extensive evaluations on the CrossDocked2020 dataset demonstrate that MolPIF outperforms baselines in binding affinity, chemical validity, geometric fidelity and chemical space coverage. Additionally, MolPIF exhibits versatility in lead optimization and offers flexible prior distribution selection (such as Laplace), establishing a robust paradigm for SBDD. Availability: Source code is freely available at https://github.com/BLEACH366/MolPIF. Supplementary information: Supplementary data are available at Bioinformatics.
Authors: Zhonglin Liu
Abstract: Innate resistance to anti‑PD‑1 immunotherapy remains a major clinical challenge in metastatic melanoma, with the underlying molecular networks being poorly understood. To address this, we constructed a dynamic Probabilistic Boolean Network model using transcriptomic data from patient tumor biopsies to elucidate the regulatory logic governing therapy response. We then employed a reinforcement learning agent to systematically discover optimal, multi‑step therapeutic interventions and used explainable artificial intelligence to mechanistically interpret the agent's control policy. The analysis revealed that a precisely timed, 4‑step temporary inhibition of the lysyl oxidase like 2 protein (LOXL2) was the most effective strategy. Our explainable analysis showed that this ''hit‑and‑run" intervention is sufficient to erase the molecular signature driving resistance, allowing the network to self‑correct without requiring sustained intervention. This study presents a novel, time‑dependent therapeutic hypothesis for overcoming immunotherapy resistance and provides a powerful computational framework for identifying non‑obvious intervention protocols in complex biological systems.
Authors: Pei-Kun Yang
Abstract: Structure‑based virtual screening (SBVS) is a key computational strategy for identifying potential drug candidates by estimating the binding free energies (delta G_bind) of protein‑ligand complexes. The immense size of chemical libraries, combined with the need to account for protein and ligand conformations as well as ligand translations and rotations, makes these tasks computationally intensive on classical hardware. This study proposes a quantum convolutional neural network (QCNN) framework to estimate delta G_bind efficiently. Using the PDBbind v2020 dataset, we trained QCNN models with 9 and 12 qubits, with the core set designated as the test set. The best‑performing model achieved a Pearson correlation coefficient of 0.694 on the test set. To assess robustness, we introduced quantum noise under two configurations. While noise increased the root mean square deviation, the Pearson correlation coefficient remained largely stable. These results demonstrate the feasibility and noise tolerance of QCNNs for high‑throughput virtual screening and highlight the potential of quantum computing to accelerate drug discovery.
Authors: Frédéric A. Dreyer, Jan Ludwiczak, Karolis Martinkus, Brennan Abanades, Robert G. Alberstein, Pan Kessel, Pranav Rao, Jae Hyeon Lee, Richard Bonneau, Andrew M. Watkins, Franziska Seeger
Abstract: We introduce Ibex, a pan‑immunoglobulin structure prediction model that achieves state‑of‑the‑art accuracy in modeling the variable domains of antibodies, nanobodies, and T‑cell receptors. Unlike previous approaches, Ibex explicitly distinguishes between bound and unbound protein conformations by training on labeled apo and holo structural pairs, enabling accurate prediction of both states at inference time. Using a comprehensive private dataset of high‑resolution antibody structures, we demonstrate superior out‑of‑distribution performance compared to existing specialized and general protein structure prediction tools. Ibex combines the accuracy of cutting‑edge models with significantly reduced computational requirements, providing a robust foundation for accelerating large molecule design and therapeutic development.
Authors: Chenyu Wang, Cai Zhou, Sharut Gupta, Zongyu Lin, Stefanie Jegelka, Stephen Bates, Tommi Jaakkola
Abstract: Diffusion models can be improved with additional guidance towards more effective representations of input. Indeed, prior empirical work has already shown that aligning internal representations of the diffusion model with those of pre‑trained models improves generation quality. In this paper, we present a systematic framework for incorporating representation guidance into diffusion models. We provide alternative decompositions of denoising models along with their associated training criteria, where the decompositions determine when and how the auxiliary representations are incorporated. Guided by our theoretical insights, we introduce two new strategies for enhancing representation alignment in diffusion models. First, we pair examples with target representations either derived from themselves or arisen from different synthetic modalities, and subsequently learn a joint model over the multimodal pairs. Second, we design an optimal training curriculum that balances representation learning and data generation. Our experiments across image, protein sequence, and molecule generation tasks demonstrate superior performance as well as accelerated training. In particular, on the class‑conditional ImageNet 256× 256 benchmark, our guidance results in 23.3 times faster training than the original SiT‑XL as well as four times speedup over the state‑of‑the‑art method REPA. The code is available at https://github.com/ChenyuWang‑Monica/REED.
Authors: Arthur Deng, Karsten Householder, Fang Wu, Sebastian Thrun, K. Christopher Garcia, Brian Trippe
Abstract: Accurate estimation of mutational effects on protein‑protein binding energies is an open problem with applications in structural biology and therapeutic design. Several deep learning predictors for this task have been proposed, but, presumably due to the scarcity of binding data, these methods underperform computationally expensive estimates based on empirical force fields. In response, we propose a transfer‑learning approach that leverages advances in protein sequence modeling and folding stability prediction for this task. The key idea is to parameterize the binding energy as the difference between the folding energy of the protein complex and the sum of the folding energies of its binding partners. We show that using a pre‑trained inverse‑folding model as a proxy for folding energy provides strong zero‑shot performance, and can be fine‑tuned with (1) copious folding energy measurements and (2) more limited binding energy measurements. The resulting predictor, StaB‑ddG, is the first deep learning predictor to match the accuracy of the state‑of‑the‑art empirical force‑field method FoldX, while offering an over 1,000x speed‑up.
Authors: Xinzhe Zheng, Hao Du, Fanding Xu, Jinzhe Li, Zhiyuan Liu, Wenkang Wang, Tao Chen, Wanli Ouyang, Stan Z. Li, Yan Lu, Nanqing Dong, Yang Zhang
Abstract: Deep learning‑based computational methods have achieved promising results in predicting protein‑protein interactions (PPIs). However, existing benchmarks predominantly focus on isolated pairwise evaluations, overlooking a model's capability to reconstruct biologically meaningful PPI networks, which is crucial for biology research. To address this gap, we introduce PRING, the first comprehensive benchmark that evaluates protein‑protein interaction prediction from a graph‑level perspective. PRING curates a high‑quality, multi‑species PPI network dataset comprising 21,484 proteins and 186,818 interactions, with well‑designed strategies to address both data redundancy and leakage. Building on this golden‑standard dataset, we establish two complementary evaluation paradigms: (1) topology‑oriented tasks, which assess intra and cross‑species PPI network construction, and (2) function‑oriented tasks, including protein complex pathway prediction, GO module analysis, and essential protein justification. These evaluations not only reflect the model's capability to understand the network topology but also facilitate protein function annotation, biological module detection, and even disease mechanism analysis. Extensive experiments on four representative model categories, consisting of sequence similarity‑based, naive sequence‑based, protein language model‑based, and structure‑based approaches, demonstrate that current PPI models have potential limitations in recovering both structural and functional properties of PPI networks, highlighting the gap in supporting real‑world biological applications. We believe PRING provides a reliable platform to guide the development of more effective PPI prediction models for the community. The dataset and source code of PRING are available at https://github.com/SophieSarceau/PRING.
Authors: Xinyue Zeng, Tuo Wang, Adithya Kulkarni, Alexander Lu, Alexandra Ni, Phoebe Xing, Junhan Zhao, Siwei Chen, Dawei Zhou
Abstract: Intrinsically disordered regions (IDRs) play central roles in cellular function, yet remain poorly evaluated by existing protein structure prediction benchmarks. Current evaluations largely focus on well‑folded domains, overlooking three fundamental challenges in realistic biological settings: the structural complexity of proteins, the resulting low availability of reliable ground truth, and prediction uncertainty that can propagate into high‑risk downstream failures, such as in drug discovery, protein‑protein interaction modeling, and functional annotation. We present DisProtBench, an IDR‑centric benchmark that explicitly incorporates prediction uncertainty into the evaluation of protein structure prediction models (PSPMs). To address structural complexity and ground‑truth scarcity, we curate and unify a large‑scale, multi‑modal dataset spanning disease‑relevant IDRs, GPCR‑ligand interactions, and multimeric protein complexes. To assess predictive uncertainty, we introduce Functional Uncertainty Sensitivity (FUS), a novel prediction uncertainty‑stratified metric that quantifies downstream task performance under prediction uncertainty. Using this benchmark, we conduct a systematic evaluation of state‑of‑the‑art PSPMs and reveal clear, task‑dependent failure modes. Protein‑protein interaction prediction degrades sharply in IDRs, while structure‑based drug discovery remains comparatively robust. These effects are largely invisible to standard global accuracy metrics, which overestimate functional reliability under prediction uncertainty. We have open‑sourced our benchmark and the codebase at https://github.com/Susan571/DisProtBench.
Authors: Xingyu Su, Xiner Li, Masatoshi Uehara, Sunwoo Kim, Yulai Zhao, Gabriele Scalia, Ehsan Hajiramezanali, Tommaso Biancalani, Degui Zhi, Shuiwang Ji
Abstract: We address the problem of fine‑tuning diffusion models for reward‑guided generation in biomolecular design. While diffusion models have proven highly effective in modeling complex, high‑dimensional data distributions, real‑world applications often demand more than high‑fidelity generation, requiring optimization with respect to potentially non‑differentiable reward functions such as physics‑based simulation or rewards based on scientific knowledge. Although RL methods have been explored to fine‑tune diffusion models for such objectives, they often suffer from instability, low sample efficiency, and mode collapse due to their on‑policy nature. In this work, we propose an iterative distillation‑based fine‑tuning framework that enables diffusion models to optimize for arbitrary reward functions. Our method casts the problem as policy distillation: it collects off‑policy data during the roll‑in phase, simulates reward‑based soft‑optimal policies during roll‑out, and updates the model by minimizing the KL divergence between the simulated soft‑optimal policy and the current model policy. Our off‑policy formulation, combined with KL divergence minimization, enhances training stability and sample efficiency compared to existing RL‑based methods. Empirical results demonstrate the effectiveness and superior reward optimization of our approach across diverse tasks in protein, small molecule, and regulatory DNA design. The source code is released at (https://divelab.github.io/VIDD/).
Authors: Hoa La, Ahan Gupta, Alex Morehead, Jianlin Cheng, Minjia Zhang
Abstract: Protein structure prediction models such as AlphaFold3 (AF3) push the frontier of biomolecular modeling by incorporating science‑informed architectural changes to the transformer architecture. However, these advances come at a steep system cost, introducing: compute‑ and memory‑intensive operators, 2D attention mechanisms, and retrieval‑augmented data pipelines, which collectively hinder the scalability of AF3 training. In this work, we present MegaFold, a cross‑platform system to accelerate AF3 training. MegaFold tackles key bottlenecks through ahead‑of‑time caching to eliminate GPU idle time from the retrieval‑augmented data pipeline, Triton‑based kernels for memory‑efficient EvoAttention on heterogeneous devices, and deep fusion for common and critical small operators in AF3. Evaluation on both NVIDIA H200 and AMD MI250 GPUs shows that MegaFold reduces peak memory usage of AF3 training by up to 1.23× and improves per‑iteration training time by up‑to 1.73× and 1.62× respectively. More importantly, MegaFold enables training on 1.35× longer sequence lengths compared to PyTorch baselines without running out‑of‑memory, significantly improving the scalability of modern protein folding models. We open source our code at https://github.com/Supercomputing‑System‑AI‑Lab/MegaFold/.
Authors: Yuelin Zhang, Jiacheng Cen, Jiaqi Han, Wenbing Huang
Abstract: Equivariant Graph Neural Networks (GNNs) have achieved remarkable success across diverse scientific applications. However, existing approaches face critical efficiency challenges when scaling to large geometric graphs and suffer significant performance degradation when the input graphs are sparsified for computational tractability. To address these limitations, we introduce FastEGNN and DistEGNN, two novel enhancements to equivariant GNNs for large‑scale geometric graphs. FastEGNN employs a key innovation: a small ordered set of virtual nodes that effectively approximates the large unordered graph of real nodes. Specifically, we implement distinct message passing and aggregation mechanisms for different virtual nodes to ensure mutual distinctiveness, and minimize Maximum Mean Discrepancy (MMD) between virtual and real coordinates to achieve global distributedness. This design enables FastEGNN to maintain high accuracy while efficiently processing large‑scale sparse graphs. For extremely large‑scale geometric graphs, we present DistEGNN, a distributed extension where virtual nodes act as global bridges between subgraphs in different devices, maintaining consistency while dramatically reducing memory and computational overhead. We comprehensively evaluate our models across four challenging domains: N‑body systems (100 nodes), protein dynamics (800 nodes), Water‑3D (8,000 nodes), and our new Fluid113K benchmark (113,000 nodes). Results demonstrate superior efficiency and performance, establishing new capabilities in large‑scale equivariant graph learning. Code is available at https://github.com/GLAD‑RUC/DistEGNN.
Authors: Zhangyang Gao, Hao Wang, Cheng Tan, Chenrui Xu, Mengdi Liu, Bozhen Hu, Linlin Chao, Xiaoming Zhang, Stan Z. Li
Abstract: This study investigates the current landscape and future directions of protein foundation model research. While recent advancements have transformed protein science and engineering, the field lacks a comprehensive benchmark for fair evaluation and in‑depth understanding. Since ESM‑1B, numerous protein foundation models have emerged, each with unique datasets and methodologies. However, evaluations often focus on limited tasks tailored to specific models, hindering insights into broader generalization and limitations. Specifically, researchers struggle to understand the relationships between tasks, assess how well current models perform across them, and determine the criteria in developing new foundation models. To fill this gap, we present PFMBench, a comprehensive benchmark evaluating protein foundation models across 38 tasks spanning 8 key areas of protein science. Through hundreds of experiments on 17 state‑of‑the‑art models across 38 tasks, PFMBench reveals the inherent correlations between tasks, identifies top‑performing models, and provides a streamlined evaluation protocol. Code is available at \hrefhttps://github.com/biomap‑research/PFMBench\textcolorblueGitHub.
Authors: Zhenqiao Song, Tiaoxiao Li, Lei Li, Martin Renqiang Min
Abstract: Designing protein‑binding proteins with high affinity is critical in biomedical research and biotechnology. Despite recent advancements targeting specific proteins, the ability to create high‑affinity binders for arbitrary protein targets on demand, without extensive rounds of wet‑lab testing, remains a significant challenge. Here, we introduce PPDiff, a diffusion model to jointly design the sequence and structure of binders for arbitrary protein targets in a non‑autoregressive manner. PPDiffbuilds upon our developed Sequence Structure Interleaving Network with Causal attention layers (SSINC), which integrates interleaved self‑attention layers to capture global amino acid correlations, k‑nearest neighbor (kNN) equivariant graph layers to model local interactions in three‑dimensional (3D) space, and causal attention layers to simplify the intricate interdependencies within the protein sequence. To assess PPDiff, we curate PPBench, a general protein‑protein complex dataset comprising 706,360 complexes from the Protein Data Bank (PDB). The model is pretrained on PPBenchand finetuned on two real‑world applications: target‑protein mini‑binder complex design and antigen‑antibody complex design. PPDiffconsistently surpasses baseline methods, achieving success rates of 50.00%, 23.16%, and 16.89% for the pretraining task and the two downstream applications, respectively. The code, data and models are available at https://github.com/JocelynSong/PPDiff.
Authors: Alan N. Amin, Nate Gruver, Andrew Gordon Wilson
Abstract: Discrete diffusion models, like continuous diffusion models, generate high‑quality samples by gradually undoing noise applied to datapoints with a Markov process. Gradual generation in theory comes with many conceptual benefits; for example, inductive biases can be incorporated into the noising Markov process, and access to improved sampling algorithms. In practice, however, the consistently best performing discrete diffusion model is, surprisingly, masking diffusion, which does not denoise gradually. Here we explain the superior performance of masking diffusion by noting that it makes use of a fundamental difference between continuous and discrete Markov processes: discrete Markov processes evolve by discontinuous jumps at a fixed rate and, unlike other discrete diffusion models, masking diffusion builds in the known distribution of jump times and only learns where to jump to. We show that we can similarly bake in the known distribution of jump times into any discrete diffusion model. The resulting models ‑ schedule‑conditioned discrete diffusion (SCUD) ‑ generalize classical discrete diffusion and masking diffusion. By applying SCUD to models with noising processes that incorporate inductive biases on images, text, and protein data, we build models that outperform masking.
Authors: Michael K. Chen, Xikun Zhang, Jiaxing Huang, Dacheng Tao
Abstract: Large language models (LLMs) have become the cornerstone of modern AI. However, the existing paradigm of next‑token prediction fundamentally limits their ability to form coherent, high‑level concepts, making it a critical barrier to human‑like understanding and reasoning. Take the phrase "ribonucleic acid" as an example: an LLM will first decompose it into tokens, i.e., artificial text fragments ("rib", "on", ...), then learn each token sequentially, rather than grasping the phrase as a unified, coherent semantic entity. This fragmented representation hinders deeper conceptual understanding and, ultimately, the development of truly intelligent systems. In response, we introduce Concept‑Aware Fine‑Tuning (CAFT), a novel multi‑token training method that redefines how LLMs are fine‑tuned. By enabling the learning of sequences that span multiple tokens, this method fosters stronger concept‑aware learning. Our experiments demonstrate significant improvements compared to conventional next‑token finetuning methods across diverse tasks, including traditional applications like text summarization and domain‑specific ones like de novo protein design. Multi‑token prediction was previously only possible in the prohibitively expensive pretraining phase; CAFT, to our knowledge, is the first to bring the multi‑token setting to the post‑training phase, thus effectively democratizing its benefits for the broader community of practitioners and researchers. Finally, the unexpected effectiveness of our proposed method suggests wider implications for the machine learning research community. All code and data are available at https://github.com/michaelchen‑lab/caft‑llm
Authors: Fudong Lin, Wanrou Du, Jinchan Liu, Tarikul Milon, Shelby Meche, Wu Xu, Xiaoqi Qin, Xu Yuan
Abstract: Deep neural networks, particularly Transformers, have been widely adopted for predicting the functional properties of proteins. In this work, we focus on exploring whether Protein Transformers can capture biological intelligence among protein sequences. To achieve our goal, we first introduce a protein function dataset, namely Protein‑FN, providing over 9000 protein data with meaningful labels. Second, we devise a new Transformer architecture, namely Sequence Protein Transformers (SPT), for computationally efficient protein function predictions. Third, we develop a novel Explainable Artificial Intelligence (XAI) technique called Sequence Score, which can efficiently interpret the decision‑making processes of protein models, thereby overcoming the difficulty of deciphering biological intelligence bided in Protein Transformers. Remarkably, even our smallest SPT‑Tiny model, which contains only 5.4M parameters, demonstrates impressive predictive accuracy, achieving 94.3% on the Antibiotic Resistance (AR) dataset and 99.6% on the Protein‑FN dataset, all accomplished by training from scratch. Besides, our Sequence Score technique helps reveal that our SPT models can discover several meaningful patterns underlying the sequence structures of protein data, with these patterns aligning closely with the domain knowledge in the biology community. We have officially released our Protein‑FN dataset on Hugging Face Datasets https://huggingface.co/datasets/Protein‑FN/Protein‑FN. Our code is available at https://github.com/fudong03/BioIntelligence.
Authors: Wenyu Zhu, Jianhui Wang, Bowen Gao, Yinjun Jia, Haichuan Tan, Ya-Qin Zhang, Wei-Ying Ma, Yanyan Lan
Abstract: Virtual screening (VS) is a critical component of modern drug discovery, yet most existing methods‑‑whether physics‑based or deep learning‑based‑‑are developed around holo protein structures with known ligand‑bound pockets. Consequently, their performance degrades significantly on apo or predicted structures such as those from AlphaFold2, which are more representative of real‑world early‑stage drug discovery, where pocket information is often missing. In this paper, we introduce an alignment‑and‑aggregation framework to enable accurate virtual screening under structural uncertainty. Our method comprises two core components: (1) a tri‑modal contrastive learning module that aligns representations of the ligand, the holo pocket, and cavities detected from structures, thereby enhancing robustness to pocket localization error; and (2) a cross‑attention based adapter for dynamically aggregating candidate binding sites, enabling the model to learn from activity data even without precise pocket annotations. We evaluated our method on a newly curated benchmark of apo structures, where it significantly outperforms state‑of‑the‑art methods in blind apo setting, improving the early enrichment factor (EF1%) from 11.75 to 37.19. Notably, it also maintains strong performance on holo structures. These results demonstrate the promise of our approach in advancing first‑in‑class drug discovery, particularly in scenarios lacking experimentally resolved protein‑ligand complexes. Our implementation is publicly available at https://github.com/Wiley‑Z/AANet.
Authors: Zishan Shu, Yufan Deng, Hongyu Zhang, Zhiwei Nie, Jie Chen
Abstract: Activity cliff prediction is a critical task in drug discovery and material design. Existing computational methods are limited to handling single binding targets, which restricts the applicability of these prediction models. In this paper, we present the Multi‑Grained Target Perception network (MTPNet) to incorporate the prior knowledge of interactions between the molecules and their target proteins. Specifically, MTPNet is a unified framework for activity cliff prediction, which consists of two components: Macro‑level Target Semantic (MTS) guidance and Micro‑level Pocket Semantic (MPS) guidance. By this way, MTPNet dynamically optimizes molecular representations through multi‑grained protein semantic conditions. To our knowledge, it is the first time to employ the receptor proteins as guiding information to effectively capture critical interaction details. Extensive experiments on 30 representative activity cliff datasets demonstrate that MTPNet significantly outperforms previous approaches, achieving an average RMSE improvement of 18.95% on top of several mainstream GNN architectures. Overall, MTPNet internalizes interaction patterns through conditional deep learning to achieve unified predictions of activity cliffs, helping to accelerate compound optimization and design. Codes are available at: https://github.com/ZishanShu/MTPNet.
Authors: Rishwanth Raghu, Axel Levy, Gordon Wetzstein, Ellen D. Zhong
Abstract: Protein structure prediction models are now capable of generating accurate 3D structural hypotheses from sequence alone. However, they routinely fail to capture the conformational diversity of dynamic biomolecular complexes, often requiring heuristic MSA subsampling approaches for generating alternative states. In parallel, cryo‑electron microscopy (cryo‑EM) has emerged as a powerful tool for imaging near‑native structural heterogeneity, but is challenged by arduous pipelines to transform raw experimental data into atomic models. Here, we bridge the gap between these modalities, combining cryo‑EM density maps with the rich sequence and biophysical priors learned by protein structure prediction models. Our method, CryoBoltz, guides the sampling trajectory of a pretrained biomolecular structure prediction model using both global and local structural constraints derived from density maps, driving predictions towards conformational states consistent with the experimental data. We demonstrate that this flexible yet powerful inference‑time approach allows us to build atomic models into heterogeneous cryo‑EM maps across a variety of dynamic biomolecular systems including transporters and antibodies. Code is available at https://github.com/ml‑struct‑bio/cryoboltz .
Authors: Muhammad Shaban, Yuzhou Chang, Huaying Qiu, Yao Yu Yeo, Andrew H. Song, Guillaume Jaume, Yuchen Wang, Luca L. Weishaupt, Tong Ding, Anurag Vaidya, Abdallah Lamane, Daniel Shao, Mohammed Zidane, Yunhao Bai, Paige McCallum, Shuli Luo, Wenrui Wu, Yang Wang, Precious Cramer, Chi Ngai Chan, Pierre Stephan, Johanna Schaffenrath, Jia Le Lee, Hendrik A. Michel, Caiwei Tian, Cristina Almagro-Perez, Sophia J. Wagner, Sharifa Sahai, Ming Y. Lu, Richard J. Chen, Andrew Zhang, Mark Edward M. Gonzales, Ahmad Makky, Jia-Ying Joey Lee, Hao Cheng, Nourhan El Ahmar, Sayed Matar, Maximilian Haist, Darci Phillips, Yuqi Tan, Garry P. Nolan, W. Richard Burack, Jacob D. Estes, Jonathan T. C. Liu, Toni K Choueiri, Neeraj Agarwal, Marc Barry, Scott J. Rodig, Long Phi Le, Georg Gerber, Christian M. Schürch, Fabian J. Theis, Youn H Kim, Joe Yeong, Sabina Signoretti, Brooke E. Howitt, Lit-Hsin Loo, Qin Ma, Sizun Jiang, Faisal Mahmood
Abstract: Foundation models have begun to transform image analysis by acting as pretrained generalist backbones that can be adapted to many tasks even when post‑training data are limited, yet their impact on spatial proteomics, imaging that maps proteins at single‑cell resolution, remains limited. Here, we introduce KRONOS, a foundation model built for spatial proteomics. KRONOS was trained in a self‑supervised manner on over 47 million image patches covering 175 protein markers, 16 tissue types, and 8 fluorescence‑based imaging platforms. We introduce key architectural adaptations to address the high‑dimensional, multi‑channel, and heterogeneous nature of multiplex imaging. We demonstrate that KRONOS learns biologically meaningful representations across multiple scales, ranging from cellular and microenvironment to tissue levels, enabling it to address diverse downstream tasks, including cell phenotyping, region classification, and patient stratification. Evaluated across 11 independent cohorts, KRONOS achieves state‑of‑the‑art performance across cell phenotyping, treatment response prediction, and retrieval tasks, and is highly data‑efficient. KRONOS also introduces the paradigm of segmentation‑free patch‑level processing for efficient and scalable spatial proteomics analysis, allowing cross‑institutional comparisons, and as an image reverse search engine for spatial patterns. Together, these results position KRONOS as a flexible and scalable tool for spatial proteomics. The model is publicly accessible at https://github.com/mahmoodlab/KRONOS.
Authors: Jigang Fan, Quanlin Wu, Shengjie Luo, Liwei Wang
Abstract: The detection of ligand binding sites for proteins is a fundamental step in Structure‑Based Drug Design. Despite notable advances in recent years, existing methods, datasets, and evaluation metrics are confronted with several key challenges: (1) current datasets and methods are centered on individual protein‑ligand complexes and neglect that diverse binding sites may exist across multiple complexes of the same protein, introducing significant statistical bias; (2) ligand binding site detection is typically modeled as a discontinuous workflow, employing binary segmentation and subsequent clustering algorithms; (3) traditional evaluation metrics do not adequately reflect the actual performance of different binding site prediction methods. To address these issues, we first introduce UniSite‑DS, the first UniProt (Unique Protein)‑centric ligand binding site dataset, which contains 4.81 times more multi‑site data and 2.08 times more overall data compared to the previously most widely used datasets. We then propose UniSite, the first end‑to‑end ligand binding site detection framework supervised by set prediction loss with bijective matching. In addition, we introduce Average Precision based on Intersection over Union (IoU) as a more accurate evaluation metric for ligand binding site prediction. Extensive experiments on UniSite‑DS and several representative benchmark datasets demonstrate that IoU‑based Average Precision provides a more accurate reflection of prediction quality, and that UniSite outperforms current state‑of‑the‑art methods in ligand binding site detection. The dataset and codes will be made publicly available at https://github.com/quanlin‑wu/unisite.
Authors: Navid NaderiAlizadeh, Darian Salehi, Xinran Liu, Soheil Kolouri
Abstract: Sliced Wasserstein (SW) distances offer an efficient method for comparing high‑dimensional probability measures by projecting them onto multiple 1‑dimensional probability distributions. However, identifying informative slicing directions has proven challenging, often necessitating a large number of slices to achieve desirable performance and thereby increasing computational complexity. We introduce a constrained learning approach to optimize the slicing directions for SW distances. Specifically, we constrain the 1D transport plans to approximate the optimal plan in the original space, ensuring meaningful slicing directions. By leveraging continuous relaxations of these transport plans, we enable a gradient‑based primal‑dual approach to train the slicer parameters, alongside the remaining model parameters. We demonstrate how this constrained slicing approach can be applied to pool high‑dimensional embeddings into fixed‑length permutation‑invariant representations. Numerical results on foundation models trained on images, point clouds, and protein sequences showcase the efficacy of the proposed constrained learning approach in learning more informative slicing directions. Our implementation code can be found at https://github.com/Stranja572/constrainedswe.
Authors: Shuo Yan, Yuliang Yan, Bin Ma, Chenao Li, Haochun Tang, Jiahua Lu, Minhua Lin, Yuyuan Feng, Enyan Dai
Abstract: Recently, extensive deep learning architectures and pretraining strategies have been explored to support downstream protein applications. Additionally, domain‑specific models incorporating biological knowledge have been developed to enhance performance in specialized tasks. In this work, we introduce Protap, a comprehensive benchmark that systematically compares backbone architectures, pretraining strategies, and domain‑specific models across diverse and realistic downstream protein applications. Specifically, Protap covers five applications: three general tasks and two novel specialized tasks, i.e., enzyme‑catalyzed protein cleavage site prediction and targeted protein degradation, which are industrially relevant yet missing from existing benchmarks. For each application, Protap compares various domain‑specific models and general architectures under multiple pretraining settings. Our empirical studies imply that: (i) Though large‑scale pretraining encoders achieve great results, they often underperform supervised encoders trained on small downstream training sets. (ii) Incorporating structural information during downstream fine‑tuning can match or even outperform protein language models pretrained on large‑scale sequence corpora. (iii) Domain‑specific biological priors can enhance performance on specialized downstream tasks. Code and datasets are publicly available at https://github.com/Trust‑App‑AI‑Lab/protap.
Authors: Chi-Jane Chen, Yuhang Chen, Sukwon Yun, Natalie Stanley, Tianlong Chen
Abstract: Image mass cytometry (IMC) enables high‑dimensional spatial profiling by combining mass cytometry's analytical power with spatial distributions of cell phenotypes. Recent studies leverage large language models (LLMs) to extract cell states by translating gene or protein expression into biological context. However, existing single‑cell LLMs face two major challenges: (1) Integration of spatial information: they struggle to generalize spatial coordinates and effectively encode spatial context as text, and (2) Treating each cell independently: they overlook cell‑cell interactions, limiting their ability to capture biological relationships. To address these limitations, we propose Spatial2Sentence, a novel framework that integrates single‑cell expression and spatial information into natural language using a multi‑sentence approach. Spatial2Sentence constructs expression similarity and distance matrices, pairing spatially adjacent and expressionally similar cells as positive pairs while using distant and dissimilar cells as negatives. These multi‑sentence representations enable LLMs to learn cellular interactions in both expression and spatial contexts. Equipped with multi‑task learning, Spatial2Sentence outperforms existing single‑cell LLMs on preprocessed IMC datasets, improving cell‑type classification by 5.98% and clinical status prediction by 4.18% on the diabetes dataset while enhancing interpretability. The source code can be found here: https://github.com/UNITES‑Lab/Spatial2Sentence.
Authors: Zheng Gong, Ziyi Jiang, Weihao Gao, Deng Zhuo, Lan Ma
Abstract: The mRNA optimization is critical for therapeutic and biotechnological applications, since sequence features directly govern protein expression levels and efficacy. However, current methods face significant challenges in simultaneously achieving three key objectives: (1) fidelity (preventing unintended amino acid changes), (2) computational efficiency (speed and scalability), and (3) the scope of optimization variables considered (multi‑objective capability). Furthermore, existing methods often fall short of comprehensively incorporating the factors related to the mRNA lifecycle and translation process, including intrinsic mRNA sequence properties, secondary structure, translation elongation kinetics, and tRNA availability. To address these limitations, we introduce RNop, a novel deep learning‑based method for mRNA optimization. We collect a large‑scale dataset containing over 3 million sequences and design four specialized loss functions, the GPLoss, CAILoss, tAILoss, and MFELoss, which simultaneously enable explicit control over sequence fidelity while optimizing species‑specific codon adaptation, tRNA availability, and desirable mRNA secondary structure features. Then, we demonstrate RNop's effectiveness through extensive in silico and in vivo experiments. RNop ensures high sequence fidelity, achieves significant computational throughput up to 47.32 sequences/s, and yields optimized mRNA sequences resulting in a significant increase in protein expression for functional proteins compared to controls. RNop surpasses current methodologies in both quantitative metrics and experimental validation, enlightening a new dawn for efficient and effective mRNA design. Code and models will be available at https://github.com/HudenJear/RPLoss.
Authors: Zaixi Zhang, Zhenghong Zhou, Ruofan Jin, Le Cong, Mengdi Wang
Abstract: DNA, encoding genetic instructions for almost all living organisms, fuels groundbreaking advances in genomics and synthetic biology. Recently, DNA Foundation Models have achieved success in designing synthetic functional DNA sequences, even whole genomes, but their susceptibility to jailbreaking remains underexplored, leading to potential concern of generating harmful sequences such as pathogens or toxin‑producing genes. In this paper, we introduce GeneBreaker, the first framework to systematically evaluate jailbreak vulnerabilities of DNA foundation models. GeneBreaker employs (1) an LLM agent with customized bioinformatic tools to design high‑homology, non‑pathogenic jailbreaking prompts, (2) beam search guided by PathoLM and log‑probability heuristics to steer generation toward pathogen‑like sequences, and (3) a BLAST‑based evaluation pipeline against a curated Human Pathogen Database (JailbreakDNABench) to detect successful jailbreaks. Evaluated on our JailbreakDNABench, GeneBreaker successfully jailbreaks the latest Evo series models across 6 viral categories consistently (up to 60% Attack Success Rate for Evo2‑40B). Further case studies on SARS‑CoV‑2 spike protein and HIV‑1 envelope protein demonstrate the sequence and structural fidelity of jailbreak output, while evolutionary modeling of SARS‑CoV‑2 underscores biosecurity risks. Our findings also reveal that scaling DNA foundation models amplifies dual‑use risks, motivating enhanced safety alignment and tracing mechanisms. Our code is at https://github.com/zaixizhang/GeneBreaker.
Authors: Junbo Yin, Chao Zha, Wenjia He, Chencheng Xu, Xin Gao
Abstract: Existing PLMs generate protein sequences based on a single‑condition constraint from a specific modality, struggling to simultaneously satisfy multiple constraints across different modalities. In this work, we introduce CFP‑Gen, a novel diffusion language model for Combinatorial Functional Protein GENeration. CFP‑Gen facilitates the de novo protein design by integrating multimodal conditions with functional, sequence, and structural constraints. Specifically, an Annotation‑Guided Feature Modulation (AGFM) module is introduced to dynamically adjust the protein feature distribution based on composable functional annotations, e.g., GO terms, IPR domains and EC numbers. Meanwhile, the Residue‑Controlled Functional Encoding (RCFE) module captures residue‑wise interaction to ensure more precise control. Additionally, off‑the‑shelf 3D structure encoders can be seamlessly integrated to impose geometric constraints. We demonstrate that CFP‑Gen enables high‑throughput generation of novel proteins with functionality comparable to natural proteins, while achieving a high success rate in designing multifunctional proteins. Code and data available at https://github.com/yinjunbo/cfpgen.
Authors: Pawan Neupane, Jian Liu, Jianlin Cheng
Abstract: Predicting protein complex structures is essential for protein function analysis, protein design, and drug discovery. While AI methods like AlphaFold can predict accurate structural models for many protein complexes, reliably estimating the quality of these predicted models (estimation of model accuracy, or EMA) for model ranking and selection remains a major challenge. A key barrier to developing effective machine learning‑based EMA methods is the lack of large, diverse, and well‑annotated datasets for training and evaluation. To address this gap, we introduce PSBench, a benchmark suite comprising four large‑scale, labeled datasets generated during the 15th and 16th community‑wide Critical Assessment of Protein Structure Prediction (CASP15 and CASP16). PSBench includes over one million structural models covering a wide range of protein sequence lengths, complex stoichiometries, functional classes, and modeling difficulties. Each model is annotated with multiple complementary quality scores at the global, local, and interface levels. PSBench also provides multiple evaluation metrics and baseline EMA methods to facilitate rigorous comparisons. To demonstrate PSBench's utility, we trained and evaluated GATE, a graph transformer‑based EMA method, on the CASP15 data. GATE was blindly tested in CASP16 (2024), where it ranked among the top‑performing EMA methods. These results highlight PSBench as a valuable resource for advancing EMA research in protein complex modeling. PSBench is publicly available at: https://github.com/BioinfoMachineLearning/PSBench.
Authors: Ethan Chern, Zhulin Hu, Steffi Chern, Siqi Kou, Jiadi Su, Yan Ma, Zhijie Deng, Pengfei Liu
Abstract: We present Thinking with Generated Images, a novel paradigm that fundamentally transforms how large multimodal models (LMMs) engage with visual reasoning by enabling them to natively think across text and vision modalities through spontaneous generation of intermediate visual thinking steps. Current visual reasoning with LMMs is constrained to either processing fixed user‑provided images or reasoning solely through text‑based chain‑of‑thought (CoT). Thinking with Generated Images unlocks a new dimension of cognitive capability where models can actively construct intermediate visual thoughts, critique their own visual hypotheses, and refine them as integral components of their reasoning process. We demonstrate the effectiveness of our approach through two complementary mechanisms: (1) vision generation with intermediate visual subgoals, where models decompose complex visual tasks into manageable components that are generated and integrated progressively, and (2) vision generation with self‑critique, where models generate an initial visual hypothesis, analyze its shortcomings through textual reasoning, and produce refined outputs based on their own critiques. Our experiments on vision generation benchmarks show substantial improvements over baseline approaches, with our models achieving up to 50% (from 38% to 57%) relative improvement in handling complex multi‑object scenarios. From biochemists exploring novel protein structures, and architects iterating on spatial designs, to forensic analysts reconstructing crime scenes, and basketball players envisioning strategic plays, our approach enables AI models to engage in the kind of visual imagination and iterative refinement that characterizes human creative, analytical, and strategic thinking. We release our open‑source suite at https://github.com/GAIR‑NLP/thinking‑with‑generated‑images.
Authors: Mahdi Pourmirzaei, Farzaneh Esmaili, Salhuldin Alqarghuli, Mohammadreza Pourmirzaei, Ye Han, Kai Chen, Mohsen Rezaei, Duolin Wang, Dong Xu
Abstract: The diverse nature of protein prediction tasks has traditionally necessitated specialized models, hindering the development of broadly applicable and computationally efficient Protein Language Models (PLMs). In this work, we introduce Prot2Token, a unified framework that overcomes these challenges by converting a wide spectrum of protein‑related predictions‑from sequence‑level properties and residue‑specific attributes to complex inter‑protein interactions‑into a standardized next‑token prediction format. At its core, Prot2Token employs an autoregressive decoder, conditioned on embeddings from pre‑trained protein encoders and guided by learnable task tokens, to perform diverse predictions. This architecture uniquely facilitates multi‑task learning, enabling general‑purpose decoders to generalize across five distinct categories. We present extensive experimental validation across a variety of benchmarks, demonstrating Prot2Token's predictive power in different types of protein‑prediction tasks. In 3D structure prediction, Prot2Token delivers substantial speedups (up to 1000x faster than AlphaFold2 with MSA on the same hardware) while, across other numerous tasks, matching or surpassing specialized methods. Beyond that, we introduce an auxiliary self‑supervised decoder pre‑training approach to improve spatially sensitive task performance. Prot2Token thus offers a step towards standardizing biological prediction into a generative interface, promising to accelerate biological discovery and the development of novel therapeutics. The code is available at https://github.com/mahdip72/prot2token .
Authors: Juntong Wu, Zijing Liu, He Cao, Hao Li, Bin Feng, Zishan Shu, Ke Yu, Li Yuan, Yu Li
Abstract: In recent years, protein‑text models have gained significant attention for their potential in protein generation and understanding. Current approaches focus on integrating protein‑related knowledge into large language models through continued pretraining and multi‑modal alignment, enabling simultaneous comprehension of textual descriptions and protein sequences. Through a thorough analysis of existing model architectures and text‑based protein understanding benchmarks, we identify significant data leakage issues present in current benchmarks. Moreover, conventional metrics derived from natural language processing fail to accurately assess the model's performance in this domain. To address these limitations, we reorganize existing datasets and introduce a novel evaluation framework based on biological entities. Motivated by our observation, we propose a retrieval‑enhanced method, which significantly outperforms fine‑tuned LLMs for protein‑to‑text generation and shows accuracy and efficiency in training‑free scenarios. Our code and data can be seen at https://github.com/IDEA‑XL/RAPM.
Authors: Hongshu Guo, Zeyuan Ma, Yining Ma, Xinglin Zhang, Wei-Neng Chen, Yue-Jiao Gong
Abstract: Designing effective black‑box optimizers is hampered by limited problem‑specific knowledge and manual control that spans months for almost every detail. In this paper, we present DesignX, the first automated algorithm design framework that generates an effective optimizer specific to a given black‑box optimization problem within seconds. Rooted in the first principles, we identify two key sub‑tasks: 1) algorithm structure generation and 2) hyperparameter control. To enable systematic construction, a comprehensive modular algorithmic space is first built, embracing hundreds of algorithm components collected from decades of research. We then introduce a dual‑agent reinforcement learning system that collaborates on structural and parametric design through a novel cooperative training objective, enabling large‑scale meta‑training across 10k diverse instances. Remarkably, through days of autonomous learning, the DesignX‑generated optimizers continuously surpass human‑crafted optimizers by orders of magnitude, either on synthetic testbed or on realistic optimization scenarios such as Protein‑docking, AutoML and UAV path planning. Further in‑depth analysis reveals DesignX's capability to discover non‑trivial algorithm patterns beyond expert intuition, which, conversely, provides valuable design insights for the optimization community. We provide DesignX's Python project at~ https://github.com/MetaEvo/DesignX.
Authors: Yuning Shen, Lihao Wang, Huizhuo Yuan, Yan Wang, Bangji Yang, Quanquan Gu
Abstract: Understanding protein dynamics is critical for elucidating their biological functions. The increasing availability of molecular dynamics (MD) data enables the training of deep generative models to efficiently explore the conformational space of proteins. However, existing approaches either fail to explicitly capture the temporal dependencies between conformations or do not support direct generation of time‑independent samples. To address these limitations, we introduce ConfRover, an autoregressive model that simultaneously learns protein conformation and dynamics from MD trajectories, supporting both time‑dependent and time‑independent sampling. At the core of our model is a modular architecture comprising: (i) an encoding layer, adapted from protein folding models, that embeds protein‑specific information and conformation at each time frame into a latent space; (ii) a temporal module, a sequence model that captures conformational dynamics across frames; and (iii) an SE(3) diffusion model as the structure decoder, generating conformations in continuous space. Experiments on ATLAS, a large‑scale protein MD dataset of diverse structures, demonstrate the effectiveness of our model in learning conformational dynamics and supporting a wide range of downstream tasks. ConfRover is the first model to sample both protein conformations and trajectories within a single framework, offering a novel and flexible approach for learning from protein MD data. Project website: https://bytedance‑seed.github.io/ConfRover.
Authors: Haixu Wu, Minghao Guo, Yuezhou Ma, Yuanxu Sun, Jianmin Wang, Wojciech Matusik, Mingsheng Long
Abstract: Attention with bias, which extends standard attention by introducing prior knowledge as an additive bias matrix to the query‑key scores, has been widely deployed in vision, language, protein‑folding and other advanced scientific models, underscoring its status as a key evolution of this foundational module. However, introducing bias terms creates a severe efficiency bottleneck in attention computation. It disrupts the tightly fused memory‑compute pipeline that underlies the speed of accelerators like FlashAttention, thereby stripping away most of their performance gains and leaving biased attention computationally expensive. Surprisingly, despite its common usage, targeted efficiency optimization for attention with bias remains absent, which seriously hinders its application in complex tasks. Diving into the computation of FlashAttention, we prove that its optimal efficiency is determined by the rank of the attention weight matrix. Inspired by this theoretical result, this paper presents FlashBias based on the low‑rank compressed sensing theory, which can provide fast‑exact computation for many widely used attention biases and a fast‑accurate approximation for biases in general formalizations. FlashBias can fully take advantage of the extremely optimized matrix multiplication operation in modern GPUs, achieving 1.5× speedup for Pairformer in AlphaFold 3, and over 2× speedup for attention with bias in vision and language models without loss of accuracy. Code is available at this repository: https://github.com/thuml/FlashBias.
Authors: Yang Tan, Wenrui Gou, Bozitao Zhong, Liang Hong, Huiqun Yu, Bingxin Zhou
Abstract: Deep learning models have driven significant progress in predicting protein function and interactions at the protein level. While these advancements have been invaluable for many biological applications such as enzyme engineering and function annotation, a more detailed perspective is essential for understanding protein functional mechanisms and evaluating the biological knowledge captured by models. To address this demand, we introduce VenusX, the first large‑scale benchmark for fine‑grained functional annotation and function‑based protein pairing at the residue, fragment, and domain levels. VenusX comprises three major task categories across six types of annotations, including residue‑level binary classification, fragment‑level multi‑class classification, and pairwise functional similarity scoring for identifying critical active sites, binding sites, conserved sites, motifs, domains, and epitopes. The benchmark features over 878,000 samples curated from major open‑source databases such as InterPro, BioLiP, and SAbDab. By providing mixed‑family and cross‑family splits at three sequence identity thresholds, our benchmark enables a comprehensive assessment of model performance on both in‑distribution and out‑of‑distribution scenarios. For baseline evaluation, we assess a diverse set of popular and open‑source models, including pre‑trained protein language models, sequence‑structure hybrids, structure‑based methods, and alignment‑based techniques. Their performance is reported across all benchmark datasets and evaluation settings using multiple metrics, offering a thorough comparison and a strong foundation for future research. Code and data are publicly available at https://github.com/ai4protein/VenusX.
Authors: Andrew Liu, Axel Elaldi, Nicholas T Franklin, Nathan Russell, Gurinder S Atwal, Yih-En A Ban, Olivia Viessmann
Abstract: Invariant Point Attention (IPA) is a key algorithm for geometry‑aware modeling in structural biology, central to many protein and RNA models. However, its quadratic complexity limits the input sequence length. We introduce FlashIPA, a factorized reformulation of IPA that leverages hardware‑efficient FlashAttention to achieve linear scaling in GPU memory and wall‑clock time with sequence length. FlashIPA matches or exceeds standard IPA performance while substantially reducing computational costs. FlashIPA extends training to previously unattainable lengths, and we demonstrate this by re‑training generative models without length restrictions and generating structures of thousands of residues. FlashIPA is available at https://github.com/flagshippioneering/flash_ipa.
Authors: Yize Jiang, Xinze Li, Yuanyuan Zhang, Jin Han, Youjun Xu, Ayush Pandit, Zaixi Zhang, Mengdi Wang, Mengyang Wang, Minjie Shen, Guang Yang, Yejin Choi, Wu-Jun Li, Tianfan Fu, Fang Wu, Junhong Liu
Abstract: Existing protein‑ligand docking studies typically focus on the self‑docking scenario, which is less practical in real applications. Moreover, some studies involve heavy frameworks requiring extensive training, posing challenges for convenient and efficient assessment of docking methods. To fill these gaps, we design PoseX, an open‑source benchmark to evaluate both self‑docking and cross‑docking, enabling a practical and comprehensive assessment of algorithmic advances. Specifically, we curated a novel dataset comprising 718 entries for self‑docking and 1,312 entries for cross‑docking; second, we incorporated 23 docking methods in three methodological categories, including physics‑based methods (e.g., Schrödinger Glide), AI docking methods (e.g., DiffDock) and AI co‑folding methods (e.g., AlphaFold3); third, we developed a relaxation method for post‑processing to minimize conformational energy and refine binding poses; fourth, we built a leaderboard to rank submitted models in real‑time. We derived some key insights and conclusions from extensive experiments: (1) AI approaches have consistently outperformed physics‑based methods in overall docking success rate. (2) Most intra‑ and intermolecular clashes of AI approaches can be greatly alleviated with relaxation, which means combining AI modeling with physics‑based post‑processing could achieve excellent performance. (3) AI co‑folding methods exhibit ligand chirality issues, except for Boltz‑1x, which introduced physics‑inspired potentials to fix hallucinations, suggesting modeling on stereochemistry improves the structural plausibility markedly. (4) Specifying binding pockets significantly promotes docking performance, indicating that pocket information can be leveraged adequately, particularly for AI co‑folding methods, in future modeling efforts. The code, dataset, and leaderboard are released at https://github.com/CataAI/PoseX.
Authors: Cameron C. W. McAllister, Lucas S. P. Rudden, Elizabeth H. C. Bromley, Matteo T. Degiacomi
Abstract: The density of a protein molecule is a key property within a variety of experimental techniques. We present a computational method for determining protein mass density that explicitly incorporates hydration effects. Our approach uses molecular dynamics simulations to quantify the volume of solvent excluded by a protein. Applied to a dataset of 260 soluble proteins, this yields an average density of 1.296 g cm‑3, notably lower than the widely cited value of 1.35 g cm‑3. Contrary to previous suggestions, we find no correlation between protein density and molecular weight. We instead find correlations with residue composition, particularly with hydrophobic amino acid content. Using these correlations, we train a regressor capable of accurately predicting protein density from sequence‑derived features alone. Examining the effect of incorporating water molecules on the measured density, we find that water molecules buried in internal cavities have a negligible effect, whereas those at the surface have a profound impact. Furthermore, by calculating the density of a titin domain and of the Bovine Pancreatic Trypsin over molecular dynamics trajectories, we show that individual proteins can occupy states with close but distinguishable densities. Finally, we analyse the density of water in the vicinity of proteins, showing that the first two hydration shells exhibit higher density than bulk water. When included in cumulative density calculations, these hydration layers contribute to a net increase in local solvent density. Overall, we find that proteins are less dense than previously reported, which is offset by their ability to induce a higher density of water in their vicinity.
Authors: Ruizhe Chen, Dongyu Xue, Xiangxin Zhou, Zaixiang Zheng, Xiangxiang Zeng, Quanquan Gu
Abstract: Proteins typically exist in complexes, interacting with other proteins or biomolecules to perform their specific biological roles. Research on single‑chain protein modeling has been extensively and deeply explored, with advancements seen in models like the series of ESM and AlphaFold2. Despite these developments, the study and modeling of multi‑chain proteins remain largely uncharted, though they are vital for understanding biological functions. Recognizing the importance of these interactions, we introduce APM (All‑Atom Protein Generative Model), a model specifically designed for modeling multi‑chain proteins. By integrating atom‑level information and leveraging data on multi‑chain proteins, APM is capable of precisely modeling inter‑chain interactions and designing protein complexes with binding capabilities from scratch. It also performs folding and inverse‑folding tasks for multi‑chain proteins. Moreover, APM demonstrates versatility in downstream applications: it achieves enhanced performance through supervised fine‑tuning (SFT) while also supporting zero‑shot sampling in certain tasks, achieving state‑of‑the‑art results. We released our code at https://github.com/bytedance/apm.
Authors: Cheng-Yen Hsieh, Xinyou Wang, Daiheng Zhang, Dongyu Xue, Fei Ye, Shujian Huang, Zaixiang Zheng, Quanquan Gu
Abstract: Multimodal protein language models (PLMs) integrate sequence and token‑based structural information, serving as a powerful foundation for protein modeling, generation, and design. However, the reliance on tokenizing 3D structures into discrete tokens causes substantial loss of fidelity about fine‑grained structural details and correlations. In this paper, we systematically elucidate the design space of multimodal PLMs to overcome their limitations. We identify tokenization loss and inaccurate structure token predictions by the PLMs as major bottlenecks. To address these, our proposed design space covers improved generative modeling, structure‑aware architectures and representation learning, and data exploration. Our advancements approach finer‑grained supervision, demonstrating that token‑based multimodal PLMs can achieve robust structural modeling. The effective design methods dramatically improve the structure generation diversity, and notably, folding abilities of our 650M model by reducing the RMSD from 5.52 to 2.36 on PDB testset, even outperforming 3B baselines and on par with the specialized folding models. Project page and code: https://bytedance.github.io/dplm/dplm‑2.1/.
Authors: Sergio Romero-Tapiador, Ruben Tolosana, Blanca Lacruz-Pleguezuelos, Laura Judith Marcos Zambrano, Guadalupe X. Bazán, Isabel Espinosa-Salinas, Julian Fierrez, Javier Ortega-Garcia, Enrique Carrillo de Santa Pau, Aythami Morales
Abstract: Automatic dietary assessment based on food images remains a challenge, requiring precise food detection, segmentation, and classification. Vision‑Language Models (VLMs) offer new possibilities by integrating visual and textual reasoning. In this study, we evaluate six state‑of‑the‑art VLMs (ChatGPT, Gemini, Claude, Moondream, DeepSeek, and LLaVA), analyzing their capabilities in food recognition at different levels. For the experimental framework, we introduce the FoodNExTDB, a unique food image database that contains 9,263 expert‑labeled images across 10 categories (e.g., "protein source"), 62 subcategories (e.g., "poultry"), and 9 cooking styles (e.g., "grilled"). In total, FoodNExTDB includes 50k nutritional labels generated by seven experts who manually annotated all images in the database. Also, we propose a novel evaluation metric, Expert‑Weighted Recall (EWR), that accounts for the inter‑annotator variability. Results show that closed‑source models outperform open‑source ones, achieving over 90% EWR in recognizing food products in images containing a single product. Despite their potential, current VLMs face challenges in fine‑grained food recognition, particularly in distinguishing subtle differences in cooking styles and visually similar food items, which limits their reliability for automatic dietary assessment. The FoodNExTDB database is publicly available at https://github.com/AI4Food/FoodNExtDB.
Authors: Guido Barducci, Ivan Rossi, Francesco Codicè, Cesare Rollo, Valeria Repetto, Corrado Pancotti, Virginia Iannibelli, Tiziana Sanavia, Piero Fariselli
Abstract: Understanding how residue variations affect protein stability is crucial for designing functional proteins and deciphering the molecular mechanisms underlying disease‑related mutations. Recent advances in protein language models (PLMs) have revolutionized computational protein analysis, enabling, among other things, more accurate predictions of mutational effects. In this work, we introduce JanusDDG, a deep learning framework that leverages PLM‑derived embeddings and a bidirectional cross‑attention transformer architecture to predict ΔΔG of single and multiple‑residue mutations while simultaneously being constrained to respect fundamental thermodynamic properties, such as antisymmetry and transitivity. Unlike conventional self‑attention, JanusDDG computes queries (Q) and values (V) as the difference between wild‑type and mutant embeddings, while keys (K) alternate between the two. This cross‑interleaved attention mechanism enables the model to capture mutation‑induced perturbations while preserving essential contextual information. Experimental results show that JanusDDG achieves state‑of‑the‑art performance in predicting ΔΔG from sequence alone, matching or exceeding the accuracy of structure‑based methods for both single and multiple mutations. Code Availability:https://github.com/compbiomed‑unito/JanusDDG
Authors: Yunsoo Kim, Michal W. S. Ong, Daniel W. Rogalsky, Manuel Rodriguez-Justo, Honghan Wu, Adam P. Levine
Abstract: Immunohistochemistry (IHC) is essential in diagnostic pathology and biomedical research, offering critical insights into protein expression and tumour biology. This study presents an automated pipeline, IHC‑LLMiner, for extracting IHC‑tumour profiles from PubMed abstracts, leveraging advanced biomedical text mining. There are two subtasks: abstract classification (include/exclude as relevant) and IHC‑tumour profile extraction on relevant included abstracts. The best‑performing model, "Gemma‑2 finetuned", achieved 91.5% accuracy and an F1 score of 91.4, outperforming GPT4‑O by 9.5% accuracy with 5.9 times faster inference time. From an initial dataset of 107,759 abstracts identified for 50 immunohistochemical markers, the classification task identified 30,481 relevant abstracts (Include) using the Gemma‑2 finetuned model. For IHC‑tumour profile extraction, the Gemma‑2 finetuned model achieved the best performance with 63.3% Correct outputs. Extracted IHC‑tumour profiles (tumour types and markers) were normalised to Unified Medical Language System (UMLS) concepts to ensure consistency and facilitate IHC‑tumour profile landscape analysis. The extracted IHC‑tumour profiles demonstrated excellent concordance with available online summary data and provided considerable added value in terms of both missing IHC‑tumour profiles and quantitative assessments. Our proposed LLM based pipeline provides a practical solution for large‑scale IHC‑tumour profile data mining, enhancing the accessibility and utility of such data for research and clinical applications as well as enabling the generation of quantitative and structured data to support cancer‑specific knowledge base development. Models and training datasets are available at https://github.com/knowlab/IHC‑LLMiner.
Authors: Beibei Wang, Boyue Cui, Shiqu Chen, Xuan Wang, Yadong Wang, Junyi Li
Abstract: Motivation: In recent years, protein function prediction has broken through the bottleneck of sequence features, significantly improving prediction accuracy using high‑precision protein structures predicted by AlphaFold2. While single‑species protein function prediction methods have achieved remarkable success, multi‑species protein function prediction methods are still in the stage of using PPI networks and sequence features. Providing effective cross‑species label propagation for species with sparse protein annotations remains a challenging issue. To address this problem, we propose the MSNGO model, which integrates structural features and network propagation methods. Our validation shows that using structural features can significantly improve the accuracy of multi‑species protein function prediction. Results: We employ graph representation learning techniques to extract amino acid representations from protein structure contact maps and train a structural model using a graph convolution pooling module to derive protein‑level structural features. After incorporating the sequence features from ESM‑2, we apply a network propagation algorithm to aggregate information and update node representations within a heterogeneous network. The results demonstrate that MSNGO outperforms previous multi‑species protein function prediction methods that rely on sequence features and PPI networks. Availability: https://github.com/blingbell/MSNGO.
Authors: Yizhen Luo, Jiashuo Wang, Siqi Fan, Zaiqing Nie
Abstract: Structural biology relies on accurate three‑dimensional biomolecular structures to advance our understanding of biological functions, disease mechanisms, and therapeutics. While recent advances in deep learning have enabled the development of all‑atom foundation models for molecular modeling and generation, existing approaches face challenges in generalization due to the multi‑modal nature of atomic data and the lack of comprehensive analysis of training and sampling strategies. To address these limitations, we propose PharMolixFM, a unified framework for constructing all‑atom foundation models based on multi‑modal generative techniques. Our framework includes three variants using state‑of‑the‑art multi‑modal generative models. By formulating molecular tasks as a generalized denoising process with task‑specific priors, PharMolixFM achieves robust performance across various structural biology applications. Experimental results demonstrate that PharMolixFM‑Diff achieves competitive prediction accuracy in protein‑small‑molecule docking (83.9% vs. 90.2% RMSD < 2Å, given pocket) with significantly improved inference speed. Moreover, we explore the empirical inference scaling law by introducing more sampling repeats or steps. Our code and model are available at https://github.com/PharMolix/OpenBioMed.
Authors: Changjian Zhou, Yuexi Qiu, Jia Song
Abstract: AI‑assisted protein design has emerged as a critical tool for advancing biotechnology, as deep generative models have demonstrated their reliability in this domain. However, most existing models primarily utilize protein sequence or structural data for training, neglecting the physicochemical properties of proteins.Moreover, they are deficient to control the generation of proteins in intuitive conditions. To address these limitations,we propose CMADiff here, a novel framework that enables controllable protein generation by aligning the physicochemical properties of protein sequences with text‑based descriptions through a latent diffusion process. Specifically, CMADiff employs a Conditional Variational Autoencoder (CVAE) to integrate physicochemical features as conditional input, forming a robust latent space that captures biological traits. In this latent space, we apply a conditional diffusion process, which is guided by BioAligner, a contrastive learning‑based module that aligns text descriptions with protein features, enabling text‑driven control over protein sequence generation. Validated by a series of evaluations including AlphaFold3, the experimental results indicate that CMADiff outperforms protein sequence generation benchmarks and holds strong potential for future applications. The implementation and code are available at https://github.com/HPC‑NEAU/PhysChemDiff.
Authors: Chenwei Zhang, Khanh Dao Duc
Abstract: Enhancing cryogenic electron microscopy (cryo‑EM) 3D density maps at intermediate resolution (4‑8 Å) is crucial in protein structure determination. Recent advances in deep learning have led to the development of automated approaches for enhancing experimental cryo‑EM density maps. Yet, these methods are not optimized for intermediate‑resolution maps and rely on map density features alone. To address this, we propose CryoSAMU, a novel method designed to enhance 3D cryo‑EM density maps of protein structures using structure‑aware multimodal U‑Nets and trained on curated intermediate‑resolution density maps. We comprehensively evaluate CryoSAMU across various metrics and demonstrate its competitive performance compared to state‑of‑the‑art methods. Notably, CryoSAMU achieves significantly faster processing speed, showing promise for future practical applications. Our code is available at https://github.com/chenwei‑zhang/CryoSAMU.
Authors: Yang Tan, Chen Liu, Jingyuan Gao, Banghao Wu, Mingchen Li, Ruilin Wang, Lingrong Zhang, Huiqun Yu, Guisheng Fan, Liang Hong, Bingxin Zhou
Abstract: Natural language processing (NLP) has significantly influenced scientific domains beyond human language, including protein engineering, where pre‑trained protein language models (PLMs) have demonstrated remarkable success. However, interdisciplinary adoption remains limited due to challenges in data collection, task benchmarking, and application. This work presents VenusFactory, a versatile engine that integrates biological data retrieval, standardized task benchmarking, and modular fine‑tuning of PLMs. VenusFactory supports both computer science and biology communities with choices of both a command‑line execution and a Gradio‑based no‑code interface, integrating 40+ protein‑related datasets and 40+ popular PLMs. All implementations are open‑sourced on https://github.com/tyang816/VenusFactory.
Authors: Hao Cui, Zahra Shamsi, Gowoon Cheon, Xuejian Ma, Shutong Li, Maria Tikhanovskaya, Peter Norgaard, Nayantara Mudur, Martyna Plomecka, Paul Raccuglia, Yasaman Bahri, Victor V. Albert, Pranesh Srinivasan, Haining Pan, Philippe Faist, Brian Rohr, Ekin Dogus Cubuk, Muratahan Aykol, Amil Merchant, Michael J. Statt, Dan Morris, Drew Purves, Elise Kleeman, Ruth Alcantara, Matthew Abraham, Muqthar Mohammad, Ean Phing VanLee, Chenfei Jiang, Elizabeth Dorfman, Eun-Ah Kim, Michael P Brenner, Viren Jain, Sameera Ponda, Subhashini Venugopalan
Abstract: Scientific problem‑solving involves synthesizing information while applying expert knowledge. We introduce CURIE, a scientific long‑Context Understanding,Reasoning and Information Extraction benchmark to measure the potential of Large Language Models (LLMs) in scientific problem‑solving and assisting scientists in realistic workflows. This benchmark introduces ten challenging tasks with a total of 580 problems and solution pairs curated by experts in six disciplines ‑ materials science, condensed matter physics, quantum computing, geospatial analysis, biodiversity, and proteins ‑ covering both experimental and theoretical work‑flows in science. We evaluate a range of closed and open LLMs on tasks in CURIE which requires domain expertise, comprehension of long in‑context information,and multi‑step reasoning. While Gemini Flash 2.0 and Claude‑3 show consistent high comprehension across domains, the popular GPT‑4o and command‑R+ fail dramatically on protein sequencing tasks. With the best performance at 32% there is much room for improvement for all models. We hope that insights gained from CURIE can guide the future development of LLMs in sciences. Evaluation code and data are in https://github.com/google/curie
Authors: Nithin Parsan, David J. Yang, John J. Yang
Abstract: Protein language models have revolutionized structure prediction, but their nonlinear nature obscures how sequence representations inform structure prediction. While sparse autoencoders (SAEs) offer a path to interpretability here by learning linear representations in high‑dimensional space, their application has been limited to smaller protein language models unable to perform structure prediction. In this work, we make two key advances: (1) we scale SAEs to ESM2‑3B, the base model for ESMFold, enabling mechanistic interpretability of protein structure prediction for the first time, and (2) we adapt Matryoshka SAEs for protein language models, which learn hierarchically organized features by forcing nested groups of latents to reconstruct inputs independently. We demonstrate that our Matryoshka SAEs achieve comparable or better performance than standard architectures. Through comprehensive evaluations, we show that SAEs trained on ESM2‑3B significantly outperform those trained on smaller models for both biological concept discovery and contact map prediction. Finally, we present an initial case study demonstrating how our approach enables targeted steering of ESMFold predictions, increasing structure solvent accessibility while fixing the input sequence. To facilitate further investigation by the broader community, we open‑source our code, dataset, pretrained models https://github.com/johnyang101/reticular‑sae , and visualizer https://sae.reticular.ai .
Authors: Nicolas Wolf, Leif Seute, Vsevolod Viliuga, Simon Wagner, Jan Stühmer, Frauke Gräter
Abstract: Deep generative models have recently been proposed for sampling protein conformations from the Boltzmann distribution, as an alternative to often prohibitively expensive Molecular Dynamics simulations. However, current state‑of‑the‑art approaches rely on fine‑tuning pre‑trained folding models and evolutionary sequence information, limiting their applicability and efficiency, and introducing potential biases. In this work, we propose a flow matching model for sampling protein conformations based solely on backbone geometry ‑ BBFlow. We introduce a geometric encoding of the backbone equilibrium structure as input and propose to condition not only the flow but also the prior distribution on the respective equilibrium structure, eliminating the need for evolutionary information. The resulting model is orders of magnitudes faster than current state‑of‑the‑art approaches at comparable accuracy, is transferable to multi‑chain proteins, and can be trained from scratch in a few GPU days. In our experiments, we demonstrate that the proposed model achieves competitive performance with reduced inference time, across not only an established benchmark of naturally occurring proteins but also de novo proteins, for which evolutionary information is scarce or absent. BBFlow is available at https://github.com/graeter‑group/bbflow.
Authors: Jiang Li, Xiaoping Wang
Abstract: Protein‑protein interaction (PPI) prediction is an instrumental means in elucidating the mechanisms underlying cellular operations, holding significant practical implications for the realms of pharmaceutical development and clinical treatment. Presently, the majority of research methods primarily concentrate on the analysis of amino acid sequences, while investigations predicated on protein structures remain in the nascent stages of exploration. Despite the emergence of several structure‑based algorithms in recent years, these are still confronted with inherent challenges: (1) the extraction of intrinsic structural information of proteins typically necessitates the expenditure of substantial computational resources; (2) these models are overly reliant on seen protein data, struggling to effectively unearth interaction cues between unknown proteins. To further propel advancements in this domain, this paper introduces a novel PPI prediction method jointing masked reconstruction and contrastive learning, termed JmcPPI. This methodology dissects the PPI prediction task into two distinct phases: during the residue structure encoding phase, JmcPPI devises two feature reconstruction tasks and employs graph attention mechanism to capture structural information between residues; during the protein interaction inference phase, JmcPPI perturbs the original PPI graph and employs a multi‑graph contrastive learning strategy to thoroughly mine extrinsic interaction information of novel proteins. Extensive experiments conducted on three widely utilized PPI datasets demonstrate that JmcPPI surpasses existing optimal baseline models across various data partition schemes. The associated code can be accessed via https://github.com/lijfrank‑open/JmcPPI.
Authors: Xinyu Yuan, Zichen Wang, Marcus Collins, Huzefa Rangwala
Abstract: Recent years have witnessed a surge in the development of protein structural tokenization methods, which chunk protein 3D structures into discrete or continuous representations. Structure tokenization enables the direct application of powerful techniques like language modeling for protein structures, and large multimodal models to integrate structures with protein sequences and functional texts. Despite the progress, the capabilities and limitations of these methods remain poorly understood due to the lack of a unified evaluation framework. We first introduce StructTokenBench, a framework that comprehensively evaluates the quality and efficiency of structure tokenizers, focusing on fine‑grained local substructures rather than global structures, as typical in existing benchmarks. Our evaluations reveal that no single model dominates all benchmarking perspectives. Observations of codebook under‑utilization led us to develop AminoAseed, a simple yet effective strategy that enhances codebook gradient updates and optimally balances codebook size and dimension for improved tokenizer utilization and quality. Compared to the leading model ESM3, our method achieves an average of 6.31% performance improvement across 24 supervised tasks, with sensitivity and utilization rates increased by 12.83% and 124.03%, respectively. Source code and model weights are available at https://github.com/KatarinaYuan/StructTokenBench
Authors: Yijia Xiao, Wanjia Zhao, Junkai Zhang, Yiqiao Jin, Han Zhang, Zhicheng Ren, Renliang Sun, Haixin Wang, Guancheng Wan, Pan Lu, Xiao Luo, Yu Zhang, James Zou, Yizhou Sun, Wei Wang
Abstract: Protein‑specific large language models (Protein LLMs) are revolutionizing protein science by enabling more efficient protein structure prediction, function annotation, and design. While existing surveys focus on specific aspects or applications, this work provides the first comprehensive overview of Protein LLMs, covering their architectures, training datasets, evaluation metrics, and diverse applications. Through a systematic analysis of over 100 articles, we propose a structured taxonomy of state‑of‑the‑art Protein LLMs, analyze how they leverage large‑scale protein sequence data for improved accuracy, and explore their potential in advancing protein engineering and biomedical research. Additionally, we discuss key challenges and future directions, positioning Protein LLMs as essential tools for scientific discovery in protein science. Resources are maintained at https://github.com/Yijia‑Xiao/Protein‑LLM‑Survey.
Authors: Sayedmohammadreza Rastegari, Sina Tabakhi, Xianyuan Liu, Tianyi Jiang, Wei Sang, Haiping Lu
Abstract: Understanding protein‑metal interactions is central to structural biology, with metal ions being vital for catalysis, stability, and signal transduction. Predicting metal‑binding residues and metal types remains challenging due to the structural and evolutionary complexity of proteins. Conventional sequence‑ and structure‑based methods often fail to capture co‑evolutionary constraints that reflect how residues evolve together to maintain metal‑binding functionality. Recent co‑evolution‑based methods capture part of this information, but still underutilize the complete co‑evolved residue network. To address this limitation, we introduce the Metal‑Binding Graph Neural Network (MBGNN), which leverages the complete co‑evolved residue network to better capture complex dependencies within protein structures. Experimental results show that MBGNN substantially outperforms the state‑of‑the‑art co‑evolution‑based method MetalNet2, achieving F1 score improvements of 2.5% for binding residue identification and 3.3% for metal type classification on the MetalNet2 dataset. Its superiority is further demonstrated on both the MetalNet2 and MIonSite datasets, where it outperforms two co‑evolution‑based and two sequence‑based methods, achieving the highest mean F1 scores across both prediction tasks. These findings highlight how integrating co‑evolutionary residue networks with graph‑based learning advances our ability to decode protein‑metal interactions, thereby facilitating functional annotation and rational metalloprotein design. The code and data are released at https://github.com/SRastegari/MBGNN.
Authors: Jixiu Zhai, Zikun Wang, Chupei Tang, Haitian Zhong, Ziyang Xu, Yuhuan Liu, Shengrui Xu, Jingwan Wang, Dan Huang, Tianchi Lu
Abstract: Accurate identification of bioactive peptides (BPs) and protein post‑translational modifications (PTMs) is essential for understanding protein function and advancing therapeutic discovery. However, most computational methods remain limited in their generalizability across diverse peptide functions. Here, we present PDeepPP, a unified deep learning framework that integrates pretrained protein language models with a hybrid transformer‑CNN architecture, enabling robust identification across diverse peptide classes and PTM sites. We curated comprehensive benchmark datasets and implemented strategies to address data imbalance, allowing PDeepPP to systematically extract both global and local sequence features. Through extensive analyses including dimensionality reduction and comparison studies, PDeepPP demonstrates strong, interpretable peptide representations and achieves state‑of‑the‑art performance in 25 of the 33 biological identification tasks. Notably, PDeepPP attains high accuracy in antimicrobial (0.9726) and phosphorylation site (0.9984) identification, with 99.5% specificity in glycosylation site prediction and substantial reduction in false negatives in antimalarial tasks. By enabling large‑scale, accurate peptide analysis, PDeepPP supports biomedical research and the discovery of novel therapeutic targets for disease treatment. All code, datasets, and pretrained models are publicly available via GitHub (https://github.com/fondress/PDeepPP) and Hugging Face (https://huggingface.co/fondress/PDeppPP)
Authors: Masatoshi Uehara, Xingyu Su, Yulai Zhao, Xiner Li, Aviv Regev, Shuiwang Ji, Sergey Levine, Tommaso Biancalani
Abstract: To fully leverage the capabilities of diffusion models, we are often interested in optimizing downstream reward functions during inference. While numerous algorithms for reward‑guided generation have been recently proposed due to their significance, current approaches predominantly focus on single‑shot generation, transitioning from fully noised to denoised states. We propose a novel framework for inference‑time reward optimization with diffusion models inspired by evolutionary algorithms. Our approach employs an iterative refinement process consisting of two steps in each iteration: noising and reward‑guided denoising. This sequential refinement allows for the gradual correction of errors introduced during reward optimization. Besides, we provide a theoretical guarantee for our framework. Finally, we demonstrate its superior empirical performance in protein and cell‑type‑specific regulatory DNA design. The code is available at \hrefhttps://github.com/masa‑ue/ProDifEvo‑Refinementhttps://github.com/masa‑ue/ProDifEvo‑Refinement.
Authors: Zizhuo Zhang, Lijun Wu, Kaiyuan Gao, Jiangchao Yao, Tao Qin, Bo Han
Abstract: Molecular docking that predicts the bound structures of small molecules (ligands) to their protein targets, plays a vital role in drug discovery. However, existing docking methods often face limitations: they either overlook crucial structural changes by assuming protein rigidity or suffer from low computational efficiency due to their reliance on generative models for structure sampling. To address these challenges, we propose FABFlex, a fast and accurate regression‑based multi‑task learning model designed for realistic blind flexible docking scenarios, where proteins exhibit flexibility and binding pocket sites are unknown (blind). Specifically, FABFlex's architecture comprises three specialized modules working in concert: (1) A pocket prediction module that identifies potential binding sites, addressing the challenges inherent in blind docking scenarios. (2) A ligand docking module that predicts the bound (holo) structures of ligands from their unbound (apo) states. (3) A pocket docking module that forecasts the holo structures of protein pockets from their apo conformations. Notably, FABFlex incorporates an iterative update mechanism that serves as a conduit between the ligand and pocket docking modules, enabling continuous structural refinements. This approach effectively integrates the three subtasks of blind flexible docking‑pocket identification, ligand conformation prediction, and protein flexibility modeling‑into a unified, coherent framework. Extensive experiments on public benchmark datasets demonstrate that FABFlex not only achieves superior effectiveness in predicting accurate binding modes but also exhibits a significant speed advantage (208 ×) compared to existing state‑of‑the‑art methods. Our code is released at https://github.com/tmlr‑group/FABFlex.
Authors: Angxiao Yue, Zichong Wang, Hongteng Xu
Abstract: Protein backbone generation plays a central role in de novo protein design and is significant for many biological and medical applications. Although diffusion and flow‑based generative models provide potential solutions to this challenging task, they often generate proteins with undesired designability and suffer computational inefficiency. In this study, we propose a novel rectified quaternion flow (ReQFlow) matching method for fast and high‑quality protein backbone generation. In particular, our method generates a local translation and a 3D rotation from random noise for each residue in a protein chain, which represents each 3D rotation as a unit quaternion and constructs its flow by spherical linear interpolation (SLERP) in an exponential format. We train the model by quaternion flow (QFlow) matching with guaranteed numerical stability and rectify the QFlow model to accelerate its inference and improve the designability of generated protein backbones, leading to the proposed ReQFlow model. Experiments show that ReQFlow achieves on‑par performance in protein backbone generation while requiring much fewer sampling steps and significantly less inference time (e.g., being 37x faster than RFDiffusion and 63x faster than Genie2 when generating a backbone of length 300), demonstrating its effectiveness and efficiency. The code is available at https://github.com/AngxiaoYue/ReQFlow.
Authors: Olga Zaghen, Floor Eijkelboom, Alison Pouplin, Cong Liu, Max Welling, Jan-Willem van de Meent, Erik J. Bekkers
Abstract: We present Riemannian Gaussian Variational Flow Matching (RG‑VFM), a geometric extension of Variational Flow Matching (VFM) for generative modeling on manifolds. Motivated by the benefits of VFM, we derive a variational flow matching objective for manifolds with closed‑form geodesics based on Riemannian Gaussian distributions. Crucially, in Euclidean space, predicting endpoints (VFM), velocities (FM), or noise (diffusion) is largely equivalent due to affine interpolations. However, on curved manifolds this equivalence breaks down. We formally analyze the relationship between our model and Riemannian Flow Matching (RFM), revealing that the RFM objective lacks a curvature‑dependent penalty ‑‑ encoded via Jacobi fields ‑‑ that is naturally present in RG‑VFM. Based on this relationship, we hypothesize that endpoint prediction provides a stronger learning signal by directly minimizing geodesic distances. Experiments on synthetic spherical and hyperbolic benchmarks, as well as real‑world tasks in material and protein generation, demonstrate that RG‑VFM more effectively captures manifold structure and improves downstream performance over Euclidean and velocity‑based baselines. Code available at https://github.com/olgatticus/rg‑vfm.
Authors: Hikaru Asano, Tadashi Kozuno, Yukino Baba
Abstract: Recent advances in large language models (LLMs) have yielded impressive performance on various tasks, yet they often depend on high‑quality feedback that can be costly. Self‑refinement methods attempt to leverage LLMs' internal evaluation mechanisms with minimal human supervision; however, these approaches frequently suffer from inherent biases and overconfidence, especially in domains where the models lack sufficient internal knowledge, resulting in performance degradation. As an initial step toward enhancing self‑refinement for broader applications, we introduce an iterative refinement pipeline that employs the Unlabeled‑Unlabeled learning framework to improve LLM‑generated pseudo‑labels for classification tasks. By exploiting two unlabeled datasets with differing positive class ratios, our approach iteratively denoises and refines the initial pseudo‑labels, thereby mitigating the adverse effects of internal biases with minimal human supervision. Evaluations on diverse datasets, including low‑resource language corpora, patent classifications, and protein structure categorizations, demonstrate that our method consistently outperforms both initial LLM's classification performance and the self‑refinement approaches by cutting‑edge models (e.g., GPT‑4o and DeepSeek‑R1). Moreover, we experimentally confirm that our refined classifier facilitates effective post‑training alignment for safety in LLMs and demonstrate successful self‑refinement in generative tasks as well.\footnoteOur code is available at https://github.com/HikaruAsano/self‑iterative‑label‑refinement.
Authors: Jiayang Zhang, Xianyuan Liu, Wei Wu, Sina Tabakhi, Wenrui Fan, Shuo Zhou, Kang Lan Tee, Tuck Seng Wong, Haiping Lu
Abstract: Virus‑like particles (VLPs) are valuable for vaccine development due to their immune‑triggering properties. Understanding their stoichiometry, the number of protein subunits to form a VLP, is critical for vaccine optimisation. However, current experimental methods to determine stoichiometry are time‑consuming and require highly purified proteins. To efficiently classify stoichiometry classes in proteins, we curate a new dataset and propose an interpretable, data‑driven pipeline leveraging linear machine learning models. We also explore the impact of feature encoding on model performance and interpretability, as well as methods to identify key protein sequence features influencing classification. The evaluation of our pipeline demonstrates that it can classify stoichiometry while revealing protein features that possibly influence VLP assembly. The data and code used in this work are publicly available at https://github.com/Shef‑AIRE/StoicIML.
Authors: Vinh Tong, Hoang Trung-Dung, Anji Liu, Guy Van den Broeck, Mathias Niepert
Abstract: In domains such as molecular and protein generation, physical systems exhibit inherent symmetries that are critical to model. Two main strategies have emerged for learning invariant distributions: designing equivariant network architectures and using data augmentation to approximate equivariance. While equivariant architectures preserve symmetry by design, they often involve greater complexity and pose optimization challenges. Data augmentation, on the other hand, offers flexibility but may fall short in fully capturing symmetries. Our framework enhances both approaches by reducing training variance and providing a provably lower‑variance gradient estimator. We achieve this by interpreting data augmentation as a Monte Carlo estimator of the training gradient and applying Rao‑Blackwellization. This leads to more stable optimization, faster convergence, and reduced variance, all while requiring only a single forward and backward pass per sample. We also present a practical implementation of this estimator incorporating the loss and sampling procedure through a method we call Orbit Diffusion. Theoretically, we guarantee that our loss admits equivariant minimizers. Empirically, Orbit Diffusion achieves state‑of‑the‑art results on GEOM‑QM9 for molecular conformation generation, improves crystal structure prediction, and advances text‑guided crystal generation on the Perov‑5 and MP‑20 benchmarks. Additionally, it enhances protein designability in protein structure generation. Code is available at: https://github.com/vinhsuhi/Orbit‑Diffusion.git.
Authors: Jingjie Zhang, Hanqun Cao, Zijun Gao, Xiaorui Wang, Chunbin Gu
Abstract: Phosphorylation site prediction based on kinase‑substrate interaction plays a vital role in understanding cellular signaling pathways and disease mechanisms. Computational methods for this task can be categorized into kinase‑family‑focused and individual kinase‑targeted approaches. Individual kinase‑targeted methods have gained prominence for their ability to explore a broader protein space and provide more precise target information for kinase inhibitors. However, most existing individual kinase‑based approaches focus solely on sequence inputs, neglecting crucial structural information. To address this limitation, we introduce SAGEPhos (Structure‑aware kinAse‑substrate bio‑coupled and bio‑auGmented nEtwork for Phosphorylation site prediction), a novel framework that modifies the semantic space of main protein inputs using auxiliary inputs at two distinct modality levels. At the inter‑modality level, SAGEPhos introduces a Bio‑Coupled Modal Fusion method, distilling essential kinase sequence information to refine task‑oriented local substrate feature space, creating a shared semantic space that captures crucial kinase‑substrate interaction patterns. Within the substrate's intra‑modality domain, it focuses on Bio‑Augmented Fusion, emphasizing 2D local sequence information while selectively incorporating 3D spatial information from predicted structures to complement the sequence space. Moreover, to address the lack of structural information in current datasets, we contribute a new, refined phosphorylation site prediction dataset, which incorporates crucial structural elements and will serve as a new benchmark for the field. Experimental results demonstrate that SAGEPhos significantly outperforms baseline methods. We release the SAGEPhos models and code at https://github.com/ZhangJJ26/SAGEPhos.
Authors: Wei Wu, Qiuyi Li, Yuanyuan Zhang, Zhihao Zhan, Ruipu Chen, Mingyang Li, Kun Fu, Junyan Qi, Yongzhou Bao, Chao Wang, Yiheng Zhu, Zhiyun Zhang, Jian Tang, Fuli Feng, Jieping Ye, Yuwen Liu, Hui Xiong, Zheng Wang
Abstract: The rapid advancement of DNA sequencing has produced vast genomic datasets, yet interpreting and engineering genomic function remain fundamental challenges. Recent large language models have opened new avenues for genomic analysis, but existing approaches are often limited by restricted training scope, constrained generative capability, or prohibitive computational cost. We introduce GENErator, a generative genomic foundation model for long‑context DNA modeling, with a context length of 98k nucleotides, pre‑trained on 386 billion nucleotides of eukaryotic DNA. Without task‑specific fine‑tuning, GENERator exhibits strong intrinsic capabilities: unsupervised embedding analyses reveal phylogenetically coherent structure, and sequence recovery benchmarks demonstrate generative accuracy comparable to or exceeding state‑of‑the‑art models with substantially improved computational efficiency. In a zero‑shot setting, GENERator achieves competitive variant effect prediction performance relative to alignment‑based methods, while remaining fully alignment‑free and broadly applicable across species. With task‑specific fine‑tuning, the model attains leading performance on established genomic benchmarks. We further demonstrate practical generative applications. GENERator can generate protein‑coding DNA sequences that translate into structurally plausible proteins and, through a prompt‑guided design framework, design cis‑regulatory elements with targeted activity profiles, including synthetic super‑enhancers validated by high‑throughput UMI‑STARR‑seq assays. Together, these results establish GENERator as an efficient and biologically grounded framework for genomic interpretation and programmable sequence design. Code and supplementary resources are available at https://github.com/GenerTeam/GENERator.
Authors: Siddarth Venkatraman, Mohsin Hasan, Minsu Kim, Luca Scimeca, Marcin Sendera, Yoshua Bengio, Glen Berseth, Nikolay Malkin
Abstract: Any well‑behaved generative model over a variable \mathbfx can be expressed as a deterministic transformation of an exogenous ('outsourced') Gaussian noise variable \mathbfz: \mathbfx=f_θ(\mathbfz). In such a model (\eg, a VAE, GAN, or continuous‑time flow‑based model), sampling of the target variable \mathbfx ~ p_θ(\mathbfx) is straightforward, but sampling from a posterior distribution of the form p(\mathbfx\mid\mathbfy) \propto p_θ(\mathbfx)r(\mathbfx,\mathbfy), where r is a constraint function depending on an auxiliary variable \mathbfy, is generally intractable. We propose to amortize the cost of sampling from such posterior distributions with diffusion models that sample a distribution in the noise space (\mathbfz). These diffusion samplers are trained by reinforcement learning algorithms to enforce that the transformed samples f_θ(\mathbfz) are distributed according to the posterior in the data space (\mathbfx). For many models and constraints, the posterior in noise space is smoother than in data space, making it more suitable for amortized inference. Our method enables conditional sampling under unconditional GAN, (H)VAE, and flow‑based priors, comparing favorably with other inference methods. We demonstrate the proposed outsourced diffusion sampling in several experiments with large pretrained prior models: conditional image generation, reinforcement learning with human feedback, and protein structure generation.
Authors: Chenao Li, Shuo Yan, Enyan Dai
Abstract: Enzyme‑catalyzed protein cleavage is essential for many biological functions. Accurate prediction of cleavage sites can facilitate various applications such as drug development, enzyme design, and a deeper understanding of biological mechanisms. However, most existing models are restricted to an individual enzyme, which neglects shared knowledge of enzymes and fails to generalize to novel enzymes. Thus, we introduce a unified protein cleavage site predictor named UniZyme, which can generalize across diverse enzymes. To enhance the enzyme encoding for the protein cleavage site prediction, UniZyme employs a novel biochemically‑informed model architecture along with active‑site knowledge of proteolytic enzymes. Extensive experiments demonstrate that UniZyme achieves high accuracy in predicting cleavage sites across a range of proteolytic enzymes, including unseen enzymes. The code is available in https://github.com/Ao‑LiChen/UniZyme
Authors: Filip Ekström Kelvinius, Zheng Zhao, Fredrik Lindsten
Abstract: A recent line of research has exploited pre‑trained generative diffusion models as priors for solving Bayesian inverse problems. We contribute to this research direction by designing a sequential Monte Carlo method for linear‑Gaussian inverse problems which builds on "decoupled diffusion", where the generative process is designed such that larger updates to the sample are possible. The method is asymptotically exact and we demonstrate the effectiveness of our Decoupled Diffusion Sequential Monte Carlo (DDSMC) algorithm on both synthetic as well as protein and image data. Further, we demonstrate how the approach can be extended to discrete data.
Authors: Xiuyuan Hu, Guoqing Liu, Can Chen, Yang Zhao, Hao Zhang, Xue Liu
Abstract: Structure‑based drug discovery, encompassing the tasks of protein‑ligand docking and pocket‑aware 3D drug design, represents a core challenge in drug discovery. However, no existing work can deal with both tasks to effectively leverage the duality between them, and current methods for each task are hindered by challenges in modeling 3D information and the limitations of available data. To address these issues, we propose 3DMolFormer, a unified dual‑channel transformer‑based framework applicable to both docking and 3D drug design tasks, which exploits their duality by utilizing docking functionalities within the drug design process. Specifically, we represent 3D pocket‑ligand complexes using parallel sequences of discrete tokens and continuous numbers, and we design a corresponding dual‑channel transformer model to handle this format, thereby overcoming the challenges of 3D information modeling. Additionally, we alleviate data limitations through large‑scale pre‑training on a mixed dataset, followed by supervised and reinforcement learning fine‑tuning techniques respectively tailored for the two tasks. Experimental results demonstrate that 3DMolFormer outperforms previous approaches in both protein‑ligand docking and pocket‑aware 3D drug design, highlighting its promising application in structure‑based drug discovery. The code is available at: https://github.com/HXYfighter/3DMolFormer .
Authors: Amitay Sicherman, Kira Radinsky
Abstract: State‑of‑the‑art models represent proteins and molecules in separate embedding manifolds, limiting the modeling of systemic biological processes. We introduce ReactEmbed, a lightweight, plug‑and‑play module that bridges this gap. ReactEmbed leverages biochemical reaction networks as a source of functional context, based on the principle that co‑participation in reactions defines a shared functional scope. The module aligns frozen embeddings from models like ESM‑3 and MolFormer into a unified space using a weighted reaction graph and a specialized sampling strategy. This process enriches unimodal embeddings and enables strong performance on cross‑domain benchmarks. ReactEmbed offers a practical method to unify biological representations without costly retraining. The code and database are available for open use\footnotehttps://github.com/amitaysicherman/ReactEmbeded.
Authors: Ali Khodabandeh Yalabadi, Mehdi Yazdani-Jahromi, Ozlem Ozmen Garibay
Abstract: Structure‑based drug design (SBDD) leverages the 3D structure of biomolecular targets to guide the creation of new therapeutic agents. Recent advances in generative models, including diffusion models and geometric deep learning, have demonstrated promise in optimizing ligand generation. However, the scarcity of high‑quality protein‑ligand complex data and the inherent challenges in aligning generated ligands with target proteins limit the effectiveness of these methods. We propose BoKDiff, a novel framework that enhances ligand generation by combining multi‑objective optimization and Best‑of‑K alignment methodologies. Built upon the DecompDiff model, BoKDiff generates diverse candidates and ranks them using a weighted evaluation of molecular properties such as QED, SA, and docking scores. To address alignment challenges, we introduce a method that relocates the center of mass of generated ligands to their docking poses, enabling accurate sub‑component extraction. Additionally, we integrate a Best‑of‑N (BoN) sampling approach, which selects the optimal ligand from multiple generated candidates without requiring fine‑tuning. BoN achieves exceptional results, with QED values exceeding 0.6, SA scores above 0.75, and a success rate surpassing 35%, demonstrating its efficiency and practicality. BoKDiff achieves state‑of‑the‑art results on the CrossDocked2020 dataset, including a ‑8.58 average Vina docking score and a 26% success rate in molecule generation. This study is the first to apply Best‑of‑K alignment and Best‑of‑N sampling to SBDD, highlighting their potential to bridge generative modeling with practical drug discovery requirements. The code is provided at https://github.com/khodabandeh‑ali/BoKDiff.git.
Authors: Masatoshi Uehara, Yulai Zhao, Chenyu Wang, Xiner Li, Aviv Regev, Sergey Levine, Tommaso Biancalani
Abstract: This tutorial provides an in‑depth guide on inference‑time guidance and alignment methods for optimizing downstream reward functions in diffusion models. While diffusion models are renowned for their generative modeling capabilities, practical applications in fields such as biology often require sample generation that maximizes specific metrics (e.g., stability, affinity in proteins, closeness to target structures). In these scenarios, diffusion models can be adapted not only to generate realistic samples but also to explicitly maximize desired measures at inference time without fine‑tuning. This tutorial explores the foundational aspects of such inference‑time algorithms. We review these methods from a unified perspective, demonstrating that current techniques ‑‑ such as Sequential Monte Carlo (SMC)‑based guidance, value‑based sampling, and classifier guidance ‑‑ aim to approximate soft optimal denoising processes (a.k.a. policies in RL) that combine pre‑trained denoising processes with value functions serving as look‑ahead functions that predict from intermediate states to terminal rewards. Within this framework, we present several novel algorithms not yet covered in the literature. Furthermore, we discuss (1) fine‑tuning methods combined with inference‑time techniques, (2) inference‑time algorithms based on search algorithms such as Monte Carlo tree search, which have received limited attention in current research, and (3) connections between inference‑time algorithms in language models and diffusion models. The code of this tutorial on protein design is available at https://github.com/masa‑ue/AlignInversePro
Authors: Kanta Masuki, Yuto Ashida
Abstract: Diffusion models represent a class of generative models that produce data by denoising a sample corrupted by white noise. Despite the success of diffusion models in computer vision, audio synthesis, and point cloud generation, so far they overlook inherent multiscale structures in data and have a slow generation process due to many iteration steps. In physics, the renormalization group offers a fundamental framework for linking different scales and giving an accurate coarse‑grained model. Here we introduce a renormalization group‑based diffusion model that leverages multiscale nature of data distributions for realizing a high‑quality data generation. In the spirit of renormalization group procedures, we define a flow equation that progressively erases data information from fine‑scale details to coarse‑grained structures. Through reversing the renormalization group flows, our model is able to generate high‑quality samples in a coarse‑to‑fine manner. We validate the versatility of the model through applications to protein structure prediction and image generation. Our model consistently outperforms conventional diffusion models across standard evaluation metrics, enhancing sample quality and/or accelerating sampling speed by an order of magnitude. The proposed method alleviates the need for data‑dependent tuning of hyperparameters in the generative diffusion models, showing promise for systematically increasing sample efficiency based on the concept of the renormalization group.
Authors: Raghav Singhal, Zachary Horvitz, Ryan Teehan, Mengye Ren, Zhou Yu, Kathleen McKeown, Rajesh Ranganath
Abstract: Diffusion models produce impressive results in modalities ranging from images and video to protein design and text. However, generating samples with user‑specified properties remains a challenge. Recent research proposes fine‑tuning models to maximize rewards that capture desired properties, but these methods require expensive training and are prone to mode collapse. In this work, we present Feynman‑Kac (FK) steering, an inference‑time framework for steering diffusion models with reward functions. FK steering works by sampling a system of multiple interacting diffusion processes, called particles, and resampling particles at intermediate steps based on scores computed using functions called potentials. Potentials are defined using rewards for intermediate states and are selected such that a high value indicates that the particle will yield a high‑reward sample. We explore various choices of potentials, intermediate rewards, and samplers. We evaluate FK steering on text‑to‑image and text diffusion models. For steering text‑to‑image models with a human preference reward, we find that FK steering a 0.8B parameter model outperforms a 2.6B parameter fine‑tuned model on prompt fidelity, with faster sampling and no training. For steering text diffusion models with rewards for text quality and specific text attributes, we find that FK steering generates lower perplexity, more linguistically acceptable outputs and enables gradient‑free control of attributes like toxicity. Our results demonstrate that inference‑time scaling and steering of diffusion models ‑ even with off‑the‑shelf rewards ‑ can provide significant sample quality gains and controllability benefits. Code is available at https://github.com/zacharyhorvitz/Fk‑Diffusion‑Steering .
Authors: Zhao Yang, Bing Su, Jiahao Chen, Ji-Rong Wen
Abstract: Predicting multiple functions labeled with Enzyme Commission (EC) numbers from the enzyme sequence is of great significance but remains a challenge due to its sparse multi‑label classification nature, i.e., each enzyme is typically associated with only a few labels out of more than 6000 possible EC numbers. However, existing machine learning algorithms generally learn a fixed global representation for each enzyme to classify all functions, thereby they lack interpretability and the fine‑grained information of some function‑specific local residue fragments may be overwhelmed. Here we present an attention‑based framework, namely ProtDETR (Protein Detection Transformer), by casting enzyme function prediction as a detection problem. It uses a set of learnable functional queries to adaptatively extract different local representations from the sequence of residue‑level features for predicting different EC numbers. ProtDETR not only significantly outperforms existing deep learning‑based enzyme function prediction methods, but also provides a new interpretable perspective on automatically detecting different local regions for identifying different functions through cross‑attentions between queries and residue‑level features. Code is available at https://github.com/yangzhao1230/ProtDETR.
Authors: George Yuanji Wang, Srisharan Murugesan, Aditya Prince Rohatgi
Abstract: Identifying druggable genes is essential for developing effective pharmaceuticals. With the availability of extensive, high‑quality data, computational methods have become a significant asset. Protein Interaction Network (PIN) is valuable but challenging to implement due to its high dimensionality and sparsity. Previous methods relied on indirect integration, leading to resolution loss. This study proposes GAN‑TAT, a framework utilizing an advanced graph embedding technology, ImGAGN, to directly integrate PIN for druggable gene inference work. Tested on three Pharos datasets, GAN‑TAT achieved the highest AUC‑ROC score of 0.951 on Tclin. Further evaluation shows that GAN‑TAT's predictions are supported by clinical evidence, highlighting its potential practical applications in pharmacogenomics. This research represents a methodological attempt with the direct utilization of PIN, expanding potential new solutions for developing drug targets. The source code of GAN‑TAT is available at (https://github.com/george‑yuanji‑wang/GAN‑TAT).
Authors: Alex Morehead, Jianlin Cheng
Abstract: Powerful generative AI models of protein‑ligand structure have recently been proposed, but few of these methods support both flexible protein‑ligand docking and affinity estimation. Of those that do, none can directly model multiple binding ligands concurrently or have been rigorously benchmarked on pharmacologically relevant drug targets, hindering their widespread adoption in drug discovery efforts. In this work, we propose FlowDock, the first deep geometric generative model based on conditional flow matching that learns to directly map unbound (apo) structures to their bound (holo) counterparts for an arbitrary number of binding ligands. Furthermore, FlowDock provides predicted structural confidence scores and binding affinity values with each of its generated protein‑ligand complex structures, enabling fast virtual screening of new (multi‑ligand) drug targets. For the well‑known PoseBusters Benchmark dataset, FlowDock outperforms single‑sequence AlphaFold 3 with a 51% blind docking success rate using unbound (apo) protein input structures and without any information derived from multiple sequence alignments, and for the challenging new DockGen‑E dataset, FlowDock outperforms single‑sequence AlphaFold 3 and matches single‑sequence Chai‑1 for binding pocket generalization. Additionally, in the ligand category of the 16th community‑wide Critical Assessment of Techniques for Structure Prediction (CASP16), FlowDock ranked among the top‑5 methods for pharmacological binding affinity estimation across 140 protein‑ligand complexes, demonstrating the efficacy of its learned representations in virtual screening. Source code, data, and pre‑trained models are available at https://github.com/BioinfoMachineLearning/FlowDock.
Authors: Serbülent Ünsal, Sinem Özdemir, Bünyamin Kasap, M. Erşan Kalaycı, Kemal Turhan, Tunca Doğan, Aybar C. Acar
Abstract: In this study, we propose HOPER (HOlistic ProtEin Representation), a novel multimodal learning framework designed to enhance protein function prediction (PFP) in low‑data settings. The challenge of predicting protein functions is compounded by the limited availability of labeled data. Traditional machine learning models already struggle in such cases, and while deep learning models excel with abundant data, they also face difficulties when data is scarce. HOPER addresses this issue by integrating three distinct modalities ‑ protein sequences, biomedical text, and protein‑protein interaction (PPI) networks ‑ to create a comprehensive protein representation. The model utilizes autoencoders to generate holistic embeddings, which are then employed for PFP tasks using transfer learning. HOPER outperforms existing methods on a benchmark dataset across all Gene Ontology categories, i.e., molecular function, biological process, and cellular component. Additionally, we demonstrate its practical utility by identifying new immune‑escape proteins in lung adenocarcinoma, offering insights into potential therapeutic targets. Our results highlight the effectiveness of multimodal representation learning for overcoming data limitations in biological research, potentially enabling more accurate and scalable protein function prediction. HOPER source code and datasets are available at https://github.com/kansil/HOPER
Authors: James Matthew Young, O. Deniz Akyildiz
Abstract: With the advent of diffusion models, new proteins can be generated at an unprecedented rate. The motif scaffolding problem requires steering this generative process to yield proteins with a desirable functional substructure called a motif. While models have been trained to take the motif as conditional input, recent techniques in diffusion posterior sampling can be leveraged as zero‑shot alternatives whose approximations can be corrected with sequential Monte Carlo (SMC) algorithms. In this work, we introduce a new set of guidance potentials for describing scaffolding tasks and solve them by adapting SMC‑aided diffusion posterior samplers with an unconditional model, Genie, as a prior. In single motif problems, we find that (i) the proposed potentials perform comparably, if not better, than the conventional masking approach, (ii) samplers based on reconstruction guidance outperform their replacement method counterparts, and (iii) measurement tilted proposals and twisted targets improve performance substantially. Furthermore, as a demonstration, we provide solutions to two multi‑motif problems by pairing reconstruction guidance with an SE(3)‑invariant potential. We also produce designable internally symmetric monomers with a guidance potential for point symmetry constraints. Our code is available at: https://github.com/matsagad/mres‑project.
Authors: Aman Patel, Arpita Singhal, Austin Wang, Anusri Pampari, Maya Kasowski, Anshul Kundaje
Abstract: Recent advances in self‑supervised models for natural language, vision, and protein sequences have inspired the development of large genomic DNA language models (DNALMs). These models aim to learn generalizable representations of diverse DNA elements, potentially enabling various genomic prediction, interpretation and design tasks. Despite their potential, existing benchmarks do not adequately assess the capabilities of DNALMs on key downstream applications involving an important class of non‑coding DNA elements critical for regulating gene activity. In this study, we introduce DART‑Eval, a suite of representative benchmarks specifically focused on regulatory DNA to evaluate model performance across zero‑shot, probed, and fine‑tuned scenarios against contemporary ab initio models as baselines. Our benchmarks target biologically meaningful downstream tasks such as functional sequence feature discovery, predicting cell‑type specific regulatory activity, and counterfactual prediction of the impacts of genetic variants. We find that current DNALMs exhibit inconsistent performance and do not offer compelling gains over alternative baseline models for most tasks, while requiring significantly more computational resources. We discuss potentially promising modeling, data curation, and evaluation strategies for the next generation of DNALMs. Our code is available at https://github.com/kundajelab/DART‑Eval.
Authors: Zuobai Zhang, Pascal Notin, Yining Huang, Aurélie Lozano, Vijil Chenthamarakshan, Debora Marks, Payel Das, Jian Tang
Abstract: Designing novel functional proteins crucially depends on accurately modeling their fitness landscape. Given the limited availability of functional annotations from wet‑lab experiments, previous methods have primarily relied on self‑supervised models trained on vast, unlabeled protein sequence or structure datasets. While initial protein representation learning studies solely focused on either sequence or structural features, recent hybrid architectures have sought to merge these modalities to harness their respective strengths. However, these sequence‑structure models have so far achieved only incremental improvements when compared to the leading sequence‑only approaches, highlighting unresolved challenges effectively leveraging these modalities together. Moreover, the function of certain proteins is highly dependent on the granular aspects of their surface topology, which have been overlooked by prior models. To address these limitations, we introduce the Sequence‑Structure‑Surface Fitness (S3F) model ‑ a novel multimodal representation learning framework that integrates protein features across several scales. Our approach combines sequence representations from a protein language model with Geometric Vector Perceptron networks encoding protein backbone and detailed surface topology. The proposed method achieves state‑of‑the‑art fitness prediction on the ProteinGym benchmark encompassing 217 substitution deep mutational scanning assays, and provides insights into the determinants of protein function. Our code is at https://github.com/DeepGraphLearning/S3F.
Authors: Jiahan Li, Tong Chen, Shitong Luo, Chaoran Cheng, Jiaqi Guan, Ruihan Guo, Sheng Wang, Ge Liu, Jian Peng, Jianzhu Ma
Abstract: Peptides, short chains of amino acids, interact with target proteins, making them a unique class of protein‑based therapeutics for treating human diseases. Recently, deep generative models have shown great promise in peptide generation. However, several challenges remain in designing effective peptide binders. First, not all residues contribute equally to peptide‑target interactions. Second, the generated peptides must adopt valid geometries due to the constraints of peptide bonds. Third, realistic tasks for peptide drug development are still lacking. To address these challenges, we introduce PepHAR, a hot‑spot‑driven autoregressive generative model for designing peptides targeting specific proteins. Building on the observation that certain hot spot residues have higher interaction potentials, we first use an energy‑based density model to fit and sample these key residues. Next, to ensure proper peptide geometry, we autoregressively extend peptide fragments by estimating dihedral angles between residue frames. Finally, we apply an optimization process to iteratively refine fragment assembly, ensuring correct peptide structures. By combining hot spot sampling with fragment‑based extension, our approach enables de novo peptide design tailored to a target protein and allows the incorporation of key hot spot residues into peptide scaffolds. Extensive experiments, including peptide design and peptide scaffold generation, demonstrate the strong potential of PepHAR in computational peptide binder design. Source code will be available at https://github.com/Ced3‑han/PepHAR.
Authors: Yaowei Jin, Qi Huang, Ziyang Song, Mingyue Zheng, Dan Teng, Qian Shi
Abstract: Biological processes, functions, and properties are intricately linked to the ensemble of protein conformations, rather than being solely determined by a single stable conformation. In this study, we have developed P2DFlow, a generative model based on SE(3) flow matching, to predict the structural ensembles of proteins. We specifically designed a valuable prior for the flow process and enhanced the model's ability to distinguish each intermediate state by incorporating an additional dimension to describe the ensemble data, which can reflect the physical laws governing the distribution of ensembles, so that the prior knowledge can effectively guide the generation process. When trained and evaluated on the MD datasets of ATLAS, P2DFlow outperforms other baseline models on extensive experiments, successfully capturing the observable dynamic fluctuations as evidenced in crystal structure and MD simulations. As a potential proxy agent for protein molecular simulation, the high‑quality ensembles generated by P2DFlow could significantly aid in understanding protein functions across various scenarios. Code is available at https://github.com/BLEACH366/P2DFlow
Authors: Chenqing Hua, Jiarui Lu, Yong Liu, Odin Zhang, Jian Tang, Rex Ying, Wengong Jin, Guy Wolf, Doina Precup, Shuangjia Zheng
Abstract: The introduction of models like RFDiffusionAA, AlphaFold3, AlphaProteo, and Chai1 has revolutionized protein structure modeling and interaction prediction, primarily from a binding perspective, focusing on creating ideal lock‑and‑key models. However, these methods can fall short for enzyme‑substrate interactions, where perfect binding models are rare, and induced fit states are more common. To address this, we shift to a functional perspective for enzyme design, where the enzyme function is defined by the reaction it catalyzes. Here, we introduce \textscGENzyme, a de novo enzyme design model that takes a catalytic reaction as input and generates the catalytic pocket, full enzyme structure, and enzyme‑substrate binding complex. \textscGENzyme is an end‑to‑end, three‑staged model that integrates (1) a catalytic pocket generation and sequence co‑design module, (2) a pocket inpainting and enzyme inverse folding module, and (3) a binding and screening module to optimize and predict enzyme‑substrate complexes. The entire design process is driven by the catalytic reaction being targeted. This reaction‑first approach allows for more accurate and biologically relevant enzyme design, potentially surpassing structure‑based and binding‑focused models in creating enzymes capable of catalyzing specific reactions. We provide \textscGENzyme code at https://github.com/WillHua127/GENzyme.
Authors: Boxin Zhao, Cong Ma, Mladen Kolar
Abstract: Precision matrix estimation is essential in various fields; yet it is challenging when samples for the target study are limited. Transfer learning can enhance estimation accuracy by leveraging data from related source studies. We propose Trans‑Glasso, a two‑step transfer learning method for precision matrix estimation. First, we obtain initial estimators using a multi‑task learning objective that captures shared and unique features across studies. Then, we refine these estimators through differential network estimation to adjust for structural differences between the target and source precision matrices. Under the assumption that most entries of the target precision matrix are shared with source matrices, we derive non‑asymptotic error bounds and show that Trans‑Glasso achieves minimax optimality under certain conditions. Extensive simulations demonstrate Trans Glasso's superior performance compared to baseline methods, particularly in small‑sample settings. We further validate Trans‑Glasso in applications to gene networks across brain tissues and protein networks for various cancer subtypes, showcasing its effectiveness in biological contexts. Additionally, we derive the minimax optimal rate for differential network estimation, representing the first such guarantee in this area. The Python implementation of Trans‑Glasso, along with code to reproduce all experiments in this paper, is publicly available at https://github.com/boxinz17/transglasso‑experiments.
Authors: Keyue Qiu, Yuxuan Song, Jie Yu, Hongbo Ma, Ziyao Cao, Zhilong Zhang, Yushuai Wu, Mingyue Zheng, Hao Zhou, Wei-Ying Ma
Abstract: Structure‑Based molecule optimization (SBMO) aims to optimize molecules with both continuous coordinates and discrete types against protein targets. A promising direction is to exert gradient guidance on generative models given its remarkable success in images, but it is challenging to guide discrete data and risks inconsistencies between modalities. To this end, we leverage a continuous and differentiable space derived through Bayesian inference, presenting Molecule Joint Optimization (MolJO), the gradient‑based SBMO framework that facilitates joint guidance signals across different modalities while preserving SE(3)‑equivariance. We introduce a novel backward correction strategy that optimizes within a sliding window of the past histories, allowing for a seamless trade‑off between explore‑and‑exploit during optimization. MolJO achieves state‑of‑the‑art performance on CrossDocked2020 benchmark (Success Rate 51.3%, Vina Dock ‑9.05 and SA 0.78), more than 4x improvement in Success Rate compared to the gradient‑based counterpart, and 2x "Me‑Better" Ratio as much as 3D baselines. Furthermore, we extend MolJO to a wide range of optimization settings, including multi‑objective optimization and challenging tasks in drug design such as R‑group optimization and scaffold hopping, further underscoring its versatility. Code is available at https://github.com/AlgoMole/MolCRAFT.
Authors: Shuo Zhang, Jian K. Liu
Abstract: Protein language models (PLMs) have demonstrated remarkable capabilities in learning relationships between protein sequences and functions. However, finetuning these large models requires substantial computational resources, often with suboptimal task‑specific results. This study investigates how parameter‑efficient finetuning via LoRA can enhance protein property prediction while significantly reducing computational demands. By applying LoRA to ESM‑2 and ESM‑C models of varying sizes and evaluating 10 diverse protein property prediction tasks, we demonstrate that smaller models with LoRA adaptation can match or exceed the performance of larger models without adaptation. Additionally, we integrate contact map information through a multi‑head attention mechanism, improving model comprehension of structural features. Our systematic analysis reveals that LoRA finetuning enables faster convergence, better performance, and more efficient resource utilization, providing practical guidance for protein research applications in resource‑constrained environments. The code is available at https://github.com/jiankliu/SeqProFT.
Authors: Fang Wu, Shuting Jin, Xiangru Tang, Junlin Xu, Mark Gerstein, James Zou
Abstract: Among these, D‑peptides are resistant to proteolysis, exhibit greater in vivo stability, and are easier to synthesize. Despite advances in deep learning for peptide discovery, the scarcity of natural D‑protein data limits the transfer of existing generative models to the D‑peptide chemical space. We propose D‑Flow, a full‑atom flow‑based framework for de novo D‑peptide design. Conditioned on receptor binding, D‑Flow uses structural representations incorporating backbone frames, side‑chain angles, and discrete amino acid types. A mirror‑image algorithm is implemented to address the lack of training data for D‑proteins by converting the chirality of L‑receptors. Furthermore, we enhance D‑Flow's capacity by integrating protein language models (PLMs) with structural awareness through a lightweight structural adapter that injects structural representations into PLM embeddings. This enables D‑Flow to learn conformational priors in the D‑peptide chemical space and to accommodate the chiral selectivity of binding sites, thereby mitigating the scarcity of D‑peptide data. A two‑stage training pipeline and a control toolkit enable D‑Flow to transition from general protein design to targeted binder design while preserving pre‑training knowledge. Results on the PepMerge benchmark show D‑Flow's effectiveness. D‑peptides generated by D‑Flow align more closely with native sequences and structures, with sequence identity improving by 10.2% over the best baseline, and the top affinity score reaching 24.31%. Overall, D‑Flow shows potential for D‑peptide design, facilitating the development of bioorthogonal and stable molecular tools and diagnostics. Code is available at https://github.com/smiles724/PeptideDesign.
Authors: Yingheng Wang, Zichen Wang, Gil Sadeh, Luca Zancato, Alessandro Achille, George Karypis, Huzefa Rangwala
Abstract: Self‑supervised training of language models (LMs) has seen great success for protein sequences in learning meaningful representations and for generative drug design. Most protein LMs are based on the Transformer architecture trained on individual proteins with short context lengths. Such protein LMs cannot extrapolate to longer proteins and protein complexes well. They also fail to account for the underlying biological mechanisms carried out by biomolecular interactions and dynamics i.e., proteins often interact with other proteins, molecules, and pathways in complex biological systems. In this work, we propose LC‑PLM based on an alternative protein LM architecture, BiMamba‑S, built upon selective structured state‑space models, to learn high‑quality universal protein representations at the amino acid token level using masked language modeling. We also introduce its graph‑contextual variant, LC‑PLM, which contextualizes protein‑protein interaction (PPI) graphs for a second stage of training. LC‑PLM demonstrates favorable neural scaling laws, better length extrapolation capability, and up to 30% and 16% improvements on protein downstream tasks compared to Transformer‑based ESM‑2 when trained with 100B and 1T tokens, respectively. LC‑PLM‑G further trained within the context of PPI graphs shows promising results on protein structure and function prediction tasks. Our study demonstrates the benefit of increasing the context size with computationally efficient LM architecture (e.g., structured state space models) in learning universal protein representations and incorporating molecular interaction contexts contained in biological graphs.
Authors: Dong Shu, Bingbing Duan, Kai Guo, Kaixiong Zhou, Jiliang Tang, Mengnan Du
Abstract: Latent representation alignment has become a foundational technique for constructing multimodal large language models (MLLM) by mapping embeddings from different modalities into a shared space, often aligned with the embedding space of large language models (LLMs) to enable effective cross‑modal understanding. While preliminary protein‑focused MLLMs have emerged, they have predominantly relied on heuristic approaches, lacking a fundamental understanding of optimal alignment practices across representations. In this study, we explore the alignment of multimodal representations between LLMs and Geometric Deep Models (GDMs) in the protein domain. We comprehensively evaluate three state‑of‑the‑art LLMs (Gemma2‑2B, LLaMa3.1‑8B, and LLaMa3.1‑70B) with four protein‑specialized GDMs (GearNet, GVP, ScanNet, GAT). Our work examines alignment factors from both model and protein perspectives, identifying challenges in current alignment methodologies and proposing strategies to improve the alignment process. Our key findings reveal that GDMs incorporating both graph and 3D structural information align better with LLMs, larger LLMs demonstrate improved alignment capabilities, and protein rarity significantly impacts alignment performance. We also find that increasing GDM embedding dimensions, using two‑layer projection heads, and fine‑tuning LLMs on protein‑specific data substantially enhance alignment quality. These strategies offer potential enhancements to the performance of protein‑related multimodal models. Our code and data are available at https://github.com/Tizzzzy/LLM‑GDM‑alignment.
Authors: Xingyi Cheng, Bo Chen, Pan Li, Jing Gong, Jie Tang, Le Song
Abstract: We explore optimally training protein language models, an area of significant interest in biological research where guidance on best practices is limited. Most models are trained with extensive compute resources until performance gains plateau, focusing primarily on increasing model sizes rather than optimizing the efficient compute frontier that balances performance and compute budgets. Our investigation is grounded in a massive dataset consisting of 939 million protein sequences. We trained over 300 models ranging from 3.5 million to 10.7 billion parameters on 5 to 200 billion unique tokens, to investigate the relations between model sizes, training token numbers, and objectives. First, we observed the effect of diminishing returns for the Causal Language Model (CLM) and that of overfitting for the Masked Language Model~(MLM) when repeating the commonly used Uniref database. To address this, we included metagenomic protein sequences in the training set to increase the diversity and avoid the plateau or overfitting effects. Second, we obtained the scaling laws of CLM and MLM on Transformer, tailored to the specific characteristics of protein sequence data. Third, we observe a transfer scaling phenomenon from CLM to MLM, further demonstrating the effectiveness of transfer through scaling behaviors based on estimated Effectively Transferred Tokens. Finally, to validate our scaling laws, we compare the large‑scale versions of ESM‑2 and PROGEN2 on downstream tasks, encompassing evaluations of protein generation as well as structure‑ and function‑related tasks, all within less or equivalent pre‑training compute budgets.
Authors: Yiheng Zhu, Jialu Wu, Qiuyi Li, Jiahuan Yan, Mingze Yin, Wei Wu, Mingyang Li, Jieping Ye, Zheng Wang, Jian Wu
Abstract: Inverse protein folding is a fundamental task in computational protein design, which aims to design protein sequences that fold into the desired backbone structures. While the development of machine learning algorithms for this task has seen significant success, the prevailing approaches, which predominantly employ a discriminative formulation, frequently encounter the error accumulation issue and often fail to capture the extensive variety of plausible sequences. To fill these gaps, we propose Bridge‑IF, a generative diffusion bridge model for inverse folding, which is designed to learn the probabilistic dependency between the distributions of backbone structures and protein sequences. Specifically, we harness an expressive structure encoder to propose a discrete, informative prior derived from structures, and establish a Markov bridge to connect this prior with native sequences. During the inference stage, Bridge‑IF progressively refines the prior sequence, culminating in a more plausible design. Moreover, we introduce a reparameterization perspective on Markov bridge models, from which we derive a simplified loss function that facilitates more effective training. We also modulate protein language models (PLMs) with structural conditions to precisely approximate the Markov bridge process, thereby significantly enhancing generation performance while maintaining parameter‑efficient training. Extensive experiments on well‑established benchmarks demonstrate that Bridge‑IF predominantly surpasses existing baselines in sequence recovery and excels in the design of plausible proteins with high foldability. The code is available at https://github.com/violet‑sto/Bridge‑IF.
Authors: Cheng Tan, Zhenxiao Cao, Zhangyang Gao, Lirong Wu, Siyuan Li, Yufei Huang, Jun Xia, Bozhen Hu, Stan Z. Li
Abstract: Post‑translational modifications (PTMs) profoundly expand the complexity and functionality of the proteome, regulating protein attributes and interactions that are crucial for biological processes. Accurately predicting PTM sites and their specific types is therefore essential for elucidating protein function and understanding disease mechanisms. Existing computational approaches predominantly focus on protein sequences to predict PTM sites, driven by the recognition of sequence‑dependent motifs. However, these approaches often overlook protein structural contexts. In this work, we first compile a large‑scale sequence‑structure PTM dataset, which serves as the foundation for fair comparison. We introduce the MeToken model, which tokenizes the micro‑environment of each amino acid, integrating both sequence and structural information into unified discrete tokens. This model not only captures the typical sequence motifs associated with PTMs but also leverages the spatial arrangements dictated by protein tertiary structures, thus providing a holistic view of the factors influencing PTM sites. Designed to address the long‑tail distribution of PTM types, MeToken employs uniform sub‑codebooks that ensure even the rarest PTMs are adequately represented and distinguished. We validate the effectiveness and generalizability of MeToken across multiple datasets, demonstrating its superior performance in accurately identifying PTM types. The results underscore the importance of incorporating structural data and highlight MeToken's potential in facilitating accurate and comprehensive PTM predictions, which could significantly impact proteomics research. The code and datasets are available at https://github.com/A4Bio/MeToken.
Authors: Yizhen Luo, Zikun Nie, Massimo Hong, Suyuan Zhao, Hao Zhou, Zaiqing Nie
Abstract: Studying protein mutations within amino acid sequences holds tremendous significance in life sciences. Protein language models (PLMs) have demonstrated strong capabilities in broad biological applications. However, due to architectural design and lack of supervision, PLMs model mutations implicitly with evolutionary plausibility, which is not satisfactory to serve as explainable and engineerable tools in real‑world studies. To address these issues, we present MutaPLM, a unified framework for interpreting and navigating protein mutations with protein language models. MutaPLM introduces a protein delta network that captures explicit protein mutation representations within a unified feature space, and a transfer learning pipeline with a chain‑of‑thought (CoT) strategy to harvest protein mutation knowledge from biomedical texts. We also construct MutaDescribe, the first large‑scale protein mutation dataset with rich textual annotations, which provides cross‑modal supervision signals. Through comprehensive experiments, we demonstrate that MutaPLM excels at providing human‑understandable explanations for mutational effects and prioritizing novel mutations with desirable properties. Our code, model, and data are open‑sourced at https://github.com/PharMolix/MutaPLM.
Authors: Joongwon Chae, Zhenyu Wang, Ijaz Gul, Jiansong Ji, Zhenglin Chen, Peiwu Qin
Abstract: Recent advancements in protein structure prediction, particularly AlphaFold2, have revolutionized structural biology by achieving near‑experimental accuracy (\textaverage RMSD < 1.5\textÅ). However, the computational demands of these models (approximately 30 minutes per protein on an RTX 4090) significantly limit their application in high‑throughput protein screening. While large language models like ESM (Evolutionary Scale Modeling) have shown promise in extracting structural information directly from protein sequences, rapid assessment of protein structure quality for large‑scale analyses remains a major challenge.
We introduce pLDDT‑Predictor, a high‑speed protein screening tool that achieves a 250,000× speedup compared to AlphaFold2 by leveraging pre‑trained ESM2 protein embeddings and a Transformer architecture. Our model predicts AlphaFold2's pLDDT (predicted Local Distance Difference Test) scores with a Pearson correlation of 0.7891 and processes proteins in just 0.007 seconds on average. Using a comprehensive dataset of 1.5 million diverse protein sequences (ranging from 50 to 2048 amino acids), we demonstrate that pLDDT‑Predictor accurately classifies high‑confidence structures (pLDDT > 70) with 91.2% accuracy and achieves an MSE of 84.8142 compared to AlphaFold2's predictions.
The source code and pre‑trained models are freely available at https://github.com/jw‑chae/pLDDT_Predictor, enabling the research community to perform rapid, large‑scale protein structure quality assessments.
Authors: Yang Tan, Ruilin Wang, Banghao Wu, Liang Hong, Bingxin Zhou
Abstract: Enzyme engineering enables the modification of wild‑type proteins to meet industrial and research demands by enhancing catalytic activity, stability, binding affinities, and other properties. The emergence of deep learning methods for protein modeling has demonstrated superior results at lower costs compared to traditional approaches such as directed evolution and rational design. In mutation effect prediction, the key to pre‑training deep learning models lies in accurately interpreting the complex relationships among protein sequence, structure, and function. This study introduces a retrieval‑enhanced protein language model for comprehensive analysis of native properties from sequence and local structural interactions, as well as evolutionary properties from retrieved homologous sequences. The state‑of‑the‑art performance of the proposed ProtREM is validated on over 2 million mutants across 217 assays from an open benchmark (ProteinGym). We also conducted post‑hoc analyses of the model's ability to improve the stability and binding affinity of a VHH antibody. Additionally, we designed 10 new mutants on a DNA polymerase and conducted wet‑lab experiments to evaluate their enhanced activity at higher temperatures. Both in silico and experimental evaluations confirmed that our method provides reliable predictions of mutation effects, offering an auxiliary tool for biologists aiming to evolve existing enzymes. The implementation is publicly available at https://github.com/tyang816/ProtREM.
Authors: Aayush Shah, Chakradhar Guntuboina, Amir Barati Farimani
Abstract: In recent years, natural language processing (NLP) models have demonstrated remarkable capabilities in various domains beyond traditional text generation. In this work, we introduce PeptideGPT, a protein language model tailored to generate protein sequences with distinct properties: hemolytic activity, solubility, and non‑fouling characteristics. To facilitate a rigorous evaluation of these generated sequences, we established a comprehensive evaluation pipeline consisting of ideas from bioinformatics to retain valid proteins with ordered structures. First, we rank the generated sequences based on their perplexity scores, then we filter out those lying outside the permissible convex hull of proteins. Finally, we predict the structure using ESMFold and select the proteins with pLDDT values greater than 70 to ensure ordered structure. The properties of generated sequences are evaluated using task‑specific classifiers ‑ PeptideBERT and HAPPENN. We achieved an accuracy of 76.26% in hemolytic, 72.46% in non‑hemolytic, 78.84% in non‑fouling, and 68.06% in solubility protein generation. Our experimental results demonstrate the effectiveness of PeptideGPT in de novo protein design and underscore the potential of leveraging NLP‑based approaches for paving the way for future innovations and breakthroughs in synthetic biology and bioinformatics. Codes, models, and data used in this study are freely available at: https://github.com/aayush‑shah14/PeptideGPT.
Authors: Wojtek Treyde, Seohyun Chris Kim, Nazim Bouatta, Mohammed AlQuraishi
Abstract: Predicting a ligand's bound pose to a target protein is a key component of early‑stage computational drug discovery. Recent developments in machine learning methods have focused on improving pose quality at the cost of model runtime. For high‑throughput virtual screening applications, this exposes a capability gap that can be filled by moderately accurate but fast pose prediction. To this end, we developed QuickBind, a light‑weight pose prediction algorithm. We assess QuickBind on widely used benchmarks and find that it provides an attractive trade‑off between model accuracy and runtime. To facilitate virtual screening applications, we augment QuickBind with a binding affinity module and demonstrate its capabilities for multiple clinically‑relevant drug targets. Finally, we investigate the mechanistic basis by which QuickBind makes predictions and find that it has learned key physicochemical properties of molecular docking, providing new insights into how machine learning models generate protein‑ligand poses. By virtue of its simplicity, QuickBind can serve as both an effective virtual screening tool and a minimal test bed for exploring new model architectures and innovations. Model code and weights are available at https://github.com/aqlaboratory/QuickBind .
Authors: Wenrui Gou, Wenhui Ge, Yang Tan, Mingchen Li, Guisheng Fan, Huiqun Yu
Abstract: Protein structures are important for understanding their functions and interactions. Currently, many protein structure prediction methods are enriching the structure database. Discriminating the origin of structures is crucial for distinguishing between experimentally resolved and computationally predicted structures, evaluating the reliability of prediction methods, and guiding downstream biological studies. Building on works in structure prediction, We developed a structure‑sensitive supervised deep learning model, Crystal vs Predicted Evaluator for Protein Structure (CPE‑Pro), to represent and discriminate the origin of protein structures. CPE‑Pro learns the structural information of proteins and captures inter‑structural differences to achieve accurate traceability on four data classes, and is expected to be extended to more. Simultaneously, we utilized Foldseek to encode protein structures into "structure‑sequences" and trained a protein Structural Sequence Language Model, SSLM. Preliminary experiments demonstrated that, compared to large‑scale protein language models pre‑trained on vast amounts of amino acid sequences, the "structure‑sequence" enables the language model to learn more informative protein features, enhancing and optimizing structural representations. We have provided the code, model weights, and all related materials on https://github.com/GouWenrui/CPE‑Pro‑main.git.
Authors: Ameya Daigavane, Bodhi P. Vani, Darcy Davidson, Saeed Saremi, Joshua Rackers, Joseph Kleinhenz
Abstract: Conformational ensembles of protein structures are immensely important both for understanding protein function and drug discovery in novel modalities such as cryptic pockets. Current techniques for sampling ensembles such as molecular dynamics (MD) are computationally inefficient, while many recent machine learning methods do not transfer to systems outside their training data. We propose JAMUN which performs MD in a smoothed, noised space of all‑atom 3D conformations of molecules by utilizing the framework of walk‑jump sampling. JAMUN enables ensemble generation for small peptides at rates of an order of magnitude faster than traditional molecular dynamics. The physical priors in JAMUN enables transferability to systems outside of its training data, even to peptides that are longer than those originally trained on. Our model, code and weights are available at https://github.com/prescient‑design/jamun.
Authors: Xu Han, Yuancheng Sun, Kai Chen, Yuxuan Ren, Kang Liu, Qiwei Ye
Abstract: Coarse‑grained (CG) molecular dynamics simulations enable efficient exploration of protein conformational ensembles. However, reconstructing atomic details from CG structures (backmapping) remains a challenging problem. Current approaches face an inherent trade‑off between maintaining atomistic accuracy and exploring diverse conformations, often necessitating complex constraint handling or extensive refinement steps. To address these challenges, we introduce a novel two‑stage framework, named CODLAD (COnstraint Decoupled LAtent Diffusion). This framework first compresses atomic structures into discrete latent representations, explicitly embedding structural constraints, thereby decoupling constraint handling from generation. Subsequently, it performs efficient denoising diffusion in this latent space to produce structurally valid and diverse all‑atom conformations. Comprehensive evaluations on diverse protein datasets demonstrate that CODLAD achieves state‑of‑the‑art performance in atomistic accuracy, conformational diversity, and computational efficiency while exhibiting strong generalization across different protein systems. Code is available at https://github.com/xiaoxiaokuye/CODLAD.
Authors: Yifan Feng, Chengwu Yang, Xingliang Hou, Shaoyi Du, Shihui Ying, Zongze Wu, Yue Gao
Abstract: Existing benchmarks like NLGraph and GraphQA evaluate LLMs on graphs by focusing mainly on pairwise relationships, overlooking the high‑order correlations found in real‑world data. Hypergraphs, which can model complex beyond‑pairwise relationships, offer a more robust framework but are still underexplored in the context of LLMs. To address this gap, we introduce LLM4Hypergraph, the first comprehensive benchmark comprising 21,500 problems across eight low‑order, five high‑order, and two isomorphism tasks, utilizing both synthetic and real‑world hypergraphs from citation networks and protein structures. We evaluate six prominent LLMs, including GPT‑4o, demonstrating our benchmark's effectiveness in identifying model strengths and weaknesses. Our specialized prompting framework incorporates seven hypergraph languages and introduces two novel techniques, Hyper‑BAG and Hyper‑COT, which enhance high‑order reasoning and achieve an average 4% (up to 9%) performance improvement on structure classification tasks. This work establishes a foundational testbed for integrating hypergraph computational capabilities into LLMs, advancing their comprehension. The source codes are at https://github.com/iMoonLab/LLM4Hypergraph.
Authors: Jacob Beck, Shikha Surana, Manus McAuliffe, Oliver Bent, Thomas D. Barrett, Juan Jose Garau Luis, Paul Duckworth
Abstract: Predicting the biophysical and functional properties of proteins is essential for in silico protein design. Machine learning has emerged as a promising technique for such prediction tasks. However, the relative scarcity of in vitro annotations means that these models often have little, or no, specific data on the desired fitness prediction task. As a result of limited data, protein language models (PLMs) are typically trained on general protein sequence modeling tasks, and then fine‑tuned, or applied zero‑shot, to protein fitness prediction. When no task data is available, the models make strong assumptions about the correlation between the protein sequence likelihood and fitness scores. In contrast, we propose meta‑learning over a distribution of standard fitness prediction tasks, and demonstrate positive transfer to unseen fitness prediction tasks. Our method, called Metalic (Meta‑Learning In‑Context), uses in‑context learning and fine‑tuning, when data is available, to adapt to new tasks. Crucially, fine‑tuning enables considerable generalization, even though it is not accounted for during meta‑training. Our fine‑tuned models achieve strong results with 18 times fewer parameters than state‑of‑the‑art models. Moreover, our method sets a new state‑of‑the‑art in low‑data settings on ProteinGym, an established fitness‑prediction benchmark. Due to data scarcity, we believe meta‑learning will play a pivotal role in advancing protein engineering.
Authors: Xihan Qin, Li Liao
Abstract: Comorbidity carries significant implications for disease understanding and management. The genetic causes for comorbidity often trace back to mutations occurred either in the same gene associated with two diseases or in different genes associated with different diseases respectively but coming into connection via protein‑protein interactions. Therefore, human interactome has been used in more sophisticated study of disease comorbidity. Human interactome, as a large incomplete graph, presents its own challenges to extracting useful features for comorbidity prediction. In this work, we introduce a novel approach named Biologically Supervised Graph Embedding (BSE) to allow for selecting most relevant features to enhance the prediction accuracy of comorbid disease pairs. Our investigation into BSE's impact on both centered and uncentered embedding methods showcases its consistent superiority over the state‑of‑the‑art techniques and its adeptness in selecting dimensions enriched with vital biological insights, thereby improving prediction performance significantly, up to 50% when measured by ROC for some variations. Further analysis indicates that BSE consistently and substantially improves the ratio of disease associations to gene connectivity, affirming its potential in uncovering latent biological factors affecting comorbidity. The statistically significant enhancements across diverse metrics underscore BSE's potential to introduce novel avenues for precise disease comorbidity predictions and other potential applications. The GitHub repository containing the source code can be accessed at the following link: https://github.com/xihan‑qin/Biologically‑Supervised‑Graph‑Embedding.
Authors: Song Li, Yang Tan, Song Ke, Liang Hong, Bingxin Zhou
Abstract: Immunogenicity prediction is a central topic in reverse vaccinology for finding candidate vaccines that can trigger protective immune responses. Existing approaches typically rely on highly compressed features and simple model architectures, leading to limited prediction accuracy and poor generalizability. To address these challenges, we introduce VenusVaccine, a novel deep learning solution with a dual attention mechanism that integrates pre‑trained latent vector representations of protein sequences and structures. We also compile the most comprehensive immunogenicity dataset to date, encompassing over 7000 antigen sequences, structures, and immunogenicity labels from bacteria, virus, and tumor. Extensive experiments demonstrate that VenusVaccine outperforms existing methods across a wide range of evaluation metrics. Furthermore, we establish a post‑hoc validation protocol to assess the practical significance of deep learning models in tackling vaccine design challenges. Our work provides an effective tool for vaccine design and sets valuable benchmarks for future research. The implementation is at https://github.com/songleee/VenusVaccine.
Authors: Jiaqing Xie, Tianfan Fu
Abstract: Deep learning has deeply influenced protein science, enabling breakthroughs in predicting protein properties, higher‑order structures, and molecular interactions. This paper introduces DeepProtein, a comprehensive and user‑friendly deep learning library tailored for protein‑related tasks. It enables researchers to seamlessly address protein data with cutting‑edge deep learning models. To assess model performance, we establish a benchmark evaluating different deep learning architectures across multiple protein‑related tasks, including protein function prediction, subcellular localization prediction, protein‑protein interaction prediction, and protein structure prediction. Furthermore, we introduce DeepProt‑T5, a series of fine‑tuned Prot‑T5‑based models that achieve state‑of‑the‑art performance on four benchmark tasks, while demonstrating competitive results on six of others. Comprehensive documentation and tutorials are available which could ensure accessibility and support reproducibility. Built upon the widely used drug discovery library DeepPurpose, DeepProtein is publicly available at https://github.com/jiaqingxie/DeepProtein.
Authors: Chenqing Hua, Yong Liu, Dinghuai Zhang, Odin Zhang, Sitao Luan, Kevin K. Yang, Guy Wolf, Doina Precup, Shuangjia Zheng
Abstract: Enzyme design is a critical area in biotechnology, with applications ranging from drug development to synthetic biology. Traditional methods for enzyme function prediction or protein binding pocket design often fall short in capturing the dynamic and complex nature of enzyme‑substrate interactions, particularly in catalytic processes. To address the challenges, we introduce EnzymeFlow, a generative model that employs flow matching with hierarchical pre‑training and enzyme‑reaction co‑evolution to generate catalytic pockets for specific substrates and catalytic reactions. Additionally, we introduce a large‑scale, curated, and validated dataset of enzyme‑reaction pairs, specifically designed for the catalytic pocket generation task, comprising a total of 328,192 pairs. By incorporating evolutionary dynamics and reaction‑specific adaptations, EnzymeFlow becomes a powerful model for designing enzyme pockets, which is capable of catalyzing a wide range of biochemical reactions. Experiments on the new dataset demonstrate the model's effectiveness in designing high‑quality, functional enzyme catalytic pockets, paving the way for advancements in enzyme engineering and synthetic biology. We provide EnzymeFlow code at https://github.com/WillHua127/EnzymeFlow with notebook demonstration at https://github.com/WillHua127/EnzymeFlow/blob/main/enzymeflow_demo.ipynb.
Authors: David Errington, Constantin Schneider, Cédric Bouysset, Frédéric A. Dreyer
Abstract: The field of protein‑ligand pose prediction has seen significant advances in recent years, with machine learning‑based methods now being commonly used in lieu of classical docking methods or even to predict all‑atom protein‑ligand complex structures. Most contemporary studies focus on the accuracy and physical plausibility of ligand placement to determine pose quality, often neglecting a direct assessment of the interactions observed with the protein. In this work, we demonstrate that ignoring protein‑ligand interaction fingerprints can lead to overestimation of model performance, most notably in recent protein‑ligand cofolding models which often fail to recapitulate key interactions.
Authors: Bowen Jing, Hannes Stärk, Tommi Jaakkola, Bonnie Berger
Abstract: Molecular dynamics (MD) is a powerful technique for studying microscopic phenomena, but its computational cost has driven significant interest in the development of deep learning‑based surrogate models. We introduce generative modeling of molecular trajectories as a paradigm for learning flexible multi‑task surrogate models of MD from data. By conditioning on appropriately chosen frames of the trajectory, we show such generative models can be adapted to diverse tasks such as forward simulation, transition path sampling, and trajectory upsampling. By alternatively conditioning on part of the molecular system and inpainting the rest, we also demonstrate the first steps towards dynamics‑conditioned molecular design. We validate the full set of these capabilities on tetrapeptide simulations and show that our model can produce reasonable ensembles of protein monomers. Altogether, our work illustrates how generative modeling can unlock value from MD data towards diverse downstream tasks that are not straightforward to address with existing methods or even MD itself. Code is available at https://github.com/bjing2016/mdgen.
Authors: Hannes Stark, Umesh Padia, Julia Balla, Cameron Diao, George Church
Abstract: Generating protein sequences conditioned on protein structures is an impactful technique for protein engineering. When synthesizing engineered proteins, they are commonly translated into DNA and expressed in an organism such as yeast. One difficulty in this process is that the expression rates can be low due to suboptimal codon sequences for expressing a protein in a host organism. We propose CodonMPNN, which generates a codon sequence conditioned on a protein backbone structure and an organism label. If naturally occurring DNA sequences are close to codon optimality, CodonMPNN could learn to generate codon sequences with higher expression yields than heuristic codon choices for generated amino acid sequences. Experiments show that CodonMPNN retains the performance of previous inverse folding approaches and recovers wild‑type codons more frequently than baselines. Furthermore, CodonMPNN has a higher likelihood of generating high‑fitness codon sequences than low‑fitness codon sequences for the same protein sequence. Code is available at https://github.com/HannesStark/CodonMPNN.
Authors: Jieyi Wang, Bingxuan Li, Nanyi Jiang, Desong Meng, Zirui Fan, Yuxin Guo, Jiayu Liu, Kunlun Zhu, Eddie Yang, Xiusi Chen, Pan Lu, Bingxin Zhao
Abstract: Biomedical researchers increasingly use AI‑generated analyses and reports to interpret protein‑level signals, but static outputs are often insufficient for research decision‑making, where users need to inspect evidence, assess uncertainty, compare mechanisms, and refine hypotheses. We present \textscBioInsight, a multi‑agent system that moves from static biomedical report generation to interactive evidence‑centered interactive interface generation. Given a disease name, a protein association table, and optional cohort metadata, BioInsight organizes disease‑specific evidence through typed intermediate artifacts, including ranked pathways, literature evidence packets, protein‑level reasoning notes, citation‑grounded reports, dashboard schemas, and rendered interactive interfaces. The system decomposes evidence retrieval from mechanistic reasoning, normalizes citations through deterministic components, and converts the same structured evidence used in the report into an interactive interface. We evaluate BioInsight on standardized biomedical QA, challenging protein‑function reasoning, and end‑to‑end biomedical evidence synthesis. Results show that BioInsight achieves best, and suggest that biomedical AI systems should move beyond text‑only and static reports toward provenance‑preserving, interactive evidence artifacts.
Authors: Hyeonna Choi, Jung Yup Kim, Hyuneui Lim, Seunggyu Jeon
Abstract: Biological experiment protocols are written in natural language, whereas automation systems rely on predefined control commands, creating a semantic gap that limits autonomous execution. Microplate‑based automatic experiments are particularly challenging due to the need to simultaneously control well mapping, sample‑reagent combinations, replicate placement, and parallel dispensing. This study proposes an agent‑based protocol translation framework that converts natural‑language microplate‑based protocols into executable control commands for a robotic laboratory platform. A Parser Agent formalizes the natural‑language protocol into a structured representation, and a rule‑based mapping engine deterministically incorporates the operational constraints of the robotic laboratory platform to generate device‑level control commands. A heterogeneous LLM Validation Agent verifies completeness, parameter accuracy, and execution order, and triggers a self‑correction loop with structured feedback when errors are detected. A sweep involving 7 Parsers and 3 Validators on randomly selected ELISA protocols evaluates how model scale and Validator type affect translation accuracy and pass rates under cross‑model verification. The accuracy‑latency trade‑off is further verified by comparing the rule‑based mapping of the proposed framework with LLM end‑to‑end direct mapping. Finally, Bradford assay‑based protein quantification using a microplate was demonstrated on a robotic laboratory platform, validating end‑to‑end autonomous execution from natural‑language protocols to real‑world experiments. The proposed framework provides a flexible approach to narrowing the semantic gap between natural‑language protocols and microplate‑based self‑driving laboratories.
Authors: Nicholas J. Williams, Ward Haddadin, Matteo P. Ferla, Constantin Schneider, Nicholas B. Woodall, Ruby Sedgwick, Christian D. Madsen, Andrew L. Hopkins, Edward O. Pyzer-Knapp
Abstract: Computational enzyme design requires generating proteins that scaffold catalytic residues and ligands, a task that demands both geometric accuracy and structural diversity from the underlying generative model. Current all‑atom generators inherit expensive architectures from structure prediction, leading to high training costs and limited sample diversity. We argue that much of this complexity is unnecessary for generators, which condition on sparse geometric constraints rather than rich co‑evolutionary signals. Emyx is a 140M‑parameter conditional flow matching model that concentrates capacity within standard transformer blocks, replacing heavy embedding stacks with lightweight conditional representations and sparse connectivity. We additionally derive an exact reparametrisation of the flow matching interpolant into the EDM noise‑level framework, bridging flow matching training efficiency with state‑of‑the‑art sampling methods designed for diffusion models without retraining. Despite being the smallest model, Emyx outperforms both Proteína‑Complexa and RFdiffusion3 against the AME enzyme design benchmark across success rate under strict evaluation requiring both global fold recovery and catalytic geometry accuracy, structural novelty, scaffold diversity, and geometric validity, while training in just 682 GPU‑hours, roughly 4× less than RFdiffusion3.
Authors: Lanqing Li, Shentong Mo, Yang Yu, Pheng-Ann Heng
Abstract: Protein language models (PLMs) have emerged as powerful tools for controllable biomolecular design, yet their post‑training adaptation typically relies on costly wet‑lab validation or curated preference datasets. To overcome this supervision bottleneck, we introduce unsupervised reward optimization of PLMs, a comprehensive framework for steerable protein generation without ground‑truth labels. Our key insight is that task‑agnostic rewards, which combine intrinsic model uncertainty with extrinsic semantic consistency informed by protein representation models, exhibit strong correlation with controllability measures across base models and temperature regimes. Building upon this discovery, we propose two offline algorithms: Soft Reward Optimization (SRO) and Binarized Reward Optimization (BRO), which effectively maximize the classical RLHF objective induced by these proxy rewards. Extensive experiments on compositional out‑of‑distribution prompts demonstrate that both methods significantly outperform competitive baselines (DPO, KTO), while approaching oracle performance across multiple sampling temperatures, model scales and protein families. Moreover, PLMs fine‑tuned with unsupervised rewards can achieve consistently higher coverage compared to their base model in pass@k evaluations. By enabling self‑improvement of PLMs through their own generated experience, our framework provides a scalable pathway toward controllable biomolecular design in settings where labeled preferences or experimental feedback are scarce or unavailable.
Authors: Yanjun Shao, Yundi Chen, Yashvi Patel, Aurelien Pelissier, María Rodríguez Martínez
Abstract: Pretrained biological language models expose per‑token probability distributions through masked‑token prediction, providing the likelihood interface central to sequence design, variant scoring, and mechanistic interpretation. Yet these distributions are learned from broad unlabeled corpora and are not naturally conditioned on task‑specific biological contexts such as interaction partners, cellular environments, or therapeutic interventions. Existing contextual matching methods often distort this interface through pooled embeddings, contrastive latent spaces, or task‑specific prediction heads. We introduce LOGICA (Logit‑space Contrastive Alignment), a framework for context‑conditioned prediction that performs contrastive learning directly in output‑logit space. Using gated cross‑modal adapters compatible with each model's native token head, LOGICA preserves the pretrained likelihood interface and converts contextualized token log‑likelihoods into matching scores. Alignment is defined through context‑sensitive token probabilities rather than proximity in a shared embedding space, enabling learning from sparse paired data across models with distinct vocabularies, without a shared tokenizer or decoder. LOGICA is particularly effective for mutation‑local variant ranking, where comparisons reduce to context‑conditioned likelihoods of mutant tokens at perturbed sites. Across protein‑‑ligand binding, TCR‑‑peptide activity, and drug‑conditioned resistance prediction, LOGICA improves over prior state‑of‑the‑art methods, including matched latent‑contrastive and conditional MLM baselines, while retaining a token‑level interface for interpretation and generation. On held‑out‑gene single‑mutation drug‑resistance prediction, LOGICA improves AUC from near‑random latent‑space baselines of ~0.55 to ~0.65.
Authors: Md Nasiat Hasan Fahim, Md. Abid Ullah Muhib, Mohammad Shahidur Rahman
Abstract: Correct identification of fish species is highly significant for food security, economic development, and climate resilience in Bangladesh. Protein sequences directly reflect functional and evolutionary constraints which are important for species authentication and biodiversity monitoring. Yet there exists no benchmark for native Bangladeshi fish species identification from protein sequence. In this study, we addressed this gap by introducing the first curated dataset for nine native Bangladeshi fish species of 2845 high quality protein sequences. We also established the first protein sequence classification baseline for this domain through a systematic benchmarking of seven architectural paradigms. Moreover, we propose a realistic deployable novel hybrid architecture of MotifCNN and Transformer with Terminal‑Aware Positional‑Encoding (MotifCNN‑Transformer+TA‑PE). Our novel architecture achieves 79.80% accuracy with macro‑F1 of 0.80. The highest 83.04% accuracy is achieved by finetuned protein language model ProtBERT that has 420M parameters and requires dual 16GB GPUs for inference. According to McNemar's test, ProtBERT's 3.24% accuracy gain over our MotifCNN‑Transformer+TA‑PE is statistically insignificant (p = 0.1120). Our novel architecture beats it among six of the nine classes in per class identification. Also our MotifCNN‑Transformer+TA‑PE is approximately 5x faster, 42x smaller, and supports 16x larger batch size than ProtBERT and has GPU free inference, making it more practical for deployment in resources constrained areas such as rural Bangladesh. Beyond this, our foundational work shows effects of phylogenetic relationships on sequence similarity and establishes pathways for fisheries management, food authentication and biodiversity conservation in South Asia's protein dependent economy.
Authors: Tianyu Liu, Ziqing Wang, Zhaokang Liang, Tong Ding, Peter Humphrey, Lorraine Colón-Cartagena, Emily Ling-Lin Pai, Kenneth Tou En Chang, Mohamed Kahila, Jonathan Chong Kai Liew, Tinglin Huang, Rex Ying, Kaize Ding, Faisal Mahmood, Wengong Jin
Abstract: Predicting immune biomarkers associated with the tumor immune microenvironment (TIME) is critical for advancing precision oncology, yet existing approaches are largely limited to single image modalities and suffer from insufficient resolution and incomplete utilization of complementary clinical and biological information. Here we introduce MixTIME, a multimodal foundation model that leverages a mixture‑of‑experts (MoE) architecture to integrate pathology foundation models trained across distinct modalities: image only (UNIv2), image text (CONCHv1.5), and image transcriptomic (STPath) representations for pixel‑level and slide‑level prediction of multiplex immunofluorescence (mIF) protein expression from hematoxylin and eosin (HE) whole‑slide images. MixTIME employs a learnable router to dynamically weight expert contributions and is trained with a distribution‑ and tendency‑aware loss function. Benchmarked on two datasets of different scales, MixTIME achieves state‑of‑the‑art performance across 17 protein markers as measured by correlation metrics. The predicted mIF profiles substantially enhance downstream tasks, including spatial domain identification, survival prediction, and AI‑assisted pathology report generation validated by expert pathologists from multiple institutes across the world. Furthermore, MixTIME enables longitudinal tracking of protein expression dynamics across clinical time points and reveals protein gene interaction patterns linked to drug resistance and immune suppression in tumor microenvironments. Collectively, MixTIME provides a scalable framework for multimodal biomarker discovery and clinical translation in computational pathology.
Authors: Andraz Jelincic, Ross C. Walker
Abstract: The growing energy demand for computation is becoming increasingly unsustainable. Thermodynamic computing, which harnesses physical thermal fluctuations as a computational resource rather than suppressing them, offers orders‑of‑magnitude energy savings for probabilistic and combinatorial tasks. Pharmaceutical R&D, heavily reliant on computational optimization and sampling, is a natural application domain. Here we present what is, to our knowledge, the first concrete pharmaceutical application mapped to thermodynamic hardware with energy estimates grounded in prototype measurements. We reduce mRNA codon optimization, a combinatorial problem routinely solved in drug development, to sampling from an Ising model, making it directly executable on a thermodynamic sampling unit (TSU). Benchmarking three approaches (Potts sampling, Ising sampling, and a genetic algorithm baseline) on the SARS‑CoV‑2 spike protein, we find that all achieve comparable optimization quality (scores ~234‑240), but energy estimates based on validated hardware models indicate that a TSU could solve this problem using approximately 10e6 times less energy than a conventional GPU. All code is released under an open‑source license.
Authors: Darin Tsui, William Deinzer, Daniel Saeedi, Amirali Aghazadeh
Abstract: Protein language models (pLMs) can generate novel protein sequences with properties beyond those observed in nature, yet the mechanisms underlying protein generation remain poorly understood. Existing mechanistic interpretability methods based on sparse autoencoders and transcoders primarily focus on protein representation learning models and do not capture the computation required for autoregressive generation. Here, we introduce ProGenMech, a mechanistic interpretability framework for generative protein language models that extends cross‑layer transcoders (CLTs) to ProGen3, a sparse Mixture‑of‑Experts model trained for both causal generation and span infilling. Unlike per‑layer approaches, CLTs reconstruct each layer using sparse latent variables from all preceding layers, enabling faithful recovery of inter‑layer generative computation. We further develop a zero‑shot circuit discovery framework to identify sparse latent circuits responsible for protein generation and fitness prediction. In causal generation and zero‑shot fitness estimation tasks, ProGenMech outperforms local transcoder baselines in recovering ProGen3's probability distribution and functional scoring behavior, while matching the original model's generative distribution in span infilling tasks. Moreover, the recovered circuits reveal biologically meaningful motifs and functional regions associated with conserved sequence patterns and protein fitness landscapes, establishing a foundation for interpretable and steerable protein generation.
Authors: Dominik Geng, Florian Graf, Martin Uray, Roland Kwitt
Abstract: Molecular dynamics (MD) simulations generate trajectories in a high‑dimensional configuration space whose analysis critically depends on molecular descriptors, typically handcrafted observables or learned kinetic embeddings. Designing descriptors that are both expressive and broadly applicable, however, remains challenging. We study persistent homology (PH) as a general‑purpose representation for MD and introduce the masked Flood complex, a protein‑tailored modification of a recently introduced simplicial complex construction that emphasizes inter‑residue structure at low computational cost. Vectorized persistence diagrams then provide information‑rich, geometry‑aware summaries of protein conformations, which we evaluate on protein class prediction, frame‑level observable regression, and Markov state model (MSM) estimation from learned low‑dimensional coordinates in a single shared representation space. Results on the mdCATH dataset show that PH‑based descriptors are competitive across tasks, with masked Flood PH yielding the most consistent overall performance. Further, when using topologically‑informed MSMs as a drop‑in replacement within the recent MarS‑FM framework for generative modeling of protein conformations, we obtain consistently better ensemble statistics than MSMs based on physical observables. Finally, we explore the transferability of the generative model to qualitatively different, fast folding, proteins.
Authors: Stefano Maestri
Abstract: Agent‑based modelling is gaining recognition as a powerful approach for simulating complex cellular pathways, owing to its ability to reproduce emergent biological behaviours without requiring extensive kinetic parameterisation. In this article, we present a GPU‑accelerated agent‑based simulator specifically designed to model and analyse signalling pathways involved in cancer progression, and to evaluate therapeutic interventions. Our approach leverages the computing capabilities of FLAME GPU 2, a GPU‑accelerated agent‑based modelling framework, to efficiently manage simulations involving millions of molecules interacting within a three‑dimensional environment. Each molecule is represented as an autonomous agent with defined physical properties, capable of binding, releasing reaction products, migrating between compartments, and interacting based on spatial proximity. An intuitive graphical interface supports model construction, parameter setup, and real‑time modification of treatment strategies. As the primary focus of this paper, we validate the simulator on the MAPK/ERK cascade affected by the BRAFV600E mutation, demonstrating that it accurately reproduces dose‑response trends observed in clinical data and outperforms both deterministic models and our prior agent‑based implementations. A second case study extends the approach to nuclear signalling by reproducing the dynamics of cFos expression and phosphorylation. This demonstrates the simulator's ability to capture compartmentalised regulation, reproducing transient mRNA responses and protein accumulation, including the effect of an unresolved negative transcriptional regulator. Together, these results show that GPU‑accelerated ABM can faithfully replicate both drug response and emergent gene expression dynamics, providing a scalable and biologically grounded computational tool for supporting precision oncology.
Authors: Peng-Fei Sun, Chuan-Xian Ren, Hong Yan
Abstract: Accurate prediction of protein‑ligand binding affinity is essential for structure‑based drug discovery. Recent geometric deep learning methods have achieved promising performance by representing protein‑ligand complexes as three‑dimensional graphs. However, most existing approaches mainly rely on static interaction geometry from a single bound conformation, while neglecting molecular flexibility and binding‑induced conformational changes. To address this limitation, we propose a curvature‑informed potential energy surface (CPES) graph neural network for protein‑ligand binding affinity prediction, which incorporates physics‑informed curvature representations to model conformational flexibility. CPES first derives curvature spectral descriptors from the Hessian of the potential energy surface evaluated at equilibrium configurations, whose eigenvalues define the local principal curvatures of the potential energy surface. It then uses spectral cross‑attention to compare the unbound ligand and protein with the bound complex, thereby capturing binding‑induced changes in conformational dynamics. In parallel, hierarchical protein‑ligand interaction representations are learned from static structural features through geometry‑aware message passing, soft clustering, and bidirectional cross‑attention. Finally, CPES fuses the curvature‑informed dynamic representations with static interaction representations for affinity regression. Extensive evaluations on multiple benchmark datasets demonstrate that CPES achieves improved predictive performance and offers physical interpretability.
Authors: Shuai Li, Chuan-Xian Ren, Yuhao Li, Ziqi Huang, Yue Pan, Mingzhe Tang, Hong Yan
Abstract: Protein‑ligand binding affinity (PLA) prediction is critical in drug discovery. Despite the notable advancements in machine learning‑based approaches, existing methods struggle to jointly characterize local geometric organization and globally coordinated cross‑molecular interactions, limiting their ability to model complex binding mechanisms. Here, we propose RicciBind, a geometric representation framework that integrates curvature‑guided hierarchical structure learning with optimal transport (OT)‑based cross‑domain alignment to model molecular interactions. Specifically, RicciBind leverages Ricci curvature to capture local interaction tightness within molecular structures, enhancing structural awareness and organizing atomic interactions into curvature‑aware hierarchical representations. An OT‑based cluster matching mechanism then aligns protein and ligand clusters across heterogeneous domains under geometric constraints, enabling globally consistent correspondences and revealing higher‑order interaction patterns beyond local neighborhoods. By coupling curvature‑guided structure encoding with OT‑driven cross‑domain alignment, RicciBind effectively models complex interaction semantics and substantially improves both the accuracy and interpretability of binding affinity prediction. Extensive experiments demonstrate that RicciBind achieved superior predictive performance and generalization across PLA benchmarks and virtual screening tasks. Ablation studies further confirmed the essential role of Ricci curvature in enhancing molecular interaction representations.
Authors: Lezhi Tan, Tijana Zrnic
Abstract: There is a proliferation of work arguing for the use of synthetic data in scientific research. For example, social scientists are arguing for the use of LLM‑generated "silicon samples" in pilot studies; AI evaluations increasingly rely on "LLM‑as‑a‑judge" outputs; and proteomics research is accelerated by generative models that produce synthetic protein structures. These developments raise an intriguing possibility: synthetic data may help researchers ask more questions, run more studies, and accelerate discovery. But they also raise a fundamental concern: synthetic data can be biased, noisy, and misspecified. In this work, we propose statistical principles for using synthetic data in scientific research with provable validity guarantees. The key insight is a new technical condition that we call task exchangeability. Informally, this is a requirement that the researcher can identify historical tasks, for which real data is available, such that their current task of interest is exchangeable with the historical tasks in an appropriate mathematical sense. We develop methods for valid inference under task exchangeability, together with extensions that provide guarantees even beyond exchangeability. We demonstrate the framework on public opinion surveys with silicon samples and AI evaluation with autoraters.
Authors: Shuni Li, Zhiyuan Ruan, Andy Shen, Ivan Jayapurna, Ting Xu, Haiyan Huang
Abstract: Synthetic random heteropolymers (RHPs), consisting of a predefined set of monomers, offer an approach toward the design of protein‑like materials. These RHPs, if designed appropriately, can mimic protein behavior and function. As such, there is a need for computational tools to efficiently guide RHP design. We bridge this gap by developing DeepRHP, a modified variational autoencoder (VAE) model under a semi‑supervised framework. By equipping a classical VAE with an additional feature‑based VAE, DeepRHP forces the latent space to capture structures of critical chemical features as well as individual RHP sequence patterns. In this sense, our method is versatile by allowing any relevant features to be incorporated in a hybrid manner. We demonstrate the effectiveness of DeepRHP by suggesting potential monomer compositions that stabilize membrane proteins (e.g. Aquaporin Z) in non‑native environments and cross‑validating our prediction with published results. The concordance between our model and true RHP function suggests strong potential in utilizing hybrid autoencoder architectures to guide RHP design for proteins and other biological compounds.
Authors: Chuanzhen Wang, Meade Cleti, Pete Jano
Abstract: De novo protein generation has transformative potential in therapeutic design, enzyme engineering, and synthetic biology. While diffusion‑based and flow matching approaches have achieved progress, they typically operate at single resolution and lack mechanisms for incorporating functional constraints. We introduce ProHiFlo, a hierarchical flow matching framework with three innovations: (1) coarse‑to‑fine generation that models backbone geometry before refining to all‑atom coordinates, reducing computational cost while maintaining accuracy; (2) functional guidance leveraging pretrained predictors to steer generation toward desired properties without retraining; (3) adaptive SE(3)‑equivariant architecture for efficient multi‑scale processing. Experiments on unconditional generation, motif scaffolding, and functional design demonstrate state‑ofthe‑art performance while requiring 4 fewer sampling steps. On enzyme active site scaffolding, ProHiFlo achieves 58.9% success rate compared to 41.2% for RFDiffusion.
Authors: Martin Jankowiak, Yerdos Ordabayev, Rudraksh Tuwani, Henry N. Ward, Hunter Nisonoff, James M. McFarland, Gevorg Grigoryan
Abstract: Despite its importance to applications in protein design, predicting protein properties like binding affinity and thermostability from sparse experimental data remains a significant challenge. Accordingly, we introduce a class of sequence kernels that exploit evolutionary substitution matrices as well as local linearity and demonstrate that the resulting Gaussian processes provide data‑efficient models of protein property landscapes, frequently outperforming alternatives that rely on foundation model embeddings. Furthermore‑‑by learning what are in effect structure‑aware substitution matrices‑‑we show that our kernels can readily incorporate structural information from foundation models. We demonstrate that these structure‑conditioned kernels are well suited to multi‑task learning across multiple protein property landscapes and can decisively outperform local supervised learning methods.
Authors: Michael Yu, Matthew L. Olson
Abstract: Generative models have shown remarkable progress in a variety of domains such as protein design, but such power enables the opaque generation of hazardous proteins. In this work, we introduce VFUSE (Virulent Feature Understanding with Sparse autoEncoders), a mechanistic interpretability approach that trains SAEs on diffusion‑transformer activations to audit protein models for hazard‑aware features. We apply VFUSE to RoseTTAFold3 and RFDiffusion3, popular open‑weight models for protein folding and synthesis. We find that for certain blocks, linear probes detect hazardous designs significantly better when fit in the SAE latent space over the original model's representations: improving interpretability without sacrificing model performance. Furthermore, we identify monosemantic features from the SAE that fire only on hazardous designs at up to AUROC 0.84 (q < 10^‑13). To our knowledge this is the first SAE trained on an all‑atom diffusion model and the first feature‑level virulence audit of a protein design model, paving the way towards safe and interpretable protein design.
Authors: Riccardo De Santi, Bruce Lee, Cristian Perez Jensen, Kimon Protopapas, Sophia Tang, Cheng-Hao Liu, Pranam Chatterjee, Yisong Yue, Andreas Krause
Abstract: Standard flow and diffusion pre‑training matches the distribution of available data (e.g., molecules), which often covers only a small fraction of the valid design space. In generative discovery, however, one aims to sample valid new‑to‑nature designs, assigned negligible probability under, and thus inaccessible to, standard models fitted to the observed data. To overcome this limitation, we depart from data distribution matching and view a generative model through its generable set: the region it covers with non‑negligible probability. This allows to introduce a new learning principle for out‑of‑distribution flow modeling: enlarging a model's generable set to increase coverage of the valid design space. We propose Active Flow Expansion (ActFlow), a continued pre‑training method that employs verifier feedback to expand a pre‑trained model over new valid regions by iteratively adapting to synthetic data generated through active exploration in the learned flow representation. Theoretically, we establish to our knowledge first‑of‑their‑kind statistical learning guarantees for out‑of‑distribution flow modeling, analyzing generable set expansion as a local‑to‑global reachability process over a learned representation. Empirically, we assess ActFlow with suitable out‑of‑distribution generative modeling metrics across small organic molecules, mid‑sized drug‑like molecules, therapeutic peptides, and protein sequence design tasks. Results show that ActFlow expands valid coverage far beyond the region modeled by the initial pre‑trained model, significantly outperforming widely adopted synthetic flow pre‑training methods.
Authors: A Shivram, Aneesh S. Chivukula, Manik Gupta, Sourav Chowdhury
Abstract: Multimodal ΔΔG predictors integrating protein language models with inverse‑folding representations achieve strong in‑distribution accuracy on the Megascale dataset but exhibit limited robustness on out‑of‑distribution (OOD) proteins, persistent forward‑reverse bias on paired‑mutation benchmarks, and under‑representation of rare stabilizing mutations. Existing approaches address these limitations primarily through additional architectural components, leaving optimization‑level intervention comparatively underexplored. We introduce a constraint‑aware optimization framework combining Balanced Mean Squared Error, a Siamese anti‑symmetric regularizer, and a novel OOD‑margin consistency loss on the per‑position feature representation, requiring no architectural changes to the SPURS backbone. Across eleven benchmarks and three random seeds, the framework improves Spearman correlation on S669 from 0.486 to 0.540 (σ=0.002 across seeds), matching the published SPURS baseline (0.50) without architectural modification, and on S461 from 0.653 to 0.711, with consistent smaller gains on five additional OOD datasets. A controlled diagnostic on Ssym reveals that anti‑symmetric training does not eliminate systematic forward‑reverse bias, indicating that gains arise through implicit regularization rather than exact thermodynamic constraint enforcement.
Authors: Chahat Baranwal, Aadtya Baranwal, Lakshya Nitin Tandon
Abstract: High‑grade gliomas integrate into neural circuits through functional synapses with neurons, raising the question of which noncoding elements shape synaptogenic gene expression in tumor cells. The regulatory program written across the dark genome, what we call the dark regulome, is the natural substrate to probe, and sequence foundation models offer a zero‑shot route through in‑silico mutagenesis (ISM); yet likelihood‑based scoring is tautologically coupled to local sequence predictability, leaving the regulatory interpretation underdetermined. Across three architecturally distinct foundation models (Caduceus‑Ph, HyenaDNA, Enformer) and 30,448 dark genome elements at 92 glioma‑relevant loci, we introduce a residualization‑and‑permutation diagnostic that separates predictability‑driven from regulation‑driven RIS variance. A sharp 10kb proximal‑regulatory horizon survives every control we apply, but the LM‑derived element‑class hierarchy does not: a six‑feature linear baseline matches Caduceus top‑decile membership at AUC = 0.985. Cross‑architecture decomposition cleanly separates a sequence‑predictability layer (the two language models co‑rank long well‑predicted transposable elements) from a regulatory‑output layer (Enformer alone retains residual cCRE‑discriminative signal), with literally zero overlap between the two top‑100 lists. Conservation, brain cis‑eQTL, and STRING‑PPI cross‑checks then anchor what biology survives: top‑100 elements across all three models are 3.3× enriched per model for matching brain eQTLs (p_\mathrmemp < 5× 10^‑3), while a tempting transposable‑element regulatory layer and a striking NRXN1+NLGN1 protein‑pair convergence both fail proper permutation tests once those tests are constructed. We deliver the diagnostic as a general methodological tool for any ISM‑based regulatory study.
Authors: Saket Reddy, Shiwei Liu
Abstract: While generative AI models have demonstrated remarkable success in structure‑based drug design, they predominantly rely on deep binding pockets and struggle to sample effective ligands for challenging low‑pocketability targets, such as the historically "undruggable" oncology targets KRAS and MYC. To address this gap, we introduce ShallowBench, a strictly curated benchmark of 5,780 shallow‑pocket targets extracted from CrossDocked2020. By computing the difference between an Alpha Shape "lid" volume and the underlying protein atom voxel volume, we successfully isolated targets with low concavity while ensuring sufficient surface area for binding. Evaluating various state‑of‑the‑art generative models reveals weaker predicted binding affinity on these low‑concavity interfaces. ShallowBench therefore provides a rigorous benchmark for generative biology models and highlights the necessity of new architectural innovations or loss functions capable of navigating these challenging targets.
Authors: Hongkun Dou, Zike Chen, Fengji Li, Hongjue Li, Yue Deng
Abstract: Controllable generation with discrete diffusion models is often hindered by high computational overhead or the need for retraining. In this paper, we present \underlineGradient‑\underlineInformed \underlineLogit \underlineCorrection (GILC), a plug‑and‑play framework that efficiently estimates guidance signals by repurposing the pretrained denoising network as a variational proxy. To circumvent the gradient instability inherent in high‑dimensional discrete spaces, we introduce a Jacobian‑free mechanism that directly corrects the clean prediction logits, facilitating stable and effective guidance. Our method accommodates both differentiable and non‑differentiable reward functions. Extensive experiments across DNA, protein sequence, and molecular generation tasks demonstrate that GILC achieves state‑of‑the‑art performance without additional training, frequently outperforming fine‑tuning approaches.
Authors: Sota Asanuma
Abstract: Neural network (NN)‑based nonlinear causal discovery methods recover DAG structure but leave each causal mechanism as a black box. Waxman et al. argued that extracting causal mechanisms from NN weights is ill‑posed. We propose EML‑CD, a framework that integrates the EML operator (capable of composing elementary functions from a single binary operator) into causal structure learning, with interpretable mechanism recovery as the primary objective. EML‑CD represents each edge mechanism as a gated EML binary tree and automatically discovers closed‑form causal equations. Analytical Jacobians can be directly computed from the output equations, enabling quantitative understanding of causal effects. On real data (Sachs protein signaling, d=11), EML‑CD achieves SHD=11.2 +/‑ 0.4 (5‑seed mean; baselines are single deterministic runs), on par with PC/GES within seed variance and below CAM, while attaching closed‑form equations to each detected edge (precision 0.756, recall 0.365). In a controlled bivariate test with known mechanisms, EML‑CD recovers 10 of 11 elementary function families faithfully (held‑out shape correlation >= 0.96; only high‑frequency sine is partial). On a symbolic synthetic benchmark, EML‑CD attains a substantially lower and more stable held‑out mechanism f‑MSE than a fixed SINDy dictionary (mean 3.67 vs. 7644, the latter inflated by catastrophic extrapolation on one seed), although its structure recovery (SHD 14.0) only matches the dictionary and stays below specialized optimizers; on the Causal Chambers light‑tunnel subset, a depth‑2 model improves F1 over linear OLS‑BIC (0.444 vs. 0.273).
Authors: Hanqun Cao, Zachary Quinn, Aastha Pal, Sumi Kimura, Jingjie Zhang, Pheng Ann Heng, Pranam Chatterjee
Abstract: Protein binder design has largely optimized for affinity alone, leaving conformational selectivity unaddressed: for allosteric targets such as kinases, nuclear receptors, and GPCRs, a binder that engages both active and inactive states provides no functional specificity regardless of how tightly it binds. We introduce AlloGen, a modular framework that decouples backbone generation from a learned state‑selectivity scorer Q_θ, an SE(3)‑invariant interface graph transformer trained via a two‑phase curriculum that first learns interface geometry before imposing conformational discrimination. Because Q_θ is fully differentiable and generator‑agnostic, it integrates with any backbone generator as a passive reranker or an active gradient‑based guide without retraining. Across a diverse benchmark of proteins spanning multiple families and conformational mechanisms, AlloGen consistently identifies binders that preferentially recognize desired structural states while rejecting alternative conformations. Experimental validation on calmodulin further demonstrates that these computational selectivity signals translate to physical molecules, yielding de novo peptides that bind the desired holo conformation while exhibiting no detectable binding to the apo state. Together, these results establish conformational selectivity as a learnable property and provide a general framework for state‑selective protein binder design.
Authors: Shi Li, Xujun Zhang, Mingquan Liu, Hui Zhang, Shuoying Jia, Yu Kang, Tingjun Hou, Peichen Pan
Abstract: Nucleic acids are increasingly recognized as therapeutic targets beyond conventional protein‑centered drug discovery, yet accurate and efficient docking of small molecules to nucleic acid structures remains challenging. Physics‑based docking methods often show limited accuracy and efficiency, whereas deep learning approaches are constrained by the scarcity of experimentally resolved nucleic acid‑ligand complexes. Here, we present NucleoDock, a deep learning framework for nucleic acid‑small molecule docking. To address data scarcity, NucleoDock combines physics‑guided large‑scale pretraining on millions of docking‑generated synthetic complexes with fine‑tuning on curated experimental co‑crystal structures. It further integrates sequence‑ and structure‑informed nucleotide representations with atomistic three‑dimensional features to capture both biological context and binding‑site geometry. A mixture density network‑based geometric scoring head is used to model conditional interaction‑distance distributions for pose ranking. On an external benchmark of 125 nucleic acid‑ligand complexes, NucleoDock achieved a top‑1 success rate of 56 percent at an RMSD cutoff of 2.0 Angstrom, outperforming rDock with 29 percent, while generating 100 poses in approximately 5 seconds per complex. Retrospective virtual screening on the ROBIN benchmark further showed improved early enrichment. NucleoDock represents a step toward bridging the methodological gap between protein‑ and nucleic acid‑directed computational drug discovery.
Authors: Wangbo Luo, Zhonghua Qiao, Yanxiang Zhao
Abstract: In this paper, we develop and analyze exponential time differencing (ETD) schemes for a phase‑field model of multicomponent membranes proposed in our previous work \citeluo2025ohta, in which membrane deformation is governed by a force‑balance phase‑field equation and protein segregation is described by a membrane‑associated Ohta‑Kawasaki (OK) dynamics. For a fixed phase‑field membrane, we introduce a geometry‑adapted operator splitting method based on the localization function, which reformulates the surface OK dynamics into a form suitable for ETD integration. The resulting first‑ and second‑order ETD schemes, combined with finite‑difference spatial discretization, are rigorously proved to satisfy a discrete maximum‑bound principle and unconditional energy stability. For the coupled system, we construct stabilized ETD schemes in an FFT‑based spectral framework, treating stiff linear terms exactly and nonlinear mechanochemical couplings explicitly. A narrow‑band implementation further reduces the computational cost by restricting surface calculations to the diffuse membrane region. Numerical experiments confirm the predicted temporal accuracy, maximum‑bound preservation, and energy decay for the fixed‑membrane OK problem, and demonstrate stable and efficient three‑dimensional simulations of protein‑driven pattern formation and membrane deformation.
Authors: Kargi Chauhan
Abstract: AI scientist systems increasingly choose biological foundation models before they choose experiments. In protein pipelines, this creates a concrete engineering and scientific question: when is the cost of structural inference worth paying over a cheaper sequence‑only model? We introduce the information bonus (IB), a task‑level metric that measures the linearly accessible advantage of frozen single‑sequence AlphaFold2 Evoformer representations over frozen ESM‑2 embeddings under protein‑level cross‑validation. Across binding affinity regression (PDBbind, n=5,680), conformational flexibility (ATLAS molecular dynamics, 268 proteins), and allosteric‑site classification (AlloSigDB, n=9,925 residues), IB is sharply mechanism‑dependent. ESM‑2 dominates binding affinity (IB=‑0.141; Pearson r=0.449 vs. 0.307) and binary flexibility (IB=‑0.060; AUROC 0.824 vs. 0.764; p=0.0017). AF2 single representations give the only above‑chance allostery predictions (IB=+0.064; AUROC 0.548 vs. 0.485), revealing long‑range geometric signal not recovered from sequence alone. We also identify a residue‑level leakage artifact: naive residue splits inflate RMSF performance by 27‑39% depending on the representation, enough to reverse representation rankings. These results turn representation selection into a measurable decision for AI‑for‑science systems.
Authors: Bryan Cheng, Austin Jin
Abstract: Proteolysis‑targeting chimeras (PROTACs) can selectively degrade disease‑causing proteins, yet predicting which targets are amenable to degradation remains a critical bottleneck: existing computational methods require the complete PROTAC molecular structure, information unavailable before synthesis. We present DegradoMap, a graph neural network that predicts PROTAC‑mediated degradability from protein structure and E3 ligase identity alone ‑‑ the minimal information available at the target selection stage. The model encodes biophysical priors through lysine‑weighted graph pooling with per‑protein normalization, models protein‑E3 compatibility via cross‑attention, and integrates cellular context from the Cancer Dependency Map. On the PROTAC‑8K benchmark (3,101 samples, 155 targets, 10 E3 ligases), DegradoMap achieves 0.646+‑0.124 AUROC on target‑unseen evaluation (best seed: 0.7449) and 0.811 AUROC on CRBN‑>VHL E3‑unseen transfer, outperforming GNN and machine learning baselines. The model additionally recommends optimal E3 ligases with 74% Hit@3 accuracy. Two findings carry broader implications: E(3)‑equivariant architectures underperform the simpler invariant design for this scalar prediction task, and ESM‑2 embeddings improve peak performance only with careful regularization ‑‑ naive integration fails. DegradoMap provides pre‑synthesis computational guidance for degradability assessment; its well‑calibrated confidence scores (ECE = 0.029, target‑unseen) enable practitioners to prioritize high‑confidence predictions for experimental follow‑up. However, the high seed variance (std = 0.124) and limited E3 coverage require ensembling for reliable deployment.
Authors: Junhao Wei, Baili Lu, Zhenhong Peng, Wanyan Li, Zhirong Huang, Yanxiao Li, Yifu Zhao, Dexing Yao, Haochen Li, Xudong Ye, Sio-Kei Im, Yapeng Wang, Xu Yang
Abstract: Sequence‑based deep learning offers a scalable alternative to structure‑based scoring for protein‑ligand binding affinity prediction. However, progress is hard to interpret when architectural priors are evaluated on canonical PDBbind‑style splits that leak similarity classes across folds. We present HonestAffinity, a compact 1D‑input predictor to isolate two priors under a leak‑aware protocol: frozen ESM‑2 (650M) protein embeddings and a learned binary pocket‑position marker. We evaluate a multi‑scale convolutional/Transformer template in three variants: HonestAffinity‑Pocket, HonestAffinity‑NoPocket, and HonestAffinity‑Pocket‑NoESM. All three train on 11,513 LP‑PDBBind complexes in ~3 GPU‑hours. We benchmark against five baselines on the LP‑PDBBind 3‑tier no‑leak hold‑out, CASF‑2016, and a CASF‑2016 non‑train subset. Our central finding is a split‑conditioned reversal rather than a uniformly best prior: HonestAffinity‑Pocket achieves the best mean Pearson R on validation and CASF‑2016 splits, whereas HonestAffinity‑Pocket‑NoESM achieves the best mean Pearson R on every strict LP no‑leak tier (test_cl1‑cl3). Both the pocket marker and ESM‑2 input improve performance on familiar splits but reduce Pearson R on strict no‑leak tiers. We argue models should report paired canonical and leak‑proof ablations, and that deployment‑regime‑matched variants better describe these reversals than a single default. Code and scripts are linked in the footnote; checkpoints will be released upon acceptance.
Authors: Sima Soltani, Mehrdad Jalali, Yahya Forghani, Reza Sheybani
Abstract: Protein complexes are central units of cellular organization, yet their identification from protein‑protein interaction (PPI) networks remains difficult because interactome maps are noisy, incomplete, context dependent, and unevenly annotated. This focused methodological review examines evidence‑aware approaches that combine PPI topology with Gene Ontology (GO) annotations, expression profiles, subcellular localization, sequence or domain evidence, temporal information, and representation learning, with emphasis on post‑2018 methods and selected historical baselines. The central synthesis is that transparent evidence‑aware graph methods currently offer the strongest tradeoff between biological plausibility and reproducibility, while deep, hypergraph, and dynamic heterogeneous models expand biological realism but require stronger benchmark control. The central bottleneck is no longer only the lack of algorithms, but the lack of harmonized, overlap‑aware, and reproducible evaluation protocols. We therefore recommend unified benchmark versions, explicit GO‑circularity controls, overlap‑aware metrics, uncertainty estimates, and executable software packages over isolated source‑specific F‑measure gains.
Authors: Zaifei Yang, Samuel Ping-Man Choi, James Kwok
Abstract: Protein‑protein interactions (PPIs) are essential for many biological processes. However, existing PPI prediction approaches suffer from two major limitations: they overlook the hierarchical organization of proteins, particularly meso‑scale motifs that critically regulate PPIs, and fail to effectively integrate sequence, structure, and function modalities. To address these limitations, we propose MMM‑PPI, a Hierarchical Motif‑based Multi‑Modal protein Encoder for PPI Prediction that constructs PPI embeddings in a bottom‑up multi‑modal manner across three scales. At the micro‑scale, we encode three modal residue features; at the meso‑scale, a novel multimodal motif encoder aggregates residues into spatially‑informed motif embeddings; at the macro‑scale, a multimodal protein encoder integrates motifs into protein embeddings by jointly modeling motif importance and inter‑modal correlations. The pre‑trained encoder can be used off‑the‑shelf for large‑scale PPI prediction. Extensive experiments on multiple PPI datasets show that MMM‑PPI outperforms state‑of‑the‑art multi‑label PPI prediction models, particularly under challenging data partitions and limited data scenarios. Codes are in https://github.com/yzf‑code/MMM‑PPI.
Authors: Jin Gao, Juntu Zhao, Zirui Zeng, Jiaqi Shen, Junhao Shi, Dukun Zhao, Yuming Lu, Dequan Wang
Abstract: AI for scientific discovery is entering an agentic era, where protein‑engineering systems are expected to prioritize future wet‑lab experiments rather than merely fit static measurements. We introduce TadA‑Bench, a million‑variant wet‑lab replay benchmark from 31 TadA directed‑evolution rounds for future‑round discovery toward agentic protein engineering. TadA‑Bench preserves the campaign chronology and defines a fixed‑data replay task: given earlier experimental rounds, models rank variants that appear only in later rounds. It provides aligned DNA, RNA, and protein views, and uses Seq2Graph, a graph‑based label‑unification pipeline, to reconcile noisy enrichment measurements into consistent cross‑round activity labels. Random‑split controls show strong interpolation, but future‑round ranking and finite‑budget candidate selection are much weaker. Controlled analyses suggest that evolutionary coverage is more informative than local data density, positioning TadA‑Bench as a reproducible wet‑lab replay substrate for future‑round discovery toward agentic protein engineering; the data and code are released on Hugging Face and GitHub.
Authors: Sahil Rahman, Maxx Richard Rahman
Abstract: Protein language models (PLMs) are passive oracles: they generate sequences in a single forward pass with no mechanism to consult external biophysical feedback or redirect generation when a candidate violates thermodynamic or structural constraints. We introduce AgentPLM, which addresses this by equipping a pre‑trained PLM with i) Reasoning‑Augmented Decoding (RAD), which interleaves autoregressive generation with tool calls (ESMFold, FoldX, AutoDock Vina), and ii) Contrastive Agent Policy Optimisation (CAPO), a trajectory‑level extension of direct preference optimisation that trains the policy end‑to‑end to learn when oracle feedback is informative rather than merely imitating high‑fitness sequences. We evaluate AgentPLM on benchmark tasks spanning de novo enzyme design, antibody optimisation, thermostability, PPI interface design, and zero‑shot fitness prediction with standardised oracle APIs and controlled sequence‑identity splits. AgentPLM achieves state‑of‑the‑art results with a gain in antibody top‑10% hit rate over the strongest passive baseline, providing mechanistic evidence of online error correction without explicit backtracking.
Authors: Guang Lin, Shikui Tu, Lei Xu
Abstract: Generating molecules that simultaneously satisfy drug‑like properties and conform to the 3D structure of a target protein is a core challenge in structure‑based drug design (SBDD). Existing generative approaches, however, often rely on costly post‑hoc processing during Sampling or require carefully curated datasets during training, yet still achieve modest gains. These limitations are especially pronounced in multi‑objective settings, where balancing conflicting criteria remains a core challenge. To address these challenges, We propose FTDiff, a reinforcement learning fine‑tuning framework tailored for diffusion‑based molecular generation under structural constraints. To ensure stable and sample‑efficient optimization, FTDiff adopts a group relative policy optimization (GRPO) style strategy. Furthermore, FTDiff builds upon a time‑free pretrained diffusion model and incorporates a fast sampling mechanism that reduces the number of denoising steps, significantly accelerating both training and inference while maintaining generation quality. By optimizing a fixed threshold‑aware reward, FTDiff effectively guides the model to produce valid, diverse, and high‑ quality molecules that balance multiple drug design objectives. Extensive experiments on benchmark datasets demonstrate that FTDiff consistently outperforms prior methods, without requiring expensive post‑hoc optimization or intricate data engineering.
Authors: Dan Luo, Xuan Lin, Peng Zhou, Junwen Zhu, Tengfei Ma, Xiangxiang Zeng, Yiping Liu
Abstract: Despite the growing availability of cryo‑electron microscopy (cryo‑EM) density maps, effectively leveraging them for protein representation remains challenging. First, current methods lack a general‑purpose protein pretraining framework tailored for cryo‑EM density maps, designed for protein‑related property prediction. Second, existing approaches typically partition density maps into local box regions and model them independently, overlooking interactions across boxes which are essential for capturing global structural context in cryo‑EM density map. To address these challenges, we propose CryoProt, a protein pretraining framework designed for cryo‑EM density maps. CryoProt introduces a Map Encoder based on multi‑head latent attention (MLA), where box‑level representations interact through a shared latent space, enabling explicit modeling of cross‑box dependencies within the density map. Furthermore, we adopt a multi‑task pretraining strategy to learn generalizable representations that can be effectively transferred to diverse downstream tasks, such as protein flexibility prediction, where cryo‑EM density maps are not required and can be inferred implicitly by the pretrained model. Experimental results demonstrate that CryoProt consistently outperforms existing state‑of‑the‑art methods across multiple benchmarks, achieving up to 12% improvement over the best‑performing baselines, highlighting the importance of modeling cross‑box interactions in cryo‑EM data. The source code is publicly available at https://anonymous.4open.science/r/CryoProt.
Authors: Aravind Mandiga, Guoming Li, Jin Lu, Ismailcem Budak Arpinar, Khaled Rasheed, Samuel E. Aggrey
Abstract: Protein‑language systems are often evaluated by whether they generate plausible biological text, but a structural question has a sharper semantics: it denotes a measurement in a 3D coordinate system. We introduce ProtStructQA, an executable benchmark for protein structural question answering in which each natural‑language question is generated from a hidden typed domain‑specific language (DSL) program and the answer is obtained by executing that program on an AlphaFold‑predicted structure. ProtStructQA releases 382.2K questions covering confidence, distances, predicted aligned error (PAE), solvent exposure, secondary structure, topology and contacts, and held‑out compositions: a 330K active benchmark over 10K proteins from four species, plus a 52.2K hard‑negative robustness pool. Without fine‑tuning, we evaluate Qwen3 models from 0.6B to 8B under direct prompting, chain‑of‑thought, grammar‑constrained executable voting, executable voting with chain‑of‑thought, and multi‑turn ReAct‑style tool use, and replicate the headline finding on Gemma‑3‑1B and Gemma‑3‑12B. We find a capability‑dependent denotation threshold between Qwen3‑1.7B and Qwen3‑4B: below it, tool‑mediated ReAct dominates because models often fail to produce executable denotations; above it, chain‑of‑thought flips from mostly harmful to strongly beneficial and becomes the strongest strategy on most splits. Parse‑failure and family‑level analyses show that the threshold is a transition from unparseable language to executable structural denotation, while grammar and execution remain selectively valuable for PAE and secondary‑structure queries. ProtStructQA reframes scientific QA as compilation from language to measurement and provides a diagnostic testbed for when language models can map words to executable 3D structural measurements.
Authors: Abhiram Badrinarayanan, Davor Davidovic, Edoardo Di Napoli, Jurica Novak, Luigi Genovese, Gustavo Ramirez-Hidalgo, Xinzhe Wu
Abstract: Simulating large molecular systems comprising thousands of atoms requires highly scalable methodologies. While modern Density Functional Theory (DFT) codes exhibit linear scaling, solving the associated large, sparse generalized eigenproblems remains a critical computational bottleneck on exascale architectures. In the context of the LimitX project, we propose a data‑driven framework to accelerate these calculations. By shifting the machine learning target from discrete eigenvalues to the coefficients of an interpolating Chebyshev polynomial, and by comparing both all‑atom and fragment‑based structural representations, we successfully overcome the dimensionality constraints of large‑scale spectral prediction. We investigate three machine learning models (Kernel Ridge Regression, Graph Neural Networks, and Random Forests) trained on a novel 2 TB dataset of protein dimers. The predicted spectra provide initial guesses that effectively bypass early Self‑Consistent Field (SCF) iterations in BigDFT. Ultimately, these spectral predictors will be deployed to dynamically optimize upcoming rational filter‑based eigensolvers, such as FrASE, which is currently in initial development.
Authors: Nipon Sarmah, Tim Finnigan, Mark Taylor, Tom Vinestock, Ethan Errington, Miao Guo
Abstract: Fermentation‑derived side streams represent an underutilised resource for sustainable protein production. This study investigates the potential of centrate from industrial Fusarium venenatum fermentation as a nutrient source for fungal biomass generation. Following compositional characterisation, a synthetic centrate medium was formulated and evaluated using a Box‑Behnken design combined with response surface methodology. Across 46 experimental runs, cell dry weight (CDW) ranged from 0.22 to 3.87 g per liter, demonstrating a strong dependence on nutrient composition. Ammonia and glucose were identified as the dominant factors influencing biomass production, with significant nonlinear effects. The model predicted a maximum CDW of 4.17 g per liter under optimised conditions, which was experimentally validated at 3.99 g per liter. Carbon conversion efficiency reached up to 29.02%, indicating effective substrate utilisation. These findings demonstrate that fermentation‑derived centrate can support substantial fungal growth, while highlighting its potential to enhance nutrient recovery and influence the biochemical composition of sustainable mycoprotein.
Authors: Elana Simon, Etowah Adams, James Zou
Abstract: Sparse autoencoders (SAEs) decompose neural network activations into interpretable features, but many learned features never activate, a problem called feature death that wastes dictionary capacity and can reintroduce superposition. Death rates vary dramatically between models: near‑zero on GPT‑2, over 70% on AlphaFold3 with identical configurations. We find that dimension‑level activation outliers (dimensions whose mean magnitude is large relative to per‑token variation) cause this by shifting pre‑activations at initialization based on each feature's alignment with the activation mean. Features anti‑aligned with the mean receive permanently negative pre‑activations and never fire. We formalize outlier severity as γ= \|μ\|/\|σ\|; it predicts initial death rates (Spearman ρ= 0.89 for dead‑by‑TopK, 0.82 for dead‑by‑ReLU) across 454 model‑layer combinations spanning language, vision, protein, and genomic models. Dead features can revive during training, but recovery requires the SAE bias to learn the activation mean, a process that is prohibitively slow at high γ. Mean‑centering (subtracting the activation mean) sidesteps this and eliminates outlier‑induced death across all tested models, confirming the mechanism and providing a principled basis for when and why this preprocessing step is necessary.
Authors: Sawan Patel, Sophia Tang, Yesol Kim, Yinuo Zhang, Divya Srijay, Ping-Jung Lin, Shambhavi Shubham, Fengmei Pi, Cedric Wu, Sherwood Yao, Pranam Chatterjee
Abstract: Therapeutic mRNA design requires coordinating multiple interacting sequence features across the full transcript, where codon usage, untranslated regions (UTRs), and their coupling jointly determine stability, translation efficiency, and protein expression. Here, we present mRNA generation via unrolled trajectories and informed latent updates (mRNAutilus), a framework for simultaneous codon optimization and de novo UTR design directly from sequence. mRNAutilus combines a masked discrete diffusion model trained on millions of full‑length mRNAs with Monte Carlo Tree Guidance to generate Pareto‑efficient sequences under multiple functional objectives, using lightweight regressors over model embeddings to predict half‑life, translation efficiency, and protein abundance. Unlike recent methods that design coding sequences and UTRs separately or rely on post hoc assembly and screening, mRNAutilus generates complete transcripts in a single process optimized across properties. Across diverse targets, zero‑shot mRNAs encoding P. pyralis luciferase achieve over 400‑fold higher expression than wild‑type and outperform commercial and machine learning‑designed baselines, including zero‑shot generative approaches. Zero‑shot SARS‑CoV‑2 Spike mRNAs exceed clinically used and commercial constructs and match or surpass lab‑optimized designs with improved durability. We further demonstrate generality in therapeutic settings, including prime editing (PEMax) and programmable proteome modulation, where mRNAutilus‑designed constructs enhance expression of peptide‑guided E3 ligases (uAbs) for beta‑catenin degradation. These results establish a sequence‑based, multi‑objective framework for generating functional mRNAs tailored to diverse biological applications.
Authors: Keyue Qiu, Yixin Wu, Lihao Wang, Yawen Ouyang, Jixiang Yu, Zihan Zhou, Changze Lv, Dongyu Xue, Yuxuan Song, Xinbo Zhang, Hao Wang, Jiangtao Feng, Zhiqiang Gao, Lijun Wu, Xiaoqing Zheng, Ka-Chun Wong, Lei Bai, Ya-Qin Zhang, Wei-Ying Ma, Dahua Lin, Bowen Zhou, Hao Zhou
Abstract: We present AMix‑2, a protein‑text foundation model that establishes protein as a native modality in large language models (LLMs), unifying protein understanding and sequence design within a single foundation model. AMix‑2 is built upon two key ideas: (1) a unified protein‑text formulation that embeds natural language and protein sequence in a shared token space, enabling one model to perform biological reasoning and conditional design instead of separate downstream task‑specialized models; and (2) a block‑wise diffusion language modeling backbone that combines causal generation across blocks with bidirectional context and iterative refinement within blocks. This scheme better matches the intrinsic nature of proteins than a strict left‑to‑right factorization. To evaluate protein foundation models under realistic generalization settings, we further introduce ProteinArena, a comprehensive benchmark with time‑aware and homology‑aware protocols across various understanding and design tasks, and with baselines covering classical bioinformatics tools, protein‑specialized models and LLMs. On ProteinArena, AMix‑2 outperforms frontier LLMs and demonstrates competitive performance to task‑specific protein models. Controlled experiments further show that the diffusion‑based paradigm generally surpasses its autoregressive counterpart, highlighting the advantage of flexible generation order for protein sequences. We release both AMix‑2 and ProteinArena to facilitate open research in protein foundation models.
Authors: Sven Gutjahr, Riccardo De Santi, Luca Schaufelberger, Kjell Jorner, Andreas Krause
Abstract: Adapting generative foundation models, in particular diffusion and flow models, to optimize given reward functions (e.g., binding affinity) while satisfying constraints (e.g., molecular synthesizability) is fundamental for their adoption in real‑world scientific discovery applications such as molecular design or protein engineering. While recent works have introduced scalable methods for reward‑guided fine‑tuning of such models via reinforcement learning and control schemes, it remains an open problem how to algorithmically trade‑off reward maximization and constraint satisfaction in a reliable and predictable manner. Motivated by this challenge, we first present a rigorous framework for Constrained Generative Optimization, which brings an optimization viewpoint to the introduced adaptation problem and retrieves the relevant task of constrained generation as a sub‑case. Then, we introduce Constrained Flow Optimization (CFO), an algorithm that automatically and provably balances reward maximization and constraint satisfaction by reducing the original problem to sequential fine‑tuning via established, scalable methods. We provide convergence guarantees for constrained generative optimization and constrained generation via CFO. Ultimately, we present an experimental evaluation of CFO on both synthetic, yet illustrative, settings, and a molecular design task. Across these evaluations, CFO achieves consistent increases in reward while ensuring high constraint satisfaction, showcasing its practical utility for constrained generative optimization.
Authors: Le Xu, Xi Zhang, Dan Luo, Ting Wang, Xuan Lin
Abstract: Accurate prediction of drug‑target interactions (DTI) is critical for drug discovery. Existing methods often rely on single‑modal representations (e.g., sequences or graphs) or combine only two modalities, overlooking 3D structural features. To address this challenge, we propose TriMod‑DTI, a triple‑modal contrastive learning framework that incorporates 1D sequences, 2D graphs, and 3D structures of drugs and proteins, obtaining the universal and complementary feature representations for DTI prediction. We design a Feature Extractor to capture drug and target features across the three modalities, thereby enriching their representations. We further propose a triple‑modal contrastive learning strategy to align different modal representations of the same drug or protein in the latent space. By constructing cross‑modal positive and negative sample pairs, this approach enhances the model's discriminative ability. Experiments on three benchmark datasets demonstrate that TriMod‑DTI outperforms state‑of‑the‑art methods. The ablation studies validate the contributions of each modality. Moreover, case studies highlight its practical potential for DTI prediction and drug discovery.
Authors: Safwen Ghediri, Guillaume Brysbaert, Fabrizio Cleri, Ralf Blossey
Abstract: Electrostatic interactions are key to the recognition processes of proteins and DNA and have been previously documented for the action of repair enzymes. Uracil‑DNA glycosylase (UDG) is the first in a sequence of enzymes that act in the base‑excision repair process (BER) and whose task is the extraction of uracil bases from nuclear DNA. The question of how the molecule targets uracil bases in chromatin, in particular in the condensed protein‑DNA complexes of nucleosomes, has only recently become a subject of detailed studies. Here we show that the presence of an arginine anchor motif on the N‑terminal tail of UDG can favor its localization on nucleosomes by binding to their acidic patches on their top and bottom surfaces via electrostatic interactions. We argue that this mechanism can play a key role in the detection of uracil defects in nucleosomal DNA.
Authors: Yamini Jangir, Samrat Ghosh, Vinay Nayaka, Mubashir Ali, Dharshan Hegde, Kunal Mooley, Arunima Saha, Hariharan VC, Sujata Malik, Amey Bagare, Saurav Mishra, Mukuljeet Singh Mehrolia, Saravanan Matheswaran, Ashwani Kumar Thakur
Abstract: The spaceflight environment presents unique physicochemical conditions, including microgravity, ionizing radiation, altered fluid transport, and confined engineered habitats, which influence biological systems and biomolecular assembly processes. These conditions also provide opportunities for orbital biomanufacturing and autonomous biofabrication that are difficult to reproduce under terrestrial gravity, motivating the development of compact autonomous experimental platforms for spaceflight research. Here, we present the Modular Astrobiology Experiment (MAEx) platform, a compact 3U spaceflight‑compatible payload designed for autonomous multimodal biological characterization under space‑relevant conditions. MAEx was engineered to operate within the constraints of orbital deployment, including limited volume, low power consumption, thermal regulation, and autonomous data acquisition. To demonstrate platform versatility, representative biological systems, including the electroactive bacterium Shewanella oneidensis MR‑1, the radiation‑resistant fungus Ustilago maydis FB1, and the human eye lens protein γD‑crystallin, spanning cellular and molecular scales were incorporated. MAEx platform integrates imaging, absorption and fluorescence spectroscopy, and electrochemical sensing within a modular architecture, enabling simultaneous monitoring of microbial growth, extracellular electron transfer (EET), and protein aggregation dynamics.
Authors: Aydin Wells, Francis A. Gatsi, Aaron Striegel, Tijana Milenković
Abstract: Protein structure classification (PSC) uses supervised learning to predict a protein's CATH/SCOP(e) class from the protein's sequence or 3D structural feature(s). We already modeled 3D structures as (static) protein structure networks (PSNs), demonstrating the competitiveness of PSN‑based features to sequence or direct (i.e. non‑network) 3D structural features in the PSC task. More recently, we demonstrated the power of features extracted from dynamic PSNs over features extracted from static PSNs (and thus by transitivity over sequence and direct 3D structural features) in the same task. That dynamic PSN approach used traditional machine learning (ML), combining manual (pre‑engineered) features with an off‑the‑shelf classifier. Here, we evaluate whether automatic deep learning (DL) from the dynamic PSNs yields improvements. Our evaluation on 72 datasets spanning ~44,000 CATH‑ or SCOPe‑labeled dynamic PSNs reveals that in terms of PSC accuracy, traditional ML and DL are (close to) tied for a large majority of the datasets, while DL is on average 10+ times slower. We are the first to evaluate traditional ML vs. DL in the dynamic PSN‑based PSC task.
Authors: Gabrielle Cohn, Rohan Gumaste, Minh Hoang, Vihan Lakshman
Abstract: Protein homology search underlies function annotation, structure prediction, and evolutionary analysis, but remains challenging in the "twilight zone," where global sequence similarity is weak and classical alignment methods lose sensitivity. Protein language models provide context‑aware representations that could improve alignment sensitivity in this regime. However, prior protein embedding‑based retrieval pipelines often pool these representations into a single vector, potentially obscuring local motifs, domains, or conserved residues that reveal remote homology. We introduce ProtoCol, a model which represents proteins as sets of residue embeddings and uses ColBERT‑style late interaction to test whether residue‑level comparison improves homolog retrieval. ProtoCol encodes proteins independently, keeps candidate representations pre‑computable, and scores candidates with MaxSim over residue embeddings. On SCOPe superfamily and Pfam clan benchmarks, ProtoCol outperforms sequence‑composition, alignment‑based, pooled PLM, and trained single‑vector baselines, supporting late interaction as an effective retrieval layer for remote homology search.
Authors: Xiao Luo
Abstract: Antibodies play a central role in the immune response by specifically recognizing and neutralizing antigens, and therapeutic antibodies have become major drugs for cancer and autoimmune diseases. However, their discovery still relies on extensive in vitro screening, and accurate computational modeling of antibody structures and antibody‑antigen interactions can prioritize candidates, reduce experimental burden, and accelerate rational design. Despite recent advances in high‑accuracy protein and complex prediction, a persistent performance gap remains for antibody‑related tasks compared with general protein‑protein interactions, limiting downstream design.
This thesis investigates why antibody‑related tasks are harder and proposes improvements along two complementary directions. First, we investigate protein language model (PLM)‑based methods for antibody and antibody‑antigen structure prediction. Using embeddings from multiple PLMs, our approach achieves the best CDR‑H3 accuracy among compared PLM‑based methods on antibody monomer prediction. Extending it to complex prediction does not generalize: without co‑evolutionary signals between antibody and antigen, single‑sequence PLM representations do not reliably identify binding interfaces.
Second, we develop two MSA‑based interventions for antibody‑antigen complex prediction: MSA refinement, which combines CDR‑focused filtering with depth recovery from a larger sequence database, and convergence‑aware recycling, which selects a stable intermediate recycle state for final diffusion sampling. Together, these interventions provide consistent gains over the AlphaFold3 baseline on a held‑out antibody‑antigen test set. Because the methods modify MSA construction and recycling behavior rather than model parameters, they apply without retraining or weight access.
Authors: Vasudha Sharma, Chakresh Kumar Singh, Jayesh Choudhari, Dharmit Nakrani
Abstract: AI is transforming life sciences research at unprecedented speed, accelerating discovery across protein structure prediction, genome modeling, and drug development (Jumper et al., 2021; Mak et al., 2024). Yet this rapid advancement, coupled with the open science movement, introduces significant dual‑use research concerns that have received limited empirical scrutiny. Here we present the first systematic analysis of dual‑use research of concern (DURC) content on open preprint servers. We screened ~52,000 bioRxiv preprints (2024‑2025) using a hybrid pipeline of lexical filtering and large language model (LLM) evaluation, scoring metadata across nine DURC, three PEPP, and five governance categories aligned with U.S. and Australia Group oversight frameworks. Our analysis reveals that dual‑use‑adjacent knowledge is routinely present in openly accessible titles and abstracts, often exceeding established risk thresholds even in studies with legitimate public health objectives. While this mapping captures surface‑level information diffusion, it does not measure operational capability, downstream misuse potential, or the substantial technical and biosafety barriers that constrain harmful application. We argue that institutional review processes, funding requirements, and preprint platform policies must evolve to incorporate proactive, metadata‑level monitoring without compromising scientific transparency. Ultimately, harmonizing controlled‑access mechanisms for high‑risk methodologies with open summaries of scientific contributions offers a pragmatic framework for governing AI‑accelerated biology at scale.
Authors: Shanghua Gao, Ada Fang, Marinka Zitnik
Abstract: Scientific research proceeds through iterative cycles of hypothesis generation, experiment design, execution, and revision. AI agents can automate parts of this process, but existing approaches typically follow a single research trajectory or coordinate through a central planner with fixed objectives. As a result, they struggle to sustain parallel exploration, adapt as experimental evidence changes, or preserve knowledge of failed directions over long‑running experiments. We introduce AutoScientists, a decentralized team of AI agents for long‑running computational scientific experimentation. Agents interpret a shared experimental state, self‑organize into teams around promising hypotheses, critique proposals before using experimental compute, and share successes and failures to reduce redundant exploration. Under matched experimental budgets, AutoScientists improves over prior AI agents across biomedical machine learning, language‑model training optimization, and protein fitness prediction. On BioML‑Bench, spanning biomedical imaging, protein engineering, single‑cell omics, and drug discovery, AutoScientists achieves a mean leaderboard percentile of 74.4% across 24 tasks, improving over the strongest AI agent by +8.33%. On GPT training optimization, AutoScientists reaches a target validation bits‑per‑byte 1.9x faster than Autoresearch and continues discovering improvements from a starting champion where the single‑agent approach finds none (7 vs. 0 accepted improvements). On ProteinGym fitness prediction, AutoScientists discovers a method for ACE2‑Spike binding that improves over the current state‑of‑the‑art model by +12.5% in Spearman correlation. Applied without modification across all 217 ProteinGym assays, the same method improves over the prior state of the art by +6.5% (Spearman correlation).
Authors: Alper Topuz, Gerhard Gompper, Dmitry A. Fedosov
Abstract: Primary hemostasis is initiated by platelet adhesion and aggregation at a site of vascular injury and is strongly regulated by local hydrodynamic conditions. At elevated shear rates, platelet capture is mediated by von Willebrand factor (vWF), a multimeric protein that undergoes shear‑induced unfolding and becomes adhesive. We investigate early‑stage clot formation under physiological high‑shear‑flow conditions by employing particle‑based mesoscale hydrodynamics simulations with explicitly resolved red blood cells, platelets, and mechano‑sensitive vWF in a microchannel geometry. The model incorporates vWF‑mediated adhesion of platelets to a hemostatic surface, together with non‑periodic inflow‑outflow boundary conditions that allow continuous material supply and transport. We analyze the dynamics of platelet‑vWF aggregation, clot growth dynamics, clot geometry and internal stresses, and thrombo‑embolization across a range of elevated flow rates. Our results demonstrate that clot formation proceeds through the establishment of platelet‑vWF aggregates at the hemostatic site, and that the clot reaches a finite size determined solely by hydrodynamic forces, without invoking biochemical stabilization mechanisms. Beyond a critical size, increased drag from fluid flow leads to recurrent embolization events that limit further growth. These findings highlight the central role of hydrodynamic stresses in regulating primary hemostasis and provide a mechanistic framework for understanding clot stability under physiological flow conditions.
Authors: Luyang Fang, Yongkai Chen, Jiazhang Cai, Ping Ma, Wenxuan Zhong
Abstract: Knowledge distillation is a powerful method for model compression, enabling the efficient deployment of complex deep learning models (teachers), including large language models. However, its underlying statistical mechanisms remain unclear, and uncertainty evaluation is often overlooked, especially in real‑world scenarios requiring diverse teacher expertise. To address these challenges, we introduce Multi‑Teacher Bayesian Knowledge Distillation (MT‑BKD), where a distilled student model learns from multiple teachers within the Bayesian framework. Our approach leverages Bayesian inference to capture inherent uncertainty in the distillation process. We introduce a teacher‑informed prior, integrating external knowledge from teacher models and task‑specific training data, offering better generalization, robustness, and scalability. Additionally, an entropy‑based weighting mechanism adaptively adjusts each teacher's influence, allowing the student to combine multiple sources of expertise effectively. MT‑BKD enhances the interpretability of the student model's learning process, improves predictive accuracy, and provides uncertainty quantification. We validate MT‑BKD on both synthetic and real‑world tasks, including protein subcellular location prediction and image classification. Our experiments show improved performance and robust uncertainty quantification, highlighting the strengths of our MT‑BKD framework.
Authors: Thao Nguyen, Heng Ji
Abstract: We present MolLingo, a multi‑agent system that emulates the reasoning process of a chemist to automate molecular design. Existing LLM‑based approaches either operate as standalone generative models without access to external tools or lack the multi‑agent coordination and shared memory needed for iterative, evidence‑driven reasoning across the molecular design pipeline. MolLingo addresses this by coordinating a Literature Agent, a Chemist Agent, and an Orchestrator through a shared memory module, with each agent equipped with domain‑specific tools. To enable effective molecular reasoning, we introduce BRICS‑based Fragment Enumeration (BFE), a synthesis‑aware molecular fragmentation method that decomposes molecules into chemically meaningful building blocks represented as block‑based SMILES paired with common chemical names. This representation bridges molecular structure and LLM semantic space, enabling block‑level reasoning and editing that is difficult with raw SMILES alone. As a case study in early‑stage therapeutic design, MolLingo further grounds the Chemist Agent's reasoning in binding site geometry and residue‑level protein context derived from molecular docking to optimize molecules for stronger target binding. Across four benchmarks, MolLingo consistently outperforms frontier LLMs and specialized baselines, including a fourfold docking score improvement over GPT‑5.4 despite using the same underlying model, consistent drug property optimization gains across multiple LLM backbones, and state‑of‑the‑art results on TOMG‑Bench, surpassing both frontier LLMs and the RL‑based optimization method RePO. Our results suggest that LLMs are already capable molecular design assistants when guided through chemically meaningful representations and biologically grounded structural context. Code is available at: https://anonymous.4open.science/status/MolLingo‑7450.
Authors: Alberto Calderone
Abstract: Simulation of post‑prandial pharmacokinetics, such as muscle protein synthesis (MPS) through mTORC1 and insulin‑induced glucose uptake, is often challenging due to the computational intensity of the multi‑compartmental approach. In this study, I introduce an in silico metabolic simulator that uses bi‑compartmental Bateman kinetic processes, gamma‑variate distributions, and finite state machine reasoning to solve temporal differential equations instantaneously, generating metabolic curves and predictions depending on input meals. The novel underlying algorithm was custom‑built entirely independent of third‑party libraries or external services. This original computational engine, bridging the gap between academia and the digital health sector, is integrated within a web dashboard and provided as a service via REST APIs. The average response time is approximately 135 ms with a maximum below 750 ms. The multi‑dimensional model was calibrated using a Landmark Validation approach across diverse dietary conditions (Whey Protein, mixed meal, OGTT) and optimized via Grid Search. Ultimately, the system achieved a global physiologically optimal Mean Absolute Percentage Error (MAPE) of ~18% while maintaining an algorithmic complexity of O(n \log n).
Authors: Alon Shtrikman, Nitzan Simchi, Michal Ran Shchory, Sagie Brodsky, Eran Seger, Kirill Pevzner
Abstract: Protein structure generative models excel at predicting single protein static structures from sequence, but routinely fail to capture the correct conformational state of protein complexes, critical for protein design and induced proximity modalities such as antibodies and PROTACs. While structural proteomics techniques like Cross‑Linking Mass Spectrometry (XL‑MS) and Hydrogen‑Deuterium Exchange (HDX‑MS) offer valuable spatial and dynamic insights, integrating these sparse, heterogeneous measurements into these models remains an open challenge. Here, we bridge this gap by combining structural proteomics data with the rich biophysical priors learned by pretrained diffusion models. We introduce AIMS‑Fold, an inference‑time guided‑diffusion framework that actively steers the generative sampling trajectory using differentiable physical potentials derived from XL‑MS spatial restraints and HDX‑MS solvent accessibility profiles. We demonstrate that these structural methods individually enhance predictive accuracy, and their integration yields synergistic improvement. Crucially, by leveraging these experimental restraints, AIMS‑Fold achieves higher accuracy on challenging induced proximity targets than purely computational, unguided state‑of‑the‑art models like Boltz‑2. This establishes our framework as a powerful, integrative computational approach for the structure based drug design of induced proximity drugs. Evaluation code will be made publicly available upon publication.
Authors: Siddhartha Roy, Rakesh S. Singh
Abstract: Biomolecular condensates play essential roles in cellular processes, and recent efforts have focused on understanding their assembly and rational design principles. In this study, we have employed an explicit‑solvent minimal statistical mechanical model based on the lattice‑gas Hamiltonian with quenched disorder ‑‑ which mimics crowders ‑‑ to investigate how protein‑solvent and protein‑crowder interactions influence condensate phase behavior and morphology. The computed phase diagrams reveal rich behavior, including upper critical solution temperature (UCST), closed‑loop, and reentrant type transitions under varying protein‑solvent interactions at both equilibrium and out‑of‑equilibrium conditions. We elucidated the origin of these phase behavior changes and examined the role of protein‑crowder interactions in modulating condensed phase morphology and stability. We further extended this model to binary protein mixtures where we studied the phase behavior in the presence and absence of quenched disorder. Without disorder, the system exhibits diverse phase‑separated morphologies ‑‑ partially wetted, fully wetted, segregative, and associative ‑‑ with phase boundaries delicately sensitive protein‑solvent interactions. The introduction of quenched disorder (or crowder) leads to a broader spectrum of complex morphologies, dictated by the interplay among protein‑protein, protein‑solvent, and protein‑crowder interaction parameters. In general, this work underscores that protein‑solvent and protein‑crowder interactions, together with protein‑protein interactions, can act as key regulatory parameters for modulating condensate morphology. These insights may guide future computational and experimental studies of liquid‑liquid phase separation in biomolecular systems aimed at designing stimuli‑responsive condensates.
Authors: Rakesh Das, Tarun Mascarenhas, Nagaraja Chappidi, Simon Alberti, Frank Jülicher
Abstract: Cells deploy robust mechanisms to repair DNA damage, safeguarding genomic stability and cellular health, but the physical principles underlying these processes remain incompletely understood. Experiments show \emphin vitro that upon a DNA double‑strand break, a DNA‑‑protein condensate can tether the broken DNA ends before they disperse away, a critical step for subsequent repair biochemistry. However, it remains puzzling how such condensation reliably achieves spatiotemporal localization at the break site and captures both broken ends despite intrinsic stochasticity. Here, we propose that broken DNA ends can trigger a conversion of proteins from a soluble state to a condensate‑competent state. Combining this idea with Brownian dynamics simulations and theory, we propose a physical mechanism for reliable DNA‑end tethering. Simulations show that such break‑induced conversion can drive local DNA‑‑protein condensation with two possible outcomes: successful or failed tethering. To rationalize this, we construct an effective free energy landscape, identify the corresponding stationary states, and demonstrate that tethering is governed by a kinetic competition between polymer relaxation and condensation dynamics. Together, our study shows that DNA end‑dependent conversion, coupled with DNA‑‑protein condensation, can reliably tether broken DNA ends.
Authors: Yuhang Zhang, Keyan Ding, Peilin Chen, Han Liu, Can Lin, Ruixi Chen, Shiqi Wang, Qi Song
Abstract: Enzyme‑reaction retrieval is a fundamental problem in computational biology, underpinning enzyme characterization, reaction mechanism elucidation, and the rational design of metabolic pathways and biocatalysts. As a bidirectional task, it entails both enzyme‑to‑reaction and reaction‑to‑enzyme mapping. However, existing approaches suffer from poor generalization across tasks and distributions, with performance highly sensitive to dataset splits and substantial asymmetry between retrieval directions. To address these challenges, we present TIGER, a Text‑Informed Generalized Enzyme‑Reaction Retrieval framework that leverages protein‑to‑text generation models to distill textual semantic knowledge from enzyme sequences, providing a generalized representation that bridges enzymes and biochemical reactions. To ensure the quality and reliability of textual semantics, we design a Dynamic Gating Network that adaptively fuses text‑derived knowledge with sequence features, enabling more consistent and informative enzyme representations, while a Structure‑Shared Feature Projector aligns enzyme and reaction representations within a unified latent space. Extensive experiments demonstrate that, under bidirectional retrieval supervision, TIGER significantly outperforms state‑of‑the‑art baselines across diverse distributions and exhibits strong robustness and transferability across tasks.
Authors: Zhaohan Meng, Zhen Bai, Ke Yuan, Iadh Ounis, Zaiqiao Meng, Hao Xu, Joseph Loscalzo
Abstract: Protein‑ligand modeling underpins computational drug discovery and molecular design. Existing protein‑ligand benchmarks typically evaluate whether a protein and ligand interact and how strongly they bind, through tasks such as binary binding prediction and affinity regression. However, these evaluations provide limited evidence of whether models can localize binding sites or identify the non‑covalent interactions underlying molecular recognition. To address this gap, we introduce InteractBind, a large‑scale protein‑ligand dataset comprising approximately 100k protein‑ligand pairs, together with a benchmark for fine‑grained evaluation. The core fine‑grained task is that of binding‑site localization, which uses protein‑residue and ligand‑atom interaction maps spanning six major types of non‑covalent interactions to assess whether model‑derived interaction maps localize binding sites. InteractBind further includes binding affinity and protein similarity‑controlled splits to support realistic generalization assessment. Using InteractBind, we evaluate eight existing sequence‑based and interaction‑aware models, assessing binary binding prediction and binding‑site localization. Results reveal limited binding‑site localization despite strong binary binding prediction, with marked variation across non‑covalent interaction types. Overall, InteractBind establishes a benchmark paradigm that encourages the development of more interpretable and physically grounded protein‑ligand models.
Authors: Roman Klypa, Alberto Bietti, Sergei Grudinin
Abstract: The design of RNA molecules that interact with specific proteins is a critical challenge in experimental and computational biology. Despite recent progress in natural language modeling and deep learning‑based protein design, there remains significant room to improve the frequency of successful interactions and the authenticity of generated sequences for functional applications. In this work, we frame conditional RNA sequence generation as a multi‑stage alignment problem, introducing Moirain: a suite of models optimized via multimodal supervised fine‑tuning (SFT) and Direct Preference Optimization (DPO). Our approach begins with large‑scale pretraining on diverse RNA corpora to capture the fundamental grammars of sequence plausibility. To achieve target‑specific generation, we employ a multimodal SFT architecture that conditions RNA synthesis on protein structural and sequential features. Finally, we leverage DPO to refine the model using synthetic interaction data: taking advantage of DPO's unique ability to navigate non‑aligned preference spaces, we improve functional fitness without collapsing the learned natural distribution. Extensive evaluation of the Moirain series (Moirain‑Base, Multi, and DPO) demonstrates that our framework consistently produces novel, diverse, and biologically plausible RNA sequences with superior binding affinities compared to existing baselines.
Authors: Jian Xu, Delu Zeng, John Paisley, Qibin Zhao
Abstract: Approximate inference over inducing variables is the central computational bottleneck of Deep Gaussian Processes (DGPs). Existing methods either fit an explicit density q_ϕ(\bU) by an ELBO (DSVI, IPVI, DDVI, DBVI) or sample by MCMC (SGHMC). We instead frame DGP inference as \emphposterior transport: learn a deterministic sampler that maps a tractable reference measure to posterior‑relevant inducing variables, regularised by a path prior derived from the Doob‑bridged reference diffusion. Our realisation, OM‑Path (formally FBVI‑bridge‑Path), uses Song's probability‑flow ODE applied to DBVI's Doob‑bridged forward SDE; the reference drift is closed‑form from the bridge marginal coefficients (no score matching) and the path regulariser is the Onsager‑‑Machlup action. At the finite‑ε value used at training, the objective is the negative log unnormalised density of a tempered Doob‑bridge path posterior, and Theorem 1 identifies it with the same posterior's small‑noise MAP path via the Freidlin‑‑Wentzell LDP. Two strict path‑space ELBO variants on the same bridge backbone (FFJORD log‑det; OM‑regularised CNF) are derived as ablations. Under a matched‑seed paired Wilcoxon test against DBVI on seven UCI regression benchmarks, OM‑Path delivers statistically significant wins on the two largest datasets (power: p\!=\!0.014, NLL \mathbf0.012 matching the DSVI baseline of 0.017; protein: p\!=\!0.002, RMSE \mathbf0.716 vs.\ 0.764, NLL \mathbf1.086 vs.\ 1.149), statistical ties on yacht / qsar, and concedes boston / energy / concrete to DBVI on small‑N noisy data. The strict‑ELBO variants do not clear DBVI on any UCI metric: in this regime, reducing the variance of the path objective dominates exact‑density tracking.
Authors: Jaihoon Kim, Taehoon Yoon, Prin Phunyaphibarn, Seungjun Kim, Morteza Mardani, Minhyuk Sung
Abstract: Discrete diffusion models have emerged as powerful frameworks for generating structured categorical data. However, efficiently sampling from reward‑tilted distributions remains a fundamental challenge. While Twisted Sequential Monte Carlo (SMC) offers asymptotic exactness for this task, estimating the optimal twist function in discrete state spaces necessitates costly Monte Carlo approximations, resulting a severe computational bottleneck at inference. To overcome this limitation, we introduce Contrastive Distribution Matching (CDM), a novel framework that amortizes the cost of SMC inference by learning a parameterized twist function via positive and negative samples. For efficient training, we reformulate the gradient estimator to leverage the closed‑form forward kernels of discrete diffusion models. In practice, evaluating our learned twist function incurs less than 5% additional computational overhead compared to a single forward pass of the base model. Through extensive empirical evaluations, we demonstrate that CDM consistently outperforms existing baselines under matched wall‑clock time. We validate the effectiveness and versatility of our approach across a diverse range of applications, including toxic text generation, regulatory DNA sequence design, protein designability, and diffusion large language model alignment.
Authors: Mohammad R. Rezaei, Rahul G. Krishnan
Abstract: A persistent challenge in machine learning for scientific applications is jointly achieving prediction and understanding. Statistical models excel on structured data but operate as black boxes, while existing interpretability methods are largely inspective: they answer "which features matter?" but do not articulate how features interact or refine explanations iteratively alongside human understanding. Asking an LLM to predict the target directly forces it to search the entire output space; we instead anchor predictions with a base model and ask the LLM the narrower question of what that model is missing. We introduce Multi‑Agent Residual In‑Context Learning (MARICL), an agentic framework in which LLM agents analyze where a base‑model fails, hypothesize missing structure from high‑residual examples provided in context, and produce explicit correction terms refined through multi‑turn textual gradient optimization. Across nine benchmarks spanning scientific, biomedical, socioeconomic, and synthetic settings, MARICL improves consistently over its base model on all datasets. To test whether these corrections reflect real structure or batch‑specific noise, we freeze formulas learned on one experimental batch of the Cell‑Free Protein dataset and apply them (with no retraining and no further LLM calls) to held‑out batches. Within the same reagent protocol, the frozen formulas improve predictions in over 92% of cases; across a different protocol, they fail systematically. The success boundary aligns with the biochemistry, not the batch count; direct evidence of mechanistic generalization.
Authors: Hai-Ling Lu, Yu-Yang Li, Yin-Bi Li, Cun-Shi Wang, A-Li Luo, Jun-Chao Liang, Shuo Li
Abstract: Stellar spectra encode key information on the physical properties and chemical compositions of stars. Accurate stellar parameter determination is essential for addressing major questions such as galaxy and stellar evolution. Large‑scale spectroscopic surveys have accumulated unprecedented spectral data. Traditional feature extraction or model‑fitting approaches struggle with high‑dimensional, massive datasets, limited generalization, and computational inefficiency. Recent advances in large language models demonstrate strong generalization and feature‑learning in tasks like natural language processing, DNA/RNA sequence analysis, and protein/chemical parsing. Stellar spectra are continuous sequential signals, enabling the transfer of language models to stellar spectroscopy. Here, we propose a two‑stage large language model framework for stellar parameter inference, achieving accurate estimation of effective temperature, surface gravity, metallicity, and abundances of ~20 chemical elements. Scaling‑law analyses show systematic performance improvements with increasing data, providing a scalable framework for forthcoming large‑scale surveys.
Authors: Kevin Han, Renfei Zhang, Kathy Wei, Hamed Mahdavi, Niloofar Mireshghallah, Amir Barati Farimani
Abstract: LLM agents have incredible potential for scientific discovery applications. However, the performance of LLM agents on real‑world, small molecule drug design (SMDD) tasks across diverse chemistries and targets is unclear. Current evaluation methods are either ad hoc, too simple for real‑world discovery, limited in scale, or restricted to single‑turn question answering. In effort to standardize the evaluation of LLM agents on small molecule design, we introduce SMDD‑Bench, a challenging, multi‑turn, long‑horizon agentic benchmark consisting of 502 guaranteed‑solvable task instances spanning 5 task types: 2D Pharmacophore Identification, Interaction Point Discovery, Scaffold Hopping, Lead Optimization, and Fragment Assembly. SMDD‑Bench tasks span a wide region of chemical space and involve 102 unique protein targets. Completely solving the benchmark would require having strong chemical and biological reasoning and 3D intuition, understanding specialized tool use, and displaying planning expertise over a limited number of oracle calls. We benchmark 7 frontier open and closed source LLMs and find even the most performant LLM, GPT5.4, solves only 40.2% of tasks. We hope SMDD‑Bench provides a standardized testbed to invigorate the field towards training and evaluating LLM agents for fully autonomous computational drug design. We host a public leaderboard at smddbench.com .
Authors: Kingsley Yeon, Xuefeng Liu, Promit Ghosal
Abstract: Protein‑protein interactions (PPIs) govern nearly all cellular processes, yet computational methods for identifying binding partners typically produce ranked predictions without mechanistic justification. This creates a fundamental barrier to adoption because biologists cannot assess whether predictions reflect genuine biochemical insight or spurious correlations. We present Protein Thoughts, a framework that reformulates PPI discovery as an interpretable search problem with explicit reasoning. The system decomposes binding evidence into four biologically meaningful signals: sequence similarity reflecting evolutionary relationships, structural complementarity capturing geometric fit, interface balance, and chemical compatibility encoding residue‑level interactions. Rather than collapsing these signals into an opaque score, we preserve their individual contributions through a transparent value function that enables both ranking and auditing. To navigate large candidate spaces efficiently, we introduce hypothesis‑guided entropy‑regularized Tree‑of‑Thoughts search. A fine‑tuned language model generates search directives from embedding‑derived features, classifying candidates as high‑priority, exploratory, or skippable. These directives condition a Boltzmann policy that balances exploitation with entropy‑driven exploration, while hypothesis‑aware pruning prevents premature abandonment of promising candidates. For candidates exhibiting score disagreement, hypothesis‑conditioned embedding‑space flow matching transports protein embeddings toward the binder manifold. On the SHS148k benchmark, Protein Thoughts achieves mean best‑binder rank of 11.2 versus 47.7 for an entropic tree search baseline, a 76% improvement, and for binding prediction the trained value function achieves 91.08 \pm 0.19 Micro‑F1, outperforming existing PPI methods on the same dataset.
Authors: Kyle Higgins, Ivan Laponogov, Dennis Veselkov, Kirill Veselkov
Abstract: Graph neural networks (GNNs) are increasingly used to model biological systems, yet the reliability of post‑hoc explanation methods for recovering meaningful molecular mechanisms remains unclear. Here, we systematically evaluate four widely used approaches: Saliency Attribution (SA), Integrated Gradients (IG), GNNExplainer, and Layer‑wise Relevance Propagation (LRP) for identifying disease‑relevant structure in breast cancer RNA‑seq data projected onto a protein‑protein interaction network. Using synthetic benchmarks with known ground‑truth motifs, we show that explanation methods recover distinct signal organizations: SA performs best for sparse single‑node drivers, whereas IG and LRP preferentially recover distributed pathway‑like and cascade‑like signals. In TCGA BRCA data, we identify a consistent topological signature of disease‑associated hubs in which attribution peaks in the immediate 1‑hop neighborhood and decays across successive network shells, a pattern most pronounced for IG and LRP and associated with strong enrichment of known cancer hubs. We further observe a trade‑off between local hub enrichment and global gene ranking performance, with IG optimizing local enrichment and SA achieving superior global discrimination. Motivated by these complementary behaviors, we introduce a framework combining a shell‑based hub score with consensus ranking across explainers. Consensus scores improve prioritization of canonical cancer genes (TP53, BRCA1, ESR1, MYC), reduce dependence on node degree, and, especially when tuned, outperform individual methods. Pathway enrichment further reveals improved recovery of biologically coherent cancer programs, including ERBB2, RTK, MAPK, immune, and cytokine signaling. Together, these results demonstrate that topology‑aware integration of graph explanations can improve biological interpretability and biologically relevant molecular recovery.
Authors: Mansoor Ahmed, Sujin Lee, Umar Khayaz, Murray Patterson
Abstract: Equivariant graph neural network (GNN) methods for antibody complementarity‑determining region (CDR) design achieve the highest sequence recovery but suffer from severe vocabulary collapse. The current best GNN methods over‑predict very few amino acids, such as tyrosine and glycine, while ignoring functionally important residues. We trace this failure to GNN encoders learning amino acid distributions de novo from limited structural data, discarding substitution patterns encoded in evolutionary databases. To resolve this, we propose EvoStruct, which bridges a frozen protein language model (PLM) with 3D structural context from an E(3)‑equivariant GNN via a cross‑attention adapter. Unlike prior PLM‑structure adapters for general protein design, EvoStruct targets the vocabulary collapse problem specific to CDR design through progressive PLM unfreezing and R‑Drop consistency regularization. On the CHIMERA‑Bench dataset, EvoStruct achieves the highest amino acid recovery and lowest perplexity among several antibody design methods, improving sequence recovery by 16% and reducing perplexity by 43% relative to the best GNN baselines, while recovering 2.3x greater amino acid diversity and the highest binding‑pair correlation with ground truth.
Authors: Sima Soltani, Mehrdad Jalali, Yahya Forghani
Abstract: Protein‑protein interaction networks provide a graph‑level view of cellular organization, yet their functional modules are overlapping, noisy, and difficult to interpret from cluster assignments alone. Existing community‑detection methods can recover candidate protein complexes, but they rarely explain why an individual protein is assigned to a specific module or whether that assignment should be treated as core, peripheral, or uncertain. Here we introduce ECHO‑PPI, an evidence‑bundled framework for interpretable overlapping protein‑module detection in protein‑protein interaction networks. ECHO‑PPI integrates weighted network topology, semantic protein profiles, and Gene Ontology evidence to identify evidence‑potential nuclei, construct candidate modules, perform overlap‑aware assignment, and export hierarchical confidence labels. The framework supports trustworthy computational decision support through assignment‑level interpretability: each protein‑module assignment is accompanied by topology, semantic, and Gene Ontology evidence scores and a hierarchical confidence label, enabling curators to inspect, rank, and triage overlapping module predictions. Evaluation on yeast protein‑interaction data shows that ECHO‑PPI preserves the behaviour of strong overlap‑aware baselines while adding evidence‑bundled auditability. Rather than claiming universal predictive superiority, ECHO‑PPI addresses a complementary need: making overlapping protein‑module predictions inspectable, confidence‑aware, and reproducible for downstream biological interpretation.
Authors: Yingqi Zhao, Kuo Zhan, Pei-Lin Xin, Yuge Liang, Enock Adjei Agyekum, Matti Putkonen, Shuai Li, Francesco De Angelis, Jianan Huang
Abstract: Post‑translational modifications (PTMs) play essential roles in regulating protein structure, function, and cellular signalling. However, peptide level discrimination of hydroxylation at the single‑molecule level remains difficult. Here, we report a particle‑in‑pore single‑molecule surface‑enhanced Raman spectroscopy (SERS) platform combined with peak occurrence frequency (POF) analysis and a one‑dimensional convolutional neural network (1D‑CNN) for discriminating hydroxylated and non‑hydroxylated HIF peptide fragments. Three peptide pairs containing the Pro‑564 hydroxylation site, with lengths of 7, 9, and 15 amino acids (AAs), were investigated. POF analysis revealed reproducible hydroxylation‑dependent spectral changes in the 7AA and 9AA peptide pairs, which were attributed to changes in adsorption conformation and surface interactions. CNN‑based classification achieved post‑evaluation accuracies of 72.98%, 78.55%, and 89.74% for the 7AA, 9AA, and 15AA peptide pairs, respectively, with AUC values above 0.80 for all the pairs, indicating a reliable discrimination. Gradient‑weighted feature visualization further showed that CNN‑sensitive regions overlapped with recurrent POF features, supporting the chemical relevance of the learned classification patterns. Notably, for the 15AA peptide pair, the enhanced citrate‑associated band suggests that hydroxylation can substantially alter peptide‑gold nanoparticle adsorption behaviour. This adsorption‑mediated effect may amplify hydroxylation‑induced spectral differences and contribute to the improved discrimination accuracy despite the increased structural complexity. These results demonstrate that the particle‑in‑pore sensor, assisted by deep learning, can capture hydroxylation‑induced spectral and adsorption changes in peptide fragments, providing a promising strategy for ultrasensitive analysis of weak PTM signatures in peptides.
Authors: Emma Leonhart
Abstract: Sutra is a typed, purely functional programming language whose compiled forward pass is a PyTorch neural network. The compiler beta‑reduces the whole program ‑‑ primitives, control flow, string I/O ‑‑ to one fused tensor‑op graph over a frozen embedding substrate. Rotation binding, unbind, bundle, polynomial Kleene three‑valued logic, and tail‑recursive loops all lower to tensor operations; the Kleene connectives are Lagrange‑interpolated polynomials exact on the ‑1, 0, +1 truth grid. Validation is one fact tested two ways. (1) The same program runs on four frozen embeddings spanning two modalities ‑‑ three text encoders (nomic‑embed‑text, all‑minilm, mxbai‑embed‑large) and one protein language model (ESM‑2) ‑‑ and decodes bundles at 100% accuracy through width k=8 on every substrate, where the textbook Hadamard product has already collapsed (2.5% on mxbai‑embed‑large, 7.5% on all‑minilm). (2) PyTorch autograd flows through the actually compiled graph: a fuzzy‑rule classifier written in .su trains from random init (18.7 +/‑ 9.5%; chance = 20%, five classes) to 100.0 +/‑ 0.0% (three seeds) by backpropagating through the emitted graph, the symbolic source unmodified. A weighted variant additionally trains a scalar cosine gain and writes it back into the .su source as a numeric literal; recompiling reproduces the trained behaviour to ~2e‑7 per logit, so the trained model is itself legible, recompilable code. The same artifact is therefore both a logic program and a trainable neural network.
Authors: Jai Sharma, Yifan Wang, Bryan Li
Abstract: Understanding dependencies between variables is critical for interpretability and efficient generation in masked diffusion models (MDMs), yet these models primarily expose marginal conditional distributions and do not explicitly represent inter‑variable dependence. We propose a neural framework for estimating pairwise conditional mutual information (MI) directly from the hidden states of a pretrained MDM, using ground‑truth MI computed from the model's own conditional distributions for supervision. The resulting estimator captures the model's internal belief about dependency structure and predicts the full MI matrix in a single forward pass, enabling MI‑guided parallel decoding by identifying conditionally independent subsets of variables. We evaluate our approach on Sudoku and protein sequence generation with ESM‑C, where the MI maps recover known structural constraints and enable a 3‑5x magnitude reduction in inference‑time forward passes compared to sequential decoding, while preserving generative quality and outperforming entropy‑based parallelization methods.
Authors: Roderik Krebbers, Marleen Huisman, Kees van Kempen, Joris Meurs, Amir Khodabakhsh, Simona M. Cristescu
Abstract: Online, comprehensive molecular profiling of exhaled breath provides a non‑invasive window into human metabolism, yet current optical platforms are restricted by narrow instantaneous spectral coverage. Here, we present a novel ultra‑broadband mid‑infrared spectroscopic platform that enables simultaneous, high‑sensitivity detection of a comprehensive profile of breath biomarkers. By integrating an intrapulse difference‑frequency generation (IDFG) supercontinuum source spanning 2.9‑11.5 μm (2580 cm^‑1) with a custom‑built Fourier transform spectrometer, we achieve a spectral resolution of 0.1 cm^‑1 ‑ surpassing current laser‑based approaches. Combined with a standardized online sampling system, the platform achieves sensitivities in the tens of parts per billion over three minutes, resolving dynamic metabolic changes of ammonia, methane, isoprene, acetone, carbon monoxide, and nitrous oxide. We demonstrate the system's utility through proof‑of‑concept case studies tracking responses to fasting, protein intake, and smoking. This calibration‑free platform establishes a powerful and versatile tool for online breath analysis, with broad potential in clinical diagnostics and exposure monitoring.
Authors: Elynn Chen, Jiayu Li, Zheshi Zheng, Jian Pei
Abstract: Tensor‑valued data arise naturally in neuroimaging, genomics, climate science, and spatiotemporal networks, where multilinear dependencies across modes carry information that is destroyed under vectorization. Existing approaches either impose a single low‑rank structure, which can miss localized signal, or treat the tensor as a long vector, which discards its multiway geometry. We propose a Dual‑Channel Tensor Neural Network (DC‑TNN) that decomposes each tensor input into a low‑rank core and a sparse refinement, and processes the two components through coupled neural channels. The framework is structure‑agnostic and accommodates CP, Tucker, and tensor‑train cores within a single architecture. For estimation, we establish non‑asymptotic risk bounds for the DC‑TNN estimator that decompose into network approximation, core estimation, and refinement‑selection terms, and show that the effective dimension is determined jointly by the core rank and refinement sparsity rather than by the ambient tensor size. For inference, we develop a structure‑aware conformal ROC procedure that calibrates within the core‑refinement latent space and produces ROC and AUC confidence bands with finite‑sample, distribution‑free coverage. Building on this, we propose a conformal structure selector that, to our knowledge, is the first distribution‑free procedure for choosing among candidate tensor decompositions with finite‑sample validity. Simulations and an analysis of a protein dataset demonstrate competitive predictive accuracy, reliable uncertainty quantification, and consistent recovery of the tensor structure.
Authors: Doruk Efe Gökmen, Rosalind Wenshan Pan, Tom Röschinger, Stephen Quake, Hernan Garcia, Rob Phillips, Vincenzo Vitelli
Abstract: While coding regions in the genome have a direct interpretation in terms of protein products, significant fractions are non‑coding and yet control essential biological functions. Unlike the genetic code, there is no "lookup table" that identifies where regulatory proteins, known as transcription factors (TFs), bind. Here, we extract these binding sites by distilling sequences of nucleotide letters into collective coordinates (hyperletters) representing the binding sites that are active under specific environmental conditions. Going beyond local information footprints between individual bases and expression levels, our information blueprint algorithm compresses the global information by optimising filters that simultaneously scan an entire promoter sequence. Inspired by renormalisation‑group techniques, we identify TF binding sites as coarse‑grained variables combining groups of correlated mutations with the highest collective impact on gene expression. We validate our approach on experimental data for E. coli and discover novel regulatory elements illustrating its deployment at scale across growth conditions.
Authors: Martins Otun
Abstract: Cold‑chain storage limits access to insulin for hundreds of millions of people; a thermally protective patch polymer could help, but the design space is too large for exhaustive experiment. Starting from that problem, we narrow to an agentic workflow: a large language model (LLM) calls physics‑based tools through the Model Context Protocol (MCP), searching the discrete PSMILES space under a budget of OpenMM Packmol‑matrix evaluations. The LLM acts as an implicit acquisition function conditioned on a persistent "discovery world": hypotheses, literature claims, and simulation outcomes updated each iteration. Under matched oracle budgets, the best autonomous campaign reaches an insulin‑polymer interaction energy of ‑2263 kJ/mol, outperforming reinforcement‑learning baselines by 68% and Bayesian optimization by 19%. Three independent campaigns converge on one structural motif (dense hydrogen‑bond donors and acceptors per repeat unit) while physics checks reject infeasible packings and name‑structure mismatches before they steer the next step. The science stage is CPU‑bound and runs on commodity hardware. More broadly, the same architecture and workflow designed here applies to other protein‑stabilization tasks whenever a tractable screening oracle is available.
Authors: Zhe Zhang, Yuanning Feng, Yuxuan Song, Keyue Qiu, Hao Zhou, Wei-Ying Ma
Abstract: AlphaFold3 introduces a diffusion‑based architecture that elevates protein structure prediction to all‑atom resolution with improved accuracy. This state‑of‑the‑art performance has established AlphaFold3 as a foundation model for diverse generation and design tasks. However, its iterative design substantially increases inference time, limiting practical deployment in downstream settings such as virtual screening and protein design. We propose DCFold, a single‑step generative model that attains AlphaFold3‑level accuracy. Our Dual Consistency training framework, which incorporates a novel Temporal Geodesic Matching (TGM) scheduler, enables DCFold to achieve a 15x acceleration in inference while maintaining predictive fidelity. We validate its effectiveness across both structure prediction and binder design benchmarks.
Authors: Binyamin Perets, Shie Mannor
Abstract: Large‑scale hypothesis testing is central to modern science, where controlling the False Discovery Rate (FDR) has become the standard approach to managing false positives across many simultaneous tests. Hypotheses rarely exist in isolation; they often exhibit structure through proximity, connectivity, or hierarchy. This structure represents both a challenge and an opportunity: while classical methods treat these dependencies as obstacles requiring conservative correction, leveraging them can substantially increase discovery power. Here, we reframe structured FDR control as a regularized learning problem. By optimizing within a suitable Reproducing Kernel Hilbert Space (RKHS), we introduce a framework that unifies continuous domains, graphs, and hierarchies under a single algorithm through kernel choice alone. This formulation enables smooth solutions in place of the piecewise‑constant fits of prior methods, principled likelihood‑based hyperparameter selection rather than heuristic tuning, and inference at unobserved locations which in turn supports sample‑efficient experimental design. Building on this estimator, we provide two decision rules which we prove to control the FDR. We validate our method on two sources: spatial locations derived from high‑dimensional real‑world datasets, and a differential gene expression task utilizing protein‑protein interaction graphs.
Authors: Li Ding, Duanyu Feng, Chen Huang, Yangshuai Wang, Yang Li, Wenqiang Lei, See-Kiong Ng
Abstract: Protein‑Text Question Answering (QA) is crucial for interpreting biological sequences through natural language. The integration of Large Language Models (LLMs) with Retrieval‑Augmented Generation (RAG) that efficiently leverages biological databases and facilitates reasoning offers a potent approach for it. However, constrained by the standard RAG pipeline, these models often rely on curated, static datasets instead of expert‑proven biological workflows, lacking the fine‑grained information processing and struggling to generalize to novel (OOD) proteins. To bridge this gap, we propose 2D‑ProteinRAG, a novel framework that empowers LLMs to operate within the gold‑standard biological research workflow (BLAST). To further extract high‑quality information from noisy retrieval contexts, we introduce a dual‑dimensional (2D) filtering strategy following the expert analytical paradigms. Horizontal Fine‑grained Attribute Alignment utilizes a lightweight, intent‑aware discriminative filter to prune irrelevant metadata and align database entries with specific user queries. Vertical Homology‑based Semantic Denoising resolves functional contradictions and redundancy across multiple homologs via hierarchical clustering. Extensive evaluations on both In‑Distribution and diverse biological OOD benchmarks demonstrate that 2D‑ProteinRAG consistently achieves state‑of‑the‑art performance, outperforming fine‑tuned baselines and other RAG methods. Our results validate the framework's robustness and scalability, providing a practical solution for interpreting protein functions in real‑world scientific scenarios.
Authors: Takayuki Kimura
Abstract: Molecular representation learning has become a central approach in AI‑driven drug discovery, yet existing molecular tokenizations such as SMILES remain largely syntactic and do not naturally align with chemically meaningful substructures. In this work, we introduce VQ‑Atom, a semantic discretization framework that converts continuous atom‑level graph representations into discrete tokens corresponding to local chemical environments. Using graph neural network embeddings and vector quantization, atoms are assigned to codebook entries representing chemically meaningful atomic contexts. These discrete tokens define a molecular language suitable for Transformer‑based pretraining.
We evaluate VQ‑Atom in protein‑ligand interaction prediction under a protein‑cold split setting without relying on 3D structural information. Experimental results show that VQ‑Atom consistently improves predictive performance compared to conventional tokenization approaches, suggesting that semantically grounded discretization can substantially enhance molecular representation learning. Our findings indicate that token design itself plays a critical role in enabling effective language modeling for chemistry.
Authors: Thomas Walton, Ayan Goel, Amirali Aghazadeh
Abstract: Masked language modeling (MLM) is the standard objective for training protein language models, typically implemented by randomly masking individual residues at a fixed rate (e.g., 15%). This practice implicitly assumes that all sequence positions contribute equally to representation learning. In downstream fitness prediction tasks, however, protein sequences are governed by three‑dimensional structural dependencies and long‑range residue contacts that induce strong nonlocal couplings between residues. We introduce Bucket Masking, a structure‑aware masking strategy that selects groups of residues based on their proximity in three‑dimensional space, preferentially masking structurally coupled regions during training. By conditioning the masking distribution on residue contacts, Bucket Masking shifts the learning objective toward modeling long‑range interactions that are critical for protein function. Across four downstream protein fitness prediction tasks, Bucket Masking enables up to a 14% improvement over standard random masking, excelling at predicting higher‑order mutational interactions. Through controlled ablations, we show that these improvements arise from mask placement rather than span size, establishing masking as a positional inductive bias.
Authors: Piotr Jedryszek, Oliver M. Crook
Abstract: Protein language models are increasingly used to guide experimental and clinical decisions, yet it is often unclear whether a confident prediction reflects recognition of biological evidence or retrieval of a statistical default. We examine this distinction for a near‑universal biological rule, that proteins begin with methionine, by tracing the computational pathway through which ESM2‑8M produces this prediction. The model does not detect methionine at the masked position. Instead, it retrieves a methionine‑favouring signal from a reference representation at the beginning‑of‑sequence token via a position‑specific query assembled across layers, with the final output emerging through competition with context‑dependent circuits. To understand how positional information reaches the readout, we introduce a norm‑direction decomposition of attention scores within rotary frequency bands. Positional encoding operates through coupled changes in query norm and angular alignment distributed across these bands. On sequences whose true N‑terminus is not methionine, where the biological question matters, the model predicts methionine anyway. This is not a correct prediction produced by an unexpected mechanism, but the output of a positional‑prior retrieval circuit that matches the statistical average and fails where biology diverges from it. Distinguishing the two requires resolution at the level of individual circuits, frequency bands, and query composition, suggesting that mechanistic verification will be necessary, and challenging, for predictions where the biological stakes are higher. Even for the simplest biological rule, the model's prediction is mediated by a distributed computational circuit rather than direct recognition, suggesting that increasing task complexity will further obscure the relationship between model confidence and underlying biological evidence.
Authors: Bruno Trentini, Dejan Stancevic, Michael M. Bronstein, Alexander Tong, Luca Ambrogioni
Abstract: For a fixed flow‑based generative model under a small inference budget, sample quality can depend strongly on where the sampler spends its few function evaluations. Flow matching and Schrödinger bridges define probability paths, yet their inference grids are usually heuristic or inherited from one‑endpoint diffusion. We derive a conditional‑marginal entropy‑rate objective for bridge‑aware discretization, separating endpoint‑conditioned bridge geometry from marginal flow evolution, and use it to build a training‑free entropic inference‑time scheduler from first principles. For Gaussian Brownian bridges this rate is closed‑form and U‑shaped, motivating boundary‑heavy nonuniform grids. On trained two‑dimensional bridge/flow models, the estimated profile recovers the predicted shape and improves 10‑step ODE‑Heun MMD over linear by 18.1%, with a paired 22.7% SDE‑Heun improvement in the same low‑NFE sweep. On EDM/CIFAR‑10, the entropic time‑discretization gives the best tested five‑step FID (186.3 \pm 4.0 versus 200.5 \pm 2.9 for linear and 238.0 \pm 5.3 for cosine). On AlphaFlow protein generation, entropic conditional‑marginal (cond‑marg) scheduling shows advantage in low‑NFE regimes on both CAMEO22 and ATLAS benchmarks. These results support entropy‑rate scheduling as a practical low‑budget allocation signal for high‑dimensional bridge and flow samplers.
Authors: Raghavan Thiagarajan, Younes Farhangi Barooji, Poul-Martin Bendix, Mandar M. Inamdar, Jakub Sedzinski
Abstract: Subcellular protein complexes and organelles exhibit diverse dynamic behaviors that reflect the mechanical constraints and organization of the intracellular environment. Although some structures follow classical Brownian motion, many display anomalous dynamics. The transitions between these regimes are increasingly recognized as critical for subcellular organization, yet how they influence pattern formation remains unclear. Here, we investigate the spatial arrangement of cilia on the apical surface of multiciliated cells (MCCs) in developing Xenopus laevis embryos, where coordinated ciliary beating depends on the precise organization of hundreds of centriole‑derived basal bodies (BBs). Using quantitative confocal, high‑resolution and high‑speed TIRF imaging together with theoretical modeling, we show that BB trajectories undergo time‑resolved transitions between diffusive and anomalous motion, with distinct regimes that correlate with apical surface expansion. During the early stages, actin remodeling facilitates the dispersal of BBs by providing a permissive, low‑confinement environment. As development progresses, the actin network becomes increasingly cross‑linked that constrains BB movement and promotes uniform spacing across the apical domain. Disruption of α‑actinin‑1, a major actin cross‑linking protein, impairs the integrity of the apical actin meshwork, weakens BB confinement, and disrupts regular spatial patterning, ultimately compromising the arrangement of BBs required for proper cilia alignment. Together, we show that progressive apical actin cross‑linking coordinates BB positioning and regulates their dynamic state, guiding the shift from diffusive to confined motion. This transition in dynamics enables the emergence of a uniform BB pattern, which in turn ensures the aligned deployment of motile cilia necessary for effective directional fluid flow.
Authors: Loka Li, Duzhen Zhang, Xingbo Du, Leonard Song, Zixiao Wang, Assanali Aukenov, Noel Thomas, Shakhnazar Sailaukan, Yonghan Yang, Feilong Chen, Jiahua Dong, Kun Zhang, Bin Zhang, Le Song
Abstract: Large language model (LLM) agents are increasingly capable of automating components of machine learning development, yet existing biomedical benchmarks mainly focus on question answering, reasoning, and tool usage, or evaluate only narrow aspects of biomedical ML coding. We present BioXArena, a biomedical machine learning benchmark designed to evaluate whether agents can generate task‑specific model training pipelines for heterogeneous and multi‑modal biomedical datasets. BioXArena contains 76 end‑to‑end tasks across 9 domains, including sequence modeling, single‑cell analysis, structural biology, network biology, chemical biology, perturbation dynamics, phenotype‑disease modeling, biomedical imaging, and text‑integrated learning. Each task is curated from primary biomedical sources into a unified evaluation framework with hidden labels, held‑out graders, and biology‑aware metrics normalized to a 0 to 1 scale. Agents are required to write executable code, train predictive models, and generate submissions for private test samples. Most tasks involve multiple input modalities, including tabular data, images, natural language, molecular sequences, omics matrices, and protein structures. We evaluate 11 agent configurations in a standardized 2‑hour single‑GPU environment. MLEvolve with Gemini‑3.1‑Pro achieves the highest average score of 0.666, followed by GPT‑5.4 with 0.636, while no single agent consistently dominates across all domains. We additionally perform extensive ablation studies, robustness evaluations, scaling analyses, cost analyses, and failure‑mode investigations to better understand how model backbones, agent scaffolds, inference budgets, and biomedical domains influence BioML coding performance. We will publicly release all benchmark tasks, graders, execution runners, leaderboard results, and agent trajectories.
Authors: Ramon Viñas Torné, Sílvia Fàbregas Salazar, Soyon Park, Ivo Alexander Ban, Artyom Gadetsky, Nikita Doikov, Maria Brbić
Abstract: Inferring the structure of directed acyclic graphs (DAGs) from data is a central challenge in causal discovery, particularly in modern high‑dimensional settings where large‑scale interventional data are increasingly available. While interventional data can improve identifiability, existing methods remain limited by soft acyclicity constraints, leading to optimization over invalid cyclic graphs, numerical instability, and reduced scalability. We introduce PACER (Perturbation‑driven Acyclic Causal Edge Recovery), a scalable framework for causal discovery that guarantees acyclicity by construction. PACER parameterizes a distribution over DAGs through a joint model of variable permutations and edge probabilities, enabling direct optimization over valid causal structures without surrogate penalties. The framework supports a unified likelihood‑based treatment of observational and interventional data, flexible conditional density models, and the incorporation of structural prior knowledge. For linear‑Gaussian mechanisms, we derive closed‑form expressions for the expected interventional log‑likelihood and its gradients, yielding substantial computational gains. Empirically, PACER matches or exceeds state‑of‑the‑art methods on protein signaling and large‑scale genetic perturbation benchmarks, while scaling efficiently to networks with thousands of variables and achieving up to two orders of magnitude speedups over penalty‑based differentiable approaches. These results demonstrate that exact and scalable causal discovery from high‑dimensional perturbation data is achievable through principled search space design.
Authors: Rishabh Dey, Salvina Sharipova, Konstantin Popov
Abstract: Implicit solvent models are widely used to decrease the number of solvent degrees of freedom and enable the calculation of solvation energetics without water molecules. However, its accuracy often falls short compared to explicit models. Recent advancements in neural potentials have shown promise in drug discovery, but transferability remains a persistent challenge. Here, we introduce the Protein Hydration Neural Network (PHNN), an implicit solvent model that extends analytical continuum solvation by learning transferable corrections to model parameters instead of applying post hoc adjustments to final energies. The model is explicitly designed to maximize data efficiency by leveraging physical priors embedded in the data. We demonstrate that PHNN improves accuracy relative to traditional analytical methods and maintains predictive accuracy on out‑of‑domain protein systems.
Authors: Léa Beaulès, Judith Miné-Hattab, Pierre Illien, Vincent Dahirel
Abstract: In living cells, proteins involved in specialized biochemical functions are often spatially organized within biomolecular condensates. Increasing evidence suggests that some of these condensates, including DNA repair condensates, emerge through liquid‑liquid phase separation (LLPS). In the nucleus, however, condensates form within a highly heterogeneous environment composed of chromatin fibers, RNA, and additional protein scaffolds such as PAR chains, all of which may interact with phase‑separating proteins. Moreover, condensate formation is frequently associated with specific chromatin conformations; for instance, loop extrusion has been proposed as a mechanism promoting DNA repair condensates. Here, we investigate how the surrounding fibrous environment controls the morphology and spatial organization of phase‑separated condensates. Using Brownian dynamics simulations of minimal models combining Lennard‑Jones particles with fixed fibrous substrates, we examine the respective roles of local fiber geometry and large‑scale network organization, reflecting the multiscale architecture of chromatin. We show that protein‑fiber interactions strongly influence droplet positioning relative to the substrate, in a manner analogous to wetting transitions in soft condensed matter systems. Both local geometric constraints and global network organization markedly affect droplet size, morphology, and multiplicity. In addition, large‑scale asymmetries in fiber organization can induce robust spatial localization of the dense phase. Our results thus highlight how multiscale structural heterogeneity of the nuclear environment can regulate the emergence and organization of biomolecular condensates.
Authors: Kaiwen Shi, Carlos Oliver
Abstract: Protein structure tokenizers (PSTs) are workhorses in protein language modeling, function prediction, and evolutionary analysis. However, existing PSTs only capture local geometry of static structures, and miss the correlated motions and alternative conformational states revealed by protein ensembles. Here we introduce Ensembits, the first tokenizer of protein conformational ensembles. Ensembits address challenges inherent to tokenizing dynamics: deriving informative geometric descriptors across conformations, permutation‑invariance encoding of variable‑size ensembles, and conquering sparsity in dynamics data. Trained with a Residual VQ‑VAE using a frame distillation objective on a large molecular dynamics corpus, Ensembits outperforms all related methods on RMSF prediction, and is the strongest standalone structural tokenizer on an token‑conditioned ANOVA test on per‑residue motion amplitude. Ensembits further matches or exceeds static tokenizers on EC, GO, binding site/affinity prediction, and zero‑shot mutation‑effect prediction despite using far less pretraining data. Notably, the distillation objective enables Ensembits to predict dynamics token from one single predicted structure, which alleviates dynamics data sparsity. As the field moves from static structure prediction toward ensemble generation, Ensembits offer the discrete vocabulary needed to bring dynamics into protein language modeling and design.
Authors: Sridhar Mahadevan
Abstract: Large language models can extract local causal claims from text, but those claims become more useful when organized as persistent, navigable world models rather than as flat summaries. We introduce PROMETHEUS, a framework that turns retrieved literature, filings, reviews, reports, agent traces, source data, code, simulations, and scientific models into causal atlases: sheaf‑like families of local causal predictive‑state models over an explicit cover of a research substrate. Each local region contains causal episodes, structured claim tables, predictive tests, support statistics, and provenance; restriction maps compare overlapping regions; gluing diagnostics expose agreement, drift, contradiction, and underdetermination. The resulting Topos World Model is not a single universal graph. It is a research instrument for navigating what a corpus says, where it says it, how strongly it is supported, and where local claims fail to assemble into a coherent global view. Three literature‑atlas case studies ‑‑ ocean‑temperature impacts on marine populations, GLP‑1 weight‑loss evidence, and resveratrol/red‑wine health‑benefit claims ‑‑ illustrate deep causal research from text with explicit locality, evidence, persistent state, and gluing tension. Four grounded‑counterfactual case studies ‑‑ a Nature Climate Change microplastics forcing paper, an Indus Valley hydrology paper with VIC‑derived figure data and model code, the canonical Sachs protein‑signaling study with single‑cell perturbation data, and a Nature singing‑mouse study with MAPseq projection matrices ‑‑ show a stronger mode: when a paper ships source data, simulation outputs, or code, PROMETHEUS can evaluate a counterfactual against that scientific substrate and then rebuild the sheaf world model around the
Authors: Sanya Murdeshwar, Sanjit Shashi, Kevin Bachelor, William Noid, Ashwin Lokapally, Razvan Marinescu
Abstract: Coarse‑grained (CG) molecular dynamics enables simulations of atomic systems such as biomolecules at timescales inaccessible to all‑atom (AA) methods, but existing CG neural potentials trained via force matching capture only the gradient of the free‑energy surface, leaving its curvature unconstrained. We introduce a framework that augments force matching with stochastic Hessian‑vector product (HVP) matching, instilling second‑order curvature information into CG potentials without constructing the full Hessian. We derive a decomposition of the target CG Hessian into a model‑independent projected AA Hessian, precomputed once before training, and a model‑dependent covariance correction computed online at negligible cost. We construct an unbiased stochastic estimator of the Hessian‑matching objective by using random probe vectors. We evaluate our method by comparing against force matching on a benchmark of nine fast‑folding proteins unseen during training. HVP matching outperforms plain force matching on 8 of 9 proteins on slow‑mode metrics, with reductions of up to 85% in the Kullback‑‑Leibler divergence between the CG and reference distributions along the slowest collective mode of the largest protein. Our results demonstrate that higher‑order physical supervision is a practical path to more accurate and transferable CG potentials for biomolecular simulation.
Authors: Andrew Y. Zhou, Sharvaree Vadgama, Sumanth Varambally, Peter Eckmann, Michael K. Gilson, Rose Yu
Abstract: Advances in large language models (LLMs) have recently opened new and promising avenues for small‑molecule drug discovery. Yet existing LLM‑based approaches for molecular generation often suffer from high rates of invalid and low‑quality ligand candidates, a result of the syntactic limitations of current models with regard to molecular strings. In this paper, we introduce \textttToolMol, an evolutionary agentic framework for de novo drug design. \textttToolMol combines a multi‑objective genetic algorithm with an agentic LLM operator that iteratively updates the ligand population. We build a comprehensive toolbox of RDKit‑backed functions that allows our agentic operator to consisently make precise ligand modifications. \textttToolMol achieves state‑of‑the‑art performance on multi‑objective property optimization tasks, discovering drug‑like and synthesizable ligands that have >10% stronger predicted binding affinity compared to existing methods, evaluated on three protein targets. \textttToolMol ligands additionally achieve state‑of‑the‑art results in gold‑standard Absolute Binding Free Energy scores, gaining over existing methods by over 35%. By studying chain‑of‑thought reasoning traces, we observe that tool‑calling enables the model to more faithfully execute its planned modifications, efficiently exploiting the strong chemical prior knowledge in LLMs.
Authors: Jeongsol Kim, Hongeun Kim, Jian Wang, Jong Chul Ye
Abstract: Existing reward alignment methods for diffusion and flow models rely on multi‑step stochastic trajectories, making them difficult to extend to deterministic generators. A natural alternative is noise‑space optimization, but existing approaches require backpropagation through the generator and reward pipeline, limiting applicability to differentiable settings. To address this, here we present ZeNO (Zeroth‑order Noise Optimization), a gradient‑free framework that formulates noise optimization as a path‑integral control problem, estimable from zeroth‑order reward evaluations alone. When instantiated with an Ornstein‑‑Uhlenbeck reference process, the update connects to Langevin dynamics implicitly targeting a reward‑tilted distribution. ZeNO enables effective inference‑time scaling and demonstrates strong performance across diverse generators and reward functions, including a protein structure generation task where backpropagation is infeasible.
Authors: Yaochen Rao, Farzaneh Jalalypour, N. M. Anoop Krishnan, Rocío Mercado
Abstract: Predictive models in biomedicine depend on structured assay data locked in the text, tables, and supplements of primary publications. This bottleneck is especially acute in targeted protein degradation (TPD), where each assay record must combine compound identity, degradation target, recruiter, assay context, and endpoint values reported across sections, tables, and supplementary files. Inconsistent compound identifiers and incomplete or implicit assay context further demand domain‑specific logic that generic LLM pipelines do not provide. Existing molecular glue and PROTAC databases are manually curated and often lack the experimental context required for downstream modeling. We formulate TPD database extraction as a domain‑specific curation task and present an expert‑in‑the‑loop LLM workflow, evaluated through a triangular comparison among LLM predictions, standardized baseline records, and expert‑annotated ground truth. A lightweight cross‑validated prompt‑refinement module adapts extraction instructions from scarce expert annotations. With only seven annotated molecular glue publications, the workflow achieved record‑level F_1 = 0.98 and transferred to PROTACs by terminology substitution alone, maintaining record‑level F_1 > 0.93. Applied at scale, it expanded molecular glue and PROTAC databases by 81% and 92% records, respectively, with 92% and 82.5% of newly recovered records validated as correct upon expert review. The workflow also recovered kinetic and assay‑context information essential for cross‑study potency comparison and condition‑aware degradation modeling. We release the workflow, prompts, evaluation code, and extracted datasets as resources for TPD data curation and AI‑assisted scientific curation more broadly.
Authors: Ziwei Xie
Abstract: Accurately modeling and designing protein complex structures is a central problem in computational structural biology, with broad implications for understanding cellular function and developing therapeutics. This thesis investigates two fundamental aspects of this problem using deep learning: domain‑specific architectures that capture the hierarchical nature of protein structures, and search algorithms that efficiently navigate the vast sequence spaces of protein complexes to identify interacting homologs for improving complex structure prediction and to design protein sequences.
Authors: Akarsh Gupta, Kenneth Rodrigues, Sagnik Chatterjee
Abstract: Identifying operons is a fundamental step in understanding prokaryotic gene regulation, as classifying genes into operons supports the reconstruction of regulatory networks, functional annotation of unannotated genes, and drug candidate development. Experimental approaches such as RT‑PCR and RNA‑seq provide precise evidence of operon structure, but are laborious and largely limited to well‑studied model organisms, making scalable computational methods essential for genome‑wide operon identification. Prior computational approaches have employed traditional classifiers such as logistic regression and decision trees, motivating our use of these as physicochemical baselines. The DGEB benchmark evaluates operonic pair classification by embedding each sequence independently with a pre‑trained protein language model and computing pairwise cosine similarity. In contrast, our Siamese MLP learns a classifier over the fused embedding space, which is theoretically better motivated for binary classification, as cosine similarity can yield meaningless scores depending on the regularization of the embedding model. While protein language model embeddings substantially outperform physicochemical features in ROC‑AUC, a learned Siamese MLP head does not significantly improve over unsupervised cosine similarity in Average Precision, suggesting that the geometry of the embedding space already captures the functional relationships needed for this task. Nonetheless, our Siamese MLP achieves a ROC‑AUC of 0.71, competitive with state‑of‑the‑art models on the DGEB leaderboard. These findings indicate that protein language model embeddings are a viable, scalable foundation for operonic pair classification across diverse microbial genomes, with implications for automated genome annotation, regulatory network reconstruction, and characterization of organisms lacking experimental operon annotations.
Authors: Siddhant Dutta, Edward Tan Beng Wai, Soumick Sarker, Pasan Gunawardane, Jagath C. Rajapakse
Abstract: Protein language models such as ESM‑2 learn rich residue representations that achieve strong performance on protein function prediction, but their features remain difficult to interpret as structural \& evolutionary signals are encoded in dense latent spaces. We propose a plug‑\&‑play framework that projects ESM‑2 representations onto protein contact graphs \& applies SoftBlobGIN, a lightweight Graph Isomorphism Network with differentiable Gumbel‑softmax substructure pooling, to perform structure‑aware message passing \& learn coarse functional substructures for downstream prediction tasks. Across enzyme classification, SoftBlobGIN achieves 92.8% accuracy \& 0.898 macro‑F1. Unlike post hoc analysis of protein language models alone, our method produces directly auditable structural explanations: GNNExplainer recovers biologically meaningful active‑site residues, spatially localized functional clusters, \& catalytic contact patterns. On binding‑site detection, SoftBlobGIN improves residue AUROC from 0.885 using an ESM‑2 linear probe to 0.983, indicating that these structural explanations are not recoverable from language‑model features alone. Learned blob partitions provide an additional layer of interpretability by automatically grouping residues into functional substructures, with blobs containing annotated active‑site residues showing 1.85× higher importance than other blobs (ρ=0.339, p=0.009), without any active‑site supervision. Our framework requires no retraining of the language model, adds only ~1.1M parameters, \& generalises across ProteinShake tasks, achieving F_\max of 0.733 on Gene Ontology prediction \& AUROC of 0.969 on binding‑site detection. We position this as an interpretable structural companion to protein language models that makes their predictions more transparent \& auditable.
Authors: Y. Ricardo Espinosa, C. Manuel Carlevaro, C. Gaston Ferrara
Abstract: Understanding the molecular mechanism by which denaturants modulate protein structure remains a central challenge in protein biophysics. In this work, molecular dynamics simulations were employed to investigate the effects of urea on the structural stability of bovine serum albumin, its F isoform at pH 3.7, over a broad range of urea concentrations (0 M to a fully urea/solvated system). The results reveal that urea induces a concentration/dependent dehydration/rehydration mechanism within the protein hydration shell. At low urea concentrations, a marked reduction in protein/water hydrogen bonds is observed, accompanied by a corresponding increase in protein/urea interactions, consistent with a competitive solvation process. At higher concentrations, urea/urea self‑association becomes significant, limiting direct protein/urea interactions and promoting partial rehydration of the protein surface. Despite these solvent rearrangements, the secondary structure of BSA remains largely preserved, whereas local and tertiary structural features, particularly in Domain III, exhibit increased solvent exposure and conformational flexibility. These findings support a dynamic compensation mechanism in which urea partially replaces water in the solvation shell without fully disrupting the hydrogen‑bonding network. Overall, this study provides molecular‑level insight into the interplay between preferential interactions, solvation dynamics, and protein stability under denaturing conditions.
Authors: Yulin Zhang, He Cao, Zihao Jiang, Chenyi Zi, Zhipeng Zhou, Zijing Liu, Yu Li, Jia Li, Ziqi Gao
Abstract: Designing proteins with desired functions or properties represents a core goal in synthetic biology and drug discovery. Recent advances in protein language models (PLMs) have enabled the generation of highly designable protein sequences, while preference alignment provides a promising way to steer designs toward desired functions and properties. Nevertheless, they often trigger catastrophic forgetting of pretrained knowledge, degrading basic designability and failing to balance multiple competing objectives. To address these issues, we draw inspiration from On‑Policy Distillation (OPD), an advanced post‑training method renowned for mitigating catastrophic forgetting through its mode‑seeking nature. In this work, we propose ProteinOPD, a multi‑objective preference alignment framework that can effectively balance multiple preference objectives while maintaining the inherent designability of PLMs. ProteinOPD adapts a pretrained PLM into preference‑specific teachers and distills their knowledge into a shared student via token‑level OPD on the student's own trajectories. During this process, the student is aligned to a unique normalized geometric consensus of weighted teachers while ensuring bounded optimization under conflicts. This bridges the gap for OPD in multi‑objective/teacher alignment. Extensive experiments show that ProteinOPD achieves substantial gains on target preference objectives without compromising the designability, with an 8x training speedup over RL‑based alignment competitors.
Authors: Nabin Giri, Steven Farrell, Kristofer E. Bouchard
Abstract: Multimodal models that jointly reason over protein sequences, structures, and function annotations within a unified representation hold immense potential for integrating multimodal data and generating new proteins with designed functional properties. To utilize transformer architectures, such models require a tokenizer that converts protein structure from continuous atomic coordinates into discrete representations suitable for scalable multimodal training. The quality of such models are fundamentally upper bounded by the fidelity and expressiveness of the underlying tokenized structure. However, existing tokenizers prioritize reconstruction over generative abilities. To address these gaps, we introduce Yeti, a simple and compact protein structure tokenizer based on lookup free quantization and trained end to end with a flow matching objective for multimodal learning. Compared to existing models, Yeti generally achieves the best codebook utilization and token diversity, and second best reconstruction accuracy (with 10x fewer parameters than ESM3) on diverse datasets. To validate Yeti's generative capability, we trained a compact multimodal model jointly over its structure tokens and amino acid sequence entirely from scratch, with no pretrained initialization. The resulting multimodal model generates plausible structures under unconditional cogeneration of protein sequence and structures, achieving comparable results to 10x larger models. Together, these results demonstrate that Yeti is a compact and expressive protein structure tokenizer suitable for training multimodal models that cogenerates highly plausible sequences and structures.
Authors: Sophie L. Wang, Phillip Isola, Brian Cheung
Abstract: How should hidden states generated autoregressively be collapsed into a representation that reflects a language model's internal state? Despite tokens being generated under causal masking, we find that mean pooling across their hidden states yields more semantic representations than any individual token alone. We quantify this through kernel alignment to reference spaces in language, vision, and protein domains. The improvement through mean pooling is consistent with information being distributed across generated tokens rather than localized to a single position. Furthermore, representations derived from generated tokens outperform those from prompt tokens, and alignment across generation reveals interpretable dynamics in model behavior.
Authors: Ziqi Gao, Chenyi Zi, Zijing Liu, Ziqiao Meng, Yu Li, Jia Li
Abstract: Protein‑protein interactions (PPIs) are fundamental to cellular function and disease mechanisms. Current learning‑based PPI predictors focus on learning powerful protein representations but neglect designing specialized classification heads. They mainly rely on generic aggregating methods like concatenation or dot products, which lack biological insight. Motivated by the biological "L3 rule", where multiple length‑3 paths between a pair of proteins indicate their interaction likelihood, our study addresses this gap by designing a biologically informed PPI classifier. In this paper, we provide empirical evidence that popular PPI datasets strongly support the L3 rule. We propose an L3‑path‑regularized graph prompt learning method called L3‑PPI, which can generate a prompt graph with virtual L3 paths based on protein representations and controls the number of paths. L3‑PPI reformulates the classification of protein embedding pairs into a graph‑level classification task over the generated prompt graph. This lightweight module seamlessly integrates with PPI predictors as a plug‑and‑play component, injecting the interaction prior of complementarity to enhance performance. Extensive experiments show that L3‑PPI achieves superior performance enhancements over advanced competitors.
Authors: Jay Shenoy, Miro Astore, Axel Levy, Frédéric Poitevin, Sonya M. Hanson, Gordon Wetzstein
Abstract: Knowledge of a protein's atomic conformational ensemble is critical to determining its function, yet state‑of‑the‑art ensemble prediction models are limited by lack of high‑quality conformational data from simulation or experiment. Recent advances in heterogeneous reconstruction for cryo‑electron microscopy (cryo‑EM) have enabled scientists to visualize ensembles of density maps for larger proteins and complexes not typically accessible through simulation, but building atomic models into these maps remains a challenge. Traditionally, ensemble prediction models are trained via a two‑stage process: experimental density maps are converted into atomic structural ensembles through model building, after which these structures are used to train sequence‑to‑atomic ensemble predictors. In this work, we propose a new principle for fine‑tuning pre‑trained static structure prediction models such as Boltz‑2 directly on raw cryo‑EM maps, bypassing the two‑stage process. We apply this technique to the problem of atomic model building by fine‑tuning Boltz‑2 to generate atomic conformations from an input ensemble of cryo‑EM maps, achieving superior model building accuracy compared to prior work. Beyond overfitting to individual map ensembles, our method, CryoSampler, also shows preliminary evidence of in‑domain generalization after fine‑tuning, sampling diverse atomic conformations for an unseen sequences within the same protein family without requiring cryo‑EM data. These capabilities indicate that CryoSampler holds the potential to train next‑generation atomic ensemble prediction models directly on raw cryo‑EM measurements.
Authors: Hanqun Cao, Aastha Pal, Sophia Tang, Yinuo Zhang, Jingjie Zhang, Pheng Ann Heng, Pranam Chatterjee
Abstract: Protein function is often controlled by ligands that bias the direction of state transitions, such as agonists and antagonists, rather than stabilizing a single conformation. This is especially important for clinically relevant G protein‑coupled receptors (GPCRs), where therapeutic efficacy depends on functional directionality. Structure‑based design methods optimize binding to static conformations and cannot represent non‑reversible, directional effects or systematically distinguish agonist from antagonist behavior. To address this gap, we introduce Transition‑Directed Discrete Diffusion for Allosteric Binder Design (TD3B), a sequence‑based generative framework that designs binders with specified agonist or antagonist behavior via a directional transition control objective. TD3B combines a target‑aware Direction Oracle, a soft binding‑affinity gate, and amortized fine‑tuning of a pre‑trained discrete diffusion model, enabling targeted agonist and antagonist generation decoupled from binding affinity and unattainable by equilibrium‑based or inference‑only guidance baselines. The code and checkpoints are available at https://huggingface.co/ChatterjeeLab/TD3B.
Authors: Michal Valko, Richard Pelikan, Miloš Hauskrecht
Abstract: Multiple technologies that measure expression levels of protein mixtures in the human body offer a potential for detection and understanding the disease. The recent increase of these technologies prompts researchers to evaluate the individual and combined utility of data generated by the technologies. In this work, we study two data sources to measure the expression of protein mixtures in the human body: whole‑sample MS profiling and multiplexed protein arrays. We investigate the individual and combined utility of these technologies by learning and testing a variety of classification models on the data from a pancreatic cancer study. We show that for the combination of these two (heterogeneous) datasets, classification models that work well on one of them individually fail on the combination of the two datasets. We study and propose a class of model fusion methods that acknowledge the differences and try to reap most of the benefits from their combination.
Authors: Congzhou M Sha
Abstract: Background: BP180, also known as collagen XVII and BPAG2 (bullous pemphigoid antigen 2), is a 180‑kDa transmembrane protein within the hemidesmosomal plaque complex, and which is known to be a major antigen in bullous pemphigoid, gestational pemphigoid, cicatricial (mucous membrane) pemphigoid, and linear IgA bullous disease.
Objective: At present, the 3D structure of BP180 is not known. The goal is to predict a reasonable structure for BP180 through machine learning and molecular dynamics.
Methods: In this work, we use the recent Boltz‑2 model to predict a putative structure for the intracellular, transmembrane, and proximal extracellular domains, including the NC16A antigenic region and a portion of its first extracellular collagenous domain, Col‑15. We computationally embed BP180 in a simple phospholipid bilayer, demonstrate that the putative structure is stable using molecular dynamics, and analyze its allosteric properties.
Results: The structures presented satisfy symmetry and secondary structure properties which are expected from homology modelling. Over three 500 ns trajectories, there is minor instability of the predicted globular head domain, but the homotrimer otherwise stays mostly folded. The putative NC16A domain is stiff, whereas the truncated Col‑15 domain is highly flexible. There does not appear to be a nearby stable conformation distinct from the initial state.
Conclusion: The structure presented is a useful starting point for targeting BP180 pharmacologically, for further experimental characterization of BP180, and for generating hypotheses regarding the relevant epitopes contributing to bullous disease. Diffusion models such as Boltz‑2 and AlphaFold3 are useful, but their results must be evaluated carefully.
Authors: Xiao Fei, Sarah Almeida Carneiro, Yang Zhang, Lawrence P. Petalidis, Achilleas Tsortos, Costas Bouyioukos, Michalis Vazirgiannis
Abstract: Protein‑protein interaction (PPI) modeling has been widely studied as a binary or multi‑label classification task. While emerging multimodal large language models (LLMs) can now describe single proteins, they remain unable to generate free‑form descriptions of interactions between protein pairs. Moving beyond controlled vocabulary annotations, we propose to model PPI using free‑text description, enabling richer expressiveness, improved interpretability, and better integration with literature knowledge base. We present PPI2Text, a multimodal LLM for free‑form PPI captioning from amino acid sequences, that encodes each protein using ESM3 encoder, constructs a pair map from the two representations to capture interactions across all residue pairs, and autoregressively generates descriptions using a Qwen3 language decoder. We further introduce PaCo‑RoPE, a coordinate‑aligned positional encoding that aligns each axis of the pair grid with the residue positions of the corresponding protein. In addition, we release PPI2Text‑Dataset, a 351k‑pair corpus of free‑form PPI descriptions aggregated from ten curated biological databases and further synthesized with Gemini under evidence‑tiered prompting. PPI2Text consistently outperforms strong baselines across multiple ablation settings and evaluation protocols. It not only achieves higher scores on linguistic metrics against synthesized references, but also excels on factuality metrics, where an LLM‑based judge evaluates outputs against raw biological evidence.
Authors: Seungik Cho
Abstract: Predicting microbial operon co‑membership requires integrating two complementary biological signals: protein‑scale molecular identity and genome‑context organization. While recent biological foundation models provide powerful representations of each view independently, naive concatenation of these modalities ignores a key biological property ‑‑ protein identity and genomic context may agree when adjacent genes form a coherent functional module, or conflict when sequence similarity is misleading but genomic layout indicates independent regulation. We present MicroFuse, a protein‑to‑genome expert fusion framework that integrates structure‑aware protein representations from ProstT5 with genome‑context representations from Bacformer through a four‑expert Mixture‑of‑Experts module (protein, genome‑context, agreement, and conflict experts) with a learned soft router. Training combines binary cross‑entropy with symmetric cross‑modal InfoNCE alignment and disagreement‑weighted supervised contrastive shaping. We further construct OG‑Operon100K, a 100,000‑pair scaffold‑level benchmark from the OMG metagenomic corpus with biologically grounded positive and negative criteria. On OG‑Operon100K, MicroFuse achieves the strongest AUROC, AUPRC, mAP, and mAR among ProstT5‑only, Bacformer‑only, and Concat MLP baselines. Ablations identify cross‑modal contrastive alignment as the dominant component, and a hard sequence‑conflict subset reveals MicroFuse's largest gains precisely in biologically ambiguous cases where protein identity alone is misleading.
Authors: Vahidullah Tac, Aeneas O. Koosis, Ellen Kuhl
Abstract: Texture shapes how we perceive and like food, yet clear links between mechanical measurements and sensory perception of texture remain elusive. Here we combine sensory data from a blind tasting with 101 participants with mechanical texture profile analysis across six burgers to identify the textural features that drive consumer perception and liking. We compare five burgers ‑‑ generated with artificial intelligence ‑‑ with animal‑based, plant‑based, mushroom‑based, and hybrid animal‑mushroom patties, and the classical Big\,Mac. Three main findings emerge: First, animal‑based burgers occupy a distinctive and coherent sensory‑mechanical region associated with attributes such as firm, fatty, and holds together. Second, mushroom‑ and plant‑based burgers deviate from this region in protein‑dependent ways: mushroom‑based burgers associate with springy and gummy textures, while plant‑based burgers associate with dry, brittle, and crumbly textures. Hybrid animal‑mushroom burgers, however, maintain sensory profiles comparable to fully animal‑based burgers. Third, resilience emerges as the strongest mechanical correlate of perceived meatiness and sensory texture, while stiffness and hardness show no statistically significant association with consumer perception. Texture independently predicts overall liking alongside flavor: increasing texture liking by one point increases overall liking by 0.28. Among all sensory attributes, meatiness is the dominant predictor of texture liking. These findings identify resilience as a promising target for texture engineering and establish texture as a critical design objective for sustainable alternative proteins.
Authors: Kyle Higgins, Guadalupe Gonzalez, Dennis Veselkov, Ivan Laponogov, Kirill Veselkov
Abstract: Understanding how molecular alterations propagate across biological systems to drive disease remains a central challenge. Although high‑throughput profiling enables comprehensive characterization of tumor states, most models neglect structured biological relationships or lack interpretability across scales. Here we present PPI‑Net, a hierarchical graph neural network that integrates protein‑protein interaction (PPI) networks with pathway‑level representations to model disease from molecular interactions to functional processes. Patient‑specific molecular profiles are embedded within a shared interaction network from STRING and propagated through a multi‑layer Reactome hierarchy using graph attention, enabling aggregation of gene‑level signals into higher‑order biological programs. Across RNA‑seq data from ten cancer types from The Cancer Genome Atlas, PPI‑Net achieves robust predictive performance, with balanced accuracy exceeding 90% in multiple cohorts. Comparative analysis on RNA‑Seq data from breast cancer demonstrated that PPI‑Net's integration of the Reactome hierarchy improved balanced accuracy by 6.7% relative to a PPI‑only model, while hierarchical multi‑level supervision improved balanced accuracy by 12.3% relative to using only a single top‑level prediction head. Applying a multi‑omics approach using RNA‑seq and methylation data improves model interpretation, recovering canonical oncogenic modules, including TP53‑AKT signaling and stress response pathways, while revealing convergence onto coherent programs such as ion signaling and cellular responses to stimuli. These results demonstrate that integrating interaction networks with pathway hierarchies enables accurate prediction while providing mechanistic insight into cancer biology.
Authors: Nicolas Menet, Andreas Krause, Abbas Rahimi
Abstract: Balancing exploration and exploitation is a core challenge in sequential decision‑making and black‑box optimization. We introduce POETS (Policy Ensembles for Thompson Sampling), a novel framework that bridges uncertainty quantification and policy optimization. Our approach is grounded in the insight that policies trained with Kullback‑Leibler (KL) regularization implicitly encode an underlying reward function. Building on this, POETS bypasses the complex, nested process of training an uncertainty‑aware reward model and separately fitting a policy to this model. Instead, we directly train a policy ensemble to capture epistemic uncertainty by matching implicitly encoded reward functions to online, bootstrapped data. To overcome the prohibitive compute and memory constraints of ensembling Large Language Models (LLMs), POETS utilizes an efficient architecture: the ensemble shares a pre‑trained backbone while maintaining diversity through independent Low‑Rank Adaptation (LoRA) branches. Theoretically, we prove that POETS implicitly conducts KL‑regularized Thompson sampling and thus inherits strong cumulative regret bounds of \mathcal O(\sqrtT γ_T). Empirically, we demonstrate that POETS achieves state‑of‑the‑art sample efficiency across diverse scientific discovery domains, including protein search and quantum circuit design. Furthermore, it improves the optimization trajectories of reinforcement learning, proving particularly robust in off‑policy settings with experience replay or in small dataset regimes.
Authors: D. Andrini, D. Riccobelli, L. Gazzera, S. Molteni, P. Metrangolo, P. Ciarletta
Abstract: Self‑encapsulated droplets floating at an oil‑‑air interface undergo striking shape changes during evaporation, including flattening and localized loss of membrane tension leading to crumpling and wrinkling. Here we combine experiments, modeling and simulations to obtain predictive morphological maps. We perform contact‑angle and evaporation experiments on water droplets coated by a hydrophobin protein film and floating in a fluorinated oil, providing reference profiles and volume‑loss sequences for quantitative validation. We develop an axisymmetric mechanics framework in which equilibria follow from minimization of a total free energy combining surface energies, membrane strain energy and gravitational potential, subject to volume and contact‑line constraints. A quasi‑convex tension‑relaxation rule accounts for compression‑free states and enables coexistence of taut, wrinkled (one principal tension vanishes) and crumpled (both vanish) membrane domains. A finite element algorithm computes quasi‑static morphing under volume reduction; key parameters are identified by fitting the reference contact‑angle profile and then used without further tuning. The model reproduces the experimentally observed shape evolution and resolves the associated stress redistribution. Systematic parameter scans yield morphological phase diagrams governed by the Bond number, the oil‑‑droplet surface‑tension ratio and the density ratio. For buoyant droplets, crumpling relocates between exposed and submerged caps as parameters vary; for heavy droplets, a crossover to circumferential wrinkling along the immersed sidewall emerges. Wall‑meniscus variations shift phase boundaries and can suppress bottom crumpling, consistent with wall‑affected experiments.
Authors: Kapil Goswami, Peter Schmelcher
Abstract: Combinatorial optimization problems play a central role in computer science with many real world applications. A number of relevant problems remain computationally difficult to solve as they lie in the NP‑hard complexity class. We present a unified framework for solving such optimization problems represented in the quadratic unconstrained binary optimization (QUBO) formalism, namely two‑SAT, XOR‑SAT, mixed‑two‑XOR‑SAT, set packing, quadratic assignment, binary clustering, and protein folding, by expanding the domain of applications of PRR, 6(2), 023031. A direct mapping from the QUBO form of these problems onto the Rydberg quantum platform is demonstrated as our first step. This mapping to the Rydberg system depends on distance‑dependent long‑range interactions and configurable local detuning, thus reducing resource overhead and improving scalability. Following‑up on the encoding, the solution is reached by steering the system toward the ground state of the target Hamiltonian using an optimized quantum annealing protocol that controls the time‑dependent detuning and Rabi frequency profiles. The framework can handle a variety of problems, each with different complexity. To quantify the complexity of any problem, a generalized hardness parameter is introduced that compares different problems based on the structure of their optimization landscapes. This is a proceedings contribution to the Athens Workshop in Theoretical Physics: 10th Anniversary, held at the National and Kapodistrian University of Athens on December 17‑19 2025.
Authors: Dan Ofer, Dafna Shahaf, Michal Linial
Abstract: Protein language models are trained primarily with masked language modeling (MLM), which predicts amino‑acid identities at masked positions. We ask whether latent‑space prediction can complement these token‑level objectives under matched wall‑clock budget. Across pretrained and random‑init protein sequence encoders at 35‑‑150M parameters, we find that the best protein‑JEPA design is not all‑position latent prediction but a variant: predicting latent targets only at masked positions, and retaining the MLM cross‑entropy. We call this recipe masked‑position MLM+JEPA. On a 16‑task downstream suite (15 frozen linear probes plus SCOPe‑40 zero‑shot fold retrieval), under matched wall‑clock budgets, this recipe wins more tasks than it loses against MLM‑only continuation: 10 wins / 3 losses / 3 ties (hereafter W/L/T) on pretrained ESM2‑35M, 11/2/3 on ESM2‑150M while results in pretraining from scratch are mixed (6/8/2). Gains are seen for multiple models on 11 of 16 tasks, including stability, \betaβ\beta‑lactamase fitness, variant effect, intrinsic disorder, remote homology, enzyme classification, and SCOPe‑40 fold retrieval. Tasks with more losses than wins are Fluorescence (TAPE) and Peptide‑HLA Binding. All‑position MLM+JEPA matches MLM‑only overall but does not reproduce the masked‑position gains. JEPA‑only (no MLM) collapses in nearly every experiment. We conclude that JEPA, when combined with MLM, is competitive and can outperform pure MLM in pretraining and continued training, even under matched wall‑clock budgets.
Authors: Minghao Yan, Bo Peng, Benjamin Coleman, Ziqi Chen, Zhouhang Xie, Shuo Chen, Zhankui He, Noveen Sachdeva, Weili Wang, Ed H. Chi, Shivaram Venkataraman, Wang-Cheng Kang, Derek Zhiyuan Cheng, Beidou Wang
Abstract: Large language models have become drivers of evolutionary search, but most systems rely on a fixed, prompt‑elicited policy to sample next candidates. This limits adaptation in practical engineering and research tasks, where evaluations are expensive, and progress depends on learning task‑specific search dynamics. We introduce PACEvolve++, an advisor‑model reinforcement learning framework for test‑time policy adaptation in evolutionary search agents. PACEvolve++ decouples strategic search decisions from implementation: a trainable advisor generates, assesses, and selects hypotheses, while a stronger frontier model translates selected hypotheses into executable candidates. To train the advisor under non‑stationary feedback, we propose a phase‑adaptive approach that adapts its optimization strategy to different phases of the evolutionary process. Early in evolution, it uses group‑relative feedback to learn broad search preferences; later, as reward gaps compress, it emphasizes best‑of‑k frontier contribution to support stable refinement. Across expert‑parallel load balancing, sequential recommendation, and protein fitness extrapolation, PACEvolve++ outperforms the state‑of‑the‑art evolutionary search framework with frontier models, achieving faster convergence and stabilizing test‑time training during evolutionary search.
Authors: Mattia Corigliano, Kuheli Biswas, Matteo Bocchiola, Daniele Montagnani, Ariel Amir, Marco Cosentino Lagomarsino
Abstract: Our understanding of cell division control in bacteria still relies largely on interpreting correlations between phenomenological variables, with limited connection to the underlying molecular mechanisms.
Here, we analytically solve a stochastic threshold‑accumulation model in which a size‑dependent divisor protein triggers division upon reaching a noisy, autocorrelated threshold, quantifying within a unified framework the combined effects of intrinsic and extrinsic noise and key mechanistic parameters such as protein reset and threshold memory. We show that incorporating these elements yields behavior far richer than the commonly assumed adder, spanning a continuum of division strategies from timer to sizer while modulating size fluctuations in a nontrivial fashion. Comparison with single‑cell E. coli data shows that extrinsic noise and additional mechanistic ingredients are required to account for the observed size fluctuations. The adder emerges when threshold correlations balance protein reset, generalizing the hypothesis that full reset is necessary to maintain adder control.
Our results establish a unified analytical framework linking stochastic molecular processes to emergent division laws, to be used in more complex bacterial cell‑cycle models.
Authors: Zhongmou Chao, Poompol Buathong, Ekaterina Selivanovitch, Susan Daniel, Peter I. Frazier
Abstract: Protein sequence data from nature exhibits survivorship bias: we only observe data from those organisms that survive and reproduce, while non‑functional protein mutations are eliminated by natural selection. Thus, predicting whether a protein sequence is functional often requires learning from positive examples alone. While positive‑unlabeled (PU) learning frameworks offer a generic solution to this problem, existing PU methods ignore the evolutionary processes that shape sequence observability and cause survivorship bias. Consider a sequence that is one mutation away from a commonly‑observed protein variant in a well‑surveilled organism. If the sequence were functional, it would likely be observed. If it is not observed, this suggests non‑functionality. In contrast, sequences that are unlikely to arise through mutation may be missing simply because they never arose. Thus, these two kinds of missing sequences should be treated differently when training models. In this work, we propose Evo‑PU, a PU learning framework that uses a scientific understanding of nucleotide mutation to model survivorship bias for well‑surveilled single‑organism sequence data. On three prediction tasks using single‑organism uniform‑coverage surveillance data ‑‑ predicting results from held‑out influenza and respiratory syncytial virus (RSV) mutagenesis studies, and predicting future SARS‑CoV‑2 variants ‑‑ Evo‑PU outperforms standard PU learning, one‑class classification (OCC), and protein language models (PLMs). On prediction tasks from multi‑organism ProteinGym datasets with more heterogeneous surveillance coverage, we identify opportunities to generalize our approach.
Authors: Dan Ofer, Oriel Perets, Michal Linial, Nadav Rappoport
Abstract: Protein language models (pLMs) produce per‑residue representations that capture evolutionary and structural information, yet their mean‑pooled sequence embeddings are not explicitly trained to reflect functional, evolutionary or structural similarity between proteins. We present Protein Sentence Transformers (ProtSent), a contrastive fine‑tuning framework for adapting PLMs into general‑purpose embedding models. ProtSent trains with MultipleNegativesRankingLoss across five protein‑pair datasets: Pfam families, structurally derived hard negatives, AlphaFold DB structural pairs, and StringDB protein‑‑protein interactions, and Deep Mutational Scanning data. We evaluate on 23~downstream tasks using frozen embeddings with a k‑nearest‑neighbor probe to measure embedding neighborhood quality. On ESM‑2 150M, ProtSent improves 15 of 23 tasks, with gains of +105% on remote homology detection, +17% on variant effect prediction, and +19.9% Recall@1 on SCOPe‑40 structural retrieval. The 35M variant improves 16 of 23 tasks with +40.5% on remote homology and +15.5% Recall@1 on SCOPe‑40. Contrastive fine‑tuning restructures the embedding space to better capture protein function and structure, without any task‑specific supervision. We release the models, public data, and training recipe and code.
Authors: Justin Sanders, Luca Giancardo, Lan Guo, Yue Zhao, Kemal Sonmez, Nina Cheng, Melih Yilmaz
Abstract: Antibody therapeutics are among the most successful modern medicines, yet computationally designing antibodies with desirable binding and developability properties remains challenging. While protein language models (pLMs) have emerged as powerful tools for antibody sequence design, existing approaches largely suffer from two key limitations: they predominantly memorize germline sequences rather than modeling biologically meaningful somatic variation, and they offer limited support for flexible classifier‑guided conditional generation. We address these challenges through two primary contributions. First, we demonstrate that discrete diffusion fine‑tuning achieves strong language modeling performance on antibody sequences while allowing for generation conditioned on any off‑the‑shelf classifier. Second, we introduce germline absorbing diffusion, a novel modification of the discrete diffusion noise process in which the germline sequence ‑ rather than a masked sequence ‑ serves as the absorbing state. This biologically motivated inductive bias restricts the model to learning the trajectory from germline to observed sequence, effectively excluding genetic variation and V(D)J recombination statistics from the learned distribution and dramatically mitigating germline bias. We show that germline diffusion improves non‑germline residue prediction accuracy from 26 percent to 46 percent, approaching the theoretical upper bound set by true biological variability. We then demonstrate the utility of our germline diffusion model on the conditional generation tasks of sampling antibodies with improved hydrophobicity and predicted binding affinity. On both tasks our model shows an improved tradeoff between class adherence and sample quality, significantly outperforming EvoProtGrad, a popular strategy to sample from pLMs with gradient‑based discrete Markov Chain Monte Carlo.
Authors: Yuchen Xiong, Swee Keong Yeap, Steven Aw Yoong Kit
Abstract: Fluorescent protein quantum yield (QY) is governed by the mature chromophore and its three‑dimensional microenvironment rather than sequence identity alone. Protein language models and emission‑band averages capture global trends, but do not model how local physical signals act on specific chromophore regions.
We present a chromophore‑centred mechanism graph algorithm for QY prediction. Each PDB structure is converted into a typed 3D residue graph, registered to a mature‑CRO state, partitioned into phenolate, bridge and imidazolinone regions, and transformed by channel‑signal‑region propagation. The representation contains 121 enrichment features; after removing identity shortcuts, 52 non‑identity features are used for band‑specific ExtraTrees regression. Because each feature encodes a contact channel, seed signal and target CRO region, interpretation is intrinsic rather than post hoc. On a 531‑protein benchmark, the method achieved the best random‑CV performance among model‑based baselines (R = 0.772 +/‑ 0.008, MAE = 0.131 +/‑ 0.002), exceeding Band mean (R = 0.632), ESM‑C (R = 0.734) and SaProt (R = 0.731), and ranked first in bright screening (Bright P@5 = 0.704). Under homology control, the advantage was clearest in the remote bucket (<50% similarity; R = 0.697 versus 0.633, 0.575 and 0.408), with the strongest overall bright/dark Top‑K screening. Stable selected features recovered band‑specific mechanisms: aromatic packing and clamp asymmetry in GFP‑like proteins, charge/clamp balance in Red proteins, and flexibility‑risk/bulky‑contact features in Far‑red proteins.
Source code, feature tables and evaluation scripts are available from the first author upon request. Contact: yuchenak05@gmail.com
Authors: Andreia F. Silva, James A. Richards, Fiona Jeffrey, Rory E. O'Neill, Daniel J. M. Hodgson, Christopher Ness, Wilson C. K. Poon
Abstract: The existence and origin of the ductile to brittle transition in non‑Brownian suspensions and pastes is underexplored despite the ubiquity of such materials in practical applications. We demonstrate the phenomenon in candies of sugar crystals in a water‑protein‑fat matrix prepared by boiling a sugar‑cream‑butter mixture (known as 'fudge' in some countries). As cooking time or final cooking temperature increases, we observe a transition from a fluid to a ductile solid, then to a brittle solid that abruptly fractures in compression. We propose that this is driven by rising solid sugar crystal volume fraction, and indeed find the same sequence of behaviour in a suspension of non‑Brownian calcite particles as the solid fraction moves from frictional jamming to random close packing. Particle‑based simulations reveal the sensitivity of the observed phenomenon to boundary conditions.
Authors: Rahul Nandakumar, Ben Fauber, Deepayan Chakrabarti
Abstract: Drug discovery seeks molecules (ligands) that bind strongly and selectively to a target protein. However, fewer than 5% of candidate ligands pass the bar for even the early stages of drug discovery. Furthermore, we want methods that work for novel proteins for which we have no prior data. Starting from scratch, we have to iteratively select and test candidate ligands such that we find enough ligands of the desired quality in as few tests as possible. Our proposed algorithm, named SPADE, introduces a novel approach to ligand selection that requires only 40 tests on average to find 10 high‑quality ligands. In one‑vs‑one comparisons, SPADE outperforms deep learning and Bayesian optimization methods on more proteins, achieving median improvements of 7%‑32% in sample efficiency. SPADE is also 10x faster than its closest competitor at scoring candidate drugs. Dataset and code is available at https://anonymous.4open.science/r/SPADE_Fast_Drug_Discovery_by_Learning_from_Sparse_Data‑F028/README.md
Authors: Atreya Dey, Guang Shi, Ryota Takaki, D. Thirumalai
Abstract: Structural Maintenance Complexes (SMC) are energy consuming motors that are important in folding the genome by loop extrusion (LE) in all stages of the cell cycle. Single molecule magnetic tweezer pulling experiments have revealed that condensin, a member of the SMC family involved in mitosis, takes occasional backward steps, thus coughing up the gains in the length of the extruded loop. To reveal the mechanism of the forward and backward steps simultaneously, we developed a theory using the stochastic kinetic model and the scrunching mechanism for LE. The calculations quantitatively account for the measured force‑dependent step size and dwell time distributions in both the directions. By postulating the existence of an intermediate state in the ATP‑driven cycle that is poised to take a forward or a backward step, we predict that its lifetime increases as the external mechanical force increases till a critical value and subsequently decreases at higher forces. The surprising finding of lifetime increase in an active motor, at sub‑piconewton forces, is the characteristic of catch bonds, known in force‑induced rupture of several passive protein complexes. The identification of catch bond‑like states in condensin not only expands our understanding of LE but also highlights the significance of mechanical forces in regulating genome organization.
Authors: Zheng Ma, Jiazhen Chen, Lei Xin, Ali Ghodsi
Abstract: The integration of deep learning approaches in biomedical research has been transformative, enabling breakthroughs in various applications. Despite these strides, its application in protein inference is impeded by the scarcity of extensively labeled datasets, a challenge compounded by the high costs and complexities of accurate protein annotation. In this study, we introduce GraphPI, a novel framework that treats protein inference as a node classification problem. We treat proteins as interconnected nodes within a protein‑peptide‑PSM graph, utilizing a Graph Neural Network‑based architecture to elucidate their interrelations. To address label scarcity, we train the model on a set of unlabeled public protein datasets with pseudo‑labels derived from an existing protein inference algorithm, enhanced by self‑training to iteratively refine labels based on confidence scores. Contrary to prevalent methodologies necessitating dataset‑specific training, our research illustrates that GraphPI, due to the well normalized nature of Percolator features, exhibits universal applicability without dataset‑specific fine‑tuning, a feature that not only mitigates the risk of overfitting but also enhances computational efficiency. Our empirical experiments reveal notable performance on various test datasets and deliver significantly reduced computation times compared to common protein inference algorithms.
Authors: Alessandro Micheli, Silvia Sapora, Anthea Monod, Samir Bhatt
Abstract: Many machine learning problems involve data supported on curved spaces such as spheres, rotation groups, hyperbolic spaces, and general Riemannian manifolds, where Euclidean geometry can distort distances, averages, and the resulting optimal transport (OT) problem. Existing manifold OT methods have pursued amortized out‑of‑sample maps, while entropic regularization has made discrete OT more scalable, but these advantages have remained largely disjoint. We propose Entropic Riemannian Neural Optimal Transport (Entropic RNOT), a unified framework that combines intrinsic entropic OT with amortized out‑of‑sample evaluation on Riemannian manifolds. Our method learns a single target‑side Schrödinger potential through a neural pullback parameterization, recovers the induced Gibbs coupling, and uses the resulting conditional laws to construct intrinsic transport surrogates. These include barycentric projections on Cartan‑Hadamard manifolds and heat‑smoothed conditional surrogates on stochastically complete manifolds, the latter turning possibly atomic target laws into absolutely continuous ones. For fixed regularization \varepsilon>0, we prove that the proposed hypothesis class recovers the entropic optimal coupling in strong probabilistic metrics. As consequences, barycentric surrogates converge in L^2, while heat‑smoothed surrogates are stable at fixed heat time and asymptotically unbiased as the heat time vanishes. The guarantees hold for compactly supported data on possibly noncompact manifolds. Empirically, our method matches or improves over Euclidean, tangent‑space, and log‑Euclidean baselines on benchmarks over \mathbbS^2, \mathrmSO(3), \mathrmSPD(3), \mathrmSE(3), and \mathbbH^2, scales favorably relative to discrete manifold Sinkhorn, and in a protein‑ligand docking application, refines poses on \mathrmSE(3) without retraining or per‑instance optimization.
Authors: Emil Sharafutdinov, Ingemar André
Abstract: Ancestral sequence reconstruction (ASR) aims to infer extinct protein sequences at internal nodes of a phylogenetic tree. Classical ASR methods are typically based on continuous‑time Markov substitution models, but they treat sites largely independently and handle insertions and deletions only weakly or not at all. We introduce a tree‑conditioned edit‑flow model for variable‑length ASR. Given two descendant sequences and their branch distances to a shared ancestor, the model reconstructs the ancestor through paired bidirectional edit trajectories constrained to agree on a common ancestral state. On a benchmark of experimentally evolved sequences with only context‑independent substitutions, the model does not match the accuracy of the best classical method, yet still achieves reasonable performance despite being trained on natural sequences that include insertions, deletions, and substitutions. On a benchmark of natural homologous sequences with abundant insertions and deletions, the model most accurately localizes inferred evolutionary change.
Authors: Cong Liu, Milong Ren, Jiaqi Guan, Chengyue Gong, Jinyuan Sun, Xinshi Chen, Wenzhi Xiao
Abstract: Recent advances in de novo protein binder design have enabled increasing experimental validation, yet reported in silico metrics remain difficult to interpret or compare across studies due to non‑standardized evaluation protocols. We introduce ProtDBench, a standardized and throughput‑aware evaluation framework for protein binder design. ProtDBench defines unified benchmark tasks, evaluation protocols, and success criteria, enabling systematic analysis of how evaluation design influences observed performance. Using a large wet‑lab annotated dataset, we analyze commonly used structure prediction models as evaluation verifiers, revealing substantial verifier‑dependent bias and limited agreement under identical filtering protocols. We then benchmark representative open‑source generative binder design methods across ten diverse protein targets under a fixed evaluation protocol. Beyond per‑sequence success rates, ProtDBench incorporates throughput‑aware metrics based on a fixed 24‑hour budget, as well as cluster‑level success criteria to account for structural diversity. Together, these results expose systematic differences induced by filtering rules, success definitions, and throughput‑aware evaluation between computational efficiency, success rate, and structural diversity. Overall, ProtDBench provides a fair and reproducible evaluation pipeline that supports systematic and controlled comparison of protein binder design methods under realistic evaluation settings.
Authors: Chris Sainsbury, Feng Dong, Andreas Karwath
Abstract: Sparse autoencoders (SAEs) have been applied to large language models and protein language models, but not systematically to electronic health record (EHR) foundation models. We train TopK SAEs on FlatASCEND, a 14.5‑million‑parameter autoregressive clinical sequence model, at all 10 residual stream extraction points on INSPECT (outpatient) and MIMIC‑IV (ICU). SAE decomposition reveals progressive abstraction across transformer depth: layer‑0 features are near‑perfect token detectors (45.7% singleton), while layer‑6 features span approximately 30 token types across multiple clinical categories (0.5% singleton). Under full‑sequence simple linear probes, SAE features outperform dense representations for discrete event prediction (mortality) while dense representations outperform for continuous magnitude prediction (length of stay) ‑ a probe‑level representational phenomenon that does not extend to clinically relevant leakage‑safe windows, where dense representations match or exceed SAE features across all tested settings (eICU‑CRD 48‑hour AUC: SAE 0.871 versus dense 0.880; base model zero‑shot, SAE dictionaries trained on eICU activations; MIMIC‑IV: 0.836 versus 0.914; INSPECT 1‑year/3‑year: 0.697 versus 0.800). A delta‑mode intervention method reduces SAE perturbation noise by 86x, enabling cleaner feature‑level experiments, though the resulting perturbation effects are larger than random controls in 3 of 4 conditions but not formally significant. Feature reproducibility across random seeds is 21%, and individual features should be interpreted as illustrative rather than stable.
Authors: Roberto Netti, Emily Hinds, Francesco Calvanese, Rama Ranganathan, Martin Weigt, Francesco Zamponi
Abstract: Boltzmann Machines trained on evolutionary sequence data have emerged as a powerful paradigm for the data‑driven design of artificial proteins. However, the relationship between model architecture, specifically parameter density, and experimental performance remains poorly understood. Here, we investigate this relationship using the Chorismate Mutase enzyme family as a model system. We compare standard fully connected Boltzmann Machines for Direct Coupling Analysis (bmDCA) with sparse models generated via progressive edge activation (eaDCA) and edge decimation (edDCA). We identify a maximum‑entropy model (meDCA) along the decimation trajectory that represents an optimal balance between constraint satisfaction and the flexibility of the probability distribution. We synthesized and tested artificial sequences from all models using an in vivo complementation assay, finding that all architectures, regardless of sparsity, generate functional enzymes with high success rates, even at significant divergence from natural sequences. Despite this functional equivalence, we demonstrate that the meDCA model samples a viable sequence space that is more than fifteen orders of magnitude larger than its low‑entropy counterparts. Furthermore, comparative analyses reveal that high‑entropy models systematically minimize overfitting and better capture the local neutral spaces surrounding natural proteins. These findings suggest that while various models satisfying coevolutionary statistics can generate functional sequences, high‑entropy Boltzmann Machines provide a superior representation of the underlying evolutionary fitness landscape.
Authors: Chiara Vercellino, Giacomo Vitali, Paolo Viviani, Alberto Scionti, Olivier Terzo, Bartolomeo Montrucchio
Abstract: We present our work on effectively representing unit‑disk graphs on the registers of neutral atom quantum machines. Specifically, we aimed to embed graphs corresponding to proteins and cellular antenna networks into unit‑disk graphs, ensuring compatibility with the registers of two real QPUs: Orion Alpha by PASQAL and Aquila by QuEra. To address machine‑specific constraints, we made adjustments and integrated Distance Encoder Networks (DEN) from our previous work. Despite these challenges, we successfully embedded up to 76% of protein‑representing graphs for a quantum machine learning classification task on the Aquila QPU, and all subgraphs derived from 90 antenna geographical positions in Turin, Italy, on the Orion Alpha QPU. In the latter case, the graphs represented instances of the graph coloring problem, which we tackled using the hybrid quantum‑classical algorithm BBQ‑mIS. These promising results underscore the effectiveness and versatility of our embedding approach for representing unit‑disk graphs on neutral atom quantum computers across diverse applications.
Authors: Souvik Mondal, Michael A. Sauer, Matthias Heyden
Abstract: Understanding protein conformational dynamics is essential for elucidating biological function but remains challenging due to the wide range of timescales and the complexity of collective motions. Enhanced sampling methods overcome timescale limitations of conventional molecular dynamics, yet their effectiveness depends on the choice of collective variables (CVs), which are often difficult to define and may lack physical interpretability. In particular, collective variables derived from machine learning or collective vibrational modes can efficiently capture slow dynamics but are not easily mapped onto intuitive structural descriptors. Here, we present a fully automated framework that transforms enhanced sampling trajectories into human‑readable representations of protein dynamics. Our approach combines enhanced sampling along CVs derived from frequency‑selective anharmonic mode analysis with a post hoc analysis of biased trajectories using weighted dynamic cross‑correlation matrices. From these, we identify residue pairs and domains exhibiting correlated and anti‑correlated motions, yielding simple domain‑domain distances that serve as physically interpretable CVs. We apply this method to five proteins, including KRAS and HIV‑1 protease, and show that it consistently identifies biologically relevant domains and motions without prior system‑specific knowledge. Projection onto these distances produces free energy surfaces that reproduce known conformational states with low statistical uncertainty while maximizing independent dynamical information. This workflow enables systematic recasting of complex CVs into simple geometric descriptors without loss of essential dynamics. Its generality and automation make it broadly applicable for interpreting enhanced sampling simulations and generating interpretable conformational ensembles for integration with emerging machine learning approaches.
Authors: Chaoran Cheng, Jiaqi Guan, Milong Ren, Chengyue Gong, Cong Liu, Xinshi Chen, Ge Liu, Wenzhi Xiao
Abstract: We present A‑CODE, a fully atomic unified one‑stage protein co‑design model that simultaneously refines discrete atom types and continuous atom coordinates. Unlike predominant two‑stage methods that cascade structure design with amino acid‑level sequence design, our approach is fully atomic within a unified multimodal diffusion framework, in which residue identities are inferred solely from atom‑level predictions. Built upon the powerful all‑atom architecture, A‑CODE achieves superior designability for unconditional protein generation, outperforming all existing one‑stage and two‑stage design models. For binder design, A‑CODE rivals and even outperforms existing state‑of‑the‑art two‑stage design models and, compared with the existing one‑stage co‑design model, achieves a drastic tenfold improvement in success rate on hard tasks. The inherent flexibility of our atomic formulation enables, for the first time, seamless adaptation to non‑canonical amino acid (ncAA) modeling. Our fully atomic framework establishes a new, versatile foundation for all‑atom generative modeling that can be naturally extended to complex biomolecular systems.
Authors: Sushovan Majhi, Atish Mitra, Žiga Virk, Pramita Bagchi
Abstract: We introduce PLACE (Persistence‑Landmark Analytic Classification Engine), a closed‑form pipeline for classifying point clouds and graphs through their persistent‑homology signatures. Three quantitative guarantees ‑‑ a margin‑based excess‑risk rate, a closed‑form descriptor‑selection rule, and a per‑prediction certificate ‑‑ are derived from training labels alone, with no learned weights or held‑out calibration. The embedding sums Mitra‑Virk single‑point coordinate functions over a sparse landmark grid; the closed‑form weight rule w_k^2 \propto (d_k+1^2 ‑ d_k^2)/R_k^2 maximizes the distortion slope in Mitra‑Virk's affine certificate under ν‑coherence. (i) An O(kR/(Δ\sqrtm_\min)) margin bound, driven by class‑mean separation Δ and embedding radius R, matched in the sample‑starved regime m \lesssim R/Δ by a Le Cam minimax lower bound. (ii) The Mahalanobis margin under Ledoit‑Wolf‑shrunk covariance is the strongest closed‑form ranker on a 64‑descriptor chemical‑graph pool (mean Spearman ρ= +0.56 across 11 benchmarks, positive on 10 of 11); the isotropic surrogate Δ/\sqrt\ell admits a closed‑form selection‑consistency rate on the homogeneous protein/social pools. (iii) A training‑time‑decided certificate, with no per‑prediction overhead, in three concrete radii (Pinelis, Gaussian plug‑in, and variance‑aware Pinelis‑Bernstein). Empirically, PLACE is the strongest diagram‑based method on Orbit5k and matches the strongest topology‑based baseline within statistical noise on MUTAG and COX2; remaining gaps fall into two diagnosable regimes (descriptor blindness on NCI1/NCI109; pool‑coverage limits elsewhere). The Pinelis‑Bernstein radius fires on 8 of the 12 benchmarks; on MUTAG the empirical and population nearest‑centroid rules agree on every one of 940 held‑out test predictions, validating the certificate's mechanism.
Authors: Zixi Shao, Tao Wang, Yibei Xiao, Tianyi Huang
Abstract: Designing therapeutic messenger RNA (mRNA) requires creating full‑length transcripts that carefully balance stability, translation efficiency, and immune safety. To address this challenge, we propose ProMORNA, a multi‑objective generation framework that produces complete mRNA transcripts de novo directly from a target protein sequence. Our approach begins by training a BART‑style encoder‑decoder model on over 6 million natural protein‑mRNA pairs. We then introduce Multi‑Objective Group Relative Policy Optimization (MO‑GRPO) to simultaneously optimize for various biological objectives in a unified way. As a case study, we evaluated ProMORNA on the widely used firefly luciferase target, excluding it from both our supervised training data and the prompt pool. The results indicate that ProMORNA improves the in silico Pareto frontier for predicted half‑life and translation efficiency relative to standard supervised baselines. Additionally, it achieves higher predicted functional scores than a state‑of‑the‑art baseline under the same evaluation pipeline. These computational findings demonstrate the feasibility of using multi‑objective reinforcement learning for full‑length mRNA design on unseen targets.
Authors: Kiyoharu Kawana, Kyosuke Adachi
Abstract: Liquid‑liquid phase separation underlies phenomena ranging from protein condensate formation to the phase coexistence of synthetic polymers. Although the random phase approximation (RPA) is widely used to predict such phase behavior, its quantitative accuracy for binodals of polymer solutions, particularly outside the high‑density regime, remains incompletely characterized. Here, we develop a field theoretic loop expansion in homopolymer systems by identifying the inverse polymer density ρ^‑1 as the Planck constant \hbar in quantum field theory. We calculate the leading‑order and next‑to‑leading‑order corrections to the RPA free energy, denoted as RPA+ and RPA++, respectively. Testing the binodal predicted by the RPA+ against molecular dynamics simulations of bead‑spring chains with Gaussian pair interactions, we find that the RPA+ qualitatively improves the dilute‑phase coexistence density over the RPA, while the critical point error remains comparable to that of the RPA. Our results establish the loop expansion as a systematic route for refining the RPA‑based binodal predictions for polymer phase separation.
Authors: Kenneth M. Merz,, Akhil Shajan, Danil Kaliakin, Fangchun Liang, Yuichi Otsuka, Tomonori Shirakawa, Lukas Broers, Han Xu, Miwako Tsuji, Mitsuhisa Sato, Seiji Yunoki, Ryo Wakizaka, Yukio Kawashima, Jun Doi, Toshinari Itoko, Hiroshi Horii, Thaddeus Pellegrini, Javier Robledo Moreno, Kevin J. Sung, Ella Fejer, Robert Walkup, Seetharami Seelam, Mario Motta
Abstract: Ab initio wavefunction methods provide accurate molecular simulations but their computational scaling restricts applications to small systems. We develop a workflow combining quantum embedding to decompose a molecule into fragments with a heterogeneous quantum‑classical (HQC) method to simulate fragments. We sample fragment electronic configurations on two 156‑qubit quantum processors (ibm\_cleveland, ibm\_kobe), using up to 94 qubits, running 9,200 circuits for over 100 hours, collecting 1.3 \cdot 10^9 measurement outcomes ‑ the most resource‑intensive HQC computation for quantum chemistry to date. We compute fragment wavefunctions via optimized subspace diagonalization on two supercomputers (Fugaku, Miyabi‑G), achieving 72.5% parallel efficiency with scalable distributed linear algebra kernels. We simulate two protein‑ligand complexes spanning dispersion‑ and electrostatics‑dominated regimes (11,608 and 12,635 atoms), demonstrate >40× increase in system size and up to 210× improvement in accuracy over the previous state‑of‑the‑art, with HQC matching coupled‑cluster (CCSD) accuracy in fragment energies, and establish a scalable pathway for systematically improvable biomolecular simulations.
Authors: Xinrui Chen, Yizhen Luo, Siqi Fan, Zaiqing Nie
Abstract: De novo functional protein design aims to generate protein sequences that realize specified biochemical functions without relying on evolutionary templates, enabling broad applications in biotechnology and medicine. Existing approaches adopt either direct function‑to‑sequence mapping or decoupled structure‑sequence generation strategies but often fail to achieve functionality and foldability simultaneously. To address this, we propose CodeFP, a Co‑generative protein language model for de novo Functional Protein design that simultaneously decodes sequence and structure tokens, thereby enabling superior simultaneous realization of functionality and foldability. CodeFP utilizes functional local structures to enrich functional semantic encodings, overcoming the suboptimal translation of flat encodings into structure tokens, while introducing auxiliary functional supervision to alleviate training ambiguity stemming from the one‑to‑many structure‑to‑token mapping. Extensive experiments show that CodeFP consistently achieves average improvements of 6.1% in functional consistency and 3.2% in foldability over the strongest baseline.
Authors: Xinyou Wang, Liang Hong, Jiasheng Ye, Zaixiang Zheng, Yu Li, Shujian Huang, Quanquan Gu
Abstract: Proteins are shaped by gradual evolution under biophysical and functional constraints. Protein language models learn rich evolutionary constraints from large‑scale sequences, and discrete diffusion‑based protein language models~(\eg, DPLMs) are promising for both understanding and generation. However, existing DPLMs typically rely on masking‑based absorbing diffusion that contradicts a simple biological intuition: proteins evolve through accumulated edits, not by emerging from masks. Consequently, these frameworks lack explicit pretraining objectives for substitution and insertion/deletion (indel) operations, limiting both optimization‑style post‑editing and flexible guided generation. To address these limitations, we present DPLM‑Evo, an evolutionary discrete diffusion framework that explicitly predicts substitution, insertion, and deletion operations during denoising. DPLM‑Evo decouples an upsampled‑length latent alignment space from the variable‑length observed sequence space, which makes indel‑aware generation tractable and enables adaptive scaffold growth throughout the process with negligible computational overhead. To better align substitutions with real evolution, we further introduce a contextualized evolutionary noising kernel that produces biologically informed, context‑dependent mutation patterns. Across tasks, DPLM‑Evo improves sequence understanding and achieves state‑of‑the‑art mutation effect prediction performance on ProteinGym in the single‑sequence setting. It also enables variable‑length simulated evolution, and post‑editing/optimization of existing proteins via explicit edit trajectories.
Authors: Robson Christie, Cerys Murray, Youngchan Kim, Jaewoo Joo
Abstract: We quantify the excitonic coupling in the homodimer of dimeric Venus fluorescent protein using a quantum‑classical hybrid workflow. Employing a transition‑density coupling formalism, we calculate J = 74.38~\mathrmcm^‑1, which is 5.6 times stronger than the far‑field point‑dipole estimate of 13.31~\mathrmcm^‑1. This disparity highlights the critical role of near‑field multipolar effects at the 27.6~Å chromophore centroid separation. Furthermore, we argue that a separation of timescales resolves the apparent theoretical tension between robust experimental excitonic couplings and the highly decoherent biological environment. While it has been hypothesised that the fluorescent protein β‑barrel scaffold sustains coupling by attenuating thermal fluctuations, we emphasise that the separation of timescales fundamentally applies irrespective of the exact degree of environmental noise suppression. Collective photoexcitation imprints the Davydov splitting under optical‑limit dielectric screening upon absorption, preceding bulk solvent relaxation and sub‑picosecond environmental dephasing. To characterise the subsequent post‑absorption evolution, we employ stochastic simulations for quantum parts to model the transition from a delocalised exciton superposition to incoherent hopping between localised chromophore states.
Authors: Ang Liu, Jingsong Shang, Jiangang J. Du, Shyamsunder Erramilli, Pritiraj Mohanty
Abstract: Sensitive biomarker detection in physiological fluids is often limited by Debye screening, which suppresses electrostatic signals at sensor surfaces. Here we report a sensing approach based on flexoelectric resonance in silicon nanowire field‑effect transistors. An applied radiofrequency field induces strain gradients in the nanowires, generating flexoelectric polarization that is amplified at resonant frequencies. This effect enhances the sensitivity of conductance measurements to small surface charge variations associated with biomolecular binding. Using C‑reactive protein as a model biomarker, we observe an order‑of‑magnitude improvement in detection sensitivity compared to conventional operation, with a 62% conductance increase versus 30% without radiofrequency modulation. The high‑frequency field also perturbs the electrical double layer, reducing Debye screening in high‑ionic‑strength environments. These combined effects enable direct biomarker detection without sample dilution. This work establishes flexoelectric resonance as a general strategy for improving nanoscale biosensing performance in physiologically relevant conditions.
Authors: Anshika Dhiman, Sanbo Qin, Huan-Xiang Zhou
Abstract: Salts are an integral part of the environment for living systems and, therefore, understanding their effects on proteins and other biomolecules is of fundamental interest. Small‑angle X‑ray scattering (SAXS) of protein solutions can provide valuable information on salt effects, but extracting this information has been a significant challenge. For example, SAXS data of bovine serum albumin (BSA) at various salt concentrations were fit to three different spherical models. Here we combined the newly developed FMAPIq approach with explicit‑solvent all‑atom molecular dynamics simulations to show that the complex effects of salt on the SAXS of BSA originate from the interplay of ions and hydration water, leading to a general picture of protein‑ion‑water interactions.
Authors: Raviteja Anantha, Nick Levato, Layne C. Price
Abstract: Parameter‑efficient fine‑tuning (PEFT) methods face a tradeoff between adapter size and expressivity: ultra‑low‑parameter adapters are confined to fixed low‑rank subspaces, capping performance even with extended training. We propose BoostLoRA, a gradient‑boosting framework that overcomes this limit by iteratively training and merging minimal adapters on the examples the current model gets wrong. A ROTATE SVD basis strategy assigns each round to an orthogonal subspace, so cumulative effective rank grows linearly with the number of rounds while each adapter remains ultra‑low‑rank. After merging, adapters are discarded, leaving zero inference overhead. On Qwen2.5‑3B, BoostLoRA reaches 89.1% on GSM8K and 68.8% on MATH‑500, surpassing both the best single‑shot ultra‑low parameter adapter (TinyLoRA) and full fine‑tuning; on code generation it reaches 57.2% on MBPP and 80.4% on HumanEval while full fine‑tuning drops below the zero‑shot baseline. We also demonstrate cross‑architecture transfer on protein binding classification with ESM2‑650M and cross‑entropy training. BoostLoRA is, to our knowledge, the first PEFT method whose effective rank grows with training, separating per‑round parameter cost from total representational capacity.
Authors: Cherif F. Matta
Abstract: The chemical bond is a central organizing concept in chemistry, yet it is absent from the molecular Hamiltonian and no "bond operator" exists. Bonding is therefore not a primitive physical entity but a derived descriptor emerging from the quantum state. The logical consequences of this observation are revisited. Statements such as "bonding stabilizes structure" when taken literally risk circular reasoning (petitio principii), whereby bonding is inferred from a stationary structure and then invoked as its cause. The same caution applies to concepts such as steric repulsion, which is also a derived descriptor. Bonding accompanies stable or metastable states and correlates with their properties without constituting their cause. Illustrative examples are drawn from QTAIM, non‑covalent interaction (NCI) approach, protein structure, and hydrogen‑hydrogen bonding. Causation, language, and the autonomy of chemistry are also briefly discussed. The aim is not at all to diminish the role of bonding, but to place it at the correct logical level, that is, as a powerful, state‑dependent descriptor that organizes, classifies, and predicts chemical behavior without serving as its fundamental cause.
Authors: Nicodemo Mazzaferro, Willmor J Pena Ccoa, Pilar Cossio, Glen M. Hocky
Abstract: Several recent methods have shown that it is possible to compute rate constants of very slow biomolecular processes using simulations where a time‑dependent bias is added along one or several collective variables (CVs). We previously reported the exponential average time‑dependent rate (EATR) method, which can improve upon these approaches by accounting for how efficiently the external biasing potential modifies the observed rate using a learned CV‑quality factor γ. This results in more accurate rate estimates using the same data when biasing a suboptimal coordinate. However, as formulated EATR depended on the biasing potential varying over time to properly determine the biasing efficiency, which limits the method's applicability to quasi‑static biasing schemes such as ``flooding'' or on‑the‑fly probability enhanced sampling (OPES). Here, we present the EATR‑flooding approach, which generalizes our method by replacing the need for a time dependent bias by instead varying (stepping up) the strength of the biasing potential across multiple sets of simulations. We implement this approach as an open‑source Python library, and demonstrate that this approach is accurate without substantial loss of efficiency compared to standard EATR for a coarse‑grained protein system, and also show good performance on a fully atomistic cavity‑ligand model. Two additional appealing features of EATR‑flooding are an internal check for over‑biasing and the fact that only a single γ parameter is predicted for a given choice of CVs, as compared to our earlier results where γ empirically depended on biasing rate. Finally, we believe EATR‑flooding applies not only to OPES simulations but more generally to CV biasing enhanced sampling approaches, making it broadly useful.
Authors: Truman Yu Ng, Yuzhu Wang, Wei Jie Chan, Ruizhe Shen, Tianqi Chen, Ching Hua Lee
Abstract: Knots and links represent a fundamental motif of non‑local connectivity that permeates the physical sciences from string theory to protein folds. While spectral braiding has been explored in two‑band non‑Hermitian models across various platforms, its direct simulation and characterization on programmable quantum hardware, particularly beyond two strands, remains a formidable challenge due to the limitations of variational optimization in these systems. Here, we introduce a family of non‑Hermitian multi‑band twister models and implement a non‑variational protocol to characterize their complex braided band structures on a programmable superconducting quantum processor. By mapping the winding of eigenstates to the spectral topology, we devise an efficient measurement strategy that extracts braid information, including braid words and knot invariants like the Alexander and Jones polynomials, without requiring full spectral tomography or repeated optimization. We experimentally demonstrate the reconstruction of complicated knots and links such as the Hopf chain and Solomon's knot. Our approach provides a general framework for investigating exotic non‑Hermitian topology on near‑term quantum devices, opening a route to simulate more sophisticated topological structures in knot theory.
Authors: Alejandro Gomez Cadavid, Pavle Nikačević, Pranav Chandarana, Sebastián V. Romero, Enrique Solano, Narendra N. Hegade, Miguel Angel Lopez-Ruiz, Claudio Girotto, Hanna Linn, Hakan Doga, Evgeny Epifanovsky, Panagiotis Kl. Barkoutsos, Ananth Kaushik, Martin Roetteler
Abstract: We report the largest trapped‑ion hardware demonstration of lattice protein‑folding optimization to date, using bias‑field digitized counterdiabatic quantum optimization (BF‑DCQO) on a fully connected 64‑qubit Barium development system similar to the forthcoming IonQ Tempo line. Six peptide sequences with 14‑16 amino‑acid residues are encoded using a coarse‑grained tetrahedral lattice model, yielding higher‑order spin‑glass Hamiltonians with long‑range interactions involving up to five‑body terms and mapped to 46‑61 qubits. The resulting instances are demanding for near‑term quantum hardware because low‑energy configurations must satisfy backbone‑geometry constraints while optimizing dense residue‑contact interactions. BF‑DCQO uses a non‑variational bias‑feedback mechanism, where low‑energy samples from each round define longitudinal fields that guide subsequent quantum evolutions. Across the studied instances, BF‑DCQO shifts raw sampled energy distributions toward lower energies than uniform random sampling, with the strongest improvements appearing in residue‑contact variables. To preserve this signal, we introduce a consensus‑based post‑processing pipeline that combines quantum‑learned contact information with feasible backbone geometries. The resulting hybrid workflow reaches the classical reference energy in multiple instances and improves over the corresponding random‑seeded pipeline. These results show that BF‑DCQO can generate structured samples for dense protein‑folding Hamiltonians at previously unexplored trapped‑ion scales.
Authors: Felipe Silva Carvalho, Steven Ramsey, Tom Kurtzman, Tyler Luchko
Abstract: Molecular dynamics (MD) simulations are widely used to study biological systems, where water molecules often play a critical role in protein‑ligand interactions. In conventional MD preparation protocols, water molecules are typically added from a pre‑equilibrated solvent box and removed using conservative steric cutoffs, an approach that can eliminate important interfacial waters that are often not recovered during equilibration due to kinetic barriers limiting exchange with bulk solvent. In this work, we present an automated and computationally efficient method for placing water molecules around biomolecular solutes using three‑dimensional reference interaction site model (3D‑RISM) solvent density distributions. By identifying regions of high solvent probability, the method generates physically meaningful initial hydration structures without requiring extended sampling or specialized techniques such as grand canonical Monte Carlo (MC) or hybrid MC/MD approaches, and will be released as an update to AmberTools 26, enabling seamless integration into standard MD preparation pipelines. We validate the approach on a diverse set of protein‑ligand complexes with crystallographically resolved bridging waters, showing that 3D‑RISM‑based placement reproduces a large fraction of these experimentally observed waters, while subsequent minimization further improves agreement as crystallographic waters relax toward positions consistent with those predicted by our approach. Overall, this method enables more accurate and practical initialization of interfacial hydration, improving the reliability of MD simulations with modest computational cost relative to routine system preparation.
Authors: Vigneshwari Karunakaran Annapoorani, Ian Rouse, Vladimir Lobaskin, Nicolae-Viorel Buchete
Abstract: Accurate quantification of protein‑nanoparticle interactions is essential for applications in nanobiotechnology, nanomedicine, and drug delivery. Motivated by recent computational and experimental work, we combine coarse‑grained united‑atom (UA) models with molecular docking to characterize protein adsorption on SiO_2 nanoparticles. We construct orientation‑resolved heatmaps in which polar and azimuthal angles uniquely specify the relative protein‑nanoparticle pose, and the map amplitude reports binding propensity via the minimum UA adsorption energy or the docking score. Each angular bin corresponds to a distinct docked complex, enabling systematic comparison of binding geometries across models. To relate docking score landscapes to Boltzmann‑averaged UA adsorption energetics, we analyze eight birch pollen allergen proteins previously studied experimentally. Similarity between the two orientational distributions is quantified using the Jensen‑Shannon divergence (JSD). We find encouraging agreement between the two approaches in several cases, while also identifying limitations and routes for improvement, including optimized angular resolution and iterative refinement of interaction parameters. Overall, this framework provides a quantitative bridge between coarse‑grained energetics and docking outputs at protein‑nanoparticle interfaces, supporting improved predictive modeling and mechanistic insight into protein‑nanoparticle binding landscapes.
Authors: Sayan Maity, Tristan A. Mauck, Ulrich Kleinekathöfer
Abstract: In the theory of open quantum systems, spectral densities are key quantities for modeling the dynamics and spectroscopic properties of the system under investigation. In the case of light‑harvesting complexes, they encode the frequency‑dependent coupling of electronic excitations in pigment molecules to their environment, reflecting contributions from both intrinsic vibrational modes and the protein surrounding. In particular, the low‑frequency components of the spectral densities are crucial for exciton transfer between pigment molecules. Apparently, slow internal modes of bacteriocholophyll molecules in the gas phase are less well represented by common force fields based on classical molecular dynamics (MD) simulations. Here, we demonstrate that Born‑Oppenheimer molecular dynamics (BOMD) based on the numerically efficient density functional‑based tight‑binding approach can accurately recover these low‑frequency features, whereas normal mode analysis captures them only partially. In contrasting approaches for determining spectral densities, the low‑frequency region of the spectral densities obtained is only associated with protein fluctuations; the usage of BOMD, however, also captures the low‑frequency contributions arising from slow intramolecular vibrations of the pigment molecules themselves. Notably, this behavior is consistently observed for both the flexible B800 and the more rigid B850 rings in light‑harvesting 2 (LH2) complexes of purple bacteria, as well as in the Fenna‑Matthews‑Olson (FMO) complex of green sulfur bacteria. Interestingly, we also find that the spectral densities of the pigments in the B850 ring of LH2 are not influenced by the environment, i.e., the gaps between ground and first excited state are not changed significantly by the fluctuations of the protein environment.
Authors: Haocheng Tang, Liang Shi, Ya-Shi Zhang, Xixian Liu, Jian Tang, Jiarui Lu
Abstract: Protein dynamics underlie many biological functions, yet remain difficult to characterize due to the high computational cost of molecular dynamics simulations and the scarcity of dynamic structural data. This survey reviews recent advances in artificial intelligence for protein dynamics from three perspectives: learning from structural ensembles and trajectories, learning from physical energy signals, and learning to accelerate molecular simulations. We summarize representative methods for conformation ensemble generation, trajectory generation, Boltzmann generators, physics‑aware adaptation, machine learning potentials, coarse‑grained modeling, and collective variable discovery. We further discuss available datasets and key open challenges, such as scalability, thermodynamic consistency, kinetic fidelity, and integration with experimental constraints.
Authors: Sayan Ghosh, Amitav Sahu, Stephanie Gonzalez-Migoni, Thomas L. C. Jansen, Vivek Tiwari
Abstract: Action‑detected two‑dimensional electronic spectroscopy (A‑2DES) could potentially be a versatile chemical tool with applicability across a range of photophysical observables such as photocurrent, photoionization, or fluorescence. However, a prominent absence of excited state energy/charge transfer dynamics signals in archetypal photosynthetic proteins has suggested severe limitations of A‑2DES in probing large aggregates where sensitivity to excited state dynamics is proposed to go down as 1/N, where N is the aggregate size. We report measurements of energy transfer dynamics in a cyanobacterial protein through both conventional and fluorescence 2DES (F‑2DES), where the dynamics reported by F‑2DES is quite prominent and comparable to that measured by conventional 2DES. Analysis of our experiments combined with coarse‑grained simulations of the spectra suggest that the 1/N limit argument, which assumes infinitely fast intra‑exciton manifold equilibration, is modified in case of cyanobacterial proteins because of slow annihilation. Our results suggest that action detection may in fact be well‑suited to probe exciton diffusion across weakly coupled systems.
Authors: Simon Axelrod, Miroslav Kašpar, Kristýna Jelínková, Markéta Šmídková, Erika Bartůňková, Sille Štěpánová, Eugene Shakhnovich, Václav Kašička, Martin Dračínský, Zlatko Janeba, Rafael Gómez-Bombarelli
Abstract: Light‑activated drugs are a promising way to treat localized diseases for which existing treatments have severe side effects. However, their development is complicated by the set of photophysical and biological properties that must be simultaneously optimized. Here we used computational techniques to find a set of promising candidates for the photoactive inhibition of the poly(ADP‑ribose) polymerase 1 (PARP1) cancer target. Using our recently developed methods based on atomistic simulation and machine learning (ML), we screened a set of 5 million hypothetical photoactive ligands. Our workflow used protein‑ligand docking to identify candidates with differential PARP1 binding under light and dark conditions; ML force fields and quantum chemistry calculations to predict pK_\mathrma, absorption spectra, and thermal half‑lives; graph‑based surrogate models to screen additional compounds; excited‑state nonadiabatic dynamics with ML force fields to estimate quantum yields; and free energy perturbation (FEP) to refine binding predictions. From these predictions, we prioritized a small set of synthetically feasible candidates expected to have red‑shifted absorption spectra, thermal half‑lives on the order of seconds to minutes, and isomer‑dependent PARP1 binding under visible‑light control. We synthesized 10 candidates and experimentally characterized their photobehavior and PARP1 inhibition constants. Among the validated compounds, 1 showed a 15‑fold increase in inhibition of PARP1 upon green‑light irradiation at 519 nm (208.8 \pm 28.3 μM vs 14.4 \pm 1.9 μM). These results validate the computation‑guided screening strategy for identifying red‑shifted PARP1 photoinhibitors, while also underscoring current limitations such as rapid thermal relaxation in aqueous media.
Authors: Siavash Golkar, Jake Kovalic, Irina Espejo Morales, Samuel Sledzieski, Minhuan Li, Ksenia Sokolova, Geraud Krawezik, Alberto Bietti, Claudia Skok Gibbs, Roman Klypa, Shengwei Xiong, Francois Lanusse, Liam Parker, Kyunghyun Cho, Miles Cranmer, Tom Hehir, Michael McCabe, Lucas Meyer, Rudy Morel, Payel Mukhopadhyay, Mariel Pettee, Helen Qu, Jeff Shen, David Fouhey, Hadi Sotoudeh, Vikram Mulligan, Pilar Cossio, Sonya M. Hanson, Alisha N. Jones, Olga G. Troyanskaya, Shirley Ho
Abstract: Biological function emerges from coupled constraints across sequence, structure, regulation, evolution, and cellular context, yet most foundation models in biology are trained within one modality or for a fixed forward task. We present MIMIC, a generative multimodal foundation model trained on our newly curated and aligned dataset, LORE, linking nucleic acid, protein, evolutionary, structural, regulatory, and semantic/contextual modalities within partially observed biomolecular states. MIMIC uses a split‑track encoder‑decoder architecture to condition on arbitrary subsets of observed modalities and reconstruct or generate missing components of molecular state across the genome, transcriptome, and proteome. Multimodal conditioning consistently improves MIMIC's sequence reconstruction relative to sequence‑only inputs, while its learned representations enable state‑of‑the‑art performance on RNA and protein downstream tasks. MIMIC achieves state‑of‑the‑art splicing prediction, and its joint generative formulation enables isoform‑aware inference that further improves performance. Beyond prediction, the same generative framework supports constrained design. For RNA, MIMIC identifies corrective edits in a clinically relevant HBB splice‑disrupting mutation without reverting it by using evolutionary and structural signals. For proteins, jointly conditioning on shape and surface chemistry of PD‑L1 and hACE2 binding sites produces diverse, high‑confidence sequences with strong in silico support for target binding. Finally, MIMIC uses experimental context as semantic conditioning to model assay‑dependent RNA chemical probing, rather than treating context as a fixed output. Together, these results position MIMIC's aligned multimodal generative modeling as a strong foundation for unifying representation learning, conditional prediction, and constrained biomolecular design within a single model.
Authors: Hung N. Do, Jessica Z. Kubicek-Sutherland, Oscar A. Negrete, S. Gnanakaran
Abstract: We instruct an AI agent to construct two separate agentic AI platforms: one for autonomous training of predictive ML models for human‑human and virus‑human PPI, and the other for inducing explicit general rules governing human‑human and virus‑human PPI. The first agentic AI platform for autonomous training of predictive ML models for PPI is designed to consist of five AI agents that handle autonomous data collection, data verification, feature embedding, model design, and training and validation on three‑way protein‑disjoint cross‑fold datasets. For human‑human and human‑virus PPIs, the final three‑way protein‑disjoint ensemble achieves an accuracy of 87.3% and 86.5%, respectively. For cross‑checking and interpretability purposes, the second agentic AI platform is designed to replace ML predictions with human‑readable rules derived from protein embeddings, physicochemical autocovariance descriptors, compartment annotations, pathway‑domain overlap, and graph contexts. For human‑human PPI, it is defined by a two‑rule induction, whereas human‑virus is induced by a more complex set of weighted rules. The rules induced by the second agentic platform align with the SHAP‑identified features from the predictive ML models built by the first agentic platform. Taken together, our work demonstrates the agentic AI's ability to orchestrate from data planning to execution, and from rule induction to explanation in ML, opening the door to various applications.
Authors: Dan Liu, Fida K. Dankar, Jennifer C. deBruyn, Amanda Ricciuto, Anne M. Griffiths, Thomas D. Walters, Khaled EI Emam
Abstract: Single‑arm trials accelerate study timelines by reducing the number of patients that must be recruited for a concurrent control group. However, these designs require an alternative comparator to estimate treatment effects. One approach is to construct a virtual control arm using a machine learning (ML) model trained on external control data to predict the counterfactual outcomes of the treatment arm. Our aim in this study was to leverage virtual controls by developing and evaluating ML‑based counterfactual outcome models trained on IFX‑treated patients to predict 1‑year steroid‑free clinical remission (SFCR ) and a composite of C‑reactive protein remission plus steroid‑free clinical remission (CRP‑SFCR) for ADA‑treated pediatric Crohn's disease patients, and to compare the resulting IFX‑versus‑ADA treatment effect estimates with those obtained using propensity score matching to external controls. Five ML models were used to train counterfactual models on the observed IFX cohort data. The resulting models were used to predict the counterfactual outcomes for the ADA arm patients. LGBM yields the best OR closest to the propensity score matched reference, and all 95% CI results align with the conclusion from the reference study that no statistical difference in the primary and secondary outcomes has been observed between the patients treated with ADA or IFX. Our study supports virtual controls as a viable and effective substitute for expensive, lengthy or unethical patient recruitment in an inflammatory bowel disease (IBD) trial. The developed gradient boosted prediction model can be used as a pretrained model to generate IFX counterfactual predictions in future studies, pending external validation and assessment of transportability.
Authors: A. Yermekov, D. A. Herrera-Martí
Abstract: Feature selection in high‑dimensional genomic data (d \gg n) demands methods that are simultaneously accurate, sparse, and stable. Existing approaches either require manual threshold specification (mRMR, stability selection), produce unstable selections under data perturbation (Lasso, Boruta), or ignore biological structure entirely. We introduce StackFeat‑RL, a meta‑learning framework that optimises the hyperparameters of an iterative dual‑criterion feature selection algorithm via REINFORCE policy gradients. The dual criterion, requiring both coefficient consistency and selection frequency, guards against two failure modes missed by single‑criterion methods, while iterative accumulation provides convergence guarantees via the law of large numbers.
On COVID‑19 miRNA data (GSE240888, 332 features) and three Alzheimer's disease classification tasks (GSE84422, 13237 genes; Normal vs.\ Possible, Probable, and Definite AD), StackFeat‑RL achieves the highest predictive accuracy among all evaluated methods, including ElasticNet, Boruta, mRMR, and stability selection, while requiring 3‑‑4× fewer features.
Keywords: feature selection, reinforcement learning, REINFORCE, elastic net, biomarker discovery, Alzheimer's disease, dual‑criterion selection, protein interaction networks
Authors: Vivek Reddy Chithari, Jasmine Y. Young, Irina Persikova, Yuhe Liang, Gregg V. Crichlow, Justin W. Flatt, Sutapa Ghosh, Brian P. Hudson, Ezra Peisach, Monica Sekharan, Chenghua Shao, Stephen K. Burley
Abstract: Motivation: Structural Biologists have contributed more than 245,000 experimentally determined three‑dimensional structures of biological macromolecules to the Protein Data Bank (PDB). Incoming data are validated and biocurated by ~20 expert biocurators across the wwPDB. RCSB PDB biocurators who process more than 40% of global depositions face increasing challenges in maintaining efficient Help Desk operations, with approximately 19,000 messages in approximately 8,000 entries received from depositors in 2025.
Results: We developed an AI‑powered Help Desk using Retrieval‑Augmented Generation (RAG) built on LangChain with a pgvector store (PostgreSQL) and GPT‑4.1‑mini. The system employs pymupdf4llm for Markdown‑preserving PDF extraction, two‑stage document chunking, Maximal Marginal Relevance retrieval, a topical guardrail that filters off‑topic queries, and a specialized system prompt that prevents exposure of internal terminology. A dual‑LLM architecture uses separate model configurations for question condensing and response generation. Deployed in production on Kubernetes with PostgreSQL (pgvector), it provides around‑the‑clock depositor assistance with citation‑backed, streaming responses.
Availability and implementation: Freely available at https://rcsb‑deposit‑help.rcsb.org.
Authors: Agostino Occhicone, Alberto Sinibaldi, Paola Di Matteo, Daniele Chiappetta, Riccardo Guadagnoli, Peter Munzert, Francesco Michelotti
Abstract: Surface functionalization plays a decisive role in the performance of biosensors, as it governs the efficiency and stability of biomolecule immobilization at the sensor interface and, consequently, the overall performance of the biosensing platforms. In this work, we present a comparative study of three organosilane chemistries ‑ APTES, APDMS, and CPTES ‑ applied to a SiO2 terminated 1D photonic crystal able to sustain Bloch surface waves and designed to operate as optical biosensors in both label free and fluorescence enhanced modes. Each chemistry was evaluated through a standardized label‑free protocol based on the interaction between immobilized SARS CoV 2 spike protein and its corresponding antibodies, enabling quantitative assessment of binding efficiency, nonspecific adsorption, and signal repeatability. CPTES exhibited the most favorable balance between specific signals, reduced variability, and low nonspecific adsorption. The three chemistries were subsequently tested in fluorescence mode for the detection of anti SARS CoV 2 IgG antibodies in human serum, demonstrating the suitability of BSW enhanced fluorescence for rapid serological analysis. Overall, the study identifies CPTES as the most robust and reproducible functionalization strategy among the three investigated for BSW biosensing and highlights the potential of the platform for fast, sensitive detection of clinically relevant antibodies.
Authors: Jiaxian Yan, Jintao Zhu, Yuhang Yang, Qi Liu, Kai Zhang, Zaixi Zhang, Xukai Liu, Boyan Zhang, Kaiyuan Gao, Jinchuan Xiao, Enhong Chen
Abstract: Protein‑ligand bioactivity data published in the literature are essential for drug discovery, yet manual curation struggles to keep pace with rapidly growing literature. Automated bioactivity extraction remains challenging because it requires not only interpreting biochemical semantics distributed across text, tables, and figures, but also reconstructing chemically exact ligand structures (e.g., Markush structures). To address this bottleneck, we introduce BioMiner, a multi‑modal extraction framework that explicitly separates bioactivity semantic interpretation from ligand structure construction. Within BioMiner, bioactivity semantics are inferred through direct reasoning, while chemical structures are resolved via a chemical‑structure‑grounded visual semantic reasoning paradigm, in which multi‑modal large language models operate on chemically grounded visual representations to infer inter‑structure relationships, and exact molecular construction is delegated to domain chemistry tools. For rigorous evaluation and method development, we further establish BioVista, a comprehensive benchmark comprising 16,457 bioactivity entries curated from 500 publications. BioMiner validates its extraction ability and provides a quantitative baseline, achieving an F1 score of 0.32 for bioactivity triplets. BioMiner's practical utility is demonstrated via three applications: (1) extracting 82,262 data from 11,683 papers to build a pre‑training database that improves downstream models performance by 3.9%; (2) enabling a human‑in‑the‑loop workflow that doubles the number of high‑quality NLRP3 bioactivity data, helping 38.6% improvement over 28 QSAR models and identification of 16 hit candidates with novel scaffolds; and (3) accelerating protein‑ligand complex bioactivity annotation, achieving a 5.59‑fold speed increase and 5.75% accuracy improvement over manual workflows in PoseBusters dataset.
Authors: Lukas Müllender, Berk Hess, Erik Lindahl
Abstract: Neural network potentials (NNPs) are rapidly changing the landscape of state‑of‑the‑art molecular dynamics (MD) simulations. To make full use of this development, the community needs flexible, easy‑to‑use interfaces firmly integrated with existing methodologies. To address this, we here present an interface for hybrid machine learning/molecular mechanics (ML/MM) simulations implemented in the widely used MD code GROMACS. The interface enables NNPs trained in the PyTorch framework to contribute energies and forces during MD simulations, either for selected subsets or entire molecular systems. By defining a flexible set of model inputs and outputs, the interface is agnostic to specific NNP architectures and can accommodate a wide range of descriptor‑based and message‑passing models. In particular, the design integrates NNP inference seamlessly into the extensive GROMACS molecular simulation ecosystem, providing users with the capability to straightforwardly combine NNPs with existing advanced sampling and free energy workflows. We demonstrate the capabilities of the interface using several representative applications, including enhanced sampling of peptide torsional free energy landscapes, absolute solvation free energy calculations, and protein‑‑ligand simulations. We also run performance benchmarks on water boxes for several different NNP architectures. Our interface is available in recent GROMACS releases, and we believe it will provide a practical foundation for incorporating machine learning potentials into production MD simulations of biomolecular systems.
Authors: Pagkratis Tagkopoulos, Dimitris Sfondilis, Ilias Tagkopoulos, Tarek Zohdi
Abstract: The prediction of sensory attributes from ingredient‑level formulations is an emerging challenge at the intersection of food science and artificial intelligence. We address the fundamental question of whether the taste of a food can be predicted from its ingredients by treating recipes as composite materials. We apply Hashin‑‑Shtrikman (HS) and Reuss‑‑Voigt (RV) bounds, techniques originally developed for elastic moduli, to predict five taste dimensions (sweetness, sourness, bitterness, umami, saltiness) on a curated dataset of 70 recipes decomposed into 209 ingredient‑level taste references with trained‑panel ground truth. The bounds provided an additive baseline but systematically under‑predict perceived taste: 77% of actual taste values exceeded the HS upper bound, with the exceedance rate ranging from 26% (bitterness) to 97% (saltiness). We traced this gap to specific processing chemistry (Maillard reactions, caramelization, evaporative concentration, protein hydrolysis, and nucleotide synergy) and introduced a hybrid model that augments the HS baseline with eight chemistry‑proxy features encoding these mechanisms. Our results show that our interpretable hybrid model eliminates the systematic bias and reduces mean absolute error by 27‑‑62% for sweetness, sourness, umami, and saltiness while using only 10 interpretable features, achieving performance comparable to a black‑box Lasso regression on 115 per‑ingredient features. We further demonstrate constrained inverse design via Differential Evolution, recovering ingredient formulations that match target taste profiles subject to compositional bounds.
Authors: Qifeng Zhou, Lei Yu, Yuzhi Guo, Yuwei Miao, Hehuan Ma, Wenliang Zhong, Lin Xu, Junzhou Huang
Abstract: The integration of single‑cell proteomic data is often hindered by the fragmented nature of targeted antibody panels. To address this limitation, we introduce scpFormer, a transformer‑based foundation model designed for single‑cell proteomics. Pre‑trained on over 390 million cells, scpFormer replaces standard index‑based tokenization with a continuous, sequence‑anchored approach. By combining Evolutionary Scale Modeling (ESM) with value‑aware expression embeddings, it dynamically maps variable panels into a shared semantic space without artificial discretization. We demonstrate that scpFormer generates global cell representations that perform competitively in large‑scale batch integration and unsupervised clustering. Moreover, its open‑vocabulary architecture facilitates in silico panel expansion, assisting in the reconstruction of biological manifolds in sparse clinical datasets. Finally, this learned protein co‑expression logic is transferable to bulk‑omics tasks, supporting applications like cancer drug response prediction. scpFormer provides a versatile, panel‑agnostic framework to facilitate scalable biomarker discovery and precision oncology.
Authors: Monika Kish, Suchitra Pradhan, Jessica L. Ramsay, Paloma Munguía Salazar, Jonathan Phillips, Daniel R. Kattnig
Abstract: The light‑dependent magnetic compass of night‑migratory songbirds is widely hypothesized to rely on the radical pair mechanism within retinal cryptochrome. However, bridging the mechanistic gap between microsecond quantum spin dynamics and the long‑lived, global protein conformational changes required for cellular signalling remains a formidable challenge. Here, we apply redox state‑resolved hydrogen/deuterium‑exchange mass spectrometry (HDX‑MS) to map the conformational landscape of European robin cryptochrome 4a (ErCry4a) across its photocycle. We reveal that photochemical reduction drives robust, allosteric structural transitions across key functional nodes, including the phosphate‑binding loop (PBL), protrusion loop (PL), FAD‑proximal helix α17, and the C‑terminal α22/α23 network. Crucially, we isolate the structural fingerprint of the transient semiquinone, the presumed signalling species. Rather than acting as a linear structural stepping‑stone, the semiquinone exhibits a distinct, non‑monotonic conformational signature characterized by a transient destabilization of the PBL and PL, contrasting sharply with the global rigidification observed in the fully reduced state. These findings establish the semiquinone as a structurally unique and functionally competent biological entity. Our results provide direct biophysical evidence for a dedicated, high‑fidelity structural signalling cascade, detailing how localized quantum‑level photochemistry is translated into the precise conformational dynamics required for animal navigation.
Authors: Carles Navarro, Philipp Tholke, Gianni de Fabritiis
Abstract: Structure‑based drug discovery faces the dual challenge of accurately capturing 3D protein‑ligand interactions while navigating ultra‑large chemical spaces to identify synthetically accessible candidates. In this work, we present a unified framework that addresses these challenges by combining contrastive 3D structure encoding with autoregressive molecular generation conditioned on commercial compound spaces. First, we introduce an SE(3)‑equivariant transformer that encodes ligand and pocket structures into a shared embedding space via contrastive learning, achieving competitive results in zero‑shot virtual screening. Second, we integrate these embeddings into a multimodal Chemical Language Model (MCLM). The model generates target‑specific molecules conditioned on either pocket or ligand structures, with a learned dataset token that steers the output toward targeted chemical spaces, yielding candidates with favorable predicted binding properties across diverse targets.
Authors: Amos. S. Kiyumbi, Jordan. H. Hossea
Abstract: We present a numerical study of a divergent‑beam Kretschmann surface plasmon resonance (SPR) platform for multiplexed malaria biosensing. A Powell‑lens‑generated angular fan enables camera‑based angular interrogation of spatially separated regions of interest on a single Au film, thereby removing the need for mechanical scanning. The framework combines transfer‑matrix modelling of the prism/Au multilayer with an effective‑adlayer description of biomolecular binding at the biofunctional interface. As a representative dual‑biomarker case, we consider plasmodium lactate dehydrogenase (pLDH) and histidine‑rich protein 2 (HRP‑2). Benchmarking of the N‑SF11/Au (45 nm) baseline against published water/glycerol data reproduces the characteristic resonance positions and yields a bulk angular sensitivity of 73.2181 \,^\circ \textRIU^‑1. With representative aptamer‑like and antibody‑like recognition layers, the relevant sensing states remain within 54^\circ to 57^\circ and produce distinct, detector‑resolvable responses. Combining the optical model with effective‑medium and Langmuir binding descriptions gives model‑based detection limits of approximately 5.5\,\textng mL^‑1 for HRP‑2 and 5.8× 10^‑2\,\textng mL^‑1 for pLDH. These results support divergent‑beam SPR as a viable architecture for quantitative multiplexed malaria biosensing.
Authors: Mariia Kryvoruchko, Brian A. Camley
Abstract: When cells collide, they often exhibit "contact inhibition of locomotion" (CIL), a behavior in which cells repolarize and migrate away from the site of contact. Experimental CIL outcomes are highly variable ‑ why? Here, we develop a minimal stochastic model to quantify how intrinsic noise in cell polarity, arising from the finite number of signaling molecules, influences CIL decision‑making. We simulate polarization dynamics by tracking individual Rho GTPase proteins that diffuse and switch stochastically between the cell membrane and cytosol. In the absence of cell‑cell contact, the polarity axis diffuses rotationally ‑ the cell's orientation wanders ‑ with a diffusion coefficient that decreases as Rho GTPase copy number increases. Assuming that cell‑cell contact inhibits Rho GTPase activation, we investigate how contact geometry, duration, and strength affect CIL sensitivity. At low protein copy number, weak, brief, or spatially narrow contacts are masked by molecular noise. In contrast, at high protein copy number, intrinsic polarity noise is negligible, and randomness in CIL response is more likely to reflect the variability from collision to collision in the cell‑cell contact properties.
Authors: Logan Hallee, Jason P. Gleghorn
Abstract: Bidirectional transformers are the foundation of many sequence modeling tasks across natural, biological, and chemical language domains, but they are permutation‑invariant without explicit positional embeddings. In contrast, unidirectional attention inherently encodes positional information through its triangular mask, enabling models to operate without positional embeddings altogether. Here, we introduce Dual Triangle Attention, a novel bidirectional attention mechanism that separates the query‑key subspace of each attention head into two complementary triangular masks: one that attends to past‑and‑self positions and one that attends to future‑and‑self positions. This design provides bidirectional context while maintaining the causal mask's implicit positional inductive bias in both directions. Using PyTorch's flex_attention, Dual Triangle Attention is implemented as a single compiled kernel call with no additional parameters beyond standard multi‑head attention. We evaluated Dual Triangle Attention across three settings: (1) a synthetic argmax position probe, (2) masked language modeling (MLM) on natural language, and (3) MLM on protein sequences. In the argmax task, both Dual Triangle Attention and causal attention learn positional information without explicit positional embeddings, whereas standard bidirectional attention cannot. In the MLM experiments, Dual Triangle Attention with Rotary Positional Embeddings (RoPE) achieved the best context extension performance and strong performance across the board. These findings suggest that Dual Triangle Attention is a viable attention mechanism for bidirectional transformers, with or without positional embeddings.
Authors: Minji Lee, Colin Kalicki, Minkyu Jeon, Aymen Qabel, Alisia Fadini, Mohammed AlQuraishi
Abstract: Models from the AlphaFold (AF) family reliably predict one dominant conformation for most well‑ordered proteins but struggle to capture biologically relevant alternate states. Several efforts have focused on eliciting greater conformational variability through ad hoc inference‑time perturbations of AF models or their inputs. Despite their progress, these approaches remain inefficient and fail to consistently recover major conformational modes. Here, we investigate both the optimal location and manner‑of‑operation for perturbing latent representations in the AF3 architecture. We distill our findings in ConforNets: channel‑wise affine transforms of the pre‑Pairformer pair latents. Unlike previous methods, ConforNets globally modulate AF3 representations, making them reusable across proteins. On unsupervised generation of alternate states, ConforNets achieve state‑of‑the‑art success rates on all existing multi‑state benchmarks. On the novel supervised task of conformational transfer, ConforNets trained on one source protein can induce a conserved conformational change across a protein family. Collectively, these results introduce a mechanism for conformational control in AF3‑based models.
Authors: Sarwan Ali, Taslim Murad
Abstract: Biological classification with interpretability remains a challenging task. For this, we introduce a novel encoding framework, Multi‑Scale Reversible Chaos Game Representation (MS‑RCGR), that transforms biological sequences into multi‑resolution geometric representations with guaranteed reversibility. Unlike traditional sequence encoding methods, MS‑RCGR employs rational arithmetic and hierarchical k‑mer decomposition to generate scale‑invariant features that preserve complete sequence information while enabling diverse analytical approaches. Our framework bridges three distinct paradigms for sequence analysis: (1) traditional machine learning using extracted geometric features, (2) computer vision models operating on CGR‑generated images, and (3) hybrid approaches combining protein language model embeddings with CGR features. Through comprehensive experiments on synthetic DNA and protein datasets encompassing seven distinct sequence classes, we demonstrate that MS‑RCGR features consistently enhance classification performance across all paradigms. Notably, our hybrid approach combining pre‑trained language model embeddings (ESM2, ProtT5) with MS‑RCGR features achieves superior performance compared to either method alone. The reversibility property of our encoding ensures no information loss during transformation, while multi‑scale analysis captures patterns ranging from individual nucleotides to complex motif structures. Our results indicate that MS‑RCGR provides a flexible, interpretable, and high‑performing foundation for biological sequence analysis.
Authors: Beatrice Caon, Mattia Corti, Francesca Bonizzoni, Paola F. Antonietti
Abstract: Alzheimer's disease is the most common neurodegenerative disorder. Its pathological development is connected with the misfolding and accumulation of two toxic proteins: amyloid‑beta and tau proteins. Mathematical models provide a valuable quantitative tool for monitoring disease progression. In this work, we proposed and compare a novel framework where the spatio‑temporal dynamics of amyloid‑beta and tau proteins is modeled based on employing either three‑dimensional patient‑specific geometries or through reduced network‑based models defined on the brain connectome. More specifically, a high‑fidelity biophysical model is proposed on three‑dimensional brain geometries reconstructed from magnetic resonance imaging, whereas a network‑based reduced formulation is defined on the brain connectome. For both approaches, a suitable numerical discretisation is proposed. A sensitivity analysis is presented to quantify the influence of model parameters on protein concentration patterns as well as compare the quality of the predictions. For both approaches, the results are validated against PET‑SUVR clinical data using 18FAZD4694 for amyloid‑beta and 18FMK6240 for tau protein. The results indicate that the three‑dimensional model provides the most accurate and biologically consistent description of the disease progression, but remains computationally demanding. On the other hand, the reduced graph‑based model is cheaper, but it is not always able to achieve reliable results.
Authors: Chupei Tang, Junxiao Kong, Moyu Tang, Di Wang, Jixiu Zhai, Ronghao Xie, Shangkun Sima, Tianchi Lu
Abstract: Motivation: Peptide‑protein interactions (PepPIs) are central to cellular regulation and peptide therapeutics, but experimental characterization remains too slow for large‑scale screening. Existing methods usually emphasize either interaction prediction or peptide generation, leaving candidate prioritization, residue‑level interpretation, and target‑conditioned expansion insufficiently integrated. Results: We present an integrated framework for early‑stage peptide screening that combines a partner‑aware prediction and localization model (ConGA‑PepPI) with a target‑conditioned generative model (TC‑PepGen). ConGA‑PepPI uses asymmetric encoding, bidirectional cross‑attention, and progressive transfer from pair prediction to binding‑site localization, while TC‑PepGen preserves target information throughout autoregressive decoding via layerwise conditioning. In five‑fold cross‑validation, ConGA‑PepPI achieved 0.839 accuracy and 0.921 AUROC, with binding‑site AUPR values of 0.601 on the protein side and 0.950 on the peptide side, and remained competitive on external benchmarks. Under a controlled length‑conditioned benchmark, 40.39% of TC‑PepGen peptides exceeded native templates in AlphaFold 3 ipTM, and unconstrained generation retained evidence of target‑conditioned signal.
Authors: Mahya Mohammadi, Meryem-Nur Duman, Isa Ahmadalidokht, Mohammad Sadraeian, Christopher G. Poulton, Alexander S. Solntsev, Irina V. Kabakova
Abstract: We investigate quantum spectroscopy with undetected photons for protein detection in the mid‑infrared spectral region. Classical Fourier‑transform infrared spectroscopy of protein samples (bovine serum albumin and N‑terminal pro‑brain natriuretic peptide) is used as reference to define the sample's mid‑infrared absorption, which is then embedded in a numerical model of a double‑pass quantum interferometer. We analyse parameters that influence visibility of the interference pattern formed by the signal beams, including the length of nonlinear crystal, sample length and mirror‑sample distance. This leads us to a practical quantum spectrometer design with optimal image contrast at the specific amide I‑II spectral bands. The simulated visibility spectra reproduce nearly identically the protein absorption features in the mid‑IR and reveal temperature‑induced changes to the protein secondary structure. Overall, this provides practical design rules for future quantum bio‑spectroscopy applications that use only visible wavelength sources and detectors.
Authors: Shah Ishmam Mohtashim, Manas Sajjan, Sabre Kais
Abstract: We present a quantum‑dynamical framework for identifying structurally important residues in proteins based on continuous time quantum walks (CTQWs) on weighted residue interaction networks constructed from experimentally resolved structures. By mapping the weighted adjacency matrix to a Hamiltonian, residue importance emerges from the long‑time averaged occupation probability, confirmed analytically through its spectral decomposition. Across a dataset of approximately 150 proteins spanning diverse structural and functional classes, CTQW centrality exhibits consistently strong agreement with classical eigenvector centrality in identifying central residues, while extending beyond it through incorporating signatures of quantum interference. Analyzing the time‑averaged quantum transition matrix reveals consistently larger spectral gaps than the classical random‑walk operator. Furthermore, biological relevance is confirmed through recovery of experimentally established functional residues in proteins kinase A and oxytocin. CTQW‑derived centrality rankings are accessible on near‑term intermediate‑scale quantum hardware, as we demonstrate through a proof‑of‑principle implementation on IBM superconducting quantum hardware. These results establish continuous‑time quantum walks as a computationally tractable framework for protein network analysis, that connects network theoretical treatments of protein structural biology to continuous‑time quantum walk dynamics.
Authors: Meghana Kshirsagar, Allen Nie, Ching-An Cheng, Fanglei Xue, Rahul Dodhia, Juan Lavista Ferres, Kevin K. Yang, Frank DiMaio
Abstract: We introduce RosettaSearch, an inference‑time multi‑objective optimization approach for backbone conditioned protein sequence design. We use large language models (LLMs) as a generative optimizer within a search algorithm capable of controlled exploration and exploitation, using rewards computed from RosettaFold3, a structure prediction model, under a strict computational budget. In a large‑scale evaluation, we apply RosettaSearch to 400 suboptimal sequences generated by LigandMPNN (a state‑of‑the‑art model trained for protein sequence design), recovering high‑fidelity designs that LigandMPNN's single‑pass decoding fails to produce. RosettaSearch's designs show improvements in structural fidelity metrics ranging between 18% to 68%, translating to a 2.5x improvement in design success rate. We observe that these gains in success rate are robust when RosettaSearch‑designed sequences are evaluated with an independent structure prediction oracle (Chai‑1) and generalize across two distinct LLM families (o4‑mini and Gemini‑3), with performance scaling consistently with reasoning capability.
We further demonstrate that RosettaSearch improves the sequence fidelity of ProteinMPNN designs for de novo backbones from the Dayhoff atlas, showing that the approach generalizes beyond native protein structures to computationally generated backbones. We also demonstrate a multi‑modal extension of RosettaSearch with vision‑language models, where images of predicted protein structures are used as feedback to incorporate structural context to guide protein sequence generation. To our knowledge, this is the first large‑scale demonstration that LLMs can serve as effective generative optimizers for backbone‑conditioned protein sequence design, yielding systematic gains without any model retraining.
Authors: Greta Grassmann, Giancarlo Ruocco, Mattia Miotto
Abstract: Biomolecular phase separation is typically attributed to the polymer physics of long, disordered chains. However, the underlying chemical grammar, i.e. the specific interactions between protein and RNA building blocks, remains poorly understood. We decouple those effects by screening the phase behavior of the complete dipeptide library in presence and absence of nucleic acids using full‑atomistic molecular dynamics simulations. We demonstrate that (i) even these ultrashort units encode the instructions for spontaneous condensation, proving that phase separation is fundamentally rooted at a sub‑polymeric level. (ii) Nucleic acids do not act as generic anionic glue but exert instead a base‑specific regulatory logic. (iii) Individual nucleobases function as chemical tuners that dissolve, stabilize, or fluidize condensates based on their molecular identity. Overall, our minimal framework reveals that while polymer length enhances assembly, the core properties and regulatory control of condensates may be also governed by a fine‑tuned chemical alphabet of peptides and nucleobases.
Authors: Yutang Ge, Guojiang Zhao, Sihang Li, Zheng Cheng, Zifeng Zhao, Hanchen Xia, Guolin Ke, Linfeng Zhang, Zhifeng Gao, Yuguang Wang
Abstract: Designing proteins that satisfy natural language functional requirements is a central goal in protein engineering. A straightforward baseline is to fine‑tune generic instruction‑tuned LLMs as direct text‑to‑sequence generators, but this is data‑ and compute‑hungry. With limited supervision, LLMs can produce coherent plans in text yet fail to reliably realize them as sequences. This plan‑execute gap motivates ProtoCycle, an agentic framework for protein design that uses LLMs primarily to drive a multi‑round, feedback‑driven decision cycle. ProtoCycle couples an LLM planner with a lightweight tool environment designed to emulate the iterative workflow of human protein engineering and uses LLM‑driven reflection on tool feedback to revise plans. Trained with supervised trajectories and online reinforcement learning, ProtoCycle achieves strong language alignment while maintaining competitive foldability, and ablations show that reflection substantially improves sequence quality.
Authors: Chenwei Zhang
Abstract: This dissertation explores how deep generative models can advance the analysis of challenging biological problems by integrating domain knowledge with deep learning. It focuses on two areas: DNA reaction kinetics and cryogenic electron microscopy (cryo‑EM). In the first part, we present ViDa, a biophysics‑informed framework leveraging variational autoencoders (VAEs) and geometric scattering transforms to generate biophysically‑plausible embeddings of DNA reaction kinetics simulations. These embeddings are reduced to a two‑dimensional space to visualize DNA hybridization and toehold‑mediated strand displacement reactions. ViDa preserves structure and clusters trajectory ensembles into reaction pathways, making simulation results more interpretable and revealing new mechanistic insights. In the second part, we address key challenges in cryo‑EM density map interpretation and protein structure modeling. We provide a comprehensive review and benchmarking of deep learning methods for atomic model building, with improved evaluation metrics and practical guidance. We then present Struc2mapGAN, a generative adversarial network that synthesizes high‑fidelity experimental‑like cryo‑EM density maps from protein structures. Finally, we present CryoSAMU, a structure‑aware multimodal U‑Net that enhances intermediate‑resolution cryo‑EM maps by integrating density features with structural embeddings from protein language models via cross‑attention. Overall, these contributions demonstrate the potential of deep generative models to interpret DNA reaction mechanisms and advance cryo‑EM density map analysis and protein structure modeling.
Authors: Zhijiang Tang, Jiaxin Qi, Yan Cui, Jinli Ou, Yuhua Zheng, Jianqiang Huang
Abstract: DNA sequence encoding is fundamental to gene function prediction, protein synthesis, and diverse downstream biological tasks. Despite the substantial progress achieved by large‑scale DNA sequence pretraining, existing studies have overwhelmingly emphasized pretraining scale and custom downstream evaluation datasets, while neglecting some essential components of the pretraining paradigm. In this paper, we reveal three critical yet heretofore overlooked problems in DNA pretraining: inappropriate downstream datasets, inherent flaws in the neighbor‑masking strategy, and the lack of detailed discussion on vocabulary. Therefore, we undertake comprehensive investigations and propose principled guidelines, including selection criteria for evaluation datasets, guiding task design, and in‑depth vocabulary analysis. Extensive experiments validate the significance of our identified problems and support the rationale behind our recommendations. Finally, we introduce a standardized testbed that enables reproducible and rigorous benchmarking of DNA pretraining methods to advance the development of genomic foundation models.
Authors: Jingke Chen, Jingrui Zhong, Tazneen Hossain Tani, Zidong Su, Xiaochun Zhang, Boxue Tian
Abstract: Despite the high accuracy of 'black box' deep learning models, drug discovery still relies on protein‑ligand interaction principles and heuristics. To improve interpretability of protein‑small molecule binding predictions, we developed the PWRules framework, which applies binding affinity data to identify privileged small molecule fragments and subsequently defines complementary pairing rules between these fragments and protein words (semantic sequence units) through an interpretability module. The resulting word‑fragment rules are then ranked by the PWScore function to prioritize active compounds. Evaluations on benchmark datasets show that PWScore achieves competitive performance comparable to the physics‑based model (Glide) and the deep learning model (PSICHIC) and shows broad applicability for protein targets outside the training dataset, e.g., SARS‑CoV‑2 main protease. Notably, PWScore captures complementary interaction information, yielding superior enrichment performance when integrated with these established methods. Structural analysis of protein‑ligand complexes indicates that learned word‑fragment rules are significantly enriched near ligand‑binding pockets, despite training without explicit structural guidance. By extracting and applying complementary pairing rules, PWRules provides an interpretable framework for drug discovery.
Authors: Skyler R. St. Pierre, Thibault Vervenne, Ethan C. Darwin, Ellen Kuhl
Abstract: Fungal protein materials exhibit inherently anisotropic microstructures formed by networks of hyphae, which suggest a natural pathway to replicate the fibrous texture of animal meat. We probe whether this structural anisotropy translates into macroscopic mechanical and sensory anisotropy. Using orthogonal tension, compression, and shear experiments on three fungi‑based materials, we identify distinct symmetry classes that range from strongly anisotropic to effectively isotropic behavior. Automated model discovery reveals that fiber‑dependent invariants emerge only when mechanically relevant, and enables direct identification of material symmetry from data. These results demonstrate that microstructural anisotropy does not universally imply anisotropic mechanics or perception and establish a data‑driven framework to infer symmetry in complex soft materials.
Authors: Yanbin Wei, Chun Kang, Siwei Li, Haoxuan Che, Yang Chen, Hua Liu, Jian Liu, Zhuang Liu, Can Ouyang, Fei Xing, Lei Sha, Rui Liu, Yu Zhang, James Kwok
Abstract: Large Vision‑Language Models (LVLMs) consistently require new arenas to guide their expanding boundaries, yet their capabilities with hypergraphs remain unexplored. In the real world, hypergraphs have significant practical applications in areas such as life sciences and social communities. Recent advancements in LVLMs have shown promise in understanding complex topologies, yet there remains a lack of a benchmark to delineate the capabilities of LVLMs with hypergraphs, leaving the boundaries of their abilities unclear. To fill this gap, in this paper, we introduce \textttHyperGVL, the first benchmark to evaluate the proficiency of LVLMs in hypergraph understanding and reasoning. \textttHyperGVL provides a comprehensive assessment of 12 advanced LVLMs across 84,000 vision‑language question‑answering (QA) samples spanning 12 tasks, ranging from basic component counting to complex NP‑hard problem reasoning. The involved hypergraphs contain multiscale synthetic structures and real‑world citation and protein networks. Moreover, we examine the effects of 12 textual and visual hypergraph representations and introduce a generalizable router \textttWiseHyGR that improves LVLMs in hypergraph via learning adaptive representations. We believe that this work is a step forward in connecting hypergraphs with LVLMs.
Authors: Mariia Ivonina, Jakub Rydzewski
Abstract: The SARS‑CoV‑2 RNA pseudoknot is a promising target for antiviral intervention, as it regulates the efficiency of ‑1 programmed ribosomal frameshifting (‑1 PRF), a mechanism that is essential for viral protein synthesis. The pseudoknot represents a viral RNA sequence composed of helical stems that adopts two long‑lived topologies, threaded and unthreaded. Ligand‑induced distortion of this fold is thought to underlie the susceptibility of ‑1 PRF to small‑molecule inhibitors. Resolving these distortions from unbiased molecular dynamics (MD) requires collective variables (CVs) that isolate the slowest dynamic modes of the RNA‑‑ligand system from the high‑frequency fluctuations. Here, we use spectral map (SM), a thermodynamics‑driven machine‑learning method, to learn such CVs directly from MD trajectories of the SARS‑CoV‑2 RNA pseudoknot in complex with the ‑1 PRF inhibitor merafloxacin and two related analogs. We examine both threaded and unthreaded pseudoknot topologies and consider the neutral and ionized ligand forms relevant at physiological pH. Free‑energy landscapes show that ligand‑induced destabilization is topology‑selective: merafloxacin and its analogs destabilize the S2 stem in the threaded pseudoknot, whereas in the unthreaded pseudoknot, destabilization shifts to the S1 and S3 stems. We find that the zwitterionic form of merafloxacin uniquely imposes slow dynamics on the otherwise featureless unthreaded pseudoknot. Furthermore, the neutral and zwitterionic forms of merafloxacin differ qualitatively in their mechanisms within the same RNA topology. Overall, these results clarify how pseudoknot topology, ligand type, and protonation state shape the slow conformational dynamics of viral RNA and establish physiological protonation as an essential factor for modeling RNA‑targeted drug action.
Authors: Alessio Valentini, David Pekker, Chungwen Liang, Todd Martinez, Swagatam Mukhopadhyay
Abstract: The classic paradigm of structural biology is that the sequence of a biomolecule (protein, nucleic acid, lipid, etc) determines its conformation (shape) which determines its biological function. Protein folding programs like AlphaFold address this paradigm by predicting the single best conformation given a sequence that defines the molecule. However, biomolecules are not static structures, and their conformational ensemble determines their function. We present the Polyformer ‑‑ a generative framework for thermodynamic modeling of polymeric molecules. Given the sequence and temperature (or another thermodynamic variable), the Polyformer generates conformations faithful to the molecule's thermodynamic conformational ensemble. It is the first generative model that solves three problems simultaneously: how does a molecule fold, what is its conformational ensemble, and how does the conformational ensemble change as we change physical temperature. As a concrete test case, we apply Polyformer to protein domains with 50‑111 residues and report good agreement of model predictions to Molecular Dynamics (MD) trajectories.
Authors: Jackie Rao, Ferran Gonzalez Hernandez, Leon Gerard, Alexandra Gessner
Abstract: Antibody lead optimization is inherently a multi‑objective challenge in drug discovery. Achieving a balance between different drug‑like properties is crucial for the development of viable candidates, and this search becomes exponentially challenging as desired properties grow. The ever‑growing zoo of sophisticated in silico tools for predicting antibody properties calls for an efficient joint optimization procedure to overcome resource‑intensive sequential filtering pipelines. We present BOAT, a versatile Bayesian optimization framework for multi‑property antibody engineering. Our `plug‑and‑play' framework couples uncertainty‑aware surrogate modeling with a genetic algorithm to jointly optimize various predicted antibody traits while enabling efficient exploration of sequence space. Through systematic benchmarking against genetic algorithms and newer generative learning approaches, we demonstrate competitive performance with state‑of‑the‑art methods for multi‑objective protein optimization. We identify clear regimes where surrogate‑driven optimization outperforms expensive generative approaches and establish practical limits imposed by sequence dimensionality and oracle costs.
Authors: Arman Bekov, Timur Bekzhanov, Bekzat Sadykov
Abstract: Predicting T‑cell receptor (TCR)‑‑peptide‑MHC (pMHC) binding is central to vaccine design and T‑cell therapy, yet deployed models frequently encounter epitopes unseen during training, causing silent overconfidence and unreliable prioritization. We address this by framing TCR‑‑pMHC prediction as a \emphselective prediction problem: a calibrated model should either output a trustworthy confidence score or explicitly abstain. Concretely, we (1) introduce a dual‑encoder architecture encoding both CDR3α/CDR3β and peptide sequences via a pre‑trained protein language model; (2) apply temperature scaling to correct systematic probability miscalibration; and (3) impose a conformal abstention rule that provides finite‑sample coverage guarantees at a user‑specified target error rate. Evaluated under three split strategies ‑‑ random, epitope‑held‑out, and distance‑aware ‑‑ our method achieves AUROC 0.813 and ECE 0.043 under the challenging epitope‑held‑out protocol, reducing ECE by 69.7% relative to an uncalibrated baseline. At 80% coverage, the selective model further reduces error rate from 18.7% to 10.9%, demonstrating that calibrated abstention enables principled coverage‑risk trade‑offs aligned with practical screening budgets.
Authors: Aadyot Bhatnagar, Peter Mørch Groth, Ali Madani
Abstract: Large language models can be aligned with human preferences through offline reinforcement learning (RL) on small labeled datasets. While single‑objective alignment is well‑studied, many real‑world applications demand the simultaneous optimization of multiple conflicting rewards, e.g. optimizing both catalytic activity and specificity in protein engineering, or helpfulness and harmlessness for chatbots. Prior work has largely relied on linear reward scalarization, but this approach provably fails to recover non‑convex regions of the Pareto front. In this paper, instead of scalarizing the rewards directly, we frame multi‑objective RL itself as an optimization problem to be scalarized via smooth Tchebysheff scalarization, a recent technique that overcomes the shortcomings of linear scalarization. We use this formulation to derive Smooth Tchebysheff Optimization of Multi‑Objective Preferences (STOMP), a novel offline RL algorithm that extends direct preference optimization to the multi‑objective setting in a principled way by standardizing the individual rewards based on their observed distributions. We empirically validate STOMP on a range of protein engineering tasks by aligning three autoregressive protein language models on three laboratory datasets of protein fitness. Compared to state‑of‑the‑art baselines, STOMP achieves the highest hypervolumes in eight of nine settings according to both offline off‑policy and generative evaluations. We thus demonstrate that STOMP is a powerful, robust multi‑objective alignment algorithm that can meaningfully improve post‑trained models for multi‑attribute protein optimization and beyond.
Authors: Rhyan Barrett, Sophia Wesely, Julia Westermayr
Abstract: Transferable excited‑state dynamics offer a route to efficient screening of photophysical behavior across molecular systems, but conventional nonadiabatic simulations remain prohibitively expensive. Here we introduce X‑MACE, a transferable machine‑learning potential for excited‑state dynamics that predicts multiple potential energy surfaces, forces and oscillator strengths, and combine it with curvature‑driven surface hopping to enable data‑efficient screening of photochemical pathways. We apply this framework to fluorescent chromophores as an example application, using green fluorescent protein chromophore variants to demonstrate how subtle structural modifications reshape excited‑state relaxation, lifetimes and photoisomerization yields. Fine‑tuning a single pretrained model with fewer than 100 reference geometries per derivative yields accurate dynamics across a chemically diverse set of analogues. The screening reveals two governing design principles: steric crowding on the phenolate ring lowers the torsional barrier and accelerates access to twisted conical intersections, whereas conjugation extension stabilizes planar excited‑state configurations, suppresses non‑radiative decay and prolongs fluorescence. More broadly, this workflow provides a general framework for scalable excited‑state screening and interpretable design of photophysical properties.
Authors: Yankang Liu, Ke Zhang, Maziar Raissi, Roya Zandi
Abstract: We learn parameterized nonlinear elasticity on curved surfaces using a physics‑informed neural network that enforces governing equations and boundary conditions directly through the loss function, enabling a single trained model to represent a continuous family of elastic equilibria across geometric and material parameters. Nonlinear elasticity on curved manifolds underlies the mechanics of crystalline shells, elastic membranes, and viral capsids, where curvature and topological defects determine equilibrium structure and stability. Traditional exact and finite element solvers rely on symmetry reduction and must be reinitialized for each parameter choice, limiting scalability when symmetry is broken or parameters vary. We validate the proposed learning‑based solver on a benchmark problem from curved elasticity, namely the one‑dimensional single disclination on a spheroidal surface with known exact and numerical solutions. The network accurately reproduces these solutions, including parameter combinations excluded from training, demonstrating generalization across geometry and material regimes. This study establishes a scalable framework for learning nonlinear elastic systems on curved manifolds and lays the groundwork for extensions to fully two‑dimensional and multi‑defect configurations relevant to protein shells and other curved elastic networks.
Authors: Seungik Cho
Abstract: Predicting the functional impact of single amino acid substitutions (SAVs) is central to understanding genetic disease and engineering therapeutic proteins. While protein language models and structure‑based methods have achieved strong performance on this task, they systematically neglect protein dynamics; residue flexibility, correlated motions, and allosteric coupling are well‑established determinants of mutational tolerance in structural biology, yet have not been incorporated into supervised variant effect predictors. We present TriFit, a multimodal framework that integrates sequence, structure, and protein dynamics through a four‑expert Mixture‑of‑Experts (MoE) fusion module with trimodal cross‑modal contrastive learning. Sequence embeddings are extracted via masked marginal scoring with ESM‑2 (650M); structural embeddings from AlphaFold2‑predicted C‑alpha geometries; and dynamics embeddings from Gaussian Network Model (GNM) B‑factors, mode shapes, and residue‑residue cross‑correlations. The MoE router adaptively weights modality combinations conditioned on the input, enabling protein‑specific fusion without fixed modality assumptions. On the ProteinGym substitution benchmark (217 DMS assays, 696k SAVs), TriFit achieves AUROC 0.897 +/‑ 0.0002, outperforming all supervised baselines including Kermut (0.864) and ProteinNPT (0.844), and the best zero‑shot model ESM3 (0.769). Ablation studies confirm that dynamics provides the largest marginal contribution over pairwise modality combinations, and TriFit achieves well‑calibrated probabilistic outputs (ECE = 0.044) without post‑hoc correction.
Authors: César Jesús Núñez-Prado, Grigori Sidorov, Liliana Chanona-Hernández
Abstract: The identification of reliable molecular biomarkers for Parkinson's disease remains challenging due to its multifactorial nature. Although protein sequences constitute a fundamental and widely available source of biological information, their standalone discriminative capacity for complex disease classification remains unclear. In this work, we present a controlled and leakage‑free evaluation of multiple representations derived exclusively from protein primary sequences, including amino acid composition, k‑mers, physicochemical descriptors, hybrid representations, and embeddings from protein language models, all assessed under a nested stratified cross‑validation framework to ensure unbiased performance estimation. The best‑performing configuration (ProtBERT + MLP) achieves an F1‑score of 0.704 +/‑ 0.028 and ROC‑AUC of 0.748 +/‑ 0.047, indicating only moderate discriminative performance. Classical representations such as k‑mers reach comparable F1 values (up to approximately 0.667), but exhibit highly imbalanced behavior, with recall close to 0.98 and precision around 0.50, reflecting a strong bias toward positive predictions. Across representations, performance differences remain within a narrow range (F1 between 0.60 and 0.70), while unsupervised analyses reveal no intrinsic structure aligned with class labels, and statistical testing (Friedman test, p = 0.1749) does not indicate significant differences across models. These results demonstrate substantial overlap between classes and indicate that primary sequence information alone provides limited discriminative power for Parkinson's disease classification. This work establishes a reproducible baseline and provides empirical evidence that more informative biological features, such as structural, functional, or interaction‑based descriptors, are required for robust disease modeling.
Authors: Yanting Li, Zhuoyang Jiang, Enyan Dai, Lei Wang, Wen-Cai Ye, Li Liu
Abstract: Goal‑directed molecular generation requires satisfying heterogeneous constraints such as protein‑‑ligand compatibility and multi‑objective drug‑like properties, yet existing methods often optimize these constraints in isolation, failing to reconcile conflicting objectives (e.g., affinity vs. safety), and struggle to navigate the non‑differentiable chemical space without compromising structural validity. To address these challenges, we propose CAGenMol, a condition‑aware discrete diffusion framework over molecular sequences that formulates molecular design as conditional denoising guided by heterogeneous structural and property signals. By coupling discrete diffusion with reinforcement learning, the model aligns the generation trajectory with non‑differentiable objectives while preserving chemical validity and diversity. The non‑autoregressive nature of diffusion language model further enables iterative refinement of molecular fragments at inference time. Experiments on structure‑conditioned, property‑conditioned, and dual‑conditioned benchmarks demonstrate consistent improvements over state‑of‑the‑art methods in binding affinity, drug‑likeness, and success rate, highlighting the effectiveness of our framework.
Authors: Ben Isselmann, Dilara Göksu, Heinz Neumann, Andreas Weinmann
Abstract: Background: Task‑specific microscopy datasets are often small, making it difficult to train deep learning models that learn robust features. While self‑supervised learning (SSL) has shown promise through pretraining on large, domain‑specific datasets, generalizability across datasets with differing staining protocols and channel configurations remains underexplored. We investigated the generalizability of SSL models pretrained on ImageNet‑1k and HPA FOV, evaluating their embeddings on OpenCell with and without fine‑tuning, two channel‑mismatch strategies, and varying fine‑tuning data fractions. We additionally analyzed single‑cell embeddings on a labeled OpenCell subset.
Result: DINO‑based ViT backbones pretrained on HPA FOV or ImageNet‑1k transfer well to OpenCell even without fine‑tuning. The HPA FOV‑pretrained model achieved the highest zero‑shot performance (macro F_1 0.822 \pm 0.007). Fine‑tuning further improved performance to 0.860 \pm 0.013. At the single‑cell level, the HPA single‑cell‑pretrained model achieved the highest k‑nearest neighbor performance across all neighborhood sizes (macro F_1 \geq 0.796).
Conclusion: SSL methods like DINO, pretrained on large domain‑relevant datasets, enable effective use of deep learning features for fine‑tuning on small, task‑specific microscopy datasets.
Authors: Peter Schurtenberger, Marco Polimeni, Sophia Marzouk, Robin Curtis, Emanuela Zaccarelli, Anna Stradner
Abstract: Colloid models have frequently been used to successfully describe the influence of protein‑protein interactions on antibody solution properties, but they suffer from inherent problems due to the anisotropic shape of the particles. The net charge required to describe electrostatic interactions is an effective quantity that cannot directly be obtained from the known molecular structure of an antibody, and the solution structure caused by excluded volume interactions is strongly overestimated at high concentrations due to the assumption of hard sphere interactions. As a result, these models have descriptive rather than predictive power. Here we present an improved, soft penetrable sphere model based on analogies to soft colloids and star polyelectrolytes that take into account the Y‑shaped antibody form and the corresponding charge and ion distribution. The model not only correctly describes the concentration and ionic strength dependence of thermodynamic and collective dynamics quantities such as the osmotic compressibility and the apparent hydrodynamic radius, but also reproduces the center‑of‑mass static structure factor obtained in computer simulations using a weakly coarse‑grained model, in which the antibody is described at an amino acid level. We demonstrate that this soft penetrable sphere model quantitatively reproduces experimental data from static and dynamic light scattering at low and high ionic strength for two well‑characterized monoclonal antibodies (mAbs) using the net charges and the overall mAb dimensions directly obtained from their molecular structure.
Authors: Karie A. Nicholas, Vikram Khipple Mulligan
Abstract: Although optimization is one of the most promising applications of quantum computers, the development of effective optimization strategies requires real‑world test cases. When planning our recent wedding reception, we realized that the problem of optimally seating our guests, given constraints related to guests' relatedness, shared interests, and physical needs, could be mapped to a cost function network (CFN) form solvable with classical or quantum optimization algorithms. We compared the seating optimization performance of classical Monte Carlo CFN solvers in the Masala software suite to that of quantum annealing‑based CFN optimization algorithms using one‑hot, domain‑wall, and approximate binary mappings, which we had developed for protein design problems. Surprisingly, the D‑Wave Advantage 2 system, which performs well on similarly‑structured CFN problems for protein design, struggled to return optimal seating arrangements that were easily found by classical Monte Carlo methods. We provide our seating optimization benchmark set, and code to convert seating optimization problems to CFN problems, as a plugin library for Masala, permitting this class of real‑world problems to be used to benchmark performance of current and future classical CFN solvers, quantum optimization algorithms, and quantum computing hardware.
Authors: Linn Evenseth, Kamil Galewski, Witold Jarnicki, Piero Lafiosca, Vyom N. Patel, Grzegorz Rajchel-Mieldzioć, Martin Šimka, Michał Szczepanik, Emil Żak
Abstract: We present a computational platform for modeling chemical reactions in complex molecular environments, focused on ligand‑protein binding in drug discovery. The platform implements our new quantum‑in‑quantum‑in‑classical (QM/QM/MM) multiscale embedding model that integrates molecular dynamics with a quantum‑information‑enhanced density matrix embedding theory and quantum chemistry solvers, including explicit solvent. Quantum‑information metrics are utilized to generate entanglement‑consistent orbitals, enabling a high‑accuracy description of strongly correlated regions. The framework supports multiple computational backends, including multi‑CPU, NVIDIA multi‑GPU architectures, and quantum hardware (IQM, IonQ, IBM) integrated under CUDA‑Q, and is designed for compatibility with future fault‑tolerant quantum systems. The new platform's capabilities are demonstrated by modeling covalent docking of zanubrutinib to Bruton's tyrosine kinase via a Michael addition mechanism, computing the full reaction energy profiles and energy barriers at a reduced computational cost relative to existing methods. As a 2nd‑generation anticancer agent, zanubrutinib serves as a proof of concept for covalent inhibitor discovery. Accurate first‑principles reaction barrier estimations provided by our method can contribute to reducing false positive and negative rates in drug discovery pipelines. Scalability is validated through benchmarks on GPU clusters, cloud‑based CPU infrastructures. We demonstrate integration with quantum devices (up to 20 qubits), alongside resource estimates for fault‑tolerant quantum computing, indicating potential speedups of up to 20x. Beyond single reactions, the platform supports the construction of reaction networks in chemical metric space, facilitating ligand screening and systematic exploration of reactive pathways.
Authors: Zhen Li, Milana Bazayeva, Thaddeus Pellegrini, Subhamoy Bhowmik, Susanta Das, Danil Kaliakin, Fangchun Liang, Akhil Shajan, Kenneth M. Merz
Abstract: The use of free energy perturbation (FEP) methods to study protein‑ligand complexes is one of the most important tools in structure‑based drug design. Because FEP methods typically rely on force fields, they may suffer from force field parameter‑related issues. Herein, we present a quantum mechanics/molecular mechanics (QM/MM) hybrid method to overcome deficiencies in force‑field models by using QM bookending approaches on both classical and quantum hardware. In the MM part of this QM/MM FEP method, AMBER is used to simulate the protein receptor and the unperturbed moiety of the ligand, with the ff19SB and GAFF2 force fields. In the QM part, QUICK was used to conduct Hartree‑Fock (HF) calculations, followed by heat‑bath configuration interaction (HCI) as a benchmark on classical devices. To enable the HCI function in QUICK, we developed a Python‑based interface to execute HCI from IBM's qiskit‑addon‑dice‑solver. Moreover, the same interface also enabled this work to execute QM/MM FEP calculations on quantum hardware using the Local Unitary Cluster Jastrow (LUCJ) ansatz, followed by sample‑based diagonalization (SQD) and extended‑SQD (extSQD) post‑processing. Using a series of thermolysis inhibitors as an example, we find reasonable agreement with experiment between the classical HCI method and the LUCJ‑SQD/extSQD method, with the latter yielding a result closer to the experimental value. The execution time between the HCI‑based FEP method and the LUCJ‑SQD/extSQD‑based FEP method is also comparable, indicating a high potential for utility in the noisy intermediate‑scale quantum (NISQ) era.
Authors: Francesco Micucci, Matteo Barbieri, Gabriella Bettonte, Domenico Bonanni, Anita Camillini, Anna Fava, Daniele Gregori, Andrea R. Beccari, Gianluca Palermo
Abstract: Molecular docking is a crucial step in the development of new drugs as it guides the positioning of a small molecule (ligand) within the pocket of a target protein. In the literature, a feasibility study explored the potential of D‑Wave quantum annealers for purely geometric molecular docking, neglecting physicochemical interactions between the protein and the ligand and focusing solely on their simplified geometries. To achieve this, the ligands were represented as graphs incorporating their geometric properties and then mapped onto a grid that discretized the three‑dimensional space of the protein pocket. The quality of the ligand pose on the protein pocket was evaluated through the isomorphism between the ligand graph and the spatial grid. This paper builds on the previous study by introducing physicochemical interactions between the protein‑ligand pair into the QUBO problem to improve the accuracy of the docking results. This paper presents a novel QUBO formulation that includes Coulomb and van der Waals forces, together with components representing H‑bond and hydrophobic interactions. We integrate these physical interactions as corrective terms to the previous purely geometric QUBO formulation, and provide experimental results using the D‑Wave quantum annealers to demonstrate their impact on the accuracy of the docking results.
Authors: Simon J. Crouzet
Abstract: Generative models can now propose thousands of \emphde novo antibody sequences, yet translating these designs into viable therapeutics remains constrained by the cost of biophysical characterization. Here we present CrossAbSense, a framework of property‑specific neural oracles that combine frozen protein language model encoders with configurable attention decoders, identified through a systematic hyperparameter campaign totaling over 200 runs per property. On the GDPa1 benchmark of 242 therapeutic IgGs, our oracles achieve notable improvements of 12‑‑20% over established baselines on three of five developability assays and competitive performance on the remaining two. The central finding is that optimal decoder architectures \emphinvert our initial biological hypotheses: self‑attention alone suffices for aggregation‑related properties (hydrophobic interaction chromatography, polyreactivity), where the relevant sequence signatures ‑‑ such as CDR‑H3 hydrophobic patches ‑‑ are already fully resolved within single‑chain embeddings by the high‑capacity 6B encoder. Bidirectional cross‑attention, by contrast, is required for expression yield and thermal stability ‑‑ properties that inherently depend on the compatibility between heavy and light chains. Learned chain fusion weights independently confirm heavy‑chain dominance in aggregation (w_H = 0.62) versus balanced contributions for stability (w_H = 0.51). We demonstrate practical utility by deploying CrossAbSense on 100 IgLM‑generated antibody designs, illustrating a path toward substantial reduction in experimental screening costs.
Authors: Jingxuan He, Karol Długołecki, Hubertus Bromberger, Amit K. Samanta, Jochen Küpper
Abstract: We report a cryogenic buffer‑gas‑cell‑aerodynamic‑lens‑stack setup that enables the generation of shock‑frozen, dense, and controllable beams of various nanoparticles in the gas phase, including small and low‑density species such as isolated proteins. We demonstrate characterization of the setup using strong‑field ionization combined with velocity‑map imaging, allowing the unambiguous detection of nanoparticles in the protein‑size range and full reconstruction of the particle beams including determination of particle flux and number density. The generation and characterization workflow presented here provides a valuable approach for protein‑like sample preparation and delivery in single‑particle diffractive imaging, microscopy, and low‑temperature nanoscience.
Authors: Hikaru Wakaura
Abstract: Quantum brain proposals require coherence on behaviorally relevant timescales, yet the gap between spin coherence times and neural decision windows has remained a quantitative obstacle. We evaluate approximate covariant quantum error correction (CQEC) ‑‑ a purification protocol constrained by the Eastin‑Knill theorem ‑‑ across two radical‑pair proteins parameterized by ab initio spin Hamiltonians: monoamine oxidase~A (MAO‑A) and cryptochrome (CRY, PDB~4I6G). Both share a three‑layer architecture (^31P nuclear spin memory, electron spin interface, classical electrochemistry) and identical hyperfine coupling (A = 200~MHz), but differ 16‑fold in nuclear T_2: 3.2~ms (MAO‑A) versus 52~ms (CRY). We test whether CQEC preserves coherence over the 200~ms Schultze‑Kraft veto window by mapping each protein's T_2 gap onto a simulation decoherence rate (γ_\mathrmveto = T_2~\textgap/2T_\mathrmsim): 3.08 for MAO‑A, 0.19 for CRY. At γ_\mathrmveto = 0.19, CQEC maintains tunneling coherence of 0.83 (95% CI [0.76, 0.79]; versus 0.12 without correction, ×6.9 improvement). At γ_\mathrmveto = 3.08, coherence collapses to 0.012 even with CQEC. A T_2 sensitivity analysis confirms robustness: at T_2 = 26~ms (half the CRY estimate), CQEC‑protected coherence remains 0.69. A classical Markov baseline produces only monotonic relaxation, confirming that CQEC‑maintained oscillatory dynamics are genuinely quantum. However, no single protein optimizes both layers: CRY's shorter T_2^e (0.53~ns versus 1.1~ns) worsens Layer~2 fidelity. This layer‑protein tradeoff, together with unresolved challenges in state preparation and entanglement distribution, defines the next targets for quantum brain research.
Authors: Wenjun Yu, Moshe Schwartz
Abstract: A single coloring channel is defined by a subset of letters it allows to pass through, while deleting all others. A sequence of coloring channels provides multiple views of the same transmitted letter sequence, forming a type of sequence‑reconstruction problem useful for protein identification and information storage at the molecular level. We provide exact capacities of several sequences of coloring channels: uniform sunflowers, two arbitrary intersecting sets, and paths. We also show how this capacity depends solely on a related graph we define, called the pairs graph. Using this equivalence, we prove lower and upper bounds on the capacity, and a tailored bound for a coloring‑channel sequence forming a cycle. In particular, for an alphabet of size 4, these results give the exact capacity of all coloring‑channel sequences except for a cycle of length 4, for which we only provide bounds.
Authors: Agostino Occhicone, Alberto Sinibaldi, Peter Munzert, Jordan N. Butt, Ethan P. Luta, Diego M. Arévalo, Francesco Michelotti, Benjamin L. Miller
Abstract: This study presents a rigorous comparative analysis of two label‑free optical biosensing platforms, Bloch surface wave (BSW) and microring resonator (MRR), for the detection of SARS‑CoV‑2 antibodies in human serum. To ensure direct comparability, a new BSW readout system was established alongside an existing MRR platform, allowing assays to be conducted under nearly identical experimental conditions. Both sensors were functionalized with various SARS‑CoV‑2 Spike and Nucleocapsid protein variants to capture specific host antibodies. The results demonstrate that both platforms provide rapid, quantitative, and sensitive detection of anti‑Spike and anti‑Nucleocapsid antibodies without the need for secondary labels. Furthermore, the platforms show excellent agreement with longitudinal serology benchmarks and high repeatability across different biochip batches. This work establishes both BSW and MRR technologies as powerful, low‑cost candidates for next‑generation clinical diagnostics and serological surveillance.
Authors: Yilong Dai, Shengyu Chen, Xiaowei Jia, Runlong Yu
Abstract: Partial differential equations (PDEs) govern nearly every physical process in science and engineering, yet solving them at scale remains prohibitively expensive. Generative AI has transformed language, vision, and protein science, but learned PDE solvers have not undergone a comparable shift. Existing paradigms each capture part of the problem. Physics‑informed neural networks embed residual structure, yet they are often difficult to optimize in stiff, multiscale, or large‑domain regimes. Neural operators amortize across instances, yet they commonly inherit a snapshot‑prediction view of solving and can degrade over long rollouts. Diffusion‑based solvers model uncertainty, yet they are often built on a solver template that still centers on state regression. We argue that the core issue is the abstraction used to train learned solvers. Many models are asked to predict states, while many scientific settings require modeling how uncertainty moves through constrained dynamics. The relevant object is transport over physically admissible futures. This motivates \emphflow learners: models that parameterize transport vector fields and generate trajectories through integration, echoing the continuous dynamics that define PDE evolution. This physics‑to‑physics alignment supports continuous‑time prediction, native uncertainty quantification, and new opportunities for physics‑aware solver design. We explain why transport‑based learning offers a stronger organizing principle for learned PDE solving and outline the research agenda that follows from this shift.
Authors: Luca Pennati, Andong Hu, Ivy Peng, Lukas Müllender, Stefano Markidis
Abstract: GROMACS is a de‑facto standard for classical Molecular Dynamics (MD). The rise of AI‑driven interatomic potentials that pursue near‑quantum accuracy at MD throughput now poses a significant challenge: embedding neural‑network inference into multi‑GPU simulations retaining high‑performance. In this work, we integrate the MLIP framework DeePMD‑kit into GROMACS, enabling domain‑decomposed, GPU‑accelerated inference across multi‑node systems. We extend the GROMACS NNPot interface with a DeePMD backend, and we introduce a domain decomposition layer decoupled from the main simulation. The inference is executed concurrently on all processes, with two MPI collectives used each step to broadcast coordinates and to aggregate and redistribute forces. We train an in‑house DPA‑1 model (1.6 M parameters) on a dataset of solvated protein fragments. We validate the implementation on a small protein system, then we benchmark the GROMACS‑DeePMD integration with a 15,668 atom protein on NVIDIA A100 and AMD MI250x GPUs up to 32 devices. Strong‑scaling efficiency reaches 66% at 16 devices and 40% at 32; weak‑scaling efficiency is 80% to 16 devices and reaches 48% (MI250x) and 40% (A100) at 32 devices. Profiling with the ROCm System profiler shows that >90% of the wall time is spent in DeePMD inference, while MPI collectives contribute <10%, primarily since they act as a global synchronization point. The principal bottlenecks are the irreducible ghost‑atom cost set by the cutoff radius, confirmed by a simple throughput model, and load imbalance across ranks. These results demonstrate that production MD with near ab initio fidelity is feasible at scale in GROMACS.
Authors: Maodong Li, Dechin Chen, Zhijun Pan, Zhe Wang, Yi Isaac Yang
Abstract: Understanding the kinetics of drug‑protein interactions is paramount for drug design, yet the field lacks large‑scale, dynamic data to move beyond static structural analysis. Here, we present DD‑03B, a massively scalable database providing dynamic, all‑atom dissociation trajectories for a broad set of ligand‑protein complexes. Utilising and extending a validated computational pipeline, we generated dissociation trajectories for 19,037 ligand‑protein complexes sourced from PDBbind+v2020R1, resulting in a repository of approximately 0.3 billion simulation frames totalling 40 TB in size. For these systems‑which possess experimental binding affinities (kd) but typically lack measured koff rates‑we computed and assigned dissociation rate constants through trajectory reweighting. Our analysis reveals that protein‑ligand complexes can be categorised into three mechanistic types (pathway‑dominant, open‑pocket, and entropy‑pocket systems), each requiring distinct strategies for accurate kinetic characterisation. Together with our previously released DD‑13M, DD‑03B forms the core of the expandable Dissociation Dynamic Database (DDD) project, which will be continuously augmented with new trajectories. This large‑scale, publicly available resource establishes a critical foundation for training and benchmarking next‑generation generative AI models to predict and optimise drug‑protein dissociation kinetics.
Authors: Mulusew W. Yaltaye, Yingqi Zhao, Kuo Zhan, Vahid Farrahi, Jian-An Huang
Abstract: Protein phosphorylation provides a dynamic readout of cellular signaling yet remains difficult to detect at low abundance and stoichiometry. Single‑molecule surface‑enhanced Raman spectroscopy (SM‑SERS) using particle‑in‑pore plasmonic nanopores offers label‑free molecular detection with submolecular sensitivity. However, reliable identification of subtle post‑translational modifications (PTMs) is hindered by the stochastic nature of SM‑SERS signals, partial excitation of peptide residues within the plasmonic hotspot, and background interference. Here, we introduce a physics‑informed deep learning framework to decode complex SM‑SERS dynamics and identify single‑peptide PTMs. The model integrates multiple‑instance learning with a temporal encoder combining temporal convolutional networks and bidirectional gated recurrent units to capture both local spectral variability and long‑range blinking dynamics. To address diffusion‑driven spectral heterogeneity, long spectral trajectories are segmented using Pearson‑correlation, enabling weakly supervised training under label ambiguity. This framework robustly distinguishes single peptide phosphorylation despite strong background interference and stochastic signal fluctuations. By coupling nanoplasmonic confinement with spatiotemporal deep learning, our approach enables high‑fidelity detection of single‑molecule phosphorylation events and advances ultrasensitive phosphoproteomic analysis.
Authors: Bryan Cheng, Jasper Zhang
Abstract: We present the first systematic study of when target context helps molecular property prediction, evaluating context conditioning across 10 diverse protein families, 4 fusion architectures, data regimes spanning 67‑9,409 training compounds, and both temporal and random evaluation splits. Using NestDrug, a FiLM‑based architecture that conditions molecular representations on target identity, we characterize both success and failure modes with three principal findings. First, fusion architecture dominates: FiLM outperforms concatenation by 24.2 percentage points and additive conditioning by 8.6 pp; how you incorporate context matters more than whether you include it. Second, context enables otherwise impossible predictions: on data‑scarce CYP3A4 (67 training compounds), multi‑task transfer achieves 0.686 AUC where per‑target Random Forest collapses to 0.238. Third, context can systematically hurt: distribution mismatch causes 10.2 pp degradation on BACE1; few‑shot adaptation consistently underperforms zero‑shot. Beyond methodology, we expose fundamental flaws in standard benchmarking: 1‑nearest‑neighbor Tanimoto achieves 0.991 AUC on DUD‑E without any learning, and 50% of actives leak from training data, rendering absolute performance metrics meaningless. Our temporal split evaluation (train up to 2020, test 2021‑2024) achieves stable 0.843 AUC with no degradation, providing the first rigorous evidence that context‑conditional molecular representations generalize to future chemical space.
Authors: Susan Khor
Abstract: A method that reconstructs protein residue networks using suitable node selection and edge recovery policies produced numerical observations that correlate strongly (Pearson's correlation coefficient < ‑0.83) with published folding rates for 52 two‑state folders and 21 multi‑state folders; correlations are also strong at the fold‑family level. These results were obtained serendipitously with the ND model, which was introduced previously, but is here extended with policies that dictate actions according to feature states. This result points to the importance of both the starting search point and the prevailing condition (random seed) for the quick success of policy search by a simple hill‑climber. The two conditions, suitable policies and random seed, which (evidenced by the strong correlation statistic) setup a conducive environment for modelling protein folding within ND, could be compared to appropriate physiological conditions required by proteins to fold naturally. Of interest is an examination of the sequence of restored edges for potential as plausible protein folding pathways. Towards this end, trajectory data is collected for analysis and further model evaluation and development.
Authors: Aniketh Iyengar, Jiaqi Han, Pengwei Sun, Mingjian Jiang, Jianwen Xie, Stefano Ermon
Abstract: Generating molecular dynamics (MD) trajectories using deep generative models has attracted increasing attention, yet remains inherently challenging due to the limited availability of MD data and the complexities involved in modeling high‑dimensional MD distributions. To overcome these challenges, we propose a novel framework that leverages structure pretraining for MD trajectory generation. Specifically, we first train a diffusion‑based structure generation model on a large‑scale conformer dataset, on top of which we introduce an interpolator module trained on MD trajectory data, designed to enforce temporal consistency among generated structures. Our approach effectively harnesses abundant structural data to mitigate the scarcity of MD trajectory data and effectively decomposes the intricate MD modeling task into two manageable subproblems: structural generation and temporal alignment. We comprehensively evaluate our method on the QM9 and DRUGS small‑molecule datasets across unconditional generation, forward simulation, and interpolation tasks, and further extend our framework and analysis to tetrapeptide and protein monomer systems. Experimental results confirm that our approach excels in generating chemically realistic MD trajectories, as evidenced by remarkable improvements of accuracy in geometric, dynamical, and energetic measurements.
Authors: Tianyu Liu, Sihan Jiang, Fan Zhang, Kunyang Sun, Teresa Head-Gordon, Hongyu Zhao
Abstract: Large language models (LLMs) are in the ascendancy for research in drug discovery, offering unprecedented opportunities to reshape drug research by accelerating hypothesis generation, optimizing candidate prioritization, and enabling more scalable and cost‑effective drug discovery pipelines. However there is currently a lack of objective assessments of LLM performance to ascertain their advantages and limitations over traditional drug discovery platforms. To tackle this emergent problem, we have developed DrugPlayGround, a framework to evaluate and benchmark LLM performance for generating meaningful text‑based descriptions of physiochemical drug characteristics, drug synergism, drug‑protein interactions, and the physiological response to perturbations introduced by drug molecules. Moreover, DrugPlayGround is designed to work with domain experts to provide detailed explanations for justifying the predictions of LLMs, thereby testing LLMs for chemical and biological reasoning capabilities to push their greater use at the frontier of drug discovery at all of its stages.
Authors: Yu Akagi, Tomohisa Seki, Toru Takiguchi, Hiromasa Ito, Yoshimasa Kawazoe, Kazuhiko Ohe
Abstract: Counterfactual simulation ‑ exploring hypothetical consequences under alternative clinical scenarios ‑ holds promise for transformative applications such as personalized medicine and in silico trials. However, it remains challenging due to methodological limitations. Here, we show that an autoregressive generative model trained on real‑world data from over 300,000 patients and 400 million patient timeline entries can generate clinically plausible counterfactual trajectories. As a validation task, we applied the model to patients hospitalized with COVID‑19 in 2023, modifying age, serum C‑reactive protein (CRP), and serum creatinine to simulate 7‑day outcomes. Increased in‑hospital mortality was observed in counterfactual simulations with older age, elevated CRP, and elevated serum creatinine. Remdesivir prescriptions increased in simulations with higher CRP values and decreased in those with impaired kidney function. These counterfactual trajectories reproduced known clinical patterns. These findings suggest that autoregressive generative models trained on real‑world data in a self‑supervised manner can establish a foundation for counterfactual clinical simulation.
Authors: Matteo Cartiglia, Natan Biesmans, Wannes Peeters, Wouter Botermans, Koen Ongena, Liam Vandekerckhove, Wouter Renckens, Eric Beamish, Elizabeth Skelly, Kirill A. Afonin, Pol van Dorpe, Sanjin Marion
Abstract: High‑throughput solid‑state nanopore experiments generate continuous MHz‑rate data streams in which only a small fraction of data contains informative molecular information. This creates storage and processing bottlenecks that limit experimental scalability. We introduce Data Sieving, a GPU‑accelerated acquisition framework that integrates real‑time event detection directly into the measurement pipeline and selectively stores and allows real‑time analysis of snapshots around molecular translocations. The system employs a lightweight rolling‑average and min‑max trigger to identify event candidates in parallel across channels. This architecture reduces stored data volume by up to 98% while preserving complete molecular signatures across a wide temporal range, from microsecond‑scale protein dynamics to second‑scale nucleic acid nanoparticle events. Continuous baseline monitoring enables autonomous closed‑loop actuation; in high‑concentration DNA experiments, automatic declogging restored pore conductance, reducing the time spent in a non‑productive clogged state to near‑zero and without interrupting parallel measurements. Validated across DNA, protein, and nucleic acid nanoparticle measurements, Data Sieving links data storage directly to molecular information content rather than experiment duration, enabling scalable, real‑time operation of parallel nanopore sensors. The approach provides a hardware‑agnostic foundation for long‑duration, high‑bandwidth single‑molecule experiments and other event‑driven sensing platforms. By using algorithms intrinsically compatible with low‑latency digital architectures, this framework provides a clear path toward high‑bandwidth, highly multiplexed recording across hundreds of individual nanopore channels in both solid‑state and biological pores.
Authors: Francesco Michelotti, Elisabetta Sepe, Agostino Occhicone, Norbert Danz, Alberto Sinibaldi
Abstract: Protein rotational kinetics are essential for understanding macromolecular behavior in crowded environments, yet measuring these dynamics at solid‑liquid interfaces remains a significant challenge due to low signal strengths. Here, we experimentally demonstrate a label‑based optical technique for measuring rotational diffusion kinetics using an all‑dielectric multilayer stack that sustains both transverse electric and transverse magnetic polarized surface electromagnetic waves. We introduce the concept of Fluorescence Recovery after Orientational Photobleaching, a rotational analogue to the standard translatory fluorescence recovery after photobleaching technique, which utilizes anisotropic photobleaching via resonant transverse electric excitation followed by real‑time monitoring of the orientational relaxation towards isotropy. Our ratiometric analysis of the transverse electric and magnetic polarized fluorescence components allows for a distance‑independent estimation of the rotational friction coefficient. Applying this method to covalently bound neutravidin, we observe a rotational friction coefficient (about 5.8E‑18 J s) significantly higher than in bulk solutions, highlighting the impact of surface anchoring and molecular crowding. The proposed approach provides a robust, high‑sensitivity platform for resolving biomolecular dynamics in complex interfacial environments.
Authors: Daniil Riabov, Abtin Saateh, Wenhong Yang, Ivan Sinev, Yuri Kivshar, Hatice Altug
Abstract: Optical biosensors are indispensable in medical and environmental diagnostics, yet existing approaches are fundamentally limited in their sensitivity due to ensemble‑averaged measurements. Digital biosensing has emerged as a promising solution for resolving individual binding events, thereby providing signals at very low analyte concentrations down to the single‑molecule level. Here, we present a novel concept for digital optical biosensing empowered by dielectric Mie voids, combining nanoparticle‑based contrast enhancement and deep learning for ultrasensitive biomarker detection. The resonantly trapped light in the air cavities of the periodic Mie void arrays ensures strong overlap between the near‑fields and the single gold nanoparticles that are captured on the surface in the presence of the protein biomarker. Remarkably, this strong interaction creates high‑contrast digital signals for the precise counting of single nanoparticles located both within and outside the voids, yielding efficient use of the entire sensor area for high sensitivity. We employ deep‑ultraviolet (DUV) lithography for the scalable and low‑cost production of Mie voids in silicon wafers and automated image analysis with a convolutional neural network for robust nanoparticle counting. As a proof of our concept, we demonstrate the detection of an important disease biomarker, interleukin‑6 (IL‑6), from small sample volumes at concentrations as low as 1.84 pg/ml, within the physiological range of healthy individuals. Owing to its scalability, precision, and adaptability, our digital nanophotonic biosensing approach based on silicon Mie voids establishes a versatile route for applications ranging from bioanalytics to health and environmental monitoring.
Authors: Kai Nelson, Tobias Kreiman, Sergey Levine, Aditi S. Krishnapriyan
Abstract: A fundamental challenge in science and engineering is the simulation‑to‑experiment gap. While we often possess prior knowledge of physical laws, these physical laws can be too difficult to solve exactly for complex systems. Such systems are commonly modeled using simulators, which impose computational approximations. Meanwhile, experimental measurements more faithfully represent the real world, but experimental data typically consists of observations that only partially reflect the system's full underlying state. We propose a data‑driven distribution alignment framework that bridges this simulation‑to‑experiment gap by pre‑training a generative model on fully observed (but imperfect) simulation data, then aligning it with partial (but real) observations of experimental data. While our method is domain‑agnostic, we ground our approach in the physical sciences by introducing Adversarial Distribution Alignment (ADA). This method aligns a generative model of atomic positions ‑‑ initially trained on a simulated Boltzmann distribution ‑‑ with the distribution of experimental observations. We prove that our method recovers the target observable distribution, even with multiple, potentially correlated observables. We also empirically validate our framework on synthetic, molecular, and experimental protein data, demonstrating that it can align generative models with diverse observables. Our code is available at https://kaityrusnelson.com/ada/.
Authors: Elena N. Govorun, Martin Lenz
Abstract: Proteins can combine into functional elements in living cells or self‑assemble into unwanted structures in a number of diseases. The resulting aggregates often display filamentous morphologies across a large range of protein shapes and molecular interactions. This has led to the suggestion that filament formation could be a generic outcome of the aggregation of geometrically complex, ill‑fitting objects, although such a mechanism has not been demonstrated in three dimensions. To address this problem, we theoretically study the self‑assembly of three‑dimensional identical, ill‑fitting deformable subunits mimicking globular proteins in solution. In our model, self‑assembling subunits incur deformations that accumulate as the aggregate size increases and can eventually hamper further assembly. We analytically predict the ground state morphologies of the resulting aggregates as a function of the subunit adhesivity and elasticity by mapping their mechanics onto those of two incompatible, interconnected networks. We find that zero‑dimensional clusters, three‑dimensional bulks as well as symmetry‑broken one‑dimensional filaments and two‑dimensional layers can all form depending on assembly parameters. Poorly compressible, moderately adhesive subunits favor filaments. These findings hint at a generic pathway to control self‑assembly in three dimensions and suggests that such mechanisms could be investigated in more realistic protein models.
Authors: Antonin Sulc
Abstract: In this work, we study whether enforcing strict compositional structure in sequence embeddings yields meaningful geometric organization when applied to protein‑protein interaction networks. Using Event2Vec, an additive sequence embedding model, we train 64‑dimensional representations on random walks from the human STRING interactome, and compare against a DeepWalk baseline based on Word2Vec, trained on the same walks. We find that compositional structure substantially improves pathway coherence (30.2× vs 2.9× above random), functional analogy accuracy (mean similarity 0.966 vs 0.650), and hierarchical pathway organization, while geometric properties such as norm‑‑degree anticorrelation are shared with or exceeded by the non‑compositional baseline. These results indicate that enforced compositionality specifically benefits relational and compositional reasoning tasks in biological networks.
Authors: Francesco Pesce, Stephen Farr, Gianni de Fabritiis
Abstract: Accurate prediction of acid dissociation constants (pK_\rm a) and the determination of dominant protonation states is critical in drug discovery, influencing molecular properties such as solubility, permeability, and protein‑ligand binding. We present AcepK_\rm a, an advanced application integrated into the PlayMolecule AI platform. AcepK_\rm a is built upon the theoretically rigorous Uni‑pK_\rm a framework, which unifies statistical mechanics with representation learning. By modeling the complete protonation ensemble rather than treating pK_a as a scalar regression target, AcepK_\rm a ensures thermodynamic consistency across coupled ionization sites. We describe the application's enhanced architecture, which features a retrained Uni‑Mol backbone achieving state‑of‑the‑art performance on standard benchmarks. Furthermore, we detail critical engineering advancements. These include AceConfgen, a proprietary GPU‑accelerated conformer generator that achieves a ~40x speed‑up compared to NVIDIA's nvmolkit, a streamlined inference engine to directly protonate molecules, and a 3D‑aware modality for applying protonation states to bound ligand poses. Finally, we discuss the integration of AcepK_\rm a into the PlayMolecule AI ecosystem, a modern AI‑assisted environment for molecular modelling and drug discovery.
Authors: Orson Kirsch, Nicole Wood, Steven A Redford, Kabir Husain
Abstract: Natural genomes sometimes encode two different proteins in staggered reading frames of the same DNA sequence. Despite the prevalence of these 'overlapping genes' across the tree of life, it remains unknown whether arbitrary protein pairs can overlap, to what extent such overlaps are feasible, or what design principles govern them. Here, we study compatibility, frustration, and connectivity in the fitness landscape of overlapping genes. We computationally design sequences de novo that satisfy the dual functional constraints of two distinct protein families. The joint fitness landscape, inferred via Potts models from multiple sequence alignments, reveals a fundamental trade‑off between the two proteins and provides a simple criterion for when overlap is feasible. We find widespread compatibility between protein families, with one class of reading frames markedly more permissible than others. By exploring alternative genetic codes, we find that the natural genetic code is uniquely well‑suited to support overlapping genes. Constructing mutational paths between sequences, we find that sequence‑diverse overlapped genes can be connected via a network of near‑neutral mutations. Overall, our results suggest that protein fitness landscapes are sufficiently flexible so as to accommodate the stringent, orthogonal requirements of overlapping genes.
Authors: Axel Giottonini, Thomas Lemmin
Abstract: Molecular dynamics simulations provide detailed trajectories at the atomic level, but extracting interpretable and robust insights from these high‑dimensional data remains challenging. In practice, analyses typically rely on a single representation. Here, we show that representation choice is not neutral: it fundamentally shapes the conformational organization, similarity relationships, and apparent transitions inferred from identical simulation data.
To complement existing representations, we introduce Orientation features, a geometrically grounded, rotation‑aware encoding of protein backbone. We compare it against common descriptions across three dynamical regimes: fast‑folding proteins, large‑scale domain motions, and protein‑protein association. Across these systems, we find that different representations emphasize complementary aspects of conformational space, and that no single representation provides a complete picture of the underlying dynamics.
To facilitate systematic comparison, we developed ManiProt, a library for efficient computation and analysis of multiple protein representations. Our results motivate a comparative, representation‑aware framework for the interpretation of molecular dynamics simulations.
Authors: S. Kojima, S. Rawat, M. Sanchez Miranda, J. G. Gluschke, H. Noji, L. K. Lee, A. P. Micolich
Abstract: We report a method for producing an array of fifty two ion‑sensitive PEDOT:PSS organic electrochemical transistors on a glass coverslip, each featuring an integrated fluoropolymer microwell sealed with lipid bilayer into which membrane proteins can be inserted for simultaneous electrical and fluorescence microscopy studies. To demonstrate capability, we fill the microwells with an `inner' phosphate assay buffer solution containing 20 μM Alexa‑488 dye and 50 mM KCl, seal the microwells with lipid bilayer using an aqueous‑organic‑aqueous liquid exchange technique, and then fill the common flow‑cell volume above the sealed microwells with a dye‑free `outer' phosphate assay buffer containing 100 mM KCl. We insert α‑hemolysin, which embeds into the lipid bilayer forming a heptameric pore with diameter ~ 2.6 nm. The pore allows K^+ ions to diffuse into the microwell and Alexa‑488 dye molecules to diffuse out of the microwell producing a corresponding drop in transistor conductance and microwell fluorescence intensity, respectively. These two signals occur at different timescales, consistent with the known size difference between K^+ ions and Alexa‑488 molecules. Our approach to fabricating microwell arrays with PEDOT:PSS OECTs incorporated into the bottom of selected microwells distributed in the array is both scalable and versatile, opening a path to studies using larger arrays and with other membrane proteins embedded in the lipid bilayer sealing the microwells.
Authors: Giada Forte, Enzo Orlandini, Davide Marenduzzo
Abstract: Bacterial chromosome replication occurs in the absence of a canonical spindle apparatus; yet it reliably produces organised and segregated genomes. While both passive and active mechanisms have been investigated, DNA replication itself is a non‑equilibrium process that continuously generates new genetic material and reorganizes the nucleoid. Here, we investigate how replication‑driven dynamics, combined with nucleoid‑associated protein (NAP) interactions, shape spatiotemporal chromosome organisation using a three‑dimensional polymer model that explicitly simulates DNA synthesis. We show that NAP‑mediated interactions induce dynamic clustering of DNA, generating density fluctuations in the nucleoid. When coupled to replication, these clusters undergo cycles of stress buildup and release that produce stepwise expansion dynamics consistent with experimental observations. Chromosome segregation occurs naturally in this regime, but only within a finite range of interaction strengths: weak interactions fail to structure the nucleoid, whereas strong interactions hinder replication progression. Within this optimal balance, replication also promotes the spontaneous formation of replication factories. Our results demonstrate that bacterial chromosome organisation can be understood as a non‑equilibrium system in which the interplay between replication forces and protein‑mediated interactions generates nucleoid mechanics, dynamics, and segregation.
Authors: L. Ghiringhelli, A. Zambon, G. Tiana
Abstract: We investigate the parameter space of transformer models trained on protein sequence data using a statistical mechanics framework, sampling the loss landscape at varying temperatures by Langevin dynamics to characterize the low‑loss manifold and understand the mechanisms underlying the superior performance of transformers in protein structure prediction. We find that, at variance with feedforward networks, the lack of a first‑‑order‑‑like transition in the loss of the transformer produces a range of intermediate temperatures with good learning properties. We show that the parameters of most layers are highly conserved at these temperatures if the dimension of the embedding is optimal, and we provide an operative way to find this dimension. Finally, we show that the attention matrix is more predictive of the contact maps of the protein at higher temperatures and for higher dimensions of the embedding than those optimal for learning.
Authors: Alexander Kaltashov, Safa Jamali
Abstract: Multicomponent gel systems have garnered much interest due to their compelling mechanical properties in the past decade. Yet, some mechanisms associated with multicomponent gels, such as sequential gelation, have been explored primarily in the context of chemical nonreversible polymeric and protein gels than in physical reversible colloidal ones. In this study, we use mesoscale simulation techniques to model the sequential gelation of two‑component colloidal systems whose components' interspecies and intraspecies electrostatic interactions can be modified independently. We show that by simply leveraging temporal control and interspecies interactions, we can construct markedly different networks; from double networks to mixed and core‑shell composite structures of varying coarseness and heterogeneity natures. These findings present a compelling case for further exploration of multicomponent colloidal systems.
Authors: Kieran Didi, Zuobai Zhang, Guoqing Zhou, Danny Reidenbach, Zhonglin Cao, Sooyoung Cha, Tomas Geffner, Christian Dallago, Jian Tang, Michael M. Bronstein, Martin Steinegger, Emine Kucukbenli, Arash Vahdat, Karsten Kreis
Abstract: Protein interaction modeling is central to protein design, which has been transformed by machine learning with applications in drug discovery and beyond. In this landscape, structure‑based de novo binder design is cast as either conditional generative modeling or sequence optimization via structure predictors ("hallucination"). We argue that this is a false dichotomy and propose Proteina‑Complexa, a novel fully atomistic binder generation method unifying both paradigms. We extend recent flow‑based latent protein generation architectures and leverage the domain‑domain interactions of monomeric computationally predicted protein structures to construct Teddymer, a new large‑scale dataset of synthetic binder‑target pairs for pretraining. Combined with high‑quality experimental multimers, this enables training a strong base model. We then perform inference‑time optimization with this generative prior, unifying the strengths of previously distinct generative and hallucination methods. Proteina‑Complexa sets a new state of the art in computational binder design benchmarks: it delivers markedly higher in‑silico success rates than existing generative approaches, and our novel test‑time optimization strategies greatly outperform previous hallucination methods under normalized compute budgets. We also demonstrate interface hydrogen bond optimization, fold class‑guided binder generation, and extensions to small molecule targets and enzyme design tasks, again surpassing prior methods. Code, models and new data will be publicly released.
Authors: Yang Tan, Lingrong Zhang, Mingchen Li, Yuanxi Yu, Bozitao Zhong, Bingxin Zhou, Nanqing Dong, Liang Hong
Abstract: Protein scientific discovery is bottlenecked by the manual orchestration of information and algorithms, while general agents are insufficient in complex domain projects. VenusFactory2 provides an autonomous framework that shifts from static tool usage to dynamic workflow synthesis via a self‑evolving multi‑agent infrastructure to address protein‑related demands. It outperforms a set of well‑known agents on the VenusAgentEval benchmark, and autonomously organizes the discovery and optimization of proteins from a single natural language prompt.
Authors: Marco Garcia Noceda, Matthew T Noakes, Andrew FigPope, Daniel E Mattox, Bryan Howie, Harlan Robins
Abstract: T cells are a critical component of the adaptive immune system, playing a role in infectious disease, autoimmunity, and cancer. T cell function is mediated by the T cell receptor (TCR) protein, a highly diverse receptor targeting specific peptides presented by the major histocompatibility complex (pMHCs). Predicting the specificity of TCRs for their cognate pMHCs is central to understanding adaptive immunity and enabling personalized therapies. However, accurate prediction of this protein‑protein interaction remains challenging due to the extreme diversity of both TCRs and pMHCs. Here, we present ImmSET (Immune Synapse Encoding Transformer), a novel sequence‑based architecture designed to model interactions among sets of variable‑length biological sequences. We train this model across a range of dataset sizes and compositions and study the resulting models' generalization to pMHC targets. We describe a failure mode in prior sequence‑based approaches that inflates previously reported performance on this task and show that ImmSET remains robust under stricter evaluation. In systematically testing the scaling behavior of ImmSET with training data, we show that performance scales consistently with data volume across multiple data types and compares favorably with the pre‑trained protein language model ESM2 fine‑tuned on the same datasets. Finally, we demonstrate that ImmSET can outperform AlphaFold2 and AlphaFold3‑based pipelines on TCR‑pMHC specificity prediction when provided sufficient training data. This work establishes ImmSET as a scalable modeling paradigm for multi‑sequence interaction problems, demonstrated in the TCR‑pMHC setting but generalizable to other biological domains where high‑throughput sequence‑driven reasoning complements structure prediction and experimental mapping.
Authors: Tianyu Wu, Lin Zhu
Abstract: Motivation: Generative models for protein backbone design have to simultaneously ensure geometric validity, sampling efficiency, and scalability to long sequences. However, most existing approaches rely on iterative refinement, quadratic attention mechanisms, or post‑hoc geometry correction, leading to a persistent trade‑off between computational efficiency and structural fidelity.
Results: We present Physics‑Informed Mamba (PI‑Mamba), a generative model that enforces exact local covalent geometry by construction while enabling linear‑time inference. PI‑Mamba integrates a differentiable constraint‑enforcement operator into a flow‑matching framework and couples it with a Mamba‑based state‑space architecture. To improve optimisation stability and backbone realism, we introduce a spectral initialization derived from the Rouse polymer model and an auxiliary cis‑proline awareness head. Across benchmark tasks, PI‑Mamba achieves 0.0% local geometry violations and high designability (scTM = 0.91\pm 0.03, n = 100), while scaling to proteins exceeding 2,000 residues on a single A5000 GPU (24 GB).
Authors: William Dawson, Louis Beal, Marco Zaccaria, Luigi Genovese
Abstract: Predicting how protein mutations affect drug binding remains a major challenge, particularly when the mutations are distal from the binding site. In this study, we introduce a coupled simulation workflow that combines long‑time‑scale molecular dynamics (MD) with high‑throughput quantum mechanical (QM) analysis to reveal the electronic structure signatures of mutation induced drug resistance in the HIV‑1 protease. Our workflow leverages GPU‑accelerated MD to generate conformational ensembles, and performs in‑operando linear‑scaling density functional theory (DFT) calculations on selected frames parallelized on a coupled partition of CPU nodes. This design enables efficient, massively parallel quantum analysis of protein‑ligand complexes at atomic resolution. Using this approach, we investigate resistance to the antiviral Darunavir in a multi‑mutant HIV‑1 protease variant. By mapping the network of electronic interactions across the binding interface, our results highlight the critical role of conformational sampling and quantum insight in understanding distal mutation effects, and demonstrate a scalable computational strategy for studying complex biophysical mechanisms of drug resistance. We argue that such kind of analysis may pave the way for designing inhibitors that maintain binding stability against systemic, mutation‑induced destabilization.
Authors: Senura Hansaja Wanasekara, Minh-Duong Nguyen, Xiaochen Liu, Nguyen H. Tran, Ken-Tye Yong
Abstract: Generative modeling has become a central paradigm in protein research, extending machine learning beyond structure prediction toward sequence design, backbone generation, inverse folding, and biomolecular interaction modeling. However, the literature remains fragmented across representations, model classes, and task formulations, making it difficult to compare methods or identify appropriate evaluation standards. This survey provides a systematic synthesis of generative AI in protein research, organized around (i) foundational representations spanning sequence, geometric, and multimodal encodings; (ii) generative architectures including \mathrmSE(3)‑equivariant diffusion, flow matching, and hybrid predictor‑generator systems; and (iii) task settings from structure prediction and de novo design to protein‑ligand and protein‑protein interactions. Beyond cataloging methods, we compare assumptions, conditioning mechanisms, and controllability, and we synthesize evaluation best practices that emphasize leakage‑aware splits, physical validity checks, and function‑oriented benchmarks. We conclude with critical open challenges: modeling conformational dynamics and intrinsically disordered regions, scaling to large assemblies while maintaining efficiency, and developing robust safety frameworks for dual‑use biosecurity risks. By unifying architectural advances with practical evaluation standards and responsible development considerations, this survey aims to accelerate the transition from predictive modeling to reliable, function‑driven protein engineering.
Authors: Yuda Bi, Huaiwen Zhang, Jingnan Sun, Vince D Calhoun
Abstract: Protein structural ensembles from NMR spectroscopy capture biologically important conformational heterogeneity, but it remains difficult to determine whether observed variation reflects coordinated motion or noise‑like artifacts. We evaluate the Spectral Coherence Index (SCI), a model‑free, rotation‑invariant summary derived from the participation‑ratio effective rank of the inter‑model pairwise distance‑variance matrix. Under grouped primary analysis of a Main110 cohort of 110 NMR ensembles (30‑‑403 residues; 10‑‑30 models per entry), SCI separated experimental ensembles from matched synthetic incoherent controls with AUC‑ROC = 0.973 and Cliff's δ= ‑0.945. Relative to an internal 27‑protein pilot, discrimination softened modestly, showing that pilot‑era thresholds do not transfer perfectly to a larger, more heterogeneous cohort: the primary operating point τ= 0.811 yielded 95.5% sensitivity and 89.1% specificity. PDB‑level sensitivity remained nearly unchanged (AUC = 0.972), and an independent 11‑protein holdout reached AUC = 0.983. Across 5‑fold grouped stratified cross‑validation and leave‑one‑function‑class‑out testing, SCI remained strong (AUC = 0.968 and 0.971), although σ_R_g was the stronger single‑feature discriminator and a QC‑augmented multifeature model generalized best (AUC = 0.989 and 0.990). Residue‑level validation linked SCI‑derived contributions to experimental RMSF across 110 proteins and showed broad concordance with GNM‑based flexibility patterns. Rescue analyses showed that Main110 softening arose mainly from size and ensemble normalization rather than from loss of spectral signal. Together, these results establish SCI as an interpretable, bounded coherence summary that is most useful when embedded in a multimetric QC workflow for heterogeneous protein ensembles.
Authors: Neha K. Nair, Aaron D'Souza
Abstract: Saccharomyces cerevisiae is increasingly recognised as a key source for single‑cell protein (SCP) production, a rising solution to global protein‑supply challenges. This study presents a computational framework combining the Yeast9 genome‑scale metabolic model (GEM) with machine learning and optimisation to predict and enhance biomass flux for SCP yield. The Yeast9 GEM, comprising 4,131 reactions, 2,806 metabolites, and 1,161 genes, was simulated using flux balance analysis (FBA) across 2,000 Latin Hypercube‑sampled flux profiles. Random Forest and XGBoost regressors achieved R2 values of 0.9999760 and 0.9997702, respectively. A variational autoencoder (VAE) identified four metabolic clusters with mean biomass fluxes of 0.472, 0.493, 0.527, and 0.505 gDW/hr. SHAP‑based feature attribution identified twenty key reactions in glycolysis, the TCA cycle, and amino‑acid biosynthesis; 18/20 (90%) were confirmed essential by in silico knockout. Bayesian optimisation produced a 12.13‑fold improvement in biomass flux (0.0858 to 1.041 gDW/hr) at glucose = ‑20.0, oxygen = ‑20.0, and ammonium = ‑8.9 mmol/gDW/hr. A generative adversarial network (GAN) generated novel flux configurations (variance = 0.124); stoichiometric feasibility verification returned 0/100 feasible profiles due to incomplete generator convergence, reported as a limitation. Pareto front analysis identified an optimal SCP operating point at 0.0858 gDW/hr biomass flux with amino‑acid biosynthesis score of 1000.029 mmol/gDW/hr.
Authors: Pin-Tian Lyu, Yifan Zhu, Qing Xia, Guangrui Ding, Arvind Pillai, Xinru Wang, Jianpeng Ao, Haonan Lin, Lulu Jiang, David Baker, Ji-Xin Cheng
Abstract: Current single molecule methods either rely on fluorescence or lack chemical information. Here we report stimulated Raman photothermal encoded scattering (SRPSCAT) microscopy for quantitative bond‑selective imaging of single‑biomolecule structures and interactions in native environments. In this approach, scattering of the target molecule is modulated by the deposited energy from stimulated Raman gain and loss processes, thereby encoding vibrational spectroscopic information. Leveraging single‑molecule sensitivity of interferometric scattering, SRPSCAT can map single proteins with chemical specificity, determine their mass, and distinguish protein secondary structures based on their Raman fingerprints. Furthermore, single protein binding kinetics are quantified and the conformational dynamics of single de novo designed allosteric proteins are observed. Together, these results highlight the potential of SRPSCAT for label‑free structural, functional and dynamic analysis at the single‑molecule level.
Authors: Tomas André, Alfredo Bellisario, Nicusor Timneanu, Carl Caleman
Abstract: We solve the orientation recovery of a tumbling protein in the gas phase from single‑event measurements of the spatial positions of its ions after an X‑ray laser induced explosion. We simulate diffracted X‑ray signal and ion dynamics under experimental conditions and compare our method to conventional orientation recovery in single‑particle imaging with X‑ray free‑electron lasers using only diffraction data. We reconstruct 3D diffraction intensities using orientations recovered from the ion signatures and retrieve the electron density with established phase‑retrieval algorithms. We test our orientation recovery procedure on 56 proteins ranging from 14 to 52 kDa (1800 to 6500 atoms), achieving roughly an angular error of around 5°. The resulting 3D electron‑density reconstructions are compared to ground‑truth volumes simulated at the same nominal resolution, and achieve the resolution at the edge of the detector in conditions similar to current single‑particle imaging setups. We investigate the reconstruction quality and demonstrate that ion data can be used for reliable orientation recovery of particles in single‑particle imaging, achieving orientation on par or better than currently used recovery techniques. This work shows the potential of ion detection for retrieving additional information from the sample fragmentation, and boost single particle imaging with X‑ray lasers in the cases where the diffraction signal is a limiting factor.
Authors: Josef Hanke, Sebastian Pujalte Ojeda, Shengyu Zhang, Werngard Czechtizky, Leonardo De Maria, Michele Vendruscolo
Abstract: The accurate prediction of protein‑RNA binding affinity remains an unsolved problem in structural biology, limiting opportunities in understanding gene regulation and designing RNA‑targeting therapeutics. A central obstacle is the structural flexibility of RNA, as, unlike proteins, RNA molecules exist as dynamic conformational ensembles. Thus, committing to a single predicted structure discards information relevant to binding. Here, we show that this obstacle can be addressed by extracting pre‑structural embeddings, which are intermediate representations from a biomolecular foundation model captured before the structure decoding step. Pre‑structural embeddings implicitly encode conformational ensemble information without requiring predicted structures. We build ZeroFold, a transformer‑based model that combines pre‑structural embeddings from Boltz‑2 for both protein and RNA molecules through a cross‑modal attention mechanism to predict binding affinity directly from sequence. To support training and evaluation, we construct PRADB, a curated dataset of 2,621 unique protein‑RNA pairs with experimentally measured affinities drawn from four complementary databases. On a held‑out test set constructed with 40% sequence identity thresholds, ZeroFold achieves a Spearman correlation of 0.65, a value approaching the ceiling imposed by experimental measurement noise. Under progressively fairer evaluation conditions that control for training‑set overlap, ZeroFold compares favourably with respect to leading structure‑based and leading sequence‑based predictors, with the performance gap widening as sequence similarity to competitor training data is reduced. These results illustrate how pre‑structural embeddings offer a representation strategy for flexible biomolecules, opening a route to affinity prediction for protein‑RNA pairs for which no structural data exist.
Authors: Nobuyuki Ota
Abstract: Biological AI models increasingly predict complex cellular responses, yet their learned representations remain disconnected from the
molecular processes they aim to capture. We present CDT‑III, which extends mechanism‑oriented AI across the full central dogma: DNA, RNA, and
protein. Its two‑stage Virtual Cell Embedder architecture mirrors the spatial compartmentalization of the cell: VCE‑N models transcription in
the nucleus and VCE‑C models translation in the cytosol. On five held‑out genes, CDT‑III achieves per‑gene RNA r=0.843 and protein r=0.969.
Adding protein prediction improves RNA performance (r=0.804 to 0.843), demonstrating that downstream tasks regularize upstream
representations. Protein supervision sharpens DNA‑level interpretability, increasing CTCF enrichment by 30%. Analysis of experimentally
measured mRNA and protein responses reveals that the majority of genes with observable mRNA changes show opposite protein‑level changes (66.7%
at |log2FC|>0.01, rising to 87.5% at |log2FC|>0.02), exposing a fundamental limitation of RNA‑only perturbation models. Despite this
pervasive direction discordance, CDT‑III correctly predicts both mRNA and protein responses. Applied to in silico CD52 knockdown approximating
Alemtuzumab, the model predicts 29/29 protein changes correctly and rediscovers 5 of 7 known clinical side effects without clinical data.
Gradient‑based side effect profiling requires only unperturbed baseline data (r=0.939), enabling screening of all 2,361 genes without new
experiments.
Authors: Wenhao Zhao, Qiran Zou, Zhouhan Lin, Dianbo Liu
Abstract: Vector Quantization (VQ) has become the cornerstone of tokenization for many multimodal Large Language Models and diffusion synthesis. However, existing VQ paradigms suffer from a fundamental conflict: they enforce discretization before the encoder has captured the underlying data manifold. We term this phenomenon Premature Discretization. To resolve this, we propose Progressive Quantization (ProVQ), which incorporates the dynamics of quantization hardness as a fundamental yet previously overlooked axis in VQ training. By treating quantization as a curriculum that smoothly anneals from a continuous latent space to a discrete one, ProVQ effectively guides the codebook toward the well‑expanded manifolds. Extensive experimental results demonstrate the broad effectiveness of ProVQ across diverse modalities. We report improved reconstruction and generative performance on the ImageNet‑1K and ImageNet‑100 benchmarks, highlighting the ProVQ's boost for generative modeling. Furthermore, ProVQ proves highly effective for modeling complex biological sequences, establishing a new performance ceiling for protein structure tokenization on the StrutTokenBench leaderboard.
Authors: Yu Liu, Ailun Wang, Yu Xia, Zhi Wang, Wen Yan
Abstract: Absolute binding free energy (ABFE) calculations offer a theoretically rigorous approach for predicting protein‑‑ligand binding affinities without the scaffold constraints of relative binding free energy (RBFE) perturbations. However, broad adoption of ABFE in high‑throughput hit discovery campaigns has been hindered by high computational costs and a lack of large‑scale validation. Here, we present Felis, an open‑source, automated, and scalable toolkit designed for high‑throughput ABFE calculations. Paired with ByteFF, a previously developed data‑driven molecular mechanics force field for drug‑like molecules, Felis achieves ranking performance comparable to state‑of‑the‑art RBFE methods on a diverse dataset comprising 43 protein targets and 859 ligands. Furthermore, we demonstrate robust convergence and ranking performance of Felis on a more challenging KRAS(G12D) dataset, where some ligands and the cofactor are highly charged. Crucially, all Felis predictions in this study were generated in a strict zero‑shot manner, eschewing custom force‑field modifications and alchemical schedule fine‑tuning. This demonstrates the viability of Felis as an effective, ready‑to‑use tool for computational structure‑based drug design.
Authors: Hanlin Xiao, Rainer Breitling, Eriko Takano, Mauricio A. Álvarez
Abstract: Recent advances in general‑purpose foundation models have stimulated the development of large biological sequence models. While natural language shows symbolic granularity (characters, words, sentences), biological sequences exhibit hierarchical granularity whose levels (nucleotides, amino acids, protein domains, genes) further encode biologically functional information. In this paper, we investigate the integration of cross‑granularity knowledge from models through a case study of BiGCARP, a Pfam domain‑level model for biosynthetic gene clusters, and ESM, an amino acid‑level protein language model. Using representation analysis tools and a set of probe tasks, we first explain why a straightforward cross‑model embedding initialization fails to improve downstream performance in BiGCARP, and show that deeper‑layer embeddings capture a more contextual and faithful representation of the model's learned knowledge. Furthermore, we demonstrate that representations at different granularities encode complementary biological knowledge, and that combining them yields measurable performance gains in intermediate‑level prediction tasks. Our findings highlight cross‑granularity integration as a promising strategy for improving both the performance and interpretability of biological foundation models.
Authors: Muralikrishnna G. Sethuraman, Faramarz Fekri
Abstract: Uncovering causal relationships is a fundamental problem across science and engineering. However, most existing causal discovery methods assume acyclicity and direct access to the system variables ‑‑ assumptions that fail to hold in many real‑world settings. For instance, in genomics, cyclic regulatory networks are common, and measurements are often corrupted by instrumental noise. To address these challenges, we propose RECLAIM, a causal discovery framework that natively handles both cycles and measurement noise. RECLAIM learns the causal graph structure by maximizing the likelihood of the observed measurements via expectation‑maximization (EM), using residual normalizing flows for tractable likelihood computation. We consider two measurement models: (i) Gaussian additive noise, and (ii) a linear measurement system with additive Gaussian noise. We provide theoretical consistency guarantees for both the settings. Experiments on synthetic data and real‑world protein signaling datasets demonstrate the efficacy of the proposed method.
Authors: Jeffrey D. Varner
Abstract: Protein sequence generation via stochastic attention produces plausible family members from small alignments without training, but treats all stored sequences equally and cannot direct generation toward a functional subset of interest. We show that a single scalar parameter, added as a bias to the sampler's attention logits, continuously shifts generation from the full family toward a user‑specified subset, with no retraining and no change to the model architecture. A practitioner supplies a small set of sequences (for example, hits from a binding screen) and a multiplicity ratio that controls how strongly generation favors them. The method is agnostic to what the subset represents: binding, stability, specificity, or any other property. We find that the conditioning is exact at the level of the sampler's internal representation, but that the decoded sequence phenotype can fall short because the dimensionality reduction used to encode sequences does not always preserve the residue‑level variation that defines the functional split. We term this discrepancy the calibration gap and show that it is predicted by a simple geometric measure of how well the encoding separates the functional subset from the rest of the family. Experiments on five Pfam families (Kunitz, SH3, WW, Homeobox, and Forkhead domains) confirm the monotonic relationship between separation and gap across a fourfold range of geometries. Applied to omega‑conotoxin peptides targeting a calcium channel involved in pain signaling, curated seeding from 23 characterized binders produces over a thousand candidates that preserve the primary pharmacophore and all experimentally identified binding determinants. These results show that stochastic attention enables practitioners to expand a handful of experimentally characterized sequences into diverse candidate libraries without retraining a generative model.
Authors: Z. Štefanič, B. Hribar-Lee
Abstract: Protein conformational stability and function depend on non‑covalent interactions that are strongly influenced by the surrounding environment. To explore protein properties, amino acids are often utilized as model systems. In this study, we determined the densities of seven α‑amino acids in aqueous solutions between 278.15 K and 308.15 K and calculated the apparent molar volumes. Linear extrapolation yielded standard molar volumes, which were analyzed to characterize amino‑acid hydration. The contributions of side chains to the standard molar volume were determined relative to glycine. The standard molar volume increased with temperature, indicating reduced electrostriction of water around the amino acids, consistent with lower hydration numbers at higher temperatures. We employed the Ornstein‑Zernike integral equation with hypernetted‑chain closure and a coarse‑grained Lennard‑Jones bead model to calculate pair correlation functions and Kirkwood‑Buff integrals, from which standard molar volumes were obtained. The model reproduced the experimental standard molar volumes very well.
Authors: A. Vazquez-Palomo, C. Betegón, J. Weickenmeier, E. Martínez-Pañeda
Abstract: Alzheimer's disease is characterised by the spreading of misfolded proteins and progressive structural changes in the brain. Despite significant clinical research, understanding how microscopic protein dynamics translate into macroscopic tissue degeneration remains a major challenge. In this work, we present a three‑dimensional, finite element‑based computational framework to model disease progression by combining multi‑protein transport and brain tissue deformation within anatomically realistic geometries. The propagation of toxic tau and amyloid‑beta proteins is described using reaction‑diffusion equations of the Fisher‑Kolmogorov type, incorporating prion‑like growth mechanisms and anisotropic transport along white matter fibre tracts. Brain atrophy is represented through a hyperelastic constitutive model driven by protein‑dependent volume loss. Subject‑specific simulations are achieved through an automated preprocessing pipeline that generates finite element meshes and reconstructs axonal orientation fields from medical imaging data. The model reproduces key morphological patterns observed in Alzheimer's disease and shows good quantitative agreement with longitudinal imaging measurements. Overall, the proposed framework offers an extensible computational platform for studying Alzheimer's disease progression across subject‑specific brain geometries. The models developed, including the image processing framework (BrainImage2Mesh) and the coupled bio‑chemo‑mechanical COMSOL finite element implementation, are made freely available to download at https://mechmat.web.ox.ac.uk/codes.
Authors: Animesh, Plaban Kumar Bhowmick, Pralay Mitra
Abstract: Accurate prediction of binding sites of a given protein, to which ligands can bind, is a critical step in structure‑based computational drug discovery. Recently, Equivariant Graph Neural Networks (GNNs) have emerged as a powerful paradigm for binding site identification methods due to the large‑scale availability of 3D structures of proteins via protein databases and AlphaFold predictions. The state‑of‑the‑art equivariant GNN methods implement dot product attention, disregarding the variation in the chemical and geometric properties of the neighboring residues. To capture this variation, we propose GDEGAN (Gaussian Dynamic Equivariant Graph Attention Network), which replaces dot‑product attention with adaptive kernels that recognize binding sites. The proposed attention mechanism captures variation in neighboring residues using statistics of their characteristic local feature distributions. Our mechanism dynamically computes neighborhood statistics at each layer, using local variance as an adaptive bandwidth parameter with learnable per‑head temperatures, enabling each protein region to determine its own context‑specific importance. GDEGAN outperforms existing methods with relative improvements of 37‑66% in DCC and 7‑19% DCA success rates across COACH420, HOLO4k, and PDBBind2020 datasets. These advances have direct application in accelerating protein‑ligand docking by identifying potential binding sites for therapeutic target identification.
Authors: Lucas Ferraz, Ana F. Rodrigues, Pedro Giesteira Cotovio, Mafalda Ventura, Gabriela Silva, Ana Sofia Coroadinha, Miguel Machuqueiro, Catia Pesquita
Abstract: Adeno‑associated viral (AAV) vectors are widely used delivery platforms in gene therapy, and the design of improved capsids is key to expanding their therapeutic potential. A central challenge in AAV bioengineering, as in protein design more broadly, is the vast sequence design space relative to the scale of feasible experimental screening. Machine‑guided generative approaches provide a powerful means of navigating this landscape and proposing novel protein sequences that satisfy functional constraints. Here, we develop a generative design framework based on protein language models and reinforcement learning to generate highly novel yet functionally plausible AAV capsids. A pretrained model was fine‑tuned on experimentally validated capsid sequences to learn patterns associated with viability. Reinforcement learning was then used to guide sequence generation, with a reward function that jointly promoted predicted viability and sequence novelty, thereby enabling exploration beyond regions represented in the training data. Comparative analyses showed that fine‑tuning alone produces sequences with high predicted viability but remains biased toward the training distribution, whereas reinforcement learining‑guided generation reaches more distant regions of sequence space while maintaining high predicted viability. Finally, we propose a candidate selection strategy that integrates predicted viability, sequence novelty, and biophysical properties to prioritize variants for downstream evaluation. This work establishes a framework for the generative exploration of protein sequence space and advances the application of generative protein language models to AAV bioengineering.
Authors: Fabrizio Camerin, Susana Marin-Aguilar, Anna Stradner, Peter Schurtenberger, Emanuela Zaccarelli
Abstract: Electrostatic interactions fundamentally govern the structure, stability, and dynamics of charged (bio)matter, yet the impact of heterogeneous and anisotropic charge distributions on the behavior of protein solutions remains elusive. Here, we introduce a versatile multiscale framework that directly connects molecular‑level electrostatics to collective properties via a colloid‑inspired coarse‑grained modeling combined with neural network‑assisted optimization. Using monoclonal antibodies as model system, our inverse design approach identifies charge patterns capable of reliably reproducing experimental structure factors, osmotic compressibility and collective diffusion coefficients in a wide region of protein concentrations. Close inspection of our data further uncovers how specific physical features and spatial arrangements of localized charge patches significantly influence the solution structure. This transferable strategy provides a predictive pathway to decode and control charge‑driven interactions in complex biomolecules and, more generally, in heterogeneously‑charged soft matter systems, with immediate relevance to protein formulation and biomaterials engineering.
Authors: Yicheng Hu, Xinyu Lin, Shulin Li, Wenjie Wang, Fengbin Zhu, Fuli Feng
Abstract: Subcellular localization is a crucial biological task for drug target identification and function annotation. Although it has been biologically realized that subcellular localization is closely associated with protein structure, no existing dataset offers comprehensive 3D structural information with detailed subcellular localization annotations, thus severely hindering the application of promising structure‑based models on this task. To address this gap, we introduce a new benchmark called \mathbfCAPSUL, a \mathbfComprehensive hum\mathbfAn \mathbfProtein benchmark for \mathbfSUbcellular \mathbfLocalization. It features a dataset that integrates diverse 3D structural representations with fine‑grained subcellular localization annotations carefully curated by domain experts. We evaluate this benchmark using a variety of state‑of‑the‑art sequence‑based and structure‑based models, showcasing the importance of involving structural features in this task. Furthermore, we explore reweighting and single‑label classification strategies to facilitate future investigation on structure‑based methods for this task. Lastly, we showcase the powerful interpretability of structure‑based methods through a case study on the Golgi apparatus, where we discover a decisive localization pattern α‑helix from attention mechanisms, demonstrating the potential for bridging the gap with intuitive biological interpretability and paving the way for data‑driven discoveries in cell biology.
Authors: Jingzhi Chen, Lijian Xu
Abstract: The protein folding problem has been fundamentally transformed by artificial intelligence, evolving from static structure prediction toward the modeling of dynamic conformational ensembles and complex biomolecular interactions. This review systematically examines the paradigm shift in AI driven protein science across five interconnected dimensions: unified multimodal representations that integrate sequences, geometries, and textual knowledge; refinement of static prediction through MSA free architectures and all atom complex modeling; generative frameworks, including diffusion models and flow matching, that capture conformational distributions consistent with thermodynamic ensembles; prediction of heterogeneous interactions spanning protein ligand, protein nucleic acid, and protein protein complexes; and functional inference of fitness landscapes, mutational effects, and text guided property prediction. We critically analyze current bottlenecks, including data distribution biases, limited mechanistic interpretability, and the disconnect between geometric metrics and biophysical reality, while identifying future directions toward physically consistent generative models, multimodal foundation architectures, and experimental closed loop systems. This methodological transformation marks artificial intelligence's transition from a structural analysis tool into a universal simulator capable of understanding and ultimately rewriting the dynamic language of life.
Authors: Philippe Formont, Maxime Darrin, Ismail Ben Ayed, Pablo Piantanida
Abstract: Recent reasoning‑based large language models have shown strong performance on tasks with verifiable outcomes, but their use in de novo molecular generation remains limited by the lack of training environments where rewards can be computed without reference molecules. We introduce MolRGen, a benchmark and molecular verifier for training and evaluating reasoning LLMs on de novo molecular generation. MolRGen contains approximately 4,500 protein‑pocket targets, resulting in 50k multi‑objective optimization prompts combining docking scores with molecular properties such as QED, synthetic accessibility, logP, and physicochemical descriptors. Unlike caption‑based generation or molecule‑editing benchmarks, MolRGen evaluates molecules proposed from scratch by computing rewards at generation time. We benchmark general‑purpose and chemistry‑specialized open‑source LLMs and introduce a diversity‑aware top‑k metric to measure whether models can generate a diverse set of high‑scoring molecules. Finally, we use the verifier to fine‑tune a 128B LLM with GRPO, showing improved performance, at the cost of a diversity‑exploitation trade‑off. MolRGen provides a scalable testbed for studying verifier‑based reasoning and reinforcement learning in molecular design.
Authors: Ben S. Southworth, Stephen Thomas
Abstract: Orthogonalized‑momentum optimizers such as Muon improve transformer training by approximately whitening/orthogonalizing matrix‑valued momentum updates via a short polar‑decomposition iteration. However, polar‑factor approximations typically require multiple large matrix multiplications, and the resulting overhead can be substantial and hardware‑dependent. We introduce MUD (MomentUm Decorrelation), a complementary whitening approach that replaces Muon's polar update with a triangular (Cholesky‑like) whitening surrogate inspired by classical Gram‑‑Schmidt and Gauss‑Seidel ideas. We show that row‑orthonormal matrices are fixed points of the MUD map, relate the inner step to symmetric Gauss‑Seidel preconditioning of the Gram matrix, and prove quadratic local convergence near the fixed point. In terms of time‑to‑perplexity, MUD yields consistent 10‑50% wall‑clock improvements over tuned AdamW and Muon in time‑to‑perplexity, typically converging slightly slower per step than Muon but with substantially lower optimizer overhead ‑‑ relative to Muon, MUD improves peak tokens/s by roughly 1.3‑2.6× across most settings and up to nearly 3× on GPT‑2 large on an A100. We also demonstrate training a ESM‑2 150M protein language model, where MUD matches Muon‑level validation perplexity in significantly less wall‑clock time.
Authors: Liang Shi, Jiarui Lu, Junqi Liu, Chence Shi, Zhi Yang, Jian Tang
Abstract: Understanding the dynamic behavior of biomolecules is fundamental to elucidating biological function and facilitating drug discovery. While Molecular Dynamics (MD) simulations provide a rigorous physical basis for studying these dynamics, they remain computationally expensive for long timescales. Conversely, recent deep generative models accelerate conformation generation but are typically either failing to model temporal relationship or built only for monomeric proteins. To bridge this gap, we introduce ATMOS, a novel generative framework based on State Space Models (SSM) designed to generate atom‑level MD trajectories for biomolecular systems. ATMOS integrates a Pairformer‑based state transition mechanism to capture long‑range temporal dependencies, with a diffusion‑based module to decode trajectory frames in an autoregressive manner. ATMOS is trained across crystal structures from PDB and conformation trajectory from large‑scale MD simulation datasets including mdCATH and MISATO. We demonstrate that ATMOS achieves state‑of‑the‑art performance in generating conformation trajectories for both protein monomers and complex protein‑ligand systems. By enabling efficient inference of atomic trajectory of motions, this work establishes a promising foundation for modeling biomolecular dynamics.
Authors: Jacopo Teneggi, S. M. Bargeen A. Turzo, Tanya Marwah, Alberto Bietti, P. Douglas Renfrew, Vikram Khipple Mulligan, Siavash Golkar
Abstract: Large language models (LLMs) are capable of emulating reasoning and using tools, creating opportunities for autonomous agents that execute complex scientific tasks. Protein design provides a natural testbed: although machine learning (ML) methods achieve strong results, these are largely restricted to canonical amino acids and narrow objectives, leaving unfilled need for a generalist tool for broad design pipelines. We introduce Agent Rosetta, an LLM agent paired with a structured environment for operating Rosetta, the leading physics‑based heteropolymer design software, capable of modeling non‑canonical building blocks and geometries. Agent Rosetta iteratively refines designs to achieve user‑defined objectives, combining LLM reasoning with Rosetta's generality. We evaluate Agent Rosetta on design with canonical amino acids, matching specialized models and expert baselines, and with non‑canonical residues ‑‑ where ML approaches fail ‑‑ achieving comparable performance. Critically, prompt engineering alone often fails to generate Rosetta actions, demonstrating that environment design is essential for integrating LLM agents with specialized software. Our results show that properly designed environments enable LLM agents to make scientific software accessible while matching specialized tools and human experts.
Authors: Arnold Mathijssen, Hamed Almohammadi, Lauren Altman, Talia Calazans, M. J. Ferencz, Michelle Fung, Ian J. Lee, Maciej Lisicki, Ivy Liu, Maggie Liu, Tianyi Liu, Ernest Park, Ran Tao, Albane Thery, Zeyuan Wang, Margot Young
Abstract: Living systems are made of active materials with microscopic components that work together to perform macroscopic biological tasks. The breakdown of these collective functionalities leads to diseases, which, conversely, could be treated by exploiting self‑organization in healthcare technologies. Here, we review recent advances in this rapidly growing field of biomedical active matter. The main themes are (1) collective self‑assembly and spatiotemporal coordination; (2) collective motion, transport, and navigation; (3) collective sensing, signaling, and communication; and (4) collective adaptation, evolution, and learning. We discuss these emerging processes in a wide range of systems, including protein folding, biomolecular condensates, cytoskeleton dynamics, intracellular flows, bacterial biofilms, quorum sensing, cilia synchronization, wound healing, biolocomotion, neurons, endocrine signalling, and cardiovascular flow networks. For each, we highlight medical conditions associated with reduced collective functionality and how they may be treated using microrobotic swarms, bioinspired metamaterials, diagnostics, lab‑on‑chip devices, organoids, and other active and adaptive matter innovations.
Authors: Madhulatha Mandarapu, Sandeep Kunkunuru
Abstract: Biomedical knowledge is fragmented across siloed databases ‑‑ Reactome for pathways, STRING for protein interactions, ClinicalTrials.gov for study registries, DrugBank for drug vocabularies, DGIdb for drug‑gene interactions, SIDER for side effects. We present three open‑source biomedical knowledge graphs ‑‑ Pathways KG (118,686 nodes, 834,785 edges from 5 sources), Clinical Trials KG (7,774,446 nodes, 26,973,997 edges from 5 sources), and Drug Interactions KG (32,726 nodes, 191,970 edges from 3 sources) ‑‑ built on Samyama, a high‑performance graph database written in Rust.
Our contributions are threefold. First, we describe a reproducible ETL pattern for constructing large‑scale KGs from heterogeneous public data sources, with cross‑source deduplication, batch loading (Python Cypher and Rust native loaders), and portable snapshot export. Second, we demonstrate cross‑KG federation: loading all three snapshots into a single graph tenant enables property‑based joins across datasets. Third, we introduce schema‑driven MCP server generation for LLM agent access, evaluated on a new BiomedQA benchmark (40 pharmacology questions): domain‑specific MCP tools achieve 98% accuracy vs. 85% for schema‑aware text‑to‑Cypher and 75% for standalone GPT‑4o, with zero schema errors.
All data sources are open‑license. The combined federated graph (7.9M nodes, 28M edges) loads in approximately 3 minutes on commodity cloud hardware, with single‑KG queries completing in 80‑100ms and cross‑KG federation joins in 1‑4s
Authors: Y. Cheung
Abstract: Combination pharmacotherapy offers substantial therapeutic advantages but also poses substantial risks of adverse drug reactions (ADRs). The accurate prediction of ADRs with interpretable computational methods is crucial for clinical safety management, drug development, and precision medicine. However, managing ADRs remains a challenge due to the vast search space of drug combinations and the complexity of physiological responses. Current graph‑based architectures often struggle to effectively integrate multi‑scale biological information and frequently rely on fixed association matrices, which limits their ability to capture dynamic organ‑level dependencies and generalize across diverse datasets. Here we propose CrossADR, a hierarchical framework for organ‑level ADR prediction through cross‑layer feature integration and cross‑level associative learning. It incorporates a gated‑residual‑flow graph neural network to fuse multi‑scale molecular features and utilizes a learnable ADR embedding space to dynamically capture latent biological correlations across 15 organ systems. Systematic evaluation on the newly constructed CrossADR‑Dataset‑covering 1,376 drugs and 946,000 unique combinations‑demonstrates that CrossADR consistently achieves state‑of‑the‑art performance across 80 distinct experimental scenarios and provides high‑resolution insights into drug‑related protein protein interactions and pathways. Overall, CrossADR represents a robust tool for cross‑scale biomedical information integration, cross‑layer feature integration as well as cross‑level associative learning, and can be effectively utilized to prevent ADRs in clinical decision‑making.
Authors: Dejun Lin, Simon Chu, Vishanth Iyer, Youhan Lee, John St John, Kevin Boyd, Brian Roland, Xiaowei Ren, Guoqing Zhou, Zhonglin Cao, Polina Binder, Yuliya Zhautouskaya, Jakub Zakrzewski, Maximilian Stadler, Kyle Gion, Yuxing Peng, Xi Chen, Tianjing Zhang, Philipp Junk, Michelle Dimon, Paweł Gniewek, Fabian Ortega, McKinley Polen, Ivan Grubisic, Ali Bashir, Graham Holt, Danny Kovtun, Matthias Grass, Luca Naef, Rui Wang, Jian Peng, Anthony Costa, Saee Paliwal, Eddie Calleja, Timur Rvachov, Neha Tadimeti, Roy Tal, Emine Kucukbenli
Abstract: Understanding cellular machinery requires atomic‑scale reconstruction of large biomolecular assemblies. However, predicting the structures of these systems has been constrained by hardware memory requirements of models like AlphaFold 3, imposing a practical ceiling of a few thousand residues that can be processed on a single GPU. Here we present NVIDIA BioNeMo Fold‑CP, a context parallelism framework that overcomes this barrier by distributing the inference and training pipelines of co‑folding models across multiple GPUs. We use the Boltz models as open source reference architectures and implement custom multidimensional primitives that efficiently parallelize both the dense triangular updates and the irregular, data‑dependent pattern of window‑batched local attention. Our approach achieves efficient memory scaling; for an N‑token input distributed across P GPUs, per‑device memory scales as O(N^2/P), enabling the structure prediction of assemblies exceeding 30,000 residues on 64 NVIDIA B300 GPUs. We demonstrate the scientific utility of this approach through successful developer use cases: Fold‑CP enabled the scoring of over 90% of Comprehensive Resource of Mammalian protein complexes (CORUM) database, as well as folding of disease‑relevant PI4KA lipid kinase complex bound to an intrinsically disordered region without cropping. By providing a scalable pathway for modeling massive systems with full global context, Fold‑CP represents a significant step toward the realization of a virtual cell.
Authors: Yiming Gao, Liuyi Xu, Pengshan Cui, Yining Qian, An-Yang Lu, Xianpeng Wang
Abstract: Accurate identification of protein‑nucleotide binding sites is fundamental to deciphering molecular mechanisms and accelerating drug discovery. However, current computational methods often struggle with suboptimal performance due to inadequate feature representation and rigid fusion mechanisms, which hinder the effective exploitation of cross‑task information synergy. To bridge this gap, we propose MTGA‑MGE, a framework that integrates a Multi‑Task Genetic Algorithm with Multi‑Granularity Encoding to enhance binding site prediction. Specifically, we develop a Multi‑Granularity Encoding (MGE) network that synergizes multi‑scale convolutions and self‑attention mechanisms to distill discriminative signals from high‑dimensional, redundant biological data. To overcome the constraints of static fusion, a genetic algorithm is employed to adaptively evolve task‑specific fusion strategies, thereby effectively improving model generalization. Furthermore, to catalyze collaborative learning, we introduce an External‑Neighborhood Mechanism (ENM) that leverages biological similarities to facilitate targeted information exchange across tasks. Extensive evaluations on fifteen nucleotide datasets demonstrate that MTGA‑MGE not only establishes a new state‑of‑the‑art in data‑abundant, high‑resource scenarios but also maintains a robust competitive edge in rare, low‑resource regimes, presenting a highly adaptive scheme for decoding complex protein‑ligand interactions in the post‑genomic era.
Authors: Zihan Dun, Liuyi Xu, An-Yang Lu, Shuang Li, Yining Qian
Abstract: Drug‑‑target affinity prediction is pivotal for accelerating drug discovery, yet existing methods suffer from significant performance degradation in realistic cold‑start scenarios (unseen drugs/targets/pairs), primarily driven by overfitting to training instances and information loss from irrelevant target sequences. In this paper, we propose LaPro‑DTA, a framework designed to achieve robust and generalizable DTA prediction. To tackle overfitting, we devise a latent dual‑view drug representation mechanism. It synergizes an instance‑level view to capture fine‑grained substructures with stochastic perturbation and a distribution‑level view to distill generalized chemical scaffolds via semantic remapping, thereby enforcing the model to learn transferable structural rules rather than memorizing specific samples. To mitigate information loss, we introduce a salient protein feature extraction strategy using pattern‑aware top‑k pooling, which effectively filters background noise and isolates high‑response bioactive regions. Furthermore, a cross‑view multi‑head attention mechanism fuses these purified features to model comprehensive interactions. Extensive experiments on benchmark datasets demonstrate that LaPro‑DTA significantly outperforms state‑of‑the‑art methods, achieving an 8% MSE reduction on the Davis dataset in the challenging unseen‑drug setting, while offering interpretable insights into binding mechanisms.
Authors: Jeffrey D. Varner
Abstract: Most protein families have fewer than 100 known members, a regime where deep generative models overfit or collapse. We propose stochastic attention (SA), a training‑free sampler that treats the modern Hopfield energy over a protein alignment as a Boltzmann distribution and draws samples via Langevin dynamics. The score function is a closed‑form softmax attention operation requiring no training, no pretraining data, and no GPU, with cost linear in alignment size. Across eight Pfam families, SA generates sequences with low amino acid compositional divergence, substantial novelty, and structural plausibility confirmed by ESMFold and AlphaFold2. Generated sequences fold more faithfully to canonical family structures than natural members in six of eight families. Against profile HMMs, EvoDiff, and the MSA Transformer, which produce sequences that drift far outside the family, SA maintains 51 to 66 percent identity while remaining novel, in seconds on a laptop. The critical temperature governing generation is predicted from PCA dimensionality alone, enabling fully automatic operation. Controls confirm SA encodes correlated substitution patterns, not just per‑position amino acid frequencies.
Authors: Tomoki Ohkubo
Abstract: Solid‑state nanopore DNA sequencers present mechanical and chemical stability, reusability, and large‑scale integrability. However, their development is hindered by the absence of a protein‑free mechanism for controlling DNA translocation, which is accomplished by motor proteins in their biological counterparts. Here, we propose and theoretically analyze a protein‑independent ratchet mechanism based on the unzipping of double‑stranded DNA at the nanopore rim. When the transmembrane bias exceeds a certain threshold, the base pairs mechanically dissociate, allowing one strand to translocate while the other remains upstream. This unzipping process is known to slow DNA motion, suggesting that voltage pulses can trigger individual unzipping events at externally defined times, a concept referred to as digital unzipping. However, the intrinsic unzipping barrier is insufficient to provide the dwell times required for a reliable ionic‑current readout; therefore, an additional mechanism is needed to hold the DNA in place between voltage pulses. To overcome this limitation, we introduce a reversible hold mechanism implemented via electrostatic attraction between DNA and a charged nanopore wall, which temporarily immobilizes the strand once the unzipping fork catches on the nanopore rim. Using a statistical‑mechanical model, we track the evolution of the mean and variance of DNA position through each ratchet cycle. Analytical expressions for the corresponding error probabilities show that submicrosecond switching of the hold mechanism enables base‑by‑base stepping with an error rate <5%. These results theoretically demonstrate that digital unzipping combined with a reversible hold mechanism can yield deterministic single‑base motion, thus opening a viable route toward all‑solid‑state nanopore sequencing.
Authors: Niklas Schweiger, Daniel Cremers, Karnik Ram
Abstract: Optimizing the noise samples of diffusion and flow models is an increasingly popular approach to align these models to target rewards at inference time. However, we observe that these approaches are usually restricted to differentiable or cheap reward models, the formulation of the underlying pretrained generative model, or are memory/compute inefficient. We instead propose a simple trust‑region based search algorithm (TRS) which treats the pre‑trained generative and reward models as a black‑box and only optimizes the source noise. Our approach achieves a good balance between global exploration and local exploitation, and is versatile and easily adaptable to various generative settings and reward models with minimal hyperparameter tuning. We evaluate TRS across text‑to‑image, molecule and protein design tasks, and obtain significantly improved output samples over the base generative models and other inference‑time alignment approaches which optimize the source noise sample, or even the entire reverse‑time sampling noise trajectories in the case of diffusion models. Our source code is publicly available.
Authors: Sosuke Asano, Ikki Yasuda, Katsuhiro Endo, Yoshinori Hirano, Kenji Yasuoka
Abstract: Comparing multiple protein systems with variation such as different binding ligands or mutations, and understanding their effects is one of the objectives in molecular dynamics simulations. Representation of these systems by a few features enables quantitative comparison. However, because molecular dynamics simulation trajectories are high‑dimensional spatiotemporal data, selection of key features relies on domain expertise, sometimes introducing arbitrary assumptions. Here, we present an approach that uses the optimal transport distance to compare high‑dimensional trajectory data, and employs simulated annealing to identify the residues that best distinguish multiple systems. We term this algorithm auto‑WHATMD (automated Wasserstein‑based High‑dimensional feature extraction Analysis for Trajectories of Molecular Dynamics). We applied auto‑WHATMD to multiple protein‑ligand systems of bromodomain 4 with different ligands, identifying the most discriminative residues in the loop region. Moreover, even a few selected residues were sufficient to capture the correlation with ligand‑binding affinities, indicating that auto‑WHATMD effectively prioritizes the most informative residues. Our approach can be used to efficiently determine key residues and design features for multiple analogous systems.
Authors: Yiran Zhu, Changxi Chi, Hongxin Xiang, Wenjie Du, Xiaoqi Wang, Jun Xia
Abstract: Protein inverse folding aims to design an amino acid sequence that will fold into a given backbone structure, serving as a central task in protein design. Two main paradigms have been widely explored. Template‑based methods exploit database‑derived structural priors and can achieve high local precision when close structural neighbors are available, but their dependence on database coverage and match quality often degrades performance on out‑of‑distribution (OOD) targets. Deep learning approaches, in contrast, learn general structure‑to‑sequence regularities and usually generalize better to new backbones. However, they struggle to capture fine‑grained local structure, which can cause uncertain residue predictions and missed local motifs in ambiguous regions. We introduce Refold, a novel framework that synergistically integrates the strengths of database‑derived structural priors and deep learning prediction to enhance inverse folding. Refold obtains structural priors from matched neighbors and fuses them with model predictions to refine residue probabilities. In practice, low‑quality neighbors can introduce noise, potentially degrading model performance. We address this issue with a Dynamic Utility Gate that controls prior injection and falls back to the base prediction when the priors are untrustworthy. Comprehensive evaluations on standard benchmarks demonstrate that Refold achieves state‑of‑the‑art native sequence recovery of 0.63 on both CATH 4.2 and CATH 4.3. Also, analysis indicates that Refold delivers larger gains on high‑uncertainty regions, reflecting the complementarity between structural priors and deep learning predictions.
Authors: Junjie Zhou, Bao Xue, Meiling Wang, Wei Shao, Daoqiang Zhang
Abstract: To enhance the precision of cancer prognosis, recent research has increasingly focused on multimodal survival methods by integrating genomic data and histology images. However, current approaches overlook the fact that the proteome serves as an intermediate layer bridging genomic alterations and histopathological features while providing complementary biological information essential for survival prediction. This biological reality exposes another architectural limitation: existing integrative analysis studies fuse these heterogeneous data sources in a flat manner that fails to capture their inherent biological hierarchy. To address these limitations, we propose HFGPI, a hierarchical fusion framework that models the biological progression from genes to proteins to histology images from a systems biology perspective. Specifically, we introduce Molecular Tokenizer, a molecular encoding strategy that integrates identity embeddings with expression profiles to construct biologically informed representations for genes and proteins. We then develop Gene‑Regulated Protein Fusion (GRPF), which employs graph‑aware cross‑attention with structure‑preserving alignment to explicitly model gene‑protein regulatory relationships and generate gene‑regulated protein representations. Additionally, we propose Protein‑Guided Hypergraph Learning (PGHL), which establishes associations between proteins and image patches, leveraging hypergraph convolution to capture higher‑order protein‑morphology relationships. The final features are progressively fused across hierarchical layers to achieve precise survival outcome prediction. Extensive experiments on five benchmark datasets demonstrate the superiority of HFGPI over state‑of‑the‑art methods.
Authors: A. Kh. Bikulov, A. P. Zubarev
Abstract: A model for studying the ultrametricity of the energy landscape in a disordered heteropolymer is presented. It is treated as a simplified model of a protein molecule in which amino acid residues are modeled as point masses. Pairwise interactions include universal repulsion, the Lennard‑Jones potential, the Coulomb potential with screening, and the elastic potential for bonds between adjacent residues. An analogy with spin glass models is used, allowing the application of replica theory methods. Unlike the standard approach to disordered systems, averaging over disorder is not performed. The overlap between replicas is defined as the Pearson correlation coefficient between the vectors of average pairwise energies, which corresponds to a comparison of thermodynamic averages in the spirit of spin glass theory. The results of a computational experiment conducted using the developed algorithm on a graphics processing unit (GPU) are presented. The simulations were performed using a 128‑residue‑long sequence, with 50 independent disorder realizations and 50 replicas for each sequence at a temperature of T = 1.0. It was found that for 90.0% of the sequences, the distance matrix between replicas contains more than half of the ultrametric triangles, and nontrivial ultrametricity predominates in 97.8% of them, indicating a hierarchical organization of the energy landscape. A repeated computational experiment for selected sequences confirms the reliability of the observations: 95.5% of them again demonstrated ultrametricity, of which 97.7% showed a predominance of the nontrivial type of ultrametricity. The obtained results confirm Frauenfelder's hypothesis of protein ultrametricity and pave the way for a systematic study of ultrametric properties in more realistic protein models.
Authors: Fei Wang, Xinye Zheng, Kun Li, Yanyan Wei, Yuxin Liu, Ganpeng Hu, Tong Bao, Jingwen Yang
Abstract: Predicting enzyme kinetic parameters quantifies how efficiently an enzyme catalyzes a specific substrate under defined biochemical conditions. Canonical parameters such as the turnover number (k_\textcat), Michaelis constant (K_\textm), and inhibition constant (K_\texti) depend jointly on the enzyme sequence, the substrate chemistry, and the conformational adaptation of the active site during binding. Many learning pipelines simplify this process to a static compatibility problem between the enzyme and substrate, fusing their representations through shallow operations and regressing a single value. Such formulations overlook the staged nature of catalysis, which involves both substrate recognition and conformational adaptation. In this regard, we reformulate kinetic prediction as a staged multimodal conditional modeling problem and introduce the Enzyme‑Reaction Bridging Adapter (ERBA), which injects cross‑modal information via fine‑tuning into Protein Language Models (PLMs) while preserving their biochemical priors. ERBA performs conditioning in two stages: Molecular Recognition Cross‑Attention (MRCA) first injects substrate information into the enzyme representation to capture specificity; Geometry‑aware Mixture‑of‑Experts (G‑MoE) then integrates active‑site structure and routes samples to pocket‑specialized experts to reflect induced fit. To maintain semantic fidelity, Enzyme‑Substrate Distribution Alignment (ESDA) enforces distributional consistency within the PLM manifold in a reproducing kernel Hilbert space. Experiments across three kinetic endpoints and multiple PLM backbones, ERBA delivers consistent gains and stronger out‑of‑distribution performance compared with sequence‑only and shallow‑fusion baselines, offering a biologically grounded route to scalable kinetic prediction and a foundation for adding cofactors, mutations, and time‑resolved structural cues.
Authors: Zhenkun Shi, Jun Zhu, Dehang Wang, BoYu Chen, Qianqian Yuan, Zhitao Mao, Fan Wei, Weining Wu, Xiaoping Liao, Hongwu Ma
Abstract: A key challenge in enzyme annotation is identifying the biochemical reactions catalyzed by proteins. Most existing methods rely on Enzyme Commission (EC) numbers as intermediaries: they first predict an EC number and then retrieve the associated reactions. This indirect strategy introduces ambiguity due to the complex many‑to‑many mappings among proteins, EC numbers, and reactions, and is further complicated by frequent updates to EC numbers and inconsistencies across databases. To address these challenges, we present RXNRECer, a transformer‑based ensemble framework that directly predicts enzyme‑catalyzed reactions without relying on EC numbers. It integrates protein language modeling and active learning to capture both high‑level sequence semantics and fine‑grained transformation patterns. Evaluations on curated cross‑validation and temporal test sets demonstrate consistent improvements over six EC‑based baselines, with gains of 16.54% in F1 score and 15.43% in accuracy. Beyond accuracy gains, the framework offers clear advantages for downstream applications, including scalable proteome‑wide reaction annotation, enhanced specificity in refining generic reaction schemas, systematic annotation of previously uncurated proteins, and reliable identification of enzyme promiscuity. By incorporating large language models, it also provides interpretable rationales for predictions. These capabilities make RXNRECer a robust and versatile solution for EC‑free, fine‑grained enzyme function prediction, with potential applications across multiple areas of enzyme research and industrial applications.
Authors: Ivan Yu. Golushko, Olga V. Konevtsova, Daria S. Roshal, Sergei B. Rochal
Abstract: Studying physical mechanisms and common geometric principles underlying known spherical packings is crucial for rational design of synthetic nanocontainers. Here we model the growth of small spherical shells containing n<72 identical particles that have their own curvature and interact with each other via the Lennard‑Jones potential. The shell assembly is assumed to be nonequilibrium and sequential: at each step, a new particle is attached to the most energetically favorable position, after which the system relaxes. Along with well‑known structures of the smallest icosahedral viral protein shells, the proposed mechanism generates a wide range of shells exhibiting square‑triangular surface order. Most of such shells are the models of synthetic or natural protein complexes that have octahedral or tetrahedral symmetries and perform various functions. We compare the obtained structures with those resulting from the equilibrium assembly and corresponding to global energy minima. Also, we consider the temperature‑dependent stochastic assembly and use the double‑minimum Lennard‑Jones‑Gauss potential to mimic anisotropic particle interactions.
Authors: Davide Rattacaso, Daniel Jaschke, Antonio Trovato, Ilaria Siloi, Simone Montangero
Abstract: Efficient sampling from ensembles of Hamiltonian cycles is critical for predicting the thermodynamic properties of compact polymers, with applications including modeling protein and RNA folding and designing soft materials. Although classical Monte Carlo methods are widely regarded as the standard approach, their efficiency is strongly limited when applied to compact polymers. In this work, we enable a quadratic speedup in the estimation of thermodynamic properties of maximally compact polymers and heteropolymers by quantum computation. To this end, we encode the target thermodynamic ensemble into the amplitudes of a quantum state, i.e., a quantum sample, which can be processed via amplitude amplification. Using quantum equational reasoning, we construct a local parent Hamiltonian whose unique ground state realizes a quantum sample of all Hamiltonian cycles. This state can be prepared on quantum hardware using ground‑state preparation methods, such as quantum annealing, and subsequently manipulated to generate quantum samples of polymers and heteropolymers at a target temperature. Finally, we approximate the quantum sample as a tensor network, revealing an entanglement area law. For fixed‑width rectangular lattices, we obtain a time‑efficient and compact encoding of the full ensemble of Hamiltonian cycles, enabling the efficient evaluation of expectation values, partition functions, and configuration probabilities via tensor contractions, without resorting to sampling.
Authors: Moritz Sallermann, Amrita Goswami, Rosana Collepardo-Guevara, Alberto Ocana, Hannes Jónsson, Elvar Ö. Jónsson, Jorge R. Espinosa
Abstract: The parameterization of simulation‑based models is a central yet laborious task in computational chemistry and physics, often driven by human intuition and manual iteration. Automating this task necessitates the definition of suitable objective functions, which tend to be expensive to evaluate, noisy, non‑differentiable, or composed of heterogeneous contributions originating from separate sets of simulations. Gradient‑free and black‑box optimization algorithms are powerful tools which are particularly well‑suited to minimizing such objective functions. Here, we introduce ChemFit, a flexible Python framework for the definition, composition, and massively concurrent evaluation of simulation‑based objective functions, which is designed to operate in conjunction with these algorithms. We demonstrate the broad applicability of this approach by using ChemFit for three representative examples of increasing complexity and real‑world relevance. First, we obtain the parameters of the Lennard‑Jones potential for liquid argon from experimental measurements of the density. Second, we parameterize a polarizable and flexible potential energy function to reproduce the structure of small H_2O clusters obtained from density functional theory calculations. Finally, we tune a small subset of the parameters of a residue‑level coarse‑grained protein force‑field, with the goal to reproduce the experimental critical solution temperature of the low complexity domain of the wild‑type hnRNPA1 sequence and an arginine‑enriched mutant of this protein. hnRNPA1 is an RNA‑binding protein linked to amyotrophic lateral sclerosis. Together, these examples illustrate how ChemFit enables scalable, reproducible, and optimizer‑agnostic parameter fitting for broadly applicable multiscale models.
Authors: Nicolas Deutschmann, Constance Ferragu, Jonathan D. Ziegler, Shayan Aziznejad, Eli Bixby
Abstract: We introduce EvoFlows, a variable‑length protein sequence‑to‑sequence modeling approach designed for protein engineering. Existing protein language models are poorly suited for optimization tasks: autoregressive models require full sequence generation, masked language and discrete diffusion models rely on pre‑specified mutation locations, and no existing methods naturally support insertions and deletions relative to a template sequence. EvoFlows learns mutational trajectories between evolutionarily related protein sequences via edit flows, allowing it to perform a controllable number of mutations (insertions, deletions, and substitutions) on a template sequence, predicting not only _which_ mutation to perform, but also _where_ it should occur. Through extensive _in silico_ evaluation on diverse protein families from UniRef and OAS, we show that EvoFlows generates variants that remain consistent with natural protein families while exploring farther from template sequences than leading baselines.
Authors: Shirin Amiraslani, Xin Gao
Abstract: Transformer self‑attention computes pairwise token interactions, yet protein sequence to phenotype relationships often involve cooperative dependencies among three or more residues that dot product attention does not capture explicitly. We introduce Higher‑Order Modular Attention, HOMA, a unified attention operator that fuses pairwise attention with an explicit triadic interaction pathway. To make triadic attention practical on long sequences, HOMA employs block‑structured, windowed triadic attention. We evaluate on three TAPE benchmarks for Secondary Structure, Fluorescence, and Stability. Our attention mechanism yields consistent improvements across all tasks compared with standard self‑attention and efficient variants including block‑wise attention and Linformer. These results suggest that explicit triadic terms provide complementary representational capacity for protein sequence prediction at controllable additional computational cost.
Authors: Yining Qian, Pengjie Wang, Yixiao Li, An-Yang Lu, Cheng Tan, Shuang Li, Lijun Liu
Abstract: Predicting drug‑target affinity is fundamental to virtual screening and lead optimization. However, existing deep models often suffer from representation collapse in stringent cold‑start regimes, where the scarcity of labels and domain shifts prevent the learning of transferable pharmacophores and binding motifs. In this paper, we propose Co‑Diffusion, a novel affinity‑aware framework that redefines DTA prediction as a constrained latent denoising process to enhance generalization. Co‑Diffusion employs a two‑stage paradigm: Stage I establishes an affinity‑steered latent manifold by aligning drug and target embeddings under an explicit supervised objective, ensuring that the latent space reflects the intrinsic binding landscape. Stage II introduces modality‑specific latent diffusion as a stochastic perturb‑and‑denoise regularizer, forcing the model to recover consistent affinity semantics from noisy structural representations. This approach effectively mitigates the reconstruction‑regression conflict common in generative DTA models. Theoretically, we show that Co‑Diffusion maximizes a variational lower bound on the joint likelihood of drug structures, protein sequences, and binding strength. Extensive experiments across multiple benchmarks demonstrate that Co‑Diffusion significantly outperforms state‑of‑the‑art baselines, particularly yielding superior zero‑shot generalization on unseen molecular scaffolds and novel protein families‑paving a robust path for in silico drug prioritization in unexplored chemical spaces.
Authors: Van Le, Tan Le
Abstract: Accurate prediction of residue‑level pKa values is essential for understanding protein function, stability, and reactivity. While existing resources such as DeepKaDB and CpHMD‑derived datasets provide valuable training data, their descriptors remain primarily classical and often struggle to generalize across diverse biochemical environments. We introduce a reproducible hybrid quantum‑classical framework that enriches residue‑level representations with a Gaussian kernel‑based quantum‑inspired feature mapping. These quantum‑enhanced descriptors are combined with normalized structural features to form a unified hybrid encoding processed by a Deep Quantum Neural Network (DQNN). This architecture captures nonlinear relationships in residue microenvironments that are not accessible to classical models. Benchmarking across multiple curated descriptor sets demonstrates that the DQNN achieves improved cross‑context generalization relative to classical baselines. External evaluation on the PKAD‑R experimental benchmark and an Aβ40 case study further highlights the robustness and transferability of the quantum‑inspired representation. By integrating quantum‑inspired feature transformations with classical biochemical descriptors, this work establishes a scalable and experimentally transferable approach for residue‑level pKa prediction and broader applications in protein electrostatics.
Authors: Weronika Kłos, Sidney Bender, Lukas Kades
Abstract: Deep learning models can predict protein properties with unprecedented accuracy but rarely offer mechanistic insight or actionable guidance for engineering improved variants. When a model flags an antibody as unstable, the protein engineer is left without recourse: which mutations would rescue stability while preserving function? We introduce Manifold‑Constrained Counterfactual Optimization for Proteins (MCCOP), a framework that computes minimal, biologically plausible sequence edits that flip a model's prediction to a desired target state. MCCOP operates in a continuous joint sequence‑structure latent space and employs a pretrained diffusion model as a manifold prior, balancing three objectives: validity (achieving the target property), proximity (minimizing mutations), and plausibility (producing foldable proteins). We evaluate MCCOP on three protein engineering tasks ‑ GFP fluorescence rescue, thermodynamic stability enhancement, and E3 ligase activity recovery ‑ and show that it generates sparser, more plausible counterfactuals than both discrete and continuous baselines. The recovered mutations align with known biophysical mechanisms, including chromophore packing and hydrophobic core consolidation, establishing MCCOP as a tool for both model interpretation and hypothesis‑driven protein design. Our code is publicly available at github.com/weroks/mccop.
Authors: Calvin McCarter, Nick Bhattacharya, Sebastian W. Ober, Hunter Elliott
Abstract: A plethora of protein language models have been released in recent years. Yet comparatively little work has addressed how to best sample from them to optimize desired biological properties. We fill this gap by proposing a flexible, effective sampling method for masked language models (MLMs), and by systematically evaluating models and methods both in silico and in vitro on actual antibody therapeutics campaigns. Firstly, we propose sampling with stochastic beam search, exploiting the fact that MLMs are remarkably efficient at evaluating the pseudo‑perplexity of the entire 1‑edit neighborhood of a sequence. Reframing generation in terms of entire‑sequence evaluation enables flexible guidance with multiple optimization objectives. Secondly, we report results from our extensive in vitro head‑to‑head evaluation for the antibody engineering setting. This reveals that choice of sampling method is at least as impactful as the model used, motivating future research into this under‑explored area.
Authors: Diane Jung, Caleb Escobedo, Noah Liska, Maitrey Gramopadhye, Daniel Szafir, Alessandro Roncone, Carson Bruns
Abstract: Scientists perform diverse manual procedures that are tedious and laborious. Such procedures are considered a bottleneck for modern experimental science, as they consume time and increase burdens in fields including material science and medicine. We employ a user‑centered approach to designing a robot‑assisted system for dialysis, a common multi‑day purification method used in polymer and protein synthesis. Through two usability studies, we obtain participant feedback and revise design requirements to develop the final system that satisfies scientists' needs and has the potential for applications in other experimental workflows. We anticipate that integration of this system into real synthesis procedures in a chemical wet lab will decrease workload on scientists during long experimental procedures and provide an effective approach to designing more systems that have the potential to accelerate scientific discovery and liberate scientists from tedious labor.
Authors: Stephen Afrifa, Biswash Khatiwada, Kapalik Khanal, Sanjay Shah, Lingjuan Wang-Li, Ramesh Bahadur Bist
Abstract: The rapid growth of the global poultry industry, driven by rising demand for affordable animal protein, has intensified public discourse surrounding production practices, housing, management, animal welfare, and supply‑chain transparency. Social media platforms such as X (formerly Twitter) generate large volumes of unstructured textual data that capture stakeholder sentiment across the poultry industry. Extracting accurate sentiment signals from this domain‑specific discourse remains challenging due to contextual ambiguity, linguistic variability, and limited domain awareness in general‑purpose language models. This study presents PoultryLeX‑Net, a lexicon‑enhanced, domain‑adaptive dual‑stream transformer framework for fine‑grained sentiment analysis in poultry‑related text. The proposed architecture integrates sentiment classification, topic modeling, and contextual representation learning through domain‑specific embeddings and gated cross‑attention mechanisms. A lexicon‑guided stream captures poultry‑specific terminology and sentiment cues, while contextual stream models long‑range semantic dependencies. Latent Dirichlet Allocation is employed to identify dominant thematic structures associated with production management and welfare‑related discussions, providing complementary interpretability to sentiment predictions. PoultryLeX‑Net was evaluated against multiple baseline models, including convolutional neural network and pre‑trained transformer architectures such as DistilBERT and RoBERTa. PoultryLeX‑Net consistently outperformed all baselines, achieving an accuracy of 97.35%, an F1 score of 96.67%, and an area under the receiver operating characteristic curve (AUC‑ROC) of 99.61% across sentiment classification tasks. Overall, domain adaptation and dual‑stream attention markedly improve sentiment classification, enabling scalable intelligence for poultry production decision support.
Authors: Himanshu Swami, John M. McBride, Jean-Pierre Eckmann, Tsvi Tlusty
Abstract: Protein function is executed at the molecular surface, where shape and chemistry act together to govern interaction. Yet most comparison methods treat these aspects separately, privileging either global fold or local descriptors and missing their coupled organization. Here we introduce IFACE (Intrinsic Field‑Aligned Coupled Embedding), a correspondence‑based framework that aligns protein surfaces through probabilistic coupling of intrinsic geometry with spatially distributed chemical fields. From this alignment, we derive a joint geometric‑‑chemical distance that integrates structural and physicochemical discrepancies within a single formulation. Across diverse proteins, this distance separates conformational variability from true structural divergence more effectively than fold‑based similarity measures. Applied to the cytochrome P450 family, it reveals coherent family‑level organization and identifies conserved buried catalytic pockets despite the complex topology. By linking interpretable surface correspondences with a unified distance, IFACE establishes a principled basis for comparing protein interfaces and detecting functionally related interaction patches across proteins.
Authors: Benjamin J. A. Héry, Lucas Tepper, Andrea Guljas, Artem Pavlov, Beate Koksch, Cecilia Clementi, Roland R. Netz
Abstract: The Mori‑Zwanzig formalism is a powerful theoretical framework for deriving equations of motion for coarse‑grained observables in the form of generalized Langevin equations (GLEs) involving evolution and projection operators. Using a time‑dependent many‑body Hamiltonian and a multi‑dimensional Mori projection operator, we derive a non‑equilibrium Mori GLE for a multi‑dimensional observable of interest \vecA that consists of a Markovian force, a running integral over time of a non‑Markovian friction force, and an orthogonal force that is often interpreted as a random force. We study the structure of the derived GLE in three limiting cases: when the components of \vecA are uncorrelated, when the Hamiltonian is time‑independent and thus the system is at equilibrium, and when both conditions are simultaneously satisfied. We highlight the presence of a contribution to the Markovian force that takes the form of an instantaneous friction force which only vanishes when the components of \vecA are uncorrelated. Our non‑Markovian framework is an important step towards the systematic modeling of the coupled kinetics of coarse‑grained reaction coordinates in biological complex systems, exemplified for the coupled intra‑ and inter‑protein folding during fibril formation of the human islet amyloid polypeptide (IAPP).
Authors: Silvia Mura, Elisabetta Marini, Maurizio Magarini, Matti Hamalainen, Marco Hernandez
Abstract: Sub‑terahertz (sub‑THz) and terahertz (THz) radiation offer unique opportunities for non‑invasive diagnostics and imaging due to their sensitivity to water content and molecular dynamics in biological tissues. In this work, a comprehensive dielectric model of human skin and its cellular constituents is developed across these frequency ranges. The model combines multi‑Debye relaxation theory with effective medium formulations to account for intracellular water dynamics and macromolecular relaxation processes. Key cellular parameters, including water content, protein and lipid fractions, and ionic conductivity, are integrated from experimentally validated sources. The proposed framework enables realistic predictions of frequency‑dependent permittivity for different skin layers and cell types, providing a physically interpretable description of sub‑THz and THz tissue interactions. This approach establishes a foundation for the design and optimization of next‑generation diagnostic and imaging techniques operating in these frequency bands.
Authors: Zhuang Liu, Beijia Yuan, Mihir Rao, Gautam Reddy, William M. Jacobs
Abstract: Intrinsically disordered regions (IDRs) of proteins mediate sequence‑specific interactions underlying diverse cellular processes, including the formation of biomolecular condensates. Although IDRs strongly influence condensate compositions, quantitative frameworks that predict and explain their phase behavior in complex mixtures remain lacking. Here we introduce a thermodynamic model that quantitatively predicts the behavior of arbitrary combinations of IDRs across a wide range of concentrations, with accuracy comparable to state‑of‑the‑art simulations. The model learns low‑dimensional, context‑independent representations of IDR sequences that combine to form mixture representations, producing context‑dependent interactions. These representations define a thermodynamic metric space in which distances between IDRs correspond directly to differences in their thermodynamic properties. We show that the model predicts multicomponent phase diagrams in quantitative agreement with molecular simulations without being trained on free‑energy or phase‑coexistence data. The metric space provides geometrically intuitive predictions of IDR partitioning, multicomponent condensation, and context‑dependent mutational effects, addressing several central problems in IDR biophysics within a single model. Systematic interrogation of the learned representations reveals how amino‑acid composition and sequence patterning jointly determine mixture thermodynamics. Together, our results establish a unified and interpretable framework for predicting and understanding the behavior of complex mixtures of IDRs and other sequence‑dependent biomolecules.
Authors: Christine Keller, Andreas Münch, Barbara Wagner
Abstract: Ion transport through narrow channels is described by the coupled Poisson‑‑Nernst‑‑Planck‑‑Stokes equations (PNPS) on a continuum scale. However, direct numerical simulations in two or three dimensions of boundary value problems for small aspect ratio geometries, a crucial characteristic of nanopores, can quickly become computationally intensive and thus limit the insights into the underlying mechanisms that control electrokinetic phenomena. Taking advantage of the small aspect ratio, we derive a systematic asymptotic reduction of the PNPS system. In contrast to existing one‑dimensional reductions, which assume a Debye length much smaller than the channel radius, our analysis identifies a distinguished asymptotic regime in which the Debye length is allowed to be comparable to the channel width. Our approach has a significantly larger range of validity and contains existing approximations such as the Helmholtz‑‑Smoluchowski approximation as limiting cases. The derived asymptotic model extends also to a generalized PNPS system, where finite‑size constraints and solvation effects are taken into account and thus applies to other well‑known models such as the Bikerman‑‑Freise model. Using our asymptotic model we demonstrate that the ion current can undergo a number of different flow transitions and in particular predict that positively charged ions can be pushed against their electrostatic gradient. Furthermore, we show how finite‑size effects can influence the ion current and enhance ion selectivity. Finally, we revisit case studies of protein‑based channels from the literature to illustrate the predictive potential of our asymptotic model.
Authors: Nuutti Barron, Heng Rao, Urmi Saha, Yu Gu, Zhenghao Liu, Ge Yu, Defu Yang, Ashish Raj, Minghan Chen
Abstract: Mechanistic modeling provides a biophysically grounded framework for studying the spread of pathological tau protein in tauopathies like Alzheimer's disease. Existing approaches typically model tau propagation as a diffusive process on the brain's structural connectome, reproducing macroscopic patterns but neglecting microscale cellular transport and reaction mechanisms. The Network Transport Model (NTM) was introduced to fill this gap, explaining how region‑level progression of tau emerges from microscale biophysical processes. However, the NTM faces a common challenge for complex models defined by large systems of partial differential equations: the inability to perform parameter inference and mechanistic discovery due to high computational burden and slow model simulations. To overcome this barrier, we propose Tau‑BNO, a Brain Neural Operator surrogate framework for rapidly approximating NTM dynamics that captures both intra‑regional reaction kinetics and inter‑regional network transport. Tau‑BNO combines a function operator that encodes kinetic parameters with a query operator that preserves initial state information, while approximating anisotropic transport through a spectral kernel that retains directionality. Empirical evaluations demonstrate high predictive accuracy (R^2\approx 0.98) across diverse biophysical regimes and an 89% performance improvement over state‑of‑the‑art sequence models like Transformers and Mamba, which lack inherent structural priors. By reducing simulation time from hours to seconds, we show that the surrogate model is capable of producing new insights and generating new hypotheses. This framework is readily extensible to a broader class of connectome‑based biophysical models, showcasing the transformative value of deep learning surrogates to accelerate analysis of large‑scale, computationally intensive dynamical systems.
Authors: Abdulrahman Alswaidan, Jeffrey D. Varner
Abstract: Attention heads retrieve: given a query, they return a weighted average of stored values. We showed that this computation is one step of gradient descent on the modern Hopfield energy, and that Langevin sampling from the corresponding Boltzmann distribution yielded stochastic attention, a training‑free sampler controlled by a single temperature parameter. Lowering the temperature gave exact retrieval; raising it gave open‑ended generation. Because the energy gradient equals the attention map, no score network, training loop, or learned model was required, making the approach particularly suited to the low‑data regime where learned generative models are starved of training signal. We derived an entropy inflection condition that identified the retrieval‑to‑generation transition temperature for any memory geometry and validated the sampler on five domains spanning two orders of magnitude in dimension. A single Boolean mask on the attention softmax, identical to the causal mask used in transformers but applied along the memory axis rather than the sequence axis, turned the sampler into a zero‑shot class‑conditional generator on Olivetti faces with no retraining and no learned classifier. On MNIST digit images, stochastic attention produced samples that were markedly more novel and more diverse than the best learned baseline while matching a Metropolis‑corrected gold standard. On protein sequences from a small Pfam family, the generation regime preserved amino acid composition far more faithfully than a variational autoencoder at matched novelty, indicating that the training‑free score function retained family‑level fidelity that learned models lost. A denoising diffusion baseline failed across all memory sizes tested, producing samples indistinguishable from isotropic noise. The approach required no architectural changes to the underlying attention mechanism.
Authors: Xiaoyang Hou, Junqi Liu, Chence Shi, Xin Liu, Zhi Yang, Jian Tang
Abstract: Protein sequence design must balance designability, defined as the ability to recover a target backbone, with multiple, often competing, developability properties such as solubility, thermostability, and expression. Existing approaches address these properties through post hoc mutation, inference‑time biasing, or retraining on property‑specific subsets, yet they are target dependent and demand substantial domain expertise or careful hyperparameter tuning. In this paper, we introduce ProtAlign, a multi‑objective preference alignment framework that fine‑tunes pretrained inverse folding models to satisfy diverse developability objectives while preserving structural fidelity. ProtAlign employs a semi‑online Direct Preference Optimization strategy with a flexible preference margin to mitigate conflicts among competing objectives and constructs preference pairs using in silico property predictors. Applied to the widely used ProteinMPNN backbone, the resulting model MoMPNN enhances developability without compromising designability across tasks including sequence design for CATH 4.3 crystal structures, de novo generated backbones, and real‑world binder design scenarios, making it an appealing framework for practical protein sequence design.
Authors: Yichen Zhou, Jonathan Golob, Amir Karimi, Stefan Bauer, Patrick Schwab
Abstract: Protein language models (pLMs) have shown strong potential for zero‑shot prediction of missense variant effects, yet systematic benchmarking on viral proteins remains limited, a critical gap given the need for proactive tools that can anticipate emerging mutations ahead of experimental validation. Here we introduce ViroGym, a comprehensive benchmark evaluating pLMs across three tasks: 79 deep mutational scanning (DMS) assays covering eukaryotic viruses with 552,065 mutated sequences across 7 phenotypic readouts, 21 influenza neutralisation tasks, and a real‑world pandemic prediction task for SARS‑CoV‑2. We benchmark well‑established pLMs on fitness landscapes, antigenic diversity, and pandemic forecasting, and find that the ProGen2 family consistently achieves the strongest performance across all three tasks. Crucially, DMS and neutralisation performance reliably identifies models that generalise to real‑world emergence, even though the mutation sets they surface barely overlap, revealing that complementary in vitro benchmarks capture the evolutionary constraints needed for real‑world mutation forecasting.
Authors: Aditya Ranganath, Hasin Us Sami, Kowshik Thopalli, Bhavya Kailkhura, Wesam Sakla
Abstract: Protein language models often take into consideration the alignment between a protein sequence and its textual description. However, they do not take structural information into consideration. Traditional methods treat sequence and structure separately, limiting the ability to exploit the alignment between the structure and protein sequence embeddings. In this paper, we introduce a sequence structure contrastive alignment framework, which learns a shared embedding space where proteins are represented consistently across modalities. By training on large‑scale pairs of sequences and experimentally resolved or predicted structures, the model maximizes agreement between matched sequence structure pairs while pushing apart unrelated pairs. This alignment enables cross‑modal retrieval (e.g., finding structural neighbors given a sequence), improves downstream prediction tasks such as function annotation and stability estimation, and provides interpretable links between sequence variation and structural organization. Our results demonstrate that contrastive learning can serve as a powerful bridge between protein sequences and structures, offering a unified representation for understanding and engineering proteins.
Authors: Shaokuan Wang, Pengshan Cui, Yining Qian, An-Yang Lu, Xianpeng Wang
Abstract: Intrinsically disordered regions of proteins play a crucial role in cell signaling and drug discovery. However, their high structural flexibility makes accurate residue‑level prediction challenging. Existing methods often rely on single‑view representations or rigid manual fusion strategies, which fail to effectively balance the complex interplay between local amino acid preferences and long‑range sequence patterns. To address these limitations, we propose D2MOE, a Dual‑View Multiscale Features and Multi‑objective Evolutionary Algorithm, which consists of two stages. First, a dual‑view multiscale feature extraction method is introduced. This method integrates evolutionary views with deep semantic views and employs multiscale extractors to capture structural information across diverse receptive fields. Second, a multi‑objective evolutionary algorithm is designed to adaptively discover optimal fusion architectures. By co‑evolving discrete feature selection and continuous fusion weights, the algorithm adaptively explores optimal cross‑feature architectures to enhance predictive accuracy while maintaining model compactness. Experimental results across three benchmark datasets demonstrate that D2MOE consistently outperforms state‑of‑the‑art methods. D2MOE combines the feature extraction capabilities of deep learning with the global search advantages of evolutionary algorithms, enabling efficient feature integration without manual design, and providing a robust computational tool for protein disorder prediction.
Authors: Shunzhou Wan, Xibei Zhang, Xiao Xue, Peter V. Coveney
Abstract: Despite continuing hype about the role of AI in drug discovery, no "AI‑discovered drugs" have so far received regulatory approval. Here we assess one of the latest AI based tools in this domain. The ability to rapidly predict protein‑ligand structures and binding affinities is pivotal for accelerating drug discovery. Boltz‑2, a recently developed biomolecular foundation model, aims to bridge the gap between AI efficiency and physics‑based precision through a joint "co‑folding" approach. In this study, we provide an extensive evaluation of Boltz‑2 using two large‑scale datasets: 16,780 compounds for 3CLPro and 21,702 compounds for TNKS2. We compare Boltz‑2 predicted structures with traditional docking and binding affinities with binding free energies derived from the physics‑based ESMACS protocol. Structural analysis reveals significant global RMSD variations, indicating that Boltz‑2 predicts multiple protein conformations and ligand binding positions rather than a single converged pose. Energetic evaluations exhibit only weak to moderate correlations across the global datasets. Furthermore, a focused analysis of the top 100 compounds yields no significant correlation between the Boltz‑2 predictions and the binding free energies from fine‑grained ESMACS, alongside observed saturation difference in ligand structures. Our results show that while Boltz‑2 offers substantial speed for initial screening, it lacks the energetic resolution required for lead identification. These findings highlight the necessity of employing physics‑based methods for the reliability and refinement of AI‑derived models.
Authors: Bastian Pfeifer, Michael G. Schimek
Abstract: Estimating node similarity is a fundamental task in network analysis and graph‑based machine learning, with applications in clustering, community detection, classification, and recommendation. We propose TopKGraphs, a method based on start‑node‑anchored random walks that bias transitions toward nodes with structurally similar neighborhoods, measured via Jaccard similarity. Rather than computing stationary distributions, walks are treated as stochastic neighborhood samplers, producing partial node rankings that are aggregated using robust rank aggregation to construct interpretable node‑to‑node affinity matrices. TopKGraphs provides a non‑parametric, interpretable, and general‑purpose representation of node similarity that can be applied in both network analysis and machine learning workflows. We evaluate the method on synthetic graphs (stochastic block models, Lancichinetti‑Fortunato‑Radicchi benchmark graphs), k‑nearest‑neighbor graphs from tabular datasets, and a curated high‑confidence protein‑protein interaction network. Across all scenarios, TopKGraphs achieves competitive or superior performance compared to standard similarity measures (Jaccard, Dice), a diffusion‑based method (personalized PageRank), and an embedding‑based approach (Node2Vec), demonstrating robustness in sparse, noisy, or heterogeneous networks. These results suggest that TopKGraphs is a versatile and interpretable tool for bridging simple local similarity measures with more complex embedding‑based approaches, facilitating both data mining and network analysis applications.
Authors: Manuel Fernández Burda, Santiago Aranguri, Iván Arcuschin Moreno, Enzo Ferrante
Abstract: Protein language models (PLMs) are becoming practical tools for de novo protein design, yet their dual‑use potential raises safety concerns. We show that domain adaptation to specific taxonomic groups can elicit toxic protein generation, even when toxicity is not the training objective. To address this, we adapt Logit Diff Amplification (LDA) as an inference‑time control mechanism for PLMs. LDA modifies token probabilities by amplifying the logit difference between a baseline model and a toxicity‑finetuned model, requiring no retraining. Across four taxonomic groups, LDA consistently reduces predicted toxicity rate (measured via ToxDL2) below the taxon‑finetuned baseline while preserving biological plausibility. We evaluate quality using Fréchet ESM Distance and predicted foldability (pLDDT), finding that LDA maintains distributional similarity to natural proteins and structural viability (unlike activation‑based steering methods that tend to degrade sequence properties). Our results demonstrate that LDA provides a practical safety knob for protein generators that mitigates elicited toxicity while retaining generative quality.
Authors: Dinesh Parthasarathy, Saminathan Ramakrishnan, Georgia Tsang, Auro Varat Patnaik, Sabrina M. C. Hardy, Willem Vanderlinden, Jamieson Howard, Braden Bylett, James R. Law, Mark C. Leake, Agnes Noy, Davide Michieletto
Abstract: The Integration Host Factor (IHF) is a nucleoid‑associated protein critical for both DNA compaction and biofilm stability. While its role in DNA packaging within the cell is well understood, its structural role in scaffolding biofilms is more puzzling and difficult to reconcile with its known DNA bending activity. Here, we investigated how IHF‑DNA interactions are modulated across a pH spectrum mimicking the acidic microenvironments of bacterial biofilms. By performing all‑atom calculations we discovered that low pHs lead to a change in protonation of IHF residues, which in turn exposes positively charged patches. We then conjectured that these positively charged residues could lead to intermolecular DNA bridging and tested this hypothesis through single‑molecule and bulk assays. We discovered that while at physiological pH IHF mostly bends DNA, at pH < 5 there is clear evidence of IHF‑mediated intermolecular crosslinking. Our results demonstrate that pH significantly modulates IHF‑DNA interactions and explains the structural role played by IHF in supporting biofilm mechanics through intermolecular crosslinking.
Authors: Patrick Sorrel Mvoto Kongo, Steve Cabrel Teguia Kouam, Jean-Pierre Tchapet Njafa, Serge Guy Nana Engo
Abstract: The discovery of synthetically accessible organic semiconductors with exceptional performance remains a critical bottleneck in materials science. While these materials offer compelling advantages ‑ structural modularity, mechanical flexibility, and cost‑effective solution processing ‑ for applications in photovoltaics and biosensors, identifying candidates that balance high efficiency with practical synthesis presents significant challenges. To address this challenge, we developed a high‑throughput screening approach using 17 458 molecules from the PubChemQC B3LYP/6‑31G//PM6 dataset. Our strategy employs a composite metric, PCESAScore = PCE ‑ SAScore, which systematically balances power conversion efficiency (PCE) predictions from the Scharber model against synthetic accessibility scores. This approach successfully identified seven multi‑functional candidates that demonstrate both exceptional photovoltaic performance (PCE up to 36.1 %) and strong protein‑binding affinity for biosensing applications. Notably, molecule 4550 emerged as the optimal candidate, exhibiting a ligand efficiency of 0.340 kcal/mol/heavy atom with 100 % target promiscuity. Our computational framework integrates machine learning, density functional theory, and molecular docking to bridge the gap between theoretical performance and experimental feasibility. These findings establish a systematic pathway for discovering synthetically compatible organic semiconductors that can simultaneously address energy conversion and molecular recognition challenges.
Authors: Sybren D van den Bedem, Ellen Kuhl, Caroline Cotto
Abstract: Global food production must reduce environmental impact while meeting rising demand for dietary protein. Plant‑based meats aim to preserve the sensory and cultural role of animal meat, while lowering greenhouse gas emissions, land use, and health risks. Advances in protein structure and flavor chemistry have improved product quality; yet, consumers continue to prioritize taste and texture over sustainability and systematic large‑scale consumer surveys are scarce. It remains unclear how plant‑based products rank against animal benchmarks and which product attributes most strongly influence overall liking. Here we show, in a large‑scale, blinded, in‑person sensory evaluation across 14 product categories, 2,684 consumers, more than 11,000 product evaluations and 800,000 data points, that plant‑based products still trail animal benchmarks at the category average level, but approach parity in selected formats: Plant‑based unbreaded chicken filets, chicken nuggets, and burgers achieved mean overall liking scores of 5.1, 4.9, and 5.2, differing from the animal benchmark by only 0.1, 0.2, and 0.3 points on a seven‑point scale. For unbreaded chicken filets and burgers, 48% and 47% of participants rated the plant‑based product the same as or better than the animal benchmark. Categories with higher sensory parity captured 5‑14% market share compared with less than 1% for low‑parity categories. Penalty analysis identified savoriness, aftertaste, juiciness, and tenderness as the strongest determinants of liking. These findings show that sensory parity is technically achievable, but not yet consistent across product types. By publicly sharing all data, we establish an open benchmark for alternative protein performance to democratize research and accelerate principled, data‑driven innovation. All data are freely available at https://www.nectar.org/sensory‑research/2025‑taste‑of‑the‑industry.
Authors: Ihor Kendiukhov
Abstract: Background: Single‑cell foundation models such as Geneformer and scGPT encode rich biological information, but whether this includes causal regulatory logic rather than statistical co‑expression remains unclear. Sparse autoencoders (SAEs) can resolve superposition in neural networks by decomposing dense activations into interpretable features, yet they have not been systematically applied to biological foundation models.
Results: We trained TopK SAEs on residual stream activations from all layers of Geneformer V2‑316M (18 layers, d=1152) and scGPT whole‑human (12 layers, d=512), producing atlases of 82525 and 24527 features, respectively. Both atlases confirm massive superposition, with 99.8 percent of features invisible to SVD. Systematic characterization reveals rich biological organization: 29 to 59 percent of features annotate to Gene Ontology, KEGG, Reactome, STRING, or TRRUST, with U‑shaped layer profiles reflecting hierarchical abstraction. Features organize into co‑activation modules (141 in Geneformer, 76 in scGPT), exhibit causal specificity (median 2.36x), and form cross‑layer information highways (63 to 99.8 percent). When tested against genome‑scale CRISPRi perturbation data, only 3 of 48 transcription factors (6.2 percent) show regulatory‑target‑specific feature responses. A multi‑tissue control yields marginal improvement (10.4 percent, 5 of 48 TFs), establishing model representations as the bottleneck.
Conclusions: These models have internalized organized biological knowledge, including pathway membership, protein interactions, functional modules, and hierarchical abstraction, yet they encode minimal causal regulatory logic. We release both feature atlases as interactive web platforms enabling exploration of more than 107000 features across 30 layers of two leading single‑cell foundation models.
Authors: Sara Seager, William Bains, Iaroslav Iakubivskyi, Rachana Agrawal, John Jenkins, Pranav Shinde, Janusz J. Petkowski
Abstract: Liquid is a fundamental requirement for life as we understand it, but whether that liquid has to be water is not known. We propose the hypothesis that ionic liquids (ILs) and deep eutectic solvents (DES) constitute a class of non‑aqueous planetary liquids capable of persisting on a wide range of bodies where stable liquid water cannot exist. This hypothesis is motivated by key physical properties of ILs and DES. Many exhibit vapor pressures orders of magnitude lower than that of water and remain liquid across exceptionally wide temperature ranges, from cryogenic to well above terrestrial temperatures. These properties permit stable liquids to exist where liquid water would rapidly evaporate or freeze and outside of bulk phases as persistent microscale reservoirs‑such as thin films and pore‑filling droplets. In other words, ILs and DES can persist in environments without requiring oceans, thick atmospheres, or narrowly regulated climate conditions. We further hypothesize that ILs and DES could act as solvents for non‑Earth‑like life. Our hypothesis ex‑tends to the idea that ILs and DES could enable prebiotic chemistry by providing long‑lived, protective liquid environments for complex organic molecules on bodies such as comets and asteroids, where liquid water is absent. Based on the occurrence of DES‑like mixtures as protective intracellular liquids in desiccation‑tolerant plants, we propose that ILs and DES might be solvents that life elsewhere purposefully evolves. We review protein and other biomolecule studies in ILs and DES and outline planetary environments in which ILs and DES might occur by discussing available anions and cations. We present strategies to advance the IL/DES solvent hypothesis using laboratory studies, computational chemistry, planetary missions, analysis of existing spectroscopic datasets, and modeling of liquid microniches and chemical survival on small bodies.
Authors: Jiahua Zhang, Yong Hou, Xinhao Hu, Yicheng Wang, Madoka Suzuki, Bo Gao, Zhiqin Chu
Abstract: The local thermal conductivity (\kappa) is a pivotal biophysical parameter, governing intracellular heat flux and underlying functional processes like metabolic regulation and stress response. However, label‑free mapping with sub‑micron resolution in living cells remains challenge. Here, we present frequency‑domain fluorescence thermometry (FD‑FTM), an all‑optical method based on a hybrid nanodiamond‑on‑gold‑membrane platform, which enables quantitative mapping of \kappa in biological systems. Fluorescence nanodiamonds (FNDs) are deposited on substrates coated with a 50 nm gold membrane, where FNDs function as nanoscale thermometers, and the gold membrane serves as a photothermal heat source. We validate FD‑FTM across reference materials and biological media, with fitting uncertainties of ~10%. By varying the modulation frequency, we tune the thermal penetration depths, enabling controlled heat propagation from the substrate to the cell nucleus. The method delivers sensitivity sufficient to resolve changes in biofluid thermal conductivity on the order of 16% relative to water. Using these capabilities, we demonstrate non‑invasive thermal profiling across scales: at the cellular level, nuclear chromatin packing yields \kappa higher by ~10% relative to the cytoplasm; at the organelle level, we resolve \kappa variations associated with protein aggregates formed during liquid‑liquid phase separation in an amyotrophic lateral sclerosis disease model. Temporal measurements in living cells over 30 minutes further reveal spatially resolved intracellular responses to osmotic stress, linking nanoscale thermal dynamics to biomolecular condensates. These results establish FD‑FTM as a label‑free, robust, and quantitative platform for thermally decoding intracellular processes, opening avenues for studying metabolic heterogeneity, disease mechanisms, and therapeutic responses.
Authors: Jai Geddes-Nelson, Xiaochen Liu, Ken-Tye Yong
Abstract: Huntington's disease (HD) is caused by CAG‑repeat expansion in HTT, which lengthens the polyglutamine (polyQ) tract in huntingtin (HTT) and promotes misfolding and aggregation. While polyQ‑length‑dependent aggregation is well established, the atomistic conformational dynamics preceding aggregation remain less defined. Here we perform all‑atom molecular dynamics simulations of HTT exon‑1 constructs containing the N17 domain, polyQ tracts of clinically relevant lengths (Q21, wildtype; Q40, adult onset threshold; Q70, juvenile onset), and the polyproline (polyP) region. Multi‑copy simulations (four chains) were run for 100 ns in explicit SPC/E water using the OPLS‑AA force field. We quantified radius of gyration (Rg), solvent‑accessible surface area (SASA), root‑mean‑square deviation (RMSD), and intra‑protein hydrogen bonds as proxies for conformational expansion and aggregation propensity. PolyQ expansion drove progressive increases in Rg and SASA, consistent with more extended, solvent‑exposed ensembles. We further tested organic co‑solvents (methanol, hexane, trichloroethylene; 0.5 to 1.0 M), which modulated these landscapes in a solvent‑dependent manner. Trichloroethylene induced marked expansion in Q21 and Q40, whereas methanol produced mild compaction in Q21. To our knowledge, this is the first MD study to systematically examine co‑solvent effects on HTT exon‑1 conformational dynamics. Although limited sampling precludes definitive mechanistic conclusions, the observed trends suggest that hydrophobic co‑solvents can bias HTT exon‑1 toward more expanded ensembles, motivating computational studies of gene‑environment modulation in HD.
Authors: Binon Teji, Subhajit Bandyopadhyay, Swarup Roy
Abstract: Prioritizing disease‑associated genes is central to understanding the molecular mechanisms of complex disorders such as Alzheimer's disease (AD). Traditional network‑based approaches rely on static centrality measures and often fail to capture cross‑modal biological heterogeneity. We propose NETRA (Node Evaluation through Transformer‑based Representation and Attention), a multimodal graph transformer framework that replaces heuristic centrality metrics with attention‑driven relevance scoring. Using AD as a case study, gene regulatory networks are independently constructed from microarray, single‑cell RNA‑seq, and single‑nucleus RNA‑seq data. Random‑walk sequences derived from these networks are used to train a BERT‑based model for learning global gene embeddings, while modality‑specific gene expression profiles are compressed using variational autoencoders. These representations are integrated with auxiliary biological networks, including protein‑protein interactions, Gene Ontology semantic similarity, and diffusion‑based gene similarity, into a unified multimodal graph. A graph transformer assigns NETRA scores that quantify gene relevance in a disease‑specific and context‑aware manner. Gene set enrichment analysis shows that NETRA achieves a normalized enrichment score of about 3.9 for the Alzheimer's disease pathway, substantially outperforming classical centrality measures and diffusion models. Top‑ranked genes enrich multiple neurodegenerative pathways, recover a known late‑onset AD susceptibility locus at chr12q13, and reveal conserved cross‑disease gene modules. The framework preserves biologically realistic heavy‑tailed network topology and is readily extensible to other complex disorders.
Authors: Youssef Abo-Dahab, Ruby Hernandez, Ismael Caleb Arechiga Duran
Abstract: The contributions of model complexity, data volume, and feature modalities to knowledge graph‑based drug repurposing remain poorly quantified under rigorous temporal validation. We constructed a pharmacology knowledge graph from ChEMBL 36 comprising 5,348 entities including 3,127 drugs, 1,156 proteins, and 1,065 indications. A strict temporal split was enforced with training data up to 2022 and testing data from 2023 to 2025, together with biologically verified hard negatives mined from failed assays and clinical trials. We benchmarked five knowledge graph embedding models and a standard graph neural network with 3.44 million parameters that incorporates drug chemical structure using a graph attention encoder and ESM‑2 protein embeddings. Scaling experiments ranging from 0.78 to 9.75 million parameters and from 25 to 100 percent of the data, together with feature ablation studies, were used to isolate the contributions of model capacity, graph density, and node feature modalities. Removing the graph attention based drug structure encoder and retaining only topological embeddings combined with ESM‑2 protein features improved drug protein PR‑AUC from 0.5631 to 0.5785 while reducing VRAM usage from 5.30 GB to 353 MB. Replacing the drug encoder with Morgan fingerprints further degraded performance, indicating that explicit chemical structure representations can be detrimental for predicting pharmacological network interactions. Increasing model size beyond 2.44 million parameters yielded diminishing returns, whereas increasing training data consistently improved performance. External validation confirmed 6 of the top 14 novel predictions as established therapeutic indications. These results show that drug pharmacological behavior can be accurately predicted using target‑centric information and drug network topology alone, without requiring explicit chemical structure representations.
Authors: Jiayang Wu, Jiale Zhou, Rubo Wang, Xingyi Zhang, Xun Lin, Tianxu Lv, Leong Hou U, Yefeng Zheng
Abstract: Accurate identification of protein active sites at the residue level is crucial for understanding protein function and advancing drug discovery. However, current methods face two critical challenges: vulnerability in single‑instance prediction due to sparse training data, and inadequate modality reliability estimation that leads to performance degradation when unreliable modalities dominate fusion processes. To address these challenges, we introduce Multimodal Mixture‑of‑Experts with Retrieval Augmentation (MERA), the first retrieval‑augmented framework for protein active site identification. MERA employs hierarchical multi‑expert retrieval that dynamically aggregates contextual information from chain, sequence, and active‑site perspectives through residue‑level mixture‑of‑experts gating. To prevent modality degradation, we propose a reliability‑aware fusion strategy based on Dempster‑Shafer evidence theory that quantifies modality trustworthiness through belief mass functions and learnable discounting coefficients, enabling principled multimodal integration. Extensive experiments on ProTAD‑Gen and TS125 datasets demonstrate that MERA achieves state‑of‑the‑art performance, with 90% AUPRC on active site prediction and significant gains on peptide‑binding site identification, validating the effectiveness of retrieval‑augmented multi‑expert modeling and reliability‑guided fusion.
Authors: Congying Liu, Taihao Li, Ming Huang, Xingyuan Wei, Peipei Liu, Yiqing Shen, Yanxu Mao, Tiehan Cui
Abstract: Protein analysis tasks arising in healthcare settings often require accurate reasoning under protein sequence constraints, involving tasks such as functional interpretation of disease‑related variants, protein‑level analysis for clinical research, and similar scenarios. To address such tasks, search agents are introduced to search protein‑related information, providing support for disease‑related variant analysis and protein function reasoning in protein‑centric inference. However, such search agents are mostly limited to single‑round, text‑only modality search, which prevents the protein sequence modality from being incorporated as a multimodal input into the search decision‑making process. Meanwhile, their reliance on reinforcement learning (RL) supervision that focuses solely on the final answer results in a lack of search process constraints, making deviations in keyword selection and reasoning directions difficult to identify and correct in a timely manner. To address these limitations, we propose ProtRLSearch, a multi‑round protein search agent trained with multi‑dimensional reward based RL, which jointly leverages protein sequence and text as multimodal inputs during real‑time search to produce high quality reports. To evaluate the ability of models to integrate protein sequence information and text‑based multimodal inputs in realistic protein query settings, we construct ProtMCQs, a benchmark of 3,000 multiple choice questions (MCQs) organized into three difficulty levels. The benchmark evaluates protein query tasks that range from sequence constrained reasoning about protein function and phenotype changes to comprehensive protein reasoning that integrates multi‑dimensional sequence features with signal pathways and regulatory networks.
Authors: Yashvir S. Grewal, Daniel M. Steinberg, Thang D. Bui, Cheng Soon Ong, Edwin V. Bonilla
Abstract: Discrete diffusion and flow matching models capture complex, non‑additive and non‑autoregressive structure in high‑dimensional objective landscapes through parallel, iterative refinement. However, their implicit generative nature precludes direct integration with principled variational frameworks for online black‑box optimisation, such as variational search distributions (VSD) and conditioning by adaptive sampling (CbAS). We introduce Active Flow Matching (AFM), which reformulates variational objectives to operate on conditional endpoint distributions along the flow, enabling gradient‑based steering of flow models toward high‑fitness regions while preserving the rigour of VSD and CbAS. We derive forward and reverse Kullback‑Leibler (KL) variants using self‑normalised importance sampling. Across a suite of online protein and small molecule design tasks, forward‑KL AFM consistently performs competitively compared to state‑of‑the‑art baselines, demonstrating effective exploration‑exploitation under tight experimental budgets.
Authors: Darshan Patil, Pranshu Malviya, Mathieu Reymond, Quentin Fournier, Sarath Chandar
Abstract: Protein language models (pLMs) have recently gained significant attention for their ability to uncover relationships between sequence, structure, and function from evolutionary statistics, thereby accelerating therapeutic drug discovery. These models learn from large protein databases that are continuously updated by the biology community and whose dynamic nature motivates the application of continual learning, not only to keep up with the ever‑growing data, but also as an opportunity to take advantage of the temporal meta‑information that is created during this process. As a result, we introduce the Continual Pretraining of Protein Language Models (CoPeP) benchmark, a novel benchmark for evaluating continual learning approaches on pLMs. Specifically, we curate a sequence of protein datasets derived from the UniProt Knowledgebase spanning a decade and define metrics to assess pLM performance across 31 protein understanding tasks. We evaluate several methods from the continual learning literature, including replay, unlearning, and plasticity‑based methods, some of which have never been applied to models and data of this scale. Our findings reveal that incorporating temporal meta‑information improves perplexity by up to 7% even when compared to training on data from all tasks jointly. Moreover, even at scale, several continual learning methods outperform naive continual pretraining. The CoPeP benchmark offers an exciting opportunity to study these methods at scale in an impactful real‑world application.
Authors: Viacheslav Dubovitskii, Filippo Utro, Aritra Bose, Laxmi Parida, Sabrina Maniscalco, Sergey N. Filippov
Abstract: Quantum walks provide a versatile framework for probing the structural and dynamical properties of complex systems ranging from biological networks to synthetic materials. However, their realization on current noisy pre‑fault‑tolerant quantum computers is fundamentally limited by decoherence. Conventional dense encodings of graph structures require prohibitively deep circuits, making them incompatible with existing hardware. Here we introduce an algorithm that leverages symmetry‑sector encoding and trades circuit depth for qubits, while integrating symmetry‑respecting postselection as an effective noise‑mitigation strategy. This combination enables us to execute practical quantum‑walk circuits for biological networks on actual quantum hardware. We benchmark the proposed methodology against known state‑of‑the‑art circuit architectures, highlighting significant reduction of circuit depth in our approach at the cost of moderate qubit overhead. Utilizing 40 qubits, we implement quantum walks on complex graphs containing up to 17 nodes and 20 edges ‑‑ the largest experiment on superconducting hardware to date, with the Hellinger fidelity exceeding 87% throughout 7 steps. We present a case study that illustrates how experimentally obtained quantum‑walk dynamics on a protein‑protein‑interaction network can be applied to prioritizing disease‑associated genes. We discuss the framework scalability in the pre‑fault‑tolerant era and its potential for studying larger biological networks.
Authors: Advaith Maddipatla, Anar Rzayev, Marco Pegoraro, Martin Pacesa, Paul Schanda, Ailie Marx, Sanketh Vedula, Alex M. Bronstein
Abstract: Protein function relies on dynamic conformational ensembles, yet current generative models like AlphaFold3 often fail to produce ensembles that match experimental data. Recent experiment‑guided generators attempt to address this by steering the reverse diffusion process. However, these methods are limited by fixed sampling horizons and sensitivity to initialization, often yielding thermodynamically implausible results. We introduce a general inference‑time optimization framework to solve these challenges. First, we optimize over latent representations to maximize ensemble log‑likelihood, rather than perturbing structures post hoc. This approach eliminates dependence on diffusion length, removes initialization bias, and easily incorporates external constraints. Second, we present novel sampling schemes for drawing Boltzmann‑weighted ensembles. By combining structural priors from AlphaFold3 with force‑field‑based priors, we sample from their product distribution while balancing experimental likelihoods. Our results show that this framework consistently outperforms state‑of‑the‑art guidance, improving diversity, physical energy, and agreement with data in X‑ray crystallography and NMR, often fitting the experimental data better than deposited PDB structures. Finally, inference‑time optimization experiments maximizing ipTM scores reveal that perturbing AlphaFold3 embeddings can artificially inflate model confidence. This exposes a vulnerability in current design metrics, whose mitigation could offer a pathway to reduce false discovery rates in binder engineering.
Authors: Felipe Bivort Haiek
Abstract: In this Master's thesis, the graph properties of a multi‑level drug‑protein network are studied, as well as how the network's shape has informed discoveries over the years, identifying primarily crawling discoveries and a smaller number of hopping discoveries. Finally, the network structure is used to inform a network diffusion recommendation system and to prioritize existing drugs for repurposing against proteins in organisms that cause Neglected Tropical Diseases.
Authors: Fuqiang Chen, Ranran Zhang, Wanming Hu, Deboch Eyob Abera, Yue Peng, Boyun Zheng, Yiwen Sun, Jing Cai, Wenjian Qin
Abstract: Immunohistochemical (IHC) staining enables precise molecular profiling of protein expression, with over 200 clinically available antibody‑based tests in modern pathology. However, comprehensive IHC analysis is frequently limited by insufficient tissue quantities in small biopsies. Therefore, virtual multiplex staining emerges as an innovative solution to digitally transform H&E images into multiple IHC representations, yet current methods still face three critical challenges: (1) inadequate semantic guidance for multi‑staining, (2) inconsistent distribution of immunochemistry staining, and (3) spatial misalignment across different stain modalities. To overcome these limitations, we present a prompt‑guided framework for virtual multiplex IHC staining using only uniplex training data (PGVMS). Our framework introduces three key innovations corresponding to each challenge: First, an adaptive prompt guidance mechanism employing a pathological visual language model dynamically adjusts staining prompts to resolve semantic guidance limitations (Challenge 1). Second, our protein‑aware learning strategy (PALS) maintains precise protein expression patterns by direct quantification and constraint of protein distributions (Challenge 2). Third, the prototype‑consistent learning strategy (PCLS) establishes cross‑image semantic interaction to correct spatial misalignments (Challenge 3).
Authors: Gal Kesten-Pomeranz, Yaniv Nikankin, Anja Reusch, Tomer Tsaban, Ora Schueler-Furman, Yonatan Belinkov
Abstract: Protein sequences are abundant in repeating segments, both as exact copies and as approximate segments with mutations. These repeats are important for protein structure and function, motivating decades of algorithmic work on repeat identification. Recent work has shown that protein language models (PLMs) identify repeats, by examining their behavior in masked‑token prediction. To elucidate their internal mechanisms, we investigate how PLMs detect both exact and approximate repeats. We find that the mechanism for approximate repeats functionally subsumes that of exact repeats. We then characterize this mechanism, revealing two main stages: PLMs first build feature representations using both general positional attention heads and biologically specialized components, such as neurons that encode amino‑acid similarity. Then, induction heads attend to aligned tokens across repeated segments, promoting the correct answer. Our results reveal how PLMs solve this biological task by combining language‑based pattern matching with specialized biological knowledge, thereby establishing a basis for studying more complex evolutionary processes in PLMs.
Authors: L. Martino, M. M. Garcia, P. S. Paradas, E. Curbelo
Abstract: Counting immunopositive cells on biological tissues generally requires either manual annotation or (when available) automatic rough systems, for scanning signal surface and intensity in whole slide imaging. In this work, we tackle the problem of counting microglial cells in lumbar spinal cord cross‑sections of rats by omitting cell detection and focusing only on the counting task. Manual cell counting is, however, a time‑consuming task and additionally entails extensive personnel training. The classic automatic color‑based methods roughly inform about the total labeled area and intensity (protein quantification) but do not specifically provide information on cell number. Since the images to be analyzed have a high resolution but a huge amount of pixels contain just noise or artifacts, we first perform a pre‑processing generating several filtered images (providing a tailored, efficient feature extraction). Then, we design an automatic kernel counter that is a non‑parametric and non‑linear method. The proposed scheme can be easily trained in small datasets since, in its basic version, it relies only on one hyper‑parameter. However, being non‑parametric and non‑linear, the proposed algorithm is flexible enough to express all the information contained in rich and heterogeneous datasets as well (providing the maximum overfit if required). Furthermore, the proposed kernel counter also provides uncertainty estimation of the given prediction, and can directly tackle the case of receiving several expert opinions over the same image. Different numerical experiments with artificial and real datasets show very promising results. Related Matlab code is also provided.
Authors: Yixuan Li, Archer Y. Yang, Yue Li
Abstract: Biological signals of interest in high‑dimensional data are often masked by dominant variation shared across conditions. This variation, arising from baseline biological structure or technical effects, can prevent standard dimensionality reduction methods from resolving condition‑specific structure. The challenge is that these confounding topics are often unknown and mixed with biological signals. Existing background correction methods are either unscalable to high dimensions or not interpretable. We introduce background contrastive Non‑negative Matrix Factorization (\model), which extracts target‑enriched latent topics by jointly factorizing a target dataset and a matched background using shared non‑negative bases under a contrastive objective that suppresses background‑expressed structure. This approach yields non‑negative components that are directly interpretable at the feature level, and explicitly isolates target‑specific variation. \model is learned by an efficient multiplicative update algorithm via matrix multiplication such that it is highly efficient on GPU hardware and scalable to big data via minibatch training akin to deep learning approach. Across simulations and diverse biological datasets, \model reveals signals obscured by conventional methods, including disease‑associated programs in postmortem depressive brain single‑cell RNA‑seq, genotype‑linked protein expression patterns in mice, treatment‑specific transcriptional changes in leukemia, and TP53‑dependent drug responses in cancer cell lines.
Authors: Ihor Kendiukhov
Abstract: Single‑cell foundation models such as scGPT learn high‑dimensional gene representations, but what biological knowledge these representations encode remains unclear. We systematically decode the geometric structure of scGPT internal representations through 63 iterations of automated hypothesis screening (183 hypotheses tested), revealing that the model organizes genes into a structured biological coordinate system rather than an opaque feature space.
The dominant spectral axis separates genes by subcellular localization, with secreted proteins at one pole and cytosolic proteins at the other. Intermediate transformer layers transiently encode mitochondrial and ER compartments in a sequence that mirrors the cellular secretory pathway. Orthogonal axes encode protein‑protein interaction networks with graded fidelity to experimentally measured interaction strength (Spearman rho = 1.000 across n = 5 STRING confidence quintiles, p = 0.017).
In a compact six‑dimensional spectral subspace, the model distinguishes transcription factors from their target genes (AUROC = 0.744, all 12 layers significant). Early layers preserve which specific genes regulate which targets, while deeper layers compress this into a coarser regulator versus regulated distinction. Repression edges are geometrically more prominent than activation edges, and B‑cell master regulators BATF and BACH2 show convergence toward the B‑cell identity anchor PAX5 across transformer depth. Cell‑type marker genes cluster with high fidelity (AUROC = 0.851). Residual‑stream geometry encodes biological structure complementary to attention patterns. These results indicate that biological transformers learn an interpretable internal model of cellular organization, with implications for regulatory network inference, drug target prioritization, and model auditing.
Authors: Rabeya Tus Sadia, Qiang Ye, Qiang Cheng
Abstract: Accurate prediction of RNA‑associated interactions is essential for understanding cellular regulation and advancing drug discovery. While Biological Large Language Models (BioLLMs) such as ESM‑2 and RiNALMo provide powerful sequence representations, existing methods rely on static fusion strategies that fail to capture the dynamic, context‑dependent nature of molecular binding. We introduce CrossLLM‑Mamba, a novel framework that reformulates interaction prediction as a state‑space alignment problem. By leveraging bidirectional Mamba encoders, our approach enables deep ``crosstalk'' between modality‑specific embeddings through hidden state propagation, modeling interactions as dynamic sequence transitions rather than static feature overlaps. The framework maintains linear computational complexity, making it scalable to high‑dimensional BioLLM embeddings. We further incorporate Gaussian noise injection and Focal Loss to enhance robustness against hard‑negative samples. Comprehensive experiments across three interaction categories, RNA‑protein, RNA‑small molecule, and RNA‑RNA demonstrate that CrossLLM‑Mamba achieves state‑of‑the‑art performance. On the RPI1460 benchmark, our model attains an MCC of 0.892, surpassing the previous best by 5.2%. For binding affinity prediction, we achieve Pearson correlations exceeding 0.95 on riboswitch and repeat RNA subtypes. These results establish state‑space modeling as a powerful paradigm for multi‑modal biological interaction prediction.
Authors: Elio Moreau, Florentin Coeurdoux, Grégoire Ferre, Eric Vanden-Eijnden
Abstract: Understanding the geometry of learned distributions is fundamental to improving and interpreting diffusion models, yet systematic tools for exploring their landscape remain limited. Standard latent‑space interpolations fail to respect the structure of the learned distribution, often traversing low‑density regions. We introduce a framework based on the string method that computes continuous paths between samples by evolving curves under the learned score function. Operating on pretrained models without retraining, our approach interpolates between three regimes: pure generative transport, which yields continuous sample paths; gradient‑dominated dynamics, which recover minimum energy paths (MEPs); and finite‑temperature string dynamics, which compute principal curves ‑‑ self‑consistent paths that balance energy and entropy. We demonstrate that the choice of regime matters in practice. For image diffusion models, MEPs contain high‑likelihood but unrealistic ''cartoon'' images, confirming prior observations that likelihood maxima appear unrealistic; principal curves instead yield realistic morphing sequences despite lower likelihood. For protein structure prediction, our method computes transition pathways between metastable conformers directly from models trained on static structures, yielding paths with physically plausible intermediates. Together, these results establish the string method as a principled tool for probing the modal structure of diffusion models ‑‑ identifying modes, characterizing barriers, and mapping connectivity in complex learned distributions.
Authors: Jonathan Krook, Axel Janson, Joakim Andén, Melanie Weber, Ozan Öktem
Abstract: We present a geometry‑aware method for heterogeneous single‑particle cryogenic electron microscopy (cryo‑EM) reconstruction that predicts atomic backbone conformations. To incorporate protein‑structure priors, we represent the backbone as a graph and use a graph neural network (GNN) autodecoder that maps per‑image latent variables to 3D displacements of a template conformation. The objective combines a data‑discrepancy term based on a differentiable cryo‑EM forward model with geometric regularization, and it supports unknown orientations via ellipsoidal support lifting (ESL) pose estimation. On synthetic datasets derived from molecular dynamics trajectories, the proposed GNN achieves higher accuracy compared to a multilayer perceptron (MLP) of comparable size, highlighting the benefits of a geometry‑informed inductive bias.
Authors: Yiquan Wang
Abstract: The boundaries of cooperative helix‑‑coil transitions directly affect protein allostery and conformational dynamics, yet the physical origin of the persistent one‑to‑two‑residue assignment ambiguity at these structural interfaces remains unresolved. We apply the discrete Hasimoto map to translate three‑dimensional protein backbone geometry into a one‑dimensional discrete nonlinear Schrödinger effective potential and analyze its spatial‑frequency fluctuations. Helical segments display near‑integrable, low‑entropy soliton‑like states, while coil regions exhibit broadband conformational noise. Statistical analysis of over 19,000 boundaries across 1,986 proteins reveals a median geometric transition width of only 0.145 residues, providing an independent kinematic counterpart to the high thermodynamic cooperativity of the Zimm‑‑Bragg model. This sub‑residue spatial narrowness indicates an intrinsic observational constraint governed by the Gabor uncertainty principle, whereby any macroscopic spectral probe tends to blur the microscopic phase boundary, suggesting that the boundary ambiguity in structural biology is not merely algorithmic but reflects a physical resolution limit inherent to the biopolymer lattice.
Authors: Aleena Siji, Amir Mohammad Karimi Mamaghan, Ferdinand Kapl, Tobias Höppe, Emmanouil Angelis, Andrea Dittadi, Maurice Brenner, Michael Heinzinger, Karl Henrik Johansson, Kaitlin Maile, Johannes von Oswald, Stefan Bauer
Abstract: Protein language models (PLMs) have become widely adopted as general‑purpose models, demonstrating strong performance in protein engineering and de novo design. Like large language models (LLMs), they are typically trained as deep transformers with next‑token or masked‑token prediction objectives on massive sequence corpora and are scaled by increasing model depth. Recent work on autoregressive LLMs has identified the Curse of Depth: many later layers contribute little to the final output predictions. These findings naturally raise the question of whether a similar depth inefficiency also appears in PLMs, where many widely used models are not autoregressive, and some are multimodal, accepting both protein sequence and structure as input. In this work, we present a depth analysis of seven popular PLM families across model scales, spanning autoregressive, masked, and diffusion objectives, and quantify how layer contributions evolve with depth using a unified set of probing‑, perturbation‑, and downstream‑evaluation measurements. Across models, we observe consistent depth‑dependent patterns that extend prior findings on LLMs: a large fraction of task‑relevant computation is concentrated in a subset of layers, while the remaining layers mainly provide incremental refinement of the final prediction. These trends persist beyond sequence‑only settings and also appear in multimodal PLMs. Taken together, our results suggest that depth inefficiency is a common feature of modern PLMs, motivating future work on more depth‑efficient architectures and training methods.
Authors: Di Zhang, Zhangpeng Gong, Xiaobo Pang, Jiashuai Liu, Junbo Lu, Hao Cui, Jiusong Ge, Zhi Zeng, Kai Yi, Yinghua Li, Si Liu, Tingsong Yu, Haoran Wang, Mireia Crispin-Ortuzar, Weimiao Yu, Chen Li, Zeyu Gao
Abstract: Foundation models have recently achieved impressive success in computational pathology, demonstrating strong generalization across diverse histopathology tasks. However, existing models overlook the heterogeneous and non‑uniform organization of pathological regions of interest (ROIs) because they rely on natural image backbones not tailored for tissue morphology. Consequently, they often fail to capture the coherent tissue architecture beyond isolated patches, limiting interpretability and clinical relevance. To address these challenges, we present Cross‑modal Adaptive Region Encoder (CARE), a foundation model for pathology that automatically partitions WSIs into several morphologically relevant regions. Specifically, CARE employs a two‑stage pretraining strategy: (1) a self‑supervised unimodal pretraining stage that learns morphological representations from 34,277 whole‑slide images (WSIs) without segmentation annotations, and (2) a cross‑modal alignment stage that leverages RNA and protein profiles to refine the construction and representation of adaptive regions. This molecular guidance enables CARE to identify biologically relevant patterns and generate irregular yet coherent tissue regions, selecting the most representative area as ROI. CARE supports a broad range of pathology‑related tasks, using either the ROI feature or the slide‑level feature obtained by aggregating adaptive regions. Based on only one‑tenth of the pretraining data typically used by mainstream foundation models, CARE achieves superior average performance across 33 downstream benchmarks, including morphological classification, molecular prediction, and survival analysis, and outperforms other foundation model baselines overall.
Authors: Anna Hart, Chi Han, Jeonghwan Kim, Huimin Zhao, Heng Ji
Abstract: Modern Protein Language Models (PLMs) apply transformer‑based model architectures from natural language processing to biological sequences, predicting a variety of protein functions and properties. However, protein language has key differences from natural language, such as a rich functional space despite a vocabulary of only 20 amino acids. These differences motivate research into how transformer‑based architectures operate differently in the protein domain and how we can better leverage PLMs to solve protein‑related tasks. In this work, we begin by directly comparing how the distribution of information stored across layers of attention heads differs between the protein and natural language domain. Furthermore, we adapt a simple early‑exit technique‑originally used in the natural language domain to improve efficiency at the cost of performance‑to achieve both increased accuracy and substantial efficiency gains in protein non‑structural property prediction by allowing the model to automatically select protein representations from the intermediate layers of the PLMs for the specific task and protein at hand. We achieve performance gains ranging from 0.4 to 7.01 percentage points while simultaneously improving efficiency by over 10 percent across models and non‑structural prediction tasks. Our work opens up an area of research directly comparing how language models change behavior when moved into the protein domain and advances language modeling in biological domains.
Authors: Tucker Allen, Barry Y. Li, Nadine C. Bradbury, Daniel Neuhauser
Abstract: Protein electrostatics tune excitation energies in the Photosystem II reaction center (PSII‑RC), yet a fully quantum‑mechanical many‑body description of how the surrounding protein environment renormalizes excitons has remained computationally inaccessible. The Bethe‑Salpeter equation (BSE) within many‑body perturbation theory accurately describes excitonic physics through an explicit electron‑hole interaction, but is prohibitively expensive for systems containing thousands of valence electrons. Here, we show that for sufficiently large systems the BSE becomes simpler to solve when treated with modern stochastic sampling techniques, as atomistic interactions self‑average. In this regime, the effective electron‑hole interaction mediated by the environment is governed by collective k‑dependent polarization. These insights enable an ab initio study of the PSII‑RC in which all six chlorins forming the hexameric dye core are treated explicitly together with a roughly seven Angstrom local protein environment. We directly compare the low‑lying optical excitations of the isolated chromophore hexamer (1276 valence electrons) and the protein‑dye cluster (3238 valence electrons). For Q_y excitations near 680 nm, inclusion of the protein environment induces polarization‑dependent energy shifts, redistributes spectral weight, and alters exciton delocalization and pigment character. Lateral and transverse asymmetries in the low‑lying excited states are captured at the BSE level of theory. These results establish that we now have the tools for many‑body calculations of biological nanostructures.
Authors: Shaorong Chen, Jingbo Zhou, Jun Xia
Abstract: The discovery of novel proteins relies on sensitive protein identification, for which de novo peptide sequencing (DNPS) from mass spectra is a crucial approach. While deep learning has advanced DNPS, existing models inadequately enforce the fundamental mass consistency constraint, that a predicted peptide's mass must match the experimental measured precursor mass. Previous DNPS methods often treat this critical information as a simple input feature or use it in post‑processing, leading to numerous implausible predictions that do not adhere to this fundamental physical property. To address this limitation, we introduce DiffuNovo, a novel regressor‑guided diffusion model for de novo peptide sequencing that provides explicit peptide‑level mass control. Our approach integrates the mass constraint at two critical stages: during training, a novel peptide‑level mass loss guides model optimization, while at inference, regressor‑based guidance from gradient‑based updates in the latent space steers the generation to compel the predicted peptide adheres to the mass constraint. Comprehensive evaluations on established benchmarks demonstrate that DiffuNovo surpasses state‑of‑the‑art methods in DNPS accuracy. Additionally, as the first DNPS model to employ a diffusion model as its core backbone, DiffuNovo leverages the powerful controllability of diffusion architecture and achieves a significant reduction in mass error, thereby producing much more physically plausible peptides. These innovations represent a substantial advancement toward robust and broadly applicable DNPS. The source code is available in the supplementary material.
Authors: Ziyi Yang, Zitong Tian, Yinjun Jia, Tianyi Zhang, Jiqing Zheng, Hao Wang, Yubu Su, Juncai He, Lei Liu, Yanyan Lan
Abstract: D‑peptide binders targeting L‑proteins have promising therapeutic potential. Despite rapid advances in machine learning‑based target‑conditioned peptide design, generating D‑peptide binders remains largely unexplored. In this work, we show that by injecting axial features to E(3)‑equivariant (polar) vector features,it is feasible to achieve cross‑chirality generalization from homo‑chiral (L‑‑L) training data to hetero‑chiral (D‑‑L) design tasks. By implementing this method within a latent diffusion model, we achieved D‑peptide binder design that not only outperforms existing tools in in silico benchmarks, but also demonstrates efficacy in wet‑lab validation. To our knowledge, our approach represents the first wet‑lab validated generative AI for the de novo design of D‑peptide binders, offering new perspectives on handling chirality in protein design.
Authors: Anna Benediktová, Lucie Nedvědová, Michal Procházka, Zdeněk Jansa, Štěpánka Jansová, Christopher D. Woodgate, David Redka, Julie B. Staunton, Ján Minár
Abstract: Improving the performance of metallic implants increasingly relies on the development of multifunctional surface modifications that combine structural stability, bioactivity, and prevention of bacterial colonization. Medium‑entropy alloys (MEAs) represent a promising approach for such coatings, as their chemical complexity allows the formation of structurally stable matrices with tunable properties. In this study, Ti‑Nb‑Zr and Ti‑Nb‑Zr‑Ag thin films were deposited by magnetron sputtering and subjected to annealing at temperatures of up to 1100 ^\circC to evaluate the influence of Ag, added for its antibacterial potential, on structural evolution. The as‑deposited Ag‑free film was fully amorphous, whereas the Ag‑containing film exhibited a predominantly amorphous matrix with finely dispersed crystalline nanoparticles, indicating that Ag promoted early‑stage crystallization. Both films displayed a fine columnar morphology (column diameter ~15 nm) with dome‑like protrusions, a hierarchical surface structure favorable for protein adhesion. Upon annealing, the Ag‑free film recrystallized into a granular, loosely packed morphology, while the Ag‑containing film retained a compact structure, demonstrating the stabilizing role of Ag. These findings underscore the potential of Ag‑containing amorphous MEAs for forming multifunctional coatings with enhanced thermal stability, antibacterial functionality, and biointerface‑relevant surface features for advanced biomedical applications.
Authors: Alexander Zhilkin, Muralika Medaparambath, Dan Mendels
Abstract: While recent advances in AI have transformed protein structure prediction, protein function is also strongly influenced by the thermodynamic and kinetic features encoded in its underlying free‑energy surface. Here, we propose a data‑efficient framework for engineering protein conformational kinetics by rationally reshaping free‑energy landscapes to control transition rates. Built on the Collective Variables for Free Energy Surface Tailoring (CV‑FEST) framework, the approach is validated on point mutations of the miniprotein Chignolin. The framework relies on Harmonic Linear Discriminant Analysis (HLDA)‑based collective variables (CVs) constructed from short molecular dynamics trajectories confined to metastable folded and unfolded basins, requiring only limited local sampling rather than exhaustive rare‑event simulations. Notably, the HLDA CV derived solely from the wild‑type system provides residue‑level scores that predict whether mutations at specific positions are likely to accelerate or slow unfolding transitions. Furthermore, the leading HLDA eigenvalue associated with the derived CV, a quantitative measure of the one‑dimensional statistical separation between folded and unfolded ensembles, is significantly correlated with transition rates across mutations. Together, these results suggest that mutation‑dependent kinetic effects can be inferred from minimal in‑basin sampling, providing a practical route for guiding peptide and protein engineering through collective‑variable design, free‑energy surface engineering, and data‑efficient molecular simulation.
Authors: Qianfeng Yu, Ningkang Peng, Yanhui Gu
Abstract: Understanding the conformational evolution of β‑amyloid (Aβ), particularly the Aβ_42 isoform, is fundamental to elucidating the pathogenic mechanisms underlying Alzheimer's disease. However, existing end‑to‑end deep learning models often struggle to capture subtle state transitions in protein trajectories due to a lack of explicit physical constraints. In this work, we introduce PIS, a Physics‑Informed System designed for robust metastable state partitioning. By integrating pre‑computed physical priors, such as the radius of gyration and solvent‑accessible surface area, into the extraction of topological features, our model achieves superior performance on the Aβ_42 dataset. Furthermore, PIS provides an interactive platform that features dynamic monitoring of physical characteristics and multi‑dimensional result validation. This system offers biological researchers a powerful set of analytical tools with physically grounded interpretability. A demonstration video of PIS is available on https://youtu.be/AJHGzUtRCg0.
Authors: Ilyes Batatia, William J. Baldwin, Domantas Kuryla, Joseph Hart, Elliott Kasoar, Alin M. Elena, Harry Moore, Mikołaj J. Gawkowski, Benjamin X. Shi, Venkat Kapil, Panagiotis Kourtis, Ioan-Bogdan Magdău, Gábor Csányi
Abstract: Accurate modelling of electrostatic interactions and charge transfer is fundamental to computational chemistry, yet most machine learning interatomic potentials (MLIPs) rely on local atomic descriptors that cannot capture long‑range electrostatic effects. We present a new electrostatic foundation model for molecular chemistry that extends the MACE architecture with explicit treatment of long‑range interactions and electrostatic induction. Our approach combines local many‑body geometric features with a non‑self‑consistent field formalism that updates learnable charge and spin densities through polarisable iterations to model induction, followed by global charge equilibration via learnable Fukui functions to control total charge and total spin. This design enables an accurate and physical description of systems with varying charge and spin states while maintaining computational efficiency. Trained on the OMol25 dataset of 100 million hybrid DFT calculations, our models achieve chemical accuracy across diverse benchmarks, with accuracy competitive with hybrid DFT on thermochemistry, reaction barriers, conformational energies, and transition metal complexes. Notably, we demonstrate that the inclusion of long‑range electrostatics leads to a large improvement in the description of non‑covalent interactions and supramolecular complexes over non‑electrostatic models, including sub‑kcal/mol prediction of molecular crystal formation energy in the X23‑DMC dataset and a fourfold improvement over short‑ranged models on protein‑ligand interactions. The model's ability to handle variable charge and spin states, respond to external fields, provide interpretable spin‑resolved charge densities, and maintain accuracy from small molecules to protein‑ligand complexes positions it as a versatile tool for computational molecular chemistry and drug discovery.
Authors: Bhaskar DasGupta, Katie Kruzan
Abstract: In recent years extensions of manifold Ricci curvature to discrete combinatorial objects such as graphs and hypergraphs (popularly called as "network shapes"), have found a plethora of applications in a wide spectrum of research areas ranging over metabolic systems, transcriptional regulatory networks, protein‑protein‑interaction networks, social networks and brain networks to deep learning models but, in contrast, they have been looked at by relatively fewer researchers in the algorithms and computational complexity community. As an attempt to bring these network Ricci‑curvature related problems under the lens of computational complexity and foster further inter‑disciplinary interactions, we provide a formal framework for studying algorithmic and computational complexity issues for detecting critical edges in an undirected graph using Ollivier‑Ricci curvatures and provide several algorithmic and inapproximability results for problems in this framework. Our results show some interesting connections between our problems, the exact perfect matching and perfect matching blocker problems for bipartite graphs and two well‑known combinatorial packing/covering problems.
Authors: Qi Wen, Xiang Lian, Nan Zhang, Yutong Ye, Mingsong Chen
Abstract: Subgraph similarity search over large‑scale graphs is a fundamental task that retrieves subgraphs similar to a given query graph from a data graph, and it plays a crucial role in real applications such as protein discovery, social network analysis, and recommendation systems. While prior works on subgraph similarity search studied various graph similarity metrics, in this paper, we propose a novel graph similarity semantics, generalized neighbor difference (GND), that accounts for both the keyword‑set relationships between vertices and edge‑weight differences. We formulate the problem of subgraph similarity search under the generalized neighbor difference semantics (S^3GND), which retrieves those subgraphs similar to a query graph q under GND semantics. To efficiently tackle the S^3GND problem, we propose an effective learning‑based approach, which constructs a keyword hypergraph from the data graph, and trains a hypergraph neural network (HGNN) model to obtain high‑quality keyword embedding representations. We design effective pruning strategies, keyword embedding MBR, vertex‑Level ND lower bound, and graph‑level GND lower bound pruning, to rule out false alarms of candidate vertices/subgraphs, and devise a tree‑based indexing mechanism to facilitate efficient S^3GND query answering. We develop an efficient S^3GND query‑processing algorithm that traverses the index, applies pruning strategies, and returns actual S^3GND answers. Finally, we conduct extensive experiments to verify the effectiveness and efficiency of our proposed S^3GND approach over both real and synthetic graphs.
Authors: Mohammadreza Ghaffarzadeh-Esfahani, Yousof Gheisari
Abstract: Adeno‑associated viruses (AAVs) are promising vectors for gene therapy, but their native serotypes face limitations in tissue tropism, immune evasion, and production efficiency. Engineering capsids to overcome these hurdles is challenging due to the vast sequence space and the difficulty of simultaneously optimizing multiple functional properties. The complexity also adds when it comes to the kidney, which presents unique anatomical barriers and cellular targets that require precise and efficient vector engineering. Here, we present AAVGen, a generative artificial intelligence framework for de novo design of AAV capsids with enhanced multi‑trait profiles. AAVGen integrates a protein language model (PLM) with supervised fine‑tuning (SFT) and a reinforcement learning technique termed Group Sequence Policy Optimization (GSPO). The model is guided by a composite reward signal derived from three ESM‑2‑based regression predictors, each trained to predict a key property: production fitness, kidney tropism, and thermostability. Our results demonstrate that AAVGen produces a diverse library of novel VP1 protein sequences. In silico validations revealed that the majority of the generated variants have superior performance across all three employed indices, indicating successful multi‑objective optimization. Furthermore, structural analysis via AlphaFold3 confirms that the generated sequences preserve the canonical capsid folding despite sequence diversification. AAVGen establishes a foundation for data‑driven viral vector engineering, accelerating the development of next‑generation AAV vectors with tailored functional characteristics.
Authors: Victor Bezchastnov, Tatiana Domratcheva
Abstract: Sensing of the geomagnetic field direction by many living organisms is commonly thought to involve radical pairs, such as those formed photochemically between the flavin and tryptophan radicals in the cryptochrome proteins. Previous theoretical studies have shown that strongly axial hyperfine couplings in the cryptochrome radicals greatly enhance the formation of a signaling state of the protein when the magnetic field is directed perpendicular to the hyperfine axis of either of the radicals. However, further analysis led to the conclusion that sharpness of detecting those magnetic directions is strongly suppressed by the inter‑radical electron spin coupling. Here, we perform theoretical simulations of the compass function for a set of arrangements of the intra‑ and inter‑radical spin couplings in the idealized cryptochrome radical pair, and find certain arrangements that preserve the sharpness in detecting the direction of the geomagnetic field. One particular arrangement, with the hyperfine axes of the radicals orthogonal to the symmetry axis of inter‑radical coupling, provides even sharper field‑direction sensitivity than that contributed solely by the anisotropy of the hyperfine coupling.
Authors: Zhangfan Yang, Baoyun Chen, Dong Xu, Jia Wang, Ruibin Bai, Junkai Ji, Zexuan Zhu
Abstract: Protein‑ligand scoring is a central component of structure‑based drug design, underpinning molecular docking, virtual screening, and pose optimization. Conventional physics‑based energy functions are often computationally expensive, limiting their utility in large‑scale screening. In contrast, deep learning‑based scoring models offer improved computational efficiency but frequently suffer from limited cross‑target generalization and poor interpretability, which restrict their practical applicability. Here we present BioLM‑Score, a simple yet generalizable protein‑ligand scoring model that couples geometric modeling with representation learning. Specifically, it employs modality‑specific and structure‑aware encoders for proteins and ligands, each augmented with biomolecular language models to enrich structural and chemical representations. Subsequently, these representations are integrated through a mixture density network to predict multimodal interatomic distance distributions, from which statistically grounded likelihood‑based scores are derived. Evaluations on the CASF‑2016 benchmark demonstrate that BioLM‑Score achieves significant improvements across docking, scoring, ranking, and screening tasks. Moreover, the proposed scoring function serves as an effective optimization objective for guiding docking protocols and conformational search. In summary, BioLM‑Score provides a principled and practical alternative to existing scoring functions, combining efficiency, generalization, and interpretability for structure‑based drug discovery.
Authors: Abhishek Tiwari
Abstract: Hydrogen bonds and other non‑covalent interactions play a crucial role in maintaining the structural integrity and functionality of biological macromolecules such as proteins and nucleic acids. Accurate identification and analysis of these interactions are essential for understanding molecular recognition, protein folding, and drug design. HBAT (Hydrogen Bond Analysis Tool) is software for analysing hydrogen bonds and other weak interactions in macromolecular structures. This paper presents HBAT 2, an updated Python reimplementation of the original HBAT tool published in 2007. HBAT 2 is a Python package for automated analysis of hydrogen bonds and other non‑covalent interactions in macromolecular structures available in Protein Data Bank (PDB) file format. The software identifies and analyses traditional hydrogen bonds, weak hydrogen bonds, halogen bonds, X‑H\cdotsπ, π‑π stacking, and n\rightarrowπ interactions using geometric criteria. It also detects cooperativity and anticooperativity chains and renders them as 2D visualisations. The latest version offers improved cross‑platform tkinter‑based graphical user interface (GUI), a web‑based interface, a simple command‑line interface (CLI), and a developer‑friendly API, making it accessible to users with different computational backgrounds.
Authors: Lin Huang, Arthur Jiang, XiaoLi Liu, Zion Wang, Jason Zhao, Chu Wang, HaoCheng Lu, ChengXiang Huang, JiaJun Cheng, YiYue Du, Jia Zhang
Abstract: All‑atom molecular simulation serves as a quintessential ``computational microscope'' for understanding the machinery of life, yet it remains fundamentally limited by the trade‑off between quantum‑mechanical (QM) accuracy and biological scale. We present UBio‑MolFM, a universal foundation model framework specifically engineered to bridge this gap. UBio‑MolFM introduces three synergistic innovations: (1) UBio‑Mol26, a large bio‑specific dataset constructed via a multi‑fidelity ``Two‑Pronged Strategy'' that combines systematic bottom‑up enumeration with top‑down sampling of native protein environments (up to 1,200 atoms); (2) E2Former‑V2, a linear‑scaling equivariant transformer that integrates Equivariant Axis‑Aligned Sparsification (EAAS) and Long‑Short Range (LSR) modeling to capture non‑local physics with up to ~4x higher inference throughput in our large‑system benchmarks; and (3) a Three‑Stage Curriculum Learning protocol that transitions from energy initialization to energy‑force consistency, with force‑focused supervision to mitigate energy offsets. Rigorous benchmarking across microscopic forces and macroscopic observables ‑‑ including liquid water structure, ionic solvation, and peptide folding ‑‑ demonstrates that UBio‑MolFM achieves ab initio‑level fidelity on large, out‑of‑distribution biomolecular systems (up to ~1,500 atoms) and realistic MD observables. By reconciling scalability with quantum precision, UBio‑MolFM provides a robust, ready‑to‑use tool for the next generation of computational biology.
Authors: Yujia Wang, Jihong Guan, Wengen Li, Shuigeng Zhou, Xuhong Wang
Abstract: Existing Protein Language Models (PLMs) often suffer from limited adaptability to multiple tasks and exhibit poor generalization across diverse biological contexts. In contrast, general‑purpose Large Language Models (LLMs) lack the capability to interpret protein sequences and fall short in domain‑specific knowledge, limiting their capacity for effective biosemantic reasoning. To combine the advantages of both, we propose BioBridge, a domain‑adaptive continual pretraining framework for protein understanding. This framework employs Domain‑Incremental Continual Pre‑training (DICP) to infuse protein domain knowledge and general reasoning corpus into a LLM simultaneously, effectively mitigating catastrophic forgetting. Cross‑modal alignment is achieved via a PLM‑Projector‑LLM pipeline, which maps protein sequence embeddings into the semantic space of the language model. Ultimately, an end‑to‑end optimization is adopted to uniformly support various tasks, including protein property prediction and knowledge question‑answering. Our proposed BioBridge demonstrates performance comparable to that of mainstream PLMs on multiple protein benchmarks, such as EC and BindingDB. It also achieves results on par with LLMs on general understanding tasks like MMLU and RACE. This showcases its innovative advantage of combining domain‑specific adaptability with general‑purpose language competency.
Authors: Ihor Kendiukhov
Abstract: We present a systematic evaluation framework ‑ thirty‑seven analyses, 153 statistical tests, four cell types, two perturbation modalities ‑ for assessing mechanistic interpretability in single‑cell foundation models. Applying this framework to scGPT and Geneformer, we find that attention patterns encode structured biological information with layer‑specific organisation ‑ protein‑protein interactions in early layers, transcriptional regulation in late layers ‑ but this structure provides no incremental value for perturbation prediction: trivial gene‑level baselines outperform both attention and correlation edges (AUROC 0.81‑0.88 versus 0.70), pairwise edge scores add zero predictive contribution, and causal ablation of regulatory heads produces no degradation. These findings generalise from K562 to RPE1 cells; the attention‑correlation relationship is context‑dependent, but gene‑level dominance is universal. Cell‑State Stratified Interpretability (CSSI) addresses an attention‑specific scaling failure, improving GRN recovery up to 1.85x. The framework establishes reusable quality‑control standards for the field.
Authors: Djordje Mihajlovic, Davide Michieletto
Abstract: Classifying the topology of closed curves is a central problem in low dimensional topology with applications beyond mathematics spanning protein folding, polymer physics and even magnetohydrodynamics. The central problem is how to determine whether two embeddings of a closed arc are equivalent under ambient isotopy. Given the striking ability of neural networks to solve complex classification tasks, it is therefore natural to ask if the knot classification problem can be tackled using Machine Learning (ML). In this paper, we investigate generic shortcut methods employed by ML to solve the knot classification challenge and specifically discover hidden non‑topological features in training data generated through Molecular Dynamics simulations of polygonal knots that are used by ML to arrive to positive classifications results. We then provide a rigorous foundation for future attempts to tackle the knot classification challenge using ML by developing a publicly‑available (i) dataset, that aims to remove the potential of non‑topological feature classification and (ii) code, that can generate knot embeddings that faithfully explore chosen geometric state space with fixed knot topology. We expect that our work will accelerate the development of ML models that can solve complex geometric knot classification challenges.
Authors: Yu Xie, Ludwig Winkler, Lixin Sun, Sarah Lewis, Adam E. Foster, José Jiménez Luna, Tim Hempel, Michael Gastegger, Yaoyi Chen, Iryna Zaporozhets, Cecilia Clementi, Christopher M. Bishop, Frank Noé
Abstract: The rare‑event sampling problem has long been the central limiting factor in molecular dynamics (MD), especially in biomolecular simulation. Recently, diffusion models such as BioEmu have emerged as powerful equilibrium samplers that generate independent samples from complex molecular distributions, eliminating the cost of sampling rare transition events. However, a sampling problem remains when computing observables that rely on states which are rare in equilibrium, for example folding free energies. Here, we introduce enhanced diffusion sampling, enabling efficient exploration of rare‑event regions while preserving unbiased thermodynamic estimators. The key idea is to perform quantitatively accurate steering protocols to generate biased ensembles and subsequently recover equilibrium statistics via exact reweighting. We instantiate our framework in three algorithms: UmbrellaDiff (umbrella sampling with diffusion models), ΔG‑Diff (free‑energy differences via tilted ensembles), and MetaDiff (a batchwise analogue for metadynamics). Across toy systems, protein folding landscapes and folding free energies, our methods achieve fast, accurate, and scalable estimation of equilibrium properties within GPU‑minutes to hours per system ‑‑ closing the rare‑event sampling gap that remained after the advent of diffusion‑model equilibrium samplers.
Authors: Yiquan Wang
Abstract: The representation of protein backbone geometry through the discrete nonlinear Schrödinger equation provides a theoretical connection between biological structure and integrable systems. Although the global application of this framework is constrained by chiral degeneracies and non‑local interactions, helical peptides can be modeled as piecewise integrable systems where the discrete Hasimoto map remains applicable within specific geometric boundaries. We delineate these boundaries through an analytic mapping (ϕ,ψ) \rightarrow (κ,τ) between biochemical dihedral angles and Frenet frame parameters for 50 helical peptide chains. This transformation is globally information‑preserving but ill‑conditioned within the helical basin (median Jacobian condition number 31), suggesting chiral information loss arises primarily from local coordinate compression rather than topological singularities. Using a local integrability error E[n] derived from the discrete dispersion relation, we show deviations from integrability are driven predominantly by torsion non‑uniformity, while curvature remains rigid. This metric identifies integrable islands where the analytic dispersion relation predicts backbone coordinates with sub‑angstrom accuracy (median RMSD 0.77\,Å), enabling a segmentation strategy that isolates structural defects and trims non‑integrable terminal fraying. Evaluating only these integrable islands, the dispersion relation extracts high‑accuracy structural cores for 88% of the dataset. Inverse backbone design is feasible within a defined integrability zone where the design constraint reduces essentially to controlling torsion uniformity. These findings advance the Hasimoto formalism from a qualitative descriptor toward a precise quantitative framework for analyzing and designing local protein geometry within the limits of piecewise integrability.
Authors: Zakaria Shams Siam, Xuefeng Liu, Chong Liu
Abstract: In this paper, we formulate the new multi‑objective coverage (MOC) problem where our goal is to identify a small set of representative samples whose predicted outcomes broadly cover the feasible multi‑objective space. This problem is of great importance in many critical real‑world applications, e.g., drug discovery and materials design, as this representative set can be evaluated much faster than the whole feasible set, thus significantly accelerating the scientific discovery process. Existing works cannot be directly applied as they either focus on sample space coverage or multi‑objective optimization that targets the Pareto front. However, chemically diverse samples often yield identical objective profiles, and safety constraints are usually defined on the objectives. To solve this MOC problem, we propose a novel search algorithm, MOC‑CAS, which employs an upper confidence bound‑based acquisition function to select optimistic samples guided by Gaussian process posterior predictions. For enabling efficient optimization, we develop a smoothed relaxation of the hard feasibility test and derive an approximate optimizer. Compared to the competitive baselines, we show that our MOC‑CAS empirically achieves superior performances across large‑scale protein‑target datasets for SARS‑CoV‑2 and cancer, each assessed on five objectives derived from SMILES‑based features.
Authors: Alicja Maksymiuk, Alexandre Duplessis, Michael Bronstein, Alexander Tong, Fernanda Duarte, İsmail İlkan Ceylan
Abstract: Macrocycles are ring‑shaped molecules that offer a promising alternative to small‑molecule drugs due to their enhanced selectivity and binding affinity against difficult targets. Despite their chemical value, they remain underexplored in generative modeling, likely owing to their scarcity in public datasets and the challenges of enforcing topological constraints in standard deep generative models. We introduce MacroGuide: Topological Guidance for Macrocycle Generation, a diffusion guidance mechanism that uses Persistent Homology to steer the sampling of pretrained molecular generative models toward the generation of macrocycles, in both unconditional and conditional (protein pocket) settings. At each denoising step, MacroGuide constructs a Vietoris‑Rips complex from atomic positions and promotes ring formation by optimizing persistent homology features. Empirically, applying MacroGuide to pretrained diffusion models increases macrocycle generation rates from 1% to 99%, while matching or exceeding state‑of‑the‑art performance on key quality metrics such as chemical validity, diversity, and PoseBusters checks.
Authors: Ana F. Rodrigues, Lucas Ferraz, Laura Balbi, Pedro Giesteira Cotovio, Catia Pesquita
Abstract: Effective representations of protein sequences are widely recognized as a cornerstone of machine learning‑based protein design. Yet, protein bioengineering poses unique challenges for sequence representation, as experimental datasets typically feature few mutations, which are either sparsely distributed across the entire sequence or densely concentrated within localized regions. This limits the ability of sequence‑level representations to extract functionally meaningful signals. In addition, comprehensive comparative studies remain scarce, despite their crucial role in clarifying which representations best encode relevant information and ultimately support superior predictive performance. In this study, we systematically evaluate multiple ProtBERT and ESM2 embedding variants as sequence representations, using the adeno‑associated virus capsid as a case study and prototypical example of bioengineering, where functional optimization is targeted through highly localized sequence variation within an otherwise large protein. Our results reveal that, prior to fine‑tuning, amino acid‑level embeddings outperform sequence‑level representations in supervised predictive tasks, whereas the latter tend to be more effective in unsupervised settings. However, optimal performance is only achieved when embeddings are fine‑tuned with task‑specific labels, with sequence‑level representations providing the best performance. Moreover, our findings indicate that the extent of sequence variation required to produce notable shifts in sequence representations exceeds what is typically explored in bioengineering studies, showing the need for fine‑tuning in datasets characterized by sparse or highly localized mutations.
Authors: Chufeng Li, Margarita Zakharova, Mauro Prasciolu, Jia Chyi Wong, Holger Fleckenstein, Nikolay Ivanov, Wenhui Zhang, Mansi Butola, J. Lukas Dresselhaus, Ivan De Gennaro Aquino, Dmitry Egorov, Philipp Middendorf, Alessa Henkel, Bjarne Klopprogge, Lars Klemeyer, Tobias Beck, Oleksandr Yefanov, Miriam Barthelmess, Janina Sprenger, Dominik Oberthuer, Saša Bajt, Henry N. Chapman
Abstract: Molecular and polymeric crystals show a wide range of functional properties that arise from the interplay between the atomic‑scale structure of their constituent molecules and the organization of these molecules within the crystal lattice at macroscopic length scales. X‑ray diffraction can provide structural information at these disparate length scales, but usually only through experiments that address one or the other of molecular (or unit‑cell) structure versus crystal structure. Consequently, the accuracy of determined molecular or polymer structures may be limited by unaccounted crystal inhomogeneities of the crystal lattice and the characterization of crystalline materials might not reveal the underlying causes of crystal morphology. Here we introduce X‑ray convergent‑beam diffraction to obtain spatially‑resolved structural information from crystals by projection topographic imaging. Using highly focusing X‑ray multilayer Laue lenses, we show that Bragg reflections can be mapped into tomographic images of the crystal, for the characterization of strain and defects at high resolution. We demonstrate how the crystal morphology obtained this way can be accounted for when determining structure factors as a function of position in the crystal. The approach may assist in studies such as diffusion and binding in MOFS, protein‑drug binding, crystal growth, and the mechanical responses of photo‑reactive or thermally driven dynamic crystals.
Authors: Hyosoon Jang, Hyunjin Seo, Honghui Kim, Seonghyun Park, Taewon Kim, Yunhui Jang, Sungsoo Ahn
Abstract: Small‑molecule foundation models are typically pretrained on standalone molecular data, unlike vision and language models that often benefit from cross‑modal or relational supervision. Protein‑ligand co‑folding provides a molecular analogue of such supervision by exposing models to atom‑level ligand‑protein interactions, raising the question of whether co‑folding models can yield strong small‑molecule representations. We study this question using Boltz2, a modern co‑folding model, by transferring its atom‑level ligand representations to standalone small‑molecule tasks. Through systematic probing and distillation, we show that Boltz2 representations match or outperform existing models on the ADMET benchmark, accelerate molecular generative modeling, and improve sample efficiency in structure‑guided ligand optimization. We further find that Boltz2 representations are complementary to those learned from conventional standalone molecular supervision, including 3D conformers, bioassay labels, and quantum‑chemical properties. Finally, we extend representation alignment to reinforcement learning, showing that dense representation‑level supervision can complement scalar rewards in molecular discovery. These results identify protein‑ligand co‑folding as a promising pretraining paradigm for small‑molecule representation learning and position Boltz2 as a strong, off‑the‑shelf molecular foundation model.
Authors: Yiquan Wang
Abstract: Determining the three‑dimensional structure of a protein from its amino‑acid sequence remains a fundamental problem in biophysics. The discrete Frenet geometry of the C_α backbone can be mapped, via a Hasimoto‑type transform, onto a complex scalar field ψ=κ\,e^i\sumτ satisfying a discrete nonlinear Schrödinger equation (DNLS), whose soliton solutions reproduce observed secondary‑structure motifs. Whether this mapping, which provides an elegant geometric description of folded states, can be extended to a predictive framework for protein folding remains an open question. We derive an exact closed‑form decomposition of the DNLS effective potential V_\texteff=V_\textre+iV_\textim in terms of curvature ratios and torsion angles, validating the result to machine precision across 856 non‑redundant proteins. Our analysis identifies three structural barriers to forward prediction: (i)~V_\textim encodes chirality via the odd symmetry of \sinτ, accounting for ~31% of the total information and implying a 2^N degeneracy if neglected; (ii)~V_\textre is determined primarily (~95%) by local geometry, rendering it effectively sequence‑agnostic; and (iii)~self‑consistent field iterations fail to recover native structures (mean RMSD = 13.1\,Å) even with hydrogen‑bond terms, yielding torsion correlations indistinguishable from zero. Constructively, we demonstrate that the residual of the DNLS dispersion relation serves as a geometric order parameter for α‑helices (ROC AUC = 0.72), defining them as regions of maximal integrability. These findings establish that the Hasimoto map functions as a kinematic identity rather than a dynamical governing equation, presenting fundamental obstacles to its use as a predictive framework for protein folding.
Authors: Darin Tsui, Kunal Talreja, Daniel Saeedi, Amirali Aghazadeh
Abstract: Protein language models (pLMs) have emerged as powerful predictors of protein structure and function. However, the computational circuits underlying their predictions remain poorly understood. Recent mechanistic interpretability methods decompose pLM representations into interpretable features, but they treat each layer independently and thus fail to capture cross‑layer computation, limiting their ability to approximate the full model. We introduce ProtoMech, a framework for discovering computational circuits in pLMs using cross‑layer transcoders that learn sparse latent representations jointly across layers to capture the model's full computational circuitry. Applied to the pLM ESM2, ProtoMech recovers 82‑89% of the original performance on protein family classification and function prediction tasks. ProtoMech then identifies compressed circuits that use <1% of the latent space while retaining up to 79% of model accuracy, revealing correspondence with structural and functional motifs, including binding, signaling, and stability. Steering along these circuits enables high‑fitness protein design, surpassing baseline methods in more than 70% of cases. These results establish ProtoMech as a principled framework for protein circuit tracing.
Authors: Panagiotis Antoniadis, Beatrice Pavesi, Simon Olsson, Ole Winther
Abstract: Molecular dynamics (MD) is a central computational tool in physics, chemistry, and biology, enabling quantitative prediction of experimental observables as expectations over high‑dimensional molecular distributions such as Boltzmann distributions and transition densities. However, conventional MD is fundamentally limited by the high computational cost required to generate independent samples. Generative molecular dynamics (GenMD) has recently emerged as an alternative, learning surrogates of molecular distributions either from data or through interaction with energy models. While these methods enable efficient sampling, their transferability across molecular systems is often limited. In this work, we show that incorporating auxiliary sources of information can improve the data efficiency and generalization of transferable implicit transfer operators (TITO) for molecular dynamics. We find that coarse‑grained TITO models are substantially more data‑efficient than Boltzmann Emulators, and that incorporating protein language model (pLM) embeddings further improves out‑of‑distribution generalization. Our approach, PLaTITO, achieves state‑of‑the‑art performance on equilibrium sampling benchmarks for out‑of‑distribution protein systems, including fast‑folding proteins. We further study the impact of additional conditioning signals ‑‑ such as structural embeddings, temperature, and large‑language‑model‑derived embeddings ‑‑ on model performance.
Authors: Arnav Shah, Junzhe Li, Parsa Idehpour, Adibvafa Fallahpour, Brandon Wang, Sukjun Hwang, Bo Wang, Patrick D. Hsu, Hani Goodarzi, Albert Gu
Abstract: Genomic foundation models have the potential to decode DNA syntax, yet face a fundamental tradeoff in their input representation. Standard fixed‑vocabulary tokenizers fragment biologically meaningful motifs such as codons and regulatory elements, while nucleotide‑level models preserve biological coherence but incur prohibitive computational costs for long contexts. We introduce dnaHNet, a state‑of‑the‑art tokenizer‑free autoregressive model that segments and models genomic sequences end‑to‑end. Using a differentiable dynamic chunking mechanism, dnaHNet compresses raw nucleotides into latent tokens adaptively, balancing compression with predictive accuracy. Pretrained on prokaryotic genomes, dnaHNet outperforms leading architectures including StripedHyena2 in scaling and efficiency. This recursive chunking yields quadratic FLOP reductions, enabling >3 × inference speedup over Transformers. On zero‑shot tasks, dnaHNet achieves superior performance in predicting protein variant fitness and gene essentiality, while automatically discovering hierarchical biological structures without supervision. These results establish dnaHNet as a scalable, interpretable framework for next‑generation genomic modeling.
Authors: Srinivas Anumasa, Barath Chandran, Tingting Chen, Nuwaisir Mohammad Rahman, Yingtao Zhu, Rushi Shah, Hongyu He, Peisong Zhang, Yizhen Liao, Yiming Tang, Yong Shen, Tianfan Fu, Rui Qing, Xiao Li, Sebastian Maurer-Stroh, Xinyi Su, Zhizhuo Zhang, Dianbo Liu
Abstract: The evolutionary fitness landscape of biological molecules is extremely sparse and heterogeneous, with functional sequences forming isolated dense ``islands'' within a vast combinatorial space of largely non‑functional variants. Protein sequences, in particular, exemplify this structure, yet most generative artificial intelligence models implicitly assume a homogeneous data distribution. We show that this assumption fundamentally breaks down in heterogeneous biological sequence spaces: fixed global noise levels impose a destructive trade‑off, either oversmoothing dense functional clusters or fragmenting sparse regions and producing non‑functional hallucinations. To address this limitation, we introduce \emphDensity‑Dependent Smoothing (DDS), a geometry‑aware generative framework that adapts stochastic smoothing to the local density of the underlying sequence landscape. By inversely coupling diffusion noise to estimated sequence density, DDS enables gentle refinement in high‑density functional regions while promoting controlled exploration across sparse regions. Implemented as a plug‑in mechanism for discrete molecular sampling, DDS consistently outperforms state‑of‑the‑art diffusion and autoregressive models across antibody repertoires, therapeutic antibody design, antimicrobial peptide generation and coronavirus antibody design. Together, these results show that fixed global smoothing assumptions fundamentally limit generative modeling in sparse biological sequence spaces, and that geometry‑aware smoothing removes this constraint, enabling reliable exploration and design previously unattainable with fixed‑noise generative models.
Authors: Edward Wijaya
Abstract: Agentic systems for drug discovery have demonstrated autonomous synthesis planning, literature mining, and molecular design. We ask how well they generalize. Evaluating six frameworks against 15 task classes drawn from peptide therapeutics, in vivo pharmacology, and resource‑constrained settings, we find five capability gaps: no support for protein language models or peptide‑specific prediction, no bridges between in vivo and in silico data, reliance on LLM inference with no pathway to ML training or reinforcement learning, assumptions tied to large‑pharma resources, and single‑objective optimization that ignores safety‑efficacy‑stability trade‑offs. A paired knowledge‑probing experiment suggests the bottleneck is architectural rather than epistemic: four frontier LLMs reason about peptides at levels comparable to small molecules, yet no framework exposes this capability. We propose design requirements and a capability matrix for next‑generation frameworks that function as computational partners under realistic constraints.
Authors: Leonardo Di Bari, Thierry Mora, Andrea Pagnani, Aleksandra M. Walczak, Francesco Zamponi, Saverio Rossi
Abstract: Generative models derived from large protein sequence alignments define complex fitness landscapes, but their utility for accurately modeling non‑equilibrium evolutionary dynamics remains unclear. In this work, we perform a rigorous comparative analysis of three simulation schemes, designed to mimic evolution in silico by local sampling of the probability distribution defined by a generative model. We compare standard independent Markov Chain Monte Carlo, Monte Carlo on a phylogenetic tree, and a population genetics dynamics, benchmarking their outputs against deep sequencing data from four distinct in vitro evolution experiments. We find that standard Monte Carlo fails to reproduce the correct phylogenetic structure and generates unrealistic, gradual mutational sweeps. Performing Monte Carlo on a tree inferred from data improves phylogenetic fidelity and historical accuracy. The population genetics scheme successfully captures phylogenetic correlations, mutational abundances, and selective sweeps as emergent properties, without the need to infer additional information from data. However, the latter choice come at the price of not sampling the proper generative model distribution at long times. Our findings highlight the crucial role of phylogenetic correlations and finite‑population effects in shaping evolutionary trajectories on fitness landscapes. These models therefore provide powerful tools for predicting complex adaptive paths and for reliably extrapolating evolutionary dynamics beyond current experimental limitations.
Authors: TrungKhang Tran, TrungTin Nguyen, Md Abul Bashar, Nhat Ho, Richi Nayak, Christopher Drovandi
Abstract: Mixture‑of‑Experts (MoE) architectures combine specialized predictors through a learned gate and are effective across regression and classification, but for classification with softmax multinomial‑logistic gating, rigorous guarantees for stable maximum‑likelihood training and principled model selection remain limited. We address both issues in the full‑data (batch) regime. First, we derive a batch minorization‑maximization (MM) algorithm for softmax‑gated multinomial‑logistic MoE using an explicit quadratic minorizer, yielding coordinate‑wise closed‑form updates that guarantee monotone ascent of the objective and global convergence to a stationary point (in the standard MM sense), avoiding approximate M‑steps common in EM‑type implementations. Second, we prove finite‑sample rates for conditional density estimation and parameter recovery, and we adapt dendrograms of mixing measures to the classification setting to obtain a sweep‑free selector of the number of experts that achieves near‑parametric optimal rates after merging redundant fitted atoms. Experiments on biological protein‑‑protein interaction prediction validate the full pipeline, delivering improved accuracy and better‑calibrated probabilities than strong statistical and machine‑learning baselines.
Authors: Dongyeop Woo, Marta Skreta, Seonghyun Park, Kirill Neklyudov, Sungsoo Ahn
Abstract: Diffusion and flow models have become the dominant paradigm for generative modeling on Riemannian manifolds, with successful applications in protein backbone generation and DNA sequence design. However, these methods require tens to hundreds of neural network evaluations at inference time, which can become a computational bottleneck in large‑scale scientific sampling workflows. We introduce Riemannian MeanFlow~(RMF), a framework for learning flow maps directly on manifolds, enabling high‑quality generations with as few as one forward pass. We derive three equivalent characterizations of the manifold average velocity (Eulerian, Lagrangian, and semigroup identities), and analyze parameterizations and stabilization techniques to improve training on high‑dimensional manifolds. In promoter DNA design and protein backbone generation settings, RMF achieves comparable sample quality to prior methods while requiring up to 10× fewer function evaluations. Finally, we show that few‑step flow maps enable efficient reward‑guided design through reward look‑ahead, where terminal states can be predicted from intermediate steps at minimal additional cost.
Authors: Matteo Rossi, Ryan Pederson, Miles Wang-Henderson, Ben Kaufman, Edward C. Williams, Carl Underkoffler, Owen Lewis Howell, Adrian Layer, Stephan Thaler, Narbe Mardirossian, John Anthony Parkhill
Abstract: We present TerraBind, a foundation model for protein‑ligand structure and binding affinity prediction that achieves 26‑fold faster inference than state‑of‑the‑art methods while improving affinity prediction accuracy by ~20%. Current deep learning approaches to structure‑based drug design rely on expensive all‑atom diffusion to generate 3D coordinates, creating inference bottlenecks that render large‑scale compound screening computationally intractable. We challenge this paradigm with a critical hypothesis: full all‑atom resolution is unnecessary for accurate small molecule pose and binding affinity prediction. TerraBind tests this hypothesis through a coarse pocket‑level representation (protein C_β atoms and ligand heavy atoms only) within a multimodal architecture combining COATI‑3 molecular encodings and ESM‑2 protein embeddings that learns rich structural representations, which are used in a diffusion‑free optimization module for pose generation and a binding affinity likelihood prediction module. On structure prediction benchmarks (FoldBench, PoseBusters, Runs N' Poses), TerraBind matches diffusion‑based baselines in ligand pose accuracy. Crucially, TerraBind outperforms Boltz‑2 by ~20% in Pearson correlation for binding affinity prediction on both a public benchmark (CASP16) and a diverse proprietary dataset (18 biochemical/cell assays). We show that the affinity prediction module also provides well‑calibrated affinity uncertainty estimates, addressing a critical gap in reliable compound prioritization for drug discovery. Furthermore, this module enables a continual learning framework and a hedged batch selection strategy that, in simulated drug discovery cycles, achieves 6× greater affinity improvement of selected molecules over greedy‑based approaches.
Authors: Ziyang Yu, Wenbing Huang, Yang Liu
Abstract: Molecular Dynamics (MD) simulations provide a fundamental tool for characterizing molecular behavior at full atomic resolution, but their applicability is severely constrained by the computational cost. To address this, a surge of deep generative models has recently emerged to learn dynamics at coarsened timesteps for efficient trajectory generation, yet they either generalize poorly across systems or, due to limited molecular diversity of trajectory data, fail to fully exploit structural information to improve generative fidelity. Here, we present the Pretrained Variational Bridge (PVB) in an encoder‑decoder fashion, which maps the initial structure into a noised latent space and transports it toward stage‑specific targets through augmented bridge matching. This unifies training on both single‑structure and paired trajectory data, enabling consistent use of cross‑domain structural knowledge across training stages. Moreover, for protein‑ligand complexes, we further introduce a reinforcement learning‑based optimization via adjoint matching that speeds progression toward the holo state, which supports efficient post‑optimization of docking poses. Experiments on proteins and protein‑ligand complexes demonstrate that PVB faithfully reproduces thermodynamic and kinetic observables from MD while delivering stable and efficient generative dynamics.
Authors: Shentong Mo, Lanqing Li
Abstract: Generative models for de novo protein backbone design have achieved remarkable success in creating novel protein structures. However, these diffusion‑based approaches remain computationally intensive and slower than desired for large‑scale structural exploration. While recent efforts like Proteina have introduced flow‑matching to improve sampling efficiency, the potential of tokenization for structural compression and acceleration remains largely unexplored in the protein domain. In this work, we present SaDiT, a novel framework that accelerates protein backbone generation by integrating SaProt Tokenization with a Diffusion Transformer (DiT) architecture. SaDiT leverages a discrete latent space to represent protein geometry, significantly reducing the complexity of the generation process while maintaining theoretical SE(3) equivalence. To further enhance efficiency, we introduce an IPA Token Cache mechanism that optimizes the Invariant Point Attention (IPA) layers by reusing computed token states during iterative sampling. Experimental results demonstrate that SaDiT outperforms state‑of‑the‑art models, including RFDiffusion and Proteina, in both computational speed and structural viability. We evaluate our model across unconditional backbone generation and fold‑class conditional generation tasks, where SaDiT shows superior ability to capture complex topological features with high designability.
Authors: Muhammad Waqas Haseeb, Mohamad Toutounji
Abstract: We investigate the quantum dynamics of ligand‑‑receptor electron transfer and conformational response in a prototypical viral binding complex, using the SARS‑CoV‑2 Spike protein bound to the human ACE2 receptor as a model system. Treating the ACE2‑‑Spike interface as an open quantum system embedded in a biological environment, we simulate how vibrational interactions and environmental memory reshape the coupled receptor‑‑ligand dynamics and modulate vibrationally assisted electron transfer (VA‑ET). Using a Non‑Markovian Stochastic Schr"odinger Equation (NMSSE) approach, we simulate electron transfer between donor and acceptor states in ACE2 modulated by a specific vibrational mode of the Spike protein. The influence of environmental memory (non‑Markovian dynamics) and non‑Condon effects (vibrational modulation of electronic coupling) are analyzed in detail. In the Markovian limit with an Ohmic bath, population dynamics reduce to exponential kinetics, and extracted transfer rates agree with semiclassical Marcus‑‑Jortner predictions in the appropriate regime. Beyond the Markovian, high‑temperature limit, we observe clear deviations: non‑exponential decay, coherent oscillatory features, and enhanced sensitivity to the vibrational frequency. Incorporating off‑diagonal system‑‑bath coupling alongside diagonal coupling shows that nuclear motion can dynamically gate electron tunneling, sharpening the frequency selectivity of the VA‑ET mechanism. Finally, a structured (sub‑Ohmic) environmental spectral density with long‑lived correlations (``memory'') preserves electronic‑‑vibrational coherence over longer times, amplifying vibrational selectivity under non‑Condon coupling. Our results support the proposition that ACE2‑‑Spike binding may exploit vibrational assistance and quantum coherence as a molecular recognition mechanism.
Authors: Rohit Dilip, Ayush Varshney, David Van Valen
Abstract: Tokenization is a promising path to multi‑modal models capable of jointly understanding protein sequences, structure, and function. Existing protein structure tokenizers create tokens by pooling information from local neighborhoods, an approach that limits their performance on generative and representation tasks. In this work, we present a method for global tokenization of protein structures in which successive tokens contribute increasing levels of detail to a global representation. This change resolves several issues with generative models based on local protein tokenization: it mitigates error accumulation, provides embeddings without sequence‑reduction operations, and allows task‑specific adaptation of a tokenized sequence's information content. We validate our method on reconstruction, generative, and representation tasks and demonstrate that it matches or outperforms existing models based on local protein structure tokenizers. We show how adaptive tokens enable inference criteria based on information content, which boosts designability. We validate representations generated from our tokenizer on CATH classification tasks and demonstrate that non‑linear probing on our tokenized sequences outperforms equivalent probing on representations from other tokenizers. Finally, we demonstrate how our method supports zero‑shot protein shrinking and affinity maturation.
Authors: Jonathan Feldman, Tal Feldman, Annie I Anton
Abstract: Biological AI tools for protein design and structure prediction are advancing rapidly, creating dual‑use risks that existing safeguards cannot adequately address. Current model‑level restrictions, including keyword filtering, output screening, and content‑based access denials, are fundamentally ill‑suited to biology, where reliable function prediction remains beyond reach and novel threats evade detection by design. We propose a three‑tier Know Your Customer (KYC) framework, inspired by anti‑money laundering (AML) practices in the financial sector, that shifts governance from content inspection to user verification and monitoring. Tier I leverages research institutions as trust anchors to vouch for affiliated researchers and assume responsibility for vetting. Tier II applies output screening through sequence homology searches and functional annotation. Tier III monitors behavioral patterns to detect anomalies inconsistent with declared research purposes. This layered approach preserves access for legitimate researchers while raising the cost of misuse through institutional accountability and traceability. The framework can be implemented immediately using existing institutional infrastructure, requiring no new legislation or regulatory mandates.
Authors: Kevin Lu, Jannik Brinkmann, Stefan Huber, Aaron Mueller, Yonatan Belinkov, David Bau, Chris Wendler
Abstract: How do protein structure prediction models fold proteins? We investigate this question by tracing how ESMFold folds a beta hairpin, a prevalent structural motif. Through counterfactual interventions on model latents, we identify two computational stages in the folding trunk. In the first stage, early blocks initialize pairwise biochemical signals: residue identities and associated biochemical features such as charge flow from sequence representations into pairwise representations. In the second stage, late blocks develop pairwise spatial features: distance and contact information accumulate in the pairwise representation. We demonstrate that the mechanisms underlying structural decisions of ESMFold can be localized, traced through interpretable representations, and manipulated with strong causal effects.
Authors: Francesco Alesiani, Jonathan Warrell, Tanja Bien, Henrik Christiansen, Matheus Ferraz, Mathias Niepert
Abstract: We propose LOGDIFF (Logical Guidance for the Exact Composition of Diffusion Models), a guidance framework for diffusion models that enables principled constrained generation with complex logical expressions at inference time. We study when exact score‑based guidance for complex logical formulas can be obtained from guidance signals associated with atomic properties. First, we derive an exact Boolean calculus that provides a sufficient condition for exact logical guidance. Specifically, if a formula admits a circuit representation in which conjunctions combine conditionally independent subformulas and disjunctions combine subformulas that are either conditionally independent or mutually exclusive, exact logical guidance is achievable. In this case, the guidance signal can be computed exactly from atomic scores and posterior probabilities using an efficient recursive algorithm. Moreover, we show that, for commonly encountered classes of distributions, any desired Boolean formula is compilable into such a circuit representation. Second, by combining atomic guidance scores with posterior probability estimates, we introduce a hybrid guidance approach that bridges classifier guidance and classifier‑free guidance, applicable to both compositional logical guidance and standard conditional generation. We demonstrate the effectiveness of our framework on multiple image and protein structure generation tasks.
Authors: Ben Isselmann, Dilara Göksu, Andreas Weinmann
Abstract: Task‑specific microscopy datasets are often too small to train deep learning models that learn robust feature representations. Self‑supervised learning (SSL) can mitigate this by pretraining on large unlabeled datasets, but it remains unclear how well such representations transfer across microscopy domains with different staining protocols and channel configurations. We investigate the cross‑domain transferability of DINO‑pretrained Vision Transformers for protein localization on the OpenCell dataset. We generate image embeddings using three DINO backbones pretrained on ImageNet‑1k, the Human Protein Atlas (HPA), and OpenCell, and evaluate them by training a supervised classification head on OpenCell labels. All pretrained models transfer well, with the microscopy‑specific HPA‑pretrained model achieving the best performance (mean macro F_1‑score = 0.8221 \pm 0.0062), slightly outperforming a DINO model trained directly on OpenCell (0.8057 \pm 0.0090). These results highlight the value of large‑scale pretraining and indicate that domain‑relevant SSL representations can generalize effectively to related but distinct microscopy datasets, enabling strong downstream performance even when task‑specific labeled data are limited.
Authors: Zhe Wang, Zijing Liu, Chencheng Xu, Yuan Yao
Abstract: Drug discovery remains time‑consuming, labor‑intensive, and expensive, often requiring years and substantial investment per drug candidate. Predicting compound‑protein interactions (CPIs) is a critical component in this process, enabling the identification of molecular interactions between drug candidates and target proteins. Recent deep learning methods have successfully modeled CPIs at the atomic level, achieving improved efficiency and accuracy over traditional energy‑based approaches. However, these models do not always align with chemical realities, as molecular fragments (motifs or functional groups) typically serve as the primary units of biological recognition and binding. In this paper, we propose Phi‑former, a pairwise hierarchical interaction representation learning method that addresses this gap by incorporating the biological role of motifs in CPIs. Phi‑former represents compounds and proteins hierarchically and employs a pairwise pre‑training framework to model interactions systematically across atom‑atom, motif‑motif, and atom‑motif levels, reflecting how biological systems recognize molecular partners. We design intra‑level and inter‑level learning pipelines that make different interaction levels mutually beneficial. Experimental results demonstrate that Phi‑former achieves superior performance on CPI‑related tasks. A case study shows that our method accurately identifies specific atoms or motifs activated in CPIs, providing interpretable model explanations. These insights may guide rational drug design and support precision medicine applications.
Authors: Ling Luo, Wenbin Jiang, Hongyuan Chang, Xinkang Wang, Xushi Zhang, Yueting Xiong, Mengsha Tong, Rongshan Yu
Abstract: Large language models (LLMs) have significantly advanced protein representation learning. However, their capacity to interpret and design antibodies through natural language remains limited. To address this challenge, we present AFD‑Instruction, the first large‑scale instruction dataset with functional annotations tailored to antibodies. This dataset encompasses two key components: antibody understanding, which infers functional attributes directly from sequences, and antibody design, which enables de novo sequence generation under functional constraints. These components provide explicit sequence‑function alignment and support antibody design guided by natural language instructions. Extensive instruction‑tuning experiments on general‑purpose LLMs demonstrate that AFD‑Instruction consistently improves performance across diverse antibody‑related tasks. By linking antibody sequences with textual descriptions of function, AFD‑Instruction establishes a new foundation for advancing antibody modeling and accelerating therapeutic discovery.
Authors: Yanru Qu, Cheng-Yen Hsieh, Zaixiang Zheng, Ge Liu, Quanquan Gu
Abstract: We present protein autoregressive modeling (PAR), the first multi‑scale autoregressive framework for protein backbone generation via coarse‑to‑fine next‑scale prediction. Using the hierarchical nature of proteins, PAR generates structures that mimic sculpting a statue, forming a coarse topology and refining structural details over scales. To achieve this, PAR consists of three key components: (i) multi‑scale downsampling operations that represent protein structures across multiple scales during training; (ii) an autoregressive transformer that encodes multi‑scale information and produces conditional embeddings to guide structure generation; (iii) a flow‑based backbone decoder that generates backbone atoms conditioned on these embeddings. Moreover, autoregressive models suffer from exposure bias, caused by the training and the generation procedure mismatch, and substantially degrades structure generation quality. We effectively alleviate this issue by adopting noisy context learning and scheduled sampling, enabling robust backbone generation. Notably, PAR exhibits strong zero‑shot generalization, supporting flexible human‑prompted conditional generation and motif scaffolding without requiring fine‑tuning. On the unconditional generation benchmark, PAR effectively learns protein distributions and produces backbones of high design quality, and exhibits favorable scaling behavior. Together, these properties establish PAR as a promising framework for protein structure generation.
Authors: Sisi Yuan, Jiehuang Chen, Junchuang Cai, Dong Xu, Xueliang Li, Zexuan Zhu, Junkai Ji
Abstract: Protein inverse folding, the task of predicting amino acid sequences for desired structures, is pivotal for de novo protein design. However, existing GNN‑based methods typically suffer from restricted receptive fields that miss long‑range dependencies and a "single‑pass" inference paradigm that leads to error accumulation. To address these bottlenecks, we propose RIGA‑Fold, a framework that synergizes Recurrent Interaction with Geometric Awareness. At the micro‑level, we introduce a Geometric Attention Update (GAU) module where edge features explicitly serve as attention keys, ensuring strictly SE(3)‑invariant local encoding. At the macro‑level, we design an attention‑based Global Context Bridge that acts as a soft gating mechanism to dynamically inject global topological information. Furthermore, to bridge the gap between structural and sequence modalities, we introduce an enhanced variant, RIGA‑Fold, which integrates trainable geometric features with frozen evolutionary priors from ESM‑2 and ESM‑IF via a dual‑stream architecture. Finally, a biologically inspired ``predict‑recycle‑refine'' strategy is implemented to iteratively denoise sequence distributions. Extensive experiments on CATH 4.2, TS50, and TS500 benchmarks demonstrate that our geometric framework is highly competitive, while RIGA‑Fold significantly outperforms state‑of‑the‑art baselines in both sequence recovery and structural consistency.
Authors: Jacob S. Leiby, Jialu Yao, Pan Lu, George Hu, Anna Davidian, Shunsuke Koga, Olivia Leung, Pravin Patel, Isabella Tondi Resta, Rebecca Rojansky, Derek Sung, Eric Yang, Paul J. Zhang, Emma Lundberg, Dokyoon Kim, Serena Yeung-Levy, James Zou, Thomas Montine, Jeffrey Nirschl, Zhi Huang
Abstract: Immunohistochemistry (IHC) provides information on protein expression in tissue sections and is commonly used to support pathology diagnosis and disease triage. While AI models for H\&E‑stained slides show promise, their applicability to IHC is limited due to domain‑specific variations. Here we introduce HPA10M, a dataset that contains 10,495,672 IHC images from the Human Protein Atlas with comprehensive metadata included, and encompasses 45 normal tissue types and 20 major cancer types. Based on HPA10M, we trained iSight, a multi‑task learning framework for automated IHC staining assessment. iSight combines visual features from whole‑slide images with tissue metadata through a token‑level attention mechanism, simultaneously predicting staining intensity, location, quantity, tissue type, and malignancy status. On held‑out data, iSight achieved 85.5% accuracy for location, 76.6% for intensity, and 75.7% for quantity, outperforming fine‑tuned foundation models (PLIP, CONCH) by 2.5‑‑10.2%. In addition, iSight demonstrates well‑calibrated predictions with expected calibration errors of 0.0150‑0.0408. Furthermore, in a user study with eight pathologists evaluating 200 images from two datasets, iSight outperformed initial pathologist assessments on the held‑out HPA dataset (79% vs 68% for location, 70% vs 57% for intensity, 68% vs 52% for quantity). Inter‑pathologist agreement also improved after AI assistance in both held‑out HPA (Cohen's κ increased from 0.63 to 0.70) and Stanford TMAD datasets (from 0.74 to 0.76), suggesting expert‑‑AI co‑assessment can improve IHC interpretation. This work establishes a foundation for AI systems that can improve IHC diagnostic accuracy and highlights the potential for integrating iSight into clinical workflows to enhance the consistency and reliability of IHC assessment.
Authors: Jiying Zhang, Shuhao Zhang, Pierre Vandergheynst, Patrick Barth
Abstract: G‑protein‑coupled receptors (GPCRs), primary targets for over one‑third of approved therapeutics, rely on intricate conformational transitions to transduce signals. While Molecular Dynamics (MD) is essential for elucidating this transduction process, particularly within ligand‑bound complexes, conventional all‑atom MD simulation is computationally prohibitive. In this paper, we introduce GPCRLMD, a deep generative framework for efficient all‑atom GPCR‑ligand simulation.GPCRLMD employs a Harmonic‑Prior Variational Autoencoder (HP‑VAE) to first map the complex into a regularized isometric latent space, preserving geometric topology via physics‑informed constraints. Within this latent space, a Residual Latent Flow samples evolution trajectories, which are subsequently decoded back to atomic coordinates. By capturing temporal dynamics via relative displacements anchored to the initial structure, this residual mechanism effectively decouples static topology from dynamic fluctuations. Experimental results demonstrate that GPCRLMD achieves state‑of‑the‑art performance in GPCR‑ligand dynamics simulation, faithfully reproducing thermodynamic observables and critical ligand‑receptor interactions.
Authors: Amaru Caceres Arroyo, Lea Bogensperger, Ahmed Allam, Michael Krauthammer, Konrad Schindler, Dominik Narnhofer
Abstract: Protein fitness optimization is challenged by a vast combinatorial landscape where high‑fitness variants are extremely sparse. Many current methods either underperform or require computationally expensive gradient‑based sampling. We present CHASE, a framework that repurposes the evolutionary knowledge of pretrained protein language models by compressing their embeddings into a compact latent space. By training a conditional flow‑matching model with classifier‑free guidance, we enable the direct generation of high‑fitness variants without predictor‑based guidance during the ODE sampling steps. CHASE achieves state‑of‑the‑art performance on AAV and GFP protein design benchmarks. Finally, we show that bootstrapping with synthetic data can further enhance performance in data‑constrained settings.
Authors: Andong Hu, Luca Pennati, Stefano Markidis, Ivy Peng
Abstract: State‑of‑the‑art AI deep potentials provide ab initio‑quality results, but at a fraction of the computational cost of first‑principles quantum mechanical calculations, such as density functional theory. In this work, we bring AI deep potentials into GROMACS, a production‑level Molecular Dynamics (MD) code, by integrating with DeePMD‑kit that provides domain‑specific deep learning (DL) models of interatomic potential energy and force fields. In particular, we enable AI deep potentials inference across multiple DP model families and DL backends by coupling GROMACS Neural Network Potentials with the C++/CUDA backend in DeePMD‑kit. We evaluate two recent large‑atom‑model architectures, DPA2 that is based on the attention mechanism and DPA3 that is based on GNN, in GROMACS using four ab initio‑quality protein‑in‑water benchmarks (1YRF, 1UBQ, 3LZM, 2PTC) on NVIDIA A100 and GH200 GPUs. Our results show that DPA2 delivers up to 4.23x and 3.18x higher throughput than DPA3 on A100 and GH200 GPUs, respectively. We also provide a characterization study to further contrast DPA2 and DPA3 in throughput, memory usage, and kernel‑level execution on GPUs. Our findings identify kernel‑launch overhead and domain‑decomposed inference as the main optimization priorities for AI deep potentials in production MD simulations.
Authors: Yucheng Liao, Han Wen, Weinan E, Weijie Zhang
Abstract: Data‑independent acquisition mass spectrometry (DIA‑MS) has established itself as a cornerstone of proteomic profiling and large‑scale systems biology, offering unparalleled depth and reproducibility. Current DIA analysis frameworks, however, require semi‑supervised training within each run for peptide‑spectrum match (PSM) re‑scoring. This approach is prone to overfitting and lacks generalizability across diverse species and experimental conditions. Here, we present DIA‑CLIP, a pre‑trained model shifting the DIA analysis paradigm from semi‑supervised training to universal cross‑modal representation learning. By integrating dual‑encoder contrastive learning framework with encoder‑decoder architecture, DIA‑CLIP establishes a unified cross‑modal representation for peptides and corresponding spectral features, achieving high‑precision, zero‑shot PSM inference. Extensive evaluations across diverse benchmarks demonstrate that DIA‑CLIP consistently outperforms state‑of‑the‑art tools, yielding up to a 45% increase in protein identification while achieving a 12% reduction in entrapment identifications. Moreover, DIA‑CLIP holds immense potential for diverse practical applications, such as single‑cell and spatial proteomics, where its enhanced identification depth facilitates the discovery of novel biomarkers and the elucidates of intricate cellular mechanisms.
Authors: Edwin V. Bonilla, He Zhao, Daniel M. Steinberg
Abstract: We propose causal preference elicitation, a Bayesian framework for expert‑in‑the‑loop causal discovery that actively queries local edge relations to concentrate a posterior over directed acyclic graphs (DAGs). From any black‑box observational posterior, we model noisy expert judgments with a three‑way likelihood over edge existence and direction. Posterior inference uses a flexible particle approximation, and queries are selected by an efficient expected information gain criterion on the expert's categorical response. Experiments on synthetic graphs, protein signaling data, and a human gene perturbation benchmark show faster posterior concentration and improved recovery of directed effects under tight query budgets.
Authors: Shih-Hsin Wang, Yuhao Huang, Taos Transue, Justin Baker, Jonathan Forstater, Thomas Strohmer, Bao Wang
Abstract: Graph neural networks (GNNs) have emerged as powerful tools for learning protein structures by capturing spatial relationships at the residue level. However, existing GNN‑based methods often face challenges in learning multiscale representations and modeling long‑range dependencies efficiently. In this work, we propose an efficient multiscale graph‑based learning framework tailored to proteins. Our proposed framework contains two crucial components: (1) It constructs a hierarchical graph representation comprising a collection of fine‑grained subgraphs, each corresponding to a secondary structure motif (e.g., α‑helices, β‑strands, loops), and a single coarse‑grained graph that connects these motifs based on their spatial arrangement and relative orientation. (2) It employs two GNNs for feature learning: the first operates within individual secondary motifs to capture local interactions, and the second models higher‑level structural relationships across motifs. Our modular framework allows a flexible choice of GNN in each stage. Theoretically, we show that our hierarchical framework preserves the desired maximal expressiveness, ensuring no loss of critical structural information. Empirically, we demonstrate that integrating baseline GNNs into our multiscale framework remarkably improves prediction accuracy and reduces computational cost across various benchmarks.
Authors: Jiahao Zhang, Zeqing Zhang, Di Wang, Lijie Hu
Abstract: Protein language models (PLMs) have enabled advances in structure prediction and de novo protein design, yet they frequently collapse into pathological repetition during generation. Unlike in text, where repetition merely reduces readability, in proteins it undermines structural confidence and functional viability. To unify this problem, we present the first systematic study of repetition in PLMs. We first propose quantitative metrics to characterize motif‑level and homopolymer repetition and then demonstrate their negative impact on folding reliability. To address this challenge, we propose UCCS (Utility‑Controlled Contrastive Steering), which steers protein generation with a constrained dataset. Instead of naively contrasting high‑ vs. low‑repetition sequences, we construct contrastive sets that maximize differences in repetition while tightly controlling for structural utility. This disentanglement yields steering vectors that specifically target repetition without degrading foldability. Injected at inference, these vectors consistently reduce repetition without retraining or heuristic decoding. Experiments with ESM‑3 and ProtGPT2 in CATH, UniRef50, and SCOP show that our method outperforms decoding penalties and other baselines, substantially lowering repetition while preserving AlphaFold confidence scores. Our results establish repetition control as a central challenge for PLMs and highlight dataset‑guided steering as a principled approach for reliable protein generation.
Authors: Jonathan Fiorentino, Michele Monti, Dimitrios Miltiadis-Vrachnos, Vittorio Del Tatto, Alessandro Laio, Gian Gaetano Tartaglia
Abstract: Identifying minimal and informative feature sets is a central challenge in data analysis, particularly when few data points are available. Here we present a theoretical analysis of an unsupervised feature selection pipeline based on the Differentiable Information Imbalance (DII). We consider the specific case of structural and physico‑chemical features describing a set of proteins. We show that if one considers the features as coordinates of a (hypothetical) statistical physics model, this model undergoes a phase transition as a function of the number of retained features. For physico‑chemical descriptors, this transition is between a glass‑like phase when the features are few and a liquid‑like phase. The glass‑like phase exhibits bimodal order‑parameter distributions and Binder cumulant minima. In contrast, for structural descriptors the transition is less sharp. Remarkably, for physico‑chemical descriptors the critical number of features identified from the DII coincides with the saturation of downstream binary classification performance. These results provide a principled, unsupervised criterion for minimal feature sets in protein classification and reveal distinct mechanisms of criticality across different feature types.
Authors: Tyler Grear, Donald J. Jacobs
Abstract: Predicting favorable protein‑peptide binding events remains a central challenge in biophysics, with continued uncertainty surrounding how nonlocal effects shape the global energy landscape. Here, we introduce peripheral surface information (PSI) entropy, a quantitative measure of the statistical variability in apolar and charged non‑interacting surface (NIS) proportions across conformational ensembles. Using energy‑directed molecular docking via HADDOCK3 and explicit‑solvent molecular dynamics simulations, it is demonstrated that favorable binding partners exhibit emergent, low‑entropy N‑states (discrete macrostates in NIS state space) indicative of preferential apolar/charged surface configurations. Across dozens of peptides and multiple receptor systems (WW, PDZ, and MDM2 domains), dominant N‑states persisted under varied docking parameters and initial conditions. An experimental meta‑ensemble of WW domains from 36 high‑resolution structures confirmed the presence of dominant NIS modes independent of in silico methodology, suggesting an evolutionary selection pressure toward specific NIS fingerprints. These findings establish PSI entropy as a thermoinformatic descriptor that encodes favorable binding constraints into unique statistical signatures of the NIS.
Authors: Fang Sheng, Mohammad Noaeen, Zahra Shakeri
Abstract: Antimicrobial resistance threatens healthcare sustainability and motivates low‑cost computational discovery of antimicrobial peptides (AMPs). De novo peptide generation must optimize antimicrobial activity and safety through low predicted toxicity, but likelihood‑trained generators do not enforce these goals explicitly. We introduce ProDCARL, a reinforcement‑learning alignment framework that couples a diffusion‑based protein generator (EvoDiff OA‑DM 38M) with sequence property predictors for AMP activity and peptide toxicity. We fine‑tune the diffusion prior on AMP sequences to obtain a domain‑aware generator. Top‑k policy‑gradient updates use classifier‑derived rewards plus entropy regularization and early stopping to preserve diversity and reduce reward hacking. In silico experiments show ProDCARL increases the mean predicted AMP score from 0.081 after fine‑tuning to 0.178. The joint high‑quality hit rate reaches 6.3% with pAMP >0.7 and pTox <0.3. ProDCARL maintains high diversity, with 1‑mean pairwise identity equal to 0.929. Qualitative analyses with AlphaFold3 and ProtBERT embeddings suggest candidates show plausible AMP‑like structural and semantic characteristics. ProDCARL serves as a candidate generator that narrows experimental search space, and experimental validation remains future work.
Authors: Fukang Ge, Jiarui Zhu, Linjie Zhang, Haowen Xiao, Xiangcheng Bao, Fangnan Xie, Danyang Chen, Yanrui Lu, Yuting Wang, Ziqian Guan, Lin Gu, Jinhao Bi, Yingying Zhu
Abstract: Modern AI technologies for drug discovery are distributed across heterogeneous platforms‑including web applications, desktop environments, and code libraries‑leading to fragmented workflows, inconsistent interfaces, and high integration overhead. We present an agentic end‑to‑end drug design framework that leverages a Large Language Model (LLM) in conjunction with the Model Context Protocol (MCP) to dynamically coordinate access to biochemical databases, modular toolchains, and task‑specific AI models. The system integrates four state‑of‑the‑art components: MaSIF (MaSIF‑site and MaSIF‑seed‑search) for geometric deep learning‑based identification of protein‑protein interaction (PPI) sites, Rosetta for grafting protein fragments onto protein backbones to form mini proteins, ProteinMPNN for amino acid sequences redesign, and AlphaFold3 for near‑experimental accuracy in complex structure prediction. Starting from a target structure, the framework supports de novo binder generation via surface analysis, scaffold grafting and pose construction, sequence optimization, and structure prediction. Additionally, by replacing rigid, script‑based workflows with a protocol‑driven, LLM‑coordinated architecture, the framework improves reproducibility, reduces manual overhead, and ensures extensibility, portability, and auditability across the entire drug design process.
Authors: Shyam Venkatasubramanian, Sean Moushegian, Michael Lin, Mir Park, Ankit Singhal, Connor Lee
Abstract: Standard attention‑based transformers are known to exhibit instability under learning rate overspecification during training, particularly at high learning rates. While various methods have been proposed to improve resilience to such overspecification by modifying the optimization procedure, fundamental architectural innovations to this end remain underexplored. In this work, we illustrate that the consensus mechanism, a drop‑in replacement for attention, stabilizes transformer training across a wider effective range of learning rates. We formulate consensus as a graphical model and provide extensive empirical analysis demonstrating improved stability across learning rate sweeps on text, DNA, and protein modalities. We further propose a hybrid consensus‑attention framework that preserves performance while improving stability. We provide theoretical analysis characterizing the properties of consensus.
Authors: Shrey Goel, Pranam Chatterjee
Abstract: Generative modeling of peptide sequences requires navigating a discrete and highly constrained space in which many intermediate states are chemically implausible or unstable. Existing discrete diffusion and flow‑based methods rely on reversing fixed corruption processes or following prescribed probability paths, which can force generation through low‑likelihood regions and require countless sampling steps. We introduce Minimal‑action discrete Schrödinger Bridge Matching (MadSBM), a rate‑based generative framework for peptide design that formulates generation as a controlled continuous‑time Markov process on the amino‑acid edit graph. To yield probability trajectories that remain near high‑likelihood sequence neighborhoods throughout generation, MadSBM 1) defines generation relative to a biologically informed reference process derived from pre‑trained protein language model logits and 2) learns a time‑dependent control field that biases transition rates to produce low‑action transport paths from a masked prior to the data distribution. We finally introduce guidance to the MadSBM sampling procedure towards a specific functional objective, expanding the design space of therapeutic peptides; to our knowledge, this represents the first‑ever application of discrete classifier guidance to Schrödinger bridge‑based generative models.
Authors: Natalie Maus, Yimeng Zeng, Haydn Thomas Jones, Yining Huang, Gaurav Ng Goel, Alden Rose, Kyurae Kim, Hyun-Su Lee, Marcelo Der Torossian Torres, Fangping Wan, Cesar de la Fuente-Nunez, Mark Yatskar, Osbert Bastani, Jacob R. Gardner
Abstract: Many key challenges in biological design ‑‑ such as small‑molecule drug discovery, antimicrobial peptide development, and protein engineering ‑‑ can be framed as black‑box optimization over vast, complex structured spaces. Existing methods rely mainly on raw structural data and struggle to exploit the rich scientific literature. While large language models (LLMs) have been added to these pipelines, they have been confined to narrow roles within structure‑centered optimizers. We instead cast biological black‑box optimization as an agent‑driven, language‑based reasoning process. We introduce Purely Agent‑driven BLack‑box Optimization (PABLO), a hierarchical agentic system that uses scientific LLMs pretrained on chemistry and biology literature to generate and iteratively refine biological candidates. On both the standard GuacaMol molecular design and antimicrobial peptide optimization tasks, PABLO achieves state‑of‑the‑art performance, substantially improving sample efficiency and final objective values over established baselines. Compared to prior optimization methods that incorporate LLMs, PABLO achieves competitive token usage per run despite relying on LLMs throughout the optimization loop. Beyond raw performance, the agentic formulation offers key advantages for realistic design: it naturally incorporates semantic task descriptions, retrieval‑augmented domain knowledge, and complex constraints. In follow‑up in vitro validation, PABLO‑optimized peptides showed strong activity against drug‑resistant pathogens, underscoring the practical potential of PABLO for therapeutic discovery.
Authors: Jiafei Chen, Yuanyuan Feng, Jingzhi Feng, Xinyun Zhang, Jinzhen Zhu, Qingmeng Xu
Abstract: The biological effects of electromagnetic fields on proteins remain controversial beyond well‑established thermal mechanisms, particularly with respect to frequency‑dependent responses. Here, we propose that electromagnetic waves can modulate protein conformation through resonant coupling with intrinsic protein dynamics. Molecular dynamics simulations were employed to characterize spontaneous conformational fluctuations in the absence of external fields, and a tiered screening strategy combined with fast Fourier transform analysis was used to identify dominant intrinsic frequencies associated with periodically fluctuating non‑covalent atom or residue pairs. Oscillating external electric fields were subsequently applied at resonant and off‑resonant frequencies to evaluate conformational responses across diverse protein systems. The results demonstrate that resonant excitation induces significantly enhanced backbone conformational deviations compared to off‑resonant conditions, with the effect becoming more pronounced in structurally flexible and multichain proteins. These findings provide atomistic evidence for frequency‑specific resonance between electromagnetic fields and protein conformational dynamics, offering mechanistic insight into frequency‑dependent electromagnetic effects and a computational framework for electromagnetic wave‑based modulation of protein function.
Authors: Xin Peng, Ang Gao
Abstract: The scalability of continuous normalizing flows (CNFs) for unbiased Boltzmann sampling remains limited in high‑dimensional systems due to the cost of Jacobian‑determinant evaluation, which requires D backpropagation passes through the flow layers. Existing stochastic Jacobian estimators such as the Hutchinson trace estimator reduce computation but introduce bias, while the recently proposed Flow Perturbation method is unbiased yet suffers from high variance. We present Flow Perturbation++, a variance‑reduced extension of Flow Perturbation that discretizes the probability‑flow ODE and performs unbiased stepwise Jacobian estimation at each integration step. This multi‑step construction retains the unbiasedness of Flow Perturbation while achieves substantially lower estimator variance. Integrated into a Sequential Monte Carlo framework, Flow Perturbation++ achieves significantly improved equilibrium sampling on a 1000D Gaussian Mixture Model and the all‑atom Chignolin protein compared with Hutchinson‑based and single‑step Flow Perturbation baselines.
Authors: Chayan Banerjee
Abstract: Traditional e‑commerce recommender systems primarily optimize for user engagement and purchase likelihood, often neglecting the rigid physiological constraints required for human health. Standard collaborative filtering algorithms are structurally blind to these hard limits, frequently suggesting bundles that fail to meet specific total daily energy expenditure and macronutrient balance requirements. To address this disconnect, this paper introduces a Physics‑Informed Neuro‑Symbolic Recommender System that integrates nutritional science directly into the recommendation pipeline via a dual‑layer architecture. The framework begins by constructing a semantic knowledge graph using sentence‑level encoders to strictly align commercial products with authoritative nutritional data. During the training phase, an implicit physics regularizer applies a differentiable thermodynamic loss function, ensuring that learned latent embeddings reflect nutritional plausibility rather than simple popularity. Subsequently, during the inference phase, an explicit physics optimizer employs simulated annealing and elastic quantity optimization to generate discrete grocery bundles that strictly adhere to the user's protein and caloric targets.
Authors: Zefeng Lin, Zhihang Zhang, Weirong Zhu, Tongchang Han, Xianyong Fang, Tianfan Fu, Xiaohua Xu
Abstract: Designing enzymes with substrate‑binding pockets is a critical challenge in protein engineering, as catalytic activity depends on the precise interaction between pockets and substrates. Currently, generative models dominate functional protein design but cannot model pocket‑substrate interactions, which limits the generation of enzymes with precise catalytic environments. To address this issue, we propose EnzyPGM, a unified framework that jointly generates enzymes and substrate‑binding pockets conditioned on functional priors and substrates, with a particular focus on learning accurate pocket‑substrate interactions. At its core, EnzyPGM includes two main modules: a Residue‑atom Bi‑scale Attention (RBA) that jointly models intra‑residue dependencies and fine‑grained interactions between pocket residues and substrate atoms, and a Residue Function Fusion (RFF) that incorporates enzyme function priors into residue representations. Also, we curate EnzyPock, an enzyme‑pocket dataset comprising 83,062 enzyme‑substrate pairs across 1,036 four‑level enzyme families. Extensive experiments demonstrate that EnzyPGM achieves state‑of‑the‑art performance on EnzyPock. Notably, EnzyPGM reduces the average binding energy of 0.47 kcal/mol over EnzyGen, showing its superior performance on substrate‑specific enzyme design. The code and dataset will be released later.
Authors: Jingjie Ning, Xiangzhen Shen, Li Hou, Shiyi Shen, Jiahao Yang, Junrui Li, Hong Shan, Sanan Wu, Sihan Gao, H. Eric Xu, Xinheng He
Abstract: G protein‑coupled receptors (GPCRs) govern diverse physiological processes and are central to modern pharmacology. Yet discovering GPCR modulators remains challenging because receptor activation often arises from complex allosteric effects rather than direct binding affinity, and conventional assays are slow, costly, and not optimized for capturing these dynamics. Here we present GPCR‑Filter, a deep learning framework specifically developed for GPCR modulator discovery. We assembled a high‑quality dataset of over 90,000 experimentally validated GPCR‑ligand pairs, providing a robust foundation for training and evaluation. GPCR‑Filter integrates the ESM‑3 protein language model for high‑fidelity GPCR sequence representations with graph neural networks that encode ligand structures, coupled through an attention‑based fusion mechanism that learns receptor‑ligand functional relationships. Across multiple evaluation settings, GPCR‑Filter consistently outperforms state‑of‑the‑art compound‑protein interaction models and exhibits strong generalization to unseen receptors and ligands. Notably, the model successfully identified micromolar‑level agonists of the 5‑HT\textsubscript1A receptor with distinct chemical frameworks. These results establish GPCR‑Filter as a scalable and effective computational approach for GPCR modulator discovery, advancing AI‑assisted drug development for complex signaling systems.
Authors: Naeyma N. Islam, Thomas R. Caulfield
Abstract: Alzheimer's disease (AD) is marked by the pathological accumulation of amyloid beta‑42 (Abeta‑42), contributing to synaptic dysfunction and neurodegeneration. While extracellular amyloid plaques are well‑studied, increasing evidence highlights intracellular Abeta‑42 as an early and toxic driver of disease progression. In this study, we present a novel, AI‑assisted drug design approach to promote targeted degradation of Abeta‑42 via the ubiquitin‑proteasome system (UPS), using E3 ligase‑directed molecular glues. We systematically evaluated the ternary complex formation potential of Abeta‑42 with three E3 ligases: CRBN, VHL, and MDM2, through structure‑based modeling, ADMET screening, and docking. We then developed a Ligase‑Conditioned Junction Tree Variational Autoencoder (LC‑JT‑VAE) to generate ligase‑specific small molecules, incorporating protein sequence embeddings and torsional angle‑aware molecular graphs. Our results demonstrate that this generative model can produce chemically valid, novel, and target‑specific molecular glues capable of facilitating Abeta‑42 degradation. This integrated approach offers a promising framework for designing UPS‑targeted therapies for neurodegenerative diseases.
Authors: Muyuan Chen, Muchen Li, Renjie Liao
Abstract: Structural dynamics of macromolecules is critical to their structural‑function relationship. Cryogenic electron microscopy (CryoEM) provides snapshots of vitrified protein at different compositional and conformational states, and the structural heterogeneity of proteins can be characterized through computational analysis of the images. For protein systems with multiple degrees of freedom, it is still challenging to disentangle and interpret the different modes of dynamics. Here, by implementing Point Transformer, a self‑attention network designed for point cloud analysis, we are able to improve the performance of heterogeneity analysis on CryoEM data, and characterize the dynamics of highly complex protein systems in a more human‑interpretable way.
Authors: Zhiying Chen, Zihao Luo, Changsen Sun, Dmitry Kiesewetter, Sergey Krivosheev, Sergey Magazinov, Victor Malyugin, Xue Han
Abstract: To address the difficulty of characterizing the surface layer rigorously, especially the thickness and refractive index (RI) in surface plasmon resonance (SPR) technology, we propose a field‑weighted analysis method. This approach enables simultaneous quantitative determination of RI for the bulk solution and the surface layer. This study utilizes the aluminum‑based Kretschmann structure with the intensity interrogation technique. We construct the field‑weighted model governed by the evanescent field penetration depth to decompose the SPR reflected intensity into the bulk and surface responses. Experiments are conducted using bovine serum albumin (BSA) solution to form a surface adsorbed protein layer, and different concentrations of BSA are tested. Results show that the separated surface response fits well with the Langmuir formula, representing a significant improvement over the untreated SPR signal. The bulk and surface responses are then incorporated into the field‑weighted model to determine the RI values of the bulk BSA solution and the surface adsorbed BSA layer at various concentrations. The experimental results of BSA solution match the Abbe refractometer measurements with a maximum error 0.0004 in RI, while the results of the adsorbed BSA layer, both the RI and thickness, aligned well with reported parameters for a single BSA layer. This method eliminates the stage rotation in the common angular interrogation SPR technique and complicated optical design and nano‑fabrication in the nano‑optics sensing schemes, making it suitable for compact, low‑cost SPR platforms for practical applications needing surface layer characterization.
Authors: Roman Poletukhin, Marcel Kollovieh, Eike Eberhard, Stephan Günnemann
Abstract: Three‑dimensional molecular structure generation is typically performed at the level of individual atoms, yet molecular graph generation techniques often consider fragments as their structural units. Building on the advances in frame‑based protein structure generation, we extend these fragmentation ideas to 3D, treating general molecules as sets of rigid‑body motifs. Utilising this representation, we employ SE(3)‑equivariant generative modelling for de novo 3D molecule generation from rigid motifs. In our evaluations, we observe comparable or superior results to state‑of‑the‑art across benchmarks, surpassing it in atom stability on GEOM‑Drugs, while yielding a 2x to 10x reduction in generation steps and offering 3.5x compression in molecular representations compared to the standard atom‑based methods.
Authors: Rongze Ma, Mengkang Lu, Zhenyu Xiang, Yongsheng Pan, Yicheng Wu, Qingjie Zeng, Yong Xia
Abstract: Virtual immunohistochemistry (IHC) aims to computationally synthesize molecular staining patterns from routine Hematoxylin and Eosin (H\&E) images, offering a cost‑effective and tissue‑efficient alternative to traditional physical staining. However, this task is particularly challenging: H\&E morphology provides ambiguous cues about protein expression, and similar tissue structures may correspond to distinct molecular states. Most existing methods focus on direct appearance synthesis to implicitly achieve cross‑modal generation, often resulting in semantic inconsistencies due to insufficient structural priors. In this paper, we propose Pathology‑Aware Integrated Next‑Scale Transformation (PAINT), a visual autoregressive framework that reformulates the synthesis process as a structure‑first conditional generation task. Unlike direct image translation, PAINT enforces a causal order by resolving molecular details conditioned on a global structural layout. Central to this approach is the introduction of a Spatial Structural Start Map (3S‑Map), which grounds the autoregressive initialization in observed morphology, ensuring deterministic, spatially aligned synthesis. Experiments on the IHC4BC and MIST datasets demonstrate that PAINT outperforms state‑of‑the‑art methods in structural fidelity and clinical downstream tasks, validating the potential of structure‑guided autoregressive modeling.
Authors: Kentaro Yamamoto, Riku Masui, Takahito Nakajima, Miwako Tsuji, Mitsuhisa Sato, Peter Schow, Lukas Heidemann, Matthew Burke, Philipp Seitz, Oliver J. Backhouse, Juan W. Pedersen, John Children, Craig Holliman, Nathan Lysne, Daichi Okuno, Seyon Sivarajah, David Muñoz Ramo, Alex Chernoguzov, Ross Duncan
Abstract: We develop a workflow within the ONIOM framework and demonstrate it on the hybrid computing system consisting of the supercomputer Fugaku and the Quantinuum Reimei trapped‑ion quantum computer. This hybrid platform extends the layered approach for biomolecular chemical reactions to accurately treat the active site, such as a protein, and the large and often weakly correlated molecular environment. Our result marks a significant milestone in enabling scalable and accurate simulation of complex biomolecular reactions
Authors: Megan A. Witherow, Michael L. Evans, Ahmed Temtam, Hamid R. Okhravi, Khan M. Iftekharuddin
Abstract: Alzheimer's disease (AD), defined as an abnormal buildup of amyloid plaques and tau tangles in the brain can be diagnosed with high accuracy based on protein biomarkers via PET or CSF analysis. However, due to the invasive nature of biomarker collection, most AD diagnoses are made in memory clinics using cognitive tests and evaluation of hippocampal atrophy based on MRI. While clinical assessment and hippocampal volume show high diagnostic accuracy for amnestic or typical AD (tAD), a substantial subgroup of AD patients with atypical presentation (atAD) are routinely misdiagnosed. To improve diagnosis of atAD patients, we propose a machine learning approach to distinguish between atAD and non‑AD cognitive impairment using clinical testing battery and MRI data collected as standard‑of‑care. We develop and evaluate our approach using 1410 subjects across four groups (273 tAD, 184 atAD, 235 non‑AD, and 685 cognitively normal) collected from one private data set and two public data sets from the National Alzheimer's Coordinating Center (NACC) and the Alzheimer's Disease Neuroimaging Initiative (ADNI). We perform multiple atAD vs. non‑AD classification experiments using clinical features and hippocampal volume as well as a comprehensive set of MRI features from across the brain. The best performance is achieved by incorporating additional important MRI features, which outperforms using hippocampal volume alone. Furthermore, we use the Boruta statistical approach to identify and visualize significant brain regions distinguishing between diagnostic groups. Our ML approach improves the percentage of correctly diagnosed atAD cases (the recall) from 52% to 69% for NACC and from 34% to 77% for ADNI, while achieving high precision. The proposed approach has important implications for improving diagnostic accuracy for non‑amnestic atAD in clinical settings using only clinical testing battery and MRI.
Authors: Gaetano Ferraro, Michele Castellana
Abstract: Biological membranes are dynamic surfaces whose shape and function are critically influenced by protein inclusions (PIs). While membrane deformations induced by PIs have been extensively studied in the small‑deformation regime, a variety of processes involves strong membrane deformations. We investigate the interaction between lipid membranes and PIs in the large deformation (LD) regime, with the finite‑element method. We develop an approximate analytical solution that captures key features of the LD regime. We show that the force exerted by the membrane on a PI displays a non‑monotonic behavior with respect to the PI vertical displacement. The qualitative features of this force appear to be independent of the protein geometry. For two interacting PIs, the membrane‑mediated potential exhibits sub‑power‑law decay with inter‑protein distance, reflecting the complex nature of the elastic medium. The interaction potential shows that conical PIs with identical and opposite orientations repel and attract, respectively, confirming the analogy between PI orientation and electric charge, in the LD regime. In the presence of membrane flows, we identify a characteristic velocity that separates two regimes in which bending rigidity and viscous effects dominate, respectively, implying the onset of flow‑induced deformations above such velocity threshold. Overall, our results provide quantitative predictions for membrane‑protein systems in biologically relevant scenarios involving LDs, with implications for protein sorting, clustering, and membrane trafficking.
Authors: Adrián Rodríguez-Muñoz, William Daspit, Adam Klivans, Antonio Torralba, Constantinos Daskalakis, Giannis Daras
Abstract: We propose Ambient Dataloops, an iterative framework for refining datasets that makes it easier for diffusion models to learn the underlying data distribution. Modern datasets contain samples of highly varying quality, and training directly on such heterogeneous data often yields suboptimal models. We propose a dataset‑model co‑evolution process; at each iteration of our method, the dataset becomes progressively higher quality, and the model improves accordingly. To avoid destructive self‑consuming loops, at each generation, we treat the synthetically improved samples as noisy, but at a slightly lower noisy level than the previous iteration, and we use Ambient Diffusion techniques for learning under corruption. Empirically, Ambient Dataloops achieve state‑of‑the‑art performance in unconditional and text‑conditional image generation and de novo protein design. We further provide a theoretical justification for the proposed framework that captures the benefits of the data looping procedure.
Authors: Xuanning Hu, Anchen Li, Qianli Xing, Jinglong Ji, Hao Tuo, Bo Yang
Abstract: Large Language Models (LLMs) possess strong representation and reasoning capabilities, but their application to structure‑based drug design (SBDD) is limited by insufficient understanding of protein structures and unpredictable molecular generation. To address these challenges, we propose Exploration‑Augmented Latent Inference for LLMs (ELILLM), a framework that reinterprets the LLM generation process as an encoding, latent space exploration, and decoding workflow. ELILLM explicitly explores portions of the design problem beyond the model's current knowledge while using a decoding module to handle familiar regions, generating chemically valid and synthetically reasonable molecules. In our implementation, Bayesian optimization guides the systematic exploration of latent embeddings, and a position‑aware surrogate model efficiently predicts binding affinity distributions to inform the search. Knowledge‑guided decoding further reduces randomness and effectively imposes chemical validity constraints. We demonstrate ELILLM on the CrossDocked2020 benchmark, showing strong controlled exploration and high binding affinity scores compared with seven baseline methods. These results demonstrate that ELILLM can effectively enhance LLMs capabilities for SBDD.
Authors: Yulia Pushkar
Abstract: Light driven oxygen formation in Photosystem II protein is a fundamental process that sustains our biosphere and serves as a blue print to future clean energy solutions due to its high energy conversion efficiency. Last decade of intense research by advanced physical techniques delivered new insights on the structure and function of the Mn4CaO5 cluster a center of the oxygen evolving complex (OEC). However, discrepancies in experimental observations and computational models persist impeding the understanding of the O‑O bond formation and the role of the protein environment in the process. Here we show that i) assignment of the OEC unique oxygen O3 ligated by histidine (His337) via dynamic H‑bond as a slow exchanging substrate and ii) its coupling with O6 oxygen generated at Mn1 in the S2 to S3 transition give the O‑O bond formation mechanism most consistent with all currently available experimental data. Proposal shows how protein environment can steer the O‑O bond formation by charge control via H‑bond and open coordination of Mn1. Obtained O3‑O6 peroxide is at lower energy than peroxides in the most studied O5‑O6 bond formation pathway. His337 appears to be similar to distal His in globins used for management of the O2 and H2O2 intermediates. The new mechanism breaks the prior impasse and will undoubtedly invigorate future detailed studies uncovering its further details.
Authors: Shengjie Xu, Xianbin Ye, Mengran Zhu, Xiaonan Zhang, Shanzhuo Zhang, Xiaomin Fang
Abstract: Identifying protein targets for small molecules, or reverse screening, is essential for understanding drug action, guiding compound repurposing, predicting off‑target effects, and elucidating the molecular mechanisms of bioactive compounds. Despite its critical role, reverse screening remains challenging because accurately capturing interactions between a small molecule and structurally diverse proteins is inherently complex, and conventional step‑wise workflows often propagate errors across decoupled steps such as target structure modeling, pocket identification, docking, and scoring. Here, we present an end‑to‑end reverse screening strategy leveraging HelixFold3, a high‑accuracy biomolecular structure prediction model akin to AlphaFold3, which simultaneously models the folding of proteins from a protein library and the docking of small‑molecule ligands within a unified framework. We validate this approach on a diverse and representative set of approximately one hundred small molecules. Compared with conventional reverse docking, our method improves screening accuracy and demonstrates enhanced structural fidelity, binding‑site precision, and target prioritization. By systematically linking small molecules to their protein targets, this framework establishes a scalable and straightforward platform for dissecting molecular mechanisms, exploring off‑target interactions, and supporting rational drug discovery.
Authors: Yanqiu Zou, Nicco Corduri, Francesco DAmico, Karol Kolataj, Huaizhou Jin, Zhenrong Zheng, Yifan Yu, Jie Liu, Shukun Weng, Ali Douaki, Jerome Wenger, Shangzhong Jin, Guillermo Acuna, Denis Garoli
Abstract: Surface‑enhanced Raman scattering (SERS) provides critical insights into analyte structure, dynamic processes, and intermolecular interactions at the single‑molecule level. By exploiting the hotspot formation in the vicinity of plasmonic structures, SERS constitutes an established tool for fundamental biological research, particularly for early‑stage disease diagnostics. In this context, the DNA Origami technique, with its high addressability, enables both the assembly of plasmonic nanostructures with nanometric accuracy, and the deterministic placement of a single analyte molecule precisely at the generated hotspot within them. To date, most DNA Origami based nanoantennas rely on gold or silver nanoparticles (NPs), whose plasmonic resonances are confined to the visible spectrum, severely limiting their use in other spectral ranges. To extend the operating range, we have recently established a robust strategy for self‑assembling programmable ultraviolet (UV)‑plasmonic dimer antennas using rhodium nanocubes. Herein, we leverage this tailored architecture to systematically investigate its performance for single‑molecule UV‑SERS. We demonstrated how biofabricated Rh‑dimers can be used to detect the characteristic SERS signal of a single streptavidin molecule linked at the dimer s gap. Our results are validated through polarization dependent measurements that yield the expected signal modulation depending on the the dimer orientation only for the DNA origami with a protein at the hotspot. This work establishes a highly sensitive and polarization‑tunable UV‑SERS platform, laying a solid foundation for label‑free optical investigation and bio‑spectroscopy of individual biomolecules in the UV spectral range.
Authors: Yiming Ren, Junjie Wee, Xi Chen, Grace Qian, Guo-Wei Wei
Abstract: Genetic mutations frequently disrupt protein structure, stability, and solubility, acting as primary drivers for a wide spectrum of diseases. Despite the critical importance of these molecular alterations, existing computational models often lack interpretability, and fail to integrate essential physicochemical interaction. To overcome these limitations, we propose SheafLapNet, a unified predictive framework grounded in the mathematical theory of Topological Deep Learning (TDL) and Persistent Sheaf Laplacian (PSL). Unlike standard Topological Data Analysis (TDA) tools such as persistent homology, which are often insensitive to heterogeneous information, PSL explicitly encodes specific physical and chemical information such as partial charges directly into the topological analysis. SheafLapNet synergizes these sheaf‑theoretic invariants with advanced protein transformer features and auxiliary physical descriptors to capture intrinsic molecular interactions in a multiscale and mechanistic manner. To validate our framework, we employ rigorous benchmarks for both regression and classification tasks. For stability prediction, we utilize the comprehensive S2648 and S350 datasets. For solubility prediction, we employ the PON‑Sol2 dataset, which provides annotations for increased, decreased, or neutral solubility changes. By integrating these multi‑perspective features, SheafLapNet achieves state‑of‑the‑art performance across these diverse benchmarks, demonstrating that sheaf‑theoretic modeling significantly enhances both interpretability and generalizability in predicting mutation‑induced structural and functional changes.
Authors: Lejia Zeng, Xintong Zhang, Yuchan Pei, Lifeng Zhao, Lan Hua, Jincai Yang, Niu Huang
Abstract: Machine learning interatomic potentials (MLIPs) enable efficient modeling of molecular interactions with quantum mechanical (QM) accuracy. However, constructing robust and representative training datasets that capture subtle, system‑specific interaction motifs remains challenging. We introduce PANIP (PAirwise Non‑covalent Interaction Potential), an ensemble MLIP model built upon the NequIP framework and trained on non‑covalent interactions (NCIs) between protein‑derived fragments. PANIP is trained using an automated multi‑fidelity active learning (MFAL) workflow, in which a representative training subset, termed PDB‑FRAGID (PDB Fragment Interaction Dataset), was distilled from an otherwise prohibitively large pool of fragment dimers extracted from the Protein Data Bank (PDB). PANIP retains ωB97X‑D3BJ/def2‑TZVPP‑level accuracy and achieves mean absolute errors below 0.2 kcal/mol on out‑of‑distribution systems, demonstrating excellent transferability across diverse NCI motifs. Compared to the widely used ANI‑2x potential, PANIP delivers substantially lower errors, particularly for charged and strongly interacting dimers. Coupled with a fragmentation‑based energy decomposition scheme, PANIP estimates protein‑ligand binding energies at near force‑field computational cost yet QM‑level accuracy, enabling its use as a fragment‑based scoring function that rivals specialized docking scoring functions.
Authors: Aishani Ghosal, Nicholas E. Lea, Lindsay B. Case, Trevor GrandPre
Abstract: Biomolecular condensates are commonly organized by a small number of scaffold molecules that drive phase separation together with client molecules that do not condense on their own but become selectively recruited into the dense phase. A central open question is how client recruitment feeds back on scaffold interactions to determine condensate composition. Here we address this problem in a reconstituted focal adhesion system composed of focal adhesion kinase (FAK) and phosphorylated p130Cas (Cas) as scaffolds and the adaptor protein paxillin (PXN) as a client. We show that both FAK phosphorylation and PXN recruitment produce a common compositional response in which FAK becomes enriched while Cas is depleted within the condensate. To interpret these observations, we develop two complementary theoretical descriptions. First, within a two‑component Flory‑Huggins framework, we show that phosphorylation can be captured by either strengthening heterotypic FAK‑Cas interactions or increasing the effective number of interaction‑relevant segments on FAK, both of which bias partitioning toward FAK‑rich condensates. Second, we introduce a minimal three‑component Flory‑Huggins theory without an explicit solvent and map it onto an effective two‑component description, demonstrating that client recruitment renormalizes homotypic and heterotypic scaffold interactions. Analytical predictions for the location of the critical point are tested in reconstituted multicomponent systems through PXN addition, showing that client recruitment alone tunes proximity to criticality and reshapes condensate composition. Together, our results reveal distinct yet convergent physical routes by which post‑translational modification and client recruitment control scaffold composition in multicomponent condensates.
Authors: Van Thuy Hoang, O-Joun Lee
Abstract: Molecular property prediction is becoming one of the major applications of graph learning in Web‑based services, e.g., online protein structure prediction and drug discovery. A key challenge arises in few‑shot scenarios, where only a few labeled molecules are available for predicting unseen properties. Recently, several studies have used in‑context learning to capture relationships among molecules and properties, but they face two limitations in: (1) exploiting prior knowledge of functional groups that are causally linked to properties and (2) identifying key substructures directly correlated with properties. We propose CaMol, a context‑aware graph causality inference framework, to address these challenges by using a causal inference perspective, assuming that each molecule consists of a latent causal structure that determines a specific property. First, we introduce a context graph that encodes chemical knowledge by linking functional groups, molecules, and properties to guide the discovery of causal substructures. Second, we propose a learnable atom masking strategy to disentangle causal substructures from confounding ones. Third, we introduce a distribution intervener that applies backdoor adjustment by combining causal substructures with chemically grounded confounders, disentangling causal effects from real‑world chemical variations. Experiments on diverse molecular datasets showed that CaMol achieved superior accuracy and sample efficiency in few‑shot tasks, showing its generalizability to unseen properties. Also, the discovered causal substructures were strongly aligned with chemical knowledge about functional groups, supporting the model interpretability.
Authors: Yusong Wang, Jialun Shen, Zhihao Wu, Yicheng Xu, Shiyin Tan, Mingkun Xu, Changshuo Wang, Zixing Song, Prayag Tiwari
Abstract: Graph Neural Networks (GNNs) have been widely adopted for Protein Representation Learning (PRL), as residue interaction networks can be naturally represented as graphs. Current GNN‑based PRL methods typically rely on single‑perspective graph construction strategies, which capture partial properties of residue interactions, resulting in incomplete protein representations. To address this limitation, we propose MMPG, a framework that constructs protein graphs from multiple perspectives and adaptively fuses them via Mixture of Experts (MoE) for PRL. MMPG constructs graphs from physical, chemical, and geometric perspectives to characterize different properties of residue interactions. To capture both perspective‑specific features and their synergies, we develop an MoE module, which dynamically routes perspectives to specialized experts, where experts learn intrinsic features and cross‑perspective interactions. We quantitatively verify that MoE automatically specializes experts in modeling distinct levels of interaction from individual representations, to pairwise inter‑perspective synergies, and ultimately to a global consensus across all perspectives. Through integrating this multi‑level information, MMPG produces superior protein representations and achieves advanced performance on four different downstream protein tasks.
Authors: Yanyan Zhu, Haim Diamant, David Andelman
Abstract: Chain‑like macromolecules in solution, whether biological or synthetic, transform from an extended conformation to a compact one when temperature or other system parameters change. This collapse transition is relevant in various phenomena, including DNA condensation, protein folding, and the behavior of polymers in solution. We investigate the interplay of chain stiffness and range of attraction between monomers in the collapse of a single polymer chain. We use Monte Carlo simulations based on the pruned‑enriched Rosenbluth method. We demonstrate that the competition between the persistence length, l_p, and the range of attraction, r_c, determines whether the chain's collapse behavior resembles that of flexible chains or stiff ones. When l_p is larger than r_c, the chain collapses sharply with decreasing temperature, whereas if l_p is smaller than r_c, it contracts gradually. Notably, in the regime of small l_p and large r_c, this rounding into a gradual compaction persists upon increasing the chain length and may remain in place in the limit of infinite chain length. Furthermore, for small r_c, the transition temperature (theta‑temperature) increases with l_p, whereas for large r_c the theta‑temperature decreases with l_p. Thus, stiffness promotes collapse for small r_c but suppresses it for large r_c. Our findings are in agreement with recent experiments on the contraction of single‑stranded RNA as compared to double‑stranded DNA, and provide valuable insights for understanding polymer collapse and the essential polymer parameters affecting it.
Authors: Sergio Suárez-Dou, Miguel Gallegos, Kyunghoon Han, Florian N. Brünig, Joshua T. Berryman, Alexandre Tkatchenko
Abstract: Biomolecular thermodynamics and spectroscopy depend on relative conformer energies, local curvatures, and collective dipole fluctuations on the potential‑energy surface. Conventional molecular mechanics force fields enable large‑scale simulations, but their fixed functional forms can misrepresent infrared intensities, mode character, and environment‑dependent vibrational response. Here we assess general‑purpose machine‑learned force fields across small molecules, finite‑temperature infrared spectra, gas‑phase peptides, and monomeric, oligomeric, and solvated protein assemblies. To enable this analysis, we introduce QVib, a dataset of 293 molecules and 1365 conformers, together with peptide amide‑band benchmarks and p53 oligomerization‑domain models, to evaluate vibrational transferability from DFT references to experimental spectra. Across these systems, machine‑learned force fields substantially improve over molecular mechanics in reproducing DFT‑level forces, vibrational frequencies, densities of states, mode eigenvectors, conformational energetics, and experimental infrared spectra. Among models with explicit long‑range electrostatics, SO3LR provides the most favourable accuracy‑cost balance for the biomolecular systems considered. These results show that machine‑learned force‑field dynamics can recover collective, environment‑dependent vibrational landscapes at near‑DFT fidelity, enabling spectroscopically validated biomolecular simulations at force‑field‑like cost.
Authors: Lisa Schneckenreiter, Sohvi Luukkonen, Lukas Friedrich, Daniel Kuhn, Günter Klambauer
Abstract: Structure‑based and ligand‑based computational drug design have traditionally relied on disjoint data sources and modeling assumptions, limiting their joint use at scale. In this work, we introduce Contrastive Geometric Learning for Unified Computational Drug Design (ConGLUDe), a single contrastive geometric model that unifies structure‑ and ligand‑based training. ConGLUDe couples a geometric protein encoder that produces whole‑protein representations and implicit embeddings of predicted binding sites with a fast ligand encoder, removing the need for pre‑defined pockets. By aligning ligands with both global protein representations and multiple candidate binding sites through contrastive learning, ConGLUDe supports ligand‑conditioned pocket prediction in addition to virtual screening and target fishing, while being trained jointly on protein‑ligand complexes and large‑scale bioactivity data. Across diverse benchmarks, ConGLUDe achieves competitive zero‑shot virtual screening performance, substantially outperforms existing methods on a challenging target fishing task, and demonstrates state‑of‑the‑art ligand‑conditioned pocket selection. These results highlight the advantages of unified structure‑ligand training and position ConGLUDe as a step toward general‑purpose foundation models for drug discovery.
Authors: Prashant C. Raju
Abstract: Representational similarity analysis and related methods have become standard tools for comparing the internal geometries of neural networks and biological systems. These methods measure what is represented, the alignment between two representational spaces, but not whether that structure is robust. We introduce geometric stability, a distinct dimension of representational quality that quantifies how reliably a representation's pairwise distance structure holds under perturbation. Our metric, Shesha, measures self‑consistency through split‑half correlation of representational dissimilarity matrices constructed from complementary feature subsets. A key formal property distinguishes stability from similarity: Shesha is not invariant to orthogonal transformations of the feature space, unlike CKA and Procrustes, enabling it to detect compression‑induced damage to manifold structure that similarity metrics cannot see. Spectral analysis reveals the mechanism: similarity metrics collapse after removing the top principal component, while stability retains sensitivity across the eigenspectrum. Across 2463 encoder configurations in seven domains ‑‑ language, vision, audio, video, protein sequences, molecular profiles, and neural population recordings ‑‑ stability and similarity are empirically uncorrelated (ρ=‑0.01). A regime analysis shows this independence arises from opposing effects: geometry‑preserving transformations make the metrics redundant, while compression makes them anti‑correlated, canceling in aggregate. Applied to 94 pretrained models across 6 datasets, stability exposes a "geometric tax": DINOv2, the top‑performing model for transfer learning, ranks last in geometric stability on 5/6 datasets. Contrastive alignment and hierarchical architecture predict stability, providing actionable guidance for model selection in deployment contexts where representational reliability matters.
Authors: Enso O. Torres Alegre
Abstract: Efficient resolution of neuroinflammation and debris clearance is a key determinant of successful central nervous system regeneration. Regenerative vertebrates such as Danio rerio often exhibit faster immune resolution and debris clearance than mammals, yet the molecular determinants underlying these differences remain incompletely understood. TAM receptor tyrosine kinases (Tyro3, Axl, and Mertk) and their ligands Gas6 and Protein S are central regulators of phagocytosis and immune resolution in the nervous system, but whether intrinsic structural properties of these receptor‑ligand complexes contribute to regenerative efficiency has not been systematically explored.
Here, we present a comparative in silico analysis of TAM receptors and ligands from zebrafish, human, and mouse, integrating sequence evolution, high‑confidence structural modeling, interface characterization, and electrostatic analysis. Despite substantial sequence divergence, ligand‑binding domains display strong structural conservation, supporting a conserved global mode of TAM‑ligand engagement. At the interface level, zebrafish complexes show enhanced electrostatic contributions and increased salt‑bridge density, particularly in the Tyro3‑Protein S interaction. Residue‑level electrostatic analysis reveals clustered interface hotspots that are spatially conserved across species despite evolutionary rewiring of individual contacts.
Together, these results suggest that TAM receptor‑ligand interfaces are evolutionarily tuned through subtle electrostatic and geometric optimization rather than large‑scale structural changes, providing a conserved yet adaptable framework for species‑specific modulation of phagocytic signaling.
Authors: Fabrizio Camerin, Marco Polimeni, Letizia Tavagnacco, Jeffrey C. Everts, Szilard Saringer, Alessandro Gulotta, Nicholas Skar-Gislinge, Anna Stradner, Emanuela Zaccarelli, Peter Schurtenberger
Abstract: The complexity of biomolecular interactions necessitates advanced methodologies to accurately capture their behavior in solution. In this work, we focus on monoclonal antibodies and adopt a multi‑scale coarse‑graining strategy for their modeling, with particular emphasis on the role of electrostatic interactions. Using scattering experiments, theoretical analysis, and large‑scale computer simulations, we explicitly compare two selected case studies‑markedly different in their charge distributions. Through mutually corroborating lines of evidence, we demonstrate that conventional approaches relying on electrostatic screening and implicit charge representations fail to capture the structural and thermodynamic properties of antibody solutions when strong charge heterogeneity is present, even at a moderate (amino acid) level of coarse‑graining. These findings highlight the importance of a correct treatment of electrostatic interactions and ion screening for heterogeneously‑ and oppositely‑charged colloidal and protein systems. Such considerations are essential to move beyond descriptive models towards a truly predictive framework, with direct implications for the formulation of therapeutics and the treatment of other complex soft‑matter systems.
Authors: Mattia Corti, Andrew Ahern, Alain Goriely, Ellen Kuhl, Paola F. Antonietti
Abstract: Accumulation of amyloid beta proteins is a defining feature of Alzheimer's disease, and is usually accompanied by cerebrovascular pathology. Evidence suggests that amyloid beta and cerebrovascular pathology are mutually reinforcing; in particular, amyloid beta suppresses perfusion by constricting capillaries, and hypoperfusion promotes the production of amyloid beta. Here, we propose a whole‑brain model coupling amyloid beta and blood vessel through a hybrid model consisting of a reaction‑diffusion system for the protein dynamics and porous‑medium model of blood flow within and between vascular networks: arterial, capillary and venous. We discretize the resulting parabolic‑‑elliptic system of PDEs by means of a high‑order discontinuous Galerkin method in space and an implicit Euler scheme in time. Simulations in realistic brain geometries demonstrate the emergence of multistability, implying that a sufficiently large pathogenic protein seeds is necessary to trigger disease outbreak. Motivated by the "two‑hit vascular hypothesis" of Alzheimer's disease that hypoperfusive vascular damage triggers amyloid beta pathology, we also demonstrate that localized hypoperfusion, in response to injury, can destabilize the healthy steady state and trigger brain‑wide disease outbreak.
Authors: Konstantin N. Moser, Christos N. Likos, Vittoria Sposini
Abstract: We investigate the structure and dynamics of a hard colloid‑star polymer mixture in the range of its arrested phase separation, where an incipient demixing transition is interfering with a nearby vitrification line, focusing on the protein limit (smaller hard component). Soft‑hard mixtures present a rich dynamics, influenced by different parameters such as the concentration of the soft and hard components, the softness of the potential, and the size ratio between the two components. Using coarse‑grained molecular dynamics simulations, we characterize the single‑particle and collective dynamics of the hard colloidal tracers in the soft glassy matrix. The hard tracers show diffusive behavior of the mean squared displacement accompanied by non‑exponential relaxation of the intermediate scattering functions at intermediate length scales and non‑Gaussian displacement distributions. Moreover, we show that the system exhibits arrested phase separation, leading to population splitting and decoupling between self‑ and collective dynamics of the hard colloids. Overall, we demonstrate that the interplay between arrested phase separation and glassiness leads to complex, multiscale phenomena that strongly influence the dynamics of the hard additives in the arrested matrix formed by the soft colloids.
Authors: Holly Masson, Massimiliano Paesani, Ioana M. Ilie
Abstract: Protein phase transitions govern numerous diseases, including neurodegenerative disorders such as Parkinson's and Alzheimer's. In Parkinson's disease, distinct species of the protein alpha‑synuclein undergo phase transitions from highly disordered to ordered beta‑rich states. The emerging species and transitions between them can be reshaped by chaperones, small molecules, peptides or antibodies. Here, we use coarse‑grained simulations to understand the effect of modulators on the thermodynamics and kinetics of alpha‑synuclein transformations and phase transitions. Each protein is represented as a single morphing particle that transforms from a soft sphere (disordered state) to a hard spherocylinder (beta‑rich state), while modulators are modeled as soft isotropic particles mimicking small peptides. The results show that purely repulsive modulators do not alter the final outcome, i.e. fibrils form following the same mechanisms independently of the modulator concentration. Attractive interactions towards the disordered protein slow down fibril formation in a dose‑dependent manner by stabilizing intermediate species, and strong attraction yields persistent disordered heteroclusters. In contrast, specific attraction to the beta‑rich state results in shorter fibrils through direct modulator surface "capping" that introduce kinetic barriers to monomer templating at the fibril ends and inhibit lateral attachment. Together, these results link modulator properties and environmental conditions to the effects on nucleation, fibril elongation and off‑pathway trapping, providing a quantitative roadmap for selecting modulator properties and strategies that redirect phase transitions toward desirable endpoints. Additionally, they provide guiding principles for the development of intervention strategies and the engineering of novel materials with tunable and responsive properties.
Authors: Haotian Gao, Xiangying Zhang, Jingyuan Li, Xinchong Chen, Haojie Wang, Yifei Qi, Renxiao Wang
Abstract: Drug discovery represents a time‑consuming and financially intensive process, and virtual screening can accelerate it. Scoring functions, as one of the tools guiding virtual screening, have their precision closely tied to screening efficiency. In our previous study, we developed a graph neural network model called PLANET (Protein‑Ligand Affinity prediction NETwork), but it suffers from the defect in representing protein‑ligand contact maps. Incorrect binding modes inevitably lead to poor affinity predictions, so accurate prediction of the protein‑ligand contact map is desired to improve PLANET. In this study, we have proposed PLANET v2.0 as an upgraded version. The model is trained via multi‑objective training strategy and incorporates the Mixture Density Network to predict binding modes. Except for the probability density distributions of non‑covalent interactions, we innovatively employ another Gaussian mixture model to describe the relationship between distance and energy of each interaction pair and predict protein‑ligand affinity like calculating the mathematical expectation. As on the CASF‑2016 benchmark, PLANET v2.0 demonstrates excellent scoring power, ranking power, and docking power. The screening power of PLANET v2.0 gets notably improved compared to PLANET and Glide SP and it demonstrates robust validation on a commercial ultra‑large‑scale dataset. Given its efficiency and accuracy, PLANET v2.0 can hopefully become one of the practical tools for virtual screening workflows. PLANET v2.0 is freely available at https://www.pdbbind‑plus.org.cn/planetv2.
Authors: Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma
Abstract: Deep protein structure predictors such as AlphaFold provide confidence estimates (e.g., pLDDT) that are often miscalibrated and degrade under distribution shifts across experimental modalities, temporal changes, and intrinsically disordered regions. We introduce CalPro, a prior‑aware evidential‑conformal framework for shift‑robust uncertainty quantification. CalPro combines (i) a geometric evidential head that outputs Normal‑Inverse‑Gamma predictive distributions via a graph‑based architecture; (ii) a differentiable conformal layer that enables end‑to‑end training with finite‑sample coverage guarantees; and (iii) domain priors (disorder, flexibility) encoded as soft constraints. We derive structure‑aware coverage guarantees under distribution shift using PAC‑Bayesian bounds over ambiguity sets, and show that CalPro maintains near‑nominal coverage while producing tighter intervals than standard conformal methods in regions where priors are informative. Empirically, CalPro exhibits at most 5% coverage degradation across modalities (vs. 15‑25% for baselines), reduces calibration error by 30‑50%, and improves downstream ligand‑docking success by 25%. Beyond proteins, CalPro applies to structured regression tasks in which priors encode local reliability, validated on non‑biological benchmarks.
Authors: Jinwoo Hwang, Yeongmin Hwang, Tadiwos Meaza, Hyeonbin Bae, Jongse Park
Abstract: Recent computational advances enable protein design pipelines to run end‑to‑end on GPUs, yet their heterogeneous computational behaviors remain undercharacterized at the system level. We implement and profile a representative pipeline at both component and full‑pipeline granularities across varying inputs and hyperparameters. Our characterization identifies generally low GPU utilization and high sensitivity to sequence length and sampling strategies. We outline future research directions based on these insights and release an open‑source pipeline and profiling scripts to facilitate further studies.
Authors: Thomas Vaitses Fontanari, Mariana Recamonde-Mendoza
Abstract: This study explores the use of graph neural networks (GNNs) with hierarchical pooling and multiple convolution layers for cancer classification based on RNA‑seq data. We combine gene expression data from The Cancer Genome Atlas (TCGA) with a precomputed STRING protein‑protein interaction network to classify tissue origin and distinguish between normal and tumor samples. The model employs Chebyshev graph convolutions (K=2) and weighted pooling layers, aggregating gene clusters into 'supernodes' across multiple coarsening levels. This approach enables dimensionality reduction while preserving meaningful interactions. Saliency methods were applied to interpret the model by identifying key genes and biological processes relevant to cancer. Our findings reveal that increasing the number of convolution and pooling layers did not enhance classification performance. The highest F1‑macro score (0.978) was achieved with a single pooling layer. However, adding more layers resulted in over‑smoothing and performance degradation. However, the model proved highly interpretable through gradient methods, identifying known cancer‑related genes and highlighting enriched biological processes, and its hierarchical structure can be used to develop new explainable architectures. Overall, while deeper GNN architectures did not improve performance, the hierarchical pooling structure provided valuable insights into tumor biology, making GNNs a promising tool for cancer biomarker discovery and interpretation
Authors: Fang Wu, Stan Z. Li
Abstract: Protein‑protein interaction (PPI) represents a central challenge within the biology field, and accurately predicting the consequences of mutations in this context is crucial for drug design and protein engineering. Deep learning (DL) has shown promise in forecasting the effects of such mutations, but is hindered by two primary constraints. First, the structures of mutant proteins are often elusive to acquire. Secondly, PPI takes place dynamically, which is rarely integrated into the DL architecture design. To address these obstacles, we present a novel framework named Refine‑PPI with two key enhancements. First, we introduce a structure refinement module trained by a mask mutation modeling (MMM) task on available wild‑type structures, which is then transferred to produce the inaccessible mutant structures. Second, we employ a new kind of geometric network, called the probability density cloud network (PDC‑Net), to capture 3D dynamic variations and encode the atomic uncertainty associated with PPI. Comprehensive experiments on SKEMPI.v2 substantiate the superiority of Refine‑PPI over all existing tools for predicting free energy change. These findings underscore the effectiveness of our hallucination strategy and the PDC module in addressing the absence of mutant protein structure and modeling geometric uncertainty.
Authors: K. S. Kuznetsova, V. A. Pashynska, Z. E. Eremenko
Abstract: This study focuses on developing a metal‑dielectric sensor structure with optimized unit cell geometry for determination of protein Immunoglobulin G (IgG) concentration in aqueous solutions. The research combines both experimental and theoretical investigations, utilizing the differential microwave dielectrometry method and numerical modeling with COMSOL software. Complex permittivity (CP) values dependence of IgG water solutions on the protein concentration was experimentally obtained at the microwaves using original microwave dielectrometer setup. It was shown that increase of IgG concentration resulted in the CP values of the solutions studied decrease. The experimentally obtained CP data for the IgG water solutions were used as a basis for microwave metal‑dielectric metasurface unit cell numerical modeling. The metal‑dielectric metasurface consisting of Teflon substrate and plane copper microresonators was combined with a standard 96‑well microplate used in clinical laboratories. Optimization of the obtained metasurface unit cell revealed that the size and position of the copper microresonators within the unit cell significantly impact the sensor sensitivity for determining the IgG concentration in aqueous solutions. The metasurface with the unit cell containing four copper microresonators provided the most sensitive platform for detecting variations in the IgG concentration in the sample. The frequency shift of the reflection coefficient was directly related to changes in the protein concentration. The calibration graph was developed for effective determination of IgG concentrations in the aqueous solutions.
Authors: Manel Gil-Sorribes, Júlia Vilalta-Mor, Isaac Filella-Mercè, Robert Soliva, Álvaro Ciudad, Víctor Guallar, Alexis Molina
Abstract: Accurate drug‑target interaction (DTI) prediction is essential for computational drug discovery, yet existing models often rely on single‑modality predefined molecular descriptors or sequence‑based embeddings with limited representativeness. We propose Tensor‑DTI, a contrastive learning framework that integrates multimodal embeddings from molecular graphs, protein language models, and binding‑site predictions to improve interaction modeling. Tensor‑DTI employs a siamese dual‑encoder architecture, enabling it to capture both chemical and structural interaction features while distinguishing interacting from non‑interacting pairs. Evaluations on multiple DTI benchmarks demonstrate that Tensor‑DTI outperforms existing sequence‑based and graph‑based models. We also conduct large‑scale inference experiments on CDK2 across billion‑scale chemical libraries, where Tensor‑DTI produces chemically plausible hit distributions even when CDK2 is withheld from training. In enrichment studies against Glide docking and Boltz‑2 co‑folder, Tensor‑DTI remains competitive on CDK2 and improves the screening budget required to recover moderate fractions of high‑affinity ligands on out‑of‑family targets under strict family‑holdout splits. Additionally, we explore its applicability to protein‑RNA and peptide‑protein interactions. Our findings highlight the benefits of integrating multimodal information with contrastive objectives to enhance interaction‑prediction accuracy and to provide more interpretable and reliability‑aware models for virtual screening.
Authors: Kevin Yang, Juana Martin Gonzalez, Alireza Ramezani, Paul van der Schoot, Roya Zandi
Abstract: Polymorphism has been observed in viral capsid assembly, demonstrating the ability of identical protein dimers to adopt multiple geometries under the same solution conditions. A well‑studied example is the hepatitis B virus (HBV), which forms two stable capsid morphologies both in vivo and in vitro. These capsids differ in diameter, containing either 90 or 120 protein dimers. Experiments have shown that their relative prevalence depends on the ionic conditions of the solution during assembly. We developed a model that incorporates salt effects by altering the intermolecular binding free energy between capsid proteins, thereby shifting the relative thermodynamic stability of the two morphologies. This model reproduces experimental results on the prevalence ratios of the large and small HBV capsids. We also constructed a kinetic model that captures the time‑dependent ratio of the two morphologies under subcritical capsid concentrations, consistent with experimental data.
Authors: Justin Airas, Bin Zhang
Abstract: Implicit solvent models (ISMs) promise to deliver the accuracy of explicit solvent simulations at a fraction of the computational cost. However, despite decades of development, their accuracy has remained insufficient for many critical applications, particularly for simulating protein folding and the behavior of intrinsically disordered proteins. Developing a transferable, data‑driven ISM that overcomes the limitations of traditional analytical formulas remains a central challenge in computational chemistry. Here we address this challenge by introducing a novel strategy that distills the evolutionary information learned by a protein language model, ESM3, into a computationally efficient graph neural network (GNN). We show that this GNN potential, trained on effective energies from ESM3, is robust enough to drive stable, long‑timescale molecular dynamics simulations. When combined with a standard electrostatics term, our hybrid model accurately reproduces protein folding free‑energy landscapes and predicts the structural ensembles of intrinsically disordered proteins. This approach yields a single, unified model that is transferable across both folded and disordered protein states, resolving a long‑standing limitation of conventional ISMs. By successfully distilling evolutionary knowledge into a physical potential, our work delivers a foundational implicit solvent model poised to accelerate the development of predictive, large‑scale simulation tools.
Authors: Fang Wu, Zhengyuan Zhou, Shuting Jin, Xiangxiang Zeng, Jure Leskovec, Jinbo Xu
Abstract: Therapeutic peptides show promise in targeting previously undruggable binding sites, with recent advancements in deep generative models enabling full‑atom peptide co‑design for specific protein receptors. However, the critical role of molecular surfaces in protein‑protein interactions (PPIs) has been underexplored. To bridge this gap, we propose an omni‑design peptides generation paradigm, called SurfFlow, a novel surface‑based generative algorithm that enables comprehensive co‑design of sequence, structure, and surface for peptides. SurfFlow employs a multi‑modality conditional flow matching (CFM) architecture to learn distributions of surface geometries and biochemical properties, enhancing peptide binding accuracy. Evaluated on the comprehensive PepMerge benchmark, SurfFlow consistently outperforms full‑atom baselines across all metrics. These results highlight the advantages of considering molecular surfaces in de novo peptide discovery and demonstrate the potential of integrating multiple protein modalities for more effective therapeutic peptide discovery.
Authors: Ilann Amiaud-Plachy, Michael Blank, Oliver Bent, Sebastien Boyer
Abstract: Phage display is a powerful laboratory technique used to study the interactions between proteins and other molecules, whether other proteins, peptides, DNA or RNA. The under‑utilisation of this data in conjunction with deep learning models for protein design may be attributed to; high experimental noise levels; the complex nature of data pre‑processing; and difficulty interpreting these experimental results. In this work, we propose a novel approach utilising a Bayesian Neural Network within a training loop, in order to simulate the phage display experiment and its associated noise. Our goal is to investigate how understanding the experimental noise and model uncertainty can enable the reliable application of such models to reliably interpret phage display experiments. We validate our approach using actual binding affinity measurements instead of relying solely on proxy values derived from 'held‑out' phage display rounds.
Authors: Chu Wang, Lin Huang, Xinran Wei, Tao Qin, Arthur Jiang, Lixue Cheng, Jia Zhang
Abstract: Machine learning force fields (MLFFs) have revolutionized molecular simulations by providing quantum mechanical accuracy at the speed of molecular mechanical computations. However, a fundamental reliance of these models on fixed‑cutoff architectures limits their applicability to macromolecular systems where long‑range interactions dominate. We demonstrate that this locality constraint causes force prediction errors to scale monotonically with system size, revealing a critical architectural bottleneck. To overcome this, we establish the systematically designed MolLR25 (Molecules with Long‑Range effect) benchmark up to 1200 atoms, generated using high‑fidelity DFT, and introduce E2Former‑LSR, an equivariant transformer that explicitly integrates long‑range attention blocks. E2Former‑LSR exhibits stable error scaling, achieves superior fidelity in capturing non‑covalent decay, and maintains precision on complex protein conformations. Crucially, its efficient design provides up to 30% speedup compared to purely local models. This work validates the necessity of non‑local architectures for generalizable MLFFs, enabling high‑fidelity molecular dynamics for large‑scale chemical and biological systems.
Authors: Nicco Corduri, Malavika Kayyil Veedu, Yifan Yu, Yanqiu Zou, Jie Liu, Denis Garoli, Guillermo P. Acuna, Jérôme Wenger, Karol Kołątaj
Abstract: Nanoparticles of plasmonic metals have significantly to the development of spectroscopic techniques, enabling strong confinement of electromagnetic fields at the nanoscale and corresponding signal amplification. However, to date, plasmonic applications have been limited mainly to the visible and near‑infrared range, as materials supporting ultraviolet resonances typically exhibit poor chemical stability and lack robust surface functionalisation methods. In this work, we address these limitations by introducing a fully programmable approach to UV plasmonics based on rhodium nanocube dimers assembled using DNA origami templates. We have developed a reliable ligand exchange strategy that allows the functionalisation of rhodium nanocubes with DNA while maintaining their colloidal stability. These DNA‑modified nanocubes act as modular building blocks that can be assembled into dimers with 69% efficiency and an average gap size of 10 nm. The DNA origami design also allows for the deterministic placement of a single streptavidin protein in the plasmonic gap, unlike previous methods based on stochastic diffusion. Experiments with single‑molecule autofluorescence in UV, supported by numerical simulations, show an increase in brightness of up to 22, a reduction in fluorescence lifetime, and a more than tenfold increase in the total number of detected photons. By creating a robust and versatile platform for the production of UV‑resonant plasmonic nanoantennas, this work extends the functionality of plasmonics to the deep UV spectrum and opens up new possibilities for labelling‑free single‑protein spectroscopy.
Authors: Chuanliu Fan, Zicheng Ma, Huanran Meng, Aijia Zhang, Wenjie Du, Jun Zhang, Yi Qin Gao, Ziqiang Cao, Guohong Fu
Abstract: Recent advances in large language models (LLMs) have highlighted the effectiveness of chain‑of‑thought reasoning in symbolic domains such as mathematics and programming. However, our study shows that directly transferring such text‑based reasoning paradigms to protein function understanding is ineffective: reinforcement learning mainly amplifies superficial keyword patterns while failing to introduce new biological knowledge, resulting in limited generalization. We argue that protein function prediction is a knowledge‑intensive scientific task that fundamentally relies on external biological priors and computational tools rather than purely internal reasoning. To address this gap, we propose PFUA, a tool‑augmented protein reasoning agent that unifies problem decomposition, tool invocation, and grounded answer generation. Instead of relying on long unconstrained reasoning traces, PFUA integrates domain‑specific tools to produce verifiable intermediate evidence. Experiments on four benchmarks demonstrate that PFUA consistently outperforms text‑only reasoning models with an average performance improvement of 103%.
Authors: Mohammad Ali Javidian
Abstract: We study the problem of imputing a designated target variable that is systematically missing in a shifted deployment domain, when a Gaussian causal DAG is available from a fully observed source domain. We propose a unified EM‑based framework that combines source and target data through the DAG structure to transfer information from observed variables to the missing target. On the methodological side, we formulate a population EM operator in the DAG parameter space and introduce a first‑order (gradient) EM update that replaces the costly generalized least‑squares M‑step with a single projected gradient step. Under standard local strong‑concavity and smoothness assumptions and a BWY‑style \citeBalakrishnan2017EM gradient‑stability (bounded missing‑information) condition, we show that this first‑order EM operator is locally contractive around the true target parameters, yielding geometric convergence and finite‑sample guarantees on parameter error and the induced target‑imputation error in Gaussian SEMs under covariate shift and local mechanism shifts. Algorithmically, we exploit the known causal DAG to freeze source‑invariant mechanisms and re‑estimate only those conditional distributions directly affected by the shift, making the procedure scalable to higher‑dimensional models. In experiments on a synthetic seven‑node SEM, the 64‑node MAGIC‑IRRI genetic network, and the Sachs protein‑signaling data, the proposed DAG‑aware first‑order EM algorithm improves target imputation accuracy over a fit‑on‑source Bayesian network and a Kiiveri‑style EM baseline, with the largest gains under pronounced domain shift.
Authors: Brandon Neff, Matthias Heyden
Abstract: Heat dissipation is ubiquitous in living systems, which constantly convert distinct forms of energy into each other. The transport of thermal energy in liquids and even within proteins is well understood but kinetic energy transfer across a heterogeneous molecular boundary provides additional challenges. Here, we use atomistic molecular dynamics simulations under steady‑state conditions to analyze how a protein dissipates surplus thermal energy into the surrounding solvent. We specifically focus on collective degrees of freedom that govern the dynamics of the system from the diffusive regime to mid‑infrared frequencies. Using a fully anharmonic analysis of molecular vibrations, we analyzed their vibrational spectra, temperatures, and heat transport efficiencies. We find that the most efficient energy transfer mechanisms are associated with solvent‑mediated friction. However, this mechanism only applies to a small number of degrees of freedom of a protein. Instead, less efficient vibrational energy transfer in the far‑infrared dominates heat transfer overall due to a large number of vibrations in this frequency range. A notable by‑product of this work is a highly sensitive measure of deviations from energy equi‑partition in equilibrium systems, which can be used to analyze non‑ergodic properties.
Authors: Weisen Yang, Hanqing Zhang, Wangren Qiu, Xuan Xiao, Weizhong Lin
Abstract: Accurate identification of protein binding sites is crucial for understanding biomolecular interaction mechanisms and for the rational design of drug targets. Traditional predictive methods often struggle to balance prediction accuracy with computational efficiency when capturing complex spatial conformations. To address this challenge, we propose an Edge‑aware Graph Attention Network (Edge‑aware GAT) model for the fine‑grained prediction of binding sites across various biomolecules, including proteins, DNA/RNA, ions, ligands, and lipids. Our method constructs atom‑level graphs and integrates multidimensional structural features, including geometric descriptors, DSSP‑derived secondary structure, and relative solvent accessibility (RSA), to generate spatially aware embedding vectors. By incorporating interatomic distances and directional vectors as edge features within the attention mechanism, the model significantly enhances its representation capacity. On benchmark datasets, our model achieves an ROC‑AUC of 0.93 for protein‑protein binding site prediction, outperforming several state‑of‑the‑art methods. The use of directional tensor propagation and residue‑level attention pooling further improves both binding site localization and the capture of local structural details. Visualizations using PyMOL confirm the model's practical utility and interpretability. To facilitate community access and application, we have deployed a publicly accessible web server at http://119.45.201.89:5000/. In summary, our approach offers a novel and efficient solution that balances prediction accuracy, generalization, and interpretability for identifying functional sites in proteins.
Authors: Yi Zhou, Haoyu Jiang, Chenghao Zhu, André Rossi
Abstract: The Edge Interdiction Clique Problem (EICP) aims to remove at most k edges from a graph so as to minimize the size of the largest clique in the remaining graph. This problem captures a fundamental question in graph manipulation: which edges are structurally critical for preserving large cliques? Such a problem is also motivated by practical applications including protein function maintenance and image matching. The EICP is computationally challenging and belongs to a complexity class beyond NP. Existing approaches rely on general mixed‑integer bilevel programming solvers or reformulate the problem into a single‑level mixed integer linear program. However, they are still not scalable when the graph size and interdiction budget k grow. To overcome this, we investigate new mixed integer linear formulations, which recast the problem into a sequence of parameterized Edge Blocker Clique Problems (EBCP). This perspective decomposes the original problem into simpler subproblems and enables tighter modeling of clique‑related inequalities. Furthermore, we propose a two‑stage exact algorithm, \textscRLCM, which first applies problem‑specific reduction techniques to shrink the graph and then solves the reduced problem using a tailored branch‑and‑cut framework. Extensive computational experiments on maximum clique benchmark graphs, large real‑world sparse networks, and random graphs demonstrate that \textscRLCM consistently outperforms existing approaches.
Authors: Pedro Pessoa, Steve Pressé, S. Banu Ozkan
Abstract: Allostery is a fundamental mechanism of protein regulation and is commonly interpreted as modulating enzymatic activity or product abundance. Here we show that this view is incomplete. Using a stochastic model of allosteric regulation combined with an information‑theoretic analysis, we quantify the mutual information between an enzyme's regulatory state and the states of downstream signaling components. Beyond controlling steady‑state production levels, allostery also regulates the timing and duration over which information is transmitted. By tuning the temporal operating regime of signaling pathways, allosteric regulation enables distinct dynamical outcomes from identical molecular components, providing a physical mechanism for temporal information flow, signaling specificity, and coordination without changes in metabolic pathways.
Authors: Biao-Feng Zeng, Zian Wang, Yuxin Yang, Xufei Ma, Liang Xu, Yi Shen, Long Yi, Yizheng Fang, Ye Tian, Zhenrong Zheng, Yudong Cui, Ji Cao, Ge Bai, Weixiang Ye, Pan Wang, Cuifang Kuang, Joshua B. Edel, Aleksandar P. Ivanov, Xu Liu, Longhua Tang
Abstract: Biological electron transfer (ET) relies on quantum mechanical tunnelling through a dynamically folded protein. Yet, the spatiotemporal coupling between structural fluctuations and electron flux remains poorly understood, largely due to limitations in existing experimental techniques, such as ensemble averaging and non‑physiological operating conditions. Here, we introduce a quantum tunnelling‑integrated optoplasmonic nanotrap (QTOP‑trap), an optoelectronic platform that combines plasmonic optical trapping with real‑time quantum tunnelling measurements. This label‑free approach enables single‑molecule resolution of protein conductance in physiological electrolytes, achieving sub‑3 nm spatial precision and 10‑μs temporal resolution. By synchronising optoelectronic measurements, QTOP‑trap resolves protein‑specific conductance signatures and directly correlates tertiary structure dynamics with conductance using a "protein switch" strategy. This methodology establishes a universal framework for dissecting non‑equilibrium ET mechanisms in individual conformational‑active proteins, with broad implications for bioenergetics research and biomimetic quantum device design.
Authors: Nobuyuki Ota
Abstract: Understanding cellular mechanisms requires integrating information across DNA, RNA, and protein ‑ the three molecular systems linked by the Central Dogma of molecular biology. While domain‑specific foundation models have achieved success for each modality individually, they remain isolated, limiting our ability to model integrated cellular processes. Here we present the Central Dogma Transformer (CDT), an architecture that integrates pre‑trained language models for DNA, RNA, and protein following the directional logic of the Central Dogma. CDT employs directional cross‑attention mechanisms ‑ DNA‑to‑RNA attention models transcriptional regulation, while RNA‑to‑Protein attention models translational relationships ‑ producing a unified Virtual Cell Embedding that integrates all three modalities. We validate CDT v1 ‑ a proof‑of‑concept implementation using fixed (non‑cell‑specific) RNA and protein embeddings ‑ on CRISPRi enhancer perturbation data from K562 cells, achieving a Pearson correlation of 0.503, representing 63% of the theoretical ceiling set by cross‑experiment variability (r = 0.797). Attention and gradient analyses provide complementary interpretive windows: in detailed case studies, these approaches highlight largely distinct genomic regions, with gradient analysis identifying a CTCF binding site that Hi‑C data showed as physically contacting both enhancer and target gene. These results suggest that AI architectures aligned with biological information flow can achieve both predictive accuracy and mechanistic interpretability.
Authors: Biraja Ghoshal
Abstract: Background: Understanding electronic interactions in protein active sites is fundamental to drug discovery and enzyme engineering, but remains computationally challenging due to exponential scaling of quantum mechanical calculations.
Results: We present a quantum‑classical hybrid framework for simulating protein fragment electronic structure using variational quantum algorithms. We construct fermionic Hamiltonians from experimentally determined protein structures, map them to qubits via Jordan‑Wigner transformation, and optimize ground state energies using the Variational Quantum Eigensolver implemented in pure Python. For a 4‑orbital serine protease fragment, we achieve chemical accuracy (< 1.6 mHartree) with 95.3% correlation energy recovery. Systematic analysis reveals three‑phase convergence behaviour with exponential decay (α = 0.95), power law optimization (γ = 1.21), and asymptotic approach. Application to SARS‑CoV‑2 protease inhibition demonstrates predictive accuracy (MAE=0.25 kcal/mol), while cytochrome P450 metabolism predictions achieve 85% site accuracy.
Conclusions: This work establishes a pathway for quantum‑enhanced biomolecular simulations on near‑term quantum hardware, bridging quantum algorithm development with practical biological applications.
Authors: QiWei Meng
Abstract: Large Protein Language Models have shown strong potential for generative protein design, yet they frequently produce structural hallucinations, generating sequences with high linguistic likelihood that fold into thermodynamically unstable conformations. Existing alignment approaches such as Direct Preference Optimization are limited in this setting, as they model preferences as binary labels and ignore the continuous structure of the physical energy landscape. We propose Physio‑DPO, a physics informed alignment framework that grounds protein language models in thermodynamic stability. Physio‑DPO introduces a magnitude aware objective that scales optimization updates according to the energy gap between native structures and physics perturbed hard negatives. Experiments show that Physio‑DPO consistently outperforms strong baselines including SFT, PPO, and standard DPO, reducing self consistency RMSD to 1.28 Å and increasing foldability to 92.8%. Qualitative analysis further demonstrates that Physio‑DPO effectively mitigates structural hallucinations by recovering biophysical interactions such as hydrophobic core packing and hydrogen bond networks.
Authors: Ali Anaissi, Seid Miad Zandavi, Weidong Huang, Junaid Akram, Basem Suleiman, Ali Braytee, Jie Hua
Abstract: Single‑cell data analysis has the potential to revolutionize personalized medicine by characterizing disease‑associated molecular changes at the single‑cell level. Advanced single‑cell multimodal assays can now simultaneously measure various molecules (e.g., DNA, RNA, Protein) across hundreds of thousands of individual cells, providing a comprehensive molecular readout. A significant analytical challenge is integrating single‑cell measurements across different modalities. Various methods have been developed to address this challenge, but there has been no systematic evaluation of these techniques with different preprocessing strategies. This study examines a general pipeline for single‑cell data analysis, which includes normalization, data integration, and dimensionality reduction. The performance of different algorithm combinations often depends on the dataset sizes and characteristics. We evaluate six datasets across diverse modalities, tissues, and organisms using three metrics: Silhouette Coefficient Score, Adjusted Rand Index, and Calinski‑Harabasz Index. Our experiments involve combinations of seven normalization methods, four dimensional reduction methods, and five integration methods. The results show that Seurat and Harmony excel in data integration, with Harmony being more time‑efficient, especially for large datasets. UMAP is the most compatible dimensionality reduction method with the integration techniques, and the choice of normalization method varies depending on the integration method used.
Authors: Bence Bolgár, András Millinghoffer, Péter Antal
Abstract: Precise probabilistic information about drug‑target interaction (DTI) predictions is vital for understanding limitations and boosting predictive performance. Gaussian processes (GP) offer a scalable framework to integrate state‑of‑the‑art DTI representations and Bayesian inference, enabling novel operations, such as Bayesian classification with rejection, top‑K selection, and ranking. We propose a deep kernel learning‑based GP architecture (DTI‑GP), which incorporates a combined neural embedding module for chemical compounds and protein targets, and a GP module. The workflow continues with sampling from the predictive distribution to estimate a Bayesian precedence matrix, which is used in fast and accurate selection and ranking operations. DTI‑GP outperforms state‑of‑the‑art solutions, and it allows (1) the construction of a Bayesian accuracy‑confidence enrichment score, (2) rejection schemes for improved enrichment, and (3) estimation and search for top‑K selections and ranking with high expected utility.
Authors: Danchen Jia, Dashan Dong, Tongyu Li, Haonan Zong, Jiabei Zhu, Xinyan Teng, Lei Tian, Ji-Xin Cheng
Abstract: Three‑dimensional molecular imaging of living cells is essential for unraveling cellular metabolism and response to therapies. However, existing volumetric methods, including fluorescence microscopy and quantitative phase imaging, either require fluorescent labels or lack chemical specificity. Mid‑infrared (mid‑IR) photothermal microscopy provides label‑free spectroscopic contrast with sub‑micrometer resolution but is limited by slow acquisition rates, precluding 3D live‑cell studies. Here, we present a photothermal relaxation intensity diffraction tomography (PRIDT) system that encodes mid‑IR absorption induced refractive index change via a photothermal relaxation scheme and recovers it through intensity diffraction tomography. PRIDT achieves video‑rate volumetric chemical imaging with up to 15 Hz per wavelength and offers lateral and axial resolutions of 264 nm and 1.12 um over a volumetric field of view of 50x50x10 um3. We showcase high‑speed PRIDT imaging of protein and lipid metabolism in ovarian cancer cells and lipid‑droplet dynamics in live cells. PRIDT opens new avenues for rapid, quantitative, three‑dimensional molecular imaging in living systems.
Authors: Seungeon Lee, Takuto Koyama, Itsuki Maeda, Shigeyuki Matsumoto, Yasushi Okuno
Abstract: Therapeutic peptides have emerged as a pivotal modality in modern drug discovery, occupying a chemically and topologically rich space. While accurate prediction of their physicochemical properties is essential for accelerating peptide development, existing molecular language models rely on representations that fail to capture this complexity. Atom‑level SMILES notation generates long token sequences and obscures cyclic topology, whereas amino‑acid‑level representations cannot encode the diverse chemical modifications central to modern peptide design. To bridge this representational gap, the Hierarchical Editing Language for Macromolecules (HELM) offers a unified framework enabling precise description of both monomer composition and connectivity, making it a promising foundation for peptide language modeling. Here, we propose HELM‑BERT, the first encoder‑based peptide language model trained on HELM notation. Based on DeBERTa, HELM‑BERT is specifically designed to capture hierarchical dependencies within HELM sequences. The model is pre‑trained on a curated corpus of 39,079 chemically diverse peptides spanning linear and cyclic structures. HELM‑BERT significantly outperforms state‑of‑the‑art SMILES‑based language models in downstream tasks, including cyclic peptide membrane permeability prediction and peptide‑protein interaction prediction. These results demonstrate that HELM's explicit monomer‑ and topology‑aware representations offer substantial data‑efficiency advantages for modeling therapeutic peptides, bridging a long‑standing gap between small‑molecule and protein language models.
Authors: Olaide N. Oyelade, Oliver Hoxey, Yulia Humrye
Abstract: The popular use of histopathology images, such as hematoxylin and eosin (H&E), has proven to be useful in detecting tumors. However, moving such cancer cases forward for treatment requires accurate on the amount of the human epidermal growth factor receptor 2 (HER2) protein expression. Predicting both the lower and higher levels of HER2 can be challenging. Moreover, jointly analyzing H&E and immunohistochemistry (IHC) stained images for HER2 scoring is difficult. Although several deep learning methods have been investigated to address the challenge of HER2 scoring, they suffer from providing a pixel‑level localization of HER2 status. In this study, we propose a single end‑to‑end pipeline using a system of vision transformers with HER2 status scoring on whole slide images of WSIs. The method includes patch‑wise processing of H&E WSIs for tumor localization. A novel mapping function is proposed to correspondingly identify correlated IHC WSIs regions with malignant regions on H&E. A clinically inspired HER2 scoring mechanism is embedded in the pipeline and allows for automatic pixel‑level annotation of 4‑way HER2 scoring (0, 1+, 2+, and 3+). Also, the proposed method accurately returns HER2‑negative and HER2‑positive. Privately curated datasets were collaboratively extracted from 13 different cases of WSIs of H&E and IHC. A thorough experiment is conducted on the proposed method. Results obtained showed a good classification accuracy during tumor localization. Also, a classification accuracy of 0.94 and a specificity of 0.933 were returned for the prediction of HER2 status, scoring in the 4‑way methods. The applicability of the proposed pipeline was investigated using WSIs patches as comparable to human pathologists. Findings from the study showed the usability of jointly evaluated H&E and IHC images on end‑to‑end ViTs‑based models for HER2 scoring
Authors: Aicha Boutorh, Soumia Bouyahiaoui, Sara Belhadj, Nour El Yakine Guendouz, Manel Kara Laouar
Abstract: Predicting the binding affinity between antigens and antibodies is fundamental to drug discovery and vaccine development. Traditional computational approaches often rely on experimentally determined 3D structures, which are scarce and computationally expensive to obtain. This paper introduces DuaDeep‑SeqAffinity, a novel sequence‑only deep learning framework that predicts affinity scores solely from their amino acid sequences using a dual‑stream hybrid architecture. Our approach leverages pre‑trained ESM‑2 protein language model embeddings, combining 1D Convolutional Neural Networks (CNNs) for local motif detection with Transformer encoders for global contextual representation. A subsequent fusion module integrates these multi‑faceted features, which are then passed to a fully connected network for final score regression. Experimental results demonstrate that DuaDeep‑SeqAffinity significantly outperforms individual architectural components and existing state‑of‑the‑art (SOTA) methods. DuaDeep achieved a superior Pearson correlation of 0.688, an R^2 of 0.460, and a Root Mean Square Error (RMSE) of 0.737, surpassing single‑branch variants ESM‑CNN and ESM‑Transformer. Notably, the model achieved an Area Under the Curve (AUC) of 0.890, outperforming sequence‑only benchmarks and even surpassing structure‑sequence hybrid models. These findings prove that high‑fidelity sequence embeddings can capture essential binding patterns typically reserved for structural modeling. By eliminating the reliance on 3D structures, DuaDeep‑SeqAffinity provides a highly scalable and efficient solution for high‑throughput screening of vast sequence libraries, significantly accelerating the therapeutic discovery pipeline.
Authors: Bintao He, Yiran Cheng, Hongjia Li, Xiang Gao, Xin Gao, Fa Zhang, Renmin Han
Abstract: Understanding protein flexibility and its dynamic interactions with other molecules is essential for studying protein function. Although cryogenic electron microscopy(cryo‑EM) provides an opportunity to observe macromolecular dynamics directly, computational analysis of datasets mixing continuous and discrete structural states remains a formidable challenge. Here we introduce GaussianEM, a Gaussian‑based pseudo‑atomic framework that simultaneously resolves compositional and conformational heterogeneity from cryo‑EM images. GaussianEM employs a dual‑encoder‑single‑decoder architecture to decompose images into learnable Gaussian components, with variability encoded through modulated parameters. This explicit parameterization yields a continuous, intuitive representation of conformational dynamics that inherently preserves local structural integrity. By modeling displacements in Gaussian space, we capture atomic‑scale conformational landscapes, bridging density maps and all‑atom models. In comprehensive experiments, GaussianEM successfully reconstructs complex compositional and conformational variability,and resolves previously unobserved details in public datasets. Quantitative evaluations further confirm its ability to capture broader conformational diversity without sacrificing structural fidelity.
Authors: Xiang Zhang, Jiaqi Wei, Yuejin Yang, Zijie Qiu, Yuhan Chen, Zhiqiang Gao, Muhammad Abdul-Mageed, Laks V. S. Lakshmanan, Wanli Ouyang, Chenyu You, Siqi Sun
Abstract: Chain‑of‑Thought (CoT) prompting has significantly advanced task‑solving capabilities in natural language processing with large language models. Unlike standard prompting, CoT encourages the model to generate intermediate reasoning steps, non‑answer tokens, that help guide the model toward more accurate final outputs. These intermediate steps enable more complex reasoning processes such as error correction, memory management, future planning, and self‑reflection. However, applying CoT to non‑natural language domains, such as protein and RNA language models, is not yet possible, primarily due to the limited expressiveness of their token spaces (e.g., amino acid tokens). In this work, we propose and define the concept of language expressiveness: the ability of a given language, using its tokens and grammar, to encode information. We show that the limited expressiveness of protein language severely restricts the applicability of CoT‑style reasoning. To overcome this, we introduce reflection pretraining, for the first time in a biological sequence model, which enables the model to engage in intermediate reasoning through the generation of auxiliary "thinking tokens" beyond simple answer tokens. Theoretically, we demonstrate that our augmented token set significantly enhances biological language expressiveness, thereby improving the overall reasoning capacity of the model. Experimentally, our pretraining approach teaches protein models to self‑correct and leads to substantial performance gains compared to standard pretraining.
Authors: Andrew D. Blevins, Ian K. Quigley
Abstract: Can machine learning models identify which chemist made a molecule from structure alone? If so, models trained on literature data may exploit chemist intent rather than learning causal structure‑activity relationships. We test this by linking CHEMBL assays to publication authors and training a 1,815‑class classifier to predict authors from molecular fingerprints, achieving 60% top‑5 accuracy under scaffold‑based splitting. We then train an activity model that receives only a protein identifier and an author‑probability vector derived from structure, with no direct access to molecular descriptors. This author‑only model achieves predictive power comparable to a simple baseline that has access to structure. This reveals a "Clever Hans" failure mode: models can predict bioactivity largely by inferring chemist goals and favorite targets without requiring a lab‑independent understanding of chemistry. We analyze the sources of this leakage, propose author‑disjoint splits, and recommend dataset practices to decouple chemist intent from biological outcomes.
Authors: Wyatt A. Curtis, Constantin R. Krüger, Axel P. Tracol Gavard, Jakub Wenz, Marcel Drabbels, Ulrich J. Lorenz
Abstract: Laser flash melting and revitrification experiments have recently improved the time resolution of cryo‑electron microscopy (cryo‑EM) to the microsecond timescale, making it fast enough to observe many of the protein motions that are associated with function. The technique has also opened up a new dimension for cryo‑EM sample preparation, making it possible to deposit compounds onto a cryo‑EM sample while it is frozen, so that upon flash melting, the embedded particles experience an altered environment. For example, we have recently shown that depositing ultrathin silicon dioxide membranes onto a cryo‑EM sample causes particles to detach from the interface upon flash melting, removing preferred particle orientation. These experiments also point towards a new strategy for initiating protein dynamics in time resolved experiments by depositing reagents, which will then mix with the sample upon flash melting. Here, we describe an apparatus for physical vapor deposition of compounds onto cryo‑EM samples, detailing its design and operation. As a demonstration, we determine that the minimum thickness of silicon dioxide sealing membranes in a laser flash melting experiment is just over two monolayers. We propose that our design can form the basis for an integrated platform for microsecond time‑resolved cryo‑EM experiments.
Authors: Diogo Ramos, Bruno Coutinho, Duarte Magano
Abstract: The systematic discovery of effective drug combinations is a challenging problem in modern pharmacology, driven by the combinatorial growth of potential pairings and dosage configurations. Network medicine, modeling diseases and drugs as interconnected modules of the human protein‑protein interactome, has emerged as a new paradigm for understanding disease mechanisms and drug action. In this work, we propose a quantum annealing‑based algorithm for identifying effective drug combinations. Underlying our approach is the biologically motivated principle of `Complementary Exposure', which posits that therapeutic drug combinations target distinct yet complementary regions of a disease module. We translate this into a quadratic unconstrained binary optimisation problem. We test our method for Diabetes Mellitus, Rheumatoid Arthritis, Asthma, and Brain Neoplasms, relying on experimentally validated drug combinations for these diseases. Our simulated quantum annealing experiments reveal that low‑energy configurations align with biologically plausible combinations, demonstrating the algorithm's ability to generate novel predictions for drug combinations.
Authors: Benjamin Tang
Abstract: Recent work has shown an increasing interest in understanding the structure of the endoplasmic reticulum (ER) and how ribosomes are displayed on it. Here we present a model that explains a physical reason for why the cell creates different structures of the ER. Due to the diffusion of biomolecules, we find that flat sheets and a matrix of tubules have different regimes of optimized capture efficiency. We extend the model to explain the observed difference in density of ribosomes on the structures of the ER. Due to the capture efficiency of tubules, less ribosomes are needed on those structures. For flat sheets, more ribosome coverage at biological separation distance is needed to match the same fraction of relative flux. We then push the model to predict that depending on the future life of the translated protein and overall demand for protein expression, the cell will utilize one structure of the ER over another. Predictions are compared with experimental data.
Authors: David Graber, Victor Armegioiu, Rebecca Buller, Siddhartha Mishra
Abstract: Predictive machine learning models generally excel on in‑distribution data, but their performance degrades on out‑of‑distribution (OOD) inputs. Reliable deployment therefore requires robust OOD detection, yet this is particularly challenging for irregular 3D graphs that combine continuous geometry with categorical identities and are unordered by construction. Here, we present a probabilistic OOD detection framework for complex 3D graph data built on a diffusion model that learns a density of the training distribution in a fully unsupervised manner. A key ingredient we introduce is a unified continuous diffusion over both 3D coordinates and discrete features: categorical identities are embedded in a continuous space and trained with cross‑entropy, while the corresponding diffusion score is obtained analytically via posterior‑mean interpolation from predicted class probabilities. This yields a single self‑consistent probability‑flow ODE (PF‑ODE) that produces per‑sample log‑likelihoods, providing a principled typicality score for distribution shift. We validate the approach on protein‑ligand complexes and construct strict OOD datasets by withholding entire protein families from training. PF‑ODE likelihoods identify held‑out families as OOD and correlate strongly with prediction errors of an independent binding‑affinity model (GEMS), enabling a priori reliability estimates on new complexes. Beyond scalar likelihoods, we show that multi‑scale PF‑ODE trajectory statistics ‑ including path tortuosity, flow stiffness, and vector‑field instability ‑ provide complementary OOD information. Modeling the joint distribution of these trajectory features yields a practical, high‑sensitivity detector that improves separation over likelihood‑only baselines, offering a label‑free OOD quantification workflow for geometric deep learning.
Authors: Xinyan Zhao, Yi-Ching Tang, Rivaaj Monsia, Victor J. Cantu, Ashwin Kumar Ramesh, Xiaozhong Liu, Zhiqiang An, Xiaoqian Jiang, Yejin Kim
Abstract: Motivation: The clinical efficacy of antibody therapeutics critically depends on high‑affinity target engagement, yet laboratory affinity‑maturation campaigns are slow and costly. In computational settings, most protein language models (PLMs) are not trained to favor high‑affinity antibodies, and existing preference optimization approaches introduce substantial computational overhead without clear affinity gains. Therefore, this work proposes SimBinder‑IF, which converts the inverse folding model ESM‑IF into an antibody sequence generator by freezing its structure encoder and training only its decoder to prefer experimentally stronger binders through preference optimization.
Results: On the 11‑assay AbBiBench benchmark, SimBinder‑IF achieves a 55 percent relative improvement in mean Spearman correlation between log‑likelihood scores and experimentally measured binding affinity compared to vanilla ESM‑IF (from 0.264 to 0.410). In zero‑shot generalization across four unseen antigen‑antibody complexes, the correlation improves by 156 percent (from 0.115 to 0.294). SimBinder‑IF also outperforms baselines in top‑10 precision for ten‑fold or greater affinity improvements. A case study redesigning antibody F045‑092 for A/California/04/2009 (pdmH1N1) shows that SimBinder‑IF proposes variants with substantially lower predicted binding free energy changes than ESM‑IF (mean Delta Delta G ‑75.16 vs ‑46.57). Notably, SimBinder‑IF trains only about 18 percent of the parameters of the full ESM‑IF model, highlighting its parameter efficiency for high‑affinity antibody generation.
Authors: Adam R. Lamson, Mohammadhossein Firouznia, Michael J. Shelley
Abstract: Cells regulate gene expression in part by forming DNA‑protein condensates in the nucleus. While existing theories describe the equilibrium size and stability of such condensates, their dynamics remain less understood. Here, we use coarse‑grained 3D Brownian‑dynamics simulations to study how long, end‑anchored biopolymers condense over time due to transient crosslinking. By tracking how clusters nucleate, merge, and disappear, we identify two dominant dynamical pathways, ripening and merging, that govern the progression from an uncompacted chain to a single condensate. We show how microscopic kinetic parameters, protein density, and mechanical constraints shape these pathways. Using insights from the simulations, we construct a minimal mechanistic free‑energy model that captures the observed scaling behavior. Together, these results clarify the dynamical determinants of DNA and chromatin reorganization on timescales relevant to gene regulation.
Authors: Christian Lagemann, Sajeda Mokbel, Miro Gondrum, Mario Rüttgers, Jared Callaham, Ludger Paehler, Samuel Ahnert, Nicholas Zolman, Kai Lagemann, Nikolaus Adams, Matthias Meinke, Wolfgang Schröder, Jean-Christophe Loiseau, Esther Lagemann, Steven L. Brunton
Abstract: Modeling and controlling fluid flows is critical for several fields of science and engineering, including transportation, energy, and medicine. Effective flow control can lead to, e.g., lift increase, drag reduction, mixing enhancement, and noise reduction. However, controlling a fluid faces several significant challenges, including high‑dimensional, nonlinear, and multiscale interactions in space and time. Reinforcement learning (RL) has recently shown great success in complex domains, such as robotics and protein folding, but its application to flow control is hindered by a lack of standardized benchmark platforms and the computational demands of fluid simulations. To address these challenges, we introduce HydroGym, a solver‑independent RL platform for flow control research. HydroGym integrates sophisticated flow control benchmarks, scalable runtime infrastructure, and state‑of‑the‑art RL algorithms. Our platform includes 42 validated environments spanning from canonical laminar flows to complex three‑dimensional turbulent scenarios, validated over a wide range of Reynolds numbers. We provide non‑differentiable solvers for traditional RL and differentiable solvers that dramatically improve sample efficiency through gradient‑enhanced optimization. Comprehensive evaluation reveals that RL agents consistently discover robust control principles across configurations, such as boundary layer manipulation, acoustic feedback disruption, and wake reorganization. Transfer learning studies demonstrate that controllers learned at one Reynolds number or geometry adapt efficiently to new conditions, requiring approximately 50% fewer training episodes. The HydroGym platform is highly extensible and scalable, providing a framework for researchers in fluid dynamics, machine learning, and control to add environments, surrogate models, and control algorithms to advance science and technology.
Authors: Muhammad Haris Khan
Abstract: Foundation models for protein design raise concrete biosecurity risks, yet the community lacks a simple, reproducible baseline for sequence‑level hazard screening that is explicitly evaluated under homology control and runs on commodity CPUs. We introduce SafeBench‑Seq, a metadata‑only, reproducible benchmark and baseline classifier built entirely from public data (SafeProtein hazards and UniProt benigns) and interpretable features (global physicochemical descriptors and amino‑acid composition). To approximate "never‑before‑seen" threats, we homology‑cluster the combined dataset at <=40% identity and perform cluster‑level holdouts (no cluster overlap between train/test). We report discrimination (AUROC/AUPRC) and screening‑operating points (TPR@1% FPR; FPR@95% TPR) with 95% bootstrap confidence intervals (n=200), and we provide calibrated probabilities via CalibratedClassifierCV (isotonic for Logistic Regression / Random Forest; Platt sigmoid for Linear SVM). We quantify probability quality using Brier score, Expected Calibration Error (ECE; 15 bins), and reliability diagrams. Shortcut susceptibility is probed via composition‑preserving residue shuffles and length‑/composition‑only ablations. Empirically, random splits substantially overestimate robustness relative to homology‑clustered evaluation; calibrated linear models exhibit comparatively good calibration, while tree ensembles retain slightly higher Brier/ECE. SafeBench‑Seq is CPU‑only, reproducible, and releases metadata only (accessions, cluster IDs, split labels), enabling rigorous evaluation without distributing hazardous sequences.
Authors: Akhil Shajan, Danil Kaliakin, Fangchun Liang, Thaddeus Pellegrini, Hakan Doga, Subhamoy Bhowmik, Susanta Das, Antonio Mezzacapo, Mario Motta, Kenneth M. Merz
Abstract: This work presents the implementation of a fragment‑based, quantum‑centric supercomputing workflow for computing molecular electronic structure using quantum hardware. The workflow is applied to predict the relative energies of two conformers of the 300‑atom Trp‑cage miniprotein. The methodology employs wave function‑based embedding (EWF) as the underlying fragmentation framework, in which all atoms in the system are explicitly included in the CI treatment. CI calculations for individual fragments are performed using either sample‑based quantum diagonalization (SQD) for challenging fragments or full configuration interaction (FCI) for trivial fragments. To assess the accuracy of SQD for fragment CI calculations, EWF‑(FCI,SQD) results are compared against EWF‑MP2 and EWF‑CCSD benchmarks. Overall, the results demonstrate that large‑scale electronic configuration interaction (CI) simulations of protein systems containing hundreds or even thousands of atoms can be realized through the combined use of quantum and classical computing resources.
Authors: Matthew Sinclair, Moeen Meigooni, Archit Vasan, Ozan Gokdemir, Xinran Lian, Heng Ma, Yadu Babuji, Alexander Brace, Khalid Hossain, Carlo Siebenschuh, Thomas Brettin, Kyle Chard, Christopher Henry, Venkatram Vishwanath, Rick L. Stevens, Ian T. Foster, Arvind Ramanathan
Abstract: Intrinsically disordered proteins (IDPs) represent crucial therapeutic targets due to their significant role in disease ‑‑ approximately 80% of cancer‑related proteins contain long disordered regions ‑‑ but their lack of stable secondary/tertiary structures makes them "undruggable". While recent computational advances, such as diffusion models, can design high‑affinity IDP binders, translating these to practical drug discovery requires autonomous systems capable of reasoning across complex conformational ensembles and orchestrating diverse computational tools at scale.To address this challenge, we designed and implemented StructBioReasoner, a scalable multi‑agent system for designing biologics that can be used to target IDPs. StructBioReasoner employs a novel tournament‑based reasoning framework where specialized agents compete to generate and refine therapeutic hypotheses, naturally distributing computational load for efficient exploration of the vast design space. Agents integrate domain knowledge with access to literature synthesis, AI‑structure prediction, molecular simulations, and stability analysis, coordinating their execution on HPC infrastructure via an extensible federated agentic middleware, Academy. We benchmark StructBioReasoner across Der f 21 and NMNAT‑2 and demonstrate that over 50% of 787 designed and validated candidates for Der f 21 outperformed the human‑designed reference binders from literature, in terms of improved binding free energy. For the more challenging NMNAT‑2 protein, we identified three binding modes from 97,066 binders, including the well‑studied NMNAT2:p53 interface. Thus, StructBioReasoner lays the groundwork for agentic reasoning systems for IDP therapeutic discovery on Exascale platforms.
Authors: Apurba Biswas, Thomas Guérin
Abstract: Rare events refer to qualitatively unlikely events whose realization can nevertheless have important consequences. Typically, the prediction of the kinetics of these events relies on Arrhenius laws, with exponentially distributed waiting times, and no correlations between successive occurrences. However, this description breaks down in the presence of long‑term memory, as has been observed in the contexts of geophysical time series or protein dynamics. So far, existing analytical approaches do not quantify the correlations between rare events due to long‑term memory. Here, for non‑Markovian Gaussian processes, we determine analytically the impact of long‑term memory on the distribution of first and second passage times to a rarely reached threshold. This distribution is non‑exponential, thus going beyond the Arrhenius paradigm. We obtain an explicit expression for the covariance between the first and second passage times, and we predict how the mean time to the next extreme event depends on the previous passage time, illustrating the phenomenon of clustering of extreme events. These analytical results, validated through extensive stochastic simulations, shed lights on the strong correlation between successive occurrences of extreme events due to long‑term memory.
Authors: Yi Zhou, Haohao Qu, Yunqing Liu, Shanru Lin, Le Song, Wenqi Fan
Abstract: Proteins inherently possess a consistent sequence‑structure duality. The abundance of protein sequence data, which can be readily represented as discrete tokens, has driven fruitful developments in protein language models (pLMs). A key remaining challenge, however, is how to effectively integrate continuous structural knowledge into pLMs. Current methods often discretize protein structures to accommodate the language modeling framework, which inevitably results in the loss of fine‑grained information and limits the performance potential of multimodal pLMs. In this paper, we argue that such concerns can be circumvented: a sequence‑based pLM can be extended to incorporate the structure modality through continuous tokens, i.e., high‑fidelity protein structure latents that avoid vector quantization. Specifically, we propose a hybrid diffusion protein language model, HD‑Prot, which embeds a continuous‑valued diffusion head atop a discrete pLM, enabling seamless operation with both discrete and continuous tokens for joint sequence‑structure modeling. It captures inter‑token dependencies across modalities through a unified absorbing diffusion process, and estimates per‑token distributions via categorical prediction for sequences and continuous diffusion for structures. Extensive results demonstrate that HD‑Prot achieves competitive performance in unconditional sequence‑structure co‑generation, motif‑scaffolding, protein structure prediction, and inverse folding tasks. Furthermore, our method can perform on par with state‑of‑the‑art multimodal pLMs, despite being developed under limited computational resources (i.e., less than one‑tenth the budget for modality extension fine‑tuning). It highlights the viability of simultaneously estimating categorical and continuous distributions within a unified language model architecture, offering a promising alternative direction for multimodal pLMs.
Authors: Kyril Kavetsky, Sabine Hong, Chih-Yuan Lin, Roger Yang, Marija Drndic
Abstract: Advanced nanopore measurements allow structural probing of molecules with high spatial and temporal resolution. We report high signal‑to‑noise, 1‑10 MHz bandwidth, translocation measurements of the multi‑state folding of heme protein cytochrome c in KCl solution through optimally designed silicon nitride pores of 2.3‑3.3 nm diameter and 3.6‑3.8 nm effective thickness, and an optimal concentration of a denaturant (Gdm‑Cl). The pore diameter is slightly smaller than the protein size, forcing the protein to squeeze through the pore. The sufficiently large pore thickness allows enough time for protein probing at an applied field of approximately 250 kV/cm. Through Bayesian Information Criterion score analysis, current blockades reveal six distinct levels, attributed to specific protein states. We calculate the transition probabilities between the states and the conditional probabilities of the protein leaving the pore from each state. We validate the model by simulating events and comparing them to experimental data.
Authors: Sophia Tang
Abstract: Spherical equivariant graph neural networks (EGNNs) provide a principled framework for learning on three‑dimensional molecular and biomolecular systems, where predictions must respect the rotational symmetries inherent in physics. These models extend traditional message‑passing GNNs and Transformers by representing node and edge features as spherical tensors that transform under irreducible representations of the rotation group SO(3), ensuring that predictions change in physically meaningful ways under rotations of the input. This guide develops a complete, intuitive foundation for spherical equivariant modeling ‑ from group representations and spherical harmonics, to tensor products, Clebsch‑Gordan decomposition, and the construction of SO(3)‑equivariant kernels. Building on this foundation, we construct the Tensor Field Network and SE(3)‑Transformer architectures and explain how they perform equivariant message‑passing and attention on geometric graphs. Through clear mathematical derivations and annotated code excerpts, this guide serves as a self‑contained introduction for researchers and learners seeking to understand or implement spherical EGNNs for applications in chemistry, molecular property prediction, protein structure modeling, and generative modeling.
Authors: Yifan Wu, Jiyue Jiang, Xichen Ye, Yiqi Wang, Chang Zhou, Yitao Xu, Jiayang Chen, He Hu, Weizhong Zhang, Cheng Jin, Jiao Yuan, Yu Li
Abstract: Biological foundation models (BioFMs), pretrained on large‑scale biological sequences, have recently shown strong potential in providing meaningful representations for diverse downstream bioinformatics tasks. However, such models often rely on millions to billions of training sequences and billions of parameters, resulting in prohibitive computational costs and significant barriers to reproducibility and accessibility, particularly for academic labs. To address these challenges, we investigate the feasibility of data pruning for BioFM pretraining and propose a post‑hoc influence‑guided data pruning framework tailored to biological domains. Our approach introduces a subset‑based self‑influence formulation that enables efficient estimation of sample importance at low computational cost, and builds upon it two simple yet effective selection strategies, namely Top‑k Influence (Top I) and Coverage‑Centric Influence (CCI). We empirically validate our method on two representative BioFMs, RNA‑FM and ESM‑C. For RNA, our framework consistently outperforms random selection baselines under an extreme pruning rate of over 99 percent, demonstrating its effectiveness. Furthermore, we show the generalizability of our framework on protein‑related tasks using ESM‑C. In particular, our coreset even outperforms random subsets that are ten times larger in both RNA and protein settings, revealing substantial redundancy in biological sequence datasets. These findings underscore the potential of influence‑guided data pruning to substantially reduce the computational cost of BioFM pretraining, paving the way for more efficient, accessible, and sustainable biological AI research.
Authors: Erwin Frey, Henrik Weyer
Abstract: Intracellular protein patterns govern essential cellular functions by dynamically redistributing proteins between membrane‑bound and cytosolic states, conserving their total numbers. This review presents a theoretical framework for understanding such patterns based on mass‑conserving reaction‑‑diffusion systems. The emergence, selection, and evolution of patterns are analyzed in terms of mass redistribution and interface motion, resulting in mesoscale laws of coarsening and wavelength selection. A geometric phase‑space perspective provides a conceptual tool to link local reactive equilibria with global pattern dynamics through conserved mass fluxes. The Min protein system of \emphEscherichia coli provides a paradigmatic example, enabling direct comparison between theory and experiment. Successive model refinements capture both the robustness of pattern formation and the diversity of dynamic regimes observed \emphin vivo and \emphin vitro. The Min system thus illustrates how to extract predictive, multiscale theory from biochemical detail, providing a foundation for understanding pattern formation in more complex and synthetic systems.
Authors: Yuhan Chen, Shang Qu, Zhiqiang Gao, Yuejin Yang, Xiang Zhang, Sheng Xu, Xinjie Mao, Liujia Qian, Jiaqi Wei, Zijie Qiu, Chenyu You, Lei Bai, Ning Ding, Tiannan Guo, Bowen Zhou, Siqi Sun
Abstract: Post‑translational modifications (PTMs) serve as a dynamic chemical language regulating protein function, yet current proteomic methods remain blind to a vast portion of the modified proteome. Standard database search algorithms suffer from a combinatorial explosion of search spaces, limiting the identification of uncharacterized or complex modifications. Here we introduce OmniNovo, a unified deep learning framework for reference‑free sequencing of unmodified and modified peptides directly from tandem mass spectra. Unlike existing tools restricted to specific modification types, OmniNovo learns universal fragmentation rules to decipher diverse PTMs within a single coherent model. By integrating a mass‑constrained decoding algorithm with rigorous false discovery rate estimation, OmniNovo achieves state‑of‑the‑art accuracy, identifying 51% more peptides than standard approaches at a 1% false discovery rate. Crucially, the model generalizes to biological sites unseen during training, illuminating the dark matter of the proteome and enabling unbiased comprehensive analysis of cellular regulation.
Authors: La Ode Aman, A Mu'thi Andy Suryadi, Dizky Ramadani Putri Papeo, Hamsidar Hasan, Ariani H Hutuba, Netty Ino Ischak, Yuszda K. Salimi
Abstract: Cancer cell response to targeted therapy arises from complex molecular interactions, making single omics insufficient for accurate prediction. This study develops a model to predict Dabrafenib sensitivity by integrating multiple omics layers (genomics, transcriptomics, proteomics, epigenomics, and metabolomics) with protein network embeddings generated using Graph Convolutional Networks (GCN). Each modality is encoded into low dimensional representations through neural network preprocessing. Protein interaction information from STRING is incorporated using GCN to capture biological topology. An attention based fusion mechanism assigns adaptive weights to each modality according to its relevance. Using GDSC cancer cell line data, the model shows that selective integration of two modalities, especially proteomics and transcriptomics, achieves the best test performance (R2 around 0.96), outperforming all single omics and full multimodal settings. Genomic and epigenomic data were less informative, while proteomic and transcriptomic layers provided stronger phenotypic signals related to MAPK inhibitor activity. These results show that attention guided multi omics fusion combined with GCN improves drug response prediction and reveals complementary molecular determinants of Dabrafenib sensitivity. The approach offers a promising computational framework for precision oncology and predictive modeling of targeted therapies.
Authors: Nadine Candoni, Romain Grossier, Stéphane Veesler
Abstract: This chapter presents an overview of microfluidic devices reported in the literature, used to develop methodologies for nucleation of biomolecules, with crystal size control, and for collecting thermodynamic and kinetic data. Part I is dedicated to the properties of microfluidic devices through materials used for their fabrication and for crystals analysis. Part II describes the variety of microfluidic devices available and how to handle them to produce flows, droplets and/or wells of micrometer size. These devices use crystallization methods inspired by batch processes and they are mainly used for protein crystallization. Part III focuses on fundamental properties of biomolecule crystallization determined using droplet‑based microfluidics: nucleation kinetics, nucleation rate and effective interfacial energy crystal/solution. Part IV explains how the kinetic effect of confinement due to micrometer size, and so nanovolumes, leads to isolation of different phases. These latter are characterized by X‑Ray Diffraction (XRD) and methods to minimize manual handling of crystals for XRD are also presented, with appropriate equipment to store the crystals.
Authors: V. M. Rivilla, E. R. Alonso, W. Song, A. Insausti, A. Maris, F. J. Basterretxea, S. Melandri, I. Jiménez-Serra, E. J. Cocinero
Abstract: Understanding the presence and distribution of prebiotic precursors in the interstellar medium (ISM) is key to tracing the chemical origins of life. Among them, 4‑oxobutanenitrile (\chHCOCH2CH2CN) has been identified in laboratory simulations as a plausible intermediate in the formation of glutamic acid, a proteinogenic amino acid. Here, we report its gas‑phase rotational spectrum, measured using two complementary techniques: chirped‑pulse Fourier transform microwave (CP‑FTMW) spectroscopy (2‑18 GHz) and free‑jet millimeter‑wave (FJ‑AMMW) absorption spectroscopy (59.6‑80 GHz). Quantum chemical calculations revealed nine low‑energy conformers, of which the TC conformer was assigned based on the measured spectra. The resulting spectroscopic parameters were used to search for the molecule in the ultradeep spectral survey of the G+0.693‑0.027 molecular cloud, located in the Galactic Center. No signal attributable to 4‑oxobutanenitrile was detected. A stringent upper limit to its column density was derived (N< 4 ×10^12 cm^‑2), corresponding to a molecular abundance of < 2.9 ×10^‑11 relative to H_2. This upper limit lies well below the observed abundances of simpler structurally related species containing HCO and CN groups, underscoring the challenge of detecting increasingly complex prebiotic molecules in the ISM and the need for future, more sensitive astronomical facilities.
Authors: Youngseung Jeon, Christopher Hwang, Ziwen Li, Taylor Le Lievre, Jesus J. Campagna, Cohn Whitaker, Varghese John, Eunice Jun, Xiang Anthony Chen
Abstract: While drug discovery is vital for human health, the process remains inefficient. Medicinal chemists must navigate a vast protein space to identify target proteins that meet three criteria: physical and functional interactions, therapeutic impact, and docking potential. Prior approaches have provided fragmented support for each criterion, limiting the generation of promising hypotheses for wet‑lab experiments. We present HAPPIER, an AI‑powered tool that supports hypothesis generation with integrated multi‑criteria support for target identification. HAPPIER enables medicinal chemists to 1) efficiently explore and verify proteins in a single integrated graph component showing multi‑criteria satisfaction and 2) validate AI suggestions with domain knowledge. These capabilities facilitate iterative cycles of divergent and convergent thinking, essential for hypothesis generation. We evaluated HAPPIER with ten medicinal chemists, finding that it increased the number of high‑confidence hypotheses and support for the iterative cycle, and further demonstrated the relationship between engaging in such cycles and confidence in outputs.
Authors: Sheng-Ting Hung, Cheng Yan Lee, Chen-Yu Lien, Cheng-Hsuan Chan, Ya-Han Yang, Quark Yungsung Chen, Kuang-Hung Cheng, Kung-Kai Kuo, Li-Wei Tu, Ching-Wen Chang
Abstract: Clinical trials screening KRAS G12D protein for 30 pancreatic ductal adenocarcinoma (PDAC) patients and 30 healthy donors were conducted utilizing an AlGaN/GaN high electron mobility transistor (HEMT) biosensor. All resistance change ratios of PDAC patients are higher than the standard deviation above the mean resistance change ratio obtained from all healthy donors. The results demonstrate the effectiveness of the HEMT biosensor and reveal its potential for early detection of pancreatic cancer with KRAS G12D protein screening.
Authors: Jiayu Weng, Xinyi Zhu, Jing Liu, Linyuan Lü, Pan Zhang, Ying Tang
Abstract: Chemical reaction networks are widely used to model stochastic dynamics in chemical kinetics, systems biology and epidemiology. Solving the chemical master equation that governs these systems poses a significant challenge due to the large state space exponentially growing with system sizes. The development of autoregressive neural networks offers a flexible framework for this problem; however, its efficiency is limited especially for high‑dimensional systems and in scenarios with rare events. Here, we push the frontier of neural‑network approach by exploiting faster optimizations such as natural gradient descent and time‑dependent variational principle, achieving a 5‑ to 22‑fold speedup, and by leveraging enhanced‑sampling strategies to capture rare events. We demonstrate reduced computational cost and higher accuracy over the previous neural‑network method in challenging reaction networks, including the mitogen‑activated protein kinase (MAPK) cascade network, the hitherto largest biological network handled by the previous approaches of solving the chemical master equation. We further apply the approach to spatially extended reaction‑diffusion systems, the Schlögl model with rare events, on two‑dimensional lattices, beyond the recent tensor‑network approach that handles one‑dimensional lattices. The present approach thus enables efficient modeling of chemical reaction networks in general.
Authors: Yu Chen, Qi Zhang, Yuanhong Teng, Chihang Luo, Zhijie Li, Jinpeng Liu, Ya Wang, Fazhan Shi, Jiangfeng Du
Abstract: Nuclear magnetic resonance (NMR) at the single‑molecule level with atomic resolution holds transformative potential for structural biology and surface chemistry. Near‑surface solid‑state spin sensors with optical readout ability offer a promising pathway toward this goal. However, their extreme proximity to target molecules demands exceptional robustness against surface‑induced perturbations. Furthermore, life science applications require these sensors to operate in biocompatible spectral ranges that minimize photodamage. In this work, we demonstrate that the PL6 quantum defect in 4H silicon carbide (4H‑SiC) can serve as a robust near‑infrared spin sensor. This sensor operates at tissue‑transparent wavelengths and exhibits exceptional near‑surface stability even at depth of 2 nm. Using shallow PL6 centers, we achieve nanoscale NMR detection of proton (\mathrm^1H) spins in immersion oil and fluorine (\mathrm^19F) spins in Fomblin, attaining a detection volume of \mathrm(3~nm)^3 and a sensitivity reaching the requirement for single‑proton spin detection. This work establishes 4H‑SiC quantum sensors as a compelling platform for nanoscale magnetic resonance, with promising applications in probing low‑dimensional water phases, protein folding dynamics, and molecular interactions.
Authors: Min Li, Qi Zhang, Xi Kong, Sheng Zhao, Bin-Bin Pan, Ziting Sun, Pei Yu, Zhecheng Wang, Mengqi Wang, Wentao Ji, Fei Kong, Guanglei Cheng, Si Wu, Ya Wang, Sanyou Chen, Xun-Cheng Su, Fazhan Shi
Abstract: The investigation of biomolecular interactions at the single‑molecule level has emerged as a pivotal research area in life science, particularly through optical, mechanical, and electrochemical approaches. Spins existing widely in biological systems, offer a unique degree of freedom for detecting such interactions. However, most previous studies have been largely confined to ensemble‑level detection in the spin degree. Here, we developed a molecular interaction analysis method approaching single‑molecule level based on relaxometry using the quantum sensor, nitrogen‑vacancy (NV) center in diamond. Experiments utilized an optimized diamond surface functionalized with a polyethylenimine nanogel layer, achieving ~10 nm average protein distance and mitigating interfacial steric hindrance. Then we measured the strong interaction between streptavidin and spin‑labeled biotin complexes, as well as the weak interaction between bovine serum albumin and biotin complexes, at both the micrometer scale and nanoscale. For the micrometer‑scale measurements using ensemble NV centers, we re‑examined the often‑neglected fast relaxation component and proposed a relaxation rate evaluation method, substantially enhancing the measurement sensitivity. Furthermore, we achieved nanoscale detection approaching single‑molecule level using single NV centers. This methodology holds promise for applications in molecular screening, identification and kinetic studies at the single‑molecule level, offering critical insights into molecular function and activity mechanisms.
Authors: Sarwan Ali, Taslim Murad
Abstract: Early detection and characterization of coronavirus disease (COVID‑19), caused by SARS‑CoV‑2, remain critical for effective clinical response and public‑health planning. The global availability of large‑scale viral sequence data presents significant opportunities for computational analysis; however, existing approaches face notable limitations. Phylogenetic tree‑based methods are computationally intensive and do not scale efficiently to today's multi‑million‑sequence datasets. Similarly, current embedding‑based techniques often rely on aligned sequences or exhibit suboptimal predictive performance and high runtime costs, creating barriers to practical large‑scale analysis. In this study, we focus on the most prevalent SARS‑CoV‑2 lineages associated with the spike protein region and introduce a scalable embedding method that leverages hashing to generate compact, low‑dimensional representations of spike sequences. These embeddings are subsequently used to train a variety of machine learning models for supervised lineage classification. We conduct an extensive evaluation comparing our approach with multiple baseline and state‑of‑the‑art biological sequence embedding methods across diverse metrics. Our results demonstrate that the proposed embeddings offer substantial improvements in efficiency, achieving up to 86.4% classification accuracy while reducing embedding generation time by as much as 99.81%. This highlights the method's potential as a fast, effective, and scalable solution for large‑scale viral sequence analysis.
Authors: Yongkai Chen, Samuel WK Wong, SC Kou
Abstract: The remarkable success of AlphaFold2 in providing accurate atomic‑level prediction of protein structures from their amino acid sequence has transformed approaches to the protein folding problem. However, its core paradigm of mapping one sequence to one structure may only be appropriate for single‑fold proteins with one stable conformation. Metamorphic proteins, which can adopt multiple distinct conformations, have conformational diversity that cannot be adequately modeled by AlphaFold2. Hence, classifying whether a given protein is metamorphic or single‑fold remains a critical challenge for both laboratory experiments and computational methods. To address this challenge, we developed a novel classification framework by re‑purposing AlphaFold2 to generate conformational ensembles via a multiple sequence alignment sampling method. From these ensembles, we extract a comprehensive set of features characterizing the conformational ensemble's modality and structural dispersion. A random forest classifier trained on a carefully curated benchmark dataset of known metamorphic and single‑fold proteins achieves a mean AUC of 0.869 with cross‑validation, demonstrating the effectiveness of our integrated approach. Furthermore, by applying our classifier to 600 randomly sampled proteins from the Protein Data Bank, we identified several potential metamorphic protein candidates ‑‑ including the 40S ribosomal protein S30, whose conformational change is crucial for its secondary function in antimicrobial defense. By combining AI‑driven protein structure prediction with statistical learning, our work provides a powerful new approach for discovering metamorphic proteins and deepens our understanding of their role in their molecular function.
Authors: Salomé Guilbert, Cassandra Masschelein, Jeremy Goumaz, Bohdan Naida, Philippe Schwaller
Abstract: Force field‑based molecular dynamics (MD) simulations are indispensable for probing the structure, dynamics, and functions of biomolecular systems, including proteins and protein‑ligand complexes. Despite their broad utility in drug discovery and protein engineering, the technical complexity of MD setup, encompassing parameterization, input preparation, and software configuration, remains a major barrier for widespread and efficient usage. Agentic LLMs have demonstrated their capacity to autonomously execute multi‑step scientific processes, and to date, they have not successfully been used to automate protein‑ligand MD workflows. Here, we present DynaMate, a modular multi‑agent framework that autonomously designs and executes complete MD workflows for both protein and protein‑ligand systems, and offers free energy binding affinity calculations with the MM/PB(GB)SA method. The framework integrates dynamic tool use, web search, PaperQA, and a self‑correcting behavior. DynaMate comprises three specialized modules, interacting to plan the experiment, perform the simulation, and analyze the results. We evaluated its performance across twelve benchmark systems of varying complexity, assessing success rate, efficiency, and adaptability. DynaMate reliably performed full MD simulations, corrected runtime errors through iterative reasoning, and produced meaningful analyses of protein‑ligand interactions. This automated framework paves the way toward standardized, scalable, and time‑efficient molecular modeling pipelines for future biomolecular and drug design applications.
Authors: Mengren, Liu, Yixiang Zhang, Yiming, Zhang
Abstract: Recent advances in protein language models (PLMs) have demonstrated remarkable capabilities in understanding protein sequences. However, the extent to which different model architectures capture antibody‑specific biological properties remains unexplored. In this work, we systematically investigate how architectural choices in PLMs influence their ability to comprehend antibody sequence characteristics and functions. We evaluate three state‑of‑the‑art PLMs‑AntiBERTa, BioBERT, and ESM2‑‑against a general‑purpose language model (GPT‑2) baseline on antibody target specificity prediction tasks. Our results demonstrate that while all PLMs achieve high classification accuracy, they exhibit distinct biases in capturing biological features such as V gene usage, somatic hypermutation patterns, and isotype information. Through attention attribution analysis, we show that antibody‑specific models like AntiBERTa naturally learn to focus on complementarity‑determining regions (CDRs), while general protein models benefit significantly from explicit CDR‑focused training strategies. These findings provide insights into the relationship between model architecture and biological feature extraction, offering valuable guidance for future PLM development in computational antibody design.
Authors: Elisabeth Gruber, Lars H. Andersen, Laurence H. Stanley, Jan R. R. Verlet, Ivan S. Avdonin, Anastasia V. Bochenkova
Abstract: The functional properties of photoactive proteins are governed by the interplay between bright and dark excited states. While the bright states are well‑studied, the dark states, which are fundamental to photostability and light harvesting, are notoriously difficult to characterize. Here, we report the direct observation and full characterization of an optically dark, low‑lying singlet excited state in the isolated anion of the meta green fluorescent protein (GFP) chromophore. Using a combination of ultrafast time‑resolved action‑absorption and photoelectron spectroscopy, we have captured the formation of this state in 100 fs and measured its remarkably long lifetime of 94 ps. We unambiguously assign its charge‑transfer character and reveal the precise trapping mechanism through high‑level ab initio calculations. Our findings uncover a photoprotective mechanism in biomolecular anions where ultrafast internal conversion quenches electron emission, stabilizing long‑lived electronic excitation even when the energy exceeds the electron detachment threshold.
Authors: Junkai Ji, Zhangfan Yang, Dong Xu, Ruibin Bai, Jianqiang Li, Tingjun Hou, Zexuan Zhu
Abstract: Drug discovery is a time‑consuming and expensive process, with traditional high‑throughput and docking‑based virtual screening hampered by low success rates and limited scalability. Recent advances in generative modelling, including autoregressive, diffusion, and flow‑based approaches, have enabled de novo ligand design beyond the limits of enumerative screening. Yet these models often suffer from inadequate generalization, limited interpretability, and an overemphasis on binding affinity at the expense of key pharmacological properties, thereby restricting their translational utility. Here we present Trio, a molecular generation framework integrating fragment‑based molecular language modeling, reinforcement learning, and Monte Carlo tree search, for effective and interpretable closed‑loop targeted molecular design. Through the three key components, Trio enables context‑aware fragment assembly, enforces physicochemical and synthetic feasibility, and guides a balanced search between the exploration of novel chemotypes and the exploitation of promising intermediates within protein binding pockets. Experimental results show that Trio reliably achieves chemically valid and pharmacologically enhanced ligands, outperforming state‑of‑the‑art approaches with improved binding affinity (+7.85%), drug‑likeness (+11.10%) and synthetic accessibility (+12.05%), while expanding molecular diversity more than fourfold. By combining generalization, plausibility, and interpretability, Trio establishes a closed‑loop generative paradigm that redefines how chemical space can be navigated, offering a transformative foundation for the next era of AI‑driven drug discovery.
Authors: Jiayu Qin, Zhengquan Luo, Guy Tadmor, Changyou Chen, David Zeevi, Zhiqiang Xu
Abstract: Predicting molecule‑protein interactions (MPIs) is a fundamental task in computational biology, with crucial applications in drug discovery and molecular function annotation. However, existing MPI models face two major challenges. First, the scarcity of labeled molecule‑protein pairs significantly limits model performance, as available datasets capture only a small fraction of biological relevant interactions. Second, most methods rely solely on molecular and protein features, ignoring broader biological context such as genes, metabolic pathways, and functional annotations that could provide essential complementary information. To address these limitations, our framework first aggregates diverse biological datasets, including molecular, protein, genes and pathway‑level interactions, and then develop an optimal transport‑based approach to generate high‑quality pseudo‑labels for unlabeled molecule‑protein pairs, leveraging the underlying distribution of known interactions to guide label assignment. By treating pseudo‑labeling as a mechanism for bridging disparate biological modalities, our approach enables the effective use of heterogeneous data to enhance MPI prediction. We evaluate our framework on multiple MPI datasets including virtual screening tasks and protein retrieval tasks, demonstrating substantial improvements over state‑of‑the‑art methods in prediction accuracies and zero shot ability across unseen interactions. Beyond MPI prediction, our approach provides a new paradigm for leveraging diverse biological data sources to tackle problems traditionally constrained by single‑ or bi‑modal learning, paving the way for future advances in computational biology and drug discovery.
Authors: Amin Tavakoli, Raswanth Murugan, Ozan Gokdemir, Arvind Ramanathan, Frances Arnold, Anima Anandkumar
Abstract: Supervised fine‑tuning (SFT) is a standard approach for adapting large language models to specialized domains, yet its application to protein sequence modeling and protein language models (PLMs) remains ad hoc. This is in part because high‑quality annotated data are far more difficult to obtain for proteins than for natural language. We present a simple and general recipe for fast SFT of PLMs, designed to improve the fidelity, reliability, and novelty of generated protein sequences. Unlike existing approaches that require costly precompiled experimental datasets for SFT, our method leverages the PLM itself, integrating a lightweight curation pipeline with domain‑specific filters to construct high‑quality training data. These filters can independently refine a PLM's output and identify candidates for in vitro evaluation; when combined with SFT, they enable PLMs to generate more stable and functional enzymes, while expanding exploration into protein sequence space beyond natural variants. Although our approach is agnostic to both the choice of protein language model (PLM) and the protein system, we demonstrate its effectiveness with a genome‑scale PLM (GenSLM) applied to the tryptophan synthase enzyme family. The supervised fine‑tuned model generates sequences that are not only more novel but also display improved characteristics across both targeted design constraints and emergent protein property measures.
Authors: Jairo Rondón, Ginger Urrutia, Angel Gonzalez-Lizardo
Abstract: Non‑thermal plasma (NTP) surface activation has become a powerful and versatile strategy to engineer the interfacial properties of biomedical polymers whose intrinsic hydrophobicity limits their biological performance. In polymers such as polylactic acid (PLA) and polycarbonate (PC), NTP promotes the controlled incorporation of polar functional groups, increases surface energy, modifies dielectric behavior, and generates micro‑roughness that collectively enhance protein adsorption and early cell adhesion. This review synthesizes and critically evaluates evidence across four complementary analytical pillars‑contact‑angle theory, dielectric impedance spectroscopy, FT‑IR chemical mapping, and optical microscopy‑to construct an integrated framework for interpreting plasma‑induced chemical and morphological transformations.
The convergence of multimodal results demonstrates that NTP consistently produces chemically active, polar, and moderately textured surfaces that support robust initial cell‑material interactions. Furthermore, combining wettability, dielectric, and spectroscopic analysis enables the identification of activation pathways, the assessment of hydrophobic recovery dynamics, and the development of quantitative correlations between dielectric parameters and biological response. However, the literature also reveals key methodological gaps, including the limited use of unified multimodal protocols, insufficient evaluation of temporal stability, and a lack of predictive dielectric‑biological models.
By articulating these advances and limitations within a unified conceptual scheme, this review provides a roadmap for future research aimed at standardizing characterization workflows and enabling the rational design of next‑generation plasma‑functionalized biomaterials for tissue‑engineering scaffolds, implantable devices, and advanced drug‑delivery systems.
Authors: Peter W Fields, Vudtiwat Ngampruetikorn, David J Schwab, Stephanie E Palmer
Abstract: Generative models of complex systems often require post‑hoc parameter adjustments to produce useful outputs. For example, energy‑based models for protein design are sampled at an artificially low ''temperature'' to generate novel, functional sequences. This temperature tuning is a common yet poorly understood heuristic used across machine learning contexts to control the trade‑off between generative fidelity and diversity. Here, we develop an interpretable, physically motivated framework to explain this phenomenon. We demonstrate that in systems with a large ''energy gap'' ‑ separating a small fraction of meaningful states from a vast space of unrealistic states ‑ learning from sparse data causes models to systematically overestimate high‑energy state probabilities, a bias that lowering the sampling temperature corrects. More generally, we characterize how the optimal sampling temperature depends on the interplay between data size and the system's underlying energy landscape. Crucially, our results show that lowering the sampling temperature is not always desirable; we identify the conditions where \emphraising it results in better generative performance. Our framework thus casts post‑hoc temperature tuning as a diagnostic tool that reveals properties of the true data distribution and the limits of the learned model.
Authors: Hamsini Ramanathan, Roman Bushuiev, Matouš Soldát, Jirí Kohout, Téo Hebra, Joshua David Smith, Josef Sivic, Tomáš Pluskal
Abstract: Terpene synthases (TPS) are a key family of enzymes responsible for generating the diverse terpene scaffolds that underpin many natural products, including front‑line anticancer drugs such as Taxol. However, de novo TPS design through directed evolution is costly and slow. We introduce TpsGPT, a generative model for scalable TPS protein design, built by fine‑tuning the protein language model ProtGPT2 on 79k TPS sequences mined from UniProt. TpsGPT generated de novo enzyme candidates in silico and we evaluated them using multiple validation metrics, including EnzymeExplorer classification, ESMFold structural confidence (pLDDT), sequence diversity, CLEAN classification, InterPro domain detection, and Foldseek structure alignment. From an initial pool of 28k generated sequences, we identified seven putative TPS enzymes that satisfied all validation criteria. Experimental validation confirmed TPS enzymatic activity in at least two of these sequences. Our results show that fine‑tuning of a protein language model on a carefully curated, enzyme‑class‑specific dataset, combined with rigorous filtering, can enable the de novo generation of functional, evolutionarily distant enzymes.
Authors: Manzi Kevin Maxime
Abstract: Predicting protein secondary structures such as alpha helices, beta sheets, and coils from amino acid sequences is essential for understanding protein function. This work presents a transformer‑based model that applies attention mechanisms to protein sequence data to predict structural motifs. A sliding‑window data augmentation technique is used on the CB513 dataset to expand the training samples. The transformer shows strong ability to generalize across variable‑length sequences while effectively capturing both local and long‑range residue interactions.
Authors: Daniele Loco, Kisa Barkemeyer, Andre R. R. Carvalho, Jean-Philip Piquemal
Abstract: Demonstrating the practical utility of Noisy Intermediate‑Scale Quantum (NISQ) hardware for recurrent tasks in Computer‑Aided Drug Discovery is of paramount importance. We tackle this challenge by performing three‑dimensional protein pockets hydration‑site prediction on a quantum computer. Formulating the water placement problem as a Quadratic Unconstrained Binary Optimization (QUBO), we use a hybrid approach coupling a classical three‑dimensional reference‑interaction site model (3D‑RISM) to an efficient quantum optimization solver, to run various hardware experiments up to 123 qubits. Matching the precision of classical approaches, our results reproduced experimental predictions on real‑life protein‑ligand complexes. Furthermore, through a detailed resource estimation analysis, we show that accuracy can be systematically improved with increasing number of qubits, indicating that full quantum utility is in reach. Finally, we provide evidence that advantageous situations could be found for systems where classical optimization struggles to provide optimal solutions. The method has potential for assisting simulations of protein‑ligand complexes for drug lead optimization and setup of docking calculations.
Authors: Ethan Decker, Christopher Watson, Junyu Zhou, Yuhao Liu, Chenxu Liu, Ang Li, Gushu Li, Samuel Stein
Abstract: Compiling shallow and accurate quantum circuits for Hamiltonian simulation remains challenging due to hardware constraints and the combinatorial complexity of minimizing gate count and circuit depth. Existing optimization method pipelines rely on hand‑engineered classical heuristics, which cannot learn input‑dependent structure and therefore miss substantial opportunities for circuit reduction.
We introduce F2, an offline reinforcement learning framework that exploits free‑fermionic structure to efficiently compile Trotter‑based Hamiltonian simulation circuits. F2 provides (i) a reinforcement‑learning environment over classically simulatable free‑fermionic subroutines, (ii) architectural and objective‑level inductive biases that stabilize long‑horizon value learning, and (iii) a reversible synthetic‑trajectory generation mechanism that consistently yields abundant, guaranteed‑successful offline data.
Across benchmarks spanning lattice models, protein fragments, and crystalline materials (12‑222 qubits), F2 reduces gate count by 47% and depth by 38% on average relative to strong baselines (Qiskit, Cirq/OpenFermion) while maintaining average errors of 10^(‑7). These results show that aligning deep reinforcement learning with the algebraic structure of quantum dynamics enables substantial improvements in circuit synthesis, suggesting a promising direction for scalable, learning‑based quantum compilation
Authors: Felix Hartmann, Vivek Unikandanunni, Matias Bargheer, Eric E. Fullerton, Stefano Bonetti, Janet Anders
Abstract: Memory effects arise in many complex systems, from protein folding, to the spreading of epidemics and financial decisions. While so‑called non‑Markovian dynamics is common in larger systems with interacting components, observations in fundamental physical systems have been confined to specifically engineered cases. Here, we report the experimental observation of non‑Markovian dynamics in an elemental material, crystalline cobalt. By driving this material with an intense terahertz electromagnetic field, we bring its magnetisation into a non‑equilibrium state and follow its evolution. We measure the sample's low temperature magnetic response in the time domain which leads to an unexpectedly rich multi‑peaked spectrum in the Fourier domain, that cannot be explained by established models. We use open quantum system theory, which predicts a non‑Markovian memory kernel in the dynamical equations to capture the fundamental interaction between the spin system and the phonon bath. Simulations based on this theory produce a multi‑peaked spectrum, which matches the measured one. Our non‑Markovian approach is also able to reproduce the modification of the spectrum at higher temperatures. Our findings demonstrate that non‑Markovian effects are observable at a much more fundamental level than previously thought, opening the door to their exploration and control in a broad range of condensed matter systems.
Authors: James King, Lewis Cornwall, Andrei Cristian Nica, James Day, Aaron Sim, Neil Dalchau, Lilly Wollman, Joshua Meyers
Abstract: Accurate prediction of protein‑protein binding affinity is vital for understanding molecular interactions and designing therapeutics. We adapt Boltz‑2, a state‑of‑the‑art structure‑based protein‑ligand affinity predictor, for protein‑protein affinity regression and evaluate it on two datasets, TCR3d and PPB‑affinity. Despite high structural accuracy, Boltz‑2‑PPI underperforms relative to sequence‑based alternatives in both small‑ and larger‑scale data regimes. Combining embeddings from Boltz‑2‑PPI with sequence‑based embeddings yields complementary improvements, particularly for weaker sequence models, suggesting different signals are learned by sequence‑ and structure‑based models. Our results echo known biases associated with training with structural data and suggest that current structure‑based representations are not primed for performant affinity prediction.
Authors: Zihan Pengmei, Spencer C. Guo, Chatipat Lorpaiboon, Aaron R. Dinner
Abstract: Molecular dynamics simulations can generate atomically detailed trajectories of complex systems, but analyzing these dynamics can be challenging when systems lack well‑established quantitative descriptors (features). Graph neural networks (GNNs) in which messages are passed between nodes that represent atoms that are spatial neighbors promise to obviate manual feature engineering, but the use of GNNs with biomolecular systems of more than a few hundred residues has been limited in the context of analyzing dynamics by both difficulties in capturing the details of long‑range interactions with message passing and the memory and runtime requirements associated with large graphs. Here, we show how local information can be aggregated to reduce memory and runtime requirements without sacrificing atomic detail. We demonstrate that this approach opens the door to analyzing simulations of protein‑nucleic acid complexes with thousands of residues on single GPUs within minutes. For systems with hundreds of residues, for which there are sufficient data to make quantitative comparisons, we show that the approach improves performance and interpretability.
Authors: Stella Brown, Nicolas Preisig, Autumn Davis, Brian Hutchinson, Filip Jagodzinski
Abstract: Understanding how protein mutations affect protein structure is essential for advancements in computational biology and bioinformatics. We introduce PRIMRose, a novel approach that predicts energy values for each residue given a mutated protein sequence. Unlike previous models that assess global energy shifts, our method analyzes the localized energetic impact of double amino acid insertions or deletions (InDels) at the individual residue level, enabling residue‑specific insights into structural and functional disruption. We implement a Convolutional Neural Network architecture to predict the energy changes of each residue in a protein mutation. We train our model on datasets constructed from nine proteins, grouped into three categories: one set with exhaustive double InDel mutations, another with approximately 145k randomly sampled double InDel mutations, and a third with approximately 80k randomly sampled double InDel mutations. Our model achieves high predictive accuracy across a range of energy metrics as calculated by the Rosetta molecular modeling suite and reveals localized patterns that influence model performance, such as solvent accessibility and secondary structure context. This per‑residue analysis offers new insights into the mutational tolerance of specific regions within proteins and provides higher interpretable and biologically meaningful predictions of InDels' effects.
Authors: Farzad Molani, Art E. Cho
Abstract: Accurately predicting protein‑ligand binding free energies (BFEs) remains a central challenge in drug discovery, particularly because the most reliable methods, such as free energy perturbation (FEP), are computationally intensive and difficult to scale. Here, we introduce a hybrid quantum‑classical framework that combines Mining Minima sampling with quantum mechanically refined ligand partial charges, QM/MM interaction evaluation, and variational quantum eigensolver (VQE)‑based electronic energy correction. This design enables explicit treatment of polarization, charge redistribution, and electronic correlation effects that are often underestimated in purely classical scoring schemes, while retaining computational efficiency. Across 23 protein targets and 543 ligands, the method achieves a mean absolute error of about 1.10 kcal/mol with strong rank‑order fidelity (Pearson R = 0.75, Spearman rho = 0.76, Kendall tau = 0.57), consistent with the performance of contemporary FEP protocols. Notably, the workflow requires only about 25 minutes per ligand on standard compute resources, resulting in an approximate 20‑fold reduction in computational cost relative to alchemical free energy approaches. This level of accuracy and efficiency makes the method well‑suited for high‑throughput lead optimization and iterative design cycles in pharmaceutical discovery. The framework also provides a natural foundation for future integration with machine learning models to enable predictive, large‑scale, and adaptive screening strategies.
Authors: Rebonto Haque, Oliver M. Turnbull, Anisha Parsan, Nithin Parsan, John J. Yang, Anna L. Beukenhorst, Charlotte M. Deane
Abstract: Sparse autoencoders (SAEs) are a mechanistic interpretability technique that have been used to provide insight into learned concepts within large protein language models. Here, we employ TopK and Ordered SAEs to investigate autoregressive antibody language models, and steer their generation. We show that TopK SAEs can reveal biologically meaningful latent features, but high feature‑concept correlation does not guarantee causal control over generation. In contrast, Ordered SAEs impose a hierarchical structure that reliably identifies steerable features, but at the expense of more complex and less interpretable activation patterns. These findings advance the mechanistic interpretability of domain‑specific protein language models and suggest that, while TopK SAEs suffice for mapping latent features to concepts, Ordered SAEs are preferable when precise generative steering is required.
Authors: Jing Shen, Ming-Zheng Du, Dong H. Zhang, Venkat Kapil, Wei Fang
Abstract: Nuclear quantum effects (NQEs) arising from the light mass of hydrogen can influence the structure and stability of hydrogen‑bonded biomolecules, yet their role in determining peptide and protein folding remains unclear. Experiments show that substituting H_2O with D_2O often stabilizes folded states, but the microscopic mechanism associated with this phenomena remains unresolved. Through ab initio‑level path‑integral molecular dynamics simulations enabled by machine‑learning interatomic potentials, we address the fundamental question of the role of NQEs in peptides by investigating both their overall impact and isotope substitution effects. Overall, NQEs systematically destabilize compact three‑dimensional structures across peptide systems, independent of secondary structure type or side‑chain interactions. Contrary to the conventional picture that places central importance on hydrogen bonds, we find that the dominant destabilization instead arises from the quantum C‑H vibrations. In addition, we reveal microscopic insights into the stabilization of folded peptides upon H_2O to D_2O substitution, showing that the H/D isotope substitution of active peptide hydrogens, previously considered unimportant, produces free‑energy changes within the range of experimentally observed shifts. These findings provide a new interpretation of isotope effects in biological systems, indicating that seemingly small H\toD substitutions within peptides can be as important as, or even outweigh, solvent contributions.
Authors: M. Prados, M. D. Hernández de la Torre, F. de Soto
Abstract: This paper deepens into the analysis of the protein secondary structure using Frenet frame to describe the curvature and torsion of the discrete curve formed by the protein α‑carbons. We show how a simple criterion based on the evaluation of the curvature and torsion of the discrete curve can be useful to pinpoint the presence of some secondary and supersecondary structures in proteins. Moreover, the description of proteins as fixed points of an effective action inspired by an U(1) gauge model is strongly supported by the curvature and torsion observed over a large dataset of proteins in the Protein Data Bank.
Authors: Masahiro Shirataki, Takuma Akimoto
Abstract: A sharp change in apparent mobility at a characteristic temperature that depends on the observation time has been reported in experiments and simulations of hydrated proteins. Such behavior is often discussed in the context of the protein dynamical transition, yet its general physical origin remains unclear. Here we show that fluctuating diffusivity within a Langevin framework naturally gives rise to an observation‑time‑induced crossover in translational diffusion: the effective diffusion coefficient exhibits a temperature‑dependent change whose crossover point systematically shifts with the observation time. Through analytical and numerical analyses, we elucidate the mechanism of this crossover and identify the minimal conditions required for its emergence. Our results establish observation‑time‑induced crossover as a generic non‑equilibrium phenomenon in systems with slowly relaxing mobility fluctuations. While distinct from internal dynamical transitions probed in neutron scattering, this framework provides a unified perspective that encompasses related finite‑time crossover phenomena observed in hydrated proteins and other complex soft‑matter systems.
Authors: Jakub Kopko, David Graber, Saltuk Mustafa Eyrilmez, Stanislav Mazurenko, David Bednar, Jiri Sedlar, Josef Sivic
Abstract: As machine learning becomes increasingly central to molecular design, it is vital to ensure the reliability of learnable protein‑ligand scoring functions on novel protein targets. While many scoring functions perform well on standard benchmarks, their ability to generalize beyond training data remains a significant challenge. In this work, we evaluate the generalization capability of state‑of‑the‑art scoring functions on dataset splits that simulate evaluation on targets with a limited number of known structures and experimental affinity measurements. Our analysis reveals that the commonly used benchmarks do not reflect the true challenge of generalizing to novel targets. We also investigate whether large‑scale self‑supervised pretraining can bridge this generalization gap and we provide preliminary evidence of its potential. Furthermore, we probe the efficacy of simple methods that leverage limited test‑target data to improve scoring function performance. Our findings underscore the need for more rigorous evaluation protocols and offer practical guidance for designing scoring functions with predictive power extending to novel protein targets.
Authors: Yanhua Xu
Abstract: Influenza A viruses (IAVs) evolve antigenically at a pace that requires frequent vaccine updates, yet the haemagglutination inhibition (HI) assays used to quantify antigenicity are labor‑intensive and unscalable. As a result, genomic data vastly outpace available phenotypic labels, limiting the effectiveness of traditional supervised models. We hypothesize that combining pre‑trained Protein Language Models (PLMs) with Semi‑Supervised Learning (SSL) can retain high predictive accuracy even when labeled data are scarce. We evaluated two SSL strategies, Self‑training and Label Spreading, against fully supervised baselines using four PLM‑derived embeddings (ESM‑2, ProtVec, ProtT5, ProtBert) applied to haemagglutinin (HA) sequences. A nested cross‑validation framework simulated low‑label regimes (25%, 50%, 75%, and 100% label availability) across four IAV subtypes (H1N1, H3N2, H5N1, H9N2). SSL consistently improved performance under label scarcity. Self‑training with ProtVec produced the largest relative gains, showing that SSL can compensate for lower‑resolution representations. ESM‑2 remained highly robust, achieving F1 scores above 0.82 with only 25% labeled data, indicating that its embeddings capture key antigenic determinants. While H1N1 and H9N2 were predicted with high accuracy, the hypervariable H3N2 subtype remained challenging, although SSL mitigated the performance decline. These findings demonstrate that integrating PLMs with SSL can address the antigenicity labeling bottleneck and enable more effective use of unlabeled surveillance sequences, supporting rapid variant prioritization and timely vaccine strain selection.
Authors: Juan Manuel Cantarero Angulo, Matthew Smith
Abstract: The global demand for sustainable protein sources is driving increasing interest in edible insects, with Acheta domesticus (house cricket) identified as one of the most suitable species for industrial production. Current farming practices typically rear crickets in mixed‑sex populations without automated sex sorting, despite potential benefits such as selective breeding, optimized reproduction ratios, and nutritional differentiation. This work presents a low‑cost, real‑time system for automated sex‑based sorting of Acheta domesticus, combining computer vision and physical actuation. The device integrates a Raspberry Pi 5 with the official Raspberry AI Camera and a custom YOLOv8 nano object detection model, together with a servo‑actuated sorting arm. The model reached a mean Average Precision at IoU 0.5 (mAP@0.5) of 0.977 during testing, and real‑world experiments with groups of crickets achieved an overall sorting accuracy of 86.8%. These results demonstrate the feasibility of deploying lightweight deep learning models on resource‑constrained devices for insect farming applications, offering a practical solution to improve efficiency and sustainability in cricket production.
Authors: Aingeru Ramos, Jose A Pascual, Javier Navaridas, Ivan Coluzza
Abstract: Markov Chain Monte Carlo methods are algorithms used to sample probability distributions, commonly used to sample the Boltzmann distribution of physical/chemical models (e.g., protein folding, Ising model, etc.). This allows us to study their properties by sampling the most probable states of those systems. However, the sampling capabilities of these methods are not sufficiently accurate when handling complex configuration spaces. This has resulted in the development of new techniques that improve sampling accuracy, usually at the expense of increasing the computational cost. One of such techniques is Parallel Tempering which improves accuracy by running several replicas which periodically exchange their states. Computationally, this imposes a significant slow‑down, which can be counteracted by means of parallelization. These schemes enable MCMC/PT techniques to be run more effectively and allow larger models to be studied. In this work, we present a parallel implementation of Metropolis‑Hastings with Parallel Tempering, using OpenMP and CUDA for the parallelization in modern CPUs and GPUs, respectively. The results show a maximum speed‑up of 52x using OpenMP with 48 cores, and of 986x speed‑up with the CUDA version. Furthermore, the results serve as a basic benchmark to compare a future quantum implementation of the same algorithm.
Authors: Sathya Edamadaka, Soojung Yang, Ju Li, Rafael Gómez-Bombarelli
Abstract: Machine learning models of vastly different modalities and architectures are being trained to predict the behavior of molecules, materials, and proteins. However, it remains unclear whether they learn similar internal representations of matter. Understanding their latent structure is essential for building scientific foundation models that generalize reliably beyond their training domains. Although representational convergence has been observed in language and vision, its counterpart in the sciences has not been systematically explored. Here, we show that representations learned by nearly sixty scientific models, spanning string‑, graph‑, 3D atomistic, and protein‑based modalities, are highly aligned across a wide range of chemical systems. Models trained on different datasets have highly similar representations of small molecules, and machine learning interatomic potentials converge in representation space as they improve in performance, suggesting that foundation models learn a common underlying representation of physical reality. We then show two distinct regimes of scientific models: on inputs similar to those seen during training, high‑performing models align closely and weak models diverge into local sub‑optima in representation space; on vastly different structures from those seen during training, nearly all models collapse onto a low‑information representation, indicating that today's models remain limited by training data and inductive bias and do not yet encode truly universal structure. Our findings establish representational alignment as a quantitative benchmark for foundation‑level generality in scientific models. More broadly, our work can track the emergence of universal representations of matter as models scale, and for selecting and distilling models whose learned representations transfer best across modalities, domains of matter, and scientific tasks.
Authors: Guang Yang, Lei Fan
Abstract: The RNA inverse folding problem, a key challenge in RNA design, involves identifying nucleotide sequences that can fold into desired secondary structures, which are critical for ensuring molecular stability and function. The inherent complexity of this task stems from the intricate relationship between sequence and structure, making it particularly challenging. In this paper, we propose a framework, named HyperRNA, a generative model with an encoder‑decoder architecture that leverages hypergraphs to design RNA sequences. Specifically, our HyperRNA model consists of three main components: preprocessing, encoding and decoding.
In the preprocessing stage, graph structures are constructed by extracting the atom coordinates of RNA backbone based on 3‑bead coarse‑grained representation. The encoding stage processes these graphs, capturing higher order dependencies and complex biomolecular interactions using an attention embedding module and a hypergraph‑based encoder. Finally, the decoding stage generates the RNA sequence in an autoregressive manner. We conducted quantitative and qualitative experiments on the PDBBind and RNAsolo datasets to evaluate the inverse folding task for RNA sequence generation and RNA‑protein complex sequence generation. The experimental results demonstrate that HyperRNA not only outperforms existing RNA design methods but also highlights the potential of leveraging hypergraphs in RNA engineering.
Authors: Michael Souza, Júlio Araújo, John Kesley Costa, Carlile Lavor
Abstract: The Ordered Covering Problem (OCP) arises in the context of the Discretizable Molecular Distance Geometry Problem (DMDGP), where the ordering of pruning edges significantly impacts the performance of the SBBU algorithm for protein structure determination. In recent work, Souza et al. (2023) formalized OCP as a hypergraph covering problem with ordered, exponential costs, and proposed a greedy heuristic that outperforms the original SBBU ordering by orders of magnitude. However, the computational complexity of finding optimal solutions remained open. In this paper, we prove that OCP is NP‑complete through a polynomial‑time reduction from the strongly NP‑complete 3‑Partition problem. Our reduction constructs a tight budget that forces optimal solutions to correspond exactly to valid 3‑partitions. This result establishes a computational barrier for optimal edge ordering and provides theoretical justification for the heuristic approaches currently used in practice.
Authors: Amandine Hong-Minh, Yair Augusto Gutiérrez Fosado, Abbie Guild, Nicholas Mullin, Laura Spagnolo, Ian Chambers, Davide Michieletto
Abstract: Proteins and nucleic acids form non‑Newtonian liquids with complex rheological properties that contribute to their function in vivo. Here we investigate the rheology of the transcription factor NANOG, a key protein in sustaining embryonic stem cell self‑renewal. We discover that at high concentrations NANOG forms macroscopic aging gels through its intrinsically disordered tryptophan‑rich domain. By combining molecular dynamics simulations, mass photometry and Cryo‑EM, we also discover that NANOG forms self‑limiting micelle‑like clusters which expose their DNA‑binding domains. In dense solutions of DNA, NANOG micelle‑like structures stabilize intermolecular entanglements and crosslinks, forming microgel‑like structures. Our findings suggest that NANOG may contribute to regulate gene expression in a unconventional way: by restricting and stabilizing genome dynamics at key transcriptional sites through the formation of an aging microgel‑like structure, potentially enabling mechanical memory in the gene network.
Authors: Milla Åhlfeldt, Maddalena Bin, Anita Girelli, Iason Andronis, Aigerim Karina, Nimmi Das Anthuparambil, Fiona Berner, Tobias Eklund, Louisa E. Kraft, Aliaksandr Leonau, Fabian Westermeier, Michael Sprung, Christian Gutt, Katrin Amann-Winkel, Fivos Perakis
Abstract: Pressure provides a powerful parameter to control the protein conformation state, which at sufficiently high values can lead to unfolding. Here, we investigate the effects of increasing pressure up to 0.4 GPa on hydrated lysozyme proteins, by measuring the nanoscale stress relaxation induced and probed by X‑rays. Structural and dynamical information at elevated pressures was obtained using X‑ray photon correlation spectroscopy (XPCS) in combination with a diamond anvil cell (DAC). The dynamical analysis revealed a slowing down of the system up to 0.2 GPa, followed by a re‑acceleration at 0.4 GPa. A similar non‑monotonic behavior was observed both in the Porod and Kohlrausch‑Williams‑Watts (KWW) exponents, consistently indicating a crossover between 0.2 and 0.4 GPa. These findings suggest the presence of pressure‑induced structural changes that impact protein collective stress‑relaxation as the system transitions from a jammed state to an elastically driven regime. These results may be relevant for a deeper understanding of protein stability under compression as well as for practical high‑pressure technologies, including food processing and pharmaceutical applications.
Authors: Maddalena Bin, Anita Girelli, Mariia Filianina, Mario Reiser, Sharon Berkowicz, Milla Åhlfeldt, Michelle Dargasz, Sonja Timmermann, Jaqueline Savelkouls, Takeshi Kawasaki, Shinji Saito, Federico Zontone, Yuriy Chushkin, Fajun Zhang, Frank Schreiber, Michael Paulus, Christian Gutt, Fivos Perakis
Abstract: Vitrification during cryopreservation requires a detailed understanding of the dynamic behavior of biological solutions. We investigate ferritin diffusion in glycerol‑water mixtures at supercooled temperatures using X‑ray Photon Correlation Spectroscopy (XPCS). Diffusion coefficients were measured from ambient conditions to T = 210 K and analyzed using the Vogel‑Fulcher‑Tammann (VFT) relation, yielding an arrest temperature of T_0 = 85 \pm 11 K for ferritin (R_\rm h = 7.3 nm), markedly lower than T_0 = 122 \pm 4 K for larger nanoparticles (R_\rm h = 50 nm). Below T \approx 230 K, ferritin diffusion exceeds the Stokes‑Einstein prediction by up to a factor of 2.7, revealing nanoscale deviations from bulk viscosity. A fluctuating‑friction model quantitatively links this enhancement to local friction heterogeneity, with fluctuations increasing upon cooling and reaching ~ 80% of the mean friction at T=210 K. These results establish a molecular‑scale connection between protein diffusion and solvent dynamical heterogeneity in cryoprotected solutions.
Authors: Jiabao Brad Wang, Siyuan Cao, Hongxuan Wu, Yiliang Yuan, Mustafa Misir
Abstract: Selecting an effective docking algorithm is highly context‑dependent, and no single method performs reliably across structural, chemical, or protocol regimes. We introduce MolAS, a lightweight algorithm selection system that predicts per‑algorithm performance from pretrained protein‑ligand embeddings using attentional pooling and a shallow residual decoder. With only hundreds to a few thousand labelled complexes, MolAS achieves up to 15% absolute improvement over the single‑best solver (SBS) and closes 17‑66% of the Virtual Best Solver (VBS)‑SBS gap across five diverse docking benchmarks. Analyses of reliability, embedding geometry, and solver‑selection patterns show that MolAS succeeds when the oracle landscape exhibits low entropy and separable solver behaviour, but collapses under protocol‑induced hierarchy shifts. These findings indicate that the main barrier to robust docking AS is not representational capacity but instability in solver rankings across pose‑generation regimes, positioning MolAS as both a practical in‑domain selector and a diagnostic tool for assessing when AS is feasible.
Authors: Felix Teufel, Aaron W. Kollasch, Yining Huang, Ole Winther, Kevin K. Yang, Pascal Notin, Debora S. Marks
Abstract: Accurately predicting protein fitness with minimal experimental data is a persistent challenge in protein engineering. We introduce PRIMO (PRotein In‑context Mutation Oracle), a transformer‑based framework that leverages in‑context learning and test‑time training to adapt rapidly to new proteins and assays without large task‑specific datasets. By encoding sequence information, auxiliary zero‑shot predictions, and sparse experimental labels from many assays as a unified token set in a pre‑training masked‑language modeling paradigm, PRIMO learns to prioritize promising variants through a preference‑based loss function. Across diverse protein families and properties‑including both substitution and indel mutations‑PRIMO outperforms zero‑shot and fully supervised baselines. This work underscores the power of combining large‑scale pre‑training with efficient test‑time adaptation to tackle challenging protein design tasks where data collection is expensive and label availability is limited.
Authors: Zijun Gao, Mutian He, Shijia Sun, Hanqun Cao, Jingjie Zhang, Zihao Luo, Xiaorui Wang, Xiaojun Yao, Chang-Yu Hsieh, Chunbin Gu, Pheng Ann Heng
Abstract: Reliable evaluation of protein structure predictions remains challenging, as metrics like pLDDT capture energetic stability but often miss subtle errors such as atomic clashes or conformational traps reflecting topological frustration within the protein folding energy landscape. We present CODE (Chain of Diffusion Embeddings), a self evaluating metric empirically found to quantify topological frustration directly from the latent diffusion embeddings of the AlphaFold3 series of structure predictors in a fully unsupervised manner. Integrating this with pLDDT, we propose CONFIDE, a unified evaluation framework that combines energetic and topological perspectives to improve the reliability of AlphaFold3 and related models. CODE strongly correlates with protein folding rates driven by topological frustration, achieving a correlation of 0.82 compared to pLDDT's 0.33 (a relative improvement of 148%). CONFIDE significantly enhances the reliability of quality evaluation in molecular glue structure prediction benchmarks, achieving a Spearman correlation of 0.73 with RMSD, compared to pLDDT's correlation of 0.42, a relative improvement of 73.8%. Beyond quality assessment, our approach applies to diverse drug design tasks, including all‑atom binder design, enzymatic active site mapping, mutation induced binding affinity prediction, nucleic acid aptamer screening, and flexible protein modeling. By combining data driven embeddings with theoretical insight, CODE and CONFIDE outperform existing metrics across a wide range of biomolecular systems, offering robust and versatile tools to refine structure predictions, advance structural biology, and accelerate drug discovery.
Authors: Omar Mahmood, Pedro O. Pinheiro, Richard Bonneau, Saeed Saremi, Vishnu Sresht
Abstract: Ligand‑based drug discovery (LBDD) relies on making use of known binders to a protein target to find structurally diverse molecules similarly likely to bind. This process typically involves a brute force search of the known binder (query) against a molecular library using some metric of molecular similarity. One popular approach overlays the pharmacophore‑shape profile of the known binder to 3D conformations enumerated for each of the library molecules, computes overlaps, and picks a set of diverse library molecules with high overlaps. While this virtual screening workflow has had considerable success in hit diversification, scaffold hopping, and patent busting, it scales poorly with library sizes and restricts candidate generation to existing library compounds. Leveraging recent advances in voxel‑based generative modelling, we propose a pharmacophore‑based generative model and workflows that address the scaling and fecundity issues of conventional pharmacophore‑based virtual screening. We introduce \emphVoxCap, a voxel captioning method for generating SMILES strings from voxelised molecular representations. We propose two workflows as practical use cases as well as benchmarks for pharmacophore‑based generation: \emphde‑novo design, in which we aim to generate new molecules with high pharmacophore‑shape similarities to query molecules, and fast search, which aims to combine generative design with a cheap 2D substructure similarity search for efficient hit identification. Our results show that VoxCap significantly outperforms previous methods in generating diverse de‑novo hits. When combined with our fast search workflow, VoxCap reduces computational time by orders of magnitude while returning hits for all query molecules, enabling the search of large libraries that are intractable to search by brute force.
Authors: Hao Qian, Pu You, Lin Zeng, Jingyuan Zhou, Dengdeng Huang, Kaicheng Li, Shikui Tu, Lei Xu
Abstract: Glioblastoma (GBM) remains the most aggressive tumor, urgently requiring novel therapeutic strategies. Here, we present a dry‑to‑wet framework combining generative modeling and experimental validation to optimize peptides targeting ATP5A, a potential peptide‑binding protein for GBM. Our framework introduces the first lead‑conditioned generative model, which focuses exploration on geometrically relevant regions around lead peptides and mitigates the combinatorial complexity of de novo methods. Specifically, we propose POTFlow, a \underlinePrior and \underlineOptimal \underlineTransport‑based \underlineFlow‑matching model for peptide optimization. POTFlow employs secondary structure information (e.g., helix, sheet, loop) as geometric constraints, which are further refined by optimal transport to produce shorter flow paths. With this design, our method achieves state‑of‑the‑art performance compared with five popular approaches. When applied to GBM, our method generates peptides that selectively inhibit cell viability and significantly prolong survival in a patient‑derived xenograft (PDX) model. As the first lead peptide‑conditioned flow matching model, POTFlow holds strong potential as a generalizable framework for therapeutic peptide design.
Authors: Patrice Koehl, Marc Delarue, Henri Orland
Abstract: We introduce a computational framework for generating realistic transition paths between distinct conformations of large bio‑molecular systems. The method is built on a stochastic integro‑differential formulation derived from the Langevin bridge formalism, which constrains molecular trajectories to reach a prescribed final state within a finite time and yields an efficient low‑temperature approximation of the exact bridge equation. To obtain physically meaningful protein transitions, we couple this formulation to a new coarse‑grained potential combining a Go‑like term that preserves native backbone geometry with a Rouse‑type elastic energy term from polymer physics; we refer to the resulting approach as SIDE. We evaluate SIDE on several proteins undergoing large‑scale conformational changes and compare its performance with established methods such as MinActionPath and EBDIMS. SIDE generates smooth, low‑energy trajectories that maintain molecular geometry and frequently recover experimentally supported intermediate states. Although challenges remain for highly complex motions‑largely due to the simplified coarse‑grained potential‑our results demonstrate that SIDE offers a powerful and computationally efficient strategy for modeling bio‑molecular conformational transitions.
Authors: Dmitry Zankov, Pavlo Polishchuk, Michal Sobieraj, Mario Barbatti
Abstract: We introduce milearn, a Python package for multi‑instance learning (MIL) that follows the familiar scikit‑learn fit/predict interface while providing a unified framework for both classical and neural‑network‑based MIL algorithms for regression and classification. The package also includes built‑in hyperparameter optimization designed specifically for small MIL datasets, enabling robust model selection in data‑scarce scenarios. We demonstrate the versatility of milearn across a broad range of synthetic MIL benchmark datasets, including digit classification and regression, molecular property prediction, and protein‑protein interaction (PPI) prediction. Special emphasis is placed on the key instance detection (KID) problem, for which the package provides dedicated support.
Authors: Yannick A. D. Omar
Abstract: There is increasing evidence that numerous membrane proteins can assemble into aggregates that modulate their function and affect many cellular processes such as signal transduction and endocytosis. Here, we present a theoretical description of the instantaneous translational diffusion coefficients of transmembrane protein aggregates on free and supported lipid membranes using Kirkwood‑Riseman theory. We find that hydrodynamic interactions within protein aggregates must be accounted for, as neglecting them yields several times lower diffusion coefficients. By deriving hydrodynamic radii for free and supported lipid membranes, we identify effective length scales that accurately characterize aggregate diffusivities in the presence of hydrodynamic interactions. These findings motivate the approximation of an aggregate by its outline and a random particle distribution inside it. We show that this approach provides a practical method to accurately determine aggregate diffusion coefficients when the particle locations cannot be resolved. The results presented in this article have immediate implications for the formation and function of membrane protein aggregates.
Authors: Hasi Hays, Yue Yu, William J. Richardson
Abstract: Artificial intelligence (AI) is reshaping computational and network biology by enabling new approaches to decode cellular communication networks. We introduce Hierarchical Molecular Language Models (HMLMs), a novel framework that models cellular signaling as a specialized molecular language, where signaling molecules function as tokens, protein interactions define syntax, and functional consequences constitute semantics. HMLMs employ a transformer‑based architecture adapted to accommodate graph‑structured signaling networks through information transducers, mathematical entities that capture how molecules receive, process, and transmit signals. The architecture integrates multi‑modal data sources across molecular, pathway, and cellular scales through hierarchical attention mechanisms and scale‑bridging operators that enable information flow across biological hierarchies. Applied to a complex network of cardiac fibroblast signaling, HMLMs outperformed traditional approaches in temporal dynamics prediction, particularly under sparse sampling conditions. Attention‑based analysis revealed biologically meaningful crosstalk patterns, including previously uncharacterized interactions between signaling pathways. By bridging molecular mechanisms with cellular phenotypes through AI‑driven molecular language representation, HMLMs establish a foundation for biology‑oriented large language models (LLMs) that could be pre‑trained on comprehensive pathway datasets and applied across diverse signaling systems and tissues, advancing precision medicine and therapeutic discovery.
Authors: Anas Aziz Khan, Md Shah Fahad, Priyanka, Ramesh Chandra, Guransh Singh
Abstract: Accurate prediction of enzyme kinetic parameters is crucial for drug discovery, metabolic engineering, and synthetic biology applications. Current computational approaches face limitations in capturing complex enzyme‑substrate interactions and often focus on single parameters while neglecting the joint prediction of catalytic turnover numbers (Kcat) and Michaelis‑Menten constants (Km). We present EnzyCLIP, a novel dual‑encoder framework that leverages contrastive learning and cross‑attention mechanisms to predict enzyme kinetic parameters from protein sequences and substrate molecular structures. Our approach integrates ESM‑2 protein language model embeddings with ChemBERTa chemical representations through a CLIP‑inspired architecture enhanced with bidirectional cross‑attention for dynamic enzyme‑substrate interaction modeling. EnzyCLIP combines InfoNCE contrastive loss with Huber regression loss to learn aligned multimodal representations while predicting log10‑transformed kinetic parameters. The model is trained on the CatPred‑DB database containing 23,151 Kcat and 41,174 Km experimentally validated measurements, and achieved competitive performance with R2 scores of 0.593 for Kcat and 0.607 for Km prediction. XGBoost ensemble methods applied to the learned embeddings further improved Km prediction (R2 = 0.61) while maintaining robust Kcat performance.
Authors: Ajit Kumar, IndraPrakash Jha
Abstract: Protein language models (PLMs) have transformed sequence‑based protein analysis, yet most applications rely only on final‑layer embeddings, which may overlook biologically meaningful information encoded in earlier layers. We systematically evaluate all 33 layers of ESM‑2 for kinase functional prediction using both unsupervised clustering and supervised classification. We show that mid‑to‑late transformer layers (layers 20‑33) outperform the final layer by 32 percent in unsupervised Adjusted Rand Index and improve homology‑aware supervised accuracy to 75.7 percent. Domain‑level extraction, calibrated probability estimates, and a reproducible benchmarking pipeline further strengthen reliability. Our results demonstrate that transformer depth contains functionally distinct biological signals and that principled layer selection significantly improves kinase function prediction.
Authors: Jin Han, Tianfan Fu, Wu-Jun Li
Abstract: Protein inverse folding, the design of an amino acid sequence based on a target protein structure, is a fundamental problem of computational protein engineering. Existing methods either generate sequences without leveraging external knowledge or relying on protein language models~(PLMs). The former omits the knowledge stored in natural protein data, while the latter is parameter‑inefficient and inflexible to adapt to ever‑growing protein data. To overcome the above drawbacks, in this paper we propose a novel method, called \underline\textretrieval‑\underline\textaugmented \underline\textdenoising \underline\textdiffusion~(\mboxRadDiff), for protein inverse folding. In RadDiff, a novel retrieval‑augmentation mechanism is designed to capture the up‑to‑date protein knowledge. We further design a knowledge‑aware diffusion model that integrates this protein knowledge into the diffusion process via a lightweight module. Experimental results on the CATH, TS50, and PDB2022 datasets show that \mboxRadDiff consistently outperforms existing methods, improving sequence recovery rate by up to 19%. Experimental results also demonstrate that RadDiff generates highly foldable sequences and scales effectively with database size.
Authors: Jakob Mihatsch, Andreas M. Menzel
Abstract: The transport of individual entities through interconnected structures is a process of practical relevance both in biology and technology. Examples are given by diffusive dynamics of molecules in porous structures. In soft environments, this transport can be strongly influenced by fluctuations of the porous structure itself. Here, we focus on triply periodic membrane structures found both in cell organelles and in synthetic amphiphilic systems. We theoretically study the effect of a complex three‑dimensional fluctuating environment on the diffusive motion of a test object, using a phase field approach. The rigid spherical test object is energetically forced to not penetrate the membrane. Generally, the pores of the membrane structure can be smaller than the diffusing object. Yet, fluctuations of the membrane can intermittently widen its pores, still allowing for the motion of the larger particles through them. Thus, the object stays trapped for a while inside one cavity formed by the membrane, before an appropriate fluctuation event widens a membrane pore in the right moment so that the object can jump into the next cavity. The process is reflected by a pronounced plateau in the time evolution of the mean squared displacement. We think that the described scenario should be directly observable, for instance, in protein diffusion through biological environments.
Authors: Sebastián Espinel-Ríos
Abstract: Biotechnology can benefit from dynamic control to improve production efficiency. In this context, optogenetics enables modulation of gene expression using light as an external input, allowing fine‑tuning of protein levels to unlock dynamic metabolic control and regulation of cell growth. Optogenetic systems can be actuated by light intensity. However, relying solely on intensity‑driven control (i.e., signal amplitude) may fail to properly tune optogenetic bioprocesses when the dose‑response relationship (i.e., light intensity versus gene‑expression strength) is steep. In these cases, tunability is effectively constrained to either fully active or fully repressed gene expression, with little intermediate regulation. Pulse‑width modulation can alleviate this issue by alternating between fully ON and OFF light intensity within forcing periods, thereby smoothing the average response and enhancing process controllability. Optimizing pulse‑width‑modulated optogenetics entails a switching‑time optimal control problem with a binary input over multiple forcing periods. While this can be formulated as a mixed‑integer optimization problem on a refined control grid with monotonic input constraints, the number of decision variables can grow rapidly with increasing control‑grid resolution within forcing periods and with the total number of forcing periods, complicating the task. Here, we propose an alternative solution based on reinforcement learning. We parametrize control actions via the duty cycle, a continuous proxy variable that encodes the ON‑to‑OFF switching time within each forcing period, thereby respecting the intrinsic binary nature of the light intensity while avoiding fine‑grid binary decision variables.
Authors: Marc M Nasser, Frédéric Poitevin, Kevin M Dalton
Abstract: Serial crystallography experiments routinely produce thousands of diffraction patterns from crystals in random orientations. To turn this stream of images into a usable dataset, each pattern must be indexed before integration and merging can proceed. In practice, diffraction patterns may contain only a small number of reliable peaks, be contaminated by background or spuriously detected reflections, or arise from crystals with highly skewed unit cells. These factors make indexing unstable in the small‑N regime. We introduce a robust indexing algorithm tailored to this setting. We formulate indexing as a symmetry‑aware lattice decoding problem and design a loss that explicitly incorporates lattice symmetries while trimming outlier peaks that are inconsistent with any plausible orientation. We combine this objective with a reciprocal‑space basis reparameterization that stabilizes decoding for skewed or poorly conditioned lattices, and we develop a dedicated small‑N objective mode that couples refined peak scoring with a method to recover orientations from very few reflections. The resulting method is memory‑efficient and suitable for robust indexing. We evaluate our approach on three protein datasets from the Coherent X‑ray Imaging Data Bank collected at XFEL facilities, using identical preprocessing and unit‑cell information across methods. Across all datasets, our algorithm matches or outperforms established indexers such as XGANDALF and TORO, with particularly large gains for patterns with few indexed peaks and for crystals with skewed unit cells. While slower, our method is extremely memory‑efficient, and its structure allows high‑parallelism on CPUs or larger batch sizes on GPUs. These results show that exploiting lattice structure, symmetry, and small‑N‑aware search yields substantial improvements in indexing robustness.
Authors: Yajun Yu, Guoping Xu, Steve Jiang, Robert Timmerman, John Minna, Yuanyuan Zhang, Hao Peng
Abstract: To develop an integrated transcriptome‑proteome framework for identifying concurrent biomarkers predictive of radiation response, as measured by survival fraction at 2 Gy (SF2), in non‑small cell lung cancer (NSCLC) cell lines. RNA sequencing (RNA‑seq) and data‑independent acquisition mass spectrometry (DIA‑MS) proteomic data were collected from 73 and 46 NSCLC cell lines, respectively. Following preprocessing, 1,605 shared genes were retained for analysis. Feature selection was performed using least absolute shrinkage and selection operator (Lasso) regression with a frequency‑based ranking criterion under five‑fold cross‑validation repeated ten times. Support vector regression (SVR) models were constructed using transcriptome‑only, proteome‑only, and combined transcriptome‑proteome feature sets. Model performance was assessed by the coefficient of determination (R2) and root mean square error (RMSE). Correlation analyses evaluated concordance between RNA and protein expression and the relationships of selected biomarkers with SF2. RNA‑protein expression exhibited significant positive correlations (median Pearson's r = 0.363). Independent pipelines identified 20 prioritized gene signatures from transcriptomic, proteomic, and combined datasets. Models trained on single‑omic features achieved limited cross‑omic generalizability, while the combined model demonstrated balanced predictive accuracy in both datasets (R2=0.461, RMSE=0.120 for transcriptome; R2=0.604, RMSE=0.111 for proteome). This study presents the first proteotranscriptomic framework for SF2 prediction in NSCLC, highlighting the complementary value of integrating transcriptomic and proteomic data. The identified concurrent biomarkers capture both transcriptional regulation and functional protein activity, offering mechanistic insights and translational potential.
Authors: Riccardo De Santi, Marin Vlastelica, Ya-Ping Hsieh, Zebang Shen, Niao He, Andreas Krause
Abstract: Adapting large‑scale foundation flow and diffusion generative models to optimize task‑specific objectives while preserving prior information is crucial for real‑world applications such as molecular design, protein docking, and creative image generation. Existing principled fine‑tuning methods aim to maximize the expected reward of generated samples, while retaining knowledge from the pre‑trained model via KL‑divergence regularization. In this work, we tackle the significantly more general problem of optimizing general utilities beyond average rewards, including risk‑averse and novelty‑seeking reward maximization, diversity measures for exploration, and experiment design objectives among others. Likewise, we consider more general ways to preserve prior information beyond KL‑divergence, such as optimal transport distances and Renyi divergences. To this end, we introduce Flow Density Control (FDC), a simple algorithm that reduces this complex problem to a specific sequence of simpler fine‑tuning tasks, each solvable via scalable established methods. We derive convergence guarantees for the proposed scheme under realistic assumptions by leveraging recent understanding of mirror flows. Finally, we validate our method on illustrative settings, text‑to‑image, and molecular design tasks, showing that it can steer pre‑trained generative models to optimize objectives and solve practically relevant tasks beyond the reach of current fine‑tuning schemes.
Authors: Fiona Y. Wang, Di Sheng Lee, David L. Kaplan, Markus J. Buehler
Abstract: Designing proteins de novo with tailored structural, physicochemical, and functional properties remains a grand challenge in biotechnology, medicine, and materials science, due to the vastness of sequence space and the complex coupling between sequence, structure, and function. Current state‑of‑the‑art generative methods, such as protein language models (PLMs) and diffusion‑based architectures, often require extensive fine‑tuning, task‑specific data, or model reconfiguration to support objective‑directed design, thereby limiting their flexibility and scalability. To overcome these limitations, we present a decentralized, agent‑based framework inspired by swarm intelligence for de novo protein design. In this approach, multiple large language model (LLM) agents operate in parallel, each assigned to a specific residue position. These agents iteratively propose context‑aware mutations by integrating design objectives, local neighborhood interactions, and memory and feedback from previous iterations. This position‑wise, decentralized coordination enables emergent design of diverse, well‑defined sequences without reliance on motif scaffolds or multiple sequence alignments, validated with experiments on proteins with alpha helix and coil structures. Through analyses of residue conservation, structure‑based metrics, and sequence convergence and embeddings, we demonstrate that the framework exhibits emergent behaviors and effective navigation of the protein fitness landscape. Our method achieves efficient, objective‑directed designs within a few GPU‑hours and operates entirely without fine‑tuning or specialized training, offering a generalizable and adaptable solution for protein design. Beyond proteins, the approach lays the groundwork for collective LLM‑driven design across biomolecular systems and other scientific discovery tasks.
Authors: Somnath Mondal, Tinkal Mondal, Soumajit Pramanik, Rukmankesh Mehra
Abstract: The interaction between proteins and nucleic acids is crucial for processes that sustain cellular function, including DNA maintenance and the regulation of gene expression and translation. Amino acid mutations in protein‑nucleic acid complexes often lead to vital diseases. Experimental techniques have their own specific limitations in predicting mutational effects in protein‑nucleic acid complexes. In this study, we compiled a large dataset of 1951 mutations including both protein‑DNA and protein‑RNA complexes and integrated structural and sequential features to build a deep learning‑based regression model named DeepPNI. This model estimates mutation‑induced binding free energy changes in protein‑nucleic acid complexes. The structural features are encoded via edge‑aware RGCN and the sequential features are extracted using protein language model ESM‑2. We have achieved a high average Pearson correlation coefficient (PCC) of 0.76 in the large dataset via five‑fold cross‑validation. Consistent performance across individual dataset of protein‑DNA, protein‑RNA complexes, and different experimental temperature split dataset make the model generalizable. Our model showed good performance in complex‑based five‑fold cross‑validation, which proved its robustness. In addition, DeepPNI outperformed in external dataset validation, and comparison with existing tools
Authors: Patricia Suriana, Joshua A. Rackers, Ewa M. Nowara, Pedro O. Pinheiro, John M. Nicoloudis, Vishnu Sresht
Abstract: Machine learning models for 3D molecular property prediction typically rely on atom‑based representations, which may overlook subtle physical information. Electron density maps ‑‑ the direct output of X‑ray crystallography and cryo‑electron microscopy ‑‑ offer a continuous, physically grounded alternative. We compare three voxel‑based input types for 3D convolutional neural networks (CNNs): atom types, raw electron density, and density gradient magnitude, across two molecular tasks ‑‑ protein‑ligand binding affinity prediction (PDBbind) and quantum property prediction (QM9). We focus on voxel‑based CNNs because electron density is inherently volumetric, and voxel grids provide the most natural representation for both experimental and computed densities. On PDBbind, all representations perform similarly with full data, but in low‑data regimes, density‑based inputs outperform atom types, while a shape‑based baseline performs comparably ‑‑ suggesting that spatial occupancy dominates this task. On QM9, where labels are derived from Density Functional Theory (DFT) but input densities from a lower‑level method (XTB), density‑based inputs still outperform atom‑based ones at scale, reflecting the rich structural and electronic information encoded in density. Overall, these results highlight the task‑ and regime‑dependent strengths of density‑derived inputs, improving data efficiency in affinity prediction and accuracy in quantum property modeling.
Authors: Anderson E. Schwertner, Francisco N. C. Sobral
Abstract: The Low Order‑Value Optimization (LOVO) problem involves minimizing the minimum among a finite number of function values within a feasible set. LOVO has several practical applications such as robust parameter estimation, protein alignment, portfolio optimization, among others. In this work, we are interested in the constrained nonlinear optimization LOVO problem of minimizing the minimum between a finite number of function values subject to a nonempty closed convex set where each function is a black‑box and continuously differentiable, but the derivatives are not available. We develop the first derivative‑free trust‑region algorithm for constrained LOVO problems with convergence to weakly critical points. Under suitable conditions, we establish the global convergence of the algorithm and also its worst‑case iteration complexity analysis. An initial open‑source implementation using only linear interpolation models is developed. Extensive numerical experiments and comparison with existing alternatives show the properties and the efficiency of the proposed approach when solving LOVO problems.
Authors: Bruno Jacob, Khushbu Agarwal, Marcel Baer, Peter Rice, Simone Raugei
Abstract: We present Genie‑CAT, a tool‑augmented large‑language‑model (LLM) system designed to accelerate scientific hypothesis generation in protein design. Using metalloproteins (e.g., ferredoxins) as a case study, Genie‑CAT integrates four capabilities ‑‑ literature‑grounded reasoning through retrieval‑augmented generation (RAG), structural parsing of Protein Data Bank files, electrostatic potential calculations, and machine‑learning prediction of redox properties ‑‑ into a unified agentic workflow. By coupling natural‑language reasoning with data‑driven and physics‑based computation, the system generates mechanistically interpretable, testable hypotheses linking sequence, structure, and function. In proof‑of‑concept demonstrations, Genie‑CAT autonomously identifies residue‑level modifications near [Fe‑‑S] clusters that affect redox tuning, reproducing expert‑derived hypotheses in a fraction of the time. The framework highlights how AI agents combining language models with domain‑specific tools can bridge symbolic reasoning and numerical simulation, transforming LLMs from conversational assistants into partners for computational discovery.
Authors: Lakshaditya Singh, Adwait Shelke, Divyansh Agrawal
Abstract: Designing new protein structures is fundamental to computational biology, enabling advances in therapeutic molecule discovery and enzyme engineering. Existing diffusion‑based generative models typically operate in Cartesian coordinate space, where adding noise disrupts strict geometric constraints such as fixed bond lengths and angles, often producing physically invalid structures. To address this limitation, we propose a Torsion‑Space Diffusion Model that generates protein backbones by denoising torsion angles, ensuring perfect local geometry by construction. A differentiable forward‑kinematics module reconstructs 3D coordinates with fixed 3.8 Angstrom backbone bond lengths while a constrained post‑processing refinement optimizes global compactness via Radius of Gyration (Rg) correction, without violating bond constraints. Experiments on standard PDB proteins demonstrate 100% bond‑length accuracy and significantly improved structural compactness, reducing Rg error from 70% to 18.6% compared to Cartesian diffusion baselines. Overall, this hybrid torsion‑diffusion plus geometric‑refinement framework generates physically valid and compact protein backbones, providing a promising path toward full‑atom protein generation.
Authors: Riccardo Tedoldi, Ola Engkvist, Patrick Bryant, Hossein Azizpour, Jon Paul Janet, Alessandro Tibo
Abstract: Sampling useful three‑dimensional molecular structures along with their most favorable conformations is a key challenge in drug discovery. Current state‑of‑the‑art 3D de‑novo design flow matching or diffusion‑based models are limited to generating a single conformation. However, the conformational landscape of a molecule determines its observable properties and how tightly it is able to bind to a given protein target. By generating a representative set of low‑energy conformers, we can more directly assess these properties and potentially improve the ability to generate molecules with desired thermodynamic observables. Towards this aim, we propose FlexiFlow, a novel architecture that extends flow‑matching models, allowing for the joint sampling of molecules along with multiple conformations while preserving both equivariance and permutation invariance. We demonstrate the effectiveness of our approach on the QM9 and GEOM Drugs datasets, achieving state‑of‑the‑art results in molecular generation tasks. Our results show that FlexiFlow can generate valid, unstrained, unique, and novel molecules with high fidelity to the training data distribution, while also capturing the conformational diversity of molecules. Moreover, we show that our model can generate conformational ensembles that provide similar coverage to state‑of‑the‑art physics‑based methods at a fraction of the inference time. Finally, FlexiFlow can be successfully transferred to the protein‑conditioned ligand generation task, even when the dataset contains only static pockets without accompanying conformations.
Authors: Guanlue Li, Xufeng Zhao, Fang Wu, Sören Laue
Abstract: Protein‑protein interactions (PPIs) are governed by surface complementarity and hydrophobic interactions at protein interfaces. However, designing diverse and physically realistic protein structure and surfaces that precisely complement target receptors remains a significant challenge in computational protein design. In this work, we introduce PepBridge, a novel framework for the joint design of protein surface and structure that seamlessly integrates receptor surface geometry and biochemical properties. Starting with a receptor surface represented as a 3D point cloud, PepBridge generates complete protein structures through a multi‑step process. First, it employs denoising diffusion bridge models (DDBMs) to map receptor surfaces to ligand surfaces. Next, a multi‑model diffusion model predicts the corresponding structure, while Shape‑Frame Matching Networks ensure alignment between surface geometry and backbone architecture. This integrated approach facilitates surface complementarity, conformational stability, and chemical feasibility. Extensive validation across diverse protein design scenarios demonstrates PepBridge's efficacy in generating structurally viable proteins, representing a significant advancement in the joint design of top‑down protein structure.
Authors: Félix Benoist, Pablo Sartori
Abstract: The cytoplasm is a heterogeneous mixture containing many types of proteins that self‑assemble into a wide variety of complexes. The accuracy and speed of cytoplasmic self‑assembly is astonishing because it involves the correct identification of components shared among different structures, despite pervasive thermal fluctuations. Typical toy models of self‑assembly are based on the specificity of binding energies among components and neglect kinetic effects. However, kinetics plays a key role in biological self‑assembly, often catalyzed by a plethora of assembly factors. Building on this observation, we extend a previous heteropolymer growth model to describe the retrieval of two‑dimensional structures. We find that the self‑assembly of structures in this model is subject to strong speed and encoding bottlenecks. Moreover, we show that these bottlenecks can be suppressed by increasing the connectivity of a small fraction of components. This mechanism of kinetically controlling a small number of critical binding events provides a simple explanation for the timely assembly of large protein, and suggests a unifying principle for the role of assembly factors.
Authors: Emanuel Dorbath, Fabian Rudolf, Adnan Gulzar, Gerhard Stock
Abstract: Allostery, the intriguing phenomenon of long‑range communication between distant sites in proteins, plays a central role in biomolecular regulation and signal transduction. While it is commonly attributed to conformational rearrangements, the underlying dynamical mechanisms remain poorly understood. The contact cluster model of allostery [J. Chem. Theory Comput. 2024, 20, 10731‑10739] identifies localized groups of highly correlated contacts that mediate interactions between secondary structure elements. This framework proposes that allostery proceeds through a multistep process involving cooperative contact changes within clusters and communication between distant clusters, transmitted through rigid secondary structures. To demonstrate the validity and generality of the model, this Perspective employs extensive molecular dynamics simulations (~1\,ms total simulation time) of four different photoswitchable PDZ domains and studies how different domains, ligands, and perturbations influence both the contact clusters and their dynamical evolution. These analyses reveal several recurring clusters that represent shared flexible structural modules, such as loops connecting β‑sheets, and show that the characteristic time scales of the nonequilibrium protein response can be directly associated with the motions of individual contact clusters. Thus, the dynamic decomposition of PDZ domains into contact clusters uncovers a modular, dynamics‑based architecture that underlies and facilitates long‑range allosteric communication.
Authors: Cangtao Yin, Meenu Upadhyay, Markus Meuwly
Abstract: The dynamics and spectroscopy of the small (H_2COO) and large (CH_3CHOO) Criegee intermediates (CIs) in the gas phase, inside/on water droplets, on amorphous solid water (ASW) and in bulk water are investigated using validated energy functions. For both species, facile diffusion between surface and inside positions for water droplets are found whereas on amorphous solid water at low temperatures (50 K) no surface diffusion is observed on the multiple‑nanosecond time scale. This is at variance with other species, such as CO or NO on ASW. The infrared spectroscopy of both CIs in contact with an aqueous environment leads to shifts of the spectral features on the order of a few to a few tens of cm^‑1, depending on the vibrational mode considered. This is consistent with Stark‑induced spectral shifts for small molecules in protein environments. However, the spectroscopy of both CIs in contact with water droplets does not depend on the positioning relative to the droplet (inside vs. surface).
Authors: Zisong Wang, Xuanyu Wang, Hang Chen, Haizhou Wang, Yuxin Chen, Yihang Xu, Yunhe Yuan, Lihuan Luo, Xitong Ling, Xiaoping Liu
Abstract: Precise prognostic stratification of colorectal cancer (CRC) remains a major clinical challenge due to its high heterogeneity. The conventional TNM staging system is inadequate for personalized medicine. We aimed to develop and validate a novel multiple instance learning model TDAM‑CRC using histopathological whole‑slide images for accurate prognostic prediction and to uncover its underlying molecular mechanisms. We trained the model on the TCGA discovery cohort (n=581), validated it in an independent external cohort (n=1031), and further we integrated multi‑omics data to improve model interpretability and identify novel prognostic biomarkers. The results demonstrated that the TDAM‑CRC achieved robust risk stratification in both cohorts. Its predictive performance significantly outperformed the conventional clinical staging system and multiple state‑of‑the‑art models. The TDAM‑CRC risk score was confirmed as an independent prognostic factor in multivariable analysis. Multi‑omics analysis revealed that the high‑risk subtype is closely associated with metabolic reprogramming and an immunosuppressive tumor microenvironment. Through interaction network analysis, we identified and validated Mitochondrial Ribosomal Protein L37 (MRPL37) as a key hub gene linking deep pathomic features to clinical prognosis. We found that high expression of MRPL37, driven by promoter hypomethylation, serves as an independent biomarker of favorable prognosis. Finally, we constructed a nomogram incorporating the TDAM‑CRC risk score and clinical factors to provide a precise and interpretable clinical decision‑making tool for CRC patients. Our AI‑driven pathological model TDAM‑CRC provides a robust tool for improved CRC risk stratification, reveals new molecular targets, and facilitates personalized clinical decision‑making.
Authors: Yuntao Lu, Yunxin Zhang
Abstract: The burst approximation is a widely used technique to simplify stochastic gene expression models. However, the dynamics and analytical properties of the protein number distribution in gene expression models under the burst approximation are barely studied. In this study, we propose and systematically analyze surrogate models with multiple gene states and arbitrary burst size distributions. An analytical time‑dependent solution to the chemical master equation is derived and then exploited in two directions. Theoretically, several fine properties of the protein number distribution are established using functional analysis. For geometrically distributed burst sizes, the distribution is dominated by a scaled negative binomial distribution, and is light‑tailed in certain parameter regimes. Computationally, we develop efficient algorithms in three settings, enabling fast calculation of the protein number distribution. Furthermore, the approximation error relative to full gene expression models is estimated in terms of low‑order moments of the distribution, thereby clarifying the validity of the burst approximation.
Authors: Tyler L. Hayes, Giri P. Krishnan
Abstract: Models such as AlphaFold2 and OpenFold have transformed protein structure prediction, yet their inner workings remain poorly understood. We present a methodology to systematically evaluate the contribution of individual OpenFold components to structure prediction accuracy. We identify several components that are critical for most proteins, while others vary in importance across proteins. We further show that the contribution of several components is correlated with protein length. These findings provide insight into how OpenFold achieves accurate predictions and highlight directions for interpreting protein prediction networks more broadly.
Authors: Xinzhe Zheng, Shiyu Jiang, Gustavo Seabra, Chenglong Li, Yanjun Li
Abstract: Deep generative models are rapidly advancing structure‑based drug design, offering substantial promise for generating small molecule ligands that bind to specific protein targets. However, most current approaches assume a rigid protein binding pocket, neglecting the intrinsic flexibility of proteins and the conformational rearrangements induced by ligand binding, limiting their applicability in practical drug discovery. Here, we propose Apo2Mol, a diffusion‑based generative framework for 3D molecule design that explicitly accounts for conformational flexibility in protein binding pockets. To support this, we curate a dataset of over 24,000 experimentally resolved apo‑holo structure pairs from the Protein Data Bank, enabling the characterization of protein structure changes associated with ligand binding. Apo2Mol employs a full‑atom hierarchical graph‑based diffusion model that simultaneously generates 3D ligand molecules and their corresponding holo pocket conformations from input apo states. Empirical studies demonstrate that Apo2Mol can achieve state‑of‑the‑art performance in generating high‑affinity ligands and accurately capture realistic protein pocket conformational changes.
Authors: Gaia Forghieri, Viacheslav Dubovitskii, Matteo A. C. Rossi, Matteo G. A. Paris
Abstract: Reconstructing protein‑protein interaction networks is a central challenge in network medicine, often addressed using link prediction algorithms. Recent studies suggest that quantum walk‑based approaches hold promise for this task. In this paper, we build on these algorithms by introducing chirality through the addition of random phases in the Hamiltonian generators. The resulting additional degrees of freedom enable a more diverse exploration of the network, which we exploit by employing a swarm of chiral quantum walks. Thus, we enhance the predictive power of quantum walks on complex networks. Indeed, compared to a non‑chiral algorithm, the chiral version exhibits greater robustness, making its performance less dependent on the optimal evolution time‑‑a critical hyperparameter of the non‑chiral model. This improvement arises from complementary dynamics introduced by chirality within the swarm. By analyzing multiple phase‑sampling strategies, we identify configurations that achieve a practical trade‑off: retaining the high predictive accuracy of the non‑chiral algorithm at its optimal time while gaining the robustness typical of chirality. Our findings highlight the versatility of chiral quantum walks and their potential to outperform both classical and non‑chiral quantum methods in realistic scenarios, including comparisons between successive versions of evolving databases.
Authors: Konstantinos Steiakakis, Alan Pichard, Maxime Vassaux
Abstract: Collagen fibrils are the building block of many biological tissues, which viability depend on the fibrils properties. Altered properties of collagen fibrils are central to the appearance of many diseases, while physiological or native properties must be reproduced for tissue engineering. Yet, the self‑assembly, the structure, and therefore the properties of collagen fibrils remain elusive. One main reason is the extreme sensitivity of the fibrils to their environmental conditions, and in particular hydration which is only loosely bound by experimental measurements. Furthermore, mechanics are an integral part of the self‑assembly process; forces exerted by cells or osmotic pressure may result in internal stresses in collagen fibrils in native conditions. Here, we propose to investigate internal stresses in collagen fibrils by means of molecular dynamics simulations of the collagen microfibril model. Our simulations reveal the quantitative evolution of internal stresses in collagen fibrils with hydration. We establish a value of native hydration of collagen fibrils at 0.78 g/g based on an absence of cross‑sectional stresses. In turn, we determine a quantitative estimate of internal longitudinal stresses in collagen fibrils in native conditions of 210 MPa. We find that internal longitudinal stresses are caused by an over‑extended protein backbone rather than partial hydration, which appears remnant of the local forces driving collagen self‑assembly. We also demonstrate the consequences of internal longitudinal stresses on the mechanical properties of collagen fibrils, which the absence of induces more than a 20% decrease in the Young's modulus. Overall, our findings provide insights into the native structure and properties of collagen fibrils. More than ever, collagen fibrils appear to be assembled via an out‑of‑equilibrium process key to the synthesis of viable tissues.
Authors: Pratik Chakraborty, Aryan Bhargava
Abstract: Proteins perform essential biological functions, and accurate classification of their sequences is critical for understanding structure‑function relationships, enzyme mechanisms, and molecular interactions. This study presents a deep learning‑based framework for functional group classification of protein sequences derived from the Protein Data Bank (PDB). Four architectures were implemented: Convolutional Neural Network (CNN), Bidirectional Long Short‑Term Memory (BiLSTM), CNN‑BiLSTM hybrid, and CNN with Attention. Each model was trained using k‑mer integer encoding to capture both local and long‑range dependencies. Among these, the CNN achieved the highest validation accuracy of 91.8%, demonstrating the effectiveness of localized motif detection. Explainable AI techniques, including Grad‑CAM and Integrated Gradients, were applied to interpret model predictions and identify biologically meaningful sequence motifs. The discovered motifs, enriched in histidine, aspartate, glutamate, and lysine, represent amino acid residues commonly found in catalytic and metal‑binding regions of transferase enzymes. These findings highlight that deep learning models can uncover functionally relevant biochemical signatures, bridging the gap between predictive accuracy and biological interpretability in protein sequence analysis.
Authors: Jun-Hyoung Park, Ho-Jun Song, Seong-Whan Lee
Abstract: Deep learning‑based molecular generation models have shown great potential in efficiently exploring vast chemical spaces by generating potential drug candidates with desired properties. However, these models often produce chemically invalid molecules, which limits the usable scope of the learned chemical space and poses significant challenges for practical applications. To address this issue, we propose ChemFixer, a framework designed to correct invalid molecules into valid ones. ChemFixer is built on a transformer architecture, pre‑trained using masking techniques, and fine‑tuned on a large‑scale dataset of valid/invalid molecular pairs that we constructed. Through comprehensive evaluations across diverse generative models, ChemFixer improved molecular validity while effectively preserving the chemical and biological distributional properties of the original outputs. This indicates that ChemFixer can recover molecules that could not be previously generated, thereby expanding the diversity of potential drug candidates. Furthermore, ChemFixer was effectively applied to a drug‑target interaction (DTI) prediction task using limited data, improving the validity of generated ligands and discovering promising ligand‑protein pairs. These results suggest that ChemFixer is not only effective in data‑limited scenarios, but also extensible to a wide range of downstream tasks. Taken together, ChemFixer shows promise as a practical tool for various stages of deep learning‑based drug discovery, enhancing molecular validity and expanding accessible chemical space.
Authors: Disha Varshney, Samarth Garg, Sarthak Tyagi, Deeksha Varshney, Nayan Deep, Asif Ekbal
Abstract: In this study, we tackle the challenging task of predicting secondary structures from protein primary sequences, a pivotal initial stride towards predicting tertiary structures, while yielding crucial insights into protein activity, relationships, and functions. Existing methods often utilize extensive sets of unlabeled amino acid sequences. However, these approaches neither explicitly capture nor harness the accessible protein 3D structural data, which is recognized as a decisive factor in dictating protein functions. To address this, we utilize protein residue graphs and introduce various forms of sequential or structural connections to capture enhanced spatial information. We adeptly combine Graph Neural Networks (GNNs) and Language Models (LMs), specifically utilizing a pre‑trained transformer‑based protein language model to encode amino acid sequences and employing message‑passing mechanisms like GCN and R‑GCN to capture geometric characteristics of protein structures. Employing convolution within a specific node's nearby region, including relations, we stack multiple convolutional layers to efficiently learn combined insights from the protein's spatial graph, revealing intricate interconnections and dependencies in its structural arrangement. To assess our model's performance, we employed the training dataset provided by NetSurfP‑2.0, which outlines secondary structure in 3‑and 8‑states. Extensive experiments show that our proposed model, SSRGNet surpasses the baseline on f1‑scores.
Authors: Nadav Bojan Sellam, Meital Bojan, Paul Schanda, Alex Bronstein
Abstract: Accurate protein structures are essential for understanding biological function, yet incorporating experimental data into protein generative models remains a major challenge. Most predictors of experimental observables are non‑differentiable, making them incompatible with gradient‑based conditional sampling. This is especially limiting in nuclear magnetic resonance, where rich data such as chemical shifts are hard to directly integrate into generative modeling. We introduce a framework for non‑differentiable guidance of protein generative models, coupling a continuous diffusion‑based generator with any black‑box objective via a tailored genetic algorithm. We demonstrate its effectiveness across three modalities: pairwise distance constraints, nuclear Overhauser effect restraints, and for the first time chemical shifts. These results establish chemical shift guided structure generation as feasible, expose key weaknesses in current predictors, and showcase a general strategy for incorporating diverse experimental signals. Our work points toward automated, data‑conditioned protein modeling beyond the limits of differentiability.
Authors: Zuqi Huang, Mengxin Tian, Huan Liu, Wentao Li, Baobao Liang, Jie Wu, Fang Yan, Zhaoqing Tang, Zhongyu Li
Abstract: Accurate cell counting in immunohistochemistry (IHC) images is critical for quantifying protein expression and aiding cancer diagnosis. However, the task remains challenging due to the chromogen overlap, variable biomarker staining, and diverse cellular morphologies. Regression‑based counting methods offer advantages over detection‑based ones in handling overlapped cells, yet rarely support end‑to‑end multi‑class counting. Moreover, the potential of foundation models remains largely underexplored in this paradigm. To address these limitations, we propose a rank‑aware agglomeration framework that selectively distills knowledge from multiple strong foundation models, leveraging their complementary representations to handle IHC heterogeneity and obtain a compact yet effective student model, CountIHC. Unlike prior task‑agnostic agglomeration strategies that either treat all teachers equally or rely on feature similarity, we design a Rank‑Aware Teacher Selecting (RATS) strategy that models global‑to‑local patch rankings to assess each teacher's inherent counting capacity and enable sample‑wise teacher selection. For multi‑class cell counting, we introduce a fine‑tuning stage that reformulates the task as vision‑language alignment. Discrete semantic anchors derived from structured text prompts encode both category and quantity information, guiding the regression of class‑specific density maps and improving counting for overlapping cells. Extensive experiments demonstrate that CountIHC surpasses state‑of‑the‑art methods across 12 IHC biomarkers and 5 tissue types, while exhibiting high agreement with pathologists' assessments. Its effectiveness on H&E‑stained data further confirms the scalability of the proposed method.
Authors: Qingsong Zhong, Haomin Yu, Yan Lin, Wangmeng Shen, Long Zeng, Jilin Hu
Abstract: Structure‑Based drug design (SBDD) has emerged as a popular approach in drug discovery, leveraging three‑dimensional protein structures to generate drug ligands. However, existing generative models encounter several key challenges: (1) incorporating boundary condition constraints, (2) integrating hierarchical structural conditions, and (3) ensuring spatial modeling fidelity. To address these limitations, we propose SculptDrug, a spatial condition‑aware generative model based on Bayesian flow networks (BFNs). First, SculptDrug follows a BFN‑based framework and employs a progressive denoising strategy to ensure spatial modeling fidelity, iteratively refining atom positions while enhancing local interactions for precise spatial alignment. Second, we introduce a Boundary Awareness Block that incorporates protein surface constraints into the generative process to ensure that generated ligands are geometrically compatible with the target protein. Third, we design a Hierarchical Encoder that captures global structural context while preserving fine‑grained molecular interactions, ensuring overall consistency and accurate ligand‑protein conformations. We evaluate SculptDrug on the CrossDocked dataset, and experimental results demonstrate that SculptDrug outperforms state‑of‑the‑art baselines, highlighting the effectiveness of spatial condition‑aware modeling.
Authors: Zhijun Zeng, Junqing Chen, Zuoqiang Shi
Abstract: We study an inverse problem for stochastic and quantum dynamical systems in a time‑label‑free setting, where only unordered density snapshots sampled at unknown times drawn from an observation‑time distribution are available. These observations induce a distribution over state densities, from which we seek to recover the parameters of the underlying evolution operator. We formulate this as learning a distribution‑to‑function neural operator and propose BlinDNO, a permutation‑invariant architecture that integrates a multiscale U‑Net encoder with an attention‑based mixer. Numerical experiments on a wide range of stochastic and quantum systems, including a 3D protein‑folding mechanism reconstruction problem in a cryo‑EM setting, demonstrate that BlinDNO reliably recovers governing parameters and consistently outperforms existing neural inverse operator baselines.
Authors: Benedetta Marmiroli, Sumea Klokic, Barbara Sartori, Marie Reissenbuechel, Alessio Turchet, Heinz Amenitsch
Abstract: Microfluidic devices are increasingly used in synchrotron‑based experiments to deliver and probe liquid samples, offering advantages such as minimal sample consumption and reduced radiation damage. Despite their growing use, few devices have been specifically designed for monitoring liquids under photoexcitation, a promising approach for fast structural transitions. Here, a microfluidic device that is transparent to X‑rays in one direction, and simulaneously transmits UV and visible light in the perpendicular direction is presented. The device is fabricated using lamination and UV lithography on a dry‑film resist, eliminating the need for cleanroom facilities and simplifying production. Its multi‑wavelength transparency was validated through UV‑visible spectroscopy, where photoexcitation at different wavelengths induced reversible trans‑to‑cis isomerization of azobenzene and fluoro‑azobenzene. X‑ray transparency was validated through Small Angle X‑ray Scattering (SAXS) measurements on hemoglobin and CO‑ligated hemoglobin sensitive to quaternary structural changes. These resusts confirm the suitability of the device for resolving protein structures and photoinduced conformational dynamics. The design further supports, as some proof of concept results show, temperature‑jump and time resolved pump‑probe experiments, providing a versatile platform for studying structural evolution in liquid samples using synchrotron SAXS.
Authors: Tom Pan, Evan Dramko, Mitchell D. Miller, Anastasios Kyrillidis, George N. Phillips
Abstract: Protein structure determination has long been one of the primary challenges of structural biology, to which deep machine learning (ML)‑based approaches have increasingly been applied. However, these ML models generally do not incorporate the experimental measurements directly, such as X‑ray crystallographic diffraction data. To this end, we explore an approach that more tightly couples these traditional crystallographic and recent ML‑based methods, by training a hybrid 3‑d vision transformer and convolutional network on inputs from both domains. We make use of two distinct input constructs / Patterson maps, which are directly obtainable from crystallographic data, and ``partial structure'' template maps derived from predicted structures deposited in the AlphaFold Protein Structure Database with subsequently omitted residues. With these, we predict electron density maps that are then post‑processed into atomic models through standard crystallographic refinement processes. Introducing an initial dataset of small protein fragments taken from Protein Data Bank entries and placing them in hypothetical crystal settings, we demonstrate that our method is effective at both improving the phases of the crystallographic structure factors and completing the regions missing from partial structure templates, as well as improving the agreement of the electron density maps with the ground truth atomic structures.
Authors: Vincent Schilling, Akshat Dubey, Georges Hattab
Abstract: Peptide classification tasks, such as predicting toxicity and HIV inhibition, are fundamental to bioinformatics and drug discovery. Traditional approaches rely heavily on handcrafted encodings of one‑dimensional (1D) peptide sequences, which can limit generalizability across tasks and datasets. Recently, protein language models (PLMs), such as ESM‑2 and ESMFold, have demonstrated strong predictive performance. However, they face two critical challenges. First, fine‑tuning is computationally costly. Second, their complex latent representations hinder interpretability for domain experts. Additionally, many frameworks have been developed for specific types of peptide classification, lacking generalization. These limitations restrict the ability to connect model predictions to biologically relevant motifs and structural properties. To address these limitations, we present PepTriX, a novel framework that integrates one dimensional (1D) sequence embeddings and three‑dimensional (3D) structural features via a graph attention network enhanced with contrastive training and cross‑modal co‑attention. PepTriX automatically adapts to diverse datasets, producing task‑specific peptide vectors while retaining biological plausibility. After evaluation by domain experts, we found that PepTriX performs remarkably well across multiple peptide classification tasks and provides interpretable insights into the structural and biophysical motifs that drive predictions. Thus, PepTriX offers both predictive robustness and interpretable validation, bridging the gap between performance‑driven peptide‑level models (PLMs) and domain‑level understanding in peptide research.
Authors: Yuancheng Sun, Yuxuan Ren, Zhaoming Chen, Xu Han, Kang Liu, Qiwei Ye
Abstract: Accurate exploration of protein conformational ensembles is essential for uncovering function but remains hard because molecular‑dynamics (MD) simulations suffer from high computational costs and energy‑barrier trapping. This paper presents Energy Preference Optimization (EPO), an online refinement algorithm that turns a pretrained protein ensemble generator into an energy‑aware sampler without extra MD trajectories. Specifically, EPO leverages stochastic differential equation sampling to explore the conformational landscape and incorporates a novel energy‑ranking mechanism based on list‑wise preference optimization. Crucially, EPO introduces a practical upper bound to efficiently approximate the intractable probability of long sampling trajectories in continuous‑time generative models, making it easily adaptable to existing pretrained generators. On Tetrapeptides, ATLAS, and Fast‑Folding benchmarks, EPO successfully generates diverse and physically realistic ensembles, establishing a new state‑of‑the‑art in nine evaluation metrics. These results demonstrate that energy‑only preference signals can efficiently steer generative models toward thermodynamically consistent conformational ensembles, providing an alternative to long MD simulations and widening the applicability of learned potentials in structural biology and drug discovery.
Authors: Yaodong Yang, Yang Wang, Jinpeng Li, Pei Guo, Da Han, Guangyong Chen, Pheng-Ann Heng
Abstract: Protein evolution through amino acid mutations is a cornerstone of life sciences. Recent advances in protein language models have shown rich evolutionary patterns, offering unprecedented potential for in‑silicon directed evolution. However, existing directed evolution methods largely rely on heuristic evolution strategies and have yet to efficiently integrate the transformative protein language models with advanced optimization techniques, such as reinforcement learning, to adaptively learn superior evolution policies. To bridge this gap, we propose AlphaDE, a novel framework that evolves protein sequences by harnessing the innovative paradigms of large language models, such as fine‑tuning and test‑time inference. First, AlphaDE fine‑tunes pretrained protein language models using masked language modeling on homologous protein sequences to activate the evolutionary plausibility of the interested protein family. Second, AlphaDE introduces test‑time inference based on Monte Carlo tree search, which effectively evolves proteins with evolutionary guidance from the fine‑tuned protein language model. Extensive benchmark experiments show that AlphaDE remarkably outperforms previous state‑of‑the‑art methods even with few‑shot fine‑tuning. A case study further demonstrates that AlphaDE supports condensing the protein sequence space of avGFP through computational evolution.
Authors: Samyak Sanghvi, Nishant Ranjan, Tarak Karmakar
Abstract: Structure‑based drug design (SBDD) faces a fundamental scaling fidelity dilemma: rich pocket‑aware conditioning captures interaction geometry but can be costly, often scales quadratically (O(L^2)) or worse with protein length (L), while efficient sequence‑only conditioning can miss key interaction structure. We propose SiDGen, a structure‑informed discrete diffusion framework that resolves this trade‑off through a Topological Information Bottleneck (TIB). SiDGen leverages a learned, soft assignment mechanism to compress residue‑level protein representations into a compact bottleneck enabling downstream pairwise computations on the coarse grid (O(L^2/s^2)). This design reduces memory and computational cost without compromising generative accuracy. Our approach achieves state‑of‑the‑art performance on CrossDocked2020 and DUD‑E benchmarks while significantly reducing pairwise‑tensor memory. SiDGen bridges the gap between sequence‑based efficiency and pocket‑aware conditioning, offering a scalable path for high‑throughput structure‑based discovery.
Authors: Lukas Billera, Hedwig Nora Nordlinder, Jack Collier Ryder, Anton Oresten, Aron Stålmarck, Theodor Mosetti Björk, Ben Murrell
Abstract: Diffusion and flow matching approaches to generative modeling have shown promise in domains where the state space is continuous, such as image generation or protein folding & design, and discrete, exemplified by diffusion large language models. They offer a natural fit when the number of elements in a state is fixed in advance (e.g. images), but require ad hoc solutions when, for example, the length of a response from a large language model, or the number of amino acids in a protein chain is not known a priori.
Here we propose Branching Flows, a generative modeling framework that, like diffusion and flow matching approaches, transports a simple distribution to the data distribution. But in Branching Flows, the elements in the state evolve over a forest of binary trees, branching and dying stochastically with rates that are learned by the model. This allows the model to control, during generation, the number of elements in the sequence. We also show that Branching Flows can compose with any flow matching base process on discrete sets, continuous Euclidean spaces, smooth manifolds, and `multimodal' product spaces that mix these components. We demonstrate this in three domains: small molecule generation (multimodal), antibody sequence generation (discrete), and protein backbone generation (multimodal), and show that Branching Flows is a capable distribution learner with a stable learning objective, and that it enables new capabilities.
Authors: Peter Lunkenheimer, Sebastian Emmert, Martin Wolf, Alois Loidl
Abstract: In the present work, we examine the relevance and proper interpretation of broadband‑dielectric and THz‑spectroscopy data for the investigation of various types of biological matter. We provide an overview of the rich variety of different dynamic processes that can be detected by these experimental methods. Several experimental examples are discussed in detail, helping to understand the information that can be drawn from such studies. This includes dielectric spectra, extending well into the GHz region, for pure water, which can be considered as a simple but highly important biological molecule. We also discuss results for a prototypical aqueous solution of a protein, belonging to one of the most important classes of biological macromolecules. Moreover, we examine broadband dielectric spectra on blood as an example of functional biological matter in organisms. To demonstrate the relevance of THz spectroscopy for the investigation of biological molecules, we finally treat such experiments applied to different amino acids.
Authors: Olivier Destaing, Bertrand Fourcade
Abstract: Protein nanoclustering is a characteristic feature of their activated state and is essential for forming numerous subcellular structures. The formation of these nanoclusters is highly dependent on a series of post‑translational modifications, such as mono‑and multi‑phosphorylation and dephosphorylation of residues. We theoretically simulate how a protein can be either mono‑or multi‑phosphorylated on several residues in functional nanoclusters, depending on effective biophysical parameters (diffusion, dwell time, etc.). Moving beyond a binary view of phosphorylation, this approach highlights the interplay between mono‑and multi‑phosphorylation, the cooperative effects generally associated with multi‑phosphorylation networks, and stresses the role of phosphatases in transforming graded phosphorylation signals into almost switch‑like responses. The results are discussed in light of experiments that probe the distribution of phospho‑residues.
Authors: Erik Hartman, Jonas Wallin, Johan Malmström, Jimmy Olsson
Abstract: Proteins underpin most biological function, and the ability to design them with tailored structures and properties is central to advances in biotechnology. Diffusion‑based generative models have emerged as powerful tools for protein design, but steering them toward proteins with specified properties remains challenging. The Feynman‑Kac (FK) framework provides a principled way to guide diffusion models using user‑defined rewards. In this paper, we enable FK‑based steering of RFdiffusion through the development of guiding potentials that leverage ProteinMPNN and structural relaxation to guide the diffusion process towards desired properties. We show that steering can be used to consistently improve predicted interface energetics and increase binder designability by 89.5%. Together, these results establish that diffusion‑based protein design can be effectively steered toward arbitrary, non‑differentiable objectives, providing a model‑independent framework for controllable protein generation.
Authors: Alvaro Lanza, Inés Martínez-Martín, Rafael Tapia-Rojo, Stefano Bo
Abstract: Quantifying the irreversibility and dissipation of non‑equilibrium processes is crucial to understanding their behavior, assessing their possible capabilities, and characterizing their efficiency. We introduce a physical quantity that quantifies the irreversibility of stochastic Langevin systems from the observation of individual molecules' displacements. Categorizing these displacements into a few groups based on their initial and final position allows us to measure irreversibility precisely without the need to know the forces and magnitude of the fluctuations acting on the system. Our model‑free estimate of irreversibility is related to entropy production by a conditional fluctuation theorem and provides a lower bound to the average entropy production. We validate the method on single‑molecule force spectroscopy experiments of proteins subject to force ramps. We show that irreversibility is sensitive to detailed features of the energy landscape underlying the protein folding dynamics and suggest how our methods can be employed to unveil key properties of protein folding processes.
Authors: Stanislav Selitskiy
Abstract: Large Artificial Neural Network (ANN) models have demonstrated success in various domains, including general text and image generation, drug discovery, and protein‑RNA (ribonucleic acid) binding tasks. However, these models typically demand substantial computational resources, time, and data for effective training. Given that such extensive resources are often inaccessible to many researchers and that life sciences data sets are frequently limited, we investigated whether small ANN models could achieve acceptable accuracy in protein‑RNA prediction. We experimented with shallow feed‑forward ANNs comprising two hidden layers and various non‑linearities. These models did not utilize explicit structural information; instead, a sliding window approach was employed to implicitly consider the context of neighboring residues and bases. We explored different training techniques to address the issue of highly unbalanced data. Among the seven most popular non‑linearities for feed‑forward ANNs, only three: Rectified Linear Unit (ReLU), Gated Linear Unit (GLU), and Hyperbolic Tangent (Tanh) yielded converging models. Common re‑balancing techniques, such as under‑ and over‑sampling of training sets, proved ineffective, whereas increasing the volume of training data and using model ensembles significantly improved performance. The optimal context window size, balancing both false negative and false positive errors, was found to be approximately 30 residues and bases. Our findings indicate that high‑accuracy protein‑RNA binding prediction is achievable using computing hardware accessible to most educational and research institutions.
Authors: David Regan, Ozan Aksakal, Athena Zitti, John McLarnon, Magdalena Lipka-Lloyd, Pierre J. Rizkallah, Anna J. Warren, Peter D. Watson, Wolfgang Langbein, D. Dafydd Jones, Paola Borri
Abstract: Stimulated Raman scattering (SRS) microscopy offers great potential to surpass fluorescent‑based approaches, owing to the sharp linewidth of Raman vibrations amenable to super‑multiplex cell imaging, but currently lacks one crucial component: genetically encodable tags equivalent to fluorescent proteins. Here, we show that infrared fluorescent proteins (IRFPs) can be used as genetically encoded SRS probes and benefit from the electronic pre‑resonant SRS enhancement effect with near‑infrared exciting pulses, comparable to synthetic dyes reported in the literature. SRS imaging of the nucleus in mammalian cells is demonstrated where a histone protein is fused to an IRFP. This work opens the route towards Raman‑based cell imaging using genetically encoded probes, motivating efforts in solving the challenges of photostability and creating a vibrational palette.
Authors: Yoonho Lee, Joseph Boen, Chelsea Finn
Abstract: We introduce Feedback Descent, a framework that optimizes text artifacts ‑‑ prompts, code, and molecules ‑‑ through structured textual feedback, rather than relying solely on scalar rewards. By preserving detailed critiques instead of compressing them to binary preferences, Feedback Descent widens the information bottleneck in preference learning, enabling directed optimization in text space rather than weight space. We show that in‑context learning can transform structured feedback into gradient‑like directional information, enabling targeted edits. Unlike prior approaches that collapse judgments into single bits, our evaluators pair each comparison with textual feedback, which functions as high‑bandwidth supervision. The iteration loop is done purely at inference time, without modifying any model weights, and is task‑agnostic. We evaluate Feedback Descent on three diverse domains and find that it outperforms state‑of‑the‑art prompt optimization (GEPA), reinforcement learning methods (GRPO, REINVENT), and even specialized graph‑based molecular optimizers. In the DOCKSTRING molecule discovery benchmark, Feedback Descent identifies novel drug‑like molecules surpassing the 99.9th percentile of a database with more than 260,000 compounds across six protein targets.
Authors: Bowei He, Bowen Gao, Yankai Chen, Yanyan Lan, Chen Ma, Philip S. Yu, Ya-Qin Zhang, Wei-Ying Ma
Abstract: Virtual screening (VS) is an essential task in drug discovery, focusing on the identification of small‑molecule ligands that bind to specific protein pockets. Existing deep learning methods, from early regression models to recent contrastive learning approaches, primarily rely on structural data while overlooking protein sequences, which are more accessible and can enhance generalizability. However, directly integrating protein sequences poses challenges due to the redundancy and noise in large‑scale protein‑ligand datasets. To address these limitations, we propose S^2Drug, a two‑stage framework that explicitly incorporates protein Sequence information and 3D Structure context in protein‑ligand contrastive representation learning. In the first stage, we perform protein sequence pretraining on ChemBL using an ESM2‑based backbone, combined with a tailored data sampling strategy to reduce redundancy and noise on both protein and ligand sides. In the second stage, we fine‑tune on PDBBind by fusing sequence and structure information through a residue‑level gating module, while introducing an auxiliary binding site prediction task. This auxiliary task guides the model to accurately localize binding residues within the protein sequence and capture their 3D spatial arrangement, thereby refining protein‑ligand matching. Across multiple benchmarks, S^2Drug consistently improves virtual screening performance and achieves strong results on binding site prediction, demonstrating the value of bridging sequence and structure in contrastive learning.
Authors: Subhadip Basu, Oded Farago
Abstract: Membrane proteins often form dimers and higher‑order oligomers whose stability and spatial organization depend sensitively on their lipid environment. To investigate the physical principles underlying this coupling, we employ a lattice Monte Carlo model of ternary lipid mixtures that exhibit liquid‑disordered (L_d) and liquid‑ordered (L_o) phase coexistence. In this framework, proteins are represented as small membrane inclusions with tunable nearest neighbor interactions with both lipids and other proteins, allowing us to examine how protein‑lipid affinity competes with protein‑protein interactions and lipid‑lipid demixing. We find that the balance of these interactions controls whether proteins remain dispersed, assemble into small oligomers, or form large stable clusters within L_o domains, and that increasing the protein concentration further promotes coarsening of the ordered phase. To incorporate ligand‑regulated activation, we extend the model to a kinetic Monte Carlo scheme in which proteins stochastically switch between inactive and active states with distinct affinities. The inverse switching rate, relative to the time required for a protein to diffuse across the characteristic size of the L_o domains, governs the aggregation behavior. Rapid switching yields only transient small oligomers, slow switching reproduces the static limit with persistent large clusters, and intermediate rates produce broad cluster‑size distributions. These results highlight the interplay between lipid phase organization, protein‑lipid affinity, and activation dynamics in regulating membrane protein oligomerization, a coupling that is central to signal transduction and membrane organization in living cells.
Authors: Janani G, Deepak Bhat
Abstract: In biological cells, DNA replication is carried out by the replisome, a protein complex encompassing multiple DNA polymerases. DNA replication is semi‑discontinuous: a DNA polymerase synthesizes one (leading) strand of the DNA continuously, and another polymerase synthesizes the other (lagging) strand discontinuously. Complex dynamics of the lagging‑strand polymerase within the replisome result in the formation of short interim fragments, known as Okazaki fragments, and gaps between them. Although the semi‑discontinuous replication is ubiquitous, a detailed characterization of it remains elusive. In this work, we develop a framework to investigate the semi‑discontinuous replication by incorporating stochastic dynamics of the lagging‑strand polymerase. Computing the size distribution of Okazaki fragments and gaps, we uncover the significance of the polymerase dissociation in shaping them. We apply the method to the previous experiment on the T4 bacteriophage replication system and identify the key parameters governing the polymerase dynamics. These results reveal that the collisions of lagging‑strand polymerase with pre‑synthesised Okazaki fragments primarily trigger its dissociation from the lagging strand.
Authors: Zhicheng Cai, Xinyuan Guo, Yu Pei, Jiangtao Feng, Jinsong Su, Jiangjie Chen, Ya-Qin Zhang, Wei-Ying Ma, Mingxuan Wang, Hao Zhou
Abstract: Autonomous agents driven by Large Language Models (LLMs) have revolutionized reasoning and problem‑solving but remain static after training, unable to grow with experience as intelligent beings do during deployment. We introduce Forward Learning with EXperience (FLEX), a gradient‑free learning paradigm that enables LLM agents to continuously evolve through accumulated experience. Specifically, FLEX cultivates scalable and inheritable evolution by constructing a structured experience library through continual reflection on successes and failures during interaction with the environment. FLEX delivers substantial improvements on mathematical reasoning, chemical retrosynthesis, and protein fitness prediction (up to 23% on AIME25, 10% on USPTO50k, and 14% on ProteinGym). We further identify a clear scaling law of experiential growth and the phenomenon of experience inheritance across agents, marking a step toward scalable and inheritable continuous agent evolution. Project Page: https://flex‑gensi‑thuair.github.io.
Authors: Ruihai Wang, Qianhao Zhao, Julia Quinn, Liming Yang, Yuhui Zhu, Feifei Huang, Chengfei Guo, Tianbo Wang, Pengming Song, Michael Murphy, Thanh D. Nguyen, Andrew Maiden, Francisco E. Robles, Guoan Zheng
Abstract: The mesoscale characterization of biological specimens has traditionally required compromises between resolution, field‑of‑view, depth‑of‑field, and molecular specificity, with most approaches relying on external labels. Here we present the Deep‑ultrAviolet ptychogRaphic pockeT‑scope (DART), a handheld platform that transforms label‑free molecular imaging through intrinsic deep‑ultraviolet spectroscopic contrast. By leveraging biomolecules' natural absorption fingerprints and combining them with lensless ptychographic microscopy, DART resolves down to 308‑nm linewidths across centimeter‑scale areas while maintaining millimeter‑scale depth‑of‑field. The system's virtual error‑bin methodology effectively eliminates artifacts from limited temporal coherence and other optical imperfections, enabling high‑fidelity molecular imaging without lenses. Through differential spectroscopic imaging at deep‑ultraviolet wavelengths, DART quantitatively maps nucleic acid and protein distributions with femtogram sensitivity, providing an intrinsic basis for explainable virtual staining. We demonstrate DART's capabilities through molecular imaging of tissue sections, cytopathology specimens, blood cells, and neural populations, revealing detailed molecular contrast without external labels. The combination of high‑resolution molecular mapping and broad mesoscale imaging in a portable platform opens new possibilities from rapid clinical diagnostics, tissue analysis, to biological characterization in space exploration.
Authors: Ziyang Gao, Annie Cheung, Yihao Ou
Abstract: Accurate prediction of protein‑ligand binding affinity plays a pivotal role in accelerating the discovery of novel drugs and vaccines, particularly for gastrointestinal (GI) diseases such as gastric ulcers, Crohn's disease, and ulcerative colitis. Traditional computational models often rely on structural information alone and thus fail to capture the genetic determinants that influence disease mechanisms and therapeutic responses. To address this gap, we propose GastroDL‑Fusion, a dual‑modal deep learning framework that integrates protein‑ligand complex data with disease‑associated gene sequence information for drug and vaccine development. In our approach, protein‑ligand complexes are represented as molecular graphs and modeled using a Graph Isomorphism Network (GIN), while gene sequences are encoded into biologically meaningful embeddings via a pre‑trained Transformer (ProtBERT/ESM). These complementary modalities are fused through a multi‑layer perceptron to enable robust cross‑modal interaction learning. We evaluate the model on benchmark datasets of GI disease‑related targets, demonstrating that GastroDL‑Fusion significantly improves predictive performance over conventional methods. Specifically, the model achieves a mean absolute error (MAE) of 1.12 and a root mean square error (RMSE) of 1.75, outperforming CNN, BiLSTM, GIN, and Transformer‑only baselines. These results confirm that incorporating both structural and genetic features yields more accurate predictions of binding affinities, providing a reliable computational tool for accelerating the design of targeted therapies and vaccines in the context of gastrointestinal diseases.
Authors: Yaoyao Xu, Di Wang, Zihan Zhou, Tianshu Yu, Mingchen Chen
Abstract: Understanding the dynamic behavior of proteins is critical to elucidating their functional mechanisms, yet generating realistic, temporally coherent trajectories of protein ensembles remains a significant challenge. In this work, we introduce a novel hierarchical autoregressive framework for modeling protein dynamics that leverages the intrinsic multi‑scale organization of molecular motions. Unlike existing methods that focus on generating static conformational ensembles or treat dynamic sampling as an independent process, our approach characterizes protein dynamics as a Markovian process. The framework employs a two‑scale architecture: a low‑resolution model captures slow, collective motions driving major conformational transitions, while a high‑resolution model generates detailed local fluctuations conditioned on these large‑scale movements. This hierarchical design ensures that the causal dependencies inherent in protein dynamics are preserved, enabling the generation of temporally coherent and physically realistic trajectories. By bridging high‑level biophysical principles with state‑of‑the‑art generative modeling, our approach provides an efficient framework for simulating protein dynamics that balances computational efficiency with physical accuracy.
Authors: Abigail Lin
Abstract: Predicting the effect of amino acid mutations on enzyme thermodynamic stability (DDG) is fundamental to protein engineering and drug design. While recent deep learning approaches have shown promise, they often process sequence and structure information independently, failing to capture the intricate coupling between local structural geometry and global sequential patterns. We present DGTN (Diffused Graph‑Transformer Network), a novel architecture that co‑learns graph neural network (GNN) weights for structural priors and transformer attention through a diffusion mechanism. Our key innovation is a bidirectional diffusion process where: (1) GNN‑derived structural embeddings guide transformer attention via learnable diffusion kernels, and (2) transformer representations refine GNN message passing through attention‑modulated graph updates. We provide rigorous mathematical analysis showing this co‑learning scheme achieves provably better approximation bounds than independent processing. On ProTherm and SKEMPI benchmarks, DGTN achieves state‑of‑the‑art performance (Pearson Rho = 0.87, RMSE = 1.21 kcal/mol), with 6.2% improvement over best baselines. Ablation studies confirm the diffusion mechanism contributes 4.8 points to correlation. Our theoretical analysis proves the diffused attention converges to optimal structure‑sequence coupling, with convergence rate O(1/sqrt(T) ) where T is diffusion steps. This work establishes a principled framework for integrating heterogeneous protein representations through learnable diffusion.
Authors: Xinheng He, Yijia Zhang, Haowei Lin, Xingang Peng, Xiangzhe Kong, Mingyu Li, Jianzhu Ma
Abstract: Structure‑based drug design has seen significant advancements with the integration of artificial intelligence (AI), particularly in the generation of hit and lead compounds. However, most AI‑driven approaches neglect the importance of endogenous protein interactions with peptides, which may result in suboptimal molecule designs. In this work, we present Peptide2Mol, an E(3)‑equivariant graph neural network diffusion model that generates small molecules by referencing both the original peptide binders and their surrounding protein pocket environments. Trained on large datasets and leveraging sophisticated modeling techniques, Peptide2Mol not only achieves state‑of‑the‑art performance in non‑autoregressive generative tasks, but also produces molecules with similarity to the original peptide binder. Additionally, the model allows for molecule optimization and peptidomimetic design through a partial diffusion process. Our results highlight Peptide2Mol as an effective deep generative model for generating and optimizing bioactive small molecules from protein binding pockets.
Authors: Alvaro Prat, Leo Zhang, Charlotte M. Deane, Yee Whye Teh, Garrett M. Morris
Abstract: Determining the binding pose of a ligand to a protein, known as molecular docking, is a fundamental task in drug discovery. Generative approaches promise faster, improved, and more diverse pose sampling than physics‑based methods, but are often hindered by chemically implausible outputs, poor generalisability, and high computational cost. To address these challenges, we introduce a novel fragmentation scheme, leveraging inductive biases from structural chemistry, to decompose ligands into rigid‑body fragments. Building on this decomposition, we present SigmaDock, an SE(3) Riemannian diffusion model that generates poses by learning to reassemble these rigid bodies within the binding pocket. By operating at the level of fragments in SE(3), SigmaDock exploits well‑established geometric priors while avoiding overly complex diffusion processes and unstable training dynamics. Experimentally, we show SigmaDock achieves state‑of‑the‑art performance, reaching Top‑1 success rates (RMSD<2 & PB‑valid) above 79.9% on the PoseBusters set, compared to 12.7‑30.8% reported by recent deep learning approaches, whilst demonstrating consistent generalisation to unseen proteins. SigmaDock is the first deep learning approach to surpass classical physics‑based docking under the PB train‑test split, marking a significant leap forward in the reliability and feasibility of deep learning for molecular modelling.
Authors: Negar Karpourazar, Keyvan Khosh Abady, Peter M. Rentzepis
Abstract: This article describes the design and construction of a portable, compact, and cost‑effective microspectrophotometer (MSP) that operates in the range of (200_800 nm). This microscope spectrophotometer records highresolution absorption and emission spectra in situ. The dual head design of this MSP enables simultaneous real time imaging and spectral recording of heterogeneous samples with high selectivity and micrometer spatial resolution. Our compact, portable MSP design reduces construction costs by more than 20 times compared to commercial benchtop alternatives, primarily due to its innovative illumination system and microscope objective design. The performance of the UV_vis_NIR MSP was confirmed by comparing the absorption and fluorescence spectra of an aqueous solution of Ru(bpy) obtained with our system to those measured by commercial spectroscopic systems. The high accuracy and reliability of our system in measuring absorbance and fluorescence were confirmed by R squared values of 0.998 and 0.990, respectively, from colorimetric and fluorometric tests. The MSP was further used to record absorption and fluorescence spectra from a variety of samples, including dyes and protein crystals, in both the solution and solid state, as well as individual living cells. This compact instrument is ideal for rapid, in situ spectroscopic measurements and is expected to find on site applications across various fields, such as environmental monitoring, biological research, forensic analysis, and materials characterization.
Authors: Xiaoling Luo, Peng Chen, Chengliang Liu, Xiaopeng Jin, Jie Wen, Yumeng Liu, Junsong Wang
Abstract: Multimodal protein features play a crucial role in protein function prediction. However, these features encompass a wide range of information, ranging from structural data and sequence features to protein attributes and interaction networks, making it challenging to decipher their complex interconnections. In this work, we propose a multimodal protein function prediction method (DSRPGO) by utilizing dynamic selection and reconstructive pre‑training mechanisms. To acquire complex protein information, we introduce reconstructive pre‑training to mine more fine‑grained information with low semantic levels. Moreover, we put forward the Bidirectional Interaction Module (BInM) to facilitate interactive learning among multimodal features. Additionally, to address the difficulty of hierarchical multi‑label classification in this task, a Dynamic Selection Module (DSM) is designed to select the feature representation that is most conducive to current protein function prediction. Our proposed DSRPGO model improves significantly in BPO, MFO, and CCO on human datasets, thereby outperforming other benchmark models.
Authors: Riccardo Rossetto, Marcel Ernst, David Zwicker
Abstract: Biological membranes often exhibit heterogeneous protein patterns, which cells control. Strong patterns, like the polarity spot in budding yeast, can be described as surface condensates, formed by physical interactions between constituents. However, it is unclear how these interactions affect the material exchange with the bulk. To study this, we analyze a thermodynamically consistent model, which reveals that passive exchange generally accelerates the coarsening of surface condensates. Active exchange can further accelerate coarsening, although it can also fully arrest it and induce complex patterns involving various length scales. We reveal how these behaviors are related to non‑local transport via diffusion through the bulk, rationalizing the various scaling laws we observe and allowing us to interpret biologically relevant scenarios.
Authors: James C. Bowden, Sergey Levine, Jennifer Listgarten
Abstract: In the era of AI‑driven science and engineering, we often want to design discrete objects in silico according to user‑specified properties. For example, we may wish to design a protein to bind its target, arrange components within a circuit to minimize latency, or find materials with certain properties. Given a property predictive model, in silico design typically involves training a generative model over the design space (e.g., protein sequence space) to concentrate on designs with the desired properties. Distributional optimization\unicodex2013which can be formalized as an estimation of distribution algorithm or as reinforcement learning policy optimization\unicodex2013finds the generative model that maximizes an objective function in expectation. Optimizing a distribution over discrete‑valued designs is in general challenging because of the combinatorial nature of the design space. However, many property predictors in scientific applications are decomposable in the sense that they can be factorized over design variables in a way that could in principle enable more effective optimization. For example, amino acids at a catalytic site of a protein may only loosely interact with amino acids of the rest of the protein to achieve maximal catalytic activity. Current distributional optimization algorithms are unable to make use of such decomposability structure. Herein, we propose and demonstrate use of a new distributional optimization algorithm, Decomposition‑Aware Distributional Optimization (DADO), that can leverage any decomposability defined by a junction tree on the design variables, to make optimization more efficient. At its core, DADO employs a soft‑factorized "search distribution"\unicodex2013a learned generative model\unicodex2013for efficient navigation of the search space, invoking graph message‑passing to coordinate optimization across linked factors.
Authors: M. Z. Haider, M. U. Ghouri, Tayyaba Noreen, M. Salman
Abstract: Rare events such as financial crashes, climate extremes, and biological anomalies are notoriously difficult to model due to their scarcity and heavy‑tailed distributions. Classical deep generative models often struggle to capture these rare occurrences, either collapsing low‑probability modes or producing poorly calibrated uncertainty estimates. In this work, we propose the Quantum‑Enhanced Generative Model (QEGM), a hybrid classical‑quantum framework that integrates deep latent‑variable models with variational quantum circuits. The framework introduces two key innovations: (1) a hybrid loss function that jointly optimizes reconstruction fidelity and tail‑aware likelihood, and (2) quantum randomness‑driven noise injection to enhance sample diversity and mitigate mode collapse. Training proceeds via a hybrid loop where classical parameters are updated through backpropagation while quantum parameters are optimized using parameter‑shift gradients. We evaluate QEGM on synthetic Gaussian mixtures and real‑world datasets spanning finance, climate, and protein structure. Results demonstrate that QEGM reduces tail KL divergence by up to 50 percent compared to state‑of‑the‑art baselines (GAN, VAE, Diffusion), while improving rare‑event recall and coverage calibration. These findings highlight the potential of QEGM as a principled approach for rare‑event prediction, offering robustness beyond what is achievable with purely classical methods.
Authors: Gabriel Nobis, Maximilian Springenberg, Arina Belova, Rembert Daems, Christoph Knochenhauer, Manfred Opper, Tolga Birdal, Wojciech Samek
Abstract: We present Fractional Diffusion Bridge Models (FDBM), a novel generative diffusion bridge framework driven by an approximation of the rich and non‑Markovian fractional Brownian motion (fBM). Real stochastic processes exhibit a degree of memory effects (correlations in time), long‑range dependencies, roughness and anomalous diffusion phenomena that are not captured in standard diffusion or bridge modeling due to the use of Brownian motion (BM). As a remedy, leveraging a recent Markovian approximation of fBM (MA‑fBM), we construct FDBM that enable tractable inference while preserving the non‑Markovian nature of fBM. We prove the existence of a coupling‑preserving generative diffusion bridge and leverage it for future state prediction from paired training data. We then extend our formulation to the Schrödinger bridge problem and derive a principled loss function to learn the unpaired data translation. We evaluate FDBM on both tasks: predicting future protein conformations from aligned data, and unpaired image translation. In both settings, FDBM achieves superior performance compared to the Brownian baselines, yielding lower root mean squared deviation (RMSD) of C_α atomic positions in protein structure prediction and lower Fréchet Inception Distance (FID) in unpaired image translation.
Authors: Yuhang Kang, Ziyu Su, Tianyang Wang, Zaibo Li, Wei Chen, Muhammad Khalid Khan Niazi
Abstract: Compared to hematoxylin‑eosin (H&E) staining, immunohistochemistry (IHC) not only maintains the structural features of tissue samples, but also provides high‑resolution protein localization, which is essential for aiding in pathology diagnosis. Despite its diagnostic value, IHC remains a costly and labor‑intensive technique. Its limited scalability and constraints in multiplexing further hinder widespread adoption, especially in resource‑limited settings. Consequently, researchers are increasingly exploring computational stain translation techniques to synthesize IHC‑equivalent images from H&E‑stained slides, aiming to extract protein‑level information more efficiently and cost‑effectively. However, most existing stain translation techniques rely on a linearly weighted summation of multiple loss terms within a single objective function, strategy that often overlooks the interdepedence among these components‑resulting in suboptimal image quality and an inability to simultaneously preserve structural authenticity and color fidelity. To address this limitation, we propose a novel network architecture that follows a progressive structure, incorporating color and cell border generation logic, which enables each visual aspect to be optimized in a stage‑wise and decoupled manner. To validate the effectiveness of our proposed network architecture, we build upon the Adaptive Supervised PatchNCE (ASP) framework as our baseline. We introduce additional loss functions based on 3,3'‑diaminobenzidine (DAB) chromogen concentration and image gradient, enhancing color fidelity and cell boundary clarity in the generated IHC images. By reconstructing the generation pipeline using our structure‑color‑cell boundary progressive mechanism, experiments on HER2 and ER datasets demonstrated that the model significantly improved visual quality and achieved finer structural details.
Authors: Annabelle Martin, Daphne Kontogiorgos-Heintz, Jeff Nivala
Abstract: Nanopore protein sequencing produces long, noisy ionic current traces in which key molecular phases, such as protein capture and translocation, are embedded. Capture phases mark the successful entry of a protein into the pore and serve as both a checkpoint and a signal that a channel merits further analysis. However, manual identification of capture phases is time‑intensive, often requiring several days for expert reviewers to annotate the data due to the need for domain‑specific interpretation of complex signal patterns. To address this, a lightweight one‑dimensional convolutional neural network (1D CNN) was developed and trained to detect capture phases in down‑sampled signal windows. Evaluated against CNN‑LSTM (Long Short‑Term Memory) hybrids, histogram‑based classifiers, and other CNN variants using run‑level data splits, our best model, CaptureNet‑Deep, achieved an F1 score of 0.94 and precision of 93.39% on held‑out test data. The model supports low‑latency inference and is integrated into a dashboard for Oxford Nanopore experiments, reducing the total analysis time from several days to under thirty minutes. These results show that efficient, real‑time capture detection is possible using simple, interpretable architectures and suggest a broader role for lightweight ML models in sequencing workflows.
Authors: Wei Zhang, Zekun Guo, Yingce Xia, Peiran Jin, Shufang Xie, Tao Qin, Xiang-Yang Li
Abstract: Structure‑based drug design (SBDD), which maps target proteins to candidate molecular ligands, is a fundamental task in drug discovery. Effectively aligning protein structural representations with molecular representations, and ensuring alignment between generated drugs and their pharmacological properties, remains a critical challenge. To address these challenges, we propose MolChord, which integrates two key techniques: (1) to align protein and molecule structures with their textual descriptions and sequential representations (e.g., FASTA for proteins and SMILES for molecules), we leverage NatureLM, an autoregressive model unifying text, small molecules, and proteins, as the molecule generator, alongside a diffusion‑based structure encoder; and (2) to guide molecules toward desired properties, we curate a property‑aware dataset by integrating preference data and refine the alignment process using Direct Preference Optimization (DPO). Experimental results on CrossDocked2020 demonstrate that our approach achieves state‑of‑the‑art performance on key evaluation metrics, highlighting its potential as a practical tool for SBDD.
Authors: Minghui Li, Yuanhang Wang, Peijin Guo, Wei Wan, Shengshan Hu, Shengqing Hu
Abstract: Accurate prediction of Drug‑Target Affinity (DTA) is crucial for reducing experimental costs and accelerating early screening in computational drug discovery. While sequence‑based deep learning methods avoid reliance on costly 3D structures, they still overlook simultaneous modeling of global sequence semantic features and local topological structural features within drugs and proteins, and represent drugs as flat sequences without atomic‑level, substructural‑level, and molecular‑level multi‑scale features. We propose HiF‑DTA, a hierarchical network that adopts a dual‑pathway strategy to extract both global sequence semantic and local topological features from drug and protein sequences, and models drugs multi‑scale to learn atomic, substructural, and molecular representations fused via a multi‑scale bilinear attention module. Experiments on Davis, KIBA, and Metz datasets show HiF‑DTA outperforms state‑of‑the‑art baselines, with ablations confirming the importance of global‑local extraction and multi‑scale fusion.
Authors: Dian Chen, Yunkai Chen, Tong Lin, Sijie Chen, Xiaolin Cheng
Abstract: Multimodal approaches that integrate protein structure and sequence have achieved remarkable success in protein‑protein interface prediction. However, extending these methods to protein‑peptide interactions remains challenging due to the inherent conformational flexibility of peptides and the limited availability of structural data that hinder direct training of structure‑aware models. To address these limitations, we introduce GeoPep, a novel framework for peptide binding site prediction that leverages transfer learning from ESM3, a multimodal protein foundation model. GeoPep fine‑tunes ESM3's rich pre‑learned representations from protein‑protein binding to address the limited availability of protein‑peptide binding data. The fine‑tuned model is further integrated with a parameter‑efficient neural network architecture capable of learning complex patterns from sparse data. Furthermore, the model is trained using distance‑based loss functions that exploit 3D structural information to enhance binding site prediction. Comprehensive evaluations demonstrate that GeoPep significantly outperforms existing methods in protein‑peptide binding site prediction by effectively capturing sparse and heterogeneous binding patterns.
Authors: Taylor Schaffner, Benjamin B. Machta
Abstract: Many protein‑protein interaction (PPI) networks take place in the fluid yet structured plasma membrane. Lipid domains, sometimes termed rafts, have been implicated in the functioning of various membrane‑bound signaling processes. Here, we present a model and a Monte Carlo simulation framework to investigate how changes in the domain size that arise from perturbations to membrane criticality can lead to changes in the rate of interactions among components, leading to altered outcomes. For simple PPI networks, we show that the activity can be highly sensitive to thermodynamic parameters near the critical point of the membrane phase transition. When protein‑protein interactions change the partitioning of some components, our system sometimes forms out of equilibrium domains we term pockets, driven by a mixture of thermodynamic interactions and kinetic sorting. More generally, we predict that near the critical point many different PPI networks will have their outcomes depend sensitively on perturbations that influence critical behavior.
Authors: Kevin Ching, Anthony Estrada, Nicholas M Rubayiza, Ligesh Theeyancheri, Jennifer M. Schwarz, Jennifer L Ross
Abstract: We investigate how an active bath of enzymes influences the liquid‑liquid phase separation (LLPS) of a non‑interacting condensing protein. The enzyme we choose to use as the active driver is urease, an enzyme that has been shown by several groups to exhibit enhanced diffusion in the presence of its substrate. The non‑interacting LLPS protein is ubiquilin‑2, a protein that condenses with increasing temperature and salt. Using a microfluidic device with semipermeable membranes, we create a chemostatic environment to maintain the substrate content to feed the enzymatic bath and remove the products of the chemical reaction. Thus, we isolate the physical enhanced fluctuations from the chemical changes of the enzyme activity. We also compare the results to controls without activity or in the presence of the products of the reaction. We find that the active bath is able to enhance droplet size, density, and concentration, implying that more ubiquilin‑2 is in condensed form. This result is consistent with an interpretation that the active bath acts as an effective temperature. Simulations provide an underlying interpretation for our experimental results. Together, these findings provide the first demonstration that physical enzymatic activity can act as an effective temperature to modify LLPS behavior, with implications for intracellular organization in the enzymatically active cellular environment.
Authors: Peter Benner, Boris N. Khoromskij, Venera Khoromskaia, Matthias Stein
Abstract: We propose and justify a new approach for fast calculation of the electrostatic interaction energy of clusters of charged particles in constrained energy minimization in the framework of rigid protein‑ligand docking. Our ``blind search'' docking technique is based on the low‑rank range‑separated (RS) tensor‑based representation of the free‑space electrostatic potential of the biomolecule represented on large n× n× n 3D grid. We show that both the collective electrostatic potential of a complex protein‑ligand system and the respective electrostatic interaction energy can be calculated by tensor techniques in O(n)‑complexity, such that the numerical cost for energy calculation only mildly (logarithmically) depends on the number of particles in the system. Moreover, tensor representation of the electrostatic potential enables usage of large 3D Cartesian grids (of the order of n^3 ~ 10^12), which could allow the accurate modeling of complexes with several large proteins. In our approach selection of the correct geometric pose predictions in the localized posing process is based on the control of van der Waals distance between the target molecular clusters. Here, we confine ourselves by constrained minimization of the energy functional by using only fast tensor‑based free‑space electrostatic energy recalculation for various rotations and translations of both clusters. Numerical tests of the electrostatic energy‑based ``protein‑ligand docking'' algorithm applied to synthetic and realistic input data present a proof of concept for rather complex particle configurations. The method may be used in the framework of the traditional stochastic or deterministic posing/docking techniques.
Authors: Charlotte Claye, Pierre Marschall, Wassila Ouerdane, Céline Hudelot, Julien Duquesne
Abstract: Single‑cell RNA‑seq foundation models achieve strong performance on downstream tasks but remain black boxes, limiting their utility for biological discovery. Recent work has shown that sparse dictionary learning can extract concepts from deep learning models, with promising applications in biomedical imaging and protein models. However, interpreting biological concepts remains challenging, as biological sequences are not inherently human‑interpretable. We introduce a novel concept‑based interpretability framework for single‑cell RNA‑seq models with a focus on concept interpretation and evaluation. We propose an attribution method with counterfactual perturbations that identifies genes that influence concept activation, moving beyond correlational approaches like differential expression analysis. We then provide two complementary interpretation approaches: an expert‑driven analysis facilitated by an interactive interface and an ontology‑driven method with attribution‑based biological pathway enrichment. Applying our framework to two well‑known single‑cell RNA‑seq models from the literature, we interpret concepts extracted by Top‑K Sparse Auto‑Encoders trained on two immune cell datasets. With a domain expert in immunology, we show that concepts improve interpretability compared to individual neurons while preserving the richness and informativeness of the latent representations. This work provides a principled framework for interpreting what biological knowledge foundation models have encoded, paving the way for their use for hypothesis generation and discovery.
Authors: Chenyu Tang, Mayank Prakash Pandey, Cheng Giuseppe Chen, Alberto Megías, François Dehez, Christophe Chipot
Abstract: Molecular transitions ‑‑ such as protein folding, allostery, and membrane transport ‑‑ are central to biology yet remain notoriously difficult to simulate. Their intrinsic rarity pushes them beyond reach of standard molecular dynamics, while enhanced‑sampling methods are costly and often depend on arbitrary variables that bias outcomes. We introduce Gen‑COMPAS, a generative committor‑guided path sampling framework that reconstructs transition pathways without predefined variables and at a fraction of the cost. Gen‑COMPAS couples a generative diffusion model, which produces physically realistic intermediates, with committor‑based filtering to pinpoint transition states. Short unbiased simulations from these intermediates rapidly yield full transition‑path ensembles that converge within nanoseconds, where conventional methods require orders of magnitude more sampling. Applied to systems from a miniprotein to a ribose‑binding protein to a mitochondrial carrier, Gen‑COMPAS retrieves committors, transition states, and free‑energy landscapes efficiently, uniting machine learning and molecular dynamics for broad mechanistic and practical insight.
Authors: Antonio Grimaldi, Michele Stofella, Billy Hobbs, Theodoros K. Karamanos, Emanuele Paci
Abstract: Hydrogen‑deuterium exchange (HDX) of protein backbone amides provides a powerful probe of conformational dynamics. However, when experiments are performed in H2O/D2O mixtures, quantitative interpretation is hindered by back exchange and isotope effects not captured by the classical Linderstrom‑Lang (LL) model. We introduce a generalized Linderstrom‑Lang (GLL) framework that explicitly accounts for forward and reverse exchange and for changes in protection upon isotopic substitution. Analytical solutions describe equilibrium enrichment (fractionation) and protection factors in mixtures, reducing to the LL model in pure D2O. Application to HDX/NMR of the molecular chaperone DNAJB1 in 50% D2O demonstrates that the GLL model recovers protection factors at 100% D2O. Ignoring back exchange (i.e., using the LL model) causes protection factors to be systematically underestimated. A particularly powerful feature of our approach is that a single HDX experiment in a mixture (e.g., 50% D2O) simultaneously provides protection factors that report on conformational dynamics and local stability, and fractionation factors that are sensitive to the local hydrogen‑bonding environment.
Authors: Danqi Liao, Chen Liu, Xingzhi Sun, Dié Tang, Haochen Wang, Scott Youlten, Srikar Krishna Gopinath, Haejeong Lee, Ethan C. Strayer, Antonio J. Giraldez, Smita Krishnaswamy
Abstract: Generating property‑optimized mRNA sequences is central to applications such as vaccine design and protein replacement therapy, but remains challenging due to limited data, complex sequence‑function relationships, and the narrow space of biologically viable sequences. Generative methods that drift away from the data manifold can yield sequences that fail to fold, translate poorly, or are otherwise nonfunctional. We present RNAGenScape, a property‑guided manifold Langevin dynamics framework for mRNA sequence generation that operates directly on a learned manifold of real data. By performing iterative local optimization constrained to this manifold, RNAGenScape preserves biological viability, accesses reliable guidance, and avoids excursions into nonfunctional regions of the ambient sequence space. The framework integrates three components: (1) an autoencoder jointly trained with a property predictor to learn a property‑organized latent manifold, (2) a denoising autoencoder that projects updates back onto the manifold, and (3) a property‑guided Langevin dynamics procedure that performs optimization along the manifold. Across three real‑world mRNA datasets spanning two orders of magnitude in size, RNAGenScape increases median property gain by up to 148% and success rate by up to 30% while ensuring biological viability of generated sequences, and achieves competitive inference efficiency relative to existing generative approaches.
Authors: Junhua Chen, Simon Mathis, Charles Harris, Kieran Didi, Pietro Lio
Abstract: Generative modeling techniques such as Diffusion and Flow Matching have achieved significant successes in generating designable and diverse protein backbones. However, many current models are computationally expensive, requiring hundreds or even thousands of function evaluations (NFEs) to yield samples of acceptable quality, which can become a bottleneck in practical design campaigns that often generate 10^4\ ‑\ 10^6 designs per target. In image generation, Rectified Flows (ReFlow) can significantly reduce the required NFEs for a given target quality, but their application in protein backbone generation has been less studied. We apply ReFlow to improve the low NFE performance of pretrained SE(3) flow matching models for protein backbone generation and systematically study ReFlow design choices in the context of protein generation in data curation, training and inference time settings. In particular, we (1) show that ReFlow in the protein domain is particularly sensitive to the choice of coupling generation and annealing, (2) demonstrate how useful design choices for ReFlow in the image domain do not directly translate to better performance on proteins, and (3) make improvements to ReFlow methodology for proteins.
Authors: Genesis Research Team, Alejandro Dobles, Nina Jovic, Kenneth Leidal, Pranav Murugan, David C. Williams, Drausin Wulsin, Nate Gruver, Christina X. Ji, Korrawat Pruegsanusak, Gianluca Scarpellini, Ansh Sharma, Wojciech Swiderski, Andrea Bootsma, Richard Strong Bowen, Charlotte Chen, Jamin Chen, Marc André Dämgen, Benjamin DiFrancesco, J. D. Fishman, Alla Ivanova, Zach Kagin, David Li-Bland, Zuli Liu, Igor Morozov, Jeffrey Ouyang-Zhang, Frank C. Pickard, Kushal S. Shah, Ben Shor, Gabriel Monteiro da Silva, Roy Tal, Maxx Tessmer, Carl Tilbury, Cyr Vetcher, Daniel Zeng, Maruan Al-Shedivat, Aleksandra Faust, Evan N. Feinberg, Michael V. LeVine, Matteus Pan
Abstract: Accurately predicting the three‑dimensional structures of protein‑ligand complexes remains a fundamental challenge in computational drug discovery that limits the pace and success of therapeutic design. Deep learning methods have recently shown strong potential as structural prediction tools, achieving promising accuracy across diverse biomolecular systems. However, their performance and utility are constrained by scarce experimental data, inefficient architectures, physically invalid poses, and the limited ability to exploit auxiliary information available at inference. To address these issues, we introduce Pearl (Placing Every Atom in the Right Location), a foundation model for protein‑ligand cofolding at scale. Pearl addresses these challenges with three key innovations: (1) training recipes that include large‑scale synthetic data to overcome data scarcity; (2) architectures that incorporate an SO(3)‑equivariant diffusion module to inherently respect 3D rotational symmetries, improving generalization and sample efficiency, and (3) controllable inference, including a generalized multi‑chain templating system supporting both protein and non‑polymeric components as well as dual unconditional/conditional modes. Pearl establishes a new state‑of‑the‑art performance in protein‑ligand cofolding. On the key metric of generating accurate (RMSD < 2 Å) and physically valid poses, Pearl surpasses AlphaFold 3 and other open source baselines on the public Runs N' Poses and PoseBusters benchmarks, delivering 14.5% and 14.2% improvements, respectively, over the next best model. In the pocket‑conditional cofolding regime, Pearl delivers 3.6× improvement on a proprietary set of challenging, real‑world drug targets at the more rigorous RMSD < 1 Å threshold. Finally, we demonstrate that model performance correlates directly with synthetic dataset size used in training.
Authors: Joohwan Ko, Aristofanis Rontogiannis, Yih-En Andrew Ban, Axel Elaldi, Nicholas Franklin
Abstract: Protein design using structure prediction models such as AlphaFold2 has shown remarkable success, but existing approaches like relaxed sequence optimization (RSO) rely on single‑path gradient descent and ignore sequence‑space constraints, limiting diversity and designability. We introduce Relaxed Sequence Sampling (RSS), a Markov chain Monte Carlo (MCMC) framework that integrates structural and evolutionary information for protein design. RSS operates in continuous logit space, combining gradient‑guided exploration with protein language model‑informed jumps. Its energy function couples AlphaFold2‑derived structural objectives with ESM2‑derived sequence priors, balancing accuracy and biological plausibility. In an in silico protein binder design task, RSS produces 5× more designable structures and 2‑3× greater structural diversity than RSO baselines, at equal computational cost. These results highlight RSS as a principled approach for efficiently exploring the protein design landscape.
Authors: Jingjie Zhang, Hanqun Cao, Zijun Gao, Yu Wang, Shaoning Li, Jun Xu, Cheng Tan, Jun Zhu, Chang-Yu Hsieh, Chunbin Gu, Pheng Ann Heng
Abstract: Post‑translational modifications (PTMs) form a combinatorial "code" that regulates protein function, yet deciphering this code ‑ linking modified sites to their catalytic enzymes ‑ remains a central unsolved problem in understanding cellular signaling and disease. We introduce COMPASS‑PTM, a mechanism‑aware, coarse‑to‑fine learning framework that unifies residue‑level PTM profiling with enzyme‑substrate assignment. COMPASS‑PTM integrates evolutionary representations from protein language models with physicochemical priors and a crosstalk‑aware prompting mechanism that explicitly models inter‑PTM dependencies. This design allows the model to learn biologically coherent patterns of cooperative and antagonistic modifications while addressing the dual long‑tail distribution of PTM data. Across multiple proteome‑scale benchmarks, COMPASS‑PTM establishes new state‑of‑the‑art performance, including a 122% relative F1 improvement in multi‑label site prediction and a 54% gain in zero‑shot enzyme assignment. Beyond accuracy, the model demonstrates interpretable generalization, recovering canonical kinase motifs and predicting disease‑associated PTM rewiring caused by missense variants. By bridging statistical learning with biochemical mechanism, COMPASS‑PTM unifies site‑level and enzyme‑level prediction into a single framework that learns the grammar underlying protein regulation and signaling.
Authors: Runjie Zheng, Zhen Wang, Anjie Qiao, Jiancong Xie, Jiahua Rao, Yuedong Yang
Abstract: Accurate protein function prediction requires integrating heterogeneous intrinsic signals (e.g., sequence and structure) with noisy extrinsic contexts (e.g., protein‑protein interactions and GO term annotations). However, two key challenges hinder effective fusion: (i) cross‑modal distributional mismatch among embeddings produced by pre‑trained intrinsic encoders, and (ii) noisy relational graphs of extrinsic data that degrade GNN‑based information aggregation. We propose Diffused and Aligned Multi‑modal Protein Embedding (DAMPE), a unified framework that addresses these through two core mechanisms. First, we propose Optimal Transport (OT)‑based representation alignment that establishes correspondence between intrinsic embedding spaces of different modalities, effectively mitigating cross‑modal heterogeneity. Second, we develop a Conditional Graph Generation (CGG)‑based information fusion method, where a condition encoder fuses the aligned intrinsic embeddings to provide informative cues for graph reconstruction. Meanwhile, our theoretical analysis implies that the CGG objective drives this condition encoder to absorb graph‑aware knowledge into its produced protein representations. Empirically, DAMPE outperforms or matches state‑of‑the‑art methods such as DPFunc on standard GO benchmarks, achieving AUPR gains of 0.002‑0.013 pp and Fmax gains 0.004‑0.007 pp. Ablation studies further show that OT‑based alignment contributes 0.043‑0.064 pp AUPR, while CGG‑based fusion adds 0.005‑0.111 pp Fmax. Overall, DAMPE offers a scalable and theoretically grounded approach for robust multi‑modal protein representation learning, substantially enhancing protein function prediction.
Authors: Michael Ito, Danai Koutra, Jenna Wiens
Abstract: Random walk neural networks (RWNNs) have emerged as a promising approach for graph representation learning, leveraging recent advances in sequence models to process random walks. However, under realistic sampling constraints, RWNNs often fail to capture global structure even in small graphs due to incomplete node and edge coverage, limiting their expressivity. To address this, we propose random search neural networks (RSNNs), which operate on random searches, each of which guarantees full node coverage. Theoretically, we demonstrate that in sparse graphs, only O(\log |V|) searches are needed to achieve full edge coverage, substantially reducing sampling complexity compared to the O(|V|) walks required by RWNNs (assuming walk lengths scale with graph size). Furthermore, when paired with universal sequence models, RSNNs are universal approximators. We lastly show RSNNs are probabilistically invariant to graph isomorphisms, ensuring their expectation is an isomorphism‑invariant graph function. Empirically, RSNNs consistently outperform RWNNs on molecular and protein benchmarks, achieving comparable or superior performance with up to 16× fewer sampled sequences. Our work bridges theoretical and practical advances in random walk based approaches, offering an efficient and expressive framework for learning on sparse graphs.
Authors: Hamza Patwa, Philip Kurian
Abstract: Collective emission of light from distributions of two‑level systems (TLSs) was first predicted in 1954 by Robert Dicke, who showed that when N quantum emitters absorb photons, their collective radiative decay rate can be enhanced (superradiance) or suppressed (subradiance) relative to a single emitter. In this work, we derive novel analytical expressions for the collective decay rates and Lamb shifts for the interaction of a single photon with a continuous distribution of TLSs on an infinite line and an infinite helix. We compare these solutions to collectives of TLSs on a cylinder, finding limits in which the eigenvalues of structures of different dimensions are equal. We also compare our solution with arrangements where the emitter distribution is discrete rather than continuous, and when short‑ (1/r^3), intermediate‑ (1/r^2), and long‑range (1/r) interaction terms are included. We find important differences between the discrete vector and continuous scalar emitter cases, which do not agree in the limit where discrete spacing goes to 0. The analytical solution for the helix is then used to make estimates of the maximally superradiant state, thermally averaged collective decay rate, and percentage of trapped states of quantum emitter architectures in protein fibers. Given the differences between our idealized infinite helix and the numerical model describing protein fibers, our analytical estimates show excellent agreement with the numerical results for sparse arrangements of emitters in protein fibers. Our work thus bridges the gap between different formalisms for superradiance, aids the engineering of devices which harness quantum optical effects for computing with superradiant error correction and subradiant memories, and motivates the discovery and creation of flexible platforms for quantum information processing using the intrinsic helical geometries of biomatter.
Authors: Linhan Wang, Jianwen Dou, Wang Li, Shengkun Wang, Zhiwu Xie, Chang-Tien Lu, Yinlin Chen
Abstract: Cryogenic Electron Tomography (CryoET) combined with sub‑volume averaging (SVA) is the only imaging modality capable of resolving protein structures inside cells at molecular resolution. Particle picking, the task of localizing and classifying target proteins in 3D CryoET volumes, remains the main bottleneck. Due to the reliance on time‑consuming manual labels, the vast reserve of unlabeled tomograms remains underutilized. In this work, we present a fast, label‑efficient semi‑supervised framework that exploits this untapped data. Our framework consists of two components: (i) an end‑to‑end heatmap‑supervised detection model inspired by keypoint detection, and (ii) a teacher‑student co‑training mechanism that enhances performance under sparse labeling conditions. Furthermore, we introduce multi‑view pseudo‑labeling and a CryoET‑specific DropBlock augmentation strategy to further boost performance. Extensive evaluations on the large‑scale CZII dataset show that our approach improves F1 by 10% over supervised baselines, underscoring the promise of semi‑supervised learning for leveraging unlabeled CryoET data.
Authors: Md Saiful Islam Sajol, Magesh Rajasekaran, Hayden Gemeinhardt, Adam Bess, Chris Alvin, Supratik Mukhopadhyay
Abstract: Computationally predicting protein‑protein interactions (PPIs) is challenging due to the lack of integrated, multimodal protein representations. DPEB is a curated collection of 22,043 human proteins that integrates four embedding types: structural (AlphaFold2), transformer‑based sequence (BioEmbeddings), contextual amino acid patterns (ESM‑2: Evolutionary Scale Modeling), and sequence‑based n‑gram statistics (ProtVec]). AlphaFold2 protein structures are available through public databases (e.g., AlphaFold2 Protein Structure Database), but the internal neural network embeddings are not. DPEB addresses this gap by providing AlphaFold2‑derived embeddings for computational modeling. Our benchmark evaluations show GraphSAGE with BioEmbedding achieved the highest PPI prediction performance (87.37% AUROC, 79.16% accuracy). The framework also achieved 77.42% accuracy for enzyme classification and 86.04% accuracy for protein family classification. DPEB supports multiple graph neural network methods for PPI prediction, enabling applications in systems biology, drug target identification, pathway analysis, and disease mechanism studies.
Authors: Oscar Davis, Michael S. Albergo, Nicholas M. Boffi, Michael M. Bronstein, Avishek Joey Bose
Abstract: Geometric data and purpose‑built generative models on them have become ubiquitous in high‑impact deep learning application domains, ranging from protein backbone generation and computational chemistry to geospatial data. Current geometric generative models remain computationally expensive at inference ‑‑ requiring many steps of complex numerical simulation ‑‑ as they are derived from dynamical measure transport frameworks such as diffusion and flow‑matching on Riemannian manifolds. In this paper, we propose Generalised Flow Maps (GFM), a new class of few‑step generative models that generalises the Flow Map framework in Euclidean spaces to arbitrary Riemannian manifolds. We instantiate GFMs with three self‑distillation‑based training methods: Generalised Lagrangian Flow Maps, Generalised Eulerian Flow Maps, and Generalised Progressive Flow Maps. We theoretically show that GFMs, under specific design decisions, unify and elevate existing Euclidean few‑step generative models, such as consistency models, shortcut models, and meanflows, to the Riemannian setting. We benchmark GFMs against other geometric generative models on a suite of geometric datasets, including geospatial data, RNA torsion angles, and hyperbolic manifolds, and achieve state‑of‑the‑art sample quality for single‑ and few‑step evaluations, and superior or competitive log‑likelihoods using the implicit probability flow.
Authors: Chang Liu, Leona Licht, Jan Rothhardt
Abstract: We present a theoretical evaluation of radiation dose constraints for extreme ultraviolet (EUV) and soft X‑ray microscopy. Our work particularly addresses the long‑standing concern regarding strong absorption of EUV radiation in biological specimens. Using an established dose‑resolution model, we compare hydrated and dehydrated cellular states and quantify the fluence required for nanoscale imaging. Our analysis identifies a protein window spanning photon energies from 70 eV up to the carbon K‑edge (284 eV), where EUV microscopy could in principle achieve sub‑10 nm half‑pitch resolution in dehydrated samples at doses well below the Henderson limit, thereby eliminating the need for cryogenic conditions. In this situation, the radiation dose required for EUV imaging is also substantially lower than what is required for comparable resolution in water window soft X‑ray microscopy. Furthermore, EUV photons with sufficiently high energy exhibit penetration depths of um‑level in dehydrated biomatter, enabling exceptional amplitude and phase contrast through thin cellular regions and small cells. These findings provide quantitative guidelines for photon energy selection and establish the EUV protein window as a dose‑efficient and physically viable modality for high‑resolution, label‑free, material‑specific imaging of dehydrated biological matter.
Authors: Ramón Nartallo-Kaluarachchi, Shashanka Ubaru, Małgorzata J Zimoń, Dongsung Huh, Robert Manson-Sawko, Lior Horesh, Yoshua Bengio
Abstract: Sequential generative models conditioned on uncertain rewards are central to AI‑driven scientific discovery, yet the epistemic uncertainty they inherit from imperfect reward estimates remains unquantified. We propagate this uncertainty through generative flow networks (GFlowNets) by fitting polynomial chaos expansions (PCEs) to small ensembles of trained models. The PCE coefficients yield analytical Sobol sensitivity indices, providing the first interpretable decomposition of which reward components drive which generative decisions, a capability unavailable from deep ensembles, Bayesian neural networks, or Monte Carlo dropout. Convergence guarantees are established theoretically and four of five are formally verified in the Lean 4 proof assistant. Across three real‑world tasks the framework reveals actionable structure invisible to ensembles alone. On the Doyle‑Dreher Buchwald‑Hartwig dataset catalyst selection is robust (D_\mathrmcatalyst\approx 71) while additive selection is fragile (D_\mathrmadditive\approx 179, 2.5× higher). In fragment‑based molecular design the linker position is the most sensitive (D_\mathrmlinker\approx 28) while decoration positions are the most robust (D\approx 14‑18), reversing the conventional scaffold‑robust / decoration‑fragile assumption. On the Sachs protein signalling network, MAPK‑cascade edges and PKA/PKC hub edges separate into distinct sensitivity regimes, providing a targeted map for perturbation experiments. Calibration coverage at the 95% level reaches 0.97‑1.00 across the dominant steps, and the surrogate evaluates 10,000 policy samples in milliseconds ‑ 10^3‑10^4× faster than exhaustive retraining.
Authors: Cécile Marie Vincent, Sapna Ravindran, Alexis Michel Prevost, Léa-Laetitia Pontani, Olivier Bénichou, Elie Wandersman
Abstract: In tissues, cells in direct physical contact with each other can exchange ions or molecules via protein clusters called gap junctions that form channels across the membranes of adjacent cells. Here, we use a simplified biomimetic approach, coupled with theoretical modeling, to unravel the physical mechanisms controlling such transport. Tissues are mimicked with 2D hexagonal networks of monodisperse aqueous droplets connected by lipid membranes called Droplet Interface Bilayers (DIBs), decorated with α‑Hemolysin (αHL) transmembrane proteins forming nanopores through heptamerization in the membrane. The diffusion of calcein across 2D DIB networks is thoroughly studied using epifluorescence microscopy at various αHL concentrations. The results are successfully confronted with a Continuous Time Random Walk model in hexagonal networks, with an average waiting time increasing nonlinearly with the concentration of pore monomers.
Authors: Daniel M. Steinberg, Asiri Wijesinghe, Rafael Oliveira, Piotr Koniusz, Cheng Soon Ong, Edwin V. Bonilla
Abstract: We introduce active generation of Pareto sets (A‑GPS), a new framework for online discrete black‑box multi‑objective optimization (MOO). A‑GPS learns a generative model of the Pareto set that supports a‑posteriori conditioning on user preferences. The method employs a class probability estimator (CPE) to predict non‑dominance relations and to condition the generative model toward high‑performing regions of the search space. We also show that this non‑dominance CPE implicitly estimates the probability of hypervolume improvement (PHVI). To incorporate subjective trade‑offs, A‑GPS introduces preference direction vectors that encode user‑specified preferences in objective space. At each iteration, the model is updated using both Pareto membership and alignment with these preference directions, producing an amortized generative model capable of sampling across the Pareto front without retraining. The result is a simple yet powerful approach that achieves high‑quality Pareto set approximations, avoids explicit hypervolume computation, and flexibly captures user preferences. Empirical results on synthetic benchmarks and protein design tasks demonstrate strong sample efficiency and effective preference incorporation.
Authors: Srivathsan Badrinarayanan, Yue Su, Janghoon Ock, Alan Pham, Sanya Ahuja, Amir Barati Farimani
Abstract: Protein mutations can have profound effects on biological function, making accurate prediction of property changes critical for drug discovery, protein engineering, and precision medicine. Current approaches rely on fine‑tuning protein‑specific transformers for individual datasets, but struggle with cross‑dataset generalization due to heterogeneous experimental conditions and limited target domain data. We introduce two key innovations: (1) the first application of Model‑Agnostic Meta‑Learning (MAML) to protein mutation property prediction, and (2) a novel mutation encoding strategy using separator tokens to directly incorporate mutations into sequence context. We build upon transformer architectures integrating them with MAML to enable rapid adaptation to new tasks through minimal gradient steps rather than learning dataset‑specific patterns. Our mutation encoding addresses the critical limitation where standard transformers treat mutation positions as unknown tokens, significantly degrading performance. Evaluation across three diverse protein mutation datasets (functional fitness, thermal stability, and solubility) demonstrates significant advantages over traditional fine‑tuning. In cross‑task evaluation, our meta‑learning approach achieves 29% better accuracy for functional fitness with 65% less training time, and 94% better accuracy for solubility with 55% faster training. The framework maintains consistent training efficiency regardless of dataset size, making it particularly valuable for industrial applications and early‑stage protein design where experimental data is limited. This work establishes a systematic application of meta‑learning to protein mutation analysis and introduces an effective mutation encoding strategy, offering transformative methodology for cross‑domain generalization in protein engineering.
Authors: Suswitha Pericharla, Daniel B. Hier, Tayo Obafemi-Ajayi
Abstract: Effective biomedical data integration depends on automated term normalization, the mapping of natural language biomedical terms to standardized identifiers. This linking of terms to identifiers is essential for semantic interoperability. Large language models (LLMs) show promise for this task but perform unevenly across terminologies. We evaluated both memorization (training‑term performance) and generalization (validation‑term performance) across multiple biomedical ontologies. Fine‑tuning Llama 3.1 8B revealed marked differences by terminology. GO mappings showed strong memorization gains (up to 77% improvement in term‑to‑identifier accuracy), whereas HPO showed minimal improvement. Generalization occurred only for protein‑gene (GENE) mappings (13.9% gain), while fine‑tuning for HPO and GO yielded negligible transfer. Baseline accuracy varied by model scale, with GPT‑4o outperforming both Llama variants for all terminologies. Embedding analyses showed tight semantic alignment between gene symbols and protein names but weak alignment between terms and identifiers for GO or HPO, consistent with limited lexicalization. Fine‑tuning success depended on two interacting factors: identifier popularity and lexicalization. Popular identifiers were more likely encountered during pretraining, enhancing memorization. Lexicalized identifiers, such as gene symbols, enabled semantic generalization. By contrast, arbitrary identifiers in GO and HPO constrained models to rote learning. These findings provide a predictive framework for when fine‑tuning enhances factual recall versus when it fails due to sparse or non‑lexicalized identifiers.
Authors: Kevin Michalewicz, Chen Jin, Philip Alexander Teare, Tom Diethe, Mauricio Barahona, Barbara Bravi, Asher Mullokandov
Abstract: A fundamental challenge in protein design is the trade‑off between generating structural diversity while preserving motif biological function. Current state‑of‑the‑art methods, such as partial diffusion in RFdiffusion, often fail to resolve this trade‑off: small perturbations yield motifs nearly identical to the native structure, whereas larger perturbations violate the geometric constraints necessary for biological function. We introduce Protein Generation with Embedding Learning (PGEL), a general framework that learns high‑dimensional embeddings encoding sequence and structural features of a target motif in the representation space of a diffusion model's frozen denoiser, and then enhances motif diversity by introducing controlled perturbations in the embedding space. PGEL is thus able to loosen geometric constraints while satisfying typical design metrics, leading to more diverse yet viable structures. We demonstrate PGEL on three representative cases: a monomer, a protein‑protein interface, and a cancer‑related transcription factor complex. In all cases, PGEL achieves greater structural diversity, better designability, and improved self‑consistency, as compared to partial diffusion. Our results establish PGEL as a general strategy for embedding‑driven protein generation allowing for systematic, viable diversification of functional motifs.
Authors: Carles Navarro, Mariona Torrens, Philipp Thölke, Stefan Doerr, Gianni De Fabritiis
Abstract: Building a working mental model of a protein typically requires weeks of reading, cross‑referencing crystal and predicted structures, and inspecting ligand complexes, an effort that is slow, unevenly accessible, and often requires specialized computational skills. We introduce \emphSpeak to a Protein, a new capability that turns protein analysis into an interactive, multimodal dialogue with an expert co‑scientist. The AI system retrieves and synthesizes relevant literature, structures, and ligand data; grounds answers in a live 3D scene; and can highlight, annotate, manipulate and see the visualization. It also generates and runs code when needed, explaining results in both text and graphics. We demonstrate these capabilities on relevant proteins, posing questions about binding pockets, conformational changes, or structure‑activity relationships to test ideas in real‑time. \emphSpeak to a Protein reduces the time from question to evidence, lowers the barrier to advanced structural analysis, and enables hypothesis generation by tightly coupling language, code, and 3D structures. \emphSpeak to a Protein is freely accessible at https://open.playmolecule.org.
Authors: Adam Stecklov, Noah El Rimawi-Fine, Mathieu Blanchette
Abstract: Allocating extra computation at inference time has recently improved sample quality in large language models and diffusion‑based image generation. In parallel, Flow Matching (FM) has gained traction in language, vision, and scientific domains, but inference‑time scaling methods for it remain under‑explored. Concurrently, Kim et al., 2025 approach this problem but replace the linear interpolant with a non‑linear variance‑preserving (VP) interpolant at inference, sacrificing FM's efficient and straight sampling. Additionally, inference‑time compute scaling for flow matching has only been applied to visual tasks, like image generation. We introduce novel inference‑time scaling procedures for FM that preserve the linear interpolant during sampling. Evaluations of our method on image generation, and for the first time (to the best of our knowledge), unconditional protein generation, show that I) sample quality consistently improves as inference compute increases, and II) flow matching inference‑time scaling can be applied to scientific domains.
Authors: Suqiang Ma, Subhadeep Sengupta, Yao Lee, Beikang Gu, Xianyan Chen, Xianqiao Wang, Yang Liu, Mengjia Xu, Galit H. Frydman, He Li
Abstract: Circulating blood cell clusters (CCCs) containing red blood cells (RBCs), white blood cells(WBCs), and platelets are significant biomarkers linked to conditions like thrombosis, infection, and inflammation. Flow cytometry, paired with fluorescence staining, is commonly used to analyze these cell clusters, revealing cell morphology and protein profiles. While computational approaches based on machine learning have advanced the automatic analysis of single‑cell flow cytometry images, there is a lack of effort to build tools to automatically analyze images containing CCCs. Unlike single cells, cell clusters often exhibit irregular shapes and sizes. In addition, these cell clusters often consist of heterogeneous cell types, which require multi‑channel staining to identify the specific cell types within the clusters. This study introduces a new computational framework for analyzing CCC images and identifying cell types within clusters. Our framework uses a two‑step analysis strategy. First, it categorizes images into cell cluster and non‑cluster groups by fine‑tuning the You Only Look Once(YOLOv11) model, which outperforms traditional convolutional neural networks (CNNs), Vision Transformers (ViT). Then, it identifies cell types by overlaying cluster contours with regions from multi‑channel fluorescence stains, enhancing accuracy despite cell debris and staining artifacts. This approach achieved over 95% accuracy in both cluster classification and phenotype identification. In summary, our automated framework effectively analyzes CCC images from flow cytometry, leveraging both bright‑field and fluorescence data. Initially tested on blood cells, it holds potential for broader applications, such as analyzing immune and tumor cell clusters, supporting cellular research across various diseases.
Authors: Alexander Aghili, Andy Bruce, Daniel Sabo, Sanya Murdeshwar, Kevin Bachelor, Ionut Mistreanu, Ashwin Lokapally, Razvan Marinescu
Abstract: The rapid evolution of molecular dynamics (MD) methods, including machine‑learned dynamics, has outpaced the development of standardized tools for method validation. Objective comparison between simulation approaches is often hindered by inconsistent evaluation metrics, insufficient sampling of rare conformational states, and the absence of reproducible benchmarks. To address these challenges, we introduce a modular benchmarking framework that systematically evaluates protein MD methods using enhanced sampling analysis. Our approach uses weighted ensemble (WE) sampling via The Weighted Ensemble Simulation Toolkit with Parallelization and Analysis (WESTPA), based on progress coordinates derived from Time‑lagged Independent Component Analysis (TICA), enabling fast and efficient exploration of protein conformational space. The framework includes a flexible, lightweight propagator interface that supports arbitrary simulation engines, allowing both classical force fields and machine learning‑based models. Additionally, the framework offers a comprehensive evaluation suite capable of computing more than 19 different metrics and visualizations across a variety of domains. We further contribute a dataset of nine diverse proteins, ranging from 10 to 224 residues, that span a variety of folding complexities and topologies. Each protein has been extensively simulated at 300K for one million MD steps per starting point (4 ns). To demonstrate the utility of our framework, we perform validation tests using classic MD simulations with implicit solvent and compare protein conformational sampling using a fully trained versus under‑trained CGSchNet model. By standardizing evaluation protocols and enabling direct, reproducible comparisons across MD approaches, our open‑source platform lays the groundwork for consistent, rigorous benchmarking across the molecular simulation community.
Authors: Hyunjin Choo, Fanchen Bu, Hyunjin Hwang, Young-Gyu Yoon, Kijung Shin
Abstract: Higher‑order interactions (HOIs) in complex systems, such as scientific collaborations, multi‑protein complexes, and multi‑user communications, are commonly modeled as hypergraphs, where each hyperedge (i.e., a subset of nodes) represents an HOI among the nodes. Given a hypergraph, hyperedge prediction aims to identify hyperedges that are either missing or likely to form in the future, and it has broad applications, including recommending interest‑based social groups, predicting collaborations, and uncovering functional complexes in biological systems. However, the vast search space of hyperedge candidates (i.e., all possible subsets of nodes) poses a significant computational challenge, making naive exhaustive search infeasible. As a result, existing approaches rely on either heuristic sampling to obtain constrained candidate sets or ungrounded assumptions on hypergraph structure to select promising hyperedges.
In this work, we propose HyperSearch, a search‑based algorithm for hyperedge prediction that efficiently evaluates unconstrained candidate sets, by incorporating two key components: (1) an empirically grounded scoring function derived from observations in real‑world hypergraphs and (2) an efficient search mechanism, where we derive and use an anti‑monotonic upper bound of the original scoring function (which is not antimonotonic) to prune the search space. This pruning comes with theoretical guarantees, ensuring that discarded candidates are never better than the kept ones w.r.t. the original scoring function. In extensive experiments on 10 real‑world hypergraphs across five domains, HyperSearch consistently outperforms state‑of‑the‑art baselines, achieving higher accuracy in predicting new (i.e., not in the training set) hyperedges.
Authors: Jose Siguenza, Bharath Ramsundar
Abstract: Neural networks that incorporate geometric relationships respecting SE(3) group transformations (e.g. rotations and translations) are increasingly important in molecular applications, such as molecular property prediction, protein structure modeling, and materials design. These models, known as SE(3)‑equivariant neural networks, ensure outputs transform predictably with input coordinate changes by explicitly encoding spatial atomic positions. Although libraries such as E3NN [4] and SE(3)‑TRANSFORMER [3 ] offer powerful implementations, they often require substantial deep learning or mathematical prior knowledge and lack complete training pipelines. We extend DEEPCHEM [ 13] with support for ready‑to‑use equivariant models, enabling scientists with minimal deep learning background to build, train, and evaluate models, such as SE(3)‑Transformer and Tensor Field Networks. Our implementation includes equivariant models, complete training pipelines, and a toolkit of equivariant utilities, supported with comprehensive tests and documentation, to facilitate both application and further development of SE(3)‑equivariant models.
Authors: Azam Shirali, Giri Narasimhan
Abstract: Protein‑protein docking tools help in studying interactions between proteins, and are essential for drug, vaccine, and therapeutic development. However, the accuracy of a docking tool depends on a robust scoring function that can reliably differentiate between native and non‑native complexes. PIsToN is a state‑of‑the‑art deep learning‑based scoring function that uses Vision Transformers in its architecture. Recently, the Mamba architecture has demonstrated exceptional performance in both natural language processing and computer vision, often outperforming Transformer‑based models in their domains. In this study, we introduce PUMBA (Protein‑protein interface evaluation with Vision Mamba), which improves PIsToN by replacing its Vision Transformer backbone with Vision Mamba. This change allows us to leverage Mamba's efficient long‑range sequence modeling for sequences of image patches. As a result, the model's ability to capture both global and local patterns in protein‑protein interface features is significantly improved. Evaluation on several widely‑used, large‑scale public datasets demonstrates that PUMBA consistently outperforms its original Transformer‑based predecessor, PIsToN.
Authors: Eli N. Weinstein, Andrei Slabodkin, Mattia G. Gollub, Elizabeth B. Wood
Abstract: Biological machine learning is often bottlenecked by a lack of scaled data. One promising route to relieving data bottlenecks is through high throughput screens, which can experimentally test the activity of 10^6‑10^12 protein sequences in parallel. In this article, we introduce algorithms to optimize high throughput screens for data creation and model training. We focus on the large scale regime, where dataset sizes are limited by the cost of measurement and sequencing. We show that when active sequences are rare, we maximize information gain if we only collect positive examples of active sequences, i.e. x with y>0. We can correct for the missing negative examples using a generative model of the library, producing a consistent and efficient estimate of the true p(y | x). We demonstrate this approach in simulation and on a large scale screen of antibodies. Overall, co‑design of experiments and inference lets us accelerate learning dramatically.
Authors: Arielle Sanford, Shuo Sun, Christian B. Mendl
Abstract: Recent advances in protein structure prediction, such as AlphaFold, have demonstrated the power of deep neural architectures like the Evoformer for capturing complex spatial and evolutionary constraints on protein conformation. However, the depth of the Evoformer, comprising 48 stacked blocks, introduces high computational costs and rigid layerwise discretization. Inspired by Neural Ordinary Differential Equations (Neural ODEs), we propose a continuous‑depth formulation of the Evoformer, replacing its 48 discrete blocks with a Neural ODE parameterization that preserves its core attention‑based operations. This continuous‑time Evoformer achieves constant memory cost (in depth) via the adjoint method, while allowing a principled trade‑off between runtime and accuracy through adaptive ODE solvers. Benchmarking on protein structure prediction tasks, we find that the Neural ODE‑based Evoformer produces structurally plausible predictions and reliably captures certain secondary structure elements, such as alpha‑helices, though it does not fully replicate the accuracy of the original architecture. However, our model achieves this performance using dramatically fewer resources, just 17.5 hours of training on a single GPU, highlighting the promise of continuous‑depth models as a lightweight and interpretable alternative for biomolecular modeling. This work opens new directions for efficient and adaptive protein structure prediction frameworks.
Authors: Pierre Glaser, Steffanie Paul, Alissa M. Hummer, Charlotte M. Deane, Debora S. Marks, Alan N. Amin
Abstract: We propose a set of kernel‑based tools to evaluate the designs and tune the hyperparameters of conditional sequence models, with a focus on problems in computational biology. The backbone of our tools is a new measure of discrepancy between the true conditional distribution and the model's estimate, called the Augmented Conditional Maximum Mean Discrepancy (ACMMD). Provided that the model can be sampled from, the ACMMD can be estimated unbiasedly from data to quantify absolute model fit, integrated within hypothesis tests, and used to evaluate model reliability. We demonstrate the utility of our approach by analyzing a popular protein design model, ProteinMPNN. We are able to reject the hypothesis that ProteinMPNN fits its data for various protein families, and tune the model's temperature hyperparameter to achieve a better fit.
Authors: Ashwini Kannan, Jaya Vasavi Pamidimukkala, Avinash Dakshinamoorthy, Soham Bopardikar, Kalyan Dasgupta, Sanjib Senapati
Abstract: Protein folding is one of the age‑old biological problems that refers to the mechanism of understanding and predicting how a protein's linear sequence of amino acids folds into its specific three dimensional structure.This structure is critical, as a protein's functionality is inherently linked to its final folded form. Misfolding can lead to severe diseases such as Alzheimer's and cystic fibrosis, highlighting the biological and clinical importance of understanding protein folding mechanisms. This work presents a novel turn based encoding optimization algorithm for predicting the folded structures of peptides and small proteins. Our approach builds upon our previous research, where our objective function focused on hydrophobic collapse, a fundamental phenomenon underlying the protein folding process. In this work, we extend that framework by not only incorporating hydrophobic interactions but also including all non bonded interactions modeled using the Miyazawa Jernigan potential. We constructed a Hamiltonian from the defined objective function that encodes the folding process on a three dimensional face centered cubic lattice, offering superior packing efficiency and a realistic representation of protein conformations. This Hamiltonian is then solved using classical and quantum solvers to explore the vast conformational space of proteins. To identify the lowest‑energy folded configurations, we utilize the Variational Quantum Eigensolver implemented on IBM 133 qubit hardware. The predicted structures are validated against experimental data using root mean square deviation as a metric and compared against classical simulated annealing and molecular dynamics simulation results. Our findings highlight the promise of hybrid classical and quantum approaches in advancing protein folding predictions, particularly for sequences with low homology.
Authors: Amitesh Badkul, Lei Xie
Abstract: Reliable, informative, and individual uncertainty quantification (UQ) remains missing in current ML community. This hinders the effective application of AI/ML to risk‑sensitive domains. Most methods either fail to provide coverage on new data, inflate intervals so broadly that they are not actionable, or assign uncertainties that do not track actual error, especially under a distribution shift. In high‑stakes drug discovery, protein‑ligand affinity (PLI) prediction is especially challenging as assay noise is heterogeneous, chemical space is imbalanced and large, and practical evaluations routinely involve distribution shift. In this work, we introduce a novel uncertainty quantification method, Trustworthy Expert Split‑conformal with Scaled Estimation for Efficient Reliable Adaptive intervals (TESSERA), that provides per‑sample uncertainty with reliable coverage guarantee, informative and adaptive prediction interval widths that track the absolute error. We evaluate on protein‑ligand binding affinity prediction under both independent and identically distributed (i.i.d.) and scaffold‑based out‑of‑distribution (OOD) splits, comparing against strong UQ baselines. TESSERA attains near‑nominal coverage and the best coverage‑width trade‑off as measured by the Coverage‑Width Criterion (CWC), while maintaining competitive adaptivity (lowest Area Under the Sparsification Error (AUSE)). Size‑Stratified Coverage (SSC) further confirms that intervals are right‑sized, indicating width increases when data are scarce or noisy, and remain tight when predictions are reliable. By unifying Mixture of Expert (MoE) diversity with conformal calibration, TESSERA delivers trustworthy, tight, and adaptive uncertainties that are well‑suited to selective prediction and downstream decision‑making in the drug‑discovery pipeline and other applications.
Authors: Jacob K. Christopher, Austin Seamann, Jingyi Cui, Sagar Khare, Ferdinando Fioretto
Abstract: Diffusion models offer a powerful means of capturing the manifold of realistic protein structures, enabling rapid design for protein engineering tasks. However, existing approaches observe critical failure modes when precise constraints are necessary for functional design. To this end, we present a constrained diffusion framework for structure‑guided protein design, ensuring strict adherence to functional requirements while maintaining precise stereochemical and geometric feasibility. The approach integrates proximal feasibility updates with ADMM decomposition into the generative process, scaling effectively to the complex constraint sets of this domain. We evaluate on challenging protein design tasks, including motif scaffolding and vacancy‑constrained pocket design, while introducing a novel curated benchmark dataset for motif scaffolding in the PDZ domain. Our approach achieves state‑of‑the‑art, providing perfect satisfaction of bonding and geometric constraints with no degradation in structural diversity.
Authors: Islam Akef Ebeid, Haoteng Tang, Pengfei Gu
Abstract: Introduction Accurate prediction of protein‑protein interactions (PPIs) is crucial for understanding cellular functions and advancing drug development. Existing in‑silico methods use direct sequence embeddings from Protein Language Models (PLMs). Others use Graph Neural Networks (GNNs) for 3D protein structures. This study explores less computationally intensive alternatives. We introduce a novel framework for downstream PPI prediction through link prediction. Methods We introduce a two‑stage graph representation learning framework, ProtGram‑DirectGCN. First, we developed ProtGram. This approach models a protein's primary structure as a hierarchy of globally inferred n‑gram graphs. In these graphs, residue transition probabilities define edge weights. Each edge connects a pair of residues in a directed graph. The probabilities are aggregated from a large corpus of sequences. Second, we propose DirectGCN, a custom directed graph convolutional neural network. This model features a unique convolutional layer. It processes information through separate path‑specific transformations: incoming, outgoing, and undirected. A shared transformation is also applied. These paths are combined via a learnable gating mechanism. We apply DirectGCN to ProtGram graphs to learn residue‑level embeddings. These embeddings are pooled via attention to generate protein‑level embeddings for prediction. Results We first established the efficacy of DirectGCN on standard node classification benchmarks. Its performance matches established methods on general datasets. The model excels at complex, directed graphs with dense, heterophilic structures. When applied to PPI prediction, the full ProtGram‑DirectGCN framework delivers robust predictive power. This strong performance holds even with limited training data.
Authors: Mahboobe Sehati, Ali Soltanmanesh, Shabnam Abutalebi, Abolfazl Bahrampour, Naser Haeri, Sareh Rostami, Alireza Bahrampour
Abstract: Photoreduction of cryptochrome protein in the retina is a well‑known mechanism of navigation of birds through the geomagnetic field, yet the biosignal nature of the mechanism remains unclear. The absorption of blue light by the flavin adenine dinucleotide (FAD) chromophore can alter the distribution of electrons in cryptochrome and create radical pairs with separated charges. In this study, the spin dynamics of electrons in the radical pair and its coupling with spatial position were investigated by computational modeling from a quantum mechanical perspective. Several interactions were considered in the presence of an external magnetic field, and the resulting electric dipole moment in cryptochrome was computed as the quantity emerging from this coupling. The computations show the induced electric dipole moment clearly depend on the characteristics of the applied magnetic field even after considering dissipative effects. In fact, our findings indicate that the radical pair in cryptochrome protein is a magnetic biosensor, in the sense that in the presence of the geomagnetic field, variations in spin states can influence its electric dipole moment, which may be interpreted via the bird as an orientation signal. The results can be used in the advancement of bio‑inspired technologies which replicate animal magnetic sensitivity. On the other hand, with increasing concern about the detrimental effects of electromagnetic fields on wildlife and human health, studying the phenomenon of magnetoreception can contribute to a deeper understanding of how biological structures interact with these fields.
Authors: Rushna Quddus, Kent Kirshenbaum, David G. Grier
Abstract: This study introduces a Holographic Agglutination Assay for quantifying levels of the immunoglobulin protein IgA in biological samples. This is the first example of a label‑free and bead‑free assay that quantifies protein agglutinates by direct detection using Total Holographic Characterization. A proof‑of‑concept assay for human serum immunoglobulins is demonstrated using Jacalin, the galactose‑specific plant lectin, to induce selective agglutination.
By analyzing the size, refractive index, and number of particles in an assay sample, we obtain a reproducible and quantitative measurement of galactosylated immunoglobulins in a given sample. The assay is calibrated for a physiologically relevant reference interval of IgA concentrations in a 10x diluted emulated biological sample from low (80 mg/dL, 5 μM) to high (320 mg/dL, 20 μM) levels. The assay clearly distinguishes samples containing IgA from samples containing IgG.
More broadly, this study introduces a platform for creating lectin‑mediated Holographic Agglutination Assays to monitor levels of immunoglobulins in biological samples. The ability to quantify immunoglobulin levels efficiently in clinical samples is likely to be valuable for diagnostics and will provide a basis for assaying other proteins that can be induced to agglutinate.
Authors: Nicolas Menet, Aleksandar Terzić, Michael Hersche, Andreas Krause, Abbas Rahimi
Abstract: Bayesian optimization in large unstructured discrete spaces is often hindered by the computational cost of maximizing acquisition functions due to the absence of gradients. We propose a scalable alternative based on Thompson sampling that eliminates the need for acquisition function maximization by directly parameterizing the probability that a candidate yields the maximum reward. Our approach, Thompson Sampling via Fine‑Tuning (ToSFiT) leverages the prior knowledge embedded in prompt‑conditioned large language models, and incrementally adapts them toward the posterior. Theoretically, we derive a novel regret bound for a variational formulation of Thompson Sampling that matches the strong guarantees of its standard counterpart. Our analysis reveals the critical role of careful adaptation to the posterior probability of maximality ‑‑ a principle that underpins our ToSFiT algorithm. Empirically, we validate our method on three diverse tasks: FAQ response refinement, thermally stable protein search, and quantum circuit design. Within a collection of methods covering in‑context Bayesian optimization, reinforcement learning, and evolutionary search, ToSFiT exhibits both state‑of‑the‑art sample efficiency and computational efficiency.
Authors: Jun Ming Hou, Long Chen, Xuan Zheng, Jia Wei Wu, Jian Wei You, Zi Xuan Cai, Jiahan Huang, Chen Xu Wu, Jian Lin Su, Lianlin Li, Jia Nan Zhang, Tie Jun Cui
Abstract: Generative models such as AlphaFold and MatterGen can directly generate novel material structures with desired properties, accelerating the new materials discovery and revolutionizing the material design paradigm from traditional trial‑and‑error approach to intelligent on‑demand generation. AlphaFold is focused on protein prediction with specific aperiodic structures; while MatterGen is focused on predicting periodic and stable crystal structures. The universal design of metamaterials is much more complicated, since it involves to design meta‑atoms (similar to the periodic structures) and their arbitrarily inhomogeneous distributions in space. Here, we propose InfoMetaGen, a universal generative model for information metamaterial design, which combines a pre‑trained foundation model with lightweight functional adapters to intelligently generate artificial structures on‑demand spanning from meta‑atoms to arbitrary space coding patterns. In contrast to conventional intelligent metamaterial design methods that require training dedicated models for specific functionalities, InfoMetaGen enables a single universal generative model capable of switching across diverse functionalities by fine‑tuning the lightweight adapters, significantly improving both efficiency and generalizability. Experimental results demonstrate that InfoMetaGen can not only accelerate the diverse discovery of new metamaterials, but also achieve breakthroughs in metamaterial performance. This work fills the gap of universal generative framework in designing artificial materials, and opens up unprecedented opportunities to expand the capability of generative models from the passive discovery of microscopic natural material to the active creation of macroscopic artificial materials.
Authors: Bo Qiang, Chengyue Gong, Xinshi Chen, Yuxuan Zhang, Wenzhi Xiao
Abstract: Lightweight inference is critical for biomolecular structure prediction and downstream tasks, enabling efficient real‑world deployment and inference‑time scaling for large‑scale applications. While AF3 and its variants (e.g., Protenix, Chai‑1) have advanced structure prediction results, they suffer from critical limitations: high inference latency and cubic time complexity with respect to token count, both of which restrict scalability for large biomolecular complexes. To address the core challenge of balancing model efficiency and prediction accuracy, we introduce three key innovations: (1) compressing non‑scalable operations to mitigate cubic time complexity, (2) removing redundant blocks across modules to reduce unnecessary overhead, and (3) adopting a few‑step sampler for the atom diffusion module to accelerate inference. Building on these design principles, we develop Protenix‑Mini+, a highly lightweight and scalable variant of the Protenix model. Within an acceptable range of performance degradation, it substantially improves computational efficiency. For example, in the case of low‑homology single‑chain proteins, Protenix‑Mini+ experiences an intra‑protein LDDT drop of approximately 3% relative to the full Protenix model ‑‑ an acceptable performance trade‑off given its substantially 90%+ improved computational efficiency.
Authors: Patricia Marques, Andreas Wichert, Duarte Magano, Bruno Coutinho
Abstract: Identification of cancer driver genes is fundamental for the development of targeted therapeutic interventions. The integration of mutational profiles with protein‑protein interaction (PPI) networks offers a promising avenue for their detection [ 1, 2], but scaling to large network datasets is computationally demanding. Quantum computing offers compact representations and potential complexity reductions. Motivated by the classical method of Gumpinger et al. [3], in this work we introduce a supervised quantum framework that combines mutation scores with network topology via a novel state preparation scheme, Quantum Multi‑order Moment Embedding (QMME). QMME encodes low‑order statistical moments over the mutation scores of a node's immediate and second‑order neighbors, and encodes this information into quantum states. These are used as inputs to a kernel‑based quantum binary classifier that discriminates known driver genes from others. Simulations on an empirical PPI network demonstrate competitive performance, with a 12.6% recall gain over a classical baseline. The pipeline performs explicit quantum state preparation and requires no classical training, enabling an efficient, nearly end‑to‑end quantum workflow. A brief complexity analysis suggests the approach could achieve a quantum speedup in network‑based cancer gene prediction. This work underscores the potential of supervised quantum graph learning frameworks to advance biological discovery.
Authors: Sazan Mahbub, Souvik Kundu, Eric P. Xing
Abstract: Designing protein sequences that fold into a target 3‑D structure, termed as the inverse folding problem, is central to protein engineering. However, it remains challenging due to the vast sequence space and the importance of local structural constraints. Existing deep learning approaches achieve strong recovery rates, however, lack explicit mechanisms to reuse fine‑grained structure‑sequence patterns conserved across natural proteins. To mitigate this, we present PRISM a multimodal retrieval‑augmented generation framework for inverse folding. PRISM retrieves fine‑grained representations of potential motifs from known proteins and integrates them with a hybrid self‑cross attention decoder. PRISM is formulated as a latent‑variable probabilistic model and implemented with an efficient approximation, combining theoretical grounding with practical scalability. Experiments across multiple benchmarks, including CATH‑4.2, TS50, TS500, CAMEO 2022, and the PDB date split, demonstrate the fine‑grained multimodal retrieval efficacy of PRISM in yielding SoTA perplexity and amino acid recovery, while also improving the foldability metrics (RMSD, TM‑score, pLDDT).
Authors: Xinhui Chen, Zuchao Li, Mengqi Gao, Yufeng Zhang, Chak Tou Leong, Haoyang Li, Jiaqi Chen
Abstract: Deciphering the function of unseen protein sequences is a fundamental challenge with broad scientific impact, yet most existing methods depend on task‑specific adapters or large‑scale supervised fine‑tuning. We introduce the "Protein‑as‑Second‑Language" framework, which reformulates amino‑acid sequences as sentences in a novel symbolic language that large language models can interpret through contextual exemplars. Our approach adaptively constructs sequence‑question‑answer triples that reveal functional cues in a zero‑shot setting, without any further training. To support this process, we curate a bilingual corpus of 79,926 protein‑QA instances spanning attribute prediction, descriptive understanding, and extended reasoning. Empirically, our method delivers consistent gains across diverse open‑source LLMs and GPT‑4, achieving up to 17.2% ROUGE‑L improvement (average +7%) and even surpassing fine‑tuned protein‑specific language models. These results highlight that generic LLMs, when guided with protein‑as‑language cues, can outperform domain‑specialized models, offering a scalable pathway for protein understanding in foundation models.
Authors: Zishen Zhang, Xiangzhe Kong, Wenbing Huang, Yang Liu
Abstract: Designing protein binders targeting specific sites, which requires to generate realistic and functional interaction patterns, is a fundamental challenge in drug discovery. Current structure‑based generative models are limited in generating nterfaces with sufficient rationality and interpretability. In this paper, we propose Retrieval‑Augmented Diffusion for Aligned interface (RADiAnce), a new framework that leverages known interfaces to guide the design of novel binders. By unifying retrieval and generation in a shared contrastive latent space, our model efficiently identifies relevant interfaces for a given binding site and seamlessly integrates them through a conditional latent diffusion generator, enabling cross‑domain interface transfer. Extensive exeriments show that RADiAnce significantly outperforms baseline models across multiple metrics, including binding affinity and recovery of geometries and interactions. Additional experimental results validate cross‑domain generalization, demonstrating that retrieving interfaces from diverse domains, such as peptides, antibodies, and protein fragments, enhances the generation performance of binders for other domains. Our work establishes a new paradigm for protein binder design that successfully bridges retrieval‑based knowledge and generative AI, opening new possibilities for drug discovery.
Authors: Guifeng Li, Chaoyang Gong
Abstract: Single‑molecule detection enables direct observation of individual biomolecular events, providing mechanistic insights into biological processes and offering a powerful tool for disease diagnostics. However, the fundamental scale mismatch between optical wavelengths and molecules restricts the application of label‑free techniques, leading to poor signal‑to‑noise (SNR) performance. Here, we propose a high‑contrast, label‑free approach based on interferometric imaging, utilizing the strong evanescent field supported on a microfiber surface to provide near‑field illumination. We observed unique interference patterns generated by in‑plane scattering from natural defects, which enabled high‑contrast detection of localized phase changes induced by single molecules. The results indicate an approximately 38 dB enhancement in SNR over the conventional fluorescence methods, without employing any plasmonic or microcavity‑based amplification techniques. This approach was further applied to track molecular dynamics, capturing both conformational transition and binding behaviors of individual protein molecules. Meanwhile, the stimulus‑response of single molecules to acoustic waves was investigated, demonstrating the ultimate miniaturization of an acoustic sensor at the single‑molecule scale. By enabling direct observation of molecular dynamics and mechanical responses at the single‑molecule level, this approach provides a versatile platform for probing fundamental biological processes and developing ultra‑sensitive biosensors. Moreover, this approach lays the foundation for coupling optical and acoustic waves at the molecular scale, opening new avenues for next‑generation single‑molecule diagnostics and precision biophysics studies.
Authors: Sara Merino-Aceituno, Carmela Moschella, Shotaro Otsuka, Christian Schmeiser, Julia Scholz
Abstract: The endoplasmic reticulum (ER) is the largest continuous membrane‑bound organelle in the cell and plays a central role in the synthesis and turnover of many lipids and proteins. It connects directly to the nucleus through specialized contact points known as ER‑nuclear envelope (NE) junctions. In our recent study, we found that these ER‑NE junctions are both narrow and infrequent, measuring less than 20 nanometers in diameter and occurring at a frequency of approximately 0.1 per square micrometer. However, it remains unclear whether such limited and narrow connections are sufficient to support efficient transport between the ER and NE. Here, we built a mathematical model of ER‑to‑NE protein diffusion, incorporating ultrastructural parameters, the frequency of ER‑NE junctions, and the diffusion coefficient of proteins within the ER lumen. To validate the model, we experimentally quantified the transport rate of ER luminal proteins to the NE using fluorescence recovery after photobleaching (FRAP). Our model and experimental data demonstrate that simple diffusion is sufficient to account for the rapid transport of proteins from the ER to the NE, despite the limited and narrow nature of the connecting junctions. Together, these findings offer mechanistic insight into how ER‑NE connectivity enables rapid protein transport and lay the groundwork for future studies on ER‑nucleus communication.
Authors: Vinayak Vinayak, Melike Lakadamyali, Vivek B Shenoy
Abstract: Nanoscale chromatin domains, variously termed nucleosome clutches, nanodomains, or packing domains, have emerged as fundamental architectural units of the mammalian genome during interphase and mitosis. Unlike cohesin‑dependent loops or TADs, these 50‑200 nm structures persist in the absence of loop extrusion, pointing to a distinct organizing principle shaped by histone post‑translational modifications and constrained by interactions with the nuclear lamina. Super‑resolution microscopy and electron tomography now enable their direct visualization, revealing conserved features such as fractal packing, enrichment for linker histone H1, and radial stratification of active and repressive histone marks. Accumulating evidence indicates that these domains act as transcriptional hubs, dynamically remodel in response to developmental and environmental cues, and undergo pathological disruption in disease. Integrated experimental, theoretical, and computational insights suggest that chromatin‑protein interactions, epigenetic read‑write processes, and diffusion‑driven dynamics together govern their formation, persistence, and nuclear positioning. Viewed in this light, nanoscale domains represent a privileged regulatory tier, complementary to compartments and loop‑based structures, that bridges local chromatin states with global nuclear architecture. By situating them alongside lamin‑associated (LADs) and nucleolus‑associated domains (NADs), we propose a unified biophysical framework for chromatin organization across scales and outline key open questions for future exploration. Because their structural disruption is a recurring feature of aging, cancer, and degenerative diseases, understanding these domains may open new avenues for diagnostics and therapeutic intervention.
Authors: Zekai Chen, Xunkai Li, Sirui Zhang, Henan Sun, Jia Li, Zhenjun Li, Bing Zhou, Rong-Hua Li, Guoren Wang
Abstract: De novo ligand design is a fundamental task that seeks to generate protein or molecule candidates that can effectively dock with protein receptors and achieve strong binding affinity entirely from scratch. It holds paramount significance for a wide spectrum of biomedical applications. However, most existing studies are constrained by the Pseudo De Novo, Limited Docking Modeling, and Inflexible Ligand Type. To address these issues, we propose MagicDock, a forward‑looking framework grounded in the progressive pipeline and differentiable surface modeling. (1) We adopt a well‑designed gradient inversion framework. To begin with, general docking knowledge of receptors and ligands is incorporated into the backbone model. Subsequently, the docking knowledge is instantiated as reverse gradient flows by binding prediction, which iteratively guide the de novo generation of ligands. (2) We emphasize differentiable surface modeling in the docking process, leveraging learnable 3D point‑cloud representations to precisely capture binding details, thereby ensuring that the generated ligands preserve docking validity through direct and interpretable spatial fingerprints. (3) We introduce customized designs for different ligand types and integrate them into a unified gradient inversion framework with flexible triggers, thereby ensuring broad applicability. Moreover, we provide rigorous theoretical guarantees for each component of MagicDock. Extensive experiments across 9 scenarios demonstrate that MagicDock achieves average improvements of 27.1% and 11.7% over SOTA baselines specialized for protein or molecule ligand design, respectively.
Authors: Daniel Jason Tan, Jiayang Chen, Dilruk Perera, Kay Choong See, Mengling Feng
Abstract: Objective: Enteral nutrition (EN) delivery in the ICU remains suboptimal due to limited personalization and uncertainty regarding appropriate calorie, protein, and fluid targets under dynamic metabolic demands. We introduce DeepEN, a reinforcement learning (RL) framework for personalized EN optimization using electronic health record data.
Methods: DeepEN was trained on over 11,000 ICU patients from MIMIC‑IV to generate 4‑hourly, patient‑specific caloric, protein, and fluid targets. The state representation incorporated demographics, comorbidities, vital signs, laboratory values, and recent interventions. A physiologically aligned reward framework balanced biomarker stability with long‑term survival. Policy learning employed a dueling double deep Q‑network with Conservative Q‑Learning regularization to enable safe offline training.
Results: DeepEN achieved the highest estimated policy value (V^π= 9.48) and the lowest calibrated mortality (18.8 +/‑ 1.0%), representing a 4.0 percentage‑point absolute reduction compared with clinician practice (22.8%). The policy also demonstrated superior metabolic stability, achieving the highest proportion of glucose, phosphate, and sodium values within target range. Furthermore, deviation from the DeepEN policy was independently associated with increased mortality and biomarker instability, whereas deviation from a random policy showed no such association. Interpretability analyses further indicated that recommendations were conditioned on physiologically relevant markers of organ function and metabolic status rather than static dosing heuristics.
Conclusion: DeepEN demonstrates the feasibility of conservative offline RL for safe, individualized EN optimization, highlighting the potential of data‑driven personalization to complement guideline‑based approaches in critical care.
Authors: Beatrice Cipriani, Hender Lopez
Abstract: Nanoparticles (NPs) demonstrate considerable potential in medical applications, including targeted drug delivery and diagnostic probes. However, their efficacy depends on their ability to navigate through the complex biological environments inside living organisms. In such environments, NPs interact with a dense mixture of biomolecules, which can reduce their mobility and hinder diffusion. Understanding the factors influencing NP diffusion in these environments is key to improving nanomedicine design and predicting toxicological effects. In this study, we propose a computational approach to model NP diffusion in crowded environments. We introduce a mesoscale model that accounts for the combined effects of the Protein Corona (PC) and the crowded medium on NP movement. By including volume‑exclusion interactions and modelling the PC both explicitly and implicitly, we identify key macromolecular descriptors that affect NP diffusion. Our results show that the morphology of the PC can significantly affect the diffusion of NPs, and the role of the occupied volume fraction and the size ratio between tracers and crowders are analysed. The results also show that approximating large macromolecular assemblies with a hydrodynamic single‑sphere model leads to inexact diffusion estimates. To overcome the limitations of single‑sphere representations, a strategy for an accurate parametrization of NP‑PC systems using a single‑sphere model is presented.
Authors: Sergei B. Rochal, Aleksey S. Roshal, Olga V. Konevtsova, Rudolf Podgornik
Abstract: Proteinaceous shells useful for various biomedical applications exhibit a wide range of anomalous structures that are fundamentally different from icosahedral viral capsids described by the Caspar‑Klug paradigmatic model. Exploring the Protein Data Bank, we have identified nine different types of anomalous shells structurally close to flat octagonal quasicrystals. As we show, these numerous shells have cubic nets cut from short‑period approximants of an octagonal tiling composed of square and rhombic tiles. The approximants and parent tiling are easily obtained within the Landau density wave approach, while the nonequilibrium assembly of them can be simulated using the pair potentials derived from critical density waves. Gluing a polyhedron net and mapping it onto a spherical surface induces tile distortions, and to reduce them, we introduce and minimize the effective elastic energy of the system. Thus, we return quasi‑equivalence to previously equivalent tiles. Possible cubic faceting of the octagonal spherical tilings is discussed in terms of the topological charge distribution over the tiling vertices. The proposed structural models describe numerous proteinaceous shells including about half of the known symmetrical enzymes. Our results constitute a fundamental basis for further applications of identified octagonal assemblies and can help to discover and study similar systems in the future.
Authors: Shawnak Shivakumar, Jefferson Hernandez
Abstract: Wuchereria bancrofti, the parasitic roundworm responsible for lymphatic filariasis, permanently disables over 36 million people and places 657 million at risk across 39 countries. A major bottleneck for drug discovery is the lack of functional annotation for more than 90 percent of the W. bancrofti dark proteome, leaving many potential targets unidentified. In this work, we present a novel computational pipeline that converts W. bancrofti's unannotated amino acid sequence data into precise four‑level Enzyme Commission (EC) numbers and drug candidates. We utilized a DEtection TRansformer to estimate the probability of enzymatic function, fine‑tuned a hierarchical nearest neighbor EC predictor on 4,476 labeled parasite proteins, and applied rejection sampling to retain only four‑level EC classifications at 100 percent confidence. This pipeline assigned precise EC numbers to 14,772 previously uncharacterized proteins and discovered 543 EC classes not previously known in W. bancrofti. A qualitative triage emphasizing parasite‑specific targets, chemical tractability, biochemical importance, and biological plausibility prioritized six enzymes across five separate strategies: anti‑Wolbachia cell‑wall inhibition, proteolysis blockade, transmission disruption, purinergic immune interference, and cGMP‑signaling destabilization. We curated a 43‑compound library from ChEMBL and BindingDB and co‑folded across multiple protein conformers with Boltz‑2. All six targets exhibited at least moderately strong predicted binding affinities below 1 micromolar, with moenomycin analogs against peptidoglycan glycosyltransferase and NTPase inhibitors showing promising nanomolar hits and well‑defined binding pockets. While experimental validation remains essential, our results provide the first large‑scale functional map of the W. bancrofti dark proteome and accelerate early‑stage drug development for the species.
Authors: Jigang Fan, Xiaoran Jiao, Shengdong Lin, Zhanming Liang, Weian Mao, Chenchen Jing, Hao Chen, Chunhua Shen
Abstract: Predicting the fitness impact of mutations is central to protein engineering but constrained by limited assays relative to the size of sequence space. Protein language models (pLMs) trained with masked language modeling (MLM) exhibit strong zero‑shot fitness prediction; we provide a unifying view by interpreting natural evolution as implicit reward maximization and MLM as inverse reinforcement learning (IRL), in which extant sequences act as expert demonstrations and pLM log‑odds serve as fitness estimates. Building on this perspective, we introduce EvoIF, a lightweight model that integrates two complementary sources of evolutionary signal: (i) within‑family profiles from retrieved homologs and (ii) cross‑family structural‑evolutionary constraints distilled from inverse folding logits. EvoIF fuses sequence‑structure representations with these profiles via a compact transition block, yielding calibrated probabilities for log‑odds scoring. On ProteinGym (217 mutational assays; >2.5M mutants), EvoIF and its MSA‑enabled variant achieve state‑of‑the‑art or competitive performance while using only 0.15% of the training data and fewer parameters than recent large models. Ablations confirm that within‑family and cross‑family profiles are complementary, improving robustness across function types, MSA depths, taxa, and mutation depths. The codes will be made publicly available.
Authors: Jayanth R. Banavar, Achille Giacometti, Trinh X. Hoang, Amos Maritan, Tatjana Škrbić
Abstract: Proteins are linear chain molecules that play a central role in life and health. Protein native state folds are modular assemblies of space‑filling building blocks of α‑helices, \beta‑sheets and tight turns. Here we deduce the structures of a countable set of space‑filling helical forms of a uniform discrete thick string from first principles with no additional input or adjustable parameters. These forms occur in correspondence with the natural numbers, loosely analogous to the energy levels in a Bohr atom. We find the remarkable result that one of these helical forms is an excellent candidate for an α‑helix through seemingly improbable quantum chemistry coincidences that fit the geometrical requirements. Our work suggests that geometry and chemistry are complementary ways of looking at proteins and suggests a route for developing a unified framework for understanding proteins.
Authors: Yuqi Zhang, Yuxin Yang, Feixiong Chen, Cheng-Chang Lu, Nima Saeidi, Samuel L. Volchenboum, Junhan Zhao, Siwei Chen, Weiwen Jiang, Qiang Guan
Abstract: Variational quantum algorithms provide a direct, physics‑based approach to protein structure prediction, but their accuracy is limited by the coarse resolution of the energy landscapes generated on current noisy devices. We propose a hybrid framework that combines quantum computation with deep learning, formulating structure prediction as a problem of energy fusion. Candidate conformations are obtained through the Variational Quantum Eigensolver (VQE) executed on IBM's 127‑qubit superconducting processor, which defines a global yet low‑resolution quantum energy surface. To refine these basins, secondary structure probabilities and dihedral angle distributions predicted by the NSP3 neural network are incorporated as statistical potentials. These additional terms sharpen the valleys of the quantum landscape, resulting in a fused energy function that enhances effective resolution and better distinguishes native‑like structures. Evaluation on 375 conformations from 75 protein fragments shows consistent improvements over AlphaFold3, ColabFold, and quantum‑only predictions, achieving a mean RMSD of 4.9 Å with statistical significance (p < 0.001). The findings demonstrate that energy fusion offers a systematic method for combining data‑driven models with quantum algorithms, improving the practical applicability of near‑term quantum computing to molecular and structural biology.
Authors: Aymen Alsaadi, Jonathan Ash, Mikhail Titov, Matteo Turilli, Andre Merzky, Shantenu Jha, Sagar Khare
Abstract: Computational protein design is experiencing a transformation driven by AI/ML. However, the range of potential protein sequences and structures is astronomically vast, even for moderately sized proteins. Hence, achieving convergence between generated and predicted structures demands substantial computational resources for sampling. The Integrated Machine‑learning for Protein Structures at Scale (IMPRESS) offers methods and advanced computing systems for coupling AI to high‑performance computing tasks, enabling the ability to evaluate the effectiveness of protein designs as they are developed, as well as the models and simulations used to generate data and train models. This paper introduces IMPRESS and demonstrates the development and implementation of an adaptive protein design protocol and its supporting computing infrastructure. This leads to increased consistency in the quality of protein design and enhanced throughput of protein design due to dynamic resource allocation and asynchronous workload execution.
Authors: Bang Chen, Lijun Guo, Houli Fan, Wentao He, Rong Zhang
Abstract: Identifying cancer driver genes (CDGs) is essential for understanding cancer mechanisms and developing targeted therapies. Graph neural networks (GNNs) have recently been employed to identify CDGs by capturing patterns in biological interaction networks. However, most GNN‑based approaches rely on a single protein‑protein interaction (PPI) network, ignoring complementary information from other biological networks. Some studies integrate multiple networks by aligning features with consistency constraints to learn unified gene representations for CDG identification. However, such representation‑level fusion often assumes congruent gene relationships across networks, which may overlook network heterogeneity and introduce conflicting information. To address this, we propose Soft‑Evidence Fusion Graph Neural Network (SEFGNN), a novel framework for CDG identification across multiple networks at the decision level. Instead of enforcing feature‑level consistency, SEFGNN treats each biological network as an independent evidence source and performs uncertainty‑aware fusion at the decision level using Dempster‑Shafer Theory (DST). To alleviate the risk of overconfidence from DST, we further introduce a Soft Evidence Smoothing (SES) module that improves ranking stability while preserving discriminative performance. Experiments on three cancer datasets show that SEFGNN consistently outperforms state‑of‑the‑art baselines and exhibits strong potential in discovering novel CDGs.
Authors: Luis Enrique Coronas, Stepan Timr, Fabio Sterpone, Giancarlo Franzese
Abstract: Biological processes like the sequestration of Superoxide Dismutase 1 (SOD1) into biomolecular condensates such as FUS and stress granules are essential to understanding disease mechanisms, including amyotrophic lateral sclerosis (ALS). Our study demonstrates that the hydration environment is crucial in these processes. Using the advanced CVF water model, which captures hydrogen‑bond networks at the molecular level, we show how water greatly impacts SOD1's behavior, residency times, and transition rates between different associative states. Importantly, when water is included to hydrate an implicit solvent model (OPEP), we gain a new perspective on the free energy landscape of the system, leading to a conclusion that clarifies that suggested by OPEP alone. While the OPEP model indicated that Bovine Serum Albumin (BSA) crowders reduce SOD1's partition coefficient (PC) mainly due to nonspecific interactions with BSA, our enhanced explicit‑water approach reveals that the hydration entropy behavior in BSA drives the observed decrease in PC. This highlights that explicitly modeling water is essential for accurately understanding protein‑crowder interactions and their biological relevance, emphasizing water's role in cellular phase separation and disease‑related processes.
Authors: Vivekananda Bal, Jackie M. Wolfrum, Paul W. Barone, Stacy L. Springs, Anthony J. Sinskey, Robert M. Kotin, Richard D. Braatz
Abstract: Physicochemical characterization of materials is central to the field of science and engineering and is essential to design new/engineered materials with specific properties. Assays available for small‑molecules, e.g., XRD, NMR, LC‑MS, can't be applied to macromolecules easily. Thus, it is extremely challenging to characterize complex materials such as adeno‑associated virus capsids (MW 5.8MDa). Capsid crystals produced in hanging‑drop are characterized in‑situ using cross‑polarized light and ex‑situ using scanning electron microscopy, energy dispersive X‑ray, and transmission electron microscopy. Cross‑polarized light can be used to identify capsid crystals within a heterogenous‑system of kosmotropic/chaotropic‑salt crystals, fibers, dense solid‑phase, opaques crystals. Despite highly‑conserved structures, crystal birefringence suggests that capsids possess serotype‑specific structural differences. SEM demonstrated that crystal‑growth occurs by random 2D‑nucleation followed by kink‑site attachment and/or spread by more 2D‑nuclei and proteinaceous assemblies tend to form semi‑crystalline solids appearing as dense/opaque materials. EDX shows that C, O, and N are present in ratio of 2.33+‑0.222:1:0.583+‑0.019 for serotypes 5, 8, and 9 and can be an alternative to protein sequencing‑based virus identification. Biological macromolecular assemblies are found to facilitate plural‑scattering responsible for Kikuchi‑diffraction pattern even for thin‑crystals (~300nm). For an optimal spot‑diffraction, crystals must possess at least one dimension consisting of at most 8 layers of capsids.
Authors: Adhithyan Kalaivanan, Zheng Zhao, Jens Sjölund, Fredrik Lindsten
Abstract: Guiding pretrained flow‑based generative models for conditional generation or to produce samples with desired target properties enables solving diverse tasks without retraining on paired data. We present ESS‑Flow, a gradient‑free method that leverages the typically Gaussian prior of the source distribution in flow‑based models to perform Bayesian inference directly in the source space using Elliptical Slice Sampling. ESS‑Flow only requires forward passes through the generative model and observation process, no gradient or Jacobian computations, and is applicable even when gradients are unreliable or unavailable, such as with simulation‑based observations or quantization in the generation or observation process. We demonstrate its effectiveness on designing materials with desired target properties and predicting protein structures from sparse inter‑residue distance measurements.
Authors: Jiahao Ma, Hongzong Li, Ye-Fan Hu, Jian-Dong Huang
Abstract: Physicochemically informed biological sequence generation has the potential to accelerate computer‑aided cellular therapy, yet current models fail to \emphjointly ensure novelty, diversity, and biophysical plausibility when designing variable regions of T‑cell receptors (TCRs). We present PhysicoGPTCR, a large generative protein Transformer that is \emphdual‑conditioned on peptide and HLA context and trained to autoregressively synthesise TCR sequences while embedding residue‑level physicochemical descriptors. The model is optimised on curated TCR‑‑peptide‑‑HLA triples with a maximum‑likelihood objective and compared against ANN, GPTCR, LSTM, and VAE baselines. Across multiple neoantigen benchmarks, PhysicoGPTCR substantially improves edit‑distance, similarity, and longest‑common‑subsequence scores, while populating a broader region of sequence space. Blind in‑silico docking and structural modelling further reveal a higher proportion of binding‑competent clones than the strongest baseline, validating the benefit of explicit context conditioning and physicochemical awareness. Experimental results demonstrate that dual‑conditioned, physics‑grounded generative modelling enables end‑to‑end design of functional TCR candidates, reducing the discovery timeline from months to minutes without sacrificing wet‑lab verifiability.
Authors: Daniele Macuglia, Giovanni Ciccotti, Benoît Roux
Abstract: From the onset of fundamental statistical mechanical constructs formulated in the late 19th century, alchemical free‑energy methods slowly emerged and transitioned to become operational tools of biomolecular simulation applicable to a wide range of problems including protein‑ligand binding for drug discovery research. This article reconstructs how statistical mechanical approaches such as thermodynamic integration and free‑energy perturbation were reconfigured in the early 1980's to address the complexities of increasingly heterogeneous biomolecular systems. Drawing on oral history interviews and primary literature, the study examines the technical, institutional, theoretical, and infrastructural conditions under which these methods were implemented, and became progressively operational. These conditions encompassed the consolidation of lab‑specific software infrastructures, the formulation of practical simulation protocols, as well as essential statistical mechanical clarifications. From this perspective, the progress of free‑energy methods proceeded less from a unified convergence than from an iterative troubleshooting process of alignment involving practical and theoretical considerations. The aim of the present article is to offer a historically grounded account of how free‑energy techniques acquired practical and functional reliability.
Authors: Ebenezer Awotoro, Chisom Ezekannagha, Florian Schwarz, Johannes Tauscher, Dominik Heider, Katharina Ladewig, Christel Le Bon, Karine Moncoq, Bruno Miroux, Georges Hattab
Abstract: Structural biology has made significant progress in determining membrane proteins, leading to a remarkable increase in the number of available structures in dedicated databases. The inherent complexity of membrane protein structures, coupled with challenges such as missing data, inconsistencies, and computational barriers from disparate sources, underscores the need for improved database integration. To address this gap, we present MetaMP, a framework that unifies membrane‑protein databases within a web application and uses machine learning for classification. MetaMP improves data quality by enriching metadata, offering a user‑friendly interface, and providing eight interactive views for streamlined exploration. MetaMP was effective across tasks of varying difficulty, demonstrating advantages across different levels without compromising speed or accuracy, according to user evaluations. Moreover, MetaMP supports essential functions such as structure classification and outlier detection.
We present three practical applications of Artificial Intelligence (AI) in membrane protein research: predicting transmembrane segments, reconciling legacy databases, and classifying structures with explainable AI support. In a validation focused on statistics, MetaMP resolved 77% of data discrepancies and accurately predicted the class of newly identified membrane proteins 98% of the time and overtook expert curation. Altogether, MetaMP is a much‑needed resource that harmonizes current knowledge and empowers AI‑driven exploration of membrane‑protein architecture.
Authors: Jiarui Li, Zixiang Yin, Zhengming Ding, Samuel J. Landry, Ramgopal R. Mettu
Abstract: T cell receptor (TCR) recognition of peptide‑MHC (pMHC) complexes is a central component of adaptive immunity, with implications for vaccine design, cancer immunotherapy, and autoimmune disease. While recent advances in machine learning have improved prediction of TCR‑pMHC binding, the most effective approaches are black‑box transformer models that cannot provide a rationale for predictions. Post‑hoc explanation methods can provide insight with respect to the input but do not explicitly model biochemical mechanisms (e.g. known binding regions), as in TCR‑pMHC binding. ``Explain‑by‑design'' models (i.e., with architectural components that can be examined directly after training) have been explored in other domains, but have not been used for TCR‑pMHC binding. We propose explainable model layers (TCR‑EML) that can be incorporated into protein‑language model backbones for TCR‑pMHC modeling. Our approach uses prototype layers for amino acid residue contacts drawn from known TCR‑pMHC binding mechanisms, enabling high‑quality explanations for predicted TCR‑pMHC binding. Experiments of our proposed method on large‑scale datasets demonstrate competitive predictive accuracy and generalization, and evaluation on the TCR‑XAI benchmark demonstrates improved explainability compared with existing approaches.
Authors: Ziying Zhang, Yaqing Wang, Yuxuan Sun, Min Ye, Quanming Yao
Abstract: Cold‑start drug‑target interaction (DTI) prediction focuses on interaction between novel drugs and proteins. Previous methods typically learn transferable interaction patterns between structures of drug and proteins to tackle it. However, insight from proteomics suggest that protein have multi‑level structures and they all influence the DTI. Existing works usually represent protein with only primary structures, limiting their ability to capture interactions involving higher‑level structures. Inspired by this insight, we propose ColdDTI, a framework attending on protein multi‑level structure for cold‑start DTI prediction. We employ hierarchical attention mechanism to mine interaction between multi‑level protein structures (from primary to quaternary) and drug structures at both local and global granularities. Then, we leverage mined interactions to fuse structure representations of different levels for final prediction. Our design captures biologically transferable priors, avoiding the risk of overfitting caused by excessive reliance on representation learning. Experiments on benchmark datasets demonstrate that ColdDTI consistently outperforms previous methods in cold‑start settings.
Authors: Andrew Campbell, Valentin De Bortoli, Jiaxin Shi, Arnaud Doucet
Abstract: We present self‑speculative masked diffusions, a new class of masked diffusion generative models for discrete data that require significantly fewer function evaluations to generate samples. Standard masked diffusion models predict factorized logits over currently masked positions. A number of masked positions are then sampled, however, the factorization approximation means that sampling too many positions in one go leads to poor sample quality. As a result, many simulation steps and therefore neural network function evaluations are required to generate high‑quality data. We reduce the computational burden by generating non‑factorized predictions over masked positions. This is achieved by modifying the final transformer attention mask from non‑causal to causal, enabling draft token generation and parallel validation via a novel, model‑integrated speculative sampling mechanism. This results in a non‑factorized predictive distribution over masked positions in a single forward pass. We apply our method to GPT2 scale text modelling and protein sequence generation, finding that we can achieve a ~2x reduction in the required number of network forward passes relative to standard masked diffusion models.
Authors: Aya Laajil, Abduragim Shtanchaev, Sajan Muhammad, Eric Moulines, Salem Lahlou
Abstract: Designing mRNA sequences is a major challenge in developing next‑generation therapeutics, since it involves exploring a vast space of possible nucleotide combinations while optimizing sequence properties like stability, translation efficiency, and protein expression. While Generative Flow Networks are promising for this task, their training is hindered by sparse, long‑horizon rewards and multi‑objective trade‑offs. We propose Curriculum‑Augmented GFlowNets (CAGFN), which integrate curriculum learning with multi‑objective GFlowNets to generate de novo mRNA sequences. CAGFN integrates a length‑based curriculum that progressively adapts the maximum sequence length guiding exploration from easier to harder subproblems. We also provide a new mRNA design environment for GFlowNets which, given a target protein sequence and a combination of biological objectives, allows for the training of models that generate plausible mRNA candidates. This provides a biologically motivated setting for applying and advancing GFlowNets in therapeutic sequence design. On different mRNA design tasks, CAGFN improves Pareto performance and biological plausibility, while maintaining diversity. Moreover, CAGFN reaches higher‑quality solutions faster than a GFlowNet trained with random sequence sampling (no curriculum), and enables generalization to out‑of‑distribution sequences.
Authors: Junde Xu, Yapin Shi, Lijun Lang, Taoyong Cui, Zhiming Zhang, Guangyong Chen, Jiezhong Qiu, Pheng-Ann Heng
Abstract: Multimodal protein language models deliver strong performance on mutation‑effect prediction, but training such models from scratch demands substantial computational resources. In this paper, we propose a fine‑tuning framework called InstructPLM‑mu and try to answer a question: Can multimodal fine‑tuning of a pretrained, sequence‑only protein language model match the performance of models trained end‑to‑end? Surprisingly, our experiments show that fine‑tuning ESM2 with structural inputs can reach performance comparable to ESM3. To understand how this is achieved, we systematically compare three different feature‑fusion designs and fine‑tuning recipes. Our results reveal that both the fusion method and the tuning strategy strongly affect final accuracy, indicating that the fine‑tuning process is not trivial. We hope this work offers practical guidance for injecting structure into pretrained protein language models and motivates further research on better fusion mechanisms and fine‑tuning protocols.
Authors: Abhi Chawla, David M. Bortz, Vanja Dukic
Abstract: The Weak form Estimation of Nonlinear Dynamics (WENDy) method is a recently proposed class of parameter estimation algorithms that exhibits notable noise robustness and computational efficiency. This work examines the coverage and bias properties of the original WENDy‑IRLS algorithm's parameter and state estimators in the context of the following differential equations: Logistic, Lotka‑Volterra, FitzHugh‑Nagumo, Hindmarsh‑Rose, and a Protein Transduction Benchmark. The estimators' performance was studied in simulated data examples, under four different noise distributions (normal, log‑normal, additive censored normal, and additive truncated normal), and a wide range of noise, reaching levels much higher than previously tested for this algorithm.
Authors: Daphne Tsolissou, Theofanis Ganitidis, Konstantinos Mitsis, Stergios CHristodoulidis, Maria Vakalopoulou, Konstantina Nikita
Abstract: Reliable risk assessment for carotid atheromatous disease remains a major clinical challenge, as it requires integrating diverse clinical and imaging information in a manner that is transparent and interpretable to clinicians. This study investigates the potential of state‑of‑the‑art and recent large vision‑language models (LVLMs) for multimodal carotid plaque assessment by integrating ultrasound imaging (USI) with structured clinical, demographic, laboratory, and protein biomarker data. A framework that simulates realistic diagnostic scenarios through interview‑style question sequences is proposed, comparing a range of open‑source LVLMs, including both general‑purpose and medically tuned models. Zero‑shot experiments reveal that even if they are very powerful, not all LVLMs can accurately identify imaging modality and anatomy, while all of them perform poorly in accurate risk classification. To address this limitation, LLaVa‑NeXT‑Vicuna is adapted to the ultrasound domain using low‑rank adaptation (LoRA), resulting in substantial improvements in stroke risk stratification. The integration of multimodal tabular data in the form of text further enhances specificity and balanced accuracy, yielding competitive performance compared to prior convolutional neural network (CNN) baselines trained on the same dataset. Our findings highlight both the promise and limitations of LVLMs in ultrasound‑based cardiovascular risk prediction, underscoring the importance of multimodal integration, model calibration, and domain adaptation for clinical translation.
Authors: Taehan Kim, Sangdae Nam
Abstract: Deep learning, particularly with the advancement of Large Language Models, has transformed biomolecular modeling, with protein language models such as ESM inspiring emerging RNA language models such as RiNALMo. Recent work has begun applying sparse autoencoders (SAEs) to protein language model representations, exploring representation‑level interpretability in biomolecular models. Here, we explore whether SAEs can provide interpretable feature decompositions of RNA language model representations, while also examining their limitations in this setting. We present SAE‑RNA, interpretability model that analyzes RiNALMo representations and maps them to known human‑level biological features. Rather than claiming definitive biological concept discovery, our study frames SAE‑based analysis as a representation‑level probe for characterizing how RNA language models organize biological information internally. More broadly, SAE‑RNA provides a feature‑level framework for comparing RNA groups and identifying sparse representation components associated with RNA family identity or structural context.
Authors: Julian Cremer, Tuan Le, Mohammad M. Ghahremanpour, Emilia Sługocka, Filipe Menezes, Djork-Arné Clevert
Abstract: We present FLOWR:root, an equivariant flow‑matching model for pocket‑aware 3D ligand generation with joint binding affinity prediction and confidence estimation. The model supports de novo generation, pharmacophore‑conditional sampling, fragment elaboration, and multi‑endpoint affinity prediction (pIC50, pKi, pKd, pEC50). Training combines large‑scale ligand libraries with mixed‑fidelity protein‑ligand complexes, followed by refinement on curated co‑crystal datasets and parameter‑efficient finetuning for project‑specific adaptation. FLOWR:root achieves state‑of‑the‑art performance in unconditional 3D molecule generation and pocket‑conditional ligand design, producing geometrically realistic, low‑strain structures. The integrated affinity prediction module demonstrates superior accuracy on the SPINDR test set and outperforms recent models on the Schrodinger FEP+/OpenFE benchmark with substantial speed advantages. As a foundation model, FLOWR:root requires finetuning on project‑specific datasets to account for unseen structure‑activity landscapes, yielding strong correlation with experimental data. Joint generation and affinity prediction enable inference‑time scaling through importance sampling, steering molecular design toward higher‑affinity compounds. Case studies validate this: selective CK2α ligand generation against CLK3 shows significant correlation between predicted and quantum‑mechanical binding energies, while ERα and TYK2 scaffold elaboration demonstrates strong agreement with QM calculations. By integrating structure‑aware generation, affinity estimation, and property‑guided sampling, FLOWR:root provides a comprehensive foundation for structure‑based drug design spanning hit identification through lead optimization.
Authors: Nakul Sridhar, Meiou Song, Michael H. B. Stowell, Kathryn L. Hassell, Xiaoyun Ding
Abstract: Sickle cell disease (SCD) remains a critical global health issue, with high child mortality in low‑resource regions. Early screening and diagnosis is essential for improving health outcomes, but conventional screening methods are unsuitable for widespread use due to the high costs of laboratory equipment. There is an urgent need for portable, cost‑effective, and user‑friendly point‑of‑care tools that can quickly assess blood health. Here, we explore two new biomarkers enabled by acoustic probing for rapid SCD screening: cell membrane stability from measuring red blood cell (RBC) lysis temperature in whole blood, and plasma protein concentration from measuring relative protein precipitation in blood plasma. Both biomarkers effectively differentiate healthy HbAA samples from pre‑/no transfusion HbSS samples with high accuracy. Additionally, the RBC lysis biomarker can distinguish post‑transfusion exchange HbSS samples with a lower percentage of sickled cells, indicating the potential to initially screen for milder forms of SCD as well as sickle cell trait.
Authors: Anders Irbäck, Lucas Knuthson, Sandipan Mohanty
Abstract: Steric clashes pose a challenge when exploring dense protein systems using conventional explicit‑chain methods. A minimal example is a single lattice protein confined on a minimal grid, with no free sites. Finding its minimum energy is a hard optimization problem, withsimilarities to scheduling problems. It can be recast as a quadratic unconstrained binary optimization (QUBO) problem amenable to classical and quantum approaches. We show that this problem in its QUBO form can be swiftly and consistently solved for chain length 48, using either classical simulated annealing or hybrid quantum‑classical annealing on a D‑Wave system. In fact, the latter computations required about 10 seconds. We also test linear and quadratic programming methods, which work well for a lattice gas but struggle with chain constraints. All methods are benchmarked against exact results obtained from exhaustive structure enumeration, at a high computational cost.
Authors: Xin Wang, Kaiwen Shi, Carlos Oliver
Abstract: Protein function is driven by cohesive substructures, such as catalytic triads, binding pockets, and structural motifs, that occupy only a small fraction of a protein's residues. Yet existing pipelines built on protein encoders do not model proteins at the substructure level, leaving the central biological question unanswered: which substructure of a protein is responsible for its function? We introduce BioBlobs, an encoder‑agnostic, end‑to‑end differentiable framework that compresses a protein into a small set of cohesive substructures (blobs) and predicts function from these blobs alone, so that each blob corresponds to a candidate functional region. Across diverse protein function prediction tasks and multiple sequence‑ and structure‑based encoders, BioBlobs matches or exceeds strong baselines while operating on only a small fraction of residues. The discovered blobs adapt their spatial scale to the task, ranging from local catalytic sites to entire structural domains. Trained only on protein‑level labels, BioBlobs recovers experimentally annotated catalytic sites in the M‑CSA database, demonstrating unsupervised functional substructure discovery and opening a path to large‑scale functional site discovery across the unannotated proteome.
Authors: Ching-Huei Tsou, Michal Ozery-Flato, Ella Barkan, Diwakar Mahajan, Ben Shapira
Abstract: Recent advances in large language models (LLMs) and biomedical foundation models (BioFMs) have achieved strong results in biological text reasoning, molecular modeling, and single‑cell analysis, yet they remain siloed in disjoint embedding spaces, limiting cross‑modal reasoning. We present BIOVERSE (Biomedical Vector Embedding Realignment for Semantic Engagement), a two‑stage approach that adapts pretrained BioFMs as modality encoders and aligns them with LLMs through lightweight, modality‑specific projection layers. The approach first aligns each modality to a shared LLM space through independently trained projections, allowing them to interoperate naturally, and then applies standard instruction tuning with multi‑modal data to bring them together for downstream reasoning. By unifying raw biomedical data with knowledge embedded in LLMs, the approach enables zero‑shot annotation, cross‑modal question answering, and interactive, explainable dialogue. Across tasks spanning cell‑type annotation, molecular description, and protein function reasoning, compact BIOVERSE configurations surpass larger LLM baselines while enabling richer, generative outputs than existing BioFMs, establishing a foundation for principled multi‑modal biomedical reasoning.
Authors: Yanbo Xu, Yu Wu, Sungjae Park, Zhizhuo Zhou, Shubham Tulsiani
Abstract: We present a mechanism to steer the sampling diversity of denoising diffusion and flow matching models, allowing users to sample from a sharper or broader distribution than the training distribution. We build on the observation that these models leverage (learned) score functions of noisy data distributions for sampling and show that rescaling these allows one to effectively control a 'local' sampling temperature. Notably, this approach does not require any finetuning or alterations to training strategy, and can be applied to any off‑the‑shelf model and is compatible with both deterministic and stochastic samplers. We first validate our framework on toy 2D data, and then demonstrate its application for diffusion models trained across five disparate tasks ‑‑ image generation, pose estimation, depth prediction, robot manipulation, and protein design. We find that across these tasks, our approach allows sampling from sharper (or flatter) distributions, yielding performance gains e.g., depth prediction models benefit from sampling more likely depth estimates, whereas image generation models perform better when sampling a slightly flatter distribution.
Authors: Eoin Quinn, Marco Carobene, Jean Quentin, Sebastien Boyer, Miguel Arbesú, Oliver Bent
Abstract: While deep learning has revolutionized the prediction of rigid protein structures, modelling the conformational ensembles of Intrinsically Disordered Proteins (IDPs) remains a key frontier. Current AI paradigms present a trade‑off: Protein Language Models (PLMs) capture evolutionary statistics but lack explicit physical grounding, while generative models trained to model full ensembles are computationally expensive. In this work we critically assess these limits and propose a path forward. We introduce GeoGraph, a simulation‑informed surrogate trained to predict ensemble‑averaged statistics of residue‑residue contact‑map topology directly from sequence. By featurizing coarse‑grained molecular dynamics simulations into residue‑ and sequence‑level graph descriptors, we create a robust and information‑rich learning target. Our evaluation demonstrates that this approach yields representations that are more predictive of key biophysical properties than existing methods.
Authors: Sichao Shan, Han Ye, Zhengmei Yang, Junpeng Hou, Zhitong Li
Abstract: Deep learning (DL) has revolutionized many fields such as materials design and protein folding. Recent studies have demonstrated the advantages of DL in the inverse design of structural colors, by effectively learning the complex nonlinear relations between structure parameters and optical responses, as dictated by the physical laws of light. While several models, such as tandem neural networks and generative adversarial networks, have been proposed, these methods can be biased and are difficult to scale up to complex structures. Moreover, the difficulty in incorporating physical constraints at the inference time hinders the controllability of the model‑predicted spectra. In this work, we propose Color2Struct, a universal framework for efficient and accurate inverse design of structural colors with controllable predictions. By utilizing sampling bias correction, adaptive loss weighting, and physics‑guided inference, Color2Struct improves the prediction of tandem networks by 65% (color difference) and 48% (short‑wave near‑infrared reflectivity) in designing RGB primary colors. These improvements make Color2Struct highly promising for applications in high‑end display technologies and solar thermal energy harvesting. In experiments, the nanostructure samples are fabricated using a standard thin‑film deposition method and their reflectance spectra are measured to validate the designs. Our work provides an efficient and highly optimized method for controllable inverse design, benefiting future explorations of more intricate structures. The proposed framework can be further generalized to a wide range of fields beyond nanophotonics.
Authors: Yikai Liu, Haoyang Zheng, Lining Mao, Yanbin Wang, Ming Chen, Guang Lin
Abstract: Molecular dynamics (MD) simulation has long been the principal computational tool for exploring protein conformational landscapes and dynamics, but its application is limited by high computational cost. We present ProTDyn, a foundation protein language model that unifies conformational ensemble generation and multi‑timescale dynamics modeling within a single framework. Unlike prior approaches that treat these tasks separately, ProTDyn allows flexible independent and identically distributed (i.i.d.) ensemble sampling and dynamic trajectory simulation. Across diverse protein systems, ProTDyn yields thermodynamically consistent ensembles, faithfully reproduces dynamical properties over multiple timescales, and generalizes to proteins beyond its training data. It offers a scalable and efficient alternative to conventional MD simulations.
Authors: Luke Bhan, Miroslav Krstic, Yuanyuan Shi
Abstract: This work establishes the first rigorous stability guarantees for approximate predictors in delay‑adaptive control of nonlinear systems, addressing a key challenge in practical implementations where exact predictors are unavailable. We analyze two scenarios: (i) when the actuated input is directly measurable, and (ii) when it is estimated online. For the measurable input case, we prove semi‑global practical asymptotic stability with an explicit bound proportional to the approximation error ε. For the unmeasured input case, we demonstrate local practical asymptotic stability, with the region of attraction explicitly dependent on both the initial delay estimate and the predictor approximation error. To bridge theory and practice, we show that neural operators‑a flexible class of neural network‑based approximators‑can achieve arbitrarily small approximation errors, thus satisfying the conditions of our stability theorems. Numerical experiments on two nonlinear benchmark systems‑a biological protein activator/repressor model and a micro‑organism growth Chemostat model‑validate our theoretical results. In particular, our numerical simulations confirm stability under approximate predictors, highlight the strong generalization capabilities of neural operators, and demonstrate a substantial computational speedup of up to 15x compared to a baseline fixed‑point method.
Authors: Siyuan Cao, Hongxuan Wu, Jiabao Brad Wang, Yiliang Yuan, Mustafa Misir
Abstract: Molecular docking is a core tool in drug discovery for predicting ligand‑target interactions. Despite the availability of diverse search‑based and machine learning approaches, no single docking algorithm consistently dominates, as performance varies by context. To overcome this challenge, algorithm selection frameworks such as GNNAS‑Dock, built on graph neural networks, have been proposed. This study introduces an enhanced system, MC‑GNNAS‑Dock, with three key advances. First, a multi‑criteria evaluation integrates binding‑pose accuracy (RMSD) with validity checks from PoseBusters, offering a more rigorous assessment. Second, architectural refinements by inclusion of residual connections strengthen predictive robustness. Third, rank‑aware loss functions are incorporated to sharpen rank learning. Extensive experiments are performed on a curated dataset containing approximately 3200 protein‑ligand complexes from PDBBind. MC‑GNNAS‑Dock demonstrates consistently superior performance, achieving up to 5.4% (3.4%) gains under composite criteria of RMSD below 1Å (2Å) with PoseBuster‑validity compared to the single best solver (SBS) Uni‑Mol Docking V2.
Authors: Mason Minot, Gisbert Schneider
Abstract: Simultaneously optimizing multiple, frequently conflicting, molecular properties is a key bottleneck in the development of novel therapeutics. Although a promising approach, the efficacy of multi‑task learning is often compromised by destructive gradient interference, especially in the data‑scarce regimes common to drug discovery. To address this, we propose AIM, an optimization framework that learns a dynamic policy to mediate gradient conflicts. The policy is trained jointly with the main network using a novel augmented objective composed of dense, differentiable regularizers. This objective guides the policy to produce updates that are geometrically stable and dynamically efficient, prioritizing progress on the most challenging tasks. We demonstrate that AIM achieves statistically significant improvements over multi‑task baselines on subsets of the QM9 and targeted protein degraders benchmarks, with its advantage being most pronounced in data‑scarce regimes. Beyond performance, AIM's key contribution is its interpretability; the learned policy matrix serves as a diagnostic tool for analyzing inter‑task relationships. This combination of data‑efficient performance and diagnostic insight highlights the potential of adaptive optimizers to accelerate scientific discovery by creating more robust and insightful models for multi‑property molecular design.
Authors: Ajit Seth, Sajal K. Ghosh, Veerendra K. Sharma
Abstract: Model biomembrane systems play a crucial role in advancing biomedical research by providing simplified yet effective platforms for exploring complex biological mechanisms. These systems span a wide range of scales, from single‑molecule‑thick lipid monolayers to micron‑sized giant unilamellar vesicles. Their efficacy and applicability largely depend on selecting an optimal model and an appropriate synthesis process. This chapter offers a comprehensive description of conventional synthesis techniques, highlighting their limitations across various model membrane systems. Additionally, it provides an overview of biophysical studies on biomimetic membranes and explores key biological applications, including drug delivery, membrane‑protein interactions, and biosensing.
Authors: Langzhou He, Junyou Zhu, Fangxin Wang, Junhua Liu, Haoyan Xu, Yue Zhao, Philip S. Yu, Qitian Wu
Abstract: Molecular foundation models are rapidly advancing scientific discovery, but their unreliability on out‑of‑distribution (OOD) samples severely limits their application in high‑stakes domains such as drug discovery and protein design. A critical failure mode is chemical hallucination, where models make high‑confidence yet entirely incorrect predictions for unknown molecules. To address this challenge, we introduce Molecular Preference‑Aligned Instance Ranking (Mole‑PAIR), a simple, plug‑and‑play module that can be flexibly integrated with existing foundation models to improve their reliability on OOD data through cost‑effective post‑training. Specifically, our method formulates the OOD detection problem as a preference optimization over the estimated OOD affinity between in‑distribution (ID) and OOD samples, achieving this goal through a pairwise learning objective. We show that this objective essentially optimizes AUROC, which measures how consistently ID and OOD samples are ranked by the model. Extensive experiments across five real‑world molecular datasets demonstrate that our approach significantly improves the OOD detection capabilities of existing molecular foundation models, achieving up to 45.8%, 43.9%, and 24.3% improvements in AUROC under distribution shifts of size, scaffold, and assay, respectively.
Authors: Zhenfeng Deng, Ruijie Hou, Ningrui Xie, Mike Tyers, Michał Koziarski
Abstract: Recent advances in structure‑based protein design have accelerated de novo binder generation, yet interfaces on large domains or spanning multiple domains remain challenging due to high computational cost and declining success with increasing target size. We hypothesized that protein folding neural networks (PFNNs) operate in a ``local‑first'' manner, prioritizing local interactions while displaying limited sensitivity to global foldability. Guided by this hypothesis, we propose an epitope‑only strategy that retains only the discontinuous surface residues surrounding the binding site. Compared to intact‑domain workflows, this approach improves in silico success rates by up to 80% and reduces the average time per successful design by up to forty‑fold, enabling binder design against previously intractable targets such as ClpP and ALS3. Building on this foundation, we further developed a tailored pipeline that incorporates a Monte Carlo‑based evolution step to overcome local minima and a position‑specific biased inverse folding step to refine sequence patterns. Together, these advances not only establish a generalizable framework for efficient binder design against structurally large and otherwise inaccessible targets, but also support the broader ``local‑first'' hypothesis as a guiding principle for PFNN‑based design.
Authors: Yogesh Verma, Markus Heinonen, Vikas Garg
Abstract: Protein structure prediction and folding are fundamental to understanding biology, with recent deep learning advances reshaping the field. Diffusion‑based generative models have revolutionized protein design, enabling the creation of novel proteins. However, these methods often neglect the intrinsic physical realism of proteins, driven by noising dynamics that lack grounding in physical principles. To address this, we first introduce a physically motivated non‑linear noising process, grounded in classical physics, that unfolds proteins into secondary structures (e.g., alpha helices, linear beta sheets) while preserving topological integrity‑‑maintaining bonds, and preventing collisions. We then integrate this process with the flow‑matching paradigm on SE(3) to model the invariant distribution of protein backbones with high fidelity, incorporating sequence information to enable sequence‑conditioned folding and expand the generative capabilities of our model. Experimental results demonstrate that the proposed method achieves state‑of‑the‑art performance in unconditional protein generation, producing more designable and novel protein structures while accurately folding monomer sequences into precise protein conformations.
Authors: Elbert Ho
Abstract: Recently, machine learning has made a significant impact on de novo drug design. However, current approaches to creating novel molecules conditioned on a target protein typically rely on generating molecules directly in the 3D conformational space, which are often slow and overly complex. In this work, we propose SOLD (SELFIES‑based Objective‑driven Latent Diffusion), a novel latent diffusion model that generates molecules in a latent space derived from 1D SELFIES strings and conditioned on a target protein. In the process, we also train an innovative SELFIES transformer and propose a new way to balance losses when training multi‑task machine learning models.Our model generates high‑affinity molecules for the target protein in a simple and efficient way, while also leaving room for future improvements through the addition of more data.
Authors: Haoyang Zheng, Xinyang Liu, Cindy Xiangrui Kong, Nan Jiang, Zheyuan Hu, Weijian Luo, Wei Deng, Guang Lin
Abstract: Fast and high‑quality language generation is the holy grail that people pursue in the age of AI. In this work, we introduce Discrete Diffusion Divergence Instruct (DiDi‑Instruct), a training‑based method that initializes from a pre‑trained diffusion large language model (dLLM) and distills a few‑step student for fast generation. The model distilled with DiDi‑Instruct matches or surpasses its dLLM teacher and the GPT‑2 baseline while providing up to 64× acceleration. The theoretical foundation of DiDi‑Instruct is a novel framework based on integral KL‑divergence minimization, which leads to a practical training algorithm. We further introduce grouped reward normalization, intermediate‑state matching, and the reward‑guided ancestral sampler to improve training stability, model coverage, and inference quality. On the OpenWebText benchmark, DiDi‑Instruct achieves perplexity ranging from 62.2 (8 NFEs) to 18.4 (128 NFEs), outperforming prior accelerated dLLMs and the GPT‑2 baseline. These gains incur a negligible entropy loss (around 1%) and reduce additional training wall‑clock time by more than 20× compared to competing dLLM distillation methods. We further validate the robustness and effectiveness of DiDi‑Instruct through extensive ablation studies, model scaling, downstream task evaluations, and unconditional protein sequence generation. In conclusion, DiDi‑Instruct enables efficient and effective distillation for language generation in the blink of an eye.
Authors: Sebastian W. Ober, Calvin McCarter, Aniruddh Raghu, Yucen Lily Li, Alan N. Amin, Andrew Gordon Wilson, Hunter Elliott
Abstract: Bayesian optimization is a natural candidate for the engineering of antibody therapeutic properties, which is often iterative and expensive. However, finding the optimal choice of surrogate model for optimization over the highly structured antibody space is difficult, and may differ depending on the property being optimized. Moreover, to the best of our knowledge, no prior works have attempted to incorporate structural information into antibody Bayesian optimization. In this work, we explore different approaches to incorporating structural information into Bayesian optimization, and compare them to a variety of sequence‑only approaches on two different antibody properties, binding affinity and stability. In addition, we propose the use of a protein language model‑based ``soft constraint,'' which helps guide the optimization to promising regions of the space. We find that certain types of structural information improve data efficiency in early optimization rounds for stability, but have equivalent peak performance. Moreover, when incorporating the protein language model soft constraint we find that the data efficiency gap is diminished for affinity and eliminated for stability, resulting in sequence‑only methods that match the performance of structure‑based methods, raising questions about the necessity of structure in Bayesian optimization for antibodies.
Authors: Kosio Beshkov, Anders Malthe-Sørenssen
Abstract: While protein language models (PLMs) are one of the most promising avenues of research for future de novo protein design, the way in which they transform sequences to hidden representations, as well as the information encoded in such representations is yet to be fully understood. Several works have attempted to propose interpretability tools for PLMs, but they have focused on understanding how individual sequences are transformed by such models. Therefore, the way in which PLMs transform the whole space of sequences along with their relations is still unknown. In this work we attempt to understand this transformed space of sequences by identifying protein structure and representation with square‑root velocity (SRV) representations and graph filtrations. Both approaches naturally lead to a metric space in which pairs of proteins or protein representations can be compared with each other.
We analyze different types of proteins from the SCOP dataset and show that the Karcher mean and effective dimension of the SRV shape space follow a non‑linear pattern as a function of the layers in ESM2 models of different sizes. Furthermore, we use graph filtrations as a tool to study the context lengths at which models encode the structural features of proteins. We find that PLMs preferentially encode immediate as well as local relations between residues, but start to degrade for larger context lengths. The most structurally faithful encoding tends to occur close to, but before the last layer of the models, indicating that training a folding model ontop of these layers might lead to improved folding performance.
Authors: Ya-Wei Eileen Lin, Ron Levie
Abstract: Canonicalization is a widely used strategy in equivariant machine learning, enforcing symmetry in neural networks by mapping each input to a standard form. Yet, it often introduces discontinuities that can affect stability during training, limit generalization, and complicate universal approximation theorems. In this paper, we address this by introducing adaptive canonicalization, a general framework in which the canonicalization depends both on the input and the network. Specifically, we present the adaptive canonicalization based on prior maximization, where the standard form of the input is chosen to maximize the predictive confidence of the network. We prove that this construction yields continuous and symmetry‑respecting models that admit universal approximation properties.
We propose two applications of our setting: (i) resolving eigenbasis ambiguities in spectral graph neural networks, and (ii) handling rotational symmetries in point clouds. We empirically validate our methods on molecular and protein classification, as well as point cloud classification tasks. Our adaptive canonicalization outperforms the three other common solutions to equivariant machine learning: data augmentation, standard canonicalization, and equivariant architectures.
Authors: Kacper Kapuśniak, Cristian Gabellini, Michael Bronstein, Prudencio Tossou, Francesco Di Giovanni
Abstract: Molecular Dynamics (MD) is a powerful computational microscope for probing protein functions. However, the need for fine‑grained integration and the long timescales of biomolecular events make MD computationally expensive. To address this, several generative models have been proposed to generate surrogate trajectories at lower cost. Yet, these models typically learn a fixed‑lag transition density, causing the training signal to be dominated by frequent but uninformative transitions. We introduce a new class of generative models, MSM Emulators, which instead learn to sample transitions across discrete states defined by an underlying Markov State Model (MSM). We instantiate this class with Markov Space Flow Matching (MarS‑FM), whose sampling offers more than two orders of magnitude speedup compared to implicit‑ or explicit‑solvent MD simulations. We benchmark Mars‑FM ability to reproduce MD statistics through structural observables such as RMSD, radius of gyration, and secondary structure content. Our evaluation spans protein domains (up to 500 residues) with significant chemical and structural diversity, including unfolding events, and enforces strict sequence dissimilarity between training and test sets to assess generalization. Across all metrics, MarS‑FM outperforms existing methods, often by a substantial margin.
Authors: Samuel Willis, Paul Duckworth, Jack Simons, Aleksandra Kalisz, Krisztina Sinkovics, Noam Ghenassia, Shikha Surana, Henry T. Oldroyd, Alexandru I. Stere, Dragos D Margineantu, Carl Henrik Ek, Henry Moss, Erik Bodin
Abstract: Modern generative AI models, such as diffusion and flow matching models, can sample from rich data distributions. However, many applications, especially in science and engineering, require more than drawing samples from the model distribution: they require searching within this distribution for samples that optimise task‑specific criteria. In this work, we propose O3 (Optimisation Over the Outputs of Generative Models), a method for sample‑efficient black‑box optimisation over continuous‑variable diffusion and flow‑matching models. O3 is built around surrogate latent spaces: low‑dimensional Euclidean embeddings that can be extracted from a generative model without additional training. The resulting representations have controllable dimensionality and support the direct application of standard optimisation algorithms. We show, on image and protein design tasks, that surrogate‑space optimisation finds substantially higher‑scoring samples than standard sampling or optimisation in the original latent space. Our method is model‑ and optimiser‑agnostic, incurs negligible additional cost over standard generation, and requires no retraining or fine‑tuning of the generative model.
Authors: Fred Zhangzhi Peng, Zachary Bezemek, Jarrid Rector-Brooks, Shuibai Zhang, Anru R. Zhang, Michael Bronstein, Alexander Tong, Avishek Joey Bose
Abstract: Diffusion language models have emerged as a powerful alternative to autoregressive models, enabling fast inference through more flexible and parallel generation paths. This flexibility of sampling is unlocked by new engineered sampling strategies, or planners, that select more favorable generation paths by iteratively planning ‑ versus uniformly at random ‑ where to denoise along the sequence. However, by modifying the reverse paths via planning, planners create an irrevocable mismatch between the uniformly random denoising paths assumed during training and planning‑based inference.
In this paper, we systematically investigate the mismatch of discrete diffusion training and inference under planning and theoretically prove that the standard discrete diffusion training evidence lower bound (ELBO) does not accurately describe a denoiser that uses a non‑uniform planner. To address this gap, we derive a new planned evidence lower bound (P‑ELBO) that incorporates planner‑based reverse dynamics directly into the training objective. Using the P‑ELBO, we introduce Planner Aware Path Learning (PAPL), a novel training scheme that aligns training and inference under a planned denoiser.
PAPL is implemented as a simple yet effective modification to the standard masked discrete diffusion loss, making it widely applicable and easy to adopt. Empirically, we show PAPL delivers consistent gains across domains, including a 40% relative improvement in protein sequences, improved text generation with up to a 4x relative MAUVE gain, and 23% relative improvement in code generation HumanEval pass@10. Code is available at github.com/pengzhangzhi/PAPL .
Authors: Felipe Silva Carvalho, Alexander McMahon, David A. Case, Tyler Luchko
Abstract: Accurate modeling of aqueous monovalent ions is essential for understanding the function of biomolecules, such as nucleic acid stability and binding of charged drugs to protein targets. The 1D and 3D reference interaction site models (1D‑ and 3D‑RISM) of molecular solvation, as implemented in the AmberTools molecular modeling suite, are well suited for modeling mixtures of ionic species around biomolecules across a wide range of concentrations. However, the available ion model parameters were optimized for molecular dynamics simulations, not for the RISM framework, which includes a closure approximation. To address this, we optimized the Lennard‑Jones 12‑6 model for monovalent ions for 1D‑RISM with the partial series expansion of order 3 closure by fitting to experimental values of ion‑oxygen distance (IOD), hydration free energy (HFE), partial molar volume (PMV) and mean activity coefficient. The new parameter set demonstrated significant improvement in HFE, IOD, and mean activity coefficients, whereas no overall change was observed for the PMV. A second optimization step was necessary to account for the cation‑anion interactions that affect the mean activity coefficients. The new parameters were validated at finite salt concentrations against experimental data for 16 ion pairs and showed improved accuracy for 14 of them, while the results for CsI and CsF were the second best. 1D‑RISM results obtained with the new NaCl parameters were used to calculate the preferential interaction parameter of the ions around the 24L B‑DNA using 3D‑RISM. The new parameters demonstrated better agreement with experiment at physiological and higher concentrations. At lower concentrations, the results primarily depended on the closure with little effect from the ion parameters. Overall, the ion parameters specifically developed for RISM show improved accuracy at infinite dilution and finite concentrations.
Authors: Haoyu Feng, Xin Zhang
Abstract: Loopy Belief Propagation (LBP) is a widely used approximate inference algorithm in probabilistic graphical models, with applications in computer vision, error correction codes, protein folding, program analysis, etc. However, LBP faces significant computational challenges when applied to large‑scale program analysis. While GPU (Graphics Processing Unit) parallel computing provides a promising solution, existing approaches lack support for flexible update strategies and have yet to integrate logical constraints with GPU acceleration, leading to suboptimal practical performance.
This paper presents a GPU‑accelerated LBP algorithm for program analysis. To support the diverse update strategies required by users, we propose a unified representation for specifying arbitrary user‑defined update strategies, along with a dependency analysis algorithm. Furthermore, building on previous work that leverages the local structure of Horn clauses to simplify message passing, we group messages to minimize warp divergence and better utilize GPU resources. Experimental results on datarace analysis over eight real‑world Java programs show that our approach achieves an average speedup of 2.14× over the state‑of‑the‑art sequential approach and 5.56× over the state‑of‑the‑art GPU‑based approach, while maintaining high accuracy.
Authors: Feng Jiang, Amina Mollaysa, Hehuan Ma, Tommaso Mansi, Junzhou Huang, Mangal Prakash, Rui Liao
Abstract: Drug target interaction (DTI) prediction is a cornerstone of computational drug discovery, enabling rational design, repurposing, and mechanistic insights. While deep learning has advanced DTI modeling, existing approaches primarily rely on SMILES protein pairs and fail to exploit the rich multimodal information available for small molecules and proteins. We introduce GRAMDTI, a pretraining framework that integrates multimodal molecular and protein inputs into unified representations. GRAMDTI extends volume based contrastive learning to four modalities, capturing higher‑order semantic alignment beyond conventional pairwise approaches. To handle modality informativeness, we propose adaptive modality dropout, dynamically regulating each modality's contribution during pre‑training. Additionally, IC50 activity measurements, when available, are incorporated as weak supervision to ground representations in biologically meaningful interaction strengths. Experiments on four publicly available datasets demonstrate that GRAMDTI consistently outperforms state of the art baselines. Our results highlight the benefits of higher order multimodal alignment, adaptive modality utilization, and auxiliary supervision for robust and generalizable DTI prediction.
Authors: Thomas Walton, Darin Tsui, Aryan Musharaf, Amirali Aghazadeh
Abstract: Autoregressive models have transformed protein engineering by enabling the generation of novel protein sequences beyond those found in nature. However, their sequential inference introduces significant latency, limiting their utility in high‑throughput protein screening. Speculative decoding accelerates generation by employing a lightweight draft model to sample tokens, which a larger target model then verifies and refines. Yet, in protein sequence generation, draft models are typically agnostic to the structural and functional constraints of the target protein, leading to biologically implausible outputs and a shift in the likelihood distribution of generated sequences. We introduce SpecMER (Speculative Decoding via k‑mer Guidance), a novel framework that incorporates biological, structural, and functional priors using k‑mer motifs extracted from multiple sequence alignments. By scoring candidate sequences in parallel and selecting those most consistent with known biological patterns, SpecMER significantly improves sequence plausibility while retaining the efficiency of speculative decoding. SpecMER achieves 24‑32% speedup over standard autoregressive decoding, along with higher acceptance rates and improved sequence likelihoods.
Authors: Mohammadsaleh Refahi, Bahrad A. Sokhansanj, James R. Brown, Gail Rosen
Abstract: Accurate prediction of drug‑target binding affinity can accelerate drug discovery by prioritizing promising compounds before costly wet‑lab screening. While deep learning has advanced this task, most models fuse ligand and protein representations via simple concatenation and lack explicit geometric regularization, resulting in poor generalization across chemical space and time. We introduce FIRM‑DTI, a lightweight framework that conditions molecular embeddings on protein embeddings through a feature‑wise linear modulation (FiLM) layer and enforces metric structure with a triplet loss. An RBF regression head operating on embedding distances yields smooth, interpretable affinity predictions. Despite its modest size, FIRM‑DTI achieves state‑of‑the‑art performance on the Therapeutics Data Commons DTI‑DG benchmark, as demonstrated by an extensive ablation study and out‑of‑domain evaluation. Our results underscore the value of conditioning and metric learning for robust drug‑target affinity prediction.
Authors: Ben Pisanty, Jovana Andrejevic, Andrea J. Liu, Sidney R. Nagel
Abstract: Elastic networks can be tuned to exhibit complex mechanical responses and have been extensively used to study protein allosteric functionality, where a localized strain regulates the conformation at a distant site. We show that cooperative binding, where two sites each enhance the other's ability to function, can be trained via a symmetric application of the training previously employed for creating network allostery. We identify a crossover temperature above which cooperative functionality breaks down due to thermal fluctuations. We develop a modified training protocol to increase this crossover temperature, enabling function to remain robust at biologically relevant temperatures.
Authors: Rujie Yin, Yang Shen
Abstract: Structural prediction of protein‑protein interactions is important to understand the molecular basis of cellular interactions, but it still faces major challenges when significant conformational changes are present. We propose a generative framework of hierarchical adaptive diffusion to improve accuracy and efficiency in such cases. It is hierarchical in separating global inter‑protein rigid‑body motions and local intra‑protein flexibility in diffusion processes, and the distinct local and global noise schedules are designed to mimic the induced‑fit effect. It is adaptive in conditioning the local flexibility schedule on predicted levels of conformational change, allowing faster flexing for larger anticipated conformational changes. Furthermore, it couples the local and global diffusion processes through a common score and confidence network with sequence, evolution, structure, and dynamics features as inputs, and maintains rotational or translational invariance or equivariance in outputs. It builds on our newly curated DIPS‑AF dataset of nearly 39,000 examples for pre‑training. Experiments on the independent docking benchmark dataset DB5.5 show that our model outperforms an AlphaFold2‑like iterative transformer (GeoDock) and a diffusion model (DiffDock‑PP) in both rigid and flexible cases, with larger improvements in more flexible cases. Ablation studies prove the importance of adaptive schedules, dynamics features, and pre‑training. Additional analyses and case studies reveal remaining gaps in sampling, scoring, and conformational resolution.
Authors: Meshi Bashari, Yonghoon Lee, Roy Maor Lotan, Edgar Dobriban, Yaniv Romano
Abstract: The rapid proliferation of high‑quality synthetic data ‑‑ generated by advanced AI models or collected as auxiliary data from related tasks ‑‑ presents both opportunities and challenges for statistical inference. This paper introduces a GEneral Synthetic‑Powered Inference (GESPI) framework that wraps around any statistical inference procedure to safely enhance sample efficiency by combining synthetic and real data. Our framework leverages high‑quality synthetic data to boost statistical power, yet adaptively defaults to the standard inference method using only real data when synthetic data is of low quality. The error of our method remains below a user‑specified bound without any distributional assumptions on the synthetic data, and decreases as the quality of the synthetic data improves. This flexibility enables seamless integration with conformal prediction, risk control, hypothesis testing, and multiple testing procedures, all without modifying the base inference method. We demonstrate the benefits of our method on challenging tasks with limited labeled data, including AlphaFold protein structure prediction, and comparing large reasoning models on complex math problems.
Authors: Mohammad Tabish, Benedict Leimkuhler, Stefan Klus
Abstract: We propose a randomized neural network approach called RaNNDy for learning transfer operators and their spectral decompositions from data. The weights of the hidden layers of the neural network are randomly selected and only the output layer is trained. The main advantage is that without a noticeable reduction in accuracy, this approach significantly reduces the training time and resources while avoiding common problems associated with deep learning such as sensitivity to hyperparameters and slow convergence. Additionally, the proposed framework allows us to compute a closed‑form solution for the output layer which directly represents the eigenfunctions of the operator. Moreover, it is possible to estimate uncertainties associated with the computed spectral properties via ensemble learning. We present results for different dynamical operators, including Koopman and Perron‑Frobenius operators, which have important applications in analyzing the behavior of complex dynamical systems, and the Schrödinger operator. The numerical examples, which highlight the strengths but also weaknesses of the proposed framework, include several stochastic dynamical systems, protein folding processes, and the quantum harmonic oscillator.
Authors: Jiayi Xin, Aniruddh Raghu, Nick Bhattacharya, Adam Carr, Melanie Montgomery, Hunter Elliott
Abstract: Modern therapeutic antibody design often involves composing multi‑part assemblages of individual functional domains, each of which may be derived from a different source or engineered independently. While these complex formats can expand disease applicability and improve safety, they present a significant engineering challenge: the function and stability of individual domains are not guaranteed in the novel format, and the entire molecule may no longer be synthesizable. To address these challenges, we develop a machine learning framework to predict "reformatting success" ‑‑ whether converting an antibody from one format to another will succeed or not. Our framework incorporates both antibody sequence and structural context, incorporating an evaluation protocol that reflects realistic deployment scenarios. In experiments on a real‑world antibody reformatting dataset, we find the surprising result that large pretrained protein language models (PLMs) fail to outperform simple, domain‑tailored, multimodal representations. This is particularly evident in the most difficult evaluation setting, where we test model generalization to a new starting antibody. In this challenging "new antibody, no data" scenario, our best multimodal model achieves high predictive accuracy, enabling prioritization of promising candidates and reducing wasted experimental effort.
Authors: Ellen M. Adams, Igor Ilyakov, Manthan Raj, Daniel Dornbusch, Thales V. A. G. de Oliveira, Atiqa Arshad, Gulloo Lal Prajapati, Alexey Ponomaryov, Jan-Christop Deinert
Abstract: Hydration water is vital for the stabilization of protein structure and function. The strong interaction of hydration water with the protein surface brings into question how dynamics and asymmetry of hydrogen bonds are perturbed for hydration water compared to bulk water. Here, z‑scan transmission measurements at 0.5 Terahertz (THz) were performed for dilute and concentrated lysozyme solutions. A giant nonlinear absorption coefficient was found for dilute lysozyme solutions that is ten times greater than previous studies. This giant nonlinear response stems from the high average THz power generated by the TELBE free electron laser source, which drives the formation of a persistent thermal lens. In contrast, concentrated lysozyme solutions did not demonstrate a nonlinear response, revealing that crowding annihilates the thermal lensing effect. These results indicates that the THz nonlinear transmission of aqueous proteins solutions depends on the amount of hydration water present, and opens to the door to understanding the nonlinear optical properties of biologically relevant systems.
Authors: Yuyang Wang, Jiarui Lu, Navdeep Jaitly, Josh Susskind, Miguel Angel Bautista
Abstract: Protein folding models have achieved groundbreaking results typically via a combination of integrating domain knowledge into the architectural blocks and training pipelines. Nonetheless, given the success of generative models across different but related problems, it is natural to question whether these architectural designs are a necessary condition to build performant models. In this paper, we introduce SimpleFold, the first flow‑matching based protein folding model that solely uses general purpose transformer blocks. Protein folding models typically employ computationally expensive modules involving triangular updates, explicit pair representations or multiple training objectives curated for this specific domain. Instead, SimpleFold employs standard transformer blocks with adaptive layers and is trained via a generative flow‑matching objective with an additional structural term. We scale SimpleFold to 3B parameters and train it on approximately 9M distilled protein structures together with experimental PDB data. On standard folding benchmarks, SimpleFold‑3B achieves competitive performance compared to state‑of‑the‑art baselines, in addition SimpleFold demonstrates strong performance in ensemble prediction which is typically difficult for models trained via deterministic reconstruction objectives. Due to its general‑purpose architecture, SimpleFold shows efficiency in deployment and inference on consumer‑level hardware. SimpleFold challenges the reliance on complex domain‑specific architectures designs in protein folding, opening up an alternative design space for future progress.
Authors: Hanna Linn, Rui-Hao Li, Alexander Holden, Abdullah Ash Saki, Frank DiFilippo, Tomas Radivoyevitch, Daniel Blankenberg, Laura García-Álvarez, Göran Johansson
Abstract: Accurately predicting protein structures from amino acid sequences remains a fundamental challenge in computational biology, with profound implications for understanding biological functions and enabling structure‑based drug discovery. Quantum computing approaches based on coarse‑grained lattice models combined with variational algorithms have been proposed as an initial step towards predicting protein structures using quantum computers. In this work, we introduce a more efficient quantum protein structure prediction workflow that bypasses the need for explicit Hamiltonian construction by employing a problem‑agnostic ansatz. The ansatz is trained to minimize an energy‑based cost function that can be efficiently computed on classical computers, eliminating the need for ancillary qubits and reducing circuit depth compared to previous Hamiltonian‑based methods. This enables a more scalable approach for larger proteins and facilitates the inclusion of higher‑order interactions, previously hard to achieve in quantum approaches. We validate our method by benchmarking a hardware‑efficient ansatz on a large set of proteins with up to 26 amino acids, modeled on the tetrahedral, body‑centered cubic, and face‑centered cubic lattices, incorporating up to second‑nearest‑neighbor interactions. We assess the performance on both a noise‑free simulator and the ibm_kingston quantum computer using a set of distinct metrics to probe different aspects of the prediction quality. These experiments push the boundaries of quantum methods for protein structure prediction, targeting sequences that are longer than those typically addressed in prior studies. Overall, the results highlight the scalability and versatility of our approach, while also identifying key areas for improvement to inform future algorithm development and hardware advancements.
Authors: Hanqun Cao, Marcelo D. T. Torres, Jingjie Zhang, Zijun Gao, Fang Wu, Chunbin Gu, Jure Leskovec, Yejin Choi, Cesar de la Fuente-Nunez, Guangyong Chen, Pheng-Ann Heng
Abstract: Antimicrobial resistance (AMR) is projected to cause up to 10 million deaths annually by 2050, underscoring the urgent need for new antibiotics. Here we present ApexAmphion, a deep‑learning framework for de novo design of antibiotics that couples a 6.4‑billion‑parameter protein language model with reinforcement learning. The model is first fine‑tuned on curated peptide data to capture antimicrobial sequence regularities, then optimised with proximal policy optimization against a composite reward that combines predictions from a learned minimum inhibitory concentration (MIC) classifier with differentiable physicochemical objectives. In vitro evaluation of 100 designed peptides showed low MIC values (nanomolar range in some cases) for all candidates (100% hit rate). Moreover, 99 our of 100 compounds exhibited broad‑spectrum antimicrobial activity against at least two clinically relevant bacteria. The lead molecules killed bacteria primarily by potently targeting the cytoplasmic membrane. By unifying generation, scoring and multi‑objective optimization with deep reinforcement learning in a single pipeline, our approach rapidly produces diverse, potent candidates, offering a scalable route to peptide antibiotics and a platform for iterative steering toward potency and developability within hours.
Authors: Jayashrita Debnath, Gerhard Hummer
Abstract: Machine learning (ML) is rapidly transforming the way molecular dynamics simulations are performed and analyzed, from materials modeling to studies of protein folding and function. ML algorithms are often employed to learn low‑dimensional representations of conformational landscapes and to cluster trajectories into relevant metastable states. Most of these algorithms require selecting a small number of features that describe the problem of interest. Although deep neural networks can tackle large numbers of input features, the training costs increase with input size, which makes the selection of a subset of features mandatory for most problems of practical interest. Here, we show that random nonlinear projections can be used to compress large feature spaces and make computations faster without substantial loss of information. We describe an efficient way to produce random projections and then exemplify the general procedure for protein folding. For our test cases NTL9 and the double‑norleucin variant of the villin headpiece, we find that random compression retains the core static and dynamic information of the original high dimensional feature space and makes trajectory analysis more robust.
Authors: Arindam Panda, Sunil P Singh
Abstract: The role of active stress on the conformational dynamics of a polymer has drawn significant interest due to its potential applications in understanding the energy landscape of protein structures, buckling of biopolymers, genomic spatial organization, and their large scale coherent dynamics. We present a model of bidirectional active force that acts along the polymer's tangent, with its direction stochastically reversing between head to tail and tail to head orientations. The active polymer shows a structural transition from a random coil like state to a compressed state with variations in the active force, directional (polarity) reversal rate, and their fraction. Furthermore, the polymer re‑swells and stretches more than its passive limit for a large active force. The polymer's radius of gyration follows the ideal chain‑like scaling relation in both the compressed and swelled states. The bidirectional active force also drives dynamical transitions, where the effective diffusivity abruptly shifts from a linear to quadratic increase. Similarly, in the regime of large activity, the linear decrease of the longest relaxation time of the polymer changes to a power law behavior. We have shown that the active polymer's conformational, relaxation, and diffusive behaviors display a transition from an active polar linear polymer model (APLP) to an active Brownian particle (ABP) polymer model with the increase in the fraction of the opposite polarity and their reconfiguration time.
Authors: Bowen Jing, Bonnie Berger, Tommi Jaakkola
Abstract: Advances in deep learning have opened an era of abundant and accurate predicted protein structures; however, similar progress in protein ensembles has remained elusive. This review highlights several recent research directions towards AI‑based predictions of protein ensembles, including coarse‑grained force fields, generative models, multiple sequence alignment perturbation methods, and modeling of ensemble descriptors. An emphasis is placed on realistic assessments of the technological maturity of current methods, the strengths and weaknesses of broad families of techniques, and promising machine learning frameworks at an early stage of development. We advocate for "closing the loop" between model training, simulation, and inference to overcome challenges in training data availability and to enable the next generation of models.
Authors: Kevin Bachelor, Sanya Murdeshwar, Daniel Sabo, Razvan Marinescu
Abstract: Machine‑learned coarse‑grained (CG) potentials are fast, but degrade over time when simulations reach under‑sampled bio‑molecular conformations, and generating widespread all‑atom (AA) data to combat this is computationally infeasible. We propose a novel active learning (AL) framework for CG neural network potentials in molecular dynamics (MD). Building on the CGSchNet model, our method employs root mean squared deviation (RMSD)‑based frame selection from MD simulations in order to generate data on‑the‑fly by querying an oracle during the training of a neural network potential. This framework preserves CG‑level efficiency while correcting the model at precise, RMSD‑identified coverage gaps. By training CGSchNet, a coarse‑grained neural network potential, we empirically show that our framework explores previously unseen configurations and trains the model on unexplored regions of conformational space. Our active learning framework enables a CGSchNet model trained on the Chignolin protein to achieve a 33.05% improvement in the Wasserstein‑1 (W1) metric in Time‑lagged Independent Component Analysis (TICA) space on an in‑house benchmark suite.
Authors: Aniruddh Raghu, Sebastian Ober, Maxwell Kazman, Hunter Elliott
Abstract: Therapeutic antibody candidates often require extensive engineering to improve key functional and developability properties before clinical development. This can be achieved through iterative design, where starting molecules are optimized over several rounds of in vitro experiments. While protein structure can provide a strong inductive bias, it is rarely used in iterative design due to the lack of structural data for continually evolving lead molecules over the course of optimization. In this work, we propose a strategy for iterative antibody optimization that leverages both sequence and structure as well as accumulating lab measurements of binding and developability. Building on prior work, we first train a sequence‑structure diffusion generative model that operates on antibody‑antigen complexes. We then outline an approach to use this model, together with carefully predicted antibody‑antigen complexes, to optimize lead candidates throughout the iterative design process. Further, we describe a guided sampling approach that biases generation toward desirable properties by integrating models trained on experimental data from iterative design. We evaluate our approach in multiple in silico and in vitro experiments, demonstrating that it produces high‑affinity binders at multiple stages of an active antibody optimization campaign.
Authors: Leonardo Martini, Ylea Vlamidis, Ileana Armando, Domenica Convertino, Vaidotas Mišeikis, Valerio Voliani, Camilla Coletti
Abstract: The rapid and global spread of coronavirus disease 2019 (COVID‑19), caused by the severe acute respiratory syndrome coronavirus 2 (SARS‑CoV‑2), underscored the urgent need for fast, reliable, and adaptable diagnostic tools capable of responding to current and future viral threats. Early diagnosis is key to limiting transmission, and biosensors based on nanomaterials offer promising solutions for accurate and rapid bioanalyte detection. In this work, we present a scalable matrix of graphene‑based field‑effect transistors (GFETs) for the direct and rapid detection of the SARS‑CoV‑2 spike protein. High‑quality graphene is functionalized in a single step with ACE2‑His, enabling detection of the spike protein with a limit of detection as low as 1 fg/mL in phosphate‑buffered saline (PBS). A robust statistical analysis, based on measurements from approximately 70 devices per analyte concentration, demonstrates the reproducibility and reliability of the platform. This label‑free, scalable, and reproducible COVID‑19 antigen sensor can be readily adapted to detect emerging SARS‑CoV‑2 variants or other viral pathogens, offering a flexible approach for future diagnostic applications.
Authors: Xuefeng Liu, Mingxuan Cao, Songhao Jiang, Xiao Luo, Xiaotian Duan, Mengdi Wang, Tobin R. Sosnick, Jinbo Xu, Rick Stevens
Abstract: The goal of protein design is to generate amino acid sequences that fold into functional structures with desired properties. Prior methods combining autoregressive language models with Monte Carlo Tree Search (MCTS) struggle with long‑range dependencies and suffer from an impractically large search space. We propose MCTD‑ME, Monte Carlo Tree Diffusion with Multiple Experts, which integrates masked diffusion models with tree search to enable multi‑token planning and efficient exploration under the guidance of multiple experts. Unlike autoregressive planners, MCTD‑ME uses biophysical‑fidelity‑enhanced diffusion denoising as the rollout engine, jointly revising multiple positions and scaling to large sequence spaces. It further leverages experts of varying capacities to enrich exploration, guided by a pLDDT‑based masking schedule that targets low‑confidence regions while preserving reliable residues. We propose a novel multi‑expert selection rule ( PH‑UCT‑ME) extends Shannon‑entropy‑based UCT to expert ensembles with mutual information. MCTD‑ME achieves superior performance on the CAMEO and PDB benchmarks, excelling in protein design tasks such as inverse folding, folding, and conditional design challenges like motif scaffolding on lead optimization tasks. Our framework is model‑agnostic, plug‑and‑play, and extensible to denovo protein engineering and beyond.
Authors: Jaydeep Rade, Md Hasibul Hasan Hasib, Meric Ozturk, Baboucarr Faal, Sheng Yang, Dipali G. Sashital, Vincenzo Venditti, Baoyu Chen, Soumik Sarkar, Adarsh Krishnamurthy, Anwesha Sarkar
Abstract: AI‑based in silico methods have improved protein structure prediction but often struggle with large protein complexes (PCs) involving multiple interacting proteins due to missing 3D spatial cues. Experimental techniques like Cryo‑EM are accurate but costly and time‑consuming. We present ProFusion, a hybrid framework that integrates a deep learning model with Atomic Force Microscopy (AFM), which provides high‑resolution height maps from random orientations, naturally yielding multi‑view data for 3D reconstruction. However, generating a large‑scale AFM imaging data set sufficient to train deep learning models is impractical. Therefore, we developed a virtual AFM framework that simulates the imaging process and generated a dataset of ~542,000 proteins with multi‑view synthetic AFM images. We train a conditional diffusion model to synthesize novel views from unposed inputs and an instance‑specific Neural Radiance Field (NeRF) model to reconstruct 3D structures. Our reconstructed 3D protein structures achieve an average Chamfer Distance within the AFM imaging resolution, reflecting high structural fidelity. Our method is extensively validated on experimental AFM images of various PCs, demonstrating strong potential for accurate, cost‑effective protein complex structure prediction and rapid iterative validation using AFM experiments.
Authors: Grischa Gerwert, Marvin Mann, Lennart Langenhoff, Nathalie Woitzik, Diana Hubert, Deniz Duman, Adrian Hoeveler, Sandy Budde, Jonas Simon, Leon Beyer, Martin Schuler, Sandrina Weber, Brit Mollenhauer, Carsten Kötting, Joern Gueldenhaupt, Klaus Gerwert
Abstract: The immuno‑infrared sensor detects target proteins in solution. Exemplary, the initial misfolding of amyloid beta (Abeta) peptides in blood is measured, enabling early risk prediction of Alzheimer's disease in the preclinical stage. Antibodies concentrate the target protein on the functionalized attenuated total reflection crystal surface. A quantum cascade laser is used to measure the amide I band, which indicates the secondary structure distribution of Abeta in blood as a biomarker.
Authors: Stephan Wiesneth, Paul Recknagel, Alastair T. Gardiner, Richard Cogdell, Richard Hildner, Jürgen Köhler
Abstract: Photosynthesis relies on efficient energy relaxation within the excited‑state manifold of pigment‑protein complexes. Since the protein scaffold is rather flexible, the resulting energetic and structural disorder gives rise to a complex excited‑state energy level structure that fluctuates on all time scales. Although the impact of such fluctuations on relaxation processes is known, the precise exciton states involved in relaxation as well as the nature of the vibrational modes driving relaxation are under debate. Here single pigment‑protein complexes from a photosynthetic purple bacterium are excited with two identical ultrashort phase‑locked pulses producing two exciton wave packets that can interfere. This leads to a modulation of the emission intensity as a function of the delay time between the pulses that fades out within about 100 fs due to fluctuating environments on those time scales. For several single complexes we find variations of the interference patterns on time scale of several 10 s that reveal fluctuations in the energy relaxation pathways towards the lowest‑energy exciton states. This relaxation is driven by temporal variations in the coupling between electronic excitations and low‑frequency vibrational modes.
Authors: Jing Lan, Hexiao Ding, Hongzhao Chen, Yufeng Jiang, Nga-Chun Ng, Gwing Kei Yip, Gerald W. Y. Cheng, Yunlin Mao, Jing Cai, Liang-ting Lin, Jung Sun Yoo
Abstract: Accurate identification of drug‑target interactions (DTI) remains a central challenge in computational pharmacology, where sequence‑based methods offer scalability. This work introduces a sequence‑based drug‑target interaction framework that integrates structural priors into protein representations while maintaining high‑throughput screening capability. Evaluated across multiple benchmarks, the model achieves state‑of‑the‑art performance on Human and BioSNAP datasets and remains competitive on BindingDB. In virtual screening tasks, it surpasses prior methods on LIT‑PCBA, yielding substantial gains in AUROC and BEDROC. Ablation studies confirm the critical role of learned aggregation, bilinear attention, and contrastive alignment in enhancing predictive robustness. Embedding visualizations reveal improved spatial correspondence with known binding pockets and highlight interpretable attention patterns over ligand‑residue contacts. These results validate the framework's utility for scalable and structure‑aware DTI prediction.
Authors: Alexander Aghili, Andy Bruce, Daniel Sabo, Razvan Marinescu
Abstract: Molecular dynamics (MD) simulations provide atomistic insight into biomolecular systems but are often limited by high computational costs required to access long timescales. Coarse‑grained machine learning models offer a promising avenue for accelerating sampling, yet conventional force matching approaches often fail to capture the full thermodynamic landscape as fitting a model on the gradient may not fit the absolute differences between low‑energy conformational states. In this work, we incorporate a complementary energy matching term into the loss function. We evaluate our framework on the Chignolin protein using the CGSchNet model, systematically varying the weight of the energy loss term. While energy matching did not yield statistically significant improvements in accuracy, it revealed distinct tendencies in how models generalize the free energy surface. Our results suggest future opportunities to enhance coarse‑grained modeling through improved energy estimation techniques and multi‑modal loss formulations.
Authors: Samuel Tovey, Julian Hoßbach, Sandro Kuppel, Tobias Ensslen, Jan C. Behrends, Christian Holm
Abstract: A device capable of performing real time classification of proteins in a clinical setting would allow for inexpensive and rapid disease diagnosis. One such candidate for this technology are nanopore devices. These devices work by measuring a current signal that arises when a protein or peptide enters a nanometer‑length‑scale pore. Should this current be uniquely related to the structure of the peptide and its interactions with the pore, the signals can be used to perform identification. While such a method would allow for real time identification of peptides and proteins in a clinical setting, to date, the complexities of these signals limit their accuracy. In this work, we tackle the issue of classification by converting the current signals into scaleogram images via wavelet transforms, capturing amplitude, frequency, and time information in a modality well‑suited to machine learning algorithms. When tested on 42 peptides, our method achieved a classification accuracy of ~81\,%, setting a new state‑of‑the‑art in the field and taking a step toward practical peptide/protein diagnostics at the point of care. In addition, we demonstrate model transfer techniques that will be critical when deploying these models into real hardware, paving the way to a new method for real‑time disease diagnosis.
Authors: Md Masud Rana, Farjana Tasnim Mukta, Duc D. Nguyen
Abstract: In structure‑based drug design, accurately estimating the binding affinity between a candidate ligand and its protein receptor is a central challenge. Recent advances in artificial intelligence, particularly deep learning, have demonstrated superior performance over traditional empirical and physics‑based methods for this task, enabled by the growing availability of structural and experimental affinity data. In this work, we introduce DeepGGL, a deep convolutional neural network that integrates residual connections and an attention mechanism within a geometric graph learning framework. By leveraging multiscale weighted colored bipartite subgraphs, DeepGGL effectively captures fine‑grained atom‑level interactions in protein‑ligand complexes across multiple scales. We benchmarked DeepGGL against established models on CASF‑2013 and CASF‑2016, where it achieved state‑of‑the‑art performance with significant improvements across diverse evaluation metrics. To further assess robustness and generalization, we tested the model on the CSAR‑NRC‑HiQ dataset and the PDBbind v2019 holdout set. DeepGGL consistently maintained high predictive accuracy, highlighting its adaptability and reliability for binding affinity prediction in structure‑based drug discovery.
Authors: Allan dos Santos Costa, Manvitha Ponnapati, Dana Rubin, Tess Smidt, Joseph Jacobson
Abstract: Unraveling the dynamical motions of biomolecules is essential for bridging their structure and function, yet it remains a major computational challenge. Molecular dynamics (MD) simulation provides a detailed depiction of biomolecular motion, but its high‑resolution temporal evolution comes at significant computational cost, limiting its applicability to timescales of biological relevance. Deep learning approaches have emerged as promising solutions to overcome these computational limitations by learning to predict long‑timescale dynamics. However, generalizable kinetics models for proteins remain largely unexplored, and the fundamental limits of achievable acceleration while preserving dynamical accuracy are poorly understood. In this work, we fill this gap with DeepJump, an Euclidean‑Equivariant Flow Matching‑based model for predicting protein conformational dynamics across multiple temporal scales. We train DeepJump on trajectories of the diverse proteins of mdCATH, systematically studying our model's performance in generalizing to long‑term dynamics of fast‑folding proteins and characterizing the trade‑off between computational acceleration and prediction accuracy. We demonstrate the application of DeepJump to ab initio folding, showcasing prediction of folding pathways and native states. Our results demonstrate that DeepJump achieves significant \approx1000× computational acceleration while effectively recovering long‑timescale dynamics, providing a stepping stone for enabling routine simulation of proteins.
Authors: Rebecca Manuela Neeser, Ilia Igashov, Arne Schneuing, Michael Bronstein, Philippe Schwaller, Bruno Correia
Abstract: Fragment‑based drug design is a promising strategy leveraging the binding of small chemical moieties that can efficiently guide drug discovery. The initial step of fragment identification remains challenging, as fragments often bind weakly and non‑specifically. We developed a protein‑fragment encoder that relies on a contrastive learning approach to map both molecular fragments and protein surfaces in a shared latent space. The encoder captures interaction‑relevant features and allows to perform virtual screening as well as generative design with our new method LatentFrag. In LatentFrag, fragment embeddings and positions are generated conditioned on the protein surface while being chemically realistic by construction. Our expressive fragment and protein representations allow location of protein‑fragment interaction sites with high sensitivity and we observe state‑of‑the‑art fragment recovery rates when sampling from the learned distribution of latent fragment embeddings. Our generative method outperforms common methods such as virtual screening at a fraction of its computational cost providing a valuable starting point for fragment hit discovery. We further show the practical utility of LatentFrag and extend the workflow to full ligand design tasks. Together, these approaches contribute to advancing fragment identification and provide valuable tools for fragment‑based drug discovery.
Authors: Taher Yacoub, Camille Depenveiller, Atsushi Tatsuma, Tin Barisin, Eugen Rusakov, Udo Gobel, Yuxu Peng, Shiqiang Deng, Yuki Kagaya, Joon Hong Park, Daisuke Kihara, Marco Guerra, Giorgio Palmieri, Andrea Ranieri, Ulderico Fugacci, Silvia Biasotti, Ruiwen He, Halim Benhabiles, Adnane Cabani, Karim Hammoudi, Haotian Li, Hao Huang, Chunyan Li, Alireza Tehrani, Fanwang Meng, Farnaz Heidar-Zadeh, Tuan-Anh Yang, Matthieu Montes
Abstract: This SHREC 2025 track dedicated to protein surface shape retrieval involved 9 participating teams. We evaluated the performance in retrieval of 15 proposed methods on a large dataset of 11,555 protein surfaces with calculated electrostatic potential (a key molecular surface descriptor). The performance in retrieval of the proposed methods was evaluated through different metrics (Accuracy, Balanced accuracy, F1 score, Precision and Recall). The best retrieval performance was achieved by the proposed methods that used the electrostatic potential complementary to molecular surface shape. This observation was also valid for classes with limited data which highlights the importance of taking into account additional molecular surface descriptors.
Authors: Bozhen Hu, Cheng Tan, Siyuan Li, Jiangbin Zheng, Sizhe Qiu, Jun Xia, Stan Z. Li
Abstract: The enzyme turnover rate is a fundamental parameter in enzyme kinetics, reflecting the catalytic efficiency of enzymes. However, enzyme turnover rates remain scarce across most organisms due to the high cost and complexity of experimental measurements. To address this gap, we propose a multimodal framework for predicting the enzyme turnover rate by integrating enzyme sequences, substrate structures, and environmental factors. Our model combines a pre‑trained language model and a convolutional neural network to extract features from protein sequences, while a graph neural network captures informative representations from substrate molecules. An attention mechanism is incorporated to enhance interactions between enzyme and substrate representations. Furthermore, we leverage symbolic regression via Kolmogorov‑Arnold Networks to explicitly learn mathematical formulas that govern the enzyme turnover rate, enabling interpretable and accurate predictions. Extensive experiments demonstrate that our framework outperforms both traditional and state‑of‑the‑art deep learning approaches. This work provides a robust tool for studying enzyme kinetics and holds promise for applications in enzyme engineering, biotechnology, and industrial biocatalysis.
Authors: JingChun Wang, Meenu Upadhyay, Eric D. Boittier, Kham Lek Chaton, Valerii Andreichev, Mike Devereux, Shimoni Patel, Sena Aydin, Kai Töpfer, Markus Meuwly
Abstract: Energy functions for pure and heterogenous systems are one of the backbones for molecular simulation of condensed phase systems. With the advent of machine learned potential energy surfaces (ML‑PESs) a new era has started. Statistical models allow the representation of reference data from electronic structure calculations for chemical systems of almost arbitrary complexity at unprecedented detail and accuracy. Here, kernel‑ and neural network‑based approaches for intramolecular degrees of freedom are combined with distributed charge models for long range electrostatics to describe the interaction energies of condensed phase systems. The main focus is on illustrative examples ranging from pure liquids (dichloromethane, water) to chemically and structurally heterogeneous systems (eutectic liquids, CO on amorphous solid water), reactions (Menshutkin), and spectroscopy (triatomic probes for protein dynamics). For all examples, small to medium‑sized clusters are used to represent and improve the total interaction energy compared with reference quantum chemical calculations. Although remarkable accuracy can be achieved for some systems (chemical accuracy for dichloromethane and water), it is clear that more realistic models are required for van der Waals contributions and improved water models need to be used for more quantitative simulations of heterogeneous chemical and biological systems.
Authors: Seon-Geun Jeong, Kyeong-Hwan Moon, Won-Joo Hwang
Abstract: Protein‑ligand binding affinity is critical in drug discovery, but experimentally determining it is time‑consuming and expensive. Artificial intelligence (AI) has been used to predict binding affinity, significantly accelerating this process. However, the high‑performance requirements and vast datasets involved in affinity prediction demand increasingly large AI models, requiring substantial computational resources and training time. Quantum machine learning has emerged as a promising solution to these challenges. In particular, hybrid quantum‑classical models can reduce the number of parameters while maintaining or improving performance compared to classical counterparts. Despite these advantages, challenges persist: why hybrid quantum models achieve these benefits, whether quantum neural networks (QNNs) can replace classical neural networks, and whether such models are feasible on noisy intermediate‑scale quantum (NISQ) devices. This study addresses these challenges by proposing a hybrid quantum neural network (HQNN) that empirically demonstrates the capability to approximate non‑linear functions in the latent feature space derived from classical embedding. The primary goal of this study is to achieve a parameter‑efficient model in binding affinity prediction while ensuring feasibility on NISQ devices. Numerical results indicate that HQNN achieves comparable or superior performance and parameter efficiency compared to classical neural networks, underscoring its potential as a viable replacement. This study highlights the potential of hybrid QML in computational drug discovery, offering insights into its applicability and advantages in addressing the computational challenges of protein‑ligand binding affinity prediction.
Authors: Nayem AL-Kayed, Charles St-Arnault, Hugh Morison, A. Aadhi, Chaoran Huang, Alexander N. Tait, David V. Plant, Bhavin J. Shastri
Abstract: Ising machines offer a compelling approach to addressing NP‑hard problems, but physical realizations that are simultaneously scalable, reconfigurable, fast, and stable remain elusive. Quantum annealers, like D‑Wave's cryogenic hardware, target combinatorial optimization tasks, but quadratic scaling of qubit requirements with problem size limits their scalability on dense graphs. Here, we introduce a programmable, stable, room‑temperature optoelectronic oscillator (OEO)‑based Ising machine with linear scaling in spin representation. Inspired by Hopfield networks, our architecture solves fully‑connected problems with up to 256 spins (65,536 couplings), and >41,000 spins (205,000+ couplings) if sparse. Our system leverages cascaded thin‑film lithium niobate modulators, a semiconductor optical amplifier, and a digital signal processing (DSP) engine in a recurrent time‑encoded loop, demonstrating potential >200 giga‑operations per second for spin coupling and nonlinearity. This platform achieves the largest spin configuration in an OEO‑based photonic Ising machine, enabled by high intrinsic speed. We experimentally demonstrate best‑in‑class solution quality for Max‑Cut problems of arbitrary graph topologies (2,000 and 20,000 spins) among photonic Ising machines and obtain ground‑state solutions for number partitioning and lattice protein folding ‑ benchmarks previously unaddressed by photonic systems. Our system leverages inherent noise from high baud rates to escape local minima and accelerate convergence. Finally, we show that embedding DSP ‑ traditionally used in optical communications ‑ within optical computation enhances convergence and solution quality, opening new frontiers in scalable, ultrafast computing for optimization, neuromorphic processing, and analog AI.
Authors: Javier Martínez-Puig, Gianluca D'Agostino, Ana Oña, Javier Rodríguez-Rodríguez
Abstract: The coffee‑ring effect is a universal feature of evaporating sessile droplets with pinned contact line, wherein solutes or particles are advected to the droplet's edge due to evaporation‑driven flows. While existing models have successfully described this phenomenon in particle‑laden droplets, they often assume that the evaporative flux, and thus hydrodynamics, are decoupled from solute transport. This assumption breaks down in complex fluids, such as protein or polymeric solutions, where the solute can influence evaporation through changes in water activity. Here, we investigate model respiratory droplets primarily composed of water, salt, and a type of the glycoprotein mucin. Using fluorescence microscopy, we observe the formation of a well‑defined protein ring at the droplet edge as water evaporates. The growth and morphology of this ring exhibit a strong dependence on ambient relative humidity (H_r), revealing dynamics that existing models cannot capture. Specifically, we find that protein accumulation at the edge is governed by the feedback between local solute concentration and evaporation rate. To account for this, we develop a minimal theoretical model based on the lubrication approximation, incorporating the coupling between hydrodynamics and solute transport through the evaporation rate. Our framework reproduces key features of the experimental observations and suggests a physical basis for the H_r‑dependent stability and infectivity of respiratory droplets containing viruses.
Authors: Yuntao Lu, Yunxin Zhang
Abstract: Stochastic modeling of gene expression is a classic problem in theoretical biophysics, and the burst approximation is widely used to simplify gene expression models formulated via the chemical master equation. However, the approximation error has been investigated only for the simplest case. This article proposes and analyzes a general stochastic gene expression model with an arbitrary number of gene states, and quantifies the error introduced by the burst approximation. Using the standard binomial moment method, we derive recurrence relations for binomial moments in steady state. We develop an algorithm to numerically compute binomial moments in a hierarchical manner. In particular, explicit expressions for low‑order moments are presented. Compared with surrogate models under the burst approximation, we conclude that the first‑order moment of protein counts is preserved, whereas discrepancies generally arise in higher‑order moments. By estimating the difference between two second‑order moments using functional analysis, we evaluate the validity of the burst approximation.
Authors: Ada Fang, Robert G. Alberstein, Simon Kelow, Frédéric A. Dreyer
Abstract: The complementarity‑determining regions of antibodies are loop structures that are key to their interactions with antigens, and of high importance to the design of novel biologics. Since the 1980s, categorizing the diversity of CDR structures into canonical clusters has enabled the identification of key structural motifs of antibodies. However, existing approaches have limited coverage and cannot be readily incorporated into protein foundation models. Here we introduce ImmunoGlobulin LOOp Tokenizer, Igloo, a multimodal antibody loop tokenizer that encodes backbone dihedral angles and sequence. Igloo is trained using a contrastive learning objective to map loops with similar backbone dihedral angles closer together in latent space. Igloo can efficiently retrieve the closest matching loop structures from a structural antibody database, outperforming existing methods on identifying similar H3 loops by 5.9%. Igloo assigns tokens to all loops, addressing the limited coverage issue of canonical clusters, while retaining the ability to recover canonical loop conformations. To demonstrate the versatility of Igloo tokens, we show that they can be incorporated into protein language models with IglooLM and IglooALM. On predicting binding affinity of heavy chain variants, IglooLM outperforms the base protein language model on 8 out of 10 antibody‑antigen targets. Additionally, it is on par with existing state‑of‑the‑art sequence‑based and multimodal protein language models, performing comparably to models with 7× more parameters. IglooALM samples antibody loops which are diverse in sequence and more consistent in structure than state‑of‑the‑art antibody inverse folding models. Igloo demonstrates the benefit of introducing multimodal tokens for antibody loops for encoding the diverse landscape of antibody loops, improving protein foundation models, and for antibody CDR design.
Authors: Åke Andersson, Vitali Zhaunerchyk
Abstract: Proteins are vital biological molecules found in every living organism, and their function is determined by what shape they fold into. Peptides are essentially subsets of proteins, and therefore ideal as model systems for protein folding. The structure of a molecule is closely related to its vibrational absorption spectrum, which lies in the infrared (IR) range. However, in vivo IR spectroscopy is hindered by interference from the surrounding water. Therefore, peptides are preferably studied isolated from solution, in the gas phase. This chapter summarizes the recent IR spectroscopy studies of gas‑phase peptides. The collected works show that IR spectroscopy combined with quantum chemical calculations is a powerful tool for deducing the molecular structure. Moreover the wealth of experimental spectra makes possible the evaluation of different quantum chemical models, which can be applied to the larger proteins.
Authors: Long-Kai Huang, Rongyi Zhu, Bing He, Jianhua Yao
Abstract: Protein Language Models (PLMs), pre‑trained on extensive evolutionary data from natural proteins, have emerged as indispensable tools for protein design. While powerful, PLMs often struggle to produce proteins with precisely specified functionalities or properties due to inherent challenges in controlling their outputs. In this work, we investigate the potential of Activation Steering, a technique originally developed for controlling text generation in Large Language Models (LLMs), to direct PLMs toward generating protein sequences with targeted properties. We propose a simple yet effective method that employs activation editing to steer PLM outputs, and extend this approach to protein optimization through a novel editing site identification module. Through comprehensive experiments on lysozyme‑like sequence generation and optimization, we demonstrate that our methods can be seamlessly integrated into both auto‑encoding and autoregressive PLMs without requiring additional training. These results highlight a promising direction for precise protein engineering using foundation models.
Authors: Adrien Couetoux, Thomas Devenyns, Lise Diagne, David Champagne, Pierre-Yves Mousset, Chris Anagnostopoulos
Abstract: In pharmaceutical R&D, predicting the efficacy of a pharmaceutical in treating a particular disease prior to clinical testing or any real‑world use has been challenging. In this paper, we propose a flexible and modular machine learning‑based approach for predicting the efficacy of an untested pharmaceutical for treating a disease. We train a machine learning model using sets of pharmaceutical‑pathway weight impact scores and patient data, which can include patient characteristics and observed clinical outcomes. The resulting model then analyses weighted impact scores of an untested pharmaceutical across human biological molecule‑protein pathways to generate a predicted efficacy value. We demonstrate how the method works on a real‑world dataset with patient treatments and outcomes, with two different weight impact score algorithms We include methods for evaluating the generalisation performance on unseen treatments, and to characterise conditions under which the approach can be expected to be most predictive. We discuss specific ways in which our approach can be iterated on, making it an initial framework to support future work on predicting the effect of untested drugs, leveraging RWD clinical data and drug embeddings.
Authors: Jayanth Venkatarama Reddy, Nelson Ndahiro, Lateef Aliyu, Ashwin Dravid, Tianxin Xang, Jinke Wu, Michael Betenbaugh, Marc Donohue
Abstract: The majority of therapeutic monoclonal antibodies (mAbs) on the market are produced using Chinese Hamster Ovary (CHO) cells cultured at scale in chemically defined cell culture medium. Because of the high costs associated with mammalian cell cultures, obtaining high cell densities to produce high product titers is desired. These bioprocesses require high concentrations of nutrients in the basal media and periodically adding concentrated feed media to sustain cell growth and therapeutic protein productivity. Unfortunately, the desired or optimal nutrient concentrations of the feed media are often solubility limited due to precipitation of chemical complexes that form in the solution. Experimentally screening the various cell culture media configurations which contain 50 to 100 compounds can be expensive and laborious. This article lays the foundation for utilizing computational tools to understand precipitation of nutrients in cell culture media by studying the pairwise interactions between amino acids in thermodynamic models. Activity coefficient data for one amino acid in water and amino acid solubility data of two amino acids in water have been used to determine a single set of UNIFAC group interaction parameters to predict the thermodynamic behavior of the multi‑component systems found in mammalian cell culture media. The data collected in this study is, to our knowledge, the largest set of ternary system amino acid solubility data reported to date. These amino acid precipitation predictions have been verified with experimentally measured ternary and quaternary amino acid solutions. Thus, we demonstrate the utility of our model as a digital twin to identify optimal cell culture media compositions by replacing empirical approaches for nutrient precipitation with computational predictions based on thermodynamics of individual media components in complex mixtures.
Authors: Diego Sanchez Espinosa, Erik H Thiede, Yunan Yang
Abstract: Cryo‑electron microscopy (Cryo‑EM) enables high‑resolution imaging of biomolecules, but structural heterogeneity remains a major challenge in 3D reconstruction. Traditional methods assume a discrete set of conformations, limiting their ability to recover continuous structural variability. In this work, we formulate cryo‑EM reconstruction as a stochastic inverse problem (SIP) over probability measures, where the observed images are modeled as the push‑forward of an unknown distribution over molecular structures via a random forward operator. We pose the reconstruction problem as the minimization of a variational discrepancy between observed and simulated image distributions, using statistical distances such as the KL divergence and the Maximum Mean Discrepancy. The resulting optimization is performed over the space of probability measures via a Wasserstein gradient flow, which we numerically solve using particles to represent and evolve conformational ensembles. We validate our approach using synthetic examples, including a realistic protein model, which demonstrates its ability to recover continuous distributions over structural states. We analyze the connection between our formulation and Maximum A Posteriori (MAP) approaches, which can be interpreted as instances of the discretize‑then‑optimize (DTO) framework. We further provide a consistency analysis, establishing conditions under which DTO methods, such as MAP estimation, converge to the solution of the underlying infinite‑dimensional continuous problem. Beyond cryo‑EM, the framework provides a general methodology for solving SIPs involving random forward operators.
Authors: Xiangyu Liu, Haodi Lei, Yi Liu, Yang Liu, Wei Hu
Abstract: Sparse Autoencoder (SAE) has emerged as a powerful tool for mechanistic interpretability of large language models. Recent works apply SAE to protein language models (PLMs), aiming to extract and analyze biologically meaningful features from their latent spaces. However, SAE suffers from semantic entanglement, where individual neurons often mix multiple nonlinear concepts, making it difficult to reliably interpret or manipulate model behaviors. In this paper, we propose a semantically‑guided SAE, called ProtSAE. Unlike existing SAE which requires annotation datasets to filter and interpret activations, we guide semantic disentanglement during training using both annotation datasets and domain knowledge to mitigate the effects of entangled attributes. We design interpretability experiments showing that ProtSAE learns more biologically relevant and interpretable hidden features compared to previous methods. Performance analyses further demonstrate that ProtSAE maintains high reconstruction fidelity while achieving better results in interpretable probing. We also show the potential of ProtSAE in steering PLMs for downstream generation tasks.
Authors: Raúl Miñán, Carles Perez-Lopez, Javier Iglesias, Álvaro Ciudad, Alexis Molina
Abstract: Molecular docking is a cornerstone of drug discovery, relying on high‑resolution ligand‑bound structures to achieve accurate predictions. However, obtaining these structures is often costly and time‑intensive, limiting their availability. In contrast, ligand‑free structures are more accessible but suffer from reduced docking performance due to pocket geometries being less suited for ligand accommodation in apo structures. Traditional methods for artificially inducing these conformations, such as molecular dynamics simulations, are computationally expensive. In this work, we introduce Sesame, a generative model designed to predict this conformational change efficiently. By generating geometries better suited for ligand accommodation at a fraction of the computational cost, Sesame aims to provide a scalable solution for improving virtual screening workflows.
Authors: João V. M. Pimentel, Vladimir A. Mandelshtam
Abstract: In a recent paper, J. Chem. Phys. 162, 214101 (2025), a novel approach for the rigidification of a molecular cluster was proposed, in which starting with an all‑atom (AA) potential, a coarse‑grained (CG) potential for the associated cluster of rigid monomers was constructed directly. The method is based on using the harmonic approximation for the fast intramolecular degrees of freedom. While conceptually primitive, the resulting CG model turned out to be surprisingly accurate for selected water and ammonia clusters. However, as originally formulated, a single evaluation of the CG potential turned out to be much more expensive than the evaluation of the AA potential, since the former required a subspace minimization followed by a subspace normal mode calculation. In this communication, we formulate the approach more broadly, making it applicable, e.g., to coarse‑graining a large protein. We also introduce key algorithmic improvements, reducing the cost of the subspace minimization and normal mode calculation. Combined with the fact that the CG simulation requires roughly an order of magnitude fewer Monte Carlo steps to reach similar statistical accuracy for selected observables compared to the AA model, the overall computational cost becomes comparable. These improvements are demonstrated on a water cluster.
Authors: Avinash Mandaiya, Veit Elser
Abstract: The advent of advanced crystallographic techniques has shifted structural biology from static, single‑conformer models toward probing protein dynamics. Extracting cooperative motions from temporally and spatially averaged electron density maps requires both high‑resolution data and refinement algorithms capable of handling conformational heterogeneity. However, current refinement protocols often fail due to the tangling phenomenon, in which conformational states become improperly intertwined during optimization. Here, we present an automated refinement methodology based on iterative projections within the divide‑and‑concur framework. This approach enables seamless integration of geometric constraints with experimental density constraints derived from observed scattering amplitudes. By allowing each atom to satisfy density constraints independently, we show that this framework effectively circumvents tangling artifacts and achieves robust refinement performance, even for models initialized with R‑factors as high as 12%. Just as iterative projections revolutionized phase retrieval in crystallography, we demonstrate that they can also address the optimization challenges in multi‑conformational refinement. This work establishes a computational foundation for advancing crystallographic methodologies to resolve conformational heterogeneity and ultimately capture protein dynamics at atomic resolution.
Authors: Matouš Soldát, Jiří Kléma
Abstract: Directed evolution is an iterative laboratory process of designing proteins with improved function by iteratively synthesizing new protein variants and evaluating their desired property with expensive and time‑consuming biochemical screening. Machine learning methods can help select informative or promising variants for screening to increase their quality and reduce the amount of necessary screening. In this paper, we present a novel method for machine‑learning‑assisted directed evolution of proteins which combines Bayesian optimization with informative representation of protein variants extracted from a pre‑trained protein language model. We demonstrate that the new representation based on the sequence embeddings significantly improves the performance of Bayesian optimization yielding better results with the same number of conducted screening in total. At the same time, our method outperforms the state‑of‑the‑art machine‑learning‑assisted directed evolution methods with regression objective.
Authors: Praveen Muralikrishnan, Jonathan W. P. Zajac, Caryn L. Heldt, Sarah L. Perry, Sapna Sarupria
Abstract: The stabilization of macromolecules is fundamental to developing biological formulations, such as vaccines and protein therapeutics. In this study, we employ coarse grained polymer models to investigate the impact of four sugars: α‑glucose, β‑fructose, trehalose, and sucrose on macromolecule stability. Free energy decomposition and preferential interaction analysis indicate that polymer‑sugar interactions favor folding at low concentrations while driving unfolding at higher concentrations. In contrast, the polymer‑solvent soft interaction entropy consistently favors unfolding across all sugar concentrations under study. At low sugar concentrations, polymer‑solvent interactions predominantly govern stabilization, whereas at higher concentrations, entropic penalties dictate polymer stability. Local mixing entropy demonstrates that binary sugar mixtures introduce entropic contributions that preferentially stabilize the folded state. These findings contribute to a more nuanced understanding of sugar‑based excipient stabilization mechanisms, offering guidance for the rational design of stable biological formulations.
Authors: Michelle Dargasz, Nimmi Das Anthuparambil, Sebastian Retzbach, Anita Girelli, Sonja Timmermann, Johannes Möller, Wonhyuk Jo, Aliaksandr Lenonau, Agha Mohammad Raza, Maddalena Bin, Jaqueline Savelkouls, Iason Andronis, Frederik Unger, Felix Brausse, Jörg Hallmann, Ulrike Boesenberg, Jan-Etienne Pudell, Angel Rodriguez-Fernandez, James Wrigley, Roman Shayduk, Mohamed Youssef, Alexey Zozulya, Anders Madsen, Felix Lehmkühler, Fivos Perakis, Fajun Zhang, Frank Schreiber, Michael Paulus, Christian Gutt
Abstract: Macromolecular crowding plays a crucial role in modulating protein dynamics in cellular and in vitro environments. Polymeric crowders such as dextran and Ficoll are known to induce entropic forces, including depletion interactions, that promote structural organization, but the nanoscale consequences for protein dynamics remain less well understood. Here, we employ megahertz X‑ray photon correlation spectroscopy (MHz‑XPCS) at the European XFEL to probe the dynamics of the protein ferritin in solutions containing sucrose, dextran, and Ficoll. We find that depletion‑driven short‑range attractions combined with long‑range repulsions give rise to intermediate‑range order (IRO) once the polysaccharide overlap concentration c^ is exceeded. These IRO features fluctuate on microsecond to millisecond timescales, strongly modulating the collective dynamics of ferritin. The magnitude of these effects depends sensitively on crowder type, concentration, and molecular weight. Normalizing the crowder concentration by c^ reveals scaling behavior in ferritin self‑diffusion with a crossover near 2c^, marking a transition from depletion‑enhanced mobility to viscosity‑dominated slowing. Our results demonstrate that bulk properties alone cannot account for protein dynamics in crowded solutions, underscoring the need to include polymer‑specific interactions and depletion theory in models of crowded environments.
Authors: Zhiyu Wang, Arian Jamasb, Mustafa Hajij, Alex Morehead, Luke Braithwaite, Pietro Liò
Abstract: Protein representation learning (PRL) is crucial for understanding structure‑function relationships, yet current sequence‑ and graph‑based methods fail to capture the hierarchical organization inherent in protein structures. We introduce Topotein, a comprehensive framework that applies topological deep learning to PRL through the novel Protein Combinatorial Complex (PCC) and Topology‑Complete Perceptron Network (TCPNet). Our PCC represents proteins at multiple hierarchical levels ‑‑ from residues to secondary structures to complete proteins ‑‑ while preserving geometric information at each level. TCPNet employs SE(3)‑equivariant message passing across these hierarchical structures, enabling more effective capture of multi‑scale structural patterns. Through extensive experiments on four PRL tasks, TCPNet consistently outperforms state‑of‑the‑art geometric graph neural networks. Our approach demonstrates particular strength in tasks such as fold classification which require understanding of secondary structure arrangements, validating the importance of hierarchical topological features for protein analysis.
Authors: Nadine du Toit, Kristian K. Müller-Nedebock
Abstract: This paper builds on a recently introduced dynamical networking framework, applying it to model motor‑driven transport along cytoskeletal filament networks. Within this approach, the networking functional describes the periodic binding and unbinding of motors to available filament sites,whilst accounting for all possible pairing, enabling a field‑theoretic treatment of constrained motion in complex networks. In this application, the dynamical networking theory is introduced into a Martin‑Siggia‑Rose representation of the Langevin dynamics describing the motion of a motor protein and its cargo. Results are presented in a collective description of motors on a network, for two different scenarios, namely homogeneous and non‑homogeneous networks. A diffusion coefficient is presented for homogeneous networks, whilst it is shown that various possibilities remain for disordered averaging over network densities for non‑homogeneous networks.
Authors: Lorenzo Talamanca, Julian Trouillon
Abstract: Combinatorial group testing reduces screening costs and turnaround time but remains challenging to apply due to design complexity, varying applicability, and lack of implementation tools. Here we present PoolPy, a unified end‑to‑end framework and web platform to benchmark, automate and decode combinatorial group testing strategies tailored to application‑specific constraints across assay modalities. We demonstrate PoolPy utility for protein‑ligand interaction screening and genome‑wide molecular profiling, enabling the scaling up of multi‑readout functional assays.
Authors: Phuc Pham, Viet Thanh Duy Nguyen, Truong-Son Hy
Abstract: Accurate identification of interactions between protein residues and ligand functional groups is essential to understand molecular recognition and guide rational drug design. Existing deep learning approaches for protein‑ligand interpretability often rely on 3D structural input or use distance‑based contact labels, limiting both their applicability and biological relevance. We introduce LINKER, the first sequence‑based model to predict residue‑functional group interactions in terms of biologically defined interaction types, using only protein sequences and the ligand SMILES as input. LINKER is trained with structure‑supervised attention, where interaction labels are derived from 3D protein‑ligand complexes via functional group‑based motif extraction. By abstracting ligand structures into functional groups, the model focuses on chemically meaningful substructures while predicting interaction types rather than mere spatial proximity. Crucially, LINKER requires only sequence‑level input at inference time, enabling large‑scale application in settings where structural data is unavailable. Experiments on the LP‑PDBBind benchmark demonstrate that structure‑informed supervision over functional group abstractions yields interaction predictions closely aligned with ground‑truth biochemical annotations.
Authors: Natalia Flechas Manrique, Alberto MartÃnez, Elena López-MartÃnez, Luc Andrea, Román Orus, Aitor Manteca, Aitziber L. Cortajarena, Llorenç Espinosa-Portalés
Abstract: Epitopes are short antigenic peptide sequences which are recognized by antibodies or immune cell receptors. These are central to the development of immunotherapies, vaccines, and diagnostics. However, the rational design of synthetic epitope libraries is challenging due to the large combinatorial sequence space, 20^n combinations for linear epitopes of n amino acids, making screening and testing unfeasible, even with high throughput experimental techniques. In this study, we present a large language model, epiGPTope, pre‑trained on protein data and specifically fine‑tuned on linear epitopes, which for the first time can directly generate novel epitope‑like sequences, which are found to possess statistical properties analogous to the ones of known epitopes. This generative approach can be used to prepare libraries of epitope candidate sequences. We further train statistical classifiers to predict whether an epitope sequence is of bacterial or viral origin, thus narrowing the candidate library and increasing the likelihood of identifying specific epitopes. We propose that such combination of generative and predictive models can be of assistance in epitope discovery. The approach uses only primary amino acid sequences of linear epitopes, bypassing the need for a geometric framework or hand‑crafted features of the sequences. By developing a method to create biologically feasible sequences, we anticipate faster and more cost‑effective generation and screening of synthetic epitopes, with relevant applications in the development of new biotechnologies.
Authors: Derek Jones, Yue Yang, Felice C. Lightstone, Niema Moshiri, Jonathan E. Allen, Tajana S. Rosing
Abstract: Self‑supervised pretraining from static structures of drug‑like compounds and proteins enable powerful learned feature representations. Learned features demonstrate state of the art performance on a range of predictive tasks including molecular properties, structure generation, and protein‑ligand interactions. The majority of approaches are limited by their use of static structures and it remains an open question, how best to use atomistic molecular dynamics (MD) simulations to develop more generalized models to improve prediction accuracy for novel molecular structures. We present SURrogate mmGBSA (SurGBSA) as a new modeling approach for MD‑based representation learning, which learns a surrogate function of the Molecular Mechanics Generalized Born Surface Area (MMGBSA). We show for the first time the benefits of physics‑informed pre‑training to train a surrogate MMGBSA model on a collection of over 1.4 million 3D trajectories collected from MD simulations of the CASF‑2016 benchmark. SurGBSA demonstrates a dramatic 27,927x speedup versus a traditional physics‑based single‑point MMGBSA calculation while nearly matching single‑point MMGBSA accuracy on the challenging pose ranking problem for identification of the correct top pose (‑0.4% difference). Our work advances the development of molecular foundation models by showing model improvements when training on MD simulations. Models, code and training data are made publicly available.
Authors: Bin Feng, Jiying Zhang, Xinni Zhang, Zijing Liu, Yu Li
Abstract: Molecular dynamics (MD) simulations are essential tools in computational chemistry and drug discovery, offering crucial insights into dynamic molecular behavior. However, their utility is significantly limited by substantial computational costs, which severely restrict accessible timescales for many biologically relevant processes. Despite the encouraging performance of existing machine learning (ML) methods, they struggle to generate extended biomolecular system trajectories, primarily due to the lack of MD datasets and the large computational demands of modeling long historical trajectories. Here, we introduce BioMD, the first all‑atom generative model to simulate long‑timescale protein‑ligand dynamics using a hierarchical framework of forecasting and interpolation. We demonstrate the effectiveness and versatility of BioMD on the DD‑13M (ligand unbinding) and MISATO datasets. For both datasets, BioMD generates highly realistic conformations, showing high physical plausibility and low reconstruction errors. Besides, BioMD successfully generates ligand unbinding paths for 97.1% of the protein‑ligand systems within ten attempts, demonstrating its ability to explore critical unbinding pathways. Collectively, these results establish BioMD as a tool for simulating complex biomolecular processes, offering broad applicability for computational chemistry and drug discovery.
Authors: Jonathan Feldman, Tal Feldman
Abstract: Recent advances in generative biology have enabled the design of novel proteins, creating significant opportunities for drug discovery while also introducing new risks, including the potential development of synthetic bioweapons. Existing biosafety measures primarily rely on inference‑time filters such as sequence alignment and protein‑protein interaction (PPI) prediction to detect dangerous outputs. In this study, we evaluate the performance of three leading PPI prediction tools: AlphaFold 3, AF3Complex, and SpatialPPIv2. These models were tested on well‑characterized viral‑host interactions, such as those involving Hepatitis B and SARS‑CoV‑2. Despite being trained on many of the same viruses, the models fail to detect a substantial number of known interactions. Strikingly, none of the tools successfully identify any of the four experimentally validated SARS‑CoV‑2 mutants with confirmed binding. These findings suggest that current predictive filters are inadequate for reliably flagging even known biological threats and are even more unlikely to detect novel ones. We argue for a shift toward response‑oriented infrastructure, including rapid experimental validation, adaptable biomanufacturing, and regulatory frameworks capable of operating at the speed of AI‑driven developments.
Authors: Vira Raichenko, Alicja Bukat, Michal Bykowski, Lucja Kowalewska, Myfanwy E. Evans
Abstract: The link between bicontinuous architectures in biological membranes and triply‑periodic minimal surfaces (TPMS) is a well established example of stunning geometric form in nature. The prolamellar body (PLB) in early plant plastid development is a classic example, forming the Diamond TPMS in a lipid‑protein‑pigment membrane. However, the early development of such spectacular geometric structures is poorly understood. Inspired by the presence of tubules in the micrographs of early plastid membrane formation, we explore here geometric modelling of transformations of packings of cylinders that coalesce together to form bicontinuous structures. Using computational modelling, we find that specific cylinder packings with cubic symmetry transform into highly symmetric TPMS, which now stand as a candidate set of surfaces for further investigation into the PLB, as well as other occurrences of bicontinuous membranes.
Authors: Aditya Sengar, Jiying Zhang, Pierre Vandergheynst, Patrick Barth
Abstract: Simulating the long‑timescale dynamics of biomolecules is a central challenge in computational science. While enhanced sampling methods can accelerate these simulations, they rely on pre‑defined collective variables that are often difficult to identify, restricting their ability to model complex switching mechanisms between metastable states. A recent generative model, LD‑FPG, demonstrated that this problem could be bypassed by learning to sample the static equilibrium ensemble as all‑atom deformations from a reference structure, establishing a powerful method for all‑atom ensemble generation. However, while this approach successfully captures a system's probable conformations, it does not model the temporal evolution between them. We introduce the Graph Latent Dynamics Propagator (GLDP), a modular component for simulating dynamics within the learned latent space of LD‑FPG. We then compare three classes of propagators: (i) score‑guided Langevin dynamics, (ii) Koopman‑based linear operators, and (iii) autoregressive neural networks. Within a unified encoder‑propagator‑decoder framework, we evaluate long‑horizon stability, backbone and side‑chain ensemble fidelity, and temporal kinetics via TICA. Benchmarks on systems ranging from small peptides to mixed‑topology proteins and large GPCRs reveal that autoregressive neural networks deliver the most robust long rollouts and coherent physical timescales; score‑guided Langevin best recovers side‑chain thermodynamics when the score is well learned; and Koopman provides an interpretable, lightweight baseline that tends to damp fluctuations. These results clarify the trade‑offs among propagators and offer practical guidance for latent‑space simulators of all‑atom protein dynamics.
Authors: Srinivas Anumasa, Barath Chandran. C, Tingting Chen, Dianbo Liu
Abstract: Diffusion models have emerged as a powerful class of generative models by learning to iteratively reverse the noising process. Their ability to generate high‑quality samples has extended beyond high‑dimensional image data to other complex domains such as proteins, where data distributions are typically sparse and unevenly spread. Importantly, the sparsity itself is uneven. Empirically, we observed that while a small fraction of samples lie in dense clusters, the majority occupy regions of varying sparsity across the data space. Existing approaches largely ignore this data‑dependent variability. In this work, we introduce a Data‑Dependent Smoothing Walk‑Jump framework that employs kernel density estimation (KDE) as a preprocessing step to estimate the noise scale σ for each data point, followed by training a score model with these data‑dependent σ values. By incorporating local data geometry into the denoising process, our method accounts for the heterogeneous distribution of protein data. Empirical evaluations demonstrate that our approach yields consistent improvements across multiple metrics, highlighting the importance of data‑aware sigma prediction for generative modeling in sparse, high‑dimensional settings.
Authors: Jingyuan Zhou, Hao Qian, Shikui Tu, Lei Xu
Abstract: Structure‑based drug design (SBDD), aiming to generate 3D molecules with high binding affinity toward target proteins, is a vital approach in novel drug discovery. Although recent generative models have shown great potential, they suffer from unstable probability dynamics and mismatch between generated molecule size and the protein pockets geometry, resulting in inconsistent quality and off‑target effects. We propose PAFlow, a novel target‑aware molecular generation model featuring prior interaction guidance and a learnable atom number predictor. PAFlow adopts the efficient flow matching framework to model the generation process and constructs a new form of conditional flow matching for discrete atom types. A protein‑ligand interaction predictor is incorporated to guide the vector field toward higher‑affinity regions during generation, while an atom number predictor based on protein pocket information is designed to better align generated molecule size with target geometry. Extensive experiments on the CrossDocked2020 benchmark show that PAFlow achieves a new state‑of‑the‑art in binding affinity (up to ‑8.31 Avg. Vina Score), simultaneously maintains favorable molecular properties.
Authors: Mihir Bafna, Bowen Jing, Bonnie Berger
Abstract: Many methods have been developed to predict static protein structures, however understanding the dynamics of protein structure is essential for elucidating biological function. While molecular dynamics (MD) simulations remain the in silico gold standard, its high computational cost limits scalability. We present DynaProt, a lightweight, SE(3)‑invariant framework that predicts rich descriptors of protein dynamics directly from static structures. By casting the problem through the lens of multivariate Gaussians, DynaProt estimates dynamics at two complementary scales: (1) per‑residue marginal anisotropy as 3 × 3 covariance matrices capturing local flexibility, and (2) joint scalar covariances encoding pairwise dynamic coupling across residues. From these dynamics outputs, DynaProt achieves high accuracy in predicting residue‑level flexibility (RMSF) and, remarkably, enables reasonable reconstruction of the full covariance matrix for fast ensemble generation. Notably, it does so using orders of magnitude fewer parameters than prior methods. Our results highlight the potential of direct protein dynamics prediction as a scalable alternative to existing methods.
Authors: Vishnu Srinivasan, Wei Wang, Brian A. Camley
Abstract: Eukaryotic cells generally sense chemical gradients using the binding of chemical ligands to membrane receptors. In order to perform chemotaxis effectively in different environments, cells need to adapt to different concentrations. We present a model of gradient sensing where the affinity of receptor‑ligand binding is increased when a protein binds to the receptor's cytosolic side. This interior protein (allosteric factor) alters the sensitivity of the cell, allowing the cell to adapt to different ligand concentrations. We propose a reaction scheme where the cell alters the allosteric factor's availability to adapt the average fraction of bound receptors to 1/2. We calculate bounds on the chemotactic accuracy of the cell, and find that the cell can reach near‑optimal chemotaxis over a broad range of concentrations. We find that the accuracy of chemotaxis depends strongly on the diffusion of the allosteric compound relative to other reaction rates. From this, we also find a trade‑off between adaptation time and gradient sensing accuracy.
Authors: Stephan Thaler, Zhiyi Wu, William G. Glass, Richard T. Bradshaw, Prudencio Tossou, Geoffrey P. F. Wood
Abstract: Free energy perturbation (FEP) is considered the gold‑standard simulation method for estimating small molecule binding affinity, a quantity of vital importance to drug discovery. The accuracy of FEP critically depends on an accurate model of the protein‑ligand complex as an initial condition for the underlying molecular dynamics simulation. This requirement has limited the impact of FEP in earlier stages of the discovery process, where appropriate experimental crystal structures are rarely available. The latest generation of structure prediction models, such as Boltz‑2, promise to overcome this limitation by predicting protein‑ligand complex structures. In this work, we combine Boltz‑2 with our own absolute FEP protocol to build Boltz‑ABFE, a robust pipeline for estimating the absolute binding free energies (ABFE) in the absence of experimental crystal structures. We investigate the quality of the structures predicted by Boltz‑2, propose automated approaches to improve structures for use in molecular dynamics simulations, and demonstrate the effectiveness of the Boltz‑ABFE pipeline for four protein targets from the FEP+ benchmark set. Demonstrating the feasibility of absolute FEP simulations without experimental crystal structures, Boltz‑ABFE significantly expands the domain of applicability of FEP, paving the way towards accelerated early‑stage drug discovery via accurate, structure‑based affinity estimation.
Authors: Andreas Erbs Hillers-Bendtsen, Todd J. Martínez
Abstract: With the widespread use of self‑consistent field methods, including Hartree‑Fock and Density Functional Theory, the implications of accelerating these methods are immense. To this end, we develop a tensor hypercontraction construction with O(N^3) formal scaling that can accelerate self‑consistent field calculations. Using tensor hypercontraction, we implement an empirically O(N^2) scaling Fock matrix construction that is 2‑4× faster than existing integral‑direct methods, as it avoids the repeated recalculation of two‑electron repulsion integrals. In combination with a density‑difference ansatz, our tensor hypercontraction self‑consistent field implementation tests show errors below 7.0 x 10^‑3 E_h for relative energies on protein systems containing up to 3000 basis functions.
Authors: Shukun Weng, Ali Douaki, Makusu Tsutsui, German Lanzavecchia, Anastasiia Sapunova, Lorenzo Iannetti, Alberto Giacomello, Roman Krahne, Denis Garoli
Abstract: Ionic transport in nanofluidic channels holds great promise for applications such as single‑molecule analysis, molecular manipulation, and energy harvesting. However, achieving precise control over ion transport remains a major challenge. In this work, we introduce a MoS2 SiN hybrid nanochannel architecture that enables electrical tuning of ionic transport via external gating, and we examine its potential for osmotic power generation and single molecule detection. To fabricate the channels, we employed a combined focused ion beam (FIB) milling and dry transfer method, producing sub 10 nm thick structures while preserving the structural integrity and electronic properties of MoS2, essential for reliable surface charge modulation. We first investigated how the gate voltage influences ionic conductance, finding evidence of gate dependent modulation of ion selectivity under different bias polarities. Next, by applying a salt concentration gradient across the nanochannels, we demonstrated the feasibility of this platform for osmotic energy harvesting. Finally, we tested the system for single molecule sensing, showing that linearized bovine serum albumin (BSA) produced translocation signals with notably long dwell times. Together, these results highlight gated MoS2 SiN nanochannels as a promising platform for tunable nanofluidics, with potential applications in controlled molecular transport and energy harvesting from osmotic gradients.
Authors: Wenyin Zhou, Christopher Iliffe Sprague, Vsevolod Viliuga, Matteo Tadiello, Arne Elofsson, Hossein Azizpour
Abstract: Molecular structure generation is a fundamental problem that involves determining the 3D positions of molecules' constituents. It has crucial biological applications, such as molecular docking, protein folding, and molecular design. Recent advances in generative modeling, such as diffusion models and flow matching, have made great progress on these tasks by modeling molecular conformations as a distribution. In this work, we focus on flow matching and adopt an energy‑based perspective to improve training and inference of structure generation models. Our view results in a mapping function, represented by a deep network, that is directly learned to iteratively map random configurations, i.e. samples from the source distribution, to target structures, i.e. points in the data manifold. This yields a conceptually simple and empirically effective flow matching setup that is theoretically justified and has interesting connections to fundamental properties such as idempotency and stability, as well as the empirically useful techniques such as structure refinement in AlphaFold. Experiments on protein docking as well as protein backbone generation consistently demonstrate the method's effectiveness, where it outperforms recent baselines of task‑associated flow matching and diffusion models, using a similar computational budget.
Authors: Zhijin Wang, Senzhen Wu, Yue Hu, Xiufeng Liu
Abstract: Modern time series analysis demands frameworks that are flexible, efficient, and extensible. However, many existing Python libraries exhibit limitations in modularity and in their native support for irregular, multi‑source, or sparse data. We introduce pyFAST, a research‑oriented PyTorch framework that explicitly decouples data processing from model computation, fostering a cleaner separation of concerns and facilitating rapid experimentation. Its data engine is engineered for complex scenarios, supporting multi‑source loading, protein sequence handling, efficient sequence‑ and patch‑level padding, dynamic normalization, and mask‑based modeling for both imputation and forecasting. pyFAST integrates LLM‑inspired architectures for the alignment‑free fusion of sparse data sources and offers native sparse metrics, specialized loss functions, and flexible exogenous data fusion. Training utilities include batch‑based streaming aggregation for evaluation and device synergy to maximize computational efficiency. A comprehensive suite of classical and deep learning models (Linears, CNNs, RNNs, Transformers, and GNNs) is provided within a modular architecture that encourages extension. Released under the MIT license at GitHub, pyFAST provides a compact yet powerful platform for advancing time series research and applications.
Authors: Zhitong Cheng, Yiran Jiang, Yulong Ge, Yufeng Li, Zhongheng Qin, Rongzhi Lin, Jianwei Ma
Abstract: Domain shift, characterized by degraded model performance during transition from labeled source domains to unlabeled target domains, poses a persistent challenge for deploying deep learning systems. Current unsupervised domain adaptation (UDA) methods predominantly rely on fine‑tuning feature extractors ‑ an approach limited by inefficiency, reduced interpretability, and poor scalability to modern architectures.
Our analysis reveals that models pretrained on large‑scale data exhibit domain‑invariant geometric patterns in their feature space, characterized by intra‑class clustering and inter‑class separation, thereby preserving transferable discriminative structures. These findings indicate that domain shifts primarily manifest as boundary misalignment rather than feature degradation.
Unlike fine‑tuning entire pre‑trained models ‑ which risks introducing unpredictable feature distortions ‑ we propose the Feature‑space Planes Searcher (FPS): a novel domain adaptation framework that optimizes decision boundaries by leveraging these geometric patterns while keeping the feature encoder frozen. This streamlined approach enables interpretative analysis of adaptation while substantially reducing memory and computational costs through offline feature extraction, permitting full‑dataset optimization in a single computation cycle.
Evaluations on public benchmarks demonstrate that FPS achieves competitive or superior performance to state‑of‑the‑art methods. FPS scales efficiently with multimodal large models and shows versatility across diverse domains including protein structure prediction, remote sensing classification, and earthquake detection. We anticipate FPS will provide a simple, effective, and generalizable paradigm for transfer learning, particularly in domain adaptation tasks. .
Authors: Rohan S. Adhikari, Winnie H. Shi, Amanda B. Marciel, Walter G. Chapman
Abstract: Intrinsically disordered proteins (IDPs) play a significant role in intracellular phenomena and are known to exist in an ensemble of inter‑converting conformations in solution. Accurately modeling the conformations of IDPs in solution poses a challenge to traditional force fields that are tuned to predict the properties of folded proteins. There is a need for generalized atomistic force fields that can accurately predict the properties of both folded proteins and IDPs. Improvements to protein force fields for increased accuracy in secondary structure prediction and new water models with increased water‑water dispersion interactions have been proposed in search of a generalized simulation method. Validating the proposed improvements against experiments poses challenges such as a lack of suitable systems to test the generalizability and choosing a property of interest to match the simulation results against experiments. In this work, we use small angle X‑ray scattering (SAXS) data from peptide‑based polyampholytes that mimic IDPs to test the generalizability of the AMBER protein force fields and the OPC water model. The specific improvements due to the AMBER ff19SB protein force field and the OPC water model are isolated and studied. Analysis of SAXS profiles and the conformational distribution of polyampholyte sequences show the AMBER ff19SB‑OPC water combination to be a generalized model that predicts both ordered polyampholyte sequences and disordered polyampholyte sequences in good agreement with experiments. We have developed a new scattering model termed SWAXS‑AMDE that accounts for the hydration layer density changes in atomic detail and is particularly useful in making one‑to‑one comparisons of simulated scattering profiles to experiments. SWAXS‑AMDE allows for the thermal fluctuations of the solute which is particularly consequential for IDPs.
Authors: Darin Tsui, Kunal Talreja, Amirali Aghazadeh
Abstract: Predicting protein function from amino acid sequence remains a central challenge in data‑scarce (low‑N) regimes, limiting machine learning‑guided protein design when only small amounts of assay‑labeled sequence‑function data are available. Protein language models (pLMs) have advanced the field by providing evolutionary‑informed embeddings and sparse autoencoders (SAEs) have enabled decomposition of these embeddings into interpretable latent variables that capture structural and functional features. However, the effectiveness of SAEs for low‑N function prediction and protein design has not been systematically studied. Herein, we evaluate SAEs trained on fine‑tuned ESM2 embeddings across diverse fitness extrapolation and protein engineering tasks. We show that SAEs, with as few as 24 sequences, consistently outperform or compete with their ESM2 baselines in fitness prediction, indicating that their sparse latent space encodes compact and biologically meaningful representations that generalize more effectively from limited data. Moreover, steering predictive latents exploits biological motifs in pLM representations, yielding top‑fitness variants in 83% of cases compared to designing with ESM2 alone.
Authors: Alireza Abbaszadeh, Armita Shahlaee
Abstract: AlphaFold 3 represents a transformative advancement in computational biology, enhancing protein structure prediction through novel multi‑scale transformer architectures, biologically informed cross‑attention mechanisms, and geometry‑aware optimization strategies. These innovations dramatically improve predictive accuracy and generalization across diverse protein families, surpassing previous methods. Crucially, AlphaFold 3 embodies a paradigm shift toward differentiable simulation, bridging traditional static structural modeling with dynamic molecular simulations. By reframing protein folding predictions as a differentiable process, AlphaFold 3 serves as a foundational framework for integrating deep learning with physics‑based molecular
Authors: Yiming Tang, Arash Lagzian, Srinivas Anumasa, Qiran Zou, Yingtao Zhu, Ye Zhang, Trang Nguyen, Yih-Chung Tham, Ehsan Adeli, Ching-Yu Cheng, Yilun Du, Dianbo Liu
Abstract: The rapid development of generative AI has transformed content creation, communication, and human development. However, this technology raises profound concerns in high‑stakes domains, demanding rigorous methods to analyze and evaluate AI‑generated content. While existing analytic methods often treat images as indivisible wholes, real‑world AI failures generally manifest as specific visual patterns that can evade holistic detection and suit more granular and decomposed analysis. Here we introduce a content analysis tool, Language‑Grounded Sparse Encoders (LanSE), which decompose images into interpretable visual patterns with natural language descriptions. Utilizing interpretability modules and large multimodal models, LanSE can automatically identify visual patterns within data modalities. Our method discovers more than 5,000 visual patterns with 93% human agreement, provides decomposed evaluation outperforming existing methods, establishes the first systematic evaluation of physical plausibility, and extends to medical imaging settings. Our method's capability to extract language‑grounded patterns can be naturally adapted to numerous fields, including biology and geography, as well as other data modalities such as protein structures and time series, thereby advancing content analysis for generative AI.
Authors: Mathieu Garrigues, Victor Onofre, Wesley Coelho, S. Acheche
Abstract: Molecular docking is a critical computational method in drug discovery used to predict the binding conformation and orientation of a ligand within a protein's binding site. Mapping this challenge onto a graph‑based problem, specifically the Maximum Weighted Independent Set (MWIS) problem, allows it to be addressed by specialized hardware such as neutral‑atom quantum processors. However, a significant bottleneck has been the size mismatch between biologically relevant molecular systems and the limited capacity of near‑term quantum devices. In this work, we overcome this scaling limitation by the use of a divide‑and‑conquer heuristic introduced in Cazals 2025. This algorithm decomposes a single, intractable graph instance into smaller sub‑problems that can be solved sequentially on a neutral‑atom quantum emulator, incurring only a linear computational overhead. We benchmark this approach on 10 real‑world protein‑ligand complexes, including 9 from the Astex Diverse Set, with graphs ranging from 225 to 585 vertices. The quantum heuristic consistently outperforms a greedy baseline and achieves the provably optimal solution on a 540‑node instance (TACE‑AS). We further assess the biological relevance of the reconstructed poses via the fraction of native contacts, and benchmark the full workflow on a standard dataset of diverse protein‑ligand complexes. Our work establishes a scalable blueprint for applying quantum optimization to molecular docking, while identifying concrete directions for improving both the algorithmic strategy and the underlying graph model.
Authors: Arne Schneuing, Ilia Igashov, Adrian W. Dobbelstein, Thomas Castiglione, Michael Bronstein, Bruno Correia
Abstract: We introduce DrugFlow, a generative model for structure‑based drug design that integrates continuous flow matching with discrete Markov bridges, demonstrating state‑of‑the‑art performance in learning chemical, geometric, and physical aspects of three‑dimensional protein‑ligand data. We endow DrugFlow with an uncertainty estimate that is able to detect out‑of‑distribution samples. To further enhance the sampling process towards distribution regions with desirable metric values, we propose a joint preference alignment scheme applicable to both flow matching and Markov bridge frameworks. Furthermore, we extend our model to also explore the conformational landscape of the protein by jointly sampling side chain angles and molecules.
Authors: Rakesh Thakur, Riya Gupta
Abstract: Comprehending the long‑timescale dynamics of protein‑ligand complexes is very important for drug discovery and structural biology, but it continues to be computationally challenging for large biomolecular systems. We introduce HemePLM‑Diffuse, an innovative generative transformer model that is designed for accurate simulation of protein‑ligand trajectories, inpaints the missing ligand fragments, and sample transition paths in systems with more than 10,000 atoms. HemePLM‑Diffuse has features of SE(3)‑Invariant tokenization approach for proteins and ligands, that utilizes time‑aware cross‑attentional diffusion to effectively capture atomic motion. We also demonstrate its capabilities using the 3CQV HEME system, showing enhanced accuracy and scalability compared to leading models such as TorchMD‑Net, MDGEN, and Uni‑Mol.
Authors: Tamizhmalar Sundararajan, Matteo Boccalini, Roméo Suss, Sandrine Mariot, Emerson R. Da Silva, Fernando C. Giacomelli, Austin Hubley, Theyencheri Narayanan, Alessandro Barducci, Guillaume Tresset
Abstract: Living cells exhibit a complex organization comprising numerous compartments, among which are RNA‑ and protein‑rich membraneless, liquid‑like organelles known as biomolecular condensates. Energy‑consuming processes regulate their formation and dissolution, with (de‑)phosphorylation by specific enzymes being among the most commonly involved reactions. By employing a model system consisting of a phosphorylatable peptide and homopolymeric RNA, we elucidate how enzymatic activity modulates the growth kinetics and alters the local structure of biomolecular condensates. Under passive condition, time‑resolved ultra‑small‑angle X‑ray scattering with synchrotron source reveals a nucleation‑driven coalescence mechanism maintained over four decades in time, similar to the coarsening of simple binary fluid mixtures. Coarse‑grained molecular dynamics simulations show that peptide‑decorated RNA chains assembled shortly after mixing constitute the relevant subunits. In contrast, actively‑formed condensates initially display a local mass fractal structure, which gradually matures upon enzymatic activity before condensates undergo coalescence. Both types of condensate eventually reach a steady state but fluorescence recovery after photobleaching indicates a peptide diffusivity twice higher in actively‑formed condensates consistent with their loosely‑packed local structure. We expect multiscale, integrative approaches implemented with model systems to link effectively the functional properties of membraneless organelles to their formation and dissolution kinetics as regulated by cellular active processes.
Authors: Jan Kocka, Kabir Husain, Jaime Agudo-Canalejo
Abstract: Biology stores information and computes at the molecular scale, yet the ways in which it does so are often distinct from human‑engineered computers. Mapping biological computation onto architectures familiar to computer science remains an outstanding challenge. Here, inspired by Crick's proposal for molecular memory, we analyse a thermodynamically‑consistent model of a protein complex subject to driven, nonequilibrium enzymatic reactions. In the strongly driven limit, we find that the system maps onto a stochastic, asynchronous variant of cellular automata, where each rule corresponds to a different set of enzymes being present. We find a broad class of phenomena in these 'molecular automata' that can be exploited for molecular computation, including error‑tolerant memory via multistable attractors, and long transients that can be used as molecular stopwatches. By systematically enumerating all possible dynamical rules, we identify those that allow molecular automata to implement simple computational architectures such as finite‑state machines. Overall, our results provide a framework for engineering synthetic molecular automata, and offer a route to building protein‑based computation in living cells.
Authors: Indranil Mal, Milan Kočí, Paolo Nicolini, Prokop Hapala
Abstract: We present GridFF, an efficient method for simulating molecules on rigid substrates, derived from techniques used in protein‑ligand docking in biochemistry. By projecting molecule‑substrate interactions onto precomputed spatial grids with tricubic B‑spline interpolation, GridFF reduces the computational cost by orders of magnitude compared to traditional pairwise atomistic models, without compromising the accuracy of forces or trajectories. The CPU implementation of GridFF in the open‑source FireCore package provides a 100‑1000x speedup over all‑atom simulations using LAMMPS, while the GPU implementation ‑ running thousands of system replicas in parallel ‑ samples millions of configurations per second, enabling an exhaustive exploration of the configuration space of small flexible molecules on surfaces within minutes. Furthermore, as demonstrated in our previous application of a similar technique to high‑resolution scanning probe microscopy, GridFF can be extended beyond empirical pairwise potentials to those derived from ab initio electron densities. Altogether, this unlocks accurate high‑throughput modeling of molecular self‑assembly, adsorption, and scanning probe manipulation in surface science.
Authors: Jianhui Wang, Wenyu Zhu, Bowen Gao, Xin Hong, Ya-Qin Zhang, Wei-Ying Ma, Yanyan Lan
Abstract: Protein‑ligand binding prediction is central to virtual screening and affinity ranking, two fundamental tasks in drug discovery. While recent retrieval‑based methods embed ligands and protein pockets into Euclidean space for similarity‑based search, the geometry of Euclidean embeddings often fails to capture the hierarchical structure and fine‑grained affinity variations intrinsic to molecular interactions. In this work, we propose HypSeek, a hyperbolic representation learning framework that embeds ligands, protein pockets, and sequences into Lorentz‑model hyperbolic space. By leveraging the exponential geometry and negative curvature of hyperbolic space, HypSeek enables expressive, affinity‑sensitive embeddings that can effectively model both global activity and subtle functional differences‑particularly in challenging cases such as activity cliffs, where structurally similar ligands exhibit large affinity gaps. Our mode unifies virtual screening and affinity ranking in a single framework, introducing a protein‑guided three‑tower architecture to enhance representational structure. HypSeek improves early enrichment in virtual screening on DUD‑E from 42.63 to 51.44 (+20.7%) and affinity ranking correlation on JACS from 0.5774 to 0.7239 (+25.4%), demonstrating the benefits of hyperbolic geometry across both tasks and highlighting its potential as a powerful inductive bias for protein‑ligand modeling.
Authors: Ivan Spirandelli, Arnur Nigmetov, Dmitriy Morozov, Myfanwy E. Evans
Abstract: The simulated self‑assembly of molecular building blocks into functional complexes is a key area of study in computational biology and materials science. Self‑assembly simulations of proteins using physically‑motivated potentials for non‑polar interactions, can identify the biologically correct assembly as the energy‑minimizing state. Short‑range potentials, however, produce rugged energy landscapes, which lead to simulations becoming trapped in non‑functional local minimizers.
Successful self‑assembly simulations depend on the physical realism of the driving potentials as well as their ability to efficiently explore the configuration space.
We introduce a long‑range topological potential, quantified via weighted total persistence, and combine it with the morphometric approach to solvation‑free energy.
This combination improves the assembly success rate in simulations of the tobacco mosaic virus dimer and other protein complexes by up to sixteen‑fold compared with the morphometric model alone. It further enables successful simulation in systems that don't otherwise assemble during the examined timescales.
Compared to previous topology‑based work, which has been primarily descriptive, our approach uses topological measures as an active energetic bias that is independent of electrostatics or chemical specificity and depends only on atomic coordinates. Therefore, the method can, in principle, be applied to arbitrary systems where such coordinates are optimized. Integrating topological descriptions into an energy function offers a general strategy for overcoming kinetic barriers in molecular simulations, with potential applications in drug design, materials development, and the study of complex self‑assembly processes.
Authors: Mehdi Yazdani-Jahromi, Ali Khodabandeh Yalabadi, Ozlem Ozmen Garibay
Abstract: The growing importance of mRNA therapeutics and synthetic biology highlights the need for models that capture the latent structure of synonymous codon (different triplets encoding the same amino acid) usage, which subtly modulates translation efficiency and gene expression. While recent efforts incorporate codon‑level inductive biases through auxiliary objectives, they often fall short of explicitly modeling the structured relationships that arise from the genetic code's inherent symmetries. We introduce Equi‑mRNA, the first codon‑level equivariant mRNA language model that explicitly encodes synonymous codon symmetries as cyclic subgroups of 2D Special Orthogonal matrix (SO(2)). By combining group‑theoretic priors with an auxiliary equivariance loss and symmetry‑aware pooling, Equi‑mRNA learns biologically grounded representations that outperform vanilla baselines across multiple axes. On downstream property‑prediction tasks including expression, stability, and riboswitch switching Equi‑mRNA delivers up to approximately 10% improvements in accuracy. In sequence generation, it produces mRNA constructs that are up to approximately 4x more realistic under Frechet BioDistance metrics and approximately 28% better preserve functional properties compared to vanilla baseline. Interpretability analyses further reveal that learned codon‑rotation distributions recapitulate known GC‑content biases and tRNA abundance patterns, offering novel insights into codon usage. Equi‑mRNA establishes a new biologically principled paradigm for mRNA modeling, with significant implications for the design of next‑generation therapeutics.
Authors: Shih-Huan Huang, Matthew W. Cotton, Tuomas P. J. Knowles, David Klenerman, Georg Meisl
Abstract: A central challenge in modeling neurodegenerative diseases is connecting cellular‑level mechanisms to tissue‑level pathology, in particular to determine whether pathology is driven primarily by cell‑autonomous triggers or by propagation from cells that are already in a pathological, runaway aggregation state. To bridge this gap, we here develop a bottom‑up physical model that explicitly incorporates these two fundamental cell‑level drivers of protein aggregation dynamics. We show that our model naturally explains the characteristic long, slow development of pathology followed by a rapid acceleration, a hallmark of many neurodegenerative diseases. Furthermore, the model reveals the existence of a critical switch point at which the system's dynamics transition from being dominated by slow, spontaneous formation of diseased cells to being driven by fast propagation. This framework provides a robust physical foundation for interpreting pathological data and offers a method to predict which class of therapeutic strategies is best matched to the underlying drivers of a specific disease.
Authors: Murat Isik, Mandeep Kaur Saggi, Humaira Gowher, Sabre Kais
Abstract: Accurately predicting enzyme functionality remains one of the major challenges in computational biology, particularly for enzymes with limited structural annotations or sequence homology. We present a novel multimodal Quantum Machine Learning (QML) framework that enhances Enzyme Commission (EC) classification by integrating four complementary biochemical modalities: protein sequence embeddings, quantum‑derived electronic descriptors, molecular graph structures, and 2D molecular image representations. Quantum Vision Transformer (QVT) backbone equipped with modality‑specific encoders and a unified cross‑attention fusion module. By integrating graph features and spatial patterns, our method captures key stereoelectronic interactions behind enzyme function. Experimental results demonstrate that our multimodal QVT model achieves a top‑1 accuracy of 85.1%, outperforming sequence‑only baselines by a substantial margin and achieving better performance results compared to other QML models.
Authors: Junwei Su, Chuan Wu
Abstract: Score‑based graph generative models (SGGMs) have proven effective in critical applications such as drug discovery and protein synthesis. However, their theoretical behavior, particularly regarding convergence, remains underexplored. Unlike common score‑based generative models (SGMs), which are governed by a single stochastic differential equation (SDE), SGGMs involve a system of coupled SDEs. In SGGMs, the graph structure and node features are governed by separate but interdependent SDEs. This distinction makes existing convergence analyses from SGMs inapplicable for SGGMs. In this work, we present the first non‑asymptotic convergence analysis for SGGMs, focusing on the convergence bound (the risk of generative error) across three key graph generation paradigms: (1) feature generation with a fixed graph structure, (2) graph structure generation with fixed node features, and (3) joint generation of both graph structure and node features. Our analysis reveals several unique factors specific to SGGMs (e.g., the topological properties of the graph structure) which affect the convergence bound. Additionally, we offer theoretical insights into the selection of hyperparameters (e.g., sampling steps and diffusion length) and advocate for techniques like normalization to improve convergence. To validate our theoretical findings, we conduct a controlled empirical study using synthetic graph models, and the results align with our theoretical predictions. This work deepens the theoretical understanding of SGGMs, demonstrates their applicability in critical domains, and provides practical guidance for designing effective models.
Authors: Ian T. Abrahams
Abstract: Strong excitonic coupling and photon antibunching (AB) have been observed together in Venus yellow fluorescent protein dimers and currently lack a cohesive theoretical explanation. In 2019, Kim et al. demonstrated Davydov splitting in circular dichroism spectra, revealing strong J‑like coupling, while antibunched fluorescence emission was confirmed by combined antibunching‑‑fluorescence correlation spectroscopy (AB/FCS fingerprinting). To investigate the implications of this coexistence, Venus yellow fluorescent protein (YFP) dimer population dynamics are modeled within a Lindblad master equation framework, testing its ability to cope with typical, data‑informed, Venus YFP dimer time and energy values. Simulations predict multiple‑femtosecond (fs) decoherence, yielding bright/dark state mixtures consistent with antibunched fluorescence emission at room temperature. Thus, excitonic coupling and photon AB in Venus YFP dimers are reconciled without invoking long‑lived quantum coherence. However, clear violations of several Lindblad approximation validity conditions appear imminent, calling for careful modifications to choices of standard system and bath definitions and parameter values.
Authors: Gabrielle Wehr, Reuben Rideaux, Amaya J. Fox, David R. Lightfoot, Jason Tangen, Jason B. Mattingley, Shane E. Ehrhardt
Abstract: Artificial intelligence systems are transforming scientific discovery by accelerating specific research tasks, from protein structure prediction to materials design, yet remain confined to narrow domains requiring substantial human oversight. The exponential growth of scientific literature and increasing domain specialisation constrain researchers' capacity to synthesise knowledge across disciplines and develop unifying theories, motivating exploration of more general‑purpose AI systems for science. Here we show that a domain‑agnostic, agentic AI Scientist system can independently navigate the scientific workflow ‑ from hypothesis generation through data collection to manuscript preparation. The system autonomously designed and executed three psychological studies on visual working memory, mental rotation, and imagery vividness, executed one new online data collection with 288 participants, developed analysis pipelines through 8‑hour+ continuous coding sessions, and produced completed manuscripts. The results demonstrate the capability of AI scientific discovery pipelines to conduct non‑trivial research with theoretical reasoning and methodological rigour comparable to experienced researchers, though with limitations in conceptual nuance and theoretical interpretation. This is a step toward embodied AI that can test hypotheses through real‑world experiments, accelerating discovery by autonomously exploring regions of scientific space that human cognitive and resource constraints might otherwise leave unexplored. It raises important questions about the nature of scientific understanding and the attribution of scientific credit.
Authors: Rahi Navelkar, Andrea Cosolo, Bogdan Bintu, Yubao Cheng, Vincent Gardeux, Silvia Gutnik, Taihei Fujimori, Antonina Hafner, Atishay Jay, Bojing Blair Jia, Adam Paul Jussila, Gerard Llimos, Antonios Lioutas, Nuno MC Martins, William J Moore, Yodai Takei, Frances Wong, Kaifu Yang, Huaiying Zhang, Quan Zhu, Magda Bienko, Lacramioara Bintu, Long Cai, Bart Deplancke, Marcelo Nollmann, Susan E Mango, Bing Ren, Peter J Park, Ahilya N Sawh, Andrew Schroeder, Jason R Swedlow, Golnaz Vahedi, Chao-Ting Wu, Sarah Aufmkolk, Alistair N Boettiger, Irene Farabella, Caterina Strambio-De-Castillia, Siyuan Wang
Abstract: In recent years, multiplexed Fluorescence In Situ Hybridization (FISH) or FISH‑omics methods have rapidly expanded, enabling the quantification of chromatin organization in single cells, often in conjunction with measurements of RNA and protein. These approaches have deepened our understanding of how 3D chromosome architecture relates to transcriptional activity and cell states in health and disease. Despite these advances, results from Chromatin Tracing FISH‑omics experiments remain challenging to share, reuse, and analyze due to the absence of standardized data exchange specifications. Building on the release of microscopy metadata standards, we introduce the FISH Omics Format‑Chromatin Tracing (FOF‑CT), a community‑developed standard for processed results from diverse imaging modalities. We describe the FOF‑CT file format and present a curated collection of datasets deposited in the 4DN Data Portal and the OME Image Data Resource (IDR). We also highlight their potential for reuse, integration, and modeling by outlining example analysis pipelines and illustrating biological insights enabled by standardized, FAIR‑compliant Chromatin Tracing datasets. While this manuscript focuses on the representation of ball‑and‑stick Chromatin Tracing, the format is designed to be extensible to volumetric Chromatin Tracing.
Authors: Himanshu Shekhar, Ashutosh Dheer, Santosh Kumar, N. Sukumar
Abstract: We investigate spectral fluctuations in multilayer networks within the random matrix theory (RMT) framework to characterize universal and non‑universal features. The adjacency matrix of a multilayer network exhibits a block structure, with diagonal blocks representing intra‑layer connections and off‑diagonal blocks encoding inter‑layer connections. Applying appropriate scaling factors for these blocks, we equalize variances across inter‑ and intra‑layers, enabling direct comparison of spectral statistics. We analyze eigenvalue spectra across multilayer network configurations with varying inter‑ and intra‑layer connectivities. Introducing a crossover model for bilayer networks, we capture the smooth transition of spectral properties from block‑diagonal (two independent GOEs) to single‑layer (one GOE) statistics as the relative strength of inter‑layer to intra‑layer connection varies. Furthermore, we analyze interatomic distance networks derived from protein crystal structures, including 1EWT, 1EWK, and 1UW6, to demonstrate applicability. Our findings reveal that the universality of spectral fluctuations persists across multilayer network architectures and highlight RMT as a robust tool for probing topological and dynamical complexities of real‑world networks.
Authors: Zohra Yagoub, Hafida Bouziane
Abstract: The prediction of amyloidogenicity in peptides and proteins remains a focal point of ongoing bioinformatics. The crucial step in this field is to apply advanced computational methodologies. Many recent approaches to predicting amyloidogenicity within proteins are highly based on evolutionary motifs and the individual properties of amino acids. It is becoming increasingly evident that the sequence information‑based features show high predictive performance. Consequently, our study evaluated the contextual features of protein sequences obtained from a pretrained protein large language model leveraging bidirectional LSTM and GRU to predict amyloidogenic regions in peptide and protein sequences. Our method achieved an accuracy of 84.5% on 10‑fold cross‑validation and an accuracy of 83% in the test dataset. Our results demonstrate competitive performance, highlighting the potential of LLMs in enhancing the accuracy of amyloid prediction.
Authors: Chuanliu Fan, Zicheng Ma, Jun Gao, Nan Yu, Jun Zhang, Ziqiang Cao, Yi Qin Gao, Guohong Fu
Abstract: Recent advances in protein large language models, such as ProtTeX, represent both side‑chain amino acids and backbone structure as discrete token sequences of residue length. While this design enables unified modeling of multimodal protein information, it suffers from two major limitations: (1) The concatenation of sequence and structure tokens approximately doubles the protein length and breaks the intrinsic residue‑level alignment between modalities. (2) Constrained by the training corpus and limited context window, ProtTeX is typically trained on single‑protein inputs, rendering it incompatible with in‑context learning (ICL) and thus limiting its generalization capability. To address these issues, we propose ProtTeX‑CC, a lightweight two‑stage compression framework designed to enhance ProtTeX under few‑shot settings. We first design a joint embedding compression mechanism that fuses sequence and structure representations at the residue level, effectively reducing the protein input length by half without sacrificing performance. Then we propose a self‑compression module that aggregates each full demonstration into the latent space of the last few linguistic tokens, reducing the average demonstration length from 751 tokens to less than 16 tokens. Compared to the original ProtTeX, our self‑compression approach achieves a compression ratio of approximately 93.68% in the total prompt length under the 16‑shot setting. Without modifying the backbone model, ProtTeX‑CC introduces only a small number of additional parameters through PEFT‑based tuning in the joint embedding compression stage and a single trainable projection layer in the self‑compression stage. Extensive experiments on protein function prediction show that ProtTeX‑CC improves performance on the in‑domain benchmark by 2%, and generalizes well to the out‑of‑domain dataset with a performance gain of 11%.
Authors: Akashnathan Aranganathan, Eric R. Beyerle
Abstract: The use of generative machine learning models, trained on the experimentally resolved structures deposited in the protein data bank, is an attractive approach to sampling conformational ensembles of proteins. However, the ensembles generated by these models lack timescale or causal information. We use the structural ensembles generated from AlphaFold2 at a range of MSA depths to parameterize the potential of mean force of an overdamped, memory‑free, coarse‑grained Langevin equation. This approach couples the AlphaFold2 ensembles to a causal model, allowing us to estimate the timescales spanned by the ensembles generated at each MSA depth. Performing this analysis on six variants of HIV‑1 protease, we confirm an inverse relationship between MSA depth and the timescale of an ensemble's conformational fluctuations. The MSA depth essentially serves as a conformational restraint, and AlphaFold2 is generally able to probe timescales at or below those seen in microsecond‑long, unbiased molecular dynamics simulations. We conclude by generalizing this approach to other generative structural ensemble‑prediction methods as well as co‑folding models, in this case the biologically functional HIV‑1 protease dimer.
Authors: Viktor Zaverkin, Matheus Ferraz, Francesco Alesiani, Mathias Niepert
Abstract: Universal machine‑learned potentials promise transferable accuracy across compositional and vibrational degrees of freedom, yet their application to biomolecular simulations remains underexplored. This work systematically evaluates equivariant message‑passing architectures trained on the SPICE‑v2 dataset with and without explicit long‑range dispersion and electrostatics. We assess the impact of model size, training data composition, and electrostatic treatment across in‑ and out‑of‑distribution benchmark datasets, as well as molecular simulations of bulk liquid water, aqueous NaCl solutions, and biomolecules, including alanine tripeptide, the mini‑protein Trp‑cage, and Crambin. While larger models improve accuracy on benchmark datasets, this trend does not consistently extend to properties obtained from simulations. Predicted properties also depend on the composition of the training dataset. Long‑range electrostatics show no systematic impact across systems. However, for Trp‑cage, their inclusion yields increased conformational variability. Our results suggest that imbalanced datasets and immature evaluation practices currently challenge the applicability of universal machine‑learned potentials to biomolecular simulations.
Authors: Dong Xu, Zhangfan Yang, Jenna Xinyi Yao, Shuangbao Song, Zexuan Zhu, Junkai Ji
Abstract: Three‑dimensional generative models increasingly drive structure‑based drug discovery, yet it remains constrained by the scarce publicly available protein‑ligand complexes. Under such data scarcity, almost all existing pipelines struggle to learn transferable geometric priors and consequently overfit to training‑set biases. As such, we present IBEX, an Information‑Bottleneck‑EXplored coarse‑to‑fine pipeline to tackle the chronic shortage of protein‑ligand complex data in structure‑based drug design. Specifically, we use PAC‑Bayesian information‑bottleneck theory to quantify the information density of each sample. This analysis reveals how different masking strategies affect generalization and indicates that, compared with conventional de novo generation, the constrained Scaffold Hopping task endows the model with greater effective capacity and improved transfer performance. IBEX retains the original TargetDiff architecture and hyperparameters for training to generate molecules compatible with the binding pocket; it then applies an L‑BFGS optimization step to finely refine each conformation by optimizing five physics‑based terms and adjusting six translational and rotational degrees of freedom in under one second. With only these modifications, IBEX raises the zero‑shot docking success rate on CBGBench CrossDocked2020‑based from 53% to 64%, improves the mean Vina score from ‑7.41 kcal mol^‑1 to ‑8.07 kcal mol^‑1, and achieves the best median Vina energy in 57 of 100 pockets versus 3 for the original TargetDiff. IBEX also increases the QED by 25%, achieves state‑of‑the‑art validity and diversity, and markedly reduces extrapolation error.
Authors: Lei Jiang, Shuzhou Sun, Biqing Qi, Yuchen Fu, Xiaohua Xu, Yuqiang Li, Dongzhan Zhou, Tianfan Fu
Abstract: In the real world, a molecule is a 3D geometric structure. Compared to 1D SMILES sequences and 2D molecular graphs, 3D molecules represent the most informative molecular modality. Despite the rapid progress of autoregressive‑based language models, they cannot handle the generation of 3D molecular conformation due to several challenges: 1) 3D molecular structures are incompatible with LLMs' discrete token space, 2) integrating heterogeneous inputs like proteins, ligands, and text remains difficult within a unified model, and 3) LLMs lack essential scientific priors, hindering the enforcement of physical and chemical constraints during generation. To tackle these issues, we present Chem3DLLM, a unified protein‑conditioned multimodal large language model. Our approach designs a novel reversible text encoding for 3D molecular structures using run‑length compression, achieving 3x size reduction while preserving complete structural information. This enables seamless integration of molecular geometry with protein pocket features in a single LLM architecture. We employ reinforcement learning with stability‑based rewards to optimize chemical validity and incorporate a lightweight protein embedding projector for end‑to‑end training. Experimental results on structure‑based drug design demonstrate state‑of‑the‑art performance with a Vina score of ‑7.21, validating our unified multimodal approach for practical drug discovery applications.
Authors: Timon Scheiber, Matthias Heller, Andreas Giebel
Abstract: We explore the potential application of quantum annealing to address the protein structure problem. To this end, we compare several proposed ab initio protein folding models for quantum computers and analyze their scaling and performance for classical and quantum heuristics. Furthermore, we introduce a novel encoding of coordinate based models on the tetrahedral lattice, based on interleaved grids. Our findings reveal significant variations in model performance, with one model yielding unphysical configurations within the feasible solution space. Furthermore, we conclude that current quantum annealing hardware is not yet suited for tackling problems beyond a proof‑of‑concept size, primarily due to challenges in the embedding. Nonetheless, we observe a scaling advantage over our in‑house simulated annealing implementation, which, however, is only noticeable when comparing performance on the embedded problems.
Authors: Patrick Soga, Zhenyu Lei, Yinhan He, Camille Bilodeau, Jundong Li
Abstract: Predicting changes in binding free energy (ΔΔG) is a vital task in protein engineering and protein‑protein interaction (PPI) engineering for drug discovery. Previous works have observed a high correlation between ΔΔG and entropy, using probabilities of biologically important objects such as side chain angles and residue identities to estimate ΔΔG. However, estimating the full conformational distribution of a protein complex is generally considered intractable. In this work, we propose a new approach to ΔΔG prediction that avoids this issue by instead leveraging energy‑based models for estimating the probability of a complex's conformation. Specifically, we novelly decompose ΔΔG into a sequence‑based component estimated by an inverse folding model and a structure‑based component estimated by an energy model. This decomposition is made tractable by assuming equilibrium between the bound and unbound states, allowing us to simplify the estimation of degeneracies associated with each state. Unlike previous deep learning‑based methods, our method incorporates an energy‑based physical inductive bias by connecting the often‑used sequence log‑odds ratio‑based approach to ΔΔG prediction with a new ΔΔE term grounded in statistical mechanics. We demonstrate superiority over existing state‑of‑the‑art structure and sequence‑based deep learning methods in ΔΔG prediction and antibody optimization against SARS‑CoV‑2.
Authors: Brian Shing-Hei Wong, Joshua Mincheol Kim, Sin-Hang Fung, Qing Xiong, Kelvin Fu-Kiu Ao, Junkang Wei, Ran Wang, Dan Michelle Wang, Jingying Zhou, Bo Feng, Alfred Sze-Lok Cheng, Kevin Y. Yip, Stephen Kwok-Wing Tsui, Qin Cao
Abstract: Allergens, typically proteins capable of triggering adverse immune responses, represent a significant public health challenge. To accurately identify allergen proteins, we introduce Applm (Allergen Prediction with Protein Language Models), a computational framework that leverages the 100‑billion parameter xTrimoPGLM protein language model. We show that Applm consistently outperforms seven state‑of‑the‑art methods in a diverse set of tasks that closely resemble difficult real‑world scenarios. These include identifying novel allergens that lack similar examples in the training set, differentiating between allergens and non‑allergens among homologs with high sequence similarity, and assessing functional consequences of mutations that create few changes to the protein sequences. Our analysis confirms that xTrimoPGLM, originally trained on one trillion tokens to capture general protein sequence characteristics, is crucial for Applm's performance by detecting important differences among protein sequences. In addition to providing Applm as open‑source software, we also provide our carefully curated benchmark datasets to facilitate future research.
Authors: Nguyen Manh Son, Pham Huu Vang, Nguyen Thi Dung, Nguyen Manh Ha. Ta Thi Thao, Tran Thi Thu Thuy, Phan Minh Giang
Abstract: Cancer is recognized as a complex group of diseases, contributing to the highest global mortality rates, with increasing prevalence and a trend toward affecting younger populations. It is characterized by uncontrolled proliferation of abnormal cells, invasion of adjacent tissues, and metastasis to distant organs. Garcinia cowa, a traditional medicinal plant widely used in Southeast Asia, including Vietnam, is employed to treat fever, cough, indigestion, as a laxative, and for parasitic diseases. Numerous xanthone compounds isolated from this species exhibit a broad spectrum of biological activities, with some showing promise as anti cancer and antimalarial agents. Network pharmacology analysis successfully identified key bioactive compounds Rubraxanthone, Garcinone D, Norcowanin, Cowanol, and Cowaxanthone alongside their primary protein targets (TNF, CTNNB1, SRC, NFKB1, and MTOR), providing critical insights into the molecular mechanisms underlying their anti‑cancer effects. The Graph Attention Network algorithm demonstrated superior predictive performance, achieving an R2 of 0.98 and an RMSE of 0.02 after data augmentation, highlighting its accuracy in predicting pIC50 values for xanthone based compounds. Additionally, molecular docking revealed MTOR as a potential target for inducing cytotoxicity in HeLa cancer cells from Garcinia cowa.
Authors: Johannes F. Hevler, Shivam Verma, Mirat Soijtra, Carolyn R. Bertozzi
Abstract: Thermal Tracks is a Python‑based statistical framework for analyzing protein thermal stability data that overcomes key limitations of existing thermal proteome profiling (TPP) work‑flows. Unlike standard approaches that assume sigmoidal melting curves and are constrained by empirical null distributions (limiting significant hits to approximately 5 % of data), Thermal Tracks uses Gaussian Process (GP) models with squared‑exponential kernels to flexibly model any melting curve shape while generating unbiased null distributions through kernel priors. This framework is particularly valuable for analyzing proteome‑wide perturbations that significantly alter protein thermal stability, such as pathway inhibitions, genetic modifications, or environmental stresses, where conventional TPP methods may miss biologically relevant changes due to their statistical constraints. Furthermore, Thermal Tracks excels at analyzing proteins with un‑conventional melting profiles, including phase‑separating proteins and membrane proteins, which often exhibit complex, non‑sigmoidal thermal stability behaviors. Thermal Tracks is freely available from GitHub and is implemented in Python, providing an accessible and flexible tool for proteome‑wide thermal profiling studies.
Authors: Liyan Jia, Chuan-Xian Ren, Hong Yan
Abstract: Accurately predicting the binding conformation of small‑molecule ligands to protein targets is a critical step in rational drug design. Although recent deep learning‑based docking surpasses traditional methods in speed and accuracy, many approaches rely on graph representations and language model‑inspired encoders while neglecting critical geometric information, resulting in inaccurate pocket localization and unrealistic binding conformations. In this study, we introduce CWFBind, a weighted, fast, and accurate docking method based on local curvature features. Specifically, we integrate local curvature descriptors during the feature extraction phase to enrich the geometric representation of both proteins and ligands, complementing existing chemical, sequence, and structural features. Furthermore, we embed degree‑aware weighting mechanisms into the message passing process, enhancing the model's ability to capture spatial structural distinctions and interaction strengths. To address the class imbalance challenge in pocket prediction, CWFBind employs a ligand‑aware dynamic radius strategy alongside an enhanced loss function, facilitating more precise identification of binding regions and key residues. Comprehensive experimental evaluations demonstrate that CWFBind achieves competitive performance across multiple docking benchmarks, offering a balanced trade‑off between accuracy and efficiency.
Authors: Ross H. McKenzie
Abstract: Many systems involve numerous interacting parts and the whole system can have properties that the individual parts do not. I take this novelty as the defining characteristic of an emergent property. Other characteristics associated with emergence discussed include universality, order, complexity, unpredictability, irreducibility, diversity, self‑organisation, discontinuities, and singularities. Emergent phenomena are widespread across physics, biology, social sciences, and computing, and are central to major scientific and societal challenges. Understanding emergence involves considering the stratification of reality across different scales (energy, time, length, complexity), each with its distinct ontology and epistemology, leading to semi‑autonomous scientific disciplines. A central challenge is bridging the gap between macroscopic emergent properties and microscopic component interactions. Identifying an intermediate mesoscopic scale where new, weakly interacting entities or modular structures emerge is key. Theoretical approaches, such as effective theories (describing phenomena at a specific scale) and toy models (simplified systems for analysis), are vital. The Ising model exemplifies how toy models can elucidate emergence characteristics. Emergence is central to condensed matter physics, chaotic systems, fluid dynamics, nuclear physics, quantum gravity, neural networks, protein folding, and social segregation. An emergent perspective should influence scientific strategy by shaping research questions, methodologies, priorities, and resource allocation. An elusive goal is the design and control of emergent properties.
Authors: Sabrina Namazova, Alessandra Brondetta, Younes Strittmatter, Matthew Nassar, Sebastian Musslick
Abstract: Simulators have revolutionized scientific practice across the natural sciences. By generating data that reliably approximate real‑world phenomena, they enable scientists to accelerate hypothesis testing and optimize experimental designs. This is perhaps best illustrated by AlphaFold, a Nobel‑prize winning simulator in chemistry that predicts protein structures from amino acid sequences, enabling rapid prototyping of molecular interactions, drug targets, and protein functions. In the behavioral sciences, a reliable participant simulator ‑ a system capable of producing human‑like behavior across cognitive tasks ‑ would represent a similarly transformative advance. Recently, Binz et al. introduced Centaur, a large language model (LLM) fine‑tuned on human data from 160 experiments, proposing its use not only as a model of cognition but also as a participant simulator for "in silico prototyping of experimental studies", e.g., to advance automated cognitive science. Here, we review the core criteria for a participant simulator and assess how well Centaur meets them. Although Centaur demonstrates strong predictive accuracy, its generative behavior ‑ a critical criterion for a participant simulator ‑ systematically diverges from human data. This suggests that, while Centaur is a significant step toward predicting human behavior, it does not yet meet the standards of a reliable participant simulator or an accurate model of cognition.
Authors: Yibo Chen, Zirui Sheng, Weitang Li, Yong Zhang, Xun Xu, Jun-Han Huang, Yuxiang Li
Abstract: Accurate calculation of strongly correlated electronic systems requires proper treatment of both static and dynamic correlations, which remains challenging for conventional methods. To address this, we present VQE‑PDFT,aquantum‑classical hybrid framework that integrates variational quantum eigensolver with multiconfiguration pair‑density functional theory (MC‑PDFT). This framework strategically employs quantum circuits for multiconfigurational wavefunction representation while utilizing density functionals for correlation energy evaluation. The hybrid strategy maintains accurate treatment of static and dynamic correlations while reducing quantum resource requirements compared to highly expressive quantum algorithms. Benchmark validation, performed via noiseless quantum circuit simulator, on the Charge‑Transfer dataset confirmed that VQE‑PDFT achieved results comparable to conventional MC‑PDFT. Building upon this, we developed shallow‑depth hardware‑efficient ansatz circuits and integrated them into a QM/MM multiscale architecture to enable applications in complex biological systems. This extended framework, when applied to electron transfer in the European robin cryptochrome protein ErCRY4 with noiseless simulations, yielded transfer rates that aligned well with experimental measurements. Finally, as a proof‑of‑concept hardware demonstration, we executed the reduced‑density‑matrix measurements for a single protein conformation on a 13‑qubit superconducting device and showed the impact of noise through a comprehensive error analysis.
Authors: Samiha Afaf Neha, Abir Ahammed Bhuiyan, Md. Ishrak Khan
Abstract: Introduction: Accurate prediction of Phage Virion Proteins (PVP) is essential for genomic studies due to their crucial role as structural elements in bacteriophages. Computational tools, particularly machine learning, have emerged for annotating phage protein sequences from high‑throughput sequencing. However, effective annotation requires specialized sequence encodings. Our paper introduces ProteoKnight, a new image‑based encoding method that addresses spatial constraints in existing techniques, yielding competitive performance in PVP classification using pre‑trained convolutional neural networks. Additionally, our study evaluates prediction uncertainty in binary PVP classification through Monte Carlo Dropout (MCD). Methods: ProteoKnight adapts the classical DNA‑Walk algorithm for protein sequences, incorporating pixel colors and adjusting walk distances to capture intricate protein features. Encoded sequences were classified using multiple pre‑trained CNNs. Variance and entropy measures assessed prediction uncertainty across proteins of various classes and lengths. Results: Our experiments achieved 90.8% accuracy in binary classification, comparable to state‑of‑the‑art methods. Multi‑class classification accuracy remains suboptimal. Our uncertainty analysis unveils variability in prediction confidence influenced by protein class and sequence length. Conclusions: Our study surpasses frequency chaos game representation (FCGR) by introducing novel image encoding that mitigates spatial information loss limitations. Our classification technique yields accurate and robust PVP predictions while identifying low‑confidence predictions.
Authors: Polina V. Banushkina, Sergei V. Krivov
Abstract: Rare but critical events in complex systems, such as protein folding, chemical reactions, disease progression, and extreme weather or climate phenomena, are governed by complex, high‑dimensional, stochastic dynamics. Identifying an optimal reaction coordinate (RC) that accurately captures the progress of these dynamics is crucial for understanding and simulating such processes. However, determining an optimal RC for realistic systems is notoriously difficult, due to methodological challenges that limit the success of standard machine learning techniques. These challenges include the absence of ground truth, the lack of a loss function for general nonequilibrium dynamics, the difficulty of selecting expressive neural network architectures that avoid overfitting, the irregular and incomplete nature of many real world trajectories, limited sampling and the extreme data imbalance inherent in rare event problems. Here, we introduce a nonparametric RC optimization framework that incorporates trajectory histories and circumvents these challenges, enabling robust analysis of irregular or incomplete data without requiring extensive sampling. The power of the method is demonstrated through increasingly challenging analyses of protein folding dynamics, where it yields accurate committor estimates that pass stringent validation tests and produce high resolution free energy profiles. Its generality is further illustrated through applications to phase space dynamics, a conceptual ocean circulation model, and a longitudinal clinical dataset. These results demonstrate that rare event dynamics can be accurately characterized without extensive sampling of the configuration space, establishing a general, flexible, and robust framework for analyzing complex dynamical systems and longitudinal datasets.
Authors: Anika Friedman, Michael Shirts
Abstract: The primary limitation for the application of alchemical free energy methods to a wider variety of complex molecular systems is achieving reasonable sampling. Flexible binding complexes often have high free energy barriers, which require prohibitively long simulations to sample sufficiently to obtain reliable free energy estimates. An example of such a system is the complex formed between FabB, an elongating β‑ketoacyl‑acyl carrier protein (ACP) synthase (KS) from Escherichia coli and ACP, which carries acyl chains of varying lengths. Previous experimental evidence suggests that growing acyl chains can bind to at least two pockets. With the multiple topology replica exchange of expanded ensemble (MT‑REXEE) enhanced sampling approach, we can obtain highly efficient sampling of both pockets by adaptively growing and shrinking the chains in the simulation ensemble, allowing each simulation to visit chain lengths where transitions between the pockets do occur. This enables unbiased sampling of alternate configurational states for large complex systems. Using the new swapping approach gives significantly enhanced sampling even for this simpler problem, as demonstrated by faster convergence of free energy estimates. This case study demonstrates the utility of MT‑REXEE and its open‑source implementation for systems that feature high free energy barriers for a subset of ligands of interest, demonstrating a valuable addition to the existing stable of enhanced sampling methods.
Authors: Youzhi Zhang, Yufei Li, Gaofeng Meng, Hongbin Liu, Jiebo Luo
Abstract: Molecular docking is a crucial aspect of drug discovery, as it predicts the binding interactions between small‑molecule ligands and protein pockets. However, current multi‑task learning models for docking often show inferior performance in ligand docking compared to protein pocket docking. This disparity arises largely due to the distinct structural complexities of ligands and proteins. To address this issue, we propose a novel game‑theoretic framework that models the protein‑ligand interaction as a two‑player game called the Docking Game, with the ligand docking module acting as the ligand player and the protein pocket docking module as the protein player. To solve this game, we develop a novel Loop Self‑Play (LoopPlay) algorithm, which alternately trains these players through a two‑level loop. In the outer loop, the players exchange predicted poses, allowing each to incorporate the other's structural predictions, which fosters mutual adaptation over multiple iterations. In the inner loop, each player dynamically refines its predictions by incorporating its own predicted ligand or pocket poses back into its model. We theoretically show the convergence of LoopPlay, ensuring stable optimization. Extensive experiments conducted on public benchmark datasets demonstrate that LoopPlay achieves approximately a 10% improvement in predicting accurate binding modes compared to previous state‑of‑the‑art methods. This highlights its potential to enhance the accuracy of molecular docking in drug discovery.
Authors: Debanjan Konar, Neerav Sreekumar, Richard Jiang, Vaneet Aggarwal
Abstract: Understanding the molecular‑level mechanisms underpinning Alzheimer's disease (AD) by studying crucial genes associated with the disease remains a challenge. Alzheimer's, being a multifactorial disease, requires understanding the gene‑gene interactions underlying it for theranostics and progress. In this article, a novel attempt has been made using a quantum regression to decode how some crucial genes in the AD Amyloid Beta Precursor Protein (APP), Sterol regulatory element binding transcription factor 14 (FGF14), Yin Yang 1 (YY1), and Phospholipase D Family Member 3 (PLD3) etc. become influenced by other prominent switching genes during disease progression, which may help in gene expression‑based therapy for AD. Our proposed Quantum Regression Network (Alz‑QNet) introduces a pioneering approach with insights from the state‑of‑the‑art Quantum Gene Regulatory Networks (QGRN) to unravel the gene interactions involved in AD pathology, particularly within the Entorhinal Cortex (EC), where early pathological changes occur. Using the proposed Alz‑QNet framework, we explore the interactions between key genes (APP, FGF14, YY1, EGR1, GAS7, AKT3, SREBF2, and PLD3) within the CE microenvironment of AD patients, studying genetic samples from the database GSE138852, all of which are believed to play a crucial role in the progression of AD. Our investigation uncovers intricate gene‑gene interactions, shedding light on the potential regulatory mechanisms that underlie the pathogenesis of AD, which help us to find potential gene inhibitors or regulators for theranostics.
Authors: Timothy Fei Truong, Tristan Bepler
Abstract: Protein language models (PLMs) learn probability distributions over natural protein sequences. By learning from hundreds of millions of natural protein sequences, protein understanding and design capabilities emerge. Recent works have shown that scaling these models improves structure prediction, but does not seem to improve mutation understanding and representation quality for protein function prediction. We introduce PoET‑2, a multimodal, retrieval‑augmented protein foundation model that incorporates in‑context learning of family‑specific evolutionary constraints with optional structure conditioning to learn generative distributions over protein sequences. PoET‑2 uses a hierarchical transformer encoder that is equivariant to sequence context ordering and a dual decoder architecture with both causal and masked language modeling objectives, allowing PoET‑2 to operate in both fully generative and bidirectional representation learning modes. PoET‑2 achieves state‑of‑the‑art performance on zero‑shot variant effect prediction, excelling at scoring variants with multiple mutations and challenging indel mutations. In supervised settings, PoET‑2 embeddings outperform previous methods for learning sequence‑function relationships, especially with small datasets. This work highlights the benefits of combining retrieval augmentation with multimodal, family‑centric modeling for advancing protein foundation models.
Authors: Mikheil Kharbedia, Diego Herráez-Aguilar, Macarena Calero, Horacio López-Menéndez, Clara Luque-Rioja, Lara H. Moleiro, Cruz Santos, Pilar Lillo, Francisco Monroy
Abstract: Active materials capable of autonomously modulating their mechanical properties are foundational to the development of next‑generation soft technologies. Here, we introduce a novel class of extensible biohybrid hydrogels powered by living polymers of the bacterial cytokinetic protein FtsZ. When embedded within a polyacrylamide (PA) matrix, GTP‑fueled FtsZ filaments self‑organize into treadmilling structures that generate internal extensible stresses, driving reversible softening, swelling, and fluidization of the composite FtsZ‑PA hydrogel network. Unlike conventional contractile biopolymer systems, these hybrid gels exhibit stress‑induced softening, yield under minimal deformation, and suppress thermal flow barriers‑hallmarks of dissipative, extensile metamaterials. Microscopic particle tracking reveals active non‑Gaussian fluctuations, while bulk rheology confirms programmable, concentration‑dependent reductions in both stiffness and viscosity. Theoretical modeling shows that internal filament activity gives rise to a negative mechanical permittivity, establishing a new paradigm in materials science in which embedded FtsZ living polymers dynamically program active matter mechanics from within. These findings open new avenues for the design of modular, reconfigurable systems in adaptive biomaterials, soft robotics, and synthetic active matter.
Authors: Xuan Chen, Yu Huang, Miaowen Wen, Shahid Mumtaz, Fatih Gulec, Anwer Al-Dulaimi, Andrew W. Eckford
Abstract: The Internet of Bio‑Nano Things (IoBNT), envisioned as a revolutionary healthcare paradigm, shows promise for epidemic control. This paper explores the potential of using molecular communication (MC) to address the challenges in constructing IoBNT for epidemic prevention, specifically focusing on modeling viral transmission, detecting the virus/infected individuals, and identifying virus mutations. First, the MC channels in macroscale and microscale scenarios are discussed to match viral transmission in both scales separately. Besides, the detection methods for these two scales are also studied, along with the localization mechanism designed for the virus/infected individuals. Moreover, an identification strategy is proposed to determine potential virus mutations, which is validated through simulation using the ORF3a protein as a benchmark. Finally, open research issues are discussed. In summary, this paper aims to analyze viral transmission through MC and combat viral spread using signal processing techniques within MC.
Authors: Heidar J. Koning, Anuradha Pullakhandam, Andrew E. Whitten, Charles S. Bond, Michel Peyrard
Abstract: SAXS studies of four 60 base‑pair DNA duplexes with sequences closely related to part of the GAGE6 (G‑antigen 6) promoter have been performed to study the role of DNA conformations in solution and their potential relationship to DNA‑protein binding. We show that the SAXS data can be analysed using a simple polymer model which nevertheless quantitatively describes the average persistence length and torsional rigidity of the DNA double helix to determine the statistical distribution of local conformations of the DNA in solution to a high accuracy. Although the SAXS data is averaged over time and all spatial orientations of the molecules, for sequences which have some asymmetry in the data we show that the conformations can be oriented with respect to the sequence. This allows specific features detected by the analysis to be precisely related to the DNA sequence, opening up new opportunities for SAXS to investigate the properties of DNA in solution. The biological implications of these results are discussed.
Authors: Nobuto Takeuchi, Kunihiko Kaneko
Abstract: The Central Dogma of molecular biology, as originally proposed by Crick, asserts that information passed into protein cannot flow back out. This principle has been interpreted as underpinning modern understandings of heredity and evolution, implying the unidirectionality of information flow from nucleic acids to proteins. Here, we propose a generalisation of the Central Dogma as a division of labour between the transmission and expression of information: the transmitter (nucleic acids) perpetuates information across generations, whereas the expressor (protein) enacts this information to facilitate the transmitter's function without itself perpetuating information. We argue that this generalisation offers two benefits. First, it provides a unifying perspective for comparing the Central Dogma to analogous divisions of labour observed at vastly different biological scales, including multicellular organisms, eukaryotic cells, organelles, and bacteria. Second, it offers a theoretical framework to explain the Central Dogma as an outcome of evolution. Specifically, we review a mathematical model suggesting that the Central Dogma originates through spontaneous symmetry breaking driven by evolutionary conflicts between different levels of selection. By reframing the Central Dogma as an informational relationship between components of a system, this generalisation underscores its broader relevance across the biological hierarchy and sheds light on its evolutionary origin.
Authors: Mhd Hussein Murtada, Z. Faidon Brotzakis, Michele Vendruscolo
Abstract: Molecular dynamics (MD) is a powerful approach for modelling molecular systems, but it remains computationally intensive on spatial and time scales of many macromolecular systems of biological interest. To explore the opportunities offered by deep learning to address this problem, we introduce a Molecular Dynamics Large Language Model (MD‑LLM) framework to illustrate how LLMs can be leveraged to learn protein dynamics and discover states not seen in training. By applying MD‑LLM‑1, the first implementation of this approach, obtained by fine‑tuning Mistral 7B, to the T4 lysozyme and Mad2 protein systems, we show that training on one conformational state enables the prediction of other conformational states. These results indicate that MD‑LLM‑1 can learn the principles for the exploration of the conformational landscapes of proteins, although it is not yet modeling explicitly their thermodynamics and kinetics.
Authors: Alex Berlaga, Michael S. Jones, Andrew L. Ferguson
Abstract: Coarse‑grained (CG) molecular models of proteins can substantially increase the time and length scales accessible to molecular dynamics simulations of proteins, but recovery of accurate all‑atom (AA) ensembles from CG simulation trajectories can be essential for exposing molecular mechanisms of folding and docking and for calculation of physical properties requiring atomistic detail. The recently reported deep generative model FlowBack restores AA detail to protein C‑alpha traces using a flow‑matching architecture and demonstrates state‑of‑the‑art performance in generation of AA structural ensembles. Training, however, is performed exclusively on structural data and the absence of any awareness of interatomic energies or forces within training results in small fractions of incorrect bond lengths, atomic clashes, and otherwise high‑energy structures. In this work, we introduce FlowBack‑Adjoint as a lightweight enhancement that upgrades the pre‑trained FlowBack model through a one‑time, physics‑aware post‑training pass. Auxiliary contributions to the flow introduce physical awareness of bond lengths and Lennard‑Jones interactions and gradients of a molecular mechanics force field energy are incorporated via adjoint matching to steer the FlowBack‑Adjoint vector field to produce lower‑energy configurations. In benchmark tests against FlowBack, FlowBack‑Adjoint lowers single‑point energies by a median of ~78 kcal/mol.residue, reduces errors in bond lengths by >92%, eliminates >98% of molecular clashes, maintains excellent diversity of the AA configurational ensemble, and produces configurations capable of initializing stable all‑atom molecular dynamics simulations without requiring energy relaxation. We propose FlowBack‑Adjoint as an accurate and efficient physics‑aware deep generative model for AA backmapping from C‑alpha traces.
Authors: Erico Souza Teixeira, Lucas Barros Fernandes, Yara Rodrigues Inácio
Abstract: Binding energy is a fundamental thermodynamic property that governs molecular interactions, playing a crucial role in fields such as healthcare and the natural sciences. It is particularly relevant in drug development, vaccine design, and other biomedical applications. Over the years, various methods have been developed to estimate protein binding energy, ranging from experimental techniques to computational approaches, with machine learning making significant contributions to this field. Although classical computing has demonstrated strong results in constructing predictive models, the variation of quantum computing for machine learning has emerged as a promising alternative. Quantum neural networks (QNNs) have gained traction as a research focus, raising the question of their potential advantages in predicting binding energies. To investigate this potential, this study explored the feasibility of QNNs for this task by proposing thirty variations of multilayer perceptron‑based quantum neural networks. These variations span three distinct architectures, each incorporating ten different quantum circuits to configure their quantum layers. The performance of these quantum models was compared with that of a state‑of‑the‑art classical multilayer perceptron‑based artificial neural network, evaluating both accuracy and training time. A primary dataset was used for training, while two additional datasets containing entirely unseen samples were employed for testing. Results indicate that the quantum models achieved approximately 20% higher accuracy on one unseen dataset, although their accuracy was lower on the other datasets. Notably, quantum models exhibited training times several orders of magnitude shorter than their classical counterparts, highlighting their potential for efficient protein binding energy prediction.
Authors: Atabey Ünlü, Phil Rohr, Ahmet Celebi
Abstract: Drug discovery frequently loses momentum when data, expertise, and tools are scattered, slowing design cycles. To shorten this loop we built a hierarchical, tool using agent framework that automates molecular optimisation. A Principal Researcher defines each objective, a Database agent retrieves target information, an AI Expert generates de novo scaffolds with a sequence to molecule deep learning model, a Medicinal Chemist edits them while invoking a docking tool, a Ranking agent scores the candidates, and a Scientific Critic polices the logic. Each tool call is summarised and stored causing the full reasoning path to remain inspectable. The agents communicate through concise provenance records that capture molecular lineage, to build auditable, molecule centered reasoning trajectories and reuse successful transformations via in context learning. Three cycle research loops were run against AKT1 protein using five large language models. After ranking the models by mean docking score, we ran 20 independent scale ups on the two top performers. We then compared the leading LLMs' binding affinity results across three configurations, LLM only, single agent, and multi agent. Our results reveal an architectural trade off, the multi agent setting excelled at focused binding optimization, improving average predicted binding affinity by 31%. In contrast, single agent runs generated molecules with superior drug like properties at the cost of less potent binding scores. Unguided LLM runs finished fastest, yet their lack of transparent tool signals left the validity of their reasoning paths unverified. These results show that test time scaling, focused feedback loops and provenance convert general purpose LLMs into auditable systems for molecular design, and suggest that extending the toolset to ADMET and selectivity predictors could push research workflows further along the discovery pipeline.
Authors: Yusuke Sakiyama, Emanuel Pfitzner, Santiago H. Andany, Georg E. Fantner, Joachim Heberle
Abstract: Pseudo‑heterodyne scattering‑type scanning near‑field optical microscopy (sSNOM) is applied in the mid‑infrared region to detect the chemical composition of biomolecules on the nanoscale. However, the application of sSNOM in molecular biology has been limited to static images in air. Recently, bottom illumination sSNOM (BI‑sSNOM) was developed for operation in water. Yet, the scan rate of sSNOM remains a bottleneck to record protein structural changes in aqueous solution on the seconds time scale. We designed an optical and mechanical system consisting of a separate scan high‑speed atomic force microscope (HS‑AFM) coupled to the BI‑sSNOM optics. The designed AFM scanner has a mechanical bandwidth of ca 70 kHz along the Z‑axis, and ca 6 kHz along the XY‑axis, equivalent to the sample scanning HS‑AFM. The AFM performance is demonstrated by imaging actin filaments. The optical design is validated by sSNOM experiments on purple membranes and microtubules.
Authors: Hanqi Feng, Peng Qiu, Mengchun Zhang, Yiran Tao, You Fan, Jingtao Xu, Barnabas Poczos
Abstract: Recent advances in diffusion models have shown remarkable potential for antibody design, yet existing approaches apply uniform generation strategies that cannot adapt to each antigen's unique requirements. Inspired by B cell affinity maturation, where antibodies evolve through multi‑objective optimization balancing affinity, stability, and self‑avoidance, we propose the first biologically‑motivated framework that leverages physics‑based domain knowledge within an online meta‑learning system. Our method employs multiple specialized experts (van der Waals, molecular recognition, energy balance, and interface geometry) whose parameters evolve during generation based on iterative feedback, mimicking natural antibody refinement cycles. Instead of fixed protocols, this adaptive guidance discovers personalized optimization strategies for each target. Our experiments demonstrate that this approach: (1) discovers optimal SE(3)‑equivariant guidance strategies for different antigen classes without pre‑training, preserving molecular symmetries throughout optimization; (2) significantly enhances hotspot coverage and interface quality through target‑specific adaptation, achieving balanced multi‑objective optimization characteristic of therapeutic antibodies; (3) establishes a paradigm for iterative refinement where each antibody‑antigen system learns its unique optimization profile through online evaluation; (4) generalizes effectively across diverse design challenges, from small epitopes to large protein interfaces, enabling precision‑focused campaigns for individual targets.
Authors: Tamir Bendory, Dan Edidin, Josh Katz, Shay Kreymer
Abstract: Orbit recovery is a central problem in both mathematics and applied sciences, with important applications to structural biology. This paper focuses on recovering generic orbits of functions on \mathbb R^n and the sphere S^n‑1 under the rotation action of SO(n). Specifically, we demonstrate that invariants of degree three (called the bispectrum) suffice to recover generic orbits of functions in finite‑dimensional approximations of L^2(\mathbb R^n) obtained by band‑limiting the spherical component and discretizing the radial direction. In particular, our main result explicitly bounds the number of samples in the radial direction required for recovery from the degree three invariants. From an application perspective, the most important case is SO(3), which arises in many scientific fields, and in particular, plays a central role in leading structural biology applications such as cryo‑electron tomography and cryo‑electron microscopy. Our result for SO(3) states that considering three spherical shells (i.e., samples in the radial direction) is sufficient to recover generic orbits, which verifies an implicit conjecture made in a paper of Bandeira et al. Our proof technique provides an explicit, computationally efficient algorithm to recover the signal by successively solving systems of linear equations. We implemented this algorithm and demonstrated its effectiveness on two protein structures.
Authors: Hanna Linn, Lucas Knuthson, Anders Irbäck, Sandipan Mohanty, Laura García-Álvarez, Göran Johansson
Abstract: Quantum heuristics have shown promise in solving various optimization problems, including lattice protein folding. Equally relevant is the inverse problem, protein design, where one seeks sequences that fold to a given target structure. The latter problem is often split into two steps: (i) searching for sequences that minimize the energy in the target structure, and (ii) testing whether the generated sequences fold to the desired structure. Here, we investigate the utility of variational quantum algorithms for the first of these two steps on today's noisy intermediate‑scale quantum devices. We focus on the sequence optimization task, which is less resource‑demanding than folding computations. We test the quantum approximate optimization algorithm and variants of it, with problem‑informed quantum circuits, as well as the hardware‑efficient ansatz, with problem‑agnostic quantum circuits. While the former algorithms yield acceptable results in noiseless simulations, their performance drops under noise. With the problem‑agnostic circuits, which are more compatible with hardware constraints, an improved performance is observed in both noisy and noiseless simulations. However, the results deteriorate when running on a real quantum device. We attribute this discrepancy to features not captured by the simulated noise model, such as the temporal aspect of the hardware noise.
Authors: Jessica Bariffi, Antonia Wachter-Zeh, Eitan Yaakobi
Abstract: This paper studies the sequence reconstruction problem for a channel inspired by protein identification. We introduce a coloring channel, where a sequence is transmitted through a channel that deletes all symbols not belonging to a fixed subset (the coloring) of the alphabet. By extending this to a coloring profile, a tuple of distinct colorings, we analyze the channel's information rate and capacity. We prove that optimal (i.e., achieving maximum information rate) coloring profiles correspond to 2‑covering designs and identify the minimal covering number required for maximum information rate, as well as the minimum number for which any coloring profile is optimal.
Authors: Jing Lan, Hexiao Ding, Hongzhao Chen, Yufeng Jiang, Nga-Chun Ng, Gerald W. Y. Cheng, Zongxi Li, Jing Cai, Liang-ting Lin, Jung Sun Yoo
Abstract: Accurate prediction of protein‑ligand interactions is essential for computer‑aided drug discovery. However, existing methods often fail to capture solvent‑dependent conformational changes and lack the ability to jointly learn multiple related tasks. To address these limitations, we introduce a pre‑training method that incorporates ligand conformational ensembles generated under diverse solvent conditions as augmented input. This design enables the model to learn both structural flexibility and environmental context in a unified manner. The training process integrates molecular reconstruction to capture local geometry, interatomic distance prediction to model spatial relationships, and contrastive learning to build solvent‑invariant molecular representations. Together, these components lead to significant improvements, including a 3.7% gain in binding affinity prediction, an 82% success rate on the PoseBusters Astex docking benchmarks, and an area under the curve of 97.1% in virtual screening. The framework supports solvent‑aware, multi‑task modeling and produces consistent results across benchmarks. A case study further demonstrates sub‑angstrom docking accuracy with a root‑mean‑square deviation of 0.157 angstroms, offering atomic‑level insight into binding mechanisms and advancing structure‑based drug design.
Authors: Shah Ishmam Mohtashim
Abstract: We introduce RinQ, a hybrid quantum‑classical framework for identifying functionally critical residues in proteins by formulating centrality detection as a Quadratic Unconstrained Binary Optimization (QUBO) problem. Protein structures are modeled as residue interaction networks (RINs), and the QUBO formulations are solved using D‑Wave's simulated annealing. Applied to a diverse set of proteins, RinQ consistently identifies central residues that closely align with classical benchmarks, demonstrating both the accuracy and robustness of the approach.
Authors: Yuqi Zhang, Yuxin Yang, Cheng-Chang Lu, Weiwen Jiang, Feixiong Cheng, Bo Fang, Qiang Guan
Abstract: Protein structure prediction is a core challenge in computational biology, particularly for fragments within ligand‑binding regions, where accurate modeling is still difficult. Quantum computing offers a novel first‑principles modeling paradigm, but its application is currently limited by hardware constraints, high computational cost, and the lack of a standardized benchmarking dataset. In this work, we present QDockBank‑the first large‑scale protein fragment structure dataset generated entirely using utility‑level quantum computers, specifically designed for protein‑ligand docking tasks. QDockBank comprises 55 protein fragments extracted from ligand‑binding pockets. The dataset was generated through tens of hours of execution on superconducting quantum processors, making it the first quantum‑based protein structure dataset with a total computational cost exceeding one million USD. Experimental evaluations demonstrate that structures predicted by QDockBank outperform those predicted by AlphaFold2 and AlphaFold3 in terms of both RMSD and docking affinity scores. QDockBank serves as a new benchmark for evaluating quantum‑based protein structure prediction.
Authors: Lixin Huang, Rogério Lopes dos Santos, Sid Labdi, Guillaume Lamour, Olek Maciejak, Michel Malo, John Manzi, Martin Lenz, Jacques Fattaccioli, Clément Campillo
Abstract: Cell shape changes, essential for processes such as motility or division, are controlled by the actomyosin cortex that actively remodels biological membranes. Their mechanisms can be deciphered in___vitro using biomimetic reconstituted systems, such as giant unilamellar vesicles (GUVs) with controlled lipid composition coupled to reconstituted actin networks. These assays allow mimicking cell shape changes in controlled biochemical and biophysical environments. However, studying the dynamics of these shape changes on statistically significant populations of GUVs with the possibility to sequentially modify the protein composition of the assay is a major experimental challenge. To address these issues, a microfluidic approach is used to immobilize several dozens of isolated GUVs and monitor membrane and actin network evolution. The loading of the chamber with GUVs and actin is first characterized. Then, the actin‑induced remodeling of populations of homogeneous and phase‑separated GUVs is monitored and shows that actin networks prevent the coalescence of lipid microdomains and that, in return, the number of domains affects the actin network structure. This microfluidic‑based experimental strategy, thus, allows for studying actin‑induced membrane deformation in___vitro and can be adapted to other studies on membrane remodeling processes.
Authors: Ambarish Singh, Romila Pradhan
Abstract: Data quality plays a pivotal role in the predictive performance of machine learning (ML) tasks ‑ a challenge amplified by the deluge of data sources available in modern organizations. Prior work in data discovery largely focus on metadata matching, semantic similarity or identifying tables that should be joined to answer a particular query, but do not consider source quality for high performance of the downstream ML task. This paper addresses the problem of determining the best subset of data sources that must be combined to construct the underlying training dataset for a given ML task. We propose SourceGrasp and SourceSplice, frameworks designed to efficiently select a suitable subset of sources that maximizes the utility of the downstream ML model. Both the algorithms rely on the core idea that sources (or their combinations) contribute differently to the task utility, and must be judiciously chosen. While SourceGrasp utilizes a metaheuristic based on a greediness criterion and randomization, the SourceSplice framework presents a source selection mechanism inspired from gene splicing ‑ a core concept used in protein synthesis. We empirically evaluate our algorithms on three real‑world datasets and synthetic datasets and show that, with significantly fewer subset explorations, SourceSplice effectively identifies subsets of data sources leading to high task utility. We also conduct studies reporting the sensitivity of SourceSplice to the decision choices under several settings.
Authors: Chen Adar, Yulia Baron, Baruch Rofman, Maya Bar-Dolev, Liat Bahari, Victor Yashunsky, Vera Sirotinskaya, Oded Shoseyov, Ido Braslavsky
Abstract: This work presents BioPykrete, a new sustainable bio‑composite material created from ice, nano‑crystalline cellulose (CNC), and a tailor‑made chimera protein designed to bind the two together. We developed and produced the chimera protein by linking AFPIII, an ice‑binding protein, with CBM3a, a CNC‑binding protein. As the suspension freezes, the CNC chains self‑organize into a reinforcing network between the ice crystals. This structural enhancement limits crack propagation to typical pore sizes, allowing BioPykrete to avoid the brittle and sudden failure commonly associated with ice. Instead, it exhibits an elastic‑like response to stress, making it suitable for construction and engineering applications. With compressive strength comparable with concrete, BioPykrete offers a sustainable and biodegradable alternative to construction materials suitable for the harsh arctic regions of the world where traditional methods are ineffective, and resources are scarce. Engineering chimera proteins with specific affinity to more than a single material type may help improve or tailor the properties of other composite materials.
Authors: Alex Abrudan, Sebastian Pujalte Ojeda, Chaitanya K. Joshi, Matthew Greenig, Felipe Engelberger, Alena Khmelinskaia, Jens Meiler, Michele Vendruscolo, Tuomas P. J. Knowles
Abstract: Structural biology has long been dominated by the one sequence, one structure, one function paradigm, yet many critical biological processes ‑ from enzyme catalysis to membrane transport ‑ depend on proteins that adopt multiple conformational states. Existing multi‑state design approaches rely on post‑hoc aggregation of single‑state predictions, achieving poor experimental success rates compared to single‑state design. We introduce DynamicMPNN, an inverse folding model explicitly trained to generate sequences compatible with multiple conformations through joint learning across conformational ensembles. Trained on 46,033 conformational pairs covering 75% of CATH superfamilies and evaluated using Alphafold 3, DynamicMPNN outperforms ProteinMPNN by up to 25% on decoy‑normalized RMSD and by 12% on sequence recovery across our challenging multi‑state protein benchmark.
Authors: Pankaj Adhikary, Rajib Biswas
Abstract: Urea is widely used as a protein denaturant. However, the potential of urea to form self‑assembled structures at higher concentrations and the influence of its self‑interactions on water structure and dynamics remains elusive. This open question demands tracking of molecular‑level rearrangements. In this work, we explore the influence of urea on local structure of water and dynamics and relate it to urea self‑association. We correlate vibrational spectral response and orientational dynamics of water with concentration‑dependent self‑association of urea by looking at the interface surface area, hydrogen bond strength, and population of relevant donor‑acceptor pairs. We compare the response of four urea force fields (KBFF, OPLS‑S, OPLS‑AA‑D, GAFF‑D3) with simple point charge extented water. The KBFF model reproduces experimental IR spectra. Both variants of the Duffy model (OPLS‑S, OPLS‑AA‑D) show blue shifts with reasonable broadening and intense concentration‑dependent responses, while GAFF‑D3 shows random peak shifts with prominent broadening. Regarding urea self‑aggregation, KBFF is mildly repulsive, Duffy models are attractive, and GAFF‑D3 is neutral with high variability. Only KBFF and GAFF‑D3 capture the expected deceleration in water‑orientational dynamics. We conclude urea does not self‑aggregate significantly in water, even at higher concentrations. KBFF emerges as the most reliable classical non‑polarizable model of urea for capturing both structural and dynamic properties of water.
Authors: Hoshin Kim, Song Feng, Pavlo Bohutskyi, Xiaolu Li, Daniel Mejia-Rodriguez, Tong Zhang, Wei-Jun Qian, Margaret S. Cheung
Abstract: Cyanobacteria require ultra‑fast metabolic switching to maintain reducing power balance during environmental fluctuations. Glucose‑6‑phosphate dehydrogenase (G6PDH), catalyzing the rate‑limiting step of the oxidative pentose phosphate pathway (OPPP), provides essential NADPH and metabolic intermediates for biosynthetic processes and redox homeostasis. In cyanobacteria, the unique redox‑sensitive protein OpcA acts as a metabolic switch for G6PDH, enabling rapid adjustment of reducing power generation from glycogen catabolism and resulting in precise regulation of carbon flux between anabolic and catabolic pathways. While the redox‑sensitive cysteine structures of OpcA are known to regulate G6PDH, the detailed mechanisms of how redox post‑translational modifications (PTMs) influence OpcA's allosteric effects on G6PDH structures and function remain elusive. To investigate this mechanism, we utilized computational modeling combined with experimental redox proteomics using Synechococcus elongatus PCC 7942 as a model system. Redox proteomics captured modified cysteine residues under light/dark or circadian shifts. Computational simulation revealed that thiol PTMs near the OpcA‑G6PDH interface are crucial to allosteric regulation of regions affecting the G6PDH activity, including a potential gate region for substrate ingress and product egress, as well as critical hydrogen bond networks within the active site. These PTMs promote rapid metabolic switching by enhancing G6PDH catalytic activity when OpcA is oxidized. This study provides evidence for novel molecular mechanisms that elucidate the importance of thiol PTMs of OpcA in modulating G6PDH structure and function in an allosteric manner, demonstrating how PTM‑level regulation provides a critical control mechanism that enables cyanobacteria to rapidly adapt to environmental fluctuations through precise metabolic fine‑tuning.
Authors: Shuo Zhang, Leonardo Shoji Aota, Mahander P. Singh, Eric V. Woods, Fantine Périer Jouet, Tim M. Schwarz, Baptiste Gault
Abstract: The folding and structure of biomacromolecules depend on the 3D distributions of their constituents, which ultimately controls their functionalities and interactions with other biomacromolecules. Atom probe tomography (APT) with its unparalleled compositional sensitivity at nanoscale spatial resolution, could provide complementary information to cryo‑electron microscopy, yet routine APT analysis of biomacromolecules in their native state remains challenging. Here, a ferritin solution was used as a model system. Following plunge freezing in liquid nitrogen, cryogenic lift‑out and cryo‑APT analysis were performed. Elements from the ferritin core and shell are detected yet particles seem destroyed. We hence demonstrate the feasibility of preparing and analyzing bulk hydrated biological samples using APT, however, the cooling was too slow to vitrify the solution. This caused irrecoverable damage to the protein shell surrounding the ferritin particles due to ice crystal formation. We report on preliminary data from high‑pressure frozen (HPF) deionized (DI) water, demonstrating a proof‑ofprinciple experiments that intact biomacromolecules could be analyzed through a similar workflow in the future. We report on many trials (and errors) on the use of different materials for substrates and different substrate geometries, and provide a perspective on the challenges we faced to facilitate future studies across the community.
Authors: YeongKyu Lee, Changbong Hyeon
Abstract: Circadian rhythms in living organisms are temporal orders emerging from biochemical circuits driven out of equilibrium. Here, considering the KaiABC system, a minimal model in the synthetic biology, we study how the oscillation emerges from the circuit made of three Kai proteins and ATP alone. The phase diagram constructed in terms of KaiC and KaiA concentrations reveals a narrowly bounded oscillatory phase, which naturally explains arrhythmia upon protein over‑expression. As dictated by the cost‑precision trade‑offs of the thermodynamic uncertainty relations, the presence of intrinsic noise, amplified in small systems, demands higher free energy cost to achieve greater rhythmic precision. The cost‑minimizing condition within the oscillatory phase is found to generate ~21‑hr rhythm, which is entrained to 24‑hr environmental signals as long as the forcing amplitude is greater than ~ 10 % of the metabolic rate. An optimal level of intrinsic noise can also induce oscillations even beyond the Hopf bifurcation, effectively expanding the oscillatory phase. Our study clarifies how the physical factors, such as regulatory mechanism, energy cost, and stochastic noise contribute to the operation of biological clocks.
Authors: Praneeth Narisetty, Uday Kumar Reddy Kattamanchi, Lohit Akshant Nimma, Sri Ram Kaushik Karnati, Shiva Nagendra Babu Kore, Mounika Golamari, Tejashree Nageshreddy
Abstract: Aquaculture plays a vital role in global food security and coastal economies by providing sustainable protein sources. As the industry expands to meet rising demand, it faces growing challenges such as disease outbreaks, inefficient feeding practices, rising labor costs, logistical inefficiencies, and critical hatchery issues, including high mortality rates and poor water quality control. Although artificial intelligence has made significant progress, existing machine learning methods fall short of addressing the domain‑specific complexities of aquaculture. To bridge this gap, we introduce AQUA, the first large language model (LLM) tailored for aquaculture, designed to support farmers, researchers, and industry practitioners. Central to this effort is AQUADAPT (Data Acquisition, Processing and Tuning), an Agentic Framework for generating and refining high‑quality synthetic data using a combination of expert knowledge, largescale language models, and automated evaluation techniques. Our work lays the foundation for LLM‑driven innovations in aquaculture research, advisory systems, and decision‑making tools.
Authors: Samiul Based Shuvo, Tasnia Binte Mamun, U Rajendra Acharya
Abstract: DNA‑binding proteins (DBPs) are integral to gene regulation and cellular processes, making their accurate identification essential for understanding biological functions and disease mechanisms. Experimental methods for DBP identification are time‑consuming and costly, driving the need for efficient computational prediction techniques. In this study, we propose a novel deep learning framework, ResCap‑DBP, that combines a residual learning‑based encoder with a one‑dimensional Capsule Network (1D‑CapsNet) to predict DBPs directly from raw protein sequences. Our architecture incorporates dilated convolutions within residual blocks to mitigate vanishing gradient issues and extract rich sequence features, while capsule layers with dynamic routing capture hierarchical and spatial relationships within the learned feature space. We conducted comprehensive ablation studies comparing global and local embeddings from ProteinBERT and conventional one‑hot encoding. Results show that ProteinBERT embeddings substantially outperform other representations on large datasets. Although one‑hot encoding showed marginal advantages on smaller datasets, such as PDB186, it struggled to scale effectively. Extensive evaluations on four pairs of publicly available benchmark datasets demonstrate that our model consistently outperforms current state‑of‑the‑art methods. It achieved AUC scores of 98.0% and 89.5% on PDB14189andPDB1075, respectively. On independent test sets PDB2272 and PDB186, the model attained top AUCs of 83.2% and 83.3%, while maintaining competitive performance on larger datasets such as PDB20000. Notably, the model maintains a well balanced sensitivity and specificity across datasets. These results demonstrate the efficacy and generalizability of integrating global protein representations with advanced deep learning architectures for reliable and scalable DBP prediction in diverse genomic contexts.
Authors: Keyan Ding, Jing Yu, Junjie Huang, Yuchen Yang, Qiang Zhang, Huajun Chen
Abstract: Scientific research increasingly relies on specialized computational tools, yet effectively utilizing these tools demands substantial domain expertise. While Large Language Models (LLMs) show promise in tool automation, they struggle to seamlessly integrate and orchestrate multiple tools for complex scientific workflows. Here, we present SciToolAgent, an LLM‑powered agent that automates hundreds of scientific tools across biology, chemistry, and materials science. At its core, SciToolAgent leverages a scientific tool knowledge graph that enables intelligent tool selection and execution through graph‑based retrieval‑augmented generation. The agent also incorporates a comprehensive safety‑checking module to ensure responsible and ethical tool usage. Extensive evaluations on a curated benchmark demonstrate that SciToolAgent significantly outperforms existing approaches. Case studies in protein engineering, chemical reactivity prediction, chemical synthesis, and metal‑organic framework screening further demonstrate SciToolAgent's capability to automate complex scientific workflows, making advanced research tools accessible to both experts and non‑experts.
Authors: Yi He, Ailun Wang, Zhi Wang, Yu Liu, Xingyuan Xu, Wen Yan
Abstract: Recent advances in generative models, particularly diffusion and auto‑regressive models, have revolutionized fields like computer vision and natural language processing. However, their application to structure‑based drug design (SBDD) remains limited due to critical data constraints. To address the limitation of training data for models targeting SBDD tasks, we propose an evolutionary framework named MEVO, which bridges the gap between billion‑scale small molecule dataset and the scarce protein‑ligand complex dataset, and effectively increase the abundance of training data for generative SBDD models. MEVO is composed of three key components: a high‑fidelity VQ‑VAE for molecule representation in latent space, a diffusion model for pharmacophore‑guided molecule generation, and a pocket‑aware evolutionary strategy for molecule optimization with physics‑based scoring function. This framework efficiently generate high‑affinity binders for various protein targets, validated with predicted binding affinities using free energy perturbation (FEP) methods. In addition, we showcase the capability of MEVO in designing potent inhibitors to KRAS^\textrmG12D, a challenging target in cancer therapeutics, with similar affinity to the known highly active inhibitor evaluated by FEP calculations. With high versatility and generalizability, MEVO offers an effective and data‑efficient model for various tasks in structure‑based ligand design.
Authors: François Charih, James R. Green, Kyle K. Biggar
Abstract: Aberrant protein‑protein interactions (PPIs) underpin a plethora of human diseases, and disruption of these harmful interactions constitute a compelling treatment avenue. Advances in computational approaches to PPI prediction have closely followed progress in deep learning and natural language processing. In this review, we outline the state‑of the‑art for sequence‑based PPI prediction methods and explore their impact on target identification and drug discovery. We begin with an overview of commonly used training data sources and techniques used to curate these data to enhance the quality of the training set. Subsequently, we survey various PPI predictor types, including traditional similarity‑based approaches, and deep learning‑based approaches with a particular emphasis on the transformer architecture. Finally, we provide examples of PPI prediction in systems‑level proteomics analyses, target identification, and design of therapeutic peptides and antibodies. We also take the opportunity to showcase the potential of PPI‑aware drug discovery models in accelerating therapeutic development.
Authors: Ziqi Zhang, Shiheng Chen, Runze Yang, Zhisheng Wei, Wei Zhang, Lei Wang, Zhanzhi Liu, Fengshan Zhang, Jing Wu, Xiaoyong Pan, Hongbin Shen, Longbing Cao, Zhaohong Deng
Abstract: Developing enzymes with desired thermal properties is crucial for a wide range of industrial and research applications, and determining temperature stability is an essential step in this process. Experimental determination of thermal parameters is labor‑intensive, time‑consuming, and costly. Moreover, existing computational approaches are often hindered by limited data availability and imbalanced distributions. To address these challenges, we introduce a curated temperature stability dataset designed for model development and benchmarking in enzyme thermal modeling. Leveraging this dataset, we present the Segment Transformer, a novel deep learning framework that enables efficient and accurate prediction of enzyme temperature stability. The model achieves state‑of‑the‑art performance with an RMSE of 24.03, MAE of 18.09, and Pearson and Spearman correlations of 0.33, respectively. These results highlight the effectiveness of incorporating segment‑level representations, grounded in the biological observation that different regions of a protein sequence contribute unequally to thermal behavior. As a proof of concept, we applied the Segment Transformer to guide the engineering of a cutinase enzyme. Experimental validation demonstrated a 1.64‑fold improvement in relative activity following heat treatment, achieved through only 17 mutations and without compromising catalytic function.
Authors: Johannes Keisers, Norbert Kern, Luca Ciandrini
Abstract: Ribosome‑targeting antibiotics, such as chloramphenicol, stall elongating ribosomes during protein synthesis, disrupting mRNA translation. These antibiotic‑induced pauses occur stochastically, alter collective ribosome dynamics and transiently block protein production on the affected transcript. Existing models of ribosome traffic often rely on idealized assumptions, such as infinitely long mRNAs and simplified pausing dynamics, overlooking key biological constraints. Here, we develop a Totally Asymmetric Simple Exclusion Process (TASEP) that incorporates stochastic particle pausing, using experimentally determined pausing and unpausing rates to model the effects of ribosome‑targeting antibiotics. We introduce a Single‑Cluster approximation, which is analytically treatable, tailored to capture the biologically relevant regime of rare and long antibiotic‑induced pauses. This biologically constrained model reveals three key insights: (i) the inhibition of antibiotic‑induced translation strongly depends on transcript length, with longer transcripts being disproportionately affected; (ii) reducing ribosome initiation rates significantly mitigates antibiotic vulnerability; and (iii) inhibition of translation is governed more by collective ribosome dynamics than by single‑ribosome properties. Our analytical predictions match Gillespie simulations, align quantitatively with experimental observations, and yield testable hypotheses for future experiments. These findings may have broader implications for the mechanistic modeling of other biological transport processes (e.g., RNAP dynamics), and more generally for the community studying traffic models.
Authors: Anastasia Agathangelou, Dilhan Manawadu, Ivano Tavernelli
Abstract: Modelling and predicting protein configurations is crucial for advancing drug discovery, enabling the design of treatments for life‑threatening diseases. A critical aspect of this challenge is rotamer optimisation ‑ the determination of optimal side‑chain conformations given a fixed protein backbone. This problem, involving the internal degrees of freedom of amino acid side‑chains, significantly influences the protein's overall structure and function. In this work, we develop a resource‑efficient optimisation algorithm to compute the ground state energy of protein structures, with a focus on side‑chain configuration. We formulate the rotamer optimisation problem as a Quadratic Unconstrained Binary Optimisation problem and map it to an Ising model, enabling efficient quantum encoding. Building on this formulation, we propose a quantum algorithm based on the Quantum Approximate Optimisation Algorithm to explore the conformational space and identify low‑energy configurations. To benchmark our approach, we conduct a classical study using custom‑built libraries tailored for structural characterisation and energy optimisation. Our quantum method demonstrates a reduction in computational cost compared to classical simulated annealing techniques, offering a scalable and promising framework for protein structure optimisation in the quantum era.
Authors: Rossana Droghetti, Mattia Corigliano, Ludovico Calabrese, Philippe Fuchs, Abhishek Vaidyanathan, Johannes Keisers, Gabriele Micali, Marco Cosentino Lagomarsino, Luca Ciandrini
Abstract: This tutorial covers the emerging field of coarse‑grained cellular growth modeling, and aims to bridge the gap between theoretical foundations and practical application. By adopting an original "cookbook" approach, it is designed to offer a hands‑on guide for constructing and analyzing different key aspects of cellular growth, focusing on available results for bacteria and beyond. The tutorial is structured as a series of step‑by‑step "recipes", and covers essential concepts, recent literature, and key challenges. It aims to empower a broad audience, from students to seasoned researchers, to replicate, extend, and innovate in this scientific area. Specifically, each section provides detailed, bare‑bone models to start working in each area, from basic steady‑state growth to variable environments and focusing on different key layers relevant to biosynthesis, transcription, translation, nutrient sensing and protein degradation, links between cell cycle and growth, ending with ecological insights.
Authors: Xiangwen Wang, Gaojie Jin, Xiaowei Huang, Ronghui Mu
Abstract: Designing mutations to optimize protein thermostability remains challenging due to the complex relationship between sequence variations, structural dynamics, and thermostability, often assessed by δδG
(the change in free energy of unfolding). Existing methods rely on experimental random mutagenesis or prediction models tested with pre‑defined datasets, using sequence‑based heuristics and treating enzyme design as a one‑step process without iterative refinement, which limits design space exploration and restricts discoveries beyond known variations. We present ThermoRL, a framework based on reinforcement learning (RL) that leverages graph neural networks (GNN) to design mutations with enhanced thermostability. It combines a pre‑trained GNN‑based encoder with a hierarchical Q‑learning network and employs a surrogate model for reward feedback, guiding the RL agent on where (the position) and which (mutant amino acid) to apply for enhanced thermostability. Experimental results show that ThermoRL achieves higher or comparable rewards than baselines while maintaining computational efficiency. It filters out destabilizing mutations and identifies stabilizing mutations aligned with experimental data. Moreover, ThermoRL accurately detects key mutation sites in unseen proteins, highlighting its strong generalizability. This RL‑guided approach powered by GNN embeddings offers a robust alternative to traditional protein mutation design.
Authors: Zinan Ling, Yi Shi, Brett McKinney, Da Yan, Yang Zhou, Bo Hui
Abstract: Generating novel and functional protein sequences is critical to a wide range of applications in biology. Recent advancements in conditional diffusion models have shown impressive empirical performance in protein generation tasks. However, reliable generations of protein remain an open research question in de novo protein design, especially when it comes to conditional diffusion models. Considering the biological function of a protein is determined by multi‑level structures, we propose a novel multi‑level conditional diffusion model that integrates both sequence‑based and structure‑based information for efficient end‑to‑end protein design guided by specified functions. By generating representations at different levels simultaneously, our framework can effectively model the inherent hierarchical relations between different levels, resulting in an informative and discriminative representation of the generated protein. We also propose a Protein‑MMD, a new reliable evaluation metric, to evaluate the quality of generated protein with conditional diffusion models. Our new metric is able to capture both distributional and functional similarities between real and generated protein sequences while ensuring conditional consistency. We experiment with the benchmark datasets, and the results on conditional protein generation tasks demonstrate the efficacy of the proposed generation framework and evaluation metric.
Authors: Yihong Feng, Chaitanya Pallerla, Xiaomin Lin, Pouya Sohrabipour, Philip Crandall, Wan Shou, Yu She, Dongyi Wang
Abstract: The poultry industry has been driven by broiler chicken production and has grown into the world's largest animal protein sector. Automated detection of chicken carcasses on processing lines is vital for quality control, food safety, and operational efficiency in slaughterhouses and poultry processing plants. However, developing robust deep learning models for tasks like instance segmentation in these fast‑paced industrial environments is often hampered by the need for laborious acquisition and annotation of large‑scale real‑world image datasets. We present the first pipeline generating photo‑realistic, automatically labeled synthetic images of chicken carcasses. We also introduce a new benchmark dataset containing 300 annotated real‑world images, curated specifically for poultry segmentation research. Using these datasets, this study investigates the efficacy of synthetic data and automatic data annotation to enhance the instance segmentation of chicken carcasses, particularly when real annotated data from the processing line is scarce. A small real dataset with varying proportions of synthetic images was evaluated in prominent instance segmentation models. Results show that synthetic data significantly boosts segmentation performance for chicken carcasses across all models. This research underscores the value of synthetic data augmentation as a viable and effective strategy to mitigate data scarcity, reduce manual annotation efforts, and advance the development of robust AI‑driven automated detection systems for chicken carcasses in the poultry processing industry.
Authors: Pei-Kun Yang
Abstract: In structure‑based virtual screening, it is often necessary to evaluate the binding free energy of protein‑ligand complexes by considering not only molecular conformations but also how these structures shift and rotate in space. The number of possible combinations grows rapidly and can become overwhelming. While classical computing has limitations in this context, quantum computing offers a promising alternative due to its inherent parallelism. In this study, we introduce a quantum machine learning approach that encodes molecular information into quantum states and processes them using parameterized quantum gates. The model is implemented and trained using PyTorch, and its performance is evaluated under three settings: ideal simulation, limited‑shot sampling, and simulations with quantum noise. With six quantum circuit units, the model achieves an RMSD of 2.37 kcal/mol and a Pearson correlation of 0.650. Even when using 100,000 shots, the predictions remain consistent, indicating that the model is compatible with near‑term quantum hardware. Although noise slightly reduces accuracy, the ranking of ligand affinities remains largely unchanged. These findings point to a practical and scalable strategy that balances robustness and predictive power, offering a viable path to accelerate virtual screening through moderately deep quantum circuits.
Authors: Rıza Özçelik, Sarah de Ruiter, Francesca Grisoni
Abstract: The scarcity of molecules with desirable properties (i.e., `positive' molecules) is an inherent bottleneck for generative molecule design. To sidestep such obstacle, here we propose molecular task arithmetic: training a model on diverse and abundant negative examples to learn 'property directions' ‑ without accessing any positively labeled data ‑ and moving models in the opposite property directions to generate positive molecules. When analyzed on 33 design experiments with distinct molecular entities (small molecules, proteins), model architectures, and scales, molecular task arithmetic generated more diverse and successful designs than models trained on positive molecules in general. Moreover, we employed molecular task arithmetic in dual‑objective and few‑shot design tasks. We find that molecular task arithmetic can consistently increase the diversity of designs while maintaining desirable complex design properties, such as good docking scores to a protein. With its simplicity, data efficiency, and performance, molecular task arithmetic bears the potential to become the de facto transfer learning strategy for de novo molecule design.
Authors: April Herwig, Matthew J. Colbrook, Oliver Junge, Péter Koltai, Julia Slipantschuk
Abstract: Koopman operator theory enables linear analysis of nonlinear dynamical systems by lifting their evolution to infinite‑dimensional function spaces. However, finite‑dimensional approximations of Koopman and transfer (Frobenius‑‑Perron) operators are prone to spectral pollution, introducing spurious eigenvalues that can compromise spectral computations. While recent advances have yielded provably convergent methods for Koopman operators, analogous tools for general transfer operators remain limited. In this paper, we present algorithms for computing spectral properties of transfer operators without spectral pollution, including extensions to the Hardy‑Hilbert space. Case studies‑‑ranging from families of Blaschke maps with known spectrum to a molecular dynamics model of protein folding‑‑demonstrate the accuracy and flexibility of our approach. Notably, we demonstrate that spectral features can arise even when the corresponding eigenfunctions lie outside the chosen space, highlighting the functional‑analytic subtleties in defining the "true" Koopman spectrum. Our methods offer robust tools for spectral estimation across a broad range of applications.
Authors: Yuxi Lin, Yaxue Fang, Zehong Zhang, Zhouwu Liu, Siyun Zhong, Zhongfang Wang, Fulong Yu
Abstract: Understanding how 5' untranslated regions (5'UTRs) regulate mRNA translation is critical for controlling protein expression and designing effective therapeutic mRNAs. While recent deep learning models have shown promise in predicting translational efficiency from 5'UTR sequences, most are constrained by fixed input lengths and limited interpretability. We introduce UTR‑STCNet, a Transformer‑based architecture for flexible and biologically grounded modeling of variable‑length 5'UTRs. UTR‑STCNet integrates a Saliency‑Aware Token Clustering (SATC) module that iteratively aggregates nucleotide tokens into multi‑scale, semantically meaningful units based on saliency scores. A Saliency‑Guided Transformer (SGT) block then captures both local and distal regulatory dependencies using a lightweight attention mechanism. This combined architecture achieves efficient and interpretable modeling without input truncation or increased computational cost. Evaluated across three benchmark datasets, UTR‑STCNet consistently outperforms state‑of‑the‑art baselines in predicting mean ribosome load (MRL), a key proxy for translational efficiency. Moreover, the model recovers known functional elements such as upstream AUGs and Kozak motifs, highlighting its potential for mechanistic insight into translation regulation.
Authors: Gabrielle R. Abraham, Tianhao Li, Anna Nguyen, William M. Jacobs, Omar A. Saleh
Abstract: Phase separation in biomolecular mixtures can result from multiple physical interactions, which may act either complementarily or antagonistically. In the case of protein‑nucleic acid mixtures, charge plays a key role but can have contrasting effects on phase behavior. Attractive electrostatic interactions between oppositely charged macromolecules are screened by added salt, reducing the driving force for coacervation. By contrast, base pairing interactions between nucleic acids are diminished by charge repulsion and thus enhanced by added salt, promoting associative phase separation. To explore this interplay, we combine experiment and theory to map the complex phase behavior of a model solution of poly‑L‑lysine (PLL) and self‑complementary DNA nanostars (NS) as a function of temperature, ionic strength, and macromolecular composition. Despite having opposite salt dependences, we find that electrostatics and base pairing cooperate to stabilize NS‑PLL coacervation at high ionic strengths and temperatures, leading to two‑ or three‑phase coexistence under various conditions. We further observe a variety of kinetic pathways to phase separation at different salt concentrations, resulting in the formation of nonequilibrium aggregates or droplets whose compositions evolve on long timescales. Finally, we show that the cooperativity between electrostatics and base pairing can be used to create immiscible coacervates that partition various NS species at intermediate salt concentrations. Our results illustrate how the interplay between distinct interaction modes can greatly increase the complexity of the phase behavior relative to systems with a single type of interaction.
Authors: Lorenzo Rosset, Martin Weigt, Francesco Zamponi
Abstract: Accurately annotating and controlling protein function from sequence data remains a major challenge, particularly within homologous families where annotated sequences are scarce and structural variation is minimal. We present a two‑stage approach for semi‑supervised functional annotation and conditional sequence generation in protein families using representation learning. First, we demonstrate that protein language models, pretrained on large and diverse sequence datasets and possibly finetuned via contrastive learning, provide embeddings that robustly capture fine‑grained functional specificities, even with limited labeled data. Second, we use the inferred annotations to train a generative probabilistic model, an annotation‑aware Restricted Boltzmann Machine, capable of producing synthetic sequences with prescribed functional labels. Across several protein families, we show that this approach achieves highly accurate annotation quality and supports the generation of functionally coherent sequences. Our findings underscore the power of combining self‑supervised learning with light supervision to overcome data scarcity in protein function prediction and design.
Authors: Changguo Jia, Yi Zhan, Tianqi Zhao, Hengzhi Ye, Minghui Zhou
Abstract: Code clone detection plays a critical role in software maintenance and vulnerability analysis. Substantial methods have been proposed to detect code clones. However, they struggle to extract high‑level program semantics directly from a single linear token sequence, leading to unsatisfactory detection performance. A similar single‑sequence challenge has been successfully addressed in protein structure prediction by AlphaFold. Motivated by the successful resolution of the shared single‑sequence challenge by AlphaFold, as well as the sequential similarities between proteins and code, we leverage AlphaFold for code clone detection. In particular, we propose AlphaCC, which represents code fragments as token sequences and adapts AlphaFold's sequence‑to‑structure modeling capability to infer code semantics. The pipeline of AlphaCC goes through three steps. First, AlphaCC transforms each input code fragment into a token sequence and, motivated by AlphaFold's use of multiple sequence alignment (MSA), novelly uses a retrieval‑augmentation strategy to construct an MSA from lexically similar token sequences. Second, AlphaCC adopts a modified attention‑based encoder based on AlphaFold to model dependencies within and across token sequences. Finally, unlike AlphaFold's protein structure prediction task, AlphaCC computes similarity scores between token sequences through a late interaction strategy and performs binary classification to determine code clone pairs. Comprehensive evaluations on three datasets, particularly two semantic clone detection datasets, show that AlphaCC consistently outperforms all baselines, demonstrating strong semantic understanding. AlphaCC further achieves strong performance on instances where tool‑dependent methods fail, highlighting its tool‑independence. Moreover, AlphaCC maintains competitive efficiency, enabling practical usage in large‑scale clone detection tasks.
Authors: Daniel Ayomide Olanrewaju
Abstract: This research introduces the Theory of Partial Symmetry Enforced Attention Decomposition (PSEAD), a new and rigorous group‑theoretic framework designed to seamlessly integrate local symmetry awareness into the core architecture of self‑attention mechanisms within Transformer models. We formalize the concept of local permutation subgroup actions on windows of biological data, proving that under such actions, the attention mechanism naturally decomposes into a direct sum of orthogonal irreducible components. Critically, these components are intrinsically aligned with the irreducible representations of the acting permutation subgroup, thereby providing a powerful mathematical basis for disentangling symmetric and asymmetric features. We show that PSEAD offers substantial advantages. These include enhanced generalization capabilities to novel biological motifs exhibiting similar partial symmetries, unprecedented interpretability by allowing direct visualization and analysis of attention contributions from different symmetry channels, and significant computational efficiency gains by focusing representational capacity on relevant symmetric subspaces. Beyond static data analysis, we extend PSEAD's applicability to dynamic biological processes within reinforcement learning paradigms, showcasing its potential to accelerate the discovery and optimization of biologically meaningful policies in complex environments like protein folding and drug discovery. This work lays the groundwork for a new generation of biologically informed, symmetry‑aware artificial intelligence models.
Authors: Zibo Gao, Zhengzhi Jiang, Qiyu Liang, Ruihua He, Van Cuong Mai, Yingwei Tang, Qirong Xiong, Wenting Zhao, Hongwei Duan, Hongliang Sun, Mo Li, Yansong Miao, Weibo Gao
Abstract: Lanthanide binding tags (LBTs) stand out as a prominent group of fluorescent probes that are extensively utilized in biological detection. However, research on LBTs has predominantly emphasized their fluorescence properties, which frequently compromised by background fluorescence noise. Investigating magnetic properties could optimize detection methodologies that offer enhanced sensitivity and specificity. In this study, we measured the response of a relaxometer based on ensemble nitrogen‑vacancy (NV) centers in diamond to various amounts of LBTs with gadolinium ions, determining the detection limit of LBTs to be 25 fmol. We then proposed and demonstrated a detection scheme employing the NV relaxometer to detect specific binding between LBTs and target. Specifically, we assessed the relaxometer's response to various concentrations of the interaction between the modified LBTs and Receptor‑Binding Domain (RBD) of SARS‑COVID‑2 spike protein, with the detection threshold reaching ~1 pmol. Our research provides a potential application platform for biomarker detection under picomole concentration by using NV centers to detect the magnetism of LBTs.
Authors: Oliver Lin, Zhiheng Lyu, Hsu-Chih Ni, Xiaokang Wang, Yetong Jia, Chu-Yun Hwang, Lehan Yao, Jian-Min Zuo, Qian Chen
Abstract: Geometric frustration is a widespread phenomenon in physics, materials science, and biology, occurring when the geometry of a system prevents local interactions from being all accommodated. The resulting manifold of nearly degenerate configurations can lead to complex collective behaviors and emergent pseudosymmetry in diverse systems such as frustrated magnets, mechanical metamaterials, and protein assemblies. In synthetic multi‑twinned nanomaterials, similar pseudosymmetric features have also been observed and manifest as intrinsic lattice strain. Despite extensive interest in the stability of these nanostructures, a fundamental understanding remains limited due to the lack of detailed structural characterization across varying sizes and geometries. In this work, we apply four‑dimensional scanning transmission electron microscopy strain mapping over a total of 23 decahedral nanoparticles with edge lengths, d, between 20 and 55 nm. From maps of full 2D strain tensor at nanometer spatial resolution, we reveal the prevalence of heterogeneity in different modes of lattice distortions, which homogenizes and restores symmetry with increasing size. Knowing the particle crystallography, we reveal distinctive spatial patterns of local lattice phase transformation between face‑centered cubic and body‑centered tetragonal symmetries, with a contrast between particles below and above d of 35 nm. The results suggest a cross‑over size of the internal structure occurs, as particles shape transition from modified‑Wulff shape favored at nanoscale to faceted, pentagonal bipyramidal shape. Ultimately, our 4D‑STEM mapping provides new insight to long‑standing mysteries of this historic system and can be widely applicable to study nanocrystalline solids and material phase transformation that are important in catalysis, metallurgy, electronic devices, and energy storage materials.
Authors: Saleh Alwer, Ronan Fleming
Abstract: Kinetic parameters such as the turnover number (k_cat) and Michaelis constant (K_\mathrmM) are essential for modelling enzymatic activity but experimental data remains limited in scale and diversity. Previous methods for predicting enzyme kinetics typically use mean‑pooled residue embeddings from a single protein language model to represent the protein. We present KinForm, a machine learning framework designed to improve predictive accuracy and generalisation for kinetic parameters by optimising protein feature representations. KinForm combines several residue‑level embeddings (Evolutionary Scale Modeling Cambrian, Evolutionary Scale Modeling 2, and ProtT5‑XL‑UniRef50), taken from empirically selected intermediate transformer layers and applies weighted pooling based on per‑residue binding‑site probability. To counter the resulting high dimensionality, we apply dimensionality reduction using principal‑‑component analysis (PCA) on concatenated protein features, and rebalance the training data via a similarity‑based oversampling strategy. KinForm outperforms baseline methods on two benchmark datasets. Improvements are most pronounced in low sequence similarity bins. We observe improvements from binding‑site probability pooling, intermediate‑layer selection, PCA, and oversampling of low‑identity proteins. We also find that removing sequence overlap between folds provides a more realistic evaluation of generalisation and should be the standard over random splitting when benchmarking kinetic prediction models.
Authors: Hengjie Yu, Kenneth A. Dawson, Haiyun Yang, Shuya Liu, Yan Yan, Yaochu Jin
Abstract: Nanomaterial‑protein interactions (NPI) are pivotal to realizing the therapeutic and diagnostic potential of nanomaterials. Although AI promises to accelerate mechanistic understanding and enable rational nanomaterial design, robust generalization to unseen nanomaterials or proteins remains unresolved. Here, we present CuMMI (curriculum‑guided multimodal interaction model), a generalizable, explainable, and transferable model designed to infer NPI across complex biological settings. CuMMI leverages a self‑constructed million‑scale NPI dataset and adopts a multi‑stage curriculum centered on human plasma, with progressively broader biofluid exposure to enhance data coverage and generalizability. By integrating protein sequence, structure, and a text‑encoded experimental context of 37 features, CuMMI captures complementary material‑specific, biochemical, and environmental information. Sample‑level quality weights are assigned to ensure full utilization of available data while mitigating low‑confidence and sparsely recorded entries. Ablation studies highlight the most influential tabular features, clarifying their contribution to the prediction. Through rigorous external validation across independence‑preserving temporal, nanomaterial‑held‑out, and protein‑held‑out evaluations, our framework consistently achieves good performance (mean of five classification metrics exceeding 0.75), highlighting its robustness and generalizability to unseen data. Furthermore, fine‑tuning on independent gold‑nanoparticle data and a held‑out protein subset further delivers better performance than training from scratch with substantially fewer samples. Together, our approach enables generalizable and transferable NPI prediction and may accelerate in vitro research and applications of nanomaterials.
Authors: Jingbo Liang, Bruna Jacobson
Abstract: Extensively exploring protein conformational landscapes remains a major challenge in computational biology due to the high computational cost involved in dynamic physics‑based simulations. In this work, we propose a novel pipeline, MoDyGAN, that leverages molecular dynamics (MD) simulations and generative adversarial networks (GANs) to explore protein conformational spaces. MoDyGAN contains a generator that maps Gaussian distributions into MD‑derived protein trajectories, and a refinement module that combines ensemble learning with a dual‑discriminator to further improve the plausibility of generated conformations. Central to our approach is an innovative representation technique that reversibly transforms 3D protein structures into 2D matrices, enabling the use of advanced image‑based GAN architectures. We use three rigid proteins to demonstrate that MoDyGAN can generate plausible new conformations. We also use deca‑alanine as a case study to show that interpolations within the latent space closely align with trajectories obtained from steered molecular dynamics (SMD) simulations. Our results suggest that representing proteins as image‑like data unlocks new possibilities for applying advanced deep learning techniques to biomolecular simulation, leading to an efficient sampling of conformational states. Additionally, the proposed framework holds strong potential for extension to other complex 3D structures.
Authors: Nimisha Ghosh, Daniele Santoni, Debaleena Nawn, Eleonora Ottaviani, Giovanni Felici
Abstract: The impact of Transformer‑based language models has been unprecedented in Natural Language Processing (NLP). The success of such models has also led to their adoption in other fields including bioinformatics. Taking this into account, this paper discusses recent advances in Transformer‑based models for protein sequence analysis and design. In this review, we have discussed and analysed a significant number of works pertaining to such applications. These applications encompass gene ontology, functional and structural protein identification, generation of de novo proteins and binding of proteins. We attempt to shed light on the strength and weaknesses of the discussed works to provide a comprehensive insight to readers. Finally, we highlight shortcomings in existing research and explore potential avenues for future developments. We believe that this review will help researchers working in this field to have an overall idea of the state of the art in this field, and to orient their future studies.
Authors: Hao Tuo, Yan Li, Xuanning Hu, Haishi Zhao, Xueyan Liu, Bo Yang
Abstract: Combinatorial optimization algorithm is essential in computer‑aided drug design by progressively exploring chemical space to design lead compounds with high affinity to target protein. However current methods face inherent challenges in integrating domain knowledge, limiting their performance in identifying lead compounds with novel and valid binding mode. Here, we propose AutoLeadDesign, a lead compounds design framework that inspires extensive domain knowledge encoded in large language models with chemical fragments to progressively implement efficient exploration of vast chemical space. The comprehensive experiments indicate that AutoLeadDesign outperforms baseline methods. Significantly, empirical lead design campaigns targeting two clinically relevant targets (PRMT5 and SARS‑CoV‑2 PLpro) demonstrate AutoLeadDesign's competence in de novo generation of lead compounds achieving expert‑competitive design efficacy. Structural analysis further confirms their mechanism‑validated inhibitory patterns. By tracing the process of design, we find that AutoLeadDesign shares analogous mechanisms with fragment‑based drug design which traditionally rely on the expert decision‑making, further revealing why it works. Overall, AutoLeadDesign offers an efficient approach for lead compounds design, suggesting its potential utility in drug design.
Authors: Yifan Deng, Spencer S. Ericksen, Anthony Gitter
Abstract: Scientific databases aggregate vast amounts of quantitative data alongside descriptive text. In biochemistry, molecule screening assays evaluate candidate molecules' functional responses against disease targets. Unstructured text that describes the biological mechanisms through which these targets operate, experimental screening protocols, and other attributes of assays offer rich information for drug discovery campaigns but has been untapped because of that unstructured format. We present Assay2Mol, a large language model‑based workflow that can capitalize on the vast existing biochemical screening assays for early‑stage drug discovery. Assay2Mol retrieves existing assay records involving targets similar to the new target and generates candidate molecules using in‑context learning with the retrieved assay screening data. Assay2Mol outperforms recent machine learning approaches that generate candidate ligand molecules for target protein structures, while also promoting more synthesizable molecule generation.
Authors: Jin Xu, XiaoLong Shi, Xin Chen, Fang Wang, Sirui Li, Pali Ye, Boliang Zhang, Di Deng, Zheng Kou, Xiaoli Qiang
Abstract: Efficiently solving NP‑complete problems‑such as protein structure prediction, cryptographic decryption, and vulnerability detection‑remains a central challenge in computer science. Traditional electronic computers, constrained by the Turing machine's one‑dimensional data processing and sequential operations, struggle to address these issues effectively. To overcome this bottleneck, computational models must adopt multidimensional data structures and parallel information processing mechanisms. Building on our team's proposed probe machine model (a non‑Turing computational framework), this study develops a blocking probe technique that leverages DNA computing's inherent parallelism to identify all valid solutions for NP‑complete problems in a single probe operation. Using the 27‑vertex 3‑coloring problem as a case study, we successfully retrieved all solutions through DNA molecular probe experiments. This breakthrough demonstrates the first implementation of a fully parallel computing system at the molecular level, offering a novel paradigm for tackling computational complexity. Our results indicate that the probe machine, with its parallel architecture and molecular implementation, transcends the limitations of classical models and holds promise for solving intricate real‑world problems.
Authors: Wesley W. Oliver, William M. Jacobs, Michael A. Webb
Abstract: Understanding and predicting the phase behavior of intrinsically disordered proteins (IDPs) is of significant interest due to their role in many biological processes. However, effectively characterizing phase behavior and its complex dependence on protein primary sequence remains challenging. In this study, we evaluate the efficacy of several simple computational metrics to quantify the propensity of single‑component IDP solutions to phase separate; specific metrics considered include the single‑chain radius of gyration, the second virial coefficient, and a newly proposed quantity termed the expenditure density. Each metric is computed using coarse‑grained molecular dynamics simulations for 2,034 IDP sequences. Using machine learning, we analyze this data to understand how sequence features correlate with the predictive performance of each metric and to develop insight into their respective strengths and limitations. The expenditure density is determined to be a broadly useful metric that combines simplicity, low computational cost, and accuracy; it also provides a continuous measure that remains informative across both phase‑separating and non‑phase‑separating sequences. Additionally, this metric shows promise in its ability to improve predictions of other properties for IDP systems. This work extends existing literature by advancing beyond binary classification, which can be useful for rapidly screening phase behavior or predicting other properties of IDP‑related systems.
Authors: Lauren Lui, Torben Nielsen
Abstract: Functional annotation of microbial genomes is often biased toward protein‑coding genes, leaving a vast, unexplored landscape of non‑coding RNAs (ncRNAs) that are critical for regulating bacterial and archaeal physiology, stress response and metabolism. Identifying ncRNAs directly from genomic sequence is a paramount challenge in bioinformatics and biology, essential for understanding the complete regulatory potential of an organism. This paper presents RNAMunin, a machine learning (ML) model that is capable of finding ncRNAs using genomic sequence alone. It is also computationally viable for large sequence datasets such as long read metagenomic assemblies with contigs totaling multiple Gbp. RNAMunin is trained on Rfam sequences extracted from approximately 60 Gbp of long read metagenomes from 16 San Francisco Estuary samples. We know of no other model that can detect ncRNAs based solely on genomic sequence at this scale. Since RNAMunin only requires genomic sequence as input, we do not need for an ncRNA to be transcribed to find it, i.e., we do not need transcriptomics data. We wrote this manuscript in a narrative style in order to best convey how RNAMunin was developed and how it works in detail. Unlike almost all current ML models, at approximately 1M parameters, RNAMunin is very small and very fast.
Authors: Tao Wang, Nan Zhang, Hongjie Huang, Yunhe An, Yunyun Dai, Yongrui Li, Nan Yang, Chaojie Yang, Xinran Zhou, Yucheng Zhu, Yingshan Ma, Lingling Huang, Yongtian Wang, Yang Liu, Zhiyong Yan
Abstract: Photoelectrochemical (PEC) biosensors exhibit significant potential for biomolecule detection due to their high sensitivity and low background noise. However, their performance is severely constrained by the rapid recombination of photogenerated charge carriers. This study innovatively introduces a non‑contact magnetic modulation strategy to suppress electron‑hole recombination by manipulating carrier spin states, thereby significantly enhancing photoelectric conversion efficiency. Building on this mechanism, we developed a novel magnetically modulated PEC biosensing platform based on the MXenes/cobalt‑doped titanium dioxide (Co‑TiO2) heterostructure. This platform achieved ultrasensitive detection of protein kinase A (PKA) activity. Compared to an identical probe‑modified biosensor without magnetic field application, the developed platform demonstrated a 68.75% enhancement in detection sensitivity and achieved an ultralow detection limit for PKA of 0.00016 U/mL. It also exhibited a wide linear range from 0.005 to 80 U/mL. This research not only provides a novel methodology for kinase activity analysis but also pioneers the innovative strategy of magnetic modulation for enhanced PEC sensing. It opens new avenues for developing high‑performance biosensing platforms, holding significant promise for early disease diagnosis and drug screening applications.
Authors: Chengyue Gong, Xinshi Chen, Yuxuan Zhang, Yuxuan Song, Hao Zhou, Wenzhi Xiao
Abstract: Lightweight inference is critical for biomolecular structure prediction and other downstream tasks, enabling efficient real‑world deployment and inference‑time scaling for large‑scale applications. In this work, we address the challenge of balancing model efficiency and prediction accuracy by making several key modifications, 1) Multi‑step AF3 sampler is replaced by a few‑step ODE sampler, significantly reducing computational overhead for the diffusion module part during inference; 2) In the open‑source Protenix framework, a subset of pairformer or diffusion transformer blocks doesn't make contributions to the final structure prediction, presenting opportunities for architectural pruning and lightweight redesign; 3) A model incorporating an ESM module is trained to substitute the conventional MSA module, reducing MSA preprocessing time. Building on these key insights, we present Protenix‑Mini, a compact and optimized model designed for efficient protein structure prediction. This streamlined version incorporates a more efficient architectural design with a two‑step Ordinary Differential Equation (ODE) sampling strategy. By eliminating redundant Transformer components and refining the sampling process, Protenix‑Mini significantly reduces model complexity with slight accuracy drop. Evaluations on benchmark datasets demonstrate that it achieves high‑fidelity predictions, with only a negligible 1 to 5 percent decrease in performance on benchmark datasets compared to its full‑scale counterpart. This makes Protenix‑Mini an ideal choice for applications where computational resources are limited but accurate structure prediction remains crucial.
Authors: Andrei Rekesh, Miruna Cretu, Dmytro Shevchuk, Vignesh Ram Somnath, Pietro Liò, Robert A. Batey, Mike Tyers, Michał Koziarski, Cheng-Hao Liu
Abstract: Synthesizability remains a critical bottleneck in generative molecular design. While recent advances have addressed synthesizability in 2D graphs, extending these constraints to 3D for geometry‑based conditional generation remains largely unexplored. In this work, we present SynCoGen (Synthesizable Co‑Generation), a single framework that combines simultaneous masked graph diffusion and flow matching for synthesizable 3D molecule generation. SynCoGen samples from the joint distribution of molecular building blocks, chemical reactions, and atomic coordinates. To train the model, we curated SynSpace, a dataset family containing over 1.2M synthesis‑aware building block graphs and 7.5M conformers. SynCoGen achieves state‑of‑the‑art performance in unconditional small molecule graph and conformer co‑generation. For protein ligand generation in drug discovery, the amortized model delivers superior performance in both molecular linker design and pharmacophore‑conditioned generation across diverse targets without relying on any scoring functions. Overall, this multimodal non‑autoregressive formulation represents a foundation for a range of molecular design applications, including analog expansion, lead optimization, and direct de novo design.
Authors: Yuehua Song, Yong Gao
Abstract: Accurately predicting drug‑target interactions (DTIs) is pivotal for advancing drug discovery and target validation techniques. While machine learning approaches including those that are based on Graph Neural Networks (GNN) have achieved notable success in DTI prediction, many of them have difficulties in effectively integrating the diverse features of drugs, targets and their interactions. To address this limitation, we introduce a novel framework to take advantage of the power of both transductive learning and inductive learning so that features at molecular level and drug‑target interaction network level can be exploited. Within this framework is a GNN‑based model called Graph‑in‑Graph (GiG) that represents graphs of drug and target molecular structures as meta‑nodes in a drug‑target interaction graph, enabling a detailed exploration of their intricate relationships. To evaluate the proposed model, we have compiled a special benchmark comprising drug SMILES, protein sequences, and their interaction data, which is interesting in its own right. Our experimental results demonstrate that the GiG model significantly outperforms existing approaches across all evaluation metrics, highlighting the benefits of integrating different learning paradigms and interaction data.
Authors: Haoran Li, Xingye Cheng, Ziyang Huang, Jingyuan Luo, Qianqian Xu, Qiguang Zhao, Tianchen Guo, Yumeng Zhang, Linda Lidan Zhong, Zhaoxiang Bian, Leihan Tang, Aiping Lyu, Liang Tian
Abstract: Traditional Chinese Medicine diagnosis and treatment principles, established through centuries of trial‑and‑error clinical practice, directly maps patient‑specific symptom patterns to personalised herbal therapies. These empirical holistic mapping principles offer valuable strategies to address remaining challenges of reductionism methodologies in modern biomedicine. However, the lack of a quantitative framework and molecular‑level evidence has limited their interpretability and reliability. Here, we present an AI framework trained on ancient and classical TCM formula records to quantify the symptom pattern‑herbal therapy mappings. Interestingly, we find that empirical TCM diagnosis and treatment are consistent with the encoding‑decoding processes in the AI model. This enables us to construct an interpretable TCM embedding space (TCM‑ES) using the model's quantitative representation of TCM principles. Validated through broad and extensive TCM patient data, the TCM‑ES offers universal quantification of the TCM practice and therapeutic efficacy. We further map biomedical entities into the TCM‑ES through correspondence alignment. We find that the principal directions of the TCM‑ES are significantly associated with key biological functions (such as metabolism, immune, and homeostasis), and that the disease and herb embedding proximity aligns with their genetic relationships in the human protein interactome, which demonstrate the biological significance of TCM principles. Moreover, the TCM‑ES uncovers latent disease relationships, and provides alternative metric to assess clinical efficacy for modern disease‑drug pairs. Finally, we construct a comprehensive and integrative TCM knowledge graph, which predicts potential associations between diseases and targets, drugs, herbal compounds, and herbal therapies, providing TCM‑informed opportunities for disease analysis and drug development.
Authors: Haruya Imamura, Yasuaki Kobayashi, Yota Otachi, Toshiki Saitoh, Keita Sato, Asahi Takaoka, Ryo Yoshinaka, Tom C. van der Zanden
Abstract: (Induced) Subgraph Isomorphism and Maximum Common (Induced) Subgraph are fundamental problems in graph pattern matching and similarity computation. In graphs derived from time‑series data or protein structures, a natural total ordering of vertices often arises from their underlying structure, such as temporal sequences or amino acid sequences. This motivates the study of problem variants that respect this inherent ordering. This paper addresses Ordered (Induced) Subgraph Isomorphism (O(I)SI) and its generalization, Maximum Common Ordered (Induced) Subgraph (MCO(I)S), which seek to find subgraph isomorphisms that preserve the vertex orderings of two given ordered graphs. Our main contributions are threefold: (1) We prove that these problems remain NP‑complete even when restricted to small graph classes, such as trees of depth 2 and threshold graphs. (2) We establish a gap in computational complexity between OSI and OISI on certain graph classes. For instance, OSI is polynomial‑time solvable for interval graphs with their interval orderings, whereas OISI remains NP‑complete under the same setting. (3) We demonstrate that the tractability of these problems can depend on the vertex ordering. For example, while OISI is NP‑complete on threshold graphs, its generalization, MCOIS, can be solved in polynomial time if the specific vertex orderings that characterize the threshold graphs are provided.
Authors: Chi-en Amy Tai, Alexander Wong
Abstract: Peptide de novo sequencing is a method used to reconstruct amino acid sequences from tandem mass spectrometry data without relying on existing protein sequence databases. Traditional deep learning approaches, such as Casanovo, mainly utilize autoregressive decoders and predict amino acids sequentially. Subsequently, they encounter cascading errors and fail to leverage high‑confidence regions effectively. To address these issues, this paper investigates using diffusion decoders adapted for the discrete data domain. These decoders provide a different approach, allowing sequence generation to start from any peptide segment, thereby enhancing prediction accuracy. We experiment with three different diffusion decoder designs, knapsack beam search, and various loss functions. We find knapsack beam search did not improve performance metrics and simply replacing the transformer decoder with a diffusion decoder lowered performance. Although peptide precision and recall were still 0, the best diffusion decoder design with the DINOISER loss function obtained a statistically significant improvement in amino acid recall by 0.373 compared to the baseline autoregressive decoder‑based Casanovo model. These findings highlight the potential of diffusion decoders to not only enhance model sensitivity but also drive significant advancements in peptide de novo sequencing.
Authors: Balu Bhasuran, Sabenabanu Abdulkadhar, Jeyakumar Natarajan
Abstract: High‑altitude diseases (HAD), encompassing acute mountain sickness (AMS), high‑altitude cerebral edema (HACE), and high‑altitude pulmonary edema (HAPE), are triggered by hypobaric hypoxia at elevations above 2,500 meters. These conditions pose significant health risks, yet the molecular mechanisms remain insufficiently understood. In this study, we developed a biomolecular event extraction pipeline integrating supervised machine learning with feature‑based and multiscale Laplacian graph kernels to analyze 7,847 curated HAD‑related abstracts from PubMed. We extracted over 150 unique biomolecular events including gene expression, regulation, binding, and localization and constructed a weighted, undirected biomolecular event network comprising 97 nodes and 153 edges. Using the PageRank algorithm, we prioritized key biomolecules based on their centrality within the event network. The top‑ranked proteins included Erythropoietin (EPO) (0.0163), Vascular endothelial growth factor (VEGF) (0.0148), Hypoxia‑inducible factor 1 (HIF‑1) alpha (0.0136), Endothelial PAS Domain Protein 1 (EPAS1) and Angiotensin‑Converting Enzyme (ACE) (0.0119), Egl nine homolog 1 (EGLN1), Endothelin 1 (ET‑1), and 70 kilodalton heat shock protein (Hsp70)(0.0118), all of which play crucial roles in oxygen sensing, vascular remodeling, erythropoiesis, and blood pressure regulation. Subnetwork analysis revealed three major functional clusters centered on hypoxia response, inflammation, and stress adaptation pathways. Our integrative approach demonstrates the utility of large‑scale text mining and graph‑based analysis to uncover mechanistic insights and prioritize potential biomarkers for high‑altitude disease.
Authors: Yuhao Wang, Keyan Ding, Kehua Feng, Zeyuan Wang, Ming Qin, Xiaotong Li, Qiang Zhang, Huajun Chen
Abstract: Protein language models have emerged as powerful tools for sequence generation, offering substantial advantages in functional optimization and denovo design. However, these models also present significant risks of generating harmful protein sequences, such as those that enhance viral transmissibility or evade immune responses. These concerns underscore critical biosafety and ethical challenges. To address these issues, we propose a Knowledge‑guided Preference Optimization (KPO) framework that integrates prior knowledge via a Protein Safety Knowledge Graph. This framework utilizes an efficient graph pruning strategy to identify preferred sequences and employs reinforcement learning to minimize the risk of generating harmful proteins. Experimental results demonstrate that KPO effectively reduces the likelihood of producing hazardous sequences while maintaining high functionality, offering a robust safety assurance framework for applying generative models in biotechnology.
Authors: Yuchen Zhu, Jihong Chen, Yitong Li, Xiaomin Fang, Xianbin Ye, Jingzhou He, Xujun Zhang, Jingxuan Ge, Chao Shen, Xiaonan Zhang, Tingjun Hou, Chang-Yu Hsieh
Abstract: Structural assessment of biomolecular complexes is vital for translating molecular models into functional insights, shaping our understanding of biology and aiding drug discovery. However, current structure‑based scoring functions often lack generalizability across diverse biomolecular systems. We present BioScore, a foundational scoring function that addresses key challenges ‑‑ data sparsity, cross‑system representation, and task compatibility ‑‑ through a dual‑scale geometric graph learning framework with tailored modules for structure assessment and affinity prediction. BioScore supports a wide range of tasks, including affinity prediction, conformation ranking, and structure‑based virtual screening. Evaluated on 16 benchmarks spanning proteins, nucleic acids, small molecules, and carbohydrates, BioScore consistently outperforms or matches 70 traditional and deep learning methods. Our newly proposed PPI Benchmark further enables comprehensive evaluation of protein‑protein complex scoring. BioScore demonstrates broad applicability: (1) pretraining on mixed‑structure data boosts protein‑protein affinity prediction by up to 40% and antigen‑antibody binding correlation by over 90%; (2) cross‑system generalizability enables zero‑ and few‑shot prediction with up to 71% correlation gain; and (3) its unified representation captures chemically challenging systems such as cyclic peptides, improving affinity prediction by over 60%. BioScore establishes a robust and generalizable framework for structural assessment across complex biomolecular landscapes.
Authors: Jiayuan Chen, Thai-Hoang Pham, Yuanlong Wang, Ping Zhang
Abstract: High‑throughput screening techniques, such as microscopy imaging of cellular responses to genetic and chemical perturbations, play a crucial role in drug discovery and biomedical research. However, robust perturbation screening for de novo cell lines remains challenging due to the significant morphological and biological heterogeneity across cell lines. To address this, we propose a novel framework that integrates external biological knowledge into existing pretraining strategies to enhance microscopy image profiling models. Our approach explicitly disentangles perturbation‑specific and cell line‑specific representations using external biological information. Specifically, we construct a knowledge graph leveraging protein interaction data from STRING and Hetionet databases to guide models toward perturbation‑specific features during pretraining. Additionally, we incorporate transcriptomic features from single‑cell foundation models to capture cell line‑specific representations. By learning these disentangled features, our method improves the generalization of imaging models to de novo cell lines. We evaluate our framework on the RxRx database through one‑shot fine‑tuning on an RxRx1 cell line and few‑shot fine‑tuning on cell lines from the RxRx19a dataset. Experimental results demonstrate that our method enhances microscopy image profiling for de novo cell lines, highlighting its effectiveness in real‑world phenotype‑based drug discovery applications.
Authors: Lu Zhu, Emmanuel Noutahi
Abstract: Generative chemical language models (CLMs) have demonstrated strong capabilities in molecular design, yet their impact in drug discovery remains limited by the absence of reliable reward signals and the lack of interpretability in their outputs. We present SAFE‑T, a generalist chemical modeling framework that conditions on biological context ‑‑ such as protein targets or mechanisms of action ‑‑ to prioritize and design molecules without relying on structural information or engineered scoring functions. SAFE‑T models the conditional likelihood of fragment‑based molecular sequences given a biological prompt, enabling principled scoring of molecules across tasks such as virtual screening, drug‑target interaction prediction, and activity cliff detection. Moreover, it supports goal‑directed generation by sampling from this learned distribution, aligning molecular design with biological objectives. In comprehensive zero‑shot evaluations across predictive (LIT‑PCBA, DAVIS, KIBA, ACNet) and generative (DRUG, PMO) benchmarks, SAFE‑T consistently achieves performance comparable to or better than existing approaches while being significantly faster. Fragment‑level attribution further reveals that SAFE‑T captures known structure‑activity relationships, supporting interpretable and biologically grounded design. Together with its computational efficiency, these results demonstrate that conditional generative CLMs can unify scoring and generation to accelerate early‑stage drug discovery.
Authors: Tomas Geffner, Kieran Didi, Zhonglin Cao, Danny Reidenbach, Zuobai Zhang, Christian Dallago, Emine Kucukbenli, Karsten Kreis, Arash Vahdat
Abstract: Recently, many generative models for de novo protein structure design have emerged. Yet, only few tackle the difficult task of directly generating fully atomistic structures jointly with the underlying amino acid sequence. This is challenging, for instance, because the model must reason over side chains that change in length during generation. We introduce La‑Proteina for atomistic protein design based on a novel partially latent protein representation: coarse backbone structure is modeled explicitly, while sequence and atomistic details are captured via per‑residue latent variables of fixed dimensionality, thereby effectively side‑stepping challenges of explicit side‑chain representations. Flow matching in this partially latent space then models the joint distribution over sequences and full‑atom structures. La‑Proteina achieves state‑of‑the‑art performance on multiple generation benchmarks, including all‑atom co‑designability, diversity, and structural validity, as confirmed through detailed structural analyses and evaluations. Notably, La‑Proteina also surpasses previous models in atomistic motif scaffolding performance, unlocking critical atomistic structure‑conditioned protein design tasks. Moreover, La‑Proteina is able to generate co‑designable proteins of up to 800 residues, a regime where most baselines collapse and fail to produce valid samples, demonstrating La‑Proteina's scalability and robustness.
Authors: Suman Samantray, Margot Lockwood, Amity Andersen, Hoshin Kim, Paul Rigor, Margaret S. Cheung, Daniel Mejia-Rodriguez
Abstract: We developed an advanced computational framework to accelerate the study of the impact of post‑translational modifications on protein structures and interactions (PTM‑Psi) using asynchronous, loosely coupled workflows on the Azure Quantum Elements Cloud platform. We seamlessly integrate emerging cloud computing assets that further expand the scope and capability of PTM‑Psi Python package by refactoring it into a cloud‑compatible library. We employed a "workflow of workflows" approach wherein a parent workflow spawns one or more child workflows, managing them, and acting on their results. This approach enabled us to optimize resource allocation according to each workflow's needs, and allowed us to use the cloud heterogeneous architecture for the computational investigation of a combinatorial explosion of thiol protein PTMs on an exemplary protein megacomplex critical to the Calvin‑Benson cycle of light‑dependent sugar production in cyanobacteria. With PTM‑Psi on the cloud, we transformed the pipeline for the thiol PTM analysis to achieve high throughput by leveraging the strengths of the cloud service. \ptmpsi\ on the cloud reduces operational complexity and lowers entry barriers to data interpretation with structural modeling for a redox proteomics mass spectrometry specialist.
Authors: Meng Liu, Karl Leswing, Simon K. S. Chu, Farhad Ramezanghorbani, Griffin Young, Gabriel Marques, Prerna Das, Anjali Panikar, Esther Jamir, Mohammed Sulaiman Shamsudeen, K. Shawn Watts, Ananya Sen, Hari Priya Devannagari, Edward B. Miller, Muyun Lihan, Howook Hwang, Janet Paulsen, Xin Yu, Kyle Gion, Timur Rvachov, Emine Kucukbenli, Saee Gopal Paliwal
Abstract: Protein‑ligand binding affinity prediction is essential for drug discovery and toxicity assessment. While machine learning (ML) promises fast and accurate predictions, its progress is constrained by the availability of reliable data. In contrast, physics‑based methods such as absolute binding free energy perturbation (AB‑FEP) deliver high accuracy but are computationally prohibitive for high‑throughput applications. To bridge this gap, we introduce ToxBench, the first large‑scale AB‑FEP dataset designed for ML development and focused on a single pharmaceutically critical target, Human Estrogen Receptor Alpha (ERα). ToxBench contains 8,770 ERα‑ligand complex structures with binding free energies computed via AB‑FEP with a subset validated against experimental affinities at 1.75 kcal/mol RMSE, along with non‑overlapping ligand splits to assess model generalizability. Using ToxBench, we further benchmark state‑of‑the‑art ML methods, and notably, our proposed DualBind model, which employs a dual‑loss framework to effectively learn the binding energy function. The benchmark results demonstrate the superior performance of DualBind and the potential of ML to approximate AB‑FEP at a fraction of the computational cost.
Authors: Rui-Hao Li, Hakan Doga, Bryan Raubenolt, Sarah Mostame, Nicholas DiSanto, Fabio Cumbo, Jayadev Joshi, Hanna Linn, Maeve Gaffney, Alexander Holden, Vinooth Kulkarni, Vipin Chaudhary, Kenneth M. Merz, Abdullah Ash Saki, Tomas Radivoyevitch, Frank DiFilippo, Jun Qin, Omar Shehab, Daniel Blankenberg
Abstract: In this work, we present the first implementation of the face‑centered cubic (FCC) lattice model for protein structure prediction with a quantum algorithm. Our motivation to encode the FCC lattice stems from our observation that the FCC lattice is more capable in terms of modeling realistic secondary structures in proteins compared to other lattices, as demonstrated using root mean square deviation (RMSD). We utilize two quantum methods to solve this problem: a polynomial fitting approach (PolyFit) and the Variational Quantum Eigensolver with constraints (VQEC) based on the Lagrangian duality principle. Both methods are successfully deployed on Eagle R3 (ibm_cleveland) and Heron R2 (ibm_kingston) quantum computers, where we are able to recover ground state configurations for the 6‑amino acid sequence KLVFFA under noise. A comparative analysis of the outcomes generated by the two QPUs reveals a significant enhancement (reaching nearly a two‑fold improvement for PolyFit and a three‑fold improvement for VQEC) in the prediction and sampling of the optimal solution (ground state conformations) on the newer Heron R2 architecture, highlighting the impact of quantum hardware advancements for this application.
Authors: Changze Lv, Jiang Zhou, Siyu Long, Lihao Wang, Jiangtao Feng, Dongyu Xue, Yu Pei, Hao Wang, Zherui Zhang, Yuchen Cai, Zhiqiang Gao, Ziyuan Ma, Jiakai Hu, Chaochen Gao, Jingjing Gong, Yuxuan Song, Shuyi Zhang, Xiaoqing Zheng, Deyi Xiong, Lei Bai, Wanli Ouyang, Ya-Qin Zhang, Wei-Ying Ma, Bowen Zhou, Hao Zhou
Abstract: We introduce AMix‑1, a powerful protein foundation model built on Bayesian Flow Networks and empowered by a systematic training methodology, encompassing pretraining scaling laws, emergent capability analysis, in‑context learning mechanism, and test‑time scaling algorithm. To guarantee robust scalability, we establish a predictive scaling law and reveal the progressive emergence of structural understanding via loss perspective, culminating in a strong 1.7‑billion model. Building on this foundation, we devise a multiple sequence alignment (MSA)‑based in‑context learning strategy to unify protein design into a general framework, where AMix‑1 recognizes deep evolutionary signals among MSAs and consistently generates structurally and functionally coherent proteins. This framework enables the successful design of a dramatically improved AmeR variant with an up to 50× activity increase over its wild type. Pushing the boundaries of protein engineering, we further empower AMix‑1 with an evolutionary test‑time scaling algorithm for in silico directed evolution that delivers substantial, scalable performance gains as verification budgets are intensified, laying the groundwork for next‑generation lab‑in‑the‑loop protein design.
Authors: Chan Lim, Jae-Hyung Jeon
Abstract: Anomalous diffusion often arises in complex environments where viscoelastic or crowded conditions influence particle motion. In many biological and soft‑matter systems, distinct components of the medium exhibit unique viscoelastic responses, resulting in time‑dependent changes in the observed diffusion exponents. Here, we develop a theoretical model of two particles, each embedded in a distinct viscoelastic medium, and coupled via a harmonic potential. By formulating and solving a system of coupled fractional Langevin equations (FLEs) with memory exponents 0<α<β\leq 1, we uncover rich transient anomalous diffusion phenomena arising from the interplay of memory kernels and bilinear coupling. Notably, we identify recovery dynamics, where a subdiffusive particle (α) transiently accelerates and eventually regains its intrinsic long‑time mobility. This recovery emerges only when memory exponents differ (α<β), whereas identical exponents (α=β) suppress recovery. Our theoretical predictions offer insight into experimentally observed transient anomalous diffusions, such as polymer‑‑protein complexes and cross‑linked cytoskeletal networks, highlighting the critical role of memory heterogeneity and mechanical interactions in biological anomalous diffusion.
Authors: Achuth Chandrasekhar, Amir Barati Farimani
Abstract: Molecular dynamics simulations are an essential tool in understanding protein structure, dynamics, and function at the atomic level. However, preparing high quality input files for MD simulations can be a time consuming and error prone process. In this work, we introduce an automated pipeline that leverages Large Language Models (LLMs), specifically Gemini 2.0 Flash, in conjunction with python scripting and Selenium based web automation to streamline the generation of MD input files. The pipeline exploits CHARMM GUI's comprehensive web‑based interface for preparing simulation‑ready inputs for NAMD. By integrating Gemini's code generation and iterative refinement capabilities, simulation scripts are automatically written, executed, and revised to navigate CHARMM GUI, extract appropriate parameters, and produce the required NAMD input files. Post processing is performed using additional software to further refine the simulation outputs, thereby enabling a complete and largely hands free workflow. Our results demonstrate that this approach reduces setup time, minimizes manual errors, and offers a scalable solution for handling multiple protein systems in parallel. This automated framework paves the way for broader application of LLMs in computational structural biology, offering a robust and adaptable platform for future developments in simulation automation.
Authors: Zerui Yang, Yuwei Wan, Siyu Yan, Yudai Matsuda, Tong Xie, Bram Hoex, Linqi Song
Abstract: Recent advances in large language models have demonstrated considerable potential in scientific domains such as drug repositioning. However, their effectiveness remains constrained when reasoning extends beyond the knowledge acquired during pretraining. Conventional approaches, such as fine‑tuning or retrieval‑augmented generation, face limitations in either imposing high computational overhead or failing to fully exploit structured scientific data. To overcome these challenges, we propose DrugMCTS, a novel framework that synergistically integrates RAG, multi‑agent collaboration, and Monte Carlo Tree Search for drug repositioning. The framework employs five specialized agents tasked with retrieving and analyzing molecular and protein information, thereby enabling structured and iterative reasoning. Extensive experiments on the DrugBank and KIBA datasets demonstrate that DrugMCTS achieves substantially higher recall and robustness compared to both general‑purpose LLMs and deep learning baselines. Our results highlight the importance of structured reasoning, agent‑based collaboration, and feedback‑driven search mechanisms in advancing LLM applications for drug repositioning.
Authors: Seonghyun Park, Kiyoung Seong, Soojung Yang, Rafael Gómez-Bombarelli, Sungsoo Ahn
Abstract: Molecular dynamics is crucial for understanding molecular systems but its applicability is often limited by the vast timescales of rare events like protein folding. Enhanced sampling techniques overcome this by accelerating the simulation along key reaction pathways, which are defined by collective variables (CVs). However, identifying effective CVs that capture the slow, macroscopic dynamics of a system remains a major bottleneck. This work proposes a novel framework coined BioEmu‑CV that learns these essential CVs automatically from BioEmu, a recently proposed foundation model for generating protein equilibrium samples. In particular, we re‑purpose BioEmu to learn time‑lagged generation conditioned on the learned CV, i.e., predict the distribution of molecular states after a certain amount of time. This training process promotes the CV to encode only the slow, long‑term information while disregarding fast, random fluctuations. We validate our learned CV on fast‑folding proteins with two key applications: (1) estimating free energy differences using on‑the‑fly probability enhanced sampling and (2) sampling transition paths with steered molecular dynamics. Our empirical study also serves as a new systematic and comprehensive benchmark for MLCVs on fast‑folding proteins larger than Alanine Dipeptide.
Authors: Dong Xu, Zhangfan Yang, Sisi Yuan, Jenna Xinyi Yao, Jiangqiang Li, Junkai Ji
Abstract: Three‑dimensional molecular generators based on diffusion models can now reach near‑crystallographic accuracy, yet they remain fragmented across tasks. SMILES‑only inputs, two‑stage pretrain‑finetune pipelines, and one‑task‑one‑model practices hinder stereochemical fidelity, task alignment, and zero‑shot transfer. We introduce MODA, a diffusion framework that unifies fragment growing, linker design, scaffold hopping, and side‑chain decoration with a Bayesian mask scheduler. During training, a contiguous spatial fragment is masked and then denoised in one pass, enabling the model to learn shared geometric and chemical priors across tasks. Multi‑task training yields a universal backbone that surpasses six diffusion baselines and three training paradigms on substructure, chemical property, interaction, and geometry. Model‑C reduces ligand‑protein clashes and substructure divergences while maintaining Lipinski compliance, whereas Model‑B preserves similarity but trails in novelty and binding affinity. Zero‑shot de novo design and lead‑optimisation tests confirm stable negative Vina scores and high improvement rates without force‑field refinement. These results demonstrate that a single‑stage multi‑task diffusion routine can replace two‑stage workflows for structure‑based molecular design.
Authors: Bruce Coburn, Jiangpeng He, Megan E. Rollo, Satvinder S. Dhaliwal, Deborah A. Kerr, Fengqing Zhu
Abstract: Large Multimodal Models (LMMs) are increasingly applied to meal images for nutrition analysis. However, existing work primarily evaluates proprietary models, such as GPT‑4. This leaves the broad range of LLMs underexplored. Additionally, the influence of integrating contextual metadata and its interaction with various reasoning modifiers remains largely uncharted. This work investigates how interpreting contextual metadata derived from GPS coordinates (converted to location/venue type), timestamps (transformed into meal/day type), and the food items present can enhance LMM performance in estimating key nutritional values. These values include calories, macronutrients (protein, carbohydrates, fat), and portion sizes. We also introduce ACETADA, a new food‑image dataset slated for public release. This open dataset provides nutrition information verified by the dietitian and serves as the foundation for our analysis. Our evaluation across eight LMMs (four open‑weight and four closed‑weight) first establishes the benefit of contextual metadata integration over straightforward prompting with images alone. We then demonstrate how this incorporation of contextual information enhances the efficacy of reasoning modifiers, such as Chain‑of‑Thought, Multimodal Chain‑of‑Thought, Scale Hint, Few‑Shot, and Expert Persona. Empirical results show that integrating metadata intelligently, when applied through straightforward prompting strategies, can significantly reduce the Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) in predicted nutritional values. This work highlights the potential of context‑aware LMMs for improved nutrition analysis.
Authors: Hanqun Cao, Xinyi Zhou, Zijun Gao, Chenyu Wang, Xin Gao, Zhi Zhang, Cesar de la Fuente-Nunez, Chunbin Gu, Ge Liu, Pheng-Ann Heng
Abstract: Protein structure prediction often hinges on multiple sequence alignments (MSAs), which underperform on low‑homology and orphan proteins. We introduce PLAME, a lightweight MSA design framework that leverages evolutionary embeddings from pretrained protein language models to generate MSAs that better support downstream folding. PLAME couples these embeddings with a conservation‑‑diversity loss that balances agreement on conserved positions with coverage of plausible sequence variation. Beyond generation, we develop (i) an MSA selection strategy to filter high‑quality candidates and (ii) a sequence‑quality metric that is complementary to depth‑based measures and predictive of folding gains. On AlphaFold2 low‑homology/orphan benchmarks, PLAME delivers state‑of‑the‑art improvements in structure accuracy (e.g., lDDT/TM‑score), with consistent gains when paired with AlphaFold3. Ablations isolate the benefits of the selection strategy, and case studies elucidate how MSA characteristics shape AlphaFold confidence and error modes. Finally, we show PLAME functions as a lightweight adapter, enabling ESMFold to approach AlphaFold2‑level accuracy while retaining ESMFold‑like inference speed. PLAME thus provides a practical path to high‑quality folding for proteins lacking strong evolutionary neighbors.
Authors: Arjun Banerjee, David Martinez, Camille Dang, Ethan Tam
Abstract: Protein language models (PLMs) encode rich biological information, yet their internal neuron representations are poorly understood. We introduce the first automated framework for labeling every neuron in a PLM with biologically grounded natural language descriptions. Unlike prior approaches relying on sparse autoencoders or manual annotation, our method scales to hundreds of thousands of neurons, revealing individual neurons are selectively sensitive to diverse biochemical and structural properties. We then develop a novel neuron activation‑guided steering method to generate proteins with desired traits, enabling convergence to target biochemical properties like molecular weight and instability index as well as secondary and tertiary structural motifs, including alpha helices and canonical Zinc Fingers. We finally show that analysis of labeled neurons in different model sizes reveals PLM scaling laws and a structured neuron space distribution.
Authors: Yupu Zhang, Zelin Xu, Tingsong Xiao, Gustavo Seabra, Yanjun Li, Chenglong Li, Zhe Jiang
Abstract: Predicting the binding affinity of protein‑ligand complexes plays a vital role in drug discovery. Unfortunately, progress has been hindered by the lack of large‑scale and high‑quality binding affinity labels. The widely used PDBbind dataset has fewer than 20K labeled complexes. Self‑supervised learning, especially graph contrastive learning (GCL), provides a unique opportunity to break the barrier by pre‑training graph neural network models based on vast unlabeled complexes and fine‑tuning the models on much fewer labeled complexes. However, the problem faces unique challenges, including a lack of a comprehensive unlabeled dataset with well‑defined positive/negative complex pairs and the need to design GCL algorithms that incorporate the unique characteristics of such data. To fill the gap, we propose DecoyDB, a large‑scale, structure‑aware dataset specifically designed for self‑supervised GCL on protein‑ligand complexes. DecoyDB consists of high‑resolution ground truth complexes (less than 2.5 Angstrom) and diverse decoy structures with computationally generated binding poses that range from realistic to suboptimal (negative pairs). Each decoy is annotated with a Root Mean Squared Deviation (RMSD) from the native pose. We further design a customized GCL framework to pre‑train graph neural networks based on DecoyDB and fine‑tune the models with labels from PDBbind. Extensive experiments confirm that models pre‑trained with DecoyDB achieve superior accuracy, label efficiency, and generalizability.
Authors: Chunhui Gu, Mohammad Sadegh Nasr, James P. Long, Kim-Anh Do, Ehsan Irajizad
Abstract: Graph Neural Networks (GNNs) often struggle with noisy edges. We propose Latent Space Constrained Graph Neural Networks (LSC‑GNN) to incorporate external "clean" links and guide embeddings of a noisy target graph. We train two encoders‑‑one on the full graph (target plus external edges) and another on a regularization graph excluding the target's potentially noisy links‑‑then penalize discrepancies between their latent representations. This constraint steers the model away from overfitting spurious edges. Experiments on benchmark datasets show LSC‑GNN outperforms standard and noise‑resilient GNNs in graphs subjected to moderate noise. We extend LSC‑GNN to heterogeneous graphs and validate it on a small protein‑metabolite network, where metabolite‑protein interactions reduce noise in protein co‑occurrence data. Our results highlight LSC‑GNN's potential to boost predictive performance and interpretability in settings with noisy relational structures.
Authors: Van Khoa Nguyen, Lionel Blondé, Alexandros Kalousis
Abstract: Training‑free diffusion guidance offers a flexible framework for leveraging off‑the‑shelf classifiers without additional training. Yet, current approaches hinge on posterior approximations via Tweedie's formula, which often yield unreliable guidance, particularly in low‑density regions. Stochastic optimal control (SOC), in contrast, enables principled posterior sampling but remains computationally prohibitive for efficient inference. In this work, we reconcile the strengths of these paradigms by introducing Stein Diffusion Guidance (SDG), a novel training‑free framework grounded in a surrogate SOC objective. We establish a new theoretical bound on the SOC value function, revealing the necessity of correcting approximate posteriors to reflect true diffusion dynamics. Building on Stein variational inference, SDG computes the steepest descent direction that minimizes the Kullback‑Leibler divergence between approximate and true posteriors. By integrating a principled Stein correction mechanism along with a novel running cost functional, SDG enables effective guidance in low‑density regions. Our experiments on diverse image‑guidance tasks and on challenging small‑ligand sampling for protein docking suggest that SDG consistently outperforms standard training‑free guidance methods and highlights its potential for broader posterior sampling problems beyond high‑density regimes.
Authors: Jiaqi Han, Austin Wang, Minkai Xu, Wenda Chu, Meihua Dang, Haotian Ye, Huayu Chen, Yisong Yue, Stefano Ermon
Abstract: Discrete diffusion models have demonstrated great promise in modeling various sequence data, ranging from human language to biological sequences. Inspired by the success of RL in language models, there is growing interest in further improving the models by alignment with a certain reward. In this work, we propose an offline preference optimization method to approach trajectory alignment for discrete diffusion models. Instead of applying the reward on the final output and backpropagating the gradient to the entire denoising process, we decompose the problem into a set of stepwise alignment objectives by matching the per‑step posterior. This framework enables efficient diffusion optimization, is compatible with arbitrary reward functions, and importantly, yields an equivalent optimal solution under additive factorization of the trajectory reward. Experiments across multiple domains including DNA sequence design, protein inverse folding, and language modeling consistently demonstrate the superiority of our approach. Notably, it achieves an up to 12% improvement over the most competitive RL‑based baseline in terms of predicted activity on DNA sequence design, and further improves the GSM8K score from 78.6 to 81.2 on LLaDA‑8B‑Instruct for language modeling.
Authors: Rik S. Breebaart, Gianmarco Lazzeri, Roberto Covino, Peter G. Bolhuis
Abstract: Understanding mechanisms of rare but important events in complex molecular systems, such as protein folding or ligand (un)binding, requires accurately mapping transition paths from an initial to a final state. The committor is the ideal reaction coordinate for this purpose, but calculating it for high‑dimensional, nonlinear systems has long been considered intractable. Here, we introduce an iterative path sampling strategy for computing the committor function for systems with high free energy barriers. We start with an initial guess to define isocommittor interfaces for transition interface sampling. The resulting path ensemble is then reweighted and used to train a neural network, yielding a more accurate committor model. This process is repeated until convergence, effectively solving the long‑standing circular problem in enhanced sampling where a good reaction coordinate is needed to generate efficient sampling, and vice‑versa. The final, converged committor model can be interrogated to extract mechanistic insights. We demonstrate the power of our method on a benchmark 2D potential and a more complex host‑guest (un)binding process in explicit solvent.
Authors: Zanyu Shi, Yang Wang, Pathum Weerawarna, Jie Zhang, Timothy Richardson, Yijie Wang, Kun Huang
Abstract: Explainable artificial intelligence (XAI) approaches have been increasingly applied in drug discovery to learn molecular representations and identify substructures driving property predictions. However, building end‑to‑end explainable models for structure‑activity relationship (SAR) modeling for compound property prediction faces many challenges, such as the limited number of compound‑protein interaction activity data for specific protein targets, and plenty of subtle changes in molecular configuration sites significantly affecting molecular properties. We exploit pairs of molecules with activity cliffs that share scaffolds but differ at substituent sites, characterized by large potency differences for specific protein targets. We propose a framework by implementing graph neural networks (GNNs) to leverage property and structure information from activity cliff pairs to predict compound‑protein affinity (i.e., half maximal inhibitory concentration, IC50). To enhance model performance and explainability, we train GNNs with structure‑aware loss functions using group lasso and sparse group lasso regularizations, which prune and highlight molecular subgraphs relevant to activity differences. We applied this framework to activity cliff data of molecules targeting three proto‑oncogene tyrosine‑protein kinase Src proteins (PDB IDs: 1O42, 2H8H, 4MXO). Our approach improved property prediction by integrating common and uncommon node information with sparse group lasso, as reflected in reduced root mean squared error (RMSE) and improved Pearson's correlation coefficient (PCC). Applying regularizations also enhances feature attribution for GNN by boosting graph‑level global direction scores and improving atom‑level coloring accuracy. These advances strengthen model interpretability in drug discovery pipelines, particularly for identifying critical molecular substructures in lead optimization.
Authors: Yunrui Qiu, Richard John, Lukas Herron, Pratyush Tiwary
Abstract: Accurate characterization of the equilibrium distributions of complex molecular systems and their dependence on environmental factors such as temperature is essential for understanding thermodynamic properties and transition mechanisms. Projecting these distributions onto meaningful low‑dimensional representations enables interpretability and downstream analysis. Recent advances in generative AI, particularly flow models such as Normalizing Flows (NFs), have shown promise in modeling such distributions, but their scope is limited without tailored representation learning. In this work, we introduce Latent Thermodynamic Flows (LaTF), an end‑to‑end framework that tightly integrates representation learning and generative modeling. LaTF unifies the State Predictive Information Bottleneck (SPIB) with NFs to simultaneously learn low‑dimensional latent representations, referred to as Collective Variables (CVs), classify metastable states, and generate equilibrium distributions across temperatures beyond the training data. The two components of representation learning and generative modeling are optimized jointly, ensuring that the learned latent features capture the system's slow, important degrees of freedom while the generative model accurately reproduces the system's equilibrium behavior. We demonstrate LaTF's effectiveness across diverse systems, including a model potential, the Chignolin protein, and cluster of Lennard Jones particles, with thorough evaluations and benchmarking using multiple metrics and extensive simulations. Finally, we apply LaTF to a RNA tetraloop system, where despite using simulation data from only two temperatures, LaTF reconstructs the temperature‑dependent structural ensemble and melting behavior, consistent with experimental and prior extensive computational results.
Authors: Janghoon Ock, Radheesh Sharma Meda, Srivathsan Badrinarayanan, Neha S. Aluru, Achuth Chandrasekhar, Amir Barati Farimani
Abstract: We present a modular framework powered by large language models (LLMs) that automates and streamlines key tasks across the early‑stage computational drug discovery pipeline. By combining LLM reasoning with domain‑specific tools, the framework performs biomedical data retrieval, literature‑grounded question answering via retrieval‑augmented generation, molecular generation, multi‑property prediction, property‑aware molecular refinement, and 3D protein‑ligand structure generation. The agent autonomously retrieved relevant biomolecular information, including FASTA sequences, SMILES representations, and literature, and answered mechanistic questions with improved contextual accuracy compared to standard LLMs. It then generated chemically diverse seed molecules and predicted 75 properties, including ADMET‑related and general physicochemical descriptors, which guided iterative molecular refinement. Across two refinement rounds, the number of molecules with QED > 0.6 increased from 34 to 55. The number of molecules satisfying empirical drug‑likeness filters also rose; for example, compliance with the Ghose filter increased from 32 to 55 within a pool of 100 molecules. The framework also employed Boltz‑2 to generate 3D protein‑ligand complexes and provide rapid binding affinity estimates for candidate compounds. These results demonstrate that the approach effectively supports molecular screening, prioritization, and structure evaluation. Its modular design enables flexible integration of evolving tools and models, providing a scalable foundation for AI‑assisted therapeutic discovery.
Authors: Haoran Zhang, Mingyuan Zhou, Wesley Tansey
Abstract: Spatial profiling technologies in biology, such as imaging mass cytometry (IMC) and spatial transcriptomics (ST), generate high‑dimensional, multi‑channel data with strong spatial alignment and complex inter‑channel relationships. Generative modeling of such data requires jointly capturing intra‑ and inter‑channel structure, while also generalizing across arbitrary combinations of observed and missing channels for practical application. Existing diffusion‑based models generally assume low‑dimensional inputs (e.g., RGB images) and rely on simple conditioning mechanisms that break spatial correspondence and ignore inter‑channel dependencies. This work proposes a unified diffusion framework for controllable generation over structured and spatial biological data. Our model contains two key innovations: (1) a hierarchical feature injection mechanism that enables multi‑resolution conditioning on spatially aligned channels, and (2) a combination of latent‑space and output‑space channel‑wise attention to capture inter‑channel relationships. To support flexible conditioning and generalization to arbitrary subsets of observed channels, we train the model using a random masking strategy, enabling it to reconstruct missing channels from any combination of inputs. We demonstrate state‑of‑the‑art performance across both spatial and non‑spatial prediction tasks, including protein imputation in IMC and gene‑to‑protein prediction in single‑cell datasets, and show strong generalization to unseen conditional configurations.
Authors: Shiyi Liu, Buwen Liang, Yuetong Fang, Zixuan Jiang, Renjing Xu
Abstract: Recent advances in AI for science have highlighted the power of contrastive learning in bridging heterogeneous biological data modalities. Building on this paradigm, we propose HIPPO (HIerarchical Protein‑Protein interaction prediction across Organisms), a hierarchical contrastive framework for protein‑protein interaction(PPI) prediction, where protein sequences and their hierarchical attributes are aligned through multi‑tiered biological representation matching. The proposed approach incorporates hierarchical contrastive loss functions that emulate the structured relationship among functional classes of proteins. The framework adaptively incorporates domain and family knowledge through a data‑driven penalty mechanism, enforcing consistency between the learned embedding space and the intrinsic hierarchy of protein functions. Experiments on benchmark datasets demonstrate that HIPPO achieves state‑of‑the‑art performance, outperforming existing methods and showing robustness in low‑data regimes. Notably, the model demonstrates strong zero‑shot transferability to other species without retraining, enabling reliable PPI prediction and functional inference even in less characterized or rare organisms where experimental data are limited. Further analysis reveals that hierarchical feature fusion is critical for capturing conserved interaction determinants, such as binding motifs and functional annotations. This work advances cross‑species PPI prediction and provides a unified framework for interaction prediction in scenarios with sparse or imbalanced multi‑species data.
Authors: Antoine Honoré, Borja Rodríguez Gálvez, Yoomi Park, Yitian Zhou, Volker M. Lauschke, Ming Xiao
Abstract: Variant effect predictors (VEPs) aim to assess the functional impact of protein variants, traditionally relying on multiple sequence alignments (MSAs). This approach assumes that naturally occurring variants are fit, an assumption challenged by pharmacogenomics, where some pharmacogenes experience low evolutionary pressure. Deep mutational scanning (DMS) datasets provide an alternative by offering quantitative fitness scores for variants. In this work, we propose a transformer‑based matrix variational auto‑encoder (matVAE) with a structured prior and evaluate its performance on 33 DMS datasets corresponding to 26 drug target and ADME proteins from the ProteinGym benchmark. Our model trained on MSAs (matVAE‑MSA) outperforms the state‑of‑the‑art DeepSequence model in zero‑shot prediction on DMS datasets, despite using an order of magnitude fewer parameters and requiring less computation at inference time. We also compare matVAE‑MSA to matENC‑DMS, a model of similar capacity trained on DMS data, and find that the latter performs better on supervised prediction tasks. Additionally, incorporating AlphaFold‑generated structures into our transformer model further improves performance, achieving results comparable to DeepSequence trained on MSAs and finetuned on DMS. These findings highlight the potential of DMS datasets to replace MSAs without significant loss in predictive performance, motivating further development of DMS datasets and exploration of their relationships to enhance variant effect prediction.
Authors: Shakya Jayakody, Youpeng Zhao, Jun Wang
Abstract: Graph convolutional networks (GCNs) are fundamental in various scientific applications, ranging from biomedical protein‑protein interactions (PPI) to large‑scale recommendation systems. An essential component for modeling graph structures in GCNs is sparse general matrix‑matrix multiplication (SpGEMM). As the size of graph data continues to scale up, SpGEMMs are often conducted in an out‑of‑core fashion due to limited GPU memory space in resource‑constrained systems. Albeit recent efforts that aim to alleviate the memory constraints of out‑of‑core SpGEMM through either GPU feature caching, hybrid CPU‑GPU memory layout, or performing the computation in sparse format, current systems suffer from both high I/O latency and GPU under‑utilization issues.
In this paper, we first identify the problems of existing systems, where sparse format data alignment and memory allocation are the main performance bottlenecks, and propose AIRES, a novel algorithm‑system co‑design solution to accelerate out‑of‑core SpGEMM computation for GCNs. Specifically, from the algorithm angle, AIRES proposes to alleviate the data alignment issues on the block level for matrices in sparse formats and develops a tiling algorithm to facilitate row block‑wise alignment. On the system level, AIRES employs a three‑phase dynamic scheduling that features a dual‑way data transfer strategy utilizing a tiered memory system: integrating GPU memory, GPU Direct Storage (GDS), and host memory to reduce I/O latency and improve throughput. Evaluations show that AIRES significantly outperforms the state‑of‑the‑art methods, achieving up to 1.8x lower latency in real‑world graph processing benchmarks.
Authors: Miriam Jäger, Steffen Wolf
Abstract: Finding process pathways in molecular simulations such as the unbinding paths of small molecule ligands from their binding sites at protein targets in a set of trajectories via unsupervised learning approaches requires the definition of a suitable similarity measure between trajectories. We here evaluate the performance of four such measures with varying degree of sophistication, i.e., Euclidean and Wasserstein distances, Procrustes analysis and dynamical time warping, when analyzing trajectory data from two different biased simulation driving protocols in the form of constant velocity constraint targeted MD and steered MD. In a streptavidin‑biotin benchmark system with known ground truth clusters, Wasserstein distances yielded the best clustering performance, closely followed by Euclidean distances, both being the most computationally efficient similarity measures. In a more complex A2a receptor‑inhibitor system, however, the simplest measure, i.e., Euclidean distances, was sufficient to reveal meaningful and interpretable clusters.
Authors: Guanhao Huang, Chang Jin, Sophie Weiyi Ding, Chaoshen Zhang, Aaron M. Day, Tobias Elbs, Neil Sinclair, Sukhad Dnyanesh Joshi, Rodrick Kuate Defo, Bertrand I. Halperin, Evelyn Hu, Marko Lončar
Abstract: From gravitational‑wave detection, protein force microscopy, to exploration of quantum‑classical boundaries, many anticipated discoveries in fundamental science require improving measurement sensitivity limits. Through the fluctuation‑dissipation theorem, mechanical dissipation sets the acoustic noise for this limit. Yet, even in high‑purity crystals, the microscopic mechanisms responsible for the acoustic loss remain poorly understood. Tension‑induced dissipation dilution offers a route to ultralow acoustic loss, but is challenging to implement in crystalline materials including single‑crystal diamond. Here we realize a strain‑engineered diamond nanomechanical platform using a liquid‑assisted van der Waals self‑assembly process that harnesses intrinsic surface forces to apply tensile stress exceeding 1 GPa. At cryogenic temperatures these resonators achieve quality factors beyond 10 billion (intrinsic material quality factors beyond 100 million). This exceptional coherence turns them into a sensitive probe for residual dissipation, elucidating three distinct two‑level‑system channels and one topological dissipation channel from a surface superfluid helium film. Our work shows how advancing mechanical coherence opens access to new regimes of physics in hybrid quantum systems, precision metrology, and condensed‑matter physics.
Authors: Don Roosan, Rubayat Khan, Saif Nirzhor, Tiffany Khou, Fahmida Hai
Abstract: The rapid expansion of biomolecular datasets presents significant challenges for computational biology. Quantum computing emerges as a promising solution to address these complexities. This study introduces a novel quantum framework for analyzing TART‑T and TART‑C gene data by integrating genomic and structural information. Leveraging a Quantum Neural Network (QNN), we classify hotspot mutations, utilizing quantum superposition to uncover intricate relationships within the data. Additionally, a Variational Quantum Eigensolver (VQE) is employed to estimate molecular ground‑state energies through a hybrid classical‑quantum approach, overcoming the limitations of traditional computational methods. Implemented using IBM Qiskit, our framework demonstrates high accuracy in both mutation classification and energy estimation on current Noisy Intermediate‑Scale Quantum (NISQ) devices. These results underscore the potential of quantum computing to advance the understanding of gene function and protein structure. Furthermore, this research serves as a foundational blueprint for extending quantum computational methods to other genes and biological systems, highlighting their synergy with classical approaches and paving the way for breakthroughs in drug discovery and personalized medicine.
Authors: Ignacio Gustin, Chang Woo Kim, Ignacio Franco
Abstract: Determining how energy flows within and between molecules is crucial for understanding chemical reactions, material properties, and even vital processes such as photosynthesis. While the general principles of energy transfer are well established, elucidating the specific molecular pathways by which energy is funneled remains challenging as it requires tracking energy flow in complex molecular environments. Here, we demonstrate how photon excitation energy is partially dissipated in the light‑harvesting Fenna‑Matthews‑Olson (FMO) complex, mediating the excitation energy transfer from light‑harvesting chlorosomes to the photosynthetic reaction center in green sulfur bacteria. Specifically, we isolate the contribution of the protein and specific vibrational modes of the pigment molecules to the energy dynamics. For this, we introduce an efficient computational implementation of a recently proposed theory of dissipation pathways for open quantum systems. Using it and a state‑of‑the‑art FMO model with highly structured and chromophore‑specific spectral densities, we demonstrate that energy dissipation is dominated by low‑frequency modes (< 800 cm^‑1) as their energy range is near‑resonance with the energy gaps between electronic states of the pigments. We identify the most important mode for dissipation to be in‑plane breathing modes (~200 cm^‑1) of the bacteriochlorophylls in the complex. Conversely, far‑detuned intramolecular vibrations with higher frequencies (> 800 cm^‑1) play no role in dissipation. Interestingly, the FMO complex first needs to borrow energy from the environment to release excess photonic energy, making the energy dissipation dynamics non‑monotonic. Beyond their fundamental value, these insights can guide the development of artificial light‑harvesting devices and, more broadly, engineer environments for chemical and quantum control tasks.
Authors: Xu Pin, Cui Jingyu, Cheng Zhi, Simon Chi-Chin Shiu, Cui Jingxian, Li Yujian, Liu Yifan, Wang Lin, Ryan Ho Ping Siu, Julian A. Tanner, Yu Changyuan
Abstract: Optical fiber sensing carries a number of potential advantages for diagnostics and biomarker detection and monitoring, yet particular challenges persist in linking molecular recognition events to a change in the refractive index. DNA aptamers carry particular advantages as functional surface molecules on optical fibers to tailor detection of specific biomolecules, yet challenges persist around sensitivity and specificity. Diagnosis of COVID‑19 through detection of nucleocapsid protein (N protein) of SARS‑CoV‑2 provides a classic diagnostic challenge where optical fiber‑based sensing could complement and improve on typical detection methods such as RT‑PCR and rapid antigen testing. In this study, a plasmonic gold‑coated tilted fiber Bragg grating (TFBG)‑based optical biosensing platform was developed for ultrasensitive detection of SARS‑CoV‑2 N protein. By functionalizing the optical fiber surface with aptamers for the molecular recognition of N protein, changes in refractive index measured biomolecular binding, thereby achieving real‑time, label‑free detection. Additionally, integrating DNA nanostructures such as the DNA tetrahedron with aptamers significantly enhanced detection sensitivity, increasing signal intensity ~2.5 times compared to aptamers alone. This study provides new insights into the development of high‑performance optical fiber sensing platforms which integrate DNA nanostructure interfaces to facilitate biomarker recognition and sensing.
Authors: Anran Liu, Xiaofei Wang, Jing Cai, Chao Li
Abstract: Hematoxylin and eosin (H&E) staining visualizes histology but lacks specificity for diagnostic markers. Immunohistochemistry (IHC) staining provides protein‑targeted staining but is restricted by tissue availability and antibody specificity. Virtual staining, i.e., computationally translating the H&E image to its IHC counterpart while preserving the tissue structure, is promising for efficient IHC generation. Existing virtual staining methods still face key challenges: 1) effective decomposition of staining style and tissue structure, 2) controllable staining process adaptable to diverse tissue and proteins, and 3) rigorous structural consistency modelling to handle the non‑pixel‑aligned nature of paired H&E and IHC images. This study proposes a mutual‑information (MI)‑guided score‑based diffusion model for unpaired virtual staining. Specifically, we design 1) a global MI‑guided energy function that disentangles the tissue structure and staining characteristics across modalities, 2) a novel timestep‑customized reverse diffusion process for precise control of the staining intensity and structural reconstruction, and 3) a local MI‑driven contrastive learning strategy to ensure the cellular level structural consistency between H&E‑IHC images. Extensive experiments demonstrate the our superiority over state‑of‑the‑art approaches, highlighting its biomedical potential. Codes will be open‑sourced upon acceptance.
Authors: D. Jasuja, P. J. Atzberger
Abstract: We investigate proteins within heterogeneous cell membranes where non‑equilibrium phenomena arises from spatial variations in concentration and temperature. We develop simulation methods building on non‑equilibrium statistical mechanics to obtain stochastic hybrid continuum‑discrete descriptions which track individual protein dynamics, spatially varying concentration fluctuations, and thermal exchanges. We investigate biological mechanisms for protein positioning and patterning within membranes and factors in thermal gradient sensing. We also study the kinetics of Brownian motion of particles with temperature variations within energy landscapes arising from heterogeneous microstructures within membranes. The introduced approaches provide self‑consistent models for studying biophysical mechanisms involving the drift‑diffusion dynamics of individual proteins and energy exchanges and fluctuations between the thermal and mechanical parts of the system. The methods also can be used for studying related non‑equilibrium effects in other biological systems and soft materials.
Authors: Yuqi Zhang, Yuxin Yang, William Martin, Kingsten Lin, Zixu Wang, Cheng-Chang Lu, Weiwen Jiang, Ruth Nussinov, Joseph Loscalzo, Qiang Guan, Feixiong Cheng
Abstract: Accurate prediction of protein active‑site structures remains a central challenge in structural biology, particularly for short and flexible peptide fragments where conventional and simulation‑based methods often fail. Here, we present a quantum computing framework specifically developed for utility‑level quantum processors to address this problem. Starting from an amino acid sequence, we formulate structure prediction as a ground‑state energy minimization problem using the Variational Quantum Eigensolver (VQE). Amino acid connectivity is encoded on a tetrahedral lattice model, and structural constraints‑including steric, geometric, and chirality terms‑are mapped into a problem‑specific Hamiltonian represented as sparse Pauli operators. Optimization is performed with a two‑stage architecture that separates energy estimation from measurement decoding, enabling noise mitigation under realistic device conditions. We evaluate the framework on 23 randomly selected protein fragments from the PDBbind dataset and 7 fragments from therapeutically relevant proteins, and execute experiments on the IBM‑Cleveland Clinic quantum processor. Predictions are benchmarked against AlphaFold 3 (AF3) and classical simulation‑based approaches using identical postprocessing and docking procedures. Our method outperforms both AF3 and classical baselines in RMSD (root‑mean‑square deviation) and docking efficacy. These results demonstrate an end‑to‑end, hardware‑executable pipeline for biologically relevant structure prediction on real quantum processors, highlighting its engineering feasibility and practical advantages over existing classical and deep learning approaches.
Authors: Gonen Golani, Manas Seal, Mrityunjoy Kar, Anthony A. Hyman, Daniella Goldfarb, Samuel Safran
Abstract: The observation of Liquid‑Liquid Phase Separation (LLPS) in biological cells has dramatically shifted the paradigm that soluble proteins are uniformly dispersed in the cytoplasm or nucleoplasm. The LLPS region is preceded by a one‑phase solution, where recent experiments have identified clusters in an aqueous solution with 102‑103 proteins. Here, we theoretically consider a core‑shell model with mesoscale core, surface, and bending properties of the cluster shell and contrast two experimental paradigms for the measured cluster size distributions of the Cytoplasmic Polyadenylation Element Binding‑4 (CPEB4) and Fused in Sarcoma (FUS) proteins. The fits to the theoretical model and earlier electron paramagnetic resonance (EPR) experiments suggest that the same protein may exhibit hydrophilic, hydrophobic, and amphiphilic conformations, which act to stabilize the clusters. We find that CPEB4 clusters are much more stable compared to FUS clusters, which are less energetically favorable. This suggests that in CPEB4, LLPS consists of large‑scale aggregates of clusters, while for FUS, clusters coalesce to form micron‑scale LLPS domains.
Authors: Seongyu Park, Inho Yang, Jinseob Lee, Sinwoo Kim, Juana Martín-López, Richard Fishel, Jong-Bong Lee, Jae-Hyung Jeon
Abstract: DNA mismatch repair (MMR) is the essential mechanism for preserving genomic integrity in various living organisms. In this process, MutS homologs (MSH) play crucial roles in identifying mismatched basepairs and recruiting downstream MMR proteins. The MSH protein exhibits distinct functions and diffusion dynamics before and after the recognition of mismatches while traversing along DNA. An ADP‑bound MSH, known as the MSH searching clamp, scans DNA sequences via rotational diffusion along the DNA backbone. Upon recognizing a mismatch, the MSH combines with ATP molecules, forming a stable sliding clamp. Recent experimental evidence challenges the conventional view that the sliding clamp performs a simple Brownian motion. In this study, we explore the diffusion dynamics of the ATP‑bound MSH sliding clamp through single‑particle tracking experiments and introduce a Bayesian single‑trajectory modeling framework to analyze its motion. Our quantitative analysis reveals that the diffusion characteristics defy explanation by a single‑state diffusion mechanism. Instead, our in‑depth model inference uncovers three distinct diffusion states, each characterized by specific diffusion coefficients. These states alternate over time, with cross‑state transitions predominantly involving one intermediate state, and direct transitions between the slowest and the fastest states being scarce. We propose that these multi‑state dynamics reflect underlying conformational changes in the MSH sliding clamp, highlighting a more intricate diffusion mechanism than previously appreciated.
Authors: Alexander D. Kalian, Jaewook Lee, Stefan P. Johannesson, Lennart Otte, Christer Hogstrand, Miao Guo
Abstract: The global demand for sustainable protein sources has accelerated the need for intelligent tools that can rapidly process and synthesise domain‑specific scientific knowledge. In this study, we present a proof‑of‑concept multi‑agent Artificial Intelligence (AI) framework designed to support sustainable protein production research, with an initial focus on microbial protein sources. Our Retrieval‑Augmented Generation (RAG)‑oriented system consists of two GPT‑based LLM agents: (1) a literature search agent that retrieves relevant scientific literature on microbial protein production for a specified microbial strain, and (2) an information extraction agent that processes the retrieved content to extract relevant biological and chemical information. Two parallel methodologies, fine‑tuning and prompt engineering, were explored for agent optimisation. Both methods demonstrated effectiveness at improving the performance of the information extraction agent in terms of transformer‑based cosine similarity scores between obtained and ideal outputs. Mean cosine similarity scores were increased by up to 25%, while universally reaching mean scores of \geq 0.89 against ideal output text. Fine‑tuning overall improved the mean scores to a greater extent (consistently of \geq 0.94) compared to prompt engineering, although lower statistical uncertainties were observed with the latter approach. A user interface was developed and published for enabling the use of the multi‑agent AI system, alongside preliminary exploration of additional chemical safety‑based search capabilities
Authors: Jakob Günther, Thomas Weymuth, Moritz Bensberg, Freek Witteveen, Matthew S. Teynor, F. Emil Thomasen, Valentina Sora, William Bro-Jørgensen, Raphael T. Husistein, Mihael Erakovic, Marek Miller, Leah Weisburn, Minsik Cho, Marco Eckhoff, Aram W. Harrow, Anders Krogh, Troy Van Voorhis, Kresten Lindorff-Larsen, Gemma Solomon, Markus Reiher, Matthias Christandl
Abstract: Free energy calculations are at the heart of physics‑based analyses of biochemical processes. They allow us to quantify molecular recognition mechanisms, which determine a wide range of biological phenomena from how cells send and receive signals to how pharmaceutical compounds can be used to treat diseases. Quantitative and predictive free energy calculations require computational models that accurately capture both the varied and intricate electronic interactions between molecules as well as the entropic contributions from motions of these molecules and their aqueous environment. However, accurate quantum‑mechanical energies and forces can only be obtained for small atomistic models, not for large biomacromolecules. Here, we demonstrate how to consistently link accurate quantum‑mechanical data obtained for substructures to the overall potential energy of biomolecular complexes by machine learning in an integrated algorithm. We do so using a two‑fold quantum embedding strategy where the innermost quantum cores are treated at a very high level of accuracy. We demonstrate the viability of this approach for the molecular recognition of a ruthenium‑based anticancer drug by its protein target, applying traditional quantum chemical methods. As such methods scale unfavorable with system size, we analyze requirements for quantum computers to provide highly accurate energies that impact the resulting free energies. Once the requirements are met, our computational pipeline FreeQuantum is able to make efficient use of the quantum computed energies, thereby enabling quantum computing enhanced modeling of biochemical processes. This approach combines the exponential speedups of quantum computers for simulating interacting electrons with modern classical simulation techniques that incorporate machine learning to model large molecules.
Authors: Hanna-Friederike Poggemann, Sabrina Klopsch, Simon Homann, Tim Dirks, Sina Schäkermann, Julia E. Bandow, Timo Jacob, Christoph Jung
Abstract: Biocatalysis is an emerging field that provides an environmentally friendly alternative to conventional catalysis, but still it faces some challenges. One of the major difficulties for biocatalysts that require reactive species like H2O2 as co‑substrates lies in the concentration of these reactive species. On the one hand, they are used as reactants, but on the other hand, they inactivate the enzymes at high concentrations. When utilizing non‑thermal plasma to deliver H2O2 for biocatalysis, it is essential to understand the potential interactions between plasma‑generated species (PGS) and enzymes. This is particularly important because, alongside \chH2O2, other reactive species such as hydroxyl radicals, atomic oxygen, superoxide, and nitric oxide are also produced. The investigation of the localized reactivity of the solvent accessible surface area (SASA) of an enzyme, with certain species, is an important tool for predicting these interactions. In combination with reactive molecular dynamics (MD) simulations this enabled us to identify amino acid residues that are likely targets for modifications by the PGS. A subset of the theoretical predictions made in the present study were confirmed experimentally by mass spectrometry, underlining the utility of the SASA and MD based screening approach to direct time‑consuming experiments and assist their interpretation.
Authors: Ahmet Sarigun, Bora Uyar, Vedran Franke, Altuna Akalin
Abstract: Sampling physically valid ligand‑binding poses remains a major challenge in molecular docking, particularly for unseen or structurally diverse targets. We introduce PocketVina, a fast and memory‑efficient, search‑based docking framework that combines pocket prediction with systematic multi‑pocket exploration. We evaluate PocketVina across four established benchmarks‑‑PDBbind2020 (timesplit and unseen), DockGen, Astex, and PoseBusters‑‑and observe consistently strong performance in sampling physically valid docking poses. PocketVina achieves state‑of‑the‑art performance when jointly considering ligand RMSD and physical validity (PB‑valid), while remaining competitive with deep learning‑based approaches in terms of RMSD alone, particularly on structurally diverse and previously unseen targets. PocketVina also maintains state‑of‑the‑art physically valid docking accuracy across ligands with varying degrees of flexibility. We further introduce TargetDock‑AI, a benchmarking dataset we curated, consisting of over 500000 protein‑ligand pairs, and a partition of the dataset labeled with PubChem activity annotations. On this large‑scale dataset, PocketVina successfully discriminates active from inactive targets, outperforming a deep learning baseline while requiring significantly less GPU memory and runtime. PocketVina offers a robust and scalable docking strategy that requires no task‑specific training and runs efficiently on standard GPUs, making it well‑suited for high‑throughput virtual screening and structure‑based drug discovery.
Authors: Junjie Xu, Jiahao Zhang, Mangal Prakash, Xiang Zhang, Suhang Wang
Abstract: Geometric graph neural networks (GNNs) that respect E(3) symmetries have achieved strong performance on small molecule modeling, but they face scalability and expressiveness challenges when applied to large biomolecules such as RNA and proteins. These systems require models that can simultaneously capture fine‑grained atomic interactions, long‑range dependencies across spatially distant components, and biologically relevant hierarchical structure, such as atoms forming residues, which in turn form higher‑order domains. Existing geometric GNNs, which typically operate exclusively in either Euclidean or Spherical Harmonics space, are limited in their ability to capture both the fine‑scale atomic details and the long‑range, symmetry‑aware dependencies required for modeling the multi‑scale structure of large biomolecules. We introduce DualEquiNet, a Dual‑Space Hierarchical Equivariant Network that constructs complementary representations in both Euclidean and Spherical Harmonics spaces to capture local geometry and global symmetry‑aware features. DualEquiNet employs bidirectional cross‑space message passing and a novel Cross‑Space Interaction Pooling mechanism to hierarchically aggregate atomic features into biologically meaningful units, such as residues, enabling efficient and expressive multi‑scale modeling for large biomolecular systems. DualEquiNet achieves state‑of‑the‑art performance on multiple existing benchmarks for RNA property prediction and protein modeling, and outperforms prior methods on two newly introduced 3D structural benchmarks demonstrating its broad effectiveness across a range of large biomolecule modeling tasks.
Authors: Felix Faltings, Hannes Stark, Regina Barzilay, Tommi Jaakkola
Abstract: We develop ProxelGen, a protein structure generative model that operates on 3D densities as opposed to the prevailing 3D point cloud representations. Representing proteins as voxelized densities, or proxels, enables new tasks and conditioning capabilities. We generate proteins encoded as proxels via a 3D CNN‑based VAE in conjunction with a diffusion model operating on its latent space. Compared to state‑of‑the‑art models, ProxelGen's samples achieve higher novelty, better FID scores, and the same level of designability as the training set. ProxelGen's advantages are demonstrated in a standard motif scaffolding benchmark, and we show how 3D density‑based generation allows for more flexible shape conditioning.
Authors: Anton Klimek, Benjamin A. Dalton, Lucas Tepper, Roland R. Netz
Abstract: Proteins often exhibit subdiffusive configurational dynamics. The origins of this subdiffusion are still unresolved. We investigate the impact of non‑Markovian friction and the free energy landscape on the dynamics of fast‑folding proteins in terms of the mean squared displacement (MSD) and the mean first‑passage‑time (MFPT) of the folding reaction coordinate. We find the friction memory kernel from published molecular dynamics (MD) simulations to be well‑described by a hierarchical multi‑exponential function, which gives rise to subdiffusion in the MSD over a finite range of time. We show that friction memory effects in fast‑folding proteins dominate the scaling behavior of the MSD compared to effects due to the folding free energy landscape. As a consequence, Markovian models are insufficient for capturing the folding dynamics, as quantified by the MSD and the MFPT, even when including coordinate‑dependent friction. Our results demonstrate the importance of memory effects in protein folding and conformational dynamics and explicitly show that subdiffusion in fast‑folding protein dynamics originates from memory effects, not from the free energy landscape and not from coordinate‑dependent friction.
Authors: Zhiwei Nie, Hongyu Zhang, Hao Jiang, Yutian Liu, Xiansong Huang, Fan Xu, Jie Fu, Zhixiang Ren, Yonghong Tian, Wen-Bin Zhang, Jie Chen
Abstract: Understanding and modeling enzyme‑substrate interactions is crucial for catalytic mechanism research, enzyme engineering, and metabolic engineering. Although a large number of predictive methods have emerged, they do not incorporate prior knowledge of enzyme catalysis to rationally modulate general protein‑molecule features that are misaligned with catalytic patterns. To address this issue, we introduce a two‑stage progressive framework, OmniESI, for enzyme‑substrate interaction prediction through conditional deep learning. By decomposing the modeling of enzyme‑substrate interactions into a two‑stage progressive process, OmniESI incorporates two conditional networks that respectively emphasize enzymatic reaction specificity and crucial catalysis‑related interactions, facilitating a gradual feature modulation in the latent space from general protein‑molecule domain to catalysis‑aware domain. On top of this unified architecture, OmniESI can adapt to a variety of downstream tasks, including enzyme kinetic parameter prediction, enzyme‑substrate pairing prediction, enzyme mutational effect prediction, and enzymatic active site annotation. Under the multi‑perspective performance evaluation of in‑distribution and out‑of‑distribution settings, OmniESI consistently delivered superior performance than state‑of‑the‑art specialized methods across seven benchmarks. More importantly, the proposed conditional networks were shown to internalize the fundamental patterns of catalytic efficiency while significantly improving prediction performance, with only negligible parameter increases (0.16%), as demonstrated by ablation studies on key components. Overall, OmniESI represents a unified predictive approach for enzyme‑substrate interactions, providing an effective tool for catalytic mechanism cracking and enzyme engineering with strong generalization and broad applicability.
Authors: Chunan Liu, Aurelien Pelissier, Yanjun Shao, Lilian Denzler, Andrew C. R. Martin, Brooks Paige, María Rodríguez Martínez
Abstract: Accurate prediction of antibody‑antigen (Ab‑Ag) binding affinity is essential for therapeutic design and vaccine development, yet the performance of current models is limited by noisy experimental labels, heterogeneous assay conditions, and poor generalization across the vast antibody and antigen sequence space. We introduce AbRank, a large‑scale benchmark and evaluation framework that reframes affinity prediction as a pairwise ranking problem. AbRank aggregates over 380,000 binding assays from nine heterogeneous sources, spanning diverse antibodies, antigens, and experimental conditions, and introduces standardized data splits that systematically increase distribution shift, from local perturbations such as point mutations to broad generalization across novel antigens and antibodies. To ensure robust supervision, AbRank defines an m‑confident ranking framework by filtering out comparisons with marginal affinity differences, focusing training on pairs with at least an m‑fold difference in measured binding strength. As a baseline for the benchmark, we introduce WALLE‑Affinity, a graph‑based approach that integrates protein language model embeddings with structural information to predict pairwise binding preferences. Our benchmarks reveal significant limitations in current methods under realistic generalization settings and demonstrate that ranking‑based training improves robustness and transferability. In summary, AbRank offers a robust foundation for machine learning models to generalize across the antibody‑antigen space, with direct relevance for scalable, structure‑aware antibody therapeutic design.
Authors: Aditya Sengar, Ali Hariri, Daniel Probst, Patrick Barth, Pierre Vandergheynst
Abstract: Generating diverse, all‑atom conformational ensembles of dynamic proteins such as G‑protein‑coupled receptors (GPCRs) is critical for understanding their function, yet most generative models simplify atomic detail or ignore conformational diversity altogether. We present latent diffusion for full protein generation (LD‑FPG), a framework that constructs complete all‑atom protein structures, including every side‑chain heavy atom, directly from molecular dynamics (MD) trajectories. LD‑FPG employs a Chebyshev graph neural network (ChebNet) to obtain low‑dimensional latent embeddings of protein conformations, which are processed using three pooling strategies: blind, sequential and residue‑based. A diffusion model trained on these latent representations generates new samples that a decoder, optionally regularized by dihedral‑angle losses, maps back to Cartesian coordinates. Using D2R‑MD, a 2‑microsecond MD trajectory (12 000 frames) of the human dopamine D2 receptor in a membrane environment, the sequential and residue‑based pooling strategy reproduces the reference ensemble with high structural fidelity (all‑atom lDDT of approximately 0.7; C‑alpha‑lDDT of approximately 0.8) and recovers backbone and side‑chain dihedral‑angle distributions with a Jensen‑Shannon divergence of less than 0.03 compared to the MD data. LD‑FPG thereby offers a practical route to system‑specific, all‑atom ensemble generation for large proteins, providing a promising tool for structure‑based therapeutic design on complex, dynamic targets. The D2R‑MD dataset and our implementation are freely available to facilitate further research.
Authors: Sophie E. Ayscough, Maximilian W. A. Skoda, James Doutch, Andrew Caruana, Christy Kinane, Luke Clifton, Simon Titmuss
Abstract: Membrane proteins serve a wide range of vital roles in the functioning of living organisms. Compared to other classes of proteins, determining membrane protein structures remains a challenge, in large part due to the difficulty in establishing experimental conditions that can preserve the correct conformation and function of the protein in isolation from its native environment. We investigated the ion channel in lipid vesicles and in a planar lipid bilayer. By using a polymeric tether our planar membrane mimetic was not constrained by the underlying solid substrate, making it sufficiently flexible to allow for increases in bilayer curvature and changes in membrane tension. We used quartz crystal microbalance with dissipation (QCM‑D), and polarised neutron reflectivity (PNR) to show the formation of MscL containing phospholipid bilayers, tethered with a high density PEG layer onto gold substrates from vesicle rupture. The MscL containing vesicles were separately characterised with small angle neutron scattering (SANS). MscL was expressed into vesicles using cell free protein expression. Analysing these vesicles with small angle neutron scattering, the radius of gyration of the protein was determined to be between 26‑29~Å, consistent with the crystal structure of individual MscL channels. The MscL composition of the formed bilayer was 14%v/v, close to the initial composition of the vesicles, and a protein protrusion extending ca. 46~Å into the solvent was determined by PNR. Addition of 1.6 and 3.2 μM pexiganan resulted in a decrease in the protrusion of MscL (from ~46 to ~38~Å). To our knowledge, these findings represent the first direct experimental evidence of a structural change in the C‑terminus containing protrusion of MscL, triggered by an antimicrobial peptide.
Authors: Gergely Flamich
Abstract: Over the last few years, machine learning unlocked previously infeasible features for compression, such as providing guarantees for users' privacy or tailoring compression to specific data statistics (e.g., satellite images or audio recordings of animals) or users' audiovisual perception. This, in turn, has led to an explosion of theoretical investigations and insights that aim to develop new fundamental theories, methods and algorithms better suited for machine learning‑based compressors.
In this thesis, I contribute to this trend by investigating relative entropy coding, a mathematical framework that generalises classical source coding theory. Concretely, relative entropy coding deals with the efficient communication of uncertain or randomised information. One of its key advantages is that it extends compression methods to continuous spaces and can thus be integrated more seamlessly into modern machine learning pipelines than classical quantisation‑based approaches. Furthermore, it is a natural foundation for developing advanced compression methods that are privacy‑preserving or account for the perceptual quality of the reconstructed data.
The thesis considers relative entropy coding at three conceptual levels: After introducing the basics of the framework, (1) I prove results that provide new, maximally tight fundamental limits to the communication and computational efficiency of relative entropy coding; (2) I use the theory of Poisson point processes to develop and analyse new relative entropy coding algorithms, whose performance attains the theoretic optima and (3) I showcase the strong practical performance of relative entropy coding by applying it to image, audio, video and protein data compression using small, energy‑efficient, probabilistic neural networks called Bayesian implicit neural representations.
Authors: Nicolas Boullé, Matthew J. Colbrook, Gustav Conradie
Abstract: Data‑driven spectral analysis of Koopman operators is a powerful tool for understanding numerous real‑world dynamical systems, from neuronal activity to variations in sea surface temperature. The Koopman operator acts on a function space and is most commonly studied on the space of square‑integrable functions. However, defining it on a suitable reproducing kernel Hilbert space (RKHS) offers numerous practical advantages, including pointwise predictions with error bounds, improved spectral properties that facilitate computations, and more efficient algorithms, particularly in high dimensions. We introduce the first general, provably convergent, data‑driven algorithms for computing spectral properties of Koopman and Perron‑‑Frobenius operators on RKHSs. These methods efficiently compute spectra and pseudospectra with error control and spectral measures while exploiting the RKHS structure to avoid the large‑data limits required in the L^2 settings. The function space is determined by a user‑specified kernel, eliminating the need for quadrature‑based sampling as in L^2 and enabling greater flexibility with finite, externally provided datasets. Using the Solvability Complexity Index hierarchy, we construct adversarial dynamical systems for these problems to show that no algorithm can succeed in fewer limits, thereby proving the optimality of our algorithms. Notably, this impossibility extends to randomized algorithms and datasets. We demonstrate the effectiveness of our algorithms on challenging, high‑dimensional datasets arising from real‑world measurements and high‑fidelity numerical simulations, including turbulent channel flow, molecular dynamics of a binding protein, Antarctic sea ice concentration, and Northern Hemisphere sea surface height. The algorithms are publicly available in the software package \textttSpecRKHS.
Authors: Michael A. Sauer, Souvik Mondal, Madeline Cano, Matthias Heyden
Abstract: At room temperature, low frequency vibrations at far‑infrared frequencies are thermally excited (k_B T > h ν) and not restricted to harmonic fluctuations around a single potential energy minimum. For folded proteins, these intrinsically anharmonic vibrations can contain information on slow conformational transitions. Recently, we have developed FREquency‑SElective ANharmonic (FRESEAN) mode analysis, a method based on time correlation functions that isolates low‑frequency vibrational motions from molecular dynamics simulation trajectories without relying on harmonic approximations. We recently showed that low‑frequency vibrations obtained from FRESEAN mode analysis are effective collective variables in enhanced sampling simulations of conformational ensembles. However, FRESEAN mode analysis is based on velocity time correlations between all degrees of freedom, which creates computational challenges for large biomolecules. To facilitate future applications, we demonstrate here how coarse‑graining of all‑atom simulation trajectories can be combined with FRESEAN mode analysis to extract information on low‑frequency vibrations at minimal computational cost.
Authors: Max Ku, Sun Sun, Hongyu Guo, Wenhu Chen
Abstract: We introduce DisProtEdit, a controllable protein editing framework that leverages dual‑channel natural language supervision to learn disentangled representations of structural and functional properties. Unlike prior approaches that rely on joint holistic embeddings, DisProtEdit explicitly separates semantic factors, enabling modular and interpretable control. To support this, we construct SwissProtDis, a large‑scale multimodal dataset where each protein sequence is paired with two textual descriptions, one for structure and one for function, automatically decomposed using a large language model. DisProtEdit aligns protein and text embeddings using alignment and uniformity objectives, while a disentanglement loss promotes independence between structural and functional semantics. At inference time, protein editing is performed by modifying one or both text inputs and decoding from the updated latent representation. Experiments on protein editing and representation learning benchmarks demonstrate that DisProtEdit performs competitively with existing methods while providing improved interpretability and controllability. On a newly constructed multi‑attribute editing benchmark, the model achieves a both‑hit success rate of up to 61.7%, highlighting its effectiveness in coordinating simultaneous structural and functional edits.
Authors: Aditya Ravuri, Neil D. Lawrence
Abstract: Protein Language Models (PLMs) such as ESM2 have been shown to be capable of zero‑shot prediction of critical scalar properties of proteins (fitness). In this work, we show that injecting a dropout layer at inference time between a PLM's featurizer/embedding layer and its transformer, and averaging its output akin to Monte‑Carlo dropout increases zero‑shot performance on a subset of the ProteinGym dataset. This is the case even when the model was not trained with dropouts to begin with, and does not require retraining or finetuning of the PLM. A dropout of 0.1 seems performant across all models.
Authors: Dong Xu, Zhangfan Yang, Ka-chun Wong, Zexuan Zhu, Jiangqiang Li, Junkai Ji
Abstract: Breakthroughs in high‑accuracy protein structure prediction, such as AlphaFold, have established receptor‑based molecule design as a critical driver for rapid early‑phase drug discovery. However, most approaches still struggle to balance pocket‑specific geometric fit with strict valence and synthetic constraints. To resolve this trade‑off, a Retrieval‑Enhanced Aligned Diffusion termed READ is introduced, which is the first to merge molecular Retrieval‑Augmented Generation with an SE(3)‑equivariant diffusion model. Specifically, a contrastively pre‑trained encoder aligns atom‑level representations during training, then retrieves graph embeddings of pocket‑matched scaffolds to guide each reverse‑diffusion step at inference. This single mechanism can inject real‑world chemical priors exactly where needed, producing valid, diverse, and shape‑complementary ligands. Experimental results demonstrate that READ can achieve very competitive performance in CBGBench, surpassing state‑of‑the‑art generative models and even native ligands. That suggests retrieval and diffusion can be co‑optimized for faster, more reliable structure‑based drug design.
Authors: Xiang Zhang, Jiaqi Wei, Zijie Qiu, Sheng Xu, Nanqing Dong, Zhiqiang Gao, Siqi Sun
Abstract: Peptide sequencing‑the process of identifying amino acid sequences from mass spectrometry data‑is a fundamental task in proteomics. Non‑Autoregressive Transformers (NATs) have proven highly effective for this task, outperforming traditional methods. Unlike autoregressive models, which generate tokens sequentially, NATs predict all positions simultaneously, leveraging bidirectional context through unmasked self‑attention. However, existing NAT approaches often rely on Connectionist Temporal Classification (CTC) loss, which presents significant optimization challenges due to CTC's complexity and increases the risk of training failures. To address these issues, we propose an improved non‑autoregressive peptide sequencing model that incorporates a structured protein sequence curriculum learning strategy. This approach adjusts protein's learning difficulty based on the model's estimated protein generational capabilities through a sampling process, progressively learning peptide generation from simple to complex sequences. Additionally, we introduce a self‑refining inference‑time module that iteratively enhances predictions using learned NAT token embeddings, improving sequence accuracy at a fine‑grained level. Our curriculum learning strategy reduces NAT training failures frequency by more than 90% based on sampled training over various data distributions. Evaluations on nine benchmark species demonstrate that our approach outperforms all previous methods across multiple metrics and species.
Authors: Han Liu, Keyan Ding, Peilin Chen, Yinwei Wei, Liqiang Nie, Dapeng Wu, Shiqi Wang
Abstract: Accurate prediction of protein‑ligand binding affinity is critical for drug discovery. While recent deep learning approaches have demonstrated promising results, they often rely solely on structural features of proteins and ligands, overlooking their valuable biochemical knowledge associated with binding affinity. To address this limitation, we propose KEPLA, a novel deep learning framework that explicitly integrates prior knowledge from Gene Ontology and ligand properties to enhance prediction performance. KEPLA takes protein sequences and ligand molecular graphs as input and optimizes two complementary objectives: (1) aligning global representations with knowledge graph relations to capture domain‑specific biochemical insights, and (2) leveraging cross attention between local representations to construct fine‑grained joint embeddings for prediction. Experiments on two benchmark datasets across both in‑domain and cross‑domain scenarios demonstrate that KEPLA consistently outperforms state‑of‑the‑art baselines. Furthermore, interpretability analyses based on knowledge graph relations and cross attention maps provide valuable insights into the underlying predictive mechanisms.
Authors: Eunna Huh, Hyeonsu Lee, Hyunjin Shin
Abstract: With the growing prominence of antibody‑based therapeutics, antibody engineering has gained increasing attention as a critical area of research and development. Recent progress in transformer‑based protein large language models (LLMs) has demonstrated promising applications in protein sequence design and structural prediction. Moreover, the availability of large‑scale antibody datasets such as the Observed Antibody Space (OAS) database has opened new avenues for the development of LLMs specialized for processing antibody sequences. Among these, RoBERTa has demonstrated improved performance relative to BERT, while maintaining a smaller parameter count (125M) compared to the BERT‑based protein model, ProtBERT (420M). This reduced model size enables more efficient deployment in antibody‑related applications. However, despite the numerous advantages of the RoBERTa architecture, antibody‑specific foundational models built upon it have remained inaccessible to the research community. In this study, we introduce Ab‑RoBERTa, a RoBERTa‑based antibody‑specific LLM, which is publicly available at https://huggingface.co/mogam‑ai/Ab‑RoBERTa. This resource is intended to support a wide range of antibody‑related research applications including paratope prediction or humanness assessment.
Authors: Changbong Hyeon, D. Thirumalai
Abstract: Molecular chaperones are machines that consume copious amount of ATP to drive misfolded proteins or RNA to fold into functionally competent native states. Because the folding landscapes of biomolecules with complex native state topology are rugged consisting of multiple minima that are separated by large free energy barriers, folding occurs by the kinetic partitioning mechanism according to which only a small fraction of the molecules reach the folded state in biologically viable times. The rescue of such proteins and RNA require chaperones. Although the protein and RNA chaperones are profoundly different in their structure and action, the principles underlying their activity to produce the folded structures can be understood using a unified theoretical framework based on iterative annealing mechanism (IAM). Our theory shows that both these machines have evolved to the maximize the production of the steady state yield on biological times. Strikingly, theory predicts that only at a moderate level of RNA chaperone activity is the yield of the self‑splicing pre‑RNA is maximized in in vivo.
Authors: Yongqin Qiu, Xinyu Zhang
Abstract: Link prediction in multilayer networks is a key challenge in applications such as recommendation systems and protein‑protein interaction prediction. While many techniques have been developed, most rely on assumptions about shared structures and require access to raw auxiliary data, limiting their practicality. To address these issues, we propose a novel transfer learning framework for multilayer networks using a bi‑level model averaging method. A K‑fold cross‑validation criterion based on edges is used to automatically weight inter‑layer and intra‑layer candidate models. This enables the transfer of information from auxiliary layers while mitigating model uncertainty, even without prior knowledge of shared structures. Theoretically, we prove the optimality and weight convergence of our method under mild conditions. Computationally, our framework is efficient and privacy‑preserving, as it avoids raw data sharing and supports parallel processing across multiple servers. Simulations show our method outperforms others in predictive accuracy and robustness. We further demonstrate its practical value through two real‑world recommendation system applications.
Authors: Juanjuan Yang, Kevin Jahnke, Ling Xin, Xinxin Jing, Pengfei Zhan, Andreas Peil, Alessandra Griffo, Marko Škugor, Donglei Yang, Sisi Fan, Kerstin Göpfrich, Hao Yan, Pengfei Wang, Na Liu
Abstract: Membrane morphology and its dynamic adaptation regulate many cellular functions, which are often mediated by membrane proteins. Advances in DNA nanotechnology have enabled the realization of various protein‑inspired structures and functions with precise control at the nanometer level, suggesting a viable tool to artificially engineer the membrane morphology. In this work, we demonstrate a DNA origami cross (DOC) structure that can be anchored onto giant unilamellar vesicles (GUVs) and subsequently polymerized into micron‑scale reconfigurable one‑dimensional (1D) chains or two‑dimensional (2D) lattices. Such DNA origami‑based networks can be switched between left‑handed (LH) and right‑handed (RH) conformations by DNA fuels and exhibit potent efficacy in remodeling the membrane curvatures of GUVs. This work sheds light on designing hierarchically‑assembled dynamic DNA systems for the programmable modulation of synthetic cells for useful applications.
Authors: Emmanuel J. Candès, Andrew Ilyas, Tijana Zrnic
Abstract: Obtaining high‑quality labeled datasets is often costly, requiring either human annotation or expensive experiments. In theory, powerful pre‑trained AI models provide an opportunity to automatically label datasets and save costs. Unfortunately, these models come with no guarantees on their accuracy, making wholesale replacement of manual labeling impractical. In this work, we propose a method for leveraging pre‑trained AI models to curate cost‑effective and high‑quality datasets. In particular, our approach results in probably approximately correct labels: with high probability, the overall labeling error is small. Our method is nonasymptotically valid under minimal assumptions on the dataset or the AI model being studied, and thus enables rigorous yet efficient dataset curation using modern AI models. We demonstrate the benefits of the methodology through text annotation with large language models, image labeling with pre‑trained vision models, and protein folding analysis with AlphaFold.
Authors: Anand Dev Ranjan, Suyash Narayan Amzare, Subhrokoli Ghosh, Ayan Banerjee
Abstract: The fabrication of multilayered heterostructures is essential for advancing microelectronic and biosensing technologies. Conventional top‑down manufacturing techniques are often cost‑prohibitive and unsuitable for biomedical applications. Here, we present a bottom‑up fabrication method, termed microbubble lithography, which enables the in situ construction of multilayered microstructures through layer‑by‑layer self‑assembly. This technique allows diverse materials to be integrated into coherent heterostructures. We demonstrate the platform's utility by successfully patterning a biomarker and a reporter protein, highlighting its potential for cost‑effective and environmentally sustainable sensing applications.
Authors: Chuqiao Zhang, Sarath Chandra Dantu, Debarghya Mitra, Dalia Chakrabarty
Abstract: Identification of critical residues of a protein is actively pursued, since such residues are essential for protein function. We present three ways of recognising critical residues of an example protein, the evolution of which is tracked via molecular dynamical simulations. Our methods are based on learning a Random Geometric Graph (RGG) variable, where the state variable of each of 156 residues, is attached to a node of this graph, with the RGG learnt using the matrix of correlations between state variables of each residue‑pair. Given the categorical nature of the state variable, correlation between a residue pair is computed using Cramer's V. We advance an organic thresholding to learn an RGG, and compare results against extant thresholding techniques, when parametrising criticality as the nodal degree in the learnt RGG. Secondly, we develop a criticality measure by ranking the computed differences between the posterior probability of the full graph variable defined on all 156 residues, and that of the graph with all but one residue omitted. A third parametrisation of criticality informs on the dynamical variation of nodal degrees as the protein evolves during the simulation. Finally, we compare results obtained with the three distinct criticality parameters, against experimentally‑ascertained critical residues.
Authors: Sayantan Mondal, Saumyak Mukherjee, Biman Bagchi
Abstract: Surface effects could play a dominant role in modifying the natural liquid order. In some cases, the effects of the surface interactions can propagate inwards, and even can interfere with a similar propagation from opposite surfaces. This can be particularly evident in liquid water under nano‑confinement. The large dipolar cross‑correlations among distinct molecules that give rise to the unusually large dielectric constant of water (and in turn owe their origin to the extended hydrogen bond (HB) network) can get perturbed by surfaces. The perturbation can propagate inwards and then interfere with the one from the opposite surface if confinement is only a few layers wide. This can give rise to short‑to‑intermediate range solvent‑mediated interaction between two surfaces. Here we study the effects of such interactions on the dielectric constant of nano‑confined liquids, not just water but also ordering at protein surfaces. The surfaces work at two levels: (i) induce orientational realignment, and (ii) alter the cross‑correlations between water molecules. Molecular dynamics simulations and statistical analyses are used to address these aspects in confinement of slit pores, nano tube/cylinder, and nano sphere. In addition, we consider the hydration layers of multiple proteins with vastly different structural features. These studies give us a measure of the extent or the length scale of cross‑correlations between dipole moments of water molecules. We find an interesting orientational arrangement in the protein hydration layers, giving rise to long‑range molecular cross‑correlations. To decouple the effect of HB from the effect of geometry, we additionally study acetonitrile under nanoconfinement. Importantly, while a protein's interior is characterized by a small dielectric constant, the dipole moment of a peptide bond is large, and thus susceptible to fluctuations in water.
Authors: Dingyi Rong, Haotian Lu, Wenzhuo Zheng, Fan Zhang, Shuangjia Zheng, Ning Liu
Abstract: Designing protein sequences with optimal energetic stability is a key challenge in protein inverse folding, as current deep learning methods are primarily trained by maximizing sequence recovery rates, often neglecting the energy of the generated sequences. This work aims to overcome this limitation by developing a model that directly generates low‑energy, stable protein sequences. We propose EnerBridge‑DPO, a novel inverse folding framework focused on generating low‑energy, high‑stability protein sequences. Our core innovation lies in: First, integrating Markov Bridges with Direct Preference Optimization (DPO), where energy‑based preferences are used to fine‑tune the Markov Bridge model. The Markov Bridge initiates optimization from an information‑rich prior sequence, providing DPO with a pool of structurally plausible sequence candidates. Second, an explicit energy constraint loss is introduced, which enhances the energy‑driven nature of DPO based on prior sequences, enabling the model to effectively learn energy representations from a wealth of prior knowledge and directly predict sequence energy values, thereby capturing quantitative features of the energy landscape. Our evaluations demonstrate that EnerBridge‑DPO can design protein complex sequences with lower energy while maintaining sequence recovery rates comparable to state‑of‑the‑art models, and accurately predicts ΔΔG values between various sequences.
Authors: Pedro Pessoa, Juan Andres Martinez, Vincent Vandenbroucke, Frank Delvigne, Steve Pressé
Abstract: Inferring protein production kinetics for dividing cells is complicated due to protein inheritance from the mother cell. For instance, fluorescence measurements ‑‑ commonly used to assess gene activation ‑‑ may reflect not only newly produced proteins but also those inherited through successive cell divisions. In such cases, observed protein levels in any given cell are shaped by its division history. As a case study, we examine activation of the glc3 gene in yeast involved in glycogen synthesis and expressed under nutrient‑limiting conditions. We monitor this activity using snapshot fluorescence measurements via flow cytometry, where GFP expression reflects glc3 promoter activity. A naïve analysis of flow cytometry data ignoring cell division suggests many cells are active with low expression. Explicitly accounting for the (non‑Markovian) effects of cell division and protein inheritance makes it impossible to write down a tractable likelihood ‑‑ a key ingredient in physics‑inspired inference, defining the probability of observing data given a model. The dependence on a cell's division history breaks the assumptions of standard (Markovian) master equations, rendering traditional likelihood‑based approaches inapplicable. Instead, we adapt conditional normalizing flows (a class of neural network models designed to learn probability distributions) to approximate otherwise intractable likelihoods from simulated data. In doing so, we find that glc3 is mostly inactive under stress, showing that while cells occasionally activate the gene, expression is brief and transient.
Authors: Zhenqiao Song, Ramith Hettiarachchi, Chuan Li, Jianwen Xie, Lei Li
Abstract: The de novo design of ligand‑binding proteins with tailored functions is essential for advancing biotechnology and molecular medicine, yet existing AI approaches are limited by scarce protein‑ligand complex data. To circumvent this data bottleneck, we leverage the abundant natural language descriptions characterizing protein‑ligand interactions. Here, we introduce InstructPro, a family of generative models that design proteins following the guidance of natural language instructions and ligand formulas. InstructPro produces protein sequences consistent with specified function descriptions and ligand targets. To enable training and evaluation, we develop InstructProBench, a large‑scale dataset of 9.6 million (function description, ligand, protein) triples. We train two model variants ‑‑ InstructPro‑1B and InstructPro‑3B ‑‑ that substantially outperform strong baselines. InstructPro‑1B achieves an AlphaFold3 ipTM of 0.918 and a binding affinity of ‑8.764 on seen ligands, while maintaining robust performance in a zero‑shot setting with scores of 0.869 and ‑6.713, respectively. These results are accompanied by novelty scores of 70.1% and 68.8%, underscoring the model's ability to generalize beyond the training set. Furthermore, the model yields a superior binding free energy of ‑20.9 kcal/mol and an average of 5.82 intermolecular hydrogen bonds, validating its proficiency in designing high‑affinity ligand‑binding proteins. Notably, scaling to InstructPro‑3B further improves the zero‑shot ipTM to 0.882, binding affinity to ‑6.797, and binding free energy to ‑25.8 kcal/mol, demonstrating clear performance gains associated with increased model capacity. These findings highlight the power of natural language‑guided generative models to mitigate the data bottlenecks in traditional structure‑based methods, significantly broadening the scope of de novo protein design.
Authors: Ruben Weitzman, Peter Mørch Groth, Lood Van Niekerk, Aoi Otani, Yarin Gal, Debora Marks, Pascal Notin
Abstract: Retrieving homologous protein sequences is essential for a broad range of protein modeling tasks such as fitness prediction, protein design, structure modeling, and protein‑protein interactions. Traditional workflows have relied on a two‑step process: first retrieving homologs via Multiple Sequence Alignments (MSA), then training models on one or more of these alignments. However, MSA‑based retrieval is computationally expensive, struggles with highly divergent sequences or complex insertions & deletions patterns, and operates independently of the downstream modeling objective. We introduce Protriever, an end‑to‑end differentiable framework that learns to retrieve relevant homologs while simultaneously training for the target task. When applied to protein fitness prediction, Protriever achieves state‑of‑the‑art performance compared to sequence‑based models that rely on MSA‑based homolog retrieval, while being two orders of magnitude faster through efficient vector search. Protriever is both architecture‑ and task‑agnostic, and can flexibly adapt to different retrieval strategies and protein databases at inference time ‑‑ offering a scalable alternative to alignment‑centric approaches.
Authors: Amina Mollaysa, Artem Moskale, Pushpak Pati, Tommaso Mansi, Mangal Prakash, Rui Liao
Abstract: We present BioLangFusion, a simple approach for integrating pre‑trained DNA, mRNA, and protein language models into unified molecular representations. Motivated by the central dogma of molecular biology (information flow from gene to transcript to protein), we align per‑modality embeddings at the biologically meaningful codon level (three nucleotides encoding one amino acid) to ensure direct cross‑modal correspondence. BioLangFusion studies three standard fusion techniques: (i) codon‑level embedding concatenation, (ii) entropy‑regularized attention pooling inspired by multiple‑instance learning, and (iii) cross‑modal multi‑head attention ‑‑ each technique providing a different inductive bias for combining modality‑specific signals. These methods require no additional pre‑training or modification of the base models, allowing straightforward integration with existing sequence‑based foundation models. Across five molecular property prediction tasks, BioLangFusion outperforms strong unimodal baselines, showing that even simple fusion of pre‑trained models can capture complementary multi‑omic information with minimal overhead.
Authors: Cheng Tan, Zhenxiao Cao, Zhangyang Gao, Siyuan Li, Yufei Huang, Stan Z. Li
Abstract: The AlphaFold Protein Structure Database (AFDB) offers unparalleled structural coverage at near‑experimental accuracy, positioning it as a valuable resource for data‑driven protein design. However, its direct use in training deep models that are sensitive to fine‑grained atomic geometry, such as inverse folding, exposes a critical limitation. Comparative analysis of structural feature distributions reveals that AFDB structures exhibit distinct statistical regularities, reflecting a systematic geometric bias that deviates from the conformational diversity found in experimentally determined structures from the Protein Data Bank (PDB). While AFDB structures are cleaner and more idealized, PDB structures capture the intrinsic variability and physical realism essential for generalization in downstream tasks. To address this discrepancy, we introduce a Debiasing Structure AutoEncoder (DeSAE) that learns to reconstruct native‑like conformations from intentionally corrupted backbone geometries. By training the model to recover plausible structural states, DeSAE implicitly captures a more robust and natural structural manifold. At inference, applying DeSAE to AFDB structures produces debiased structures that significantly improve inverse folding performance across multiple benchmarks. This work highlights the critical impact of subtle systematic biases in predicted structures and presents a principled framework for debiasing, significantly boosting the performance of structure‑based learning tasks like inverse folding.
Authors: Travis Leadbetter, Prashant K. Purohit, Celia Reina
Abstract: Given a particle system obeying overdamped Langevin dynamics, we demonstrate that it is always possible to construct a thermodynamically consistent macroscopic model which obeys a gradient flow with respect to its non‑equilibrium free energy. To do so, we significantly extend the recent Stochastic Thermodynamics with Internal Variables (STIV) framework, a method for producing macroscopic thermodynamic models far‑from‑equilibrium from the underlying mesoscopic dynamics and an approximate probability density of states parameterized with so‑called internal variables. Though originally explored for Gaussian probability distributions, we here allow for an arbitrary choice of the approximate probability density while retaining a gradient flow dynamics. This greatly extends its range of applicability and automatically ensures consistency with the second law of thermodynamics, without the need for secondary verification. We demonstrate numerical convergence, in the limit of increasing internal variables, to the true probability density of states for both a multi‑modal relaxation problem, a protein diffusing on a strand of DNA, and for an externally driven particle in a periodic landscape. Finally, we provide a reformulation of STIV with the quasi‑equilibrium approximations in terms of the averages of observables of the mesostate, and show that these, too, obey a gradient flow.
Authors: Qifeng Wu, Zhengzhe Liu, Han Zhu, Yizhou Zhao, Daisuke Kihara, Min Xu
Abstract: This paper aims to retrieve proteins with similar structures and semantics from large‑scale protein dataset, facilitating the functional interpretation of protein structures derived by structural determination methods like cryo‑Electron Microscopy (cryo‑EM). Motivated by the recent progress of vision‑language models (VLMs), we propose a CLIP‑style framework for aligning 3D protein structures with functional annotations using contrastive learning. For model training, we propose a large‑scale dataset of approximately 200,000 protein‑caption pairs with rich functional descriptors. We evaluate our model in both in‑domain and more challenging cross‑database retrieval on Protein Data Bank (PDB) and Electron Microscopy Data Bank (EMDB) dataset, respectively. In both cases, our approach demonstrates promising zero‑shot retrieval performance, highlighting the potential of multimodal foundation models for structure‑function understanding in protein biology.
Authors: Sebastián V. Romero, Alejandro Gomez Cadavid, Pavle Nikačević, Enrique Solano, Narendra N. Hegade, Miguel Angel Lopez-Ruiz, Claudio Girotto, Masako Yamada, Panagiotis Kl. Barkoutsos, Ananth Kaushik, Martin Roetteler
Abstract: We experimentally demonstrate that the bias‑field digitized counterdiabatic quantum optimization (BF‑DCQO) algorithm, implemented on IonQ's fully connected trapped‑ion quantum processors, offers an efficient approach to solving dense higher‑order unconstrained binary optimization (HUBO) problems. Specifically, we tackle protein folding on a tetrahedral lattice for up to 12 amino acids, representing the largest quantum hardware implementations of protein folding problems reported to date. Additionally, we address MAX 4‑SAT instances at the computational phase transition and fully connected spin‑glass problems using all 36 available qubits. Across all considered cases, our method consistently achieves optimal solutions, highlighting the powerful synergy between non‑variational quantum optimization approaches and the intrinsic all‑to‑all connectivity of trapped‑ion architectures. Given the expected scalability of trapped‑ion quantum systems, BF‑DCQO represents a promising pathway toward practical quantum advantage for dense HUBO problems with significant industrial and scientific relevance.
Authors: Ziwen Wang, Jiajun Fan, Ruihan Guo, Thao Nguyen, Heng Ji, Ge Liu
Abstract: Protein generative models have shown remarkable promise in protein design, yet their success rates remain constrained by reliance on curated sequence‑structure datasets and by misalignment between supervised objectives and real design goals. We present ProteinZero, an online reinforcement learning framework for inverse folding models that enables scalable, automated, and continuous self‑improvement with computationally efficient feedback. ProteinZero employs a reward pipeline that combines structural guidance from ESMFold with a novel self‑derived ddG predictor, providing stable multi‑objective signals while avoiding the prohibitive cost of physics‑based methods. To ensure robustness in online RL, we further introduce a novel embedding‑level diversity regularizer that mitigates mode collapse and promotes functionally meaningful sequence variation. Within a general RL formulation balancing multi‑reward optimization, KL‑divergence from a reference model, and diversity regularization, ProteinZero achieves robust improvements across designability, stability, recovery, and diversity. On the CATH‑4.3 benchmark, it consistently outperforms state‑of‑the‑art baselines including ProteinMPNN, ESM‑IF, and InstructPLM, reducing design failure rates by 36‑48% and achieving success rates above 90% across diverse folds. Importantly, a complete RL run can be executed on a single 8 X GPU node within three days, including reward computation and data generation. These results indicate that efficient online RL fine‑tuning can complement supervised pretraining by allowing protein generative models to evolve continuously from their own outputs and optimize multiple design objectives without labeled data, opening new possibilities for exploring the vast protein design space. Full source code and model checkpoints will be released upon publication.
Authors: Mohammad Tanver Hossain, Dakota Piorkowski, Andrew Lowe, Wonsik Eom, Abhishek Shetty, Sameh H. Tawfick, Douglas S. Fudge, Randy H. Ewoldt
Abstract: Hagfish slime is a unique biological material composed of mucus and protein threads that rapidly deploy into a cohesive network when deployed in seawater. The forces involved in thread deployment and interactions among mucus and threads are key to understanding how hagfish slime rapidly assembles into a cohesive, functional network. Despite extensive interest in its biophysical properties, the mechanical forces governing thread deployment and interaction remain poorly quantified. Here, we present the first direct in situ measurements of the micromechanical forces involved in hagfish slime formation, including mucus mechanical properties, skein peeling force, thread‑mucus adhesion, and thread‑thread cohesion. Using a custom glass‑rod force sensing system, we show that thread deployment initiates when peeling forces exceed a threshold of approximately 6.8 nN. To understand the flow strength required for unraveling, we used a rheo‑optic setup to impose controlled shear flow, enabling us to directly observe unraveling dynamics and determine the critical shear rate for unraveling of the skeins, which we then interpreted using an updated peeling‑based force balance model. Our results reveal that thread‑mucus adhesion dominates over thread‑thread adhesion and that deployed threads contribute minimally to bulk shear rheology at constant flow rate. These findings clarify the physics underlying the rapid, flow‑triggered assembly of hagfish slime and inform future designs of synthetic deployable fiber‑gel systems.
Authors: Zixuan Jiang, Renjing Xu
Abstract: Deciphering protein function remains a fundamental challenge in protein representation learning. The task presents significant difficulties for protein language models (PLMs) due to the sheer volume of functional annotation categories and the highly imbalanced distribution of annotated instances across biological ontologies. Inspired by the remarkable success of reinforcement learning from human feedback (RLHF) in large language model (LLM) alignment, we propose AnnoDPO, a novel multi‑modal framework for protein function prediction that leverages Direct Preference Optimization (DPO) to enhance annotation learning. Our methodology addresses the dual challenges of annotation scarcity and category imbalance through preference‑aligned training objectives, establishing a new paradigm for biological knowledge integration in protein representation learning.
Authors: Yan Jun Lin, Zhengyang Liu, Sunghwan Jung
Abstract: Traditional surface cleaning methods often suffer from drawbacks such as chemical harshness, potential for surface damage, and high energy consumption. This study investigates an alternative approach: acoustic‑driven surface cleaning using millimeter‑sized bubbles excited at low, sub‑cavitation frequencies. We identify and characterize a distinct translational resonance of these bubbles, occurring at significantly lower frequencies (e.g., 50 Hz for 1.3 mm diameter bubbles) than the Minnaert resonance for a bubble of the same size. Experiments reveal that at this translational resonance, stationary bubbles exhibit amplified lateral swaying, while bubbles sliding on an inclined surface display pronounced "stop‑and‑go" dynamics. The theoretical model treats the bubble as a forced, damped harmonic oscillator, where surface tension provides the restoring force and the inertia is dominated by the hydrodynamic added mass of the surrounding fluid. It accurately predicts the observed resonant frequency scaling with bubble size (\propto R_0^‑3/2). Cleaning efficacy, assessed using protein‑based artificial soil on glass slides, was improved by approximately 90% when bubbles were driven at their translational resonant frequency compared to off‑resonant frequencies or non‑acoustic conditions. These findings demonstrate that leveraging translational resonance enhances bubble‑induced shear and agitation, offering an effective and sustainable mechanism for surface cleaning.
Authors: Viacheslav Dubovitskii, Aritra Bose, Filippo Utro, Laxmi Parida
Abstract: Biomolecular networks, such as protein‑protein interactions, gene‑gene associations, and cell‑cell interactions, offer valuable insights into the complex organization of biological systems. These networks are key to understanding cellular functions, disease mechanisms, and identifying therapeutic targets. However, their analysis is challenged by the high dimensionality, heterogeneity, and sparsity of multi‑omics data. Random walk algorithms are widely used to propagate information through disease modules, helping to identify disease‑associated genes and uncover relevant biological pathways. In this work, we investigate the limitations of classical random walks and explore the potential of quantum random walks (QRWs) for biomolecular network analysis. We evaluate QRWs in two network‑based applications. First, in a gene‑gene interaction network associated with asthma, autism, and schizophrenia, QRWs more accurately rank disease‑associated genes compared to classical methods. Second, in a structured multi‑partite cell‑cell interaction network derived from mouse brown adipose tissue, QRWs identify key driver genes in malignant cells that are overlooked by classical random walks. Our findings suggest that quantum random walks offer a promising alternative to classical approaches, with improved sensitivity to network structure and better performance in identifying biologically relevant features. This highlights their potential in advancing network medicine and systems biology.
Authors: Antonio Jesús Banegas-Luna, Baldomero Imbernón Tudela, Carlos Martínez-Cortés, José María Cecilia, Horacio Pérez-Sánchez
Abstract: Virtual screening (VS) is a computationally intensive process crucial for drug discovery, often requiring significant resources to analyze large chemical libraries and predict ligand‑protein interactions. This study evaluates the performance impact of containerization on METADOCK 2, a high‑throughput docking software when deployed on heterogeneous high‑performance computing (HPC) platforms. By testing three containerization technologies ‑ Docker, Singularity, and Apptainer ‑ across varying CPU and GPU configurations, the experiments reveal that containerization introduces negligible performance overhead, with deviations below 1%. Moreover, METADOCK 2 demonstrated the capability to efficiently process large molecular complexes, surpassing the limitations of commercial tools such as AutoDock Vina. The results underscore the advantages of container‑based deployment for ensuring portability, reproducibility, and scalability in scientific computing. This study concludes that containerized METADOCK 2 is a robust and efficient solution for VS tasks on heterogeneous HPC platforms.
Authors: Noémie Bergues, Arthur Carré, Paul Join-Lambert, Brice Hoffmann, Arnaud Blondel, Hamza Tajmouati
Abstract: Predicting the 3D conformation of small molecules within protein binding sites is a key challenge in drug design. When a crystallized reference ligand (template) is available, it provides geometric priors that can guide 3D pose prediction. We present a two‑stage method for ligand conformation generation guided by such templates. In the first stage, we introduce a molecular alignment approach based on flow‑matching to generate 3D coordinates for the ligand, using the template structure as a reference. In the second stage, a differentiable pose optimization procedure refines this conformation based on shape and pharmacophore similarities, internal energy, and, optionally, the protein binding pocket. We introduce a new benchmark of ligand pairs co‑crystallized with the same target to evaluate our approach and show that it outperforms standard docking tools and open‑access alignment methods, especially in cases involving low similarity to the template or high ligand flexibility.
Authors: Yunqing Liu, Wenqi Fan, Xiaoyong Wei, Qing Li
Abstract: Proteins are central to biological systems, participating as building blocks across all forms of life. Despite advancements in understanding protein functions through protein sequence analysis, there remains potential for further exploration in integrating protein structural information. We argue that the structural information of proteins is not only limited to their 3D information but also encompasses information from amino acid molecules (local information) to protein‑protein structure similarity (global information). To address this, we propose GLProtein, the first framework in protein pre‑training that incorporates both global structural similarity and local amino acid details to enhance prediction accuracy and functional insights. GLProtein innovatively combines protein‑masked modelling with triplet structure similarity scoring, protein 3D distance encoding and substructure‑based amino acid molecule encoding. Experimental results demonstrate that GLProtein outperforms previous methods in several bioinformatics tasks, including predicting protein‑protein interaction, contact prediction, and so on.
Authors: Jiakai Zhang, Shouchen Zhou, Haizhao Dai, Xinhang Liu, Peihao Wang, Zhiwen Fan, Yuan Pei, Jingyi Yu
Abstract: Pose estimation from unordered images is fundamental for 3D reconstruction, robotics, and scientific imaging. Recent geometric foundation models, such as DUSt3R, enable end‑to‑end dense 3D reconstruction but remain underexplored in scientific imaging fields like cryo‑electron microscopy (cryo‑EM) for near‑atomic protein reconstruction. In cryo‑EM, pose estimation and 3D reconstruction from unordered particle images still depend on time‑consuming iterative optimization, primarily due to challenges such as low signal‑to‑noise ratios (SNR) and distortions from the contrast transfer function (CTF). We introduce CryoFastAR, the first geometric foundation model that can directly predict poses from Cryo‑EM noisy images for Fast ab initio Reconstruction. By integrating multi‑view features and training on large‑scale simulated cryo‑EM data with realistic noise and CTF modulations, CryoFastAR enhances pose estimation accuracy and generalization. To enhance training stability, we propose a progressive training strategy that first allows the model to extract essential features under simpler conditions before gradually increasing difficulty to improve robustness. Experiments show that CryoFastAR achieves comparable quality while significantly accelerating inference over traditional iterative approaches on both synthetic and real datasets.
Authors: Jes Frellsen, Maher M. Kassem, Tone Bengtsen, Lars Olsen, Kresten Lindorff-Larsen, Jesper Ferkinghoff-Borg, Wouter Boomsma
Abstract: Inverse folding models have proven to be highly effective zero‑shot predictors of protein stability. Despite this success, the link between the amino acid preferences of an inverse folding model and the free‑energy considerations underlying thermodynamic stability remains incompletely understood. A better understanding would be of interest not only from a theoretical perspective, but also potentially provide the basis for stronger zero‑shot stability prediction. In this paper, we take steps to clarify the free‑energy foundations of inverse folding models. Our derivation reveals the standard practice of likelihood ratios as a simplistic approximation and suggests several paths towards better estimates of the relative stability. We empirically assess these approaches and demonstrate that considerable gains in zero‑shot performance can be achieved with fairly simple means.
Authors: Greta Grassmann, Mattia Miotto, Francesca Alessandrini, Leonardo Bo', Giancarlo Ruocco, Edoardo Milanetti, Andrea Giansanti
Abstract: The functionality of protein‑protein complexes is closely tied to the strength of their interactions, making the evaluation of binding affinity a central focus in structural biology. However, the molecular determinants underlying binding affinity are still not fully understood. In particular, the entropic contributions, especially those arising from conformational dynamics, remain poorly characterized. In this study, we explore the relationship between protein motion and binding stability and its role in protein function. To gain deeper insight into how protein complexes modulate their stability, we investigated a model system with a well‑characterized and fast evolutionary history: a set of SARS‑CoV‑2 spike protein variants bound to the human ACE2 receptor, for which experimental binding affinity data are available. Through Molecular Dynamics simulations, we analyzed both structural and dynamical differences between the unbound (apo) and bound (holo) forms of the spike protein across several variants of concern. Our findings indicate that a more stable binding is associated with proteins that exhibit higher rigidity in their unbound state and display dynamical patterns similar to that observed after binding to ACE2. The increase of binding stability is not the sole driving force of SARS‑CoV‑2 evolution. More recent variants are characterized by a more dynamical behavior that determines a less efficient viral entry but could optimize other traits, such as antibody escape. These results suggest that to fully understand the strength of the binding between two proteins, the stability of the two isolated partners should be investigated.
Authors: Yiyu Lin, Yan Wang, You Zhou, Xinye Ni, Jiahui Wu, Sen Yang
Abstract: As a core mechanism of epigenetic regulation in eukaryotes, protein post‑translational modifications (PTMs) require precise prediction to decipher dynamic life activity networks. To address the limitations of existing deep learning models in cross‑modal feature fusion, domain generalization, and architectural optimization, this study proposes UniPTMs: the first unified framework for multi‑type PTM prediction. The framework innovatively establishes a "Master‑Slave" dual‑path collaborative architecture: The master path dynamically integrates high‑dimensional representations of protein sequences, structures, and evolutionary information through a Bidirectional Gated Cross‑Attention (BGCA) module, while the slave path optimizes feature discrepancies and recalibration between structural and traditional features using a Low‑Dimensional Fusion Network (LDFN). Complemented by a Multi‑scale Adaptive convolutional Pyramid (MACP) for capturing local feature patterns and a Bidirectional Hierarchical Gated Fusion Network (BHGFN) enabling multi‑level feature integration across paths, the framework employs a Hierarchical Dynamic Weighting Fusion (HDWF) mechanism to intelligently aggregate multimodal features. Enhanced by a novel Hierarchical Contrastive loss function for feature consistency optimization, UniPTMs demonstrates significant performance improvements (3.2%‑11.4% MCC and 4.2%‑14.3% AP increases) over state‑of‑the‑art models across five modification types and transcends the Single‑Type Prediction Paradigm. To strike a balance between model complexity and performance, we have also developed a lightweight variant named UniPTMs‑mini.
Authors: Xinyan Zhao, Yi-Ching Tang, Akshita Singh, Victor J Cantu, KwanHo An, Junseok Lee, Adam E Stogsdill, Ibraheem M Hamdi, Ashwin Kumar Ramesh, Zhiqiang An, Xiaoqian Jiang, Yejin Kim
Abstract: We introduce AbBiBench (Antibody Binding Benchmarking), a benchmarking framework for antibody binding affinity maturation and design. Unlike previous strategies that evaluate antibodies in isolation, typically by comparing them to natural sequences with metrics such as amino acid recovery rate or structural RMSD, AbBiBench instead treats the antibody‑antigen (Ab‑Ag) complex as the fundamental unit. It evaluates an antibody design's binding potential by measuring how well a protein model scores the full Ab‑Ag complex. We first curate, standardize, and share more than 184,500 experimental measurements of antibody mutants across 14 antibodies and 9 antigens‑including influenza, lysozyme, HER2, VEGF, integrin, Ang2, and SARS‑CoV‑2‑covering both heavy‑chain and light‑chain mutations. Using these datasets, we systematically compare 15 protein models including masked language models, autoregressive language models, inverse folding models, diffusion‑based generative models, and geometric graph models by comparing the correlation between model likelihood and experimental affinity values. Additionally, to demonstrate AbBiBench's generative utility, we apply it to antibody F045‑092 in order to introduce binding to influenza H1N1. We sample new antibody variants with the top‑performing models, rank them by the structural integrity and biophysical properties of the Ab‑Ag complex, and assess them with in vitro ELISA binding assays. Our findings show that structure‑conditioned inverse folding models outperform others in both affinity correlation and generation tasks. Overall, AbBiBench provides a unified, biologically grounded evaluation framework to facilitate the development of more effective, function‑aware antibody design models.
Authors: Filippo Conforto, Antonio Valdes, Willem Vanderlinden, Davide Michieletto
Abstract: Structural‑Maintenance‑of‑Chromosome (SMC) complexes such as condensins organise the folding of chromosomes. However, their role in modulating the entanglement of DNA and chromatin is not fully understood. To address this question, we perform single molecule and bulk characterisation of yeast condensin in entangled DNA. First, we discover that yeast condensin can proficiently bind double‑stranded DNA through its hinge domain, in addition to its heads. Through bulk microrheology assays we then discover that physiological concentrations of yeast condensin increase both the viscosity and elasticity of dense solutions of lambda‑DNA suggesting that condensin acts as a crosslinker in entangled DNA, stabilising entanglements rather than resolving them and contrasting the popular theoretical picture where SMCs purely drive the formation of segregated, bottle‑brush‑like chromosome structures. We further discover that the presence of ATP fluidifies the solution ‑‑ likely by activating loop extrusion ‑‑ but does not recover the viscosity measured in the absence of protein. Finally, we show that the observed rheology can be understood by modelling SMCs as transient crosslinkers in bottle‑brush‑like entangled polymers. Our findings help us to understand how SMCs affect the dynamics and entanglement of genomes.
Authors: Anton Klimek, Benjamin A. Dalton, Roland R. Netz
Abstract: Subdiffusion is a hallmark of complex systems, ranging from protein folding to transport in viscoelastic media. However, despite its pervasiveness, the mechanistic origins of subdiffusion remain contested. Here, we analyze both Markovian and non‑Markovian dynamics, in the presence and absence of energy barriers, in order to disentangle the distinct contributions of memory‑dependent friction and energy barriers to the emergence of subdiffusive behavior. Focusing on the mean squared displacement (MSD), we develop an analytical framework that connects subdiffusion to multiscale memory effects in the generalized Langevin equation (GLE), and derive the subdiffusive scaling behavior of the MSD for systems governed by multi‑exponential memory kernels. We identify persistence and relaxation timescales that delineate dynamical regimes in which subdiffusion arises from either memory or energy barrier effects. By comparing analytical predictions with simulations, we confirm that memory dominates the overdamped dynamics for barrier heights up to approximately 2\,k_BT, a regime recently shown to be relevant for protein folding. Overall, our results advance the theoretical understanding of anomalous diffusion and provide practical tools that are broadly applicable to fields as diverse as molecular biophysics, polymer physics, and active matter systems.
Authors: Junde Xu, Zijun Gao, Xinyi Zhou, Jie Hu, Xingyi Cheng, Le Song, Guangyong Chen, Pheng-Ann Heng, Jiezhong Qiu
Abstract: The inverse folding problem, aiming to design amino acid sequences that fold into desired three‑dimensional structures, is pivotal for various biotechnological applications. Here, we introduce a novel approach leveraging Direct Preference Optimization (DPO) to fine‑tune an inverse folding model using feedback from a protein folding model. Given a target protein structure, we begin by sampling candidate sequences from the inverse‑folding model, then predict the three‑dimensional structure of each sequence with the folding model to generate pairwise structural‑preference labels. These labels are used to fine‑tune the inverse‑folding model under the DPO objective. Our results on the CATH 4.2 test set demonstrate that DPO fine‑tuning not only improves sequence recovery of baseline models but also leads to a significant improvement in average TM‑Score from 0.77 to 0.81, indicating enhanced structure similarity. Furthermore, iterative application of our DPO‑based method on challenging protein structures yields substantial gains, with an average TM‑Score increase of 79.5% with regard to the baseline model. This work establishes a promising direction for enhancing protein sequence design ability from structure feedback by effectively utilizing preference optimization.
Authors: Bhanjan Debnath, Parag Katira
Abstract: The interaction lifetimes between condensate‑forming biomolecules can dictate both the specificity of the condensate‑forming species as well as the fluidity and exchange dynamics of these condensates. Using a heuristic modeling approach, we show that single‑step vs. sequential, multistep binding‑unbinding interactions between proteins can lead to similar average interaction lifetimes, but with either exponential or truncated power‑law‑like lifetime distributions, respectively. Combining this model with Brownian dynamics simulations, we find that the differences in these lifetime distributions influence the features of condensates, such as their fluidic nature, aging, and size distribution.
Authors: Ella Rannon, David Burstein
Abstract: Natural Language Processing (NLP) has transformed various fields beyond linguistics by applying techniques originally developed for human language to the analysis of biological sequences. This review explores the application of NLP methods to biological sequence data, focusing on genomics, transcriptomics, and proteomics. We examine how various NLP methods, from classic approaches like word2vec to advanced models employing transformers and hyena operators, are being adapted to analyze DNA, RNA, protein sequences, and entire genomes. The review also examines tokenization strategies and model architectures, evaluating their strengths, limitations, and suitability for different biological tasks. We further cover recent advances in NLP applications for biological data, such as structure prediction, gene expression, and evolutionary analysis, highlighting the potential of these methods for extracting meaningful insights from large‑scale genomic data. As language models continue to advance, their integration into bioinformatics holds immense promise for advancing our understanding of biological processes in all domains of life.
Authors: Rong Chen, Quan-Xin Mei, Wen-Ding Zhao, Lin Yao, Hao-Xiang Yang, Shun-Yao Zhang, Jiao Chen, Hong-Lin Li
Abstract: AlphaFold has achieved groundbreaking advancements in protein structure prediction, exerting profound influence across biology, medicine, and drug discovery. However, its reliance on multiple sequence alignment (MSA) is inherently time‑consuming due to the NP‑hard nature of constructing MSAs. Quantum computing emerges as a promising alternative, compared to classical computers, offering the potentials for exponential speedup and improved accuracy on such complex optimization challenges. This work bridges the gap between quantum computing and MSA task efficiently and successfully, where we compared classical and quantum computational scaling as the number of qubits increases, and assessed the role of quantum entanglement in model performance. Furthermore, we proposed an innovative hybrid query encoding approach hyQUBO to avoid redundancy, and thereby the quantum resources significantly reduced to a scaling of \mathcalO(NL). Additionally, coupling of VQE and the quenched CVaR scheme was utilized to enhance the robustness and convergence. The integration of multiple strategies facilitates the robust deployment of the quantum algorithm from idealized simulators (on CPU and GPU) to real‑world, noisy quantum devices (HYQ‑A37). To the best of our knowledge, our work represented the largest‑scale implementation of digital simulation using up to 16 qubits on a trapped‑ion quantum computer for life science problem, which achieved state of the art performance in both simulation and experimental results. Our work paves the way towards large‑scale simulations of life science tasks on real quantum processors.
Authors: Mengdi Liu, Xiaoxue Cheng, Zhangyang Gao, Hong Chang, Cheng Tan, Shiguang Shan, Xilin Chen
Abstract: Designing protein sequences that fold into a target 3D structure, known as protein inverse folding, is a fundamental challenge in protein engineering. While recent deep learning methods have achieved impressive performance by recovering native sequences, they often overlook the one‑to‑many nature of the problem: multiple diverse sequences can fold into the same structure. This motivates the need for a generative model capable of designing diverse sequences while preserving structural consistency. To address this trade‑off, we introduce ProtInvTree, the first reward‑guided tree‑search framework for protein inverse folding. ProtInvTree reformulates sequence generation as a deliberate, step‑wise decision‑making process, enabling the exploration of multiple design paths and exploitation of promising candidates through self‑evaluation, lookahead, and backtracking. We propose a two‑stage focus‑and‑grounding action mechanism that decouples position selection and residue generation. To efficiently evaluate intermediate states, we introduce a jumpy denoising strategy that avoids full rollouts. Built upon pretrained protein language models, ProtInvTree supports flexible test‑time scaling by expanding the search depth and breadth without retraining. Empirically, ProtInvTree outperforms state‑of‑the‑art baselines across multiple benchmarks, generating structurally consistent yet diverse sequences, including those far from the native ground truth.
Authors: Fanglei Xue, Andrew Kubaney, Zhichun Guo, Joseph K. Min, Ge Liu, Yi Yang, David Baker
Abstract: Protein sequence design methods have demonstrated strong performance in sequence generation for de novo protein design. However, as the training objective was sequence recovery, it does not guarantee designability‑‑the likelihood that a designed sequence folds into the desired structure. To bridge this gap, we redefine the training objective by steering sequence generation toward high designability. To do this, we integrate Direct Preference Optimization (DPO), using AlphaFold pLDDT scores as the preference signal, which significantly improves the in silico design success rate. To further refine sequence generation at a finer, residue‑level granularity, we introduce Residue‑level Designability Preference Optimization (ResiDPO), which applies residue‑level structural rewards and decouples optimization across residues. This enables direct improvement in designability while preserving regions that already perform well. Using a curated dataset with residue‑level annotations, we fine‑tune LigandMPNN with ResiDPO to obtain EnhancedMPNN, which achieves a nearly 3‑fold increase in in silico design success rate (from 6.56% to 17.57%) on a challenging enzyme design benchmark.
Authors: Giulio Costantini, Lorenzo Caprini, Umberto Marini Bettolo Marconi, Fabio Cecconi
Abstract: Understanding the link between structure and function in proteins is fundamental in molecular biology and proteomics. A central question in this context is whether allostery ‑ where the binding of a molecule at one site affects the activity of a distant site ‑ emerges as a further manifestation of the intricate interplay between structure, function, and intrinsic dynamics. This study explores how allosteric regulation is modified when intrinsic protein dynamics operate under out‑of‑equilibrium conditions. To this purpose, we introduce a simple nonequilibrium model of protein dynamics, inspired by active matter systems, by generalizing the widely employed Gaussian Network Model (GNM) to incorporate non‑thermal effects. Our approach underscores the advantage of framing allostery as a causal process by using, as a benchmark system, the second PDZ domain of the human phosphatase hPT1E that mediates protein‑protein interactions. We employ causal indicators, such as response functions and transfer entropy, to identify the network of PDZ2 residues through which the allosteric signal propagates across the protein structure. These indicators reveal specific regions that align well with experimental observations. Furthermore, our results suggest that deviations from purely thermal fluctuations can significantly influence allosteric communication by introducing distinct timescales and memory effects. This influence is particularly relevant when the allosteric response unfolds on timescales incompatible with relaxation to equilibrium. Accordingly, non‑thermal fluctuations may become essential for accurately describing protein responses to ligand binding and developing a comprehensive understanding of allosteric regulation.
Authors: David Gamez
Abstract: Over the last thirty years, considerable progress has been made with the development of systems that can drive cars, play games, predict protein folding and generate natural language. These systems are described as intelligent and there has been a great deal of talk about the rapid increase in artificial intelligence and its potential dangers. However, our theoretical understanding of intelligence and ability to measure it lag far behind our capacity for building systems that mimic intelligent human behaviour. There is no commonly agreed definition of the intelligence that AI systems are said to possess. No‑one has developed a practical measure that would enable us to compare the intelligence of humans, animals and AIs on a single ratio scale.
This paper sets out a new universal measure of intelligence that is based on the hypothesis that prediction is the most important component of intelligence. As an agent interacts with its normal environment, the accuracy of its predictions is summed up and the complexity of its predictions and perceived environment is accounted for using Kolmogorov complexity. Two experiments were carried out to evaluate the practical feasibility of the algorithm. These demonstrated that it could measure the intelligence of an agent embodied in a virtual maze and an agent that makes predictions about time‑series data. This universal measure could be the starting point for a new comparative science of intelligence that ranks humans, animals and AIs on a single ratio scale.
Authors: Jiarui Lu, Xiaoyin Chen, Stephen Zhewen Lu, Aurélie Lozano, Vijil Chenthamarakshan, Payel Das, Jian Tang
Abstract: Protein dynamics play a crucial role in protein biological functions and properties, and their traditional study typically relies on time‑consuming molecular dynamics (MD) simulations conducted in silico. Recent advances in generative modeling, particularly denoising diffusion models, have enabled efficient accurate protein structure prediction and conformation sampling by learning distributions over crystallographic structures. However, effectively integrating physical supervision into these data‑driven approaches remains challenging, as standard energy‑based objectives often lead to intractable optimization. In this paper, we introduce Energy‑based Alignment (EBA), a method that aligns generative models with feedback from physical models, efficiently calibrating them to appropriately balance conformational states based on their energy differences. Experimental results on the MD ensemble benchmark demonstrate that EBA achieves state‑of‑the‑art performance in generating high‑quality protein ensembles. By improving the physical plausibility of generated structures, our approach enhances model predictions and holds promise for applications in structural biology and drug discovery.
Authors: Caio Cheohen, Vinnícius M. S. Gomes, Manuela L. da Silva
Abstract: The COVID‑19 pandemic, caused by SARS‑CoV‑2, highlighted the critical need for accurate prediction of disease severity to optimize healthcare resource allocation and patient management. The spike protein, which facilitates viral entry into host cells, exhibits high mutation rates, particularly in the receptor‑binding domain, influencing viral pathogenicity. Artificial intelligence approaches, such as deep learning, offer promising solutions for leveraging genomic and clinical data to predict disease outcomes. Objective: This study aimed to develop a hybrid CNN‑LSTM deep learning model to predict COVID‑19 severity using spike protein sequences and associated clinical metadata from South American patients. Methods: We retrieved 9,570 spike protein sequences from the GISAID database, of which 3,467 met inclusion criteria after standardization. The dataset included 2,313 severe and 1,154 mild cases. A feature engineering pipeline extracted features from sequences, while demographic and clinical variables were one‑hot encoded. A hybrid CNN‑LSTM architecture was trained, combining CNN layers for local pattern extraction and an LSTM layer for long‑term dependency modeling. Results: The model achieved an F1 score of 82.92%, ROC‑AUC of 0.9084, precision of 83.56%, and recall of 82.85%, demonstrating robust classification performance. Training stabilized at 85% accuracy with minimal overfitting. The most prevalent lineages (P.1, AY.99.2) and clades (GR, GK) aligned with regional epidemiological trends, suggesting potential associations between viral genetics and clinical outcomes. Conclusion: The CNN‑LSTM hybrid model effectively predicted COVID‑19 severity using spike protein sequences and clinical data, highlighting the utility of AI in genomic surveillance and precision public health. Despite limitations, this approach provides a framework for early severity prediction in future outbreaks.
Authors: Youngseung Jeon, Ziwen Li, Thomas Li, JiaSyuan Chang, Morteza Ziyadi, Xiang 'Anthony' Chen
Abstract: Retrieving the biological impacts of protein‑protein interactions (PPIs) is essential for target identification (Target ID) in drug development. Given the vast number of proteins involved, this process remains time‑consuming and challenging. Large Language Models (LLMs) and Retrieval‑Augmented Generation (RAG) frameworks have supported Target ID; however, no benchmark currently exists for identifying the biological impacts of PPIs. To bridge this gap, we introduce the RAG Benchmark for PPIs (RAGPPI), a factual question‑answer benchmark of 4,420 question‑answer pairs that focus on the potential biological impacts of PPIs. Through interviews with experts, we identified criteria for a benchmark dataset, such as a type of QA and source. We built a gold‑standard dataset (500 QA pairs) through expert‑driven data annotation. We developed an ensemble auto‑evaluation LLM that reflected expert labeling characteristics, which facilitates the construction of a silver‑standard dataset (3,720 QA pairs). We are committed to maintaining RAGPPI as a resource to support the research community in advancing RAG systems for drug discovery QA solutions.
Authors: Runmin Jiang, Genpei Zhang, Yuntian Yang, Siqi Wu, Minhao Wu, Wanyue Feng, Yizhou Zhao, Xi Xiao, Xiao Wang, Tianyang Wang, Xingjian Li, Muyuan Chen, Min Xu
Abstract: Single‑particle cryo‑electron microscopy (cryo‑EM) has become a cornerstone of structural biology, enabling near‑atomic resolution analysis of macromolecules through advanced computational methods. However, the development of cryo‑EM processing tools is constrained by the scarcity of high‑quality annotated datasets. Synthetic data generation offers a promising alternative, but existing approaches lack thorough biophysical modeling of heterogeneity and fail to reproduce the complex noise observed in real imaging. To address these limitations, we present CryoCCD, a synthesis framework that unifies versatile biophysical modeling with the first conditional cycle‑consistent diffusion model tailored for cryo‑EM. The biophysical engine provides multi‑functional generation capabilities to capture authentic biological organization, and the diffusion model is enhanced with cycle consistency and mask‑guided contrastive learning to ensure realistic noise while preserving structural fidelity. Extensive experiments demonstrate that CryoCCD generates structurally faithful micrographs, enhances particle picking and pose estimation, as well as achieves superior performance over state‑of‑the‑art baselines, while also generalizing effectively to held‑out protein families.
Authors: Meital Bojan, Sanketh Vedula, Advaith Maddipatla, Nadav Bojan Sellam, Anar Rzayev, Federico Napoli, Paul Schanda, Alex M. Bronstein
Abstract: The local structure of a protein strongly impacts its function and interactions with other molecules. Therefore, a concise, informative representation of a local protein environment is essential for modeling and designing proteins and biomolecular interactions. However, these environments' extensive structural and chemical variability makes them challenging to model, and such representations remain under‑explored. In this work, we propose a novel representation for a local protein environment derived from the intermediate features of atomistic foundation models (AFMs). We demonstrate that this embedding effectively captures both local structure (e.g., secondary motifs), and chemical features (e.g., amino‑acid identity and protonation state). We further show that the AFM‑derived representation space exhibits meaningful structure, enabling the construction of data‑driven priors over the distribution of biomolecular environments. Finally, in the context of biomolecular NMR spectroscopy, we demonstrate that the proposed representations enable a first‑of‑its‑kind physics‑informed chemical shift predictor that achieves state‑of‑the‑art accuracy. Our results demonstrate the surprising effectiveness of atomistic foundation models and their emergent representations for protein modeling beyond traditional molecular simulations. We believe this will open new lines of work in constructing effective functional representations for protein environments.
Authors: Sylvey Lin, Zhi-Yi Cao
Abstract: We investigate whether synthetic images generated by diffusion models can enhance multi‑label classification of protein subcellular localization. Specifically, we implement a simplified class‑conditional denoising diffusion probabilistic model (DDPM) to produce label‑consistent samples and explore their integration with real data via two hybrid training strategies: Mix Loss and Mix Representation. While these approaches yield promising validation performance, our proposed MixModel exhibits poor generalization to unseen test data, underscoring the challenges of leveraging synthetic data effectively. In contrast, baseline classifiers built on ResNet backbones with conventional loss functions demonstrate greater stability and test‑time performance. Our findings highlight the importance of realistic data generation and robust supervision when incorporating generative augmentation into biomedical image classification.
Authors: Artem Moskalev, Mangal Prakash, Junjie Xu, Tianyu Cui, Rui Liao, Tommaso Mansi
Abstract: Processing global geometric context while preserving equivariance is crucial when modeling biological, chemical, and physical systems. Yet, this is challenging due to the computational demands of equivariance and global context at scale. Standard methods such as equivariant self‑attention suffer from quadratic complexity, while local methods such as distance‑based message passing sacrifice global information. Inspired by the recent success of state‑space and long‑convolutional models, we introduce Geometric Hyena, the first equivariant long‑convolutional model for geometric systems. Geometric Hyena captures global geometric context at sub‑quadratic complexity while maintaining equivariance to rotations and translations. Evaluated on all‑atom property prediction of large RNA molecules and full protein molecular dynamics, Geometric Hyena outperforms existing equivariant models while requiring significantly less memory and compute that equivariant self‑attention. Notably, our model processes the geometric context of 30k tokens 20x faster than the equivariant transformer and allows 72x longer context within the same budget.
Authors: Michal Kmicikiewicz, Vincent Fortuin, Ewa Szczurek
Abstract: Designing protein sequences of both high fitness and novelty is a challenging task in data‑efficient protein engineering. Exploration beyond wild‑type neighborhoods often leads to biologically implausible sequences or relies on surrogate models that lose fidelity in novel regions. Here, we propose ProSpero, an active learning framework in which a frozen pre‑trained generative model is guided by a surrogate updated from oracle feedback. By integrating fitness‑relevant residue selection with biologically‑constrained Sequential Monte Carlo sampling, our approach enables exploration beyond wild‑type neighborhoods while preserving biological plausibility. We show that our framework remains effective even when the surrogate is misspecified. ProSpero consistently outperforms or matches existing methods across diverse protein engineering tasks, retrieving sequences of both high fitness and novelty.
Authors: Lianghui Zhu, Xitong Ling, Minxi Ouyang, Xiaoping Liu, Tian Guan, Mingxi Fu, Zhiqiang Cheng, Fanglei Fu, Maomao Zeng, Liming Liu, Song Duan, Qiang Huang, Ying Xiao, Jianming Li, Shanming Lu, Zhenghua Piao, Mingxi Zhu, Yibo Jin, Shan Xu, Qiming He, Yizhi Wang, Junru Cheng, Xuanyu Wang, Luxi Xie, Houqiang Li, Sufang Tian, Yonghong He
Abstract: Gastrointestinal (GI) diseases represent a clinically significant burden, necessitating precise diagnostic approaches to optimize patient outcomes. Conventional histopathological diagnosis suffers from limited reproducibility and diagnostic variability. To overcome these limitations, we develop Digepath, a specialized foundation model for GI pathology. Our framework introduces a dual‑phase iterative optimization strategy combining pretraining with fine‑screening, specifically designed to address the detection of sparsely distributed lesion areas in whole‑slide images. Digepath is pretrained on over 353 million multi‑scale images from 210,043 H&E‑stained slides of GI diseases. It attains state‑of‑the‑art performance on 33 out of 34 tasks related to GI pathology, including pathological diagnosis, protein expression status prediction, gene mutation prediction, and prognosis evaluation. We further translate the intelligent screening module for early GI cancer and achieve near‑perfect 99.70% sensitivity across nine independent medical institutions. This work not only advances AI‑driven precision pathology for GI diseases but also bridge critical gaps in histopathological practice.
Authors: Jie Gao, Jun Li, Jing Hu, Shanzhuo Zhang, Kunrui Zhu, Yueyang Huang, Xiaonan Zhang, Xiaomin Fang
Abstract: Protein binder design is central to therapeutics, diagnostics, and synthetic biology, yet practical deployment remains challenging due to fragmented workflows, high computational costs, and complex tool integration. We present HelixDesign‑Binder, a production‑grade, high‑throughput platform built on HelixFold3 that automates the full binder design pipeline, from backbone generation and sequence design to structural evaluation and multi‑dimensional scoring. By unifying these stages into a scalable and user‑friendly system, HelixDesign‑Binder enables efficient exploration of binder candidates with favorable structural, energetic, and physicochemical properties. The platform leverages Baidu Cloud's high‑performance infrastructure to support large‑scale design and incorporates advanced scoring metrics, including ipTM, predicted binding free energy, and interface hydrophobicity. Benchmarking across six protein targets demonstrates that HelixDesign‑Binder reliably produces diverse and high‑quality binders, some of which match or exceed validated designs in predicted binding affinity. HelixDesign‑Binder is accessible via an interactive web interface in PaddleHelix platform, supporting both academic research and industrial applications in antibody and protein binder development.
Authors: Divya Nori, Anisha Parsan, Caroline Uhler, Wengong Jin
Abstract: Protein binder design has been transformed by hallucination‑based methods that optimize structure prediction confidence metrics, such as the interface predicted TM‑score (ipTM), via backpropagation. However, these metrics do not reflect the statistical likelihood of a binder‑target complex under the learned distribution and yield sparse gradients for optimization. In this work, we propose a method to extract such likelihoods from structure predictors by reinterpreting their confidence outputs as an energy‑based model (EBM). By leveraging the Joint Energy‑based Modeling (JEM) framework, we introduce pTMEnergy, a statistical energy function derived from predicted inter‑residue error distributions. We incorporate pTMEnergy into BindEnergyCraft (BECraft), a design pipeline that maintains the same optimization framework as BindCraft but replaces ipTM with our energy‑based objective. BECraft outperforms BindCraft, RFDiffusion, and ESM3 across multiple challenging targets, achieving higher in silico binder success rates while reducing structural clashes. Furthermore, pTMEnergy establishes a new state‑of‑the‑art in structure‑based virtual screening tasks for miniprotein and RNA aptamer binders.
Authors: Felix Chalumeau, Daniel Rajaonarivonivelomanantsoa, Ruan de Kock, Claude Formanek, Sasha Abramowitz, Oumayma Mahjoub, Wiem Khlifi, Simon Du Toit, Louay Ben Nessir, Refiloe Shabe, Noah De Nicola, Arnol Fokam, Siddarth Singh, Ulrich Mbou Sob, Arnu Pretorius
Abstract: Reinforcement learning (RL) systems have countless applications, from energy‑grid management to protein design. However, such real‑world scenarios are often extremely difficult, combinatorial in nature, and require complex coordination between multiple agents. This level of complexity can cause even state‑of‑the‑art RL systems, trained until convergence, to hit a performance ceiling which they are unable to break out of with zero‑shot inference. Meanwhile, many digital or simulation‑based applications allow for an inference phase that utilises a specific time and compute budget to explore multiple attempts before outputting a final solution. In this work, we show that such an inference phase employed at execution time, and the choice of a corresponding inference strategy, are key to breaking the performance ceiling observed in complex multi‑agent RL problems. Our main result is striking: we can obtain up to a 126% and, on average, a 45% improvement over the previous state‑of‑the‑art across 17 tasks, using only a couple seconds of extra wall‑clock time during execution. We also demonstrate promising compute scaling properties, supported by over 60k experiments, making it the largest study on inference strategies for complex RL to date. Our experimental data and code are available at https://sites.google.com/view/inference‑strategies‑rl.
Authors: Jiahao Kuang, Nuowei Liu, Jie Wang, Changzhi Sun, Tao Ji, Yuanbin Wu
Abstract: Function‑guided protein design is a crucial task with significant applications in drug discovery and enzyme engineering. However, the field lacks a unified and comprehensive evaluation framework. Current models are assessed using inconsistent and limited subsets of metrics, which prevents fair comparison and a clear understanding of the relationships between different evaluation criteria. To address this gap, we introduce PDFBench, the first comprehensive benchmark for function‑guided denovo protein design. Our benchmark systematically evaluates eight state‑of‑the‑art models on 16 metrics across two key settings: description‑guided design, for which we repurpose the Mol‑Instructions dataset, originally lacking quantitative benchmarking, and keyword‑guided design, for which we introduce a new test set, SwissTest, created with a strict datetime cutoff to ensure data integrity. By benchmarking across a wide array of metrics and analyzing their correlations, PDFBench enables more reliable model comparisons and provides key insights to guide future research.
Authors: Chen Liu, Mingchen Li, Yang Tan, Wenrui Gou, Guisheng Fan, Bingxin Zhou
Abstract: A pivotal area of research in antibody engineering is to find effective modifications that enhance antibody‑antigen binding affinity. Traditional wet‑lab experiments assess mutants in a costly and time‑consuming manner. Emerging deep learning solutions offer an alternative by modeling antibody structures to predict binding affinity changes. However, they heavily depend on high‑quality complex structures, which are frequently unavailable in practice. Therefore, we propose ProtAttBA, a deep learning model that predicts binding affinity changes based solely on the sequence information of antibody‑antigen complexes. ProtAttBA employs a pre‑training phase to learn protein sequence patterns, following a supervised training phase using labeled antibody‑antigen complex data to train a cross‑attention‑based regressor for predicting binding affinity changes. We evaluated ProtAttBA on three open benchmarks under different conditions. Compared to both sequence‑ and structure‑based prediction methods, our approach achieves competitive performance, demonstrating notable robustness, especially with uncertain complex structures. Notably, our method possesses interpretability from the attention mechanism. We show that the learned attention scores can identify critical residues with impacts on binding affinity. This work introduces a rapid and cost‑effective computational tool for antibody engineering, with the potential to accelerate the development of novel therapeutic antibodies.
Authors: Sophia Hager, Aleem Khan, Andrew Wang, Nicholas Andrews
Abstract: Most successful applications of deep learning involve similar training and test conditions. However, tasks such as biological sequence design involve searching for sequences that improve desirable properties beyond previously known values, which requires novel hypotheses that \emphextrapolate beyond training data. In these settings, extrapolation may be achieved by using random search methods such as Markov chain Monte Carlo (MCMC), which, given an initial state, sample local transformations to approximate a target density that rewards states with the desired properties. However, even with a well‑designed proposal, MCMC may struggle to explore large structured state spaces efficiently. Rather than relying on stochastic search, it would be desirable to have a model that greedily optimizes the properties of interest, successfully extrapolating in as few steps as possible. We propose to learn such a model from the Markov chains resulting from MCMC search. Specifically, our approach uses selected states from Markov chains as a source of training data for an autoregressive model, which is then able to efficiently generate novel sequences that extrapolate along the sequence‑level properties of interest. The proposed approach is validated on three problems: protein sequence design, text sentiment control, and text anonymization. We find that the autoregressive model can extrapolate as well or better than MCMC, but with the additional benefits of scalability and significantly higher sample efficiency.
Authors: Xiaowen Ling, Zhiqiang Li, Yanbin Wang, Zhuhong You
Abstract: As protein informatics advances rapidly, the demand for enhanced predictive accuracy, structural analysis, and functional understanding has intensified. Transformer models, as powerful deep learning architectures, have demonstrated unprecedented potential in addressing diverse challenges across protein research. However, a comprehensive review of Transformer applications in this field remains lacking. This paper bridges this gap by surveying over 100 studies, offering an in‑depth analysis of practical implementations and research progress of Transformers in protein‑related tasks. Our review systematically covers critical domains, including protein structure prediction, function prediction, protein‑protein interaction analysis, functional annotation, and drug discovery/target identification. To contextualize these advancements across various protein domains, we adopt a domain‑oriented classification system. We first introduce foundational concepts: the Transformer architecture and attention mechanisms, categorize Transformer variants tailored for protein science, and summarize essential protein knowledge. For each research domain, we outline its objectives and background, critically evaluate prior methods and their limitations, and highlight transformative contributions enabled by Transformer models. We also curate and summarize pivotal datasets and open‑source code resources to facilitate reproducibility and benchmarking. Finally, we discuss persistent challenges in applying Transformers to protein informatics and propose future research directions. This review aims to provide a consolidated foundation for the synergistic integration of Transformer and protein informatics, fostering further innovation and expanded applications in the field.
Authors: Hazem Alsamkary, Mohamed Elshaffei, Mohamed Elkerdawy, Ahmed Elnaggar
Abstract: Protein language models (PLMs) have emerged as powerful tools to detect complex patterns of protein sequences. However, the capability of PLMs to fully capture information on protein sequences might be limited by focusing on single pre‑training tasks. Although adding data modalities or supervised objectives can improve the performance of PLMs, pre‑training often remains focused on denoising corrupted sequences. To push the boundaries of PLMs, our research investigated a multi‑task pre‑training strategy. We developed Ankh3, a model jointly optimized on two objectives: masked language modeling with multiple masking probabilities and protein sequence completion relying only on protein sequences as input. This multi‑task pre‑training demonstrated that PLMs can learn richer and more generalizable representations solely from protein sequences. The results demonstrated improved performance in downstream tasks, such as secondary structure prediction, fluorescence, GB1 fitness, and contact prediction. The integration of multiple tasks gave the model a more comprehensive understanding of protein properties, leading to more robust and accurate predictions.
Authors: Hazem Alsamkary, Mohamed Elshaffei, Mohamed Soudy, Sara Ossman, Abdallah Amr, Nehal Adel Abdelsalam, Mohamed Elkerdawy, Ahmed Elnaggar
Abstract: Protein‑protein interactions (PPIs) are fundamental to numerous cellular processes, and their characterization is vital for understanding disease mechanisms and guiding drug discovery. While protein language models (PLMs) have demonstrated remarkable success in predicting protein structure and function, their application to sequence‑based PPI binding affinity prediction remains relatively underexplored. This gap is often attributed to the scarcity of high‑quality, rigorously refined datasets and the reliance on simple strategies for concatenating protein representations. In this work, we address these limitations. First, we introduce a meticulously curated version of the PPB‑Affinity dataset of a total of 8,207 unique protein‑protein interaction entries, by resolving annotation inconsistencies and duplicate entries for multi‑chain protein interactions. This dataset incorporates a stringent, less than or equal to 30%, sequence identity threshold to ensure robust splitting into training, validation, and test sets, minimizing data leakage. Second, we propose and systematically evaluate four architectures for adapting PLMs to PPI binding affinity prediction: embeddings concatenation (EC), sequences concatenation (SC), hierarchical pooling (HP), and pooled attention addition (PAD). These architectures were assessed using two training methods: full fine‑tuning and a lightweight approach employing ConvBERT heads over frozen PLM features. Our comprehensive experiments across multiple leading PLMs (ProtT5, ESM2, Ankh, Ankh2, and ESM3) demonstrated that the HP and PAD architectures consistently outperform conventional concatenation methods, achieving up to 12% increase in terms of Spearman correlation. These results highlight the necessity of sophisticated architectural designs to fully exploit the capabilities of PLMs for nuanced PPI binding affinity prediction.
Authors: Thomas Hamelryck, Kanti V. Mardia
Abstract: The seminal breakthrough of AlphaFold in protein structure prediction relied on a learned potential energy function parameterized by deep models, in contrast to its successors AlphaFold2 and AlphaFold3, which lack an explicit probabilistic interpretation. While AlphaFold's potential was originally justified by heuristic analogy to physical potentials of mean force, we show that it can instead be understood as a principled instance of probability kinematics (PK), also known as Jeffrey conditioning, a generalization of Bayesian updating. This reinterpretation reveals that AlphaFold is a generalized Bayesian model that explicitly defines a posterior distribution over structures, providing a deeper explanation of its success and a foundation for future model design. To demonstrate this framework with precision, we introduce a tractable synthetic model in which an angular random walk prior is updated with distance‑based evidence via PK, directly mirroring AlphaFold's mechanism. This setting allows us to explore the probabilistic foundations of AlphaFold in a clear and interpretable way. Our work connects a landmark in protein structure prediction to a broader class of compositional deep generative models and points to new opportunities for principled probabilistic approaches.
Authors: Haitao Lin, Odin Zhang, Jia Xu, Yunfan Liu, Zheng Cheng, Lirong Wu, Yufei Huang, Zhifeng Gao, Stan Z. Li
Abstract: The affinity and specificity of protein‑molecule binding directly impact functional outcomes, uncovering the mechanisms underlying biological regulation and signal transduction. Most deep‑learning‑based prediction approaches focus on structures of atoms or fragments. However, quantum chemical properties, such as electronic structures, are the key to unveiling interaction patterns but remain largely underexplored. To bridge this gap, we propose ECBind, a method for tokenizing electron cloud signals into quantized embeddings, enabling their integration into downstream tasks such as binding affinity prediction. By incorporating electron densities, ECBind helps uncover binding modes that cannot be fully represented by atom‑level models. Specifically, to remove the redundancy inherent in electron cloud signals, a structure‑aware transformer and hierarchical codebooks encode 3D binding sites enriched with electron structures into tokens. These tokenized codes are then used for specific tasks with labels. To extend its applicability to a wider range of scenarios, we utilize knowledge distillation to develop an electron‑cloud‑agnostic prediction model. Experimentally, ECBind demonstrates state‑of‑the‑art performance across multiple tasks, achieving improvements of 6.42% and 15.58% in per‑structure Pearson and Spearman correlation coefficients, respectively.
Authors: Nuowei Liu, Jiahao Kuang, Yanting Liu, Tao Ji, Changzhi Sun, Man Lan, Yuanbin Wu
Abstract: Protein design is a fundamental challenge in biotechnology, aiming to design novel sequences with specific functions within the vast space of possible proteins. Recent advances in deep generative models have enabled function‑based protein design from textual descriptions, yet struggle with structural plausibility. Inspired by classical protein design methods that leverage natural protein structures, we explore whether incorporating fragments from natural proteins can enhance foldability in generative models. Our empirical results show that even random incorporation of fragments improves foldability. Building on this insight, we introduce ProDVa, a novel protein design approach that integrates a text encoder for functional descriptions, a protein language model for designing proteins, and a fragment encoder to dynamically retrieve protein fragments based on textual functional descriptions. Experimental results demonstrate that our approach effectively designs protein sequences that are both functionally aligned and structurally plausible. Compared to state‑of‑the‑art models, ProDVa achieves comparable function alignment using less than 0.04% of the training data, while designing significantly more well‑folded proteins, with the proportion of proteins having pLDDT above 70 increasing by 7.38% and those with PAE below 10 increasing by 9.6%.
Authors: Morteza Rakhshaninejad, Mira Jurgens, Nicolas Dewolf, Willem Waegeman
Abstract: Accurate drug‑target interaction (DTI) prediction with machine learning models is essential for drug discovery. Such models should also provide a credible representation of their uncertainty, but applying classical marginal conformal prediction (CP) in DTI prediction often overlooks variability across drug and protein subgroups. In this work, we analyze three cluster‑conditioned CP methods for DTI prediction, and compare them with marginal and group‑conditioned CP. Clusterings are obtained via nonconformity scores, feature similarity, and nearest neighbors, respectively. Experiments on the KIBA dataset using four data‑splitting strategies show that nonconformity‑based clustering yields the tightest intervals and most reliable subgroup coverage, especially in random and fully unseen drug‑protein splits. Group‑conditioned CP works well when one entity is familiar, but residual‑driven clustering provides robust uncertainty estimates even in sparse or novel scenarios. These results highlight the potential of cluster‑based CP for improving DTI prediction under uncertainty.
Authors: Nic Fishman, Gokul Gowri, Peng Yin, Jonathan Gootenberg, Omar Abudayyeh
Abstract: Many real‑world problems require reasoning across multiple scales, demanding models which operate not on single data points, but on entire distributions. We introduce generative distribution embeddings (GDE), a framework that lifts autoencoders to the space of distributions. In GDEs, an encoder acts on sets of samples, and the decoder is replaced by a generator which aims to match the input distribution. This framework enables learning representations of distributions by coupling conditional generative models with encoder networks which satisfy a criterion we call distributional invariance. We show that GDEs learn predictive sufficient statistics embedded in the Wasserstein space, such that latent GDE distances approximately recover the W_2 distance, and latent interpolation approximately recovers optimal transport trajectories for Gaussian and Gaussian mixture distributions. We systematically benchmark GDEs against existing approaches on synthetic datasets, demonstrating consistently stronger performance. We then apply GDEs to six key problems in computational biology: learning donor‑level representations from single‑nuclei RNA sequencing data (6M cells), capturing clonal dynamics in lineage‑traced RNA sequencing data (150K cells), predicting perturbation effects on transcriptomes (1M cells), predicting perturbation effects on cellular phenotypes (20M single‑cell images), designing synthetic yeast promoters (34M sequences), and spatiotemporal modeling of viral protein sequences (1M sequences).
Authors: Taskin Mehereen, Sourav Saha, Intesar Jawad Jaigirdar, Chanwook Park
Abstract: The ability to accurately model interatomic interactions in large‑scale systems is fundamental to understanding a wide range of physical and chemical phenomena, from drug‑protein binding to the behavior of next‑generation materials. While machine learning interatomic potentials (MLIPs) have made it possible to achieve ab initio‑level accuracy at significantly reduced computational cost, they still require very large training datasets and incur substantial training time and expense. In this work, we propose the Interpolating Neural Network Force Field (INN‑FF), a novel framework that merges interpolation theory and tensor decomposition with neural network architectures to efficiently construct molecular dynamics potentials from limited quantum mechanical data. Interpolating Neural Networks (INNs) achieve comparable or better accuracy than traditional multilayer perceptrons (MLPs) while requiring orders of magnitude fewer trainable parameters. On benchmark datasets such as liquid water and rMD17, INN‑FF not only matches but often surpasses state‑of‑the‑art accuracy by an order of magnitude, while achieving significantly lower error when trained on smaller datasets. These results suggest that INN‑FF offers a promising path toward building efficient and scalable machine‑learned force fields.
Authors: Can Chen, David Heurtel-Depeiges, Robert M. Vernon, Christopher James Langmead, Yoshua Bengio, Quentin Fournier
Abstract: Protein language models (pLMs) pre‑trained on vast protein sequence databases excel at various downstream tasks but often lack the structural knowledge essential for some biological applications. To address this, we introduce a method to enrich pLMs with structural knowledge by leveraging pre‑trained protein graph neural networks (pGNNs). First, a latent‑level contrastive learning task aligns residue representations from pLMs with those from pGNNs across multiple proteins, injecting inter‑protein structural information. Additionally, a physical‑level task integrates intra‑protein information by training pLMs to predict structure tokens. Together, the proposed dual‑task framework effectively incorporates both inter‑ and intra‑protein structural knowledge into pLMs. Given the variability in the quality of protein structures in PDB, we further introduce a residue loss selection module that uses a small model trained on high‑quality structures to select reliable yet challenging residue losses for the pLM to learn. Applying our structure alignment method as a simple, lightweight post‑training step to the state‑of‑the‑art ESM2 and AMPLIFY yields notable performance gains. These improvements are consistent across a wide range of tasks, including substantial gains in deep mutational scanning (DMS) fitness prediction and a 59% increase in P@L for ESM2 650M contact prediction on CASP16. Furthermore, we demonstrate that these performance gains are robust, scaling with model sizes from 8M to 650M and extending to different downstream tasks.
Authors: Kathryn Linehan, Radu Balan
Abstract: The singular value decomposition (SVD) is commonly used in applications requiring a low rank matrix approximation. However, the singular vectors cannot be interpreted in terms of the original data. For applications requiring this type of interpretation, e.g., selection of important data matrix columns or rows, the approximate CUR matrix factorization can be used. Work on the CUR matrix approximation has generally focused on algorithm development, theoretical guarantees, and applications. In this work, we present a novel deterministic CUR formulation and algorithm with theoretical convergence guarantees. The algorithm utilizes convex optimization, finds important columns and rows separately, and allows the user to control the number of important columns and rows selected from the original data matrix. We present numerical results and demonstrate the effectiveness of our CUR algorithm as a feature selection method on gene expression data. These results are compared to those using the SVD and other CUR algorithms as the feature selection method. Lastly, we present a novel application of CUR as a feature selection method to determine discriminant proteins when clustering protein expression data in a self‑organizing map (SOM), and compare the performance of multiple CUR algorithms in this application.
Authors: Ashley Wang, Peter Chin
Abstract: The graph alignment problem explores the concept of node correspondence and its optimality. In this paper, we focus on purely geometric graph alignment methods, namely our newly proposed Ricci Matrix Comparison (RMC) and its original form, Degree Matrix Comparison (DMC). To formulate a Ricci‑curvature‑based graph alignment situation, we start with discussing different ideas of constructing one of the most typical and important topological objects, the torus, and then move on to introducing the RMC based on DMC with theoretical motivations. Lastly, we will present to the reader experimental results on a torus and a complex protein‑protein interaction network that indicate the potential of applying a differential‑geometric view to graph alignment. Results show that a direct variation of DMC using Ricci curvature can help with identifying holes in tori and aligning line graphs of a complex network at 80‑90+% accuracy. This paper contributes a new perspective to the field of graph alignment and partially shows the validity of the previous DMC method.
Authors: Kanan Kiguchi, Yunhao Tu, Katsuhiro Ajito, Fady Alnajjar, Kazuyuki Murase
Abstract: We propose a novel framework for integrating fragmented multi‑modal data in Alzheimer's disease (AD) research using large language models (LLMs) and knowledge graphs. While traditional multimodal analysis requires matched patient IDs across datasets, our approach demonstrates population‑level integration of MRI, gene expression, biomarkers, EEG, and clinical indicators from independent cohorts. Statistical analysis identified significant features in each modality, which were connected as nodes in a knowledge graph. LLMs then analyzed the graph to extract potential correlations and generate hypotheses in natural language. This approach revealed several novel relationships, including a potential pathway linking metabolic risk factors to tau protein abnormalities via neuroinflammation (r>0.6, p<0.001), and unexpected correlations between frontal EEG channels and specific gene expression profiles (r=0.42‑0.58, p<0.01). Cross‑validation with independent datasets confirmed the robustness of major findings, with consistent effect sizes across cohorts (variance <15%). The reproducibility of these findings was further supported by expert review (Cohen's k=0.82) and computational validation. Our framework enables cross modal integration at a conceptual level without requiring patient ID matching, offering new possibilities for understanding AD pathology through fragmented data reuse and generating testable hypotheses for future research.
Authors: Hongbo Xia, Kaiqiang Yu, Shengxin Liu, Cheng Long, Xun Zhou
Abstract: Cohesive subgraph mining is a fundamental problem in graph theory with numerous real‑world applications, such as social network analysis and protein‑protein interaction modeling. Among various cohesive subgraphs, the γ‑quasi‑clique is widely studied for its flexibility in requiring each vertex to connect to at least a γ proportion of other vertices in the subgraph. However, solving the maximum γ‑quasi‑clique problem is NP‑hard and further complicated by the lack of the hereditary property, which makes designing efficient pruning strategies challenging. Existing algorithms, such as DDA and FastQC, either struggle with scalability or exhibit significant performance declines for small values of γ. In this paper, we propose a novel algorithm, IterQC, which reformulates the maximum γ‑quasi‑clique problem as a series of k‑plex problems that possess the hereditary property. IterQC introduces a non‑trivial iterative framework and incorporates two key optimization techniques: (1) the pseudo lower bound (pseudo LB) technique, which leverages information across iterations to improve the efficiency of branch‑and‑bound searches, and (2) the preprocessing technique that reduces problem size and unnecessary iterations. Extensive experiments demonstrate that IterQC achieves up to four orders of magnitude speedup and solves significantly more graph instances compared to state‑of‑the‑art algorithms DDA and FastQC.
Authors: Jason Yang, Wenda Chu, Daniel Khalil, Raul Astudillo, Bruce J. Wittmann, Frances H. Arnold, Yisong Yue
Abstract: Protein fitness optimization involves finding a protein sequence that maximizes desired quantitative properties in a combinatorially large design space of possible sequences. Recent advances in steering protein generative models (e.g., diffusion models and language models) with labeled data offer a promising approach. However, most previous studies have optimized surrogate rewards and/or utilized large amounts of labeled data for steering, making it unclear how well existing methods perform and compare to each other in real‑world optimization campaigns where fitness is measured through low‑throughput wet‑lab assays. In this study, we explore fitness optimization using small amounts (hundreds) of labeled sequence‑fitness pairs and comprehensively evaluate strategies such as classifier guidance and posterior sampling for guiding generation from different discrete diffusion models of protein sequences. We also demonstrate how guidance can be integrated into adaptive sequence selection akin to Thompson sampling in Bayesian optimization, showing that plug‑and‑play guidance strategies offer advantages over alternatives such as reinforcement learning with protein language models. Overall, we provide practical insights into how to effectively steer modern generative models for next‑generation protein fitness optimization.
Authors: Anna Ottavia Schulte, Samar Alqatari, Saverio Rossi, Francesco Zamponi
Abstract: Protein fitness landscapes frequently exhibit epistasis, where the effect of a mutation depends on the genetic context in which it occurs, i.e., the rest of the protein sequence. Epistasis increases landscape complexity, often resulting in multiple fitness peaks. In its simplest form, known as global epistasis, fitness is modeled as a non‑linear function of an underlying additive trait. In contrast, more complex epistasis arises from a network of (pairwise or many‑body) interactions between residues, which cannot be removed by a single non‑linear transformation. Recent studies have explored how global and network epistasis contribute to the emergence of functional bottlenecks ‑ fitness landscape topologies where two broad high‑fitness basins, representing distinct phenotypes, are separated by a bottleneck that can only be crossed via one or a few mutational paths. Here, we introduce and analyze a stylized model of global epistasis with an additive underlying trait. We demonstrate that functional bottlenecks arise with high probability if the model is properly calibrated. Furthermore, our results underscore that a proper balance between neutral and non‑neutral mutations is needed for the emergence of functional bottlenecks.
Authors: Christopher Kolloff, Tobias Höppe, Emmanouil Angelis, Mathias Jacob Schreiner, Stefan Bauer, Andrea Dittadi, Simon Olsson
Abstract: We propose a regularization framework inspired by thermodynamic work for guiding pre‑trained probability flow generative models (e.g., continuous normalizing flows or diffusion models) by minimizing excess work, a concept rooted in statistical mechanics and with strong conceptual connections to optimal transport. Our approach enables efficient guidance in sparse‑data regimes common to scientific applications, where only limited target samples or partial density constraints are available. We introduce two strategies: Path Guidance for sampling rare transition states by concentrating probability mass on user‑defined subsets, and Observable Guidance for aligning generated distributions with experimental observables while preserving entropy. We demonstrate the framework's versatility on a coarse‑grained protein model, guiding it to sample transition configurations between folded/unfolded states and correct systematic biases using experimental data. The method bridges thermodynamic principles with modern generative architectures, offering a principled, efficient, and physics‑inspired alternative to standard fine‑tuning in data‑scarce domains. Empirical results highlight improved sample efficiency and bias reduction, underscoring its applicability to molecular simulations and beyond.
Authors: Tom George Grigg, Mason Burlage, Oliver Brook Scott, Adam Taouil, Dominique Sydow, Liam Wilbraham
Abstract: Exhaustive virtual screening is highly informative but often intractable against the expensive objective functions involved in modern drug discovery. This problem is exacerbated in combinatorial contexts such as multi‑vector expansion, where molecular spaces can quickly become ultra‑large. Here, we introduce Scalable Active Learning via Synthon Acquisition (SALSA): a simple algorithm applicable to multi‑vector expansion which extends pool‑based active learning to non‑enumerable spaces by factoring modeling and acquisition over synthon or fragment choices. Through experiments on ligand‑ and structure‑based objectives, we highlight SALSA's sample efficiency, and its ability to scale to spaces of trillions of compounds. Further, we demonstrate application toward multi‑parameter objective design tasks on three protein targets ‑ finding SALSA‑generated molecules have comparable chemical property profiles to known bioactives, and exhibit greater diversity and higher scores over an industry‑leading generative approach.
Authors: Yanting Li, Jiyue Jiang, Zikang Wang, Ziqian Lin, Dongchen He, Yuheng Shan, Yanruisheng Shao, Jiayi Li, Xiangyu Shi, Jiuming Wang, Yanyu Chen, Yimin Fan, Han Li, Yu Li
Abstract: Inverse Protein Folding (IPF) is a critical subtask in the field of protein design, aiming to engineer amino acid sequences capable of folding correctly into a specified three‑dimensional (3D) conformation. Although substantial progress has been achieved in recent years, existing methods generally rely on either backbone coordinates or molecular surface features alone, which restricts their ability to fully capture the complex chemical and geometric constraints necessary for precise sequence prediction. To address this limitation, we present DS‑ProGen, a dual‑structure deep language model for functional protein design, which integrates both backbone geometry and surface‑level representations. By incorporating backbone coordinates as well as surface chemical and geometric descriptors into a next‑amino‑acid prediction paradigm, DS‑ProGen is able to generate functionally relevant and structurally stable sequences while satisfying both global and local conformational constraints. On the PRIDE dataset, DS‑ProGen attains the current state‑of‑the‑art recovery rate of 61.47%, demonstrating the synergistic advantage of multi‑modal structural encoding in protein design. Furthermore, DS‑ProGen excels in predicting interactions with a variety of biological partners, including ligands, ions, and RNA, confirming its robust functional retention capabilities.
Authors: Asher Moldwin, Amarda Shehu
Abstract: This paper surveys foundation models for AI‑enabled biological design, focusing on recent developments in applying large‑scale, self‑supervised models to tasks such as protein engineering, small molecule design, and genomic sequence design. Though this domain is evolving rapidly, this survey presents and discusses a taxonomy of current models and methods. The focus is on challenges and solutions in adapting these models for biological applications, including biological sequence modeling architectures, controllability in generation, and multi‑modal integration. The survey concludes with a discussion of open problems and future directions, offering concrete next‑steps to improve the quality of biological sequence generation.
Authors: Dan Luo, Jinyu Zhou, Le Xu, Sisi Yuan, Xuan Lin
Abstract: Predicting drug‑target binding affinity (DTA) is essential for identifying potential therapeutic candidates in drug discovery. However, most existing models rely heavily on static protein structures, often overlooking the dynamic nature of proteins, which is crucial for capturing conformational flexibility that will be beneficial for protein binding interactions. We introduce DynamicDTA, an innovative deep learning framework that incorporates static and dynamic protein features to enhance DTA prediction. The proposed DynamicDTA takes three types of inputs, including drug sequence, protein sequence, and dynamic descriptors. A molecular graph representation of the drug sequence is generated and subsequently processed through graph convolutional network, while the protein sequence is encoded using dilated convolutions. Dynamic descriptors, such as root mean square fluctuation, are processed through a multi‑layer perceptron. These embedding features are fused with static protein features using cross‑attention, and a tensor fusion network integrates all three modalities for DTA prediction. Extensive experiments on three datasets demonstrate that DynamicDTA achieves by at least 3.4% improvement in RMSE score with comparison to seven state‑of‑the‑art baseline methods. Additionally, predicting novel drugs for Human Immunodeficiency Virus Type 1 and visualizing the docking complexes further demonstrates the reliability and biological relevance of DynamicDTA.
Authors: Xiao Fei, Michail Chatzianastasis, Sarah Almeida Carneiro, Hadi Abdine, Lawrence P. Petalidis, Michalis Vazirgiannis
Abstract: Predicting protein function from sequence is a central challenge in computational biology. While existing methods rely heavily on structured ontologies or similarity‑based techniques, they often lack the flexibility to express structure‑free functional descriptions and novel biological functions. In this work, we introduce Prot2Text‑V2, a novel multimodal sequence‑to‑text model that generates free‑form natural language descriptions of protein function directly from amino acid sequences. Our method combines a protein language model as a sequence encoder (ESM‑3B) and a decoder‑only language model (LLaMA‑3.1‑8B‑Instruct) through a lightweight nonlinear modality projector. A key innovation is our Hybrid Sequence‑level Contrastive Alignment Learning (H‑SCALE), which improves cross‑modal learning by matching mean‑ and std‑pooled protein embeddings with text representations via contrastive loss. After the alignment phase, we apply instruction‑based fine‑tuning using LoRA on the decoder to teach the model how to generate accurate protein function descriptions conditioned on the protein sequence. We train Prot2Text‑V2 on about 250K curated entries from SwissProt and evaluate it under low‑homology conditions, where test sequences have low similarity with training samples. Prot2Text‑V2 consistently outperforms traditional and LLM‑based baselines across various metrics.
Authors: Francesco Madeddu, Lucia Testa, Gianluca De Carlo, Michele Pieroni, Andrea Mastropietro, Aris Anagnostopoulos, Paolo Tieri, Sergio Barbarossa
Abstract: The intrinsic complexity of human biology presents ongoing challenges to scientific understanding. Researchers collaborate across disciplines to expand our knowledge of the biological interactions that define human life. AI methodologies have emerged as powerful tools across scientific domains, particularly in computational biology, where graph data structures effectively model biological entities such as protein‑protein interaction (PPI) networks and gene functional networks. Those networks are used as datasets for paramount network medicine tasks, such as gene‑disease association prediction, drug repurposing, and polypharmacy side effect studies. Reliable predictions from machine learning models require high‑quality foundational data. In this work, we present a comprehensive multi‑purpose biological knowledge graph constructed by integrating and refining multiple publicly available datasets. Building upon the Drug Repurposing Knowledge Graph (DRKG), we define a pipeline tasked with a) cleaning inconsistencies and redundancies present in DRKG, b) coalescing information from the main available public data sources, and c) enriching the graph nodes with expressive feature vectors such as molecular fingerprints and gene ontologies. Biologically and chemically relevant features improve the capacity of machine learning models to generate accurate and well‑structured embedding spaces. The resulting resource represents a coherent and reliable biological knowledge graph that serves as a state‑of‑the‑art platform to advance research in computational biology and precision medicine. Moreover, it offers the opportunity to benchmark graph‑based machine learning and network medicine models on relevant tasks. We demonstrate the effectiveness of the proposed dataset by benchmarking it against the task of drug repurposing, PPI prediction, and side‑effect prediction, modeled as link prediction problems.
Authors: Ruiqing Sun, Dawei Feng, Sen Yang, Ronghang Wang, Huaiyuan Song, Bo Ding, Yijie Wang, Huaimin Wang
Abstract: Optimizing conflicting molecular properties while strictly adhering to complex 3D structural constraints constitutes a challenging Constrained Multi‑Objective Optimization Problem (CMOP). Traditional Evolutionary Algorithms (EAs) destroy chemical valency in 3D space, whereas 3D diffusion models act as rigid generators requiring costly retraining for novel objectives. To bridge this gap, we propose a progressive algorithmic suite. First, we introduce the Evolutionary‑Guided Diffusion (EGD) operator, which executes crossover and mutation at an optimally calibrated noise level, leveraging a pre‑trained denoising network to project chimeric states back onto the valid chemical manifold. Second, to combat the severe loss of molecular structural diversity inherent in traditional EMO frameworks, we design a Structure‑Aware Environmental Selection (SAES) mechanism that explicitly enforces structural distinctiveness. Finally, synergizing EGD and SAES, we develop the Diffusion‑based Evolutionary Molecular Optimization (DEMO) framework for CMOPs. To safely navigate disjoint feasible regions, DEMO employs a tri‑population architecture with distinct goals: exploring novel chemical scaffolds, refining partially assembled intermediates, and fine‑tuning perfectly feasible elite molecules. Extensive experiments across single‑property targeting, unconstrained MOPs, multi‑fragment CMOPs, and 3D protein‑ligand docking demonstrate that our method comprehensively outperforms state‑of‑the‑art baselines and traditional EMO frameworks. Operating entirely zero‑shot, this suite consistently discovers highly diverse, chemically valid Pareto frontiers.
Authors: Kutalmış Coşkun, Ivo Kavisanczki, Amin Mirzaei, Tom Siegl, Bjarne C. Hiller, Stefan Lüdtke, Martin Becker
Abstract: In complex and low‑data domains such as biomedical research, incorporating background knowledge (BK) graphs, such as protein‑protein interaction (PPI) networks, into graph‑based machine learning pipelines is a promising research direction. However, while BK is often assumed to improve model performance, its actual contribution and the impact of imperfect knowledge remain poorly understood. In this work, we investigate the role of BK in an important real‑world task: cancer subtype classification. Surprisingly, we find that (i) state‑of‑the‑art GNNs using BK perform no better than uninformed models like linear regression, and (ii) their performance remains largely unchanged even when the BK graph is heavily perturbed. To understand these unexpected results, we introduce an evaluation framework, which employs (i) a synthetic setting where the BK is clearly informative and (ii) a set of perturbations that simulate various imperfections in BK graphs. With this, we test the robustness of BK‑aware models in both synthetic and real‑world biomedical settings. Our findings reveal that careful alignment of GNN architectures and BK characteristics is necessary but holds the potential for significant performance improvements.
Authors: Justin Sanders, Melih Yilmaz, Jacob H. Russell, Wout Bittremieux, William E. Fondrie, Nicholas M. Riley, Sewoong Oh, William Stafford Noble
Abstract: Mass spectrometry is the dominant technology in the field of proteomics, enabling high‑throughput analysis of the protein content of complex biological samples. Due to the complexity of the instrumentation and resulting data, sophisticated computational methods are required for the processing and interpretation of acquired mass spectra. Machine learning has shown great promise to improve the analysis of mass spectrometry data, with numerous purpose‑built methods for improving specific steps in the data acquisition and analysis pipeline reaching widespread adoption. Here, we propose unifying various spectrum prediction tasks under a single foundation model for mass spectra. To this end, we pre‑train a spectrum encoder using de novo sequencing as a pre‑training task. We then show that using these pre‑trained spectrum representations improves our performance on the four downstream tasks of spectrum quality prediction, chimericity prediction, phosphorylation prediction, and glycosylation status prediction. Finally, we perform multi‑task fine‑tuning and find that this approach improves the performance on each task individually. Overall, our work demonstrates that a foundation model for tandem mass spectrometry proteomics trained on de novo sequencing learns generalizable representations of spectra, improves performance on downstream tasks where training data is limited, and can ultimately enhance data acquisition and analysis in proteomics experiments.
Authors: Sebestyén Kamp, Giovanni Stracquadanio, T. Ian Simpson
Abstract: We present GNN‑Suite, a robust modular framework for constructing and benchmarking Graph Neural Network (GNN) architectures in computational biology. GNN‑Suite standardises experimentation and reproducibility using the Nextflow workflow to evaluate GNN performance. We demonstrate its utility in identifying cancer‑driver genes by constructing molecular networks from protein‑protein interaction (PPI) data from STRING and BioGRID and annotating nodes with features from the PCAWG, PID, and COSMIC‑CGC repositories.
Our design enables fair comparisons among diverse GNN architectures including GAT, GAT3H, GCN, GCN2, GIN, GTN, HGCN, PHGCN, and GraphSAGE and a baseline Logistic Regression (LR) model. All GNNs were configured as standardised two‑layer models and trained with uniform hyperparameters (dropout = 0.2; Adam optimiser with learning rate = 0.01; and an adjusted binary cross‑entropy loss to address class imbalance) over an 80/20 train‑test split for 300 epochs. Each model was evaluated over 10 independent runs with different random seeds to yield statistically robust performance metrics, with balanced accuracy (BACC) as the primary measure. Notably, GCN2 achieved the highest BACC (0.807 +/‑ 0.035) on a STRING‑based network, although all GNN types outperformed the LR baseline, highlighting the advantage of network‑based learning over feature‑only approaches.
Our results show that a common framework for implementing and evaluating GNN architectures aids in identifying not only the best model but also the most effective means of incorporating complementary data. By making GNN‑Suite publicly available, we aim to foster reproducible research and promote improved benchmarking standards in computational biology. Future work will explore additional omics datasets and further refine network architectures to enhance predictive accuracy and interpretability in biomedical applications.
Authors: Amira Alakhdar, Barnabas Poczos, Newell Washburn
Abstract: Developing bioactive molecules remains a central, time‑ and cost‑heavy challenge in drug discovery, particularly for novel targets lacking structural or functional data. Pharmacophore modeling presents an alternative for capturing the key features required for molecular bioactivity against a biological target. In this work, we present PharmaDiff, a pharmacophore‑conditioned diffusion model for 3D molecular generation. PharmaDiff employs a transformer‑based architecture to integrate an atom‑based representation of the 3D pharmacophore into the generative process, enabling the precise generation of 3D molecular graphs that align with predefined pharmacophore hypotheses. Through comprehensive testing, PharmaDiff demonstrates superior performance in matching 3D pharmacophore constraints compared to ligand‑based drug design methods. Additionally, it achieves higher docking scores across a range of proteins in structure‑based drug design, without the need for target protein structures. By integrating pharmacophore modeling with 3D generative techniques, PharmaDiff offers a powerful and flexible framework for rational drug design.
Authors: He Wang, Yikun Zhang, Jie Chen, Jian Zhan, Yaoqi Zhou
Abstract: Given usefulness of protein language models (LMs) in structure and functional inference, RNA LMs have received increased attentions in the last few years. However, these RNA models are often not compared against the same standard. Here, we divided RNA LMs into three classes (pretrained on multiple RNA types (especially noncoding RNAs), specific‑purpose RNAs, and LMs that unify RNA with DNA or proteins or both) and compared 13 RNA LMs along with 3 DNA and 1 protein LMs as controls in zero‑shot prediction of RNA secondary structure and functional classification. Results shows that the models doing well on secondary structure prediction often perform worse in function classification or vice versa, suggesting that more balanced unsupervised training is needed.
Authors: Laia Coronas Sala, Parfait Atchade-Adelemou
Abstract: We introduce Quantum Mechanics for Proteins (QMProt), a dataset developed to support quantum computing applications in protein research. QMProt contains precise quantum‑mechanical and physicochemical data, enabling accurate characterization of biomolecules and supporting advanced computational methods like molecular fragmentation and reassembly. The dataset includes 45 molecules covering all 20 essential human amino acids and their core structural elements: amino terminal groups, carboxyl terminal groups, alpha carbons, and unique side chains. QMProt primarily features organic molecules with up to 15 non‑hydrogen atoms (C, N, O, S), offering comprehensive molecular Hamiltonians, ground state energies, and detailed physicochemical properties. Publicly accessible, QMProt aims to enhance reproducibility and advance quantum‑enhanced simulations in molecular biology, biochemistry, and drug discovery.
Authors: Zakarya Benayad, Guillaume Stirnemann
Abstract: Enhanced sampling techniques are essential for exploring biomolecular conformational dynamics that occur on timescales inaccessible to conventional molecular dynamics (MD) simulations. This study introduces a framework that combines Hamiltonian replica exchange with solute tempering (REST2) with denoising diffusion probabilistic models (DDPMs) and importance sampling to enhance the mapping of conformational free‑energy landscapes. Building on previous applications of DDPMs to temperature replica exchange (TREM), we propose two key improvements. First, we adapt the method to REST2 by treating potential energy as a fluctuating variable. This adaptation allows for more efficient sampling in large biomolecular systems. Second, to further improve resolution in high‑barrier regions, we develop an iterative scheme combining replica exchange, DDPM, and importance sampling along known collective variables. Benchmarking on the mini‑protein CLN025 demonstrates that DDPM‑refined REST2 achieves comparable accuracy to TREM while requiring fewer replicas. Application to the enzyme PTP1B reveals a loop transition pathway consistent with prior complex biased simulations, showcasing the approach's ability to uncover high‑barrier transitions with minimal computational overhead with respect to conventional replica exchange approaches. Overall, this hybrid strategy enables more efficient exploration of free‑energy landscapes, expanding the utility of generative models in enhanced sampling simulations.
Authors: Wenderson R. F. Silva, Larissa C. P. Monteiro, Murilo C. Costa, Renato V. A. Boaventura, Eduardo N. D. de Araújo, Rafael O. R. R. Cunha, Tiago A. de O. Mendes, Rodrigo G. Lacerda, Joaquim B. S. Mendes
Abstract: This work presents an innovative magnetoelastic (ME) biosensor using graphene functionalized with the SARS‑CoV‑2 N protein for antibody detection via magnetoelastic resonance. Graphene was chosen for its biocompatibility and high surface area, enabling efficient antigen adsorption, validated by techniques such as energy‑dispersive X‑ray spectroscopy (EDX), atomic force microscopy (AFM), and micro‑Raman spectroscopy. Changes in Raman bands (a ~ 10~\mathrmcm^‑1 shift in the 2D band and an increase in the I_D/I_G ratio from 0.03 to 0.60) confirmed non‑covalent interactions and enhanced surface coverage with ~100 μg of N protein. Tests using human plasma (10 RT‑PCR‑positive and 10 negative samples) demonstrated a clear distinction between groups using graphene sensors functionalized with ~100 μg of N protein. Enzyme‑linked immunosorbent assay (ELISA) validation corroborated the results. Optimization of protein concentration and biofunctionalization time highlighted the importance of homogeneous surface coverage for reproducibility of the graphene‑based ME biosensor. The platform combines graphene's advantages with the wireless, real‑time detection capabilities of ME sensors, offering low cost, high sensitivity, and potential for automation, with applications in point‑of‑care diagnostics.
Authors: Pei-Kun Yang
Abstract: Structure‑based virtual screening must address a combinatorial explosion arising from up to 10^60 drug‑like molecules, multiple conformations of proteins and ligands, and all possible spatial translations and rotations of ligands within the binding pocket. Although these calculations are inherently parallelizable, their sheer volume remains prohibitive for classical CPU/GPU resources. Quantum computing offers a promising solution: by using n qubits to compute the binding energy of a single protein‑ligand pair and m additional qubits to encode different configurations, the algorithm can simultaneously evaluate 2^m combinations in a single quantum execution. To realize this potential, we propose a quantum algorithm that integrates classical force field models to compute electrostatic and van der Waals interactions on discretized grid points. Binding energy calculations are reformulated as matrix‑based inner products, while ligand translations and rotations are encoded using unitary operations. This approach circumvents explicit distance calculations and provides a scalable, quantum‑enhanced framework for efficient and high‑dimensional binding energy estimation in drug discovery.
Authors: Muhamed Amin, Bernard R. Brooks
Abstract: We present the Boltzmann classifier, a novel distance based probabilistic classification algorithm inspired by the Boltzmann distribution. Unlike traditional classifiers that produce hard decisions or uncalibrated probabilities, the Boltzmann classifier assigns class probabilities based on the average distance to the nearest neighbors within each class, providing interpretable, physically meaningful outputs. We evaluate the performance of the method across three application domains: molecular activity prediction, oxidation state classification of transition metal complexes, and breast cancer diagnosis. In the molecular activity task, the classifier achieved the highest accuracy in predicting active compounds against two protein targets, with strong correlations observed between the predicted probabilities and experimental pIC50 values. For metal complexes, the classifier accurately distinguished between oxidation states II and III for Fe, Mn, and Co, using only metal‑ligand bond lengths extracted from crystallographic data, and demonstrated high consistency with known chemical trends. In the breast cancer dataset, the classifier achieved 97% accuracy, with low confidence predictions concentrated in inherently ambiguous cases. Across all tasks, the Boltzmann classifier performed competitively or better than standard models such as logistic regression, support vector machines, random forests, and k‑nearest neighbors. Its probabilistic outputs were found to correlate with continuous physical or biological properties, highlighting its potential utility in both classification and regression contexts. The results suggest that the Boltzmann classifier is a robust and interpretable alternative to conventional machine learning approaches, particularly in scientific domains where underlying structure property relationships are important.
Authors: Alberto Martinez-Serra, Gionni Marchetti, Francesco D'Amico, Ivana Fenoglio, Barbara Rossi, Marco P. Monopoli, Giancarlo Franzese
Abstract: When nanoparticles (NPs) are introduced into a biological solution, layers of biomolecules form on their surface, creating a corona. Understanding how the structure of the protein evolves into the corona is essential for evaluating the safety and toxicity of nanotechnology. However, the influence of NP properties on protein conformation is not well understood. In this study, we propose a new method that addresses this issue by analyzing multi‑component spectral data using Machine Learning (ML). We apply the method to fibrinogen, a crucial protein in human blood plasma, at physiological concentrations while interacting with hydrophobic carbon or hydrophilic silicon dioxide NPs, revealing striking differences in the temperature dependence of the protein structure between the two cases. Our unsupervised ML method a) does not suffer from the challenges associated with the curse of dimensionality, and b) simultaneously handles spectral data from various sources. The method offers a quantitative analysis of protein structural changes upon adsorption and enhances the understanding of the correlation between protein structure and NP interactions, which could support the development of nanomedical tools to treat various conditions.
Authors: Jacques Fries, Roxanne Berthin, Marie Jardat, Pierre Illien, Vincent Dahirel
Abstract: Biomolecular condensates play a crucial role in the spatial organization of living matter. These membrane‑less organelles, resulting from liquid‑liquid phase separation, operate far from thermodynamic equilibrium, with their size and stability influenced by non‑equilibrium chemical reactions. While condensates are frequently considered optimized nanoreactors that enhance molecular encounters, their actual impact on reaction kinetics remains unclear due to competing effects such as diffusion hindrance, and random trapping in non‑specific condensates. In this study, we develop a microscopic, stochastic model for chemically active droplets, incorporating reaction‑driven modulation of protein interactions. Using Brownian dynamics simulations, we investigate how protein interactions and active coupling to a free energy reservoir influence phase separation, molecular transport and reaction kinetics. We demonstrate that the intensity of the chemical drive governs surface dynamics, generating fluxes that modulate bimolecular reaction rates. Comparing active emulsions to homogeneous systems, we reveal that condensates can either accelerate or decelerate molecular encounters. Our findings provide key insights into the role of biomolecular condensates as potential regulators of intracellular reaction kinetics.
Authors: J. Bhatt Mitra, V. K. Sharma, M. Kumar, V. Garcia Sakai, A. Mukherjee
Abstract: Ribosomal protein S30 (RS30) exhibits potent antimicrobial activity against a broad spectrum of bacteria. Despite its efficacy, the underlying action mechanism remained elusive. In this study, we unravel the fundamental mechanism by which RS30 exerts its bactericidal effects, using a combination of microbiological assays and advanced biophysical techniques. Microbiological analyses reveal that RS30 kills bacteria primarily through membrane depolarization, despite limited membrane permeabilization, indicating an unconventional mode of action involving no or partial lysis of the membrane. Importantly, RS30 demonstrates time‑dependent bactericidal activity with no detectable cytotoxicity toward mammalian cells, underscoring its high selectivity. This selective action was further confirmed using biophysical experiments on model membrane systems composed of anionic (bacterial mimic) and zwitterionic (mammalian mimic) phospholipids. Our measurements suggested that RS30 preferentially binds to anionic membranes via electrostatic interactions, undergoes a conformational transition from a random coil to an α‑helix upon binding, and induces vesicle aggregation. Quasielastic neutron scattering (QENS) measurements provide microscopic insights, showing that RS30 significantly restricts the lateral diffusion of anionic lipids, thereby perturbing membrane dynamics and increasing susceptibility to external stress. Together, our findings uncover important insights into the antimicrobial action mechanism of RS30, characterized by selective membrane interaction, structural transformation, and dynamic modulation of lipid membranes.
Authors: Seunghee Han, Soongyu Choi, Joo-Young Kim
Abstract: Recent advances in Protein Structure Prediction Models (PPMs), such as AlphaFold2 and ESMFold, have revolutionized computational biology by achieving unprecedented accuracy in predicting three‑dimensional protein folding structures. However, these models face significant scalability challenges, particularly when processing proteins with long amino acid sequences (e.g., sequence length > 1,000). The primary bottleneck that arises from the exponential growth in activation sizes is driven by the unique data structure in PPM, which introduces an additional dimension that leads to substantial memory and computational demands. These limitations have hindered the effective scaling of PPM for real‑world applications, such as analyzing large proteins or complex multimers with critical biological and pharmaceutical relevance.
In this paper, we present LightNobel, the first hardware‑software co‑designed accelerator developed to overcome scalability limitations on the sequence length in PPM. At the software level, we propose Token‑wise Adaptive Activation Quantization (AAQ), which leverages unique token‑wise characteristics, such as distogram patterns in PPM activations, to enable fine‑grained quantization techniques without compromising accuracy. At the hardware level, LightNobel integrates the multi‑precision reconfigurable matrix processing unit (RMPU) and versatile vector processing unit (VVPU) to enable the efficient execution of AAQ. Through these innovations, LightNobel achieves up to 8.44x, 8.41x speedup and 37.29x, 43.35x higher power efficiency over the latest NVIDIA A100 and H100 GPUs, respectively, while maintaining negligible accuracy loss. It also reduces the peak memory requirement up to 120.05x in PPM, enabling scalable processing for proteins with long sequences.
Authors: Anjie Qiao, Hao Zhang, Qianmu Yuan, Qirui Deng, Jingtian Su, Weifeng Huang, Huihao Zhou, Guo-Bo Li, Zhen Wang, Jinping Lei
Abstract: Generating molecules that bind to specific protein targets via diffusion models has shown good promise for structure‑based drug design and molecule optimization. Especially, the diffusion models with binding interaction guidance enables molecule generation with high affinity through forming favorable interaction within protein pocket. However, the generated molecules may not form interactions with the highly conserved residues, which are important for protein functions and bioactivities of the ligands. Herein, we developed a new 3D target‑aware diffusion model DiffDecip, which explicitly incorporates the protein‑ligand binding interactions and evolutionary conservation information of protein residues into both diffusion and sampling process, for molecule optimization through scaffold decoration. The model performance revealed that DiffDecip outperforms baseline model DiffDec on molecule optimization towards higher affinity through forming more non‑covalent interactions with highly conserved residues in the protein pocket.
Authors: Li Ni, Ziqi Deng, Lin Mu, Lei Zhang, Wenjian Luo, Yiwen Zhang
Abstract: Hypergraphs, capable of representing high‑order interactions via hyperedges, have become a powerful tool for modeling real‑world biological and social systems. Inherent relationships within these real‑world systems, such as the encoding relationship between genes and their protein products, drive the establishment of interconnections between multiple hypergraphs. Here, we demonstrate how to utilize those interconnections between multiple hypergraphs to synthesize integrated information from multiple higher‑order systems, thereby enhancing understanding of underlying structures. We propose a model based on the stochastic block model, which integrates information from multiple hypergraphs to reveal latent high‑order structures. Real‑world hyperedges exhibit preferential attachment, where certain nodes dominate hyperedge formation. To characterize this phenomenon, our model introduces hyperedge internal degree to quantify nodes' contributions to hyperedge formation. This model is capable of mining communities, predicting missing hyperedges of arbitrary sizes within hypergraphs, and inferring inter‑hypergraph edges between hypergraphs. We apply our model to high‑order datasets to evaluate its performance. Experimental results demonstrate strong performance of our model in community detection, hyperedge prediction, and inter‑hypergraph edge prediction tasks. Moreover, we show that our model enables analysis of multiple hypergraphs of different types and supports the analysis of a single hypergraph in the absence of inter‑hypergraph edges. Our work provides a practical and flexible tool for analyzing multiple hypergraphs, greatly advancing the understanding of the organization in real‑world high‑order systems.
Authors: Junhao Xiong, Ishan Gaur, Maria Lukarska, Hunter Nisonoff, Luke M. Oltrogge, David F. Savage, Jennifer Listgarten
Abstract: Sequence generative models are transforming protein engineering. However, no principled framework exists for conditioning these models on auxiliary information, such as experimental data, without additional training of a generative model. Herein, we present ProteinGuide, a method for such "on‑the‑fly" conditioning, amenable to a broad class of protein generative models including Masked Language Models (e.g. ESM3), any‑order auto‑regressive models (e.g. ProteinMPNN) as well as diffusion and flow matching models (e.g. MultiFlow). ProteinGuide stems from our unifying view of these model classes under a single statistical framework. As proof of principle, we perform several in silico experiments. We first guide pre‑trained generative models to design proteins with user‑specified properties, such as higher stability or activity. Next, we design for optimizing two desired properties that are in tension with each other. Finally, we apply our method in the wet lab, using ProteinGuide to increase the editing activity of an adenine base editor in vivo with data from only a single pooled library of 2,000 variants. We find that a single round of ProteinGuide achieves a higher editing efficiency than was previously achieved using seven rounds of directed evolution.
Authors: Xule Lin
Abstract: Human‑AI scientific collaboration has evolved from tool‑user relationships into co‑evolutionary partnerships. When AlphaFold improved protein structure prediction, researchers engaged with an epistemic partner that transformed their approach to structure‑function problems. Yet existing frameworks position AI as either sophisticated tool or potential risk, overlooking how scientific understanding emerges through recursive interaction. We introduce Cognitio Emergens (CE), a framework that captures the co‑evolutionary nature of human‑AI epistemic partnerships.
Drawing from autopoiesis theory, social systems theory, and organizational modularity, CE integrates three components: Agency Configurations modeling how authority distributes through Directed, Contributory, and Partnership modes, with partnerships oscillating dynamically rather than following linear progression; Epistemic Dimensions capturing six capabilities along Discovery, Integration, and Projection axes, creating distinctive "capability signatures" that guide strategic development; and Partnership Dynamics identifying evolutionary forces including epistemic alienation, where researchers lose interpretive control over knowledge they formally endorse.
The framework equips researchers to diagnose dimensional imbalances, institutional leaders to design governance structures supporting multiple agency configurations, and policymakers to develop evaluations beyond simple performance metrics. By reconceptualizing human‑AI collaboration as fundamentally co‑evolutionary, CE provides conceptual tools for cultivating partnerships that preserve epistemic integrity while enabling transformative breakthroughs neither humans nor AI could achieve independently.
Authors: Xuan Lin, Qingrui Liu, Hongxin Xiang, Daojian Zeng, Xiangxiang Zeng
Abstract: Chemical reaction and retrosynthesis prediction are fundamental tasks in drug discovery. Recently, large language models (LLMs) have shown potential in many domains. However, directly applying LLMs to these tasks faces two major challenges: (i) lacking a large‑scale chemical synthesis‑related instruction dataset; (ii) ignoring the close correlation between reaction and retrosynthesis prediction for the existing fine‑tuning strategies. To address these challenges, we propose ChemDual, a novel LLM framework for accurate chemical synthesis. Specifically, considering the high cost of data acquisition for reaction and retrosynthesis, ChemDual regards the reaction‑and‑retrosynthesis of molecules as a related recombination‑and‑fragmentation process and constructs a large‑scale of 4.4 million instruction dataset. Furthermore, ChemDual introduces an enhanced LLaMA, equipped with a multi‑scale tokenizer and dual‑task learning strategy, to jointly optimize the process of recombination and fragmentation as well as the tasks between reaction and retrosynthesis prediction. Extensive experiments on Mol‑Instruction and USPTO‑50K datasets demonstrate that ChemDual achieves state‑of‑the‑art performance in both predictions of reaction and retrosynthesis, outperforming the existing conventional single‑task approaches and the general open‑source LLMs. Through molecular docking analysis, ChemDual generates compounds with diverse and strong protein binding affinity, further highlighting its strong potential in drug design.
Authors: Miguel Martin-Landrove, B. P. Embaid
Abstract: The multiple parameter logistic equation has previously been utilized to determine the global stability of ternary codes, based on the arrangement of different symbols within the code. This approach has been extended to DNA and RNA sequences, proposing a specific application in the context of reading and translation processes involved in DNA replication and RNA‑mediated protein codification. To address the complexity of mapping Liapunov exponents in terms of four parameters representing the different nucleotide bases specialized mapping techniques have been developed. These include Liapunov exponent distributions for entire sequences, as well as binary maps that classify nucleotide bases based on their chemical type (purinic or pyrimidinic). Such methodologies provide a framework for examining the structural and functional properties of genetic material. The sequences analyzed encompass a wide range of DNA and RNA types, including those with and without introns, as well as codifying and noncodifying regions. This multifaceted approach offers valuable insights into the dynamic behavior and stability of nucleotide arrangements, contributing to a deeper understanding of the underlying processes that govern genetic replication and protein synthesis.
Authors: Michal H. Kolář, Klára Hlouchová
Abstract: Since the Hadean era of Earth's history, peptides/proteins and RNA have undergone a complex evolutionary trajectory. Originating from simple monomeric units, these molecules evolved abiotically under various biochemical and biophysical constraints into functional biomolecules that contributed to the emergence of the first living cells. Within these cells, their interactions could then evolve through Darwinian selection. In this review, we examine current understanding of how protein‑RNA interactions emerged under prebiotic conditions and developed into today's iconic biomolecular machines such as the ribosome. Particular emphasis is placed on the types of physicochemical interactions accessible to early protein‑RNA complexes and their roles in driving spatial organization and compartmentalization in protocellular environments.
Authors: Yiming Zhang, Koji Tsuda
Abstract: Nanobodies ‑‑ single‑domain antibody fragments derived from camelid heavy‑chain‑only antibodies ‑‑ exhibit unique advantages such as compact size, high stability, and strong binding affinity, making them valuable tools in therapeutics and diagnostics. While recent advances in pretrained protein and antibody language models (PPLMs and PALMs) have greatly enhanced biomolecular understanding, nanobody‑specific modeling remains underexplored and lacks a unified benchmark. To address this gap, we introduce NbBench, the first comprehensive benchmark suite for nanobody representation learning. Spanning eight biologically meaningful tasks across nine curated datasets, NbBench encompasses structure annotation, binding prediction, and developability assessment. We systematically evaluate eleven representative models ‑‑ including general‑purpose protein LMs, antibody‑specific LMs, and nanobody‑specific LMs ‑‑ in a frozen setting. Our analysis reveals that antibody language models excel in antigen‑related tasks, while performance on regression tasks such as thermostability and affinity remains challenging across all models. Notably, no single model consistently outperforms others across all tasks. By standardizing datasets, task definitions, and evaluation protocols, NbBench offers a reproducible foundation for assessing and advancing nanobody modeling.
Authors: Zhongxin Yang, Yuanwei Bin, Yipeng Shi, Xiang I. A. Yang
Abstract: Artificial intelligence (AI) has achieved human‑level performance in specialized tasks such as Go, image recognition, and protein folding, raising the prospect of an AI singularity‑where machines not only match but surpass human reasoning. Here, we demonstrate a step toward this vision in the context of turbulence modeling. By treating a large language model (LLM), DeepSeek‑R1, as an equal partner, we establish a closed‑loop, iterative workflow in which the LLM proposes, refines, and reasons about near‑wall turbulence models under adverse pressure gradients (APGs), system rotation, and surface roughness. Through multiple rounds of interaction involving long‑chain reasoning and a priori and a posteriori evaluations, the LLM generates models that not only rediscover established strategies but also synthesize new ones that outperform baseline wall models. Specifically, it recommends incorporating a material derivative to capture history effects in APG flows, modifying the law of the wall to account for system rotation, and developing rough‑wall models informed by surface statistics. In contrast to conventional data‑driven turbulence modeling‑often characterized by human‑designed, black‑box architectures‑the models developed here are physically interpretable and grounded in clear reasoning.
Authors: Cong Qi, Hanzhang Fang, Siqi jiang, Tianxing Hu, Zhi Wei
Abstract: Understanding the binding specificity between T‑cell receptors (TCRs) and peptide‑major histocompatibility complexes (pMHCs) is central to immunotherapy and vaccine development. However, current predictive models struggle with generalization, especially in data‑scarce settings and when faced with novel epitopes. We present LANTERN (Large lAnguage model‑powered TCR‑Enhanced Recognition Network), a deep learning framework that combines large‑scale protein language models with chemical representations of peptides. By encoding TCR \beta‑chain sequences using ESM‑1b and transforming peptide sequences into SMILES strings processed by MolFormer, LANTERN captures rich biological and chemical features critical for TCR‑peptide recognition. Through extensive benchmarking against existing models such as ChemBERTa, TITAN, and NetTCR, LANTERN demonstrates superior performance, particularly in zero‑shot and few‑shot learning scenarios. Our model also benefits from a robust negative sampling strategy and shows significant clustering improvements via embedding analysis. These results highlight the potential of LANTERN to advance TCR‑pMHC binding prediction and support the development of personalized immunotherapies.
Authors: Jorge H. Melillo, Ido Braslavsky
Abstract: Hypothesis Roughening transitions at solid‑liquid interfaces govern crystal morphology in diverse systems. In ice crystallization, these transitions control interfacial faceting and surface kinetics. Faceted morphologies are often associated with ice‑active molecules, which inhibit recrystallization and are essential for cryopreservation. We hypothesize that kinetic roughening transitions can induce faceting even in the absence of ice‑active agents, particularly at high solute concentrations with depressed melting points, potentially complicating the interpretation of crystal morphology as an indicator of ice activity.
Experiments We investigated the kinetic roughening transition of ice in dimethyl sulfoxide (DMSO) and proline‑water solutions using cryomicroscopy and real‑time image analysis. Crystals grew in microdroplets, maintaining near‑equilibrium conditions as solute concentration increased during growth due to conversion of liquid water to ice. Antifreeze protein type III (AFPIII) was applied to distinguish intrinsic roughening from adsorption‑mediated effects.
Findings A distinct kinetic roughening transition temperature (TR = ‑16.0 +/‑ 0.2 oC) was identified, marking a shift from rounded disks at higher temperatures to faceted hexagonal plates at lower temperatures, independent of solute type. Recrystallization below TR revealed asymmetry between growth and melting interfaces. AFPIII promoted faceting even above TR, consistent with stabilization of step edges and elevation of the roughening transition temperature. These results clarify the interplay between intrinsic interface kinetics and molecular adsorption, with implications for interpreting ice morphology, surface roughening, and cryopreservation design.
Authors: Sneha Arora, Jishnu Narayanan S J, Idan Haritan, Amitava Adhikary, Achintya Kumar Dutta
Abstract: In this work, the effect of amino acid environment on the nucleobase‑centered anion radical shape resonances is investigated by employing uracil as a model system for pyrimidine base in RNA. Anionic uracil‑glycine complexes have been used to model the RNA‑protein interactions. The resonance positions and widths of these complexes have been simulated using the equation of motion coupled cluster method coupled with resonance via Padé approach. Our work shows that in the transient negative ion (TNI, or, the anion radical of glycine:uracil complex), glycine stabilizes the nucleobase‑centered resonances through hydrogen bonding, increasing the lifetime of TNI. At the same time, a glycine‑centered resonance shows the ability of amino acids to capture the electron density and move it away from the uracil nucleobase. At the micro‑solvation level, this modeling indicates that amino acids would have more influence on nucleobase‑centered resonances in the TNI than that displayed by the corresponding aqueous environment.
Authors: R. Gonzalo Parra, Diego U. Ferreiro
Abstract: The controlled dissipation of chemical potentials is the fundamental way cells make a living. Enzyme‑mediated catalysis allows the various transformations to proceed at biologically relevant rates with remarkable precision and efficiency. Theory, experiments and computational studies coincide to show that local frustration is a useful concept to relate protein dynamics with catalytic power. Local frustration gives rise to the asperities of the energy landscapes that can harness the thermal fluctuations to guide the functional protein motions. We review here recent advances into these relationships from various fields of protein science. The biologically relevant dynamics is tuned by the evolution of protein sequences that modulate the local frustration patterns to near optimal values.
Authors: Jie Yang, Yuwen Wang, Kaixuan Chen, Tongya Zheng, Yihe Zhou, Zhenbang Xiao, Ji Cao, Mingli Song, Shunyu Liu
Abstract: Interpretable Graph Neural Networks (GNNs) aim to reveal the underlying reasoning behind model predictions, attributing their decisions to specific subgraphs that are informative. However, existing subgraph‑based interpretable methods suffer from an overemphasis on local structure, potentially overlooking long‑range dependencies within the entire graphs. Although recent efforts that rely on graph coarsening have proven beneficial for global interpretability, they inevitably reduce the graphs to a fixed granularity. Such an inflexible way can only capture graph connectivity at a specific level, whereas real‑world graph tasks often exhibit relationships at varying granularities (e.g., relevant interactions in proteins span from functional groups, to amino acids, and up to protein domains). In this paper, we introduce a novel Tree‑like Interpretable Framework (TIF) for graph classification, where plain GNNs are transformed into hierarchical trees, with each level featuring coarsened graphs of different granularity as tree nodes. Specifically, TIF iteratively adopts a graph coarsening module to compress original graphs (i.e., root nodes of trees) into increasingly coarser ones (i.e., child nodes of trees), while preserving diversity among tree nodes within different branches through a dedicated graph perturbation module. Finally, we propose an adaptive routing module to identify the most informative root‑to‑leaf paths, providing not only the final prediction but also the multi‑granular interpretability for the decision‑making process. Extensive experiments on the graph classification benchmarks with both synthetic and real‑world datasets demonstrate the superiority of TIF in interpretability, while also delivering a competitive prediction performance akin to the state‑of‑the‑art counterparts.
Authors: Art'om Zolotarjov, Roland Kröger, Dmitri O. Pushkin
Abstract: Collagen is the most abundant structural protein in animals, forming hierarchically organised fibrils that provide mechanical support to tissues. Despite detailed structural studies, the physical principles that govern the formation of the characteristic axially‑periodic collagen microfibril remain poorly understood. Here, we present a theoretical framework that links the amino acid sequence of tropocollagen to its supramolecular organisation. By combining statistical modeling of residue geometry with sequence‑informed interaction potentials, we show that the chiral arrangement of outward‑facing residues induces directional intermolecular interactions that drive molecular supercoiling. These interactions favour the formation of right‑handed, pentameric microfibrils with a staggered axial periodicity of approximately 67 nm. Our simulations reveal that this structure emerges across a wide range of mammalian collagen sequences as a global energy minimum robust to biochemical noise. These findings provide a mechanistic explanation for collagen's supramolecular chirality and offer design principles for engineering synthetic collagen‑mimetic materials.
Authors: Ting Liang, Ke Xu, Eric Lindgren, Zherui Chen, Rui Zhao, Jiahui Liu, Esmée Berger, Benrui Tang, Bohan Zhang, Yanzhou Wang, Keke Song, Penghua Ying, Nan Xu, Haikuan Dong, Shunda Chen, Paul Erhart, Zheyong Fan, Tapio Ala-Nissila, Jianbin Xu
Abstract: While machine‑learned interatomic potentials offer near‑quantum‑mechanical accuracy for atomistic simulations, many are material‑specific or computationally intensive, limiting their broader use. Here we introduce NEP89, a foundation model based on neuroevolution potential architecture, delivering empirical‑potential‑like speed and high accuracy across 89 elements. A compact yet comprehensive training dataset covering inorganic and organic materials was curated through descriptor‑space subsampling and iterative refinement across multiple datasets. NEP89 achieves competitive accuracy compared to representative foundation models while being three to four orders of magnitude more computationally efficient, enabling previously impractical large‑scale atomistic simulations of inorganic and organic systems. In addition to its out‑of‑the‑box applicability to diverse scenarios, including million‑atom‑scale compression of compositionally complex alloys, ion diffusion in solid‑state electrolytes and water, rocksalt dissolution, methane combustion, and protein‑ligand dynamics, NEP89 also supports fine‑tuning for rapid adaptation to user‑specific applications, such as mechanical, thermal, structural, and spectral properties of two‑dimensional materials, metallic glasses, and organic crystals.
Authors: Anjie Qiao, Junjie Xie, Weifeng Huang, Hao Zhang, Jiahua Rao, Shuangjia Zheng, Yuedong Yang, Zhen Wang, Guo-Bo Li, Jinping Lei
Abstract: Molecular optimization, aimed at improving binding affinity or other molecular properties, is a crucial task in drug discovery that often relies on the expertise of medicinal chemists. Recently, deep learning‑based 3D generative models showed promise in enhancing the efficiency of molecular optimization. However, these models often struggle to adequately consider binding affinities with protein targets during lead optimization. Herein, we propose a 3D pocket‑aware and affinity‑guided diffusion model, named Diffleop, to optimize molecules with enhanced binding affinity. The model explicitly incorporates the knowledge of protein‑ligand binding affinity to guide the denoising sampling for molecule generation with high affinity. The comprehensive evaluations indicated that Diffleop outperforms baseline models across multiple metrics, especially in terms of binding affinity.
Authors: Qingzhi Yu, Shuai Yan, Wenfeng Dai, Xiang Cheng
Abstract: Protein‑protein interactions (PPIs) are fundamental for deciphering cellular functions,disease pathways,and drug discovery.Although existing neural networks and machine learning methods have achieved high accuracy in PPI prediction,their black‑box nature leads to a lack of causal interpretation of the prediction results and difficulty in capturing hierarchical geometries and multi‑scale dynamic interaction patterns among proteins.To address these challenges, we propose HyboWaveNet,a novel deep learning framework that collaborates with hyperbolic graphical neural networks (HGNNs) and multiscale graphical wavelet transform for robust PPI prediction. Mapping protein features to Lorentz space simulates hierarchical topological relationships among biomolecules via a hyperbolic distance metric,enabling node feature representations that better fit biological a priori.HyboWaveNet inherently simulates hierarchical and scale‑free biological relationships, while the integration of wavelet transforms enables adaptive extraction of local and global interaction features across different resolutions. Our framework generates node feature representations via a graph neural network under the Lorenz model and generates pairs of positive samples under multiple different views for comparative learning, followed by further feature extraction via multi‑scale graph wavelet transforms to predict potential PPIs. Experiments on public datasets show that HyboWaveNet improves over both existing state‑of‑the‑art methods. We also demonstrate through ablation experimental studies that the multi‑scale graph wavelet transform module improves the predictive performance and generalization ability of HyboWaveNet. This work links geometric deep learning and signal processing to advance PPI prediction, providing a principled approach for analyzing complex biological systems
Authors: Tiziana Mancini, Nicole Luchetti, Salvatore Macis, Velia Minicozzi, Rosanna Mosetti, Alessandro Nucara, Stefano Lupi, Annalisa D Arco
Abstract: The SARS‑CoV‑2 pandemic has led to a significant emergence of highly mutated forms of viruses with a great ability to adapt to the human host. Some mutations resulted in changes in the amino acid sequences of viral proteins, including the Spike glycoproteins, affecting protein physico‑chemical properties and functionalities. Here, we propose, for the first time to the best of our knowledge, a systematic and comparative study of the monomeric spike protein subunits 1 of three SARS‑CoV‑2 variants at pH 7.4, combining both an experimental approach, taking advantage of Attenuated Total Reflection Infrared and Circular Dichroism spectroscopies, and a computational approach via Molecular Dynamics simulations. Experimental data in combination with Molecular Dynamics and Surface polarity calculations provide a comprehensive understanding of variants proteins in terms of their secondary structure content, 3D conformational structure and order and interaction with the solvent. The present structural investigation clarifies which kind of changes in conformation and functionalities occurred as long as mutations appeared in amino acids sequences. This information is essential for preventive targeted actions, drug design, and biosensing applications.
Authors: Samantha Petti, Carlos Martí-Gómez, Justin B. Kinney, Juannan Zhou, David M. McCandlish
Abstract: Mappings from biological sequences (DNA, RNA, protein) to quantitative measures of sequence functionality play an important role in contemporary biology. We are interested in the related tasks of (i) inferring predictive sequence‑to‑function maps and (ii) decomposing sequence‑function maps to elucidate the contributions of individual subsequences. Because each sequence‑function map can be written as a weighted sum over subsequences in multiple ways, meaningfully interpreting these weights requires ``gauge‑fixing,'' i.e., defining a unique representation for each map. Recent work has established that most existing gauge‑fixed representations arise as the unique solutions to L_2‑regularized regression in an overparameterized ``weight space'' where the choice of regularizer defines the gauge. Here, we establish the relationship between regularized regression in overparameterized weight space and Gaussian process approaches that operate in ``function space,'' i.e.~the space of all real‑valued functions on a finite set of sequences. We disentangle how weight space regularizers both impose an implicit prior on the learned function and restrict the optimal weights to a particular gauge. We show how to construct regularizers that correspond to arbitrary explicit Gaussian process priors combined with a wide variety of gauges and characterize the implicit function space priors associated with the most common weight space regularizers. Finally, we derive the posterior distribution of a broad class of sequence‑to‑function statistics, including gauge‑fixed weights and multiple systems for expressing higher‑order epistatic coefficients. We show that such distributions can be efficiently computed for product‑kernel priors using a kernel trick.
Authors: Alireza Ghafarollahi, Markus J. Buehler
Abstract: Advances in artificial intelligence (AI) promise autonomous discovery, yet most systems still resurface knowledge latent in their training data. We present Sparks, a multi‑modal multi‑agent AI model that executes the entire discovery cycle that includes hypothesis generation, experiment design and iterative refinement to develop generalizable principles and a report without human intervention. Applied to protein science, Sparks uncovered two previously unknown phenomena: (i) a length‑dependent mechanical crossover whereby beta‑sheet‑biased peptides surpass alpha‑helical ones in unfolding force beyond ~80 residues, establishing a new design principle for peptide mechanics; and (ii) a chain‑length/secondary‑structure stability map revealing unexpectedly robust beta‑sheet‑rich architectures and a "frustration zone" of high variance in mixed alpha/beta folds. These findings emerged from fully self‑directed reasoning cycles that combined generative sequence design, high‑accuracy structure prediction and physics‑aware property models, with paired generation‑and‑reflection agents enforcing self‑correction and reproducibility. The key result is that Sparks can independently conduct rigorous scientific inquiry and identify previously unknown scientific principles.
Authors: Decheng Kong, Jinlong Ren, Zhuang Li, Guangcun Shan, Zhongjian Wang, Ruiqin Zhang, Wei Huang, Kunpeng Dou
Abstract: To overcome antimalarial drug resistance, carbohydrate derivatives as selective PfHT1 inhibitor have been suggested in recent experimental work with orthosteric and allosteric dual binding pockets. Inspired by this promising therapeutic strategy, herein, molecular dynamics simulations are performed to investigate the molecular determinants of co‑administration on orthosteric and allosteric inhibitors targeting PfHT1. Our binding free energy analysis capture the essential trend of inhibitor binding affinity to protein from published experimental IC50 data in three sets of distinct characteristics. In particular, we rank the contribution of key residues as binding sites which categorized into three groups based on linker length, size of tail group, and sugar moiety of inhibitors. The pivotal roles of these key residues are further validated by mutant analysis where mutated to nonpolar alanine leading to reduced affinities to different degrees. The exception was fructose derivative, which exhibited a significant enhanced affinity to mutation on orthosteric sites due to strong changed binding poses. This study may provide useful information for optimized design of precision medicine to circumvent drug‑resistant Plasmodium parasites with high efficacy.
Authors: Sudipta Bera, Ayelet Vilan, Sourav Das, Israel Pecht, David Ehre, Mordechai Sheves, David Cahen
Abstract: While solid‑state protein junctions have shown efficient electron transport over lengths that surpass those of conventional organic semiconducting systems, interfacial or contact effects may obscure the intrinsic protein charge transport properties. Therefore, contact resistance (RC) effects need to be quantified and then minimized, which poses a problem if 4‑probe geometries cannot be used. Here we show how RC can be extracted quantitatively from the measured junction resistance (RP) by using the extrapolated zero‑length resistance (RZLR) and short‑circuit resistance (RS). We used AC (impedance spectroscopy) and DC measurements to examine charge transport in junctions of human serum albumin (HSA) and bacteriorhodopsin (bR) films with varying thicknesses. Three contact configurations, Si‑Au, Au‑EGaIn, and, in a micropore device (MpD), Au‑Pd, were compared. While Si‑Au and Au‑EGaIn junctions exhibit substantial RC that we ascribe to interfacial oxides and electrostatic protein‑electrode interactions, MpD effectively eliminates RC, enabling measuring the intrinsic electron transport across HSA and bR films. The exponential length dependence of RP shows a transport decay constant (beta) that varies with interfacial conditions, underscoring the role of contact engineering. By minimizing RC, exceptionally low beta values (0.7 to 1.1 per nm) are found, proving that, indeed, proteins can have outstanding charge transport efficiencies.
Authors: Maodong Li, Jiying Zhang, Zhe Wang, Bin Feng, Wenqi Zeng, Dechin Chen, Zhijun Pan, Yu Li, Zijing Liu, Yi Isaac Yang
Abstract: The kinetics and dynamics of drug‑protein binding and dissociation are crucial to understanding drug absorption and metabolism. Despite advances in artificial intelligence (AI) tools for drug‑protein interaction studies, existing training datasets remain limited to static structures or quasi‑static conformations. This paper proposes a novel computational approach for rapidly generating drug‑protein dissociation trajectories and presents the inaugural dynamically time‑resolved 4‑D (t, x, y, z) trajectory database DD‑13M. This dataset captures over 26,000 complete dissociation processes for 565 ligand‑protein complexes, providing nearly 13 million frames of all‑atom simulation trajectories. A deep equivariant generative model, UnbindingFlow, was trained using the DD‑13M dataset. This model has the capacity to produce dissociation trajectories for novel targets whilst accurately predicting their rate constants (koff). DD‑13M introduces a new type of training dataset for AI models, establishing a de novo paradigm for studying the dynamics of drug‑protein interactions.
Authors: Jigang Fan, Chunhao Zhu, Xiaobing Lan, Haiming Zhuang, Mingyu Li, Jian Zhang, Shaoyong Lu
Abstract: Neurotensin receptor 1 (NTSR1), a member of the Class A G protein‑coupled receptor superfamily, plays an important role in modulating dopaminergic neuronal activity and eliciting opioid‑independent analgesia. Recent studies suggest that promoting \beta‑arrestin‑biased signaling in NTSR1 may diminish drugs of abuse, such as psychostimulants, thereby offering a potential avenue for treating human addiction‑related disorders. In this study, we utilized a novel computational and experimental approach that combined nudged elastic band‑based molecular dynamics simulations, Markov state models, temporal communication network analysis, site‑directed mutagenesis, and conformational biosensors, to explore the intricate mechanisms underlying NTSR1 activation and biased signaling. Our study reveals a dynamic stepwise transition mechanism and activated transmission network associated with NTSR1 activation. It also yields valuable insights into the complex interplay between the unique polar network, non‑conserved ion locks, and aromatic clusters in NTSR1 signaling. Moreover, we identified a cryptic allosteric site located in the intracellular region of the receptor that exists in an intermediate state within the activation pathway. Collectively, these findings contribute to a more profound understanding of NTSR1 activation and biased signaling at the atomic level, thereby providing a potential strategy for the development of NTSR1 allosteric modulators in the realm of G protein‑coupled receptor biology, biophysics, and medicine.
Authors: Cece Zhang, Xuehuan Zhu, Nick Peterson, Jieqiong Wang, Shibiao Wan
Abstract: The subcellular localization of RNAs, including long non‑coding RNAs (lncRNAs), messenger RNAs (mRNAs), microRNAs (miRNAs) and other smaller RNAs, plays a critical role in determining their biological functions. For instance, lncRNAs are predominantly associated with chromatin and act as regulators of gene transcription and chromatin structure, while mRNAs are distributed across the nucleus and cytoplasm, facilitating the transport of genetic information for protein synthesis. Understanding RNA localization sheds light on processes like gene expression regulation with spatial and temporal precision. However, traditional wet lab methods for determining RNA localization, such as in situ hybridization, are often time‑consuming, resource‑demanding, and costly. To overcome these challenges, computational methods leveraging artificial intelligence (AI) and machine learning (ML) have emerged as powerful alternatives, enabling large‑scale prediction of RNA subcellular localization. This paper provides a comprehensive review of the latest advancements in AI‑based approaches for RNA subcellular localization prediction, covering various RNA types and focusing on sequence‑based, image‑based, and hybrid methodologies that combine both data types. We highlight the potential of these methods to accelerate RNA research, uncover molecular pathways, and guide targeted disease treatments. Furthermore, we critically discuss the challenges in AI/ML approaches for RNA subcellular localization, such as data scarcity and lack of benchmarks, and opportunities to address them. This review aims to serve as a valuable resource for researchers seeking to develop innovative solutions in the field of RNA subcellular localization and beyond.
Authors: Pranav Kantroo, Günter P. Wagner, Benjamin B. Machta
Abstract: Language models have emerged as powerful predictors of the viability of biological sequences. During training these models learn the rules of the grammar obeyed by sequences of amino acids or nucleotides. Once trained, these models can take a sequence as input and produce a likelihood score as an output; a higher likelihood implies adherence to the learned grammar and correlates with experimental fitness measurements. Here we show that in‑context learning can distort the relationship between fitness and likelihood scores of sequences. This phenomenon most prominently manifests as anomalously high likelihood scores for sequences that contain repeated motifs. We use protein language models with different architectures trained on the masked language modeling objective for our experiments, and find transformer‑based models to be particularly vulnerable to this effect. This behavior is mediated by a look‑up operation where the model seeks the identity of the masked position by using the other copy of the repeated motif as a reference. This retrieval behavior can override the model's learned priors. This phenomenon persists for imperfectly repeated sequences, and extends to other kinds of biologically relevant features such as reversed complement motifs in RNA sequences that fold into hairpin structures.
Authors: Zakaria Lamine, Abdelatif Hafid, Mohamed Rahouti
Abstract: We present a machine learning approach that leverages persistent homology to classify bacterial flagellar motors into two functional states: rotated and stalled. By embedding protein structural data into a topological framework, we extract multiscale features from filtered simplicial complexes constructed over atomic coordinates. These topological invariants, specifically persistence diagrams and barcodes, capture critical geometric and connectivity patterns that correlate with motor function. The extracted features are vectorized and integrated into a machine learning pipeline that includes dimensionality reduction and supervised classification. Applied to a curated dataset of experimentally characterized flagellar motors from diverse bacterial species, our model demonstrates high classification accuracy and robustness to structural variation. This approach highlights the power of topological data analysis in revealing functionally relevant patterns beyond the reach of traditional geometric descriptors, offering a novel computational tool for protein function prediction.
Authors: Arnav Sharma, Anthony Gitter
Abstract: The ability to make zero‑shot predictions about the fitness consequences of protein sequence changes with pre‑trained machine learning models enables many practical applications. Such models can be applied for downstream tasks like genetic variant interpretation and protein engineering without additional labeled data. The advent of capable protein structure prediction tools has led to the availability of orders of magnitude more precomputed predicted structures, giving rise to powerful structure‑based fitness prediction models. Through our experiments, we assess several modeling choices for structure‑based models and their effects on downstream fitness prediction. Zero‑shot fitness prediction models can struggle to assess the fitness landscape within disordered regions of proteins, those that lack a fixed 3D structure. We confirm the importance of matching protein structures to fitness assays and find that predicted structures for disordered regions can be misleading and affect predictive performance. Lastly, we evaluate an additional structure‑based model on the ProteinGym substitution benchmark and show that simple multi‑modal ensembles are strong baselines.
Authors: Kun Meng, Linyan Nie, Johannes Berger, Nick R. von Grafenstein, Christopher Einholz, Stefan Weber, Lars-Oliver Essen, Roberto Rizzato, Erik Schleicher, Dominik B. Bucher
Abstract: Optically addressable spin systems, such as nitrogen‑vacancy centers in diamond, have been widely studied for quantum sensing applications. In this work, we demonstrate that certain flavoproteins, specifically cryptochrome and iLOV, which generate spin correlated radical pairs upon optical excitation, also exhibit optically detected magnetic resonance (ODMR). Remarkably, the iLOV protein, commonly used in cellular imaging, displays ODMR contrast approaching 50%. We present initial applications including widefield magnetic field sensing and spatial modulation of photoluminescence using radiofrequency pulses and magnetic field gradients. Our results establish radical pairs in proteins as a novel platform for optically addressable spin systems, offering the key advantages of molecular designability and genetic encodability. Moreover, due to the spin‑selective nature of radical pair chemistry, the results lay the groundwork for future radiofrequency‑based manipulation of biological systems.
Authors: Yujie Qin, Ming He, Changyong Yu, Ming Ni, Xian Liu, Xiaochen Bo
Abstract: The de novo design of proteins refers to creating proteins with specific structures and functions that do not naturally exist. In recent years, the accumulation of high‑quality protein structure and sequence data and technological advancements have paved the way for the successful application of generative artificial intelligence (AI) models in protein design. These models have surpassed traditional approaches that rely on fragments and bioinformatics. They have significantly enhanced the success rate of de novo protein design, and reduced experimental costs, leading to breakthroughs in the field. Among various generative AI models, diffusion models have yielded the most promising results in protein design. In the past two to three years, more than ten protein design models based on diffusion models have emerged. Among them, the representative model, RFDiffusion, has demonstrated success rates in 25 protein design tasks that far exceed those of traditional methods, and other AI‑based approaches like RFjoint and hallucination. This review will systematically examine the application of diffusion models in generating protein backbones and sequences. We will explore the strengths and limitations of different models, summarize successful cases of protein design using diffusion models, and discuss future development directions.
Authors: Y. Ricardo Espinosa, C. Manuel Carlevaro, C. Gastón Ferrara
Abstract: The disruption of protein structures by denaturants like urea is well studied, though its molecular mechanisms remain unclear. Using Molecular Dynamics (MD) simulations, we investigated how urea affects the structural stability of Bovine Serum Albumin (BSA) at concentrations from 0 M to 5 M. Our results reveal that urea induces a dehydration/rehydration cycle, characterized by displacement and partial replacement of water molecules in BSAs hydration shell. At low concentrations, urea reduces protein/water hydrogen bonds while enhancing protein‑urea interactions. At higher concentrations, urea aggregation limits these interactions, promoting rehydration and changes in tertiary structure, while secondary structure remains largely intact. These findings provide insights into the mechanisms of protein denaturation and stability by urea.
Authors: Krinos Li, Xianglu Xiao, Zijun Zhong, Guang Yang
Abstract: Protein‑ligand binding complexes are ubiquitous and essential to life. Protein‑ligand binding affinity prediction (PLA) quantifies the binding strength between ligands and proteins, providing crucial insights for discovering and designing potential candidate ligands. While recent advances have been made in predicting protein‑ligand complex structures, existing algorithms for interaction and affinity prediction suffer from a sharp decline in performance when handling ligands bound with novel unseen proteins. We propose IPBind, a geometric deep learning‑based computational method, enabling robust predictions by leveraging interatomic potential between complex's bound and unbound status. Experimental results on widely used binding affinity prediction benchmarks demonstrate the effectiveness and universality of IPBind. Meanwhile, it provides atom‑level insights into prediction. This work highlights the advantage of leveraging machine learning interatomic potential for predicting protein‑ligand binding affinity.
Authors: Peizheng Liu, Hitoshi Iba
Abstract: Transformer‑based architectures have recently propelled advances in sequence modeling across domains, but their application to the hydrophobic‑hydrophilic (H‑P) model for protein folding remains relatively unexplored. In this work, we adapt a Deep Q‑Network (DQN) integrated with attention mechanisms (Transformers) to address the 3D H‑P protein folding problem. Our system formulates folding decisions as a self‑avoiding walk in a reinforced environment, and employs a specialized reward function based on favorable hydrophobic interactions. To improve performance, the method incorporates validity check including symmetry‑breaking constraints, dueling and double Q‑learning, and prioritized replay to focus learning on critical transitions. Experimental evaluations on standard benchmark sequences demonstrate that our approach achieves several known best solutions for shorter sequences, and obtains near‑optimal results for longer chains. This study underscores the promise of attention‑based reinforcement learning for protein folding, and created a prototype of Transformer‑based Q‑network structure for 3‑dimensional lattice models.
Authors: Jianxiong Li, Beining Zhang, Mingzhen Li, Siyu Hu, Jinzhe Zeng, Lijun Liu, Guojun Yuan, Zhan Wang, Guangming Tan, Weile Jia
Abstract: Neural network‑based molecular dynamics (NNMD) simulations incorporating long‑range electrostatic interactions have significantly extended the applicability to heterogeneous and ionic systems, enabling effective modeling critical physical phenomena such as protein folding and dipolar surface and maintaining ab initio accuracy. However, neural network inference and long‑range force computation remain the major bottlenecks, severely limiting simulation speed. In this paper, we target DPLR, a state‑of‑the‑art NNMD package that supports long‑range electrostatics, and propose a set of comprehensive optimizations to enhance computational efficiency. We introduce (1) a hardware‑offloaded FFT method to reduce the communication overhead; (2) an overlapping strategy that hides long‑range force computations using a single core per node, and (3) a ring‑based load balancing method that enables atom‑level task evenly redistribution with minimal communication overhead. Experimental results on the Fugaku supercomputer show that our work achieves a 37x performance improvement, reaching a maximum simulation speed of 51 ns/day.
Authors: Manuel Mayo, Rodrigo Soto
Abstract: Bacterial chemotaxis for E.coli is controlled by methylation of chemoreceptors, which in a biochemical pathway regulates the concentration of the CheY‑P protein that finally controls the tumbling rate. As a consequence, the tumbling rate adjusts to changes in the concentration of relevant chemicals, to produce a biased random walk toward chemoattractants of against the repellers. Methylation is a slow process, implying that the internal concentration of CheY‑P is not instantaneously adapted to the environment, and the tumbling rate presents memory. This implies that the Keller‑Segel (KS) equations used to describe chemotaxis at the macroscopic scale, which assume a local relation between the bacterial flux and the chemical gradient, are not fully valid as memory and the associated nonlocal response are not considered. To derive the equations that replace the KS ones, we use a kinetic approach, in which a kinetic equation for the bacterial transport is written considering the dynamics of the protein concentration. When memory is large, the protein concentration field must be considered a relevant variable as the bacterial density. Working out the Chapman‑Enskog (CE) method, the dynamical equations for these fields are obtained, which have the form of reaction‑diffusion equations with flux and source terms depending on the gradients on the chemical signal. The transport coefficients are obtained entirely in terms of the microscopic dynamics, giving their values of the case of E.coli. Solving the equations for an inhomogeneous signal it is shown that the response is nonlocal, with a smoothing length as large as 170μm for E.coli. The homogeneous response and the relaxational dynamics are also studied. The case of small memory is also studied, in which case the CE method reproduces the KS equations, with explicit expressions for the transport coefficients.
Authors: Manuel Mayo, Rodrigo Soto
Abstract: Chemotaxis in bacteria such as E.\ coli is controlled by the slow methylation of chemoreceptors. As a consequence, intrinsic time and length scales of tens of seconds and hundreds of micrometers emerge, making the Keller‑‑Segel equations invalid when the chemical signal changes on these scales, as occurs in several natural environments. Using a kinetic approach, we show that chemotaxis is described using the concentration field of the protein that controls tumbling in addition to bacterial density. The macroscopic equations for these fields are derived, which describe the nonlocal response.
Authors: Rafael Bicudo Ribeiro, Henrique Musseli Cezar
Abstract: Clustering techniques are consolidated as a powerful strategy for analyzing the extensive data generated from molecular modeling. In particular, some tools have been developed to cluster configurations from classical simulations with a standard focus on individual units, ranging from small molecules to complex proteins. Since the standard approach includes computing the Root Mean Square Deviation (RMSD) of atomic positions, accounting for the permutation between atoms is crucial for optimizing the clustering procedure in the presence of identical molecules. To address this issue, we present the clusttraj program, a solvent‑informed clustering package that fixes inflated RMSD values by finding the optimal pairing between configurations. The program combines reordering schemes with the Kabsch algorithm to minimize the RMSD of molecular configurations before running a hierarchical clustering protocol. By considering evaluation metrics, one can determine the ideal threshold in an automated fashion and compare the different linkage schemes available. The program capabilities are exemplified by considering solute‑solvent systems ranging from pure water clusters to a solvated protein or a small solute in different solvents. As a result, we investigate the dependence on different parameters, such as the system size and reordering method, and also the representativeness of the cluster medoids for the characterization of optical properties. clusttraj is implemented as a Python library and can be employed to cluster generic ensembles of molecular configurations that go beyond solute‑solvent systems.
Authors: Wouter Vervust, Daniel T. Zhang, Enrico Riccardi, Titus S. van Erp, An Ghysels
Abstract: Predicting the kinetics of drug‑protein interactions is crucial for understanding drug efficacy, particularly in personalized medicine, where protein mutations can significantly alter drug residence times. This study applies Replica Exchange Transition Interface Sampling (RETIS) and its Partial Path variant (REPPTIS) to investigate the dissociation kinetics of imatinib from Abelson nonreceptor tyrosine kinase (ABL) and mutants relevant to chronic myeloid leukemia therapy. These path‑sampling methods offer a bias‑free alternative to conventional approaches requiring qualitative predefined reaction coordinates. Nevertheless, the complex free‑energy landscape of ABL‑imatinib dissociation presents significant challenges. Multiple metastable states and orthogonal barriers lead to parallel unbinding pathways, complicating convergence in TIS‑based methods. Despite employing computational efficiency strategies such as asynchronous replica exchange, full convergence remained elusive. This work provides a critical assessment of path sampling in high‑dimensional biological systems, discussing the need for enhanced initialization strategies, advanced Monte Carlo path generation moves, and machine learning‑derived reaction coordinates to improve kinetic predictions of drug dissociation with minimal prior knowledge.
Authors: Bruno Stegani, Riccardo Capelli
Abstract: This study introduces a novel computational approach based on ratchet‑and‑pawl molecular dynamics (rMD) for accurately estimating ligand dissociation kinetics in protein‑ligand complexes. By integrating Kramers' theory with Bell's equation, our method systematically investigates the relationship between the effective biasing force applied during simulations and the ligand residence times. The proposed technique is demonstrated through extensive simulations of the benzamidine‑trypsin complex, employing first an implicit solvent model (multi‑eGO) to set up the approach parameters and thus an explicit solvent model. Our results illustrate the method's reliability, accuracy, and computational efficiency, with calculated kinetic rates closely matching experimental values. Overall, this study highlights rMD as a versatile and efficient non‑equilibrium methodology, broadly applicable to kinetic analyses in chemical and biological systems.
Authors: Zhengxi Lu, Shizhuo Cheng, Yuru Jiang, Yan Zhang, Min Zhang
Abstract: Recent advances in protein backbone generation have achieved promising results under structural, functional, or physical constraints. However, existing methods lack the flexibility for precise topology control, limiting navigation of the backbone space. We present ProtPainter, a diffusion‑based approach for generating protein backbones conditioned on 3D curves. ProtPainter follows a two‑stage process: curve‑based sketching and sketch‑guided backbone generation. For the first stage, we propose CurveEncoder, which predicts secondary structure annotations from a curve to parametrize sketch generation. For the second stage, the sketch guides the generative process in Denoising Diffusion Probabilistic Modeling (DDPM) to generate backbones. During this process, we further introduce a fusion scheduling scheme, Helix‑Gating, to control the scaling factors. To evaluate, we propose the first benchmark for topology‑conditioned protein generation, introducing Protein Restoration Task and a new metric, self‑consistency Topology Fitness (scTF). Experiments demonstrate ProtPainter's ability to generate topology‑fit (scTF > 0.8) and designable (scTM > 0.5) backbones, with drawing and dragging tasks showcasing its flexibility and versatility.
Authors: Yuqing Liu, Meng Zhao, Guanlan Hu, Yuchen Zhang
Abstract: Background. Diet and inflammation are critical factors influencing cancer risk. However, the combined impact of nutritional status and inflammatory biomarkers on cancer status and type, using machine learning (ML), remains underexplored.
Objectives. This study investigates the association between nutritional factors, inflammatory biomarkers, and cancer status, and whether these relationships differ across cancer types using National Health and Nutrition Examination Survey (NHANES) data.
Methods. We analyzed 24 macro‑ and micronutrients, C‑reactive protein (CRP), and the advanced lung cancer inflammation index (ALI) in 26,409 NHANES participants (2,120 with cancer). Multivariable logistic regression assessed associations with cancer prevalence. We also examined whether these features differed across the five most common cancer types. To evaluate predictive value, we applied three ML models ‑ Logistic Regression, Random Forest, and XGBoost ‑ on the full feature set.
Results. The cohort's mean age was 49.1 years; 34.7% were obese. Comorbidities such as anemia and liver conditions, along with nutritional factors like protein and several vitamins, were key predictors of cancer status. Among the models, Random Forest performed best, achieving an accuracy of 0.72.
Conclusions. Higher‑quality nutritional intake and lower levels of inflammation may offer protective effects against cancer. These findings highlight the potential of combining nutritional and inflammatory markers with ML to inform cancer prevention strategies.
Authors: Snigdha Tiwari, Sahil Sharma, Arvind Bagga, Aditi Sinha, Deepak Sharma
Abstract: Background Telemedicine has the potential to provide secure and cost‑effective healthcare at the touch of a button. Nephrotic syndrome is a chronic childhood illness involving frequent relapses and demands long/complex treatment. Hence, developing a remote means of doctor‑patient interface will ensure the provision of quality healthcare to patients. Methods The Utsarjan mobile App framework was built with Flutter that enables cross‑platform development (Android, iOS, Windows) with speed, smoothness, and open‑source benefits. The frontend uses Dart for user interaction, while the backend employs Node.js, Express, and NGINX for APIs, load balancing and high performance. MongoDB ensures a flexible database, Bcrypt secures passwords, PM2 handles deployment, uptime and logs, while Firebase Cloud Messaging powers free push notifications. Results Utsarjan (means excretion) is a multi‑functional smartphone application for giving nephrotic care and real‑time assistance to all patients (especially those in rural regions and/or who do not have access to specialists). It helps patients and doctors by ensuring opportune visits, recording each clinical test/parameter and improving medication adherence. It gives a graphical visualization of relapses, medicine dosage as well as different anthropometric parameters (urine protein, BP, height and weight). This is the first nephrotic care App that enables prompt access to doctor's advice. Conclusions Utsarjan is a mobile App to provide kidney care and real‑time assistance to children with nephrotic syndrome. It gives a graphical overview of changes in a patient's health over the long course of treatment. This will assist doctors in appropriately modifying the treatment regimen. Consequently, it will (hopefully) lead to the prevention of relapses and/or complications.
Authors: Pingfei Zhu, Chenyang Zhao, Haishi Zhao, Bo Yang
Abstract: AI‑powered drug discovery typically relies on the successful prediction of compound‑protein interactions, which are pivotal for the evaluation of designed compound molecules in structure‑based drug design and represent a core challenge in the field.
However, accurately predicting compound‑protein affinity via regression models usually requires adequate‑binding pose, which are derived from costly and complex experimental methods or time‑consuming simulations with docking software. In response, we have introduced the GenShin model, which constructs a geometry‑enhanced structural graph module that separately extracts additional features from proteins and compounds. Consequently, it attains an accuracy on par with mainstream models in predicting compound‑protein affinities, while eliminating the need for adequate‑binding pose as input. Our experimental findings demonstrate that the GenShin model vastly outperforms other models that rely on non‑input docking conformations, achieving, or in some cases even exceeding, the performance of those requiring adequate‑binding pose. Further experiments indicate that our GenShin model is more robust to inadequate‑binding pose, affirming its higher suitability for real‑world drug discovery scenarios. We hope our work will inspire more endeavors to bridge the gap between AI models and practical drug discovery challenges.
Authors: Agnese Barbensi, Alexander R. Klotz, Dimos Gkountaroulis
Abstract: Simulations of knotting and unknotting in polymers or other filaments rely on random processes to facilitate topological changes. Here we introduce a method of topological steering to determine the optimal pathway by which a filament may knot or unknot while subject to a given set of physics. The method involves measuring the knotoid spectrum of a space curve projected onto many surfaces and computing the mean unravelling number of those projections. Several perturbations of a curve can be generated stochastically, e.g. using the Langevin equation or crankshaft moves, and a gradient can be followed that maximises or minimises the topological complexity. We apply this method to a polymer model based on a growing self‑avoiding tangent‑sphere chain, which can be made to model proteins by imposing a constraint that the bending and twisting angles between successive spheres must maintain the distribution found in naturally occurring protein structures. We show that without these protein‑like geometric constraints, topologically optimised polymers typically form alternating torus knots and composites thereof, similar to the stochastic knots predicted for long DNA. However, when the geometric constraints are imposed on the system, the frequency of twist knots increases, similar to the observed abundance of twist knots in protein structures.
Authors: Aingeru Ramos, Jose A. Pascual, Javier Navaridas, Ivan Coluzza
Abstract: Markov Chain Monte Carlo (MCMC) methods are algorithms for sampling probability distributions, commonly applied to the Boltzmann distribution in physical and chemical models such as protein folding and the Ising model. These methods enable exploration of such systems by sampling their most probable states. However, sampling multidimensional and multimodal distributions with MCMC requires substantial computational resources, leading to the development of techniques aimed at improving sampling efficiency. In this context, quantum computing, with its potential to accelerate classical methods, emerges as a promising solution to the sampling problem. In this work, we present the design of a new circuit based on the Discrete Quantum Walk (DQW) algorithm to perform MCMC sampling over a desired distributions. Simulation results show convergence behavior in the superposition of the quantum register that encodes the target distribution. This design is further refined to increase convergence speed and, consequently, the scalability of the algorithm.
Authors: Anna C. Nelson, Scott A. McKinley, Melissa M. Rolls, Maria-Veronica Ciocanel
Abstract: Microtubules (MTs) are dynamic protein filaments essential for intracellular organization and transport, particularly in long‑lived cells such as neurons. The plus and minus ends of neuronal MTs switch between growth and shrinking phases, and the nucleation of new filaments is believed to be regulated in both healthy and injury conditions. We propose stochastic and deterministic mathematical models to investigate the impact of filament nucleation and length‑regulation mechanisms on emergent properties such as MT lengths and numbers in living cells. We expand our stochastic continuous‑time Markov chain model of filament dynamics to incorporate MT nucleation and capture realistic stochastic fluctuations in MT numbers and tubulin availability. We also propose a simplified partial differential equation (PDE) model, which allows for tractable analytical investigation into steady‑state MT distributions under different nucleation and length‑regulating mechanisms. We find that the stochastic and PDE modeling approaches show good agreement in predicted MT length distributions, and that both MT nucleation and the catastrophe rate of large‑length MTs regulate MT length distributions. In both frameworks, multiple mechanistic combinations achieve the same average MT length. The models proposed can predict parameter regimes where the system is scarce in tubulin, the building block of MTs, and suggest that low filament nucleation regimes are characterized by high variation in MT lengths, while high nucleation regimes drive high variation in MT numbers. These mathematical frameworks have the potential to improve our understanding of MT regulation in both healthy and injured neurons.
Authors: Jack Shepherd, Mark Leake
Abstract: British biophysics has a tradition of scientific invention and innovation, resulting in new technologies transforming biological insight, such as rapid DNA sequencing, super‑resolution and label‑free microscopy, high‑throughput and single‑molecule bio‑sensing, and bio‑inspired synthetic materials. Some advances were established through democratised platforms and many have biomedical success, a key example involving the SARS‑CoV‑2 spike protein during the COVID‑19 pandemic. Here, three UK labs made crucial contributions revealing how the spike protein targets human cells, and how therapies of vaccines and neutralizing nanobodies work, enabled largely through biophysical innovations of cryo‑electron microscopy. Here, we discuss leading‑edge innovations which resulted from discovery‑led British 'Physics of Life' research (capturing blends of physical‑life sciences research in the UK including biophysics and biological physics) and have matured into wide‑reaching sustainable commercial ventures enabling translational impact. We describe the biophysical science which led to these academic spinouts, presenting the scientific questions that were addressed through innovating new techniques and approaches. We consider these examples through the lens of opportunities and challenges for academic biophysics research in partnership with British industry. We highlight how commercial breakthroughs have emerged organically from fundamental research rather than from technology‑first approaches but also discuss lessons to learn from past failures. Finally, we propose recommendations concerning future resourcing and structuring of UK biophysics research and the training and support of its researchers to ensure that UK plc punches above its weight in biophysics innovation, and a need to educate the policymakers and public that an absence of basic science impoverishes innovation.
Authors: Maximilian G. Schuh, Joshua Hesse, Stephan A. Sieber
Abstract: Antibiotic resistance presents a growing global health crisis, demanding new therapeutic strategies that target novel bacterial mechanisms. Recent advances in protein structure prediction and machine learning‑driven molecule generation offer a promising opportunity to accelerate drug discovery. However, practical guidance on selecting and integrating these models into real‑world pipelines remains limited. In this study, we develop an end‑to‑end, artificial intelligence‑guided antibiotic discovery pipeline that spans target identification to compound realization. We leverage structure‑based clustering across predicted proteomes of multiple pathogens to identify conserved, essential, and non‑human‑homologous targets. We then systematically evaluate six leading 3D‑structure‑aware generative models\unicodex2014spanning diffusion, autoregressive, graph neural network, and language model architectures\unicodex2014on their usability, chemical validity, and biological relevance. Rigorous post‑processing filters and commercial analogue searches reduce over 100 000 generated compounds to a focused, synthesizable set. Our results highlight DeepBlock and TamGen as top performers across diverse criteria, while also revealing critical trade‑offs between model complexity, usability, and output quality. This work provides a comparative benchmark and blueprint for deploying artificial intelligence in early‑stage antibiotic development.
Authors: Zitai Kong, Yiheng Zhu, Yinlong Xu, Hanjing Zhou, Mingzhe Yin, Jialu Wu, Hongxia Xu, Chang-Yu Hsieh, Tingjun Hou, Jian Wu
Abstract: The design of protein sequences with desired functionalities is a fundamental task in protein engineering. Deep generative methods, such as autoregressive models and diffusion models, have greatly accelerated the discovery of novel protein sequences. However, these methods mainly focus on local or shallow residual semantics and suffer from low inference efficiency, large modeling space and high training cost. To address these challenges, we introduce ProtFlow, a fast flow matching‑based protein sequence design framework that operates on embeddings derived from semantically meaningful latent space of protein language models. By compressing and smoothing the latent space, ProtFlow enhances performance while training on limited computational resources. Leveraging reflow techniques, ProtFlow enables high‑quality single‑step sequence generation. Additionally, we develop a joint design pipeline for the design scene of multichain proteins. We evaluate ProtFlow across diverse protein design tasks, including general peptides and long‑chain proteins, antimicrobial peptides, and antibodies. Experimental results demonstrate that ProtFlow outperforms task‑specific methods in these applications, underscoring its potential and broad applicability in computational protein sequence design and analysis.
Authors: Michal Balcerak, Tamaz Amiranashvili, Antonio Terpin, Suprosanna Shit, Lea Bogensperger, Sebastian Kaltenbach, Petros Koumoutsakos, Bjoern Menze
Abstract: Current state‑of‑the‑art generative models map noise to data distributions by matching flows or scores. A key limitation of these models is their inability to readily integrate available partial observations and additional priors. In contrast, energy‑based models (EBMs) address this by incorporating corresponding scalar energy terms. Here, we propose Energy Matching, a framework that endows flow‑based approaches with the flexibility of EBMs. Far from the data manifold, samples move from noise to data along irrotational, optimal transport paths. As they approach the data manifold, an entropic energy term guides the system into a Boltzmann equilibrium distribution, explicitly capturing the underlying likelihood structure of the data. We parameterize these dynamics with a single time‑independent scalar field, which serves as both a powerful generator and a flexible prior for effective regularization of inverse problems. The present method substantially outperforms existing EBMs on CIFAR‑10 and ImageNet generation in terms of fidelity, while retaining simulation‑free training of transport‑based approaches away from the data manifold. Furthermore, we leverage the flexibility of the method to introduce an interaction energy that supports the exploration of diverse modes, which we demonstrate in a controlled protein generation setting. This approach learns a scalar potential energy, without time conditioning, auxiliary generators, or additional networks, marking a significant departure from recent EBM methods. We believe this simplified yet rigorous formulation significantly advances EBMs capabilities and paves the way for their wider adoption in generative modeling in diverse domains.
Authors: Julian Cremer, Ross Irwin, Alessandro Tibo, Jon Paul Janet, Simon Olsson, Djork-Arné Clevert
Abstract: We introduce FLOWR, a novel structure‑based framework for the generation and optimization of three‑dimensional ligands. FLOWR integrates continuous and categorical flow matching with equivariant optimal transport, enhanced by an efficient protein pocket conditioning. Alongside FLOWR, we present SPINDR, a thoroughly curated dataset comprising ligand‑pocket co‑crystal complexes specifically designed to address existing data quality issues. Empirical evaluations demonstrate that FLOWR surpasses current state‑of‑the‑art diffusion‑ and flow‑based methods in terms of PoseBusters‑validity, pose accuracy, and interaction recovery, while offering a significant inference speedup, achieving up to 70‑fold faster performance. In addition, we introduce FLOWR:multi, a highly accurate multi‑purpose model allowing for the targeted sampling of novel ligands that adhere to predefined interaction profiles and chemical substructures for fragment‑based design without the need of re‑training or any re‑sampling strategies
Authors: Krishna Rijal, Caroline M. Holmes, Samantha Petti, Gautam Reddy, Michael M. Desai, Pankaj Mehta
Abstract: Predicting phenotype from genotype is a central challenge in genetics. Traditional approaches in quantitative genetics typically analyze this problem using methods based on linear regression. These methods generally assume that the genetic architecture of complex traits can be parameterized in terms of an additive model, where the effects of loci are independent, plus (in some cases) pairwise epistatic interactions between loci. However, these models struggle to analyze more complex patterns of epistasis or subtle gene‑environment interactions. Recent advances in machine learning, particularly attention‑based models, offer a promising alternative. Initially developed for natural language processing, attention‑based models excel at capturing context‑dependent interactions and have shown exceptional performance in predicting protein structure and function. Here, we apply attention‑based models to quantitative genetics. We analyze the performance of this attention‑based approach in predicting phenotype from genotype using simulated data across a range of models with increasing epistatic complexity, and using experimental data from a recent quantitative trait locus mapping study in budding yeast. We find that our model demonstrates superior out‑of‑sample predictions in epistatic regimes compared to standard methods. We also explore a more general multi‑environment attention‑based model to jointly analyze genotype‑phenotype maps across multiple environments and show that such architectures can be used for "transfer learning" ‑ predicting phenotypes in novel environments with limited training data.
Authors: Chaoran Cheng, Jiahan Li, Jiajun Fan, Ge Liu
Abstract: Recent efforts have extended the flow‑matching framework to discrete generative modeling. One strand of models directly works with the continuous probabilities instead of discrete tokens, which we colloquially refer to as Continuous‑State Discrete Flow Matching (CS‑DFM). Existing CS‑DFM models differ significantly in their representations and geometric assumptions. This work presents a unified framework for CS‑DFM models, under which the existing variants can be understood as operating on different α‑representations of probabilities. Building upon the theory of information geometry, we introduce α‑Flow, a family of CS‑DFM models that adheres to the canonical α‑geometry of the statistical manifold, and demonstrate its optimality in minimizing the generalized kinetic energy. Theoretically, we show that the flow matching loss for α‑flow establishes a unified variational bound for the discrete negative log‑likelihood. We comprehensively evaluate different instantiations of α‑flow on various discrete generation domains to demonstrate their effectiveness in discrete generative modeling, including intermediate values whose geometries have never been explored before. α‑flow significantly outperforms its discrete‑state counterpart in image and protein sequence generation and better captures the entropy in language modeling.
Authors: Aspen Erlandsson Brisebois, Jason Broderick, Zahed Khatooni, Heather L. Wilson, Steven Rayan, Gordon Broderick
Abstract: There is growing awareness that the success of pharmacologic interventions on living organisms is significantly impacted by context and timing of exposure. In turn, this complexity has led to an increased focus on regulatory network dynamics in biology and our ability to represent them in a high‑fidelity way, in silico. Logic network models show great promise here and their parameter estimation can be formulated as a constraint satisfaction problem (CSP) that is well‑suited to the often sparse, incomplete data in biology. Unfortunately, even in the case of Boolean logic, the combinatorial complexity of these problems grows rapidly, challenging the creation of models at physiologically‑relevant scales. That said, quantum computing, while still nascent, facilitates novel information‑processing paradigms with the potential for transformative impact in problems such as this one. In this work, we take a first step at actualizing this potential by identifying the structure and Boolean decisional logic of a well‑studied network linking 5 proteins involved in the neural development of the mammalian cortical area of the brain. We identify the protein‑protein connectivity and binary decisional logic governing this network by formulating it as a Boolean Satisfiability (B‑SAT) problem. We employ Grover's algorithm to solve the NP‑hard problem faster than the exponential time complexity required by deterministic classical algorithms. Using approaches deployed on both quantum simulators and actual noisy intermediate scale quantum (NISQ) hardware, we accurately recover several high‑likelihood models from very sparse protein expression data. The results highlight the differential roles of data types in supporting accurate models; the impact of quantum algorithm design as it pertains to the mutability of quantum hardware; and the opportunities for accelerated discovery enabled by this approach.
Authors: Charles Rathkopf
Abstract: Generative AI increasingly supports scientific inference, from protein structure prediction to weather forecasting. Yet its distinctive failure mode, hallucination, raises epistemic alarm bells. I argue that this failure mode can be addressed by shifting from data‑centric to phenomenon‑centric assessment. Through case studies of AlphaFold and GenCast, I show how scientific workflows discipline generative models through theory‑guided training and confidence‑based error screening. These strategies convert hallucination from an unmanageable epistemic threat into bounded risk. When embedded in such workflows, generative models support reliable inference despite opacity, provided they operate in theoretically mature domains.
Authors: Unathi Skosana, Sthembiso Gumede, Mark Tame
Abstract: We present numerical calculations of the energetic separation between different spin states (singlet, triplet and quintet) for a simplified model of a deoxy‑myoglobin protein using the variational quantum eigensolver (VQE) algorithm. The goal is to gain insight into the workflow and challenges of VQE simulations for transition metal complexes, with emphasis on methodology over hardware‑specific implementation. The numerical calculations are performed using an in‑house statevector simulator with single‑ and multi‑reference trial wavefunctions based on the k‑unitary pair coupled‑cluster generalized singles and doubles or k‑UpCCGSD ansatz. The spin‑state energetics for active spaces of increasing size up to 10 spatial orbitals (20 spin orbitals or qubits) are computed with VQE and were found to agree with the classical complete active self‑consistent field or CASSCF method to within 1‑4 kcal/mol. We evaluate relevant multi‑reference diagnostics and show that the spin states computed with VQE possess a sufficient degree of multi‑reference character to highlight the presence of strong electron correlation effects. Our numerical simulations show that in the ideal case, the VQE algorithm is capable of reproducing spin‑state energetics of strongly correlated systems such as transition metal complexes for both single‑ and multi‑reference trial wavefunctions, asymptotically achieving good agreement with results from classical methods as the number of active orbitals increases.
Authors: Neeru Dubey, Elin Karlsson, Miguel Angel Redondo, Johan Reimegård, Anna Rising, Hedvig Kjellström
Abstract: The remarkable mechanical properties of spider silk, including its tensile strength and extensibility, are primarily governed by the repetitive regions of the proteins that constitute the fiber, the major ampullate spidroins (MaSps). However, establishing correlations between mechanical characteristics and repeat sequences is challenging due to the intricate sequence‑structure‑function relationships of MaSps and the limited availability of annotated datasets. In this study, we present a novel computational framework for designing MaSp repeat sequences with customizable mechanical properties. To achieve this, we developed a lightweight GPT‑based generative model by distilling the pre‑trained ProtGPT2 protein language model. The distilled model was subjected to multilevel fine‑tuning using curated subsets of the Spider Silkome dataset. Specifically, we adapt the model for MaSp repeat generation using 6,000 MaSp repeat sequences and further refine it with 572 repeats associated with experimentally determined fiber‑level mechanical properties. Our model generates biologically plausible MaSp repeat regions tailored to specific mechanical properties while also predicting those properties for given sequences. Validation includes sequence‑level analysis, assessing physicochemical attributes and expected distribution of key motifs as well as secondary structure compositions. A correlation study using BLAST on the Spider Silkome dataset and a test set of MaSp repeats with known mechanical properties further confirmed the predictive accuracy of the model. This framework advances the rational design of spider silk‑inspired biomaterials, offering a versatile tool for engineering protein sequences with tailored mechanical attributes.
Authors: Alice Driessen, Benedek Harsanyi, Marianna Rapsomaniki, Jannis Born
Abstract: Learning the response of single‑cells to various treatments offers great potential to enable targeted therapies. In this context, neural optimal transport (OT) has emerged as a principled methodological framework because it inherently accommodates the challenges of unpaired data induced by cell destruction during data acquisition. However, most existing OT approaches are incapable of conditioning on different treatment contexts (e.g., time, drug treatment, drug dosage, or cell type) and we still lack methods that unanimously show promising generalization performance to unseen treatments. Here, we propose the Conditional Monge Gap which learns OT maps conditionally on arbitrary covariates. We demonstrate its value in predicting single‑cell perturbation responses conditional to one or multiple drugs, a drug dosage, or combinations thereof. We find that our conditional models achieve results comparable and sometimes even superior to the condition‑specific state‑of‑the‑art on scRNA‑seq as well as multiplexed protein imaging data. Notably, by aggregating data across conditions we perform cross‑task learning which unlocks remarkable generalization abilities to unseen drugs or drug dosages, widely outperforming other conditional models in capturing heterogeneity (i.e., higher moments) in the perturbed population. Finally, by scaling to hundreds of conditions and testing on unseen drugs, we narrow the gap between structure‑based and effect‑based drug representations, suggesting a promising path to the successful prediction of perturbation effects for unseen treatments.
Authors: Yanping Liu, Dui Qin, Xinwei Li, Guoqiang Li, Zhichao Liu, Kena Song, Wei Wang, Zhangyong Li
Abstract: Cell migration, which is strictly regulated by intracellular and extracellular cues, is crucial for normal physiological processes and the progression of certain diseases. However, there is a lack of an efficient approach to analyze super‑statistical and time‑varying characteristics of cell migration based on single trajectories. Here, we propose an approach to reconstruct single‑cell trajectories, which incorporates wavelet transform, power spectrum of an OU‑process, and fits of the power spectrum to analyze statistical and time‑varying properties of customized target‑finding and migration metrics. Our results reveal diverse relationships between motility parameters and dynamic metrics, especially the existence of an optimal parameter range. Moreover, the analysis reveals that the loss of Arpin protein enhances the migration potential of D. discoideum, and a previously reported result that the rescued amoeba is distinguishable from the wild‑type amoeba. Significantly, time‑varying dynamic metrics emerge periodic phenomena under the influence of irregularly changing parameters, which correlates with migration potential. Our analysis suggests that the approach provides a powerful tool for estimating time‑dependent migration potential and statistical features of single‑cell trajectories, enabling a better understanding of the relationship between intracellular proteins and cellular behaviors. This also provides more insights on the migration dynamics of single cells and cell populations.
Authors: Jan van Eck, Dea Gogishvili, Wilson Silva, Sanne Abeln
Abstract: Protein language models (PLMs) have revolutionised computational biology through their ability to generate powerful sequence representations for diverse prediction tasks. However, their black‑box nature limits biological interpretation and translation to actionable insights. We present an explainable adapter layer ‑ PLM‑eXplain (PLM‑X), that bridges this gap by factoring PLM embeddings into two components: an interpretable subspace based on established biochemical features, and a residual subspace that preserves the model's predictive power. Using embeddings from ESM2, our adapter incorporates well‑established properties, including secondary structure and hydropathy while maintaining high performance. We demonstrate the effectiveness of our approach across three protein‑level classification tasks: prediction of extracellular vesicle association, identification of transmembrane helices, and prediction of aggregation propensity. PLM‑X enables biological interpretation of model decisions without sacrificing accuracy, offering a generalisable solution for enhancing PLM interpretability across various downstream applications. This work addresses a critical need in computational biology by providing a bridge between powerful deep learning models and actionable biological insights.
Authors: Ivan Rossi, Guido Barducci, Tiziana Sanavia, Paola Turina, Emidio Capriotti, Piero Fariselli
Abstract: The prediction of protein stability changes following single‑point mutations plays a pivotal role in computational biology, particularly in areas like drug discovery, enzyme reengineering, and genetic disease analysis. Although deep‑learning strategies have pushed the field forward, their use in standard workflows remains limited due to resource demands. Conversely, potential‑like methods are fast, intuitive, and efficient. Yet, these typically estimate Gibbs free energy shifts without considering the free‑energy variations in the unfolded protein state, an omission that may breach mass balance and diminish accuracy. This study shows that incorporating a mass‑balance correction (MBC) to account for the unfolded state significantly enhances these methods. While many machine learning models partially model this balance, our analysis suggests that a refined representation of the unfolded state may improve the predictive performance.
Authors: Jakub Vašíček, Dafni Skiadopoulou, Ksenia G. Kuznetsova, Lukas Käll, Marc Vaudel, Stefan Bruckner
Abstract: In mass spectrometry‑based proteomics, experts usually project data onto a single set of reference sequences, overlooking the influence of common haplotypes (combinations of genetic variants inherited together from a parent). We recently introduced ProHap, a tool for generating customized protein haplotype databases. Here, we present ProHap Explorer, a visualization interface designed to investigate the influence of common haplotypes on the human proteome. It enables users to explore haplotypes, their effects on protein sequences, and the identification of non‑canonical peptides in public mass spectrometry datasets. The design builds on well‑established representations in biological sequence analysis, ensuring familiarity for domain experts while integrating novel interactive elements tailored to proteogenomic data exploration. User interviews with proteomics experts confirmed the tool's utility, highlighting its ability to reveal whether haplotypes affect proteins of interest. By facilitating the intuitive exploration of proteogenomic variation, ProHap Explorer supports research in personalized medicine and the development of targeted therapies.
Authors: Felix Wittwer, Nimmi Das Anthuparambil, Frederik Unger, Randeer Pratap Gautam, Silja Flenner, Imke Greving, Christian Gutt, Peter Modregger
Abstract: Upon heating, egg yolk transforms from a liquid to a gel due to protein denaturation. This process can serve as a useful model to better understand protein denaturation in general. Using x‑ray holographic tomography, we investigated the structural changes in egg yolk during boiling without the need for complex sample fixation or drying. Our results reveal a developing separation between proteins and lipids, with fatty components rapidly aggregating into large globules that subsequently evolve into bubbles.
Authors: Paola F. Antonietti, Mattia Corti, Sergio Gómez, Ilaria Perugia
Abstract: This work presents a structure‑preserving, high‑order, unconditionally stable numerical method for approximating the solution to the Fisher‑Kolmogorov equation on polytopic meshes, with a particular focus on its application in simulating misfolded protein spreading in neurodegenerative diseases. The model problem is reformulated using an entropy variable to guarantee solution positivity, boundedness, and satisfaction of a discrete entropy‑stability inequality at the numerical level. The scheme combines a local discontinuous Galerkin method on polytopal meshes for the space discretization with a ν‑step backward differentiation formula for the time integration. Implementation details are discussed, including a detailed derivation of the linear systems arising from Newton's iteration. The accuracy and robustness of the proposed method are demonstrated through extensive numerical tests. Finally, the method's practical performance is demonstrated through simulations of α‑synuclein propagation in a two‑dimensional brain geometry segmented from MRI data, providing a relevant computational framework for modeling synucleopathies (such as Parkinson's disease) and, more generally, neurodegenerative diseases.
Authors: Jordane Preto, Vania Calandrini, Elena Floriani, Gergely Katona, Marco Pettini
Abstract: Recent experimental evidence for collective protein vibrations in the terahertz (THz) domain indicates that energy in biomolecular systems can self‑organize in an orderly manner, as anticipated by Fröhlich's theory of condensates within a quantum framework. As a first step to bridge THz experiments with theory, we study the Hamiltonian dynamics of a classical network of coupled normal modes representing Fröhlich‑type systems. Our results demonstrate that biologically relevant condensates can emerge at room temperature under appropriate nonlinear coupling schemes. The condensation mechanism remains robust also when the original Fröhlich resonance conditions are relaxed.
Authors: L. N. Mohanam, R. Umeda, L. Gu, Y. Song, D. J. Tobias, A. I. Hochbaum, R. Wu, S. Sharifzadeh
Abstract: The anaerobic bacterium Geobacter sulfurreducens produces extracellular, electronically conductive cytochrome polymer wires that are conductive over micron length scales. Structure models from cryo‑electron microscopy data show OmcS wires form a linear chain of hemes along the protein wire axis, which is proposed as the structural basis supporting their electronic properties. The geometric arrangement of heme along OmcS wires is conserved in many multiheme c‑type cytochromes and other recently discovered microbial cytochrome wires. However, the mechanism by which this arrangement of heme molecules support electron transport through proteins and supramolecular heme wires is unclear. Here, we investigate the site energies, inter‑heme coupling, and long‑range electronic conductivity within OmcS. We introduce an approach to extract charge carrier site information directly from Kohn‑Sham density functional theory, without employing projector schemes. We show that site and coupling energies are highly sensitive to changes in inter‑heme geometry and the surrounding electrostatic environment, as intuitively expected. These parameters serve as inputs for a quantum charge carrier model that includes decoherence corrections with which we predict a diffusion coefficient comparable with other organic‑based electronic materials. Based on these simulations, we propose that dynamic disorder, particularly due to perturbative inter‑heme vibrations allow the carrier to overcome trapping due to the presence of static disorder via small frequency‑dependent fluctuations. These studies provide insights into molecular and electronic determinants of long‑range electronic conductivity in microbial cytochrome wires and highlight design principles for bioinspired, heme‑based conductive materials.
Authors: Matthew K Burgess, Ryan T Murray, Veronica M Lucian, Zekun Liu, Robin O Cleveland, Callum J Beeston, Malavika Nair
Abstract: Conventional tissue engineering methodologies frequently depend on pharmacological strategies to induce or expedite tissue repair. However, bioengineered strategies incorporating biophysical stimulation have emerged as promising alternatives. Electroactive materials facilitate the provision of controlled electrical, mechanical, and electromechanical stimuli, which support cell proliferation and tissue remodelling. Despite their ability to supply external electrical and mechanical stimuli to the tissue microenvironment, the electroactive polymers in use today often lack critical biochemical signals essential for native‑like cell‑cell and cell‑scaffold interactions, thereby constraining their regenerative capabilities. To address the demand for biomimetic materials that possess enhanced capabilities in promoting cell and tissue stimulation, we present the development of a novel class of polymers called ionomeric extracellular matrices (iECMs). By utilising the linker‑mediated conjugation of sulfonic acid biomolecules (taurine) to the backbone of an extracellular matrix protein (collagen), we illustrate the potential of iECMs as the first electromechanical actuating material platform derived entirely from ECM materials, paving the way for dynamic and soft‑robotic platforms for a wide range of tissue engineering applications.
Authors: Yusef Ahsini, Marc Escoto, J. Alberto Conejero
Abstract: Anomalous diffusion occurs in a wide range of systems, including protein transport within cells, animal movement in complex habitats, pollutant dispersion in groundwater, and nanoparticle motion in synthetic materials. Accurately estimating the anomalous diffusion exponent and the diffusion coefficient from the particle trajectories is essential to distinguish between sub‑diffusive, super‑diffusive, or normal diffusion regimes. These estimates provide a deeper insight into the underlying dynamics of the system, facilitating the identification of particle behaviors and the detection of changes in diffusion states. However, analyzing short and noisy video data, which often yield incomplete and heterogeneous trajectories, poses a significant challenge for traditional statistical approaches. We introduce a data‑driven method that integrates particle tracking, an attention
U‑Net architecture, and a change‑point detection algorithm to address these issues. This approach not only infers the anomalous diffusion parameters with high accuracy but also identifies temporal transitions between different states, even in the presence of noise and limited temporal resolution. Our methodology demonstrated strong performance in the 2nd Anomalous Diffusion (AnDi) Challenge benchmark within the top submissions for video tasks.
Authors: Xuefeng Liu, Songhao Jiang, Chih-chan Tien, Jinbo Xu, Rick Stevens
Abstract: Protein representation learning is critical for numerous biological tasks. Recently, large transformer‑based protein language models (pLMs) pretrained on large scale protein sequences have demonstrated significant success in sequence‑based tasks. However, pLMs lack structural context. Conversely, graph neural networks (GNNs) designed to leverage 3D structural information have shown promising generalization in protein‑related prediction tasks, but their effectiveness is often constrained by the scarcity of labeled structural data. Recognizing that sequence and structural representations are complementary perspectives of the same protein entity, we propose a multimodal bidirectional hierarchical fusion framework to effectively merge these modalities. Our framework employs attention and gating mechanisms to enable effective interaction between pLMs‑generated sequential representations and GNN‑extracted structural features, improving information exchange and enhancement across layers of the neural network. This bidirectional and hierarchical (Bi‑Hierarchical) fusion approach leverages the strengths of both modalities to capture richer and more comprehensive protein representations. Based on the framework, we further introduce local Bi‑Hierarchical Fusion with gating and global Bi‑Hierarchical Fusion with multihead self‑attention approaches. Our method demonstrates consistent improvements over strong baselines and existing fusion techniques in a variety of protein representation learning benchmarks, including enzyme EC classification, model quality assessment, protein‑ligand binding affinity prediction, protein‑protein binding site prediction, and B cell epitopes prediction. Our method establishes a new state‑of‑the‑art for multimodal protein representation learning, emphasizing the efficacy of Bi‑Hierarchical Fusion in bridging sequence and structural modalities.
Authors: Ngoc-Quang Nguyen
Abstract: Accurate prediction of compound‑protein interactions (CPI) remains a cornerstone challenge in computational drug discovery. While existing sequence‑based approaches leverage molecular fingerprints or graph representations, they critically overlook three‑dimensional (3D) structural determinants of binding affinity. To bridge this gap, we present EquiCPI, an end‑to‑end geometric deep learning framework that synergizes first‑principles structural modeling with SE(3)‑equivariant neural networks. Our pipeline transforms raw sequences into 3D atomic coordinates via ESMFold for proteins and DiffDock‑L for ligands, followed by physics‑guided conformer re‑ranking and equivariant feature learning. At its core, EquiCPI employs SE(3)‑equivariant message passing over atomic point clouds, preserving symmetry under rotations, translations, and reflections, while hierarchically encoding local interaction patterns through tensor products of spherical harmonics. The proposed model is evaluated on BindingDB (affinity prediction) and DUD‑E (virtual screening), EquiCPI achieves performance on par with or exceeding the state‑of‑the‑art deep learning competitors.
Authors: Mohammad Amaan Sayeed, Engin Tekin, Maryam Nadeem, Nancy A. ElNaker, Aahan Singh, Natalia Vassilieva, Boulbaba Ben Amor
Abstract: Unlocking the next generation of biotechnology and therapeutic innovation demands overcoming the inherent complexity and resource‑intensity of conventional protein engineering methods. Recent GenAI‑powered computational techniques often rely on the availability of the target protein's 3D structures and specific binding sites to generate high‑affinity binders, constraints exhibited by models such as AlphaProteo and RFdiffusion. In this work, we explore the use of Protein Language Models (pLMs) for high‑affinity binder generation. We introduce Prot42, a novel family of Protein Language Models (pLMs) pretrained on vast amounts of unlabeled protein sequences. By capturing deep evolutionary, structural, and functional insights through an advanced auto‑regressive, decoder‑only architecture inspired by breakthroughs in natural language processing, Prot42 dramatically expands the capabilities of computational protein design based on language only. Remarkably, our models handle sequences up to 8,192 amino acids, significantly surpassing standard limitations and enabling precise modeling of large proteins and complex multi‑domain sequences. Demonstrating powerful practical applications, Prot42 excels in generating high‑affinity protein binders and sequence‑specific DNA‑binding proteins. Our innovative models are publicly available, offering the scientific community an efficient and precise computational toolkit for rapid protein engineering.
Authors: Xiaokun Liu, Sayedmohammadreza Rastegari, Yijun Huang, Sxe Chang Cheong, Weikang Liu, Wenjie Zhao, Qihao Tian, Hongming Wang, Yingjie Guo, Shuo Zhou, Sina Tabakhi, Xianyuan Liu, Zheqing Zhu, Wei Sang, Haiping Lu
Abstract: In cancer therapeutics, protein‑metal binding mechanisms critically govern the pharmacokinetics and targeting efficacy of drugs, thereby fundamentally shaping the rational design of anticancer metallodrugs. While conventional laboratory methods used to study such mechanisms are often costly, low throughput, and limited in capturing dynamic biological processes, machine learning (ML) has emerged as a promising alternative. Despite increasing efforts to develop protein‑metal binding datasets and ML algorithms, the application of ML in tumor protein‑metal binding remains limited. Key challenges include a shortage of high‑quality, tumor‑specific datasets, insufficient consideration of multiple data modalities, and the complexity of interpreting results due to the ''black box'' nature of complex ML models. This paper summarizes recent progress and ongoing challenges in using ML to predict tumor protein‑metal binding, focusing on data, modeling, and interpretability. We present multimodal protein‑metal binding datasets and outline strategies for acquiring, curating, and preprocessing them for training ML models. Moreover, we explore the complementary value provided by different data modalities and examine methods for their integration. We also review approaches for improving model interpretability to support more trustworthy decisions in cancer research. Finally, we offer our perspective on research opportunities and propose strategies to address the scarcity of tumor protein data and the limited number of predictive models for tumor protein‑metal binding. We also highlight two promising directions for effective metal‑based drug design: integrating protein‑protein interaction data to provide structural insights into metal‑binding events and predicting structural changes in tumor proteins after metal binding.
Authors: Hannah Janmohamed, Antoine Cully
Abstract: Quality‑Diversity algorithms are powerful tools for discovering diverse, high‑performing solutions. Recently, Multi‑Objective Quality‑Diversity (MOQD) extends QD to problems with several objectives while preserving solution diversity. MOQD has shown promise in fields such as robotics and materials science, where finding trade‑offs between competing objectives like energy efficiency and speed, or material properties is essential. However, existing methods in MOQD rely on tessellating the feature space into a grid structure, which prevents their application in domains where feature spaces are unknown or must be learned, such as complex biological systems or latent exploration tasks. In this work, we introduce Multi‑Objective Unstructured Repertoire for Quality‑Diversity (MOUR‑QD), a MOQD algorithm designed for unstructured and unbounded feature spaces. We evaluate MOUR‑QD on five robotic tasks. Importantly, we show that our method excels in tasks where features must be learned, paving the way for applying MOQD to unsupervised domains. We also demonstrate that MOUR‑QD is advantageous in domains with unbounded feature spaces, outperforming existing grid‑based methods. Finally, we demonstrate that MOUR‑QD is competitive with established MOQD methods on existing MOQD tasks and achieves double the MOQD‑score in some environments. MOUR‑QD opens up new opportunities for MOQD in domains like protein design and image generation.
Authors: Giuseppe Russo, Kristina Gligorić, Vincent Moreau, Robert West
Abstract: Reducing meat consumption is crucial for achieving global environmental and nutritional targets. Meat‑Free Day (MFD) is a widely adopted strategy to address this challenge by encouraging plant‑based diets through the removal of animal‑based meals. We assessed the environmental, behavioral, and nutritional impacts of MFD by implementing 67 MFDs over 18 months (once a week on a randomly chosen day) across 12 cafeterias on a large university campus, analyzing over 400,000 food purchases. MFD reduced on‑campus food‑related greenhouse gas (GHG) emissions on treated days by 52.9% and contributed to improved fiber (+26.9%) and cholesterol (‑4.5%) consumption without altering caloric intake. These nutritional benefits were, however, accompanied by a 27.6% decrease in protein intake and a 34.2% increase in sugar consumption. Moreover, the increase in plant‑based meals did not carry over to subsequent days, as evidenced by a 3.5% rebound in animal‑based meal consumption on days immediately following treated days. MFD also led to a 16.8% drop in on‑campus meal sales on treated days.Monte Carlo simulations suggest that if 8.7% of diners were to eat burgers off‑campus on treated days, MFD's GHG savings would be fully negated. As our analysis identifies on‑campus customer retention as the main challenge to MFD effectiveness, we recommend combining MFD with customer retention interventions to ensure environmental and nutritional benefits.
Authors: Valentin Lombard, Sergei Grudinin, Elodie Laine
Abstract: Proteins move and deform to ensure their biological functions. Despite significant progress in protein structure prediction, approximating conformational ensembles at physiological conditions remains a fundamental open problem. This paper presents a novel perspective on the problem by directly targeting continuous compact representations of protein motions inferred from sparse experimental observations. We develop a task‑specific loss function enforcing data symmetries, including scaling and permutation operations. Our method PETIMOT (Protein sEquence and sTructure‑based Inference of MOTions) leverages transfer learning from pre‑trained protein language models through an SE(3)‑equivariant graph neural network. When trained and evaluated on the Protein Data Bank, PETIMOT shows superior performance in time and accuracy, capturing protein dynamics, particularly large/slow conformational changes, compared to state‑of‑the‑art flow‑matching approaches and traditional physics‑based models.
Authors: Shengrui XU, Tianchi Lu, Zikun Wang, Jixiu Zhai
Abstract: Protein‑protein interaction (PPI) prediction plays a pivotal role in deciphering cellular functions and disease mechanisms. To address the limitations of traditional experimental methods and existing computational approaches in cross‑modal feature fusion and false‑negative suppression, we propose SCMPPI‑a novel supervised contrastive multimodal framework. By effectively integrating sequence‑based features (AAC, DPC, ESMC‑CKSAAP) with network topology (Node2Vec embeddings) and incorporating an enhanced contrastive learning strategy with negative sample filtering, SCMPPI achieves superior prediction performance. Extensive experiments on eight benchmark datasets demonstrate its state‑of‑the‑art accuracy(98.13%) and AUC(99.69%), along with excellent cross‑species generalization (AUC>99%). Successful applications in CD9 networks, Wnt pathway analysis, and cancer‑specific networks further highlight its potential for disease target discovery, establishing SCMPPI as a powerful tool for multimodal biological data analysis.
Authors: Jiannuo Li, Lan Yao
Abstract: Accurate prediction of the binding affinity between drugs and target proteins is a core task in computer‑aided drug design. Existing deep learning methods tend to ignore the information of internal sub‑structural features of drug molecules and drug‑target interactions, resulting in limited prediction performance. In this paper, we propose a drug‑target association prediction model HCAF‑DTA based on cross‑attention fusion hypergraph neural network. The model innovatively introduces hypergraph representation in the feature extraction stage: drug molecule hypergraphs are constructed based on the tree decomposition algorithm, and the sub‑structural and global features extracted by fusing the hypergraph neural network with the graphical neural network through hopping connections, in which the hyper edges can efficiently characterise the functional functional groups and other key chemical features; for the protein feature extraction, a weighted graph is constructed based on the residues predicted by the ESM model contact maps to construct weighted graphs, and multilayer graph neural networks were used to capture spatial dependencies. In the prediction stage, a bidirectional multi‑head cross‑attention mechanism is designed to model intermolecular interactions from the dual viewpoints of atoms and amino acids, and cross‑modal features with correlated information are fused by attention. Experiments on benchmark datasets such as Davis and KIBA show that HCAF‑DTA outperforms state of the arts in all three performance evaluation metrics, with the MSE metrics reaching 0.198 and 0.122, respectively, with an improvement of up to 4% from the optimal baseline.
Authors: Christoph Brunken, Sebastien Boyer, Mustafa Omar, Martin Maarand, Olivier Peltre, Solal Attias, Bakary N'tji Diallo, Anastasia Markina, Olaf Othersen, Oliver Bent
Abstract: Coarse‑grained (CG) force field methods for molecular systems are a crucial tool to simulate large biological macromolecules and are therefore essential for characterisations of biomolecular systems. While state‑of‑the‑art deep learning (DL)‑based models for all‑atom force fields have improved immensely over recent years, we observe and analyse significant limitations of the currently available approaches for DL‑based CG simulations. In this work, we present the first transferable DL‑based CG force field approach (i.e., not specific to only one narrowly defined system type) applicable to a wide range of biosystems. To achieve this, our CG algorithm does not rely on hard‑coded rules and is tuned to output coarse‑grained systems optimised for minimal statistical noise in the ground truth CG forces, which results in significant improvement of model training. Our force field model is also the first CG variant that is based on the MACE architecture and is trained on a custom dataset created by a new approach based on the fragmentation of large biosystems covering protein, RNA and lipid chemistry. We demonstrate that our model can be applied in molecular dynamics simulations to obtain stable and qualitatively accurate trajectories for a variety of systems, while also discussing cases for which we observe limited reliability.
Authors: Phuong Thuy Bui, Trinh Xuan Hoang
Abstract: The ribosomal exit tunnel is the primary structure affecting the release of nascent proteins at the ribosome. The ribosomal exit tunnels from different species have elements of conservation and differentiation in structural and physico‑chemical properties. In this study, by simulating the elongation and escape processes of nascent proteins at the ribosomal exit tunnels of four different organisms, we show that the escape process has conserved mechanisms across the domains of life. Specifically, it is found that the escape process of proteins follows the diffusion mechanism given by a simple diffusion model and the median escape time positively correlates with the number of hydrophobic residues and the net charge of a protein for all the exit tunnels considered. These properties hold for twelve distinct proteins considered in two slightly different and improved Gō‑like models. It is also found that the differences in physico‑chemical properties of the tunnels lead to quantitative differences in the protein escape times. In particular, the relatively strong hydrophobicity of the E. coli's tunnel and the unusually high number of negatively charged amino acids on the tunnel's surface of H. marismortui lead to substantially slower escapes of proteins at these tunnels than at those of S. cerevisisae and H. sapiens.
Authors: Francesco Calvanese, Giovanni Peinetti, Polina Pavlinova, Philippe Nghe, Martin Weigt
Abstract: Generative probabilistic models have shown promise in designing artificial RNA and protein sequences but often suffer from high rates of false positives, where sequences predicted as functional fail experimental validation. To address this critical limitation, we explore the impact of reintegrating experimental feedback into the model design process. We propose a likelihood‑based reintegration scheme, which we test through extensive computational experiments on both RNA and protein datasets, as well as through wet‑lab experiments on the self‑splicing ribozyme from the group I intron RNA family where our approach demonstrates particular efficacy. We show that integrating recent experimental data enhances the model's capacity of generating functional sequences (e.g. from 6.7% to 63.7% of active designs at 45 mutations). This feedback‑driven approach thus provides a significant improvement in the design of biomolecular sequences by directly tackling the false‑positive challenge.
Authors: Wanqing Yang, Yanwei Wang, Yang Wang
Abstract: This systematic review outlines pivotal advancements in deep learning‑driven protein structure prediction and design, focusing on four core models‑AlphaFold, RoseTTAFold, RFDiffusion, and ProteinMPNN‑developed by 2024 Nobel Laureates in Chemistry: David Baker, Demis Hassabis, and John Jumper. We analyze their technological iterations and collaborative design paradigms, emphasizing breakthroughs in atomic‑level structural accuracy, functional protein engineering, and multi‑component biomolecular interaction modeling. Key innovations include AlphaFold3's diffusion‑based framework for unified biomolecular prediction, RoseTTAFold's three‑track architecture integrating sequence and spatial constraints, RFDiffusion's denoising diffusion for de novo protein generation, and ProteinMPNN's inverse folding for sequence‑structure co‑optimization. Despite transformative progress in applications such as binder design, nanomaterials, and enzyme engineering, challenges persist in dynamic conformational sampling, multimodal data integration, and generalization to non‑canonical targets. We propose future directions, including hybrid physics‑AI frameworks and multimodal learning, to bridge gaps between computational design and functional validation in cellular environments.
Authors: Junyu Hou
Abstract: De novo molecular design has extensive applications in drug discovery and materials science. The vast chemical space renders direct molecular searches computationally prohibitive, while traditional experimental screening is both time‑ and labor‑intensive. Efficient molecular generation and screening methods are therefore essential for accelerating drug discovery and reducing costs. Although reinforcement learning (RL) has been applied to optimize molecular properties via reward mechanisms, its practical utility is limited by issues in training efficiency, convergence, and stability. To address these challenges, we adopt Direct Preference Optimization (DPO) from NLP, which uses molecular score‑based sample pairs to maximize the likelihood difference between high‑ and low‑quality molecules, effectively guiding the model toward better compounds. Moreover, integrating curriculum learning further boosts training efficiency and accelerates convergence. A systematic evaluation of the proposed method on the GuacaMol Benchmark yielded excellent scores. For instance, the method achieved a score of 0.883 on the Perindopril MPO task, representing a 6% improvement over competing models. And subsequent target protein binding experiments confirmed its practical efficacy. These results demonstrate the strong potential of DPO for molecular design tasks and highlight its effectiveness as a robust and efficient solution for data‑driven drug discovery.
Authors: Tudor-Stefan Cotet, Igor Krawczuk
Abstract: Bayesian optimization (BO) has recently become more prevalent in protein engineering applications and hence has become a fruitful target of benchmarks. However, current BO comparisons often overlook real‑world considerations like risk and cost constraints. In this work, we compare 72 model combinations of encodings, surrogate models, and acquisition functions on 11 protein binder fitness landscapes, specifically from this perspective. Drawing from the portfolio optimization literature, we adopt metrics to quantify the cold‑start performance relative to a random baseline, to assess the risk of an optimization campaign, and to calculate the overall budget required to reach a fitness threshold. Our results suggest the existence of Pareto‑optimal models on the risk‑performance axis, the shift of this preference depending on the landscape explored, and the robust correlation between landscape properties such as epistasis with the average and worst‑case model performance. They also highlight that rigorous model selection requires substantial computational and statistical efforts.
Authors: Nhung T. T. Nguyen, Pham Nam Phong, Duy Manh Le, Minh-Tien Tran, Trinh Xuan Hoang
Abstract: The aqueous solvent profoundly influences protein folding, yet its effects are relatively poorly understood. In this study, we investigate the impact of solvation on the folding of lattice proteins by using Monte Carlo simulations. The proteins are modelled as self‑avoiding 27‑mer chains on a cubic lattice, with compact native states and structure‑based Gō potentials. Each residue that makes no contacts with other residues in a given protein conformation is assigned a solvation energy ε_s , representing its full exposure to the solvent. We find that a negative ε_s , indicating a favorable solvation, increases the cooperativity of the folding transition by lowering the free energy of the unfolded state, increasing the folding free energy barrier, and narrowing the folding routes. This favorable solvation also significantly improves the correlation between folding rates and the native topology, measured by the relative contact order. Our results suggest that Gō model may overestimate the importance of native interactions and a solvation potential countering the native bias can play a significant role. The solvation energy in our model can be related to the polar interaction between water and peptide groups in the protein backbone. It is therefore suggested that the solvation of peptide groups may significantly contribute to the exceptional folding cooperativity and the pronounced topology‑dependence of folding rates observed in two‑state proteins.
Authors: Fadi Alharbi, Nishant Budhiraja, Aleksandar Vakanski, Boyu Zhang, Murtada K. Elbashir, Harshith Guduru, Mohanad Mohammed
Abstract: The integration of heterogeneous multi‑omics datasets at a systems level remains a central challenge for developing analytical and computational models in precision cancer diagnostics. This paper introduces Multi‑Omics Graph Kolmogorov‑Arnold Network (MOGKAN), a deep learning framework that utilizes messenger‑RNA, micro‑RNA sequences, and DNA methylation samples together with Protein‑Protein Interaction (PPI) networks for cancer classification across 31 different cancer types. The proposed approach combines differential gene expression with DESeq2, Linear Models for Microarray (LIMMA), and Least Absolute Shrinkage and Selection Operator (LASSO) regression to reduce multi‑omics data dimensionality while preserving relevant biological features. The model architecture is based on the Kolmogorov‑Arnold theorem principle and uses trainable univariate functions to enhance interpretability and feature analysis. MOGKAN achieves classification accuracy of 96.28 percent and exhibits low experimental variability in comparison to related deep learning‑based models. The biomarkers identified by MOGKAN were validated as cancer‑related markers through Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis. By integrating multi‑omics data with graph‑based deep learning, our proposed approach demonstrates robust predictive performance and interpretability with potential to enhance the translation of complex multi‑omics data into clinically actionable cancer diagnostics.
Authors: Hernán Barrio-Zhang, Glen McHale, Gary G. Wells, Rodrigo Ledesma-Aguilar, Rui Han, Nicholas Jakubovics, Jinju Chen
Abstract: Siliconization is widely used as a coating technique to engineer surface properties, such as in the pharmaceutical and medical device industries to lubricate motion, ensure complete dispensation of product, and to inhibit protein adsorption and biofilm growth. In the hitherto unconnected literature, there has recently been significant progress in understanding the concept of surfaces slippery to liquids. Whereas in the siliconization industry the wettability of surfaces focuses on the hydrophobicity, as measured by contact angle and surface energy, for surfaces slippery to liquids the focus is on the contact angle hysteresis (droplet‑on‑solid static friction). Moreover, it has been discovered that surfaces with similar static wetting properties can have dramatically different droplet kinetic friction. Here, we report a simple‑to‑apply coating method to create ultra‑low contact angle hysteresis liquid‑like coatings for glass (G), polydimethylsiloxane (PDMS), polyurethane (PU), and stainless steel (SS); materials that are used for pharmaceutical/parenteral packaging and medical equipment. Moreover, we demonstrate that the coating's slow sliding dynamics surface properties for water droplets, which indicate high droplet kinetic friction, can be converted into fast sliding dynamics, which indicate low droplet kinetic friction, by a simple molecular capping (methylation) process. Our results provide new insight into key aspects of siliconization coatings in the context of industrial/commercial processes.
Authors: Chiara Lombardo, Andrea Sottini, Sarina Seiter, Gerard Colas des Francs, Jaime Ortega Arroyo, Romain Quidant
Abstract: Interferometric‑based microscopies stand as powerful label‑free approaches for monitoring and characterising chemical reactions and heterogeneous nanoparticle systems in real time with single particle sensitivity. Nevertheless, coherent artifacts, such as speckle and parasitic interferences, together with limited photon fluxes from spatially incoherent sources, pose an ongoing challenge in achieving both high sensitivity and throughput. In this study, we systematically characterise how partial coherence affects both the signal contrast and the background noise level; thus, it offers a route to improve the signal‑to‑noise ratio from single nanoparticles (NPs), irrespective of their size and composition; or the light source used. We first validate that lasers can be modified into partially coherent sources with performance matching that of spatially incoherent ones; while providing higher photon fluxes. Secondly, we demonstrate that tuning the degree of partial coherence not only enhances the detection sensitivity of both synthetic and biological NPs, but also affects how signal contrasts vary as a function of the focus position. Finally, we apply our findings to single‑protein detection, confirming that these principles extend to differential imaging modalities, which deliver the highest sensitivity. Our results address a critical milestone in the detection of weakly scattering NPs in complex matrices, with wide‑ranging applications in biotechnology, nanotechnology, chemical synthesis, and biosensing; ushering a new generation of microscopes that push both the sensitivity and throughput boundaries without requiring beam scanning.
Authors: Lukas Eriksson, Tim K. Esser, Marko Grabarics, Laurence T. Seeley, Simon B. Knoblauch, Jingjin Fan, Joseph Gault, Paul Fremdling, Thomas Reynolds, Justin L. P. Benesch, Carol V. Robinson, Jani R. Bolla, Lindsay Baker, Stephan Rauschenbach
Abstract: Electrospray ion beam deposition (ESIBD) is the intact, chemically selective deposition of molecular ions on surfaces in vacuum. Here, we present a general method and dedicated instrumentation for ESIBD‑based cryoEM sample preparation of soluble proteins. Precise control over deposition energy, sample environment, and reproducible growth of thin, homogeneous, vitreous ice films embedding the deposited proteins results in samples suitable for high‑resolution cryoEM structure determination. Applied to several protein complexes, β‑Galactosidase, GDH, RuBisCo, GroEL, the workflow yields near‑atomic resolution cryoEM maps (2.5‑4.8\,Å) from which atomic models are derived. Dehydration‑induced structural changes correlate with the magnitude of solvent exposure in the native structure: interior residues present high‑resolution density while surface‑exposed regions rearrange. Coherent rearrangements retain secondary and tertiary structure, incoherent changes degrade resolution. These results establish ESIBD+cryoEM as viable method for structure determination of chemically selected protein samples, directly linking native MS chemical information with near‑atomic structural resolution.
Authors: Sylwia Czach, Jakub Rydzewski, Wiesław Nowak
Abstract: Photoactive proteins absorb light and undergo structural changes that enable them to perform essential biological functions. These proteins are critical for understanding light‑induced biological processes, making them important in biophysics, biotechnology, and medicine. One effective approach to uncovering photoactive processes is through computational methods. These techniques provide atomic‑level insights into the structural, electronic, and dynamic changes that occur upon light absorption. By employing these methods, we can gain a better understanding of processes that are challenging to capture experimentally, such as chromophore isomerization and protein conformational changes. Here, we provide a brief overview of the different families of photoactive proteins and the computational methods used to study them, including bioinformatics, molecular dynamics, and enhanced sampling. Our review can serve as an introduction to computational methods for studying light‑activated molecular processes, specifically targeting researchers beginning their journey in this field.
Authors: Josef Cikhart, Aneta Leskourová, Michal H. Kolář
Abstract: Ribosomes are critical biomolecular nanomachines responsible for protein synthesis in all known organisms. The function and dynamics of ribosomes can be studied using molecular dynamics computer simulations. Although this task remains challenging at atomic level, several studies have reported all‑atom molecular dynamics simulations of the entire ribosome. However, for certain applications, atomistic simulations are impractical due to the limited simulation timescales achievable. In this study, we investigate the applicability of the coarse‑grained MARTINI model for simulations of the bacterial ribosome. After testing several simulation setups, we found that the structure of the ribosome and its components are generally well represented compared to the reference experimental structure. Compared with all‑atom simulations of the entire ribosome, coarse‑grained simulations result in a less flexible and smaller ribosome. We demonstrate how modifications of some parameters of the model can enhance the dynamics of the ribosome to better align with the atomistic model. Our work provides a detailed protocol for coarse‑grained simulations of the ribosome and highlights aspects of the model that need improvements.
Authors: Alexander J. Dear, Georg Meisl, Jing Hu, Tuomas P. J. Knowles, Sara Linse
Abstract: ``Seeding'' is the addition of preformed fibrils to a solution of monomeric protein to accelerate its aggregation into new fibrils. It is a versatile and widely‑used tool for scientists studying protein aggregation kinetics, as it enables the isolation and separate study of discrete reaction steps contributing to protein aggregation, specifically elongation and secondary nucleation. However, the seeding levels required to achieve dominating effects on each of these steps separately have been established largely by trial‑and‑error, due in part to the lack of availability of integrated rate laws valid for moderate to high seeding levels and generally applicable to all common underlying reaction mechanisms. Here, we improve on a recently developed mathematical method based on Lie symmetries for solving differential equations, and with it derive such an integrated rate law. We subsequently develop simple expressions for the amounts of seed required to isolate each step. We rationalize the empirical observation that fibril seeds must often be broken up into small pieces to successfully isolate elongation. We also derive expressions for average fibril lengths at different times in the aggregation reaction, and explore different methods to break up fibrils. This paper will provide an invaluable reference for future experimental and theoretical studies in which seeding techniques are employed, and should enable more sophisticated analyses than have been performed to date.
Authors: Xiuyuan Hu, Guoqing Liu, Can Chen, Yang Zhao, Hao Zhang, Xue Liu
Abstract: Structure‑based drug design (SBDD) is a critical task in drug discovery, requiring the generation of molecular information across two distinct modalities: discrete molecular graphs and continuous 3D coordinates. However, existing SBDD methods often overlook two key challenges: (1) the multi‑modal nature of this task and (2) the causal relationship between these modalities, limiting their plausibility and performance. To address both challenges, we propose TransDiffSBDD, an integrated framework combining autoregressive transformers and diffusion models for SBDD. Specifically, the autoregressive transformer models discrete molecular information, while the diffusion model samples continuous distributions, effectively resolving the first challenge. To address the second challenge, we design a hybrid‑modal sequence for protein‑ligand complexes that explicitly respects the causality between modalities. Experiments on the CrossDocked2020 benchmark demonstrate that TransDiffSBDD outperforms existing baselines.
Authors: Clara Fannjiang, Ji Won Park
Abstract: Algorithms for machine learning‑guided design, or design algorithms, use machine learning‑based predictions to propose novel objects with desired property values. Given a new design task ‑‑ for example, to design novel proteins with high binding affinity to a therapeutic target ‑‑ one must choose a design algorithm and specify any hyperparameters and predictive and/or generative models involved. How can these decisions be made such that the resulting designs are successful? This paper proposes a method for design algorithm selection, which aims to select design algorithms that will produce a distribution of design labels satisfying a user‑specified success criterion ‑‑ for example, that at least ten percent of designs' labels exceed a threshold. It does so by combining designs' predicted property values with held‑out labeled data to reliably forecast characteristics of the label distributions produced by different design algorithms, building upon techniques from prediction‑powered inference. The method is guaranteed with high probability to return design algorithms that yield successful label distributions (or the null set if none exist), if the density ratios between the design and labeled data distributions are known. We demonstrate the method's effectiveness in simulated protein and RNA design tasks, in settings with either known or estimated density ratios.
Authors: Ling-Nan Zou
Abstract: We describe Structured Random Binding (SRB), a minimal model of protein‑protein interactions rooted in the statistical physics of disordered systems. In this model, nonspecific binding is a generic consequence of the interaction between random proteins, exhibiting a phase transition from a high temperature state where nonspecific complexes are transient and lack well‑defined interaction interfaces, to a low temperature state where the complex structure is frozen and a definite interaction interface is present. Numerically, weakly‑bound nonspecific complexes can evolve into tightly‑bound, highly specific complexes, but only if the structural correlation length along the peptide backbone is short; moreover, evolved tightly‑bound homodimers favor the same interface structure that is predominant in real protein homodimers.
Authors: Yihan He, Ming-Chun Hong, Qiming Ding, Chih-Sheng Lin, Chih-Ming Lai, Chao Fang, Xiao Gong, Tuo-Hung Hou, Gengchiau Liang
Abstract: Molecular docking is a critical computational strategy in drug design and discovery, but the complex diversity of biomolecular structures and flexible binding conformations create an enormous search space that challenges conventional computing methods. Although quantum computing holds promise for these challenges, it remains constrained by scalability, hardware limitations, and precision issues. Here, we report a prototype of a probabilistic computer (p‑computer) that efficiently and accurately solves complex molecular docking for the first time, overcoming previously encountered challenges. At the core of the system is a p‑computing chip based upon our artificial tunable probabilistic bits (p‑bits), which are compatible with computing‑in‑memory schemes, based upon 180 nm CMOS technology and BEOL HfO2 RRAM. We successfully demonstrated the superior performance of the p‑computer in practical ligand‑protein docking scenarios. A 42‑node molecular docking problem of lipoprotein with LolA‑LolCDE complex‑a key point in developing antibiotics against Gram‑negative bacteria, was successfully solved. Our results align well with the Protein‑Ligand Interaction Profiler tool. This work marks the first application of p‑computing in molecular docking‑based computational biology, which has great potential to overcome the limitations in success rate and efficiency of current technologies in addressing complex bioinformatics problems.
Authors: Mick Gardner, Audrey Billhymer, Rebecca Kamerer, Joanna Schmit, Trevor Park, Julie Nguyen-Edquilang, Rita Miller, Kim A Selting, Michael Oelze
Abstract: Quantitative ultrasound (QUS) characterizes the composition of cells to distinguish diseased from healthy tissue. QUS can reflect the complexity of the tumor and detect early lymph node (LN) metastasis ex vivo. The objective in this study was to gather preliminary QUS and cytokine data from dogs undergoing radiation therapy and correlate QUS data with both LN metastasis and tumor response. Spontaneous solid tumors were evaluated with QUS before and up to one year after receiving RT. Additionally, regional LNs were evaluated with QUS in vivo, then excised and examined with histopathology to detect metastasis. Paired t‑tests were used to compare QUS data of metastatic and non‑metastatic LNs within patients. Furthermore, paired t‑tests compared pre‑ versus post‑RT QUS data. Serum was collected at each time point for cytokine profiles. Most statistical tests were underpowered to produce significant p values, but interesting trends were observed. The lowest p values for LN tests were found with the envelope statistics K (p = 0.142) and μ (p = 0.181), which correspond to cell structure and number of scatterers. For tumor response, the lowest p values were found with K (p = 0.115) and μ (p = 0.127) when comparing baseline QUS data with QUS data 1 week after RT. Monocyte chemoattractant protein 1 (MCP‑1) was significantly higher in dogs with cancer when compared to healthy controls (p = 1.12e‑4). A weak correlation was found between effective scatterer diameter (ESD) and Transforming growth factor beta 1 (TGFβ‑1). While statistical tests on the preliminary QUS data alone were underpowered to detect significant differences among groups, our methods create a basis for future studies.
Authors: Parisa Mollaei, Amir Barati Farimani
Abstract: In this study, we propose a Kernel‑PCA model designed to capture structure‑function relationships in a protein. This model also enables ranking of reaction coordinates according to their impact on protein properties. By leveraging machine learning techniques, including Kernel and principal component analysis (PCA), our model uncovers meaningful patterns in high‑dimensional protein data obtained from molecular dynamics (MD) simulations. The effectiveness of our model in accurately identifying reaction coordinates has been demonstrated through its application to a G protein‑coupled receptor. Furthermore, this model utilizes a network‑based approach to uncover correlations in the dynamic behavior of residues associated with a specific protein property. These findings underscore the potential of our model as a powerful tool for protein structure‑function analysis and visualization.
Authors: Eoin Quinn, Ghassene Jebali, Maxime Seince, Oliver Bent
Abstract: We explore a framework for protein sequence representation learning that decomposes the task between manifold learning and distributional modelling. Specifically we present a Latent Space Diffusion architecture which combines a protein sequence autoencoder with a denoising diffusion model operating on its latent space. We obtain a one‑parameter family of learned representations from the diffusion model, along with the autoencoder's latent representation. We propose and evaluate two autoencoder architectures: a homogeneous model forcing amino acids of the same type to be identically distributed in the latent space, and an inhomogeneous model employing a noise‑based variant of masking. As a baseline we take a latent space learned by masked language modelling, and evaluate discriminative capability on a range of protein property prediction tasks. Our finding is twofold: the diffusion models trained on both our proposed variants display higher discriminative power than the one trained on the masked language model baseline, none of the diffusion representations achieve the performance of the masked language model embeddings themselves.
Authors: Delower Hossain, Jake Y Chen
Abstract: Over the last few decades, Artificial Intelligence (AI) scientists have been conducting investigations to attain human‑level performance by a machine in accomplishing a cognitive task. Within machine learning, the ultimate aspiration is to attain Artificial General Intelligence (AGI) through a machine. This pursuit has led to the exploration of two distinct AI paradigms. Symbolic AI, also known as classical or GOFAI (Good Old‑Fashioned AI) and Connectionist (Sub‑symbolic) AI, represented by Neural Systems, are two mutually exclusive paradigms. Symbolic AI excels in reasoning, explainability, and knowledge representation but faces challenges in processing complex real‑world data with noise. Conversely, deep learning (Black‑Box systems) research breakthroughs in neural networks are notable, yet they lack reasoning and interpretability. Neuro‑symbolic AI (NeSy), an emerging area of AI research, attempts to bridge this gap by integrating logical reasoning into neural networks, enabling them to learn and reason with symbolic representations. While a long path, this strategy has made significant progress towards achieving common sense reasoning by systems. This article conducts an extensive review of over 977 studies from prominent scientific databases (DBLP, ACL, IEEExplore, Scopus, PubMed, ICML, ICLR), thoroughly examining the multifaceted capabilities of Neuro‑Symbolic AI, with a particular focus on its healthcare applications, particularly in drug discovery, and Protein engineering research. The survey addresses vital themes, including reasoning, explainability, integration strategies, 41 healthcare‑related use cases, benchmarking, datasets, current approach limitations from both healthcare and broader perspectives, and proposed novel approaches for future experiments.
Authors: Shun-Cai Zhao, Yi-Meng Huang, Yi-Fan Yang, Zi-Ran Zhao
Abstract: Machine learning simulations of open quantum dynamics often rely on recursive predictors that accumulate error. We develop a non‑recursive convolutional neural networks (CNNs) that maps system parameters and a redundant time encoding directly to excitation‑energy‑transfer populations in the Fenna‑Matthews‑Olson complex. The encoding‑modified logistic plus \tanh functions‑normalizes time and resolves fast, transitional, and quasi‑steady regimes, while physics‑informed labels enforce population conservation and inter‑site consistency. Trained only on 0~ 7 ps reference trajectories generated with a Lindblad model in QuTiP, the network accurately predicts 0~100 ps dynamics across a range of reorganization energies, bath rates, and temperatures. Beyond 20 ps, the absolute relative error remains below 0.05, demonstrating stable long‑time extrapolation. By avoiding step‑by‑step recursion, the method suppresses error accumulation and generalizes across timescales. These results show that redundant time encoding enables data‑efficient inference of long‑time quantum dissipative dynamics in realistic pigment‑protein complexes, and may aid the data‑driven design of light‑harvesting materials.
Authors: Romain Lacombe
Abstract: Evolution‑based protein structure prediction models have achieved breakthrough success in recent years. However, they struggle to generalize beyond evolutionary priors and on sequences lacking rich homologous data. Here we present a novel, out‑of‑domain benchmark based on sactipeptides, a rare class of ribosomally synthesized and post‑translationally modified peptides (RiPPs) characterized by sulfur‑to‑α‑carbon thioether bridges creating cross‑links between cysteine residues and backbone. We evaluate recent models on predicting conformations compatible with these cross‑links bridges for the 10 known sactipeptides with elucidated post‑translational modifications. Crucially, the structures of 5 of them have not yet been experimentally resolved. This makes the task a challenging problem for evolution‑based models, which we find exhibit limited performance (0.0% to 19.2% GDT‑TS on sulfur‑to‑α‑carbon distance). Our results point at the need for physics‑informed models to sustain progress in biomolecular structure prediction.
Authors: Sophia Tang, Yinuo Zhang, Alexander Tong, Pranam Chatterjee
Abstract: Flow matching in the continuous simplex has emerged as a promising strategy for DNA sequence design, but struggles to scale to higher simplex dimensions required for peptide and protein generation. We introduce Gumbel‑Softmax Flow and Score Matching, a generative framework on the simplex based on a novel Gumbel‑Softmax interpolant with a time‑dependent temperature. Using this interpolant, we introduce Gumbel‑Softmax Flow Matching by deriving a parameterized velocity field that transports from smooth categorical distributions to distributions concentrated at a single vertex of the simplex. We alternatively present Gumbel‑Softmax Score Matching which learns to regress the gradient of the probability density. Our framework enables high‑quality, diverse generation and scales efficiently to higher‑dimensional simplices. To enable training‑free guidance, we propose Straight‑Through Guided Flows (STGFlow), a classifier‑based guidance method that leverages straight‑through estimators to steer the unconditional velocity field toward optimal vertices of the simplex. STGFlow enables efficient inference‑time guidance using classifiers pre‑trained on clean sequences, and can be used with any discrete flow method. Together, these components form a robust framework for controllable de novo sequence generation. We demonstrate state‑of‑the‑art performance in conditional DNA promoter design, sequence‑only protein generation, and target‑binding peptide design for rare disease treatment.
Authors: Minsu Kim, Jiayao Gu, Ye Yuan, Taeyoung Yun, Zixuan Liu, Yoshua Bengio, Can Chen
Abstract: Offline optimization is a fundamental challenge in science and engineering, where the goal is to optimize black‑box functions using only offline datasets. This setting is particularly relevant when querying the objective function is prohibitively expensive or infeasible, with applications spanning protein engineering, material discovery, neural architecture search, and beyond. The main difficulty lies in accurately estimating the objective landscape beyond the available data, where extrapolations are fraught with significant epistemic uncertainty. This uncertainty can lead to objective hacking(reward hacking), exploiting model inaccuracies in unseen regions, or other spurious optimizations that yield misleadingly high performance estimates outside the training distribution. Recent advances in model‑based optimization(MBO) have harnessed the generalization capabilities of deep neural networks to develop offline‑specific surrogate and generative models. Trained with carefully designed strategies, these models are more robust against out‑of‑distribution issues, facilitating the discovery of improved designs. Despite its growing impact in accelerating scientific discovery, the field lacks a comprehensive review. To bridge this gap, we present the first thorough review of offline MBO. We begin by formalizing the problem for both single‑objective and multi‑objective settings and by reviewing recent benchmarks and evaluation metrics. We then categorize existing approaches into two key areas: surrogate modeling, which emphasizes accurate function approximation in out‑of‑distribution regions, and generative modeling, which explores high‑dimensional design spaces to identify high‑performing designs. Finally, we examine the key challenges and propose promising directions for advancement in this rapidly evolving field including safe control of superintelligent systems.
Authors: Viet Thanh Duy Nguyen, Truong-Son Hy
Abstract: Proteins are complex biomolecules that play a central role in various biological processes, making them critical targets for breakthroughs in molecular biology, medical research, and drug discovery. Deciphering their intricate, hierarchical structures, and diverse functions is essential for advancing our understanding of life at the molecular level. Protein Representation Learning (PRL) has emerged as a transformative approach, enabling the extraction of meaningful computational representations from protein data to address these challenges. In this paper, we provide a comprehensive review of PRL research, categorizing methodologies into five key areas: feature‑based, sequence‑based, structure‑based, multimodal, and complex‑based approaches. To support researchers in this rapidly evolving field, we introduce widely used databases for protein sequences, structures, and functions, which serve as essential resources for model development and evaluation. We also explore the diverse applications of these approaches in multiple domains, demonstrating their broad impact. Finally, we discuss pressing technical challenges and outline future directions to advance PRL, offering insights to inspire continued innovation in this foundational field.
Authors: Aahan Singh, Engin Tekin, Maryam Nadeem, Nancy A. ElNaker, Mohammad Amaan Sayeed, Natalia Vassilieva, Boulbaba Ben Amor
Abstract: Revolutionizing drug discovery demands more than just understanding molecular interactions ‑ it requires generative models that can design novel ligands tailored to specific biological targets. While chemical Language Models (cLMs) have made strides in learning molecular properties, most fail to incorporate target‑specific insights, restricting their ability to drive de‑novo ligand generation. Chem42, a cutting‑edge family of generative chemical Language Models, is designed to bridge this gap. By integrating atomic‑level interactions with multimodal inputs from Prot42, a complementary protein Language Model, Chem42 achieves a sophisticated cross‑modal representation of molecular structures, interactions, and binding patterns. This innovative framework enables the creation of structurally valid, synthetically accessible ligands with enhanced target specificity. Evaluations across diverse protein targets confirm that Chem42 surpasses existing approaches in chemical validity, target‑aware design, and predicted binding affinity. By reducing the search space of viable drug candidates, Chem42 could accelerate the drug discovery pipeline, offering a powerful generative AI tool for precision medicine. Our Chem42 models set a new benchmark in molecule property prediction, conditional molecule generation, and target‑aware ligand design. The models are publicly available at huggingface.co/inceptionai.
Authors: Krithik Ramesh, Sameed M. Siddiqui, Albert Gu, Michael D. Mitzenmacher, Pardis C. Sabeti
Abstract: Deep learning architectures such as convolutional neural networks and Transformers have revolutionized biological sequence modeling, with recent advances driven by scaling up foundation and task‑specific models. The computational resources and large datasets required, however, limit their applicability in biological contexts. We introduce Lyra, a subquadratic architecture for sequence modeling, grounded in the biological framework of epistasis for understanding sequence‑to‑function relationships. Mathematically, we demonstrate that state space models efficiently capture global epistatic interactions and combine them with projected gated convolutions for modeling local relationships. We demonstrate that Lyra is performant across over 100 wide‑ranging biological tasks, achieving state‑of‑the‑art (SOTA) performance in many key areas, including protein fitness landscape prediction, biophysical property prediction (e.g. disordered protein region functions) peptide engineering applications (e.g. antibody binding, cell‑penetrating peptide prediction), RNA structure analysis, RNA function prediction, and CRISPR guide design. It achieves this with orders‑of‑magnitude improvements in inference speed and reduction in parameters (up to 120,000‑fold in our tests) compared to recent biology foundation models. Using Lyra, we were able to train and run every task in this study on two or fewer GPUs in under two hours, democratizing access to biological sequence modeling at SOTA performance, with potential applications to many fields.
Authors: Shuqi Lu, Haowei Lin, Lin Yao, Zhifeng Gao, Xiaohong Ji, Yitao Liang, Weinan E, Linfeng Zhang, Guolin Ke
Abstract: 3D structure modeling is essential across scales, enabling applications from fluid simulation and 3D reconstruction to protein folding and molecular docking. Yet, despite shared 3D spatial patterns, current approaches remain fragmented, with models narrowly specialized for specific domains and unable to generalize across tasks or scales. We propose Uni‑3DAR, a unified autoregressive framework for cross‑scale 3D generation and understanding. At its core is a coarse‑to‑fine tokenizer based on octree data structures, which compresses diverse 3D structures into compact 1D token sequences. We further propose a two‑level subtree compression strategy, which reduces the octree token sequence by up to 8x. To address the challenge of dynamically varying token positions introduced by compression, we introduce a masked next‑token prediction strategy that ensures accurate positional modeling, significantly boosting model performance. Extensive experiments across multiple 3D generation and understanding tasks, including small molecules, proteins, polymers, crystals, and macroscopic 3D objects, validate its effectiveness and versatility. Notably, Uni‑3DAR surpasses previous state‑of‑the‑art diffusion models by a substantial margin, achieving up to 256% relative improvement while delivering inference speeds up to 21.8x faster.
Authors: Ashkan Dehghan, Paweł Prałat, François Théberge
Abstract: Many real‑world and artificial systems and processes can be represented as graphs. Some examples of such systems include social networks, financial transactions, supply chains, and molecular structures. In many of these cases, one needs to consider a collection of graphs, rather than a single network. This could be a collection of distinct but related graphs, such as different protein structures or graphs resulting from dynamic processes on the same network. Examples of the latter include the evolution of social networks, community‑induced graphs, or ego‑nets around various nodes. A significant challenge commonly encountered is the absence of ground‑truth labels for graphs or nodes, necessitating the use of unsupervised techniques to analyze such systems. Moreover, even when ground‑truth labels are available, many existing graph machine learning methods depend on complex deep learning models, complicating model explainability and interpretability. To address some of these challenges, we have introduced NEExT (Network Embedding Exploration Tool) for embedding collections of graphs via user‑defined node features. The advantages of the framework are twofold: (i) the ability to easily define your own interpretable node‑based features in view of the task at hand, and (ii) fast embedding of graphs provided by the Vectorizers library. In this paper, we demonstrate the usefulness of NEExT on collections of synthetic and real‑world graphs. For supervised tasks, we demonstrate that performance in graph classification tasks could be achieved similarly to other state‑of‑the‑art techniques while maintaining model interpretability. Furthermore, our framework can also be used to generate high‑quality embeddings in an unsupervised way, where target variables are not available.
Authors: Katarzyna Walczewska-Szewc, Jakub Rydzewski
Abstract: Neurodegenerative diseases, such as Alzheimer's and Parkinson's, pose a growing global health burden. Prolyl oligopeptidase (PREP) has emerged as a potential therapeutic target in these diseases. Recent studies have shown that direct interaction between PREP and pathological proteins, such as α‑synuclein and Tau, influences protein aggregation and neuronal function. While most known PREP inhibitors primarily target its enzymatic functions, a new class of ligands, known as HUPs, specifically modulates protein‑protein interactions (PPIs), which are crucial in neurodegenerative diseases. These structurally distinct ligands exhibit diverse binding behaviors, highlighting the importance of understanding their binding pathways. In this study, we analyzed the binding pathways and stability of diverse ligands using molecular dynamics simulations and enhanced sampling techniques. Traditional inhibitors, such as KYP‑2047, target the active site between the catalytic domains of PREP and the β‑propeller domain, while HUP ligands bind to alternative regions, such as the hinge site, potentially disrupting non‑enzymatic PPIs. We demonstrated that structural variations among ligands lead to distinct binding and unbinding pathways. Free‑energy profiles from umbrella sampling revealed key kinetic bottlenecks and differences in pathways. For example, HUP‑55 exhibits pathway hopping, characterized by diffuse exploration of binding regions before selecting an exit, while KYP‑2047 prefers the central tunnel of the β‑propeller domain even under perturbations. These results suggest that the dynamic interaction between ligands and PREP plays a critical role in their mechanism. The ability of HUPs to interact with multiple binding sites and adapt to PREP's conformational changes may be essential for their PPI‑targeting effects.
Authors: Suemin Lee, Ruiyu Wang, Lukas Herron, Pratyush Tiwary
Abstract: Predicting and characterizing phase transitions is crucial for understanding generic physical phenomena such as crystallization, protein folding and others. However, directly observing phase transitions is not always easy, and often one has limited observations far from the phase boundary and measured under some specific thermodynamic conditions. In this study, we propose a statistical physics and Generative AI driven framework that can take such limited information to generate samples of different phases under arbitrary thermodynamic conditions, which we name Exponentially Tilted Thermodynamic Maps (expTM). The central idea is to map collected data into a tractable simple prior expressed as an exponentially tilted Gaussian. We demonstrate how the variance and mean of the prior can be correlated with pairs of thermodynamic control variables, including temperature, pressure, and chemical potential. This gives us the ability to generate thermodynamically correct samples under any values of the control variables. To demonstrate the practical applicability of this approach, we use expTM to sample the lattice gas models with the Grand Canonical ensemble, capturing phase transitions under varying chemical potentials and temperatures. We further demonstrate how expTM can model the isothermal‑isobaric ensemble, with which we predict different phases of CO2 under varying pressure conditions. Both examples are trained on very limited data far from the phase boundary. These results establish expTM as a robust tool for understanding phase transitions across diverse thermodynamic conditions requiring only a small number of observations.
Authors: Jian Jiang, Long Chen, Lu ke, Bozheng Dou, Yueying Zhu, Yazhou Shi, Huahai Qiu, Bengong Zhang, Tianshou Zhou, Guo-Wei Wei
Abstract: Chaos is omnipresent in nature, and its understanding provides enormous social and economic benefits. However, the unpredictability of chaotic systems is a textbook concept due to their sensitivity to initial conditions, aperiodic behavior, fractal dimensions, nonlinearity, and strange attractors. In this work, we introduce, for the first time, chaotic learning, a novel multiscale topological paradigm that enables accurate predictions from chaotic systems. We show that seemingly random and unpredictable chaotic dynamics counterintuitively offer unprecedented quantitative predictions. Specifically, we devise multiscale topological Laplacians to embed real‑world data into a family of interactive chaotic dynamical systems, modulate their dynamical behaviors, and enable the accurate prediction of the input data. As a proof of concept, we consider 28 datasets from four categories of realistic problems: 10 brain waves, four benchmark protein datasets, 13 single‑cell RNA sequencing datasets, and an image dataset, as well as two distinct chaotic dynamical systems, namely the Lorenz and Rossler attractors. We demonstrate chaotic learning predictions of the physical properties from chaos. Our new chaotic learning paradigm profoundly changes the textbook perception of chaos and bridges topology, chaos, and learning for the first time.
Authors: Samya Sen, Changxin Dong, Carolyn K. Jons, Wencke Reineking, Alakesh Alakesh, Noah Eckman, Ye Eun Song, Alexander N. Prossnitz, Eric A. Appel
Abstract: Hydrogels are crosslinked polymer networks with high water content, widely employed in biomedical applications such as drug delivery, tissue engineering, and regenerative medicine. Injectable, depot‑forming hydrogels enable sustained release of therapeutic agents by modulating macromolecular diffusion through dynamic polymer networks. However, achieving reliable control over release kinetics remains a challenge, as the injection process induces shear‑mediated disruption of transient crosslinks, leading to an initial burst release that can cause local toxicity and compromise therapeutic efficacy. Here, we present a hydrogel formulation strategy designed to restore network structure post‑injection through rapid reformation of dynamic crosslinks, enabling time‑dependent regulation of diffusion properties. By tuning viscoelastic parameters, including stress relaxation time and network recovery rate, we reduced the extent of burst release without compromising sustained delivery. Using model protein cargo, we demonstrate in both in~vitro and in~vivo settings that hydrogels with faster crosslink reformation kinetics exhibit significantly lower early‑phase release while maintaining long‑term delivery comparable to unmodified formulations. These results establish a mechanistic framework for decoupling short‑ and long‑term release behavior, offering a broadly applicable strategy for precise drug delivery in soft tissue environments.
Authors: Pawel Rubach
Abstract: This paper presents the Kafka Slurm Agent (KSA), an open source (Apache 2.0 license) distributed computing and stream processing engine designed to help researchers distribute Python‑based computational tasks across multiple Slurm‑managed HPC clusters and workstations. Written entirely in Python, this extensible framework utilizes an Apache Kafka broker for asynchronous communication between its components. It is intended for non‑expert users and does not require administrative privileges or additional libraries to run on Slurm. The framework's development was driven by the introduction of the AlphaFold protein structure prediction model, specifically, it was first created to facilitate the detection of knots in protein chains within structures predicted by AlphaFold. KSA has since been applied to several structural bioinformatics research projects, among others, leading to the discovery of new knotted proteins with previously unknown knot types. These knotted structures are now part of the AlphaKnot 2.0 web server and database, where KSA is applied to manage the knot detection process for user‑uploaded structures.
Authors: Taslim Murad, Sarwan Ali, Murray Patterson
Abstract: The analysis of sequences (e.g., protein, DNA, and SMILES string) is essential for disease diagnosis, biomaterial engineering, genetic engineering, and drug discovery domains. Conventional analytical methods focus on transforming sequences into numerical representations for applying machine learning/deep learning‑based sequence characterization. However, their efficacy is constrained by the intrinsic nature of deep learning (DL) models, which tend to exhibit suboptimal performance when applied to tabular data. An alternative group of methodologies endeavors to convert biological sequences into image forms by applying the concept of Chaos Game Representation (CGR). However, a noteworthy drawback of these methods lies in their tendency to map individual elements of the sequence onto a relatively small subset of designated pixels within the generated image. The resulting sparse image representation may not adequately encapsulate the comprehensive sequence information, potentially resulting in suboptimal predictions. In this study, we introduce a novel approach to transform sequences into images using the Bézier curve concept for element mapping. Mapping the elements onto a curve enhances the sequence information representation in the respective images, hence yielding better DL‑based classification performance. We employed different sequence datasets to validate our system by using different classification tasks, and the results illustrate that our Bézier curve method is able to achieve good performance for all the tasks.
Authors: Yichao Zhang, Ningyuan Deng, Xinyuan Song, Ziqian Bi, Tianyang Wang, Zheyu Yao, Keyu Chen, Ming Li, Qian Niu, Junyu Liu, Benji Peng, Sen Zhang, Ming Liu, Li Zhang, Xuanhe Pan, Jinlang Wang, Pohsun Feng, Yizhu Wen, Lawrence KQ Yan, Hongming Tseng, Yan Zhong, Yunze Wang, Ziyuan Qin, Bowen Jing, Junjie Yang, Jun Zhou, Chia Xin Liang, Junhao Song
Abstract: After AlphaFold won the Nobel Prize, protein prediction with deep learning once again became a hot topic. We comprehensively explore advanced deep learning methods applied to protein structure prediction and design. It begins by examining recent innovations in prediction architectures, with detailed discussions on improvements such as diffusion based frameworks and novel pairwise attention modules. The text analyses key components including structure generation, evaluation metrics, multiple sequence alignment processing, and network architecture, thereby illustrating the current state of the art in computational protein modelling. Subsequent chapters focus on practical applications, presenting case studies that range from individual protein predictions to complex biomolecular interactions. Strategies for enhancing prediction accuracy and integrating deep learning techniques with experimental validation are thoroughly explored. The later sections review the industry landscape of protein design, highlighting the transformative role of artificial intelligence in biotechnology and discussing emerging market trends and future challenges. Supplementary appendices provide essential resources such as databases and open source tools, making this volume a valuable reference for researchers and students.
Authors: Ewan R. S. Wallace, Nathan C. Frey, Joshua A. Rackers
Abstract: Ligand strain energy, the energy difference between the bound and unbound conformations of a ligand, is an important component of structure‑based small molecule drug design. A large majority of observed ligands in protein‑small molecule co‑crystal structures bind in low‑strain conformations, making strain energy a useful filter for structure‑based drug design. In this work we present a tool for calculating ligand strain with a high accuracy. StrainRelief uses a MACE Neural Network Potential (NNP), trained on a large database of Density Functional Theory (DFT) calculations to estimate ligand strain of neutral molecules with quantum accuracy. We show that this tool estimates strain energy differences relative to DFT to within 1.4 kcal/mol, more accurately than alternative NNPs. These results highlight the utility of NNPs in drug discovery, and provide a useful tool for drug discovery teams.
Authors: Beatriz Costa-Gomes, Joel Greer, Nikolai Juraschko, James Parkhurst, Jola Mirecka, Marjan Famili, Camila Rangel-Smith, Oliver Strickson, Alan Lowe, Mark Basham, Tom Burnley
Abstract: Ease of access to data, tools and models expedites scientific research. In structural biology there are now numerous open repositories of experimental and simulated datasets. Being able to easily access and utilise these is crucial for allowing researchers to make optimal use of their research effort. The tools presented here are useful for collating existing public cryoEM datasets and/or creating new synthetic cryoEM datasets to aid the development of novel data processing and interpretation algorithms. In recent years, structural biology has seen the development of a multitude of machine‑learning based algorithms for aiding numerous steps in the processing and reconstruction of experimental datasets and the use of these approaches has become widespread. Developing such techniques in structural biology requires access to large datasets which can be cumbersome to curate and unwieldy to make use of. In this paper we present a suite of Python software packages which we collectively refer to as PERC (profet, EMPIARreader and CAKED). These are designed to reduce the burden which data curation places upon structural biology research. The protein structure fetcher (profet) package allows users to conveniently download and cleave sequences or structures from the Protein Data Bank or Alphafold databases. EMPIARreader allows lazy loading of Electron Microscopy Public Image Archive datasets in a machine‑learning compatible structure. The Class Aggregator for Key Electron‑microscopy Data (CAKED) package is designed to seamlessly facilitate the training of machine learning models on electron microscopy data, including electron‑cryo‑microscopy‑specific data augmentation and labelling. These packages may be utilised independently or as building blocks in workflows. All are available in open source repositories and designed to be easily extensible to facilitate more advanced workflows if required.
Authors: Renaud Baillou, Marta Pedrosa Garcia-Moreno, Quentin Guigue, Solene Meinier, Thierry Darnige, Gaspard Junot, Fernando Peruani, Eric Clément
Abstract: Navigation of microorganisms is controlled by internal processes ultimately sensitive to mechanical or chemical signaling encountered along the path. In many natural environments, such as porous soils or physiological ducts, motile species alternate between bulk and surface motion displaying in each case, distinct kinematics. This inherent complexity is key to many practical biological and ecological issues involving spreading and contamination, essential for understanding the spatiotemporal structuring of populations in their environment. However grasping the interplay between geometrical confinement and kinematics driven by internal biological responses remains poorly understood from a physical and biological standpoint. Here, we address this question through experimental and theoretical analysis in the heuristic situation of two parallel confining surfaces. We track wild‑type E. coli ‑ a model peritrichous flagellated bacterium ‑ in 3D over extended periods of time. We obtain the first experimental measurements of the emerging diffusivity and bulk/surface residence times as a function of confinement height and the specific chiral kinematics at surfaces. All experimental results are quantitatively reproduced, without parametric adjustment, by a non‑Markovian stochastic (BV) model that incorporates the internal biochemical memory carried by a phosphorylated protein switching the motor rotation. By matching the results with a Markovian (memoryless) companion model, we derive an analytical expression for the diffusivity and demonstrate how confining walls influence microbial long‑range dispersion. This approach also provides a general conceptual basis for understanding how microorganisms navigate complex environments, in which their movement alternates between bulk and surfaces.
Authors: Liming Wu, Wenbing Huang, Rui Jiao, Jianxing Huang, Liwei Liu, Yipeng Zhou, Hao Sun, Yang Liu, Fuchun Sun, Yuxiang Ren, Jirong Wen
Abstract: Predicting crystal structures from chemical compositions is a fundamental challenge in materials discovery, complicated by complex 3D geometries that distinguish it from fields like protein folding. Here, we present Diffusion‑based Crystal Omni (DAO), a pretrain‑finetune framework for crystal structure prediction integrating two Siamese foundation models: a structure generator and an energy predictor. The generator is pretrained via a two‑stage pipeline on a vast dataset of stable and unstable structures, leveraging the predictor to relax unstable configurations and guide the generative sampling. Across two well‑known benchmarks, pretraining significantly enhances performance across multiple backbone architectures. Ablation studies confirm that the synergy between the generator and predictor mutually benefits both components. We further validate DAO on three real‑world superconductors (\textCr_6\textOs_2, \textZr_16\textRh_8\textO_4, and \textZr_16\textPd_8\textO_4) typically inaccessible to conventional computation. For \textCr_6\textOs_2, DAO achieves a 100% match rate with experimental references and an atomic‑position error of 0.0012 under 20‑shot generation, performing over 2000× faster per iteration than DFT‑based structure predictors. These compelling results collectively highlight the potential of our approach for advancing materials science research.
Authors: Yigang Chen, Xiang Ji, Ziyue Zhang, Yuming Zhou, Yang-Chi-Dung Lin, Hsi-Yuan Huang, Tao Zhang, Yi Lai, Ke Chen, Chang Su, Xingqiao Lin, Zihao Zhu, Yanggyi Zhang, Kangping Wei, Jiehui Fu, Yixian Huang, Shidong Cui, Shih-Chung Yen, Ariel Warshel, Hsien-Da Huang
Abstract: Deep learning‑based drug‑target interaction (DTI) prediction methods have demonstrated strong performance; however, real‑world applicability remains constrained by limited data diversity and modeling complexity. To address these challenges, we propose SCOPE‑DTI, a unified framework combining a large‑scale, balanced semi‑inductive human DTI dataset with advanced deep learning modeling. Constructed from 13 public repositories, the SCOPE dataset expands data volume by up to 100‑fold compared to common benchmarks such as the Human dataset. The SCOPE model integrates three‑dimensional protein and compound representations, graph neural networks, and bilinear attention mechanisms to effectively capture cross domain interaction patterns, significantly outperforming state‑of‑the‑art methods across various DTI prediction tasks. Additionally, SCOPE‑DTI provides a user‑friendly interface and database. We further validate its effectiveness by experimentally identifying anticancer targets of Ginsenoside Rh1. By offering comprehensive data, advanced modeling, and accessible tools, SCOPE‑DTI accelerates drug discovery research.
Authors: Burak Suyunu, Özdeniz Dolu, Ibukunoluwa Abigail Olaosebikan, Hacer Karatas Bristow, Arzucan Özgür
Abstract: Proteins are the essential drivers of biological processes. At the molecular level, they are chains of amino acids that can be viewed through a linguistic lens where the twenty standard residues serve as an alphabet combining to form a complex language, referred to as the language of life. To understand this language, we must first identify its fundamental units. Analogous to words, these units are hypothesized to represent an intermediate layer between single residues and larger domains. Crucially, just as protein diversity arises from evolution, these units should inherently reflect evolutionary relationships. We introduce PUMA (Protein Units via Mutation‑Aware Merging) to discover these evolutionarily meaningful units. PUMA employs an iterative merging algorithm guided by substitution matrices to identify protein units and organize them into families linked by plausible mutations. This process creates a hierarchical genealogy where parent units and their mutational variants coexist, simultaneously producing a unit vocabulary and the genealogical structure connecting them. We validate that PUMA families are biologically meaningful; mutations within a PUMA family correlate with clinically benign variants and with high‑scoring mutations in high‑throughput assays. Furthermore, these units align with the contextual preferences of protein language models and map to known functional annotations. PUMA's genealogical framework provides evolutionarily grounded units, offering a structured approach for understanding the language of life.
Authors: Adrián Nadal-Rosa, Gonzalo Manzano
Abstract: Molecular motors are in charge of almost every process in the life cycle of cells, such as protein synthesis, DNA replication, and cell locomotion, hence being of crucial importance for understanding the cellular dynamics. However, given their size scales on the order of nanometers, direct measurements are rather challenging, and the information that can be extracted from them is limited. In this work, we propose strategies based on martingale theory in stochastic thermodynamics to infer thermodynamic properties of molecular motors using a limited amount of available information. In particular, we use two recent theoretical results valid for systems arbitrary far of equilibrium: the integral fluctuation theorem (IFT) at stopping times, and a family of bounds to the maximal excursions of entropy production. The potential of these strategies is illustrated with a simple model for the F1‑ATPase rotary molecular motor, where our approach is able to estimate several quantities determining the thermodynamics of the motor, such as the rotational work of the motor performed against an externally applied force, or the effective environmental temperature.
Authors: Umberto Borso, Davide Paglieri, Jude Wells, Tim Rocktäschel
Abstract: Diffusion models have achieved state‑of‑the‑art performance across multiple domains, with recent advancements extending their applicability to discrete data. However, aligning discrete diffusion models with task‑specific preferences remains challenging, particularly in scenarios where explicit reward functions are unavailable. In this work, we introduce Discrete Diffusion DPO (D2‑DPO), the first adaptation of Direct Preference Optimization (DPO) to discrete diffusion models formulated as continuous‑time Markov chains. Our approach derives a novel loss function that directly fine‑tunes the generative process using preference data while preserving fidelity to a reference distribution. We validate D2‑DPO on a structured binary sequence generation task, demonstrating that the method effectively aligns model outputs with preferences while maintaining structural validity. Our results highlight that D2‑DPO enables controlled fine‑tuning without requiring explicit reward models, making it a practical alternative to reinforcement learning‑based approaches. Future research will explore extending D2‑DPO to more complex generative tasks, including language modeling and protein sequence generation, as well as investigating alternative noise schedules, such as uniform noising, to enhance flexibility across different applications.
Authors: Zicheng Ma, Chuanliu Fan, Zhicong Wang, Zhenyu Chen, Xiaohan Lin, Yanheng Li, Shihao Feng, Jun Zhang, Ziqiang Cao, Yi Qin Gao
Abstract: Large language models have made remarkable progress in the field of molecular science, particularly in understanding and generating functional small molecules. This success is largely attributed to the effectiveness of molecular tokenization strategies. In protein science, the amino acid sequence serves as the sole tokenizer for LLMs. However, many fundamental challenges in protein science are inherently structure‑dependent. The absence of structure‑aware tokens significantly limits the capabilities of LLMs for comprehensive biomolecular comprehension and multimodal generation. To address these challenges, we introduce a novel framework, ProtTeX, which tokenizes the protein sequences, structures, and textual information into a unified discrete space. This innovative approach enables joint training of the LLM exclusively through the Next‑Token Prediction paradigm, facilitating multimodal protein reasoning and generation. ProtTeX enables general LLMs to perceive and process protein structures through sequential text input, leverage structural information as intermediate reasoning components, and generate or manipulate structures via sequential text output. Experiments demonstrate that our model achieves significant improvements in protein function prediction, outperforming the state‑of‑the‑art domain expert model with a twofold increase in accuracy. Our framework enables high‑quality conformational generation and customizable protein design. For the first time, we demonstrate that by adopting the standard training and inference pipelines from the LLM domain, ProtTeX empowers decoder‑only LLMs to effectively address diverse spectrum of protein‑related tasks.
Authors: Taojie Kuang, Qianli Ma, Athanasios V. Vasilakos, Yu Wang, Qiang, Cheng, Zhixiang Ren
Abstract: In recent years, deep learning techniques have made significant strides in molecular generation for specific targets, driving advancements in drug discovery. However, existing molecular generation methods present significant limitations: those operating at the atomic level often lack synthetic feasibility, drug‑likeness, and interpretability, while fragment‑based approaches frequently overlook comprehensive factors that influence protein‑molecule interactions. To address these challenges, we propose a novel fragment‑based molecular generation framework tailored for specific proteins. Our method begins by constructing a protein subpocket and molecular arm concept‑based neural network, which systematically integrates interaction force information and geometric complementarity to sample molecular arms for specific protein subpockets. Subsequently, we introduce a diffusion model to generate molecular backbones that connect these arms, ensuring structural integrity and chemical diversity. Our approach significantly improves synthetic feasibility and binding affinity, with a 4% increase in drug‑likeness and a 6% improvement in synthetic feasibility. Furthermore, by integrating explicit interaction data through a concept‑based model, our framework enhances interpretability, offering valuable insights into the molecular design process.
Authors: Dotan Goberman, Anjan Roy, Rami Pugatch
Abstract: To double the cellular population of ribosomes, a fraction of the active ribosomes is allocated to synthesize ribosomal proteins. Subsequently, these ribosomal proteins enter the ribosome self‑assembly process, synthesizing new ribosomes and forming the well‑known ribosome autocatalytic subcycle. Neglecting ribosome lifetime and the duration of the self‑assembly process, the doubling rate of all cellular biomass can be equated with the fraction of ribosomes allocated to synthesize an essential ribosomal protein times its synthesis rate. However, ribosomes have a finite lifetime, and the assembly process has a finite duration. Furthermore, the number of ribosomes is known to decrease with slow growth rates. The finite lifetime of ribosomes and the decline in their numbers present a challenge in sustaining slow growth solely through controlling the allocation of ribosomes to synthesize more ribosomal proteins. When the number of ribosomes allocated per mRNA of an essential ribosomal protein is approximately one, the resulting fluctuations in the production rate of new ribosomes increase, causing a potential risk that the actual production rate will fall below the ribosome death rate. Thus, in this regime, a significant risk of extinction of the ribosome population emerges. To mitigate this risk, we suggest that the ribosome translation speed is used as an alternative control parameter, which facilitates the maintenance of slow growth rates with a larger ribosome pool. We clarify the observed reduction in translation speed at harsh environments in E. coli and C. Glutamicum, explore other mitigation strategies, and suggest additional falsifiable predictions of our model.
Authors: Suraj Deshmukh, Sougata Guha, Basudha Roy, Shivprasad Patil, Arnab Saha, Sudipto Muhuri
Abstract: Designing a miniature microscale engine that can override the role of thermal fluctuations has remained elusive and is an important open challenge. Here we provide the design and theoretical framework for a unique information‑based engine ‑ a work‑to‑work converter ‑ comprising a sub‑micron size bead and motor protein‑microtubule (MT) complex in an optical trap setup. We demonstrate how by implementing a simple motor protein state‑dependent feedback protocol of the optical trap stiffness, this engine is able to harness and convert the movement of a motor protein into work output. Unlike other conventional microengines, the fidelity and performance of this engine is determined by the stochasticity of motor (un)binding characteristics. We obtain an analytical form of the work distribution function, average work output and average power output, providing quantitative predictions for engine performance which are validated by stochastic simulations. Remarkably, the average work output per cycle is at least an order of magnitude higher than the thermal fluctuations and supersedes the performance of other microscale engines realized so far.
Authors: Gongbo Zhang, Yanting Li, Renqian Luo, Pipi Hu, Yang Yang, Zeru Zhao, Lingbo Li, Guoqing Liu, Zun Wang, Ran Bi, Kaiyuan Gao, Liya Guo, Yu Xie, Chang Liu, Jia Zhang, Tian Xie, Robert Pinsler, Claudio Zeni, Ziheng Lu, Hongxia Hao, Yingce Xia, Marwin Segler, Maik Riechert, Wei Yang, Hao Jiang, Wen-Bin Zhang, Zhijun Zeng, Yi Zhu, Li Dong, Xiuyuan Hu, Li Yuan, Lei Chen, Haiguang Liu, Tao Qin
Abstract: Function in natural systems arises from one‑dimensional sequences forming three‑dimensional structures with specific properties. However, current generative models suffer from critical limitations: training objectives seldom target function directly, discrete sequences and continuous coordinates are optimized in isolation, and conformational ensembles are under‑modeled. We present UniGenX, a unified generative foundation model that addresses these gaps by co‑generating sequences and coordinates under direct functional and property objectives across proteins, molecules, and materials. UniGenX represents heterogeneous inputs as a mixed stream of symbolic and numeric tokens, where a decoder‑only autoregressive transformer provides global context and a conditional diffusion head generates numeric fields steered by task‑specific tokens. Besides the new high SOTAs on structure prediction tasks, the model demonstrates state‑of‑the‑art or competitive performance for the function‑aware generation across domains: in materials, it achieves "conflicted" multi‑property conditional generation, yielding 436 crystal candidates meeting triple constraints, including 11 with novel compositions; in chemistry, it sets new benchmarks on five property targets and conformer ensemble generation on GEOM; and in biology, it improves success in modeling protein induced fit (RMSD < 2 Å) by over 23‑fold and enhances EC‑conditioned enzyme design. Ablation studies and cross‑domain transfer substantiate the benefits of joint discrete‑continuous training, establishing UniGenX as a significant advance from prediction to controllable, function‑aware generation.
Authors: Jiawen Wang, Samin Karim, Yuan Hong, Binghui Wang
Abstract: Diffusion models are powerful generative models in continuous data domains such as image and video data. Discrete graph diffusion models (DGDMs) have recently extended them for graph generation, which are crucial in fields like molecule and protein modeling, and obtained the SOTA performance. However, it is risky to deploy DGDMs for safety‑critical applications (e.g., drug discovery) without understanding their security vulnerabilities. In this work, we perform the first study on graph diffusion models against backdoor attacks, a severe attack that manipulates both the training and inference/generation phases in graph diffusion models. We first define the threat model, under which we design the attack such that the backdoored graph diffusion model can generate 1) high‑quality graphs without backdoor activation, 2) effective, stealthy, and persistent backdoored graphs with backdoor activation, and 3) graphs that are permutation invariant and exchangeable‑‑two core properties in graph generative models. 1) and 2) are validated via empirical evaluations without and with backdoor defenses, while 3) is validated via theoretical results.
Authors: Yael Kapon, Dror Merhav, Gal Finkelstein-Zuta, Omer Blumen, Naomi Melamed Book, Yael Levi-Kalisman, Ilya Torchinsky, Shira Yochelis, Daniel Sharon, Lech Tomasz Baczewski, Ehud Gazit, Yossi Paltiel
Abstract: Protein aggregation into insoluble amyloid‑like fibrils is implicated in a wide range of diseases and understanding its nucleation process is a key for mechanistic insights and advancing therapeutics. The electronic charge of the amyloidogenic monomers significantly influences their self‑assembly process. However, the impact of electron spin interactions between monomers on amyloid nucleation has not been considered yet. Here, we studied amyloid formation on magnetic substrates using Scanning Electron Microscopy (SEM), fluorescence microscopy, and Attenuated Total Reflection Fourier Transform Infrared (ATR‑FTIR) Spectroscopy. We observed a preferred magnetization orientation of the ferromagnetic layer for fibril formation, leading to twice as many and significantly longer fibrils (up to 20 times) compared to the opposite magnetization orientation. This preference is related to monomer chirality. Additionally, fibril structure varied with substrate magnetization orientation. Our findings suggest a transient spin polarization in monomers during self‑assembly, driven by the Chiral Induced Spin Selectivity (CISS) effect. These effects are consistent for various molecule length scales, from A‑beta polypeptide to dipeptides and single amino acids, indicating a fundamental spin‑based dependence on biomolecular aggregation that could be applied in novel therapeutic interventions targeted for amyloid‑related diseases.
Authors: Mirja Granfors, Jesús Pineda, Blanca Zufiria Gerbolés, Joana B. Pereira, Carlo Manzo, Giovanni Volpe
Abstract: Graphs provide a powerful framework for modeling complex systems, but their structural variability poses significant challenges for analysis and classification. To address these challenges, we introduce GAUDI (Graph Autoencoder Uncovering Descriptive Information), a novel unsupervised geometric deep learning framework designed to capture both local details and global structure. GAUDI employs an innovative hourglass architecture with hierarchical pooling and upsampling layers linked through skip connections, which preserve essential connectivity information throughout the encoding‑decoding process. Even though identical or highly similar underlying parameters describing a system's state can lead to significant variability in graph realizations, GAUDI consistently maps them into nearby regions of a structured and continuous latent space, effectively disentangling invariant process‑level features from stochastic noise. We demonstrate GAUDI's versatility across multiple applications, including small‑world networks modeling, characterization of protein assemblies from super‑resolution microscopy, analysis of collective motion in the Vicsek model, and identification of age‑related changes in brain connectivity. Comparison with related approaches highlights GAUDI's superior performance in analyzing complex graphs, providing new insights into emergent phenomena across diverse scientific domains.
Authors: Lars Meuser, Alexandros Patsilinakos, Pietro Faccioli
Abstract: In silico de novo design can drastically cut the costs and time of drug development. In particular, a key advantage of bottom‑up physics‑based approaches is their independence from training datasets, unlike generative models. However, they require the simultaneous exploration of chemical and conformational space. In this study, we address this formidable challenge leveraging quantum annealers. Focusing on peptide de novo design, we introduce a multi‑scale framework that integrates classical and quantum computing for atomically resolved predictions. We assess this scheme by designing binders for several protein targets. The D‑Wave quantum annealer rapidly generates a chemically diverse set of binders with primary structures and binding poses that correlate well with experiments. These results demonstrate that, even in their current early stages, quantum technologies can already empower physics‑based drug design.
Authors: Anna T. Thomas, Adam Yee, Andrew Mayne, Maya B. Mathur, Dan Jurafsky, Kristina Gligorić
Abstract: Food systems are responsible for a third of human‑caused greenhouse gas emissions. We investigate what Large Language Models (LLMs) can contribute to reducing the environmental impacts of food production. We define a typology of design and prediction tasks based on the sustainable food literature and collaboration with domain experts, and evaluate six LLMs on four tasks in our typology. For example, for a sustainable protein design task, food science experts estimated that collaboration with an LLM can reduce time spent by 45% on average, compared to 22% for collaboration with another expert human food scientist. However, for a sustainable menu design task, LLMs produce suboptimal solutions when instructed to consider both human satisfaction and climate impacts. We propose a general framework for integrating LLMs with combinatorial optimization to improve reasoning capabilities. Our approach decreases emissions of food choices by 79% in a hypothetical restaurant while maintaining participants' satisfaction with their set of choices. Our results demonstrate LLMs' potential, supported by optimization techniques, to accelerate sustainable food development and adoption.
Authors: Ryan Barron, Maksim E. Eren, Duc P. Truong, Cynthia Matuszek, James Wendelberger, Mary F. Dorn, Boian Alexandrov
Abstract: Missing link prediction is a method for network analysis, with applications in recommender systems, biology, social sciences, cybersecurity, information retrieval, and Artificial Intelligence (AI) reasoning in Knowledge Graphs. Missing link prediction identifies unseen but potentially existing connections in a network by analyzing the observed patterns and relationships. In proliferation detection, this supports efforts to identify and characterize attempts by state and non‑state actors to acquire nuclear weapons or associated technology ‑ a notoriously challenging but vital mission for global security. Dimensionality reduction techniques like Non‑Negative Matrix Factorization (NMF) and Logistic Matrix Factorization (LMF) are effective but require selection of the matrix rank parameter, that is, of the number of hidden features, k, to avoid over/under‑fitting. We introduce novel Weighted (WNMFk), Boolean (BNMFk), and Recommender (RNMFk) matrix factorization methods, along with ensemble variants incorporating logistic factorization, for link prediction. Our methods integrate automatic model determination for rank estimation by evaluating stability and accuracy using a modified bootstrap methodology and uncertainty quantification (UQ), assessing prediction reliability under random perturbations. We incorporate Otsu threshold selection and k‑means clustering for Boolean matrix factorization, comparing them to coordinate descent‑based Boolean thresholding. Our experiments highlight the impact of rank k selection, evaluate model performance under varying test‑set sizes, and demonstrate the benefits of UQ for reliable predictions using abstention. We validate our methods on three synthetic datasets (Boolean and uniformly distributed) and benchmark them against LMF and symmetric LMF (symLMF) on five real‑world protein‑protein interaction networks, showcasing an improved prediction performance.
Authors: Zhenyu Wang, Zikang Wang, Jiyue Jiang, Pengan Chen, Xiangyu Shi, Yu Li
Abstract: Large Language Models (LLMs) are revolutionizing bioinformatics, enabling advanced analysis of DNA, RNA, proteins, and single‑cell data. This survey provides a systematic review of recent advancements, focusing on genomic sequence modeling, RNA structure prediction, protein function inference, and single‑cell transcriptomics. Meanwhile, we also discuss several key challenges, including data scarcity, computational complexity, and cross‑omics integration, and explore future directions such as multimodal learning, hybrid AI models, and clinical applications. By offering a comprehensive perspective, this paper underscores the transformative potential of LLMs in driving innovations in bioinformatics and precision medicine.
Authors: Yiheng Zhu, Mingyang Li, Junlong Liu, Kun Fu, Jiansheng Wu, Qiuyi Li, Mingze Yin, Jieping Ye, Jian Wu, Zheng Wang
Abstract: Structure‑based drug discovery (SBDD) is a systematic scientific process that develops new drugs by leveraging the detailed physical structure of the target protein. Recent advancements in pre‑trained models for biomolecules have demonstrated remarkable success across various biochemical applications, including drug discovery and protein engineering. However, in most approaches, the pre‑trained models primarily focus on the characteristics of either small molecules or proteins, without delving into their binding interactions which are essential cross‑domain relationships pivotal to SBDD. To fill this gap, we propose a general‑purpose foundation model named BIT (an abbreviation for Biomolecular Interaction Transformer), which is capable of encoding a range of biochemical entities, including small molecules, proteins, and protein‑ligand complexes, as well as various data formats, encompassing both 2D and 3D structures. Specifically, we introduce Mixture‑of‑Domain‑Experts (MoDE) to handle the biomolecules from diverse biochemical domains and Mixture‑of‑Structure‑Experts (MoSE) to capture positional dependencies in the molecular structures. The proposed mixture‑of‑experts approach enables BIT to achieve both deep fusion and domain‑specific encoding, effectively capturing fine‑grained molecular interactions within protein‑ligand complexes. Then, we perform cross‑domain pre‑training on the shared Transformer backbone via several unified self‑supervised denoising tasks. Experimental results on various benchmarks demonstrate that BIT achieves exceptional performance in downstream tasks, including binding affinity prediction, structure‑based virtual screening, and molecular property prediction.
Authors: Christos Papalitsas, Yanfei Guan, Shreyas Waghe, Athanasios Liakos, Ioannis Balatsos, Vassilios Pantazopoulos
Abstract: Molecular docking is a critical process for drug discovery and challenging due to the complexity and size of biomolecular systems, where the optimal binding configuration of a drug to a target protein is determined. Hybrid classical‑quantum computing techniques offer a novel approach to address these challenges. The Quantum Approximate Optimization Algorithm (QAOA) and its variations are hybrid classical‑quantum techniques, and a promising tool for combinatorial optimization challenges. This paper presents a Digitized Counterdiabatic QAOA (DC‑QAOA) approach to molecular docking. Simulated quantum runs were conducted on a GPU cluster. We examined 14 and 17 nodes instances ‑ to the best of our knowledge the biggest published instance is 12‑node at Ding et al. and we present the results. Based on computational results, we conclude that binding interactions represent the anticipated exact solution. Additionally, as the size of the examined instance increases, the computational times exhibit a significant escalation.
Authors: Jiyue Jiang, Zikang Wang, Yuheng Shan, Heyan Chai, Jiayi Li, Zixian Ma, Xinrui Zhang, Yu Li
Abstract: Large Language models (LLMs) have emerged as powerful tools for addressing challenges across diverse domains. Notably, recent studies have demonstrated that large language models significantly enhance the efficiency of biomolecular analysis and synthesis, attracting widespread attention from academics and medicine. In this paper, we systematically investigate the application of prompt‑based methods with LLMs to biological sequences, including DNA, RNA, proteins, and drug discovery tasks. Specifically, we focus on how prompt engineering enables LLMs to tackle domain‑specific problems, such as promoter sequence prediction, protein structure modeling, and drug‑target binding affinity prediction, often with limited labeled data. Furthermore, our discussion highlights the transformative potential of prompting in bioinformatics while addressing key challenges such as data scarcity, multimodal fusion, and computational resource limitations. Our aim is for this paper to function both as a foundational primer for newcomers and a catalyst for continued innovation within this dynamic field of study.
Authors: Xiangxin Zhou, Yi Xiao, Haowei Lin, Xinheng He, Jiaqi Guan, Yang Wang, Qiang Liu, Feng Zhou, Liang Wang, Jianzhu Ma
Abstract: The dynamic nature of proteins, influenced by ligand interactions, is essential for comprehending protein function and progressing drug discovery. Traditional structure‑based drug design (SBDD) approaches typically target binding sites with rigid structures, limiting their practical application in drug development. While molecular dynamics simulation can theoretically capture all the biologically relevant conformations, the transition rate is dictated by the intrinsic energy barrier between them, making the sampling process computationally expensive. To overcome the aforementioned challenges, we propose to use generative modeling for SBDD considering conformational changes of protein pockets. We curate a dataset of apo and multiple holo states of protein‑ligand complexes, simulated by molecular dynamics, and propose a full‑atom flow model (and a stochastic version), named DynamicFlow, that learns to transform apo pockets and noisy ligands into holo pockets and corresponding 3D ligand molecules. Our method uncovers promising ligand molecules and corresponding holo conformations of pockets. Additionally, the resultant holo‑like states provide superior inputs for traditional SBDD approaches, playing a significant role in practical drug discovery.
Authors: Moritz Bensberg, Marco Eckhoff, F. Emil Thomasen, William Bro-Jørgensen, Matthew S. Teynor, Valentina Sora, Thomas Weymuth, Raphael T. Husistein, Frederik E. Knudsen, Anders Krogh, Kresten Lindorff-Larsen, Markus Reiher, Gemma C. Solomon
Abstract: Binding free energies are a key element in understanding and predicting the strength of protein‑‑drug interactions. While classical free energy simulations yield good results for many purely organic ligands, drugs including transition metal atoms often require quantum chemical methods for an accurate description. We propose a general and automated workflow that samples the potential energy surface with hybrid quantum mechanics/molecular mechanics (QM/MM) calculations and trains a machine learning (ML) potential on the QM energies and forces to enable efficient alchemical free energy simulations. To represent systems including many different chemical elements efficiently and to account for the different description of QM and MM atoms, we propose an extension of element‑embracing atom‑centered symmetry functions for QM/MM data as an ML descriptor. The ML potential approach takes electrostatic embedding and long‑range electrostatics into account. We demonstrate the applicability of the workflow on the well‑studied protein‑‑ligand complex of myeloid cell leukemia 1 and the inhibitor 19G and on the anti‑cancer drug NKP1339 acting on the glucose‑regulated protein 78.
Authors: Moritz Bensberg, Marco Eckhoff, Raphael T. Husistein, Matthew S. Teynor, Valentina Sora, William Bro-Jørgensen, F. Emil Thomasen, Anders Krogh, Kresten Lindorff-Larsen, Gemma C. Solomon, Thomas Weymuth, Markus Reiher
Abstract: We present a quantum‑in‑quantum embedding strategy coupled to machine learning potentials to improve on the accuracy of quantum‑classical hybrid models for the description of large molecules. In such hybrid models, relevant structural regions (such as those around reaction centers or pockets for binding of host molecules) can be described by a quantum model that is then embedded into a classical molecular‑mechanics environment. However, this quantum region may become so large that only approximate electronic structure models are applicable. To then restore accuracy in the quantum description, we here introduce the concept of quantum cores within the quantum region that are amenable to accurate electronic structure models due to their limited size. Huzinaga‑type projection‑based embedding, for example, can deliver accurate electronic energies obtained with advanced electronic structure methods. The resulting total electronic energies are then fed into a transfer learning approach that efficiently exploits the higher‑accuracy data to improve on a machine learning potential obtained for the original quantum‑classical hybrid approach. We explore the potential of this approach in the context of a well‑studied protein‑ligand complex for which we calculate the free energy of binding using alchemical free energy and non‑equilibrium switching simulations.
Authors: Nikolaos Nakis, Chrysoula Kosma, Anastasia Brativnyk, Michail Chatzianastasis, Iakovos Evdaimon, Michalis Vazirgiannis
Abstract: Accurately predicting complex protein‑protein interactions (PPIs) is crucial for decoding biological processes, from cellular functioning to disease mechanisms. However, experimental methods for determining PPIs are computationally expensive. Thus, attention has been recently drawn to machine learning approaches. Furthermore, insufficient effort has been made toward analyzing signed PPI networks, which capture both activating (positive) and inhibitory (negative) interactions. To accurately represent biological relationships, we present the Signed Two‑Space Proximity Model (S2‑SPM) for signed PPI networks, which explicitly incorporates both types of interactions, reflecting the complex regulatory mechanisms within biological systems. This is achieved by leveraging two independent latent spaces to differentiate between positive and negative interactions while representing protein similarity through proximity in these spaces. Our approach also enables the identification of archetypes representing extreme protein profiles. S2‑SPM's superior performance in predicting the presence and sign of interactions in SPPI networks is demonstrated in link prediction tasks against relevant baseline methods. Additionally, the biological prevalence of the identified archetypes is confirmed by an enrichment analysis of Gene Ontology (GO) terms, which reveals that distinct biological tasks are associated with archetypal groups formed by both interactions. This study is also validated regarding statistical significance and sensitivity analysis, providing insights into the functional roles of different interaction types. Finally, the robustness and consistency of the extracted archetype structures are confirmed using the Bayesian Normalized Mutual Information (BNMI) metric, proving the model's reliability in capturing meaningful SPPI patterns.
Authors: Afnan Sultan, Max Rausch-Dupont, Shahrukh Khan, Olga Kalinina, Dietrich Klakow, Andrea Volkamer
Abstract: Over the past six years, molecular transformer models have become key tools in drug discovery. Most existing models are pre‑trained on large, unlabeled datasets such as ZINC or ChEMBL. However, the extent to which large‑scale pre‑training improves molecular property prediction remains unclear. This study evaluates transformer models for this task while addressing their limitations. We explore how pre‑training dataset size and chemically informed objectives impact performance. Our results show that increasing the dataset beyond approximately 400K to 800K molecules from large‑scale unlabeled databases does not enhance performance across seven datasets covering five ADME endpoints: lipophilicity, permeability, solubility (two datasets), microsomal stability (two datasets), and plasma protein binding. In contrast, domain adaptation on a small, domain‑specific dataset (less than or equal 4K molecules) using multi‑task regression of physicochemical properties significantly boosts performance (P‑value less than 0.001). A model pre‑trained on 400K molecules and adapted with domain‑specific data outperforms larger models such as MolFormer and performs comparably to MolBERT. Benchmarks against Random Forest (RF) baselines using descriptors and Morgan fingerprints show that chemically and physically informed features consistently yield better performance across model types. While RF remains a strong baseline, we identify concrete practices to enhance transformer performance. Aligning pre‑training and adaptation with chemically meaningful tasks and domain‑relevant data presents a promising direction for molecular property prediction. Our models are available on HuggingFace for easy use and adaptation.
Authors: Kai Wang, Gabrielle Gilmer, Matheus Candia Arana, Hirotaka Iijima, Juliana Bergmann, Antonio Woollard, Boris Mesits, Meghan McGraw, Brian Zoltowski, Paola Cappellaro, Alex Ungar, David Pekker, David H. Waldeck, Sunil Saxena, Seth Lloyd, Fabrisia Ambrosio
Abstract: Diverse organisms exploit the geomagnetic field (GMF) for migration. Migrating birds employ an intrinsically quantum mechanical mechanism for detecting the geomagnetic field: absorption of a blue photon generates a radical pair whose two electrons precess at different rates in the magnetic field, thereby sensitizing cells to the direction of the GMF. In this work, using an in vitro injury model, we discovered a quantum‑based mechanism of cellular migration. Specifically, we show that migrating cells detect the GMF via an optically activated, electron spin‑based mechanism. Cell injury provokes acute emission of blue photons, and these photons sensitize muscle progenitor cells to the magnetic field. We show that the magnetosensitivity of muscle progenitor cells is (a) activated by blue light, but not by green or red light, and (b) disrupted by the application of an oscillatory field at the frequency corresponding to the energy of the electron‑spin/magnetic field interaction. A comprehensive analysis of protein expression reveals that the ability of blue photons to promote cell motility is mediated by activation of calmodulin calcium sensors. Collectively, these data suggest that cells possess a light‑dependent magnetic compass driven by electron spin dynamics.
Authors: Gokul Gowri, Igor Sadalski, Dan Raviv, Peng Yin, Jonathan Rosenfeld, Allon M. Klein
Abstract: Large genomic and imaging datasets can be used to train models that learn meaningful representations of cellular systems. Across domains, model performance improves predictably with dataset size and compute budget, providing a basis for allocating data and computation. Scientific data, however, is also limited by noise arising from factors such as molecular undersampling, sequencing errors, and image resolution. By fitting 1,670 representation learning models across three data modalities (gene expression, sequence, and image data), we show that noise defines a distinct axis along which performance improves. Noise scaling follows a logarithmic law. We derive the law from a model of noise propagation, and use it to define noise sensitivity and model capacity as benchmarking metrics. We show that protein sequence representations are noise‑robust while single cell transcriptomics models are not, with a Transformer‑based model showing greater noise robustness but lower saturating performance than a variational autoencoder model. Noise scaling metrics may support future model evaluation and experimental design.
Authors: Yue Gao, Yifan Feng, Shiquan Liu, Xiangmin Han, Shaoyi Du, Zongze Wu, Han Hu
Abstract: Hypergraph neural networks (HGNNs) effectively model complex high‑order relationships in domains like protein interactions and social networks by connecting multiple vertices through hyperedges, enhancing modeling capabilities, and reducing information loss. Developing foundation models for hypergraphs is challenging due to their distinct data, which includes both vertex features and intricate structural information. We present Hyper‑FM, a Hypergraph Foundation Model for multi‑domain knowledge extraction, featuring Hierarchical High‑Order Neighbor Guided Vertex Knowledge Embedding for vertex feature representation and Hierarchical Multi‑Hypergraph Guided Structural Knowledge Extraction for structural information. Additionally, we curate 11 text‑attributed hypergraph datasets to advance research between HGNNs and LLMs. Experiments on these datasets show that Hyper‑FM outperforms baseline methods by approximately 13.4%, validating our approach. Furthermore, we propose the first scaling law for hypergraph foundation models, demonstrating that increasing domain diversity significantly enhances performance, unlike merely augmenting vertex and hyperedge counts. This underscores the critical role of domain diversity in scaling hypergraph models.
Authors: Alain M. Dikandé
Abstract: The Duffing oscillator describes the dynamics of a mass suspended on a spring with position‑dependent stiffness. The mass is assumed to experience a linear damping and a time‑dependent external forcing. The model has been instrumental in theoretical investigations of dynamical properties of systems with parity‑conserving symmetry, where a double‑well substrate connects two metastable states separated by a barrier. Physical systems of interest include nonlinear feedback‑controlled mass‑spring‑damper oscillators, active hysteresis circuits (e.g. memristors), protein chains prone to hydrogen bond‑mediated conformational transitions, centro‑symmetric crystals and so on. In this work we consider a Duffing‑type oscillator with a double‑well potential represented by a hyperbolic function of mass position. The hyperbolic double‑well potential has two degenerate minima that can be smoothly tuned by varying a deformability parameter, leaving unchanged the barrier height. We investigate solutions of the equation of motion in the absence and presence of damping and forcing. In the absence of perturbations numerical solutions lead to a periodic train of anharmonic oscillations featuring a crystal of pulse solitons of sech types. However, when the hyperbolic double‑well potential is inverted, analytical solutions can be obtained which turn out to be kink‑soliton crystals described by Jacobi elliptic functions. When damping and forcing are taken into consideration, the system dynamics can transit from periodic to chaotic phases or vice‑versa via period‑doubling or period‑halving bifurcations, by simply varying the deformability parameter. The Poincaré map of the proposed model carries the well‑known characteristic signatures of chaos presursors of the standard Duffing model, which happens to be just a particular case of the bistable oscillator model with the hyperbolic double‑well potential.
Authors: Guanlue Li, Chenran Jiang, Ziqi Gao, Yu Liu, Chenyang Liu, Jiean Chen, Yong Huang, Jia Li
Abstract: Effective generation of molecular structures, or new chemical entities, that bind to target proteins is crucial for lead identification and optimization in drug discovery. Despite advancements in atom‑ and motif‑wise deep learning models for 3D molecular generation, current methods often struggle with validity and reliability. To address these issues, we develop the Atom‑Motif Consistency Diffusion Model (AMDiff), utilizing a joint‑training paradigm for multi‑view learning. This model features a hierarchical diffusion architecture that integrates both atom‑ and motif‑level views of molecules, allowing for comprehensive exploration of complementary information. By leveraging classifier‑free guidance and incorporating binding site features as conditional inputs, AMDiff ensures robust molecule generation across diverse targets. Compared to existing approaches, AMDiff exhibits superior validity and novelty in generating molecules tailored to fit various protein pockets. Case studies targeting protein kinases, including Anaplastic Lymphoma Kinase (ALK) and Cyclin‑dependent kinase 4 (CDK4), demonstrate the model's capability in structure‑based de novo drug design. Overall, AMDiff bridges the gap between atom‑view and motif‑view drug discovery and speeds up the process of target‑aware molecular generation.
Authors: Aakanksha J Shetty, Alexei Sirbu, Paolo Annibale
Abstract: G protein‑coupled receptors (GPCRs) represent a diverse and vital family of membrane proteins that mediate intracellular signaling in response to extracellular stimuli, playing critical roles in physiology and disease. Traditionally recognized as chemical signal transducers, GPCRs have recently been implicated in mechanotransduction, the process of converting mechanical stimuli into cellular responses. This review explores the emerging role of GPCRs in sensing and responding to mechanical forces, with a particular focus on the cardiovascular system. Cardiovascular homeostasis is heavily influenced by mechanical forces such as shear stress, cyclic stretch, and pressure, which are central to both normal physiology and the pathogenesis of diseases like hypertension and atherosclerosis. GPCRs, including the angiotensin II type 1 receptor (AT1R) and the \beta2‑adrenergic receptor (\beta2‑AR), have demonstrated the ability to integrate mechanical and chemical signals, potentially through conformational changes and/or modulation of lipid interactions, leading to biased signaling. Recent studies highlight the dual activation mechanisms of GPCRs, with \beta2‑AR now serving as a key example of how mechanical and ligand‑dependent pathways contribute to cardiovascular regulation. This review synthesizes current knowledge of GPCR mechanosensitivity, emphasizing its implications for cardiovascular health and disease, and explores advancements in methodologies poised to further unravel the mechanistic intricacies of these receptors.
Authors: Tomas Geffner, Kieran Didi, Zuobai Zhang, Danny Reidenbach, Zhonglin Cao, Jason Yim, Mario Geiger, Christian Dallago, Emine Kucukbenli, Arash Vahdat, Karsten Kreis
Abstract: Recently, diffusion‑ and flow‑based generative models of protein structures have emerged as a powerful tool for de novo protein design. Here, we develop Proteina, a new large‑scale flow‑based protein backbone generator that utilizes hierarchical fold class labels for conditioning and relies on a tailored scalable transformer architecture with up to 5x as many parameters as previous models. To meaningfully quantify performance, we introduce a new set of metrics that directly measure the distributional similarity of generated proteins with reference sets, complementing existing metrics. We further explore scaling training data to millions of synthetic protein structures and explore improved training and sampling recipes adapted to protein backbone generation. This includes fine‑tuning strategies like LoRA for protein backbones, new guidance methods like classifier‑free guidance and autoguidance for protein backbones, and new adjusted training objectives. Proteina achieves state‑of‑the‑art performance on de novo protein backbone design and produces diverse and designable proteins at unprecedented length, up to 800 residues. The hierarchical conditioning offers novel control, enabling high‑level secondary‑structure guidance as well as low‑level fold‑specific generation.
Authors: Gian Marco Visani, Michael N. Pun, Anastasia A. Minervina, Philip Bradley, Paul Thomas, Armita Nourmohammad
Abstract: T‑cells play a key role in adaptive immunity by mounting specific responses against diverse pathogens. An effective binding between T‑cell receptors (TCRs) and pathogen‑derived peptides presented on Major Histocompatibility Complexes (MHCs) mediate an immune response. However, predicting these interactions remains challenging due to limited functional data on T‑cell reactivities. Here, we introduce a computational approach to predict TCR interactions with peptides presented on MHC class I alleles, and to design novel immunogenic peptides for specified TCR‑MHC complexes. Our method leverages HERMES, a structure‑based, physics‑guided machine learning model trained on the protein universe to predict amino acid preferences based on local structural environments. Despite no direct training on TCR‑pMHC data, the implicit physical reasoning in HERMES enables us to make accurate predictions of both TCR‑pMHC binding affinities and T‑cell activities across diverse viral epitopes and cancer neoantigens, achieving up to 0.72 correlation with experimental data. Leveraging our TCR recognition model, we develop a computational protocol for de novo design of immunogenic peptides. Through experimental validation in three TCR‑MHC systems targeting viral and cancer peptides, we demonstrate that our designs ‑‑ with up to five substitutions from the native sequence ‑‑ activate T‑cells at success rates of up to 50%. Lastly, we use our generative framework to quantify the diversity of the peptide recognition landscape for various TCR‑MHC complexes, offering key insights into T‑cell specificity in both humans and mice. Our approach provides a platform for immunogenic peptide and neoantigen design, as well as for evaluating TCR specificity, offering a computational framework to inform design of engineered T‑cell therapies and vaccines.
Authors: Kisan Khatri, Ronald M. Levy, Allan Haldane
Abstract: Recent generative learning models applied to protein multiple sequence alignment (MSA) datasets include simple and interpretable physics‑based Potts covariation models and other machine learning models such as MSA‑Transformer (MSA‑T). The best models accurately reproduce MSA statistics induced by the biophysical constraints within proteins, raising the question of which functional forms best model the underlying physics. The Potts model is usually specified by an effective potential including pairwise residue‑residue interaction terms, but it has been suggested that MSA‑T can capture the effects induced by effective potentials which include more than pairwise interactions and implicitly account for phylogenetic structure in the MSA. Here we compare the ability of the Potts model and MSA‑T to reconstruct higher‑order sequence statistics reflecting complex biological sequence constraints. We find that the model performance depends greatly on the treatment of phylogenetic relationships between the sequences, which can induce non‑biophysical mutational covariation in MSAs. When using explicit corrections for phylogenetic dependencies, we find the Potts model outperforms MSA‑T in detecting epistatic interactions of biophysical origin.
Authors: Tom Pan, Evan Dramko, Mitchell D. Miller, George N. Phillips, Anastasios Kyrillidis
Abstract: Determining protein structures at an atomic level remains a significant challenge in structural biology. We introduce \textttRecCrysFormer, a hybrid model that exploits the strengths of transformers with the aim of integrating experimental and ML approaches to protein structure determination from crystallographic data. \textttRecCrysFormer leverages Patterson maps and incorporates known standardized partial structures of amino acid residues to directly predict electron density maps, which are essential for constructing detailed atomic models through crystallographic refinement processes. \textttRecCrysFormer benefits from a ``recycling'' training regimen that iteratively incorporates results from crystallographic refinements and previous training runs as additional inputs in the form of template maps. Using a preliminary dataset of synthetic peptide fragments based on Protein Data Bank, \textttRecCrysFormer achieves good accuracy in structural predictions and shows robustness against variations in crystal parameters, such as unit cell dimensions and angles.
Authors: Higor V. M. Ferreira, Nelson H. T. Lemes, Yara L. Coelho, Luciano S. Virtuoso, Ana C. dos Santos Pires, Luis H. M. da Silva
Abstract: The application of surface plasmon resonance (SPR) has transformed the field of study of interactions between a ligand immobilized on the surface of a sensor chip, designated as L_S, and an analyte in solution, referred to as A. This technique enables the real‑time measurement of interactions with high sensitivity. The dynamics of adsorption‑desorption process, A+L_S \rightarrow AL_S, can be expressed mathematically as a set of coupled integer‑order differential equations. However, this approach has limited ability to acoount for temperature distribution, diffusion and transport effects involved in the reaction process. The fractional kinetic model provides a methodology for incorporating non‑local effects into the problem. In this study, the proposed model was applied to analyze data to the interaction between Immobilized Baru Protein (IBP) and Congo Red dye (CR) at concentrations ranging from 7.5 to 97.5 μM, at pH 7.4 and 16^o C. The variation in the kinetic constants was studied, and it was demonstrated that the integer‑order model is unable to adequately represent the experimental data. This work has shown that the fractional‑order model is capable of capturing the complexity of the adsorption‑desorption process involved in the SPR data.
Authors: Roman Klypa, Alberto Bietti, Sergei Grudinin
Abstract: Designing RNA molecules that interact with specific proteins is a critical challenge in experimental and computational biology. Existing computational approaches require a substantial amount of previously known interacting RNA sequences for each specific protein or a detailed knowledge of RNA structure, restricting their utility in practice. To address this limitation, we develop RNA‑BAnG, a deep learning‑based model designed to generate RNA sequences for protein interactions without these requirements. Central to our approach is a novel generative method, Bidirectional Anchored Generation (BAnG), which leverages the observation that protein‑binding RNA sequences often contain functional binding motifs embedded within broader sequence contexts. We first validate our method on generic synthetic tasks involving similar localized motifs to those appearing in RNAs, demonstrating its benefits over existing generative approaches. We then evaluate our model on biological sequences, showing its effectiveness for conditional RNA sequence design given a binding protein.
Authors: Shoummo Ahsan Khandoker, Estelle M. Inack, Mohamed Hibat-Allah
Abstract: Understanding the principles of protein folding is a cornerstone of computational biology, with implications for drug design, bioengineering, and the understanding of fundamental biological processes. Lattice protein folding models offer a simplified yet powerful framework for studying the complexities of protein folding, enabling the exploration of energetically optimal folds under constrained conditions. However, finding these optimal folds is a computationally challenging combinatorial optimization problem. In this work, we introduce a novel upper‑bound training scheme that employs masking to identify the lowest‑energy folds in two‑dimensional Hydrophobic‑Polar (HP) lattice protein folding. By leveraging Dilated Recurrent Neural Networks (RNNs) integrated with an annealing process driven by temperature‑like fluctuations, our method accurately predicts optimal folds for benchmark systems of up to 60 beads. Our approach also effectively masks invalid folds from being sampled without compromising the autoregressive sampling properties of RNNs. This scheme is generalizable to three spatial dimensions and can be extended to lattice protein models with larger alphabets. Our findings emphasize the potential of advanced machine learning techniques in tackling complex protein folding problems and a broader class of constrained combinatorial optimization challenges.
Authors: Rafael B. Frigori
Abstract: The rapid evolution and global impact of coronaviruses, notably SARS‑CoV‑1 and SARS‑CoV‑2, underscore the importance of understanding their molecular mechanisms in detail. This study focuses on the receptor‑binding motif (RBM) within the Spike protein of these viruses, a critical element for viral entry through interaction with the ACE2 receptor. We investigate the sequence variations in the RBM across SARS‑CoV‑1, SARS‑CoV‑2 and its early variants of concern (VOCs). Utilizing multicanonical simulations and microcanonical analysis, we examine how these variations influence the folding dynamics, thermostability, and solubility of the RBMs. Our methodology includes calculating the density of states (DoS) to identify structural phase transitions and assess thermodynamic properties. Furthermore, we solve the Poisson‑Boltzmann equation to model the solubility of the RBMs in aqueous environments. This methodology is expected to elucidate structural and functional differences in viral evolution and pathogenicity, likely improving targeted treatments and vaccines.
Authors: Yu-Ching Tseng, Chamika Goonetilleke, Xiaotian Lu, Niladri Sekhar Mandal, Ali Borhan, Ayusman Sen
Abstract: Through a combination of experiments and modeling, we have demonstrated a novel pattern formation phenomenon in an isothermal miscible fluid system involving simple protein and sugar solutions. We introduced dye‑tagged protein solution into a petri dish with sugar solutions, which had higher density than the added protein solution. Initially, the protein spread and became more uniformly distributed at the air‑water interface. Subsequently, it concentrated in specific areas to form spiral patterns. We propose that the mechanism involves an interplay between Marangoni effects, evaporation, and airflow. This finding is unexpected as solute Marangoni‑related processes are generally characterized by fast spreading (seconds), while the pattern formation in our systems takes several minutes to form. Our work suggests that Turing reaction‑diffusion patterns can be replicated by replacing the reaction‑induced inhomogeneous solute distribution by evaporation‑induced inhomogeneity. In both cases, the fast diffusive or Marangoni spreading of the solute is counteracted by a slower step that serves to reverse the solute homogenization. In showing that dissipative patterns can form in the absence of thermal gradients or chemical reactions, our findings significantly expand the conditions that lead to pattern formation. The insights gained also enhance our ability to manipulate and control fluid motion and surface morphology, with promising implications for many areas such as coating technologies, materials science, and microfluidics.
Authors: Chao Fang, Yihan He, Xiao Gong, Gengchiau Liang
Abstract: In the post‑Moore era, the need for efficient solutions to non‑deterministic polynomial‑time (NP) problems is becoming more pressing. In this context, the Ising model implemented by the probabilistic computing systems with probabilistic bits (p‑bits) has attracted attention due to the widespread availability of p‑bits and support for large‑scale simulations. This study marks the first work to apply probabilistic computing to tackle protein folding, a significant NP‑complete problem challenge in biology. We represent proteins as sequences of hydrophobic (H) and polar (P) beads within a three‑dimensional (3‑D) grid and introduce a novel many‑body interaction‑based encoding method to map the problem onto an Ising model. Our simulations show that this approach significantly simplifies the energy landscape for short peptide sequences of six amino acids, halving the number of energy levels. Furthermore, the proposed mapping method achieves approximately 100 times acceleration for sequences consisting of ten amino acids in identifying the correct folding configuration. We predicted the optimal folding configuration for a peptide sequence of 36 amino acids by identifying the ground state. These findings highlight the unique potential of the proposed encoding method for solving protein folding and, importantly, provide new tools for solving similar NP‑complete problems in biology by probabilistic computing approach.
Authors: Chuanliu Fan, Ziqiang Cao, Zicheng Ma, Nan Yu, Yimin Peng, Jun Zhang, Yiqin Gao, Guohong Fu
Abstract: Goal‑oriented de novo molecule design, namely generating molecules with specific property or substructure constraints, is a crucial yet challenging task in drug discovery. Existing methods, such as Bayesian optimization and reinforcement learning, often require training multiple property predictors and struggle to incorporate substructure constraints. Inspired by the success of Large Language Models (LLMs) in text generation, we propose ChatMol, a novel approach that leverages LLMs for molecule design across diverse constraint settings. Initially, we crafted a molecule representation compatible with LLMs and validated its efficacy across multiple online LLMs. Afterwards, we developed specific prompts geared towards diverse constrained molecule generation tasks to further fine‑tune current LLMs while integrating feedback learning derived from property prediction. Finally, to address the limitations of LLMs in numerical recognition, we referred to the position encoding method and incorporated additional encoding for numerical values within the prompt. Experimental results across single‑property, substructure‑property, and multi‑property constrained tasks demonstrate that ChatMol consistently outperforms state‑of‑the‑art baselines, including VAE and RL‑based methods. Notably, in multi‑objective binding affinity maximization task, ChatMol achieves a significantly lower KD value of 0.25 for the protein target ESR1, while maintaining the highest overall performance, surpassing previous methods by 4.76%. Meanwhile, with numerical enhancement, the Pearson correlation coefficient between the instructed property values and those of the generated molecules increased by up to 0.49. These findings highlight the potential of LLMs as a versatile framework for molecule generation, offering a promising alternative to traditional latent space and RL‑based approaches.
Authors: Xingyi Zhang, Kun Xie, Ningqiao Huang, Wei Liu, Peilin Zhao, Sibo Wang, Kangfei Zhao, Biaobin Jiang
Abstract: Recent advancements in protein design have leveraged diffusion models to generate structural scaffolds, followed by a process known as protein inverse folding, which involves sequence inference on these scaffolds. However, these methodologies face significant challenges when applied to hyper‑variable structures such as antibody Complementarity‑Determining Regions (CDRs), where sequence inference frequently results in non‑functional sequences due to hallucinations. Distinguished from prevailing protein inverse folding approaches, this paper introduces Igseek, a novel structure‑retrieval framework that infers CDR sequences by retrieving similar structures from a natural antibody database. Specifically, Igseek employs a simple yet effective multi‑channel equivariant graph neural network to generate high‑quality geometric representations of CDR backbone structures. Subsequently, it aligns sequences of structurally similar CDRs and utilizes structurally conserved sequence motifs to enhance inference accuracy. Our experiments demonstrate that Igseek not only proves to be highly efficient in structural retrieval but also outperforms state‑of‑the‑art approaches in sequence recovery for both antibodies and T‑Cell Receptors, offering a new retrieval‑based perspective for therapeutic protein design.
Authors: Gregory W. Kyro, Tianyin Qiu, Victor S. Batista
Abstract: Deep learning has transformed protein design, enabling accurate structure prediction, sequence optimization, and de novo protein generation. Advances in single‑chain protein structure prediction via AlphaFold2, RoseTTAFold, ESMFold, and others have achieved near‑experimental accuracy, inspiring successive work extended to biomolecular complexes via AlphaFold Multimer, RoseTTAFold All‑Atom, AlphaFold 3, Chai‑1, Boltz‑1 and others. Generative models such as ProtGPT2, ProteinMPNN, and RFdiffusion have enabled sequence and backbone design beyond natural evolution‑based limitations. More recently, joint sequence‑structure co‑design models, including ESM3, have integrated both modalities into a unified framework, resulting in improved designability. Despite these advances, challenges still exist pertaining to modeling sequence‑structure‑function relationships and ensuring robust generalization beyond the regions of protein space spanned by the training data. Future advances will likely focus on joint sequence‑structure‑function co‑design frameworks that are able to model the fitness landscape more effectively than models that treat these modalities independently. Current capabilities, coupled with the dizzying rate of progress, suggest that the field will soon enable rapid, rational design of proteins with tailored structures and functions that transcend the limitations imposed by natural evolution. In this review, we discuss the current capabilities of deep learning methods for protein design, focusing on some of the most revolutionary and capable models with respect to their functionality and the applications that they enable, leading up to the current challenges of the field and the optimal path forward.
Authors: Fanglei Xue, Meihan Zhang, Shuqi Li, Xinyu Gao, James A. Wohlschlegel, Wenbing Huang, Yi Yang, Weixian Deng
Abstract: Targeted protein degradation (TPD) induced by small molecules has emerged as a rapidly evolving modality in drug discovery, targeting proteins traditionally considered "undruggable". Proteolysis‑targeting chimeras (PROTACs) and molecular glue degraders (MGDs) are the primary small molecules that induce TPD. Both types of molecules form a ternary complex linking an E3 ligase with a target protein, a crucial step for drug discovery. While significant advances have been made in binary structure prediction for proteins and small molecules, ternary structure prediction remains challenging due to obscure interaction mechanisms and insufficient training data. Traditional methods relying on manually assigned rules perform poorly and are computationally demanding due to extensive random sampling. In this work, we introduce DeepTernary, a novel deep learning‑based approach that directly predicts ternary structures in an end‑to‑end manner using an encoder‑decoder architecture. DeepTernary leverages an SE(3)‑equivariant graph neural network (GNN) with both intra‑graph and ternary inter‑graph attention mechanisms to capture intricate ternary interactions from our collected high‑quality training dataset, TernaryDB. The proposed query‑based Pocket Points Decoder extracts the 3D structure of the final binding ternary complex from learned ternary embeddings, demonstrating state‑of‑the‑art accuracy and speed in existing PROTAC benchmarks without prior knowledge from known PROTACs. It also achieves notable accuracy on the more challenging MGD benchmark under the blind docking protocol. Remarkably, our experiments reveal that the buried surface area calculated from predicted structures correlates with experimentally obtained degradation potency‑related metrics. Consequently, DeepTernary shows potential in effectively assisting and accelerating the development of TPDs for previously undruggable targets.
Authors: Adolfo Ruiz-Sanmartín, Vicent Ribas, David Suñol, Luis Chiscano-Camón, Laura Martín, Iván Bajaña, Juliana Bastida, Nieves Larrosa, Juan José González, M Dolores Carrasco, Núria Canela, Ricard Ferrer, Juan Carlos Ruiz-Rodrígue
Abstract: Background: The search for new biomarkers that allow an early diagnosis in sepsis has become a necessity in medicine. The objective of this study is to identify potential protein biomarkers of differential expression between sepsis and non‑infectious systemic inflammatory response syndrome (NISIRS).
Methods: Prospective observational study of a cohort of septic patients activated by the Sepsis Code and patients admitted with NISIRS, during the period 2016‑2017. A mass spectrometry‑based approach was used to analyze the plasma proteins in the enrolled subjects. Subsequently, using recursive feature elimination (RFE) classification and cross‑validation with a vector classifier, an association of these proteins in patients with sepsis compared to patients with NISIRS. The protein‑protein interaction network was analyzed with String software.
Results: A total of 277 patients (141 with sepsis and 136 with NISIRS) were included. After performing RFE, 25 proteins in the study patient cohort showed statistical significance, with an accuracy of 0.960, specificity of 0.920, sensitivity of 0.973, and an AUC of 0.985. Of these, 14 proteins (vWF, PPBP, C5, C1RL, FCN3, SAA2, ORM1, ITIH3, GSN, C1QA, CA1, CFB, C3, LBP) have a greater relationship with sepsis while 11 proteins (FN1, IGFALS, SERPINA4, APOE, APOH, C6, SERPINA3, AHSG, LUM, ITIH2, SAA1) are more expressed in NISIRS.
Authors: Christoph Haessig, Flemming Møller
Abstract: The ability to measure protein functionality is critical for the development of plant‑based products, particularly with respect to gelation behavior, which is vital for food structure and texture. Small amplitude oscillatory shear tests remain the standard for monitoring protein gelation; however, these methods are costly, time‑consuming, and require physical contact with the sample. Laser speckle rheology, an optical‑based technique, offers a contactless alternative by assessing rheological properties through speckle pattern fluctuations. In this work, we present a simple laser speckle rheology setup, utilizing a diode laser and a digital camera, to monitor rheological changes during the rennet coagulation of milk. We use a viscoelasticity index, derived from a two‑dimensional linear correlation, to quantify speckle pattern fluctuations. The laser speckle rheology method is compared with conventional small amplitude oscillatory shear rheology. Results demonstrate that key characteristics of the coagulation process, including coagulation and gelation times, are temporally aligned between the two methods. Furthermore, the viscoelasticity index allows for the comparison of the complex modulus in samples with similar compositions under consistent acquisition parameters. These findings underscore the potential of laser speckle rheology as a cost‑effective, rapid, and contactless approach for capturing protein gelation, providing a viable alternative to conventional shear rheological methods.
Authors: Alex Havrilla, David Alvarez-Melis, Nicolo Fusi
Abstract: Large language models (LLMs) have emerged as a powerful method for discovery. Instead of utilizing numerical data, LLMs utilize associated variable semantic metadata to predict variable relationships. Simultaneously, LLMs demonstrate impressive abilities to act as black‑box optimizers when given an objective f and sequence of trials. We study LLMs at the intersection of these two capabilities by applying LLMs to the task of interactive graph discovery: given a ground truth graph G^ capturing variable relationships and a budget of I edge experiments over R rounds, minimize the distance between the predicted graph \hatG_R and G^ at the end of the R‑th round. To solve this task we propose IGDA, a LLM‑based pipeline incorporating two key components: 1) an LLM uncertainty‑driven method for edge experiment selection 2) a local graph update strategy utilizing binary feedback from experiments to improve predictions for unselected neighboring edges. Experiments on eight different real‑world graphs show our approach often outperforms all baselines including a state‑of‑the‑art numerical method for interactive graph discovery. Further, we conduct a rigorous series of ablations dissecting the impact of each pipeline component. Finally, to assess the impact of memorization, we apply our interactive graph discovery strategy to a complex, new (as of July 2024) causal graph on protein transcription factors, finding strong performance in a setting where memorization is impossible. Overall, our results show IGDA to be a powerful method for graph discovery complementary to existing numerically driven approaches.
Authors: Chaohao Yuan, Kangfei Zhao, Ercan Engin Kuruoglu, Liang Wang, Tingyang Xu, Wenbing Huang, Deli Zhao, Hong Cheng, Yu Rong
Abstract: Graph Transformers (GTs) have demonstrated a strong capability in modeling graph structures by addressing the intrinsic limitations of graph neural networks (GNNs), such as over‑smoothing and over‑squashing. Recent studies have proposed diverse architectures, enhanced explainability, and practical applications for Graph Transformers. In light of these rapid developments, we conduct a comprehensive review of Graph Transformers, covering aspects such as their architectures, theoretical foundations, and applications within this survey. We categorize the architecture of Graph Transformers according to their strategies for processing structural information, including graph tokenization, positional encoding, structure‑aware attention and model ensemble. Furthermore, from the theoretical perspective, we examine the expressivity of Graph Transformers in various discussed architectures and contrast them with other advanced graph learning algorithms to discover the connections. Furthermore, we provide a summary of the practical applications where Graph Transformers have been utilized, such as molecule, protein, language, vision, traffic, brain and material data. At the end of this survey, we will discuss the current challenges and prospective directions in Graph Transformers for potential future research.
Authors: Haocheng Tang, Jing Long, Beihong Ji, Junmei Wang
Abstract: In this work, we introduce Auxiliary Discriminator Sequence Generative Adversarial Networks (ADSeqGAN), a novel approach for molecular generation in small‑sample datasets. Traditional generative models often struggle with limited training data, particularly in drug discovery, where molecular datasets for specific therapeutic targets, such as nucleic acids binders and central nervous system (CNS) drugs, are scarce. ADSeqGAN addresses this challenge by integrating an auxiliary random forest classifier as an additional discriminator into the GAN framework, significantly improves molecular generation quality and class specificity. Our method incorporates pretrained generator and Wasserstein distance to enhance training stability and diversity. We evaluate ADSeqGAN across three representative cases. First, on nucleic acid‑ and protein‑targeting molecules, ADSeqGAN shows superior capability in generating nucleic acid binders compared to baseline models. Second, through oversampling, it markedly improves CNS drug generation, achieving higher yields than traditional de novo models. Third, in cannabinoid receptor type 1 (CB1) ligand design, ADSeqGAN generates novel druglike molecules, with 32.8% predicted actives surpassing hit rates of CB1‑focused and general‑purpose libraries when assessed by a target‑specific LRIP‑SF scoring function. Overall, ADSeqGAN offers a versatile framework for molecular design in data‑scarce scenarios, with demonstrated applications in nucleic acid binders, CNS drugs, and CB1 ligands.
Authors: Alireza Nourbakhsh, Hoda Mohammadzade
Abstract: Time Series Alignment is a critical task in signal processing with numerous real‑world applications. In practice, signals often exhibit temporal shifts and scaling, making classification on raw data prone to errors. This paper introduces a novel approach for Multiple Time Series Alignment (MTSA) leveraging Deep Learning techniques. While most existing methods primarily address Multiple Sequence Alignment (MSA) for protein and DNA sequences, there remains a significant gap in alignment methodologies for numerical time series. Additionally, conventional approaches typically focus on pairwise alignment, whereas our proposed method aligns all signals in a multiple manner (all the signals are aligned together at once). This innovation not only enhances alignment efficiency but also significantly improves computational speed. By decomposing into piece‑wise linear sections, we introduce varying levels of complexity into the warping function. Additionally, our method ensures the satisfaction of three warping constraints: boundary, monotonicity, and continuity conditions. The utilization of a deep convolutional network allows us to employ a new loss function, addressing some limitations of Dynamic Time Warping (DTW). Experimental results on the UCR Archive 2018, comprising 129 time series datasets, demonstrate that employing our approach to align signals significantly enhances classification accuracy and warping average and also reduces the run time across the majority of these datasets.
Authors: Zaifu Zhan, Jun Wang, Shuang Zhou, Jiawen Deng, Rui Zhang
Abstract: Objective: To optimize in‑context learning in biomedical natural language processing by improving example selection. Methods: We introduce a novel multi‑mode retrieval‑augmented generation (MMRAG) framework, which integrates four retrieval strategies: (1) Random Mode, selecting examples arbitrarily; (2) Top Mode, retrieving the most relevant examples based on similarity; (3) Diversity Mode, ensuring variation in selected examples; and (4) Class Mode, selecting category‑representative examples. This study evaluates MMRAG on three core biomedical NLP tasks: Named Entity Recognition (NER), Relation Extraction (RE), and Text Classification (TC). The datasets used include BC2GM for gene and protein mention recognition (NER), DDI for drug‑drug interaction extraction (RE), GIT for general biomedical information extraction (RE), and HealthAdvice for health‑related text classification (TC). The framework is tested with two large language models (Llama2‑7B, Llama3‑8B) and three retrievers (Contriever, MedCPT, BGE‑Large) to assess performance across different retrieval strategies. Results: The results from the Random mode indicate that providing more examples in the prompt improves the model's generation performance. Meanwhile, Top mode and Diversity mode significantly outperform Random mode on the RE (DDI) task, achieving an F1 score of 0.9669, a 26.4% improvement. Among the three retrievers tested, Contriever outperformed the other two in a greater number of experiments. Additionally, Llama 2 and Llama 3 demonstrated varying capabilities across different tasks, with Llama 3 showing a clear advantage in handling NER tasks. Conclusion: MMRAG effectively enhances biomedical in‑context learning by refining example selection, mitigating data scarcity issues, and demonstrating superior adaptability for NLP‑driven healthcare applications.
Authors: Yingying Sun, Jun A, Zhiwei Liu, Rui Sun, Liujia Qian, Samuel H. Payne, Wout Bittremieux, Markus Ralser, Chen Li, Yi Chen, Zhen Dong, Yasset Perez-Riverol, Asif Khan, Chris Sander, Ruedi Aebersold, Juan Antonio Vizcaíno, Jonathan R Krieger, Jianhua Yao, Han Wen, Linfeng Zhang, Yunping Zhu, Yue Xuan, Benjamin Boyang Sun, Liang Qiao, Henning Hermjakob, Haixu Tang, Huanhuan Gao, Yamin Deng, Qing Zhong, Cheng Chang, Nuno Bandeira, Ming Li, Weinan E, Siqi Sun, Yuedong Yang, Gilbert S. Omenn, Yue Zhang, Ping Xu, Yan Fu, Xiaowen Liu, Christopher M. Overall, Yu Wang, Eric W. Deutsch, Luonan Chen, Jürgen Cox, Vadim Demichev, Fuchu He, Jiaxing Huang, Huilin Jin, Chao Liu, Nan Li, Zhongzhi Luan, Jiangning Song, Kaicheng Yu, Wanggen Wan, Tai Wang, Kang Zhang, Le Zhang, Peter A. Bell, Matthias Mann, Bing Zhang, Tiannan Guo
Abstract: Artificial intelligence (AI) is transforming scientific research, including proteomics. Advances in mass spectrometry (MS)‑based proteomics data quality, diversity, and scale, combined with groundbreaking AI techniques, are unlocking new challenges and opportunities in biological discovery. Here, we highlight key areas where AI is driving innovation, from data analysis to new biological insights. These include developing an AI‑friendly ecosystem for proteomics data generation, sharing, and analysis; improving peptide and protein identification and quantification; characterizing protein‑protein interactions and protein complexes; advancing spatial and perturbation proteomics; integrating multi‑omics data; and ultimately enabling AI‑empowered virtual cells.
Authors: Dengdeng Huang, Shikui Tu
Abstract: Deep generative models provide a promising approach to de novo 3D peptide design. Most of them jointly model the distributions of peptide's position, orientation, and conformation, attempting to simultaneously converge to the target pocket. However, in the early stage of docking, optimizing conformation‑only modalities such as rotation and torsion can be physically meaningless, as the peptide is initialized far from the protein pocket and no interaction field is present. We define this problem as the multimodal temporal inconsistency problem and claim it is a key factor contributing to low binding affinity in generated peptides. To address this challenge, we propose THFlow, a novel flow matching‑based multimodal generative model that explicitly models the temporal hierarchy between peptide position and conformation. It employs a polynomial based conditional flow to accelerate positional convergence early on, and later aligns it with rotation and torsion for coordinated conformation refinement under the emerging interaction field. Additionally, we incorporate interaction‑related features, such as polarity, to further enhance the model's understanding of peptide‑protein binding. Extensive experiments demonstrate that THFlow outperforms existing methods in generating peptides with superior stability, affinity, and diversity, offering an effective and accurate solution for advancing peptide‑based therapeutic development.
Authors: Vivek Sharma, Poulomi Sadhukhan
Abstract: Fibrinogen is a protein found in blood that forms Fibrin polymer network to build a clot during wound healing process when there is a cut in the blood vessel. The fibrin fiber is highly stretchable and shows a complex mechanical properties. The fibrin monomer, Fibrinogen, has a very complex structure which is responsible for its unusual elastic behaviour. In this work, we focus on mechanism of unfolding of D‑domain of Fibrinogen, and study its effect in the mechanical behaviour. We develop a coarse‑grained (CG) bead‑spring model for Fibrinogen which captures the unfolding of folded D‑domains along with other necessary structural properties which affect the mechanical behaviour. The results from our unfolding‑incorporated coarse‑grained polymer (UCGP) model matches with the experimental results. This model has capacity to serve as the minimal unit to build a large‑scale hierarchical structure of fibrin fiber and network to possibly unfold the mystery of fibrin's unusual elastic behaviour. This model can also be used for other polymers having folded domains or sacrificial bonds.
Authors: Xinxiang Chen, Jude Ann Vishnu, Pol Besenius, Julian König, Friederike Schmid
Abstract: Protein RNA‑binding domains selectively interact with specific RNA sites, a key interaction that determines the emergent cooperative behaviors in RNA‑protein mixtures. Through molecular dynamics simulations, we investigate the impact of the specific binding interactions on the phase transitions of an examplary RNA‑protein system and compare it with predictions of the Semenov‑Rubinstein theory of associative polymers. Our findings reveal a sol‑gel (percolation) transition without phase separation, characterized by double reentrant behavior as the RNA or protein concentration increases. We highlight the crucial role of bridge formations in driving these transitions, particularly when binding sites are saturated. The theory quantitatively predicts the binding numbers at equilibrium in the semidilute regime, but it significantly overestimates the size of the concentration range where percolation is observed. This can partly be traced back to the fact that the mean‑field assumption in the theory is not valid in the dilute regime, and that the theory neglects the existence of cycles in the connectivity graph of the percolating cluster at the sol‑gel transition. Our study enriches the understanding of RNA‑protein phase behaviors, providing valuable insights for the interpretation of experimental observations.
Authors: Yusuke Uchida, Takaaki Fukui
Abstract: Cryo‑electron tomography (cryoET) is a crucial technique for unveiling the structure of protein complexes. Automatically analyzing tomograms captured by cryoET is an essential step toward understanding cellular structures. In this paper, we introduce the 4th place solution from the CZII ‑ CryoET Object Identification competition, which was organized to advance the development of automated tomogram analysis techniques. Our solution adopted a heatmap‑based keypoint detection approach, utilizing an ensemble of two different types of 2.5D U‑Net models with depth reduction. Despite its highly unified and simple architecture, our method achieved 4th place, demonstrating its effectiveness.
Authors: Zhuoqi Zheng, Bo Zhang, Kieran Didi, Kevin K. Yang, Jason Yim, Joseph L. Watson, Hai-Feng Chen, Brian L. Trippe
Abstract: The motif‑scaffolding problem is a central task in computational protein design: Given the coordinates of atoms in a geometry chosen to confer a desired biochemical function (a motif), the task is to identify diverse protein structures (scaffolds) that include the motif and maintain its geometry. Significant recent progress on motif‑scaffolding has been made due to computational evaluation with reliable protein structure prediction and fixed‑backbone sequence design methods. However, significant variability in evaluation strategies across publications has hindered comparability of results, challenged reproducibility, and impeded robust progress. In response we introduce MotifBench, comprising (1) a precisely specified pipeline and evaluation metrics, (2) a collection of 30 benchmark problems, and (3) an implementation of this benchmark and leaderboard at github.com/blt2114/MotifBench. The MotifBench test cases are more difficult compared to earlier benchmarks, and include protein design problems for which solutions are known but on which, to the best of our knowledge, state‑of‑the‑art methods fail to identify any solution.
Authors: Owen Lailey, Maria Agustina Alais, Liuhe Wang, Pinki Chahal, David G. Cory, Timothy Khoo, Ekaterina Olkhov-Mitsel, Dusan Sarenac, Dmitry A. Pushin, Jelena Mirkovic
Abstract: Amyloidosis is a protein misfolding disease caused by the deposition of large, insoluble aggregates (amyloid fibrils) of protein in a tissue, which has been associated with various conditions, such as lymphoid disorders, Alzheimer's disease, diabetes mellitus type 2, chronic inflammatory processes, and cancers. Amyloid fibrils are commonly diagnosed by qualitative observation of green birefringence from Congo red stained biopsy tissue samples under polarized light, a technique that is limited by lack of specificity, dependence on subjective interpretation, and technical constraints. Studies emphasize the utility of quantitative polarized light microscopy (PLM) methodology to diagnose amyloid fibrils in Congo red stained tissues. However, while Congo red enhances the intrinsic birefringence of amyloid fibrillar structures, there are significant disadvantages such as the appearance of multiple non‑green colors under polarized light and binding to other structures, which may result in misdiagnoses with Congo red dye and inconclusive explanations. In this work, we present an improved PLM methodology for quantitative detection of amyloid fibrils without requiring Congo red staining. We perform PLM measurements on four tissues: abdominal subcutaneous tissue biopsy, duodenal biopsy, thyroid biopsy, and breast biopsy, both with Congo red stain and H\&E stain, and through Fourier analysis quantify birefringence, birefringent axis orientation, dichroism, optical activity, and relative amyloid density. These results emphasize a quantitative analysis for amyloid diagnosis rooted in Fourier signal harmonics that does not require Congo red dye and paves the way for rapid, simple, and accurate diagnosis of amyloid fibrils.
Authors: Heng Ma, Alexander Brace, Carlo Siebenschuh, Greg Pauloski, Ian Foster, Arvind Ramanathan
Abstract: The Large Language Model agent workflow enables the LLM to invoke tool functions to increase the performance on specific scientific domain questions. To tackle large scale of scientific research, it requires access to computing resource and parallel computing setup. In this work, we implemented Parsl to the LangChain/LangGraph tool call setup, to bridge the gap between the LLM agent to the computing resource. Two tool call implementations were set up and tested on both local workstation and HPC environment on Polaris/ALCF. The first implementation with Parsl‑enabled LangChain tool node queues the tool functions concurrently to the Parsl workers for parallel execution. The second configuration is implemented by converting the tool functions into Parsl ensemble functions, and is more suitable for large task on super computer environment. The LLM agent workflow was prompted to run molecular dynamics simulations, with different protein structure and simulation conditions. These results showed the LLM agent tools were managed and executed concurrently by Parsl on the available computing resource.
Authors: Andrei Skalkin, Razmik Unanyan, Michael Fleischhauer
Abstract: Noise is commonly regarded as an adverse effect disrupting communication and coherent transport processes or limiting their efficiency. However, as has been shown for example for small light‑harvesting protein complexes decoherence processes can play a significant role in facilitating transport processes, a phenomenon termed environment‑assisted quantum transport (ENAQT). We here study numerically and analytically how dephasing noise improves the efficiency of spin excitation transport in a two dimensional lattice with small homogeneous losses. In particular we investigate the efficiency and time of excitation transfer from a random initial site to a specific target site and show that for system sizes below a characteristic scale it can be substantially enhanced by adding small dephasing noise. We derive approximate analytic expressions for the efficiency which become rather accurate in the two limits of small (coherent regime) and large noise (Zeno regime) and give a very good overall estimate. These analytic expressions provide a quantitative description of ENAQT in spatially extended systems and allow to derive conditions for its existence.
Authors: Jirka Lhotka, Daniel Probst
Abstract: Molecules have various computational representations, including numerical descriptors, strings, graphs, point clouds, and surfaces. Each representation method enables the application of various machine learning methodologies from linear regression to graph neural networks paired with large language models. To complement existing representations, we introduce the representation of molecules through vector‑valued functions, or n‑dimensional vector fields, that are parameterized by neural networks, which we denote molecular neural fields. Unlike surface representations, molecular neural fields capture external features and the hydrophobic core of macromolecules such as proteins. Compared to discrete graph or point representations, molecular neural fields are compact, resolution independent and inherently suited for interpolation in spatial and temporal dimensions. These properties inherited by molecular neural fields lend themselves to tasks including the generation of molecules based on their desired shape, structure, and composition, and the resolution‑independent interpolation between molecular conformations in space and time. Here, we provide a framework and proofs‑of‑concept for molecular neural fields, namely, the parametrization and superresolution reconstruction of a protein‑ligand complex using an auto‑decoder architecture and the embedding of molecular volumes in latent space using an auto‑encoder architecture.
Authors: Amey P. Pasarkar, Adji Bousso Dieng
Abstract: The evolution of microscopy, beginning with its invention in the late 16th century, has continuously enhanced our ability to explore and understand the microscopic world, enabling increasingly detailed observations of structures and phenomena. In parallel, the rise of data‑driven science has underscored the need for sophisticated methods to explore and understand the composition of complex data collections. This paper introduces the Vendiscope, the first algorithmic microscope designed to extend traditional microscopy to computational analysis. The Vendiscope leverages the Vendi scores ‑‑ a family of differentiable diversity metrics rooted in ecology and quantum mechanics ‑‑ and assigns weights to data points based on their contribution to the overall diversity of the collection. These weights enable high‑resolution data analysis at scale. We demonstrate this across biology, materials science, and machine learning (ML). We analyzed the 250 million protein sequences in the protein universe, discovering that over 200 million are near‑duplicates and that AlphaFold fails on proteins with Gene Ontology (GO) functions that contribute most to diversity. Applying the Vendiscope to the Materials Project database led to similar findings: more than 85% of the crystals with formation energy data are near‑duplicates and ML models perform poorly on materials that enhance diversity. Additionally, the Vendiscope can be used to study phenomena such as memorization in generative models. We used the Vendiscope to identify memorized training samples from 13 different generative models and found that the best‑performing ones often memorize the training samples that contribute least to diversity. Our findings demonstrate that the Vendiscope can serve as a powerful tool for data‑driven science.
Authors: Zequn He, Celia Reina
Abstract: We present Epistemic Variational Onsager Diffusion Models (EVODMs), a machine learning framework that integrates Onsager's variational principle with diffusion models to enable thermodynamically consistent learning of free energy and dissipation potentials (and associated evolution equations) from noisy, stochastic data in a robust manner. By further combining the model with Epinets, EVODMs quantify epistemic uncertainty with minimal computational cost. The framework is validated through two examples: (1) the phase transformation of a coiled‑coil protein, modeled via a stochastic partial differential equation, and (2) a lattice particle process (the symmetric simple exclusion process) modeled via Kinetic Monte Carlo simulations. In both examples, we aim to discover the thermodynamic potentials that govern their dynamics in the deterministic continuum limit. EVODMs demonstrate a superior accuracy in recovering free energy and dissipation potentials from noisy data, as compared to traditional machine learning frameworks. Meanwhile, the epistemic uncertainty is quantified efficiently via Epinets and knowledge distillation. This work highlights EVODMs' potential for advancing data‑driven modeling of non‑equilibrium phenomena and uncertainty quantification for stochastic systems.
Authors: Can Chen, Karla-Luise Herpoldt, Chenchao Zhao, Zichen Wang, Marcus Collins, Shang Shang, Ron Benson
Abstract: Antibodies are widely used as therapeutics, but their development requires costly affinity maturation, involving iterative mutations to enhance binding affinity.This paper explores a sequence‑only scenario for affinity maturation, using solely antibody and antigen sequences. Recently AlphaFlow wraps AlphaFold within flow matching to generate diverse protein structures, enabling a sequence‑conditioned generative model of structure. Building on this, we propose an alternating optimization framework that (1) fixes the sequence to guide structure generation toward high binding affinity using a structure‑based affinity predictor, then (2) applies inverse folding to create sequence mutations, refined by a sequence‑based affinity predictor for post selection. A key challenge is the lack of labeled data for training both predictors. To address this, we develop a co‑teaching module that incorporates valuable information from noisy biophysical energies into predictor refinement. The sequence‑based predictor selects consensus samples to teach the structure‑based predictor, and vice versa. Our method, AffinityFlow, achieves state‑of‑the‑art performance in affinity maturation experiments. We plan to open‑source our code after acceptance.
Authors: Bo Ni, Markus J. Buehler
Abstract: Proteins are dynamic molecular machines whose biological functions, spanning enzymatic catalysis, signal transduction, and structural adaptation, are intrinsically linked to their motions. Designing proteins with targeted dynamic properties, however, remains a challenge due to the complex, degenerate relationships between sequence, structure, and molecular motion. Here, we introduce VibeGen, a generative AI framework that enables end‑to‑end de novo protein design conditioned on normal mode vibrations. VibeGen employs an agentic dual‑model architecture, comprising a protein designer that generates sequence candidates based on specified vibrational modes and a protein predictor that evaluates their dynamic accuracy. This approach synergizes diversity, accuracy, and novelty during the design process. Via full‑atom molecular simulations as direct validation, we demonstrate that the designed proteins accurately reproduce the prescribed normal mode amplitudes across the backbone while adopting various stable, functionally relevant structures. Notably, generated sequences are de novo, exhibiting no significant similarity to natural proteins, thereby expanding the accessible protein space beyond evolutionary constraints. Our work integrates protein dynamics into generative protein design, and establishes a direct, bidirectional link between sequence and vibrational behavior, unlocking new pathways for engineering biomolecules with tailored dynamical and functional properties. This framework holds broad implications for the rational design of flexible enzymes, dynamic scaffolds, and biomaterials, paving the way toward dynamics‑informed AI‑driven protein engineering.
Authors: Masaki Sasai, Bhaswati Bhattacharyya, Shin Fujishiro, Yoshiaki Horiike
Abstract: Understanding the interplay among processes that occur over different timescales is a challenging issue in the physics of systems regulation. In gene regulation, the timescales for changes in chromatin states can differ from those for changes in the concentration of product protein, raising questions about how to understand their coupled dynamics. In this study, we examine the effects of these different timescales on eukaryotic gene regulation using a stochastic model that describes the landscapes and probability currents of nonequilibrium fluctuations.This model shows that slow, nonadiabatic transitions of chromatin states significantly impact gene‑regulation dynamics. The simulated circular flow of the probability currents indicates a maximum entropy production when the rates of chromatin‑state transitions are low in the intensely nonadiabatic regime. In the mildly nonadiabatic regime, this circular flow fosters hysteresis, suggesting that changes in chromatin states precede changes in transcription activity. Furthermore, calculations using a model of a circuit involving three core genes in mouse embryonic stem cells illustrate how the timescale difference can tune fluctuations in individual genes. These findings highlight the rich effects of nonadiabatic chromatin‑state transitions on gene regulation in eukaryotic cells.
Authors: Andrzej Balis, Georgi Gochev, Domenico Truzzolillo, Dawid Lupa, Liliana Szyk-Warszynska, Jan Zawala
Abstract: Protein nanoparticles have been proven to be highly effective stabilizers of water‑in‑water emulsions obtained from a number of different types of aqueous two‑phase systems (ATPS). The stabilizing efficiency of such particles is attributed to their affinity to the water/water interface of relevant ATPS, and emulsion formulations with long‑term stability were reported in the recent years. In this study we investigated the macroscopic dynamics of the early‑stage time evolution of dextran‑in‑polyethylene glycol emulsions obtained from a single ATPS and containing beta‑lactoglobulin microgel particles of various diameters (ca. 40‑190 nm). The results revealed the existence of a threshold in microgel size above which the water‑in‑water emulsion is stabilized, and that the process of segregative phase separation is determined by the interplay of droplets coalescence and sedimentation. Efficient droplet coalescence inhibition was found for microgel particles larger than 60 nm. Based on previous literature results, we discuss our coalescence‑driven phase separation data in the context of the formation of durable particle layers on the emulsion droplets and the resulting droplet‑droplet interactions.
Authors: Edith Natalia Villegas Garcia, Alessio Ansuini
Abstract: The rapid advancements in transformer‑based language models have revolutionized natural language processing, yet understanding the internal mechanisms of these models remains a significant challenge. This paper explores the application of sparse autoencoders (SAE) to interpret the internal representations of protein language models, specifically focusing on the ESM‑2 8M parameter model. By performing a statistical analysis on each latent component's relevance to distinct protein annotations, we identify potential interpretations linked to various protein characteristics, including transmembrane regions, binding sites, and specialized motifs.
We then leverage these insights to guide sequence generation, shortlisting the relevant latent components that can steer the model towards desired targets such as zinc finger domains. This work contributes to the emerging field of mechanistic interpretability in biological sequence models, offering new perspectives on model steering for sequence design.
Authors: Jingjing Zhao, Chen Huang, Ali Mostaed, Amirafshar Moshtaghpour, James M. Parkhurst, Ivan Lobato, Marcus Gallagher-Jones, Judy S. Kim, Mark Boyce, David Stuart, Elena A. Andreeva, Jacques-Philippe Colletier, Angus I. Kirkland
Abstract: Phase reconstruction is important in transmission electron microscopy for structural studies. We describe electron Fourier ptychography and its application to phase reconstruction of both radiation‑resistant and beam‑sensitive materials. We demonstrate that the phase of the exit wave can be reconstructed to high resolution using a modified iterative phase retrieval algorithm using data collected in an alternative optical geometry. This method achieves a spatial resolution of 0.63 nm at a fluence of 4.5 × 10^2 \, e^‑/\textnm^2, as validated on Cry11Aa protein crystals under cryogenic conditions. Notably, this method requires no additional hardware modifications, is straightforward to implement, and can be seamlessly integrated with existing data collection software, providing a broadly accessible alternative approach to structural studies.
Authors: Yingce Xia, Peiran Jin, Shufang Xie, Liang He, Chuan Cao, Renqian Luo, Guoqing Liu, Yue Wang, Zequn Liu, Yuan-Jyue Chen, Zekun Guo, Yeqi Bai, Pan Deng, Yaosen Min, Ziheng Lu, Hongxia Hao, Han Yang, Jielan Li, Chang Liu, Jia Zhang, Jianwei Zhu, Ran Bi, Kehan Wu, Wei Zhang, Kaiyuan Gao, Qizhi Pei, Qian Wang, Xixian Liu, Yanting Li, Houtian Zhu, Yeqing Lu, Mingqian Ma, Zun Wang, Tian Xie, Krzysztof Maziarz, Marwin Segler, Zhao Yang, Zilong Chen, Yu Shi, Shuxin Zheng, Lijun Wu, Chen Hu, Peggy Dai, Tie-Yan Liu, Haiguang Liu, Tao Qin
Abstract: Foundation models have revolutionized natural language processing and artificial intelligence, significantly enhancing how machines comprehend and generate human languages. Inspired by the success of these foundation models, researchers have developed foundation models for individual scientific domains, including small molecules, materials, proteins, DNA, RNA and even cells. However, these models are typically trained in isolation, lacking the ability to integrate across different scientific domains. Recognizing that entities within these domains can all be represented as sequences, which together form the "language of nature", we introduce Nature Language Model (NatureLM), a sequence‑based science foundation model designed for scientific discovery. Pre‑trained with data from multiple scientific domains, NatureLM offers a unified, versatile model that enables various applications including: (i) generating and optimizing small molecules, proteins, RNA, and materials using text instructions; (ii) cross‑domain generation/design, such as protein‑to‑molecule and protein‑to‑RNA generation; and (iii) top performance across different domains, matching or surpassing state‑of‑the‑art specialist models. NatureLM offers a promising generalist approach for various scientific tasks, including drug discovery (hit generation/optimization, ADMET optimization, synthesis), novel material design, and the development of therapeutic proteins or nucleotides. We have developed NatureLM models in different sizes (1 billion, 8 billion, and 46.7 billion parameters) and observed a clear improvement in performance as the model size increases.
Authors: Zicheng Liu, Siyuan Li, Zhiyuan Chen, Chang Yu, Qirong Yang, Yucheng Guo, Yujie Yang, Xiaoming Zhang, Stan Z. Li
Abstract: The interactions between DNA, RNA, and proteins are fundamental to biological processes, as illustrated by the central dogma of molecular biology. Although modern biological pre‑trained models have achieved great success in analyzing these macromolecules individually, their interconnected nature remains underexplored. This paper follows the guidance of the central dogma to redesign both the data and model pipeline and offers a comprehensive framework, Life‑Code, that spans different biological functions. As for data flow, we propose a unified pipeline to integrate multi‑omics data by reverse‑transcribing RNA and reverse‑translating amino acids into nucleotide‑based sequences. As for the model, we design a codon tokenizer and a hybrid long‑sequence architecture to encode the interactions between coding and non‑coding regions through masked modeling pre‑training. To model the translation and folding process with coding sequences, Life‑Code learns protein structures of the corresponding amino acids by knowledge distillation from off‑the‑shelf protein language models. Such designs enable Life‑Code to capture complex interactions within genetic sequences, providing a more comprehensive understanding of multi‑omics with the central dogma. Extensive experiments show that Life‑Code achieves state‑of‑the‑art results on various tasks across three omics, highlighting its potential for advancing multi‑omics analysis and interpretation.
Authors: Lirong Wu, Yunfan Liu, Haitao Lin, Yufei Huang, Guojiang Zhao, Zhifeng Gao, Stan Z. Li
Abstract: The proteins that exist today have been optimized over billions of years of natural evolution, during which nature creates random mutations and selects them. The discovery of functionally promising mutations is challenged by the limited evolutionary accessible regions, i.e., only a small region on the fitness landscape is beneficial. There have been numerous priors used to constrain protein evolution to regions of landscapes with high‑fitness variants, among which the change in binding free energy (DDG) of protein complexes upon mutations is one of the most commonly used priors. However, the huge mutation space poses two challenges: (1) how to improve the efficiency of DDG prediction for fast mutation screening; and (2) how to explain mutation preferences and efficiently explore accessible evolutionary regions. To address these challenges, we propose a lightweight DDG predictor (Light‑DDG), which adopts a structure‑aware Transformer as the backbone and enhances it by knowledge distilled from existing powerful but computationally heavy DDG predictors. Additionally, we augmented, annotated, and released a large‑scale dataset containing millions of mutation data for pre‑training Light‑DDG. We find that such a simple yet effective Light‑DDG can serve as a good unsupervised antibody optimizer and explainer. For the target antibody, we propose a novel Mutation Explainer to learn mutation preferences, which accounts for the marginal benefit of each mutation per residue. To further explore accessible evolutionary regions, we conduct preference‑guided antibody optimization and evaluate antibody candidates quickly using Light‑DDG to identify desirable mutations.
Authors: Dennis Herb, Marco Trenti, Marilena Mantela, Constantinos Simserides, Joachim Ankerhold, Mirko Rossini
Abstract: The study of DNA charge dynamics is a highly interdisciplinary field that bridges physics, chemistry, biology, and medicine, and plays a critical role in processes such as DNA damage detection, protein‑DNA interactions, and DNA‑based nanotechnology. However, despite significant advances in each of these areas, knowledge often remains inaccessible to other scientific communities, limiting the broader impact of advances across disciplines. To bridge this gap, we present QuantumDNA, an open‑source Python package for simulating DNA charge transfer (CT) and excited states using quantum‑physical methods. QuantumDNA combines an efficient Linear Combination of Atomic Orbitals (LCAO) approach with tight‑binding (TB) models, incorporating open quantum systems techniques to account for environmental effects. This approach allows rapid yet accurate analysis of large DNA ensembles, enabling statistical studies of genetic and epigenetic phenomena. To ensure accessibility, the package features a graphical user interface (GUI), making it suitable for researchers across disciplines.
Authors: Zhicong Wang, Zicheng Ma, Ziqiang Cao, Changlong Zhou, Jun Zhang, Yiqin Gao
Abstract: Motivation: Proteins are of great significance in living organisms. However, understanding their functions encounters numerous challenges, such as insufficient integration of multimodal information, a large number of training parameters, limited flexibility of classification‑based methods, and the lack of systematic evaluation metrics for protein Q&A systems. To tackle these issues, we propose the Prot2Chat framework. Results: We modified ProteinMPNN to encode protein sequence and structural information in a unified way. We used a large language model (LLM) to encode questions into vectors and developed a protein‑text adapter to compress protein information into virtual tokens based on these vectors, achieving the early fusion of text and protein information. Finally, the same LLM reads the virtual tokens and the questions to generate answers. To optimize training efficiency, we froze the encoder and employed Low‑Rank Adaptation (LoRA) techniques for the LLM. Experiments on two datasets show that both automated metrics and expert evaluations demonstrate the superior performance of our model, and zero‑shot prediction results highlight its generalization ability. The models and codes are available at https://github.com/ wangzc1233/Prot2Chat. Contact: zqcao@suda.edu.cn or wangzc025@163.com Key words: Protein Q&A, Early‑Fusion, LLM
Authors: Sanket Jantre, Tianle Wang, Gilchan Park, Kriti Chopra, Nicholas Jeon, Xiaoning Qian, Nathan M. Urban, Byung-Jun Yoon
Abstract: Identification of protein‑protein interactions (PPIs) helps derive cellular mechanistic understanding, particularly in the context of complex conditions such as neurodegenerative disorders, metabolic syndromes, and cancer. Large Language Models (LLMs) have demonstrated remarkable potential in predicting protein structures and interactions via automated mining of vast biomedical literature; yet their inherent uncertainty remains a key challenge for deriving reproducible findings, critical for biomedical applications. In this study, we present an uncertainty‑aware adaptation of LLMs for PPI analysis, leveraging fine‑tuned LLaMA‑3 and BioMedGPT models. To enhance prediction reliability, we integrate LoRA ensembles and Bayesian LoRA models for uncertainty quantification (UQ), ensuring confidence‑calibrated insights into protein behavior. Our approach achieves competitive performance in PPI identification across diverse disease contexts while addressing model uncertainty, thereby enhancing trustworthiness and reproducibility in computational biology. These findings underscore the potential of uncertainty‑aware LLM adaptation for advancing precision medicine and biomedical research.
Authors: Ziqi Chen, Bo Peng, Tianhua Zhai, Daniel Adu-Ampratwum, Xia Ning
Abstract: Drug development is a critical but notoriously resource‑ and time‑consuming process. In this manuscript, we develop a novel generative artificial intelligence (genAI) method DiffSMol to facilitate drug development. DiffSmol generates 3D binding molecules based on the shapes of known ligands. DiffSMol encapsulates geometric details of ligand shapes within pre‑trained, expressive shape embeddings and then generates new binding molecules through a diffusion model. DiffSMol further modifies the generated 3D structures iteratively via shape guidance to better resemble the ligand shapes. It also tailors the generated molecules toward optimal binding affinities under the guidance of protein pockets. Here, we show that DiffSMol outperforms the state‑of‑the‑art methods on benchmark datasets. When generating binding molecules resembling ligand shapes, DiffSMol with shape guidance achieves a success rate 61.4%, substantially outperforming the best baseline (11.2%), meanwhile producing molecules with novel molecular graph structures. DiffSMol with pocket guidance also outperforms the best baseline in binding affinities by 13.2%, and even by 17.7% when combined with shape guidance. Case studies for two critical drug targets demonstrate very favorable physicochemical and pharmacokinetic properties of the generated molecules, thus, the potential of DiffSMol in developing promising drug candidates.
Authors: Jinzhen Zhu
Abstract: Simulating large‑scale protein dynamics using traditional all‑atom molecular dynamics (MD) remains computationally prohibitive. We present a unified, universal framework for coarse‑grained molecular dynamics (CG‑MD) that achieves high‑fidelity structural reconstruction and generalizes across diverse protein systems. Central to our approach is a hierarchical, tree‑structured protein representation (TSCG) that maps Cartesian coordinates into a minimal set of interpretable collective variables. We extend this representation to accommodate multi‑chain assemblies, demonstrating sub‑angstrom precision in reconstructing full‑atom structures from coarse‑grained nodes. To model temporal evolution, we formulate protein dynamics as stochastic differential equations (SDEs), utilizing a Transformer‑based architecture as a universal propagator. By representing collective variables as language‑like sequences, our model transcends the limitations of protein‑specific networks, generalizing to arbitrary sequence lengths and multi‑chain configurations. The framework achieves an acceleration of over 10,000 to 20,000 times compared to traditional MD, generating microsecond‑long trajectories within minutes. Our results show that the generated trajectories maintain statistical consistency with all‑atom MD in RMSD profiles and structural ensembles. This universal model provides a salable solution for high‑throughput protein simulation, offering a significant leap toward a foundation model for molecular dynamics.
Authors: Etienne Goffinet, Sen Yan, Fabrizio Gabellieri, Laurence Jennings, Lydia Gkoura, Filippo Castiglione, Ryan Young, Idir Malki, Ankita Singh, Thomas Launey
Abstract: Nuclear Magnetic Resonance (NMR) spectrometry uses electro‑frequency pulses to probe the resonance of a compound's nucleus, which is then analyzed to determine its structure. The acquisition time of high‑resolution NMR spectra remains a significant bottleneck, especially for complex biological samples such as proteins. In this study, we propose a novel and efficient sub‑sampling strategy based on a diffusion model trained on protein NMR data. Our method iteratively reconstructs under‑sampled spectra while using model uncertainty to guide subsequent sampling, significantly reducing acquisition time. Compared to state‑of‑the‑art strategies, our approach improves reconstruction accuracy by 52.9%, reduces hallucinated peaks by 55.6%, and requires 60% less time in complex NMR experiments. This advancement holds promise for many applications, from drug discovery to materials science, where rapid and high‑resolution spectral analysis is critical.
Authors: Alexander Denker, Shreyas Padhy, Francisco Vargas, Johannes Hertrich
Abstract: Diffusion models are an important tool for generative modelling, serving as effective priors in applications such as imaging and protein design. A key challenge in applying diffusion models for downstream tasks is efficiently sampling from resulting posterior distributions, which can be addressed using Doob's h‑transform. This work introduces a self‑supervised algorithm for fine‑tuning diffusion models by learning the optimal control, enabling amortised conditional sampling. Our method iteratively refines the control using a synthetic dataset resampled with path‑based importance weights. We demonstrate the effectiveness of this framework on class‑conditional sampling, inverse problems and reward fine‑tuning for text‑to‑image diffusion models.
Authors: R. Gonzalo Parra, Elizabeth A. Komives, Peter G. Wolynes, Diego U. Ferreiro
Abstract: Molecules provide the ultimate language in terms of which physiology and pathology must be understood. Myriads of proteins participate in elaborate networks of interactions and perform chemical activities coordinating the life of cells. To perform these often amazing tasks, proteins must move and we must think of them as dynamic ensembles of three dimensional structures formed first by folding the polypeptide chains so as to minimize the conflicts between the interactions of their constituent amino acids. It is apparent however that, even when completely folded, not all conflicting interactions have been resolved so the structure remains "locally frustrated". Over the last decades it has become clearer that this local frustration is not just a random accident but plays an essential part of the inner workings of protein molecules. We will review here the physical origins of the frustration concept and review evidence that local frustration is important for protein physiology, protein‑protein recognition, catalysis and allostery. Also, we highlight examples showing how alterations in the local frustration patterns can be linked to distinct pathologies. Finally we explore the extensions of the impact of frustration in higher order levels of organization of systems including gene regulatory networks and the neural networks of the brain.
Authors: Giulia Pozzi, Giulia Mazzilli, Giulia D'Arrigo, Claudia Verderio, Giuseppe Legname, Stefano Turzi, Pasquale Ciarletta
Abstract: Neurodegenerative diseases are among the leading causes of global mortality, characterized by the progressive deterioration of specific neuron populations, ultimately leading to cognitive decline and dementia. Extracellular vesicles (EVs) are believed to play a role in the early stages of these diseases, acting as carriers of pathogens and contributing to neuroinflammation and disease propagation. This study presents a mathematical model aimed at characterizing the movement of EVs bearing prion protein (PrP) on their surface along neuronal surfaces. The model, informed by experimental data, investigates the influence of PrP and actin polymerization on EV transport dynamics and explores the possible interplay between passive and active mechanisms. EVs isolated from non‑human astrocytes were analyzed under three conditions: untreated control (Ctrl), neurons treated with Cytochalasin D (CytoD‑HN), and EVs treated with Cytochalasin D (CytoD‑EV). The mathematical model is data‑driven, testing different hypotheses regarding the underlying transport mechanisms. In the CytoD‑EV dataset, EV movement was modeled using a flashing Brownian ratchet to represent directed motion. For active transport in the CytoD‑HN set, a symmetric periodic potential was used to describe EV rolling along the neuron surface. The Ctrl scenario incorporates both mechanisms, reflecting a more complex transport behavior. A sensitivity analysis and comparison between numerical predictions and experimental data suggest that the model effectively captures key features of EV motion, providing a quantitative framework to interpret different transport regimes. While some variability remains, the approach offers a promising basis for future investigations into the role of cytoskeletal dynamics in EV‑mediated disease propagation.
Authors: Fred Zhangzhi Peng, Zachary Bezemek, Sawan Patel, Jarrid Rector-Brooks, Sherwood Yao, Avishek Joey Bose, Alexander Tong, Pranam Chatterjee
Abstract: Any order generation of discrete data using masked diffusion models (MDMs) offers a compelling alternative to traditional autoregressive models, especially in domains that lack a natural causal ordering of data. However, current popular MDMs depart from their successful continuous diffusion model counterparts with simplified masked inference wherein unmasked tokens cannot be iteratively refined ‑‑ even if there is a mistake. In this paper, we extract the full power of MDMs by introducing a novel inference sampling strategy termed Path Planning (P2) that decomposes each generation step into two sub‑stages: planning and denoising. Under P2, the planner at every step selects appropriate tokens that are marked to be updated, which can then be sampled using the denoiser. We demonstrate that P2 generalizes all existing sampling strategies for MDMs and critically enhances generative quality through the new capability of refining and updating existing unmasked tokens. We theoretically prove that P2 establishes a (new) expanded evidence lower bound (ELBO) on the log marginal likelihood of data. We instantiate P2 with a family of planners including: 1.) Self‑Planning, 2.) BERT‑Planning, and 3.) Trained‑Planning with a learned planner leading to SOTA generative performance for MDMs on a suite of domains. Specifically, solely using P2 inference, we observe relative improvements of 22% in protein sequence foldability, 8% in RNA sequence pLDDT, 4% in math reasoning, 68% in story generation (ROUGE score), and 33% in code generation for the challenging pass@1 metric.
Authors: Tianyang Wang, Silin Chen, Yunze Wang, Yichao Zhang, Xinyuan Song, Ziqian Bi, Ming Liu, Qian Niu, Junyu Liu, Pohsun Feng, Xintian Sun, Charles Zhang, Keyu Chen, Ming Li, Cheng Fei, Lawrence KQ Yan, Riyang Bao, Ziyuan Qin, Chong Jiang, Zekun Jiang, Benji Peng
Abstract: The integration of bioinformatics predictions and experimental validation plays a pivotal role in advancing biological research, from understanding molecular mechanisms to developing therapeutic strategies. Bioinformatics tools and methods offer powerful means for predicting gene functions, protein interactions, and regulatory networks, but these predictions must be validated through experimental approaches to ensure their biological relevance. This review explores the various methods and technologies used for experimental validation, including gene expression analysis, protein‑protein interaction verification, and pathway validation. We also discuss the challenges involved in translating computational predictions to experimental settings and highlight the importance of collaboration between bioinformatics and experimental research. Finally, emerging technologies, such as CRISPR gene editing, next‑generation sequencing, and artificial intelligence, are shaping the future of bioinformatics validation and driving more accurate and efficient biological discoveries.
Authors: Louis-Alexandre Leger, Maxine Leonardi, Andrea Salati, Felix Naef, Martin Weigert
Abstract: Understanding cell cycle dynamics is crucial for studying biological processes such as growth, development and disease progression. While fluorescent protein reporters like the Fucci system allow live monitoring of cell cycle phases, they require genetic engineering and occupy additional fluorescence channels, limiting broader applicability in complex experiments. In this study, we conduct a comprehensive evaluation of deep learning methods for predicting continuous Fucci signals using non‑fluorescence brightfield imaging, a widely available label‑free modality. To that end, we generated a large dataset of 1.3 M images of dividing RPE1 cells with full cell cycle trajectories to quantitatively compare the predictive performance of distinct model categories including single time‑frame models, causal state space models and bidirectional transformer models. We show that both causal and transformer‑based models significantly outperform single‑ and fixed frame approaches, enabling the prediction of visually imperceptible transitions like G1/S within 1h resolution. Our findings underscore the importance of sequence models for accurate predictions of cell cycle dynamics and highlight their potential for label‑free imaging.
Authors: Jianming Huang, Hiroyuki Kasai
Abstract: Graph data, with its structurally variable nature, represents complex real‑world phenomena like chemical compounds, protein structures, and social networks. Traditional Graph Neural Networks (GNNs) primarily utilize the message‑passing mechanism, but their expressive power is limited and their prediction lacks explainability. To address these limitations, researchers have focused on graph substructures. Subgraph neural networks (SGNNs) and GNN explainers have emerged as potential solutions, but each has its limitations. SGNNs computes graph representations based on the bags of subgraphs to enhance the expressive power. However, they often rely on predefined algorithm‑based sampling strategies, which is inefficient. GNN explainers adopt data‑driven approaches to generate important subgraphs to provide explanation. Nevertheless, their explanation is difficult to be translated into practical improvements on GNNs. To overcome these issues, we propose a novel self‑supervised framework that integrates SGNNs with the generation approach of GNN explainers, named the Reinforcement Walk Exploration SGNN (RWE‑SGNN). Our approach features a sampling model trained in an explainer fashion, optimizing subgraphs to enhance model performance. To achieve a data‑driven sampling approach, unlike traditional subgraph generation approaches, we propose a novel walk exploration process, which efficiently extracts important substructures, simplifying the embedding process and avoiding isomorphism problems. Moreover, we prove that our proposed walk exploration process has equivalent generation capability to the traditional subgraph generation process. Experimental results on various graph datasets validate the effectiveness of our proposed method, demonstrating significant improvements in performance and precision.
Authors: Isaac Ellmen, Constantin Schneider, Matthew I. J. Raybould, Charlotte M. Deane
Abstract: While conventional Transformers generally operate on sequence data, they can be used in conjunction with structure models, typically SE(3)‑invariant or equivariant graph neural networks (GNNs), for 3D applications such as protein structure modelling. These hybrids typically involve either (1) preprocessing/tokenizing structural features as input for Transformers or (2) taking Transformer embeddings and processing them within a structural representation. However, there is evidence that Transformers can learn to process structural information on their own, such as the AlphaFold3 structural diffusion model. In this work we show that Transformers can function independently as structure models when passed linear embeddings of coordinates. We first provide a theoretical explanation for how Transformers can learn to filter attention as a 3D Gaussian with learned variance. We then validate this theory using both simulated 3D points and in the context of masked token prediction for proteins. Finally, we show that pre‑training protein Transformer encoders with structure improves performance on a downstream task, yielding better performance than custom structural models. Together, this work provides a basis for using standard Transformers as hybrid structure‑language models.
Authors: Amitay Sicherman, Kira Radinsky
Abstract: Computational prediction of enzymatic reactions represents a crucial challenge in sustainable chemical synthesis across various scientific domains, ranging from drug discovery to materials science and green chemistry. These syntheses rely on proteins that selectively catalyze complex molecular transformations. These protein catalysts exhibit remarkable substrate adaptability, with the same protein often catalyzing different chemical transformations depending on its molecular partners. Current approaches to protein representation in reaction prediction either ignore protein structure entirely or rely on static embeddings, failing to capture how proteins dynamically adapt their behavior to different substrates. We present Docking‑Aware Attention (DAA), a novel architecture that generates dynamic, context‑dependent protein representations by incorporating molecular docking information into the attention mechanism. DAA combines physical interaction scores from docking predictions with learned attention patterns to focus on protein regions most relevant to specific molecular interactions. We evaluate our method on enzymatic reaction prediction, where it outperforms previous state‑of‑the‑art methods, achieving 62.2% accuracy versus 56.79% on complex molecules and 55.54% versus 49.45% on innovative reactions. Through detailed ablation studies and visualizations, we demonstrate how DAA generates interpretable attention patterns that adapt to different molecular contexts. Our approach represents a general framework for context‑aware protein representation in biocatalysis prediction, with potential applications across enzymatic synthesis planning. We open‑source our implementation and pre‑trained models to facilitate further research.
Authors: Sergei Kholkin, Ivan Butakov, Evgeny Burnaev, Nikita Gushchin, Alexander Korotin
Abstract: Diffusion bridge models have recently become a powerful tool in the field of generative modeling. In this work, we leverage their power to address another important problem in machine learning and information theory, the estimation of the mutual information (MI) between two random variables. Neatly framing MI estimation as a domain transfer problem, we construct an unbiased estimator for data posing difficulties for conventional MI estimators. We showcase the performance of our estimator on three standard MI estimation benchmarks, i.e., low‑dimensional, image‑based and high MI, and on real‑world data, i.e., protein language model embeddings.
Authors: Maike Scherer, Lukas Brand, Louis Wolf, Teena tom Dieck, Maximilian Schäfer, Sebastian Lotter, Andreas Burkovski, Heinrich Sticht, Robert Schober, Kathrin Castiglione
Abstract: We present a fluid‑based experimental molecular communication (MC) testbed which uses media modulation. Motivated by the natural human cardiovascular system, the testbed operates in a closed‑loop tube system. The proposed system is designed to be biocompatible, resource‑efficient, and controllable from outside the tube. As signaling molecule, the testbed employs the green fluorescent protein variant "Dreiklang" (GFPD). GFPDs can be reversibly switched via light of different wavelengths between a bright fluorescent state and a less fluorescent state. GFPDs in solution are filled into the testbed prior to the start of information transmission and remain there for an entire experiment. For information transmission, an optical transmitter (TX) and an optical eraser (EX), which are located outside the tube, are used to write and erase the information encoded in the state of the GFPDs, respectively. At the receiver (RX), the state of the GFPDs is read out by fluorescence detection. In our testbed, due to the closed‑loop setup, we observe new forms of inter‑symbol interferences (ISI), which do not occur in short experiments and open‑loop systems. For the testbed, we developed a communication scheme, which includes blind transmission start detection, symbol‑by‑symbol synchronization, and adaptive threshold detection. We comprehensively analyze our MC experiments using different performance metrics. Moreover, we experimentally demonstrate the error‑free transmission of 5370 bit at a data rate of 36 \textrmbit\, \textrmmin^\boldsymbol‑1 using 8‑ary modulation and the error‑free binary transmission of around 90000 bit at a data rate of 12 \textrmbit\, \textrmmin^\boldsymbol‑1. For the latter experiment, data was transmitted for a period of 125 hours. All signals recorded and parts of the evaluation code are publicly available on Zenodo and Github, respectively.
Authors: Arijit Khan, Xiangyu Ke, Yinghui Wu
Abstract: The ubiquity of machine learning, particularly deep learning, applied to graphs is evident in applications ranging from cheminformatics (drug discovery) and bioinformatics (protein interaction prediction) to knowledge graph‑based query answering, fraud detection, and social network analysis. Concurrently, graph data management deals with the research and development of effective, efficient, scalable, robust, and user‑friendly systems and algorithms for storing, processing, and analyzing vast quantities of heterogeneous and complex graph data. Our survey provides a comprehensive overview of the synergies between graph data management and graph machine learning, illustrating how they intertwine and mutually reinforce each other across the entire spectrum of the graph data science and machine learning pipeline. Specifically, the survey highlights two crucial aspects: (1) How graph data management enhances graph machine learning, including contributions such as improved graph neural network performance through graph data cleaning, scalable graph embedding, efficient graph‑based vector data management, robust graph neural networks, user‑friendly explainability methods; and (2) how graph machine learning, in turn, aids in graph data management, with a focus on applications like query answering over knowledge graphs and various data science tasks. We discuss pertinent open problems and delineate crucial research directions.
Authors: Md Rownak Hossain Chowdhury, Mostafizur Rahman
Abstract: Addressing the growing demands of artificial intelligence (AI) and data analytics requires new computing approaches. In this paper, we propose a reconfigurable hardware accelerator designed specifically for AI and data‑intensive applications. Our architecture features a messaging‑based intelligent computing scheme that allows for dynamic programming at runtime using a minimal instruction set. To assess our hardware's effectiveness, we conducted a case study in TSMC 28nm technology node. The simulation‑based study involved analyzing a protein network using the computationally demanding PageRank algorithm. The results demonstrate that our hardware can analyze a 5,000‑node protein network in just 213.6 milliseconds over 100 iterations. These outcomes signify the potential of our design to achieve cutting‑edge performance in next‑generation AI applications.
Authors: Lea Bogensperger, Dominik Narnhofer, Ahmed Allam, Konrad Schindler, Michael Krauthammer
Abstract: The goal of protein fitness optimization is to discover new protein variants with enhanced fitness for a given use. The vast search space and the sparsely populated fitness landscape, along with the discrete nature of protein sequences, pose significant challenges when trying to determine the gradient towards configurations with higher fitness. We introduce Variational Latent Generative Protein Optimization (VLGPO), a variational perspective on fitness optimization. Our method embeds protein sequences in a continuous latent space to enable efficient sampling from the fitness distribution and combines a (learned) flow matching prior over sequence mutations with a fitness predictor to guide optimization towards sequences with high fitness. VLGPO achieves state‑of‑the‑art results on two different protein benchmarks of varying complexity. Moreover, the variational design with explicit prior and likelihood functions offers a flexible plug‑and‑play framework that can be easily customized to suit various protein design tasks.
Authors: Lorenzo Rosset, Roberto Netti, Anna Paola Muntoni, Martin Weigt, Francesco Zamponi
Abstract: In this methods article, we provide a flexible but easy‑to‑use implementation of Direct Coupling Analysis (DCA) based on Boltzmann machine learning, together with a tutorial on how to use it. The package \textttadabmDCA 2.0 is available in different programming languages (C++, Julia, Python) usable on different architectures (single‑core and multi‑core CPU, GPU) using a common front‑end interface. In addition to several learning protocols for dense and sparse generative DCA models, it allows to directly address common downstream tasks like residue‑residue contact prediction, mutational‑effect prediction, scoring of sequence libraries and generation of artificial sequences for sequence design. It is readily applicable to protein and RNA sequence data.
Authors: Hiroshi Noguchi
Abstract: Membrane proteins are crucial in regulating biomembrane shapes and controlling the dynamic changes in membrane morphology during essential cellular processes. These proteins can localize to regions with their preferred curvatures (curvature sensing) and induce localized membrane curvature. Thus, this review describes the recent theoretical development in membrane remodeling performed by membrane proteins. The mean‑field theories of protein binding and the resulting membrane deformations are reviewed. The effects of hydrophobic insertions on the area‑difference elasticity energy and that of intrinsically disordered protein domains on the membrane bending energy are discussed. For the crescent‑shaped proteins, such as Bin/Amphiphysin/Rvs superfamily proteins, anisotropic protein bending energy and orientation‑dependent excluded volume significantly contribute to curvature sensing and generation. Moreover, simulation studies of membrane deformations caused by protein binding are reviewed, including domain formation, budding, and tubulation.
Authors: Manuel F. Mollon, Joaquin Gonzalez-Rodriguez, Alicia Lozano-Diez, Daniel Ramos, Doroteo T. Toledano
Abstract: In this study, we expand upon the FLIP benchmark‑designed for evaluating protein fitness prediction models in small, specialized prediction tasks‑by assessing the performance of state‑of‑the‑art large protein language models, including ESM‑2 and SaProt on the FLIP dataset. Unlike larger, more diverse benchmarks such as ProteinGym, which cover a broad spectrum of tasks, FLIP focuses on constrained settings where data availability is limited. This makes it an ideal framework to evaluate model performance in scenarios with scarce task‑specific data. We investigate whether recent advances in protein language models lead to significant improvements in such settings. Our findings provide valuable insights into the performance of large‑scale models in specialized protein prediction tasks.
Authors: Jiang Li, Yuan-Ting Li
Abstract: Identifying protein‑protein interactions (PPI) is crucial for gaining in‑depth insights into numerous biological processes within cells and holds significant guiding value in areas such as drug development and disease treatment. Currently, most PPI prediction methods focus primarily on the study of protein sequences, neglecting the critical role of the internal structure of proteins. This paper proposes a novel PPI prediction method named MgslaPPI, which utilizes graph attention to mine protein structural information and enhances the expressive power of the protein encoder through multitask learning strategy. Specifically, we decompose the end‑to‑end PPI prediction process into two stages: amino acid residue reconstruction (A2RR) and protein interaction prediction (PIP). In the A2RR stage, we employ a graph attention‑based residue reconstruction method to explore the internal relationships and features of proteins. In the PIP stage, in addition to the basic interaction prediction task, we introduce two auxiliary tasks, i.e., protein feature reconstruction (PFR) and masked interaction prediction (MIP). The PFR task aims to reconstruct the representation of proteins in the PIP stage, while the MIP task uses partially masked protein features for PPI prediction, with both working in concert to prompt MgslaPPI to capture more useful information. Experimental results demonstrate that MgslaPPI significantly outperforms existing state‑of‑the‑art methods under various data partitioning schemes.
Authors: Monika Ghalawat, Virendra Kumar Meena, Sharda Prasad, Pankaj Poddar, Atanu Basu
Abstract: The spike protein (SP) of SARS‑CoV‑2 is the major molecular target for making diagnostic tests, vaccines, and therapeutic development. We used a combination of transmission electron microscopy (TEM) and surface enhanced Raman microscopy (SERS) to study its structure. Using SERS on an aluminum substrate, we were able to detect a characteristic spectrum of SP mostly due to vibration of three aromatic amino acids producing Raman shifts at 466 cm‑1, 524 cm‑1, 773 cm‑1, 831 cm‑1, 1048 cm‑1, 1308 cm‑1, 1457 cm‑1, and 1610 cm‑1. Transmission Electron Microscopy (TEM) of the SP showed periodic 2D‑lattice orientation. The findings from this study have translational values for developing surface‑enhanced Raman spectroscopy (SERS) based detectors for screening and testing SARS‑CoV‑2 signatures in diagnostic settings and contamination tracking.
Authors: Xiaoqing Lian, Jie Zhu, Tianxu Lv, Shiyun Nie, Hang Fan, Guosheng Wu, Yunjun Ge, Lihua Li, Xiangxiang Zeng, Xiang Pan
Abstract: Significant differences in protein structures hinder the generalization of existing drug‑target interaction (DTI) models, which often rely heavily on pre‑learned binding principles or detailed annotations. In contrast, BioBridge designs an Inductive‑Associative pipeline inspired by the workflow of scientists who base their accumulated expertise on drawing insights into novel drug‑target pairs from weakly related references. BioBridge predicts novel drug‑target interactions using limited sequence data, incorporating multi‑level encoders with adversarial training to accumulate transferable binding principles. On these principles basis, BioBridge employs a dynamic prototype meta‑learning framework to associate insights from weakly related annotations, enabling robust predictions for previously unseen drug‑target pairs. Extensive experiments demonstrate that BioBridge surpasses existing models, especially for unseen proteins. Notably, when only homologous protein binding data is available, BioBridge proves effective for virtual screening of the epidermal growth factor receptor and adenosine receptor, underscoring its potential in drug discovery.
Authors: Ziwen Li, Xiang 'Anthony' Chen, Youngseung Jeon
Abstract: Drug discovery (DD) has tremendously contributed to maintaining and improving public health. Hypothesizing that inhibiting protein misfolding can slow disease progression, researchers focus on target identification (Target ID) to find protein structures for drug binding. While Large Language Models (LLMs) and Retrieval‑Augmented Generation (RAG) frameworks have accelerated drug discovery, integrating models into cohesive workflows remains challenging. We conducted a user study with drug discovery researchers to identify the applicability of LLMs and RAGs in Target ID. We identified two main findings: 1) an LLM should provide multiple Protein‑Protein Interactions (PPIs) based on an initial protein and protein candidates that have a therapeutic impact; 2) the model must provide the PPI and relevant explanations for better understanding. Based on these observations, we identified three limitations in previous approaches for Target ID: 1) semantic ambiguity, 2) lack of explainability, and 3) short retrieval units. To address these issues, we propose GraPPI, a large‑scale knowledge graph (KG)‑based retrieve‑divide‑solve agent pipeline RAG framework to support large‑scale PPI signaling pathway exploration in understanding therapeutic impacts by decomposing the analysis of entire PPI pathways into sub‑tasks focused on the analysis of PPI edges.
Authors: Maximilian C. Hübl, Thomas E. Videbæk, Daichi Hayakawa, W. Benjamin Rogers, Carl P. Goodrich
Abstract: Experiments have reached a monumental capacity for designing and synthesizing microscopic particles for self‑assembly, making it possible to precisely control particle concentrations, shapes, and interactions. However, more physical insight is needed before we can take full advantage of this vast design space to assemble nanostructures with complex form and function. Here we show how a significant part of this design space can be quickly and comprehensively understood by identifying a class of thermodynamic constraints that act on it. These thermodynamic constraints form a high‑dimensional convex polyhedron that determines which nanostructures can be assembled at high equilibrium yield and reveals limitations that govern the coexistence of structures, which we verify through detailed, quantitative assembly experiments of nanoscale particles synthesized using DNA origami. Strong experimental agreement confirms the importance of the polyhedral structure and motivates its use as a predictive tool for the rational design of self‑assembly. These results uncover fundamental physical relationships underpinning many‑component programmable self‑assembly in equilibrium and form the basis for robust inverse‑design, applicable to a wide array of systems from biological protein complexes to synthetic nanomachines.
Authors: Michaela Cohrs, Shiwoo Koak, Yejin Lee, Yu Jin Sung, Wesley De Neve, Hristo L. Svilenov, Utku Ozbulak
Abstract: Protein‑based therapeutics play a pivotal role in modern medicine targeting various diseases. Despite their therapeutic importance, these products can aggregate and form subvisible particles (SvPs), which can compromise their efficacy and trigger immunological responses, emphasizing the critical need for robust monitoring techniques. Flow Imaging Microscopy (FIM) has been a significant advancement in detecting SvPs, evolving from monochrome to more recently incorporating color imaging. Complementing SvP images obtained via FIM, deep learning techniques have recently been employed successfully for stress source identification of monochrome SvPs. In this study, we explore the potential of color FIM to enhance the characterization of stress sources in SvPs. To achieve this, we curate a new dataset comprising 16,000 SvPs from eight commercial monoclonal antibodies subjected to heat and mechanical stress. Using both supervised and self‑supervised convolutional neural networks, as well as vision transformers in large‑scale experiments, we demonstrate that deep learning with color FIM images consistently outperforms monochrome images, thus highlighting the potential of color FIM in stress source classification compared to its monochrome counterparts.
Authors: Jiaqi Guan, Jiahan Li, Xiangxin Zhou, Xingang Peng, Sheng Wang, Yunan Luo, Jian Peng, Jianzhu Ma
Abstract: Molecular docking is a key task in computational biology that has attracted increasing interest from the machine learning community. While existing methods have achieved success, they generally treat each protein‑ligand pair in isolation. Inspired by the biochemical observation that ligands binding to the same target protein tend to adopt similar poses, we propose \textscGroupBind, a novel molecular docking framework that simultaneously considers multiple ligands docking to a protein. This is achieved by introducing an interaction layer for the group of ligands and a triangle attention module for embedding protein‑ligand and group‑ligand pairs. By integrating our approach with diffusion‑based docking model, we set a new S performance on the PDBBind blind docking benchmark, demonstrating the effectiveness of our proposed molecular docking paradigm.
Authors: Xiangyu Liu, Yi Liu, Silei Chen, Wei Hu
Abstract: Designing proteins with specific attributes offers an important solution to address biomedical challenges. Pre‑trained protein large language models (LLMs) have shown promising results on protein sequence generation. However, to control sequence generation for specific attributes, existing work still exhibits poor functionality and structural stability. In this paper, we propose a novel controllable protein design method called CtrlProt. We finetune a protein LLM with a new multi‑listwise preference optimization strategy to improve generation quality and support multi‑attribute controllable generation. Experiments demonstrate that CtrlProt can meet functionality and structural stability requirements effectively, achieving state‑of‑the‑art performance in both single‑attribute and multi‑attribute protein sequence generation.
Authors: Taslim Murad, Prakash Chourasia, Sarwan Ali, Imdad Ullah Khan, Murray Patterson
Abstract: The availability of SARS‑CoV‑2 (severe acute respiratory syndrome coronavirus 2) virus data post‑COVID has reached exponentially to an enormous magnitude, opening research doors to analyze its behavior. Various studies are conducted by researchers to gain a deeper understanding of the virus, like genomic surveillance, etc, so that efficient prevention mechanisms can be developed. However, the unstable nature of the virus (rapid mutations, multiple hosts, etc) creates challenges in designing analytical systems for it. Therefore, we propose a neural network‑based (NN) mechanism to perform an efficient analysis of the SARS‑CoV‑2 data, as NN portrays generalized behavior upon training. Moreover, rather than using the full‑length genome of the virus, we apply our method to its spike region, as this region is known to have predominant mutations and is used to attach to the host cell membrane. In this paper, we introduce a pipeline that first converts the spike protein sequences into a fixed‑length numerical representation and then uses Neuromorphic Spiking Neural Network to classify those sequences. We compare the performance of our method with various baselines using real‑world SARS‑CoV‑2 spike sequence data and show that our method is able to achieve higher predictive accuracy compared to the recent baselines.
Authors: Michael Fuest, Alfredo Cuesta, Kalyan Veeramachaneni
Abstract: Recent breakthroughs in large‑scale generative modeling have demonstrated the potential of foundation models in domains such as natural language, computer vision, and protein structure prediction. However, their application in the energy and smart grid sector remains limited due to the scarcity and heterogeneity of high‑quality data. In this work, we propose a method for creating high‑fidelity electricity consumption time series data for rare and unseen context variables (e.g. location, building type, photovoltaics). Our approach, Context Encoding and Normalizing Time Series Generation, or CENTS, includes three key innovations: (i) A context normalization approach that enables inverse transformation for time series context variables unseen during training, (ii) a novel context encoder to condition any state‑of‑the‑art time‑series generator on arbitrary numbers and combinations of context variables, (iii) a framework for training this context encoder jointly with a time‑series generator using an auxiliary context classification loss designed to increase expressivity of context embeddings and improve model performance. We further provide a comprehensive overview of different evaluation metrics for generative time series models. Our results highlight the efficacy of the proposed method in generating realistic household‑level electricity consumption data, paving the way for training larger foundation models in the energy domain on synthetic as well as real‑world data.
Authors: Mohammadamin Safdari, Siyu Li, Sanaz Panahandeh, Paul van der Schoot, Roya Zandi
Abstract: The encapsulation of polyanions, whether single‑stranded RNAs or synthetic polymers, is primarily driven by attractive electrostatic interactions between the positively charged, structurally disordered RNA‑binding domains of virus coat proteins and the negatively charged polyanions. Theoretically, this interaction is often modeled by coarse‑graining the charge distribution of the binding domains, either by projecting the charges onto the inner surface of the protein shell or by spreading them across a region representing the capsid lumen where the binding domains are located. In practice, however, the positive charges are not uniformly distributed across the binding domains, which themselves are positioned at discrete, specific sites on the shell surface. Here, we use molecular dynamics simulations to investigate the impact of localized interactions on the most probable or optimal length of the encapsulated polymer, revealing that the specific location of charges along the binding domains plays a significant role, consistent with experimental observations. Comparing the simulations with predictions from a simple mean‑field theory taken from the literature, we find that while the general trends are reasonably well captured, quantitative discrepancies arise between the two approaches.
Authors: Zhiyu Zhang, Yongjian Zhu, Liang Dai
Abstract: Knotted molecules occur naturally and are designed by scientists to gain special biological and material properties. Understanding and utilizing knotting require efficient methods to recognize and generate knotted structures, which are unsolved problems in mathematics and physics. Here, we solve these two problems using machine learning. First, our Transformer‑based neural network (NN) can recognize the knot types of given chain conformations with an accuracy of >99%. We can use a single NN model to recognize knots with different chain lengths, and our computational speed is about 4500 times faster than the most popular mathematical method for knot recognition: the Alexander polynomials. Second, we for the first time design a diffusion‑based NN model to generate conformations for given knot types. The generated conformations satisfy not only the desired knot types, but also the correct physical distributions of the radii of gyration and knot sizes. The results have several implications. First, the Transformer is suitable for handling knotting tasks, probably because of its strength in processing sequence information, a key component in knotting. Second, our NN can replace mathematical methods of knot recognition for faster speed on many occasions. Third, our models can facilitate the design of knotted protein structures. Lastly, analyzing how NN recognizes knot types can provide insight into the principle behind knots, an unsolved problem in mathematics. We provide an online website (http://144.214.24.236) for using our models.
Authors: Jesus Renero, Idoia Ochoa, Roberto Maestre
Abstract: Explainable Artificial Intelligence (XAI) techniques hold significant potential for enhancing the causal discovery process, which is crucial for understanding complex systems in areas like healthcare, economics, and artificial intelligence. However, no causal discovery methods currently incorporate explainability into their models to derive the causal graphs. Thus, in this paper we explore this innovative approach, as it offers substantial potential and represents a promising new direction worth investigating. Specifically, we introduce ReX, a causal discovery method that leverages machine learning (ML) models coupled with explainability techniques, specifically Shapley values, to identify and interpret significant causal relationships among variables. Comparative evaluations on synthetic datasets comprising continuous tabular data reveal that ReX outperforms state‑of‑the‑art causal discovery methods across diverse data generation processes, including non‑linear and additive noise models. Moreover, ReX was tested on the Sachs single‑cell protein‑signaling dataset, achieving a precision of 0.952 and recovering key causal relationships with no incorrect edges. Taking together, these results showcase ReX's effectiveness in accurately recovering true causal structures while minimizing false positive predictions, its robustness across diverse datasets, and its applicability to real‑world problems. By combining ML and explainability techniques with causal discovery, ReX bridges the gap between predictive modeling and causal inference, offering an effective tool for understanding complex causal structures.
Authors: Darin Tsui, Kunal Talreja, Amirali Aghazadeh
Abstract: Computing the Fourier transform of a q‑ary function f:\mathbbZ_q^n\rightarrow \mathbbR, which maps q‑ary sequences to real numbers, is an important problem in mathematics with wide‑ranging applications in biology, signal processing, and machine learning. Previous studies have shown that, under the sparsity assumption, the Fourier transform can be computed efficiently using fast and sample‑efficient algorithms. However, in most practical settings, the function is defined over a more general space ‑‑ the space of generalized q‑ary sequences \mathbbZ_q_1 × \mathbbZ_q_2 × \cdots × \mathbbZ_q_n ‑‑ where each \mathbbZ_q_i corresponds to integers modulo q_i. Herein, we develop GFast, a coding theoretic algorithm that computes the S‑sparse Fourier transform of f with a sample complexity of O(Sn), computational complexity of O(Sn \log N), and a failure probability that approaches zero as N=\prod_i=1^n q_i \rightarrow \infty with S = N^δ for some 0 \leq δ< 1. We show that a noise‑robust version of GFast computes the transform with a sample complexity of O(Sn^2) and computational complexity of O(Sn^2 \log N) under the same high probability guarantees. Additionally, we demonstrate that GFast computes the sparse Fourier transform of generalized q‑ary functions 8× faster using 16× fewer samples on synthetic experiments, and enables explaining real‑world heart disease diagnosis and protein fitness models using up to 13× fewer samples compared to existing Fourier algorithms applied to the most efficient parameterization of the models as q‑ary functions.
Authors: Eugenio Borzone, Leandro Di Persia, Matias Gerard
Abstract: This paper presents a novel graph‑based deep learning model for tasks involving relations between two nodes (edge‑centric tasks), where the focus lies on predicting relationships and interactions between pairs of nodes rather than node properties themselves. This model combines supervised and self‑supervised learning, taking into account for the loss function the embeddings learned and patterns with and without ground truth. Additionally it incorporates an attention mechanism that leverages both node and edge features. The architecture, trained end‑to‑end, comprises two primary components: embedding generation and prediction. First, a graph neural network (GNN) transform raw node features into dense, low‑dimensional embeddings, incorporating edge attributes. Then, a feedforward neural model processes the node embeddings to produce the final output. Experiments demonstrate that our model matches or exceeds existing methods for protein‑protein interactions prediction and Gene Ontology (GO) terms prediction. The model also performs effectively with one‑hot encoding for node features, providing a solution for the previously unsolved problem of predicting similarity between compounds with unknown structures.
Authors: Nitin Malapally, Marta Devodier, Giulia Rossetti, Paolo Carloni, Davide Mandelli
Abstract: Molecular dynamics (MD)‑based path sampling algorithms are a very important class of methods used to study the energetics and kinetics of rare (bio)molecular events. They sample the highly informative but highly unlikely reactive trajectories connecting different metastable states of complex (bio)molecular systems. The metadynamics of paths (MoP) method proposed by Mandelli, Hirshberg, and Parrinello [Pys. Rev. Lett. 125 2, 026001 (2020)] is based on the Onsager‑Machlup path integral formalism. This provides an analytical expression for the probability of sampling stochastic trajectories of given duration. In practice, the method samples reactive paths via metadynamics simulations performed directly in the phase space of all possible trajectories. Its parallel implementation is in principle infinitely scalable, allowing arbitrarily long trajectories to be simulated. Paving the way for future applications to study the thermodynamics and kinetics of protein‑ligand (un)binding, a problem of great pharmaceutical interest, we present here the efficient implementation of MoP in the HPC‑oriented biomolecular simulation software GROMACS. Our benchmarks on a membrane protein (150,000 atoms) show an unprecedented weak scaling parallel efficiency of over 70% up to 3200 GPUs on the pre‑exascale JUWELS Booster machine at the Jülich Supercomputing Center.
Authors: Masaaki Tsubouchi, Nobuhisa Ishii, Takatoshi Fujita, Motoyasu Adachi, Ryuji Itakura
Abstract: Phycobilisomes are antenna protein complexes in cyanobacteria and red algae. In phycobilisomes, energy transfer is unidirectional with an extremely high quantum efficiency close to unity. We investigate intraprotein energy relaxation and quantum coherence of constituent chromoproteins of allophycocyanin (APC) and two kinds of C‑phycocyanin (CPC) in phycobilisomes using two‑dimensional electronic spectroscopy (2D‑ES). These chromoproteins have similar adjacent pairs of pigments α84 and β84, which are excited to delocalized exciton states. However, the kinetics and coherence of exciton states are significantly different from each other. Even CPCs with almost the same molecular structure display significantly different spectra and kinetics when the locations in the phycobilisome are different. This difference may be one of the key mechanisms for the efficient and unidirectional energy transfer in phycobilisomes. We observe low‑frequency coherent vibrational motion of approximately 200 cm^‑1 with large amplitude and a decay time of 200 fs. The wave packet motion involving energy relaxation and oscillatory motions on the potential energy surface of the exciton state is clearly visualized using beat‑frequency‑resolved 2D‑ES.
Authors: Wenqi Fan, Yi Zhou, Shijie Wang, Yuyao Yan, Hui Liu, Qian Zhao, Le Song, Qing Li
Abstract: Considering the significance of proteins, computational protein science has always been a critical scientific field, dedicated to revealing knowledge and developing applications within the protein sequence‑structure‑function paradigm. In the last few decades, Artificial Intelligence (AI) has made significant impacts in computational protein science, leading to notable successes in specific protein modeling tasks. However, those previous AI models still meet limitations, such as the difficulty in comprehending the semantics of protein sequences, and the inability to generalize across a wide range of protein modeling tasks. Recently, LLMs have emerged as a milestone in AI due to their unprecedented language processing & generalization capability. They can promote comprehensive progress in fields rather than solving individual tasks. As a result, researchers have actively introduced LLM techniques in computational protein science, developing protein Language Models (pLMs) that skillfully grasp the foundational knowledge of proteins and can be effectively generalized to solve a diversity of sequence‑structure‑function reasoning problems. While witnessing prosperous developments, it's necessary to present a systematic overview of computational protein science empowered by LLM techniques. First, we summarize existing pLMs into categories based on their mastered protein knowledge, i.e., underlying sequence patterns, explicit structural and functional information, and external scientific languages. Second, we introduce the utilization and adaptation of pLMs, highlighting their remarkable achievements in promoting protein structure prediction, protein function prediction, and protein design studies. Then, we describe the practical application of pLMs in antibody design, enzyme design, and drug discovery. Finally, we specifically discuss the promising future directions in this fast‑growing field.
Authors: Sajjad Saleem, Adil Hussain, Nabila Majeed, Zahid Akhtar, Kamran Siddique
Abstract: Wheat is an important source of dietary fiber and protein that is negatively impacted by a number of risks to its growth. The difficulty of identifying and classifying wheat diseases is discussed with an emphasis on wheat loose smut, leaf rust, and crown and root rot. Addressing conditions like crown and root rot, this study introduces an innovative approach that integrates multi‑scale feature extraction with advanced image segmentation techniques to enhance classification accuracy. The proposed method uses neural network models Xception, Inception V3, and ResNet 50 to train on a large wheat disease classification dataset 2020 in conjunction with an ensemble of machine vision classifiers, including voting and stacking. The study shows that the suggested methodology has a superior accuracy of 99.75% in the classification of wheat diseases when compared to current state‑of‑the‑art approaches. A deep learning ensemble model Xception showed the highest accuracy.
Authors: Lucas Laird, Circe Hsu, Asilata Bapat, Robin Walters
Abstract: Group theory has been used in machine learning to provide a theoretically grounded approach for incorporating known symmetry transformations in tasks from robotics to protein modeling. In these applications, equivariant neural networks use known symmetry groups with predefined representations to learn over geometric input data. We propose MatrixNet, a neural network architecture that learns matrix representations of group element inputs instead of using predefined representations. MatrixNet achieves higher sample efficiency and generalization over several standard baselines in prediction tasks over the several finite groups and the Artin braid group. We also show that MatrixNet respects group relations allowing generalization to group elements of greater word length than in the training set.
Authors: Yinkai Wang, Jiaxing He, Yuanqi Du, Xiaohui Chen, Jianan Canal Li, Li-Ping Liu, Xiaolin Xu, Soha Hassoun
Abstract: We consider the protein sequence engineering problem, which aims to find protein sequences with high fitness levels, starting from a given wild‑type sequence. Directed evolution has been a dominating paradigm in this field which has an iterative process to generate variants and select via experimental feedback. We demonstrate large language models (LLMs), despite being trained on massive texts, are secretly protein sequence optimizers. With a directed evolutionary method, LLM can perform protein engineering through Pareto and experiment‑budget constrained optimization, demonstrating success on both synthetic and experimental fitness landscapes.
Authors: Thomas E. Videbæk, Daichi Hayakawa, Michael F. Hagan, Gregory M. Grason, Seth Fraden, W. Benjamin Rogers
Abstract: Programmable self‑assembly has recently enabled the creation of complex structures through precise control of the interparticle interactions and the particle geometries. Targeting ever more structurally complex, dynamic, and functional assemblies necessitates going beyond the design of the structure itself, to the measurement and control of the local flexibility of the inter‑subunit connections and its impact on the collective mechanics of the entire assembly. In this study, we demonstrate a method to infer the mechanical properties of multisubunit assemblies using cryogenic electron microscopy (cryo‑EM) and RELION's multi‑body refinement. Specifically, we analyze the fluctuations of pairs of DNA‑origami subunits that self‑assemble into tubules. By measuring the fluctuations of dimers using cryo‑EM, we extract mechanical properties such as the bending modulus and interparticle spring constant. These properties are then applied to elastic models to predict assembly outcomes, which align well with experimental observations. This approach not only provides a deeper understanding of nanoparticle mechanics, but also opens new pathways to refining subunit designs to achieve precise assembly behavior. This methodology could have broader applications in the study of nanomaterials, including protein assemblies, where understanding the interplay of mechanical properties and subunit geometry is essential for controlling complex self‑assembled structures.
Authors: D. Evan Piephoff, Jianshu Cao
Abstract: A fluctuation theorem is examined for the first‑passage time of a biomolecular machine (e.g., a motor protein or an enzyme) in a nonequilibrium steady‑state. For such machines in which the driven, observable process is coupled to a hidden process in a kinetically cooperative fashion, the entropy produced along first‑passage trajectories is no longer constant, resulting in a breakdown of this expression. Here, we consider the canonical model for this type of system, a kinetic scheme for conformation‑modulated single‑enzyme catalysis (a type of continuous‑time Markov process with relevance to β‑galactosidase and human glucokinase), as we explore this fluctuation theorem in cooperative biomolecular networks. Kinetic evaluations are performed using a novel, efficient pathway analysis technique, allowing us to attain surprising and concise results from complex calculations. We find that in the absence of hidden current, a fluctuation theorem can be established for the first‑passage time of the observable process, and we demonstrate that this dramatic reduction is a general feature applicable to a wide variety of cooperative networks. The validity of this expression can be experimentally tested, with its violation serving as a unique signature of hidden detailed balance breaking. In addition, we obtain a remarkably compact exact expression for the integrated correction to this first‑passage time fluctuation theorem, as well as the general form, revealing a thermodynamic bound on the kinetic branching ratio (a measure of directionality defined as the ratio of the forward observable process probability to the backward one). These results provide detailed insight into the rich connections between dynamic measurements and the underlying nonequilibrium thermodynamics for cooperative biomolecular machines.
Authors: Sudipta Mitra, Ranjit Biswas, Suman Chakrabarty
Abstract: Estimating rare event kinetics from molecular dynamics simulations is a non‑trivial task despite the great advances in enhanced sampling methods. Weighted Ensemble (WE) simulation, a special class of enhanced sampling techniques, offers a way to directly calculate kinetic rate constants from biased trajectories without the need to modify the underlying energy landscape using bias potentials. Conventional WE algorithms use different binning schemes to partition the collective variable (CV) space separating the two metastable states of interest. In this work, we have developed a new "binless" WE simulation algorithm to bypass the hurdles of optimizing binning procedures. Our proposed protocol (WeTICA) uses a low‑dimensional CV space to drive the WE simulation toward the specified target state. We have applied this new algorithm to recover the unfolding kinetics of three proteins: (A) TC5b Trp‑cage mutant, (B) TC10b Trp‑cage mutant, and (C) Protein G, with unfolding times spanning the range between 3 and 40 μs using projections along predefined fixed Time‑lagged Independent Component Analysis (TICA) eigenvectors as CVs. Calculated unfolding times converge to the reported values with good accuracy with more than one order of magnitude less cumulative WE simulation time than the unfolding time scales with or without a priori knowledge of the CVs that can capture unfolding. Our algorithm can be used with other linear CVs, not limited to TICA. Moreover, the new walker selection criteria for resampling employed in this algorithm can be used on more sophisticated nonlinear CV space for further improvements of binless WE methods.
Authors: Luca Maggi
Abstract: The study of microscopic protein dynamics has historically presented significant challenges to researchers seeking to develop a comprehensive and detailed description of its diverse and intriguing features. Recent experimental and theoretical studies have proposed the hypothesis that protein dynamics may be non‑ergodic. The implications of this finding are of paramount importance from both a practical and theoretical standpoint. In this study, we employ all‑atom molecular dynamics simulations to examine these results over a time window spanning from picoseconds to nanoseconds. To this end, we utilize widely used statistical tools. Our findings challenge the conclusions of previous studies, which suggested that proteins exhibit non‑ergodic dynamics. Instead, we demonstrate that deviations from ergodic behavior are due to incomplete convergence of the investigated quantities. Additionally, we discuss the implications of findings that suggest a potential breaking of the ergodic hypothesis over larger time windows, which were not directly investigated in this study.
Authors: Gabriel Bianchin de Oliveira, Helio Pedrini, Zanoni Dias
Abstract: Various approaches utilizing Transformer architectures have achieved state‑of‑the‑art results in Natural Language Processing (NLP). Based on this success, numerous architectures have been proposed for other types of data, such as in biology, particularly for protein sequences. Notably among these are the ESM2 architectures, pre‑trained on billions of proteins, which form the basis of various state‑of‑the‑art approaches in the field. However, the ESM2 architectures have a limitation regarding input size, restricting it to 1,022 amino acids, which necessitates the use of preprocessing techniques to handle sequences longer than this limit. In this paper, we present the long and quantized versions of the ESM2 architectures, doubling the input size limit to 2,048 amino acids.
Authors: Sepideh Maleki, Josh Vekhter, Keshav Pingali
Abstract: Groups with complex set intersection relations are a natural way to model a wide array of data, from the formation of social groups to the complex protein interactions which form the basis of biological life. One approach to representing such higher order relationships is as a hypergraph. However, efforts to apply machine learning techniques to hypergraph structured datasets have been limited thus far. In this paper, we address the problem of link prediction in knowledge hypergraphs as well as simple hypergraphs and develop a novel, simple, and effective optimization architecture that addresses both tasks. Additionally, we introduce a novel feature extraction technique using node level clustering and we show how integrating data from node‑level labels can improve system performance. Our self‑supervised approach achieves significant improvement over state of the art baselines on several hyperedge prediction and knowledge hypergraph completion benchmarks.
Authors: Aram Ansary Ogholbake, Qiang Cheng
Abstract: Circadian rhythms regulate the physiology and behavior of humans and animals. Despite advancements in understanding these rhythms and predicting circadian phases at the transcriptional level, predicting circadian phases from proteomic data remains elusive. This challenge is largely due to the scarcity of time labels in proteomic datasets, which are often characterized by small sample sizes, high dimensionality, and significant noise. Furthermore, existing methods for predicting circadian phases from transcriptomic data typically rely on prior knowledge of known rhythmic genes, making them unsuitable for proteomic datasets. To address this gap, we developed a novel computational method using unsupervised deep learning techniques to predict circadian sample phases from proteomic data without requiring time labels or prior knowledge of proteins or genes. Our model involves a two‑stage training process optimized for robust circadian phase prediction: an initial greedy one‑layer‑at‑a‑time pre‑training which generates informative initial parameters followed by fine‑tuning. During fine‑tuning, a specialized loss function guides the model to align protein expression levels with circadian patterns, enabling it to accurately capture the underlying rhythmic structure within the data. We tested our method on both time‑labeled and unlabeled proteomic data. For labeled data, we compared our predictions to the known time labels, achieving high accuracy, while for unlabeled human datasets, including postmortem brain regions and urine samples, we explored circadian disruptions. Notably, our analysis identified disruptions in rhythmic proteins between Alzheimer's disease and control subjects across these samples.
Authors: Karishma Thakrar, Jiangqin Ma, Max Diamond, Akash Patel
Abstract: Predicting the impact of single‑point amino acid mutations on protein stability is essential for understanding disease mechanisms and advancing drug development. Protein stability, quantified by changes in Gibbs free energy (ΔΔG), is influenced by these mutations. However, the scarcity of data and the complexity of model interpretation pose challenges in accurately predicting stability changes. This study proposes the application of deep neural networks, leveraging transfer learning and fusing complementary information from different models, to create a feature‑rich representation of the protein stability landscape. We developed four models, with our third model, ThermoMPNN+, demonstrating the best performance in predicting ΔΔG values. This approach, which integrates diverse feature sets and embeddings through latent transfusion techniques, aims to refine ΔΔG predictions and contribute to a deeper understanding of protein dynamics, potentially leading to advancements in disease research and drug discovery.
Authors: Dexuan Xie, Liam Jemison, Yi Jiang
Abstract: The Poisson‑Boltzmann (PB) model is a widely used implicit solvent model in protein simulations. Although variants, such as the size modified PB and nonlocal modified PB models, have been developed to account for ionic size effects and nonlocal dielectric correlations, no existing PB variants simultaneously incorporate both, due to significant modeling and computational challenges. To address this gap, in this paper, a nonlocal size modified PB (NSMPB) model is introduced and solved using a finite element method for a protein with a three‑dimensional molecular structure and an ionic solution containing multiple ion species. In particular, a novel solution decomposition is proposed to overcome the difficulties caused by the increased nonlinearity, nonlocality, and solution singularities of the model. It is then applied to the development of the NSMPB finite element solver, which includes an efficient modified Newton iterative method, an effective damping parameter selection strategy, and good selections of initial iterations. Moreover, the construction of the modified Newton iterative method is mathematically justified. Furthermore, an NSMPB finite element package is developed by integrating a mesh generation tool, a protein data bank file retrieval program, and the PDB2PQR package to simplify and accelerate its usage and application. Finally, numerical experiments are conducted on an ionic solution with four species, proteins with up to 11439 atoms, and irregular interface‑fitted tetrahedral box meshes with up to 1188840 vertices. The numerical results confirm the fast convergence and strong robustness of the modified Newton iterative method, demonstrate the high performance of the package, and highlight the crucial roles played by the damping parameter and initial iteration selections in enhancing the method's convergence. The package will be a valuable tool in protein simulations.
Authors: En Xu, Can Rong, Jingtao Ding, Yong Li
Abstract: The evolutionary processes of complex systems contain critical information regarding their functional characteristics. The generation time of edges provides insights into the historical evolution of various networked complex systems, such as protein‑protein interaction networks, ecosystems, and social networks. Recovering these evolutionary processes holds significant scientific value, including aiding in the interpretation of the evolution of protein‑protein interaction networks. However, existing methods are capable of predicting the generation times of remaining edges given a partial temporal network but often perform poorly in cross‑network prediction tasks. These methods frequently fail in edge generation time recovery tasks for static networks that lack timestamps. In this work, we adopt a comparative paradigm‑based framework that fuses multiple networks for training, enabling cross‑network learning of the relationship between network structure and edge generation times. Compared to separate training, this approach yields an average accuracy improvement of 16.98%. Furthermore, given the difficulty in collecting temporal networks, we propose a novel diffusion‑model‑based generation method to produce a large number of temporal networks. By combining real temporal networks with generated ones for training, we achieve an additional average accuracy improvement of 5.46% through joint training.
Authors: Aurélien Decelle, Alfonso de Jesús Navas Gómez, Beatriz Seoane
Abstract: Maximum entropy methods, rooted in the inverse Ising/Potts problem from statistical physics, are widely used to model pairwise interactions in complex systems across disciplines such as bioinformatics and neuroscience. While successful, these approaches often fail to capture higher‑order interactions that are critical for understanding collective behavior. In contrast, modern machine learning methods can model such interactions, but their interpretability often comes at a prohibitive computational cost. Restricted Boltzmann Machines (RBMs) provide a computationally efficient alternative by encoding statistical correlations through hidden units in a bipartite architecture. In this work, we introduce a method that maps RBMs onto generalized Potts models, enabling the systematic extraction of interactions up to arbitrary order. Leveraging large‑N approximations, made tractable by the RBM's structure, we extract effective many‑body couplings with minimal computational effort. We further propose a robust framework for recovering higher‑order interactions in more complex generative models, and introduce a simple gauge‑fixing scheme for the effective Potts representation. Validation on synthetic data demonstrates accurate recovery of two‑ and three‑body interactions. Applied to protein sequence data, our method reconstructs contact maps with high fidelity and outperforms state‑of‑the‑art inverse Potts models. These results establish RBMs as a powerful and efficient tool for modeling higher‑order structure in high‑dimensional categorical data.
Authors: Tracy Quynh Ha, Albert C. Aragonès, Qiankun Wang, Desmond Koomson, Nashili Kibria, Jhanelle White, Kavita Garg, Jessica Peate, Alex P. S. Brogan, Leigh Aldous, Sarah M. Barry, Ismael Díez-Pérez
Abstract: Single‑enzyme catalysis offers a promising approach for unravelling the dynamic behaviour of individual enzymes as they undergo a reaction, revealing the complex heterogeneity that is lost in the averaged ensembles. Here we demonstrate real‑time, label‑free monitoring of the electrical transduction of single‑protein enzymatic activity for two redox enzymes, cytochrome P450cam and glutathione reductase, trapped in an electrochemically controlled nanoscale tunnelling junction immersed in the aqueous enzymatic mixture. The conductance switching signal observed in individual transients of the electrical current flowing through the single‑protein junction shows that the tunnelling conductance is modulated by the enzymatic reaction; subtle changes of the enzyme redox state occurring during the chemical catalysis process result in fluctuations of the enzyme junction conductivity, which are captured as a switching signal. At the applied electrochemical reducing potential for electrocatalysis, the transient oxidation of the trapped enzyme in every catalytic cycle opens an additional redox‑mediated electron tunnelling channel in the single protein junction that results in a temporary current jump, contributing to the observed conductance switching features. The latter is experimentally assessed via electrochemically controlled conductance measurements of the single‑protein junction. The statistical analysis of the switching events captured over long time periods results in average frequencies that correlate well with the reported catalytic turnover values of both enzymes obtained in standard bulk assays. The single‑enzyme experiments reveal the acute heterogenous behaviour of enzymatic catalysis and the quantification of single enzyme turnover frequencies.
Authors: Giovanni di Sarra, Barbara Bravi, Yasser Roudi
Abstract: Restricted Boltzmann Machines are simple yet powerful neural networks. They can be used for learning structure in data, and are used as a building block of more complex neural architectures. At the same time, their simplicity makes them easy to use, amenable to theoretical analysis, yielding interpretable models in applications. Here, we focus on reviewing the role that the activation functions, describing the input‑output relationship of single neurons in RBM, play in the functionality of these models. We discuss recent theoretical results on the benefits and limitations of different activation functions. We also review applications to biological data analysis, namely neural data analysis, where RBM units are mostly taken to have sigmoid activation functions and binary units, to protein data analysis and immunology where non‑binary units and non‑sigmoid activation functions have recently been shown to yield important insights into the data. Finally, we discuss open problems addressing which can shed light on broader issues in neural network research.
Authors: Laia Coronas Sala, Parfait Atchade-Adelemou
Abstract: Protein characterization is one of the key components for understanding the human body and advancing drug discovery processes. While the future of quantum hardware holds the potential to accurately characterize these molecules, current efforts focus on developing strategies to fragment larger molecules into computationally manageable subsystems. In this work, we propose a novel strategy to enable quantum simulation using existing quantum algorithms. Our approach involves fragmenting proteins into their corresponding amino acids, simulating them independently, and then reassembling them post‑simulation while applying chemical corrections. This methodology demonstrates its accuracy by calculating the ground state energy of relatively small peptides through reassembling, achieving a mean relative error of only 0.00469 \pm 0.01071%. Future directions include investigating, with larger quantum computers, whether this approach remains valid for larger proteins.
Authors: Amélie Chardac, Michael M. Norton, Jonathan Touboul, Guillaume Duclos
Abstract: Many essential cellular processes, including cell division and the establishment of cell polarity during embryogenesis, are regulated by pattern‑forming proteins. These proteins often need to bind to a substrate, such as the cell membrane, onto which they interact and form two‑dimensional (2D) patterns. It is unclear how the membrane's continuity and dimensionality impact pattern formation. Here, we address this gap using the MinDE system, a prototypical example of pattern‑forming membrane proteins. We show that when the lipid substrate is fragmented into submicrometer‑sized diffusive liposomes, ATP‑driven protein‑protein interactions generate three‑dimensional (3D) spatially extended patterns, despite the complete loss of membrane continuity. Remarkably, these 3D patterns emerge at scales four orders of magnitude larger than the individual liposomes. By systematically varying protein concentration, liposome size, and density, we observed and characterized a variety of 3D dynamical patterns not seen on continuous 2D membranes, including traveling waves, dynamical spirals, and a coexistence phase. Simulations and linear stability analysis of a coarse‑grained model revealed that the physical properties of the dispersed membrane effectively rescale both the protein‑membrane binding rates and diffusion, two key parameters governing pattern formation and wavelength selection. These findings highlight the robustness of Min's pattern‑forming ability, suggesting that protein‑membrane suspensions could serve as an adaptable template for studying out‑of‑equilibrium self‑organization in 3D, beyond in vivo contexts.
Authors: Jian Jiang, Long Chen, Yueying Zhu, Yazhou Shi, Huahai Qiu, Bengong Zhang, Tianshou Zhou, Guo-Wei Wei
Abstract: Anesthetics are crucial in surgical procedures and therapeutic interventions, but they come with side effects and varying levels of effectiveness, calling for novel anesthetic agents that offer more precise and controllable effects. Targeting Gamma‑aminobutyric acid (GABA) receptors, the primary inhibitory receptors in the central nervous system, could enhance their inhibitory action, potentially reducing side effects while improving the potency of anesthetics. In this study, we introduce a proteomic learning of GABA receptor‑mediated anesthesia based on 24 GABA receptor subtypes by considering over 4000 proteins in protein‑protein interaction (PPI) networks and over 1.5 millions known binding compounds. We develop a corresponding drug‑target interaction network to identify potential lead compounds for novel anesthetic design. To ensure robust proteomic learning predictions, we curated a dataset comprising 136 targets from a pool of 980 targets within the PPI networks. We employed three machine learning algorithms, integrating advanced natural language processing (NLP) models such as pretrained transformer and autoencoder embeddings. Through a comprehensive screening process, we evaluated the side effects and repurposing potential of over 180,000 drug candidates targeting the GABRA5 receptor. Additionally, we assessed the ADMET (absorption, distribution, metabolism, excretion, and toxicity) properties of these candidates to identify those with near‑optimal characteristics. This approach also involved optimizing the structures of existing anesthetics. Our work presents an innovative strategy for the development of new anesthetic drugs, optimization of anesthetic use, and deeper understanding of potential anesthesia‑related side effects.
Authors: Wen-ran Li, Xavier F. Cadet, David Medina-Ortiz, Mehdi D. Davari, Ramanathan Sowdhamini, Cedric Damour, Yu Li, Alain Miranville, Frederic Cadet
Abstract: Protein design with desirable properties has been a significant challenge for many decades. Generative artificial intelligence is a promising approach and has achieved great success in various protein generation tasks. Notably, diffusion models stand out for their robust mathematical foundations and impressive generative capabilities, offering unique advantages in certain applications such as protein design. In this review, we first give the definition and characteristics of diffusion models and then focus on two strategies: Denoising Diffusion Probabilistic Models and Score‑based Generative Models, where DDPM is the discrete form of SGM. Furthermore, we discuss their applications in protein design, peptide generation, drug discovery, and protein‑ligand interaction. Finally, we outline the future perspectives of diffusion models to advance autonomous protein design and engineering. The E(3) group consists of all rotations, reflections, and translations in three‑dimensions. The equivariance on the E(3) group can keep the physical stability of the frame of each amino acid as much as possible, and we reflect on how to keep the diffusion model E(3) equivariant for protein generation.
Authors: Ngoc-Duy Dinh, Marc Rodriguez-Garcia, Zenon Toprakcioglu, Yi Shen, Tuomas Knowles
Abstract: Microscale hydrogels comprised of macromolecular networks have increasingly been used for applications involving cell encapsulation, tissue engineering and for the storage and release of active cargo molecules. However, the majority of such microgels are formed from nonbiodegradable synthetic polymers, involving harmful solvents, or using animal proteins, such as silk and gelatin, which can have a negative environmental impact and lack sustainability. Furthermore, most encapsulation techniques involve either protecting hydrophobic or hydrophilic cargo, but rarely both. In order to address these issues, we employed droplet‑microfluidics to develop novel, plant protein microcapsules capable of containing both hydrophilic and hydrophobic cargo molecules. The microcapsule structure and cargo release rates were controlled by balancing osmotic pressures between the outer and inner phases of the capsules. Moreover, the digestibility of the microcapsules was comparable with that of pure pea protein, thereby enabling the use of these microcapsules for food and beverage applications. In addition, digestive enzymes can trigger the release of the encapsulated active ingredients, and hence, these microcapsules are well suited for the controlled delivery of active nutraceutical or pharmaceutical ingredients. Finally, we investigated the biodegradability of the microcapsules. It was determined that the plant protein microcapsules exhibited 98.0% biodegradability (as compared with cellulose), thereby fulfilling the biodegradability standards stipulated by the International Organization for Standardization (ISO 14851) for microplastics in freshwater conditions (90%). Hence, the plant protein microcapsules can have numerous applications in the food, nutraceutical, pharmaceutical, cosmetic, personal care, and agriculture industries.
Authors: Yuwei Miao, Yuzhi Guo, Hehuan Ma, Jingquan Yan, Feng Jiang, Rui Liao, Junzhou Huang
Abstract: Exploring the functions of genes and gene products is crucial to a wide range of fields, including medical research, evolutionary biology, and environmental science. However, discovering new functions largely relies on expensive and exhaustive wet lab experiments. Existing methods of automatic function annotation or prediction mainly focus on protein function prediction with sequence, 3D‑structures or protein family information. In this study, we propose to tackle the gene function prediction problem by exploring Gene Ontology graph and annotation with BERT (GoBERT) to decipher the underlying relationships among gene functions. Our proposed novel function prediction task utilizes existing functions as inputs and generalizes the function prediction to gene and gene products. Specifically, two pre‑train tasks are designed to jointly train GoBERT to capture both explicit and implicit relations of functions. Neighborhood prediction is a self‑supervised multi‑label classification task that captures the explicit function relations. Specified masking and recovering task helps GoBERT in finding implicit patterns among functions. The pre‑trained GoBERT possess the ability to predict novel functions for various gene and gene products based on known functional annotations. Extensive experiments, biological case studies, and ablation studies are conducted to demonstrate the superiority of our proposed GoBERT.
Authors: Francesc Sabanés Zariquiey, Stephen E. Farr, Stefan Doerr, Gianni De Fabritiis
Abstract: Accurate prediction of protein‑ligand binding affinities is crucial in drug discovery, particularly during hit‑to‑lead and lead optimization phases, however, limitations in ligand force fields continue to impact prediction accuracy. In this work, we validate relative binding free energy (RBFE) accuracy using neural network potentials (NNPs) for the ligands. We utilize a novel NNP model, AceFF 1.0, based on the TensorNet architecture for small molecules that broadens the applicability to diverse drug‑like compounds, including all important chemical elements and supporting charged molecules. Using established benchmarks, we show overall improved accuracy and correlation in binding affinity predictions compared with GAFF2 for molecular mechanics and ANI2‑x for NNPs. Slightly less accuracy but comparable correlations with OPLS4. We also show that we can run the NNP simulations at 2 fs timestep, at least two times larger than previous NNP models, providing significant speed gains. The results show promise for further evolutions of free energy calculations using NNPs while demonstrating its practical use already with the current generation. The code and NNP model are publicly available for research use.
Authors: Weihang Dai
Abstract: Proteins are sequences of amino acids that serve as the basic building blocks of living organisms. Despite rapidly growing databases documenting structural and functional information for various protein sequences, our understanding of proteins remains limited because of the large possible sequence space and the complex inter‑ and intra‑molecular forces. Deep learning, which is characterized by its ability to learn relevant features directly from large datasets, has demonstrated remarkable performance in fields such as computer vision and natural language processing. It has also been increasingly applied in recent years to the data‑rich domain of protein sequences with great success, most notably with Alphafold2's breakout performance in the protein structure prediction. The performance improvements achieved by deep learning unlocks new possibilities in the field of protein bioinformatics, including protein design, one of the most difficult but useful tasks. In this paper, we broadly categorize problems in protein bioinformatics into three main categories: 1) structural prediction, 2) functional prediction, and 3) protein design, and review the progress achieved from using deep learning methodologies in each of them. We expand on the main challenges of the protein design problem and highlight how advances in structural and functional prediction have directly contributed to design tasks. Finally, we conclude by identifying important topics and future research directions.
Authors: Siddharth Narayanan, James D. Braza, Ryan-Rhys Griffiths, Manu Ponnapati, Albert Bou, Jon Laurent, Ori Kabeli, Geemi Wellawatte, Sam Cox, Samuel G. Rodriques, Andrew D. White
Abstract: Solving complex real‑world tasks requires cycles of actions and observations. This is particularly true in science, where tasks require many cycles of analysis, tool use, and experimentation. Language agents are promising for automating intellectual tasks in science because they can interact with tools via natural language or code. Yet their flexibility creates conceptual and practical challenges for software implementations, since agents may comprise non‑standard components such as internal reasoning, planning, tool usage, as well as the inherent stochasticity of temperature‑sampled language models. Here, we introduce Aviary, an extensible gymnasium for language agents. We formalize agents as policies solving language‑grounded partially observable Markov decision processes, which we term language decision processes. We then implement five environments, including three challenging scientific environments: (1) manipulating DNA constructs for molecular cloning, (2) answering research questions by accessing scientific literature, and (3) engineering protein stability. These environments were selected for their focus on multi‑step reasoning and their relevance to contemporary biology research. Finally, with online training and scaling inference‑time compute, we show that language agents backed by open‑source, non‑frontier LLMs can match and exceed both frontier LLM agents and human experts on multiple tasks at up to 100x lower inference cost.
Authors: Linus Zwaka
Abstract: Sequence alignment is a cornerstone of bioinformatics, widely used to identify similarities between DNA, RNA, and protein sequences and studying evolutionary relationships and functional properties. The Needleman‑Wunsch algorithm remains a robust and accurate method for global sequence alignment. However, its computational complexity, O(mn), poses significant challenges when processing large‑scale datasets or performing multiple sequence alignments. To address these limitations, a hybrid implementation of the Needleman‑Wunsch algorithm that leverages CUDA for parallel execution on GPUs and MPI for distributed computation across multiple nodes on a supercomputer is proposed. CUDA efficiently offloads computationally intensive tasks to GPU cores, while MPI enables communication and workload distribution across nodes to handle large‑scale alignments.
This work details the implementation and performance evaluation of the Needleman‑Wunsch algorithm in a massively parallel computing environment. Experimental results demonstrate significant acceleration of the alignment process compared to traditional CPU‑based implementations, particularly for large input sizes and multiple sequence alignments. In summary, the combination of CUDA and MPI effectively overcomes the computational bottlenecks inherent to the Needleman‑Wunsch algorithm without requiring substantial modifications to the underlying algorithm, highlighting the potential of high‑performance computing in advancing sequence alignment workflows.
Authors: Abhinav Roy, Bhavesh Gyanchandani, Aditya Oza, Abhishek Sharma
Abstract: Parkinson's Disease (PD) is a degenerative neurological disorder that impairs motor and non‑motor functions, significantly reducing quality of life and increasing mortality risk. Early and accurate detection of PD progression is vital for effective management and improved patient outcomes. Current diagnostic methods, however, are often costly, time‑consuming, and require specialized equipment and expertise. This work proposes an innovative approach to predicting PD progression using regression methods, Long Short‑Term Memory (LSTM) networks, and Kolmogorov Arnold Networks (KAN). KAN, utilizing spline‑parametrized univariate functions, allows for dynamic learning of activation patterns, unlike traditional linear models.
The Movement Disorder Society‑Sponsored Revision of the Unified Parkinson's Disease Rating Scale (MDS‑UPDRS) is a comprehensive tool for evaluating PD symptoms and is commonly used to measure disease progression. Additionally, protein or peptide abnormalities are linked to PD onset and progression. Identifying these associations can aid in predicting disease progression and understanding molecular changes.
Comparing multiple models, including LSTM and KAN, this study aims to identify the method that delivers the highest metrics. The analysis reveals that KAN, with its dynamic learning capabilities, outperforms other approaches in predicting PD progression. This research highlights the potential of AI and machine learning in healthcare, paving the way for advanced computational models to enhance clinical predictions and improve patient care and treatment strategies in PD management.
Authors: Luca Donati, Surahit Chewle, Dominik St. Pierre, Vijay Natarajan, Marcus Weber
Abstract: Molecular Dynamics simulations are essential tools for understanding the dynamic behavior of biomolecules, yet extracting meaningful molecular pathways from these simulations remains challenging due to the vast amount of generated data. In this work, we present Molecular Kinetics via Topology (MoKiTo), a novel approach that combines the ISOKANN algorithm to determine the reaction coordinate of a molecular system with a topological analysis inspired by the Mapper algorithm. Our strategy efficiently identifies and characterizes distinct molecular pathways, enabling the detection and visualization of critical conformational transitions and rare events. This method offers deeper insights into molecular mechanisms, facilitating the design of targeted interventions in drug discovery and protein engineering.
Authors: Giovanny Espitia, Yui Tik Pang, James C. Gumbart
Abstract: We address protein structure prediction in the 3D Hydrophobic‑Polar lattice model through two novel deep learning architectures. For proteins under 36 residues, our hybrid reservoir‑based model combines fixed random projections with trainable deep layers, achieving optimal conformations with 25% fewer training episodes. For longer sequences, we employ a long short‑term memory network with multi‑headed attention, matching best‑known energy values. Both architectures leverage a stabilized Deep Q‑Learning framework with experience replay and target networks, demonstrating consistent achievement of optimal conformations while significantly improving training efficiency compared to existing methods.
Authors: Ishmael Apachigawo, Dhruvil Solanki, Santanu Maity, Pradeep Shukla, Radhakrishna Rao, Prabhakar Pradhan
Abstract: Photonics/light localization techniques are important in understanding the structural changes in biological tissues at the nano‑ to sub‑micron scale. It is now known that structural alteration starts at the nanoscale at the beginning of cancer progression. This study examines the molecular‑specific nano‑structural alterations of chronic alcoholism and probiotic effects on colon cancer using a mouse model of colon cancer. Confocal microscopy and mesoscopic light‑scattering analysis are applied to quantify structural changes in DNA (chromatin), cytoskeleton, and ki‑67 protein cells with appropriate staining dyes. We assessed alcohol‑treated and azoxymethane (AOM) with dextran sulfate sodium (DSS)‑induced colitis models, including ethanol (EtOH) and probiotic (L.Casei) treatments separately and together. The inverse participation ratio (IPR) technique was employed to quantify the degree of light localization to access the molecular‑specific spatial structural disorder as a biomarker for cancer progression detection. Significant enhancement of cancer progression was observed in the alcohol‑treated group, and probiotics treatment with alcohol showed partial reversal of these changes in colon cancer. The results underscore the potential of the IPR technique in detecting early structural changes in colon cancer, offering insights into the mitigating effects of probiotics on alcohol‑induced enhancement of colon cancer.
Authors: Hanjing Zhou, Mingze Yin, Wei Wu, Mingyang Li, Kun Fu, Jintai Chen, Jian Wu, Zheng Wang
Abstract: Multi‑modality pre‑training paradigm that aligns protein sequences and biological descriptions has learned general protein representations and achieved promising performance in various downstream applications. However, these works were still unable to replicate the extraordinary success of language‑supervised visual foundation models due to the ineffective usage of aligned protein‑text paired data and the lack of an effective function‑informed pre‑training paradigm. To address these issues, this paper curates a large‑scale protein‑text paired dataset called ProtAnno with a property‑driven sampling strategy, and introduces a novel function‑informed protein pre‑training paradigm. Specifically, the sampling strategy determines selecting probability based on the sample confidence and property coverage, balancing the data quality and data quantity in face of large‑scale noisy data. Furthermore, motivated by significance of the protein specific functional mechanism, the proposed paradigm explicitly model protein static and dynamic functional segments by two segment‑wise pre‑training objectives, injecting fine‑grained information in a function‑informed manner. Leveraging all these innovations, we develop ProtCLIP, a multi‑modality foundation model that comprehensively represents function‑aware protein embeddings. On 22 different protein benchmarks within 5 types, including protein functionality classification, mutation effect prediction, cross‑modal transformation, semantic similarity inference and protein‑protein interaction prediction, our ProtCLIP consistently achieves SOTA performance, with remarkable improvements of 75% on average in five cross‑modal transformation benchmarks, 59.9% in GO‑CC and 39.7% in GO‑BP protein function prediction. The experimental results verify the extraordinary potential of ProtCLIP serving as the protein multi‑modality foundation model.
Authors: Emilio Gallicchio
Abstract: We present the Alchemical Transfer with Coordinate Swapping (ATS) method to enable the calculation of the relative binding free energies between large congeneric ligands and single‑point mutant peptides to protein receptors with the Alchemical Transfer Method (ATM) framework. Similarly to ATM, the new method implements the alchemical transformation as a coordinate transformation, and works with any unmodified force fields and standard chemical topologies. Unlike ATM, which transfers the whole ligands in and out of the receptor binding site, ATS limits the magnitude of the alchemical perturbation by transferring only the portion of the molecules that differ between the the bound and unbound ligands. The common region of the two ligands, which can be arbitrarily large, is unchanged and does not contribute to the magnitude and statistical fluctuations of the perturbation energy. Internally, the coordinates of the atoms of the common regions are swapped to maintain the integrity of the covalent bonding data structures of the molecular dynamics engine. The work successfully validates the method on protein‑ligand and protein‑peptide RBFE benchmarks. This advance paves the road for the application of the relative binding free energy Alchemical Transfer Method protocol to study the effect of protein and nucleic acid mutations on the binding affinity and specificity of macromolecular complexes.
Authors: Beyza E. Ortlek, Ozgur B. Akan
Abstract: Molecular communication (MC) is a bio‑inspired communication paradigm that utilizes molecules to transfer information and offers a robust framework for understanding biological signaling systems. This paper introduces a novel end‑to‑end MC framework for short‑chain fatty acid (SCFA)‑driven vagus nerve signaling within the gut‑brain axis (GBA) to enhance our understanding of gut‑brain communication mechanisms. SCFA molecules, produced by gut microbiota, serve as important biomarkers in physiological and psychological processes, including neurodegenerative and mental health disorders. The developed end‑to‑end model integrates SCFA binding to vagal afferent fibers, G protein‑coupled receptor (GPCR)‑mediated calcium signaling, and Hodgkin‑Huxley‑based action potential generation into a comprehensive vagus nerve signaling mechanism through GBA. Information‑theoretic metrics such as mutual information and delay are used to evaluate the efficiency of this SCFA‑driven signaling pathway model. Simulations demonstrate how molecular inputs translate into neural outputs, highlighting critical aspects that govern gut‑brain communication. In this work, the integration of SCFA‑driven signaling into the MC framework provides a novel perspective on gut‑brain communication and paves the way for the development of innovative therapeutic advancements targeting neurological and psychiatric disorders.
Authors: Zi Hao Liu, Maria Tsanai, Oufan Zhang, Teresa Head-Gordon, Julie Forman-Kay
Abstract: Intrinsically disordered proteins and regions are increasingly appreciated for their abundance in the proteome and the many functional roles they play in the cell. In this short review, we describe a variety of approaches used to obtain biological insight from the structural ensembles of disordered proteins, regions, and complexes and the integrative biology challenges that arise from combining diverse experiments and computational models. Importantly, we highlight findings regarding structural and dynamic characterization of disordered regions involved in binding and phase separation, as well as drug targeting of disordered regions, using a broad framework of integrative modeling approaches.
Authors: Regina Ibragimova, Dimitrios Iliadis, Willem Waegeman
Abstract: Recently, machine learning (ML) has gained popularity in the early stages of drug discovery. This trend is unsurprising given the increasing volume of relevant experimental data and the continuous improvement of ML algorithms. However, conventional models, which rely on the principle of molecular similarity, often fail to capture the complexities of chemical interactions, particularly those involving activity cliffs (ACs) ‑ compounds that are structurally similar but exhibit evidently different activity behaviors. In this work, we address two distinct yet related tasks: (1) activity cliff (AC) prediction and (2) drug‑target interaction (DTI) prediction. Leveraging insights gained from the AC prediction task, we aim to improve the performance of DTI prediction through transfer learning. A universal model was developed for AC prediction, capable of identifying activity cliffs across diverse targets. Insights from this model were then incorporated into DTI prediction, enabling better handling of challenging cases involving ACs while maintaining similar overall performance. This approach establishes a strong foundation for integrating AC awareness into predictive models for drug discovery. Scientific Contribution This study presents a novel approach that applies transfer learning from AC prediction to enhance DTI prediction, addressing limitations of traditional similarity‑based models. By introducing AC‑awareness, we improve DTI model performance in structurally complex regions, demonstrating the benefits of integrating compound‑specific and protein‑contextual information. Unlike previous studies, which treat AC and DTI predictions as separate problems, this work establishes a unified framework to address both data scarcity and prediction challenges in drug discovery.
Authors: Conghao Wang, Jagath C. Rajapakse
Abstract: De novo design of bioactive drug molecules with potential to treat desired biological targets is a profound task in the drug discovery process. Existing approaches tend to leverage the pocket structure of the target protein to condition the molecule generation. However, even the pocket area of the target protein may contain redundant information since not all atoms in the pocket is responsible for the interaction with the ligand. In this work, we propose PharmacoBridge, a phamacophore‑guided de novo design approach to generate drug candidates inducing desired bioactivity via diffusion bridge. Our method adapts the diffusion bridge to effectively convert pharmacophore arrangements in the spatial space into molecular structures under the manner of SE(3)‑equivariant transformation, providing sophisticated control over optimal biochemical feature arrangements on the generated molecules. PharmacoBridge is demonstrated to generate hit candidates that exhibit high binding affinity with potential protein targets.
Authors: Valentina Pederzoli, Mattia Corti, Davide Riccobelli, Paola F. Antonietti
Abstract: The aim of this paper is to introduce, analyse and test in practice a new mathematical model describing the interplay between biological tissue atrophy driven by pathogen diffusion, with applications to neurodegenerative disorders. This study introduces a novel mathematical and computational model comprising a Fisher‑Kolmogorov equation for species diffusion coupled with an elasticity equation governing mass loss. These equations intertwine through a logistic law dictating the reduction of the medium's mass. One potential application of this model lies in understanding the onset and development of Alzheimer's disease. Here, the equations can describe the propagation of misfolded tau‑proteins and the ensuing brain atrophy characteristic of the disease. To address numerically the inherited complexities, we propose a Polygonal Discontinuous Galerkin method on polygonal/polyhedral grids for spatial discretization, while time integration relies on the theta‑method. We present the mathematical model, delving into its characteristics and propose discretization applied. Furthermore, convergence results are presented to validate the model, accompanied by simulations illustrating the application scenario of the onset of Alzheimer's disease.
Authors: C. Yang, D. Ma, S. Hu, M. Li, Y. Lu
Abstract: Membrane proteins often need to be inserted into or attached on the cell membrane to perform their functions. Understanding their transmembrane topology and conformational dynamics during insertion is crucial for elucidating their roles. However, it remains challenging to monitor nanoscale changes in insertion depth of individual proteins in membranes. Here, we introduce two single molecule imaging methods, SIFA and LipoFRET, designed for in vitro observation of the nanoscale architecture of membrane proteins within membranes. These methods have demonstrated their efficacy in studying biomolecules interacting with bio‑membranes with sub‑nanometer precision.
Authors: Ashutosh Baheti, Debanjana Chakraborty, Faeze Brahman, Ronan Le Bras, Ximing Lu, Nouha Dziri, Yejin Choi, Mark Riedl, Maarten Sap
Abstract: Obeying precise constraints on top of multiple external attributes is a common computational problem underlying seemingly different domains, from controlled text generation to protein engineering. Existing language model (LM) controllability methods for multi‑attribute constraint satisfaction often rely on specialized architectures or gradient‑based classifiers, limiting their flexibility to work with arbitrary black‑box evaluators and pretrained models. Current general‑purpose large language models, while capable, cannot achieve fine‑grained multi‑attribute control over external attributes. Thus, we create Multi‑Attribute Constraint Satisfaction (MACS), a generalized method capable of finetuning language models on any sequential domain to satisfy user‑specified constraints on multiple external real‑value attributes. Our method trains LMs as editors by sampling diverse multi‑attribute edit pairs from an initial set of paraphrased outputs. During inference, LM iteratively improves upon its previous solution to satisfy constraints for all attributes by leveraging our designed constraint satisfaction reward. We additionally experiment with reward‑weighted behavior cloning to further improve the constraint satisfaction rate of LMs. To evaluate our approach, we present a new Fine‑grained Constraint Satisfaction (FineCS) benchmark, featuring two challenging tasks: (1) Text Style Transfer, where the goal is to simultaneously modify the sentiment and complexity of reviews, and (2) Protein Design, focusing on modulating fluorescence and stability of Green Fluorescent Proteins (GFP). Our empirical results show that MACS achieves the highest threshold satisfaction in both FineCS tasks, outperforming strong domain‑specific baselines. Our work opens new avenues for generalized and real‑value multi‑attribute control, with implications for diverse applications spanning NLP and bioinformatics.
Authors: Fatemah Alharthi, Dhruvil Solanki, Ishmael Apachigawo, Jianfeng Xiao, Mohammad Moshahid Khan, Prabhakar Pradhan
Abstract: Parkinsons disease (PD) is considered one of the most frequent neurological diseases in the world. There is a need to study the early and efficient biomarkers of Parkinsons, such as changes in structural disorders like DNA and chromatin, especially at the subcellular level in the human brain. We used two techniques, Partial wave spectroscopy (PWS) and Inverse Participation Ratio (IPR), to detect the changes in structural disorder in the human brain tissue samples. It was observed from the PWS experiment that there was an increase in structural disorder in Parkinsons disease tissues and cells when compared to normal tissues and cells using mesoscopic light transport theory. Furthermore, the IPR experiment also showed DNA and chromatin structural alterations that have the same trend and support the PWS results. The increase in mass density in the nuclei components, such as DNA and chromatin, can be linked to the aggregation of alpha‑synuclein in the substantia nigra of the brain. This protein deposition is considered a significant cause of neuronal death in the brains of PD patients. We also did a histological analysis of brain tissues, which supports our results from dual photonics techniques. The results show that this dual technique is a powerful approach to detect the changes. Our results highlight the potential of the parameter, related to the structural disorder strength, as an efficient biomarker for PD progress, paving the way for research into early disease detection.
Authors: Lars L. Schaaf, Ilyes Batatia, Christoph Brunken, Thomas D. Barrett, Jules Tilly
Abstract: Simulating atomic‑scale processes, such as protein dynamics and catalytic reactions, is crucial for advancements in biology, chemistry, and materials science. Machine learning force fields (MLFFs) have emerged as powerful tools that achieve near quantum mechanical accuracy, with promising generalization capabilities. However, their practical use is often limited by long inference times compared to classical force fields, especially when running extensive molecular dynamics (MD) simulations required for many biological applications. In this study, we introduce BoostMD, a surrogate model architecture designed to accelerate MD simulations. BoostMD leverages node features computed at previous time steps to predict energies and forces based on positional changes. This approach reduces the complexity of the learning task, allowing BoostMD to be both smaller and significantly faster than conventional MLFFs. During simulations, the computationally intensive reference MLFF is evaluated only every N steps, while the lightweight BoostMD model handles the intermediate steps at a fraction of the computational cost. Our experiments demonstrate that BoostMD achieves an eight‑fold speedup compared to the reference model and generalizes to unseen dipeptides. Furthermore, we find that BoostMD accurately samples the ground‑truth Boltzmann distribution when running molecular dynamics. By combining efficient feature reuse with a streamlined architecture, BoostMD offers a robust solution for conducting large‑scale, long‑timescale molecular simulations, making high‑accuracy ML‑driven modeling more accessible and practical.
Authors: Akarsh Kumar, Chris Lu, Louis Kirsch, Yujin Tang, Kenneth O. Stanley, Phillip Isola, David Ha
Abstract: With the recent Nobel Prize awarded for radical advances in protein discovery, foundation models (FMs) for exploring large combinatorial spaces promise to revolutionize many scientific fields. Artificial Life (ALife) has not yet integrated FMs, thus presenting a major opportunity for the field to alleviate the historical burden of relying chiefly on manual design and trial‑and‑error to discover the configurations of lifelike simulations. This paper presents, for the first time, a successful realization of this opportunity using vision‑language FMs. The proposed approach, called Automated Search for Artificial Life (ASAL), (1) finds simulations that produce target phenomena, (2) discovers simulations that generate temporally open‑ended novelty, and (3) illuminates an entire space of interestingly diverse simulations. Because of the generality of FMs, ASAL works effectively across a diverse range of ALife substrates including Boids, Particle Life, Game of Life, Lenia, and Neural Cellular Automata. A major result highlighting the potential of this technique is the discovery of previously unseen Lenia and Boids lifeforms, as well as cellular automata that are open‑ended like Conway's Game of Life. Additionally, the use of FMs allows for the quantification of previously qualitative phenomena in a human‑aligned way. This new paradigm promises to accelerate ALife research beyond what is possible through human ingenuity alone.
Authors: Yilong Zang, Lingfei Ren, Yue Li, Zhikang Wang, David Antony Selby, Zheng Wang, Sebastian Josef Vollmer, Hongzhi Yin, Jiangning Song, Junhang Wu
Abstract: Graph neural networks (GNNs) have shown promise in integrating protein‑protein interaction (PPI) networks for identifying cancer genes in recent studies. However, due to the insufficient modeling of the biological information in PPI networks, more faithfully depiction of complex protein interaction patterns for cancer genes within the graph structure remains largely unexplored. This study takes a pioneering step toward bridging biological anomalies in protein interactions caused by cancer genes to statistical graph anomaly. We find a unique graph anomaly exhibited by cancer genes, namely weight heterogeneity, which manifests as significantly higher variance in edge weights of cancer gene nodes within the graph. Additionally, from the spectral perspective, we demonstrate that the weight heterogeneity could lead to the "flattening out" of spectral energy, with a concentration towards the extremes of the spectrum. Building on these insights, we propose the HIerarchical‑Perspective Graph Neural Network (HIPGNN) that not only determines spectral energy distribution variations on the spectral perspective, but also perceives detailed protein interaction context on the spatial perspective. Extensive experiments are conducted on two reprocessed datasets STRINGdb and CPDB, and the experimental results demonstrate the superiority of HIPGNN.
Authors: Yan Zhu, Shihao Wang, Yong Han, Yao Lu, Shulan Qiu, Ling Jin, Xiangdong Li, Weixiong Zhang
Abstract: Air pollution, particularly airborne particulate matter (PM), poses a significant threat to public health globally. It is crucial to comprehend the association between PM‑associated toxic components and their cellular targets in humans to understand the mechanisms by which air pollution impacts health and to establish causal relationships between air pollution and public health consequences. Although many studies have explored the impact of PM on human health, the understanding of the association between toxins and the associated targets remain limited. Leveraging cutting‑edge deep learning technologies, we developed tipFormer (toxin‑protein interaction prediction based on transformer), a novel deep‑learning tool for identifying toxic components capable of penetrating human cells and instigating pathogenic biological activities and signaling cascades. Experimental results show that tipFormer effectively captures interactions between proteins and toxic components. It incorporates dual pre‑trained language models to encode protein sequences and chemicals. It employs a convolutional encoder to assimilate the sequential attributes of proteins and chemicals. It then introduces a learning module with a cross‑attention mechanism to decode and elucidate the multifaceted interactions pivotal for the hotspots binding proteins and chemicals. Experimental results show that tipFormer effectively captures interactions between proteins and toxic components. This approach offers significant value to air quality and toxicology researchers by allowing high‑throughput identification and prioritization of hazards. It supports more targeted laboratory studies and field measurements, ultimately enhancing our understanding of how air pollution impacts human health.
Authors: Heming Zhang, Di Huang, Yixin Chen, Fuhai Li
Abstract: The integration of multi‑omic data is pivotal for understanding complex diseases, but its high dimensionality and noise present significant challenges. Graph Neural Networks (GNNs) offer a robust framework for analyzing large‑scale signaling pathways and protein‑protein interaction networks, yet they face limitations in expressivity when capturing intricate biological relationships. To address this, we propose Graph Sequence Language Model (GraphSeqLM), a framework that enhances GNNs with biological sequence embeddings generated by Large Language Models (LLMs). These embeddings encode structural and biological properties of DNA, RNA, and proteins, augmenting GNNs with enriched features for analyzing sample‑specific multi‑omic data. By integrating topological, sequence‑derived, and biological information, GraphSeqLM demonstrates superior predictive accuracy and outperforms existing methods, paving the way for more effective multi‑omic data integration in precision medicine.
Authors: Edward Kim, Manil Shrestha, Richard Foty, Tom DeLay, Vicki Seyfert-Margolis
Abstract: Creation and curation of knowledge graphs can accelerate disease discovery and analysis in real‑world data. While disease ontologies aid in biological data annotation, codified categories (SNOMED‑CT, ICD10, CPT) may not capture patient condition nuances or rare diseases. Multiple disease definitions across data sources complicate ontology mapping and disease clustering. We propose creating patient knowledge graphs using large language model extraction techniques, allowing data extraction via natural language rather than rigid ontological hierarchies. Our method maps to existing ontologies (MeSH, SNOMED‑CT, RxNORM, HPO) to ground extracted entities.
Using a large ambulatory care EHR database with 33.6M patients, we demonstrate our method through the patient search for Dravet syndrome, which received ICD10 recognition in October 2020. We describe our construction of patient‑specific knowledge graphs and symptom‑based patient searches. Using confirmed Dravet syndrome ICD10 codes as ground truth, we employ LLM‑based entity extraction to characterize patients in grounded ontologies. We then apply this method to identify Beta‑propeller protein‑associated neurodegeneration (BPAN) patients, demonstrating real‑world discovery where no ground truth exists.
Authors: Haoran Liu, Youzhi Luo, Tianxiao Li, James Caverlee, Martin Renqiang Min
Abstract: We consider the conditional generation of 3D drug‑like molecules with explicit control over molecular properties such as drug‑like properties (e.g., Quantitative Estimate of Druglikeness or Synthetic Accessibility score) and effectively binding to specific protein sites. To tackle this problem, we propose an E(3)‑equivariant Wasserstein autoencoder and factorize the latent space of our generative model into two disentangled aspects: molecular properties and the remaining structural context of 3D molecules. Our model ensures explicit control over these molecular attributes while maintaining equivariance of coordinate representation and invariance of data likelihood. Furthermore, we introduce a novel alignment‑based coordinate loss to adapt equivariant networks for auto‑regressive de‑novo 3D molecule generation from scratch. Extensive experiments validate our model's effectiveness on property‑guided and context‑guided molecule generation, both for de‑novo 3D molecule design and structure‑based drug discovery against protein targets.
Authors: Maria Tsanai, Teresa Head-Gordon
Abstract: We employ a multiscale computational approach to investigate the condensation process of the C‑terminal low‑complexity region of the Caprin1 protein as a function of increasing ATP concentration for three states: the initial mixed state, nanocondensate formation, and the dissolution of the droplet as it reenters the mixed state. We show that upon condensation ATP assembles via pi‑pi interactions, resulting in the formation of a large cluster of stacked ATP molecules stabilized by sodium counterions. The surface of the ATP assembly interacts with the arginine‑rich regions of the Caprin1 protein, particularly with its N‑terminus, to promote the complete phase‑separated droplet on a lengthscale of tens of nanometers. In order to understand droplet stability, we analyze the near‑surface electrostatic potential (NS‑ESP) of Caprin1 and estimate the zeta potential of the Caprin1‑ATP assemblies. We predict a positive NS‑ESP at the Caprin1 surface for low ATP concentrations that defines the early mixed state, in excellent agreement with the NS‑ESP obtained from NMR experiments using paramagnetic resonance enhancement. By contrast, the NS‑ESP of Caprin1 at the surface of the nanocondensate at moderate levels of ATP is highly negative compared to the mixed state, and estimates of a large zeta potential outside the highly dense region of charge further explains the remarkable stability of this phase separated droplet assembly. As ATP concentrations rise further, the strong electrostatic forces needed for nanocondensate stability are replaced by weaker Caprin1‑ATP interactions that drive the reentry into the mixed state that exhibits a much lower zeta potential.
Authors: Søren Toxvaerd
Abstract: Living organisms have some common structures, chemical reactions and molecular structures. The organisms consist of cells with cell division, they have homochirality of protein and carbohydrate units, and metabolism, and genetics, and they are mortal. The molecular structures and chemical reactions underlying these features are common from the simplest bacteria to human beings. The origin of life is evolutionary with the emergence of a network of spontaneous biochemical reactions, and the evolution has taken place over a very long time. The evolution contains, however some "landmarks" and bottlenecks, which in a revolutionary manner directed the evolution, and the article tries to establish the order of these events. The article advocates that a possible order in the emergence of life is that the first milestone in prebiotic evolution is at the emergence of homochirality in proteins. The homochirality of peptides is, however, with instability and racemization which causes aging of the peptides and mortality. The metabolism and genetics are established through homochiral enzymes in the Earth's crust for \approx 4 Gyr ago. Finally, the cells with cell division are established in the Hot Springs environment at the interface between the crust and the Hadean Ocean.
Authors: Fatemah Alharthi, Ishmael Apachigawo, Dhruvil Solanki, Sazzad Khan, Himanshi Singh, Mohammad Moshahid Khan, Prabhakar Pradhan
Abstract: Understanding alterations in structural disorders in tissue or cells or building blocks, such as DNA or chromatin in the human brain, at the nano to submicron level provides us with efficient biomarkers for Alzheimers detection. Here, we report a dual photonics technique to detect nano‑ to submicron‑scale alterations in brain tissues or cells and DNA or chromatin due to the early to late progression of Alzheimers disease in humans. Using a recently developed mesoscopic light transport technique, fine‑focused nano‑sensitive partial wave spectroscopy (PWS), we measure the degree of structural disorder in tissues. Furthermore, the chemical‑specific inverse participation ratio technique (IPR) was used to measure the DNA or chromatin structural alterations. The results of the PWS and IPR experiments showed a significant increase in the degree of structural disorder at the nano to submicron scale at different stages of AD relative to their controls for both the tissue or cell and DNA cellular levels. The increase in the structural disorder in cells or tissues and DNA or chromatin in the nuclei can be attributed to higher mass density fluctuations in the tissue and DNA or chromatin damage in the nuclei caused by the rearrangements of macromolecules due to the deposition of the amyloid beta protein and damage in DNA or chromatin with the progress of AD.
Authors: Shivasankaran Vanaja Pandi, Bharath Ramsundar
Abstract: Protein language models (PLMs) have shown promise in improving the understanding of protein sequences, contributing to advances in areas such as function prediction and protein engineering. However, training these models from scratch requires significant computational resources, limiting their accessibility. To address this, we integrate a PLM into DeepChem, an open‑source framework for computational biology and chemistry, to provide a more accessible platform for protein‑related tasks.
We evaluate the performance of the integrated model on various protein prediction tasks, showing that it achieves reasonable results across benchmarks. Additionally, we present an exploration of generating plastic‑degrading enzyme candidates using the model's embeddings and latent space manipulation techniques. While the results suggest that further refinement is needed, this approach provides a foundation for future work in enzyme design. This study aims to facilitate the use of PLMs in research fields like synthetic biology and environmental sustainability, even for those with limited computational resources.
Authors: Sai Advaith Maddipatla, Nadav Bojan Sellam, Sanketh Vedula, Ailie Marx, Alex Bronstein
Abstract: Proteins are dynamic, adopting ensembles of conformations. The nature of this conformational heterogenity is imprinted in the raw electron density measurements obtained from X‑ray crystallography experiments. Fitting an ensemble of protein structures to these measurements is a challenging, ill‑posed inverse problem. We propose a non‑i.i.d. ensemble guidance approach to solve this problem using existing protein structure generative models and demonstrate that it accurately recovers complicated multi‑modal alternate protein backbone conformations observed in certain single crystal measurements.
Authors: Hanxun Jin, William Goldberg, Zhenqin Wang, Huiyong Li, Yuxuan Huang, Marcus Foston, Guy M. Genin
Abstract: Renewable and biodegradable plastics derived from soy protein isolate (SPI) offer a promising alternative to conventional petroleum‑based plastics, particularly for film‑grade bioplastics applications such as plastic bags. However, even with reinforcement from cellulose nanocrystals (CNCs), their mechanical properties including stiffness lag behind those of petroleum‑based plastics. To identify pathways for improving CNC‑reinforced SPI composites, we studied stiffening mechanisms by interpreting experimental data using homogenization models that accounted for CNC agglomeration and the formation of CNC/SPI interphases. To model effects of surface modification of CNCs with polydopamine (polyDOPA), we incorporated two key mechanisms: enhanced CNC dispersion and modified CNC‑SPI interfacial interactions. Models accounted for interphases surrounding CNCs, arising from physicochemical interactions with the polyDOPA‑modified CNC surfaces. Consistent wih experimental observations of polyDOPA modification enhancing mechanical properties through both increased spatial distribution of CNCs and matrix‑filler interactions, results demonstrated that improved dispersion and interfacial bonding contribute to increased composite stiffness. Results highlight the potential of biodegradable CNC/SPI bio‑nanocomposites as sustainable plastic alternatives, and suggest pathways for further enhancing their mechanical properties.
Authors: Yuwei Miao, Yuzhi Guo, Hehuan Ma, Jingquan Yan, Feng Jiang, Weizhi An, Jean Gao, Junzhou Huang
Abstract: Gene studies are crucial for fields such as protein structure prediction, drug discovery, and cancer genomics, yet they face challenges in fully utilizing the vast and diverse information available. Gene studies require clean, factual datasets to ensure reliable results. Ontology graphs, neatly organized domain terminology graphs, provide ideal sources for domain facts. However, available gene ontology annotations are currently distributed across various databases without unified identifiers for genes and gene products. To address these challenges, we introduce Unified Entrez Gene Identifier Dataset and Benchmarks (UniEntrezDB), the first systematic effort to unify large‑scale public Gene Ontology Annotations (GOA) from various databases using unique gene identifiers. UniEntrezDB includes a pre‑training dataset and four downstream tasks designed to comprehensively evaluate gene embedding performance from gene, protein, and cell levels, ultimately enhancing the reliability and applicability of LLMs in gene research and other professional settings.
Authors: Johanna L. Hall, Shiun-Jr Yang, David T. Limmer, Graham R. Fleming
Abstract: Photosystem II (PSII) can achieve near‑unity quantum efficiency of light harvesting in ideal conditions and can dissipate excess light energy as heat to prevent formation of reactive oxygen species under light stress. Understanding how this pigment‑protein complex accomplishes these opposing goals is a topic of great interest that has so far been explored primarily through the lens of the system energetics. Despite PSII's known flat energy landscape, a thorough consideration of the entropic effects on energy transfer in PSII is lacking. In this work, we aim to discern the free energetic design principles underlying the PSII energy transfer network. To accomplish this goal, we employ a structure‑based rate matrix and compute the free energy terms in time following a specific initial excitation to discern how entropy and enthalpy drive ensemble system dynamics. We find that the interplay between the entropy and enthalpy components differs among each protein subunit, which allows each subunit to fulfill a unique role in the energy transfer network. This individuality ensures PSII can accomplish efficient energy trapping in the RC, effective NPQ in the periphery, and robust energy trapping in the other‑monomer RC if the same‑monomer RC is closed. We also show that entropy, in particular, is a dynamically tunable feature of the PSII free energy landscape accomplished through regulation of LHCII binding. These findings help rationalize natural photosynthesis and provide design principles for novel, more efficient solar energy harvesting technologies.
Authors: Fabio Zamio
Abstract: Liver cancer is a leading cause of cancer‑related mortality worldwide, with its high genetic heterogeneity complicating diagnosis and treatment. This study introduces DLSOM, a deep learning framework utilizing stacked autoencoders to analyze the complete somatic mutation landscape of 1,139 liver cancer samples, covering 20,356 protein‑coding genes. By transforming high‑dimensional mutation data into three low‑dimensional features, DLSOM enables robust clustering and identifies five distinct liver cancer subtypes with unique mutational, functional, and biological profiles. Subtypes SC1 and SC2 exhibit higher mutational loads, while SC3 has the lowest, reflecting mutational heterogeneity. Novel and COSMIC‑associated mutational signatures reveal subtype‑specific molecular mechanisms, including links to hypermutation and chemotherapy resistance. Functional analyses further highlight the biological relevance of each subtype. This comprehensive framework advances precision medicine in liver cancer by enabling the development of subtype‑specific diagnostics, biomarkers, and therapies, showcasing the potential of deep learning in addressing cancer complexity.
Authors: E. Faraji, P. Kurian, R. Franzosi, S. Mancini, E. Floriani, V. Calandrini, G. Pettini, M. Pettini
Abstract: In the present paper we address the general problem of selective electrodynamic interactions between DNA and protein, which is motivated by decades of theoretical study and our very recent experimental findings (M. Lechelon et al, Sci Adv 8, eabl5855 (2022)). Inspired by the Davydov and Holstein‑Fröhlich models describing electron motion along biomolecules, and using a model Hamiltonian written in second quantization, the time‑dependent variational principle (TDVP) is used to derive the dynamical equations of the system. We demonstrate the efficacy of this second‑quantized model for a well‑documented biochemical system consisting of a restriction enzyme, EcoRI, which binds selectively to a palindromic six‑base‑pair target within a DNA oligonucleotide sequence to catalyze a DNA double‑strand cleavage. The time‑domain Fourier spectra of the electron currents numerically computed for the DNA fragment and for the EcoRI enzyme, respectively, exhibit a cross‑correlation spectrum with a sharp co‑resonance peak. When the target DNA recognition sequence is randomized, this sharp co‑resonance peak is replaced with a broad and noisy spectrum. Such a sequence‑dependent charge transfer phenomenology is suggestive of a potentially rich variety of selective electrodynamic interactions influencing the coordinated activity of DNA substrates, enzymes, transcription factors, ligands, and other proteins under realistic biochemical conditions characterized by electron‑phonon excitations.
Authors: Elana Simon, James Zou
Abstract: Protein language models (PLMs) have demonstrated remarkable success in protein modeling and design, yet their internal mechanisms for predicting structure and function remain poorly understood. Here we present a systematic approach to extract and analyze interpretable features from PLMs using sparse autoencoders (SAEs). By training SAEs on embeddings from the PLM ESM‑2, we identify up to 2,548 human‑interpretable latent features per layer that strongly correlate with up to 143 known biological concepts such as binding sites, structural motifs, and functional domains. In contrast, examining individual neurons in ESM‑2 reveals up to 46 neurons per layer with clear conceptual alignment across 15 known concepts, suggesting that PLMs represent most concepts in superposition. Beyond capturing known annotations, we show that ESM‑2 learns coherent concepts that do not map onto existing annotations and propose a pipeline using language models to automatically interpret novel latent features learned by the SAEs. As practical applications, we demonstrate how these latent features can fill in missing annotations in protein databases and enable targeted steering of protein sequence generation. Our results demonstrate that PLMs encode rich, interpretable representations of protein biology and we propose a systematic framework to extract and analyze these latent features. In the process, we recover both known biology and potentially new protein motifs. As community resources, we introduce InterPLM (interPLM.ai), an interactive visualization platform for exploring and analyzing learned PLM features, and release code for training and analysis at github.com/ElanaPearl/interPLM.
Authors: Nuowei Liu, Changzhi Sun, Tao Ji, Junfeng Tian, Jianxin Tang, Yuanbin Wu, Man Lan
Abstract: Current Large Language Models (LLMs) for understanding proteins primarily treats amino acid sequences as a text modality. Meanwhile, Protein Language Models (PLMs), such as ESM‑2, have learned massive sequential evolutionary knowledge from the universe of natural protein sequences. Furthermore, structure‑based encoders like ProteinMPNN learn the structural information of proteins through Graph Neural Networks. However, whether the incorporation of protein encoders can enhance the protein understanding of LLMs has not been explored. To bridge this gap, we propose EvoLlama, a multimodal framework that connects a structure‑based encoder, a sequence‑based protein encoder and an LLM for protein understanding. EvoLlama consists of a ProteinMPNN structure encoder, an ESM‑2 protein sequence encoder, a multimodal projector to align protein and text representations and a Llama‑3 text decoder. To train EvoLlama, we fine‑tune it on protein‑oriented instructions and protein property prediction datasets verbalized via natural language instruction templates. Our experiments show that EvoLlama's protein understanding capabilities have been significantly enhanced, outperforming other fine‑tuned protein‑oriented LLMs in zero‑shot settings by an average of 1%‑8% and surpassing the state‑of‑the‑art baseline with supervised fine‑tuning by an average of 6%. On protein property prediction datasets, our approach achieves promising results that are competitive with state‑of‑the‑art task‑specific baselines. We will release our code in a future version.
Authors: Hezha O. Rasul, Dlzar D. Ghafour, Bakhtyar K. Aziz, Bryar A. Hassan, Tarik A. Rashid, Arif Kivrak
Abstract: The drug development process is a critical challenge in the pharmaceutical industry due to its time‑consuming nature and the need to discover new drug potentials to address various ailments. The initial step in drug development, drug target identification, often consumes considerable time. While valid, traditional methods such as in vivo and in vitro approaches are limited in their ability to analyze vast amounts of data efficiently, leading to wasteful outcomes. To expedite and streamline drug development, an increasing reliance on computer‑aided drug design (CADD) approaches has merged. These sophisticated in silico methods offer a promising avenue for efficiently identifying viable drug candidates, thus providing pharmaceutical firms with significant opportunities to uncover new prospective drug targets. The main goal of this work is to review in silico methods used in the drug development process with a focus on identifying therapeutic targets linked to specific diseases at the genetic or protein level. This article thoroughly discusses A‑to‑Z in silico techniques, which are essential for identifying the targets of bioactive compounds and their potential therapeutic effects. This review intends to improve drug discovery processes by illuminating the state of these cutting‑edge approaches, thereby maximizing the effectiveness and duration of clinical trials for novel drug target investigation.
Authors: Qingwen Tian, Yuxin Xu, Yixuan Yang, Zhen Wang, Ziqi Liu, Pengju Yan, Xiaolin Li
Abstract: Molecular 3D conformations play a key role in determining how molecules interact with other molecules or protein surfaces. Recent deep learning advancements have improved conformation prediction, but slow training speeds and difficulties in utilizing high‑degree features limit performance. We propose EquiFlow, an equivariant conditional flow matching model with optimal transport. EquiFlow uniquely applies conditional flow matching in molecular 3D conformation prediction, leveraging simulation‑free training to address slow training speeds. It uses a modified Equiformer model to encode Cartesian molecular conformations along with their atomic and bond properties into higher‑degree embeddings. Additionally, EquiFlow employs an ODE solver, providing faster inference speeds compared to diffusion models with SDEs. Experiments on the QM9 dataset show that EquiFlow predicts small molecule conformations more accurately than current state‑of‑the‑art models.
Authors: Zhuoran Qiao, Feizhi Ding, Thomas Dresselhaus, Mia A. Rosenfeld, Xiaotian Han, Owen Howell, Aniketh Iyengar, Stephen Opalenski, Anders S. Christensen, Sai Krishna Sirumalla, Frederick R. Manby, Thomas F. Miller, Matthew Welborn
Abstract: Structure determination is essential to a mechanistic understanding of diseases and the development of novel therapeutics. Machine‑learning‑based structure prediction methods have made significant advancements by computationally predicting protein and bioassembly structures from sequences and molecular topology alone. Despite substantial progress in the field, challenges remain to deliver structure prediction models to real‑world drug discovery. Here, we present NeuralPLexer3 ‑‑ a physics‑inspired flow‑based generative model that achieves state‑of‑the‑art prediction accuracy on key biomolecular interaction types and improves training and sampling efficiency compared to its predecessors and alternative methodologies. Examined through newly developed benchmarking strategies, NeuralPLexer3 excels in vital areas that are crucial to structure‑based drug design, such as physical validity and ligand‑induced conformational changes.
Authors: Shashank Pathak, Guohui Lin
Abstract: Motivation: Codon optimization of Open Reading Frame (ORF) sequences is essential for enhancing mRNA stability and expression in applications like mRNA vaccines, where codon choice can significantly impact protein yield which directly impacts immune strength. In this work, we investigate the use of a pre‑trained protein language model (PPLM) for getting a rich representation of amino acids which could be utilized for codon optimization. This leaves us with a simpler fine‑tuning task over PPLM in optimizing ORF sequences.
Results: The ORFs generated by our proposed models outperformed their natural counterparts encoding the same proteins on computational metrics for stability and expression. They also demonstrated enhanced performance against the benchmark ORFs used in mRNA vaccines for the SARS‑CoV‑2 viral spike protein and the varicella‑zoster virus (VZV). These results highlight the potential of adapting PPLM for designing ORFs tailored to encode target antigens in mRNA vaccines.
Authors: J. Kyle Brubaker, Kyle E. C. Booth, Akihiko Arakawa, Fabian Furrer, Jayeeta Ghosh, Tsutomu Sato, Helmut G. Katzgraber
Abstract: The peptide‑protein docking problem is an important problem in structural biology that facilitates rational and efficient drug design. In this work, we explore modeling and solving this problem with the quantum‑amenable quadratic unconstrained binary optimization (QUBO) formalism. Our work extends recent efforts by incorporating the objectives and constraints associated with peptide cyclization and peptide‑protein docking in the two‑particle model on a tetrahedral lattice. We propose a ``resource efficient'' QUBO encoding for this problem, and baseline its performance with a novel constraint programming (CP) approach. We implement an end‑to‑end framework that enables the evaluation of our methods on instances from the Protein Data Bank (PDB). Our results show that the QUBO approach, using a classical simulated annealing solver, is able to find feasible conformations for problems with up to 6 peptide residues and 34 target protein residues, but has trouble scaling beyond this problem size. In contrast, the CP approach can solve problems with up to 13 peptide residues and 34 target protein residues. We conclude that while QUBO can be used to successfully tackle this problem, its scaling limitations and the strong performance of the CP method suggest that it may not be the best choice.
Authors: Vivekananda Bal, Moo Sun Hong, Jacqueline M. Wolfrum, Paul W. Barone, Stacy L. Springs, Anthony J. Sinskey, Robert M. Kotin, Richard D. Braatz
Abstract: Crystallization of proteins, specifically proteins of medical relevance, is performed for various reasons such as to understand the protein structure and to design therapies. Obtaining kinetic constants in rate laws for nucleation and growth of advanced biotherapeutics such as capsids, an assembly of macromolecules, is challenging and essential to the design of the crystallization processes. In this work, coupled population balance and species balance equations are developed to extract nucleation and growth kinetics for crystallization of recombinant adeno‑associated virus (rAAV) capsids. A comparison of model results with that of experimental data for capsid crystallization in hanging‑drop vapor diffusion system shows that slow rate of vapor diffusion from the droplet controls the initial nucleation and growth processes, and the capsid nucleation occurs via heterogeneous nucleation in the microdroplet. Results also show that the capsids, which are of very high molecular weight (~3.6 MDa), have a similar tendency to nucleate as small organic molecules such as glycine (75 Da), low‑molecular‑weight proteins, and small‑molecule active pharmaceutical ingredients due to its ball‑shaped outer structure/shape. Capids also show a prolonged nucleation period as for proteins and other macromolecules, but has a slow growth rate with a growth rate pre‑factor seven orders of magnitude smaller than that of lysozyme. The capsid crystal growth rate is weakly sensitive to the supersaturation compared to lysozyme and is limited by the transport of capsids due to slow Brownian motion resulting from the very high molecular weight.
Authors: Jinsong Shao, Qineng Gong, Zeyu Yin, Yu Chen, Yajie Hao, Lei Zhang, Linlin Jiang, Min Yao, Jinlong Li, Fubo Wang, Li Wang
Abstract: The imperfect modeling of ternary complexes has limited the application of computer‑aided drug discovery tools in PROTAC research and development. In this study, an AI‑assisted approach for PROTAC molecule design pipeline named LM‑PROTAC was developed, which stands for language model driven Proteolysis Targeting Chimera, by embedding a transformer‑based generative model with dual constraints on structure and properties, referred to as the DCT. This study utilized the fragmentation representation of molecules and developed a language model driven pipeline. Firstly, a language model driven affinity model for protein compounds to screen molecular fragments with high affinity for the target protein. Secondly, structural and physicochemical properties of these fragments were constrained during the generation process to meet specific scenario requirements. Finally, a two‑round screening of the preliminary generated molecules using a multidimensional property prediction model to generate a batch of PROTAC molecules capable of degrading disease‑relevant target proteins for validation in vitro experiments, thus achieving a complete solution for AI‑assisted PROTAC drug generation. Taking the tumor key target Wnt3a as an example, the LM‑PROTAC pipeline successfully generated PROTAC molecules capable of inhibiting Wnt3a. The results show that DCT can efficiently generate PROTAC that targets and hydrolyses Wnt3a.
Authors: Axel Levy, Rishwanth Raghu, David Shustin, Adele Rui-Yang Peng, Huan Li, Oliver Biggs Clarke, Gordon Wetzstein, Ellen D. Zhong
Abstract: Cryo‑electron microscopy (cryo‑EM) is an experimental technique for protein structure determination that images an ensemble of macromolecules in near‑physiological contexts. While recent advances enable the reconstruction of dynamic conformations of a single biomolecular complex, current methods do not adequately model samples with mixed conformational and compositional heterogeneity. In particular, datasets containing mixtures of multiple proteins require the joint inference of structure, pose, compositional class, and conformational states for 3D reconstruction. Here, we present Hydra, an approach that models both conformational and compositional heterogeneity fully ab initio by parameterizing structures as arising from one of K neural fields. We employ a new likelihood‑based loss function and demonstrate the effectiveness of our approach on synthetic datasets composed of mixtures of proteins with large degrees of conformational variability. We additionally demonstrate Hydra on an experimental dataset of a cellular lysate containing a mixture of different protein complexes. Hydra expands the expressivity of heterogeneous reconstruction methods and thus broadens the scope of cryo‑EM to increasingly complex samples.
Authors: Chenglin Wang, Yucheng Zhou, Zijie Zhai, Jianbing Shen, Kai Zhang
Abstract: Protein inverse folding is a fundamental problem in bioinformatics, aiming to recover the amino acid sequences from a given protein backbone structure. Despite the success of existing methods, they struggle to fully capture the intricate inter‑residue relationships critical for accurate sequence prediction. We propose a novel method that leverages diffusion models with representation alignment (DMRA), which enhances diffusion‑based inverse folding by (1) proposing a shared center that aggregates contextual information from the entire protein structure and selectively distributes it to each residue; and (2) aligning noisy hidden representations with clean semantic representations during the denoising process. This is achieved by predefined semantic representations for amino acid types and a representation alignment method that utilizes type embeddings as semantic feedback to normalize each residue. In experiments, we conduct extensive evaluations on the CATH4.2 dataset to demonstrate that DMRA outperforms leading methods, achieving state‑of‑the‑art performance and exhibiting strong generalization capabilities on the TS50 and TS500 datasets.
Authors: Gyeo-Re Han, Artem Goncharov, Merve Eryilmaz, Shun Ye, Hyou-Arm Joung, Rajesh Ghosh, Emily Ngo, Aoi Tomoeda, Yena Lee, Kevin Ngo, Elizabeth Melton, Omai B. Garner, Dino Di Carlo, Aydogan Ozcan
Abstract: Democratizing biomarker testing at the point‑of‑care requires innovations that match laboratory‑grade sensitivity and precision in an accessible format. Here, we demonstrate high‑sensitivity detection of cardiac troponin I (cTnI) through innovations in chemiluminescence‑based sensing, imaging, and deep learning‑driven analysis. This chemiluminescence vertical flow assay (CL‑VFA) enables rapid, low‑cost, and precise quantification of cTnI, a key cardiac protein for assessing heart muscle damage and myocardial infarction. The CL‑VFA integrates a user‑friendly chemiluminescent paper‑based sensor, a polymerized enzyme‑based conjugate, a portable high‑performance CL reader, and a neural network‑based cTnI concentration inference algorithm. The CL‑VFA measures cTnI over a broad dynamic range covering six orders of magnitude and operates with 50 uL of serum per test, delivering results in 25 min. This system achieves a detection limit of 0.16 pg/mL with an average coefficient of variation under 15%, surpassing traditional benchtop analyzers in sensitivity by an order of magnitude. In blinded validation, the computational CL‑VFA accurately measured cTnI concentrations in patient samples, demonstrating a robust correlation against a clinical‑grade FDA‑cleared analyzer. These results highlight the potential of CL‑VFA as a robust diagnostic tool for accessible, rapid cardiac biomarker testing that meets the needs of diverse healthcare settings, from emergency care to underserved regions.
Authors: Peizhen Bai, Filip Miljković, Xianyuan Liu, Leonardo De Maria, Rebecca Croasdale-Wood, Owen Rackham, Haiping Lu
Abstract: Inverse protein folding generates valid amino acid sequences that can fold into a desired protein structure, with recent deep‑learning advances showing strong potential and competitive performance. However, challenges remain, such as predicting elements with high structural uncertainty, including disordered regions. To tackle such low‑confidence residue prediction, we propose a Mask‑prior‑guided denoising Diffusion (MapDiff) framework that accurately captures both structural information and residue interactions for inverse protein folding. MapDiff is a discrete diffusion probabilistic model that iteratively generates amino acid sequences with reduced noise, conditioned on a given protein backbone. To incorporate structural information and residue interactions, we develop a graph‑based denoising network with a mask‑prior pre‑training strategy. Moreover, in the generative process, we combine the denoising diffusion implicit model with Monte‑Carlo dropout to reduce uncertainty. Evaluation on four challenging sequence design benchmarks shows that MapDiff substantially outperforms state‑of‑the‑art methods. Furthermore, the in silico sequences generated by MapDiff closely resemble the physico‑chemical and structural characteristics of native proteins across different protein families and architectures.
Authors: Shuqi Li, Shufang Xie, Hongda Sun, Yuhan Chen, Tao Qin, Tianjun Ke, Rui Yan
Abstract: Traditional drug discovery processes are both time‑consuming and require extensive professional expertise. With the accumulation of drug‑target interaction (DTI) data from experimental studies, leveraging modern machine‑learning techniques to discern patterns between drugs and target proteins has become increasingly feasible. In this paper, we introduce the Multi‑channel Interaction Network (MIN), a novel framework designed to predict DTIs through two primary components: a representation learning module and a multi‑channel interaction module. The representation learning module features a C‑Score Predictor‑assisted screening mechanism, which selects critical residues to enhance prediction accuracy and reduce noise. The multi‑channel interaction module incorporates a structure‑agnostic channel, a structure‑aware channel, and an extended‑mixture channel, facilitating the identification of interaction patterns at various levels for optimal complementarity. Additionally, contrastive learning is utilized to harmonize the representations of diverse data types. Our experimental evaluations on public datasets demonstrate that MIN surpasses other strong DTI prediction methods. Furthermore, the case study reveals a high overlap between the residues selected by the C‑Score Predictor and those in actual binding pockets, underscoring MIN's explainability capability. These findings affirm that MIN is not only a potent tool for DTI prediction but also offers fresh insights into the prediction of protein binding sites.
Authors: Wang Liang
Abstract: Multilingual transfer ability, which reflects how well models fine‑tuned on one source language can be applied to other languages, has been well studied in multilingual pre‑trained models. However, the existence of such capability transfer between natural language and gene sequences/languages remains under explored.This study addresses this gap by drawing inspiration from the sentence‑pair classification task used for evaluating sentence similarity in natural language. We constructed two analogous tasks: DNA‑pair classification(DNA sequence similarity) and DNA‑protein‑pair classification(gene coding determination). These tasks were designed to validate the transferability of capabilities from natural language to gene sequences. Even a small‑scale pre‑trained model like GPT‑2‑small, which was pre‑trained on English, achieved an accuracy of 78% on the DNA‑pair classification task after being fine‑tuned on English sentence‑pair classification data(XTREME PAWS‑X). While training a BERT model on multilingual text, the precision reached 89%. On the more complex DNA‑protein‑pair classification task, however, the model's output was barely distinguishable from random output.Experimental validation has confirmed that the transfer of capabilities from natural language to biological language is unequivocally present. Building on this foundation, we have also investigated the impact of model parameter scale and pre‑training on this capability transfer. We provide recommendations for facilitating the transfer of capabilities from natural language to genetic language,as well as new approaches for conducting biological research based on this capability.This study offers an intriguing new perspective on exploring the relationship between natural language and genetic language.
Authors: O. Pavón-Torres, J. R. Collantes-Collantes, M. A. Agüero-Granados
Abstract: Nonlinear molecular excitations in DNA have traditionally been modelled using the nonlinear Schrödinger equation (NLSE). An alternative approach is based on the plane‑base rotator model and the SU(2)/U(1) generalized spin coherent states, which leads to a cubic quintic NLSE. Higher‑order nonlinearities are particularly useful for modelling complex interactions, such as those in DNA‑protein systems, where multiple competing forces play a significant role. Additionally, the surrounding viscous medium introduces dissipative forces that affect the propagation of molecular excitations, leading to energy dissipation and damping effects. These damping effects are modelled using the quasi‑stationary method, which describes the system's near‑equilibrium behaviour. In this work, we explore the evolution of nonlinear molecular excitations in DNA‑protein systems, accounting for damping effects, and discuss potential applications to the transcription process.
Authors: Marsha Mariya Kappan, Joby George
Abstract: A kind of pancreatic cancer called Pancreatic Ductal Adenocarcinoma (PDAC) is anticipated to be one of the main causes of mortality during past years. Evidence from several researches supported the concept that the oncogenic KRAS (Ki‑ras2 Kirsten rat sarcoma viral oncogene) mutation is the major cause of pancreatic cancer. KRAS acts as an on‑off switch that promotes cell growth. But when the KRAS gene is mutated, it will be in one position, allowing the cell growth uncontrollably. This uncontrollable multiplication of cells causes cancer growth. Therefore, KRAS was selected as the target protein in the study. Fifty plant‑derived compounds are selected for the study. To determine whether the examined drugs could bind to the KRAS complex's binding pocket, molecular docking was performed. Computational analyses were used to assess the possible ability of tested substances to pass the Blood Brain Barrier (BBB). To predict the bioactivity of ligands a machine learning model was created. Five machine learning models were created and have chosen the best one among them for analyzing the bioactivity of each ligand. From the fifty plant‑derived compounds the compounds with the least binding energies are selected. Then bioactivity of these six compounds is analyzed using Random Forest Regression model. Adsorption, Distribution, Metabolism, Excretion (ADME) properties of compounds are analyzed. The results showed that borneol has powerful effects and acts as a promising agent for the treatment of pancreatic cancer. This suggests that borneol found in plants like mint, ginger, rosemary, etc., is a successful compound for the treatment of pancreatic cancer.
Authors: Vivekananda Bal, Jacqueline M. Wolfrum, Paul W. Barone, Stacy L. Springs, Anthony J. Sinskey, Robert M. Kotin, Richard D. Braatz
Abstract: Gene therapies using recombinant adeno‑associated virus (rAAV) have been developed to treat monogenic and acquired diseases but are currently the most expensive drugs due, in part, to high manufacturing costs. The cells producing rAAV generate substantial quantities of empty (50‑90%) and partially filled capsids that must be removed prior to final formulation. The conventional separation processes are inefficient in removing empty and partially filled capsids, have low yield, scale poorly, time consuming and need additional purification steps. This article demonstrates one step separation of full capsids from a mixture of full, partially filled, and empty capsids, and other protein impurities using selective crystallization, a purification process, which is first time in protein purification and is performed without physically or chemically modifying the target component for the first time in the history of selective crystallization, and is highly‑efficient, highly‑scalable, and economical. Hanging‑drop vapor diffusion experiments were used to scout crystallization conditions in which full and empty capsids crystallize, then to define conditions in which crystals of full, empty, or both full and empty capsids nucleate and grow. The experimental results for rAAV serotypes 5, 8, and 9 as exemplary vectors and scale‑up results show that full capsids can be selectively crystallized and separated in one step from a mixture of full, partially filled, and empty capsids, and other proteins with full capsid enrichment of greater than 80%, approximately 20% higher, and yield of greater than 90%, approximately greater than 30% higher from the existing methods, keeping their biological activity intact, in a short period of time (less than 4 h), with approximately 87% reduction in processing time from the existing processing time and without the need of additional purification steps and in one round.
Authors: Azwad Tamir, Jiann-Shiun Yuan
Abstract: Recent developments in next generation sequencing technology have led to the creation of extensive, open‑source protein databases consisting of hundreds of millions of sequences. To render these sequences applicable in biomedical applications, they must be meticulously annotated by wet lab testing or extracting them from existing literature. Over the last few years, researchers have developed numerous automatic annotation systems, particularly deep learning models based on machine learning and artificial intelligence, to address this issue. In this work, we propose a transformer‑based fusion model capable of predicting Gene Ontology (GO) terms from full‑scale protein sequences, achieving state‑of‑the‑art accuracy compared to other contemporary machine learning annotation systems. The approach performs particularly well on clustered split datasets, which comprise training and testing samples originating from distinct distributions that are structurally diverse. This demonstrates that the model is able to understand both short and long term dependencies within the enzyme's structure and can precisely identify the motifs associated with the various GO terms. Furthermore, the technique is lightweight and less computationally expensive compared to the benchmark methods, while at the same time not unaffected by sequence length, rendering it appropriate for diverse applications with varying sequence lengths.
Authors: Nina Královič-Kanjaková, Ali Asi Shirazi, Lukáš Hubčík, Mária Klacsová, Atoosa Keshavarzi, Juan Carlos Martínez, Sophie Combet, José Teixeira, Daniela Uhríková
Abstract: The use of exogenous pulmonary surfactant (EPS) to deliver other relevant drugs to the lung is a promising strategy for combined therapy. We evaluated the interaction of polymyxin B (PxB) with clinically used EPS, the poractant alfa Curosurf (PSUR). The effect of PxB on the protein‑free model system (MS) composed of four phospholipids (diC16:0PC/16:0‑18:1PC/16:0‑18:2PC/16:0‑18:1PG) was examined in parallel to distinguish the specificity of the composition of PSUR. We used several experimental techniques (differential scanning calorimetry, small‑and wide‑angle X‑ray scattering, small angle neutron scattering, fluorescence spectroscopy, and electrophoretic light scattering) to characterize the binding of PxB to both EPS. Electrostatic interactions PxB EPS are dominant. The results obtained support the concept of cationic PxB molecules lying on the surface of the PSUR bilayer, strengthening the multilamellar structure of the PSUR as derived from SAXS and SANS. A protein‑free MS mimics natural EPS well but was found to be less resistant to penetration of PxB into the lipid bilayer. PxB does not affect the gel‑to‑fluid phase transition temperature Tm of PSUR, while Tm increased by ~ +2 ^\circC in MS. The decrease of the thickness of the lipid bilayer (dL) of PSUR upon PxB binding is negligible. The hydrophobic tail of the PxB molecule does not penetrate the bilayer as derived from SANS data analysis and changes in lateral pressure monitored by excimer fluorescence at two depths of the hydrophobic region of the bilayer. Changes in dL of protein‑free MS show a biphasic dependence on the adsorbed amount of PxB with a minimum close to the point of electroneutrality of the mixture. Our results do not discourage the concept of a combined treatment with PxBenriched Curosurf. However, the amount of PxB must be carefully assessed (less than 5 wt% relative to the mass of the surfactant) to avoid inversion of the surface charge of the membrane.
Authors: Daiheng Zhang, Yan Zeng, Xinyu Hong, Jinbo Xu
Abstract: Accurately predicting protein melting temperature changes (Delta Tm) is fundamental for assessing protein stability and guiding protein engineering. Leveraging multi‑modal protein representations has shown great promise in capturing the complex relationships among protein sequences, structures, and functions. In this study, we develop models based on powerful protein language models, including ESM‑2, ESM‑3 and AlphaFold, using various feature extraction methods to enhance prediction accuracy. By utilizing the ESM‑3 model, we achieve a new state‑of‑the‑art performance on the s571 test dataset, obtaining a Pearson correlation coefficient (PCC) of 0.50. Furthermore, we conduct a fair evaluation to compare the performance of different protein language models in the Delta Tm prediction task. Our results demonstrate that integrating multi‑modal protein representations could advance the prediction of protein melting temperatures.
Authors: Huiyu Li, Ao Ma
Abstract: The bottleneck in enhanced sampling lies in finding collective variables (CVs) that can effectively accelerate protein conformational changes. True reaction coordinates (tRCs) that can predict the committor are considered the optimal CVs, but identifying them requires unbiased natural reactive trajectories, which, paradoxically, depend on effective enhanced sampling. Using the generalized work functional method, we found that tRCs control both conformational changes and energy relaxation, enabling us to compute tRCs from energy relaxation simulations. Applying bias to tRCs accelerated conformational changes and ligand dissociation in HIV‑1 protease and the PDZ2 domain by 10^5 to 10^15‑fold. The resulting trajectories follow natural transition pathways, enabling efficient generation of natural reactive trajectories. In contrast, biased trajectories from empirical CVs often display non‑physical features. Furthermore, by computing tRCs from a single protein structure, our method enables predictive sampling of conformational changes. These findings significantly broaden the range of protein functional processes accessible to molecular dynamics simulations.
Authors: Yoav Kan-Tor, Michael Morris Danziger, Eden Zohar, Matan Ninio, Yishai Shimoni
Abstract: The application of deep learning methods, particularly foundation models, in biological research has surged in recent years. These models can be text‑based or trained on underlying biological data, especially omics data of various types. However, comparing the performance of these models consistently has proven to be a challenge due to differences in training data and downstream tasks. To tackle this problem, we developed an architecture‑agnostic benchmarking approach that, instead of evaluating the models directly, leverages entity representation vectors from each model and trains simple predictive models for each benchmarking task. This ensures that all types of models are evaluated using the same input and output types. Here we focus on gene properties collected from professionally curated bioinformatics databases. These gene properties are categorized into five major groups: genomic properties, regulatory functions, localization, biological processes, and protein properties. Overall, we define hundreds of tasks based on these databases, which include binary, multi‑label, and multi‑class classification tasks. We apply these benchmark tasks to evaluate expression‑based models, large language models, protein language models, DNA‑based models, and traditional baselines. Our findings suggest that text‑based models and protein language models generally outperform expression‑based models in genomic properties and regulatory functions tasks, whereas expression‑based models demonstrate superior performance in localization tasks. These results should aid in the development of more informed artificial intelligence strategies for biological understanding and therapeutic discovery. To ensure the reproducibility and transparency of our findings, we have made the source code and benchmark data publicly accessible for further investigation and expansion at github.com/BiomedSciAI/gene‑benchmark.
Authors: Xiao-Yu Guo, Yi-Fan Li, Yuan Liu, Xiaoyong Pan, Hong-Bin Shen
Abstract: Protein design has become a critical method in advancing significant potential for various applications such as drug development and enzyme engineering. However, protein design methods utilizing large language models with solely pretraining and fine‑tuning struggle to capture relationships in multi‑modal protein data. To address this, we propose ProtDAT, a de novo fine‑grained framework capable of designing proteins from any descriptive protein text input. ProtDAT builds upon the inherent characteristics of protein data to unify sequences and text as a cohesive whole rather than separate entities. It leverages an innovative multi‑modal cross‑attention, integrating protein sequences and textual information for a foundational level and seamless integration. Experimental results demonstrate that ProtDAT achieves the state‑of‑the‑art performance in protein sequence generation, excelling in rationality, functionality, structural similarity, and validity. On 20,000 text‑sequence pairs from Swiss‑Prot, it improves pLDDT by 6%, TM‑score by 0.26, and reduces RMSD by 1.2 Å, highlighting its potential to advance protein design.
Authors: Yuyang Wang, Anurag Ranjan, Josh Susskind, Miguel Angel Bautista
Abstract: Flow matching models have emerged as a powerful method for generative modeling on domains like images or videos, and even on irregular or unstructured data like 3D point clouds or even protein structures. These models are commonly trained in two stages: first, a data compressor is trained, and in a subsequent training stage a flow matching generative model is trained in the latent space of the data compressor. This two‑stage paradigm sets obstacles for unifying models across data domains, as hand‑crafted compressors architectures are used for different data modalities. To this end, we introduce INRFlow, a domain‑agnostic approach to learn flow matching transformers directly in ambient space. Drawing inspiration from INRs, we introduce a conditionally independent point‑wise training objective that enables INRFlow to make predictions continuously in coordinate space. Our empirical results demonstrate that INRFlow effectively handles different data modalities such as images, 3D point clouds and protein structure data, achieving strong performance in different domains and outperforming comparable approaches. INRFlow is a promising step towards domain‑agnostic flow matching generative models that can be trivially adopted in different data domains.
Authors: Matthew Ricci, Guy Pelc, Zoe Piran, Noa Moriel, Mor Nitzan
Abstract: Spatiotemporal dynamics pervade the natural sciences, from the morphogen dynamics underlying patterning in animal pigmentation to the protein waves controlling cell division. A central challenge lies in understanding how controllable parameters induce qualitative changes in system behavior called bifurcations. This endeavor is particularly difficult in realistic settings where governing partial differential equations (PDEs) are unknown and data is limited and noisy. To address this challenge, we propose TRENDy (Temporal Regression of Effective Nonlinear Dynamics), an equation‑free approach to learning low‑dimensional, predictive models of spatiotemporal dynamics. TRENDy first maps input data to a low‑dimensional space of effective dynamics through a cascade of multiscale filtering operations. Our key insight is the recognition that these effective dynamics can be fit by a neural ordinary differential equation (NODE) having the same parameter space as the input PDE. The preceding filtering operations strongly regularize the phase space of the NODE, making TRENDy significantly more robust to noise compared to existing methods. We train TRENDy to predict the effective dynamics of synthetic and real data representing dynamics from across the physical and life sciences. We then demonstrate how we can automatically locate both Turing and Hopf bifurcations in unseen regions of parameter space. We finally apply our method to the analysis of spatial patterning of the ocellated lizard through development. We found that TRENDy's predicted effective state not only accurately predicts spatial changes over time but also identifies distinct pattern features unique to different anatomical regions, such as the tail, neck, and body‑‑an insight that highlights the potential influence of surface geometry on reaction‑diffusion mechanisms and their role in driving spatially varying pattern dynamics.
Authors: Dilipkumar N. Asthagiri, Arjun Valiya Parambathu, Thomas L. Beck
Abstract: Earlier we showed that in the molecular dynamics simulation of a rigid model of water it is necessary to use an integration time‑step δt \leq 0.5 fs to ensure equipartition between translational and rotational modes. Here we extend that study in the NVT ensemble to NpT conditions and to an aqueous protein. We study neat liquid water with the rigid, SPC/E model and the protein BBA (PDB ID: 1FME) solvated in the rigid, TIP3P model. We examine integration time‑steps ranging from 0.5 fs to 4.0 fs for various thermostat plus barostat combinations. We find that a small δt is necessary to ensure consistent prediction of the simulation volume. Hydrogen mass repartitioning alleviates the problem somewhat, but is ineffective for the typical time‑step used with this approach. The compressibility, a measure of volume fluctuations, and the dielectric constant, a measure of dipole moment fluctuations, are also seen to be sensitive to δt. Using the mean volume estimated from the NpT simulation, we examine the electrostatic and van der Waals contribution to the hydration free energy of the protein in the NVT ensemble. These contributions are also sensitive to δt. In going from δt = 2 fs to δt = 0.5 fs, the change in the net electrostatic plus van der Waals contribution to the hydration of BBA is already in excess of the folding free energy reported for this protein.
Authors: Ajay N. Jain, Ann E. Cleves, W. Patrick Walters
Abstract: The diffusion learning method, DiffDock, for docking small‑molecule ligands into protein binding sites was recently introduced. Results included comparisons to more conventional docking approaches, with DiffDock showing superior performance. Here, we employ a fully automatic workflow using the Surflex‑Dock methods to generate a fair baseline for conventional docking approaches. Results were generated for the common and expected situation where a binding site location is known and also for the condition of an unknown binding site. For the known binding site condition, Surflex‑Dock success rates at 2.0 Angstroms RMSD far exceeded those for DiffDock (Top‑1/Top‑5 success rates, respectively, were 68/81% compared with 45/51%). Glide performed with similar success rates (67/73%) to Surflex‑Dock for the known binding site condition, and results for AutoDock Vina and Gnina followed this pattern. For the unknown binding site condition, using an automated method to identify multiple binding pockets, Surflex‑Dock success rates again exceeded those of DiffDock, but by a somewhat lesser margin. DiffDock made use of roughly 17,000 co‑crystal structures for learning (98% of PDBBind version 2020, pre‑2019 structures) for a training set in order to predict on 363 test cases (2% of PDBBind 2020) from 2019 forward. DiffDock's performance was inextricably linked with the presence of near‑neighbor cases of close to identical protein‑ligand complexes in the training set for over half of the test set cases. DiffDock exhibited a 40 percentage point difference on near‑neighbor cases (two‑thirds of all test cases) compared with cases with no near‑neighbor training case. DiffDock has apparently encoded a type of table‑lookup during its learning process, rendering meaningful applications beyond its reach. Further, it does not perform even close to competitively with a competently run modern docking workflow.
Authors: Saverio Rossi, Leonardo Di Bari, Martin Weigt, Francesco Zamponi
Abstract: Protein evolution involves mutations occurring across a wide range of time scales. In analogy with disordered systems in statistical physics, this dynamical heterogeneity suggests strong correlations between mutations happening at distinct sites and times. To quantify these correlations, we examine the role of various fluctuation sources in protein evolution, simulated using a data‑driven energy landscape as a proxy for protein fitness. By applying spatio‑temporal correlation functions developed in the context of disordered physical systems, we disentangle fluctuations originating from the initial condition, i.e. the ancestral sequence from which the evolutionary process originated, from those driven by stochastic mutations along independent evolutionary paths. Our analysis shows that, in diverse protein families, fluctuations from the ancestral sequence predominate at shorter time scales. This allows us to identify a time scale over which ancestral sequence information persists, enabling its reconstruction. We link this persistence to the strength of epistatic interactions: ancestral sequences with stronger epistatic signatures impact evolutionary trajectories over extended periods. At longer time scales, however, ancestral influence fades as epistatically constrained sites evolve collectively. To confirm this idea, we apply a standard ancestral sequence reconstruction algorithm and verify that the time‑dependent recovery error is influenced by the properties of the ancestor itself. Overall, our results reveal that the properties of ancestral sequences ‑ particularly their epistatic constraints ‑ influence the initial evolutionary dynamics and the performance of standard ancestral sequence reconstruction algorithms.
Authors: Alberto Megías, Sergio Contreras Arredondo, Cheng Giuseppe Chen, Chenyu Tang, Benoît Roux, Christophe Chipot
Abstract: This contribution introduces a neural‑network‑based approach to discover meaningful transition pathways underlying complex biomolecular transformations in coherence with the committor function. The proposed path‑committor‑consistent artificial neural network (PCCANN) iteratively refines the transition pathway by aligning it to the gradient of the committor. This method addresses the challenges of sampling in molecular dynamics simulations rare events in high‑dimensional spaces, which is often limited computationally. Applied to various benchmark potentials and biological processes such as peptide isomerization and protein‑model folding, PCCANN successfully reproduces established dynamics and rate constants, while revealing bifurcations and alternate pathways. By enabling precise estimation of transition states and free‑energy barriers, this approach provides a robust framework for enhanced‑sampling simulations of rare events in complex biomolecular systems.
Authors: Katarzyna Janocha, Annabel Ling, Alice Godson, Yulia Lampi, Simon Bornschein, Nils Y. Hammerla
Abstract: Cell and immunotherapy offer transformative potential for treating diseases like cancer and autoimmune disorders by modulating the immune system. The development of these therapies is resource‑intensive, with the majority of drug candidates failing to progress beyond laboratory testing. While recent advances in machine learning have revolutionised areas such as protein engineering, applications in immunotherapy remain limited due to the scarcity of large‑scale, standardised datasets and the complexity of cellular systems. In this work, we address these challenges by leveraging a high‑throughput experimental platform to generate data suitable for fine‑tuning protein language models. We demonstrate how models fine‑tuned using a preference task show surprising correlations to biological assays, and how they can be leveraged for few‑shot hit maturation in CARs. This proof‑of‑concept presents a novel pathway for applying ML to immunotherapy and could generalise to other therapeutic modalities.
Authors: Daiheng Zhang, Chengyue Gong, Qiang Liu
Abstract: Deep generative models have achieved tremendous success in structure‑based drug design in recent years, especially for generating 3D ligand molecules that bind to specific protein pocket. Notably, diffusion models have transformed ligand generation by providing exceptional quality and creativity. However, traditional diffusion models are restricted by their conventional learning objectives, which limit their broader applicability. In this work, we propose a new framework FlowSBDD, which is based on rectified flow model, allows us to flexibly incorporate additional loss to optimize specific target and introduce additional condition either as an extra input condition or replacing the initial Gaussian distribution. Extensive experiments on CrossDocked2020 show that our approach could achieve state‑of‑the‑art performance on generating high‑affinity molecules while maintaining proper molecular properties without specifically designing binding site, with up to ‑8.50 Avg. Vina Dock score and 75.0% Diversity.
Authors: Xinyu Shi, Yixin Tao, Shih-Chi Lin
Abstract: The accurate prediction of B‑cell epitopes is critical for guiding vaccine development against infectious diseases, including SARS and COVID‑19. This study explores the use of a deep neural network (DNN) model to predict B‑cell epitopes for SARS‑CoVandSARS‑CoV‑2,leveraging a dataset that incorporates essential protein and peptide features. Traditional sequence‑based methods often struggle with large, complex datasets, but deep learning offers promising improvements in predictive accuracy. Our model employs regularization techniques, such as dropout and early stopping, to enhance generalization, while also analyzing key features, including isoelectric point and aromaticity, that influence epitope recognition. Results indicate an overall accuracy of 82% in predicting COVID‑19 negative and positive cases, with room for improvement in detecting positive samples. This research demonstrates the applicability of deep learning in epitope mapping, suggesting that such approaches can enhance the speed and precision of vaccine design for emerging pathogens. Future work could incorporate structural data and diverse viral strains to further refine prediction capabilities.
Authors: Xiao Lin, Mingjie Li, Yisen Wang
Abstract: Graph Neural Networks (GNNs) have garnered significant attention from researchers due to their outstanding performance in handling graph‑related tasks, such as social network analysis, protein design, and so on. Despite their widespread application, recent research has demonstrated that GNNs are vulnerable to backdoor attacks, implemented by injecting triggers into the training datasets. Trained on the poisoned data, GNNs will predict target labels when attaching trigger patterns to inputs. This vulnerability poses significant security risks for applications of GNNs in sensitive domains, such as drug discovery. While there has been extensive research into backdoor defenses for images, strategies to safeguard GNNs against such attacks remain underdeveloped. Furthermore, we point out that conventional backdoor defense methods designed for images cannot work well when directly implemented on graph data. In this paper, we first analyze the key difference between image backdoor and graph backdoor attacks. Then we tackle the graph defense problem by presenting a novel approach called MADE, which devises an adversarial mask generation mechanism that selectively preserves clean sub‑graphs and further leverages masks on edge weights to eliminate the influence of triggers effectively. Extensive experiments across various graph classification tasks demonstrate the effectiveness of MADE in significantly reducing the attack success rate (ASR) while maintaining a high classification accuracy.
Authors: Sajjad Abdollahramezani, Darrell Omo-Lamai, Gerlof Bosman, Omid Hemmatyar, Sahil Dagli, Varun Dolia, Kai Chang, Nicholas A. Gusken, Hamish C. Delgado, Geert-Jan Boons, Mark L. Brongersma, Fareeha Safir, Butrus T. Khuri-Yakub, Parivash Moradifar, Jennifer A. Dionne
Abstract: Empirical investigation of the quintillion‑scale, functionally diverse antibody repertoires that can be generated synthetically or naturally is critical for identifying potential biotherapeutic leads, yet remains burdensome. We present high‑throughput nanophotonics‑ and bioprinter‑enabled screening (HT‑NaBS), a multiplexed assay for large‑scale, sample‑efficient, and rapid characterization of antibody libraries. Our platform is built upon independently addressable pixelated nanoantennas exhibiting wavelength‑scale mode volumes, high‑quality factors (high‑Q) exceeding 5000, and pattern densities exceeding one million sensors per square centimeter. Our custom‑built acoustic bioprinter enables individual sensor functionalization via the deposition of picoliter droplets from a library of capture antigens at rates up to 25,000 droplets per second. We detect subtle differentiation in the target binding signature through spatially‑resolved spectral imaging of hundreds of resonators simultaneously, elucidating antigen‑antibody binding kinetic rates, affinity constant, and specificity. We demonstrate HT‑NaBS on a panel of antibodies targeting SARS‑CoV‑2, Influenza A, and Influenza B antigens, with a sub‑picomolar limit of detection within 30 minutes. Furthermore, through epitope binning analysis, we demonstrate the competence and diversity of a library of native antibodies targeting functional epitopes on a priority pathogen (H5N1 bird flu) and on glycosylated therapeutic Cetuximab antibodies against epidermal growth factor receptor. With a roadmap to image tens of thousands of sensors simultaneously, this high‑throughput, resource‑efficient, and label‑free platform can rapidly screen for high‑affinity and broad epitope coverage, accelerating biotherapeutic discovery and de novo protein design.
Authors: Jiangbin Zheng, Qianhui Xu, Ruichen Xia, Stan Z. Li
Abstract: Identifying T‑cell receptors (TCRs) that interact with antigenic peptides provides the technical basis for developing vaccines and immunotherapies. The emergent deep learning methods excel at learning antigen binding patterns from known TCRs but struggle with novel or sparsely represented antigens. However, binding specificity for unseen antigens or exogenous peptides is critical. We introduce a domain‑adaptive peptide‑agnostic learning framework DapPep for universal TCR‑antigen binding affinity prediction to address this challenge. The lightweight self‑attention architecture combines a pre‑trained protein language model with an inner‑loop self‑supervised regime to enable robust TCR‑peptide representations. Extensive experiments on various benchmarks demonstrate that DapPep consistently outperforms existing tools, showcasing robust generalization capability, especially for data‑scarce settings and unseen peptides. Moreover, DapPep proves effective in challenging clinical tasks such as sorting reactive T cells in tumor neoantigen therapy and identifying key positions in 3D structures.
Authors: Jiangbin Zheng, Ge Wang, Han Zhang, Stan Z. Li
Abstract: Computational protein design (CPD) offers transformative potential for bioengineering, but current deep CPD models, focused on universal domains, struggle with function‑specific designs. This work introduces a novel CPD paradigm tailored for functional design tasks, particularly for enzymes‑a key protein class often lacking specific application efficiency. To address structural data scarcity, we present CrossDesign, a domain‑adaptive framework that leverages pretrained protein language models (PPLMs). By aligning protein structures with sequences, CrossDesign transfers pretrained knowledge to structure models, overcoming the limitations of limited structural data. The framework combines autoregressive (AR) and non‑autoregressive (NAR) states in its encoder‑decoder architecture, applying it to enzyme datasets and pan‑proteins. Experimental results highlight CrossDesign's superior performance and robustness, especially with out‑of‑domain enzymes. Additionally, the model excels in fitness prediction when tested on large‑scale mutation data, showcasing its stability.
Authors: Burak Suyunu, Enes Taylan, Arzucan Özgür
Abstract: Tokenization is a crucial step in processing protein sequences for machine learning models, as proteins are complex sequences of amino acids that require meaningful segmentation to capture their functional and structural properties. However, existing subword tokenization methods, developed primarily for human language, may be inadequate for protein sequences, which have unique patterns and constraints. This study evaluates three prominent tokenization approaches, Byte‑Pair Encoding (BPE), WordPiece, and SentencePiece, across varying vocabulary sizes (400‑6400), analyzing their effectiveness in protein sequence representation, domain boundary preservation, and adherence to established linguistic laws. Our comprehensive analysis reveals distinct behavioral patterns among these tokenizers, with vocabulary size significantly influencing their performance. BPE demonstrates better contextual specialization and marginally better domain boundary preservation at smaller vocabularies, while SentencePiece achieves better encoding efficiency, leading to lower fertility scores. WordPiece offers a balanced compromise between these characteristics. However, all tokenizers show limitations in maintaining protein domain integrity, particularly as vocabulary size increases. Analysis of linguistic law adherence shows partial compliance with Zipf's and Brevity laws but notable deviations from Menzerath's law, suggesting that protein sequences may follow distinct organizational principles from natural languages. These findings highlight the limitations of applying traditional NLP tokenization methods to protein sequences and emphasize the need for developing specialized tokenization strategies that better account for the unique characteristics of proteins.
Authors: Ali Azadbakht, Daniela J. Kraft
Abstract: Lipid membrane deformations have been predicted to lead to indirect forces between the objects that induce these deformations. Recent experimental measurements have found an attractive interaction between spherical particles that all induce a deformation towards the inside of a giant unilamellar vesicle. Here, we complement these experimental observations by investigating the interactions between deformations pointing in opposite directions with respect to the membrane normal vector. This is experimentally realized by a particle deforming the membrane towards the inside of the GUV and pulling a membrane tube towards the outside of the membrane. Particles completely wrapped by the membrane are repelled from the tube with a strength of 3~k_BT at a distance of 0.5~μm. However, particles that strongly curve the membrane by adhering only to a patch of about 50~% of its surface area are attracted to the center of the tube with a strength of ‑5.3~k_BT at a minimum distance of about 1~μm. We find that such Janus particles also experience attractive interactions when both deforming the membrane in the same way. These quantitative experimental observations provide new insights into interactions between oppositely membrane deforming objects, important for cooperative protein assembly at or interactions of microplastics with cell membranes.
Authors: Jacob S. Feder, Benjamin S. Soloway, Shreya Verma, Zhi Z. Geng, Shihao Wang, Bethel Kifle, Emmeline G. Riendeau, Yeghishe Tsaturyan, Leah R. Weiss, Mouzhe Xie, Jun Huang, Aaron Esser-Kahn, Laura Gagliardi, David D. Awschalom, Peter C. Maurer
Abstract: Optically‑addressable spin qubits form the foundation of a new generation of emerging nanoscale sensors. The engineering of these sensors has mainly focused on solid‑state systems such as the nitrogen‑vacancy (NV) center in diamond. However, NVs are restricted in their ability to interface with biomolecules due to their bulky diamond host. Meanwhile, fluorescent proteins have become the gold standard in bioimaging, as they are genetically encodable and easily integrated with biomolecules. While fluorescent proteins have been suggested to possess a metastable triplet state, they have not been investigated as qubit sensors. Here, we realize an optically‑addressable spin qubit in the Enhanced Yellow Fluorescent Protein (EYFP) enabled by a novel spin‑readout technique. A near‑infrared laser pulse allows for triggered readout of the triplet state with up to 44% spin contrast. Using coherent microwave control of the EYFP spin at liquid‑nitrogen temperatures, we measure a spin‑lattice relaxation time of (141 \pm 5)\, \mathrmμs, a (16 \pm 2)\, \mathrmμs coherence time under Carr‑Purcell‑Meiboom‑Gill (CPMG) decoupling, and a predicted oscillating (AC) magnetic field sensitivity with an upper bound of 183 \, \mathrmfT\, \mathrmmol^1/2\, \mathrmHz^‑1/2. We express the qubit in mammalian cells, maintaining contrast and coherent control despite the complex intracellular environment. Finally, we demonstrate optically‑detected magnetic resonance at room temperature in aqueous solution with contrast up to 3%, and measure a static (DC) field sensitivity with an upper bound of 93 \, \mathrmpT\, \mathrmmol^1/2\, \mathrmHz^‑1/2. Our results establish fluorescent proteins as a powerful new qubit sensor platform and pave the way for applications in the life sciences that are out of reach for solid‑state technologies.
Authors: Yiming Ma, Fei Ye, Yi Zhou, Zaixiang Zheng, Dongyu Xue, Quanquan Gu
Abstract: Nature creates diverse proteins through a 'divide and assembly' strategy. Inspired by this idea, we introduce ProteinWeaver, a two‑stage framework for protein backbone design. Our method first generates individual protein domains and then employs an SE(3) diffusion model to flexibly assemble these domains. A key challenge lies in the assembling step, given the complex and rugged nature of the inter‑domain interaction landscape. To address this challenge, we employ preference alignment to discern complex relationships between structure and interaction landscapes through comparative analysis of generated samples. Comprehensive experiments demonstrate that ProteinWeaver: (1) generates high‑quality, novel protein backbones through versatile domain assembly; (2) outperforms RFdiffusion, the current state‑of‑the‑art in backbone design, by 13% and 39% for long‑chain proteins; (3) shows the potential for cooperative function design through illustrative case studies. To sum up, by introducing a `divide‑and‑assembly' paradigm, ProteinWeaver advances protein engineering and opens new avenues for functional protein design.
Authors: Subhadip Basu, Oded Farago
Abstract: Many ternary mixtures composed of saturated and unsaturated lipids with cholesterol (Chol) exhibit a region of coexistence between liquid‑disordered (L_d) and liquid‑ordered (L_o) domains, bearing some similarities to lipid rafts in biological membranes. However, biological rafts also contain many proteins that interact with the lipids and modify the distribution of lipids. Here, we extend a previously published lattice model of ternary DPPC/DOPC/Chol mixtures by introducing a small amount of small proteins (peptides). We use Monte Carlo simulations to explore the mixing phase behavior of the components as a function of the interaction parameter representing the affinity between the proteins and the saturated DPPC chains, and for different mixture compositions. At moderate fractions of DPPC, the system is in a two‑phase L_d+L_o coexistence, and the proteins exhibit a simple partition behavior between the phases that depends on the protein‑lipid affinity parameter. At low DPPC compositions, the mixture is in L_d phase with local nanoscopic ordered domains. Addition of proteins with sufficiently strong attraction to the saturated lipids can induce the separation of a distinct L_o large domain with tightly‑packed gel‑like clusters of proteins and saturated lipids. Consistent with the theory of phase transitions, we observe that the domain sizes grow when the mixture composition is in the vicinity of the critical point. Our simulations show that the addition of a small amount of proteins to such mixtures can cause their size to grow even further, and lead to the formation of metastable dynamic L_o domains with sizes comparable to biological rafts.
Authors: Tong You, Johan Bielecki, Filipe R. N. C. Maia
Abstract: Single‑particle imaging (SPI) using X‑ray free‑electron Lasers (XFELs) offers the potential to determine protein structures at high spatial and temporal resolutions without the need for crystallization or vitrification. However, the technique faces challenges due to weak diffraction signals from single proteins and significant background scattering from gases used for sample delivery. A recent observation of a diffraction pattern from an isolated GroEL protein complex had similar numbers of signal and background photons. Ongoing efforts aim to reduce the background created by sample delivery, with one approach replacing most of the used gas with helium. In this study, we investigate the effects of a potentially reduced background on the resolution limits for SPI of isolated proteins under different experiment conditions. As a test case, we used GroEL, and we used experimentally measured parameters for our simulations. We observe that background significantly impacts the achievable resolution, particularly when the signal strength is comparable to the background, and a background reduction would lead to a significant improvement in resolution.
Authors: Florian B. Hinz, Matthew R. Masters, Julia N. Kieu, Amr H. Mahmoud, Markus A. Lill
Abstract: Water plays a fundamental role in the structure and function of proteins and other biomolecules. The thermodynamic profile of water molecules surrounding a protein are critical for ligand binding and recognition. Therefore, identifying the location and thermodynamic behavior of relevant water molecules is important for generating and optimizing lead compounds for affinity and selectivity to a given target. Computational methods have been developed to identify these hydration sites, but are largely limited to simplified models that fail to capture multi‑body interactions, or dynamics‑based methods that rely on extensive sampling. Here we present a method for fast and accurate localization and thermodynamic profiling of hydration sites for protein structures. The method is based on a geometric deep neural network trained on a large, novel dataset of explicit water molecular dynamics simulations. We confirm the accuracy and robustness of our model on experimental data and demonstrate it's utility on several case studies.
Authors: Andrew T. McNutt, Abhinav K. Adduri, Caleb N. Ellington, Monica T. Dayao, Eric P. Xing, Hosein Mohimani, David R. Koes
Abstract: Virtual screening of small molecules against protein targets can accelerate drug discovery and development by predicting drug‑target interactions (DTIs). However, structure‑based methods like molecular docking are too slow to allow for broad proteome‑scale screens, limiting their application in screening for off‑target effects or new molecular mechanisms. Recently, vector‑based methods using protein language models (PLMs) have emerged as a complementary approach that bypasses explicit 3D structure modeling. Here, we develop SPRINT, a vector‑based approach for screening entire chemical libraries against whole proteomes for DTIs and novel mechanisms of action. SPRINT improves on prior work by using a self‑attention based architecture and structure‑aware PLMs to learn drug‑target co‑embeddings for binder prediction, search, and retrieval. SPRINT achieves SOTA enrichment factors in virtual screening on LIT‑PCBA, DTI classification benchmarks, and binding affinity prediction benchmarks, while providing interpretability in the form of residue‑level attention maps. In addition to being both accurate and interpretable, SPRINT is ultra‑fast: querying the whole human proteome against the ENAMINE Real Database (6.7B drugs) for the 100 most likely binders per protein takes 16 minutes. SPRINT promises to enable virtual screening at an unprecedented scale, opening up new opportunities for in silico drug repurposing and development. SPRINT is available on the web as ColabScreen: https://bit.ly/colab‑screen
Authors: Simeon Minic, Luka Velickovic, Burkhard Annighöfer, Aurélien Thureau, Nikola Gligorijevic, Zorana Jovanovic, Annie Brûlet, Sophie Combet
Abstract: The red macroalgae Porphyra, commonly known as Nori, is widely used as food around the world due to its high nutrient content, including the significant abundance of coloured phycobiliproteins (PBPs). Among these, R‑phycocyanin (R‑PC) stands out for its vibrant purple colour and numerous bioactive properties, making it a valuable protein for the food industry. However, R‑PC's limited thermal stability necessitates alternative processing methods to preserve its colour and bioactive properties. Our study aimed to investigate the in‑situ stability of oligomeric R‑PC under high pressure (HP) conditions (up to 4,000 bar) using a combination of absorption, fluorescence, and small‑angle X‑ray scattering (SAXS) techniques. The unfolding of R‑PC is a multiphase process. Initially, low pressure induces conformational changes in the R‑PC oligomeric form (trimers). As pressure increases above 1,600 bar, these trimers dissociate into monomers, and at pressures above 3,000 bar, the subunits begin to unfold. When returned to atmospheric pressure, R‑PC partially refolds, retaining 50% of its original colour absorbance. In contrast, heat treatment causes irreversible and detrimental effects on R‑PC colour, highlighting the advantages of HP treatment in preserving both the colour and bioactive properties of R‑PC compared to heat treatment. SIGNIFICANCE: HP is a powerful probe that reveals intermediate states of proteins through subtle structural changes not accessible by other denaturation methods. By combining HP‑small‑angle‑Xray scattering with HP‑absorption and fluorescence spectroscopy, we elucidate the multiphase unfolding process of R‑phycocyanin. This process includes: 1) conformational changes, 2) oligomer dissociation at moderate pressures, and 3) monomer unfolding. Our approach provides new opportunities for the structural determination of protein intermediates and oligomers using HP.
Authors: Mahsa Sheikholeslami, Navid Mazrouei, Yousof Gheisari, Afshin Fasihi, Matin Irajpour, Ali Motahharynia
Abstract: Traditional drug design faces significant challenges due to inherent chemical and biological complexities, often resulting in high failure rates in clinical trials. Deep learning advancements, particularly generative models, offer potential solutions to these challenges. One promising algorithm is DrugGPT, a transformer‑based model, that generates small molecules for input protein sequences. Although promising, it generates both chemically valid and invalid structures and does not incorporate the features of approved drugs, resulting in time‑consuming and inefficient drug discovery. To address these issues, we introduce DrugGen, an enhanced model based on the DrugGPT structure. DrugGen is fine‑tuned on approved drug‑target interactions and optimized with proximal policy optimization. By giving reward feedback from protein‑ligand binding affinity prediction using pre‑trained transformers (PLAPT) and a customized invalid structure assessor, DrugGen significantly improves performance. Evaluation across multiple targets demonstrated that DrugGen achieves 100% valid structure generation compared to 95.5% with DrugGPT and produced molecules with higher predicted binding affinities (7.22 [6.30‑8.07]) compared to DrugGPT (5.81 [4.97‑6.63]) while maintaining diversity and novelty. Docking simulations further validate its ability to generate molecules targeting binding sites effectively. For example, in the case of fatty acid‑binding protein 5 (FABP5), DrugGen generated molecules with superior docking scores (FABP5/11, ‑9.537 and FABP5/5, ‑8.399) compared to the reference molecule (Palmitic acid, ‑6.177). Beyond lead compound generation, DrugGen also shows potential for drug repositioning and creating novel pharmacophores for existing targets. By producing high‑quality small molecules, DrugGen provides a high‑performance medium for advancing pharmaceutical research and drug discovery.
Authors: Oriol Vilanova, Alberto Martinez-Serra, Marco P Monopoli, Giancarlo Franzese
Abstract: Nanoparticles (NPs) in contact with biological fluid adsorb biomolecules into a corona. This corona comprises proteins that strongly bind to the NP (hard corona) and loosely bound proteins (soft corona) that dynamically exchange with the surrounding solution. While the kinetics of hard corona formation is relatively well understood, thanks to experiments and robust simulation models, the experimental characterization and simulation of the soft corona present a more significant challenge. Here, we review the current state of the art in soft corona characterization and introduce a novel open‑source computational model to simulate its dynamic behavior, for which we provide the documentation. We focus on the case of transferrin (Tf) interacting with polystyrene NPs as an illustrative example, demonstrating how this model captures the complexities of the soft corona and offers deeper insights into its structure and behavior. We show that the soft corona is dominated by a glassy evolution that we relate to crowding effects. This work advances our understanding of the soft corona, bridging experimental limitations with improved simulation techniques.
Authors: Poorya Khajouie, Titli Sarkar, Krishna Rauniyar, Li Chen, Wu Xu, Vijay Raghavan
Abstract: Protein structures represent the key to deciphering biological functions. The more detailed form of similarity among these proteins is sometimes overlooked by the conventional structural comparison methods. In contrast, further advanced methods, such as Triangular Spatial Relationship (TSR), have been demonstrated to make finer differentiations. Still, the classical implementation of TSR does not provide for the integration of secondary structure information, which is important for a more detailed understanding of the folding pattern of a protein. To overcome these limitations, we developed the SSE‑TSR approach. The proposed method integrates secondary structure elements (SSEs) into TSR‑based protein representations. This allows an enriched representation of protein structures by considering 18 different combinations of helix, strand, and coil arrangements. Our results show that using SSEs improves the accuracy and reliability of protein classification to varying degrees. We worked with two large protein datasets of 9.2K and 7.8K samples, respectively. We applied the SSE‑TSR approach and used a neural network model for classification. Interestingly, introducing SSEs improved performance statistics for Dataset 1, with accuracy moving from 96.0% to 98.3%. For Dataset 2, where the performance statistics were already good, further small improvements were found with the introduction of SSE, giving an accuracy of 99.5% compared to 99.4%. These results show that SSE integration can dramatically improve TSR key discrimination, with significant benefits in datasets with low initial accuracies and only incremental gains in those with high baseline performance. Thus, SSE‑TSR is a powerful bioinformatics tool that improves protein classification and understanding of protein function and interaction.
Authors: Itai Carmeli, Chanoch Carmeli
Abstract: The interaction of light with photosynthetic proteins is an extremely efficient process and has been thoroughly investigated. However, exploring light‑matter interactions in hybrid nano‑solid‑photosynthetic proteins is a relatively new and existing field of research. The properties of these hybrid materials significantly influence the energy levels, non‑radiative energy transfer, absorption, and fluorescence of the photosynthetic proteins upon interaction with light. There is special interest in levering these light‑matter interactions for applications such as photo‑sensing and converting light energy to electricity. The development of efficient devices requires the formation of a junction for oriented attachment, facilitating efficient energy and electronic transfer between the solids and the proteins. This review will outline the major advancements in solid‑state photosynthetic protein devices, elucidate the underlying mechanism, and assess electron transfer efficiency. Furthermore, it will explore and analyze the effect of plasmons on the enhancement of absorption, fluorescence, and photocurrent in hybrid devices.
Authors: Yiliang Yuan, Mustafa Misir
Abstract: Molecular docking is a major element in drug discovery and design. It enables the prediction of ligand‑protein interactions by simulating the binding of small molecules to proteins. Despite the availability of numerous docking algorithms, there is no single algorithm consistently outperforms the others across a diverse set of docking scenarios. This paper introduces GNNAS‑Dock, a novel Graph Neural Network (GNN)‑based automated algorithm selection system for molecular docking in blind docking situations. GNNs are accommodated to process the complex structural data of both ligands and proteins. They benefit from the inherent graph‑like properties to predict the performance of various docking algorithms under different conditions. The present study pursues two main objectives: 1) predict the performance of each candidate docking algorithm, in terms of Root Mean Square Deviation (RMSD), thereby identifying the most accurate method for specific scenarios; and 2) choose the best computationally efficient docking algorithm for each docking case, aiming to reduce the time required for docking while maintaining high accuracy. We validate our approach on PDBBind 2020 refined set, which contains about 5,300 pairs of protein‑ligand complexes.
Authors: Tommaso Nottoli, Mattia Bondanza, Filippo Lipparini, Benedetta Mennucci
Abstract: We present a polarizable embedding quantum mechanics/molecular mechanics (QM/MM) framework for ground‑ and excited‑state Complete Active Space Self‑Consistent Field (CASSCF) calculations on molecules within complex environments, such as biological systems. These environments are modeled using the AMOEBA polarizable force field. This approach is implemented by integrating the OpenMMPol library with the CFour quantum chemistry software suite. The implementation supports both single‑point energy evaluations and geometry optimizations, facilitated by the availability of analytical gradients. We demonstrate the methodology by applying it to two distinct photoreceptors, exploring the impact of the protein environment on the structural and photophysical properties of their embedded chromophores.
Authors: Xiang Li, Gagan Agrawal, Rajiv Ramnath, Ruoming Jin
Abstract: Graph‑level representations (and clustering/classification based on these representations) are required in a variety of applications. Examples include identifying malicious network traffic, prediction of protein properties, and many others. Often, data has to stay in isolated local systems (i.e., cannot be centrally shared for analysis) due to a variety of considerations like privacy concerns, lack of trust between the parties, regulations, or simply because the data is too large to be shared sufficiently quickly. This points to the need for federated learning for graph‑level representations, a topic that has not been explored much, especially in an unsupervised setting.
Addressing this problem, this paper presents a new framework we refer to as Federated Contrastive Learning of Graph‑level Representations (FCLG). As the name suggests, our approach builds on contrastive learning. However, what is unique is that we apply contrastive learning at two levels. The first application is for local unsupervised learning of graph representations. The second level is to address the challenge associated with data distribution variation (i.e. the ``Non‑IID issue") when combining local models. Through extensive experiments on the downstream task of graph‑level clustering, we demonstrate FCLG outperforms baselines (which apply existing federated methods on existing graph‑level clustering methods) with significant margins.
Authors: Sangita Mondal, Ved Mahajan, Biman Bagchi
Abstract: Dimerization and subsequent aggregation of polymers and biopolymers often occur under nonequilibrium conditions. When the initial state of the polymer is not collapsed or the final folded native state, the dynamics of dimerization can follow a course sensitive to both the initial conditions and the conformational dynamics. Here we study the dimerization process by using computer simulations and analytical theory where both the two monomeric polymer chains are in the elongated state and are initially placed at a separation distance, d0. Subsequent dynamics lead to the concurrent processes of collapse, dimerization and/or escape. We employ Langevin dynamics simulations with a coarse‑grained model of the polymer to capture certain aspects of the dimerization process. At separations d0 much shorter than the length of the monomeric polymer, the dimerization could happen fast and irreversibly, from the partly extended polymer state itself. At an initial separation larger than a critical distance, dc, the polymer collapse precedes dimerization and a significant number of single polymers do not dimerize within the time scale of simulations. To quantify these competition, we introduce several time‑dependent order parameters, namely, (i) the time‑dependent radius of gyration of individual polymers describing the conformational state of the polymer, (ii) a centre‑to‑centre of mass distance parameter RMM, and (iii) a time dependent overlap function Q(t) between the two monomeric polymers, mimicking contact order parameter popular in protein folding. In order to better quantify the findings, we perform a theoretical analysis to capture the stochastic processes of collapse and dimerization by using dynamical disorder model.
Authors: Adrián Morales-Pastor, Raquel Vázquez-Reza, Miłosz Wieczór, Clàudia Valverde, Manel Gil-Sorribes, Bertran Miquel-Oliver, Álvaro Ciudad, Alexis Molina
Abstract: RNA is a vital biomolecule with numerous roles and functions within cells, and interest in targeting it for therapeutic purposes has grown significantly in recent years. However, fully understanding and predicting RNA behavior, particularly for applications in drug discovery, remains a challenge due to the complexity of RNA structures and interactions. While foundational models in biology have demonstrated success in modeling several biomolecules, especially proteins, achieving similar breakthroughs for RNA has proven more difficult. Current RNA models have yet to match the performance observed in the protein domain, leaving an important gap in computational biology. In this work, we present ChaRNABERT, a suite of sample and parameter‑efficient RNA foundational models, that through a learnable tokenization process, are able to reach state‑of‑the‑art performance on several tasks in established benchmarks. We extend its testing in relevant downstream tasks such as RNA‑protein and aptamer‑protein interaction prediction. Weights and inference code for ChaRNABERT‑8M will be provided for academic research use. The other models will be available upon request.
Authors: Jacques Fries, Javier Diaz, Marie Jardat, Ignacio Pagonabarraga, Pierre Illien, Vincent Dahirel
Abstract: The formation of condensates is now considered as a major organization principle of eukaryotic cells. Several studies have recently shown that the properties of these condensates are affected by enzymatic reactions. We propose here a simple generic model to study the interplay between two enzyme populations and a two‑state protein. In one state, the protein forms condensed droplets through attractive interactions, while in the other state, the proteins remain dispersed. Each enzyme catalyzes the production of one of these two protein states only when reactants are in its vicinity. A key feature of our model is the explicit representation of enzyme trajectories, capturing the fluctuations in their local concentrations. The spatially dependent growth rate of droplets naturally arises from the stochastic motion of these explicitly modeled enzymes.
Using two complementary numerical methods, (1) Brownian Dynamics simulations, and (2) a hybrid method combining Cahn‑Hilliard‑Cook diffusion equations with Brownian Dynamics for the enzymes, we investigate how enzyme concentration and dynamics influence the evolution with time, and the steady‑state number and size of droplets. Our results show that the concentration and diffusion coefficient of enzymes govern the formation and size‑selection of biocondensates.
Authors: Thanh V. T. Tran, Nhat Khang Ngo, Viet Anh Nguyen, Truong Son Hy
Abstract: Latent space optimization (LSO) is a powerful method for designing discrete, high‑dimensional biological sequences that maximize expensive black‑box functions, such as wet lab experiments. This is accomplished by learning a latent space from available data and using a surrogate model to guide optimization algorithms toward optimal outputs. However, existing methods struggle when labeled data is limited, as training the surrogate model with few labeled data points can lead to subpar outputs, offering no advantage over the training data itself. We address this challenge by introducing GROOT, a Graph‑based Latent Smoothing for Biological Sequence Optimization. In particular, GROOT generates pseudo‑labels for neighbors sampled around the training latent embeddings. These pseudo‑labels are then refined and smoothed by Label Propagation. Additionally, we theoretically and empirically justify our approach, demonstrate GROOT's ability to extrapolate to regions beyond the training set while maintaining reliability within an upper bound of their expected distances from the training regions. We evaluate GROOT on various biological sequence design tasks, including protein optimization (GFP and AAV) and three tasks with exact oracles from Design‑Bench. The results demonstrate that GROOT equalizes and surpasses existing methods without requiring access to black‑box oracles or vast amounts of labeled data, highlighting its practicality and effectiveness. We release our code at https://anonymous.4open.science/r/GROOT‑D554
Authors: Xavier Viader-Godoy, Maria Manosas, Felix Ritort
Abstract: Base stacking is crucial in nucleic acid stabilization, from DNA duplex hybridization to single‑stranded DNA (ssDNA) protein binding. While stacking energies are tiny in ssDNA, they are inextricably mixed with hydrogen bonding in DNA base pairing, making their measurement challenging. We conduct unzipping experiments with optical tweezers of short poly‑purine (dA and alternating dG and dA) sequences of 20‑40 bases. We introduce a helix‑coil model of the stacking‑unstacking transition that includes finite length effects and reproduces the force‑extension curves. Fitting the model to the experimental data, we derive the stacking energy per base, finding the salt‑independent value ΔG_0 = 0.14(3) kcal/mol for poly‑dA and ΔG_0 = 0.07(3) kcal/mol for poly‑dGdA. Stacking in these polymeric sequences is predominantly cooperative with a correlation length of ~4 bases at zero force. The correlation length reaches a maximum of ~10 and 5 bases at the stacking‑unstacking transition force of ~10 and 20 pN for poly‑dA and poly‑dGdA, respectively. The salt dependencies of the cooperativity parameter in ssDNA and the energy of DNA hybridization are in agreement, suggesting that double‑helix stability is primarily due to stacking. Analysis of poly‑rA and poly‑rC RNA sequences shows a larger stacking stability but a lower stacking correlation length of ~2 bases.
Authors: Xiaoliang Luo, Michael Ramscar, Bradley C. Love
Abstract: The impressive performance of large language models (LLMs) has led to their consideration as models of human language processing. Instead, we suggest that the success of LLMs arises from the flexibility of the transformer learning architecture. To evaluate this conjecture, we trained LLMs on scientific texts that were either in a forward or backward format. Despite backward text being inconsistent with the structure of human languages, we found that LLMs performed equally well in either format on a neuroscience benchmark, eclipsing human expert performance for both forward and backward orders. Our results are consistent with the success of transformers across diverse domains, such as weather prediction and protein design. This widespread success is attributable to LLM's ability to extract predictive patterns from any sufficiently structured input. Given their generality, we suggest caution in interpreting LLM's success in linguistic tasks as evidence for human‑like mechanisms.
Authors: Anya Chauhan, Ayush Noori, Zhaozhi Li, Yingnan He, Michelle M Li, Marinka Zitnik, Sudeshna Das
Abstract: Alzheimer's disease (AD) is a complex, progressive neurodegenerative disorder characterized by extracellular A\beta plaques, neurofibrillary tau tangles, glial activation, and neuronal degeneration, involving multiple cell types and pathways. Current models often overlook the cellular context of these pathways. To address this, we developed a multiscale graph neural network (GNN) model, ALZ PINNACLE, using brain omics data from donors spanning the entire aging to AD spectrum. ALZ PINNACLE is based on the PINNACLE GNN framework, which learns context‑aware protein, cell type, and tissue representations within a unified latent space. ALZ PINNACLE was trained on 14,951 proteins, 206,850 protein interactions, 7 cell types, and 48 cell subtypes or states. After pretraining, we investigated the learned embedding of APOE, the largest genetic risk factor for AD, across different cell types. Notably, APOE embeddings showed high similarity in microglial, neuronal, and CD8 cells, suggesting a similar role of APOE in these cell types. Fine tuning the model on AD risk genes revealed cell type contexts predictive of the role of APOE in AD. Our results suggest that ALZ PINNACLE may provide a valuable framework for uncovering novel insights into AD neurobiology.
Authors: Ji Woong Yu, Daeseong Yong, Bae-Yeun Ha, Changbong Hyeon
Abstract: Inclusions in mobile brushes experience apparent (depletion) attraction, which arises from a tendency to minimize the volume of depletion zones around the inclusions, thereby to maximize the entropy of the surrounding polymers. Here, we study the brush‑induced depletion attraction between cylindrical inclusions using molecular dynamics simulations and the Asakura‑Oosawa theory. Our considerations find that the correlation blobs defined in the brush environment serve as the fundamental units of the attraction. In tall brushes, however, the entropy of the overgrown polymer competes with the depletion attraction between the inclusions. As a result, the brush‑induced depletion interaction displays non‑monotonic variations with the brush height. Our study not only expands the repertoire of colloid‑polymer mixtures to depletion interactions in brushes, but also suggests the brush‑induced depletion interaction as a previously unappreciated mechanism for glycocalyx‑induced protein cluster formation on cell surfaces.
Authors: Peter St. John, Dejun Lin, Polina Binder, Malcolm Greaves, Vega Shah, John St. John, Adrian Lange, Patrick Hsu, Rajesh Illango, Arvind Ramanathan, Anima Anandkumar, David H Brookes, Akosua Busia, Abhishaike Mahajan, Stephen Malina, Neha Prasad, Sam Sinai, Lindsay Edwards, Thomas Gaudelet, Cristian Regep, Martin Steinegger, Burkhard Rost, Alexander Brace, Kyle Hippe, Luca Naef, Keisuke Kamata, George Armstrong, Kevin Boyd, Zhonglin Cao, Han-Yi Chou, Simon Chu, Allan dos Santos Costa, Sajad Darabi, Eric Dawson, Kieran Didi, Cong Fu, Mario Geiger, Michelle Gill, Darren J Hsu, Gagan Kaushik, Maria Korshunova, Steven Kothen-Hill, Youhan Lee, Meng Liu, Micha Livne, Zachary McClure, Jonathan Mitchell, Alireza Moradzadeh, Ohad Mosafi, Youssef Nashed, Saee Paliwal, Yuxing Peng, Sara Rabhi, Farhad Ramezanghorbani, Danny Reidenbach, Camir Ricketts, Brian C Roland, Kushal Shah, Tyler Shimko, Hassan Sirelkhatim, Savitha Srinivasan, Abraham C Stern, Dorota Toczydlowska, Srimukh Prasad Veccham, Niccolò Alberto Elia Venanzi, Anton Vorontsov, Jared Wilber, Isabel Wilkinson, Wei Jing Wong, Eva Xue, Cory Ye, Xin Yu, Yang Zhang, Guoqing Zhou, Becca Zandstein, Alejandro Chacon, Prashant Sohani, Maximilian Stadler, Christian Hundt, Feiwen Zhu, Christian Dallago, Bruno Trentini, Emine Kucukbenli, Saee Paliwal, Timur Rvachov, Eddie Calleja, Johnny Israeli, Harry Clifford, Risto Haukioja, Nicholas Haemel, Kyle Tretina, Neha Tadimeti, Anthony B Costa
Abstract: Artificial Intelligence models encoding biology and chemistry are opening new routes to high‑throughput and high‑quality in‑silico drug development. However, their training increasingly relies on computational scale, with recent protein language models (pLM) training on hundreds of graphical processing units (GPUs). We introduce the BioNeMo Framework to facilitate the training of computational biology and chemistry AI models across hundreds of GPUs. Its modular design allows the integration of individual components, such as data loaders, into existing workflows and is open to community contributions. We detail technical features of the BioNeMo Framework through use cases such as pLM pre‑training and fine‑tuning. On 256 NVIDIA A100s, BioNeMo Framework trains a three billion parameter BERT‑based pLM on over one trillion tokens in 4.2 days. The BioNeMo Framework is open‑source and free for everyone to use.
Authors: Ryan K. Krueger, Megan C. Engel, Ryan Hausen, Michael P. Brenner
Abstract: Developing physics‑based models for molecular simulation requires fitting many unknown parameters to diverse experimental datasets. Traditionally, this process is piecemeal and difficult to reproduce, leading to a fragmented landscape of models. Here, we establish a systematic, extensible framework for fitting coarse‑grained molecular models to macroscopic experimental data by leveraging recently developed methods for computing low‑variance gradient estimates with automatic differentiation. Using a widely validated DNA force field as an exemplar, we develop methods for optimizing structural, mechanical, and thermodynamic properties across a range of simulation techniques, including enhanced sampling and external forcing, spanning micro‑ and millisecond timescales. We highlight how gradients enable efficient sensitivity analyses that yield physical insight. We then demonstrate the broad applicability of these techniques by optimizing diverse biomolecular systems, including RNA and DNA‑protein hybrid models. We show how conflict‑free gradient methods from multi‑task learning can be adapted to impose multiple constraints simultaneously without compromising accuracy. This approach provides a foundation for transparent, reproducible, community‑driven force field development, accelerating progress in molecular modeling.
Authors: Abdurahman Ali Mohammed, Catherine Fonder, Donald S. Sakaguchi, Wallapak Tavanapong, Surya K. Mallapragada, Azeez Idris
Abstract: We present a new annotated microscopic cellular image dataset to improve the effectiveness of machine learning methods for cellular image analysis. Cell counting is an important step in cell analysis. Typically, domain experts manually count cells in a microscopic image. Automated cell counting can potentially eliminate this tedious, time‑consuming process. However, a good, labeled dataset is required for training an accurate machine learning model. Our dataset includes microscopic images of cells, and for each image, the cell count and the location of individual cells. The data were collected as part of an ongoing study investigating the potential of electrical stimulation to modulate stem cell differentiation and possible applications for neural repair. Compared to existing publicly available datasets, our dataset has more images of cells stained with more variety of antibodies (protein components of immune responses against invaders) typically used for cell analysis. The experimental results on this dataset indicate that none of the five existing models under this study are able to achieve sufficiently accurate count to replace the manual methods. The dataset is available at https://figshare.com/articles/dataset/Dataset/21970604.
Authors: Hanqing Bi, Suresh Neethirajan
Abstract: This study investigates the correlation between dairy farm characteristics and methane concentrations as derived from satellite observations in Eastern Canada. Utilizing data from 11 dairy farms collected between January 2020 and December 2022, we integrated Sentinel‑5P satellite methane data with critical farm‑level attributes, including herd genetics, feeding practices, and management strategies. Initial analyses revealed significant correlations with methane concentrations, leading to the application of Variance Inflation Factor (VIF) and Principal Component Analysis (PCA) to address multicollinearity and enhance model stability. Subsequently, machine learning models ‑ specifically Random Forest and Neural Networks ‑ were employed to evaluate feature importance and predict methane emissions. Our findings indicate a strong negative correlation between the Estimated Breeding Value (EBV) for protein percentage and methane concentrations, suggesting that genetic selection for higher milk protein content could be an effective strategy for emissions reduction. The integration of atmospheric transport models with satellite data further refined our emission estimates, significantly enhancing accuracy and spatial resolution. This research underscores the potential of advanced satellite monitoring, machine learning techniques, and atmospheric modeling in improving methane emission assessments within the dairy sector. It emphasizes the critical role of farm‑specific characteristics in developing effective mitigation strategies. Future investigations should focus on expanding the dataset and incorporating inversion modeling for more precise emission quantification. Balancing ecological impacts with economic viability will be essential for fostering sustainable dairy farming practices.
Authors: Jin Han, Wu-Jun Li
Abstract: Protein structure similarity search (PSSS), which tries to search proteins with similar structures, plays a crucial role across diverse domains from drug design to protein function prediction and molecular evolution. Traditional alignment‑based PSSS methods, which directly calculate alignment on the protein structures, are highly time‑consuming with high memory cost. Recently, alignment‑free methods, which represent protein structures as fixed‑length real‑valued vectors, are proposed for PSSS. Although these methods have lower time and memory cost than alignment‑based methods, their time and memory cost is still too high for large‑scale PSSS, and their accuracy is unsatisfactory. In this paper, we propose a novel method, called \underline\textpr\underline\textotein \underline\textstructure \underline\texthashing (POSH), for PSSS. POSH learns a binary vector representation for each protein structure, which can dramatically reduce the time and memory cost for PSSS compared with real‑valued vector representation based methods. Furthermore, in POSH we also propose expressive hand‑crafted features and a structure encoder to well model both node and edge interactions in proteins. Experimental results on real datasets show that POSH can outperform other methods to achieve state‑of‑the‑art accuracy. Furthermore, POSH achieves a memory saving of more than six times and speed improvement of more than four times, compared with other methods.
Authors: Michael A. Sauer, Souvik Mondal, Brandon Neff, Sthitadhi Maiti, Matthias Heyden
Abstract: Protein function does not solely depend on structure but often relies on dynamical transitions between distinct conformations. Despite this fact, our ability to characterize or predict protein dynamics is substantially less developed compared to state‑of‑the‑art protein structure prediction. Molecular simulations provide unique opportunities to study protein dynamics, but the timescales associated with conformational changes generate substantial challenges. Enhanced sampling algorithms with collective variables can greatly reduce the computational cost of sampling slow processes. However, defining collective variables suitable to enhance sampling of protein conformational transitions is non‑trivial. Low‑frequency vibrations have long been considered as promising candidates for collective variable but their identification so far relied on assumptions inherently invalid at low frequencies. We recently introduced an analysis of molecular vibrations that does not rely on such approximations and remains accurate at low frequencies. Here, we modified this approach to efficiently isolate low‑frequency vibrations in proteins and applied it to a set of five proteins of varying complexity. We demonstrate that our approach is not only highly reproducible but results in collective variables that consistently enhance sampling of protein conformational transitions and associated free energy surfaces on timescales compatible with high throughput applications. This enables the efficient generation of protein conformational ensembles, which will be key for future prediction algorithms aiming beyond static protein structures.
Authors: Tetsuhiro S. Hatakeyama
Abstract: Increasing the enzyme concentration generally speeds up enzymatic reactions. However, in this Letter, we show that increasing the enzyme concentration can also slow down the enzymatic reaction. We consider a simple allosteric protein with multiple modification sites, catalyzed by two enzymes with the same catalytic activity, but slightly different affinities. We show that increasing the concentration of one enzyme can slow the relaxation to the equilibrium state. The mechanism for this slowing is similar the Markovian Mpemba effect, and we name this phenomenon as the Enzymatic Mpemba effect.
Authors: Jessica Irons, Patrick Cooper, Melanie McGrath, Shahroz Tariq, Andreas Duenser
Abstract: Artificial intelligence (AI) tools are now prevalent in many knowledge work industries. As AI becomes more capable and interactive, there is a growing need for guidance on how to employ AI most effectively. The A2C framework (Tariq, Chhetri, Nepal & Paris, 2024) distinguishes three decision‑making modes for engaging AI: automation (AI completes a task, including decision/action), augmentation (AI supports human to decide) and collaboration (iterative interaction between human and AI). However, selecting the appropriate mode for a specific application is not always straightforward. The goal of the present study was to compile and trial a simple set of criteria to support recommendations about appropriate A2C mode for a given application. Drawing on human factors and computer science literature, we identified key criteria related to elements of the task, impacts on worker and support needs. From these criteria we built a scoring rubric with recommendation for A2C mode. As a preliminary test of this approach, we applied the criteria to cognitive task analysis (CTA) outputs from three tasks in the science domain ‑ genome annotation, biological collections curation and protein crystallization ‑ which provided insights into worker decision points, challenges and expert strategies. This paper describes the method for connecting CTA to A2C, reflecting on the challenges and future directions.
Authors: Elsa Perez-Martin, Tristan Beranger, Laurent Bonnet, Frederic Teppe, Alvydas Lisauskas, Ketsukis Ikamas, Elwin Vrouwe, Elena Floriani, Gergely Katona, Didier Marguet, Vania Calandrini, Marco Pettini, Sandra Ruffenach, Jeremie Torres
Abstract: Electrodynamic interactions between biomolecules are of potential biological interest for signaling warranting investigation of their activation through various mechanisms in living systems. Here, using as model system a light harvesting protein within the phycobilisome antenna system of red algae, we proved that not only light exposure but also thermal energy alone can trigger attractive electrodynamic interactions up to hundreds of nanometer. The latter are sustained by low frequency collective modes and while the second mode appears only upon illumination, the fundamental one can be activated by temperature alone. Activation of such collective modes and ED interactions might influence conformational rearrangements and energy transport within the phycobilisome antenna system. This is a paradigm‑shift that underscores the immense potential of biological systems in exploiting different forms of input energy to achieve optimal energy transfer.
Authors: Albin Joy, Anand Srivastava, Rajib Biswas
Abstract: Azurin and its derived peptides, notably p28, exhibit significant anticancer properties, primarily by stabilizing the tumor suppressor protein p53 and preventing its degradation. Previous studies have shown that p28 binds to p53's DNA‑binding domain, protecting it from degradation mechanisms. Expanding on these findings, our research explored whether p28 acts on additional cancer pathways beyond p53 stabilization. Specifically, we examined the interactions between p28 and Human Double Minute 2 (HDM2), a protein that downregulates p53's tumor‑suppressive activity by binding to its transactivation domain (TAD). HDM2 is crucial in diminishing p53's function, and our study aimed to determine if p28 disrupts this HDM2‑p53 interaction. Using HADDOCK docking and molecular dynamics simulations, we identified three stable conformations of the HDM2‑p28 complex. These conformations effectively block HDM2's hydrophobic pocket, allowing for sustained inter‑chain interactions and showing favorable binding energies. Further analysis pinpointed essential residues in these interactions, and we calculated interaction energies using the Molecular Mechanics Poisson‑Boltzmann Surface Area (MMPBSA) method. Our findings reveal that by blocking HDM2's binding sites, p28 helps maintain p53's transcriptional activity, thus enhancing its tumor‑suppressive functions, including apoptosis and cell cycle arrest in cancer cells. This study enhances understanding of azurin‑derived peptides' anticancer mechanisms and highlights p28's potential as a peptide‑based anticancer agent. These findings also suggest the possibility of designing additional peptide therapies targeting HDM2 and other cancer‑related pathways, opening new directions in anticancer therapeutics.
Authors: Luis F Seoane, Henry Secaira-Morocho, Ester Lázaro, Susanna Manrubia
Abstract: Understanding how viral mutant spectra organize and explore genotype space is essential for unraveling the mechanisms driving evolution at the finest scale. Here we use deep‑sequencing data of an amplicon in the A2 protein of the RNA bacteriophage Qβ to reconstruct genotype networks with tens of thousands of different haplotypes. The study of populations evolved under different temperature regimes uncovers generic topological features conditioned by fundamental structural motifs of genotype networks ‑‑ tetrahedrons, triangles, and squares ‑‑ that govern their local architecture. Mutant swarms display a hierarchical structure where sequences cluster around a highly connected and abundant sequence core that sustains population diversity. The immediate neighborhood of this core is comprehensively sampled, with no signs of selection, while a few mutations away sampling becomes dynamical and sparse, showing signs of purifying selection. By aggregating genotype networks from populations adapted to different temperatures, we capture the early stages of evolutionary divergence, with overlapping populations that remain connected through short mutational paths. Even at the time scale of these experiments, evolutionary pathways might be multiple, preventing the backward reconstruction of unique trajectories once mutations have been fixed. This analysis provides a detailed view of the local, fine‑scale processes shaping viral quasispecies evolution and underscores the usefulness of genotype networks as an enlightening visualization of the organization of mutant swarms.
Authors: Aya Abdelsalam Ismail, Tuomas Oikarinen, Amy Wang, Julius Adebayo, Samuel Stanton, Taylor Joren, Joseph Kleinhenz, Allen Goodman, Héctor Corrada Bravo, Kyunghyun Cho, Nathan C. Frey
Abstract: We introduce Concept Bottleneck Protein Language Models (CB‑pLM), a generative masked language model with a layer where each neuron corresponds to an interpretable concept. Our architecture offers three key benefits: i) Control: We can intervene on concept values to precisely control the properties of generated proteins, achieving a 3 times larger change in desired concept values compared to baselines. ii) Interpretability: A linear mapping between concept values and predicted tokens allows transparent analysis of the model's decision‑making process. iii) Debugging: This transparency facilitates easy debugging of trained models. Our models achieve pre‑training perplexity and downstream task performance comparable to traditional masked protein language models, demonstrating that interpretability does not compromise performance. While adaptable to any language model, we focus on masked protein language models due to their importance in drug discovery and the ability to validate our model's capabilities through real‑world experiments and expert knowledge. We scale our CB‑pLM from 24 million to 3 billion parameters, making them the largest Concept Bottleneck Models trained and the first capable of generative language modeling.
Authors: Raushan Singh, Jaroslaw Glowacki, Marius Beaud, Federica Padovano, Robert S. Manning, John H. Maddocks
Abstract: DNA minicircles are closed double‑stranded DNA (dsDNA) fragments that have been demonstrated to be an important experimental tool to understand supercoiled, or stressed, DNA mechanics, such as nucleosome positioning and DNA‑protein interactions. Specific minicircles can be simulated using Molecular Dynamics (MD) simulation. However, the enormous sequence space makes it unfeasible to exhaustively explore the sequence‑dependent mechanics of DNA minicircles using either experiment or MD. For linear fragments, the cgNA+ model, a computationally efficient sequence‑dependent coarse‑grained model using enhanced Curves+ internal coordinates (rigid base plus rigid phosphate) of double‑stranded nucleic acids (dsNAs), predicts highly accurate nonlocal sequence‑dependent equilibrium distributions for an arbitrary sequence when compared with MD simulations. This article addresses the problem of modeling sequence‑dependent topologically closed and, therefore, stressed fragments of dsDNA. We introduce cgNA+min, a computational approach within the cgNA+ framework, which extends the cgNA+ model applicability to compute the sequence‑dependent energy minimising configurations of covalently closed dsDNA minicircles of various lengths and linking numbers (Lk). The main idea is to derive the appropriate chain rule to express the cgNA+ energy in absolute coordinates involving quaternions where the closure condition is simple to handle. We also present a semi‑analytic method for efficiently computing sequence‑dependent initial minicircles having arbitrary Lk and length. For different classes and lengths of sequences, we demonstrate that the dsDNA minicircle energies computed using cgNA+min agree well with the energies approximated from experimentally measured J‑factor values. Finally, we present the minicircle shape, energy, and multiplicity of Lk for more than 120K random DNA sequences of different lengths.
Authors: Aayush Shah, Shankar Jayaratnam
Abstract: Large language models (LLMs) have demonstrated significant success in natural language processing (NLP) tasks and have shown promising results in other domains such as protein sequence generation. However, there remain salient differences between LLMs used for NLP, which effectively handle multiple tasks and are available in small sizes, and protein language models that are often specialized for specific tasks and only exist in larger sizes. In this work, we introduce two small protein language models, based on Llama‑3‑8B and Phi‑3‑mini, that are capable of both uncontrollable and controllable protein generation. For the uncontrollable generation task, our best model achieves an average pLDDT score of 69.75, demonstrating robust performance in generating viable protein structures. For the controllable generation task, in which the model generates proteins according to properties specified in the prompt, we achieve a remarkable average TM‑Score of 0.84, indicating high structural similarity to target proteins. We chose 10 properties, including six classes of enzymes, to extend the capabilities of prior protein language models. Our approach utilizes the Low‑Rank Adaptor (LoRA) technique, reducing trainable parameters to just 4% of the original model size, lowering computational requirements. By using a subset of the UniRef50 dataset and small models, we reduced the overall training time by 70% without compromising performance. Notably, Phi‑3‑mini reduced trainable parameters by 60%, decreasing training cost by 30% compared to Llama 3. Consequently, Phi‑3 achieved a comparable TM‑Score of 0.81, demonstrating that smaller models can match the performance of larger ones, like Llama 3. We also demonstrate the deployment of our models on the energy efficient ET‑SoC‑1 chip, significantly improving the TPS/W by a factor of 3.
Authors: Zhenze Yang, Sarah K. Yorke, Tuomas P. J. Knowles, Markus J. Buehler
Abstract: Peptides are ubiquitous and important biologically derived molecules, that have been found to self‑assemble to form a wide array of structures. Extensive research has explored the impacts of both internal chemical composition and external environmental stimuli on the self‑assembly behaviour of these systems. However, there is yet to be a systematic study that gathers this rich literature data and collectively examines these experimental factors to provide a global picture of the fundamental rules that govern protein self‑assembly behavior. In this work, we curate a peptide assembly database through a combination of manual processing by human experts and literature mining facilitated by a large language model. As a result, we collect more than 1,000 experimental data entries with information about peptide sequence, experimental conditions and corresponding self‑assembly phases. Utilizing the collected data, ML models are trained and evaluated, demonstrating excellent accuracy (>80%) and efficiency in peptide assembly phase classification. Moreover, we fine‑tune our GPT model for peptide literature mining with the developed dataset, which exhibits markedly superior performance in extracting information from academic publications relative to the pre‑trained model. We find that this workflow can substantially improve efficiency when exploring potential self‑assembling peptide candidates, through guiding experimental work, while also deepening our understanding of the mechanisms governing peptide self‑assembly. In doing so, novel structures can be accessed for a range of applications including sensing, catalysis and biomaterials.
Authors: Simon Wagner, Leif Seute, Vsevolod Viliuga, Nicolas Wolf, Frauke Gräter, Jan Stühmer
Abstract: We introduce a generative model for protein backbone design utilizing geometric products and higher order message passing. In particular, we propose Clifford Frame Attention (CFA), an extension of the invariant point attention (IPA) architecture from AlphaFold2, in which the backbone residue frames and geometric features are represented in the projective geometric algebra. This enables to construct geometrically expressive messages between residues, including higher order terms, using the bilinear operations of the algebra. We evaluate our architecture by incorporating it into the framework of FrameFlow, a state‑of‑the‑art flow matching model for protein backbone generation. The proposed model achieves high designability, diversity and novelty, while also sampling protein backbones that follow the statistical distribution of secondary structure elements found in naturally occurring proteins, a property so far only insufficiently achieved by many state‑of‑the‑art generative models.
Authors: Prakash Chourasia, Tamkanat E Ali, Sarwan Ali, Murray Pattersn
Abstract: Federated Learning (FL) is a distributed learning technique that maintains data privacy by providing a decentralized training method for machine learning models using distributed big data. This promising Federated Learning approach has also gained popularity in bioinformatics, where the privacy of biomedical data holds immense importance, especially when patient data is involved. Despite the successful implementation of Federated learning in biological sequence analysis, rigorous consideration is still required to improve accuracy in a way that data privacy should not be compromised. Additionally, the optimal integration of federated learning, especially in protein sequence analysis, has not been fully explored. We propose a deep feed‑forward neural network‑based enhanced federated learning method for protein sequence classification to overcome these challenges. Our method introduces novel enhancements to improve classification accuracy. We introduce dynamic weighted federated learning (DWFL) which is a federated learning‑based approach, where local model weights are adjusted using weighted averaging based on their performance metrics. By assigning higher weights to well‑performing models, we aim to create a more potent initial global model for the federated learning process, leading to improved accuracy. We conduct experiments using real‑world protein sequence datasets to assess the effectiveness of DWFL. The results obtained using our proposed approach demonstrate significant improvements in model accuracy, making federated learning a preferred, more robust, and privacy‑preserving approach for collaborative machine‑learning tasks.
Authors: Youssef Boulaimen, Gabriele Fossi, Leila Outemzabet, Nathalie Jeanray, Oleksandr Levenets, Stephane Gerart, Sebastien Vachenc, Salvatore Raieli, Joanna Giemza
Abstract: The classification of genetic variants, particularly Variants of Uncertain Significance (VUS), poses a significant challenge in clinical genetics and precision medicine. Large Language Models (LLMs) have emerged as transformative tools in this realm. These models can uncover intricate patterns and predictive insights that traditional methods might miss, thus enhancing the predictive accuracy of genetic variant pathogenicity.
This study investigates the integration of state‑of‑the‑art LLMs, including GPN‑MSA, ESM1b, and AlphaMissense, which leverage DNA and protein sequence data alongside structural insights to form a comprehensive analytical framework for variant classification. Our approach evaluates these integrated models using the well‑annotated ProteinGym and ClinVar datasets, setting new benchmarks in classification performance. The models were rigorously tested on a set of challenging variants, demonstrating substantial improvements over existing state‑of‑the‑art tools, especially in handling ambiguous and clinically uncertain variants.
The results of this research underline the efficacy of combining multiple modeling approaches to significantly refine the accuracy and reliability of genetic variant classification systems. These findings support the deployment of these advanced computational models in clinical environments, where they can significantly enhance the diagnostic processes for genetic disorders, ultimately pushing the boundaries of personalized medicine by offering more detailed and actionable genetic insights.
Authors: Klemens Flöge, Srisruthi Udayakumar, Johanna Sommer, Marie Piraud, Stefan Kesselheim, Vincent Fortuin, Stephan Günneman, Karel J van der Weg, Holger Gohlke, Erinc Merdivan, Alina Bazarova
Abstract: Recent advances in Artificial Intelligence have enabled multi‑modal systems to model and translate diverse information spaces. Extending beyond text and vision, we introduce OneProt, a multi‑modal AI for proteins that integrates structural, sequence, text, and binding site data. Using the ImageBind framework, OneProt aligns the latent spaces of protein modality encoders in a lightweight fine‑tuning scheme that focuses on pairwise alignment with sequence data rather than requiring full matches. This novel approach comprises a mix of Graph Neural Networks and transformer architectures. It demonstrates strong performance in retrieval tasks and showcases the efficacy of multi‑modal systems in Protein Machine Learning through a broad spectrum of downstream baselines, including enzyme function prediction and binding site analysis. Furthermore, OneProt enables the transfer of representational information from specialized encoders to the sequence encoder, enhancing capabilities for distinguishing evolutionarily related and unrelated sequences and exhibiting representational properties where evolutionarily related proteins align in similar directions within the latent space. In addition, we extensively investigate modality ablations to identify the encoders that contribute most to predictive performance, highlighting the significance of the binding site encoder, which has not been used in similar models previously. This work expands the horizons of multi‑modal protein models, paving the way for transformative applications in drug discovery, biocatalytic reaction planning, and protein engineering.
Authors: Mohammad Tabish, Neil K. Chada, Stefan Klus
Abstract: The Koopman operator plays a crucial role in analyzing the global behavior of dynamical systems. Existing data‑driven methods for approximating the Koopman operator or discovering the governing equations of the underlying system typically require a fixed set of basis functions, also called dictionary. The optimal choice of basis functions is highly problem‑dependent and often requires domain knowledge. We present a novel gradient descent‑based optimization framework for learning suitable and interpretable basis functions from data and show how it can be used in combination with EDMD, SINDy, and PDE‑FIND. We illustrate the efficacy of the proposed approach with the aid of various benchmark problems such as the Ornstein‑Uhlenbeck process, Chua's circuit, a nonlinear heat equation, as well as protein‑folding data.
Authors: A. M. Begun, A. A. Korneev, A. V. Zorina
Abstract: Protein MJ0366 is a hypothetical protein from Methanocaldococcus jannaschii that has a rare and complex knot in its structure. The knot is a right‑handed trefoil knot that involves about half of the protein's residues. In this article, we investigate the thermal stability of protein MJ0366 using numerical simulations based on molecular dynamics and Monte Carlo methods. We compare the results with those of a similar unknotted protein and analyze the effects of the knot on the folding and unfolding processes. We show that the knot in protein MJ0366 increases its thermal stability by creating a topological barrier that prevents the protein from unfolding at high temperatures. We also discuss the possible biological implications of the knot for the function and evolution of protein MJ0366.
Authors: Martin Charron, Breeana Elliott, Nada Kerrouri, Liqun He, Vincent Tabard-Cossa
Abstract: Inspired by its central role in many biological processes, the transport of biopolymers across nanoscale pores is at the heart of a single‑molecule sensing technology aimed at nucleic acid and protein sequencing, as well as biomarker detection. When electrophoretically driven through a pore by an electric potential gradient, a translocating polymer hinders the flow of ions, producing a transient current blockage signature that can be mapped to physicochemical properties of the polymer. Although investigated theoretically and by simulations, few experimental studies have attempted to validate the predicted transport properties, mainly due to the complex nature of the non‑equilibrium translocation process. Here, we elucidate these fundamental concepts by constructing a patterned DNA nanostructure whose current signatures allow measurement of the instantaneous velocity throughout the translocation process. With simple physical insights from polymer and fluid dynamics, we show how the resulting molecular velocity profiles can be used to investigate the nanoscale forces at play and their dependence on experimental parameters such as polymer length, pore size and voltage. These results allow testing of theoretical models and outline their limitations. In addition to bridging experiment and theory, knowledge of the velocity fluctuation and force scaling during passage can assist researchers in designing nanopore experiments with optimized sensing performance.
Authors: Niklas Schmidinger, Lisa Schneckenreiter, Philipp Seidl, Johannes Schimunek, Pieter-Jan Hoedt, Johannes Brandstetter, Andreas Mayr, Sohvi Luukkonen, Sepp Hochreiter, Günter Klambauer
Abstract: Language models for biological and chemical sequences enable crucial applications such as drug discovery, protein engineering, and precision medicine. Currently, these language models are predominantly based on Transformer architectures. While Transformers have yielded impressive results, their quadratic runtime dependency on the sequence length complicates their use for long genomic sequences and in‑context learning on proteins and chemical sequences. Recently, the recurrent xLSTM architecture has been shown to perform favorably compared to Transformers and modern state‑space model (SSM) architectures in the natural language domain. Similar to SSMs, xLSTMs have a linear runtime dependency on the sequence length and allow for constant‑memory decoding at inference time, which makes them prime candidates for modeling long‑range dependencies in biological and chemical sequences. In this work, we tailor xLSTM towards these domains and propose a suite of architectural variants called Bio‑xLSTM. Extensive experiments in three large domains, genomics, proteins, and chemistry, were performed to assess xLSTM's ability to model biological and chemical sequences. The results show that models based on Bio‑xLSTM a) can serve as proficient generative models for DNA, protein, and chemical sequences, b) learn rich representations for those modalities, and c) can perform in‑context learning for proteins and small molecules.
Authors: Keir Adams, Kento Abeywardane, Jenna Fromer, Connor W. Coley
Abstract: Engineering molecules to exhibit precise 3D intermolecular interactions with their environment forms the basis of chemical design. In ligand‑based drug design, bioisosteric analogues of known bioactive hits are often identified by virtually screening chemical libraries with shape, electrostatic, and pharmacophore similarity scoring functions. We instead hypothesize that a generative model which learns the joint distribution over 3D molecular structures and their interaction profiles may facilitate 3D interaction‑aware chemical design. We specifically design ShEPhERD, an SE(3)‑equivariant diffusion model which jointly diffuses/denoises 3D molecular graphs and representations of their shapes, electrostatic potential surfaces, and (directional) pharmacophores to/from Gaussian noise. Inspired by traditional ligand discovery, we compose 3D similarity scoring functions to assess ShEPhERD's ability to conditionally generate novel molecules with desired interaction profiles. We demonstrate ShEPhERD's potential for impact via exemplary drug design tasks including natural product ligand hopping, protein‑blind bioactive hit diversification, and bioisosteric fragment merging.
Authors: Zhenning Liu, Xiantao Li, Chunhao Wang, Jin-Peng Liu
Abstract: Modeling and simulating the protein folding process overall remains a grand challenge in computational biology. We systematically investigate end‑to‑end quantum algorithms for simulating various protein dynamics with effects, such as mechanical forces or stochastic noises. A major focus is the read‑in of system settings for simulation, for which we discuss (i) efficient quantum algorithms to prepare initial states‑‑whether for ensemble or single‑state simulations, in particular, the first efficient procedure for preparing Gaussian pseudo‑random amplitude states, and (ii) the first efficient loading of the connectivity matrices of the protein structure. For the read‑out stage, our algorithms estimate a range of classical observables, including energy, low‑frequency vibrational modes, density of states, displacement correlations, and optimal control parameters. Between these stages, we simulate the dynamic evolution of the protein system, by using normal mode models‑‑such as Gaussian network models (GNM) and all‑atom normal mode models. In addition, we conduct classical numerical experiments focused on accurately estimating the density of states and applying optimal control to facilitate conformational changes. These experiments serve to validate our claims regarding potential quantum speedups. Overall, our study demonstrates that quantum simulation of protein dynamics represents a robust, end‑to‑end application for both early‑stage and fully fault‑tolerant quantum computing.
Authors: Ángel Morán Ledezma
Abstract: In this work, we study the dynamics of complex systems with time‑dependent transition rates, focusing on p‑adic analysis in modeling such systems. Starting from the master equation that governs the stochastic dynamics of a system with a large number of interacting components, we generalize it by p‑adically parametrizing the metabasins to account for states that are organized in a fractal and hierarchical manner within the energy landscape. This leads to a not necessarily time homogeneous Markov process described by a time‑dependent operator acting on an ultrametric space. We prove well‑posedness of the initial value problem and analyze the stochastic nature of the master equation with time‑dependent transition‑operator. We demonstrate how ultrametricity simplifies the description of intra‑metabasin dynamics without increasing computational complexity. We apply our theoretical framework to two scenarios: glass relaxation under rapid cooling and protein folding dynamics influenced by temperature variations. In the glass relaxation model, we observe anomalous relaxation behavior where the dynamics slow down during cooling, with lasting effects depending on how drastic the temperature drop is. In the protein folding model, we incorporate temperature‑dependent transition rates to simulate folding and unfolding processes across the melting temperature. Our results capture a "whiplash" effect: from an unfolded state, the system folds and then returns to an unfolded state (which may differ from the initial one) in response to temperature changes. This study demonstrates the effectiveness of p‑adic parametrization and ultrametric analysis in modeling complex systems with dynamic transition rate, providing analytical solutions that improve our understanding of relaxation processes in material and biological systems.
Authors: Lucas Hedström, Seong-Gyu Yang, Ludvig Lizana
Abstract: We present a novel framework for understanding node target search in systems organized as hierarchical networks‑within‑networks. Our work generalizes traditional search models on complex networks, where the mean‑first passage time is typically inversely proportional to the node degree. However, real‑world search processes often span multiple network layers, such as moving from an external environment into a local network, and then navigating several internal states. This multilayered complexity appears in scenarios such as international travel networks, tracking email spammers, and the dynamics of protein‑DNA interactions in cells. Our theory addresses these complex systems by modeling them as a three‑layer multiplex network: an external source layer, an intermediate spatial layer, and an internal state layer. We derive general closed‑form solutions for the steady‑state flux through a target node, which serves as a proxy for inverse mean‑first passage time. Our results reveal a universal relationship between search efficiency and network‑specific parameters. This work extends the current understanding of multiplex networks by focusing on systems with hierarchically connected layers. Our findings have broad implications for fields ranging from epidemiology to cellular biology and provide a more comprehensive understanding of search dynamics in complex, multilayered environments.
Authors: Anton Bushuiev, Roman Bushuiev, Olga Pimenova, Nikola Zadorozhny, Raman Samusevich, Elisabet Manaskova, Rachel Seongeun Kim, Hannes Stärk, Jiri Sedlar, Martin Steinegger, Tomáš Pluskal, Josef Sivic
Abstract: Generalization beyond training data remains a central challenge in machine learning for biology. A common way to enhance generalization is self‑supervised pre‑training on large datasets. However, aiming to perform well on all possible proteins can limit a model's capacity to excel on any specific one, whereas experimentalists typically need accurate predictions for individual proteins they study, often not covered in training data. To address this limitation, we propose a method that enables self‑supervised customization of protein language models to one target protein at a time, on the fly, and without assuming any additional data. We show that our Protein Test‑Time Training (ProteinTTT) method consistently enhances generalization across different models, their sizes, and datasets. ProteinTTT improves structure prediction for challenging targets, achieves new state‑of‑the‑art results on protein fitness prediction, and enhances function prediction on two tasks. Through two challenging case studies, we also show that customization via ProteinTTT achieves more accurate antibody‑antigen loop modeling and enhances 19% of structures in the Big Fantastic Virus Database, delivering improved predictions where general‑purpose AlphaFold2 and ESMFold struggle.
Authors: Hrant Topchyan, Win Nuding, Andreas Klümper, Ara Sedrakyan
Abstract: The Harris criterion imposes a constraint on the critical behavior of a system upon introduction of new disorder, based on its dimension d and localization length exponent ν. It states that the new disorder can be relevant only if d ν< 2. We analyze the applicability of the Harris criterion to the GKNS network disorder formulated in the paper [I. A. Gruzberg, A. Klümper, W. Nuding, and A. Sedrakyan, Phys. Rev. B 95, 125414 (2017)] and show that the fluctuations of the geometry are relevant despite d ν> 2, implying that Harris criterion should be modified. We have observed that the fluctuations of the critical point in different quenched configurations of disordered network blocks is of order L^0, i.e.~it does not depend on block size L in contrast to the expectation based on the Harris criterion that they should decrease as L^‑d/2 according to the central limit theorem. Since L^0 > (x‑x_c) is always satisfied near the critical point, the mentioned network disorder is relevant and the critical indices of the system can be changed. We have also shown that the GKNS disordered network is fundamentally different from Voronoi‑Delaunay and dynamically triangulated random lattices: the probability of higher connectivity in the GKNS network decreases in a power law as opposed to an exponential, indicating that we are dealing with a ``scale free" network, such as the Internet, protein‑protein interactions, etc.
Authors: Kusal Debnath, Pratip Rana, Preetam Ghosh
Abstract: Drug‑target affinity (DTA) prediction is a critical aspect of drug discovery. The meaningful representation of drugs and targets is crucial for accurate prediction. Using 1D string‑based representations for drugs and targets is a common approach that has demonstrated good results in drug‑target affinity prediction. However, these approach lacks information on the relative position of the atoms and bonds. To address this limitation, graph‑based representations have been used to some extent. However, solely considering the structural aspect of drugs and targets may be insufficient for accurate DTA prediction. Integrating the functional aspect of these drugs at the genetic level can enhance the prediction capability of the models. To fill this gap, we propose GramSeq‑DTA, which integrates chemical perturbation information with the structural information of drugs and targets. We applied a Grammar Variational Autoencoder (GVAE) for drug feature extraction and utilized two different approaches for protein feature extraction: Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN). The chemical perturbation data is obtained from the L1000 project, which provides information on the upregulation and downregulation of genes caused by selected drugs. This chemical perturbation information is processed, and a compact dataset is prepared, serving as the functional feature set of the drugs. By integrating the drug, gene, and target features in the model, our approach outperforms the current state‑of‑the‑art DTA prediction models when validated on widely used DTA datasets (BindingDB, Davis, and KIBA). This work provides a novel and practical approach to DTA prediction by merging the structural and functional aspects of biological entities, and it encourages further research in multi‑modal DTA prediction.
Authors: Yingze Wang, Kunyang Sun, Jie Li, Xingyi Guan, Oufan Zhang, Dorian Bagni, Teresa Head-Gordon
Abstract: Development of scoring functions (SFs) used to predict protein‑ligand binding energies requires high‑quality 3D structures and binding assay data for training and testing their parameters. In this work, we show that one of the widely‑used datasets, PDBbind, suffers from several common structural artifacts of both proteins and ligands, which may compromise the accuracy, reliability, and generalizability of the resulting SFs. Therefore, we have developed a series of algorithms organized in a semi‑automated workflow, HiQBind‑WF, that curates non‑covalent protein‑ligand datasets to fix these problems. We also used this workflow to create an independent data set, HiQBind, by matching binding free energies from various sources including BioLiP, Binding MOAD and BindingDB with co‑crystalized ligand‑protein complexes from the PDB. The resulting HiQBind workflow and dataset are designed to ensure reproducibility and to minimize human intervention, while also being open‑source to foster transparency in the improvements made to this important resource for the biology and drug discovery communities.
Authors: Boming Kang, Qinghua Cui
Abstract: Artificial intelligence based on machine learning and deep learning has made significant advances in various fields such as protein structure prediction and climate modeling. However, a central challenge remains: the "black box" nature of AI, where precise quantitative relationships between inputs and outputs are often lacking. Here, by analyzing 323 AI models trained to predict human essential proteins, we uncovered a ratio law showing that model performance and the ratio of minority to majority samples can be closely linked by two concise equations. Moreover, we mathematically proved that an AI model achieves its optimal performance on a balanced dataset. More importantly, we next explore whether this finding can further guide us to enhance AI models' performance. Therefore, we divided the imbalanced dataset into several balanced subsets to train base classifiers, and then applied a bagging‑based ensemble learning strategy to combine these base models. As a result, the equation‑guided strategy substantially improved model performance, with increases of 4.06% and 5.28%, respectively, outperforming traditional dataset balancing techniques. Finally, we confirmed the broad applicability and generalization of these equations using different types of classifiers and 10 additional, diverse binary classification tasks. In summary, this study reveals two equations precisely linking AI's input and output, which could be helpful for unboxing the mysterious "black box" of AI.
Authors: Zhili Feng, Tanya Marwah, Nicolo Fusi, David Alvarez-Melis, Lester Mackey
Abstract: Modern large language models use a fixed tokenizer to effectively compress text drawn from a source domain. However, applying the same tokenizer to a new target domain often leads to inferior compression, more costly inference, and reduced semantic alignment. To address this deficiency, we introduce Sparse Sinkhorn Token Translation (S2T2). S2T2 trains a tailored tokenizer for the target domain and learns to translate between target and source tokens, enabling more effective reuse of the pre‑trained next‑source‑token predictor. In our experiments with finetuned English language models, S2T2 improves both the perplexity and the compression of out‑of‑domain protein sequences, outperforming direct finetuning with either the source or target tokenizer. In addition, we find that token translations learned for smaller, less expensive models can be directly transferred to larger, more powerful models to reap the benefits of S2T2 at lower cost.
Authors: Rafał Powalski, Bazyli Klockiewicz, Maciej Jaśkowski, Bartosz Topolski, Paweł Dąbrowski-Tumański, Maciej Wiśniewski, Łukasz Kuciński, Piotr Miłoś, Dariusz Plewczynski
Abstract: Accelerating molecular docking ‑‑ the process of predicting how molecules bind to protein targets ‑‑ could boost small‑molecule drug discovery and revolutionize medicine. Unfortunately, current molecular docking tools are too slow to screen potential drugs against all relevant proteins, which often results in missed drug candidates or unexpected side effects occurring in clinical trials. To address this gap, we introduce RapidDock, an efficient transformer‑based model for blind molecular docking. RapidDock achieves at least a 100 × speed advantage over existing methods without compromising accuracy. On the Posebusters and DockGen benchmarks, our method achieves 52.1% and 44.0% success rates (\textRMSD<2Å), respectively. The average inference time is 0.04 seconds on a single GPU, highlighting RapidDock's potential for large‑scale docking studies. We examine the key features of RapidDock that enable leveraging the transformer architecture for molecular docking, including the use of relative distance embeddings of 3D structures in attention matrices, pre‑training on protein folding, and a custom loss function invariant to molecular symmetries.
Authors: Liang He, Peiran Jin, Yaosen Min, Shufang Xie, Lijun Wu, Tao Qin, Xiaozhuan Liang, Kaiyuan Gao, Yuliang Jiang, Tie-Yan Liu
Abstract: Proteins, essential to biological systems, perform functions intricately linked to their three‑dimensional structures. Understanding the relationship between protein structures and their amino acid sequences remains a core challenge in protein modeling. While traditional protein foundation models benefit from pre‑training on vast unlabeled datasets, they often struggle to capture critical co‑evolutionary information, which evolutionary‑based methods excel at. In this study, we introduce a novel pre‑training strategy for protein foundation models that emphasizes the interactions among amino acid residues to enhance the extraction of both short‑range and long‑range co‑evolutionary features from sequence data. Trained on a large‑scale protein sequence dataset, our model demonstrates superior generalization ability, outperforming established baselines of similar size, including the ESM model, across diverse downstream tasks. Experimental results confirm the model's effectiveness in integrating co‑evolutionary information, marking a significant step forward in protein sequence‑based modeling.
Authors: Xin, Ma, Dong Si
Abstract: Constructing atomic models from cryo‑electron microscopy (cryo‑EM) maps is a crucial yet intricate task in structural biology. While advancements in deep learning, such as convolutional neural networks (CNNs) and graph neural networks (GNNs), have spurred the development of sophisticated map‑to‑model tools like DeepTracer and ModelAngelo, their efficacy notably diminishes with low‑resolution maps beyond 4 Å. To address this shortfall, our research introduces DeepTracer‑LowResEnhance, an innovative framework that synergizes a deep learning‑enhanced map refinement technique with the power of AlphaFold. This methodology is designed to markedly improve the construction of models from low‑resolution cryo‑EM maps. DeepTracer‑LowResEnhance was rigorously tested on a set of 37 protein cryo‑EM maps, with resolutions ranging between 2.5 to 8.4 Å, including 22 maps with resolutions lower than 4 Å. The outcomes were compelling, demonstrating that 95.5% of the low‑resolution maps exhibited a significant uptick in the count of total predicted residues. This denotes a pronounced improvement in atomic model building for low‑resolution maps. Additionally, a comparative analysis alongside Phenix's auto‑sharpening functionality delineates DeepTracer‑LowResEnhance's superior capability in rendering more detailed and precise atomic models, thereby pushing the boundaries of current computational structural biology methodologies.
Authors: Mrunal Kamble, Evan Humberd, Tian Li, Girish S. Agarwal
Abstract: Analyzing the kinetics of biological processes plays a significant role in understanding fundamental cellular functions. Many physics‑based technologies used to study such processes are limited by the shot noise inherent to the coherent states of light. These technologies can greatly benefit by leveraging quantum probes to improve the sensitivity of measurements in cellular biology. Surface Plasmon Resonance (SPR) technique has been used effectively to achieve label‑free, real‑time measurements of protein binding kinetics, which constitutes an important biological phenomenon occurring near the cell membrane. Here, we demonstrate the integration of this technique with the two‑mode bright squeezed state having fewer fluctuations as compared to the coherent state to improve the sensitivity of measurement in studying a protein‑gold adsorption process. We show 4dB of squeezing as we record the signal‑to‑noise ratio as the function of time and it is maintained throughout the kinetic process. The quantum advantage as shown in terms of squeezing is achieved despite the total absorption of 74% from the source until the final detection after the sensor. Overall, we provide the most practical setup for improving the sensitivity of the time‑dependent measurements involved in various biological processes at the molecular level.
Authors: Caleb Musfeldt
Abstract: This thesis details a Python‑based software designed to calculate the Jones polynomial, a vital mathematical tool from Knot Theory used for characterizing the topological and geometrical complexity of curves in \( \mathbbR^3 \), which is essential in understanding physical systems of filaments, including the behavior of polymers and biopolymers. The Jones polynomial serves as a topological invariant capable of distinguishing between different knot structures. This capability is fundamental to characterizing the architecture of molecular chains, such as proteins and DNA. Traditional computational methods for deriving the Jones polynomial have been limited by closure‑schemes and high execution costs, which can be impractical for complex structures like those that appear in real life. This software implements methods that significantly reduce calculation times, allowing for more efficient and practical applications in the study of biological polymers. It utilizes a divide‑and‑conquer approach combined with parallel computing and applies recursive Reidemeister moves to optimize the computation, transitioning from an exponential to a near‑linear runtime for specific configurations. This thesis provides an overview of the software's functions, detailed performance evaluations using protein structures as test cases, and a discussion of the implications for future research and potential algorithmic improvements.
Authors: Zihan Pengmei, Zhengyuan Shen, Zichen Wang, Marcus Collins, Huzefa Rangwala
Abstract: Constructing transferable descriptors for conformation representation of molecular and biological systems finds numerous applications in drug discovery, learning‑based molecular dynamics, and protein mechanism analysis. Geometric graph neural networks (Geom‑GNNs) with all‑atom information have transformed atomistic simulations by serving as a general learnable geometric descriptors for downstream tasks including prediction of interatomic potential and molecular properties. However, common practices involve supervising Geom‑GNNs on specific downstream tasks, which suffer from the lack of high‑quality data and inaccurate labels leading to poor generalization and performance degradation on out‑of‑distribution (OOD) scenarios. In this work, we explored the possibility of using pre‑trained Geom‑GNNs as transferable and highly effective geometric descriptors for improved generalization. To explore their representation power, we studied the scaling behaviors of Geom‑GNNs under self‑supervised pre‑training, supervised and unsupervised learning setups. We find that the expressive power of different architectures can differ on the pre‑training task. Interestingly, Geom‑GNNs do not follow the power‑law scaling on the pre‑training task, and universally lack predictable scaling behavior on the supervised tasks with quantum chemical labels important for screening and design of novel molecules. More importantly, we demonstrate how all‑atom graph embedding can be organically combined with other neural architectures to enhance the expressive power. Meanwhile, the low‑dimensional projection of the latent space shows excellent agreement with conventional geometrical descriptors.
Authors: Wenxian Shi, Menghua Wu, Regina Barzilay
Abstract: Forecasting the change in the distribution of viral variants is crucial for therapeutic design and disease surveillance. This task poses significant modeling challenges due to the sharp differences in virus distributions across sub‑populations (e.g., countries) and their dynamic interactions. Existing machine learning approaches that model the variant distribution as a whole are incapable of making location‑specific predictions and ignore transmissions that shape the viral landscape. In this paper, we propose a sub‑population specific protein evolution model, which predicts the time‑resolved distributions of viral proteins in different locations. The algorithm explicitly models the transmission rates between sub‑populations and learns their interdependence from data. The change in protein distributions across all sub‑populations is defined through a linear ordinary differential equation (ODE) parametrized by transmission rates. Solving this ODE yields the likelihood of a given protein occurring in particular sub‑populations. Multi‑year evaluation on both SARS‑CoV‑2 and influenza A/H3N2 demonstrates that our model outperforms baselines in accurately predicting distributions of viral proteins across continents and countries. We also find that the transmission rates learned from data are consistent with the transmission pathways discovered by retrospective phylogenetic analysis.
Authors: Po-Yu Liang, Jun Bai
Abstract: Biologists frequently desire protein inhibitors for a variety of reasons, including use as research tools for understanding biological processes and application to societal problems in agriculture, healthcare, etc. Immunotherapy, for instance, relies on immune checkpoint inhibitors to block checkpoint proteins, preventing their binding with partner proteins and boosting immune cell function against abnormal cells. Inhibitor discovery has long been a tedious process, which in recent years has been accelerated by computational approaches. Advances in artificial intelligence now provide an opportunity to make inhibitor discovery smarter than ever before. While extensive research has been conducted on computer‑aided inhibitor discovery, it has mainly focused on either sequence‑to‑structure mapping, reverse mapping, or bio‑activity prediction, making it unrealistic for biologists to utilize such tools. Instead, our work proposes a new method of computer‑assisted inhibitor discovery: de novo pocket‑aware peptide structure and sequence generation network. Our approach consists of two sequential diffusion models for end‑to‑end structure generation and sequence prediction. By leveraging angle and dihedral relationships between backbone atoms, we ensure an E(3)‑invariant representation of peptide structures. Our results demonstrate that our method achieves comparable performance to state‑of‑the‑art models, highlighting its potential in pocket‑aware peptide design. This work offers a new approach for precise drug discovery using receptor‑specific peptide generation.
Authors: Xiaoqi Ling, Cheng Cai, Demin Kong, Zhisheng Wei, Jing Wu, Lei Wang, Zhaohong Deng
Abstract: Computational protein design (CPD) refers to the use of computational methods to design proteins. Traditional methods relying on energy functions and heuristic algorithms for sequence design are inefficient and do not meet the demands of the big data era in biomolecules, with their accuracy limited by the energy functions and search algorithms. Existing deep learning methods are constrained by the learning capabilities of the networks, failing to extract effective information from sparse protein structures, which limits the accuracy of protein design. To address these shortcomings, we developed an Efficient attention‑based Models for Computational Protein Design using amino acid microenvironment (EMOCPD). It aims to predict the category of each amino acid in a protein by analyzing the three‑dimensional atomic environment surrounding the amino acids, and optimize the protein based on the predicted high‑probability potential amino acid categories. EMOCPD employs a multi‑head attention mechanism to focus on important features in the sparse protein microenvironment and utilizes an inverse residual structure to optimize the network architecture. The proposed EMOCPD achieves over 80% accuracy on the training set and 68.33% and 62.32% accuracy on two independent test sets, respectively, surpassing the best comparative methods by over 10%. In protein design, the thermal stability and protein expression of the predicted mutants from EMOCPD show significant improvements compared to the wild type, effectively validating EMOCPD's potential in designing superior proteins. Furthermore, the predictions of EMOCPD are influenced positively, negatively, or have minimal impact based on the content of the 20 amino acids, categorizing amino acids as positive, negative, or neutral. Research findings indicate that EMOCPD is more suitable for designing proteins with lower contents of negative amino acids.
Authors: Xiangxin Zhou, Jiaqi Guan, Yijia Zhang, Xingang Peng, Liang Wang, Jianzhu Ma
Abstract: Dual‑target therapeutic strategies have become a compelling approach and attracted significant attention due to various benefits, such as their potential in overcoming drug resistance in cancer therapy. Considering the tremendous success that deep generative models have achieved in structure‑based drug design in recent years, we formulate dual‑target drug design as a generative task and curate a novel dataset of potential target pairs based on synergistic drug combinations. We propose to design dual‑target drugs with diffusion models that are trained on single‑target protein‑ligand complex pairs. Specifically, we align two pockets in 3D space with protein‑ligand binding priors and build two complex graphs with shared ligand nodes for SE(3)‑equivariant composed message passing, based on which we derive a composed drift in both 3D and categorical probability space in the generative process. Our algorithm can well transfer the knowledge gained in single‑target pretraining to dual‑target scenarios in a zero‑shot manner. We also repurpose linker design methods as strong baselines for this task. Extensive experiments demonstrate the effectiveness of our method compared with various baselines.
Authors: Dongyu Lyu, Matthias Holzenkamp, Vivin Vinod, Yannick Marcel Holtkamp, Sayan Maity, Carlos R. Salazar, Ulrich Kleinekathöfer, Peter Zaspel
Abstract: Natural light‑harvesting antenna complexes efficiently capture solar energy using chlorophyll, i.e., magnesium porphyrin pigments, embedded in a protein matrix. Inspired by this natural configuration, artificial clay‑porphyrin antenna structures have been experimentally synthesized and have demonstrated remarkable excitation energy transfer properties. The study presents the computational design and simulation of a synthetic light‑harvesting system that emulates natural mechanisms by arranging cationic free‑base porphyrin molecules on an anionic clay surface. We investigated the transfer of excitation energy among the porphyrin dyes using a multiscale quantum mechanics/molecular mechanics (QM/MM) approach based on the semi‑empirical density functional‑based tight‑binding (DFTB) theory for the ground state dynamics. To improve the accuracy of our results, we incorporated an innovative multifidelity machine learning (MFML) approach, which allows the prediction of excitation energies at the numerically demanding time‑dependent density functional theory level with the Def2‑SVP basis set. This approach was applied to an extensive dataset of 640K geometries for the 90‑atom porphyrin structures, facilitating a thorough analysis of the excitation energy diffusion among the porphyrin molecules adsorbed to the clay surface. The insights gained from this study, inspired by natural light‑harvesting complexes, demonstrate the potential of porphyrin‑clay systems as effective energy transfer systems.
Authors: Zaixi Zhang, Ruofan Jin, Kaidi Fu, Le Cong, Marinka Zitnik, Mengdi Wang
Abstract: Protein structure is key to understanding protein function and is essential for progress in bioengineering, drug discovery, and molecular biology. Recently, with the incorporation of generative AI, the power and accuracy of computational protein structure prediction/design have been improved significantly. However, ethical concerns such as copyright protection and harmful content generation (biosecurity) pose challenges to the wide implementation of protein generative models. Here, we investigate whether it is possible to embed watermarks into protein generative models and their outputs for copyright authentication and the tracking of generated structures. As a proof of concept, we propose a two‑stage method FoldMark as a generalized watermarking strategy for protein generative models. FoldMark first pretrain watermark encoder and decoder, which can minorly adjust protein structures to embed user‑specific information and faithfully recover the information from the encoded structure. In the second step, protein generative models are fine‑tuned with watermark‑conditioned Low‑Rank Adaptation (LoRA) modules to preserve generation quality while learning to generate watermarked structures with high recovery rates. Extensive experiments are conducted on open‑source protein structure prediction models (e.g., ESMFold and MultiFlow) and de novo structure design models (e.g., FrameDiff and FoldFlow) and we demonstrate that our method is effective across all these generative models. Meanwhile, our watermarking framework only exerts a negligible impact on the original protein structure quality and is robust under potential post‑processing and adaptive attacks.
Authors: Tiangang Cui, Alex Gorodetsky
Abstract: We present a new sampling‑based approach for enabling efficient computation of low‑rank Bayesian matrix completion and quantifying the associated uncertainty. Firstly, we design a new prior model based on the singular‑value‑decomposition (SVD) parametrization of low‑rank matrices. Our prior is analogous to the seminal nuclear‑norm regularization used in non‑Bayesian setting and enforces orthogonality in the factor matrices by constraining them to Stiefel manifolds. Then, we design a geodesic Hamiltonian Monte Carlo (‑within‑Gibbs) algorithm for generating posterior samples of the SVD factor matrices. We demonstrate that our approach resolves the sampling difficulties encountered by standard Gibbs samplers for the common two‑matrix factorization used in matrix completion. More importantly, the geodesic Hamiltonian sampler allows for sampling in cases with more general likelihoods than the typical Gaussian likelihood and Gaussian prior assumptions adopted in most of the existing Bayesian matrix completion literature. We demonstrate an applications of our approach to fit the categorical data of a mice protein dataset and the MovieLens recommendation problem. Numerical examples demonstrate superior sampling performance, including better mixing and faster convergence to a stationary distribution. Moreover, they demonstrate improved accuracy on the two real‑world benchmark problems we considered.
Authors: Siddharth Viswanath, Dhananjay Bhaskar, David R. Johnson, Joao Felipe Rocha, Egbert Castro, Jackson D. Grady, Alex T. Grigas, Michael A. Perlmutter, Corey S. O'Hern, Smita Krishnaswamy
Abstract: Understanding the dynamic nature of protein structures is essential for comprehending their biological functions. While significant progress has been made in predicting static folded structures, modeling protein motions on microsecond to millisecond scales remains challenging. To address these challenges, we introduce a novel deep learning architecture, Protein Transformer with Scattering, Attention, and Positional Embedding (ProtSCAPE), which leverages the geometric scattering transform alongside transformer‑based attention mechanisms to capture protein dynamics from molecular dynamics (MD) simulations. ProtSCAPE utilizes the multi‑scale nature of the geometric scattering transform to extract features from protein structures conceptualized as graphs and integrates these features with dual attention structures that focus on residues and amino acid signals, generating latent representations of protein trajectories. Furthermore, ProtSCAPE incorporates a regression head to enforce temporally coherent latent representations.
Authors: Yifan Deng, Spencer S. Ericksen, Anthony Gitter
Abstract: The development of large language models and multi‑modal models has enabled the appealing idea of generating novel molecules from text descriptions. Generative modeling would shift the paradigm from relying on large‑scale chemical screening to find molecules with desired properties to directly generating those molecules. However, multi‑modal models combining text and molecules are often trained from scratch, without leveraging existing high‑quality pretrained models. Training from scratch consumes more computational resources and prohibits model scaling. In contrast, we propose a lightweight adapter‑based strategy named Chemical Language Model Linker (ChemLML). ChemLML blends the two single domain models and obtains conditional molecular generation from text descriptions while still operating in the specialized embedding spaces of the molecular domain. ChemLML can tailor diverse pretrained text models for molecule generation by training relatively few adapter parameters. We find that the choice of molecular representation used within ChemLML, SMILES versus SELFIES, has a strong influence on conditional molecular generation performance. SMILES is often preferable despite not guaranteeing valid molecules. We raise issues in using the entire PubChem dataset of molecules and their associated descriptions for evaluating molecule generation and provide a filtered version of the dataset as a generation test set. To demonstrate how ChemLML could be used in practice, we generate candidate protein inhibitors and use docking to assess their quality and also generate candidate membrane permeable molecules.
Authors: Parthasarathy Suryanarayanan, Yunguang Qiu, Shreyans Sethi, Diwakar Mahajan, Hongyang Li, Yuxin Yang, Elif Eyigoz, Aldo Guzman Saenz, Daniel E. Platt, Timothy H. Rumbell, Kenney Ng, Sanjoy Dey, Myson Burch, Bum Chul Kwon, Pablo Meyer, Feixiong Cheng, Jianying Hu, Joseph A. Morrone
Abstract: Quality molecular representations are key to foundation model development in bio‑medical research. Previous efforts have typically focused on a single representation or molecular view, which may have strengths or weaknesses on a given task. We develop Multi‑view Molecular Embedding with Late Fusion (MMELON), an approach that integrates graph, image and text views in a foundation model setting and may be readily extended to additional representations. Single‑view foundation models are each pre‑trained on a dataset of up to 200M molecules. The multi‑view model performs robustly, matching the performance of the highest‑ranked single‑view. It is validated on over 120 tasks, including molecular solubility, ADME properties, and activity against G Protein‑Coupled receptors (GPCRs). We identify 33 GPCRs that are related to Alzheimer's disease and employ the multi‑view model to select strong binders from a compound screen. Predictions are validated through structure‑based modeling and identification of key binding motifs.
Authors: Darin Tsui, Aryan Musharaf, Yigit Efe Erginbas, Justin Singh Kang, Amirali Aghazadeh
Abstract: The growing adoption of machine learning models for biological sequences has intensified the need for interpretable predictions, with Shapley values emerging as a theoretically grounded standard for model explanation. While effective for local explanations of individual input sequences, scaling Shapley‑based interpretability to extract global biological insights requires evaluating thousands of sequences‑‑incurring exponential computational cost per query. We introduce SHAP zero, a novel algorithm that amortizes the cost of Shapley value computation across large‑scale biological datasets. After a one‑time model sketching step, SHAP zero enables near‑zero marginal cost for future queries by uncovering an underexplored connection between Shapley values, high‑order feature interactions, and the sparse Fourier transform of the model. Applied to models of guide RNA efficacy, DNA repair outcomes, and protein fitness, SHAP zero explains predictions orders of magnitude faster than existing methods, recovering rich combinatorial interactions previously inaccessible at scale. This work opens the door to principled, efficient, and scalable interpretability for black‑box sequence models in biology.
Authors: Jiarui Lu, Xiaoyin Chen, Stephen Zhewen Lu, Chence Shi, Hongyu Guo, Yoshua Bengio, Jian Tang
Abstract: Proteins adopt multiple structural conformations to perform their diverse biological functions, and understanding these conformations is crucial for advancing drug discovery. Traditional physics‑based simulation methods often struggle with sampling equilibrium conformations and are computationally expensive. Recently, deep generative models have shown promise in generating protein conformations as a more efficient alternative. However, these methods predominantly rely on the diffusion process within a 3D geometric space, which typically centers around the vicinity of metastable states and is often inefficient in terms of runtime. In this paper, we introduce Structure Language Modeling (SLM) as a novel framework for efficient protein conformation generation. Specifically, the protein structures are first encoded into a compact latent space using a discrete variational auto‑encoder, followed by conditional language modeling that effectively captures sequence‑specific conformation distributions. This enables a more efficient and interpretable exploration of diverse ensemble modes compared to existing methods. Based on this general framework, we instantiate SLM with various popular LM architectures as well as proposing the ESMDiff, a novel BERT‑like structure language model fine‑tuned from ESM3 with masked diffusion. We verify our approach in various scenarios, including the equilibrium dynamics of BPTI, conformational change pairs, and intrinsically disordered proteins. SLM provides a highly efficient solution, offering a 20‑100x speedup than existing methods in generating diverse conformations, shedding light on promising avenues for future research.
Authors: Abhiram Sripat
Abstract: Mycorrhizal fungi form vast subterranean networks that are critical for plant nutrient uptake, carbon sequestration, and ecosystem resilience. Despite their ecological importance, optimizing these networks for precision agriculture, forestry,and carbon sequestration remains an open challenge, particularly when it comes to understanding the complex molecular and quantum‑scale processes that govern nutrient exchange. In this paper, we propose a novel experimental framework using mycoponics, a controlled, soil‑less environment for the study of plant fungal symbiosis integrated with isotopic labeling and quantum dots to track real‑time nutrient transfer.
Authors: Yuzhi Xu, Haowei Ni, Qinhui Gao, Chia-Hua Chang, Yanran Huo, Fanyu Zhao, Shiyu Hu, Wei Xia, Yike Zhang, Radu Grovu, Min He, John. Z. H. Zhang, Yuanqing Wang
Abstract: Computational molecular design ‑‑ the endeavor to design molecules, with various missions, aided by machine learning and molecular dynamics approaches, has been widely applied to create valuable new molecular entities, from small molecule therapeutics to protein biologics. In the small data regime, physics‑based approaches model the interaction between the molecule being designed and proteins of key physiological functions, providing structural insights into the mechanism. When abundant data has been collected, a quantitative structure‑activity relationship (QSAR) can be more directly constructed from experimental data, from which machine learning can distill key insights to guide the design of the next round of experiment design. Machine learning methodologies can also facilitate physical modeling, from improving the accuracy of force fields and extending them to unseen chemical spaces, to more directly enhancing the sampling on the conformational spaces. We argue that these techniques are mature enough to be applied to not just extend the longevity of life, but the beauty it manifests. In this perspective, we review the current frontiers in the research \& development of skin care products, as well as the statistical and physical toolbox applicable to addressing the challenges in this industry. Feasible interdisciplinary research projects are proposed to harness the power of machine learning tools to design innovative, effective, and inexpensive skin care products.
Authors: Luran Wang, Chaoran Cheng, Yizhen Liao, Yanru Qu, Ge Liu
Abstract: Controlled generation with pre‑trained Diffusion and Flow Matching models has vast applications. One strategy for guiding ODE‑based generative models is through optimizing a target loss R(x_1) while staying close to the prior distribution. Along this line, some recent work showed the effectiveness of guiding flow model by differentiating through its ODE sampling process. Despite the superior performance, the theoretical understanding of this line of methods is still preliminary, leaving space for algorithm improvement. Moreover, existing methods predominately focus on Euclidean data manifold, and there is a compelling need for guided flow methods on complex geometries such as SO(3), which prevails in high‑stake scientific applications like protein design. We present OC‑Flow, a general and theoretically grounded training‑free framework for guided flow matching using optimal control. Building upon advances in optimal control theory, we develop effective and practical algorithms for solving optimal control in guided ODE‑based generation and provide a systematic theoretical analysis of the convergence guarantee in both Euclidean and SO(3). We show that existing backprop‑through‑ODE methods can be interpreted as special cases of Euclidean OC‑Flow. OC‑Flow achieved superior performance in extensive experiments on text‑guided image manipulation, conditional molecule generation, and all‑atom peptide design.
Authors: Samir Rosas, Wihan Adi, Aidana Beisenova, Shovasis Kumar Biswas, Furkan Kuruoglu, Hongyan Mei, Mikhail A. Kats, David A. Czaplewski, Yuri S. Kivshar, Filiz Yesilkoy
Abstract: Optical metasurfaces provide novel solutions to label‑free biochemical sensing by localizing light resonantly beyond the diffraction limit, thereby selectively enhancing light‑matter interactions for improved analytical performance. However, high‑Q resonances in metasurfaces are usually achieved in the reflection mode, which impedes metasurface integration into compact imaging systems. Here, we demonstrate a novel metasurface platform for advanced biochemical sensing based on the physics of the bound states in the continuum (BIC) and electromagnetically induced transparency (EIT) modes, which arise when two interfering resonances from a periodic pattern of tilted elliptic holes overlap both spectrally and spatially, creating a narrow transparency window in the mid‑infrared spectrum. We experimentally measure these resonant peaks observed in transmission mode (Q~734 at ~8.8 um) in free‑standing silicon membranes and confirm their tunability through geometric scaling. We also demonstrate the strong coupling of the BIC‑EIT modes with a thinly coated PMMA film on the metasurface, characterized by a large Rabi splitting (32 cm‑1) and biosensing of protein monolayers in transmission mode. Our new photonic platform can facilitate the integration of metasurface biochemical sensors into compact and monolithic optical systems while being compatible with scalable manufacturing, thereby clearing the way for on‑site biochemical sensing in everyday applications.
Authors: Bahar Ali, Anwar Shah, Malik Niaz, Musadaq Mansoord, Sami Ullah, Muhammad Adnan
Abstract: Advanced automated AI techniques allow us to classify protein sequences and discern their biological families and functions. Conventional approaches for classifying these protein families often focus on extracting N‑Gram features from the sequences while overlooking crucial motif information and the interplay between motifs and neighboring amino acids. Recently, convolutional neural networks have been applied to amino acid and motif data, even with a limited dataset of well‑characterized proteins, resulting in improved performance. This study presents a model for classifying protein families using the fusion of 1D‑CNN, BiLSTM, and an attention mechanism, which combines spatial feature extraction, long‑term dependencies, and context‑aware representations. The proposed model (ProFamNet) achieved superior model efficiency with 450,953 parameters and a compact size of 1.72 MB, outperforming the state‑of‑the‑art model with 4,578,911 parameters and a size of 17.47 MB. Further, we achieved a higher F1 score (98.30% vs. 97.67%) with more instances (271,160 vs. 55,077) in fewer training epochs (25 vs. 30).
Authors: Yasha Ektefaie, Olivia Viessmann, Siddharth Narayanan, Drew Dresser, J. Mark Kim, Armen Mkrtchyan
Abstract: Protein inverse folding‑that is, predicting an amino acid sequence that will fold into the desired 3D structure‑is an important problem for structure‑based protein design. Machine learning based methods for inverse folding typically use recovery of the original sequence as the optimization objective. However, inverse folding is a one‑to‑many problem where several sequences can fold to the same structure. Moreover, for many practical applications, it is often desirable to have multiple, diverse sequences that fold into the target structure since it allows for more candidate sequences for downstream optimizations. Here, we demonstrate that although recent inverse folding methods show increased sequence recovery, their "foldable diversity"‑i.e. their ability to generate multiple non‑similar sequences that fold into the structures consistent with the target‑does not increase. To address this, we present RL‑DIF, a categorical diffusion model for inverse folding that is pre‑trained on sequence recovery and tuned via reinforcement learning on structural consistency. We find that RL‑DIF achieves comparable sequence recovery and structural consistency to benchmark models but shows greater foldable diversity: experiments show RL‑DIF can achieve an foldable diversity of 29% on CATH 4.2, compared to 23% from models trained on the same dataset. The PyTorch model weights and sampling code are available on GitHub.
Authors: Johannes Karwounopoulos, Mateusz Bieniek, Zhiyi Wu, Adam L. Baskerville, Gerhard Koenig, Benjamin P. Cossins, Geoffrey P. F. Wood
Abstract: The development of machine‑learning (ML) potentials offers significant accuracy improvements compared to molecular mechanics (MM) because of the inclusion of quantum‑mechanical effects in molecular interactions. However, ML simulations are several times more computationally demanding than MM simulations, so there is a trade‑off between speed and accuracy. One possible compromise are hybrid machine learning/molecular mechanics (ML/MM) approaches with mechanical embedding that treat the intramolecular interactions of the ligand at the ML level and the protein‑ligand interactions at the MM level. Recent studies have reported improved protein‑ligand binding free energy results based on ML/MM with mechanical embedding, arguing that intramolecular interactions like torsion potentials of the ligand are often the limiting factor for accuracy. This claim is evaluated based on 108 relative binding free energy calculations for four different benchmark systems. As an alternative strategy, we also tested a tool that fits the MM dihedral potentials to the ML level of theory. Overall, the relative binding free energy results from MM with Open Force Field 2.2.0, MM with ML‑fitted torsion potentials, and the corresponding ML/MM end‑state corrected simulations show no statistically significant differences in the mean absolute errors (between 0.8 and 0.9 kcal/mol). Therefore, a well‑parameterized force field is on a par with simple mechanical embedding ML/MM simulations for protein‑ligand binding. In terms of computational costs, the reparametrization of poor torsional potentials is preferable over employing computationally intensive ML/MM simulations of protein‑ligand complexes with mechanical embedding. Also, the refitting strategy leads to lower variances of the protein‑ligand binding free energy results than the ML/MM end‑state corrections.
Authors: Leo Liberti
Abstract: The Buckminsterfullerene is an inorganic molecule consisting of 60 carbon atoms in the shape of a soccer ball. It was used in [Juhas et al., Nature 2006] to showcase algorithms that find the correct shape of a protein from limited data (length of inter‑atomic distances) without any further chemical experiment: in that case, by means of a complicated constructive heuristic based on genetic algorithms. In this paper we show that we can reconstruct the Buckminsterfullerene structure by means of mathematical programming, standard solver software, and little else.
Authors: Haowen Zhao, Francesco A. Aprile, Barbara Bravi
Abstract: The computational prediction and design of peptide binders targeting specific linear epitopes is crucial in biological and biomedical research, yet it remains challenging due to their highly dynamic nature and the scarcity of experimentally solved binding data. To address this problem, we built an unprecedentedly large‑scale library of peptide pairs within stable secondary structures (beta sheets), leveraging newly available AlphaFold predicted structures. We then developed a machine learning method based on the Transformer architecture for the design of specific linear binders, in analogy to a language translation task. Our method, TransformerBeta, accurately predicts specific beta strand interactions and samples sequences with beta sheet‑like molecular properties, while capturing interpretable physico‑chemical interaction patterns. As such, it can propose specific candidate binders targeting linear epitope for experimental validation to inform protein design.
Authors: Tomas André, Ibrahim Dawod, Sebastian Cardoch, Emiliano De Santis, Nicusor Timneanu, Carl Caleman
Abstract: We simulated the Coulomb explosion dynamics due to the fast ionization induced by high‑intensity X‑rays in six proteins that share similar atomic content and shape. We followed and projected the trajectory of the fragments onto a virtual detector, providing a unique explosion footprint. After collecting 500 explosion footprints for each protein, we utilized principal component analysis and t‑distributed stochastic neighbor embedding to classify these. The results show that the classification algorithms were able to separate proteins on the basis of explosion footprints from structurally similar proteins into distinct groups. The explosion footprints, therefore, provide a unique identifier for each of the proteins. We envision that method could be used concurrently with single particle coherent imaging experiments to provide additional information on shape, mass, or conformation.
Authors: Shikhar Vashistha, Neetesh Kumar
Abstract: Traditional graph neural networks (GNNs) lack scalability and lose individual node characteristics due to over‑smoothing, especially in the case of deeper networks. This results in sub‑optimal feature representation, affecting the model's performance on tasks involving dynamically changing graphs. To address this issue, we present Graph Selective States Focused Attention Networks (GSANs) based neural network architecture for graph‑structured data. The GSAN is enabled by multi‑head masked self‑attention (MHMSA) and selective state space modeling (S3M) layers to overcome the limitations of GNNs. In GSAN, the MHMSA allows GSAN to dynamically emphasize crucial node connections, particularly in evolving graph environments. The S3M layer enables the network to adjust dynamically in changing node states and improving predictions of node behavior in varying contexts without needing primary knowledge of the graph structure. Furthermore, the S3M layer enhances the generalization of unseen structures and interprets how node states influence link importance. With this, GSAN effectively outperforms inductive and transductive tasks and overcomes the issues that traditional GNNs experience. To analyze the performance behavior of GSAN, a set of state‑of‑the‑art comparative experiments are conducted on graphs benchmark datasets, including Cora, Citeseer, Pubmed network citation, and protein‑protein‑interaction datasets, as an outcome, GSAN improved the classification accuracy by 1.56%, 8.94%, 0.37%, and 1.54% on F1‑score respectively.
Authors: Aryan Abbasian, Mahtab Mirmohseni, Masoumeh Nasiri Kenari
Abstract: Recent experiments have demonstrated the feasibility of storing digital information in macromolecules such as DNA and protein. However, the DNA storage channel is prone to errors such as deletions, insertions, and substitutions. During the synthesis and reading phases of DNA strings, many noisy copies of the original string are generated. The problem of recovering the original string from these noisy copies is known as sequence reconstruction. A key concept in this problem is the error ball, which is the set of all possible sequences that can result from a limited number of errors applied to the original sequence. Levenshtein showed that the minimum number of noisy copies required for a given channel to recover the original sequence is equal to one plus the maximum size of the intersection of two error balls. Therefore, deriving the size of the error ball for any channel and any sequence is essential for solving the sequence reconstruction problem. In DNA storage systems, multiple types of errors such as deletion, insertion and substitution in a string could occur simultaneously. In this work, we aim to derive the size of the error ball for channels with multiple types of errors and at most three edits. Specifically, we consider the channels with single‑deletion double‑substitution, single‑deletion double‑insertion and single‑insertion single‑substitution errors.
Authors: Haibo Wang, Yuxuan Qiu, Yanze Wang, Rob Brekelmans, Yuanqi Du
Abstract: Simulating transition dynamics between metastable states is a fundamental challenge in dynamical systems and stochastic processes with wide real‑world applications in understanding protein folding, chemical reactions and neural activities. However, the computational challenge often lies on sampling exponentially many paths in which only a small fraction ends in the target metastable state due to existence of high energy barriers. To amortize the cost, we propose a data‑driven approach to warm‑up the simulation by learning nonlinear interpolations from local dynamics. Specifically, we infer a potential energy function from local dynamics data. To find plausible paths between two metastable states, we formulate a generalized flow matching framework that learns a vector field to sample propable paths between the two marginal densities under the learned energy function. Furthermore, we iteratively refine the model by assigning importance weights to the sampled paths and buffering more likely paths for training. We validate the effectiveness of the proposed method to sample probable paths on both synthetic and real‑world molecular systems.
Authors: Sizhe Liu, Jun Xia, Lecheng Zhang, Yuchen Liu, Yue Liu, Wenjie Du, Zhangyang Gao, Bozhen Hu, Cheng Tan, Hongxin Xiang, Stan Z. Li
Abstract: Molecular relational learning (MRL) is crucial for understanding the interaction behaviors between molecular pairs, a critical aspect of drug discovery and development. However, the large feasible model space of MRL poses significant challenges to benchmarking, and existing MRL frameworks face limitations in flexibility and scope. To address these challenges, avoid repetitive coding efforts, and ensure fair comparison of models, we introduce FlexMol, a comprehensive toolkit designed to facilitate the construction and evaluation of diverse model architectures across various datasets and performance metrics. FlexMol offers a robust suite of preset model components, including 16 drug encoders, 13 protein sequence encoders, 9 protein structure encoders, and 7 interaction layers. With its easy‑to‑use API and flexibility, FlexMol supports the dynamic construction of over 70, 000 distinct combinations of model architectures. Additionally, we provide detailed benchmark results and code examples to demonstrate FlexMol's effectiveness in simplifying and standardizing MRL model development and comparison.
Authors: Hanqun Cao, Mutian He, Ning Ma, Chang-yu Hsieh, Chunbin Gu, Pheng-Ann Heng
Abstract: DNA‑encoded library (DEL) screening has revolutionized the detection of protein‑ligand interactions through read counts, enabling rapid exploration of vast chemical spaces. However, noise in read counts, stemming from nonspecific interactions, can mislead this exploration process. We present DEL‑Ranking, a novel distribution‑correction denoising framework that addresses these challenges. Our approach introduces two key innovations: (1) a novel ranking loss that rectifies relative magnitude relationships between read counts, enabling the learning of causal features determining activity levels, and (2) an iterative algorithm employing self‑training and consistency loss to establish model coherence between activity label and read count predictions. Furthermore, we contribute three new DEL screening datasets, the first to comprehensively include multi‑dimensional molecular representations, protein‑ligand enrichment values, and their activity labels. These datasets mitigate data scarcity issues in AI‑driven DEL screening research. Rigorous evaluation on diverse DEL datasets demonstrates DEL‑Ranking's superior performance across multiple correlation metrics, with significant improvements in binding affinity prediction accuracy. Our model exhibits zero‑shot generalization ability across different protein targets and successfully identifies potential motifs determining compound binding affinity. This work advances DEL screening analysis and provides valuable resources for future research in this area.
Authors: Rabea Khatun, Wahia Tasnim, Maksuda Akter, Md Manowarul Islam, Md. Ashraf Uddin, Md. Zulfiker Mahmud, Saurav Chandra Das
Abstract: Gallbladder cancer (GBC) is the most frequent cause of disease among biliary tract neoplasms. Identifying the molecular mechanisms and biomarkers linked to GBC progression has been a significant challenge in scientific research. Few recent studies have explored the roles of biomarkers in GBC. Our study aimed to identify biomarkers in GBC using machine learning (ML) and bioinformatics techniques. We compared GBC tumor samples with normal samples to identify differentially expressed genes (DEGs) from two microarray datasets (GSE100363, GSE139682) obtained from the NCBI GEO database. A total of 146 DEGs were found, with 39 up‑regulated and 107 down‑regulated genes. Functional enrichment analysis of these DEGs was performed using Gene Ontology (GO) terms and REACTOME pathways through DAVID. The protein‑protein interaction network was constructed using the STRING database. To identify hub genes, we applied three ranking algorithms: Degree, MNC, and Closeness Centrality. The intersection of hub genes from these algorithms yielded 11 hub genes. Simultaneously, two feature selection methods (Pearson correlation and recursive feature elimination) were used to identify significant gene subsets. We then developed ML models using SVM and RF on the GSE100363 dataset, with validation on GSE139682, to determine the gene subset that best distinguishes GBC samples. The hub genes outperformed the other gene subsets. Finally, NTRK2, COL14A1, SCN4B, ATP1A2, SLC17A7, SLIT3, COL7A1, CLDN4, CLEC3B, ADCYAP1R1, and MFAP4 were identified as crucial genes, with SLIT3, COL7A1, and CLDN4 being strongly linked to GBC development and prediction.
Authors: Xinyou Wang, Zaixiang Zheng, Fei Ye, Dongyu Xue, Shujian Huang, Quanquan Gu
Abstract: Proteins are essential macromolecules defined by their amino acid sequences, which determine their three‑dimensional structures and, consequently, their functions in all living organisms. Therefore, generative protein modeling necessitates a multimodal approach to simultaneously model, understand, and generate both sequences and structures. However, existing methods typically use separate models for each modality, limiting their ability to capture the intricate relationships between sequence and structure. This results in suboptimal performance in tasks that requires joint understanding and generation of both modalities. In this paper, we introduce DPLM‑2, a multimodal protein foundation model that extends discrete diffusion protein language model (DPLM) to accommodate both sequences and structures. To enable structural learning with the language model, 3D coordinates are converted to discrete tokens using a lookup‑free quantization‑based tokenizer. By training on both experimental and high‑quality synthetic structures, DPLM‑2 learns the joint distribution of sequence and structure, as well as their marginals and conditionals. We also implement an efficient warm‑up strategy to exploit the connection between large‑scale evolutionary data and structural inductive biases from pre‑trained sequence‑based protein language models. Empirical evaluation shows that DPLM‑2 can simultaneously generate highly compatible amino acid sequences and their corresponding 3D structures eliminating the need for a two‑stage generation approach. Moreover, DPLM‑2 demonstrates competitive performance in various conditional generation tasks, including folding, inverse folding, and scaffolding with multimodal motif inputs, as well as providing structure‑aware representations for predictive tasks.
Authors: Chenyu Wang, Masatoshi Uehara, Yichun He, Amy Wang, Tommaso Biancalani, Avantika Lal, Tommi Jaakkola, Sergey Levine, Hanchen Wang, Aviv Regev
Abstract: Recent studies have demonstrated the strong empirical performance of diffusion models on discrete sequences across domains from natural language to biological sequence generation. For example, in the protein inverse folding task, conditional diffusion models have achieved impressive results in generating natural‑like sequences that fold back into the original structure. However, practical design tasks often require not only modeling a conditional distribution but also optimizing specific task objectives. For instance, we may prefer protein sequences with high stability. To address this, we consider the scenario where we have pre‑trained discrete diffusion models that can generate natural‑like sequences, as well as reward models that map sequences to task objectives. We then formulate the reward maximization problem within discrete diffusion models, analogous to reinforcement learning (RL), while minimizing the KL divergence against pretrained diffusion models to preserve naturalness. To solve this RL problem, we propose a novel algorithm, DRAKES, that enables direct backpropagation of rewards through entire trajectories generated by diffusion models, by making the originally non‑differentiable trajectories differentiable using the Gumbel‑Softmax trick. Our theoretical analysis indicates that our approach can generate sequences that are both natural‑like and yield high rewards. While similar tasks have been recently explored in diffusion models for continuous domains, our work addresses unique algorithmic and theoretical challenges specific to discrete diffusion models, which arise from their foundation in continuous‑time Markov chains rather than Brownian motion. Finally, we demonstrate the effectiveness of DRAKES in generating DNA and protein sequences that optimize enhancer activity and protein stability, respectively, important tasks for gene therapies and protein‑based therapeutics.
Authors: Pedro Zuidberg Dos Martires, Vincent Derkinderen, Luc De Raedt, Marcus Krantz
Abstract: Recent developments in AI have reinvigorated pursuits to advance the (life) sciences using AI techniques, thereby creating a renewed opportunity to bridge different fields and find synergies. Headlines for AI and the life sciences have been dominated by data‑driven techniques, for instance, to solve protein folding with next to no expert knowledge. In contrast to this, we argue for the necessity of a formal representation of expert knowledge ‑ either to develop explicit scientific theories or to compensate for the lack of data. Specifically, we argue that the fields of knowledge representation (KR) and systems biology (SysBio) exhibit important overlaps that have been largely ignored so far. This, in turn, means that relevant scientific questions are ready to be answered using the right domain knowledge (SysBio), encoded in the right way (SysBio/KR), and by combining it with modern automated reasoning tools (KR). Hence, the formal representation of domain knowledge is a natural meeting place for SysBio and KR. On the one hand, we argue that such an interdisciplinary approach will advance the field SysBio by exposing it to industrial‑grade reasoning tools and thereby allowing novel scientific questions to be tackled. On the other hand, we see ample opportunities to move the state‑of‑the‑art in KR by tailoring KR methods to the field of SysBio, which comes with challenging problem characteristics, e.g. scale, partial knowledge, noise, or sub‑symbolic data. We stipulate that this proposed interdisciplinary research is necessary to attain a prominent long‑term goal in the health sciences: precision medicine.
Authors: Jakub Grudzien Kuba, Pieter Abbeel, Sergey Levine
Abstract: Large neural networks excel at prediction tasks, but their application to design problems, such as protein engineering or materials discovery, requires solving offline model‑based optimization (MBO) problems. While predictive models may not directly translate to effective design, recent MBO algorithms incorporate reinforcement learning and generative modeling approaches. Meanwhile, theoretical work suggests that exploiting the target function's structure can enhance MBO performance. We present Cliqueformer, a transformer‑based architecture that learns the black‑box function's structure through functional graphical models (FGM), addressing distribution shift without relying on explicit conservative approaches. Across various domains, including chemical and genetic design tasks, Cliqueformer demonstrates superior performance compared to existing methods.
Authors: Jiaqi Han, Minkai Xu, Aaron Lou, Haotian Ye, Stefano Ermon
Abstract: Generative models have shown great promise in generating 3D geometric systems, which is a fundamental problem in many natural science domains such as molecule and protein design. However, existing approaches only operate on static structures, neglecting the fact that physical systems are always dynamic in nature. In this work, we propose geometric trajectory diffusion models (GeoTDM), the first diffusion model for modeling the temporal distribution of 3D geometric trajectories. Modeling such distribution is challenging as it requires capturing both the complex spatial interactions with physical symmetries and temporal correspondence encapsulated in the dynamics. We theoretically justify that diffusion models with equivariant temporal kernels can lead to density with desired symmetry, and develop a novel transition kernel leveraging SE(3)‑equivariant spatial convolution and temporal attention. Furthermore, to induce an expressive trajectory distribution for conditional generation, we introduce a generalized learnable geometric prior into the forward diffusion process to enhance temporal conditioning. We conduct extensive experiments on both unconditional and conditional generation in various scenarios, including physical simulation, molecular dynamics, and pedestrian motion. Empirical results on a wide suite of metrics demonstrate that GeoTDM can generate realistic geometric trajectories with significantly higher quality.
Authors: Elena Chachkarova, Terence Tse, Yordan Yordanov, Yao Wei, Cedric Weber
Abstract: The real world obeys quantum physics and quantum computing presents an alternative way to map physical problems to systems that follow the same laws. Such computation fundamentally constitutes a better way to understand the most challenging quantum problems. One such problem is the accurate simulation of highly correlated quantum systems. Due to the high dimensionality of the problem classical computers require considerable computer power to accurately predict material properties, especially when strong electron interactions are present. Still, modern day quantum hardware has many limitations and only allows for modeling of very simple systems. Here we present for the first time a quantum computer model simulation of a complex hemocyanin molecule, which is an important respiratory protein involved in various physiological processes such as oxygen transport and immune defence, and is also used as a key component in therapeutic vaccines for cancer. To better characterise the mechanism by which hemocyanin transports oxygen, variational quantum eigensolver (VQE) based on fermionic excitations and quantum embedding methods is used in the context of dynamic mean field theory to solve Anderson impurity model (AIM). Finally, it is concluded that the magnetic structure of hemocyanin is largely influenced by the many‑body correction and that the computational effort for solving correlated electron systems could be substantially reduced with the introduction of quantum computing algorithms. We encourage the use of the Hamiltonian systems presented in this paper as a benchmark for testing quantum computing algorithms efficiency for chemistry applications.
Authors: Sarwan Ali, Taslim Murad, Prakash Chourasia, Haris Mansoor, Imdad Ullah Khan, Pin-Yu Chen, Murray Patterson
Abstract: Understanding the structural and functional characteristics of proteins are crucial for developing preventative and curative strategies that impact fields from drug discovery to policy development. An important and popular technique for examining how amino acids make up these characteristics of the protein sequences with position‑specific scoring (PSS). While the string kernel is crucial in natural language processing (NLP), it is unclear if string kernels can extract biologically meaningful information from protein sequences, despite the fact that they have been shown to be effective in the general sequence analysis tasks. In this work, we propose a weighted PSS kernel matrix (or W‑PSSKM), that combines a PSS representation of protein sequences, which encodes the frequency information of each amino acid in a sequence, with the notion of the string kernel. This results in a novel kernel function that outperforms many other approaches for protein sequence classification. We perform extensive experimentation to evaluate the proposed method. Our findings demonstrate that the W‑PSSKM significantly outperforms existing baselines and state‑of‑the‑art methods and achieves up to 45.1% improvement in classification accuracy.
Authors: Mehdi Yazdani-Jahromi, Mangal Prakash, Tommaso Mansi, Artem Moskalev, Rui Liao
Abstract: Messenger RNA (mRNA) plays a crucial role in protein synthesis, with its codon structure directly impacting biological properties. While Language Models (LMs) have shown promise in analyzing biological sequences, existing approaches fail to account for the hierarchical nature of mRNA's codon structure. We introduce Hierarchical Encoding for mRNA Language Modeling (HELM), a novel pre‑training strategy that incorporates codon‑level hierarchical structure into language model training. HELM modulates the loss function based on codon synonymity, aligning the model's learning process with the biological reality of mRNA sequences. We evaluate HELM on diverse mRNA datasets and tasks, demonstrating that HELM outperforms standard language model pre‑training as well as existing foundation model baselines on seven diverse downstream property prediction tasks and an antibody region annotation tasks on average by around 8%. Additionally, HELM enhances the generative capabilities of language model, producing diverse mRNA sequences that better align with the underlying true data distribution compared to non‑hierarchical baselines.
Authors: Dongqi Fu, Liri Fang, Zihao Li, Hanghang Tong, Vetle I. Torvik, Jingrui He
Abstract: Graphs, as a relational data structure, have been widely used for various application scenarios, like molecule design and recommender systems. Recently, large language models (LLMs) are reorganizing in the AI community for their expected reasoning and inference abilities. Making LLMs understand graph‑based relational data has great potential, including but not limited to (1) distillate external knowledge base for eliminating hallucination and breaking the context window limit for LLMs' inference during the retrieval augmentation generation process; (2) taking graph data as the input and directly solve the graph‑based research tasks like protein design and drug discovery. However, inputting the entire graph data to LLMs is not practical due to its complex topological structure, data size, and the lack of effective and efficient semantic graph representations. A natural question arises: Is there a kind of graph representation that can be described by natural language for LLM's understanding and is also easy to require to serve as the raw input for LLMs? Based on statistical computation, graph laws pre‑define a set of parameters (e.g., degree, time, diameter) and identifie their relationships and values by observing the topological distribution of plenty of real‑world graph data. We believe this kind of parametric representation of graphs, graph laws, can be a solution for making LLMs understand graph data as the input. In this survey, we first review the previous study of graph laws from multiple perspectives, i.e., macroscope and microscope of graphs, low‑order and high‑order graphs, static and dynamic graphs, different observation spaces, and newly proposed graph parameters. After we review various real‑world applications benefiting from the guidance of graph laws, we conclude the paper with current challenges and future research directions.
Authors: Marko Djukanović, Jaume Reixach, Ana Nikolikj, Tome Eftimov, Aleksandar Kartelj, Christian Blum
Abstract: This paper addresses the Restricted Longest Common Subsequence (RLCS) problem, an extension of the well‑known Longest Common Subsequence (LCS) problem. This problem has significant applications in bioinformatics, particularly for identifying similarities and discovering mutual patterns and important motifs among DNA, RNA, and protein sequences. Building on recent advancements in solving this problem through a general search framework, this paper introduces two novel heuristic approaches designed to enhance the search process by steering it towards promising regions in the search space. The first heuristic employs a probabilistic model to evaluate partial solutions during the search process. The second heuristic is based on a neural network model trained offline using a genetic algorithm. A key aspect of this approach is extracting problem‑specific features of partial solutions and the complete problem instance. An effective hybrid method, referred to as the learning beam search, is developed by combining the trained neural network model with a beam search framework. An important contribution of this paper is found in the generation of real‑world instances where scientific abstracts serve as input strings, and a set of frequently occurring academic words from the literature are used as restricted patterns. Comprehensive experimental evaluations demonstrate the effectiveness of the proposed approaches in solving the RLCS problem. Finally, an empirical explainability analysis is applied to the obtained results. In this way, key feature combinations and their respective contributions to the success or failure of the algorithms across different problem types are identified.
Authors: Weixi Xiang, Xueting Han, Xiujuan Chai, Jing Bai
Abstract: Modeling biological sequences such as DNA, RNA, and proteins is crucial for understanding complex processes like gene regulation and protein synthesis. However, most current models either focus on a single type or treat multiple types of data separately, limiting their ability to capture cross‑modal relationships. We propose that by learning the relationships between these modalities, the model can enhance its understanding of each type. To address this, we introduce BSM, a small but powerful mixed‑modal biological sequence foundation model, trained on three types of data: RefSeq, Gene Related Sequences, and interleaved biological sequences from the web. These datasets capture the genetic flow, gene‑protein relationships, and the natural co‑occurrence of diverse biological data, respectively. By training on mixed‑modal data, BSM significantly enhances learning efficiency and cross‑modal representation, outperforming models trained solely on unimodal data. With only 110M parameters, BSM achieves performance comparable to much larger models across both single‑modal and mixed‑modal tasks, and uniquely demonstrates in‑context learning capability for mixed‑modal tasks, which is absent in existing models. Further scaling to 270M parameters demonstrates even greater performance gains, highlighting the potential of BSM as a significant advancement in multimodal biological sequence modeling.
Authors: Jiaxian Yan, Zaixi Zhang, Jintao Zhu, Kai Zhang, Jianfeng Pei, Qi Liu
Abstract: Molecular docking, a technique for predicting ligand binding poses, is crucial in structure‑based drug design for understanding protein‑ligand interactions. Recent advancements in docking methods, particularly those leveraging geometric deep learning (GDL), have demonstrated significant efficiency and accuracy advantages over traditional sampling methods. Despite these advancements, current methods are often tailored for specific docking settings, and limitations such as the neglect of protein side‑chain structures, difficulties in handling large binding pockets, and challenges in predicting physically valid structures exist. To accommodate various docking settings and achieve accurate, efficient, and physically reliable docking, we propose a novel two‑stage docking framework, DeltaDock, consisting of pocket prediction and site‑specific docking. We innovatively reframe the pocket prediction task as a pocket‑ligand alignment problem rather than direct prediction in the first stage. Then we follow a bi‑level coarse‑to‑fine iterative refinement process to perform site‑specific docking. Comprehensive experiments demonstrate the superior performance of DeltaDock. Notably, in the blind docking setting, DeltaDock achieves a 31% relative improvement over the docking success rate compared with the previous state‑of‑the‑art GDL model. With the consideration of physical validity, this improvement increases to about 300%.
Authors: Juan Pablo Carrillo-Mora, Moniellen Pires Monteiro, Aníbal R. Lodeiro, V. I. Marconi, María Luisa Cordero
Abstract: The swimming motility of bacteria is driven by the action of bacterial flagellar motors, whose outermost structure is a long and thin helicoidal filament. When rotated, the fluid medium exerts an anisotropic viscous drag on the flagellar filaments, ultimately leading to bacterial propulsion. The flagellar filaments are protein‑based flexible structures that can break due to interactions with fluid flows. Here, we study the evolution of flagellar filaments in the soil bacterium Bradyrhizobium diazoefficiens after being exposed to shear flows created in long microchannels, for shear rates between 1 s^‑1 and 10^5 s^‑1, and for durations between tens of milliseconds and minutes. We demonstrate that the average swimming speed and fraction of swimming cells decrease after exposition to shear, but both parameters can recover, at least partially, with time. These observations support the hypothesis that shear flows cut flagellar filaments but that reversibly damaged bacterial flagellar motors can be restored thanks to filament regeneration. By fitting our observations with phenomenological expressions, we obtain the individual growth rates of the two different flagellar filaments that B. diazoefficiens possesses, showing that the lateral filaments have a recovery time of about 40 min while the subpolar one requires more than 4.5 h to regrow. Our work demonstrates that simple monitoring of bacterial motility after exposition to shear can be used to characterize the process of flagellar filament breakup and growth, a phenomenon widely present in bacteria swimming in porous soil and exposed to shear flows due to rainfall and watering systems.
Authors: Gabin Schieffer, Ivy Peng
Abstract: In drug discovery, molecular docking aims at characterizing the binding of a drug‑like molecule to a macromolecule. AutoDock‑GPU, a state‑of‑the‑art docking software, estimates the geometrical conformation of a docked ligand‑protein complex by minimizing a scoring function. Our profiling results indicate that the current reduction operation that is heavily used in the scoring function is sub‑optimal. Thus, we developed a method to accelerate the sum reduction of four‑element vectors using matrix operations on NVIDIA Tensor Cores. We integrated the new reduction operation into AutoDock‑GPU and evaluated it on multiple chemical complexes on three GPUs. Our results show that our method for reduction operation is 4‑7 times faster than the AutoDock‑GPU baseline. We also evaluated the impact of our method on the overall simulation time in the real‑world docking simulation and achieved a 27% improvement on the average docking time.
Authors: Fabian H. Kreten, Barbara A. Niemeyer, Ludger Santen, Reza Shaebani
Abstract: A high degree of structural complexity arises in dynamic neuronal dendrites due to extensive branching patterns and diverse spine morphologies, which enable the nervous system to adjust function, construct complex input pathways and thereby enhance the computational power of the system. Owing to the determinant role of dendrite morphology in the functionality of the nervous system, recognition of pathological changes due to neurodegenerative disorders is of crucial importance. We show that the statistical analysis of a temporary signal generated by cargos that have diffusively passed through the complex dendritic structure yields vital information about dendrite morphology. As a feasible scenario, we propose engineering mRNA‑carrying multilamellar liposomes to diffusively reach the soma and release mRNAs, which are translated into a specific protein upon encountering ribosomes. The concentration of this protein over a large population of neurons can be externally measured, as a detectable temporary signal. Using a stochastic coarse‑grained approach for first‑passage through dendrites, we connect the key morphological properties affected by neurodegenerative diseases ‑‑ including the density and size of spines, the extent of the tree, and the segmental increase of dendrite diameter towards soma ‑‑ to the characteristics of the evolving signal. Thus, we establish a direct link between the dendrite morphology and the statistical characteristics of the detectable signal. Our approach provides a fast noninvasive measurement technique to indirectly extract vital information about the morphological evolution of dendrites in the course of neurodegenerative disease progression.
Authors: Allan dos Santos Costa, Ilan Mitnikov, Franco Pellegrini, Ameya Daigavane, Mario Geiger, Zhonglin Cao, Karsten Kreis, Tess Smidt, Emine Kucukbenli, Joseph Jacobson
Abstract: Mapping the conformational dynamics of proteins is crucial for elucidating their functional mechanisms. While Molecular Dynamics (MD) simulation enables detailed time evolution of protein motion, its computational toll hinders its use in practice. To address this challenge, multiple deep learning models for reproducing and accelerating MD have been proposed drawing on transport‑based generative methods. However, existing work focuses on generation through transport of samples from prior distributions, that can often be distant from the data manifold. The recently proposed framework of stochastic interpolants, instead, enables transport between arbitrary distribution endpoints. Building upon this work, we introduce EquiJump, a transferable SO(3)‑equivariant model that bridges all‑atom protein dynamics simulation time steps directly. Our approach unifies diverse sampling methods and is benchmarked against existing models on trajectory data of fast folding proteins. EquiJump achieves state‑of‑the‑art results on dynamics simulation with a transferable model on all of the fast folding proteins.
Authors: Xiaoran Jiao, Weian Mao, Wengong Jin, Peiyuan Yang, Hao Chen, Chunhua Shen
Abstract: Predicting the change in binding free energy (ΔΔG) is crucial for understanding and modulating protein‑protein interactions, which are critical in drug design. Due to the scarcity of experimental ΔΔG data, existing methods focus on pre‑training, while neglecting the importance of alignment. In this work, we propose the Boltzmann Alignment technique to transfer knowledge from pre‑trained inverse folding models to ΔΔG prediction. We begin by analyzing the thermodynamic definition of ΔΔG and introducing the Boltzmann distribution to connect energy with protein conformational distribution. However, the protein conformational distribution is intractable; therefore, we employ Bayes' theorem to circumvent direct estimation and instead utilize the log‑likelihood provided by protein inverse folding models for ΔΔG estimation. Compared to previous inverse folding‑based methods, our method explicitly accounts for the unbound state of protein complex in the ΔΔG thermodynamic cycle, introducing a physical inductive bias and achieving both supervised and unsupervised state‑of‑the‑art (SoTA) performance. Experimental results on SKEMPI v2 indicate that our method achieves Spearman coefficients of 0.3201 (unsupervised) and 0.5134 (supervised), significantly surpassing the previously reported SoTA values of 0.2632 and 0.4324, respectively. Futhermore, we demonstrate the capability of our method on binding energy prediction, protein‑protein docking and antibody optimization tasks.
Authors: Salman Fariz Navas, Sabine H. L. Klapp
Abstract: The construction of coarse‑grained descriptions of a system's kinetics is well established in biophysics. One prominent example is Markov state models in protein folding dynamics. In this paper, we develop a coarse‑grained, discrete state model of a self‑aggregating colloidal particle system inspired by the concepts of Markov state modelling. The specific self‑aggregating system studied here involves field‑responsive colloidal particles in orthogonal electric and magnetic fields. Starting from particle‑resolved (Brownian dynamics) simulations, we define the discrete states by categorizing each particle according to it's local structure. We then describe the kinetics between these states as a series of stochastic, memoryless jumps. In contrast to other works on colloidal self‑assembly, our coarse‑grained approach describes the simultaneous formation and evolution of multiple aggregates from single particles. Our discrete model also takes into account the changes in transition dynamics between the discrete states as the size of the largest cluster grows. We validate the coarse‑grained model by comparing the predicted population fraction in each of the discrete states with those calculated directly from the particle‑resolved simulations as a function of the largest cluster size. We then predict population fractions in presence of noise‑averaging and in a situation where a model parameter is changed instantaneously after a certain time. Finally, we explore the validity of the detailed balance condition in the various stages of aggregation.
Authors: Fabio Deelan Cunden, Juan Carlos Jimenez-Castellanos, Raquel Ortega Munoz
Abstract: A defining feature of efflux pumps is multidrug polyspecificity, which to date still eludes some of the traditional dogmas of drug binding within protein science. We propose a combinatorial approach to explore the neighbourhood of efflux pump superfamilies in the vast sequence space of polypeptide chains. By generating new candidate structures through structured permutations of existing sequences, this framework aims to uncover hidden determinants of efflux pump functionality.
Authors: Benson Chen, Tomasz Danel, Gabriel H. S. Dreiman, Patrick J. McEnaney, Nikhil Jain, Kirill Novikov, Spurti Umesh Akki, Joshua L. Turnbull, Virja Atul Pandya, Boris P. Belotserkovskii, Jared Bryce Weaver, Ankita Biswas, Dat Nguyen, Kent Gorday, Mohammad Sultan, Nathaniel Stanley, Daniel M Whalen, Divya Kanichar, Christoph Klein, Emily Fox, R. Edward Watts
Abstract: DNA‑Encoded Libraries (DELs) represent a transformative technology in drug discovery, facilitating the high‑throughput exploration of vast chemical spaces. Despite their potential, the scarcity of publicly available DEL datasets presents a bottleneck for the advancement of machine learning methodologies in this domain. To address this gap, we introduce KinDEL, one of the largest publicly accessible DEL datasets and the first one that includes binding poses from molecular docking experiments. Focused on two kinases, Mitogen‑Activated Protein Kinase 14 (MAPK14) and Discoidin Domain Receptor Tyrosine Kinase 1 (DDR1), KinDEL includes 81 million compounds, offering a rich resource for computational exploration. Additionally, we provide comprehensive biophysical assay validation data, encompassing both on‑DNA and off‑DNA measurements, which we use to evaluate a suite of machine learning techniques, including novel structure‑based probabilistic models. We hope that our benchmark, encompassing both 2D and 3D structures, will help advance the development of machine learning models for data‑driven hit identification using DELs.
Authors: Anita Girelli, Maddalena Bin, Mariia Filianina, Michelle Dargasz, Nimmi Das Anthuparambil, Johannes Möller, Alexey Zozulya, Iason Andronis, Sonja Timmermann, Sharon Berkowicz, Sebastian Retzbach, Mario Reiser, Agha Mohammad Raza, Marvin Kowalski, Mohammad Sayed Akhundzadeh, Jenny Schrage, Chang Hee Woo, Maximilian D. Senft, Lara Franziska Reichart, Aliaksandr Leonau, Prince Prabhu Rajaiah, William Chèvremont, Tilo Seydel, Jörg Hallmann, Angel Rodriguez-Fernandez, Jan-Etienne Pudell, Felix Brausse, Ulrike Boesenberg, James Wrigley, Mohamed Youssef, Wei Lu, Wonhyuk Jo, Roman Shayduk, Trey Guest, Anders Madsen, Felix Lehmkühler, Michael Paulus, Fajun Zhang, Frank Schreiber, Christian Gutt, Fivos Perakis
Abstract: Understanding protein motion within the cell is crucial for predicting reaction rates and macromolecular transport in the cytoplasm. A key question is how crowded environments affect protein dynamics through hydrodynamic and direct interactions at molecular length scales. Using megahertz X‑ray Photon Correlation Spectroscopy (MHz‑XPCS) at the European X‑ray Free Electron Laser (EuXFEL), we investigate ferritin diffusion at microsecond time scales. Our results reveal anomalous diffusion, indicated by the non‑exponential decay of the intensity autocorrelation function g_2(q,t) at high concentrations. This behavior is consistent with the presence of cage‑trapping in between the short‑ and long‑time protein diffusion regimes. Modeling with the δγ‑theory of hydrodynamically interacting colloidal spheres successfully reproduces the experimental data by including a scaling factor linked to the protein direct interactions. These findings offer new insights into the complex molecular motion in crowded protein solutions, with potential applications for optimizing ferritin‑based drug delivery, where protein diffusion is the rate‑limiting step.
Authors: Yi Zhou, Yilai Li, Jing Yuan, Quanquan Gu
Abstract: Cryo‑electron microscopy (cryo‑EM) is a powerful technique in structural biology and drug discovery, enabling the study of biomolecules at high resolution. Significant advancements by structural biologists using cryo‑EM have led to the production of over 38,626 protein density maps at various resolutions1. However, cryo‑EM data processing algorithms have yet to fully benefit from our knowledge of biomolecular density maps, with only a few recent models being data‑driven but limited to specific tasks. In this study, we present CryoFM, a foundation model designed as a generative model, learning the distribution of high‑quality density maps and generalizing effectively to downstream tasks. Built on flow matching, CryoFM is trained to accurately capture the prior distribution of biomolecular density maps. Furthermore, we introduce a flow posterior sampling method that leverages CRYOFM as a flexible prior for several downstream tasks in cryo‑EM and cryo‑electron tomography (cryo‑ET) without the need for fine‑tuning, achieving state‑of‑the‑art performance on most tasks and demonstrating its potential as a foundational model for broader applications in these fields.
Authors: Germaine Neza Hozana, Gonzalo Díaz Mirón, Ali Hassanali
Abstract: Over the last decade, there has been a growing body of experimental work showing that proteins devoid of aromatic and conjugated groups can absorb light in the near‑UV beyond 300 nm and emit visible light. Understanding the origins of this phenomena offers the possibility of designing non‑invasive spectroscopic probes for local interactions in biological systems. It was recently found that the synthetic protein α3C displays UV‑vis absorption between 250‑800 nm which was shown to arise from charge‑transfer excitations between charged amino acids. In this work, we use data‑driven discovery to revisit the origins of these features using molecular dynamics and excited‑state simulations. Specifically, an unsupervised learning approach beginning with encoding protein environments with local atomic descriptors, is employed to automatically detect relevant structural motifs. We identify three main motifs corresponding to different hydrogen‑bonding patterns that are subsequently used to perform QM/MM simulations including the entire protein and solvent bath with the density‑functional tight‑binding (DFTB) approach. Hydrogen‑bonding structures involving arginine and carboxylate groups appear to be the most prone to near‑UV absorption. We show that these features are highly sensitive to the size of the QM region employed as well as to the inclusion of explicit solvation underscoring the limitations of using gas‑phase cluster models.
Authors: Olga Anosova, Alexey Gorelov, William Jeffcott, Ziqiu Jiang, Vitaliy Kurlin
Abstract: Proteins are large biomolecules that regulate all living organisms and consist of one or several chains. The primary structure of a protein chain is a sequence of amino acid residues whose three main atoms (alpha‑carbon, nitrogen, and carbonyl carbon) form a protein backbone. The tertiary structure is the rigid shape of a protein chain represented by atomic positions in 3‑dimensional space. Because different geometric structures often have distinct functional properties, it is important to continuously quantify differences in rigid shapes of protein backbones. Unfortunately, many widely used similarities of proteins fail axioms of a distance metric and discontinuously change under tiny perturbations of atoms.
This paper develops a complete invariant that identifies any protein backbone in 3‑dimensional space, uniquely under rigid motion. This invariant is Lipschitz bi‑continuous in the sense that it changes up to a constant multiple of a maximum perturbation of atoms, and vice versa. The new invariant has been used to detect thousands of (near‑)duplicates in the Protein Data Bank, whose presence inevitably skews machine learning predictions. The resulting invariant space allows low‑dimensional maps with analytically defined coordinates that reveal substantial variability in the protein universe.
Authors: Jarrid Rector-Brooks, Mohsin Hasan, Zhangzhi Peng, Zachary Quinn, Chenghao Liu, Sarthak Mittal, Nouha Dziri, Michael Bronstein, Yoshua Bengio, Pranam Chatterjee, Alexander Tong, Avishek Joey Bose
Abstract: Generative modeling of discrete data underlies important applications spanning text‑based agents like ChatGPT to the design of the very building blocks of life in protein sequences. However, application domains need to exert control over the generated data by steering the generative process ‑ typically via RLHF ‑ to satisfy a specified property, reward, or affinity metric. In this paper, we study the problem of steering Masked Diffusion Models (MDMs), a recent class of discrete diffusion models that offer a compelling alternative to traditional autoregressive models. We introduce Discrete Denoising Posterior Prediction (DDPP), a novel framework that casts the task of steering pre‑trained MDMs as a problem of probabilistic inference by learning to sample from a target Bayesian posterior. Our DDPP framework leads to a family of three novel objectives that are all simulation‑free, and thus scalable while applying to general non‑differentiable reward functions. Empirically, we instantiate DDPP by steering MDMs to perform class‑conditional pixel‑level image modeling, RLHF‑based alignment of MDMs using text‑based rewards, and finetuning protein language models to generate more diverse secondary structures and shorter proteins. We substantiate our designs via wet‑lab validation, where we observe transient expression of reward‑optimized protein sequences.
Authors: Yuanqi Du, Michael Plainer, Rob Brekelmans, Chenru Duan, Frank Noé, Carla P. Gomes, Alán Aspuru-Guzik, Kirill Neklyudov
Abstract: Rare event sampling in dynamical systems is a fundamental problem arising in the natural sciences, which poses significant computational challenges due to an exponentially large space of trajectories. For settings where the dynamical system of interest follows a Brownian motion with known drift, the question of conditioning the process to reach a given endpoint or desired rare event is definitively answered by Doob's h‑transform. However, the naive estimation of this transform is infeasible, as it requires simulating sufficiently many forward trajectories to estimate rare event probabilities. In this work, we propose a variational formulation of Doob's h‑transform as an optimization problem over trajectories between a given initial point and the desired ending point. To solve this optimization, we propose a simulation‑free training objective with a model parameterization that imposes the desired boundary conditions by design. Our approach significantly reduces the search space over trajectories and avoids expensive trajectory simulation and inefficient importance sampling estimators which are required in existing methods. We demonstrate the ability of our method to find feasible transition paths on real‑world molecular simulation and protein folding tasks.
Authors: Xu Wang, Yiquan Wang, Tin-Yeh Huang, Yuhua Dong, Jia Deng, Longji Xu, Xiang Li, Rui He
Abstract: Navigating vast, rugged biological fitness landscapes to discover high‑value functional patterns‑such as optimal protein sequences‑is a central challenge in health informatics. However, conventional algorithms often struggle with the exploration‑exploitation dilemma, failing to synergize global search with deep local refinement, which leads to entrapment in suboptimal solutions. To overcome this barrier, we introduce Octopus Inspired Optimization (OIO), a novel hierarchical metaheuristic that mimics the octopus's unique neural architecture to intrinsically unify centralized global exploration and parallelized local exploitation. We validated OIO on a real‑world protein engineering benchmark, where it surpassed 15 competing metaheuristics. This success is underpinned by OIO's architectural suitability for protein‑like landscapes, confirmed by its top ranking on the NK‑Landscape benchmark, and its powerful optimization engine, demonstrated by its first‑place performance on the gold‑standard CEC2022 benchmark. OIO thus provides a robust, nature‑inspired computational tool for complex optimization problems in drug discovery and personalized medicine.
Authors: Fabio S. Ferreira, John Ashburner, Arabella Bouzigues, Chatrin Suksasilp, Lucy L. Russell, Phoebe H. Foster, Eve Ferry-Bolder, John C. van Swieten, Lize C. Jiskoot, Harro Seelaar, Raquel Sanchez-Valle, Robert Laforce, Caroline Graff, Daniela Galimberti, Rik Vandenberghe, Alexandre de Mendonca, Pietro Tiraboschi, Isabel Santana, Alexander Gerhard, Johannes Levin, Sandro Sorbi, Markus Otto, Florence Pasquier, Simon Ducharme, Chris R. Butler, Isabelle Le Ber, Elizabeth Finger, Maria C. Tartaglia, Mario Masellis, James B. Rowe, Matthis Synofzik, Fermin Moreno, Barbara Borroni, Samuel Kaski, Jonathan D. Rohrer, Janaina Mourao-Miranda
Abstract: In this study, we propose a novel approach to uncover subgroup‑specific and subgroup‑common latent factors addressing the challenges posed by the heterogeneity of neurological and mental disorders, which hinder disease understanding, treatment development, and outcome prediction. The proposed approach, sparse Group Factor Analysis (GFA) with regularised horseshoe priors, was implemented with probabilistic programming and can uncover associations (or latent factors) among multiple data modalities differentially expressed in sample subgroups. Synthetic data experiments showed the robustness of our sparse GFA by correctly inferring latent factors and model parameters. When applied to the Genetic Frontotemporal Dementia Initiative (GENFI) dataset, which comprises patients with frontotemporal dementia (FTD) with genetically defined subgroups, the sparse GFA identified latent disease factors differentially expressed across the subgroups, distinguishing between "subgroup‑specific" latent factors within homogeneous groups and "subgroup common" latent factors shared across subgroups. The latent disease factors captured associations between brain structure and non‑imaging variables (i.e., questionnaires assessing behaviour and disease severity) across the different genetic subgroups, offering insights into disease profiles. Importantly, two latent factors were more pronounced in the two more homogeneous FTD patient subgroups (progranulin (GRN) and microtubule‑associated protein tau (MAPT) mutation), showcasing the method's ability to reveal subgroup‑specific characteristics. These findings underscore the potential of sparse GFA for integrating multiple data modalities and identifying interpretable latent disease factors that can improve the characterization and stratification of patients with neurological and mental health disorders.
Authors: Ismail Erbas, Aporva Amarnath, Vikas Pandey, Karthik Swaminathan, Naigang Wang, Xavier Intes
Abstract: Fluorescence lifetime imaging (FLI) is a widely used technique in the biomedical field for measuring the decay times of fluorescent molecules, providing insights into metabolic states, protein interactions, and ligand‑receptor bindings. However, its broader application in fast biological processes, such as dynamic activity monitoring, and clinical use, such as in guided surgery, is limited by long data acquisition times and computationally demanding data processing. While deep learning has reduced post‑processing times, time‑resolved data acquisition remains a bottleneck for real‑time applications. To address this, we propose a method to achieve real‑time FLI using an FPGA‑based hardware accelerator. Specifically, we implemented a GRU‑based sequence‑to‑sequence (Seq2Seq) model on an FPGA board compatible with time‑resolved cameras. The GRU model balances accurate processing with the resource constraints of FPGAs, which have limited DSP units and BRAM. The limited memory and computational resources on the FPGA require efficient scheduling of operations and memory allocation to deploy deep learning models for low‑latency applications. We address these challenges by using STOMP, a queue‑based discrete‑event simulator that automates and optimizes task scheduling and memory management on hardware. By integrating a GRU‑based Seq2Seq model and its compressed version, called Seq2SeqLite, generated through knowledge distillation, we were able to process multiple pixels in parallel, reducing latency compared to sequential processing. We explore various levels of parallelism to achieve an optimal balance between performance and resource utilization. Our results indicate that the proposed techniques achieved a 17.7x and 52.0x speedup over manual scheduling for the Seq2Seq model and the Seq2SeqLite model, respectively.
Authors: Alex M. Tseng, Gokcen Eraslan, Tommaso Biancalani, Gabriele Scalia
Abstract: Deep neural networks excel in mapping genomic DNA sequences to associated readouts (e.g., protein‑DNA binding). Beyond prediction, the goal of these networks is to reveal to scientists the underlying motifs (and their syntax) which drive genome regulation. Traditional methods that extract motifs from convolutional filters suffer from the uninterpretable dispersion of information across filters and layers. Other methods which rely on importance scores can be unstable and unreliable. Instead, we designed a novel mechanistically interpretable architecture for regulatory genomics, where motifs and their syntax are directly encoded and readable from the learned weights and activations. We provide theoretical and empirical evidence of our architecture's full expressivity, while still being highly interpretable. Through several experiments, we show that our architecture excels in de novo motif discovery and motif instance calling, is robust to variable sequence contexts, and enables fully interpretable generation of novel functional sequences.
Authors: Fadi Alharbi, Aleksandar Vakanski, Boyu Zhang, Murtada K. Elbashir, Mohanad Mohammed
Abstract: Multi‑omics data is increasingly being utilized to advance computational methods for cancer classification. However, multi‑omics data integration poses significant challenges due to the high dimensionality, data complexity, and distinct characteristics of various omics types. This study addresses these challenges and evaluates three graph neural network architectures for multi‑omics (MO) integration based on graph‑convolutional networks (GCN), graph‑attention networks (GAT), and graph‑transformer networks (GTN) for classifying 31 cancer types and normal tissues. To address the high‑dimensionality of multi‑omics data, we employed LASSO (Least Absolute Shrinkage and Selection Operator) regression for feature selection, leading to the creation of LASSO‑MOGCN, LASSO‑MOGAT, and LASSO‑MOTGN models. Graph structures for the networks were constructed using gene correlation matrices and protein‑protein interaction networks for multi‑omics integration of messenger‑RNA, micro‑RNA, and DNA methylation data. Such data integration enables the networks to dynamically focus on important relationships between biological entities, improving both model performance and interpretability. Among the models, LASSO‑MOGAT with a correlation‑based graph structure achieved state‑of‑the‑art accuracy (95.9%) and outperformed the LASSO‑MOGCN and LASSO‑MOTGN models in terms of precision, recall, and F1‑score. Our findings demonstrate that integrating multi‑omics data in graph‑based architectures enhances cancer classification performance by uncovering distinct molecular patterns that contribute to a better understanding of cancer biology and potential biomarkers for disease progression.
Authors: Friso de Kruiff, Erik Bekkers, Ozan Öktem, Carola-Bibiane Schönlieb, Willem Diepeveen
Abstract: We propose Pullback Flow Matching (PFM), a novel framework for generative modeling on data manifolds. Unlike existing methods that assume or learn restrictive closed‑form manifold mappings for training Riemannian Flow Matching (RFM) models, PFM leverages pullback geometry and isometric learning to preserve the underlying manifold's geometry while enabling efficient generation and precise interpolation in latent space. This approach not only facilitates closed‑form mappings on the data manifold but also allows for designable latent spaces, using assumed metrics on both data and latent manifolds. By enhancing isometric learning through Neural ODEs and proposing a scalable training objective, we achieve a latent space more suitable for interpolation, leading to improved manifold learning and generative performance. We demonstrate PFM's effectiveness through applications in synthetic data, protein dynamics and protein sequence data, generating novel proteins with specific properties. This method shows strong potential for drug discovery and materials science, where generating novel samples with specific properties is of great interest.
Authors: Hyeonah Kim, Minsu Kim, Taeyoung Yun, Sanghyeok Choi, Emmanuel Bengio, Alex Hernández-García, Jinkyoo Park
Abstract: Designing biological sequences with desired properties is challenging due to vast search spaces and limited evaluation budgets. Although reinforcement learning methods use proxy models for rapid reward evaluation, insufficient training data can cause proxy misspecification on out‑of‑distribution inputs. To address this, we propose a novel off‑policy search, δ‑Conservative Search, that enhances robustness by restricting policy exploration to reliable regions. Starting from high‑score offline sequences, we inject noise by randomly masking tokens with probability δ, then denoise them using our policy. We further adapt δ based on proxy uncertainty on each data point, aligning the level of conservativeness with model confidence. Experimental results show that our conservative search consistently enhances the off‑policy training, outperforming existing machine learning methods in discovering high‑score sequences across diverse tasks, including DNA, RNA, protein, and peptide design.
Authors: Ignacio Hounie, Charilaos Kanatsoulis, Arnuv Tandon, Alejandro Ribeiro
Abstract: Low Rank Adaptation (LoRA) is a popular Parameter Efficient Fine Tuning (PEFT) method that effectively adapts large pre‑trained models for downstream tasks. LoRA parameterizes model updates using low‑rank matrices at each layer, significantly reducing the number of trainable parameters and, consequently, resource requirements during fine‑tuning. However, the lower bound on the number of trainable parameters remains high due to the use of the low‑rank matrix model. Recent works have addressed this limitation by proposing low rank tensor parameterizations for model updates. However, they only exploit redundancy across layers, or tensorize individual matrices using ad‑hoc schemes that introduce additional hyperparameters. In this work, we propose a higher‑order Candecomp/Parafac (CP) decomposition, enabling a more compact and flexible representation compared to existing matrix and tensor based PEFT methods. Our experiments on Natural Language Understanding, Instruction Tuning, Preference Optimization and Protein Folding benchmarks demonstrate that our method can achieve a reduction in the number of parameters while maintaining comparable performance.
Authors: Justin Airas, Bin Zhang
Abstract: Graph neural network (GNN) architectures have emerged as promising force field models, exhibiting high accuracy in predicting complex energies and forces based on atomic identities and Cartesian coordinates. To expand the applicability of GNNs, and machine learning force fields more broadly, optimizing their computational efficiency is critical, especially for large biomolecular systems in classical molecular dynamics simulations. In this study, we address key challenges in existing GNN benchmarks by introducing a dataset, DISPEF, which comprises large, biologically relevant proteins. DISPEF includes 207,454 proteins with sizes up to 12,499 atoms and features diverse chemical environments, spanning folded and disordered regions. The implicit solvation free energies, used as training targets, represent a particularly challenging case due to their many‑body nature, providing a stringent test for evaluating the expressiveness of machine learning models. We benchmark the performance of seven GNNs on DISPEF, emphasizing the importance of directly accounting for long‑range interactions to enhance model transferability. Additionally, we present a novel multiscale architecture, termed Schake, which delivers transferable and computationally efficient energy and force predictions for large proteins. Our findings offer valuable insights and tools for advancing GNNs in protein modeling applications.
Authors: Ayesha Ejaz, Markus Sutter, Sigal Lechno-Yossef, Cheryl A. Kerfeld, Allison Squires
Abstract: Photosynthetic organisms rely on sophisticated photoprotective mechanisms to prevent oxidative damage under high or fluctuating solar illumination. Cyanobacteria, which have evolved a modular, water‑soluble light harvesting complex ‑ the phycobilisome ‑ achieve photoprotection through a unique, photoactivatable quencher called the Orange Carotenoid Protein (OCP). Although phycobiliproteins are highly conserved, phycobilisomes take on different macromolecular architectures in different species of cyanobacteria, and it is not well understood whether or how these structures relate to changes in photoprotective function. To learn whether OCP functions similarly across species with different core architectures, we experimentally compare the photophysical states accessible to prototypical tricylindrical and pentacylindrical phycobilisomes, with and without OCP, at the single‑molecule level using an Anti‑Brownian ELectrokinetic (ABEL) trap. We compare our data to Monte Carlo simulations of exciton transfer in compartmental models of phycobilisomes with OCP bound at different combinations of predicted docking sites. Our results suggest that while some aspects of OCP function are influenced by phycobilisome architecture, others are surprisingly well‑conserved: OCP appears to bind at different locations in each architecture and cross‑species OCP‑phycobilisome compatibility is asymmetric, yet the quenching strength and dimeric binding of OCP appear to be similar for both phycobilisome architectures. Together, our findings provide new insights into how the uniquely modular architecture of phycobilisomes enables robust conservation as well as fine‑tuning of the OCP quenching mechanism across species.
Authors: Tianhao Li, Jingyu Lu, Chuangxin Chu, Tianyu Zeng, Yujia Zheng, Mei Li, Haotian Huang, Bin Wu, Zuoxian Liu, Kai Ma, Xuejing Yuan, Xingkai Wang, Keyan Ding, Huajun Chen, Qiang Zhang
Abstract: Large language models (LLMs) have a transformative impact on a variety of scientific tasks across disciplines including biology, chemistry, medicine, and physics. However, ensuring the safety alignment of these models in scientific research remains an underexplored area, with existing benchmarks primarily focusing on textual content and overlooking key scientific representations such as molecular, protein, and genomic languages. Moreover, the safety mechanisms of LLMs in scientific tasks are insufficiently studied. To address these limitations, we introduce SciSafeEval, a comprehensive benchmark designed to evaluate the safety alignment of LLMs across a range of scientific tasks. SciSafeEval spans multiple scientific languages‑including textual, molecular, protein, and genomic‑and covers a wide range of scientific domains. We evaluate LLMs in zero‑shot, few‑shot and chain‑of‑thought settings, and introduce a "jailbreak" enhancement feature that challenges LLMs equipped with safety guardrails, rigorously testing their defenses against malicious intention. Our benchmark surpasses existing safety datasets in both scale and scope, providing a robust platform for assessing the safety and performance of LLMs in scientific contexts. This work aims to facilitate the responsible development and deployment of LLMs, promoting alignment with safety and ethical standards in scientific research.
Authors: Jason Yang, Aadyot Bhatnagar, Jeffrey A. Ruffolo, Ali Madani
Abstract: The conditional generation of proteins with desired functions is a key goal for generative models. Existing methods based on prompting of protein language models (PLMs) can generate proteins conditioned on a target functionality, such as a desired enzyme family. However, these methods are limited to simple, tokenized conditioning and have not been shown to generalize to unseen functions. In this study, we propose ProCALM (Protein Conditionally Adapted Language Model), an approach for the conditional generation of proteins using adapters to PLMs. While previous methods have used adapters for structure‑conditioned generation from PLMs, our implementation of ProCALM involves finetuning ProGen2 to condition generation based on versatile representations of protein function‑e.g. enzyme family, taxonomy, or natural language descriptions. ProCALM matches or exceeds the performance of existing methods at conditional sequence generation from target functions. Impressively, it can also generalize to rare and unseen functions. Overall, ProCALM is a flexible and computationally efficient approach, and we expect that it can be extended to a wide range of generative language models.
Authors: Wei Wu, Chao Wang, Liyi Chen, Mingze Yin, Yiheng Zhu, Kun Fu, Jieping Ye, Hui Xiong, Zheng Wang
Abstract: Proteins, as essential biomolecules, play a central role in biological processes, including metabolic reactions and DNA replication. Accurate prediction of their properties and functions is crucial in biological applications. Recent development of protein language models (pLMs) with supervised fine tuning provides a promising solution to this problem. However, the fine‑tuned model is tailored for particular downstream prediction task, and achieving general‑purpose protein understanding remains a challenge. In this paper, we introduce Structure‑Enhanced Protein Instruction Tuning (SEPIT) framework to bridge this gap. Our approach incorporates a novel structure‑aware module into pLMs to enrich their structural knowledge, and subsequently integrates these enhanced pLMs with large language models (LLMs) to advance protein understanding. In this framework, we propose a novel instruction tuning pipeline. First, we warm up the enhanced pLMs using contrastive learning and structure denoising. Then, caption‑based instructions are used to establish a basic understanding of proteins. Finally, we refine this understanding by employing a mixture of experts (MoEs) to capture more complex properties and functional information with the same number of activated parameters. Moreover, we construct the largest and most comprehensive protein instruction dataset to date, which allows us to train and evaluate the general‑purpose protein understanding model. Extensive experiments on both open‑ended generation and closed‑set answer tasks demonstrate the superior performance of SEPIT over both closed‑source general LLMs and open‑source LLMs trained with protein knowledge.
Authors: Tiexin Qin, Mengxu Zhu, Chunyang Li, Terry Lyons, Hong Yan, Haoliang Li
Abstract: Understanding protein dynamics are essential for deciphering protein functional mechanisms and developing molecular therapies. However, the complex high‑dimensional dynamics and interatomic interactions of biological processes pose significant challenge for existing computational techniques. In this paper, we approach this problem for the first time by introducing Deep Signature, a novel computationally tractable framework that characterizes complex dynamics and interatomic interactions based on their evolving trajectories. Specifically, our approach incorporates soft spectral clustering that locally aggregates cooperative dynamics to reduce the size of the system, as well as signature transform that collects iterated integrals to provide a global characterization of the non‑smooth interactive dynamics. Theoretical analysis demonstrates that Deep Signature exhibits several desirable properties, including invariance to translation, near invariance to rotation, equivariance to permutation of atomic coordinates, and invariance under time reparameterization. Furthermore, experimental results on three benchmarks of biological processes verify that our approach can achieve superior performance compared to baseline methods.
Authors: Wei Guo, Yuchen Zhu, Molei Tao, Yongxin Chen
Abstract: This article makes discrete masked models for the generative modeling of discrete data controllable. The goal is to generate samples of a discrete random variable that adheres to a posterior distribution, satisfies specific constraints, or optimizes a reward function. This methodological development enables broad applications across downstream tasks such as class‑specific image generation and protein design. Existing approaches for controllable generation of masked models typically rely on task‑specific fine‑tuning or additional modifications, which can be inefficient and resource‑intensive. To overcome these limitations, we propose a novel plug‑and‑play framework based on importance sampling that bypasses the need for training a conditional score. Our framework is agnostic to the choice of control criteria, requires no gradient information, and is well‑suited for tasks such as posterior sampling, Bayesian inverse problems, and constrained generation. We demonstrate the effectiveness of our approach through extensive experiments, showcasing its versatility across multiple domains, including protein design.
Authors: Hossein Sholehrasa, Majid Jaberi-Douraki
Abstract: Breast cancer's complexity and variability pose significant challenges in understanding its progression and guiding effective treatment. This study aims to integrate protein sequence data with expression levels to improve the molecular characterization of breast cancer subtypes and predict clinical outcomes. Using ProtGPT2, a language model specifically designed for protein sequences, we generated embeddings that capture the functional and structural properties of proteins. These embeddings were integrated with protein expression levels to form enriched biological representations, which were analyzed using machine learning methods, such as ensemble K‑means for clustering and XGBoost for classification. Our approach enabled the successful clustering of patients into biologically distinct groups and accurately predicted clinical outcomes such as survival and biomarker status, achieving high performance metrics, notably an F1 score of 0.88 for survival and 0.87 for biomarker status prediction. Feature importance analysis identified KMT2C, CLASP2, and MYO1B as key proteins involved in hormone signaling, cytoskeletal remodeling, and therapy resistance in hormone receptor‑positive and triple‑negative breast cancer, with potential influence on breast cancer subtype behavior and progression. Furthermore, protein‑protein interaction networks and correlation analyses revealed functional interdependencies among proteins that may influence the behavior and progression of breast cancer subtypes. These findings suggest that integrating protein sequence and expression data provides valuable insights into tumor biology and has significant potential to enhance personalized treatment strategies in breast cancer care.
Authors: Erik Jansson, Jonathan Krook, Klas Modin, Ozan Ãktem
Abstract: We address recovery of the three‑dimensional backbone structure of single polypeptide proteins from single‑particle cryo‑electron microscopy (Cryo‑SPA) data. Cryo‑SPA produces noisy tomographic projections of electrostatic potentials of macromolecules. From these projections, we use methods from shape analysis to recover the three‑dimensional backbone structure. Thus, we view the reconstruction problem as an indirect matching problem, where a point cloud representation of the protein backbone is deformed to match 2D tomography data. The deformations are obtained via the action of a matrix Lie group. By selecting a deformation energy, the optimality conditions are obtained, which lead to computational algorithms for optimal deformations. We showcase our approach on synthetic data, for which we recover the three‑dimensional structure of the backbone.
Authors: Xuefeng Liu, Songhao Jiang, Xiaotian Duan, Archit Vasan, Qinan Huang, Chong Liu, Michelle M. Li, Heng Ma, Thomas Brettin, Arvind Ramanathan, Fangfang Xia, Mengdi Wang, Abhishek Pandey, Marinka Zitnik, Ian T. Foster, Jinbo Xu, Rick L. Stevens
Abstract: Protein‑ligand binding is the process by which a small molecule (drug or inhibitor) attaches to a target protein. Binding affinity, which characterizes the strength of biomolecular interactions, is essential for tackling diverse challenges in life sciences, including therapeutic design, protein engineering, enzyme optimization, and elucidating biological mechanisms. Much work has been devoted to predicting binding affinity over the past decades. Here, we review recent significant works, with a focus on methods, evaluation strategies, and benchmark datasets. We note growing use of both traditional machine learning and deep learning models for predicting binding affinity, accompanied by an increasing amount of data on proteins and small drug‑like molecules. With improved predictive performance and the FDA's phasing out of animal testing, AI‑driven in silico models, such as AI virtual cells (AIVCs), are poised to advance binding affinity prediction; reciprocally, progress in building binding affinity predictors can refine AIVCs. Future efforts in binding affinity prediction and AI‑driven in silico models can enhance the simulation of temporal dynamics, cell‑type specificity, and multi‑omics integration to support more accurate and personalized outcomes.
Authors: Bernd Ulmann, Shrish Roy
Abstract: Oscillator based Ising machines are non‑von‑Neumann machines ideally suited for solving combinatorial problems otherwise intractable on classic stored‑program digital computers due to their run‑time complexity. Possible future applications are manifold ranging from quantum simulations to protein folding and are of high academic and commercial interest as well. Described in the following is a very simple such machine aimed at educational and research applications.
Authors: Prabeen Kumar Pattnayak, Aloke Kumar, Gaurav Tomar
Abstract: Advances in controlled polymerization have enabled the synthesis of mechanically interlocked polymers like molecular knots and linear[n]catenane. These aesthetic macromolecules with unique topological constraints in the form of mechanical bonds are well known for their fascinating transport and rheological properties in the development of molecular machines and in knotted protein dynamics in biological applications. The diffusion dynamics of such macromolecular structures with large internal degrees of freedom are generally studied by using an equivalent size parameter, i.e., hydrodynamic radius, defined using Zimm theory. Although diffusion rates are expected to depend strongly on the molecular topological constraints in macromolecules, their explicit effects on translational and reorientational dynamics are still unknown. Here, we perform an in silico study on the diffusion dynamics of seven topologically distinct polymer chains in the limit of infinite dilution using multi‑particle collision dynamics. The modeled polymers are linear, ring, linear[2]catenane, trefoil knot, linear[3]catenane, cyclic[3]catenane, and Borromean ring. The molecular weights of these macromolecules are selected such that the resulting hydrodynamic radius is approximately equal to each other. We show that while the translational diffusion coefficients of these topologically distinct polymer chains are approximately equal to each other in agreement with the Zimm theory, there are significant differences among the values of the corresponding rotational diffusion coefficients. We show that the presence of mechanical bonds in the polymer chains slows down the rotational diffusion significantly, thus suggesting the role of molecular topology on reaction kinetics of macromolecules.
Authors: Henrik Weyer, Tobias A. Roth, Erwin Frey
Abstract: For cellular functions like division and polarization, protein pattern formation driven by NTPase cycles is a central spatial control strategy. Operating far from equilibrium, no general theory links microscopic reaction networks and parameters to the pattern type and dynamics. We discover a generic mechanism giving rise to an effective interfacial tension organizing the macroscopic structure of non‑equilibrium steady‑state patterns. Namely, maintaining protein‑density interfaces by cyclic protein attachment and detachment produces curvature‑dependent protein redistribution which straightens the interface. We develop a non‑equilibrium Neumann angle law and Plateau vertex conditions for interface junctions and mesh patterns, thus introducing the concepts of ``Turing mixtures'' and ``Turing foams''. In contrast to liquid foams and mixtures, these non‑equilibrium patterns can select an intrinsic wavelength by interrupting an equilibrium‑like coarsening process. Data from in vitro experiments with the E. coli Min protein system verifies the vertex conditions and supports the wavelength dynamics. Our study uncovers interface laws with correspondence to thermodynamic relations that arise from distinct physical processes in active systems. It allows the design of specific pattern morphologies with potential applications as spatial control strategies in synthetic cells.
Authors: Yun Zhou, Gang Chen, Bing Xue, Mengjie Zhang, Jeremy S. Rooney, Kirill Lagutin, Andrew MacKenzie, Keith C. Gordon, Daniel P. Killeen
Abstract: The rapid and accurate detection of biochemical compositions in fish is a crucial real‑world task that facilitates optimal utilization and extraction of high‑value products in the seafood industry. Raman spectroscopy provides a promising solution for quickly and non‑destructively analyzing the biochemical composition of fish by associating Raman spectra with biochemical reference data using machine learning regression models. This paper investigates different regression models to address this task and proposes a new design of Convolutional Neural Networks (CNNs) for jointly predicting water, protein, and lipids yield. To the best of our knowledge, we are the first to conduct a successful study employing CNNs to analyze the biochemical composition of fish based on a very small Raman spectroscopic dataset. Our approach combines a tailored CNN architecture with the comprehensive data preparation procedure, effectively mitigating the challenges posed by extreme data scarcity. The results demonstrate that our CNN can significantly outperform two state‑of‑the‑art CNN models and multiple traditional machine learning models, paving the way for accurate and automated analysis of fish biochemical composition.
Authors: Rodrigo Henrique Ramos, Yago Augusto Bardelotte, Cynthia de Oliveira Lage Ferreira, Adenilso Simao
Abstract: Identifying driver genes is crucial for understanding oncogenesis and developing targeted cancer therapies. Driver discovery methods using protein or pathway networks rely on traditional network science measures, focusing on nodes, edges, or community metrics. These methods can overlook the high‑dimensional interactions that cancer genes have within cancer networks. This study presents a novel method using Persistent Homology to analyze the role of driver genes in higher‑order structures within Cancer Consensus Networks derived from main cellular pathways. We integrate mutation data from six cancer types and three biological functions: DNA Repair, Chromatin Organization, and Programmed Cell Death. We systematically evaluated the impact of gene removal on topological voids (β_2 structures) within the Cancer Consensus Networks. Our results reveal that only known driver genes and cancer‑associated genes influence these structures, while passenger genes do not. Although centrality measures alone proved insufficient to fully characterize impact genes, combining higher‑order topological analysis with traditional network metrics can improve the precision of distinguishing between drivers and passengers. This work shows that cancer genes play an important role in higher‑order structures, going beyond pairwise measures, and provides an approach to distinguish drivers and cancer‑associated genes from passenger genes.
Authors: Furio Surfaro, Fajun Zhang, Frank Schreiber, Roland Roth
Abstract: Patchy particles are an intriguing subject of study and indeed a model system in the field of soft matter physics. In recent years, patchy particle models have been applied to describe a wide variety of systems, including colloidal crystals, macromolecular interactions, liquid crystals, and nanoparticle assemblies. Given the importance of the topic, rationalizing and capturing the basic features of these models is crucial to their correct application in specific systems. In this study, we extend the ion‑activated attractive patchy particles model previously employed to elucidate the phase behavior of protein solutions in the presence of trivalent salts. Our extension incorporates the effect of repulsion between unoccupied and occupied binding sites, depicted as patches. Furthermore, we examine the influence of model parameters on the liquid‑vapor coexistence region within the phase diagram, employing numerical methods. A deeper understanding of this model will facilitate a better comprehension of the effects observed in experiments.
Authors: Melis Ilayda Bal, Pier Giuseppe Sessa, Mojmir Mutny, Andreas Krause
Abstract: Bayesian optimization (BO) is a powerful framework to optimize black‑box expensive‑to‑evaluate functions via sequential interactions. In several important problems (e.g. drug discovery, circuit design, neural architecture search, etc.), though, such functions are defined over large combinatorial and unstructured spaces. This makes existing BO algorithms not feasible due to the intractable maximization of the acquisition function over these domains. To address this issue, we propose GameOpt, a novel game‑theoretical approach to combinatorial BO. GameOpt establishes a cooperative game between the different optimization variables, and selects points that are game equilibria of an upper confidence bound acquisition function. These are stable configurations from which no variable has an incentive to deviate‑ analog to local optima in continuous domains. Crucially, this allows us to efficiently break down the complexity of the combinatorial domain into individual decision sets, making GameOpt scalable to large combinatorial spaces. We demonstrate the application of GameOpt to the challenging protein design problem and validate its performance on four real‑world protein datasets. Each protein can take up to 20^X possible configurations, where X is the length of a protein, making standard BO methods infeasible. Instead, our approach iteratively selects informative protein configurations and very quickly discovers highly active protein variants compared to other baselines.
Authors: Kevin Borisiak, Gian Marco Visani, Armita Nourmohammad
Abstract: Predicting protein functional characteristics from structure remains a central problem in protein science, with broad implications from understanding the mechanisms of disease to designing novel therapeutics. Unfortunately, current machine learning methods are limited by scarce and biased experimental data, and physics‑based methods are either too slow to be useful, or too simplified to be accurate. In this work, we present Loop‑Diffusion, an energy based diffusion model which leverages a dataset of general protein loops from the entire protein universe to learn an energy function that generalizes to functional prediction tasks. We evaluate Loop‑Diffusion's performance on scoring TCR‑pMHC interfaces and demonstrate state‑of‑the‑art results in recognizing binding‑enhancing mutations.
Authors: Antonio Mirarchi, Raul P. Pelaez, Guillem Simeon, Gianni De Fabritiis
Abstract: All‑atom molecular simulations offer detailed insights into macromolecular phenomena, but their substantial computational cost hinders the exploration of complex biological processes. We introduce Advanced Machine‑learning Atomic Representation Omni‑force‑field (AMARO), a new neural network potential (NNP) that combines an O(3)‑equivariant message‑passing neural network architecture, TensorNet, with a coarse‑graining map that excludes hydrogen atoms. AMARO demonstrates the feasibility of training coarser NNP, without prior energy terms, to run stable protein dynamics with scalability and generalization capabilities.
Authors: Luiz Felipe Vecchietti, Minji Lee, Begench Hangeldiyev, Hyunkyu Jung, Hahnbeom Park, Tae-Kyun Kim, Meeyoung Cha, Ho Min Kim
Abstract: Recent advancements in machine learning (ML) are transforming the field of structural biology. For example, AlphaFold, a groundbreaking neural network for protein structure prediction, has been widely adopted by researchers. The availability of easy‑to‑use interfaces and interpretable outcomes from the neural network architecture, such as the confidence scores used to color the predicted structures, have made AlphaFold accessible even to non‑ML experts. In this paper, we present various methods for representing protein 3D structures from low‑ to high‑resolution, and show how interpretable ML methods can support tasks such as predicting protein structures, protein function, and protein‑protein interactions. This survey also emphasizes the significance of interpreting and visualizing ML‑based inference for structure‑based protein representations that enhance interpretability and knowledge discovery. Developing such interpretable approaches promises to further accelerate fields including drug development and protein design.
Authors: Lucas Squillante, Isys F. Mello, Luciano S. Ricco, Marcos F. Minicucci, Aniekan Magnus Ukpong, Antonio C. Seridonio, Roberto E. Lagos-Monaco, Mariano de Souza
Abstract: Protein compartmentalization in the frame of a liquid‑liquid phase separation is a key mechanism to optimize spatiotemporal control of biological systems. Such a compartmentalization process reduces the intrinsic noise in protein concentration due to stochasticity in gene expression. Employing Flory‑Huggins solution theory, Avramov/Casalini's model, and the Grüneisen parameter, we unprecedentedly propose a cellular Griffiths‑like phase (CGLP), which can impact its functionality and self‑organization. The here‑proposed CGLP is key ranging from the understanding of primary organisms' evolution to the treatment of diseases. Our findings pave the way for an alternative Biophysics approach to investigate coacervation processes.
Authors: Kouhei Okitsu
Abstract: Behavior of X‑rays diffracted in a perfect or quasi‑perfect crystal can be described by the dynamical theory of X‑ray diffraction. Study on the two‑beam cases in which only transmitted and one reflected X‑ray beams are strong has a history of one hundred years. However, the population of researchers who study on the multiple‑beam cases (n‑beam cases) in which more than two beams are simultaneously strong is small. The present author has derived the Takagi‑Taupin (T‑T) dynamical theory that can be applied to the n‑beam cases, coded the computer programs to solve it and experimentally verified them by using the synchrotron X‑rays. The equivalence between the Ewald‑Laue (E‑L) and the T‑T dynamical theories described by the Fourier transform also for the n‑beam cases is explicitly verified in the present paper. Further, the methods of the computer simulations and the experiments are also described.
Furthermore, a hypothesis concerning the too large values of R‑factor in protein crystallography is also described. This might be extremely important in protein crystallography in the future.
Authors: Saeed Omidi, Gianluca Fabi, Xiaopeng Wang, James C. M. Hwang, Yevgeny Berdichevsky
Abstract: Intracellular processes triggered by neural activity include changes in ionic concentrations, protein release, and synaptic vesicle cycling. These processes play significant roles in neurological disorders. The beneficial effects of brain stimulation may also be mediated through intracellular changes. There is a lack of label‑free techniques for monitoring activity‑dependent intracellular changes. Electromagnetic (EM) waves at frequencies larger than 1x10^6 Hz (1 MHz) were previously used to probe intracellular contents of cells, as cell membrane becomes transparent at this frequency range. EM waves interact with membranes of intracellular organelles, proteins, and water in the MHz‑GHz range. In this work, we developed a device for probing the interaction between intracellular contents of active neurons and EM waves. The device used an array of grounded coplanar waveguides (GCPWs) to deliver EM waves to a three‑dimensional (3D) spheroid of rat cortical neurons. Neural activity was evoked using optogenetics, with synchronous detection of propagation of EM waves. Broadband measurements were conducted in the MHz‑GHz range to track changes in transmission coefficients. Neuronal activity was found to reversibly alter EM wave transmission. Pharmacological suppression of neuronal activity abolished changes in transmission. Time constants of changes in transmission were in the range of seconds to tens of seconds, suggesting the presence of relatively slow, activity‑dependent intracellular processes. This study provides the first evidence that EM transmission through neuronal tissue is activity‑dependent in MHz‑GHz range. Device developed in this work may find future applications in studies of the mechanisms of neurological disorders and the development of new therapies.
Authors: Vahid Nateghi, Feliks Nüske
Abstract: In this paper, we show how kernel‑based models for the Koopman generator ‑‑ the gEDMD method ‑‑ can be used to identify coarse‑grained dynamics on reduced variables, which retain the slowest transition timescales of the original dynamics. The centerpiece of this study is a learning method to identify an effective diffusion in coarse‑grained space, which is similar in spirit to the force matching method. By leveraging the gEDMD model for the Koopman generator, the kinetic accuracy of the CG model can be evaluated. By combining this method with a suitable learning method for the effective free energy, such as force matching, a complete model for the effective dynamics can be inferred. Using a two‑dimensional model system and molecular dynamics simulation data of alanine dipeptide and the Chignolin mini‑protein, we demonstrate that the proposed method successfully and robustly recovers the essential kinetic and also thermodynamic properties of the full model. The parameters of the method can be determined using standard model validation techniques.
Authors: Jiaxing Yang
Abstract: Structural prediction has long been considered critical in RNA research, especially following the success of AlphaFold2 in protein studies, which has drawn significant attention to the field. While recent advances in machine learning and data accumulation have effectively addressed many biological tasks, particularly in protein related research. RNA structure prediction remains a significant challenge due to data limitations. Obtaining RNA structural data is difficult because traditional methods such as nuclear magnetic resonance spectroscopy, Xray crystallography, and electron microscopy are expensive and time consuming. Although several RNA 3D structure prediction methods have been proposed, their accuracy is still limited. Predicting RNA structural information at another level, such as distance maps, remains highly valuable. Distance maps provide a simplified representation of spatial constraints between nucleotides, capturing essential relationships without requiring a full 3D model. This intermediate level of structural information can guide more accurate 3D modeling and is computationally less intensive, making it a useful tool for improving structural predictions. In this work, we demonstrate that using only primary sequence information, we can accurately infer the distances between RNA bases by utilizing a large pretrained RNA language model coupled with a well trained downstream transformer.
Authors: Antonia Winter, Yuhao Liu, Alexander Ziepke, George Dadunashvili, Erwin Frey
Abstract: The self‑organization of proteins into enriched compartments and the formation of complex patterns are crucial processes for life on the cellular level. Liquid‑liquid phase separation is one mechanism for forming such enriched compartments. When phase‑separating proteins are membrane‑bound and locally disturb it, the mechanical response of the membrane mediates interactions between these proteins. How these membrane‑mediated interactions influence the steady state of the protein density distribution is thus an important question to investigate in order to understand the rich diversity of protein and membrane‑shape patterns present at the cellular level. This work starts with a widely used model for membrane‑bound phase‑separating proteins. We numerically solve our system to map out its phase space and perform a careful, systematic expansion of the model equations to characterize the phase transitions through linear stability analysis and free energy arguments. We observe that the membrane‑mediated interactions, due to their long‑range nature, are capable of qualitatively altering the equilibrium state of the proteins. This leads to arrested coarsening and length‑scale selection instead of simple demixing and complete coarsening. In this study, we unambiguously show that long‑range membrane‑mediated interactions lead to pattern formation in a system that otherwise would not do so. This work provides a basis for further systematic study of membrane‑bound pattern‑forming systems.
Authors: Mohammed A. Al-Qadasi, Samantha M. Grist, Matthew Mitchell, Karyn Newton, Stephen Kioussis, Sheri J. Chowdhury, Avineet Randhawa, Yifei Liu, Piramon Tisapramotkul, Karen C. Cheung, Lukas Chrostowski, Sudip Shekhar
Abstract: Decentralized diagnostic testing that is accurate, portable, quantitative, and capable of making multiple simultaneous measurements of different biomarkers at the point‑of‑need remains an important unmet need in the post‑pandemic world. Resonator‑based biosensors using silicon photonic integrated circuits are a promising technology to meet this need, as they can leverage (1) semiconductor manufacturing economies of scale, (2) exquisite optical sensitivity, and (3) the ability to integrate tens to hundreds of sensors on a millimeter‑scale photonic chip. However, their application to decentralized testing has historically been limited by the expensive, bulky tunable lasers and alignment optics required for their readout. In this work, we introduce a segmented sensor architecture that addresses this important challenge by facilitating resonance‑tracking readout using a fixed‑wavelength laser. The architecture incorporates an in‑resonator phase shifter modulated by CMOS drivers to periodically sweep and acquire the resonance peak shifts as well as a distinct high‑sensitivity sensing region, maintaining high performance at a fraction of the cost and size. We show, for the first time, that fixed‑wavelength sensor readout can offer similar performance to traditional tunable laser readout, demonstrating a system limit of detection of 6.1 x 10‑5 RIU as well as immunoassay‑based detection of the SARS‑CoV‑2 spike protein. We anticipate that this sensor architecture will open the door to a new data‑rich class of portable, accurate, multiplexed diagnostics for decentralized testing.
Authors: Furio Surfaro, Ralph Maier, Kai-Florian Pastryk, Fajun Zhang, Frank Schreiber, Roland Roth
Abstract: The osmotic second virial coefficient B2 is an important parameter to describe the interactions and phase behavior of protein solutions, including colloidal systems and macromolecular solutions. Another key parameter to describe the driving force of the nucleation of a new phase is the supersaturation, which is used in the classical nucleation theory framework and is connected with the favorable contribution in the Gibbs free energy in the bulk solution. In this article, we establish a connection between B2 calculated from small angle Xray scattering (SAXS) data and the values of B2 obtained from supersaturation measurements using thermodynamics considerations. The values of the second virial coefficient calculated employing this method agree with those determined via SAXS in the region near the liquid liquid phase separation border for human serum albumin and bovine serum albumin. The general relations adopted are shown to be useful for the estimation of the second virial coefficient B2 for globular proteins, in the proximity of the binodal biphasic coexistent region.
Authors: Bohao Xu, Yingzhou Lu, Yoshitaka Inoue, Namkyeong Lee, Tianfan Fu, Jintai Chen
Abstract: Protein function prediction is a pivotal task in drug discovery, significantly impacting the development of effective and safe therapeutics. Traditional machine learning models often struggle with the complexity and variability inherent in predicting protein functions, necessitating more sophisticated approaches. In this work, we introduce Protein‑Mamba, a novel two‑stage model that leverages both self‑supervised learning and fine‑tuning to improve protein function prediction. The pre‑training stage allows the model to capture general chemical structures and relationships from large, unlabeled datasets, while the fine‑tuning stage refines these insights using specific labeled datasets, resulting in superior prediction performance. Our extensive experiments demonstrate that Protein‑Mamba achieves competitive performance, compared with a couple of state‑of‑the‑art methods across a range of protein function datasets. This model's ability to effectively utilize both unlabeled and labeled data highlights the potential of self‑supervised learning in advancing protein function prediction and offers a promising direction for future research in drug discovery.
Authors: Thanh Son Do, Daniel B. Hier, Tayo Obafemi-Ajayi
Abstract: This study evaluates the ability of large language models (LLMs) to map biomedical ontology terms to their corresponding ontology IDs across the Human Phenotype Ontology (HPO), Gene Ontology (GO), and UniProtKB terminologies. Using counts of ontology IDs in the PubMed Central (PMC) dataset as a surrogate for their prevalence in the biomedical literature, we examined the relationship between ontology ID prevalence and mapping accuracy. Results indicate that ontology ID prevalence strongly predicts accurate mapping of HPO terms to HPO IDs, GO terms to GO IDs, and protein names to UniProtKB accession numbers. Higher prevalence of ontology IDs in the biomedical literature correlated with higher mapping accuracy. Predictive models based on receiver operating characteristic (ROC) curves confirmed this relationship.
In contrast, this pattern did not apply to mapping protein names to Human Genome Organisation's (HUGO) gene symbols. GPT‑4 achieved a high baseline performance (95%) in mapping protein names to HUGO gene symbols, with mapping accuracy unaffected by prevalence. We propose that the high prevalence of HUGO gene symbols in the literature has caused these symbols to become lexicalized, enabling GPT‑4 to map protein names to HUGO gene symbols with high accuracy. These findings highlight the limitations of LLMs in mapping ontology terms to low‑prevalence ontology IDs and underscore the importance of incorporating ontology ID prevalence into the training and evaluation of LLMs for biomedical applications.
Authors: Jens Weimar, Frank Hirschmann, Martin Oettel
Abstract: Colloidal model systems are successful in rationalizing emergent phenomena like aggregation, rheology and phase behaviour of protein solutions. Colloidal theory in conjunction with isotropic interaction models is often employed to estimate the stability of such solutions. In particular, a universal criterion for the reduced second virial coefficient at the critical point B_2^ is frequently invoked which is based on the behavior of short‑range attractive fluids (Noro‑Frenkel rule, B_2^\approx‑1.5). However, if anisotropic models for the protein‑protein interaction are considered, e.g. the Kern‑Frenkel (KF) patchy particle model, the value of the B_2^ criterion is shifted to lower values and explicitly depends on the number of patches. If an explicit shape anisotropy is considered, as e.g. in a coarse‑grained protein model, the normalization of B_2^ becomes ambiguous to some extent, as no unique exclusion volume can be defined anymore. Here, we investigate a low‑resolution, coarse‑grained model for the globular protein bovine serum albumin (BSA) and study effects of charge‑anisotropy on the phase diagram (determined by simulations) at the isoelectric point. We present methods of assigning an ``effective patchiness'' to our protein model by comparing its critical properties to the KF model. We find that doubling the native charges increases the critical temperature T_c by \approx 14 % and that our BSA model can be compared to a 3 to 5 patch KF model. Finally, we argue that applying existing B_2^ criteria from colloidal theory should be done with care, due to multiple, physically plausible ways of how to assign effective diameters to shape‑anisotropic models.
Authors: Roman Joeres, Daniel Bojar
Abstract: Glycans are the most complex biological sequence, with monosaccharides forming extended, non‑linear sequences. As post‑translational modifications, they modulate protein structure, function, and interactions. Due to their diversity and complexity, predictive models of glycan properties and functions are still insufficient.
Graph Neural Networks (GNNs) are deep learning models designed to process and analyze graph‑structured data. These architectures leverage the connectivity and relational information in graphs to learn effective representations of nodes, edges, and entire graphs. Iteratively aggregating information from neighboring nodes, GNNs capture complex patterns within graph data, making them particularly well‑suited for tasks such as link prediction or graph classification across domains.
This work presents a new model architecture based on combinatorial complexes and higher‑order message passing to extract features from glycan structures into a latent space representation. The architecture is evaluated on an improved GlycanML benchmark suite, establishing a new state‑of‑the‑art performance. We envision that these improvements will spur further advances in computational glycosciences and reveal the roles of glycans in biology.
Authors: Sumukh Pinge, Weihong Xu, Wout Bittremieux, Niema Moshiri, Sang-Woo Jun, Tajana Rosing
Abstract: Mass spectrometry (MS) is essential for protein analysis but faces significant challenges with large datasets and complex post‑translational modifications, resulting in difficulties in spectral identification. Open Modification Search (OMS) improves the analysis of these modifications. We present RapidOMS, a solution leveraging the Samsung SmartSSD, which integrates SSD and FPGA in a near‑storage configuration to minimize data movement and enhance the efficiency of large‑scale database searching. RapidOMS employs hyperdimensional computing (HDC), a brain‑inspired, high‑dimensional data processing approach, exploiting the parallel processing and low‑latency capabilities of FPGAs, making it well‑suited for MS. Utilizing the parallelism and efficiency of bitwise operations in HDC, RapidOMS delivers up to a 60x speedup over the state‑of‑the‑art (SOTA) CPU tool ANN‑Solo and is 2.72x faster than the GPU tool HyperOMS. Furthermore, RapidOMS achieves an 11x improvement in energy efficiency compared to conventional systems, providing scalable, energy‑efficient solutions for large‑scale proteomics applications and advancing the efficient processing of proteomic data.
Authors: James Michels, Ramya Bandarupalli, Amin Ahangar Akbari, Thai Le, Hong Xiao, Jing Li, Erik F. Y. Hom
Abstract: Recent advances in Natural Language Processing (NLP) have ignited interest in developing effective methods for predicting protein‑ligand interactions (PLIs) given their relevance to drug discovery and protein engineering efforts and the ever‑growing volume of biochemical sequence and structural data available. The parallels between human languages and the "languages" used to represent proteins and ligands have enabled the use of NLP machine learning approaches to advance PLI studies. In this review, we explain where and how such approaches have been applied in the recent literature and discuss useful mechanisms such as long short‑term memory, transformers, and attention. We conclude with a discussion of the current limitations of NLP methods for the study of PLIs as well as key challenges that need to be addressed in future work.
Authors: Julia Buhmann, Ward Haddadin, Lukáš Pravda, Alan Bilsland, Hagen Triendl
Abstract: Predicting protein‑ligand binding affinity is an essential part of computer‑aided drug design. However, generalisable and performant global binding affinity models remain elusive, particularly in low data regimes. Despite the evolution of model architectures, current benchmarks are not well‑suited to probe the generalisability of 3D binding affinity models. Furthermore, 3D global architectures such as GNNs have not lived up to performance expectations. To investigate these issues, we introduce a novel split of the PDBBind dataset, minimizing similarity leakage between train and test sets and allowing for a fair and direct comparison between various model architectures. On this low similarity split, we demonstrate that, in general, 3D global models are superior to protein‑specific local models in low data regimes. We also demonstrate that the performance of GNNs benefits from three novel contributions: supervised pre‑training via quantum mechanical data, unsupervised pre‑training via small molecule diffusion, and explicitly modeling hydrogen atoms in the input graph. We believe that this work introduces promising new approaches to unlock the potential of GNN architectures for binding affinity modelling.
Authors: Amir Khosravanizadeh, Serge Dmitrieff
Abstract: We have used numerical simulations to investigate how the properties of motor proteins control the dynamical behavior of a driven flexible filament. The filament is pinned at one end and positioned on top of a patch of anchored motor proteins, a setup commonly referred to as a spiral gliding assay. In nature, there is a variety of motor proteins with different properties. In this study, we have investigated the role of detachment rate, detachment force, stall force, and unloaded speed of motors on the dynamical behavior of the filament. We found that this system generally can show three different regimes: 1) Fluctuation, where the filament undergoes random fluctuations because the motors are unable to bend it. 2) Rotation, in which the filament bends and then moves continuously in one direction. 3) Beating, where the filament's direction of rotation changes over time. We found that the transition between fluctuation and rotation occurs when motors exert a force sufficient to buckle the filament. The threshold force coincides to the second buckling mode of a filament undergoing a continuously distributed load. Moreover, we showed that when motors near the pining point work close to their stall force, they get stuck and act as a second pin, leading to the beating regime.
Authors: Lara Callea, Camilla Caprai, Laura Bonati, Toni Giorgino, Stefano Motta
Abstract: The interpretation of ligand‑target interactions at atomistic resolution is central to most efforts in computational drug discovery and optimization. However, the highly dynamic nature of protein targets, as well as possible induced fit effects, makes difficult to sample many interactions effectively with docking studies or even with large‑scale molecular dynamics (MD) simulations. We propose a novel application of Self‑Organizing Maps (SOM) to address the sampling and dynamic mapping tasks, particularly in cases involving ligand flexibility and induced fit. The SOM approach offers a data‑driven strategy to create a map of the interaction process and pathways based on unbiased MD. Furthermore, we show how the preliminary SOM mapping is complementary to kinetic analysis, both with the employment of network‑based approaches and Markov State Models (MSM). We demonstrate the method by comprehensively mapping a large dataset of 640 μs of unbiased trajectories sampling the recognition process between the phosphorylated YEEI peptide and its high‑specificity target Lck‑SH2. The integration of SOM into unbiased simulation protocols significantly advances our understanding of the ligand binding mechanism. This approach serves as a potent tool for mapping intricate ligand‑target interactions with unprecedented detail, thereby enhancing the characterization of kinetic properties crucial to drug design.
Authors: Kairi Furui, Masahito Ohue
Abstract: Accurate prediction and optimization of protein‑protein binding affinity is crucial for therapeutic antibody development. Although machine learning‑based prediction methods ΔΔG are suitable for large‑scale mutant screening, they struggle to predict the effects of multiple mutations for targets without existing binders. Energy function‑based methods, though more accurate, are time consuming and not ideal for large‑scale screening. To address this, we propose an active learning workflow that efficiently trains a deep learning model to learn energy functions for specific targets, combining the advantages of both approaches. Our method integrates the RDE‑Network deep learning model with Rosetta's energy function‑based Flex ddG to efficiently explore mutants. In a case study targeting HER2‑binding Trastuzumab mutants, our approach significantly improved the screening performance over random selection and demonstrated the ability to identify mutants with better binding properties without experimental ΔΔG data. This workflow advances computational antibody design by combining machine learning, physics‑based computations, and active learning to achieve more efficient antibody development.
Authors: Binghao Yan, Yunbi Nam, Lingyao Li, Rebecca A. Deek, Hongzhe Li, Siyuan Ma
Abstract: Recent advancements in deep learning, particularly large language models (LLMs), made a significant impact on how researchers study microbiome and metagenomics data. Microbial protein and genomic sequences, like natural languages, form a language of life, enabling the adoption of LLMs to extract useful insights from complex microbial ecologies. In this paper, we review applications of deep learning and language models in analyzing microbiome and metagenomics data. We focus on problem formulations, necessary datasets, and the integration of language modeling techniques. We provide an extensive overview of protein/genomic language modeling and their contributions to microbiome studies. We also discuss applications such as novel viromics language modeling, biosynthetic gene cluster prediction, and knowledge integration for metagenomics studies.
Authors: Weida Liao, Eric Lauga
Abstract: Cytoplasmic streaming, the persistent flow of fluid inside a cell, induces intracellular transport, which plays a key role in fundamental biological processes. In meiosis II mouse oocytes (developing egg cells) awaiting fertilisation, the spindle, which is the protein structure responsible for dividing genetic material in a cell, must maintain its position near the cell cortex (the thin actin network bound to the cell membrane) for many hours. However, the cytoplasmic streaming that accompanies this stable positioning would intuitively appear to destabilise the spindle position. Here, through a combination of numerical and analytical modelling, we reveal a new, hydrodynamic mechanism for stable spindle positioning beneath the cortical cap. We show that this stability depends critically on the spindle size and the active driving from the cortex, and demonstrate that stable spindle positioning can result purely from a hydrodynamic suction force exerted on the spindle by the cytoplasmic flow. Our findings show that local fluid dynamic forces can be sufficient to stabilise the spindle, explaining robustness against perturbations not only perpendicular but also parallel to the cortex. Our results shed light on the importance of cytoplasmic streaming in mammalian meiosis.
Authors: Jonathan R. Church, Ofir Blumer, Tommer D. Keidar, Leo Ploutno, Shlomi Reuveni, Barak Hirshberg
Abstract: We present a procedure for enhanced sampling of molecular dynamics simulations through informed stochastic resetting. Many phenomena, such as protein folding and crystal nucleation, occur over time scales that are inaccessible in standard simulations. We recently showed that stochastic resetting can accelerate molecular simulations that exhibit broad transition time distributions. However, standard stochastic resetting does not exploit any information about the reaction progress. For a model system and chignolin in explicit water, we demonstrate that an informed resetting protocol leads to greater accelerations than standard stochastic resetting in molecular dynamics and Metadynamics simulations. This is achieved by resetting only when a certain condition is met, e.g., when the distance from the target along the reaction coordinate is larger than some threshold. We use these accelerated simulations to infer important kinetic observables such as the unbiased mean first‑passage time and direct transit time. For the latter, Metadynamics with informed resetting leads to speedups of 2‑3 orders of magnitude over unbiased simulations with relative errors of only ~35‑70%. Our work significantly extends the applicability of stochastic resetting for enhanced sampling of molecular simulations.
Authors: Kaixuan Huang, Yukang Yang, Kaidi Fu, Yanyi Chu, Le Cong, Mengdi Wang
Abstract: This work presents RNAdiffusion, a latent diffusion model for generating and optimizing discrete RNA sequences of variable lengths. RNA is a key intermediary between DNA and protein, exhibiting high sequence diversity and complex three‑dimensional structures to support a wide range of functions. We utilize pretrained BERT‑type models to encode raw RNA sequences into token‑level, biologically meaningful representations. A Query Transformer is employed to compress such representations into a set of fixed‑length latent vectors, with an autoregressive decoder trained to reconstruct RNA sequences from these latent variables. We then develop a continuous diffusion model within this latent space. To enable optimization, we integrate the gradients of reward models‑‑surrogates for RNA functional properties‑‑into the backward diffusion process, thereby generating RNAs with high reward scores. Empirical results confirm that RNAdiffusion generates non‑coding RNAs that align with natural distributions across various biological metrics. Further, we fine‑tune the diffusion model on mRNA 5' untranslated regions (5'‑UTRs) and optimize sequences for high translation efficiencies. Our guided diffusion model effectively generates diverse 5'‑UTRs with high Mean Ribosome Loading (MRL) and Translation Efficiency (TE), outperforming baselines in balancing rewards and structural stability trade‑off. Our findings hold potential for advancing RNA sequence‑function research and therapeutic RNA design.
Authors: Mohammed Aledhari, Mohamed Rahouti
Abstract: Gene and RNA editing methods, technologies, and applications are emerging as innovative forms of therapy and medicine, offering more efficient implementation compared to traditional pharmaceutical treatments. Current trends emphasize the urgent need for advanced methods and technologies to detect public health threats, including diseases and viral agents. Gene and RNA editing techniques enhance the ability to identify, modify, and ameliorate the effects of genetic diseases, disorders, and disabilities. Viral detection and identification methods present numerous opportunities for enabling technologies, such as CRISPR, applicable to both RNA and gene editing through the use of specific Cas proteins. This article explores the distinctions and benefits of RNA and gene editing processes, emphasizing their contributions to the future of medical treatment. CRISPR technology, particularly its adaptation via the Cas13 protein for RNA editing, is a significant advancement in gene editing. The article will delve into RNA and gene editing methodologies, focusing on techniques that alter and modify genetic coding. A‑to‑I and C‑to‑U editing are currently the most predominant methods of RNA modification. CRISPR stands out as the most cost‑effective and customizable technology for both RNA and gene editing. Unlike permanent changes induced by cutting an individual's DNA genetic code, RNA editing offers temporary modifications by altering nucleoside bases in RNA strands, which can then attach to DNA strands as temporary modifiers.
Authors: Tomasz Bednarek, Jakub Jędrak
Abstract: In small systems, quantitative discrepancies between stochastic and deterministic descriptions of chemical kinetics can be significant, with their magnitude depending on the specific reaction network. Here, we study the Finke‑Watzky model‑an irreversible autocatalysis, A + B ‑‑ > 2B, supplemented by an irreversible first‑order process, A ‑‑ > B. This model has been used to describe the formation of transition metal nanoparticles and protein misfolding and aggregation, but it may also serve as a minimal model for the spread of a non‑fatal but incurable disease. We show that, for certain parameter values, exceptionally large deviations can arise between stochastic and deterministic kinetics of the Finke‑Watzky model. Moreover, its stochastic time evolution may be highly sensitive to initial conditions. These properties are retained in the generalization of the model to reversible reactions. To quantify the differences between the predictions of deterministic and stochastic kinetics, we derive the explicit analytical solution of the Chemical Master Equation for the Finke‑Watzky model. This solution also allows us to derive analogous solutions for two related reaction networks: A + A ‑‑ > A + B, A ‑‑ > B, and A + A ‑‑ > A + B, A + B ‑‑ > 2B. Our findings may have implications for modeling epidemics and intracellular chemical processes, and more broadly for models of population dynamics.
Authors: Jiri Käser, Kai Töpfer, Markus Meuwly
Abstract: The diffusional dynamics and vibrational spectroscopy of molecular hydrogen (H_2) in myoglobin (Mb) is characterized. Hydrogen has been implicated in a number of physiologically relevant processes, including cellular aging or inflammation. Here, the internal diffusion through the protein matrix was characterized and the vibrational spectroscopy was investigated using conventional empirical energy functions and improved models able to describe higher‑order electrostatic moments of the ligand. H_2 can occupy the same internal defects as already found for Xe or CO (Xe1 to Xe4 and B‑state). Furthermore, 4 additional sites were found, some of which had been discovered in earlier simulation studies. The vibrational spectra using the most refined energy function indicate that depending on the docking site the spectroscopy of H_2 differs. The maxima of the absorption spectra cover ~ 20 cm^‑1 which are indicative of a pronounced effect of the surrounding protein matrix on the vibrational spectroscopy of the ligand. Electronic structure calculations show that H_2 forms a stable complex with the heme‑iron (stabilized by ~ ‑12 kcal/mol) but splitting of H_2 is unlikely due to a high activation energy (~ 50 kcal/mol).
Authors: Andreas Plesner, Hans Henrik Brandenborg Sørensen, Søren Hauberg
Abstract: Bessel functions are critical in scientific computing for applications such as machine learning, protein structure modeling, and robotics. However, currently, available routines lack precision or fail for certain input ranges, such as when the order v is large, and GPU‑specific implementations are limited. We address the precision limitations of current numerical implementations while dramatically improving the runtime. We propose two novel algorithms for computing the logarithm of modified Bessel functions of the first and second kinds by computing intermediate values on a logarithmic scale. Our algorithms are robust and never have issues with underflows or overflows while having relative errors on the order of machine precision, even for inputs where existing libraries fail. In C++/CUDA, our algorithms have median and maximum speedups of 45x and 6150x for GPU and 17x and 3403x for CPU, respectively, over the ranges of inputs and third‑party libraries tested. Compared to SciPy, the algorithms have median and maximum speedups of 77x and 300x for GPU and 35x and 98x for CPU, respectively, over the tested inputs.
The ability to robustly compute a solution and the low relative errors allow us to fit von Mises‑Fisher, vMF, distributions to high‑dimensional neural network features. This is, e.g., relevant for uncertainty quantification in metric learning. We obtain image feature data by processing CIFAR10 training images with the convolutional layers of a pre‑trained ResNet50. We successfully fit vMF distributions to 2048‑, 8192‑, and 32768‑dimensional image feature data using our algorithms. Our approach provides fast and accurate results while existing implementations in SciPy and mpmath fail to fit successfully.
Our approach is readily implementable on GPUs, and we provide a fast open‑source implementation alongside this paper.
Authors: Jinghui Liu, Tom Burkart, Alexander Ziepke, John Reinhard, Yu-Chen Chao, Tzer Han Tan, S. Zachary Swartz, Erwin Frey, Nikta Fakhri
Abstract: Chemo‑mechanical waves on active deformable surfaces are a key component for many vital cellular functions. In particular, these waves play a major role in force generation and long‑range signal transmission in cells that dynamically change shape, as encountered during cell division or morphogenesis. Reconstituting and controlling such chemically controlled cell deformations is a crucial but unsolved challenge for the development of synthetic cells. Here, we develop an optogenetic method to elucidate the mechanism responsible for coordinating surface contraction waves that occur in oocytes of the starfish Patiria miniata during meiotic cell division. Using spatiotemporally‑patterned light stimuli as a control input, we create chemo‑mechanical cortical excitations that are decoupled from meiotic cues and drive diverse shape deformations ranging from local pinching to surface contraction waves and cell lysis. We develop a quantitative model that entails the hierarchy of chemical and mechanical dynamics, which allows to relate the variety of mechanical responses to optogenetic stimuli. Our framework systematically predicts and explains transitions of programmed shape dynamics. Finally, we qualitatively map the observed shape dynamics to elucidate how the versatility of intracellular protein dynamics can give rise to a broad range of mechanical phenomenologies. More broadly, our results pave the way toward real‑time control over dynamical deformations in living organisms and can advance the design of synthetic cells and life‑like cellular functions.
Authors: Gengmo Zhou, Zhen Wang, Feng Yu, Guolin Ke, Zhewei Wei, Zhifeng Gao
Abstract: Virtual Screening is an essential technique in the early phases of drug discovery, aimed at identifying promising drug candidates from vast molecular libraries. Recently, ligand‑based virtual screening has garnered significant attention due to its efficacy in conducting extensive database screenings without relying on specific protein‑binding site information. Obtaining binding affinity data for complexes is highly expensive, resulting in a limited amount of available data that covers a relatively small chemical space. Moreover, these datasets contain a significant amount of inconsistent noise. It is challenging to identify an inductive bias that consistently maintains the integrity of molecular activity during data augmentation. To tackle these challenges, we propose S‑MolSearch, the first framework to our knowledge, that leverages molecular 3D information and affinity information in semi‑supervised contrastive learning for ligand‑based virtual screening. Drawing on the principles of inverse optimal transport, S‑MolSearch efficiently processes both labeled and unlabeled data, training molecular structural encoders while generating soft labels for the unlabeled data. This design allows S‑MolSearch to adaptively utilize unlabeled data within the learning process. Empirically, S‑MolSearch demonstrates superior performance on widely‑used benchmarks LIT‑PCBA and DUD‑E. It surpasses both structure‑based and ligand‑based virtual screening methods for AUROC, BEDROC and EF.
Authors: Mohamed Dhouioui, Jonathan Barnoud, Rhoslyn Roebuck Williams, Harry J. Stroud, Phil Bates, David R. Glowacki
Abstract: Molecular dynamics (MD) simulations are a crucial computational tool for researchers to understand and engineer molecular structure and function in areas such as drug discovery, protein engineering, and material design. Despite their utility, MD simulations are expensive, owing to the high dimensionality of molecular systems. Interactive molecular dynamics in virtual reality (iMD‑VR) has recently emerged as a "human‑in‑the‑loop" strategy for efficiently navigating hyper‑dimensional molecular systems. By providing an immersive 3D environment that enables visualization and manipulation of real‑time molecular simulations running on high‑performance computing architectures, iMD‑VR enables researchers to reach out and guide molecular conformational dynamics, in order to efficiently explore complex, high‑dimensional molecular systems. Moreover, iMD‑VR simulations generate rich datasets that capture human experts' spatial insight regarding molecular structure and function. This paper explores the use of researcher‑generated iMD‑VR datasets to train AI agents via imitation learning (IL). IL enables agents to mimic complex behaviours from expert demonstrations, circumventing the need for explicit programming or intricate reward design. In this article, we review IL across robotics and Multi‑agents systems domains which are comparable to iMD‑VR, and discuss how iMD‑VR recordings could be used to train IL models to interact with MD simulations. We then illustrate the applications of these ideas through a proof‑of‑principle study where iMD‑VR data was used to train a CNN network on a simple molecular manipulation task; namely, threading a small molecule through a nanotube pore. Finally, we outline future research directions and potential challenges of using AI agents to augment human expertise in navigating vast molecular conformational spaces.
Authors: Yonglei Yang, Zihui Liu, Fulu Zheng, Panpan Zhang, Hongxing He, Ajay Jha, Hong-Guang Duan
Abstract: The evolution of photosynthetic reaction centers (RCs) from anoxygenic bacteria to oxygenic cyanobacteria and plants reflects their structural and functional adaptation to environmental conditions. Chirality plays a significant role in influencing the arrangement and function of key molecules in these RCs. This study investigates chirality‑related energy transfer in two distinct RCs: Thermochromatium tepidum (BRC) and Thermosynechococcus vulcanus (PSII RC) using two‑dimensional electronic spectroscopy (2DES). Circularly polarized laser pulses reveal transient chiral dynamics, with 2DCD spectroscopy highlighting chiral contributions. BRC displays more complex chiral behavior, while PSII RC shows faster coherence decay, possibly as an adaptation to oxidative stress. Comparing the chiral dynamics of BRC and PSII RC provides insights into photosynthetic protein evolution and function.
Authors: Fei Ye, Zaixiang Zheng, Dongyu Xue, Yuning Shen, Lihao Wang, Yiming Ma, Yan Wang, Xinyou Wang, Xiangxin Zhou, Quanquan Gu
Abstract: Recent years have witnessed a surge in the development of protein foundation models, significantly improving performance in protein prediction and generative tasks ranging from 3D structure prediction and protein design to conformational dynamics. However, the capabilities and limitations associated with these models remain poorly understood due to the absence of a unified evaluation framework. To fill this gap, we introduce ProteinBench, a holistic evaluation framework designed to enhance the transparency of protein foundation models. Our approach consists of three key components: (i) A taxonomic classification of tasks that broadly encompass the main challenges in the protein domain, based on the relationships between different protein modalities; (ii) A multi‑metric evaluation approach that assesses performance across four key dimensions: quality, novelty, diversity, and robustness; and (iii) In‑depth analyses from various user objectives, providing a holistic view of model performance. Our comprehensive evaluation of protein foundation models reveals several key findings that shed light on their current capabilities and limitations. To promote transparency and facilitate further research, we release the evaluation dataset, code, and a public leaderboard publicly for further analysis and a general modular toolkit. We intend for ProteinBench to be a living benchmark for establishing a standardized, in‑depth evaluation framework for protein foundation models, driving their development and application while fostering collaboration within the field.
Authors: Yang Jiao, Hananeh Derakhshan, Barbara St. Pierre Schneider, Emma Regentova, Mei Yang
Abstract: White blood cells (WBCs) are the most diverse cell types observed in the healing process of injured skeletal muscles. In the course of healing, WBCs exhibit dynamic cellular response and undergo multiple protein expression changes. The progress of healing can be analyzed by quantifying the number of WBCs or the amount of specific proteins in light microscopic images obtained at different time points after injury. In this paper, we propose an automated quantifying and analysis framework to analyze WBCs using light microscopic images of uninjured and injured muscles. The proposed framework is based on the Localized Iterative Otsu's threshold method with muscle edge detection and region of interest extraction. Compared with the threshold methods used in ImageJ, the LI Otsu's threshold method has high resistance to background area and achieves better accuracy. The CD68‑positive cell results are presented for demonstrating the effectiveness of the proposed work.
Authors: Taslim Murad, Prakash Chourasia, Sarwan Ali, Imdad Ullah Khan, Murray Patterson
Abstract: Cancer is a complex disease characterized by uncontrolled cell growth. T cell receptors (TCRs), crucial proteins in the immune system, play a key role in recognizing antigens, including those associated with cancer. Recent advancements in sequencing technologies have facilitated comprehensive profiling of TCR repertoires, uncovering TCRs with potent anti‑cancer activity and enabling TCR‑based immunotherapies. However, analyzing these intricate biomolecules necessitates efficient representations that capture their structural and functional information. T‑cell protein sequences pose unique challenges due to their relatively smaller lengths compared to other biomolecules. An image‑based representation approach becomes a preferred choice for efficient embeddings, allowing for the preservation of essential details and enabling comprehensive analysis of T‑cell protein sequences. In this paper, we propose to generate images from the protein sequences using the idea of Chaos Game Representation (CGR) using the Kaleidoscopic images approach. This Deep Learning Assisted Analysis of Protein Sequences Using Chaos Enhanced Kaleidoscopic Images (called DANCE) provides a unique way to visualize protein sequences by recursively applying chaos game rules around a central seed point. we perform the classification of the T cell receptors (TCRs) protein sequences in terms of their respective target cancer cells, as TCRs are known for their immune response against cancer disease. The TCR sequences are converted into images using the DANCE method. We employ deep‑learning vision models to perform the classification to obtain insights into the relationship between the visual patterns observed in the generated kaleidoscopic images and the underlying protein properties. By combining CGR‑based image generation with deep learning classification, this study opens novel possibilities in the protein analysis domain.
Authors: Xiaoxi Liu, Ying-Chieh Lai., Di Cui, Shiang-Cheng Kung, Meyeon Park, Laszik Zoltan, Peder E. Z. Larson, Zhen J. Wang
Abstract: BACKGROUND: Kidney transplant is the treatment of choice for patients with end‑stage renal disease. Early detection of allograft injury is important to delay or prevent irreversible damage. PURPOSE: To investigate the feasibility of hyperpolarized (HP) [1‑13C]pyruvate MRI for assessing kidney allograft metabolism. SUBJECTS: 6 participants (mean age, 45.2 +‑ 12.4 years, 2 females) scheduled for kidney allograft biopsy and 5 patients (mean age, 59.6 +‑ 10.4 years, 2 females) with renal cell carcinoma (RCC). ASSESSMENT: Five of the six kidney allograft participants underwent biopsy after MRI. Estimated glomerular filtration rate (eGFR) and urine protein‑to‑creatine ratio (uPCR) were collected within 4 weeks of MRI. Kidney metabolism was quantified from HP [1‑13C]pyruvate MRI using the lactate‑to‑pyruvate ratio in allograft kidneys and non‑tumor bearing kidneys from RCC patients. RESULTS: Biopsy was performed a mean of 9 days (range 5‑19 days) after HP [1‑13C]pyruvate MRI. Three biopsies were normal, one showed low‑grade fibrosis and one showed moderate microvascular inflammation. All had stable functioning allografts with eGFR > 60 mL/min/1.73 m2 and normal uPCR. One participant who did not undergo biopsy had reduced eGFR of 49 mL/min/1.73 m2 and elevated uPCR. The mean lactate‑to‑pyruvate ratio was 0.373 in participants with normal findings (n = 3) and 0.552 in participants with abnormal findings (n = 2). The lactate‑to‑pyruvate ratio was highest (0.847) in the participant with reduced eGFR and elevated uPRC. Native non‑tumor bearing kidneys had a mean lactate‑to‑pyruvate ratio of 0.309. DATA CONCLUSION: Stable allografts with normal findings at biopsy showed lactate‑to‑pyruvate ratios similar to native non‑tumor bearing kidneys, whereas allografts with abnormal findings showed higher lactate‑to‑pyruvate ratios.
Authors: Jakub Rydzewski
Abstract: Understanding the behavior of complex molecular systems is a fundamental problem in physical chemistry. To describe the long‑time dynamics of such systems, which is responsible for their most informative characteristics, we can identify a few slow collective variables (CVs) while treating the remaining fast variables as thermal noise. This enables us to simplify the dynamics and treat it as diffusion in a free‑energy landscape spanned by slow CVs, effectively rendering the dynamics Markovian. Our recent statistical learning technique, spectral map [Rydzewski, J. Phys. Chem. Lett. 2023, 14, 22, 5216‑5220], explores this strategy to learn slow CVs by maximizing a spectral gap of a transition matrix. In this work, we introduce several advancements into our framework, using a high‑dimensional reversible folding process of a protein as an example. We implement an algorithm for coarse‑graining Markov transition matrices to partition the reduced space of slow CVs kinetically and use it to define a transition state ensemble. We show that slow CVs learned by spectral map closely approach the Markovian limit for an overdamped diffusion. We demonstrate that coordinate‑dependent diffusion coefficients only slightly affect the constructed free‑energy landscapes. Finally, we present how spectral map can be used to quantify the importance of features and compare slow CVs with structural descriptors commonly used in protein folding. Overall, we demonstrate that a single slow CV learned by spectral map can be used as a physical reaction coordinate to capture essential characteristics of protein folding.
Authors: Daniel M. Steinberg, Rafael Oliveira, Cheng Soon Ong, Edwin V. Bonilla
Abstract: We develop VSD, a method for conditioning a generative model of discrete, combinatorial designs on a rare desired class by efficiently evaluating a black‑box (e.g. experiment, simulation) in a batch sequential manner. We call this task active generation; we formalize active generation's requirements and desiderata, and formulate a solution via variational inference. VSD uses off‑the‑shelf gradient based optimization routines, can learn powerful generative models for desirable designs, and can take advantage of scalable predictive models. We derive asymptotic convergence rates for learning the true conditional generative distribution of designs with certain configurations of our method. After illustrating the generative model on images, we empirically demonstrate that VSD can outperform existing baseline methods on a set of real sequence‑design problems in various protein and DNA/RNA engineering tasks.
Authors: I. Mihalcescu, H. Kaji, H. Maruyama, J. Giraud, M. Van-Melle Gateau, B. Houchmandzadeh, H. Ito
Abstract: The in vivo circadian clock in single cyanobacteria is studied here by time‑lapse fluorescence microscopy when the temperature is lowered below 25°C . We first disentangle the circadian clock behavior from the bacterial cold shock response by identifying a sequence of "death steps" based on cellular indicators. By analyzing only "alive" tracks, we show that the dynamic response of individual oscillatory tracks to a step‑down temperature signal is described by a simple Stuart‑Landau oscillator model. The same dynamical analysis applied to in vitro data (KaiC phosphorylation level following a temperature step‑down) allows for extracting and comparing both clock's responses to a temperature step down. It appears, therefore, that both oscillators go through a similar supercritical Hopf bifurcation. Finally, to quantitatively describe the temperature dependence of the resulting in vivo and in vitro Stuart‑Landau parameters μ(T) and ω_c(T), we propose two simplified analytical models: temperature‑dependent positive feedback or time‑delayed negative feedback that is temperature compensated. Our results provide strong constraints for future models and emphasize the importance of studying transitory regimes along temperature effects in circadian systems.
Authors: Sarwan Ali, Prakash Chourasia, Bipin Koirala, Murray Patterson
Abstract: Molecular sequence analysis is crucial for comprehending several biological processes, including protein‑protein interactions, functional annotation, and disease classification. The large number of sequences and the inherently complicated nature of protein structures make it challenging to analyze such data. Finding patterns and enhancing subsequent research requires the use of dimensionality reduction and feature selection approaches. Recently, a method called Correlated Clustering and Projection (CCP) has been proposed as an effective method for biological sequencing data. The CCP technique is still costly to compute even though it is effective for sequence visualization. Furthermore, its utility for classifying molecular sequences is still uncertain. To solve these two problems, we present a Nearest Neighbor Correlated Clustering and Projection (CCP‑NN)‑based technique for efficiently preprocessing molecular sequence data. To group related molecular sequences and produce representative supersequences, CCP makes use of sequence‑to‑sequence correlations. As opposed to conventional methods, CCP doesn't rely on matrix diagonalization, therefore it can be applied to a range of machine‑learning problems. We estimate the density map and compute the correlation using a nearest‑neighbor search technique. We performed molecular sequence classification using CCP and CCP‑NN representations to assess the efficacy of our proposed approach. Our findings show that CCP‑NN considerably improves classification task accuracy as well as significantly outperforms CCP in terms of computational runtime.
Authors: Shrimon Mukherjee, Madhusudan Ghosh, Partha Basuchowdhuri
Abstract: Application of artificial intelligence (AI) has been ubiquitous in the growth of research in the areas of basic sciences. Frequent use of machine learning (ML) and deep learning (DL) based methodologies by researchers has resulted in significant advancements in the last decade. These techniques led to notable performance enhancements in different tasks such as protein structure prediction, drug‑target binding affinity prediction, and molecular property prediction. In material science literature, it is well‑known that crystalline materials exhibit topological structures. Such topological structures may be represented as graphs and utilization of graph neural network (GNN) based approaches could help encoding them into an augmented representation space. Primarily, such frameworks adopt supervised learning techniques targeted towards downstream property prediction tasks on the basis of electronic properties (formation energy, bandgap, total energy, etc.) and crystalline structures. Generally, such type of frameworks rely highly on the handcrafted atom feature representations along with the structural representations. In this paper, we propose an unsupervised framework namely, CrysAtom, using untagged crystal data to generate dense vector representation of atoms, which can be utilized in existing GNN‑based property predictor models to accurately predict important properties of crystals. Empirical results show that our dense representation embeds chemical properties of atoms and enhance the performance of the baseline property predictor models significantly.
Authors: Huma Perveen, Julie Weeds
Abstract: Purpose: This study aimed to enhance protein sequence classification using natural language processing (NLP) techniques while addressing the impact of sequence similarity on model performance. We compared various machine learning and deep learning models under two different data‑splitting strategies: random splitting and ECOD family‑based splitting, which ensures evolutionary‑related sequences are grouped together. Methods: The study evaluated models such as K‑Nearest Neighbors (KNN), Multinomial Naïve Bayes, Logistic Regression, Multi‑Layer Perceptron (MLP), Decision Tree, Random Forest, XGBoost, Voting and Stacking classifiers, Convolutional Neural Network (CNN), Long Short‑Term Memory (LSTM), and transformer models (BertForSequenceClassification, DistilBERT, and ProtBert). Performance was tested using different amino acid ranges and sequence lengths with a focus on generalization across unseen evolutionary families. Results: The Voting classifier achieved the highest performance with 74% accuracy, 74% weighted F1 score, and 65% macro F1 score under random splitting, while ProtBERT obtained 77% accuracy, 76% weighted F1 score, and 61% macro F1 score among transformer models. However, performance declined across all models when tested using ECOD‑based splitting, revealing the impact of sequence similarity on classification performance. Conclusion: Advanced NLP techniques, particularly ensemble methods like Voting classifiers, and transformer models show significant potential in protein classification, with sufficient training data and sequence similarity management being crucial for optimal performance. However, the use of biologically meaningful splitting methods, such as ECOD family‑based splitting, is crucial for realistic performance evaluation and generalization to unseen evolutionary families.
Authors: Rong Han, Xiaohong Liu, Tong Pan, Jing Xu, Xiaoyu Wang, Wuyang Lan, Zhenyu Li, Zixuan Wang, Jiangning Song, Guangyu Wang, Ting Chen
Abstract: Accurately measuring protein‑RNA binding affinity is crucial in many biological processes and drug design. Previous computational methods for protein‑RNA binding affinity prediction rely on either sequence or structure features, unable to capture the binding mechanisms comprehensively. The recent emerging pre‑trained language models trained on massive unsupervised sequences of protein and RNA have shown strong representation ability for various in‑domain downstream tasks, including binding site prediction. However, applying different‑domain language models collaboratively for complex‑level tasks remains unexplored. In this paper, we propose CoPRA to bridge pre‑trained language models from different biological domains via Complex structure for Protein‑RNA binding Affinity prediction. We demonstrate for the first time that cross‑biological modal language models can collaborate to improve binding affinity prediction. We propose a Co‑Former to combine the cross‑modal sequence and structure information and a bi‑scope pre‑training strategy for improving Co‑Former's interaction understanding. Meanwhile, we build the largest protein‑RNA binding affinity dataset PRA310 for performance evaluation. We also test our model on a public dataset for mutation effect prediction. CoPRA reaches state‑of‑the‑art performance on all the datasets. We provide extensive analyses and verify that CoPRA can (1) accurately predict the protein‑RNA binding affinity; (2) understand the binding affinity change caused by mutations; and (3) benefit from scaling data and model size.
Authors: Elyssa Sliheet, Md Abu Talha, Weihua Geng
Abstract: In this project, we present a deep neural network (DNN)‑based biophysics model that uses multi‑scale and uniform topological and electrostatic features to predict protein properties, such as Coulomb energies or solvation energies. The topological features are generated using element‑specific persistent homology (ESPH) on a selection of heavy atoms or carbon atoms. The electrostatic features are generated using a novel Cartesian treecode, which adds underlying electrostatic interactions to further improve the model prediction. These features are uniform in number for proteins of varying sizes; therefore, the widely available protein structure databases can be used to train the network. These features are also multi‑scale, allowing users to balance resolution and computational cost. The optimal model trained on more than 17,000 proteins for predicting Coulomb energy achieves MSE of approximately 0.024, MAPE of 0.073 and R^2 of 0.976. Meanwhile, the optimal model trained on more than 4,000 proteins for predicting solvation energy achieves MSE of approximately 0.064, MAPE of 0.081, and R^2 of 0.926, showing the efficiency and fidelity of these features in representing the protein structure and force field. The feature generation algorithms also have the potential to serve as general tools for assisting machine learning based prediction of protein properties and functions.
Authors: Mateusz Polakowski, Miłosz Panfil
Abstract: Ion channels are protein structures that facilitate the selective passage of ions across the membrane cells of living organisms. They are known for their high conductance and high selectivity. The precise mechanism between these two seemingly contradicting features is not yet firmly established. One possible candidate is the quantum coherence. In this work we study the quantum model of the soft knock‑on conduction using the Lindblad equation taking into account the non‑hermiticity of the model. We show that the model exhibits a regime in which high conductance coexists with high coherence. Our findings second the role of quantum effects in the transport properties of the ion channels.
Authors: Tomoei Takahashi, George Chikenji, Kei Tokita, Yoshiyuki Kabashima
Abstract: How typical elements that shape organisms, such as protein secondary structures, have evolved, or how evolutionarily susceptible/resistant they are to environmental changes, are significant issues in evolutionary biology, structural biology, and biophysics. According to Darwinian evolution, natural selection and genetic mutations are the primary drivers of biological evolution. However, the concept of ``robustness of the phenotype to environmental perturbations across successive generations," which seems crucial from the perspective of natural selection, has not been formalized or analyzed. In this study, through Bayesian learning and statistical mechanics we formalize the stability of the free energy in the space of amino acid sequences that can design particular protein structure against perturbations of the chemical potential of water surrounding a protein as such robustness. This evolutionary stability is defined as a decreasing function of a quantity analogous to the susceptibility in the statistical mechanics of magnetic bodies specific to the amino acid sequence of a protein. Consequently, in a two‑dimensional square lattice protein model composed of 36 residues, we found that as we increase the stability of the free energy against perturbations in environmental conditions, the structural space shows a steep step‑like reduction. Furthermore, lattice protein structures with higher stability against perturbations in environmental conditions tend to have a higher proportion of α‑helices and a lower proportion of β‑sheets. This result is qualitatively confirmed by comparing the histograms of the percentage of secondary structures of evolutionarily robust proteins and randomly selected proteins through an empirical validation using a protein database.
Authors: Pratyush Tiwary, Lukas Herron, Richard John, Suemin Lee, Disha Sanwal, Ruiyu Wang
Abstract: The recent surge in Generative Artificial Intelligence (AI) has introduced exciting possibilities for computational chemistry. Generative AI methods have made significant progress in sampling molecular structures across chemical species, developing force fields, and speeding up simulations. This Perspective offers a structured overview, beginning with the fundamental theoretical concepts in both Generative AI and computational chemistry. It then covers widely used Generative AI methods, including autoencoders, generative adversarial networks, reinforcement learning, flow models and language models, and highlights their selected applications in diverse areas including force field development, and protein/RNA structure prediction. A key focus is on the challenges these methods face before they become truly predictive, particularly in predicting emergent chemical phenomena. We believe that the ultimate goal of a simulation method or theory is to predict phenomena not seen before, and that Generative AI should be subject to these same standards before it is deemed useful for chemistry. We suggest that to overcome these challenges, future AI models need to integrate core chemical principles, especially from statistical mechanics.
Authors: Gokul Gowri, Xiao-Kang Lun, Allon M. Klein, Peng Yin
Abstract: Mutual information (MI) is a general measure of statistical dependence with widespread application across the sciences. However, estimating MI between multi‑dimensional variables is challenging because the number of samples necessary to converge to an accurate estimate scales unfavorably with dimensionality. In practice, existing techniques can reliably estimate MI in up to tens of dimensions, but fail in higher dimensions, where sufficient sample sizes are infeasible. Here, we explore the idea that underlying low‑dimensional structure in high‑dimensional data can be exploited to faithfully approximate MI in high‑dimensional settings with realistic sample sizes. We develop a method that we call latent MI (LMI) approximation, which applies a nonparametric MI estimator to low‑dimensional representations learned by a simple, theoretically‑motivated model architecture. Using several benchmarks, we show that unlike existing techniques, LMI can approximate MI well for variables with > 10^3 dimensions if their dependence structure has low intrinsic dimensionality. Finally, we showcase LMI on two open problems in biology. First, we approximate MI between protein language model (pLM) representations of interacting proteins, and find that pLMs encode non‑trivial information about protein‑protein interactions. Second, we quantify cell fate information contained in single‑cell RNA‑seq (scRNA‑seq) measurements of hematopoietic stem cells, and find a sharp transition during neutrophil differentiation when fate information captured by scRNA‑seq increases dramatically.
Authors: Amir Shee, Vidur Sabharwal, Sandhya P. Koushika, Amitabha Nandi, Debasish Chaudhuri
Abstract: Cargo distribution within eukaryotic cells relies on the active transport mechanisms driven by molecular motors. Despite their critical role, the intricate relationship between motor transport properties and cargo binding ‑ and its impact on motor distribution ‑ remains inadequately understood. Additionally, improper regulation of ubiquitination, a pivotal post‑translational modification that affects protein degradation, activation, and localization, is associated with several neurodegenerative diseases. Recent data showed that ubiquitination can alter motor‑cargo binding of the Kinesin‑3 motor UNC‑104 / KIF1A that transports synaptic vesicles. To investigate how ubiquitin‑like modifications affect motor protein function, particularly cargo binding, transport properties, and distribution, we utilize the PLM neuron of C. elegans as a model system. Using fluorescent microscopy, we assess the distribution of cargo‑bound UNC‑104 motors along the axon and probe their dynamics using FRAP experiments. We model cargo binding kinetics with a Master equation and motor density dynamics using a Fokker‑Planck approach. Our combined experimental and theoretical analysis reveals that ubiquitin‑like knockdowns enhance UNC‑104's cooperative binding to its cargo. However, these modifications do not affect UNC‑104's transport properties, such as processivity and diffusivity. Thus, while ubiquitin‑like modifications significantly impact the cargo‑binding of UNC‑104, they do not alter its transport dynamics, keeping the homeostatic distribution of UNC‑104 unchanged.
Authors: A. Quadir, M. Sajid, M. Tanveer
Abstract: The identification of DNA‑binding proteins (DBPs) is essential due to their significant impact on various biological activities. Understanding the mechanisms underlying protein‑DNA interactions is essential for elucidating various life activities. In recent years, machine learning‑based models have been prominently utilized for DBP prediction. In this paper, to predict DBPs, we propose a novel framework termed a multiview random vector functional link (MvRVFL) network, which fuses neural network architecture with multiview learning. The MvRVFL model integrates both late and early fusion advantages, enabling separate regularization parameters for each view, while utilizing a closed‑form solution for efficiently determining unknown parameters. The primal objective function incorporates a coupling term aimed at minimizing a composite of errors stemming from all views. From each of the three protein views of the DBP datasets, we extract five features. These features are then fused together by incorporating a hidden feature during the model training process. The performance of the proposed MvRVFL model on the DBP dataset surpasses that of baseline models, demonstrating its superior effectiveness. We further validate the practicality of the proposed model across diverse benchmark datasets, and both theoretical analysis and empirical results consistently demonstrate its superior generalization performance over baseline models.
Authors: Shingo Gibo, Teiji Kunihiro, Tetsuo Hatsuda, Gen Kurosawa
Abstract: Numerous biological processes accelerate as temperatures increase, but the period of circadian rhythms remains constant, known as temperature compensation, while synchronizing with the 24h light‑dark cycle. We theoretically explores the possible relevance of waveform distortions in circadian gene‑protein dynamics to the temperature compensation and synchronization. Our analysis of the Goodwin model provides a coherent explanation of most of temperature compensation hypotheses. Using the renormalization group method, we analytically demonstrate that the decreasing phase of circadian protein oscillations should lengthen with increasing temperature, leading to waveform distortions to maintain a stable period. This waveform‑period correlation also occurs in other oscillators like Lotka‑Volterra and van der Pol models. A reanalysis of known data nicely confirms our findings on waveform distortion and its impact on synchronization range. Thus we conclude that circadian rhythm waveforms are fundamental to both temperature compensation and synchronization.
Authors: Zi Hao Liu, Maria Tsanai, Oufan Zhang, Julie Forman-Kay, Teresa Head-Gordon
Abstract: In 1999 Wright and Dyson highlighted the fact that large sections of the proteome of all organisms are comprised of protein sequences that lack globular folded structures under physiological conditions. Since then the biophysics community has made significant strides in unraveling the intricate structural and dynamic characteristics of intrinsically disordered proteins (IDPs) and intrinsically disordered regions (IDRs). Unlike crystallographic beamlines and their role in streamlining acquisition of structures for folded proteins, an integrated experimental and computational approach aimed at IDPs/IDRs has emerged. In this Perspective we aim to provide a robust overview of current computational tools for IDPs and IDRs, and most recently their complexes and phase separated states, including statistical models, physics‑based approaches, and machine learning methods that permit structural ensemble generation and validation against many solution experimental data types.
Authors: Alessandro Martinelli, Stefano Buzzaccaro, Quentin Galand, Juliette Behra, Niel Segers, Erik Leussink, Yadvender Singh Dhillon, Dominique Maes, James Lutsko, Roberto Piazza, Luca Cipelletti
Abstract: Colloidal Solids (COLIS) is a state‑of‑the‑art light scattering setup developed for experiments onboard the International Space Station (ISS). COLIS allows for probing the structure and dynamics of soft matter systems on a wide range of length scales, from a few nm to tens of microns, and on time scales from 100 ns to tens of hours. In addition to conventional static and dynamic light scattering, COLIS includes depolarized dynamic light scattering, a small‑angle camera, photon correlation imaging, and optical manipulation of thermosensitive samples through an auxiliary near‑infrared laser beam, thereby providing a unique platform for probing soft matter systems. We demonstrate COLIS through ground tests on standard Brownian suspensions, and on protein, colloidal glasses, and gel systems similar to those to be used in future ISS experiments.
Authors: Shania Mitra, Lei Huang, Manolis Kellis
Abstract: Protein function prediction is a crucial task in bioinformatics, with significant implications for understanding biological processes and disease mechanisms. While the relationship between sequence and function has been extensively explored, translating protein structure to function continues to present substantial challenges. Various models, particularly, CNN and graph‑based deep learning approaches that integrate structural and functional data, have been proposed to address these challenges. However, these methods often fall short in elucidating the functional significance of key residues essential for protein functionality, as they predominantly adopt a retrospective perspective, leading to suboptimal performance.
Inspired by region proposal networks in computer vision, we introduce the Protein Region Proposal Network (ProteinRPN) for accurate protein function prediction. Specifically, the region proposal module component of ProteinRPN identifies potential functional regions (anchors) which are refined through the hierarchy‑aware node drop pooling layer favoring nodes with defined secondary structures and spatial proximity. The representations of the predicted functional nodes are enriched using attention mechanisms and subsequently fed into a Graph Multiset Transformer, which is trained with supervised contrastive (SupCon) and InfoNCE losses on perturbed protein structures. Our model demonstrates significant improvements in predicting Gene Ontology (GO) terms, effectively localizing functional residues within protein structures. The proposed framework provides a robust, scalable solution for protein function annotation, advancing the understanding of protein structure‑function relationships in computational biology.
Authors: Siyu Li, Guillaume Tresset, Roya Zandi
Abstract: The packaging of genetic material within a protein shell, called the capsid, marks a pivotal step in the life cycle of numerous single‑stranded RNA viruses. Understanding how hundreds, or even thousands, of proteins assemble around the genome to form highly symmetrical structures remains an unresolved puzzle. In this paper, we design novel subunits and develop a model that enables us to explore the assembly pathways and genome packaging mechanism of icosahedral viruses, which were previously inaccessible. Using molecular dynamics (MD) simulations, we observe capsid fragments, varying in protein number and morphology, assembling at different locations along the genome. Initially, these fragments create a disordered structure that later merges to form a perfect symmetric capsid. The model demonstrates remarkable strength in addressing numerous unresolved issues surrounding virus assembly. For instance, it enables us to explore the advantages of RNA packaging by capsid proteins over linear polymers. Our MD simulations are in excellent agreement with our experimental findings from small‑angle X‑ray scattering and cryo‑transmission electron microscopy, carefully analyzing the assembly products of viral capsid proteins around RNAs with distinct topologies.
Authors: Lenard Neander, Cedric Hannemann, Roland R. Netz, Anil Kumar Sahoo
Abstract: Interactions of polyelectrolytes (PEs) with proteins play a crucial role in numerous biological processes, such as the internalization of virus particles into host cells. Although docking, machine learning methods, and molecular dynamics (MD) simulations are utilized to estimate binding poses and binding free energies of small‑molecule drugs to proteins, quantitative prediction of the binding thermodynamics of PE‑based drugs presents a significant obstacle in computer‑aided drug design. This is due to the sluggish dynamics of PEs caused by their size and strong charge‑charge correlations. In this paper, we introduce advanced sampling methods based on a force‑spectroscopy setup and theoretical modeling to overcome this barrier. We exemplify our method with explicit solvent all‑atom MD simulations of interactions of anionic PEs that show antiviral properties, namely heparin and linear polyglycerol sulfate (LPGS), with the SARS‑CoV‑2 spike protein receptor binding domain (RBD). Our prediction for the binding free energy of LPGS to the wild‑type RBD matches experimentally measured dissociation constants within thermal energy, kT, and correctly reproduces the experimental PE‑length dependence. We find that LPGS binds to the Delta‑variant RBD with an additional free‑energy gain of 2.4 kT, compared to the wild‑type RBD, in accord with electrostatic arguments. We show that the LPGS‑RBD binding is solvent‑dominated and enthalpy‑driven, though with a large entropy‑enthalpy compensation. Our method is applicable to general polymer adsorption phenomena and predicts precise binding free energies and re‑configurational friction as needed for drug and drug‑delivery design.