Cover of Nature Machine Intelligence: BioMap Proposed Its Protein Structure Prediction Model

2023-11-27

In October 2023, Nature Machine Intelligence, a prestigious international scholarly journal, showcased BioMap’s research via a featured cover - A method for multiple-sequence-alignment-free protein structure prediction using a protein language model.

This groundbreaking research, conducted last year and initially released on a preprint server, utilizes BioMap’s specialized biology language model (academic version) xTrimo, and Baidu's computation infrastructure. It presents the initial large-scale language model-based protein structure prediction model devoid of MSA (Multiple Sequence Alignment), enhancing the efficiency evaluation tasks by over 100 times in the field of protein prediction.

Furthermore, this initiative represents the inaugural academic open source project from the xTrimo life science large model system. Simultaneously developing its proprietary large model for life science projects and the academic version for academic collaboration, BioMap aims to accelerate the technological ecosystem by sharing its expertise. We invite additional prominent scholars to participate and contribute.

 

Presently, AI researchers are addressing the challenge of protein structure prediction, notably advancing prediction accuracy with AlphaFold2 relocation frontiers. The pressing issue remains, however, that these leading protein structural prediction methodologies, including the AlphaFold2 model, heavily rely on synergistic evolutionary insights garnered from multiple sequence alignments (MSAs) and Templates.

This research focuses on the problem of structure prediction for universal proteins. Leveraging the protein sequence comprehension capability afforded by large language models, it overcomes the speed bottleneck associated with mainstream MSA retrieval models like AlphaFold2. The average prediction speed for protein structures is dramatically increased by several hundred times, achieving predictions within seconds. The publication of this work introduces a protein structure prediction solution with lower usage barriers and broader applicability, potentially advancing research in various fields such as biomedicine, synthetic biology, and other life sciences.

 

Fig.1.   The framework of HelixFold-Single

The HelixFold-Single model outperforms AlphaFold2, boasting a performance enhancement of several hundredfold and enabling swift predictions within seconds. Consider this: the AlphaFold2 prediction process for gate protein 7et2_H (697 length) consumes 1280 seconds (roughly 21 minutes). In contrast, HelixFold-Single accomplishes the same in merely 11 seconds, signifying a 115-fold speed upgrade.

 

Fig.2. Comparison of median times of MSA search, AlphaFold2 and HelixFold-Single speeds

The efficient HelixFold-Single model not only adapts better to tasks requiring frequent protein structure prediction, such as protein design and large-scale virtual screening but also outperforms AlphaFold2 in scenarios involving highly variable proteins, particularly in the context of large molecule drug design. 

Moreover, HelixFold-Single, an evolved pre-trained model, extends to downstream applications like protein duty predication, interactions, and mutant protein forecasters. HelixFold-Single will assist researchers in the field of life sciences to interpret the composition and dynamic patterns of living organisms more conveniently and efficiently at a deeper level. This resource facilitates researchers in conducting more groundbreaking studies, such as exploring treatments for specific cancers, viral infections, and developing new antibiotics, targeted drugs, or more efficient industrial enzymes. In doing so, it contributes a continuous stream of value to both human health and industrial development.

Presently, AI big data approaches are propelling the evolution of bioinformatics. BioMap, a pioneer in life science big data models, introduced the xTrimoPGLM protein language model with a trillion parameters in July this year. Ranking high on the LifeScience Leaderboard (http://www.biomap.com/sota), it outperforms similar models in over 40 sectors.

 

Fig.3. AIGP platform selects CD40L binding protein segment as the motif and uses various design strategies to design a completely new miniprotein.

Dr. Song Le, BioMap's CTO and corresponding author of the paper, declared, "BioMap is committed to building AI Foundation Models for decoding life. Our goal is to provide innovative solutions to challenging issues in life sciences by utilizing ultra-large-scale AI pre-training models to comprehend complex biological phenomena. The joint exploration of large models for protein structure prediction in this collaboration is based on BioMap's powerful Foundation Model technology and the rich accumulation of life science data and knowledge. We also hope for more in-depth collaboration in the future, particularly in areas like improving target analysis and high-precision protein design."

Paper Link:https://www.nature.com/articles/s42256-023-00721-6