xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein

2023-07-07

In July 2023, Biomap partnering with Tsinghua University unveiled a novel model called xTrimo Protein General Language Model (xTrimoPGLM), featuring a gargantuan parameter count of 100 billion. This pivotal achievement was posted in the form of a preprint on biorxiv on July 7, 2023.

Protein language models have shown remarkable success in learning biological information from protein sequences. However, most existing models are limited by either autoencoding or autoregressive pre-training objectives, which makes them struggle to handle protein understanding and generation tasks concurrently.

This paper proposes a unified protein language model, xTrimoPGLM, to address these two types of tasks simultaneously through an innovative pre-training framework. Our key technical contribution is an exploration of the compatibility and the potential for joint optimization of the two types of objectives, which has led to a strategy for training xTrimoPGLM at an unprecedented scale of 100 billion parameters and 1 trillion training tokens. 

Our extensive experiments reveal that xTrimoPGLM significantly outperforms other advanced baselines in diverse protein understanding tasks (13 out of 15 tasks across four categories) and generates novel protein sequences which are structurally similar to natural ones. Furthermore, using the same xTrimoPGLM framework, we train an antibody-specific model (xTrimoPGLM-Ab) using 1 billion parameters. This model set a new record in predicting antibody naturalness and structures, both essential to the field of antibody-based drug design, and demonstrated a significantly faster inference speed than AlphaFold2.

These results highlight the substantial capability and versatility of xTrimoPGLM in understanding and generating protein sequences.

While numerous avenues await exploration, the advent of a trillion-parameter model heralds the integration of bleeding edge AI tech and biology, ushering in an era rife with endless possibilities. Anticipating xTrimoPGLM propelling the pharmaceutical sector towards another evolutionary epoch, imminent advancement in human healthcare and scientific endeavor is tangible. 

Link: https://www.biorxiv.org/content/10.1101/2023.07.05.547496v3