xTrimoGene: An Efficient and Scalable Representation Learner for Single-Cell RNA-Seq Data

2023-03-25

In March 2023, BioMap unveiled its 100 million parameter pre-trained model, 'xTrimoGene,' a subsystem of 'xTrimo' life science large-scale model system. It signifies BioMap's progression in decoding the cryptic language of life.

Compared to classic sequencing, single-cell RNA sequencingscan divide genetic data into single-cell resolution, providing significant contributions to tissue/organ function study, medical drug research, etc.

Presently, single-cell RNA-seq data attained billions scale, alongside numerous man-sequenced genetics reaching ten million mark. Accounting approximately 20,000 protein-coding genes per cell brings marker quantum to one trillion, similar to the word count used in training natural language models like GPT. Concurrently, vast data volume surfaces new issues in efficient scRNA-seq data analysis and modeling: High dimensionality and sparseness demand updated data management strategies. 

Introducing xTrimoGene's unique asymmetry encoder-decoder framework to address such challenges and accommodate unique characteristics of scRNA-seq data. The encoders (like scBERT) utilized earlier have been replaced by a novel method, with increased speed by over 3X.

Proof of its performance was evaluated through verified results on several datasets for three critical tasks - cell classification, treatment prediction, and drug combination estimation, universally essential in single-cell research.

To delve deeper into the potential of pre-trained models in single-cell research, our team created a new task for xTrimoGene - 'read-depth-aware (RDA)' - focusing on multi-resolution and super-resolution handling of Single Cell Data. Exploring xTrimoGene further, we developed 'scFoundation,' a modified version demonstrating broader applications in specific downstream areas such as target revelation and drug synergistic impact prediction.

Both our research milestones, a collaboration between BioMap and Tsinghua University, were recently disclosed on the preprint platform.

Link:

xTrimoGene: https://www.biorxiv.org/content/10.1101/2023.03.24.534055v1

scFoundation: https://www.biorxiv.org/content/10.1101/2023.05.29.542705v2