AI for Small Molecule

Our team has focused extensively on multi-modal scientific data integration and intelligent platform development, leveraging a range of advanced technologies. Through building comprehensive datasets across genomics, transcriptomics, proteomics, and metabolomics, and developing predictive models such as MetaSCDrug, 3D-Mol, and ProtFAD, the team has advanced drug response prediction, molecular property modeling, and protein function analysis. This research emphasizes the deep integration of domain knowledge into AI models, enhancing explainability and prediction accuracy. The team has published over ten research papers, contributing significantly to the acceleration of drug discovery, precise medical research, and the broader application of data-driven life science innovations.

A quantitative analysis of knowledge-learning preferences in large language models in molecular science

Paper 1 Image

Deep learning has significantly advanced molecular modelling and design, enabling an efficient understanding and discovery of novel molecules. In particular, large language models introduce a fresh research paradigm to tackle scientific problems from a natural language processing perspective. Large language models significantly enhance our understanding and generation of molecules, often surpassing existing methods with their capabilities to decode and synthesize complex molecular patterns. However, two key issues remain: how to quantify the match between model and data modalities and how to identify the knowledge-learning preferences of models. To address these challenges, we propose a multimodal benchmark, named ChEBI-20-MM, and perform 1,263 experiments to assess the model’s compatibility with data modalities and knowledge acquisition. Through the modal transition probability matrix, we provide insights into the most suitable modalities for tasks. Furthermore, we introduce a statistically interpretable approach to discover context-specific knowledge mapping by localized feature filtering. Our analysis offers an exploration of the learning mechanism and paves the way for advancing large language models in molecular science.

GIT-Mol: A multi-modal large language model for molecular science with graph, image, and text

Paper 1 Image

Large language models have made significant strides in natural language processing, enabling innovative applications in molecular science by processing textual representations of molecules. However, most existing language models cannot capture the rich information with complex molecular structures or images. In this paper, we introduce GIT-Mol, a multi-modal large language model that integrates the Graph, Image, and Text information. To facilitate the integration of multi-modal molecular data, we propose GIT-Former, a novel architecture that is capable of aligning all modalities into a unified latent space. We achieve a 5%–10% accuracy increase in properties prediction and a 20.2% boost in molecule generation validity compared to the baselines. With the any-to-language molecular translation strategy, our model has the potential to perform more downstream tasks, such as compound name recognition and chemical reaction prediction.

Concept-Driven Deep Learning for Enhanced Protein-Specific Molecular Generation

Paper 1 Image

In recent years, deep learning techniques have made significant strides in molecular generation for specific targets, driving advancements in drug discovery. However, existing molecular generation methods present significant limitations: those operating at the atomic level often lack synthetic feasibility, drug-likeness, and interpretability, while fragment-based approaches frequently overlook comprehensive factors that influence protein-molecule interactions. To address these challenges, we propose a novel fragment-based molecular generation framework tailored for specific proteins. Our method begins by constructing a protein subpocket and molecular arm concept-based neural network, which systematically integrates interaction force information and geometric complementarity to sample molecular arms for specific protein subpockets. Subsequently, we introduce a diffusion model to generate molecular backbones that connect these arms, ensuring structural integrity and chemical diversity. Our approach significantly improves synthetic feasibility and binding affinity, with a 4% increase in drug-likeness and a 6% improvement in synthetic feasibility. Furthermore, by integrating explicit interaction data through a concept-based model, our framework enhances interpretability, offering valuable insights into the molecular design process.

AI for Protein

Our research group specializes in harnessing state‑of‑the‑art deep learning techniques to tackle critical challenges in protein science, with a particular emphasis on accurate protein function prediction and the rational design of functional proteins. By integrating advanced neural network architectures with domain‑specific insights, we develop innovative models that capture rich structural representations of proteins while seamlessly injecting biochemical and biophysical knowledge into learned embeddings. Through this synergistic approach—melding data‑driven feature learning with expert‑curated protein knowledge—we have successfully published multiple high‑impact journal articles demonstrating the effectiveness of our methods across a variety of AI‑for‑proteins applications.

Aligning sequence and structure representations leveraging protein domains for function prediction

Paper 1 Image

Protein function prediction is traditionally approached through sequence or structural modeling, often neglecting the effective fusion of diverse data sources. Protein domains, as functionally independent building blocks, determine a protein’s biological function, yet their potential has not been fully exploited in function prediction tasks. To address this, we introduce a modality-fused neural network leveraging function-aware domain embeddings as a bridge. We pre-train these embeddings by aligning domain semantics with Gene Ontology (GO) terms and textual descriptions. Additionally, we partition proteins into sub-views based on continuous domain regions for contrastive learning, supervised by a novel triplet InfoNCE loss. Our method outperforms state-of-the-art approaches across various benchmarks, and clearly differentiates proteins carrying distinct functions compared to the competitor.

Generative prediction of real-world prevalent SARS-CoV-2 mutation with in silico virus evolution

Paper 1 Image

Predicting the mutation prevalence trends of emerging viruses in the real world is an efficient means to update vaccines or drugs in advance. It is crucial to develop a computational method for the prediction of real-world prevalent SARS-CoV-2 mutations considering the impact of multiple selective pressures within and between hosts. Here, a deep-learning generative framework for real-world prevalent SARS-CoV-2 mutation prediction, named ViralForesight, is developed on top of protein language models and in silico virus evolution. Through the paradigm of host-to-herd in silico virus evolution, ViralForesight reproduced previous real-world prevalent SARS-CoV-2 mutations for multiple lineages with superior performance. More importantly, ViralForesight correctly predicted the future prevalent mutations that dominated the COVID-19 pandemic in the real world more than half a year in advance with in vitro experimental validation. Overall, ViralForesight demonstrates a proactive approach to the prevention of emerging viral infections, accelerating the process of discovering future prevalent mutations with the power of generative deep learning.

AI for transcriptomics

Our research group focuses on "AI for transcriptomics", leveraging machine learning and deep learning techniques to analyze complex gene expression data, including bulk RNA-seq, single-cell RNA-seq, and spatial transcriptomics. This approach enables us to uncover hidden patterns in high-dimensional data and supports a wide range of downstream tasks such as cell type annotation, gene regulatory network inference, trajectory analysis, drug response prediction, and multi-omics integration. By developing interpretable and robust AI models, our goal is to extract meaningful biological insights, facilitate precision medicine, and enhance our understanding of cellular diversity and disease mechanisms.

Self-supervised learning on millions of primary RNA sequences from 72 vertebrates improves sequence-based RNA splicing prediction

Paper 1 Image

Language models pretrained by self-supervised learning (SSL) have been widely utilized to study protein sequences, while few models were developed for genomic sequences and were limited to single species. Due to the lack of genomes from different species, these models cannot effectively leverage evolutionary information. In this study, we have developed SpliceBERT, a language model pretrained on primary ribonucleic acids (RNA) sequences from 72 vertebrates by masked language modeling, and applied it to sequence-based modeling of RNA splicing. Pretraining SpliceBERT on diverse species enables effective identification of evolutionarily conserved elements. Meanwhile, the learned hidden states and attention weights can characterize the biological properties of splice sites. As a result, SpliceBERT was shown effective on several downstream tasks: zero-shot prediction of variant effects on splicing, prediction of branchpoints in humans, and cross-species prediction of splice sites. Our study highlighted the importance of pretraining genomic language models on a diverse range of species and suggested that SSL is a promising approach to enhance our understanding of the regulatory logic underlying genomic sequences.

Cytokine expression patterns: A single-cell RNA sequencing and machine learning based roadmap for cancer classification

Paper 1 Image

Cytokines are small protein molecules that exhibit potent immunoregulatory properties, which are known as the essential components of the tumor immune microenvironment (TIME). While some cytokines are known to be universally upregulated in TIME, the unique cytokine expression patterns have not been fully resolved in specific types of cancers. To address this challenge, we develop a TIME single-cell RNA sequencing (scRNA-seq) dataset, which is designed to study cytokine expression patterns for precise cancer classification. The dataset, including 39 cancers, is constructed by integrating 684 tumor scRNA-seq samples from multiple public repositories. After screening and processing, the dataset retains only the expression data of immune cells. With a machine learning classification model, unique cytokine expression patterns are identified for various cancer categories and pioneering applied to cancer classification with an accuracy rate of 78.01%. Our method will not only boost the understanding of cancer-type-specific immune modulations in TIME but also serve as a crucial reference for future diagnostic and therapeutic research in cancer immunity.

TCMM: A unified database for traditional Chinese medicine modernization and therapeutic innovations

Paper 1 Image

Mining the potential of traditional Chinese medicine (TCM) in treating modern diseases requires a profound understanding of its action mechanism and a comprehensive knowledge system that seamlessly bridges modern medical insights with traditional theories. However, existing databases for modernizing TCM are plagued by varying degrees of information loss, which impede the multidimensional dissection of pharmacological effects. To address this challenge, we introduce traditional Chinese medicine modernization (TCMM), the currently largest modernized TCM database that integrates pioneering intelligent pipelines. By aligning high-quality TCM and modern medicine data, TCMM boasts the most extensive TCM modernization knowledge, including 20 types of modernized TCM concepts such as prescription, ingredient, target and 46 biological relations among them, totaling 3,447,023 records. We demonstrate the efficacy and reliability of TCMM with two features, prescription generation and knowledge discovery, the outcomes show consistency with biological experimental results. A publicly available web interface is at

https://www.tcmm.net.cn/.