AI for Life Science
Our team has focused extensively on multi-modal scientific data integration and intelligent platform development, leveraging a range of advanced technologies. Through building comprehensive datasets across genomics, transcriptomics, proteomics, and metabolomics, and developing predictive models such as MetaSCDrug, 3D-Mol, and ProtFAD, the team has advanced drug response prediction, molecular property modeling, and protein function analysis. This research emphasizes the deep integration of domain knowledge into AI models, enhancing explainability and prediction accuracy. The team has published over ten research papers, contributing significantly to the acceleration of drug discovery, precise medical research, and the broader application of data-driven life science innovations.
TCMM: A unified database for traditional Chinese medicine modernization and therapeutic innovations
Mining the potential of traditional Chinese medicine (TCM) in treating modern diseases requires a profound understanding of its action mechanism and a comprehensive knowledge system that seamlessly bridges modern medical insights with traditional theories. However, existing databases for modernizing TCM are plagued by varying degrees of information loss, which impede the multidimensional dissection of pharmacological effects. To address this challenge, we introduce traditional Chinese medicine modernization (TCMM), the currently largest modernized TCM database that integrates pioneering intelligent pipelines. By aligning high-quality TCM and modern medicine data, TCMM boasts the most extensive TCM modernization knowledge, including 20 types of modernized TCM concepts such as prescription, ingredient, target and 46 biological relations among them, totaling 3,447,023 records. We demonstrate the efficacy and reliability of TCMM with two features, prescription generation and knowledge discovery, the outcomes show consistency with biological experimental results. A publicly available web interface is at https://www.tcmm.net.cn/.
3D‑Mol: A Novel Contrastive Learning Framework for Molecular Property Prediction with 3D Information
Molecular property prediction, crucial for early drug candidate screening and optimization, has seen advancements with deep learning-based methods. While deep learning-based methods have advanced considerably, they often fall short in fully leveraging 3D spatial information. Specifcally, current molecular encoding techniques tend to inadequately extract spatial information, leading to ambiguous representations where a single one might represent multiple distinct molecules. Moreover, existing molecular modeling methods focus predominantly on the most stable 3D conformations, neglecting other viable conformations present in reality. To address these issues, we propose 3D-Mol, a novel approach designed for more accurate spatial structure representation. It deconstructs molecules into three hierarchical graphs to better extract geometric information. Additionally, 3D-Mol leverages contrastive learning for pretraining on 20 million unlabeled data, treating their conformations with identical topological structures as weighted positive pairs and contrasting ones as negatives, based on the similarity of their 3D conformation descriptors and fngerprints. We compare 3D-Mol with various state-of-the-art baselines on 7 benchmarks and demonstrate our outstanding performance.
ProtFAD: Introducing function-aware domains as implicit modality towards protein function perception
Protein function prediction is currently achieved by encoding its sequence or structure, where the sequence-to-function transcendence and high-quality structural data scarcity lead to obvious performance bottlenecks. Protein domains are "building blocks" of proteins that are functionally independent, and their combinations determine the diverse biological functions. However, most existing studies have yet to thoroughly explore the intricate functional information contained in the protein domains. To fill this gap, we propose a synergistic integration approach for a functionaware domain representation, and a domain-joint contrastive learning strategy to distinguish different protein functions while aligning the modalities. Specifically, we associate domains with the GO terms as function priors to pre-train domain embeddings. Furthermore, we partition proteins into multiple sub-views based on continuous joint domains for contrastive training under the supervision of a novel triplet InfoNCE loss. Our approach significantly and comprehensively outperforms the state-of-the-art methods on various benchmarks, and clearly differentiates proteins carrying distinct functions compared to the competitor.
Self-supervised learning on millions of primary RNA sequences from 72 vertebrates improves sequence-based RNA splicing prediction
Language models pretrained by self-supervised learning (SSL) have been widely utilized to study protein sequences, while few models were developed for genomic sequences and were limited to single species. Due to the lack of genomes from different species, these models cannot effectively leverage evolutionary information. In this study, we have developed SpliceBERT, a language model pretrained on primary ribonucleic acids (RNA) sequences from 72 vertebrates by masked language modeling, and applied it to sequencebased modeling of RNA splicing. Pretraining SpliceBERT on diverse species enables effective identification of evolutionarily conserved elements. Meanwhile, the learned hidden states and attention weights can characterize the biological properties of splice sites. As a result, SpliceBERT was shown effective on several downstream tasks: zero-shot prediction of variant effects on splicing, prediction of branchpoints in humans, and cross-species prediction of splice sites. Our study highlighted the importance of pretraining genomic language models on a diverse range of species and suggested that SSL is a promising approach to enhance our understanding of the regulatory logic underlying genomic sequences.
A Self-feedback Knowledge Elicitation Approach for Chemical Reaction Predictions
The task of chemical reaction predictions (CRPs) plays a pivotal role in advancing drug discovery and material science. However, its effectiveness is constrained by the vast and uncertain chemical reaction space and challenges in capturing reaction selectivity, particularly due to existing methods’ limitations in exploiting the data’s inherent knowledge. To address these challenges, we introduce a data-curated self-feedback knowledge elicitation approach. This method starts from iterative optimization of molecular representations and facilitates the extraction of knowledge on chemical reaction types (RTs). Then, we employ adaptive prompt learning to infuse the prior knowledge into the large language model (LLM). As a result, we achieve significant enhancements: a 14.2% increase in retrosynthesis prediction accuracy, a 74.2% rise in reagent prediction accuracy, and an expansion in the model’s capability for handling multi-task chemical reactions. This research offers a novel paradigm for knowledge elicitation in scientific research and showcases the untapped potential of LLMs in CRPs.
Cytokine expression patterns: A single-cell RNA sequencing and machine learning based roadmap for cancer classification
Cytokines are small protein molecules that exhibit potent immunoregulatory properties, which are known as the essential components of the tumor immune microenvironment (TIME). While some cytokines are known to be universally upregulated in TIME, the unique cytokine expression patterns have not been fully resolved in specific types of cancers. To address this challenge, we develop a TIME single-cell RNA sequencing (scRNA-seq) dataset, which is designed to study cytokine expression patterns for precise cancer classification. The dataset, including 39 cancers, is constructed by integrating 684 tumor scRNA-seq samples from multiple public repositories. After screening and processing, the dataset retains only the expression data of immune cells. With a machine learning classification model, unique cytokine expression patterns are identified for various cancer categories and pioneering applied to cancer classification with an accuracy rate of 78.01%. Our method will not only boost the understanding of cancer-type-specific immune modulations in TIME but also serve as a crucial reference for future diagnostic and therapeutic research in cancer immunity.
GIT-Mol: A multi-modal large language model for molecular science with graph, image, and text
Large language models have made significant strides in natural language processing, enabling innovative applications in molecular science by processing textual representations of molecules. However, most existing language models cannot capture the rich information with complex molecular structures or images. In this paper, we introduce GIT-Mol, a multi-modal large language model that integrates the Graph, Image, and Text information. To facilitate the integration of multi-modal molecular data, we propose GIT-Former, a novel architecture that is capable of aligning all modalities into a unified latent space. We achieve a 5%–10% accuracy increase in properties prediction and a 20.2% boost in molecule generation validity compared to the baselines. With the any-to-language molecular translation strategy, our model has the potential to perform more downstream tasks, such as compound name recognition and chemical reaction prediction.
SCREP: Towards Single-Cell Drug Response Prediction by Pharmacogenomic Embedding Enhanced Meta-Pretraining and Few-Shot Transfer Learning
Analyzing the drug response at the cellular level is crucial for identifying biomarkers and understanding the mechanisms of resistance. Although studies on the drug response of individual cells can provide novel insights into tumor heterogeneity, pharmacogenomic data related to single-cell (SC) RNA sequencing is often limited. Transfer learning provides a promising approach to translate the knowledge of drug response from bulk cell lines to SC analysis, potentially providing an effective solution to this challenge. Previous studies often use data from single drug-cell lines to pre-train specific models and adapt the models on SC datasets, which lack pharmacogenomic information from other drugs and hinder model generalization. In this work, we introduce MetaSCDrug as a unified meta pre-training framework that integrates molecular information with transcriptomic data to simultaneously modeling cellular heterogeneity in response to multiple pre-trained drugs and generalize to unseen drugs. Our model requires only one pre-training session, followed by fine-tuning on multiple single-cell datasets by few-shot learning, achieving an average of 4.58% accuracy increase in drug response prediction compared to the baselines. Furthermore, our meta pre-training strategy effectively captures transcriptome heterogeneity in the generalization of unseen drugs, achieving a 20% improvement over the model without meta pre-training. Case studies of our framework highlight its capability to identify critical genes for resistance, providing a method for exploring drug action pathways and understanding resistance mechanisms.
AI for Temporal Data
Temporal data, also known as time series data, is a sequence of observations recorded at regular intervals over time. This data can be found in various forms, such as stock prices, gravitational wave, weather forecasts, and sensor readings from IoT devices. Artificial intelligence (AI) has revolutionized various fields, and temporal data analysis is no exception. The ability to understand and predict patterns from time series data has significant implications for businesses across industries. Moreover, AI for temporal data analysis holds significant commercial value across various industries. From astronomical discoveries and weather forecasting to customer behavior analysis and financial applications, the insights gained from this data can drive informed decision-making and competitive advantage. As AI techniques continue to evolve, we can expect to see even more innovative applications and benefits in the future.
Dilated convolutional neural network for detecting extreme-mass-ratio inspirals
The detection of Extreme Mass Ratio Inspirals (EMRIs) is intricate due to their complex waveforms, extended duration, and low signal-to-noise ratio (SNR), making them more challenging to be identified compared to compact binary coalescences. While matched filtering-based techniques are known for their computational demands, existing deep learning-based methods primarily handle time-domain data and are often constrained by data duration and SNR. In addition, most existing work ignores time-delay interferometry (TDI) and applies the long-wavelength approximation in detector response calculations, thus limiting their ability to handle laser frequency noise. In this study, we introduce DECODE (DilatEd COnvolutional neural network for Detecting Extrememass-ratio inspirals), an end-to-end model focusing on EMRI signal detection by sequence modeling in the frequency domain. Centered around a dilated causal convolutional neural network, trained on synthetic data considering TDI-1.5 detector response, DECODE can efficiently process a year’s worth of multichannel TDI data with an SNR of around 50. We evaluate our model on 1-year data with accumulated SNR ranging from 50 to 120 and achieve a true positive rate of 96.3% at a false positive rate of 1%, keeping an inference time of less than 0.01 seconds. With the visualization of three showcased EMRI signals for interpretability and generalization, DECODE exhibits strong potential for future space-based gravitational wave data analyses.
Compact Binary Systems Waveform Generation with Generative Pre-trained Transformer
Space-based gravitational wave (GW) detection is one of the most anticipated GW detection projects in the next decade, which promises to detect abundant compact binary systems. At present, deep learning methods have not been widely explored for GW waveform generation and extrapolation. To solve the data processing difficulty and the increasing waveform complexity caused by the detector’s response and second-generation time-delay interferometry (TDI 2.0), an interpretable pre-trained large model named CBS-GPT (Compact Binary Systems Waveform Generation with Generative Pre-trained Transformer) is proposed. For compact binary system waveforms, three models were trained to predict the waveforms of massive black hole binaries (MBHB), extreme mass-ratio inspirals (EMRIs), and galactic binaries (GB), achieving prediction accuracies of at most 99%, 91%, and 99%, respectively. The CBS-GPT model exhibits notable generalization and interpretability, with its hidden parameters effectively capturing the intricate information of waveforms, even with the complex instrument response and a wide parameter range. Our research demonstrates the potential of large models in the GW realm, opening up new opportunities and guidance for future researches such as complex waveforms generation, gap completion, and deep learning model design for GW science.
WaveFormer: transformer-based denoising method for gravitational-wave data
With the advent of gravitational-wave astronomy and the discovery of more compact binary coalescences, data quality improvement techniques are desired to handle the complex and overwhelming noise in gravitational wave (GW) observational data. Though recent machine learning-based studies have shown promising results for data denoising, they are unable to precisely recover both the GW signal amplitude and phase. To address such an issue, we develop a deep neural network centered workflow, WaveFormer, for significant noise suppression and signal recovery on observational data from the Laser Interferometer Gravitational-Wave Observatory (LIGO). The WaveFormer has a science-driven architecture design with hierarchical feature extraction across a broad frequency spectrum. As a result, the overall noise and glitch are decreased by more than one order of magnitude and the signal recovery error is roughly 1% and 7% for the phase and amplitude, respectively. Moreover, on 75 reported binary black hole events of LIGO we obtain a significant improvement of inverse false alarm rate. Our work highlights the potential of large neural networks in GW data analysis and, while primarily demonstrated on LIGO data, its adaptable design indicates promise for broader application within the International Gravitational-Wave Observatories Network in future observational runs.
GWAI: Harnessing Artificial Intelligence for Enhancing Gravitational Wave Data Analysis
Gravitational wave (GW) astronomy has opened new frontiers in understanding the cosmos, while the integration of artificial intelligence (AI) in science promises to revolutionize data analysis methodologies. However, a significant gap exists, as there is currently no dedicated platform that enables scientists to develop, test, and evaluate AI algorithms efficiently. To address this gap, we introduce GWAI, a pioneering AI-centered software platform designed for gravitational wave data analysis. GWAI contains a three-layered architecture that emphasizes simplicity, modularity, and flexibility, covering the entire analysis pipeline. GWAI aims to accelerate scientific discoveries, bridging the gap between advanced AI techniques and astrophysical research.
Taiji Data Challenge for Exploring Gravitational Wave Universe
The direct observation of gravitational waves (GWs) opens a new window for exploring new physics from quanta to cosmos and provides a new tool for probing the evolution of universe. GWs detection in space covers a broad spectrum ranging over more than four orders of magnitude and enables us to study rich physical and astronomical phenomena. Taiji is a proposed space-based GW detection mission that will be launched in the 2030s. Taiji will be exposed to numerous overlapping and persistent GW signals buried in the foreground and background, posing various data analysis challenges. In order to empower potential scientific discoveries, the Mock Laser Interferometer Space Antenna (LISA) Data Challenge and the LISA Data Challenge (LDC) were developed. While LDC provides a baseline framework, the first LDC needs to be updated with more realistic simulations and adjusted detector responses for Taiji’s constellation. In this paper, we review the scientific objectives and the roadmap for Taiji, as well as the technical difficulties in data analysis and the data generation strategy, and present the associated data challenges. In contrast to LDC, we utilize second-order Keplerian orbit and second-generation time delay interferometry techniques. Additionally, we employ a new model for the extreme-mass-ratio inspiral waveform and stochastic GW background spectrum, which enables us to test general relativity and measure the non-Gaussianity of curvature perturbations. Furthermore, we present a comprehensive showcase of parameter estimation using a toy dataset. This showcase not only demonstrates the scientific potential of the Taiji Data Challenge but also serves to validate the effectiveness of the pipeline. As the first data challenge for Taiji, we aim to build an open ground for data analysis related to Taiji sources and sciences. More details can be found on the official website http://taiji-tdc.ictp-ap.org.
Space-based gravitational wave signal detection and extraction with deep neural network
Space-based gravitational wave (GW) detectors will be able to observe signals from sources that are otherwise nearly impossible from current ground-based detection. Consequently, the well established signal detection method, matched filtering, will require a complex template bank, leading to a computational cost that is too expensive in practice. Here, we develop a high-accuracy GW signal detection and extraction method for all space-based GW sources. As a proof of concept, we show that a science-driven and uniform multi-stage self-attentionbased deep neural network can identify synthetic signals that are submerged in Gaussian noise. Our method exhibits a detection rate exceeding 99% in identifying signals from various sources, with the signal-to-noise ratio at 50, at a false alarm rate of 1%. while obtaining at least 95% similarity compared with target signals. We further demonstrate the interpretability and strong generalization behavior for several extended scenarios.
Sampling with Prior Knowledge for High-dimensional Gravitational Wave Data Analysis
Extracting knowledge from high-dimensional data has been notoriously difficult, primarily due to the so-called “curse of dimensionality” and the complex joint distributions of these dimensions. This is a particularly profound issue for high-dimensional gravitational wave data analysis where one requires to conduct Bayesian inference and estimate joint posterior distributions. In this study, we incorporate prior physical knowledge by sampling from desired interim distributions to develop the training dataset. Accordingly, the more relevant regions of the high-dimensional feature space are covered by additional data points, such that the model can learn the subtle but important details. We adapt the normalizing flow method to be more expressive and trainable, such that the information can be effectively extracted and represented by the transformation between the prior and target distributions. Once trained, our model only takes approximately 1 s on one V100 GPU to generate thousands of samples for probabilistic inference purposes. The evaluation of our approach confirms the efficacy and efficiency of gravitational wave data inferences and points to a promising direction for similar research. The source code, specifications, and detailed procedures are publicly accessible on GitHub.