Next generation sequencing, with rapidly increasing throughput and sharply reducing cost, is fostering a wave of genomic, transcriptomic and epigenomic technologies. Especially, advances of sequencing technologies have scaled-down the amount of DNA necessary for library preparation to the level that DNA/RNA in a single cell suffices for sequencing. Single cell sequencing studies have provided exciting insights into biology and medicine at the unitary resolution of life. For example, single cell genome studies have identified somatic mutations in single cells and reconstructed cancer cell lineage trees1,2. Single cell RNA-Seq have provided the gene expression profile of each individual cell and identified biologically distinct cells even when the cells cannot be distinguishable by marker genes or cell morphology3,4. Furthermore, single cell sequencing also will revolutionize our understanding of many fundamental questions in multiple disciplines such as developmental biology, stem cell biology, cell fate decision, immunity, gene regulation network and metagenomics5.
The development of single cell sequencing have a lot clinical implication, particular in precision medicine6. Precision medicine is the paradigm of future health care, in which patient will be provided with proper treatments precisely matching to his/her genetic features7,8. The rapid development of high-throughput biotechnologies and identification of numerous patient-specific genetic variants has provided basis of precision medicine. For example, precision medicine techniques such as genome sequencing have revealed mutations in cystic fibrosis and the customized treatment is developing9,10. However, the widespread cellular heterogeneity in clinical samples is still a challenge, which could result in selective drug resistance and failure of treatment11-14. Single cell sequencing is ideally powerful for identifying variations in heterogeneous samples and is critical for the success of precision medicine.
With both biological and computational background, I am interested in addressing questions in human genetics, epigenetics and biomedicine using high-throughput technologies. I also would like to improve and develop the high throughput biotechnologies, especially single cell sequencing, to systematically investigate the gene regulation mechanism. Furthermore, I am interested in the clinical applications of single cell sequencing such as clinical diagnosis and medical consultation. My research plans for next several years will primarily focus on two highly associated directions:
1. Developing single cell sequencing and extending its clinical application. Single cell sequencing could provide comprehensive genetic features including cell subpopulations, characterization of cellular heterogeneity and cellular hierarchy. The comprehensive genetic features are essential for precision medicine, such as understanding of development of tumor and drug resistance in cancer treatment. However, single cell sequencing usually feature with high false negative and high noise, which limited its large-scale application.
1.1 Improving scDNase-Seq and developing single cell epigenomic technologies. Although significant advances on single cell genome-seq and single cell RNA-seq have been made, progress on single cell epigenonomics is slow because preparation of single cell epigenomic libraries is much complex and usually cause serious DNA loss6. Our scDNase-Seq has greatly contributed to the understanding of the DNase I hypersensitive sites in single cells, there is still a lot of potential to be significantly improved15. E.g. using one tube/well to complete the whole experiments to reduce the DNA loss caused by tube change and optimizing the circular carrier DNA to reduce the unspecific amplification. I am working on these challenges with my colleagues and I am looking forward the approach will be significantly improved. Furthermore, the principles and ideas underlying scDNase-Seq could be implemented in other single cell epigenomic studies. Beyond the scDNas-Seq, we are also working on the single cell ChIP-Seq and single cell MNase-Seq. I am confident that our single cell epigenomic study will significantly contributed the biomedical research in the future.
1.2 Single cell variants calling and imputation. With the development of next generation sequencing technology, many software and computational tools have been developed, greatly facilitating the application of next generation sequencing technologies. However, directly implementation of these software to call variant from single cell genomic data lead high false positive because single cell genomic data usually have large number of libraries but very low coverage and high noise in each library. We will develop software for variants calling that consider the feathers of single cell genomic data. More specifically, mutations and allelic imbalances introduced during genome amplification in library preparation will be taken into account. We will use three strategies to overcome the false positive introduced during amplification. First, the genomic data from the bulk cells can be used as a reference to reduce false positive. Second, two or three cells can be required to have the same variant at the same position, which is unlikely introduced by random amplification error. Third, we will introduce missing data in the single cell genomic for the locus that do not covered by any read, which provide the possible to recovery the genetic information later. When some values in one locus are missing, most statistical methods and packages default to discard the locus in following analysis, which affect the representativeness of the data or even introduce bias. Imputation will preserve all loci by replacing missing data with an estimated value inferred from haplotype information. In summary, we will develop the software for single cell variant calling that take account the specific features the single cell genomic data.
1.3 Pipeline and computational tools for single cell sequencing study. Single cell sequencing studies usually generate a large number of single cell genomes, single cell transcriptomes or single cell epigenomes. However, there is no way to gain any idea about the data if the data is not analyzed. Therefore, analyses of single cell sequencing data became one of the most rate-limiting step for scientists to gain biological insights. I realized that study of the distribution of variations and mutations, gene expression dynamics, or epigenomic dynamics in the cell population is similar to classic population genetic analysis. Therefore, I will develop a computational tool integrating most population genetic module for analysis of single cell sequencing data. The computational tool will cluster the cell population based on distance functions that provide a quantitative measurement of the differences between pairs of cells. The computational tool will generate a cell lineage tree if single cell genomic data are provided. The computational tool will provide the genetic variants or epigenetic changes that contribute to cell subpopulations. The computational tool will infer the potential drive mutation based on frequency of mutation and haplotype. By integrating these software and computational tools, we will develop a pipeline for automatic data analyses that can directly provide genomic variants, epigenetic changes, cell lineage relationship, cell subpopulations, variants between different subpopulations, potential drive mutation, cellular dynamics and so on.
1.4 Modeling cellular epigenomic dynamic and regulatory network. It is obvious epigenetic status is varying in both time and space even in the same cell due to the microenvironment variation. Furthermore, the gene expression data used for constructing regulatory network are average gene expression profile of millions of cells, which may not reflect the regulation network in each cell. Single cell RNA-Seq have provided the expression profile of tens of thousands of single cells, which potentially provided a unique opportunity for constructing accurate gene regulatory network. I will explore the cellular epigenomic dynamics under different conditions to hypothesize a cellular epigenomic dynamic model. I will improve the regulatory network from bulk of cells by checking each edge using the single cell transcriptome data. I will explore various deep learning approaches for projecting discrete networks into continuous, low-dimensional representations that filter noise.
1.5 Development of technology for generating multi-single cell omic data from one cell. There are two studies developed technologies to investigate genome and transcriptome in the same cell16,17. Although the data are still high noisy, the studies have made interesting discoveries such as genes with high cellular variability in transcript numbers generally have lower genomic copy numbers. It will be more interesting to systematically investigate the epigenome and transcriptome in the same cell, which could provide detailed information of gene regulation at cellular level. Although single cell RNA-Seq and our recently developed scDNase-seq have emerged as powerful tools to study gene expression profile and chromatin accessibility profile in single cells, it is currently not possible to explore the relationship between epigenome and transcriptome in the same cells because either scDNase-seq or single cell RAN-seq need a whole cell for preparing a library. I will try to work on this exciting idea in the long time, although it is very challenges for capturing the epigenome and transcriptome in the same single cell.
1.6 Applying single cell sequencing to clinical samples. The heterogeneity of cancer cells poises significant challenges in designing effective treatment for cancer. Recently, the applications of single cell genome or single cell transcriptome to patient samples have great increased our knowledge about tumor heterogeneity. However, there is no single cell epigenome study on clinical samples, which may provide interesting results. Applications of multi-single cell omics sequencing to the cancer cell also will provide an opportunity to systematically characterize cancer single cells. We could identify the genes or epigenetic status that are significantly changed during oncogenesis. We could further infer the transcriptional factors that change the landscape of epigenetics among single cells by motif analysis, which are more likely to play causal role in oncogenesis.
2. Predicting disease risks by multi-omics data integration. Accurate prediction of disease risk and drug sensitivity is one of the pillars of precision medicine. Although genome-wide association studies (GWAS) have identified thousands of well-reproducible SNPs associated with complex diseases/traits, the combined predictive power of these associations is generally too low to be of clinical relevance. Recent technological advances lead to creation of various omics data, from genomic data, to transcriptomic, epigenomic, proteomic and metabolomic data. The multiple-dimensional omic data provide the potential for developing effective models that predict complex traits/diseases, although assembling all of these data into a complete biological story is immensely challenging.
2.1 Gene regulatory network integrating epigenetic data. Construction of gene regulatory network is one of the most important issues in systems biology. The regulatory network constructed via gene expression data is still not accurate, with inherent noise, partly due to the limited data and complexity of biology. The epigenetics including histone modification played key roles in genome function, cellular differentiation, and human disease, thus integrating the epigenetics data to the analysis potential significantly improve the regulatory network. I will apply dynamic Bayesian network to infer gene regulatory networks from time-series gene expression data. I will try to develop new methods for systematic discovery and the characterization of the regulatory networks to explore interactions between genetic, gene expression, epigenomic, environment and disease status.
2.2 Network-centric association studies with complex disease/trait. GWAS are the key approach for correlating genomic variations with change of gene expression, epigenetics and phenotypic traits/diseases. However, these studies typically focus on common genetic variants due to both cost and ease of representation. Using regulatory network, I will reduce the dimensionality of each human genome by translating individual genomes into personalized transcriptional regulatory networks that differ between individuals based on their personal genetic burdens. My research will focus on identifying the differences of regulatory network architecture between healthy individuals and patients. At the beginning, I will explore the approach using publicly available data such as the Cancer Genome Atlas (TCGA) data.
2.3 Data integration by meta-dimensional analysis. Data integration has been widely used for the interpreting multi-omics data. The conventional data integration is to divide the analysis into multiple steps, and signals are enriched with each step of the analysis. However, the conventional approach will fail to model a complex trait if the trait is the result of a combination of genomic variants, gene expression variability, epigenetic states and protein structure rather than in a stepwise linear model. Meta-dimensional analysis combines multiple data types in a simultaneous analysis, which could significantly improved the prediction of complex traits or disease risk. I will try to improve the meta-dimensional analysis by integrating Bayesian probability.
2.4 Interactive database for medical consultation. We will develop an interactive database for collecting both genetic risk biomarkers and drug sensitive variants. Researchers will be encouraged to submit their results, particularly the results published on peer-reviewed journals, to the database, which will enhance comprehensive of the database. The submitted items will be manually checked before formally inserting into the database. Controversial items will be marked, thus researcher and customer will be alerted when any conclusion are derived from them. Finally, the database will try to match customers’ genetic features including genetic variants to provide the medical advices and solutions. The customer will be encouraged to provide feedback after the medical care, which could further enhancer the database.
I have an unrestrained passion for interdisciplinary research thus prefer to assemble a research group of trainees with both biological and computational interests to pursue exciting science. I will work with scientists with diverse backgrounds to develop and implement single cell sequencing, and predict disease risks by multi-omics data integration. I think my expertise in epigenomics, population genetics, association study, systems biology and statistics will be instrumental in addressing the challenges during pursuing the research goals.
- Meacham, C.E. & Morrison, S.J. Tumour heterogeneity and cancer cell plasticity. Nature 501, 328-37 (2013).
- Hou, Y. et al. Single-cell exome sequencing and monoclonal evolution of a JAK2-negative myeloproliferative neoplasm. Cell 148, 873-85 (2012).
- Xue, Z. et al. Genetic programs in human and mouse early embryos revealed by single-cell RNA sequencing. Nature 500, 593-7 (2013).
- Shalek, A.K. et al. Single-cell transcriptomics reveals bimodality in expression and splicing in immune cells. Nature 498, 236-40 (2013).
- Shapiro, E., Biezuner, T. & Linnarsson, S. Single-cell sequencing-based technologies will revolutionize whole-organism science. Nat Rev Genet 14, 618-30 (2013).
- Gawad, C., Koh, W. & Quake, S.R. Single-cell genome sequencing: current state of the science. Nat Rev Genet 17, 175-88 (2016).
- Reardon, S. Precision-medicine plan raises hopes. Nature 517, 540 (2015).
- Collins, F.S. & Varmus, H. A new initiative on precision medicine. N Engl J Med 372, 793-5 (2015).
- Green, D.M. Cystic fibrosis: a model for personalized genetic medicine. N C Med J 74, 486-7 (2013).
- Kaiser, J. Personalized medicine. New cystic fibrosis drug offers hope, at a price. Science 335, 645 (2012).
- Zahreddine, H.A. et al. The sonic hedgehog factor GLI1 imparts drug resistance through inducible glucuronidation. Nature 511, 90-3 (2014).
- Alsford, S. et al. High-throughput decoding of antitrypanosomal drug efficacy and resistance. Nature 482, 232-6 (2012).
- Rathert, P. et al. Transcriptional plasticity promotes primary and acquired resistance to BET inhibition. Nature 525, 543-7 (2015).
- Straussman, R. et al. Tumour micro-environment elicits innate resistance to RAF inhibitors through HGF secretion. Nature 487, 500-4 (2012).
- Jin, W. et al. Genome-wide detection of DNase I hypersensitive sites in single cells and FFPE tissue samples. Nature 528, 142-6 (2015).
- Macaulay, I.C. et al. G&T-seq: parallel sequencing of single-cell genomes and transcriptomes. Nat Methods 12, 519-22 (2015).
- Dey, S.S., Kester, L., Spanjaard, B., Bienko, M. & van Oudenaarden, A. Integrated genome and transcriptome sequencing of the same cell. Nat Biotechnol 33, 285-9 (2015).