RNA-sequencing and bioinformatics analyses were used to identify significantly and differentially expressed transcription factors in Lin-CD34+ and Lin-CD34- subpopulations of mouse EMLcells. These transcription factors might play important roles in determining the switch between self-renewing Lin-CD34+ and partially differentiated Lin-CD34- cells.
Hematopoietic stem cells (HSCs) are used clinically for transplantation treatment to rebuild a patient's hematopoietic system in many diseases such as leukemia and lymphoma. Elucidating the mechanisms controlling HSCs self-renewal and differentiation is important for application of HSCs for research and clinical uses. However, it is not possible to obtain large quantity of HSCs due to their inability to proliferate in vitro. To overcome this hurdle, we used a mouse bone marrow derived cell line, the EML (Erythroid, Myeloid, and Lymphocytic) cell line, as a model system for this study.
RNA-sequencing (RNA-Seq) has been increasingly used to replace microarray for gene expression studies. We report here a detailed method of using RNA-Seq technology to investigate the potential key factors in regulation of EML cell self-renewal and differentiation. The protocol provided in this paper is divided into three parts. The first part explains how to culture EML cells and separate Lin-CD34+ and Lin-CD34- cells. The second part of the protocol offers detailed procedures for total RNA preparation and the subsequent library construction for high-throughput sequencing. The last part describes the method for RNA-Seq data analysis and explains how to use the data to identify differentially expressed transcription factors between Lin-CD34+ and Lin-CD34- cells. The most significantly differentially expressed transcription factors were identified to be the potential key regulators controlling EML cell self-renewal and differentiation. In the discussion section of this paper, we highlight the key steps for successful performance of this experiment.
In summary, this paper offers a method of using RNA-Seq technology to identify potential regulators of self-renewal and differentiation in EML cells. The key factors identified are subjected to downstream functional analysis in vitro and in vivo.
Hematopoietic stem cells are rare blood cells that reside mainly in the adult bone marrow niche. They are responsible for the production of cells required to replenish the blood and the immune systems1. As a kind of stem cells, HSCs are capable of both self-renewal and differentiation. Elucidating mechanisms that control the fate decision of HSCs, toward either self-renewal or differentiation, will offer valuable guidance on the manipulation of HSCs for blood disease researches and clinical usage2. One problem faced by the researchers is that HSCs can be maintained and expanded in vitro to a very limited extent; the vast majority of their progeny are partially differentiated in culture2.
In order to identify key regulators that control the processes of self-renewal and differentiation at a genome-wide scale, we used a mouse primitive hematopoietic progenitor cell line EML as a model system. This cell line was derived from murine bone marrow3,4. When fed with different growth factors, EML cells can differentiate into erythroid, myeloid, and lymphoid cells in vitro5. Importantly, this cell line can be propagated in large quantity in culture medium containing stem cell factor (SCF) and still retaining their multipotentiality. EML cells can be separated into subpopulations of self-renewing Lin-SCA+CD34+ and partially differentiated Lin-SCA-CD34- cells based on surface markers CD34 and SCA6. Similar to short-term HSCs, SCA+CD34+ cells are able of self-renewal. When treated with SCF, Lin-SCA+CD34+ cells can rapidly regenerate a mixed population of Lin-SCA+CD34+ and Lin-SCA-CD34- cells and continue to proliferate6. The two populations are similar in morphology and have similar levels of c-kit mRNA and protein6. Lin-SCA-CD34- cells are capable of propagating in media containing IL-3 instead of SCF3. Unveiling the key regulators in the EML cell fate decision will offer better understanding of cellular and molecular mechanisms in early developmental transition during hematopoiesis.
In order to investigate the underlying molecular differences between the self-renewing Lin-SCA+CD34+ and partially differentiated Lin-SCA-CD34- cells, we used RNA-Seq to identify differentially expressed genes. In particular, we focus on transcription factors, as transcription factors are crucial in determining cell fate. RNA-Seq is a recently developed approach that utilizes the capabilities of next-generation sequencing (NGS) technologies to profile and quantify RNAs transcribed from genome7,8. In brief, total RNA is poly-A selected and fragmented as the initial template.The RNA template is then converted into cDNA using reverse transcriptase. In order to map full-length RNA transcripts, using intact, non-degraded RNA for constructing cDNA library is important. For the purpose of sequencing, specific adapter sequences are added to both ends of cDNA. Then, in most cases, cDNA molecules are amplified by PCR and sequenced in a high-throughput manner.
After sequencing, the resulting reads can be aligned to a reference genome and a transcriptome database. The number of reads that map to the reference gene is counted and this information can be used to estimate the gene expression level. The reads can also be assembled de novo without a reference genome, enabling the study of transcriptomes in non-model organisms9. RNA-seq technology has also been used to detect splice isoforms10-12, novel transcripts13 and gene fusions14. In addition to the detection of protein-coding genes, RNA-Seq can also be used to detect novel and analyze transcription level of non-coding RNAs, such as long non-coding RNA15,16, microRNA17, siRNA etc.18. Because of the accuracy of this method, it has been utilized for detection of single nucleotide variations19,20.
Before the advent of RNA-Seq technology, microarray was the main method used for analyzing gene expression profile. Pre-designed probes are synthesized and subsequently attached to a solid surface to form a microarray slide21. mRNA is extracted and converted to cDNA. During the reverse transcription process, fluorescently labeled nucleotides are incorporated into the cDNA and the cDNA can be hybridized onto the microarray slides. The intensity of the signal collected from a specific spot depends on the amount of cDNA binding to the specific probe on that spot21. Compared with RNA-Seq technology, microarray has several limitations. First, microarray relies on the pre-existing knowledge of gene annotation, while RNA-Seq technology is able to detect novel transcripts at relative high background level, which limits its use when gene expression level is low. Besides, the RNA-Seq technology has much higher dynamic range of detection (8,000 fold)7, whereas, due to background and saturation of signals, the accuracy of microarray is limited for both highly and lowly expressed genes7,22. Finally, microarray probes differ in their hybridization efficiencies, which make the results less reliable when comparing relative expression levels of different transcripts within one sample23. Although RNA-Seq has many advantages over microarray, its data analysis is complex. This is one of the reasons that many researchers still use microarray instead of RNA-Seq. Various bioinformatics tools are required for RNA-Seq data processing and analysis24.
Among several next-generation sequencing (NGS) platforms, 454, Illumina, SOLID and Ion Torrent are the most widely used ones. 454 was the first commercial NGS platform. In contrast to the other sequencing platforms such as illumina and SOLID, the 454 platform generates longer read length (average 700 base reads)25. Longer reads are better for initial characterization of transcriptiome due to their higher assemble efficiency25. The main disadvantage of the 454 platform is its high cost per megabase of sequence. The Illumina and SOLID platforms generate reads with increased numbers and short lengths. The cost per megabase of sequence is much lower than the 454 platform. Due to the large numbers of short reads for the Illumina and SOLID platforms, data analysis is much more computationally intensive. The price of the instrument and reagents for sequencing for the Ion Torrent platform is cheaper and the sequencing time is shorter25. However, the error rate and the cost per megabase of sequence are higher compared to the Illumina and SOLID platforms. Different platforms have their own advantages and disadvantages and require different methods for data analysis. The platform should be chosen based on the sequencing purpose and the availability of funding.
In this paper, we take Illumina RNA-Seq platform as an example. We used EML cell as a model system to investigate the key regulators in EML cell self-renewal and differentiation, and provided a detailed methods of RNA-Seq library construction and data analysis for expression level calculation and novel transcript detection. We have shown in our previous publication that RNA-seq study in EML model system2, when coupled with functional test (e.g. shRNA knockdown) provide a powerful approach in understanding the molecular mechanism of the early stages of hematopoietic differentiation, and can serve as a model for the analysis of cell self-renewal and differentiation in general.
1. EML Cell Culture and Separation of Lin-CD34+ and Lin-CD34- Cells Using Magnetic Cell Sorting System and Fluorescence-activated Cell Sorting Method
2. RNA Preparation and Library Construction for High-throughput Sequencing
3. Data Analysis
For reference of software used in this part, please see (Table 2).
Figure S1: Converting .bcl file to .fastq file using CASAVA software.
Figure S2: Mapping reads to reference genome using Tophat.
Figure S3: Detection of novel transcripts and expression level estimation.
Figure S4: Calling differential expressed gene using DESeq package.
Figure S5: Identification of differentially expressed transcription factors.
Figure S6: Converting mapping result for data visualization.
In order to analyze differentially expressed genes in Lin-CD34+ and Lin-CD34- EML cells, we used RNA-Seq technology. Figure 1 shows the workflow of the procedures. After isolation of lineage negative cells by magnetic cell sorting, we separated Lin-SCA+CD34+ and Lin-SCA-CD34- cells using FACS Aria. Lin-enriched EML cells were stained with anti-CD34, anti-Sca1 and lineage cocktail antibodies. Only Lin- cells were gated for analysis of Sca1 and CD34 expression. Two populations (SCA+CD34+ and SCA-CD34- EML cells) could be observed by FACS analysis (Figure 2)6.
After cell separation, we extracted total RNA from CD34+ and CD34- cells respectively and analyzed the quality of RNA. The accuracy of RNA-Seq data largely relies on the quality of RNA-Seq library and the quality of total RNA is vital for preparing a high quality library. High quality RNA sample should have an OD 260/280 value between 1.8 and 2.0. In addition to using the spectrophotometer, RNA quality was further assessed with greater accuracy by Bioanalyzer. Figure 3 shows a result of a high quality RNA sample with the RIN equal to 9.4. Only high quality total RNA sample with RIN value greater than 9 was used for mRNA extraction and subsequent library construction procedures.
Ribosomal RNA is the most abundant type of RNA in cell. Currently two main strategies, depletion of rRNA or positively selection of polyadenylated mRNA (poly-A mRNA), are used for enrichment of target RNA before library construction. Non polyadenylated RNA species are lost during the selection of poly-A mRNA. In contrast, rRNA depletion methods such as RiboMinus could preserve non polyadenylated RNA species. The purpose of our study is to look for differentially expressed coding genes in two cell types, thus we used the poly-A mRNA selection method for enrichment of target RNAs before library construction. When library construction was finished, the size of DNA fragments in the library was checked before sequencing using Bioanalyzer. Figure 4 shows a good quality library with the fragment size peaks at about 300 bp.
In the subsequent step, the library was subjected to high-throughput sequencing. In principle, longer read length will be helpful for read mapping. It can reduce the probability that the read is mapped to multiple locations due to similarity among duplicate genes or gene family members. As the pair-end sequencing sequences are from both ends of the fragments, the read length chosen should be less than half of the average fragments length. If the main goal of the experiment is to measure the expression level instead of constructing transcript structure, single-end read (75 or 100 bp) can reduce the cost without losing too much information. Paired-end sequencing is more useful for transcript structure construction and shorter read length can be used to reduce cost. Certainly, when sufficient funding is available, longer read length is preferred.
For differential expression analysis, there are many alternative algorithms other than DESeq. There is also one included in cufflinks package named cuffdiff32. DESeq is one of the most widely used count based DE gene analysis algorithms. DESeq method is based on a well characterized statistics model — negative binomial distribution. In our experience, DESeq is more stable compare to cuffdiff. Early versions of cuffdiff often give significantly different numbers of DE genes. Therefore we used DESeq for DE analysis here.
Because transcription factors are crucial for cell fate determination, we focused on the significantly differentially expressed transcription factors33. The TFs changed >1.5 fold between Lin-CD34+ and Lin-CD34- were found and are shown on the heatmap (Figure 5)2. Notably, the relative expression level of Tcf7 in Lin-CD34+ cells is more than 100 fold higher than that in Lin-CD34- cells. Thus Tcf7 was chosen for further ChIP-Sequencing (Chromatin Immunoprecipitation and sequencing) analysis and functional test to confirm Tcf7’s function in regulation of EML cell self-renewal and differentiation2.
Figure 1: Workflow of the procedures. Lin-CD34+ and Lin-CD34- cells were separated by magnetic cell separation system and fluorescence-activated cell sorting method. Total RNA was extracted followed by mRNA purification and library construction. After analysis of library quality, samples were subjected to high throughput sequencing. Data were analyzed and differentially expressed transcription factors were identified.
Figure 2: Separation of Lin-CD34+ and Lin-CD34- EML cells6. Lin- EML cells were enriched by magnetic cell sorting. Lin- cells were stained with anti-CD34, anti-Sca1 and lineage mixture antibodies. Lin- cells were gated for expression of CD34 and Sca1. Lin-CD34+SCA+ and Lin-CD34-SCA- EML cell populations were sorted.
Figure 3: A representative of high-quality total RNA sample. The quality of total RNA was assessed by Bioanalyzer. The RNA Integrity Number is 9.4 (FU, Fluorescence Units).
Figure 4: Fragments size range of Paired-End library. The DNA size distribution of the library was analyzed using Bioanalyzer. Most fragments are within the size range of 250-500 bp.
Figure 5: Differentially expressed transcription factors (>1.5 fold) between Lin-CD34+ cells and Lin-CD34- cells2. For each cell type, two independent experiments were performed. Up-regulated genes are indicated as red color and down-regulated genes are indicated as green color.
BHK medium | |
100x Antibiotic-Antimycotic | 10 ml |
200 mM L-Glutamine | 10 ml |
FBS | 100 ml |
DMEM | 880 ml |
Total volume | 1,000 ml |
EML Basic medium | |
IMDM | 390 ml |
HI horse serum | 100 ml |
100 x Penicillin-Streptomycin | 5 ml |
200 mM L-Glutamine | 5 ml |
BHK medium | 75 ml |
Total Volume | 575 ml |
Filtrate through 0.45 μM filter | |
FACS buffer | |
BSA | 0.50% |
EDTA | 1 mM |
Dissolved in PBS and filtrate through 0.45 μM filter |
Table 1: Buffers and Cell culture mediums.
Software | Usage | Reference |
Bowtie 1.2.7 | Used by Tophat for mapping | [28] |
Tophat 1.3.3 | Mapping reads back to reference genome | [27] |
Cufflinks 1.3.0 | Transcripts construction and expression level estimation | [29] |
DESeq 1.16.0 | Differential expression analysis | [30] |
Bedtools 2.18 | Convert .bam file into .bed file | [31] |
bedGraphToBigWig | Convert .bed file to .bigwig file | http://genome.ucsc.edu/ |
Table 2: List of software for data analysis.
Mammalian transcriptome is very complex34-38. RNA-Seq technology plays an increasingly important role in the studies of transcriptome analysis, novel transcripts detection and single nucleotide variation discovery etc. It has many advantages over other methods for gene expression analysis. As mentioned in the introduction, it overcomes the hybridization artifacts of microarray and can be used to identify novel transcripts de novo. One limitation of RNA-sequencing is relative short read length comparing to Sanger sequencing. However, with the rapid improvement of sequencing technology, read length is increasing constantly. In this paper, we provide detailed methods of using this technology to identify potential key regulators in mouse EML cell self-renewal and differentiation.
The first key step for this protocol is EML cell culture. Although EML is a hematopoietic precursor cell line and it can be propagated in large quantity with SCF. The culturing condition of EML cells requires more attention than the usual immortalized cell lines. The cells should be fed and passaged at a regular basis with gentle operation; otherwise the cells could change in their properties of self-renewal and differentiation and undergo cell death. As the first step after collecting enough cells, we isolated lineage negative cells using a magnetic activated cell sorting system. Then we separated CD34+ and CD34- cells using fluorescence-activated cell sorting. The EML cells are normally passaged less than 10 generations before using for RNA extraction and the numbers of CD34+ and CD34- cells should be similar after separation. If the two populations vary greatly in cell number, it is advisable to discard the culture and re-thaw another tube of cell stock for culture.
After separation of CD34+ and CD34- cell, total RNA extraction was performed, another important step for this study. High quality RNA is the base for construction of a high quality library, which promises the accuracy of the sequencing data. In this critical step, any contact with RNase should be avoided. All reagents should be RNase free. It is important to wear gloves at all times while handling RNA. High quality RNA sample has an OD 260/280 value between 1.8 and 2.0. When collecting the aqueous phase containing RNA, be careful not to carry any organic phase with the RNA sample. Any residual organic solvents such as phenol or chloroform in the RNA would result in an OD260/280 value lower than 1.65. If the OD260/280 value is lower than 1.65, precipitate RNA again with ethanol. After washing with 75% ethanol, do not overdry RNA pellet. Drying RNA pellet completely will affect the solubility of RNA and lead to low yield of RNA.
The next key step for this protocol is library preparation. After total RNA extraction, a step of using DNase for removal of contaminated DNA is highly recommended, since DNA contamination might result in wrong estimation of the amount of total RNA used. It is recommended to perform the downstream procedure immediately after RNA isolation, since after long-term storage and freeze-thawing procedure, RNA will degrade to some degree. If the subsequent steps after RNA isolation can not be performed immediately, store the RNA in -80 °C. Before total RNA is used for mRNA purification and cDNA synthesis, the quality should always be checked. Only high quality RNA can be used for library preparation. Using low quality or degraded RNA might lead to over-representation of 3' ends. Before sequencing, library quality was assessed to ensure maximum sequencing efficiency.
In the data analysis part, after performing a run of Cufflinks without a reference transcriptome, we combined the novel transcripts with known transcripts to form a reference .gtf file and run Tophat and cufflinks for the second time. This two-run procedure is recommended, since this provide more accurate FPKM estimation than running only once. After data analysis, the differentially expressed genes were identified. Downstream experiments can be performed to validate the function of genes in vitro and in vivo. In our previous publication2, we chose the significantly differentially expressed transcription factors and identified the genome binding site of these factors by performing chromatin immunoprecipitation and sequencing (ChIP-Seq). In addition, we applied shRNA knockdown assay to test the functional effect of Tcf7. We found that in Tcf7 knockdown cells, up-regulated genes were the genes highly enriched in CD34- cells, while down-regulated genes were found to be significantly enriched in CD34+ cells. Therefore, the gene expression profile of Tcf7 knockdown cells shifted toward a partially differentiated CD34- state.Overall, using EML cell as a model system coupled with RNA-Sequencing technology and functional assays, we identified and confirmed Tcf7 as an important regulator of EML cell self-renewal and differentiation.
The authors have nothing to disclose.
JQW, SZ, SD and KC are supported by grant from the National Institutes of Health and the Staman Ogilvie Fund—Memorial Hermann Foundation.
Antibiotic-Antimycotic | Invitrogen | 15240-062 | BHK cell culture |
Anti-Mouse CD34 FITC | eBioscience | 11-0341-81 | FACS sorting |
Anti-Mouse Ly-6A/E (Sca-1) PE | eBioscience | 12-5981-81 | FACS sorting |
APC Mouse Lineage Antibody Cocktail | BD Biosciences | 558074 | FACS sorting |
BD FACSAria Cell Sorter | BD Biosciences | Special offer sysmtem | FACS sorting |
Corning™ Cell Culture Treated Flasks 75cm2 | Corning incorporated | 430641 | Cell culture |
Corning™ Cell Culture Treated Flasks 25cm2 | Corning incorporated | 430639 | Cell culture |
Deoxyribonuclease I, Amplification Grade | Invitrogen | 18068-015 | Library preparation |
DMEM | Invitrogen | 11965-092 | BHK cell culture |
DPBS | Gibco | 14190 | Cell culture |
HI FBS | Invitrogen | 16140071 | BHK cell culture |
Horse Serum | Invitrogen | 16050-122 | EML cell culture |
IMDM | HyClone | SH30228.02 | EML cell culture |
L-Glutamine | Invitrogen | 25030-081 | Cell culture |
Lineage Cell Depletion Kit, mouse | Miltenyi Biotec | 130-090-858 | Isolation of lineage negative cells |
NanoVue Plus spectrophotometer | GE Healthcare | 28-9569-62 | Quality control |
Thermo Scientific™ Napco™ 8000 Water-Jacketed CO2 Incubators | Thermo Scientific | 15-497-002 | Cell culture |
Penicillin-Streptomycin | Invitrogen | 15140-122 | EML cell culture |
TRIzol® Reagent | Invitrogen | 15596-018 | RNA exraction |
TruSeq™ RNA Sample Prep Kit v2 -Set B (48rxn) | Illumina | RS-122-2002 | Library preparation |
2100 Electrophoresis Bioanalyzer Instrument | Agilent | G2939AA | Quality control |
0.25% Trypsin-EDTA | Gibco | 25200 | Cell culture |
0.45 µm Syringe Filters | Nalgene | 190-2545 | Cell culture |