You can perform a complete differential analysis on genes or transcripts thanks to the different workflows and tools available within GalaxEast. Datasets provided within GalaxEast and described below enable you to test these workflows and tools if necessary.
For this purpose it is strongly recommended to use an Ensembl GTF annotation file. Indeed, in the second step of this analysis, Ensembl Gene ID are used to annotate genes and Ensembl annotations are used to compute gene length in order to normalize read counts by gene length.
To test these workflows and tools, you can use a real dataset (/!\ the alignment can takes hours) or a test dataset (all the workflow takes less than one hour). These data are available in the “Shared Data → Data Libraries” menu (“RNA-seq analysis workflow - real data” and “RNA-seq analysis workflow - test data”).
This tutorial allows the following steps of an RNA-seq data analysis:
In this tutorial, the quality assessment of your data is done with the FastQC tool. This is a set of modules which aim to provide a simple way to do some quality control checks on raw sequence data. See here the FastQC documentation and here a typical FastQC Report.
This is an important step of the RNA-seq data analysis. The general objective when aligning a collection of sequencing reads to a reference is to discover the true location (origin) of each read with respect to that reference. By mapping your reads to a reference genome, you need an alignment tool that can allow for a read to be “split” between distant regions of the reference in the event that the read spans two exons. Since the sequencing library was constructed from transcribed RNA, intronic sequence was not present, and the sequenced molecule natively spanned exon boundaries. It is why, in this tutorial, we will use an aligner well adapted to this purpose, the bowtie2/tophat2 tools (see the TopHat2 publication).
The advantage of TopHat2 is that we can supply a set of gene model annotations and/or known transcripts, as a GTF formatted file (as Ensembl GTF annotation file). Throughout this tutorial, we will use this option, thus TopHat will first extract the transcript sequences and use Bowtie2 to align reads to this virtual transcriptome first. Only the reads that do not fully map to the transcriptome will then be mapped on the genome. This step allows to improve the mapping quality.
Example of output alignment file:
We have 6 aligned reads represented on this image. We can see the chromosomal location of these reads, their respective sequence and others “flag” (AS, XN, XM, …) whose you can get information on the SAM format specification page.
After the mapping step, we can count the number of reads assigned to each feature (gene/transcript) depending on their chromosomal location and approximate abundances of those features in the original sample. Such measures of digital gene expression are then subject to comparison among samples or treatments in a statistical framework (see the next step).
In this tutorial, we will use the HTSeq-count tool (and to a lesser extent, featureCounts) to assess the number of reads by gene. These tools will count the number of reads by exons and then sum all counts of all exons of a gene to determine its “overall digital expression”.
Example of HTSeq-count/featureCounts output from Ensembl GTF annotation:
To assess the number of reads by transcript, we will use the Cufflinks Suite. Cufflinks uses its RABT assembly method that determines the minimum number of transcripts needed to explain sets of reads aligned to a genome and than quantify these transcripts.
Finally, it is possible to use the raw read counts of each sample to normalize all the counts between these samples and to perform a differential gene/isoform analysis, in order to see which genes/isoforms are significantly differentially expressed between conditions.
To test the differential gene expression between conditions we will use the DESeq2 package (see details). And to test the differential isoform expression between conditions, we will use the Cufflinks Suite.
At the end, thanks to a p-value by gene/isoform, you can filter all significatives genes/isoforms between two conditions.
To learn all about the RNA-seq, see the RNA-seqlopedia.