User Tools

Site Tools


training:introduction2galaxy

Introduction to Galaxy

Instructor Stephanie Le Gras
Duration 2 hours
Content Description of web interface of Galaxy (Lecture)
Practical session on basic features of Galaxy (Hands-on)
Prerequisites None

Description of the training dataset

We are using RNAseq and chIPseq data from :

Strub, T., Giuliano, S., Ye, T., Bonet, C., Keime, C., Kobi, D., Le Gras, S., Cormont, M., Ballotti, R., and Bertolotto, C. (2011). Essential role of microphthalmia transcription factor for DNA replication, mitosis and genomic stability in melanoma. Oncogene 30, 2319–2332.

RNAseq data

In this study, they compared the transcriptome in melanoma cell lines between cells with an siRNA against MITF or an siRNA against Luciferase (used as a control). The data are given as a TSV (Tab-Separated Values) file which contain the number of reads per genes. Genes are identified by their Ensembl gene IDs.

Data have been analyzed using the Human genome hg19/GRCh37.

Here are the different column of the file to be analyzed:

Ensembl gene id Ensembl gene ID
siLuc2 (raw read counts) Raw read counts - control - 1st biological replicate
siLuc3 (raw read counts) Raw read counts - control - 2nd biological replicate
siMitf3 (raw read counts) Raw read counts - siMITF - 1st biological replicate
siMitf4 (raw read counts) Raw read counts - siMITF - 2nd biological replicate
siLuc2 (normalized) Normalized read counts - control - 1st biological replicate
siLuc3 (normalized) Normalized read counts - control - 2nd biological replicate
siMitf3 (normalized) Normalized read counts - siMITF - 1st biological replicate
siMitf4 (normalized) Normalized read counts - siMITF - 2nd biological replicate
siLuc2 (normalized and divided by gene length in kb) RPK - control - 1st biological replicate
siLuc3 (normalized and divided by gene length in kb) RPK - control - 2nd biological replicate
siMitf3 (normalized and divided by gene length in kb) RPK - siMITF - 1st biological replicate
siMitf4 (normalized and divided by gene length in kb) RPK - siMITF - 2nd biological replicate
log2(siMitf/siLuc) Log2 Fold change (siMitf/siLuc)
P-value (siMitf vs siLuc) P-value (siMitf/siLuc)
Adjusted p-value (siMitf vs siLuc) Adjusted p-value - FDR (siMitf/siLuc)

ChIPseq data

In this study, they did a chIPseq on the transcription factor MITF. The data file we are using is a fastq file (see http://fr.wikipedia.org/wiki/FASTQ) which contain a selection of reads mapped on chromosome 2.

Practical session

During the pratical session, we are going to use the GalaxEast platform (http://use.galaxeast.fr).

Question 1

Upload the file S12040_genesdiff.txt into Galaxy.

Hints:

  • There is no need to uncompress the file before unloading it to GalaxEast.
  • Format of the file: tabular

Answer

Question 2

Import from the Shared data menu (top menu) > Data libraries > Introduction 2 Galaxy (datasets), the dataset Homo_sapiens_genes_(GRCh37).txt

Answer

Question 3

Annotate the S12040_genediff.txt with the Associated Gene Name, the Gene Biotype and the Gene Description found in the file Homo_sapiens_genes_(GRCh37).txt. Use the common field « Ensembl gene id » to join the two tables.

Help:

  • Use the tool Join two datasets (section: Join, Substract and Group)
  • Look at the numbers of the shared columns in both datasets
  • The first file to join is S12040_genediff.txt
  • Join the two files using the common field (Ensembl gene id).

Answer

Question 4

There are now two “Ensembl gene id” column. Remove the one located right before the “Associated Gene Name” column.

Help:

  • Use the tool Cut (section: Text Manipulation)
  • Identify the number of the second “Ensembl gene id” column
  • Find the total number of columns in the dataset
  • You have to tell the tool which column to be kept (not which column to be removed)

Answer

Question 5

Is there any LincRNA (long non coding RNA) in the data? (Gene Biotype would be exactly equal to lincRNA). If so, how many genes are annotated as lincRNA.

Help:

  • Use the tool Filter (section: Filter and sort)
  • Find the number of the column containing the LincRNA
  • Find the exact case of the LincRNA annotation (tools using expression such as Filter tool are case sensitive)

Answer

Question 6

Rename your history to « prepare RNA-seq data » and rename the last dataset to « final data ».

Answer

Question 7

We have chIP-seq data in line with these RNAseq data. The chIP-seq data are raw sequencing data. Create a workflow which goes frow raw data to peak annotation. Here are the different steps to follow:

  • Step 1: input file
  • Step 2: Map reads with bowtie 1 (Parameter “Select a reference genome” should be set at runtime; output (sam) is the output to link to the next tool)
  • Step 3: MACS v1.4.2 (There is no control file; Parameters “Effective genome size” and “Tag size” should be set at runtime; output_bed_file (bed) is the output to link to next tool )
  • Step 4: homer_annotatePeaks (Parameters “Genome version” should be set at runtime)

nb:

  • The goal of a workflow is to link output of one tool to the input of next tool(s)
  • Tools and input file are found in the left panel (use the search field to faster find tools)
  • Tool parameters are set in the right panel
  • Two tools can be linked if the output format of first tool and the input format of next tool matches. In this case, the arrow linking the two becomes green.

Answer

Question 8

Import from the Shared data menu (top menu) > Data libraries > Introduction 2 Galaxy (datasets), the dataset MITF(2)-chr2.fastq. Run the workflow created on question 7. Here are the parameters to use:

  • Step 2: Select a reference genome: hg19
  • Step 5 (1): Effective genome size: 182400000
  • Step 5 (2): tag size: 54
  • Step 6: genome version: hg19

Answer

training/introduction2galaxy.txt · Last modified: 2017/03/10 10:55 by slegras