User Tools

Site Tools


Introduction to Galaxy

Instructor Stephanie Le Gras
Duration 1 hour
Content Practical session on basic analysis using Galaxy (Hands-on)
Prerequisites None

Description of the training dataset

Transcription factor PU.1 is a protein that is encoded by the SPI1 gene in humans. This gene encodes an ETS-domain transcription factor that activates gene expression during myeloid and B-lymphoid cell development (see genecard of this gene).

We are going to use ChIP-seq data for PU1 transcription factor in mouse. The goal of ChIPseq experiments is to identify DNA regions bound by proteins of interest. Proteins are usually bound to DNA motifs. The topic of this pratical session is to identify the underlying motif for PU.1 binding sites.

These data have been published in this study:

Heinz S, Benner C, Spann N, Bertolino E et al. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol Cell 2010 May 28;38(4):576-89. PMID: 20513432

Data have been downloaded from GEO database: GSM537989.

This GEO dataset contains peak coordinates of PU1 binding sites. Original data are based on the mouse genome assembly mm8. Coordinates have been transformed into mm9 coordinates to ease the analysis.

The dataset is of this form:

chr1    193580486       193580686       chr1-193457322-0        191     +
chr1    64972363        64972563        chr1-64860165-0 183     +
chr1    134238383       134238583       chr1-134169452-0        179     +
chr1    51991430        51991630        chr1-51879231-0 177     +
chr1    53880739        53880939        chr1-53768540-0 176     +
chr1    130487423       130487623       chr1-130418492-0        175     +
chr1    99556072        99556272        chr1-99490001-0 174     +

This is a BED file. It contains one line per PU.1 peak which have been identified. Here is the description of the columns:

  • column 1: chromosome of the peak
  • column 2: start position of the peak
  • column 3: end position of the peak
  • column 4: ID of the peak
  • column 5: score
  • column 6: strand

Practical session

During the pratical session, we are going to use the GalaxEast platform (

Question 1

Upload the GSM537989_Sample7.Bcell-PU.1.mm9.bed file into Galaxy.


Question 2

2.1 - Estimate the size of the peaks contained in the file GSM537989_Sample7.Bcell-PU.1.mm9.bed.


  • Use the Compute tool.


2.2 - Plot an histogram of the size of the peaks.


  • Use the Histogram tool


Question 3

Randomly select 100 lines from the dataset GSM537989_Sample7.Bcell-PU.1.mm9.bed.


  • use the “Select random lines” tool


Question 4

Extract nucleotide sequences from the genomic coordinates of the 100 PU.1 peaks extracted in the previous step.


  • use the “Extract Genome DNA” tool


Question 5

Detect de novo motifs within the nucleotide sequences extracted from the 100 PU.1 peaks.


  • Use “MEME”
  • Detect up to 3 motifs
  • Max motif length should be 15.


Question 6

Extract a workflow from your history and save this workflow as “Motif Detection”. Keep neither the “Compute” nor the “Histogram” step.


Question 7

Change history name to “Motif Detection” and add the tags “chip-seq, meme, motif” to the history.


Question 8

Copy the file GSM537989_Sample7.Bcell-PU.1.mm9.bed to a new history called “Comparison of motif detection”. Use this new history.


Question 9

Run the workflow “Motif Detection” twice on the file GSM537989_Sample7.Bcell-PU.1.mm9.bed.


training/introduction2galaxy2.txt · Last modified: 2017/03/10 10:57 by slegras