Online Motif Discovery with WebMOTIFS

These pages contain the FASTA-formatted input files, the intermediate motif discovery output files, and the clustering results files for Harbison et. al.  Links and descriptions of file formats can be found below.

Support Files -- Directory Structure

The files are organized into the following directories.  File formats are described in detail below.
fsafiles or (2.2 MB tgz)
FASTA formatted files for 310 location analysis experiments. Sequences for all probes on the 6k microarray can be found here (3.2 MB).
Binding Data (11.4 MB) Binding data as matrix of P-values according to probe name and experiment comma-separated format (".csv")
kellis or (0.07 MB tgz)
Motifs found using the method described in Kellis et al.
converge or (2.9 MB tgz)
Motifs found using the CONVERGE program described in Methods
mdscan or (2.1 MB tgz)
Motifs found using MDscan
meme or (0.4 MB tgz)
Motifs found using MEME
meme_c or (0.4 MB tgz)
Motifs found using MEME applied to a genome in which unconserved bases were replaced with "Ns"
Clustering Results or (0.1 MB tgz)
Raw clustering results of significant motifs found by the program above.
Final Motifs
Motifs used in assembling the map, assembled from the clusters above.
GFF files
Files in GFF format used to represent the map of the regulatory code at different levels of confidence. Updated 12-9-04
For Spreadsheets
Text files representing the map of the regulatory code formatted for importing in to spreadsheets. Versions are provided at different levels of confidence.
Genomic Sequences
NOTE UPDATE: All genomic locations are based on the version of the Yeast genome available from SGD in March 2003. The sequences are available in Genbank format (from SGD) here, and in Fasta format (locally) here.
Bound Genes
NOTE UPDATE: Lists of the genes/orfs bound by each transcription factor.

Motif File Format

Sample Motif FormatMotifs and their raw scores are represented in the format depicted to the right.

(a)  The scoring matrix is represented according to the log2(Pi/Qi) at each position, in which Pi is the probability of observing the letter at a particular position, and Qi is the background probability. In the final set of motifs, the scoring matrix is followed by the probability matrix of Pi's (but not in the example to the right.)

(b) A text-based logo of the motif in which the height of letters is proportional to the sum of the relative information content of all the letters at that position.

(c) A single line repeating the one-letter summary of the motif, along with several scores and metrics. For example the following line,
(Bits: 14.82  MAP: 42.51   D: 0.000  -1) E: 16.860  ch: 18.34  f:  1.00  Ra: 0.9945
summarizes this information:

Total number of bits
For AlignACE and MDscan, this is the MAP score.  For MEME and MEME_c, this is -log10(E-value)
Not Used
Not Used
Enrichment Score, as described in Methods
Specificity Score, as described in Hughes, et al. (Not Used)
Fraction of probes containing motif (1.00 means all probes)
ROC area under the curve, as described in Methods

(d) A set of 20 generated sequences that, when taken together, approximate the probability values at each position of the matrix.  Used mainly for convenience with other programs that take sets of sequences instead of numerical values.
The "4" value is not used.  The trailing number (e.g. 17.996) is the score the matrix assigns to each sequence.

Clustering Output File Format

Clustering Output File FormatThese ".cluster" files are generated by filtering the discovered motifs from all program for a particular location analysis experiment, and submitting the results to k-medoids clustering as described in Methods.   The example at right demonstrates the results for clustering significant motifs from the FKH2_H2O2Lo experiment. 

All motifs that pass the significance tests are included in the .cluster files. Empty (zero-byte) files represent experiments for which no significant motifs were found.

(a)  The discovered number of clusters (k=3), and the average intra-cluster distance are summarized at the top

(b)  Clusters are numbered starting from zero.  Each row in each cluster summarizes information about a single motif that is a member of the cluster.  The first column (containing mostly dots) functions as a placeholder, except for the entry marked "*-," which denotes the medoid of the cluster.  The letters in the second column are not used in this study.  They denote which motifs have the maximum (capital) and median (lowercase) scores as noted (R - ROC auc, E - Enrichment, B - Total bits of information).

(c)  One-letter representation of the motif alignment that generated the cluster.  (Note that in this example, the zeroth cluster appears to contain the composite Fkh2-Mcm1 binding site.) 

(d)  Source and score information for each motif in each cluster.  In the example:
alignace     3 zE   3.43 E:  22.035  Ra: 0.7591
The motif was found by AlignACE, is the 3rd motif in the correponding file (FKH2_H2O2Lo.ttace), has
a z-score of 3.43 (explained more below*), an Enrichment score of 22.035, and a ROC auc score of 0.7591.  Some motifs in other files may also have a "k:" field, denoting the CC4 score as measured according to Kellis et al.

(*) The z-score is computed both for the Enrichment score and the ROC auc score.  The maximal z-score is reported.