These pages contain the FASTA-formatted input files, the intermediate
motif discovery output files, and the clustering results files for
Harbison et. al. Links and descriptions of file formats can be
The files are organized into the following directories. File
formats are described in detail below.
or (2.2 MB tgz)
|FASTA formatted files for 310
location analysis experiments. Sequences for all probes on the 6k microarray can be found
here (3.2 MB).
|Binding Data (11.4 MB)
||Binding data as matrix of P-values according to probe name and experiment
comma-separated format (".csv")
or (0.07 MB tgz)
|Motifs found using the method
described in Kellis et al.
or (2.9 MB tgz)
|Motifs found using the CONVERGE
program described in Methods
or (2.1 MB tgz)
|Motifs found using MDscan
or (0.4 MB tgz)
|Motifs found using MEME
or (0.4 MB tgz)
|Motifs found using MEME applied
to a genome in which unconserved bases were replaced with "Ns"
or (0.1 MB tgz)
|Raw clustering results of
significant motifs found by the program above.
|Motifs used in assembling the map,
assembled from the clusters above.
GFF format used to represent the
map of the regulatory code at
different levels of confidence. Updated 12-9-04
Text files representing the map of the regulatory code formatted for importing in to spreadsheets. Versions are
provided at different levels of confidence.
NOTE UPDATE: All genomic locations are based on the version of the Yeast genome
available from SGD in March 2003. The sequences are available in Genbank format (from SGD)
here, and in Fasta format (locally) here.
NOTE UPDATE: Lists of the genes/orfs bound by each transcription factor.
|Motifs and their raw scores are
represented in the format depicted to the right.
(a) The scoring
matrix is represented according to the log2(Pi/Qi)
at each position, in which Pi is the probability of
observing the letter at a particular position, and Qi is the
background probability. In the final set of motifs, the scoring matrix
is followed by the probability matrix of Pi's (but not in the example
to the right.)
(b) A text-based logo of the motif
in which the height of letters is proportional to
the sum of the relative information content of all the letters at that
(c) A single line
repeating the one-letter summary of the motif, along
with several scores and metrics. For example the following line,
summarizes this information:
MAP: 42.51 D: 0.000 -1) E: 16.860 ch:
18.34 f: 1.00 Ra: 0.9945
|Total number of bits
|For AlignACE and
MDscan, this is the MAP score. For MEME and MEME_c, this is
|Enrichment Score, as
described in Methods
|Specificity Score, as
described in Hughes, et al. (Not Used)
|Fraction of probes
containing motif (1.00 means all probes)
|ROC area under the
curve, as described in Methods
(d) A set of 20 generated
sequences that, when taken together,
approximate the probability values at each position of the
matrix. Used mainly for convenience with other programs that take
sets of sequences instead of numerical values.
The "4" value is
not used. The trailing number (e.g. 17.996) is the score the
matrix assigns to each sequence.
files are generated by filtering the discovered motifs from all program
for a particular location analysis experiment, and submitting the
results to k-medoids clustering as described in Methods.
The example at right demonstrates the results for clustering
significant motifs from the FKH2_H2O2Lo experiment.
All motifs that pass the significance tests are included in
the .cluster files. Empty (zero-byte) files represent experiments for
which no significant motifs were found.
(a) The discovered
number of clusters (k=3), and the average intra-cluster distance are
summarized at the top
(b) Clusters are
numbered starting from zero. Each row in each cluster summarizes
information about a single motif that is a member of the cluster.
The first column (containing mostly dots) functions as a placeholder,
except for the entry marked "*-," which denotes the medoid of the
cluster. The letters in the second column are not used in this
study. They denote which motifs have the maximum (capital) and
median (lowercase) scores as noted (R - ROC auc, E - Enrichment, B -
Total bits of information).
representation of the motif alignment that generated the cluster.
(Note that in this example, the zeroth cluster appears to contain the
composite Fkh2-Mcm1 binding site.)
Source and score information for each motif in each cluster. In
The motif was found by AlignACE, is the 3rd motif in the correponding
file (FKH2_H2O2Lo.ttace), has
alignace 3 zE 3.43 E: 22.035 Ra: 0.7591
a z-score of 3.43 (explained more below*), an Enrichment score of
22.035, and a ROC auc score of 0.7591. Some motifs in other files
may also have a "k:" field, denoting the CC4 score as measured
according to Kellis et al.
(*) The z-score is computed both for the Enrichment score and the ROC
auc score. The maximal z-score is reported.