When one does de novo motif discovery with multiple motifs, often many very similar motifs are discovered. To make it easier to analyze these motifs, WebMOTIFS clusters and averages motifs from basic, de novo motif discovery. This clustering step occurs after the motifs have been scored and filtered for significance.
WebMOTIFS' clustering and
averaging methods are adapted from the methods used by Harbison et al.
Harbison et al. "Transcriptional regulatory code of a eukaryotic genome." Nature. 431(7004) (2004 Sept. 3):99-104.
The explanations below are based on the supplementary
methods for that paper, which can be found here (PDF, external link).
Motif discovery programs often discover a number of very similar
motifs. In addition, if more than one motif discovery program is run,
several programs may identify the same basic motif.
To identify these common patterns, we cluster the significant motifs
and average all members of the same cluster. The average motif
can be represented and scored much like a single motif.
K-medoids Clustering: The set of significant motifs is clustered with
k-medoids clustering, using the inter-motif distance metric described
below. For a given number of clusters, we run the k-medoids algorithm 5
times to find a clustering with the minimal sum of within-cluster
distances. To find the optimal number of clusters, we begin
with 10 clusters, and then decrease the number of clusters
until the average distances between members each medoid and the members
of other clusters is sufficiently large (greater or equal to 0.15). This
avoids splitting very similar sets of motifs into an arbitrarily large
number of similar clusters.
For more information on the k-medoids clustering algorithm, see:
Hastie, T., Tibshirani, R. & Friedman, J. The elements of Statistical Learning;
Data mining, inference and prediction. Springer-Verlag: New York, 2001.
Please note that WebMOTIFS' clustering algorithm treats dimers as single motifs. This is a very basic approach. An alternative approach would be to separate dimers into parts and cluster them into monomers, as in Jensen and Liu's hierarchical Bayesian clustering method. (Shane Jensen, Lei Shen, and Jun Liu. "Combining phylogenetic motif discovery and motif clustering topredict co-regulated genes." Bioinformatics 21(20) 2005: pp. 3832-9.)
We use a distance metric to aid in the comparison of motifs. The distance D
between two aligned motifs "a" and "b" is defined as,
where w is the motif width, and ai,L and bi,L are the estimated probabilities of observing
base L at position i of motifs a and b/ respectively.
In practice, the optimal alignment of motifs is not known. We therefore use the
minimum distance between motifs among all alignments in which the motifs overlap by
at least seven bases, or when the motifs are shorter, by 2 bases fewer than the shortest
motif length. Alignments to the reverse complements of the motifs are included.
We compute a single motif representing each cluster. This motif is basically an average of the motifs in the cluster. We find the optimal alignment of the motifs in the cluster, then average the probabilities at each matrix position of the aligned motifs making up the cluster. [Low-information
positions on the flanks of the averaged motifs are removed.]
When computing a single motif to represent each cluster, we need to consider that each cluster can contain motifs from multiple motif discovery programs. Some of these programs tend to discover many nearly identical motifs, while others return only one motif of each type. To give both types of programs equal weight when computing the representative motif for a cluster, we include in our average only one motif of each type from each program.
Each average motif is then scored. For information on the metrics used, see the scoring and significance filtering page.
We provide the enrichment score, median enrichment z-score, bits, and group specificity scores for
each average motif: these are calculated as described on the scoring
and significance filtering page.