This page describes how WebMOTIFS scores and filters the raw output from the motif discovery methods. Scoring and significance filtering methods for basic motif discovery are adapted from the methods we used in:
Harbison et al. "Transcriptional regulatory code of a eukaryotic genome." Nature. 431(7004) (2004 Sept. 3):99-104.
Scoring methods for Bayesian motif discovery are adapted from the methods used in
Macisaac et al. "A hypothesis-based approach for identifying the binding specificity of regulatory proteins from chromatin immunoprecipitation data",Bioinformatics. 22(4) (2006 Feb. 15):423-429.
The explanations of the enrichment score and significance filtering, below, are based on the supplementary methods for Harbison et al., which can be found here (PDF, external link)
Enrichment score: We use the hypergeometric distribution to measure the probability that the observed number of motif matches in the input set would be found if the sequences had been selected at random from the genome.
We compute a p-value for the enrichment using the following formula:
where B is the number of input sequences and G is the total number of sequences represented in the microarray or genome. The quantities b and g represent the subset of B and G that match the motif.
The enrichment score is equal to -log10(p).
We report the enrichment score for both each cluster average and each individual motif within a cluster.
Program Specific Score: Each motif discovery program we use has some native score it uses to rank motifs. For AlignACE and MDscan, this is the MAP (Maximum a priori log likelihood) score.
For MEME, this is -log10(E-value).
For Weeder, this is the significance score.
Please refer to the documentation of AlignACE, MDscan, MEME, and Weeder for the underlying probability models.
We report the program specific score for each individual motif within a cluster.
Empirical Probability: When applied to randomly selected sequences, motif discovery programs often find motifs that would be scored as significant by many of the most commonly used metrics. To separate true motifs from false ones, we convert a motif's conventional score into the empirical probability that a motif with a similar score could be found by the same program in randomly selected sequences.
In practice, we compute the empirical probability of the Enrichment score and the Program Specific score, using fifty control calculations carried out using the same parameters and approximately the same number of input sequences as the original calculation. This empirical probability is provided as a z-score.
The enrichment z-score of a motif is equal to:
The program specific z-score is computed in a similar manner.
We report the enrichment z-score and program specific z-score for each individual motif within a cluster. For each cluster, we also report the median enrichment z-score.
Bits: Total number of bits of information. We report the number of bits for each cluster average.
Group Specificity Score, also called Church Score: The group specificity score was developed by Hughes et al. It measures "the degree to which the distribution of sites is skewed toward the input set": how specific a motif is to the input sequences in which it was found. Thus, it has an advantage over other over-representation metrics such as the MAP score because it does not give high scores to motifs that are over-represented throughout the genome.
We calculate a motif's group specificity score with the following formula:
where N is the total number of ORFs in the organism, s1 is the number of ORFs in the group used to find the motif, s2 is the number of ORFs in the target genes, and x is the number of ORFs in the intersection of the group used to find the motif and the target genes.
For more information on the group specificity score, see:
Hughes, JD, Estep, PW, Tavazoie S, and GM Church, "Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae", Journal of Molecular Biology. 296(5) (2000 March 10):1205-14.
We report the group specificity score for each cluster average.
Cross-validation error: The mean cross-validation error achieved by the hypothesis. THEME performs motif discovery multiple times for each initial hypothesis, using a different subset of the input sequences each time. The cross-validation error measures how often THEME's discovered motif correctly predicts binding sites in the held-out input sequences.
Lower cross-validation scores generally indicate better motifs, though the relation between CV error and motif quality varies between FBPs and organisms. Usually, a CV error below .3 indicates a significant motif and a CV error above .4 motif indicates an insignificant motif. The principled way to evaluate motif significance is to compare the EV error of a motif to the average CV error obtained by running THEME on randomly chosen sequences. Currently, these randomizations can be performed only with the downloadable version of THEME, but we hope to include them in WebTHEME soon.
Cross-validation error is the best measure of motif quality for motifs discovered by WebTHEME. Thus, in the final output, we rank motifs by their CV errors, from smallest to largest. Because motifs with CV error above .4 are rarely significant, we place motifs with CV error less than or equal to .4 in a table labeled "Most significant motifs", and we place motifs with CV error greater than .4 in a table labeled "Less significant motifs".
Enrichment score: As for TAMO. See here.
Beta: The optimal weighting of the original hypothesis (determined by cross-validation). Beta values vary between 0 and 1. A high value of beta indicates a high degree of similarity between the refined motif and the initial hypothesis.
LLR Match Threshold: The log-likelihood ratio score used as a matching threshold by the best SVM classifier. Ranges from zero to one. Higher values indicate that only sequences very similar to the motif were considered as matches.
We filter discovered motifs by enrichment score. We consider motifs to be significant if the p-value of the enrichment is less than a specified threshold
The p-value is calculated using the observed enrichment score for the motif , and the average and standard deviation of the enrichment score calculated in randomization runs with the same program and approximately the same number of probes. Our calculation of the p-value assumes a normal distribution for the enrichment scores.
The user can chose between p-value thresholds of .001 (stringent), .01(medium), and .1(lenient). Stringent filtering, with a cutoff of .001, is the default. This is the threshold that Harbison et al. found minimized false positives and identified the correct motifs for many regulators with known specificity.
© Copyright The Fraenkel Lab 2010, Massachusetts Institute of Technology. For more information contact firstname.lastname@example.org