We found that the clusters found in microarray data do not in general agree with functional annotation classes. Although many statistically significant relationships can be found, the majority of clusters are not related to known biology (as described in annotation ontologies). This implies that use of guilt-by-association is not supported by annotation ontologies. Depending on the estimate of the amount of noise in the data, our results suggest that bioinformatics has only codified a small proportion of the biological knowledge required to understand microarray data.
The annotated clusters can be found at http://www.aber.ac.uk/compsci/Research/bio/dss/gba/.
Two main approaches have been employed in testing the reliability of microarray clusters: self-consistency, and consistency with known biology. In self-consistency the idea is to predict some left out information in a cluster. For example Yeung et al. [Yeung2001] used self-consistency to assess the results of their clustering by leaving out one experimental condition (or time point), and using this held out condition to test the predictive ability of the clusters. They were testing whether the cluster could predict the value of the missing experimental condition, or the value at the missing time point.
The idea behind the use of existing biological knowledge is that if a cluster is consistent with known biological knowledge then it reflects a real feature in the data. This is essentially a systematic version of the informal approach generally taken to evaluate clustering. This idea was used by Tavazoie et al. [Tavazoie1999] who clustered the S. cerevisiae cdc28 dataset [Cho1998] with k-means clustering (k=30), and checked the validity of their clustering by mapping the MIPS functional classes onto the clusters to show which classes were found more often than by chance in each cluster. Several clusters were significantly enriched for one or more classes.
In this paper we combine the ideas of self-consistency and using known biological knowledge to test systematically the relationship between microarray clusters and known biology. We examine, for a range of clustering algorithms, the quantitative self-consistency of the functional classifications in the clusters.
Hierarchical and k-means clustering are the two methods currently available on the Expression Profiler (http://ep.ebi.ac.uk/) web server. The Expression Profiler is an up to date set of tools available over the web for clustering, analysis, and visualisation of microarray data.
Hierarchical clustering is an agglomerative method, joining the closest two clusters each time, then recalculating the inter-cluster distances, and joining the next closest two clusters together. We chose the average linkage of Pearson correlation as the measure of distance between clusters, and a cut-off value of 0.3.
K-means clustering is a standard clustering technique, well known in the fields of statistics and machine learning. Correlation was used as distance measure. We used k=100.
The QT_CLUST algorithm is described in Heyer et al. [Heyer1999]. We implemented a modified version of this algorithm. The data was normalised so that the data for each ORF had mean 0 and variance 1. We did not implement Heyer et al.'s method of removal of outliers by computing jackknife correlations. Pearson correlation was used as a similarity measure, and each ORF was used to seed a cluster. Further ORFs were added to the cluster if their similarity with all ORFs in the cluster was greater than a fixed threshold (the cluster diameter). This was taken to be our final clustering. We henceforth refer to this method as QT_CLUST_MOD. We did not, as Heyer et al. do, set aside the largest cluster and recluster, since we did not demand a unique solution, and in fact we wanted a clustering in which each ORF could be in more than one cluster. Allowing each ORF to be in more than one cluster is similar to the situation when using the functional hierarchies as ORFs can belong to more than one functional class. There can be as many clusters as there are ORFs - there is no fixed number of clusters which has to be decided beforehand or as part of the training process. We used a cluster diameter (minimum similarity) of 0.7.
All clustering parameters (hierarchical cut-off value, k means, cluster diameter) were chosen to be reasonable after experimentation with various values. Although parameters could possibly have been further refined, this was not the aim of this work.
We form this measure as follows: in the k-means and hierarchical clusterings, each ORF appears in only one cluster, so we can take each ORF in turn and test whether or not the cluster without this ORF can predict its class; for the QT_CLUST_MOD scheme, the clusters are seeded by each ORF in turn, and we test for each seed in turn whether or not the rest of the cluster predicts the class of the seed. That is, we test whether or not the majority class of the cluster is one of the classes of the heldout ORF. The majority class is the class most frequently found among the ORFs in the cluster. We call this measure the `Predictive Power' of the clustering. The measure is related to our previous work in predicting ORF function from sequence [King2001]. Tests of predictive power were only carried out on clusters which had more than 5 ORFs, and only on ORFs which were not classified as unknown.
For example, if a cluster contained ORFs belonging to the following classes:
ORF Class orf1 A orf2 B orf3 A orf4 A orf5 A orf6 A orf7 A orf8 A orf9 B orf10 Athen the majority class of the cluster without orf1 would be "A", which is a correct prediction of the actual class of orf1. However orf2 and orf9 would be incorrectly predicted by this cluster. Class "A" is correctly predicted in 8/10 cases. (Class "B" is never predicted by this cluster, since it is never the majority class).
To test the statistical significance of this measure we used a Monte Carlo type approach. ORFs were chosen at random, without replacement, and random clusters formed using the same cluster size distribution as was observed in the real clusterings. The resulting random clusters were then analysed in the same manner as the real clusters. A thousand random clusterings were made each time, and both the mean results reported, and how many times the random cluster accuracy for a functional class was equal or exceeded that of the actual clusters.
An example of a strong cluster in shown in Table 1. The probability of this cluster occurring by chance is estimated to be less than 10^(-10) (calculated by hypergeometric distribution). However, note the mannose related sub-cluster in this cluster. The most likely explanation for the sub-cluster is: as the yeast data was formed to study cell division, histones and mannose are both required in the same time specific pattern during division. They therefore have either share the same transcription control mechanism, or have ones similarly controlled. This hypothetical common transcriptional control in cell division is not reflected in the current annotation.
Clusters which appeared to be unrelated were common. There were also clusters which did seem contain related ORFs, but less obviously than the histone cluster mentioned above. Table 2 shows an example of a cluster which does show a DNA processing theme, but this theme is not reflected in the variety of classifications of the ORFs.
To quantitatively test the relationship between clusters and annotations we evaluated all the clusters formed using our measure of predictive power (see Table 3). For S. cerevisiae we calculated the predictive power of each clustering method (k-means, hierarchical, and QT_CLUST_MOD) for each functional class in levels 1 and 2 of MIPS and GO. For E. coli we calculated this for each functional class in level 1 of the GenProtEC hierarchy. This produced a large number of annotated clusterings, and the complete set of these can be found at http://www.aber.ac.uk/compsci/Research/bio/dss/gba/. The same broad conclusions regarding the relationship between clusters and annotation were true for both species and using all clustering methods; we have therefore chosen to present the S. cerevisiae tables only for hierarchical clustering and for only the first levels of the GO and MIPS annotation hierarchies, and one table for E. coli. The predictive power of the different clustering methods can be seen in Tables 4, 5, 6 and 7. The results are broken down according to the majority classes of the clusters. Absence of data for a class indicates that this class was not the majority class of any cluster.
All the clustering methods produced statistically significant clusters with all three functional annotation schemes. This confirms that clusters produced from microarray data reflect some known biology. However, the predictive power of even the best clusters of the clearest functional classes is low (mostly <50%). This means that if you predict function based on guilt-by-association your predictions will contradict existing annotations a large percentage of the time.
One of the clearest messages from the data is the large difference in predictive power across the different: microarray experiments, clustering methods, and annotation schemes. There is no clear best approach and quite different significance results are obtained using different combinations.
Perhaps the most interesting differences are those between different microarray experiments: alpha, cdc15, cdc28, and elu. It is to be expected that different microarray experiments will highlight different features of cell organisation. However it is unclear how biologically significant these differences are. Using the GO annotation scheme:
The classes highlighted also differed significantly between clustering methods. Considering first the GO annotation scheme: the clustering method k-means is best for predicting enzyme class; QT_CLUST_MOD is best for the classes nucleic acid binding, chaperone, and cell-adhesion; and hierarchical clustering is best for structural protein, transporter, and ligand binding or carrier chaperone. Considering the MIPS annotation scheme: the clustering method k-means is best for predicting classes cell rescue, defence, cell death and ageing, and protein synthesis; QT_CLUST_MOD is best for the class transcription; and hierarchical clustering is best for cell organisation, metabolism, cell growth cell division and DNA synthesis.
The data also revealed some unexpected apparent negative correlations between clusters and classes. For example using the MIPS annotation scheme and hierarchical clustering the cdc28 clusters for class cellular transport and transport mechanism the random clustering produced a higher predictive power >95% of the time. Transport proteins seem particularly poorly predicted both in both S. cerevisiae and E. coli. A possible explanation for this is that their transcription control is synchronised with the specific pathways they are involved with rather than as a group.
Do the two annotation schemes of MIPS and GO agree on cluster consistency? Sometimes, with strong clusters, such as within the ribosomal clusters (see Figure 1). But on the whole, the correlation between the scores given by the two annotation schemes is approximately 0. This is partly due to the fact that GO had many fewer annotations than MIPS, so there are several clusters which show a trend under MIPS annotation that cannot be seen under GO, because too many ORFs have no annotation.
Are the annotation schemes improving over time with respect to these clusters? Tables 7 and 8 show the accuracies of the clusters found by hierarchical clustering under MIPS annotations. Table 7 uses the MIPS annotations from 21st December 2000, whereas Table 8 uses the MIPS annotations from 24th April 2002, 16 months later. The accuracies are almost identical.
Does simple preprocessing help? Table 9 shows a comparison of accuracy when normalisation or removal of ORFs with low standard deviation was used. There is no consistent trend of improvement or degradation and very little difference between the results.
It has been commented that perhaps other distance measures or a different choice of linkage could be more appropriate for the use of hierarchical clustering on expression data. We show that use of Euclidean distance and complete linkage does little to change the accuracy and in fact seems worse than correlation and average linkage for the alpha dataset. This can be seen in Table 10 by comparison to Table 9.
A major challenge in microarray analysis is therefore to discriminate between the old and new biology in the data and the noise. To achieve this we require
ORF | description | MIPS classes |
ybr008c | fluconazole resistance protein | 11/7/0/0 7/28/0/0 |
ypl127c | histone H1 protein | 30/10/0/0 30/13/0/0 |
ynl031c | histone H3 | 30/10/0/0 30/13/0/0 4/5/1/4 |
ynl030w | histone H4 | 30/10/0/0 30/13/0/0 4/5/1/4 |
ylr455w | weak similarity to human G/T mismatch binding protein | 99/0/0/0 |
ygl065c | mannosyltransferase | 1/5/1/0 6/7/0/0 |
yer003c | mannose-6-phosphate isomerase | 1/5/1/0 30/3/0/0 |
ydr225w | histone H2A | 30/10/0/0 30/13/0/0 4/5/1/4 |
ydr224c | histone H2B | 30/10/0/0 30/13/0/0 4/5/1/4 |
ydl055c | mannose-1-phosphate guanyltransferase | 1/5/1/0 9/1/0/0 |
ybr010w | histone H3 | 30/10/0/0 30/13/0/0 4/5/1/4 |
ybr009c | histone H4 | 30/10/0/0 30/13/0/0 4/5/1/4 |
ybl003c | histone H2A.2 | 30/10/0/0 30/13/0/0 4/5/1/4 |
ybl002w | histone H2B.2 | 30/10/0/0 30/13/0/0 4/5/1/4 |
Table 1: A yeast histone cluster (cdc15 data, QT_CLUST_MOD clustering algorithm, MIPS annotations, cluster id: 203)
ORF | description | MIPS classes |
ycl064c | L-serine/L-threonine deaminase | 1/1/10/0 |
yol090w | DNA mismatch repair protein | 3/19/0/0 30/10/0/0 |
yol017w | similarity to YFR013w | 99/0/0/0 |
ynl273w | topoisomerase I interacting factor 1 | 99/0/0/0 |
ynl262w | DNA-directed DNA polymerase epsilon, catalytic subunit A | 11/4/0/0 3/16/0/0 3/22/1/0 30/10/0/0 |
ynl082w | DNA mismatch repair protein | 3/19/0/0 30/10/0/0 |
ynl072w | RNase H(35), a 35 kDa ribonuclease H | 1/3/16/0 |
ylr049c | hypothetical protein | 99/0/0/0 |
yjl074c | required for structural maintenance of chromosomes | 3/22/0/0 9/13/0/0 |
yhr153c | sporulation protein | 3/10/0/0 3/13/0/0 |
yhr110w | p24 protein involved in membrane trafficking | 6/4/0/0 8/99/0/0 |
ygr041w | budding protein | 3/4/0/0 |
ydl227c | homothallic switching endonuclease | 3/7/0/0 30/10/0/0 |
ydl164c | DNA ligase | 11/4/0/0 3/16/0/0 3/19/0/0 30/10/0/0 |
ydl156w | weak similarity to Pas7p | 99/0/0/0 |
ybr071w | hypothetical protein | 99/0/0/0 |
yar007c | DNA replication factor A, 69 KD subunit | 3/16/0/0 3/19/0/0 3/7/0/0 30/10/0/0 |
Table 2: A yeast DNA processing cluster. Note the variety of MIPS classes represented here. (cdc28 data, QT_CLUST_MOD clustering algorithm, MIPS annotations, cluster id: 599)
MIPS ydr417c questionable ORF 99/0/0/0 ylr325c 60S large subunit ribosomal protein 30/3/0/0 5/1/0/0 yjl177w 60s large subunit ribosomal protein L17.e 30/3/0/0 5/1/0/0 yml063w ribosomal protein S3a.e 30/3/0/0 5/1/0/0 ydr447c ribosomal protein S17.e.B 30/3/0/0 5/1/0/0 ykl056c strong similarity to human IgE-dependent histamine-releasing factor 30/3/0/0 98/0/0/0 ylr061w ribosomal protein 5/1/0/0 ykr094c ubiquitin 30/3/0/0 5/1/0/0 6/13/1/0 yol039w acidic ribosomal protein P2.beta 30/3/0/0 5/1/0/0 ydr418w 60S large subunit ribosomal protein L12.e 30/3/0/0 5/1/0/0 ykl006w ribosomal protein 30/3/0/0 5/1/0/0 yol040c 40S small subunit ribosomal protein 30/3/0/0 5/1/0/0 yor167c 40S small subunit ribosomal protein S28.e.c15 30/3/0/0 5/1/0/0 ylr367w ribosomal protein S15a.e.c12 30/3/0/0 5/1/0/0 ypr102c ribosomal protein L11.e 30/3/0/0 5/1/0/0 ypr118w similarity to M.jannaschii translation initiation factor, eIF-2B 99/0/0/0 -------- GO ydr417c molecular_function unknown GO_0005554 ylr325c structural protein of ribosome GO_0005198 : GO_0003735 yjl177w structural protein of ribosome GO_0005198 : GO_0003735 yml063w structural protein of ribosome GO_0005198 : GO_0003735 ydr447c structural protein of ribosome GO_0005198 : GO_0003735 ykl056c molecular_function unknown GO_0005554 ylr061w structural protein of ribosome GO_0005198 : GO_0003735 ykr094c structural protein of ribosome GO_0005198 : GO_0003735 yol039w structural protein of ribosome GO_0005198 : GO_0003735 ydr418w structural protein of ribosome GO_0005198 : GO_0003735 ykl006w structural protein of ribosome* GO_0003676 GO_0005198 : GO_0003723 GO_0003735 yol040c structural protein of ribosome GO_0005198 : GO_0003735 yor167c structural protein of ribosome GO_0005198 : GO_0003735 ylr367w structural protein of ribosome GO_0005198 : GO_0003735 ypr102c structural protein of ribosome GO_0005198 : GO_0003735 ypr118w molecular_function unknown GO_0005554Figure 1: Ribosomal cluster as agreed by MIPS and GO. Semicolons separate the levels of GO classes. They agree on all except ykr094c. (This example is cluster ID: 390, alpha data, hierarchical clustering)
random | alpha | cdc15 | cdc28 | elu | E. coli | |
MIPS - hier | 56.257 | 61.062 (0) | 62.821 (0) | 61.341 (0) | 60.270 (0) | - |
MIPS - k | 59.304 | 58.795 (989) | 59.053 (908) | 59.677 (4) | 59.583 (17) | - |
MIPS - QT | 58.256 | 59.714 (16) | 61.697 (0) | 62.136 (0) | 59.631 (21) | - |
GO - hier | 52.265 | 62.526 (0) | 60.799 (0) | 61.087 (0) | 59.344 (0) | - |
GO - k | 59.301 | 61.649 (0) | 59.990 (1) | 62.067 (0) | 60.638 (0) | - |
GO - QT | 55.397 | 60.799 (0) | 60.236 (0) | 61.288 (0) | 60.146 (0) | - |
E. coli - hier | 52.906 | - | - | - | - | 57.785 (0) |
E. coli - k | 53.104 | - | - | - | - | 56.103 (0) |
E. coli - QT | 52.751 | - | - | - | - | 55.291 (0) |
Table 3: A summary of the average predictive power for each type of clustering for each experiment. Figures are percentage correct predictions. ``random'' shows mean over 1000 random clusterings. Figures in brackets show how many times out of 1000 the random clustering produced equal or greater than this percentage.
random | alpha | cdc15 | cdc28 | elu | |
enzyme | 59.487 | 64.578 (0) | 61.753 (64) | 61.771 (61) | 60.511 (238) |
nucleic acid binding | 16.355 | 26.531 (13) | 15.584 (574) | 25.439 (22) | 22.430 (78) |
structural protein | 12.767 | 50.725 (0) | 47.423 (0) | 57.792 (0) | 20.290 (52) |
transporter | 8.585 | 13.725 (154) | 32.432 (0) | 10.417 (330) | 5.357 (735) |
ligand binding or carrier | 6.825 | 4.167 (669) | 30.769 (0) | 3.571 (706) | 21.429 (4) |
chaperone | 3.170 | 13.636 (55) | - | 5.263 (278) | - |
signal transducer | 2.938 | 14.815 (34) | 15.000 (34) | 3.703 (303) | 11.765 (69) |
motor | 1.069 | - | - | - | 11.111 (41) |
Table 4: Yeast hierarchical clustering (cut-off=0.3) class by class breakdown at level 1 GO. First column shows average over 1000 random clusterings. alpha, cdc15, cdc28 and elu are the 4 cell-cycle synchronisation methods. Figures show percentage correct predictions. The figure in brackets is how many times out of 1000 the random clustering produced equal or greater than this percentage. If less than 5, this value is highlighted.
random | alpha | cdc15 | cdc28 | elu | |
cellular organization | 59.344 | 61.988 (4) | 63.258 (0) | 64.442 (0) | 61.146 (42) |
metabolism | 28.827 | 39.721 (1) | 40.136 (0) | 33.951 (82) | 37.061 (6) |
cell growth, cell division and DNA synthesis | 22.429 | 43.421 (0) | 46.961 (0) | 35.816 (1) | 22.963 (478) |
transcription | 20.879 | 23.333 (304) | 30.496 (26) | 33.333 (4) | 18.750 (681) |
protein destination | 14.988 | 17.021 (350) | 18.367 (266) | 14.865 (495) | 11.842 (710) |
cellular transport and transport mechanisms | 12.500 | 12.500 (491) | - | 2.273 (957) | 21.951 (70) |
cell rescue, defense, cell death and ageing | 9.495 | 18.750 (60) | 18.605 (60) | 21.739 (35) | 14.706 (163) |
transport facilitation | 8.024 | 14.815 (135) | 28.889 (2) | 2.703 (792) | - |
protein synthesis | 8.760 | 68.000 (0) | 15.385 (175) | 56.716 (0) | - |
energy | 5.891 | 13.636 (140) | - | 7.895 (349) | - |
cellular biogenesis | 5.241 | 11.765 (160) | - | 6.250 (367) | - |
ionic homeostasis | 2.805 | - | - | - | 14.286 (73) |
Table 5: Yeast hierarchical clustering (cut-off 0.3) class by class breakdown at level 1 MIPS. First column shows average over 1000 random clusterings. alpha, cdc15, cdc28 and elu are the 4 cell-cycle synchronisation methods. Figures show percentage correct predictions. The figure in brackets is how many timesout of 1000 the random clustering produced equal or greater than this percentage. If less than 5, this value is highlighted.
hier | k | QT | |
metabolism | 56.579 (0) | 55.581 (0) | 58.115 (0) |
location of gene products | 57.107 (0) | 55.037 (4) | 56.513 (0) |
cell structure | 59.794 (0) | 34.884 (294) | 30.755 (509) |
information transfer | 35.417 (116) | 26.315 (341) | 16.667 (699) |
regulation | 36.364 (68) | - | 7.692 (623) |
transport | 38.095 (69) | - | 13.333 (748) |
cell processes | 57.692 (11) | 58.140 (14) | 63.636 (8) |
extrachromosomal | 64.286 (0) | 39.189 (61) | 42.105 (3) |
Table 6: E. coli class by class breakdown at level 1. QT = QT_CLUST_MOD (0.7), k = k-means clustering (100), hier = hierarchical clustering (0.3). Figures show percentage correct predictions. The figure in brackets is how many times out of 1000 the random clustering produced equal or greater than this percentage. If less than 5, this value is highlighted.
hier | |
energy | 88.462 |
cellular organization | 65.163 |
protein destination | 62.500 |
metabolism | 56.738 |
cell growth, cell division and DNA synthesis | 48.649 |
transcription | 46.970 |
transport facilitation | 42.105 |
cellular transport and transport mechanisms | 37.500 |
cellular biogenesis | 20.000 |
cell rescue, defense, cell death and ageing | 16.667 |
Table 7: Gasch data set, MIPS classification as of 21/12/00. Class by class breakdown at level 1. Hierarchical clustering with cut-off 0.3. Figures show percentage correct predictions.
hier | |
energy | 79.310 |
protein fate (folding, modification, destination) | 66.667 |
subcellular localisation | 64.158 |
metabolism | 50.794 |
cell cycle and DNA processing | 48.148 |
transcription | 47.692 |
transport facilitation | 42.105 |
cellular transport and transport mechanisms | 35.294 |
control of cellular organization | 25.000 |
cell rescue, defense and virulence | 16.667 |
Table 8: Gasch data set, MIPS classification as of 24/4/02. Class by class breakdown at level 1. Hierarchical clustering with cut-off 0.3. Figures show percentage correct predictions.
plain | norm | rem low | norm and rem low | |
energy | 21.739 | 15.789 | 18.750 | 18.750 |
protein fate (folding, modification, destination) | 16.102 | 16.667 | 18.519 | 9.756 |
subcellular localisation | 61.079 | 61.460 | 62.097 | 62.267 |
metabolism | 39.432 | 38.387 | 37.500 | 29.208 |
cell cycle and DNA processing | 37.013 | 36.111 | 41.026 | 43.564 |
transcription | 20.800 | 29.134 | 16.346 | 18.447 |
transport facilitation | 11.765 | 19.355 | - | - |
cellular transport and transport mechanisms | 10.204 | 10.870 | - | - |
control of cellular organization | 8.696 | 10.000 | 15.385 | 20.000 |
cell rescue, defense and virulence | 21.622 | 21.622 | 33.333 | 22.222 |
cell fate | 16.129 | 17.544 | 25.000 | 14.634 |
protein synthesis | 61.765 | 67.741 | 43.478 | 52.174 |
regulation of/interaction with cellular environment | - | 15.385 | 14.286 | - |
Table 9: The effects of preprocessing on accuracy. Preprocessing of the data is compared. "plain" is a baseline, no preprocessing. "norm" is normalisation where mean and standard deviation are normalised to 0 and 1 respectively for each ORF. "rem low" is removal of ORFs with low standard deviation (in the bottom 25% of the data set). "rem low and norm" is removal of ORFs with low standard deviation followed by normalisation of the remaining ORFs. The dataset was alpha data, MIPS classification as of 24/4/02. Clustering was hierarchical clustering, average linkage of correlation, cut-off=0.3.
complete linkage, euclidean distance | |
energy | 18.182 |
protein fate (folding, modification, destination) | 24.107 |
subcellular localisation | 60.623 |
metabolism | 34.819 |
cell cycle and DNA processing | 19.388 |
transcription | 35.545 |
transport facilitation | - |
cellular transport and transport mechanisms | 10.169 |
control of cellular organization | 7.143 |
cell rescue, defense and virulence | 5.405 |
cell fate | 7.407 |
protein synthesis | 33.766 |
regulation of/interaction with cellular environment | 6.250 |
Table 10: Use of complete linkage and Euclidean distance for hierarchical clustering. Compare these values with the "plain column" in Table 9. Dataset was alpha data, MIPS classification as of 24/4/02. Clustering was hierarchical clustering, complete linkage of Euclidean distance, cut-off=1.0. (0.3 gave clusters containing a maximum of 2 ORFs only, so was too tight).
Duda, R., Hart, P. and Stork, P. (2000) Pattern Classification. John Wiley and Sons, New York