How well do we understand the clusters found in microarray data?

Amanda Clare* and Ross D. King

Department of Computer Science,
University of Wales Aberystwyth,
Penglais,
Aberystwyth
SY23 3DB
*Corresponding author:
Email: afc@aber.ac.uk

Abstract

We wished to quantify the state-of-the-art of our understanding of clusters in microarray data. To do this we systematically compared the clusters produced on sets of microarray data using a representative set of clustering algorithms (hierarchical, k-means, and a modified version of QT_CLUST) with the annotation schemes MIPS, GeneOntology and GenProtEC. We assumed that if a cluster reflected known biology its members would share related ontological annotations. This assumption is the basis of ``guilt-by-association'' and is commonly used to assign the putative function of proteins. To statistically measure the relationship between cluster and annotation we developed a new predictive discriminatory measure.

We found that the clusters found in microarray data do not in general agree with functional annotation classes. Although many statistically significant relationships can be found, the majority of clusters are not related to known biology (as described in annotation ontologies). This implies that use of guilt-by-association is not supported by annotation ontologies. Depending on the estimate of the amount of noise in the data, our results suggest that bioinformatics has only codified a small proportion of the biological knowledge required to understand microarray data.

The annotated clusters can be found at http://www.aber.ac.uk/compsci/Research/bio/dss/gba/.

Key words

ontology, statistics, machine-learning, transcriptome, microarray data.

Introduction

The most frequent method used to analyse microarray gene expression data is unsupervised learning in the form of clustering. This has been applied in many forms, including hierarchical clustering (top down and bottom up), Bayesian models, simple networks based on mutual information or jackknife correlation, simulated annealing, k-means clustering and self-organising maps [Alon1999, Eisen1998, Barash2001, Butte2000, Heyer1999, Lukashin2001, Tavazoie1999, Törönen1999]. How well do we understand the clusters produced by these methods? Most papers on expression data clustering report a selection of good clusters which the authors have selected by hand and which correspond to known biology. However microarray clustering experiments also generally produce clusters which seem to correspond less well with known biology. This type of cluster has received less attention in the literature. A common rationale behind using clustering algorithms is guilt-by-association [Walker1999]: if gene expression patterns are similar then the genes are likely to share similar biological function. Is this generally true?

Two main approaches have been employed in testing the reliability of microarray clusters: self-consistency, and consistency with known biology. In self-consistency the idea is to predict some left out information in a cluster. For example Yeung et al. [Yeung2001] used self-consistency to assess the results of their clustering by leaving out one experimental condition (or time point), and using this held out condition to test the predictive ability of the clusters. They were testing whether the cluster could predict the value of the missing experimental condition, or the value at the missing time point.

The idea behind the use of existing biological knowledge is that if a cluster is consistent with known biological knowledge then it reflects a real feature in the data. This is essentially a systematic version of the informal approach generally taken to evaluate clustering. This idea was used by Tavazoie et al. [Tavazoie1999] who clustered the S. cerevisiae cdc28 dataset [Cho1998] with k-means clustering (k=30), and checked the validity of their clustering by mapping the MIPS functional classes onto the clusters to show which classes were found more often than by chance in each cluster. Several clusters were significantly enriched for one or more classes.

In this paper we combine the ideas of self-consistency and using known biological knowledge to test systematically the relationship between microarray clusters and known biology. We examine, for a range of clustering algorithms, the quantitative self-consistency of the functional classifications in the clusters.

Methods

Microarray Data

To test our approach we used the classic microarray data from Spellman et al. [Spellman1998], which included 4 different experiments measuring cell-cycle expression levels in the S. cerevisiae genome: alpha-factor based synchronisation, cdc15-based synchronisation, elutrition-based synchronisation and the cdc28-based data from Cho et al. [Cho1998]. To show that these results are not specific to this dataset of S. cerevisiae we also used the data from Khodursky et al. [Khodursky2000] for E. coli. This data measured expression levels in response to changes in tryptophan metabolism. Then to show that the general trends are also true of more recent yeast data sets we used the data set from Gasch et al. [Gasch2000] which measured the expression levels of yeast cells when subject to a wide variety of environmental changes.

Classification Schemes

For S. cerevisiae we selected the most commonly used functional classification schemes: the Munich Information Center for Protein Sequences (MIPS) scheme (http://mips.gsf.de/proj/yeast/catalogues/funcat), and the SGD annotations of the GeneOntology (GO) consortium scheme (http://www.geneontology.org). The GO scheme is particularly interesting as it is part of a cross-species functional classification project. For E. coli we used GenProtEC's MultiFun classification scheme (http://genprotec.mbl.edu/start), arguably the most reliable of all functional classification schemes. These classification schemes are hierarchical (MIPS and GenProtEC are trees, GeneOntology a directed acyclic graph) and provide several levels of granularity in the classification, from very general classes at the top of the hierarchy to more specific classes lower down.

Clustering Methods

We chose three clustering methods to compare: agglomerative hierarchical clustering [Eisen1998], k-means clustering [Duda2000], and a modified version of the method of Heyer et al. [Heyer1999] ``QT_CLUST''. These were chosen not to show that any one was better than any other (this would be difficult to prove, given the variety of parameters that can be tuned in each algorithm), but rather, to give a representative sample of commonly used clustering algorithms for microarray data - to show that what we observe is approximately true for any reasonable clustering.

Hierarchical and k-means clustering are the two methods currently available on the Expression Profiler (http://ep.ebi.ac.uk/) web server. The Expression Profiler is an up to date set of tools available over the web for clustering, analysis, and visualisation of microarray data.

Hierarchical clustering is an agglomerative method, joining the closest two clusters each time, then recalculating the inter-cluster distances, and joining the next closest two clusters together. We chose the average linkage of Pearson correlation as the measure of distance between clusters, and a cut-off value of 0.3.

K-means clustering is a standard clustering technique, well known in the fields of statistics and machine learning. Correlation was used as distance measure. We used k=100.

The QT_CLUST algorithm is described in Heyer et al. [Heyer1999]. We implemented a modified version of this algorithm. The data was normalised so that the data for each ORF had mean 0 and variance 1. We did not implement Heyer et al.'s method of removal of outliers by computing jackknife correlations. Pearson correlation was used as a similarity measure, and each ORF was used to seed a cluster. Further ORFs were added to the cluster if their similarity with all ORFs in the cluster was greater than a fixed threshold (the cluster diameter). This was taken to be our final clustering. We henceforth refer to this method as QT_CLUST_MOD. We did not, as Heyer et al. do, set aside the largest cluster and recluster, since we did not demand a unique solution, and in fact we wanted a clustering in which each ORF could be in more than one cluster. Allowing each ORF to be in more than one cluster is similar to the situation when using the functional hierarchies as ORFs can belong to more than one functional class. There can be as many clusters as there are ORFs - there is no fixed number of clusters which has to be decided beforehand or as part of the training process. We used a cluster diameter (minimum similarity) of 0.7.

All clustering parameters (hierarchical cut-off value, k means, cluster diameter) were chosen to be reasonable after experimentation with various values. Although parameters could possibly have been further refined, this was not the aim of this work.

Predictive Power

We required a quantitative measure of how coherent the ORF clusters formed from microarray expression data are with the known functions of the ORFs. We considered using cross-entropy type measures, but decided in favour of using a predictive discriminatory measure as a more direct measure of coherence.

We form this measure as follows: in the k-means and hierarchical clusterings, each ORF appears in only one cluster, so we can take each ORF in turn and test whether or not the cluster without this ORF can predict its class; for the QT_CLUST_MOD scheme, the clusters are seeded by each ORF in turn, and we test for each seed in turn whether or not the rest of the cluster predicts the class of the seed. That is, we test whether or not the majority class of the cluster is one of the classes of the heldout ORF. The majority class is the class most frequently found among the ORFs in the cluster. We call this measure the `Predictive Power' of the clustering. The measure is related to our previous work in predicting ORF function from sequence [King2001]. Tests of predictive power were only carried out on clusters which had more than 5 ORFs, and only on ORFs which were not classified as unknown.

For example, if a cluster contained ORFs belonging to the following classes:

ORF  Class
orf1  A
orf2  B
orf3  A
orf4  A
orf5  A
orf6  A
orf7  A 
orf8  A
orf9  B
orf10 A

then the majority class of the cluster without orf1 would be "A", which is a correct prediction of the actual class of orf1. However orf2 and orf9 would be incorrectly predicted by this cluster. Class "A" is correctly predicted in 8/10 cases. (Class "B" is never predicted by this cluster, since it is never the majority class).

To test the statistical significance of this measure we used a Monte Carlo type approach. ORFs were chosen at random, without replacement, and random clusters formed using the same cluster size distribution as was observed in the real clusterings. The resulting random clusters were then analysed in the same manner as the real clusters. A thousand random clusterings were made each time, and both the mean results reported, and how many times the random cluster accuracy for a functional class was equal or exceeded that of the actual clusters.

Preprocessing of data

The majority of the results here are presented with no preprocessing of the data other than to interpolate missing values by using the average of the two adjacent values. This is to present a consistent approach to all datasets and algorithms we tested. Experiments were carried out to ascertain whether simple preprocessing would make a significant difference. It is common in microarray data processing to normalise the data to have a mean of zero and a standard deviation of one. It is also common to remove ORFs whose expression does not vary significantly over the timepoints, since these ORFs might cluster together without having a common reason. We chose to remove all ORFs that had a standard deviation in the first quartile of the standard deviations of the data set. This is all ORFs with a standard deviation in the bottom 25% of the dataset. This removed 1495 ORFs. Table 9 shows a comparison of the different preprocessing methods on result accuracy.

Results

The quality of the clusters produced by the different programs was, we believe, consistent with those shown in previous results. Initial inspection of the clusters showed some obviously good groupings, and other clusters were produced which generally seemed to have something in common, but the signal was less strong. However, most clusters on inspection did not appear to share anything in common at all.

An example of a strong cluster in shown in Table 1. The probability of this cluster occurring by chance is estimated to be less than 10^(-10) (calculated by hypergeometric distribution). However, note the mannose related sub-cluster in this cluster. The most likely explanation for the sub-cluster is: as the yeast data was formed to study cell division, histones and mannose are both required in the same time specific pattern during division. They therefore have either share the same transcription control mechanism, or have ones similarly controlled. This hypothetical common transcriptional control in cell division is not reflected in the current annotation.

Clusters which appeared to be unrelated were common. There were also clusters which did seem contain related ORFs, but less obviously than the histone cluster mentioned above. Table 2 shows an example of a cluster which does show a DNA processing theme, but this theme is not reflected in the variety of classifications of the ORFs.

To quantitatively test the relationship between clusters and annotations we evaluated all the clusters formed using our measure of predictive power (see Table 3). For S. cerevisiae we calculated the predictive power of each clustering method (k-means, hierarchical, and QT_CLUST_MOD) for each functional class in levels 1 and 2 of MIPS and GO. For E. coli we calculated this for each functional class in level 1 of the GenProtEC hierarchy. This produced a large number of annotated clusterings, and the complete set of these can be found at http://www.aber.ac.uk/compsci/Research/bio/dss/gba/. The same broad conclusions regarding the relationship between clusters and annotation were true for both species and using all clustering methods; we have therefore chosen to present the S. cerevisiae tables only for hierarchical clustering and for only the first levels of the GO and MIPS annotation hierarchies, and one table for E. coli. The predictive power of the different clustering methods can be seen in Tables 4, 5, 6 and 7. The results are broken down according to the majority classes of the clusters. Absence of data for a class indicates that this class was not the majority class of any cluster.

All the clustering methods produced statistically significant clusters with all three functional annotation schemes. This confirms that clusters produced from microarray data reflect some known biology. However, the predictive power of even the best clusters of the clearest functional classes is low (mostly <50%). This means that if you predict function based on guilt-by-association your predictions will contradict existing annotations a large percentage of the time.

One of the clearest messages from the data is the large difference in predictive power across the different: microarray experiments, clustering methods, and annotation schemes. There is no clear best approach and quite different significance results are obtained using different combinations.

Perhaps the most interesting differences are those between different microarray experiments: alpha, cdc15, cdc28, and elu. It is to be expected that different microarray experiments will highlight different features of cell organisation. However it is unclear how biologically significant these differences are. Using the GO annotation scheme:

alpha is best for predicting classes: enzyme; nucleic acid; chaperone; motor; and cell-adhesion.
cdc15 is best for predicting classes: transporter; and ligand binding or carrier.
cdc28 is best for predicting class: structural protein.

Using the MIPS annotation scheme:

alpha is best for predicting class: protein destination.
cdc15 is best for predicting classes: metabolism; cell growth, cell division and DNA synthesis; and transport facilitation.
cdc28 is best for predicting classes: cellular organisation; transcription; and protein synthesis.
elu is best for class cellular transport and transport mechanism.

A particularly dramatic difference is that for the GO class ligand binding or carrier using hierarchical clustering, where the cdc15 and elu experiments produced highly significant clusters whereas the alpha and cdc28 experiments produced clusters with negative correlation.

The classes highlighted also differed significantly between clustering methods. Considering first the GO annotation scheme: the clustering method k-means is best for predicting enzyme class; QT_CLUST_MOD is best for the classes nucleic acid binding, chaperone, and cell-adhesion; and hierarchical clustering is best for structural protein, transporter, and ligand binding or carrier chaperone. Considering the MIPS annotation scheme: the clustering method k-means is best for predicting classes cell rescue, defence, cell death and ageing, and protein synthesis; QT_CLUST_MOD is best for the class transcription; and hierarchical clustering is best for cell organisation, metabolism, cell growth cell division and DNA synthesis.

The data also revealed some unexpected apparent negative correlations between clusters and classes. For example using the MIPS annotation scheme and hierarchical clustering the cdc28 clusters for class cellular transport and transport mechanism the random clustering produced a higher predictive power >95% of the time. Transport proteins seem particularly poorly predicted both in both S. cerevisiae and E. coli. A possible explanation for this is that their transcription control is synchronised with the specific pathways they are involved with rather than as a group.

Do the two annotation schemes of MIPS and GO agree on cluster consistency? Sometimes, with strong clusters, such as within the ribosomal clusters (see Figure 1). But on the whole, the correlation between the scores given by the two annotation schemes is approximately 0. This is partly due to the fact that GO had many fewer annotations than MIPS, so there are several clusters which show a trend under MIPS annotation that cannot be seen under GO, because too many ORFs have no annotation.

Are the annotation schemes improving over time with respect to these clusters? Tables 7 and 8 show the accuracies of the clusters found by hierarchical clustering under MIPS annotations. Table 7 uses the MIPS annotations from 21st December 2000, whereas Table 8 uses the MIPS annotations from 24th April 2002, 16 months later. The accuracies are almost identical.

Does simple preprocessing help? Table 9 shows a comparison of accuracy when normalisation or removal of ORFs with low standard deviation was used. There is no consistent trend of improvement or degradation and very little difference between the results.

It has been commented that perhaps other distance measures or a different choice of linkage could be more appropriate for the use of hierarchical clustering on expression data. We show that use of Euclidean distance and complete linkage does little to change the accuracy and in fact seems worse than correlation and average linkage for the alpha dataset. This can be seen in Table 10 by comparison to Table 9.

Discussion and Conclusion

Given a clustering produced from microarray data and a protein functional classification scheme there are four possibilities:

the annotations confirm the cluster.
the annotations and the cluster disagree because the microarray data involves biological knowledge not explicitly represented in the annotations.
the annotations and the cluster disagree because the microarray data involves new biological knowledge.
the annotations and the cluster disagree because the cluster is noise.

This paper quantifies how often the first case occurs and illustrates the limitations of existing annotations in explaining microarray data. The consequences of these limitations are discussed in Kell & King [Kell2000]. These clusters where annotation and microarray data agree are the first choice of clusters to examine to gain knowledge of control of transcription. In favour of the second explanation is the fact the functional classification schemes are still ``under construction'' and do not reflect all that is known about biology. There are also many possible improvements in clustering algorithms which could improve the consistency. It is to be expected that the third case will predominate. Microarrays are a fascinating technology and an industry has been based on them. It is almost inconceivable that microarrays will not reveal large amounts of new and fascinating biological knowledge.

A major challenge in microarray analysis is therefore to discriminate between the old and new biology in the data and the noise. To achieve this we require

Better models of instrument noise in microarrays [Newton2001].
Data analysis methods explicitly designed to exploit time-series data (with most existing clustering methods you could permute the time points and get the same results).
Ways of combining information from different experiments to provide repeatability (known standards of data will help this, such as MIAME http://www.mged.org/Annotations-wg/index.html).
Deeper data analysis methods designed to elucidate the biological processes behind the clusters, such as genetic networks.

The consequences of our observations are that there is still much to be done to extract information from the results of microarray experiments. Clusters produced from the data reflect some known biology, however, the majority of clusters have no obvious common annotations from the current ORF annotation schemes. We expect that microarray data presents new biological knowledge and that in time the annotation schemes will represent this. We also conclude that unsupervised clustering is limited, and we recommend that deeper data analysis methods need to be used in future.

Acknowledgements

We would like to thank the people at EBI behind the Expression Profiler system (http://ep.ebi.ac.uk/) for making their clustering software easy to use and publicly available. Amanda Clare was supported by MRC grant G78/6609.

Table 1: A yeast histone cluster (cdc15 data, QT_CLUST_MOD clustering algorithm, MIPS annotations, cluster id: 203)

ORF description MIPS classes

ycl064c L-serine/L-threonine deaminase 1/1/10/0

yol090w DNA mismatch repair protein 3/19/0/0 30/10/0/0

yol017w similarity to YFR013w 99/0/0/0

ynl273w topoisomerase I interacting factor 1 99/0/0/0

ynl262w DNA-directed DNA polymerase epsilon, catalytic subunit A 11/4/0/0 3/16/0/0 3/22/1/0 30/10/0/0

ynl082w DNA mismatch repair protein 3/19/0/0 30/10/0/0

ynl072w RNase H(35), a 35 kDa ribonuclease H 1/3/16/0

ylr049c hypothetical protein 99/0/0/0

yjl074c required for structural maintenance of chromosomes 3/22/0/0 9/13/0/0

yhr153c sporulation protein 3/10/0/0 3/13/0/0

yhr110w p24 protein involved in membrane trafficking 6/4/0/0 8/99/0/0

ygr041w budding protein 3/4/0/0

ydl227c homothallic switching endonuclease 3/7/0/0 30/10/0/0

ydl164c DNA ligase 11/4/0/0 3/16/0/0 3/19/0/0 30/10/0/0

ydl156w weak similarity to Pas7p 99/0/0/0

ybr071w hypothetical protein 99/0/0/0

yar007c DNA replication factor A, 69 KD subunit 3/16/0/0 3/19/0/0 3/7/0/0 30/10/0/0

Table 2: A yeast DNA processing cluster. Note the variety of MIPS classes represented here. (cdc28 data, QT_CLUST_MOD clustering algorithm, MIPS annotations, cluster id: 599)

MIPS
 
ydr417c  questionable ORF                           99/0/0/0
ylr325c  60S large subunit ribosomal protein        30/3/0/0 5/1/0/0
yjl177w  60s large subunit ribosomal protein L17.e  30/3/0/0 5/1/0/0
yml063w  ribosomal protein S3a.e                    30/3/0/0 5/1/0/0
ydr447c  ribosomal protein S17.e.B                  30/3/0/0 5/1/0/0
ykl056c  strong similarity to human IgE-dependent histamine-releasing factor  30/3/0/0 98/0/0/0
ylr061w  ribosomal protein                          5/1/0/0
ykr094c  ubiquitin                                  30/3/0/0 5/1/0/0 6/13/1/0
yol039w  acidic ribosomal protein P2.beta           30/3/0/0 5/1/0/0
ydr418w  60S large subunit ribosomal protein L12.e  30/3/0/0 5/1/0/0
ykl006w  ribosomal protein                          30/3/0/0 5/1/0/0
yol040c  40S small subunit ribosomal protein        30/3/0/0 5/1/0/0
yor167c  40S small subunit ribosomal protein S28.e.c15 30/3/0/0 5/1/0/0
ylr367w  ribosomal protein S15a.e.c12               30/3/0/0 5/1/0/0
ypr102c  ribosomal protein L11.e                    30/3/0/0 5/1/0/0
ypr118w  similarity to M.jannaschii translation initiation factor, eIF-2B  99/0/0/0

--------
GO
 
ydr417c  molecular_function unknown       GO_0005554
ylr325c  structural protein of ribosome   GO_0005198 :  GO_0003735
yjl177w  structural protein of ribosome   GO_0005198 :  GO_0003735
yml063w  structural protein of ribosome   GO_0005198 :  GO_0003735
ydr447c  structural protein of ribosome   GO_0005198 :  GO_0003735
ykl056c  molecular_function unknown       GO_0005554
ylr061w  structural protein of ribosome   GO_0005198 :  GO_0003735
ykr094c  structural protein of ribosome   GO_0005198 :  GO_0003735
yol039w  structural protein of ribosome   GO_0005198 :  GO_0003735
ydr418w  structural protein of ribosome   GO_0005198 :  GO_0003735
ykl006w  structural protein of ribosome*  GO_0003676 GO_0005198 : GO_0003723 GO_0003735
yol040c  structural protein of ribosome   GO_0005198 :  GO_0003735
yor167c  structural protein of ribosome   GO_0005198 :  GO_0003735
ylr367w  structural protein of ribosome   GO_0005198 :  GO_0003735
ypr102c  structural protein of ribosome   GO_0005198 :  GO_0003735
ypr118w  molecular_function unknown       GO_0005554

Figure 1: Ribosomal cluster as agreed by MIPS and GO. Semicolons separate the levels of GO classes. They agree on all except ykr094c. (This example is cluster ID: 390, alpha data, hierarchical clustering)

random alpha cdc15 cdc28 elu E. coli

MIPS - hier 56.257 61.062 (0) 62.821 (0) 61.341 (0) 60.270 (0) -

MIPS - k 59.304 58.795 (989) 59.053 (908) 59.677 (4) 59.583 (17) -

MIPS - QT 58.256 59.714 (16) 61.697 (0) 62.136 (0) 59.631 (21) -

GO - hier 52.265 62.526 (0) 60.799 (0) 61.087 (0) 59.344 (0) -

GO - k 59.301 61.649 (0) 59.990 (1) 62.067 (0) 60.638 (0) -

GO - QT 55.397 60.799 (0) 60.236 (0) 61.288 (0) 60.146 (0) -

E. coli - hier 52.906 - - - - 57.785 (0)

E. coli - k 53.104 - - - - 56.103 (0)

E. coli - QT 52.751 - - - - 55.291 (0)

Table 3: A summary of the average predictive power for each type of clustering for each experiment. Figures are percentage correct predictions. ``random'' shows mean over 1000 random clusterings. Figures in brackets show how many times out of 1000 the random clustering produced equal or greater than this percentage.

random alpha cdc15 cdc28 elu

enzyme 59.487 64.578 (0) 61.753 (64) 61.771 (61) 60.511 (238)

nucleic acid binding 16.355 26.531 (13) 15.584 (574) 25.439 (22) 22.430 (78)

structural protein 12.767 50.725 (0) 47.423 (0) 57.792 (0) 20.290 (52)

transporter 8.585 13.725 (154) 32.432 (0) 10.417 (330) 5.357 (735)

ligand binding or carrier 6.825 4.167 (669) 30.769 (0) 3.571 (706) 21.429 (4)

chaperone 3.170 13.636 (55) - 5.263 (278) -

signal transducer 2.938 14.815 (34) 15.000 (34) 3.703 (303) 11.765 (69)

motor 1.069 - - - 11.111 (41)

Table 4: Yeast hierarchical clustering (cut-off=0.3) class by class breakdown at level 1 GO. First column shows average over 1000 random clusterings. alpha, cdc15, cdc28 and elu are the 4 cell-cycle synchronisation methods. Figures show percentage correct predictions. The figure in brackets is how many times out of 1000 the random clustering produced equal or greater than this percentage. If less than 5, this value is highlighted.

random alpha cdc15 cdc28 elu

cellular organization 59.344 61.988 (4) 63.258 (0) 64.442 (0) 61.146 (42)

metabolism 28.827 39.721 (1) 40.136 (0) 33.951 (82) 37.061 (6)

cell growth, cell division and DNA synthesis 22.429 43.421 (0) 46.961 (0) 35.816 (1) 22.963 (478)

transcription 20.879 23.333 (304) 30.496 (26) 33.333 (4) 18.750 (681)

protein destination 14.988 17.021 (350) 18.367 (266) 14.865 (495) 11.842 (710)

cellular transport and transport mechanisms 12.500 12.500 (491) - 2.273 (957) 21.951 (70)

cell rescue, defense, cell death and ageing 9.495 18.750 (60) 18.605 (60) 21.739 (35) 14.706 (163)

transport facilitation 8.024 14.815 (135) 28.889 (2) 2.703 (792) -

protein synthesis 8.760 68.000 (0) 15.385 (175) 56.716 (0) -

energy 5.891 13.636 (140) - 7.895 (349) -

cellular biogenesis 5.241 11.765 (160) - 6.250 (367) -

ionic homeostasis 2.805 - - - 14.286 (73)

Table 5: Yeast hierarchical clustering (cut-off 0.3) class by class breakdown at level 1 MIPS. First column shows average over 1000 random clusterings. alpha, cdc15, cdc28 and elu are the 4 cell-cycle synchronisation methods. Figures show percentage correct predictions. The figure in brackets is how many timesout of 1000 the random clustering produced equal or greater than this percentage. If less than 5, this value is highlighted.

hier k QT

metabolism 56.579 (0) 55.581 (0) 58.115 (0)

location of gene products 57.107 (0) 55.037 (4) 56.513 (0)

cell structure 59.794 (0) 34.884 (294) 30.755 (509)

information transfer 35.417 (116) 26.315 (341) 16.667 (699)

regulation 36.364 (68) - 7.692 (623)

transport 38.095 (69) - 13.333 (748)

cell processes 57.692 (11) 58.140 (14) 63.636 (8)

extrachromosomal 64.286 (0) 39.189 (61) 42.105 (3)

Table 6: E. coli class by class breakdown at level 1. QT = QT_CLUST_MOD (0.7), k = k-means clustering (100), hier = hierarchical clustering (0.3). Figures show percentage correct predictions. The figure in brackets is how many times out of 1000 the random clustering produced equal or greater than this percentage. If less than 5, this value is highlighted.

hier

energy 88.462

cellular organization 65.163

protein destination 62.500

metabolism 56.738

cell growth, cell division and DNA synthesis 48.649

transcription 46.970

transport facilitation 42.105

cellular transport and transport mechanisms 37.500

cellular biogenesis 20.000

cell rescue, defense, cell death and ageing 16.667

Table 7: Gasch data set, MIPS classification as of 21/12/00. Class by class breakdown at level 1. Hierarchical clustering with cut-off 0.3. Figures show percentage correct predictions.

hier

energy 79.310

protein fate (folding, modification, destination) 66.667

subcellular localisation 64.158

metabolism 50.794

cell cycle and DNA processing 48.148

transcription 47.692

transport facilitation 42.105

cellular transport and transport mechanisms 35.294

control of cellular organization 25.000

cell rescue, defense and virulence 16.667

Table 8: Gasch data set, MIPS classification as of 24/4/02. Class by class breakdown at level 1. Hierarchical clustering with cut-off 0.3. Figures show percentage correct predictions.

plain norm rem low norm and rem low

energy 21.739 15.789 18.750 18.750

protein fate (folding, modification, destination) 16.102 16.667 18.519 9.756

subcellular localisation 61.079 61.460 62.097 62.267

metabolism 39.432 38.387 37.500 29.208

cell cycle and DNA processing 37.013 36.111 41.026 43.564

transcription 20.800 29.134 16.346 18.447

transport facilitation 11.765 19.355 - -

cellular transport and transport mechanisms 10.204 10.870 - -

control of cellular organization 8.696 10.000 15.385 20.000

cell rescue, defense and virulence 21.622 21.622 33.333 22.222

cell fate 16.129 17.544 25.000 14.634

protein synthesis 61.765 67.741 43.478 52.174

regulation of/interaction with cellular environment - 15.385 14.286 -

Table 9: The effects of preprocessing on accuracy. Preprocessing of the data is compared. "plain" is a baseline, no preprocessing. "norm" is normalisation where mean and standard deviation are normalised to 0 and 1 respectively for each ORF. "rem low" is removal of ORFs with low standard deviation (in the bottom 25% of the data set). "rem low and norm" is removal of ORFs with low standard deviation followed by normalisation of the remaining ORFs. The dataset was alpha data, MIPS classification as of 24/4/02. Clustering was hierarchical clustering, average linkage of correlation, cut-off=0.3.

complete linkage, euclidean distance

energy 18.182

protein fate (folding, modification, destination) 24.107

subcellular localisation 60.623

metabolism 34.819

cell cycle and DNA processing 19.388

transcription 35.545

transport facilitation -

cellular transport and transport mechanisms 10.169

control of cellular organization 7.143

cell rescue, defense and virulence 5.405

cell fate 7.407

protein synthesis 33.766

regulation of/interaction with cellular environment 6.250

Table 10: Use of complete linkage and Euclidean distance for hierarchical clustering. Compare these values with the "plain column" in Table 9. Dataset was alpha data, MIPS classification as of 24/4/02. Clustering was hierarchical clustering, complete linkage of Euclidean distance, cut-off=1.0. (0.3 gave clusters containing a maximum of 2 ORFs only, so was too tight).

References

Alon, U., Barkai, N., Notterman, D., Gish, K., Ybarra, S., Mack, D. and Levine, A. (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Nat. Acad. Sci. USA. 96(12) 6745-6750.

Barash, Y. and Friedman, N. (2001) Context-Specific Bayesian Clustering for Gene Expression Data. In Proceedings of the Fifth International Conference on Computational Molecular Biology (RECOMB 2001).

Butte, A. and Kohane, I. (2000) Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. In PSB 2000.

Cho, R., Campbell, M., Winzeler, E., Steinmetz, L., Conway, A., Wodicka, L., Wolfsberg, T., Gabrielian, A., Landsman, D., Lockhart, D. and Davis, R. (1998) A genome-wide transcription analysis of the mitotic cell cycle. Molecular Cell 2 65-73

Duda, R., Hart, P. and Stork, P. (2000) Pattern Classification. John Wiley and Sons, New York

Eisen, M., Spellman, P., Brown, P. and Botstein, D. (1998) Cluster analysis and display of genome-wide expression patterns. Proc. Nat. Acad. Sci. USA. 95 14863-14868.

Gasch, A., Spellman, P., Kao, C., Carmel-Harel, O., Eisen, M., Storz, G., Botstein, D. and Brown, P. (2000) Genomic expression program in the response of yeast cells to environmental changes. Mol. Biol. Cell. 11 4241-4257.

Heyer, L. J., Kruglyak, S. and Yooseph, S. (1999) Exploring expression data: identification and analysis of coexpressed genes. Genome Research 9(11) 1106-1115.

Kell, D. and King, R. (2000) On the optimization of classes for the assignment of unidentified reading frames in functional genomics programmes: the need for machine learning. Trends Biotechnol. 18 93-98.

Khodursky, A., Peter, B., Cozzarelli, N., Botstein, D. and Brown, P. (2000) DNA microarray analysis of gene expression in response to physiological and genetic changes that affect tryptophan metabolism in Escherichia coli Proc. Nat. Acad. Sci. USA 97(22) 12170-12175.

King, R., Karwath, A., Clare, A. and Dehaspe, L. (2001) The Utility of Different Representations of Protein Sequence for Predicting Functional Class. Bioinformatics 17(5) 445-454.

Lukashin, A. and Fuchs, R. (2001) Analysis of temporal gene expression profiles: clustering by simulated annealing and determining the optimal number of clusters. Bioinformatics 17(5) 405-414.

Newton, M.A., Kendziorski, C.M., Richmond C.S., Blattner, F.R. and Tsui, K.W. (2001) On differential variability of expression ratios: Improving statistical inference about gene expression changes from microarray data. Journal of Computational Biology 8 37-52.

Spellman, P, Sherlock, G., Zhang, M., Iyer, V., Anders, K., Eisen, M., Brown, P., Botstein, D. and Futcher, B. (1998) Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell 9 3273-3297

Tavazoie, S., Hughes, J., Campbell, M., Cho, R. and Church, G. (1999) Systematic determination of genetic network architecture. Nature Genetics 22 281-285.

Törönen, P., Kolehmainen, M., Wong, G. and Castrén, E. (1999) Analysis of gene expression data using self-organizing maps. FEBS Lett. 451(2) 142-146

Walker, M., Volkmuth, W., Sprinzak, E., Hodgson, D. and Klingler, T. (1999) Prediction of gene function by genome-scale expression analysis: prostate cancer-associated genes. Genome Res. 9(12) 1198-1203.

Yeung, K.Y., Haynor, D. and Ruzzo, W. (2001) Validating clustering for gene expression data. Bioinformatics 17(4) 309-318

ORF	description	MIPS classes
ybr008c	fluconazole resistance protein	11/7/0/0 7/28/0/0
ypl127c	histone H1 protein	30/10/0/0 30/13/0/0
ynl031c	histone H3	30/10/0/0 30/13/0/0 4/5/1/4
ynl030w	histone H4	30/10/0/0 30/13/0/0 4/5/1/4
ylr455w	weak similarity to human G/T mismatch binding protein	99/0/0/0
ygl065c	mannosyltransferase	1/5/1/0 6/7/0/0
yer003c	mannose-6-phosphate isomerase	1/5/1/0 30/3/0/0
ydr225w	histone H2A	30/10/0/0 30/13/0/0 4/5/1/4
ydr224c	histone H2B	30/10/0/0 30/13/0/0 4/5/1/4
ydl055c	mannose-1-phosphate guanyltransferase	1/5/1/0 9/1/0/0
ybr010w	histone H3	30/10/0/0 30/13/0/0 4/5/1/4
ybr009c	histone H4	30/10/0/0 30/13/0/0 4/5/1/4
ybl003c	histone H2A.2	30/10/0/0 30/13/0/0 4/5/1/4
ybl002w	histone H2B.2	30/10/0/0 30/13/0/0 4/5/1/4

ORF	description	MIPS classes
ycl064c	L-serine/L-threonine deaminase	1/1/10/0
yol090w	DNA mismatch repair protein	3/19/0/0 30/10/0/0
yol017w	similarity to YFR013w	99/0/0/0
ynl273w	topoisomerase I interacting factor 1	99/0/0/0
ynl262w	DNA-directed DNA polymerase epsilon, catalytic subunit A	11/4/0/0 3/16/0/0 3/22/1/0 30/10/0/0
ynl082w	DNA mismatch repair protein	3/19/0/0 30/10/0/0
ynl072w	RNase H(35), a 35 kDa ribonuclease H	1/3/16/0
ylr049c	hypothetical protein	99/0/0/0
yjl074c	required for structural maintenance of chromosomes	3/22/0/0 9/13/0/0
yhr153c	sporulation protein	3/10/0/0 3/13/0/0
yhr110w	p24 protein involved in membrane trafficking	6/4/0/0 8/99/0/0
ygr041w	budding protein	3/4/0/0
ydl227c	homothallic switching endonuclease	3/7/0/0 30/10/0/0
ydl164c	DNA ligase	11/4/0/0 3/16/0/0 3/19/0/0 30/10/0/0
ydl156w	weak similarity to Pas7p	99/0/0/0
ybr071w	hypothetical protein	99/0/0/0
yar007c	DNA replication factor A, 69 KD subunit	3/16/0/0 3/19/0/0 3/7/0/0 30/10/0/0

	random	alpha	cdc15	cdc28	elu	E. coli
MIPS - hier	56.257	61.062 (0)	62.821 (0)	61.341 (0)	60.270 (0)	-
MIPS - k	59.304	58.795 (989)	59.053 (908)	59.677 (4)	59.583 (17)	-
MIPS - QT	58.256	59.714 (16)	61.697 (0)	62.136 (0)	59.631 (21)	-
GO - hier	52.265	62.526 (0)	60.799 (0)	61.087 (0)	59.344 (0)	-
GO - k	59.301	61.649 (0)	59.990 (1)	62.067 (0)	60.638 (0)	-
GO - QT	55.397	60.799 (0)	60.236 (0)	61.288 (0)	60.146 (0)	-
E. coli - hier	52.906	-	-	-	-	57.785 (0)
E. coli - k	53.104	-	-	-	-	56.103 (0)
E. coli - QT	52.751	-	-	-	-	55.291 (0)

	hier	k	QT
metabolism	56.579 (0)	55.581 (0)	58.115 (0)
location of gene products	57.107 (0)	55.037 (4)	56.513 (0)
cell structure	59.794 (0)	34.884 (294)	30.755 (509)
information transfer	35.417 (116)	26.315 (341)	16.667 (699)
regulation	36.364 (68)	-	7.692 (623)
transport	38.095 (69)	-	13.333 (748)
cell processes	57.692 (11)	58.140 (14)	63.636 (8)
extrachromosomal	64.286 (0)	39.189 (61)	42.105 (3)

	hier
energy	88.462
cellular organization	65.163
protein destination	62.500
metabolism	56.738
cell growth, cell division and DNA synthesis	48.649
transcription	46.970
transport facilitation	42.105
cellular transport and transport mechanisms	37.500
cellular biogenesis	20.000
cell rescue, defense, cell death and ageing	16.667

	hier
energy	79.310
protein fate (folding, modification, destination)	66.667
subcellular localisation	64.158
metabolism	50.794
cell cycle and DNA processing	48.148
transcription	47.692
transport facilitation	42.105
cellular transport and transport mechanisms	35.294
control of cellular organization	25.000
cell rescue, defense and virulence	16.667

	plain	norm	rem low	norm and rem low
energy	21.739	15.789	18.750	18.750
protein fate (folding, modification, destination)	16.102	16.667	18.519	9.756
subcellular localisation	61.079	61.460	62.097	62.267
metabolism	39.432	38.387	37.500	29.208
cell cycle and DNA processing	37.013	36.111	41.026	43.564
transcription	20.800	29.134	16.346	18.447
transport facilitation	11.765	19.355	-	-
cellular transport and transport mechanisms	10.204	10.870	-	-
control of cellular organization	8.696	10.000	15.385	20.000
cell rescue, defense and virulence	21.622	21.622	33.333	22.222
cell fate	16.129	17.544	25.000	14.634
protein synthesis	61.765	67.741	43.478	52.174
regulation of/interaction with cellular environment	-	15.385	14.286	-

	complete linkage, euclidean distance
energy	18.182
protein fate (folding, modification, destination)	24.107
subcellular localisation	60.623
metabolism	34.819
cell cycle and DNA processing	19.388
transcription	35.545
transport facilitation	-
cellular transport and transport mechanisms	10.169
control of cellular organization	7.143
cell rescue, defense and virulence	5.405
cell fate	7.407
protein synthesis	33.766
regulation of/interaction with cellular environment	6.250