doi:10.1016/j.biosystems.2006.04.005
Copyright © 2006 Elsevier Ireland Ltd All rights reserved.
Benchmarking a memetic algorithm for ordering microarray data
References and further reading may be available for this article. To view references and further reading you must
purchase this article.
P. Moscatoa,
, A. Mendes
, a,
and R. Berrettaa, 
aNewcastle Bioinformatics Initiative, School of Electrical Engineering and Computer Science, Faculty of Engineering and Built Environment, The University of Newcastle, Callaghan, NSW 2308, Australia
Received 11 August 2004;
revised 11 April 2006;
accepted 11 April 2006.
Available online 5 May 2006.
Abstract
This work introduces a new algorithm for “gene ordering”. Given a matrix of gene expression data values, the task is to find a permutation of the gene names list such that genes with similar expression patterns should be relatively close in the permutation. The algorithm is based on a combined approach that integrates a constructive heuristic with evolutionary and Tabu Search techniques in a single methodology. To evaluate the benefits of this method, we compared our results with the current outputs provided by several widely used algorithms in functional genomics. We also compared the results with our own hierarchical clustering method when used in isolation. We show that the use of images, corrupted with known levels of noise, helps to illustrate some aspects of the performance of the algorithms and provide a complementary benchmark for the analysis. The use of these images, with known high-quality solutions, facilitates in some cases the assessment of the methods and helps the software development, validation and reproducibility of results. We also propose two quantitative measures of performance for gene ordering. Using these measures, we make a comparison with probably the most used algorithm (due to Eisen and collaborators, PNAS 1998) using a microarray dataset available on the public domain (the complete yeast cell cycle dataset).
Keywords: Memetic algorithms; Tabu search; Gene ordering; Clustering; Microarray
Fig. 1. Pseudo-code for the hierarchical clustering technique.
Fig. 2. Example of the hierarchical clustering with four genes.
Fig. 3. Pseudo-code for the memetic algorithm.
Fig. 4. Diagram of the population structure.
Fig. 5. Pre-order traversal representation of a solution.
Fig. 6. Example of a subtree-based recombination. A subtree of solution B is selected and inserted into solution A, in the left or in the right part of the ramification. All choices – subtree selection, insertion position and left/right insertion – are at random.
Fig. 7. Pseudo-code of the swap-based tabu search.
Fig. 8. Pseudo-code of the tree-flip-based tabu search.
Fig. 9. A local search algorithm is used to move “blocks” which are identified as consecutive pairs of highly dissimilar expression patterns. The figure illustrates a situation in which five blocks are clear and an ad hoc local search method is highly beneficial.
Fig. 10. Pseudo-code of the block-based local search.
Fig. 11. Instances derived from the Lenna 512×512-pixel image. On the left, the multiplied Lenna which is ten copies of the original image. On the top are the expected optimal solution, striped Lenna with types I + II noise and only noise type I, respectively. On the bottom, four Whole Lenna with types I + II noise. This allows a controlled benchmark, as all algorithms will then receive a dataset on which all the rows have been permuted at random.
Fig. 12. Whole Lenna instances results for the four algorithms. Noise level varies from 0% to 40%.
Fig. 13. Striped Lenna instances results for the four algorithms. Noise types I + II level varies from 0% to 20%.
Fig. 14. Striped Lenna instances results for the four algorithms. Noise type I level varies from 0% to 20%.
Fig. 15. Results for the fibroblast and yeast (Saccharomyces cerevisiae) instances, with 517 and 6221 genes, respectively.
Fig. 16. A comparison of CLICK and MA results on the fibroblast instance. CLICK has apparently discovered four major clusters. However, if we randomly order the subsequence given by the MA for the genes of the major cluster, we can see that it is visually similar to the result given by CLICK. This shows that the MA is able to see a rich substructure within that cluster.
Fig. 17. Comparison between the memetic algorithm results with ws=1 and ws=5 and Eisen’s hierarchical clustering algorithm for the Yeast instance. The lines shown are the average correlation of gene expression patterns at distance d in the permutation ordering of the solutions found by the three methods (as a function of d). All of them have the same pattern; genes that are closer in the sequence have a higher correlation than genes that are farther apart. With a small window the average correlations between genes up to five positions apart are higher than in Eisen’s solution, making small windows better suited to find small groups of highly-correlated genes.
Fig. 18. Comparison between the memetic algorithm results with ws=1% and ws=5% and Eisen’s hierarchical clustering algorithm for the Yeast instance. The average correlation of gene expression patterns at distance d in the permutation ordering
ρ
(d) for the MAs have a very slow decay pattern, meaning that genes that are well-apart in the sequence still have a good correlation. These window sizes are better suited to find larger groups of correlated genes, or to order groups of smaller, highly-correlated clusters.
Fig. 19. Total number of cliques that have their genes ordered consecutively in the final display by the MA (ws=5) and Eisen’s hierarchical clustering.
Fig. 20. Protein synthesis functional group found by the memetic algorithm and Eisen’s hierarchical clustering. The memetic algorithm solution contains 89 genes and 77 of them are protein synthesis-related. Eisen’s solution contains one gene less and 74 of them are protein synthesis-related. Even though larger groups are easy to identify and both methods perform almost the same on them, there generally is a slight difference in favor of the memetic algorithm.
Fig. 21. Smaller functional groups found by the memetic algorithm and Eisen’s hierarchical clustering. The first group is ATP synthesis + oxidative phosphorylation + TCA cycle, which was separated into two groups well-apart with Eisen’s method. For the sterol metabolism and TCA cycle groups, the memetic algorithm found larger clusters. Finally, both methods grouped the same number of protein degradation genes, 18 in total.
Table 1.
CPU times for the methods


Corresponding author. Tel.: +61 2 49292308; fax: +61 2 49216929.