A ground truth based comparative study on clustering of gene expression data

Frontiers in Bioscience-Landmark (FBL) is published by IMR Press from Volume 26 Issue 5 (2021). Previous articles were published by another publisher on a subscription basis, and they are hosted by IMR Press on imrpress.com as a courtesy and upon agreement with Frontiers in Bioscience.

Article

Yitan Zhu¹, Zuyi Wang^1,2, David J. Miller³, Robert Clarke⁴, Jianhua Xuan¹, Eric P. Hoffman², Yue Wang^1,*

Show Less

¹ Department of Electrical and Computer Engineering, Virginia Polytechnic and State University, Arlington, VA 22203, USA

² Research Center for Genetic Medicine, Children’s National Medical Center, Washington, DC 20010, USA

³ Department of Electrical Engineering, The Pennsylvania State University, University Park, PA 16802, USA

⁴ Department of Oncology and Physiology and Biophysics and Lombardi Comprehensive Cancer Center, Georgetown University, Washington, DC 20007, USA

*Author to whom correspondence should be addressed.

Front. Biosci. (Landmark Ed) 2008, 13(10), 3839–3849; https://doi.org/10.2741/2972

Published: 1 May 2008

Download PDF

Cite

Abstract

Given the variety of available clustering methods for gene expression data analysis, it is important to develop an appropriate and rigorous validation scheme to assess the performance and limitations of the most widely used clustering algorithms. In this paper, we present a ground truth based comparative study on the functionality, accuracy, and stability of five data clustering methods, namely hierarchical clustering, K-means clustering, self-organizing maps, standard finite normal mixture fitting, and a caBIG^TM toolkit (VIsual Statistical Data Analyzer - VISDA), tested on sample clustering of seven published microarray gene expression datasets and one synthetic dataset. We examined the performance of these algorithms in both data-sufficient and data-insufficient cases using quantitative performance measures, including cluster number detection accuracy and mean and standard deviation of partition accuracy. The experimental results showed that VISDA, an interactive coarse-to-fine maximum likelihood fitting algorithm, is a solid performer on most of the datasets, while K-means clustering and self-organizing maps optimized by the mean squared compactness criterion generally produce more stable solutions than the other methods.

Previous article in this issue

Next article in this issue

Front. Biosci. (Landmark Ed) Print ISSN 2768-6701 Electronic ISSN 2768-6698