A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets.

Hdl Handle:
http://hdl.handle.net/10675.2/108
Title:
A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets.
Authors:
Sharma, Ashok; Podolsky, Robert H.; Zhao, Jieping; McIndoe, Richard A
Abstract:
MOTIVATION: As the number of publically available microarray experiments increases, the ability to analyze extremely large datasets across multiple experiments becomes critical. There is a requirement to develop algorithms which are fast and can cluster extremely large datasets without affecting the cluster quality. Clustering is an unsupervised exploratory technique applied to microarray data to find similar data structures or expression patterns. Because of the high input/output costs involved and large distance matrices calculated, most of the algomerative clustering algorithms fail on large datasets (30,000 + genes/200 + arrays). In this article, we propose a new two-stage algorithm which partitions the high-dimensional space associated with microarray data using hyperplanes. The first stage is based on the Balanced Iterative Reducing and Clustering using Hierarchies algorithm with the second stage being a conventional k-means clustering technique. This algorithm has been implemented in a software tool (HPCluster) designed to cluster gene expression data. We compared the clustering results using the two-stage hyperplane algorithm with the conventional k-means algorithm from other available programs. Because, the first stage traverses the data in a single scan, the performance and speed increases substantially. The data reduction accomplished in the first stage of the algorithm reduces the memory requirements allowing us to cluster 44,460 genes without failure and significantly decreases the time to complete when compared with popular k-means programs. The software was written in C# (.NET 1.1). AVAILABILITY: The program is freely available and can be downloaded from http://www.amdcc.org/bioinformatics/bioinformatics.aspx. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Citation:
Bioinformatics. 2009 May 1; 25(9):1152-1157
Issue Date:
24-Apr-2009
URI:
http://hdl.handle.net/10675.2/108
DOI:
10.1093/bioinformatics/btp123
PubMed ID:
19261720
PubMed Central ID:
PMC2672630
Type:
Journal Article; Research Support, N.I.H., Extramural
ISSN:
1367-4811
Appears in Collections:
Department of Pathology: Faculty Research and Presentations

Full metadata record

DC FieldValue Language
dc.contributor.authorSharma, Ashoken_US
dc.contributor.authorPodolsky, Robert H.en_US
dc.contributor.authorZhao, Jiepingen_US
dc.contributor.authorMcIndoe, Richard Aen_US
dc.date.accessioned2010-09-24T22:03:21Z-
dc.date.available2010-09-24T22:03:21Z-
dc.date.issued2009-04-24en_US
dc.identifier.citationBioinformatics. 2009 May 1; 25(9):1152-1157en_US
dc.identifier.issn1367-4811en_US
dc.identifier.pmid19261720en_US
dc.identifier.doi10.1093/bioinformatics/btp123en_US
dc.identifier.urihttp://hdl.handle.net/10675.2/108-
dc.description.abstractMOTIVATION: As the number of publically available microarray experiments increases, the ability to analyze extremely large datasets across multiple experiments becomes critical. There is a requirement to develop algorithms which are fast and can cluster extremely large datasets without affecting the cluster quality. Clustering is an unsupervised exploratory technique applied to microarray data to find similar data structures or expression patterns. Because of the high input/output costs involved and large distance matrices calculated, most of the algomerative clustering algorithms fail on large datasets (30,000 + genes/200 + arrays). In this article, we propose a new two-stage algorithm which partitions the high-dimensional space associated with microarray data using hyperplanes. The first stage is based on the Balanced Iterative Reducing and Clustering using Hierarchies algorithm with the second stage being a conventional k-means clustering technique. This algorithm has been implemented in a software tool (HPCluster) designed to cluster gene expression data. We compared the clustering results using the two-stage hyperplane algorithm with the conventional k-means algorithm from other available programs. Because, the first stage traverses the data in a single scan, the performance and speed increases substantially. The data reduction accomplished in the first stage of the algorithm reduces the memory requirements allowing us to cluster 44,460 genes without failure and significantly decreases the time to complete when compared with popular k-means programs. The software was written in C# (.NET 1.1). AVAILABILITY: The program is freely available and can be downloaded from http://www.amdcc.org/bioinformatics/bioinformatics.aspx. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.en_US
dc.rightsThe PMC Open Access Subset is a relatively small part of the total collection of articles in PMC. Articles in the PMC Open Access Subset are still protected by copyright, but are made available under a Creative Commons or similar license that generally allows more liberal redistribution and reuse than a traditional copyrighted work. Please refer to the license statement in each article for specific terms of use. The license terms are not identical for all articles in this subset.en_US
dc.subject.meshAlgorithmsen_US
dc.subject.meshCluster Analysisen_US
dc.subject.meshComputational Biology / methodsen_US
dc.subject.meshOligonucleotide Array Sequence Analysis / methodsen_US
dc.subject.meshPattern Recognition, Automated / methodsen_US
dc.subject.meshSoftwareen_US
dc.titleA modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets.en_US
dc.typeJournal Articleen_US
dc.typeResearch Support, N.I.H., Extramuralen_US
dc.identifier.pmcidPMC2672630en_US
dc.contributor.corporatenameDepartment of Pathologyen_US

Related articles on PubMed

All Items in Scholarly Commons are protected by copyright, with all rights reserved, unless otherwise indicated.