IMR Press / FBE / Volume 5 / Issue 2 / DOI: 10.2741/E659

Frontiers in Bioscience-Elite (FBE) is published by IMR Press from Volume 13 Issue 2 (2021). Previous articles were published by another publisher on a subscription basis, and they are hosted by IMR Press on imrpress.com as a courtesy and upon agreement with Frontiers in Bioscience.

Article

MotifOrganizer: a scalable model-based motif clustering tool for mammalian genomes

Show Less
1 Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta GA 30322, USA
2 Department of Biomedical Informatics, Emory University School of Medicine, Atlanta GA 30322, USA
3 Center for Comprehensive Informatics, Emory University, Atlanta GA 30322, USA
4 British Columbia Cancer Agency Genome Sciences Centre, Vancouver, BC, V5Z 4E6, Canada
5 Department of Computational Medicine & Bioinformatics, University of Michigan, Ann Arbor MI 48109, USA

*Author to whom correspondence should be addressed.

 

Front. Biosci. (Elite Ed) 2013, 5(2), 785–797; https://doi.org/10.2741/E659
Published: 1 January 2013
Abstract

Assembling a comprehensive catalog of all transcription factors (TFs) and the genes that they regulate (regulon) is important for understanding gene regulation. The sequence-specific conserved binding profiles of TFs can be characterized from whole genome sequences with phylogenetic approaches, and a large number of such profiles have been released. Effective mining of these data sources could reveal novel functional elements computationally. Due to the variability of the binding sites, it is necessary to generalize profiles pertinent to the same TF by clustering. The summarized familial profile is effective in identifying unknown binding sites, thus lead to gene co-regulation prediction. Here we report MotifOrganizer, a scalable model-based clustering algorithm designed for grouping motifs identified from large scale comparative genomics studies on mammalian species. The new algorithm allows grouping of motifs with variable widths and a novel two-stage operation scheme further increases the scalability. MotifOrgainzer demonstrated favorable performance comparing to distance-based and single-stage model-based clustering tools on simulated data. Tests on approximately 150k motifs from the cisRED human database demonstrated that MotifOrganizer can effectively cluster whole genome sets of mammalian motifs.

Keywords
Model-based clustering
Transcription factor binding site
Motif
Bayesian
Scalability
Share
Back to top