Other publications by Jean-Christophe Nebel
|
W. Dyrka, J.-C. Nebel and M. Kotulska
Algorithms for Molecular Biology
8:31, 2013
[PDF]
Abstract
Background:
Hidden Markov Models power many state-of-the-art tools in the field of protein bioinformatics. While excelling in their tasks, these methods of protein analysis do not convey directly information on medium- and long-range residue-residue interactions. This requires an expressive power of at least context-free grammars. However, application of more powerful grammar formalisms to protein analysis has been surprisingly limited.
Results:
In this work, we present a probabilistic grammatical framework for problem-specific protein languages and apply it to classification of transmembrane helix-helix pairs configurations. The core of the model consists of a probabilistic context-free grammar, automatically inferred by a genetic algorithm from only a generic set of expert-based rules and positive training samples. The model was applied to produce sequence based descriptors of four classes of transmembrane helix-helix contact site configurations. The highest performance of the classifiers reached AUC ROC of 0.70. The analysis of grammar parse trees revealed the ability of representing structural features of helix-helix contact sites.
Conclusions:
We demonstrated that our probabilistic context-free framework for analysis of protein sequences outperforms the state of the art in the task of helix-helix contact site classification. However, this is achieved without necessarily requiring modeling long range dependencies between interacting residues. A significant feature of our approach is that grammar rules and parse trees are human-readable. Thus they could provide biologically meaningful information for molecular biologists.
Keywords: Probabilistic context-free grammar, Grammar inference, Genetic algorithm, Helix-helix contact,
Protein structure prediction
Cited by ( Google Scholar: 13, ISI Web of Knowledge: & SCOPUS: ): 13
2024