TDI Kernel database

Posted on 06.13.08 to . Grab the feed. Both comments and pings are currently closed.

Methods

Computational pipeline

We have assembled a computational pipeline that relies on several databases and programs, taking as input protein sequences and producing an output containing protein models as well as predicted locations of binding sites for small molecules on their surfaces and predicted types of molecules they bind. The pipeline, which relies on the MODPIPE package[1] and the AnnoLyze program[2], has been applied to genomes of ten pathogens that cause tropical diseases. The output of the pipeline has been stored in a relational database for easy searching and dissemination over the web.

TDI target genomes

We selected the following ten target genomes based on both disease burden and the completeness of published sequences: Cryptosporidium hominis (CyrptoDB[3]), Cryptosporidium parvum (CyrptoDB[3]), Leishmania major (GeneDB[4]), Mycobacterium leprae (OrthoMCL-DB[5]), Mycobacterium tuberculosis (TubercuList[6]), Plasmodium falciparum (PlasmoDB[7]), Plasmodium vivax (PlasmoDB[7]), Trypanosoma brucei (GeneDB[4]), Trypanosoma cruzi (GeneDB[4]), and Toxoplasma gondii (ToxoDB[8]). We then mapped the transcript sequences onto UniProt ids[9].

Annotation databases

Functional annotation for predicted binding sites in our models relied on the following databases: (i) UniProt[9], which contains 385,721 sequences from the SwissProt database and 5,814,087 sequences from the TrEMBL database, was used to annotate the transcripts from the target genomes; (ii) MODBASE[10], which contains 6,805,385 comparative models calculated by MODPIPE for domains in 1,810,521 proteins, was used to store all comparative models; (iii) DBAli[11], which contains 1.7 billion pairwise alignments generated by an all-against-all comparison of known protein structures, was used to identify structure relationships between our modeling templates and other known protein structures; (iv) LigBase[12], which contains 232,852 structurally defined ligand-binding sites in PDB, was used as a resource for AnnoLyze to predict ligand binding sites on pathogen protein models; (v) MSDChem[13], which contains 8,287 small ligands, was used as an annotated repository of small molecules in the PDB database; and (vi) DrugBank[14], which contains 4,765 drug-like compounds (including 1,485 FDA-approved small molecule drugs, 128 FDA-approved biotech drugs, 71 nutraceuticals and 3,243 experimental drugs), was used to identify small molecules in the MSDChem database that have similar chemical composition to known drugs.

Comparative protein structure prediction

Models for all sequences from the ten target genomes were calculated using MODPIPE, our automated software pipeline for comparative modeling[1, 15]. It relies primarily on the various modules of MODELLER[16] for its functionality and is adapted for large-scale operation on a cluster of PCs using scripts written in PERL and Python. Sequence-structure matches are established using a variety of fold-assignment methods, including sequence-sequence[17], profile-sequence[18, 19] and profile-profile alignments[19, 20]. Odds of finding a template structure are increased by using an E-value threshold of 1.0. By default, ten models are calculated for each of the alignments[16]. A representative model for each alignment is then chosen by ranking based on the atomic distance-dependent statistical potential DOPE[21]. Finally, the fold of each model is evaluated using a composite model quality criterion that includes the coverage of the modeled sequence, sequence identity implied by the sequence-structure alignment, the fraction of gaps in the alignment, the compactness of the model, and various statistical potential Z-scores[21-23]. We only used the models that were predicted to have a “correct” fold (i.e., a MODPIPE quality score higher than 1.0); based on our benchmarking studies, we expect the true positives rate to be 93% and the false positives rate of 5%, at this threshold.

Binding site prediction

The AnnoLyze program[2] was used to predict binding sites for small molecules on all well-assessed models. Briefly, AnnoLyze predicts ligand-binding sites on the surface of a model by transferring known ligands in the LigBase database[12] via the target-template alignment. Such predictions are made in a two step process (Figure 4): (i) transfer of a binding site between known structures (i.e. a ligand bound to a structure is transferred if at least 75% of the LigBase-defined binding site residues are within 4 Å of the template residues in a global superposition of the two structures and if at least 75% of the binding site residue types are invariant); and (ii) transfer of a binding site to a comparative model from its template (i.e. a ligand bound to a template structure is transferred to the comparative model if at least 75% of the LigBase-defined binding site residues are within 4 Å of the template residues and if at least 75% of the binding site residue types are invariant). Using these cutoffs, approximately 30% of the selected models had at least one predicted binding site for small molecules (Table 1), which were then mapped to MSDChem entries.

From ligands to drugs

The jcsearch program from the JChem package[24] was used with default parameters to match related compounds in MSDChem and DrugBank. Four types of matches were collected: (i) exact matches (i.e. their SMILES strings[25] matched with a Tanimoto score[26] equal to 1.0); (ii) substructure matches in which a matched DrugBank query molecule is a part of an MSDChem molecule; (iii) substructure matches in which an identified MSDChem molecule is a part of a DrugBank query molecule; and (iv) similar matches with a Tanimoto score between MSDChem and DrugBank molecules of at least 0.9.

Data storage, sharing and licensing

The entire kernel, including all predicted models and binding sites, is freely available over the web (http://www.tropicaldisease.org/kernel). The server uses the WordPress package (http://www.wordpress.org), a widely used platform that facilitates easy creation, storage and dissemination of each target entry in our database. WordPress also supports numerous “plugins”, including a rating system that allows TDI web site users to rate targets for “druggability.” The package also supports bookmarking by most web-based social networks. In particular, each of the TDI kernel’s target pages includes a “blog it” button that allows registered users of The Synaptic Leap (TSL, http://www.thesynapticleap.org) to post TDI entries directly into the TSL discussion panels. TSL is our web-based “collaboratory” portal that is designed to host open source drug discovery projects in much the same way SourceForge hosts software collaborations.

The TDI kernel is freely downloadable as public domain data. Options include direct downloads of individually requested targets, pre-defined sets for each of our ten target genomes, and user-defined batch downloads. Users receive the data with no restriction in accordance with the Science Commons protocol for implementing open access data[27] that was designed to embody normal academic attribution norms and facilitate tracking of work based on the kernel. We note that our predictions are public domain, but some of the drugs used in our predictions might be subject to patents.

References

1. Eswar, N., et al., Tools for comparative protein structure modeling and analysis. Nucleic Acids Res, 2003. 31(13): p. 3375-80.

2. Marti-Renom, M.A., et al., The AnnoLite and AnnoLyze programs for comparative annotation of protein structures. BMC Bioinformatics, 2007. 8 Suppl 4: p. S4.

3. Heiges, M., et al., CryptoDB: a Cryptosporidium bioinformatics resource update. Nucleic Acids Res, 2006. 34(Database issue): p. D419-22.

4. Hertz-Fowler, C., et al., GeneDB: a resource for prokaryotic and eukaryotic organisms. Nucleic Acids Res, 2004. 32(Database issue): p. D339-43.

5. Chen, F., et al., OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups. Nucleic Acids Res, 2006. 34(Database issue): p. D363-8.

6. Cole, S.T., et al., Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature, 1998. 393(6685): p. 537-44.

7. Stoeckert, C.J., Jr., et al., PlasmoDB v5: new looks, new genomes. Trends Parasitol, 2006. 22(12): p. 543-6.

8. Gajria, B., et al., ToxoDB: an integrated Toxoplasma gondii database resource. Nucleic Acids Res, 2008. 36(Database issue): p. D553-6.

9. Wu, C.H., et al., The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res, 2006. 34(Database issue): p. D187-91.

10. Pieper, U., et al., MODBASE: a database of annotated comparative protein structure models and associated resources. Nucleic Acids Res, 2006. 34(Database issue): p. D291-5.

11. Marti-Renom, M.A., et al., DBAli tools: mining the protein structure space. Nucleic Acids Res, 2007. 35(Web Server issue): p. W393-7.

12. Stuart, A.C., V.A. Ilyin, and A. Sali, LigBase: a database of families of aligned ligand binding sites in known protein sequences and structures. Bioinformatics, 2002. 18(1): p. 200-1.

13. Golovin, A., et al., MSDsite: a database search and retrieval system for the analysis and viewing of bound ligands and active sites. Proteins, 2005. 58(1): p. 190-9.

14. Wishart, D.S., et al., DrugBank: a knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Res, 2008. 36(Database issue): p. D901-6.

15. Eswar, N., et al., ModPipe: a large-scale protein structure modeling pipeline for the genomic era. 2008. Submitted.

16. Sali, A. and T.L. Blundell, Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol, 1993. 234(3): p. 779-815.

17. Smith, T.F. and M.S. Waterman, Identification of common molecular subsequences. J.Mol.Biol., 1981. 147(1): p. 195-197.

18. Altschul, S.F., et al., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, 1997. 25(17): p. 3389-402.

19. Eswar, N., et al., Comparative protein structure modeling using Modeller. Curr Protoc Bioinformatics, 2006. Chapter 5: p. Unit 5 6.

20. Marti-Renom, M.A., M.S. Madhusudhan, and A. Sali, Alignment of protein sequences by their profiles. Protein Sci, 2004. 13(4): p. 1071-87.

21. Shen, M.Y. and A. Sali, Statistical potential for assessment and prediction of protein structures. Protein Sci, 2006. 15(11): p. 2507-24.

22. Eramian, D., et al., A composite score for predicting errors in protein structure models. Protein Sci, 2006. 15(7): p. 1653-66.

23. Melo, F., R. Sanchez, and A. Sali, Statistical potentials for fold assessment. Protein Sci, 2002. 11(2): p. 430-48.

24. Csizmadia, F., JChem: Java applets and modules supporting chemical database handling from web browsers. J Chem Inf Comput Sci, 2000. 40(2): p. 323-4.

25. Weininger, D., A. Weininger, and J.L. Weininger, SMILES. 2. algorithm for generation of uniques SMILES notation. J. Chem. Inf. Comput. Sci., 1989. 29: p. 97–101.

26. Gower, J.C., A general coefﬁcient of similarity and some of its properties. Biometrics, 1971. 27: p. 857–871.

27. Commons, S. Protocol for Implementing Open Access Data. 2008 [cited July 2008]; Available from: http://sciencecommons.org/projects/publishing/open-access-data-protocol/.

Comments are closed.

Search Kernel