sma3s: Sequence Massive Annotation With 3 modules
Sma3s is a biological sequence annotation tool especially focused in the massive annotation of sequences obtained either from any kind of gene library or genome. Sma3s tool is composed of 3 modules that sequentially solve the annotation from:
- Already existing annotated sequences.
- Orthologous sequences.
- Groups of sequences sharing statistical significant patterns.
Sma3s provides (i) gene ontology terms, (ii) Swiss-Prot keywords and (iii) pathways, (iv) InterPro domains, and (v) IntAct interactions.
Biological sequence annotation is the process of finding, recovering and incorporating relevant biological information available in public databases regarding to an individual or massive collection of sequences. New annotated descriptors for a given sequence provide new insights about function, cellular location, biological process and/or protein structure, and it is of special interest in both genome and EST (Expressed sequence tags) sequencing projects as well as in gene-expression experiments and other hot areas of research
Sma3s is a contribution in the field, and it is especially focused in the massive annotation of sequences obtained either from any kind of gene library or genome. It provides high levels of prediction accuracy with minimal human participation and computational resources; it is composed of three modules that solve the problem in incremental complexity, all of them based on preliminary exhaustive blast searching, and is highly sensitive and specific. The 3 modules of Sma3s sequentially solve the complexity growing cases of obtaining the annotation from: (a) already existing annotated sequences; (b) orthologous sequences and (c) groups of sequences sharing statistically significant patterns. Blast outputs are the basic source of information in all cases of the annotation procedure. As result Sma3s obtains the following biological descriptors associated to each sequence: gene ontology terms (GO), Swiss-Prot keywords and pathways, InterPro domains, and IntAct interactions.
To annotate a sequence dataset the user needs:
- the sequences in Fasta format.
- taxonomic division of UniProt database which can be downloaded from ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/taxonomic_divisions/
Sma3s use blastall and blastclust programs from Blast package to search for homology relasionships, and it will index the files for fast access. So, the program needs Perl language (http://www.perl.com), Blast package (ftp://ftp.ncbi.nlm.nih.gov/blast/), and the Perl libraries (http://www.bioperl.org):
- Bio::SearchIO (included in Bioperl)
- Bio::Index::Fasta (included in Bioperl)
- Bio::Index::Swissprot (included in Bioperl)
- Bio::Tools::Run::StandAloneBlast (included in Bioperl)
- IO::String (http://search.cpan.org/~gaas/IO-String-1.08/String.pm)