As a result of the genome sequencing and structural genomics initiatives, we have a wealth of protein sequence and structural data. However, only about 1% of these proteins have experimental functional annotations. As a result, computational approaches that can predict protein functions are essential in bridging this widening annotation gap.
A resource that classifies full-length proteins is PIRSF, in which a set of rules is applied to define primary and curated clusters that are also based on textual (protein names, literature) and parent-child relationships. These clusters (named superfamilies) are further divided into those with full-length similarity (that is, common domain architecture) and those sharing an ancestral domain. PIRSF covers more than two-thirds of the protein sequence space.
Sequence annotations describe regions or sites of interest in the protein sequence, such as post-translational modifications, binding sites, enzyme active sites, local secondary structure or other characteristics.
Family and superfamily classification also serves as the basis for rule-based procedures that provide rich automatic functional annotation among homologous sequences and perform integrity checks. Combining the classification information and sequence patterns or profiles, certain rules have been defined to predict position-specific sequence features such as active sites, binding sites, modification sites, and sequence motifs. We derive family-specific patterns for such features from alignments of closely related sequences for which some of the sequences have experimentally determined properties. While studying proteins at a domain level allows more accurate functional inference and is useful for predicting the function of novel domain combinations that possibly give rise to new protein functions.
The process of functional annotation involves assessing available evidence and reaching a conclusion about what we think the protein is doing in the cell and why.
Functional annotations should only be as specific as the supporting evidence allows
All evidence that led to the annotation conclusions that were made must be stored.
In addition, detailed documentation of methodologies and general rules or guidelines used in any annotation process should be provided.
Basic set of protein Annotations
Protein name: descriptive common name for the protein e.g. “ribokinase”
gene symbol: mnemonic abbreviation for the gene e.g. “recA”
EC number : only applicable to enzymes e.g. 1.4.3.2
Role: what the protein is doing in the cell and why e.g. “amino acid biosynthesis”
Supporting Evidence: accession numbers of BER and HMM matches – TmHMM, SignalP, LipoP whatever information we used to make the annotation
Unique Identifier e.g. locus ids
Annotation of Proteins provide following useful information;
Molecule processing
Initiator methionine
Signal
Transit peptide
Propeptide
Chain
Peptide
Regions
Topological domain
Transmembrane
Intramembrane
Domain
Repeat
Calcium binding
Zinc finger
DNA binding
Nucleotide binding
Region
Coiled coil
Motif
Compositional bias
Sites
Active site
Metal binding
Binding site
Amino acid modifications
Non-standard residue
Modified residue
Lipidation
Glycosylation
Disulfide bond
Natural variations
Alternative sequence
Natural variant
Experimental info
Mutagenesis
Sequence uncertainty
Sequence conflict
Non-adjacent residues
Non-terminal residue
Secondary structure
Helix
Turn
Beta strand