WP10 - Enhancing protein annotation standards
Development of standardized annotation according to scientific needs, especially the exact description of the protein molecule, to which annotation actually applies - whenever this information is available.
Task 1: The standardization of the annotation of protein-protein interactions (PPIs) in UniProtKB/Swiss-Prot (D10.01).
As a result of a collaborative activity, binary PPIs with high confidence are imported automatically from IntAct into UniProtKB/Swiss-Prot CC INTERACTION lines, which were specifically created for this purpose. PPIs that are not yet covered by IntAct are manually annotated within the CC SUBUNIT blocks. In order to facilitate the extraction of binary interactions, we will modify the format of the CC INTERACTION blocks to cover both manually annotated and imported PPIs. The new format will allow the exact description of the proteins involved in the interaction, using the UniProtKB accession numbers, isoform identifiers (IsoIds) and feature identifiers (FTIds). The source of the information will be indicated in the form of PubMed identifiers corresponding to relevant publications. More than 111'000 CC SUBUNIT annotation blocks (Jan 2008) will have to be checked for binary interactions and relevant information transferred into the new CC INTERACTION block format. Tools will be developed to assist the format conversion in a semi-automated way. Annotation tools and various applications will be adapted accordingly. The new format of the manually curated CC INTERACTION blocks is suitable for the import of data from UniProtKB/Swiss-Prot to other databases such as IntAct and it allows scientists an easy and unambiguous extraction of binary interactions from UniProtKB/Swiss-Prot.
Task 2: The development of naming standards for proteins (D10.02).
Consistent nomenclature is indispensable for communication, literature searching and entry retrieval. Many species-specific gene nomenclature committees have been established that try to assign consistent and, if possible, meaningful gene symbols. Other scientific communities have established protein nomenclatures for a set of proteins based on sequence similarity and/or function. But there is no established organization involved in the standardization of protein names, nor are there any efforts to establish naming rules that are valid across the largest spectrum of species possible. UniProt is constantly striving to further standardize the nomenclature for a given protein across related organisms. This is accomplished via protein family-driven annotation, through both manual and automated pipelines. This also involves the ongoing standardization of all the existing UniProtKB/Swiss-Prot protein names according to our guidelines. We strive to attribute a recommended name to all the proteins of UniProtKB/Swiss-Prot (cmp. WP05, Support offered in this project ).