Editorial Board-Authored
Annotation Drivers for Sequence Databases

Daniel H. Haft and Bruce C. Orcutt
National Biomedical Research Foundation, Washington, DC

Poster Presentation:
American Society for Biochemistry and Molecular Biology
May 16-20, 1998
Washington Convention Center
Washington, DC

Abstract

We introduce a general mechanism for annotating sequence databases effectively by distributing editorial responsibility to scientific experts and collecting their submissions as annotation drivers. Annotation drivers are rules or data that can be applied automatically to force new annotation. We demonstrate the model for the PIR-International Protein Sequence Database. PIR's ATLAS program supports query by sequence, taxon, superfamily classification, features, etc. It defines an implicit, intuitive scripting language we call AQL (ATLAS Query Language) for selecting groups of entries, testing attributes, and reporting needed changes. Each run adds content searchable by other AQL scripts, so annotation builds upon annotation. The newly established PIR Editorial Board is our panel of experts; their contributions become AQL scripts by which annotation is attached to protein entries. We are working to expand the editorial board panel, develop tools to help construct these annotation drivers, and offer alternative search criteria that make the drivers directly portable to other protein databases. This work is supported in part by the NLM grant LM05798.

 

Introduction

Annotated protein databases are a resource of critical importance to biological research. The immense scale of accumulated scientific information and rapid pace of the discovery of new proteins require new, scalable approaches. We have developed a model for achieving more efficient, more responsive, higher quality annotation. First, we distribute the editorial responsibility of determining which proteins merit which annotations to the community of experts in those protein groups. These scientists are the PIR Editorial Board. Second, we capture the results of each expert review in the form of tools that drives automatic annotation of protein sequences; we call these tools annotation drivers.

The classification of proteins by superfamily, family, and homology domain [1] has been a major focus of the Protein Information Resource (PIR) in the development of the PIR-International Protein Sequence Database [2]. The ATLAS query program can select groups of proteins by this classification and other annotations, both for assignment to Editorial Board members for review and for annotation by automated tools. The commands recognized by the ATLAS program comprise the ATLAS Query Language, or AQL, a scripting language in which we have written already a number of powerful annotation drivers. By enlisting the help of the scientific community, and by working on the development of tools rather than on the entries themselves, we achieve scalable database annotation in two compelling ways.

 

The Protein Information Resource

The Protein Information Resource (PIR) supports biological research by stimulating and facilitating inquiry into protein structure, function, modification, regulation, interaction, and evolution. We pursue this mission through continued development of the Protein Sequence Database of PIR-International and supporting auxiliary databases. By an integrated approach of data collection, computation, collaboration, and curation, we seek to take information that is theoretically available to the scientific community by search or calculation and make it readily accessible, clearly understandable, immediately suggestive of further inquiry, and robustly interconnected to the community of interoperable biological databases.

The Protein Sequence Database is available free over the Internet for downloading from FTP sites or searching at our Web site. The database is also distributed on CD-ROM together with search indexes and versions of the ATLAS query program that run on a number of different computer platforms.

A major project of PIR is the classification of proteins by homology domains, superfamily (full-length homology), and family (closely related proteins) [1]. This classification scheme provides a framework for understanding the relationships among proteins, the structural and functional significance of regions of similarity, and the range over which empirically-based annotations ought to be propagated.

The PIR Editorial Board is group of scientists who have agreed to assist the PIR in the many specialized biological and biomedical fields involved in describing the structure, function, classification, and evolution of the proteins included in the PIR-International Protein Sequence Database.

 

Challenges in protein annotation

Resources for annotation

Goals in annotation

 

About ATLAS

The Atlas Multidatabase Information Retrieval System (ATLAS), developed by the National Biomedical Research Foundation (NBRF), is a retrieval program specifically designed to provide simultaneous access to many macromolecular sequence, alignment, and auxiliary databases [3]. Fields such as Entry title, Journal , Author, Species, Keyword, Superfamily, Feature, and others are indexed to allow fast text retrieval. A powerful pattern matching routine is available for sequence searching.

ATLAS maintains a current list of the results of the last search command. This list can be refined by successive search commands. Intermediate results can be inspected easily. This provides a highly intuitive user interface for developing complex queries.

 

Table I. Some ATLAS commands

GET filename

Retrieve list of entries from a file as the new current list.

LIST / OUT=filename

Save current list of entries to a file.

LIST / RESTORE

Replace the current list with the previous current.

SCAN peptide

Find all entries with the exact subsequence peptide

FEAT / CUR P-loop

Refine the CURrent list. Keep only entries with "P-loop" in the FEAture table. Similar commands query other indexed fields.

SUP / SUB helicase

SUBtract all entries with "helicase" in the SUPerfamily field from the current list.

FILE filename

Pass control of ATLAS to the script in filename

!

A comment line (ignored by ATLAS)

!!

A message line (sent to the current output file)

The MATCH command

Peptide specification

W

[WFY]

{P}

X

(W)

([WFY])

^ (caret)

Trp

Trp, Phe, or Tyr

anything but Pro

any residue

Trp exactly, or no match (no matter what /MIS=n)

Trp, Phe, or Tyr exactly, or no match

display the match position following the caret (^)

 Selected modifiers

 

/DEFINE

Keep only entries that match

/DEFINE / SUBTRACT

Remove all entries that match

/MISMATCH=n

Allow up to n mismatches

/VIEW

mark positions with mismatches

/PEPTIDE=expression

specify peptide on command line (otherwise prompted)

/PRINT=filename

Send display results to filename

 

Example
! For the existing current list, find all sequences that resemble the typical P-loop,
! allowing up to one mismatch but requiring the Lys, and report the location of that Lys
MATCH / CUR / MIS=1 / PRINT = atp_feat.transact / PEPT=GXXGXG^(K)[TS]
!! Features found: /Binding site: ATP (Lys) #status predicted

 

From ATLAS to AQL

 

Working with AQL scripts

 

Figure 1. Sample Integrity Check

! An integrity check for organelle-specific keywords
! in prokaryotic sequence entries:
! This list will be empty unless some error has occurred.
!
! Choose annotated database sections of PIR (Protein Information Resource):
BASES PIR1, PIR2
!
! Search by keyword for organelle-specific annotations:
KEY / BRIEF lysosome
KEY / BRIEF / ADD plast
KEY / BRIEF / ADD Golgi
KEY / BRIEF / ADD nucleus
! etc.

...

! Should now have all entries that have organelle-specific keywords.
! If any entry is prokaryotic, there is a problem: bad keyword or bad species
!
! From current list, select prokaryotic examples:
TAXONOMY / CURRENT prok
!
! Create a list of problem entries (only if the list is non-empty)
LIST / IF / OUT=problem_kw_organelle_tax.cod

An example of an integrity check script in AQL. Integrity checks can find logical inconsistencies in annotations. Examples include organelle-specific keywords for bacterial proteins, complex carbohydrate binding site predictions for cytosolic or nuclear proteins, etc. Exceptions lists can be built into integrity check scripts as necessary to avoid rechecking unusual but validated annotations.

 

Figure 2. Sample Annotation Driver

!An annotation driver for proteins classified as having
! phosphoglycerate mutase homology.
!
! Based on a write-up submitted by PIR editorial
! board member Alex J. Lange
!
!
Find protein by superfamily classification

SUP / BRIEF / EXACT "phosphoglycerate mutase homology"
!
!
remove those for which the feature is already present

FEATURE / BRIEF / SUBTRACT

"Active site: His (phosphohistidine intermediate)"
!
! and remove those in which the site is defective
! for example, PFK26 (PIR:S48465)
FEATURE / BRIEF / SUBTRACT
"Region: defective catalytic site"
!
! and now look for the active site / catalytic zinc feature
!One mismatch allowed, but the His must match exactly
MATCH / CUR / DEFINE / SUBTR / MIS=1 / PRINT=filename
[LVI] X [LVI] [LVI] R ^(H) G [EQ]

!! matched His residues receive active site feature:
!! Active site: His (phosphohistidine intermediate)
!
FIND / BR / SUB
fragment
!
! Entries remaining should be investigated, if there are any
LIST / IF / OUT = why_no_site.cod

An example annotation driver script in AQL. AQL (ATLAS) commands and modifiers are given in bold face capital letters. Lines beginning with a single exclamation point are comments ignored by ATLAS. Lines beginning with a double exclamation point are messages passed by ATLAS to the last output file. This script reports newly found active sites to one file and the problem list of non-fragments with no discernable active sites to another file.

 

Figure 3. Sample Alert Trigger

BASES pir1, pir2
!
! Look for the first archaeal globin
! None is expected, but if one is found
! then alert our expert.
!
SUPERFAMILY / BRIEF
globin homology
TAXONOMY / CURRENT arc
!
!
Current list has archaeal globins (surprise!)
! If there is one, alert somebody
LIST / IF / PRINT=archaeal_globin.alert
!
!
and write contact info to file for parsing
!! ALERT:
name@address "Found globin(s) in the Archaea!"

An example of an alert trigger script in AQL. This script will create an output file if and only if the special condition (an archaeal sequence classified as a globin homolog) occurs.

 

Figure 4. Annotations Trigger More Annotations

AQL script # 3

if

has "EC 3.4.24."in title

then

add to KEYWORD field (if missing) keyword "hydrolase" keyword "metalloproteinase"

AQL script # 2

if

matrix metalloproteinase homology, astacin homology, atrolysin C, serralysin, or certain other superfamilies and has both active site and zinc-binding features

then

add to TITLE field (if missing)
"EC 3.4.24."

AQL script # 1

if

matrix metalloproteinase homology

then

find and add to FEATURE field (if missing)
active site
catalytic zinc-binding site
autoinhibitory region

A heirarchy of AQL scripts is shown for progressive annotation of metalloproteinase-like proteins. Script #1 (shown at the bottom of the heirarchy) covers the fewest entries but adds the most specific information. If it adds active site and zinc-binding features to some database entry, that entry (after update) will fall also under the scope of script #2. Script #2 adds the appropriate enzyme nomenclature to metalloproteinase homologs if and only if they have a functional active site and zinc-binding. Entries modified by script #2 will next fall under the scope of script #3, as will entries that received the enzyme designation in other ways (e.g. author submission). Script #3 ensures the addition of keywords as needed to accompany the "3.4.24." portion of the title. Annotation by these scripts may be viewed as object-oriented in the sense that the queries in AQL scripts define classes of entries, and that annotations are inherited by entries according to their membership in various classes.

 

Conclusion

We have extended the capabilities of the ATLAS query program to create a scripting language, AQL, in which to develop and run collections of stored queries. A stored query can become a permanent extension of the database that is carefully validated, traceable in its effects, and able to enforce complex rules of biology consistently. Ad hoc queries don't have these attributes. Categories of AQL script include integrity checks, alert triggers, and annotation drivers. The query in each script defines the class of entries supervised by that script; one entry may belong to a number of different classes. AQL scripts thus offer an object-oriented style of annotation.

Most protein annotations should come, ultimately, from critical readings of the literature. We have started collaboration with a panel of outside experts, the PIR Editorial Board, to produce electronic mini-reviews of various groups of proteins. The content of each write-up will be linked to a scope (protein entries selected dynamically by a stored query) over which the review is applicable and its generalizations valid. The process thus results in the creation of annotation drivers, interesting by themselves as concise expressions of biological rules, but also enabling computer annotation, an important goal in bioinformatics.

Future work will include expanding the roster of Editorial Board members, developing WWW-based tools to assist Editorial Board members in creating and testing annotation-driving scripts, and adapting the resulting database of annotation drivers for portability to other protein databases.

 

References

  1. The superfamily classification in the PIR-International Protein Sequence Database, Winona C. Barker, Friedhelm Pfeiffer, and David George, in: Methods in Enzymology, R.F. Doolittle, ed., Academic Press, Orlando, FL, pp. 59-71, 1996.
  2. The PIR-International Protein Sequence Database, Winona C.Barker, John S. Garavelli, Daniel H. Haft, Christopher R. Marzec, Bruce C. Orcutt, Geetha Y. Srinivasarao, Lai-Su L. Yeh, Robert S. Ledley, Hans-Werner Mewes, Friedhelm Pfeiffer, and Akira Tsugita, Nucleic Acids Res. 26, 27-32,1998.
  3. ATLAS User's Guide, National Biomedical Research Foundation, Washington, DC, 159 pp., December 1995.