Editorial Board-Authored Annotation Drivers for Sequence Databases

Editorial Board-Authored
Annotation Drivers for Sequence Databases

Daniel H. Haft and Bruce C. Orcutt
National Biomedical Research Foundation, Washington, DC

Poster Presentation:
American Society for Biochemistry and Molecular Biology
May 16-20, 1998
Washington Convention Center
Washington, DC

Abstract

We introduce a general mechanism for annotating sequence databases effectively by distributing editorial responsibility to scientific experts and collecting their submissions as annotation drivers. Annotation drivers are rules or data that can be applied automatically to force new annotation. We demonstrate the model for the PIR-International Protein Sequence Database. PIR's ATLAS program supports query by sequence, taxon, superfamily classification, features, etc. It defines an implicit, intuitive scripting language we call AQL (ATLAS Query Language) for selecting groups of entries, testing attributes, and reporting needed changes. Each run adds content searchable by other AQL scripts, so annotation builds upon annotation. The newly established PIR Editorial Board is our panel of experts; their contributions become AQL scripts by which annotation is attached to protein entries. We are working to expand the editorial board panel, develop tools to help construct these annotation drivers, and offer alternative search criteria that make the drivers directly portable to other protein databases. This work is supported in part by the NLM grant LM05798.

Introduction

Annotated protein databases are a resource of critical importance to biological research. The immense scale of accumulated scientific information and rapid pace of the discovery of new proteins require new, scalable approaches. We have developed a model for achieving more efficient, more responsive, higher quality annotation. First, we distribute the editorial responsibility of determining which proteins merit which annotations to the community of experts in those protein groups. These scientists are the PIR Editorial Board. Second, we capture the results of each expert review in the form of tools that drives automatic annotation of protein sequences; we call these tools annotation drivers.

The classification of proteins by superfamily, family, and homology domain [1] has been a major focus of the Protein Information Resource (PIR) in the development of the PIR-International Protein Sequence Database [2]. The ATLAS query program can select groups of proteins by this classification and other annotations, both for assignment to Editorial Board members for review and for annotation by automated tools. The commands recognized by the ATLAS program comprise the ATLAS Query Language, or AQL, a scripting language in which we have written already a number of powerful annotation drivers. By enlisting the help of the scientific community, and by working on the development of tools rather than on the entries themselves, we achieve scalable database annotation in two compelling ways.

The Protein Information Resource

The Protein Information Resource (PIR) supports biological research by stimulating and facilitating inquiry into protein structure, function, modification, regulation, interaction, and evolution. We pursue this mission through continued development of the Protein Sequence Database of PIR-International and supporting auxiliary databases. By an integrated approach of data collection, computation, collaboration, and curation, we seek to take information that is theoretically available to the scientific community by search or calculation and make it readily accessible, clearly understandable, immediately suggestive of further inquiry, and robustly interconnected to the community of interoperable biological databases.

The Protein Sequence Database is available free over the Internet for downloading from FTP sites or searching at our Web site. The database is also distributed on CD-ROM together with search indexes and versions of the ATLAS query program that run on a number of different computer platforms.

A major project of PIR is the classification of proteins by homology domains, superfamily (full-length homology), and family (closely related proteins) [1]. This classification scheme provides a framework for understanding the relationships among proteins, the structural and functional significance of regions of similarity, and the range over which empirically-based annotations ought to be propagated.

The PIR Editorial Board is group of scientists who have agreed to assist the PIR in the many specialized biological and biomedical fields involved in describing the structure, function, classification, and evolution of the proteins included in the PIR-International Protein Sequence Database.

Challenges in protein annotation

How can annotation be made more efficient?
What is the source of each annotation?
Which annotations can be trusted?
Which hypotheses in the primary literature have won general acceptance?
How can incorrect annotations be detected and removed?
What are the rules used to create and spread annotations?
How should information travel from the literature to database entries?
How can annotations be standardized?

Resources for annotation

Existing annotated databases, e.g. the Protein Sequence Database of PIR.

Query programs for selecting sequences from a database, e.g. ATLAS.

Sequence analysis programs and similarity search programs.
A community of interested scientists with expertise in specific groups of proteins.
The Internet.

Goals in annotation

Involve the broader scientific community in the annotation of protein databases.
Create mechanisms for abstracting biological principles into rules that govern the appropriateness of each annotation.
Create, test, validate, use, and distribute sets of tools (annotation drivers) that apply these rules to assign annotations automatically.
Keep annotations current with the literature.

About ATLAS

The Atlas Multidatabase Information Retrieval System (ATLAS), developed by the National Biomedical Research Foundation (NBRF), is a retrieval program specifically designed to provide simultaneous access to many macromolecular sequence, alignment, and auxiliary databases [3]. Fields such as Entry title, Journal , Author, Species, Keyword, Superfamily, Feature, and others are indexed to allow fast text retrieval. A powerful pattern matching routine is available for sequence searching.

ATLAS maintains a current list of the results of the last search command. This list can be refined by successive search commands. Intermediate results can be inspected easily. This provides a highly intuitive user interface for developing complex queries.

Table I. Some ATLAS commands

GET filename	Retrieve list of entries from a file as the new current list.
LIST / OUT=filename	Save current list of entries to a file.
LIST / RESTORE	Replace the current list with the previous current.
SCAN peptide	Find all entries with the exact subsequence peptide
FEAT / CUR P-loop	Refine the CURrent list. Keep only entries with "P-loop" in the FEAture table. Similar commands query other indexed fields.
SUP / SUB helicase	SUBtract all entries with "helicase" in the SUPerfamily field from the current list.
FILE filename	Pass control of ATLAS to the script in filename
!	A comment line (ignored by ATLAS)
!!	A message line (sent to the current output file)

The MATCH command

Peptide specification
W [WFY] {P} X (W) ([WFY]) ^ (caret)	Trp Trp, Phe, or Tyr anything but Pro any residue Trp exactly, or no match (no matter what /MIS=n) Trp, Phe, or Tyr exactly, or no match display the match position following the caret (^)
Selected modifiers
/DEFINE	Keep only entries that match
/DEFINE / SUBTRACT	Remove all entries that match
/MISMATCH=n	Allow up to n mismatches
/VIEW	mark positions with mismatches
/PEPTIDE=expression	specify peptide on command line (otherwise prompted)
/PRINT=filename	Send display results to filename

Example
! For the existing current list, find all sequences that resemble the typical P-loop,
! allowing up to one mismatch but requiring the Lys, and report the location of that Lys
MATCH / CUR / MIS=1 / PRINT = atp_feat.transact / PEPT=GXXGXG^(K)[TS]
!! Features found: /Binding site: ATP (Lys) #status predicted

From ATLAS to AQL

The ATLAS query program interprets and executes a variety of powerful search and display commands for indexed text and sequence databases.
Because ATLAS is a command interpreter, the range of allowable queries defines the scripting language ATLAS Query Language, or AQL.
We have added several new capabilities to ATLAS to increase the power of AQL, including the FILE command to run scripts from files instead of the command line, and comment line capability for both scripts and output files.
ATLAS queries can first create and then modify current lists. AQL scripts can manipulate current lists to do a variety of operations on a query-defined set of related proteins.
The principle of working with a current list and the syntax of AQL are simple and intuitive enough for AQL scripts to be developed by biologists, not just database specialists.
AQL scripts can be validated, stored, and then run automatically at regular intervals. Their automated reuse is a scalable method for tackling annotation in the age of genomics.
Stored AQL scripts are extensions of the database system because they express and enforce complex rules of biology.
AQL scripts offer an object-oriented style of database extension. Every entry found by a particular AQL script belongs to a class defined by that script and is subject to the methods of that class for annotation, integrity checking, etc. A single entry may belong to many classes at once and inherit the characteristic properties of each class.
Each database entry can be linked to all the AQL scripts able to drive its annotation. Therefore, each annotation within an entry can be assessed as to whether or not it is derivable from an existing AQL script. AQL scripts can therefore give authority to existing annotations in large, heterogeneously annotated databases

Working with AQL scripts

Integrity checks are AQL scripts that report problems in annotation, such as the keyword "chloroplast" appearing in a bacterial protein entry.

Annotation drivers are AQL scripts that find and report new annotations, such as active site features located by matching a regular expression.

Alert triggers are AQL scripts set to look for predetermined lists of unexpected events and trigger automatic notification of the interested parties.
Annotation through AQL scripts occurs iteratively. One script adds annotations that result in detection of the entry by another script. Good AQL programming style will rely on this property to simplify the task of each script.
Annotation responsibility ultimately belongs to the greater scientific community. The Protein Information Resource (PIR) is working to recruit a panel of experts, the PIR Editorial Board, to review of groups of proteins and formalize the annotation rules that belong in AQL scripts.
Currently, we perform the conversion of annotation rules to AQL annotation drivers on behalf of the Editorial Board. We aim to develop Web-based tools to assist Editorial Board members or other end users in creating and testing AQL scripts themselves.
AQL scripts can start with the GET command to create a current list from a set of codes listed in a file. This code list may come from the results of any database search program, and not necessarily ATLAS itself. For example, a query may start by loading codes found by a Hidden Markov Model search program.
AQL scripts could be used, in principle, to annotate any protein database. The portability depends on the query interface available to that database. We plan to make our AQL scripts available for export to other database projects.

Figure 1. Sample Integrity Check

! An integrity check for organelle-specific keywords
! in prokaryotic sequence entries:
! This list will be empty unless some error has occurred.
!
! Choose annotated database sections of PIR (Protein Information Resource):
BASES PIR1, PIR2
!
! Search by keyword for organelle-specific annotations:
KEY / BRIEF lysosome
KEY / BRIEF / ADD plast
KEY / BRIEF / ADD Golgi
KEY / BRIEF / ADD nucleus
! etc.

...

! Should now have all entries that have organelle-specific keywords.
! If any entry is prokaryotic, there is a problem: bad keyword or bad species
!
! From current list, select prokaryotic examples:
TAXONOMY / CURRENT prok
!
! Create a list of problem entries (only if the list is non-empty)
LIST / IF / OUT=problem_kw_organelle_tax.cod

An example of an integrity check script in AQL. Integrity checks can find logical inconsistencies in annotations. Examples include organelle-specific keywords for bacterial proteins, complex carbohydrate binding site predictions for cytosolic or nuclear proteins, etc. Exceptions lists can be built into integrity check scripts as necessary to avoid rechecking unusual but validated annotations.

Figure 2. Sample Annotation Driver

!An annotation driver for proteins classified as having
! phosphoglycerate mutase homology.
!
! Based on a write-up submitted by PIR editorial
! board member Alex J. Lange
!
! Find protein by superfamily classification

SUP / BRIEF / EXACT "phosphoglycerate mutase homology"
!
! remove those for which the feature is already present

FEATURE / BRIEF / SUBTRACT

"Active site: His (phosphohistidine intermediate)"
!
! and remove those in which the site is defective
! for example, PFK26 (PIR:S48465)
FEATURE / BRIEF / SUBTRACT
"Region: defective catalytic site"
!
! and now look for the active site / catalytic zinc feature
!One mismatch allowed, but the His must match exactly
MATCH / CUR / DEFINE / SUBTR / MIS=1 / PRINT=filename
[LVI] X [LVI] [LVI] R ^(H) G [EQ]

!! matched His residues receive active site feature:
!! Active site: His (phosphohistidine intermediate)
!
FIND / BR / SUB fragment
!
! Entries remaining should be investigated, if there are any
LIST / IF / OUT = why_no_site.cod

An example annotation driver script in AQL. AQL (ATLAS) commands and modifiers are given in bold face capital letters. Lines beginning with a single exclamation point are comments ignored by ATLAS. Lines beginning with a double exclamation point are messages passed by ATLAS to the last output file. This script reports newly found active sites to one file and the problem list of non-fragments with no discernable active sites to another file.

Figure 3. Sample Alert Trigger

BASES pir1, pir2
!
! Look for the first archaeal globin
! None is expected, but if one is found
! then alert our expert.
!
SUPERFAMILY / BRIEF globin homology
TAXONOMY / CURRENT arc
!
! Current list has archaeal globins (surprise!)
! If there is one, alert somebody
LIST / IF / PRINT=archaeal_globin.alert
!
! and write contact info to file for parsing
!! ALERT: name@address "Found globin(s) in the Archaea!"

An example of an alert trigger script in AQL. This script will create an output file if and only if the special condition (an archaeal sequence classified as a globin homolog) occurs.

Figure 4. Annotations Trigger More Annotations

AQL script # 3

if	has "EC 3.4.24."in title
then	add to KEYWORD field (if missing) keyword "hydrolase" keyword "metalloproteinase"

AQL script # 2

if	matrix metalloproteinase homology, astacin homology, atrolysin C, serralysin, or certain other superfamilies and has both active site and zinc-binding features
then	add to TITLE field (if missing) "EC 3.4.24."

AQL script # 1

if	matrix metalloproteinase homology
then	find and add to FEATURE field (if missing) active site catalytic zinc-binding site autoinhibitory region

A heirarchy of AQL scripts is shown for progressive annotation of metalloproteinase-like proteins. Script #1 (shown at the bottom of the heirarchy) covers the fewest entries but adds the most specific information. If it adds active site and zinc-binding features to some database entry, that entry (after update) will fall also under the scope of script #2. Script #2 adds the appropriate enzyme nomenclature to metalloproteinase homologs if and only if they have a functional active site and zinc-binding. Entries modified by script #2 will next fall under the scope of script #3, as will entries that received the enzyme designation in other ways (e.g. author submission). Script #3 ensures the addition of keywords as needed to accompany the "3.4.24." portion of the title. Annotation by these scripts may be viewed as object-oriented in the sense that the queries in AQL scripts define classes of entries, and that annotations are inherited by entries according to their membership in various classes.

Conclusion

We have extended the capabilities of the ATLAS query program to create a scripting language, AQL, in which to develop and run collections of stored queries. A stored query can become a permanent extension of the database that is carefully validated, traceable in its effects, and able to enforce complex rules of biology consistently. Ad hoc queries don't have these attributes. Categories of AQL script include integrity checks, alert triggers, and annotation drivers. The query in each script defines the class of entries supervised by that script; one entry may belong to a number of different classes. AQL scripts thus offer an object-oriented style of annotation.

Most protein annotations should come, ultimately, from critical readings of the literature. We have started collaboration with a panel of outside experts, the PIR Editorial Board, to produce electronic mini-reviews of various groups of proteins. The content of each write-up will be linked to a scope (protein entries selected dynamically by a stored query) over which the review is applicable and its generalizations valid. The process thus results in the creation of annotation drivers, interesting by themselves as concise expressions of biological rules, but also enabling computer annotation, an important goal in bioinformatics.

Future work will include expanding the roster of Editorial Board members, developing WWW-based tools to assist Editorial Board members in creating and testing annotation-driving scripts, and adapting the resulting database of annotation drivers for portability to other protein databases.

References

The superfamily classification in the PIR-International Protein Sequence Database, Winona C. Barker, Friedhelm Pfeiffer, and David George, in: Methods in Enzymology, R.F. Doolittle, ed., Academic Press, Orlando, FL, pp. 59-71, 1996.
The PIR-International Protein Sequence Database, Winona C.Barker, John S. Garavelli, Daniel H. Haft, Christopher R. Marzec, Bruce C. Orcutt, Geetha Y. Srinivasarao, Lai-Su L. Yeh, Robert S. Ledley, Hans-Werner Mewes, Friedhelm Pfeiffer, and Akira Tsugita, Nucleic Acids Res. 26, 27-32,1998.
ATLAS User's Guide, National Biomedical Research Foundation, Washington, DC, 159 pp., December 1995.