Prediction Tools for Protein Homology Domain-Associated Post-Translational Modifications in the RESID Database

J.S. Garavelli, D.J. Miller, and G.Y. Srinivasarao

Protein Information Resource
National Biomedical Research Foundation

Georgetown University Medical Center, Washington, DC 20007

Poster presented at the Protein Society Meeting, July 23 - 27, 1999, Boston, MA

ABSTRACT

Many post-translational modifications arise through local sequence motifs recognized by relatively non-specific enzymes. However, some post-translational modifications are characteristically or uniquely associated with defined protein homology domains recognized by highly specific enzymes. The RESID Database is a unique database of protein structure modifications, covalent binding sites and cross-links publicly distributed by the Protein Information Resource (PIR) and made available on the Web site at /pirwww/dbinfo/resid.html. Using this database we identify a class of post-translational modifications that occur characteristically or uniquely in defined protein homology domains from the PIR-ALN Database of protein sequence alignments. Because of recognized deficiencies and limitations in consensus sequence methods for recognizing homology domains, especially in highly divergent proteins, we develop protein sequence profiles based on these alignments. These profiles are used to identify this class of modification-associated homology domains in newly determined genomic sequences and to produce appropriate feature annotations. We also investigate information theory based procedures for quantitatively assessing the divergence between orthologous domains and deciding when it is appropriate to consider them as either the same or different homology domains. Specific cases considered are the cytochrome c and cytochrome c6 homology domains and the lipoyl/biotin-binding homology domain.

The RESID database is supported by NSF grant DBI-9808414.

INTRODUCTION

An homology domain is a region within a single protein chain that appears to have a common evolutionary origin with similar regions in structurally or functionally related proteins. A set of proteins sharing a common homology domain constitute a domain superfamily [1]. Many proteins require post-translational modification by specific enzymes to become functional. The sites of these post-translational modification are under dual selective constraint; they first have to assume the conformation necessary to become modified by the activating enzyme, and then they have to assume the dynamic conformations required for their function. It is reasonable to assume that there are functionally significant modifications associated with particular homology domains that have evolved with such constraints. The Protein Information Resource (PIR) is developing tools to investigate this hypothesis.

The RESID database of protein structure post-translational modifications was first produced by PIR in 1995 [2]. It was originally designed to support quality assurance of feature annotations in the PIR Protein Sequence Database (PIR-PSD) [3] and to assist users and annotators in interpreting those features. Funding for its enhancement and public distribution was obtained from the NSF in 1998. The RESID Database, accessible on the web at /pirwww/dbinfo/resid.html, provides detailed chemical and structural information for more than 240 post-translational modifications as of the June 1999 release 18.00. Table 1 presents a table of the entries in the latest release of the RESID Database.

The PIR-ALN is a curated database of protein sequence alignments derived from sequences and annotation in the PIR-PSD [4]. Alignments include family alignments of sequences that are less than 55% different from each other, superfamily alignments of sequences from different families, and alignments of homology domain. Conserved residues and consensus sequences calculated from the alignment are displayed along with annotation information derived from the PIR-PSD, although not yet feature annotations. It has been observed that the set of sequences chosen to construct an alignment can drastically effect the consensus sequences that result. It is generally recognized that the methods of consensus sequences and regular expression patterns are flawed because they cannot convey the relative significance of sequence variations. The less variety present in an alignment represented in a consensus sequence, the greater the number of false negatives the pattern will produce; and the more variety represented in a consensus sequence, the greater the number of false positives the pattern will produce. Other methods such as profiles [5,6], logos based on information theory [7], and hidden Markov models (HMMs) [8] offer greater flexibility and sensitivity in detecting homology domains.

Using the RESID and PIR-ALN databases a list was prepared of all homology domains which had annotated features occurring within them in the PIR-PSD. We selected for closer examination 15 particular cases of post-translational features associated with homology domains. In 2 cases new homology domains were identified based on the occurrence of the same post-translational features in more than one superfamily. Together these 17 cases cover 21 entries in the RESID Database with 20 features, and 28 homology domains from 2460 entries in the PIR-PSD.

Table 1. RESID Database

Table 2. Post-Translational Modifications Associated with Homology Domains

DISCUSSION

Glycine Radical: The preponderance of catalyzed biochemical reactions are initiated by nucleophilic or electrophilic attack. Many are simple or coupled oxidation/reduction reactions. Few proceed by free radical mechanisms. The reason why free radical mechanisms may not have been favored during the course of evolution is because of difficulties either in producing and maintaining suitable initiators for free radical reactions, or in eliminating unproductive side reactions. Four free radical post-translational modifications are known, of cysteine, tryptophan, tyrosine and, the most recently discovered, glycine. The glycyl free radical seems to be generated by a very closely associated iron-sulfur center. The Glycyl Radical Homology Domain identified in this work is approximately 60 residues in length and has been found in at least 3 homeomorphic superfamilies. The corresponding PROSITE pattern has 9 positions and produces 1 false positive. The glycyl radical may occur in other proteins lacking the Glycyl Radical Homology Domain. The Logo for the entire domain is presented in Figure 1.

Erythro-beta-Hydroxyasparagine and Erythro-beta-Hydroxy-Aspartic Acid: The hydroxylation of asparagine or aspartic acid residues is carried out by peptide-aspartate beta-dioxygenase (EC 1.14.11.16). In animals the modification occurs at a specific position in the EGF Homology Domain; however, that position is not always conserved. In some bacteria the modification occurs in proteins lacking the EGF Homology Domain. The function of the modification is not known. The logo for the immediate site within the homology domain is presented in Figure 2.

Gamma-Carboxyglutamic Acid: In vertebrates the vitamin K- dependent gamma-carboxylation of glutamic acid appears to be restricted to the Gla Homology Domain; however, a number of glutamic acid residues in different sequence contexts are modified within that domain. The gamma-carboxyglutamic acid functions to bind calcium. In molluscs this modification occurs in proteins lacking the Gla Homology Domain. The Logo for 324 individual carboxyglutamic acid sites (not for the aligned homology domain) is presented in Figure 3. This Logo represents the information available to the carboxylating enzyme through local determinants at each site and not the aggregate information available in the complete homology domain.

3'-Methylhistidine: The specific methylation of histidine in myosin and actin is carried out by protein-histidine N-methyltransferase (EC 2.1.1.85). In myosin the modification occurs at a histidine within the Myosin Motor Homology Domain, which does not occur in actin. The function of the modification is not known. There is no corresponding PROSITE pattern.

3'-FAD-Histidine: This modification occurs at a specific histidine in bacterial proteins with the Fumarate Reductase Flavoprotein Homology. The modification also occurs in proteins lacking this homology domain.

O-Phosphopantetheine-Serine: This wide-spread modification occurs exclusively at a specific serine in proteins with the Acyl Carrier Protein Homology. The modification is carried out by the enzyme holo-[acyl-carrier protein] synthase (EC 2.7.8.7). The corresponding PROSITE pattern produces almost 50% false positives (81) and 10 false negatives. The Logo for 94 phosphopantetheine binding sites is presented in Figure 4.

Biotinyl-Lysine and Lipoyl-Lysine: Biotin is attached to lysine by biotin-protein ligase (EC 6.3.4.-) and lipoamide is attached by lipoate-protein ligase (EC 6.3.4.-) through similar mechanisms. However, the two activating enzymes have no apparent sequence similarity. Both the biotin and the lipoyl modifications occur at a specific lysine in the Lipoyl/Biotin-Binding Homology Domain, and although the two subsets have fairly distinctive sequences, the conformations are observed to be similar. We are developing an approach that applies an analysis of variance to the information contents of the sequences in the two subsets and in the aggregate to aid in deciding whether the subsets of these homology domain sequences should be “lumped or split”. Figures 5 and 6 present the Logos for 27 biotin and 40 lipoamide binding sites respectively. Figures 7 and 8 present the Logos for the complete Lipoyl/Biotin-Binding Homology Domains of 18 biotin and 20 lipoamide binding sequences respectively. Figure 9 presents the Logo for an aggregate of 125 Lipoyl/Biotin-Binding Homology Domains.

N6-Pyridoxal Phosphate-Lysine: Most enzymes with this modification may be self-activating. This modification occurs in a relatively large number of different homology domains. Four of these are listed in Table 2.

Phytochromobilin-Cysteine: The lyase that produces this modification has not been characterized. The modification occurs exclusively in the Phytochrome Homology Domain.

Nitrogenase Iron-Molybdenum Cofactor: This prosthetic group, cysteinyl homocitryl molybdenum-heptairon-nonasulfide, and a similar iron-vanadium cofactor are produced by a class of synthases, some of which share the Nitrogenase Vanadium-Iron Protein Alpha Chain Homology. The modification does not occur in proteins lacking this homology domain.

Cysteinyl Molybdopterin: This prosthetic group and a class of related prosthetic groups containing molybdenum or tungsten are synthesized by a class of synthases, some of which may share the Molybdopterin-Binding Domain Homology. However, this modification may also occur in proteins lacking this homology domain.

S-Methyl-Cysteine: This modification is formed by the “enzyme” itself which thereby becomes inactive; no reversing metabolic reaction has been characterized. The Methylated-DNA-Protein-Cysteine S-Methyltransferase Homology Domain is approximately 80 residues in length. This modification probably also occurs in proteins lacking this homology domain.

Heme-Biscysteine: This important modification is produced by holocytochrome-c synthase (EC 4.4.1.17). The modification is found in a number of different homology domains, multiple times in some of them. Some of the homology domains are obviously distantly related, and the method for analysis of variance in the sequence information contents is being used to help classify these.

Tetrakiscysteinyl Iron: This modification is probably produced only with the assistance of an iron carrier. The modifications occurs in two different, well-characterized homology domains, and probably in others as well.

Tetrakiscysteinyl Diiron Disulfide, Biscysteinyl Bishistidino Diiron Disulfide, Triscysteinyl Triiron Tetrasulfide, Tetrakiscysteinyl Tetrairon Tetrasulfide: These modifications are produced with a class of nifS protein homologs. These synthases obtain sulfur from free cysteine to modify an internal cysteine residue to a cysteine persulfide intermediate (RESID:AA0269) before contributing the sulfur to the iron-sulfur cluster. The four-cysteine-ligated 2Fe-2S cluster occurs predominantly but not exclusively in the Ferredoxin [2Fe-2S] Homology Domain. The 3Fe-4S and 4Fe-4S clusters occur predominantly but not exclusively in the Ferredoxin 2[4Fe-4S] Homology Domain. However, the two-cysteine-two-histidine-ligated 2Fe-2S cluster occurs exclusively in the Rieske [2Fe-2S] Homology Domain. The feature was known to occur in at least two different homeomorphic superfamilies, but the sequences were not sufficiently similar, except for the binding motif, to be mutually detectable using BLAST with normal parameters. After an alignment of sequences from the two superfamilies was prepared using the experimentally determined binding sites, an HMM procedure successfully detected 19 additional sequences which could be reliably predicted to have the feature. The Rieske [2Fe-2S] Homology Domain is being refined by further rounds of hidden Markov modeling. Sequences from different superfamilies are being tested by the analysis of variance procedure to decide whether the subsets are sufficiently homogeneous to be classified in a single homology domain.

ACKNOWLEDGMENTS

We thank Thomas D. Schneider at the National Cancer Institute, Laboratory of Experimental and Computational Biology, Frederick, Maryland, for his remote assistance in preparing the sequence logo Postscript files, and Joseph Janda for administrative support.

REFERENCES

[1] George, D.G. (1993) Proposal for the Definition of a “Protein Superfamily”.
National Biomedical Research Foundation, Washington, DC, pp.1-13.
/pirwww/otherinfo/sfdef.html

[2] Garavelli, J.S. (1999) The RESID database of protein structure modifications.
Nucleic Acids Res. 27(1), 198-199.

[3] Barker, W.C., Garavelli, J.S., McGarvey, P.B., Marzec, C.R., Orcutt, B.C.,
Srinivasarao, G.Y., Yeh, L.S.L., Ledley, R.S., Mewes, H.W., Pfeiffer, Tsugita, A.,
and Wu, C. (1999) The PIR-International Protein Sequence Database. Nucleic Acids Res. 27(1), 39-43.

[4] Srinivasarao, G.Y., Yeh, L.S.L., Marzec, C.R., Orcutt, B.C. and Barker, W.C. (1999)
PIR-ALN: a database of protein sequence alignments. Bioinformatics 15(5), 382-390.

[5] Gribskov, M., Luthy, R., and Eisenberg, D. (1990).
Profile analysis. Meth. Enzymol. 183, 146-159.

[6] Gribskov, M., McLachlan, A. D., and Eisenberg, D. (1987).
Profile analysis: Detection of distantly related proteins. Proc. Natl. Acad. Sci. USA 84, 4355-4358.

[7] Schneider, T.D. and Stephens, R.M. (1990)
Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 18(20), 6097-6100.

[8] Krogh, A., Brown, M., Mian, I. S., Sjolander, K., and Haussler, D. (1994).
Hidden Markov models in computational biology: Applications to protein modeling. J. Mol. Biol. 235, 1501-1531.