"Domain" Record
The format for the "Domain:" record is
"Domain:" ["(or "hyphenated pairs")"] domain name ["("form")"] ["#status" status] "<" tag ">"
This record should be generally be applied to a single hyphenated pair. A
"domain" carries the connotation of having some degree of spatial coherence,
that is, secondary or tertiary structure. Separate segments of sequence that
together form the same domain should be placed in the same record. Separate
segments of sequence that form spatially distinct domains that happen to have
the same description should be placed in separate records.
We have attempted to standardize most "Domain" records, but this format is
still somewhat variable. Here we set forth some very general guidelines
pertaining to certain types of domains.
Back to Top
General Principles
Use the same name for the same kind of domain. Insofar as possible, use
the same or similar tags for the same kind of domain. Domain names should be
INFORMATIVE; avoid names such as "first", "A", "II", etc. A domain or region should be annotated only when it is biologically significant and the name
should reflect that interesting structural or functional property. Names that
are obvious or used only for the convenience of particular authors should be
suspect.
Do not include enumeration within the names given to repeated domains of the
same type within the same sequence. This results in needless proliferation of
names that are all the same except for a number or letter. The enumeration
should be in the tag instead.
The boundaries of domains are assumed to have "predicted" status and are
understood to be not necessarily precise; usually no additional indication of
uncertainty is needed. If there is considerable uncertainty, for example if
any of three Mets might be the initiator, this is indicated by an initial
parenthetical phrase. For example,
Domain: (or 5-32 or 11-32)
The "or" form should be avoided whenever possible.
The boundaries of homology domains are understood to be more or less
arbitary and defined on the basis of sequence similarities; do not put a status
on such domains.
The question of what types of regions we should call domains is still under
discussion.
Each instance of a given kind of domain within a sequence should have a
separate domain record, thus use
20-42/Domain: transmembrane #status predicted
50-72/Domain: transmembrane #status predicted
and do not use
20-42,50-72/Domain: transmembrane #status predicted
In some cases a single 3-dimensionally defined domain does consist of separated
segments of sequence, and a list of ranges may appear in such cases but this
is rare.
Back to Top
Signal sequences and transit peptides
These domains have been standardized in PIR. Please follow the format
given below for the simple cases; in more complex cases, use the examples
as a guide. A form must appear with a transit peptide. Tags are required,
but these examples are suggestions.
"Domain: " ["(or"hyphenated pair ["or" hyphenated pair ...]")"]
"signal sequence" ["(fragment)"] ["#status" status] "<SIG>"
"Domain:" ["(or" hyphenated pair ["or" hyphenated pair ...] ")"]
"transit peptide ("form")" ["(fragment)"] ["#status" status] "<TNP>"
form is "mitochondrion" | "chloroplast" | "amyloplast" | "chromoplast" |
"cyanelle" | "glyoxysome" | "hydrogenosome" | "plastid" | "thylakoid"
Examples:
Domain: signal sequence #status predicted <SIG>
Domain: signal sequence (fragment) #status experimental <SIG>
Domain: transit peptide (amyloplast) #status predicted <TNP>
Domain: transit peptide (chloroplast) #status predicted <TNP>
Domain: transit peptide (chloroplast) (fragment) #status experimental <TNP>
Domain: transit peptide (mitochondrion) #status predicted <TNP>
The "or" form should be avoided whenever possible.
Domain: (or 1-15) signal sequence (fragment) #status predicted <SIG>
Domain: (or 1-43 or 1-49) signal sequence #status predicted <SIG>
When the boundary between a signal sequence and the following domain has not
been determined or predicted, use a record like one of these:
Domain: signal sequence and propeptide #status predicted <SIG>
Domain: signal sequence (fragment) and propeptide #status predicted <SIG>
When more than one protein product is presented in an entry, use this format.
"Domain: signal sequence (of "product name") ("status") <SIG>"
For example
Domain: signal sequence (of membrane glycoprotein E1) #status predicted <SIG>
Back to Top
Membrane-crossing regions
These are currently annotated as domains.
Domain: transmembrane #status predicted <TMM>
Domain: transmembrane beta strand #status predicted <TMM>
Domain: transmembrane helix #status experimental <TMM>
Try to be consistent in assigning boundaries of transmembrane domains within a
group of closely replated proteins. Lacking any other criteria, use the
minimum range suggested by the ALOM program. The preferred tags are "<TMM>"
when there is only one, and "<TM1>", "<TM2>", etc., when there are more than one. When there are more than nine, use tags like "<TM01>".
[BLACK] Do not use the following kinds of names for transmembrane domains:
"transmembrane 2" or "transmembrane II"
(the numbers should not be part of the name)
"transmembrane domain"
"transmembrane region"
"membrane-spanning segment"
"potential transmembrane sequence"
"membrane anchor domain"
[GRAY]
The following cases also appear and are under review.
Domain: intramembrane
Domain: membrane anchor
Domain: membrane associated
Domain: membrane insertion
Domain: membrane-bound
Domain: transmembrane amphipathic helix #status predicted
Back to Top
Homology domains
Homology domains form a special class. They are distinguished by the property
that those of a given type (with the same name) are homeomorphic and share
sequence homology although they are found in different (nonhomeomorphic)
proteins. The names of such domains end with the word "homology". Many such
domains are homologous with most of the entire length of some other protein,
in which case they may be named after such a protein, either exactly ("trypsin
homology") or with a more general designator ("protein kinase homology"). Other
domains have, so far, been found as domains within multidomain proteins
("homeobox homology"). Some are named to indicate that they are repeated in
a certain protein ("complement factor H repeat homology"). The conversion of
homology domain names to include the terms "homology" or "repeat homology"
is still incomplete.
The boundaries of homology domains should be consistent with an alignment of
representative domains of the named type. Dr. Barker is collecting such
alignments. A preferred tag will be assigned for each type of homology domain.
Some examples:
Domain: basic proteinase inhibitor homology <BPI>
Domain: cytochrome b5 core homology <CB5>
Domain: protein kinase homology <KIN>
Domain: calmodulin repeat homology <EF1>
Note that some domains with names of proteins may have been assigned not by
sequence homology but by predicted activity. Defining a homology domain is
preferable to a name that predicts structure ("EF hand") or function
("calcium-binding") because structure may be distorted or function lost in
homologous domains. Do NOT add the word "homology" without affirming that
there is sequence homology! Please read carefully the discussion and proposal
of the use of "Domain" and "Region" records for repeated sequence elements.
[BLACK] Do not use a status for this type of domain. They are only assigned by homology, which is always an inference of predicted status and never
experimental. Boundaries are understood to be somewhat a matter of human
judgement.
[BLACK] The following names are NOT acceptable:
"alpha chain homolog"
"complement binding protein-related"
"endozepine-like"
"homology with Ig C region domains"
"malK protein homolog 1"
"lipoyl domain 1 #status predicted"
In this last example, it's not clear whether it is an homology domain or a
function prediction. The word "domain" is always superfluous. Domains should not be enumerated.
To denote regions that are under consideration as homology domains, it has
become acceptable practice to annotate them as a "similarity" like
Domain: platelet-derived growth factor chain B similarity
with the understanding that they will be changed at a later date.
Back to Top
Repeat Domains
Domains that are repeated in a protein should be names as homology domains if
they are also known to occur in diverse proteins. Otherwise, they may be named
for the specific protein or an example of it omitting the term "homology", for
example "CDC23 repeat".
Back to Top
Miscellaneous Rules
Use a hyphen before "binding" as in "cAMP-binding" when it is used
attributively, that is as an adjective with a following noun. For example,
Domain: DNA-binding core #status predicted
Domain: alpha-actinin actin-binding domain homology
Otherwise, if it is used nominatively, as a name with no following noun,
do not use a hypen. For example,
Domain: DNA binding #status predicted
Region: actin binding #status predicted
If required, use "fragment". It appears before the status (if used) and tag.
Always try to use names that are at least 3 characters in length.
[BLACK]Old "Duplication" records cannot be used. Any features of this type should be
entered as "Region" or "Domain" records in accordance with the discussion
below.
Following these guidelines will at least reduce the heterogeneity in
the current database and make it more easy to convert.
Duplications are of two major types: short repeats (usually tandem) and longer
domains. There is no firm cut-off without studying the situation; as a
guideline, we could try 25 or fewer residues is a repeat, 50 or more is a
domain, and in between it must be a domain if such domains exists in other
types of proteins (e.g., EGF-like) but may be treated as a tandem repeat
if it is unique to this type of protein.
Back to Top
Repeats
Repeats are very to fairly short, usually occur in tandem, and the pattern is
often, but not always, specific to this type of protein. The annotation to
use is
"Region:" record
Use a hyphenated pair for the entire region in the location field, "22-300",
and do not give the boundaries of individual repeats.
Several unconnected regions may be listed if they contain the same pattern,
"22-100, 200-298" (note: we don't have the permission to use a semi-colon yet;
hopefully it comes very soon!)
If there is a list, then all the other information within the record must be
applicable to the entire list.
In the description field, use the following format
n "-residue repeats" ["("sequence pattern")] [", " descriptive phrase]
For example
Region: 11-residue repeats (D-P-A-K-A-S-Q-G-G-L-E)
"n" is the typical number of residues in the repeat pattern and the number of
repeats is not given.
A sequence pattern is a simple representation of the canonical pattern using
the single-letter code separated by hyphens and when necessary alternatives
are indicated for only the most common residues separated by a slash.
For example,
(A-C-D/E-F-G)
No tag is usually used with the "Region" record.
"tandem repeat" is used as a KEYWORD if the repeats are tandem.
"repeat" may be allowed as a KEYWORD for non-tandem repeats.
Back to Top
Domains
It is suggested that the longer domains should be annotated as domains,
individually represented, and tagged so that these subsequences can be
retrieved. All domains of the same type should be given exactly the same name.
For domains defined by homology, there will be eventually an alignment of
selected examples which, in effect, is the definition of the domain. Dr.
Barker is curator for homology domains and welcomes any such alignments from
other PIR-International staff for the definition of new domains or for the
standardization of currently heterogeneous domains.
Back to Top
"Product" Records
A Product is any relatively stable (i.e. isolatable) peptide chain, including
chains that experience cleavage of a precursor form and remain bound together
in the same molecule. This definition has several implications.
Some sequence elements previously identified as "Peptide" are probably not
stable and will not fit the proposed definition of "Product". You may use
"Domain" or "Region" for these. Activation peptide are normally annotated as
a "Domain" unless they have been isolated and appear to be physiologically
significant. What can usually be easily determined is what segments are
present in the final mature protein(s) and what segments are removed.
[BLACK] Do not use Product records like
20-50,70-90/Product: mcguffin A and B chains #status experimental <MAT>
Several options are available, and there are good examples where it has been
necessary to use one or the other of these forms. You may represent the chains
in two separate "Product" features.
20-50/Product: mcguffin chain A #status experimental <ACH>
70-90/Product: mcguffin chain B #status experimental <BCH>
It is also possible to present a single "Product" feature and two "Domain"
features, especially when the chains are covalently linked and only a single
molecular entity with one molecular weight actually exists.
20-50,70-90/Product: mcguffin #status experimental <MAT>
20-50/Domain: mcguffin chain A #status experimental <ACH>
70-90/Domain: mcguffin chain B #status experimental <BCH>
[The use of the second approach is evident in annotating protein splicing.]
Do not mix these forms in the same entry, and try to standardize them across
a family.
So far there has been little standardization of "Product" records; however, the
following guidelines should be used.
Do not use "Product" for a segment that has insufficient lifetime to be
isolated.
A name in a "Product" feature should repeat the protein name as given in the
entry title or a name in the "Contains" record, usually omitting "precursor"
and including a chain designation. Version and clone designations may be
omitted. This may be enforced at a later date.
Chain designations that are words or Greek letters should precede the word
"chain" and designations that are English letters, numbers (Arabic or Roman) or
combinations them should follow the word "chain": thus,
"chain B2"
"chain IV"
"pi chain"
"heavy chain"
"catalytic chain"
The tag is required. Use "<MAT>" for a single mature product.
If you can determine that at least both boundaries of a product have been
experimentally determined AS PROTEIN with substantially enough of the portion
between to leave little doubt that additional processing or splice forms do not
occur, then use the status "#status experimental". Use "#status predicted" if
the boundaries are assigned by homology or the sequence is determined
substantially as nucleic acid. The experimental determination of only one
end (almost always the amino end) is not sufficient to justify use of
"#status experimental" for an entire "Product" feature because protein
splicing, alternate transcripts, frame-shift errors and carboxyl-terminal
propeptide processing introduce too many uncertainties.
[BLACK] Do not use "amino end of" and "carboxyl end of". Instead use the by modifier "(fragment)".
Back to Top
"Region" Record
This record remains generally unstandardized at this time to allow the
annotation of new features that are not yet well-understood or standardized.
A "Region" should probably carry the only the connotation of being contiguous
sequence, as opposed to the spatial connotation of a "Domain". The following
guidelines should be followed:
The tag is not usually used.
Status is often not appropriate.
See the discussion elsewhere for how to handle regions of tandem repeats.
The word "rich" should be appended with a hyphen. The word "binding" should
be appended with a hypen if it is used as an adjective, and it should not
have a hyphen if it is not followed by a noun.
Do not use the word "region" in the description; no "Region: xxx region".
[BLACK] Avoid using expressions which match other record types, such as:
Region: active site
Region: extracellular domain
This first expression should especially not be used if specific residues are
listed. It should either be annotated as an "Active site" or as
Region: catalytic
The second would be better as
Domain: extracellular #status predicted
Regions of a specific type of secondary structure should not be annotated.
In the NRL_3D database only, the PDB HELIX, TURN and SHEET features are
converted to PIR "Region" features. The definitions and descriptions will
use the PDB annotations in parentheses.
Region: helix (right hand alpha)
Region: turn (type II)
Region: beta sheet
Region: beta barrel
No other PIR databases should have entries with such conformational information
annotated. The feature
Domain: beta barrel
is acceptable.
Motifs or patterns combining various types of secondary structure may be
annotated as "Regions". For example,
Region: helix-turn-helix motif <HTH>
Do not use the word "motif" except for defined or accepted sequence motifs.
Use the word "pattern" instead.
Do not use "#status predicted" for any feature that is defined by a sequence
motif or pattern. Do not use something like
Region: pentapeptide motif (X-F-X-F-G) #status predicted
This is nonsensical because either the pattern is in the sequence or it is not.
Instead use
Region: pentapeptide motif (X-F-X-F-G)
Unfortunately it becomes more difficult to appreciate this rule when the name
given to the motif is supposedly descriptive of a function. In cases like
Region: DNA-binding motif (K/R-G-R-G-R-P)
it is very tempting to use "#status predicted". But does the status mean that
the property of DNA binding is predicted, or only that a motif is present? If
the motif is present, it certainly isn't predicted, it is experimentally
observed. But putting "experimental" would suggest that "DNA-binding" is not
just a name but an observation. Don't be confused or confusing; never use
a status with "motif", "pattern", "homology", "similarity", etc.
Back to Top
Suggestions for Annotators
Annotators may wish to use this checklist in preparing an annotation.
Usually the annotation should be the same as an annotation already in the
database. Check for the feature in other database entries. Only these record types should be used:
Active site:
Binding site:
Cleavage site:
Cross-link:
Disulfide bonds:
Domain:
Inhibitory site:
Modified site:
Product:
Region:
The use of only these types is enforced in PIR databases.
Except for the special cases of "selenocysteine" and "N-formylmethionine",
standard 3-letter residue codes should appear after the colon of
"Active site", "Cleavage site" and "Inhibitory site" records, and in
parentheses immediately after the first name in "Binding site",
"Cross-link" and "Modified site" records. Be certain the residue code
appears, that the residue has the correct number and that it corresponds
to the proper residue in the sequence.
This identity check is enforced in PIR databases.
Check that all other required fields are present and in the preferred
order. The status should always be added to new entries in these records:
Active site:
Binding site:
Cleavage site:
Cross-link:
Disulfide bonds:
Inhibitory site:
Modified site:
Product:
A status may be appropriately used in only some "Region" and "Domain"
records.
If the extent field is used, only the word "partial" should appear, it
should be placed immediately before the status and the status should be
"experimental", not "predicted"
Check your spelling and punctuation. Spelling errors in chemical terms can
be especially difficult to catch. When appropriate, check that the names
in "Product" records correspond to names in the title or "Contains" record.
Check that there are unique tags on all "Product" and "Domain"
records, and that they are different from other tags in the entry.
Tags are not required on any other types of features.
Back to Top
Revised 10/22/01
|