THE PROTEIN INFORMATION RESOURCE DATABASES FOR GENOMIC RESEARCH
The Protein Information Resource (PIR) supports research on molecular evolution, functional genomics, and computational biology by maintaining a comprehensive, non-redundant, well-classified, and freely available protein sequence database. Data from whole genome sequencing projects are incorporated into the database and sequence analysis tools are applied to the database to classify all entries into families, superfamilies and homology domains. This comprehensive classification effort allows large-scale annotation at the family level, detection of potential sequence errors and identification of redundant entries to be merged. PIR has been able to improve genomic sequence reports by identifying and annotating correct translation initiation sites, translational frameshifts, and translational stop codon exceptions such as selenocysteine. Sequence entries are extensively cross-referenced to major nucleic acid, literature, genome, structure, sequence alignment and family databases. PIR maintains several auxiliary databases to help annotation and for integrity checking. These include: PIR-ALN, containing alignments of superfamilies, families and homology domains; FAMBASE, a searchable database of family representatives; and the RESID Database of covalent protein modifications. All the Databases can be accessed on the PIR Web site (http://www-nbrf.georgetown.edu/pir/) and contain hypertext-links to each other and relevant external databases. The Web site is being redesigned to include new BLAST similarity search engines and pattern matching capabilities. The latest quarterly release of the databases can be accessed through the ATLAS multi-database retrieval software on the Atlas CD-ROM and downloaded by FTP
The Protein Information Resource (PIR) was established in 1984 by the National Biomedical Research Foundation (NBRF). The PIR Protein Sequence Database evolved from the original NBRF Protein Sequence Database, developed over 20 years by the late Margaret O. Dayhoff and published as the "Atlas of Protein Sequence and Structure". PIR-International is a collaboration between NBRF, the Munich Information Center for Protein Sequences (MIPS), and the Japan International Protein Information Database (JIPID) to collect and publish what is now the oldest and largest database of biomolecular sequence, source, bibliographic, and feature information.
PIR-International Protein Sequence Database: an annotated, non-redundant and cross-referenced database of protein sequences.
PIR Alignment Database, PIR-ALN: contains sequence alignments of superfamilies, families and homology domains produced from information in the Protein Sequence Database.
NRL_3D Sequence--Structure Database: produced from sequences and annotation in the Protein DataBank of three-dimensional structures.
RESID Database of Amino Acid Modifications: based on feature information in the Protein Sequence Database.
PATCHX Protein Sequence Database: derived from publicly available databases and contains protein sequences and associated information not yet included in the PIR-International Protein Sequence Database.
FAMBASE Family Database: a searchable database containing a single representative sequence from each protein family.
MIPS Alignment Database, MIPSALN: contains automatically generated alignments of families having 2 or more sequences. MIPSALN is a subset of the PROT-FAM database produced by our collaborators at MIPS.
PIR incorporates data from genome sequencing projects following publication and submission to the major nucleotide sequence databases. A FASTA comparison of each individual entry with all other entries in the database is preformed to identify similar sequences and independent reports of the same protein from the same species are merged into a single entry. During this process PIR has improved a number of genome sequence reports by:
The new Genomes page on the PIR Web site provides access to all the completed genomic sequences. We are working towards placing all genome entries into superfamilies and providing alignments for all families with 2 or more members. This will provide an excellent tool for researchers to work on comparative genomics. Statistics on the classification of the complete genomes in PIR are shown in Table 1.
Note: ORFs that require translational frameshifting or readthrough of termination codons have been identified and reported by several genome sequencing projects. However, these ORFs are not always translated in the major nucleotide sequence datatabases. PIR provides translations and annotates these sequences.
Species | # Entries | # Placed (superfamilies) |
# MIPSALN (family alignments) |
Aquifex aeolicus | 1517 | 681 | 250 |
Archaeoglobus fulgidus | 2398 | 931 | 386 |
Bacillus subtilis | 4212 | 1700 | 742 |
Borrelia burgdorferi (Lyme disease spirochete) | 1417 | 445 | 147 |
Chlamydia trachomatis | 999 | 74 | 36 |
Escherichia coli | 5315 | 2705 | 1654 |
Haemophilus influenzae | 1773 | 983 | 920 |
Helicobacter pylori | 1583 | 537 | 212 |
Methanobacterium thermoautotrophicum | 1963 | 913 | 491 |
Methanococcus jannaschii | 1736 | 810 | 445 |
Mycobacterium tuberculosis | 3950 | 38 | 28 |
Mycoplasma genitalium | 467 | 264 | 411 |
Mycoplasma pneumoniae | 679 | 299 | 434 |
Pyrococcus horikoshii | 2061 | 26 | 9 |
Synechocystis sp. PCC6803 | 3151 | 1040 | 517 |
Treponema pallidum (syphilis spirochete) | 1040 | 17 | 4 |
The PIR-International Protein Sequence Database, PIR-ALN, NRL_3D, RESID, and PATCHX databases are available on the "ATLAS of Protein and Genomic Sequences CD-ROM". Included on the CD-ROM is the ATLAS database retrieval software.
The ATLAS Multidatabase Information Retrieval Program is designed to provided simultaneous access to many macromolecular sequence, alignment, and auxiliary databases. Fields such as Title, Journal, Author, Species, Keyword, Superfamily, Feature, and others are indexed to allow fast retrieval. ATLAS maintains a current list of search results that can be refined by additional search commands. A powerful pattern matching command is included to search for sequence patterns in the current list. Several display options are available and the command line interface is easy to use. The User's Guide for the ATLAS program is included on the CD-ROM. The ATLAS program written in C currently runs on PC-DOS, VAX/VMS, OpenVMS, DEC UNIX, SunOS, SGI/IRIX, and Macintosh systems.
PIR1 section of the Protein Sequence Database, release 58.02, 23-Oct-1998, assembled and annotated by the PIR-International. Create a submission form for RGECGG
PIR1:RGECGG nitrogen regulation protein I ntrC - Escherichia coli Species: Escherichia coli Date: 31-Mar-1990 #sequence_revision 31-Mar-1990 #text_change 17-Jul-1998 Accession: B30377; S40813; G65191; Q90553 Miranda-Rios, J.; Sanchez-Pescador, R.; Urdea, M.; Covarrubias, A.A. Nucleic Acids Res. 15, 2757-2770, 1987 Title: The complete nucleotide sequence of the glnALG operon of Escherichia coli K12. Reference number: A30377; MUID:87174797 Accession: B30377 Molecule type: DNA Residues: 1-468Cross-references: EMBL:X05173; NID:g41562; PID:g41565 Experimental source: strain K-12 Plunkett III, G.; Burland, V.; Daniels, D.L.; Blattner, F.R. Nucleic Acids Res. 21, 3391-3398, 1993 Title: Analysis of the Escherichia coli genome. III. DNA sequence of the region from 87.2 to 89.2 minutes. Reference number: S40802 Accession: S40813 Status: nucleic acid sequence not shown; translation not shown Molecule type: DNA Residues: 1-141,'GEA',144-468 Cross-references: EMBL:L19201; NID:g304961; PID:g304973 Experimental source: strain K-12, substrain MG1655 Note: the nucleotide sequence was submitted to the EMBL Data Library, October 1993 Blattner, F.R.; Plunkett III, G.; Bloch, C.A.; Perna, N.T.; Burland, V.; Riley, M.; Collado-Vides, J.; Glasner, J.D.; Rode, C.K.; Mayhew, G.F.; Gregor, J.; Davis, N.W.; Kirkpatrick, H.A.; Goeden, M.A.; Rose, D.J.; Mau, B.; Shao, Y. Science 277, 1453-1462, 1997 Title: The complete genome sequence of Escherichia coli K-12. Reference number: A64720; MUID:97426617 Accession: G65191 Status: nucleic acid sequence not shown; translation not shown Molecule type: DNA Residues: 1-141,'GEA',144-468 Cross-references: GB:AE000462; GB:U00096; NID:g1790295; PID:g1790299; UWGP:b3868 Experimental source: strain K-12, substrain MG1655 Genetics: Gene: glnG; ntrC; glnT Map position: 87 min Function: Description: de-uridylylated P-II forms a complex with nitrogen regulation protein II (ntrB); ntrB, when complexed with de-uridylylated P-II, dephosphorylates nitrogen regulation protein I (ntrC); the uridylylated form of P-II does not complex with ntrB; free ntrB phosphorylates nitrogen regulation protein I (ntrC) Note: phosphorylated nitrogen regulation protein I (ntrC) activates transcription of the glutamine synthase (glnA) gene via interaction with sigma-54 factor (DNA-looping) for transcription activation: assembly of a multimeric ntrC complex at the enhancer DNA sequence Superfamily: nitrogen assimilation regulatory protein ntrC; response regulator homology; RNA polymerase sigma factor interaction domain homology Keywords: ATP; DNA binding; P-loop; phosphoprotein; signal transduction; transcription regulation Residues Feature 6-115 Domain: response regulator homology 140-361 Domain: RNA polymerase sigma factor interaction domain homology 167-174 Region: nucleotide-binding motif A (P-loop) #status atypical 234-238 Region: nucleotide-binding motif B 54 Binding site: phosphate (Asp) (covalent) #status predicted Summary: #length 468 #molecular_weight 52196 5 10 15 20 25 30 1 M Q R G I V W V V D D D S S I R W V L E R A L A G A G L T C 31 T T F E N G A E V L E A L A S K T P D V L L S D I R M P G M 61 D G L A L L K Q I K Q R H P M L P V I I M T A H S D L D A A 91 V S A Y Q Q G A F D Y L P K P F D I D E A V A L V E R A I S 121 H Y Q E Q Q Q P R N V Q L N G P T T D I I A K P A M Q D V F 151 R I I G R L S R S S I S V L I N G E S G T G K E L V A H A L 181 H R H S P R A K A P F I A L N M A A I P K D L I E S E L F G 211 H E K G A F T G A N T I R Q G R F E Q A D G G T L F L D E I 241 G D M P L D V Q T R L L R V L A D G Q F Y R V G G Y A P V K 271 V D V R I I A A T H Q N L E Q R V Q E G K F R E D L F H R L 301 N V I R V H L P P L R E R R E D I P R L A R H F L Q V A A R 331 E L G V E A K L L H P E T E A A L T R L A W P G N V R Q L E 361 N T C R W L T V M A A G Q E V L I Q D L P G E L F E S T V A 391 E S T S Q M Q P D S W A T L L A Q W A D R A L R S G H Q N L 421 L S E A Q P E L E R T L L T T A L R H T Q G H K Q E A A R L 451 L G W G R N T L T R K L K E L G M E ALIGNMENTS containing RGECGG: SA1144 nitrogen assimilation regulatory protein ntrC superfamily 2887.0 Associated Alignments: DA1066 response regulator homology DA1489 RNA polymerase sigma factor interaction domain homology Related Links (Superfamily classification and Alignment): Protein Classification for Entry=RGECGG at MIPS, Germany. ProClass for Entry=RGECGG at Univ. of Texas, USA.
in the PIR-ALN section of the Protein Alignment Database, release 21.02, 23-Oct-1998, assembled and annotated by the PIR-International.
PIRALN:SA1144 nitrogen assimilation regulatory protein ntrC superfamily 2887.0 Date: 11-Aug-1994 #sequence_revision 05-Dec-1997 #text_change 20-Jun-1998 Members: RGECGG; RGKBCP; S42745; A26934; B26499; A38449; B33862 RGECGG nitrogen regulation protein I ntrC - Escherichia coli RGKBCP nitrogen regulation protein I ntrC - Klebsiella pneumoniae S42745 nitrogen assimilation regulatory protein ntrC - Azospirillum brasilense A26934 nitrogen assimilation regulatory protein ntrC - Rhizobium meliloti B26499 nitrogen assimilation regulatory protein ntrC - Bradyrhizobium sp. A38449 regulatory protein algB - Pseudomonas aeruginosa B33862 transcription regulator hydG - Escherichia coli Cross-references: PCF:A00579 Superfamily: nitrogen assimilation regulatory protein ntrC; response regulator homology; RNA polymerase sigma factor interaction domain homology Placement: 2887.0 Other members: PL0151; S23901; S53024; I39494; I39719; S18622; S36203; B64992; S19606; C33586; B26981; S18625; A38533; S35232; S32951; A41896; S26601; A65033; S04376; S49540; S71029; B70195; H70320; C70396; D70315; S70529 Cross-references: MIPSALN:M03321; PIRALN:DA1066; MIPSALN:M07032; PIRALN:DA1489; MIPSALN:M20642 Comment: This superfamily has 16 families and 33 members. Keywords: DNA binding; phosphoprotein; transcription regulation Other keywords: ATP; signal transduction; two-component regulatory system; P-loop Alignment: #sequences 7 #positions 492 [wide alignment display] 10 20 30 40 50 60 RGECGG MQRGIV-----WVVDDDSSIRWVLERALAGAGLTCTTFENGAEVLEALASKTPDVLLSDI RGKBCP MQRGIA-----WIVDDDSSIRWVLERALTGAGLSCTTFESGNEVLDALTTKTPDVLLSDI S42745 MSARTI-----LVADDDRAIRTVLTQALARLGHEVRTTGNASTLWRWVADGQGDLIITDV A26934 MTGATI-----LVADDDAAIRTVLNQALSRAGYDVRITSNAATLWRWIAAGDGDLVVTDV B26499 MPAGSI-----LVADDDTAIRTVLNQALSRAGYEVRLTGNAATLWRWVSQGEGDLVITDV A38449 METTSEKQGRILLVDDESAILRTFRYCLEDEGYSVATASSAPQAEALLQRQVFDLCFLDL B33862 MTHDNID---ILVVDDDISHCTILQALLRGWGYNVALANSGRQALEQVREQVFDLVLCDV conser * . ...**. ...... .* .*. . . .. *. *. consen MxxxxI LVVDDDxAIRTVLxxALxxAGYxVxTxxNAxxxxxxxxxxxxDLxxxDV 70 80 90 100 110 120 RGECGG RMPGMDGLALLKQIKQRHPMLPVIIMTAHSDLDAAVSAYQQGAFDYLPKPFDIDEAVALV RGKBCP RMPGMDGLALLKQIKQRHPMLPVIIMTAHSDLDAAVSAYQQGAFDYLPKPFDIDEAVALV S42745 VMPDENGLDLIPRIKKIRPDLRIIVMSAQNTLITAVKAAERGAFEYLPKPFDLKELVSVV A26934 VMPDENAFDLLPRIKKARPDLPVLVMSAQNTFMTAIKASEKGAYDYLPKPFDLTELIGII B26499 VMPDENAFDLLPRIKKMRPNLPVIVMSAQNTFMTAIRPSERGAYEYLPKPFDLKELITIV A38449 RLGEDNGLDVLAQMRVQAPWMRVVIVTAHSAVDTAVDAMQAGAVDYLVKPCSPDQLRLAA B33862 RMAEMDGIATLKEIKALNPAIPVLIMTAYSSVETAVEALKTGALDYLIKPLDFDNLQATL conser ... ...... .. * .......* . .*. . ** .**.**.. ... . consen RMPxxNGLDLLxxIKxxxPxLPVIIMTAxSxxxTAVxAxxxGAxDYLPKPFDxDELxxxV 130 140 150 160 170 180 RGECGG ERAI--SHYQEQQQPRNVQLNGPTTDIIAK-PAMQDVFRIIGRLSRSSISVLINGESGTG RGKBCP DRAI--SHYQEQQQPRNAPINSPTADIIGEAPAMQDVFRIIGRLSRSSISVLINGESGTG S42745 ERALNSNTPPAALPADAGEAD-EQLPLIGRSPAMQEIYRVLARLMGTDLTVTITGESGTG A26934 GRAL--AEPKRRPSKLEDDSQ-DGMPLVGRSAAMQEIYRVLARLMQTDLTLMITGESGTG B26499 GRAL--AEPKERVSSPADDGEFDSIPLVGRSPAMQEIYRVLARLMQTDLTVMISGESGTG A38449 AKQLEVRQLTARLEALEDEVRRQGDGLESHSPAMAAVLETARQVAATDANILILGESGSG B33862 EKAL---AHTHSIDAETPAVTASQFGMVGKSPAMQHLLSEIALVAPSEATVLIHGDSGTG conser ... . . ..**. . ... .. ...* *.**.* consen xRAL xxxxxxxxxxxxxx xxxxLxGxSPAMQxxxRxxARLxxTDxTVLIxGESGTG 190 200 210 220 230 240 RGECGG KELVAHALHRHSPRAKAPFIALNMAAIPKDLIESELFGHEKGAFTGANTIRQGRFEQADG RGKBCP KELVAHALHRHSPRAKAPFIALNMAAIPKDLIESELFGHEKGAFTGANTVRQGRFEQADG S42745 KELVARALHDYGKRRNGPFVAINMAAIPRELIESELFGHEKGAFTGATNRSTGRFEQAQG A26934 KELVARALHDYGKRRNGPFVAINMAAIPRDLIESELFGHEKGAFTGAQTRSTGRFEQAEG B26499 KELVARALHDYGRRRNGPFVAVNMAAIPRDLIESELFGHERGAFTGANTRASGRFEQAEG A38449 KGELARAIHTWSKRAKKPQVTINCPSLTAELMESELFGHSRGAFTGATESTLGRVSQADG B33862 KELVARAIHASSARSEKPLVTLNCAALNESLLESELFGHEKGAFTGADKRREGRFVEADG conser *...*.*.* . * *... *..... .*.*******..****** .. **...*.* consen KELVARALHxxSxRxxxPFVAxNMAAIPxDLIESELFGHEKGAFTGAxTRxxGRFEQADG 250 260 270 280 290 300 RGECGG GTLFLDEIGDMPLDVQTRLLRVLADGQFYRVGGYAPVKVDVRIIAATHQNLEQRVQEGKF RGKBCP GTLFLDEIGDMPLDVQTRLLRVLADGQFYRVGGYAPVKVDVRIIAATHQNLELRVQEGKF S42745 GTLFLDEIGDMPLEAQTRLLRVLQEGEYTTVGGRTPIKTDVRIVAATHRDLRTLIRQGLF A26934 GTLFLDEIGDMPMDAQTRLLRVLQQGEYTTVGGRTPIRSDVRIVAATNKDLKQSINQGLF B26499 GTLFLDEIGDMPMEAQTRLLRVLQQGEYTTVGGRTPIKTDVRIVAASNKDLRILIQQGLF A38449 GTLFLDEIGDFPLTLQPKLLRFIQDKEYERVGDPVTRRADVRILAATNRDLGAMVAQGQF B33862 GTLFLDEIGDISPMMQVRLLRAIQEREVQRVGSNQIISVDVRLIAATHRDLAAEVNAGRF conser **********... *..***... ... .**. ... ***. **.. .* . .* * consen GTLFLDEIGDMPLxxQTRLLRVLQxGEYxRVGGxxPIKxDVRIxAATHxDLxxxVxQGxF 310 320 330 340 350 360 RGECGG REDLFHRLNVIRVHLPPLRERREDIPRLARHFLQVAARELGVEAKLLHPETEAALTRLAW RGKBCP REDLFHRLNVIRVHLPPLRERREDIPRLARHFLQIAARELGVEAKQLHPETEMALTRLAW S42745 REDLFYRLCVVPIRLPPLRERTEDVPLLVRHFLNQCSAQ-GLPVKSIDQPAMDRLKRYRW A26934 REDLYYRLNVVPLRLPPLRDRAEDIPDLVRHFVQQAEKE-GLDVKRFDQEALELMKAHPW B26499 REDLFFRLNVVPLRVPPLRERIEDLPDLIRHFFSLAEKD-GLPPKKLDAQALERLKQHRW A38449 REDLLYRLNVIVLNLPPLRERAEDILGLAERFLARFVKDYGRPARGFSEAAREAMRQYPW B33862 RQDLYYRLNVVAIEVPSLRQRREDIPLLAGHFLQRFAERNRKAVKGFTPQAMDLLIHYDW conser *.**..**.*. .*.**.* **.. *...*.. . . . . . * consen REDLFYRLNVVxxxLPPLRERxEDIPxLARHFLQxAxxx GxxxKxxxxxAxxxLxxxxW 370 380 390 400 410 420 RGECGG PGNVRQLENTCRWLTVMAAGQEVLIQDLPGELFESTVAESTSQMQPDSWATL-LAQWADR RGKBCP PGNVRQLENTCRWLTVMAAGQEVLTQDLPSELFETAIPDNPTQMLPDSWATL-LGQWADR S42745 PGNVRELENLVRRLAALYS-QEVIGLDVVEAELADTTPAAQPVEEPQGEG---LSAAVER A26934 PGNVRELENLVRRLTALYP-QDVITREIIENELRSEIPDSPIEKAAARSGSLSISQAVEE B26499 PGNVRELENLARRLAALYP-QDVITASVIDGEL---APPAVTSGSTATVGVDNLGGAVEA A38449 PGNVRELRNVIERASIICNQELVDVDHLGFSAA-------QSASSAPRIGE-SLS----- B33862 PGNIRELENAVERAVVLLTGEYISERELPLAIASTPIPLGQSQDIQP------------- conser ***.*.*.* ... . . . . . . . consen PGNVRELENxxRRLxxLxx QxVxxxxLxxxxx P xxxxxxx G L 430 440 450 460 470 480 RGECGG ALRSGHQNLLSEAQP---------ELERTLLTTALRHTQGHKQEAARLLGWGRNTLTRKL RGKBCP ALRSGHQNLLSEAQP---------EMERTLLTTALRHTQGHKQEAARLLGWGRNTLTRKL S42745 HLKDYFAAHKDGMPSNGLYDRVLREVERPLISLSLSATRGNQIKAAQLLGLNRNTLRKKI A26934 NMRQYFASFGDALPPSGLYDRVLAEMEYPLILAALTATRGNQIKAADLLGLNRNTLRKKI B26499 YLSSHFSGFPNGVPPPGLYHRILKEIEIPLLTAALAATRGNQIRAADLLGLNRNTLRKKI A38449 ----------------------LEDLEKAHITAVM-ASSATLDQAAKTLGIDASTLYRKR B33862 ----------------------LVEVEKEVILAALEKTGGNKTEAARQLGITRKTLLAKL conser . . . . * ...... .. .. ** .** ..** * consen L P L ExExxLITAAL ATxGNxxxAAxLLGxxRNTLxxKx 490 RGECGG KELGME RGKBCP KELGME S42745 RDLDIQVVRGLK A26934 RELGVSVYRSLA B26499 RDLDIQVYRSGG A38449 KQYGL B33862 SR conser .. consen xxLG Matrix: Number of differences 1 2 3 4 5 6 7 1 RGECGG . 36 269 272 274 318 278 2 RGKBCP 7 . 273 272 272 318 278 3 S42745 56 56 . 156 152 318 291 4 A26934 56 56 32 . 126 313 291 5 B26499 57 56 31 26 . 323 304 6 A38449 66 66 65 64 67 . 267 7 B33862 59 59 61 61 63 58 . Percent difference
The World Wide Web provides the primary means to access the PIR-International Protein Sequence Database. The PIR home page is found at: http://www-nbrf.georgetown.edu/pir.
The PIR Web site is undergoing a major hardware and software upgrade. The new PIR home page is shown below. The upgraded Web site will be available to the public by December 1, 1998 and contain the following important features: