home uniprot
 
       Home      About PIR     Databases      Search/Retrieval      Download      Support
HOME / Pan Proteomes
Pan Proteomes (PPs)

Pan Proteomes (PPs)

For each reference proteome cluster, also known as representative proteome group (RPG) (Chen et al., 2011), a pan proteome is a set of sequences consisting of all the sequences in the reference proteome, plus the addition of unique protein sequences that are found in other proteomes of the cluster but not in the reference proteome. These additional sequences are identified using UniRef50 membership. Analogous to pan genomes, pan proteomes are useful for studies of proteome diversity, gene evolution, gene transfer and phylogenetic comparisons.

Algorithm

Proteome Cluster UniRef50 A UniRef50 B UniRef50 C UniRef50 D UniRef50 E ...
Proteome 1 (Reference) A1 - - D1 E1 ...
Proteome 2 A2 - C2 D2 E2 ...
Proteome 3 A3 B3 - D3 E3 ...
Proteome 4 - B4 - D4 E4 ...
Proteome 5 A5 B5 - D5 E5 ...

The pan proteome consists of those proteins in RED.


For each proteome cluster, or representative proteome group (RPG),
  1. Add all sequences from the reference proteome into the pan proteome
  2. If the RPG contains non-reference proteomes (NRP):
    1. Sort the NRPs by the proteome priority score: NRP1, NRP2, ...
    2. For NRP1, select one protein with top protein annotation score (AS) from each UniRef50 which does not have any protein sequence that already in the pan proteome under construction.
    3. Repeat step b for all NRPs.
Note: The proteome priority score (PPS) is the same as the one we use for generating the reference proteome.

Download PPs files

Co-membership
Cutoff (%)
#RPGs PP file
75 18367 pp-75.txt
55 11753 pp-55.txt
35 5942 pp-35.txt
15 1906 pp-15.txt
File format:
>PanProteomeID TaxonId OSCode OrgName TaxonGroup ProteomePriorityScore(PPS:RefP,PrevP,#PMID,ProteomeMeanAS,#Entry) CorrCutoff IsRefP
#UPID TaxonId #proteins
...
UniProtAC UPID TaxonId UniRef50
...

Example:
>Pan-Proteome_UP000001544       398511  BACPE   Bacillus pseudofirmus (strain OF4)      Bac/Firmicute   37117.16681(PPS:1,1,17,11.73,4119)      55(CUTOFF)      RefP
 #UP000001544   398511  4310
 #UP000017170   1188261 760
 A7LKG4 UP000001544     398511  UniRef50_O05267
 A7LKG5 UP000001544     398511  UniRef50_A7LKG5
 D3FPP4 UP000001544     398511  UniRef50_D6YVU9
 D3FPP5 UP000001544     398511  UniRef50_Q1GBQ0
 ...
 Q9RGZ4 UP000001544     398511  UniRef50_O05259
 Q9RGZ5 UP000001544     398511  UniRef50_Q9K2S2
 U6SH21 UP000017170     1188261 UniRef50_U6SH21
 U6SH81 UP000017170     1188261 UniRef50_E5WS36
 ...

Download PPs sequences

The pan proteome sequences for singleton and non-singleton reference proteome clusters (75% proteome similarity for Eukaryota and 55% proteome similarity for Archaea and Bacteria) can be downloaded from here.

The pan proteome sequences for non-singleton reference proteome clusters (75% proteome similarity for Fungus and 55% proteome similarity for Archaea and Bacteria) can be downloaded from here.

The tar ball contains the following files:
  1. PPMembership.txt
  2. This is a tab-delimited two columns file with header. Column 1 are the UPIds of Pan Proteomes. Column 2 are the UPIds of Pan Proteome members.

  3. Compressed sequence files in Fasta format
  4. File is named as UPxxxxxxxxx.fasta.gz, where UPxxxxxxxxx is the UPId of the Pan Proteome. The decription line of each Fasta record is the standard UniProt Fasta plus UP Id and Pan Proteome ID.
    Example:
    >sp|Q9RGZ4|MRPB_BACPE Na(+)/H(+) antiporter subunit B OS=Bacillus pseudofirmus (strain OF4) GN=mrpB PE=1 SV=1 UPId=UP000001544 PPId=UP000001544
    MKNLKSNDVLLHTLTRVVTFIILAFSVYLFFAGHNNPGGGFIGGLMTASALLLMYLGFDM
    RSIKKAIPFDFTKMIAFGLLIAIFTGFGGLLVGDPYLTQYFEYYQIPILGETELTTALPF
    DLGIYLVVIGIALTIILTIAEDDM
    
    
  5. README


PIR
 HomeAbout PIRDatabasesSearch/AnalysisDownloadSupport  SITE MAPTERMS OF USE
©2018 Protein Information Resource