Pan Proteomes (PPs)
For each reference proteome cluster, also known as representative proteome group (RPG) (Chen et al., 2011), a pan proteome is a set of sequences consisting of all the sequences in the reference proteome, plus the addition of unique protein sequences that are found in other proteomes of the cluster but not in the reference proteome. These additional sequences are identified using UniRef50 membership. Analogous to pan genomes, pan proteomes are useful for studies of proteome diversity, gene evolution, gene transfer and phylogenetic comparisons.
Algorithm
Proteome Cluster |
UniRef50 A |
UniRef50 B |
UniRef50 C |
UniRef50 D |
UniRef50 E |
... |
Proteome 1 (Reference) |
A1 |
- |
- |
D1 |
E1 |
... |
Proteome 2 |
A2 |
- |
C2 |
D2 |
E2 |
... |
Proteome 3 |
A3 |
B3 |
- |
D3 |
E3 |
... |
Proteome 4 |
- |
B4 |
- |
D4 |
E4 |
... |
Proteome 5 |
A5 |
B5 |
- |
D5 |
E5 |
... |
The pan proteome consists of those proteins in RED.
For each proteome cluster, or representative proteome group (RPG),
-
Add all sequences from the reference proteome into the pan proteome
-
If the RPG contains non-reference proteomes (NRP):
-
Sort the NRPs by the proteome priority score: NRP1, NRP2, ...
-
For NRP1, select one protein with top protein annotation score (AS) from each UniRef50 which does not have any protein sequence that already in the pan proteome under construction.
-
Repeat step b for all NRPs.
Note: The proteome priority score (PPS) is the same as the one we use for generating the reference proteome.
Download PPs files
File format:
>PanProteomeID TaxonId OSCode OrgName TaxonGroup ProteomePriorityScore(PPS:RefP,PrevP,#PMID,ProteomeMeanAS,#Entry) CorrCutoff IsRefP
#UPID TaxonId #proteins
...
UniProtAC UPID TaxonId UniRef50
...
Example:
>Pan-Proteome_UP000001544 398511 BACPE Bacillus pseudofirmus (strain OF4) Bac/Firmicute 37117.16681(PPS:1,1,17,11.73,4119) 55(CUTOFF) RefP
#UP000001544 398511 4310
#UP000017170 1188261 760
A7LKG4 UP000001544 398511 UniRef50_O05267
A7LKG5 UP000001544 398511 UniRef50_A7LKG5
D3FPP4 UP000001544 398511 UniRef50_D6YVU9
D3FPP5 UP000001544 398511 UniRef50_Q1GBQ0
...
Q9RGZ4 UP000001544 398511 UniRef50_O05259
Q9RGZ5 UP000001544 398511 UniRef50_Q9K2S2
U6SH21 UP000017170 1188261 UniRef50_U6SH21
U6SH81 UP000017170 1188261 UniRef50_E5WS36
...
Download PPs sequences
The pan proteome sequences for singleton and non-singleton reference proteome clusters (75% proteome similarity for Eukaryota and 55% proteome similarity for Archaea and Bacteria) can be downloaded from here.
The pan proteome sequences for non-singleton reference proteome clusters (75% proteome similarity for Fungus and 55% proteome similarity for Archaea and Bacteria) can be downloaded from here.
The tar ball contains the following files:
-
PPMembership.txt
This is a tab-delimited two columns file with header. Column 1 are the UPIds of Pan Proteomes. Column 2
are the UPIds of Pan Proteome members.
-
Compressed sequence files in Fasta format
File is named as UPxxxxxxxxx.fasta.gz, where UPxxxxxxxxx is the UPId of the Pan Proteome.
The decription line of each Fasta record is the standard UniProt Fasta plus UP Id and Pan Proteome ID.
Example:
>sp|Q9RGZ4|MRPB_BACPE Na(+)/H(+) antiporter subunit B OS=Bacillus pseudofirmus (strain OF4) GN=mrpB PE=1 SV=1 UPId=UP000001544 PPId=UP000001544
MKNLKSNDVLLHTLTRVVTFIILAFSVYLFFAGHNNPGGGFIGGLMTASALLLMYLGFDM
RSIKKAIPFDFTKMIAFGLLIAIFTGFGGLLVGDPYLTQYFEYYQIPILGETELTTALPF
DLGIYLVVIGIALTIILTIAEDDM
-
README
|