Thursday, February 23, 2012
Peptide Validator vs. Percolator, Part 3
I finally got around to the original citation: "Semi-supervised learning for peptide identification from shotgun proteomics datasets."
The paper begins by describing how normal false-discovery rates (FDR) are used in proteomics. The FDR is applied after all the peptides have been scored. If the FDR is set at 5%, then the lowest scoring 5% of peptides are dropped for being artificial.
The authors note that this can be a problem because those peptides may, in fact, be real, or you may be letting through bad peptides all based off of an arbitrary cutoff.
Percolator works by creating a decoy database made up of scrambled peptide sequences, presumably created from the FASTA file you are using which are used as negative examples. The best scoring peptides are used as positive examples. Percolator uses these examples to train a "machine learning algorithm."
The paper goes on to show how Percolator works better than XCorr plus a FDR, by evaluating a tryptic digest using Sequest with no enzyme selected.
All in all, they provide a pretty convincing argument that this works well.
Problem: In their analysis, the Sequest algorithm took 3 days to process their sample and Percolator analysis required only 4 addition minutes.
As noted in my previous entry, the absolute time limiting step in my analyses is the Percolator module. On single runs, Percolator takes 2 - 2.5 times as long as Peptide validator. On a complex data set, Percolator has added as much as 2 days to my processing. I don't know why this module takes so much longer than the one described in this paper, but until this is resolved, high throughput labs like mine are going to find it too much of a handicap to employ.