The University of Washington has been nice enough to provide us with some of the best software ever developed for mass spectrometry -- for free. I discovered two new ones that I'd never even heard of while looking for an image for this post. Luckily, I have expert <ctrl +> skills!
Percolator is still the one I use the most since it is integrated into commercial proteomics packages like PD and Mascot. As good as this algorithm is at distinguishing good peptides matches from bad ones (it is the gold standard -- for good reason!) it has one drawback -- it isn't the fastest part of your data processing pipeline and looking at 1e6 or more spectra may be your bottleneck
Enter Percolator 3.0 as described in the JASMS that was delivered yesterday!
What is it good for?
1) Big data sets
2) Fast and accurate protein inference
3) Do you need more than this?
How did they test it? They Percolated the entire JHU Human Proteome Draft Maps (and 2 other datasets, one huge and one small yeast proteome)
The JHU dataset is 2.7e7 MS/MS spectra alone. I honestly think I could do this on my Destroyer with the Percolator in PD 2.2, but it might take days.
In a search I did this week, 3e5 spectra took roughly 1 hour to Percolate -- so if we assume this would scale linearly maybe 100 hours for the whole JHU Draft Map dataset?
...and they could Percolate it in MINUTES. And through some bioinformatic magic, they were able to do this with only 30GB of RAM!
It seriously gets better than this. Percolator is a machine learning algorithm and what it trains on has a lot to do with how effectively it works. They improve the training sets and methods and the darned thing even works better! Remember the small yeast proteome dataset I mentioned above? They show that even though it has the capability to tear through world class datasets in terms of size -- it still works right for small sets!
In terms of protein inference, the paper walks you through the logic for their inference logic and they end up settling on one that not only works up a good score for protein inference, but also adjusts for the length of the protein (something that not all inference algorithms take into account --or...more commonly....do in a really dumb way, but lets talk about that some other time!) The observations they make in the different inference adjustments alone are worth analyzing at length!
In a random aside -- I'd like to mention the data processing method that led up to what they Percolated -- semi-tryptic digestion with 2 missed cleavages -- on 2.7e7 spectra? Wow....
To wrap up -- this is another awesome advance in proteomics software created and distributed for free from those guys in Seattle. The code is all available to download now and make our lives easier and our data much much better.