Wednesday, August 17, 2016
In-depth study of protein inference!
August and September is crazy time for my day job so the blog is probably going to be worse than usual for a while...so if I write about something during this time its cause I really really like it.
Case in point:
(Direct link here)
These guys totally had a protein inference party! Protein inference is the problem that I think is really well described in the image at the top (this supposedly was taken from the wall in a 1st grade class....this is definitely a better school than the one I went to....).
We KNOW the Peptide Spectral Match (probably...). Our search engines are great at that. "This MS1 mass and MS/MS fragmentation of the area around that MS1 mass matches this peptide from our FASTA index". The tricky part is what protein is present?
So this group of slackers did what everyone else would do...who had the technical capabilities of taking most of the protein inference algorithms and then putting them into the same operating environment. They used something called KNIME which appears to be some sort of a big Cloud-based collaboration environment. To get everything working together they assembled an OpenMS workflow within this environment. Of course, they made it all available to download on GitHub here (under the really cool name, KNIME-OMICS)
Once they got everything all operating under the same technical conditions and parameters. Wait. Describe everything:
They used the search engines: Mascot, X!Tandem and MS-GF+
And then the inference algorithm: FIDO, PIA, ProteinProphet, ProteinLP and MSBayesPro (I don't know the last 2. No time to investigate)
They picked 4 datasets of varying levels of complexity from public repositories. They range from a yeast digest all the way up to a lung cancer analysis. Then they go to work. I'd like to mention that the paper is really well written. No guesswork regarding what setting they used for which algorithm. Every one I have the concentration on this little sleep to really look at seems clearly detailed.
Good news for us non-programmers in the Proteome Discoverer world, cause FIDO seems to perform really well in these studies, providing the highest number of unique proteins inferred as the number of databases increase of all the inference algorithms. Go FIDO!
Probably the coolest conclusion is something we've probably all observed a little -- that increasing the number of algorithms doesn't always increase the number of proteins inferred at the end. But it does increase the number of peptides which increases the strength and accuracy of our inferences.
Despite the fact we didn't see Sequest or Comet employed I think we can infer from the other 3 algorithms and strength of observations that what they show here would reproduce well in the most used search algorithms as well. This is the most thorough and best controlled study I've ever seen on protein inference so I'll definitely take it!