Wednesday, September 7, 2016
Origin of Disagreements in Tandem Mass Spectra!
When you search the same RAW file containing tandem mass spectra versus the same database using different search engines, you are going to see some disagreements in the results.
For example, if I take a proteomic sample from myself and I run it through Mascot and I run it through Sequest separately, the results probably not going to be exactly the same. Mascot will identify some peptides that Sequest won't, and vice versa. It is also likely that I'll see a few MS/MS spectra that Sequest said was one sequence and Mascot said...is something different...
Considering that the database we're searching this against is constructed making some textbook assumptions and is starting from a DNA sequence....that is not mine....we do pretty darned good though!
Where do these disagreements come from? That is the topic of this new paper from Dominique Tessier et al., in this month's JPR. To evaluate this question, these researchers grab a cancer dataset from PRIDE from Gygi lab and then run some plant samples in house on an Orbitrap Velos using high/low (or...medium/low? 30k MS1 + Top5 ion trap MS/MS).
The RAW files are then searched versus: Mascot, MSGF+, X!Tandem, TPP (presumably, also using X!Tandem) and an analysis of the conflicts are performed between the results.
The results are interesting, and the processed results are more conflicting than I've ever seen. The authors develop a concept of "peptide space" and conclude that optimization of the search parameters for each engine is essential to getting the best and most overlapping data. They also note that in some versions of the software they utilize the parameters that they need to change to get the best data is sometimes not easily user accessible.
I think this is a nice study and a good look at some of the problems we have in the statistics behind the scenes. It is sometimes easy to forget these days what an enormous undertaking from a mathematical perspective developing all these tools has been over the last couple of decades. Today's proteomics researchers coming in can simply push a play button to get good results and its easy to take it for granted!
1) The RAW files were converted by different tools that I believe are quite different in their underlying mechanisms. I think this is a variable should have been eliminated by using the same tools. Would it have an effect? I dunno...but its a variable that could be knocked out with 5 minutes more work.
2) PD 1.7? Wow, I don't have that one! ;)
3) I think the function of the search engines is something that is being focused on cause its the easiest to implicate. The FDR estimations employed were different for each engine. I think this could have a big impact on these results. I'd suspect that if FDR was controlled the same way for each of these results that the level of agreement would be a little better
4) The in-house generated data is just a little weird. 30k MS1 followed by 5 MS/MS for plant fractions is going to yield only high copy number proteins and using a search parameter of 0.4 Da for the fragments is probably too tight and will affect the downstream results a little.
Again, minor criticisms from somebody who just does proteomics as a hobby. Please feel free to ignore! I do like this paper and I'm glad Twitter (PastelBio!) recommended it for my breakfast paper today.