Saturday, January 9, 2016
Are there really missing values in normal (DDA) proteomics data?
A common premise in proteomics the last few years is that our normal shotgun proteomics approaches -- in particular, data dependent techniques (DDA) suffer from something called "missing values." This statement has been parroted about quite a bit, but is it actually true?
Not according to this very nice new paper in MCP from Bo Zhang et al.,. First of all, let me say that there is a lot of good stuff in this study, but let me pull my favorite quote out:
"Contrary to what [is] frequently being claimed, missing values do not have to be an intrinsic problem of DDA approaches that perform quantification at the MS1 level."
This isn't exactly revolutionary, of course, many people have made this statement (there were very nice posters to this effect as ASMS the last couple years from the Qu lab), but it sure is nice to get things like that out of the way.
Here is the central premise: The few studies and loads of marketing material that claim missing values in DDA data are focusing on one thing: that within a single quantitative digital proteomic map (i.e., RAW data file) we will not fragment every possible ion. And this is obviously true.
So how do these researchers contest this point? By pointing out that if we have high resolution accurate mass MS1 scans that we don't need to fragment every ion in every single run. If our goal is to compare sample A and sample B, if we make the fragmentation in sample A, then we do not have to fragment it in sample B. Having an accurate high resolution MS1 mass and retention time is enough to confirm that the peptide in sample A and sample B are the same thing.
If you use the precursor ion area detector in Proteome Discoverer or the awesome OpenMS label free quantification nodes or LFQ in MaxQuant these software are going to make this assumption automatically. So, how did this paper make MCP?
DeMix-Q and you can get it from github here. They do FDR by running target and decoy label free quantification matches through an algorithm that considers many factors, including retention time.
What do you get at the end? Data that is fully complementary. Did you run 400 samples? Imagine that in sample 237 you had an MS/MS fragmentation that was unique only to that RAW file -- but there is a clear (but low) MS1 signal in every other dataset. This will allow you to quantify that peptide in all 400 samples! And have a metric for your confidence via FDR!
Can you get away with this on every instrument? Probably not. If you've got a lower mass accuracy instrument you probably can't distinguish between peptides of similar m/z from run to run and you are going to see a lot of false measurements. In that case you are probably better off to take the big hit in dynamic range and use a data INdependent method like SWATCH so you can back up your identifications with fragmentation data in every single run. For those of us, though, who have to go after the really low copy number proteins or who have collaborators looking for complete proteome coverage, it looks like we'll still be a lot better off with smarter data processing with DDA data.
TL/DR: Smart paper that combats a current myth in our field and shows us a great method for applying false discovery rates to label free DDA data.