Tuesday, May 24, 2016
Systematic errors in shotgun proteomics
Wow, Google, you totally outdid yourself here. Did someone draw this freehand with a pencil? It does the trick, though, and frames this somewhat sobering new paper from Boris Bogdanow et al., in press at MCP.
The premise of this paper is that there are PTMs in the stuff we're running. A LOT of PTMs.
".... It is estimated that every unmodified peptide is accompanied by ~10 modified versions that are typically less abundant(16)..." If you want to follow up on this, the reference to this statement is here. I'll be honest. That is a bigger number than I had in my head....
We know there are lots of PTMs. Big deal. The problem they point out is that PTMs may be the majority of our false discoveries. By making the assumption that the most common thing we'll be seeing is the direct gene product we are propagating these errors. It is helpful that the unmodified version is likely to be the most intense, but if...for example, 1% of a peptide of albumin is modified, that is still going to be waaaaaaaay more abundant than most transcription factors.
Okay. They succeeded. I'm totally stressed out about this problem. Thanks, guys!
Wait...its in MCP because they propose a solution! Unfortunately, it isn't the easiest thing ever. This paper is currently open access, so I can put in a figure, right? If not (don't sue me!) email me: firstname.lastname@example.org and I'll take it down, but its easier to explain it this way.
Now...they talk about other search algorithms, but this study solely employs MaxQuant and leverages heavily on a feature I didn't previously know about called ModifiComb. ModifiComb does an unbiased look at the potential PTMs. As described in the paper I'm going to consider it somewhat analogous to Preview. Cause its pulling out the most common dynamic modifications hanging out in your samples. (Preview is trademarked so I can't say if it uses the same logic described for this MaxQuant feature, but the end result seems the same).
Okay. This makes sense so far and makes feel pretty good about the search strategies we recommended at the PD 2.x workshops last year. But they diverge here a little and I think I like it. They run parallel searches with just individual modifications. Then they compile those and then they do the FDR stuff. According to the output it works really well. Why would this work better than using Preview to get the most common dynamic mods and throwing them all in? No idea, but if you're running PD it would be easy to try something similar to this approach.
Run your data through the Preview node. As so, grab the modifications individually (ignore the screenshot, grabbed randomly...)
Then, set up parallel PD searches like this:
They do something interesting with the target decoy searches involving doubling the number of scrambled peptides, but this is just a preliminary look, right? I'd much rather just use the node than use the Fixed Value node, pull my decoys double 'em and plot manually, but that might be necessary.
In terms of the Consensus workflows, I'm a little hazy on how they built their proteins from their PSMs, but it doesn't sound too far out of the box. Hey. Might be worth a shot, though!
They also integrate protein quantification data into assisting with their FDR, which other people have proposed (and honestly, might be some of the most powerful data we get from transcriptomics data..that's in here somewhere, right? I forget) but I don't have time to spend on the supplemental figures this morning.
TL/DR: Nice paper that emphasized that PTMs might be a bigger problem for FDR than we generally think. Proposes a reasonably straight-forward approach that might assist in improving the number of false discoveries coming from mis-assingments from PTMs.