Wednesday, June 5, 2013

False discovery rate calculations part 2 -- real data

To read part 1 of this monologue on False Discovery rates....


Part 2:  Real data

Here is the setup, a cell line digest was separated on a 140 minute gradient on an Orbitrap Elite operating in high-low mode.  I believe it was a Top20 method with a dynamic exclusion of 1 (I ran this back in November).

For databases I used Uniprot/Swissprot parsed on the term "sapiens".  This is what I'll refer to as the "normal" sample.  I then used COMPASS to make a reverse of this database and append it to the end of the normal one, which I'll refer to as the "concatenated" database.

The sample was ran twice on PD with default parameters, carbamidomethylation of cysteine was the only modification (as always write me if you want more details).  The Fixed value PSM node was used.  So no FDR, just the default XCorr cutoffs.   The only difference was the database employed, normal or concatenated.

Normal run:  4051 protein groups, 17735 peptides
Concatenated:  4636 protein groups, 19618 peptides

Now, assuming all things are equal, 581 new protein groups (14.4% of the normal total) were added.  Meaning that there is a possibility that 14.4% of protein IDs occurred here, not because they are true, but due to random chance.  There are, of course, other explanations like homologous contaminating peptides and so on, but I'm going to ignore them here.

Really, we should be looking at the peptide level and not the protein one. 1883 random peptides (10.6%).

Alright!  Now we should just be able to cut out the 10.6% lowest scoring peptides, right?  As I'll keep iterating, it is trickier than that.  Look at this overlap at the protein level:

Uhhhh...so there were proteins ID'ed in the normal sample that were not identified in the concatenated?  This means that there were actually some peptides that matched the decoy database BETTER than they matched the real one.  Of course, I should have done this at the peptide level, but the point carries through.  Even if we chop off 10.6% of the lowest scoring peptide IDs, we still don't know that we've got them all.  This is because the random matches may not actually be low scoring peptides at all!

This is why we have to take a step away from doing this arithmetically and go to statistics.  The simplest and the most well known example is by establishing a confidence interval using the Benjamini-Hochberg equation:


This equation is out of the scope of the blog, but this equation determines the false discovery rate individually for a peptide (or psm, or anything else) at a specified interval (alpha) and is solved for the highest (k) possible.  This is only one of many variations on the same theme.  Ultimately, the goal is to use the results of the target decoy to establish a statistical frame for the true likelihood of a peptide match.

For more information, please refer to the classic paper from Gygi's lab.

No comments:

Post a Comment