Tuesday, June 4, 2013

False discovery rates: Part 1 -- Fictitious history and introduction to target decoy searches

In my opinion, the single most controversial aspect of mass spectrometry based proteomics is the false discovery rate calculation.  It is controversial because it is so poorly understood.  It is poorly understood because it is extremely complicated.

I'm going to try to take this apart as methodically as I possibly can in order to help simplify it as much as possible.  Honestly, it is going to be a bit of an oversimplification and probably more than a bit inaccurate, but I think I can cut through to the fundamentals by conducting multiple experiments on the same database and sample set and by telling it like simple story.

In the good old days, our super sensitive and speedy mass spectrometers could do a reasonably good job of looking at one compound at a time and fragmenting it.  We could then go to the mass of the compound, make a hypothesis about what it is, and then see if there is fragmentation data to support that hypothesis.  Sometimes we still have to do this.  I have some friends that still manually verify every MS/MS spectra that they have sequenced before they will publish it.  It takes forever, even when you have developed a knack for it.

One day, however, the mass spectrometers got much much speedier.  I blame it on the Finnigan LTQ, quite honestly. All of a sudden, the instruments were generating a full MS1 and multiple fragmentation spectra per second.  People started doing crazy things like not separating their samples out in 2 (or 3) dimensions before they put them on the LC-MS.  Sometimes they were collecting fractions that contained hundreds of proteins for MS/MS.  This new speed made the absolute limiting factor the amount of human hours it took to manually verify spectra (which, as I said, some people still do and I applaud them for it!  Nothing beats it.)

By this time lots of scoring methods for peptide quality had already been invented, like Xcorr and Mascot score (whatever that is), but these methods were also showing their weaknesses.  Again, they worked best when you set your parameters based on what criteria you personally trusted and then manually verified everything.  In order to shortcut around manually verifying every experiment, we borrowed the False Discovery Rate (FDR) idea from the genomics people.

The first real idea was the target decoy.  In Proteome Discoverer terms, we call this Peptide Validator or the Target Decoy PSM Validator.  The math is complex, but essentially it consists of 3 steps:
1)  Run your MS/MS spectra vs. your database
2)  Run your MS/MS spectra vs. a messed up version of your database (backwards or scrambled)
3)  Use the percentage of matches to your messed up database to determine how what percentage of your matches to a database can be random and use that degree of randomness to cut out your lowest scoring peptides.

For example.  If I compared my RAW file to my target decoy database and it came up with 10 hits but when I compared it to my real database and it got 100 hits, we would think that 10% of our good hits may just be sheer random occurrence.  Now, we could just drop the 10% lower hits and be done with it, but those might actually be good spectra and there might be more random matches in there.  Of course, there are all sorts of maths out there that can help, but I find this plot from PEAKS to be the easiest for my little brain:

In this simplification, we are taking our false matches and our true matches and overlapping them.  The confidence interval must be moved in a way to minimize the false matches while getting the most possible true hits.  I particularly like this because it demonstrates 2 key points:  1) Using an FDR calculation is going to lose you good data.  2) There is no guarantee that any use of FDR is going to prevent the existence of bad matches.  Some peptides are just going to be there due to a combination of random occurrence and statistical anomaly.  For more description on this particular graph, please check out this nice tutorial at PEAKS.

For those of you solely interested in FDR calcuation nodes for Proteome Discoverer, I suggest you watch this video I constructed on Vimeo for Proteome Discoverer 1.4.

In the next section, I'll take a look at the results we get when using 1) No decoy libary 2) A decoy library appended to the end of a normal FASTA (with no cutoffs) a normal target decoy search (PSM validator) and Percolator using the same dataset.

No comments:

Post a Comment