Monday, May 18, 2015

Highly scalable false discovery rates


This article gets a picture of my puppy, because you don't get anything interesting if you Google Image Search False Discovery Rates. And if you Google "FDR" you get pictures of some dead guy.

The article in question is this one from Mikhail Savitski and  Mathias Wilhelm et al., and is currently in press at MCP.

What is it?  A new way of doing FDR.

Why would we need one of those?  Don't we have several?  We do, but they have drawbacks.  Target decoy can't keep up with big datasets and databases.  Don't believe me?  Run your most recent samples versus the organism database and then run that exact same sample versus the entire TREMBL database.  The number of decoy hits go through the roof and your positive protein IDs drop through the floor.

Well, how does this group propose we do it?

Okay, this is pretty smart.  Say you have a peptide from a protein that comes up positive.  They take a another peptide from that protein and set them off together as a pair.  Then when you do the target-decoy match you have successfully narrowed down the data rather than taking the entire database and flipped it backwards.  Make sense?  I kind of get it.  It seems foggy, but I'm also really sleepy.

The thing is, it seems to work.  I recommend this paper to anyone curious about how FDR works.  Even if you skip all the new stuff they've done here, it is a great review of FDR and various proposed mechanisms (that are target-decoy based).

To test this mechanism, they pulled 19,000 LC-MS runs (yup. like almost 20k runs!) and ran this approach.  They got better data than the target decoy approach.

Okay, this is cool and everything, but what about Percolator!?!?  You're very right. Percolator is the gold standard right now for this stuff.  But they did 19,000 LC-MS runs.  I did the crude math and Percolator in its current form could dig through that many files in around 11 years, and I'm guessing they did this a little faster!



2 comments:

  1. "Don't believe me? Run your most recent samples versus the organism database and then run that exact same sample versus the entire TREMBL database. The number of decoy hits go[es] through the roof and your positive protein IDs drop through the floor."

    If you search large databases with big similarities the number of false positives will go to the roof, and you will lose good ids due to hit competition and lack of discrimination from your scores. Your test does not seem to evaluate the target decoy approach performance, but rather the search engine performance.
    It is a rather good sign that the number of decoys grows with the number of false positives. Your example thus seems to be showing that target/decoy tends to rather keep up with big databases?

    ReplyDelete
  2. Good point. But the bigger your forward database, the bigger your decoy database that is generated. That means the random chance of getting hits goes up. Thats pretty much my point. If I have 12k entries, I have 12k decoys. If I uses a database with 10M entries the chance that something will match better by chance goes up dramatically.
    Thanks for your comments!

    ReplyDelete