Wednesday, July 10, 2019

Ben's inaccurate guide to FDR in Proteome Discoverer -- what to use and when!

Every once in a while I start a blog post and I think "oh no. what am I doing?" This is probably going to be one of them, but I'm probably going to keep typing, but keep this in mind.

1) I have well over my 10,000 hours of kicking data of all types around with different search engines and false discovery rate (FDR) algorithms -- so I have real world advice.

2) This involves statistics. My last formal course on statistics was in the 90s. I did tutor statistics. Also in the 90s. We didn't use computers in statistics in my undergrad institution -- because it was the 90s...

Inaccurate FDR guide time! 

Lets' use the Proteome Discoverer stuff above because I actually know about it most. Let's start with the easiest one, using PD 2.2's fantastic and logical numbering system

4) Fixed Value PSM Validator = Translation: No FDR!

What is it? That's actually inaccurate in the later versions, but up to PD 2.1 (I'm pretty sure) there were no settings in this box. It is basically there because you have to have something. This relies on the cutoffs within the search engine itself. If SeQuest, you've just moved the scoring to entirely rely on the XCorr cutoffs within your Admin --> Configuration --> SeQuest --> Node

In PD 2.2 and 2.3(?)  this now includes your DeltaCN cutoff. DeltaCN is just detailing what you do when you have 2 sequences that match your MS/MS spectra. Imagine you have these 2 matches and one gets a score of (bear with me) 100 and the other gets a score of 96. If your DeltaCN is 0.05, you get a peptide spectral match (PSM) for both sequences, because your delta off of your highest match is within 5% (0.05). However, if you get 2 potential peptide matches and one is 100 and the second is 84, then only the top PSM is reported.

When do I use it? When I'm looking at a single protein, or 5 proteins, or 10 proteins. Maybe up to 100 proteins? Not many proteins! Or if I want to get a ton of IDs from a global dataset and I don't care about the level of accuracy in those IDs....which is a joke. I don't use this for global and you shouldn't unless you've got other filters like >3 unique peptides per protein or something.

6) Target Decoy PSM Validator = Translation: The first kind of FDR for proteomics

What is it? It was a revelation in proteomics about a decade ago. Unless I'm mistaken it was first described by Elias and Gygi? Although, the paper I typically associate it with is this paper from the same authors a few years later.

You essentially do 2 searches:
1) Normal FASTA database
2) Your FASTA database that you screwed up. You either read all the entries in reverse or you scrambled them.
3) You adjust the scoring on the two searches to cutoffs that only allow a small percentage of the screwed up database to match. For example (and yes -- I know -- this is only a useful statement for visualization...I've heard all the statistician tirades...) you adjust the scoring until only 1% of your screwed up FASTA database gets picked up as real hits. This is the whole misnomer of 1% FDR. Never say this in front of an audience.

The assumption you lean on is that your screwed up database won't get matches, or will get very very few. On a big database that has been screwed up, there are ALWAYS going to be matches by your search engine. The goal here is to control the random chance of PSMs occurring (this works if you assume the scrambled database matches would only occur by random chance, which - due to things like organisms biasing toward specific amino acid usage and the biological value of having repeating sequences in some proteins -- isn't really true, but it is uncommon enough that it makes a good metric).

When do I use it? When I need to be conservative with my identifications. Particularly for big databases, target decoy is going to throw out a LOT of good data. I do not typically use this for single proteins. Maybe I do as a comparison?

(Too much uninterrupted text in a row makes me nervous. Pug Pile!!) 

5) Percolator = Translation: Try to find the good data that target decoy threw out by looking at a bunch of factors.

What is it? It is essentially the first foray of mainstream proteomics into intelligent learning machine deepness (insert arbitrary sounding big data computer term here -- again -- check WikiPedia a LOT before you use any of these terms for an audience. They all sound like the same thing but they aren't. Percolator is really a "semi-supervised learning" thing, and people take these different terms very seriously. I once made the innocent mistake of referring to some Harley Davidsons (sp?) outside a bar as "nice bicycles" inside said bar, which, by the way, they definitely technically are and it was almost as poor of an interaction over a technicality as I've seen at conferences where people got their artificial learning things mixed up.)  Percolator was probably first described here.  A while back a colleague assembled a list of all the things that Percolator considers when rescoring a Peptide Spectral Match (PSM) and that list is here.  Since Percolator rescores peptides with it's metrics its great for combining the results of different engines.

When do I use it? When I've got a LOT of data and I want to get the most identifications possible. I NEVER use Percolator for a single protein digest. I don't know the realistic cutoff where I trust Percolator output. I might put it at around 1,000 proteins. BY THE WAY. If someone gives you an immunoprecipitation or Co-IP and it's got 4,000 proteins in it, there are lots of great guides online on how to do one of these experiments properly. Ask that person to read one and try again. Don't spend all your time trying to interpret garbage pulldowns. If you are going to try and make sense of it, don't treat it like a Co-IP experiment. Treat it as an affinity enrichment and just do everything they do in this paper.

7) IMP-Elutator = Translation: Percolator + Chromatography data

What is it? As best as I can figure out, Elutator takes the Percolator approach but then also incorporates chromatography information predictions. I think IMP-Elutator was first detailed here.

When do I use it? Big datasets with a lot of PTMs. MSAmanda uses more sophisticated features for PTM identification and confidence in those identifications than older code like SeQuest. MSAmanda rewards your scores for higher mass accuracy (measured in PPM accuracy, rather than mass range bins) and can handle things like neutral losses from your peptides (akin to MaxQuant searches). Since I'm already using MSAmanda (2.0!) for getting high confidence PTMs, I go ahead and use the tools that were designed to work with it. More importantly, as you've added PTMs, your search matrix increases exponentially, as does your likelihood of getting bad matches. By taking chromatography into account you can further improve on your FDR.

Okay -- I hope this helps? Thanks to Dr. R.R. who wrote me the question that made me realize I hadn't updated anything on FDR in PD since like 2012 and I should reeeeeaaaallly go back and delete some old stuff!

No comments:

Post a Comment