Very soon...we're all going to be having a lot of conversations on the topic of "Imputation". And it's gonna be all sorts of fun. Partly 'cause it means a lot of different things to a lot of different people -- and partly 'cause even the people who all think it means the same thing do it 12 different ways.
The different definitions come from the background you came from to get into proteomics. If you came from genetics like I did, you've done all sorts of imputation on microarrays. You have to do that because microarrays are a picture of a tiny piece of glass that has thousands of different probes on it. And the biologists at the end expect a measurement from every single one of those probes...even if something like this happens...
See that thing in the bottom? That isn't massive overexpression of all the genes/transcripts in that area of the array (each pixel is a gene or transcript and the intensity of the color correlates to how much of it there is) -- these instruments can do thousands of these arrays each day -- and not all of them are perfect.
I did the QC on one of the (at the time) world's largest microarray projects and when you have thousands of arrays you need to have a strategy for getting every measurement that you can -- and for figuring out what measurements you can't trust. And we call that imputation!
For a great review on some of the techniques for imputation in proteomics you should check out this
It is going to be more in-depth than anything I'll go into here -- and more in-depth than anything you'll see from any commercial software package anytime soon. This review is partially so deep because this group at PNNL has a completely different idea of how we should be doing proteomics. In numerous studies they have shown that having high resolution accurate mass and accurate retention time is enough data to accurately do quantitative proteomics -- once you have a library of exact masses and retention times. And they have applied numerous statistical algorithms (some deriving from genomics techniques and others from pure statistics) to a few high quality data sets they have created to find the best ones -- and this is a wrap up of these. BTW, I LOVE these papers, I just don't think we're quite there to applying these to the diseases I care about.
Wow, this is rambly.
Important parts of this paper -- there are 3 ways of doing imputation
1) Single value
2) Local similarity
1) Single value:
Pro -- Easy, fast
Overview -- Makes one very powerful assumption and lives -- or dies -- by it. That if you have a missing value it is because it is outside of your dynamic range -- and is therefore super significant!
This can be AWESOME in proteomics of cells with unstable genomes. In homologous recombination/excision events in cancers where entire ends of both copies of a chromosome are gone...?...you aren't going to have a down-regulation of that protein. That protein will be gone. If you didn't do some sort of imputation outside of your dynamic range, you could walk away empty handed.
Cons: You can have tons of false positives that you think are significant, that totally aren't. Your spray bottomed out for 2 seconds during sample one of your treated samples -- when you only had an n of 3? Now you have 10 peptides that look really important -- that aren't.
P.S. Up till now this was the only imputation strategy available in Proteome Discoverer + Quan "Replace missing values with minimum intensity"
2) Local similarity:
Pro--Kinda easy...well...you can download the scripts from PNNL in MatLab format (I think the DanTe R package also has stuff, but I can't remember)
Overview -- This one derives from engineering principles -- the idea is this, sending back a reading of "zero" could wreck the whole system, so avoid it at all costs. The textbook example is you've got 10 temperature sensors down a vat and the one shorts out a couple times and comes back with 0 degrees. Better to impute a zero to the temperature of it's nearest neighbor.
If we apply that thought to proteomics -- the idea is that it is better to have a 1:1 measurement for a peptide than it would be to have no value for the peptide. The most used one is gonna be the KNN or k-nearest neighbors approach -- this goes one beyond the 1:1 idea I state above -- the most similar peptides (in a Euclidian sense) to the one with the missing value are used to impute the intensity of the missing peptide.
Further elaboration on why this could be useful --> what if your collaborator (or postdoc advisor) made you filter out any proteins that didn't have 2 or more unique peptide values before they'd look at your protein list. This is still common, right? If you had 2 peptides ID'ed and one was 10:1 and one was 10:(missing value) it sure would be nice to put something in there, right? Even if the nearest neighbors made this ratio 1:1, you'd still have 2 peptides and the quan would average out to 5:1. Sure beats having no protein (in the appropriate circumstances).
Cons -- It is almost the opposite of the single value example I mentioned above. You can almost have something approaching ratio suppression here.
Pro -- Statistically valid, probably
Overview -- This model assumes that missing values are the result of random events that are evenly distributed across all measurements. By reducing everything down to prime principal components and then normalizing across the missing measurements you fill it all in.
Cons -- Well...Computationally really really expensive. For example, in this paper, this group couldn't complete a BPCA global-structure imputation on a dataset with 15 samples --that contained a total of 1,500 peptides IN A WEEK.
For reference sake, 1 crunched >1000 files with Minora and imputed with the KNN (local similarity) 16,000 peptides across all files in 48 hours..,with database searching. Maybe this is something a server is necessary do really do.
This is the first conversation of probably many that everyone will be having soon, but this is a seriously, a good review of the topic!