Sunday, February 28, 2016

Imputation strategies in label free quantification

This bright Sunday morning, I learned a new word, "imputation". And since Google Image only gives you really weird stuff if you try search for this word, here is a picture of  my dog dressed as a sheep.

Google says: "In statistics, imputation is the process of replacing missing data with substituted values."  The paper where I learned this term was Just Accepted at JPR and you can find it here (open access if you are logged in).

This paper and I got off on the wrong foot on the very first line in the introduction, when they state: "Missing values are a genuine issue in label-free quantitative proteomics." We're going to agree to disagree here, because I fall firmly into the camp that "missing values" in modern instrumentation (i.e., Orbitraps) is an illusion, propagated by clever marketing from groups with alternative agendas and the fact that software hasn't existed until recently/soon that can assess all the values in our RAW data files in HRAM mass spectrometry. Again, agree to disagree and move on into this interesting paper!

For this team of talented statisticians, they are going to assume that:

In this run we didn't get a PSM for this peptide = missing value

Missing value = problem

Since they've defined this as a problem, how do they move forward? First of all they define 3 reasons that they wouldn't have achieved this PSM in this run. This has to do with whether the mixed value was due to a random occurrence (and how random). Next they use a very simple equation to simulate the replacement of a value in one set with the value achieved in a separate set. 3rd, they take a good SUPER SILAC dataset and check their values, then they go nuts with a bunch of equations.

What did we learn here?

Well, if you are going to do imputation -- you'd better do it at the peptide level (though the authors may actually mean PSM here). That if you are going to Impute (or plug in new values) for missing ones then you should really take into account the reason that the value is missing in the first place. So having algorithms in place that can diagnose why the value is missing will be a valuable tool in allowing you to correct for the missing values.

1 comment:

  1. Has anyone made a direct comparison between labelled and label-free experiments in terms of missing data? It's interesting to consider how methods like TMT or iTRAQ compare to unlabelled experiments.