Saturday, January 25, 2020
EPIFANY -- A smart and fast method for protein inference!!!
This new study in press at JPR is critically important for shotgun proteomics and smarter people than me (which means just about everyone) should really take a good look at this and 1) verify it is as good as it looks and 2) see about integrating the source code or primary logic into all sorts of other tools. (An earlier draft was also made available through biorXIV.)
Okay -- so -- shotgun proteomics is really really good at one thing -- making
Match(es) -- (PSM)s.
And if we're working with the best proteins in the whole entire world then each and every one of those PSMs is unique to 1 particular protein and when we identify that PSM and quantify it in a sample we have proven that particular protein is there and we can even get quantification estimates / measurements on that one protein from that one PSM. (I need a word count on sentences).
However -- from an evolutionary perspective it doesn't make a ton of sense for each protein to have developed in isolation with no relation to any other protein. So...a lot of PSMs could be derived from more than one protein. And if you only identify PSMs that could originate in more than one protein, what do you do?
You INFER the protein identity.
How do you do that?
Well -- probably by a set of mostly arbitrary rules that were chosen because....we had to do something...and it's a great idea if we keep them to ourselves...because they don't reflect well on us or our field.
The best one? When you've got equal evidence, it's probably the biggest protein in your FASTA database....(some tools use the highest percent coverage, but then you'll get all weirded out because if UniProt contains your full length variant and 4 alternative "fragment of" protein sequences you'll only ever see the fragments and then you'll be afraid your lysis method broke off all your C-termini...which...you can't rule out....see...it doesn't sound great when you say it out loud. I hate explaining it when I can tell people are paying attention. I go ahead and get the idea of a "razor" peptide out of the way next, because it's better to get two things that damage your credibility out of the way at the same time and then you can spend the rest of your talk or lecture trying to gain it back.
I'm oversimplifying a complex and varied environment of protein informatics software here. It isn't all this way. From the paper:
"Some methods tackle this problem by either ignoring shared peptides (Percolator 7,8), employing maximum parsimony principles and finding a minimal set of proteins explaining found peptides or PSMs (PIA4 ), iteratively distributing its evidence among all parents (ProteinProphet 9 ) or incorporating the evidence in a fully probabilistic manner (Fido 10, MSBayesPro11, MIPGEM12)"
The best way to do this? An exhaustive recent analysis showed on the iPRG 2016 (the big ABRF study that comes up a lot) that the full probabilistic models are the way to go. More statistics, FTW!
However -- I've only used Fido, but it required a whole lot more processing time/power than even Percolating a large dataset. And this study suggests it's not just Fido...it's a brute force approach that, in the end, may not be realistic.
EPIFANY uses some fancy statistics to achieve the same (better?) inference results, but use alternative logic (something about loopy beliefs) that massively reduce the data processing load.
Full disclaimer -- I'm still trying to figure out how to use it because it runs in KNIME and I might be too dumb for it. I just found this cool KNIME cheatsheet thing -- with this and the full pipeline and all data available here I'm hoping to work my way through it. [Hooooly cow. You can run it from command line....how did I miss that!?!? ]
However -- the evidence here is solid that this is a better way to infer protein identifications. The authors test it against multiple datasets including the iPRG and use all sorts of ways to infer the protein identities and EPIFANY is the best -- or close enough -- and finishes in a reasonable time.
And -- look -- even if it didn't work any better at all, wouldn't it be better for us to use the tools that at least tried to use intelligent statistics to infer our protein identities? Grant review boards are grumpy by design. We don't need to give them excuses to fund more transcriptomics.