Sunday, July 20, 2014

Heavy analysis of the human proteome drafts

I'm certainly not the only person who has jumped on the new resources provided by the human proteome drafts and checked them out.  In this brand new paper in JPR, a group out of Madrid takes a look at some of their favorite proteins in the human proteome drafts and comes back with an interesting analysis.  (Abstract here.)

I love the fact that, in this paper, they did the same experiment Alexis and I did the day the drafts came out.  We chose proteins that we knew would lead to cancer if they were over or under expressed and analyzed those.  This group took proteins from nasal tissue (olfactory receptor proteins) and looked for those in the various tissues.

At first glance, the image on the abstract looks pretty damning:

These are olfactory (smelling) receptors.  What are they doing being expressed in colon cells and platelets?!?!  (It is worth noting that the image above is from the (the data from the Pandey lab).

The authors of this analysis indicate, even in the abstract, that the "experimental data from these studies should be used with caution."  And I agree.  There is inherent error in studies this big; hell, a 1% false discovery rate on 100 million observations is 1 million observations that are false, right?.  But...the experimental data from every study should be used with caution.  And we all know that (by "we" I mean you proteomics experts who read this.)  I am glad that this caution is stated, though, for the people outside our field who have discovered this resource through mainstream news outlets.

That being said, I have some problems with this experimental design.  There are 3 big assumptions being made here:
1) The annotation of these proteins are 100% correct
2) These proteins have 1 function
3) These proteins only function in one tissue

Number 1 is easy.  Annotations suck.  The system for annotation sucks.  The first person to identify a protein in the first tissue gets to name it, right?  So there are tons and tons of proteins named in tissues that are heavily studied.

Number 2 is relatively easy, as well.  Making new proteins takes a ton of energy.  Evolutionarily (that's not a word? whatever...) there will be a lot of pressure for proteins to function in more than one way, in more than one context.  (Side note, one of my graduate committee members, Jiann-Shin Chen proved the first dual substrate the 1970s...sorry, couldn't find the link, I'll add it later if you're interested).  Considering the sophistication of eukaryote proteins, it is naive to think that if a protein is annotated as "Butt_itching_protein_1" that it would ONLY be utilized in the itchy butt response pathway.

Number 3 is an impressive coincidence.  Like millions of Americans, I subscribe to "I Fucking Love Science" and get Elise's feed of cool articles.  From this feed I know that:  zebrafish embryos highly express functional olfactory response proteins and olfactory receptors are highly active in human skin.  Heck, I've looked through more than a few high quality proteomics assays and seen "olfactory response proteins" in bunches of different tissues.  So...I think this was a poor choice for analysis.

TL;DR:  Please interpret the results of the human proteome draft maps with caution.  They are draft maps.  Two, consider proteins in an evolutionary context before using those proteins to generate excessive criticism of datasets that a ton of work went into.

Thanks, Karl, for suggesting something to read over coffee this morning!


  1. Ben, the Pandey paper also identified 31 out of 54 Y-chromosome genes in their ovary samples. Simple biological errors like this should have been caught by the authors themselves or at least by the reviewers and they do demonstrate that the protein false positive rate was higher than suggested by the manuscript.

    It is true that the manuscript is labelled as a "draft". However, it was still a flag ship publication in Nature and the authors have a responsibility to try to be as accurate as possible and to test the validity of their statistical analysis.

  2. This comment has been removed by the author.

  3. Hi,
    A balanced interpretation of the results and a nice read. I specially liked the part on assumptions in the olfactory protein analyses and their detection across multiple tissues. I would like to add one more possible reason that might cause misinterpretation of results.
    In proteomics we detect peptides and map it back to the protein sequences to infer the proteins. Minimum two peptides is what we all want to see. That being said, in my recent work on rat proteogenomics when I mapped peptides back to proteins 20% peptides were shared between genes (not isoforms). This is when I took identical matches into consideration and this is probably taken care by protein inference algorithms. However, SNPs can result in differences in peptides. When I mapped peptides with one mismatch allowed at any position another 20% peptides were shared. The differences in peptides might cause a different protein be called as expressed when it is simply a peptide from other gene accumulating an SNP. The extent of point mutations on results from large studies like draft human proteome map is yet to be estimated (trying) but surely would affect conclusions. This might also explain some of Y-chromosome genes being shown expressed in ovary tissues as mention by Ronald beavis , as many Y-chromosome genes have high similarity in X-chromosome.