Tuesday, January 3, 2017

Where does the billions of proteoforms idea come from?!?

A really interesting post from a much more serious blogger has been making the rounds on Twitter today. You can check it out here -- but it throws into question the possibility of billions of different proteoforms in humans.

I don't plan to argue this -- and, in fact, the references that are cited on the page he refers to do appear to be wrong. I've contacted some people inside that company (I know a few) to see if we can find better references. While looking for them -- I thought "wow...are you EVER going to finish your Favorite Papers of 2016 Post?" Might as well look at this thing!

As is often the case when I start digging through the amazing work you guys are out there doing -- I got distracted...

Assume for a minute that the blog post above didn't question things that ever person who's tried doing top-down proteomics knows are factual -- and that there are millions to billions of possible proteoforms. As top-down proteomics continues to grow -- how are we going to keep millions of things straight?

Qiang Kou et al., have a really interesting solution that is a step forward from some combinational bioinformatics graphs -- they call these mass graphs. 

There is a ton of maths in here, but I think this figure (can I show this? email me: orsburn@vt.edu if I can't -- I'll take it down! Oh..it is in the manual, shouldn't be a problem!)

How sharp is that?  It summarizes the possibilities semi-linearly!

You can check the software out directly here!  P.S. It is called TopMG

What was I talking about?

Oh....who first said that if the average protein had X possible length variants times Y post translational modifications that there might be billions of proteoforms?

Uh oh! Gonna need this (couldn't find a T.A.R.D.I.S.)

This is a paper we can blame for this idea!  2006!?!?

Check out this nugget!

Here in 2006 we have a hint of the number. Not considering alternative slicing events -- just PTMs, we might have millions of distinct proteins floating around. Quick -- those people at NorthWestern talk about all the proteoforms -- grab a quick paper from them and see who they blame.

Wow!  Okay, I'll come back to 21 Tesla ETD top down. Cause that sounds great -- I love doing fragmentation at 1Hz...

I need to borrow some more equipment!

To get me back to 2009 -- to this nice open paper!

This tackles the topic head-on. Showing how even in canonical proteins like GAPDH, there are multiple proteoforms performing multiple functions.

But even this paper doesn't drop the --

(another side note -- Carl Sagan's book by this same name is one of my favs from him)

Okay -- I'm gonna have to keep digging. The evidence is here -- and somebody totally was first to do the math and drop the B-bomb.

EDIT (1/5/17): Found it!  Had to go to the Proteoform man, himself!

In this beautiful (and Open Access!) paper from 2012, we're going to run into this here figure.

...and this makes an awful lot more sense, right?  Heck, I'm impressed to find out thee are 4,000 cell types. Now...it is fair to consider that each cell type isn't going to necessarily have a completely different protein profile...but it makes sense that we'd see different proteoforms in some of them.  And..I'd like to point out one of my favorite papers from 2015.

59!! Proteoforms of Ovalbumin!! 


  1. It's the wrong question - "How many protein isoforms make up the human proteome?" Only physicists specializing in mass spectrometry worry about that. EGFR has 12 glycosylation sites each occupied by giant sugar chains that together increase its MW from 134 kDa to 175 kDa. The sugar chains have microheterogeniety that cause tiny immeasurable pI changes. Why would anyone count the isoforms? It's the tyrosine phosphorylation sites that are important.
    A better question is "How do changes in protein expression and post-translational modification in specific cell types cause disease?"

    That's my 2 cents, anyway.

    1. Thanks for adding! I definitely agree 100% with the second part. I don't care how many proteoforms exist if they aren't affecting the diseases I care about. I'd much rather ignore the fact that alternative proteoforms exist...but the suggestion that they aren't there (how I read the other blogger's article) is definitely untrue. The further I get in, the larger their numbers appear to be.