A whole international consortium got together in 2022 and found something like 10% more human proteins!
Does that mean that you now have a FASTA you can reprocess your data with and get like 10% more IDs?
...not exactly...at least not yet....but it's super cool! Here is the preprint!
Wow. That's a lot of names, including some of the wettest blankets in all of proteomics - "false discovery"
this" "analytical metrics of precision"
that - "standard pipelines and data storage types" on an on. Names you may not recognize are even worse - they're RiboSeq people.... (I wrote up some stuff on what
Riboseq a few years ago here, if you're interested)
Please read the paragraph above in this voice, if you didn't already.
With the important stuff out of the way, what is all of this? Well, it puts into question how we build those nice protein level FASTA files everyone in mass spec based proteomics takes for granted today - until you don't have one.
In a nutshell, they threw out some of the assumptions and looked at a few billion human MS/MS sequences on ProteomeXchange that are from tryptic datasets. Billion with a B. And they looked at a few hundred million MS/MS sequences from HLA immunopeptidomics experiments. Honestly, I was pretty surprised there was that much HLA data publicly available. Y'all have been busy! There wasn't very much (good) stuff out there when I un-retired from science in 2018.
Have you ever 6 frame translated your own genomic data in MaxQuant? There is a little tool for it. And it defaults to something like 50 amino acids. What if the genomics people have also been doing something like that all along? Would you care? What use is a 31 amino acid protein? At 110 Da each that's only 3,410 Da. Cut it with trypsin once or twice and it is probably too small to detect. And you won't get more than 1 peptide for it.
Here is where it gets cool, though. For about 4 years people have been confidently finding surface peptides (MHCs or HLAs or NeoAntigens, whatever you want to call them) on the cell surface that map to genetic information that isn't in our FASTAs. There was a flurry of this in 2020-2022. In the
study I know the best out of these Amol Prakash found over 700 that he was super confident about. And that was one of maybe 5 papers that dropped over this period of time where everyone was like ...ummm....WTF...?
And - get this - the RiboSeq nerds have been seeing the same thing. There are mRNA transcripts going to the ribosome - presumably to be ribosomed into chains of amino acids - and they come from regions of the DNA that are annotated as noncoding.
So these two groups worked on it for like 2 years and this is what they found - overlapping data supported by both MS based proteomics data in repositories and whatever stuff the RiboSeq thingamabobs produce.
And what did they find? I'm just going to take screenshots of the coolest stuff. I started this on my phone earlier today.
100 codons! Wait. That's 100 amino acids, right? That's not as small as my example above!
Then they remind us (me? you?) that there are long established rules about calling something a protein that were agreed upon by the Chromosome Centric Human Proteome Project (you're using the SpongeBob voice now, right?) And that there is validation and other stuff. BUT - this is all still really cool.
If you find a section of the DNA turned into mRNA and hanging out inside of ribosome AND you find (probably a gross looking, immunopeptides or no fun) MS1 and MS2 fragmentation spectra showing that same sequence occupying one of the HLA things you pulled down - that's probably around somewhere doing stuff, right? AND what if you're being all nosy and looking in other people's proteomics data AND you see those peptides there?
Evolution is pretty stingy. It doesn't generally go out of it's way to make new mRNA and then put it in the thing that translates it and then leave it floating around for some nerds to detect, right? Accidents happen where there isn't sufficient evolutionary pressure to lead to the removal of things, but they are the exception rather than the rule.
Super exciting stuff, right?
Man, while I was looking for the preprint on my PC and this was half written, I found a much better breakdown of the study and results in the form of a Tweetorial.
You can check it out here.
Did I forget to make fun of the fact they used the Trans Proteomic Pipeline thing? That is what they processed the data with. And the fun people at the Broad probably sequenced the peptides by hand with a ruler.
I'll leave you with one last screenshot from this super cool likely text book altering study.
(BTW, they're calling these cool things they found "ncORFs". They leave a lot of questions open to the community for how these should be categorized and dealt with, etc., but you'll have to go to the paper for those.
If you are new here I should probably clarify that me taking the time to poke fun at a study wherever I can is the highest form of compliment I generally can come up with. This study may contribute to answering so many riddles like -what are these other spectra? Why is our coverage of the immunopeptidome so abysmal? It also shows why we can't just target the proteome for every study - What percentage of the proteome do we even understand now?!? If these data were all from targeted experiments, we'd never know that the genome/proteome may be 10% larger than we thought. What other stuff is hiding?
I can't recommend this (51 pages???? WTaF?) enough.