Thursday, September 12, 2019

Who needs MS/MS? 1,000 proteins in 5 MINUTES with Direct MS1!

You knew this was coming, right? We've been working our way back this direction the last couple of years -- and here it is.

What is "match between runs"? It's essentially just MS1 based identification.

That's why BoxCar retrieves so many identified peptides/proteins. It increases the S/N and increases the number of MS1 based identifications. You lose MS/MS -- because it relies on MS1 based libraries. It seemed inevitable that we'd soon see the intelligent application of stand-alone MS1 based proteomics, but -- I'll be honest -- I didn't expect the data to look this good.

The idea of using MS1 exclusively for your peptide/protein IDs is not new. Peptide mass fingerprinting was described during the FIRST SEASON of Walker, Texas Ranger.

Some people in Washington were doing MS1 based ID quan in proteomics on big Helium cooled magnet systems and ultra-high quality HPLC systems before the first commercial Orbitrap came out, but as good as the resolution was, they were sloooooooooow and expensive and, I'd argue, the biggest weakness was that our understanding of the depth of the proteome was more than a little flawed. Now that we know that in basically every HRAM MS1 scan there is probably a PTM-modified peptide (or 10) and our libraries can grow up to reflect this...these approaches start to make more sense and false discoveries become somewhat(?) less ubiquitous.

These authors argue some additional points. 120,000 is a lot of resolution, and if you can get more than 4 scans/second, you can do some nice HPLC. And -- if we have learned anything in the last few years it's that the informatics side of proteomics that has been lacking -- in every area -- in every regard. (I do not mean this as a slight in any way to any of the great programs out there, but the people out there writing the new stuff aren't doing it in a vacuum. They're taking the traditional stuff, identifying the weaknesses, and fixing them. The reanalysis of beautiful old data with new better algorithms is basically what half our field is doing right now).

I can't follow all the weird Greek letters and all the Python scripts that this group has either developed, or painstakingly chosen for their daily operation from other groups (comparisons described in numerous previous studies) but I think think that this idea is definitely worth exploring and you should check this paper out!

My favorite observation from the paper might be that going up to 240,000 resolution did not improve the number of identifications over 120,000 resolution. The author's conclusion is that it's the relative loss in # of MS1 scans. In the end, the Orbitrap doesn't get any bigger when you crank up resolution. Any gains you get in resolving coeluting peaks is offset by the speed.

The deisotoping and peak detection was done with the Dinosaur algorithm. I only mention this now so I can use this as a valid excuse in my mind to use this great picture I just found.

Wednesday, September 11, 2019

SWARM-- Remove the adducts and clean up the data!

I'm not sure I get it. I probably shouldn't admit this, since the authors call it "straightforward" twice.

Wait. 3 times! And I think I get it now! Go espresso go!

Here's the problem: When you use ESI to ionize a protein you always get some dumb adducts, particularly if you are using some conditions to try and get the whole protein-protein or protein-ligand measurements and use ammonium acetate or whatever. It is less stressful when you've got one protein or a simple mixture, but it's a lot more stressful as your samples get more complicated.

The Sliding Windows part of SWARM is post-acquisition processing stuff. This is what threw me off. What if you assume that the adduct formation is a constant? Imagine you've got a single protein and you're going to incubate it with a ligand that will bind to it 1 or 2 or 3 or 4 times. You're already looking at intact mass spectra that aren't fun to figure out. Then imagine that you've got no adduct + adduct in there. Counting your no adduct no ligand protein you've got a 10 (?) actual protein combinations present and multiple charge states of each! Gross. Your deconvolution algorithm is going to have a hard time on this and every time it picks an adduct on accident -- fake mass generated....

In the simplest instance of SWARM (if I've got this right) you would run your protein alone, with no ligands. Then you'd figure out what is your protein and what your protein adducts are. Now you make the assumption that no matter if you add 1 or 4 ligands the level of the buffer adducts wouldn't change. So you subtract out all the peaks that have the + adduct signature! Yeah! I think this is what it's doing.

The authors demonstrate this works in simple cases then in more complicated cases, then backwards. The spectra are all acquired on a Waters QTOF with possibly an interesting nanospray ionization hack I'm unfamiliar with. (Could just be I haven't been around a Waters system in a loooong time). Deconvolution is handled mostly with UniDeC and SWARM is implemented through custom Python scripts. If they're publicly available, I missed the link in the paper.

I'm glad I continued to stare at this thing in sleepy puzzlement. There is a lot of power here, just not in my espresso this morning. I hope that the deconvolution software writing people take note of this. For something like antibody drug conjugates, this could be enormously valuable. The authors are careful to note that the main assumption (that adduct formation dynamics are consistent) may not hold true in all cases, but where it does? I'll take any decrease in spectral complexity I can get.

Tuesday, September 10, 2019

Do you have TMTPro (TMT 16-plex)?!? Here is how you process the data!

First off. TMT and TMTPro and probably the word "plex" are the sole properties of Proteome Sciences. Trademark. Copyright. Whatever is necessary to keep me out of trouble. (Big R with a circle around it?)

Important stuff! (Don't sue me) We can plex 16 channels!!

Next -- HUGE shoutout to -- (Wait. Don't sue them either....I should anonymize the person who works for a company, shouldn't I....?) You know who you are, anyway! Dr. Secret Scientist 1 and Dr. Ed Emmott for the resources. I did none of this. Wait. I'll totally make the method templates for Proteome Discoverer because someone wrote me this morning and asked me for them. I'll contribute something!

#1: MAXQUANT for TMTPro?  Best of luck. Have fun. I won't help you at all with this, however...

Dr. Emmott (who will be opening his lab in 7 weeks in Liverpool) has made all the XML add-ins you'll need to modify MaxQuant to use these reagents and made them available via this DropBox.

(...thank you Ed! and good luck with the new program! Need help carrying boxes?)


This will require a couple of steps.

Step 1: You need to add the modifications to your instance of PD.

I recommend you update your UniMod

Both TMTPro 16 and TMTPro ZERO were uploaded today! I don't care about TMTPro Zero (sorry if you do, but you can figure it out. I believe in you! You're very very smart and people like you for obvious reasons.)

If you can't update your UniMod (offline or whatever) you can download this XML from my 100% totally nonprofit Google Drive thingy here.

Then you have to checkmark your TMTPro reagents, hit apply, and then in PD 2.2 I had to close my software and reopen it for it to take effect. Maybe in PD 2.3 as well.

Next you'll need to go to your Administration and import this quantification method (thanks Super Secret Scientist 1!)

Now you should have the picture at the very top of this way-too-long blog post!

16 quantification channels!

For proof that I've contributed something meaningful to human existence today. Here is the processing method for MS2 based TMTPro. You may note that the method name includes the words "probably wrong". I suggest you never get your methods from a completely nonprofit -- costs me a surprising amount of money each month to keep all these things going -- blog.

I wanted to make this as a reminder that TMTPro does not have the same mass you're used to at MS1.

And...according to Dr. Kamath (at a University, I checked, put them lawyers back on their leashes), who was just at a talk today about these reagents, you'll need to think about tuning your collision energy down for these reagents (hopefully to 27 NCE on a QE!)  I don't have details yet!

I think I'll be a reporter when I grow up. Today's blog isn't all that bad for a 9 year old.

Monday, September 9, 2019

Proteomics is not an island!

Okay -- move fast. That gif is super distracting and annoying! 

I just gave a Multi-Omics talk or two and this was great to draw from. What is daunting is that my talk centered on using metabolomics and genomics and proteomics in tandem.

What is a bummer is that
and some other things are coming and -- if the big three don't hold the answer to your disease or model, it may realistically be the others.....

Sunday, September 8, 2019

The Glycan (glycomics?) field is coming -- and they're not messing around!

There is a mandatory quote from someone at every talk about glycan modifications of proteins that's something like "glycans are involved in every human disease." I spent some time trying to find where that came from, but since I couldn't find anything conclusive, I'm going to blame Jerry Hart for it.

Glycan chain analysis suuuuuucks....they all have the same stupid masses. Is it a GlcNaC or is it a GalNaC? It's all the same stupid HexNaC mass. It sucks whether you approach them when you've liberated them or when they're still attached to the peptides. The bond energy is waaaaay different for glycosidic and peptide bonds and if you are using fragmentation that is biased toward bond strength like CID/HCD you are only going to get part of the picture. But smart people are still finding ways to tackle this stuff.

Today's awesome example: 

Okay -- as cool as it is to say the glycan and disease thing above -- it's a lot harder to do something that opens up glycomics capabilities to the world. And that is what we see in this great new preprint!  I'm not going to embarrass myself or insult anyone with my interpretation of the biology. You know what is important and I do get? This is making cell specific libraries of glycans. That you can get and I can get and everyone can get and now we can use them! Because as stupid as the masses of the individual sugar things are -- when they make chains they are different!

Something this group has been working on for a while is the use of porous graphite columns to resolve glycans chromatographically. The work I'd seen previously had looked great, but the MS/MS was ion trap (still cool!) but here we see this technique powered up on an Orbi Velos Pro.

This group is....intimidatingly.... good at this stuff.

Pooled QC samples? Check
Randomized samples? Check
Internal standards? Of course
Blank + internal standards to verify carryover protection? Oh yeah.
Data publicly available? On Panorama Public here and Glycopost!

Wait. What's a GlycoPost? This is where I change the title of this blog post. Cause this is a new-to-me data storage site for the glycan stuff!  WoooHooo!

It's just as easy to use as ProteomeXchange! I just pulled down some RAW files from glycan analysis of the atlantic salmon!  Look....I'll never tell you what to do with your whatever you want. I've got salmon sugars to look at!

Back to the Crashwood how do you analyze glycan/glycomics data, anyway? Byonic (Protein Metrics) is used is a lot of well as GlycoMod -- which is really cool (link here) and something called the GlycoWorkbench (all code is available here!) and this is all used to filter down to Skyline for the quan and statistics. glycomics still is not easy. It's hard even getting my head wrapped around all the stuff they did!

Okay -- but like I mentioned before -- I don't have to! Because I can just go to the Panorama public link above and just download their library!

BOOM! How cool is that? Look -- any study we do is a phenomenal amount of work and it's still definitely more work, but if you make it as easy as possible to get your resources and output that's how to ensure that you're making the future better!

Great study! 100% recommended.

I should be working on a talk, but I'm going to keep typing in this box.

Part 2: I want to remind you about SugarQB.

What's that? That's a totally free glycoproteomics search engine that works within the Proteome Discoverer framework. (There might be a stand-alone -- I forget)

There hasn't been a SugarQb paper, but it's been applied in a couple of great studies. You can get the nodes at

Here is the thing. SugarQb is great -- but only as great as the libraries that it has. This new resource from Gundry lab I just rambled on about allows me to power up my instance of SugarQb, because I can add this great new data to the human glycan library that I've got (it's just a CSV file!)

I haven't had Byonic in a while, but unless things changes -- it works the same way, and obviously this all works in Skyline as well!

Saturday, September 7, 2019

30 seconds to make the world better? Help Skyline!

The amazing wonderful and -- free! -- unifier of all things mass spectrometry -- Skyline software is supported by a combination of grants and direct vendor support. To keep this great package of tools we take for granted going, we periodically need to inform people that 1) we are here! and we're a growing field! 2) we take Skyline and our ability to compare data from instrument to instrument and from lab to lab for granted in what we all do.

Grants are due next week, meaning Skyline needs your help right now!  You can spend 30 seconds showing support here at this link (or on the image below), or spend 3 minutes writing a letter and upload it.

Friday, September 6, 2019

TMT 16-plex is out! And someone already used it for single cell!!

We've been hearing rumors about this for months. And now you can order it here!

...and some people have already had access to it -- check out this KILLER application of it in ScoPE-MS!! for single cell!! Since the effects of ScoPE-MS are essentially additive, more channels equals more sensitivity (though...I cell at a time isn't a lot...but it's better than having one cell less!)

I don't have time to read this..yet...but the HF-X collected MS/MS at 45,000 resolution, so it looks like it seamlessly integrates into your workflow -- just with a bunch more channels!

Oh. Need to process the data? Here you go!

Correction: This study doesn't appear to use the TMTPro. Still a great title, though....

Thursday, September 5, 2019

Determination of Proteolytic Proteoforms with HUNTER!

I really truly try to read at least one paper in its entirety each day. It's a rule that I started when I worked for the great Michal Fried and I thought it was the only way I'd ever be able to have a chance of having any context for how to help apply what I know how to do to the brilliant medical stuff she does. A really bad day for me is when I don't get to even a single one.

A great day is when I start something like this brilliant new MCP study and I am learning from the very first sentence!


Wait. So....did you know this? Should I blame this on all-to-frequent head impacts?  Look, I know about caspases. I know that they're amazing things to ignore when we're doing proteomics (that your quantitative difference might actually be that one set of cells has decided to go into it's own death cycle and is degrading it's own proteins) but is there truly something that we should just be considering in all systems that is as broad-stroke as N-terminal degradation?    My ignorance in biology aside -- that we could be deriving important context from how one side of a protein is systematically degraded -- how on earth would you quantitatively measure something like this?

TAILS would be my first thought. This technique is covered in these past posts (1, 2, 3)

HUNTER is detailed here and seems like TAILS went all Super Saiyan....wait...[Google]

....okay...of course that is a thing. YouTube video here....

I'd like to point out here that TAILS is a tough experiment from the sample prep side. HUNTER looks crazy hard. As someone who IS NOT good at sample prep, this looks like something I'd only try once I found someone really talented at doing it (or programming a sample handling robot) to do it.


The authors walk you through how to do it manually as well -- but here are step by step instructions (sorry) on how to set up a robot for an impossible sample prep design!

As further proof of 1) this technique totally works and 2) it can be applied to various biological systems and 3) it produces useful biological data from all of them -- they apply this method to a variety of human systems and to plants.

By selectively labeling and enriching for N-terminal peptides, they demonstrate the recovery, identification and quantification of >1,000 N-terminals even when they start with micrograms of material.....

The LCMS work is demonstrated on both a Q Exactive HF and a Bruker Impact II, showing that this technique with all of it's power and apparent biological significance can be applied in any proteomics lab. Do I fully get why you'd want to do it from a biology level? Nope! But I know an awful lot of biological models out there where the -omics hasn't solved the phenotype...and here is a fully mature technique provided in excruciating detail that might be the way to the answer.

Wednesday, September 4, 2019

6 hour gradients + HF-X + DIA = 10,000 Human Proteins

...well....maybe I'll interpret the peak width stuff a little later....BECAUSE EVERY INDIVIDUAL FILE IS OVER 15GB!!!  ( least the ones I'm interested in, from this study based on the results they report)

The title of this post kind of sums it up.

This team looks at several different chromatography conditions and materials to gradually build an ideal gradient for their ultra-long run DIA analysis. I think they settle on 60cm column of CSH solid phase at 250nL/min. This is probably a really good idea because they use a slEasyNLC 1200 system and at higher flow rates you'd probably run short of buffer.

0.3 x 360 (6 hour @ 60min)  = 108uL? Okay, so not as bad as I'd have thought. The total pump capacity is only 140 uL on each pump. If you assume that you use around 12 uL to load (didn't look, but that's typically what I expect) you're still okay.

They use an HF-X system with 120,000 resolution MS1 and it looks like 30,000 resolution DIA windows, but 60,000 when the gradient gets to 6 hours and beyond. 60,000 resolution scans take a long time. Your peaks are gonna shift by the time you get through a full cycle. To compensate for this, they throw in an additional MS1 scan part-way through to allow AGC to have better data to work off of.

MS1 (AGC calculation) - DIA/DIA/DIA/ MS1 (AGC calculation) DIA/DIA/DIA/DIA - Repeat (number of windows not accurate)

3 normalized energies are used -- 25.5, 27, and 30. I find this surprising because a lot of the recent DIA work I've seen has used direct eV for the fragmentation since the normalization doesn't do much. This is easier, and it's interesting to me that such a small step is worth the effort of putting it in!

SpectroNaut is the data analysis software and they do some interesting stuff with the data processing. In some experiments they rely on a library made directly from .FASTA, though it looks like ultimately the best data is obtained when they use it in combination with real libraries.

I'd hoped to look at the RAW data, but it looks like my ConCast home internet has said no. I've got 4GB downloaded and it still says 2 hours for one file. If you're interested in DIA there is a solid amount to learn from this new study.

Tuesday, September 3, 2019

Its MASH Suite time!

Have I talked about this yet? I forget and don't care!

I have the deepest and most profound respect for Dr. Neil Kelleher. He's always looking 20 years ahead and his lab has produced some of the best mass spectrometrists I've ever had the pleasure to work with. And he's done this all by being disciplined and 100% serious at all times. You'll never catch him wasting a second doing anything ridiculous. That's his secret, I think.

But -- I'll be honest here -- I've never ever in years of working with it been able to figure out how to use ProSightPC. I have many friends who have figured it out and use it all the time.  It's amazingly powerful software. It is the industry standard by 100 miles, but I'm too dumb to get it.

And -- what else do you use for top down proteomics? The weird command line thing someone at NCBI wrote in 1981? I mean, that'll totally do stuff probably. Not for me. (I assume all command line things were originally written on a Commodore 64).

In what I think might be the first serious back up plan for those of us with various ProSightPC deficiencies -- you should 100% check out the free MASH Suite software!


How much does it cost? Nuthing!

Do you have to read a manual? I guess not! I sure didn't and its been deconvoluting and searching intact protein data for me for months.

For real, it might be a neurochemistry issue. Maybe my childhood fear of PUFfins has something to do with it (top down protein analysis jokes...) but I can make this software do stuff.

And check out all the options it has!

5 deconvoluters! And none of them are Xtract!! Xtract is awesome for one protein. Xtract is NOT AWESOME for cell lysates. If I'm firing up BioPharma Finder with Xtract, I do it before I go home for the night. Fingers crossed  -- it might be done in the morning!

At the very least this is a great new set of tools -- for free!  And they're surprisingly easy to use!

You can get them at the Ye lab website here.

Oh -- and v1.1 just came out this weekend. If you've got the older one, go to "Remove programs" and uninstall it. You'll want v1.1.

Monday, September 2, 2019

How many proteins should there be, anyway?!?

I've spent a lot of time this year wondering things like "okay -- so -- how many fracking proteins are there supposed to be here, anyway?" And the answers are suprisingly murky.

What I do know -- proteomics loooooooves to use cancer cell lines. You know why? Because...

They aren't normal human cell environments. For one, most of them can't stop dividing regardless of what damage they pick up. "Oh....this neuroblastoma cell line is now expressing tooth enamel production proteins? Not normal, but it probably won't stop that cell from continuing to grow."

If you're doing work on healthy human brain tissue, you probably shouldn't see those tooth enamel production proteins, right?

We all have decent feel for what we should get out of HeLa digests on our instruments (or Hek or K562 or whatever) and unless you're doing cancer stuff all day those numbers are probably crazy high compared to what you're normally doing. Here is the question, though, how many should be there?

The picture at the top is taken from this Human Protein Atlas page.  Of 19,000 or so human proteins, around 11,000 are found in the human liver. Okay -- I actually chose the human liver as an example at random, but this actually comes from this brand new paper.

There aren't just liver cells -- the liver is an organ made of all sorts of different types of cells.

I'd assume that there is no way that a Kerpuffle cell would express every protein that an Marovaculus encoshelail cells would (if they did, they'd be the same cell, right?) so if we subsection the liver cells by flow cytometry or by laser capture microdissection then we'd expect that number of proteins to drop of markedly, right? We're talking less than 11,000 now. A lot less?

Seems very cell-type specific. For example, probably on the low end are the boring simple old red blood cells. Two recent studies (post 1 and 2 here) may only have 2,000 or 3,000 total proteins. They don't have to do much but haul hemoglobin and malaria parasites around. They don't need a ton of proteins. I'd expect everything else goes up from there?

Getting a good answer this morning has been tougher than I thought it would be...if anyone knows of a good breakdown or review, that would be great. I feel like I should be able to make one of the Atlas projects make a chart for me, but I hadn't figured it out yet. I also can't figure out my stupid washing machine (what ever happened to a dial? what's wrong with the spring loaded -- wash -- spin -rinse -spin? why does a washing machine need a really crappy touch screen user interface?) so -- grain of's probably easy....

Scholar insists that the answer in this paper (it isn't. this title promises a lot. the paper doesn't deliver)

What about the human protein map (JHU version)?  AHA!

There is this sweet chart that provides solid insight --

The bottom chart is all 30 tissues they tested. There are 2,350 (far right) proteins that were found in every cell type they checked out. On the opposite end are genes/proteins that are unique to one single tissue/cell. Most are in the middle. I think this says a lot -- like the Venn diagram would be horrendous to look at -- OMG -- it would make the best UpSetR plot...though....okay......I've got other stuff I should be doing. This makes sense to me. I don't think RBCs were done, but they'd be the low end -- in this 2,500 protein range and we'd see this complexity all the way up, since this should all be additive, but each human cell type would exists on a spectrum ranging from 2,500 proteins right on up.

Wait. What was the point of this? It wasn't to ask a question and then say -- "sorry, I totally don't know" but that seems to be what happened. There is a take-away, though!

If you're running some proteomics experiments, don't freak out if you don't get the 6,000 or 8,000 or 16,000 proteins that you expect from your HeLa cell line under the same conditions. Your cells probably don't have that many proteins. Probably if you look hard enough in the literature for your specific organ or cell, there is guidance on what you should expect. (Transcript studies like this one might be useful guidance -- if it isn't transcribed, it won't be translated so it may be the high numbers excluding posttranscriptional/translational thingies).

Chances are it's a lot lower than your cancer control digest, and the more homogenous the cells going into your digest are the lower those total # of proteins ID'ed should be.

Sunday, September 1, 2019

There is still no convincing evidence for the frequent occurrence of posttranslationally spliced HLA-I peptides.

HLA peptides are the hottest thing to talk about in mass spectrometry in the US. There are probably 20 posts on this dumb blog about them demonstrating how little I know about them, and -- immunology in general...

Why they're important:
If you know the neoantigen things on the cell surface you can specifically target those cells for destruction. There have been some successes and many, many, many failures.

A recent hypothesis is very controversial, that part of the reason we can't figure it out is that during the protein processing in the whatever-its-called proteins are post-translationally spliced and kicked out.

Zach Rolfs et al., disagrees. This is the abstract. The entire abstract.

This short paper brings up a really important point. So important that I'll use both italics and bold again. Your database you use for both forward and decoy searches can massively influence the results of your proteomics search. I assume that most of the misguided souls who read this blog just rolled their eyes so hard at this last sentence that it hurt their ears. Yes. Obviously it does. However, have you seen an example as important as this?

There are people who are attempting to make antibody based drugs to target these peptides that are being demonstrated on the cell surface of cancer and other diseases.

This team goes to a previous study and reanalyzes the previous study's data with the same software using the same settings and all they do is change the way the FDR stuff is done.

And the results are completely different. The spliced peptides appear to disappear. Almost completely.

Which is right? The original study? The new re-analysis? Why would you ask a blogger?

What I do know? That biology shit looks hard. Last year I did hundreds of quality checks on antibody drug conjugates there is exactly one person there (who makes like $27k in an exploitation that the NIH is allowed to do that is called a "postbac") who ever sent an antibody that actually was what he was trying to make.

My assumption isn't that everyone else was dumb and useless, it was that you had to be really gifted to make antibody based drugs.

And if someone is going to pay a really gifted person $8 an hour to work on this -- we should manually review the data we send them least come up with a better and smarter way to do FDR on endogenous peptides! And this looks like a step in the right direction!

Saturday, August 31, 2019

An amazing clinical scale proteogenomic study of lung cancer!

This is just a beautiful piece of work demonstrating what proteomics and genomics can do in tandem to really find the true differences in patient samples in diseases that may look -- to traditional assays -- to be the same disease.

I'm going to try and not type much because I can't do this justice. I don't even know what the word squamous means. I'd assume that if there is a squamous then that is a subtype of lung cancer and that these might be enough of a subcategory that we'd treat them the same way. What this huge amount of work shows clearly enough that it even gets through to me, that these aren't the same thing and that if we didn't know that -- and we treated these patients the same way -- we'd get massive differences in treatment response.

Details I do understand:
Start with 108 patient samples that have corresponding clinical, molecular, pathology AND outcome data.
Have someone who understands that experimental design needs to be done up front. Up front. You need to think about this. Maybe get a statistician to say a lot of boring things but help you so that you can draw meaningful data out later.

TMT 6-plex was used with pooled controls and lots of smart statistics to combine all the data.
Combine the output with genomics data you also get on all the patients. (Q Exactive Plus for the TMT. Sorry MS3 nerds. MS2 totally works). DIA is also employed with little tiny 5 Da windows!

Instead of making fun of the other technology (which...I do all the time, of course...look, I realize RNA technologies have value. I just don't think they have 100x the value of protein data, but that's what the outside world spends on the RNA stuff) combine it.

Instead of making fun of how primitive clinical diagnostic assays in the US are because we have to give $99 out of every $100 we spend on healthcare here to corporate money hoarders, use these data to help make sense of the patterns you've found with these modern assays your hospitals could totally afford if we'd recognize that being a billionaire is actually a hoarding disease that is way more gross than having 32 cats in your house.

What can you get if you combine all this? New and more powerful therapeutic opportunities for disease subtypes.

Okay -- I lied -- I was going to type a lot. Check this out: You've developed a great new chemotherapeutic and it goes through all the hurdles to get to human clinical trials. You get a bunch of patients with squamous lung cell cancer -- if you didn't know there were these important subgroups and lumped them all together you could fail that trial! For a drug that could totally and completely help some of the patients in that subgroup!

Awesome study. 100% recommended. Lots and lots of words in it I don't understand, but I'm still really optimistic about it.

Friday, August 30, 2019

The Multi-Omics Cannabis Draft Map Project!

I swear, I think this will be the last self-promotional post for a while. I'm just so insanely unbelievably relieved to get this off my desk and out for review. It's been a long 7 months or work on this project. Some people had seen the first preprint, but that was, in large part, just me letting the world know that I'd started doing work in this brand new field where we're allowed to do research now in the U.S.

Want to check it out?

Oh and here is the new preprint.

It's been almost impossible in the US to do research on Cannabis plants because they've been illegal. In January or something, the federal government passed "the hemp act" and -- BOOM! people can start growing these plants with different state to state restrictions although some massive confusion exists about what is allowed federally and from state-to-state. A lot of universities now have research greenhouses!

Turns out -- no one has EVER done modern proteomics on these plants! There were some 2D-gel based studies a while back and some MALDI stuff here and there and all of a sudden I'd ended up in a position, mostly by chance where I was one of the first people who could get access to research material. Yeah! The little research I've done in my life has been mainly incremental biology or technical things.  How awesome would it be to do the first comprehensive proteome on an organism people have heard of?  A bunch of my friends were on the Johns Hopkins version of the Human Proteome Draft. What they did is way more important. What we did was funnier.

Everything was going really well. It turned out to be super easy to get the proteins out an peptides digested in high efficiency by just freezing it, smacking it with a hammer and then doing FASP. We did it all match between runs style. Combine/fractionate, build a library (2 hour runs HF-X) run each individual sample separately with match between runs for quan.

Then it hit us -- NO ONE HAD EVEN COMPLETED THE GENOME OF THE PLANT. Ugh. People had done some sequencing and deposited the data. So we 6 frame translated 3 (surprisingly bad) genomes, combined the data, did proteomics and went to ASMS bound and determined to figure out how to make an annotated FASTA file from genomic data.

What we learned in Atlanta:
1) Everybody wants to know how to easily combine genomics and proteomics.
2) 5 -- maybe 6 people in our field can actually do it
3) It involves using genomics tools and genomics data to filter and QC and align the data.
 A) I don't know how to do
 C) I don't actually want to know how to do, because -- for real -- next gen genomics data is crappy. You acquire 100x coverage over your genomes because 99% of it is crap. For real, I'm not making that up. There was a great paper to this effect years ago. I thought MacCoss was on it, but I can't find it right now... Imagine an ion trap running at 500 Hz and what that data quality would be like (yes, I made up that number). Sure, there is real data in there, but you could also say anything you wanted by lining up your hypothesis to the noise. That's why bioinformaticians and clusters are so important for next gen. You need power and experience to tell the real stuff from crap (mostly by looking for the most repeats of the evidence, particularly in short read sequencing where you might never have 100% overlap of your repeats).
This is taken from a talk David Tabb gave at ASMS this year. Everything in green is the stuff I didn't know how to do that I thought I could get away with never learning how to do because I'd use this as my filter. Guess what -- I think I'm right. I think we can easily use proteomics to help us build good FASTAs of unsequenced organisms! I have some old projects that I couldn't complete because I couldn't do this before.

I have >40M theoretical protein sequences from the next gen stuff.
Only like 400,000 non-redundant ones have matches to my high resolution MS/MS. There's my filter! Throw away 39.6M theoretical sequences!

All the stuff that is in the red circle above is also stuff I didn't know how to do, but learned how!

What do we get?
The first ever annotated protein FASTA for Cannabis plants. BOOM!
Then we could use FASTA to use all the normal tools to finish the project right.

I submit for your amusement my favorite protein bioinformatic flowchart of all time.

 What did we learn?

1) Lysine acetylation is highly involved in the production of the chemicals people seem to care about in cannabis plants. Definitely the terpenes, possibly the cannabinoids (there is a noisy spectra in the supplemental that needs verified later)
2) The proteins may be glycosylated everywhere, but we need to work on it more because it looks like the second sugar in every chain is not one of the ones I know about.
3) The flowers of the plant make hundreds of unknown small molecules (there are way way more than 20 variants on the normal cannabinoids. There are hundreds!)

All MS/MS spectra are available at the site in MGF form.
We have created a Skyline spectral library containing all the PSMs.
We've also created a 40GB file containing every spectral annotation.
While we were doing this another group released some Orbi Velos based proteomics on Cannabis flowers (paper link). Since they used only the 300 or so proteins available on UniProt for the plant, they only identified maybe 180 proteins? Using our FASTA we can re-search their data and come up with around 2k-3k (more what you'd expect from their experimental design).

Oh yeah! And we made an online tool that will tell you what chromosome each protein we ID'ed came from (unfortunately the chromosomes are short-read sequencing, so it's not the most comprehensive, I've got an idea to fix that). We've also done some fun things like align our new proteins to ones where there are Swiss-Model 3D models for. Oh, and did a little proof of concept trying to figure out how to identify fake Cannabis products using rapid metabolomics. People are making all sorts of counterfeit "vape cartridges" here in the US that have made people seriously ill. Maybe metabolomics can help determine the really sophisticated counterfeits from the real thing.

The protein FASTA can be downloaded on the site, as well as all our Metabolites with our hypotheses for the identities of said metabolites. There are also two sneak peaks into cool new informatic software that Conor has been developing around his 70 hour/week job and final year of classes.

The first is a tool that can pull out spectra that contain diagnostic ions. For example, if you're interested in lysine acetylation, the ProteomeTools project showed that these peptides commonly produce a strong diagnostic ion of 126.0913. Conor's script can just pull out and count all the spectra that have those. The second is a tool kit for easy correlation analysis between metabolites, transcripts and proteins if you have quantitative data on all of them. Both of these things are python tools and bundling those into a more user friendly interface is an area of focus.

Final note (warning?) if you create a cool Protein informatics tool and you don't create a cool icon for it, I may have to do it myself.