Saturday, August 31, 2019

An amazing clinical scale proteogenomic study of lung cancer!

This is just a beautiful piece of work demonstrating what proteomics and genomics can do in tandem to really find the true differences in patient samples in diseases that may look -- to traditional assays -- to be the same disease.

I'm going to try and not type much because I can't do this justice. I don't even know what the word squamous means. I'd assume that if there is a squamous then that is a subtype of lung cancer and that these might be enough of a subcategory that we'd treat them the same way. What this huge amount of work shows clearly enough that it even gets through to me, that these aren't the same thing and that if we didn't know that -- and we treated these patients the same way -- we'd get massive differences in treatment response.

Details I do understand:
Start with 108 patient samples that have corresponding clinical, molecular, pathology AND outcome data.
Have someone who understands that experimental design needs to be done up front. Up front. You need to think about this. Maybe get a statistician to say a lot of boring things but help you so that you can draw meaningful data out later.

TMT 6-plex was used with pooled controls and lots of smart statistics to combine all the data.
Combine the output with genomics data you also get on all the patients. (Q Exactive Plus for the TMT. Sorry MS3 nerds. MS2 totally works). DIA is also employed with little tiny 5 Da windows!

Instead of making fun of the other technology (which...I do all the time, of course...look, I realize RNA technologies have value. I just don't think they have 100x the value of protein data, but that's what the outside world spends on the RNA stuff) combine it.

Instead of making fun of how primitive clinical diagnostic assays in the US are because we have to give $99 out of every $100 we spend on healthcare here to corporate money hoarders, use these data to help make sense of the patterns you've found with these modern assays your hospitals could totally afford if we'd recognize that being a billionaire is actually a hoarding disease that is way more gross than having 32 cats in your house.

What can you get if you combine all this? New and more powerful therapeutic opportunities for disease subtypes.

Okay -- I lied -- I was going to type a lot. Check this out: You've developed a great new chemotherapeutic and it goes through all the hurdles to get to human clinical trials. You get a bunch of patients with squamous lung cell cancer -- if you didn't know there were these important subgroups and lumped them all together you could fail that trial! For a drug that could totally and completely help some of the patients in that subgroup!

Awesome study. 100% recommended. Lots and lots of words in it I don't understand, but I'm still really optimistic about it.

Friday, August 30, 2019

The Multi-Omics Cannabis Draft Map Project!

I swear, I think this will be the last self-promotional post for a while. I'm just so insanely unbelievably relieved to get this off my desk and out for review. It's been a long 5 or 6 months of work on this project. Some people had seen the first preprint, but that was, in large part, just me letting the world know that I'd started doing work in this brand new field where we're allowed to do research now in the U.S.

Want to check it out?

Oh and here is the new preprint.

It's been almost impossible in the US to do research on Cannabis plants because they've been illegal. In January or something, the federal government passed "the hemp act" and -- BOOM! people can start growing these plants with different state to state restrictions although some massive confusion exists about what is allowed federally and from state-to-state. A lot of universities now have research greenhouses!

Turns out -- no one has EVER done modern proteomics on these plants! There were some 2D-gel based studies a while back and some MALDI stuff here and there and all of a sudden I'd ended up in a position, mostly by chance where I was one of the first people who could get access to research material. Yeah! The little research I've done in my life has been mainly incremental biology or technical things.  How awesome would it be to do the first comprehensive proteome on an organism people have heard of?  A bunch of my friends were on the Johns Hopkins version of the Human Proteome Draft. What they did is way more important. What we did was funnier.

Everything was going really well. It turned out to be super easy to get the proteins out an peptides digested in high efficiency by just freezing it, smacking it with a hammer and then doing FASP. We did it all match between runs style. Combine/fractionate, build a library (2 hour runs HF-X) run each individual sample separately with match between runs for quan.

Then it hit us -- NO ONE HAD EVEN COMPLETED THE GENOME OF THE PLANT. Ugh. People had done some sequencing and deposited the data. So we 6 frame translated 3 (surprisingly bad) genomes, combined the data, did proteomics and went to ASMS bound and determined to figure out how to make an annotated FASTA file from genomic data.

What we learned in Atlanta:
1) Everybody wants to know how to easily combine genomics and proteomics.
2) 5 -- maybe 6 people in our field can actually do it
3) It involves using genomics tools and genomics data to filter and QC and align the data.
 A) I don't know how to do
 C) I don't actually want to know how to do, because -- for real -- next gen genomics data is crappy. You acquire 100x coverage over your genomes because 99% of it is crap. For real, I'm not making that up. There was a great paper to this effect years ago. I thought MacCoss was on it, but I can't find it right now... Imagine an ion trap running at 500 Hz and what that data quality would be like (yes, I made up that number). Sure, there is real data in there, but you could also say anything you wanted by lining up your hypothesis to the noise. That's why bioinformaticians and clusters are so important for next gen. You need power and experience to tell the real stuff from crap (mostly by looking for the most repeats of the evidence, particularly in short read sequencing where you might never have 100% overlap of your repeats).
This is taken from a talk David Tabb gave at ASMS this year. Everything in green is the stuff I didn't know how to do that I thought I could get away with never learning how to do because I'd use this as my filter. Guess what -- I think I'm right. I think we can easily use proteomics to help us build good FASTAs of unsequenced organisms! I have some old projects that I couldn't complete because I couldn't do this before.

I have >40M theoretical protein sequences from the next gen stuff.
Only like 400,000 non-redundant ones have matches to my high resolution MS/MS. There's my filter! Throw away 39.6M theoretical sequences!

All the stuff that is in the red circle above is also stuff I didn't know how to do, but learned how!

What do we get?
The first ever annotated protein FASTA for Cannabis plants. BOOM!
Then we could use FASTA to use all the normal tools to finish the project right.

I submit for your amusement my favorite protein bioinformatic flowchart of all time.

 What did we learn?

1) Lysine acetylation is highly involved in the production of the chemicals people seem to care about in cannabis plants. Definitely the terpenes, possibly the cannabinoids (there is a noisy spectra in the supplemental that needs verified later)
2) The proteins may be glycosylated everywhere, but we need to work on it more because it looks like the second sugar in every chain is not one of the ones I know about.
3) The flowers of the plant make hundreds of unknown small molecules (there are way way more than 20 variants on the normal cannabinoids. There are hundreds!)

All MS/MS spectra are available at the site in MGF form.
We have created a Skyline spectral library containing all the PSMs.
We've also created a 40GB file containing every spectral annotation.
While we were doing this another group released some Orbi Velos based proteomics on Cannabis flowers (paper link). Since they used only the 300 or so proteins available on UniProt for the plant, they only identified maybe 180 proteins? Using our FASTA we can re-search their data and come up with around 2k-3k (more what you'd expect from their experimental design).

Oh yeah! And we made an online tool that will tell you what chromosome each protein we ID'ed came from (unfortunately the chromosomes are short-read sequencing, so it's not the most comprehensive, I've got an idea to fix that). We've also done some fun things like align our new proteins to ones where there are Swiss-Model 3D models for. Oh, and did a little proof of concept trying to figure out how to identify fake Cannabis products using rapid metabolomics. People are making all sorts of counterfeit "vape cartridges" here in the US that have made people seriously ill. Maybe metabolomics can help determine the really sophisticated counterfeits from the real thing.

The protein FASTA can be downloaded on the site, as well as all our Metabolites with our hypotheses for the identities of said metabolites. There are also two sneak peaks into cool new informatic software that Conor has been developing around his 70 hour/week job and final year of classes.

The first is a tool that can pull out spectra that contain diagnostic ions. For example, if you're interested in lysine acetylation, the ProteomeTools project showed that these peptides commonly produce a strong diagnostic ion of 126.0913. Conor's script can just pull out and count all the spectra that have those. The second is a tool kit for easy correlation analysis between metabolites, transcripts and proteins if you have quantitative data on all of them. Both of these things are python tools and bundling those into a more user friendly interface is an area of focus.

Final note (warning?) if you create a cool Protein informatics tool and you don't create a cool icon for it, I may have to do it myself.

Thursday, August 29, 2019

FDA Scientific Computing Days are Open to the Public!

The FDA is this big thing with a lot of gates and guards and buildings all around DC with little sites around the U.S. I think I've been on every campus at one point or another and they're all still mysterious to me.

FDA Scientific Computing is open to the public! You can see what goes on behind the curtain in how the FDA regulates increasingly complex products coming in. I don't know, but it sounds like a lot of fun to me.

You can register here!

Monday, August 26, 2019


It's no surprise that your coverage improves if you use more than one enzyme for shotgun proteomics. However, on the data processing side of things it can be a serious pain. Workflow set up for each one separately and then go back and combine the data, for most things, right? What if a data processing pipeline was ready to go that was ready to combine loads of different enzymes?


Oh -- here is the paper link.

There are all sorts of smart software to do protein inference (you've got multiple peptides that can be assigned to multiple different proteins, but you don't know which is which) -- NOTHING beats having higher sequence coverage (...well...except for top-down, obviously. but that's another topic for later....nothing beats having higher sequence coverage for shotgun proteomics, is that better?) But you, for real, might get so bummed out about setting up digestions with 10 different enzymes (none of which work as well as trypsin!) and then the pain in the neck the data processing is going to be that you'll just give up and go back to using your LysC/Trysin digestion and not use the others.

The pic at the very top shows the issue and demonstrates how much cooler it is to work with multiple proteases if your software is designed up front for it.

Minor criticism: The authors didn't put the ProteomeXchange ID in, just the MASSIVE ID, so you have to spend 30, seriously, maybe 45 seconds finding the files. Or you can go to this link right to the FTP download site.

THIS IS ALL YOU HAVE TO DO!  Your file specific parameters are just swapped to a different enzyme from the front page.

Want a walkthrough like the one I stole this screenshot from? Here is a link to the YouTube video.

Sunday, August 25, 2019

Human phosphorylation isn't limited to just S/T/Y!!!

Okay -- my screenshot of Supplemental 12 looks pretty bad. That's my fault for using a PDF supplemental and using the Snipping tool.

What you can hardly see there is E-phospho with 100% sequence coverage. From a human cancer cell line. I picked this one out of the supplemental images because it is my favorite. I think that almost all of the Tyrosine-GalNaC peptides ever reported in humans are just S-GlcNaCs and I'm paranoid now. If a PTM is somewhere it doesn't belong, I'm more comfortable if there aren't traditionally expected amino acids floating around.

I've been through this brand new paper with some skepticism and thoroughness (at least what I have that passes for it) and:  1) I have a headache  2) I...geez....the evidence looks good to me that the authors are right -- we've only been seeing S/T/Y phospho because our enrichment methods only work "well" for those.... (the headache is because not every piece of software I use is easy to add a new chemical modification to an amino acid it shouldn't be on....and I'm never pumped to learn a new way to enrich phosphopeptides.)

What paper?? Sorry...this one!

Non-canonical? Oh yeah.

These authors demonstrate with Strong Anion Exchange (SAX) enrichment and a dizzying number of spectra (with comparisons to synthetic spectra in many cases) evidence that S/T/Y is just the beginning....

How's your head doing?

Wait. Do you see the super positive thing in the figure above? LOOK AT HOW MANY Y-phospho sites!!!  This isn't an insanely frustrating and low recovery antibody based technique that no one has improved noticeably on since FACE was first described like 10 years ago. (I mean, it's great, but I've always hoped it would be improved upon somehow, but rabbit blood is...well...rabbit blood.... antibodies aren't just going to magically get more reliable....) THIS is a chemical enrichment that yields Y+phospho! Any reason to let the rabbits keep their blood is good for everyone (particularly reproducibility, and rabbits, of course!)

I need to type this again. SAX enrichment can get you Tyrosine Phosphopeptides!

I'm not going further into it. The paper is open access and you should check it out.

Are you skeptical enough that you'd like to start a RAW file download before you start reading? They're all available here.

You probably want to add some new chemical modifications to your search engine first....

Saturday, August 24, 2019

BOLT -- A Scalable Cloud Based Search Engine with an easy GUI input/output!

Full Disclaimer up front: Two papers featuring the Bolt search engine were recently accepted, one on the engine and one pushing the crap out of someone else's "Cloud computers" to look at EVERY currently known cancer mutation that alters human protein sequences in a bunch of files, and I'm an author on both. Somehow I ended up last, but that was clearly just a nod to the fact that I'm, by far, the oldest contributor. It'll happen to you one day. "Oh no, look how hard it is for Ben to get out of his chair after all those knee's so sad...let's put him in as the senior author...."

To unnecessarily clarify my position on the papers:  The Bolt engine is the invention of OptysTech (And if you don't want to take my word for it and read all these poorly written words, just go here and contact them for a demo of the software!) Conor and I were lucky enough to get involved in this project and provide the comparison data (we'd search the files on different software, mainly through PD and then compare it to Bolt output) and feedback on the input/output and biological interpretation of the data. OptysTech has given me nothing to write this blog post or the papers (see disclaimers page over there somewhere --> no one wants my irate responses, space on this blog is not for sale, fortunately...who the heck would want it?), except for access to the beta and demo versions of Bolt and I guess they paid the publication fees on the manuscripts if the journals charge them. I do subscribe to their other great software package, Pinnacle, and pay the normal annual fees. And, to be honest, I think I forgot to pay them for this year's meaning that my license is probably expired now ('s August...?..oh...yeah, definitely expired...), and I ought to look at that and getting them a P.O....cause there is a ton of DIA data we need to look at, and Pinnacle is my preferred way of looking at DIA data due to the cool thumbnails that allow you to QC hundreds of peptides by eye almost instantly.

END DISCLAIMERS. Start cool stuff!

We knocked out this preprint on Bolt a few months ago. Like many preprints, it improved a ton during what ended up being a really positive peer review process (more engines are compared and a lot of exploration of FDR)

Using "Cloud Computing" isn't a crazy new idea or anything. Everyone is using the Cloud for everything else. Big clouds like Amazon Web Services are said out loud in so many places by so many people that I now think when someone starts saying the word "always" that they are starting to say "AWS." Using the Cloud for proteomics isn't a new concept either. This team set up the Trans Proteomic Pipeline to run on AWS 4 years ago and search over 1,000 files in 9 hours. I'm not going to read through the paper, my thoughts are on this dumb blog somewhere, but I remember thinking it was amazingly inexpensive. Dollars or cents per file.

Great proof of concept, with 2 major flaws
1) I can hardly figure out the TPP on my desktop (I'm dumb)
2) I have exactly zero chance whatsoever of setting that up myself in my lifetime.

So when a long time friend offered to show me a rough Cloud interface his team was working on that I could actually use? Yes. Sign me up.

Bolt is a commercial product and I haven't ran a search on it in a while so it might be a little different now that it's available -- but it was already in an interface that I exactly knew how to use. I load my data into Pinnacle -- Pinnacle spends a few minutes converting the file, exporting it to the Cloud, and it's done. The first time I saw it, there was some showmanship involved. Like "pick any human file on your PC and load it into this new box in Pinnacle -- and -- let's talk about something else for a minute -- BOOM that popup is your completed file searched against every mutation in this huge library, and I threw in several PTMs"

I made this picture that sums it up, I think. Our first pressure test started with just single files (HeLa we got off ProteomeXchange or something) and then we loaded as many sequences as we possibly could into it and kept adding modifications to see if we what would break it. Turns out we could load every sequence we could find!

Behind the scenes --

(--I've always liked pointing this out...) -- but behind Bolt is something called an Azure. Around my eyes glazing over when the younger authors explain these things, I have absorbed the interpretation that this is Microsoft's equivalent to AWS.

This Azure Cloud thing can, apparently, scale according to the demands put upon it. Therefore if you do something stupid, like load up the entire NCI-60 proteome project and then search that against every mutation in COSMIC (for example. btw, COSMIC is free for academic use and you should check with them if you're going to use it for not academic use) and then throw in 30+ PTMs and partial cleavage (which -- now that we've really taken a look at -- there are an awful lot of....) Bolt isn't sitting around for days thinking about it. Bolt just magically (from my perspective) uses a ton more cores and memory and things (I'd assume Microsoft's power bill goes up? which I guess probably bills OptysTech more? Magic!) and you get your output in just about the same amount of time as you do for a single file.

I say "just about" with this caveat. Bolt is much faster at my work with the super speed internet than it is on Holiday Inn WiFi. You've got to get the files there and get the interpreted data back. The files are converted before going up and integrated when they return. However, there isn't much of a difference whether I'm using a laptop or my PC tower. The conversion is fast and it might be a little faster on my big tower, but that's it.

And -- this is a serious perk -- all the quan is done and interpeted in the same informative interface in Pinnacle that I'm used to where little thumbnails for all the signals used to generate that numeric ratio are visible and I can go right into them to examine if one of the thumbnails looks funky. For me this is a huge advantage. I've only got so much space left in my brain to learn new software. I already use Pinnacle. I mostly kinda know how to use it! If you're a Pinnacle subscriber already, there's almost nothing to learn!

At ASMS this year I saw a couple new Cloud proteomics technologies on posters. Our data is getting so large that it's inevitable, right? But it has taken us an awful long time to get here compared to every other field in the world and (given there are over 1,000 proteomics software packages out there, who knows, maybe there have been easy Cloud engines for a while, I can't keep track of 4 dogs and where they pooped at the park all at once -- 1,000 software packages?? but this was definitely the first I've ever seen) but, if nothing else, Bolt is a great proof of concept that you can have an easy-to-use GUI software with powerful visual output without sacrificing behind-the-scenes power.

That's a lot of words, I know, this is a thing I've wanted to talk about for quite a while!

Worth noting -- the last I checked, Bolt could only search human data, but that's just cause they have to load the FASTAs on the back end.

And -- I've totally got to point this out --- there is a lot of proteomics software you can buy out there -- and I've been using Pinnacle for quite a while, in part, due to this page on their website:

I spent a lot of time contracting for the US government and they love price level caps. You can order a single nanospray column without getting permission from anyone because it's under your personal spending cap. However, you need to provide a written justification for why 6 columns for $3,502 is a better deal than buying 3 Columns for $2,200 now and then repeating it later. Software is even worse. You have to find where the IT guys are playing FortNite (or whatever) and get one of them to sign it. you can customize and lease for an amount that doesn't require 90 minutes of tracking Dorito crumbs and body odor through your building's sub-basement? Intrinsically valuable and, admittedly, what first drew me in and got me hooked on this powerful software.

I don't know what Bolt will cost, but if OptysTech's other software is any indication, I don't think they're going to try and use it to buy and island or anything....

Friday, August 23, 2019

IDBac -- Who needs expensive BioTyper software?

I had a great talk this week where discussions ended up on BioTyping after hitting a few dozen other topics. Full disclaimer: I've never used one, am only vaguely familiar, but have always thought they were cool.

What I learned -- apparently there isn't anything magical about the hardware. They're just MALDI-TOFs with software and libraries. So...what if you had an extremely similar model you were considering getting rid of? Could you repurpose it for microbial ID?

Maybe? Again, a lack of familiarity with the technology might mean that the amazing open source software described in this recent paper actually does rely on having the correct configurations, but as best as I can tell it doesn't seem to. (Did I just have a sentence of one word and then follow it with a sentence of 40 words?)

I found this because the group that developed it just did a JOVE on it! If you aren't familiar, it's the coolest thing. It's peer reviewed videos. I'm about to submit one. You go through their library and find a method that you are a grand master of that they don't have in their library, then you put together a normal-ish methods paper. If they decide to publish your paper, they send a film crew to you so then you do the thing and they film you doing it.

It isn't the whole normal journalism thing where you put food coloring in test tubes and stare at something while you're pipetting. As an aside -- I've been lucky enough in my life to ruin several of these by doing something that well-meaning photographers didn't know was incorrect...such as pouring liquid directly into a centrifuge...

...this was so long ago I can't swear that I really poured it or not, the photo kinda looks like the lid is still on....but I digresss...

You can get this amazing seeming piece of software here! It's got a GUI!

Thursday, August 22, 2019

A thousand and one tales of proteomics software! knew we had a lot of proteomics software out there. But more than ONE THOUSAND??

It's "Just Accepted" so my library hasn't indexed it yet, but it looks like this group is trying to organize it all. (They've apparently got 750 of them sorted out -- a huge undertaking!)

The title made me think of a grainy VHS tape at my grandmother's house that was something like "Popeye the Sailor and the tales of a thousand and one nights" that I'd watched as a child when visiting. I found it on YouTube just now and it is real and from the 1930s and altogether strange and, honestly a little disturbing, but that has been my adult impression of a lot of early animation....

Wednesday, August 21, 2019

Okinawa Analytical Instrument Network Meeting 2019!

There are a lot of mass spec meetings around the world and @PastelBio does an amazing job of keeping up with them.

So it's a shock and a huge honor to be invited to speak at one in Okinawa(!!!) that isn't even on that list. It is a rotating meeting on Analytical Instrumentation. Last year was NMR and this year is mass spectrometry!

The site isn't up yet, but here is last year's. If you're from that side of the world, you should check out this 50-person workshop focused on the newest hardware and how to squeeze the highest coverage and most reproducible data out of it.

Monday, August 19, 2019

Two new smart strategies for raw spectra -- Alphabet and pClean!

Oh no. Are those Greek letters on spectra? Did someone write the draft of this paper using R? Do I have a hunch it is really smart but I absolutely have no idea what is going on? Are we just looking for patterns in all the chaos of MS/MS data? What a great idea! Can I explain it better than this? Nope.

If spectral bioinformagics are your thing, maybe you should just check this one out yourself -- patterns in de novo is the theme.

And this is not the same thing at all -- but just happens to be in the most recent JPR as well, and something I think most of us have wondered about -- could we denoise some of the junk out of spectra ahead of time?

I know people who are still using Proteome Discoverer 1.4 in large part because they can use the (now defunct? or entirely incorporated into MSAmanda only?) MS2Spectrum processor ( prior to their Mascot searches. pClean is an R package that has much of this same functionality but goes a little bit further with it's fancy network stuff AND even if that isn't useful (there's no way I have the capability to judge) -- pClean can remove fragments that come from your TMT or iTRAQ tag that your search engine can't utilize properly in making assignments! That is certainly useful!

You can get pClean at this GitHub.

Sunday, August 18, 2019

Catcher of the Rye: Using proteomics to find contaminated cereals!

Is "elegant" the right word for a study that is straight-forward, did something I'd never ever have thought of using the tools I have sitting around here, and totally worked? 

I'm not sure if the whole "gluten free" obsession/fad/thing has passed by in the world, but I hear a lot less about it. I have a good friend with an issue with it in the it-will-legit-kill-me level of issue and, whether the rest of it was founded in fact or not, the awareness of gluten was a really good thing for some people.

This group takes a bunch of grains and performs shotgun proteomics on it. They aren't sample limited (there are lots of grains) so they can use a relatively low dynamic range instrument (6600 3xTOF) in data (information) dependent mode and build a comprehensive proteome. They they work up a list of peptides that are specific to the grains they are concerned about -- BOOM! targeted panel.

When examining a bunch of cereals they can pick out some that are clearly full of grains that the label says aren't supposed to be there!

Saturday, August 17, 2019

False Discovery Rates for Hybrid Search!

Family's in town and I'm late for dinner. Gotta type fast!

boom! (here's this paper!)  EDIT: 8/19/19 Link fixed. Thanks, Dr. Koller!

As spectral libraries take off like crazy again (thanks, DIA!) and smart new tools take advantage of the fact our computers are hundreds of times faster than they were the last go-around -- WHAT ABOUT FDR for us old fashioned DDA hold outs who want to use libraries?!??

This is particularly important as we step away from the library (previously, it's worst limitation) by throwing in delta masses and PTMs.

NIST to the rescue!

What if you did like 100 iterations of your library all randomized up and then figured out what makes sense in terms of your FDR filters and approaches?

Real proof that your FDR makes sense? That's your tax dollars at work, 'Murica!

I love this paper, but I should probably find some shoes!

Friday, August 16, 2019

Ubiquitin clipping?

I'm going to leave this here and go sit in traffic for 2 hours trying to get 30 miles. I'm also going to leave the PDF open on my desktop and send it to people since I can't figure out what is going on here and whether this is useful for a ubiquitin linked patient phenotype we've been working on.

What I do get -- new enzyme derived from "foot and mouth disease" that does something interesting to ubiquitin chains that is cooler than just tryptically cleaving it.

Intact mass analysis (using a QE Plus) can distinguish between proteins that have single and multiple ubiquitination sites. Maybe even count the number of distinct ubiquitin sites? Not sure...

Has application to full cell lysates? I think that is something that is maybe shown here just in gels.

Somehow can distinguish between straight chain and branch chain ubiquitin? Maybe my brain is rejecting all this because I didn't know until now that ubiquitin branches? I thought it was a relatively short peptide chain that with 3 or 4 known configurations? Like GGK(something)end or LRGGK (something)end, and GGK(something longer)K(something) end, which is, by-the-way, tough to sort out in my hands, particularly with incomplete digestion in an attempt to miss the first event. Despite the fact this paper says the K-GG is "easily detectable by MS" I'd contend that most search engines don't like this mod very much, and definitely don't like the longer variants. So -- even though I don't understand what is going on here -- I realize any improvement here is worth knowing about!

Thursday, August 15, 2019


THIS IS REAL LIFE. Single cell proteomics is totally a thing. 

Is there room to improve? Absolutely. But there is this tipping point you need to get to. If you've just got a couple of ambitious labs working on something in relative isolation they're only going to get so far, regardless of how brilliant they are. When everyone is looking at it? When there are meetings all around the world on a new topic? When Nature Methods runs a review on it? When said review points out the dominant technology (mRNA for single cell) has serious serious serious --

--- WHAT?!?! You can dissect this one all day.

3 years ago a big school near my home went crazy and spent almost $4M on proteomics technologies. In one year!!  It was really exciting to be in town. At that same school, one professor spent $20M on RNA-Seq instruments and reagents. One lab. A big lab, but still. We play 4th seat cello to these "next gen" technologies year after year after year.

And -- to see something like this -- of course mRNA abundance and true protein abundance don't line up!  These vaunted technologies are only getting 3 transcript measurements per gene across a cell? Which only barely puts them in the 4th order of dynamic range for genetic measurements? What?!?When the true protein measurements go orders beyond that? My take-away is that the post-transcriptional regulation machinery is massively more important to the protein regulation than the raw transcript abundance. There simply aren't enough mRNA molecules present to be otherwise. And -- even if there were and this is a technological limitation, like they actually are there, but the tech can't see them -- there isn't enough coverage from the transcript measurement data to interpret it. Either way -- direct measurement of protein should be where resources are focused! (duh.)

There is a lot more here -- including single cell SWATH (ummm....which hopefully has some ion accumulation of amplification thing cause TOFs are nowhere near sensitive enough for single cell without it) and they talk about mass cytometry and a couple of new technologies I've never heard of.

Wednesday, August 14, 2019

Combining multiple TMT batches, missing channels and false discoveries....

This new study in MCP is kind of a down-to-earth moment for labeled proteomics.

My rule for combining TMT sets came from something Dr. Pandey said during a conversation. Something like "you can combine 2 without much problem, but if you throw in a 3rd set or more you have so many missing values it isn't worth it." Made enough sense to me to never try a 3rd set.

In this study we see it really quantified --

Yikes, right? At the protein level it isn't going to be quite as bad, but -- if you combine 5 sets you are only comparing quantification of the same peptides from these proteins like 60% of the time? You'd better have some smart normalization planned because TMT is peptide quan. How you roll it up to the protein level when you're doing true pairwise Peptide A from sample 1-10 is a whole lot more linear than rolling up Peptide A from 1-10 and Peptide B from the next set.

This study has some smart normalization. It also has smart ways to set up the experimental design if you've got to multi-batch TMT it. This is one of those papers to keep in your back pocket for when the situation arises.

I'm personally putting it into the "Combine TMT channels" folder with this paper that has been sitting there alone for a few years.

In terms of technical details, the newer MCP article utilizes an Orbitrap Fusion with MS3 for the TMT analysis with 4 hour gradients. Data is processed in MaxQuant.

There are other interesting observations in this paper and this is at least the second time I've mentioned it on this blog (I either saw it as a preprint or a poster or both, I forget). A lot of the more interesting observations derive from the fact that they look for sex chromosome specific markers in samples from known genders. This is the second or third time I've wondered if I could utilize that somehow in the bigger clinical sets where I have access to things like the patient's (anonymous, of course) age/gender/blood type/other stuff that I basically just throw into the dendrograms after the unsupervised clustering is done...

Tuesday, August 13, 2019

FusionPro -- Find the fused proteins with transcript data!

Fusion proteins -- at this point -- seem like the result of relatively rare events, at least in humans. And...woweee....have some of them been controversial....but they absolutely do exist.

If you want to find them there are different ways of doing it, but FusionPro goes at it with a different toolbox than what I'd use.  Oh no. This gets rambly. Here is the paper!

I'd also like to throw this in up front -- this tool will NOT work for post translational protein/ peptide fusions products. This is purely for events that would produce both transcript (RNA) products that would then be translated to protein. These are still important, but might not be what you're looking for.

The most well-characterized example for me to ramble about is probably the BCR/ABL protein fusion that is the result of the "Philadelphia Chromosome"

(Photo taken from the source used in this link and used here in accordance with the GNU Agreement, thanks Pmx!)

It's a little hard to make out, but these are human chromosomes stained with DAPI (blue) and the ends of of two separate chromosomes are tagged, probably by FISH (not the awful band, the fluorescence in situ hybridization).

What you should see is 2 chromosomes with green dots and 2 chromosomes with red dots.

That thing in the upper corner with both colors is the Philly chromosome thing. An unfortunate event has caused the ends of two to break and rejoin together making an evil little chromosome.

There might be more negative effects of this chromosome fusion -- but the one I know about is that the break points line up to make a long reading frame that produces a BCR/ABL Fusion transcript and then protein. Unless something has changed recently, we don't actually know what the BCR protein does in the normal context, but it appears to be a Serine/Threonine kinase. (Responsible for direct S/T + phospho). ABL is a tyrosine kinase (Y+ phospho). See where this is going?

Tyrosine phosphorylation is supposes to be our fast-response sensitive signal regulatory system -- and now this stupid fusion protein is turning on tyrosine phosphorylation -- permanently. This triggers an entire regulatory system inside the cells with the protein that is saying divide! divide! divide! You don't want every cell trying to divide all the especially DO NOT want damaged cells to divide -- you want them to stop, try to repair themselves and then divide and kill themselves if the damage is too bad to repair...but a damaged cell with this protein will still want to make damaged copies of itself.

Wow. That was a lot of words. I'll link the paper way above.

Back to FusionPro -- I think the standard way of doing this kind of work, for us, is to de novo sequence every peptide and then try to work your way back. To be honest, that's what I'd still do first. But proteomics is 1000x easier when you have a database to reference. And this is where FusionPro comes in. It can build that database for you from transcript data.

You'll need a bioinformagician for this. FusionPro uses a combination of Perl and Python and you can get it all here. It totally works, though. They go through CPTAC data that has both high read depth RNA-Seq and deep proteomic data (95% sure this is CPTAC 2 since it's labeled reporter ion data) and they find fusion events with high accuracy. Pretty nice to have a load of transcript reads and a high resolution labeled MS/MS spectra to make your case that you found a new fusion protein!

Monday, August 12, 2019

The Role of Proteomics in Precision Medicine -- A Great Perspective!

I don't know how I missed this! Wow.... In my always humble opinion, I think that no one should be allowed to say the words "precision medicine" before reading this paper and passing a quiz on it's contents.

Is there a phrase as important as precision medicine that has been abused more in the last few years? What's it even mean? It certainly seems to mean votes for some politicians...and then some vague stuff....this is the best definition I've seen -- and, in particular, what it means for us and proteomics as we continue to demonstrate that we're all grown up now -- let us back into the clinic. Yes, we know, we said this before -- but for real this time!

The big things here is pointing out the diseases where the genetics aren't useful. Sure, you can have a predisposition toward some cardiac issues, but lifestyle has a huge and obvious part in that (says the guy who just had the best cheesesteak in Pennsylvania) and it is protein and or small molecule markers that indicate that problems are coming. Same thing with more and more of these brain diseases. They're protein stuff. Wait. Holy shit. I've forgotten to blog about the recent Bateman lab papers...that group is killing it and they seem to have markers that can predict Alzheimer's WAY in advance. Seriously, before any other assays can detect any symptoms. These assays are going to get out there fast. They're too powerful and too well developed.

For real, if you're thinking of trying to score some of the precision medicine grants out there with a mass spec, check this out. Make sure you aren't going after a disease where the established assays or PCR or something makes more sense. Even if we come up with a smarter, cheaper or faster way to do a test, there is a lot of activation energy required to push the medical industry mountain forward. If we come in and tackle the stuff that only LCMS can do? That's how more of us get our feets... Wait. What's wrong with "feets"...that's definitely a word, dumb Google.... in the door, have successes where there have been none and effectively lower the activation energy for the next assay!

Sunday, August 11, 2019


I recently received an awesome gift of a couple of new URLs that will link directly to this blog.

My favorite is this: Just type: PROTEOMICS.ROCKS (no www required) into your browser address bar thingy it will take you right here. 

Correlation tools for coregulation(?) analysis!

This is another win from the awesome scientific tool that is Tweeter. I wish there was one just for science, though...

I'm studying two things right now where we're paving new ground. One is human brains and the other is plant material. I can't go to Ingenuity Pathways to help me interpret my data (yes, there is brain stuff there, but no one has ever seen what we're looking at, and no established pathways line up with our differential proteins. This is surprisingly common, by the way. Overlaying your data on established networks is starting to seem less powerful to me every time I do it.)

Last year I saw a talk by Wilhelm Haas that has stuck with me for a load of reasons. One of those reasons is that it seemed like coregulation is what his team finds most important in cancer. It's a simple idea that I'd almost stumbled blindly to myself, but is now a central thought in every study I do.

Here is the idea ---> what is important is the proteins that all go up or down in abundance at the same time. Wait. Was that exactly what you were always doing anyway? Of course! But here is where it diverges from my thinking. I get a list of up regulated and a list of down regulated and I try to overlay that list on an established pathway. Seriously, the more I think about this the more dumb I feel... What if I've got a new pathway no one has seen before? Did I just bias my results to old data that might not be relevant? What about the cool new protein that changed? Where did they go? Did they just get ignored because they don't line up with the pretty Ingenuity figures...?...

By establishing a list of the coregulated proteins as intrinsically important we might be able to find the pathways ourselves! Should I keep typing if every word makes me seem even more clueless than the last.....

Bernard (not named after Dr. Delanghe, despite what you might have heard) struggling to escape a koala costume (best picture we could get. He's grumpy for a Belgian Terrier) puts things in perspective for me and I feel much better!

Okay -- back on topic. I'm building new pathways where they've never been found before. Time for coregulation analysis!

Before I made Conor (@SpecInformatics) spend his weekend writing what I wanted, I went to Tweeter and asked for suggestions. Obviously, I could use Perseus for correlation analysis. Or Excel. I don't want to use the former because I'm lazy. I don't want to use the latter because people will make fun of me. Sometimes people in my own home...

Want some powerful correlation analysis tools on your desktop? Check out this Java thing at Metscape. It's called the correlation calculator. Definitely download the text file example if you want to use it so you get the formatting correct.

HeatMapper is a great suggestion. 100% recommended. Web based and easy. It's here. A heatmapper figure is definitely going in the supplemental of this paper I'm currently putting guess...hey, it's my Saturday, I can procrastinate a little....

Just in case you know R or Python. These suggestions were popular solutions as well.

I presume that corrr would be an easy to find central package in R. Maybe part of the TidyVerse thing I hear so much about.

Payne lab already has a Python package out that utilizes the looping of pandas or something. It's available here.

All great suggestions, and I sincerely appreciate the Twitter Proteomics community for the ones I'm not going to get to here as well.

In the end, I took the data (1,000 metabolites quantified over 12 samples and the by-the-book IonStar results from the proteomics of those same 12 samples (about 5,000 proteins) and gave the two lists to Conor and late on Saturday night I got a list of each metabolite and each protein and the Pearson and Spearman correlation coefficients and corresponding p-Values of each one.... I don't ask details, I just ask for them to be typed up in the methods section, but I presume that it is similar to the flipping Pandas thing from the CPTAC Github above....

The most accurate Tweet of my summer... And the results are exactly what I wanted!  Data from just 6 samples shown below. My collaborator is very interested in this metabolite. Patient 4? Tons of it!

The top correlating protein from the PD results? 0.999 Pearson?

2 peptides for this protein in sample 4. Spot checking more of them suggests this worked great!

Is this what Dr. Haas meant by coregulation analysis? Maybe? Maybe this is song is just a tribute to the greatest data interpretation method in the world. However, I'm not sure I've ever had collaborators as pumped to see a spreadsheet before....