Saturday, February 29, 2020
I would like to interrupt your explaining CoronaVirus to your second cousin (once removed) who called you this afternoon to have an inappropriate rant about something awesome!
Here is the preprint!
Okay -- they used a MALDI-TOF -- don't ask me, astronomy people are weird. Proof -- look at the specs on the mass specs on the Mars Rovers. They represent a....unique ...perspective on mass spectrometry. Like...maybe we peaked in the 1970s?
Wait -- back to the meteor. I think they didn't know they had a protein so they were using the MALDI to ionize it. They mess around with a bunch of different matrices. Seriously -- look at that resolution.
Man....it brings back the memories of all the nights I worked late in grad school because mass spectrometers were garbage back then.
Please -- notice to anyone in the world. If you happen to find, I dunno,
a PROTEIN FROM
AND YOU DON'T KNOW ANYONE WITH A GOOD MASS SPEC
PLEASE PLEASE PLEASE EMAIL ME. (There are a number of other obvious examples, I hope.)
I will bug every mass spectrometrist on earth till we find you something awesome you can use. I will bum favors. I will sell my house. I will break and enter.
A surprising number of people just give me keys to their labs, so I probably won't have to do any of that. I'll help you get...I dunno...data with decimal points on the masses? With a resolution greater than 400?
How do we know it's from outer space?
The isotopes are all messed up!! They aren't earth isotope distributions! (Yes, you can see it with this thing).
Okay -- so think about this -- a protein that has been moved through space and has had to deal with whatever reasons we need space suits is probably going to be pretty weird.
Its just strands of glycine. About 1/5 are oxidized and it's got an iron stuck to both termini. Still cool.
Wait. They acknowledge Bruker for use of an instrument.
BRUKER HAS REAL INSTRUMENTS! Why did they use whatever this thing is?
OMG. Cornell has great instruments. I've helped develop apps on at least 3 up there over the years. What is happening?
Oh. I get it. These people thought they had a space protein and some guy in a suit at Bruker was like
and told one of the demo chemists to humor them and fire up the thing they use as a coat rack -- and -- wanna guess what the regret level is? Over 10,000? That they didn't load this up on a -- 2 MILLION RESOLUTION FTICR or something?
Okay -- this is still super amazingly cool, man, I'm such a resolution snob, but I'd love to be able to see this protein a little/lot better!
Thursday, February 27, 2020
Okay -- I'm already bored of this post and I haven't finished the first sentence. It could be said that I don't tolerate boredom ....particularly well.... but this so important that I'm going to stick it out to the end because...
--THE FUTURE NEEDS US to make some really boring decisions and do some really boring stuf! (Okay...maybe I jazz this up some? Let's see!)
How many times have you went to get a proteomics data file from one of these things
-- and ran into a list like this --
-- where the list of samples in the folders don't line up AT ALL with anything listed in the methods or results section of the paper. (Chances are you actually haven't ever ran into that until now when I started uploading extreme examples into PRIDE to make a point or three.)
However, I bet you've had trouble figuring out which file is which a few times. I know I've bugged a lot of people because I've been confused. This is fun, I think, because I get to email someone maybe I haven't met and compliment their work and theeeeeeeen eeeeeaasssse my way into WHY DIDN'T YOU LABEL THESE THINGS BETTER? WHAT IS WRONG WITH YOU??
Okay -- imagine this, though, what if you are a bioinformatician of today or the future and you get back behind the web interface of PRIDE because you want to compare 10 studies you found on the same type of cancer because 8 people did gobal proteomics on it at different points, Ollsen lab did really nice phospho and Gundry lab did glycoproteomics on it and you want to loop it all together. (You can actually do this now through the API!)
What? You have to contact 7-10 busy people? And...in this situation you're a bioinformatician. Have you met a real one? Contacting 10 busy people might not be at the top of their favorites things list. (I'm generalizing because it's funnier that way.)
Chances are they'll just drop the ones that they can't figure out on their own, or where the corresponding author was on vacation. Maybe that's your work they just dropped. No meta-analysis or citations for you!
We can work on a universal upload organization standard thing now -- something like this (a click will expand it)
One of the things the genomics people have done right (probably this just means many of them) is standardize their data uploads. This probably evolved out of the fact that one data file can be 1 TERAbyte and you don't want to upload/download that more than once in your life.
Our smart and forward-thinking bioinformaticians in proteomics (or probably the people that we're right on the cusp of driving completely insane -- probably the same people) are working on a plan and they would like community help and guidance.
There are a lot of boring ideas here, but this is going to impact us and the future and whether our data is going to be remembered and reused later.
To kick it off, they have assembled a list of the 100 most downloaded proteomics datasets. If nothing you should check it out, it says a lot about where we are today!
ProteomeTools is on this list 3 times! I think PXD000138 is the synthesized phosphopeptides. And the proteomics of 29 healthy tissues -- that's relatively new -- but what a great idea that was! This makes me think I should check out every set of files that I don't recognize.
If we get these annotated this gives the people of the great global ProteomeXchange consortium -- and the field of Proteomics --
-- a pathway to the future
Okay... the original video of this might be better than this....
Whew...I did it. That wasn't as boring as I thought it was going to be.
Update: Here are some organized notes on this!
Update: Here are some organized notes on this!
Wednesday, February 26, 2020
Busy week here and this one definitely deserves more time than I can afford this morning.
-- 30 minute digestion
-- High enough efficiency to be competitive with our friend trypsin
-- Doesn't care that youre integral membrane proteins have no basic residues, so better coverage of membrane proteins than I think I've ever seen before, but -- again, I'm moving fast.
Tuesday, February 25, 2020
The Cancer Proteome Atlas appears to have leveled up. If you're doing cancer network stuff and tired of Ingenuity giving you 1e-3243453453445793593593535935 p-values for MAPK.
You can read about these pretty looking updates here.
Monday, February 24, 2020
Dropping this link here so I remember to read this!
The questions are coming since single cell transcriptomics and proteomics seem so complementary. I put in a proposal last year with a government agency to put some resources toward sorting it out.
The response was something like (I'm obviously paraphrasing) "Holy cow. Are you crazy? We have no idea how to deal with all this single cell genomics data and EVERYONE is doing it. We have to devote resources to this mess first!!"
That was less like paraphrasing and more like me exaggerating a lot, but I do like when we can see behind the genomics curtain and see that it isn't nearly as tidy back there as we've been led to believe. Unfortunately, that curtain is made of an increasingly thick layer of cash, so it is increasingly more difficult to move it out of the way.
Sunday, February 23, 2020
On the list of things I'm psyched to add to my list of things to be concerned about I would like to add #7,412.
Whew! Okay, well, I'm safe, because Dr. Sudipto Das showed me a couple years ago that you should dissolve your peptides in 0.1% TFA so that at least a few of them will stick to a PepMap C-18 trap column. Thank you Sudipto, and....I guess....thank you...PepMap...? I don't see hydrophilic peptides, and my phosphopeptides (that don't wash through) all come off in the first 10% of the gradient, but I don't have artificial formylation!
I jest...PepMap isn't that bad, but there are good reasons that CPTAC and even the vendor demo labs don't use it anymore.
Saturday, February 22, 2020
(Previously known as the First Annual News in Proteomics Data Mineathon Challenge)
Some people have signed up from these places*
(*Of course, participation is voluntary and does not imply, in any way, the endorsement of these institutions.)
However, this looks pretty impressive, right? And it's not all of them!
Participants in the challenge are coming from 5 continents!
Details can be downloaded from the page -- but here is an overview.
Amyotrophic Lateral Sclerosis (ALS), often called Lou Gherig's disease, fucking sucks.
You can read about it in the New England Journal of Medicine here.
Here is an article in Frontiers that helps elaborate on how things have been going:
There isn't a cure yet and best I can tell, the diagnostics aren't very good. There are some promising treatments, however, but they aren't getting out to patients rapidly at all.
Here is one from a company called Brainstorm that is showing promise
as well as
another recent advance from researchers in Houston.
I'm no ALS expert, I'm a loud-mouthed mass spectrometrist, but I've been listening to people about this a lot lately. My understanding is that in some cases there are genomic components, but in some patients, there is not -- or it hasn't been uncovered.
However, in both cases, like most of these neurological diseases, this is a post-transcriptional problem. It's either proteins or it is post-translational modifications.
No surprise that there has been little in the way of success on it, right? Genetic diseases can be approached with genomics technology that is either mature -- or, well, at least 10 years ahead of protein technology, in most regards -- and definitely when it comes to informatics!
This challenge is a test to see if we can help find something interesting in ALS patient data with today's proteomics data analysis techniques. We're going to use files from this study. (Chorus 1439)
What do we have?
33 cerebral spinal fluid samples from patients with ALS
33 matched controls
Plasma samples as well!
However, we've chosen to focus on the CSF for this (we will NOT turn down plasma data, but the focus is on the CSF)
It's from Michael Bereman's lab (AutoQC, sProCOP), so the quality is obviously great. QE Plus single shot data. Still a lot to process for PTMs.
I'll put a FAQ up on the page that is more formal, however, these have been some questions:
Q1) I'm from a software vendor, can I participate?
A1) Fuck yes, you can participate. If you find the important PTMs and have the best data, I will buy your software. That is a promise. I'll go around telling everyone else they should buy it.
Q2) Can I just use a commercial software package I have?
A2) Please see answer A1.
Q3) Weren't you involved in some software development stuff? (No one has actually asked that)
A3) However, I am not judging in any way. I'm the hype man. Picture DJ Khaled with a more nasal voice, just going "Yo!" and "What!" while you're talking. That is what I'm doing here.
Q4) What is the goal again?
A4) We want to find the most important PTMs that appear linked to the diseaase. Bereman lab's original study found some interesting proteins. Let's take it beyond that. Let's see if, as a huge -- kinda scary huge -- team we can find something that can help move ALS research forward.
Official emails will go out to all the participants as soon as I figure out how to get everyone's email addresses into my contacts folder correctly.
Tuesday, February 18, 2020
This might possibly be one of my very favorite idea I've had in my life.
Current status, we have:
1) A ridiculously cool and important
2) Some impressive impartial judges
3) An almost ready launch page!
4) Like 30+ awesome (and patient) participants!!!!
We're reviewing the rules and some final details and we should be good to go.
Had no one signed up, it would have launched right on time (probably) but with so many people volunteering their time I owe it to everyone to try and come as close as possible to doing it right.
Wednesday, February 12, 2020
Okay....are you guys ready for this one? I wish I could say I was, but it's too important for us as a field to not think about....
"Analytical figures of merit"?? Hey! This is the proteomics party, don't you come in here with all your boring analytical chemistry validation stuff....oh.....ugh...okay....
Why is this (study) important? In part because it addresses 2 separate concepts that need to be separated -- and they're right in the abstract:
"....Our results demonstrate that increasing the number of detected peptides in a proteomics experiment does not necessarily result in increased numbers of peptides that can be measured quantitatively....."
First of all, this study is like 4 pages or something and it represents an absurd amount of work. SRMs and DIA experiments (QE HF, I think) and a bunch of different HPLCs and the matrices are all sorts of fun -- CSF and FFPE and yeast digest and maybe I missed one.
What's the point? Well, I think the goal was to set out and develop some powerful standard curves without heavy standards, but the quote above suggests a really powerful fundamental truth was kind of a side effect and it kind of steals the show.
We do a lot of relative quan stuff in proteomics. And....it's seriously just relative....and a lot of the results make no sense at all. And this study looks at an absurd amount of data and -- look -- some peptides are just not quantifiable in their background matrix. Real quan has things like linear dynamic range and other boring terms like LOQ/LOD/LLOQ/LLLLOQ and if you really dig into them the way this team did, there is only one solution --
"....Our results demonstrate that increasing the number of detected peptides in a proteomics experiment does not necessarily result in increased numbers of peptides that can be measured quantitatively....."
Same quote twice....? Why not.
Tuesday, February 11, 2020
There are seriously 10 papers open on my desktop that I want to blog about -- and will! -- but I'm busy, so time for another super lazy post.
Last year some cool people asked me if I'd be interested in doing some articles about things happening in proteomics that I absolutely thought the outside world should know about. My first thought?!?? Single cell proteomics (by SCoPE-MS).
This is the best I could come up with.
(Of course, I love to type, so I also talked about the study I credit with making proteomics a reality for the rest of us.)
On this topic, I recently was so sleepy that I went through all the "comments" on my blog. There was around 2,000 spam messages suggesting all sorts of terribleness, but there were also some legit comments. And -- I tell you what -- SCoPE-MS gets some comments. Particularly regarding aspects of the RAW data in the public repositories, and I think that is something we will really need to talk about at some point.
My opinion is that we've been really lucky as a field in that we....mostly haven't actually been sample limited. Ten years ago the people doing cell culture would look at me like I was a tyrant when I said I needed 1mg of protein for global + PTMs. I get the same exact look now when I ask for 50 micrograms.
With the exception of PTMs on tyrosine, glycopeptides and a few other weird things, I'd feel comfortable saying that >90% of the peptide MS/MS spectra reported in the literature have looked like this --
>80% sequence coverage thanks to
1) An abundance of signal
2) Really really friendly charge distribution thanks to basic residues
In SCoPE-MS we don't have #1. There is a limit to how much you can load your carrier channel without fogging your single cell signal (as an aside, I have a crazy hypothesis that this limit is very different depending on whether you are using a D20 or D30 Orbitrap). So...the spectra are always flirting with the background noise. And...at low signal, nothing is all that pretty.
Here is the big question though:
How many fragments to you actually need for confidence in that identification?
Another question: If you were doing targeted peptide stuff with SRMs how many do you need to trust an identification? 3? With unit resolution? And a good reproducible retention time?
I think we've got a philosphical hurdle at some level for this one, particularly for people in our field with Analytical Chemistry as their background. If you look at who got really comfortable with the SCoPE-MS stuff and jumped on it first, I think it has been the people who are coming from the genomics or informatics world.
I promise, if you had been looking at microarrays yesterday, the SCoPE-MS data is a huge and beautiful upgrade. But, if you are used to loading 1ug of peptides on your Q Exactive....SCoPE-MS data is going to take some getting used to.
Monday, February 10, 2020
Hey you! Are you looking for a tool to help you select viral peptides for targeted assays?
Unrelated --- what is the best color of dinosaur?
I got you, yo. Check this out.
Before you panic, when they wrote the paper "Purple" was just a Python script that you can get here. I assure you this is no longer the case. There is a very straight-forward (to install) executable that will set you up with a GUI that looks just like this --
-- that you can get here.
What does it do? Well, it helps you select peptides that are ideal for targeted assays from the databases you feed it. Imagine Picky, but you can load stuff that isn't human into it. (If you are doing human proteomics -- you should be using Picky, btw. It's amazing).
Purple: Feed it your peptide sequences you're interested in: Feed it your contaminating background. Choose your rules. Get your peptides!
Sunday, February 9, 2020
A lot of people downloaded my ugly FASTA for 2019-nCoV after I posted it. UniProt has done their normal crazy meticulous job of assembling all the data and is a much better resource.
You can check it all out here.
Thursday, February 6, 2020
I've only got a few minutes, but -- wow -- is this ever worth reading!
Microbial ID by shotgun proteomics is NOT new. But promising study after promising study seems to end up with -- no new clinical assays.
MALDI-TOF with a BioTyper is easier in the clinic, I guess, but maybe we just need the right technologies to get us over the hump. Clearly, the insistence of researchers to continue utilizing NanoLC is a big hurdle, but maybe innovative sample prep methods would also help bridge the gap?
They use some crazy technology in this one. A flow cell digestion method that allows a tryptic digest of bacterial proteins in one hour? And a depletion technology that removes "host" (human!) biomass??
I have to mention that this study is a big collaboration between groups in Stockholm (where HUPO 2020 is!) and Gothenburg, a city blessed by some dark metal gods or something to be the birthplace of the greatest bands that have ever walked this earth. Yup, I definitely had to mention that.
Tuesday, February 4, 2020
Sometimes I take a dataset and compare 2 different data processing pipelines. One time, maybe I compared 3?
22? What? Wow! Why do we even have 22 pipelines? The abstract suggest that there are very good reasons, actually -- the results aren't the same....and they propose a solution for this. Only a paywall and a biological requirement for sleep stand in my way of reading this right now!
As a reminder -- there is a super epic community proteomics PTM challenge coming up in less than 2 weeks and I think maybe 10 labs have signed up for it so far.
I think that this is probably a great resource to help set the stage.
If there is an easier looking experimental method to measure protein misfolding in vivo, I've never seen it.
If you are interested in structural proteomics stuff at all, I highly recommend this preprint.
Formaldehyde is pretty efficient at binding to proteins! Turns out that:
1) you can get heavy stable isotopically labeled formaldehyde
2) in your cells the formaldehyde can only get access to the outside of your protein 3D structures, effectively "painting" the surface of them.
3) You can compare different biological conditions by using "heavy" and "light" formaldehyde.
Digest your proteins with chymotrypsin and 'voila -- you can quantitatively compare the outside of your proteins and protein-protein complexes!
The downside here is that you have to think hard about the peptide identifications as -- CDH2 : 13CH3 , 13CH3 : CDH2 , 13CHD2 : CD3 , CD3 : 13CHD2 -- could correspond to Disaster Level: "deuterated deamidation" study.
To fully eliminate this an issue, these authors acquired MS/MS at 120,000 resolution! Which...in my opinion is overkill, but on the instrument they used, theyv'e got 60,000 or 120,000 to choose from and 60,000 is going to get a little sketchy on the larger fragment ions. (Loosely related...I commonly run at 90,000 resolution on another instrument...)
Despite the decreased number of scans possible on an LC time scale, they come back with a tremendous amount of data.
In case any of the author see this -- Unless I'm completely misunderstanding what I'm seeing -- Extended Data Figure #4 is possibly my favorite visualization I've seen of anything so far this year. (Maybe I should put this commend on the bioRXIV thing like I'm supposed to....)
Oh yeah! I almost forgot! On top of how cool the technique is, the authors make some interesting findings regarding protein folding and alzheimers!
Sunday, February 2, 2020
It's about time that we talked about how to add....
...well...deep learning...(but...come on, I HAD to use that when I found it, right?!?) to your proteomics workflow!
Don't want to read my rambling about why Prosit is awesome and just want to do it? Skip to Part 2 below!
I almost guarantee that there is someone at your facility who drops all sorts of words like this around -- and maybe that same person has given you reason to question their intelligence in other matters, but as long as they keep saying things about "neural networks" and "semi-supervised" whatevers it seems like everyone wants to talk to them, and maybe give them lots of money. Follow this easy walkthough and THAT COULD BE YOU.
I jest, because Prosit is the real deal and has real world advantages, including more and higher confidence identifications right now.
For a biomolecule, the peptide bond is a joy to work with -- energetically -- crudely optimize the collision energy and you'll break most of them. Our friends in the small molecule world, where I continue to dabble don't have it anywhere near as good. There seems to be no rhyme or reason to what energy will break which bonds. When I do QE metabolomics, I step my CE, typically with 10, 30, 100. Just to come close. The ID-X even has something called "assisted" where it tries to help. Most of the time when you've got a molecule you really want to study, it makes sense to run it 10 times with different energies....
However -- just because peptides are better than most molecules at fragmenting, that doesn't make them consistent. Look at them. Why on earth would you miss the y7 in this peptide or the y4 in that one? It's just not there. And -- at some level it must make sense --energetically.
Prosit was described here last year:
In as few words as I appear capable of writing -- Prosit looks at the ProteomeTools database (you know that thing where they are synthesizing EVERY human peptide and then fragmenting them and making libraries?) and it models the peptides YOU give it against that library with this deep learning thingy.
PART 2: How to use Prosit!
You will need:
1) A protein .FASTA database.
2) The EncyclopeDIA (you can get it here)
3) That's it. I just felt dumb making a list with 2 entries in it.
EncyclopeDIA can do all sorts of smart stuff (some of which I wrote not smart stuff about here) -- and it also has awesome utilities. Such as "Create Prosit CSV from FASTA"
As an aside, I heard from the Prosit team -- they'll have this integrated soon, but if you wanted to put the words "deep learning" on your ASMS abstract that is due tomorrow you have to do what I am doing.
This is ridiculously easy. Add your FASTA. It will make you a Prosit .CSV file. I believe very strongly in you and your abilities. You'll definitely be able to do it!
Now -- go to proteomicsdb.org/prosit and load that CSV you just made.
Hit next and then tell Prosit the format of your output library:
I'm using MSP because I can't afford Spectronaut yet. Then submit your job!
Now -- this is important. When you submit the job you'll go into the queue. You'll want to copy the link URL it gives you and/or the Task ID number. You will not want to close your browser without remembering to do this, because you won't get your library. When it's ready you'll get a download link!
If you want to check the quality of your MSP library -- the PDV is a nice, lightweight, java program that will allow you to flip through all of them. If you've already got the NIST MS Interpreter installed it will also load them. PDV will look something like this!
For this peptide, Prosit predicts that for a CE of 27 I'm not going to see every b/y ion. There are some bonds that it thinks, from the hundreds of thousands of real peptides it has studied, just won't fragment well.
And if, for example, you are looking at that real peptide. And it's right? Then you aren't penalized for missing that fragment when using this library!
Saturday, February 1, 2020
Yeah....maybe I need a hobby....but I think this stuff is cool AND I've learned how to use some new tools thanks to my curiosity about this new virus and thinking about how I would analyze proteomics data from the virus if I could get my hands on it....
Here is the question: PTMs don't typically just happen indiscriminately. There are particular motifs that are the targets of the enzymes that add the PTMs. So...can we start with just some unknown linear proteins and predict what PTMs that we would find?
And...are those predictions any good? I can't yet answer that part directly, but I'm trying.
There are a LOT of tools that predict PTM sites. After two late nights of trying a few of them and doing a lot of failing -- this older one is my current leading favorite -- and you can read about it here.
If you've got better things to do on a Saturday than read, I got you, yo!
You can also just go and dump stuff into their server at ModPred.org. The interface is super straight-forward. Put in your protein FASTA entry (one at a time), pick your mods and hit the button. (You can also install it locally, but I'd rather use their electricity.)
You are capped at 5,000 amino acids per model with the web interface of their server. And you are definitely penalized for longer sequences. At 1,000 amino acids, I recommend walking your dog.
Okay -- so only one protien from the 2019-nCoV translated FASTA is over the cap, so I broke it into 5 separate translated regions in order to have a large overalap in peptide sequences (in case the domains it is modeling against for PTM prediction are large ones). And -- it took basically all morning.
You get a pretty output that you can keep or have it kick you out a Tab(?) delimited text file. I spent a lot of time swearing while combining everything into a single Excel file (I need to grow up and stop using Excel. It always seems like it will be easier -- even though it increasingly is not the easiest solution.
Okay -- and here I'm talking smack about Excel -- and the Ideas button just did something smart!! Normally, it's just funny to hit the button, but -- darn -- it made a decent Pivot Table!
If you're interested in the actual motifs predicted to be modified, you can download them from my Google drive here.
Okay -- so -- that's all nice and all. Predicted PTMs are a pretty big step away from actual PTMs.
Can we test this?
I mentioned a couple of days ago that there was some cool unpublished MERS-CoV proteomics data on MASSIVE.
Now -- this is CID ion trap MS/MS data -- not my favorite source of data for identifying PTMs. It also kind of rules out some of my favorite tools, because they were designed with HRAM MS/MS data in mind. So...back in the time machine to the 1990s to fire up SeQuest and take a minute to polish up my sense of skepticism....
Okay -- this will take more than a minute or two....I forgot how long CID MS/MS takes to search with a couple of PTMs.
I broke it up into queues and only one has finished -- aaaaaaannnnnddddd....nothing!
Okay...so I do actually need another hobby....maybe something I can do inside, in case I screw up my knee and have to do a lot of sitting around for a while.
However -- there is A LOT wrong with this system. One -- we're looking at single shot analysis from 2009s best mass spectrometer -- in a human cell background. We're not exactly digging to the full depth of the proteome -- and PTMs rarely want to announce themselves. Two -- I'm using a prediction model of one virus that is similar to another, but we are definitely reaching when trying to make predictions off the little data across the board. Three through 41 --? I didn't even look to see if that region of the similar protein is even digested by trypsin. Maybe that is for next Saturday.