Wednesday, February 12, 2020

Is a peptide quantitatively measurable? Here's how you find out!

Okay....are you guys ready for this one? I wish I could say I was, but it's too important for us as a field to not think about....

Matrix matching?
"Analytical figures of merit"??  Hey! This is the proteomics party, don't you come in here with all your boring analytical chemistry validation stuff....oh.....ugh...okay....

(Yes. I had to make that. You're welcome.)

Why is this (study) important? In part because it addresses 2 separate concepts that need to be separated -- and they're right in the abstract:

"....Our results demonstrate that increasing the number of detected peptides in a proteomics experiment does not necessarily result in increased numbers of peptides that can be measured quantitatively....." 


First of all, this study is like 4 pages or something and it represents an absurd amount of work. SRMs and DIA experiments (QE HF, I think) and a bunch of different HPLCs and the matrices are all sorts of fun -- CSF and FFPE and yeast digest and maybe I missed one.

What's the point? Well, I think the goal was to set out and develop some powerful standard curves without heavy standards, but the quote above suggests a really powerful fundamental truth was kind of a side effect and it kind of steals the show.

We do a lot of relative quan stuff in proteomics.'s seriously just relative....and a lot of the results make no sense at all. And this study looks at an absurd amount of data and -- look -- some peptides are just not quantifiable in their background matrix. Real quan has things like linear dynamic range and other boring terms like LOQ/LOD/LLOQ/LLLLOQ and if you really dig into them the way this team did, there is only one solution --

"....Our results demonstrate that increasing the number of detected peptides in a proteomics experiment does not necessarily result in increased numbers of peptides that can be measured quantitatively....." 

Same quote twice....? Why not.

Tuesday, February 11, 2020

The single cell proteomics revolution!

There are seriously 10 papers open on my desktop that I want to blog about -- and will! -- but I'm busy, so time for another super lazy post.

Last year some cool people asked me if I'd be interested in doing some articles about things happening in proteomics that I absolutely thought the outside world should know about. My first thought?!?? Single cell proteomics (by SCoPE-MS).

This is the best I could come up with.

(Of course, I love to type, so I also talked about the study I credit with making proteomics a reality for the rest of us.)

On this topic, I recently was so sleepy that I went through all the "comments" on my blog. There was around 2,000 spam messages suggesting all sorts of terribleness, but there were also some legit comments. And -- I tell you what -- SCoPE-MS gets some comments. Particularly regarding aspects of the RAW data in the public repositories, and I think that is something we will really need to talk about at some point.

My opinion is that we've been really lucky as a field in that we....mostly haven't actually been sample limited. Ten years ago the people doing cell culture would look at me like I was a tyrant when I said I needed 1mg of protein for global + PTMs. I get the same exact look now when I ask for 50 micrograms.

With the exception of PTMs on tyrosine, glycopeptides and a few other weird things, I'd feel comfortable saying that >90% of the peptide MS/MS spectra reported in the literature have looked like this --

>80% sequence coverage thanks to
1) An abundance of signal
2) Really really friendly charge distribution thanks to basic residues

In SCoPE-MS we don't have #1. There is a limit to how much you can load your carrier channel without fogging your single cell signal (as an aside, I have a crazy hypothesis that this limit is very different depending on whether you are using a D20 or D30 Orbitrap). So...the spectra are always flirting with the background noise. low signal, nothing is all that pretty.

Here is the big question though:
How many fragments to you actually need for confidence in that identification?

Another question: If you were doing targeted peptide stuff with SRMs how many do you need to trust an identification? 3? With unit resolution? And a good reproducible retention time?

I think we've got a philosphical hurdle at some level for this one, particularly for people in our field with Analytical Chemistry as their background. If you look at who got really comfortable with the SCoPE-MS stuff and jumped on it first, I think it has been the people who are coming from the genomics or informatics world.

I promise, if you had been looking at microarrays yesterday, the SCoPE-MS data is a huge and beautiful upgrade. But, if you are used to loading 1ug of peptides on your Q Exactive....SCoPE-MS data is going to take some getting used to.

Monday, February 10, 2020

Purple -- Pick unique peptides for viral (and other?) experiments from FASTA!

Hey you! Are you looking for a tool to help you select viral peptides for targeted assays? 

Unrelated --- what is the best color of dinosaur? 

I got you, yo. Check this out. 

Before you panic, when they wrote the paper "Purple" was just a Python script that you can get here. I assure you this is no longer the case. There is a very straight-forward (to install) executable that will set you up with a GUI that looks just like this --

-- that you can get here.

What does it do? Well, it helps you select peptides that are ideal for targeted assays from the databases you feed it. Imagine Picky, but you can load stuff that isn't human into it. (If you are doing human proteomics -- you should be using Picky, btw. It's amazing).

Purple: Feed it your peptide sequences you're interested in: Feed it your contaminating background. Choose your rules. Get your peptides!

Sunday, February 9, 2020

UniProt has a page and resources set up for 2019-nCoV now!

A lot of people downloaded my ugly FASTA for 2019-nCoV after I posted it. UniProt has done their normal crazy meticulous job of assembling all the data and is a much better resource.

You can check it all out here.

Thursday, February 6, 2020

Peptide biomarkers for bacterial pathogens!

I've only got a few minutes, but -- wow -- is this ever worth reading!

Microbial ID by shotgun proteomics is NOT new. But promising study after promising study seems to end up with -- no new clinical assays.

MALDI-TOF with a BioTyper is easier in the clinic, I guess, but maybe we just need the right technologies to get us over the hump. Clearly, the insistence of researchers to continue utilizing NanoLC is a big hurdle, but maybe innovative sample prep methods would also help bridge the gap?

They use some crazy technology in this one. A flow cell digestion method that allows a tryptic digest of bacterial proteins in one hour? And a depletion technology that removes "host" (human!) biomass??

I have to mention that this study is a big collaboration between groups in Stockholm (where HUPO 2020 is!) and Gothenburg, a city blessed by some dark metal gods or something to be the birthplace of the greatest bands that have ever walked this earth. Yup, I definitely had to mention that.

Tuesday, February 4, 2020

22 Phosphoproteomics Data Analysis solutions go head to head!

Sometimes I take a dataset and compare 2 different data processing pipelines. One time, maybe I compared 3? 

22? What? Wow! Why do we even have 22 pipelines?  The abstract suggest that there are very good reasons, actually -- the results aren't the same....and they propose a solution for this. Only a paywall and a biological requirement for sleep stand in my way of reading this right now!

As a reminder -- there is a super epic community proteomics PTM challenge coming up in less than 2 weeks and I think maybe 10 labs have signed up for it so far.

I think that this is probably a great resource to help set the stage.

Covalent Protein Painting to measure in vivo protein misfolding!

If there is an easier looking experimental method to measure protein misfolding in vivo, I've never seen it.

If you are interested in structural proteomics stuff at all, I highly recommend this preprint.

Formaldehyde is pretty efficient at binding to proteins! Turns out that:

1) you can get heavy stable isotopically labeled formaldehyde
2) in your cells the formaldehyde can only get access to the outside of your protein 3D structures, effectively "painting" the surface of them.
3) You can compare different biological conditions by using "heavy" and "light" formaldehyde.

Digest your proteins with chymotrypsin and 'voila -- you can quantitatively compare the outside of your proteins and protein-protein complexes!

The downside here is that you have to think hard about the peptide identifications as -- CDH2 : 13CH3 , 13CH3 : CDH2 , 13CHD2 : CD3 , CD3 : 13CHD2 -- could correspond to Disaster Level: "deuterated deamidation" study.

To fully eliminate this an issue, these authors acquired MS/MS at 120,000 resolution! my opinion is overkill, but on the instrument they used, theyv'e got 60,000 or 120,000 to choose from and 60,000 is going to get a little sketchy on the larger fragment ions. (Loosely related...I commonly run at 90,000 resolution on another instrument...)

Despite the decreased number of scans possible on an LC time scale, they come back with a tremendous amount of data.

In case any of the author see this -- Unless I'm completely misunderstanding what I'm seeing -- Extended Data Figure #4 is possibly my favorite visualization I've seen of anything so far this year. (Maybe I should put this commend on the bioRXIV thing like I'm supposed to....)

Oh yeah! I almost forgot! On top of how cool the technique is, the authors make some interesting findings regarding protein folding and alzheimers!

Sunday, February 2, 2020

Remember that Prosit thing everyone was talking about? It is super easy to use!

It's about time that we talked about how to add....

...well...deep learning...(but...come on, I HAD to use that when I found it, right?!?) to your proteomics workflow!

Don't want to read my rambling about why Prosit is awesome and just want to do it? Skip to Part 2 below!

I almost guarantee that there is someone at your facility who drops all sorts of words like this around -- and maybe that same person has given you reason to question their intelligence in other matters, but as long as they keep saying things about "neural networks" and "semi-supervised" whatevers it seems like everyone wants to talk to them, and maybe give them lots of money. Follow this easy walkthough and THAT COULD BE YOU. 

I jest, because Prosit is the real deal and has real world advantages, including more and higher confidence identifications right now.

For a biomolecule, the peptide bond is a joy to work with -- energetically -- crudely optimize the collision energy and you'll break most of them. Our friends in the small molecule world, where I continue to dabble don't have it anywhere near as good. There seems to be no rhyme or reason to what energy will break which bonds. When I do QE metabolomics, I step my CE, typically with 10, 30, 100. Just to come close. The ID-X even has something called "assisted" where it tries to help. Most of the time when you've got a molecule you really want to study, it makes sense to run it 10 times with different energies....

However -- just because peptides are better than most molecules at fragmenting, that doesn't make them consistent. Look at them. Why on earth would you miss the y7 in this peptide or the y4 in that one? It's just not there. And -- at some level it must make sense --energetically.

Prosit was described here last year:

In as few words as I appear capable of writing -- Prosit looks at the ProteomeTools database (you know that thing where they are synthesizing EVERY human peptide and then fragmenting them and making libraries?) and it models the peptides YOU give it against that library with this deep learning thingy.

PART 2: How to use Prosit! 

You will need:
1) A protein .FASTA database.
2) The EncyclopeDIA (you can get it here)
3) That's it. I just felt dumb making a list with 2 entries in it.

EncyclopeDIA can do all sorts of smart stuff (some of which I wrote not smart stuff about here) -- and it also has awesome utilities. Such as "Create Prosit CSV from FASTA"

As an aside, I heard from the Prosit team -- they'll have this integrated soon, but if you wanted to put the words "deep learning" on your ASMS abstract that is due tomorrow you have to do what I am doing.

This is ridiculously easy. Add your FASTA. It will make you a Prosit .CSV file. I believe very strongly in you and your abilities. You'll definitely be able to do it!

Now -- go to and load that CSV you just made.

Hit next and then tell Prosit the format of your output library:

I'm using MSP because I can't afford Spectronaut yet. Then submit your job!

Now -- this is important. When you submit the job you'll go into the queue. You'll want to copy the link URL it gives you and/or the Task ID number. You will not want to close your browser without remembering to do this, because you won't get your library. When it's ready you'll get a download link!

If you want to check the quality of your MSP library -- the PDV is a nice, lightweight, java program that will allow you to flip through all of them. If you've already got the NIST MS Interpreter installed it will also load them. PDV will look something like this!

For this peptide, Prosit predicts that for a CE of 27 I'm not going to see every b/y ion. There are some bonds that it thinks, from the hundreds of thousands of real peptides it has studied, just won't fragment well.

And if, for example, you are looking at that real peptide. And it's right? Then you aren't penalized for missing that fragment when using this library!

Saturday, February 1, 2020

Predicting PTMs in 2019-nCoV Wuhan Coronavirus

Yeah....maybe I need a hobby....but I think this stuff is cool AND I've learned how to use some new tools thanks to my curiosity about this new virus and thinking about how I would analyze proteomics data from the virus if I could get my hands on it....

Here is the question: PTMs don't typically just happen indiscriminately. There are particular motifs that are the targets of the enzymes that add the PTMs. So...can we start with just some unknown linear proteins and predict what PTMs that we would find?

And...are those predictions any good? I can't yet answer that part directly, but I'm trying.

There are a LOT of tools that predict PTM sites. After two late nights of trying a few of them and doing a lot of failing -- this older one is my current leading favorite -- and you can read about it here.

If you've got better things to do on a Saturday than read, I got you, yo! 

You can also just go and dump stuff into their server at The interface is super straight-forward. Put in your protein FASTA entry (one at a time), pick your mods and hit the button. (You can also install it locally, but I'd rather use their electricity.)

You are capped at 5,000 amino acids per model with the web interface of their server.  And you are definitely penalized for longer sequences. At 1,000 amino acids, I recommend walking your dog.

Okay -- so only one protien from the 2019-nCoV translated FASTA is over the cap, so I broke it into 5 separate translated regions in order to have a large overalap in peptide sequences (in case the domains it is modeling against for PTM prediction are large ones). And -- it took basically all morning.

You get a pretty output that you can keep or have it kick you out a Tab(?) delimited text file. I spent a lot of time swearing while combining everything into a single Excel file (I need to grow up and stop using Excel. It always seems like it will be easier -- even though it increasingly is not the easiest solution.

Okay -- and here I'm talking smack about Excel -- and the Ideas button just did something smart!! Normally, it's just funny to hit the button, but -- darn -- it made a decent Pivot Table!

If you're interested in the actual motifs predicted to be modified, you can download them from my Google drive here.

Okay -- so -- that's all nice and all. Predicted PTMs are a pretty big step away from actual PTMs.

..and rightly so...

Can we test this?

I mentioned a couple of days ago that there was some cool unpublished MERS-CoV proteomics data on MASSIVE.

Now -- this is CID ion trap MS/MS data -- not my favorite source of data for identifying PTMs. It also kind of rules out some of my favorite tools, because they were designed with HRAM MS/MS data in mind. So...back in the time machine to the 1990s to fire up SeQuest and take a minute to polish up my sense of skepticism....

Okay -- this will take more than a minute or two....I forgot how long CID MS/MS takes to search with a couple of PTMs.

I broke it up into queues and only one has finished -- aaaaaaannnnnddddd....nothing! I do actually need another hobby....maybe something I can do inside, in case I screw up my knee and have to do a lot of sitting around for a while.

However -- there is A LOT wrong with this system. One -- we're looking at single shot analysis from 2009s best mass spectrometer -- in a human cell background. We're not exactly digging to the full depth of the proteome -- and PTMs rarely want to announce themselves. Two -- I'm using a prediction model of one virus that is similar to another, but we are definitely reaching when trying to make predictions off the little data across the board. Three through 41 --? I didn't even look to see if that region of the similar protein is even digested by trypsin. Maybe that is for next Saturday.

Posting some friendly reminder from Dr. Yates.

One of the laziest posts I've ever made...but I've got a lot of stuff to do this weekend.....


Friday, January 31, 2020

PCR + Mass spectrometry for Coronavirus detection!

This study is a couple of years old, but it highlights a whole clever way of detecting pathogens  -- amplifying DNA and then doing mass spectrometry of the amplicons.

-You can start with virtually no DNA and make a ton of it
-MALDI-TOF is fast when you've got a ton of samples to screen
-MALDI instruments, while maybe not very common in research environments, are increasingly common in clinical labs. (Big question, though, is the flexibility -- I think that a lot of these are locked down to performing one specific assay, but -- still -- those places would have staff with the technical expertise to prep the samples and run the instruments
-If you pick primers well, you'd be resistint to mutations in the viral strains

-PCR takes time. Is it faster than it was?
-It takes some people a long time to make primers and to verify they don't cross-react. (I've heard tools have gotten better)
-MALDI is almost always connected to low power TOF devices (sensitivity, resolution and accuracy are the things that are typically the low part)

Check out this alternative technique that would alleviate some of this --

Same general idea -- 15 years ago -- this group amplified their virus DNA (SARS-coronavirus) and then did FTICR....are there a few thousand smaller, faster FTMS devices around the world right now? If you've already done PCR would you even need to couple it to HPLC? FIA-MS on an Orbitrap?

Sorry if you're tired of hearing about these viruses, but I'm motivated to read/write about it

On the topic of the 2019-nCOV (Wuhan) coronavirus -- check out this beautiful resource from NextStrain!

If you are working on this from a proteomics/mass spec/clinical detection perspective and want to talk, please reach out ( I'd be happy to lend a hand developing/troubleshooting assays or by (much more usefully) connecting you to people who could be of great assistance.

And -- while I'm sleepy rambling -- check this out -- ModPred ( thinks palmitoylation is a dominant PTM -- if you are developing peptide specific assays -- I'd skip that entire protein terminus.

Thursday, January 30, 2020

MaxQuant.Live 1.2 is live!

I'm a little behind on this and don't have actual data to compare yet, but MaxQuant.Live 1.2 is up for download.

Important factor for many of us -- it still appears to rely on the same Foundation and Tune 2.9 (no 2020 mandatory upgrades if you were already using it in 2019).

The interface looks a little cleaner, but if  you were also hoping for some magical new data acquisition method to appear in the App store, you'd also be disappointed.

Given how active the developers have been on the awesome MaxQuant.Live Google group discussion forum, I think we're looking at a great software that still hasn't been utilized to it's potential, but has had some minor bugs ironed out.

APEX Proteomics applied to Stress Granules Provides insight into ALS progression!

A great way to see trends in proteomics is to go to ProteomeXchange and see what everyone is uploading.

The word "spatial" is completely blowing up. I think there are a couple versions of the modern "spatial" techniques. APEX, however, might be the best example. There is a great overview of this technique on the Krogan lab website that you can view here (I don't want to steal their nice image!)

Want the 6 18 second description/reminder?
1) You make a version of your protein of interest with a peroxidase on it and then you put it into your biological system and let it interact with all it's friends.
2) You put some biotin-phenols that float around in your system doing (presumably) no harm or alterations to your system
3) !!SURPRISE!! your system by adding hydrogen peroxide!
4) Nothing special happens to any other proteins (except normal H2O2 effects, I guess) but your protein of interest has that peroxidase on it --it reacts with the biotin phenols around it which causes all it's local protein friends to be labeled with biotin!
5) Pull down the biotin labeled proteins and now you know all the proteins in the general area of your protein of interest. Cool, right?

(It is a bit more complex than this. You don't want the APEX fusion being expressed all the time, for example, you can wait and activate it when your cells get ot the right stage -- you also need ot quench the reaction, but now I need to change how many seconds this took again...)

However, if you want to see it in action in a medical context -- check out this new preprint!

This group used multiple APEX Fusions to study stress granules (SG)s which according to the authors "form in response to a variety of cellular stresses by phase-separation of proteins associated with non-translating mRNAs" (yes, stolen from the first sentence of the abstract).

Unlike a lot of spatial labeling techniques -- this one requires nothing special from the mass spectrometer. The biotin labeled proteins are pulled down and digested. This group used a QE Plus for part of the study and it looks like they upgraded to a QE HF at some point. The LC separation was a long (3hour) gradient on 50cm columns. A relatively cycle time is maintained between the two instruments by using 17,500 resolution for MS/MS on the Plus and 30,000 resolution on the HF. The data processing here is MaxQuant and it looks very typical.

With this system set up, the experiment gets more complex, with the addition of ALS linked dipeptides that show alterations in the stress granules and the SUMoylation -- which is where the biology goes beyond me.

What I do get:
1) A fantastic application of this powerful new technique
2) A method that demonstrates that I could definitely do my part of this.
3) Important new understandings into the progression of a protein disease? 

Wednesday, January 29, 2020

Skyline for small molecules/metabolomics and Skyline 20.1!

Skyline has had support for small molecules and metabolites for years now-- but I still have a lot of trouble setting it up and have to bug smarter (and typically younger...) people for help a lot.  What I could use is a Step-by-Step protocol and template files I can download. 

While I'm on the Skyline topic -- I just got this great email overnight -- Skyline 20.1 is up. 

It does require a manual download and install (which you can download here) -- but the Skyline team hasn't forgotten about proteomics.

Edit: On my laptop that has the Windows 10 disease, I did have to manually remove the last install of Skyline and reboot to install 20.1.

I've trimmed the email to remove any mention of command line Skyline and stats words I'm unfamiliar with. And highlighted my favorite parts. 

MSFragger spectral libraries!! Pull out the dark proteome and then quantify it?!?  For the biopharma groups that are finding multi attribute monitoring the most cost-effective way forward? Supported! 

Improvements since Skyline 19.1 include:
  • Prosit spectrum and iRT prediction support directly integrated into the UI
    • Building libraries for targeted peptides in a document through Peptide Settings - Library - Build button.
    • Prosit spectrum prediction viewing in the Spectrum Match plot with new right-click menus, including mirror plotting
    • Settings in Tools > Options > Prosit
  • Support for spectral library building from MS Fragger pepXML search results
  • Support for diaPASEF!
    • We have run this with 2 separate 3-organism datasets through the LFQBench statistical assessment and that works.
  • Improved ddaPASEF and initial prmPASEF support.
  • Performance gains in importing Agilent and Waters IMS data as much as 2x or more.
  • Parallel file import with proteomewide DIA in the UI or by default on the command-line has performance similar to what was previously only available from the command-line using --import-process-count. Choose "Many" on your next import or just ignore threading the next time you import from the command-line.
  • Optimized spectrum memory handling for instrument vendors with .NET data reader libraries, benefitting Agilent, SCIEX, and Thermo
  • A new "Consistency" tab in the Refine > Advanced form, supporting CV and q value cut-offs
  • New checkbox for Refine > Advanced - Results tab Max precursor peak only
  • Support for Multiple Attribute Model (MAM) grouping with Peptide.AttributeGroupID and PeptideResults.AttrributeAreaProportion
  • Added File.SampleID and .SerialNumber (of the instrument) as fields in Document Grid custom reports
  • Transitions Settings - Full-Scan - MS/MS filtering has been extended to apply to all non-MS1 spectra (e.g. MS3) as long as the MS1-level precursor matches the target precursor m/z The redundant library filtering phase of spectral library building is around 20x
  • Improved iRT calibration UI making it easy to create new sets of standards based on existing sets that can be used in spectral library building and the Import Peptide Search wizard
  • More iRT improvements including more intelligent use of 80+ CiRT peptides when CiRT is chosen during library building
  • New right-click > Quantitative menu item for changing the Quantitative property on transitions in the Targets view TIC and BPC now come from raw data files and do not need to be extracted from MS1 spectra which has performance benefits for MS1 filtering
  • New global "QC" transitions have been added such as the pressure trace
  • Calibration curve fixes to make ImCal (Isotopolog Calibration Curves) work
  • New "Calculated" annotations have been added which support storing Skyline calculated values in annotations for future use with AutoQC
  • Support KEGG IDs as molecular identifiers in small molecule targets.
  • Improved support for D used in chemical formulae in place of the Skyline default H'
  • Added support for Thermo Exploris and Eclipse instruments
  • Support for opening .skyp files downloaded directly from Panorama

Tuesday, January 28, 2020

Publicly available (unpublished?) proteomic, metabolomic and lipidomic (MERS-CoV) coronavirus data!

Wow. Do I ever love ProteomeXchange!

Skip my reading and go to MASSIVE and get proteomic data from cancer cells infected with a coronavirus and -- if you're into that sort of thing -- you can get metabolomic data here and lipidomic data here! 

The RAW files currently heating my apartment may not have been published yet, but they are publicly available and I just contacted the uploader, but I'm moving fast because this data is 1) awesome and 2) pertinent

The Wuhan Coronavirus (2019-CoV) has a very close neighbor (possibly the closest according to my rough pBLAST of the entire translated sequence, but that may just be a consequence that it emerged more recently, as sequencing technology has gotten cheaper and more common -- leading to more data) -- that is called the MERS-CoV (here is the entry from UniProt) or Middle East respiratory syndrome-related coronavirus.

The experiment is 9 files from infected Calu cells (appears to be an immortalized and/or human cancer cell line) infected with the virus and 3 files from "Mock" (presumably uninfected).

The files were acquired on an Orbitrap Velos in "high/low" mode (120k resolution MS1 and CID ion trap fragmentation). The files appear to originate from PNNL, where it is rumored they know a thing or two about running mass spectrometers.

MetaMorpheus recalibration shows the MS1 is spot on, something like -1ppm off actual when compared against human and -- get this -- I can get >70% coverage of the main capsule protein from the virus in the virus infected proteomes. This is really cool because that protein is well conserved betweent the 2 (by pBLAST score, anyway).

Update: More fast moving science!! Just because the pBLAST scores line up, it doesn't mean that the peptides do -- check this out!

Again -- big disclaimer -- this is a mass spectrometrist's blog. I know very little about viruses and is just interested in this topic!

Single cell RNASeq + Plasma Proteomics + Machine Learning!

You should check out this new preprint here! 

What a great week or 10 days for proteomics. Holy cow. January was kind of laggy and then -- BOOOOOOOOOOOM --!!

Okay -- in yet another that is going into a file called "January 2020 papers you must read!!" -- which -- is too many words for this cursed Windows 10 thing --


Back to the paper -- if you just read the abstract, the phrasing will make you think that or friends at the Max Planck jumped on the ScoPE-MS electric Porsche into the future but you'll find inside that more standard plasma profiling (which looks a lot to me like at lot of the clinical proteomics proof - of - concept work we've seen from the Mann lab -- high fractionation, rapid HF runs [relatively affordable instrument!!] for individual patients and MBR). You can read my rambling about one of my favorite of these recent studies here.

Couple that to high throughput single cell transcriptomics and then using machine learning to link the plasma proteome features to the single cell transcripts across 31 clinically derived factors from these patients and -- it looks like the future to me, but it appears they took the Tesla.

...which...of course, that is a thing, right?

Since I'm still rambling -- this preprint was posted in medrXiV, which has some great disclaimers.

Monday, January 27, 2020

Predicting human life span with deep plasma proteomics??

...and the 2020 Grammy for most eye grabbing title goes to....

...this brand new study that is the first or second thing I get to once I'm safely behind the publisher's financial security of a university library paywall....

To be clear, I haven't read this and I'll probably doubly verify the QC/QA checks on my baloney detectors before I do. But if you think there is a force on earth that can keep me from reading this today --

...I mean...besides the paywall....$8.99....

Sunday, January 26, 2020

Wuhan Coronavirus (2019-nCoV) Complete Protein FASTA download

Edit 2/10/2020: UniProt has resource up. These are better. You can check them all out here!

I was looking for a complete protein FASTA database for the Wuhan coronavirus and came up empty.

The NCBI database was just updated yesterday (direct link here) so I pulled the newest sequences and just assembled them into a single file.

You can download the complete protein FASTA from this Google drive link here.

Hit me up if you have any issues with it.

Image above is from this preprint which was updated on 1/22 after it was ORIGINALLY POSTED ON 1/21!! This is how fast science can be, people!

Yikes -- okay, well I guess the way that blew up I wasn't the only person looking for it.

Disclaimers: I'm a loud mouthed mass spectrometrist who knows very little about viruses. I just put all the sequences NCBI translated into one file so the common proteomics software on my computers will accept it.

An Encyclopedia with Quantitative Proteomics of 375 cancer cell lines!?!?!?

Ummm.....whoa....I'm just going to leave this here. This is far too large of a resource for me to tackle on a Sunday morning.

Here is an overview and a lot of links to/around/about the study -- including an query-able -- SQL database in case you're not sure where to put 4,000 Fusion RAW files....

Correction: It's only 500 or so files. Multiplexin'

And here is a short paper about it....

Saturday, January 25, 2020

EPIFANY -- A smart and fast method for protein inference!!!

This new study in press at JPR is critically important for shotgun proteomics and smarter people than me (which means just about everyone) should really take a good look at this and 1) verify it is as good as it looks and 2) see about integrating the source code or primary logic into all sorts of other tools. (An earlier draft was also made available through biorXIV.)

Okay -- so -- shotgun proteomics is really really good at one thing -- making

Match(es) -- (PSM)s.

And if we're working with the best proteins in the whole entire world then each and every one of those PSMs is unique to 1 particular protein and when we identify that PSM and quantify it in a sample we have proven that particular protein is there and we can even get quantification estimates / measurements on that one protein from that one PSM. (I need a word count on sentences).

However -- from an evolutionary perspective it doesn't make a ton of sense for each protein to have developed in isolation with no relation to any other protein. So...a lot of PSMs could be derived from more than one protein. And if you only identify PSMs that could originate in more than one protein, what do you do?

You INFER the protein identity.
How do you do that?
Well -- probably by a set of mostly arbitrary rules that were chosen because....we had to do something...and it's a great idea if we keep them to ourselves...because they don't reflect well on us or our field.

The best one? When you've got equal evidence, it's probably the biggest protein in your FASTA database....(some tools use the highest percent coverage, but then you'll get all weirded out because if UniProt contains your full length variant and 4 alternative "fragment of" protein sequences you'll only ever see the fragments and then you'll be afraid your lysis method broke off all your can't rule doesn't sound great when you say it out loud. I hate explaining it when I can tell people are paying attention. I go ahead and get the idea of a "razor" peptide out of the way next, because it's better to get two things that damage your credibility out of the way at the same time and then you can spend the rest of your talk or lecture trying to gain it back.

I'm oversimplifying a complex and varied environment of protein informatics software here. It isn't all this way. From the paper:

"Some methods tackle this problem by either ignoring shared peptides (Percolator 7,8), employing maximum parsimony principles and finding a minimal set of proteins explaining found peptides or PSMs (PIA4 ), iteratively distributing its evidence among all parents (ProteinProphet 9 ) or incorporating the evidence in a fully probabilistic manner (Fido 10, MSBayesPro11, MIPGEM12)"

The best way to do this? An exhaustive recent analysis showed on the iPRG 2016 (the big ABRF study that comes up a lot) that the full probabilistic models are the way to go. More statistics, FTW!

However -- I've only used Fido, but it required a whole lot more processing time/power than even Percolating a large dataset. And this study suggests it's not just's a brute force approach that, in the end, may not be realistic.

EPIFANY uses some fancy statistics to achieve the same (better?) inference results, but use alternative logic (something about loopy beliefs) that massively reduce the data processing load.

Full disclaimer -- I'm still trying to figure out how to use it because it runs in KNIME and I might be too dumb for it.  I just found this cool KNIME cheatsheet thing -- with this and the full pipeline and all data available here I'm hoping to work my way through it.  [Hooooly cow. You can run it from command did I miss that!?!? ]

However -- the evidence here is solid that this is a better way to infer protein identifications. The authors test it against multiple datasets including the iPRG and use all sorts of ways to infer the protein identities and EPIFANY is the best -- or close enough -- and finishes in a reasonable time.

And -- look -- even if it didn't work any better at all, wouldn't it be better for us to use the tools that at least tried to use intelligent statistics to infer our protein identities? Grant review boards are grumpy by design. We don't need to give them excuses to fund more transcriptomics.