Tuesday, July 30, 2019

Ribo-Seq + TMT Peptidomics!

Just this summer I heard about Ribo-Seq (WikiPedia article here) for the first time.  Honestly, Ribo-Seq or Ribosome profiling is one of those technologies that I think will start taking lots of the easier work from proteomics facilities -- if it hasn't done so already. You get a ton closer to accurately predicting the protein levels if you're actually measuring what gets to the ribosome, right? 

In this new study the two are used in unison to look for new brain specific neuropeptides.  

The approach used here seems to combine the results of TMT proteomics and RiboSeq and look for concordance. It is worth noting that another approach (Proteoformer 2.0 post here) that is a bit newer than this March publication has demonstrated combining the two a bit further upstream to amplify the results.

In the Nucifora Lab study we see that the two are highly complementary and have the ability to differentiate organ specific peptides and proteins.

Monday, July 29, 2019

Don't throw out all the Keratins! They might be important!

Now that I'm back in the part of the world where you can't see stars at night, but you do have internet service and for no particular reason, I might highlight research done right here in rat infested Baltimore.

This is a short one because I've got to get my fingers used to typing again, but this cool recent paper in Science Translational Medicine highlights why you shouldn't use a filter that throws out the word "keratin" in your results! 

The patients that have this disease have massively more of some weird keratins....

For real -- I don't think I could have caught this with several of my default filter setups. Particularly if I was doing skin proteomics? I might have tossed every word that started with a K....

Thursday, July 18, 2019

ABRF GenomeWeb Follow up.

If you missed the ABRF GenomeWeb talks we did the other day, you can still watch them on demand by going to this link and registering. Since my animations didn't work, you can get my slides with working animations here.

We went a little long and didn't get to all the questions. I took screenshots of them and have been working on them.

Q1: Is there something like RAWMeat that can monitor specifics of instrument performance longitudinally?

A1: Yes. There are several, but these 2 are my favorites:


(New!) QCloud

AutoQC and sPRoCoP have to be mentioned as well. There's loads on this dumb blog about them.

Q2: TMT vs DIA comparisons with today's instrumentation? How do they compare?

A2: It sounds like Dr. Schilling's group is doing a massive and comprehensive comparison and we should watch out for it, but this is a great recent paper where time constraints were utilized as a parameter:

Q3: Can you use a Q Exactive Classic for DIA? 

A3: Totally! Is it as fast as the high field stuff? No, but once you adjust for these limitations you're set. Here is a great example:

Q4: Is there a reason to generate libraries now that ProSit is out there?

Uninformed Opinion 4: Scott Peterson did a really extensive analysis years ago, I have slides here somewhere and ALWAYS in house generated spectral libraries were better than any in silico models that he tried. It wasn't even close. There is individual instrument and lab and sample variations. ProSit is a big step forward. These new intelligent machine based libraries are better than anything we've ever seen, but I bet you that in house generated will still be better. I don't have data to back this up. I hope that the margin is small. I'll definitely go to Prosit first. Instrument time isn't free. However, I'd comfortably bet $4.13 that when the dust settles and we see the inevitable 10-15 papers that compare the two in the next 18 months or so that in house is at least marginally better. (I do hope I'm wrong, though)

Q5: How much longer do you think that MS will be the dominant technology in proteomics? 

Uninformed Rambling Opinion 5: It depends on how we define proteomics. Are we talking peptide abundance? If so, then I don't expect MS to be the leader for much longer. Arrays are coming. Nanopores are coming. If we're defining proteomics as modified protein abundance and/or top-down, intact protein analysis and quantification? I think we can safely say that MS will be the dominant force for the remainder of my career.

Arrays are neat and everything. And cheap. And they'll get better, but you have to know so much in advance to use them. I'd guess only a few more years before it is better/cheaper/faster to use arrays if you want to find out protein abundance. Honestly, though, who cares about protein abundance? Really? RiboSeq is getting a lot more accurate and faster/cheaper.

We might even see arrays replace phosphoproteomics. I'd have a party. Let's give phosphoproteomics to someone else, because, at the end of the day we all know that LC-MS/MS isn't very good at it. Not really.

Where LC-MS/MS has a tremendous advantage is all the other PTMs that we've only barely explored. Glycosylations, acetylations, succinylations, Sumoylation, Ubiquitinations and our ability to look at all of them. And intact protein analysis on a global scale? Yeah -- it's getting closer all the time and no one on earth wants that problem except for us.

Wednesday, July 17, 2019

WinProphet -- Trans Proteomic Pipeline for Everyone!

I'd like to offer a big thank you to these authors! The Trans Proteomic Pipeline is something I've always respected and read lots about, but have always been far too dumb to use for myself.

You know what I need? A friendly Graphical User Interface (GUI) and that's what WinProphet is.

BOOM -- Access to all sorts of tools you've probably heard about and you were probably smart enough to use it (I'm not) but you didn't have time to learn a new command line program thing with all the options you have for searching proteomics data.

You can skip all the reading and just get WinProphet here!

Tuesday, July 16, 2019

Expert Tips? Running a Mass Spec Core Amid Rapid Change Webinar Tomorrow! 1pm EST.

If I did this right again this time, you can register for this ABRF/GenomeWeb Joint Webinar by Clicking anything from this line up.

And....so...it's obviously totally July right now. Not only July but also like past the middle of July and I'm 100% prepared for things that are to happen in the middle of July like my meeting with some super important government persons today about the second lab I've built from scratch in less than 5 months, as well as giving a webinar for freaking GenomeWeb with 2 of my personal professional heroes and of course my webinar is totally done and suitably well rehearsed....

....you should definitely register, the other 2 talks are great and, at the very least, I'll try to say something funny, but I'll try to also come up with something smart. (This is a recap of our ABRF 2019 workshop with updated material. I definitely know where those slides are, but it enjoyed writing it this way more.)

Monday, July 15, 2019

Biden Cancer Moonshot suspends operations....

I hope this doesn't have consequences for people funded by this amazing program, but conflict of interest used to be a real thing our government considered.....

AP story is here.

Wednesday, July 10, 2019

Ben's inaccurate guide to FDR in Proteome Discoverer -- what to use and when!

Every once in a while I start a blog post and I think "oh no. what am I doing?" This is probably going to be one of them, but I'm probably going to keep typing, but keep this in mind.

1) I have well over my 10,000 hours of kicking data of all types around with different search engines and false discovery rate (FDR) algorithms -- so I have real world advice.

2) This involves statistics. My last formal course on statistics was in the 90s. I did tutor statistics. Also in the 90s. We didn't use computers in statistics in my undergrad institution -- because it was the 90s...

Inaccurate FDR guide time! 

Lets' use the Proteome Discoverer stuff above because I actually know about it most. Let's start with the easiest one, using PD 2.2's fantastic and logical numbering system

4) Fixed Value PSM Validator = Translation: No FDR!

What is it? That's actually inaccurate in the later versions, but up to PD 2.1 (I'm pretty sure) there were no settings in this box. It is basically there because you have to have something. This relies on the cutoffs within the search engine itself. If SeQuest, you've just moved the scoring to entirely rely on the XCorr cutoffs within your Admin --> Configuration --> SeQuest --> Node

In PD 2.2 and 2.3(?)  this now includes your DeltaCN cutoff. DeltaCN is just detailing what you do when you have 2 sequences that match your MS/MS spectra. Imagine you have these 2 matches and one gets a score of (bear with me) 100 and the other gets a score of 96. If your DeltaCN is 0.05, you get a peptide spectral match (PSM) for both sequences, because your delta off of your highest match is within 5% (0.05). However, if you get 2 potential peptide matches and one is 100 and the second is 84, then only the top PSM is reported.

When do I use it? When I'm looking at a single protein, or 5 proteins, or 10 proteins. Maybe up to 100 proteins? Not many proteins! Or if I want to get a ton of IDs from a global dataset and I don't care about the level of accuracy in those IDs....which is a joke. I don't use this for global and you shouldn't unless you've got other filters like >3 unique peptides per protein or something.

6) Target Decoy PSM Validator = Translation: The first kind of FDR for proteomics

What is it? It was a revelation in proteomics about a decade ago. Unless I'm mistaken it was first described by Elias and Gygi? Although, the paper I typically associate it with is this paper from the same authors a few years later.

You essentially do 2 searches:
1) Normal FASTA database
2) Your FASTA database that you screwed up. You either read all the entries in reverse or you scrambled them.
3) You adjust the scoring on the two searches to cutoffs that only allow a small percentage of the screwed up database to match. For example (and yes -- I know -- this is only a useful statement for visualization...I've heard all the statistician tirades...) you adjust the scoring until only 1% of your screwed up FASTA database gets picked up as real hits. This is the whole misnomer of 1% FDR. Never say this in front of an audience.

The assumption you lean on is that your screwed up database won't get matches, or will get very very few. On a big database that has been screwed up, there are ALWAYS going to be matches by your search engine. The goal here is to control the random chance of PSMs occurring (this works if you assume the scrambled database matches would only occur by random chance, which - due to things like organisms biasing toward specific amino acid usage and the biological value of having repeating sequences in some proteins -- isn't really true, but it is uncommon enough that it makes a good metric).

When do I use it? When I need to be conservative with my identifications. Particularly for big databases, target decoy is going to throw out a LOT of good data. I do not typically use this for single proteins. Maybe I do as a comparison?

(Too much uninterrupted text in a row makes me nervous. Pug Pile!!) 

5) Percolator = Translation: Try to find the good data that target decoy threw out by looking at a bunch of factors.

What is it? It is essentially the first foray of mainstream proteomics into intelligent learning machine deepness (insert arbitrary sounding big data computer term here -- again -- check WikiPedia a LOT before you use any of these terms for an audience. They all sound like the same thing but they aren't. Percolator is really a "semi-supervised learning" thing, and people take these different terms very seriously. I once made the innocent mistake of referring to some Harley Davidsons (sp?) outside a bar as "nice bicycles" inside said bar, which, by the way, they definitely technically are and it was almost as poor of an interaction over a technicality as I've seen at conferences where people got their artificial learning things mixed up.)  Percolator was probably first described here.  A while back a colleague assembled a list of all the things that Percolator considers when rescoring a Peptide Spectral Match (PSM) and that list is here.  Since Percolator rescores peptides with it's metrics its great for combining the results of different engines.

When do I use it? When I've got a LOT of data and I want to get the most identifications possible. I NEVER use Percolator for a single protein digest. I don't know the realistic cutoff where I trust Percolator output. I might put it at around 1,000 proteins. BY THE WAY. If someone gives you an immunoprecipitation or Co-IP and it's got 4,000 proteins in it, there are lots of great guides online on how to do one of these experiments properly. Ask that person to read one and try again. Don't spend all your time trying to interpret garbage pulldowns. If you are going to try and make sense of it, don't treat it like a Co-IP experiment. Treat it as an affinity enrichment and just do everything they do in this paper.

7) IMP-Elutator = Translation: Percolator + Chromatography data

What is it? As best as I can figure out, Elutator takes the Percolator approach but then also incorporates chromatography information predictions. I think IMP-Elutator was first detailed here.

When do I use it? Big datasets with a lot of PTMs. MSAmanda uses more sophisticated features for PTM identification and confidence in those identifications than older code like SeQuest. MSAmanda rewards your scores for higher mass accuracy (measured in PPM accuracy, rather than mass range bins) and can handle things like neutral losses from your peptides (akin to MaxQuant searches). Since I'm already using MSAmanda (2.0!) for getting high confidence PTMs, I go ahead and use the tools that were designed to work with it. More importantly, as you've added PTMs, your search matrix increases exponentially, as does your likelihood of getting bad matches. By taking chromatography into account you can further improve on your FDR.

Okay -- I hope this helps? Thanks to Dr. R.R. who wrote me the question that made me realize I hadn't updated anything on FDR in PD since like 2012 and I should reeeeeaaaallly go back and delete some old stuff!

Tuesday, July 9, 2019

InterPro and UniFire -- UniProt is more than a place to download FASTAs!

Sorry for all the "building FASTAs and genomes from proteomics and nextgen stuff" posts. If your organism is all annotated and clonal this is probably boring." However, if you are studying something that hasn't been perfectly annotated or has individual genetic variation this is a great time for you! So many options!

InterPro isn't new, by any means, but InterProScan (scan is the web portal) was just upgraded last week and EMBL-EBI has some very convincing arguments (like near the end of this YouTube video) for why you should be using it for your annotations.

Okay -- having a gorgeous web interface shouldn't be the reason to use something -- but when it's backed with this kind of ridiculous power, it doesn't hurt. All these databases feed in seamlessly and you can choose to use some or all of them.

Like the pBLAST web interface you can only search one thing at a time, which is great if you're down to a small list of quantitatively interesting things. However, there is a stand-alone program (InterPro, I think). It can only be installed in 64-bit Linux, so no casual Windows people.

Okay -- but the reason I've been running InterProScan is actually because of something else from EMBL -- it's called UniFire and there isn't tons out on it yet.

However, there IS a free class on Wednesday July 10th on UniFire. You can register for it here.

Monday, July 8, 2019

DIA Community Study Webinar Tomorrow Tuesday the 9th!

If I did it right you should be able to click anywhere on the image above and go to the free registration thing. If I didn't, you can go to it by clicking here instead.

Were you one of the 65 labs that got the HeLa DIA samples sent to you?

Are you curious how that turned out?

Are you just interested in DIA and wonder how 65 different labs did when sent the same samples? Then tomorrow's GenomeWeb/ABRF webinar thing is for you!

Sunday, July 7, 2019

An incredibly comprehensive evaluation of LFQ software!

This certainly isn't the first comparison of label free quan algorithms, but it might be one of the most comprehensive and well executed ones.

5 publicly available data sets (all spike-ins of digests into complex backgrounds).

5 common software packages used for all comparisons

Exploration of different imputation methods and how they affect each one of these software package results.

Cool and complex metrics developed and clearly explained (I had to circle each on so I could reference back to it -- and the morning after reading it I can not tell you what a LogFC is without going back to look, but it seemed logical!)

And Open Access!  I am surprised to just now notice that it is from last fall?

Friday, July 5, 2019

eggNOG for the 4th of July? Annotate your nextgen derived FASTA!

I didn't go out for fireworks. I sat at home and learned how to annotate these protein FASTA databases that I generated from Illumina ("short read sequencing") and PacBio ("long read sequencing") data.

I started with BlastP command line, but half way through I decided to again see how long it takes me to manually annotate. From 5:08pm to 5:45pm I manually annotated 42 FASTA entries. That's less than one every 90 seconds. Let's call it one minute. I only have 17,166 to go. If I could keep going 24 hours a day it would only take me 11 days to get through it. Don't check my math. I probably did it wrong.

I have something important to do in like 12 days (this cool ABRF recap webinar series!), and if I didn't sleep for 11 days straight, I probably wouldn't do a very good job. New plan! eggNOG!

This is the newest paper, but what I actually appear to be using is 2.0. There is a paper from 2017 on 1.0, but just user documentation in-between.

What's better than reading? Dumping all your cool nextgen filtered data stuff into someone else's server and seeing if it works!

You can dump your data into their server here.

Or you can get the code to run it locally for yourself here.

What's it do? Okay -- so what I have from all the next gen data that I 6 frame translated to proteins is a crappy annotation that looks like this:


What's that annotation mean? Nothing useful. It may be the number of the probe used for the sequencing in that experiment. Then you have the filtered protein sequence that I translated from that probe. What I want to know is what that protein actually is.

pBLAST through the web interface (using my personally preferred reliable older, somewhat slower version) took about 3 minutes for this individual protein sequence to give me this annotation:

 (The way I do it faster is to have 10 tabs open at one time. Please don't do it this way. You're smarter than me.)

eggNOG is listed in the original paper as being 15x faster than pBLAST. It's more like 10,000x faster than a human with pBLAST.

Check this out. Online -- it did my FASTA with almost 36,000 entries in under 90 minutes!

AWESOME. Okay. So this is what you get out of it.

....perhaps the biggest bummer in the world on the 4th of July is the fact that it doesn't combine the .FASTA with the .Annotations. Yo. That's what I'm here for.

So -- what i did was use Excel (trigger groans) and the VLOOKUP function. This is essentially a Find/Replace where if you say if anything is in Column X, then replace it with the value in Column Y. Yes, there are smarter ways of doing this, but mine totally worked. I changed the names after the decimal point of both files to .fasta, which allows them to be opened by Proteome Discoverer.

I plan to release a package called "Ben's dumb Excel tools for mass spectrometrists" when I get time and this should be part of it. I should post that here sometime....

Wednesday, July 3, 2019

GIX -- Turn your internet browser into a genetic search engine!

Time to upgrade your browser with this awesome add-in!

It's called GIX and you can read about it here:

(there is also a preprint here)

You can just go ahead and install GIX by going to the GIX site (if you are using Firefox or Chrome -- but...why wouldn't you...?...are there better options?) here.

Once you add GIX to your browser, go to any webpage, anywhere. If there is a gene name displayed anywhere in your browser, double click on it!!

For example:


If you double click on it you should get ALL THIS INFO as something that pops up on the right side of your screen. That's all you have to do. And, as far as I can tell, it always looks just like this. I can find the MW of any protein just by clicking on the name of the gene -- same format -- any site. No more "UniProt has that info here, and NCBI keeps it in this pulldown thing"


It looks like you can customize it a lot, and there is an API, but I like it exactly the way it is.

Tuesday, July 2, 2019

Orbiter -- Double your TMT efficiency!

I just reread the short preprint describing Orbiter and feel that my rush job of going over it didn't do it justice at all.

In Synchronous Precursor Selection (SPS) MS/MS/MS (MS3) reporter ion quantification, there are 3 main steps (I'm trying to write out all my acronyms after some reader comments. There are new people coming into this field all the time! Don't kill them with acronyms!).

The first is the MS1 scan. All the masses that the Orbitrap can see.

The second step is/are the MS/MS (MS2) scan(s). Part way through the MS1 scan the computers onboard the Tribrid (Orbitrap Fusion 1/2 or 3) have enough information about the ions present that they start selecting the most interesting ions for MS2. The ion trap is crazy fast and in the remaining time left in the MS1 scan you can select dozens of ions for MS2. The combination of the high resolution accurate mass (HRAM) of the MS1 with the ion trap fragments is enough, in many cases, to fully sequence the peptide of interest. In SPS MS3 there is an additional step.

Ions in the MS2 are selected by SPS, isolated, combined and the subjected to MS3. High energy fragmentation is used to fully liberate all of the reporter ions from all of the ions SPS selects. The Orbitrap is used with a high (or what is now funny to consider "relatively high" because it wasn't realistically attainable by anything not all that long ago -- but maybe I'm getting old) resolution of, ideally 50,000 at m/z of 200 or so. This is way slower than the ion trap.

What Orbiter is doing is looking at the MS2 scans as they're being acquired, doing a search on them and providing information back to the Tribrid on what ions it should do MS3 on.

Yes. As I mentioned, the ion trap is crazy fast. So is Orbiter. Orbiter is using the Comet search engine. If you aren't familiar, it was written by a guy who had kind of a lot to do with something called Sequest. They share some similarities in their base functionalities, except he stopped working on Sequest a long time ago and he's still involved in improvements on Comet. (Other people have made improvements on Sequest, but I've never heard of a comparison between the two that Comet didn't win).  The focus here is to do the search accurately AND FAST.  Orbiter can complete a yeast search with some pretty loose tolerances (less restrictive tolerances typically increase search time) in 5ms and using all human UniProt + Isoforms in 17ms. This captures it better --

And it isn't a sloppy search, either. It's modeling peptide quality using a new and faster mechanism called a multi-feature linear discrimination analysis (LDA) that takes into account the XCorr, mass accuracy, percentage of matching fragments (this is a not a feature in classic peptide search engines, btw, but increasingly common, thankfully) and other features to tell good peptides from bad peptides. The Comet E-value is too slow, but the LDA method seems to be a huge win all around for speed and accuracy of the search.

The limiting factor here is really the relatively slow MS3 scan, which at 50,000 resolution takes 86ms. As long as Orbiter can figure out the peptide and what MS2 fragments from that ion that should be selected and the Fusion can select them and get them ready before the 86ms of the previous scan has elapsed, then this is a complete and total win.

What's important to note here is that you do not get more MS1, MS2, OR MS3 scans with Orbiter. What you get is far less MS3 scans that are wasted on junk that just happened to slip through without the intelligent filters of the Orbiter.

I don't have any Orbiter RAW files yet, but based on files I've got from ProteomeXchange from Lumos MS3 the normal way (there are Lumos files in the PRIDE repository as well as Fusion 1), such as this study, and comparing them to the files from this QE HF-X TMT MS2 study, MS2 based quan on a benchtop is still faster. You get more MS2 scans than the Lumos gets MS3. Even with the improvements on the Fusion 3 Vader system, I don't see where it could make up for this speed gap.

However, with no real time searching, the HF-X is fragmenting a lot of things that just slipped through. Monoisotopic precursor selection (MIPS) which is probably called "Peptide Match" on your system helps a lot, but it still allows stuff to get through that you aren't going to be able to sequence. So the number of matched peptides per unit time is going to be way higher with Orbiter.

Comparing SPS MS3 with and without Orbiter -- Orbiter allows practically the same number of identifications in HALF the time.

There is still just one downside in the back of my head that I'm concerned about. And that is individual genetic variation and PTMs we don't know about yet. Comet/Orbiter is fast enough to take in a surprising number of modifications and keep up performance (can NOT wait for hands on. I have some ideas I want to try). But what if your target organisms have a cool mutation that isn't found in your database that you don't know about yet. Or, simpler to think about, what if the thing that is really cool about your organism is a PTM that you don't know about yet? What if it's a sulfonylation, for example? If Orbiter doesn't know you're looking for a sulfonylation, is it going to skip all the peptides that have that modification?

I don't mean to end on a low note for what is an incredibly brilliant bit of coding that leads to somewhat unbelievable levels of performance improvements, but it is something to think about.

Monday, July 1, 2019

Perhaps the winner for best experimental design (and title?) of the year?

I'll be honest. I can't actually read this paper. My library doesn't carry this journal and I don't have $39.95 for a single PDF download. However....I think we're looking at experimental design of the year award.

1) Get 50 people
2) Give them 1g of ethanol per kg of body weight. (Siri says I'm 84kg...okay...you've got my attention....

...ummm...84/14 = 6 drinks for me. That's a bottle and a half of wine? Ouch. Okay. Admittedly, I've experienced that,  but never as the subject of a scientific study.)

It gets better. The participants had to finish this alchohol in one hour!!

3) The best part. Blood draws every 15 minutes for 3 hours. Then every 30 minutes for the next 3 hours and then every hour for the next 9 hours.

3+3+9 = 15 hours of blood draws! 27 blood draws over the next 15 hours.

Can you imagine a study design that would be more unpleasant for the participants, study organizers, and perhaps particularly the phlebotomists? How fun is it to get a blood draw on someone who just chugged 6 glasses of wine? Is it worse then, or after 15 hours of visiting?

But that's what science is. An incorruptible pursuit of the truth. Kudos to these authors for asking the big questions.