Tuesday, July 16, 2019
If I did this right again this time, you can register for this ABRF/GenomeWeb Joint Webinar by Clicking anything from this line up.
And....so...it's obviously totally July right now. Not only July but also like past the middle of July and I'm 100% prepared for things that are to happen in the middle of July like my meeting with some super important government persons today about the second lab I've built from scratch in less than 5 months, as well as giving a webinar for freaking GenomeWeb with 2 of my personal professional heroes and of course my webinar is totally done and suitably well rehearsed....
....you should definitely register, the other 2 talks are great and, at the very least, I'll try to say something funny, but I'll try to also come up with something smart. (This is a recap of our ABRF 2019 workshop with updated material. I definitely know where those slides are, but it enjoyed writing it this way more.)
Monday, July 15, 2019
Wednesday, July 10, 2019
Every once in a while I start a blog post and I think "oh no. what am I doing?" This is probably going to be one of them, but I'm probably going to keep typing, but keep this in mind.
1) I have well over my 10,000 hours of kicking data of all types around with different search engines and false discovery rate (FDR) algorithms -- so I have real world advice.
2) This involves statistics. My last formal course on statistics was in the 90s. I did tutor statistics. Also in the 90s. We didn't use computers in statistics in my undergrad institution -- because it was the 90s...
Inaccurate FDR guide time!
Lets' use the Proteome Discoverer stuff above because I actually know about it most. Let's start with the easiest one, using PD 2.2's fantastic and logical numbering system
4) Fixed Value PSM Validator = Translation: No FDR!
What is it? That's actually inaccurate in the later versions, but up to PD 2.1 (I'm pretty sure) there were no settings in this box. It is basically there because you have to have something. This relies on the cutoffs within the search engine itself. If SeQuest, you've just moved the scoring to entirely rely on the XCorr cutoffs within your Admin --> Configuration --> SeQuest --> Node
In PD 2.2 and 2.3(?) this now includes your DeltaCN cutoff. DeltaCN is just detailing what you do when you have 2 sequences that match your MS/MS spectra. Imagine you have these 2 matches and one gets a score of (bear with me) 100 and the other gets a score of 96. If your DeltaCN is 0.05, you get a peptide spectral match (PSM) for both sequences, because your delta off of your highest match is within 5% (0.05). However, if you get 2 potential peptide matches and one is 100 and the second is 84, then only the top PSM is reported.
When do I use it? When I'm looking at a single protein, or 5 proteins, or 10 proteins. Maybe up to 100 proteins? Not many proteins! Or if I want to get a ton of IDs from a global dataset and I don't care about the level of accuracy in those IDs....which is a joke. I don't use this for global and you shouldn't unless you've got other filters like >3 unique peptides per protein or something.
6) Target Decoy PSM Validator = Translation: The first kind of FDR for proteomics
What is it? It was a revelation in proteomics about a decade ago. Unless I'm mistaken it was first described by Elias and Gygi? Although, the paper I typically associate it with is this paper from the same authors a few years later.
You essentially do 2 searches:
1) Normal FASTA database
2) Your FASTA database that you screwed up. You either read all the entries in reverse or you scrambled them.
3) You adjust the scoring on the two searches to cutoffs that only allow a small percentage of the screwed up database to match. For example (and yes -- I know -- this is only a useful statement for visualization...I've heard all the statistician tirades...) you adjust the scoring until only 1% of your screwed up FASTA database gets picked up as real hits. This is the whole misnomer of 1% FDR. Never say this in front of an audience.
The assumption you lean on is that your screwed up database won't get matches, or will get very very few. On a big database that has been screwed up, there are ALWAYS going to be matches by your search engine. The goal here is to control the random chance of PSMs occurring (this works if you assume the scrambled database matches would only occur by random chance, which - due to things like organisms biasing toward specific amino acid usage and the biological value of having repeating sequences in some proteins -- isn't really true, but it is uncommon enough that it makes a good metric).
When do I use it? When I need to be conservative with my identifications. Particularly for big databases, target decoy is going to throw out a LOT of good data. I do not typically use this for single proteins. Maybe I do as a comparison?
(Too much uninterrupted text in a row makes me nervous. Pug Pile!!)
5) Percolator = Translation: Try to find the good data that target decoy threw out by looking at a bunch of factors.
What is it? It is essentially the first foray of mainstream proteomics into intelligent learning machine deepness (insert arbitrary sounding big data computer term here -- again -- check WikiPedia a LOT before you use any of these terms for an audience. They all sound like the same thing but they aren't. Percolator is really a "semi-supervised learning" thing, and people take these different terms very seriously. I once made the innocent mistake of referring to some Harley Davidsons (sp?) outside a bar as "nice bicycles" inside said bar, which, by the way, they definitely technically are and it was almost as poor of an interaction over a technicality as I've seen at conferences where people got their artificial learning things mixed up.) Percolator was probably first described here. A while back a colleague assembled a list of all the things that Percolator considers when rescoring a Peptide Spectral Match (PSM) and that list is here. Since Percolator rescores peptides with it's metrics its great for combining the results of different engines.
When do I use it? When I've got a LOT of data and I want to get the most identifications possible. I NEVER use Percolator for a single protein digest. I don't know the realistic cutoff where I trust Percolator output. I might put it at around 1,000 proteins. BY THE WAY. If someone gives you an immunoprecipitation or Co-IP and it's got 4,000 proteins in it, there are lots of great guides online on how to do one of these experiments properly. Ask that person to read one and try again. Don't spend all your time trying to interpret garbage pulldowns. If you are going to try and make sense of it, don't treat it like a Co-IP experiment. Treat it as an affinity enrichment and just do everything they do in this paper.
7) IMP-Elutator = Translation: Percolator + Chromatography data
What is it? As best as I can figure out, Elutator takes the Percolator approach but then also incorporates chromatography information predictions. I think IMP-Elutator was first detailed here.
When do I use it? Big datasets with a lot of PTMs. MSAmanda uses more sophisticated features for PTM identification and confidence in those identifications than older code like SeQuest. MSAmanda rewards your scores for higher mass accuracy (measured in PPM accuracy, rather than mass range bins) and can handle things like neutral losses from your peptides (akin to MaxQuant searches). Since I'm already using MSAmanda (2.0!) for getting high confidence PTMs, I go ahead and use the tools that were designed to work with it. More importantly, as you've added PTMs, your search matrix increases exponentially, as does your likelihood of getting bad matches. By taking chromatography into account you can further improve on your FDR.
Okay -- I hope this helps? Thanks to Dr. R.R. who wrote me the question that made me realize I hadn't updated anything on FDR in PD since like 2012 and I should reeeeeaaaallly go back and delete some old stuff!
Tuesday, July 9, 2019
Sorry for all the "building FASTAs and genomes from proteomics and nextgen stuff" posts. If your organism is all annotated and clonal this is probably boring." However, if you are studying something that hasn't been perfectly annotated or has individual genetic variation this is a great time for you! So many options!
InterPro isn't new, by any means, but InterProScan (scan is the web portal) was just upgraded last week and EMBL-EBI has some very convincing arguments (like near the end of this YouTube video) for why you should be using it for your annotations.
Okay -- having a gorgeous web interface shouldn't be the reason to use something -- but when it's backed with this kind of ridiculous power, it doesn't hurt. All these databases feed in seamlessly and you can choose to use some or all of them.
Like the pBLAST web interface you can only search one thing at a time, which is great if you're down to a small list of quantitatively interesting things. However, there is a stand-alone program (InterPro, I think). It can only be installed in 64-bit Linux, so no casual Windows people.
Okay -- but the reason I've been running InterProScan is actually because of something else from EMBL -- it's called UniFire and there isn't tons out on it yet.
However, there IS a free class on Wednesday July 10th on UniFire. You can register for it here.
Monday, July 8, 2019
If I did it right you should be able to click anywhere on the image above and go to the free registration thing. If I didn't, you can go to it by clicking here instead.
Were you one of the 65 labs that got the HeLa DIA samples sent to you?
Are you curious how that turned out?
Are you just interested in DIA and wonder how 65 different labs did when sent the same samples? Then tomorrow's GenomeWeb/ABRF webinar thing is for you!
Sunday, July 7, 2019
This certainly isn't the first comparison of label free quan algorithms, but it might be one of the most comprehensive and well executed ones.
5 publicly available data sets (all spike-ins of digests into complex backgrounds).
5 common software packages used for all comparisons
Exploration of different imputation methods and how they affect each one of these software package results.
Cool and complex metrics developed and clearly explained (I had to circle each on so I could reference back to it -- and the morning after reading it I can not tell you what a LogFC is without going back to look, but it seemed logical!)
And Open Access! I am surprised to just now notice that it is from last fall?
Friday, July 5, 2019
I didn't go out for fireworks. I sat at home and learned how to annotate these protein FASTA databases that I generated from Illumina ("short read sequencing") and PacBio ("long read sequencing") data.
I started with BlastP command line, but half way through I decided to again see how long it takes me to manually annotate. From 5:08pm to 5:45pm I manually annotated 42 FASTA entries. That's less than one every 90 seconds. Let's call it one minute. I only have 17,166 to go. If I could keep going 24 hours a day it would only take me 11 days to get through it. Don't check my math. I probably did it wrong.
I have something important to do in like 12 days (this cool ABRF recap webinar series!), and if I didn't sleep for 11 days straight, I probably wouldn't do a very good job. New plan! eggNOG!
This is the newest paper, but what I actually appear to be using is 2.0. There is a paper from 2017 on 1.0, but just user documentation in-between.
What's better than reading? Dumping all your cool nextgen filtered data stuff into someone else's server and seeing if it works!
You can dump your data into their server here.
Or you can get the code to run it locally for yourself here.
What's it do? Okay -- so what I have from all the next gen data that I 6 frame translated to proteins is a crappy annotation that looks like this:
What's that annotation mean? Nothing useful. It may be the number of the probe used for the sequencing in that experiment. Then you have the filtered protein sequence that I translated from that probe. What I want to know is what that protein actually is.
pBLAST through the web interface (using my personally preferred reliable older, somewhat slower version) took about 3 minutes for this individual protein sequence to give me this annotation:
(The way I do it faster is to have 10 tabs open at one time. Please don't do it this way. You're smarter than me.)
eggNOG is listed in the original paper as being 15x faster than pBLAST. It's more like 10,000x faster than a human with pBLAST.
Check this out. Online -- it did my FASTA with almost 36,000 entries in under 90 minutes!
AWESOME. Okay. So this is what you get out of it.
....perhaps the biggest bummer in the world on the 4th of July is the fact that it doesn't combine the .FASTA with the .Annotations. Yo. That's what I'm here for.
So -- what i did was use Excel (trigger groans) and the VLOOKUP function. This is essentially a Find/Replace where if you say if anything is in Column X, then replace it with the value in Column Y. Yes, there are smarter ways of doing this, but mine totally worked. I changed the names after the decimal point of both files to .fasta, which allows them to be opened by Proteome Discoverer.
I plan to release a package called "Ben's dumb Excel tools for mass spectrometrists" when I get time and this should be part of it. I should post that here sometime....
Wednesday, July 3, 2019
Time to upgrade your browser with this awesome add-in!
It's called GIX and you can read about it here:
(there is also a preprint here)
You can just go ahead and install GIX by going to the GIX site (if you are using Firefox or Chrome -- but...why wouldn't you...?...are there better options?) here.
Once you add GIX to your browser, go to any webpage, anywhere. If there is a gene name displayed anywhere in your browser, double click on it!!
If you double click on it you should get ALL THIS INFO as something that pops up on the right side of your screen. That's all you have to do. And, as far as I can tell, it always looks just like this. I can find the MW of any protein just by clicking on the name of the gene -- same format -- any site. No more "UniProt has that info here, and NCBI keeps it in this pulldown thing"
HOW COOL IS THAT???
It looks like you can customize it a lot, and there is an API, but I like it exactly the way it is.
Tuesday, July 2, 2019
I just reread the short preprint describing Orbiter and feel that my rush job of going over it didn't do it justice at all.
In Synchronous Precursor Selection (SPS) MS/MS/MS (MS3) reporter ion quantification, there are 3 main steps (I'm trying to write out all my acronyms after some reader comments. There are new people coming into this field all the time! Don't kill them with acronyms!).
The first is the MS1 scan. All the masses that the Orbitrap can see.
The second step is/are the MS/MS (MS2) scan(s). Part way through the MS1 scan the computers onboard the Tribrid (Orbitrap Fusion 1/2 or 3) have enough information about the ions present that they start selecting the most interesting ions for MS2. The ion trap is crazy fast and in the remaining time left in the MS1 scan you can select dozens of ions for MS2. The combination of the high resolution accurate mass (HRAM) of the MS1 with the ion trap fragments is enough, in many cases, to fully sequence the peptide of interest. In SPS MS3 there is an additional step.
Ions in the MS2 are selected by SPS, isolated, combined and the subjected to MS3. High energy fragmentation is used to fully liberate all of the reporter ions from all of the ions SPS selects. The Orbitrap is used with a high (or what is now funny to consider "relatively high" because it wasn't realistically attainable by anything not all that long ago -- but maybe I'm getting old) resolution of, ideally 50,000 at m/z of 200 or so. This is way slower than the ion trap.
What Orbiter is doing is looking at the MS2 scans as they're being acquired, doing a search on them and providing information back to the Tribrid on what ions it should do MS3 on.
Yes. As I mentioned, the ion trap is crazy fast. So is Orbiter. Orbiter is using the Comet search engine. If you aren't familiar, it was written by a guy who had kind of a lot to do with something called Sequest. They share some similarities in their base functionalities, except he stopped working on Sequest a long time ago and he's still involved in improvements on Comet. (Other people have made improvements on Sequest, but I've never heard of a comparison between the two that Comet didn't win). The focus here is to do the search accurately AND FAST. Orbiter can complete a yeast search with some pretty loose tolerances (less restrictive tolerances typically increase search time) in 5ms and using all human UniProt + Isoforms in 17ms. This captures it better --
And it isn't a sloppy search, either. It's modeling peptide quality using a new and faster mechanism called a multi-feature linear discrimination analysis (LDA) that takes into account the XCorr, mass accuracy, percentage of matching fragments (this is a not a feature in classic peptide search engines, btw, but increasingly common, thankfully) and other features to tell good peptides from bad peptides. The Comet E-value is too slow, but the LDA method seems to be a huge win all around for speed and accuracy of the search.
The limiting factor here is really the relatively slow MS3 scan, which at 50,000 resolution takes 86ms. As long as Orbiter can figure out the peptide and what MS2 fragments from that ion that should be selected and the Fusion can select them and get them ready before the 86ms of the previous scan has elapsed, then this is a complete and total win.
What's important to note here is that you do not get more MS1, MS2, OR MS3 scans with Orbiter. What you get is far less MS3 scans that are wasted on junk that just happened to slip through without the intelligent filters of the Orbiter.
I don't have any Orbiter RAW files yet, but based on files I've got from ProteomeXchange from Lumos MS3 the normal way (there are Lumos files in the PRIDE repository as well as Fusion 1), such as this study, and comparing them to the files from this QE HF-X TMT MS2 study, MS2 based quan on a benchtop is still faster. You get more MS2 scans than the Lumos gets MS3. Even with the improvements on the Fusion 3 Vader system, I don't see where it could make up for this speed gap.
However, with no real time searching, the HF-X is fragmenting a lot of things that just slipped through. Monoisotopic precursor selection (MIPS) which is probably called "Peptide Match" on your system helps a lot, but it still allows stuff to get through that you aren't going to be able to sequence. So the number of matched peptides per unit time is going to be way higher with Orbiter.
Comparing SPS MS3 with and without Orbiter -- Orbiter allows practically the same number of identifications in HALF the time.
There is still just one downside in the back of my head that I'm concerned about. And that is individual genetic variation and PTMs we don't know about yet. Comet/Orbiter is fast enough to take in a surprising number of modifications and keep up performance (can NOT wait for hands on. I have some ideas I want to try). But what if your target organisms have a cool mutation that isn't found in your database that you don't know about yet. Or, simpler to think about, what if the thing that is really cool about your organism is a PTM that you don't know about yet? What if it's a sulfonylation, for example? If Orbiter doesn't know you're looking for a sulfonylation, is it going to skip all the peptides that have that modification?
I don't mean to end on a low note for what is an incredibly brilliant bit of coding that leads to somewhat unbelievable levels of performance improvements, but it is something to think about.
Monday, July 1, 2019
I'll be honest. I can't actually read this paper. My library doesn't carry this journal and I don't have $39.95 for a single PDF download. However....I think we're looking at experimental design of the year award.
1) Get 50 people
2) Give them 1g of ethanol per kg of body weight. (Siri says I'm 84kg...okay...you've got my attention....
...ummm...84/14 = 6 drinks for me. That's a bottle and a half of wine? Ouch. Okay. Admittedly, I've experienced that, but never as the subject of a scientific study.)
It gets better. The participants had to finish this alchohol in one hour!!
3) The best part. Blood draws every 15 minutes for 3 hours. Then every 30 minutes for the next 3 hours and then every hour for the next 9 hours.
3+3+9 = 15 hours of blood draws! 27 blood draws over the next 15 hours.
Can you imagine a study design that would be more unpleasant for the participants, study organizers, and perhaps particularly the phlebotomists? How fun is it to get a blood draw on someone who just chugged 6 glasses of wine? Is it worse then, or after 15 hours of visiting?
But that's what science is. An incorruptible pursuit of the truth. Kudos to these authors for asking the big questions.
Sunday, June 30, 2019
Sooooooo....umm...I'm relatively sure I just learned a few things from this new paper ASAP at JPR and I think you should read it, particularly if you want to learn/unlearn/overlearn what you thought you understood about how XIC (eXtracted Ion Chromatogram) based clustering (a step most commonly employed as a part of label free quantification workflows by data dependent LC-MS/MS) works.
Now. I'm on a fence between here, because I don't understand this very well, but I'm really excited about this study and how they did it. Stop typing? Type faster? Screenshots? ....Screenshots!
Okay -- so WTFTICR is any of these here words?
Either I was in the sun too long or all of these are new programs to me. Presumably they exist behind the scenes as steps in processing pipelines I know about? I don't know. This paper is worth reading just to look up new software! But it's not done. It is about a new way to do XIC clustering based on Bayesian thingamathings, which of course are:
...I totally knew that....
You can get XNET at this GitHub and it requires an Apache License, which is something I'd seen written and...I give up. I had no idea what that was either, but it is an open source agreement that you can read about here.
And my favorite part about this paper might be how brazenly just honest and just good the whole thing seems. My interpretation is this:
1) Bayesian network things might be a smart way to do XIC clustering quan
2) This is what this is and how we set it up.
3) Here is the potential it might have
4) Here we stacked it up against stuff that you already know, like MaxQuant and OpenMS
5) Sure -- we don't actually win this comparison, but here is all our code and you can use it as long as you get this cool license that says you guarantee it stays open source forever.
The only thing might possibly improve this intimidatingly smart and positive example of how science should work might be ending it like this.
Thursday, June 27, 2019
Okay -- no time to read this -- I've really got to run to meetings -- but -- is this FINALLY it? Is this finally moving isotope analysis from instruments that only have maximum mass ranges of like 5 Th to instruments that can do other things?
I don't know....but it kind of looks like it is it.... for real, if you don't aren't familiar, you should take a look at isotope analysis and how it hasn't changed at all since the 60s....and compare it to this!
Big shoutout to Dr. Kermit Murray who does a better job of keeping track of inorganic mass spectrometry advances than I do.
Holy cow....C&EN is already running with a press release of a JASMS study -- I think this is it!! You can check it out here. No, this isn't proteomics, but this is potentially a light year jump for our long-suffering friends in the inorganic MS world!
Wednesday, June 26, 2019
Disclaimer: You should definitely 100% completely follow every step of your vendor provided protocols. You probably spent close to a million $$ on that instrument, what's our total reagent costs? The "next gen" people drop close to $1,000 per sample no matter what on their global assays....
However...as just a thought experiment...that I would never recommend every setting up in real life... maybe take a look a this interesting paper?
A while back I noted some other (also unrecommended) thought experiments from some people at Harvard where they were using less than the ratios that you should be using (I can't find the link...) but I think they were using 1:2 ratios but typically getting 95%-ish labeling efficiency.
In this (don't do it, for real) thought experiment, these theorists find they can obtain over 99% efficiency with a 1:1 ratio of reagent after some tweaking.
You certainly shouldn't do this.
And you certainly shouldn't take these authors' word that this works. You can check their work, it's at ProteomeXchange here. Definitely download the .csv file, the file naming isn't quite intuitive (and keep in mind about half the optimization was done with the TMT-0 reagent, which if you haven't used before may not be default in your software (this is where you find/add it in Proteome Discoverer. Depending on your version, you may need to Apply the modification and then close your open workflows so you can see it)
(TMT-0 is +224, TMT-duplex is +225, TMT6-11plex is +229, for reference in any other software program)
I'm definitely not going to do this myself, obviously, I'm going to follow my vendor recommended protocols, just as I will urge you to do, but --
-- a quick filter says that out of 5,234 PSMs labeled in this method, I'm getting 3 unlabeled PSMs from the author's data....that's >99.942% labeling efficiency....which....wow....??...what? okay...so this was me just processing one of the short gradient optimization TMT-0 experiments, so grain of salt here on the larger complex sets, but wow....not a bad start. You still shouldn't do it, but it's an interesting thing to think about.
Tuesday, June 25, 2019
My library hasn't indexed this (Just accepted) but I think I can guess the idea. We've got gas lines just hanging out when we used nanospray -- chances are you just have them blocked off, but you could use them!
In this case, these authors turn on the gas when the troublesome stuff is eluting. I pulled a RAW file from ProteomeXchange/PRIDE here and it looks quite convincing!
This shouldn't be confused with the ABird, which is constantly removing background ions. Heck, the two might work really well in unison.....
Dr. Dave Sarracino used to do something that might be similar to NanoBlow, but I can't find any record of it. Correction, found it! I apologize to everyone if this isn't the same kind of thing, I'll correct things when I can read the paper!
Monday, June 24, 2019
At ABRF this year I talked a lot about a SOPs and LIMS and why I loved they are critical to our future development as a field.
Laboratory Information Management (LIMS) systems are maybe starting to take off now, in some form or another, after a couple of false starts? I think proteomics changed way way too fast -- 2D-gels were still cutting edge not all that long ago and are mostly a niche technology now. A lot of coding went into technology that aren't our central pipelines anymore.
What's a LIMS? Well...the word means a lot of different things to a lot of different people, but this is how the old Johns Hopkins Clinical LIMS worked when I was there a million years ago.
1) Samples come in and are barcoded
2) Samples only move to a new place when a new person (tracked person) scans them and moves them to a new location (sample collected. sample arrives at new location)
3) Samples are only processed by the strict criteria in the computer (multiple choice options -- no creativity allowed).
4) Sample data is reported out into the LIMS as it is achieved
5) Most of this data is directly uploaded to the computer by the instrument
6) The operator verifies this data is correct
7) The data goes into permanent encrypted storage that is only accessible by the appropriately credentialed parties.
This is obviously a lot trickier for some assays than others. What was the CRP level? The total albumin, etc.,? That's a lot easier than -- "what was the relative shift in the PARPylation level across the entire proteome normalized against the global level?" Extra steps involved, but it's the same stuff.
Whoa! Here is a great review that is better than what I'm going to write. You should check it out instead!
Do we have LIMS options now? I think so, and I hope that we'll have more. (Let's stick to global untargeted first)
In no particular order check out Proteios.
First you'll think -- that paper is 10 years old! Well, here is a paper where it was used just a few months ago!
Another great one (more focused on the data management side, as far as I can tell) is msLIMS. It runs via a really slick Java GUI that you can get here.
Okay -- for something completely different (and something you have to buy, no idea whatsoever the cost, but there is a 30 day free trial, disclaimers over there somewhere... -->) check out this LabKey thing!
Back to the stuff for global -- here is another one from the rush around 2010 to get platforms out (MasSpectra) that has been seen in the literature as recently as the last few years (Bonus, it's a 2014 malaria flagella paper that I've never seen before!!)
Are there more? I hope so! But this is a decent start and more than I thought I'd find today!!