Saturday, November 30, 2013

Comprehensive history of the Orbitrap by Dr. Makarov!

There are a lot of stories out there about the development of the Orbitrap system.  Want the whole story directly from Alexander Makarov?  Check out this month's issue of the Analytical Scientist, cause he wrote out the whole history.

The article, "Orbitrap Against All Odds" is a good read for both people inside the field as well as for anyone who is trying to push through an idea that they believe in, despite the opinions of others.

You can download the complete PDF here!  (You may need to register, first, but it is free!)

Friday, November 29, 2013

How does TMT10 affect peptide charge states?

A reader wrote in with this very sensible question regarding one of my posts on the TMT 10plex reagents.  The question from Javi:  How does the new TMT 10plex reagent affect peptide charge states.  For example, he notes, that iTRAQ can lead to an increase in charge states.  The TMT0 reagent, as well, is often used in ETD studies because the charge state gets pushed up.

 But does the TMT 10plex do the same thing?  I could probably ask someone, but since I have a lot of data lying about in all of these portable hard drives, maybe I should just look at a few.  Who needs scientific rigor?  This is a blog, after all!

  Anyway, I picked two tryptic digests that each had roughly 30,000 MS/MS events, both are adherent cancer cell line digests.

The black bars are the unlabeled digest.  And the majority of the peptides appear to be +2.  The TMT labeled seems a little biased toward +3.

So, in my completely unscientific (and roughly 4 minute analysis) I'd say, yes, the TMT10 plex is similar to other isobaric peptide labels in that it tends to lead to an increase in peptide charge state.

Keep these questions coming!  Sometimes I seriously just run out of things I'm interested enough to write about!

Thursday, November 28, 2013

Turkey (egg shell) proteomics!

Happy Turkey day!  This time, that isn't my pug, he just looks just like him!  That costume is ridiculously expensive.  I'll get it after the holiday when it goes on sale for next year!

Now, I often wonder strange things like:  A whole lot of researchers sure do work with green monkeys as disease models, but the genome has never been finished?  But the turkey genome was finished way back in 2010 (partially, I believe due to my alma mater and the fact that a neutered turkey is our mascot...).  Wow.  This isn't even close to a coherent thought at all!  If you're used to my disjointed rambling, you're probably okay with it, or you've already skipped ahead to the science.

So...what do we do with a turkey genome?  Turkey proteomics, of course!  In this study, Karlheinz and Matthias Mann take a look at the turkey egg shell in comparison to that of the chicken egg shell. Seriously!  And yes, it is a pretty interesting paper!  Ever wondered how to extract protein out of an egg shell?  Not any more!  This (open access) paper has a clear method.

The extracted proteins were ran on an Orbitrap Elite in high:high mode and comparisons were done with the fancy statistics in MaxQuant/Andromeda.  It is pretty neat because the extraction required the subfractionation of the proteins present by what they were dissolved in.  Now, I'm a little bit confused about the analysis.  It appears that they used MaxQuant to do a meta-analysis (in short, old data from a database compared to newly acquired data) of the new dataset (turkey) vs an old dataset (chicken).  I do a lot of meta-analysis of genomics data.  But we have all the nice statistical tools we need, as well as enough replicates of the data to verify statistical robustness (a single microarray may have as many as 20-100 signals per protein depending on the array type).   I am unclear as to how this can be done in MaxQuant.  It is likely that the newer/est versions of MaxQuant have some new statistics tools and I just haven't upgraded recently enough.  That probably means it's time for some MaxQuant reviews!

Now we just need to get working on that green monkey genome.

TL/DR:  Matthias Mann's lab did proteomics on turkey eggs.  Ben wrote this and a lot of other words because thought it would be funny to write about turkey proteomics on turkey day.

Wednesday, November 27, 2013

What does a good TMT or iTRAQ MS/MS spectra look like?

Holy cow!  I haven't posted anything in almost a week.  Normally there are very good reasons for this, like 1) I changed my password and forgot it or 2) It was nouveau week, or 3) The super cool projects I'm working on in my spare time are either a) something I can't yet tell you about cause its secret or b) something that didn't actually work.  Possibly a combination of all of these, but I'd appreciate it y'all would assume it isn't primarily 3b!

But now I'm back, full of espresso and I'm excited to throw this one out here.   I may have written about this before, and I plan to actually do another entry later on "this is a good spectra, this is not" but this one is pertinent to a lot of people due to the explosive popularity of the (fantastic!) TMT10 reagents.

Here is the question:  When I'm looking at an MS/MS spectra of a reporter ion tagged peptide, what am I looking for to tell that I have a good one?  I.e., how can I tell that my HCD collision energy is too much or too little?

Disclaimer:  I totally made the following up.  I ran iTRAQ for years for my own research and help  at least one person a week optimize their reporter ions.  This is the way I do it.  People have probably published other better ways of doing it, but this one is faster.

I base my opinion of whether I'm looking at a good MS/MS spectra on only 2 things
1) Are there reporter ions
2) Can I find my parent ion at <5% base peak intensity

Randomly chosen example from a friend's TMT10 run:

This is on an Orbi Velos.  The chromatogram isn't ugly because of spray stability issues (Patricia doesn't mess around when it comes to technique, the spray stability is great).  It is ugly because of the relatively high amount of time it takes the Orbi Velos to do a Top15 method with MS/MS at 30k resolution.

Lets look at criteria #1

Reporter ions!  Check

What about criteria #2?  The base peak is 1.2E6

How about my parent ion?

Hard to read, but the parent is 4E4.  Less than 5%, but still there.  HCD can be a little tricky to optimize.  It can be easy to over-blast your peptides and not have enough left to sequence.  If there is still a small percentage of the parent around, then I can feel pretty confident that I didn't hit the peptide too hard.  If there is a lot of parent around then I didn't hit it hard enough.  The 5% rule is a crude estimation.  Is there parent?  Is there just a tiny bit?  Perfect.

So this brings into play the big advantage that I perceive between the iTRAQ 8 plex and TMT10, and why every person I've seen do the comparison has switched to TMT10.  This is much easier to optimize.  The iTRAQ 8 chemistry is tricky.  It takes several passes to get your collision energy where you have reporter ions AND you have enough peptide left to sequence.  It is significantly easier to get this right with the TMT10, because the reporters come off with at least the same efficiency as the breaking of the peptide backbone.  When in doubt process the data!  I bet you'll find that spectra optimized like this will end up sequenced with good quan data at a pretty high efficiency.

TL/DR:  Its a good reporter ion MS/MS spectra if you have reporter ions and you can still find some of your parent ion at a low level in the spectra.

Thursday, November 21, 2013

SCAMPI-- A statistical approach to protein quantification

We need more statistics in proteomics.  We all know that.  We particularly need them in our quantification studies.  This is a little easier when we're doing label free but, of course, that comes with its own set of new challenges.

I get all sorts of excited when I see a proteomics paper that looks like a listing of fraternity houses, and this new paper from Sarah Gerster, et al., definitely fits that description.  In this study, the team describes SCAMPI, a protein quantification tool written in R, everyone's favorite statistics program.

Now, this is where this blogger stops.  I drew your attention to it.  I looked at every page.  I think anything where we start to treat proteomics like every other science and do robust statistical magic is going to move us forward.  I cant' really tell you if this is a good one, but it looks nice and it's got Ruedi Aebersold's name on it, so I figure it's worth checking out.  At the very least it has a memorable name.

Wednesday, November 20, 2013

How to do intact or top-down analysis of intact proteins on an Orbitrap

It is funny that I haven't written about this before, particularly when it is such a common question for me to be asked, and even more particularly because it is so counter-intuitive.

First of all, I don't understand the physics or anything, I just have these simple concepts in my head (heck, as far as I cant tell, the physics seems a little controversial anyway).

Concept 1)  Proteins hate to be trapped.
In my head, I visualize the fact that we can't achieve a perfect vacuum, so there are some gas molecules in the traps, regardless of how well we pump them down.  The longer our big ol' proteins are in the trap, the more likely it is that they'll run into one of these stray gas molecules.

Concept 2) Crap sticks to proteins, so we need to blast them a little.
I was around when some previous students of Neil Kelleher's had a lively discussion regarding the physics around this.  My brain was it's normal reliable self and went to thinking about something like this:
Fortunately, for all intensive purposes all I really need to know is:  crap sticks to proteins, so blast them a little.

Okay, so those are my concepts.  These are directly linked to how I'm going to get a bad ass intact protein MS1 spectra:

The steps:
1) Find a nice protein standard and direct inject it.  If it is apomyoglobin, bring it up in 30% organic or higher or it won't dissolve (thanks Rosa!).  Start small, say 10-30kDa.  If doing high flow, you're going to need quite a bit.  It definitely depends on your instrument, sensitivity, etc., But for an Orbi Velos or QE, I'll probably start with something as high as 0.1ug/uL in 30-50% acetonitrile with 0.1%-0.3% formic acid.  Once I get it, I can always dilute the next injection.

2) Use the lowest resolution your instrument has.  (See, counter-intuitive, right?)

3) Fill time is not your friend, that's just more trapping time.  Keep it low, but your AGC target high (3E6 AGC, but 50 or 100ms fill time at most).

4) Microscans ARE your friend.  Rather than filling for 200ms, which is one set of proteins given a chance to react with spare gas molecules, you can do 4 microscans of 50ms, giving 4 times the number of ions 1/4 of the time to get messed up.

5) S-lens RF or tube voltage, depending on the kind of instrument, are going to be interesting things for optimization.  Mess around with them till you get the best signal

6) Adjust the spray voltage and capillary temperatures.  In general, turning them down lower than you have been using for cal mix.  These can beat up your proteins.  A lot of times if I'm using a HESI source, I just turn off the auxiliary heater (just set it to 0, it will always show you a red mark by that temperature, but that's okay!)

7) Try adding some in-source collison energy to knock some crap off your protein.  Watch for a drop in signal due to fragmentation as you raise the energy levels

8) Acquire a set number of MS1 scans.  I like 100.  Open the file, average the spectra and see how that looks.  Does it suck?  Increase your microscans and adjust all the things I mentioned above.  Try again till it looks nice

9) Are you happy with your resolution?  If no, raise the resolution, repeat steps 3-8.  If yes, move on to a bigger protein, and start at number 3 again!  Try cutting your concentration and repeating.  What is your limit of detection?  Keep in mind the rough numbers, because if you move from the ESI to micro or nano-flow, you're going to have increases in sensitivity in most (not all!  these big proteins can be harder to solvate with nano than high flow ESI).

Intacts are hard to do.  Keep that in mind.  This is a process.  It is best to start with a higher concentration of a lower molecular weight protein at low resolution and work your way up to that antibody.  Once you get a nice signal, then you can start thinking about things like SIM scans for better signal and think about fragmenting these big things!

Important note:  When you buy a protein standard, it comes all full of junk.  There are salts and detergents and preservatives and often other proteins that are in there to preserve that protein.  Most standards will benefit greatly, maybe enormously, by some sort of pre-cleanup method. 

Can't get those last air bubbles out of your nano-LC system?

So you've purged and flushed air, and ran your LC at high speed, but you've still go some pesky airbubbles eluting from the tip of your emitter?  Don't just get super angry, do something about it!

I just learned this trick this week after spending a couple days trying to solve exactly this situation.  I received a suggestion from a coworker that seemed a little nutty.  Fortunately, if it had involved a ritual rain dance, I probably would have tried it at that point.

DISCLAIMER:  I don't know very much about LCs at all.  I know they pump liquid of a specific volume in a certain direction at a user-controlled rate.  Do not take any advice from me on this (or quite frankly, on anything else! without consulting your service manual, engineer or tech support)

Anyway, what I ended up doing, based on this suggestion was run an injection of 100% isopropanol through the system as it was.  I set the LC to an artifical "1 column setup"  (there were 2, but I didn't tell the LC).  This way all of the isopropanol was pushed through both the trap and analytical column.

And you know what?  It totally worked.  It might not work for you or for anyone else you know.  But it worked for me, and I looked less silly doing it than if I had went with the option the guys below chose.  Honestly, they look pretty cool.  I would look far less cool doing it, but if the IPA injection fails....

Tuesday, November 19, 2013

Shortix: Cut silica correctly every time!

A group I'm working with this week has this awesome little tool.  It is perfect for people like me who can't cut fused silica cleanly and evenly any every single time they try.

You push the silica into the device while repressing a little entry button that holds the diamond cutter out of the way.  You tighten this thing down so it holds the silica evenly, let go of the button, rotate the wheel and BOOM! perfectly cut silica.

Down-side?  It is $300.  You can purchase it here.

Monday, November 18, 2013

IPRG 2012 -- What did we learn?

IPRG 2012:  What did we learn?

In general, the ABRF (The Association of Biomolecular Resource Facilities) has some awesome ideas and the IPRG 2012 study is no exception.

In this study, synthetic peptides were produced that contained common modifications on their respective amino acids, including phosphorylation, acetylation, methylation, sulfation and nitration events.  The synthetic peptides were spiked into yeast tryptic digest.  The anonymous participants of the study ran these samples and attempted to search for these PTMs using a variety of LC-MS/MS and processing conditions.  While the level/number of identified spectra was a measured metric, the real focus of this study was the efficiency of identification of the modified peptides and the correct localization of those modifications.

The results are definitely interesting across the board.  One place of particular interest is a breakdown in the paper of the number of peptides, both consensus and unique that were identified by each research group.  The study showed that the clear winners were a group that used Byonic as the primary search engine.  Surprisingly, the one researcher who used Proteome Discoverer/Sequest had the lowest number of identified peptides in the study.  Having personally compared PD to every one of the search engines compared in this study on at least a few, if not numerous datasets, I have to think that this group had issues either with their instrumentation or experimental design.  Nothing short of that would explain the discrepancy.  While it would be interesting to know for sure what happened, that would negate a good bit of the anonymity of this study.

Another place where Byonic really showed power was in the identifications of the known modifications and the correct placement of them.  Interestingly, nearly all of the instruments and methodologies had trouble with one specific modification in specific, tyrosine sulfation.

Now, I want to throw out my cautious opinion on this study. I definitely see the value in comparing lab to lab, particularly when reproducibility is such an active criticism for our field.  It is definitely worth thinking about the small sample size and the huge array of variables that this study is taking a swing at.  Different instruments, LC gradients, packing material, ionization sources and their relative efficiencies, processing schemes, etc., etc., all contribute to these results.

Is it valuable to know where we are in terms of global abilities to accurately assess PTMs?  Absolutely, and this is certainly a valuable snapshot of where we are.  But we should be slow to make judgments based on this small sample size and intrinsic variability.

You can read the paper, In Press, here.

Saturday, November 16, 2013

Nerdy computer note of the month: DDR4 release!

PC nerd alert.  DDR4 memory is about to release.  Crucial says they'll have the first modules out next month.  Want your processing PC to access memory faster, but also use less energy?  Enter DDR4.  Twice as much memory per stick (16GB!  woooohooo!) with access speeds twice that of DDR3.  Read the marketing press release here.

59 proteoforms of ovalbumin?

I'm currently just overwhelmed in my raw appreciation of just how cool science is and of how very very little we seem to know about our world around us.  There is stuff to discover absolutely everywhere!

Case in point, this new paper out of Albert Heck's lab where they use a modified Exactive (essentially the Exactive plus EMR) to study ovalbumin in its native state.  The same ovalbumin that we use as a molecular weight marker for SDS-PAGE.  The same ovalbumin that is sitting on a fridge shelf for some reason or another in virtually every lab in the world.

And what do they see in their nifty native analysis?  59 proteoforms!  59 distinct variations of this standard protein.  Seriously?  59?!?!  Does that blow anyone else's mind a little?

Step sideways a second:  Remember the human genome project release?  When we were super excited that we had 30,000 genes sequenced or whatever after close to a decade of work? (I drank a lot of beers yesterday with a friend who told me that he can sequence a human genome with 30x coverage in 1 day, but that's a different thought for a different day).  So we had 30k genes all worked out, and that is a lot of complexity.  But even if we ignore all the variations in transcription/translation that we know about now, and just considered  that if 1 gene made one transcipt and that transcript made one protein, here we see 59 variants of that protein that, for the most part we couldn't/wouldn't find (or it would be pretty difficult to discern) unless we looked at the protein in its intact and native state.  That is a lot of complexity!  But think about the fact that we know there are possibly millions of protein variants at just the linear amino acid/modification level, and throw in the fact that these can actually result in a much larger combination of proteoforms and Wow!  does that ever make it seem amazing that we have come so far, but also how exciting how much further we have to go?!?!

Overwhelming feeling here?  We all need to do more intact and native analysis.  (In a related note, this week I'll be doing some top-down work on a QE Plus with the Protein Mode upgrade {Woooohoooo!}.  Of course, my opinions/results on that will follow!)

Check out this paper.  If only to get an idea about how many things biologically kind of make sense, but don't really, that might make sense if we took into account the fact that what we think of as 1 protein could actually be dozens of variants that we just haven't had the tools (until now!) to even see.

The paper that has inspired me to get out of bed with a ton of appreciation for the world today is called:

Analyzing Protein Micro-Heterogeneity in Chicken Ovalbumin by High-Resolution Native Mass Spectrometry Exposes Qualitatively and Semi-Quantitatively 59 Proteoforms

TL/DR:  Read this paper.

Friday, November 15, 2013

Weak statistics and lack of reproducibility

Umm...this one is disturbing.
Let's start at the title:

Weak statistical standards implicated in scientific irreproducibility

and then move to the subtitle:

One-quarter of studies that meet commonly used statistical cutoff may be false.

Ummm...already disturbing, right?  It gets worse when you start to think about the 2 most common criticisms of our field:  1) A lacks of robust statistics and 2) A lack of reproducibility (you generally don't hear them in exactly this order...)

I'm actually not going to go any further.  You should check out this short editorial, though, and the 2 references.  This is a dialogue we're going to need to continue to have as a field through the future.  Yes, I'm dreading it.  Cause I don't want to be doing a lot of statistics either....

The editorial is here (and under 1 page!)

Thursday, November 14, 2013

Poo proteomics!

It is probably a little immature that I'm taking this very serious, interesting, and well published study and reducing it to the term "poo proteomics."  But sometimes, that just happens, and it's still my blog (please refer to disclaimer page)!  (I humbly issue an apology to the authors of this very nice paper if you find it offensive.  You have to admit that my title is catchier.)

The paper is actually called "Host-centric proteomics of stool: A novel strategy focused on intestinal responses to the gut microbiota," and is from a team out of Standford.  In this very serious study, the researchers use a number of complex in vivo models of different gut flora and perform proteomics on the output.

Just a side note (and I'm totally cracking up here):  I'm picturing the staff scientist who runs this instrument and his/her face when they explain what they want to inject into his extremely well maintained analytical instrument....  To my good friends out there in Core lab type roles, I apologize because I've pictured a lot of your faces during this imaginary dialogue in my head.

Back to serious:  What they demonstrate:  more complex gut flora equals more complex poo proteome.  The results sound obvious, but imagine how useful an assay would be for gut infections (like the crazy deadly C. difficile variants) if you only had to take a tiny sample of stool (which a lot of hospitals acquire anyway) to classify. And come on, somebody was going to do this eventually, right?!?

mMass -- easy open source tools for mass spectra

I just happened across this one when 2 people asked me about a nice open source in silico fragmentation predictor in the same day.  Sounds like search that will end up as a post!

I looked around, downloaded a few, and found my favorite, and it is mMass.  You can check it out at  This very nice piece of open sourceware has a ton of nice options, and is written by a guy who states that programming is his hobby.  The world needs to find more hobby programmers like this!

The program is super easy to download, install and use.  And the interface is very intuitive.  Besides fragment prediction, it is also a file converter, sharer, and viewer.  It can pick peaks, recalibrate your spectra and do some processing.  And on and on.

Definitely definitely worth a free download!

Wednesday, November 13, 2013

Uniprot update available today

Uniprot update time!  Last update of 2013.  Update here!

Open source tools for top down proteomics

Want to do some top-down data processing on the cheap?  Are you willing to write a command line here and there and jump through a data conversion hoop or two?  Then there are a couple of tools that will work for you or the bioinformatics guy who is doing your processing.

The first is the MS-Deconv from the CCMS.  Simple deconvolution of MS and MS/MS spectra.  It is available for download as a command line driven algorithm, or with a simple graphical user interface.  In order to run with this program, you will first have to convert your data to mzXML.  Unfortunately, unlike in some programs, it doesn't seem like you can get away with uploading mzml.  That X is essential here.  (For Thermo Elite, QE, or Fusion, first convert your RAW file to mzml with the PD full version or viewer software, then use ProteoWizard to convert mzml to mzXml, instructions here.)  For most instruments, you can directly convert your RAW files with this tool directly to mzXmL, but I haven't tested this tool for the newer Thermo instruments in quite a while and it didn't seem to like the RAW data for these when I last did.

The next tool is MS-Align.  Which can directly take the output for MS-Deconv and process it for LC driven intact analysis.

I'm doing some intact analysis with a QE today, we'll see how these two tools compare to other ones out there.  Yes, there are some hoops to jump through (and you'll notice a lack of control settings in the MS-Deconv algorithm that you may want), but these could be a nice complementary resource for your top-down studies.

ItunesU free courses in mass spectrometry

Aside from the crazy randomness of the internet, my second favorite thing about it is the easy access to information about everything.  Suddenly fascinated by the fact that gel nail polish is polymerized by placing the customer's nails in a UV light and want to know the chemistry?  Easy access to that information.

In a note more related to the supposed topic of this blog, if you are new to proteomics there are tons of tools out there.  It doesn't need to be that daunting to get into this field.  As an example, yesterday I learned about this resource that is available on iTunes.  A whole slew of intro to proteomics videos produced by our friends at the Broad (like toad) Institute!  They are less than a year old (no old info here!) and broken into concise topics for easy digestion.

Tuesday, November 12, 2013

iPRG 2013 -- Next gen sequencing + proteomics!

In my humble opinion, proteomics is just on the verge of going into full-out revolution.  We keep hinting at it with the "dirty genomics" idea and a few studies here and there that have shown the potential of searching MS/MS spectra against "next-gen" sequencing data.

ABRF has jumped on this with iPRG 2013.  In this study labs are being actively recruited to look into the real potential of searching MS/MS spectra vs RNA seq data.  This is really exciting, as RNA seq technology can give you a picture of your transcriptome at an exact chronological point.

Among the potential of massively boosting your peptide spectral matches, the linking of these 2 technologies could give you the ability of looking at the proteome to see what is there and the RNA-seq to more easily see what is being expressed and when.

For more information, check out the ABRF website here.  Direct link to flier from the study.

Monday, November 11, 2013

Transferred subgroup FDR for rare PTMs

A new paper currently in press at MCP takes a novel approach at calculating the false discovery rates of post translational modifications.  The work, from Fu and Qian, researchers at two facilities in Beijing, tries separating out FDR for modified and unmodified peptides.

The paper is really stats heavy.  Too many Sigmas for this blogger to be really insightful here.  The gist, however, is that both search engines and FDR calculations are pretty rough on PTMs in comparison to their unmodified counterparts.   By modeling on a slew of artificial spectra this team demonstrates an increase in high-confidence peptide spectral matches of modified peptides.  The FDR calculations are actually linked so the analysis isn't separate, which also seems to improve the results.

In theory it looks really good.  I am curious as to how this works in practice.  It is one thing to model an engine on perfect in silico fragmented spectra.  When you look at the intrinsic noise, variability and inevitably missing (or low intensity) fragment ions in real spectra it can be a completely different animal.  A few images with manual validation of real spectra would have been a nice addition to this paper.  Hopefully there will be a follow-up.  But, hey, any new approaches we can come up for working out this FDR mess is a good thing in my book!

Sunday, November 10, 2013

What kind of fish is it?

Counterfeiting food is a surprisingly big problem in first world countries these days.  (Problems that I'm sure seem pretty minor to people on the other side of the developing/developed boundary...)  The problem was highlighted on a segment of 'This American Life' a while back when they investigated restaurants serving hog rectums in lieu of calamari.

In a much less gross analysis, a paper in this month's MCP demonstrates an ability to distinguish between different fish by MS/MS analysis by comparing spectral libraries in bulk.  The work by Tune Wulff et al., demonstrates a method by which commercially available fish can be distinguished, even when the muscle has been heavily processed.

To be perfectly honest, this one only caught my eye because I liked the abstract illustration:

And if you've read this so far, I invite you to watch the Fish Slapping Dance!


Saturday, November 9, 2013

Is protein denaturation the key to the serum proteomics?

Serum and plasma proteomics experiments suck.  They just do.  With modern instrumentation you can take a cell pellet from just about anything and knock out an easy 2,000 plus unique proteins.  But then someone brings you some plasma or serum.  You do a BCA assay and it's the highest protein concentration you've ever seen.  So you digest it exactly the same way, pop it onto the instrument and walk away with 300 or 400 proteins and see that less than 10% of your spectra matched anything in your database.

You have options to increase this coverage.  You can deplete with a fancy column or these sweet new spin cartridges I haven't had a chance to try yet, or you can go to 2- or 3- dimensional fractionation, but you never get to the number of proteins that you can find in any cell pellet.  There are lots of explanations for this; the dynamic range is terrible, 15 or so proteins make up >90% of plasma proteins, there are glycos and lipids everywhere, and on and on.  The worst part of all of this is that we know that plasma contains a wealth of information about the physiological condition of the animal it comes from.  But we have to go to herculean efforts to see anything in there.

What if there was a factor that we weren't considering?  Vincenzo Verdoliva thinks that there is a really simple one.  In this new paper at Plos One, Verdoliva et al., examine the effect that denaturation conditions affect our ability to see into the serum proteome.  They try nearly 70 different protocols to denature serum proteins and find that changing these conditions dramatically changes what we can find by LC-MS/MS analysis.  In an even simpler way, they find that these conditions even change how the plasma proteins look on an SDS-PAGE gel.

The researchers seem as surprised by these findings as I am.  Proteins just denature, right?  Why would wouldn't they do it the same way in serum as in everything else?  They do a pretty good job of leaving this story open-ended by stating that further study into denaturing proteins is obviously needed to see what effect this is really having on our ability to see what we want to.

If you are doing serum or plasma proteomics, you should give the aptly named "Differential Denaturation of Serum Proteome Reveals a Significant Amount of Hidden Information in Complex Mixtures of Proteins," a read.  It is definitely worth thinking about.

Friday, November 8, 2013

MSAmanda, Sequest, Percolator discussion part 2

This is a continuation of a previous analysis from last week.  Part 1 is here.

Okay, here is the question.  What, if anything, is MSAmanda giving us that we aren't getting from Sequest + Percolator?  In the previous entry I think I did a good job of highlighting 2 things: 1) MSAmanda and Percolator work VERY well together and 2) We get more proteins from high resolution MS/MS spectra with MSAmanda.

I guess the question is this, at the peptide level, how many are unique?  Are we scoring the same stuff, mostly, or is this really complementary data.  In the end, does it really matter?  More peptides is a great thing, right?  But I want to have a solid metric (on one data set...) to say, "adding this search engine can give you XX% more results," or something.  There is lot of data out there regarding different complementary search engines used together, like this poster.  In general, however, I tend to expect an extra search engine to boost my IDs by ~10%.

First of all:  I mentioned last time that in this particular run, I was not happy with the Sequest + Percolator PSMs that made it through my filter.  I need to narrow those down to peptides that I trust.

I used the method that I mentioned last time:  I cut back my FDR cutoff at the "high" confidence level (since this is Percolator, this is based on q-value), until I got to consistently good peptides.

Here is a summary:
At 0.01, I had 11600 peptides from Sequest + Percolator
At q value 0.009, I had 11482, and they still didn't meet my threshold cutoff
At q value 0.006, I had 11,108, but I still didn't trust all of the lowest scoring peptides, and so on.
I ended up cutting it to 0.001, which left me with 10,226 peptides, ~800 less than I started with, but still a big boost over what I got from Sequest + Target decoy.  In a related note, I did this a second way, by cutting the original Xcorr factor to a minimum of 1.75 using the same sampling technique I liked the peptides and came up with close to the same numbers.  Interesting, but maybe coincidental.

Here are the peptide numbers from each analysis:
Sequest + Percolator:  10,226 (trusted)
MSAmanda + target decoy:  9262
MSAmanda + percolator:  12,241

And here is what it looks like:

Not a bad chart, right?  By the way, I'm completely fascinated by the fact that target decoy search sometimes gives me peptides that I don't get from Percolator.  It makes me wonder if we should be doing both in order to boost our ID counts.  Remember from the last entry that my quick and lazy analysis said that the peptides from both Amanda runs seemed trustworthy.

Anyway, I guess I was looking for a hard number.  So, if we add those up, it looks like we get 12,532 unique peptides from this one run.  And if we just look at the unique ones from Amanda + Percolator, we get 2086 (1585+501).  That is a 16% boost in trustable (that isn't a word either?  WV public education...) unique peptide IDs.  It's actually a little better than that since the total I have here also has the additional peptides from the MSAmanda + TD search, but I'm not going to do that math.  It's late, and this is reasonably close.

Okay, so I know running MSAmanda takes extra time.  But so does adding extra time to your gradient.  This is a 2 hour run that I'm analyzing.  If we added an extra hour to it we might have boosted our peptide IDs by another 10-20% (just guessing, but I should do that analysis, I have data just like that on a hard drive somewhere).  We could also have boosted this by running this same sample on a faster instrument like a Fusion.  We could also run it with a longer gradient + DMSO on a Fusion and do this, and you know what, we'd get a ridiculous number of peptides IDs. The point is, this is free data right there in your RAW file, you just need to take the time to pull it out.

TL/DR:  Are the peptides from MSAmanda unique?  A lot of them sure are!  When running vs Sequest in this dataset it gave us 16% new peptide IDs in exchange for a little extra processing time.

Thursday, November 7, 2013

Fame in science

Uh oh!  This one has been percolating in my head for a while.  Do I write it and risk offending a lot of really smart people?  Or should I just do some vinyasa flow and just let this negative energy drain into the cosmos?  In the end, I did both.  My IT bands feel great, and I wrote something I feel is a lot more balanced than it could have been.

Let's start here:

In 2006, Anil Potti was a shining star.  He was in a fellowship at Duke and was on a streamlined path toward a full professorship.  Between 2006 and 2010, he published a slew of papers in all of the highest impact journals.  You see, Potti was a microarray expert while genomics was still the king of the roost.  During Potti's prestigious fellowship he had figured out how to decode the extremely complex relationships between drugs and cancer cell responses.  Figured it out?  He mastered it.  He could tell you from a microarray the target of the drug and whether it would work on one particular cancer cell and not another.  He got so good at it, in fact, that patients were being treated based off of what Potti could figure out about the microarray of their particular tumor.  The program was absolutely groundbreaking and signaled that genomics was finally coming into it's own and was going to change the battle against cancer into our favor.

There was just one problem.  The data was full of fabrications.  Microarray outputs are commonly converted into simple Excel format and you process from there.  I line mine up across, say control vs. treated, and divide to get my fold changes, sort by fold change and toss all the low numbers.  A number of short communications have been written about Potti's papers, like this one.  In them you'll find all sorts of fantastic observations, such as when the results didn't match what Potti wanted, he simply cut the columns and sorted them until they showed what he wanted.  He didn't do it a little.  He did it a lot.

In 2010, the papers began to be retracted, and the clinical trials were stopped.  Keep in mind, people were actually being treated with the chemotherapy agents that these fabricated microarrays were telling physicians to use.  The biggest problem?  The blatant errors in these microarray analyses were pointed out by a team at M.D. Anderson in 2007.  The primary author of the letter to Nature Medicine was Kevin Coombes, a guy who had written a whole bunch of proteomics papers and new a little something about reproducibility of -omics data.

And here is where, in my humble opinion, Scientific Fame came into play.  In 2007 the group at M.D.A., pointed out blatant errors in one of Potti's initial studies, and he immediately hit back. The M.D.A. evidence was solid,  but it was too late.  Potti's star was already rising.  And rising so fast that one detractor couldn't slow it down, and it wasn't until after it had gone to the worst possible level, to actually endangering the well being of real people in clinical trials.

And this is why I have a problem with this thing:

The Analytical Scientist (whatever that is) set up a ranking system based on a secret nomination and judging system to rank the 100 most influential scientists in analytical chemistry.  Are there some great scientists on that list?  Absolutely.  Are these people who have changed some of the fundamentals of chemistry and how we do it, possibly forever?  Yes there are.  I'm not arguing that there are some great people on this list.  What I'm arguing is:  whats the point?  I know the point for "The Analytical Scientist," this thing is generating some Ad revenue.  People like lists.

But here is the danger:  This is science.  We're supposed to be weighing out every idea by it's merits and by the strength of the proof behind it.  If we start to judge the idea by who said it, rather than purely by the merits of the evidence, then we've missed the point.  The next revolution in chemistry may come from a student at a small school with 300 students in the mountains of Kyrgyzstan and her ideas should be treated with the exact same degree of skepticism as the ideas of every person on this list.  I'm not saying that we're not doing that, but it sure seems like if we're going to invite the possibility of that kind of bias, then this would be a good start in that direction.

End rant.

Update:  Yes, I understand the hypocrisy in the fact that this is all being written by a guy who blogs a sizeable percentage of his thought into the universe every day.  That's what makes it fun(ny)!  Don't trust anything I write here, I try to warn you about my biases, and certainly don't think that I wouldn't have been super psyched if my name was on that list.  Maybe they'll extend it to the top 10,000 and I'll make the cut one day and I'll never write a bad thing about whatever that magazine was called again.

Wednesday, November 6, 2013

What is a TopN Peaks filter?

One at a time, I've been going through the PD nodes, new and old and evaluating them in exactly the way that one should.  Using my current favorite dataset, I simply add in the new node and run the same sample with and without this node.  It makes for some easy entries.  On the down-side, using just one dataset may not be an accurate representation of what this node can do, as they may be more useful for more specialized datasets.

The TopN filter is an interesting one.  It has two settings 1) the number of MS/MS fragments to look at and 2) the window width in which to look for these fragments.

For example, the defaults are Top 6 with 100 Da.  What this does is go through each and every MS/MS spectra and break it into 100Da windows.  Within each window, it determines the 6 most intense ions and eliminates everything else.  If you scanned from 400-1400, then you've reduced your MS/MS spectra to the 60 most abundant peaks and dropped a lot of noise from your spectra.

Sooooo... what does this button do!?!?
For one, it's fast.  On my current favorite dataset, a 2 hour HeLa high-high dataset, it takes about 2 minutes to run.  This is offset by the fact that the spectrum selector ends up taking less time.  My search using a Target Decoy ran 6 minutes, whether I used this filter or not.  Yes, my laptop knocks out PD searches in 6 minutes.  Let me know if you want the specs on it, it wasn't very expensive at all.

Okay, so there are no apparent consequences, time-wise, to doing it!  How are the peptides?
Well, in both the case of the target decoy and percolator searches, we ended up with slightly fewer peptides and protein groups when we use the TopN filter.  Yup, fewer.  End of entry.

Nope!  I'm joking.  Not about there being fewer peptides.  There are fewer, but remember a few entries back where I was talking about Percolator trying too hard on Sequest searches and letting some junk through?  What if there was now less of that junk?  That would be a perk, right?

And it is.  The number of peptides drops (from ~11,600 to ~11,100 in this search) but when you look at the worst scoring peptides that made it through the Percolator cutoff, they aren't nearly as bad.  The thing is that Sequest and Percolator are just digging too deep and making mis-assignments on what is essentially noise.  But if you do a good job of eliminating that noise, then we're looking at fewer false positives.

I encourage you to check out this node.  I'd love to know how it performs on a larger dataset.  I would expect it to work much better, but who knows.

Tuesday, November 5, 2013

OpenMS -- new algorithm for metabolomics

In press at MCP right now, is this paper:   "Automated Label-Free Quantification of Metabolites from LC-MS Data," by Erhan Kenar et al.,

Now, I know this is a proteomics blog, but I try to keep my ear to the ground in regard to this metabolomics thing that has been exploding.  And this is a nice new one.

First of all, it is built into the OpenMS platform, which has a big support network and is available on all platforms (and crazy easy to install!).  The cool part, however, is the use of a support vector machine (SVM) to rapidly and accurately identify metabolites.  A SVM is a supervised learning algorithm (think, Percolator, or artificial neural networks in genomics) that makes classifications in a non-probabilistic manner.  In this case, this sophisticated algorithm is used to determine whether ions in your run are metabolites of your ions of interest.

If you are doing metabolite ID and quan, you should take a minute to look through this paper and download OpenMS 1.11.  You can find it here.

Monday, November 4, 2013

Does our target threshold for MS/MS actually matter?

This very thoughtful question was recently posed to me by someone.  And I had a nice long think about it on a plane today.
Here it is:  If we are always going for our most intense ions, Top10 or 20 or whatever, would it even matter what we put our target threshold at?  Or would we never get down into that junk?

So here is my crude bumpy-airplane-ride attempt at an answer.
1)Start with my target dataset:  A 1ug Hela lysate ran on an Orbitrap Elite with a Top15 method in "high:high" mode with HCD (MS/MS at 15,000 resolution).

2) Filter for MS1 scans only
3) Set the bottom screen to only show the peak list
4) Go through all of the below analyses, return to Xcalibur, change the settings to "Display all" or you do a bunch of Excel work for nothing....erk....exactly why they serve alcohol on bumpy plane rides....
5) Export the peak lists.  For the sake of brevity, I exported one at each of these time points (in minutes):  10,20,30,40,50,60,70,80,90,100,110
6) Find out how many peaks are there and what the average intensities are

So, at the MS1 level in this extremely complex digest, each MS1 spectra contained, on average 2,413 peaks that Xcalibur could detect (+/-200 or so).  Of those peaks, the average recorded intensity was 1.2E5!

Okay, more maths:  How does this compare to our MS1/MS2 ratios:
Considering the length of the gradient, we ended up getting a full scan every 2.41 seconds in this experiment.

I'm going to need to make some big assumptions to do the rest of the math.  So bear with me.  I think it will be worth it.

Let's assume that we have a 30s peak width (should be close) and we'll assume that is uniform, so each compound is detectable in the system for 30s, and that is it.  This breaks our gradient into 240 measurable time windows, of which I sample 10.

Now, if we assume that the average number of ions around is a good measurement, 2413 x 240 gives us 579,000 ions that the instrument was able to detect and assign an intensity.  This is a tryptic run, but I'm going to ignore the fact that a lot of these are singly charged and unlikely to be sequenceable (which isn't a real word, I guess.)

So there are 579,000 ions and we looked at the most intense 26,494.  This is the top 4.2%.  Compare that to just the average intensity, and that means that at the absolute minimum, every ion we fragmented (given all of these assumptions are true) was at least 1.2E5.

This is on paper (partially on a napkin, to be perfectly honest), but according to the math.  No, there is no reason in a complex mixture to spend time fretting over whether you set your MS/MS triggering threshold to 5,000 or 2,000 or 500 or even 50.  If you're always going for the most intense ion, you'll never be digging into junk that low in intensity.

However, there are beginnings and ends to each gradient, and those shouldn't have peptides in them (in an ideal world they shouldn't have anything in them at all, but we all know this isn't how it works).  If your threshold is too low, then you will be triggering on noise there and increasing your file size, but that would be the only drawback.  From a statistical FDR, type level, this would actually be good for you if the premise that "the more bad MS/MS events we have for FDR, the better it works" is true (I've written about that somewhere in one of my previous and long FDR rants).

But, wait a minute!  Didn't I just write about the importance of thresholds in dynamic exclusion settings?  Yup!  I promise I'm going somewhere with this, but I'm out of time.  More later, maybe

TL/DR:  On paper (or on a napkin) there doesn't seem to be a good reason to worry about your minimum MS/MS intensity threshold cutoff in a TopN experiment in a complex mixture of fairly high load.

Sunday, November 3, 2013

Dynamic exclusion -- which camp are you in?

I've noticed in my traveling that there are 2 very distinct ways that people set up their dynamic exclusion settings.  I've taken to referring to them as the two camps.  Here are some really crooked illustrations with somewhat arbitrary and not-to-scale numbers on crooked bars.
Camp #1:  The two-timers

The two timers will set their dynamic exclusion so that every prospective peak  has a possibility of being fragmented twice -- once at low threshold, then again near the peak apex.  This is done by using a low MS/MS triggering cutoff (think 2 E3) and then doing a repeat that approximates the half-peak width.  The illustration above demonstrates the ideal circumstance.  One MS/MS event at an intensity that may or may not give you a good fragmentation spectra, followed by a second that most certainly will.

Camp #2:  The soloists

The soloists give their instrument one shot to get a good MS/MS spectra for searching and then move on to the next target.  In general, I see the soloists using a higher target value than the two timers.  The benefits here can be big.  If you only fragment each ion once and it fragments successfully the first try, then you can fragment double the number of ions/run as a two timer using otherwise identical conditions.  The downside occurs when that one fragmentation event isn't enough to efficiently ID your peptides of interest.

Now, this is the end of this for me.  I won't tell you which camp I'm in, because I can too clearly see the benefits of both and I struggle with it every time I set up an experiment.  I just wanted to clearly identify the two groups so that we know what some people are doing and to set the backdrop for the experiments I'm currently planning to perform.