Friday, May 30, 2014

Proteome Discoverer as a multivendor platform?

It is clear that I'm biased.  I love Proteome Discoverer.  I didn't always.  When PD 1.0 came out, I was like "No thanks! I'll keep using Bioworks."  I evaluated PD 1.1, but it didn't win me over until PD 1.2.  I think 1.4 is on-par with any shotgun proteomics package ever.

But, what if you don't have a Thermo mass spec?  Do you miss out on all the fun?  Not at all!

PD can support virtually any instrument.  It has had this capability almost all along, but I have never had the chance to verify it.  A friend of mine (who will remain nameless right now) recently left the Thermo only lab her worked in for his postdoc and is now running a facility that has Orbitraps AND a bunch of Q-TOFs.

Where to start?  Well, PD can't directly accept RAW files from other vendors, so you need to use distiller or ProteoWizard to convert the data to a compatible file type.  For simplicity, we chose MGF.

You'll also probably need to tell PD what instrument you are using in your method template.

Thats what all these settings in the Spectrum Selector that we never use are for.  Check out the blue box.  TOFMS!  Just go through those settings and match them to your instrument and save the template.  Boom!  Easy, right?

You'll need to set your mass tolerances according to what you normally would when running a search.  But outside from that, PD will process your data just the same as described in all the tutorial videos on the right side of this screen and you can process all of your data from all of your instruments on one (very good!) software package.  One less variable to worry about!

Thursday, May 29, 2014

Two human proteome maps! How do they compare?

We have two first drafts of the Human Proteome!  What did you expect me to do?  Lets compare what they did and what we end up getting out of it!

First of all, both these studies are awesome and big and give our field a load of credibility, but they are very different.

Instrumentation:  Both groups used Orbitraps, of course.  Pandey's lab exclusively used High/high data.  So their MS/MS spectra were high res accurate mass.  The Kuster lab used a mix of high/high and high/low data.  Due to the increased sensitivity and speed of the high/low experiments, we'd expect Kuster to end up with more MS/MS spectra, and they do -- by a long shot, but the overall quality of the data is probably a little better from the Pandey lab.  Pandey's lab generated all of the data on their own.  The Kuster study draws strongly on Orbitrap data that was previously generated.

Tissues analyzed:  
The Pandey lab evaluated 30 histologically separate protein samples
The Kuster lab evaluated:  60 tissue samples, 13 body fluids and 147 cell lines...holy cow....this was 6,380 runs.  I'm not joking.  This study redefines what we consider a HUGE proteomics study
In defense of the Pandey lab, the Telegraph reported that the entire project was pulled off for under $700,000.  That's pretty amazing, considering that they generated all of this data on their own!

Okay, so both of these studies kick ass.  They took tons of individual tissues and painstakingly detailed them via shotgun proteomics using the world's best instrumentation.  Next question?  What's in it for me?!?!?

The Pandey lab's data is available at the

The site has a simple/handy interface:

You can search by genes or by preloaded pathways, you can compare different tissues and cell lines.  No instructions necessary.

The output is even more simple:

Perhaps...disappointingly simple.  For this example protein we see that it is expressed in two tissues.  Clicking on the gene identified doesn't help much:

We see that for this protein, the study identified one single peptide.  And that it was identified only in 2 tissues.  It was not identified in any other tissues, including the human pancreas.  This doesn't mean that it wasn't there (not having it almost always means cancer, by the way....) it just wasn't detected.

Lets try something easier.  What about HPRT1 (housekeeper gene strongly expressed in virtually all human tissues)
Okay, that's much better!  The protein is seen in every tissue here.

Lets test the same proteins on

Not as simple as the other interface, but there is a lot more that we can do here!

Searching for CDKN2A?

Wait a minute!  ProteomicsDB knows that CDKN2A has important isoforms?  We're looking at the data from a protein centric level.  Yes, its less clean, but there is so much more data here!  This makes me really happy.  The Human Proteome Map looks at Proteomics data like its analogous with genes, which is how we've always thought about it.  ProteomicsDB looks at proteins the way Neil Kelleher and Albert Heck look at proteins, in that isoforms and variants are seriously seriously important and we need to think about them, regardless of how much we don't want to.

What about expression profiles for this protein (I'm looking at isoform #1)?

Check out how much information is here!  They must have been working on this for years!  The expression tab is just one of 8 pages of information on this protein?  Unreal!  And the increased coverage here shows that we're seeing this protein isoform in tons of tissues (as we should...I won't show it here, but we're also seeing virtually every peptide for the protein).  This is a mind boggling amount of work and data.  Unreal...

I can spend an hour looking through information on just this one protein.  I'm not joking.  Check it out.  What if I said that you could directly examine the MS/MS spectra for every peptide identified?  Would you believe me?  Check it out.  It's there.  All of it at your fingertips.  This might be the most thorough resource tool ever developed for human proteomics.

There is no way I have time to tell you everything that you can do on this page.  Not without taking the day off from my real job.  But I want to leave you with this bit of awesomeness:

Chromosome maps.  Incredibly well curated proteomics data of every human chromosome.  Expandable to just a crazy level.  The amount of information here is unbelievable.  Have we really come this far?!?

Let me sum this up.  Both these studies obviously belong in Nature.  They represent enormous undertakings that not only provide new information for everyone (I haven't even gotten into all the protein data that we have that genetics thought was from regions of DNA that don't make protein!!!!  Which is a primary focus of Pandey's paper!).  These are super powerful new tools that really demonstrate where proteomics is right now and where it's going.

The Pandey lab did an amazing amazing job with the resources they had to work with.
The Kuster lab just changed the scale.  This may be the most thorough and sophisticated study anyone has ever done in our field and an enormous amount of effort has went into making all of this data available to everyone.  Unbelievable.

Update: 6/5/14:  For even more on what ProteomicsDB can do, check out part 2 here!

Wednesday, May 28, 2014

Proteomics in the mainstream news!!!!

Proteomics made the mainstream news!  Obviously not in the U.S. but in a country that has real news!!!!

Check out this article from the BBC where they cover the release of the first complete drafts of human proteomes.  One was completed by the Pandey lab at Hopkins (where I'm working this week, w00t!) and the second from the Kuester lab.

They will share the cover of Nature!

Tuesday, May 27, 2014

IMAC or TiO2 enrichment for phosphopeptides? How 'bout both in one step?

Heck, this lab is good!  Albert Heck, that is!

While all of us were sitting around trying to decide whether to do IMAC phosphopeptide enrichment first or TiO2 enrichment, or even which one we would use if we didn't have time to do both, the Heck lab was making beads that use both.

In this new paper, currently in press at MCP (and open access now) the Heck lab shows that these new reagents work really really well.  They demonstrate the monitoring of >10k phosphosites over 6 different points in a time course experiment.

Totally worth checking out.

Monday, May 26, 2014

The MicroXeno project

At long last -- I can finally talk about the huge genomics project I've been working on!

Xenografts are human tumors that are grown in mice.  They are important models for learning about the development of cancer and for testing in vivo drug efficacy and stuff.  An increasingly common application is to take a piece of tumor from someone, grow the pieces up in a number of mice and figure out which chemotherapy drug (or combination of drugs) works best on this particular tumor.  Powerful tools, right?

Well, we wondered 1) how different xenografted tumors were from liquid growing cell lines and 2) how stable xenografts are through passages; in other words, how stable are xenografts if grow one in a mouse and then take some of that and move it to another mouse.  By passaging xenografts we can get more material while limiting the suffering of individual mice.

The easiest way to get a feel for how these cell lines differ?  A shit ton of genomics via microarray.  My job was analysis and quality control of the arrays.  Did I mention there were a lot?  Yeah....for just the first 49 cell lines completed (there were going to be 50, but we dropped HeLa on ethical grounds) there are 823 arrays, each representing about 14,000 genes.  This project is on-going.  Eventually there will be thousands of arrays, with a series for every cell line studied by the National Cancer Institute.  This paper is an introduction to the project and the an overview of the first cohort.

What we've found out so far:  some cells are super stable.  Some cells differentiate like crazy and will be very different after a few passages.  PCA analysis can be used to determine what cell lines are permeated by mouse tissue and that it may be possible to track sensitivity to some drugs across cohorts this large when the pathways of sensitivity/resistance are well understood.

You can check out this paper at BMC Genomics here.

If you are really interested, you can download the RAW array data for every sample set here.

I wonder if there is any material left over from this study to do proteomics on...

Sunday, May 25, 2014

Proteomics of liver regeneration

Like a lot of scientists, I'm pretty concerned with liver regeneration -- for a variety of reasons.  Its nice to know some people are concerned enough about it to do interesting proteomics studies to check it out.

In this new paper at JPR, Thilo Bracht et. al., dive right in to this subject.  They look at the proteomics of mouse liver regeneration in "normal" mice to those with a knockout in BIRC5.  BIRC5 inhibits a protein called Survivin that I hope you can tell from the name is pretty useful.  They go in, chop out some of the mouse liver, let the mice grow some of it back (or try) and take the rest of the livers for proteomics studies later.

Interesting for a lot of reasons.  Partially cause the changes in STAT1/STAT2 are at the expression level rather than at the phospho level (surprise to this guy!)  Direct link to the paper is here.

Saturday, May 24, 2014

RawMeat for Fusion and Q Exactive data

I'm posting this cause I get questions on this both on this forum and through my day job.  As you can tell by flipping through these posts, RawMeat is one of my favorite little programs for evaluating RAW data.

If you have a Q Exactive or Orbitrap Fusion, you will find that some of the features won't work here.  These include the bar charts and time plots for ion injection times and anything that is calculated from your injection times.

I've contacted the author of the software and there are no plans at this time to update the software.  It is a freeware after all and, like most of us, the author has a taxing day job.  The other features of this awesome software package should, however, work just fine for these instruments.  If you see other things missing, let me know and I'll update this.

Friday, May 23, 2014

How to set up a Q Exactive for intact protein analysis.

Most common question I get in the course of my day job?  How to get an intact protein mass on the Q Exactive.  I've put some guidelines up on this blog somewhere, but I'm in full out video making mode and I thought that would be an easy one to do.  I've added the Vimeo link to the Q Exactive videos (on the right side of the blog).

I've also uploaded it to YouTube so I can embed it here (it looks a lot better on Vimeo than Youtube):  Again, not an official video from my day job, just something that y'all might find useful.

Want a ton of information about the proteins in your FASTA database?

Want an in-depth look at every entry in your FASTA database?  How about the protein MW, the pI and the % amino acid composition.  Want it in about a minute?

Then check out this cool tool from Brum et. al., out of the Biological Institute of San Paulo, Brazil.  I don't know why you'd need this information, but if you did, the tool exists to get it for you!

Thursday, May 22, 2014

Thermo's full ASMS technical schedule is up now

The ASMS thing is rapidly approaching!  Time to find some alternative pants if your company frowns on pants made of denim or has lots of holes...

The full schedule for the technical program from Thermo is now up.  You can check it out here.

Wednesday, May 21, 2014

Personalized genetics?

It seems like all my genetics/biologists friends know about this, but whenever I mention it to my proteomics friends no one is familiar..and it's surprisingly relevant to human proteomics.

23andMe is a personalized genetics service that uses topnotch technology to sequence important sections of your DNA and give that data to you.  I've thought about it for years (In 2008 Time Magazine named it the invention of the year), but I was pretty sure it would end up invalidating me for health insurance cause of some pre-existing condition I didn't know about.  Thanks to the Affordable Care Act, I don't have to worry about pre-existing conditions anymore, so I ordered a kit and had it shipped to my parents in WV cause you can't get them delivered here in MD.

This service used to provide you with disease information, but they can't anymore due to some lawsuits from doctors or something, but they provide you with your RAW sequencing data and there are plenty of genomics tools out there.  I've been digging through mine big genetics file with the Interpretome software (article here, and open access!)  There is so much data.  But that isn't the real story.

The real story is the huge amount of genetic variation that we have.  In proteomics, I don't think we consider it a lot.  Seriously.  We take the peptide MS/MS data from every human sample and we compare it vs. Uniprot.  We're making the assumption that the proteins from my plasma are going to match the protein sequence of the proteins from the FASTA of the sequenced person in the UniProt file.  And for the most part, we're probably right.  Albumin is albumin.  It is pretty well conserved among all mammals.  But what about other proteins?  Just randomly selecting some data from my genomics data and a pathway that is pretty well annotated:

Parkinson's disease
Parkinson's disease
Parkinson's disease
Parkinson's disease
Parkinson's disease
Parkinson's disease
Parkinson's disease
Parkinson's disease
Parkinson's disease
Parkinson's disease
Parkinson's disease
Parkinson's disease
Parkinson's disease
Parkinson's disease
Parkinson's disease
Parkinson's disease
Parkinson's disease
Parkinson's disease
Parkinson's disease
Parkinson's disease
Parkinson's disease
Parkinson's disease
Parkinson's disease
Parkinson's disease
Parkinson's disease
Parkinson's disease
Parkinson's disease
Parkinson's disease

Check this out!  Every place in this pathway is a place where we are known to be genetically different in the pathways we currently know lead to Parkinson's disease that are tested by this $100 kit.  Those letters at the end?  Those are the nucleic acids that vary between different human beings.  Look at the first one.  At that point in the DNA you can either have an A or a G.  If that is the third letter in a codon, this is probably not a problem:

The third letter shift from A to G rarely results in a different amino acid being put into place.  Look at isoleucine, though.  ATA means isoleucine.  ATG means methionine!  If it is the first in the codon, then it commonly means that your peptide sequence has a different amino acid in it than mine does.  And if I'm searching my peptides vs your protein sequence there is nothing for that spectra to match to and that spectra goes into the trash.

I find this 1) a little scary and 2) super interesting and 3) exciting, cause someone out there is going to solve this stuff and I can't wait to see how you do it!

I've been pretty philosophical here the last two days.  Its because I'm doing some hard technical stuff all day. Results are on the way, I promise!

Tuesday, May 20, 2014

Have we exceeded all other techniques in quantitative accuracy?

This is a really interesting article Alexis found in BioEssays.  And it got me to philosophizing.  Where are we these days as a field?  We've come pretty far, for sure!  But are we at a point where we can beat the other techniques out there?

For example:

Do we trust an LC-MS obtained quantitative value
One obtained by horse radish peroxidase stuck so some stuff from mouse blood
Densitometry of a gel band stained with silver
A hand plotted enzyme kinetics curve where the slope is calculated as the limit appraoching something or other?

This essay takes into account the measurements of protein counts per cell via various techniques and then compares them to the values obtained by mass spectrometers.  Is it time to readjust the values in our text books?  Even if the mass spec says that they're off by a log value compared to these other assays?

Worth a thought!  If you're interested, you can check out this article here.

Monday, May 19, 2014

Load levels vs injection times

This post comes from an excellent suggestion for a post in the comments section of something I wrote last week.  What about injection times vs. different sample loads and complexities?

For this I'm going to brazenly steal from my absolute favorite talk of 2014 so far.  Tara Schroeder of the Thermo NJ demo labs put together at talk for the iORBI tour for obtaining maximum peptide coverage with the QE.  I'm going to refer to this slide deck often.

One of the many excellent bits of info is a general starting point for target values and injection times:

Again, this is a starting point.  Chromatography conditions will vary a lot, as well the true definition of "simple" and "complex" mixtures depending on what you thing you are looking at (also compared to what you actually are looking at, right?  Sometimes they don't exactly line up!)

For something complex and high load (segway, isn't it awesome that we are at a point in time where we are considering over 100 nano grams high load?!?!) we aren't as concerned with hitting our target values are we as about getting as many MS/MS events as possible.

When we drop into the low nanogram range, we are truly concerned that 50ms is not going to be enough time to hit that magic level for each individual peptide that will give us high scores.  We sacrifice the number of MS/MS events that we can get in order to increase our chances of getting good ones.

Now, for simple stuff, we simply treat it like low load.  By "simple" what we really mean is: that we can easily obtain a single MS/MS event for at least a few peptides for every protein that is present.  I think that is a fair starting point.  The stress here isn't in getting enough MS/MS events.  The real concern is converting every possible MS/MS event into a peptide ID.  Again, we sacrifice the number of possible MS/MS events a little in return for giving us twice the possible signal to convert these peptides into high quality MS/MS spectra that the search engine would love.

I want to introduce one other variable here:  Dynamic exclusion (previously discussed here.)

For the complex stuff at high load:  Almost always, we want to use the soloist approach.  1 MS/MS event and then put the ion onto the exclusion list.

For the complex stuff at low load:  This is a toss up.  The soloist approach will give you more MS/MS events but at lower efficiency than the two timer approach.  Tough to say which will be more effective for a given experiment without more information

For the simple stuff:  Two timer!  If you've got plenty of cycle time to fragment peptides from every protein present, give yourself a chance to get each peptide at least twice.  Yes, the number of unique MS/MS targets decreases (by half), but by claiming your sample is "simple" you've already said that isn't a concern.  For single protein pull-downs, I'd allow as many MS/MS events for each peptide as possible.  I worked with a group recently where our #1 goal was maximum sequence coverage of a single small protein and its PSMs.  Best coverage occurred when we allowed 4 fragmentations of each ion before dynamic exclusion kicked (this gave us a pesky phosphopeptide that just wouldn't ionize well!)

If we go lower, we will need to increase these fill times.  But this gives us a crude starting point.

Not sure about your sample complexity or want to double check your run to see if you set it up right?
RAW Meat time!

This is the analysis of the MS2 fill times from an IP run my friend Patricia and I did.  The maximum injection time for each MS/MS event was 100ms.  We hit the maximum almost every single time.  This suggests that we are loading so little sample that we need significantly more than 100ms of fill time.  After a pull down it is currently just about impossible to determine your peptide load.  We often don't have nearly enough material for a standard protein measurement assay (hopefully someone will come up with something more sensitive soon...)  If this were a monoclonal pull down I'd say, crank up that fill time and try running it again!

What do we want to see?

This.  This is a low load complex run from my friend Rosa.  She used a maximum fill time ~150ms for this run and it was perfectly appropriate for this sample.  The first bar represents fill times of <50ms the second represents ~50 ms, the third ~100 and the 4th bar is maxing out.  The vast majority of peptides hit target value in less than maximum -- in fact, less half of max fill time.  But there were a large number of MS/MS events that required at least half the max and about 1/8 of the peptides needed the full 150ms.

I hope this is clear.  Thanks to Kristian for the questions and Tara, Patricia, and Rosa for the data to let me put this together this weekend.

Saturday, May 17, 2014

Nth order double play

I bet a lot of y'all know this trick.  But I don't think everyone does.

I didn't know it until I joined my current employer.  All of my LTQ and LTQ Orbitrap methods looked like this:

Where I'd have 21 scan events for my "top 20" experiment.  On sample #4 I'd realize that I screwed up scan event number 12 and I was really doing MS3 on the ion from scan event 11 or something else that would seem really stupid later.  The worst was if I wanted to change a single parameter!  Then I have to go through every one of these stupid things and edit them individually.

The trick?  Nth order doubleplay:

If I build my method like this I get just 2 scan events:

And whatever settings I put in for this dependent scan can be carried over for every other dependent scan that I do.  No more:  take the 6th most intense from scan number 1 bologna.

BTW, a new colleague of mine, Donna Earley, suggested that I try to do a Ben's Application tip of the week.  If I can come up with more than this one, I'm going to count this as number 1, lol!

Since the work week is over and this is a very work-heavy post, I present this crowd surfing pug!

Friday, May 16, 2014

Cool proteomics conference in Vienna!

Wow.  This looks amazing.  A two day proteomics summit in Austria that is loaded with tons of great speakers and workshop hosts?

Check out the full page here.

Highlights?  For me, anyway!

NanoLC practical workshops
Post translational modifications
de novo sequencing strategies
Peptide MRM optimizations
Orbitrap method optimization and on and on.

Best of all?  Conference registration is free and includes nightly beer tastings!

Thursday, May 15, 2014

96 FASP!

Do you love FASP but desperately wish it was faster to prep a ton of samples that way?  Well, Yanbao Yu et al., has a solution for you:  FASP in a 96 well plate!

Benefits?  FAST (per sample), reproducible (check out that correlation factor!), and since there are lots of robots, workflows, and special pipettemen, etc., for automatic processing of 96 well plates it is very friendly for automation.

You can find this paper in this month's ACS here.

Wednesday, May 14, 2014

What is the maximum theoretical coverage of a protein?

Recently, I worked with a couple of labs that use single protein digests and % coverage as a QC metric.  Lots of people do this.  This isn't my favorite QC, but as long as people are benchmarking their instruments with some sort of constant standard, I'm sure not going to stand in the way.  A question occurred to me when I saw very high % of peptide coverage:  how much can we actually see with a single enzyme digest and mass spectrometry?


Take this coverage map  for example.  This is the Mascot coverage output for one of these QC proteins.  Mascot says 79% coverage (what was found is in red).

Something that I've started to be very concerned about, due to the amount of intact and top-down analysis I've been doing, is the signal and pro- peptide sequences.  This protein is BSA, but the first 24 amino acids are not actually part of the true BSA sequence.  They are part of the translational process and are cleaved prior to BSA, so I don't think they should count.

Lets look at what is left:  If we assume 100% cleavage, we have:


What are our requirements for settings for our instruments?  I, for one, almost never look at ions with a mass to charge of <400.  I also ignore anything with less than 2 charges, because they don't seqence in most cases.  Ignoring the fact that not all amino acids can/will accept protons, if I only use the requirment that my peptide has a mass >800 Da, only DLGEEHFK, makes the cut.  It also has two basic amino acids, so it should charge to at least +2.  If it charges to +3 or above, this would explain why we didn't see it, as it won't meet our >400 m/z cutoff as a +3.

So, if we actually consider our coverage of what is possible?  If we start with the FASTA BSA sequence of 608 a.a. and subtract our non-expressed region (24 a.a.) then we get 584 amino acids in the fully expressed protein.  There are 109 amino acids in the peptides I just deemed too short for my mass spec analysis.  584-109 = 475.  Lets assume that DLGEEHFK will charge +2, so it counts as one that we can see but didn't so (475-8)/475 = 98% achievable coverage of BSA in this example.

Real achievable coverage (RAC? is that in use?) is 475/608 = 68% of the FASTA sequence coverage.  I wonder if that is anywhere near consistent in natural proteins?

Tuesday, May 13, 2014

PTMs for dummies?

UCSF has brought us tons of great mass spec resources over the years.  The first that pops out in my head is the great protein prospector.

What if you are new to proteomics or are considering moving your biological problem over to allow proteomics to take a look?  UCSF hosts a great presentation by Dr. Chris Walsh that provides a clear and thorough overview into PTMs and their analysis by mass spec.

You can kind the original presentation here.

Monday, May 12, 2014

What factors does Percolator consider when used with SequestHT?

I received this cool list today from a colleague.  This is the list of features from SequestHT that Percolator uses in its rescoring algorithm.

Delta Cn2
 Binomial score
 % Isolation Interference
 Observed Mass
 Delta Mass [Da]
 Delta Mass [ppm]
 Absolute Delta Mass [Da]
 Absolute Delta Mass [ppm]
 Peptide Length
 Is z=1
 Is z=2
 Is z=3
 Is z=4
 Is z=5
 Is z>5
 # Missed Cleavages
 Log Peptides Matched
 Log TIC
 Fraction Matched Intensity
 Fragment Coverage Series ABC
 Fragment Coverage Series XYZ
 Log Matched Fragment Series Intensities ABC
 Log Matched Fragment Series Intensities XYZ
 Longest Sequence Series ABC
 Longest Sequence Series XYZ
 IQR Fragment Delta Mass [Da]
 IQR Fragment Delta Mass [ppm]
 Mean Fragment Delta Mass [Da]
 Mean Fragment Delta Mass [ppm]
 Mean Absolute Fragment Delta Mass [Da]
 Mean Absolute Fragment Delta Mass [ppm]


Thanks, Kai, for generating this list!