Friday, April 4, 2014

Confetti. Everything you ever wanted to know about the HeLa proteome

As a field, we have a tendency to ignore what is going on in other fields.  I understand, for real!  This field is awesome.  New technology, new software, new techniques.  Its almost impossible to keep up with what is even happening in your favorite subsection of proteomics on your favorite PTM.  However, there are times when we should probably lift our heads up and look around.  Maybe that time is when people keep getting sued for studying and releasing data on a particular cell line.

So... its with a little hesitation that I write about this awesome compilation of data.  The work is top notch.  I wish it was on K562 or MCF-7 or one of the other amazing and well characterized cell lines out there.  But it isn't.  Its on HeLa, and despite my reservations, Confetti is a great contribution to our field and deserves to be acknowledged as such.

Off the soap box!  This paper is currently open access at MCP, but won't be for long.  So grab it here!

What they did:  They extracted protein and digested it (with FASP! )in different combinations of 7 different enzymes.  They used a variety of run techniques, including unfractionated and SAX fractionation to get a huge coverage.  How huge?

8539 proteins.
44.7% coverage over the sequences of these 8539 proteins!  Seriously.  QExactive power!

Big deal?  People have gotten numbers like this before. With UniprotHuman as the only database?  Very very few people have ever obtained coverage this deep vs a manually annotated genome.  Very few.  Still not impressed?

Well, what if this group made a simple web application so that you could directly build and test SRMs off of this data set?  Guess what.  They did.

I'd put up a screenshot, but it is currently down.  The paper hasn't officially been released yet, so the application may be going through some growing pains still.  But the figures in the paper make it seem easy to use and powerful.  Hopefully it will be up soon.  Regardless, check out this paper.  This is an idea of what we can get with proteomics when we really need to go deep and get awesome coverage.

When the application is back up, you can check it out here!

Update 4/15/14:  The Confetti application had a minor web address issue, now resolved (see comments below).  The resource is up and looks great!


  1. Not been impressed by the number, but impressed by the method that authors used in peptide identification.
    Mass tolerance of 0.1Da for Q-Exactive? Man, that equals 500 ppm at 200 m/z. That is bloody WRONG! Any MS/MS peak with higher than 20 ppm is just a wrong match.

    1. Hi,
      Good catch on the methods! I missed that detail (and good math!). Now, there is a school of thought in FDR calculations that decoy searches work better if we deliberately feed our FDR calculator bad data. I occasionally work with a bioinformatician in Seattle who routinely does this. He will go back through old MS/MS data and use large tolerances. The first time he did this to data from my Orbi Velos, I was positively horrified, but he was able to pull out a number of false negatives (good peptide matches) that were otherwise excluded using my tolerances. I am not so convinced that I do this in my own searches, but people out there do use these big windows. At the end, it would be pretty easy to throw out the extra data that you gained that was more that 10ppm out and dump it but the HOPE is that the improvement in FDR calculations wouldn't let any of those through anyway. Just a thought. Since one of the authors of the paper has commented here, I would love to know their thoughts on using a tolerance window that big and what the final PPM limits at the MS1/MS2 levels are in the reported data.

    2. See below for my full explanation but also note that most fragment ions aren't at 200 m/z. 0.1Da = 50ppm at 2000 m/z and yes, we do have very large peptides giving fragments ~2000 m/z. If a search engine can't take ppm fragment tolerances then setting too tight on a Da scale means it might not match those very high m/z fragments where mass error tends to be highest (on our QE at least).

  2. Hi Ben,

    I'm one of the authors on the paper. The server was working, but it required a trailing slash at the end of the address, i.e.

    I've fixed this now so it works without the trailing slash.

    1. Hi!
      Just in case you didn't see it in the conversation above, would you be willing to comment on the logic that went into the big mass tolerances you used in your search windows? No pressure, it would just be good to know!

  3. Hi, I also had a query by email on the mass accuracy issue - maybe from the same person who commented above.

    Like Ben's acquaintance in Seattle I'm sometimes concerned about the validity of applying very tight mass tolerances and then using the empirical target/decoy method for FDR estimation. However in this case it's not the driving force.

    At MS1 level we generally use PeptideProphet's accurate mass model. This fits mass-accuracy distributions for positive/negative hits. Erroneous hits have a broader mass-error distribution, so wide mass tolerances can be penalised. Search tolerance has to be wide enough for the difference in distributions to be seen.

    At MS2 level the 0.1Da setting is due to our use of various search engines, and a need to be conservative in this work. We absolutely could not afford to repeat all of the no-enzyme searches if there was any issue, so stuck to settings that might not be the best, but were robust. In prior experimentation I had seen issues with some search engines when using very tight MS/MS tolerances. Some of the search engines in our pipeline don't support ppm fragment tolerances, so we can't specify in ppm. COMET, until relatively recently (June 2013), required prohibitively large amounts of RAM at small fragment_bin_tol settings. Since we were combining results from multiple search engines we used a single tolerance that we were certain worked with all. We stuck with these settings throughout for consistency, even when improvements to our or other software meant that a tighter tolerance may have given an improvement.

    Remember that this is a project that involved a very large number of long MS runs, from a huge number of digests. It took a long time to acquire and search the data - some searches were performed back in October 2012. The RAW data is in ProteomeXchange, so anyone is welcome to try and do better than us, or use it for other purposes. I don't doubt that it's possible to beat our numbers. If I did it from the beginning I would probably change some things, but I'm 2 years wiser now.

    Raw data is at:

    It's 700Gb of RAW, and a lot of it needs a non-specific cleavage search, so it may take you a while to get through ;-)


    Dave Trudgian