News in Proteomics Research: June 2013

Monday, June 10, 2013

Thermo Reveals Tribrid Mass Spectrometer at ASMS!

I'm not as ASMS, but I'm following closely on Twitter, and talk about a big show!
Thermo has already revealed its new Orbitrap Fusion system, which they are calling a Tribrid system because its has a quadrupole, an ion trap and the fastest and most powerful Orbitrap ever. Who wants to run an 450,000 resolution or 20 scans per second from the same little box? What about an ETD system on the front that never never has to be cleaned or tinkered with? If you answered "everybody" you're probably right. The picture above was taken by Julian Saba this morning at ASMS

To find the official release information and a video of the Fusion in action go here.

If you want to follow the action I suggest you check out the feeds from @ScientistSaba and @kWheelz

Sunday, June 9, 2013

FASTX Toolkit -- Convert your next gen sequencing data to something usable

Recent studies have shown that having a cell line specific database can significantly boost your proteomics matches. Really, it makes sense. The human sequence we primarily use is derived from the genome sequence of J. Craig Venter. My proteins might be a little different. This opens the door for next generation DNA sequencing to finally contribute something useful to science! (I don't really mean that, I only said it because it is funny...)

Many labs are now doing sequencing on their cells of interest followed by proteomics searches versus this new sequencing. If you have tried to use this data, I'm sure you've noticed that the output isn't exactly the quality level of what we're used to from Uniprot manually curated data.

Never fear, though, the FASTX toolkit is a set of tools that can clean up this data and make it a whole lot more presentable to our favorite peptide search algorithsm. You can find out more about FASTX here.

Saturday, June 8, 2013

Bioconductor -- Open Source tools for Bioinformatics

I'm starting to wonder if I'm even noticing 10% of the software that is out there. This field is blowing up like crazy!
My friend Patricia just tipped me off to this one. A complete open source software package based on R that has tons of neat tools in it. The sobering thing is that this project has been ongoing since 2001. Yup, 2001. Brand new to me. Patricia's lab is using it to analyze Itraq data. I'm going to just assume that this is a new feature, since the package was originally designed for genomics applications.

Friday, June 7, 2013

Identification of human proteins that may be essential for HIV infection

This impressive new study comes from David Graham's lab at Johns Hopkins. After HIV infects a cell it releases new particles to infect other cells. Like many diseases, HIV appears to evade the immune system by taking hiding itself in the host proteins.

The Graham lab captured post-infection HIV particles and identified 25 human proteins that appear to be essential to the HIV infection. These 25 were targeted because they were conserved in very different HIV infection models.

You can read the Hopkins press release here (along with a nice video by Dr. Graham). The original paper in JPR is available here.

Thursday, June 6, 2013

28 tissues in the SILAC mouse

The Heavy SILAC mouse (not the one above!) put to good use!!

In this month's issue of MCP, we have this new and pretty incredible study from the Mann lab. In it they do a tissue-by-tissue study using the SILAC mouse. 28 tissues are dissected out of a light and a heavy labeled mouse and compared.

False Discovery Rate Calculations Part 3: Do we gain anything by running two FDR algorithms in tandem?

You can read part 2 of this monologue here.

In part 1 I rambled a little about FDR. In part 2 I demonstrated what happens when you use a regular database and a database with a concatenated reverse target to run the same sample.

Here in part 3, I want to highlight an issue with FDR with a very extreme example. What happens when you use more than 1 level of FDR at the peptide level? I mention this because there is some nice post-processing software out there. Scaffold is the one I run into the most, often in conjunction with Proteome Discoverer. People run their data through PD, get files and then import the MSF into Scaffold. In Scaffold, you have the option of running FDR.

Should you use it?

If you have used a target decoy database in PD, the answer is almost certainly NO.

To illustrate this point, I'm going to combine the experiment from part 2 with the Target Decoy PSM validator node in PD 1.4.

Experiment (same sample from yesterday, same parameters, etc.,)

Run1: Normal database: 2958 protein groups, 10,483 peptides
Run2: Concatenated: 0 proteins, 0 peptides

Again, this is an extreme example but it highlights the main point here. The job of the FDR calculator is to find bad peptides and throw them out. In most cases, it will find bad peptides even when there are no bad peptides to see.

By using the same FDR method twice (reverse target decoy) we've eliminated ALL peptides. Something very similar will occur if you use two similar FDRs, though it will be less extreme.

Keep in mind that I'm not saying that you can't use the FDR in Scaffold. Ultimately, I've heard very good things about this algorithm. If you are going to use it, however, do not use an FDR in Proteome Discoverer. Instead, use the Fixed value PSM validator node and import that resulting file into Scaffold.

Wednesday, June 5, 2013

New paper from the Hess lab -- optimize your Orbitrap

Is your Orbitrap fully tuned up? Are you using all the right features? This new paper from the Hess lab takes a swing at answering these questions by testing them all out and showing how they change or improve the number of peptide IDs.

False discovery rate calculations part 2 -- real data

To read part 1 of this monologue on False Discovery rates....

Part 2: Real data

Here is the setup, a cell line digest was separated on a 140 minute gradient on an Orbitrap Elite operating in high-low mode. I believe it was a Top20 method with a dynamic exclusion of 1 (I ran this back in November).

For databases I used Uniprot/Swissprot parsed on the term "sapiens". This is what I'll refer to as the "normal" sample. I then used COMPASS to make a reverse of this database and append it to the end of the normal one, which I'll refer to as the "concatenated" database.

The sample was ran twice on PD with default parameters, carbamidomethylation of cysteine was the only modification (as always write me if you want more details). The Fixed value PSM node was used. So no FDR, just the default XCorr cutoffs. The only difference was the database employed, normal or concatenated.

Normal run: 4051 protein groups, 17735 peptides
Concatenated: 4636 protein groups, 19618 peptides

Now, assuming all things are equal, 581 new protein groups (14.4% of the normal total) were added. Meaning that there is a possibility that 14.4% of protein IDs occurred here, not because they are true, but due to random chance. There are, of course, other explanations like homologous contaminating peptides and so on, but I'm going to ignore them here.

Really, we should be looking at the peptide level and not the protein one. 1883 random peptides (10.6%).

Alright! Now we should just be able to cut out the 10.6% lowest scoring peptides, right? As I'll keep iterating, it is trickier than that. Look at this overlap at the protein level:

Uhhhh...so there were proteins ID'ed in the normal sample that were not identified in the concatenated? This means that there were actually some peptides that matched the decoy database BETTER than they matched the real one. Of course, I should have done this at the peptide level, but the point carries through. Even if we chop off 10.6% of the lowest scoring peptide IDs, we still don't know that we've got them all. This is because the random matches may not actually be low scoring peptides at all!

This is why we have to take a step away from doing this arithmetically and go to statistics. The simplest and the most well known example is by establishing a confidence interval using the Benjamini-Hochberg equation:

This equation is out of the scope of the blog, but this equation determines the false discovery rate individually for a peptide (or psm, or anything else) at a specified interval (alpha) and is solved for the highest (k) possible. This is only one of many variations on the same theme. Ultimately, the goal is to use the results of the target decoy to establish a statistical frame for the true likelihood of a peptide match.

For more information, please refer to the classic paper from Gygi's lab.

Tuesday, June 4, 2013

False discovery rates: Part 1 -- Fictitious history and introduction to target decoy searches

In my opinion, the single most controversial aspect of mass spectrometry based proteomics is the false discovery rate calculation. It is controversial because it is so poorly understood. It is poorly understood because it is extremely complicated.

I'm going to try to take this apart as methodically as I possibly can in order to help simplify it as much as possible. Honestly, it is going to be a bit of an oversimplification and probably more than a bit inaccurate, but I think I can cut through to the fundamentals by conducting multiple experiments on the same database and sample set and by telling it like simple story.

In the good old days, our super sensitive and speedy mass spectrometers could do a reasonably good job of looking at one compound at a time and fragmenting it. We could then go to the mass of the compound, make a hypothesis about what it is, and then see if there is fragmentation data to support that hypothesis. Sometimes we still have to do this. I have some friends that still manually verify every MS/MS spectra that they have sequenced before they will publish it. It takes forever, even when you have developed a knack for it.

One day, however, the mass spectrometers got much much speedier. I blame it on the Finnigan LTQ, quite honestly. All of a sudden, the instruments were generating a full MS1 and multiple fragmentation spectra per second. People started doing crazy things like not separating their samples out in 2 (or 3) dimensions before they put them on the LC-MS. Sometimes they were collecting fractions that contained hundreds of proteins for MS/MS. This new speed made the absolute limiting factor the amount of human hours it took to manually verify spectra (which, as I said, some people still do and I applaud them for it! Nothing beats it.)

By this time lots of scoring methods for peptide quality had already been invented, like Xcorr and Mascot score (whatever that is), but these methods were also showing their weaknesses. Again, they worked best when you set your parameters based on what criteria you personally trusted and then manually verified everything. In order to shortcut around manually verifying every experiment, we borrowed the False Discovery Rate (FDR) idea from the genomics people.

The first real idea was the target decoy. In Proteome Discoverer terms, we call this Peptide Validator or the Target Decoy PSM Validator. The math is complex, but essentially it consists of 3 steps:
1) Run your MS/MS spectra vs. your database
2) Run your MS/MS spectra vs. a messed up version of your database (backwards or scrambled)
3) Use the percentage of matches to your messed up database to determine how what percentage of your matches to a database can be random and use that degree of randomness to cut out your lowest scoring peptides.

For example. If I compared my RAW file to my target decoy database and it came up with 10 hits but when I compared it to my real database and it got 100 hits, we would think that 10% of our good hits may just be sheer random occurrence. Now, we could just drop the 10% lower hits and be done with it, but those might actually be good spectra and there might be more random matches in there. Of course, there are all sorts of maths out there that can help, but I find this plot from PEAKS to be the easiest for my little brain:

In this simplification, we are taking our false matches and our true matches and overlapping them. The confidence interval must be moved in a way to minimize the false matches while getting the most possible true hits. I particularly like this because it demonstrates 2 key points: 1) Using an FDR calculation is going to lose you good data. 2) There is no guarantee that any use of FDR is going to prevent the existence of bad matches. Some peptides are just going to be there due to a combination of random occurrence and statistical anomaly. For more description on this particular graph, please check out this nice tutorial at PEAKS.

For those of you solely interested in FDR calcuation nodes for Proteome Discoverer, I suggest you watch this video I constructed on Vimeo for Proteome Discoverer 1.4.

In the next section, I'll take a look at the results we get when using 1) No decoy libary 2) A decoy library appended to the end of a normal FASTA (with no cutoffs) a normal target decoy search (PSM validator) and Percolator using the same dataset.

Sunday, June 2, 2013

ProteomicsNews is now on Twitter

In an effort to join everyone else in this decade, I've joined Twitter. I'm hoping it will become less confusing and more useful in bringing interesting new advances to my attention. You can follow me @ProteomicsNews.