News in Proteomics Research: Processing times for different enzyme efficiencies

Monday, March 23, 2015

Processing times for different enzyme efficiencies

A local chemist in my area politely emailed me with a question regarding no enzyme searches and how much time these could take.

I realized that I'd never actually looked at this stuff before. Since her instrument is a Q Exactive and I happen to have a nice QE -like run around and I haven't left my desk anyway, I queued up a bunch of runs in PD 2.0

PC details:
8 PC running at 4.7 GHz/ thread
16 GB of RAM
Solid state drive
Atlanta Hawks/San Antonio Spurs game streaming in HD (NBA League Pass FTW!) in the background.

HeLa digest from 1ug ran on an Elite operating in High-High mode (high resolution MS1/ high resolution MS2, so just like a Q Exactive should run) separated on a 2 hr gradient with a 50 cm EasySpray C-18 column

FASTA: The human uniprot database (Uniprot-Swissprot parsed on the term "sapiens" in PD 1.4) ~ 15MB

10ppm MS1 tolerance
20 mmu (0.02 Da) MS/MS tolerance
Oxidation of M at a variable mod
Carbamidomethylation of C as a static mod

All pretty normal, right?

Here is what I changed around between runs

1) Trypsin; 0 missed cleavages
2) Trypsin; 1 missed cleavage
3) Trypsin; 2 missed cleavages
4) Trypsin; 0 missed cleavages, Semi-tryptic digestion
5) No enzyme search; 0 missed cleavages

Now, in hind sight I should have queued these up on PD 1.4 because PD 2.0 throws my numbers off cause it started running the first 4 at once. When the first one finished it queued up the last one. They are probably close enough. Let's just talk about the Sequest search cause that is what I'm most interested in here.

SequestHT times (as reported by PD)
1) 1 min 37 seconds
2) 1 min 57 seconds (I'm not joking. when people round things to the nearest 7, I assume they are making them up.)
3) 1 min 19 seconds? What? Umm... This may have had to do with PD queuing up different runs and maybe how the game was buffering.
4) Semi-tryptic: 17 min 39 seconds; I'm just going to go right out there and admit I don't entirely know what that means. (Don't tell anybody.) I've always assumed its like this; sometimes it hits trypsin, sometimes doesn't. In my head I assume its the same as like 2 missed cleavages. It obviously isn't. It requires a whole lot more power than that. Guess I'd better look it up..ugh...
According to Matrix Science (orginal page here):

"semiTrypsin" means that Mascot will search for peptides that show tryptic specificity (KR not P) at one terminus, but where the other terminus may be a non-tryptic cleavage. This is a half-way house between choosing "Trypsin" and "None". It will only fail to find peptides that are non-specific at both ends.

I take back the "ugh". This is actually pretty cool. But I digress...

5) No enzyme; 3 hours 4 minute and an odd number of seconds.
Now, this may have got a boost cause it was the last run and toward the end PD could focus solely on processing that run. Plus the Spurs won; honestly it wasn't even close and the game was definitely taking up processing power.

Interesting observation 1: On normal searches and a ton of threads with tight tolerances, PD 2.0 just tears through these data sets. A minute or two each, give or take.

Interesting observation 2: Semi-tryptic search is a big boost in search space (this is, size the of the in silico theoretical digest that we are comparing our spectra to.

Interesting observation 3: During the semi-tryptic and no enzyme search, PD 2.0 doesn't make an index. It says so here (highlighted). I circled and drew a smily face around the 2 missed cleavage search that took 1 min. I'm not entirely sure why, though I'm gonna blame it on cold medicine.

Okay, the next question that pops into my strangely performing brain is this: Did I gain anything here?

Here are the number of peptides from each run:
1) 6675
2) 8068
3) 8068
4) 8863
5) 9044

Interesting. Recently, I began working on making videos for Protein Metrics nodes for PD. When I ran this same dataset with the Preview node, it told me that 15 or so percent of my cleavage sites were missed. That comes dangerously close to being right on the money (15% more than 6675 is ~7700). Interestingly, we didn't gain anything at all by looking at the second missed cleavage event. At 15% probability, missing 2 seems unlikely but to gain exactly zero? Seems like a big coincidence. I'll rerun this one later and update if necessary. The interesting thing is that the semi-tryptic search did the best. It took a whole lot more time to run, but it came back with the most peptides. I did my old manual verification sampling trick and I think these are good matches.

I guess, for high resolution MS/MS sequences and small databases, you might as well use a couple of missed cleavages for searching. It won't affect a high end PC running SequestHT hardly at all. But I think we learned that Trypsin isn't perfect. It is supposed to cut at K and R. And it does...most of the time... but it misses sometimes, at least 15% of the time it will blow right by a K or R and go to the next one. However, it might make more mistakes than that. I'm sure this is in the literature somewhere!

News in Proteomics Research

Monday, March 23, 2015

Processing times for different enzyme efficiencies

No comments:

Post a Comment