Monday, November 14, 2016

Optimize Proteome Discoverer 2.1 to operate on super computers!



First of all -- shoutout to the team at OmicsPCs for allowing me remote access to one of their "mid-level" computers (actually, this might be one of the bigger non-server rack ones above, lol!) after seeing my complaints about having trouble crunching >1k RAW files on my own desktop a while back.

On my first run, I hit a striking revelation -- if I just leave the Administration properties at default -- this monster with 6x more processing cores than mine -- wasn't any faster on a 24 fraction human lysate digest. They ran neck and neck -- which is weird cause these Xeons are only running 3-4GHz (I forget what he said) and my Destroyer runs at 5GHz and my solid state drive is recent and was the fastest on the market when I got it. (Don't ask him cause he starts talking in fractions about laten purses or cachets or something).

Time to mess around -- especially cause when I looked at the Task Manager and it doesn't look like the one above (going hard on all cores!) at all.



If you go into the Admin section on PD 2.1 you've got a lot you can optimize. Way more than in any previous version of PD. You can tell Sequest, MSAmanda, Byonic, and even ptmRS how many processing threads they are allowed to use. There may be more. At this point, I'm going to leave them all at default so they will use the maximum available.

Then -- I'm going to modify the Parallel Job Execution thing -- this is a shot of the defaults from my desktop. I can only allow 4 workflows to proceed at once and up to 4 consensus workflows.

On the PC above (shoulda took a screenshot before they changed the password...) I could set each to up to 24!!  Messing around with changing these parameters and a workflow like this....



...didn't change things!  Until you clicked the "As Batch" button -- and then the high core box finished over 3x faster!!  WTFourier?!?

Now...this probably bears further examination. Might be some things going on here, like maybe setting Sequest to 0 (to recognize all cores) doesn't work on these components? But my thought here is that PD doesn't need all these threads to finish each individual RAW file, so it doesn't use them. But if you are doing them in Batch -- then you are using all sorts of different Processing and Consensus workflows and you can use all sorts of extra resources. So....

Why don't we batch them and then recompile the results? I've always been told that if you fractionate your samples you should search and Percolate them altogether for the best statistics. But...PD 2.1 allows you to do peptide group level and protein FDR. So..maybe its okay?

First off, we've got to get those Consensus workflows out of there -- cause we only want to do 1 of them.

Did you know (I didn't) that if you only want PD to stop at making an MSF file you can? You just make a Consensus workflow that looks like this!


BOOM!  No consensus!

So...I messed around trying a few different things. But, ultimately, 12 and 24 processing workflows and 24 consensus workflows (it still fires up the Consensus -- even if it doesn't actually do anything) appeared to be the fastest setups.) It was faster per processing workflow with 12 workflows (4 threads on this PC) that it made up for the 24 processing workflows at 2 threads each. Honestly, I was watching Cavs/Celtics while running this remotely...so...within the error margin of an NBA time out? (3-7 minutes at most). But they were pretty close.

Then you need to take the MSFs that you generated and make one Consensus workflow!


When you get your 24 processed MSFs you highlight them and now this option (highlighted) is no longer greyed out!

All you need to do then is to make a normal consensus workflow -- here, to keep all things the same (and cause Marcus Smart is hitting 3's!) I just used the same "Basic Consensus" that I used for the previous runs...and....!

...from a pure number level...they are pretty darned close...the top is them batched. The bottom is the RAW files ran in one processing workflow....

Oh...wait...I was talking about speed as well. The top one? Under 30 minutes (this is only 2 dynamic mods --MetOX and Acetyl-proteinN). The bottom one? Can't find my notes, but 1.5-2 hours.

Out of curiosity (of course!) how do the lists line up?

Here I'm just going to lazily use the Venny Venn.

At the PSM level!


Venny isn't set to do "as scale" so this looks extreme...but we're talking a total difference of about 2.7%.  Now...I actually don't know if Venny is reading my upper/lower case (has PTMs) correctly. And I'm too lazy to check.

More importantly...how's it do at the protein group level?

Interesting!  A little bit more! But still within a few percent. I wonder what those few percent are? I've got a hunch...

If I drop out the 1 hit wonders (at least 2 unique peptides, which is kinda harsh) we're looking at almost 99% agreement here between these 2 datasets and dropping the net processing time by at least 3 fold.

Not saying this is the way to do this,but might be something to check out if your data processing time is a serious bottleneck!

BTW, Venny 2.1 defaults a little ugly!

2 comments:

  1. Well, this comment should be in a "PD for dummies", but here it goes, since it took me a while to figure out what I was doing wrong.
    When you follow these instructions, after you click the "Use these results to create new (multi)consensus", remember to delete the previous processing step you just run, otherwise it will run again and waste a lot of time.

    Thanks for the great post, Ben.

    ReplyDelete