Saturday, September 15, 2018

Miss the days when PD was slower and less stable? Tips to relive those days!


This has come up a lot over the years and I was surprised to see I couldn't find a case of me rambling about it here! So....I present Ben's guide to run Proteome Discoverer way slower and with lots of weird random chaotic surprises!

Tip 1 (picture above):

Keep all your stuff on separate drives while processing! Bonus points if you keep your RAW files on network drives. Double bonus points if you process your RAW files over your network from one drive and then deposit the results on a different processing drive!!  Want to level all the way up? Pull your RAW files from one network storage drive. THEN transmit your processed results to a DIFFERENT network storage drive!


...hours of processing....

Besides the fact that you've went from eSATA data transfer rates (according to Google --6Gbps) to (assuming you have true gigabit ethernet LAN) to a whopping 100 Mbps which is a minimum of 60x slower, you also get to deal with a bunch of cool extra things that are described well in this page.

It totally cracks me up that the physical distance between your network drive and your PC is a tangible factor that can affect your network rate. High traffic on your network doesn't speed things up either (a win for us nocturnal scientists!), but that is often negated by the huge FAIL that the drives tend to do things like perform their backups and security scans at 2am when there is only the one weird guy in the building using them.

Honestly, our files aren't all that big. We just did some deep fractionated proteomes (15 fractions) and they're maybe 24GB per patient. Transferring 6Gbps and 100Mbps a second shouldn't be that big of a change, even if you had 10 of them, right? However, it isn't just one reading step. It's constant R/W steps (have you seen the funny huge ".scratch" file that is generated?  while you're running? You are constantly reading that back and forth across the network.

Around the fact that PD is super slow -- you get all sorts of hilarious strange bugs. This week I saw one where PD would claim there was something wrong with the name of the output file that someone was trying to use! Wins all around!

Tip 2: Even on the same PC --- process your data on different drives!  

I think I have proof around here somewhere. I think I worked it out to 24x slower if you process the same data all on one drive as opposed to R/W to different drives.  I think it's on my old PC....I'll update if i find it, it's striking.

Wait -- side note --- did you know that even HDDs can have markedly different speeds? They totally do! There are drives designed for storage that are much slower than ones meant for working on. I've described my problem with that recently on here I think.

This is from a paper currently in review from our lab, but I think it's cool to use it here out of context ---


The cool part is how our new software makes processing huge proteomics sets much faster while kicking out the same data -- but what is pertinent in this ramble is the two shorter bars. Using the exact same files, huge mult-gig proteogenomic FASTA and software settings, we can drop a processing run from 24 hours to down to 14 or so just by moving everything from a HDD to a faster standard commercial Solid State Drive (SSD). If you aren't processing on these, I'd recommend checking them out again. They are getting cheaper every day. I think we just ordered some 1TB ones for less than $200. Bonus: I've still never had an SSD fail. And I've got 2 HDDs on different boxes that sound like they are popping popcorn (not the best sign ever) that aren't as old as the SSDs sharing space with them.

Can I call this a "guide" if there are only two tips? On the first Saturday in approximately 3 years in Maryland where the sun is shining? Looks like I sure can. I need to put on some brake pads.

TL/DR: PD HATES processing over network drives. Move your data and output files to the same drive when running PD then put them back. Yeah, transferring is a pain, but you'll more than make up for it in processing your data faster and with less random chaos.

Big shoutout to the two great scientists who introduced me to new PD errors this week that inspired this post!  I promise I'm not making fun. This really does come up a lot. It's too tempting to use your >100TB network storage rather than move things around, but I think system architecture needs improved before you can do it bug-free.

No comments:

Post a Comment