Monday, June 29, 2015

Is "generation loss" hurting your results?

I thought this was one isolated event but I may have just ran into this issue again and did some investigation this morning.  I don't have a full wrap on it but I wanted to get the idea out to as many people as possible.

Here is the issue in a nutshell.  A few weeks ago I visited a lab that was getting bad results here and there.  This was a lab of the very best sort.  People with more experience with me with great instrumentation, flawless methods and chromatography and thorough quality control at every level.  But their processed proteomic data looked bad in a semi-random sense.  The data coming off the many acquisition computers was automatically transferred to a big storage server maintained by the University (awesome, right?!?) then the data could be transferred to the processing PCs.  It turns out that sometime between the two transfers that HOLES got poked in their data.  No joke.  The .RAW files would have spectra after the transfer that had nothing in them.  Not are darned thing.  And this messed up data processing.  The program would see these errored spectra and flip out a little, sometimes jumping many MS/MS spectra and not getting a thing out of them.

Again, I assumed this was an isolated incident...then it possibly reared its head again while I was on vacation last week.  So its time for my first sober post in a while!

I'm no expert on this, but I've read several Wikipedia posts on this today.  And I think that we're looking at something called "generation loss" (or something similar).  You can read about it here.  In a nutshell...

...sorry.  Sober post, I swear (though I just realized how long its been since I last saw that great movie)!

Anyway...back on topic...when we compress and/or transfer data we lose stuff sometimes.  Its like the old Xerox effect.  You can only Xerox a document a certain number of times before its junk.  Data compression is one issue (I'll come back to it) but data transfer is another.  There are many ways to transfer data from one place to another and sometimes a system has to decide between two things -- speed vs. quality (there is something related here.)

A few years ago we all started moving away from one data transfer mechanism called FTP (though FTP sites are still around).  FTP is super fast, and relatively easy to use but it has no data correction native to the format.  (Supporting evidence here).  So I would have someone send me an 1GB Orbi Elite file and maybe it would get there intact...and maybe it wouldn't...  FTP can be encoded with extra security features that include autocorrection but better data transfer mechanisms exist.  What was interesting, though, was that most of the time if I had an FTP transfer error I simply couldn't open the file.  Though I don't actually have a good tool to determine if some spectra were missing.  Again, there are tons of ways to transfer data but I think from the equations in the link above that there an inverse relationship between speed and quality, particularly when data correction algorithms are used.

Right.  So that makes sense, right?  So we should transfer slower and get better data quality.  Even a Fusion only maxes out around 1GB/hour, right?  And most PCs these days have gigabit ethernet connections so that should be no problem.  However, what if you had a ton of systems transferring this much data?  And what if the data coming off the Fusion or Q Exactive was relatively small in comparison to the other data coming through?  Then you've got some tough decisions to make.

I think this is related, though.  DNA/RNA sequencers generate much more data than we do and at a much faster rate.  And they've been doing it the whole time.  Integrated into these sequencing technologies has been (and sure has to have been) data compression and transfer mechanisms (some related info here).  There is no alternative for them.  This data has to be compressed in some way.  When you are getting terabytes of data per day from a HiSeq platform you need to do something with it.  

This is where I need to speculate a little.  What if you are in a big institution and you have shared resources with a genomics core?  Would these mechanisms be automatic?  Would the central storage server run at higher speeds that would cause issues with data fidelity? Would they use some level of data compression to control storage of files above a minimum size?  I don't know.  What I do know is that 2x in the last month or so I've seen data that had lost quality.  Hopefully its a coincidence and not a pattern.

The next question, of course, is how do our universal formats like mZML and such deal with compression and transfer?

Sorry I don't have great answers here.  Definitely curious if you guys with actual knowledge on these topics can weigh in here!


  1. Your example is perfect since this is exactly and very well known in photo.
    You start by being happy that your compressions over compressions make the file tinier until it becames useless.
    Well the surest way to transfer data would still be transferring storage bays from one point to another;) And it's sometimes faster, especially for very big files and local servers.
    You can also control data checksum (like MD5 but there are others), but each checksum could be sensitive to some kind of errors, so in order to be exhaustive, you have to check multiple different sums. That's basically what happens with the most secure technologies, they do it at packet and file levels. One method could be to save data locally and then transfer it on server hosts at slower paces.
    But the problem could be important at acquisition level too.
    If in order to increase speed, a losing compression method is applied, you never get to see important data without knowing about it.
    That's very important again in photography, where some preprocessing is made at captor level to avoid filling the buffer and increase burst speed, and was found to influence a lot image quality in the end, especially in very bright or dark areas.
    Which for MS, talks about dynamic range.
    All about this is an optimization problem between usability and speed.
    Regarding encoding, there are a lot of formats that use lossless compression. Since FT's are basically used to encode music too, I guess that some of these formats can be applied to some type of MS data too.

    1. Lawrence, thanks so much for weighing in here and for the additional details!!!