Monday, June 29, 2015
Is "generation loss" hurting your results?
I thought this was one isolated event but I may have just ran into this issue again and did some investigation this morning. I don't have a full wrap on it but I wanted to get the idea out to as many people as possible.
Here is the issue in a nutshell. A few weeks ago I visited a lab that was getting bad results here and there. This was a lab of the very best sort. People with more experience with me with great instrumentation, flawless methods and chromatography and thorough quality control at every level. But their processed proteomic data looked bad in a semi-random sense. The data coming off the many acquisition computers was automatically transferred to a big storage server maintained by the University (awesome, right?!?) then the data could be transferred to the processing PCs. It turns out that sometime between the two transfers that HOLES got poked in their data. No joke. The .RAW files would have spectra after the transfer that had nothing in them. Not are darned thing. And this messed up data processing. The program would see these errored spectra and flip out a little, sometimes jumping many MS/MS spectra and not getting a thing out of them.
Again, I assumed this was an isolated incident...then it possibly reared its head again while I was on vacation last week. So its time for my first sober post in a while!
I'm no expert on this, but I've read several Wikipedia posts on this today. And I think that we're looking at something called "generation loss" (or something similar). You can read about it here. In a nutshell...
...sorry. Sober post, I swear (though I just realized how long its been since I last saw that great movie)!
Anyway...back on topic...when we compress and/or transfer data we lose stuff sometimes. Its like the old Xerox effect. You can only Xerox a document a certain number of times before its junk. Data compression is one issue (I'll come back to it) but data transfer is another. There are many ways to transfer data from one place to another and sometimes a system has to decide between two things -- speed vs. quality (there is something related here.)
A few years ago we all started moving away from one data transfer mechanism called FTP (though FTP sites are still around). FTP is super fast, and relatively easy to use but it has no data correction native to the format. (Supporting evidence here). So I would have someone send me an 1GB Orbi Elite file and maybe it would get there intact...and maybe it wouldn't... FTP can be encoded with extra security features that include autocorrection but better data transfer mechanisms exist. What was interesting, though, was that most of the time if I had an FTP transfer error I simply couldn't open the file. Though I don't actually have a good tool to determine if some spectra were missing. Again, there are tons of ways to transfer data but I think from the equations in the link above that there an inverse relationship between speed and quality, particularly when data correction algorithms are used.
Right. So that makes sense, right? So we should transfer slower and get better data quality. Even a Fusion only maxes out around 1GB/hour, right? And most PCs these days have gigabit ethernet connections so that should be no problem. However, what if you had a ton of systems transferring this much data? And what if the data coming off the Fusion or Q Exactive was relatively small in comparison to the other data coming through? Then you've got some tough decisions to make.
I think this is related, though. DNA/RNA sequencers generate much more data than we do and at a much faster rate. And they've been doing it the whole time. Integrated into these sequencing technologies has been (and sure has to have been) data compression and transfer mechanisms (some related info here). There is no alternative for them. This data has to be compressed in some way. When you are getting terabytes of data per day from a HiSeq platform you need to do something with it.
This is where I need to speculate a little. What if you are in a big institution and you have shared resources with a genomics core? Would these mechanisms be automatic? Would the central storage server run at higher speeds that would cause issues with data fidelity? Would they use some level of data compression to control storage of files above a minimum size? I don't know. What I do know is that 2x in the last month or so I've seen data that had lost quality. Hopefully its a coincidence and not a pattern.
The next question, of course, is how do our universal formats like mZML and such deal with compression and transfer?
Sorry I don't have great answers here. Definitely curious if you guys with actual knowledge on these topics can weigh in here!