Okay y'all. I'm going to approach this one with a healthy pile of skepticism, but I need a solution - and probably you do as well. A small label free single cell study for us - like one 384 well plate is generating maybe 500 - 600 GB in RAW (Bruker .d) data. Then to run our data in SpectroNaut we have do first do the absurdly infuriating process of converting it to a special SpectroNaut format. It's called .HTRMS, which is probably Swiss for "Hard drive (T?) Room Makes no Sense". This takes your .d file and makes a second file that can only be read by SpectroNaut and it almost exactly the same size. Now you're at 1 384 well plate and a TERAbyte or more.
The problem here is that neither of these things is a UNIVERSAL format. The always forgotten, frequently cursed, consistently ignored, but-they-keep going-anyway Proteomics Standards Initiative has tried forever tried to come up with ways to store mass spec and proteomics data in "universal formats". They've had some great ideas. We've ignored all of them. They've evolved those ideas as files went from Megabytes to Gigabytes, which didn't really change much because we -as a field - ignored all the stuff they were talking about anyway.
mzML or mzXML or whatever we're supposed to be working with doesn't work for me. A Bruker .d file still increases in size by about 10x. So...my 384 well plate is now 6TB and that's the size of my largest onboard hard drives.
What we need is something that can not only deal with the files of today, but allow us all to deal with the files of tomorrow. I've got some files from the Billie prototype and those files are 10x larger than anything my TIMSTOFs generate. It's not out of the question that things like the 8600 which are doing scanning SWATHs at absurd rates of speed are also going to be generating preposterous amounts of data - let alone behemoths such as the Astral and Athena.
I don't know. A bunch of these people seem to contribute to the PSI, which immediately makes me want to ignore this whole thing, but the parquet file format does sound like it might have a bunch of advantages over .sqlites and XMLs and even locked proprietary binary files that are easily corrupted by transfering from one location to another.
Last author on the paper, and coordinator of the standardization effort here, so take this with a grain of salt.
ReplyDeleteWe've worked hard on a couple of different paths to try to prevent this effort from being the "15th competing standard" (https://xkcd.com/927/):
- Holding roundtables at HUPO and ASMS to discuss with interested parties in both academia and industry.
- Running a poll to determine current needs for storing MS data.
- Writing and distributing a whitepaper within the committee and to key opinion leaders in industry to make sure that our goals align with their needs.
- Publishing this paper *before* finalizing technical details of the project.
- Making it clear that one of the deliverables from the project is going to be well publicized reference implementations.
Now, none of that guarantees that this will see widespread acceptance, but it does address shortcomings in previous standardization efforts:
- getting community buy-in both in industry and academia.
- getting technical feedback from the vendors to make sure that we aren't promising impossible things.
- making the integration of the standard into existing products as easy as possible.
I'm happy to answer any further questions and really hope to be able to move this forward.
Also, I'd be remiss to not link to the current prototype that Joshua Klein has been developing: https://github.com/mobiusklein/mzpeak_prototyping
It's a PSI (or PSI-adjacent) initiative. But that's not necessarily a bad thing. Unfortunately I don't see the point on this one. They seem to expect vendors to move away from their proprietary formats and that's a bit crazy, because it would mean all the auxiliary information (status log, method parameters, etc.) would need to be stored in the mzPeak file as well.
ReplyDeleteI think mzMLb is good enough. Gives me a 4x expansion from Bruker TDF. I suppose Spectronaut doesn't read that though (nor much else, to be honest; but they could!). But Bruker owns Biognosys: doesn't Spectronaut read the TDF directly? Is HTRMS conversion not an optional thing only for speeding up repeated analyses?