Thursday, February 27, 2020

Building a proteomics METADATA standard!


Okay -- I'm already bored of this post and I haven't finished the first sentence. It could be said that I don't tolerate boredom ....particularly well.... but this so important that I'm going to stick it out to the end because...



--THE FUTURE NEEDS US to make some really boring decisions and do some really boring stuf!  (Okay...maybe I jazz this up some? Let's see!) 

How many times have you went to get a proteomics data file from one of these things


-- and ran into a list like this -- 


-- where the list of samples in the folders don't line up AT ALL with anything listed in the methods or results section of the paper. (Chances are you actually haven't ever ran into that until now when I started uploading extreme examples into PRIDE to make a point or three.) 


However, I bet you've had trouble figuring out which file is which a few times. I know I've bugged a lot of people because I've been confused. This is fun, I think, because I get to email someone maybe I haven't met and compliment their work and theeeeeeeen eeeeeaasssse my way into WHY DIDN'T YOU LABEL THESE THINGS BETTER? WHAT IS WRONG WITH YOU??

Okay -- imagine this, though, what if you are a bioinformatician of today or the future and you get back behind the web interface of PRIDE because you want to compare 10 studies you found on the same type of cancer because 8 people did gobal proteomics on it at different points, Ollsen lab did really nice phospho and Gundry lab did glycoproteomics on it and you want to loop it all together. (You can actually do this now through the API!) 

What? You have to contact 7-10 busy people? And...in this situation you're a bioinformatician. Have you met a real one? Contacting 10 busy people might not be at the top of their favorites things list. (I'm generalizing because it's funnier that way.)  

Chances are they'll just drop the ones that they can't figure out on their own, or where the corresponding author was on vacation. Maybe that's your work they just dropped. No meta-analysis or citations for you! 

OR 

We can work on a universal upload organization standard thing now -- something like this (a click will expand it) 


One of the things the genomics people have done right (probably this just means many of them) is standardize their data uploads. This probably evolved out of the fact that one data file can be 1 TERAbyte and you don't want to upload/download that more than once in your life. 

Our smart and forward-thinking bioinformaticians in proteomics (or probably the people that we're right on the cusp of driving completely insane -- probably the same people) are working on a plan and they would like community help and guidance. 


There are a lot of boring ideas here, but this is going to impact us and the future and whether our data is going to be remembered and reused later.  



ProteomeTools is on this list 3 times! I think PXD000138 is the synthesized phosphopeptides. And the proteomics of 29 healthy tissues -- that's relatively new -- but what a great idea that was!  This makes me think I should check out every set of files that I don't recognize. 

If we get these annotated this gives the people of the great global ProteomeXchange consortium -- and the field of Proteomics --


-- a pathway to the future 

Okay... the original video of this might be better than this....


Whew...I did it. That wasn't as boring as I thought it was going to be.

Update: Here are some organized notes on this!

No comments:

Post a Comment