Sunday, January 6, 2019
CHESS -- The New Human Genome Catalog!
Okay....so...this...is...a...thing...I...definitely...do...NOT...need.... I don't even like chess. It stresses me out....and I'm already off topic. Back!
Let's talk about this CHESS.
If you do human-based proteomics, chances are you base this on human-based genomics stuff that has been converted and cleaned up and annotated into nice human protein .FASTA or .XML files for you.
The trouble is that the genetics people can't seem to decide yet on how many human genes there actually are. It's gotten so bad that we've had to get involved with projects like C-HPP.
But there is SO MUCH genomics and transcriptomics data out there. Couldn't someone just get 9,795 human RNA-Seq files and come up with a brilliant way to figure out what genes humans actually make transcripts for? Is that so hard? What would it be, at most, 900 BILLION transcript measurements?
That's what this group did. The scale here is just ridiculous. I'm not all up on the conversion of transcript reads and things, but my understanding is that the Hi-SeQ generates around 200GB of data -- and that system is rapidly being replaced by one that generates a TERABYTE or more of data per sequencing sample. MacCoss lab did some stuff with data minimization of RNA-Seq a few years ago, but for this analysis I don't think there is any way this group could do that. How much HPC firepower did they have access to? A lot.
Honestly, this is yet another paper this weekend that I can read and hear the sounds the words are making in my head, but I can't really grasp. What I can grasp is a HUMAN PROTEIN FASTA I CAN DOWNLOAD!! You can get it here.
The format is funny when I look at it in my FASTA browser thing.
-- but this can't be anything but useful, right? This is the stuff that we, as a species, make transcripts for!!