Friday, February 7, 2014
RNAseq plus Proteomics. It's coming!
RNAseq "next gen sequencing" data is becoming more and more common every day. The instruments are getting better, cheaper and faster all the time. From what I'm seeing and hearing, I'm expecting to see a ton of posters and papers out there where shotgun LC-MS/MS data is searched against RNAseq data.
The big problem? The size of the two datasets. Especially the RNAseq data. It's pretty tough to search against a 500GB genomics file with our LC-MS/MS data.
Sunghee Woo et. al., to the rescue! In the paper "Proteogenomic Database Construction Driven from Large Scale RNAseq Data", this team from UCSD (with some help from Mike MacCoss) demonstrates how to efficiently use the two technologies synchronously.
They start by dramatically reducing the RNAseq data from a 400GB output down to a 400MB FASTA file, by a stepwise removal of redundancy and less useful data. The logic is clearly defined and seems really smart.
Next they compare the two datasets, the LC-MS/MS data and the new FASTA in such a way that they end up complementing each other. So, not only do you they get better matches for their proteomics data than they would with a fully annotated FASTA (what most of us want to use this for), but they also improve the annotation of their genomics data. That's right. They use the proteomics to correct the genomics data. Cause the proteomics data can give you single amino acid substitution and splice variants and improve the understanding of the genome.