Sunday, July 14, 2013

Dirty genomics and proteomics

The term "dirty genomics" is a new one to me as of this morning, but it is good name for something that is coming up all the time now:  how do we use this wealth of high throughput genomics data that keeps pouring down the pipeline?

Many of these sequences are incomplete and virtually all are annotated by algorithm only.  From what I've seen, these algorithms struggle with fitting the sequences into a specific template (like FASTA).  But they are undoubtedly very valuable.

For many organisms, these incomplete sequences are all that are available.  On top of that, I have heard from a number of researchers that supplementing the standard human FASTA databases with incomplete sequencing data from that particular cell line strain (or person) adds significant value to the proteomics data set.

The published information seems a bit limited at this point and appears to be limited to the microorganisms, but there is some out there, including this 2009 PlosOne article where they had success using an incomplete sequence.

I'm not going to introduce a solid opinion here, I need more data, but I really wanted to kick out this dirty genomics idea and the thought that this may be a resource we can use to improve our results.


  1. I agree in general to the idea of using genomics data (or transcriptomics) in a proteomics experiment. That's very interesting, indeed. Have you already dig into proteogenomics?

  2. Celine,
    I have had to do this to a certain extent during my previous life as a malaria researcher. We were updating our FASTA databases with the newly sequenced environmental isolates very commonly and occasionally going back to re-mine our old data with the new sequences as they appeared. I've also helped some people with this. I think the next challenge will be developing a high throughput manner in which we merge these rapidly growing data sources.