Sunday, July 14, 2013
Dirty genomics and proteomics
The term "dirty genomics" is a new one to me as of this morning, but it is good name for something that is coming up all the time now: how do we use this wealth of high throughput genomics data that keeps pouring down the pipeline?
Many of these sequences are incomplete and virtually all are annotated by algorithm only. From what I've seen, these algorithms struggle with fitting the sequences into a specific template (like FASTA). But they are undoubtedly very valuable.
For many organisms, these incomplete sequences are all that are available. On top of that, I have heard from a number of researchers that supplementing the standard human FASTA databases with incomplete sequencing data from that particular cell line strain (or person) adds significant value to the proteomics data set.
The published information seems a bit limited at this point and appears to be limited to the microorganisms, but there is some out there, including this 2009 PlosOne article where they had success using an incomplete sequence.
I'm not going to introduce a solid opinion here, I need more data, but I really wanted to kick out this dirty genomics idea and the thought that this may be a resource we can use to improve our results.