Saturday, November 29, 2014
Using the ENCODE database to hunt down novel proteoforms?
Is this stuff ever going to get simple? The answer appears to be a resounding "NO!" We are some crazy complex creatures and the more I learn the more I realize that we are getting just a tiny bit of the biological picture with any technique we use. Fortunately there are super smart people out there thinking of ways to integrate all of our tools so we can really get to the bottom of stuff!
I'm going to back up a little. ENCODE is short for the Encyclopedia of DNA Elements and it is an amazing genomics resource at UCSC that has been ongoing since 2002. ENCODE has been one way of trying to make sense of the wealth of DNA sequencing and expression data that has been rapidly building up out there. You can learn more about ENCODE at these two pages (the original) (the new ENCODE portal).
Now, when we look at the genomics stuff, one of the big problems is that we know the starting material, either the DNA or the RNA transcripts present. Like proteomics, or anything where we're making thousands, millions, or billions of observations, False discoveries are a problem. And we can only score false discoveries based on what we currently know as true. Man, am I mangling this post or what?
The reason I'm rambling about this is that this sweet paper in press at JPR took a swing at integrating data from ENCODE with proteomics data in an effort to expand more on the CHPP. While I'm simplifying this completely into the ground, the idea is: how many of these things we can't explain that have been scored as false observations in genomics can be explained by unmatched spectra from the proteomics run?
Turns out? Unsurprisingly, maybe? Quite a few! If unmatched spectra are driving you crazy, you might want to check out cool paper and see if this might help you explain some of them.