Tuesday, January 19, 2016

moCluster - Rapidly make sense of huge multi-omics data sets!

So somebody out there has generated a few dozen terabytes of transcriptomics data on the samples that you are generating millions of spectra on? What should you do next?

Maybe you need to grab that data and do some clustering to see what stands out. There are a couple of algorithms out there that will do this, like iCluster, but what if you want mo'?

Then the Kuster lab would like to introduce you to moCluster. In this paper from Chen Meng et al., they describe this new R BioConductor package (under "mogsa") that can perform these clustering analyses, better and 1,000 x faster!

How's it work? Well, it starts here...


...and then it gets complicated. If any of you guys who are good with math that contains letters have feedback, please leave some comments. I'm not exactly...qualified to review this part of the paper.

You might wonder why I'm writing this. And why I think the Kuster Cluster passes muster.

Remember this awesome study where they did proteomics of the entire NCI-60 panel? What if they showed that they could take their algorithm, this proteomics data set AND some transcriptomics data and they can start to see differentiation in clustering based on cell origin and cancer type?

If you take the transcriptomic data from the NCI-60 panel and do normal clustering via principal component analysis (PCA), you're probably going to end up with a figure like this one I published a couple years ago (the ones in red are stromal invasion):


A couple weird things are going to score as nasty outliers and then your are going to end up squeezing your maths and getting just a gobbledy-gook of your cell lines all clustered together. If you go through and remove "outlier" after "outlier" you'll eventually start to see something that approaches clustering. The problem is that cancer cells are so messed up and variable in what makes them cancerous that it is just about impossible to make sense out of what you are seeing. Not impossible, but hard.

Clustering, btw, is a really central technique to how genomics people make sense of their data.
So what if you had the complementary information of the transcriptomics and proteomics and did some clustering. Can that improve what you're seeing?


From the proteomic data (right) you can't really differentiate the melanoma cell lines from everything else. However, it does fall out in the transcriptomics data. The opposite goes for the leukemia. When you look at the transcriptomics data, the leukemia is right in the mix with the rest of the cell lines, but it differentiates strongly in the proteomics data. How does leukemia differ from other cancers? Look at the proteins that are driving this differentiation!!  Is the differentiating factor the original cell type involved? Maybe, but maybe the difference is that family of protein pumps that also make melanomas so resistant to that chemotherapeautic you've been working with!

TL/DR: Proteomics improves everything genomics, if you can figure out how to leverage it. moCluster looks like an awesome and fast new step forward unifying global -omics information!



No comments:

Post a Comment