Friday, June 22, 2018

MOFA -- Reduce the dimensionality of all the data!

What a great way to start my morning!
1) My Twitter feed popped up a paper that I checked out because it had a funny name (MOFA! link here!) and, while a little scared to check Google Images, it turns out it is a moped that has pedals!

2) I realize on page 3 or so of the paper that this is one that my wife was talking about that started a conversation that we should write some journals and suggest that software links are provided in abstracts.

You can get the software for R or Python here.  This post isn't just rambly wasted time! The link is hard to find in the paper. With my service today complete -- time for (probably inaccurate) rambling!

What is MOFA?!?!  Multi-Omics Factor Analysis, duh.
Could that mean anything? Sure it could!

What does it mean here?

It means a new way of integrating data from all sorts of input -- the more I think about it, the more I like it. However, after 4 shots of espresso there is a period of time in the morning when I like everything, especially my cat.  Sorry, this has been cracking me up all week....he's fine with business catsual.

Stop laughing at the cat, Ben -- be serious and talk about dimensionality reduction!

How are we doing things in proteogenomics/metabogenomics/multi-omics right now?

1) Somebody does transcriptomics on the cell/patient and works out a huge list of the transcripts that are changing (and probably those that are unique to the cell -- variant call files and such, but lets ignore those right now)
2) Somebody else finds a list of small molecule features that are changing from sample to sample and assigns the best metabolite ID they can to all of those features
3) You identify as many PSMs as you can and then quantify those.

Generally these lists are reduced to what appears to be significantly different between these groups -- based on the significance that makes sense for each individual experiment. This is likely highly driven by the depth of coverage and the number of samples. It isn't hard to imagine a problem if you had 300 metabolites quantified compared to 30,000 transcripts quantified, right? Is the significance cutoff the two lists the same? Sure, your cutoffs make sense in each individual experiment....

Then someone converts those lists to something universal -- probably the proteins to gene IDs (which has some serious weaknesses I should ramble about some day) and then puts those lists all into KEGG or Ingenuity(tm) or something similar. (Perhaps the complete lists are fed to Ingenuity).

MOFA says -- before you do all that stuff -- why don't you just try reducing all the factors to what changes between your sample sets?

What is the output from all of these things? 3 dimensions.

Dimension 1: The patient or sample
Dimension 2: The transcript, PSM/protein, metabolite ID
Dimension 3: The relative quantification you get for Dimension 2

What if -- for just a minute -- you forget where that data came from? What if you didn't care that this was a metabolite and this was a transcript and so on? Now you just have a big list of things about your sample versus the other samples and their quan. Could you just reduce the data to seek the factors that are explaining the variance between Sample A and Sample B? (More realistically -- Sample Set A and Sample Set B -- a big n is going to be required to do it)

This is probably inaccurate -- but this is what I interpret that MOFA is doing. Massive multiomics data reduction.  Figure 5 was what finally convinced me I was on the right track logically about what was happening here. I suggest scrolling down to it and then start into the results section.

The paper is open access, you should check it out, because they look at 200 patient samples with multiomics data integration and they pull out some really interesting observations with this approach suggesting that this makes a lot of sense.

30,000 transcripts with abundance --> get a significant list
+ 3,000 metabolites with quan --> get a significant list
+ 8,000 proteins quantified --> get a significant list
Try to combine that significant list with cutoffs that make sense in terms of the data source itself but perhaps border on arbitrary compared to the sum the total variance from all the data as a whole.

OR MOFA it all down to what is really different between your samples first while using the sum of all the data points you've all worked so hard to generate together to increase the true power of this huge effort?

By the way -- they don't ignore the mutations and stuff in their study. They integrate all that too!

1 comment:

  1. I enjoyed reading this ;)
    And you are indeed the right track. Rather than looking at the high-dimensional space in a supervised and feature-wise manner, why not finding the relevant (latent) dimensions where interesting stuff happens? The intuition is exactly the same as in PCA (in fact, running MOFA in a single view should yield a similar solution to PCA, but with the benefits of sparsity and probabilistic modelling of noise).
    However, the key difference with PCA is that the MOFA factors disentangle the shared sources of heterogeneity between assays from the unique heterogeneity within assays. This allows you to summarise the variability of an entire multi-omics data set in a single plot (see Figure 1b, 2b and 5b).