Tuesday, November 22, 2016

Polymorphic Peptide Variants and Propagation in Spectral Networks


Subtitle: Why everyone needs to take a whack at proteomics data!

Need a paper to mull on while avoiding discussing politics with your family this holiday weekend? Think on this one!




What is it? Wait, you can't tell from the title? Come on!

In all seriousness, it is a really unique (to me, at least) way of thinking about what that unmatched spectra might be in that organism you don't have a good database for. And it might just be brilliant. I can't tell.

I gave it a good read and then thought about it in my car while I enjoyed the combination of normal D.C. and possibly early holiday traffic(?) and this is what I think is going on.(And I might totally have this wrong).

Imagine we're starting off with this organism that no one has sequenced before and we need to do proteomics on it. The mass spec side is the same as always (as long as it wasn't hard to lyse or whatever, of course) but then we've got no database for it. We could de novo it or use BICEPS, but these are both going to be super computationally expensive, full of false discoveries or require that you spend 2 years studying Python to use it (this approach may fail in one of these regards as well, I'll have to check).

Spectral networks goes sideways here. What if you could lower your bioinformatic load (what?!?) by running more samples? They go the easy route here and take 3 bacteria and do dd-MS2 on them. Then they take the spectra that are the most similar (by MS/MS fragments) and network them together. In this way you can 1) Find the most important features and 2) Start to limit what you're going to have to search.

I know this is wacky. Who has spare mass spec time?!? To this, I answer -- who can find a good bioinformatician for that salary that you can't seem to find a good mass spectrometrist for? Nobody, that's why!


Seriously -- what choice do we you are told to get some proteomics data on this organism? Wait and hope the genomics people are considering it a priority, will sequence it this year, and will annotate it by 2020?

Example set: They start with 3 species (or strains) of Cyanothece that biofuels people are seriously interested in that someone has done proteomics on. Serious proteomics:

Start with:
 >1e6 spectra/organism
Cluster the completely homologous peptides (identical ones from each run AND organism)
 = heck, if you search those conserved ones you're gonna have a massive reduction in search space (but you're going to miss what makes that organism why it isn't the other)
Cluster the MS/MS spectra that are only different in one mass shift. For example, the y ions are awesome till you get to the high mass ones and then each one is off by 8 in species 2 and 14 in species 3. (Or whatever). then move onto the next pairing!

As a side effect here, btw, you're going to get a quick understanding of evolutionary relatedness here -- without any genomic information on these guys! Most these MS/MS spectra are the same and you didn't get the samples mixed up? These things are related for sure!

In this run through they break their spectra into something like 16,000 networks. So....this is just a little more complex than the example 2 paragraphs up, but it is for illustration purposes only.

But check this out -- you now have these networks, where this spectrumA is equal to spectrumB (+8Da at y7/y8/y9) and spectrumC is equal to spectrumB (- something). Now that it is all linked you dump in some matched spectral data. Some stuff that is ID'ed and perfect. The MS/MS spectra are linked to IDs and it falls together like dominoes.

Does it work? They probably wouldn't have sent it to MCP if it didn't, but it definitely looks like it works. I find it makes more sense to me the more I think about it....

The pipeline is more complex than I described.


...but all the tools are freely available here. 

No comments:

Post a Comment