Wednesday, February 24, 2016
Use more than one search engine? Why?
This week I got some great questions from a researcher who is relatively new to the field. I love when I get these kinds of things because it reminds me of questions I had when I started doing this that I later forgot about (I've started a new FAQ Page that will appear on the right at some point (soon if I finally find the entrance to Kami's hyperbolic time chamber...)
One question was specifically linked to Search engines, including what advantages we'd see from using more than one. There have been several good papers over the years, but I'd probably argue that this one from David Shteynberg et al., is the most comprehensive look at this subject.
While the primary focus of the paper might be more on how to deal with FDR when using a bunch of different algorithms, there are a number of interesting figures and details regarding the Peptide Spectral Matches (PSMs) that show up when you use these other algorithms. I think if you really dug for it you'd find at least 10 papers over the years that will come up with something like this
1 engine = x PSMs
2 engine = 10%x more PSMs
Add a third? = Maybe 3% more PSMs than 2
I'm embarrassed by how long it took me to write that (as well as how dumb the endpoint looks) but I hope that is reasonably clear.
Now, there may be big differences here. Some algorithms are very similar. Comet and Sequest have very similar underlying algorithms. Using the two together might not give you 10% more IDs. In the paper I mention above, they define a concept of Search Engine Complementarity (I'll add this to the Translator now!) and the equation is in the paper. In general, though, this is the amount of overlap between two search engines. Ben's bad-at-math translation:
Search Engine Complementarity (SEC); higher SEC = more new peptide IDs from the same dataset
In this example Sequest + Comet would have a lower SEC than two very different algorithms. The super-secret Mascot engine and InSpecT were found to have the highest SEC in this dataset.
This paper is a couple of years old, so some new stuff has popped up that didn't make the study. Notably MS-Amanda and MS-GF+. If you follow the figures in the launch paper for the latter, you will see what looks like a very high SEC for MS-GF+ and Mascot (the only software it was compared against). In my hands, I find a lower SEC when paired with Sequest (but I feel its higher when paired with Mascot), but these are just rough measurements.
An interesting factor to consider here, though, is that these are all complex statistical algorithms and concepts like the SEC may be drastically altered when looking at different datasets. Case in point, where Sequest+MSAmanda seem to produce very similar results in my hands -- until I'm looking at a relatively high number of dynamic modifications in high resolution MS/MS and then I see the two begin to diverge.