Today's lunch time reading was an older paper. Some of the proteins we are interested in have high variability regions. The way we've been dealing with them is a complex FASTA file containing all known sequences of these proteins from dozens of partially sequenced field isolated. Unfortunately, it doesn't look like we're only seeing the tip of the iceberg. The next plan is to filter our peptide data and remove everything that matches. What we're going to be interested in is the stuff that doesn't match any entries in our database.
The first program I've chosen to evaluate is the PepNovo software.
The following paper was cited in the link above, and its short, so I figured it was a good place to start (if first appeared in JPR in 2006).
De Novo Peptide Sequencing and Identification with Precision Mass Spectrometry
The central concept of this paper is that of homeometric peptides, which the authors define as different peptides with similar theoretical MS/MS spectra. The authors site a number of reasons that these can and do occur, though I'm sure they are FAR more likely when you are looking at lower resolution MS/MS spectra
The authors propose that multiple de novo sequencing outputs should be produced by the software that can be narrowed down/filtered by other means.
The big advance forward from this paper is the description of the Dancik scoring algorithm. From what I can understand of this algorithm, in addition to normal de novo sequencers, the Dancik ranks the intensity of the fragment ions (1st most intense, 2nd, etc.,). The most intense fragments are considered to be the most likely to be b or y ions and the probability of the outputted peptide sequence takes this into consideration.
They then take this scoring algorithm and interrogate a dataset generated by a 7-tesla(!) FT machine and conclude that it is an improvement over other scoring methods.