I'm moving fast this morning, but I thought this was a fun thing to bring up. It would be great if every proteomics sample ONLY had the proteins that you digested and wanted to see, but that's not the case, right? You've got your stupid protease hanging around, and you've probably got dog and trash cat keratin falling off of you all the time. In the winter you get this great boost in wool peptide identifications. Common contaminant databases are critical and used by just about everyone and a lot of cool software now just has the option to add them automatically.
And...maybe we ought to take a critical look at some of these lists.....
Imagine that you're doing some laser capture microdissection experiments on the epithelium of a tissue slice, and your suprise when you don't detect one of the major protein constituents that should be there. Weird, right? Did you toss keratin 7 because you hard filter your results and use a contaminants database that flags a few extra keratins?
If you're using the default contaminants.fasta that comes with every MaxQuant download, that might be the case.
There might be 10 new proteomics studies this fall already on Ubiquitin-Conjugating Enzymes. It's a hot topic out there. I wish you all luck. Blech. If you're doing a meta-analysis of this data and using a hard filter, you might not see a few of the proteins if your contaminant database is derived directly from the great Global Proteomics Machine cRAP database.
The GPM website clearly breaks out the contaminants in the database by type. There are a bunch of human proteins on the list that are common contaminants if you use the Sigma UPS standard, which a lot of labs do. However, there are some really cool proteins in those standards!
The direct FASTA download doesn't break the proteins out that way (it can't, FASTA isn't exactly a flexible thing) and it looks like a couple pieces of software have either taken cRAP verbatim or have started with it and added their own in house observations to it. The MetaMorpheus contaminants XML definitely has these proteins in it, for example.
The answer? Probably not hard filtering, I guess. (I have a default filter that makes anything on my contaminants database invisible in PD and when I open .tsv from other software I toss anything with an X in the contaminants column. That's on me, but hey! now I know better!