ComPIL -- Infinitely scalable databases for metaproteomics!!

WHOA!!!!  Have you tried processing any metaproteomics data you found in a repository online (or generated yourself)?  If you naively barge into it you will rapidly discover a problem.

There are hundreds of GB of sequenced bacteria/virus/archaea databases out there! The sequencing labs aren't slowing down, either. In 2006 there were 300 bacterial genomes completed. In 2016 there were over 2,000 genomes completed of just E.coli! (Ref)

If you really don't know what species might be in your sample of mud or sewage or whatever (that's metaproteomics, btw, just digest what is there and do LC-MS/MS) you either need to find some database reduction steps -- or -- ComPIL!

ComPIL is a metaproteomics search system designed for the future. In theory -- it is infinitely scalable and may be able to keep up with these busy genomics centers and all the information they are kicking out into the world.

Nope, I don't get how they made it work, but I do understand the evidence they used to validate it.

They generate some LC-MS/MS data on HEK293 (immortalized human cell line) and search it against a human database in a normal way and then vs a ComPIL'ed database where human protein entries are a very small percentage of the database. Normally when you do something like this you get loads of false discoveries thanks to homology and just database scaling issues and your number of IDs drops through the floor. At the same FDR, they only lose 15% of their PSMs.

It gets better -- HEK293 was immortalized with an adenovirus infection. Using this massive database -- they identify the virus incorporation sites!!  This thing is POWERFUL. They do some more validation with some bacterial proteomes (which is what it is intended for) and it appears to work better on those!

The data from the paper was deposited in PRIDE/ ProteomeXchange under PXD003896 and PXD003907.

