Wednesday, June 29, 2022

Using transcriptomics to REDUCE databases in proteogenomics!

 

This new study in Genome Biology is, on the surface, probably counter-intuitive. Our smallest databases are the nice reviewed ones from UniProt/SwissProt. When we start looking at databases that are a little (lot) more biologically relevant because of things like genetic variation, it is easy for the input database size to blow up to astronomical proportions. We frequently use an input database with millions of entries, which requires special class based FDR and a lot of computational power (2 million are known cancer-linked mutations, so they're just little snippets of sequences). When we start to toss in databases from those same samples that were derived from next gen DNA or RNA, things tend to blow up. BTW, neither is nearly as clean as you'd guess given the fact we're on "3rd gen" sequencing technology with 1TB of data coming off per sample with these new sequencers. There are fundamental questions being asked right now like -- wait -- is the genome way way way way more complex than we every thought, or is Illumina generating less and less relevant data with each generation and more literal garbage and hoping to cash out before someone stops to consider that the latter is the simpler explanation. Today's data density might give them another 10 years because it will take that long to process the data from 4 patients. 

This wasn't supposed to be a genomics rant, but while I'm going -- long read sequencing is the way to go for us, y'all. Illumina and whatever the thing is that Thermo sells that no one uses, generate really short read sequences. 6-frame translate those little things (reducing them /3 to get amino acids) gives a lot of tiny annoying things to search against. PacBio and NanoPore both produce much longer outputs and it is transformative for us, both by reducing a ton of redundancy and giving us more sequences to match against. 

All of the words starting with the 2nd sentence were meant to impart the fact that, unless I've been doing it totally wrong for 10 years, which is completely possible, proteogenomics databases shouldn't get smaller. They just keep getting bigger. It would be awesome if there was some way, any way, to reduce them.

There is a lot here, and the paper tackles two different concepts. The first is a recently proposed strategy for database reduction that I won't go into because they don't like it. The second big concept they use -- is one where they utilize the transcriptomics (RNA) to reduce databases. The logic is that if there are no transcripts expressed for that gene, it seems silly to go looking for that protein. Using this approach they find a "more sensitive" peptide detection rate (you'll see their terminology and it makes sense shortly into it) even using standard target decoy based approaches.

Big caveat here, of course, that if you are perturbing a system by, for example, irradiating cells that would induce a rapid response that would shut off transcription you definitely shouldn't do this. Proteins with long half-lifes relative to their oligo counterparts would still be hanging around, and then you wouldn't have entries for them. This is just the first example off the top of my head for what a bummer thinking about these systems in a biological concept is probably like. 

Also, I am not sure what the figure I chose for this blogpost is displaying, I really liked their choice of colors. 

No comments:

Post a Comment