Friday, July 5, 2019

eggNOG for the 4th of July? Annotate your nextgen derived FASTA!


I didn't go out for fireworks. I sat at home and learned how to annotate these protein FASTA databases that I generated from Illumina ("short read sequencing") and PacBio ("long read sequencing") data.

I started with BlastP command line, but half way through I decided to again see how long it takes me to manually annotate. From 5:08pm to 5:45pm I manually annotated 42 FASTA entries. That's less than one every 90 seconds. Let's call it one minute. I only have 17,166 to go. If I could keep going 24 hours a day it would only take me 11 days to get through it. Don't check my math. I probably did it wrong.

I have something important to do in like 12 days (this cool ABRF recap webinar series!), and if I didn't sleep for 11 days straight, I probably wouldn't do a very good job. New plan! eggNOG!


This is the newest paper, but what I actually appear to be using is 2.0. There is a paper from 2017 on 1.0, but just user documentation in-between.

What's better than reading? Dumping all your cool nextgen filtered data stuff into someone else's server and seeing if it works!

You can dump your data into their server here.

Or you can get the code to run it locally for yourself here.

What's it do? Okay -- so what I have from all the next gen data that I 6 frame translated to proteins is a crappy annotation that looks like this:

>9385295:True:1830
MLFTYYCLYSERICSQFYKDSEMGDSKGCFLEYFHSGDYSSLWKSHGAYGIAGAVVVGILIPVIISSFFIGKKKGKLRGVPVDVGGDSAYTVRNSRVTELIEVPWEGATTMAHLFEQSCKRNSRNQFLGTRKFIERDFVAASDGRKFEKLHFGEYEWQTYGEAFDRACNFASGLIKLGHNVDTRAALFSETRAEWLIAFQVCYLMHFYIQLLLPYLLFVLVYFNLLLKTQDNINNSRISGMLPTKYYCCYYLCNSWGGCANPRT

What's that annotation mean? Nothing useful. It may be the number of the probe used for the sequencing in that experiment. Then you have the filtered protein sequence that I translated from that probe. What I want to know is what that protein actually is.

pBLAST through the web interface (using my personally preferred reliable older, somewhat slower version) took about 3 minutes for this individual protein sequence to give me this annotation:


 (The way I do it faster is to have 10 tabs open at one time. Please don't do it this way. You're smarter than me.)

eggNOG is listed in the original paper as being 15x faster than pBLAST. It's more like 10,000x faster than a human with pBLAST.

Check this out. Online -- it did my FASTA with almost 36,000 entries in under 90 minutes!


AWESOME. Okay. So this is what you get out of it.


....perhaps the biggest bummer in the world on the 4th of July is the fact that it doesn't combine the .FASTA with the .Annotations. Yo. That's what I'm here for.


So -- what i did was use Excel (trigger groans) and the VLOOKUP function. This is essentially a Find/Replace where if you say if anything is in Column X, then replace it with the value in Column Y. Yes, there are smarter ways of doing this, but mine totally worked. I changed the names after the decimal point of both files to .fasta, which allows them to be opened by Proteome Discoverer.

I plan to release a package called "Ben's dumb Excel tools for mass spectrometrists" when I get time and this should be part of it. I should post that here sometime....

No comments:

Post a Comment