Tuesday, January 16, 2018
Automatically building big & complex multi-organism FASTA databases in PD
Imagine that you are sitting there minding your own business and someone walks in with some Louisiana crayfish peptides. As cool as proteomics is, you'll have to convince me that there is a more appropriate usage of Crayfish than this....
...but let's assume that this is super important (and we only need a few micrograms of peptides anyway)
What if there isn't a good sequenced crayfish FASTA database? I don't know if there is, I'm working on something else and I only chose this example because I'm hungry. Before you go all out and start de novo sequencing everything, maybe you can start with just a giant FASTA (I learned today that this is a hard "A" fast-AY or fast-(Candian) -Eh.) Who knew? Everybody?
You can start by building a FASTA that has all related organisms. If you have Mascot, you're in luck. You can just choose the taxonomy in your pulldown (assuming the complete database has been loaded). I don't have Mascot access at home so I went to Google and the first link was some terrifying exercise in BLASTP+ from command line where you cross-reference your taxonomy list from UniProt to the complete FASTA....
Then I remembered one of the perks of having PD maintenace -- something about FASTA downloads. Turns out it is pretty cool. If you look up Crayfish (p.s. in my state we call them crawldaddies, no idea why) in WikiPedia you can find the entire taxonomy. You can then follow either of the links in the box at the top image (this pops up when you go FASTA Database utilities --> download from ProteinCenter --> I chose arthropoda in taxonomy which gave me that number and then I just hit Download.
It queues up and does everything, building you a huge (prepare to wait a while if you choose TrembL) database that you can then run and see if it actually finds you some hits.