Thursday, August 15, 2013

How often should you update your FASTA database?


Great question, right?  How much are these databases really changing?
During one of my postdocs, one of the staff scientists told me that I should update my FASTAs every 3 months.  In order to compare, I dug up an old complete Uniprot-SwissProt library that I downloaded in July, 2011 and I downloaded the newest version while making dinner tonight.

So!  What's the difference?  First off, I notice that the new file is about 2.5 MB bigger than my FASTA from July 2011.  That doesn't seem like a lot.

Random comparisons:
1) Burkholderia:
Old FASTA; 8206 entries
New FASTA;  8208 entries

2) Green monkey (Chlorocebus):
Old FASTA;  128 entries
New FASTA;  130 entries

3)  Malaria (Plasmodium falciparum)
Old FASTA;  298 entries
New FASTA;  300 entries

It looks like I'm making this up.  They've added exactly 2 new annotations to each of the databases?  Or I pseudo-randomly chose 3 databases that had exactly 2 new entries?  Seems made up.

Better analysis!  How different is the same file searched in the same way versus the same FASTA?

I'll parse each database on "sapiens" and run the same method against the same cell line digest ran on a 120 min gradient on an Orbitrap Elite operating in High-high mode using a Top15 method

Both FASTAs came up with 1431 unique proteins


...only one of which had changed.

What about annotations?

Here is where we see some differences.  29 Annotations in our species has been updated in the last 2 years.

What's the takeaway?  I guess it is that if we are working with a well annotated organism, such as us or mice, we probably shouldn't be too paranoid about remaking all our databases every couple of months when the new Uniprot-Swissprot is posted.

It is important to remember that this is the best annotated protein database out there.  These are manually verified.  Databases that are annotated via automated processes (like TREMBL?) are probably going to be a whole lot more dynamic than this one, so all these thoughts go out the window.  For example, the PlasmoDB (malaria) database is the complete other side of the spectrum.  It changes constantly.  (It has to, the global Plasmodium genome has probably went through millions of changes since I started writing this entry....)

Good news for Proteome Discoverer 1.4 users, the Annotation module that is linked to ProteinCenter automatically updates these annotations to the newest available for you.  It works perfectly for slowly changing databases, and does a surprisingly good job even for dynamic databases such as Plasmodium.


No comments:

Post a Comment