News in Proteomics Research: What is in a FASTA file?

Thursday, May 23, 2013

What is in a FASTA file?

Due to a whole ton of new next gen sequencing data popping up in new databases around the world, this question keeps popping up: what is in a FASTA database?

This is what the NCBI says:

FASTA

A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line (defline) is distinguished from the sequence data by a greater-than (">") symbol at the beginning. It is recommended that all lines of text be shorter than 80 characters in length. An example sequence in FASTA format is:

  >gi|129295|sp|P01013|OVAX_CHICK GENE X PROTEIN (OVALBUMIN-RELATED)
  QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE
  KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS
  VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP
  FLFLIKHNPTNTIVYFGRYWSP

Blank lines are not allowed in the middle of FASTA input.

Uniprot/Swissprot entries are going to look different than TREMBL and these are going to look different than RefSeq and so on and so on, but they are all going to follow this basic format.

News in Proteomics Research

Thursday, May 23, 2013

What is in a FASTA file?

No comments:

Post a Comment