Definition, Origin, and Dual Role of FASTA
FASTA, an acronym derived from “FAST-All,” is a foundational concept in the field of bioinformatics, existing both as a ubiquitous data format and a powerful suite of sequence comparison programs. It was first developed by David J. Lipman and William R. Pearson in 1985 as an evolution of their earlier tools, FAST-P (for protein) and FAST-N (for nucleotide), combining the ability to work with all sequence types. The simplicity and text-based nature of the FASTA format quickly led it to become a near-universal standard for representing biological sequences.
Fundamentally, FASTA is a text-based format designed to store either nucleotide sequences (DNA, RNA) or amino acid (protein) sequences. The main goal of the related FASTA software package is to rapidly and accurately identify statistically significant sequence similarity between a query sequence and sequences housed within large, comprehensive databases. This sequence comparison is crucial for inferring evolutionary relationships, known as homology, and for characterizing newly determined sequences by identifying functional domains and motifs.
The FASTA File Format: Structure and Conventions
The FASTA format is structured text composed of two essential components for each sequence: a definition line (or header) and the sequence data itself. The description line must be the first line of the sequence entry and is universally distinguished by beginning with a greater-than symbol (>) in the first column.
The definition line typically holds summary information, such as a unique sequence identifier (SeqID), accession number, the name of the gene or protein, and the source organism. It is a critical convention that the entire description line must not contain any hard returns, meaning all identifying information must reside on a single line of text. In multi-FASTA files, where multiple sequences are stored in a single file, each new sequence must begin with its own distinct definition line starting with the ‘>’ symbol.
The sequence data immediately follows the definition line and is composed of single-letter codes representing the nucleotides or amino acids, following the conventions set by the IUPAC (International Union of Pure and Applied Chemistry) notation. While the sequence data itself can contain line returns, it is a common recommendation, although not a strict requirement, that all lines of text within a FASTA file should be shorter than 80 characters in length to maintain human readability and compatibility across various bioinformatics tools. Furthermore, the format is flexible enough to handle certain non-standard characters: lowercase letters are accepted and automatically mapped to uppercase; a single hyphen or dash (-) is used to represent a gap character; and in amino acid sequences, ‘U’ (selenocysteine) and ‘*’ (termination codon) are acceptable letters.
The FASTA Software Package and Programs
The FASTA software package, maintained by its original creator, W. R. Pearson, provides a comprehensive set of sequence comparison programs. While the FASTA programs are generally not as fast as the ubiquitous BLAST package, they are often considered equally sensitive and can provide more accurate statistical estimates because they calculate statistical parameters directly from the distribution of similarity scores determined during the search process. This allows them to effectively use a wide range of similarity scoring matrices and gap penalties.
The suite includes various programs tailored for specific comparison tasks. For example, *fasta36* is the standard tool for comparing a protein query against a protein database or a DNA query against a DNA database. More complex comparisons are handled by programs like *fastx36* and *tfastx36*, which are designed to compare a DNA query sequence against a protein database by translating the DNA into three reading frames to account for potential gaps and frameshifts. For highly accurate but slower optimal local and global similarity searches, the package includes tools such as *ssearch36* and *ggsearch36*. The programs also offer advanced features, including the ability to incorporate functional site and domain annotations into alignments, which enhances the biological interpretation of search results.
Working Principle and Steps of the FASTA Algorithm
The FASTA algorithm operates using a rapid, heuristic approach to achieve high search speed against massive databases. It comprises four main steps to identify sequence similarity:
The first step is **Identifying Regions** of high similarity, often referred to as the hashing step. The query sequence is broken down into small, consecutive “words” called k-tuples (ktup). The value of ktup is typically set to 2 for protein sequences and 6 for nucleotide sequences. A lookup table is created for the query sequence, which is then used to find exact matches of these k-tuples in the database sequences. These initial matches identify regions of high similarity, which are visualized as diagonals in a two-dimensional matrix.
The second step is **Re-Scoring** the diagonals. The ten best-scoring non-overlapping diagonals found in the first step are selected. These diagonals are re-scored using a proper scoring matrix, such as a PAM or BLOSUM matrix, rather than just the simple word match score. This initial score is known as the ktup score.
The third step involves applying a **Joining Threshold** to link the initial high-scoring regions. A score cutoff, or threshold, is used to filter out segments unlikely to contribute to the final alignment. The selected high-scoring regions are checked to see if they can be joined together to form a gapped alignment. This step introduces gaps between the diagonal segments to achieve a longer, continuous alignment while applying gap opening and extension penalties. This results in an optimized gapped alignment score.
The fourth and final step is **Final Alignment and Statistical Significance**. The highest-scoring sequences from the database are then subjected to a more rigorous, optimized alignment process, which is similar to the full Smith-Waterman local alignment algorithm, focusing only on the promising regions. The statistical significance of the match is then evaluated, primarily using the E-value (Expected value), which represents the number of times a match is expected to occur by chance in a database of that size. The output also includes the bit score and the Z-score, the latter representing the number of standard deviations from the mean score of the database search, providing a clear measure of how significant the similarity is.
Key Uses and Applications of FASTA
The FASTA format and its associated programs have a wide array of applications that are essential to modern molecular biology and bioinformatics research. The primary use is **Sequence Alignment** and **Database Searching**, where researchers compare a novel sequence against vast public databases to rapidly find matches or homologous sequences. This function is indispensable for characterizing the new sequence.
By identifying regions of similarity between sequences, FASTA enables the **Inferring of Homology**, suggesting a common evolutionary origin and potential shared function between two genes or proteins. Furthermore, the ability to identify **Conserved Regions, Functional Domains, and Motifs** within the alignment output is critical. These conserved elements often correspond to biologically important areas, such as active sites in enzymes or binding domains, thereby providing valuable insights into the biological function, structure, and mechanism of the sequence. Finally, the FASTA package serves as a fundamental computational tool for various disciplines, including genomics, proteomics, and structure-based drug discovery, by providing the foundation for sequence analysis.