seqaln-intro - introduction to sequence alignment molecular biology software written at USC.
These programs are similar, and take arguments of the form global db seqfile match/matrix mismatch/csub alpha beta [ flags ] fit db seqfile/profile match/matrix mismatch/csub alpha beta [ flags ] over db seqfile match/matrix mismatch/csub alpha beta [-1/- 2/+1/+2/+3] [ flags ] local db seqfile match/matrix mismatch/csub alpha beta [ flags ] srlocalS seqfile match/matrix mismatch/csub alpha beta [ flags ] trlocalS db pattern match/matrix mismatch/csub alpha [ flags ] pvlocalS db seqfile match/matrix mismatch/csub alpha beta [ dis1 [ dis2 ]] [-s#] [-d#] [-g# -p#] [ flags ] pvsrlocalS seqfile match/matrix mismatch/csub alpha beta [ dis ] [-s#] [-d#] [-g# -p#] [ flags ]
The programs in the seqaln package share underlying func- tions that reflect the commonality in many of the algorithms for the above types of alignments: comparing all of one sequence against all of another; trying to fit one sequence into another; searching for overlaps of two sequences; searching for regions of local similarity between two sequences, self-repeats in a sequence, tandem repeats of a given pattern in a sequence, and sequences having statisti- cally significant local similarity (found through p-values). These programs are built upon a common sequence alignment library; see libseqaln(3). The library contains routines for reading sequence and similar files, comparing sequences, and printing tracebacks and alignments. Sequence files are specified in the Pearson/FASTA format and substitution matrix files are in BLAST format, allowing users to employ sequences and substitution matrices directly available from the NCBI. Beyond the immediate utility of the standalone programs, they also serve as examples of software that can be written using the libseqaln library. We assume an under- standing of this software at the level of Michael S. Waterman's Introduction to Computational Biology: Maps, sequences and genomes. Chapman & Hall, 1995, ISBN 0-412- 99391-0. Chapter 9 is particularly relevant to this software package. The categories for these sequence alignment programs are: global find global alignment of two entire sequences. fit fit the sequence in seqfile or profile into the sequence or sequences in db. over find overlaps at the beginning of one sequence and the end of another. local find regions of local similarity in two sequences. srlocal search for any number of self-repeats of any pat- tern within one sequence. trlocal search for any number of tandem repeats of a given pattern within a sequence. pvlocal search for sequence alignments having statisti- cally significant scores, through p-values. pvsrlocal search for self-repeats having statistically sig- nificant scores, through p-values. Preceding the above program type names are two possible pre- fixes designating special scoring: m use a substitution matrix for score calculation, rather than a uniform score for each match or mismatch found. p use a profile file to search for a given subse- quence. [Profiles are most meaningful in, and only used in, fit software.] To these alignment category names, the following suffixes determine the type of scoring metric: S find similarities between two sequences, based on score. D find distances between two sequences, as a score. [There is no local or overlap distance software.] All results are printed on stdout. Errors are printed to stderr. Of all possible combinations, the following programs exist: globalS find global similarity alignments between two sequences. mglobalS find global similarity alignments between two sequences, using a substitution matrix for scoring aligned letters. globalD find global distance alignments between two sequences. mglobalD find global distance alignments between two sequences, using a substitution matrix for scoring aligned letters. fitS fit one sequence into another sequence with similarity scoring. mfitS fit one sequence into another sequence with similarity scoring, using a substitution matrix for scoring aligned letters. pfitS fit a profile into a sequence with similarity scoring. fitD fit one sequence into another sequence with distance scoring. mfitD fit one sequence into another sequence, using a substitution matrix for scoring aligned letters. overS find overlaps of two sequences with similar- ity scoring. moverS find overlaps of two sequences with similar- ity scoring, using a substitution matrix for scoring aligned letters. localS find local similarity alignments between two sequences. mlocalS find local similarity alignments between two sequences, using a substitution matrix for scoring aligned letters. srlocalS find self repeats in a sequence using local similarity alignment. msrlocalS find self repeats in a sequence using local similarity alignment, using a substitution matrix for scoring aligned letters. trlocalS find tandem repeats in a sequence using local similarity alignment. mtrlocalS find tandem repeats in a sequence using local similarity alignment, using a substitution matrix for scoring aligned letters. pvlocalS find gamma, p, and p-values for local simi- larity alignment to search for statistically significant scores assuming a Poisson distri- bution of random sequence scores. mpvlocalS find gamma, p, and p-values as in pvlocalS, but using a substitution matrix for scoring aligned letters. pvsrlocalS find gamma, p, and p-values as in pvlocalS, but using self-repeat alignments. mpvsrlocalS find gamma, p, and p-values as in pvsrlocalS, but using a substitution matrix for scoring aligned letters.
Several events can occur when comparing two sequences, all of which are factored into the total score. There can be an exact match between two letters; there can be a mismatch between two letters (i.e., a substitution of one letter for another); and a subsequence can be inserted in one sequence (or deleted in the other), known as an indel. Indels can be one or more letters in length. The scoring parameters asso- ciate a score with each of these events. The standard pro- gram parameters are: db the database file, containing one or more sequences in FASTA format. seqfile the file containing the sequence to be compared against sequences in db, in FASTA format. Optional start and stop locations for scoring with the sequence may be specified as `seqfile(start- loc)' (begins at start-loc and proceeds to the end of the sequence), `seqfile(start-loc,stop-loc)' (begins at start-loc and ends at stop-loc ), `seqfile(,stop-loc)' (begins at the start of the sequence and ends at stop-loc ). The default is to begin scoring at the start of the sequence, i.e., at position 1, and to end scoring at the end. Positions in the alignment output always reflect positions in the original full sequence. The ability to specify start and stop locations is particularly useful in specifying open reading frames for use with the -x and -r flags for nucleotide to protein translation with or without reverse complementing. profile the file containing profile information, where a separate match score is assigned for each letter in each position of a sequence. This differs from the penalty matrix versions (see the matrix option below), where one score is assigned to a given letter match regardless of its position in the sequence. Profile files have a suffix of ".pro"; they are specified without this extension (for example, "globins.pro" is specified simply as "globins"). For more information on the format of a profile file, see profile(5). match the score for aligning identical letters, in non- substitution-matrix versions. mismatch the amount to subtract for a letter mismatch, in non-substitution-matrix versions. matrix the matrix of scores for aligning letters, in sub- stitution matrix versions. Penalty matrices have a suffix of ".mat"; they are specified without this extension (for example, "PAM250.mat" is specified simply as "PAM250"). csub the threshold for printing `:' between aligned letters, designating such pairs as conservative substitutions. Vertical bars (`|') are printed between matching aligned letters (see the examples at the end of this man page); colons are printed between non-matching aligned letters where a sub- stitution score is greater than or equal to csub for similarity scoring, or where a substitution score is less than or equal to csub for distance scoring. alpha the amount to score for the first letter of an insertion or deletion sequence (indel). beta the amount to score for subsequent letters in an indel. For example, if there is a five-letter indel, i.e. k = 5, then the score will change by alpha + beta * ( k - 1 ) = alpha + beta * (4). dis A 26-letter frequency distribution file, defining letter distributions in the single sequence in p- value self-repeat sequence simulation; for more information, see distribution-file(5). dis1 A 26-letter frequency distribution, defining letter distributions in db in p-value sequence simulation; for more information, see distribution-file(5). dis2 A 26-letter frequency distribution, defining letter distributions in seqfile in p-value sequence simulation; for more information, see distribution-file(5). Additional flags are taken by most of the programs. Flags unique to a particular program or programs are designated as such. Most of these flags manipulate variables in the SEQALN_CONSTANTS data structure. To see the effect each flag has, examine the files parseargs.c and results_init.c. +1 Print the best-scoring alignment ending at the first sequence, in overlap software. +2 Print the best-scoring alignment ending at the second sequence, in overlap software. +3 Print both the best-scoring alignment ending at the first sequence, and the best-scoring alignment ending at the second sequence, in overlap software. +A Print the alignment. This is the default. +D Turn on debugging output. N.B.: You probably don't want this unless you're writing new software to interface with the seqaln library. +Efile Append stderr to file. +L Obtain score in linear space, not quadratic space. This option saves significant space and time, but disallows tracebacks and align- ment outputs. This parameter is very useful when scoring a sequence against a large data- base. Same as -L. +M Print the matrix of dynamic programming scores. A tic mark (') appears after an entry in a matrix when it has already appeared in an alignment as a match or mismatch. These marked positions are then never aligned again. See the -c# and -n# options for more information on repeated alignments. The default is not to print the matrix. +Ofile Append stdout to file. +P Print a table consisting of score, observed, and predicted probabilities for p-value software following the linear regression on the simulation results. +S Print the alignment score. This is the default. +T Print the alignment traceback coordinates. The default is not to print this coordinate list. +V Print the program name and its version number; verbose mode. -1 Don't print the best-scoring alignment ending at the first sequence, in overlap software. -2 Don't print the best-scoring alignment ending at the second sequence, in overlap software. -A Don't print the alignment. The default is to print the alignment. -D Turn off debugging output. This is the default. -Efile Truncate file and write stderr to it. -L Obtain score in linear space, not quadratic space. This parameter saves significant space and time, but disallows tracebacks and alignment outputs. This parameter is very useful when scoring a sequence against a large database. Same as +L. -M Don't print the matrix of dynamic programming scores. This is the default. -Ofile Truncate file and write stdout to it. -P Don't print a table consisting of score, observed, and predicted probabilities for p- value software following the linear regres- sion on the simulation results. This is the default. -Rfile Random number generator seed file is file. This file is read in for p-value software, and written out at the end of p-value simula- tions. For more information on the random number generator, see GFSR(3). -S Don't print the alignment score. The default is to print the score. -T Don't print the traceback list. This is the default. -V Don't print the program name and its version number; non-verbose mode. This is the default. -W# Specify an alignment output width of # aligned letters per line. The default is 60 letters per line. -a Use all high scores in computing gamma and p for p-value simulations. By default, the lowest 10% and highest 10% of scores are dis- carded, producing a better linear regression when the number of simulations is at least 1000. -b Report both upper and lower tracebacks. -c# Print alignments scoring at a cutoff score of # or better, for local and fit alignment. -d# Perform # declumps per p-value direct simula- tion. See also -s#. Not recommended at present, owing to -e Trace back an alignment envelope, either upper (-u), lower (-l), or both (-b). -f Flip the preferred indel traceback direction between left and up from one indel to the next. -g# Specify gamma of #, for p-value software. Must be specified with -p#. -l Give preference to the leftward indel direc- tion over the upper indel direction. With - e, this will trace the lower envelope. -m# Shift the mean of the scores in a substitu- tion matrix by #; this is an offset added to each element of the substitution matrix. -n# Print the best # alignments, for local and fit alignment. The default is to print the best alignment. -p# Specify p of #, for p-value software. Must be specified with -g#. -r Reverse-complement the second sequence. Use the reverse-complement of the sequence in the second file, using IUB reverse complement conventions. The basic alphabet assumed is ACGT. U is converted into A, and A is always converted into T. This follows the Genbank convention. N and X represent any letter and remain unchanged. The special case of letter `.' is not supported. -s# Number of direct simulations to perform (and consequently number of sequence pairs to gen- erate), for p-value software. See also -d#. -u Give preference to the upward indel direction over the leftward indel direction. With -e, this will trace the upper envelope. -w# Specify the window size of the number of pat- tern copies to string together in performing tandem repeat alignment. For example, if w = 5, a pattern will be repeated five times across the scoring matrix. This allows pat- tern recognition across an indel of at most just under five patterns in length. -x Translate the nucleotide sequence in seqfile into a protein sequence. Terminal codons are translated as `*'. Sequences are truncated to be a multiple of three nucleotides in length. Most programs compare the sequence[s] in db with the sequence in seqfile. The format of sequence files db and seqfile is the Pearson/FASTA format. The first line begins with `>', after which the sequence descrip- tion (up to 512 characters) appears. Subsequent lines contain the sequence to be used. The sequences them- selves may contain blanks, returns, and other whi- tespace for readability. The sequence terminates at end-of-file, or if another `>' is read. Programs that use a single match score add this value to the cumulative score for each exact letter match; a mismatch between two letters has score mismatch sub- tracted from the cumulative score. Programs that use a substitution matrix (i.e., those beginning with `m') use parameter matrix as the name of the substitution score matrix file. Programs that use a profile (currently, only pfitS (1)) specify a match score for each letter in each relative position in the search sequence. Also, profile programs use alpha and beta to compute gap penalties only for the sequence being searched, not for the profile. The first line of a profile file contains its name. The second line con- tains alpha and beta for profile gap penalties (see profile (5)). Indel substrings are penalized by a score of alpha for the first letter, and by beta for subsequent letters in the indel.
The examples below illustrate the application of scoring parameters to evaluate alignments. They compare the sequence TAAAATAGAT with the sequence TAGTAGATAGTAGAT, and demonstrate how different choices for the scoring parameters can produce different alignments. In both examples, only the best (i.e., highest scoring) alignment is requested. In the examples below, localS is used. Similar output is gen- erated by all other programs: typically at least one align- ment is printed, with a corresponding score for each align- ment. example% localS f1 f2 10 2 20 1 finds 7 matches and 2 mismatches beginning at position 2 in the first sequence and position 7 in the second sequence: 2 A A A A T A G A T 10 | | | | | | | 7 A T A G T A G A T 15 for a score of 10(7) - 2(2) = 66. example% localS f1 f2 10 2 2 1 finds 10 matches, 3 one-letter indels and 1 two-letter indel beginning at position 1 in both sequences: 1 T A - - A - A - A - T A G A T 10 | | | | | | | | | | 1 T A G T A G A T A G T A G A T 15 for a score of 10(10) - 3(2) - 1(2+1) = 91.
M.S. Waterman. Introduction to Computational Biology: Maps, sequences and genomes. Chapman & Hall. London: 1995. ISBN 0-412-99391-0.
globalS(1), mglobalS(1), globalD(1), mglobalD(1), fitS(1), mfitS(1), pfitS(1), fitD(1), mfitD(1), overS(1), moverS(1), localS(1), mlocalS(1), srlocalS(1), msrlocalS(1), trlo- calS(1), mtrlocalS(1), pvlocalS(1), mpvlocalS(1), distribution-file(5), profile(5), penalty-matrix(5), sequence-file(5).Download the latest version.