seqaln-intro



NAME

     seqaln-intro - introduction to sequence alignment  molecular
     biology software written at USC.


SYNOPSIS

     These programs are similar, and take arguments of the form

     global db seqfile match/matrix mismatch/csub  alpha  beta  [
     flags ]

     fit db seqfile/profile match/matrix mismatch/csub alpha beta
     [ flags ]

     over db seqfile match/matrix mismatch/csub alpha beta  [-1/-
     2/+1/+2/+3] [ flags ]

     local db seqfile match/matrix  mismatch/csub  alpha  beta  [
     flags ]

     srlocalS seqfile match/matrix  mismatch/csub  alpha  beta  [
     flags ]

     trlocalS db pattern match/matrix mismatch/csub alpha [ flags
     ]

     pvlocalS db seqfile match/matrix mismatch/csub alpha beta  [
     dis1 [ dis2 ]] [-s#] [-d#] [-g# -p#] [ flags ]

     pvsrlocalS seqfile match/matrix mismatch/csub alpha  beta  [
     dis ] [-s#] [-d#] [-g# -p#] [ flags ]


DESCRIPTION

     The programs in the seqaln package  share  underlying  func-
     tions that reflect the commonality in many of the algorithms
     for the above types of alignments:   comparing  all  of  one
     sequence  against all of another; trying to fit one sequence
     into another;  searching  for  overlaps  of  two  sequences;
     searching  for  regions  of  local  similarity  between  two
     sequences, self-repeats in a sequence, tandem repeats  of  a
     given  pattern in a sequence, and sequences having statisti-
     cally significant local similarity (found through p-values).
     These  programs  are  built upon a common sequence alignment
     library; see libseqaln(3). The library contains routines for
     reading sequence and similar files, comparing sequences, and
     printing tracebacks  and  alignments.   Sequence  files  are
     specified  in  the  Pearson/FASTA  format  and  substitution
     matrix files are in BLAST format, allowing users  to  employ
     sequences  and substitution matrices directly available from
     the NCBI.  Beyond the immediate utility  of  the  standalone
     programs,  they  also serve as examples of software that can
     be written using the libseqaln library.  We assume an under-
     standing  of  this  software  at  the  level  of  Michael S.
     Waterman's  Introduction  to  Computational  Biology:  Maps,
     sequences  and  genomes.  Chapman  & Hall, 1995, ISBN 0-412-
     99391-0.   Chapter  9  is  particularly  relevant  to   this
     software package.

     The categories for these sequence alignment programs are:

     global    find global alignment of two entire sequences.

     fit       fit the sequence in seqfile or  profile  into  the
               sequence or sequences in db.

     over      find overlaps at the beginning of one sequence and
               the end of another.

     local     find regions of local similarity in two sequences.

     srlocal   search for any number of self-repeats of any  pat-
               tern within one sequence.

     trlocal   search for any number of tandem repeats of a given
               pattern within a sequence.

     pvlocal   search for sequence  alignments  having  statisti-
               cally significant scores, through p-values.

     pvsrlocal search for self-repeats having statistically  sig-
               nificant scores, through p-values.

     Preceding the above program type names are two possible pre-
     fixes designating special scoring:

     m         use a substitution matrix for  score  calculation,
               rather  than  a  uniform  score  for each match or
               mismatch found.

     p         use a profile file to search for  a  given  subse-
               quence.   [Profiles  are  most  meaningful in, and
               only used in, fit software.]

     To these alignment category names,  the  following  suffixes
     determine the type of scoring metric:

     S         find similarities between two sequences, based  on
               score.

     D         find distances between two sequences, as a  score.
               [There is no local or overlap distance software.]

     All results are printed on stdout.  Errors  are  printed  to
     stderr.

     Of all possible combinations, the following programs exist:

          globalS   find global similarity alignments between two
                    sequences.

          mglobalS  find global similarity alignments between two
                    sequences,  using  a  substitution matrix for
                    scoring aligned letters.

          globalD   find global distance alignments  between  two
                    sequences.

          mglobalD  find global distance alignments  between  two
                    sequences,  using  a  substitution matrix for
                    scoring aligned letters.

          fitS      fit one sequence into another  sequence  with
                    similarity scoring.

          mfitS     fit one sequence into another  sequence  with
                    similarity   scoring,  using  a  substitution
                    matrix for scoring aligned letters.

          pfitS     fit a profile into a sequence with similarity
                    scoring.

          fitD      fit one sequence into another  sequence  with
                    distance scoring.

          mfitD     fit one sequence into another sequence, using
                    a  substitution  matrix  for  scoring aligned
                    letters.

          overS     find overlaps of two sequences with  similar-
                    ity scoring.

          moverS    find overlaps of two sequences with  similar-
                    ity  scoring, using a substitution matrix for
                    scoring aligned letters.

          localS    find local similarity alignments between  two
                    sequences.

          mlocalS   find local similarity alignments between  two
                    sequences,  using  a  substitution matrix for
                    scoring aligned letters.

          srlocalS  find self repeats in a sequence  using  local
                    similarity alignment.

          msrlocalS find self repeats in a sequence  using  local
                    similarity  alignment,  using  a substitution
                    matrix for scoring aligned letters.

          trlocalS  find tandem repeats in a sequence using local
                    similarity alignment.

          mtrlocalS find tandem repeats in a sequence using local
                    similarity  alignment,  using  a substitution
                    matrix for scoring aligned letters.

          pvlocalS  find gamma, p, and p-values for  local  simi-
                    larity  alignment to search for statistically
                    significant scores assuming a Poisson distri-
                    bution of random sequence scores.

          mpvlocalS find gamma, p, and p-values as  in  pvlocalS,
                    but  using  a substitution matrix for scoring
                    aligned letters.

          pvsrlocalS
                    find gamma, p, and p-values as  in  pvlocalS,
                    but using self-repeat alignments.

          mpvsrlocalS
                    find gamma, p, and p-values as in pvsrlocalS,
                    but  using  a substitution matrix for scoring
                    aligned letters.


OPTIONS

     Several events can occur when comparing two  sequences,  all
     of which are factored into the total score.  There can be an
     exact match between two letters; there  can  be  a  mismatch
     between  two letters (i.e., a substitution of one letter for
     another); and a subsequence can be inserted in one  sequence
     (or  deleted in the other), known as an indel. Indels can be
     one or more letters in length.  The scoring parameters asso-
     ciate  a score with each of these events.  The standard pro-
     gram parameters are:

     db        the  database  file,  containing   one   or   more
               sequences in FASTA format.

     seqfile   the file containing the sequence  to  be  compared
               against   sequences   in   db,  in  FASTA  format.
               Optional start and stop locations for scoring with
               the  sequence  may be specified as `seqfile(start-
               loc)' (begins at start-loc and proceeds to the end
               of  the  sequence),  `seqfile(start-loc,stop-loc)'
               (begins at  start-loc  and  ends  at  stop-loc  ),
               `seqfile(,stop-loc)'  (begins  at the start of the
               sequence and ends at stop-loc ).  The  default  is
               to  begin  scoring  at  the start of the sequence,
               i.e., at position 1, and to  end  scoring  at  the
               end.   Positions  in  the  alignment output always
               reflect positions in the original  full  sequence.
               The ability to specify start and stop locations is
               particularly useful  in  specifying  open  reading
               frames  for  use  with  the  -x  and  -r flags for
               nucleotide to protein translation with or  without
               reverse complementing.

     profile   the file containing profile information,  where  a
               separate  match  score is assigned for each letter
               in each position of a sequence.  This differs from
               the penalty matrix versions (see the matrix option
               below), where one score is  assigned  to  a  given
               letter  match  regardless  of  its position in the
               sequence.  Profile files have a suffix of  ".pro";
               they  are  specified  without  this extension (for
               example,  "globins.pro"  is  specified  simply  as
               "globins").  For more information on the format of
               a profile file, see profile(5).

     match     the score for aligning identical letters, in  non-
               substitution-matrix versions.

     mismatch  the amount to subtract for a letter  mismatch,  in
               non-substitution-matrix versions.

     matrix    the matrix of scores for aligning letters, in sub-
               stitution  matrix versions.  Penalty matrices have
               a suffix of ".mat";  they  are  specified  without
               this   extension  (for  example,  "PAM250.mat"  is
               specified simply as "PAM250").

     csub      the threshold for  printing  `:'  between  aligned
               letters,  designating  such  pairs as conservative
               substitutions.  Vertical bars  (`|')  are  printed
               between matching aligned letters (see the examples
               at the end of this man page); colons  are  printed
               between  non-matching aligned letters where a sub-
               stitution score is greater than or equal  to  csub
               for  similarity  scoring,  or where a substitution
               score is less than or equal to csub  for  distance
               scoring.

     alpha     the amount to score for the  first  letter  of  an
               insertion or deletion sequence (indel).

     beta      the amount to score for subsequent letters  in  an
               indel.   For  example,  if  there is a five-letter
               indel, i.e.  k = 5, then the score will change  by
               alpha + beta * ( k - 1 ) = alpha + beta * (4).

     dis       A 26-letter frequency distribution file,  defining
               letter  distributions in the single sequence in p-
               value self-repeat sequence  simulation;  for  more
               information, see distribution-file(5).

     dis1      A  26-letter  frequency   distribution,   defining
               letter  distributions  in  db  in p-value sequence
               simulation;    for    more    information,     see
               distribution-file(5).

     dis2      A  26-letter  frequency   distribution,   defining
               letter   distributions   in   seqfile  in  p-value
               sequence simulation;  for  more  information,  see
               distribution-file(5).

     Additional flags are taken by most of the  programs.   Flags
     unique to a particular program or programs are designated as
     such.  Most of  these  flags  manipulate  variables  in  the
     SEQALN_CONSTANTS  data  structure.   To  see the effect each
     flag has, examine the files parseargs.c and results_init.c.

          +1        Print the best-scoring  alignment  ending  at
                    the first sequence, in overlap software.

          +2        Print the best-scoring  alignment  ending  at
                    the second sequence, in overlap software.

          +3        Print both the best-scoring alignment  ending
                    at  the  first sequence, and the best-scoring
                    alignment ending at the second  sequence,  in
                    overlap software.

          +A        Print the alignment.  This is the default.

          +D        Turn on debugging output.  N.B.: You probably
                    don't  want  this  unless  you're writing new
                    software  to  interface   with   the   seqaln
                    library.

          +Efile    Append stderr to file.

          +L        Obtain score in linear space,  not  quadratic
                    space.   This  option saves significant space
                    and time, but disallows tracebacks and align-
                    ment  outputs.  This parameter is very useful
                    when scoring a sequence against a large data-
                    base. Same as -L.

          +M        Print  the  matrix  of  dynamic   programming
                    scores.   A  tic  mark  (')  appears after an
                    entry  in  a  matrix  when  it  has   already
                    appeared  in  an  alignment  as  a  match  or
                    mismatch.  These marked  positions  are  then
                    never  aligned  again.   See  the -c# and -n#
                    options  for  more  information  on  repeated
                    alignments.   The default is not to print the
                    matrix.

          +Ofile    Append stdout to file.

          +P        Print a table consisting of score,  observed,
                    and   predicted   probabilities  for  p-value
                    software following the linear  regression  on
                    the simulation results.

          +S        Print  the  alignment  score.   This  is  the
                    default.

          +T        Print the  alignment  traceback  coordinates.
                    The  default  is not to print this coordinate
                    list.

          +V        Print  the  program  name  and  its   version
                    number; verbose mode.

          -1        Don't print the best-scoring alignment ending
                    at the first sequence, in overlap software.

          -2        Don't print the best-scoring alignment ending
                    at the second sequence, in overlap software.

          -A        Don't print the alignment.  The default is to
                    print the alignment.

          -D        Turn  off  debugging  output.   This  is  the
                    default.

          -Efile    Truncate file and write stderr to it.

          -L        Obtain score in linear space,  not  quadratic
                    space.    This  parameter  saves  significant
                    space and time, but disallows tracebacks  and
                    alignment  outputs.   This  parameter is very
                    useful when  scoring  a  sequence  against  a
                    large database. Same as +L.

          -M        Don't print the matrix of dynamic programming
                    scores.  This is the default.

          -Ofile    Truncate file and write stdout to it.

          -P        Don't print  a  table  consisting  of  score,
                    observed,  and predicted probabilities for p-
                    value software following the  linear  regres-
                    sion  on the simulation results.  This is the
                    default.

          -Rfile    Random number generator seed  file  is  file.
                    This  file  is  read in for p-value software,
                    and written out at the end of p-value simula-
                    tions.   For  more  information on the random
                    number generator, see GFSR(3).

          -S        Don't print the alignment score.  The default
                    is to print the score.

          -T        Don't print the traceback list.  This is  the
                    default.

          -V        Don't print the program name and its  version
                    number;   non-verbose   mode.   This  is  the
                    default.

          -W#       Specify  an  alignment  output  width  of   #
                    aligned  letters per line.  The default is 60
                    letters per line.

          -a        Use all high scores in computing gamma and  p
                    for  p-value  simulations.   By  default, the
                    lowest 10% and highest 10% of scores are dis-
                    carded,  producing a better linear regression
                    when the number of simulations  is  at  least
                    1000.

          -b        Report both upper and lower tracebacks.

          -c#       Print alignments scoring at a cutoff score of
                    # or better, for local and fit alignment.

          -d#       Perform # declumps per p-value direct simula-
                    tion.   See  also  -s#.  Not  recommended  at
                    present, owing to

          -e        Trace  back  an  alignment  envelope,  either
                    upper (-u), lower (-l), or both (-b).

          -f        Flip the preferred indel traceback  direction
                    between  left  and  up  from one indel to the
                    next.

          -g#       Specify gamma of  #,  for  p-value  software.
                    Must be specified with -p#.

          -l        Give preference to the leftward indel  direc-
                    tion  over the upper indel direction.  With -
                    e, this will trace the lower envelope.

          -m#       Shift the mean of the scores in  a  substitu-
                    tion  matrix by #; this is an offset added to
                    each element of the substitution matrix.

          -n#       Print the best # alignments,  for  local  and
                    fit  alignment.   The default is to print the
                    best alignment.

          -p#       Specify p of #, for p-value  software.   Must
                    be specified with -g#.

          -r        Reverse-complement the second sequence.   Use
                    the reverse-complement of the sequence in the
                    second file,  using  IUB  reverse  complement
                    conventions.   The  basic alphabet assumed is
                    ACGT.  U is converted into A, and A is always
                    converted  into  T.  This follows the Genbank
                    convention.  N and X represent any letter and
                    remain unchanged.  The special case of letter
                    `.' is not supported.

          -s#       Number of direct simulations to perform  (and
                    consequently number of sequence pairs to gen-
                    erate), for p-value software.  See also -d#.

          -u        Give preference to the upward indel direction
                    over  the leftward indel direction.  With -e,
                    this will trace the upper envelope.

          -w#       Specify the window size of the number of pat-
                    tern  copies to string together in performing
                    tandem repeat alignment.  For example, if w =
                    5,  a  pattern  will  be  repeated five times
                    across the scoring matrix.  This allows  pat-
                    tern  recognition  across an indel of at most
                    just under five patterns in length.

          -x        Translate the nucleotide sequence in  seqfile
                    into a protein sequence.  Terminal codons are
                    translated as `*'.  Sequences  are  truncated
                    to  be  a  multiple  of  three nucleotides in
                    length.

          Most programs compare the sequence[s] in  db  with  the
          sequence  in  seqfile.  The format of sequence files db
          and seqfile is the  Pearson/FASTA  format.   The  first
          line begins with `>', after which the sequence descrip-
          tion (up to 512 characters) appears.  Subsequent  lines
          contain  the  sequence to be used.  The sequences them-
          selves may contain  blanks,  returns,  and  other  whi-
          tespace  for  readability.   The sequence terminates at
          end-of-file, or if another `>' is read.
          Programs that use a single match score add  this  value
          to  the cumulative score for each exact letter match; a
          mismatch between two letters has  score  mismatch  sub-
          tracted from the cumulative score.  Programs that use a
          substitution matrix (i.e., those  beginning  with  `m')
          use  parameter  matrix  as the name of the substitution
          score  matrix  file.   Programs  that  use  a   profile
          (currently,  only  pfitS (1)) specify a match score for
          each letter in each relative  position  in  the  search
          sequence.  Also, profile programs use alpha and beta to
          compute gap  penalties  only  for  the  sequence  being
          searched,  not  for  the  profile.  The first line of a
          profile file contains its name.  The second  line  con-
          tains  alpha  and  beta  for profile gap penalties (see
          profile (5)).

          Indel substrings are penalized by a score of alpha  for
          the first letter, and by beta for subsequent letters in
          the indel.


EXAMPLES

     The examples below illustrate  the  application  of  scoring
     parameters   to   evaluate  alignments.   They  compare  the
     sequence TAAAATAGAT with the sequence  TAGTAGATAGTAGAT,  and
     demonstrate how different choices for the scoring parameters
     can produce different alignments.  In  both  examples,  only
     the best (i.e., highest scoring) alignment is requested.  In
     the examples below, localS is used.  Similar output is  gen-
     erated  by all other programs: typically at least one align-
     ment is printed, with a corresponding score for each  align-
     ment.

          example% localS f1 f2 10 2 20 1

          finds 7 matches and 2 mismatches beginning at position 2 in the first sequence
          and position 7 in the second sequence:

               2   A A A A T A G A T   10
                   |   |   | | | | |
               7   A T A G T A G A T   15

          for a score of 10(7) - 2(2) = 66.

          example% localS f1 f2 10 2 2 1

          finds 10 matches, 3 one-letter indels and 1 two-letter indel beginning at
          position 1 in both sequences:

               1   T A - - A - A - A - T A G A T   10
                   | |     |   |   |   | | | | |
               1   T A G T A G A T A G T A G A T   15

          for a score of 10(10) - 3(2) - 1(2+1) = 91.


REFERENCES

     M.S. Waterman.  Introduction to Computational Biology: Maps,
     sequences  and  genomes. Chapman & Hall. London: 1995.  ISBN
     0-412-99391-0.


SEE ALSO

          globalS(1), mglobalS(1), globalD(1),  mglobalD(1),  fitS(1),
     mfitS(1),  pfitS(1), fitD(1), mfitD(1), overS(1), moverS(1),
     localS(1),  mlocalS(1),  srlocalS(1),  msrlocalS(1),   trlo-
     calS(1),     mtrlocalS(1),     pvlocalS(1),    mpvlocalS(1),
     distribution-file(5),     profile(5),     penalty-matrix(5),
     sequence-file(5).




 

Download the latest version.