pvlocalS



NAME

     pvlocalS - p-value  or  statistical  significance  of  local
     alignments.


SYNOPSIS

     pvlocalS  db file2 match mismatch alpha beta [ dis2 [ dis1 ]
     ] [ -s# ] [ -d# ] [ -g# -p# ] [ flags ]


DESCRIPTION

     pvlocalS generates values of gamma  and  p  for  statistical
     significance  of  similarity  scores, to estimate successive
     probabilities ( p-values ) of at least one  random  sequence
     having  a higher alignment score.  By default, the McCalden-
     Argos amino acid distribution is used to  simulate  two  iid
     sequences  to  determine  gamma and p; however, other letter
     frequency distributions can be provided for the input  files
     as  dis2  and  dis1,  respectively.   They  are specified in
     reverse  order  because  ordinarily  one   would   use   the
     McCaldon-Argos  distribution  as  the frequency distribution
     for the first sequence and tailor the frequency distribution
     for  the  second file to an individual database.  The format
     of dis2 and dis1 is described in distribution-file(5).

     The values of gamma and p are used to estimate the  signifi-
     cance  of  the local alignment scores derived from comparing
     the sequences in db and the sequence in file2 with the scor-
     ing parameters described below.

     By default, direct simulation of 1000 repeated sequences  is
     used to determine gamma and p. The -s# flag specifies that #
     simulations be performed instead.  Declumping  is  not  used
     unless specified with -d#, in which case # declumps are per-
     formed for each direct simulation.  If declumping  is  used,
     the  matrix  size  is  set to a minimum of 900 by 900 so the
     matrix will be large enough that alignments  do  not  inter-
     sect.   For  this  reason,  we recommend simply using direct
     simulation with no declumping with the current software.  We
     also  recommend a minimum of 1000 direct simulations with no
     declumping (i.e., -s1000), or  10  direct  simulations  with
     declumping  300  times  per  direct  simulation  (i.e., -s10
     -d300).

     The format of sequence files db and file2  is  our  standard
     format,  the  Pearson/FASTA  format.   The first line is the
     sequence name, and should be used as a description.   Subse-
     quent  lines contain the sequence to be used.  The sequences
     themselves may contain blanks, returns, and other whitespace
     for  readability.   The  sequence terminates at end-of-file,
     `>' is read to begin a new sequence  in  the  FASTA  format.
     Only multiple sequences in the first file will be processed.



OPTIONS

     db        the first file of sequences.

     file2     the second sequence file, whose length is used for
               determination of gamma and p.

     matrix    is a lower-diagonal penalty matrix  with  26  rows
               and  columns,  corresponding  to the 26 letters of
               the alphabet.  This allows matrices  to  be  built
               for protein, DNA, and RNA sequences depending upon
               the letters used.  The most  common  use  of  this
               matrix  is to compare amino acid sequences in pro-
               teins, but the flexibility  of  the  matrix  input
               allows  other  types  of sequences to be compared.
               The matrix file name ends in ".mat";  this  suffix
               is  not  given.  If the matrix is not found in the
               current directory,  the  directory  given  by  the
               environment variable MATDIR will be examined.


     csub      is the lower limit for conservative substitutions,
               which  are non-matching substitutions printed with
               a `:' in alignment output.


     alpha     the amount to subtract for the first letter of  an
               insertion or deletion sequence (indel).

     beta      is the amount to subtract for  subsequent  letters
               in  an  indel.   For  example, if there is a five-
               letter indel, k = 5, then alpha + beta * ( k - 1 )
               =  alpha  + beta * (4) will be subtracted from the
               score.


     dis2      distribution file of the distributions of  letters
               in file2, used in place of the McCaldon-Argos dis-
               tribution.

     dis1      distribution file of the distributions of  letters
               in file1.

     +P        Print a table of probabilities  after  the  linear
               regression  on  high  scores.   Three  columns are
               printed: the score; the observed probability  that
               a  score  is less than or equal to this score from
               the simulations;  and  the  predicted  probability
               that  a score is less than or equal to this score,
               using gamma and p. Only those scores used  in  the
               linear  regression  are reported in this table; if
               only the middle 80% of scores are used, only  they
               are reported.

     -P        Do not print the above-mentioned table of observed
               and predicted probabilities.  This is the default.

     -a        Use all high scores, rather than the  middle  80%.
               This  could be used when the number of simulations
               is small.  However, we recommend taking the middle
               80%  of  scores  from at least 1000 direct simula-
               tions, which is the default behavior.

     -s#       Perform # direct  simulations  (i.e.,  generate  #
               sequence  pairs).   Default is 1000 direct simula-
               tions.

     -d#       Perform # declumps  for  each  direct  simulation.
               Not   recommended   at  present.   Default  is  no
               declumps.

     -g#       Specify value of gamma for subsequent runs.   Must
               also specify p at the same time.

     -p#       Specify value of p for subsequent runs.  Must also
               specify gamma at the same time.


REFERENCES

     T.F. Smith and M.S.  Waterman.   "Identification  of  Common
          Molecular Subsequences".  Journal of Molecular Biology,
          147, (1981) 195.

     M.  Vingron  and  M.S.  Waterman.  "Sequence  Alignment  and
          Penalty  Choice:  review  of concepts, case studies and
          implications".  Journal  of  Molecular  Biology,   234,
          (1993).

     M.S. Waterman and M. Eggert.   "A  New  Algorithm  for  Best
          Subsequence  Alignments  with  Application to tRNA-rRNA
          Comparisons".  Journal of Molecular Biology 197  (1987)
          723-728.

     M.S. Waterman and M. Vingron. Rapid and  accurate  estimates
          of   statistical  significance  for  sequence  database
          searches.   Proc.  Natl.   Acad.   Sci.   USA   (1994).
          91:4625-28.

     M.S. Waterman and M. Vingron. Sequence  comparison  signifi-
          cance  and Poisson approximation.  Statistical Sciences
          (1994).  2:367-81.

     M.S. Waterman.  Introduction to Computational Biology: Maps,
     sequences  and  genomes. Chapman & Hall. London: 1995.  ISBN
     0-412-99391-0.



SEE ALSO

     seqaln-intro(1), pfitS(1).