pvsrlocalS



NAME

     pvsrlocalS - p-value or statistical  significance  of  self-
     repeat local alignments.


SYNOPSIS

     pvsrlocalS  file match mismatch alpha beta [ dis ] [ -s# ] [
     -d# ] [ -g# -p# ] [ flags ]


DESCRIPTION

     pvsrlocalS generates values of gamma and p  for  statistical
     significance  of  similarity  scores, to estimate successive
     probabilities ( p-values ) of at least one  random  sequence
     having  a higher alignment score.  By default, the McCaldon-
     Argos amino acid distribution is used to  simulate  two  iid
     sequences  to  determine  gamma and p; however, other letter
     frequency distributions can be provided for the  input  file
     as  dis.  The  format  of  dis is described in distribution-
     file(5).

     The values of gamma and p are used to estimate the  signifi-
     cance  of  the local alignment scores derived from comparing
     the sequence in file against itself with the scoring parame-
     ters described below.

     By default, direct simulation of 1000 repeated sequences  is
     used to determine gamma and p. The -s# flag specifies that #
     simulations be performed instead.  Declumping  is  not  used
     unless specified with -d#, in which case # declumps are per-
     formed for each direct simulation.  If declumping  is  used,
     the  matrix  size  is  set to a minimum of 900 by 900 so the
     matrix will be large enough that alignments  do  not  inter-
     sect.   For  this  reason,  we recommend simply using direct
     simulation with no declumping with the current software.  We
     also  recommend a minimum of 1000 direct simulations with no
     declumping (i.e., -s1000), or  10  direct  simulations  with
     declumping  300  times  per  direct  simulation  (i.e., -s10
     -d300).

     The format of the sequence file file is our standard format,
     or the Pearson/FASTA format.  The first line is the sequence
     name, and should be used as a description.  Subsequent lines
     contain  the  sequence to be used.  The sequences themselves
     may contain blanks, returns, and other whitespace for reada-
     bility.   The sequence terminates at end-of-file, or if `//'
     is read in our format or `>' is read in the Pearson format.


OPTIONS

     file      the sequence file, whose length is used for deter-
               mination of gamma and p.

     matrix    is a lower-diagonal penalty matrix  with  26  rows
               and  columns,  corresponding  to the 26 letters of
               the alphabet.  This allows matrices  to  be  built
               for protein, DNA, and RNA sequences depending upon
               the letters used.  The most  common  use  of  this
               matrix  is to compare amino acid sequences in pro-
               teins, but the flexibility  of  the  matrix  input
               allows  other  types  of sequences to be compared.
               The matrix file name ends in ".mat";  this  suffix
               is  not  given.  If the matrix is not found in the
               current directory,  the  directory  given  by  the
               environment variable MATDIR will be examined.


     csub      is the lower limit for conservative substitutions,
               which  are non-matching substitutions printed with
               a `:' in alignment output.


     alpha     the amount to subtract for the first letter of  an
               insertion or deletion sequence (indel).

     beta      is the amount to subtract for  subsequent  letters
               in  an  indel.   For  example, if there is a five-
               letter indel, k = 5, then alpha + beta * ( k - 1 )
               =  alpha  + beta * (4) will be subtracted from the
               score.


     dis       distribution file of the distributions of  letters
               in  file, used in place of the McCaldon-Argos dis-
               tribution.

     +P        Print a table of probabilities  after  the  linear
               regression  on  high  scores.   Three  columns are
               printed: the score; the observed probability  that
               a  score  is less than or equal to this score from
               the simulations;  and  the  predicted  probability
               that  a score is less than or equal to this score,
               using gamma and p. Only those scores used  in  the
               linear  regression  are reported in this table; if
               only the middle 80% of scores are used, only  they
               are reported.

     -P        Do not print the above-mentioned table of observed
               and predicted probabilities.  This is the default.

     -a        Use all high scores, rather than the  middle  80%.
               This  could be used when the number of simulations
               is small.  However, we recommend taking the middle
               80%  of  scores  from at least 1000 direct simula-
               tions, which is the default behavior.

     -s#       Perform # direct  simulations  (i.e.,  generate  #
               sequence  pairs).   Default is 1000 direct simula-
               tions.

     -d#       Perform # declumps  for  each  direct  simulation.
               Not   recommended   at  present.   Default  is  no
               declumps.

     -g#       Specify value of gamma for subsequent runs.   Must
               also specify p at the same time.

     -p#       Specify value of p for subsequent runs.  Must also
               specify gamma at the same time.


REFERENCES

     T.F. Smith and M.S.  Waterman.   "Identification  of  Common
          Molecular Subsequences".  Journal of Molecular Biology,
          147, (1981) 195.

     M.  Vingron  and  M.S.  Waterman.  "Sequence  Alignment  and
          Penalty  Choice:  review  of concepts, case studies and
          implications".  Journal  of  Molecular  Biology,   234,
          (1993).

     M.S. Waterman and M. Eggert.   "A  New  Algorithm  for  Best
          Subsequence  Alignments  with  Application to tRNA-rRNA
          Comparisons".  Journal of Molecular Biology 197  (1987)
          723-728.

     M.S. Waterman and M. Vingron. Rapid and  accurate  estimates
          of   statistical  significance  for  sequence  database
          searches.   Proc.  Natl.   Acad.   Sci.   USA   (1994).
          91:4625-28.

     M.S. Waterman and M. Vingron. Sequence  comparison  signifi-
          cance  and Poisson approximation.  Statistical Sciences
          (1994).  2:367-81.

     M.S. Waterman.  Introduction to Computational Biology: Maps,
     sequences  and  genomes. Chapman & Hall. London: 1995.  ISBN
     0-412-99391-0.


SEE ALSO

     seqaln-intro(1), localS(1), mlocalS(1), srlocalS(1),  msrlo-
     calS(1),  trlocalS(1),  mtrlocalS(1), mpvlocalS(1), mpvsrlo-
     calS(1), sequence-file(5), distribution-file(5).