pvsrlocalS
NAME
pvsrlocalS - p-value or statistical significance of self-
repeat local alignments.
SYNOPSIS
pvsrlocalS file match mismatch alpha beta [ dis ] [ -s# ] [
-d# ] [ -g# -p# ] [ flags ]
DESCRIPTION
pvsrlocalS generates values of gamma and p for statistical
significance of similarity scores, to estimate successive
probabilities ( p-values ) of at least one random sequence
having a higher alignment score. By default, the McCaldon-
Argos amino acid distribution is used to simulate two iid
sequences to determine gamma and p; however, other letter
frequency distributions can be provided for the input file
as dis. The format of dis is described in distribution-
file(5).
The values of gamma and p are used to estimate the signifi-
cance of the local alignment scores derived from comparing
the sequence in file against itself with the scoring parame-
ters described below.
By default, direct simulation of 1000 repeated sequences is
used to determine gamma and p. The -s# flag specifies that #
simulations be performed instead. Declumping is not used
unless specified with -d#, in which case # declumps are per-
formed for each direct simulation. If declumping is used,
the matrix size is set to a minimum of 900 by 900 so the
matrix will be large enough that alignments do not inter-
sect. For this reason, we recommend simply using direct
simulation with no declumping with the current software. We
also recommend a minimum of 1000 direct simulations with no
declumping (i.e., -s1000), or 10 direct simulations with
declumping 300 times per direct simulation (i.e., -s10
-d300).
The format of the sequence file file is our standard format,
or the Pearson/FASTA format. The first line is the sequence
name, and should be used as a description. Subsequent lines
contain the sequence to be used. The sequences themselves
may contain blanks, returns, and other whitespace for reada-
bility. The sequence terminates at end-of-file, or if `//'
is read in our format or `>' is read in the Pearson format.
OPTIONS
file the sequence file, whose length is used for deter-
mination of gamma and p.
matrix is a lower-diagonal penalty matrix with 26 rows
and columns, corresponding to the 26 letters of
the alphabet. This allows matrices to be built
for protein, DNA, and RNA sequences depending upon
the letters used. The most common use of this
matrix is to compare amino acid sequences in pro-
teins, but the flexibility of the matrix input
allows other types of sequences to be compared.
The matrix file name ends in ".mat"; this suffix
is not given. If the matrix is not found in the
current directory, the directory given by the
environment variable MATDIR will be examined.
csub is the lower limit for conservative substitutions,
which are non-matching substitutions printed with
a `:' in alignment output.
alpha the amount to subtract for the first letter of an
insertion or deletion sequence (indel).
beta is the amount to subtract for subsequent letters
in an indel. For example, if there is a five-
letter indel, k = 5, then alpha + beta * ( k - 1 )
= alpha + beta * (4) will be subtracted from the
score.
dis distribution file of the distributions of letters
in file, used in place of the McCaldon-Argos dis-
tribution.
+P Print a table of probabilities after the linear
regression on high scores. Three columns are
printed: the score; the observed probability that
a score is less than or equal to this score from
the simulations; and the predicted probability
that a score is less than or equal to this score,
using gamma and p. Only those scores used in the
linear regression are reported in this table; if
only the middle 80% of scores are used, only they
are reported.
-P Do not print the above-mentioned table of observed
and predicted probabilities. This is the default.
-a Use all high scores, rather than the middle 80%.
This could be used when the number of simulations
is small. However, we recommend taking the middle
80% of scores from at least 1000 direct simula-
tions, which is the default behavior.
-s# Perform # direct simulations (i.e., generate #
sequence pairs). Default is 1000 direct simula-
tions.
-d# Perform # declumps for each direct simulation.
Not recommended at present. Default is no
declumps.
-g# Specify value of gamma for subsequent runs. Must
also specify p at the same time.
-p# Specify value of p for subsequent runs. Must also
specify gamma at the same time.
REFERENCES
T.F. Smith and M.S. Waterman. "Identification of Common
Molecular Subsequences". Journal of Molecular Biology,
147, (1981) 195.
M. Vingron and M.S. Waterman. "Sequence Alignment and
Penalty Choice: review of concepts, case studies and
implications". Journal of Molecular Biology, 234,
(1993).
M.S. Waterman and M. Eggert. "A New Algorithm for Best
Subsequence Alignments with Application to tRNA-rRNA
Comparisons". Journal of Molecular Biology 197 (1987)
723-728.
M.S. Waterman and M. Vingron. Rapid and accurate estimates
of statistical significance for sequence database
searches. Proc. Natl. Acad. Sci. USA (1994).
91:4625-28.
M.S. Waterman and M. Vingron. Sequence comparison signifi-
cance and Poisson approximation. Statistical Sciences
(1994). 2:367-81.
M.S. Waterman. Introduction to Computational Biology: Maps,
sequences and genomes. Chapman & Hall. London: 1995. ISBN
0-412-99391-0.
SEE ALSO
seqaln-intro(1), localS(1), mlocalS(1), srlocalS(1), msrlo-
calS(1), trlocalS(1), mtrlocalS(1), mpvlocalS(1), mpvsrlo-
calS(1), sequence-file(5), distribution-file(5).