tacg(1) Version 3.0-Beta tacg(1)
New or changed options are in this color
[-c h H l L q Q s S v] [--dam] [--dcm] [-b begin] [-eend] [-C0-12] [--costunits/$] [-D 0-4] [-f0|1] [-F 0-3] [-gLoCutOff(,HiCutOff)] [-G bin_size,X|Y|L] [-i (--idonly) 0-2][-mtotal_min_hits] [-M total_Max_hits] [-n3-8] [--notics] [-o0|1|3|5] [-O1-6(x),min ORF] [-p Name,Pattern,Err] [-PNameA,(+|-)(l|g)Dist_Lo(-Dist_Hi),NameB] [-r (--regex) 'Label:RegexPat' || 'FILE:FileOfRegexPatterns'] [-Ralt_Rebase | alt_Matrix] [--raw] [--rules 'NameA:min:Max[&|]NameB:min:Max[&|]..] [--silent] [--strands] [-T 0|1|3|6,1|3] [-w 1|width] [-V 1-3] [-W #] [-x (--explicit) 'NameA(,=),NameB..(,C)'] [-X (--eXtract)b,e,[0|1]] [-#%_Match_Cutoff]
NB: Most flags are the same as in earlier
versions with the exception of these changes:
-C is current with the Codon tables from CUTG/NCBI (13 tables).
-q the default is now quiet (doesn't send UDP info back).
-R is also used to specify an alternative Matrix file.
-T and -t (co-translation with the Linear Map) have been
merged.
-V now has 3 levels of verbosity.
-w now has a special case of '1' to generate 1 line output.
-x (used to be -r) now has 2 additional optional options,
'=' and ',C'
and these additions:
--cost filters REs based on units/$ of cost (from recent NEB
catalog))
--dam simulates Dam methylation of the DNA. The file rebase.dam
contains only those REs which are Dam-sensitive.
--dcm simulates Dcm methylation of the DNA. The file rebase.dcm
contains only those REs which are
Dcm-sensitive.
--i (--idonly) controls the amount of output for those sequences
which did not have any matches, ranging from full output regardless of
hits to only the SEQ id of those that did have hits.
--notics removes the tics under the strands for maximum compaction of output
-Q UNquiet; sends UDP data back to me to let me know what options
have been used, so I can adjust docs and options to make it easier to use.
-r (--regex) searches for regular expressions with built-in
IUPAC translations (y->[ct]).
--raw allows tacg to consider all input to be IUPAC sequence for processing
file fragments or editor buffers
--rules allows users to compose complex logical phrases of matches,
with grammar including logical ANDs and ORs and per-pattern minimum and
maximum limits, over the whole sequence or within a Sliding Window (see
below)
--silent searches for possible SILENT RE sites (those that won't
cause translation to change.
--strands sets the number of strands to display in the Linear Map
-W (--slidwin) defines the sliding window in which the --rules
and min/Max values are in effect.
-X (--extract) eXtracts the sequence around the pattern matched.
-# is used to set the cutoff for the Matrix match.
Unless told not to via the --raw flag,tacg now automagically translates most ASCII formats (Genbank, FASTA, etc) via Jim Knight's SEQIO library and now handles multiple sequences at one time, internally converting 'u's to 't's. It considers both strands at the same time so you don't have to manually reverse complement the sequence and will by default accept all IUPAC degeneracies (y r m k w s b d h v), performing all possible operations on that sequence. It treats degeneracies in the input sequence in different ways depending on the -D flag (see below). It either strips all letters other than a c g t and analyzes the sequence as 'pure' using a fast incremental hashing algorithm or it treats it as degenerate and analyses it via a slower algorithm. By default, it treats it as 'pure' unless it detects an IUPAC degeneracy, in which case it will adaptively switch back and forth between the fast and slow hashing routines. See also RELATED PROGRAMS at bottom.
NB: tacg can produce lots of output; while it's possible to pipe direct to lp/lpr, you'll probably regret it.
Flag | Value | Explanation |
---|---|---|
-b |
|
select the beginning of a subsequence from a larger sequence file; 1* for 1st base of sequence. In the Linear Map output, the upper label indicates numbering from beginning of subsequence; the lower label indicates numbering from the beginning of the entire sequence. The SMALLEST SEQUENCE that tacg can handle is 4 bases (10 for the ladder map (-l)). This allows analysis of primers and linkers. |
-e |
|
select the end of a subsequence from a larger sequence file; 0* for last base of sequence. This subsequence can also be made circular via the -f flag. The largest sequence that tacg can handle depends on how much memory you have, although for practical purposes, assume 1 billion bases. |
--cost |
|
select REs by their cost (units/$ - >100 is cheap; <10 is v. expensive) |
-c | order the output by # of cuts/fragments by each RE (Strider style) and thence alphabetically; otherwise output is by order of appearance in the REBASE file. | |
--dam | simulate cutting in the presence of Dam methylase (GmATC). rebase.dam contains all REs that are Dam-sensitive. | |
--dcm | simulate cutting in the presence of Dcm methylase (CmCWGG). rebase.dcm contains all REs that are Dcm-sensitive. | |
-C | 0*-12 | Codon Usage table to use for translation:
0 - Standard 5 - Ciliate_Mito 10 - Ascidian_Mito 1 - Vert_Mito 6 - Echino_Mito 11 - Flatworm 2 - Yeast_Mito 7 - Euplotid_Nuclear 12 - Blepharisma 3 - Mold_Mito 8 - Bacterial 4 - Invert_Mito 9 - Alt_Yeast |
-D | 0-4 | Degeneracy flag - controls input and analysis of degenerate
sequence input where:
0 FORCES excl'n of degens in seq; only 'acgtu' accepted 1* cut as NONdegen unless degen's found; then cut as '-D3' 2 degen's OK; ignore in KEY, but match outside of KEY 3 degen's OK; expand in KEY, find only EXACT matches 4 degen's OK; expand in KEY, find ALL POSSIBLE matchesThe pattern matching is adaptive; given a small window of nondegenerate sequence, the algorithm will match very fast; if degenerate sequence is detected, it will switch to a slower, iterative approach. This results in speed that is proportional to degeneracy for most cases. If you have long sequences of 'n's (inserted as placekeepers, for instance), -D2 may be a better choice. In all cases, as soon as degeneracy of the KEY hexamer exceeds a compiled-in limit (usually 256-fold degeneracy), the KEY is skipped. |
-f | 0|1* | form (or topology) of DNA - 0 (zero) for circular; 1 for linear. This flag also operates on subsequences. |
-F | 0*-3 | print/sort Fragments; 0*-omit; 1-unsorted; 2-sorted; 3-both. |
-g | Lo i#(,Hi i#) | specify if you want a pseudo-graphic gel map, with a low end cutoff of Lo# bases (converted to an integer multiple of 10), and (if present), a high end cutoff of Hi#. In Ver <2, the Lo# was restricted to 10 or 100; now it can be any any integer exponent of 10 (10, 100, 1000, etc), as can the Hi#. If Hi# is omitted or is larger than the sequence length, it takes the value of the sequence length. See examples below. |
-G | binsize,X|Y|L | Graphic data output, so (mis)named for its original use, where:
binsize = # bases for which hits should be pooled X|Y|L indicates whether the BaseBins should be on the X or Y axis or in 'Long' form where Basebins (as X) and Name data (as Y) are reiterated in 2 columns for all the Named patterns: X: BaseBins 1000 2000 3000 .. NameA 0 4 0 .. NameB 22 57 98 .. (#s = matches per bin) NameC 1 0 0 .. . Y: BaseBins NameA NameB NameC .. 1000 0 22 1 .. 2000 4 57 0 .. 3000 0 98 0 .. . L: Basebins NameA 1000 0 2000 4 . . Basebins NameB 1000 22 2000 57 . .This addresses some missing features - allows the export of hit data for the selected Names so that you can manipulate it as you wish. This provides an alternative to rewriting the program's code to manipulate the data as you wish. Like other output, it is streamed to stdout, so it's not wise to mix -G with other analyses; the lines generated (esp. w/ the X option), can be quite long and are NOT governed by the -w flag). Here's an example. |
-h | brief help page (condensed man page). | |
-H | generates partial HTML tags for the Web version. Not useful in the command line version. | |
-i (--idonly) |
{0|1*|2} | controls output for sequences that have no hits 0 - ID line and normal output printed regardless of hits 1 - (default) ID line and normal output are printed ONLY IF there are hits. 2 - ONLY ID line is printed if there are hits. |
-l | specify if you want a ladder map of selected enzymes, much like the GCG MAPPLOT output. Also appends a summary of those enzymes that match few times. This last # is length-sensitive in the distributed source code, but it is easy to set another default as a '#define' in 'tacg.h'. | |
-L | specify if you WANT a Linear map a la Strider or GCG's MAP (but better - tacg indicates the actual CUT site as opposed to the 1st base in the pattern as do other mapping programs). In Ver 3.x, the Linear Map only includes those REs or patterns which pass the filtering criteria set via the -n, -o, -m, -M, --cost, etc. | |
--strands | {1|2*} | specifies how many strands get printed in the linear map. Allows you to slightly compact the linear map, especially when used with the --notics flag below |
--notics | do NOT print the tics marks below the DNA in the linear map. Allows you to slightly compact the linear map, especially when used with the --strands flag above | |
-m | i# | select enzyme by minimum # cuts in the whole sequence. Default is no minimum (ie ALL). Affects the number of enzymes displayed by the sites (-s), fragments (-F), Linear map -L, and ladder map (-l) flags. |
-M | i# | select enzyme by Maximum # cuts in the whole sequence. Default is 32,000. Affects the number of enzymes displayed by the sites (-s), fragments (-F), Linear map -L, and ladder map (-l) flags. |
-n | 3*-8 | select enzymes by magnitude of recognition site; 3 = all, 5 = 5,6,7,8... n's don't count, other degeneracies are summed ie: tgca=4, tgyrca=5, tgcnnngca=6, tannnnnnnnnnta=4 |
-o | 0|1*|3|5 | select enzymes by overhang generated; 5 = 5', 3 = 3', 0 for blunt, 1 for all |
-O | 1-6(x),MinSiz | ORF analysis where any frame combination can be specified ('126'
or '45' or '13456') along with the minimum ORF Size you want to detect. Produces
either a single line (if -w1 is specified)
or a block, (with the Amino Acids wrapped at the specified width) for each
ORF including:
NB: Because the output can be in a single line for each ORF, other line- oriented pattern-matching tools (grep, perl, awk) can examine the ORF generated for matching regular expressions (see the GNU grep man page for an explanation of regular expressions). In this way you can search all 6 frames of >=MinSize AAs for whatever pattern interests you. Examples: -O 145x,25 (search frames 1,4,5 with extended AA information on all ORFs > 25 AAs) -O 2,66 (search frame 2 with a min ORF size of 66 AAs) |
-p | Name,Pat,Err | allows entry of search patterns from the command line, where
Name = name by which pattern is labeled (<=1 chars) Pat = <30 IUPAC characters (ie. gryttcnnngt) Err = max # of errors that are tolerated (<=5) Also logs the patterns you've entered into a file tacg.patterns in the correct format for later copying to a REBASE file. Can enter up to 10 of these at a time. Patterns should consist of <=30 IUPAC bases. Long sequences with large errors will cause SUBSTANTIAL cpu usage in validating the patterns. |
-P | NameA,
[+-][lg] Dist_Lo [-Dist_Hi], NameB MBQ |
Pattern proximity matching to search for spacial relationships
between factors, 2 at a time (up to a total of 10).
NameA and NameB must be in a REBASE file, either the default rebase.data or another specified by the -R flag and are case INsensitive. NameA/B patterns can be composed of any IUPAC bases and ERRORs can be specified in the REBASE entry ie: Pit1 5 WWTATNCATW 0 2 ! a Pit1 site with 2 errors Tataa 4 TATAAWWWW 0 1 ! a Tataa site with 1 error + NameA is DOWNSTREAM of NameB (default is either)
Examples:
-PPit1,-30-2500,Tataa
|
-q | Be quiet. DISallows sending diagnostic udp info back to author, now the default behavior (so unless you TELL the program to send data back, it won't). | |
-Q | Be UNquiet. Allows the program to send diagnostic udp info back to author. In version 2.x, this was the default behavior, but it has served its purpose, so unless you WANT me to log your usage, I won't. | |
--raw | tells tacg to consider ALL input as valid sequence (as with version 2). instead of using SEQIO to parse the input as a standard sequence format. Useful for analyzing file fragments or editor buffers, which may be missing valid format. Note that specifying this flag will tell tacg to eat headers, comments, etc as well as sequence, if it encounters them. ALL IUPAC degeneracies will be analyzed | |
--rules | 'ruleA[&|] ruleB[&|] ruleC[&|]..' MBQ |
allows you to compose complex logic rules to determine whether a sequence matches your profile, using logical ANDs and ORs. Parens () enforce logic; otherwise expressions are evaluated left -> right. Each rule in the phrase can be either a single pattern definition (NameA:m:M, where m = the minimum # of matches, M = Maximum # of matches allowed in the sequence or in a sliding window (-W) of the sequence) or a collection of them i.e. (NameA:1:7&(NameB:4:17|NameC:23:99)). |
-r (--regex) |
'Label:RegexPat'
or 'FILE:RegexFile' MBQ |
searches for regular expressions entered from the commandline
using the 1st option or searches for the regular expressions read from
a file using the 2nd option. The regular expression syntax can be formal
regex patterns or the IUPAC'ed version thereof; the translation from one
to the other is handled automatically. ie:
gy(tt|gc)nc{2,3}m -> g[ct]\(tt\|gc\).c\{2,3\}[ca] When trying to specify a file, the term FILE must be in CAPs (so don't use 'FILE' as a pattern name). Specific regex patterns from the file can be specified by using the -x flag to name them explicitly. |
-R | REBASE or
MATRIX file |
specifies an alternative Restriction Enzyme file (in GCG format) or
Matrix file (in TRANSFAC format) to use. (The latest REBASE files are available
via FTP or via WWW.
The latest TRANSFAC files are also available via FTP or WWW. There are several such files included in the std distribution:
|
-s | prints the summary of site information, describing how many times each enzyme or pattern matches the sequence. Those that cut zero times are shown first. In Ver >=2, only those that match at least once are shown in the second part (the 0-matchers are not reiterated) | |
-S | prints the the actual match Sites in tabular form. | |
--silent | requests that the NA sequence be translated starting at the 1st base, in frame 1 (use -b to shift the starting base), according to the Codon Translation table specified with -C, then reverse translated, using the same table, using all the possible degeneracies, then restrict that (quite) degenerate sequence and show all the REs that will match it. You should use the L and -T flags to generate the linear map which shows both the REs and the cotranslated sequence to verify that all is as it should be. NB: Depending on Codon Table, some AAs are not reversibly translatable. Using the standard table, Arg (=mgn), Leu (=ytn), and Ser (=wsn) cannot be Forward translated from their Reverse translation. | |
-T | 0*|1|3|6,1|3 | In the Linear map, beneath the DNA sequence, include the translated protein in
0*, 1, 3(= frames 123), or 6 (=123456) frames of Translation
with 1 or 3 letter codes.
ie. -T 3,3 (includes frames 1,2,3 with 3 letter labels) -T 6,1 (includes frames 1,2,3,4,5,6, with 1 letter labels) |
-v | asks for program version (there may be multiple versions of the same functional program to track its migration. | |
-V | 1-3 | Verbose - requests all kinds of diagnostic info to be spat to the screen. May be useful in diagnosing why tacg did not behave as expected..but maybe not. Higher numbers mean more output and are generally downwardly inclusive. |
-w | output width in bp's (must be between 60* and 210, truncated to a # exactly divisible by 15 ('-w 100' will be interpreted as '-w 90') and actual printed output will be about 20 characters wider. Also applies to output of the ladder and gel maps, so if you're trying to get more accuracy and your output device can display small fonts, you may want to use this flag to widen the output. If you want as much output on one line as possible for external parsing/analysis, specify -w 1. | |
-W | width of Sliding Window in bp's. Used primarily with the '--rules' flag above. If it is not specified, it is assumed that the window under consideration is the whole sequence. Most analyses other than '--rules' ignore this setting. | |
-x | 'NameA(,=), NameB, NameC, NameD,...(,C)' |
used to explicitly name those enzymes or patterns to be used
in the analysis (up to a maximum of 15). Case INsensitive (HindIII=hindiii=HinDiIi),
but the name HAS to be spelled exactly like the entry in the REBASE or
MATRIX file with no spaces (HindIII != Hind III != Hind3).
The ',=' tag appended to a name indicates that it is the tagged RE in a Hookey/AFLP analysis; only those fragments that have at least one end generated by the tagged RE will be shown. This has been shown to be useful in AFLP analysis. The trailing ',C', if added, requests a combined digestion using all the REs specified with this flag. Examples: -xHindIII,BamHI,NruI,C requests data for these REs both individually, and combined. -x EcoRI=,MseI,Hinf
|
-X (--extract) |
b,e,[0|1] | aka "--extract" eXtracts the sequence around the pattern matched, from b bases preceding, to e bases following the MIDDLE of pattern if a normal pattern, the START of the pattern if a regular expression. If the pattern is found in the bottom strand AND the last field = 1, sequence is rev-compl'ed before it's extracted so all patterns are in same orientation; if last field = 0, it is NOT reverse compl'ed. In any event, the sequences are FASTA-formatted on output. |
-# | % CutOff | The percentage of the optimal matrix score that you will accept as
a match. ie. if the matrix (as below) was 10 bases long, and had a maximum
score of 69 (scoring a 100% match at each position as '1', then if you
indicated a -# 75, you would accept a score of 51.75 (69 x .75)
as a match.
a t g g c y t r g g Consensus 1 2 3 4 5 6 7 8 9 10 Position a 8 0 1 1 1 0 1 4 0 0 c 1 3 1 0 9 6 0 0 2 0 Sum of Max (bold) = 69 g 1 0 8 7 0 0 0 6 7 10 t 0 7 0 2 0 4 9 0 1 0 |
However, if an external program IS needed for format interconversion, I also strongly recommend Don Gilbert's excellent 'readseq' program (available in source or executable via FTP. Why recommend readseq when I've used SEQIO? SEQIO is a great library of functions to use in other programs, but readseq is easier to use for stand-alone, interactive use, chiefly due to a more std interface. Both are scriptable; for scripting use, it's a toss-up.
You can also use the paging utility 'less' to move thru your sequence file and use its marking and piping facility to punt the sequence of interest to 'tacg'. Many editors also allow piping a selection of text to an external program and inclusion of the result into another window, especially (nedit, as well as crisp, and the ubiquitous, omnipotent emacs and its gui doppelganger xemacs. others).
Much of tacg's output benefits from wider-than-normal printing. The '-w#' flag allows output up to about 230 characters wide, however to print this without wrapping, you need to print in landscape mode, using very small fonts. A number of unix printing utilities allow you to do this, notably genscript aka GNU Enscript, residing in the GNU repository
Used alone:
tacg -f0 -n5 -T3,1 -sL -F3 -g 100 <input.seq.file >output.seq.file
Translation: read sequence from input.seq.file and analyze it as circular (-f0), with 5+ cutters (-n5), returning both site info and Linear map (-sL) as well as sorted and unsorted fragment data (-F3) and do 1,2,3 frame translation w/ 1 letter codes (-T3,1) on the linear map, writing the output to output.seq.file. Also, include a pseudo gel diagram for those enzymes that pass the filtering, with a low end cutoff of 100 bases (-g100).
Used to search for Matrix Matches:
tacg -# 75 -R yeast.matrices -sS < yeast.chr_4 | less
Translation: seach the file yeast.chr_4 for all the matrix definitions in the file 'yeast.matrices', with a cutoff of 75% of the maximum score possible, listing also the summary and the Sites information, piping the output to the pager less
tacg -p Pit1,tatwcata,1 -p ap2,tgygcatw,1 -w90 -sSL < rprlPromo.seq > promo.map
Translation: search for the patterns labeled Pit1 and ap2 with 1 error each and search the sequence from the file rprlPromo.seq for them, printing the results (summary (-s), Sites (S), and the Linear Map (L) 90 characters wide (-w90) to the file promo.map.
Used to search for a Regular Expression:
tacg --regex 'yadda:gm(tt|ag)ggn{3,5}tgy' -SL < some.seq | less
Translation: search the file some.seq for the regular expression gm(tt|ag)ggn{3,5}tgy, piping the information about Sites (-S) and the Linear Map (L) to the pager 'less'.
Used to search the entire yeast 500bp Upstream Regulatory sequences (a file containing 6226 500 bp sequences) for matches to the MATa1 binding site (from TRANSFAC) :
tacg -R TRANSFAC.data -sScw1 -xMATa1 -#85 < utr5_sc_500.fasta > yeast.summary
Translation: translate each of the FASTA formatted entries in the input file utr5_sc_500.fasta into usable sequence, and after finding the MATa1 (-x MATa1) matrix description from the database (-R TRANSFAC.data), search the sequences for matches at 85% of the maximum score that it has in the TRANSFAC database (-# 85), returning the summary (-s), the sites (S) sorted in Strider order (c) with results printed on 1 line (w1), directing the output into the file yeast.summary.
- tacg, unless recompiled without it, spits back about 100 bytes of information about it's use (the hosting OS, the command line flags, and the sequence length) to enable me to track how and how often it is being used. If you are uncomfortable with this trait, you may disable it from the command line ('-q').
- the inclusion of the seqio functions has caused an ~10fold increase in the compiled size of the executable to ~400kB (up from ~50kb before). If I get a lot of complaints about this, I'll look into stripping out the functions that I use from the SEQIO library, but I'd rather not as it does include a lot of (hidden) functionality that I plan to use later.
- tacg will not currently cut sequence shorter than 4 bases; if you need to analyze sequences shorter than this, perhaps you're using the wrong program.
- tacg has been made re-entrant for the inclusion of SEQIO and as such a number of memory leaks have been plugged (with the use of Gray Watson's excellent dmalloc library). It's not perfect but it's a lot more robust.
The command line handling has been completely re-written, using the getopt() and getopt_long() functions, so the flags are considerably less sensitive to spacing and order.
- translation in 6 frames assumes circular sequence regardless of '-f' flag, so that the last amino acids in frames 5 and 6 in the 1st output block are obviously incorrect if you are assuming linear sequence.
for other bugs which the author thinks are less problematic, see the manual