tacg(1)               Version 3.0-Beta                      tacg(1)

NAME

tacg - takes input from stdin, automagically translates most standard ASCII formats of Nucleic Acid (NA) sequence, then analyses that sequence for restriction enzyme (RE) sites and other NA motifs such as Transcription Factor (TF) binding sites, matrix matches, and regular expressions, finally writing analyses to stdout. It also can translate the NA input to protein in any frame, using a number of Codon translations tables, and search for Open Reading Frames (ORFs), as well as perform many other analyses.

SYNOPSIS

tacg -flag [option] --flag [option] ... <input.file >output.file tacg takes input from stdin (| or <); spits output to screen (default), >file, | next command
In the following summary, flags are in bold, options are underlined, ranges are specified with a dash (-), commas (,) if present ARE REQUIRED as part of the option string, alternatives are separated with a vertical line (|), Name options are the names of the patterns in the REBASE or MATRIX file, Pattern indicates a nucleic acid pattern composed of IUPAC codes (acgtyrmkwsbdhv), parentheses () indicate optional parts of the option, and other indicators are hopefully reasonably clear. The underlined flags are linked to a longer explanation later in the text. Version 3 also now includes 'long options' which are preceded by 2 dashes (--longopt).

New or changed options are in this color

[-c h H l L q Q s S v] [--dam] [--dcm] [-b begin] [-eend] [-C0-12] [--costunits/$] [-D 0-4] [-f0|1] [-F 0-3] [-gLoCutOff(,HiCutOff)] [-G bin_size,X|Y|L] [-i (--idonly) 0-2][-mtotal_min_hits] [-M total_Max_hits] [-n3-8] [--notics] [-o0|1|3|5] [-O1-6(x),min ORF] [-p Name,Pattern,Err] [-PNameA,(+|-)(l|g)Dist_Lo(-Dist_Hi),NameB] [-r (--regex) 'Label:RegexPat' || 'FILE:FileOfRegexPatterns'] [-Ralt_Rebase | alt_Matrix] [--raw] [--rules 'NameA:min:Max[&|]NameB:min:Max[&|]..] [--silent] [--strands] [-T 0|1|3|6,1|3] [-w 1|width] [-V 1-3] [-W #] [-x (--explicit) 'NameA(,=),NameB..(,C)'] [-X (--eXtract)b,e,[0|1]] [-#%_Match_Cutoff]

NB: Most flags are the same as in earlier versions with the exception of these changes:
-C is current with the Codon tables from CUTG/NCBI (13 tables).
-q the default is now quiet (doesn't send UDP info back).
-R is also used to specify an alternative Matrix file.
-T and -t (co-translation with the Linear Map) have been merged.
-V now has 3 levels of verbosity.
-w now has a special case of '1' to generate 1 line output.
-x (used to be -r) now has 2 additional optional options, '=' and ',C'

and these additions:
--cost filters REs based on units/$ of cost (from recent NEB catalog))
--dam simulates Dam methylation of the DNA.  The file rebase.dam contains only those REs which are Dam-sensitive.
--dcm simulates Dcm methylation of the DNA.  The file rebase.dcm contains only those REs which are
Dcm-sensitive.
--i (--idonly) controls the amount of output for those sequences which did not have any matches, ranging from full output regardless of hits to only the SEQ id of those that did have hits.
--notics removes the tics under the strands for maximum compaction of output
-Q UNquiet; sends UDP data back to me to let me know what options have been used, so I can adjust docs and options to make it easier to use.
-r (--regex) searches for regular expressions with built-in IUPAC translations (y->[ct]).
--raw allows tacg to consider all input to be IUPAC sequence for processing file fragments or editor buffers
--rules allows users to compose complex logical phrases of matches, with grammar including logical ANDs and ORs and per-pattern minimum and maximum limits, over the whole sequence or within a Sliding Window (see below)
--silent searches for possible SILENT RE sites (those that won't cause translation to change.
--strands sets the number of strands to display in the Linear Map
-W (--slidwin) defines the sliding window in which the --rules and min/Max values are in effect.
-X (--extract) eXtracts the sequence around the pattern matched.
-# is used to set the cutoff for the Matrix match.


DESCRIPTION

tacg searches the sequence(s) read from stdin for matches based on descriptions stored in a database of patterns (default is rebase.data, in an extended GCG format; can also read plain GCG format), either explicit sequences, possibly containing IUPAC degeneracies or matrix descriptions (default is matrix.data, in TRANSFAC format), and based on matches and options entered on the command line, sends ALL output to stdout. Since SEQIO can read collections of sequence, tacg inherits that ability, and can now apply its analyses over multiple sequence files.
Unless requested (by -V1-3), it no longer sends errors to stderr (except failure errors) and it no longer emits default output - you have to request all output. Most of the internals use dynamic memory so there are few limits on sequence input size and pattern number. I've generated >6000 patterns and searched 14Mb of input sequence. It's ~ 5-35x faster than the comparable routines in the GCG pkg and being written in ANSI C, is portable to all unix variants. It has been ported to Linux (Intel, Alpha, PPC), SunOS, Solaris, Compaq Tru64 Unix (aka DEC Unix aka TUFKAO (the Unix formerly known as OSF)), Ultrix, IRIX, NeXTStep, ConvexOS, and HP/UX.

Unless told not to via the --raw flag,tacg now automagically translates most ASCII formats (Genbank, FASTA, etc) via Jim Knight's SEQIO library and now handles multiple sequences at one time, internally converting 'u's to 't's. It considers both strands at the same time so you don't have to manually reverse complement the sequence and will by default accept all IUPAC degeneracies (y r m k w s b d h v), performing all possible operations on that sequence. It treats degeneracies in the input sequence in different ways depending on the -D flag (see below). It either strips all letters other than a c g t and analyzes the sequence as 'pure' using a fast incremental hashing algorithm or it treats it as degenerate and analyses it via a slower algorithm. By default, it treats it as 'pure' unless it detects an IUPAC degeneracy, in which case it will adaptively switch back and forth between the fast and slow hashing routines. See also RELATED PROGRAMS at bottom.

NB: tacg can produce lots of output; while it's possible to pipe direct to lp/lpr, you'll probably regret it.


REQUIREMENTS

tacg 3.0 again requires an external Codon file (codon.data - tacg V2.xx did not) but does not absolutely require a pattern/REBASE file, allowing you to enter patterns via the command line with the -p flag. However, most users will want to use a REBASE file in GCG format to supply the RE definitions. By default the name of this (supplied) file is rebase.data, altho other files in the same format can be specified by the -R flag. Searching for Matrix matching requires the use of a TRANSFAC-formatted file (also supplied) in the default name of matrix.data. Regular expressions can be supplied in a simplified REBASE format; the default name is regex.data. Examples of all data files are supplied. The codons/pattern/matrix/regex data files will be found in any of 3 locations which are searched in the order of: the current directory ($PWD), your home directory ($HOME), or the tacg lib ($TACGLIB). Many shells are set to define the 1st two; the last must be specified either via command line or in your .cshrc file (or equivalent). If they are in another location, you'll have to specify it with either an explicit full or relative path name.

FLAGS and Options

Flag Value Explanation
-b
i#
select the beginning of a subsequence from a larger sequence file; 1* for 1st base of sequence. In the Linear Map output, the upper label indicates numbering from beginning of subsequence; the lower label indicates numbering from the beginning of the entire sequence. The SMALLEST SEQUENCE that tacg can handle is 4 bases (10 for the ladder map (-l)). This allows analysis of primers and linkers.
-e
i#
select the end of a subsequence from a larger sequence file; 0* for last base of sequence. This subsequence can also be made circular via the -f flag. The largest sequence that tacg can handle depends on how much memory you have, although for practical purposes, assume 1 billion bases. 
--cost
i#
select REs by their cost (units/$ - >100 is cheap; <10 is v. expensive)
-c order the output by # of cuts/fragments by each RE (Strider style) and thence alphabetically; otherwise output is by order of appearance in the REBASE file.
--dam simulate cutting in the presence of Dam methylase (GmATC). rebase.dam contains all REs that are Dam-sensitive.
--dcm simulate cutting in the presence of Dcm methylase (CmCWGG).  rebase.dcm contains all REs that are Dcm-sensitive.
-C 0*-12 Codon Usage table to use for translation:
0 - Standard     5 - Ciliate_Mito      10 - Ascidian_Mito
1 - Vert_Mito    6 - Echino_Mito       11 - Flatworm
2 - Yeast_Mito   7 - Euplotid_Nuclear  12 - Blepharisma
3 - Mold_Mito    8 - Bacterial
4 - Invert_Mito  9 - Alt_Yeast
-D 0-4 Degeneracy flag - controls input and analysis of degenerate sequence input where:
 0  FORCES excl'n of degens in seq; only 'acgtu' accepted
 1* cut as NONdegen unless degen's found; then cut as '-D3'
 2  degen's OK; ignore in KEY, but match outside of KEY
 3  degen's OK; expand in KEY, find only EXACT matches
 4  degen's OK; expand in KEY, find ALL POSSIBLE matches
The pattern matching is adaptive; given a small window of nondegenerate sequence, the algorithm will match very fast; if degenerate sequence is detected, it will switch to a slower, iterative approach. This results in speed that is proportional to degeneracy for most cases. If you have long sequences of 'n's (inserted as placekeepers, for instance), -D2 may be a better choice. In all cases, as soon as degeneracy of the KEY hexamer exceeds a compiled-in limit (usually 256-fold degeneracy), the KEY is skipped. 
-f 0|1* form (or topology) of DNA - 0 (zero) for circular; 1 for linear. This flag also operates on subsequences.
-F 0*-3 print/sort Fragments; 0*-omit; 1-unsorted; 2-sorted; 3-both.
-g Lo i#(,Hi i#) specify if you want a pseudo-graphic gel map, with a low end cutoff of Lo# bases (converted to an integer multiple of 10), and (if present), a high end cutoff of Hi#. In Ver <2, the Lo# was restricted to 10 or 100; now it can be any any integer exponent of 10 (10, 100, 1000, etc), as can the Hi#. If Hi# is omitted or is larger than the sequence length, it takes the value of the sequence length. See examples below.
-G binsize,X|Y|L Graphic data output, so (mis)named for its original use, where:
binsize = # bases for which hits should be pooled
X|Y|L indicates whether the BaseBins should be on the X or Y axis or in 'Long' form where Basebins (as X) and Name data (as Y) are reiterated in 2 columns for all the Named patterns:
X: BaseBins 1000 2000 3000  ..
   NameA      0    4    0   ..   
   NameB     22   57   98   ..     (#s = matches per bin)
   NameC      1    0    0   ..
   .
Y: BaseBins  NameA   NameB   NameC   ..
     1000      0      22       1     ..
     2000      4      57       0     .. 
     3000      0      98       0     ..
     .
L: Basebins  NameA
     1000      0    
     2000      4    
        .      .
   Basebins  NameB
     1000     22
     2000     57
        .      .
This addresses some missing features - allows the export of hit data for the selected Names so that you can manipulate it as you wish. This provides an alternative to rewriting the program's code to manipulate the data as you wish. Like other output, it is streamed to stdout, so it's not wise to mix -G with other analyses; the lines generated (esp. w/ the X option), can be quite long and are NOT governed by the -w flag). Here's an example.
-h brief help page (condensed man page).
-H generates partial HTML tags for the Web version.  Not useful in the command line version.
-i
(--idonly)
{0|1*|2} controls output for sequences that have no hits
0 - ID line and normal output printed regardless of hits
1 - (default) ID line and normal output are printed ONLY IF there are hits.
2 - ONLY ID line is printed if there are hits.
-l specify if you want a ladder map of selected enzymes, much like the GCG MAPPLOT output. Also appends a summary of those enzymes that match few times. This last # is length-sensitive in the distributed source code, but it is easy to set another default as a '#define' in 'tacg.h'.
-L specify if you WANT a Linear map a la Strider or GCG's MAP (but better - tacg indicates the actual CUT site as opposed to the 1st base in the pattern as do other mapping programs). In Ver 3.x, the Linear Map only includes those REs or patterns which pass the filtering criteria set via the -n, -o, -m, -M, --cost, etc.
--strands {1|2*} specifies how many strands get printed in the linear map. Allows you to slightly compact the linear map, especially when used with the --notics flag below
--notics do NOT print the tics marks below the DNA in the linear map. Allows you to slightly compact the linear map, especially when used with the --strands flag above
-m i# select enzyme by minimum # cuts in the whole sequence. Default is no minimum (ie ALL). Affects the number of enzymes displayed by the sites (-s), fragments (-F), Linear map -L, and ladder map (-l) flags.
-M i# select enzyme by Maximum # cuts in the whole sequence. Default is 32,000. Affects the number of enzymes displayed by the sites (-s), fragments (-F), Linear map -L, and ladder map (-l) flags.
-n 3*-8 select enzymes by magnitude of recognition site; 3 = all, 5 = 5,6,7,8... n's don't count, other degeneracies are summed ie: tgca=4, tgyrca=5, tgcnnngca=6, tannnnnnnnnnta=4
-o 0|1*|3|5 select enzymes by overhang generated; 5 = 5', 3 = 3', 0 for blunt, 1 for all 
-O 1-6(x),MinSiz ORF analysis where any frame combination can be specified ('126' or '45' or '13456') along with the minimum ORF Size you want to detect. Produces either a single line (if -w1 is specified) or a block, (with the Amino Acids wrapped at the specified width) for each ORF including:
  • Frame of the Current ORF
  • Sequence # of the Current ORF in that frame
  • Offset from the start in both bases and AAs
  • Size of the ORF in AAs and KDa
  • estimated pI
  • ORF in 1 letter code for external analysis
If 'x' is appended to the frame specification, 3 additional lines are appended, which give proportion of each AA in # and %. This breaks the FASTA format, but can be easily stripped later as each line is prefixed with a '#'.
NB: Because the output can be in a single line for each ORF, other line- oriented pattern-matching tools (grep, perl, awk) can examine the ORF generated for matching regular expressions (see the GNU grep man page for an explanation of regular expressions). In this way you can search all 6 frames of >=MinSize AAs for whatever pattern interests you.
Examples:
-O 145x,25   (search frames 1,4,5 with extended AA information on all ORFs > 25 AAs)
-O 2,66   (search frame 2 with a min ORF size of 66 AAs)
-p Name,Pat,Err allows entry of search patterns from the command line, where 
Name = name by which pattern is labeled (<=1 chars)
Pat = <30 IUPAC characters (ie. gryttcnnngt)
Err = max # of errors that are tolerated (<=5)
Also logs the patterns you've entered into a file tacg.patterns in the correct format for later copying to a REBASE file. Can enter up to 10 of these at a time. Patterns should consist of <=30 IUPAC bases. 
Long sequences with large errors will cause SUBSTANTIAL cpu usage in validating the patterns. 
-P NameA,
[+-][lg]
Dist_Lo
[-Dist_Hi],
NameB

MBQ

Pattern proximity matching to search for spacial relationships between factors, 2 at a time (up to a total of 10).
NameA and NameB must be in a REBASE file, either the default rebase.data or another specified by the -R flag and are case INsensitive. NameA/B patterns can be composed of any IUPAC bases and ERRORs can be specified in the REBASE entry ie: 
Pit1 5 WWTATNCATW 0 2 ! a Pit1 site with 2 errors
Tataa 4 TATAAWWWW 0 1 ! a Tataa site with 1 error

+ NameA is DOWNSTREAM of NameB (default is either)
- NameA is UPSTREAM of NameB (ditto)
l NameA is LESS THAN Dist_Lo from NameB (default)
g NameA is GREATER THAN Dist_Lo from NameB
Dist_Hi - if used, implies a RANGE, obviates l or g

Examples:
-PHindIII,350,bamhi
Matches HindIII sites within 350 bp of BamHI sites

-PPit1,-30-2500,Tataa
Match Pit1 sites 30 to 2500 bp UPSTREAM of a Tataa site.

-q Be quiet. DISallows sending diagnostic udp info back to author, now the default behavior (so unless you TELL the program to send data back, it won't). 
-Q Be UNquiet. Allows the program to send diagnostic udp info back to author. In version 2.x, this was the default behavior, but it has served its purpose, so unless you WANT me to log your usage, I won't.
--raw tells tacg to consider ALL input as valid sequence (as with version 2). instead of using SEQIO to parse the input as a standard sequence format. Useful for analyzing file fragments or editor buffers, which may be missing valid format. Note that specifying this flag will tell tacg to eat headers, comments, etc as well as sequence, if it encounters them. ALL IUPAC degeneracies will be analyzed
--rules 'ruleA[&|]
ruleB[&|]
ruleC[&|]..'

MBQ

allows you to compose complex logic rules to determine whether a sequence matches your profile, using logical ANDs and ORs. Parens () enforce logic; otherwise expressions are evaluated left -> right. Each rule in the phrase can be either a single pattern definition (NameA:m:M, where m = the minimum # of matches, M = Maximum # of matches allowed in the sequence or in a sliding window (-W) of the sequence) or a collection of them i.e. (NameA:1:7&(NameB:4:17|NameC:23:99)).
-r
(--regex)
'Label:RegexPat'
or
'FILE:RegexFile'

MBQ

searches for regular expressions entered from the commandline using the 1st option or searches for the regular expressions read from a file using the 2nd option. The regular expression syntax can be formal regex patterns or the IUPAC'ed version thereof; the translation from one to the other is handled automatically. ie:
gy(tt|gc)nc{2,3}m -> g[ct]\(tt\|gc\).c\{2,3\}[ca]
When trying to specify a file, the term FILE must be in CAPs (so don't use 'FILE' as a pattern name). Specific regex patterns from the file can be specified by using the -x flag to name them explicitly. 
-R REBASE or
MATRIX file
specifies an alternative Restriction Enzyme file (in GCG format) or Matrix file (in TRANSFAC format) to use. (The latest REBASE files are available via FTP or via WWW
The latest TRANSFAC files are also available via FTP or WWW. There are several such files included in the std distribution:
  • rebase.data - the main restriction enzyme pattern database (including those in rebase.dam, rebase.dcm)
  • rebase.dam - only those REs that are Dam sensitive
  • rebase.dcm - only those REs that are Dcm sensitive
  • regex.data - a few example regular expression patterns
  • matrix.data - all the TRANSFAC matrices
  • transfac.data - the entire TRANSFAC database in GCG format
The file specified here is also searched for in the same order as the other data files: $PWD (the current directory), $HOME (your home directory), $TACGLIB (the TACGLIB directory).
-s prints the summary of site information, describing how many times each enzyme or pattern matches the sequence. Those that cut zero times are shown first. In Ver >=2, only those that match at least once are shown in the second part (the 0-matchers are not reiterated)
-S prints the the actual match Sites in tabular form.
--silent requests that the NA sequence be translated starting at the 1st base, in frame 1 (use -b to shift the starting base), according to the Codon Translation table specified with -C, then reverse translated, using the same table, using all the possible degeneracies, then restrict that (quite) degenerate sequence and show all the REs that will match it. You should use the L and -T flags to generate the linear map which shows both the REs and the cotranslated sequence to verify that all is as it should be. NB: Depending on Codon Table, some AAs are not reversibly translatable. Using the standard table, Arg (=mgn), Leu (=ytn), and Ser (=wsn) cannot be Forward translated from their Reverse translation.
-T 0*|1|3|6,1|3 In the Linear map, beneath the DNA sequence, include the translated protein in 0*, 1, 3(= frames 123), or 6 (=123456) frames of Translation with 1 or 3 letter codes.
ie.
-T 3,3 (includes frames 1,2,3 with 3 letter labels)
-T 6,1 (includes frames 1,2,3,4,5,6, with 1 letter labels)
-v asks for program version (there may be multiple versions of the same functional program to track its migration.
-V 1-3 Verbose - requests all kinds of diagnostic info to be spat to the screen. May be useful in diagnosing why tacg did not behave as expected..but maybe not. Higher numbers mean more output and are generally downwardly inclusive.
-w
1 | i#
output width in bp's (must be between 60* and 210, truncated to a # exactly divisible by 15 ('-w 100' will be interpreted as '-w 90') and actual printed output will be about 20 characters wider. Also applies to output of the ladder and gel maps, so if you're trying to get more accuracy and your output device can display small fonts, you may want to use this flag to widen the output. If you want as much output on one line as possible for external parsing/analysis, specify -w 1
-W
i#
width of Sliding Window in bp's. Used primarily with the '--rules' flag above. If it is not specified, it is assumed that the window under consideration is the whole sequence. Most analyses other than '--rules' ignore this setting.
-x 'NameA(,=),
NameB,
NameC,
NameD,...(,C)'
used to explicitly name those enzymes or patterns to be used in the analysis (up to a maximum of 15). Case INsensitive (HindIII=hindiii=HinDiIi), but the name HAS to be spelled exactly like the entry in the REBASE or MATRIX file with no spaces (HindIII != Hind III != Hind3).
The ',=' tag appended to a name indicates that it is the tagged RE in a Hookey/AFLP analysis; only those fragments that have at least one end generated by the tagged RE will be shown. This has been shown to be useful in AFLP analysis.
The trailing ',C', if added, requests a combined digestion using all the REs specified with this flag.
Examples:
-xHindIII,BamHI,NruI,C
requests data for these REs both individually, and combined.

-x EcoRI=,MseI,Hinf
requests AFLP formatted data, with EcoRI tagged; NO combined results (per se, altho that is a part of the AFLP analysis).

-X
(--extract)
b,e,[0|1] aka "--extract" eXtracts the sequence around the pattern matched, from b bases preceding, to e bases following the MIDDLE of pattern if a normal pattern, the START of the pattern if a regular expression. If the pattern is found in the bottom strand AND the last field = 1, sequence is rev-compl'ed before it's extracted so all patterns are in same orientation; if last field = 0, it is NOT reverse compl'ed. In any event, the sequences are FASTA-formatted on output.
-# % CutOff The percentage of the optimal matrix score that you will accept as a match. ie. if the matrix (as below) was 10 bases long, and had a maximum score of 69 (scoring a 100% match at each position as '1', then if you indicated a -# 75, you would accept a score of 51.75 (69 x .75) as a match. 
     a  t  g  g  c  y  t  r  g  g   Consensus
     1  2  3  4  5  6  7  8  9 10   Position
  a  8  0  1  1  1  0  1  4  0  0  
  c  1  3  1  0  9  6  0  0  2  0   Sum of Max (bold) = 69
  g  1  0  8  7  0  0  0  6  7 10
  t  0  7  0  2  0  4  9  0  1  0


RELATED PROGRAMS

As noted above, tacg now incorporates some of Jim Knight's SEQIO pkg to perform sequence format translation, so external conversion programs are no longer needed for most common sequence formats. This is an extremely useful, well thought-out, and relatively easy-to-use pkg - I'm sorry I waited so long to include it. The package is no longer at its original URL, but it has been archived around the internet.

However, if an external program IS needed for format interconversion, I also strongly recommend Don Gilbert's excellent 'readseq' program (available in source or executable via FTP. Why recommend readseq when I've used SEQIO? SEQIO is a great library of functions to use in other programs, but readseq is easier to use for stand-alone, interactive use, chiefly due to a more std interface. Both are scriptable; for scripting use, it's a toss-up.

You can also use the paging utility 'less' to move thru your sequence file and use its marking and piping facility to punt the sequence of interest to 'tacg'. Many editors also allow piping a selection of text to an external program and inclusion of the result into another window, especially (nedit, as well as crisp, and the ubiquitous, omnipotent emacs and its gui doppelganger xemacs. others).

Much of tacg's output benefits from wider-than-normal printing. The '-w#' flag allows output up to about 230 characters wide, however to print this without wrapping, you need to print in landscape mode, using very small fonts. A number of unix printing utilities allow you to do this, notably genscript aka GNU Enscript, residing in the GNU repository

EXAMPLES

Used alone:

tacg -f0 -n5 -T3,1 -sL -F3 -g 100 <input.seq.file >output.seq.file

Translation: read sequence from input.seq.file and analyze it as circular (-f0), with 5+ cutters (-n5), returning both site info and Linear map (-sL) as well as sorted and unsorted fragment data (-F3) and do 1,2,3 frame translation w/ 1 letter codes (-T3,1) on the linear map, writing the output to output.seq.file. Also, include a pseudo gel diagram for those enzymes that pass the filtering, with a low end cutoff of 100 bases (-g100). 


Used to search for Matrix Matches:

tacg -# 75 -R yeast.matrices -sS < yeast.chr_4 | less

Translation: seach the file yeast.chr_4 for all the matrix definitions in the file 'yeast.matrices', with a cutoff of 75% of the maximum score possible, listing also the summary and the Sites information, piping the output to the pager less


Used to search for degenerate transcription factor sites with errors:

tacg -p Pit1,tatwcata,1 -p ap2,tgygcatw,1 -w90 -sSL < rprlPromo.seq > promo.map

Translation: search for the patterns labeled Pit1 and ap2 with 1 error each and search the sequence from the file rprlPromo.seq for them, printing the results (summary (-s), Sites (S), and the Linear Map (L) 90 characters wide (-w90) to the file promo.map


Used to search for a Regular Expression:

tacg --regex 'yadda:gm(tt|ag)ggn{3,5}tgy' -SL < some.seq | less

Translation: search the file some.seq for the regular expression gm(tt|ag)ggn{3,5}tgy, piping the information about Sites (-S) and the Linear Map (L) to the pager 'less'. 


Used to search the entire yeast 500bp Upstream Regulatory sequences (a file containing 6226 500 bp sequences) for matches to the MATa1 binding site (from TRANSFAC) :

tacg -R TRANSFAC.data -sScw1 -xMATa1 -#85 < utr5_sc_500.fasta > yeast.summary

Translation: translate each of the FASTA formatted entries in the input file utr5_sc_500.fasta into usable sequence, and after finding the MATa1 (-x MATa1) matrix description from the database (-R TRANSFAC.data), search the sequences for matches at 85% of the maximum score that it has in the TRANSFAC database (-# 85), returning the summary (-s), the sites (S) sorted in Strider order (c) with results printed on 1 line (w1), directing the output into the file yeast.summary


BUGS and ODDITIES

Major

- tacg, unless recompiled without it, spits back about 100 bytes of information about it's use (the hosting OS, the command line flags, and the sequence length) to enable me to track how and how often it is being used. If you are uncomfortable with this trait, you may disable it from the command line ('-q').

- the inclusion of the seqio functions has caused an ~10fold increase in the compiled size of the executable to ~400kB (up from ~50kb before). If I get a lot of complaints about this, I'll look into stripping out the functions that I use from the SEQIO library, but I'd rather not as it does include a lot of (hidden) functionality that I plan to use later.

- tacg will not currently cut sequence shorter than 4 bases; if you need to analyze sequences shorter than this, perhaps you're using the wrong program.

- tacg has been made re-entrant for the inclusion of SEQIO and as such a number of memory leaks have been plugged (with the use of Gray Watson's excellent dmalloc library). It's not perfect but it's a lot more robust.

The command line handling has been completely re-written, using the getopt() and getopt_long() functions, so the flags are considerably less sensitive to spacing and order.

- translation in 6 frames assumes circular sequence regardless of '-f' flag, so that the last amino acids in frames 5 and 6 in the 1st output block are obviously incorrect if you are assuming linear sequence.

for other bugs which the author thinks are less problematic, see the manual