Welcome to tacg Version 3.0
Document Version 3.0.2 - Feb 13th, 2000
Please note that this document is in flux, incomplete, and certainly
inaccurate in places. The man page is more likely to be correct than
this doc, and the help section even more likely to be correct. In the event
of complete incomprehensibility, please mail me.
A web interface to version 3 is now available in beta form;
you can try it out here.
The Web interface that uses most of Ver 2's abilities is still available!
Try it out here!
This documentation is subject to change - compare the version # above
to your local copy, if you got one with your source code or binary distribution.
This page provides information about both the commandline and Web interface of
tacg Ver
3.x, describes some of the philosophy behind its design, provides
links to more documentation, and where it can be got (both ANSI C source
code and precompiled binaries for some popular systems).
Table of Contents
Cross Index of Command-line Options
What's new or changed with Version 3?
Briefly:
- integration (finally) with Jim Knight's SEQIO to allow automatic
sequence conversion on input and scanning multiple sequence databases (or
not - with --raw it still supports the ability to
assume that ALL input is sequence - useful for analyzing file fragments
or editor buffers).
- can use it to search for arbitraily complex combinations of patterns at
the same time by combining pattern specifications with the logic
conjunctions AND, NOT. see --rules.
- searching for regular expressions (in nucleic acid), with autoconversion
of IUPAC degeneracies to the appropriate regex.
- searching for TRANSFAC matrices, with user-specified cutoffs.
- gel simulations with low and high end cutoffs for expansion.
- selection of Restriction Enzymes explicitly, by overhang generated,
magnitude of recognition site, price, minimum, maximum number of cuts
(overall or on a per-pattern basis).
- supports Combination Cuts of up to 15 REs at a time.
- supports limited AFLP fragment matching / simulation.
- simulates Dam and/or Dcm methylation of DNA
- searches for silent sites, with reverse translation.
- ORF finding in any combination of frames with FASTA output, with offsets
in DNA, protein, Molecular Wt, pI, with optional additional info on
AA frequency in #s or %.
- conditional output based on matches, for scanning large numbers of
sequences at a time.
- sequence extraction surrounding pattern matches, with variable upstream,
downstream inclusion, optional reverse translation in FASTA format.
- various ways of making the output even more compact (printing only
single strands, or eliminating tics)
- uses autoconf/configure to ease building on different platforms.
- by default, tacg no longer sends usage data back to me.
- includes an explicit example function to show how to add your own funtionality.
- much more robust, debugged, especially as to memory allocation.
Since most of the routines tacg run in a loop now, memory leaks are
much more apparent and get plugged quicker.
- BUT: more features -> more (new) bugs.
What is it?
tacg is a character-based, command line tool
for unix-like operating systems (and even a few extremely un-unix-like
ones) for pattern-matching in nucleic acids and performing some of the basic
protein manipulations. It was originally designed for
restriction enzyme analysis of DNA, but has been extended to other types
of matching. It now handles degenerate sequence input in a variety of matching
approaches, as well as patterns with errors, regular expressions and TRANSFAC-formatted
matrices.
It was designed to be a 'grep' for DNA and like the original grep,
it's capabilities have grown so that now the author has to keep calling
up the help page to figure out which flags (now ~50) mean what.
tacg itself is NOT a GUI application in any sense. However, it's existance
as a strictly command-line tool lends itself well to Webification and I've
made Version 2 available as a cgi program.
Version 3 will be getting an even better one. Also, by it's strict
command line nature, it can parasitize the interface of a number of GUI
programs that allow passing selections made within their windows to external
programs. This is particularly effective with a free, Xwindow editor called
NEdit. With minimal reconfiguration, NEdit can become a fairly good biosequence
editor and analysis tool a la Strider for Xwindows. See the entry for
NEdit below.
A real world example, quickly.. say the Genbank entry for the lambda
genome.
tacg -w75 -n6 -F2 -l -c -g'100,10000' -L -T3,1 -O145,25 <
lamcg.gb | gzip - > lambda.map.gz
Translation: Process the file 'lamcg.gb' (Genbank format) thru
the seqio routines (transparent to users) and pipe the following sequence
to tacg, returning the analysis:
- 75 characters wide (-w75)
- on all 6+ cutters (-n6)
- giving me the sorted fragment sizes (-F2) of those Enz's that match
- providing a ladder map summary (-l)
- sort tabular ouput by number of hits and thence alphabetically (-c)
- provide a gel simulation with a 100 bp LOW cutoff and a 10000 bp HIGH
cutoff (-g100)
- with a linear restriction map (-L)
- with 1 letter, 3 frame translation (-T3,1)
- with the ORF table for frames 1,4,5 at a minimum ORF size of 25 AAs with
the extended composition info (-O145x,25)
- and pipe the output thru 'gzip' to a file called lambda.map.gz.
Altho it's not clear why you would be interested in seeing this particular
analysis, here's the resulting
output (130KB text, gzipped - you'll need a gzip-compatible app to
view it, such as Stuffit
Expander, Stuffit For Windows, Winzip,
or even ... gzip).
The lambda genome used to generate this map is also included in the
tacg distribution directory (called lamcg.gb.gz)
What does it do with incoming sequence?
The input stream is usually processed by the SEQIO routines. If the stream
contains multiple sequences, they are passed to tacg serially.
tacg reads ASCII characters from the buffer passed in by SEQIO, determines
whether there are IUPAC characters in the stream and, depending on the
-D flag option, EITHER strips out anything not an 'a', 'c', 't',
'u', or 'g', OR allows all IUPAC characters (but strips all others,
including numbers, spaces, tabs, linefeeds, etc), and places that input
sequence in a buffer.
What it does then depends on what other flags are set. The 1st writ
function and probably the most popular is Restriction Enzyme Analysis.
Flags indicating this function cause tacg to read in a series of named
patterns (either from the command line, or more typically, from a GCG-formatted
REBASE file, now somewhat extended for increased functionality) and match
them against the sequence already read in. These matches are noted, then
some basic math is done with them and the output is sent to stdout, in
the form of tables of cut sites, fragments generated, various maps and
simulations.
The number of recognition patterns was originally hard-coded, but is
now dynamically allocated so 1000s of patterns can be searched for at once
(I've generated ~7000 without problems), as long as they are in the simple
format required. For instance it can now process all the patterns
in the TRANSFAC database at one shot (altho I'm not sure that would produce
anything usable). You can download tacg-formatted TRANSFAC pattern
files from here
A similar function matches (possibly degenerate) input with patterns
that contain errors (as well as simple degeneracies). This means that if
the pattern is gartc with 1 error, it could match
nartc,
gnrtc,
gantc,
garnc, or
gartn.
Of course, the 1st and last sequences decay
to artc and gart, as a leading or lagging 'n' makes no difference
in the pattern. NB: See the note on how subsequence
patterns are currently handled.
Related functions (-P and --rules)
filter the data from the above searches by parameters that you give them,
filtering the number of reports by your proximity rules.
A new function allows Open Reading Frames to be
extracted. This function is not very sophisticated, but does present the
data in a way that allows it to be analyzed by other programs (each ORF
is written in a FASTA format so that you can use the resulting file for
other types of searching). I suspect this function will be updated quite
frequently as users request more features.
Enzyme selection can be filtered by a variety of criteria; these criteria
affect what is printed as output (if only 3 enzymes match your selection
criteria, only those 3 will show up in most analyses. They can be chosen
by 'magnitude' of recognition site (4, 5, 6 cutters, etc), by overhang
generated, by minimum/maximum times they cut, etc.
The entire feature set is described in more detail below, but it includes:
- Enzyme cut-frequency summary tables,
- Tables of site data (sorted or not)
- Tables of fragment data (sorted or not; ordered or not)
- Linear map with names stacked efficiently with/without translations in
1,3, or 6 frames, from 1 of 8 Codon Usage tables
- Pseudo-graphical ladder maps a la GCG
- Summary map of infrequent cutters
- Gel simulations (using a simple log fit).
- Tables of ORFs for any frames, with offsets in bases, AAs, and MolWt. [new]
- Entering patterns from the commandline [new]
- Searching for patterns with errors [new]
- Proximity matching - reports only sites that match distance criteria (up/downstream,
less/more than a distance, within a range of distances) [new]
- Other match data output formats
What will it do next?
What you want it to do will have an effect on what I work on. If you want
something added, especially a variant of an existing analysis, I'd be happy
to consider it, so let me know your opinion. Otherwise, I try to add what
I believe to be the most useful features, especially if they are missing from
other tools.
Anyone who believes the promises of software suppliers also believes
in the tooth fairy. That said...
- postscript circular and linear graphic maps have been promised for a while
now and have even been worked on a bit, but keep getting delayed
No more - top priority.
- well, maybe not TOP priority - that would be getting as much functionality
into a new, much nicer web interface.
- I'd like to add secondary structure prediction output to the postscript
maps, so if I can wrangle some public domain code into shape, I may be
able to carry thru on this wild promise as well.
- BSML output would be a neat additional feature.
- A javascript interface would also go a LONG way in making the Web interface
more usable.
Where can I get it?
tacg V3 is available via FTP in
ANSI C source code
and precompiled
binaries for several architectures. Instructions
are included with each package. Note that the precompiled binaries include
only a statically linked binary executable and the Data and Doc files necessary
to run it; be sure to pick up both pkgs if you want both.
Is there a GUI or Web Interface?
tacg is a command line program (see Simplicity,
below) but as such, it lends itself well to Webification as a cgi program.
The Web interface that uses most of Ver 2's abilities is still running
- Try it out here!
The Version 3 Web interface is actually composed of THREE similar pages that
emphasize different aspects of tacg's abilities. All three pages address the same
CGI (perl-based for this version), and call the same tacg executable. The reason for
the different pages is just for clarity.
The entry point to the Version 3 Web interface is available
from this point
and you can also run it from these servers (to the best of my knowledge).
If you know of others that support public access or any of the following don't
work, please
let me know.
The only substantial crippling is that it currently has a cutoff at
500,000 characters (up from 100K in Ver 1.7x) as the output for that size
of analysis starts to get very large for some analyses). If you routinely
use more sequence than this, it's probably worth your time to install it
locally. NB The 500,000 character limit, while probably sufficient
for most analyses, counts the header and comment characters as well.
Is there a man page I can look at?
Yup. You have your choice of a REAL manpage, or a
PREFORMATTED
man page or a Webified
man page, all included with both the source code and the binary
pkgs).
Design Criteria
Simplicity (and a few Alternatives)
In answer to: "Why does the world need another restriction mapping program,
especially one as crude as this?" or Why didn't you just use (or modify)
XXX?", I enumerate:
- Well, in the 1st place, crude is not always bad... 'grep' could be
classified as crude and it's one of the most useful and used programs
in the world. It's free, simple, flexible, chainable, and very fast;
it provides the desired info in a useable form with a minimum of
mucking about.
- There are a variety of programs available for Windows and (especially)
the Mac for doing these kinds of analysis, but there's little along
the same lines for unix, especially Linux, which is an extremely
attractive OS for bioinformatics.
- Many available programs, even the commercial ones, start to come undone
at sequence lengths quite a bit smaller than those now being
published. Dynamic memory allocation relieves this limit, so I
decided to write one that incorporated this feature.
- Even some of my favorite programs (such as the beautifully designed DNA
Strider) would refuse to analyze degenerate DNA sequence, making the
generation of useable plasmid maps from the usual bits of known
sequence, polylinker, and hand-drawn maps impossible or extremely
unpleasant.
- While I have had access to GCG for quite some time, I have been
increasingly aggravated at it's speed, output, cost, and the fact that
it doesn't run on Linux. While tacg's code is sufficiently arcane that
few bench biologists will attempt to add functionality to it, an undergrad
CS student could, especially with the inclusion of the new example
function.
- There is also the challenge of creating a program that, like grep has
significant utility in a relatively small footprint.
- It was fun.
However, if you want alternatives, here are a few:
- There's the stalwart (if ugly) GCG, but it's $ware and does not
(yet?) run on Linux. It also remains essentially a collection of
unfriendly command-line programs (hey, nice glass house..) sheltered
under a common but still not too intuitive GUI. The GUI is strangely
warped - it appears to be very similar to a HTML forms-based GUI, but
with neither the benefit of HTML, nor many of the advantages of a
'native' GUI. The controls are surprisingly crudely laid out (much like
the tacg Web form), which is odd for a 'native' app; I would have
expected a much more compact and intuitive interface.
- Webcutter
2 is a Web-only cgi (no command-line version, no source code) that Max Heiman put together at
about the same time that I was writing tacgi. I've tried it a few times
and while it may be useful, I found it too slow and the output a bit
info-sparse. It does have some features that tacg lacks (coloration of
selected Restriction Enzymes (REs), grabbing sequences direct from
Genbank to restrict), but for my own use they weren't that compelling.
Although the access is free, you cannot install it locally, nor examine
the source code. Nevertheless, if anyone thinks that a feature of
WebCutter would make tacg considerably more usable, I'd be interested in
hearing about it.
- WebGene is a
service offered from Indiana which offers a small subset of the options
and analyses that WWWtacg does. It has a similar Restriction Enzyme
selection interface that the original WWWtacg had, but I decided that
while it was a nice display, it took too much HTML to code which made
selecting the REs too slow. It offers both an ORF interface as well as a
Restriction Enzyme interface.
- Cloneit
- EMBOSS
- eBioinformatics
- Curatools
- restmap (perl thingie out of stanford by Steve Chervitz, now at
Neomorphic). Quite good Web output, including colored image maps
that really nice) but being perl, it's a bit slow and doesn't have nearly
the features that tacg has.
- Shankar Subramanian's (now at UCSD) Biologist's Workbench is an
excellent example of how a Web Interface can make coherent a great number
of the free software tools of the net. Now that you mention it, tacg is
one of its included tools :).
The only exception I take to Web interfaces (including my own) is the
response time/uncertainty of the connection and the inability to retain
state easily. But it is a great way to take advantage of programs that
you'd never be able to take the time to get working and working
reasonably easily and a well thought-out interface can do a lot to reduce
the problem of retained state.
- Steve Smith's 'GDE' has been compiled for
Linux, but requires a fair amount of dedication to put and hold it
all together; last time I looked it was available only an OpenLook
interface, which is an increasing disadvantage. Steve is now at GCG (I
think - the new versions of GCG seem to show the GDE influence.) so there
probably won't be a huge amount of work going into it.
- Brian Fristensky's
estimable suite of programs BIRCH (of
which tacg is also a part) is pretty good and his comments and
suggestions have resulted in a a number of the improvements seen here,
especially the Degenerate Sequence matching.
- MSI has just released an app called Gene Explorer, which includes quite
a bit of functionality as a loss-leader to attract researchers to their
higher-end products. Haven't had enough experience with it to comment
further.
- James Tisdall's 'DNA Workbench'
is designed along the same idea as tacg, but
requires perl and is therfore interpreted.
Despite perl's many syntactic and other advantages, I wanted something
even simpler (a standalone app) and therefore easier to install, faster,
and more complete.
- Don Gilbert's
multiplatform, GUI ' SeqPup' is
the closest to an ideal setup in terms of design and interface. It
has recently been recoded in to Java and Don is still leading the
way in terms of innovation and interface in bioinformatics. Anyone
interested in interface design for bioinformatics is well-advised to
check out and follow Don's contributions. His have consistently led the
field in terms of usability, Internet use, cohesiveness, and
inventiveness.
- On some systems, you can run Executor, Macintosh Application Environment,
SoftWindows, and Virtual
PC and thus have access to the Mac and Windows programs, but that's
also somewhat costly, slow, and complex.
Installation
tacg REQUIRES only 2 files - the executable (tacg,
which can be renamed) and the Codon table file 'codon.prefs'. Most people will
want to use the restriction enzyme database file (rebase.data; included with
the distributions, which is simply the GCG-formatted version of Rich
Robert's REBASE), but it is not strictly needed - you can enter patterns
from the command line since Ver 2. The codon usage file (codon.prefs) is also
required if you want to do any translation. It was required in the 1st few
versions, then was compiled into the app in version 2 and now has been
externalized again for easier access and modification. As the sequencing
factories roll thru more organisms, there will more need for changing or adding
to the file.
The data files (rebase.data, codon.prefs, other pattern files) will be
found in any of 3 places automatically, in the following order:
the current directory ($PWD), the home directory
($HOME), or the tacg lib directory ($TACGLIB). The
program searches these paths and gives up only if the file can't be found in
any of these places. 'rebase.data' is ASCII text and can be edited and modified
by the user, if required. I'd suggest placing 'rebase.data' in the TACGLIB
directory and keeping modified versions of these files, if required, in your
home directory which is searched first.
tacg can use the NEB/REBASE-supplied file of restriction enzyme data as is,
(in GCG format - latest unmodified version
here) but can also make use of additional extended information if
available. Version 3 will use error, cost, and methylation information added
to the file as shown
here.
The format of the unmodified version is:
Comments
..
; Top Recognition Bottom
;Name Offset Pattern Offset ! Isoschiz >Supplier Index
;AarI 3 CACCTGC 0 ! >
;AatI 3 AGG'CCT 0 ! StuI,Eco147I,Pme55I,SseBI >O
AatII 5 G_ACGT'C -4 ! >ADEFIKLMNOR
;AauI 1 T'GTAC_A 4 ! Bsp1407I,BsrGI,SspBI >I
AccI 2 GT'mk_AC 2 ! FblI,XmiI >ABDEGJKLMNOQRS
Comments
..
; Top Recognition Bottom
;Name Offset Pattern Offset Err dam dcm units $ ! Isoschiz >Supplier Index
;AarI 3 CACCTGC 0 0 100 100 0 0 ! >
AatII 5 G_ACGT'C -4 0 100 100 2500 176 ! >ADEFIKLMNOR
AccI 2 GT'mk_AC 2 0 100 100 2500 220 ! FblI,XmiI >ABDEGJKLMNOQRS
AccII 2 CG'CG 0 0 100 100 444 99 ! Bsh1236I,BstFNI >KQR
Acc65I 1 G'GTAC_C 4 0 100 3 5000 200 ! KpnI,Asp718I >DFINR
In the above, the extra fields are:
- Err - the maximum number of errors that can be tolerated in the search.
- dam - the offset from the Top Offset that if methylated by Dam
methylase, blocks cutting
- dcm - the offset from the Top Offset that if methylated by Dcm
methylase, blocks cutting
- units - the number of units packaged (by NEB) in their Small size lots
- $ - the cost for that many units above.
Program Output
The output of this program uses only alphanumeric characters so that all
of its output can be viewed on a vanilla vt100-like terminal, although
you can do more useful things if you're using an X display or have imported
the output into a Word Processor (some of the output can best be viewed
using very small fonts or in multiple columns on a page).
The different output sections have been prefixed with "==" so that you can
easily jump to the different sections if you're viewing the output with an
editor or pager such as less. In the WWW version, the
different sections have been labeled with HTML tags so that clicking the
automatically generated Table of Results entry will drop you to the
corresponding results section.
For the Web interface, if you're using a Mac, printing to a
Laserwriter-type device, you might try the '105' width, with the 'Small' font,
then use Page Setup to select Landscape output, 2-up (2 pages fitted to one
page).
If you're using MS Windows, there are probably similar output
selections that you can use.
If you're using a unix box, save the output in whatever width
you want and pass it thru a text-to-postscript filter to generate nicely
formatted output. Some that I've used are:
- lptops - included on SGI's; probably available somewhere for other platforms.
- nenscript
- (Free implementation of Adobe's enscript)
- genscript aka GNU enscript
- Free, most features, currently supported, no good reason to use any other...
- a2ps - a monster text-printing system that will handle almost any
conceivable file format (and allows you to color code your own if you
want).
Use it thusly:
genscript -1GZ -r -fCourier7 -p- [filename] | lpr -P[PSprinter]
Most Xterminals allow different font selections and window re-sizing,
so you can probably view the results adequately that way as well.
High Portability
The program is written in vanilla ANSI C. After being directed to the correct
libraries, it compiles with few complaints on:
- Silicon Graphics/IRIX 5.3, 6.2 (R3000, R4X00, R5000, R10K) (cc)
- Sun SPARC/SunOS (4.1.3) (gcc)
- Sun SPARC/Solaris (gcc, cc)
- DEC Alpha/DEC Unix aka OSF/1 (V3.0/347) (cc, gcc)
- Convex C3480/ConvexOS (cc, gcc)
- HP-UX/Exemplar OS (single CPU of an SPP2000) (cc, gcc)
- NeXT boxes (Black)/NeXTSTEP (3.2) (except the UDP code) (gcc)
- Intel PCs running Linux (Slackware, RedHat, Caldera) (gcc)
- DEC Alpha Linux (gcc)
- MkLinux on PowerMacs (gcc)
- Probably anything with gcc.
Bewildering to me, on those systems where both the native commercial compiler
(cc) and gcc have been available, gcc has consistently outperformed cc.
YMMV. I think this is because almost all the core code is integer ops
and therefore the carefully coded math and floating point libs make little
difference. Stemming from this, tacg runs faster on an Intel P6 architecture
than on many RISC workstations of similar or even higher clock speed.
Speed and Capacity
The program uses a pre-calculated hashtable lookup of the restriction enzyme
recognition sites, so that only about half of the sequence is checked any
further than the initial hash (which itself is generated via the very efficient
shift/add DFA algorithm). Depending on the degeneracy of the input sequence, the kind
of output you request and the i/o of the machine, the program processes DNA at
this speed:
Speed* |
Hardware |
OS |
Compiler, flags |
68 |
i486/66 |
RH Linux 2.0.18 |
gcc -O2 -m486 |
140 |
R4000/100 Indigo2 |
IRIX 5.3 |
cc -O2 -mips2 |
165 |
early DEC Alpha |
OSF/1 |
gcc -O2 |
290 |
R4400/200 Indigo2 |
IRIX 5.3 |
cc -O2 -mips2 |
313 |
SPP/HP-PA |
SPP-UX 5.1/HP-UX 10.1 |
gcc -O2 |
300 |
PMac8500/PPC604/120 |
Mklinux 2.0.27 |
gcc -O2 |
330 |
DECAlphaStation/233 |
RH Linux 2.0.29 |
gcc -O2 |
450 |
Sparc Ultra1/??MHz |
Sun OS v5.5 |
gcc -O2 |
454 |
R10000/175 Indigo2 |
IRIX 6.2 |
cc -O2 -mips2 |
515 |
Intel PPro/200 |
RHLinux 2.0.27 |
gcc -O2 -486 |
* in Kb/sec, crudely adjusted for cpu usage via the `time` shell command,
using 215 enzymes, digesting the E coli genome (4.638 MB), for only the summary listing (-s).
Requesting the full analysis (-n4 -sSlc -F3 -g100) takes about twice this
time. The full Linear Map doesn't take much more CPU time, but considerably
more wall time due to the i/o (output is 10x input).
NB: This is about the slowest performance you can expect. Because
of the size of the input sequence, the program has to go thru several rounds
of memory allocation. Shorter sequences, more typical of cloning projects,
complete essentially immediately even on slow machines.
Here's a table of how fast the same analysis completes on various sequence
lengths (for a PPro/200/96M/PCI-SCSI and a 486/66/32M/VESA-SCSI):
Sequence Length |
Seconds(PPro) |
Bases/s(PPro) |
Seconds(486) |
Bases/s(486) |
100 |
.04 |
2.5k |
.28 |
360 |
1,000 |
.05 |
20k |
.3 |
3.3k |
10,000 |
.07 |
143k |
.4 |
24.4k |
100,000 |
.2 |
500k |
1.4 |
71.4k |
1,000,000 |
1.6 |
617k |
11.3 |
88.5k |
4,630,000 |
9.1 |
509k |
68 |
68k |
10,000,000 |
22.7 |
492k |
261 |
38k |
From earlier, crude analyses it appears to be about 7-35 times faster
than the equivalent routines in the GCG 'map' program.
In my experience, interacting with tacgi/tacg via it's web interface (on an
100MHz R4000 SGI) over 2 LAN hops is both faster and easier than using GCG's
native Xwindows interface (on a 200MHz R4400 SGI). There may be some subjective
feeling involved here, but objectively, it's faster by 10X (on a machine that's
less than 1/2 as fast :) ). GCG has recently been trumpeting that its code is now
parallel, allowing much faster operation. Putting the above in a similar
context, you'd have to buy ~20 CPUs to run your analyses with GCG's routines
compared to 1 if you ran them on tacg.
Memory Usage
tacg uses dynamic memory allocation for most data structures so that while
there are a few hard-coded limitations (mostly in output format), it easily
handles sequences into the millions of bases.
As a practical rule of thumb, the program needs about 2 MB of free memory
for itself, and then, depending on what results you've asked for, as much
as 3 times the input sequence length to hold all the intermediate results
before they are barfed to stdout. For the E Coli genome (4.7Mbases), tacg
memory usage tops out at about 13 MB, so on a 16 MB machine, you'll probably
be doing some swapping in this case. On a 32 MB Linux PC, I had no swapping
on the above 4.7 Mbase analysis, but quite a bit with 10MB (about 40MB
total required, according to 'top'). YMMV.
In the Web form, there is a cutoff of 500,000 characters (not
bases) as input; currently this includes extraneous characters such as
headers, numbers, spaces, etc, so if you think you are being unfairly denied
service, check how many bytes your file actually is. The command-line version
is limited only by your machine's RAM and your patience.
Feature Set
Inspired by [add mailto] Christian Marck's elegant DNA Strider, I tried for a
similiar output format, changing a few things I didn't like, adding a few
things I wished he had. I'm open to suggestions for enhancements, interface
changes, etc.
Kvetch or it won't change.
Sequence Input
tacg will take any ASCII data as input. If it isn't a recognized
format, it will assume that it's raw sequence and will process it accordingly.
You can also explicit tell it that all input is to be considered sequence.
If this is the case, it will ignore anything that isn't a,c,g, u, or t
(or IUPAC degeneracies - see below) and subject that sequence to analysis.
Because of this, numbering, spacing, line width, etc, of the input file
should not be a problem.
The Web interface allows you to upload entire files of sequence for
analysis. Recently, with the recruitment of Jim Knight's SEQIO,
Version 3 in both commandline and Web form allows most ASCII sequence formats
(see below) to
be read and processed without further modification or editing.
Input formats supported by SEQIO (and therefore tacg and WWWtacg):
IG/Stanford |
GenBank/GB |
NBRF |
EMBL |
GCG |
DNAStrider (ASCII) |
Fitch |
Pearson/Fasta |
Zuker (in-only) |
Olsen (in-only) |
Phylip3.2 |
Phylip |
Plain/Raw |
PIR/CODATA |
MSF |
ASN.1 |
PAUP/NEXUS |
|
However, it will NOT accept (at least reliably) binary file formats
such as those from DNA Strider, MacVector, NBI, GeneWorks, etc. However, most
commercial programs have an ASCII export function that should work. For the WWW
interface, anything that you can cut and paste into the window and looks like
sequence ought to be fine. The '--raw' option skips SEQIO and so reverts tacg
to it's former aggressive behavior of considering EVERYTHING sequence.
IUPAC Degeneracies
IUPAC Degeneracies are those characters which denote uncertainty in the
sequence; most are familiar with y (pyrimidines - c or t)
and r (purines - a or g), but there are several others
that can be useful. The entire set is:
Base |
Name |
Bases Represented |
Complementary Base |
A |
Adenine |
A |
T |
T |
Thymidine |
T |
A |
U |
Uridine(RNA only) |
U |
A |
G |
Guanidine |
G |
C |
C |
Cytidine |
C |
G |
Y |
pYrimidine |
C T |
R |
R |
puRine |
A G |
Y |
S |
Strong(3Hbonds) |
G C |
W |
W |
Weak(2Hbonds) |
A T |
S |
K |
Keto |
T/U G |
M |
M |
aMino |
A C |
K |
B |
not A |
C G T |
V |
D |
not C |
A G T |
H |
H |
not G |
A C T |
D |
V |
not T/U |
A C G |
B |
N |
Unknown |
A C G T |
N |
Handling Degeneracy
tacg V2 was modified to handle degenerate input sequence as well as the
previous gatc-only sequences. Version 3 has the same matching engine as
Version 2 which matches
sequence using the fast, hexamer hashtable look-ahead algorithm that DNA
Strider uses (my implementation). When it hits a degeneracy in the leading
hexamer window, it switches to a more extensive (but slower)
matching approach until the degeneracy is cleared, whereupon it
switches back to the faster approach. This allows tacg to accurately search
sequence
that is only slightly degenerate (such as many genomic sequences) essentially
as fast as nondegenerate sequence.
Although tacg behaves as above in the default mode, it can be forced into
several other modes if desired. Truth be told, I'm not aware of ANYONE
besidesactually using the other modes, tho they are formally useful.
[Table of Degeneracies and options]
Enzyme Selection
Restriction Enzymes can be selected either by:
Explicit pick from all the active entries
from the latest REBASE.
Enzymes can be picked explicitly via the -x flag (-x NameA,NameB,NameC...
where NameX is the case INsensitive name of the Enzyme or pattern you wish
to use in the GCG-formatted REBASE file, either 'rebase.data' or another
specified by the -R flag (-R Alt.rebase.file).)
or
Selected recursively by characteristic, based on:
Magnitude of recognition sequence
The 'magnitude' of the recognition sequence depends on the number of
defined bases that make up the site. Degenerate bases can also contribute:
acgt each count '1' magnitude point
yrwsmk each count '1/2' magnitude point
bdhu each count '1/4' magnitude point
n doesn't count at all
(tgca=4, tgyrca=5, tgcnnngca=6, etc)
Overhang of resulting ends (5', 3', blunt)
This refers to the stretch of unpaired bases at the border of the cut
created by offset breaks in the phosphate backbone. Enzymes can leave:
5' overhangs - ie. BamHI(G'GATC_C)
v BamHI
5'...tagG GATC_Ccga...3' -> 5'...tagG GATCCcga...3'
3'...atcC CTAG Ggct...5' -> 3'...atcCCTAG Ggct...5'
^
3' overhangs - ie. BbeI(G_GCGC'C)
v BbeI
5'...tagG_GCGC'Ccga...3' -> 5'...tagGGCGC Ccga...3'
3'...atcC CGCG Ggct...5' -> 3'...atcC CGCGGgct...5'
^
Blunt (no overhang) ... You get the idea...
Minimum, Maximum times they cut the sequence
This is pretty self explanatory, but..
'Minimum' indicates the minimum # of times an enzyme HAS
TO cut before it's included in the
output (the default is 1).
'Maximum' indicates the maximum # of times an enzyme CAN
cut before it's excluded from the output (no
default maximum).
Alternative REBASE database files to use
This option allows you to specify an alternative REBASE file to use
in the matching, perhaps edited to contain only cheap enzymes or otherwise
customized for your lab/site. It starts at 300 entries, which is
enough for most Restriction Enzyme databases, such as REBASE, but if you
format TRANSFAC or TFD into REBASE form, it will easily process them (about
3000 entries).
Linear Restriction Map
Linear Map Mode:
- produces linear restriction maps, with enzyme names nested compactly, readably.
- is routinely tested to more 4.6 Mbases, has been tested to 14 Mbases
- handles linear/circular topologies, subsequences.
- shows EXACT cutting position (not just the start of the recognition
sequence - minor quibble with Strider and other programs)
- can show same-page translation
- in 1/3/6 frames
- in 1 or 3 letter codes
- using one of several Codon Usage Tables
- can scale output (reasonably intelligently) in widths up to ~230 characters.
- shows numeric indices for both the whole sequence (4635, below) and the subsequence (181, below) if specified.
Bsu36I
Sau3AI
DpnI
AlwI NlaIV
BstYI Tsp4CI
BbvI BamHI AlwI MseI MaeIII SfaNI
\ \ \ \ \ \ \ \ \\
181 gagaaacaacaatggatcctaaggttagaacattgttaaaagttactgttgaagatgctt 240
4635 ctctttgttgttacctaggattccaatcttgtaacaattttcaatgacaacttctacgaa 4694
^ * ^ * ^ * ^ * ^ * ^ *
E K Q Q W I L R L E H C Z K L L L K M L
Simple ORF Analysis
tacg can search for Open Reading Frames (ORFs) in any of the 6 frames with
a user-specified minimum ORF size. It is not a sophisticated search algorithm;
the rules are:
- It assumes that the sequence starts in a ORF - this may cause ORFs to be
reported without starting M's and may lead to short ORFs being
reported in short input sequences. (If it has not hit a STOP before it
gets to the end of the input sequence, it will report the cruft that it
assumed was an aleady started ORF.
- If not already in an ORF, a Met codon starts an ORF and a
STOP codon ends it.
- If the ORF is smaller than the specified size when it hits a STOP, it's
discarded.
- Internal Met's are not noted in any other way, as they are in some other
apps such as Lasergene's Protean (good graphic design there), although
this is an easy addition.
- If there is an unidentified AA (because of degeneracies in the DNA) the
AA is translated as an 'X' and given the averaged MolWt (118.89 Da), but
the ORF is allowed to continue (see output below).
Very little analysis is done (so far). The beginning and ending Offsets
in both base pairs and Amino Acids are reported so that it can be found on
a linear map (offsets for frames in the opposite strand are calculated
so that they refer to the current top strand). The size in both Amino Acids
and KDa is calculated, as is the pI. Perhaps most usefully, the results
are writ to stdout in FASTA format, with the stats as the ID line
so that it can be examined by a more sophisticated pattern matcher
such as grep, agrep, perl, etc (see Using tacg with
Other Programs
The output of this analysis can be in either a single line (by specifying
-w 1) in the case where you're concerned that the external pattern matcher
won't be able to pick up a pattern split by a linefeed) or wrapped at the
regular width.
Programming Note: The data structure that holds the ORF information
is dynamically allocated so that all the ORF info is available until the
end of the program. As well this allows very large sequences to be scanned
in one sweep - my aging 486/66/32M will process the E coli genome, searching
all the ORFs for a degenerate 10 AA motif with 'agrep' in under 5 minutes,
but this means that it can suck up QUITE a lot of memory.
Here is some sample output:
% ./tacg -O156,60 < dpp.mel.seq
== ORF Analysis for Frame 1: 7 ORF(s) > 60 AAs
F# ORF# Begin(bp/AAs) End(bp/AAs) #AAs MWt(KDa)
> 1 1 2581 / 860 2922 / 974 114 13155.40
MWTVLWHAENFMTSFADHQASGAEAPNKKHSPRLAHNRPRFCLLMASLERFSGCNTRNFVQRQHNSPKDSLNFDQLSDWVAAAPQHHRNWLAEKLELHPLLNGPIRNPQDFTDG
> 1 2 3091 / 1030 3282 / 1094 64 7107.97
MLNKNDKTQPRIVICNLCTCTLNHLVAPVLLLDAPPAPILFMPLSVAHCCCCCHTHTHIRTTQK
> 1 3 5929 / 1976 6216 / 2072 96 9692.80
MEVDAGSGMLGAGQVGYGLGWVLAPLRFPGHADAADGARGATTPADKAVRVQLPTPAREASAQATKLAKMQRDQGMSGGVLARMGLRLGLCGINCA
> 1 4 7522 / 2507 7704 / 2568 61 7430.89
MYIYLFFACVQYQTLMFNKSGKNLYCAYKNHQNHKGYALPYEPTYLTRSQQTLHFESIHWV
> 1 5 7870 / 2623 8160 / 2720 97 10690.71
MIYVWDVGECVSVCWRRLGGWGAERADRAGRSDPAESSSKTTTAAAIMIHSRNVGNGCLTILFFFSIFGSRRRRSCAGYEIHTTQILTHTATVAFTT
> 1 6 11071 / 3690 11277 / 3759 69 7770.53
MWNKLARASQRFSRSFFSSLRATISAALFRGQRMMGTLPRVTNRAKSESTMESIRASISSSRVLARLMA
> 1 7 12406 / 4135 12768 / 4256 121 14055.17
MILKEEHPHQSIETAANAARQAQVRWRMAHLKALSRTRTPAHGNCCGRVVSKNHFFKHSRAFLWFLLCNLVMNADAFAHSQLLINVQNQVRNRKKKWFKTLKSQQRKTQQRQTATLAALPI
== ORF Analysis for Frame 5: 5 ORF(s) > 60 AAs
F# ORF# Begin(bp/AAs) End(bp/AAs) #AAs MWt(KDa)
> 5 1 11241 / 3747 10885 / 3627 120 13410.25
MEALMDSIVDSLFALFVTLGNVPIIRCPRNSAAEMVARKLEKKLRENLWDARANLFHMDATQAGGGVFSFQRPVLLLLDRNMDLATPLHHTWSYQALVHDVLDLGLNLVYVEDETASAGK
> 5 2 8007 / 2669 7708 / 2568 101 10973.69
MAAAVVVLLDDSAGSDRPARSARSAPQPPNLRQHTDTHSPTSHTYIIILTTLNKAEQIPKFEEPDSNSQSGSRTIPQNEVPARQRRRSSDTASALALGWDL
> 5 3 2304 / 768 2113 / 703 65 7230.86
MLAQHQIEIKVYPPFRSIPCYVCFLPGQNNSALFWALGATQKVMIKLCGDPVEGTRLRGYPSTSS
> 5 4 1734 / 578 1492 / 496 82 9039.19
MPISWHKKMRQPCPHSPAITEGTGLCIIFNEDICQQKCSGVLFMAFKAKPIFITTATAGQRKTTRDAVVVVVVLIIDTSYAN
> 5 5 762 / 254 583 / 193 61 6683.20
MLCPSILPFTLKHKFMLSVHFFQCTTKLTPCGAESATRRRSAKGKGAHGQGTGCATDWLTR
== ORF Analysis for Frame 6: 5 ORF(s) > 60 AAs
F# ORF# Begin(bp/AAs) End(bp/AAs) #AAs MWt(KDa)
> 6 1 11181 / 3727 10997 / 3664 63 7224.82
MCPSSAVRGTVPLKWWHASWRRSCARTFGMRVPICSTWMPRKPAAECSAFRDLCSCCWTGTWI
> 6 2 10764 / 3588 9728 / 3241 347 39136.07
MTHKGSPFPTVAEAIQEELESYRNSEEEIKRLKTSMGIEGESDIAFSLVNDTTARLTNAVNSLPQLMEKKRLIDMHTKIATAILNFIKARRLDSFFEIEEKVMSKQTLDRPLLDLLRDGEFGQAEDKLRLYI
IYFICAQQLPESEQERLKEALQAAGCDLTALAYVQRWKGIMNRSPSISQATQYEGGGTKTVSMFTKLVSQGSSFVMEGVKNLVVKRHVSFSFQIQIQVLYQNCLQNLPVTKITEQVMECRSNAETDDYLYLD
PKLLKGGEVLPKNRAPFQDAVVFMVGGGNYIEYQNLVDFIKQKQTSNVQRRIIYGASTLTNARQFLKELSALGGEIQSPTATS
> 6 3 6774 / 2258 6530 / 2175 83 9280.03
MQGNCNRLAGEWTQKGQPRMREDAALPARNPTFLGLHYPGLGRFFVRPTAPNGHQQTAANSFIKHKQIPWTCSALTPHAAFEH
> 6 4 5901 / 1967 5687 / 1894 73 8163.18
MVLFLRVVCFSPVDNKVQPTCGYRTDRFRVLVALVLGASRWLTLRANDAKQQIKGAIRKIAARDVLLNCSSKA
> 6 5 1647 / 549 1310 / 435 114 13523.17
MKTSVSKNVRAFCLWHLRQNRFLLPQRQQGKGKPHATLSSLSSSSSLTHRMQIKSVSQPDHKVAAPDRRICRFQCWERGDPWRFRSRIPDTDTETRLEVKRTRPIRRDLWHESS
==-------------------------- End of Analysis --------------------------==
Tabular Output
All these output formats, as well as the Linear Map and Summaries are affected
by the width parameter (-w); if you can manage the wider output,
it's more efficient and easier to view. The -w flag has a special
case besides the normal. If you specify -w1, it will set the
output to be 10,000,000 characters wide, assuring that all output for a
particular stanza will be in 1 line. This is sometimes useful for
post-processing data.
Summary of All Cuts
This is no longer printed by default; you have to explicitly request all
output. This was to make tacg more amenable to inclusion into other programs
as a shell command - see Using tacg with Other Programs.
This output includes info on EVERY active (uncommented) enzyme/pattern
in the rebase file that has passed all the filtering flags (-n, -o, -m/M).
Restriction Enzymes that DO NOT CUT in this sequence:
BbeI EheI FseI KasI NarI NheI NotI
PacI PaeR7I SalI SfiI SpeI SwaI XhoI
Total Number of Cuts per Restriction Enzyme:
AatII 5 BsiYI 130 EcoNI 5 MluI 7 SalI 0
AccI 5 BsmI 30 EcoO109I 2 MmeI 8 SapI 7
AflII 2 BsmAI 26 EcoRI 3 MnlI 184 SauI 1
AflIII 13 Bsp120I 1 EcoRII 49 MscI 17 Sau96I 61
AgeI 12 Bsp1286I 26 EcoRV 14 MseI 106 ScaI 4
AluI 89 BspEI 22 EheI 0 MspI 278 ScrFI 145
[remainder omitted]
Tables of Cutting Sites
(for enzymes that pass the filtering flags)
All of the following output formats (except the Infrequent Cutters
Map) can have the REs listed by order of their names in the database or
sorted
by # of cuts and thence alphabetically, like Strider, using the
'-c' flag.
In the next example output, the first example is sorted by 'order in
database', the second is sorted by '# cuts, then alphabetically'.
sorted by 'order in database'
== Cut Sites by Restriction Enzyme
AatII G_ACGT'C - 2 Cut(s)
1118 12143
AccI GT'mk_AC - 16 Cut(s)
632 3008 3073 3079 5393 5816 6881 7974 8815 9978 11962
14136 16857 18087 20067 20222
AceIII CAGCTCnnnnnnn'nnnn_ - 8 Cut(s)
2127 5417 11681 11750 15151 17252 17584 21795
AciI C'CG_C - 48 Cut(s)
567 850 1512 2455 2931 3300 3339 3490 4842 5468 5600
5907 6174 6318 7533 8123 8807 9086 9745 10482 10805 11626
11669 11896 12050 12214 12404 12417 12675 13337 13646 14217 14881
15949 16013 16668 17060 17089 17101 17132 18035 18140 18820 18896
19781 21146 21536 21958
AflII C'TTAA_G - 5 Cut(s)
1292 4185 11172 18334 20663
[remainder omitted]
sorted by '# cuts, then alphabetically'
== Cut Sites by Restriction Enzyme
ApaI G_GGCC'C - 1 Cut(s)
21979
ApaLI G'TGCA_C - 1 Cut(s)
10676
Bpu10I CC'TnA_GC - 1 Cut(s)
844
Bpu1102I GC'TnA_GC - 1 Cut(s)
13369
NarI GG'CG_CC - 1 Cut(s)
16014
NgoAIV G'CCGG_C - 1 Cut(s)
15606
[remainder omitted]
Tables of Fragment Sizes
(unsorted by fragment size, sorted or both; 'vertical' ordering can also
be sorted with '-c' as in the above example.
** SORTED Fragment Sizes by Restriction Enzyme **
AatII G_ACGT'C - 5 Fragment(s)
1849 3449 3731 4289 5110 14062
AccI GT'mk_AC - 5 Fragment(s)
639 1187 2192 3574 11828 13070
AflII C'TTAA_G - 2 Fragment(s)
6078 6541 19871
AflIII A'CryG_T - 13 Fragment(s)
35 170 459 493 956 1268 1712 1913 2360 2419 4091
4920 5733 5961
[remainder omitted]
Ladder Map
-
overhangs indicated: 5'(\), 3'(/) blunt(|) cutters
-
if more than 1 cut maps to a space, the number of cuts is shown numerically
(see AluI, below)
-
total # of cuts indicated at end of line).
-
width-sensitive, so you can increase accuracy by choosing a wider output.
See also the complete
output of the lambda analysis.
Ladder Map of Restriction Enzyme Cut Sites:
900 1800 2700 3600 4500 5400 6300 7200 8100 9000
: : : : : : : : : :
AccI ----------------\-----------------\-------------------- 2
AciI -----------------\-----------\--\---\\------\--\\------ 8
AflII ---------\-------------\---------\--\----\---\--------- 6
AhdI ---------------------------/--------------------------- 1
: : : : : : : : : :
AluI --|---|||--23||||2||-|2--2---|2|232||-|-|22--2-|3|-||-- 50
AlwI ------2---\-----------\--2-------------\2-------------- 9
Alw26I ----------------------------------------------------\-- 1
AlwNI -----------------------/--------------/--/------------- 3
: : : : : : : : : :
Infrequent Cutters Map
(a la Strider) of enzymes that cut less than X (a small number of times)
- (In version 2/3 this has been changed to be length-sensitive; X = 2 until
the sequence increases to beyond 10,000, then X increases proportional
to the log of the increased size). This output option is not called explicitly,
but is appended to the Ladder map (-l).
It should be obvious, but each label is of the form:
Name@Cut_Site
== Summary of Enzymes that cut ** 2 ** times or less:
XcmI@9286
XcmI@2466 UbaEI@9054
Sse8647I@1801 SexAI@9288 UbaDI@13309
SmaI@2882 PstI@10119 PstI@15573
Psp5II@1801 Pfl1108I@10898 SgrAI@16556
NruI@2248 HgiEII@10152 PflMI@18025
NcoI@2848 SexAI@8621 EcoNI@12615 NruI@17146
BstXI@2962 EcoNI@10850 NgoAIV@15606 PmlI@20299
BstEII@4076 EagI@12054 NarI@16014 Pfl1108I@20364
BstEII@1183 Bsu36I@10980 HgiEII@18039 MluI@22230
BstDSI@2848 NcoI@7724 Bpu1102I@13369 PacI@18544 MluI@22201
Bpu10I@844 PshAI@5399 ApaLI@10676 EagI@16150 BstXI@20204
AatII@1118 BstDSI@7724 AatII@12143 Bsu36I@16688 ApaI@21979
||| | | | | | ||| | || | | | || || | | | ||
-------------------------------------------------------------------
: : : : : : : : : : :
2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 22000
Pseudo Gel Map
- This option shows approximately how different digests would look if run
on an agarose gel. Turn the page 90°, blur your eyes slightly, and
it'll look like a massive agarose gel... really. Useful for selecting
restriction enzymes for best separation of fragments...
- It uses a straight log10() approximation, but the addition of math to
simulate different percentages of agarose/polyacryamide would be trivial.
If anyone has a favorite approximation, I'd be happy to include it.
- It uses the same representation as the ladder map,
with single fragments represented as '|', multiple fragments that cannot
be resolved as a digit showing how many map to that space.
- the flag values set the LOW and HIGH CUTOFFs - fragments smaller or
larger than this window are excluded from the simulation. The value must
be an integer multiple of 10 (10, 100, 1000, etc).
Below is a section of gel output generated with the width set to '-w75';
the sequence length was 22380 bp (so the gel is automatically extended
to 30 Kb).
From examining the UDP data so far, this is the option that is used
incorrectly most of the time. To obtain the results below (a 10 bp cutoff),
it should be used thusly:
tacg -g10 [other flags] < input.file
If you wanted a 1000 bp cutoff, change it to:
tacg -g1000 [other flags] < input.file
== Pseudo-Gel Map of Digestions:
10 100 1000 10000
. . . . . . .... . . . . . .... . . . . . .... . .
AatII | || 2
AccI | | | | | | |2| 222 | 16
AceIII | | | 2 || | | 8
AciI || || || | | | |22 || 33||4 |227||3| |2 48
. . . . . . .... . . . . . .... . . . . . .... . .
AflII | | | | 2 5
AflIII | | | | | | 2 2 | | 11
AhdI 3 | | | | | 7
AluI | 3 ||3 3|222 |||2223 3334|2352|2|42||| 2|2 74
. . . . . . .... . . . . . .... . . . . . .... . .
AlwI 5 | || | ||| ||||2|| | 3| || 26
Alw26I | | 2 | | | |2|| 3 32||3| |4|2 | | 35
AlwNI | | | 2 | | | 7
ApaI | 1
[rest deleted]
Pattern Matching
The pattern matching used in tacg has been extended to a more general approach
for nucleic acids so that it is applicable for searching for degenerate
patterns with a specified number of errors (in possibly degenerate sequences).
Pattern Matching with Errors
The original tacg allowed searching for nucleic acid patterns with the
standard IUPAC degeneracies, but V2/3 allows you to extend this capability
by allowing searching with errors.
Quite often when you are searching for binding sites for transcription
patterns, you wish to search for not only degenerate sites but degenerate
sites with 'a bit of slop'. You can use tacg to do this in 2 ways, using
the '-p' flag:
Say you had defined a site (called NPR) on a promoter by footprinting
as 'gtcagggcgaat' and you wanted to search another implicated region for
this sequence but you wanted to allow 2 errors to be allowed, so that you
could find a sequence that would, for example match as follows:
...gtcagggcgaat...
x x
...gtcggggcgatt...
An appropriate flag/option sequence to obtain such a match would be:
tacg -pNPR,gtcagggcgaat,2 -Sl -w75 <input.file |less
Translation: search input.file for the NPR sequence gtcagggcgaat,
allowing at most 2 errors (-pNPR,gtcagggcgaat,2),
returning the
Site data and a ladder map (-Sl) formatted
75 characters wide (-w75) and pipe the results directly to
the pager program 'less' (|less)
or you can specify that the sequence is
always
supposed to be searched 'with slop' via the addition of a single character
in the standard GCG-formatted
REBASE
file entry:
Here is the standard GCG REBASE format line:
NPR 6 gtcagggcgaat 0 !NPR is searched with 0 errors
Here is my hack to the standard GCG format line:
NPR 6 gtcagggcgaat 0 2 !NPR is always searched with 2 errors
You can also add more info to the REBASE file as described
here.
The algorithm to do this is reasonably brute force: an 'n' scan of the
sequence is done, replacing every base of the sequence with an 'n'. If
there is more than 1 error allowed, then each successive generation is
'n'-scanned. At each iteration, each generated pattern is compared with
the previous ones so that only novel patterns are actually stored. This
approach dramatically decreases the number of patterns searched for, but
can consume substantial CPU resources in the validation for long patterns
with many errors allowed. There's a tweak to the algorithm that will also
substantially decrease this time, but I haven't implemented it yet. I'll
see if anyone uses it 1st. Udi Manber's very fast, elegant, bitmasked
agrep [link]
could also be applied to this approach, but I haven't spent the time to
do so.
Proximity Matching
This is another option designed specifically for Transcription Factor (TF)
analyses. (see the man page section for details) In many cases, you are
interested in a particular juxtaposition of 2 or more TFs; this option
allows you to specify up to 10 sets of relationships between TFs and extract
only those which match your criteria. The output has been organized so
that you can see both the raw data (All) and
the filtered data (Matched) in the same output
stanza (below):
== Proximity Matches:
lef(1E)/IHF(0E) * 1 matches * found from search parameters: lef,200,IHF
57309/57116
== Ladder Map of Enzyme Sites:
# hits
10000 20000 30000 40000 50000 60000 70000 80000 90000100000
: : : : : : : : : :
lef |--|-|-|463222222||||25|2|243---|324|222222--32223|-||3-3|-- 98 (All)
lef ----------------------------------|------------------------- 1 (Matched)
IHF ----------------------------------|------------------------- 1 (Matched)
IHF ----------------------------|-----|------------------------- 2 (All)
: : : : : : : : : :
Pattern Match Data Output by 'Bins' NEW
This option (-Gbin_width,X|Y|L)
was included to allow the results of the matching to be exported to other
programs for further analysis. Especially where there is a high number
of matches, it may be of more use to plot or analyze these data using an
external program, whether it be a spreadsheet, plotting app, or custom-writ
program. I don't presume to know what you might want to do with the data
and so to make tacg play nicely with other apps in the best unix tradition,
I've tried to address some of the simpler data output formats that might
be of most use. Output is possible in 2 rectangular formats (X and
Y;
one is the diagonal flip of the other) as well as in a long form (L)that
may be better suited for simple, custom-writ programs that just want to
read x,y pairs. The very slick gnuplot
can read the multiple columns of the Y form with only a very small
amount of editing or filtering of the output. An example of gnuplot's processing
of some tacg-generated data is shown as well. See also the gnuplot
stanza in the section below.
Using tacg with Other Programs
tacg is obviously not the only molbio program you need to do your work.
There are many analyses tacg just cannot do; to that end, it at least regurgitates
your data in a re-digestible form so that you can masticate it further
elsewhere. If you can program a bit, for quick and dirty programs you probably
can't do better than perl (but if you
can program a bit, you already know that). If you can't program, importing
your data into a spreadsheet and manipulating it cell by cell is certainly
viable with the sophistication of modern spreadsheets.
In terms of working with your data in the context of tacg, there are
lots of other programs that make input easier, and output more readable/browsable.
Here are some that I use on a regualr basis:
-
NEdit A
well-designed, actively supported, free GUI editor for Xwindows on unix
and VMS. Besides its own features, NEdit allows you to seamlessly pipe
selected text to external programs (...like tacg) and display the output
in a NEdit window, and even add menu items of external programs (...like
tacg), effectively making the NEdit/tacg combination a decent GUI biosequence
analysis tool.
I've included a few of my own shell commands and other NEdit settings
in the .nedit file accompanying the distribution. Either copy your .nedit
to a safe place and start NEdit with mine, or edit the tasty bits of mine
into yours. If you're just browsing, here's my .nedit.
While the NEdit/tacg combo doesn't allow quite the same flexibility
as Strider, it does allow you to use an exceptional editor to modify or
annotate your sequence in a very natural way and also allows you to submit
parts of that sequence for some of the usual analyses. On slow machines,
(~486/66) there is a perceptible pause while the data is passed to tacg
and tacg's output is redirected back into nedit, but the window does appear;
at least you don't have to go searching for it as you usually do in GCG.
-
joe
I can't stand vi or emacs so when I have to text-edit thru a term window,
I use joe. Now a standard part of linux, it compiles on just about everything
and while it isn't as powerful as emacs, it's intuitive and doesn't force
you to be bimodal as does vi. In order to use it with tacg, you can (as
with vi and emacs, use it's shell command to pass-thru whatever you've
selected to an external command:
Example:
-
^KB to mark the beginning of a block
-
^KK to mark the end of the block
-
^K/ to pipe the selected, hilited text to whatever program you wish,
such as tacg -n4 -sSl (to get summary, Sites and the ladder map
imported into joe) so that you can move around and edit the results from
within joe.
-
less
My favorite pager - slick design, fast, memory miser. Like joe, it can
pass selected text (but by lines only) to an external app with the '|from
mark X to here' command.
Example:
-
go to the start of the selected text
-
hit 'm', then a key to mark the current position (say 'a')
-
move to the end of the text
-
hit '|', then the same indicator key as above ('a'), then the command you
wish to pipe it to (|a tacg -n4 -sSl)
-
readseq
The standard for biological sequence format conversion, from that biocomputing
code wiz Don Gilbert. You can quibble with the interface a bit, but it's
a lifesaver. If only GCG could convert sequence like readseq...
-
gnuplot
gnuplot is a another of those many unix utilities that seems crude and
clunky until you play with it a while. And after a bit, things fall into
place and you wonder how you got along without it. Using tacg's -G
output with Y format, it is easy to plot the distribution of A/T
rich regions (using a pattern entry called 'at-rich') in the E coli genome,
along with the distribution of an IHF site (pattern called 'ihf') allowing
1 error. While they should both be further reduced with a sliding window,
you can see that it is reasonably easy to visualize primary data from tacg.
Example:
-
agrep
A really useful tool that works very much like the other members of the
'grep' family but with a few notable differences. It's VERY fast, and like
tacg, you can use it to search for patterns with errors. I was originally
going to use it as the basis for tacg's searching with errors, but by the
time I started to understand the bit operations that it uses, I had already
implemented a workable scheme using the code I had already written, which
was more applicable to the kind of output I wanted. I might switch later
- it's a very cool algorithm (underlies Jim Knight's grepseq as well as
the Glimpse and Harvest search tools). The technical papers describing
it (in the same FTP tree) are a model of readability.
-
Genscript
(Gnu enscript) This is an ASCII to postscript converter that allows
you to scale and rotate text on the printed page. For many tacg analyses,
it's convenient to have very wide output and genscript allows you to squeeze
the text you want in the space you have.
-
Alpha
Not a unix tool at all, but a hair-raisingly, spine-tinglingly good shareware
text editor for the Mac from Pete Keleher. Often compared with the commercial
BBedit. The reason I use it is because it works the way I work - almost
everything I'd like in an editor is there and if it isn't, Alpha is mostly
written in tcl so that it can be extended to do whatever you want. If I
was going to try a native Mac port of tacg, I'd try it 1st as an add-in
to Alpha, similar to the shell commands added to NEdit, described
above.
Using tacg to Scan Databases
Thanks to the SEQIO lib, taqcg now handles this just fine. Another
thing that would be very handy for database handling is the ability to
deal transparently with gzipped/compressed data; this is available via
the zlib library.
The Odd one: UDP Reporting
tacg was also designed to track it's own use
and spread - something in which I'm also interested, To that end,
the binaries have been compiled with code (udping.c) that spits a small
amount of data back to me at each usage IF YOU EXPLICITLY ALLOW IT TO.
However, with version 3, this has ceased to be the default. You can
still explicitly send back info (and I'd be grateful, but it annoyed a
few people and I think it has fulfilled it's usefulness.
To wit:
-
the IP number of the hosting machine
-
the cpu type
-
Operating System/Version #
-
what flags were used in calling the program
-
the number of bases processed.
It does NOT return host or domain names, user names, or actual sequence.
The exact data (minus IP number which is inherent in the UDP protocol)
that is returned is shown on stderr (usually the screen) at each use. ie.
Returned: [hw=IP22 os=IRIX osver=5.3]
[TACG Version 1.5F] /usr/local/bin/tacg -n5 -o5 -F2 -slL -g100 <100000
bp>
If your machine is not connected to a network, is behind a firewall,
or there is a problem in getting the data to my machine, there is no penalty,
as it's UDP output (unverified). This allows me to find out what OSs it's
been ported to, what the popular flags are and which ones are used little
if at all. I understand that some people may be queasy about this function,
especially if they haven't seen the source code (which is freely
available), so this function can be disabled from the command line
(-q) and can easily be eliminated when recompiling (see tacg.h in the source
code). If you do this, I'd appreciate an email letting me know what you
think of it otherwise. I'd also be interested in hearing suggested refinements
and critiques of this process.
Changes and Fixes since Ver 2.28
Major
-
In prepping the readseq-enabled Web version (2.36), I came across a minor
bug in the '-b' and '-e' flags, now corrected.
-
Thanks again to Catherine Letondal for catching a bug in the ORF (-O) section
which would cause a segfault if there were more than 100 ORFs in any frame.
No problem if your sequences were either small or you were looking for
large ORFs, but looking for small ORFs in large sequences (what tacg is
designed to do) would get you there. The Web version is already patched
as well.
-
Thanks to Claude Aflalo, a bug was fixed that stopped the correct placement
of tics very close to the origin in the Summary Map (and stopped the program).
-
The routine for reading REBASE files now checks for both Carriage Returns
and Line Feeds to enable uploads from Macs and PCs.
-
Thanks to Catherine Letondal, numerous bugs were fixed, including bad casts
that crashed the DEC Unix versions whenever log10() was called.
-
In the -O flag, a stupid mistake prevented frames 4, 5, and 6 from being
translated correctly (thanks to Kirk Fry for pointing this out). Fixed.
Minor
-
Ashok Aiyar suggested that since tacg is pretty terse, it emit a message
when no other output has been requested. Incorporated.
-
Catherine Letondal (CL) pointed out a number of glitches in the man page,
HTML documentation, and wording of a number of sections. Most of these
have been incorporated.
-
CL also caught an 'off-by-one' error in analyses where the sequence was
defined as circular.
-
While fixing the -O flag bug, I realized that the offsets listed in the
output were less than straightforward, so I fixed them as well, adding
a Beginning offset and Ending offset. Frames 4, 5, 6 have their offsets
manipulated so that they refer to the current top strand instead of the
previous bottom strand (so that index numbers are easier to find on the
linear map.
Changes and Fixes since Ver 1.71
Major
-
Added Degenerate sequence analyses, with
5 modes of searching, including adaptive switching of the search algorithm,
based on sequence input. The 'fast' searching of nondegenerate sequence
will digest/search about 500kbases/sec on a 200MHz PPro. The 'slow' search
type is about 10 times slower. The default method runs using the 'fast'
method unless it hits degeneracies in a small window leading the current
position. In that case, the degeneracies are processed with the 'slow'
method, and then once the window is cleared of degeneracies, tacg speeds
back up on non-degenerate sequence. Thus, other things being equal, search
speed is inversely proportional to number of degeneracies, although long
runs of 'n's will not slow it down.
-
Added Searching with Errors. Using the '-p'
flag, you can denote both the pattern and number of errors you want to
allow. This also works with the -P (Proximity) flag (see below) and can
be applied to any existing pattern by appending a single term to its REBASE
entry. For each error allowed, this involves making an 'n' scan across
the sequence length. For each level of error, each sequence thus generated
is checked against the previously validated sequences to check for redundancies
and is only added to the list if it is novel. This backchecking results
in some time being lost in the startup, but it enormously reduces the number
of sequences to be checked and thus increases the speed at which the input
sequence is scanned by a similar amount.
-
Added a method to extract 'match data by bins'
(of bases) so that much of the internal data is available for external
analysis or plotting - one form of output is of the type preferred by gnuplot.
Another is useable by most spreadsheets, such as NeXS, Applix, or even
Excel BP .
-
Added -p flag, allowing patterns to be entered from the command
line. The patterns so entered are logged (with the date) to 'tacg.pattern'
in the current directory in the same form as the GCG-formatted REBASE that
is the default for tacg.
-
Added -O flag, allowing simple Open Reading
Frame analysis. Allows any of the six frames to be translated; can specify
the minimum ORF length to be allowed. Output is currently only DNA and
Amino Acids offsets and MolWt, but adding more analyses should be relatively
easy.
-
Added -P flag to allow Proximity matching.
This also works with the variable error rate described above. This allows
multiple relationships to be defined and only the sites that pass the filter
are reported. These relationships can be composed by terms that restrict
it by 'upstream vs downstream', 'less than' or 'greater than' a distance,
'within a range of distances', or logical combinations of these terms.
-
Fixed (or substantially fixed) the previous 30bp minimum sequence requirement.
It's now down to 4bp, allowing analysis of polylinkers and primers. If
you need analysis on fewer than 4bp, this is probably the wrong program
for you.
-
Major internal re-write of major data structures to streamline and
compact code. Substantial memory savings for Enzyme struct 'RE' which is
also now dynamically allocated, so there are no longer any hard limits
on how many patterns can be used. Several major logical bugs were
rooted out and addressed. Probably an equal number were created..
-
Minor
-
Assorted Interface tweaks
-
The number of Summary Cuts is now calculated based on length of sequence,
rather than being static.
-
Changes and Fixes since Ver 1.5
Major
-
Addition of the '-c' flag which sorts tabular data 1st by # of cuts and
then alphabetically. Makes output run from smallest number of cuts to largest
so that you can group the REs for those cuts in you're interested.
-
Addition of the '-r' option that allows the explicit selection of REs by
name (up to 15 without recompilation, but you can #define any number you
want). This was primarily for the Web version but has been incorporated
into the commandline version as well.
-
Addition of the '-H' option that helps out the Web version by generating
some crude HTML tags for formatting and building the Table of Contents
of the Results. As of now, it's not very useful outside of the Web version,
but maybe it'll evolve to generate standalone HTML output
-
Fixed a bad bug in the sequence input routine (GetSeq.c) that would crash
the Webified program if it was taking sequence input from a Mac or PC that
did not terminate lines in a Unix-style.
-
Documentation is now mostly HTML, so it can be read via Web Browser for
easier movement.
-
Addition of Web interface via Tom Boutell's cgic library and and Fred Criscuolo's
help. This has driven many of the above changes and helped in uncovering
(and generating) many bugs.
Minor
-
Included the '==' prefix to the major output stanzas so that you can immediately
jump to them with a searching pager such as 'less' or 'more'.
-
Tidied and spiffed the HTML page and re-ordered the flags in the help section
and man page to be ~alphabetical.
-
Many smaller bug fixes and cosmetic changes.
-
The fast gets faster. tacg on a PPro/200 is faster than on a similarly
clocked SGI Indigo2/R4400, and even faster than an 233MHz Alpha (?!).
-
Fixed some nonfatal, but bothersome messages from SetFlags().
-
For Web interface, made up Vendor-specific versions of the REBASE file.
As well, Dana Macelis of NEB helped with the generation of a price field
so I could offer a 'Cheap' file.
Changes and Fixes since Ver 1.34
Major
-
the command line handling has been completely re-written, using the getopt()
function, so the flags are considerably less sensitive to spacing and order.
"-n 6" or "-n6" are now equally acceptable and flags without variables
can be grouped "-Ls" (but not "-Lsn6" - that must be entered "-Ls -n6"
-
All output is to stdout, instead of to named files in the cases of the
ladder and gel map.
-
All output is dependent on the width setting (-w#), rather than having
different width settings for different options
-
the linear map is now produced only on request (-L), reducing the amount
of output under normal circumstances.
-
the enzymes displayed on the map are filtered to represent only those selected
by the various flags; previously all enzymes were displayed. This was the
result of one vocal request; I'd be interested in hearing supporting or
opposing views.
Minor Several bugs and much seemingly
unconnected weirdness disappeared when I used getopt().
Bugs and Oddities
Major
-
tacg, unless recompiled without it, spits
back about 100 bytes of information about it's use (the result of `uname`,
the program version, the command line flags, and the sequence length) to
enable me to track how and how often it is being used. If you are uncomfortable
with this trait, you may disable it from the command line ('-q', see
above).
-
noted above; tacg has no sequence format conversion
abilities, it will happily eat and analyze the acgt's from the header (or
your thesis) if fed them.
-
both main() and functions were written as single pass code; little effort
was made to free() memory grabbed in the analytical stages. If you plan
on using any of this code for re-entrant programs, you'll have to be *very*
wary of this. As if... High priority to fix..(still).
-
translation in 6 frames assumes circular sequence regardless of '-f' flag,
so that the last amino acids in frames 5 and 6 in the 1st output block
are obviously incorrect if you are assuming linear sequence. This is still
incorrect as of V2, but is also reasonably obvious.
-
Related to above, if you want translation of circular DNA, the translation
routines do not yet support translation through the endpoints, so if there
is an ORF in progress when it hits the end of the sequence, it will not
continue the translation across the boundary (nor in the reverse direction).
This occurred to me just prior to release of V2, and it will be one of
the 1st things addressed. This bugfix may also fix the bug mentioned directly
above.
-
if you are running tacg on a machine that
was set up on a TCP/IP network and then disconnected it from the network
(took it home or pulled the ethernet card), the UDP reporting code will
cause the program to hang at the end. If this is the case, either use the
quiet flag (-q), recompile the code with the REPORT #define set to be quiet,
or hit [Cntrl+C] to terminate it.
Minor
-
the code is still laden with debugging statements, many of which are still
operative if you specify the -V (Verbose) flag. Previous to Ver 2, tacg
was quite chatty while in progress, but is now silent by default, unless
a major error occurs or you request the output. This allows tacg to be
used by other programs (Nedit, for example) which trap stderr and open
windows to display it.
-
in the linear map, although the enzyme names stack efficiently, they sometimes
appear to stack inappropriately. This is a result of the order in which
they were encountered; a site that has a long offset to the cut site, but
encountered in the opposite strand, can map before but still be higher
in the labels.
-
NOT a bug but a slight difference in accuracy. tacg
finds all the sites that Strider does and a few more. Strider is apparently
confused by some degenerate sites that overlap with themselves. I've only
seen it with an enzyme called GdiII. There is also a surprising bug
in Strider 1.2 that causes long, non-palindromic patterns to be found
correctly in the top strand, but ignored in the bottom strand. I found
this because a (corrected) bug produced the same error in tacg in a few
cases.
-
relatedly, because of the algorithm used to find sites, and the offset
recognition sequence of some enzymes, tacg will
find cut sites for the same enzyme in opposite strands that map to the
same nucleotide, resulting in fragments of zero length. After some thought,
I've decided to consider this a feature, not a bug ;) better to know all
the possible sites than not.
-
the tic markers in the last block of linear map do not stop with the sequence.
V low priority.
-
when searching for patterns with errors,
some patterns generated are shorter than the parent pattern. Because of
the way this has been implmented, a hit will often be a subsequence of
another hit (aagctg is subsequence of gaagctg, but will be
noted as a separate hit. As in the case above, I've decided to call this
a feature, not a bug (so far). Just be careful...
-
some flags are ignored; if you decline a linear map, translation format
requests will be ignored because that is the only way for them to be written
at this point. As the protein analysis code appears, this may change.
-
in the process of writing this, I've come to realize that there are many
cases where the code is fragile, wrong-headed, inefficient, and/or crude
beyond imagining. In many cases, I'm planning to fix this; in others, it
doesn't impact the speed (hashing), I'll probably leave it alone.
I am extremely interested in hearing about new bugs, suggestions for new
features, and overall comments on the program.
To Do List
-
Incorporation of a sequence parser for the input, either from 'readseq'
or 'SEQIO', so that it does not need a separate program to do so, although
because of the organization of many sequence formats, many files can be
simply be 'tail'ed to remove the header and the rest of the sequence is
digested smoothly, filtering out numbers, spaces, newlines, etc. Crude,
but effective. But perhaps not - I like the simplicity of it this way -
arguments?
-
An earlier version of this code has already been converted to a GUI with
tcl/tk. Depending on feedback and time constraints, I'd like to convert
the latest version to a GUI, probably using FORMS or other C-based GUI
toolkit (the Vibrant toolkit?), and in fact some of the things I'm planning
can really only be done using a full GUI program. We'll see... tcl/tk is
questionable. For a programmer, it's very nice and flexible, but it's not
really standard on all unix machines yet and the setup for mere mortals
is horrendous. If tacg goes to a GUI, it'll
probably be as a standalone application, but don't hold your breath.
-
Even without a GUI, I plan on converting some of the output to graphics,
so that more information can be put on one page. Circular maps are difficult
to implement using only alphabetic characters. Color can be used to great
effect in some presentations, although I'm a firm believer in Tufte's maxim:
Maximum info, Minimum ink. On hold to see how the Web interface can be
used.
-
Assuming significant usage, analysis of returned data to put together a
picture of how the program spreads and how it's being used.
Cheers Harry
...And The Small Print Taketh Away...
COPYRIGHT NOTICE
Copyright (c) 1996,1997 by Harry Mangalam at Univ. of California, Irvine
Permission to use, copy, modify, and distribute this software and its
documentation is hereby granted, subject to the following restrictions
and understandings:
-
Any copy of this software or any copy of software derived from it must
include this copyright notice in full.
-
All materials or software developed as a consequence of the use of this
software or software derived from it must duly acknowledge such use, in
accordance with the usual standards of acknowledging credit in academic
research.
-
The software may be used by anyone for any purpose, except that its redistribution
for profit or inclusion in other software sold for profit requires the
express, written permission of the author. Note that the written permission
of the author is not required for any use of the software that may result
in profit, and the author waives any rights to any part of profit gained
by any use other than its redistribution for profit.
In plain English, the only thing that requires the express, written
permission is if you want to sell software that includes or uses this software.
Any use other than selling software that uses this software is fine, and
any profit made from that use is yours to keep.
-
This software is provided AS IS with no warranties of any kind. The author
shall have no liability with respect to the infringement of copyrights,
trade secrets or any patents by this software or any part thereof. In no
event will the author be liable for any lost revenue or profits or other
special, indirect and consequential damages.