Welcome to tacg Version 3.5
Document Version 4.0 - June 6, 2003
This page provides information about both the commandline and Web interface of tacg Ver 4.0, describes some of the philosophy behind its
design, provides links to more documentation, and where it can be got.
Please note that this document is in flux, incomplete, and certainly
inaccurate in places. The man page is more likely to be correct than
this doc, and the help section even more likely to be correct. In the event
of complete incomprehensibility, please mail me.
A web interface to version 4.0 is available;
you can find some sites that offer it here.
Table of Contents
Cross Index of Command-line Options
What's new or changed with Version 4.0 ? (vs 3.x)
Briefly:
- Added circular graphic plasmid maps in Postscript and PDF.
- Added 'Rules-based' searching
- Added cloning features to find REs that hit or
don't hit in particular regions.
- brought man pages and HTML-ized man pages up to date.
- use of -p (pattern) also sets -S (Sites table) so that tacg acts more grep-like.
- changed license to DUAL license, either GPL or proprietary.
- better autoconf integration (tho no automatic regression testing yet)
- integration (finally) with Jim Knight's SEQIO to allow automatic
sequence conversion on input and scanning multiple sequence databases (or
not - with --raw it still supports the ability to
assume that ALL input is sequence - useful for analyzing file fragments
or editor buffers).
- can use it to search for arbitraily complex combinations of patterns at
the same time by combining pattern specifications with the logic
conjunctions AND, NOT. see --rule.
- searching for regular expressions (in nucleic acid), with autoconversion
of IUPAC degeneracies to the appropriate regex.
- searching for TRANSFAC matrices, with user-specified cutoffs.
- gel simulations with low and high end cutoffs for expansion.
- selection of Restriction Enzymes explicitly, by overhang generated,
magnitude of recognition site, price, minimum, maximum number of cuts
(overall or on a per-pattern basis).
- supports Combination Cuts of up to 15 REs at a time.
- supports limited AFLP fragment matching / simulation.
- simulates Dam and/or Dcm methylation of DNA
- searches for silent sites, with reverse translation.
- ORF finding in any combination of frames with FASTA output, with offsets
in DNA, protein, Molecular Wt, pI, with optional additional info on
AA frequency in #s or %.
- conditional output based on matches, for scanning large numbers of
sequences at a time.
- sequence extraction surrounding pattern matches, with variable upstream,
downstream inclusion, optional reverse translation in FASTA format.
- various ways of making the output even more compact (printing only
single strands, or eliminating tics)
- uses autoconf/configure to ease building on different platforms.
- by default, tacg no longer sends usage data back to me.
- includes an explicit example function to show how to add your own funtionality.
- much more robust, debugged, especially as to memory allocation.
Since most of the routines tacg run in a loop now, memory leaks are
much more apparent and get plugged quicker.
- BUT: more features leads to more (new) bugs so be on guard..
What is it?
tacg is a character-based, command line tool
for unix-like operating systems for pattern-matching in nucleic acids and
performing some of the basic
protein manipulations. It was originally designed for
restriction enzyme analysis of DNA, but has been extended to other types
of matching. It now handles degenerate sequence input in a variety of matching
approaches, as well as patterns with errors, regular expressions and TRANSFAC-formatted
matrices.
It was designed to be a 'grep' for DNA and like the original grep,
it's capabilities have grown so that now the author has to keep calling
up the help page to figure out which flags (now ~50) mean what.
tacg itself is NOT a GUI application in any sense. However, it's existance
as a strictly command-line tool lends itself well to Webification and it is
now distributed with a web interface form and a Perl CGI handler (the original
C CGI written with Tom Boutell's CGIC code is also distributed with it for
historical/educational purposes, but Perl is a MUCH easier interface language.
NO question about it.)
Also, by it's strict command line nature, it can parasitize the interface of a number of GUI
programs that allow passing selections made within their windows to external
programs. This is particularly effective with a free, Xwindow editor called
NEdit. With minimal reconfiguration, NEdit can become a fairly good biosequence
editor and analysis tool a la Strider for XWindows. See the entry for
NEdit below.
A real world example may help to illumninate.. say the Genbank entry for the lambda
genome.
tacg -w75 -n6 -F2 -l -c -g'100,10000' -L -T3,1 -O145x,25 <
lamcg.gb | gzip - > lambda.map.gz
Translation: Process the file 'lamcg.gb' (Genbank format) thru
the seqio routines (transparent to users) and pipe the following sequence
to tacg, returning the analysis:
- 75 characters wide (-w75)
- on all 6+ cutters (-n6)
- giving me the sorted fragment sizes (-F2) of those Enz's that match
- providing a ladder map summary (-l)
- sort tabular ouput by number of hits and thence alphabetically (-c)
- provide a gel simulation with a 100 bp LOW cutoff and a 10000 bp HIGH
cutoff (-g100)
- with a linear restriction map (-L)
- with 1 letter, 3 frame translation (-T3,1)
- with the ORF table for frames 1,4,5 at a minimum ORF size of 25 AAs with
the extended composition info (-O145x,25)
- and pipe the output thru 'gzip' to a file called lambda.map.gz.
Altho it's not clear why you would be interested in seeing this particular
analysis, here's the resulting
output (130KB text, gzipped - you'll need a gzip-compatible app to
view it, such as Stuffit
Expander, Stuffit For Windows, Winzip,
or even ... gzip).
The lambda genome used to generate this map is also included in the
tacg distribution directory (called lamcg.gb.gz)
How does it handle incoming sequence?
If the --raw flag is used, the file is processed directly by the GetSequence() function
which reads everything and strips out anything not interpretable as an IUPAC character.
Note that there are a lot of IUPAC characters and the typical letter to your Mum read
with the --raw flag will probably yield quite a bit of acceptable, if uninterpretable,
sequence.
If the --raw flag is not used, the input stream is processed by the SEQIO routines.
If the stream
contains multiple sequences, they are passed to tacg serially, in an internal loop.
tacg reads ASCII characters from the buffer passed in by SEQIO, determines
whether there are IUPAC characters in the stream and, depending on the
-D flag option, EITHER strips out anything not an 'a', 'c', 't',
'u', or 'g', OR allows all IUPAC characters (but strips all others,
including numbers, spaces, tabs, linefeeds, etc), and places that input
sequence in a buffer.
What it does then depends on what other flags are set. The 1st writ
function (probably still the most popular) is Restriction Enzyme Analysis.
Flags calling this function cause tacg to read in a series of named
patterns (either from the command line, or more typically, from a GCG-formatted
REBASE file, now somewhat extended for increased functionality) and match
them against the sequence. These matches are logged, then
some basic math is done with them and the output is sent to stdout, in
the form of tables of cut sites, fragments generated, various maps and
simulations.
The number of recognition patterns is
now dynamically allocated so 1000s of patterns can be searched for at once
(I've generated ~7000 without problems), as long as they are in the simple
format required. For instance it can now process all the patterns
in the TRANSFAC database at one shot (altho I'm not sure that would produce
anything usable).
A similar function matches (possibly degenerate) input with patterns
that contain errors (as well as simple degeneracies). This means that if
the pattern is gartc with 1 error, it could match
nartc,
gnrtc,
gantc,
garnc, or
gartn.
Of course, the 1st and last sequences decay
to artc and gart, as a leading or lagging 'n' makes no difference
in the pattern. NB: See the note on how subsequence
patterns are currently handled.
Related functions (-P and --rule)
analyze the data from the above searches by parameters that you give them and
filter the number of reports by your proximity rules.
The -O flag allows Open Reading Frames to be
extracted. This function is not very sophisticated, but does present the
data in a way that allows it to be analyzed by other programs (each ORF
is written in a FASTA format so that you can use the resulting file for
other types of searching).
Enzyme selection can be filtered by a variety of criteria; these criteria
affect what is printed as output (if only 3 enzymes match your selection
criteria, only those 3 will show up in most analyses. They can be chosen
by 'magnitude' of recognition site (4, 5, 6 cutters, etc), by overhang
generated, by minimum/maximum times they cut, cost, dam/dcm protection, etc.
The entire feature set is described in more detail below, but it includes:
- Enzyme cut-frequency summary tables,
- Tables of site data (sorted or not)
- Tables of fragment data (sorted or not; ordered or not)
- Linear map with names stacked efficiently with/without translations in
1,3, or 6 frames, from 1 of 8 Codon Usage tables
- Pseudo-graphical ladder maps a la GCG
- Summary map of infrequent cutters
- Gel simulations (using a simple log fit).
- Tables of ORFs for any frames, with offsets in bases, AAs, and MolWt. [new]
- Entering patterns from the commandline [new]
- Searching for patterns with errors [new]
- Proximity matching - reports only sites that match distance criteria (up/downstream,
less/more than a distance, within a range of distances) [new]
- Other match data output formats
What will it do next?
What you want it to do will have an effect on what I work on. If you want
something added, especially a variant of an existing analysis, I'd be happy
to consider it, so let me know your opinion. Otherwise, I try to add what
I believe to be the most useful features, especially if they are missing from
other tools.
Anyone who believes the promises of software suppliers also believes
in the tooth fairy. That said...
- I'd like to add secondary structure prediction output to the postscript
maps, so if I can wrangle some public domain code into shape, I may be
able to carry thru on this wild promise as well.
- XML output (BSML for example)
would be a neat additional feature.
- support for more sequence formats to Seqio would be useful (support for some
binary formats (Lasergene, DNA Strider, GFF, etc).
- automated regression testing would be very useful for development testing and
porting to different platforms.
- adding support for more efficient searching using Manber's agrep code for extremely fast
approximate searching.
Where can I get it?
tacg V3.5 and higher is now available via its own
Sourceforge Project site in ANSI C source code. Precompiled binaries for a few
architectures may follow. Instructions are included with each package.
Is there a GUI or Web Interface?
tacg is a command line program (see Simplicity,
below) but as such, it lends itself well to Webification as a cgi program.
The Version 3 Web interface is actually composed of three similar pages that emphasize
different aspects of tacg's abilities. All three pages address the same CGI
(perl-based for this version), and call the same tacg executable. The reason for the
different pages is just for clarity. Only the 1st is finished and you can find sites
that make it publicly available via the Sourceforge
project site
The only substantial crippling is that it currently has a cutoff at 500,000 characters
as the output for that size of analysis can get very large for some analyses. If you
routinely use more sequence than this, it's probably worth your time to install it
locally. NB The 500,000 character limit, while probably sufficient for most
analyses, counts the header and comment characters as well.
Is there a man page I can look at?
Yup. You have your choice of
All are included with both the source code and the binary pkgs.
Design Criteria
Simplicity (and a few Alternatives)
In answer to: "Why does the world need another restriction mapping program,
especially one as crude as this?" or Why didn't you just use (or modify)
XXX?", I enumerate:
- Well, in the 1st place, crude is not always bad... 'grep' could be
classified as crude and it's one of the most useful and used programs
in the world. It's free, simple, flexible, chainable, and very fast;
it provides the desired info in a useable form with a minimum of
mucking about.
- There are a variety of programs available for Windows and (especially)
the Mac for doing these kinds of analysis, but there's little along
the same lines for unix, especially Linux, which is an extremely
attractive OS for bioinformatics. There is the very complete
EMBOSS package
tho, which does a LOT more than tacg, but what they both do, tacg does much
faster (20-50x faster).
- Many available programs, even the commercial ones, start to come undone
at sequence lengths quite a bit smaller than those now being
published. Dynamic memory allocation relieves this limit, so I
decided to write one that incorporatedS this feature. tacg is relatively miserly
in memory requrirements - vs the EMBOSS restrict program which uses >100MB RAM
to analyze the E coli genome (4.6 MB), tacg uses 13MB.
- Even some of my favorite programs (such as the beautifully designed DNA
Strider) refuse to analyze degenerate DNA sequence, making the
generation of useable plasmid maps from the usual bits of known
sequence, polylinker, and hand-drawn maps impossible or extremely
unpleasant.
- While I have had access to GCG for quite some time, I have been
increasingly aggravated at it's speed, output, cost, and the fact that
it doesn't run on Linux.
While tacg's code is sufficiently arcane that
few bench biologists will attempt to add functionality to it, an undergrad
CS student could, especially with the inclusion of the new example
function.
- There is also the challenge of creating a program that, like grep has
significant utility in a relatively small footprint.
- It was fun.
However, if you want alternatives, here are a few:
- The very complete
EMBOSS package has to top the list. It's well-designed, autoconfigs, has good
documentation, and is Open Source (GPL). All you could ask for is a little more
speed...
- Roland Walker's
SEALS Perl package is also very comprehensive, but requires a lot of add-ons
and admin to get it up an running.
- There's the stalwart (if ugly) GCG, but it's $ware and does not
(yet?) run on Linux. It also remains essentially a collection of
unfriendly command-line programs (hey, nice glass house..) sheltered
under a common but still not too intuitive GUI. The GUI is strangely
warped - it appears to be very similar to a HTML forms-based GUI, but
with neither the benefit of HTML, nor many of the advantages of a
'native' GUI. The controls are surprisingly crudely laid out (much like
the tacg Web form), which is odd for a 'native' app; I would have
expected a much more compact and intuitive interface.
- Webcutter
2 is a Web-only cgi (no command-line version, no source code) that Max Heiman put together at
about the same time that I was writing tacgi. I've tried it a few times
and while it may be useful, I found it too slow and the output a bit
info-sparse. It does have some features that tacg lacks (coloration of
selected Restriction Enzymes (REs), grabbing sequences direct from
Genbank to restrict), but for my own use they weren't that compelling.
Although the access is free, you cannot install it locally, nor examine
the source code. Nevertheless, if anyone thinks that a feature of
WebCutter would make tacg considerably more usable, I'd be interested in
hearing about it.
- WebGene is a
service offered from Indiana which offers a small subset of the options
and analyses that WWWtacg does. It has a similar Restriction Enzyme
selection interface that the original WWWtacg had, but I decided that
while it was a nice display, it took too much HTML to code which made
selecting the REs too slow. It offers both an ORF interface as well as a
Restriction Enzyme interface.
- Cloneit, a
unix-based cloning program with a web interface as well.
- BioNavigator from eBioinformatics, now
Entigen, VERY nice (uses tacg, incidentally), but for-pay.
- Curatools, Curagen's genomics portal
(still free, I believe).
- restmap (perl thingie out of stanford by Steve Chervitz, now at
Neomorphic). Quite good Web output, including colored image maps
that really nice) but being perl, it's a bit slow and doesn't have nearly
the features that tacg has. Appears to have been taken offline..?
- The Biology Workbench is an
excellent example of how a Web Interface can make coherent a great number
of the free software tools of the net. Now that you mention it, tacg is
one of its included tools :).
The only exception I take to Web interfaces (including my own) is the
response time/uncertainty of the connection and the inability to retain
state easily. But it is a great way to take advantage of programs that
you'd never be able to take the time to get working and working
reasonably easily and a well thought-out interface can do a lot to reduce
the problem of retained state.
- Steve Smith's 'GDE' has been compiled for
Linux, but requires a fair amount of dedication to put and hold it
all together; last time I looked it was available only an OpenLook
interface, which is an increasing disadvantage. Steve is now at GCG (I
think - the new versions of GCG seem to show the GDE influence.) so there
probably won't be a huge amount of work going into it.
- Brian Fristensky's
estimable suite of programs BIRCH (of
which tacg is also a part) is pretty good and his comments and
suggestions have resulted in a a number of the improvements seen here,
especially the Degenerate Sequence matching.
- MSI has just released an app called Gene Explorer, which includes quite
a bit of functionality as a loss-leader to attract researchers to their
higher-end products. Haven't had enough experience with it to comment
further.
- James Tisdall's 'DNA Workbench'
is designed along the same idea as tacg, but
requires perl and is therfore interpreted.
Despite perl's many syntactic and other advantages, I wanted something
even simpler (a standalone app) and therefore easier to install, faster,
and more complete.
- Don Gilbert's
multiplatform, GUI ' SeqPup' is
the closest to an ideal setup in terms of design and interface. It
has recently been recoded in Java and Don is still leading the
way in terms of innovation and interface in bioinformatics. Anyone
interested in interface design for bioinformatics is well-advised to
check out and follow Don's contributions. His have consistently led the
field in terms of usability, Internet use, cohesiveness, and
inventiveness.
- On some systems, you can run Executor, Macintosh Application Environment,
SoftWindows, and Virtual
PC and thus have access to the Mac and Windows programs, but that's
also somewhat costly, slow, and complex.
Installation
tacg REQUIRES only 2 files - the executable (tacg,
which can be renamed) and the Codon table file 'codon.prefs'. Most people will
want to use the restriction enzyme database file (rebase.data; included with
the distributions, which is simply the GCG-formatted version of Rich
Robert's REBASE), but it is not strictly needed - you can enter patterns
from the command line since Ver 2. The codon usage file (codon.prefs) is also
required if you want to do any translation. It was required in the 1st few
versions, then was compiled into the app in version 2 and now has been
externalized again for easier access and modification. As the sequencing
factories roll thru more organisms, there will more need for changing or adding
to the file.
The data files (rebase.data, codon.prefs, other pattern files) will be
found in any of 3 places automatically, in the following order:
the current directory ($PWD), the home directory
($HOME), or the tacg lib directory ($TACGLIB). The
program searches these paths and gives up only if the file can't be found in
any of these places. 'rebase.data' is ASCII text and can be edited and modified
by the user, if required. I'd suggest placing 'rebase.data' in the TACGLIB
directory and keeping modified versions of these files, if required, in your
home directory which is searched first.
tacg can use the NEB/REBASE-supplied file of restriction enzyme data as is, (in GCG
format - latest unmodified version
here) but can also make use of additional extended information if available.
Version 3 will use error, cost, and methylation information added to the file as
shown here.
The format of the unmodified version is:
Comments
..
; Top Recognition Bottom
;Name Offset Pattern Offset ! Isoschiz >Supplier Index
;AarI 3 CACCTGC 0 ! >
;AatI 3 AGG'CCT 0 ! StuI,Eco147I,Pme55I,SseBI >O
AatII 5 G_ACGT'C -4 ! >ADEFIKLMNOR
;AauI 1 T'GTAC_A 4 ! Bsp1407I,BsrGI,SspBI >I
AccI 2 GT'mk_AC 2 ! FblI,XmiI >ABDEGJKLMNOQRS
The format of the MODIFIED version is:
Comments
..
; Top Recognition Bottom
;Name Offset Pattern Offset Err dam dcm units $ ! Isoschiz >Supplier Index
;AarI 3 CACCTGC 0 0 100 100 0 0 ! >
AatII 5 G_ACGT'C -4 0 100 100 2500 176 ! >ADEFIKLMNOR
AccI 2 GT'mk_AC 2 0 100 100 2500 220 ! FblI,XmiI >ABDEGJKLMNOQRS
AccII 2 CG'CG 0 0 100 100 444 99 ! Bsh1236I,BstFNI >KQR
Acc65I 1 G'GTAC_C 4 0 100 3 5000 200 ! KpnI,Asp718I >DFINR
In the above, the extra fields are:
- Err - the maximum number of errors that can be tolerated in the search.
- dam - the offset from the Top Offset that if methylated by Dam
methylase, blocks cutting
- dcm - the offset from the Top Offset that if methylated by Dcm
methylase, blocks cutting
- units - the number of units packaged (by NEB) in their Small size lots
- $ - the cost for that many units above.
Program Output
The output of tacg uses only alphanumeric characters (except for the recent circular
graphics maps) so that all of its output can be viewed on a vanilla vt100-like
terminal, although you can do more useful things if you're using an X display or have
imported the output into a Word Processor (some of the output can best be viewed using
very small fonts or in multiple columns on a page).
The different output sections have been prefixed with "==" so that you can easily jump
to the different sections if you're viewing the output with an editor or pager such as
less. In the WWW version, the different sections have been labeled
with HTML tags so that clicking the automatically generated Table of Results entry will
drop you to the corresponding results section.
Plasmid Maps
- WARNING: This is a beta-level function and may explode without warning or
give inaccurate results. If you suspect the latter, please let me know.
- Select the # of REs to place via the Restriction Enzyme Selection section above,
paying special attention to the Maximum Number of Cuts setting.
Usually a Max of 1 or 2 is what you want.
- To place ORFs on map, check the "Stream Out Open Reading Frames" stanza
immediately above, setting the minimum size of the ORF to detect, and the frames
to display. ORF (The extended info will be printed but is currently ignored in
the plasmid map.) The map will not accurately track ORFs that cross the
origin, so if you suspect that an ORF crosses the origin, edit or subselect the
sequence to verify.
For the Web interface, if you're using a Mac, printing to a
Laserwriter-type device, you might try the '105' width, with the 'Small' font,
then use Page Setup to select Landscape output, 2-up (2 pages fitted to one
page).
If you're using MS Windows, there are probably similar output
selections that you can use.
If you're using a unix box, save the output in whatever width
you want and pass it thru a text-to-postscript filter to generate nicely
formatted output. Some that I've used are:
- lptops - included on SGI's; probably available somewhere for other platforms.
- nenscript
- (Free implementation of Adobe's enscript)
- genscript aka GNU enscript
- Free, most features, currently supported, no good reason to use any other...
- a2ps - a monster text-printing system that will handle almost any
conceivable file format (and allows you to color code your own if you
want).
Use it thusly:
genscript -1GZ -r -fCourier7 -p- [filename] | lpr -P[PSprinter]
Most Xterminals allow different font selections and window re-sizing,
so you can probably view the results adequately that way as well.
High Portability
The program is written in vanilla ANSI C. After being directed to the correct
libraries, it compiles with few complaints on:
- Silicon Graphics/IRIX 5.3, 6.2 (R3000, R4X00, R5000, R10K) (cc)
- Sun SPARC/SunOS (4.1.3) (gcc)
- Sun SPARC/Solaris (gcc, cc)
- DEC Alpha/DEC Unix aka OSF/1 (V3.0/347) (cc, gcc)
- Convex C3480/ConvexOS (cc, gcc)
- HP-UX/Exemplar OS (single CPU of an SPP2000) (cc, gcc)
- NeXT boxes (Black)/NeXTSTEP (3.2) (except the UDP code) (gcc)
- Intel PCs running Linux (Slackware, RedHat, Caldera) (gcc)
- DEC Alpha Linux (gcc)
- MkLinux on PowerMacs (gcc)
- Probably anything with gcc.
Bewildering to me, on those systems where both the native commercial compiler
(cc) and gcc have been available, gcc has consistently outperformed cc.
YMMV. I think this is because almost all the core code is integer ops
and therefore the carefully coded math and floating point libs make little
difference. Stemming from this, tacg runs faster on an Intel P6 architecture
than on many RISC workstations of similar or even higher clock speed.
Speed and Capacity
The program uses a pre-calculated hashtable lookup of the restriction enzyme
recognition sites, so that only about half of the sequence is checked any
further than the initial hash (which itself is generated via the very efficient
shift/add DFA algorithm). Depending on the degeneracy of the input sequence, the kind
of output you request and the i/o of the machine, the program processes DNA at
this speed:
Speed* |
Hardware |
OS |
Compiler, flags |
68 |
i486/66 |
RH Linux 2.0.18 |
gcc -O2 -m486 |
140 |
R4000/100 Indigo2 |
IRIX 5.3 |
cc -O2 -mips2 |
165 |
early DEC Alpha |
OSF/1 |
gcc -O2 |
290 |
R4400/200 Indigo2 |
IRIX 5.3 |
cc -O2 -mips2 |
313 |
SPP/HP-PA |
SPP-UX 5.1/HP-UX 10.1 |
gcc -O2 |
300 |
PMac8500/PPC604/120 |
Mklinux 2.0.27 |
gcc -O2 |
330 |
DECAlphaStation/233 |
RH Linux 2.0.29 |
gcc -O2 |
450 |
Sparc Ultra1/??MHz |
Sun OS v5.5 |
gcc -O2 |
454 |
R10000/175 Indigo2 |
IRIX 6.2 |
cc -O2 -mips2 |
515 |
Intel PPro/200 |
RHLinux 2.0.27 |
gcc -O2 -486 |
* in Kb/sec, crudely adjusted for cpu usage via the `time` shell command,
using 215 enzymes, digesting the E coli genome (4.638 MB), for only the summary listing (-s).
Requesting the full analysis (-n4 -sSlc -F3 -g100) takes about twice this
time. The full Linear Map doesn't take much more CPU time, but considerably
more wall time due to the i/o (output is 10x input).
NB: This is about the slowest performance you can expect. Because
of the size of the input sequence, the program has to go thru several rounds
of memory allocation. Shorter sequences, more typical of cloning projects,
complete essentially immediately even on slow machines.
Here's a table of how fast the same analysis completes on various sequence
lengths (for a PPro/200/96M/PCI-SCSI and a 486/66/32M/VESA-SCSI):
Sequence Length |
Seconds(PPro) |
Bases/s(PPro) |
Seconds(486) |
Bases/s(486) |
100 |
.04 |
2.5k |
.28 |
360 |
1,000 |
.05 |
20k |
.3 |
3.3k |
10,000 |
.07 |
143k |
.4 |
24.4k |
100,000 |
.2 |
500k |
1.4 |
71.4k |
1,000,000 |
1.6 |
617k |
11.3 |
88.5k |
4,630,000 |
9.1 |
509k |
68 |
68k |
10,000,000 |
22.7 |
492k |
261 |
38k |
From earlier, crude analyses it appears to be about 7-35 times faster
than the equivalent routines in the GCG 'map' program.
In my experience, interacting with tacgi/tacg via it's web interface (on an
100MHz R4000 SGI) over 2 LAN hops is both faster and easier than using GCG's
native Xwindows interface (on a 200MHz R4400 SGI). There may be some subjective
feeling involved here, but objectively, it's faster by 10X (on a machine that's
less than 1/2 as fast :) ). GCG has recently been trumpeting that its code is now
parallel, allowing much faster operation. Putting the above in a similar
context, you'd have to buy ~20 CPUs to run your analyses with GCG's routines
compared to 1 if you ran them on tacg.
Memory Usage
tacg uses dynamic memory allocation for most data structures so that while
there are a few hard-coded limitations (mostly in output format), it easily
handles sequences into the millions of bases.
As a practical rule of thumb, the program needs about 2 MB of free memory
for itself, and then, depending on what results you've asked for, as much
as 3 times the input sequence length to hold all the intermediate results
before they are barfed to stdout. For the E Coli genome (4.7Mbases), tacg
memory usage tops out at about 13 MB, so on a 16 MB machine, you'll probably
be doing some swapping in this case. On a 32 MB Linux PC, I had no swapping
on the above 4.7 Mbase analysis, but quite a bit with 10MB (about 40MB
total required, according to 'top'). YMMV.
In the Web form, there is a cutoff of 500,000 characters (not
bases) as input; currently this includes extraneous characters such as
headers, numbers, spaces, etc, so if you think you are being unfairly denied
service, check how many bytes your file actually is. The command-line version
is limited only by your machine's RAM and your patience. (Actually, I think Linux's max
per-PID allocation is currently ~900MB...?
Feature Set
Inspired by [add mailto] Christian Marck's elegant DNA Strider, I tried for a
similiar output format, changing a few things I didn't like, adding a few
things I wished he had. I'm open to suggestions for enhancements, interface
changes, etc.
Kvetch or it won't change.
Sequence Input
tacg will take any ASCII data as input. If it isn't a recognized
format, it will assume that it's raw sequence and will process it accordingly.
You can also explicit tell it that all input is to be considered sequence with the
--raw flag.
If this is the case, it will ignore anything that isn't a,c,g, u, or t
(or IUPAC degeneracies - see below) and subject that sequence to analysis.
Because of this, numbering, spacing, line width, etc, of the input file
should not be a problem.
The Web interface allows you to upload entire files of sequence for
analysis. Recently, with the recruitment of Jim Knight's SEQIO,
Version 3 in both commandline and Web form allows most ASCII sequence formats
(see below) to
be read and processed without further modification or editing.
Input formats supported by SEQIO (and therefore tacg and WWWtacg):
IG/Stanford |
GenBank/GB |
NBRF |
EMBL |
GCG |
DNAStrider (ASCII) |
Fitch |
Pearson/Fasta |
Zuker (in-only) |
Olsen (in-only) |
Phylip3.2 |
Phylip |
Plain/Raw |
PIR/CODATA |
MSF |
ASN.1 |
PAUP/NEXUS |
|
However, it will NOT accept (at least reliably) binary file formats such as
those from DNA Strider, MacVector, NBI, GeneWorks, etc. However, most commercial
programs have an ASCII export function that should work. For the WWW interface,
anything that you can cut and paste into the window and looks like sequence ought to be
fine. As mentioned above, the --raw option skips SEQIO and so reverts tacg to
it's former aggressive behavior of considering EVERYTHING sequence.
IUPAC Degeneracies
IUPAC Degeneracies are those characters which denote uncertainty in the
sequence; most are familiar with y (pyrimidines - c or t)
and r (purines - a or g), but there are several others
that can be useful. The entire set is:
Base |
Name |
Bases Represented |
Complementary Base |
A |
Adenine |
A |
T |
T |
Thymidine |
T |
A |
U |
Uridine(RNA only) |
U |
A |
G |
Guanidine |
G |
C |
C |
Cytidine |
C |
G |
Y |
pYrimidine |
C T |
R |
R |
puRine |
A G |
Y |
S |
Strong(3Hbonds) |
G C |
W |
W |
Weak(2Hbonds) |
A T |
S |
K |
Keto |
T/U G |
M |
M |
aMino |
A C |
K |
B |
not A |
C G T |
V |
D |
not C |
A G T |
H |
H |
not G |
A C T |
D |
V |
not T/U |
A C G |
B |
N |
Unknown |
A C G T |
N |
Handling Degeneracy
tacg V2 was modified to handle degenerate input sequence as well as the
previous gatc-only sequences. Version 3 has the same matching engine as
Version 2 which matches
sequence using the fast, hexamer hashtable look-ahead algorithm that DNA
Strider uses (my implementation). When it hits a degeneracy in the leading
hexamer window, it switches to a more extensive (but slower)
matching approach until the degeneracy is cleared, whereupon it
switches back to the faster approach. This allows tacg to accurately search
sequence
that is only slightly degenerate (such as many genomic sequences) essentially
as fast as nondegenerate sequence.
Although tacg behaves as above in the default mode, it can be forced into
several other modes if desired. Truth be told, I'm not aware of ANYONE
besides me actually using the other modes, tho they are formally useful.
See the explanation of the
-D options in the man page.
Enzyme Selection
Restriction Enzymes can be selected either by:
Explicit pick from all the active entries from the latest REBASE.
Enzymes can be picked explicitly via the
-x flag
(-x NameA,NameB,NameC... where NameX is the case INsensitive name of the Enzyme or
pattern you wish to use in the GCG-formatted REBASE file, either 'rebase.data' or
another specified by the
-R flag (-R alt.rebase.file).)
or
Selected recursively by characteristic, based on:
Magnitude of recognition sequence
-n
The 'magnitude' of the recognition sequence depends on the number of
defined bases that make up the site. Degenerate bases can also contribute:
acgt each count '1' magnitude point
yrwsmk each count '1/2' magnitude point
bdhu each count '1/4' magnitude point
n doesn't count at all
(tgca=4, tgyrca=5, tgcnnngca=6, etc)
Overhang of resulting ends (5', 3', blunt)
-o
This refers to the stretch of unpaired bases at the border of the cut
created by offset breaks in the phosphate backbone. Enzymes can leave:
5' overhangs - ie. BamHI(G'GATC_C)
v BamHI
5'...tagG GATC_Ccga...3' -> 5'...tagG GATCCcga...3'
3'...atcC CTAG Ggct...5' -> 3'...atcCCTAG Ggct...5'
^
3' overhangs - ie. BbeI(G_GCGC'C)
v BbeI
5'...tagG_GCGC'Ccga...3' -> 5'...tagGGCGC Ccga...3'
3'...atcC CGCG Ggct...5' -> 3'...atcC CGCGGgct...5'
^
Blunt (no overhang) ... You get the idea...
Minimum, Maximum times they hit the sequence
-m, -M
This is pretty self explanatory, but..
'Minimum' indicates the minimum # of times a pattern HAS
TO match before it's included in the
output (the default is 1).
'Maximum' indicates the maximum # of times an enzyme CAN
match before it's excluded from the output (no
default maximum).
Alternative REBASE database files to use
-R
This option allows you to specify an alternative REBASE file to use
in the matching, perhaps edited to contain only cheap enzymes or otherwise
customized for your lab/site. It starts at 300 entries, which is
enough for most Restriction Enzyme databases, such as REBASE, but if you
format TRANSFAC or TFD into REBASE form, it will easily process them (about
3000 entries).
Linear Restriction Map
-L
- produces linear restriction maps, with enzyme names nested compactly, readably.
- is routinely tested to more 4.6 Mbases, has been tested to 70 Mbases.
- handles linear/circular topologies
-f, subsequences
-b, -e.
- shows EXACT cutting position (not just the start of the recognition
sequence - minor quibble with Strider and other programs)
- can make the Linear Map even more compact by eliminating the second strand
--strands or the tic marks
--notics
- can show co-linear translation
- in 1/3/6 frames, in 1 or 3 letter codes
-T
- using one of several Codon Usage Tables
-C
- can scale output (reasonably intelligently) in widths up to ~230 characters
-w
- can re-number the sequence arbitrarily, negative numbers as well as positives
--numstart
- shows numeric indices for both the whole sequence (4635, below) and the subsequence (181, below) if specified.
Bsu36I
Sau3AI
DpnI
AlwI NlaIV
BstYI Tsp4CI
BbvI BamHI AlwI MseI MaeIII SfaNI
\ \ \ \ \ \ \ \ \\
181 gagaaacaacaatggatcctaaggttagaacattgttaaaagttactgttgaagatgctt 240
4635 ctctttgttgttacctaggattccaatcttgtaacaattttcaatgacaacttctacgaa 4694
^ * ^ * ^ * ^ * ^ * ^ *
E K Q Q W I L R L E H C Z K L L L K M L
Plasmid maps can be generated by using these flags. The --ps flag causes it to be
generated in the original Postscript. The --pdf flag invokes the --ps
option silently and then ghostscript to convert
the postscript to be converted to Adobe's Portable Document Format (aka pdf). The plasmid
map shows a variant of the standard
circular plasmid map, with Restriction Enzyme hits marked around the periphery. The
algorithm that lays out the distribution of RE names is sub-optimal, but works OK for
small numbers of hits. You can generate some interesting spirograph-like effects by
forcing large numbers of hits to be plotted. In addition, it plots any degeneracy as
shaded marks around the circumference. You can also add any ORFs from any 6 frames as
radial lines by using the -O flag to find ORFs
larger than the minimum ORF size you specify. For example, to request a map in PDF format
with ORFs larger than 50 Amino acids from frames 1, 3, and 5, the appro flags would be:
--pdf -O135,50
Simple ORF Analysis
-O
tacg can search for Open Reading Frames (ORFs) in any of the 6 frames with
a user-specified minimum ORF size
-O.
It is not a sophisticated search algorithm;
the rules are:
- It assumes that the sequence starts in a ORF - this may cause ORFs to be
reported without starting M's and may lead to short ORFs being
reported in short input sequences. (If it has not hit a STOP before it
gets to the end of the input sequence, it will report the cruft that it
assumed was an aleady started ORF.
- If not already in an ORF, a Met codon starts an ORF and a
STOP codon ends it.
- If the ORF is smaller than the specified size when it hits a STOP, it's
discarded.
- Internal Met's are not noted in any other way, as they are in some other
apps such as Lasergene's Protean (good graphic design there), although
this is an easy addition.
- If there is an unidentified AA (because of degeneracies in the DNA) the
AA is translated as an 'X' and given the averaged MolWt (118.89 Da), but
the ORF is allowed to continue (see output below).
Very little analysis is done (still..). The beginning and ending Offsets
in both base pairs and Amino Acids are reported so that it can be found on
a linear map (offsets for frames in the opposite strand are calculated
so that they refer to the current top strand). The size in both Amino Acids
and KDa is calculated, as is the pI. Perhaps most usefully, the results
are writ to stdout in FASTA format, with the stats as the ID line
so that it can be examined by a more sophisticated pattern matcher
such as grep, agrep, perl, etc (see Using tacg with
Other Programs
The output of this analysis can be in either a single line (by specifying
-w 1) in the case where you're concerned that the external pattern matcher
won't be able to pick up a pattern split by a linefeed) or wrapped at the
regular width.
Programming Note: The data structure that holds the ORF information
is dynamically allocated so that all the ORF info is available until the
end of the program. As well this allows very large sequences to be scanned
in one sweep - my aging 486/66/32M will process the E coli genome, searching
all the ORFs for a degenerate 10 AA motif with 'agrep' in under 5 minutes,
but this means that it can suck up QUITE a lot of memory.
Here is some sample output:
% ./tacg -O156,60 < dpp.mel.seq
== ORF Analysis for Frame 1: 7 ORF(s) > 60 AAs
F# ORF# Begin(bp/AAs) End(bp/AAs) #AAs MWt(KDa)
> 1 1 2581 / 860 2922 / 974 114 13155.40
MWTVLWHAENFMTSFADHQASGAEAPNKKHSPRLAHNRPRFCLLMASLERFSGCNTRNFVQRQHNSPKDSLNFDQLSDWVAAAPQHHRNWLAEKLELHPLLNGPIRNPQDFTDG
> 1 2 3091 / 1030 3282 / 1094 64 7107.97
MLNKNDKTQPRIVICNLCTCTLNHLVAPVLLLDAPPAPILFMPLSVAHCCCCCHTHTHIRTTQK
> 1 3 5929 / 1976 6216 / 2072 96 9692.80
MEVDAGSGMLGAGQVGYGLGWVLAPLRFPGHADAADGARGATTPADKAVRVQLPTPAREASAQATKLAKMQRDQGMSGGVLARMGLRLGLCGINCA
> 1 4 7522 / 2507 7704 / 2568 61 7430.89
MYIYLFFACVQYQTLMFNKSGKNLYCAYKNHQNHKGYALPYEPTYLTRSQQTLHFESIHWV
> 1 5 7870 / 2623 8160 / 2720 97 10690.71
MIYVWDVGECVSVCWRRLGGWGAERADRAGRSDPAESSSKTTTAAAIMIHSRNVGNGCLTILFFFSIFGSRRRRSCAGYEIHTTQILTHTATVAFTT
> 1 6 11071 / 3690 11277 / 3759 69 7770.53
MWNKLARASQRFSRSFFSSLRATISAALFRGQRMMGTLPRVTNRAKSESTMESIRASISSSRVLARLMA
> 1 7 12406 / 4135 12768 / 4256 121 14055.17
MILKEEHPHQSIETAANAARQAQVRWRMAHLKALSRTRTPAHGNCCGRVVSKNHFFKHSRAFLWFLLCNLVMNADAFAHSQLLINVQNQVRNRKKKWFKTLKSQQRKTQQRQTATLAALPI
== ORF Analysis for Frame 5: 5 ORF(s) > 60 AAs
F# ORF# Begin(bp/AAs) End(bp/AAs) #AAs MWt(KDa)
> 5 1 11241 / 3747 10885 / 3627 120 13410.25
MEALMDSIVDSLFALFVTLGNVPIIRCPRNSAAEMVARKLEKKLRENLWDARANLFHMDATQAGGGVFSFQRPVLLLLDRNMDLATPLHHTWSYQALVHDVLDLGLNLVYVEDETASAGK
> 5 2 8007 / 2669 7708 / 2568 101 10973.69
MAAAVVVLLDDSAGSDRPARSARSAPQPPNLRQHTDTHSPTSHTYIIILTTLNKAEQIPKFEEPDSNSQSGSRTIPQNEVPARQRRRSSDTASALALGWDL
> 5 3 2304 / 768 2113 / 703 65 7230.86
MLAQHQIEIKVYPPFRSIPCYVCFLPGQNNSALFWALGATQKVMIKLCGDPVEGTRLRGYPSTSS
> 5 4 1734 / 578 1492 / 496 82 9039.19
MPISWHKKMRQPCPHSPAITEGTGLCIIFNEDICQQKCSGVLFMAFKAKPIFITTATAGQRKTTRDAVVVVVVLIIDTSYAN
> 5 5 762 / 254 583 / 193 61 6683.20
MLCPSILPFTLKHKFMLSVHFFQCTTKLTPCGAESATRRRSAKGKGAHGQGTGCATDWLTR
== ORF Analysis for Frame 6: 5 ORF(s) > 60 AAs
F# ORF# Begin(bp/AAs) End(bp/AAs) #AAs MWt(KDa)
> 6 1 11181 / 3727 10997 / 3664 63 7224.82
MCPSSAVRGTVPLKWWHASWRRSCARTFGMRVPICSTWMPRKPAAECSAFRDLCSCCWTGTWI
> 6 2 10764 / 3588 9728 / 3241 347 39136.07
MTHKGSPFPTVAEAIQEELESYRNSEEEIKRLKTSMGIEGESDIAFSLVNDTTARLTNAVNSLPQLMEKKRLIDMHTKIATAILNFIKARRLDSFFEIEEKVMSKQTLDRPLLDLLRDGEFGQAEDKLRLYI
IYFICAQQLPESEQERLKEALQAAGCDLTALAYVQRWKGIMNRSPSISQATQYEGGGTKTVSMFTKLVSQGSSFVMEGVKNLVVKRHVSFSFQIQIQVLYQNCLQNLPVTKITEQVMECRSNAETDDYLYLD
PKLLKGGEVLPKNRAPFQDAVVFMVGGGNYIEYQNLVDFIKQKQTSNVQRRIIYGASTLTNARQFLKELSALGGEIQSPTATS
> 6 3 6774 / 2258 6530 / 2175 83 9280.03
MQGNCNRLAGEWTQKGQPRMREDAALPARNPTFLGLHYPGLGRFFVRPTAPNGHQQTAANSFIKHKQIPWTCSALTPHAAFEH
> 6 4 5901 / 1967 5687 / 1894 73 8163.18
MVLFLRVVCFSPVDNKVQPTCGYRTDRFRVLVALVLGASRWLTLRANDAKQQIKGAIRKIAARDVLLNCSSKA
> 6 5 1647 / 549 1310 / 435 114 13523.17
MKTSVSKNVRAFCLWHLRQNRFLLPQRQQGKGKPHATLSSLSSSSSLTHRMQIKSVSQPDHKVAAPDRRICRFQCWERGDPWRFRSRIPDTDTETRLEVKRTRPIRRDLWHESS
==-------------------------- End of Analysis --------------------------==
Tabular Output
All these output formats, as well as the Linear Map and Summaries are affected
by the width parameter
-w;
if you can manage the wider output,
it's more efficient and easier to view. The -w flag has a special
case besides the normal. If you specify -w1, it will set the
output to be 10,000,000 characters wide, assuring that all output for a
particular stanza will be in 1 line. This is sometimes useful for
post-processing data.
Summary of All Cuts
-s
This is no longer printed by default; you have to explicitly request all
output. This was to make tacg more amenable to inclusion into other programs
as a shell command - see Using tacg with Other Programs.
This output includes info on EVERY active (uncommented) enzyme/pattern
in the rebase file that has passed all the filtering flags (-n, -o, -m/M).
Restriction Enzymes that DO NOT CUT in this sequence:
BbeI EheI FseI KasI NarI NheI NotI
PacI PaeR7I SalI SfiI SpeI SwaI XhoI
Total Number of Cuts per Restriction Enzyme:
AatII 5 BsiYI 130 EcoNI 5 MluI 7 SalI 0
AccI 5 BsmI 30 EcoO109I 2 MmeI 8 SapI 7
AflII 2 BsmAI 26 EcoRI 3 MnlI 184 SauI 1
AflIII 13 Bsp120I 1 EcoRII 49 MscI 17 Sau96I 61
AgeI 12 Bsp1286I 26 EcoRV 14 MseI 106 ScaI 4
AluI 89 BspEI 22 EheI 0 MspI 278 ScrFI 145
[remainder omitted]
Tables of Cutting Sites
-S
(for enzymes that pass the filtering flags)
All of the following output formats (except the Infrequent Cutters
Map) can have the REs listed by order of their names in the database or
sorted
by # of cuts and thence alphabetically, like Strider, using the
-c flag.
In the next example output, the first example is sorted by 'order in
database', the second is sorted by '# cuts, then alphabetically'.
sorted by 'order in database'
== Cut Sites by Restriction Enzyme
AatII G_ACGT'C - 2 Cut(s)
1118 12143
AccI GT'mk_AC - 16 Cut(s)
632 3008 3073 3079 5393 5816 6881 7974 8815 9978 11962
14136 16857 18087 20067 20222
AceIII CAGCTCnnnnnnn'nnnn_ - 8 Cut(s)
2127 5417 11681 11750 15151 17252 17584 21795
AciI C'CG_C - 48 Cut(s)
567 850 1512 2455 2931 3300 3339 3490 4842 5468 5600
5907 6174 6318 7533 8123 8807 9086 9745 10482 10805 11626
11669 11896 12050 12214 12404 12417 12675 13337 13646 14217 14881
15949 16013 16668 17060 17089 17101 17132 18035 18140 18820 18896
19781 21146 21536 21958
AflII C'TTAA_G - 5 Cut(s)
1292 4185 11172 18334 20663
[remainder omitted]
sorted by '# cuts, then alphabetically'
-c flag
== Cut Sites by Restriction Enzyme
ApaI G_GGCC'C - 1 Cut(s)
21979
ApaLI G'TGCA_C - 1 Cut(s)
10676
Bpu10I CC'TnA_GC - 1 Cut(s)
844
Bpu1102I GC'TnA_GC - 1 Cut(s)
13369
NarI GG'CG_CC - 1 Cut(s)
16014
NgoAIV G'CCGG_C - 1 Cut(s)
15606
[remainder omitted]
Tables of Fragment Sizes
-F
(unsorted by fragment size, sorted or both; 'vertical' ordering can also
be sorted with '-c' as in the above example.
** SORTED Fragment Sizes by Restriction Enzyme **
AatII G_ACGT'C - 5 Fragment(s)
1849 3449 3731 4289 5110 14062
AccI GT'mk_AC - 5 Fragment(s)
639 1187 2192 3574 11828 13070
AflII C'TTAA_G - 2 Fragment(s)
6078 6541 19871
AflIII A'CryG_T - 13 Fragment(s)
35 170 459 493 956 1268 1712 1913 2360 2419 4091
4920 5733 5961
[remainder omitted]
Ladder Map
-l (el, not one)
-
overhangs indicated: 5'(\), 3'(/) blunt(|) cutters
-
if more than 1 cut maps to a space, the number of cuts is shown numerically
(see AluI, below)
-
total # of cuts indicated at end of line).
-
width-sensitive, so you can increase accuracy by choosing a wider output.
See also the complete
output of the lambda analysis.
Ladder Map of Restriction Enzyme Cut Sites:
900 1800 2700 3600 4500 5400 6300 7200 8100 9000
: : : : : : : : : :
AccI ----------------\-----------------\-------------------- 2
AciI -----------------\-----------\--\---\\------\--\\------ 8
AflII ---------\-------------\---------\--\----\---\--------- 6
AhdI ---------------------------/--------------------------- 1
: : : : : : : : : :
AluI --|---|||--23||||2||-|2--2---|2|232||-|-|22--2-|3|-||-- 50
AlwI ------2---\-----------\--2-------------\2-------------- 9
Alw26I ----------------------------------------------------\-- 1
AlwNI -----------------------/--------------/--/------------- 3
: : : : : : : : : :
Infrequent Cutters Map
part of the -l flag
(a la Strider) of enzymes that cut less than X (a small number of times)
- (In version 2/3 this has been changed to be length-sensitive; X = 2 until
the sequence increases to beyond 10,000, then X increases proportional
to the log of the increased size). This output option is not called explicitly,
but is appended to the Ladder map (-l).
It should be obvious, but each label is of the form:
Name@Cut_Site
== Summary of Enzymes that cut ** 2 ** times or less:
XcmI@9286
XcmI@2466 UbaEI@9054
Sse8647I@1801 SexAI@9288 UbaDI@13309
SmaI@2882 PstI@10119 PstI@15573
Psp5II@1801 Pfl1108I@10898 SgrAI@16556
NruI@2248 HgiEII@10152 PflMI@18025
NcoI@2848 SexAI@8621 EcoNI@12615 NruI@17146
BstXI@2962 EcoNI@10850 NgoAIV@15606 PmlI@20299
BstEII@4076 EagI@12054 NarI@16014 Pfl1108I@20364
BstEII@1183 Bsu36I@10980 HgiEII@18039 MluI@22230
BstDSI@2848 NcoI@7724 Bpu1102I@13369 PacI@18544 MluI@22201
Bpu10I@844 PshAI@5399 ApaLI@10676 EagI@16150 BstXI@20204
AatII@1118 BstDSI@7724 AatII@12143 Bsu36I@16688 ApaI@21979
||| | | | | | ||| | || | | | || || | | | ||
-------------------------------------------------------------------
: : : : : : : : : : :
2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 22000
Pseudo Gel Map
-g
- This option shows approximately how different digests would look if run
on an agarose gel. Turn the page 90°, blur your eyes slightly, and
it'll look like a massive agarose gel... really. Useful for selecting
restriction enzymes for best separation of fragments...
- It uses a straight log10() approximation, but the addition of math to
simulate different percentages of agarose/polyacryamide would be trivial.
If anyone has a favorite approximation, I'd be happy to include it.
- It uses the same representation as the ladder map,
with single fragments represented as '|', multiple fragments that cannot
be resolved as a digit showing how many map to that space.
- the flag values set the LOW and HIGH CUTOFFs - fragments smaller or
larger than this window are excluded from the simulation. The value must
be an integer multiple of 10 (10, 100, 1000, etc).
Below is a section of gel output generated with the width set to '-w75';
the sequence length was 22380 bp (so the gel is automatically extended
to 30 Kb).
From examining the UDP data so far, this is the option that is used
incorrectly most of the time. To obtain the results below (a 10 bp cutoff),
it should be used thusly:
tacg -g10 [other flags] < input.file
If you wanted a 1000 bp cutoff, change it to:
tacg -g1000 [other flags] < input.file
== Pseudo-Gel Map of Digestions:
10 100 1000 10000
. . . . . . .... . . . . . .... . . . . . .... . .
AatII | || 2
AccI | | | | | | |2| 222 | 16
AceIII | | | 2 || | | 8
AciI || || || | | | |22 || 33||4 |227||3| |2 48
. . . . . . .... . . . . . .... . . . . . .... . .
AflII | | | | 2 5
AflIII | | | | | | 2 2 | | 11
AhdI 3 | | | | | 7
AluI | 3 ||3 3|222 |||2223 3334|2352|2|42||| 2|2 74
. . . . . . .... . . . . . .... . . . . . .... . .
AlwI 5 | || | ||| ||||2|| | 3| || 26
Alw26I | | 2 | | | |2|| 3 32||3| |4|2 | | 35
AlwNI | | | 2 | | | 7
ApaI | 1
[rest deleted]
--rule allows you to specify arbitrarily complex logical associations of
small motifs to detect the patterns that interest you. Here's an example:
Say you wanted to search for an enhancer that you suspected might be involved in the
transcriptional regulation of a pituitary-specific gene. You knew that you were looking
for a sequence about 1000 bp long in which there were at least 2 Pit1 sites and 3-5
Estrogen response elements, but NO TATAA boxes. If you had defined these patterns in a
file called pit.specific as:
Pit1 0 WWTATNCATW 0 1 ! Pit1 site w/ 1 error
ERE 0 GGTCAGCCTGACC 0 1 ! ERE site w/ 1 error
TATAA 0 tataawwww 0 0 ! TATAA site, no errors allowed
you could specify this search by:
tacg --rule '((Pit1:2:7&ERE:3:5)&(TATAA:0:0),1000)' -R pit.specific <
input_sequence
This query searches a sliding window of 1000 bps (-W 1000) for ((2-7 Pit1 AND 3-5 ERE
sites) AND (0 TATAA sites)). These combinations can be as large as your OS allows
your command-line to be with arbitraily complex relations represented with
logical AND (&), OR (|), and XOR (^) as conjunctions. Parens enforce
groupings; otherwise it's evaluated left to right.
After rules are evaluated from the commandline, if the rule parses
correctly, the rule is written to the tacg.patterns file (created in the current directory
if it doesn't exist) so that it can be incorporated into a FILE of such rules which can
then be evaluated en masse using the --rulefile flag. This option, followed by a
path to the file, allows all the uncommented rules in the file to be evaluated at once.
The format for the rules in both the tacg.patterns and the rulefile is as follows:
free-format comments
free-format comments down to the '..' separater between header and body.
..
;comments below the '..' must have a ';' character as the 1st character of the line.
;Name , Rule , sliding window - Note that they must be separated by a ','
TestRule1 , (MwoI:1:8 | NaeI:1:8 | NarI:1:8) , 866
;rules can be split over multiple lines with a '\' as the last (continuation) character
TestRule2 , ((MwoI:1:8 | NaeI:1:8 | NarI:1:8) ^ ( NciI:1:8 & NcoI:1:8) | \
(NdeI:1:8 | (NgoMIV:1:8 & NheI:1:8))),866
Cloning functions
--clone
The --clone flag allows you to specify regions of the sequence which can or cannot be
cleaved by the REs selected by the other RE selection criteria (-o, -n, --cost, -m,
-M, etc). You can specify up to the compiled-in limit of 15 criteria at once in the form
of: --clone #_#,#x#,#x#,#_#... where # is an integer
indicating the sequence offset and each #_# or #x# is a criterium.
#_# indicates a range that MAY
NOT be cleaved; #x# indicates a range that MUST BE cleaved.
The output indicates which criteria are met, with any combinations that perfectly meet all
the criteria reported 1st.
Pattern Matching
-p
The pattern matching used in tacg has been extended to a more general approach
for nucleic acids so that it is applicable for searching for degenerate
patterns with a specified number of errors (in possibly degenerate sequences).
Pattern Matching with Errors
The original tacg allowed searching for nucleic acid patterns with the
standard IUPAC degeneracies, but V2/3 allows you to extend this capability
by allowing searching with mismatch errors.
Quite often when you are searching for binding sites for transcription
patterns, you wish to search for not only degenerate sites but degenerate
sites with 'a bit of slop'. You can use tacg to do this in 2 ways, using
the
-p flag:
Say you had defined a site (called NPR) on a promoter by footprinting
as 'gtcagggcgaat' and you wanted to search another implicated region for
this sequence but you wanted to allow 2 errors to be allowed, so that you
could find a sequence that would, for example match as follows:
...gtcagggcgaat...
x x
...gtcggggcgatt...
An appropriate flag/option sequence to obtain such a match would be:
tacg -pNPR,gtcagggcgaat,2 -Sl -w75 <input.file |less
Translation: search input.file for the NPR sequence gtcagggcgaat,
allowing at most 2 errors (-pNPR,gtcagggcgaat,2),
returning the
Site data and a ladder map (-Sl) formatted
75 characters wide (-w75) and pipe the results directly to
the pager program 'less' (|less)
NB: In versions < 3.5, the error term can be omitted if you want perfect matches (O mismatches
allowed). -pNPR,gtcagggcgaat indicates that no mismatches are allowed.
Additionally, when the -p flag is invoked, it also sets the -S flag so that tacg now
behaves more like a grep.
or you can specify that the sequence is
always
supposed to be searched 'with slop' via the addition of a single character
in the standard GCG-formatted
REBASE
file entry:
Here is the standard GCG REBASE format line:
NPR 6 gtcagggcgaat 0 !NPR is searched with 0 errors
Here is my hack to the standard GCG format line:
NPR 6 gtcagggcgaat 0 2 !NPR is always searched with 2 errors
You can also add more info to the REBASE file as described
here.
The algorithm to do this is reasonably brute force: an 'n' scan of the
sequence is done, replacing every base of the sequence with an 'n'. If
there is more than 1 error allowed, then each successive generation is
'n'-scanned. At each iteration, each generated pattern is compared with
the previous ones so that only novel patterns are actually stored. This
approach dramatically decreases the number of patterns searched for, but
can consume substantial CPU resources in the validation for long patterns
with many errors allowed. There's a tweak to the algorithm that will also
substantially decrease this time, but I haven't implemented it yet. I'll
see if anyone uses it 1st.
Udi Manber's very fast, elegant, bitmasked
agrep
could also be applied to this approach, but I haven't spent the time to
do so yet.
Proximity Matching
-P
This is another option designed specifically for Transcription Factor (TF)
analyses. I apologize that the commandline is particularly horrible, but
In many cases, you are
interested in a particular juxtaposition of 2 or more TFs; this option
allows you to specify up to 10 sets of relationships between TFs and extract
only those which match your criteria. The output has been organized so
that you can see both the raw data (All) and
the filtered data (Matched) in the same output
stanza (below):
== Proximity Matches:
lef(1E)/IHF(0E) * 1 matches * found from search parameters: lef,200,IHF
57309/57116
== Ladder Map of Enzyme Sites:
# hits
10000 20000 30000 40000 50000 60000 70000 80000 90000100000
: : : : : : : : : :
lef |--|-|-|463222222||||25|2|243---|324|222222--32223|-||3-3|-- 98 (All)
lef ----------------------------------|------------------------- 1 (Matched)
IHF ----------------------------------|------------------------- 1 (Matched)
IHF ----------------------------|-----|------------------------- 2 (All)
: : : : : : : : : :
Pattern Match Data Output by 'Bins' NEW
-G
This option
was included to allow the results of the matching to be exported to other
programs for further analysis. Especially where there is a high number
of matches, it may be of more use to plot or analyze these data using an
external program, whether it be a spreadsheet, plotting app, or custom-writ
program. I don't presume to know what you might want to do with the data
and so to make tacg play nicely with other apps in the best unix tradition,
I've tried to address some of the simpler data output formats that might
be of most use. Output is possible in 2 rectangular formats (X and
Y;
one is the diagonal flip of the other) as well as in a long form (L)that
may be better suited for simple, custom-writ programs that just want to
read x,y pairs. The very slick gnuplot
can read the multiple columns of the Y form with only a very small
amount of editing or filtering of the output. An example of gnuplot's processing
of some tacg-generated data is shown as well. See also the gnuplot
stanza in the section below.
Using tacg with Other Programs
tacg is obviously not the only molbio program you need to do your work.
There are many analyses tacg just cannot do; to that end, it at least regurgitates
your data in a re-digestible form so that you can masticate it further
elsewhere. If you can program a bit, for quick and dirty programs you probably
can't do better than perl (but if you
can program a bit, you already know that). If you can't program, importing
your data into a spreadsheet and manipulating it cell by cell is certainly
viable with the sophistication of modern spreadsheets.
In terms of working with your data in the context of tacg, there are
lots of other programs that make input easier, and output more readable/browsable.
Here are some that I use on a regular basis:
-
NEdit A
well-designed, actively supported, free GUI editor for Xwindows on unix
and VMS. Besides its own features, NEdit allows you to seamlessly pipe
selected text to external programs (...like tacg) and display the output
in a NEdit window, and even add menu items of external programs (...like
tacg), effectively making the NEdit/tacg combination a decent GUI biosequence
analysis tool.
I've included a few of my own shell commands and other NEdit settings
in the dotnedit
file accompanying the distribution. Either copy your .nedit
to a safe place and start NEdit with mine, or edit the tasty bits of mine
into yours. If you're just browsing,
here's my .nedit.
And incidentally,
here's another from Thomas Siegmund
which allows biosequence coloration inside Nedit. He has also developed some very neat
Kaptain scripts to wrap many of the
EMBOSS applications (some
examples shown here.
While the NEdit/tacg combo doesn't allow quite the same flexibility
as Strider, it does allow you to use an exceptional editor to modify or
annotate your sequence in a very natural way and also allows you to submit
parts of that sequence for some of the usual analyses. On slow machines,
(~486/66) there is a perceptible pause while the data is passed to tacg
and tacg's output is redirected back into nedit, but the window does appear;
at least you don't have to go searching for it as you usually do in GCG.
-
less
My favorite pager - slick design, fast, memory miser. Like joe, it can
pass selected text (but by lines only) to an external app with the '|from
mark X to here' command.
Example:
-
go to the start of the selected text
-
hit 'm', then a key to mark the current position (say 'a')
-
move to the end of the text
-
hit '|', then the same indicator key as above ('a'), then the command you
wish to pipe it to (|a tacg -n4 -sSl)
-
readseq
The standard for biological sequence format conversion, from that biocomputing
code wiz Don Gilbert. You can quibble with the interface a bit, but it's
a lifesaver. If only GCG could convert sequence like readseq...
-
gnuplot
gnuplot is a another of those many unix utilities that seems crude and
clunky until you play with it a while. And after a bit, things fall into
place and you wonder how you got along without it. Using tacg's -G
output with Y format, it is easy to plot the distribution of A/T
rich regions (using a pattern entry called 'at-rich') in the E coli genome,
along with the distribution of an IHF site (pattern called 'ihf') allowing
1 error. While they should both be further reduced with a sliding window,
you can see that it is reasonably easy to visualize primary data from tacg.
Example:
-
agrep
A really useful tool that works very much like the other members of the
'grep' family but with a few notable differences. It's VERY fast, and like
tacg, you can use it to search for patterns with errors. I was originally
going to use it as the basis for tacg's searching with errors, but by the
time I started to understand the bit operations that it uses, I had already
implemented a workable scheme using the code I had already written, which
was more applicable to the kind of output I wanted. I might switch later
- it's a very cool algorithm (underlies Jim Knight's grepseq as well as
the Glimpse and Harvest search tools). The technical papers describing
it (in the same FTP tree) are models of readability.
-
Genscript
(Gnu enscript) This is an ASCII to postscript converter that allows
you to scale and rotate text on the printed page. For many tacg analyses,
it's convenient to have very wide output and genscript allows you to squeeze
the text you want in the space you have.
-
Alpha
Not a unix tool at all, but a hair-raisingly, spine-tinglingly good shareware
text editor for the Mac from Pete Keleher. Often compared with the commercial
BBedit. The reason I use it is because it works the way I work - almost
everything I'd like in an editor is there and if it isn't, Alpha is mostly
written in tcl so that it can be extended to do whatever you want. If I
was going to try a native Mac port of tacg, I'd try it 1st as an add-in
to Alpha, similar to the shell commands added to NEdit, described
above.
Using tacg to Scan Databases
Thanks to the SEQIO lib, tacg now handles this just fine. Another
thing that would be very handy for database handling is the ability to
deal transparently with gzipped/compressed data; this is available via
the zlib library.
The Odd one: UDP Reporting
tacg was also designed to track it's own use
and spread - something in which I'm also interested.
However, with version 3.x, this has ceased to be the default. You can
still explicitly send back info with the
-Q flag
(and I'd be grateful), but it annoyed a
few people and I think it has fulfilled it's usefulness.
The binaries have been compiled with code (udping.c) that spits a small
amount of data back to me at each usage IF YOU EXPLICITLY ALLOW IT TO.
The data that gets sent back is:
- the IP number of the hosting machine
- the cpu type
- Operating System and Version # (output of 'uname -a')
- what flags were used in calling the program
- the number of bases processed.
It does NOT return user names or actual sequence.
An example of the data that is returned:
Returned: [hw=IP22 os=IRIX osver=5.3]
[TACG Version 1.5F] /usr/local/bin/tacg -n5 -o5 -F2 -slL -g100 <100000
bp>
...And The Small Print Taketh Away...
COPYRIGHT NOTICE
Copyright (c) 1994-2001 by Harry Mangalam, tacg Informatics.
hjm@tacgi.com, 949 856 2847
Permission to use, copy, modify, and distribute this software and its
documentation is hereby granted, subject to the following restrictions
and understandings:
- Any copy of this software or any copy of software derived from it must
include this copyright notice in full.
- All materials or software developed as a consequence of the use of this
software or software derived from it must duly acknowledge such use, in
accordance with the usual standards of acknowledging credit in academic
research.
- This software is licensed under the
Free Software Foundation's General Public which is available by clicking the preceding
link or at http://www.gnu.org/copyleft/gpl.html
and is also included in the source code distribution in the file COPYING
- If the GPL is unacceptable to you or your organization, you may also contact
the author to negotiate a proprietary license under which you are relieved of the terms of the
GPL.
- This software is provided AS IS with no warranties of any kind. The author
shall have no liability with respect to the infringement of copyrights,
trade secrets or any patents by this software or any part thereof. In no
event will the author be liable for any lost revenue or profits or other
special, indirect and consequential damages. Period.