tacg (1) Version 4.3 tacg (1)

NAME

tacg - takes input from stdin, automagically translates most standard ASCII formats of Nucleic Acid (NA) sequence, then analyses that sequence for restriction enzyme (RE) sites and other NA motifs such as Transcription Factor (TF) binding sites, matrix matches, and regular expressions, finally writing analyses to stdout. It also can translate the NA input to protein in any frame, using a number of Codon translations tables, and print Open Reading Frames (ORFs), pseudo-graphics ORF and MET/STOP maps as well as perform many other analyses.

SYNOPSIS

tacg -flag [option] --flag [option] ... <input.file >output.file
tacg takes input from stdin (| or <); spits output to screen (default), >file, | next command

In the following summary,

Versions > 3 includes 'long options' which are preceded by 2 dashes (--longopt).

[-c h H l L q Q s S v] [-b begin] [-e end] [--clone #_#,#x#..] [-C 0-16] [--cost units/$] [--dam] [--dcm] [-D 0-4] [--example] [-f 0|1] [-F 0-3] [-g LoCutOff(,HiCutOff)] [-G bin_size,X|Y|L] [-i (--idonly) 0-2] [-m min_hits] [-M Max_hits] [-n 3-8] [--notics] [--numstart] [-o0|1|3|5] [-O 1-6(x),minORF] [--orfmap] [-p Name,Pattern,Err] [-P NameA,(+|-)(l|g)Dist_Lo(-Dist_Hi),NameB] --ps] --pdf] [--logdegens] [-r (--regex) 'Label:RegexPat' || 'FILE:FileOfRegexPatterns'] [-R alt_Rebase | alt_Matrix] [--raw] [--rev] [--comp] [--revcomp] [--rules 'NameA:min:Max[&|]NameB:min:Max[&|]..] [--rulefile /path/to/rulefile] [--silent] [--strands] [--tmppath /path/to/tmp/dir] [-T 0|1|3|6,1|3] [-w 1|width] [-V 1-3] [-W #] [-x (--explicit) 'NameA(,=),NameB..(,C)'] [-X (--eXtract) b,e,[0|1]] [-# %_Match_Cutoff]

NB: Most flags are the same as in earlier versions with the exception of these changes:

and these additions:


DESCRIPTION

tacg searches the sequence(s) read from stdin (or using the --infile option) for matches based on descriptions stored in a database of patterns (default is rebase.data, in normal (or an extended) GCG format. It can read either explicit sequences, possibly containing IUPAC degeneracies or matrix descriptions (default is matrix.data, in TRANSFAC format), and based on matches and options entered on the command line, sends ALL output to stdout. Since SEQIO can read collections of sequence, tacg inherits that ability, and can now apply its analyses over multiple sequence files.

Unless requested (by -V1-3), it no longer sends errors to stderr (except failure errors) and it no longer emits default output (except for one special case (see -p) - you have to request all output. Most of the internals use dynamic memory so there are few limits on sequence input size and pattern number. I've generated >6000 patterns and searched 230Mb of input sequence. It's ~ 5-35x faster than the comparable routines in the GCG pkg and being written in ANSI C, is portable to all unix variants. It has been ported to Linux (Intel, Alpha, PPC), MacOSX, SunOS/Solaris, Compaq Tru64 Unix (aka DEC Unix aka TUFKAO (the Unix formerly known as OSF)), Ultrix, IRIX, NeXTStep, ConvexOS, and HP/UX. But it likes Linux best.

Unless told not to via the --raw flag,tacg now automagically translates most ASCII formats (Genbank, FASTA, etc) via Jim Knight's SEQIO library and now handles multiple sequences at one time, internally converting 'u's to 't's. It considers both strands at the same time so you don't have to manually reverse complement the sequence and will by default accept all IUPAC degeneracies (y r m k w s b d h v), performing all possible operations on that sequence. It treats degeneracies in the input sequence in different ways depending on the -D flag (see below). It either strips all letters other than a c g t and analyzes the sequence as 'pure' using a fast incremental hashing algorithm or it treats it as degenerate and analyses it via a slower algorithm. By default, it treats it as 'pure' unless it detects an IUPAC degeneracy, in which case it will adaptively switch back and forth between the fast and slow hashing routines. See also RELATED PROGRAMS at bottom.

NB: tacg can produce lots of output; while it's possible to pipe direct to lp/lpr, you'll probably regret it.


REQUIREMENTS

tacg 4.3 requires an external Codon file (codon.data but does not absolutely require a pattern/REBASE file, allowing you to enter patterns via the command line with the -p flag. However, most users will want to use a REBASE file in GCG format to supply the RE definitions. By default the name of this (supplied) file is rebase.data, altho other files in the same format can be specified by the -R flag. Searching for Matrix matching requires the use of a TRANSFAC-formatted file (also supplied) in the default name of matrix.data. Regular expressions can be supplied in a simplified REBASE format; the default name is regex.data. Examples of all data files are supplied. The codons/pattern/matrix/regex data files will be found in any of 3 locations which are searched in the order of: the current directory ($PWD), your home directory ($HOME), or the tacg lib ($TACGLIB). Many shells are set to define the 1st two; the last must be specified either via command line or in your .cshrc file (or equivalent).
csh %> setenv TACGLIB /usr/local/lib/tacg   [csh/tcsh]

bash #>export TACGLIB=/usr/local/lib/tacg [bash]

If they are in another location, you'll have to specify it with either an explicit full or relative path name.

FLAGS and Options

Flag Value Explanation
-b
i#
select the beginning of a subsequence from a larger sequence file; 1* for 1st base of sequence. In the Linear Map output, the upper label indicates numbering from beginning of subsequence; the lower label indicates numbering from the beginning of the entire sequence. The SMALLEST SEQUENCE that tacg can handle is 4 bases (10 for the ladder map (-l)). This allows analysis of primers and linkers.
-e
i#
select the end of a subsequence from a larger sequence file; 0* for last base of sequence. This subsequence can also be made circular via the -f flag. The largest sequence that tacg can handle depends on how much memory you have, although for practical purposes, assume 1 billion bases. 
-c   order the output by # of cuts/fragments by each RE (Strider style) and thence alphabetically; otherwise output is by order of appearance in the REBASE file.
--clone
#_#,#x#...
(all integers)
Clone finds sequence ranges which either MUST NOT be cut (#_#) or that MUST be cut (#x#), up to a maximum of 15 at once. Ranges not specified can be either cut or not cut. The output first lists all REs (if any) which match ALL the rules, then all REs which match SOME rules as long as all NO-CUT rules are respected. The same filters that work in other RE selections (-n, -o, -m, -M, --cost, --dam/dcm) can be applied here to fine-tune the selection.
-C
0*-16
Codon Usage table to use for translation:
0 Standard       6 Echino_Mito        12 Blepharisma
1 Vert_Mito      7 Euplotid_Nuclear   13 Chloro_mito
2 Yeast_Mito     8 Bacterial          14 Trematode_mito
3 Mold_Mito      9 Alt_Yeast          15 Scenedes_mito
4 Invert_Mito    10 Ascidian_Mito     16 Thrausto_mito
5 Ciliate_Mito   11 Alt_Flatworm_mito
--cost
i#
select REs by their cost (units/$ - >100 is cheap; <10 is v. expensive)
--dam   simulate cutting in the presence of Dam methylase (GmATC). rebase.dam contains all REs that are Dam-sensitive.
--dcm   simulate cutting in the presence of Dcm methylase (CmCWGG).  rebase.dcm contains all REs that are Dcm-sensitive.
-D
0-4
Degeneracy flag - controls input and analysis of degenerate sequence input where:
 0  FORCES excl'n of degens in seq; only 'acgtu' accepted
 1* cut as NONdegen unless degen's found; then cut as '-D3'
 2  degen's OK; ignore in KEY, but match outside of KEY
 3  degen's OK; expand in KEY, find only EXACT matches
 4  degen's OK; expand in KEY, find ALL POSSIBLE matches
The pattern matching is adaptive; given a small window of nondegenerate sequence, the algorithm will match very fast; if degenerate sequence is detected, it will switch to a slower, iterative approach. This results in speed that is proportional to degeneracy for most cases. If you have long sequences of 'n's (inserted as placekeepers, for instance), -D2 may be a better choice. In all cases, as soon as degeneracy of the KEY hexamer exceeds a compiled-in limit (usually 256-fold degeneracy), the KEY is skipped. 
--example
 
example code to show how to add your own flags and functions. Search for 'EXAMPLE' in 'SetFlags.c' and 'tacg.c' for the code.
-f
0|1*
form (or topology) of DNA - 0 (zero) for circular; 1 for linear. This flag also operates on subsequences.
-F
0*-3
print/sort Fragments; 0*-omit; 1-unsorted; 2-sorted; 3-both.
-g
Lo i#(,Hi i#)
specify if you want a pseudo-graphic gel map, with a low end cutoff of Lo# bases (converted to an integer multiple of 10), and (if present), a high end cutoff of Hi#. In Ver <2, the Lo# was restricted to 10 or 100; now it can be any any integer exponent of 10 (10, 100, 1000, etc), as can the Hi#. If Hi# is omitted or is larger than the sequence length, it takes the value of the sequence length. See examples below.
-G
binsize,X|Y|L
Graphic data output, so (mis)named for its original use, where:
binsize = # bases for which hits should be pooled
X|Y|L indicates whether the BaseBins should be on the X or Y axis or in 'Long' form where Basebins (as X) and Name data (as Y) are reiterated in 2 columns for all the Named patterns:
X: BaseBins 1000 2000 3000  ..
   NameA      0    4    0   ..   
   NameB     22   57   98   ..     (#s = matches per bin)
   NameC      1    0    0   ..
   .
Y: BaseBins  NameA   NameB   NameC   ..
     1000      0      22       1     ..
     2000      4      57       0     .. 
     3000      0      98       0     ..
     .
L: Basebins  NameA
     1000      0    
     2000      4    
        .      .
   Basebins  NameB
     1000     22
     2000     57
        .      .
This addresses some missing features - allows the export of hit data for the selected Names so that you can manipulate it as you wish. Like other output, it is streamed to stdout, so it's not wise to mix -G with other analyses; the lines generated (esp. w/ the X option), can be quite long and are NOT governed by the -w flag). Here's an example.
-h
--help
  brief help page (condensed man page).
-H
--HTML
{0*|1}
generates HTML tags for inclusion into Web pages.
0 - (default) makes standalone HTML page, with header, footer, and Table of Contents.
1 - does not generate HTML page headers, only TOC, to embed in other HTML pages.
-i
(--idonly)
{0|1*|2}
controls output for sequences that have no hits
0 - ID line and normal output printed regardless of hits
1 - (default) ID line and normal output are printed ONLY IF there are hits.
2 - ONLY ID line is printed if there are hits.
--infile
{filename}
allows those wanting to specify a file by commandline flag to do so (helps in some kinds of GUI wrapping functions)
-l
(el)
  specify if you want a ladder map of selected enzymes, much like the GCG MAPPLOT output. Also appends a summary of those enzymes that match few times. This last # is length-sensitive in the distributed source code, but it is easy to set another default as a '#define' in 'tacg.h'.
-L   specify if you WANT a Linear map a la Strider or GCG's MAP (but better - tacg indicates the actual CUT site as opposed to the 1st base in the pattern as do other mapping programs). In Ver 3.x, the Linear Map only includes those REs or patterns which pass the filtering criteria set via the -n, -o, -m, -M, --cost, etc.
--strands
{1|2*}
specifies how many strands get printed in the linear map. Allows you to slightly compact the linear map, especially when used with the --notics flag below
--notics   do NOT print the tics marks below the DNA in the linear map. Allows you to slightly compact the linear map, especially when used with the --strands flag above
--numstart
i#
(-|+)
the value given with this flag is the beginning number in the Linear Map (-L) output. This can be used to force a particular numbering scheme on the output or to force upstream (negative) numbering for promoters sequences. If a negative number is used, the zero position is omitted at the transition from - to +.
-m
i#
select enzyme by minimum # cuts in the whole sequence. Default is no minimum (ie ALL). Affects the number of enzymes displayed by the sites (-s), fragments (-F), Linear map -L, and ladder map (-l) flags.
-M
i#
select enzyme by Maximum # cuts in the whole sequence. Default is 32,000. Affects the number of enzymes displayed by the sites (-s), fragments (-F), Linear map -L, and ladder map (-l) flags.
-n
3*-10
select enzymes by magnitude of recognition site; 3 = all, 5 = 5,6,7,8... n's don't count, other degeneracies are summed ie: tgca=4, tgyrca=5, tgcnnngca=6, tannnnnnnnnnta=4
-o
0|1*|3|5
select enzymes by overhang generated; 5 = 5', 3 = 3', 0 for blunt, 1 for all 
-O
1-6(x),MinSiz
ORF analysis where any frame combination can be specified ('126' or '45' or '13456') along with the minimum ORF Size you want to detect. Produces either a single line (if -w1 is specified) or a block, (with the Amino Acids wrapped at the specified width) for each ORF including:
  • Frame of the Current ORF
  • Sequence # of the Current ORF in that frame
  • Offset from the start in both bases and AAs
  • Size of the ORF in AAs and KDa
  • estimated pI
  • ORF in 1 letter code for external analysis
If 'x' is appended to the frame specification, 3 additional lines are appended, which give proportion of each AA in # and %. This breaks the FASTA format, but can be easily stripped later as each line is prefixed with a '#'.
NB: Because the output can be in a single line for each ORF, other line- oriented pattern-matching tools (grep, perl, awk) can examine the ORF generated for matching regular expressions (see the GNU grep man page for an explanation of regular expressions). In this way you can search all 6 frames of >=MinSize AAs for whatever pattern interests you.
Examples:
-O 145x,25   (search frames 1,4,5 with extended AA information on all ORFs > 25 AAs)
-O 2,66   (search frame 2 with a min ORF size of 66 AAs)
--orfmap
 
in conjunction with -O (above) this option draws a pseudographic character-based ORF map showing all the frames specified with -O and all the ORFs larger than the minimum size mapped to their relative positions on the map. It also prints a smilar map of all the METs and STOP codons on a similar map.
-p
Name,Pat,Err
allows entry of search patterns from the command line, where 
Name = name by which pattern is labeled (<=10 chars)
Pat = <30 IUPAC characters (ie. gryttcnnngt)
Err = max # of errors that are tolerated (<=5)
Also logs the patterns you've entered into a file tacg.patterns in the correct format for later copying to a REBASE file. Can enter up to 10 of these at a time. Patterns should consist of <=30 IUPAC bases. 
Long sequences with large errors will cause SUBSTANTIAL cpu and memory usage in validating the patterns. 
-P
NameA,
[+-][lg]
Dist_Lo
[-Dist_Hi],
NameB

MBQ

Pattern proximity matching to search for spacial relationships between factors, 2 at a time (up to a total of 10).
NameA and NameB must be in a REBASE file, either the default rebase.data or another specified by the -R flag and are case INsensitive. NameA/B patterns can be composed of any IUPAC bases and ERRORs can be specified in the REBASE entry ie: 
Pit1 5 WWTATNCATW 0 2 ! a Pit1 site with 2 errors
Tataa 4 TATAAWWWW 0 1 ! a Tataa site with 1 error

+ NameA is DOWNSTREAM of NameB (default is either)
- NameA is UPSTREAM of NameB (ditto)
l NameA is LESS THAN Dist_Lo from NameB (default)
g NameA is GREATER THAN Dist_Lo from NameB
Dist_Hi - if used, implies a RANGE, obviates l or g

Examples:
-PHindIII,350,bamhi
Matches HindIII sites within 350 bp of BamHI sites

-PPit1,-30-2500,Tataa
Match Pit1 sites 30 to 2500 bp UPSTREAM of a Tataa site.

--ps
 
generates a postscript plasmid map (and multiple pages with the same parameters if fed a multi- sequence file). The output file is named tacg_Map.ps and additional plots will be appended to it if it exists in the same directory. REs to be plotted can be selected with the usual parameters: (-m -M --cost --n -x -p) but you'll usually want to use -M1 or -M2. Degeneracies are plotted along the rim as grayscale arcs (remember tacg can tolerate degeneracies in sequence, so you can compose accurate plasmid maps by connecting known sequences with N's.) ORFs from any and all frames can be plotted internal to the sequence ring by using the -O flag.
--pdf
 
Invokes --ps above and automatically converts the Postscript putput to Adobe's Portable Document Format, which is considerably more compact. You'll need a PDF viewer to view the results, Adobe's Acrobat Reader, xpdf, gv, or functional equivalent. Requires a working local Ghostscript installation, with gs installed at or linked to /usr/bin/gs. NB: If the standard Type1 fonts aren't installed, it will fail.
--logdegens
 
(off by default) Using this flag forces the logging of every degeneracy in the sequence, trivial if a short sequence (<1Mb), but of concern for chromosome-sized chunks. This info will be used for drawing graphic maps of the sequence and shading degeneracies differently (invoked by --ps & --pdf above). It is quite memory intensive as it marks the beginning and end of every degeneracy run. No external data is produced, but could be as it's just a simple 2-step array.
-q
REMOVED
Be quiet. DISallows sending diagnostic udp info back to author, now the default behavior (so unless you TELL the program to send data back, it won't). 
-Q
REMOVED
Be UNquiet. Allows the program to send diagnostic udp info back to author. In version 2.x, this was the default behavior, but it has served its purpose, so unless you WANT me to log your usage, I won't.
Allows sending diagnostic UDP info back to author's machine. Report stream includes this info: Date Time IP# UID hardware OS OS_version TACG_version [tacg commandline] < # bases analyzed > ie. 1996-03-08 17:02:26 128.23.4.24:[uid=502 hw=i486 os=Linux osver=1.2.6] [TACG Version 1.33F] tacg -t 3 -n 6 < 434 bp >
--raw   tells tacg to consider ALL input as valid sequence (as with version 2). instead of using SEQIO to parse the input as a standard sequence format. Useful for analyzing file fragments or editor buffers, which may be missing valid format. Note that specifying this flag will tell tacg to eat headers, comments, etc as well as sequence, if it encounters them. ALL IUPAC degeneracies will be analyzed
--rev   tells tacg to reverse all input sequences before analyzing it. tacg -> gcat
--comp   tells tacg to complement all input sequences before analyzing it. tacg -> atgc
--revcomp   tells tacg to reverse-complement all input sequences before analyzing it. tacg -> cgta
--rules
'ruleA[&|^]
ruleB[&|^]
ruleC[&|^]..,i#'

MBQ

--rule allows you to specify arbitrarily complex logical associations of characteristics to detect the patterns that interest you. Admittedly, that phrase is incomprehensible, so let me give an example:

Say you wanted to search for an enhancer that you suspected might be involved in the transcriptional regulation of a pituitary-specific gene. You knew that you were looking for a sequence about 1000 bp long in which there were at least 2 Pit1 sites and 3-5 Estrogen response elements, but NO TATAA boxes.
If you had defined these patterns in a file called pit.specific as:

       Pit1  0  WWTATNCATW    0 1 ! Pit1 site w/ 1 error
       ERE   0  GGTCAGCCTGACC 0 1 ! ERE site w/ 1 error
       TATAA 0  tataawwww     0 0 ! TATAA site, no errors allowed 
you could specify this search by:
tacg --rule '((Pit1:2:7&ERE:3:5)&(TATAA:0:0),1000)' \ 
-R pit.specific < input_sequence >output

This query searches a sliding window of 1000 bps (',1000') for ((2-7 Pit1 AND 3-5 ERE sites) AND (0 TATAA sites)). These combinations can be as large as your OS allows your command-line to be with arbitraily complex relations represented with logical AND (&), OR (|), and XOR (^) as conjunctions. Parens enforce groupings; otherwise it's evaluated left to right.

--rulefile
/path/to/file
This option allows you to read in a complete file of the kind of complex rules described above and have them all evaluated. The file format is described in the example data file supplied rules.data
-r
(--regex)
'Label:RegexPat'
or
'FILE:RegexFile'

MBQ

searches for regular expressions entered from the commandline using the 1st option or searches for the regular expressions read from a file using the 2nd option. The regular expression syntax can be formal regex patterns or the IUPAC'ed version thereof; the translation from one to the other is handled automatically. ie:
gy(tt|gc)nc{2,3}m -> g[ct]\(tt\|gc\).c\{2,3\}[ca]
When trying to specify a file, the term FILE must be in CAPs (so don't use 'FILE' as a pattern name). Specific regex patterns from the file can be specified by using the -x flag to name them explicitly. 
-R
REBASE, REGEX, or
MATRIX file
specifies an alternative Restriction Enzyme file (in GCG format), regular expression or Matrix file (in TRANSFAC format) to use. (The latest REBASE files are available via FTP or via WWW
The latest TRANSFAC files are also available via FTP or WWW. There are several such files included in the std distribution:
  • rebase.data - the main restriction enzyme pattern database (including those in rebase.dam, rebase.dcm)
  • rebase.dam - only those REs that are Dam sensitive
  • rebase.dcm - only those REs that are Dcm sensitive
  • regex.data - a few example regular expression patterns
  • matrix.data - all the TRANSFAC matrices
  • transfac.data - the entire TRANSFAC database in GCG format
The file specified here is also searched for in the same order as the other data files: $PWD (the current directory), $HOME (your home directory), $TACGLIB (the TACGLIB directory).
-s   prints the summary of site information, describing how many times each enzyme or pattern matches the sequence. Those that cut zero times are shown first. In Ver >=2, only those that match at least once are shown in the second part (the 0-matchers are not reiterated)
-S   prints the the actual match Sites in tabular form.
--silent   requests that the NA sequence be translated starting at the 1st base, in frame 1 (use -b to shift the starting base), according to the Codon Translation table specified with -C, then reverse translated, using the same table, using all the possible degeneracies, then restrict that (quite) degenerate sequence and show all the REs that will match it. You should use the L and -T flags to generate the linear map which shows both the REs and the cotranslated sequence to verify that all is as it should be. NB: Depending on Codon Table, some AAs are not reversibly translatable. Using the standard table, Arg (=mgn), Leu (=ytn), and Ser (=wsn) cannot be Forward translated from their Reverse translation.
--tmppath
'/path/to/tmp/dir'
passes the path to tacg to cooperate with CGIs or other programs that need to tell tacg where to place the ps/pdf files for access by other processes.
-T
0*|1|3|6,1|3
In the Linear map, beneath the DNA sequence, include the translated protein in 0*, 1, 3(= frames 123), or 6 (=123456) frames of Translation with 1 or 3 letter codes.
ie.
-T 3,3 (includes frames 1,2,3 with 3 letter labels)
-T 6,1 (includes frames 1,2,3,4,5,6, with 1 letter labels)
-v   asks for program version (there may be multiple versions of the same functional program to track its migration. Also build date, kernel version, and GCC version used.
-V
1-3
Verbose - requests all kinds of diagnostic info to be spat to the screen. May be useful in diagnosing why tacg did not behave as expected..but maybe not. Higher numbers mean more output and are generally downwardly inclusive.
-w
1 | i#
output width in bp's (must be between 60* and 210, truncated to a # exactly divisible by 15 ('-w 100' will be interpreted as '-w 90') and actual printed output will be about 20 characters wider. Also applies to output of the ladder and gel maps, so if you're trying to get more accuracy and your output device can display small fonts, you may want to use this flag to widen the output. If you want as much output on one line as possible for external parsing/analysis, specify -w 1
-x
'NameA(,=),
NameB,
NameC,
NameD,...(,C)'
used to explicitly name those enzymes or patterns to be used in the analysis (up to a maximum of 15). Case INsensitive (HindIII=hindiii=HinDiIi), but the name HAS to be spelled exactly like the entry in the REBASE or MATRIX file with no spaces (HindIII != Hind III != Hind3).
The ',=' tag appended to a name indicates that it is the tagged RE in a AFLP analysis; only those fragments that have at least one end generated by the tagged RE will be shown. This has been shown to be useful in AFLP analysis.
The trailing ',C', if added, requests a combined digestion using all the REs specified with this flag.
Examples:
-xHindIII,BamHI,NruI,C
requests data for these REs both individually, and combined.

-x EcoRI=,MseI,Hinf
requests AFLP formatted data, with EcoRI tagged; NO combined results (per se, altho that is a part of the AFLP analysis).

-X
(--extract)
b,e,[0|1]
aka "--extract" eXtracts the sequence around the pattern matched, from b bases preceding, to e bases following the MIDDLE of pattern if a normal pattern, the START of the pattern if a regular expression. If the pattern is found in the bottom strand AND the last field = 1, sequence is rev-compl'ed before it's extracted so all patterns are in same orientation; if last field = 0, it is NOT reverse compl'ed. In any event, the sequences are FASTA-formatted on output.
-#
% CutOff
The percentage of the optimal matrix score that you will accept as a match. ie. if the matrix (as below) was 10 bases long, and had a maximum score of 69 (scoring a 100% match at each position as '1', then if you indicated a -# 75, you would accept a score of 51.75 (69 x .75) as a match. 
     a  t  g  g  c  y  t  r  g  g   Consensus
     1  2  3  4  5  6  7  8  9 10   Position
  a  8  0  1  1  1  0  1  4  0  0  
  c  1  3  1  0  9  6  0  0  2  0   Sum of Max (bold) = 69
  g  1  0  8  7  0  0  0  6  7 10
  t  0  7  0  2  0  4  9  0  1  0


RELATED PROGRAMS

As noted above, tacg now incorporates some of Jim Knight's SEQIO pkg to perform sequence format translation, so external conversion programs are no longer needed for most common sequence formats. This is an extremely useful, well thought-out, and relatively easy-to-use pkg - I'm sorry I waited so long to include it. The package is no longer at its original URL, but it has been archived around the internet.

However, if an external program IS needed for format interconversion, I also strongly recommend Don Gilbert's excellent readseq program (available in source or executable via FTP. Why recommend readseq when I've used SEQIO? SEQIO is a great library of functions to use in other programs, but readseq is easier to use for stand-alone, interactive use, chiefly due to a more std interface. Both are scriptable; for scripting use, it's a toss-up.

You can also use the paging utility less to move thru your sequence file and use its marking and piping facility to punt the sequence of interest to 'tacg'. Many editors also allow piping a selection of text to an external program and inclusion of the result into another window, especially (nedit ( here is a .nedit extract that includes some tacg functions into the nedit Background (aka right-click) menu system) as well as crisp, and the ubiquitous, omnipotent emacs and its gui doppelganger xemacs).

Much of tacg's output benefits from wider-than-normal printing. The '-w#' flag allows output up to about 230 characters wide, however to print this without wrapping, you need to print in landscape mode, using very small fonts. A number of unix printing utilities allow you to do this, notably genscript aka GNU Enscript, residing in the GNU repository

EXAMPLES

Used alone:

tacg -f0 -n5 -T3,1 -sL -F3 -g 100 <input.seq.file >output.seq.file

Translation: read sequence from input.seq.file and analyze it as circular (-f0), with 5+ cutters (-n5), returning both site info and Linear map (-sL) as well as sorted and unsorted fragment data (-F3) and do 1,2,3 frame translation w/ 1 letter codes (-T3,1) on the linear map, writing the output to output.seq.file. Also, include a pseudo gel diagram for those enzymes that pass the filtering, with a low end cutoff of 100 bases (-g100). 


Used to search for Matrix Matches:

tacg -# 75 -R yeast.matrices -sS < yeast.chr_4 | less

Translation: seach the file yeast.chr_4 for all the matrix definitions in the file 'yeast.matrices', with a cutoff of 75% of the maximum score possible, listing also the summary and the Sites information, piping the output to the pager less


Used to search for degenerate transcription factor sites with errors:

tacg -p Pit1,tatwcata,1 -p ap2,tgygcatw,1 -w90 -sSL < rprlPromo.seq > promo.map

Translation: search for the patterns labeled Pit1 and ap2 with 1 error each and search the sequence from the file rprlPromo.seq for them, printing the results (summary (-s), Sites (S), and the Linear Map (L) 90 characters wide (-w90) to the file promo.map


Used to search for a Regular Expression:

tacg --regex 'yadda:gm(tt|ag)ggn{3,5}tgy' -SL < some.seq | less

Translation: search the file some.seq for the regular expression gm(tt|ag)ggn{3,5}tgy, piping the information about Sites (-S) and the Linear Map (L) to the pager 'less'. 


Used to search the entire yeast 500bp Upstream Regulatory sequences (a file containing 6226 500 bp sequences) for matches to the MATa1 binding site (from TRANSFAC) :

tacg -R TRANSFAC.data -sScw1 -xMATa1 -#85 < utr5_sc_500.fasta > yeast.summary

Translation: translate each of the FASTA formatted entries in the input file utr5_sc_500.fasta into usable sequence, and after finding the MATa1 (-x MATa1) matrix description from the database (-R TRANSFAC.data), search the sequences for matches at 85% of the maximum score that it has in the TRANSFAC database (-# 85), returning the summary (-s), the sites (S) sorted in Strider order (c) with results printed on 1 line (w1), directing the output into the file yeast.summary


BUGS and ODDITIES

Major

- tacg will not currently cut sequence shorter than 4 bases; if you need to analyze sequences shorter than this, perhaps you're using the wrong program.

- tacg has been made re-entrant for the inclusion of SEQIO and as such a number of memory leaks have been plugged (with the use of Gray Watson's excellent dmalloc library). tacg's not perfect yet but it's a lot more robust.

- the command line handling has been completely re-written, using the getopt() and getopt_long() functions, so the flags are considerably less sensitive to spacing and order.

- translation in 6 frames assumes circular sequence regardless of '-f' flag, so that the last amino acids in frames 5 and 6 in the 1st output block are obviously incorrect if you are assuming linear sequence.