Jonathan Badger
June 29, 1999
CRITICA is a series of programs designed to find protein coding regions in intron-less genomic DNA. It doesn't require (or use) any sequence annotation for its predictions. However, it does require that at least some of the genes in the sequence have homologs in data-banks for CRITICA's comparative analysis and that that the overall dicodon bias in these genes is typical of the organism for CRITICA's non-comparative analysis. In practice, this means that one should either analyze complete genomes with CRITICA or a sizeable chunk (>100,000 nucleotides). It is not required that the sequence be in one contig, however.
CRITICA is implemented as a series of Perl 5.0 scripts and ANSI C
code. To make the C programs simply type make (if your compiler
is not GCC, you'll have to edit the Makefile). Put the
generated programs somewhere on your path. The scripts in the
scripts directory also need to be on your path
(Critica.pm should be placed somewhere on your PERLLIB
path). The location of perl in the scripts may have to be
edited if your perl is not in /usr/bin. Also, you may
have to adjust several environmental variables for the script
blast-contigs: CRITICA_BLASTN (default value ``blastn'')
is the name of the BLASTN program used by
CRITICA. CRITICA_BLASTPARM (default value ``-warnings
-hspmax=500 E=1e-4 E2=1e-4'') is a set of parameters to be sent to
BLASTN (if you use a version of BLASTN that generates gapped
alignments, you should turn off this feature by giving a flag in
CRITICA_BLASTPARM). CRITICA_BLASTDB (default value
``gb'') is the name of the BLASTN database to be searched. The two
files critica.tex and critica.html are simply the
documentation that you are reading now in LATEX and html format.
Probably the best way to demonstrate how to use CRITICA is to walk though a typical analysis. To start with, you need the DNA to be analyzed to be in FASTA (also called Pearson) format. Here is an example of part of a FASTA file:
>ECOLI AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTC TGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGG TCACTAAATACTTTAACCAATATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTAC ACAACATCCATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGT AACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAGCCCGCACCTGACAGTGCGGG
Let's say you were going to analyze E. coli with CRITICA and the
genomic data was contained in a file named contigs. The first
part of a CRITICA analysis requires running blastn on small
subsequences of the genomic data: the script to do this is called
blast-contigs. Below is an example:
blast-contigs contigs > EC.blast
If you had additional data not in the main database. you could additionally run blast-contigs as follows:
blast-contigs -db=mydata contigs >> EC.blast
BLASTing a genome against GenBank often takes overnight; be prepared.
Next the BLAST output has to be processed into a format readable by CRITICA.
make-blastpairs EC.blast > EC.blast.pairs
Now the program scanblastpairs can be used to tabulate the
triplets aligned to E. coli in the blast.pairs
file
scanblastpairs contigs EC.blast.pairs EC.triplets
Now we can predict coding regions. Because it is best to have CRITICA
iterate several times, a script iterate-critica has been
supplied. It is run as follows:
iterate-critica ecorfs contigs EC.triplets
This will take twenty minutes (on a 300 Mhz Pentium II) to several
hours (on slower machines) to run and will generate numerous files,
among them ecorfs0.cds, ecorfs1.cds, ecorfs2.cds,
and ecorfs3.cds. These files are the predictions made by
CRITICA at each iteration. The last file (ecorfs3.cds) should
be the most accurate.
Here is part of the ecorfs3.cds file for interpretation:
ECOLI 35371 34781 2.86e-35 16 3399 2708 -93 GTG 48 12 AGGA ECOLI 36162 35377 8.28e-50 32 2171 6818 62 ATG 40 8 GGAG ECOLI 37824 36271 8.18e-45 64 1002 7385 63 ATG 84 3 GAGGTG ECOLI 39115 37898 7.12e-53 128 0 10183 63 ATG 76 7 AGGAG ECOLI 40386 39244 4.09e-50 128 324 9258 63 ATG 73 7 GAGGT ECOLI 41931 40417 1.29e-53 64 225 10286 63 ATG -123 0 - ^ ^ ^ ^ ^ ^ ^ ^^^^^^ ^^^^^^^^^^^^ contig start end P-value M comp dicod initiator SD
The P-value is the amount of statistical support for the coding region. Like BLAST, low scores are better. The matrix (M) is the best comparative matrix - high matrices mean the homologs found were more distant. The start and end are the start and end of the coding region, including the terminator. Comp and dicod represent the scores the orf has from comparative support and dicodon support respectively, The initiator information consists of the score given to the initiator codon and the codon itself. The SD information consists of the score given to the best SD sequence for the chosen initiator, the number of nucleotides between the end of the SD sequence and the start of the coding region, and the SD sequence itself.
Usage: addlongorfs [options] cds-file seq-file
Valid addlongorfs options:
-orf-aa-length=length
-genetic-code=number
This program lists orfs over a certain length that aren't included in
the cds file given as input. The main use of this program is that it
is called by iterate-critica if the -add-longorfs=
option is given.
usage: annotation-compare annotation.cds prediction.cds
This program compares the results of a CRITICA run against a file of
known coding regions of the format contig-name start end (...).
It outputs the percentage of false positives and false negatives, as
well as the percentages of too early and late start positions. It also
generates files ending in .fp, .fn, .early and
.late which have the respective CRITICA calls in them.
Usage: blast-contigs [-db=dbname] fasta-file
This script blasts the DNA in a fasta file against a DNA
database. Either the default database (usually GenBank or a
non-redundant DNA database) is used, or if the -db= option is
chosen, a database of the user's choice. The DNA to be blasted is
split into small fragments before BLASTing.
Usage: compare-orf-lists list1 list2
This script compares lists of coding regions. The lists can either be in the format of CRITICA .cds files or in a simple format:
contig start end
The resulting list is a list of coding regions in list1 not in list2 and the vice versa. The start position of the coding region doesn't matter in these comparisons.
Usage: critica [options] seq triplet-scores [more options]
Valid critica options:
-scoring-matrix=file
-dicodon-scores=file
-init-scores=file
-sd-scores=file
-genetic-code=number
-threshold=value
-alpha=value
-strict-threshold
-frameshift-threshold=value
-quick-stats
This is the main CRITICA program. It takes as arguments the sequence
to be analyzed and set of comparative triplets (created by
scanblastpairs -- see above). It outputs a list of predicted
coding regions with start positions within two orders of magnitude of
the best start position for each coding region.
-scoring-matrix=file -- this causes critica to use
an empirically derived scoring matrix instead of the theoretically
determined scores as described in the manuscript. An advantage of the
empirical scores is that conservative changes can be
rewarded. However, in practice I've usually found that this does not
significantly change the results, suggesting that identities are
sufficient for finding coding regions.
-dicodon-scores -- this causes CRITICA to
use a set of dicodon scores (generated by the dicodontable
program -- see below) for noncomparative support.
-init-scores -- this causes CRITICA to use
a set of initiator scores (generated by iterate-critica --
see below) for initiator choice.
-sd-scores -- this causes CRITICA to use a
set of SD scores (generated by iterate-critica -- see
below) for initiator choice.
-genetic code=number -- this causes CRITICA
to use a different genetic code than the ``universal'' one. The
numbers used are are the standard NCBI genetic code numbers.
-threshold=value -- this sets the threshold
for the maximum p-value allowed for an region to be considered a
coding region. The default is
1 x 10-4, which seems to work
the best.
-alpha=value -- this sets the parameter
``alpha'' which scales the dicodon scores down to compensate for their
non-independent nature. The default value is 0.8.
-strict-threshold -- by default CRITICA
accepts as coding regions regions that meet the demand of the
threshold even if extending them to a terminator causes the region to
drop below the amount of support required by the threshold. Using
-strict-threshold eliminates them. In practice this tends not
to be desirable, however.
-frameshift-threshold=value -- causes
CRITICA to check for potential frameshift errors in the sequences it
is analyzing. The value supplied is the number of nucleotides between
two regions of coding evidence in difference frames below which the
regions are assumed to be caused by a frameshift. A good value to use
is 10. The potential frameshifts are outputed to stderr - in the
case of iterate-critica (below) the messages are redirected to
the .mesg files.
-quick-stats -- one of the more time
consuming parts of a CRITICA run is calculating the parameters K and
lambda. This option causes CRITICA to use conservative values (0.2
and 0.015 respectively) and thus speeds the run (at the expense of
losing some marginal cases).
Usage: dicodontable [options] cds-file seq-file
Valid dicodontable options:
-fraction-coding=fraction
This program generates a set of dicodon scores from a list of coding
regions (generated by best-inits, above) and the FASTA file of
the DNA to be analyzed. The option -fraction-coding allows
adjustment for the fact that in the first (comparative evidence only
iteration) less than a realistic percentage of the DNA is called
coding. The default behavior of iterate critica is to set this
parameter to 0.8 in the first iteration, and not set it at all in
following iterations. The table generated by dicodontable has the
following format:
dicodon dicodon-score frequency-in-noncoding DNA
Usage: empirical-matrix coding-blast-pairs
This program creates a scoring matrix including conservative changes
from a blast-pairs file created from a BLASTN run of known coding
regions in the organism of interest. The scores are based on the log
odds of two triplets being aligned in the coding frame (the +1 frame)
vs. being aligned in other frames. The matrix is used via the
-scoring-matrix option of critica and
iterate-critica.
Usage: extractcoding [options] fasta-file cds-file
valid extractcoding options:
-desc (include gene descriptions in headers)
-prot (output translated sequences)
-genetic-code=NCBI-num
-min=num (minimum length (in amino acids) of extracted genes)
-max=num (maximum length (in amino acids) of extracted genes)
This program generates FASTA files of the genes predicted by CRITICA
(actually, it can read any file that starts contig-name start end).
The default is to output DNA sequences; the program can also
translate the genes if desired (through the -prot option. The
-desc option is useful particularly for generating FASTA files
from annotation files of the format
contig-name start end description....
Usage: histo-orf bin-size cds-file
This script produces a histogram of coding-region lengths (in amino
acids). The arguments are the size (in amino acids) of the histogram
bin size and the list of coding regions. The list can either be in the
format of CRITICA .cds files or any file that starts
contig-name start end.
The program produces input showing the number of orfs that have a length less than the number on the left (and whose length is greater or equal to the number on the left on the previous line). The percent of all coding regions in the bin is listed, as well as the culminative percentage.
Usage: intergenic [options] fasta-file cds-file
valid intergenic options:
-desc (include gene descriptions in headers)
-min=num (minimum length (in amino acids) of extracted genes)
-max=num (maximum length (in amino acids) of extracted genes)
This program generates FASTA files of the regions of the sequences provided that are not covered by the genes listed in cds-file.
iterate-critica: [options] output-name seq-file triplets-scores
Valid iterate-critica options:
-iterations=number (default: 3)
-scoring-matrix=file
-initial-dicod=file
-initial-init=file
-initial-sd=file
-no-sdscores
-threshold=value
-fraction-coding=value
-add-longorfs=length
-genetic-code=number
-strict-threshold
-frameshift-threshold=value
-quick-stats
This script is the normal way of automating a CRITICA run. The options are mostly identical to those of the CRITICA program itself except for the ability to set the number of iterations.
Some options of note -- -fraction-coding also changing the
default value is 0.8, (80%), which is typical of the fraction of DNA
which is actually part of a gene in prokaryotes. If you know that your
organism is quite different from this you can set your own value.
-add-longorfs=length adds to the first iteration all ORFS that
would encode proteins over a certain length (given in amino
acids). This can be helpful if there is very little comparative
information available for the organism being analyzed.
-no-sdscores turns off support for SD scoring. This is useful
if you know that your organism doesn't use SD sequences.
Usage: lookat [options] fasta-file [contig-name] start end
valid lookat options:
-num include num bases upstream of given region
+num include num bases downstream of given region
This program allows the user to examine any sub-sequence of a DNA or protein sequence. If the start is less than end, the reverse complement of the subsequence is given (for DNA of course!)
Usage: make-blastpairs [-exclude=string] blast-file
This script converts BLAST output (generated by blast -contigs,
above), into a FASTA file with alternating records representing query
DNA and its corresponding BLAST match. The option -exclude=
allows the exclusion of DNA from a particular genus or species,
generally the same genus or species of the query organism. In general,
this is unneeded because make-blastpairs automatically excludes
matches with more than 97% identity.
Usage: map-critica-orfs cds-list
This script lists the amount of overlap of each coding region predicted by CRITICA with the previous coding region. The output is the same as the input but with two additional columns - the first is the number of nucleotides of overlap with the previous coding region (negative numbers mean there is no overlap but a gap between the two genes). The second is the amount of overlap represented as the percent of the length of the coding region.
Usage: motiffind [-options] pattern fasta-file
Valid motiffind options:
-compstrand
This program finds instances of the pattern supplied in the FASTA
file. In the FASTA file contains DNA sequences, ambiguous codes such as
R and Y can be used (the program decides if a sequence is protein if
its G+C is less than 0.20). If the compstrand option is
selected, the reverse complement is searched (for DNA sequences).
Usage: removeoverlaps seq-file cds-file percent-allowedThis program removes coding regions which overlap more than the percentage given. The coding region removed is the one with the least support.
Usage: reportscore [options] seq triplet-scores contig start end matrix
Valid options:
-dicodon-scores=file
-init-scores=file
-sd-scores=file
-genetic-code=number
The program allows one to determine the score and p-value of a region of DNA. This is useful for understanding why CRITICA did not call an expected region a coding region.
Usage: scanblastpairs seq blast-pairs triplets
This program takes as inputs the FASTA file of the sequence to be analyzed, the blast-pairs generated by make-blastpairs (see above), and generates a table of triplets aligned to each triplet in the sequence file.
Usage: translate [options] fasta-file [contig-name] start end
valid translate options:
-num include num residues upstream of given region
+num include num residues downstream of given region
-genetic-code=NCBI-num
This program allows the user to translate any sub-sequence of a DNA sequence. If the start is less than end, the reverse complement of the subsequence is used.
Jonathan Badger Department of Computer Science University of Waterloo Waterloo, Ontario N2L 3G1 Canada email: jhbadger@monod.uwaterloo.ca phone: (519) 888-4567 ext. 5321 fax: (519) 885-1208
This document was generated using the LaTeX2HTML translator Version 98.2 beta6 (August 14th, 1998)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 1 -no_navigation -show_section_numbers critica.tex
The translation was initiated by Jonathan Badger on 1999-06-29