Added/modified in Ver. 3.5.2 (2010-02-06)
(1A) SA x SA : DP-RA
(1N) SN x SN : DP-RA
(2A) SA x MA or MA x MA : CL-AF
(2N) SN x MN or MN x MN : CL-AF
(3) SN x SA or SN x MA : DP-AF
where DP and CL imply dynamic programming and candidate-list algorithms, respectively. CL is a generalized version of DP, and was introduced to rigorously optimize a group-to-group alignment score with an affine gap-penalty function [3]. Aln converts the input MA or MN into a generalized profile [4], if the performance is expected to be improved compared to the ordinary character-based method.
[1] Gotoh, O. (1982) "An improved algorithm for matching biological sequences." J. Mol. Biol. 162, 705-708.
[2] Gotoh, O. (1990) "Optimal sequence alignment allowing for long gaps." Bull. Math. Biol. 52, 359-373.
[3] Gotoh, O. (1993) "Optimal alignment between groups of sequences and its application to multiple sequence alignment." CABIOS 9, 361-370.
[4] Gotoh, O. (1994) "Further improvement in group-to-group sequence alignment with generalized profile operations." CABIOS 10, 379-387.
[5] Gotoh, O. (1998) "Divergent structures of Caenorhabditis elegans cytochrome P450 genes suggest the frequent loss and gain of introns during the evolution of nematodes." Mol. Biol. Evol. 15, 1447-1459.
[6] Gotoh, O. (2000) " Homology-based gene structure prediction: simplified matching algorithm by the use of translated codon (tron) and improved accuracy by allowing for long gaps." Bioinformatics, 16, 190-202.
[7] Gotoh, O. "A space-efficient and accurate
method for mapping and aligning cDNA sequences onto genomic sequence"
Nucleic Acids Research 36 (8) 2630-2638 (2008).
[8] Gotoh, O. "Direct mapping and alignment of protein sequences onto genomic sequence"
Bioinformatics 24 (21) 2438-2444 (2008).
| File 1: | | | File 2: |
| | | ||
| 3 16 | | | 2 16 |
| >Seq1 | | | LOCUS Seq1 |
| aaatttcccggg | | | ORIGIN |
| // | | | 1 AAATTTCCCGGG |
| >Seq2 | | | // |
| atcgatcgatcgat | | | LOCUS NCODE |
| // | | | ORIGIN |
| >NCODE | | | 1 -ACMGRSVTWYHKDBN |
| ACMGRSVTWYHKDBN | | | // |
| // | | |
Interleaved (native) format of multiple sequences
This native format is designed for multiple-sequence alignment to be naturally
recognized by human eyes. The alignment produced by aln can be used as an input
to aln or prrn, and this is the most common way to have access to sets of pre-
aligned sequences. Thus, the format of an aligned sequence file is the same as
the default output format of aln. The first non-blank line in a file must
indicate the number of sequences, N, involved in the alignment. This number is
obtained as the sum of numbers in square brackets, e.g., when the first line is
Seq1[3] - Seq2[4]
N is calculated to be 7. Subsequent lines up to the first blank line are ignored. The rest of the file is composed of one or more blocks of a fixed column width of less than 254 characters. Each block is composed of N 'sequence lines' and other (optional) 'non-sequence lines'. The general format of a sequence line is:
<Position> <Sequence> <Name>
where <Position> is a numeral that indicates the sequence position of the first letter in <Sequence|> (Usually all <Position>s in the first block are 1, but it is not a prerequisite. Negative values are also appropriate). A line lacking the <Position> field is regarded as a 'non-sequence line' and ignored upon reading. The i-th sequence line in the second block is concatenated to the i-th sequence line in the first block, and so on. There is no particular limit on N or the length, but the total number of characters to be stored is limited by MAXAREA defined in src/sqio.h. Several examples of native format are provided in the sample directory.
For native format of multiple sequences, lines starting with ";B ", ";b ", and ";m " represent the information about organizations of corresponding genes. The first number that follows ";B ", NP, indicates the number of alignment positions where an intron intervenes at least one of the genes corresponding to the aligned protein or cDNA sequences. For a protein sequence, the position means that of coding nucleotide sequence so that phase as well alignment position is also significant. The second number in this line, NI, indicates the total number of introns. The numbers that follow ";b " indicate the number of sequences that contain the intron at the 1st, 2nd, ..., NP-th position. The numbers that follow ";m " present the list of sequences that contain the intron at the 1st, 2nd, ..., NP-th position in this order. See the example ce13a.mfa in sample/pas directory.
[1] aln [option2] [2] aln [option2] seq1 seq2 [3] aln option1 [option2] [seq2]Option2 is common to other programs (e.g. prrn), and default values will be used if omitted. Method [1] is a 'conversational mode', which will be discussed in detail below. Method [2] is an 'instant mode'; the calculation starts immediately using the sequence data in files seq1 and seq2. Method [3] is a 'batch mode', which will be discussed after explanation of conversational mode.
Usually, aln automatically judges the type of a sequence (amino acids or nucleotides) according to the composition of characters read in. You may explicitly specify the sequence type by adding an 'attribute' as described below.
Menu Prompt: 1 2 pgris equivalent to
Menu Prompt: 1 Menu Prompt: 2 Menu Prompt: p Menu Prompt: g Menu Prompt: r
[[path][filename]] [start] [end] [attribute]
Any terms may be omitted, except start when you wish to specify end. When path
is omitted, the default path is used (this can be specified with option2, see
[2] instant mode below). Start and end specify the range of the sequence to be
analyzed. The default values are 1 and the last position of the sequence,
respectively. Some attribute terms are applicable only to a nucleotide
sequence. Currently meaningful attributes are defined below.
^ : the sequence is complemented. (nucleotide sequence only)
- : the sequence is reversed.
< : the sequence is reverse-complemented.
% : forces the sequences into profile.
A : amino acid sequence.
P : amino acid sequence.
D : nucleotide sequence.
R : nucleotide sequence.
N : nucleotide sequence, delete ambiguous codes ('N').
T : tron sequence. See ref. 5.
@N : read coding parts only, according to the N-th CDS (GenBank only).
#N : retain only the columns composed of >N% of non-deletion characters.
If there is no attribute character (default), the sequence remains as read. If
filename is specified, the remaining terms are reset to the default values,
otherwise non-specified terms are unchanged. Of course, filename should not be
omitted in the beginning session.
Default Meaning
pam 250 PAM level of Dayhoff's mutation data matrix MDM [0 - 300] step by
50.
bias 0 Value to be added to each element of MDM.
scale 1 Precision of the MDM matrix ( >= 1 ).
v 9 Constant term of the gap-weighting function.
u 2 Proportional coefficient of the gap-weighting function for a gap
shorter than or equal to k1.
s[=] 2 Similarity measure between identical nucleotides.
s[#] -2 Similarity measure between different nucleotides.
u1 1 Proportional coefficient of the gap-weighting function for a gap
longer than k1.
k1 10 The flection point.
shldr 100 Window size ( >= 0 ), see note 1.
trial 0 Number of pairs of randomized sequences used for a Monte Carlo test
for the significance of sequence similarity.
lpw 60 Number of residues per line in an output alignment.
lseg 2 Number of pieces of linear functions that represent a gap-weighting
function [1 - 3], see Ref. [2]. For DNA vs. protein sequence
alignment, lseg = 2 (affine gap penalty) and lseg = 3 (restricted
affine penalty) indicate the potential existence of introns in the
DNA sequence.
pmode 1 Control the format of the output alignment.
consl 0 do. How strictly to regard a site as a consensus site.
lorn 0 do. Sequence label is name (0) or numbering (1).
cmode 1 Calculation mode [0 - 1], see note 2.
algor 0 Select the algorithm for group-to-group alignment. If 0, the
program chooses conceivably the best one.
av/sum 1 Sum of pairs measure (1) or averaged score (0).
(note 1)
The program sets a window of size (|M - N| + 2 * shldr) along the center of
the main diagonal, where M and N are the lengths of the sequences. For DNA
vs. protein sequence alignment, the length of the protein sequence is
multiplied by 3 to calculate the window size.
(note 2) Mode = 0: Get only the distance value and other statistics (no alignment). Mode = 1: Get a single optimal alignment.
option default range description -AN 0 0-5 Select algorithm. N=0: aln's choice; N=1: DP-AF; N=2: DP-RA; N>=3: CL-AF. A value of other than 0 is not recommended. -FN 1 1-9 Output format. N=1-5: native; N=6: Phylip; N=7: GCG; N=8: GDE; N=9: Concatenated Fasta. -lN 60 >8 Set line width = N. -L[N] 0 0-15 Control penalty values given to terminal gaps. -mS Amino acid exchange matrix. -n Report only statistics rather than an alignment. -oS In the interactive mode, S indicates the default directory where the result is written. In other modes, S is the file name itself. If this option is not set, the output goes to stdout. -ON Output mode. See the 'Gene structure prediction'. -p Prompt to enter parameter values interactively. -r Don't remove intermediate files (with -a or -b). -RN 0 >=0 The number of random sequence pairs used for a shuffle test. -sS Set the default path where sequence files reside. -TS ALN_TAB The 'table' directory. -uN 2 >=0 Set the gap-extension penalty u = N. -vN 9 >=0 Set the gap-opening penalty v = N. -wN 100 >=0 Set shldr = N. -ybN 0 Add N to each element of an amino acid exchange matrix. -yeN 0 >=0 Additional penalty assigned to each deletion. -yiN* 68 >0 Intron penalty. -yjN 0 >=0 The u' value in an RA gap penalty. Must be 0 >= u' < u. -ykN 10 >0 The K value in an RA gap penalty. -ylN* 1 1-3 N=1: assume the absence of introns; N=2: assume the presence of introns, and use the AF penalty: N=3: assume the presence of introns, and use the RA penalty. -ymN 2 >=0 Similarity measure between identical nucleotides. -ynN 2 <=0 Similarity measure between different nucleotides. -ypN 250 0-300 The 'PAM' value of the mutation data matrix. -yuN 2 >=0 Same as -uN. -yvN 9 >=0 Same as -vN. -ywN 100 >=0 Same as -wN. -yxN* 20 >=0 Penalty for a frameshift. -yyN* 16 >=0 Relative contribution rate of translational and splicing boundary signals to the total alignment scores. -yzN* 1 >=0 Relative contribution rate of coding potential to the total alignment scores.With options -p, -P, -q, and -Q, you are asked to input new parameter values just as with 'p' + 'P' commands in the Interactive mode. Prompt and echo are simply for your convenience. Use -p when inputting from a terminal, -P or -Q when inputting from a redirected file, and -q if you like quiet.
-acatalog Construct multiple-sequence alignment simply by adding on in the
order of the file names in the catalog.
-btree Construct multiple-sequence alignment by the progressive method.
-ecatalog All pairs of sequences in the catalog.
-fcatalog Given sequence vs. all sequences in the catalog.
-gcatalog Every sequence in group1 vs. every sequence in group2. The two
groups are separated by a blank line.
-hcatalog All pairs of sequences in the catalog in a different order from -e.
-icatalog Self comparisons. Not useful with aln.
-jcatalog (2i-1)-th sequence vs. 2i-th sequence (i > 0).
upg/nj is not included in the present distribution version. You can perform the equivalent calculations with the prrn -b option. Therefore, the following explanations might be obsolete, but are retained for compatibility with earlier versions.
AA --------- 1 | BB | 0 | Content of a tree file. CC | AA,...EE show sequence file names. 3 | Numbers indicate the order of calculations. DD | 2 | EE --------In this example, BB and CC are aligned first, and the result is stored in a file named _s0 in the current directory. _s0 is then aligned with AA to give _s1. Alignment between DD and EE is stored in _s2, which is further aligned with _s1, to give _s3, the final result. The number between sequence names indicates the distance to the internal node from the leaf furthest from the root. Since the present implementation uses only topological information, you need not care about precise distance values. These intermediate files _s.. are deleted automatically unless you give the '-r' command line option, and the final result is written in either the file specified by -o option or './ALN' if not specified. A second run of 'aln -b tree' may replace the previous files, so that the final result (ALN or _s3 with -r option in this example) should be copied or renamed before another run of aln.
The strategy adopted here is the so-called 'progressive method' similar to those proposed by Waterman and Perlwitz (1984) Bull. Math. Biol. 46, 567-577, Feng and Doolittle (1987) J. Mol. Evol. 25, 351-360, and others. This strategy rapidly generates a reasonably good alignment, which may be further refined by prrn. Please refer to the accompanying document 'prrn.doc' for further details.
[[path][filename]] [start] [end] [attribute]With the -g option, a blank line separates two groups, otherwise a blank line (optional) indicates the end of the list.
CodePot TransInit TransTerm Splice3 Splice5 Intron53 (optional) AlnParam (optional)These files are included in the distribution package. At this moment, however, only a single set for the C. elegans genome is provided.
By the default setting, the nucleotide sequence is conceptually translated in three frames and the translated sequence is optimally matched with the protein sequence. This mode is most useful for detecting potential frameshift errors in an mRNA or cDNA sequence.
The option -yl2 or -yl3 tells the program that the nucleotide sequence may contain introns which are skipped at translation. As a consequence, the exon/intron structure of a genomic sequence can be predicted [5,6]. An affine (-yl2) or a restricted affine (-yl3) function may be chosen to penalize a gap (indel) at the translated amino-acid sequence level. The -L option should also usually be set to perform semi-global alignment. The standard command would look like:
% aln -yl2 (or -yl3) -L -ON 'DNA N1 N2' reference (sense strand) % aln -yl2 (or -yl3) -L -ON 'DNA N1 N2 <' reference (antisense strand)where N1 and N2 specify the region in the genomic DNA sequence within which the objective gene may reside, and 'reference' indicates a protein sequence or profile used as the reference. The -ON option controls the output:
-O0, -O4: gene sequence. -O1, -O5: alignment between the DNA sequence and the protein. -O2, -O6: catenated exonic sequence i.e. 'cNDA' sequence. -O3, -O7: translated amino acid sequence.With -O4, -O6, and -O7, boundary signal strengths are also reported after the end of the sequence. In addition, translation is conducted after correction for potential frameshift errors with -O7, while no correction is made with -O3. However, less memory is required with -O0, -O2 and -O3 than with other selections.
A GenBank-like format is also obtainable if you use -Fg (or -F1) instead of -ON.
% aln -yl2 (or -yl3) -L -Fg 'DNA N1 N2' reference > OutputFile (sense strand) % aln -yl2 (or -yl3) -L -Fg -oOutputFile 'DNA N1 N2 <' reference (antisense strand)The accompanying file 'Exam' contains some examples for the application of aln to test data with several different option settings.
Copyright © 1997-2009 Osamu Gotoh