ALN Information




Present Version: 3.5.2 Last Update: 2010-02-06

Change from previous versions

Added/modified in Ver. 3.5.2 (2010-02-06)

  1. Bugs with -L option have been fixed.
Added/modified in Ver. 3.5.1 (2009-12-26)

  1. For alignment of single sequences (Sa x Sa, Sn x Sn), the quick mode is added in (1A) and (1N). To use this mode, set QN (N = 1, 2, or 3) option. The quick mode has already been implemented in spliced alignment, (3A) and (3N).
  2. If information about the exon-intron organization of the corresponding gene is provided in the sequence files and yBN (N > 0) option is set, a bonus is given to a match of intron positions along the resultant alignment.
  3. The LS (local similarity) option invokes Smith-Waterman type local alignment. Thus aln can now virtually replace swg that is an implementation of the Smith-Waterman algorithm modified by Gotoh [2] so that suboptimal local alignments can also be reported. However, aln sets the cutting corners approximation, and hence some similarities far from the main diagonals might be missed. In combination with M2 option, aln reports only "unique" alignments suppressing repetitious outputs, which may be reported with M3 option.

Overview

Aln is a program for aligning a pair of nucleotide or amino acid sequences or groups of sequences. Aln accepts various combinations of distinct types of inputs: a single nucleotide sequence (SN), a single amino acid sequence (SA), a pre-aligned group of nucleotide sequences (MN), and a group of amino acid sequences (MA). Notably, aln can now perform mixed alignment between a nucleotide sequence and a protein sequence or a group of protein sequences [5]. This novel feature of aln enables us to predict eukaryotic gene structures (protein-coding exons) based on sequence homology with known protein sequence(s). The details of this procedure will be described in a later section entitled 'Gene-structure prediction'.
In sequence alignment, the hardest problem is to develop an efficient optimization algorithm under realistic constraints on deletions and insertions (indels or gaps). Currently the most popular methods use affine functions (AF) of the form, w(k) = uk + v, to penalize an indel [1]. A slightly more general penalty function is a piecewise linear function [2], a special and the simplest form of which is a restricted affine function (RA), w(k) = uk + v for k <= K and w(k) = w(K) + u'(k-K) for (k > K), where u, u' (< u), v, and K are non-negative constants. With the default settings, aln automatically selects seemingly the best algorithm, depending on the types of input sequences, as follows.

(1A) SA x SA : DP-RA
(1N) SN x SN : DP-RA
(2A) SA x MA or MA x MA : CL-AF
(2N) SN x MN or MN x MN : CL-AF
(3) SN x SA or SN x MA : DP-AF

where DP and CL imply dynamic programming and candidate-list algorithms, respectively. CL is a generalized version of DP, and was introduced to rigorously optimize a group-to-group alignment score with an affine gap-penalty function [3]. Aln converts the input MA or MN into a generalized profile [4], if the performance is expected to be improved compared to the ordinary character-based method.

References

[1] Gotoh, O. (1982) "An improved algorithm for matching biological sequences." J. Mol. Biol. 162, 705-708.

[2] Gotoh, O. (1990) "Optimal sequence alignment allowing for long gaps." Bull. Math. Biol. 52, 359-373.

[3] Gotoh, O. (1993) "Optimal alignment between groups of sequences and its application to multiple sequence alignment." CABIOS 9, 361-370.

[4] Gotoh, O. (1994) "Further improvement in group-to-group sequence alignment with generalized profile operations." CABIOS 10, 379-387.

[5] Gotoh, O. (1998) "Divergent structures of Caenorhabditis elegans cytochrome P450 genes suggest the frequent loss and gain of introns during the evolution of nematodes." Mol. Biol. Evol. 15, 1447-1459.

[6] Gotoh, O. (2000) " Homology-based gene structure prediction: simplified matching algorithm by the use of translated codon (tron) and improved accuracy by allowing for long gaps." Bioinformatics, 16, 190-202.

[7] Gotoh, O. "A space-efficient and accurate method for mapping and aligning cDNA sequences onto genomic sequence" Nucleic Acids Research 36 (8) 2630-2638 (2008).
[8] Gotoh, O. "Direct mapping and alignment of protein sequences onto genomic sequence" Bioinformatics 24 (21) 2438-2444 (2008).

System

The programs are written in C/C++ (ISO/C++), and distributed as source code. Users must compile the programs on their own system. Although the programs have been tested only on linux and Solaris 2.2-5, they are likely to be portable to most UNIX systems with little or no modifications. The distribution package also contains source codes of a few related programs, including prrn, etc.

Sequence File

A sequence file may contain either a single sequence or a set of pre-aligned sequences. Avoid file names starting with a numeral.

Single sequence

Almost all popular formats for nucleotide and amino acid sequences, FASTA, GenBank, EMBL, SwissProt, PIR, and ProDB, are acceptable. A simple text file without any additional information is also fine. In any format, a line beginning with a semicolon (;) is regarded as a comment line. The specific format is recognized automatically by the first word in the first non-comment line of each file. Only single-letter codes are accepted for an amino acid. The nomenclature recommended by NC-IUB (Eur. J. Biochem. (1985) 150, 1-5) is followed to represent ambiguous nucleotides. The programs are not case-sensitive.

Sequential format of multiple sequences

The concatenation of multiple sequences in one of the above single-sequence formats, except for the naked sequence-alone format, may be used as a program input. However, the first line of the file must consist of two numbers, the number of sequences and the length of the longest sequence, separated by one or more spaces. Deletion characters are automatically padded at the end of shorter sequences. If the sequences are prealigned, internal deletion characters are preserved. Two examples of sequential formats are shown below.

File 1: |File 2:
|
3 16 |2 16
>Seq1 |LOCUS Seq1
aaatttcccggg|ORIGIN
// |1 AAATTTCCCGGG
>Seq2 |//
atcgatcgatcgat |LOCUS NCODE
// |ORIGIN
>NCODE |1 -ACMGRSVTWYHKDBN
ACMGRSVTWYHKDBN|//
// |

Interleaved (native) format of multiple sequences
This native format is designed for multiple-sequence alignment to be naturally recognized by human eyes. The alignment produced by aln can be used as an input to aln or prrn, and this is the most common way to have access to sets of pre- aligned sequences. Thus, the format of an aligned sequence file is the same as the default output format of aln. The first non-blank line in a file must indicate the number of sequences, N, involved in the alignment. This number is obtained as the sum of numbers in square brackets, e.g., when the first line is

Seq1[3] - Seq2[4]

N is calculated to be 7. Subsequent lines up to the first blank line are ignored. The rest of the file is composed of one or more blocks of a fixed column width of less than 254 characters. Each block is composed of N 'sequence lines' and other (optional) 'non-sequence lines'. The general format of a sequence line is:

<Position> <Sequence> <Name>

where <Position> is a numeral that indicates the sequence position of the first letter in <Sequence|> (Usually all <Position>s in the first block are 1, but it is not a prerequisite. Negative values are also appropriate). A line lacking the <Position> field is regarded as a 'non-sequence line' and ignored upon reading. The i-th sequence line in the second block is concatenated to the i-th sequence line in the first block, and so on. There is no particular limit on N or the length, but the total number of characters to be stored is limited by MAXAREA defined in src/sqio.h. Several examples of native format are provided in the sample directory.

Special characters

Three characters, dash '-', tilde '~' and caret '^' have special meanings in a multiple-sequence file of the native format. A dash indicates a 'deletion' introduced by some alignment procedure. Be careful not to use a space or dot instead of a dash. Spaces and dots are simply ignored, so that the file may be interpreted in a totally unexpected way. A tilde means the same residue as that in the first sequence line on that column in the block. On the other hand, a caret means the same residue as that in the previous sequence line on that column. These ditto characters are convenient in representing an alignment of closely related sequences. Neither '~' nor '^' is allowed in the first sequence line in each block.

Information on gene structures

For a single sequence, lines that start with ";C " indicate the gene (exon-intron) organization of the gene encoding the transcript. The format after ";C " is essentially the same as that of Feature field of a GenBank file. Start and end positions of each exon are separated by two dots. Individual exons are delimited by a comma. See the examples ce13a1 and ce13a2 in sample/pas directory.

For native format of multiple sequences, lines starting with ";B ", ";b ", and ";m " represent the information about organizations of corresponding genes. The first number that follows ";B ", NP, indicates the number of alignment positions where an intron intervenes at least one of the genes corresponding to the aligned protein or cDNA sequences. For a protein sequence, the position means that of coding nucleotide sequence so that phase as well alignment position is also significant. The second number in this line, NI, indicates the total number of introns. The numbers that follow ";b " indicate the number of sequences that contain the intron at the 1st, 2nd, ..., NP-th position. The numbers that follow ";m " present the list of sequences that contain the intron at the 1st, 2nd, ..., NP-th position in this order. See the example ce13a.mfa in sample/pas directory.

User Operation

There are three methods to invoke a program.
[1]	aln [option2]
[2]	aln [option2] seq1 seq2
[3]	aln option1 [option2] [seq2]
Option2 is common to other programs (e.g. prrn), and default values will be used if omitted. Method [1] is a 'conversational mode', which will be discussed in detail below. Method [2] is an 'instant mode'; the calculation starts immediately using the sequence data in files seq1 and seq2. Method [3] is a 'batch mode', which will be discussed after explanation of conversational mode.

Notice

At the beginning of each process, several files in the 'table' directory are read. Follow the instructions under Install to properly locate these files.

Usually, aln automatically judges the type of a sequence (amino acids or nucleotides) according to the composition of characters read in. You may explicitly specify the sequence type by adding an 'attribute' as described below.

[1] Conversational mode

This menu-driven mode provides an easy way to use the programs. You may read a sequence file, change parameter values, or send the results to a desired device or file.

Menu

In response to the prompt, you can select one or more single-letter commands. Case is significant except for a few commands. Numerals must be separated from other commands by one or more spaces, whereas other characters should be typed contiguously. The commands are processed from left to right. For example
	Menu Prompt: 1 2 pgr 
is equivalent to
	Menu Prompt: 1 
	Menu Prompt: 2 
	Menu Prompt: p 
	Menu Prompt: g 
 	Menu Prompt: r 

Sequence

When the programs successfully start, or when you select a numeral (1 or 2) as the menu command, input of a sequence is prompted. (If you give a plus sign in addition to a numeral, i.e., 1+ or 2+, the new sequence is added (concatenated) to the existing one, otherwise the new sequence replaces the older one.) You should respond with
     [[path][filename]] [start] [end] [attribute]
Any terms may be omitted, except start when you wish to specify end. When path is omitted, the default path is used (this can be specified with option2, see [2] instant mode below). Start and end specify the range of the sequence to be analyzed. The default values are 1 and the last position of the sequence, respectively. Some attribute terms are applicable only to a nucleotide sequence. Currently meaningful attributes are defined below.
^	: the sequence is complemented. (nucleotide sequence only)
-	: the sequence is reversed.
<	: the sequence is reverse-complemented.
%	: forces the sequences into profile.
A	: amino acid sequence.
P	: amino acid sequence.
D	: nucleotide sequence.
R	: nucleotide sequence.
N	: nucleotide sequence, delete ambiguous codes ('N').
T	: tron sequence.  See ref. 5.
@N	: read coding parts only, according to the N-th CDS (GenBank only).
#N	: retain only the columns composed of >N% of non-deletion characters.
If there is no attribute character (default), the sequence remains as read. If filename is specified, the remaining terms are reset to the default values, otherwise non-specified terms are unchanged. Of course, filename should not be omitted in the beginning session.

Parameters

When you select the 'p' or 'P' command in the menu, you can change the current parameter values, which are shown in parentheses. Lowercase 'p' and uppercase 'P' refer to different sets of parameters. Carriage/Return terminates the input, leaving the remaining terms unchanged. Hence, only hit Carriage/Return if you do not want to change the current values shown in the line. The following table shows the default values and meanings of the parameters used. Note that the default values vary with the sequence types to be aligned, and may be different from those shown here.

Parameters

     Default                    Meaning

pam	250	PAM level of Dayhoff's mutation data matrix MDM [0 - 300] step by
		50.
bias	0	Value to be added to each element of MDM.
scale	1	Precision of the MDM matrix ( >= 1 ).
 v	9	Constant term of the gap-weighting function.
 u	2	Proportional coefficient of the gap-weighting function for a gap
		shorter than or equal to k1.
s[=]	2	Similarity measure between identical nucleotides.
s[#]	-2	Similarity measure between different nucleotides.
u1	1	Proportional coefficient of the gap-weighting function for a gap
		longer than k1.
k1	10	The flection point.
shldr	100	Window size ( >= 0 ), see note 1.
trial	0	Number of pairs of randomized sequences used for a Monte Carlo test
		for the significance of sequence similarity.
lpw	60	Number of residues per line in an output alignment.
lseg	2	Number of pieces of linear functions that represent a gap-weighting
		function [1 - 3], see Ref. [2].  For DNA vs. protein sequence
		alignment, lseg = 2 (affine gap penalty) and lseg = 3 (restricted
		affine penalty) indicate the potential existence of introns in the
		DNA sequence.
pmode	1	Control the format of the output alignment.
consl	0	do. How strictly to regard a site as a consensus site.
lorn	0	do. Sequence label is name (0) or numbering (1).
cmode	1	Calculation mode [0 - 1], see note 2.
algor	0	Select the algorithm for group-to-group alignment.  If 0, the
		program chooses conceivably the best one.
av/sum	1	Sum of pairs measure (1) or averaged score (0).

(note 1)
The program sets a window of size (|M - N| + 2 * shldr) along the center of
the main diagonal, where M and N are the lengths of the sequences.   For DNA
vs. protein sequence alignment, the length of the protein sequence is
multiplied by 3 to calculate the window size.
(note 2)
Mode = 0:	Get only the distance value and other statistics (no alignment).
Mode = 1:	Get a single optimal alignment.

Go, and Reprint

The 'g' command starts actual calculation. After calculation is complete, aln prompts for where the result is to be output. At the prompt, select a number (-1, 0, 1, 2, 3, or 4, the default is 0). When you select 3 or 4, the destination file name is prompted (if you select 4, the file is opened in an append mode). The alignment is saved in a file which may be further edited or used as an input file for multiple-sequence alignment. The 'r' command re-displays the latest result, and is effective only after a 'g' command.

Quit

To exit from the program, type 'q' in response to the menu prompt.

[2] Instant mode

This mode is the simplest to use. Two sequence files to be compared are given in the command line arguments, and the result is reported to stdout (standard output). Command line options, option2, are common to all modes, but they are described here because they are most likely to be used in non-conversational modes.

Option2

Each option is specified by a dash followed by a single character. Some options take a second argument as shown below, where N is a number (real or integer), and S is a word (file name in most cases). The options with an asterisk are meaningful only for DNA vs. protein sequence alignment.
option	default	range	description
-AN	0	0-5	Select algorithm.  N=0: aln's choice; N=1: DP-AF; N=2: DP-RA;
			N>=3: CL-AF.  A value of other than 0 is not recommended.
-FN	1	1-9	Output format.  N=1-5: native; N=6: Phylip; N=7: GCG; N=8: GDE;
			N=9: Concatenated Fasta.
-lN	60	>8	Set line width = N.  
-L[N]	0	0-15	Control penalty values given to terminal gaps.
-mS			Amino acid exchange matrix.
-n			Report only statistics rather than an alignment.
-oS			In the interactive mode, S indicates the default directory where
			the result is written.  In other modes, S is the file name
			itself.  If this option is not set, the output goes to stdout.
-ON			Output mode.  See the 'Gene structure prediction'.
-p			Prompt to enter parameter values interactively.
-r			Don't remove intermediate files (with -a or -b).
-RN	0	>=0	The number of random sequence pairs used for a shuffle test.
-sS			Set the default path where sequence files reside.
-TS	ALN_TAB		The 'table' directory.
-uN	2	>=0	Set the gap-extension penalty u = N.
-vN	9	>=0	Set the gap-opening penalty v = N.
-wN	100	>=0	Set shldr = N.

-ybN	0		Add N to each element of an amino acid exchange matrix.
-yeN	0	>=0	Additional penalty assigned to each deletion.
-yiN*	68	>0	Intron penalty.
-yjN	0	>=0	The u' value in an RA gap penalty.  Must be 0 >= u' < u.
-ykN	10	>0	The K value in an RA gap penalty.
-ylN*	1	1-3	N=1: assume the absence of introns; N=2: assume the presence of
			introns, and use the AF penalty: N=3: assume the presence of
			introns, and use the RA penalty.
-ymN	2	>=0	Similarity measure between identical nucleotides.
-ynN	2	<=0	Similarity measure between different nucleotides.
-ypN	250	0-300	The 'PAM' value of the mutation data matrix.
-yuN	2	>=0	Same as -uN.
-yvN	9	>=0	Same as -vN.
-ywN	100	>=0	Same as -wN.
-yxN*	20	>=0	Penalty for a frameshift.
-yyN*	16	>=0	Relative contribution rate of translational and splicing
			boundary signals to the total alignment scores.
-yzN*	1	>=0	Relative contribution rate of coding potential to the total
			alignment scores.
With options -p, -P, -q, and -Q, you are asked to input new parameter values just as with 'p' + 'P' commands in the Interactive mode. Prompt and echo are simply for your convenience. Use -p when inputting from a terminal, -P or -Q when inputting from a redirected file, and -q if you like quiet.

[3] Batch mode

This mode enables several calculations with a single command. Given a list of sequence names in a catalog file, you can compare various combinations of sequences according to option1. Option1's are mutually exclusive; you should choose one of them. In a special case, you get a multiple-sequence alignment by recurrently applying pairwise alignment to groups of sequences. In this case, you must prepare a special format of tree file instead of a catalog file.

Option1

-acatalog   Construct multiple-sequence alignment simply by adding on in the
            order of the file names in the catalog.
-btree      Construct multiple-sequence alignment by the progressive method.
-ecatalog   All pairs of sequences in the catalog.
-fcatalog   Given sequence vs. all sequences in the catalog.
-gcatalog   Every sequence in group1 vs. every sequence in  group2.  The two
            groups are separated by a blank line.
-hcatalog   All pairs of sequences in the catalog in a different order from -e.
-icatalog   Self comparisons.  Not useful with aln.
-jcatalog   (2i-1)-th sequence vs. 2i-th sequence (i > 0).

Tree file


<<<Notes on Version 3.0>>>

upg/nj is not included in the present distribution version. You can perform the equivalent calculations with the prrn -b option. Therefore, the following explanations might be obsolete, but are retained for compatibility with earlier versions.


The tree topology may be specified in the 'New Hampshire Standard' format, in which the OTU names correspond to the sequence file names to be aligned. Alternatively, you may directly indicate the order of calculation as shown in the following example.
AA     ---------
1               |
BB              |
0               |    Content of a tree file.
CC              |    AA,...EE show sequence file names.
3               |    Numbers indicate the order of calculations.
DD              |
2               |
EE     --------
In this example, BB and CC are aligned first, and the result is stored in a file named _s0 in the current directory. _s0 is then aligned with AA to give _s1. Alignment between DD and EE is stored in _s2, which is further aligned with _s1, to give _s3, the final result. The number between sequence names indicates the distance to the internal node from the leaf furthest from the root. Since the present implementation uses only topological information, you need not care about precise distance values. These intermediate files _s.. are deleted automatically unless you give the '-r' command line option, and the final result is written in either the file specified by -o option or './ALN' if not specified. A second run of 'aln -b tree' may replace the previous files, so that the final result (ALN or _s3 with -r option in this example) should be copied or renamed before another run of aln.

The strategy adopted here is the so-called 'progressive method' similar to those proposed by Waterman and Perlwitz (1984) Bull. Math. Biol. 46, 567-577, Feng and Doolittle (1987) J. Mol. Evol. 25, 351-360, and others. This strategy rapidly generates a reasonably good alignment, which may be further refined by prrn. Please refer to the accompanying document 'prrn.doc' for further details.

Catalog file

Each line in a catalog file has the same format as that used for input of a sequence name under Conversational mode:
[[path][filename]] [start] [end] [attribute]
With the -g option, a blank line separates two groups, otherwise a blank line (optional) indicates the end of the list.

Gene-structure prediction

The latest version of aln supports mixed alignment between a nucleotide sequence and a protein sequence or a protein profile. To run the program in this mode, you must confirm that the following five files are present in the 'table' directory.
CodePot
TransInit
TransTerm
Splice3
Splice5
Intron53 (optional)
AlnParam (optional)

These files are included in the distribution package. At this moment, however, only a single set for the C. elegans genome is provided.

By the default setting, the nucleotide sequence is conceptually translated in three frames and the translated sequence is optimally matched with the protein sequence. This mode is most useful for detecting potential frameshift errors in an mRNA or cDNA sequence.

The option -yl2 or -yl3 tells the program that the nucleotide sequence may contain introns which are skipped at translation. As a consequence, the exon/intron structure of a genomic sequence can be predicted [5,6]. An affine (-yl2) or a restricted affine (-yl3) function may be chosen to penalize a gap (indel) at the translated amino-acid sequence level. The -L option should also usually be set to perform semi-global alignment. The standard command would look like:

% aln -yl2 (or -yl3) -L -ON 'DNA N1 N2' reference	(sense strand)

% aln -yl2 (or -yl3) -L -ON 'DNA N1 N2 <' reference	(antisense strand)
where N1 and N2 specify the region in the genomic DNA sequence within which the objective gene may reside, and 'reference' indicates a protein sequence or profile used as the reference. The -ON option controls the output:
-O0, -O4:	gene sequence.
-O1, -O5:	alignment between the DNA sequence and the protein.
-O2, -O6:	catenated exonic sequence i.e. 'cNDA' sequence.
-O3, -O7:	translated amino acid sequence.
With -O4, -O6, and -O7, boundary signal strengths are also reported after the end of the sequence. In addition, translation is conducted after correction for potential frameshift errors with -O7, while no correction is made with -O3. However, less memory is required with -O0, -O2 and -O3 than with other selections.

A GenBank-like format is also obtainable if you use -Fg (or -F1) instead of -ON.

% aln -yl2 (or -yl3) -L -Fg 'DNA N1 N2' reference > OutputFile 	(sense strand)

% aln -yl2 (or -yl3) -L -Fg -oOutputFile 'DNA N1 N2 <' reference (antisense strand)
The accompanying file 'Exam' contains some examples for the application of aln to test data with several different option settings.

Copyright © 1997-2009 Osamu Gotoh