Added/modified in Ver. 3.1.3 (2010-02-06)
[1] Gotoh, O. (1982) "An improved algorithm for matching biological sequences." J. Mol. Biol. 162, 705-708.
[2] Gotoh, O. (1990) "Optimal sequence alignment allowing for long gaps." Bull. Math. Biol. 52, 359-373.
[3] Berger, M.P., and Munson, P.J. (1991) "A novel randomized iterative strategy for aligning multiple protein sequences." CABIOS 7, 479-484.
[4] Gotoh, O. (1993) "Optimal alignment between groups of sequences and its application to multiple sequence alignment." CABIOS 9, 361-370.
[5] Gotoh, O. (1993) "Extraction of conserved or variable regions from a multiple sequence alignment." Proceedings of Genome Informatics Workshop IV, pp. 109-113.
[6] Gotoh, O. (1994) "Further improvement in group-to-group sequence alignment with generalized profile operations." CABIOS 10, 379-387.
[7] Gotoh, O. (1995) "A weighting system and algorithm for aligning many phylogenetically related sequences." CABIOS, 11, 543-551.
[8] Gotoh, O. (1996) "Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments." J. Mol. Biol. 264, 823-838.
[9] Gotoh, O. (1999) "Multiple sequence alignment: algorithms and applications." Adv. Biophys. 36, 159-206.
[10] Gotoh, O., Yamada, S., and Yada, T. (2006) Multiple Sequence Alignment, in Handbook of Computational Molecular Biology, (Aluru, S. ed.) Chapman & Hall/CRC, Computer and Information Science Series, Vol. 9, pp. 3.1-3.36.
| File 1: | | | File 2: |
| | | ||
| 3 16 | | | 2 16 |
| >Seq1 | | | LOCUS Seq1 |
| aaatttcccggg | | | ORIGIN |
| // | | | 1 AAATTTCCCGGG |
| >Seq2 | | | // |
| atcgatcgatcgat | | | LOCUS NCODE |
| // | | | ORIGIN |
| >NCODE | | | 1 -ACMGRSVTWYHKDBN |
| ACMGRSVTWYHKDBN | | | // |
| // | | |
Interleaved (native) format of multiple sequences
This native format is designed for multiple-sequence alignment to be naturally
recognized by human eyes. The alignment produced by aln can be used as an input
to aln or prrn, and this is the most common way to have access to sets of pre-
aligned sequences. Thus, the format of an aligned sequence file is the same as
the default output format of aln. The first non-blank line in a file must
indicate the number of sequences, N, involved in the alignment. This number is
obtained as the sum of numbers in square brackets, e.g., when the first line is
Seq1[3] - Seq2[4]
N is calculated to be 7. Subsequent lines up to the first blank line are ignored. The rest of the file is composed of one or more blocks of a fixed column width of less than 254 characters. Each block is composed of N 'sequence lines' and other (optional) 'non-sequence lines'. The general format of a sequence line is:
<Position> <Sequence> <Name>
where <Position> is a numeral that indicates the sequence position of the first letter in <Sequence|> (Usually all <Position>s in the first block are 1, but it is not a prerequisite. Negative values are also appropriate). A line lacking the <Position> field is regarded as a 'non-sequence line' and ignored upon reading. The i-th sequence line in the second block is concatenated to the i-th sequence line in the first block, and so on. There is no particular limit on N or the length, but the total number of characters to be stored is limited by MAXAREA defined in src/sqio.h. Several examples of native format are provided in the sample directory.
[1] prrn [options1] [options2] [2] prrn [options1] [options2] seq1 [seq2 seq3 ...]Options1 is specific to prrn, whereas options2 is common to other programs, and default values will be used if it is omitted. Method [1] is a 'conversational mode', which will be discussed in detail below. Method [2] is a 'command-line mode'; the calculation starts immediately using the sequence data in file(s) seqn. When two or more sequences or multiple sequences are given, they are combined but not prealigned. Use the -a[N], -b[N], or -U option described below to start the iterative refinement from a prealigned state.
Menu Prompt: 1 pgris equivalent to
Menu Prompt: 1 Menu Prompt: p Menu Prompt: g Menu Prompt: r
[[path][filename]] [start] [end] [attribute]Any terms may be omitted, except start when you wish to specify end. When path is omitted, the default path is used (these can be specified with options2, see [2] instant mode below). Start and end specify the range of the sequence to be analyzed. The default values are 1 and the last position of the sequence, respectively. Some attribute terms are applicable only to a nucleotide sequence. Currently meaningful attributes are defined below.
^ : the sequence is complemented. (nucleotide sequence only)
- : the sequence is reversed.
< : the sequence is reverse-complemented.
% : forces the sequences into profile.
A : amino acid sequence.
P : amino acid sequence.
D : nucleotide sequence.
R : nucleotide sequence.
N : nucleotide sequence, delete ambiguous codes ('N').
T : tron sequence. See ref. 5.
@N : read coding parts only, according to the N-th CDS (GenBank only).
#N : retain only the columns composed of >N% of non-deletion characters.
If there is no attribute character (default), the sequence remains as read. If filename is
specified, the remaining terms are reset to the default values, otherwise non-specified terms
are unchanged. Of course, filename should not be omitted in the beginning session.
Default Meaning
pam 250 PAM level of Dayhoff's mutation data matrix MDM [0-300] step by 50.
bias 0 Value to be added to each element of MDM.
scale 1 Precision of the MDM matrix ( >= 1 ).
v 2 Constant term of the gap-weighting function.
u 9 Proportional coefficient of the gap-weighting function for a gap shorter than or
equal to k1.
Parameters for nucleotide sequences
s[=] 2 Similarity measure between identical nucleotides. s[#] 2 Similarity measure between different nucleotides. v 4 Constant term of the gap-weighting function. u 2 Proportional coefficient of the gap-weighting function.Parameters common to nucleotide and protein sequences
shldr 100 Window size ( >= 0 ), see note 1.
series 1 Number of iteration series. The result with the best score among the series is
finally reported.
omode 0 Message level [0 - 2], see note 2.
algor 0 Select the algorithm for group-to-group alignment. If 0, the program chooses
conceivably the best one.
dmode 2 This mode controls the way multiple sequences are divided into two groups, see
note 3.
newrn 1 Set the seed of a series of quasi-random numbers.
group 2 Mutual alignment between some sequences in the given alignment may be fixed, see
note 4.
Level = 0: Fewest messages Level = 1: The course of refinement is monitored.All of these messages are reported to stderr.
Mode = 1: A randomly chosen single sequence is aligned with the remaining (M-1)
sequences. The average convergence takes O(M) steps.
Mode = 2: Choose only the divisions that bisect an unrooted tree representing the
mutual relationship of the sequences in the given alignment. The average
convergence takes O(2M) steps. (default)
Mode = 0: Previous grouping is used. Mode = 1: Groups should be manually indicated. Mode = 2: No grouping.
Options1 Default Meaning
-a[N] 0 Prealign by sequentially adding in the order
N=0:input; N=1:from shorter; N=2: from longer;
-b[N] 1 Prealign by a progressive method.
-U[N] 0 Same as -a but members in seq1 are replaced by sequences of the same
names in seqn (n>1). Seq1 must be a multiple alignment whose members
should have unique names.
-ES SE Destination of supplementary messages.
-GN 2 Grouping.
-HN 20 Threshold value used in finding conserved regions.
-IN 10 Maximum number of iterations in the outer loop.
-JN 1 N=1: UPGMA method; N=2: NJ method for tree.
-ON 0 Set message level (omode) to N.
-RN Seed of the series of random numbers.
-SM 1 Number of iteration series.
-t The order of sequences in the output file is rearranged according to the
calculated phylogenetic relationship.
Options2 Default Meaning
-AN 0 Use algorithm N = [0-5]. A value of other than 0 is not recommended.
-FN 1 Format of output. N=1-5: native; N=6: Phylip; N=7: GCG; N=8: GDE;
N=9: Concatenated Fasta.
-lN 60 Set lpw (# of residues per line) = N > 8.
-mS Amino acid exchange matrix.
-oS SO Output resultant alignment to file.
-sS ./ Set the default path to sequence files to path.
-uN 2 Set gap-extension penalty u = N.
-vN 4/9 Set gap-opening penalty v = N.
-wN 100 Set shldr = N.
-ypN 250 Set pam = N.
(Note) SE and SO mean standard error (stderr) and standard output (stdout), respectively.
Copyright © 1997-2009 Osamu Gotoh