Present Version: 3.1.3 Last Update: 2010-02-06
The program prrn is an implementation of the randomized iterative strategy
for multiple sequence alignment. Prrn accepts either nucleotide or protein
sequences. Prrn repeatedly uses pairwise group-to-group alignment to
improve the overall weighted sum-of-pairs score at each iterative step,
where the pair weights are introduced to correct for uneven representations
of the sequences to be aligned. The weights are fixed in the earlier (RIW)
method, whereas the doubly-nested randomized iterative (DNR) method
described in Ref.  tries to make alignment, phylogenetic tree and pair
weights mutually consistent. These two strategies can be selected by setting
different options, and DNR is now the default. These are hill-climbing
strategies, and the true optimum may not be attained. These strategies
work most effectively for refining a crude alignment obtained by other
more rapid methods, e.g. progressive alignment. Setting option -a
or -b generates such a provisional alignment from unaligned sequences.
Change from previous versions
Added/modified in Ver. 3.1.3 (2010-02-06)
Added/modified in Ver. 3.1.1
- Conservation of gene (exon-intron) structure can be taken into account by setting yBN option (N > 0). This option is motivated by the fact that gene structures are generally better conserved than amino acid sequences.
Added/modified in Ver. 3.1.0
- 1. After version 3.1, the source codes are ANSI-C/C++ compatible but no longer compatible with a KR-compiler.
- -L[N] option is added. If set, terminal gaps are not penalized, i.e. semi-global alignment is obtained. However, this option sometimes produces unexpected results. Use this option only to refine an existing alignment.
- Some serious bugs were fixed. The frequency of core dumps, which often occurred when
many or complex gaps were present, has been greatly reduced.
- An "update mode" was added, with which certain members within an existing alignment are
replaced by new ones. This mode is designed to cooperate with a eukaryotic gene structure
prediction method so that predicted structures of a set of paralogous genes can be refined to
yield progressively better alignment of translation products (Ref. 10).
- After version 3.0, the single program 'prrn' can accept either nucleotide or protein
sequences. The program usually judges the type according to the composition of input
- After version 3.0, the objective score is maximized rather than minimized, so that the
sign of a score value is opposite that reported by previous versions of the programs.
- After version 2.5, the source codes are ANSI-C-compatible. This may improve portability
of the programs. Moreover, the gcc compiler, which supports ANSI-C, generates better (smaller
and faster) code@than the original SUN compiler.
- Rigorous optimal alignment between two groups of sequences is the key procedure in the
present iterative strategy. An improved algorithm was devised, which simplified the codes and
improved the execution rate by an average of about 20%.
- When either of the two groups contains no internal gap (typically a single sequence), the
alignment algorithm can be simplified and calculation time is considerably reduced. Distinct
subroutines were prepared to incorporate these, which are found in more than half of tree-
- Optional output formats now include GCG's MSF format (-F7).
Compatibility with Phylip format has been improved (-F6).
 Gotoh, O. (1982) "An improved algorithm for matching biological sequences." J. Mol. Biol.
 Gotoh, O. (1990) "Optimal sequence alignment allowing for long gaps." Bull. Math. Biol.
 Berger, M.P., and Munson, P.J. (1991) "A novel randomized iterative strategy for aligning
multiple protein sequences." CABIOS 7, 479-484.
 Gotoh, O. (1993) "Optimal alignment between groups of sequences and its application to
multiple sequence alignment." CABIOS 9, 361-370.
 Gotoh, O. (1993) "Extraction of conserved or variable regions from a multiple sequence
alignment." Proceedings of Genome Informatics Workshop IV, pp. 109-113.
 Gotoh, O. (1994) "Further improvement in group-to-group sequence alignment with
generalized profile operations." CABIOS 10, 379-387.
 Gotoh, O. (1995) "A weighting system and algorithm for aligning many phylogenetically
related sequences." CABIOS, 11, 543-551.
 Gotoh, O. (1996) "Significant improvement in accuracy of multiple protein sequence
alignments by iterative refinement as assessed by reference to structural alignments." J. Mol.
Biol. 264, 823-838.
 Gotoh, O. (1999) "Multiple sequence alignment: algorithms and applications." Adv.
Biophys. 36, 159-206.
 Gotoh, O., Yamada, S., and Yada, T. (2006) Multiple Sequence Alignment, in
Handbook of Computational Molecular Biology, (Aluru, S. ed.) Chapman &
Hall/CRC, Computer and Information Science Series, Vol. 9, pp. 3.1-3.36.
The programs are written in C++ (ISO/C++), and distributed as source code.
Users must compile the programs on their own system. Although the programs have
been tested only on linux and Solaris 2.2-5, they are likely to be portable
to most UNIX systems with little or no modifications. The distribution package
also contains source codes of a few related programs, including prrn, etc.
A sequence file may contain either a single sequence or a set of pre-aligned
sequences. Avoid file names starting with a numeral.
Almost all popular formats for nucleotide and amino acid sequences, FASTA,
GenBank, EMBL, SwissProt, PIR, and ProDB, are acceptable. A simple text file
without any additional information is also fine. In any format, a line
beginning with a semicolon (;) is regarded as a comment line. The specific
format is recognized automatically by the first word in the first non-comment
line of each file. Only single-letter codes are accepted for an amino acid.
The nomenclature recommended by NC-IUB (Eur. J. Biochem. (1985) 150, 1-5) is
followed to represent ambiguous nucleotides. The programs are not case-sensitive.
Sequential format of multiple sequences
The concatenation of multiple sequences in one of the above single-sequence
formats, except for the naked sequence-alone format, may be used as a program
input. However, the first line of the file must consist of two numbers, the
number of sequences and the length of the longest sequence, separated by one or
more spaces. Deletion characters are automatically padded at the end of shorter
sequences. If the sequences are prealigned, internal deletion characters are
preserved. Two examples of sequential formats are shown below.
|File 1: |||||File 2:
|3 16 |||||2 16
|>Seq1 |||||LOCUS Seq1
|// |||||1 AAATTTCCCGGG
|atcgatcgatcgat |||||LOCUS NCODE
|>NCODE |||||1 -ACMGRSVTWYHKDBN
Interleaved (native) format of multiple sequences
This native format is designed for multiple-sequence alignment to be naturally
recognized by human eyes. The alignment produced by aln can be used as an input
to aln or prrn, and this is the most common way to have access to sets of pre-
aligned sequences. Thus, the format of an aligned sequence file is the same as
the default output format of aln. The first non-blank line in a file must
indicate the number of sequences, N, involved in the alignment. This number is
obtained as the sum of numbers in square brackets, e.g., when the first line is
Seq1 - Seq2
N is calculated to be 7. Subsequent lines up to the first blank line are
ignored. The rest of the file is composed of one or more blocks of a fixed
column width of less than 254 characters. Each block is composed of N 'sequence
lines' and other (optional) 'non-sequence lines'. The general format of a
sequence line is:
<Position> <Sequence> <Name>
where <Position> is a numeral that indicates the sequence position of the first
letter in <Sequence|> (Usually all <Position>s in the first block are 1, but it
is not a prerequisite. Negative values are also appropriate). A line lacking
the <Position> field is regarded as a 'non-sequence line' and ignored upon
reading. The i-th sequence line in the second block is concatenated to the i-th
sequence line in the first block, and so on. There is no particular limit on N
or the length, but the total number of characters to be stored is limited by
MAXAREA defined in src/sqio.h. Several examples of native format are provided
in the sample directory.
Three characters, dash '-', tilde '~' and caret '^' have special meanings in a
multiple-sequence file of the native format. A dash indicates a 'deletion'
introduced by some alignment procedure. Be careful not to use a space or dot
instead of a dash. Spaces and dots are simply ignored, so that the file may be
interpreted in a totally unexpected way. A tilde means the same residue as that
in the first sequence line on that column in the block. On the other hand, a
caret means the same residue as that in the previous sequence line on that
column. These ditto characters are convenient in representing an alignment of
closely related sequences. Neither '~' nor '^' is allowed in the first sequence
line in each block.
There are two methods to invoke a program.
 prrn [options1] [options2]
 prrn [options1] [options2] seq1 [seq2 seq3 ...]
Options1 is specific to prrn, whereas options2 is common to other programs, and default values
will be used if it is omitted. Method  is a 'conversational mode', which will be discussed
in detail below. Method  is a 'command-line mode'; the calculation starts immediately
using the sequence data in file(s) seqn. When two or more sequences or multiple sequences are
given, they are combined but not prealigned. Use the -a[N], -b[N], or -U option described
below to start the iterative refinement from a prealigned state.
At the beginning of each process, prrn reads the score files. Follow the instructions above
to tell the program where these files are located.
 Conversational mode
This menu-driven mode provides an easy way to use the program. You may read a sequence file,
change parameter values, or send the results to a desired device or file.
In response to the prompt, you can select one or more single-letter commands. Case is
significant, except for a few commands. Numerals must be separated from other commands by one
or more spaces, whereas other characters should be typed contiguously. The commands are
processed from left to right. For example
Menu Prompt: 1 pgr
is equivalent to
Menu Prompt: 1
Menu Prompt: p
Menu Prompt: g
Menu Prompt: r
When the programs successfully start, or when you select '1' as the menu command, input of a
sequence is prompted. (If you add a plus sign, i.e. 1+, the new sequence is added
(concatenated) to the existing one, otherwise the new sequence replaces the older one.) You
should respond with
[[path][filename]] [start] [end] [attribute]
Any terms may be omitted, except start when you wish to specify end. When path is omitted,
the default path is used (these can be specified with options2, see  instant mode below).
Start and end specify the range of the sequence to be analyzed. The default values are 1 and
the last position of the sequence, respectively. Some attribute terms are applicable only to
a nucleotide sequence. Currently meaningful attributes are defined below.
^ : the sequence is complemented. (nucleotide sequence only)
- : the sequence is reversed.
< : the sequence is reverse-complemented.
% : forces the sequences into profile.
A : amino acid sequence.
P : amino acid sequence.
D : nucleotide sequence.
R : nucleotide sequence.
N : nucleotide sequence, delete ambiguous codes ('N').
T : tron sequence. See ref. 5.
@N : read coding parts only, according to the N-th CDS (GenBank only).
#N : retain only the columns composed of >N% of non-deletion characters.
If there is no attribute character (default), the sequence remains as read. If filename is
specified, the remaining terms are reset to the default values, otherwise non-specified terms
are unchanged. Of course, filename should not be omitted in the beginning session.
When you select the 'p' command in the menu, you can change the current parameter values,
which are shown in parentheses. Carriage/Return terminates the input, leaving the remaining
terms unchanged. Hence, only hit Carriage/Return if you do not want to change the current
values shown in the line. The following tables show the default values and meaning of the
Parameters for protein sequences
pam 250 PAM level of Dayhoff's mutation data matrix MDM [0-300] step by 50.
bias 0 Value to be added to each element of MDM.
scale 1 Precision of the MDM matrix ( >= 1 ).
v 2 Constant term of the gap-weighting function.
u 9 Proportional coefficient of the gap-weighting function for a gap shorter than or
equal to k1.
Parameters for nucleotide sequences
s[=] 2 Similarity measure between identical nucleotides.
s[#] 2 Similarity measure between different nucleotides.
v 4 Constant term of the gap-weighting function.
u 2 Proportional coefficient of the gap-weighting function.
Parameters common to nucleotide and protein sequences
shldr 100 Window size ( >= 0 ), see note 1.
series 1 Number of iteration series. The result with the best score among the series is
omode 0 Message level [0 - 2], see note 2.
algor 0 Select the algorithm for group-to-group alignment. If 0, the program chooses
conceivably the best one.
dmode 2 This mode controls the way multiple sequences are divided into two groups, see
newrn 1 Set the seed of a series of quasi-random numbers.
group 2 Mutual alignment between some sequences in the given alignment may be fixed, see
The program sets a window of size (|M - N| + 2 * shldr) along the center of the main
diagonal, where M and N are the lengths of the sequences.
Level = 0: Fewest messages
Level = 1: The course of refinement is monitored.
All of these messages are reported to stderr.
Mode = 1: A randomly chosen single sequence is aligned with the remaining (M-1)
sequences. The average convergence takes O(M) steps.
Mode = 2: Choose only the divisions that bisect an unrooted tree representing the
mutual relationship of the sequences in the given alignment. The average
convergence takes O(2M) steps. (default)
Mode = 0: Previous grouping is used.
Mode = 1: Groups should be manually indicated.
Mode = 2: No grouping.
Go and Reprint
The 'g' command conducts the alignment process. The program prompts for where the result is
to be output. At the prompt, select a number (-1, 0, 1, 2, 3, or 4, the default is 0). When
you select 3 or 4, the destination file name is prompted (if you select 4, the file is opened
in an append mode).
To exit from the program, type 'q' in response to the menu prompt.
 Instant mode
This mode is the simplest to use. The sequence file containing the preliminary alignment is
given in the command line arguments, and the result is reported to SO or the file specified by
option -o (see below). Most of the command line options are also effective in the
conversational mode, but they are described here since they are most likely to be used in this
mode. Each option is specified by a dash followed by a single character. Some options take a
second argument as shown below.
Options1 Default Meaning
-a[N] 0 Prealign by sequentially adding in the order
N=0:input; N=1:from shorter; N=2: from longer;
-b[N] 1 Prealign by a progressive method.
-U[N] 0 Same as -a but members in seq1 are replaced by sequences of the same
names in seqn (n>1). Seq1 must be a multiple alignment whose members
should have unique names.
-ES SE Destination of supplementary messages.
-GN 2 Grouping.
-HN 20 Threshold value used in finding conserved regions.
-IN 10 Maximum number of iterations in the outer loop.
-JN 1 N=1: UPGMA method; N=2: NJ method for tree.
-ON 0 Set message level (omode) to N.
-RN Seed of the series of random numbers.
-SM 1 Number of iteration series.
-t The order of sequences in the output file is rearranged according to the
calculated phylogenetic relationship.
Options2 Default Meaning
-AN 0 Use algorithm N = [0-5]. A value of other than 0 is not recommended.
-FN 1 Format of output. N=1-5: native; N=6: Phylip; N=7: GCG; N=8: GDE;
N=9: Concatenated Fasta.
-lN 60 Set lpw (# of residues per line) = N > 8.
-mS Amino acid exchange matrix.
-oS SO Output resultant alignment to file.
-sS ./ Set the default path to sequence files to path.
-uN 2 Set gap-extension penalty u = N.
-vN 4/9 Set gap-opening penalty v = N.
-wN 100 Set shldr = N.
-ypN 250 Set pam = N.
(Note) SE and SO mean standard error (stderr) and standard output (stdout), respectively.
If you use the NJ method to construct a phylogenetic tree, a negative edge length may appear.
No particular countermeasure has been worked out, so the program may stop unexpectedly or
produce nonsensical results in such cases.
Copyright © 1997-2009 Osamu Gotoh