PRRN information




Present Version: 3.1.3 Last Update: 2010-02-06

Overview

The program prrn is an implementation of the randomized iterative strategy for multiple sequence alignment. Prrn accepts either nucleotide or protein sequences. Prrn repeatedly uses pairwise group-to-group alignment to improve the overall weighted sum-of-pairs score at each iterative step, where the pair weights are introduced to correct for uneven representations of the sequences to be aligned. The weights are fixed in the earlier (RIW) method, whereas the doubly-nested randomized iterative (DNR) method described in Ref. [8] tries to make alignment, phylogenetic tree and pair weights mutually consistent. These two strategies can be selected by setting different options, and DNR is now the default. These are hill-climbing strategies, and the true optimum may not be attained. These strategies work most effectively for refining a crude alignment obtained by other more rapid methods, e.g. progressive alignment. Setting option -a or -b generates such a provisional alignment from unaligned sequences.

Major changes from previous versions

Change from previous versions

Added/modified in Ver. 3.1.3 (2010-02-06)

  1. Conservation of gene (exon-intron) structure can be taken into account by setting yBN option (N > 0). This option is motivated by the fact that gene structures are generally better conserved than amino acid sequences.
Added/modified in Ver. 3.1.1

  1. 1. After version 3.1, the source codes are ANSI-C/C++ compatible but no longer compatible with a KR-compiler.
  2. -L[N] option is added. If set, terminal gaps are not penalized, i.e. semi-global alignment is obtained. However, this option sometimes produces unexpected results. Use this option only to refine an existing alignment.
Added/modified in Ver. 3.1.0

  1. Some serious bugs were fixed. The frequency of core dumps, which often occurred when many or complex gaps were present, has been greatly reduced.
  2. An "update mode" was added, with which certain members within an existing alignment are replaced by new ones. This mode is designed to cooperate with a eukaryotic gene structure prediction method so that predicted structures of a set of paralogous genes can be refined to yield progressively better alignment of translation products (Ref. 10).
  3. After version 3.0, the single program 'prrn' can accept either nucleotide or protein sequences. The program usually judges the type according to the composition of input characters.
  4. After version 3.0, the objective score is maximized rather than minimized, so that the sign of a score value is opposite that reported by previous versions of the programs.
  5. After version 2.5, the source codes are ANSI-C-compatible. This may improve portability of the programs. Moreover, the gcc compiler, which supports ANSI-C, generates better (smaller and faster) code@than the original SUN compiler.
  6. Rigorous optimal alignment between two groups of sequences is the key procedure in the present iterative strategy. An improved algorithm was devised, which simplified the codes and improved the execution rate by an average of about 20%.
  7. When either of the two groups contains no internal gap (typically a single sequence), the alignment algorithm can be simplified and calculation time is considerably reduced. Distinct subroutines were prepared to incorporate these, which are found in more than half of tree- dependent partitioning.
  8. Optional output formats now include GCG's MSF format (-F7). Compatibility with Phylip format has been improved (-F6).

References

[1] Gotoh, O. (1982) "An improved algorithm for matching biological sequences." J. Mol. Biol. 162, 705-708.

[2] Gotoh, O. (1990) "Optimal sequence alignment allowing for long gaps." Bull. Math. Biol. 52, 359-373.

[3] Berger, M.P., and Munson, P.J. (1991) "A novel randomized iterative strategy for aligning multiple protein sequences." CABIOS 7, 479-484.

[4] Gotoh, O. (1993) "Optimal alignment between groups of sequences and its application to multiple sequence alignment." CABIOS 9, 361-370.

[5] Gotoh, O. (1993) "Extraction of conserved or variable regions from a multiple sequence alignment." Proceedings of Genome Informatics Workshop IV, pp. 109-113.

[6] Gotoh, O. (1994) "Further improvement in group-to-group sequence alignment with generalized profile operations." CABIOS 10, 379-387.

[7] Gotoh, O. (1995) "A weighting system and algorithm for aligning many phylogenetically related sequences." CABIOS, 11, 543-551.

[8] Gotoh, O. (1996) "Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments." J. Mol. Biol. 264, 823-838.

[9] Gotoh, O. (1999) "Multiple sequence alignment: algorithms and applications." Adv. Biophys. 36, 159-206.

[10] Gotoh, O., Yamada, S., and Yada, T. (2006) Multiple Sequence Alignment, in Handbook of Computational Molecular Biology, (Aluru, S. ed.) Chapman & Hall/CRC, Computer and Information Science Series, Vol. 9, pp. 3.1-3.36.

System

The programs are written in C++ (ISO/C++), and distributed as source code. Users must compile the programs on their own system. Although the programs have been tested only on linux and Solaris 2.2-5, they are likely to be portable to most UNIX systems with little or no modifications. The distribution package also contains source codes of a few related programs, including prrn, etc.

Sequence File

A sequence file may contain either a single sequence or a set of pre-aligned sequences. Avoid file names starting with a numeral.

Single sequence

Almost all popular formats for nucleotide and amino acid sequences, FASTA, GenBank, EMBL, SwissProt, PIR, and ProDB, are acceptable. A simple text file without any additional information is also fine. In any format, a line beginning with a semicolon (;) is regarded as a comment line. The specific format is recognized automatically by the first word in the first non-comment line of each file. Only single-letter codes are accepted for an amino acid. The nomenclature recommended by NC-IUB (Eur. J. Biochem. (1985) 150, 1-5) is followed to represent ambiguous nucleotides. The programs are not case-sensitive.

Sequential format of multiple sequences

The concatenation of multiple sequences in one of the above single-sequence formats, except for the naked sequence-alone format, may be used as a program input. However, the first line of the file must consist of two numbers, the number of sequences and the length of the longest sequence, separated by one or more spaces. Deletion characters are automatically padded at the end of shorter sequences. If the sequences are prealigned, internal deletion characters are preserved. Two examples of sequential formats are shown below.

File 1: |File 2:
|
3 16 |2 16
>Seq1 |LOCUS Seq1
aaatttcccggg|ORIGIN
// |1 AAATTTCCCGGG
>Seq2 |//
atcgatcgatcgat |LOCUS NCODE
// |ORIGIN
>NCODE |1 -ACMGRSVTWYHKDBN
ACMGRSVTWYHKDBN|//
// |

Interleaved (native) format of multiple sequences
This native format is designed for multiple-sequence alignment to be naturally recognized by human eyes. The alignment produced by aln can be used as an input to aln or prrn, and this is the most common way to have access to sets of pre- aligned sequences. Thus, the format of an aligned sequence file is the same as the default output format of aln. The first non-blank line in a file must indicate the number of sequences, N, involved in the alignment. This number is obtained as the sum of numbers in square brackets, e.g., when the first line is

Seq1[3] - Seq2[4]

N is calculated to be 7. Subsequent lines up to the first blank line are ignored. The rest of the file is composed of one or more blocks of a fixed column width of less than 254 characters. Each block is composed of N 'sequence lines' and other (optional) 'non-sequence lines'. The general format of a sequence line is:

<Position> <Sequence> <Name>

where <Position> is a numeral that indicates the sequence position of the first letter in <Sequence|> (Usually all <Position>s in the first block are 1, but it is not a prerequisite. Negative values are also appropriate). A line lacking the <Position> field is regarded as a 'non-sequence line' and ignored upon reading. The i-th sequence line in the second block is concatenated to the i-th sequence line in the first block, and so on. There is no particular limit on N or the length, but the total number of characters to be stored is limited by MAXAREA defined in src/sqio.h. Several examples of native format are provided in the sample directory.

Special Characters

Three characters, dash '-', tilde '~' and caret '^' have special meanings in a multiple-sequence file of the native format. A dash indicates a 'deletion' introduced by some alignment procedure. Be careful not to use a space or dot instead of a dash. Spaces and dots are simply ignored, so that the file may be interpreted in a totally unexpected way. A tilde means the same residue as that in the first sequence line on that column in the block. On the other hand, a caret means the same residue as that in the previous sequence line on that column. These ditto characters are convenient in representing an alignment of closely related sequences. Neither '~' nor '^' is allowed in the first sequence line in each block.

User Operation

There are two methods to invoke a program.
[1]  prrn [options1] [options2]
[2]  prrn [options1] [options2] seq1 [seq2 seq3 ...]
Options1 is specific to prrn, whereas options2 is common to other programs, and default values will be used if it is omitted. Method [1] is a 'conversational mode', which will be discussed in detail below. Method [2] is a 'command-line mode'; the calculation starts immediately using the sequence data in file(s) seqn. When two or more sequences or multiple sequences are given, they are combined but not prealigned. Use the -a[N], -b[N], or -U option described below to start the iterative refinement from a prealigned state.

Notice

At the beginning of each process, prrn reads the score files. Follow the instructions above to tell the program where these files are located.

[1] Conversational mode

This menu-driven mode provides an easy way to use the program. You may read a sequence file, change parameter values, or send the results to a desired device or file.
Menu
In response to the prompt, you can select one or more single-letter commands. Case is significant, except for a few commands. Numerals must be separated from other commands by one or more spaces, whereas other characters should be typed contiguously. The commands are processed from left to right. For example
	Menu Prompt: 1 pgr 
is equivalent to
	Menu Prompt: 1 
	Menu Prompt: p 
	Menu Prompt: g 
	Menu Prompt: r 

Sequence

When the programs successfully start, or when you select '1' as the menu command, input of a sequence is prompted. (If you add a plus sign, i.e. 1+, the new sequence is added (concatenated) to the existing one, otherwise the new sequence replaces the older one.) You should respond with
	[[path][filename]] [start] [end] [attribute]
Any terms may be omitted, except start when you wish to specify end. When path is omitted, the default path is used (these can be specified with options2, see [2] instant mode below). Start and end specify the range of the sequence to be analyzed. The default values are 1 and the last position of the sequence, respectively. Some attribute terms are applicable only to a nucleotide sequence. Currently meaningful attributes are defined below.
^	: the sequence is complemented. (nucleotide sequence only)
-	: the sequence is reversed.
<	: the sequence is reverse-complemented.
%	: forces the sequences into profile.
A	: amino acid sequence.
P	: amino acid sequence.
D	: nucleotide sequence.
R	: nucleotide sequence.
N	: nucleotide sequence, delete ambiguous codes ('N').
T	: tron sequence.  See ref. 5.
@N	: read coding parts only, according to the N-th CDS (GenBank only).
#N	: retain only the columns composed of >N% of non-deletion characters.
If there is no attribute character (default), the sequence remains as read. If filename is specified, the remaining terms are reset to the default values, otherwise non-specified terms are unchanged. Of course, filename should not be omitted in the beginning session.

Parameters

When you select the 'p' command in the menu, you can change the current parameter values, which are shown in parentheses. Carriage/Return terminates the input, leaving the remaining terms unchanged. Hence, only hit Carriage/Return if you do not want to change the current values shown in the line. The following tables show the default values and meaning of the parameters used.

Parameters for protein sequences

       Default                    Meaning
 pam    250     PAM level of Dayhoff's mutation data matrix MDM [0-300] step by 50.
 bias     0     Value to be added to each element of MDM.
scale     1     Precision of the MDM matrix ( >= 1 ).
  v       2     Constant term of the gap-weighting function.
  u       9     Proportional coefficient of the gap-weighting function for a gap shorter than or
                equal to k1.
Parameters for nucleotide sequences
 s[=]     2     Similarity measure between identical nucleotides.
 s[#]     2     Similarity measure between different nucleotides.
  v       4     Constant term of the gap-weighting function.
  u       2     Proportional coefficient of the gap-weighting function.
Parameters common to nucleotide and protein sequences
shldr   100      Window size ( >= 0 ), see note 1.
series    1      Number of iteration series.  The result with the best score among the series is
                 finally reported.
omode     0      Message level [0 - 2], see note 2.
algor     0      Select the algorithm for group-to-group alignment.  If 0, the program chooses
                 conceivably the best one.
dmode     2      This mode controls the way multiple sequences are divided into two groups, see
                 note 3.
newrn     1      Set the seed of a series of quasi-random numbers.
group     2      Mutual alignment between some sequences in the given alignment may be fixed, see
                 note 4.
(note 1)
The program sets a window of size (|M - N| + 2 * shldr) along the center of the main diagonal, where M and N are the lengths of the sequences.
(note 2)
Level = 0:    Fewest messages
Level = 1:    The course of refinement is monitored.
All of these messages are reported to stderr.
(note 3)
Mode = 1:     A randomly chosen single sequence is aligned with the remaining (M-1)
              sequences.  The average convergence takes O(M) steps.
Mode = 2:     Choose only the divisions that bisect an unrooted tree representing the
              mutual relationship of the sequences in the given alignment.  The average
              convergence takes O(2M) steps. (default)
(note 4)
Mode = 0:     Previous grouping is used.
Mode = 1:     Groups should be manually indicated.
Mode = 2:     No grouping.

Go and Reprint

The 'g' command conducts the alignment process. The program prompts for where the result is to be output. At the prompt, select a number (-1, 0, 1, 2, 3, or 4, the default is 0). When you select 3 or 4, the destination file name is prompted (if you select 4, the file is opened in an append mode).

Quit

To exit from the program, type 'q' in response to the menu prompt.

[2] Instant mode

This mode is the simplest to use. The sequence file containing the preliminary alignment is given in the command line arguments, and the result is reported to SO or the file specified by option -o (see below). Most of the command line options are also effective in the conversational mode, but they are described here since they are most likely to be used in this mode. Each option is specified by a dash followed by a single character. Some options take a second argument as shown below.
Options1  Default         Meaning

-a[N]     0     Prealign by sequentially adding in the order
                N=0:input; N=1:from shorter; N=2: from longer;
-b[N]     1     Prealign by a progressive method.
-U[N]     0     Same as -a but members in seq1 are replaced by sequences of the same
                names in seqn (n>1).  Seq1 must be a multiple alignment whose members
                should have unique names.
-ES      SE     Destination of supplementary messages.
-GN       2     Grouping.
-HN      20     Threshold value used in finding conserved regions.
-IN      10     Maximum number of iterations in the outer loop.
-JN       1     N=1: UPGMA method; N=2: NJ method for tree.
-ON       0     Set message level (omode) to N.
-RN	            Seed of the series of random numbers.
-SM       1     Number of iteration series.
-t              The order of sequences in the output file is rearranged according to the
                calculated phylogenetic relationship.
Options2  Default       Meaning

-AN       0     Use algorithm N = [0-5].  A value of other than 0 is not recommended.
-FN       1     Format of output. N=1-5: native; N=6: Phylip; N=7: GCG; N=8: GDE; 
                N=9: Concatenated Fasta.
-lN      60     Set lpw (# of residues per line) = N > 8.
-mS             Amino acid exchange matrix.
-oS      SO     Output resultant alignment to file.
-sS      ./     Set the default path to sequence files to path.
-uN       2     Set gap-extension penalty u = N.
-vN     4/9     Set gap-opening penalty v = N.
-wN     100     Set shldr = N.
-ypN    250     Set pam = N.
(Note) SE and SO mean standard error (stderr) and standard output (stdout), respectively.

caveat

If you use the NJ method to construct a phylogenetic tree, a negative edge length may appear. No particular countermeasure has been worked out, so the program may stop unexpectedly or produce nonsensical results in such cases.


Copyright © 1997-2009 Osamu Gotoh