PRRN, ALN, and REFGS.PL




Present Version: 5.2.0 Last Update: 2020-10-12

Overview

This site contains sequence alignment and homology-based gene prediction programs developed by Osamu Gotoh and associates. Aln performs pairwise alignment between sequences or pre-aligned groups of sequences, whereas Prrn performs multiple sequence alignment (MSA) of DNA or protein sequences. These programs use common internal codes and the same sequence file format described below. Refgs.pl is a perl script that performs concerted gene prediction and multiple sequence alignment, taking advantage of the unique feature of prrn that generates gene-structure-aware multiple protein sequence alignment (GSA-MPSA). A related program Spaln that seamlessly performs genome mapping and spliced alignment is presented in a separate page. For further details, crick on the individual tags.

Install

The programs are written in C++ (ISO/C++), and distributed as source code. Users must compile the programs on their own system. First install aln and other associated programs. Prrn is installed separately from others because prrn uses double while aln and other programs use float to encode floating point varables; in very rare cases, prrn compiled with float variables can fail during the refinement phase.

% cd src
% ./configure [--help]
% make all
% make install
% make clearall
% CFLAGS="-O3 -DDVAL=1" ./configure
% make prrn
% make install_prrn
% make clearall

If you have changed the location of the table directory after installation, set the env variable ALN_TAB:

    % setenv ALN_TAB New_Aln_Tab (csh/tsh)
    $ export ALN_TAB=New_Aln_Tab (sh/bsh)

Sequence File

A sequence file may contain either a single sequence or a set of pre-aligned sequences. Avoid file names starting with a numeral.

Single sequence

Almost all popular formats for nucleotide and amino acid sequences, FASTA, GenBank, EMBL, SwissProt, PIR, and ProDB, are acceptable. A simple text file without any additional information is also fine. In any format, a line beginning with a semicolon (;) is regarded as a comment line. The specific format is recognized automatically by the first word in the first non-comment line of each file. Only single-letter codes are accepted for an amino acid. The nomenclature recommended by NC-IUB (Eur. J. Biochem. (1985) 150, 1-5) is followed to represent ambiguous nucleotides. The programs are not case-sensitive.

To represent the exon-intron structure of the parental gene, the format of FASTA file should be extended. A line starting with ';C' shows the exon boundaries (inclusive). More than one line may be used if necessary. The format after ';C ' is essentially the same as that of Feature field of a GenBank file. Start and end positions of each exon are separated by two dots. Individual exons are delimited by a comma. The term 'complement' indicates that the corresponding gene lies in the complementary strand of the genomic sequence. Two examples are as follows:

>ce13a1 C. elegans chromosome II positive strand  
;C join(9803525..9803710,9803766..9804097,9804152..9804251,  
;C 9804299..9804855,9804926..9805069,9805115..9805349)  
MSFSILIAIAIFVGIISYYLWIWSFWIRKGVKGPRGLPFLGVIHKFTNYENPGALKFSEW  
TKKYGPVYGITEGVEKTLVISDPEFVHEVFVKQFDNFYGRKLTAIQGDPNKNKRVPLVAA  
QGHRWKRLRTLASPTFSNKSLRKIMGTVEESVTELVRSLEKASAEGKTLDMLEYYQEFTM  
DIIGKMAMGQEKSLMFRNPMLDKVKTIFKEGRNNVFMISGIFPFVGIALRNIFAKFPSLQ  
MATDIQSILEKALNKRLEQREADEKAGIEPSGEPQDFIDLFLDARSTVDFFEGEAEQDFA  
KSEVLKVDKHLTFDEIIGQLFVFLLAGYDTTALSLSYSSYLLATHPEIQKKLQEEVDREC  
PDPEVTFDQLSKLKYLECVVKEALRLYPLASLVHNRKCLKTTNVLGMEIEAGTNINVDTW  
SLHHDPKVWGDDVNEFKPERWESGDELFFAKGGYLPFGMGPRICIGMRLAMMEMKMLLTN  
ILKNYTFETTPETVIPLKLVGTATIAPSSVLLKLKSRF  
[EOF]

>ce13a2 C. elegans chromosome II negative strand  
;C complement(join(9798263..9798503,9798584..9798727,9798905..9799461,  
;C 9799519..9799618,9799680..9800011,9800058..9800243))  
MSLSILIAGASFIGLLTYYIWIWSFWIRKGVKGPRGFPFFGVIHEFQDYENPGLLKLGEW  
TKEYGPIYGITEGVEKTLIVSNPEFVHEVFVKQFDNFYGRKTNPIQGDPNKNKRAHLVSA  
QGHRWKRLRTLSSPTFSNKNLRKIMSTVEETVVELMRHLDDASAKGKAVDLLDYYQEFTL  
DIIGRIAMGQTESLMFRNPMLPKVKGIFKDGRKLPFLVSGIFPIAGTMFREFFMRFPSIQ  
PAFDIMSTVEKALNKRLEQRAADEKAGIEPSGEPQDFIDLFLDARANVDFFEEESALGFA  
KTEIAKVDKQLTFDEIIGQLFVFLLAGYDTTALSLSYSSYLLARHPEIQKKLQEEVDREC  
PNPEVTFDQISKLKYMECVVKEALRMYPLASIVHNRKCMKETNVLGVQIEKGTNVQVDTW  
TLHYDPKVWGEDANEFRPERWESGDELFYAKGGYLPFGMGPRICIGMRLAMMEKKMLLTH  
ILKKYTFETSTQTEIPLKLVGSATTAPRSVMLKLTPRHSN  
[EOF]

The line starting with ';C' may be simplified as follows:

>ce13a1 C. elegans chromosome II positive strand  
;C + 9803525 9803710 9803766 9804097 9804152 9804251  
;C 9804299 9804855 9804926 9805069 9805115 9805349  
...

>ce13a2 C. elegans chromosome II negative strand  
;C - 9798263 9798503 9798584 9798727 9798905 9799461  
;C 9799519 9799618 9799680 9800011 9800058 9800243  
...

When the parental gene contains one or more frame shifts, an output from aln or spaln contains the corresponding number of lines starting with ';M' such that:

    ;M Deleted n chars at p
    ;M Insert n chars at p
The first line indicates that n (n = 1 or 2) nucleotides have been deleted from the parental genomic sequence beginning at site p. Likewise, the second line indicates that n blank characters are inserted after site p to maintain the open reading frame. Such kinds of information are used to properly juxtapose intron positions along the alignment.

Sequential format of multiple sequences

The concatenation of multiple sequences in one of the above single-sequence formats, except for the naked sequence-alone format, may be used as a program input. Extended FASTA format sequences as described above should be used to indicate the exon-intron organizations of the parental genes. You can indicate the number of sequences and the length of the longest sequence in the first line of the file by two numbers separated by one or more spaces. However, this is no longer mandatory and a file in the ordinary multi-fasta format is fine. Deletion characters are automatically padded at the end of shorter sequences. If the sequences are pre-aligned, internal deletion characters are preserved. Two examples of sequential formats are shown below. A multi-fasta file with gene structure information may be found in the sample directory of the distribution package.

File 1: |File 2:
|
>Seq1 |LOCUS Seq1
aaatttcccggg|ORIGIN
... |1 AAATTTCCCGGG
>Seq2 |//
atcgatcgatcgat |LOCUS NCODE
... |ORIGIN
>NCODE |1 -ACMGRSVTWYHKDBN
ACMGRSVTWYHKDBN|...
... |//

Interleaved (native) format of multiple sequences

This native format is designed for multiple-sequence alignment to be naturally recognized by human eyes. The alignment produced by aln can be used as an input to aln or prrn, and this is the most common way to have access to sets of pre- aligned sequences. Thus, the format of an aligned sequence file is the same as the default output format of aln. The first non-blank line in a file must indicate the number of sequences, N, involved in the alignment. This number is obtained as the sum of numbers in square brackets, e.g., when the first line is

Seq1[3] - Seq2[4]

N is calculated to be 7. Subsequent lines up to the first blank line are ignored. The rest of the file is composed of one or more blocks of a fixed column width of less than 254 characters. Each block is composed of N 'sequence lines' and other (optional) 'non-sequence lines'. The general format of a sequence line is:

<Position> <Sequence> <Name>

where <Position> is a numeral that indicates the sequence position of the first letter in <Sequence|> (Usually all <Position>s in the first block are 1, but it is not a prerequisite. Negative values are also appropriate). A line lacking the <Position> field is regarded as a 'non-sequence line' and ignored upon reading. The i-th sequence line in the second block is concatenated to the i-th sequence line in the first block, and so on. There is no particular limit on N or the length, but the total number of characters to be stored is limited by MAXAREA defined in src/sqio.h. Several examples of native format are provided in the sample directory.

Special Characters

Three characters, dash '-', tilde '~' and caret '^' have special meanings in a multiple-sequence file of the native format. A dash indicates a 'deletion' introduced by some alignment procedure. Be careful not to use a space or dot instead of a dash. Spaces and dots are simply ignored, so that the file may be interpreted in a totally unexpected way. A tilde means the same residue as that in the first sequence line on that column in the block. On the other hand, a caret means the same residue as that in the previous sequence line on that column. These ditto characters are convenient in representing an alignment of closely related sequences. Neither '~' nor '^' is allowed in the first sequence line in each block.

Information on gene structures

For native format of multiple sequences, lines starting with ";B ", ";b ", and ";m " represent the information about organizations of corresponding genes. The first number that follows ";B ", NP, indicates the number of alignment positions where an intron intervenes at least one of the genes corresponding to the aligned protein or cDNA sequences. For a protein sequence, the position means that of coding nucleotide sequence so that phase as well as alignment column is also significant. The second number in this line, NI, indicates the total number of introns. The numbers that follow ";b " indicate the number of sequences that contain the intron at the 1st, 2nd, ..., NP-th position. The numbers that follow ";m " present the list of sequences that contain the intron at the 1st, 2nd, ..., NP-th position in this order. See the example ce13a.mfa in sample/pas directory.

References

[1] Gotoh, O. (1982) "An improved algorithm for matching biological sequences." J. Mol. Biol. 162, 705-708.

[2] Gotoh, O. (1990) "Optimal sequence alignment allowing for long gaps." Bull. Math. Biol. 52, 359-373.

[3] Berger, M.P., and Munson, P.J. (1991) "A novel randomized iterative strategy for aligning multiple protein sequences." CABIOS 7, 479-484.

[4] Gotoh, O. (1993) "Optimal alignment between groups of sequences and its application to multiple sequence alignment." CABIOS 9, 361-370.

[5] Gotoh, O. (1993) "Extraction of conserved or variable regions from a multiple sequence alignment." Proceedings of Genome Informatics Workshop IV, pp. 109-113.

[6] Gotoh, O. (1994) "Further improvement in group-to-group sequence alignment with generalized profile operations." CABIOS 10, 379-387.

[7] Gotoh, O. (1995) "A weighting system and algorithm for aligning many phylogenetically related sequences." CABIOS, 11, 543-551.

[8] Gotoh, O. (1996) "Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments." J. Mol. Biol. 264, 823-838.

[9] Gotoh, O. (1999) "Multiple sequence alignment: algorithms and applications." Adv. Biophys. 36, 159-206.

[10] Gotoh, O., Yamada, S., and Yada, T. (2006) Multiple Sequence Alignment, in Handbook of Computational Molecular Biology, (Aluru, S. ed.) Chapman & Hall/CRC, Computer and Information Science Series, Vol. 9, pp. 3.1-3.36.

Download

source

Binary