SPALN information




Overview

Spaln (space-efficient spliced alignment) is a stand-alone program that maps and aligns a set of cDNA or protein sequences onto a whole genomic sequence in a single job. From Version 1.4, spaln supports a combination of protein sequence database and a given genomic segment. From Version 2.2, spaln also performs rapid similarity search and (semi-)global alignment of a set of protein sequence queries again a protein sequence database. Spaln adopts multi-phase heuristics that makes it possible to perform the job on a conventional personal computer running under Unix/Linux with limited memory. The program is written in C++ and distributed as source codes and also as executables for a few platforms. Unless binaries are not provided, users must compile the program on their own system. Although the program has been tested only on a Linux operating system, it is likely to be portable to most Unix systems with little or no modifications. The accessory program sortgrcd sorts the gene loci found by spaln in the order of chromosomal position and orientation.

References

[1] Gotoh, O. " A space-efficient and accurate method for mapping and aligning cDNA sequences onto genomic sequence" Nucleic Acids Research 36 (8) 2630-2638 (2008).
[2] Gotoh, O. " Direct mapping and alignment of protein sequences onto genomic sequence" Bioinformatics 24 (21) 2438-2444 (2008).
[3] Iwata, H. and Gotoh, O., " Benchmarking spliced alignment programs including Spaln2, an extended version of Spaln that incorporates additional species-specific features" Nucleic Acids Research 40 (20) e161 (2012)

Present Version: 2.2.2, Last update: 2016-05-10

Install

From source

To compile the source codes in the default settings, follow the instructions below. If you download the source file (spaln2.2.2) in the directory download, five directories will be generated under download/spalnXX/ after installation, where XX is a version code. We assume work is your workspace, which may or may not be identical to download. Note: The default destinations of the installed files of version 1 are somewhat different from those described here.

To modify the location of executables and/or other settings, run 'configure --help' at step 6 below. (Warning: Full path name rather than relative path name must be given for executables or other directories as the arguments of the configure command.) These locations are hard coded in spaln. The locations of the 'seqdb' and 'table' directories will be respectively denoted by seqdb and table below. Hence, seqdb=download/spalnXX/seqdb, and table=download/spalnXX/table in the default settings.

  1. % mkdir download
  2. % cd download
  3. Download spalnXX.tar.gz
  4. % tar xfz spalnXX.tar.gz
  5. % cd ./spalnXX/src
  6. % ./configure [--help]
  7. % make
  8. % make install
  9. % make clearall
  10. Add download/spalnXX/bin to your PATH Preferably, you may add the above line in your start up rc file (e.g. ~/.bashrc)
    Alternatively, move or copy download/spalnXX/bin/* to a directory on your PATH, if you have not specified the location of executables at step 6 above.
  11. If you have changed the location of table and/or seqdb directory after installation, set the env variables ALN_TAB and/or ALN_DBS as explained in the following subsection.
  12. Proceed to Sequence data formation.

From binaries

Binaries for a 32 bit (spaln2.0.4.linux32) or 64 bit (spaln2.2.2.linux64) Linux machine are available. To use the binaries, follow the instructions below.

Case I: Assume the directory work is your workspace where every material is stored. In this case, seqdb=work.

  1. % mkdir work
  2. % cd work
  3. Download spalnXX.PC.tar.gz, where PC is a platform code
  4. % tar xfz spalnXX.PC.tar.gz
  5. Add work/bin to your PATH
    Or move or copy work/bin/* to a directory on your PATH
  6. % mv ./table/* .; rmdir ./table
  7. % mv ./seqdb/* .; rmdir ./seqdb
  8. Now proceed to Sequence data formation.

Case II: Assume your workspace work is distinct from seqdb

  1. % mkdir download
  2. % cd download
  3. Download spalnXX.PC.tar.gz, where PC is a platform code
  4. % tar xfz spalnXX.PC.tar.gz
  5. Add download/bin to your PATH
    Or move or copy download/bin/* to a directory on your PATH
  6. % setenv ALN_TAB download/table (csh/tsh)
    $ export ALN_TAB=download/table (sh/bsh)
  7. % setenv ALN_DBS download/seqdb (csh/tsh)
    $ export ALN_DBS=download/seqdb (sh/bsh)
  8. Add the above lines to your rc file, so that you don't have to repeat the commands at every login time.
  9. Now proceed to Sequence data formation

Sequence data formation

If you do not need genome mapping or database search, you may skip this section. All sequence files should be in (multi-)fasta format. The genomic sequence must be formatted before use.
  1. % cd seqdb
  2. Download or copy genomic sequences or protein database sequence in multi-fasta format.
  3. Chromosomal sequences should be concatenated into a single file. Alternatively, you can use multiple chromosomal files without concatenation. This procedure will be described at the end of this section. To render the 'make' command effective, the extension of the genomic sequence file should be '.mfa'or '.gf', and protein database sequence should be '.faa'. Hereafter, the file name is assumed to be xxxgnm.mfa or prosdb.faa. The total number of residues in a file must not be greater than or equal to 2**32.
  4. % ./makeidx.pl -i[n|p|np] xxxgnm.mfa or
    % ./makeidx.pl -i[a] prosdb.faa
  5. % make xxxgnm.idx (for genomic sequence) or
    % make prosdb.idx (for protein database sequence)
  6. % make xxxgnm.bkn (for cDNA queries) or
    % make xxxgnm.bkp (for protein queries) or
    % make prosdb.bka (for protein database)
  7. It is possible to generate xxxgnm.idx and other three files directly from the input files without concatenation:
    This method is particularly useful when the concatenation might yield a file too large to be dealt with by the OS.

Execution

  1. Prepare protein, cDNA, or genomic segment sequence(s) in (multi-)fasta format (denoted by query below)
  2. Store query to work
  3. % cd work
  4. Run spaln in one of the following three modes. Spaln does not support comparison between two genomic segments.
    (A) % spaln -Q[0|1|2|3] [-ON] [other options] genome_segment query
    (B) % spaln -Q[4|5|6|7] [-ON] [other options] -dxxxgnm query
    (C) % spaln -Q[4|5|6|7] [-ON] [other options] -aprosdb query

    Only a subset of queries may be examined if query is replaced with 'query (from to)', where 'from' and 'to' are the first and last entry numbers in query to be examined.
    To run spaln on multiple CPUs, for example, the following commands may be used and the results may be summarized with sortgrcd, as explained later.
    (A) % spaln -Q7 -O12 -oxxxO1 -dxxxgnm 'query (1 1000)'
    (B) % spaln -Q7 -O12 -oxxxO2 -dxxxgnm 'query (1001 2000)'
    (C) % spaln -Q7 ...
    However, the procedure will be simplified if a multi-thread operation is used as follows:
    (D) % spaln -Q7 -O12 -oxxx -dxxxgnm -t[N] query

    Options: (default value)

  5. % sortgrcd [options] *.grd
    Sortgrcd is used to recover the output of spaln with -O12 option, to apply some filtering, and also to rearrange the output of multiple spaln runs.

Example

Change from previous versions

Added/modified in Ver. 2.2.2 (2016-05-06):

  1. Although implemented earlier, Spaln now formally supports rapid sequence similarity search against a protein sequence database with a set of protein sequence queries. Unlike blastp and some other programs for local sequence similarity search, spaln calculates (semi-)global alignments. Currently, p-values are not estimated, as spaln intends to find only a small number of strong similarities (to use the found sequence as a template for spliced alignment, when the query is a genomic segment).
  2. Specification of species-specific parameter set is made easier. An eight character genus-species identifier (e.g. mus_musc) or genus name (e.g. Mus) is now allowed as the argument to the -T option for specifying the corresponding parameter set (Tetrapod in this example). To activate this facility, new "gnm2tab" file must be installed in the "table" directory.
  3. A rare problem that mapping routine becomes much slow down due to internal repeats in query sequences have been fixed.
  4. Estimation of maximum gene length in a genome from the total genome size has been modified. Remember that you must set the expected maximum gene length by -GXn option (n is the expected maximum gene length) at the formatting phase or at each run time if your "genome" is not full size.
  5. Intron length distribution is modelled by a combination of up to three (formally two) Frechet distributions.
Added/modified in Ver. 2.2.1a-e (2016-02-22):

  1. Surplus lines in seqdb/makblk.pl have been removed
  2. configure script has been modified.
  3. A few modifications have been made to accord with Mac OS X.
  4. Several bugs in minor run modes have been fixed.
  5. The problem of skipping the first sequence line in some input fasta files has been fixed.
Added/modified in Ver. 2.2.1 (2015-12-02):

  1. A few bugs have been fixed. In particular, output in SAM format has been corrected.
  2. For mapping amino acid sequences, the numbers of matches and mismatches are displayed correctly.
Added/modified in Ver. 2.2.0 (2015-08-14):

  1. One-time run of database search has been enabled. In this mode, you need not format database sequences beforehand. However, the formatted results are not retained.
  2. A few primitive bugs have been fixed. In particular, the problem with long header line has been fixed. Output formats with -O3 (BED format) and -O10 (SAM format) are also rectified.
Added/modified in Ver. 2.1.4 (2015-01-30):

  1. Three-byte addressing has been abolished. Accordingly, data files formatted with an older version of spaln can be used by the current version, but the opposite will result in a failure.
  2. A few primitive bugs have been fixed.
Added/modified in Ver. 2.1.3 (2014-12-15):

  1. -CN option is added to deal with non-standard genetic codes, where N denotes the "transl_table number" defined in NCBI transl_table. For example, -C6 option enables proper translation of Tetrahymena or Paramecium genes.
  2. A failure to identify the 3' splice junction followed by a long gap has been amended.
  3. A few bugs with -O1 option have been fixed.
Added/modified in Ver. 2.1.2 (2013-10-17):

  1. A few bugs in spaln and sortgrcd have been fixed.
  2. Inching toward more c++ style coding.
Added/modified in Ver. 2.1.1 (2013-7-1):

  1. The bugs in makdbs have been fixed.
  2. The placements of exon-intron boundaries are modified when spaln is run under -ya[N] (N = 2 or 3) options with protein queries.
Added/modified in Ver. 2.1.0 (2013-5-15):

From this version, the formats of the database files and output of -O12 option of spaln have been changed. These modifications are intended to eliminate the limitations on the lengths of identifiers of both genomic and query sequences (formally up to 20 letters). To accord with these modifications, the associated programs, makdbs and sortgrcd, have been updated. Despite these modifications in the file formats, the user interface of each program is unchanged. spaln_2.1 can read database sequences formatted by an older version, provided that the length limitations on the identifiers are satisfied. However, sortgrcd_2 can no longer process outputs from an older version of spaln. From Version 2.1.3, spaln can deals with non-standard genetic codes.

Major changes in Version2

The major revision number of spaln has been updated from 1 to 2. The new version incorporates additional features for intron recognition in addition to other revisions. The major points of revision are as follows.

  1. The codes have been extensively rewritten in C++, so that they are no longer compiled by a C compiler.
  2. A heuristic routine is added after the HSP search to reduce the number of calls of the restricted DP routine.
  3. -t[N] option specifies the number of CPUs involved in a multi-thread operation. If N is omitted, all available CPUs are used.
  4. A branch point signal and an intron propensity based on its oligomer compositions are incorporated in the scoring system.
  5. The splice junction signals are composed of two terms: (1) universal signal that depends on the dinucleotide pair at the ends of an intron and (2) species-specific signal that depends on the sequence around each splice junction. The relative contributions of these two terms can be freely adjusted by -ySN option.
  6. The intron penalty is now automatically adjusted depending on the values of other parameters.
  7. Species specific parameters are available for 61 divergent eukaryotes. Those parameters are stored in each subdirectory in the 'table' directory.
  8. A set of benchmark data named "SPAliBASE" is ready for download.

  1. The limitations on the length of the sequence id of both genomic and query sequences are removed. According to this modification, makdbs and sortgrcd programs are updated.
  2. The default order of output from sortgrcd has been changed.
  3. A few bugs under - M option are fixed.
Added/modified in Ver. 2.0.6 (2012-01-23):

  1. The limitation on the length of the sequence id (formally max. 20 letters) has been removed. Note that the length of entry id in a database file is still limited up to 14 letters. This limitation will be removed in the next release.
  2. -XsN option is added to allow for overlapping k-mer seeds at the block search phase, where N (1<=N<=k) indicates the distance between adjacent seeds. The sensitivity of block search will be improved with a small N, which is particularly beneficial for mapping of short queries. Compared with the default setting (N=k, i.e. tiling seeds), however, the memory consumption for the k-mer table will be increased by the factor of k/N.
  3. Spaln now supports mapping of paired-end reads. Use -ip option for paired input files (matching 5f and 3f reads must appear in the same order) or -ia option for a single input file (matching 5f and 3f reads must appear alternatively).
  4. The performance of the GvsA mode (query = genomic segment, database = amino acid sequences) has been improved.
  5. Memory leak errors have been fixed (20130226)

Added/modified in Ver. 2.0.4 (2011-9-10, 2011-10-11, 2011-11-17)

  1. A bug when the database file is constructed from multiple input sequence files has been fixed. Makefile and makblk.pl in 'seqdb' directory are also modified to accord for multiple input files.
  2. The output with the -O3 option is changed from Psl-like to Bed format.
  3. The default parameters used in polyA detection have been modified to reduce the chance of removing genome-encoded polyA-like sequences.
  4. The procedure to collect related HSPs within and adjacent significant blocks has been simplified.
  5. The procedure to assemble HSPs has been improved to reduce the chance of chimera formation among tandemly repeated paralogous genes.
  6. A bug concerning with very long genes has been fixed.
Added/modified in Ver. 1.4.5 (2010-04-23, 2010-05-08, 2010-08-04)

  1. Graphical output is supplemented in the Web interface.
  2. -U option is added to spaln. With this option, alignment is computed without splicing. It may be useful when genomic fragment(s) are mapped/aligned with whole genome.
  3. Several small bugs in sortgrcd have been fixed.
  4. Occasional core dump of spaln with -M option has been remedied.
  5. A problem of makdbs command when the character '|' is contained in a second or later word in the header line of fasta file was fixed.
Added/modified in Ver. 1.4.4 (2009-12-01, 2010-01-25, 2010-02-6, 2010-02-23)

  1. -SN option is added to spaln. When N = 3 (default) both orientations of query sequence(s) are examined. When N = 1 and N = 2, only forward and reverse direction is examined, respectively. When N = 0, the orientation to be examined depends on the annotation (header line of FASTA format) of the query sequence. The list of phrases that indicate the orientation is given in the file "StrandPhrase" in the table/spalnXX/seqdb. If no phrase in this list is found in the annotation, both orientations are examined.
  2. -yaN option of spaln has been modified. When N = 0 (default), only canonical intron boundary sequences (GT..AG, GC..AG, AT..AC) are allowed. When N = 1, the third consensus is relaxed as AT..AN. When N = 2, one mismatch from GT..AG is allowed in addition to the relaxed consensus of AT..AN. When N = 3, any sequence is allowed as a boundary.
  3. Many bugs of sortgrcd have been fixed.
  4. -O15 option of sortgrcd outputs information about unique introns merged from the mappings of several transcripts.
  5. Unnecessary spaces in Gff3 format have been removed from outputs of both spaln and sortgrcd.
Added/modified in Ver. 1.4.3c (2009-08-21)

  1. Some incompatibilities between documents and programs have been rectified.
  2. Species-specific parameters in table/spalnXX/seqdb have been updated.
Added/modified in Ver. 1.4.3b (2009-07-10) (2009-07-24)

  1. Sortgrcd has been modified so that files with may chromosome/contig entries should be handled.
  2. Spaln's write mode now accepts plural arguments each corresponding to a piece of the whole genome (e.g. each chromosomal sequence). This modification has made it possible for spaln to format whole genomic sequence on a platform that cannot handle a large file (e.g. > 2GB). For example, the human genome is composed of 24 chromosomes. The total genome size exceeds 2 Gb, while the maximum chromosome size is less than 300 Mb. Even though a system fails to read the file of whole genome at a time, it can support spaln to process chromosomal sequences sequentially. In the example bellow, "my_genome" is the genome name, and chr1, chr2, ... chrn are chromosomal sequence files.
    % makdbs -nmy_genome -KD chr1 chr2 ... chrn
    % spaln -Wmy_genome.bkn -Xk12 -Xb54272 -XG2802688 -Xg64 -KD chr1 chr2 ... chrn
    Then, you can use spaln in a usual way:
    % spaln -Q7 -dmy_genome cDNAs
  3. A few bugs are fixed to prevent core dumps under some conditions.

Added/modified in Ver. 1.4.2 (2009-05-16) (2009-06-17)

  1. Bugs of sortgrcd has been fixed. sortgrcd now supports -O5 (intron-oriented format) output.
  2. Several bugs that were conditionally problematic have also been fixed.

Added/modified in Ver. 1.4.1 (2009-02-16) (2009-03-16)

  1. A combination of a protein sequence database and a genomic segment(s) is now supported. The protein sequence database must be formatted beforehand. At the run time, a set of ORFs in the genomic segment are translated and rapidly searched against the database sequences, and then the best-hit sequence is used as the template of the spliced alignment between the relevant portion of the genomic segment and the protein sequence retrieved.
  2. The format of .bkn and .bkp files is slightly changed. This has introduced some incompatibility between formatted files and software; spaln Ver.1.4 can use files formatted by spaln Ver.1.3, but the inverse would make an error.
  3. The output form now includes Gff3 (gene or match form) and Psl-like format.
  4. Sortgrcd also supports Gff3 gene format in addition to the native format.
  5. The default intron-penalty function has been modified from the well-shaped to a 'generic' function derived from intron-length distributions of several species.
  6. Many small modifications have been made to improve the stability of the software. In particular, multiple outputs of the same locus have been suppressed when -M option is set.

Added/modified in Ver. 1.3.2. (2008-09-12) (2008-09-25) (2008-10-21)

  1. Divide by zero error with makblk.pl when genome size is less than 1Mb has been fixed.
  2. Missing genomic fragment name in -Q[0-3] mode has been recovered.
  3. A few bugs have been fixed to avoid core dumps due to incompatibility between HSP coordinates and sequence ends.
  4. Known exon boundary information can be incorporated into the objective function when a cDNA, as well as a protein, is used as a template. To do so, simply add the -yBN option, where N is the bonus given to each matched location of exon-exon boundary.
  5. Compile time errors on some systems have been fixed.

Added/modified in Ver. 1.3.1. (2008-08-21)

  1. The portability has been improved for spaln to run on 64-bit machines.
  2. The codes are cleaned up so that less number of warnings should be output at the compilation time.
  3. 'make install' command is modified so that species-specific parameter files should be copied to the working table/spalnXX/seqdb directory.
  4. The test procedure described at the end of this document now includes the case of protein sequence queries.

Added/modified in Ver. 1.3.0

  1. Protein sequences can now be used as queries. If properly formatted, users do not need to be conscious of the difference in the type of queries at the run time.
  2. Bugs with -O12 option was fixed to regain normal output.
  3. The best amino acid substitution matrices are now chosen in different phases of execution. In this connection, the format of mdm files has been changed. Accordingly, the revised makmdm command must be run before the use of the new version of spaln.
  4. Combination of relevant blocks has been modified to improve sensitivity.
  5. The 'Salvage' procedure was added to find homologues that are missed by the formal criteria.

Download

Source Binaries Data

Copyright (c) 2007-2016 Osamu Gotoh all rights reserved