SPALN information




Overview

Spaln (space-efficient spliced alignment) is a stand-alone program that maps and aligns a set of cDNA or protein sequences onto a whole genomic sequence in a single job. Spaln adopts multi-phase heuristics that makes it possible to perform the job on a conventional personal computer running under Unix/Linux with limited memory. The program is written in C++ and distributed as source codes and also as executables for a few platforms. Unless binaries are not provided, users must compile the program on their own system. Although the program has been tested only on a Linux operating system, it is likely to be portable to most Unix systems with little or no modifications. The accessory program sortgrcd sorts the gene loci found by spaln in the order of chromosomal position and orientation.

References

[1] Gotoh, O. " A space-efficient and accurate method for mapping and aligning cDNA sequences onto genomic sequence" Nucleic Acids Research 36 (8) 2630-2638 (2008).
[2] Gotoh, O. " Direct mapping and alignment of protein sequences onto genomic sequence" Bioinformatics 24 (21) 2438-2444 (2008).
[3] Iwata, H. and Gotoh, O., in preparation.

Present Version: 2.0.4, Last update: 2011-11-17

Major changes in Version2

The major revision number of spaln has been updated from 1 to 2. The new version incorporates additional features for intron recognition in addition to other revisions. The major points of revision are as follows.

  1. The codes have been extensively rewritten in C++, so that they are no longer compiled by a C compiler.
  2. A heuristic routine is added after the HSP search to reduce the number of calls of the restricted DP routine.
  3. -t[N] option specifies the number of CPUs involved in a multi-thread operation. If N is omitted, all available CPUs are used.
  4. A branch point signal and an intron propensity based on its oligomer compositions are incorporated in the scoring system.
  5. The splice junction signals are composed of two terms: (1) universal signal that depends on the dinucleotide pair at the ends of an intron and (2) species-specific signal that depends on the sequence around each splice junction. The relative contributions of these two terms can be freely adjusted by -ySN option.
  6. The intron penalty is now automatically adjusted depending on the values of other parameters.
  7. Species specific parameters are available for 61 divergent eukaryotes. Those parameters are stored in each subdirectory in the 'table' directory.
  8. A set of benchmark data named "SPAliBASE" is ready for download.

Install

From source

To compile the source codes in the default settings, follow the instructions below. If you download the source file (spaln.2.0.4) in the directory download, five directories will be generated under download/spalnXX/ after installation, where XX is a version code. We assume work is your workspace, which may or may not be identical to download. Note: The default destinations of the installed files of version 1 are somewhat different from those described here.

To modify the location of executables and/or other settings, run 'configure --help' at step 6 below. (Warning: Full path name rather than relative path name must be given for executables or other directories as the arguments of the configure command.) These locations are hard coded in spaln. The locations of the 'seqdb' and 'table' directories will be respectively denoted by seqdb and table below. Hence, seqdb=download/spalnXX/seqdb, and table=download/spalnXX/table in the default settings.

  1. % mkdir download
  2. % cd download
  3. Download spalnXX.tar.gz
  4. % tar xfz spalnXX.tar.gz
  5. % cd ./spalnXX/src
  6. % ./configure [--help]
  7. % make
  8. % make install
  9. % make clearall
  10. Add download/spalnXX/bin to your PATH Preferably, you may add the above line in your start up rc file (e.g. ~/.bashrc)
    Alternatively, move or copy download/spalnXX/bin/* to a directory on your PATH, if you have not specified the location of executables at step 6 above.
  11. If you have changed the location of table and/or seqdb directory after installation, set the env variables ALN_TAB and/or ALN_DBS as explained in the following subsection.
  12. Proceed to Sequence data formation.

From binaries

Binaries for a 32 bit (spaln.2.0.4.linux32) or 64 bit (spaln.2.0.4.linux64) Linux machine are available. To use the binaries, follow the instructions below.

Case I: Assume the directory work is your workspace where every material is stored. In this case, seqdb=work.

  1. % mkdir work
  2. % cd work
  3. Download spalnXX.PC.tar.gz, where PC is a platform code
  4. % tar xfz spalnXX.PC.tar.gz
  5. Add work/bin to your PATH
    Or move or copy work/bin/* to a directory on your PATH
  6. % mv ./table/* .; rmdir ./table
  7. % mv ./seqdb/* .; rmdir ./seqdb
  8. Now proceed to Sequence data formation.

Case II: Assume your workspace work is distict from seqdb

  1. % mkdir download
  2. % cd download
  3. Download spalnXX.PC.tar.gz, where PC is a platform code
  4. % tar xfz spalnXX.PC.tar.gz
  5. Add download/bin to your PATH
    Or move or copy download/bin/* to a directory on your PATH
  6. % setenv ALN_TAB download/table (csh/tsh)
    $ export ALN_TAB=download/table (sh/bsh)
  7. % setenv ALN_DBS download/seqdb (csh/tsh)
    $ export ALN_DBS=download/seqdb (sh/bsh)
  8. Add the above lines to your rc file, so that you don't have to repeat the commands at every login time.
  9. Now proceed to Sequence data formation

Sequence data formation

If you do not need genome mapping or database search, you may skip this section. All sequence files should be in (multi-)fasta format. The genomic sequence must be formatted before use.
  1. % cd seqdb
  2. Download or copy genomic sequences or protein database sequence in multi-fasta format.
  3. Chromosomal sequences should be concatenated into a single file. Alternatively, you can use multiple chromosomal files without concatenation. This procedure will be described at the end of this section. To render the 'make' command effective, the extension of the genomic sequence file should be '.mfa', and protein database sequence should be '.faa'. Hereafter, the file name is assumed to be xxxgnm.mfa or prosdb.faa. The total number of residues in a file must not be greater than or equal to 2**32.
  4. % make xxxgnm.idx (for genomic sequence) or
    % make prosdb.idx (for protein database sequence)
  5. % make xxxgnm.bkn (for cDNA queries) or
    % make xxxgnm.bkp (for protein queries) or
    % make prosdb.bka (for protein database)
  6. It is possible to generate xxxgnm.idx and other three files directly from the input files without concatenatation:
    This method is particularly useful when the concatenation might yield a file too large to be dealt with by the OS.

Execution

  1. Prepare protein, cDNA, or genomic segment sequence(s) in (multi-)fasta format (denoted by query below)
  2. Store query to work
  3. % cd work
  4. Run spaln in one of the following three modes. Spaln does not support comparison between two genomic segments.
    (A) % spaln -Q[0|1|2|3] [-ON] [-MN] [other options] genome_segment query
    (B) % spaln -Q[4|5|6|7] [-ON] [-MN] [other options] -dxxxgnm query
    (C) % spaln -Q[4|5|6|7] [-ON] [-MN] [other options] -aprosdb query

    Only a subset of queries may be examined if query is replaced with 'query (from to)', where 'from' and 'to' are the first and last entry numbers in query to be examined.
    To run spaln on multiple CPUs, for example, the following commands may be used and the results may be summarized with sortgrcd, as explained later.
    (A) % spaln -Q7 -O12 -oxxxO1 -dxxxgnm 'query (1 1000)'
    (B) % spaln -Q7 -O12 -oxxxO2 -dxxxgnm 'query (1001 2000)'
    (C) % spaln -Q7 ...
    However, the procedure will be simplified if a multi-thread operation is used as follows:
    (D) % spaln -Q7 -O12 -oxxx -dxxxgnm -t[N] query

    Options: (default value)

  5. % sortgrcd [options] *.grd
    Sortgrcd is used to recover the output of spaln with -O12 option, to apply some filtering, and also to rearrange the output of multiple spaln runs.

Example

Change from previous versions

Added/modified in Ver. 2.0.4 (2011-9-10, 2011-10-11, 2011-11-17)

  1. A bug when the database file is constructed from multiple input sequence files has been fixed. Makefile and makblk.pl in 'seqdb' directory are also modified to accord for multiple input files.
  2. The output with the -O3 option is changed from Psl-like to Bed format.
  3. The default parameters used in polyA detection have been modified to reduce the chance of removing genome-encoded polyA-like sequences.
  4. The procedure to collect related HSPs within and adjacent significant blocks has been simplified.
  5. The procedure to assemble HSPs has been improved to reduce the chance of chimera formation among tandemly repeated paralogous genes.
  6. A bug concerning with very long genes has been fixed.
Added/modified in Ver. 1.4.5 (2010-04-23, 2010-05-08, 2010-08-04)

  1. Graphical output is supplemented in the Web interface.
  2. -U option is added to spaln. With this option, alignment is computed without splicing. It may be useful when genomic fragment(s) are mapped/aligned with whole genome.
  3. Several small bugs in sortgrcd have been fixed.
  4. Occasional core dump of spaln with -M option has been remedied.
  5. A problem of makdbs command when the character '|' is contained in a second or later word in the header line of fasta file was fixed.
Added/modified in Ver. 1.4.4 (2009-12-01, 2010-01-25, 2010-02-6, 2010-02-23)

  1. -SN option is added to spaln. When N = 3 (default) both orientations of query sequence(s) are examined. When N = 1 and N = 2, only forward and reverse direction is examined, respectively. When N = 0, the orientation to be examined depends on the annotation (header line of FASTA format) of the query sequence. The list of phrases that indicate the orientation is given in the file "StrandPhrase" in the table/spalnXX/seqdb. If no phrase in this list is found in the annotation, both orientations are examined.
  2. -yaN option of spaln has been modified. When N = 0 (default), only canonical intron boundary sequences (GT..AG, GC..AG, AT..AC) are allowed. When N = 1, the third consensus is relaxed as AT..AN. When N = 2, one mismatch from GT..AG is allowed in addition to the relaxed consensus of AT..AN. When N = 3, any sequence is allowed as a boundary.
  3. Many bugs of sortgrcd have been fixed.
  4. -O15 option of sortgrcd outputs information about unique introns merged from the mappings of several transcripts.
  5. Unnecessary spaces in Gff3 format have been removed from outputs of both spaln and sortgrcd.
Added/modified in Ver. 1.4.3c (2009-08-21)

  1. Some incompatibilities between documents and programs have been rectified.
  2. Species-specific parameters in table/spalnXX/seqdb have been updated.
Added/modified in Ver. 1.4.3b (2009-07-10) (2009-07-24)

  1. Sortgrcd has been modified so that files with may chromosome/contig entries should be handled.
  2. Spaln's write mode now accepts plural arguments each corresponding to a piece of the whole genome (e.g. each chromosomal sequence). This modification has made it possible for spaln to format whole genomic sequence on a platform that cannot handle a large file (e.g. > 2GB). For example, the human genome is composed of 24 chromosomes. The total genome size exceeds 2 Gb, while the maximum chromosome size is less than 300 Mb. Even though a system fails to read the file of whole genome at a time, it can support spaln to process chromosomal sequences sequentially. In the example bellow, "my_genome" is the genome name, and chr1, chr2, ... chrn are chromosomal sequence files.
    % makdbs -nmy_genome -KD chr1 chr2 ... chrn
    % spaln -Wmy_genome.bkn -Xk12 -Xb54272 -XG2802688 -Xg64 -KD chr1 chr2 ... chrn
    Then, you can use spaln in a usual way:
    % spaln -Q7 -dmy_genome cDNAs
  3. A few bugs are fixed to prevent core dumps under some conditions.

Added/modified in Ver. 1.4.2 (2009-05-16) (2009-06-17)

  1. Bugs of sortgrcd has been fixed. sortgrcd now supports -O5 (intron-oriented format) output.
  2. Several bugs that were conditionally problematic have also been fixed.

Added/modified in Ver. 1.4.1 (2009-02-16) (2009-03-16)

  1. A combination of a protein sequence database and a genomic segment(s) is now supported. The protein sequence database must be formatted beforehand. At the run time, a set of ORFs in the genomic segment are translated and rapidly searched against the database sequences, and then the best-hit sequence is used as the template of the spliced alignment between the relevant portion of the genomic segment and the protein sequence retrieved.
  2. The format of .bkn and .bkp files is slightly changed. This has introduced some incompatibility between formatted files and software; spaln Ver.1.4 can use files formatted by spaln Ver.1.3, but the inverse would make an error.
  3. The output form now includes Gff3 (gene or match form) and Psl-like format.
  4. Sortgrcd also supports Gff3 gene format in addition to the native format.
  5. The default intron-penalty function has been modified from the well-shaped to a 'generic' function derived from intron-length distributions of several species.
  6. Many small modifications have been made to improve the stability of the software. In particular, multiple outputs of the same locus have been suppressed when -M option is set.

Added/modified in Ver. 1.3.2. (2008-09-12) (2008-09-25) (2008-10-21)

  1. Divide by zero error with makblk.pl when genome size is less than 1Mb has been fixed.
  2. Missing genomic fragment name in -Q[0-3] mode has been recovered.
  3. A few bugs have been fixed to avoid core dumps due to incompatibility between HSP coordinates and sequence ends.
  4. Known exon boundary information can be incorporated into the objective function when a cDNA, as well as a protein, is used as a template. To do so, simply add the -yBN option, where N is the bonus given to each matched location of exon-exon boundary.
  5. Compile time errors on some systems have been fixed.

Added/modified in Ver. 1.3.1. (2008-08-21)

  1. The portability has been improved for spaln to run on 64-bit machines.
  2. The codes are cleaned up so that less number of warnings should be output at the compilation time.
  3. 'make install' command is modified so that species-specific parameter files should be copied to the working table/spalnXX/seqdb directory.
  4. The test procedure described at the end of this document now includes the case of protein sequence queries.

Added/modified in Ver. 1.3.0

  1. Protein sequences can now be used as queries. If properly formatted, users do not need to be conscious of the difference in the type of queries at the run time.
  2. Bugs with -O12 option was fixed to regain normal output.
  3. The best amino acid substitution matrices are now chosen in different phases of execution. In this connection, the format of mdm files has been changed. Accordingly, the revised makmdm command must be run before the use of the new version of spaln.
  4. Combination of relevant blocks has been modified to improve sensitivity.
  5. The 'Salvage' procedure was added to find homologues that are missed by the formal criteria.

Download

Source Binaries Data

Copyright (c) 2007-2011 Osamu Gotoh all rights reserved