Spaln (space-efficient spliced alignment) is a
stand-alone program that maps and aligns a set of cDNA or protein sequences
onto a whole genomic sequence in a single job. Spaln adopts multi-phase
heuristics that makes it possible to perform the job on a conventional personal
computer running under Unix/Linux with limited memory. The program is written
in C++ and distributed as source codes and also as executables for a few platforms.
Unless binaries are not provided, users must compile the program on their
own system. Although the program has been tested only on a Linux operating
system, it is likely to be portable to most Unix systems with little or no
modifications. The accessory program sortgrcd sorts the gene loci found
by spaln in the order of chromosomal position and orientation.
[1] Gotoh, O. "
A space-efficient and accurate method for mapping and aligning cDNA sequences onto genomic sequence"
Nucleic Acids Research 36 (8) 2630-2638 (2008).
[2] Gotoh, O. "
Direct mapping and alignment of protein sequences onto genomic sequence"
Bioinformatics 24 (21) 2438-2444 (2008).
[3] Iwata, H. and Gotoh, O., in preparation.
Present Version: 2.0.4, Last update: 2011-11-17
The major revision number of spaln has been updated
from 1 to 2. The new version incorporates additional features
for intron recognition in addition to other revisions. The major points of revision are as follows.
- The codes have been extensively rewritten in C++, so that they are no longer compiled by a C compiler.
- A heuristic routine is added after the HSP search to reduce the number of calls of the restricted DP routine.
- -t[N] option specifies the number of CPUs involved in a multi-thread operation. If N is omitted, all available CPUs are used.
- A branch point signal and an intron propensity based on its oligomer compositions are incorporated in the scoring system.
- The splice junction signals are composed of two terms: (1) universal signal that depends on the dinucleotide pair at the ends of an intron and (2) species-specific signal that depends on the sequence around each splice junction. The relative contributions of these two terms can be freely adjusted by -ySN option.
- The intron penalty is now automatically adjusted depending on the
values of other parameters.
- Species specific parameters are available for 61 divergent eukaryotes. Those parameters are stored in each subdirectory in the 'table' directory.
- A set of benchmark data named "SPAliBASE" is ready for download.
From source
To compile the source codes in the default settings, follow the instructions below. If you download the source file (spaln.2.0.4) in the directory download, five directories will be generated under download/spalnXX/ after installation, where XX is a version code. We assume work is your workspace, which may or may not be identical to download. Note: The default destinations of the installed files of version 1 are somewhat different from those described here.
- bin : binaries
- doc : documents
- seqdb : sample sequences. In this directory you should format genomic or database files
- src : source codes
- table : parameter files used by spaln
To modify the location of executables and/or other settings, run 'configure --help' at step 6 below. (Warning: Full path name rather than relative path name must be given for executables or other directories as the arguments of the configure command.) These locations are hard coded in spaln. The locations of the 'seqdb' and 'table' directories will be respectively denoted by seqdb and table below. Hence, seqdb=download/spalnXX/seqdb, and table=download/spalnXX/table in the default settings.
- % mkdir download
- % cd download
- Download spalnXX.tar.gz
- % tar xfz spalnXX.tar.gz
- % cd ./spalnXX/src
- % ./configure [--help]
- Please manually edit Makefile if $(CC) does not indicate a C++ compiler or
- % CC=g++ ./configure [other options]
- % make
- % make install
- Executables are copied to ../bin
- makmdm program makes mutation data matrices of various PAM levels in the ../table directory
- % make clearall
- Add download/spalnXX/bin to your PATH
% setenv PATH $PATH:download/spalnXX/bin (csh/tsh)
$ export PATH=$PATH:download/spalnXX/bin (sh/bsh)
Preferably, you may add the above line in your start up rc file (e.g. ~/.bashrc)
Alternatively, move or copy download/spalnXX/bin/* to a directory on your PATH, if you have not specified the location of executables at step 6 above.
- If you have changed the location of table and/or seqdb directory after installation, set the env variables ALN_TAB and/or ALN_DBS as explained in the following subsection.
- Proceed to Sequence data formation.
From binaries
Binaries for a 32 bit (spaln.2.0.4.linux32) or 64 bit (spaln.2.0.4.linux64) Linux machine are available. To use the binaries, follow the instructions below.
Case I: Assume the directory work is your workspace where every material is stored. In this case, seqdb=work.
- % mkdir work
- % cd work
- Download spalnXX.PC.tar.gz, where PC is a platform code
- % tar xfz spalnXX.PC.tar.gz
- Add work/bin to your PATH
Or move or copy work/bin/* to a directory on your PATH
- % mv ./table/* .; rmdir ./table
- % mv ./seqdb/* .; rmdir ./seqdb
- Now proceed to Sequence data formation.
Case II: Assume your workspace work is distict from seqdb
- % mkdir download
- % cd download
- Download spalnXX.PC.tar.gz, where PC is a platform code
- % tar xfz spalnXX.PC.tar.gz
- Add download/bin to your PATH
Or move or copy download/bin/* to a directory on your PATH
- % setenv ALN_TAB download/table (csh/tsh)
$ export ALN_TAB=download/table (sh/bsh)
- % setenv ALN_DBS download/seqdb (csh/tsh)
$ export ALN_DBS=download/seqdb (sh/bsh)
- Add the above lines to your rc file, so that you don't have to repeat the commands at every login time.
- Now proceed to Sequence data formation
If you do not need genome mapping or database search, you may skip this section. All sequence files should be in (multi-)fasta format. The genomic sequence must be formatted before use.
- % cd seqdb
- Download or copy genomic sequences or protein database sequence in multi-fasta format.
- Chromosomal sequences should be concatenated into a single file. Alternatively, you can use multiple chromosomal files without concatenation. This procedure will be described at the end of this section. To render the 'make' command effective, the extension of the genomic sequence file should be '.mfa', and protein database sequence should be '.faa'. Hereafter, the file name is assumed to be xxxgnm.mfa or prosdb.faa. The total number of residues in a file must not be greater than or equal to 2**32.
- % make xxxgnm.idx (for genomic sequence) or
% make prosdb.idx (for protein database sequence)
-
This command converts the sequence into a binary format. Four files, xxxgnm.seq, xxxgnm.idx, xxxgnm.hed, and xxxgnm.grp, are constructed (prosdb instead of xxxgnm in case of make prosdb.idx). It may take several tens of minutes to construct the files for mammalian genome.
- % make xxxgnm.bkn (for cDNA queries) or
% make xxxgnm.bkp (for protein queries) or
% make prosdb.bka (for protein database)
- This command makes the block index table. This process may take another several tens of minutes.
- Internally, the make command invokes
'spaln -Wxxxgnm.bkn -KD [Options] xxxgnm.mfa' or
'spaln -Wxxxgnm.bkp -KP [Options] xxxgnm.mfa' or
'spaln -Wprosdb.bka -KA [Options] prosdb.faa'.
- If xxxgnm.grp or prosdb.grp were successfully constructed at step 4 above, the option values below would be automatically calculated by script makblk.pl. WARNING: The estimated maximal gene size can be
inadequately small if only a part of the genome (e.g a single chromosome)
is formatted. At that time, explicitly specify the maximal gene size by
the -XGN option of makblk.pl or at the runtime of spaln.
N can have suffix 'k' and 'M' to indicate that the number is measured in kbp and Mbp, respectively.
- Options: (default value)
- -XkN: word size (11 for DNA, 5 for protein)
- -XGN: maximum gene length (262144)
- -XbN: block size (4096)
N < 65536 and 'genome size' < 65536 * N must be satisfied. An estimate of N is sqrt(genome size). For mammal, N is nearly equal to 54000.
- -XaN: a parameter used to filter excessively abundant words (10)
- -XgN: maximal distance in block number between 5' terminal and 3' terminal blocks (16)
- It is possible to generate xxxgnm.idx and other three files directly from the input files without concatenatation:
% makdbs -nxxxgnm -KD file1 ... fileN
% make xxxgnm.bkn (for cDNA queries) or
% make xxxgnm.bkp (for protein queries)
This method is particularly useful when the concatenation might yield a file too large to be dealt with by the OS.
- Prepare protein, cDNA, or genomic segment sequence(s) in (multi-)fasta format (denoted by query below)
- Store query to work
- % cd work
- Run spaln in one of the following three modes. Spaln
does not support comparison between two genomic segments.
(A) % spaln -Q[0|1|2|3] [-ON] [-MN] [other options]
genome_segment query
(B) % spaln -Q[4|5|6|7] [-ON] [-MN] [other options]
-dxxxgnm query
(C) % spaln -Q[4|5|6|7] [-ON] [-MN] [other options]
-aprosdb query
Only a subset of queries may be examined if query is replaced with 'query (from to)', where 'from' and 'to' are the first and last entry numbers in query to be examined.
To run spaln on multiple CPUs, for example, the following commands may be used and the results may be summarized with sortgrcd, as explained later.
(A) % spaln -Q7 -O12 -oxxxO1 -dxxxgnm 'query (1 1000)'
(B) % spaln -Q7 -O12 -oxxxO2 -dxxxgnm 'query (1001 2000)'
(C) % spaln -Q7 ...
However, the procedure will be simplified if a multi-thread operation is
used as follows:
(D) % spaln -Q7 -O12 -oxxx -dxxxgnm -t[N] query
Options: (default value)
- -HN: Output is suppressed if the alignment score is less than N. See also -pw. (35)
- -K[D|P|A]: Format either genomic DNA for sequence search with DNA (D) or Protein (P) queries or protein database sequences for search with a genomic segment (A). Use in combination with -W option below.
- -LS: Smith-Waterman-type local alignment. This option may prune out weakly matched terminal regions.
- -M[N]: Single or multiple output for each query
- No option (default): single locus
- No argument: Multiple loci up to the maximum number specified by the program (4 in the present implementation)
- N=1: Re-search the query region not aligned in the first trial. May be useful to detect chimera or fragmented genomic region
- N>1: Output multiple loci maximally up to N
- -ON: Select output format (4)
- N=0: Gff3 gene format
- N=1: Alignment
- N=2: Gff3 match format
- N=3: Bed format
- N=4: Exon-oriented format similar to the output of megablast -D 3
- N=5: Intron-oriented output
- N=6: Concatenated exon sequence
- N=7: Translated amino-acid sequence. Presently not very useful for cDNA queries because the entire exon rather than an ORF is translated
- N=8: Mapping (block) information only. Use with -Q4
- N=12: Output the same information as -O4 in the
binary format. If -oOutput is set, two files named
Output.grd and Output.erd will be created. Otherwise,
xxx.grd and xxx.erd will be created.
- -QN: Select algorithm (3)
- 0<=N<=3: Genomic segment in the fasta format given by the first argument vs. query given by the second argument. One may skip the formatting step described above if only this mode of operation is used.
- 4<=N<=7: Genome mapping and alignment. The genomic sequence must be formatted beforehand.
- N=0,4: DP procedure without HSP search. Considerably slow
- N=1,2,3,5,6,7: Recursive HSP search up to the level of (N % 4)
- -RS: Read block index table from file S.
If omitted, the X.bkn, X.bkp,
or X.bka file will be read depending on the type of query. The
appropriate file is searched for in the current directory, the directory
specified by the env variable 'ALN_DBS', and the
'seqdb' directory specified at the compile time in this order.
- -SN: Orientation of the DNA query sequence (0)
- N=0: The
orientation is inferred from the phrases (e.g. 5' end) in the header
line of each entry within a fasta file. If no information is available,
both orientations are examined, and the result with the better score is
reported.
- N=1: Forward orientation only
- N=2: Reverse-complement orientation only
- N=3: Examine both orientations
- -Txxx: Specify the species. For genome vs. DNA comparison, -yS flag should also be set in combination with this option. xxx corresponds to the subdirectory in the table/spalnXX/seqdb directory.
- -VN: Minimum space to induce Hirschberg's algorithm (16M)
- -WS: Write block index table to file S.
- -i[a|p]: Input mode with -Q[0N3].
- -ia: Alternative mode; a genomic segment of an odd numbered entry in the input file is aligned with the query of the following entry.
- -ip: Parallel mode; the i-th entry in the file specified by the first argument is aligned with the i-th entry in the file specified by the second argument.
- default: The genomic segment specified by the first argument is aligned with each entry in the file specified by the second argument.
- -oS: Destination of output file name (stdout)
- -pa: Terminal polyA or polyT sequence is not trimmed.
- -pw: Report result even if alignment score is below threshold value.
- -xBS: Bit pattern of seeds used for HSP search at level 1
- -xbS: Bit pattern of seeds used for HSP search at level 3
- -uN: Gap-extension penalty (3, 2, 2)
- -vN: Gap-opening penalty (8, 6, 9)
- -yaN: Dinucleotide pairs at the ends of an intron (0)
- N=0: Accept only the canonical pairs (GT..AG,GC..AG,AT..AC)
- N=1: accept also AT..AN
- N=1: allow up to one mismatch from GT..AG
- N=3: accept any pairs. An omission of N implies N = 3
- -yiN: Intron penalty. No longer recommended to use this option.
- -yjN: Incline of long gap penalty (0.6)
- -ykN: Flex point where the incline of gap penalty changes (7)
- -ylN: Double affine gap penalty if N=3; otherwise affine gap penalty
- -ymN: Score for a nucleotide match (2, 2)
- -ynN: Penalty for nucleotide mismatch (6, 2)
- -yoN: Penalty for a in-frame termination codon (100)
- -ypN: PAM level used in the alignment (third) phase (150)
- -yqN: PAM level used in the second phase (50)
- -yxN: Penalty for a frame shift (100)
- -yyN: Relative contribution of splicing signal (8)
- -yzN: Relative contribution of coding potential (2)
- -yAN: Relative contribution of the translational initiation or termination signal (8)
- -yBN: Relative contribution of branch point signal (0)
- -yEN: Minimum exon length (2)
- -yIS: Intron distribution parameters
- -yJN: Relative contribution of the bonus given
to a conserved intron position
- -yLN: Minimum intron length (30, 30)
- -yS: For
cDNA queries, use species-specific exon-intron boundary signals. For
protein queries, invoke the 'salvage' procedure to examine all blocks
with positive scores.
- -ySN: N specifies the percentile
contribution of the species-specific splice signal. The other part is
derived from the universal signal given to the dinucleotides at the ends
of an intron. An omission of N implies N = 100.
- -yX: For a DNA query, this option sets parameter
values for cross-species comparison. The actual values are given as the
second number in the parentheses of the above lines. -yS100 is also
automatically invoked. Conversely, this option specifies an intra-species
mode for a protein query, whereby -yS30 is invoked.
- -yYN: Relative contribution of length-dependent part of intron penalty (8)
- -yZN: Relative contribution of oligomer composition within an intron (0)
- % sortgrcd [options] *.grd
Sortgrcd is used to recover the output of spaln with -O12 option, to apply some filtering, and also to rearrange the output of multiple spaln runs.
- Options:
- -CN: Minimum cover rate = % nucleotides in predicted exons / length of query (x 3 if query is protein) (0-100)
- -FN: Filter level (N=0: no; N=1: mild; N=2: medium; N=3: stringent)
- -IN: Minimum sequence identity (0-100)
- -HN: Minimum alignment score (35)
- -ON: Output mode. N=0: Gff3 gene form; N=4: same as -O4 of spaln (default)
- -VN: Internal memory size used for core sort. If the
data size is greater than N, the sorting procedure will be
done in pieces.
- -mN: Maximum number of mismatches within 20 bp from the nearest exon-intron boundary
- -nN: Maximum number of non-canonical (other than GT..AG, GC..AG, AT..AC) intron ends
- -uN: Maximum number of unpaired (gap) sites within 20 bp from the nearest exon-intron boundary
- By default, no filter listed above is applied.
- When the output of spaln is separated into several files, the combined results are subjected to the sorting. Although *.grd files are assigned as the argument, there must be corresponding *.erd files in the same directory.
- In the default output format, the gene structure corresponding to each transcript is delimited by a line starting with '@', whereas each gene locus is delimited by a line starting with '!'. Two transcripts belong to the same locus if their corresponding genomic regions overlap by at least one nucleotide on the same strand.
- With -O0 option, the outputs follow the instruction of Gff3 (http://www.sequenceontology.org/gff3.shtml) where a gene locus is defined as described above.
- To experience the flow of procedures with the samples in seqdb, type in the following series of commands after moving to seqdb.
% make ddignm.idx
% make ddignm.bkn
% make ddi.cdna
% make ddi.srd
% make ddignm.bkp
% make ddi.spp
% make ddiaa.idx
% make ddiaa.bka
Added/modified in Ver. 2.0.4 (2011-9-10, 2011-10-11, 2011-11-17)
- A bug when the database file is constructed from multiple input sequence files has been fixed. Makefile and makblk.pl in 'seqdb' directory are also modified to accord for multiple input files.
- The output with the -O3 option is changed from Psl-like to Bed format.
- The default parameters used in polyA detection have been
modified to reduce the chance of removing genome-encoded
polyA-like sequences.
- The procedure to collect related HSPs within and adjacent significant
blocks has been simplified.
- The procedure to assemble HSPs has been improved to reduce the chance
of chimera formation among tandemly repeated paralogous genes.
- A bug concerning with very long genes has been fixed.
Added/modified in Ver. 1.4.5 (2010-04-23, 2010-05-08, 2010-08-04)
- Graphical output is supplemented in the Web interface.
- -U option is added to spaln. With this option, alignment is computed without splicing. It may be useful when genomic fragment(s) are mapped/aligned with whole genome.
- Several small bugs in sortgrcd have been fixed.
- Occasional core dump of spaln with -M option has been remedied.
- A problem of makdbs command when the character '|' is contained in a second or later word in the header line of fasta file was fixed.
Added/modified in Ver. 1.4.4 (2009-12-01, 2010-01-25, 2010-02-6, 2010-02-23)
- -SN option is added to spaln. When N = 3 (default) both orientations
of query sequence(s) are examined. When N = 1 and N = 2, only forward and reverse
direction is examined, respectively. When N = 0, the orientation to be examined
depends on the annotation (header line of FASTA format) of the query sequence.
The list of phrases that indicate the orientation is given in the file
"StrandPhrase" in the table/spalnXX/seqdb. If no phrase in this list is found in the
annotation, both orientations are examined.
- -yaN option of spaln has been modified. When N = 0 (default), only
canonical intron boundary sequences (GT..AG, GC..AG, AT..AC) are allowed.
When N = 1, the third consensus is relaxed as AT..AN. When N = 2, one mismatch
from GT..AG is allowed in addition to the relaxed consensus of AT..AN.
When N = 3, any sequence is allowed as a boundary.
- Many bugs of sortgrcd have been fixed.
- -O15 option of sortgrcd outputs information about unique introns merged
from the mappings of several transcripts.
- Unnecessary spaces in Gff3 format have been removed from outputs of
both spaln and sortgrcd.
Added/modified in Ver. 1.4.3c (2009-08-21)
- Some incompatibilities between documents and programs have been rectified.
- Species-specific parameters in table/spalnXX/seqdb have been updated.
Added/modified in Ver. 1.4.3b (2009-07-10) (2009-07-24)
- Sortgrcd has been modified so that files with may chromosome/contig entries should be handled.
- Spaln's write mode now accepts plural arguments each corresponding to a piece of the whole genome (e.g. each chromosomal sequence). This modification has made it possible for spaln to format whole genomic sequence on a platform that cannot handle a large file (e.g. > 2GB). For example, the human genome is composed of 24 chromosomes. The total genome size exceeds 2 Gb, while the maximum chromosome size is less than 300 Mb. Even though a system fails to read the file of whole genome at a time, it can support spaln to process chromosomal sequences sequentially. In the example bellow, "my_genome" is the genome name, and chr1, chr2, ... chrn are chromosomal sequence files.
% makdbs -nmy_genome -KD chr1 chr2 ... chrn
% spaln -Wmy_genome.bkn -Xk12 -Xb54272 -XG2802688 -Xg64 -KD chr1 chr2 ... chrn
Then, you can use spaln in a usual way:
% spaln -Q7 -dmy_genome cDNAs
- A few bugs are fixed to prevent core dumps under some conditions.
Added/modified in Ver. 1.4.2 (2009-05-16) (2009-06-17)
- Bugs of sortgrcd has been fixed. sortgrcd now supports -O5 (intron-oriented format) output.
- Several bugs that were conditionally problematic have also been fixed.
Added/modified in Ver. 1.4.1 (2009-02-16) (2009-03-16)
- A combination of a protein sequence database and a genomic segment(s) is now supported. The protein sequence database must be formatted beforehand. At the run time, a set of ORFs in the genomic segment are translated and rapidly searched against the database sequences, and then the best-hit sequence is used as the template of the spliced alignment between the relevant portion of the genomic segment and the protein sequence retrieved.
- The format of .bkn and .bkp files is slightly changed. This has introduced some incompatibility between formatted files and software; spaln Ver.1.4 can use files formatted by spaln Ver.1.3, but the inverse would make an error.
- The output form now includes Gff3 (gene or match form) and Psl-like format.
- Sortgrcd also supports Gff3 gene format in addition to the native format.
- The default intron-penalty function has been modified from the well-shaped to a 'generic' function derived from intron-length distributions of several species.
- Many small modifications have been made to improve the stability of the software. In particular, multiple outputs of the same locus have been suppressed when -M option is set.
Added/modified in Ver. 1.3.2. (2008-09-12) (2008-09-25) (2008-10-21)
- Divide by zero error with makblk.pl when genome size is less than 1Mb has been fixed.
- Missing genomic fragment name in -Q[0-3] mode has been recovered.
- A few bugs have been fixed to avoid core dumps due to incompatibility between HSP coordinates and sequence ends.
- Known exon boundary information can be incorporated into the objective function when a cDNA, as well as a protein, is used as a template. To do so, simply add the -yBN option, where N is the bonus given to each matched location of exon-exon boundary.
- Compile time errors on some systems have been fixed.
Added/modified in Ver. 1.3.1. (2008-08-21)
- The portability has been improved for spaln to run on 64-bit machines.
- The codes are cleaned up so that less number of warnings should be output at the compilation time.
- 'make install' command is modified so that species-specific parameter files should be copied to the working table/spalnXX/seqdb directory.
- The test procedure described at the end of this document now includes the case of protein sequence queries.
Added/modified in Ver. 1.3.0
- Protein sequences can now be used as queries. If properly formatted, users do not need to be conscious of the difference in the type of queries at the run time.
- Bugs with -O12 option was fixed to regain normal output.
- The best amino acid substitution matrices are now chosen in different phases of execution. In this connection, the format of mdm files has been changed. Accordingly, the revised makmdm command must be run before the use of the new version of spaln.
- Combination of relevant blocks has been modified to improve sensitivity.
- The 'Salvage' procedure was added to find homologues that are missed by the formal criteria.
Source
Binaries
Data
Copyright (c) 2007-2011 Osamu Gotoh all rights reserved