Spaln (space-efficient spliced alignment) is a
stand-alone program that maps and aligns a set of cDNA or protein sequences
onto a whole genomic sequence in a single job. Spaln also performs spliced
or ordinary alignment after rapid similarity search against a protein sequence
database, if a genomic segment or an amino acid sequence is given as a query.
From Version 1.4, spaln supports a combination of protein sequence database and
a given genomic segment. From Version 2.2, spaln also performs rapid similarity
search and (semi-)global alignment of a set of protein sequence queries again
a protein sequence database. Spaln adopts multi-phase
heuristics that makes it possible to perform the job on a conventional personal
computer running under Unix/Linux with limited memory. The program is written
in C++ and distributed as source codes and also as executables for a few platforms.
Unless binaries are not provided, users must compile the program on their
own system. Although the program has been tested only on a Linux operating
system, it is likely to be portable to most Unix systems with little or no
modifications. The accessory program sortgrcd sorts the gene loci found
by spaln in the order of chromosomal position and orientation. From
version 2.3.2, spaln and sortgrcd can handle some gzipped files
without prior expansion if USE_ZLIB mode is activated upon compilation.
From version 2.3.2a, compressed query sequence file(s) may also be accepted.
From version 2.4.0, multiple files corresponding to different output
forms can be generated at a single run.
 Gotoh, O. "
A space-efficient and accurate method for mapping and aligning cDNA sequences onto genomic sequence"
Nucleic Acids Research 36 (8) 2630-2638 (2008).
 Gotoh, O. "
Direct mapping and alignment of protein sequences onto genomic sequence"
Bioinformatics 24 (21) 2438-2444 (2008).
 Iwata, H. and Gotoh, O. "
Benchmarking spliced alignment programs including Spaln2, an extended version of Spaln that incorporates additional species-specific features"
Nucleic Acids Research 40 (20) e161 (2012)
 Nagasaki, H., Arita, M., Nishizawa, T., Suwa, M., Gotoh, O. "
Automated classification of alternative splicing and transcriptional initiation and construction of a visual database of the classified patterns" Bioinformatics 22 (10) 1211-1216 (2006).
To compile the source codes in the default settings, follow the instructions below.
If you download the source file (spaln2.4.0) in the directory download, five directories will be generated under download/spalnXX/ after installation, where XX is a version code. We assume work is your workspace, which may or may not be identical to download.
- bin : binaries
- doc : documents
- seqdb : sample sequences. In this directory you should format genomic or database files
- src : source codes
- table : parameter files used by spaln
To modify the location of executables and/or other settings, run 'configure --help' at step 6 below. (Warning: Full path name rather than relative path name must be given for executables or other directories as the arguments of the configure command.) These locations are hard coded in spaln. The locations of the 'seqdb' and 'table' directories will be respectively denoted by seqdb and table below. Hence, seqdb=download/spalnXX/seqdb, and table=download/spalnXX/table in the default settings.
- % mkdir download
- % cd download
- Download spalnXX.tar.gz
- % tar xfz spalnXX.tar.gz
- % cd ./spalnXX/src
- % ./configure [--help]
- Please manually edit Makefile if $(CC) does not indicate a C++ compiler or
- % CXX=g++ ./configure [other options]
- To make spaln and sorgrcd handle gzipped files, ./configure command should be run with --use_zlib=1. Alternatively, you may manually edit the generated Makefile so that -DUSE_ZLIB=1 is included in the compile option.
- % make
- % make install
- Executables are copied to ../bin
- makmdm program makes mutation data matrices of various PAM levels in the ../table directory
- % make clearall
- Add download/spalnXX/bin to your PATH
% setenv PATH $PATH:download/spalnXX/bin (csh/tsh)
Preferably, you may add the above line in your start up rc file (e.g. ~/.bashrc)
$ export PATH=$PATH:download/spalnXX/bin (sh/bsh)
Alternatively, move or copy download/spalnXX/bin/* to a directory on your PATH, if you have not specified the location of executables at step 6 above.
- If you have changed the location of table and/or seqdb directory after installation, set the env variables ALN_TAB and/or ALN_DBS as explained in the following subsection.
- Proceed to Sequence data formation.
Binaries for a 32 bit (spaln2.0.4.linux32) or 64 bit
(spaln2.4.0.linux64) Linux machine are available. The
executable will run on 64-bit Windows10 WSL environment without any modification.
To use the binaries, follow the instructions below.
Case I: Assume the directory work is your workspace where every material is stored. In this case, seqdb=work.
- % mkdir work
- % cd work
- Download spalnXX.PC.tar.gz, where PC is a platform code
- % tar xfz spalnXX.PC.tar.gz
- Add work/bin to your PATH
Or move or copy work/bin/* to a directory on your PATH
- % mv ./table/* .; rmdir ./table
- % mv ./seqdb/* .; rmdir ./seqdb
- Now proceed to Sequence data formation.
Case II: Assume your workspace work is distinct from seqdb
- % mkdir download
- % cd download
- Download spalnXX.PC.tar.gz, where PC is a platform code
- % tar xfz spalnXX.PC.tar.gz
- Add download/bin to your PATH
Or move or copy download/bin/* to a directory on your PATH
- % setenv ALN_TAB download/table (csh/tsh)
$ export ALN_TAB=download/table (sh/bsh)
- % setenv ALN_DBS download/seqdb (csh/tsh)
$ export ALN_DBS=download/seqdb (sh/bsh)
- Add the above lines to your rc file, so that you don't have to repeat the commands at every login time.
- Now proceed to Sequence data formation
If you do not need genome mapping or database search, you may skip this section. All sequence files should be in (multi-)fasta format.
To perform genome mapping, the genomic sequence must be formatted before use. Formatting is optional for amino acid sequence database search.
- % cd seqdb
- Download or copy genomic sequences or protein database sequence in multi-fasta format. If spaln is accordingly compiled, gzipped file need not be uncompressed (the file name must be X.gz).
- To use 'makeidx.pl' command, chromosomal sequences must be concatenated into a single file. The extension of the genomic sequence file must be '.mfa' or '.gf', and protein database sequence must be '.faa', to render 'make' command effective. With 'spaln -W' command, these restrictions are not obligatory. Hereafter, the file name is assumed to be xxxgnm.gf or prosdb.faa.
The total number of residues in a file must not be greater than or equal to 2**32.
- To format xxxgnm.gf(.gz), run either of the following two commands, which are equivalent to each other except that the former is faster, accepts multiple input files, and does not need Makefile.
% spaln -W -K[D|P] [-XGMAX_GENE] [other spaln options] xxxgnm.gf(.gz) ...
To format protein database sequence, use either of the following two commands:
% ./makeidx.pl -i[n|p|np] [-XGMAX_GENE] [other spaln options] xxxgnm.gf(.gz)
% spaln -W -KA [other spaln options] prosdb.faa(.gz) ...
% ./makeidx.pl -ia [other spaln options] prosdb.faa(.gz)
As Spaln -W command accepts multiple input files and generates all necessary files in a single operation, you can skip following instructions.
- -KX (or corresponding -ix) option specifies the "block file" xxxgnm.bkx to be constructed, where X is 'A', 'D' or 'P' and x is 'a', 'n' or 'p'. The -inp option will construct both xxxgnm.bkn (for cDNA queries)
and xxxgnm.bkp (for protein queries) together with the xxxgnm.idx and associated files.
-KX option is mandatory. If -ix is omitted or x is empty, xxxgnm.idx and associated files are created but no block file is constructed.
- The block size and k-mer size are estimated from the genome size, unless explicitly specified (see below).
- if MAX_GENE (the length of the plausibly longest gene on the genome) is
not specified, MAX_GENE is also estimated from the genome size.
- WARNING: Don't forget to specify MAX_GENE if xxxgnm.gf represents only a part of the genome (e.g a single chromosome)!!. Otherwise, MAX_GENE may be seriously underestimated. MAX_GENE can have suffix 'k' and 'M' to indicate that the number is measured in kbp and Mbp, respectively.
- -g option generates gzipped outputs.
- -tN option enables N-thread operation.
- -E option generates local lookup table.
Makeidx.pl command performs the following series of operations 5-6, if the input is a single sequence file.
- % make xxxgnm.idx (for genomic sequence) or
% make prosdb.idx (for protein database sequence)
- This command converts the sequence into a binary format. Four or five files, xxxgnm.seq, xxxgnm.idx, xxxgnm.ent, xxxgnm.grp, and optionally xxxgnm.odr are constructed (prosdb instead of xxxgnm in case of make prosdb.idx). It may take several tens of minutes to construct the files for mammalian genome.
- % make xxxgnm.bkn (for cDNA queries) or
% make xxxgnm.bkp (for protein queries) or
% make prosdb.bka (for protein database)
- This command makes the block index table. This process may take another several tens of minutes.
- Internally, the make command invokes
'spaln -Wxxxgnm.bkn -KD [Options] xxxgnm.gf' or
'spaln -Wxxxgnm.bkp -KP [Options] xxxgnm.gf' or
'spaln -Wprosdb.bka -KA [Options] prosdb.faa'.
- If xxxgnm.grp or prosdb.grp were successfully constructed at step 5 above, the option values below would be automatically calculated by script makblk.pl.
- Options: (default value)
- -XAN: Alphabet size of the reduced amino acids: 6 < N <= 20 (20)
- -XBS: Bit patterns of the spaced seeds. The pattern should be asymmetric when the number of patterns > 2.
- -XCN: Number of seed patterns: 0 <= N <= 5 (0: contiguous seed)
- -XGN: Maximum gene length (262144)
- -XaN: A parameter used to filter excessively abundant words (10)
- -XbN: Block size (4096)
N < 65536 and 'genome size' < 65536 * N must be satisfied. An estimate of N is sqrt(genome size). For mammals, N is nearly equal to 54000.
- -XgN: Maximal distance in block number between 5' terminal and 3' terminal blocks (16)
- -XkN: Word size (11 for DNA, 5 for protein)
- -XsN: Distance between neighboring seeds (= k)
- It is possible to generate xxxgnm.idx and other three files directly from the input files without concatenation:
% makdbs -nxxxgnm -KD file1 ... fileN
This method is particularly useful when the concatenation might yield a file too large to be dealt with by the OS.
% make xxxgnm.bkn (for cDNA queries) or
% make xxxgnm.bkp (for protein queries)
- Prepare protein, cDNA, or genomic segment sequence(s) in (multi-)fasta
or extended (multi-)fasta (see -O6 option)
format (denoted by query below). From 2.3.2a, gzipped fasta file(s) may
be used as the query without prior expansion. Note, however, that compressed
query can considerably slow down the execution rate.
- Store query to work
- % cd work
- Run spaln in one of the following four modes. Spaln
does not support comparison between two genomic segments.
(A) % spaln -Q[0|1|2|3] [-ON] [other options]
(B) % spaln -Q[4|5|6|7] [-ON] [other options]
-[d|D] xxxgnm query
(C) % spaln -Q[4|5|6|7] [-ON] [other options]
-[a|A] prosdb query
(D) % spaln -Q[4|5|6|7] [-ON] [other options]
In the last case, prosdb.faa will be internally formatted, and the formatted results will be discarded after the end of execution.
Only a subset of queries may be examined if query is replaced with 'query (from to)', where 'from' and 'to' are the first and last entry numbers in query to be examined.
To run spaln on multiple CPUs, for example, the following commands may be used and the results may be summarized with sortgrcd, as explained later.
(a) % spaln -Q7 -O12 -oxxxO1 -dxxxgnm 'query (1 1000)'
(b) % spaln -Q7 -O12 -oxxxO2 -dxxxgnm 'query (1001 2000)'
(c) % spaln -Q7 ...
However, the procedure will be simplified if a multi-thread operation is
used as follows:
(d) % spaln -Q7 -ON -oxxx -dxxxgnm -t[N] query
Options: (default value)
- -CN: Use the genetic code specified by the "transl_table number" defined in NCBI transl_table (1).
- -E: Use local lookup table.
- -HN: Output is suppressed if the alignment score is less than N. See also -pw. (35)
- -LS: Smith-Waterman-type local alignment. This option may prune out weakly matched terminal regions.
- -M[N[.M]: Single or multiple output for each query
- No option (default): single locus
- No argument: Multiple loci up to the maximum number specified by the program (4 in the present implementation)
- N=1: Re-search the query region not aligned in the first trial. May be useful to detect chimera or fragmented genomic region
- N>1: Output multiple loci maximally up to N
- M: Maximal number of candidate loci to be subjected to spliced alignment (4). IfM < N, M is reset to N.
- -ON[,N2,N3...]: Select output format for genome vs cDNA or aa (4)
It is possible to output multiple files with extensions of .ON at a run by multiply applying this option, or by concatenating the format numbers with commas (,) or colons (:), ex. -O0,1,4. See also -o option.
- N=0: Gff3 gene format
- N=1: Alignment
- N=2: Gff3 match format
- N=3: Bed format
- N=4: Exon-oriented format similar to the output of megablast -D 3
- N=5: Intron-oriented output
- N=6: Concatenated exon sequence in extended (multi-)fasta format, in which the exon-intron structure of the parental gene is supplied by one
or more comment lines starting with ';C', such as
- N=7: Translated amino-acid sequence in extended (multi-)fasta format. Presently not very useful for cDNA queries because the entire exon rather than an ORF is translated
- N=8: Mapping (block) information only. Use with -Q4
- N=10: SAM format
- N=12: Output the same information as -O4 in the
binary format. If -oOutput is set, three files named
Output.grd, Output.erd, and Output.qrd will be created. Otherwise,
query.grd, query.erd, and query.qrd will be created.
- -O N[,N2,N3...]: Select output format for aa vs aa (4)
It is possible to output multiple files with extensions of .ON at a run by multiply applying this option, or by concatenating the format numbers with commas (,) or colons (:), ex. -O0,1,4. See also -o option.
- N=0: statistics (%divergence alignment_score #match #mismatch #gap #unpaired)
- N=1: Alignment
- N=2: Sugar format
- N=3: Psl format
- N=4: XYL = Coordinate + match length
- N=5: statistics + XYL
- N=8: Cigar format
- N=9: Vulgar format
- N=10: SAM format
- -QN: Select algorithm (3)
- 0<=N<=3: Genomic segment in the fasta format given by the first argument vs. query given by the following arguments. See also -i option. One may skip the formatting step described above if only this mode of operation is used.
- 4<=N<=7: Genome mapping and alignment. The genomic sequence must be formatted beforehand.
- N=0,4: DP procedure without HSP search. Considerably slow
- N=1,2,3,5,6,7: Recursive HSP search up to the level of (N % 4)
- -RS: Read block index table from file S.
If omitted, the xxxgnm.bkn, xxxgnm.bkp,
or prosdb.bka file will be read depending on the type of query. The
appropriate file is searched for in the current directory, the directory
specified by the env variable 'ALN_DBS', and the
'seqdb' directory specified at the compile time in this order.
- -SN: Orientation of the DNA query sequence (3)
- N=0: The
orientation is inferred from the phrases (e.g. 5' end) in the header
line of each entry within a fasta file. If no information is available,
both orientations are examined, and the result with the better score is
reported. Terminal polyA or polyT sequence is not trimmed.
- N=1: Forward orientation only. PolyA tail may be trimmed off.
- N=2: Reverse-complement orientation only. Leading polyT may be trimmed off.
- N=3: Examine both orientations. Terminal polyA or polyT sequence may be trimmed off.
- -TS: Specify the species-specific parameter set. S corresponds to the subdirectory in the table directory. Alternatively, S may be the 1st or the 3rd term in table/gnm2tab file, where the 2nd term on the line indicates the subdirectory.
- -VN: Minimum space to induce Hirschberg's algorithm (16M)
- -WS: Write block index table to file S.bkx. if S is omitted, the file name (without directory and extension) of the first argument is used as S.
- -g: gzipped output used in combination with -W or -O12 option.
- -i[a|p]: Input mode with -Q[0..3].
- -ia: Alternative mode; a genomic segment of an odd numbered entry in the input file is aligned with the query of the following entry.
- -ip: Parallel mode; the i-th entry in the file specified by the first argument is aligned with the i-th entry in the file specified by the second argument.
- default: The genomic segment specified by the first argument is aligned with each entry in the file specified by the following arguments.
- -oS: Destination of output file name (stdout). If multiple output formats are specified by -O option(s), S specifies the directory or prefix to which the file names with .ON extensions are concatenated.
- -paN: Terminal polyA or polyT sequence longer than N (12) is trimmed off and the orientation is fixed accordingly. If N = 0 or empty, these functionalities are disabled.
- -pi: Mark exon-intron junctions by color in the alignment (-O1).
- -pq: Suppress warning messages sent to stderr.
- -pw: Report result even if alignment score is below threshold value.
- -px: Suppress self-comparisons in the execution mode (C) or (D).
- -xBS: Bit pattern of seeds used for HSP search at level 1
- -xbS: Bit pattern of seeds used for HSP search at level 3
- -uN: Gap-extension penalty (3, 2, 2)
- -vN: Gap-opening penalty (8, 6, 9)
- -yaN: Dinucleotide pairs at the ends of an intron (0)
- N=0: Accept only the canonical pairs (GT..AG,GC..AG,AT..AC)
- N=1: accept also AT..AN
- N=2: allow up to one mismatch from GT..AG
- N=3: accept any pairs. An omission of N implies N = 3
- -yiN: Intron penalty (11, 8, 11)
- -yjN: Incline of long gap penalty (0.6)
- -ykN: Flex point where the incline of gap penalty changes (7)
- -ylN: Double affine gap penalty if N=3; otherwise affine gap penalty
- -ymN: Score for a nucleotide match (2, 2)
- -ynN: Penalty for nucleotide mismatch (6, 2)
- -yoN: Penalty for an in-frame termination codon (100)
- -ypN: PAM level used in the alignment (third) phase (150)
- -yqN: PAM level used in the second phase (50)
- -yxN: Penalty for a frame shift (100)
- -yyN: Relative contribution of splicing signal (8)
- -yzN: Relative contribution of coding potential (2)
- -yAN: Relative contribution of the translational initiation or termination signal (8)
- -yBN: Relative contribution of branch point signal (0)
- -yEN: Minimum exon length (2)
- -yIS: Intron distribution parameters
- -yJN: Relative contribution of the bonus given
to a conserved intron position
- -yLN: Minimum intron length (30, 30)
- -ySN: N specifies the percentile
contribution of the species-specific splice signal. The other part is
derived from the universal signal given to the dinucleotides at the ends
of an intron. An omission of N implies N = 100.
- -yXN: N = 0: set parameter values for intra-species
comparison. N = 1: set parameter values for cross-species comparison. The default value for N is 0 or 1 for DNA or protein query, respectively.
- -yYN: Relative contribution of length-dependent part of intron penalty (8)
- -yZN: Relative contribution of oligomer composition within an intron (0)
Sortgrcd is used to recover the output of spaln with -O12 option, to apply some filtering, and also to rearrange the output of multiple spaln runs. It is invoked by:
% sortgrcd [options] X.grd [Y.grd ...] or
% sortgrcd [options] X.grd.gz [Y.grd.gz ...]
- -CN: Minimum cover rate = % nucleotides in predicted exons / length of query (x 3 if query is protein) (0-100)
- -EN: Report only the best (N=1) or all (N=2) results per gene locus (1)
- -FN: Filter level (N=0: no; N=1: mild; N=2: medium; N=3: stringent)
- -IN: Minimum sequence identity (0-100)
- -HN: Minimum alignment score (35)
- -ON: Output mode. N=0 or 3-7: same as -ON of spaln; N=15: -O5 format for unique introns (default: N=4)
- -SC: Sort chromosomes/contigs in the order of C=a: alphabetical, b: abundance, c: appearance in genome database, r: reverse order for minus strand
- -VN: Internal memory size used for core sort. If the
data size is greater than N, the sorting procedure will be
done in pieces.
- -mN: Maximum number of mismatches within 20 bp from the nearest exon-intron boundary
- -nN: Maximum number of non-canonical (other than GT..AG, GC..AG, AT..AC) intron ends
- -uN: Maximum number of unpaired (gap) sites within 20 bp from the nearest exon-intron boundary
- By default, no filter listed above is applied.
- When the output of spaln is separated into several files, the combined results are subjected to the sorting. Although *.grd files are assigned as the argument, there must be corresponding *.erd and *.qrd files in the same directory.
- In the default output format, the gene structure corresponding to each transcript is delimited by a line starting with '@', whereas each gene locus is delimited by a line starting with '!' . Two transcripts belong to the same locus if their corresponding genomic regions overlap by at least one nucleotide on the same strand.
- With -O0 option, the outputs follow the instruction of Gff3 (http://www.sequenceontology.org/gff3.shtml) where a gene locus is defined as described above.
- To experience the flow of procedures with the samples in seqdb, type in the following series of commands after moving to seqdb.
% make dictdisc.cf
% make dictdisc.faa
% make dictdisc_g.gf
% ./makeidx.pl -inp dictdisc_g.gf
% make dictdisc.srd
% make dictdisc.spn
Added/modified in Ver. 2.4.0 (2019-11-18):
Added/modified in Ver. 2.3.3f (2019-08-26):
- Spaln can now directly format genomic sequences without relying on 'make' command. See Sequence data formation.
- The internal format of index files is slightly modified. Although previously-formatted files can be used by the new version, the opposite is not true. Note that use of older files with the new version can lead to a slight loss in sensitivity.
- The above change has been done to facilitate multi-thread operation at the format time.
- Multiple output forms can be produced at a single run. See -O and -o options.
- The traditional bidirectional Hirschberg algorithm is changed to the unidirectional variant.
- Also, the bidirectional 'sandwich' or 'attack by both sides' spliced alignment algorithm has been changed to unidirectional 'skipped' spliced alignment algorithm. This and the preceding changes have considerably reduced code complexity.
- Local lookup table (xxxgnm.lun or xxxgnm.lup) is generated and used with -E option. Be cautious to use this option, as a large disk space is required to store the generated file, and a large memory is required at the runtime.
- 8. Paired-ends mode has been removed.
- Many small bugs have been fixed.
Added/modified in Ver. 2.3.3e (2019-07-23):
- Prevent segment fault invoked with -ia or -ip option .
Added/modified in Ver. 2.3.3d (2019-06-10):
- Prevent segment fault invoked with extremely short query sequence (< 2 x word size).
- Prevent possible infinite loop when query sequence possesses its own splice site information.
- Fix small bugs in gsinfo.c.
Added/modified in Ver. 2.3.3c (2019-05-02):
- Update help messages of spaln and sortgrcd.
- Fix a small bug in gsinfo.c.
Added/modified in Ver. 2.3.3b (2019-04-26):
- Update alignment engines to prevent rare segmentation faults.
Added/modified in Ver. 2.3.3a (2019-03-28):
- Minor changes in help/error messages and documentations.
Added/modified in Ver. 2.3.3 (2019-02-12):
- The heuristic alignment engine has been updated, resulting in marginal but significant improvement in speed and accuracy,
especially with -Q3/7 option.
- In this connection, copy constructor and assign (=) operator have been implemented for Mfile class.
- chachr.pl has been extended to accept Ganbank/DDBJ and EMBL-formatted files in
addition to FASTA files. Maybe used as a format convertor.
Added/modified in Ver. 2.3.2a (2018-09-03):
- The maximal path size has been extended from 255 to 2047 characters.
- The NEVSEL constant value has been changed to avoid underflow of 2 * NEVSEL.
- In utilseq.c and .h, a member variable in class PatMat has been moved to a local variable to recover thread safety.
- When Spaln is run: % spaln protein genome, the order of the 1st and 2nd argumnets is exchanged with a warning message.
Added/modified in Ver. 2.3.2 (2018-07-26):
- From this version, compressed query fasta file(s) may be accepted.
- The new option of spaln '-g' directly generates compressed output(s) when used in combination with -W or -O12 option.
- makeidx.pl and makblk.pl have been modified to accord with gzipped genome/database fasta files.
- A small bug in makdbs.c has been fixed.
Added/modified in Ver. 2.3.1 (2017-06-22):
- From this version, input fasta files (X.mfa, X.gf, or X.faa), formatted data files (X.seq, X.bka, X.bkn, and X.bkp) for spaln, and X.grd, X.erd, and X.qrd for sortgrcd may be gzipped if USE_ZLIB mode is activated upon compilation. The compressed file name must end with .gz (e.g. X.gf.gz). Note: other data files (X.ent, X.grp, X.idx, and X.odr) must not be compressed.
- A serious bug concerning with multiple queries has been fixed. This has considerably improved mapping sensitivity particularly when -M option is set under single thread operation.
- Fixation of several smaller bugs and fine tuning of codes further enhanced mapping sensitivity and specificity particularly for short protein queries.
- -ON option of sortgrcd has been extended to support -O3 (BED), -O6 (cDNA = concatenated exons), and -O7 (translated amino acid sequence).
Added/modified in Ver. 2.3.0 (2017-02-15):
- The data structure of a block file (X.bkn, X.bkp, X.bka) has been simplified to reduce its size. According to this modification, upper compatibility of database is maintained, but earlier versions of spaln cannot read block files formatted with this version of spaln
- The internal representation of splicing junctions has been modified, which has led to considerable reduction in code size and memory requirement, and slight reduction in execution time.
- The algorithm for limiting the range of blocks to be passed to alignment phases has been modified. This modification will facilitate detection of the first or the last exon that is separated from the other exons by more than one block size.
- The default value for nucleotide mismatch penalty with -yX option has been restored to that of earlier versions.
Added/modified in Ver. 2.2.2a (2016-11-19):
- The limitation on the total number of residues in a database sequence less than 2**32 has been removed. To realize this update, the database format has been slightly changed. Spaln ver.2.3 can read database sequence formatted by a previous version of spaln, but the opposite is not true. In some cases, reformatting the database would result in better performance, even if the data size is less than 2**32.
- Several 'new' non-standard genetic codes are supported, following the update of NCBI's transl_table.
- Makeidx.pl has been updated. Formatting options such as -XCN may be transferred to spaln through this command. The default size of k-mers with -p option has been slightly modified.
- Miscellaneous small modifications have been done to facilitate better maintenance.
Added/modified in Ver. 2.2.2 (2016-05-06):
- In the mapping mode, an approximate genic range is now inferred from the query, if its gene structure is provided in the extended fasta format. Examples of the extended fasta format will be obtained by running spaln with -O6 or -O7 option. This modification relies on the observations that orthologous introns tend to have similar lengths, and is proven to be effective when the first or a few following introns are so longer that the 5' end of the gene is hard to be located by the previous version of spaln.
- The expression of genomic site numbers in an alignment is corrected.
- Excessive new line after sequence identifier is suppressed, when the header line of the query has no comment.
Added/modified in Ver. 2.2.1a-e (2016-02-22):
- Although implemented earlier, Spaln now formally supports rapid sequence similarity search against a protein sequence database with a set of protein sequence queries. Unlike blastp and some other programs for local sequence similarity search, spaln calculates (semi-)global alignments. Currently, p-values are not estimated, as spaln intends to find only a small number of strong similarities (to use the found sequence as a template for spliced alignment, when the query is a genomic segment).
- Specification of species-specific parameter set is made easier. An eight character genus-species identifier (e.g. mus_musc) or genus name (e.g. Mus) is now allowed as the argument to the -T option for specifying the corresponding parameter set (Tetrapod in this example). To activate this facility, new "gnm2tab" file must be installed in the "table" directory.
- A rare problem that mapping routine becomes much slow down due to internal repeats in query sequences have been fixed.
- Estimation of maximum gene length in a genome from the total genome size has been modified. Remember that you must set the expected maximum gene length by -GXn option (n is the expected maximum gene length) at the formatting phase or at each run time if your "genome" is not full size.
- Intron length distribution is modelled by a combination of up to three (formally two) Frechet distributions.
Added/modified in Ver. 2.2.1 (2015-12-02):
- Surplus lines in seqdb/makblk.pl have been removed
- configure script has been modified.
- A few modifications have been made to accord with Mac OS X.
- Several bugs in minor run modes have been fixed.
- The problem of skipping the first sequence line in some input fasta files has been fixed.
Added/modified in Ver. 2.2.0 (2015-08-14):
- A few bugs have been fixed. In particular, output in SAM format has been corrected.
- For mapping amino acid sequences, the numbers of matches and mismatches are displayed correctly.
Added/modified in Ver. 2.1.4 (2015-01-30):
- One-time run of database search has been enabled. In this mode, you need not format database sequences beforehand. However, the formatted results are not retained.
- A few primitive bugs have been fixed. In particular, the problem with long header line has been fixed. Output formats with -O3 (BED format) and -O10 (SAM format) are also rectified.
Added/modified in Ver. 2.1.3 (2014-12-15):
- Three-byte addressing has been abolished. Accordingly, data files formatted with an older version of spaln can be used by the current version, but the opposite will result in a failure.
- A few primitive bugs have been fixed.
Added/modified in Ver. 2.1.2 (2013-10-17):
- -CN option is added to deal with non-standard genetic codes, where N denotes the "transl_table number" defined in NCBI transl_table. For example, -C6 option enables proper translation of Tetrahymena or Paramecium genes.
- A failure to identify the 3' splice junction followed by a long gap has been amended.
- A few bugs with -O1 option have been fixed.
Added/modified in Ver. 2.1.1 (2013-7-1):
- A few bugs in spaln and sortgrcd have been fixed.
- Inching toward more c++ style coding.
Added/modified in Ver. 2.1.0 (2013-5-15):
- The bugs in makdbs have been fixed.
- The placements of exon-intron boundaries are modified when spaln is run under -ya[N] (N = 2 or 3) options with protein queries.
From this version, the formats of the database files and output of -O12 option of spaln have been changed. These modifications are intended to eliminate the limitations on the lengths of identifiers of both genomic and query sequences (formally up to 20 letters). To accord with these modifications, the associated programs, makdbs and sortgrcd, have been updated. Despite these modifications in the file formats, the user interface of each program is unchanged. spaln_2.1 can read database sequences formatted by an older version, provided that the length limitations on the identifiers are satisfied. However, sortgrcd_2 can no longer process outputs from an older version of spaln. From Version 2.1.3, spaln can deals with non-standard genetic codes.
The major revision number of spaln has been updated
from 1 to 2. The new version incorporates additional features
for intron recognition in addition to other revisions. The major points of revision are as follows.
- The codes have been extensively rewritten in C++, so that they are no longer compiled by a C compiler.
- A heuristic routine is added after the HSP search to reduce the number of calls of the restricted DP routine.
- -t[N] option specifies the number of CPUs involved in a multi-thread operation. If N is omitted, all available CPUs are used.
- A branch point signal and an intron propensity based on its oligomer compositions are incorporated in the scoring system.
- The splice junction signals are composed of two terms: (1) universal signal that depends on the dinucleotide pair at the ends of an intron and (2) species-specific signal that depends on the sequence around each splice junction. The relative contributions of these two terms can be freely adjusted by -ySN option.
- The intron penalty is now automatically adjusted depending on the
values of other parameters.
- Species specific parameters are available for 61 divergent eukaryotes. Those parameters are stored in each subdirectory in the 'table' directory.
- A set of benchmark data named "SPAliBASE" is ready for download.
Added/modified in Ver. 2.0.6 (2012-01-23):
- The limitations on the length of the sequence id of both genomic and query sequences are removed. According to this modification, makdbs and sortgrcd programs are updated.
- The default order of output from sortgrcd has been changed.
- A few bugs under - M option are fixed.
- The limitation on the length of the sequence id (formally max. 20 letters) has been removed. Note that the length of entry id in a database file is still limited up to 14 letters. This limitation will be removed in the next release.
- -XsN option is added to allow for overlapping k-mer seeds at the block search phase, where N (1<=N<=k) indicates the distance between adjacent seeds. The sensitivity of block search will be improved with a small N, which is particularly beneficial for mapping of short queries. Compared with the default setting (N=k, i.e. tiling seeds), however, the memory consumption for the k-mer table will be increased by the factor of k/N.
- Spaln now supports mapping of paired-end reads. Use -ip option for paired input files (matching 5?f and 3?f reads must appear in the same order) or -ia option for a single input file (matching 5?f and 3?f reads must appear alternatively).
- The performance of the GvsA mode (query = genomic segment, database = amino acid sequences) has been improved.
- Memory leak errors have been fixed (20130226)
Added/modified in Ver. 2.0.4 (2011-9-10, 2011-10-11, 2011-11-17)
Added/modified in Ver. 1.4.5 (2010-04-23, 2010-05-08, 2010-08-04)
- A bug when the database file is constructed from multiple input sequence files has been fixed. Makefile and makblk.pl in 'seqdb' directory are also modified to accord for multiple input files.
- The output with the -O3 option is changed from Psl-like to Bed format.
- The default parameters used in polyA detection have been
modified to reduce the chance of removing genome-encoded
- The procedure to collect related HSPs within and adjacent significant
blocks has been simplified.
- The procedure to assemble HSPs has been improved to reduce the chance
of chimera formation among tandemly repeated paralogous genes.
- A bug concerning with very long genes has been fixed.
Added/modified in Ver. 1.4.4 (2009-12-01, 2010-01-25, 2010-02-6, 2010-02-23)
- Graphical output is supplemented in the Web interface.
- -U option is added to spaln. With this option, alignment is computed without splicing. It may be useful when genomic fragment(s) are mapped/aligned with whole genome.
- Several small bugs in sortgrcd have been fixed.
- Occasional core dump of spaln with -M option has been remedied.
- A problem of makdbs command when the character '|' is contained in a second or later word in the header line of fasta file was fixed.
Added/modified in Ver. 1.4.3c (2009-08-21)
- -SN option is added to spaln. When N = 3 (default) both orientations
of query sequence(s) are examined. When N = 1 and N = 2, only forward and reverse
direction is examined, respectively. When N = 0, the orientation to be examined
depends on the annotation (header line of fasta format) of the query sequence.
The list of phrases that indicate the orientation is given in the file
"StrandPhrase" in the table/spalnXX/seqdb. If no phrase in this list is found in the
annotation, both orientations are examined.
- -yaN option of spaln has been modified. When N = 0 (default), only
canonical intron boundary sequences (GT..AG, GC..AG, AT..AC) are allowed.
When N = 1, the third consensus is relaxed as AT..AN. When N = 2, one mismatch
from GT..AG is allowed in addition to the relaxed consensus of AT..AN.
When N = 3, any sequence is allowed as a boundary.
- Many bugs of sortgrcd have been fixed.
- -O15 option of sortgrcd outputs information about unique introns merged
from the mappings of several transcripts.
- Unnecessary spaces in Gff3 format have been removed from outputs of
both spaln and sortgrcd.
Added/modified in Ver. 1.4.3b (2009-07-10) (2009-07-24)
- Some incompatibilities between documents and programs have been rectified.
- Species-specific parameters in table/spalnXX/seqdb have been updated.
- Sortgrcd has been modified so that files with may chromosome/contig entries should be handled.
- Spaln's write mode now accepts plural arguments each corresponding to a piece of the whole genome (e.g. each chromosomal sequence). This modification has made it possible for spaln to format whole genomic sequence on a platform that cannot handle a large file (e.g. > 2GB). For example, the human genome is composed of 24 chromosomes. The total genome size exceeds 2 Gb, while the maximum chromosome size is less than 300 Mb. Even though a system fails to read the file of whole genome at a time, it can support spaln to process chromosomal sequences sequentially. In the example bellow, "my_genome" is the genome name, and chr1, chr2, ... chrn are chromosomal sequence files.
% makdbs -nmy_genome -KD chr1 chr2 ... chrn
% spaln -Wmy_genome.bkn -Xk12 -Xb54272 -XG2802688 -Xg64 -KD chr1 chr2 ... chrn
Then, you can use spaln in a usual way:
% spaln -Q7 -dmy_genome cDNAs
- A few bugs are fixed to prevent core dumps under some conditions.
Added/modified in Ver. 1.4.2 (2009-05-16) (2009-06-17)
- Bugs of sortgrcd has been fixed. sortgrcd now supports -O5 (intron-oriented format) output.
- Several bugs that were conditionally problematic have also been fixed.
Added/modified in Ver. 1.4.1 (2009-02-16) (2009-03-16)
- A combination of a protein sequence database and a genomic segment(s) is now supported. The protein sequence database must be formatted beforehand. At the run time, a set of ORFs in the genomic segment are translated and rapidly searched against the database sequences, and then the best-hit sequence is used as the template of the spliced alignment between the relevant portion of the genomic segment and the protein sequence retrieved.
- The format of .bkn and .bkp files is slightly changed. This has introduced some incompatibility between formatted files and software; spaln Ver.1.4 can use files formatted by spaln Ver.1.3, but the inverse would make an error.
- The output form now includes Gff3 (gene or match form) and Psl-like format.
- Sortgrcd also supports Gff3 gene format in addition to the native format.
- The default intron-penalty function has been modified from the well-shaped to a 'generic' function derived from intron-length distributions of several species.
- Many small modifications have been made to improve the stability of the software. In particular, multiple outputs of the same locus have been suppressed when -M option is set.
Added/modified in Ver. 1.3.2. (2008-09-12) (2008-09-25) (2008-10-21)
- Divide by zero error with makblk.pl when genome size is less than 1Mb has been fixed.
- Missing genomic fragment name in -Q[0-3] mode has been recovered.
- A few bugs have been fixed to avoid core dumps due to incompatibility between HSP coordinates and sequence ends.
- Known exon boundary information can be incorporated into the objective function when a cDNA, as well as a protein, is used as a template. To do so, simply add the -yBN option, where N is the bonus given to each matched location of exon-exon boundary.
- Compile time errors on some systems have been fixed.
Added/modified in Ver. 1.3.1. (2008-08-21)
- The portability has been improved for spaln to run on 64-bit machines.
- The codes are cleaned up so that less number of warnings should be output at the compilation time.
- 'make install' command is modified so that species-specific parameter files should be copied to the working table/spalnXX/seqdb directory.
- The test procedure described at the end of this document now includes the case of protein sequence queries.
Added/modified in Ver. 1.3.0
- Protein sequences can now be used as queries. If properly formatted, users do not need to be conscious of the difference in the type of queries at the run time.
- Bugs with -O12 option was fixed to regain normal output.
- The best amino acid substitution matrices are now chosen in different phases of execution. In this connection, the format of mdm files has been changed. Accordingly, the revised makmdm command must be run before the use of the new version of spaln.
- Combination of relevant blocks has been modified to improve sensitivity.
- The 'Salvage' procedure was added to find homologues that are missed by the formal criteria.
Copyright (c) 2007-2019 Osamu Gotoh all rights reserved