doi:10.1016/j.compbiolchem.2004.02.001
Copyright © 2004 Elsevier Ltd. All rights reserved.
An adaptive and iterative algorithm for refining multiple sequence alignment
Yi Wang and Kuo-Bin Li
, 
Bioinformatics Institute, 30 Biopolis Street, Singapore 138671, Singapore
Received 2 December 2003;
Revised 10 February 2004;
accepted 10 February 2004.
Available online 4 May 2004.
References and further reading may be available for this article. To view references and further reading you must
purchase this article.
Abstract
Multiple sequence alignment is a basic tool in computational genomics. The art of multiple sequence alignment is about placing gaps. This paper presents a heuristic algorithm that improves multiple protein sequences alignment iteratively. A consistency-based objective function is used to evaluate the candidate moves. During the iterative optimization, well-aligned regions can be detected and kept intact. Columns of gaps will be inserted to assist the algorithm to escape from local optimal alignments. The algorithm has been evaluated using the BAliBASE benchmark alignment database. Results show that the performance of the algorithm does not depend on initial or seed alignments much. Given a perfect consistency library, the algorithm is able to produce alignments that are close to the global optimum. We demonstrate that the algorithm is able to refine alignments produced by other software, including ClustalW, SAGA and T-COFFEE. The program is available upon request.
Author Keywords: Iterative algorithm; Multiple sequence alignment; Alignment improver
Fig. 1. The iterative algorithms of AIMSA in pseudo code.
Fig. 2. An example of a block-gap.
Fig. 3. An example of the detection of well-aligned regions (denoted by bold typeface): (a) the whole region will be recognized as a well-aligned region when the maximum number of mismatches is set to four; (b) two well-aligned regions are to be recognized when the maximum number of mismatches is set to two.
Fig. 4. An example of the indirect gap-insertion: (a) the alignment initially has two well-aligned regions; (b) inserting two gaps would damage the second well-aligned region; (c) instead, columns of gaps could be inserted; (d) two gaps are then moved to the targeted insertion cite; (e) the redundant column gaps can be removed easily. Now the alignment has three well-aligned regions.
Fig. 5. An example showing that a poorly aligned region may be refined by AIMSA without damaging its neighboring well-aligned regions. Sequences in bold are poorly aligned region between two well-aligned regions.
Fig. 6. The two inserted buffering zones will only affect the sub regions A and C. As a result, additional column-gaps need to be inserted to a randomly chosen position so that adding or deleting gaps in the sub region B is possible.
Table 1. The average improvement on the COFFEE objective scores obtained by AIMSA

All scores (in percentage) are COFFEE scores, which measure the consistency between a multiple sequence alignment and the pairwise library. The COFFEE scores of the initial alignments (obtained from either ClustalW, SAGA or T-COFFEE) and the final alignments (from AIMSA) are listed.
Table 2. The average improvement on the alignment quality by AIMSA

All scores are BAliBASE sum-of-pair (SP) scores. Given a multiple alignment, a score of 100 indicates that all amino acid pairs in the core blocks are correctly aligned compared with the reference alignment. The definition of this sum-of-pair scoring scheme can be found in BAliBASE (Thompson et al., 1999b). The SP scores of the initial alignments (obtained from either ClustalW, SAGA or T-COFFEE) and the final alignments (from AIMSA) are listed. S.D. and average are the standard deviations and averages of the alignment scores across the five reference sets, respectively. Increment denotes the percentage of increment by AIMSA based on the initial score.
Table 3. The average improvement of AIMSA given a “correct” guidance

The scores shown here are BAliBASE SP score. Initial alignments are produced by ClustalW, SAGA or T-COFFEE. For each type of initial alignments, we list its original SP scores as well as those refined by AIMSA. The correct guidance is provided from a manually created pairwise library based on BAliBASE reference alignments. In other words, the COFFEE function now computes the consistencies between a working multiple alignment and the correct answer.
Table 4. Comparison of the average time costs (in seconds) to complete a BAliBASE test in the five reference sets using ClustalW, SAGA, T-COFFEE and AIMSA

The initial alignments for AIMSA were created by T-COFFEE. Time cost of AIMSA doesn’t include the time that is used to create initial alignment. All tests were performed on a 1.4 GHz Pentium III computer. AIMSA is implemented in Java whereas the remaining three are in C program language.