Abstract

Patterns of linkage disequilibrium plays a central role in genome-wide association studies aimed at identifying genetic variation responsible for common human diseases. These patterns in human chromosomes show a block-like structure, and regions of high linkage disequilibrium are called haplotype blocks. A small subset of SNPs, called tag SNPs, is sufficient to capture the haplotype patterns in each haplotype block. Previously developed algorithms completely partition a haplotype sample into blocks while attempting to minimize the number of tag SNPs. However, when resource limitations prevent genotyping all the tag SNPs, it is desirable to restrict their number. We propose two dynamic programming algorithms, incorporating many diversity evaluation functions, for haplotype block partitioning using a limited number of tag SNPs. We use the proposed algorithms to partition the chromosome 21 haplotype data. When the sample is fully partitioned into blocks by our algorithms, the 2,266 blocks and 3,260 tag SNPs are fewer than those identified by previous studies. We also demonstrate that our algorithms find the optimal solution by exploiting the nonmonotonic property of a common haplotype-evaluation function.

1. Introduction

Single-nucleotide polymorphisms (SNPs) play important roles in disease association studies owing to their high abundance in the human genome, low mutation rate, and amenability to high-throughput genotyping. A small subset of SNPs directly influences the quality or quantity of the gene product and increases the risks of certain diseases or of severe side effects of drugs. Alleles of SNPs that are close together tend to be inherited together. A haplotype refers to a set of SNPs found to be statistically associated on a single chromosome. Haplotypes defined by common SNPs have important uses in identifying disease association and human traits [15].

Genome-wide association studies based on linkage disequilibrium (LD) offer a promising approach to detect genetic variation responsible for common human diseases. The patterns of linkage disequilibrium (LD) observed in human chromosome show a block-like structure [1, 6, 7], such that the entire chromosome can be partitioned into high-LD regions interspersed with low-LD regions. The high-LD regions are called haplotype blocks and the low-LD regions are referred to as recombination hotspots. A few common haplotypes account for most of the variation from person to person in haplotype blocks. Furthermore, each haplotype block, comprising large regions of low diversity, can be characterized with a small number of SNPs, which are referred to as tag SNPs [8]. Identification of tag SNPs is aimed at tagging candidate genes which can capture the most information in the haplotype blocks. The new technologies allow to genotype rarer variants than before [9]; therefore, there are more and more genotyping data needed to be analyzed, and the structure of haplotype blocks will be more complicated. Despite great progress in current genotyping developments which allow intensive genotyping at cheap prices, the concept of tag SNP selection is more and more significant due to exploded genotyping data. Most tag SNP selection strategies are based on haplotype blocks and have the aim of identifying a minimal subset of SNPs able to tag the most common haplotypes [7, 10].

Several methods have been used to identify haplotype-block structures, including LD-based [6, 11], recombination-based [12, 13], information-complexity-based [1416], and diversity-based [7, 17, 18] methods. The result of block partitioning and the meaning of a haplotype block may differ according to the assessment criteria. The diversity-based test methods can be classified into two categories: those that divide strings of SNPs into blocks on the basis of the decay of LD across block boundaries and those that delineate blocks on the basis of some haplotype-diversity measure within the blocks. Patil et al. [7] defined a haplotype block as a region in which a fraction of a percent or more of all the observed haplotypes are represented at least times or at a given threshold in the sample. They applied the optimization criteria outlined by Zhang et al. [10, 19] and described a general algorithm that defines block boundaries in a way that minimizes the number of SNPs that are required to identify all the haplotypes in a region. They identified a total of 4,563 tag SNPs and 4,135 blocks defining the haplotype structure of human chromosome 21 [7]. In each block they required that at least 80% of the haplotypes must be represented more than once in the block.

In this paper, we propose two dynamic programming algorithms, incorporating several haplotype diversity evaluation functions, for haplotype block partitioning with constraints on diversity and the number of tag SNPs. Part of data related to this paper have been open to public domain http://www.cs.pu.edu.tw/~yawlin/ (web-site page of a coauthor), these data are also published at local conference [20].

2. Diversity Functions

Several operational definitions have been used to identify haplotype-block structures [6, 7, 1118] and result in different results of block partitioning and meanings of a haplotype block, depending on the objective function. Haplotype blocks are genome regions with high LD, so that distinct haplotype patterns within the block are few and the diversity of the block is low. In terms of diversity functions, the block selection problem can be viewed as finding a segmentation of a given haplotype matrix such that the diversities of chosen blocks satisfy a certain value constraint. We use the following definitions.

Definition 1 (haplotype block diversity). Given an interval of a haplotype matrix and a diversity function, is an evaluation function measuring the diversity of the submatrix .

Given an haplotype matrix , a block of matrix is viewed as haplotype strings; they are partitioned into groups by merging identical haplotype strings into the same group. The probability of each haplotype pattern is defined accordingly, such that . Li [21] proposes a diversity function defined by

Note that is the probability that two haplotype strings chosen at random from are different from each other. Other measurements of diversity can be obtained by choosing different diversity functions; for example, to measure information complexity one can choose the information entropy (negative-log) function [1416] as follows:

Patil et al. [7] and Zhang et al. [10, 18] define a haplotype block as a region where at least 80% of observed haplotypes within a block must be common haplotypes. Using the same definition of common haplotype, the coverage of common haplotypes of the block can be formulated as a form of diversity as follows: where denotes unambiguous haplotypes, denotes common haplotypes, and denotes singleton haplotypes. In other words, Patil’s method requires that .

Some methods [6, 22, 23] presented a definition of a haplotype block based on the LD measure ; however, there is no consensus definition as yet. Zhang and Jin [22] defined a haplotype block as a region in which no pairwise values are lower than a threshold . Let denote a haplotype interval . We define the diversity as the complement of minimal of . By this definition, is a haplotype block if its diversity is lower than . Consider

Zhang et al. [22] also propose a definition for haplotype block; they require a proportion at least of SNP pairs having strong LD (pairwise greater than a threshold) in each block. We use this definition to redefine the function as the proportion of SNP pairs without strong LD. denotes the number of SNP pairs without strong LD in the interval . The diversity function is as follows: Diversity measurement usually reflects recombination events that occurred during evolution. Generally, haplotype blocks with low diversity indicate conserved regions of the genome.

Definition 2 (monotonic diversity). A diversity function is said to be monotonic if for any haplotype block (interval) of haplotype matrix , it follows that whenever ; that is, the diversity of any subinterval of is no larger than the diversity of .

The diversity functions (1) and (2) are monotonic. However, the evaluation function for common haplotypes proposed by Patil et al. [7] does not satisfy the monotonic property when the haplotype sample has missing data. For example, Figure 1 shows a small portion of human chromosome 21 haplotype sample provided in [7], where denotes missing data. In the sample, the common-haplotype coverage of interval is 9/10, which is greater than 80%. Therefore, according to the definition, it is a feasible haplotype block. In contrast, the common haplotype coverage of interval is 3/7, which is less than 80%, so that it is not a feasible haplotype block. Both interval and interval terminate at the same SNP locus, and the interval , which has more SNPs, is a feasible haplotype block, whereas interval is not. Tag SNPs can capture most of the haplotype diversity in blocks and thereby could potentially capture most of the information about association between a trait and the SNP marker loci. The diversity and features of each haplotype block can be described easily and economically with tag SNPs. For these reasons, we want to define the haplotype structure using as few tag SNPs as possible. In previous studies, Patil et al. [7] defined a haplotype block as a region in which a fraction of percent or more of all the observed haplotypes are represented at least times or at a given threshold in the sample. They applied the optimization criteria outlined by Zhang et al. [10, 18] and described a general algorithm that defined block boundaries in a way that minimizes the number of tag SNPs that are required to distinguish uniquely a certain percentage of all the haplotypes in a region. The greedy algorithm [7] identified a total of 4,563 tag SNPs and a total of 4,135 blocks to define the haplotype structure of human chromosome 21. In each block, they required that at least 80% of haplotypes are represented more than once in the block. In addition, Zhang et al. [10] used a dynamic programming approach to reduce the numbers of blocks and tag SNPs to 2,575 and 3,582, respectively.

Both of the algorithms [7, 10] fully partition the haplotype sample into blocks with the objective of minimizing the tag SNPs. However, when the resources are limited, investigators and biologists may be unable to genotype all the tag SNPs and instead must restrict the number of tag SNPs to be identified by the algorithms. In this paper, we propose two dynamic programming algorithms for the haplotype-block partitioning problem.

Problem 3 (longest--blocks). Given a haplotype matrix and a diversity upper limit , we wish to find disjoint blocks whose diversity is less then D such that the total length is maximized. That is, output the set , with for each , such that is maximized. Here denote the length of block .

Assuming that the given diversity function is monotonic and the given haplotype matrix is preprocessed for finding the indices of the farthest site, called good partner site, indices from current site, the longest--block problem can be solved in space and time. The good partner of locus refers to the left farthest locus from such that is a haplotype block whose diversity is less then the upper limit constraint. The idea of left good partner is shown in Figure 2.

Problem 4 (longest-blocks--tags). Given a haplotype matrix and a diversity upper limit , we wish to find a list of disjoint blocks whose total tag SNP number is less than such that the total length is maximized. That is, output the set such that (for all ) and denote the number of tag SNPs required for block , so that is maximized. Here denote the length of block .

Assuming that all of the feasible blocks and tag SNPs required for each block have been preprocessed, the longest-blocks--tags problem can be solved in time, where denotes the total number of feasible blocks. For the same sample used, based on the same criteria adopted by [23], our algorithm identifies a total of 2,266 blocks, which can be tagged by 3,260 tag SNPs. The number of blocks and tag SNPs we identified are 45.2% and 28.6% fewer than those identified by [7]. These results are also better than those by Zhang’s method with respect to the number of tag SNPs used and the total block numbers.

The definition of the haplotype-block diversity evaluation function we use in this paper is equal to the ratio of singleton haplotypes to unambiguous haplotypes in the blocks. It is also equal to 1 minus the ratio of common haplotypes to unambiguous haplotypes; in other words, 80% of common-haplotype coverage is equal to 20% (or 0.2) of haplotype diversity by the definition presented in [7]. That is, we require the diversity of each block to be ≤0.2. Here we propose two linear-space algorithms for these two problems.

3. Method

We propose two dynamic programming algorithms to partition haplotype blocks with constraints on diversity and number of tag SNPs. The proposed algorithms are able to find the longest segmentation into blocks such that the diversity of each block is less than an upper limit and the total number of tag SNPs required for these blocks does not exceed a specified number , and they are time-efficient and linear-space algorithms for solving Problems 3 and 4. In the first algorithm, the longest segmentation consisting of feasible blocks can be found in time and linear space after the preprocessing of the leftmost site (good partner site) and the rightmost site for each SNP marker . After partitioning blocks, we select tag SNPs in each block. Using this method, we can partition a haplotype into a minimum number of blocks with a modest number of tag SNPs. In the second algorithm, the longest segmentation covered by tag SNPs can be found in time after the preprocessing of left good partners for each marker and tag SNPs required for each feasible block. Using this method, we can partition a haplotype into a minimum number of blocks with a minimum number of tag SNPs.

3.1. Tag SNP Selection Algorithm

Our algorithms begin with the preprocessing of the farthest site (good partner) for each SNP marker. According to the haplotype block definition defined by [7], at least 80% of unambiguous haplotypes must be represented more than once. Using the same criteria as in [7], for each block we want to minimize the number of SNPs that distinguish uniquely at least 80% of the unambiguous haplotypes in the block. Those SNPs can be thought of as a signature of the haplotype-block partition.

In general, the number of tag SNPs required increases as the length of the haplotype block increases. But an exception is shown in Figure 3. The block consisting of 3 SNPs needs 3 tag SNPs to distinguish each haplotype uniquely, but the block consisting of 4 SNPs needs only 2 tag SNPs (column 2 and column 4).

The problem of finding the minimum number of tag SNPs within a block that uniquely distinguishes all the haplotypes is known as the MINIMUM TEST SET problem and has been proven to be NP complete [24]. Thus, there is no polynomial-time algorithm that guarantees to find the optimal solution for any input, though approximate, greedy algorithms have been proposed [2527]. In order to find the optimal solution, we adopt a brute-force method to find tag SNPs within a block. Our strategy for selecting the tag SNPs in haplotype blocks is as follows. First, the common haplotypes are grouped into distinct patterns by merging the compatible haplotypes in each block. After the missing data are assigned in each group, we determine the smallest number of tag SNPs required based on the smallest number of haplotype groups needing to be distinguished such that haplotypes in these groups contain at least 80% of the unambiguous haplotypes in the block. Finally, we select a locus set consisting of the minimum number of SNPs in the haplotypes such that at least 80% of the unambiguous haplotypes can be uniquely distinguished. An exhaustive search can be used very efficiently, given that the number of tag SNPs needed for each block is usually modest. The exhaustive-search algorithm shown in Algorithm 1 enumerates the -combinations in lexicographic order to generate the next candidate tag SNP set until each pattern can be uniquely distinguished.

FINDTAG( )   Find the number of tag SNPs required for haplotype block
       such that percentage of unambiguous haplotypes in can be
        distinguished uniquely.
Input: A percentage and haplotype block with unambiguous haplotype pattern
   ; the haplotypes number in each pattern is .
Output: The tag SNPs required for haplotype block .
(1) Sort haplotype patterns in .
   ’s are listed in decreasing order of the number of haplotype strings.
(2)    is the number of percentage of unambiguous haplotypes.
(3) Find the minimum number such that
(4)    is the minimum number of tag SNPs required.
(5) for     to     do initiate the tag SNP loci set .
(6)   
(7)
(8) while     do    is the total haplotype strings that can be distinguished.
(9)     generate the next -combination in lexicographic order.
(10)  while     do    is the length of haplotype string.
(11)     
(12)   if  
(13)    
(14)   for     to     do
(15)     
(16)   else
(17)    
(18)   for     to     do
(19)     
(20)    , is the haplotype that can be distinguished by tag SNP.}
(21) return  

3.2. A Linear-Space Algorithm for Haplotype Block Partitioning

Patil et al. [7] studied the global haplotype structure on chromosome 21 and identified 20 haplotypes for 24,047 SNPs spanning over about 32.4 Mbps. By the sample, they applied a greedy algorithm to partition the haplotype into blocks of limited haplotype diversity. Using the same criteria as in Patil et al., Zhang et al. [18, 23, 28] provided a dynamic programming algorithm to partition the same sample totally into 2.575 blocks and identify a total of 3,582 tag SNPs that are 37.7% and 21.5% smaller, respectively, then those identified by Patil et al. The space complexity for Zhang et al.’s algorithm is and the time complexity is , where is the total number of tag SNPs, is the total length of haplotype sample, and is the number of SNPs contained in the largest block. The idea behind the Zhang et al.’s algorithm is illustrated in Figure 4. The maximized segmentation consisting of disjoint blocks between sites 1 and with the constraint of using at most tag SNPs will have two cases, either the site is included in the last block of or not. If site is not included in the last block of , it will find between sites 1 and ; otherwise there will exist a site , , such that is the last block of . In the latter case, the tag SNPs required for block is , so it can find other blocks which are covered by other tag SNPs between site 1 and site .

Assuming a monotonic diversity function, the recurrence relation is

The idea behind the recurrence relation is that either the th block of the maximal segment in does not include site or the block must be the last block of . Note that can be determined in time if all of the and have been calculated. It follows that ’s can be calculated from the in time. Thus a computation proceeding from the , to the takes time. Lemma 5 presents the dynamic programming theory for the general case.

Lemma 5. Given a submatrix of an haplotype matrix and a diversity upper limit , for all constrained intervals , , find a segmentation consisting of k feasible blocks such that the total length can be maximized in time after the preprocessed leftmost markers (tag SNP selection), ’s are prepared.

Finding a segmentation that consists of feasible blocks and maximum total length can be completed using dynamic programming based on the recurrence relation. However, it is difficult to retrieve the intervals in linear space. To solve this problem, we can use a concept similar to that of [29]. We find a cut point to divide SNP sites into two parts, and , and then there are blocks in and blocks in and . We now have the following recursion relation. While , the boundaries of the block can be found by scanning the leftmost marker array and appending the longest feasible block in to a global data structure. The algorithm is shown in Algorithm 2.

LIS( )   Lis: List blocks in with maximized total length.
Input: Interval and number of blocks .
Output: 's blocks and their boundaries.
Global variable: .      and are used to store the good partner pointers and
      which have been preprocessed dependent on diversity constraints is used to store the
      results of LIS( ), are global temporary working storages.
(1) if     then
(2) Append the boundaries of the longest block in to
(3) return
(4) for     to     do Initiate the boundary condition of .
(5)     
(6) for     to     do Compute , .
(7)  for     to     do
(8)   if     then
(9)    
(10)   else
(11)    
(12)    if     then       , exceeding the boundary region.
(13)    
(14)    else
(15)    
(16)   
(17)   copy to for next iteration of    is used to store the temporary results of .
(18) for     to     do Initiate the boundary condition of .
(19)  
(20) for     to     do Compute .
(21)   for     downto     do
(22)   if     then
(23)    
(24)   else
(25)    
(26)   if     then      , exceeding the boundary region.
(27)    
(28)   else
(29)    
(30)   
(31)   copy to for next iteration of    is used to store the temporary results of .
(32)
(33) for     to     do Find .
(34)  if     then
(35)   
(36)   
(37) LIS     recursive call.
(38) LIS    recursive call.

Theorem 6 (longest--blocks). Given a haplotype matrix A and a diversity upper limit D, the longest k-block and their boundaries can be computed in time and space after the preprocessed left- and rightmost markers, and are prepared.

Proof. We propose an time algorithm, Lis, shown in Algorithm 1. Note that time suffices for preprocessing to find the rightmost markers and leftmost markers for each marker site as shown in [30].

In this algorithm, we use six global data structures involving arrays , , , ,, and -list. and are used to store the good partner points and that have been calculated in preprocessing. -list is used to store the boundaries of blocks. Arrays and are used to store the results of the and . During the computation of the and the , we use array , replacing a table to store temporary results that will be used to calculate further results. The size of each of arrays , , , , and is . The size of -list is , in the general case, so that the space used by the algorithm is .

The time complexity of the algorithm is as shown in the following by induction. Let denote the time needed for Lis. Assume that for all ,  . According to the algorithm, we have By induction, Letting , the above inequality is satisfied, so that we can prove the time complexity of the algorithm to be . Although we assume that the block diversity evaluation function we used here is monotonic, we can modify the algorithm slightly such that it can be applied to nonmonotonic blocks. In the case of nonmonotonic blocks, for each SNP , we use to denote the set of all such that is a feasible haplotype block. Let , where is the average number of for each marker . It can be shown that the modified algorithm uses time and space.

3.3. A Linear Space Algorithm for Haplotype Block Partitioning with Limited Number of Tag SNPs

Using a similar concept as in [31], we find a cut point to divide SNP sites into two parts, and , and use tag SNPs for and the other tag SNPs for such that the total size of blocks covered by tags in and tags in is maximized. We obtain the following recurrence relation:

The idea behind the recurrence relation is illustrated in Figure 5. Note that in order to maximize the total size of blocks tagged by SNPs and SNPs, we are unable to assign half of to directly, because in some cases the th and the th SNP will be used to tag the same block that is a member of the longest segmentation. If we use the first to the th SNPs to tag the blocks in and use the th to the th SNPs to tag the blocks in , we will not obtain the longest segmentation for SNPs. In the general case there will be many pairs of and solutions that satisfy our requirement. For time efficiency, we want to make and approximate half of and as nearly as possible. Let denote the maximum number of tag SNPs required among all feasible blocks. In order to find the appropriate value of and , we can examine in continuous possible values, and examine in all SNPs loci for each selection of . Since is small in the general case, we can find and quickly. After finding appropriate values of and , we can execute the steps recursively to partition the original problem to two subproblems repeatedly. Until , we simply use the dynamic programming algorithm to solve each subproblem. The algorithm traces back to output the boundaries of each block. The algorithm is shown in Algorithm 3.

HBPTS Lis: List blocks covered by tag SNPs in with maximized total length.
Input: Interval and number of tag SNPs .
Output: The boundaries of blocks covered by tag SNPs.
Global variable: .      and are used to store the good partner pointers and
      which have been preprocessed dependent on diversity constraints ,
      is used to store the tag SNPs required for each feasible blocks,
     two dimensional array and are global temporary working storages.
(1) if     then
(2)  for     to  T  do
(3)   for     to     do
(4)    Directly compute according to recursion relation 3.
(5)  Trace back on array to output the boundaries of blocks covered by tag SNPs.
(6)  return
(7) for     to     do
   Compute .
(8)  for     to     do
(9)   Compute by formula 3.
(10) for     to     do
   Compute , .
(11)  for     down to     do
(12)   Compute by formula 3 with opposite direction.
(13) Find = arg .
   Pointer back tracking to find the , and
(14) HBPTS      recursive call to report blocks in interval [i, x*] with t* tags.
(15) HBPTS     recursive call to report blocks in interval with tags.

Theorem 7 (longest-blocks--tags). Assume that the maximum number of tag SNPs required among all feasible blocks, , is a constant. Given a haplotype matrix , a diversity upper limit , and a number of tag SNPs t, find a segmentation S consisting of k feasible blocks such that (for all ) and , so that maximizing the total length of can be done in time and using linear space after the preprocessing of , , and , , for each SNP .

In the algorithm, named HBPTS, we use five global data structures involving arrays , , , , and . Arrays and are used to store the good partner points and for each SNP , and array is used to store the tag SNPs required for each feasible block. The data in arrays , , and were calculated in preprocessing, and the size of each array is , the number of all feasible blocks. In addition, we use a two-dimensional array for computing the and for computing the . Note that the computation of will compare the values of , , and . Therefore, if denotes the maximum , the maximum number of tag SNPs required among all feasible blocks, we need to store at most the values of and while computing the value of . In our experience, the will be equal to 8 at most, as seen, for example, in the haplotype data of Patil et al. [7]. Thus the space of two dimensional arrays and is , so the space complexity for the algorithm is . Since is generally a constant and in most practical cases, we can prove that the space used by the algorithm is . The flowchart of HBPTS is shown in Figure 6.

Proof. We propose an time algorithm, HBPTS, shown in Algorithm 2. The time complexity of the algorithm is as shown in the following by induction. Let denote the time needed for HBPTS. Assume that for all , . According to the algorithm, we have By induction, Letting , the above inequality will be satisfied, so that we can prove the time complexity of the algorithm to be .

4. Experiments

We applied our dynamic programming algorithms, which find the longest segmentation covered by a specific number of tag SNPs, to the haplotype data for chromosome 21 provided by Patil et al. [7]. The data contain 20 haplotype samples and each contains 24,047 SNPs spanning 32.4 Mb of chromosome 21. The minor-allele frequency at each marker locus is at least 10%. Using the proposed algorithms with the same criteria as in [7] with coverage of common haplotypes in blocks, 3,260 tag SNPs and 2,266 haplotype blocks are identified. In contrast, 4,563 tag SNPs and 4,135 blocks are identified in [7], and 3,582 tag SNPs and 2,575 blocks are identified in [10]. The proposed algorithms reduce the number of tag SNPs and blocks by 28.6% and 45.2% compared to [7]. We also demonstrate that the results shown in [10] are not optimal.

Table 1 shows a comparison of properties of haplotype blocks found by [7, 10] and our algorithms with 80% coverage of common haplotypes. The proposed algorithms discover 736 blocks containing more than 10 SNPs per block. Blocks with SNPs account for 32.5% of all blocks. The average number of SNPs for all of the blocks is 10.6. The largest block contains 128 common SNPs, which is longer than the largest block (containing 114 SNPs) identified by [7] and the same as identified by [10]. Tables 2 and 3 show more experimental results. According to these results, we can partition 38.6% of the genome region into blocks that require no tag SNPs. This is because most of these blocks contain only a few common SNPs, and 80% of the unambiguous haplotypes have the same haplotype pattern (are compatible) in these blocks. We term these SNP loci as uninformative markers because they are the same among most (80%) of the population. These data also show that as the covered genome region increases, we need to add more and more tag SNPs to capture the haplotype information of the blocks, and the number of zero-tagged blocks becomes fewer. Although the average length of non-zero-tagged blocks becomes shorter as the covered chromosome region increases, the average length of all blocks becomes longer.

Figure 7(a) shows the percentage of tag SNPs identified by the proposed algorithms when blocks cover a specified percent of the genome region. According to our experimental results, when blocks cover 70% of the genome region, the proposed algorithm required only 19.1% (about 623) of the tag SNPs to capture most of the haplotype information. This also indicates that the proposed method discovers that only a few tag SNPs are needed to capture most of the genome-region information. Figure 7(b) shows that the percentage of covered genome region increases while the tag SNPs identified by the proposed methods increase by 5%. Note that as the number of tag SNPs increases, the marginal percentage of genome region covered decreases. This indicates that, as the genome region coverage increases, fewer common SNPs are covered by each tag SNP on average. Figure 7(c) shows the added tag SNPs needed to increase the genome-region coverage by 5%. We find that as genome-region coverage increases, many more tag SNPs are needed to capture the haplotype information. In particular, when the genome-region coverage increases from 95% to 100%, we need to use another extra 1,014 tag SNPs, about 31.1% of the total tag SNPs. It is interesting to note that the proposed method discovers that the marginal utility of tag SNPs decreases as genome-region coverage increases. From the results, our algorithms obtain better results than those by the other methods [7, 10] on the same haplotype sample. One of the main reasons is that their algorithms presume that the common-haplotypes evaluation function satisfies the monotonic property. However, when the haplotype sample has missing data, the diversity function does not satisfy the monotonic property. For example, Table 5 shows the analysis results of Zhang’s and our algorithms on the same haplotype sample; this sample has just 69 SNPs, which is a small part of Patil’s haplotype data [7]; the contig number is NT 001035. Using the sample criteria (80% of common haplotype), our methods partition the sample into 20 blocks and identify 18 tag SNPs, whereas Zhang’s algorithm partitions the sample into 23 blocks and uses 22 tag SNPs. The results are similar in interval , but in interval , our methods discover 3 blocks and 3 tag SNPs, whereas Zhang’s gives 6 blocks and 6 tag SNPs. In interval , both Zhang’s and our methods find 2 blocks, but our method needs only 3 tag SNPs rather than the 4 found by Zhang’s method. These cases demonstrate that Zhang’s algorithm fails to find the optimal solution owing to the nonmonotonic property of the common-haplotype evaluation function.

We also apply our algorithm on biological data set from chromosome 20 from HapMap data bulk (http://www.hapmap.org), respectively. The data set contains 120 individuals that include 71,539 SNPs from the Yoruba in Ibadan, Nigeria (abbreviated YRI). We select the first 30 individuals and the first 5,000 SNPs for the input sample of the algorithm. Using the diversity function (3) with coverage of common haplotypes in blocks, the total number of blocks and tag SNPs identified by the algorithm is 293 and 1,047. The haplotype sample also being applied to Zhang’s method with minimum number of tag SNPs implemented on HapBlock [23], using the same criteria, 344 blocks and 1,184 tag SNPs are obtained by Zhang’s algorithm.

Table 4 shows a comparison of properties of haplotype blocks identified by Zhang’s and our algorithms with 80% coverage of common haplotypes. Our algorithm discovers 168 blocks containing more than 10 SNPs per block. Blocks with SNPs account for 57.3% of all blocks. The average number of SNPs for all of the blocks is 17.1. The largest block contains 92 common SNPs, which is longer than the largest block (containing 79 SNPs) identified by Zhang’s algorithm. Note that the haplotype sample has no missing data, and using the data set, the diversity function (3) will satisfy the monotonic property.

5. Conclusion

In this paper, we examine several haplotype diversity evaluation functions. By use of appropriate diversity functions, the block selection problem can be viewed as finding a segmentation of a given haplotype matrix such that the diversities of chosen blocks satisfy a given value constraint. Tag SNPs can capture most of the haplotype diversity in the blocks and thereby can potentially capture most of the information for association between a trait and the SNP marker loci. Instead of genotyping all of the SNPs on the chromosome, one may wish to use only the genotype information of tag SNPs. We can infer the haplotype features of most populations by genotyping only a few SNPs. Thus, identifying tag SNPs can dramatically reduce the time and effort needed for genotyping, without loss of much haplotype information.

We present two dynamic programming algorithms for haplotype-block partitioning such that total block length is maximized and the total tag SNPs required are minimized. We also show in Theorem 6 that finding the longest -block segmentation with diversity constraints can be done in time and space. In Theorem 7, we show that finding a maximum segmentation with limited tag SNPs number can be done in time; furthermore, we reduce the space complexity into . We point out that these efficiency results of our algorithms can be applied in many different definitions of diversity functions, provided that we can precompute the boundaries of all feasible blocks and tag SNPs required for these blocks.

We also show that the experimental results discovered by our methods are superior to those by Zhang’s algorithm. We demonstrate that owing to the nonmonotonic property of the common-haplotype evaluation function, Zhang’s algorithm will not find an optimal solution when the haplotype samples have missing data.

Acknowledgment

This research was supported by the National Science Council under the Grants NSC-100-2221-E-126-007-MY3.