doi:10.1016/j.dam.2005.09.021
Copyright © 2006 Elsevier B.V. All rights reserved.
Integer linear programming approaches for non-unique probe selection
aMathematics in Life Sciences, Free University Berlin, Arnimallee 3, D-14195 Berlin, Germany
bDFG Research Center M
ATHEON “Mathematics for Key Technologies”, Berlin, Germany
cAlgorithms and Statistics for Systems Biology, Genome Informatics, Technische Fakultät, Bielefeld University, D-33594 Bielefeld, Germany
dComputational Molecular Biology, Max Planck Institute for Molecular Genetics, Ihnestr. 73, D-14195 Berlin, Germany
eAlgorithmic Bioinformatics, Free University Berlin, Takustr. 9, D-14195 Berlin, Germany
Received 17 July 2004;
revised 17 January 2005;
accepted 24 September 2005.
Available online 31 October 2006.
References and further reading may be available for this article. To view references and further reading you must
purchase this article.
Abstract
In addition to their prevalent use for analyzing gene expression, DNA microarrays are an efficient tool for biological, medical, and industrial applications because of their ability to assess the presence or absence of biological agents, the targets, in a sample. Given a collection of genetic sequences of targets one faces the challenge of finding short oligonucleotides, the probes, which allow detection of targets in a sample by hybridization experiments. The experiments are conducted using either unique or non-unique probes, and the problem at hand is to compute a minimal design, i.e., a minimal set of probes that allows to infer the targets in the sample from the hybridization results. If we allow to test for more than one target in the sample, the design of the probe set becomes difficult in the case of non-unique probes.
Building upon previous work on group testing for microarrays we describe the first approach to select a minimal probe set for the case of non-unique probes in the presence of a small number of multiple targets in the sample. The approach is based on an integer linear programming formulation and a branch-and-cut algorithm. Our implementation significantly reduces the number of probes needed while preserving the decoding capabilities of existing approaches.
Keywords: Integer linear programming; Microarray; Probe; Oligonucleotide; Design; Group testing
Fig. 2. The two evolutionary tree models used for sequence family generation. Branch lengths are indicated next to sample branches. The 256 (respectively, 400) leaf sequences were taken as family members. In (a) not all children of the nodes are shown.
Table 1.
Target-probe incidence matrix H

Table 2.
For each artificial data set (a)1 to (b)5 and for the Markmann [6] meiobenthic data (M), the table shows the number m of targets, the number #cand of probe candidates, and the number of probes n chosen by the greedy design heuristic and the ILP approach, using pairwise separation only

Percentages represent the number of selected probes in relation to the number of probe candidates. The probe ratio nGreedy/nILP and the ratio tGreedy/tILP of the required design times are also shown.
Table 3.
Comparison of design size between greedy and ILP solution

The column targets contains the number of sequences in each data set and the column candidates the number of probe candidates in the incidence matrix H. The next four columns show the number of probes in the design matrix computed by the heuristic algorithm without and with random separation (500,000 random groups) and by the ILP-based algorithm without (ILP1) and with (ILP2) group separation (the maximal cardinality of groups was set to c=5).
Table 4.
Decoding results for the greedy heuristic design and the ILP design on the Markmann [6] data set (M)

Reading example: the value 0.93 for k=2 targets among “top 2” implies that 93 of the 100 true positives in the 50 repetitions of the Monte Carlo experiment were ranked among the top two by the decoding. Similarly, 98 of the 100 true positives were found among those ranked first to fourth.
Table 5.
Decoding results for artificial data set (a)1

See Table 4 for further explanations.
Table 6.
Decoding results for artificial data set (b)3

See Table 4 for further explanations.
Table 7.
Comparison of decoding capabilities for data set (a)1 using the heuristic design, the ILP design with group separation and the ILP design without group separation

Reading example: averaged over all experiments with k=5 true targets, we found 91.2% of the true targets among the top 10 predicted ones in the heuristic design (1165 probes; cf. Table 3). The success rate decreases to 82.6% for the ILP design with group separation (515 probes) and to 75.2% without group separation (503 probes).
Table 8.
Comparison of decoding capabilities for data set (cl)1, similar to Table 7
