Elsevier

Pattern Recognition Letters

Volume 29, Issue 6, 15 April 2008, Pages 768-772
Pattern Recognition Letters

A compression algorithm for pre-simulated Monte Carlo p-value functions: Application to the ontological analysis of microarray studies

https://doi.org/10.1016/j.patrec.2007.12.007Get rights and content

Abstract

Monte Carlo simulation is frequently employed to compute p-values for test statistics with unknown null distributions. However, the computations can be exceedingly time-consuming, and, in such cases, the use of pre-computed simulations can be considered to increase speed. This approach is attractive in principle, but complicated in practice because the size of the pre-computed data can be prohibitively large. We developed an algorithm for computing size-reduced representations of Monte Carlo p-value functions. We show that, in typical settings, this algorithm reduces the size of the pre-computed data by several orders of magnitude, while bounding provably the approximation error at an explicitly controllable level. The algorithm is data-independent, fully non-parametric, and easy to implement. We exemplify its practical utility by applying it to the threshold-free ontological analysis of microarray data. The presented algorithm simplifies the use of pre-computed Monte Carlo p-value functions in software, including specialized bioinformatics applications.

Section snippets

Background

Monte Carlo (MC) simulation (Metropolis and Ulam, 1949) is widely used for estimating p-values under circumstances when the distribution of the test statistic is unknown or cannot be computed exactly. In essence, the MC method artificially recreates the underlying chance process to generate a series of simulated test statistics under the null distribution, and defines the p-value for an observed statistic as the proportion of simulations greater than or equal to that value. Formally, given an

Proposed algorithm

Given a set of simulated test statistics X={xk}k=1N, we seek to find a non-increasing function p˜(x):R[0,1] that approximates the Monte Carlo p-value function for the upper tail subject to the following competing constraints: First, the representation of p˜(x) should be storage-efficient, i.e. the amount of data needed for its evaluation should be small compared to the size of X; Second, it is desirable to have a way to control the approximation error. Here, we require that the relative

Results

In this section, we first explore the performance of the presented algorithm for different simulation sizes and requested error bounds. We next exemplify the effect of varying the function used for reconstructing the MC p-values from the compressed data. Finally, we illustrate the utility of the algorithm by applying it to the ontological analysis of microarray data.

Discussion

We have presented an algorithm designed to facilitate the use of Monte Carlo p-values in cases when it is feasible and desirable to pre-compute the simulations but the sheer size of the data complicates its distribution. This problem arises in specialized data analysis applications, including, as shown, certain problems in bioinformatics.

The benefits of the proposed algorithm can be summarized as follows: first, the reconstruction error is explicitly controllable at design time via the

Conclusion

In conclusion, we propose an new algorithm for obtaining size-reduced representations of MC p-value functions, originally motivated by the ontological analysis of microarray data. Hence, this work contributes to the further development of pattern analysis applications in bioinformatics.

References (12)

  • J. Lamb

    A mechanism of cyclin D1 action encoded in the patterns of gene expression in human cancer

    Cell

    (2003)
  • M. Ashburner

    Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium

    Nat. Genet.

    (2000)
  • W. Barry

    Significance analysis of functional categories in gene expression studies: A structured permutation approach

    Bioinformatics

    (2005)
  • Y. Bejamini et al.

    Controlling the false discovery rate: A practical and powerful approach to multiple testing

    J. Roy. Statist. Soc. B.

    (1995)
  • Y. Ben-Shaul

    Identifying subtle interrelated changes in functional gene categories using continuous measures of gene expression

    Bioinformatics

    (2005)
  • X. Cui

    Improved statistical tests for differential gene expression by shrinking variance components estimates

    Biostatistics

    (2005)
There are more references available in the full text version of this article.

Cited by (2)

View full text