Abstract
The emergence of the CD-ROM as a storage medium for full-text databases raises the question of the maximum size database that can be contained by this medium. As an example, the problem of storing the Trésor de la Langue Française on a CD-ROM is examined in this paper. The text alone of this database is 700 megabytes long, more than a CD-ROM can hold. In addition, the dictionary and concordance needed to access these data must be stored. A further constraint is that some of the material is copyrighted, and it is desirable that such material be difficult to decode except through software provided by the system. Pertinent approaches to compression of the various files are reviewed, and the compression of the text is related to the problem of data encryption: Specifically, it is shown that, under simple models of text generation, Huffman encoding produces a bit-string indistinguishable from a representation of coin flips.
- 1 AHO, A. V., HOPCROFT, J. E., AND ULLMAN, J. D. The Design and Analysis of Computer Algorithms. Addison-Wesley, Reading, Mass., 1974. Google Scholar
- 2 BRATLE~, P., AND CHOUEKA, Y. Processing truncated terms in document retrieval systems. Inf. Process. Manage. 18 (1982), 257-266.Google Scholar
- 3 CHOUEKA, Y. Full text systems and research in the humanities. Computers and the Humanities XIV (1980), 153-169.Google Scholar
- 4 CHOUEKA, Y., FRAENKEL, A. S., AND KLEIN, S.T. Compression of concordances in full-text retrieval systems. In Proceedings of the 11th Conference on Research and Development in Information Retrieval (Grenoble, Fr., June 13-15), 1988, pp. 597-612. Google Scholar
- 5 CHOUEKA, Y., KLEIN, S. T., AND NEUVITZ, E. Automatic retrieval of frequent idiomatic and collocational expressions in a large corpus. J. Assoc. Literary and Linguistic Comput. 4 (1983), 34-38.Google Scholar
- 6 CHOUEKA, Y., KLEIN, S. T., AND PERL, Y. Efficient variants of Huffman codes in high level languages. In Proceedings of the 8th Annual SIGIR Conference on Research and Development in Information Retrieval (Montreal, Quebec, June 5-7). ACM, New York, 1985, pp. 122-130. Google Scholar
- 7 CHOUEKA, Y., FRAENKEL, A. S., KLEIN, S. T., AND SEGAL E. Improved hierarchical bit-vector compression in document retrieval systems. In Proceedings of the 9th Annual Conference on Research and Development in Information Retrieval (Pisa, Italy, Sept. 8-10). ACM, New York, 1986, pp. 88-97. Google Scholar
- 8 CHOUEKA, Y., FRAENKEL, A. S., KLEIN, S. T., AND SEGAL, E. Improved techniques for processing queries in full-text systems. In Proceedings of the l Oth Annual International Conference on Research and Development in Information Retrieval (New Orleans, La., June 3-5). ACM, New York, 1987, pp. 306-315. Google Scholar
- 9 CHRISTODOULAKIS, S., AND FORD, D.A. Performance analysis and fundamental performance trade offs for CLV optical discs. In Proceedings of SIGMOD International Conference on Management of Data {Chicago, Ill., June 1-3). ACM, New York, 1988, pp. 286-294. Google Scholar
- 10 CICHOCK|, E. M., AND ZIEMER, S.M. Design considerations for CD-ROM retrieval software. J. Am. Soc. Inf. Sci. 39 (1988), 43-46.Google Scholar
- 11 DAVIES, D.H. The CD-ROM medium. J. Am. Soc. Inf. Sci. 39 (1988), 34-42.Google Scholar
- 12 FALOUTSOS, C., AND CHRISTODOULAKIS, S. Signature files: An access method for documents and its analytical performance evaluation. ACM Trans. Off. Inf. Syst. 2, 2 (Apr. 1984), 267-288. Google Scholar
- 13 FELLER, W. An Introduction to Probability Theory and Its Applications, vol. I. Wiley, New York, 1950.Google Scholar
- 14 FERGUSON, T. J., AND RABINOWITZ, J.H. Self-synchronizing Huffman codes. IEEE Trans. Inf. Theory IT-30 (1984), 687-693.Google Scholar
- 15 FRAENKEL, A.S. All about the Responsa Retrieval Project you always wanted to know but were afraid to ask, expanded summary. Jurimetrics J. I6 (1976), 149-156.Google Scholar
- 16 FRAENKEL, A. S., AND KLEIN, S.T. Novel compression of sparse bit-strings. Combinatorial Algorithms on Words, NATO ASI Series, vol. F12. Springer Verlag, Berlin, 1985, pp. 169-183.Google Scholar
- 17 FRAENKEL, A. S., AND KLEIN, S.T. Bidirectional Huffman coding. Tech. Rep. CS87-02, The Weizmann Institute of Science, Israel (1987).Google Scholar
- 18 FRAENKEL, A. S., AND MOR, M. Combinatorial compression and partitioning of large dictionaries. Comput. J. 26 (1983), 336-343.Google Scholar
- 19 FRAENKEL, A. S., MOR, M., AND PERL, Y. Is text compression by prefixes and suffixes practical? Acta Inf. 20 (1983), 371-389.Google Scholar
- 20 HEAPS, H.S. Information Retrieval. Computational and Theoretical Aspects. Academic Press, Orlando, Fla., 1978. Google Scholar
- 21 I-}UFFMAN, D. A method for the construction of minimum redundancy codes. In Proceedings of the IRE 40 (1952), 1098-1101.Google Scholar
- 22 JAKOBSSON, M. One pass text compression with a subword dictionary. J. Am. Soc. Inf. Sci. 39 (1988), 262-269.Google Scholar
- 23 KNUTH, D.E. The Art of Computer Programming. Vol. 1, Fundamental Algorithms. Addison- Wesley, Reading, Mass., 1973. Google Scholar
- 24 KONHEIM, A.G. Cryptography, A Primer. Wiley, New York, 1981. Google Scholar
- 25 LONGO, G., AND GALASSO, C. An application of informational divergence to Huffman codes. IEEE Trans. Inf. Theory IT-28 (1982), 36-43.Google Scholar
- 26 RUBIN, F. Experiments in text file compression. Commun. ACM 19, 11 (Nov. 1976), 617-623. Google Scholar
- 27 STORER, J. A. Data Compression, Methods and Theory. Computer Science Press, Rockville, Md., 1988. Google Scholar
- 28 VALLARINO, O. On the use of bit maps for multiple key retrieval. In Proceedings of Conference on Data (Mar. 22-24, Salt Lake City, Utab). SIGPLAN Not. (ACM), Special Issue II, 1976, pp. 108-114. Google Scholar
- 29 WAGNER, R.A. Common phrases and minimum space text storage. Commun. ACM 16, 3 (Mar. 1973), 148-152. Google Scholar
- 30 WEINER, P. Linear pattern matching algorithms. In Proceedings o/the 14th IEEE Symposium on Switching and Automata Theory. IEEE, New York, 1973, pp. 1-11.Google Scholar
- 31 WELCH, W.A. A technique for high-performance data compression. IEEE Computer 17 (June 1984), 8-19.Google Scholar
- 32 ZIV, J., AND LEMPEL, A. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory IT-23 (1977), 337-343.Google Scholar
Index Terms
- Storing text retrieval systems on CD-ROM: compression and encryption considerations
Comments