skip to main content
article
Free Access

Storing text retrieval systems on CD-ROM: compression and encryption considerations

Published:01 July 1989Publication History
Skip Abstract Section

Abstract

The emergence of the CD-ROM as a storage medium for full-text databases raises the question of the maximum size database that can be contained by this medium. As an example, the problem of storing the Trésor de la Langue Française on a CD-ROM is examined in this paper. The text alone of this database is 700 megabytes long, more than a CD-ROM can hold. In addition, the dictionary and concordance needed to access these data must be stored. A further constraint is that some of the material is copyrighted, and it is desirable that such material be difficult to decode except through software provided by the system. Pertinent approaches to compression of the various files are reviewed, and the compression of the text is related to the problem of data encryption: Specifically, it is shown that, under simple models of text generation, Huffman encoding produces a bit-string indistinguishable from a representation of coin flips.

References

  1. 1 AHO, A. V., HOPCROFT, J. E., AND ULLMAN, J. D. The Design and Analysis of Computer Algorithms. Addison-Wesley, Reading, Mass., 1974. Google ScholarGoogle Scholar
  2. 2 BRATLE~, P., AND CHOUEKA, Y. Processing truncated terms in document retrieval systems. Inf. Process. Manage. 18 (1982), 257-266.Google ScholarGoogle Scholar
  3. 3 CHOUEKA, Y. Full text systems and research in the humanities. Computers and the Humanities XIV (1980), 153-169.Google ScholarGoogle Scholar
  4. 4 CHOUEKA, Y., FRAENKEL, A. S., AND KLEIN, S.T. Compression of concordances in full-text retrieval systems. In Proceedings of the 11th Conference on Research and Development in Information Retrieval (Grenoble, Fr., June 13-15), 1988, pp. 597-612. Google ScholarGoogle Scholar
  5. 5 CHOUEKA, Y., KLEIN, S. T., AND NEUVITZ, E. Automatic retrieval of frequent idiomatic and collocational expressions in a large corpus. J. Assoc. Literary and Linguistic Comput. 4 (1983), 34-38.Google ScholarGoogle Scholar
  6. 6 CHOUEKA, Y., KLEIN, S. T., AND PERL, Y. Efficient variants of Huffman codes in high level languages. In Proceedings of the 8th Annual SIGIR Conference on Research and Development in Information Retrieval (Montreal, Quebec, June 5-7). ACM, New York, 1985, pp. 122-130. Google ScholarGoogle Scholar
  7. 7 CHOUEKA, Y., FRAENKEL, A. S., KLEIN, S. T., AND SEGAL E. Improved hierarchical bit-vector compression in document retrieval systems. In Proceedings of the 9th Annual Conference on Research and Development in Information Retrieval (Pisa, Italy, Sept. 8-10). ACM, New York, 1986, pp. 88-97. Google ScholarGoogle Scholar
  8. 8 CHOUEKA, Y., FRAENKEL, A. S., KLEIN, S. T., AND SEGAL, E. Improved techniques for processing queries in full-text systems. In Proceedings of the l Oth Annual International Conference on Research and Development in Information Retrieval (New Orleans, La., June 3-5). ACM, New York, 1987, pp. 306-315. Google ScholarGoogle Scholar
  9. 9 CHRISTODOULAKIS, S., AND FORD, D.A. Performance analysis and fundamental performance trade offs for CLV optical discs. In Proceedings of SIGMOD International Conference on Management of Data {Chicago, Ill., June 1-3). ACM, New York, 1988, pp. 286-294. Google ScholarGoogle Scholar
  10. 10 CICHOCK|, E. M., AND ZIEMER, S.M. Design considerations for CD-ROM retrieval software. J. Am. Soc. Inf. Sci. 39 (1988), 43-46.Google ScholarGoogle Scholar
  11. 11 DAVIES, D.H. The CD-ROM medium. J. Am. Soc. Inf. Sci. 39 (1988), 34-42.Google ScholarGoogle Scholar
  12. 12 FALOUTSOS, C., AND CHRISTODOULAKIS, S. Signature files: An access method for documents and its analytical performance evaluation. ACM Trans. Off. Inf. Syst. 2, 2 (Apr. 1984), 267-288. Google ScholarGoogle Scholar
  13. 13 FELLER, W. An Introduction to Probability Theory and Its Applications, vol. I. Wiley, New York, 1950.Google ScholarGoogle Scholar
  14. 14 FERGUSON, T. J., AND RABINOWITZ, J.H. Self-synchronizing Huffman codes. IEEE Trans. Inf. Theory IT-30 (1984), 687-693.Google ScholarGoogle Scholar
  15. 15 FRAENKEL, A.S. All about the Responsa Retrieval Project you always wanted to know but were afraid to ask, expanded summary. Jurimetrics J. I6 (1976), 149-156.Google ScholarGoogle Scholar
  16. 16 FRAENKEL, A. S., AND KLEIN, S.T. Novel compression of sparse bit-strings. Combinatorial Algorithms on Words, NATO ASI Series, vol. F12. Springer Verlag, Berlin, 1985, pp. 169-183.Google ScholarGoogle Scholar
  17. 17 FRAENKEL, A. S., AND KLEIN, S.T. Bidirectional Huffman coding. Tech. Rep. CS87-02, The Weizmann Institute of Science, Israel (1987).Google ScholarGoogle Scholar
  18. 18 FRAENKEL, A. S., AND MOR, M. Combinatorial compression and partitioning of large dictionaries. Comput. J. 26 (1983), 336-343.Google ScholarGoogle Scholar
  19. 19 FRAENKEL, A. S., MOR, M., AND PERL, Y. Is text compression by prefixes and suffixes practical? Acta Inf. 20 (1983), 371-389.Google ScholarGoogle Scholar
  20. 20 HEAPS, H.S. Information Retrieval. Computational and Theoretical Aspects. Academic Press, Orlando, Fla., 1978. Google ScholarGoogle Scholar
  21. 21 I-}UFFMAN, D. A method for the construction of minimum redundancy codes. In Proceedings of the IRE 40 (1952), 1098-1101.Google ScholarGoogle Scholar
  22. 22 JAKOBSSON, M. One pass text compression with a subword dictionary. J. Am. Soc. Inf. Sci. 39 (1988), 262-269.Google ScholarGoogle Scholar
  23. 23 KNUTH, D.E. The Art of Computer Programming. Vol. 1, Fundamental Algorithms. Addison- Wesley, Reading, Mass., 1973. Google ScholarGoogle Scholar
  24. 24 KONHEIM, A.G. Cryptography, A Primer. Wiley, New York, 1981. Google ScholarGoogle Scholar
  25. 25 LONGO, G., AND GALASSO, C. An application of informational divergence to Huffman codes. IEEE Trans. Inf. Theory IT-28 (1982), 36-43.Google ScholarGoogle Scholar
  26. 26 RUBIN, F. Experiments in text file compression. Commun. ACM 19, 11 (Nov. 1976), 617-623. Google ScholarGoogle Scholar
  27. 27 STORER, J. A. Data Compression, Methods and Theory. Computer Science Press, Rockville, Md., 1988. Google ScholarGoogle Scholar
  28. 28 VALLARINO, O. On the use of bit maps for multiple key retrieval. In Proceedings of Conference on Data (Mar. 22-24, Salt Lake City, Utab). SIGPLAN Not. (ACM), Special Issue II, 1976, pp. 108-114. Google ScholarGoogle Scholar
  29. 29 WAGNER, R.A. Common phrases and minimum space text storage. Commun. ACM 16, 3 (Mar. 1973), 148-152. Google ScholarGoogle Scholar
  30. 30 WEINER, P. Linear pattern matching algorithms. In Proceedings o/the 14th IEEE Symposium on Switching and Automata Theory. IEEE, New York, 1973, pp. 1-11.Google ScholarGoogle Scholar
  31. 31 WELCH, W.A. A technique for high-performance data compression. IEEE Computer 17 (June 1984), 8-19.Google ScholarGoogle Scholar
  32. 32 ZIV, J., AND LEMPEL, A. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory IT-23 (1977), 337-343.Google ScholarGoogle Scholar

Index Terms

  1. Storing text retrieval systems on CD-ROM: compression and encryption considerations

              Recommendations

              Reviews

              Thomas C. Lowe

              Practitioners will find this work's many observations interesting and perhaps immediately relevant. The authors discuss several related topics at levels that range from mathematical analysis to casual observations. The paper does not describe a system development process in the usual sense of presenting a system designed or built to meet a requirement, nor is it a paper on a single abstract research problem or topic. Rather, the authors present a collection of their observations and thoughts related to making a full-text CD-ROM–based storage and retrieval system for an unchanging natural language database, in which “We therefore can take as much time as needed on the fine tuning of the parameters of the compression procedures.” The data are about 700 megabytes of collected writings ( Tre´sor de la Franc¸aise) plus associated overhead (such as dictionaries and concordances). With suitable data compression, they could fit on a single CD-ROM. The data should be protected against wholesale copying of cleartext; the contents are all published works, however, so security requirements are aimed at preventing pilferage, not protecting secrets. Data compression therefore becomes part of and a basis for the requisite security. Coding is thus important first for compression and then for security. The authors develop an argument that a Huffman coding, perhaps itself simply encrypted, would provide all the necessary protection. A discussion of dictionaries and concordance implementation follows; characteristics of the particular database under consideration are used to illustrate options. Apparently the concordances would contain only word location information. The authors briefly describe bit-mapped implementations, stop words, and other topics in the intended application context.

              Access critical reviews of Computing literature here

              Become a reviewer for Computing Reviews.

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in

              Full Access

              • Published in

                cover image ACM Transactions on Information Systems
                ACM Transactions on Information Systems  Volume 7, Issue 3
                July 1989
                134 pages
                ISSN:1046-8188
                EISSN:1558-2868
                DOI:10.1145/65943
                Issue’s Table of Contents

                Copyright © 1989 ACM

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 1 July 1989
                Published in tois Volume 7, Issue 3

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • article

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader