Abstract
The infrastructure of a typical search engine can be used to calculate and resolve persistent document identifiers: a string that can uniquely identify and locate a document on the Internet without reference to its original location (URL). Bookmarking a document using such an identifier allows its retrieval even if the document's URL, and, in many cases, its contents change. Web client applications can offer facilities for users to bookmark a page by reference to a search engine and the persistent identifier instead of the original URL. The identifiers are calculated using a global Internet term index; a document's unique identifier consists of a word or word combination that occurs uniquely in the specific document. We use a genetic algorithm to locate a minimal unique document identifier: the shortest word or word combination that will locate the document. We tested our approach by implementing tools for indexing a document collection, calculating the persistent identifiers, performing queries, and distributing the computation and storage load among many computers.
Article PDF
Similar content being viewed by others
References
Ashman H (2000) Electronic document addressing: Dealing with change. ACM Computing Surveys, 32(3):201–212.
Barabási A-L, Albert R and Jeong H (2000) Scale-free characteristics of random networks: The topology of the world-wide web, Physica, A (281):69–77.
Berners-Lee T, Masinter L and McCahill M (1994) RFC 1738: Uniform Resource Locators (URL), (Dec.). Updated by RFC1808, RFC2368 (Fielding, 1995; Hoffman et al. 1998). Status: Proposed Standard.
Brin S and Page L (1998) The anatomy of a large-scale hypertextual web search engine. Computer Networks, 30(1–7):107–117, Seventh International World Wide Web Conference Proceedings (WWW7).
Cerny V (1985) Thermodynamical approach to the traveling salesman problem: An efficient simulation algorithm. Journal of Optimization Theory and Applications, 45:41–51.
Chankhunthod A, Danzing PB, Neerdaels C, Schwartz MF and Worrell KJ (1996) A hierarchical internet object cache. In USENIX Technical Conference Proceedings, Usenix Association, Berkeley, CA.
Fielding R (1995) RFC 1808: Relative Uniform Resource Locators (June). Updates RFC1738 (Berners-Lee et al., 1994). Updated by RFC2368 (Hoffman et al., 1998). Status: Proposed Standard.
Forrest S (1996) Genetic algorithms ACM Computing Surveys, 28(1):77–83.
Garey MR and Johnson DS (1979) Computers and intractability: A guide to the Theory of NP-Completeness. W.H. Freeman and Company.
Glover F (1990) Tabu search—Part I, ORSA Journal on Computing, I:190–206.
Goldberg DE (1989) Genetic algorithms: In Search of Optimization and Machine Learning, Addison-Wesley.
Goldberg DE (1994) Genetic and evolutionary algorithms come of age. Communications of the ACM, 37(3):113–119.
Grefenstette JJ (1986) Optimization of control parameters for genetic algorithms, IEEE Transactions on Systems, Man, and Cybernetics, 16(1):122–128.
Hitchcock S, Carr L, Harris S, Hey JMN and Hall W(1999) Citation linking: Improving access to online journals. In Proceedings of the 2nd ACM International Conference on Digital Libraries, pp. 115–122.
Hoffman P, Masinter L and Zawinski J (1998) RFC 2368: The mailto URL Scheme, (July). Updates RFC1738, RFC1808 (Berners-Lee et al., 1994; Fielding, 1995). Status: Proposed Standard.
Holland JH (1975) Adaptation in Natural and Artificial Systems, University of Michigan Press, Ann Arbor, MI.
Karr CL (1993) Genetic algorithms for modelling, design, and process control. CIKM '93. Proceedings of the Second International Conference on Information and Knowledge Management, ACM, pp. 233–238.
Knuth DE (1981) The Art of Computer Programming, 2nd edition, Vol. 2. Seminumerical Algorithms, Addison-Wesley, Reading, MA.
Koulamas C, Antony SR and Jaen R (1994) A survey of simulated annealing applications to operations research problems, Omega International Journal of Management Science, 22(1):41–56.
Lawrence S and Giles CL (1999) Searching the web: General and scientific information access. IEEE Communications, 37(1):116–122.
Lawrence S, Giles CL and Bollacker K (1999) Digital libraries and autonomous citation indexing. IEEE Computer, 32(6):67–71.
Lawrence S, Pennock DM, Flake GW, Coetzee FM, Glover E, Nielsen F Å, Kruger A and Giles CL (2001) Persistence of web references in scientific research. IEEE Computer, 34(2):26–31.
Moffat A (1992) Economical inversion of large text files. Computing Systems, 5(2):125–139.
Park S-T, Pennock D, Giles L and Krovetz R (2002) Analysis of lexical signatures for finding lost or related documents, Proceedings of the 25th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval. New York, ACM Press, for ACM, pp. 11–18.
Phelps TA and Wilensky R (2000) Robust hyperlinks: Cheap, everywhere. In: Proceedings of Digital Documents and Electronic Publishing (DDEP00).
Pitkow JE (1999) Summary of WWW characterizations. World Wide Web, 2(1–2):3–13.
Schneier B (1996) Applied Cryptography, 2nd edition, Wiley, New York.
Spinellis D (1994) The design and implementation of a legal text database. In: Karagiannis D, Ed., DEXA 94: 5th International Conference on Database and Expert Systems Applications, Springer-Verlag, pp. 348. Lecture Notes in Computer Science 856.
Spinellis D (2003) The decay and failures of web references, Communications of the ACM, 46(1):71–77.
Takeda MKK (2000) Information retrieval on the web. ACM Computing Surveys, 32(2):144–173.
Van Laarhoven PJM and Aarts EHL (1987) Simulated annealing: Theory and applications, D. Reidel, Dordrecht, The Nethelands.
Wagner M (2001) Google defies dot-com downturn, Tech Web, April, Online http://www.techweb.com/wire/story/TWB20010427S0011 (current June 2002).
Zobel J, Heinz S and Williams HE (2001) In-memory hash tables for accumulating text vocabularies. Information Processing Letters, 80(6):271–277.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Spinellis, D. Index-Based Persistent Document Identifiers. Information Retrieval 8, 5–24 (2005). https://doi.org/10.1023/B:INRT.0000048494.05013.6a
Issue Date:
DOI: https://doi.org/10.1023/B:INRT.0000048494.05013.6a