Abstract
This paper investigates the impact of three approaches to XML retrieval: using Zettair, a full-text information retrieval system; using eXist, a native XML database; and using a hybrid system that takes full article answers from Zettair and uses eXist to extract elements from those articles. For the content-only topics, we undertake a preliminary analysis of the INEX 2003 relevance assessments in order to identify the types of highly relevant document components. Further analysis identifies two complementary sub-cases of relevance assessments (General and Specific) and two categories of topics (Broad and Narrow). We develop a novel retrieval module that for a content-only topic utilises the information from the resulting answer list of a native XML database and dynamically determines the preferable units of retrieval, which we call Coherent Retrieval Elements. The results of our experiments show that—when each of the three systems is evaluated against different retrieval scenarios (such as different cases of relevance assessments, different topic categories and different choices of evaluation metrics)—the XML retrieval systems exhibit varying behaviour and the best performance can be reached for different values of the retrieval parameters. In the case of INEX 2003 relevance assessments for the content-only topics, our newly developed hybrid XML retrieval system is substantially more effective than either Zettair or eXist, and yields a robust and a very effective XML retrieval.
Article PDF
Similar content being viewed by others
References
Adler S, Berglund A, Caruso J, Deach S, Graham T, Grosso P, Gutentag E, Milowski A, Parnell S, Richman J and Zilles S, Eds. (2001) Extensible stylesheet language (XSL) Version 1.0. W3C Recommendation 15 October 2001, The World Wide Web Consortium. http://www.w3.org/TR/xsl/ (Visited Aug. 27th, 2004).
Al-Khalifa S, Yu C and Jagadish HV (2003) Querying structured text in an XML database. In: Papakonstantinou Y, Halevy A and Ives Z, Eds., Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, San Diego, California, USA, June 9–12, 2003. ACM Press, pp. 4–15.
Amer-Yahia S, Botev C and Shanmugasundaram J (2004) TeXQuery: A full-text search extension to XQuery. In: Feldman S, Uretsky M, Najork M and Wills C, Eds., Proceedings of the 13th International Conference on World Wide Web, New York, NY, USA, May 17–20, 2004. ACM Press, pp. 583–594.
Amer-Yahia S and Case P, Eds. (2004) XQuery 1.0 and XPath 2.0 Full-Text Use Cases. W3C Working Draft 09 July 2004, The World Wide Web Consortium. http://www.w3.org/TR/2004/WD-xmlquery-full-text-use-cases-20040709/ (Visited Aug. 23rd, 2004).
Berglund A, Boag S, Chamberlin D, Fernandez MF, Kay M, Robie J, Simeon J, Eds. (2004) XML path language (XPath) 2.0. W3C working draft 23 July 2004, The World Wide Web Consortium. http://www.w3.org/TR/xpath20/ (Visited Aug. 27th, 2004).
Boag S, Chamberlin D, Fernandez MF, Florescu D, Robie J and Simeon J, Eds. (2004) XQuery 1.0: An XML query language. W3C Working Draft 23 July 2004, The World Wide Web Consortium. http://www.w3.org/TR/2004/WD-xquery-20040723/(Visited Aug. 23rd, 2004).
Botev C, Amer-Yahia S and Shanmugasundaram J (2004) A TeXQuery-based XML full-text search engine. In: Weikum G, Konig AC and Debloch S, Eds., Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, Paris, France, June 13–18, 2004. ACM Press, pp. 943–944.
Buxton S and Rys M, Eds. (2003) XQuery and XPath full-text requirements. W3C Working Draft 02 May 2003, The World Wide Web Consortium. http://www.w3.org/TR/xquery-full-text-requirements/ (Visited Aug. 23rd, 2004).
Chiaramella Y, Mulhem P and Fourel F (1996) A model for multimedia information retrieval. Technical Report Fermi ESPRIT BRA 8134, University of Glasgow. http://www.dcs.gla.ac.uk/fermi/tech_reports/reports/fermi96-4.ps.gz (Visited April 19th, 2004).
Cohen S, Mamou J, Kanza Y and Sagiv Y (2003) XSEarch: A semantic search engine for XML. In: Freytag JC, Lockemann PC, Abiteboul S, Carey MJ, Selinger PG and Heuer A, Eds., Proceedings of 29th International Conference on Very Large Data Bases (VLDB), Berlin, Germany, Sept. 9–12, 2003. Morgan Kaufmann, pp. 45–56.
Doucet A, Aunimo L, Lehtonen M and Petit R (2004) Accurate retrieval of XML document fragments using EXTIRP. In: Fuhr N, Lalmas M and Malik S, Eds., Proceedings of the 2nd Workshop of the INitiative for the Evaluation of XML Retrieval (INEX), Dagstuhl, Germany, Dec. 15–17, pp. 73–80.
Florke H (2004) The SearX-engine at INEX’03: XML enabled probabilistic retrieval. In: Fuhr N, Lalmas M and Malik S, Eds., Proceedings of the 2nd Workshop of the INitiative for the Evaluation of XML Retrieval (INEX), Dagstuhl, Germany, Dec. 15–17, pp. 155–157.
Fuhr N and Grossjohann K (2001) XIRQL: A query language for information retrieval in XML documents. In: Croft WB, Harper DJ, Kraft DH and Zobel J, Eds., Proceedings of the 24th Annual International ACM SIGIR, Conference on Research and Development in Information Retrieval, New Orleans, Louisiana, USA, Sept. 9–13, 2001. ACM Press, pp. 172–180.
Fuhr N, Malik S and Lalmas M (2004) Overview of the INitiative for the evaluation of XML retrieval (INEX) 2003. In: Fuhr N, Lalmas M and Malik S, Eds., Proceedings of the 2nd Workshop of the INitiative for the Evaluation of XML Retrieval (INEX), Dagstuhl, Germany, Dec. 15–17, pp. 12–11.
Goevert N and Kazai G (2003) Overview of the INitiative for the evaluation of XML retrieval (INEX) 2002. In: Fuhr N, Goevert N, Kazai G and Lalmas M, Eds., Proceedings of the 1st Workshop of the INitiative for the Evaluation of XML Retrieval (INEX), Dagstuhl, Germany, Dec. 8–11, 2002. ERCIM, pp. 1–17.
Goevert N, Kazai G, Fuhr N and Lalmas M (2003) Evaluating the effectiveness of content-oriented XML retrieval. Technical Report, University of Dortmund, Computer Science 6. http://www.is.informatik.uni-duisburg.de/bib/fulltext/ir/Goevert_etal:03a.pdf (Visited April 19th, 2004).
Guo L, Shao F, Botev C and Shanmugasundaram J (2003) XRANK: Ranked keyword search over XML documents. In: Papakonstantinou Y, Halevy A and Ives Z, Eds., Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, San Diego, California, USA, June 9–12, 2003. ACM Press, pp. 16–27.
Hatano K, Kinutani H, Watanabe M, Mori Y, Yoshikawa M and Uemura S (2004) Keyword-based XML fragment retrieval: Experimental evaluation based on INEX 2003 relevance assessments. In: Fuhr N, Lalmas M and Malik S, Eds., Proceedings of the 2nd Workshop of the INitiative for the Evaluation of XML Retrieval (INEX), Dagstuhl, Germany, Dec. 15–17, pp. 81–88.
Hatano K, Kinutani H, Watanabe M, Yoshikawa M and Uemura S (2003) Determining the unit of retrieval results for XML documents. In: Fuhr N, Goevert N, Kazai G and Lalmas M, Eds., Proceedings of the 1st Workshop of the INitiative for the Evaluation of XML Retrieval (INEX), Dagstuhl, Germany, Dec. 8–11, 2002. ERCIM, pp. 57–64.
Kazai G, Lalmas M and Piwowarski B (2004) INEX’03 relevance assessment Guide. In: Fuhr N, Lalmas M and Malik S, Eds., Proceedings of the 2nd Workshop of the INitiative for the Evaluation of XML Retrieval (INEX), Dagstuhl, Germany, Dec. 15–17, pp. 204–209.
Kazai G, Lalmas M and Rolleke T (2002) Focussed structured document retrieval. In: Laender AHF and Oliveira AL, Eds., Proceedings of the 9th International Symposium on String Processing and Information Retrieval (SPIRE), Lisbon, Portugal, Sept. 11–13, 2002. Springer-Verlag, pp. 241–247.
Kazai G, Lalmas M and de Vries AP (2004) The Overlap problem in content-oriented XML retrieval evaluation. In: Sanderson M, Jarvelin K, Allan J and Bruza P, Eds., Proceedings of the 27th Annual ACM SIGIR International Conference on Research and Development in Information Retrieval, Sheffield, United Kingdom, July 25–29, 2004. ACM Press, pp. 72–79.
Mass Y and Mandelbrod M (2004) Retrieving the most relevant XML components. In: Fuhr N, Lalmas M and Malik S, Eds., Proceedings of the 2nd Workshop of the INitiative for the Evaluation of XML Retrieval (INEX), Dagstuhl, Germany, Dec. 15–17, 2003. pp. 53–58.
Meier W (2003) eXist: An open source native XML database. In: Chaudri AB, Jeckle M, Rahm E and Unland R, Eds., Web, Web-Services, and Database Systems. NODe 2002 Web- and Database-Related Workshops, Erfurt, Germany, Oct. 7–10, 2002. Springer-Verlag, pp. 169–183.
Pehcevski J, Thom J and Vercoustre A-M (2003) RMIT INEX Experiments: XML rRetrieval using zettair/eXist. In: Fuhr N, Lalmas M and Malik S, Eds., Proceedings of the 2nd Workshop of the INitiative for the Evaluation of XML Retrieval (INEX), Dagstuhl, Germany, Dec. 15–17, pp. 134–141.
Pehcevski J, Thom J and Vercoustre A-M (2004) Enhancing content-and-structure information retrieval using a native XML database. In: Mihajlovic V and Hiemstra D, Eds., Proceedings of The First Twente Data Management Workshop (TDM’04) on XML Databases and Information Retrieval, Enschede, The Netherlands, June 21. CTIT - University of Twente, pp. 24–31.
Schenkel R, Theobald A and Weikum G (2004) XXL @ INEX 2003. In: Fuhr N, Lalmas M and Malik S, Eds., Proceedings of the 2nd Workshop of the INitiative for the Evaluation of XML Retrieval (INEX), Dagstuhl, Germany, Dec. 15–17. pp. 59–66.
Sigurbjornsson B, Kamps J and de Rijke M (2004) An element-based approach to XML retrieval. In: Fuhr N, Lalmas M and Malik S, Eds., Proceedings of the 2nd Workshop of the INitiative for the Evaluation of XML Retrieval (INEX), Dagstuhl, Germany, Dec. 15–17, pp. 19–28.
Singhal A, Buckley C and Mitra M (1996) Pivoted document length normalization. In: Frei H-P, Harman D, Schauble P and Wilkinson R, Eds., Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Zurich, Switzerland, Aug. 18–22. ACM Press, pp. 21–29.
Theobald A and Weikum G (2002) The index-based XXL search engine for querying XML data with relevance ranking. In: Jensen CS, Jeffery KG, Pokorny J, Saltenis S, Bertino E, Bohm K and Jarke M, Eds., Proceedings of the 8th International Conference on Extending Database Technology: Advances in Database Technology (EDBT 2002), Prague, Czech Republic, March 25–27, 2002. Springer-Verlag, pp. 477–495.
Trotman A (2004) Searching Structured Documents. Information Processing & Management, 40(4):619–632.
Trotman A and O’Keefe R (2004) Identifying and ranking relevant document elements. In: Fuhr N, Lalmas M and Malik S, Eds., Proceedings of the 2nd Workshop of the INitiative for the Evaluation of XML Retrieval (INEX), Dagstuhl, Germany, Dec. 15–17, pp. 149–154.
Vercoustre A-M, Thom JA, Krumpholz A, Mathieson I, Wilkins P, Wu M, Craswell N and Hawking D (2003) CSIRO INEX experiments: XML search using PADRE. In: Fuhr N, Goevert N, Kazai G and Lalmas M, Eds., Proceedings of the 1st Workshop of the INitiative for the Evaluation of XML Retrieval (INEX), Dagstuhl, Germany, Dec. 8–11, 2002. ERCIM, pp. 65–72.
Wilkinson R (1994) Effective retrieval of structured documents. In: Croft WB and van Rijsbergen CJ, Eds., Proceedings of the 17th Annual International ACM SIGIR, Conference on Research and Development in Information Retrieval, Dublin, Ireland, July 3–6, 1994. ACM Press/Springer-Verlag, pp. 311–317.
Witten IH, Moffat A and Bell TC (1999, Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann Publishers, 1999.
Yu C, Jagadish HV and Radev DR (2003) Querying XML using structures and keywords in Timber. In: Clarke C, Cormack G, Callan J, Hawking D and Smeaton A, Eds., Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Toronto, Canada, July 28–Aug. 01, 2003. ACM Press, pp. 463–463.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Pehcevski, J., Thom, J.A. & Vercoustre, AM. Hybrid XML Retrieval: Combining Information Retrieval and a Native XML Database. Inf Retrieval 8, 571–600 (2005). https://doi.org/10.1007/s10791-005-0748-1
Issue Date:
DOI: https://doi.org/10.1007/s10791-005-0748-1