Non-english web search: an evaluation of indexing and searching the Greek web

Efthimiadis, Efthimis N.; Malevris, Nicos; Kousaridas, Apostolos; Lepeniotou, Alexandra; Loutas, Nikos

doi:10.1007/s10791-008-9084-6

Non-english web search: an evaluation of indexing and searching the Greek web

Published: 16 January 2009

Volume 12, pages 352–379, (2009)
Cite this article

Download PDF

Information Retrieval Aims and scope Submit manuscript

Non-english web search: an evaluation of indexing and searching the Greek web

Download PDF

Efthimis N. Efthimiadis¹,
Nicos Malevris²,
Apostolos Kousaridas²^nAff3,
Alexandra Lepeniotou²^nAff4 &
…
Nikos Loutas²^nAff5

221 Accesses
5 Citations
Explore all metrics

Abstract

The study reports on a longitudinal and comparative evaluation of Greek language searching on the web. Ten engines, five global (A9, AltaVista, Google, MSN Search, and Yahoo!) and five Greek (Anazitisi, Ano-Kato, Phantis. Trinity, and Visto), were evaluated using (a) navigational queries in 2004 and 2006; and (b) by measuring the freshness of the search engine indices in 2005 and 2006. Homepage finding queries for known Greek organizations were created and searched. Queries included the name of the organization in its Greek and non-Greek, English or transliterated equivalent forms. The organizations represented ten categories: government departments, universities, colleges, travel agencies, museums, media (TV, radio, newspapers), transportation, and banks. The freshness of the indices was evaluated by examining the status of the returned URLs (live versus dead) from the navigational queries, and by identifying if the engines have indexed 32480 active (live) Greek domain URLs. Effectiveness measures included (a) qualitative assessment of how engines handle the Greek language; (b) precision at 10 documents (P@10); (c) mean reciprocal rank (MRR); (d) Navigational Query Discounted Cumulative Gain (NQ-DCG), a new heuristic evaluation measure; (e) response time; (f) the ratio of the dead URL links returned, (g) the presence or absence of URLs and the decay observed over the period of the study. The results report on which of the global and Greek search engines perform best; and if the performance achieved is good enough from a user’s perspective.

The EERQI Search Engine

A Linguistic Graph-Based Approach for Web News Sentence Searching

Search Engines and Hebrew - Revisited

1 Introduction

The web continues to expand and the dominant search engines, Google and Yahoo! claim to have indexed more than 20 billion pages (Mayer 2005). Recent statistics on Internet usage by language show that 31.2% is English and 68.8% is non-English (Internet World Statistics 2007b). As the non-English web usage increases, there is an increasing number of non-English queries that need to be handled by the search engines.

The goals of this research are to:

(a)
evaluate how well search engines respond to Greek language queries;
(b)
assess whether the Greek or global search engines are more effective in satisfying user requests, and,
(c)
evaluate the extent of coverage of the Greek web by the ten search engines.

Preliminary results of the present study as these pertain to (a) and part of (b) above appeared in Efthimiadis et al. (2008). To achieve these goals the study was conducted as follows:

1.
a set of queries were searched in 10 search engines (5 Greek, 5 global) and the results were evaluated to see if the correct answer was returned;
2.
all the URLs found in the result sets were retrieved to identify the percentage that were live (active) or dead (non-active) links;
3.
a sample of 32480 active URLs from the Greek web was used to evaluate whether the search engines had them indexed.

The organization of the paper is as follows: Sect. 2 reviews related work, Sect. 3 gives a brief overview of the Greek language, Sect. 4 presents the methodology, Sect. 5 discusses the results, and the conclusions are given in Sect. 6.

2 Related work

Bar-Ilan and Gutman (2005) explored how search engines respond to queries in four non-English languages, Russian, French, Hungarian and Hebrew. For each of the languages they searched in three global search engines, AltaVista, FAST and Google, and in two or three local engines. The local engines were the Russian Yandex, Rambler, Aport; the French Voila, AOL France, La Toile de Quebec, the Hungarian Origo-vizsla, Startlap, Heureka, and the Hebrew Morfix and Walla. For each of the four languages the authors developed queries that emphasized specific linguistic characteristics of that language. The first ten results of each search were evaluated not for relevance, but for whether the exact word form or a morphological variant of the query was retrieved. They found that the search engines ignored the special language characteristics and did not handle diacritics well.

Moukdad (2004) studied how three global search engines, AltaVista, AllTheWeb, and Google, handle Arabic queries compared to three Arabic engines, Al bahhar, Ayna, and Morfix. He employed the same methodology used by Bar-Ilan and Gutman (2005). A set of eight Arabic search terms was selected and run in the six search engines. He found that the global search engines had shortcomings in handling Arabic.

Sroka (2000) evaluated Polish versions of English language search engines and Polish search engines. The evaluation focused on search capability and retrieval performance. Precision was based on relevance judgments for the first 10 matches from each search engine. The overlap of retrieved documents and the response time for each search engine were recorded. Of the five search engines that were evaluated, Polski Infoseek and Onet.pl had the best precision scores, and Polski Infoseek turned out to be the fastest Web search engine.

Kelly-Holmes (2006) conducted a study searching with Irish Gaelic words on the Irish language version of Google. Five words from ‘typical’ and ‘non-typical’ domains for Irish were used, and the results were analyzed in terms of the “authenticity” of the search process and results, the language usage in the sites found through the search process, and the domains represented by the results. The study identified a number of problems encountered when searching using the Irish Gaelic language.

Bitirim et al. (2002) investigated the performance of four Turkish search engines with respect to precision, normalized recall, coverage and novelty ratios. They used seventeen queries and searched Arabul, Arama, Netbul and Superonline. The queries were carefully selected to assess the capability of a search engine for handling broad or narrow topic subjects, exclusion of particular information, identifying and indexing Turkish characters, retrieval of authoritative pages, stemming of Turkish words, and correct interpretation of Boolean operators. Arama appears to be the best Turkish search engine in terms of average precision and normalized recall ratios, and the coverage of Turkish sites. The handling of Turkish characters and stemming still causes problems to the Turkish search engines. Superonline and Netbul make use of the indexing information in meta-tag fields to improve retrieval results.

Griesbaum (2004) investigated the retrieval effectiveness of three popular German Web search services: AltaVista.de, Google.de and Lycos.de. Fifty queries were used both in German and in their English translation. The top twenty results were evaluated for precision. The findings indicated that Google performed significantly better than AltaVista, but there was no significant difference between Google and Lycos. Lycos also achieved better values than AltaVista, but the differences reached were not statistically significant. When comparing the 2004 results to a similar study by the author in 2002 the results were similar, but the gaps between the engines were closer. The overall conclusion of the study was that the retrieval performance of the engines was very close to each other.

Lazarinis (2007) evaluated the performance of eleven search engines, seven global (AlltheWeb, AltaVista, AOL, ASK, Google, MSN, Yahoo) and four Greek (Anazitisis, In.gr, Pathfinder, Robby), with the use of six Greek language queries. He employed thirty-one users who were divided into six groups and each searched one query. Each group member retrieved twenty results. The retrieved results by all group members were evaluated for relevance collectively by the members of each group. Lazarinis reports that the precision of all engines is very similar. Based on the six queries the study further investigated how engines handle upper and lower case input, diacritics, stemming, and stop words. The study noted that there were variations in the handling of Greek.

Moukdad and Cui (2005) investigated how Chinese language queries are handled by Google and AlltheWeb, as well as Sohu and Baidu, the Chinese search engines. They created ten queries by selecting terms from a Chinese-English dictionary. The terms emphasized certain linguistic characteristics of Chinese. The queries were searched in the Simplified Chinese script which is in use in mainland China. The results were evaluated based on the number of retrieved documents, word segmentation, and correct display of Chinese characters. Moukdad and Cui found that the global search engines did not use any linguistic processing and thus were not able to process the Chinese queries satisfactorily, which led to the introduction of unexpected results.

3 The Greek language

Greek is a rich highly inflectional language that dates to 9th century BC. The Greek language uses a different script to that of Latin-based languages. The Greek alphabet set has twenty-four upper case letters, twenty-five lower case letters, and a number of diacritics or accent marks depending on the form used (see Fig. 1).

The most commonly known forms of the Greek language are ancient or classical Greek, Katharevousa, and Demotic Greek (Dhimotiki) (Babiniotis 1998). Depending on the system of accents used, Greek is either polytonic or monotonic. The polytonic orthography system for Greek uses three accents, two breathings, iota subscripts and diaeresis. The polytonic system was used since the ancient times and was simplified into the monotonic system in 1982. The monotonic Greek language system uses one accent and the diaeresis, in order to signify that two adjacent vowels are pronounced separately and not as a diphthong.

Transliteration of Greek to Latin letters is common but adds to the complexity of processing Greek because of the different transliteration standards. Furthermore, individuals often ignore the standards and apply their own phonetic interpretation. The widespread use of computers and the Internet coupled with the slow progress in adopting non-Latin-based scripts has given rise to Greeklish, which is a form of transliteration used to exchange email messages and post to discussion forums (Karakos 2003; Tzekou et al. 2007).

Alevizos et al. (1988) discuss the challenges faced by search systems in handling Greek. Kalamboukis (1995) introduces the inflectional aspects of Greek and presents a stemming approach.

4 Methodology

The methodology used in carrying out this research is presented in this section. A user need scenario is first introduced. The search engines and the search process are presented. The subject categories selected for the navigational queries follow. A discussion of the evaluation criteria used for each part of the study concludes the section.

4.1 User needs

The use of the Internet by Greeks has seen a threefold increase between 2000 and 2006, jumping from 9.1% to 33.5% respectively (Internet World Statistics 2007a). Similarly, the Greek web has proliferated with an increasing presence of governmental and commercial entities. In 2004 and then again in 2006, most of the Greek web pages (63.5%, and 63.4%) were in the Greek language (Efthimiadis and Castillo 2004). Most Greeks learn a second language to some degree of proficiency; however, it is reasonable to assume that Greeks would search in Greek to find information in the Greek web. Following the Broder (2002) classification of web queries we selected the “navigational” class as the basis of a user task definition. We assume that a user will search to find the specific site of an organization. To that respect our methodology relates to that of Hawking et al. (2001).

4.2 Search engines and the search process

Ten search engines were used in this study. These were divided into two groups, five global or international in scope, and five Greek search engines. The global search engines are: A9, AltaVista, Google, MSN Search (this is not Live Search, as Live was introduced after the study was concluded), and Yahoo!. The Greek engines are: Anazitisis, Ano-Kato, Phantis, Trinity, and Visto. The Appendix lists the engines and the corresponding URLs used to send the search requests.

A program in Java was developed to submit queries to each search engine automatically. The returned results were downloaded and stored in a MySQL database for further processing and analysis. The process is depicted in Fig. 2 and discussed throughout the methodology section.

4.3 Subject categories and queries

Ten broad subject categories were identified using professional and business directories. The categories are: government departments, universities, colleges, travel agencies, museums, media (TV, radio, newspapers), transportation, and banks. Two hundred and seventeen (217) organizations that had a web presence were selected for searching. For each organization we established the formal name in Greek, its non-Greek equivalent if available (usually in English or other Latin-based language) and the URL(s) of the web site.

The URLs available for these organizations were used to download the corresponding webpage and verify that these were active. In addition, the robots.txt file was checked for every URL in order to establish if there were any indexing restrictions on the page. At that time none of the organizations had any restrictions to search engines for crawling and indexing their pages. Consequently, all search engines should have had access to them.

Queries were generated from the Greek and non-Greek (English or transliterated) versions of the names of the selected businesses or organizations. Table 1 lists the subject categories and the numbers of the Greek organizations that correspond to each category. There were a total of 217 organizations, of which 92 had a corresponding English or other non-Greek equivalent name, thus, resulting in 309 queries.

Table 1 Queries by subject category and language searched

Non-english web search: an evaluation of indexing and searching the Greek web

Abstract

Similar content being viewed by others

The EERQI Search Engine

A Linguistic Graph-Based Approach for Web News Sentence Searching

Search Engines and Hebrew - Revisited

1 Introduction

2 Related work

3 The Greek language

4 Methodology

4.1 User needs

4.2 Search engines and the search process

4.3 Subject categories and queries

4.4 Evaluation criteria

4.4.1 Evaluating search engines effectiveness in searching the Greek web

4.4.1.1 Greek language processing

4.4.1.2 Precision at 10 documents (P@10)

4.4.1.3 Mean reciprocal rank (MRR)

4.4.1.4 Navigational query discounted cumulative gain (NQ-DCG)

4.4.1.5 Evaluating the returned results: response time

4.4.1.6 Evaluating the returned results: Live versus dead links

4.4.2 Evaluating search engine coverage of the Greek web

5 Results

5.1 Greek language processing by search engines and effects on searching

5.2 Search results by rank order and P@10

5.3 Mean reciprocal rank

5.4 Search results by subject category and NQ-DCG

5.5 Live versus dead links

5.6 Response time

5.7 Search engine coverage of the Greek web

6 Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

1.1 List of search engines used in the study

1.1.1 Global search engines

1.1.2 Greek search engines

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation