Linguistic based search facilities in snowflake-like database schemes

https://doi.org/10.1016/S0169-023X(03)00104-6Get rights and content

Abstract

Development of generic and general search is one of the most difficult tasks in website development. In practical day life search is not as difficult. The simplification is based on association and context facilities provided by the language and the application area.

We aim in developing a simple and powerful approach to search based on a generalization of the theory of word fields to concept fields and providing an appropriate meta-structuring within database schemata that reflects context- or application-based search in a more appropriate way. The internal meta-structuring is based on star and snowflake meta-structures within the schema.

Introduction

It is commonly agreed that sophisticated web systems must support all kinds of search requests people may post to the system. This requirement can be satisfied if search is based on a more general approach and allows to ask questions in an approximate form, questions associating information requested to other terms, questions leading to an answer through construction, or questions which are put into an utilization context. The metaphor [45] of an informed librarian or bookseller is often stressed to illustrate such general search. Everybody is familiar with a situation that a customer in a bookshop is asking for an appropriate book. If the customers profile is known by the bookseller then the needs of this customer can be met in a much simpler and more appropriate form.

Therefore, we need an approach supporting users by very sophisticated search facilities. Such support does not come from nowhere. It is based on the information that may be provided by the information system. It is not possible to generate all possible queries and to provide a meta-querying interface that allows to pick the most appropriate query. In this paper we develop an approach that allows to generate a good portion of possible queries. Our approach is based on star and snowflake schemata that support query generation.

Search is one of the most common facilities in information-intensive systems. It requires

  • to examine the data and information on hand, and

  • to carefully look at or through or into the data and the information.


Search is an activity which can be generalized to an investigation into the database. There is a large variety of information search such as:

  • querying data sets (by providing query expressions in the informed search approach),

  • seeking for information on data (by browsing, understanding and compiling),

  • questing data formally (by providing appropriate search terms during step-wise refinement),

  • ferreting out data necessary (by discovering the information requested by searching out or browsing through the data),

  • searching by associations and drilling down (by appropriate refinement of the search terms),

  • casting about and digging into the data (with a transformation of the query and the data to a common form), and

  • zapping through data sets (by jumping through data provided, e.g., by uninformed search).


This variety of search approaches is applied almost everywhere in daily life. We observe this search behavior by users of internet search engines, by TV consumers, by people using a railway information system, by information seekers which are approaching internet sites that provide information such as cultural events, etc. Information systems must support all these different kinds of search.

The first and the second kinds of search (direct querying) is supported by text retrieval systems [4], [22], [39], [49], [50] and does seldom lead to satisfactory results. The database query language SQL provides a nice support for skilled and trained users in the case the meaning of data and the semantics of the database schema is entirely known.

The third kind of search (questing) is supportable by systems which use the information provided by the database schema [21] and by word analysis. In [47], [48] an approach has been developed which generates an SQL query for natural language utterances by extending the results of sentence parsing and analysis

  • by the meaning and associations of words using WordNet [15] information,

  • by the hierarchy of application terms or topics [34] ordered in an ontology [20], [35], or a query concept lattice [3], [19], [24], and

  • by associating the terms to database schema information [41].


The fourth kind of search (ferreting out) is currently not supported at all. It may be, however, partially supported if the information and associations on hand are properly used. In this case, context of terms, special context provided by applications and the search profile of the user can be used for generating a general context of the search utterance.

The fifth kind of search (association-based search or browsing) can partially be supported by techniques of artificial intelligence and by careful analysis of the meta-information [37] provided by the information system.

The sixth kind of search (investigating and casting) requires a powerful transformer of search terms, of meta-information provided by the database schema and of general context. The support of this kind of search is not yet visible for information systems. The support of the last kind of search (zapping) is far more difficult.

At the same time information systems respond to search requests by informing the requester on the information on hand and answering the search request. Informing and answering is a task which is as difficult as searching. Informing has also a number of facets such as:

  • returning the data that matches to search terms to the user in some format,

  • replying by giving an answer to the request with inclusion of other helpful information,

  • respond by reacting on the search request and providing some information, and

  • retort by answering back quickly and cleverly.


The addition of helpful information can be supported by extraction of useful associations throughout the database, e.g., by utilization of associations in the database schema. Responding requires a selection criteria for the selection of the information either retrieved as data in the database or condensed on the basis of data available in the information system. Retorting to a request requires contraction of data.

Answering is related to informing the user. Informing has a similar pattern.

As a typical example let us consider a website search facility. The Cottbus InfoServices team has developed 34 large websites including city and region information sites, group and association sites, edutainment sites and B2B and B2C sites. All these sites include sophisticated search support. In a regional information site, users search for a large variety of information: events, hotels, restaurants, places of interest, traffic, streets, etc. Search facilities also include hotel search. Hotel search has a large number of facets such as informed search (looking for a specific hotel or a hotel chain), property-based search (looking for main characteristics of hotels), profile-based search (based on the seekers profile), association-based search (based on relationships of hotel to specific points of interest such as sightseeing points or transportation), browsing through the set of available hotels, or zapping through the list of hotels. This situation is similar to daily life, e.g., in a service agency supporting hotel and accommodation search.

Generic components [26], [31] have been developed and applied a long time ago. A typical example of generic functions are the database manipulation functions insert, delete, and update in relational databases which are defined whenever the structure to which this function should be applied becomes defined. The mechanism for generation of the concrete database manipulation functions may serve as the first example of generative programming.

In this paper, we develop a general approach to search support which allows to build generative search facilities on the basis a specific inner structuring [17], [28], [42], [43] of the database schema. This approach supports the first six kinds of search (query, seek, quest, ferret out, association-based browsing, and cast about). This approach uses the inner structuring of the schema or the meta-structuring of the schema.

Generative programming [12] aims to increase the productivity, quality, and time-to-market in software development thanks to the deployment of both standard component and production automation. Therefore system families are developed rather than single systems. The approach is replacing manual search, adaptation, and assembly of components by the automatic generation of needed components on demand.

Generative and component-based software engineering [13] is based on a number of software engineering approaches such as:

  • Parameterization is used for statically bound, simultaneous and non-simultaneous dimensions of objects from various classes [7], [33].

  • Aspect-oriented programming improves the modularity of designs and implementations by allowing a better encapsulation of cross-cutting concerns such as distributed transfer, synchronization, data traversal, tracing, caching, etc. in a new kind of modularity called “aspects”.

  • Subject-oriented programming focuses on capturing different subjective perspectives on a single object model. It basically allows composing applications out of “subjects” (partial object models) by means of declarative composition rules [8], [11].

  • Software transformation technology and systems aid software development activities by providing mechanized support for manipulating program representations, e.g., extracting views, refinement, refactoring, and optimizations of program representations. A specific approach uses transformation systems based on formal algebraic specifications.

  • Intentional programming provides an extendible programming environment based on transformation technology and direct manipulation of active program representations. New programming notations and transformations can be distributed and used as plug-ins.

  • Domain engineering comprises the development of a common model and concrete components, generators, and reuse infrastructures for a family of software systems.


Generic representation is based on generic features. They abstract and relate facilities of an entire group or class. Therefore, a concept is represented by families of concepts. The process of creation or generation of concrete instances is a task of generative programming beyond generic representation.

Databases are often very complex and are based on database schemata which are rather complex. If the structure is simple then querying becomes partially simpler as long as only one tuple variable per table is needed. If, however, several tuple variables are necessary for one of the tables, the formulation of a query becomes very complex and highly error-prone. For instance, the cube operator must be considered to be harmful and misleading [29]. If a simple database schema is used for complex applications then data maintenance becomes a nightmare. Moreover, integrity maintenance is infeasible. Typical such databases are developed in OLAP applications [25]. OLAP schemata are aiming in easing search by introducing high redundancy in the schema. It has been shown in [30] that OLAP and XML applications are much simpler if the OLAP or XML data are based on views on rich-structured databases. The last approach allows to build an OLAP or XML application on top of an operational OLTP system. The question whether views must be materialized can be resolved by consideration of performance parameters and maintenance procedures for materialized views and queries defined on those.

If a richer structuring is used the formulation of a query must be based on the structure of the database. It is easier for an experienced database programmer to develop complex queries. Casual users and programmers are, however, completely lost in the schema. The formulation is requiring a complete understanding of the syntax and of the semantics coded within the schema. Therefore, we need a generic approach to query formulation.

We observe, further, that large schemata display various aspects of the application in a different or repetitive form. Therefore, query statement is rather difficult and has to consider a variety of aspects. The repetition and redundancy in schemata and variety of aspects is also caused by

  • different usage of similar types of the schema,

  • minor and small differences of the types structure in application views, and

  • semantic differences of variants of types.


A typical example are the schemata used by Scheer, e.g., the one in [36]. These schemata are highly redundant, display variations of types in the same schema and mix abstraction levels of modeling. The size of the schema in [36] can be cut by half if none-redundancy is a quality criteria. Therefore, we need approaches which allow to reason on repeating structures inside schemata, on semantic differences and differences in usage of objects.

Large schemata also suffer from the deficiency of variation detection: The same or similar content is often repeated in a schema without noticing it.1 The similarity of the schemata is often not detected by teams and causes a number of redundancy and inconsistency problems.

Analyzing the used retrieval functionalities we can obtain the following characteristics which are related to different scientific areas:

  • Search patterns depending on information needs and actors, integration with browsing, search iterations;

  • Search input: keywords, alternative terms, misspelling, multilingual, natural language searches, text entry support, spelling-reduced searches, fuzzy formulation, modes of searching, clear search options, support for judgement, and information retrieval techniques;

  • Representation of search results: prioritizing, clustering, navigation support, and feedback always or not;

  • AI techniques: mining, discovery, concept hierarchies, information structuring, agents, uncertainty, incompleteness, heuristics;

  • Search style: search without spelling, scoped searches, expression logics (and, or, NOT), buttons, and search capabilities.


That is why the content of a database must be expressed in different ways for the users. So, we focus on the following kinds of retrieval interfaces: a step-by-step form-based database search, a multi step semantic search and a natural language text search. The step-by-step form search is shown in [41], [46], [47], [48].

Psychologists claim that humans are able to perceive 5 ± 2 concepts at the same time. Humans are better in recognizing connected or associated concepts. Thus, it seems that schemata similar to a star are easier to perceive and to understand.

Star typing has been used already for a long time outside the database community. Let us consider the example in Fig. 1.2 It shows a part of the standardized description of screws using in mechanical engineering. Each screw is characterized by basic data. Additionally, properties on the manufacturer, suppliers, material, form such as head, etc. may be added.

Thus, a star type [44] is characterized by a kernel entity type used for storing basic data, by a number of subtypes of the entity type which are used for additional properties. These additional properties are clustered according to their occurrence for the things under consideration.

This observation has been taken into account by the OLAP community. Kimball [25] claims that ER modeling is completely wrong and that database modeling should be based on star and snowflakes in the sense OLAP people are using it. This claim is far too strict. In the same fashion the snowflake schema, displayed partially without attributes in Fig. 2, can be generated on a schema used for representing the information structure on purchases.

Star structuring and snowflake structuring is becoming popular in the XML community. Modeling with optional parts is a typical approach used for XML structures. The cohesion of elements must be expressed through constraints. Star structuring allows to express cohesion of elements by developing a subtype that contains the coexisting elements.

The design process of web sites [6], [10] often does not contain the design of search facilities. These facilities are often added after designing the whole web site. Search tools are used for realizing site maps (e.g. [51]) or catalogue structures. But site maps or catalogues are:

  • often too big especially if we have very large web sites,

  • do not have the abstraction level the user needs, and

  • do not follow semantical guide lines resulting from the application.


Another kind of search facilities are key word based search engines [32].

The problems of world wide search engines are well known.

  • The result set is too big.

  • The ranking algorithm does not fit the user needs. Key words are not specific enough to decide which document is more important than another for a specific user. Natural language used by search engines are very rare and dealing with the problem of ambiguity.

  • The search engines do not rate the profundity of information. The user wants to have more common information on a topic which is represented e.g. by information of the top of a website. On the other hand, a user wants to have details of a topic represented by the leaves of a website. The most key word search engines do not support the search using different kinds of abstraction levels.


These drawbacks are also relevant if such a search engine is used for an own site like MyGoogle. In addition, the provider’s meta-knowledge about the site is lost.

Search strategies have been developed based on set-theoretic approaches in SQL. These approaches are very general. They allow to formulate a very large set of queries. In most cases we do not need this generality. Often we can rely on more simple retrieval facilities. These retrieval facilities may be directly derived from the schema.

In this paper, we will illustrate that the star and snowflake structure allow to directly derive the following set of generic search and retrieval methods:

  • String bag approach based on meta-properties.

  • Retrieval as star function.

  • Meta-property-based retrieval.

  • Association-based retrieval.

  • Fuzzy retrieval.

  • Retrieval based on special functions.


On the basis of the linguistic instrument: the concept fields. Concept fields are an abstraction of word fields [23] and are used to define classes of words with the same basic semantic description. The functions of concept fields can be directly mapped onto star and snowflake like schemes. The multi-dimensional meta-structures based on stars and snowflakes are described in [14].

Section snippets

Word fields

The mind map of dimensions is based on the theory of word fields [27], [40]. A word field is a linguistic system [9] in which similar words

  • that describe the “same” basic semen

  • that are used in the same context


are combined to a common structure and data set.

In contrast to common synonym dictionary, word fields define the possible/necessary actors, the actions and the context. Word fields can be used for verbs, nouns and adjectives. We focus on verb fields and extend the implementations of the

Automatic derivation of generic search in star and snowflake schemata

Star and snowflake schemata have functions which are simpler. These functions can be formalized by generic functions similar to the classical generic modification functions insert, it delete, and update. Based on those functions, any modification of a database can be specified. The description of a modification is sometimes cumbersome. The same observation is valid for retrieval functions. SQL allows to formulate a large number of queries. The experienced database programmer can also formulate

Conclusion

Search is at the same time one of the central common facilities of information systems and one of the most difficult tasks. There is no general approach to the development of search interfaces. In this paper we develop an approach which provides a general approach if the system is supported by additional information and by conceptions used in computer linguistics and in advanced database modeling:

  • Word fields are linguistic systems that describe the same basic semen and that are used in the same

Acknowledgements

The authors are thankful to the reviewers of their NLDB’2002 and to the reviewer of this paper. Their detailed and substantial remarks have led to improvements of concepts and and of the style of the paper.

Antje Duesterhöft studied computer science and linguistics in Rostock and received her Ph.D. in computer science from the Technical University of Cottbus, Germany. She has been a senior researcher at the database group of the University of Rostock and was heading the Getess project team at Rostock University. Currently, she is a professor at the University of Applied Science at Wismar. Her research interest include data and ontology modeling, linguistics, website development methodologies,

References (51)

  • M. Broy

    Compositional refinement of interactive systems

    Journal of the ACM

    (1997)
  • L. Brown

    Integration models––templates for business transformation

    (2000)
  • K.-H. Bünting

    Introduction to Linguistics

    (1996)
  • S. Ceri, P. Fraternali, S. Paraboschi, Data-driven, one-to-one web site generation for data-intensive applications, in:...
  • K. Czarnecki et al.

    Generative Programming

    (2000)
  • A. Düsterhöft, B. Thalheim, Linguistic Search Facilities in Snowflake-Like Database Schemes (long version). BTU...
  • C. Fellbaum

    The English verb lexicon as a semantic net

    International Journal of Lexicography

    (1990)
  • T. Feyer et al.

    E/R based scenario modeling for rapid prototyping of web information services

  • T. Feyer et al.

    Many-dimensional schema modeling

  • T. Feyer et al.

    Conceptual design and development of information services

  • B. Ganter et al.

    Formal Concept Analysis––Mathematical Foundations

    (1998)
  • Available from...
  • Y. Gurevich, Draft of the ASM guide, Technical Report, EECS Department, University of Michigan,...
  • Cited by (15)

    • Matching parse thickets for open domain question answering

      2017, Data and Knowledge Engineering
      Citation Excerpt :

      Structure-based approaches to improve web searches are popular as well. Düsterhöft, Thalheim [17] developed a simple and powerful approach to search based on a generalization of the theory of word fields to concept fields [46] and also based on providing the optimal meta-structuring within database schemata that supports search in a more effective way. The internal meta-structuring is based on star and snowflake meta-structures within the schema.

    • Models are functioning in scenarios

      2020, Communications in Computer and Information Science
    • Developing enterprise chatbots: Learning linguistic structures

      2019, Developing Enterprise Chatbots: Learning Linguistic Structures
    • Design and Web Information Development of Systems

      2019, Design and Development of Web Information Systems
    • Query and answer forms for sophisticated database interfaces

      2013, Frontiers in Artificial Intelligence and Applications
    View all citing articles on Scopus

    Antje Duesterhöft studied computer science and linguistics in Rostock and received her Ph.D. in computer science from the Technical University of Cottbus, Germany. She has been a senior researcher at the database group of the University of Rostock and was heading the Getess project team at Rostock University. Currently, she is a professor at the University of Applied Science at Wismar. Her research interest include data and ontology modeling, linguistics, website development methodologies, website technology and internet search engines.

    Bernhard Thalheim (born in 1952) studied mathematics and computer science at the universities of Dresden and Moscov. He held professorship positions in Dresden, Kuwait, Rostock and now Cottbus. He works on the theory of the relational and object-relational data models, on theory and pragmatics of Entity-Relationship and object-relational modeling, on the technology of object-relational DBMS and their design. He led several projects and database design tools ((DB)2, RADD) and application projects in collaboration with industrial partners, has been a member of more than three-score program committees, and is a consultant for large enterprises. Currently, he is chairing the InfoServices team at Cottbus Tech which has developed, installed and is maintaining more than 30 large information-intensive websites.

    Expanded version of the talk given at NLDB’2002 in Stockholm in June 2002.

    View full text