ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Software Tool Article

AQUA: an Advanced QUery Architecture for the SPARC Portal

[version 1; peer review: 1 approved with reservations, 1 not approved]
PUBLISHED 16 Sep 2021
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Hackathons collection.

Abstract

The Stimulating Peripheral Activity to Relieve Conditions (SPARC) program integrates biological and neural information to create anatomical and functional maps of the peripheral nervous system. The SPARC Portal hosts a dynamic storage for the datasets, models, and resources to help the researchers find and produce data. Currently, the SPARC Portal provides a primary search tool, which lacks some features to improve the search experience. To purposefully retrieve the required information from the stored datasets and resources, we have developed an Advanced QUery Architecture (AQUA) for the SPARC Portal. Near-real-time auto-completion of the queries, close-matches suggestions, and multiple filters to narrow or sort the results are the major features of AQUA with the goal to enhance the usability of the SPARC search engine. AQUA is available from: https://github.com/SPARC-FAIR-Codeathon/aqua

Keywords

AQUA, SPARC, biological query, natural language processing, NIFS Ontology, text mining, Codeathon

F1000 Research Statement of Endorsement

David Nickerson confirms that the author has an appropriate level of expertise to conduct this research, and confirms that the submission is of an acceptable scientific standard. David Nickerson declares he is NF’s primary supervisor and one of the organisers of the 2021 SPARC FAIR Codeathon. Affiliation: Auckland Bioengineering Institute, University of Auckland.

Introduction

The Stimulating Peripheral Activity to Relieve Conditions (SPARC) program is a platform to assist neuroscientists in developing new medical devices.1 It aims to leverage our understanding of nerve-organ interactions in biological entities and advance existing medical tools. It hosts over a hundred datasets, projects, and resources that are increasing in number, and in the future, there will be a need for a robust tool to explore the expanding content. A targeted data retrieval from the SPARC Portal can boost the researcher-portal interaction experience and help users find the data they seek. However, the search features of the SPARC Portal are limited.

Currently, the search engine of the SPARC Portal does not account for close-matches or misspelt words. The primitive display of the returned results does not emphasise the matched texts and does not allow users to filter or sort the searched data. This prevents users from easily finding their required resources, and once found, users cannot properly narrow or sort the returned data. Moreover, the current description given for each returned result might not necessarily contain the matched keywords which leads to confusion. We have developed an application that we believe will enhance the SPARC Portal search by addressing the above-mentioned issues to reach a FAIR (Findable, Accessible, Interoperable and Reusable) repository to benefit researchers globally.

Advanced QUery Architecture (AQUA) is an application that aims at improving the search capabilities of the SPARC Portal. In particular, it makes the search engine smarter at reading and understanding queries. It also enhances the result display feature of the SPARC Portal by making it more user-friendly and providing users with more sophisticated result filtering and sorting options. Our end goal is to improve the visibility of the SPARC datasets exponentially. This, in turn, will benefit the SPARC community as a whole since their datasets will be more discoverable for reuse and subsequent collaborations.

AQUA was initiated and accomplished during the 2021 SPARC FAIR Codeathon held in July, for a time frame of two weeks. In AQUA, we have incorporated Artificial Intelligence tools to process and refine the queries on the SPARC Portal and implement predictive typing to give feasible suggestions. Thereafter, AQUA auto-corrects the queries to match the existing data on the SPARC portal and the Neuroscience Information Framework Standard (NIFS) Ontology. This will return the most probable datasets that match the search keywords and a list of related new keywords. To enhance the current results display, we have added some functional features to first more precisely filter and sort the results, second emphasise the matched texts for easier skimming, and third, in the case of no available matching results, allow the users to enter their email addresses and get notified when their requested dataset is published.

In this paper, we first review the implementation of AQUA and how its main sectors correlate with the user and the SPARC portal. Next, we provide more details on the sub-sections of each sector and their implemented tools and packages. We mention the added features to the AQUA User Interface (UI) and discuss how it differs from the existing SPARC Portal. Finally, we describe how AQUA can change the search tool on the SPARC Portal and denote the possible future developments to AQUA.

Methods

Implementation

This section discusses the improvement of the search tool on the SPARC Portal. Figure 1 demonstrates how the AQUA UI (also referred to as frontend) and the AQUA server-side data-access layer (also referred to as backend) bridge between the user and the SPARC Knowledge Base. AQUA UI receives the user’s queries, formulates them in JSON, and transfers to the AQUA backend module. AQUA backend searches for the formulated queries in the SPARC Knowledge Base. Once the matching datasets/resources are detected, the AQUA backend returns the ranked results to the AQUA UI. Thereafter, the AQUA UI displays the results according to the user’s preference of ranking/filtering. The AQUA UI is implemented using the HTML-CSS-JS trio and the main tools utilised for the AQUA backend are Python, Docker, SQLite,2 and SciGraph.

18549b6f-6da2-4a66-8307-f71ce3e84bff_figure1.gif

Figure 1. AQUA workflow.

Figure 2 depicts the pipeline of AQUA in three major sections:

  • Query refinement:

    • 1. Auto-completion: Based on the term, our tool automatically completes the queries if it partially/completely matches any keywords. It then sends the selected keyword to AQUA backend.

    • 2. Suggestions: If no exact matches are found, it finds close-matches and suggests them to the users by popping up the phrase: “Showing results for ...”. If the users select to search for their initial query, AQUA will send the raw and uncorrected query to the AQUA backend.

  • Results filtering:

    • 1. Sort by: When the results for the query are displayed, user will have the option of sorting them based on the Relevance, Date published, and Alphabetical order.

    • 2. Filter by: The results can also be filtered based on Keyword, Author, Category, and Publication date.

    • 3. Matched text emphasised: The searched keywords will be emphasised in the returned results.

  • “Notify me”: At the end, if no results are returned by the AQUA backend, our tool asks the user if they want to get notified when a related resource is published or not. For a given email address, the tool checks for its validity and then stores it using SQLite. Thereafter, it will check for any updated/uploaded related resource on the SPARC Portal everyday at 2AM EDT. In case of the requested resource availability, it sends a notification email to the registered user.

18549b6f-6da2-4a66-8307-f71ce3e84bff_figure2.gif

Figure 2. An overview of the AQUA pipeline.

The grey and yellow boxes correspond to the “Query refinement” and “Notify me” modules of the AQUA backend, respectively. The green box corresponds to the “Results filtering” function of the AQUA frontend on displaying the results. The purple boxes illustrate the filters and sorting options.

The AQUA platform integrates Python libraries, data mining tools, a SQL database engine, and Document Object Model (DOM) API to mimic an environment similar to the SPARC Portal with an improved seach functionality in multiple ways.

AQUA backend

The AQUA backend includes querying the SPARC Knowledge Base for information, delivering data to the frontend, and processing any logic that the AQUA UI requires. The SPARC Knowledge Base comprises of two references: SPARC dataset metadata and NIFS ontology. Metadata is the “Data about data”, i.e., additional information provided about datasets. The SPARC dataset metadata includes information such as title, description, techniques, as well as the number of the files, formats, licenses, etc. (SPARC dataset metadata), and the NIFS ontology is a set of community ontologies used by SPARC to annotate data and models.

The AQUA backend focuses on two main features: Query refinement and Email notification. Below, we give a brief introduction to these added features.

  • Query refinement:

    When the initial query term is inserted it goes through two paths: auto-completion (yellow box in Figure 3) and suggestions (purple box in Figure 3).

    • 1. Auto-completion:

      The AQUA query refinement module auto-completes the queries after the third inserted letter while the user is typing. The idea of auto-completion is to prevent typos occurring and to give a better user experience in the SPARC Portal. We have created an n-gram model for auto-completion and utilised a Python library fast-autocomplete. In spelling correction task, an n-gram is a contiguous sequence of n letters from a given sample of text. An n-gram model is utilised to compare strings and compute the similarity between two words, by counting the number of similar n-grams they share. This technique is language independent. The more similar n-grams between two words exist the more similar they are.3

      The Elasticsearch’s auto-complete suggester is not fast enough and does not do everything that we need. Consequently, we have utilised the fast-autocomplete library in Python which provides us with a much faster process (reducing the average latency from 900 ms to 30 ms). Elasticsearch’s auto-complete suggester does not handle any sort of combination of the words in query terms. For example fast-autocomplete can handle “brainstem neuron in rat” when the words “brainstem”, “neuron”, “in”, “rat” are separately fed into it, while Elasticsearch’s auto-complete needs that whole sentence to be fed to it to show it in auto-complete results.

    • 2. Suggestions:

      Simultaneously, AQUA utilises SciGraph for auto-correction and suggestion. SciGraph represents ontologies and ontology-encoded knowledge in a Neo4j graph. However, we found that solely using SciGraph is not sufficient because SciGraph returns alternative queries/suggestions without correcting the initial query. For example, if there is a typo or removed space between the words of a query (scriptio continua), SciGraph returns either no results or irrelevant results from the ElasticSearch. Therefore, we have added a new auto-correction feature to segment queries with missing spaces and fix error spelling by creating a pipeline to SymSpellPy. SymSpellPy is a Python port of SymSpell for spelling correction, fuzzy search and approximate string matching. This improves the performance before sending the request to the ElasticSearch. The auto-correction result is combined with the suggestion results and then executed as the final query search terms. This is demonstrated within the purple box in Figure 3.

      1. Word segmentation:

        Word segmentation divides a string into words by inserting missing spaces at the appropriate positions.

      2. Spelling correction:

        Supports spelling correction (word splitting/merging) of multi-word input strings in three cases4:

        1) Extra space inserted into a correct word which leads to two incorrect terms; 2) Removed space between two correct words which leads to one incorrect term; 3) Multiple independent input terms with/without spelling errors.

      To read more on AQUA query refinement visit: https://github.com/SPARC-FAIR-Codeathon/aqua/blob/main/Documentation/QueryRefinement.md.

  • Email notification

    The primary purpose of this module is to notify users whenever a new dataset is published matching their search terms. However, users can still use the same function to receive a summary table including basic information and links to all datasets currently matching their keywords. Additionally, as the “Notify me” module saves the requests in a database, this information can be further accessed and analysed to improve the content (Figure 4).

    We can summarize the “Notify me” actions as follow:

    • 1. Adds email requests with keywords;

    • 2. Scans for existing search hits and sends email;

    • 3. Moves the pending requests to a waiting list that is scanned daily;

    • 4. Moves the fulfilled requests to an archive;

    • 5. Any failed requests (that already have matching hits) will remain on the waiting list for one month, during which the “Notify me” module will try to send the email daily. Afterwards, if the email still fails, it will be moved to the archive with a “failed” status.

    To read more visit: https://github.com/SPARC-FAIR-Codeathon/aqua/blob/main/Documentation/NotifyMe.md.

18549b6f-6da2-4a66-8307-f71ce3e84bff_figure3.gif

Figure 3. Query refinement by Auto-completion/Suggestions.

The purple box corresponds to the path into returned suggestions and the yellow box corresponds to the auto-completion path. The procedure is demonstrated by an example of inserting a misspelt initial query (braistem) into the module.

18549b6f-6da2-4a66-8307-f71ce3e84bff_figure4.gif

Figure 4. The pipeline of the AQUA ”Notify me” module.

AQUA UI

AQUA UI receives the user’s queries, formulates them, and transfers to the AQUA backend module. When the response from the AQUA backend is received, the AQUA UI interprets it and displays the content on the screen. Like the SPARC Portal web application, the AQUA UI is implemented using VueJS and NuxtJS. Nuxt is an upper-level framework that is built over Vue.js to design and create highly advanced web applications.5 The AQUA UI displays the customised list of results with the emphasised searched keywords.

Operation

To start the application follow the steps in Installation.

How to use the features added by AQUA to the SPARC Portal search engine?

The application works like other similar search engines with a user interface mimicking the SPARC Portal environment.

  • 1. Predictive search typing:

    AQUA provides auto-completion for user’s queries as they type. This feature is powered by SciGraph and training data from the SPARC Knowledge Base. AQUA only shows auto-completion after users type three letters or more to avoid too many results being returned, slowing down the application.

  • 2. Advanced search options:

    By expanding the “Advanced search” tab under the search box, users can select whether AQUA searches for Exact match for their query or Any of the words. The default is Any of the words match.

  • 3. Advanced sorting:

    The existing SPARC Portal allows sorting based on dataset titles (alphabetically) and by published date. AQUA adds a “Relevance” sorting criterion that returns results based on how relevant the results are to their search query. This is set as the default sorting option.

  • 4. Advanced filtering:

    The existing SPARC Portal only allows for filtering based on “Dataset status”, which is either Published or Embargoed. AQUA adds more sophisticated filtering options. Users can filter datasets by one or several keywords, authors, and categories. Hit “Enter” after each “Keyword”, “Author”, or “Category” in their respective box to register it. After the entries are registered, click “Apply” to filter dataset results.

  • 5. Email notifications for new matched datasets:

    Users can opt in to receive emails about new datasets that match their search query. We believe this is a much needed option for users to stay updated about their search and SPARC datasets. Simply click on “Create alerts” under the search box and enter an email. AQUA will trigger an email send when newly added dataset(s) that match the search query are published by SPARC. This is a one-time-only email subscription.

  • 6. Emphasise matched texts in result display:

    When a dataset is returned, any matched text in the dataset title and description will be emphasised for easy and convenient lookup.

Use case

We conducted experiments to compare the performance of the AQUA query refinement module by either deploying SciGraph or fast-autocomplete. We analysed the operation in auto-completing the queries in terms of performance and execution time. We compared these two criteria in two scenarios: correct queries, and queries with one typo. Our experiment revealed that fast-autocomplete returns more completions than SciGraph in both cases of inserting correct queries and queries with typo. Also, fast-autocomplete returned the results 24 times faster in correct queries and 11 times faster in queries with typos.

We tested the performance of the AQUA spelling correction module and compared the results with the SPARC’s Elasticsearch. To do this, we randomly selected 22 sets of queries from the SPARC dataset, each containing fifty keywords or phrases. The queries were then modified to include different types of typos (deletion, insertion, replacement). We calculated the Mean Average Precision (MAP) of AQUA and the SPARC’s Elasticsearch in spelling correction. Results showed that as the number of terms in a query increases, the performance of AQUA noticeably surpasses the SPARC’s Elasticsearch (Table 1). Same steps were taken on querying the name of author/authors as keywords for 9 test collections. Table 2 shows that AQUA performs better in correcting misspellings that appear in a two-term “author” query. A significant performance difference is AQUA’s ability to fix “author” as a query that loses space where AQUA’s MAP is 0.92 while the SPARC’s Elasticsearch’s MAP is only 0.12.

Table 1. Mean Average Precision (MAP) of AQUA and the SPARC’s Elasticsearch over 22 test collections consisting of biological keywords as queries.

Typo1 term2 terms3 terms
AQUAESAQUAESAQUAES
0 typo0.7147850.7114520.5696730.5696730.6804310.677097
1 del0.6359350.6771840.5553710.5058490.6686090.653644
1 insert0.7047850.7423560.565590.5726630.6804310.661312
1 replace0.6441260.7722020.5489680.5683640.6804310.646185
no spaceNaNNaN0.5680060.9876670.6670970.816667
no space 1 typoNaNNaN0.5596960.9959180.6705080.056122
no space 2 typoNaNNaN0.4840050.0566670.6443050.010204
no space 3 typoNaNNaN0.4462960.1842110.5899030.003472
3 typoNaNNaN0.5407610.4812120.6469190.621238

Table 2. Mean Average Precision (MAP) of AQUA and the SPARC’s Elasticsearch over 9 test collections consisting of authors as queries.

Typo1 term2 terms
AQUAESAQUAES
0 typo0.8632120.8976730.9269110.952778
1 del0.6130250.6759740.8185790.797889
1 insert0.8438710.9141930.9269440.96
1 replace0.8223740.8677860.9130390.913265
no spaceNaNNaN0.9269110.1245

The experiment results and description are available here. The code for running the experiments and the data are also available on: https://github.com/SPARC-FAIR-Codeathon/aqua/tree/main/experiment.

Conclusions and next steps

This paper demonstrated how the SPARC Portal could be more FAIR by improving its search feature through AQUA. Since the first contact between researchers and a repository of datasets/models/resources is through the website’s search engine, we enhanced the search system’s functionality and the user interface. In AQUA, we deployed multiple tools and packages to make querying the data more precise, convenient, and effective.

We propose to add a view type to the existing SPARC Portal to enhance the users’ experience with the website. The SPARC Portal’s existing view type is “List”. AQUA proposes to add a “Gallery” view option in the future. Also, we plan to add a new discovering feature to the SPARC Portal to find resources by querying snapshots of simulations. This can be done by segmenting the simulation results into smaller time intervals or any chunk of data. Currently, the AQUA “Notify me” feature is a one-time-only email notification. Options to be alerted more than once can also be added in the future. AQUA can also enhance the SPARC search engine further by improving user’s next query. This will be done by developing a session-based search based on user’s search or clickthrough history on the Portal. The feature will create a personalized experience for users and thus enhance their overall experience with the SPARC Portal.

Software availability

Source code available from: https://github.com/SPARC-FAIR-Codeathon/aqua/blob/main/LICENSE

Archived source code as at time of publication: https://doi.org/10.5281/zenodo.5352470.6

License: MIT

The AQUA application can be installed and run by cloning the main Github repository and following the command line instructions. Instructions on how to clone a Github repository can be found here.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 16 Sep 2021
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Shahidi N, Lin X, Munarko Y et al. AQUA: an Advanced QUery Architecture for the SPARC Portal [version 1; peer review: 1 approved with reservations, 1 not approved] F1000Research 2021, 10:930 (https://doi.org/10.12688/f1000research.73018.1)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 16 Sep 2021
Views
6
Cite
Reviewer Report 26 Jul 2022
Maryann E. Martone, Department of Neurosciences, Center for Research in Biological Systems, University of California, San Diego, San Diego, California, USA 
Approved with Reservations
VIEWS 6
The authors describe a query tool they developed for the SPARC Portal during a code-a-thon held in July 2021. The authors correctly identified several shortcomings of the portal search at that time, and created a service that would address them. ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Martone ME. Reviewer Report For: AQUA: an Advanced QUery Architecture for the SPARC Portal [version 1; peer review: 1 approved with reservations, 1 not approved]. F1000Research 2021, 10:930 (https://doi.org/10.5256/f1000research.76636.r141807)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
24
Cite
Reviewer Report 01 Nov 2021
Vijay Rajagopal, Department of Biomedical Engineering, University of Melbourne, Melbourne, VIC, Australia 
Not Approved
VIEWS 24
The authors present a search and retrieve tool for the SPARC knowledge database. Overall, the contribution is important and in line with making research data FAIR. The article is also written well but is missing some key components that would ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Rajagopal V. Reviewer Report For: AQUA: an Advanced QUery Architecture for the SPARC Portal [version 1; peer review: 1 approved with reservations, 1 not approved]. F1000Research 2021, 10:930 (https://doi.org/10.5256/f1000research.76636.r94519)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 16 Sep 2021
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.