Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

MetaQA: Enhancing human-centered data search using Generative Pre-trained Transformer (GPT) language model and artificial intelligence

  • Diya Li,

    Roles Conceptualization, Data curation, Formal analysis, Methodology, Resources, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Department of Geography, Texas A&M University, College Station, Texas, United States of America

  • Zhe Zhang

    Roles Data curation, Funding acquisition, Methodology, Project administration, Supervision, Writing – original draft, Writing – review & editing

    zhezhang@tamu.edu

    Affiliations Department of Geography, Texas A&M University, College Station, Texas, United States of America, Department of Electrical and Computer Engineering, Texas A&M University, College Station, Texas, United States of America

Abstract

Accessing and utilizing geospatial data from various sources is essential for developing scientific research to address complex scientific and societal challenges that require interdisciplinary knowledge. The traditional keyword-based geosearch approach is insufficient due to the uncertainty inherent within spatial information and how it is presented in the data-sharing platform. For instance, the Gulf of Mexico Coastal Ocean Observing System (GCOOS) data search platform stores geoinformation and metadata in a complex tabular. Users can search for data by entering keywords or selecting data from a drop-down manual from the user interface. However, the search results provide limited information about the data product, where detailed descriptions, potential use, and relationship with other data products are still missing. Language models (LMs) have demonstrated great potential in tasks like question answering, sentiment analysis, text classification, and machine translation. However, they struggle when dealing with metadata represented in tabular format. To overcome these challenges, we developed Meta Question Answering System (MetaQA), a novel spatial data search model. MetaQA integrates end-to-end AI models with a generative pre-trained transformer (GPT) to enhance geosearch services. Using GCOOS metadata as a case study, we tested the effectiveness of MetaQA. The results revealed that MetaQA outperforms state-of-the-art question-answering models in handling tabular metadata, underlining its potential for user-inspired geosearch services.

1. Introduction

Addressing societal challenges such as disaster management necessitates the efficient and accurate sharing and usage of geospatial data from various sources. A key part of this process is service matching, where user requirements for spatial data are used to search across archives. However, traditional search approaches, such as the one used by the Gulf of Mexico Coastal Ocean Observing System (GCOOS) [1], part of the U.S. Integrated Ocean Observing System (https://ioos.noaa.gov/), often fall short. These systems, which rely on keyword inputs from users, face limitations due to the inherent uncertainty within spatial information and its representation during the data search process. Furthermore, GCOOS, despite collecting and ensuring the reliability of thousands of data points, presents metadata in complex tables that are difficult to interpret. The system allows data search using keywords or by drawing a polygon on a map, but the returned results, primarily metadata tables, lack detailed descriptions of the data, its usage, and potential linkage with other data products. This limitation makes the system more suitable for experienced researchers familiar with the data and related products, while new users must resort to additional resources to understand the data product and its application in research, a process that can be time-consuming.

The current geo-research system is limited since GCOOS stores the metadata in complex data tables, which is difficult to interpret. As search results, the system returns a metadata table that describes the data title, attribute names, and a data download link. The information about detailed description of the data, use of the data, and possible linkage with other data products is missing from the current GCOOS data-sharing platform. Fig 1 illustrates an example of the GCOOS data search user interface, where users can enter a keyword such as “wind speed” in the keyword field for searching data. After that, the system will return a data table describing the data name and ID, metadata, and download link. The GCOOS advanced data search portal enables users to search for data at a spatiotemporal scale, where users can draw a polygon on the map and specify the data coverage time period. However, the results of the data search remain at the metadata tabular level, which cannot be used to explain more detailed knowledge about the data or any other data or literature that are related to the searched data item. In other words, the current GCOOS data search system is useful for experienced researchers (or users) who know data and related data products well. New users who are unfamiliar with the research area must use other literature or resources to learn the data product and how to use the data in the research, which is time-consuming [2, 3].

thumbnail
Fig 1. An illustration of using a keyword-based search in the GCOOS platform.

https://doi.org/10.1371/journal.pone.0293034.g001

Above mentioned research challenges raise a research question: How to develop a user-inspired data-sharing system to support geosearch and make spatial data FAIR principle (Findable, Accessible, Interoperable, and Reusable) [4]? Language models (LMs) have proven to be highly effective in developing Question and Answering systems, sentiment analysis [5], text classification [6], and machine translation [7]. However, despite their success in comprehending free-form natural language sentences by learning from massive amounts of unstructured text data, LMs still face significant challenges when the metadata are represented as tabular question answering. The structure of tables and the relationships between the data are often implicit and difficult to be recognized using raw language models. [810]. To address this challenge, various methods have been developed to improve LMs’ ability to handle structured data, such as tables. For example, fine-tuning techniques can be highly effective in adapting pre-trained models to specific tasks, and it’s promising to see these techniques being applied to absorb table-specific representations [11]. Similarly, the use of synthetic SQL (Structured Query Language) refinement through the generation of artificial queries and tables can help language models learn to better understand and work with structured data [1214]. These methods are crucial for fabricating user-driven data-sharing frameworks that endorse geosearch and render spatial data FAIR by enabling productive and accurate data exchange [12, 15]. There are various research gaps in the development of tabular question-answering models, including the need for an extensive training corpus of high quality, the requirement to infer correlations among columns and rows, and the necessity to understand the implications of complex structures in the table. Addressing these gaps can result in improved models that are better equipped to reason about the components of the table and provide accurate responses.

In this project, we developed a novel data search model called MetaQA (Metadata Question Anasering) to enable user-friendly geosearch and service matching. The MetaQA system uses Generative Pre-trained Transformer (GPT) Language Model and machine learning algorithms to understand and interpret the user’s query and provides relevant answers based on the metadata and GPT knowledge base. We applied our MetaQA model to GCOOS metadata to help GCOOS improve the data services. Finally, we tested the performance of MetaQA using the Metadata Question Answering data sets by comparing different tabular QA models and showed our MetaQA performed better than work described in [1517]. Our approach leverages the power of the GPT langue model to provide users with rich information about the data and its relevant literature based on the keywords question [18]. Furthermore, the output of the MetaQA system can provide context and background knowledge about the scientific data, which enables researchers to find the relevant data more quickly, ultimately leading to new scientific discoveries and insights [18].

The remainder of this article is organized as follows. Section 2 introduces the literature background of our research. Section 3 presents the data collection and processing methods. Section 4 illustrates the MetaQA methodology of the metadata tabular question-answering model. Section 5 implements experiments by the proposed method, analyzes the results, and discusses this method for different applications and limitations. The last section concludes the study.

2. Background

Data-sharing platforms (e.g., GCOOS, National Aeronautics and Space Administration, and U.S. Geological Survey data portals) are becoming increasingly important for researchers in various fields as they enable researchers to access, reuse, and compare data across disciplinary boundaries [19, 20]. By making data more widely available, these platforms can also accelerate scientific discovery, facilitate collaboration, and help ensure the reproducibility of research findings. For instance, researchers working in disaster management can access disaster-related data from various data-sharing platforms and analyze the data to observe climate patterns and conduct disaster mitigation plans [2124]. Additionally, a growing amount of spatiotemporal data is being collected and analyzed across various domains, including earth sciences, urban science, transportation, epidemiology, and social sciences [2528]. Spatiotemporal data is distinct from relational data, as it includes both spatial and temporal attributes with multiple actual measurements [26, 29].

The traditional keyword-based geosearch approach may not be sufficient for searching and discovering spatiotemporal data on data-sharing platforms, particularly when spatial information is uncertain or when the data is scattered across multiple platforms with unstructured metadata [30]. In such cases, alternative approaches such as semantic search, machine learning, and natural language processing may be more effective in discovering and integrating relevant data. These approaches can help to identify relevant data based on the context and meaning of the search query rather than just relying on exact keyword matches. They can also help to identify and extract spatial information from unstructured metadata, and to integrate data from multiple sources based on their semantic similarity [3]. For example, Elnozahy et al. [31] found that the QA system can help to overcome the barriers posed by limited metadata and domain-specific knowledge. By leveraging large language models trained on relevant knowledge, a question-answering system can provide more effective and efficient access to scientific data, enabling researchers to discover and understand the data they need more easily [11, 32, 33].

Recent pre-trained language models such as BERT [6] and BART [7] have achieved state-of-the-art results on a wide range of natural language processing tasks. While language models have shown remarkable success in interpreting unstructured natural language, they can encounter difficulties when dealing with tabular question-answering tasks where the metadata is represented as structured data [810]. Tabular data such as GCOOS data has a different format and structure compared to natural language text. Tabular data is typically represented in a structured format with rows and columns, and each column has a specific data type and format [34, 35]. Different techniques and approaches have been proposed to address this challenge in handling tabular data [17], semantic parsing [36], and entity linking [35]. These techniques aim to extract the relevant information from the table and map it to a structured representation that can be used by language models for question-answering. Semantic parsing techniques [36], on the other hand, involve translating the table schema and contents into a formal representation that can be used by language models for question-answering. The arrangement of tables and the associations between data are often implicit and not easily identified by raw language models. They may fail to deduce the relationships between columns and rows or to grasp the implications of absent values in the table [34, 35]. This shortcoming can lead to incorrect or inadequate answers when confronted with questions that necessitate reasoning about the relationships between various components of the table [5].

Developing tabular question-answering models can be challenging due to the need for a large-scale training corpus with high quality. Tabular data is often complex and heterogeneous, and the natural language queries that are used to retrieve information from the tables can be diverse and nuanced. [15] illustrated challenges, including collecting parallel data comprising natural language sentences and tables as the pre-training corpus, and either crawling tables or synthesizing natural language sentences on available tables. While the crawling approach offers the advantage of a larger corpus, it is also more prone to noise and requires complicated heuristics to clean [15]. On the other hand, the synthesis approach offers better control over the quality of the data, but it is also more costly and often lacks diversity. Recent studies [37, 38] found that the use of the generative pre-trained transformer (GPT) model to build a training corpus can accelerate the process of synthetic data, which allows researchers to obtain a larger and more diverse corpus without relying on manual annotation or data crawling [37]. Moreover, researchers can fine-tune GPT on a smaller annotated corpus (e.g., metadata tables) and questions, allowing the model to learn the specific patterns and relationships in the data that are relevant to the tabular question answering task [38].

Recent research [39] on prompt tuning in NLP has shown good performance in downstream tasks. Compared to traditional fine-tuning methods, prompt-tuning can use a smaller dataset consisting of carefully crafted prompts to optimize the performance of a pre-trained model on specific tasks [39]. The goal of prompt learning is to automatically generate a diverse set of high-quality questions that represent the data and can be used to train and evaluate question-answering models [40]. Similar to synthetic fine-tuning [15], using a template to generate and fit the parameters during training makes the model easier to understand and learn the patterns and relationships in the data. While the large language model and prompt tuning method show promise for generating high-quality questions from metadata tables, more research is needed to explore their potential for supporting metadata search and discovery [41, 42]. Prompt tuning typically follows the design of verbalizer mapping [43]. For example, if the input format is a structured query that requires a specific output format, such as a table or a list, the verbalizer mapping would convert the output of the language model into the appropriate format [44]. Similarly, if the input format is a natural language question, the verbalizer mapping would convert the output of the language model into a grammatically correct and meaningful sentence that answers the question [43]. Designing the verbalizer mapping in prompt tuning requires a lot of effort to test the effectiveness of the template [39]. Existing research has found that using a much larger general language model, such as GPT-3.5-turbo, can handle template generation [45].

3. Data

In this project, we used two types of data sets: The GCCOS metadata and the Wikipedia Tabular Question Answering Dataset. The GCOOS data was used to create a metadata question answering dataset and the Wikipedia Tabular Question Answering Dataset was used to fine-tune the model to understand free-form tabular question answering.

As shown in Table 1, we evaluated the performance of our MetaQA model using Wikipedia Tabular Question Answering (WikiTQA) Dataset [36] and Metadata QA datasets. The GCOOS metadata is downloaded from the Gulf of Mexico Coastal Ocean Observing System (GCOOS) data service (https://gcoos.org/). Three kinds of data products were collected from the ERDDAP data servers (https://data.gcoos.org/erddap.php)including the historical collection of oceanographic and meteorological data, near-real-time observation and biological & socioeconomics data. In this project, we built a metadata table question-answering dataset by collecting 25,810 raw metadata from GCOOS. The attributes used for building the Metadata Question Answering dataset are described in Table 2. After preprocessing the raw metadata, 890 metadata tables were selected. The following data preprocessing steps were included:

  • Data Cleaning: Removing duplicate, missing, or irrelevant data to ensure the quality of the dataset.
  • Data Normalization: Converting data into a standard format, such as converting text to lowercase or converting dates into a standard format.
  • Data Integration: Merging data from different sources to create a unified dataset.
  • Data Transformation: Transforming the data into a suitable format for use in the question-answering system, such as converting text into numerical values or creating a table structure from unstructured data.
  • Data Sampling: Reducing the size of the training batch by sampling only a batch of rows and columns to make sure the input data satisfies the maximum input tokens.

Using the prompting method, over the example question-answer pairs were generated from these tables. This dataset will serve as a valuable resource for developing and evaluating AI models for question-answering systems, particularly for those that operate on metadata tables.

4. Methdology

4.1 Masked language model

Masked language models involve predicting a masked word or phrase within a sentence or sequence of text [6]. By replacing certain tokens or words with a special token, often denoted as [MASK], these models are trained to predict the original word or token based on the context of the surrounding words. In this project, we applied a modified Bidirectional and Auto-Regressive Transformer (BART) [7] as the base language model as our backbone. It is a large-scale sequence-to-sequence model that uses a combination of both autoencoder and autoregressive architectures for generating text. For example in Fig 2, given the sentence “New York City is [MASK] region”, the weight of the masked language model can learn by predicting the process of the missing word “east” or “south” based on the context of surrounding words. This training process will let the model learn to recognize the underlying structure of the text and to be more robust to noise or errors in the input. This approach can be applied to question-answering tasks by using the model to predict the answer to a given question based on a given context. In this project, we modified the BART-large model that follows a standard sequence-to-sequence Transformer architecture, with GeLU activation functions for a better smooth effect expressed as below: (1) where erf is the error function used in statistics to calculate the probability of an event. Adopting the question-answering task to the masked language model is similar to the text completion example shown above. The input of the model that is passed to the encoder contains a natural language sentence along with a corresponding table. It is crucial to mention that the quality of this natural language sentence significantly impacts the quality of the output. The sentence, ideally a question, should be clear, concise, and directly related to the data in the table for the model to provide a precise and accurate answer. Ambiguous or overly complex questions might lead to inaccurate or vague answers. Furthermore, both the natural language sentence and the table should be stringified so they can be transformed through tokenization, which breaks down the text into smaller units. Tokenization is an essential step in NLP task as it allows the text to be processed by a computer in a more structured and efficient manner.

thumbnail
Fig 2. The demonstration of the BART model: A document with random errors or noise (left) is considered “corrupted”.

The model encodes this corrupted document bidirectionally and then uses an autoregressive decoder to calculate the likelihood of the original, error-free document (right). To fine-tune the model, an uncorrupted document is inputted to both the encoder and decoder [7].

https://doi.org/10.1371/journal.pone.0293034.g002

4.2 Metadata tabular question-answering

In MetaQA, we employed an encoder-decoder BART model and applied pre-training as a sequence generation task for Pre-trained LM from free-form NL, as shown in Fig 3 as starting point. We followed the table pre-training via execution [15] training architecture that applies tabular question-answering downstream tasks with an end-to-end workflow, which starts with a BART base model as a pre-trained free-form language model to conduct the SQL synthetic. After that, the pre-trained LM is tuned with a multi-round tuning method with various datasets including Wikipedia Question Answering Dataset and Metadata Question Answering Dataset.

4.2.1 Spatial-temporal SQL synthesizes.

Many geoscience data such as GCOOS data contain spatial-temporal information, which is difficult to retrieve during the data search process [46]. To design a suitable architecture for metadata tabular question-answering, we argue that the structured spatial-temporal information in metadata can be applied to the language model via SQL command, as the traditional search method often use it to find out the exact result by SQL execution in a relational database (e.g. MySQL). However, the existing SQL synthesizing method lacks spatial-temporal relationship [15]. For example, as shown in Fig 4, we adopt the spatial-temporal SQL command conducted during the synthetic step. The synthesized dataset is automatically generated and sampled from SQUALL dataset [14] with the spatial-temporal keywords insertion. An example SQL template is data_product start_date >= val, where data_product and start_data can be denoted as the columns name with the temporal attributes and value respectively and val represents the value stored in the certain rows. Note that the spatial-temporal query is a complex operation in Geographic Information Science (GIS). By tracing back to the base operation in the SQL execution level, we simplified the operations into eight categories, those are: select, filter, aggregate, superlative, arithmetic, comparative, group, and spatial union&intersection. In this step, the input text is the stringified SQL command as well as the corresponding tables. In practice, the table is flattened as a long-form text so that it can be passed directly into the tokenizer. Before flattening the table, the table was pre-processed with various special indicators to address the table structure in a natural language way. For example, a table T with n * m cell structure can be flattened as (2) where [HEAD], [ROW] and [TAIL] are indicators to locate the headers, rows, and the end of the table. This way, the spatial-temporal attributes are addressed during the training steps by applying multiple spatial-temporal queries.

thumbnail
Fig 4. The demonstration of transform from executable SQL command to question.

https://doi.org/10.1371/journal.pone.0293034.g004

4.2.2 Prior knowledge fine-tuning.

Prior knowledge fine-tuning is an important step in metadata tabular QA because it allows the model to learn how to effectively use external information to improve its performance on the task [47]. The previous pre-trained language model is trained on large amounts of text data without targeting scientific question-answering tasks [48]. By fine-tuning a pre-trained language model on a spatial data metadata dataset, the model can learn to make more accurate predictions without modifying the architecture or parameter dimension of the model. WikiTQA dataset [36] is a scientific-related question-answering dataset designed to test the performance of the table question-answering model in the downstream task. By manually labeling the question-answering set on thousands of Wikipedia tables, the LM fine-tuned on WikiTQA can take free-form natural language questions, instead of SQL commands, to answer the question by simply inferencing their fine-tuned weights. In this step, the pre-trained model is further trained with a smaller dataset and similar input and output. The detailed fine-tuned steps are:

  1. Select a pre-trained LM: By loading the pre-trained BART model with 400 million parameters on SQL synthesizes, we initialize the fine-tuning model weight appropriately.
  2. Prepare the data: We collect and preprocess the dataset in this step. The preprocessing step is mentioned in Section 3.
  3. Fine-tune the model: In this step, we fine-tune the model with Adam optimization algorithm [49] by updating the model’s parameters to minimize a cross-entropy loss function on the fairseq-lib [50] that measures the difference between the model’s predictions and the true labels.
  4. Evaluate the model: After training, evaluate the model’s performance on validation and testing sets. The evaluation metric is denotation accuracy, which is described in Section 5.

4.2.3 Question-answering prompting.

Prompt tuning, the process of fine-tuning a pre-trained language model (LM) on a specific domain using relevant prompts or examples, has been widely recognized for its efficacy in domain-specific tasks [40]. The literature proposes various strategies to enhance the performance of LMs in specific domains through prompt tuning. In our approach, we forgo the use of complex templates for mapping predictions from fine-tuned models to verbalizers [51]. Instead, we harness the capabilities of a more powerful general language model, specifically an Artificial General Intelligence (AGI), to backtrack original questions from the provided answer template. We assert that the answers generated from the tabular question-answering system should have a strong correlation with the information of specific cells. As illustrated in Fig 5, we employ gptturbo [52] via LangChain (https://github.com/hwchase17/langchain) to generate question-answering templates. These templates incorporate constraints such as question type, device location, temporal features, or spatial features. In this system, we propose integrating a two-step chain-of-thought prompting (CoT) design into our metadata question-answering system. CoT prompting, a method that has shown substantial potential in enhancing the reasoning capabilities of large language models, is particularly effective in tasks requiring complex, multi-hop reasoning [53]. In our context, the CoT design involves planning the analysis goal and identifying relevant metadata terminology. This process provides a sequence of thought demonstrations or ‘chains’ that guide the language model in its reasoning process. Each ‘thought’ corresponds to a specific piece of metadata, and the model sequentially processes these ‘thoughts’, effectively decomposing the complex reasoning task into manageable steps. For instance, given the first example in Table 3, the first CoT is designed to identify the data product id (e.g., “IOOS-STATION-WMO-42896”) based on the sampled cell information from the metadata tabular from a miscellaneous table content. The second step involves applying the generated indicator field (e.g., “{data product) to the LLM chain to support the previously fine-tuned question-answering system. These pre-defined LLM chains can be applied iteratively with a looping condition to enhance the quality of the question with a set of temperatures, also known as the diversity parameter.

thumbnail
Fig 5. The demonstration of transform from executable SQL command to question.

https://doi.org/10.1371/journal.pone.0293034.g005

thumbnail
Table 3. The example questions and answers for our model on experimental datasets.

https://doi.org/10.1371/journal.pone.0293034.t003

5. Experiments and results

5.1 Experiments

The implementation of this project was carried out on the FASTER cluster, a high-performance computing resource provided by Texas A&M University and funded by the National Science Foundation. The training process utilized the NVIDIA A100 GPUs and lasted approximately 40 hours, encompassing up to 20,000 steps. A batch size of 512 was employed, accompanied by an initial learning rate of 3e-5, which decayed polynomially. Given the substantial amount of metadata involved, a maximum tokenization length of 4,000 was defined, with any exceeding tokenized text being truncated. During the training phase, a warm-up stage was incorporated to assist the model in avoiding suboptimal solutions early in the process. This warm-up stage allowed for gradual increases in the learning rate, enabling the model to explore the solution space more effectively and achieve better convergence within a shorter time frame. In our experiments, the warm-up stage persisted for 5,000 steps. To mitigate the frequency of parameter updates, gradient accumulation was employed every 12 epochs. This approach reduced the computational burden by reducing the number of updates made to the model’s parameters. Furthermore, evaluation metrics were computed every 1000 epochs using the equation specified in the subsequent section. The evaluation procedures were conducted through ten independent experiments, ensuring the robustness and reliability of the results. The fixed random seeds list ranging from 42 to 52 was utilized, enhancing the reproducibility of the findings. The average evaluation derived from these experiments is presented later in this section, providing a comprehensive summary of the overall performance.

5.2 Evaluation

5.2.1 Denotation accuracy.

Denotation accuracy in question-answering language models refers to the ability of the model to correctly identify and output the correct answer to a given question. This accuracy is typically measured by comparing the model’s output to the answer and calculating the proportion of questions for which the model’s output is exactly the same as the answer: (3) where c denote as number of correct answers and Q denote as total number of questions. As shown in Table 4, the denotation accuracy is being compared with a recent state-of-the-art table question-answering model trained using fine-tuning with prior knowledgeThe true answers are sourced from the WikiTQA benchmark dataset.

5.2.2 Soft denotation accuracy.

Metadata table question-answering usually generates a more complex answer, which contains a detailed date-time and latitude, and longitude. For example, the answer “29.31N, 94.82W” will be considered as a wrong answer compared with the correct one “94.82W, 29.31N”. [16] found that a soft form of accuracy measurement can get an overall accuracy to evaluate the performance of key information retrieving ability of the question-answering model. As it relaxes the strict equal formula to a task-specific similarity score: (4) where p denote a precision score, x denotes a predict text and denotes reference text from dataset. The precision score is commonly used in evaluating the text generation task [16, 54] by matching each token in x to a token in to a token in x to compute maximum precision with the greedy method, where each token is matched to the most similar token in the other sentences. Similarly, the recall score that gives us the percentage of positives well predicted by our model can be expressed as: (5) and the F1 score that measures both precision and recall by the harmonic mean of the two is expressed as follows: (6) As shown in Fig 6a and 6b, the two tokenized sentences are compared according to the accuracy score. The overall accuracy in Fig 6a and 6b is 0.9879 and 1.000 respectively.

thumbnail
Fig 6. The demonstration of comparing the accuracy between the predicted candidate and the actual reference.

(a) The comparison between “New York City is in east region” and “New York City is in northeast region”. (b) The comparison between “New York City is in northeast region” and “New York City is in northeast region”.

https://doi.org/10.1371/journal.pone.0293034.g006

5.3 Results and discussion

5.3.1 Evaluation metrics comparison.

Tables 5 and 6 indicates that our fine-tuned model outperforms several similar existing models [11, 16, 17] in metadata question answering on the Metadata Question Answering datasets. Specifically, our model achieved an F1 score of 0.8911, a precision score of 0.9034, and a recall score of 0.8891, which surpasses the baseline of other methods. The evaluation step was carried out on an untrained test dataset that was randomly sampled from the original Metadata Question Answering dataset. Moreover, we also included our experiments in Wikipedia Table Question Answering Datasets.

thumbnail
Table 5. Average soft accuracy on metadata question answering datasets.

https://doi.org/10.1371/journal.pone.0293034.t005

thumbnail
Table 6. Average soft accuracy on wikipedia table question answering datasets.

https://doi.org/10.1371/journal.pone.0293034.t006

5.3.2 Example results.

Table 3 shows an example of our model’s output in a standard free-form natural language format, which enables users to ask questions related to desired metadata table and receive relevant answers. For instance, in the first example, the user expects to retrieve the name of a data product that contains technical specifications. Our model retrieves the information presented in the ‘summary’ columns and returns the name of the data product from the ‘data product title’ column. In the third example, the user expects a result with location information. Our model organizes the information found from ‘geospatial latitude maximum’, ‘geospatial longitude maximum’, ‘geospatial latitude minimum’, etc., to provide a brief latitude and longitude answer. The fourth example showcases our model’s ability to find answers from metadata pertaining to a specific storm event, as long as the relevant records are present in the metadata of the data center.

5.3.3 Intuitive comparison.

By sampling a few typical question-answering sets from the Metadata Question Answering dataset, Table 7 provides an intuitive comparison between our model and OmniTab [17], The inference result is directly generated by a stringified metadata table and question, where inference refers to the process of using a trained model to generate predictions or outputs for new input data by passing the input through the model and using its learned parameters to generate a prediction. The comparison results demonstrate that our model is capable of handling spatial-temporal question-answering tasks better than OmniTab, which achieved the second-best performance score in Table 7. This is because OmniTab can only retrieve information directly from the cell, whereas our MetaQA model can extract information and generate natural language answers for abstract questions. While both models perform well when the answer requires only the retrieval of latitude and longitude information directly from cells, MetaQA outperforms OmniTab for complex question settings that require information extraction and natural language generation. Additionally, the first example in Table 6 illustrates that MetaQA can generate a simpler answer format than OmniTab, which only combines the raw text from the ‘time coverage start’ and ‘time coverage end’ cells.

thumbnail
Table 7. The comparison between MetaQA and recent table question answering tables.

https://doi.org/10.1371/journal.pone.0293034.t007

5.3.4 Inference efficiency.

In addition to accuracy, the efficiency of the inference process is a crucial metric in the application of large language models. Poor performance in this regard could severely limit the practical usability of the model in production environments. As part of our evaluation process, we recorded the inference speed of our developed model every 1000 epochs. Fig 7a shows that the runtime for each evaluation was consistently around 2.88 seconds, with a maximum of 3.609 seconds and a minimum of 2.695 seconds. Meanwhile, Fig 7b reveals that our model was able to process approximately 65 samples per second during the evaluation steps. This translates to a cost of less than 0.02 seconds for each question-answering task. Overall, our developed model demonstrated high efficiency in inference while maintaining impressive accuracy, making it a practical and reliable choice for various applications.

thumbnail
Fig 7. The efficiency metric recorded during the evaluation steps.

(a) The runtime in second for every evaluation step. (b) The number of processed samples per second for every evaluation step.

https://doi.org/10.1371/journal.pone.0293034.g007

5.3.5 Limitations.

One major limitation of our approach is its inability to handle large metadata databases. While our flattening technique works well for relatively small tables, its performance deteriorates as the table size increases. This is due to the fact that our adopted tokenizer has a maximum tokenization length, and even increasing the limit would not enable us to load the entire metadata into memory at once. Another limitation is the specificity of our approach to a particular type of metadata stored in structured databases. The performance and applicability of our method may vary when applied to different datasets (WikiTQA). To address these limitations, we plan to explore compression techniques that can reduce the size of the metadata table without sacrificing any detailed information. Additionally, we aim to develop strategies for handling different types of datasets, including those with unstructured or semi-structured formats. Furthermore, some rows and columns, such as ‘geospatial latitude units’, are rarely used and are more akin to scale values. We believe that creating a set of scale values a priori can help reduce the size of the metadata table. In future research, we plan to develop an evaluation method that can balance the influence of adopting AGI models in the end-to-end framework. By doing so, we can ensure that our approach remains reliable and robust for a wide range of applications.

6. Conclusion

In this project, we developed a MetaQA model to enable user-friendly spatiotemporal data search and sharing using a large language model. The paper identified the drawbacks of traditional geosearch methods that rely on users’ keyword-based search, such as lack of knowledge and descriptions of data, and complex metadata queries (e.g., tabular metadata format). It addressed them by designing an efficient end-to-end QA system. The development of the MetaQA system involves the use of NLP techniques to create a structured and machine-readable representation of datasets and their associated metadata. This can help researchers identify datasets that are most relevant to their research questions and improve the efficiency of the discovery process. The findings of this research demonstrate the effectiveness of the MetaQA model in outperforming other question-answering models, which provides a more efficient and effective way to access and analyze the large amount of data generated by scientific facilities. Overall, the development of this MetaQA system represents an important contribution to the field of GIScience, as it has the potential to improve the efficiency and effectiveness of interdisciplinary research by enabling scientists to discover and access datasets of interest across domain boundaries.

References

  1. 1. Jochens AE, Watson SM. The Gulf of Mexico Coastal Ocean Observing System: An integrated approach to building an operational regional observing system. Marine Technology Society Journal. 2013;47(1).
  2. 2. Zhang W, Byna S, Niu C, Chen Y. Exploring metadata search essentials for scientific data management. In: 2019 IEEE 26th international conference on high performance computing, data, and analytics (HiPC). IEEE; 2019. p. 83–92.
  3. 3. Pinoli P, Ceri S, Martinenghi D, Nanni L. Metadata management for scientific databases. Information Systems. 2019;81:1–20.
  4. 4. Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, et al. The FAIR Guiding Principles for scientific data management and stewardship. Scientific data. 2016;3(1):1–9. pmid:26978244
  5. 5. Howard J, Ruder S. Universal language model fine-tuning for text classification. arXiv preprint arXiv:180106146. 2018;.
  6. 6. Devlin J, Chang MW, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805. 2018;.
  7. 7. Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, et al. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:191013461. 2019;.
  8. 8. Nan L, Hsieh C, Mao Z, Lin XV, Verma N, Zhang R, et al. FeTaQA: free-form table question answering. Transactions of the Association for Computational Linguistics. 2022;10:35–49.
  9. 9. Khalid MA, Jijkoun V, de Rijke M. Machine learning for question answering from tabular data. In: 18th International Workshop on Database and Expert Systems Applications (DEXA 2007). IEEE; 2007. p. 392–396.
  10. 10. Chen W, Zha H, Chen Z, Xiong W, Wang H, Wang W. Hybridqa: A dataset of multi-hop question answering over tabular and textual data. arXiv preprint arXiv:200407347. 2020;.
  11. 11. Gupta P, Gupta V. A survey of text question answering techniques. International Journal of Computer Applications. 2012;53(4).
  12. 12. Wang P, Shi T, Reddy CK. Text-to-SQL generation for question answering on electronic medical records. In: Proceedings of The Web Conference 2020; 2020. p. 350–361.
  13. 13. Zhong V, Xiong C, Socher R. Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:170900103. 2017;.
  14. 14. Shi T, Zhao C, Boyd-Graber J, Daumé III H, Lee L. On the potential of lexico-logical alignments for semantic parsing to SQL queries. arXiv preprint arXiv:201011246. 2020;.
  15. 15. Liu Q, Chen B, Guo J, Ziyadi M, Lin Z, Chen W, et al. Tapex: Table pre-training via learning a neural sql executor. arXiv preprint arXiv:210707653. 2021;.
  16. 16. Herzig J, Nowak PK, Müller T, Piccinno F, Eisenschlos JM. TaPas: Weakly supervised table parsing via pre-training. arXiv preprint arXiv:200402349. 2020;.
  17. 17. Jiang Z, Mao Y, He P, Neubig G, Chen W. OmniTab: Pretraining with Natural and Synthetic Data for Few-shot Table-based Question Answering. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; 2022.
  18. 18. Tang Y, Han H, Yu X, Zhao J, Liu G, Wei L. An intelligent question answering system based on power knowledge graph. In: 2021 IEEE Power & Energy Society General Meeting (PESGM). IEEE; 2021. p. 01–05.
  19. 19. De Carlo F, Gürsoy D, Marone F, Rivers M, Parkinson DY, Khan F, et al. Scientific data exchange: a schema for HDF5-based storage of raw and analyzed data. Journal of synchrotron radiation. 2014;21(6):1224–1230. pmid:25343788
  20. 20. Fayyad U, Haussler D, Stolorz P. Mining scientific data. Communications of the ACM. 1996;39(11):51–57.
  21. 21. Zhang Z, Demšar U, Rantala J, Virrantaus K. A fuzzy multiple-attribute decision-making modelling for vulnerability analysis on the basis of population information for disaster management. International Journal of Geographical Information Science. 2014;28(9):1922–1939.
  22. 22. Seppänen H, Luokkala P, Zhang Z, Torkki P, Virrantaus K. Critical infrastructure vulnerability—A method for identifying the infrastructure service failure interdependencies. International Journal of Critical Infrastructure Protection. 2018;22:25–38.
  23. 23. Li D, Chaudhary H, Zhang Z. Modeling spatiotemporal pattern of depressive symptoms caused by COVID-19 using social media data mining. International Journal of Environmental Research and Public Health. 2020;17(14):4988. pmid:32664388
  24. 24. Zhang Z, Li D, Zhang Z, Duffield N. A time-series clustering algorithm for analyzing the changes of mobility pattern caused by COVID-19. In: Proceedings of the 1st ACM SIGSPATIAL International Workshop On Animal Movement Ecology And Human Mobility; 2021. p. 13–17.
  25. 25. Geetha R, Sumathi N, Sathiyabama S, Tiruchengode TT. A survey of spatial, temporal and spatio-temporal data mining. journal of computer applications. 2008;1(4):31–33.
  26. 26. Atluri G, Karpatne A, Kumar V. Spatio-temporal data mining: A survey of problems and methods. ACM Computing Surveys (CSUR). 2018;51(4):1–41.
  27. 27. Alizadeh B, Li D, Hillin J, Meyer MA, Thompson CM, Zhang Z, et al. Human-centered flood mapping and intelligent routing through augmenting flood gauge data with crowdsourced street photos. Advanced Engineering Informatics. 2022;54:101730.
  28. 28. Li D, Zhang Z. Urban computing cyberinfrastructure: visualizing human sentiment using social media and augmented reality. In: Proceedings of the 4th ACM SIGSPATIAL International Workshop on Advances in Resilient and Intelligent Cities; 2021. p. 27–31.
  29. 29. Yao X. Research issues in spatio-temporal data mining. In: Workshop on geospatial visualization and knowledge discovery, University Consortium for Geographic Information Science, Virginia. vol. 1. Citeseer; 2003.
  30. 30. Pallickara SL, Pallickara S, Zupanski M, Sullivan S. Efficient metadata generation to enable interactive data discovery over large-scale scientific data collections. In: 2010 IEEE Second International Conference on Cloud Computing Technology and Science. IEEE; 2010. p. 573–580.
  31. 31. Elnozahy WA, El Khayat GA, Cheniti-Belcadhi L, Said B. Question Answering System to Support University Students’ Orientation, Recruitment and Retention. Procedia Computer Science. 2019;164:56–63.
  32. 32. Brill E, Lin J, Banko M, Dumais ST, Ng AY, et al. Data-Intensive Question Answering. In: TREC. vol. 56; 2001. p. 90.
  33. 33. Goodwin TR, Harabagiu SM. Medical question answering for clinical decision support. In: Proceedings of the 25th ACM international on conference on information and knowledge management; 2016. p. 297–306.
  34. 34. Beltagy I, Lo K, Cohan A. SciBERT: A pretrained language model for scientific text. arXiv preprint arXiv:190310676. 2019;.
  35. 35. Jin N, Siebert J, Li D, Chen Q. A survey on table question answering: Recent advances. In: Knowledge Graph and Semantic Computing: Knowledge Graph Empowers the Digital Economy: 7th China Conference, CCKS 2022, Qinhuangdao, China, August 24–27, 2022, Revised Selected Papers. Springer; 2022. p. 174–186.
  36. 36. Pasupat P, Liang P. Compositional semantic parsing on semi-structured tables. arXiv preprint arXiv:150800305. 2015;.
  37. 37. Budzianowski P, Vulić I. Hello, it’s GPT-2–how can I help you? towards the use of pretrained language models for task-oriented dialogue systems. arXiv preprint arXiv:190705774. 2019;.
  38. 38. Chintagunta B, Katariya N, Amatriain X, Kannan A. Medically aware GPT-3 as a data generator for medical dialogue summarization. In: Machine Learning for Healthcare Conference. PMLR; 2021. p. 354–372.
  39. 39. Liu X, Ji K, Fu Y, Du Z, Yang Z, Tang J. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:211007602. 2021;.
  40. 40. Lester B, Al-Rfou R, Constant N. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:210408691. 2021;.
  41. 41. Reppert J, Rachbach B, George C, Byun LSJ, Appleton M, Stuhlmüller A. Iterated Decomposition: Improving Science Q&A by Supervising Reasoning Processes. arXiv preprint arXiv:230101751. 2023;.
  42. 42. Hosseini M, Rasmussen LM, Resnik DB. Using AI to write scholarly publications; 2023.
  43. 43. Ding N, Hu S, Zhao W, Chen Y, Liu Z, Zheng HT, et al. Openprompt: An open-source framework for prompt-learning. arXiv preprint arXiv:211101998. 2021;.
  44. 44. Gu Y, Han X, Liu Z, Huang M. Ppt: Pre-trained prompt tuning for few-shot learning. arXiv preprint arXiv:210904332. 2021;.
  45. 45. Trummer I. CodexDB: Generating Code for Processing SQL Queries using GPT-3 Codex. arXiv preprint arXiv:220408941. 2022;.
  46. 46. Zhang Z, Wang Z, Li A, Ye X, Usery EL, Li D. An Al-based Spatial Knowledge Graph for Enhancing Spatial Data and Knowledge Search and Discovery. In: Proceedings of the 1st ACM SIGSPATIAL International Workshop on Searching and Mining Large Collections of Geospatial Data; 2021. p. 13–17.
  47. 47. Sarasaen C, Chatterjee S, Breitkopf M, Rose G, Nürnberger A, Speck O. Fine-tuning deep learning model parameters for improved super-resolution of dynamic mri with prior-knowledge. Artificial Intelligence in Medicine. 2021;121:102196. pmid:34763811
  48. 48. Lu P, Mishra S, Xia T, Qiu L, Chang KW, Zhu SC, et al. Learn to explain: Multimodal reasoning via thought chains for science question answering. arXiv preprint arXiv:220909513. 2022;.
  49. 49. Kingma DP, Ba J. Adam: A method for stochastic optimization. arXiv preprint arXiv:14126980. 2014;.
  50. 50. Ott M, Edunov S, Baevski A, Fan A, Gross S, Ng N, et al. fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:190401038. 2019;.
  51. 51. Liu P, Yuan W, Fu J, Jiang Z, Hayashi H, Neubig G. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys. 2023;55(9):1–35.
  52. 52. Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, et al. Language models are few-shot learners. Advances in neural information processing systems. 2020;33:1877–1901.
  53. 53. Wei J, Wang X, Schuurmans D, Bosma M, Xia F, Chi E, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems. 2022;35:24824–24837.
  54. 54. Zhang T, Kishore V, Wu F, Weinberger KQ, Artzi Y. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:190409675. 2019;.