Published February 22, 2020 | Version v1
Dataset Restricted

MOIED: Magi Open Information Extraction Dataset

Description

Description

Magi Open Information Extraction Dataset (MOIED) is a Chinese Open IE dataset containing 7,618,181 records extracted from plain text across 3,319,763 webpages in various domains. Each record in the dataset consists of the (subject, predicate, object) tuple, the associated confidence score, and the context information. The dataset comprises 1,427,742 distinct facts of 272,522 entities and 117,731 predicates.

A notable property of MOIED is that each distinct fact has multiple records with URLs referring to mentions in diverse contexts, which enables multiple-instance learning (MIL) and other correlative approaches.

As a paragraph level Open IE dataset, at least 45.1% of the records in MOIED can only be extracted through synthesizing information from multiple sentences.

Magi is an extraction engine that continuously learns from the Internet, which combines cross-referencing, timeline analysis, and other heuristics to mitigate the inevitable false positives in the extractions. All records in MOIED were randomly sampled from a database dump of magi.com in January 2020. To provide more reliable evaluation results, human annotators examined the dataset and selected 19,161 verified records for the dev and test sets.

 

Disclaimers

  1. The dataset is expected to be used in weakly supervised scenarios since the records in the training set are not human-annotated and could be imprecise or erroneous.
  2. Records are not guaranteed to be universally correct. The correctness of extractions should be evaluated based on contexts (specified by the URLs).
  3. The extraction was made at a certain time Magi visits the URL, thus it is not guaranteed that the URL is still accessible, or the content is unmodified since the extraction was conducted.
  4. Due to legal and regulatory issues, the webpage URLs are mostly ones accessible from Mainland China, yet, the content of certain webpages, as well as the extraction results, could be in violation of law and regulation of certain countries or regions in certain ways.

Notes

This dataset contains content from the Internet, for copyright reasons, please do not redistribute or use it for non-research purposes.

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Request access

If you would like to request access to these files, please fill out the form below.

You need to satisfy these conditions in order for this request to be accepted:

This dataset contains content from the Internet, for copyright reasons, please do not redistribute or use it for non-research purposes.

Please state the name of yourself and your institution, and how this dataset could help your research. Thanks!

You are currently not logged in. Do you have an account? Log in here