Abstract
Today the Internet growing exponentially and revolutionizing everything with increasing number of users everywhere in order to meet the superfluous demand has triggered an unprecedented wave of various kinds of digital data on the Web. Among them much of the data is relevant and can be turned into actionable insights but difficulties to face are that handling such a hype of data on the Web and due to its unstructured format can not meet the pre-set requirements of professionals and end users. In the context of biodiversity domain, a conceptual approach of data science has been proposed in this paper to extract and structure data seamlessly, which makes sense of all biodiversity-rich data and multiple-record documents by saving time and energy. The major drawback in manual extraction and storage of biodiversity data is that it gives rise to several errors (such as spelling errors, skipping of some data fields etc.) which can be difficult to improve during the processing stage, thereafter can not meet the research demands. However, such drawbacks can be dealt if data science approach is applied within the system and this automated approach will be fast, flexible, reliable and accurate. Nevertheless, the only thing to be taken care in the extraction approach is regular monitoring and analysis of Hypertext Markup Language (HTML) structure, documents, and links of target sources. Such a huge set of data contains many error and noisy characters; to eliminate these errors, data cleaning algorithm has been used to make data error-free and ready for further systematic research. Due to the wide variety of data formats, achieving interoperability is a daunting task, since some of the datasets do not follow their own schema structure. To cope with this demand, semantic interoperability has proved to be helpful by exchanging data through web services between different independent loosely coupled systems. This paper presents an overview of semantic interoperability and case studies on various projects that implemented it for biodiversity data sharing.
Similar content being viewed by others
References
Adali S, Candan KS, Papakonstantinou Y, Subrahmanian VS (1996) Query caching and optimization in distributed mediator systems. ACM SIGMOD Rec 10(1145/235968):233327
Adelberg B (1998) NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents. ACM SIGMOD Rec 10(1145/276305):276330
Apers PM (1995) Identifying internet-related database research. In East/west database workshop. Springer, London, pp 183–193
Arasu A, Garcia-Molina H, University S (2003) Extracting structured data from Web pages. In: Proceedings of the 2003 ACM SIGMOD international conference on on Management of data—SIGMOD’03. https://doi.org/10.1145/872797.872799
Arens Y, Knoblock Ca., Shen WM (1996) Query reformulation for dynamic information integration. J Intell Inform Syst. https://doi.org/10.1007/BF00122124
Arocena GO, Mendelzon AO (1999) WebOQL: Restructuring documents, databases, and webs. Theory and practice of object systems. https://doi.org/10.1002/(SICI)1096-9942(1999)5:3%3c127::AID-TAPO2%3e3.0.CO;2-X
Arora NK (2018a) Environmental sustainability—necessary for survival. Environ Sustain 1(1):1–2. https://doi.org/10.1007/s42398-018-0013-3
Arora NK (2018b) Biodiversity conservation for sustainable future. Environ Sustain 1(2):109–111. https://doi.org/10.1007/s42398-018-0023-1
Ballesteros-Mejia L, Kitching IJ, Jetz W, Nagel P, Beck J (2013) Mapping the biodiversity of tropical insects: Species richness and inventory completeness of African sphingid moths. Global Ecol Biogeogr. https://doi.org/10.1111/geb.12039
Batini C, Lenzerini M, Navathe SB (1986) A comparative analysis of methodologies for database schema integration. ACM Comput Surv 10(1145/27633):27634
Batista-Navarro R, Zerva C, Nguyen NTH, Ananiadou S (2017) A text mining-based framework for constructing an RDF-compliant biodiversity knowledge repository. In: Communications in computer and information science. https://doi.org/10.1007/978-3-319-55209-5_3
Baumgartner R, Baumgartner R, Flesca S, Gottlob G, Flesca S, Gottlob G (2001) Visual web information extraction with lixto. In: Proceedings of the international conference on very large data bases
Baumgartner R, Gatterbauer W, Gottlob G (2009) Web data extraction system. In: Encyclopedia of database systems (pp. 3465-3471). Springer, Boston
Beneventano D, Bergamaschi S, Guerra F, Vincini M (2003) Synthesizing an integrated ontology. In: IEEE Internet Computing. https://doi.org/10.1109/MIC.2003.1232517
Berners-Lee T (1998) Web Architecture from 50,000 feet. W3C. https://www.w3.org/DesignIssues/Architecture.html. Accessed 23 September 2018
Berners-Lee T, M F (1999) Weaving the web, the original design and ultimate destiny of the World Wide Web by its inventor. Harper Business San Francisco. https://doi.org/10.1109/TPC.2000.843652
Bernstein PA, Haas LM (2008) Information integration in the enterprise. Commun ACM 10(1145/1378727):1378745
Blagoderov V, Kitching IJ, Livermore L, Simonsen TJ, Smith VS (2012) No specimen left behind: Industrial scale digitization of natural history collections. ZooKeys. https://doi.org/10.3897/zookeys.209.3178
Blakeley JA (1997) Universal data access with OLE DB. In: Proceedings IEEE COMPCON 97. https://doi.org/10.1109/CMPCON.1997.584662
Bonczek RH, Holsapple CW, Whinston AB (1978) Aiding decision makers with a generalized data base management system: an application to inventory management. Decision Sci. https://doi.org/10.1111/j.1540-5915.1978.tb01381.x
Brin S, Motwani R, Page L, Winograd T (1998) What can you do with a web in your pocket? IEEE Data Eng Bull 21(2):37–47
Califf ME, Mooney RJ (1999) Relational learning of pattern-match rules for information extraction. Comput Linguist. https://doi.org/10.1162/153244304322972685
Calvanese D, De Giacomo G, Lenzerini M (2001) A Framework for Ontology Integration. In: Proc. of the 2001 Int. Semantic Web Working Symposium (SWWS 2001)
Ceccarelli T (1997) Towards a planning support system for communal areas in the Zambezi Valley, Zimbabwe: a multi criteria evaluation linking farm household analysis, land evaluation and geographic information systems
Chang CH, Kuo SC (2004) OLERA: Semisupervised Web-data extraction with visual support. IEEE Intell Syst. https://doi.org/10.1109/MIS.2004.71
Chang C, Lui SC (2001) IEPAD: Information extraction based on pattern discovery. In:Proceedings of the 10th international conference on World Wide Web—WWW. https://doi.org/10.1145/371920.372182
Chawathe S, Garcia-Molina H, Hammer J, Ireland K, Papakonstantinou Y, Ullman J, Widom J (1994) The TSIMMIS project: integration of heterogenous information sources. In:Proceedings of IPSJ conference
Chen PPS (1976) The entity-relationship model—toward a unified view of data. ACM Trans Datab Syst. https://doi.org/10.1145/320434.320440
Chidlovskii B, Ragetli J, de Rijke M (2000) Automatic wrapper generation for web search engines. In: International conference on web-age information management (pp. 399-410). Springer, Berlin
Choi N, Song I-Y, Han H (2006) A survey on ontology mapping. ACM SIGMOD Record 10(1145/1168092):1168097
Cohen WW (1998) Integration of heterogeneous databases without common domains using queries based on textual similarity. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data. https://doi.org/10.1145/276305.276323
Crescenzi V, Mecca G (1998) Grammars have exceptions. Inform Syst. https://doi.org/10.1016/S0306-4379 (98)00028-3
Crescenzi V, Mecca G, Merialdo P (2001) Roadrunner: towards automatic data extraction from large web sites. In: Proceedings of the 27th International Conference on Very Large Data Bases
Date CJ (1995) An introduction to database systems. In: An introduction to database systems. https://doi.org/10.3145/epi.2009.jul.14
Doan A, Domingos P, Halevy A (2003) Learning to match the schemas of data sources: a multistrategy approach. Mach Learn. https://doi.org/10.1023/A:1021765902788
Drew P, King R, McLeod D, Rusinkiewicz M, Silberschatz A (1993) Report of the workshop on semantic heterogeneity and interpolation in multidatabase systems. {ACM} {SIGMOD} Record. https://doi.org/10.1145/163090.163098
Elmagarmid AK (1992) Database transaction models for advanced applications. Database
Embley DW, Campbell DM, Jiang YS, Liddle SW, Lonsdale DW, Ng YK, Smith RD (1999) Conceptual-model-based data extraction from multiple-record Web pages. Data Knowl Eng. https://doi.org/10.1016/S0169-023X(99)00027-0
Fayyad U, Piatetsky-Shapiro G, Smyth P (1996) Knowledge discovery and data mining: towards a unifying framework. In: Int Conf on Knowledge Discovery and Data Mining
Fensel D, Van Harmelen F, Klein M, Akkermans H, Broekstra J, Fluit C, Krohn U (2000) On-to-knowledge: ontology-based tools for knowledge management. J Bus Ethic
Ferrara E, De Meo P, Fiumara G, Baumgartner R (2014) Web data extraction, applications and techniques: A survey. Knowl Based Syst. https://doi.org/10.1016/j.knosys.2014.07.007
Finkelstein C (1989) An introduction to information engineering: from strategic planning to information systems. Addison-Wesley, Sydney, p 52
Florescu D, Levy AY, Mendelzon AO (1998) Database techniques for the World-Wide Web: a survey. SIGMOD Rec 27(3):59–74
Freitag D (2000) Machine learning for information extraction in informal domains. Mach Learn. https://doi.org/10.1023/A:1007601113994
Fridman N, Musen M (2000) PROMPT: Algorithm and tool for automated ontology merging and alignment. Proc. AAAI’00
Friedman M, Weld DS (1997) Efficiently executing information-gathering plans. In: In Proc. of the Int. Joint Conf. of AI (IJCAI
Fuxman A, Hernandez MA, Ho H, Miller RJ, Papotti P, Roma Tre U, Popa L (2006) Nested mappings: schema mapping reloaded. VLDB
Gangemi A, Guarino N, Masolo C, Oltramari A (2003) Sweetening WORDNET with DOLCE. AI magazine. https://doi.org/10.1007/3-540-45810-7
Gennari JH, Musen MA, Fergerson RW, Grosso WE, Crubezy M, Eriksson H, Tu SW (2003) The evolution of Protégé: An environment for knowledge-based systems development. International J Hum Comput Stud. https://doi.org/10.1016/S1071-5819(02)00127-1
Georgakopoulos D, Rusinkiewicz M, Sheth AP (1994) Using tickets to enforce the serializability of multidatabase Transactions. In: IEEE Transactions on Knowledge and Data Engineering. https://doi.org/10.1109/69.273035
Ghawi R, Cullot N (2007) Database-to-ontology mapping generation for semantic interoperability. VDBL’07 Conference, VLDB Endowment ACM
Haas LM, Kossmann D, Wimmers EL, Yang J (1997) Optimizing queries across diverse data sources. Vldb
Halevy AY (2001) Answering queries using views: a survey. VLDB J. https://doi.org/10.1007/s007780100054
Halevy A, Ordille J (2006) Data integration : the teenage years. artificial intelligence. integration : the teenage yea
Hammer M, McLeod D (1979) On Database Management System Architecture (No. MIT/LCS/TM-141). MASSACHUSETTS INST OF TECH CAMBRIDGE LAB FOR COMPUTER SCIENCE
Hammer J, McHugh J, Garcia-Molina H (1997) Semistructured data: the TSIMMIS experience. In: Proceedings of the 1st east-european symposium on advances in databases and information systems (ADBIS)
Hardisty AR, Bacall F, Beard N, Balcázar-Vargas MP, Balech B, Barcza Z, Yilmaz P (2016) BioVeL: a virtual laboratory for data analysis and modelling in biodiversity science and ecology. BMC Ecol. https://doi.org/10.1186/s12898-016-0103-y
Heimbigner D, McLeod D (1985) A federated architecture for information management. ACM Trans Inform Syst 10(1145/4229):4233
Hogue A, Karger D (2005) Thresher : automating the unwrapping of semantic content from the World Wide Web. WWW’05: In: Proceedings of the 14th international conference on World Wide Web. https://doi.org/10.1145/1060745.1060762
Hsu CN, Dung MT (1998) Generating finite-state transducers for semi-structured data extraction from the Web. Inform Syst. https://doi.org/10.1016/S0306-4379
Huber GP (1990) A theory of the effects of advanced information technologies on organizational design, intelligence, and decision making. Acad Manag Rev. https://doi.org/10.2307/258105
Hull R (1997) Managing semantic heterogeneity in databases: a theoretical prospective. In: Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems (pp. 51-61). ACM
International Business Machines Corporation (1978) Business systems planning: Information systems planning guide. IBM
Irmak U, Suel T (2006) Interactive wrapper generation with minimal user effort. In: Proceedings of the 15th international conference on World Wide Web—WWW’06. https://doi.org/10.1145/1135777.1135859
IUCN Red list (2018) Numbers of threatened species by major groups of organisms (1996–2018). http://cmsdocs.s3.amazonaws.com/summarystats/2018-1_Summary_Stats_Page_Documents/2018_1_RL_Stats_Table_1.pdf. Accessed 20 September 2018
Jones AC (2006) Applying computer science research to biodiversity informatics: Some experiences and lessons. Trans Comput Syst Biol. https://doi.org/10.1007/11732488_4
Jones A, Xu X, Pittas N, Gray W, Fiddian N, White RJ, Brandt S (2000) SPICE: a flexible architecture for integrating autonomous databases to comprise a distributed catalogue of life. In: Database and expert systems applications, lecture notes in computer science. https://doi.org/10.1007/3-540-44469-6_92
Kadam VB, Pakle GK (2014) A survey on HTML structure aware and tree based web data scraping technique. Int J Comput Sci Inform Technol 5(2):1655–1658
Kalfoglou Y, SchorlemmeR M (2003) Ontology mapping: the state of the art. Knowl Eng Rev. https://doi.org/10.1017/S0269888903000651
Kayed M, Chang CH (2010) FiVaTech: Page-level web data extraction from template pages. In: IEEE Transactions on Knowledge and Data Engineering. https://doi.org/10.1109/TKDE.2009.82
Kossmann D (2000) The state of the art in distributed query processing. ACM Comput Surv 10(1145/371578):371598
Kushmerick N (2000) Wrapper induction: efficiency and expressiveness. Artif Intell. https://doi.org/10.1016/S0004-3702(99)00100-9
Laender AHF, Ribeiro-Neto BA, da Silva AS, Teixeira JS (2002a) A brief survey of web data extraction tools. ACM SIGMOD Rec 10(1145/565117):565137
Laender AHF, Ribeiro-Neto B, De Silva AS (2002) DEByE—Data extraction by example. Data Knowl Eng. https://doi.org/10.1016/S0169-023X(01)00047-7
Lage JP, Da Silva AS, Golgher PB, Laender AHF (2004) Automatic generation of agents for collecting hidden Web pages for data extraction. In : Data and Knowledge Engineering. https://doi.org/10.1016/j.datak.2003.10.003
Levy Y, Rajaraman A, Ordille J (1996) Querying heterogeneous information sources using source descriptions. In: Proceeding VLDB’96 proceedings of the 22th international conference on very large data bases. https://doi.org/10.1049/tpe.1981.0030
Litwin W, Abdellatif A (1986) Multidatabase interoperability. Computer. https://doi.org/10.1109/MC.1986.1663123
Litwin W, Mark L, Roussopoulos N (1990) Interoperability of multiple autonomous databases. ACM Comput Surv 10(1145/96602):96608
Liu L, Pu C, Han W (2000) XWRAP : An XML—enabled wrapper construction system for web information sources. In: Proceedings of the 16th international conference on data engineering. https://doi.org/10.1109/ICDE.2000.839475
Malone TW, Yates J, Benjamin RI (1987) Electronic markets and electronic hierarchies. Commun ACM 10(1145/214762):214766
Martin J, Finkelstein C (1989) Information engineering. Prentice Hall, Englewood Cliffs
Mathew C, Güntsch A, Obst M, Vicario S, Haines R, Williams A, Goble C (2014) A semi-automated workflow for biodiversity data retrieval, cleaning, and quality control. Biodiv Data J. https://doi.org/10.3897/BDJ.2.e4221
McCarthy WE (1982) The REA accounting model—a generalized framework for accounting systems in a shared data environment. The Account Rev
Meersman R (2005) The use of lexicons and other computer-linguistic tools in semantics, design and cooperation of database systems. Star (2005)
Miller RJ, Haas LM, Hernández Ma (2000) Schema mapping as query discovery. In: Proceedings of the 26th international conference on very large data bases
Murphy J, Hashim NH, O’Connor P (2007) Take me back: validating the wayback machine. J Comput Med Commun. https://doi.org/10.1111/j.1083-6101.2007.00386.x
Muslea I, Minton S, Knoblock CA (2001) Hierarchical wrapper induction for semistructured information sources. Autonom Agent Multi-Agent Syst. https://doi.org/10.1023/A:1010022931168
Niles I, Pease A (2001) Towards a standard upper ontology. In: Proceedings of the international conference on formal ontology in information systems—FOIS’01. https://doi.org/10.1145/505168.505170
Nowak J, Nogueras-Iso J, Peedell S (2005) Issues of multilinguality in creating a European SDI-The perspective for spatial data interoperability. In: 11th ECGI GIS workshop ESDI setting the framework Alghero Sardinia
O’Sullivan B, Keady S, Keane E, Irwin S, O’Halloran J (2010). Data mining for biodiversity prediction in forests. In: Frontiers in artificial intelligence and applications. https://doi.org/10.3233/978-1-60750-606-5-289
Ouksel AM, Sheth A (1999) Semantic interoperability in global information systems. ACM Sigmod Record 28(1):5–12
Page RDM (2011) Extracting scientific articles from a large digital archive: BioStor and the biodiversity heritage library. BMC Bioinform. https://doi.org/10.1186/1471-2105-12-187
Raghavan S, Garcia-Molina H (2001) Integrating diverse information management systems: a brief survey. Technical Report, Stanford
Reis DC, Golgher PB, Silva AS, Laender AF (2004) Automatic web news extraction using tree edit distance. In: Proceedings of the 13th conference on World Wide Web - WWW’04. https://doi.org/10.1145/988672.988740
Ribeiro-Neto B, Laender aHF, Da Silva aS (1999) Extracting semi-structured data through examples. In: Proceedings of the Eighth International Conference on Information and Knowledge Management. https://doi.org/10.1145/319950.319962
Roy PS, Karnatak H, Kushwaha SPS, Roy A, Saran S (2012) India’s plant diversity database at landscape level on geospatial platform: prospects and utility in today’s changing climate. Curr Sci 102(8):1136–1142
Sahuguet A, Azavant F (2001) Building intelligent Web applications using lightweight wrappers. Data Knowl Eng. https://doi.org/10.1016/S0169-023X(00)00051-3
Saran S, Kushwaha SPS, Ganeshaiah KN, Roy PS, Murthy YK (2012) Indian Bioresource Information Network (IBIN): a distributed bioresource national portal. ISG Newslett 18(3):6
Selkow SM (1977) The tree-to-tree editing problem. Inform Proces Lett. https://doi.org/10.1016/0020-0190(77)90064-3
Shanmughavel P (2007) An overview on biodiversity information in databases. Bioinformation. https://doi.org/10.6026/97320630001367
Shekhar S (2004) Spatial data mining and geo-spatial interoperability. In Report of the NCGIA specialist meeting on spatial webs, Santa Barbara, December 2–4 2004
Sheth AP (1999) Changing focus on interoperability in information systems:from system, syntax, structure to semantics. In: Interoperating geographic information systems. https://doi.org/10.1007/978-1-4615-5189-8_2
Sheth A, Kashyap V (1993) So far (schematically) yet so near (semantically). In interoperable database. Systems 5:283–312
Sheth AP, Larson JA (1990) Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Comput Surv 10(1145/96602):96604
Silva N, Rocha J (2003) Ontology mapping for interoperability in semantic web. In: ICWI, pp. 603–610
Silvertown J (2009) A new dawn for citizen science. Trends Ecol Evol. https://doi.org/10.1016/j.tree.2009.03.017
Singh P, Saran S, Kumar D, Padalia H, Srivastava A, Kumar AS (2018) Species mapping using citizen science approach through IBIN portal: use case in foothills of Himalaya. J Indian Soc Remote Sens, 1–13
Smedt TD, Daelemans W (2012) Pattern for python. J Mach Learn Res 13:2063–2067
Soderland S (1999) Learning information extraction rules for semi-structured and free text. Mach Learn. https://doi.org/10.1023/A:1007562322031
Sonsilphong S, Arch-int N, Arch-int S (2012) Rule-based semantic web services annotation for healthcare information integration. In: Computing and networking technology (iccnt), 2012 8th international conference on (pp. 147-152). IEEE
Species 2000 Secretariat (2009) Species 2000. http://www.sp2000.org/index.php?option=com_content&task=view&id=40&Itemid=49. Accessed 23 September 2018
Stonebraker M, Aoki PM, Litwin W, Pfeffer A, Sah A, Sidell J, Yu A (1996) Mariposa: a wide-area distributed database system. VLDB J. https://doi.org/10.1007/s007780050015
Tomasic A, Raschid L, Valduriez P (1998) Scaling access to heterogeneous data sources with DISCO. In: IEEE transactions on knowledge and data engineering. https://doi.org/10.1109/69.729736
Ullman JD (2000) Information integration using logical views. Theor Comput Sci. https://doi.org/10.1016/S0304-3975(99)00219-4
Veltman KH (2001) Syntactic and semantic interoperability: new approaches to knowledge and the semantic web. N Rev Inform Netw 7(1):159–183
Wache H, Vögele T, Visser U, Stuckenschmidt H, Schuster G, Neumann H, Hübner S (2001) Ontology-based integration of information-a survey of existing approaches. IJCAI Workshop: Ontologies and Information Sharing
Wang J, Lochovsky FH (2003) Data extraction and label assignment for web databases. In: Proceedings of the twelfth international conference on World Wide Web—WWW’03. https://doi.org/10.1145/775152.775179
Wilson PS (2007) What mapping and modeling means to the HIM professional. Perspectives in Health Information Management/AHIMA, American Health Information Management Association, Chicago, p 4
Woelk D, Bohrer B, Jacobs N, Ong K, Tomlinson C, Unnikrishnan C (1995) Carnot and InfoSleuth: database technology and the world wide web. In: ACM SIGMOD Record (Vol. 24, No. 2, pp. 443-444). ACM
Yang W (1991) Identifying syntactic differences between two programs. software: practice and experience. https://doi.org/10.1002/spe.4380210706
Yesson C, Brewer PW, Sutton T, Caithness N, Pahwa JS, Burgess M, Culham A (2007) How global is the global biodiversity information facility? PLoS One. https://doi.org/10.1371/journal.pone.0001124
Zhai Y, Liu B (2005) Web data extraction based on partial tree alignment. In: Proceedings of the 14th international conference on World Wide Web—WWW’05. https://doi.org/10.1145/1060745.1060761
Zhai Y, Liu B (2006) Structured data extraction from the web based on partial tree alignment. In: IEEE transactions on knowledge and data engineering. https://doi.org/10.1109/TKDE.2006.197
Zhao H, Zhang S, Zhou J, Wang M. (2007). Semantic model based heterogeneous databases integration platform. In: Icnc (pp. 366-370). IEEE
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Singh, P., Kumar, D. & Saran, S. Interoperable framework for improving data quality using semantic approach: use case on biodiversity. Environmental Sustainability 1, 367–381 (2018). https://doi.org/10.1007/s42398-018-00033-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42398-018-00033-1