skip to main content
10.1145/2756406.2756923acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
research-article

Demystifying the Semantics of Relevant Objects in Scholarly Collections: A Probabilistic Approach

Published:21 June 2015Publication History

ABSTRACT

Efforts to make highly specialized knowledge accessible through scientific digital libraries need to go beyond mere bibliographic metadata, since here information search is mostly entity-centric. Previous work has realized this trend and developed different methods to recognize and (to some degree even automatically) annotate several important types of entities: genes and proteins, chemical structures and molecules, or drug names to name but a few. Moreover, such entities are often crossreferenced with entries in curated databases. However, several questions still remain to be answered: Given a scientific discipline what are the important entities? How can they be automatically identified? Are really all of them relevant, i.e. do all of them carry deeper semantics for assessing a publication? How can they be represented, described, and subsequently annotated? How can they be used for search tasks? In this work we focus on answering some of these questions. We claim that to bring the use of scientific digital libraries to the next level we must find treat topic-specific entities as first class citizens and deeply integrate their semantics into the search process. To support this we propose a novel probabilistic approach that not only successfully provides a solution to the integration problem, but also demonstrates how to leverage the knowledge encoded in entities and provide insights to explore the use of our approach in different scenarios. Finally, we show how our results can benefit information providers.

References

  1. D. M. Blei, A. Y. NG, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research. 2003 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Blei, D. M., & Lafferty, J. D. (2009). Topic Models. In Text Mining: Classification, Clustering, and Applications (pp. 71--89). Chapman & Hall/CRC Data Mining and Knowledge Discovery Series. doi:10.1145/1143844.1143859Google ScholarGoogle Scholar
  3. Blei, D. M. (2012). Introduction to Probabilistic Topic Modeling. Communications of the ACM, 55, 77--84. doi:10.1145/2133806.2133826. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Goulart, R. R. V., Strube de Lima, V. L., & Xavier, C. C. (2011). A systematic review of named entity recognition in biomedical texts. Journal of the Brazilian Computer Society. doi:10.1007/s13173-011-0031--9.Google ScholarGoogle Scholar
  5. Settles, B. (2005). ABNER: An open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics, 21, 3191--3192. doi:10.1093/bioinformatics/bti475 Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Filippov, I. V., & Nicklaus, M. C. (2009). Optical structure recognition software to recover chemical information: OSRA, an open source solution. Journal of Chemical Information and Modeling, 49, 740--743. doi:10.1021/ci800067rGoogle ScholarGoogle ScholarCross RefCross Ref
  7. Lowe, D. M., Corbett, P. T., Murray-Rust, P., & Glen, R. C. 2011. Journal of Chemical Information and Modeling, 51, 739--753. doi:10.1021/ci100384dGoogle ScholarGoogle Scholar
  8. Park, J., Rosania, G. R., Shedden, K. A., Nguyen, M., Lyu, N., & Saitou, K. (2009). Automated extraction of chemical structure information from digital raster images. Chemistry Central Journal, 3, 4. doi:10.1186/1752--153X-3--4Google ScholarGoogle ScholarCross RefCross Ref
  9. P. Sojka and M. Lška. The Art of Mathematics Retrieval. Proceedings of the ACM Conference on Document Engineering. 2011 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Michael Kohlhase, Bogdan A. Matican, and Corneliu C. Prodescu. MathWebSearch 0.5 -Scaling an open Formula Sarch Engine. Conferences on Intelligent Computer Mathematics (CICM). 2012 Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Kamali, S., & Tompa, F. W. (2013). Retrieving documents with mathematicalcontent. In Proceedings of the 36th international ACM SIGIRconference on Research and development in information retrieval -- SIGIR '13 (p. 353). doi:10.1145/2484028.2484083 Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Sun, B., Mitra, P., & Giles, C. L. (2008). Mining, indexing, and searching for textual chemical molecule information on the web. In Proceeding of the international conference on World Wide Web (pp. 735--744). doi:10.1145/1367497.1367597 Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Tönnies, S., Köhncke, B., Koepler, O., & Balke, W.-T. (2010). Exposing the Hidden Web for Chemical Digital Libraries. In Int.l Joint Conference on Digital Libraries (pp. 234--244). doi:10.1145/1816123.1816159 Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Vickrey, D., Biewald, L., Teyssier, M., & Koller, D. (2005). Word-Sense Disambiguation for Machine Translation. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT '05) (pp. 771--778). doi:10.3115/1220575.1220672 Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Carpuat, M., & Wu, D. (2007). Improving statistical machine translation using word sense disambiguation. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), 61--72. Retrieved from papers2://publication/uuid/CA8E0BC3--96B6--4123--8674--4E4BD98AACA9Google ScholarGoogle Scholar
  16. Brody, S., & Lapata, M. (2009). Bayesian Word Sense Induction. Computational Linguistics, 103--111. doi:10.3115/1609067.1609078 Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Lau, J. H., Cook, P., McCarthy, D., Newman, D., Baldwin, T., & Computing, L. (2012). Word sense induction for novel sense detection. In Proceedings of the 13th Conference of the European Chapter of the Association for computational Linguistics (EACL 2012) (pp. 591--601). Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Firth, J. R. (1957). A synopsis of linguistic theory 1930--55. Studies in Linguistic Analysis (special Volume of the Philological Society), 1952--59, 1--32.Google ScholarGoogle Scholar
  19. Griffith TL, Steyvers M (2004). Finding Scientic Topics. Proceedings of the National Academy of Sciences of the United States of America, 101, 5228--5235Google ScholarGoogle ScholarCross RefCross Ref
  20. Burges, C. J. C. (1998). A Tutorial on Support Vector Machines for Pattern Recognition. Knowledge Discovery and Data Mining, 2, 121--167. Retrieved from /papers/Burges98.ps.gz Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2006). Hierarchical Dirichlet Processes. Journal of the American Statistical Association. doi:10.1198/016214506000000302Google ScholarGoogle Scholar

Index Terms

  1. Demystifying the Semantics of Relevant Objects in Scholarly Collections: A Probabilistic Approach

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        JCDL '15: Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries
        June 2015
        324 pages
        ISBN:9781450335942
        DOI:10.1145/2756406
        • General Chairs:
        • Paul Logasa Bogen,
        • Suzie Allard,
        • Holly Mercer,
        • Micah Beck,
        • Program Chairs:
        • Sally Jo Cunningham,
        • Dion Goh,
        • Geneva Henry

        Copyright © 2015 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 21 June 2015

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        JCDL '15 Paper Acceptance Rate18of60submissions,30%Overall Acceptance Rate415of1,482submissions,28%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader