research-article

Demystifying the Semantics of Relevant Objects in Scholarly Collections: A Probabilistic Approach

Authors:
Jose Maria Gonzalez Pinto

IFIS TU Braunschweig, BRAUNSCHWEIG, Germany

IFIS TU Braunschweig, BRAUNSCHWEIG, Germany
View Profile

,
Wolf-Tilo Balke

IFIS TU Braunschweig, BRAUNSCHWEIG, Germany

IFIS TU Braunschweig, BRAUNSCHWEIG, Germany
View Profile

JCDL '15: Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital LibrariesJune 2015Pages 157–164https://doi.org/10.1145/2756406.2756923

Published:21 June 2015Publication History

JCDL '15: Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries

Pages 157–164

ABSTRACT

Efforts to make highly specialized knowledge accessible through scientific digital libraries need to go beyond mere bibliographic metadata, since here information search is mostly entity-centric. Previous work has realized this trend and developed different methods to recognize and (to some degree even automatically) annotate several important types of entities: genes and proteins, chemical structures and molecules, or drug names to name but a few. Moreover, such entities are often crossreferenced with entries in curated databases. However, several questions still remain to be answered: Given a scientific discipline what are the important entities? How can they be automatically identified? Are really all of them relevant, i.e. do all of them carry deeper semantics for assessing a publication? How can they be represented, described, and subsequently annotated? How can they be used for search tasks? In this work we focus on answering some of these questions. We claim that to bring the use of scientific digital libraries to the next level we must find treat topic-specific entities as first class citizens and deeply integrate their semantics into the search process. To support this we propose a novel probabilistic approach that not only successfully provides a solution to the integration problem, but also demonstrates how to leverage the knowledge encoded in entities and provide insights to explore the use of our approach in different scenarios. Finally, we show how our results can benefit information providers.

References

D. M. Blei, A. Y. NG, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research. 2003 Google ScholarDigital Library
Blei, D. M., & Lafferty, J. D. (2009). Topic Models. In Text Mining: Classification, Clustering, and Applications (pp. 71--89). Chapman & Hall/CRC Data Mining and Knowledge Discovery Series. doi:10.1145/1143844.1143859Google Scholar
Blei, D. M. (2012). Introduction to Probabilistic Topic Modeling. Communications of the ACM, 55, 77--84. doi:10.1145/2133806.2133826. Google ScholarDigital Library
Goulart, R. R. V., Strube de Lima, V. L., & Xavier, C. C. (2011). A systematic review of named entity recognition in biomedical texts. Journal of the Brazilian Computer Society. doi:10.1007/s13173-011-0031--9.Google Scholar
Settles, B. (2005). ABNER: An open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics, 21, 3191--3192. doi:10.1093/bioinformatics/bti475 Google ScholarDigital Library
Filippov, I. V., & Nicklaus, M. C. (2009). Optical structure recognition software to recover chemical information: OSRA, an open source solution. Journal of Chemical Information and Modeling, 49, 740--743. doi:10.1021/ci800067rGoogle ScholarCross Ref
Lowe, D. M., Corbett, P. T., Murray-Rust, P., & Glen, R. C. 2011. Journal of Chemical Information and Modeling, 51, 739--753. doi:10.1021/ci100384dGoogle Scholar
Park, J., Rosania, G. R., Shedden, K. A., Nguyen, M., Lyu, N., & Saitou, K. (2009). Automated extraction of chemical structure information from digital raster images. Chemistry Central Journal, 3, 4. doi:10.1186/1752--153X-3--4Google ScholarCross Ref
P. Sojka and M. Lška. The Art of Mathematics Retrieval. Proceedings of the ACM Conference on Document Engineering. 2011 Google ScholarDigital Library
Michael Kohlhase, Bogdan A. Matican, and Corneliu C. Prodescu. MathWebSearch 0.5 -Scaling an open Formula Sarch Engine. Conferences on Intelligent Computer Mathematics (CICM). 2012 Google ScholarDigital Library
Kamali, S., & Tompa, F. W. (2013). Retrieving documents with mathematicalcontent. In Proceedings of the 36th international ACM SIGIRconference on Research and development in information retrieval -- SIGIR '13 (p. 353). doi:10.1145/2484028.2484083 Google ScholarDigital Library
Sun, B., Mitra, P., & Giles, C. L. (2008). Mining, indexing, and searching for textual chemical molecule information on the web. In Proceeding of the international conference on World Wide Web (pp. 735--744). doi:10.1145/1367497.1367597 Google ScholarDigital Library
Tönnies, S., Köhncke, B., Koepler, O., & Balke, W.-T. (2010). Exposing the Hidden Web for Chemical Digital Libraries. In Int.l Joint Conference on Digital Libraries (pp. 234--244). doi:10.1145/1816123.1816159 Google ScholarDigital Library
Vickrey, D., Biewald, L., Teyssier, M., & Koller, D. (2005). Word-Sense Disambiguation for Machine Translation. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT '05) (pp. 771--778). doi:10.3115/1220575.1220672 Google ScholarDigital Library
Carpuat, M., & Wu, D. (2007). Improving statistical machine translation using word sense disambiguation. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), 61--72. Retrieved from papers2://publication/uuid/CA8E0BC3--96B6--4123--8674--4E4BD98AACA9Google Scholar
Brody, S., & Lapata, M. (2009). Bayesian Word Sense Induction. Computational Linguistics, 103--111. doi:10.3115/1609067.1609078 Google ScholarDigital Library
Lau, J. H., Cook, P., McCarthy, D., Newman, D., Baldwin, T., & Computing, L. (2012). Word sense induction for novel sense detection. In Proceedings of the 13th Conference of the European Chapter of the Association for computational Linguistics (EACL 2012) (pp. 591--601). Google ScholarDigital Library
Firth, J. R. (1957). A synopsis of linguistic theory 1930--55. Studies in Linguistic Analysis (special Volume of the Philological Society), 1952--59, 1--32.Google Scholar
Griffith TL, Steyvers M (2004). Finding Scientic Topics. Proceedings of the National Academy of Sciences of the United States of America, 101, 5228--5235Google ScholarCross Ref
Burges, C. J. C. (1998). A Tutorial on Support Vector Machines for Pattern Recognition. Knowledge Discovery and Data Mining, 2, 121--167. Retrieved from /papers/Burges98.ps.gz Google ScholarDigital Library
Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2006). Hierarchical Dirichlet Processes. Journal of the American Statistical Association. doi:10.1198/016214506000000302Google Scholar

Index Terms

Demystifying the Semantics of Relevant Objects in Scholarly Collections: A Probabilistic Approach
1. Applied computing
  1. Computers in other domains
    1. Digital libraries and archives
2. Information systems
  1. Information systems applications
    1. Digital libraries and archives

Recommendations

Russian Scholarly Papers in Open-Access Megajournals
Abstract
The quantity, research topics, and growth rates are assessed for Russian scholarly papers published in open-access megajournals. Russian papers published in PLoS ONE in 2006–2019 are analyzed on the basis of international scientometric indicators. ...
Read More
Scholarly publications beyond pay-walls: increased citation advantage for open publishing

First, we aim to determine the total amount of scholarly articles freely available on the internet. Second, we aim to prove whether there exists a citation advantage for open publishing. The total scholarly publication output of Norway is indexed in ...
Read More
Disciplinary differences in Twitter scholarly communication

This paper investigates disciplinary differences in how researchers use the microblogging site Twitter. Tweets from selected researchers in ten disciplines (astrophysics, biochemistry, digital humanities, economics, history of science, cheminformatics, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
JCDL '15: Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries
June 2015
324 pages
ISBN:9781450335942
DOI:10.1145/2756406
General Chairs:
Paul Logasa Bogen
Google, USA
,
Suzie Allard
School of Information Sciences, University of Tennessee, USA
,
Holly Mercer
University Libraries, University of Tennessee, USA
,
Micah Beck
University of Tennessee, USA
,
Program Chairs:
Sally Jo Cunningham
Waikato University, New Zealand
,
Dion Goh
Wee Kim Wee School of Communication and Information, Nanyang Technical University, Singapore
,
Geneva Henry
University Libraries, The George Washington University, USA
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 June 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
hidden knowledge
probabilistic topic models.
scientific digital libraries
semantics entities
Qualifiers
- research-article
Conference

Acceptance Rates
JCDL '15 Paper Acceptance Rate18of60submissions,30%Overall Acceptance Rate415of1,482submissions,28%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 78
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Demystifying the Semantics of Relevant Objects in Scholarly Collections: A Probabilistic Approach

JCDL '15: Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries

ABSTRACT

References

Cited By

Index Terms

Recommendations

Russian Scholarly Papers in Open-Access Megajournals

Scholarly publications beyond pay-walls: increased citation advantage for open publishing

Disciplinary differences in Twitter scholarly communication

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Demystifying the Semantics of Relevant Objects in Scholarly Collections: A Probabilistic Approach

JCDL '15: Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries

ABSTRACT

References

Cited By

Index Terms

Recommendations

Russian Scholarly Papers in Open-Access Megajournals

Scholarly publications beyond pay-walls: increased citation advantage for open publishing

Disciplinary differences in Twitter scholarly communication

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media