poster

Learning to bridge colloquial and formal language applied to linking and search of E-Commerce data

Authors:
Ivan Vulić

KU Leuven, Heverlee, Belgium

KU Leuven, Heverlee, Belgium
View Profile

,
Susana Zoghbi

KU Leuven, Heverlee, Belgium

KU Leuven, Heverlee, Belgium
View Profile

,
Marie-Francine Moens

KU Leuven, Heverlee, Belgium

KU Leuven, Heverlee, Belgium
View Profile

SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrievalJuly 2014Pages 1195–1198https://doi.org/10.1145/2600428.2609543

Published:03 July 2014Publication History

SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval

Pages 1195–1198

ABSTRACT

We study the problem of linking information between different idiomatic usages of the same language, for example, colloquial and formal language. We propose a novel probabilistic topic model called multi-idiomatic LDA (MiLDA). Its modeling principles follow the intuition that certain words are shared between two idioms of the same language, while other words are non-shared, that is, idiom-specific. We demonstrate the ability of our model to learn relations between cross-idiomatic topics in a dataset containing product descriptions and reviews. We intrinsically evaluate our model by the perplexity measure. Following that, as an extrinsic evaluation, we present the utility of the new MiLDA topic model in a recently proposed IR task of linking Pinterest pins (given in colloquial English on the users' side) to online webshops (given in formal English on the retailers' side). We show that our multi-idiomatic model outperforms the standard monolingual LDA model and the pure bilingual LDA model both in terms of perplexity and MAP scores in the IR task.

References

D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993--1022, 2003. Google ScholarDigital Library
W. De Smet and M.-F. Moens. Cross-language linking of news stories on the web using interlingual topic modelling. In Proc. of the CIKM SWSM Workshop, pages 57--64, 2009. Google ScholarDigital Library
D. Mimno, H. Wallach, J. Naradowsky, D. A. Smith, and A. McCallum. Polylingual topic models. In EMNLP, pages 880--889, 2009. Google ScholarDigital Library
X. Ni, J.-T. Sun, J. Hu, and Z. Chen. Mining multilingual topics from Wikipedia. In WWW, pages 1155--1156, 2009. Google ScholarDigital Library
M. Steyvers and T. Griffiths. Probabilistic topic models. Handbook of Latent Semantic Analysis, 427(7):424--440, 2007.Google Scholar
X. Wei and W. B. Croft. LDA-based document models for ad-hoc retrieval. In SIGIR, pages 178--185, 2006. Google ScholarDigital Library
S. Zoghbi, I. Vuli--c, and M.-F. Moens. Are words enough?: A study on text-based representations and retrieval models for linking pins to online shops. In CIKM UnstructureNLP Workshop, pages 45--52, 2013. Google ScholarDigital Library

Index Terms

Learning to bridge colloquial and formal language applied to linking and search of E-Commerce data
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
    2. Retrieval tasks and goals
      1. Document filtering
      2. Information extraction

Recommendations

I pinned it. where can i buy one like it?: automatically linking pinterest pins to online webshops
DUBMOD '13: Proceedings of the 2013 workshop on Data-driven user behavioral modelling and mining from social media

The information that users of social network sites post often points towards their interests and hobbies. It can be used to recommend relevant products to users. In this paper we implement and evaluate several information retrieval models for linking ...
Read More
Are words enough?: a study on text-based representations and retrieval models for linking pins to online shops
UnstructureNLP '13: Proceedings of the 2013 international workshop on Mining unstructured big data using natural language processing

User-generated content offers opportunities to learn about people's interests and hobbies. We can leverage this information to help users find interesting shops and businesses find interested users. However this content is highly noisy and unstructured ...
Read More
Cross-language information retrieval with latent topic models trained on a comparable corpus
AIRS'11: Proceedings of the 7th Asia conference on Information Retrieval Technology

In this paper we study cross-language information retrieval using a bilingual topic model trained on comparable corpora such as Wikipedia articles. The bilingual Latent Dirichlet Allocation model (BiLDA) creates an interlingual representation, which can ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval
July 2014
1330 pages
ISBN:9781450322577
DOI:10.1145/2600428
General Chairs:
Shlomo Geva
Queensland University of Technology
,
Andrew Trotman
University of Dunedin
,
Program Chairs:
Peter Bruza
Queensland University of Technology
,
Charles L.A. Clarke
University of Waterloo
,
Kal Järvelin
University of Tampere
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 3 July 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
personalized linking
recommendation systems
topic models
unstructured data
user interests
user-generated data
Qualifiers
- poster
Conference

Acceptance Rates
SIGIR '14 Paper Acceptance Rate82of387submissions,21%Overall Acceptance Rate792of3,983submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 193
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Learning to bridge colloquial and formal language applied to linking and search of E-Commerce data

SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

I pinned it. where can i buy one like it?: automatically linking pinterest pins to online webshops

Are words enough?: a study on text-based representations and retrieval models for linking pins to online shops

Cross-language information retrieval with latent topic models trained on a comparable corpus