ABSTRACT
We study the problem of linking information between different idiomatic usages of the same language, for example, colloquial and formal language. We propose a novel probabilistic topic model called multi-idiomatic LDA (MiLDA). Its modeling principles follow the intuition that certain words are shared between two idioms of the same language, while other words are non-shared, that is, idiom-specific. We demonstrate the ability of our model to learn relations between cross-idiomatic topics in a dataset containing product descriptions and reviews. We intrinsically evaluate our model by the perplexity measure. Following that, as an extrinsic evaluation, we present the utility of the new MiLDA topic model in a recently proposed IR task of linking Pinterest pins (given in colloquial English on the users' side) to online webshops (given in formal English on the retailers' side). We show that our multi-idiomatic model outperforms the standard monolingual LDA model and the pure bilingual LDA model both in terms of perplexity and MAP scores in the IR task.
- D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993--1022, 2003. Google ScholarDigital Library
- W. De Smet and M.-F. Moens. Cross-language linking of news stories on the web using interlingual topic modelling. In Proc. of the CIKM SWSM Workshop, pages 57--64, 2009. Google ScholarDigital Library
- D. Mimno, H. Wallach, J. Naradowsky, D. A. Smith, and A. McCallum. Polylingual topic models. In EMNLP, pages 880--889, 2009. Google ScholarDigital Library
- X. Ni, J.-T. Sun, J. Hu, and Z. Chen. Mining multilingual topics from Wikipedia. In WWW, pages 1155--1156, 2009. Google ScholarDigital Library
- M. Steyvers and T. Griffiths. Probabilistic topic models. Handbook of Latent Semantic Analysis, 427(7):424--440, 2007.Google Scholar
- X. Wei and W. B. Croft. LDA-based document models for ad-hoc retrieval. In SIGIR, pages 178--185, 2006. Google ScholarDigital Library
- S. Zoghbi, I. Vuli--c, and M.-F. Moens. Are words enough?: A study on text-based representations and retrieval models for linking pins to online shops. In CIKM UnstructureNLP Workshop, pages 45--52, 2013. Google ScholarDigital Library
Index Terms
- Learning to bridge colloquial and formal language applied to linking and search of E-Commerce data
Recommendations
I pinned it. where can i buy one like it?: automatically linking pinterest pins to online webshops
DUBMOD '13: Proceedings of the 2013 workshop on Data-driven user behavioral modelling and mining from social mediaThe information that users of social network sites post often points towards their interests and hobbies. It can be used to recommend relevant products to users. In this paper we implement and evaluate several information retrieval models for linking ...
Are words enough?: a study on text-based representations and retrieval models for linking pins to online shops
UnstructureNLP '13: Proceedings of the 2013 international workshop on Mining unstructured big data using natural language processingUser-generated content offers opportunities to learn about people's interests and hobbies. We can leverage this information to help users find interesting shops and businesses find interested users. However this content is highly noisy and unstructured ...
Cross-language information retrieval with latent topic models trained on a comparable corpus
AIRS'11: Proceedings of the 7th Asia conference on Information Retrieval TechnologyIn this paper we study cross-language information retrieval using a bilingual topic model trained on comparable corpora such as Wikipedia articles. The bilingual Latent Dirichlet Allocation model (BiLDA) creates an interlingual representation, which can ...
Comments