Skip to main content

Table Topic Models for Hidden Unit Estimation

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9994))

Abstract

We propose a method to estimate hidden units of numbers written in tables. We focus on Wikipedia tables and propose an algorithm to estimate which units are appropriate for a given cell that has a number but no unit words. We try to estimate such hidden units using surrounding contexts such as a cell in the first row. To improve the performance, we propose the table topic model that can model tables and surrounding sentences simultaneously.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    The total number of tables founded in the corpus was 255,039.

  2. 2.

    We observed that 39.1 % cells out of randomly selected number-cells were number-only cells (i.e., cells without any unit).

  3. 3.

    Note that in this paper the \(x_i\) is assumed to be a vector whose value is 1 in the i-th dimension where i is the ID of every context word for the cell.

  4. 4.

    If we see the number “1987”, we think of it as a number that indicates a year.

  5. 5.

    For example, if the unit word is “yen”, the surrounding words are likely to contain the word “price”.

  6. 6.

    We observed that using all “same row” cells worsen the accuracy in preliminary experiments, so we do not use those cells.

  7. 7.

    In our data set, 266 (93.7 %) out of 284 tables (which is the tables that contains one or more hand-annotated cells) were row-wise.

  8. 8.

    It is inspired by the Polya-tree models for modeling of continuous values.

  9. 9.

    We also use some additional digits such as for signs, but omit them here for the sake of simplicity.

  10. 10.

    We set \(N=2\) currently.

  11. 11.

    We use some rules to parse the number string, so different expressions like “95,300” are also available.

  12. 12.

    We divided the corpus in such a way that the cells from the same table are not included in the same subset. The accuracy is calculated by summing up the correct/incorrect of predictions on each cell, i.e., the accuracy is micro-averaged one.

  13. 13.

    Each Gibbs sampling performed 500 iterations. The distribution of the sampled topic IDs in the final 200 iterations were used as the input features for the logistic regression (i.e., we added each topic ID observed for the column of each cell in the test data with their relative frequency as a weight.).

References

  1. Andrew, G., Gao, J.: Scalable training of l1-regularized log-linear models. In: Proceedings of ICML 2007, pp. 33–40 (2007)

    Google Scholar 

  2. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  3. Cafarella, M.J., Halevy, A.Y., Wang, D.Z., Wu, E., Zhang, Y.: Webtables: exploring the power of tables on the web. Proc. VLDB Endow. 1(1), 538–549 (2008)

    Article  Google Scholar 

  4. Govindaraju, V., Zhang, C., Re, C.: Understanding tables in context using standard NLP toolkits. In: Proceedings of ACL2013 (2013)

    Google Scholar 

  5. Narisawa, K., Watanabe, Y., Mizuno, J., Okazaki, N., Inui, K.: Is a 204 cm man tall or small? Acquisition of numerical common sense from the web. In: Proceedings of the ACL, vol. 1, pp. 382–391 (2013)

    Google Scholar 

  6. Okazaki, N.: Classias: a collection of machine-learning algorithms for classification. http://www.chokkan.org/software/classias/

  7. Sarawagi, S., Chakrabarti, S.: Open-domain quantity queries on web tables: annotation, response, and consensus models. In: Proceedings of KDD, pp. 711–720 (2014)

    Google Scholar 

  8. Takamura, H., Tsujii, J.: Estimating numerical attributes by bringing together fragmentary clues. In: Proceedings of NAACL-HLT2015 (2015)

    Google Scholar 

  9. Wang, H., Liu, A., Wang, J., Ziebart, B.D., Yu, C.T., Shen, W.: Context retrieval for web tables. In: Proceedings of ICTIR 2015, pp. 251–260 (2015)

    Google Scholar 

  10. Yoshida, M., Sato, I., Nakagawa, H., Terada, A.: Mining numbers in text using suffix arrays and clustering based on Dirichlet process mixture models. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds.) PAKDD 2010. LNCS, vol. 6119, pp. 230–237. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  11. Zanibbi, R., Blostein, D., Cordy, J.R.: A survey of table recognition. Int. J. Doc. Anal. Recogn. 7(1), 1–16 (2004)

    Article  Google Scholar 

Download references

Acknowledgement

This work was supported by JSPS KAKENHI Grant Numbers JP15K00309, JP15K00425, JP15K16077.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Minoru Yoshida .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Yoshida, M., Matsumoto, K., Kita, K. (2016). Table Topic Models for Hidden Unit Estimation. In: Ma, S., et al. Information Retrieval Technology. AIRS 2016. Lecture Notes in Computer Science(), vol 9994. Springer, Cham. https://doi.org/10.1007/978-3-319-48051-0_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-48051-0_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-48050-3

  • Online ISBN: 978-3-319-48051-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics