ABSTRACT
Charts are commonly used for data visualization. Generating a chart usually involves performing data transformations, including data pre-processing and aggregation. These tasks can be cumbersome and time-consuming, even for experienced data scientists. Reproducing existing charts can also be a challenging task when information about data transformations is no longer available.
In this paper, we tackle the problem of recovering data transformations from existing charts. Given an input table and a chart, our goal is to automatically recover the data transformation program underlying the chart. We divide our approach into four steps: (1) data extraction, (2) candidate generation, (3) candidate ranking, and (4) candidate disambiguation. We implemented our approach in a tool called UnchartIt and evaluated it on a set of 50 benchmarks from Kaggle. Experimental results show that UnchartIt successfully ranks the correct data transformation among the top-10 programs in 92% of the benchmarks. To disambiguate the top-ranking programs, we use our new interactive procedure, which successfully disambiguates 98% of the ambiguous benchmarks by asking on average fewer than 2 questions to the user.
- Rajeev Alur, Rastislav Bodík, Eric Dallal, Dana Fisman, Pranav Garg, Garvit Juniwal, Hadas Kress-Gazit, P. Madhusudan, Milo M. K. Martin, Mukund Raghothaman, Shambwaditya Saha, Sanjit A. Seshia, Rishabh Singh, Armando Solar-Lezama, Emina Torlak, and Abhishek Udupa. 2015. Syntax-Guided Synthesis. In Dependable Software Systems Engineering. IOS Press, 1--25.Google Scholar
- Matej Balog, Alexander Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tarlow. 2017. DeepCoder: Learning to Write Programs. In Proc. International Conference on Learning Representations.Google Scholar
- Leilani Battle, Peitong Duan, Zachery Miranda, Dana Mukusheva, Remco Chang, and Michael Stonebraker. 2018. Beagle: Automated Extraction and Interpretation of Visualizations from the Web. In Proc. Conference on Human Factors in Computing Systems. ACM, 594.Google ScholarDigital Library
- Dirk Beyer, Matthias Dangl, and Philipp Wendler. 2018. A Unifying View on SMT-Based Software Verification. Journal of Automated Reasoning 60, 3 (2018), 299--335.Google ScholarDigital Library
- Nikolaj Bjørner, Anh-Dung Phan, and Lars Fleckenstein. 2015. vZ - An Optimizing SMT Solver. In Proc. International Conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 194--199.Google ScholarDigital Library
- François Chollet et al. 2015 (accessed May 8, 2020). Keras. https://keras.io.Google Scholar
- Edmund M. Clarke, Daniel Kroening, and Flavio Lerda. 2004. A Tool for Checking ANSI-C Programs. In Proc. International Conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 168--176.Google ScholarCross Ref
- Emir Demirovic and Peter J. Stuckey. 2019. Techniques Inspired by Local Search for Incomplete MaxSAT and the Linear Algorithm: Varying Resolution and Solution-Guided Search. In Proc. International Conference Principles and Practice of Constraint Programming. Springer, 177--194.Google Scholar
- Frank Elberzhager, Alla Rosbach, Jürgen Münch, and Robert Eschbach. 2012. Reducing test effort: A systematic mapping study on existing approaches. Inf. Softw. Technol. 54, 10 (2012), 1092--1106.Google ScholarDigital Library
- Kevin Ellis and Sumit Gulwani. 2017. Learning to Learn Programs from Examples: Going Beyond Program Structure. In Proc. International Joint Conference on Artificial Intelligence. ijcai.org, 1638--1645.Google ScholarCross Ref
- Kevin Ellis, Daniel Ritchie, Armando Solar-Lezama, and Josh Tenenbaum. 2018. Learning to Infer Graphics Programs from Hand-Drawn Images. In Proc. Annual Conference on Neural Information Processing Systems. 6062--6071.Google Scholar
- Dennis Felsing, Sarah Grebing, Vladimir Klebanov, Philipp Rümmer, and Mattias Ulbrich. 2014. Automating regression verification. In Proc. International Conference on Automated Software Engineering. ACM, 349--360.Google ScholarDigital Library
- Yu Feng, Ruben Martins, Osbert Bastani, and Isil Dillig. 2018. Program synthesis using conflict-driven learning. In Proc. ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, 420--435.Google ScholarDigital Library
- Yu Feng, Ruben Martins, Jacob Van Geffen, Isil Dillig, and Swarat Chaudhuri. 2017. Component-based synthesis of table consolidation and transformation tasks from examples. In Proc. ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, 422--436.Google ScholarDigital Library
- John K. Feser, Swarat Chaudhuri, and Isil Dillig. 2015. Synthesizing data structure transformations from input-output examples. In Proc. ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, 229--239.Google ScholarDigital Library
- Mikhail R. Gadelha, Felipe R. Monteiro, Jeremy Morse, Lucas C. Cordeiro, Bernd Fischer, and Denis A. Nicole. 2018. ESBMC 5.0: An Industrial-Strength C Model Checker. In Proc. International Conference on Automated Software Engineering. ACM, 888--891.Google Scholar
- Joel Galenson, Philip Reames, Rastislav Bodík, Björn Hartmann, and Koushik Sen. 2014. CodeHint: dynamic and interactive synthesis of code snippets. In Proc. International Conference on Software Engineering. ACM, 653--663.Google ScholarDigital Library
- Benny Godlin and Ofer Strichman. 2008. Inference rules for proving the equivalence of recursive procedures. Acta Informatica 45, 6 (2008), 403--439.Google ScholarDigital Library
- Benny Godlin and Ofer Strichman. 2009. Regression verification. In Proc. Design Automation Conference. ACM, 466--471.Google ScholarDigital Library
- Sumit Gulwani. 2011. Automating string processing in spreadsheets using input-output examples. In Proc. ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. ACM, 317--330.Google ScholarDigital Library
- Sumit Gulwani, Oleksandr Polozov, and Rishabh Singh. 2017. Program Synthesis. Foundations and Trends in Programming Languages 4, 1--2 (2017), 1--119.Google ScholarCross Ref
- Ruyi Ji, Jingjing Liang, Yingfei Xiong, Lu Zhang, and Zhenjiang Hu. 2020. Question Selection for Interactive Program Synthesis. In Proc. ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM.Google ScholarDigital Library
- Zhongjun Jin, Michael R. Anderson, Michael J. Cafarella, and H. V. Jagadish. 2017. Foofah: Transforming Data By Example. In Proc. International Conference on Management of Data. ACM, 683--698.Google Scholar
- Daekyoung Jung, Wonjae Kim, Hyunjoo Song, Jeongin Hwang, Bongshin Lee, Bo Hyoung Kim, and Jinwook Seo. 2017. ChartSense: Interactive Data Extraction from Chart Images. In Proc. Conference on Human Factors in Computing Systems. ACM, 6706--6717.Google ScholarDigital Library
- Dmitri V. Kalashnikov, Laks V. S. Lakshmanan, and Divesh Srivastava. 2018. FastQRE: Fast Query Reverse Engineering. In Proc. International Conference on Management of Data. ACM, 337--350.Google ScholarDigital Library
- Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jeffrey Heer. 2011. Wrangler: Interactive Visual Specification of Data Transformation Scripts. In Proc. SIGCHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, 3363--3372.Google ScholarDigital Library
- Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. 2019. On the Variance of the Adaptive Learning Rate and Beyond. CoRR abs/1908.03265 (2019).Google Scholar
- Ruben Martins, Jia Chen, Yanju Chen, Yu Feng, and Isil Dillig. 2019. Trinity: An Extensible Synthesis Framework for Data Science. PVLDB 12, 12 (2019), 1914--1917.Google ScholarDigital Library
- Mikaël Mayer, Gustavo Soares, Maxim Grechkin, Vu Le, Mark Marron, Oleksandr Polozov, Rishabh Singh, Benjamin G. Zorn, and Sumit Gulwani. 2015. User Interaction Models for Disambiguation in Programming by Example. In Proc. Symposium on User Interface Software & Technology. ACM, 291--301.Google ScholarDigital Library
- Robert Nieuwenhuis and Albert Oliveras. 2006. On SAT Modulo Theories and Optimization Problems. In Proc. International Conference on Theory and Applications of Satisfiability Testing. Springer, 156--169.Google ScholarDigital Library
- Peter Oehlert. 2005. Violating Assumptions with Fuzzing. IEEE Secur. Priv. 3, 2 (2005), 58--62.Google ScholarDigital Library
- Saswat Padhi, Prateek Jain, Daniel Perelman, Oleksandr Polozov, Sumit Gulwani, and Todd D. Millstein. 2018. FlashProfile: a framework for synthesizing data profiles. Proc. ACM Program. Lang. 2, OOPSLA (2018), 150:1--150:28.Google Scholar
- Mohammad Raza and Sumit Gulwani. 2017. Automated Data Extraction Using Predictive Program Synthesis. In Proc. AAAI Conference on Artificial Intelligence. AAAI Press, 882--890.Google ScholarCross Ref
- Ankit Rohatgi. 2019 (accessed May 8, 2020). WebPlotDigitizer, Version 4.2. https://automeris.io/WebPlotDigitizer.Google Scholar
- Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Fei-Fei Li. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 115, 3 (2015), 211--252.Google ScholarDigital Library
- Manolis Savva, Nicholas Kong, Arti Chhajta, Fei-Fei Li, Maneesh Agrawala, and Jeffrey Heer. 2011. ReVision: automated classification, analysis and redesign of chart images. In Proc. Annual ACM Symposium on User Interface Software. ACM, 393--402.Google ScholarDigital Library
- Mingxing Tan and Quoc V. Le. 2019. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proc. International Conference on Machine Learning. 6105--6114.Google Scholar
- Chenglong Wang, Alvin Cheung, and Rastislav Bodík. 2017. Interactive Query Synthesis from Input-Output Examples. In Proc. International Conference on Management of Data. ACM, 1631--1634.Google ScholarDigital Library
- Chenglong Wang, Alvin Cheung, and Rastislav Bodík. 2017. Synthesizing highly expressive SQL queries from input-output examples. In Proc. ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, 452--466.Google ScholarDigital Library
- Chenglong Wang, Yu Feng, Rastislav Bodík, Alvin Cheung, and Isil Dillig. 2020. Visualization by example. PACMPL 4, POPL (2020), 49:1--49:28.Google Scholar
- Kuat Yessenov, Shubham Tulsiani, Aditya Krishna Menon, Robert C. Miller, Sumit Gulwani, Butler W. Lampson, and Adam Kalai. 2013. A colorful approach to text processing by example. In Proc. Symposium on User Interface Software and Technology. ACM, 495--504.Google ScholarDigital Library
- Andreas Zeller. 2001. Automated Debugging: Are We Close. IEEE Computer 34, 11 (2001), 26--31.Google ScholarDigital Library
- Andreas Zeller and Ralf Hildebrandt. 2002. Simplifying and Isolating Failure-Inducing Input. IEEE Trans. Software Eng. 28, 2 (2002), 183--200.Google ScholarDigital Library
- Michael R. Zhang, James Lucas, Jimmy Ba, and Geoffrey E. Hinton. 2019. Looka-head Optimizer: k steps forward, 1 step back. In Proc. Annual Conference on Neural Information Processing Systems. 9593--9604.Google Scholar
- Sai Zhang and Yuyin Sun. 2013. Automatically synthesizing SQL queries from input-output examples. In Proc. International Conference on Automated Software Engineering. IEEE, 224--234.Google ScholarDigital Library
Index Terms
- UnchartIt: an interactive framework for program recovery from charts
Recommendations
Theory and practice of ambiguity labelling with a view to interactive disambiguation in text and speech MT
COLING '96: Proceedings of the 16th conference on Computational linguistics - Volume 1In many contexts, automatic analyzers cannot fully disambiguate a sentence or an utterance reliably, but can produce ambiguous results containing the correct interpretation. It is useful to study vatious properties of these ambiguities in the view of ...
Foofah: Transforming Data By Example
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of DataData transformation is a critical first step in modern data analysis: before any analysis can be done, data from a variety of sources must be wrangled into a uniform format that is amenable to the intended analysis and analytical software package. This ...
Synthesizing transformations on hierarchically structured data
PLDI '16This paper presents a new approach for synthesizing transformations on tree-structured data, such as Unix directories and XML documents. We consider a general abstraction for such data, called hierarchical data trees (HDTs) and present a novel example-...
Comments