skip to main content
research-article

Crawling Ajax-Based Web Applications through Dynamic Analysis of User Interface State Changes

Published:01 March 2012Publication History
Skip Abstract Section

Abstract

Using JavaScript and dynamic DOM manipulation on the client side of Web applications is becoming a widespread approach for achieving rich interactivity and responsiveness in modern Web applications. At the same time, such techniques---collectively known as Ajax---shatter the concept of webpages with unique URLs, on which traditional Web crawlers are based. This article describes a novel technique for crawling Ajax-based applications through automatic dynamic analysis of user-interface-state changes in Web browsers. Our algorithm scans the DOM tree, spots candidate elements that are capable of changing the state, fires events on those candidate elements, and incrementally infers a state machine that models the various navigational paths and states within an Ajax application. This inferred model can be used in program comprehension and in analysis and testing of dynamic Web states, for instance, or for generating a static version of the application. In this article, we discuss our sequential and concurrent Ajax crawling algorithms. We present our open source tool called Crawljax, which implements the concepts and algorithms discussed in this article. Additionally, we report a number of empirical studies in which we apply our approach to a number of open-source and industrial Web applications and elaborate on the obtained results.

References

  1. Alvarez, M., Pan, A., Raposo, J., and Hidalgo, J. 2006. Crawling webpages with support for client-side dynamism. In Proceedings of the International Conference on Advances in Web-Age Information Management. Lecture Notes in Computer Science Series, vol. 4016, 252--262. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Alvarez, M., Pan, A., Raposo, J., and Vina, A. 2004. Client-side deep Web data extraction. In Proceedings of the IEEE International Conference on E-Commerce Technology for Dynamic E-Business (CEC-EAST’04). IEEE Computer Society, Los Alamitos, 158--161. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Atterer, R. and Schmidt, A. 2005. Adding usability to Web engineering models and tools. In Proceedings of the 5th International Conferencee on Web Engineering (ICWE’05). 36--41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Barbosa, L. and Freire, J. 2007. An adaptive crawler for locating hidden-Web entry points. In Proceedings of the 16th International World Wide Web Conference (WWW’07). ACM Press, New York, NY, 441--450. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Bezemer, C.-P., Mesbah, A., and van Deursen, A. 2009. Automated security testing of Web widget interactions. In Proceedings of the 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC-FSE’09). ACM Press, New York, NY, 81--91. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Boldi, P., Codenotti, B., Santini, M., and Vigna, S. 2004. Ubicrawler: A scalable fully distributed Web crawler. Softw. Practice Exper. 34, 8, 711--726. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. Comput. Netw. ISDN Syst. 30, 1--7, 107--117. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Burner, M. 1997. Crawling towards eternity: Building an archive of the world wide web. Web Techniques Mag. 2, 5, 37--40.Google ScholarGoogle Scholar
  9. Chawathe, S. S., Rajaraman, A., Garcia-Molina, H., and Widom, J. 1996. Change detection in hierarchically structured information. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’96). ACM Press, New York, NY, 493--504. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Cho, J. and Garcia-Molina, H. 2002. Parallel crawlers. In Proceedings of the 11th International World Wide Web Conference. ACM Press, New York, NY, 124--135. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Cho, J., Garcia-Molina, H., and Page, L. 1998. Efficient crawling through URL ordering. Comput. Netw. ISDN Syst. 30, 1--7, 161--172. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Dasgupta, A., Ghosh, A., Kumar, R., Olston, C., Pandey, S., and Tomkins, A. 2007. The discoverability of the Web. In Proceedings of the 16th International World Wide Web Conference (WWW’07). ACM Press, New York, NY, 421--430. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. de Carvalho, A. F. and Silva, F. S. 2004. Smartcrawl: A new strategy for the exploration of the hidden Web. In Proceedings of the 6th Annual ACM International Workshop on Web Information and Data Management (WIDM’04). ACM Press, New York, NY, 9--15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Dijkstra, E. W. 1959. A note on two problems in connexion with graphs. Numerische Mathematik 1, 1, 269--271.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Duda, C., Frey, G., Kossmann, D., Matter, R., and Zhou, C. 2009. Ajax crawl: Making Ajax applications searchable. In Proceedings of the 25th International Conference on Data Engineering (ICDE’09). IEEE Computer Society, Los Alamitos, CA, 78--89. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Fielding, R. and Taylor, R. N. 2002. Principled design of the modern Web architecture. ACM Trans. Inter. Tech. 2, 2, 115--150. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Garavel, H., Mateescu, R., and Smarandache, I. 2001. Parallel state space construction for model-checking. Model Check. Softw. 2057, 217--234. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Garrett, J. 2005. Ajax: A new approach to Web applications. Adaptive path. http://www.adaptivepath.com/publications/essays/archives/000385.php.Google ScholarGoogle Scholar
  19. Heydon, A. and Najork, M. 1999. Mercator: A scalable, extensible Web crawler. Proceedings of the International World Wide Web Conference, 2, 4, 219--229. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Lage, J. P., da Silva, A. S., Golgher, P. B., and Laender, A. H. F. 2004. Automatic generation of agents for collecting hidden Webpages for data extraction. Data Knowl. Eng. 49, 2, 177--196. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Levenshtein, V. L. 1996. Binary codes capable of correcting deletions, insertions, and reversals. Cybernetics Control Theory 10, 707--710.Google ScholarGoogle Scholar
  22. Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., and Halevy, A. 2008. Google’s deep Web crawl. Proc. VLDB Endow. 1, 2, 1241--1252. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Maxwell, S. and Delaney, H. 2004. Designing Experiments and Analyzing Data: A Model Comparison Perspective. Lawrence Erlbaum, U.K.Google ScholarGoogle Scholar
  24. Memon, A., Banerjee, I., and Nagarajan, A. 2003. GUI ripping: Reverse engineering of graphical user interfaces for testing. In Proceedings of the 10th Working Conference on Reverse Engineering (WCRE’03). IEEE Computer Society, Los Alamitos, CA, 260--269. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Memon, A., Soffa, M. L., and Pollack, M. E. 2001. Coverage criteria for GUI testing. In Proceedings of the European Software Engineering Conference and ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE’01). ACM Press, New York, NY, 256--267. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Mesbah, A., Bozdag, E., and van Deursen, A. 2008. Crawling Ajax by inferring user interface state changes. In Proceedings of the 8th International Conference on Web Engineering (ICWE’08). IEEE Computer Society, Los Alamitos, CA, 122--134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Mesbah, A. and van Deursen, A. 2007. Migrating multipage Web applications to single-page Ajax interfaces. In Proceedings of the 11th European Conference on Software Maintenance and Reengineering (CSMR’07). IEEE Computer Society, Los Alamitos, CA, 181--190. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Mesbah, A. and Prasad, M. R. 2011. Automated cross-browser compatibility testing. In Proceedings of the 33rd ACM/IEEE International Conference on Software Engineering (ICSE’11). ACM Press, New York, NY, 561--570. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Mesbah, A. and van Deursen, A. 2008. A component- and push-based architectural style for Ajax applications. J. Syst. Softw. 81, 12, 2194--2209. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Mesbah, A. and van Deursen, A. 2009. Invariant-based automatic testing of Ajax user interfaces. In Proceedings of the 31st International Conference on Software Engineering (ICSE’09). IEEE Computer Society, Los Alamitos, CA, 210--220. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Ntoulas, A., Zerfos, P., and Cho, J. 2005. Downloading textual hidden Web content through keyword queries. In Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’05). ACM Press, New York, NY, 100--109. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Pinkerton, B. 1994. Finding what people want: Experiences with the Web crawler. In Proceedings of the 2nd International World Wide Web Conference. Vol. 94. 17--20.Google ScholarGoogle Scholar
  33. Raghavan, S. and Garcia-Molina, H. 2001. Crawling the hidden Web. In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB’01). 129--138. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Roest, D., Mesbah, A., and van Deursen, A. 2010. Regression testing Ajax applications: Coping with dynamism. In Proceedings of the 3rd International Conference on Software Testing, Verification and Validation (ICST’10). IEEE Computer Society, Los Alamitos, CA, 128--136. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Russell, A. 2006. Comet: Low latency data for the browser. http://alex.dojotoolkit.org/?p=545.Google ScholarGoogle Scholar
  36. Valmari, A. 1998. The state explosion problem. In Lectures on Petri Nets I: Basic Models: Advances in Petri Nets. 429--528. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. W3C. W3C Web Storage. http://dev.w3.org/html5/webstorage/.Google ScholarGoogle Scholar

Index Terms

  1. Crawling Ajax-Based Web Applications through Dynamic Analysis of User Interface State Changes

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Transactions on the Web
              ACM Transactions on the Web  Volume 6, Issue 1
              March 2012
              109 pages
              ISSN:1559-1131
              EISSN:1559-114X
              DOI:10.1145/2109205
              Issue’s Table of Contents

              Copyright © 2012 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 1 March 2012
              • Accepted: 1 August 2011
              • Revised: 1 June 2011
              • Received: 1 December 2010
              Published in tweb Volume 6, Issue 1

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Research
              • Refereed

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader