research-article

Crawling Ajax-Based Web Applications through Dynamic Analysis of User Interface State Changes

Authors:
Ali Mesbah

University of British Columbia

University of British Columbia
View Profile

,
Arie van Deursen

Delft University of Technology

Delft University of Technology
View Profile

,
Stefan Lenselink

Delft University of Technology

Delft University of Technology
View Profile

Authors Info & Claims

ACM Transactions on the Web Volume 6 Issue 1Article No.: 3pp 1–30https://doi.org/10.1145/2109205.2109208

Published:01 March 2012Publication History

ACM Transactions on the Web

Abstract

Using JavaScript and dynamic DOM manipulation on the client side of Web applications is becoming a widespread approach for achieving rich interactivity and responsiveness in modern Web applications. At the same time, such techniques---collectively known as Ajax---shatter the concept of webpages with unique URLs, on which traditional Web crawlers are based. This article describes a novel technique for crawling Ajax-based applications through automatic dynamic analysis of user-interface-state changes in Web browsers. Our algorithm scans the DOM tree, spots candidate elements that are capable of changing the state, fires events on those candidate elements, and incrementally infers a state machine that models the various navigational paths and states within an Ajax application. This inferred model can be used in program comprehension and in analysis and testing of dynamic Web states, for instance, or for generating a static version of the application. In this article, we discuss our sequential and concurrent Ajax crawling algorithms. We present our open source tool called Crawljax, which implements the concepts and algorithms discussed in this article. Additionally, we report a number of empirical studies in which we apply our approach to a number of open-source and industrial Web applications and elaborate on the obtained results.

References

Alvarez, M., Pan, A., Raposo, J., and Hidalgo, J. 2006. Crawling webpages with support for client-side dynamism. In Proceedings of the International Conference on Advances in Web-Age Information Management. Lecture Notes in Computer Science Series, vol. 4016, 252--262. Google ScholarDigital Library
Alvarez, M., Pan, A., Raposo, J., and Vina, A. 2004. Client-side deep Web data extraction. In Proceedings of the IEEE International Conference on E-Commerce Technology for Dynamic E-Business (CEC-EAST’04). IEEE Computer Society, Los Alamitos, 158--161. Google ScholarDigital Library
Atterer, R. and Schmidt, A. 2005. Adding usability to Web engineering models and tools. In Proceedings of the 5th International Conferencee on Web Engineering (ICWE’05). 36--41. Google ScholarDigital Library
Barbosa, L. and Freire, J. 2007. An adaptive crawler for locating hidden-Web entry points. In Proceedings of the 16th International World Wide Web Conference (WWW’07). ACM Press, New York, NY, 441--450. Google ScholarDigital Library
Bezemer, C.-P., Mesbah, A., and van Deursen, A. 2009. Automated security testing of Web widget interactions. In Proceedings of the 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC-FSE’09). ACM Press, New York, NY, 81--91. Google ScholarDigital Library
Boldi, P., Codenotti, B., Santini, M., and Vigna, S. 2004. Ubicrawler: A scalable fully distributed Web crawler. Softw. Practice Exper. 34, 8, 711--726. Google ScholarDigital Library
Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. Comput. Netw. ISDN Syst. 30, 1--7, 107--117. Google ScholarDigital Library
Burner, M. 1997. Crawling towards eternity: Building an archive of the world wide web. Web Techniques Mag. 2, 5, 37--40.Google Scholar
Chawathe, S. S., Rajaraman, A., Garcia-Molina, H., and Widom, J. 1996. Change detection in hierarchically structured information. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’96). ACM Press, New York, NY, 493--504. Google ScholarDigital Library
Cho, J. and Garcia-Molina, H. 2002. Parallel crawlers. In Proceedings of the 11th International World Wide Web Conference. ACM Press, New York, NY, 124--135. Google ScholarDigital Library
Cho, J., Garcia-Molina, H., and Page, L. 1998. Efficient crawling through URL ordering. Comput. Netw. ISDN Syst. 30, 1--7, 161--172. Google ScholarDigital Library
Dasgupta, A., Ghosh, A., Kumar, R., Olston, C., Pandey, S., and Tomkins, A. 2007. The discoverability of the Web. In Proceedings of the 16th International World Wide Web Conference (WWW’07). ACM Press, New York, NY, 421--430. Google ScholarDigital Library
de Carvalho, A. F. and Silva, F. S. 2004. Smartcrawl: A new strategy for the exploration of the hidden Web. In Proceedings of the 6th Annual ACM International Workshop on Web Information and Data Management (WIDM’04). ACM Press, New York, NY, 9--15. Google ScholarDigital Library
Dijkstra, E. W. 1959. A note on two problems in connexion with graphs. Numerische Mathematik 1, 1, 269--271.Google ScholarDigital Library
Duda, C., Frey, G., Kossmann, D., Matter, R., and Zhou, C. 2009. Ajax crawl: Making Ajax applications searchable. In Proceedings of the 25th International Conference on Data Engineering (ICDE’09). IEEE Computer Society, Los Alamitos, CA, 78--89. Google ScholarDigital Library
Fielding, R. and Taylor, R. N. 2002. Principled design of the modern Web architecture. ACM Trans. Inter. Tech. 2, 2, 115--150. Google ScholarDigital Library
Garavel, H., Mateescu, R., and Smarandache, I. 2001. Parallel state space construction for model-checking. Model Check. Softw. 2057, 217--234. Google ScholarDigital Library
Garrett, J. 2005. Ajax: A new approach to Web applications. Adaptive path. http://www.adaptivepath.com/publications/essays/archives/000385.php.Google Scholar
Heydon, A. and Najork, M. 1999. Mercator: A scalable, extensible Web crawler. Proceedings of the International World Wide Web Conference, 2, 4, 219--229. Google ScholarDigital Library
Lage, J. P., da Silva, A. S., Golgher, P. B., and Laender, A. H. F. 2004. Automatic generation of agents for collecting hidden Webpages for data extraction. Data Knowl. Eng. 49, 2, 177--196. Google ScholarDigital Library
Levenshtein, V. L. 1996. Binary codes capable of correcting deletions, insertions, and reversals. Cybernetics Control Theory 10, 707--710.Google Scholar
Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., and Halevy, A. 2008. Google’s deep Web crawl. Proc. VLDB Endow. 1, 2, 1241--1252. Google ScholarDigital Library
Maxwell, S. and Delaney, H. 2004. Designing Experiments and Analyzing Data: A Model Comparison Perspective. Lawrence Erlbaum, U.K.Google Scholar
Memon, A., Banerjee, I., and Nagarajan, A. 2003. GUI ripping: Reverse engineering of graphical user interfaces for testing. In Proceedings of the 10th Working Conference on Reverse Engineering (WCRE’03). IEEE Computer Society, Los Alamitos, CA, 260--269. Google ScholarDigital Library
Memon, A., Soffa, M. L., and Pollack, M. E. 2001. Coverage criteria for GUI testing. In Proceedings of the European Software Engineering Conference and ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE’01). ACM Press, New York, NY, 256--267. Google ScholarDigital Library
Mesbah, A., Bozdag, E., and van Deursen, A. 2008. Crawling Ajax by inferring user interface state changes. In Proceedings of the 8th International Conference on Web Engineering (ICWE’08). IEEE Computer Society, Los Alamitos, CA, 122--134. Google ScholarDigital Library
Mesbah, A. and van Deursen, A. 2007. Migrating multipage Web applications to single-page Ajax interfaces. In Proceedings of the 11th European Conference on Software Maintenance and Reengineering (CSMR’07). IEEE Computer Society, Los Alamitos, CA, 181--190. Google ScholarDigital Library
Mesbah, A. and Prasad, M. R. 2011. Automated cross-browser compatibility testing. In Proceedings of the 33rd ACM/IEEE International Conference on Software Engineering (ICSE’11). ACM Press, New York, NY, 561--570. Google ScholarDigital Library
Mesbah, A. and van Deursen, A. 2008. A component- and push-based architectural style for Ajax applications. J. Syst. Softw. 81, 12, 2194--2209. Google ScholarDigital Library
Mesbah, A. and van Deursen, A. 2009. Invariant-based automatic testing of Ajax user interfaces. In Proceedings of the 31st International Conference on Software Engineering (ICSE’09). IEEE Computer Society, Los Alamitos, CA, 210--220. Google ScholarDigital Library
Ntoulas, A., Zerfos, P., and Cho, J. 2005. Downloading textual hidden Web content through keyword queries. In Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’05). ACM Press, New York, NY, 100--109. Google ScholarDigital Library
Pinkerton, B. 1994. Finding what people want: Experiences with the Web crawler. In Proceedings of the 2nd International World Wide Web Conference. Vol. 94. 17--20.Google Scholar
Raghavan, S. and Garcia-Molina, H. 2001. Crawling the hidden Web. In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB’01). 129--138. Google ScholarDigital Library
Roest, D., Mesbah, A., and van Deursen, A. 2010. Regression testing Ajax applications: Coping with dynamism. In Proceedings of the 3rd International Conference on Software Testing, Verification and Validation (ICST’10). IEEE Computer Society, Los Alamitos, CA, 128--136. Google ScholarDigital Library
Russell, A. 2006. Comet: Low latency data for the browser. http://alex.dojotoolkit.org/?p=545.Google Scholar
Valmari, A. 1998. The state explosion problem. In Lectures on Petri Nets I: Basic Models: Advances in Petri Nets. 429--528. Google ScholarDigital Library
W3C. W3C Web Storage. http://dev.w3.org/html5/webstorage/.Google Scholar

Index Terms

Crawling Ajax-Based Web Applications through Dynamic Analysis of User Interface State Changes

Recommendations

A Model-Based Approach for Crawling Rich Internet Applications

New Web technologies, like AJAX, result in more responsive and interactive Web applications, sometimes called Rich Internet Applications (RIAs). Crawling techniques developed for traditional Web applications are not sufficient for crawling RIAs. The ...
Read More
Intelligent crawling of web applications for web archiving
WWW '12 Companion: Proceedings of the 21st International Conference on World Wide Web

The steady growth of the World Wide Web raises challenges regarding the preservation of meaningful Web data. Tools used currently by Web archivists blindly crawl and store Web pages found while crawling, disregarding the kind of Web site currently ...
Read More
Topic-Sensitive hidden-web crawling
WISE'12: Proceedings of the 13th international conference on Web Information Systems Engineering

A constantly growing amount of high-quality information is stored in pages coming from the Hidden Web. Such pages are accessible only through a query interface that a Hidden-Web site provides and may span a variety of topics.

In order to provide ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on the Web Volume 6, Issue 1
March 2012
109 pages
ISSN:1559-1131
EISSN:1559-114X
DOI:10.1145/2109205
Issue’s Table of Contents

Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 March 2012
- Accepted: 1 August 2011
- Revised: 1 June 2011
- Received: 1 December 2010
Published in tweb Volume 6, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Ajax
Crawling
DOM crawling
Web 2.0
dynamic analysis
hidden web
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 197
  Total Citations
  View Citations
- 2,146
  Total Downloads
- Downloads (Last 12 months)88
- Downloads (Last 6 weeks)11
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Crawling Ajax-Based Web Applications through Dynamic Analysis of User Interface State Changes

ACM Transactions on the Web

Abstract

References

Cited By

Index Terms

Recommendations

A Model-Based Approach for Crawling Rich Internet Applications

Intelligent crawling of web applications for web archiving

Topic-Sensitive hidden-web crawling

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Crawling Ajax-Based Web Applications through Dynamic Analysis of User Interface State Changes

ACM Transactions on the Web

Abstract

References

Cited By

Index Terms

Recommendations

A Model-Based Approach for Crawling Rich Internet Applications

Intelligent crawling of web applications for web archiving

Topic-Sensitive hidden-web crawling

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media