An Introduction to Intrusive Web Crawling

Web-spiders typically operate by crawling the targeted website and extracting static links from web pages. And while this can also be used to locate web pages that are only accessible through form submissions (e.g GET or POST forms), most web crawlers fall short of a fundamental feature, of being able to submit forms and then crawl the pages returned after completing the form submissions.

For this reason, I have made a relatively small crawler in Python that is able to do this.

This project has been dubbed "Jick". It is available here on GitHub.

I refer to this process as "intelligent" web-crawling. This crawler will intelligently generate form data and then submit the form, so that it can continue crawling the returned HTML. The form-submission data is generated using a combination of attributes in the HTML tags. E.g placeholder or maxlength attribute values are used to generate a form parameter that is most likely going to work with the given form. And, of course, different <input> "types" are taken into account; e.g:


<input type='email'>

will cause this crawler to always generate a string of text that could be a valid email address.

It is necessary that the form-submission data is generated in a calculated manner, based on information that is contained within the HTML tags and attributes. If this is not done properly, the form-submission data will effectively be random nonsense. And in order to perform effective crawling, it is important that the server-side scripts receive valid data in the GET/POST parameters. For example, if there is a server-side script that expects one of the parameters to be a zip-code, but instead the client sends it a string of random letters and numbers, the returned HTML may not be as useful for furthering the crawling process as it would have been, had the proper values been submitted.

One could also refer to this as "intrusive" web-crawling, as it does involve submitting GET or POST forms in ways that may wreak havoc on the crawled website. For example, if there is a POST form for subscribing to a newsletter, this web-spider (Jick) and any other spider that has this same form-submitting capability, may submit the form, thereby registering an invalid newsletter subscription.

For this reason, this is an unethical way to crawl a website. However, if one wants to perform a truly robust penetration test or security audit of a web application, one must not limit his approach by moral principles that a black-hat cracker will not be bound by.

In any case, the lack of a form-generating capability has been a short-coming of most web-crawlers, and I do hope this nifty tool I make will establish this feature as a new standard that will become a basic feature incorporated into the future of all web-spidering software.

We plan to write many articles here on securityandpentesting.org, many of which will use this tool as a basis for discussion of further topics.