Robots.txt

Due to the fact that we crawl such a large number of URLs per day, and in order to be a good netizen, we fully respect robots.txt directives.

We currently identify ourselves as one of two User-Agent strings, depending on whether javascript rendering is enabled or not:

ExtractBot/1.0 (+http://extractbot.com/docs/crawler) [text-only]
Mozilla/5.0 (compatible; ExtractBot 1.0; +http://extractbot.com/docs/crawler) [javascript enabled]

The javascript enabled crawler runs a modified version of PhantomJS 2.0 to render documents post-crawl.

Changing the user-agent

It's possible to change the User-Agent string that we identify as during your crawls, however we will continue to adhere to robots.txt directives for both our own UA and your newly supplied UA (in that order).

Bypassing robots.txt directives

We will not disrespect robots.txt directives under any circumstances. If you need to parse a page restricted to us then you must fetch the HTML yourself and pass it to us in the html parameter with your request.