Due to the fact that we crawl such a large number of URLs per day, and in order to be a good netizen, we fully respect robots.txt directives.
We currently identify ourselves as one of two User-Agent
strings, depending on whether javascript rendering is enabled or not:
ExtractBot/1.0 (+http://extractbot.com/docs/crawler)
[text-only]
Mozilla/5.0 (compatible; ExtractBot 1.0; +http://extractbot.com/docs/crawler)
[javascript enabled]
The javascript enabled crawler runs a modified version of PhantomJS 2.0 to render documents post-crawl.
It's possible to change the User-Agent
string that we identify as during your crawls, however we will continue to adhere
to robots.txt directives for both our own UA and your newly supplied UA (in that order).
We will not disrespect robots.txt directives under any circumstances. If you need to parse a page restricted to us then you must fetch the HTML
yourself and pass it to us in the html
parameter with your request.