The API has a single endpoint (/) and accepts a POST request with a JSON-encoded data
parameter containing the following:
Parameter | Type | Description |
---|---|---|
auth | JSON | An encoded JSON string containing key parameter and, optionally, a secret parameter. |
steps | JSON | An encoded JSON string containing the steps required to parse the HTML. The contents of steps is explained in more detail below. |
url | string (optional) |
The full url that you want to parse, e.g. http://money.cnn.com/data/markets/nasdaq/ |
html | string (optional) |
The raw HTML that you want to parse. If a url and an html parameter are both present then the html takes precedence. |
user-agent | string (optional) |
This will override our user-agent when you also supply a url param. Please note that we will adhere to robots.txt directives for our original UA and then your newly supplied UA (in that order). |
The request must contain one of either the url
or html
parameters. If a URL is
supplied, then an HTTP GET request will be made to that URL to obtain the HTML to parse. If you do not include a url
parameter then you should instead supply the HTML itself.
Sending the HTML explicitely is ideal if, for example, you want to parse a page that is restricted by robots.txt exclusion and we cannot access it directly.
{ "auth": {"key": "your-api-key"}, "steps": [{ "name": "your-step-identifier", "robot": "xpath", "value": "//h2" }], "url": "http://www.extractbot.com/" }
{ "your-step-identifier": { "robot": "xpath", "value": "//h2", "result": [ { "html": "<h2 style=\"color:#989d91\">Meet our robots</h2>", "inner_html": "Meet our robots", "text": "Meet our robots", "attr": { "style": "color:#989d91" } }, { "html": "<h2 style=\"color:#989d91\">Content extraction</h2>", "inner_html": "Content extraction", "text": "Content extraction", "attr": { "style": "color:#989d91" } }, { "html": "<h2 style=\"color:#989d91\">Super fast API</h2>", "inner_html": "Super fast API", "text": "Super fast API", "attr": { "style": "color:#989d91" } }, { "html": "<h2 style=\"color:#0f0e0d\">Pricing</h2>", "inner_html": "Pricing", "text": "Pricing", "attr": { "style": "color:#0f0e0d" } }, { "html": "<h2 style=\"color:#989d91\">Try it out!</h2>", "inner_html": "Try it out!", "text": "Try it out!", "attr": { "style": "color:#989d91" } } ] } }
Using the use
parameter opens up the ability to chain robots into assemblies. use
is a JSON pointer to extract
a string from the parent step results
output into the child step input stream. For example, to use the attr->style
string and feed it into a
regex robot, you would use the following template:
{ "auth": {"key": "your-api-key"}, "url": "http://www.extractbot.com/", "steps": [{ "name": "your-step-identifier", "robot": "xpath", "value": "//h2", "steps": [{ "use": "/attr/style", "name": "your-sub-step-identifier", "robot": "regex", "value": ".*(#.*?)$" }] }] }
{ "your-step-identifier": { "robot": "xpath", "value": "//h2", "result": [ { "html": "<h2 style=\"color:#989d91\">Meet our robots</h2>", "inner_html": "Meet our robots", "text": "Meet our robots", "attr": { "style": "color:#989d91" }, "next": { "your-sub-step-identifier": { "robot": "regex", "value": ".*(#.*?)$", "result": [ { "0": "color:#989d91", "1": "#989d91" } ] } } }, ... ] } }
Output can become extremely verbose, especially when dealing with large numbers of results and large chained assemblies. For this
reason you can choose to map
output, explained in detail here.