Request & response

The API has a single endpoint (/) and accepts a POST request with a JSON-encoded data parameter containing the following:

Parameter Type Description
auth JSON An encoded JSON string containing key parameter and, optionally, a secret parameter.
steps JSON An encoded JSON string containing the steps required to parse the HTML. The contents of steps is explained in more detail below.
url string
(optional)
The full url that you want to parse, e.g. http://money.cnn.com/data/markets/nasdaq/
html string
(optional)
The raw HTML that you want to parse. If a url and an html parameter are both present then the html takes precedence.
user-agent string
(optional)
This will override our user-agent when you also supply a url param. Please note that we will adhere to robots.txt directives for our original UA and then your newly supplied UA (in that order).

The request must contain one of either the url or html parameters. If a URL is supplied, then an HTTP GET request will be made to that URL to obtain the HTML to parse. If you do not include a url parameter then you should instead supply the HTML itself.

Sending the HTML explicitely is ideal if, for example, you want to parse a page that is restricted by robots.txt exclusion and we cannot access it directly.

Example request

{
    "auth": {"key": "your-api-key"},
    "steps": [{
        "name":  "your-step-identifier",
        "robot": "xpath",
        "value": "//h2"
    }],
    "url": "http://www.extractbot.com/"
}

Example response

{
  "your-step-identifier": {
    "robot": "xpath",
    "value": "//h2",
    "result": [
      {
        "html": "<h2 style=\"color:#989d91\">Meet our robots</h2>",
        "inner_html": "Meet our robots",
        "text": "Meet our robots",
        "attr": {
          "style": "color:#989d91"
        }
      },
      {
        "html": "<h2 style=\"color:#989d91\">Content extraction</h2>",
        "inner_html": "Content extraction",
        "text": "Content extraction",
        "attr": {
          "style": "color:#989d91"
        }
      },
      {
        "html": "<h2 style=\"color:#989d91\">Super fast API</h2>",
        "inner_html": "Super fast API",
        "text": "Super fast API",
        "attr": {
          "style": "color:#989d91"
        }
      },
      {
        "html": "<h2 style=\"color:#0f0e0d\">Pricing</h2>",
        "inner_html": "Pricing",
        "text": "Pricing",
        "attr": {
          "style": "color:#0f0e0d"
        }
      },
      {
        "html": "<h2 style=\"color:#989d91\">Try it out!</h2>",
        "inner_html": "Try it out!",
        "text": "Try it out!",
        "attr": {
          "style": "color:#989d91"
        }
      }
    ]
  }
}
 

Assemblies: chaining robots

Using the use parameter opens up the ability to chain robots into assemblies. use is a JSON pointer to extract a string from the parent step results output into the child step input stream. For example, to use the attr->style string and feed it into a regex robot, you would use the following template:

Request

{
    "auth": {"key": "your-api-key"},
    "url": "http://www.extractbot.com/",
    "steps": [{
        "name": "your-step-identifier",
        "robot": "xpath",
        "value": "//h2",
        "steps": [{
            "use": "/attr/style",
            "name": "your-sub-step-identifier",
            "robot": "regex",
            "value": ".*(#.*?)$"
        }]
    }]
}

Response

{
  "your-step-identifier": {
    "robot": "xpath",
    "value": "//h2",
    "result": [
      {
        "html": "<h2 style=\"color:#989d91\">Meet our robots</h2>",
        "inner_html": "Meet our robots",
        "text": "Meet our robots",
        "attr": {
          "style": "color:#989d91"
        },
        "next": {
          "your-sub-step-identifier": {
            "robot": "regex",
            "value": ".*(#.*?)$",
            "result": [
              {
                "0": "color:#989d91",
                "1": "#989d91"
              }
            ]
          }
        }
      },
      ...
    ]
  }
}

Mapping output

Output can become extremely verbose, especially when dealing with large numbers of results and large chained assemblies. For this reason you can choose to map output, explained in detail here.