Rest API Reading with Nessy — How-To¶

This guide shows how to read REST APIs into Spark DataFrames using ReadAPIAction in Nessy. You'll learn the core options, pagination patterns we support (page_based, limit_offset), how to fan out requests from prior steps, and a few production-ready tips. Examples use the public Rick and Morty API.

What ReadAPIAction does¶

Makes one or many HTTP requests (optionally with pagination).
Runs requests in parallel (via Spark’s mapInPandas).
Returns a Spark DataFrame with one column: json_response (an array of page results for each request).
Each page entry includes the raw JSON (as a string) and rich request/response metadata.

Minimal pipeline¶

from cloe_nessy.pipeline import PipelineParsingService

yaml_str = """
name: Read API (basic)
steps:
  Read Characters:
    action: READ_API
    options:
      base_url: https://rickandmortyapi.com/api
      endpoint: character
"""

pipeline = PipelineParsingService.parse(path=yaml_str)
pipeline.run()

Key options (at a glance)¶

base_url (required): API base (trailing slash optional).
endpoint (required unless you use requests_from_context): path to call.
method (default GET), timeout, params, headers, data, json_body, default_headers.
auth: chainable auth providers (basic, env, secret_scope, azure_oauth).
pagination: only page_based and limit_offset are supported.
max_retries, backoff_factor, max_concurrent_requests.
key: optional JSON path to extract (e.g., results or data.items).
requests_from_context: build many requests from an upstream DataFrame.

Pagination (supported strategies)¶

We currently support page-based and limit/offset. Both accept shared/advanced fields:

check_field: dotted path to a list/field used to test for “any data”.
next_page_field: dotted path that signals “has next” (true/URL/non-empty).
max_page: hard cap (-1 / None = all).
pages_per_array_limit: split results into sub-arrays per N pages (useful for chunking large crawls).
preliminary_probe: pre-scan the API to determine all page params up front (then fully parallelize).

1) Page-based (Rick & Morty)¶

The API exposes page and returns

the data under results
a info.next pointer

name: Read API with page-based pagination
steps:
  Read Characters:
    action: READ_API
    options:
      base_url: https://rickandmortyapi.com/api
      endpoint: character
      method: GET
      timeout: 900
      max_retries: 5
      params:
        page: 1
      pagination:
        strategy: page_based
        page_field: page           # required
        check_field: results       # list to check for emptiness
        next_page_field: info.next # truthy => more pages
        pages_per_array_limit: 1   # one page per array entry (nice for filtering)

Optionally, add a filter step to drop empty responses:

  Handle Empty Response:
    action: TRANSFORM_FILTER
    options:
      condition: size(json_response.response) > 0

What requests are made? GET …/character?page=1, ?page=2, … until info.next is empty or results is empty.

2) Limit/offset (generic example)¶

If your API uses limit/offset shape:

name: Read API with limit/offset pagination
steps:
  Read Products:
    action: READ_API
    options:
      base_url: https://api.example.com
      endpoint: products
      params:
        limit: 50
        offset: 0
      pagination:
        strategy: limit_offset
        limit_field: limit        # required
        offset_field: offset      # required
        check_field: data.items   # optional (where the list lives)
        next_page_field: page_info.has_next  # optional, if available we trust it
        max_page: -1
        pages_per_array_limit: -1

Requests will look like:

?limit=50&offset=0 → &offset=50 → &offset=100 → …

Pre-scan pages for full parallelism¶

Set preliminary_probe: true to first probe the API for the last page and then fan out all requests at once (handy with large datasets and higher max_concurrent_requests).

Rick & Morty example:

name: Read API with probe
steps:
  Read Characters (probe):
    action: READ_API
    options:
      base_url: https://rickandmortyapi.com/api
      endpoint: character
      params:
        page: 1
      pagination:
        strategy: page_based
        page_field: page
        check_field: results
        next_page_field: info.next
        preliminary_probe: true
      max_concurrent_requests: 16

Extract a nested list via `key`¶

If you only want a nested list (e.g., results) instead of the whole JSON:

name: Read API (extract list)
steps:
  Read Characters (results only):
    action: READ_API
    options:
      base_url: https://rickandmortyapi.com/api
      endpoint: character
      params:
        page: 1
      pagination:
        strategy: page_based
        page_field: page
        check_field: results
        next_page_field: info.next
      key: results

Now json_response[*].response contains just the serialized results array for each page.

Authentication (optional, chainable)¶

Combine multiple auth sources; headers from env/secret_scope are merged with bearer/basic flows:

name: Read API (auth)
steps:
  Read Secure Endpoint:
    action: READ_API
    options:
      base_url: https://secure.example.com/api
      endpoint: v1/resources
      auth:
        - type: basic
          username: ${USER}
          password: ${PASSWORD}
        - type: env
          header_template:
            "X-API-Key": "<ENV_API_KEY>"
        - type: secret_scope
          secret_scope: my_scope
          header_template:
            "X-ORG-Token": "<ORG_TOKEN>"
        - type: azure_oauth
          client_id: <client-id>
          client_secret: <client-secret>
          tenant_id: <tenant-id>
          scope: <entra-app-id-uri/.default>

Tip: Don’t hardcode secrets in YAML; prefer secret scopes or environment variables.

Drive requests from prior steps (`requests_from_context`)¶

If an upstream step produces a DataFrame with columns endpoint, params, headers, data, json_body, you can fan out heterogeneous calls:

name: Read API (dynamic)
steps:
  Build Requests:
    action: TRANSFORM_SQL
    options:
      # produce a DataFrame with the required columns
      # e.g., SELECT 'character' AS endpoint, map('page','1') AS params, null AS headers, null AS data, null AS json_body

  Read Many:
    action: READ_API
    options:
      base_url: https://rickandmortyapi.com/api
      requests_from_context: true
      method: GET
      timeout: 45
      # optional pagination applied to every request:
      pagination:
        strategy: page_based
        page_field: page
        check_field: results
        next_page_field: info.next
        preliminary_probe: true

When requests_from_context: true, the top-level endpoint is optional; each distinct row drives a request (or a set of pages when pagination is active).

Retries, backoff, concurrency¶

options:
  max_retries: 3         # retry on 429/503/504 or connection errors
  backoff_factor: 2      # exponential backoff (capped)
  max_concurrent_requests: 16
  timeout: 60

Output shape¶

ReadAPIAction returns a DataFrame:

Column: json_response (ArrayType of structs)
response (string): the JSON payload (or the extracted key)
__metadata:
- timestamp, base_url, url, status_code, reason, elapsed
- endpoint, query_parameters (as a string map)

With pages_per_array_limit: 1, each array element corresponds to a single page, which makes downstream filtering easy (e.g., drop empty response).

End-to-end Rick & Morty example (copy-paste)¶

YAML (read_api_with_pagination.yaml)

name: Read API with Pagination
steps:
  Read API:
    action: READ_API
    options:
      base_url: https://rickandmortyapi.com/api
      endpoint: character
      method: GET
      timeout: 900
      max_retries: 10
      params:
        page: 1
      pagination:
        strategy: page_based
        check_field: results
        next_page_field: info.next
        page_field: page
        pages_per_array_limit: 1

  Handle Empty Response:
    action: TRANSFORM_FILTER
    options:
      condition: size(json_response.response) > 0

Python

from cloe_nessy.pipeline import PipelineParsingService

pipeline = PipelineParsingService.parse(path="./read_api_with_pagination.yaml")
pipeline.run()

Troubleshooting¶

“A value for base_url must be supplied” — set options.base_url.
“endpoint must be supplied” — set options.endpoint or use requests_from_context: true with a valid upstream DataFrame.
Pagination validation
strategy: page_based → page_field is required.
strategy: limit_offset → limit_field and offset_field are required.
Empty pages — set check_field to where the list lives (e.g., results) and optionally add TRANSFORM_FILTER to drop empties.
Slow crawls — enable preliminary_probe: true and increase max_concurrent_requests (be considerate of rate limits).
Rate limits (429) — set max_retries and backoff_factor.

That’s it! With the samples above you can fetch paginated data, chunk it to taste, and keep your pipelines secret-safe and parallel.