Rest API Reading with Nessy — How-To¶
This guide shows how to read REST APIs into Spark DataFrames using
ReadAPIAction in Nessy.
You'll learn the core options, pagination patterns we support (page_based,
limit_offset), how to fan out requests from prior steps, and a few
production-ready tips. Examples use the public Rick and Morty
API.
What ReadAPIAction does¶
- Makes one or many HTTP requests (optionally with pagination).
- Runs requests in parallel (via Spark’s
mapInPandas). - Returns a Spark DataFrame with one column:
json_response(an array of page results for each request). - Each page entry includes the raw JSON (as a string) and rich request/response metadata.
Minimal pipeline¶
from cloe_nessy.pipeline import PipelineParsingService
yaml_str = """
name: Read API (basic)
steps:
Read Characters:
action: READ_API
options:
base_url: https://rickandmortyapi.com/api
endpoint: character
"""
pipeline = PipelineParsingService.parse(path=yaml_str)
pipeline.run()
Key options (at a glance)¶
base_url(required): API base (trailing slash optional).endpoint(required unless you userequests_from_context): path to call.method(defaultGET),timeout,params,headers,data,json_body,default_headers.auth: chainable auth providers (basic,env,secret_scope,azure_oauth).pagination: onlypage_basedandlimit_offsetare supported.max_retries,backoff_factor,max_concurrent_requests.key: optional JSON path to extract (e.g.,resultsordata.items).requests_from_context: build many requests from an upstream DataFrame.
Pagination (supported strategies)¶
We currently support page-based and limit/offset. Both accept shared/advanced fields:
check_field: dotted path to a list/field used to test for “any data”.next_page_field: dotted path that signals “has next” (true/URL/non-empty).max_page: hard cap (-1 / None = all).pages_per_array_limit: split results into sub-arrays per N pages (useful for chunking large crawls).preliminary_probe: pre-scan the API to determine all page params up front (then fully parallelize).
1) Page-based (Rick & Morty)¶
The API exposes page and returns
- the data under
results - a
info.nextpointer
name: Read API with page-based pagination
steps:
Read Characters:
action: READ_API
options:
base_url: https://rickandmortyapi.com/api
endpoint: character
method: GET
timeout: 900
max_retries: 5
params:
page: 1
pagination:
strategy: page_based
page_field: page # required
check_field: results # list to check for emptiness
next_page_field: info.next # truthy => more pages
pages_per_array_limit: 1 # one page per array entry (nice for filtering)
Optionally, add a filter step to drop empty responses:
Handle Empty Response:
action: TRANSFORM_FILTER
options:
condition: size(json_response.response) > 0
What requests are made?
GET …/character?page=1,?page=2, … untilinfo.nextis empty orresultsis empty.
2) Limit/offset (generic example)¶
If your API uses limit/offset shape:
name: Read API with limit/offset pagination
steps:
Read Products:
action: READ_API
options:
base_url: https://api.example.com
endpoint: products
params:
limit: 50
offset: 0
pagination:
strategy: limit_offset
limit_field: limit # required
offset_field: offset # required
check_field: data.items # optional (where the list lives)
next_page_field: page_info.has_next # optional, if available we trust it
max_page: -1
pages_per_array_limit: -1
Requests will look like:
Pre-scan pages for full parallelism¶
Set preliminary_probe: true to first probe the API for the last page and
then fan out all requests at once (handy with large datasets and higher
max_concurrent_requests).
Rick & Morty example:
name: Read API with probe
steps:
Read Characters (probe):
action: READ_API
options:
base_url: https://rickandmortyapi.com/api
endpoint: character
params:
page: 1
pagination:
strategy: page_based
page_field: page
check_field: results
next_page_field: info.next
preliminary_probe: true
max_concurrent_requests: 16
Extract a nested list via key¶
If you only want a nested list (e.g., results) instead of the whole JSON:
name: Read API (extract list)
steps:
Read Characters (results only):
action: READ_API
options:
base_url: https://rickandmortyapi.com/api
endpoint: character
params:
page: 1
pagination:
strategy: page_based
page_field: page
check_field: results
next_page_field: info.next
key: results
Now json_response[*].response contains just the serialized results array for
each page.
Authentication (optional, chainable)¶
Combine multiple auth sources; headers from env/secret_scope are merged with
bearer/basic flows:
name: Read API (auth)
steps:
Read Secure Endpoint:
action: READ_API
options:
base_url: https://secure.example.com/api
endpoint: v1/resources
auth:
- type: basic
username: ${USER}
password: ${PASSWORD}
- type: env
header_template:
"X-API-Key": "<ENV_API_KEY>"
- type: secret_scope
secret_scope: my_scope
header_template:
"X-ORG-Token": "<ORG_TOKEN>"
- type: azure_oauth
client_id: <client-id>
client_secret: <client-secret>
tenant_id: <tenant-id>
scope: <entra-app-id-uri/.default>
Tip: Don’t hardcode secrets in YAML; prefer secret scopes or environment variables.
Drive requests from prior steps (requests_from_context)¶
If an upstream step produces a DataFrame with columns endpoint, params,
headers, data, json_body, you can fan out heterogeneous calls:
name: Read API (dynamic)
steps:
Build Requests:
action: TRANSFORM_SQL
options:
# produce a DataFrame with the required columns
# e.g., SELECT 'character' AS endpoint, map('page','1') AS params, null AS headers, null AS data, null AS json_body
Read Many:
action: READ_API
options:
base_url: https://rickandmortyapi.com/api
requests_from_context: true
method: GET
timeout: 45
# optional pagination applied to every request:
pagination:
strategy: page_based
page_field: page
check_field: results
next_page_field: info.next
preliminary_probe: true
When requests_from_context: true, the top-level endpoint is optional; each
distinct row drives a request (or a set of pages when pagination is active).
Retries, backoff, concurrency¶
options:
max_retries: 3 # retry on 429/503/504 or connection errors
backoff_factor: 2 # exponential backoff (capped)
max_concurrent_requests: 16
timeout: 60
Output shape¶
ReadAPIAction returns a DataFrame:
-
Column:
json_response(ArrayType of structs) -
response(string): the JSON payload (or the extractedkey) -
__metadata:timestamp,base_url,url,status_code,reason,elapsedendpoint,query_parameters(as a string map)
With pages_per_array_limit: 1, each array element corresponds to a single
page, which makes downstream filtering easy (e.g., drop empty response).
End-to-end Rick & Morty example (copy-paste)¶
YAML (read_api_with_pagination.yaml)
name: Read API with Pagination
steps:
Read API:
action: READ_API
options:
base_url: https://rickandmortyapi.com/api
endpoint: character
method: GET
timeout: 900
max_retries: 10
params:
page: 1
pagination:
strategy: page_based
check_field: results
next_page_field: info.next
page_field: page
pages_per_array_limit: 1
Handle Empty Response:
action: TRANSFORM_FILTER
options:
condition: size(json_response.response) > 0
Python
from cloe_nessy.pipeline import PipelineParsingService
pipeline = PipelineParsingService.parse(path="./read_api_with_pagination.yaml")
pipeline.run()
Troubleshooting¶
- “A value for base_url must be supplied” — set
options.base_url. - “endpoint must be supplied” — set
options.endpointor userequests_from_context: truewith a valid upstream DataFrame. -
Pagination validation
-
strategy: page_based→page_fieldis required. strategy: limit_offset→limit_fieldandoffset_fieldare required.- Empty pages — set
check_fieldto where the list lives (e.g.,results) and optionally addTRANSFORM_FILTERto drop empties. - Slow crawls — enable
preliminary_probe: trueand increasemax_concurrent_requests(be considerate of rate limits). - Rate limits (429) — set
max_retriesandbackoff_factor.
That’s it! With the samples above you can fetch paginated data, chunk it to taste, and keep your pipelines secret-safe and parallel.