crawlee mirrors the architecture of Crawlee in pure R. A crawler owns:
You build a crawler with crawler() and configure it with
cr_* verbs that compose through the native pipe
(|>).
Every handler receives a context object, conventionally named
ctx:
| Element | Description |
|---|---|
ctx$request |
The current request (url, label,
depth, …). |
ctx$response |
The raw httr2 response. |
ctx$page |
The parsed page (xml_document) for HTML/XML, else
NULL. |
ctx$push_data(data) |
Append a record (list or data frame) to the dataset. |
ctx$enqueue_links(...) |
Discover and enqueue links from the page. |
ctx$log |
Logging helpers (info(), success(),
warn(), error()). |
enqueue_links() accepts glob,
include/exclude patterns and a
same_domain flag (on by default), so you only follow the
links you care about:
Requests enqueued with a label are routed to the
matching handler registered with
cr_on_html(..., label = "article").
The request queue deduplicates URLs by a normalised key (see
cr_normalize_url()), so the same page is never fetched
twice and crawls are deterministic. Persistent, resumable storage
backends (DuckDB, Parquet) are on the roadmap. ```