A tidy R interface for reproducible web crawling — inspired by the architecture of Crawlee, implemented in pure R.
crawlee brings the unified-crawler idea to R: a deduplicating, resumable request queue, content-type aware handlers, structured storage and rich console logging via cli. It can crawl HTML pages, sitemaps, RSS and Atom feeds and PDF documents — with reproducibility as a first-class concern.
It is built entirely on the R web-scraping ecosystem (httr2, rvest, xml2, chromote) — no Node.js runtime required.
A crawl is a loop: requests flow through a deduplicating queue to a
fetch engine; each response is dispatched to a handler that extracts
data (push_data()) and discovers more links
(enqueue_links()), which flow back into the queue until it
drains.
# install.packages("pak")
pak::pak("StrategicProjects/crawlee")library(crawlee)
resultado <- crawler("https://example.com") |>
cr_options(delay = 0.5, max_depth = 2, respect_robots = TRUE) |>
cr_use_http() |>
cr_on_html(function(ctx) {
ctx$push_data(list(
url = ctx$request$url,
titulo = ctx$page |> rvest::html_element("h1") |> rvest::html_text2()
))
ctx$enqueue_links(glob = "*/blog/*")
}) |>
cr_run() |>
cr_collect()
resultado
#> # A tibble: 1 × 2
#> url titulo
#> <chr> <chr>
#> 1 https://example.com Example DomainSuggests), loaded only when
used.cr_* verbs
compose with the native pipe and always return tibbles.robots.txt awareness by default.| Milestone | Scope | Status |
|---|---|---|
| M1 | Core: queue, HTTP, HTML handlers, dataset, cli logs | ✅ |
| M2 | Sitemap & RSS discovery, robots.txt enforcement | ✅ |
| M3 | PDF / document handlers (pdftools) |
✅ |
| M4 | Headless browser backend (chromote) |
✅ |
| M5 | RAG helpers (chunking, embeddings, export) | ✅ |
| M6 | Persistent & resumable storage (jsonl/duckdb,
cr_persist()) |
✅ |
| M7 | Parallel fetching (cr_parallel()) |
✅ |
| M8 | Autoscaling (cr_autoscale()) & streaming pool
(cr_stream()) |
✅ |
| M9 | Adaptive streaming + per-host pacing | ✅ |
MIT © crawlee authors.