the species names never quite match
Offline taxonomic name matching against local Darwin Core backbones, with matching done in C.
Hand it a column of messy species names. taxify cleans
them, matches them against a backbone you already have on disk, resolves
synonyms to accepted names, and returns one standardized data.frame.
Every step runs locally against a versioned snapshot, so there are no
API calls, no rate limits, and the same input gives the same output on
any machine. The matching engine is written in C through the vectra columnar engine.
library(taxify)
# match against WFO (downloads the backbone on first use, ~120 MB)
taxify(c(
"Quercus robur",
"Pinus abies", # synonym, resolved to Picea abies
"Quercus robus", # typo, fuzzy-corrected to Q. robur
"Taraxacum officinale"
))The usual route for name resolution, taxize, calls out
to around twenty web services (NCBI, ITIS, GBIF, EOL, IUCN, WoRMS,
Tropicos, …). That covers everything, but it ties each run to network
latency, service uptime, and rate limits, and the answer can change
between runs as upstream services update. taxify ships the
backbones as pre-built local snapshots and matches against them in C, so
a list of thousands resolves in seconds and a result is reproducible
from the recorded backbone version.
The closest local analogue is taxadb, which also stores backbone snapshots on disk; the migration vignette walks through the differences in matching strategy, output schema, and enrichment.
taxify ships ten backbones as compressed
.vtr files, downloaded once and matched locally. Pass
several and they form a fallback chain: a name unmatched by the first
backbone cascades to the next.
# WFO first (plants), then GBIF for whatever WFO doesn't cover
taxify(
c("Quercus robur", "Panthera leo", "Amanita muscaria"),
backend = c("wfo", "gbif")
)| Backend | Scope | Approx. names |
|---|---|---|
| WFO | Vascular plants | ~400k |
| COL | All kingdoms | ~4.5M |
| GBIF | All kingdoms | ~10M |
| ITIS | US focus, freshwater/marine | ~900k |
| NCBI Taxonomy | All life | ~2.5M |
| Open Tree of Life | All life (synthetic) | ~4M |
| WoRMS | Marine/aquatic | ~600k |
| Euro+Med | European/Mediterranean plants | ~132k |
| Species Fungorum | Fungi | ~329k |
| AlgaeBase | Algae | ~172k |
Input names are normalized first, so the fuzzy pass only runs on names that genuinely differ from the backbone rather than on names that just carry extra authorship or qualifiers:
"Quercus robur L." -> "Quercus robur" # authorship stripped
"Pinus cf. sylvestris" -> "Pinus sylvestris" # qualifier removed
"Nothofagus x alpina" -> "Nothofagus alpina" # hybrid marker normalized
"Betula pendula (Roth) Doll" -> "Betula pendula" # parenthesized author strippedFuzzy matching is configurable (Damerau-Levenshtein, Levenshtein, or Jaro-Winkler, with a distance threshold), and runs genus-blocked so a typo only competes against names in the same genus.
On the same WFO backbone and the same 5,000 plant names (Windows, R 4.5.2), matching against the local snapshot in C avoids the per-name cost of the CSV-into-RAM approach:
| taxify | WorldFlora | |
|---|---|---|
| Exact match (1,000 names) | 0.1 s | 1.3 s |
| Fuzzy match (1,000 names) | 1.0 s | 1,862 s (31 min) |
| Fuzzy match (5,000 names) | 1.1 s | ~83 min (extrapolated) |
| Backbone load | ~3 s (first call) | 33 s (CSV into RAM) |
taxify() returns one row per input name with a fixed
16-column schema: the matched and accepted names, IDs, rank, family,
genus, epithet, authorship, synonym and hybrid flags, the match type
(exact, exact_ci, fuzzy, or
none), the fuzzy distance, the backend, and the backbone
version used. summary() prints a compact digest of how the
batch resolved.
result <- taxify(c("Quercus robur", "Pinus abies", "Quercus robus", "Taraxacum officinale"))
summary(result)
#> -- taxify results ----------------------------------------------------
#> backend: WFO | 4 names submitted
#>
#> matched 4 (exact: 2, case-insensitive: 0, fuzzy: 2)
#> unmatched 0Twenty-seven enrichment layers join published trait and status data to your results through the backbone-resolved accepted name, so synonyms in either dataset land on the same key:
# plants
taxify(plant_names) |>
add_conservation_status() |> # IUCN Red List
add_invasive_status("AT") |> # GRIIS
add_woodiness() |> # Zanne et al.
add_eive() # EIVE indicator values
# fish
taxify(fish_names, backend = "col") |>
add_fishbase() |> # FishBase morphology & ecology
add_fish_traits() # FISHMORPH functional traitsSources span all kingdoms: IUCN, GRIIS, GBIF common names, WCVP, EIVE, Diaz et al., LEDA, FungalTraits, FUNGuild, AlgaeTraits, EltonTraits, AVONET, PanTHERIA, AmphiBIO, FISHMORPH, FishBase, AnAge, GloNAF, LepTraits, AnimalTraits, and regional plant-trait sets for France (Baseflor), Britain (Ecoflora), and Germany (FloraWeb), and more. The enrichments vignette lists the full set with references and licenses.
To join your own table, add_data() auto-detects the
species column, matches it through the same backbone(s) used in the
original call, and left-joins. It accepts data.frames, CSV, CSV.GZ,
XLSX, SQLite, and .vtr.
result |> add_data("TRY_traits.csv")
result |> add_data("TRY_traits.csv", cols = c("LeafArea", "SLA", "PlantHeight"))install.packages("pak")
pak::pak("gcol33/taxify") # vectra is installed automatically“Software is like sex: it’s better when it’s free.” — Linus Torvalds
I’m a PhD student who builds R packages in my free time because I believe good tools should be free and open. I started these projects for my own work and figured others might find them useful too.
If this package saved you some time, buying me a coffee is a nice way to say thanks. It helps with my coffee addiction.
MIT (see the LICENSE.md file)