README

Offline taxonomic name matching against local Darwin Core backbones, with matching done in C.

Hand it a column of messy species names. taxify cleans them, matches them against a backbone you already have on disk, resolves synonyms to accepted names, and returns one standardized data.frame. Every step runs locally against a versioned snapshot, so there are no API calls, no rate limits, and the same input gives the same output on any machine. The matching engine is written in C through the vectra columnar engine.

library(taxify)

# match against WFO (downloads the backbone on first use, ~120 MB)
taxify(c(
  "Quercus robur",
  "Pinus abies",        # synonym, resolved to Picea abies
  "Quercus robus",      # typo, fuzzy-corrected to Q. robur
  "Taraxacum officinale"
))

Local, not over the wire

The usual route for name resolution, taxize, calls out to around twenty web services (NCBI, ITIS, GBIF, EOL, IUCN, WoRMS, Tropicos, …). That covers everything, but it ties each run to network latency, service uptime, and rate limits, and the answer can change between runs as upstream services update. taxify ships the backbones as pre-built local snapshots and matches against them in C, so a list of thousands resolves in seconds and a result is reproducible from the recorded backbone version.

The closest local analogue is taxadb, which also stores backbone snapshots on disk; the migration vignette walks through the differences in matching strategy, output schema, and enrichment.

Thirteen backbones, one call

taxify ships thirteen backbones as compressed .vtr files, downloaded once and matched locally. Pass several and they form a fallback chain: a name unmatched by the first backbone cascades to the next.

# WFO first (plants), then GBIF for whatever WFO doesn't cover
taxify(
  c("Quercus robur", "Panthera leo", "Amanita muscaria"),
  backend = c("wfo", "gbif")
)

Names are cleaned before matching

Input names are normalized first, so the fuzzy pass only runs on names that genuinely differ from the backbone rather than on names that just carry extra authorship or qualifiers:

Backend	Scope	Approx. names
WFO	Vascular plants	~400k
COL	All kingdoms	~4.5M
GBIF	All kingdoms	~10M
ITIS	US focus, freshwater/marine	~900k
NCBI Taxonomy	All life	~2.5M
Open Tree of Life	All life (synthetic)	~4M
WoRMS	Marine/aquatic	~600k
Euro+Med	European/Mediterranean plants	~132k
Species Fungorum	Fungi	~329k
AlgaeBase	Algae	~172k
FishBase	Fishes	~100k
SeaLifeBase	Non-fish marine/aquatic	~134k
Reptile Database	Reptiles	~50k

"Quercus robur L."            ->  "Quercus robur"      # authorship stripped
"Pinus cf. sylvestris"        ->  "Pinus sylvestris"   # qualifier removed
"Nothofagus x alpina"         ->  "Nothofagus alpina"  # hybrid marker normalized
"Betula pendula (Roth) Doll"  ->  "Betula pendula"     # parenthesized author stripped

Fuzzy matching is configurable (Damerau-Levenshtein, Levenshtein, or Jaro-Winkler, with a distance threshold), and runs genus-blocked so a typo only competes against names in the same genus.

On the same WFO backbone and the same 5,000 plant names (Windows, R 4.5.2), matching against the local snapshot in C avoids the per-name cost of the CSV-into-RAM approach:

What you get back

	taxify	WorldFlora
Exact match (1,000 names)	0.1 s	1.3 s
Fuzzy match (1,000 names)	1.0 s	1,862 s (31 min)
Fuzzy match (5,000 names)	1.1 s	~83 min (extrapolated)
Backbone load	~3 s (first call)	33 s (CSV into RAM)

taxify() returns one row per input name with a fixed schema: the matched and accepted names with their IDs and authorship, rank, family, genus, epithet, synonym / hybrid / ambiguity flags, any taxonomic qualifier, the match type (exact, exact_ci, fuzzy, abbrev, out_of_scope, or none), the fuzzy distance, a coarse kingdom / taxon-group label, the backend, and the backbone version used. summary() prints a compact digest of how the batch resolved.

result <- taxify(c("Quercus robur", "Pinus abies", "Quercus robus", "Taraxacum officinale"))
summary(result)
#> -- taxify results ----------------------------------------------------
#>   backend: WFO  |  4 names submitted
#>
#>   matched         4  (exact: 2, case-insensitive: 0, fuzzy: 2, abbrev: 0)
#>   --------------------------------------------------------------
#>   taxon groups: vascular plant: 4

Trait and status enrichment

More than eighty enrichment layers join published trait and status data to your results through the backbone-resolved accepted name, so synonyms in either dataset land on the same key:

# plants
taxify(plant_names) |>
  add_iucn() |>                  # IUCN Red List
  add_griis("AT") |>             # GRIIS
  add_zanne() |>                 # Zanne et al. woodiness
  add_eive()                     # EIVE indicator values

# fish
taxify(fish_names, backend = "col") |>
  add_fishbase() |>              # FishBase morphology & ecology
  add_fishmorph()                # FISHMORPH functional traits

# one trait, gathered across every source that has it
taxify(plant_names) |>
  add_trait("seed_mass")         # Diaz + GIFT, harmonized to mg

Sources span all kingdoms: IUCN, GRIIS, GBIF common names, WCVP, EIVE, Diaz et al., LEDA, GIFT, FungalTraits, FUNGuild, AlgaeTraits, EltonTraits, AVONET, PanTHERIA, AmphiBIO, FISHMORPH, FishBase, AnAge, GloNAF, LepTraits, AnimalTraits, and regional plant-trait sets for France (Baseflor), Britain (Ecoflora), and Germany (FloraWeb), and more. The enrichments vignette lists the full set with references and licenses.

To join your own table, add_data() auto-detects the species column, matches it through the same backbone(s) used in the original call, and left-joins. It accepts data.frames, CSV, CSV.GZ, XLSX, SQLite, and .vtr.

result |> add_data("TRY_traits.csv")
result |> add_data("TRY_traits.csv", cols = c("LeafArea", "SLA", "PlantHeight"))

Check a list before you trust it

inspect() returns only the names that look wrong, each labelled with what stands out and the name to use instead. It catches typos, retired synonyms, made-up genera, near-duplicate spellings, and the lone animal in a list of plants, and ranks them by whether they need a decision, a second look, or just optional cleanup.

inspect(field_names)                  # offline register and list checks
inspect(field_names, backbones = TRUE) # also typos, synonyms, ambiguity

For regional field lists, region steers a fuzzy correction toward species that actually occur where you work, so a misspelling resolves to the plant that grows there even when a one-letter neighbour from another continent sits just as close in spelling. Pass a region name, a TDWG code, or coordinates.

taxify(field_names, region = "Belgium")
taxify(field_names, coords = c(4.35, 50.85))

Installation

install.packages("taxify")

install.packages("pak")
pak::pak("gcol33/taxify")          # vectra is installed automatically

Documentation

Support

I’m a PhD student who builds R packages in my free time because I believe good tools should be free and open. I started these projects for my own work and figured others might find them useful too.

If this package saved you some time, buying me a coffee is a nice way to say thanks. It helps with my coffee addiction.

taxify