Biodiversity data analysis almost always starts with a name-matching step. Field records, herbarium labels, and literature extractions use different spellings, outdated synonyms, and informal qualifiers. Before any statistical work can begin, those raw strings need to be resolved to a single accepted name per taxon.
Two R packages handle this well: taxize queries online APIs in real time, and WorldFlora matches names offline against the WFO backbone. taxify builds on the same idea but works offline against ten backbone databases at once, with fuzzy matching in C, synonym resolution, and a built-in enrichment pipeline for joining trait and status data.
taxify takes a different approach. It downloads Darwin Core backbone snapshots to disk once, converts them to a compressed columnar format (.vtr files powered by the vectra engine), and runs all matching offline against those local copies. Nine backbones are available: WFO (plants), COL (all-kingdom catalogue), GBIF (all kingdoms, largest), ITIS (North American focus), NCBI Taxonomy (molecular/genomic), Open Tree of Life (synthetic tree), WoRMS (marine taxa), Species Fungorum (fungi), and AlgaeBase (algae). A single function call matches names, resolves synonyms, and returns a uniform 16-column data.frame regardless of which backbone was used.
The choice of backbone matters. WFO is maintained by the World Flora Online consortium and represents the most authoritative source for vascular plants, bryophytes, and ferns. COL (the Catalogue of Life) covers all kingdoms but with less taxonomic depth per group; it is a good second choice when a dataset mixes plants with fungi or animals. GBIF has the widest raw coverage because it aggregates multiple source taxonomies, but its synonym handling is coarser. For marine taxa, WoRMS is the standard. For molecular work, NCBI Taxonomy aligns with GenBank accession metadata. The remaining backbones serve more specialized needs: ITIS for North American regulatory contexts, OTT for phylogenetic placement, Species Fungorum for fungal nomenclature, and AlgaeBase for algal taxonomy.
This vignette walks through the core workflow: installing a backbone, matching names, reading the output, and layering on enrichment data. The code chunks are not evaluated here because the backbone files are too large for CRAN build infrastructure, but every example uses real species names and realistic outputs.
The first call to taxify() auto-downloads the WFO
backbone if no local copy exists. For a more deliberate setup, or to
pre-install several backbones before an analysis session, use
taxify_download_vtr().
#> i WFO backbone not found locally. Downloading v2024-12...
#> v WFO backbone ready (v2024-12, 148 MB).
The file lands in a platform-appropriate data directory. On Linux
that is typically
~/.local/share/R/taxify/wfo/latest/wfo.vtr; on macOS,
~/Library/Application Support/R/taxify/wfo/latest/wfo.vtr;
on Windows,
%LOCALAPPDATA%/R/data/R/taxify/wfo/latest/wfo.vtr.
#> [1] "/home/user/.local/share/R/taxify"
Multiple backbones can be installed in one call. Each backbone is independent and occupies its own subdirectory.
taxify checks backbone versions once per R session. If a newer
release appears on Zenodo, the next taxify() call downloads
the update automatically. Pinned versions (useful for reproducibility)
are also supported:
taxify_download_vtr("wfo", version = "2024.06") downloads
into a separate directory that is never overwritten. This distinction
matters for long-running projects. The “latest” directory always tracks
the most recent release, while a pinned directory preserves an exact
snapshot. If a collaborator needs to reproduce your results six months
later, the pinned version guarantees that the same backbone rows are
used even if WFO has published a new release in the interim.
The core function is taxify(). It accepts a character
vector of taxonomic names and returns a data.frame with one row per
input name.
result <- taxify(c(
"Quercus robur",
"Pinus sylvestris",
"Betula pendula",
"Fagus sylvatica",
"Acer pseudoplatanus"
))#> Matching 5 names...
The result is a standard data.frame. Every column is character or logical, so it plays well with dplyr, data.table, or base R subsetting without type coercion surprises.
#> input_name accepted_name family match_type
#> 1 Quercus robur Quercus robur Fagaceae exact
#> 2 Pinus sylvestris Pinus sylvestris Pinaceae exact
#> 3 Betula pendula Betula pendula Betulaceae exact
#> 4 Fagus sylvatica Fagus sylvatica Fagaceae exact
#> 5 Acer pseudoplatanus Acer pseudoplatanus Sapindaceae exact
All five names matched exactly. The family column comes
from the backbone, not from a separate taxonomy lookup, so it is always
consistent with the accepted name.
Every taxify() call returns the same 16 columns,
regardless of which backbone produced the match. This uniformity means
downstream code never needs to branch on backend type.
| Column | Type | Description |
|---|---|---|
input_name |
character | The original string as submitted |
matched_name |
character | The backbone entry that matched |
accepted_name |
character | The currently accepted name (equals
matched_name when the match is not a synonym) |
taxon_id |
character | Backend-specific ID of the matched name |
accepted_id |
character | ID of the accepted name |
rank |
character | Taxonomic rank: species, subspecies, genus, family, etc. |
family |
character | Family of the accepted name |
genus |
character | Genus of the accepted name |
epithet |
character | Specific epithet |
authorship |
character | Taxonomic authority string |
is_synonym |
logical | TRUE if the matched name is a synonym |
is_hybrid |
logical | TRUE if a hybrid marker was detected in the input |
match_type |
character | One of exact, exact_ci,
fuzzy, out_of_scope, or none |
fuzzy_dist |
numeric | Normalized edit distance (0–1), NA for exact matches |
backend |
character | Which backbone was used (e.g., wfo,
col, gbif) |
backbone_version |
character | Backend name, version, and download date for reproducibility |
To see what a single row looks like in practice, consider the synonym “Pinus abies” matched against WFO.
#> [,1]
#> input_name "Pinus abies"
#> matched_name "Pinus abies"
#> accepted_name "Picea abies"
#> taxon_id "wfo-0000483065"
#> accepted_id "wfo-0000471692"
#> rank "species"
#> family "Pinaceae"
#> genus "Picea"
#> epithet "abies"
#> authorship "L."
#> is_synonym "TRUE"
#> is_hybrid "FALSE"
#> match_type "exact"
#> fuzzy_dist NA
#> backend "wfo"
#> backbone_version "wfo:2024-12 (2026-04-01)"
Several things stand out. The taxon_id is the WFO
identifier of the row that actually matched (“Pinus abies”), while
accepted_id points to the currently accepted taxon (“Picea
abies”). These two IDs differ whenever a synonym is involved. The
genus and family columns always reflect the
accepted name, not the matched synonym, so downstream joins on genus or
family work correctly even for synonym inputs. The
backbone_version string encodes both the WFO release
version and the date the backbone was downloaded. This is useful for
methods sections: “We matched names against WFO v2024-12, downloaded
2026-04-01.”
When a name is not a synonym, taxon_id and
accepted_id are identical, matched_name and
accepted_name are identical, and is_synonym is
FALSE. The fuzzy_dist column holds NA for all exact and
case-insensitive matches; it only gets a numeric value for fuzzy
matches. This makes it straightforward to filter for uncertain matches
with result[!is.na(result$fuzzy_dist), ].
The rank column deserves a brief note. Most matched
names will have rank “species”, but taxify also matches genus-level
names (rank “genus”), infraspecific names (rank “subspecies”, “variety”,
“form”), and higher-rank names (rank “family”, “order”, etc.) when they
appear in the backbone. If you submit “Quercus” without an epithet,
taxify matches the genus-level entry and returns rank “genus”. If you
submit “Pinus sylvestris var. hamata” and the variety exists in the
backbone, you get rank “variety”; if it does not exist, taxify falls
back to the species-level match and returns rank “species”.
The authorship column contains the taxonomic authority
as recorded in the backbone. For WFO this is typically the standard
abbreviation (“L.”, “Sm.”, “(Aiton) Sm.”), while COL and GBIF may
include the full unabbreviated author name. Note that this is the
authorship of the matched name, not necessarily of the accepted
name. When a synonym is matched, the authorship reflects the synonym’s
authority. This can be useful for disambiguating homonyms (different
species that share the same binomial but differ in authorship).
taxify cleans input names before matching, so messy real-world data works without manual preprocessing. The cleaning pipeline runs entirely on the user’s input vector (which is small); the backbone is already clean. The following transformations happen in order:
Qualifier stripping. Prefixes and infixes like
cf., aff., s.l.,
s.str., sp., spp.,
subsp., var., f.,
auct., sensu, agg. are removed.
The qualifier is recorded separately and can be retrieved later with
add_qualifier_info().
Authorship removal. Parenthesized authorship
strings like “(L.)” or “(Aiton) Sm.” are stripped first, then trailing
authorship patterns like “L.” or “ex DC.” are removed. The backbone’s
own authorship column still carries the authority for the
matched name.
Whitespace and case normalization. Multiple spaces collapse to one. Everything except the genus initial is lowercased. “ACER PSEUDOPLATANUS” becomes “Acer pseudoplatanus”.
Hybrid marker detection. The multiplication
sign, the letter “x” between genus and epithet, or “x” between two
binomials are recognized as hybrid markers. The is_hybrid
flag is set, and the marker is stripped for matching purposes.
Latin orthographic normalization. Common epithet spelling alternations are reduced to a canonical form. Pairs like “hirtaeformis”/“hirtiformis”, “caeruleum”/“ceruleum”, and “phyllum”/“fillum” all normalize to the same key. This catches mismatches that are really just alternative transliterations, not typos.
Here are those transformations in action.
messy_result <- taxify(c(
"Quercus robur L.", # trailing authorship
"cf. Betula pendula", # qualifier prefix
"Pinus sylvestris var. hamata", # infraspecific qualifier
" Fagus sylvatica ", # extra whitespace
"ACER PSEUDOPLATANUS" # all caps
))
messy_result[, c("input_name", "accepted_name", "match_type")]#> input_name accepted_name match_type
#> 1 Quercus robur L. Quercus robur exact
#> 2 cf. Betula pendula Betula pendula exact
#> 3 Pinus sylvestris var. hamata Pinus sylvestris exact
#> 4 Fagus sylvatica Fagus sylvatica exact
#> 5 ACER PSEUDOPLATANUS Acer pseudoplatanus exact_ci
The authorship “L.” after “Quercus robur” was removed before
matching. The “cf.” prefix on “Betula pendula” was stripped (the
qualifier itself is recorded internally and can be retrieved with
add_qualifier_info()). “Pinus sylvestris var. hamata” did
not match at the variety rank, so taxify fell back to matching “Pinus
sylvestris” at species rank and reported it as an exact match. The
all-caps version of Acer pseudoplatanus matched after case folding, so
it received match type exact_ci.
Latin orthographic normalization is grouped under
exact_ci as well, since no edit distance algorithm is
involved. The normalizer handles six common alternation patterns in
Latin epithets: ae/i (as in
caeruleum/ceruleum), oe/i, terminal
ii/i, y/i,
ph/f (phyllum/fillum),
rh/r, and th/t.
These transformations are applied only to the epithet, never to the
genus, and only during the normalization matching pass. They catch a
class of discrepancies that would otherwise require fuzzy matching and
consume edit-distance budget that might be needed for genuine typos.
The cleaning pipeline is conservative by design. It strips known qualifiers and authorship patterns but preserves the core binomial. It does not attempt to correct obvious misspellings (that is the fuzzy matcher’s job), and it does not guess at abbreviated genus names. The goal is to remove noise while leaving the signal intact for the matching engine to handle.
Synonyms are resolved transparently. When a submitted name matches a
synonym in the backbone, matched_name shows what was found
and accepted_name shows what it resolves to. The
is_synonym flag marks these rows.
syn_result <- taxify(c(
"Picea abies",
"Pinus abies", # basionym / synonym of Picea abies
"Quercus robur",
"Quercus pedunculata" # synonym of Quercus robur
))
syn_result[, c("input_name", "matched_name", "accepted_name", "is_synonym")]#> input_name matched_name accepted_name is_synonym
#> 1 Picea abies Picea abies Picea abies FALSE
#> 2 Pinus abies Pinus abies Picea abies TRUE
#> 3 Quercus robur Quercus robur Quercus robur FALSE
#> 4 Quercus pedunculata Quercus pedunculata Quercus robur TRUE
Both “Pinus abies” and “Picea abies” resolve to the same
accepted_name, and both share the same
accepted_id. This is the key that enrichment joins and
add_data() use, so trait data attached via accepted ID
propagates correctly regardless of which synonym the user submitted.
Some species accumulate many synonyms over their taxonomic history.
The common Norway spruce has been described under at least four
different genera: Pinus abies L. (the Linnaean basionym),
Abies picea Mill., Picea excelsa (Lam.) Link, and the
accepted Picea abies (L.) H.Karst. All four names are present
in the WFO backbone as synonyms pointing to the same accepted taxon ID.
Submitting any of them to taxify returns the same
accepted_name and accepted_id.
spruce <- taxify(c(
"Picea abies", # accepted name
"Pinus abies", # Linnaean basionym
"Abies picea", # Miller's combination
"Picea excelsa" # Link's combination
))
spruce[, c("input_name", "accepted_name", "accepted_id", "is_synonym")]#> input_name accepted_name accepted_id is_synonym
#> 1 Picea abies Picea abies wfo-0000471692 FALSE
#> 2 Pinus abies Picea abies wfo-0000471692 TRUE
#> 3 Abies picea Picea abies wfo-0000471692 TRUE
#> 4 Picea excelsa Picea abies wfo-0000471692 TRUE
The identical accepted_id across all four rows means any
downstream operation that groups or joins on accepted ID treats them as
the same species. This is the entire point of synonym resolution: it
collapses the many-to-one relationship between historical names and the
current consensus.
Taxonomic synonyms come in two flavours. A homotypic synonym
(also called a nomenclatural synonym) is based on the same type specimen
as the accepted name; the species was simply moved to a different genus.
“Pinus abies” is a homotypic synonym of “Picea abies” because both are
based on the same Linnaean type. A heterotypic synonym (also
called a taxonomic synonym) is based on a different type specimen but
was later judged to represent the same species. taxify does not
distinguish between the two in the output: the is_synonym
column is TRUE for both. Some backbones (GBIF, COL) do record whether a
synonym is homotypic or heterotypic, and that information is available
via add_gbif_info() or add_col_info(), but for
most workflows the simple TRUE/FALSE flag is sufficient.
Synonym resolution handles chains. If synonym A points to synonym B, which points to accepted name C, taxify follows the chain (up to 10 hops) and returns C. This matters for backbones like COL and GBIF where synonym chains of length two or three are common.
taxify classifies every match into one of five categories. Each category reflects a different level of confidence in the result.
exact: The cleaned input matches a backbone entry character for character. This is the fastest path and the most reliable. In practice, the majority of well-formatted species names fall into this category.
#> input_name match_type fuzzy_dist
#> 1 Quercus robur exact NA
exact_ci: The input matches after case folding or after Latin orthographic normalization. No edit distance is involved. This category catches two distinct classes of mismatch: pure capitalization differences, and Latin spelling alternations.
#> input_name match_type fuzzy_dist
#> 1 quercus robur exact_ci NA
The name “quercus robur” (all lowercase) does not match “Quercus
robur” character-for-character, but it does match after case folding.
The match type is exact_ci and fuzzy_dist
remains NA because no edit distance algorithm was needed.
fuzzy: The input does not match exactly but falls within the allowed edit distance of a backbone entry. The default threshold is 0.2 (normalized Damerau-Levenshtein), which allows roughly one edit per five characters. Fuzzy matching is genus-blocked: “Quercus robor” will only be compared against other Quercus entries, not the entire backbone.
#> input_name accepted_name match_type fuzzy_dist
#> 1 Quercus robor Quercus robur fuzzy 0.07142857
The typo “robor” (missing the “u”) was corrected to “Quercus robur”
with a normalized edit distance of about 0.07. The
fuzzy_dist column always holds the normalized distance, so
values are comparable across names of different lengths.
out_of_scope: The genus is recognized in the genus register but is not covered by the requested backbone. Submitting “Panthera leo” to the WFO backbone (plants only) produces this classification, because taxify knows Panthera is a real genus that belongs to a different backbone.
#> input_name match_type life_form
#> 1 Panthera leo out_of_scope animal
The life_form column (populated from the genus register)
shows “animal”, which explains why the name is out of scope for a
plant-only backbone. The summary method uses this information to suggest
alternative backends.
none: No match was found and the genus is either unknown or also covered by the requested backend. This means the name is genuinely absent from the backbone, not merely scoped to a different one.
#> input_name match_type
#> 1 Fakegenus fakus none
“Fakegenus fakus” is not recognized by any backbone, so it receives
none. There is no genus register entry for “Fakegenus”, no
alternative backend to suggest. In real-world datasets,
none typically indicates a garbled name, a common name that
was not converted to Latin, or an organism described in a publication
that predates the backbone’s coverage.
The five categories together give a complete picture of match quality. A quick diagnostic pass over the output might look like this:
types_result <- taxify(c(
"Quercus robur", # exact
"quercus robur", # exact_ci (case folding)
"Quercus robor", # fuzzy (one-char typo)
"Panthera leo", # out_of_scope (animal in WFO)
"Fakegenus fakus" # none
))
table(types_result$match_type)#> exact exact_ci fuzzy none out_of_scope
#> 1 1 1 1 1
Fuzzy matching is controlled by three arguments.
fuzzy = FALSE disables it entirely.
fuzzy_threshold sets the maximum normalized distance
(default 0.2) or, when >= 1, a raw edit count (e.g.,
fuzzy_threshold = 2L allows at most 2 edits regardless of
name length). fuzzy_method selects the algorithm:
"dl" (Damerau-Levenshtein, default),
"levenshtein", or "jw" (Jaro-Winkler).
Calling summary() on a taxify result prints a compact
digest of match quality. This is the fastest way to assess whether a run
went well or whether something needs attention upstream.
mixed <- taxify(c(
"Quercus robur", "Pinus sylvestris", "Betula pendula",
"Picea abies", "Pinus abies",
"Quercus robor", # typo
"Panthera leo", # animal in WFO
"Felis catus", # animal in WFO
"Fakus invalidus" # genuinely absent
))
summary(mixed)#> -- taxify results ------------------------------------------------------------
#> backend: WFO v2024-12 | 9 names submitted
#>
#> matched 6 (exact: 4, case-insensitive: 0, fuzzy: 1)
#> out of scope 2 (animal: 2 -- not in WFO, try backend = "col", "gbif")
#> unmatched 1 (taxon_group: unknown: 1)
#> ------------------------------------------------------------
#> taxon groups: vascular plant: 6 animal: 2 unknown: 1
The first line identifies the backend and version, and the total
number of names submitted. The “matched” line breaks down by match type
so we can immediately see that four names matched exactly, zero needed
case folding, and one required fuzzy correction. The “out of scope” line
reports two animal names that have no business being in a plant
backbone, and helpfully suggests "col" or
"gbif" as alternatives. The “unmatched” line tallies
genuinely absent names, broken down by taxon group from the genus
register. The taxon-groups summary at the bottom shows the life-form
composition of the full input. If a dataset that should be all plants
shows 50 animals, something went wrong upstream.
When enrichments have been applied, the summary includes them as well. Each enrichment layer gets its own line showing the source, version, and how many species received data.
#> -- taxify results ------------------------------------------------------------
#> backend: WFO v2024-12 | 9 names submitted
#>
#> matched 6 (exact: 4, case-insensitive: 0, fuzzy: 1)
#> out of scope 2 (animal: 2 -- not in WFO, try backend = "col", "gbif")
#> unmatched 1 (taxon_group: unknown: 1)
#> ------------------------------------------------------------
#> taxon groups: vascular plant: 6 animal: 2 unknown: 1
#>
#> enrichments:
#> conservation_status (IUCN Red List 2024.12) -- 4 of 9 matched
#> woodiness (Zanne et al. 2014 2024.12) -- 5 of 9 matched
The enrichment lines show that 4 of the 9 input names received an IUCN conservation status, and 5 received woodiness data. The difference is expected: the two animal names and the unmatched name have no records in either enrichment, and not every plant species has been assessed by IUCN.
A single backbone rarely covers everything in a mixed-kingdom dataset. Passing multiple backend names creates a fallback chain: names matched by an earlier backend are not re-matched by later ones.
multi <- taxify(
c("Quercus robur", "Panthera leo", "Amanita muscaria",
"Escherichia coli", "Salmo trutta"),
backend = c("wfo", "col", "gbif")
)#> Matching 5 names against 3 backends: wfo -> col -> gbif
#> [wfo] Matching 5 names...
#> [col] Matching 3 remaining names...
#> [gbif] Matching 1 remaining names...
The progress messages tell the story. WFO receives all 5 names. It matches “Quercus robur” (a plant) and fails on the other four. COL then receives the 3 remaining names and matches “Panthera leo” (a mammal), “Amanita muscaria” (a fungus), and “Salmo trutta” (a fish). That leaves only “Escherichia coli” (a bacterium) for GBIF.
#> input_name accepted_name backend
#> 1 Quercus robur Quercus robur wfo
#> 2 Panthera leo Panthera leo col
#> 3 Amanita muscaria Amanita muscaria col
#> 4 Escherichia coli Escherichia coli gbif
#> 5 Salmo trutta Salmo trutta col
The backend column records which backbone resolved each
name. This column is essential for reproducibility: it tells a reviewer
(or your future self) that Quercus robur was resolved using WFO while
Panthera leo used COL.
The fallback order matters. WFO is the most authoritative source for
plants, so putting it first ensures plant names get WFO-quality synonym
resolution. COL covers all kingdoms but with less taxonomic depth per
group. GBIF has the widest coverage but coarser synonym handling. A
sensible default for mixed-kingdom work is
backend = c("wfo", "col", "gbif").
A subtle point: once a name matches in an earlier backend, it is removed from the pool sent to subsequent backends. This means the COL and GBIF backends never see “Quercus robur” at all, which both speeds up matching and avoids conflicting results for names that exist in multiple backbones. If Quercus robur had been sent to all three backbones, it would match in each, potentially returning different taxon IDs and slightly different synonym chains. The fallback design avoids this ambiguity by construction.
One practical consequence: the accepted_id values in a
multi-backend result are not globally unique across backends. A WFO ID
like “wfo-0000306015” and a COL ID like “9TQBG” are both valid
identifiers but belong to different namespaces. Downstream joins that
combine results from different taxify runs should join on
accepted_name (which is a real taxonomic name) rather than
accepted_id (which is backend-specific).
The summary for a multi-backend run aggregates across all backends but preserves the per-backend breakdown in the out-of-scope tally. It also shows the backend names in the header line, making it clear that multiple sources contributed to the result.
taxify ships with 12 enrichment layers that join external trait and
status data to a taxify() result. Each enrichment is a
separate .vtr file downloaded on first use and cached locally. The join
key is accepted_name, so synonyms in the original input
resolve correctly.
#> name version nrow static trait_cols source_url
#> 1 conservation_status 2024.12 166342 TRUE conservation_status https://doi.org/10.15468/39omei
#> 2 griis 2024.12 25918 TRUE invasive_status https://doi.org/10.15468/6jbdk3
#> 3 wcvp 2024.12 356224 TRUE native_status https://doi.org/10.34885/gah...
#> 4 eive 2024.12 6937 TRUE light, temperature, moisture, reaction https://doi.org/10.1111/jvs.13031
#> 5 elton_traits 2024.12 9994 TRUE diet_inv, diet_vend, ..., body... https://doi.org/10.1890/13-1917.1
#> 6 avonet 2024.12 11009 TRUE beak_length, wing_length, migrat... https://doi.org/10.1111/ele.13898
#> 7 pantheria 2024.12 5416 TRUE longevity_mo, litter_size, gestation... https://doi.org/10.1890/08-1494.1
#> 8 amphibio 2024.12 6776 TRUE body_size_mm, age_maturity_d, reproduc... https://doi.org/10.1038/sdata.2017.123
#> 9 common_names 2024.12 982445 FALSE common_name https://doi.org/10.15468/39omei
#> 10 woodiness 2024.12 47898 TRUE woodiness https://doi.org/10.1038/nature12872
#> 11 diaz_traits 2024.12 7381 TRUE seed_mass_mg, plant_height_m https://doi.org/10.1038/s41586-015-...
#> 12 leda 2024.12 3625 TRUE raunkiaer_life_form, dispersal_type, ... https://doi.org/10.1111/j.1365-...
Each add_*() function appends one or more columns to the
result. The functions download their .vtr on first use, so no separate
installation step is needed. The static column in the
listing above indicates whether the dataset is version-locked (TRUE
means it will never change; FALSE means taxify checks for updates once
per session).
The enrichment join key is accepted_name, not
input_name. This is a deliberate choice. If two rows in the
taxify result were submitted as “Pinus abies” (a synonym) and “Picea
abies” (the accepted name), both resolve to the same
accepted_name and therefore receive the same trait values
from the enrichment layer. The enrichment .vtr files are built with
cross-backbone name resolution, meaning a species name is resolved
against all seven backbones during the enrichment build pipeline, and
the union of all resulting accepted names is stored. This ensures that
add_woodiness() works correctly regardless of whether the
user matched against WFO, COL, or GBIF.
Some enrichments are kingdom-specific. Woodiness, EIVE, WCVP, Diaz traits, and LEDA cover plants only. EltonTraits covers birds and mammals. AVONET is bird-only. PanTHERIA is mammal-only. AmphiBIO is amphibian-only. Conservation status and common names are cross-kingdom. When an enrichment does not cover a particular taxon group, those rows simply receive NA. The summary method reports how many rows received data, so the coverage gap is immediately visible.
add_conservation_status() joins IUCN Red List
categories. Coverage is global across all taxonomic groups,
approximately 166,000 species.
conservation <- taxify(c(
"Panthera tigris",
"Quercus robur",
"Ailuropoda melanoleuca",
"Pinus sylvestris",
"Spheniscus demersus"
), backend = c("wfo", "col")) |>
add_conservation_status()
conservation[, c("input_name", "accepted_name", "conservation_status")]#> input_name accepted_name conservation_status
#> 1 Panthera tigris Panthera tigris EN
#> 2 Quercus robur Quercus robur LC
#> 3 Ailuropoda melanoleuca Ailuropoda melanoleuca VU
#> 4 Pinus sylvestris Pinus sylvestris LC
#> 5 Spheniscus demersus Spheniscus demersus EN
The IUCN abbreviations are standard: LC (Least Concern), NT (Near Threatened), VU (Vulnerable), EN (Endangered), CR (Critically Endangered), EW (Extinct in the Wild), EX (Extinct). Species not yet assessed by the IUCN receive NA. The Sumatran tiger and African penguin both show EN; Quercus robur and Pinus sylvestris are LC.
add_common_names() joins GBIF vernacular names filtered
by ISO 639-1 language code. The default is English.
common <- taxify(c(
"Quercus robur",
"Pinus sylvestris",
"Betula pendula"
)) |>
add_common_names()
common[, c("input_name", "common_name")]#> input_name common_name
#> 1 Quercus robur Pedunculate Oak
#> 2 Pinus sylvestris Scots Pine
#> 3 Betula pendula Silver Birch
Other languages work the same way. German names for the same species:
common_de <- taxify(c(
"Quercus robur",
"Pinus sylvestris",
"Betula pendula"
)) |>
add_common_names(lang = "de")
common_de[, c("input_name", "common_name")]#> input_name common_name
#> 1 Quercus robur Stieleiche
#> 2 Pinus sylvestris Waldkiefer
#> 3 Betula pendula Hängebirke
When multiple vernacular names exist for a species in the requested language, the most commonly used one is returned.
add_woodiness() joins the Zanne et al. (2014) woodiness
classification. Coverage is about 48,000 plant species, each labelled as
“woody”, “herbaceous”, or “variable” (species that can be either
depending on growth conditions).
woody <- taxify(c(
"Quercus robur",
"Trifolium repens",
"Salix caprea",
"Plantago lanceolata"
)) |>
add_woodiness()
woody[, c("input_name", "accepted_name", "woodiness")]#> input_name accepted_name woodiness
#> 1 Quercus robur Quercus robur woody
#> 2 Trifolium repens Trifolium repens herbaceous
#> 3 Salix caprea Salix caprea woody
#> 4 Plantago lanceolata Plantago lanceolata herbaceous
Enrichments stack naturally in a pipe. The columns added by each
add_*() function are independent, so the order of
application does not matter.
stacked <- taxify(c(
"Quercus robur",
"Betula pendula",
"Pinus sylvestris"
)) |>
add_conservation_status() |>
add_woodiness() |>
add_common_names()
stacked[, c("accepted_name", "conservation_status",
"woodiness", "common_name")]#> accepted_name conservation_status woodiness common_name
#> 1 Quercus robur LC woody Pedunculate Oak
#> 2 Betula pendula LC woody Silver Birch
#> 3 Pinus sylvestris LC woody Scots Pine
add_data() joins any external dataset to a taxify result
through backbone matching. The external data’s species names are run
through the same backbone(s) that produced the original result, and the
join is performed on accepted_id. This means synonyms in
either the user’s data or the external dataset resolve to the same
key.
traits <- data.frame(
species = c("Quercus robur", "Quercus pedunculata",
"Pinus sylvestris", "Betula pendula"),
max_height_m = c(40, 40, 35, 25),
shade_tolerance = c("moderate", "moderate", "intolerant", "intolerant"),
stringsAsFactors = FALSE
)
result <- taxify(c("Quercus robur", "Pinus sylvestris", "Betula pendula"))
enriched <- result |>
add_data(traits, species_col = "species")#> Matching 4 names from 'species' through WFO backbone...
#> Matching 4 names...
#> add_data: 3 of 3 species matched (100.0%). 0 names in data unmatched.
#> input_name accepted_name max_height_m shade_tolerance
#> 1 Quercus robur Quercus robur 40 moderate
#> 2 Pinus sylvestris Pinus sylvestris 35 intolerant
#> 3 Betula pendula Betula pendula 25 intolerant
The traits data.frame contained both “Quercus robur” and “Quercus
pedunculata” (a synonym). Because both resolve to the same accepted ID,
the join works correctly without deduplication on the user’s side. If
the two rows had different trait values for the same accepted species,
add_data() would raise an error rather than picking one
arbitrarily.
When species_col is not specified,
add_data() auto-detects it by probing the first 10 rows of
each character column against the backbone. The column with the highest
match rate wins. If no column reaches 50% match rate, an error asks the
user to specify species_col explicitly. The auto-detection
runs a small taxify() call internally, so it adds a brief
delay on first use, but it saves time in exploratory workflows where the
column name varies across datasets (“species”, “taxon”,
“scientific_name”, “Taxon.name”, and so on).
The join itself is performed on accepted_id, not on the
raw species name. This is the key difference from a naive
merge(). If the user’s CSV contains “Quercus pedunculata”
(a synonym) and the taxify result contains “Quercus robur” (the accepted
name), a raw string merge would miss the connection. The
add_data() join resolves both names through the backbone,
discovers that they share the same accepted ID, and links them
correctly. Duplicate handling is strict: if two rows in the external
data resolve to the same accepted ID with different trait values,
add_data() raises an error rather than picking one row
arbitrarily.
add_data() reads .csv,
.csv.gz, .xlsx (requires the openxlsx2
package), .sqlite/.db (requires DBI and
RSQLite), and .vtr files natively. For SQLite files,
specify the table name with the table argument. Any other
format can be read into a data.frame first and passed directly.
taxify detects hybrid markers in input names (the multiplication
sign, the letter x between genus and epithet, or between two binomials)
and sets is_hybrid = TRUE in the output.
add_hybrid_info() goes further, parsing hybrid formulas to
extract parent names and classify the hybrid type.
hybrids <- taxify(c(
"Quercus x rosacea", # nothospecies
"Quercus pyrenaica x Q. petraea", # hybrid formula
"x Cuprocyparis leylandii", # nothogenus
"Betula pendula" # not a hybrid
)) |>
add_hybrid_info()
hybrids[, c("input_name", "is_hybrid", "hybrid_type",
"hybrid_parent_1", "hybrid_parent_2")]#> input_name is_hybrid hybrid_type hybrid_parent_1 hybrid_parent_2
#> 1 Quercus x rosacea TRUE nothospecies <NA> <NA>
#> 2 Quercus pyrenaica x Q. petraea TRUE formula Quercus pyrenaica Quercus petraea
#> 3 x Cuprocyparis leylandii TRUE nothogenus <NA> <NA>
#> 4 Betula pendula FALSE <NA> <NA> <NA>
Three hybrid types are recognized. “nothospecies” is a named hybrid species (the multiplication sign appears between genus and epithet). “formula” is a hybrid cross written as “A x B”, where the parser expands abbreviated genera (the “Q.” in “Q. petraea” is expanded to “Quercus” based on the first parent). “nothogenus” is a hybrid genus (the multiplication sign precedes the genus name).
For formulas, the extracted parent names are full binomials that can
be submitted to taxify() themselves for further resolution.
A common workflow for hybrid-heavy datasets (e.g., ornamental
horticulture records) is to run taxify() on the full list,
then pipe through add_hybrid_info(), and finally re-run
taxify() on the extracted parent names to obtain their full
taxonomic resolution. The hybrid detection runs on the raw input before
any cleaning, so it correctly handles both the Unicode multiplication
sign and the ASCII “x” notation.
taxify maintains a unified genus register built from all installed
backbones. It maps each genus to its family, higher classification, and
a broad life-form category. lookup_genus() queries this
register.
#> genus kingdom phylum class order family kingdom_group taxon_group life_form
#> 1 Quercus Plantae <NA> Magnoliopsida Fagales Fagaceae plantae vascular plant vascular plant
#> genus kingdom phylum class order family kingdom_group taxon_group life_form
#> 1 Panthera Animalia Chordata Mammalia Carnivora Felidae animalia animal animal
The register is what powers the out_of_scope
classification and the taxon-group breakdown in summary().
It is built once from the union of WFO, COL, and GBIF genera, with
classification conflicts resolved by priority (COL > GBIF > WFO
for higher taxonomy, since COL and GBIF carry kingdom through order
while WFO only has family).
taxify_register_coverage() shows which backbones contain
a given genus, which helps decide which backend to use for a particular
taxonomic group.
#> genus backend version date_added
#> 1 Quercus wfo 2024-12 2026-04-01
#> 2 Quercus col 2024.4 2026-04-01
#> 3 Quercus gbif 2024-08 2026-04-01
#> genus backend version date_added
#> 1 Panthera col 2024.4 2026-04-01
#> 2 Panthera gbif 2024-08 2026-04-01
Quercus appears in all three major backbones. Panthera is absent from
WFO (plants only) but present in COL and GBIF. This information is used
automatically during matching, but it is also useful when planning which
backends to install for a particular project. If your dataset is
entirely marine invertebrates, you might check a few representative
genera with taxify_register_coverage() and discover that
WoRMS is the only backend that covers them, saving the time of
downloading WFO and COL.
The register currently contains approximately 100,000 genera drawn
from the union of WFO, COL, and GBIF. It ships as a pre-built
.vtr file and is updated with each taxify release. The
register is small enough to fit comfortably in memory and is cached for
the duration of the R session, so lookup_genus() calls are
effectively instant.
Backbone .vtr files, enrichment .vtr files, and the genus register
all live under taxify_data_dir(). During an R session,
taxify caches file paths in memory so that repeated
taxify() calls do not re-scan the file system.
taxify_clear_cache() drops these in-memory handles. The
next taxify() call will re-load from disk. This is rarely
needed, but it can help after manually moving or deleting .vtr
files.
To force a fresh manifest fetch (e.g., after the maintainer publishes
a new backbone version mid-session), use
taxify_refresh_manifest().
The on-disk files themselves are never deleted by taxify. To reclaim
disk space, delete the contents of taxify_data_dir()
manually.
# See where everything lives
taxify_data_dir()
# To remove all taxify data (backbones, enrichments, register):
# unlink(taxify_data_dir(), recursive = TRUE)The total disk footprint depends on how many backbones and enrichments are installed. WFO alone is about 150 MB; all seven backbones plus all 12 enrichments total roughly 2.5 GB.
We will close with an end-to-end worked example. The input is a realistic list of 22 species names from a hypothetical European biodiversity survey. It includes six clean accepted plant names, three historical synonyms (Pinus abies, Quercus pedunculata, Picea excelsa), two typos (Quercus robor, Fagus sylvatyca), four messy field annotations (qualifier, trailing authorship, infraspecific rank, excess whitespace), one hybrid, four animal names that do not belong in a plant backbone, and two entirely fictitious names. We will match against WFO and COL, inspect the summary, and layer on three enrichments.
survey_names <- c(
"Quercus robur", "Fagus sylvatica", "Betula pendula",
"Pinus sylvestris", "Alnus glutinosa", "Fraxinus excelsior",
"Pinus abies", "Quercus pedunculata", "Picea excelsa",
"Quercus robor", "Fagus sylvatyca",
"cf. Sorbus aucuparia", "Acer pseudoplatanus L.",
"Pinus sylvestris var. hamata", " Tilia cordata ",
"Quercus x rosacea",
"Panthera leo", "Salmo trutta", "Cervus elaphus", "Parus major",
"Notareal plantus", "Randomus specius"
)There are 22 names in total. We match against WFO first (best for plants), with COL as a fallback for the animal names.
#> Matching 22 names against 2 backends: wfo -> col
#> [wfo] Matching 22 names...
#> [wfo] Fuzzy matching 6 unmatched...
#> [col] Matching 4 remaining names...
WFO handled the plant names (exact, fuzzy, synonym, and all). The 4 animal names did not match in WFO and were forwarded to COL, which resolved all four. Let us look at the summary first.
#> -- taxify results ------------------------------------------------------------
#> backend: WFO + COL | 22 names submitted
#>
#> matched 18 (exact: 13, case-insensitive: 0, fuzzy: 2)
#> out of scope 0
#> unmatched 2 (taxon_group: unknown: 2)
#> ------------------------------------------------------------
#> taxon groups: vascular plant: 16 animal: 4 unknown: 2
Eighteen of 22 names matched. The two unknowns are “Notareal plantus” and “Randomus specius”, which do not exist in any backbone. The animal names are not out of scope this time because COL covers them.
Now we layer on three enrichments: conservation status, woodiness, and common names.
The summary now reflects the enrichment coverage.
#> -- taxify results ------------------------------------------------------------
#> backend: WFO + COL | 22 names submitted
#>
#> matched 18 (exact: 13, case-insensitive: 0, fuzzy: 2)
#> out of scope 0
#> unmatched 2 (taxon_group: unknown: 2)
#> ------------------------------------------------------------
#> taxon groups: vascular plant: 16 animal: 4 unknown: 2
#>
#> enrichments:
#> conservation_status (IUCN Red List 2024.12) -- 12 of 22 matched
#> woodiness (Zanne et al. 2014 2024.12) -- 14 of 22 matched
#> common_names (GBIF vernacular names 2024.12) -- 16 of 22 matched
Conservation status matched 12 of 22 names, woodiness matched 14, and common names matched 16. These numbers make sense. Conservation status has gaps because not every species has been assessed by IUCN; the common European trees are assessed (mostly LC), but some of the less prominent species may not be. Woodiness covers about 48,000 species from the Zanne et al. dataset, so most temperate trees are included. Common names have the widest coverage because the GBIF vernacular names dataset is very large, though it still misses some species in less commonly documented languages.
We can now pull out the columns we care about for downstream analysis. The result is a plain data.frame, so standard R subsetting, dplyr verbs, or data.table operations all work without special handling.
analysis <- result[, c("input_name", "accepted_name", "family",
"match_type", "is_synonym", "backend",
"conservation_status", "woodiness",
"common_name")]A few diagnostic queries round out the workflow. These are patterns that come up in nearly every biodiversity data cleaning session.
# Which names were synonyms?
result[result$is_synonym == TRUE,
c("input_name", "accepted_name", "accepted_id")]#> input_name accepted_name accepted_id
#> 7 Pinus abies Picea abies wfo-0000471692
#> 8 Quercus pedunculata Quercus robur wfo-0000306015
#> 9 Picea excelsa Picea abies wfo-0000471692
Three synonyms were submitted. “Pinus abies” and “Picea excelsa” both resolve to the same accepted species, Picea abies, with the same accepted ID. “Quercus pedunculata” resolves to Quercus robur. Because these synonyms share accepted IDs with other rows in the result, the enrichment columns carry the same trait values as their accepted-name counterparts. This is a common pattern in biodiversity databases: the same physical species appears under different names in different records, and the synonym resolution step collapses those variants so that trait lookups and species counts are correct. Without this step, a species count would overcount Picea abies (listing it three times under three different names) and a trait join would miss the synonym rows entirely.
# Which names needed fuzzy correction?
result[result$match_type == "fuzzy",
c("input_name", "accepted_name", "fuzzy_dist")]#> input_name accepted_name fuzzy_dist
#> 10 Quercus robor Quercus robur 0.07142857
#> 11 Fagus sylvatyca Fagus sylvatica 0.06666667
Both typos were corrected with very low edit distances. The “y” in “sylvatyca” was caught by Damerau-Levenshtein, and the missing “u” in “robor” was caught the same way.
# Threatened species in the survey
result[!is.na(result$conservation_status) &
result$conservation_status %in% c("VU", "EN", "CR"),
c("accepted_name", "conservation_status", "common_name")]#> accepted_name conservation_status common_name
#> 17 Panthera leo VU Lion
Only one species in this survey carries a threatened IUCN status. The rest are either Least Concern or not yet assessed. For a European forest plot this is unsurprising. A tropical or marine dataset would likely show more threatened species. The conservation status enrichment is cross-kingdom, so it works for the animal names as well (Panthera leo is VU globally).
#> herbaceous woody <NA>
#> 0 14 8
All matched plants in this survey are woody, which makes sense for a European forest plot. The 8 NAs correspond to the 4 animal names (woodiness is a plant-only enrichment), the 2 unknown names, and 2 plants that happen to fall outside the Zanne et al. dataset coverage.
The entire pipeline ran offline after the initial backbone download.
There are no API rate limits, no network dependency during analysis, and
the backbone_version column in the output ensures full
reproducibility.
A few closing notes on performance. taxify uses vectra’s columnar
engine for all backbone queries. Exact matching is index-accelerated
(hash indexes on the name and genus columns), so even against the GBIF
backbone (over 7 million rows) a batch of 10,000 names typically
completes in a few seconds. Fuzzy matching is slower because it involves
string distance computation, but it is genus-blocked: each input name is
only compared against backbone entries in the same genus, which reduces
the search space by several orders of magnitude. For very large batches
(100,000+ names), the main bottleneck is fuzzy matching of names with
misspelled genera, where the genus blocking cannot help. In that case,
consider running a first pass with fuzzy = FALSE to pick
off the easy matches, then re-running only the unmatched names with
fuzzy enabled.
The enrichment joins are also fast. Each add_*() call
performs a single vectra inner join on accepted_name, which
is O(n) in the size of the taxify result. Stacking multiple enrichments
in a pipe adds columns incrementally without re-reading the
backbone.