Getting started with taxify

What taxify solves

Almost every biodiversity dataset starts as a column of names. Before any analysis, those strings have to resolve to one accepted name per taxon, and they rarely line up on their own: authorship, field qualifiers, capitalization, historical synonyms, hybrids, and plain typos all keep two records of the same species apart. A bare merge() on raw strings silently drops every row that disagrees, so the matching has to come first.

taxify() takes a character vector and returns one standardized table: each name cleaned, matched against a backbone you keep on disk, synonyms resolved to the accepted name. Matching runs in C through the vectra engine, so there are no web services and no rate limits, and the same input gives the same output on any machine. Thirteen Darwin Core backbones are available (WFO, COL, GBIF, ITIS, NCBI, OTT, WoRMS, Euro+Med, Species Fungorum, AlgaeBase, FishBase, SeaLifeBase, Reptile Database), all queried offline.

library(taxify)

The first taxify() call downloads a backbone once (WFO is about 150 MB) and caches it under taxify_data_dir(). After that, nothing touches the network.

One call

Hand taxify() a vector of names. The list below is deliberately small and deliberately messy: every entry takes a different route to its accepted name.

field_names <- c(
  "Quercus robur L.",        # authorship to strip
  "Quercus robus",           # typo
  "cf. Betula pendula",      # field qualifier
  "FAGUS SYLVATICA",         # caps
  "Quercus pedunculata",     # historical synonym of Q. robur
  "Q. petraea",              # abbreviated genus
  "Pinus abies",             # synonym of Picea abies (a different genus)
  "Festuca rubrra",          # typo
  "Fallopia japonica",       # synonym of Reynoutria japonica (invasive)
  "Taraxacum officinale"
)

res <- taxify(field_names)
res[, c("input_name", "accepted_name", "family",
        "is_synonym", "match_type", "fuzzy_dist")]

#>             input_name        accepted_name       family is_synonym match_type fuzzy_dist
#> 1     Quercus robur L.        Quercus robur     Fagaceae      FALSE      exact         NA
#> 2        Quercus robus        Quercus robur     Fagaceae      FALSE      fuzzy      0.077
#> 3   cf. Betula pendula       Betula pendula   Betulaceae      FALSE      exact         NA
#> 4      FAGUS SYLVATICA      Fagus sylvatica     Fagaceae      FALSE   exact_ci         NA
#> 5  Quercus pedunculata        Quercus robur     Fagaceae       TRUE      exact         NA
#> 6           Q. petraea      Quercus petraea     Fagaceae      FALSE     abbrev         NA
#> 7          Pinus abies          Picea abies     Pinaceae       TRUE      exact         NA
#> 8       Festuca rubrra        Festuca rubra      Poaceae      FALSE      fuzzy      0.071
#> 9    Fallopia japonica  Reynoutria japonica Polygonaceae       TRUE      exact         NA
#> 10 Taraxacum officinale Taraxacum officinale  Asteraceae      FALSE      exact         NA

Ten names, ten rows, every match readable. Each row also carries genus, authorship, taxon and accepted IDs, a hybrid flag, the backend, and the exact backbone version (the full table is wider, the same shape for any input).

Each name reaches its accepted name a different way. The animation below walks one name at a time through the pipeline: the clean step strips authorship, a qualifier, or case; the match step is exact, case-folded, fuzzy, or abbreviated; the resolve step follows a synonym to the current name.

Quercus robur L. loses its authorship before matching. cf. Betula pendula loses the qualifier. FAGUS SYLVATICA matches after case folding (exact_ci). Q. petraea resolves on the genus initial plus epithet (abbrev). The three synonyms (Quercus pedunculata, Pinus abies, Fallopia japonica) resolve to their accepted names, the last being the current name for a well-known invader. The two typos go to the fuzzy pass, which is the next thing worth seeing.

Why a typo barely costs anything

The fuzzy pass never scores a name against the whole backbone. It blocks on genus first, so Quercus robus is compared only against the other Quercus names. A one-letter slip is found in a handful of comparisons rather than across every name on disk.

The default threshold allows about one edit per five characters, so common typos resolve while genuinely different names do not. Fuzzy matching is controlled by fuzzy, fuzzy_threshold, and fuzzy_method; the fuzzy-matching vignette covers the sub-blocking for very large genera and the genus-typo fallback in full.

Check the batch at a glance

summary() prints a digest, the fastest way to see whether a run went cleanly.

summary(res)

#> ── taxify results ──────────────────────────────────────────────────────────
#>   backend: WFO v2024-12  |  10 names submitted
#>
#>   matched        10  (exact: 6, case-insensitive: 1, fuzzy: 2, abbrev: 1)
#>   ────────────────────────────────────────────────────────────
#>   taxon groups: angiosperm: 8  gymnosperm: 1  unknown: 1

The digest reports the backend and version, the match-route breakdown (all ten resolved here, including the abbreviated Q. petraea), and the taxon-group mix. When a name is out of scope (an animal in a plant-only backbone) or genuinely absent, the digest tallies it and suggests an alternative backend. The match types and the multi-backend fallback (backend = c("wfo", "col", "gbif")) are covered in the backends vignette.

Offline, and how much faster

Every match runs against the local snapshot, so a run reproduces exactly and the backbone_version column records the WFO release and download date for a methods section. On the same task many in this field reach for, WorldFlora’s WFO.match, both run against a local copy and return the same matches; the difference is where the matching happens. taxify scores names in C against the compiled backbone, WorldFlora in R. On 1,000 plant names with fuzzy matching on (Windows, R 4.5.2):

Exact matching is close (0.1 s against 1.3 s); the gap opens on fuzzy matching, where the genus blocking keeps taxify near a second while the in-R scan grows with the list. The full benchmark and large-batch strategy are in the large-scale vignette.

Add your own attributes

Once names resolve to an accepted name, any table keyed on species joins cleanly. add_data() takes a data.frame, CSV, XLSX, or SQLite file, runs its species column through the same backbone, and joins on the accepted name, so a synonym on either side still lines up.

my_traits <- data.frame(
  species      = c("Quercus pedunculata",   # synonym of Q. robur
                   "Pinus sylvestris",
                   "Betula pendula"),
  seed_mass_mg = c(3200, 7.5, 0.2)
)

taxify(c("Quercus robur", "Pinus sylvestris", "Betula pendula")) |>
  add_data(my_traits, species_col = "species")

#> add_data: 3 of 3 species matched (100.0%). 0 names in data unmatched.
#>         input_name    accepted_name seed_mass_mg
#> 1    Quercus robur    Quercus robur       3200.0
#> 2 Pinus sylvestris Pinus sylvestris          7.5
#> 3   Betula pendula   Betula pendula          0.2

The trait table used Quercus pedunculata and the result used Quercus robur; a plain merge() would have missed that row. add_data() joins on the accepted name, so it lines up. Formats, auto-detection, and strict duplicate handling are in the custom-data vignette.

The enrichment layers

taxify also ships published trait and status layers that attach on the accepted name. There are more than sixty, across the tree of life and for the conservation and invasion records this kind of work needs. Each add_*() matches its own source against the backbone and attaches on the accepted name, so any of them stacks into a pipeline the same way. Run list_enrichments() for the current set, versions, and coverage.

For invasion work the GloNAF, GRIIS, and alien-first-record layers attach naturalized status, invasive status, and first-record years on the same accepted name, so a resolved species list carries its invasion history without a second join. The full menu and per-layer detail are in the enrichments vignette.

Stack layers, then test an idea

Field lists run to hundreds of names. Here is a realistic one, about a hundred European species, matched in one call. Pick any layers and stack them: woodiness, the EIVE ecological indicator values, and plant height from the Diaz global trait dataset all attach on the accepted name.

field <- c(
  "Quercus petraea", "Pinus sylvestris", "Picea abies", "Betula pendula",
  "Acer pseudoplatanus", "Acer platanoides", "Acer campestre", "Corylus avellana",
  "Fraxinus excelsior", "Carpinus betulus", "Sorbus aucuparia", "Tilia cordata",
  "Ulmus glabra", "Alnus glutinosa", "Salix caprea", "Populus tremula",
  "Prunus avium", "Prunus spinosa", "Crataegus monogyna", "Sambucus nigra",
  "Cornus sanguinea", "Viburnum opulus", "Euonymus europaeus", "Ligustrum vulgare",
  "Frangula alnus", "Juniperus communis", "Taxus baccata", "Larix decidua",
  "Abies alba", "Rosa canina", "Rubus idaeus", "Hedera helix",
  "Clematis vitalba", "Berberis vulgaris", "Betula pubescens", "Prunus padus",
  "Rhamnus cathartica", "Lonicera xylosteum", "Trifolium repens", "Trifolium pratense",
  "Festuca ovina", "Dactylis glomerata", "Plantago lanceolata", "Plantago major",
  "Plantago media", "Achillea millefolium", "Ranunculus acris", "Ranunculus repens",
  "Urtica dioica", "Poa pratensis", "Poa annua", "Galium mollugo",
  "Galium aparine", "Bellis perennis", "Cardamine pratensis", "Cirsium arvense",
  "Cirsium vulgare", "Daucus carota", "Heracleum sphondylium", "Anthriscus sylvestris",
  "Lotus corniculatus", "Medicago lupulina", "Vicia cracca", "Lathyrus pratensis",
  "Stellaria media", "Silene dioica", "Silene vulgaris", "Geranium pratense",
  "Geranium robertianum", "Glechoma hederacea", "Lamium album", "Prunella vulgaris",
  "Ajuga reptans", "Veronica chamaedrys", "Rumex acetosa", "Rumex obtusifolius",
  "Chenopodium album", "Capsella bursa-pastoris", "Senecio vulgaris", "Leucanthemum vulgare",
  "Centaurea jacea", "Knautia arvensis", "Campanula rotundifolia", "Primula veris",
  "Anemone nemorosa", "Filipendula ulmaria", "Lythrum salicaria", "Robinia pseudoacacia",
  "Solidago canadensis", "Solidago gigantea", "Impatiens glandulifera",
  "Heracleum mantegazzianum", "Prunus serotina", "Quercus rubra",
  "Quercus robur L.", "FAGUS SYLVATICA", "Quercus robus",
  "cf. Taraxacum officinale", "Quercus pedunculata", "Festuca rubrra"
)

dat <- taxify(field) |>
  add_zanne() |>
  add_eive() |>
  add_diaz_traits()

head(dat[, c("accepted_name", "woodiness", "eive_light",
             "eive_reaction", "eive_nutrients", "plant_height_m")], 6)

#>         accepted_name woodiness eive_light eive_reaction eive_nutrients plant_height_m
#> 1     Quercus petraea     woody       5.83          4.79           3.55          31.44
#> 2    Pinus sylvestris     woody       7.10          5.13           2.69          19.03
#> 3         Picea abies     woody       4.36          4.24           4.31          40.69
#> 4      Betula pendula     woody       6.84          4.38           3.64          12.02
#> 5 Acer pseudoplatanus     woody       3.80          5.92           6.89          24.46
#> 6    Acer platanoides     woody       3.87          6.34           5.66          21.93

A few species lack an EIVE or a height value, so those cells are NA; R’s statistics functions drop incomplete rows on their own. From here it is ordinary analysis. Do species on more base-rich soils also sit higher on the nutrient axis?

cor.test(dat$eive_reaction, dat$eive_nutrients)

#>  Pearson's product-moment correlation
#>
#> data:  dat$eive_reaction and dat$eive_nutrients
#> t = 3.2378, df = 96, p-value = 0.001654
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.1230070 0.4821711
#> sample estimates:
#>       cor
#> 0.3137695

Across about a hundred species the correlation is positive and significant, though modest (r = 0.31, p = 0.002): base-rich soils tend to carry higher nutrient values. The same three lines work for any attribute the package can attach.