wcvpmatch is an R package for scientific plant name
standardization and taxonomic reconciliation against the World Checklist of
Vascular Plants (WCVP).
The package is inspired by the matching workflow implemented in ‘treemendous’ (especially its staged ‘matching’ logic), and extends that approach with broader functionality for robust and reproducible taxonomic resolution workflows.
wcvpmatch addsclassify_spnames())wcvp_matching())allow_duplicates, input_index)matched_taxon_name,
accepted_taxon_name, status fields)matched_taxon_authors,
accepted_taxon_authors)output_name_style = "snake_case")You can install the development version from GitHub with:
# install.packages("pak")
pak::pak("PaulESantos/wcvpmatch")wcvpmatch depends on fozziejoin for fuzzy
matching in the backend. For that reason, a working Rust toolchain is a
practical prerequisite for a fully functional installation of
wcvpmatch, especially when fozziejoin is
installed from source.
Install Rust on your computer first. This is required so that the
fozziejoin backend used by wcvpmatch can
compile and run the fuzzy matching components correctly. You can
download the installer from https://rust-lang.org/tools/install/.
On Windows, configure Rust for R + Rtools compatibility by running this command in the R terminal:
rustup override set stable-x86_64-pc-windows-gnupak::pak("fozziejoin")Or install the development version of fozziejoin:
pak::pak("fozzieverse/fozziejoin/fozziejoin-r")If you want the package to use the default WCVP backbone
automatically, install the companion package wcvpdata from
r-universe:
install.packages(
"wcvpdata",
repos = c("https://paulesantos.r-universe.dev", "https://cloud.r-project.org")
)After loading wcvpmatch, you can check whether the
default backbone is available with:
library(wcvpmatch)
wcvp_setup_info()library(wcvpmatch)
sp_names <- c(
"Aniba heterotepala",
"Anthurium quipuscoae",
"Centropogon reflexus",
"Chuquiraga jonhstonii",
"Cyathea carolinae",
"Ditassa violascens",
"Cleistocactus fieldianus",
"Opuntia yanganucensis",
"Epidendrum trachydipterum",
"Hebeclinium hylophorbum",
"Jaltometa sagastegui", # Jaltomata sagastegui
"Lepechinia tomentosa",
"Lupinus cookianos", # Lupinus cookianus
"Oxalis hochreutinerii", # Oxalis hochreutineri
"Passiflora heterohelix",
"Peperomia arborigaudens", # Peperromia arborigaudens
"Piper setulosum",
"Pycnophyllum aristattum", # Pycnophyllum aristatum
"Salvia subscadens", # Salvia subscandens
"Stellaria macbridei",
"Stemodia piurenses", # Stemodia piurensis
"Weberbauerella brongnartioides"
)
res <-
classify_spnames(sp_names) |>
wcvp_matching(
prefilter_genus = TRUE,
allow_duplicates = TRUE,
max_dist = 2,
method = "osa",
output_name_style = "snake_case"
)
res |> dim()
#> [1] 22 43
res |>
dplyr::select(input_name,
matched_taxon_name,
accepted_taxon_name,
taxon_status) |>
as.data.frame()
#> input_name matched_taxon_name
#> 1 Aniba heterotepala Aniba heterotepala
#> 2 Anthurium quipuscoae Anthurium quipuscoae
#> 3 Centropogon reflexus Centropogon reflexus
#> 4 Chuquiraga jonhstonii Chuquiraga johnstonii
#> 5 Cyathea carolinae Cyathea carolinae
#> 6 Ditassa violascens Ditassa violascens
#> 7 Cleistocactus fieldianus Cleistocactus fieldianus
#> 8 Opuntia yanganucensis Opuntia yanganucensis
#> 9 Epidendrum trachydipterum Epidendrum trachydipterum
#> 10 Hebeclinium hylophorbum Hebeclinium hylophorbum
#> 11 Jaltometa sagastegui Jaltomata sagastegui
#> 12 Lepechinia tomentosa Lepechinia tomentosa
#> 13 Lupinus cookianos Lupinus cookianus
#> 14 Oxalis hochreutinerii Oxalis hochreutineri
#> 15 Passiflora heterohelix Passiflora heterohelix
#> 16 Peperomia arborigaudens Peperomia arborigaudens
#> 17 Piper setulosum Piper setulosum
#> 18 Pycnophyllum aristattum Pycnophyllum aristatum
#> 19 Salvia subscadens Salvia subscandens
#> 20 Stellaria macbridei Stellaria macbridei
#> 21 Stemodia piurenses Stemodia piurensis
#> 22 Weberbauerella brongnartioides Weberbauerella brongniartioides
#> accepted_taxon_name taxon_status
#> 1 Aniba heterotepala accepted
#> 2 Anthurium quipuscoae accepted
#> 3 Centropogon reflexus accepted
#> 4 Chuquiraga johnstonii accepted
#> 5 Cyathea carolinae accepted
#> 6 Ditassa violascens accepted
#> 7 Borzicactus fieldianus synonym
#> 8 Austrocylindropuntia floccosa synonym
#> 9 Epidendrum trachydipterum accepted
#> 10 Hebeclinium hylophorbum accepted
#> 11 Jaltomata sagastegui accepted
#> 12 Lepechinia tomentosa accepted
#> 13 Lupinus cookianus accepted
#> 14 Oxalis hochreutineri accepted
#> 15 Passiflora heterohelix accepted
#> 16 Peperomia arborigaudens accepted
#> 17 Piper setulosum accepted
#> 18 Pycnophyllum aristatum accepted
#> 19 Salvia subscandens accepted
#> 20 Stellaria macbridei accepted
#> 21 Stemodia piurensis accepted
#> 22 Weberbauerella brongniartioides acceptedwcvp_matching() follows a staged pipeline:
wcvp_direct_match()wcvp_genus_match()wcvp_fuzzy_match_genus()wcvp_direct_match_species_within_genus()wcvp_suffix_match_species_within_genus()wcvp_fuzzy_match_species_within_genus()This staged design prioritizes strict evidence first, then progressively relaxes criteria to recover additional valid matches.
wcvp_matching() accepts:
classify_spnames()Genus and
SpeciesOptional columns:
Infra.Rank, Infraspecies for trinomial
matchingInput.Name to preserve original query textAuthor (from parser) as contextual information in
outputwcvp_matching() returns:
input_indexOrig.*) and matched
components (Matched.*)direct_match,
fuzzy_match_genus, etc.)matchedmatched_plant_name_idmatched_taxon_namematched_taxon_authorstaxon_statusaccepted_plant_name_idaccepted_taxon_nameaccepted_taxon_authorsis_accepted_nameWhen input genera have tied fuzzy candidates,
wcvp_matching() emits:
Multiple fuzzy matches for some genera (tied distances). The first match is selected.
Example with an FIA-like table (Orig.Genus,
Orig.Species):
fia
#> # A tibble: 2,169 × 2
#> Orig.Genus Orig.Species
#> <chr> <chr>
#> 1 Abies amabilis
#> 2 Abies balsamea
#> 3 Abies bracteata
#> 4 Abies concolor
#> 5 Abies fraseri
#> 6 Abies grandis
#> 7 Abies lasiocarpa
#> 8 Abies magnifica
#> 9 Abies procera
#> 10 Abies shastensis
#> # ℹ 2,159 more rows
fia_result <- fia |>
wcvp_matching(
allow_duplicates = TRUE,
max_dist = 2
) |>
dplyr::select(
input_name,
orig_genus,
matched_genus,
orig_species,
matched_species,
taxon_status,
accepted_taxon_name
)
#> Warning: ! Multiple fuzzy matches for some genera (tied distances).
#> ℹ The first match is selected.
fia_result
#> # A tibble: 2,169 × 7
#> input_name orig_genus matched_genus orig_species matched_species taxon_status
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Abies ama… Abies Abies amabilis amabilis accepted
#> 2 Abies bal… Abies Abies balsamea balsamea accepted
#> 3 Abies bra… Abies Abies bracteata bracteata accepted
#> 4 Abies con… Abies Abies concolor concolor accepted
#> 5 Abies fra… Abies Abies fraseri fraseri accepted
#> 6 Abies gra… Abies Abies grandis grandis accepted
#> 7 Abies las… Abies Abies lasiocarpa lasiocarpa accepted
#> 8 Abies mag… Abies Abies magnifica magnifica accepted
#> 9 Abies pro… Abies Abies procera procera accepted
#> 10 Abies sha… Abies Abies shastensis shastensis synonym
#> # ℹ 2,159 more rows
#> # ℹ 1 more variable: accepted_taxon_name <chr>Inspect ambiguous genus cases from the result attribute:
amb_g <- attr(fia_result, "ambiguous_genus")
amb_g
#> # A tibble: 4 × 28
#> # Groups: .row_id [2]
#> Orig.Genus Orig.Species genus_match Input.Name Orig.Name Author
#> <chr> <chr> <lgl> <chr> <chr> <chr>
#> 1 Howeia forsteriana FALSE Howeia forsteriana <NA> ""
#> 2 Howeia forsteriana FALSE Howeia forsteriana <NA> ""
#> 3 Hyeronima clusioides FALSE Hyeronima clusioides <NA> ""
#> 4 Hyeronima clusioides FALSE Hyeronima clusioides <NA> ""
#> # ℹ 22 more variables: Orig.Infraspecies <chr>, Infra.Rank <chr>, Rank <dbl>,
#> # has_cf <lgl>, has_aff <lgl>, is_sp <lgl>, is_spp <lgl>, had_hybrid <lgl>,
#> # rank_late <lgl>, rank_missing_infra <lgl>, had_na_author <lgl>,
#> # implied_infra <lgl>, sorter <dbl>, input_index <int>, .dedup_key <chr>,
#> # direct_match <lgl>, Matched.Genus <chr>, Matched.Species <chr>,
#> # Matched.Infraspecies <chr>, Matched.Infra.Rank <chr>, .row_id <int>,
#> # fuzzy_genus_dist <dbl>Re-run only those ambiguous names for focused review:
amb_g |>
dplyr::ungroup() |>
dplyr::select(Orig.Genus, Orig.Species) |>
wcvp_matching(
allow_duplicates = TRUE,
max_dist = 2
) |>
dplyr::select(
input_name,
orig_genus,
matched_genus,
orig_species,
matched_species,
taxon_status,
accepted_taxon_name
)
#> Warning: ! Multiple fuzzy matches for some genera (tied distances).
#> ℹ The first match is selected.
#> # A tibble: 4 × 7
#> input_name orig_genus matched_genus orig_species matched_species taxon_status
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Howeia for… Howeia Howea forsteriana forsteriana accepted
#> 2 Howeia for… Howeia Howea forsteriana forsteriana accepted
#> 3 Hyeronima … Hyeronima Hieronyma clusioides clusioides accepted
#> 4 Hyeronima … Hyeronima Hieronyma clusioides clusioides accepted
#> # ℹ 1 more variable: accepted_taxon_name <chr>In this example, the ambiguous genus spellings (Howeia,
Hyeronima) are resolved to accepted genera
(Howea, Hieronyma) after species-level
checks.
For medium/large inputs, recommended settings are:
wcvp_matching(
parsed_df,
allow_duplicates = TRUE
)This typically reduces runtime by limiting candidate genera before species-level matching and avoids recomputing duplicated names.
classify_spnames()wcvp_matching()build_genus_index()prefilter_target_by_genus()wcvp_direct_match()wcvp_genus_match()wcvp_fuzzy_match_genus()wcvp_direct_match_species_within_genus()wcvp_fuzzy_match_species_within_genus()wcvp_suffix_match_species_within_genus()wcvpmatch builds on ideas used in the treemendous
matching workflow and extends them for WCVP-focused reconciliation,
richer parser diagnostics, and reproducible row-level traceability.