| Title: | Taxonomic Name Reconciliation Against the 'WCVP' Backbone |
| Version: | 0.0.1 |
| Description: | Standardizes and reconciles scientific plant names against a World Checklist of Vascular Plants ('WCVP')-style taxonomic backbone. The package parses names into taxonomic components and applies staged exact and fuzzy matching for binomial and trinomial inputs, including infraspecific rank-aware checks. It also returns accepted-name context and row-level matching flags to support reproducible, auditable preprocessing for downstream biodiversity, spatial, and trait analyses. A user-supplied backbone can be passed through 'target_df'; when the optional companion package 'wcvpdata' is installed, its default checklist can also be used. |
| License: | MIT + file LICENSE |
| Encoding: | UTF-8 |
| RoxygenNote: | 7.3.3 |
| Imports: | assertthat, cli, dplyr, fozziejoin, lifecycle, magrittr, memoise, purrr, stringdist, stringr, tibble, tidyr |
| Depends: | R (≥ 4.1.0) |
| LazyData: | true |
| LazyDataCompression: | xz |
| Suggests: | rlang (≥ 1.0.0), testthat (≥ 3.0.0), wcvpdata |
| Additional_repositories: | https://paulesantos.r-universe.dev |
| Config/testthat/edition: | 3 |
| URL: | https://github.com/PaulESantos/wcvpmatch |
| BugReports: | https://github.com/PaulESantos/wcvpmatch/issues |
| NeedsCompilation: | no |
| Packaged: | 2026-03-18 02:22:55 UTC; PC |
| Author: | Paul Efren Santos Andrade [aut, cre, cph] |
| Maintainer: | Paul Efren Santos Andrade <paulefrens@gmail.com> |
| Repository: | CRAN |
| Date/Publication: | 2026-03-23 09:10:03 UTC |
wcvpmatch: Taxonomic Name Reconciliation Against the 'WCVP' Backbone
Description
Standardizes and reconciles scientific plant names against a World Checklist of Vascular Plants ('WCVP')-style taxonomic backbone. The package parses names into taxonomic components and applies staged exact and fuzzy matching for binomial and trinomial inputs, including infraspecific rank-aware checks. It also returns accepted-name context and row-level matching flags to support reproducible, auditable preprocessing for downstream biodiversity, spatial, and trait analyses. A user-supplied backbone can be passed through 'target_df'; when the optional companion package 'wcvpdata' is installed, its default checklist can also be used.
Author(s)
Maintainer: Paul Efren Santos Andrade paulefrens@gmail.com [copyright holder]
See Also
Useful links:
Report bugs at https://github.com/PaulESantos/wcvpmatch/issues
Build a Genus Index for Fast Prefiltering
Description
Creates a compact genus-level index from the target backbone. The index stores
one row per genus and a list-column with candidate plant_name_id values
associated with each genus.
If plant_name_id is not present in target_df, a surrogate integer ID is
created to keep the index usable with custom backbones.
Usage
build_genus_index(target_df = NULL)
Arguments
target_df |
Optional custom target table. If |
Value
A tibble with columns:
- genus
Genus name (character).
- plant_name_id
List-column of unique IDs per genus.
- n_records
Number of IDs per genus.
- genus_nchar
Number of characters in the genus name.
Examples
library(wcvpmatch)
build_genus_index()
Classify Scientific Plant Names into Taxonomic Components
Description
Parse and classify scientific plant names into taxonomic components: genus, specific epithet, infraspecific rank, infraspecific epithet, and author.
Output is aligned to a backbone convention:
-
Orig.Genusin Title Case (first letter uppercase, rest lowercase). -
Orig.SpeciesandOrig.Infraspeciesepithets in lowercase. -
Infra.Rankin lowercase (subsp.,var.,subvar.,f.,subf.). -
Authoris recovered from the input and preserved in its original casing/punctuation (no forced uppercasing). -
Orig.Nameis reconstructed as: genus + species + (rank + infra) + author.
Robustness rules:
-
cf./aff.are removed from parsing but preserved as flags (has_cf,has_aff). Hybrid markers (
x/\u00D7) as standalone tokens are removed withhad_hybrid = TRUE.-
sp./spp.triggers genus-only classification (Rank = 1,Orig.Species = NA) and setsis_sp/is_spp. If an infraspecific rank is present but the infraspecific epithet is missing, sets
rank_missing_infra = TRUEand keepsInfra.RankwhileOrig.Infraspecies = NA.If rank appears "late" (after author-like tokens), parsing is best-effort and
rank_late = TRUE.If there is no explicit rank and a third token exists, the function can infer an unranked infraspecific epithet when the third token looks epithet-like (all lowercase), and does not look like the start of an author. In that case
implied_infra = TRUE,Orig.Infraspeciesis filled,Infra.Rank = NA, andRank = 3.
Usage
classify_spnames(splist)
Arguments
splist |
Character vector. Scientific plant names. |
Value
A tibble with one row per input name and standardized columns/flags:
- sorter
Numeric index of original order.
- Input.Name
Original input string as provided by user.
- Orig.Name
Reconstructed standardized name aligned to backbone + original-cased author.
- Orig.Genus
Genus in Title Case.
- Orig.Species
Specific epithet in lowercase, or
NAfor genus-only (sp./spp.).- Author
Recovered author string (original casing/punctuation) or
"".- Orig.Infraspecies
Infraspecific epithet in lowercase (ranked or implied), or
NA.- Infra.Rank
Infraspecific rank in lowercase (
subsp.,var.,subvar.,f.,subf.), orNA.- Rank
Numeric level:
1genus-only,2genus+species,3includes infraspecific epithet.- has_cf,has_aff,is_sp,is_spp,had_hybrid,rank_late,rank_missing_infra,had_na_author,implied_infra
Logical flags.
Examples
library(wcvpmatch)
classify_spnames(c("Opuntia sp.", "Rosa canina subsp. coriifolia (Fr.) Leffler"))
classify_spnames(c("Cydonia japonica tricolor")) # implied unranked infra epithet
Direct Match Infraspecific Rank within Species
Description
Direct Match Infraspecific Rank within Species
Usage
direct_match_infra_rank_within_species(df, target_df = NULL)
Cleaned Master Tree Species List from FIA
Description
A cleaned dataset containing tree species recorded by the
Forest Inventory and Analysis (FIA) program of the U.S. Forest Service.
This dataset is used in the examples and README of the wcvpmatch
package. The data was downloaded in November 2022 from the official
webpage of the Forest Inventory and Analysis National Program, available
at the following link,
and was originally used during the development of the treemendous
package. For wcvpmatch, the variable names have been standardized
to Orig.Genus and Orig.Species.
Usage
fia
Format
A data frame with 2169 rows and 2 variables:
- Orig.Genus
Genus name of the species binomial
- Orig.Species
Specific epithet of the species binomial
Fuzzy Match Infraspecific Epithet within Species
Description
Fuzzy Match Infraspecific Epithet within Species
Usage
fuzzy_match_infraspecies_within_species(
df,
target_df = NULL,
max_dist = 1,
method = "osa"
)
Prefilter Target Backbone by Input Genera (Exact + Fuzzy)
Description
Reduces the target backbone to genera relevant for the current input names.
This is designed as a pre-step before wcvp_matching() to reduce search space.
Strategy:
Exact genus candidates are always included.
Optional fuzzy genus candidates are included when
include_fuzzy = TRUE.Returned object preserves the standard target schema used by the package.
Usage
prefilter_target_by_genus(
df,
target_df = NULL,
genus_index = NULL,
include_fuzzy = TRUE,
max_dist = 1,
method = "osa"
)
Arguments
df |
Input tibble/data.frame with either |
target_df |
Optional custom target table. If |
genus_index |
Optional pre-built index from |
include_fuzzy |
Logical. If |
max_dist |
Maximum fuzzy distance for genus matching (used when |
method |
String distance method passed to |
Value
A prefiltered target_df tibble compatible with wcvp_matching(target_df = ...).
Attributes:
- candidate_genera
Character vector of selected genera.
- exact_genera
Character vector of exact matched genera.
- fuzzy_genera
Character vector of fuzzy matched genera.
Examples
library(wcvpmatch)
df <- data.frame(Genus = "Opuntia", Species = "yanganucensis")
prefilter_target_by_genus(df)
Direct Match Species & Genus Binomial or Trinomial names
Description
Tries to directly match Genus + Species | Genus + Species + Rank + Infraspecies to WCVP data.
Usage
wcvp_direct_match(df, target_df = NULL)
Arguments
df |
|
target_df |
Optional custom target table. If |
Value
Returns a tibble with the additional logical column direct_match, indicating whether the binomial was successfully matched (TRUE) or not (FALSE).
Returns original columns plus Matched.Genus, Matched.Species, Matched.Infra.Rank, and Matched.Infraspecies.
Examples
library(wcvpmatch)
# Simple binomial match
df_parsed <- classify_spnames("Opuntia yanganucensis")
wcvp_direct_match(df_parsed)
Direct Match Species within Genus
Description
Tries to directly match the specific epithet within an already matched genus in 'WCVP'.
Usage
wcvp_direct_match_species_within_genus(df, target_df = NULL)
Arguments
df |
|
target_df |
Optional custom target table. If |
Value
Returns a tibble with the additional logical column direct_match_species_within_genus, indicating whether the specific epithet was successfully matched within the matched genus (TRUE) or not (FALSE).
Fuzzy Match Genus Name
Description
Tries to fuzzy match the genus name to the 'WCVP' table (using the optional wcvpdata checklist by default when available).
Usage
wcvp_fuzzy_match_genus(df, target_df = NULL, max_dist = 1, method = "osa")
Arguments
df |
|
target_df |
Optional custom target table. If |
max_dist |
Maximum edit distance used for fuzzy genus matching. |
method |
String distance method passed to |
Value
Returns a tibble with the additional logical column fuzzy_match_genus, indicating whether the genus was successfully matched (TRUE) or not (FALSE).
Further, the additional column fuzzy_genus_dist returns the distance for every match.
Examples
library(wcvpmatch)
df <- data.frame(Orig.Genus = "Opuntiaa", Orig.Species = "yanganucensis")
wcvp_fuzzy_match_genus(df)
Fuzzy Match Species within Genus
Description
Tries to fuzzy match the species epithet within a matched genus against 'WCVP' (using the optional wcvpdata checklist by default when available).
Usage
wcvp_fuzzy_match_species_within_genus(
df,
target_df = NULL,
max_dist = 1,
method = "osa"
)
Arguments
df |
|
target_df |
Optional custom target table. If |
max_dist |
Maximum edit distance used for fuzzy species matching within genus. |
method |
String distance method passed to |
Value
Returns a tibble with the additional logical column fuzzy_match_species_within_genus, indicating whether the specific epithet was successfully fuzzy matched within the matched genus (TRUE) or not (FALSE).
Match Genus name
Description
Tries to match the genus name to the 'WCVP' table (using the optional wcvpdata checklist by default when available).
Usage
wcvp_genus_match(df, target_df = NULL)
Arguments
df |
|
target_df |
Optional custom target table. If |
Value
Returns a tibble with the additional logical column genus_match, indicating whether the genus was successfully matched (TRUE) or not (FALSE).
Match Scientific Names Against WCVP
Description
Runs a matching pipeline with exact and partial matching for binomial and trinomial names, including infraspecific rank validation.
Usage
wcvp_matching(
df,
target_df = NULL,
prefilter_genus = TRUE,
allow_duplicates = FALSE,
max_dist = 1,
method = "osa",
add_name_distance = FALSE,
name_distance_method = "osa",
profile = FALSE,
output_name_style = c("snake_case", "legacy")
)
Arguments
df |
Input tibble/data.frame with either |
target_df |
Optional custom target table. If |
prefilter_genus |
Logical. If |
allow_duplicates |
Logical. If |
max_dist |
Maximum distance used in all fuzzy matching stages (genus, species, infraspecies). |
method |
A string indicating the fuzzy matching method (passed to
|
add_name_distance |
Logical. If |
name_distance_method |
Method passed to |
profile |
Logical. If |
output_name_style |
Naming style for output columns:
|
Value
Tibble with matched names, process flags, and taxonomic context
columns: matched_plant_name_id, matched_taxon_name, taxon_status,
accepted_plant_name_id, accepted_taxon_name, is_accepted_name.
Examples
library(wcvpmatch)
# Match a single name
wcvp_matching(data.frame(Genus = "Opuntia", Species = "yanganucensis"))
# Match multiple names with snake_case output
names <- c("Aniba heterotepala", "Anthurium quipuscoae")
df <- classify_spnames(names)
wcvp_matching(df, output_name_style = "snake_case")
# Attach per-stage timings for profiling
out <- wcvp_matching(df, output_name_style = "snake_case", profile = TRUE)
attr(out, "timings")
Check Default Backbone Setup
Description
Reports whether the optional companion package wcvpdata is available for
use as the default WCVP backbone and, if not, explains how to install it
from r-universe.
Usage
wcvp_setup_info(inform = TRUE)
Arguments
inform |
Logical. If |
Value
Invisibly returns a named list with setup status fields:
default_backbone_available, wcvpdata_installed,
wcvpdata_has_backbone, wcvpdata_version, repository, and
install_command.
Examples
library(wcvpmatch)
wcvp_setup_info()
Suffix Match Species within Genus
Description
Tries to match the specific epithet by exchanging common suffixes within an already matched genus in 'WCVP'.
The following suffixes are captured: c("a", "i", "is", "um", "us", "ae").
Usage
wcvp_suffix_match_species_within_genus(df, target_df = NULL)
Arguments
df |
|
target_df |
Optional custom target table. If |
Value
Returns a tibble with the additional logical column suffix_match_species_within_genus, indicating whether the specific epithet was successfully matched within the matched genus (TRUE) or not (FALSE).
Examples
library(wcvpmatch)
df <- data.frame(Orig.Genus = "Opuntia", Orig.Species = "yanganucensa", Matched.Genus = "Opuntia")
wcvp_suffix_match_species_within_genus(df)