regextable

Shirlyn Dong

2026-02-02

Introduction

regextable extracts regex-based pattern matches from a data frame or character vector using a pattern lookup table. For each input row, all matching patterns are returned, along with the matched substring, an internal row identifier, and additional columns specified in data_return_cols and regex_return_cols. Optional metadata from the pattern table can also be included. Multiple rows may be returned for a single text if it matches multiple patterns.

Install and load the package:

library(regextable)
library(kableExtra)

Data

For demonstration, we use two included datasets:

data("members")
kable(members)
congress chamber bioname pattern icpsr state_abbrev district_code first_name last_name
110 President BUSH, George Walker george bush|george walker bush|bush|george w bush|bush|(^|senator |representative )bush|bush, george|bush george|bush, g|president bush|g w bush 99910 USA 0 George BUSH
110 House BONNER, Jr., Josiah Robins (Jo) josiah bonner|josiah josiah robins bonner|bonner|josiah j bonner|jo bonner|jo josiah robins bonner|jo j bonner|(^|senator |representative )bonner|bonner, jo|bonner, josiah|bonner josiah|bonner, j|representative bonner|j j bonner 20300 AL 1 Josiah BONNER
110 House ROGERS, Mike Dennis mike rogers|mike dennis rogers|rogers.{1,4}al|mike d rogers|michael rogers|michael dennis rogers|michael d rogers|(^|senator |representative )rogers{1,4}al|rogers, michael|rogers, mike|rogers mike|representative rogers{1,4}al|m d rogers 20301 AL 3 Mike ROGERS
110 House DAVIS, Artur artur davis|davis|(^|senator |representative )davis{1,4}al|davis, artur|davis artur|davis, a|representative davis{1,4}al 20302 AL 7 Artur DAVIS
110 House CRAMER, Robert E. (Bud), Jr.  robert cramer|robert e cramer|cramer|bud cramer|bud e cramer|cramer|(^|senator |representative )cramer|cramer, bud|cramer, robert|cramer robert|cramer, r|cramer, b|representative cramer|r e cramer 29100 AL 5 Robert CRAMER

data("cr2007_03_01")
kable(subset(cr2007_03_01, select = -c(url, url_txt)))
date text header
2007-03-01 HON. SAM GRAVES;Mr. GRAVES RECOGNIZING JARRETT MUCK FOR ACHIEVING THE RANK OF EAGLE SCOUT; Congressional Record Vol. 153, No. 35
2007-03-01 HON. MARK UDALL;Mr. UDALL INTRODUCING A CONCURRENT RESOLUTION HONORING THE 50TH ANNIVERSARY OF THE INTERNATIONAL GEOPHYSICAL YEAR (IGY); Congressional Record Vol. 153, No. 35
2007-03-01 HON. JAMES R. LANGEVIN;Mr. LANGEVIN BIOSURVEILLANCE ENHANCEMENT ACT OF 2007; Congressional Record Vol. 153, No. 35
2007-03-01 HON. JIM COSTA;Mr. COSTA A TRIBUTE TO THE LIFE OF MRS. VERNA DUTY; Congressional Record Vol. 153, No. 35
2007-03-01 HON. SAM GRAVES;Mr. GRAVES RECOGNIZING JARRETT MUCK FOR ACHIEVING THE RANK OF EAGLE SCOUT

Text Cleaning

extract() cleans text by default, so the user does not need to call it manually. Cleaning standardizes spacing, punctuation, and capitalization, which helps regex pattern matching.

Example of clean_text():

text <- "  HELLO---WORLD  "
clean_text(text)
#> [1] "hello world"

Basic Extraction

The simplest use of extract():

result <- extract(
  data = cr2007_03_01,
  regex_table = members,
  data_return_cols = c("text"),
  regex_return_cols = c("icpsr")
)

kable(head(result))
row_id text icpsr pattern match
1 HON. SAM GRAVES;Mr. GRAVES 20124 samuel graves&#124;graves&#124;sam graves&#124;(^&#124;senator &#124;representative )graves&#124;graves, sam&#124;graves, samuel&#124;graves samuel&#124;graves, s&#124;representative graves/td> SAM GRAVES
2 HON. MARK UDALL;Mr. UDALL 29906 mark udall&#124;udall&#124;mark e udall&#124;udall&#124;(^&#124;senator &#124;representative )udall{1,4}co&#124;udall, mark&#124;udall mark&#124;udall, m&#124;representative udall{1,4}co&#124;m e udall MARK UDALL
3 HON. JAMES R. LANGEVIN;Mr. LANGEVIN 20136 james langevin&#124;langevin&#124;james r langevin&#124;jim langevin&#124;jim r langevin&#124;(^&#124;senator &#124;representative )langevin&#124;langevin, jim&#124;langevin, james&#124;langevin james&#124;langevin, j&#124;representative langevin&#124;j r langevin james r langevin
4 HON. JIM COSTA;Mr. COSTA 20501 jim costa&#124;costa&#124;james costa&#124;(^&#124;senator &#124;representative )costa&#124;costa, james&#124;costa, jim&#124;costa jim&#124;costa, j&#124;representative costa/td> JIM COSTA
5 HON. SAM GRAVES;Mr. GRAVES 20124 samuel graves&#124;graves&#124;sam graves&#124;(^&#124;senator &#124;representative )graves&#124;graves, sam&#124;graves, samuel&#124;graves samuel&#124;graves, s&#124;representative graves/td> SAM GRAVES

Explanation: - data: the text dataset to search. - col_name: which column contains the text. - regex_table: the lookup table of patterns. - data_return_cols: additional columns from data to include in the result. - regex_return_cols: additional columns from the pattern table to attach. Each row in the output corresponds to a detected match, and includes both the original text and the matching pattern. —

Advanced Usage

extract() can also filter data by date, remove acronyms (all-uppercase patterns with 2+ characters), and select specific output columns. This is useful for more controlled extraction.

Explanation: - date_col, date_start, date_end: filter rows by date. - remove_acronyms: skip patterns like “NASA” or “USA”. You can combine these filters with any subset of columns for flexible outputs. —

Parallel Matching

extract() supports parallel processing via the cl parameter:

library(parallel)
clust <- makeCluster(2)
result_parallel <- extract(
  data = cr2007_03_01,
  regex_table = members,
  cl = clust,
  data_return_cols = c("text"),
  regex_return_cols = c("icpsr")
)
stopCluster(clust)
head(result_parallel)

Summary