regextable

Introduction

regextable extracts regex-based pattern matches from a data frame or character vector using a pattern lookup table. For each input row, all matching patterns are returned, along with the matched substring, an internal row identifier, and additional columns specified in data_return_cols and regex_return_cols. Optional metadata from the pattern table can also be included. Multiple rows may be returned for a single text if it matches multiple patterns.

Install and load the package:

library(regextable)
library(kableExtra)

Data

For demonstration, we use two included datasets:

members: A lookup table of regex patterns for member names.
cr2007_03_01: A sample text dataset to search.

data("members")
kable(members)

congress	chamber	bioname	pattern	icpsr	state_abbrev	district_code	first_name	last_name
110	President	BUSH, George Walker	george bush\|george walker bush\|bush\|george w bush\|bush\|(^\|senator \|representative )bush\|bush, george\|bush george\|bush, g\|president bush\|g w bush	99910	USA	0	George	BUSH
110	House	BONNER, Jr., Josiah Robins (Jo)	josiah bonner\|josiah josiah robins bonner\|bonner\|josiah j bonner\|jo bonner\|jo josiah robins bonner\|jo j bonner\|(^\|senator \|representative )bonner\|bonner, jo\|bonner, josiah\|bonner josiah\|bonner, j\|representative bonner\|j j bonner	20300	AL	1	Josiah	BONNER
110	House	ROGERS, Mike Dennis	mike rogers\|mike dennis rogers\|rogers.{1,4}al\|mike d rogers\|michael rogers\|michael dennis rogers\|michael d rogers\|(^\|senator \|representative )rogers{1,4}al\|rogers, michael\|rogers, mike\|rogers mike\|representative rogers{1,4}al\|m d rogers	20301	AL	3	Mike	ROGERS
110	House	DAVIS, Artur	artur davis\|davis\|(^\|senator \|representative )davis{1,4}al\|davis, artur\|davis artur\|davis, a\|representative davis{1,4}al	20302	AL	7	Artur	DAVIS
110	House	CRAMER, Robert E. (Bud), Jr.	robert cramer\|robert e cramer\|cramer\|bud cramer\|bud e cramer\|cramer\|(^\|senator \|representative )cramer\|cramer, bud\|cramer, robert\|cramer robert\|cramer, r\|cramer, b\|representative cramer\|r e cramer	29100	AL	5	Robert	CRAMER


data("cr2007_03_01")
kable(subset(cr2007_03_01, select = -c(url, url_txt)))

date	text	header
2007-03-01	HON. SAM GRAVES;Mr. GRAVES	RECOGNIZING JARRETT MUCK FOR ACHIEVING THE RANK OF EAGLE SCOUT; Congressional Record Vol. 153, No. 35
2007-03-01	HON. MARK UDALL;Mr. UDALL	INTRODUCING A CONCURRENT RESOLUTION HONORING THE 50TH ANNIVERSARY OF THE INTERNATIONAL GEOPHYSICAL YEAR (IGY); Congressional Record Vol. 153, No. 35
2007-03-01	HON. JAMES R. LANGEVIN;Mr. LANGEVIN	BIOSURVEILLANCE ENHANCEMENT ACT OF 2007; Congressional Record Vol. 153, No. 35
2007-03-01	HON. JIM COSTA;Mr. COSTA	A TRIBUTE TO THE LIFE OF MRS. VERNA DUTY; Congressional Record Vol. 153, No. 35
2007-03-01	HON. SAM GRAVES;Mr. GRAVES	RECOGNIZING JARRETT MUCK FOR ACHIEVING THE RANK OF EAGLE SCOUT

Text Cleaning

extract() cleans text by default, so the user does not need to call it manually. Cleaning standardizes spacing, punctuation, and capitalization, which helps regex pattern matching.

Example of clean_text():

text <- "  HELLO---WORLD  "
clean_text(text)
#> [1] "hello world"

Basic Extraction

The simplest use of extract():

result <- extract(
  data = cr2007_03_01,
  regex_table = members,
  data_return_cols = c("text"),
  regex_return_cols = c("icpsr")
)

kable(head(result))

row_id	text	icpsr	pattern	match
1	HON. SAM GRAVES;Mr. GRAVES	20124	samuel graves\|graves\|sam graves\|(^\|senator \|representative )graves\|graves, sam\|graves, samuel\|graves samuel\|graves, s\|representative graves/td>	SAM GRAVES
2	HON. MARK UDALL;Mr. UDALL	29906	mark udall\|udall\|mark e udall\|udall\|(^\|senator \|representative )udall{1,4}co\|udall, mark\|udall mark\|udall, m\|representative udall{1,4}co\|m e udall	MARK UDALL
3	HON. JAMES R. LANGEVIN;Mr. LANGEVIN	20136	james langevin\|langevin\|james r langevin\|jim langevin\|jim r langevin\|(^\|senator \|representative )langevin\|langevin, jim\|langevin, james\|langevin james\|langevin, j\|representative langevin\|j r langevin	james r langevin
4	HON. JIM COSTA;Mr. COSTA	20501	jim costa\|costa\|james costa\|(^\|senator \|representative )costa\|costa, james\|costa, jim\|costa jim\|costa, j\|representative costa/td>	JIM COSTA
5	HON. SAM GRAVES;Mr. GRAVES	20124	samuel graves\|graves\|sam graves\|(^\|senator \|representative )graves\|graves, sam\|graves, samuel\|graves samuel\|graves, s\|representative graves/td>	SAM GRAVES

Explanation: - data: the text dataset to search. - col_name: which column contains the text. - regex_table: the lookup table of patterns. - data_return_cols: additional columns from data to include in the result. - regex_return_cols: additional columns from the pattern table to attach. Each row in the output corresponds to a detected match, and includes both the original text and the matching pattern. —

Advanced Usage

extract() can also filter data by date, remove acronyms (all-uppercase patterns with 2+ characters), and select specific output columns. This is useful for more controlled extraction.

Explanation: - date_col, date_start, date_end: filter rows by date. - remove_acronyms: skip patterns like “NASA” or “USA”. You can combine these filters with any subset of columns for flexible outputs. —

Parallel Matching

extract() supports parallel processing via the cl parameter:

library(parallel)
clust <- makeCluster(2)
result_parallel <- extract(
  data = cr2007_03_01,
  regex_table = members,
  cl = clust,
  data_return_cols = c("text"),
  regex_return_cols = c("icpsr")
)
stopCluster(clust)
head(result_parallel)

Summary

regextable is a tool for extracting data from text.
Use the included datasets to get started or supply your own lookup tables.
extract() by default handles text cleaning and efficient matching.
Optional parameters allow advanced control over filtering and output.