regextable extracts regex-based pattern matches from a
data frame or character vector using a pattern lookup table. For each
input row, all matching patterns are returned, along with the matched
substring, an internal row identifier, and additional columns specified
in data_return_cols and regex_return_cols.
Optional metadata from the pattern table can also be included. Multiple
rows may be returned for a single text if it matches multiple
patterns.
Install and load the package:
For demonstration, we use two included datasets:
members: A lookup table of regex patterns for member
names.cr2007_03_01: A sample text dataset to search.| congress | chamber | bioname | pattern | icpsr | state_abbrev | district_code | first_name | last_name |
|---|---|---|---|---|---|---|---|---|
| 110 | President | BUSH, George Walker | george bush|george walker bush|bush|george w bush|bush|(^|senator |representative )bush|bush, george|bush george|bush, g|president bush|g w bush | 99910 | USA | 0 | George | BUSH |
| 110 | House | BONNER, Jr., Josiah Robins (Jo) | josiah bonner|josiah josiah robins bonner|bonner|josiah j bonner|jo bonner|jo josiah robins bonner|jo j bonner|(^|senator |representative )bonner|bonner, jo|bonner, josiah|bonner josiah|bonner, j|representative bonner|j j bonner | 20300 | AL | 1 | Josiah | BONNER |
| 110 | House | ROGERS, Mike Dennis | mike rogers|mike dennis rogers|rogers.{1,4}al|mike d rogers|michael rogers|michael dennis rogers|michael d rogers|(^|senator |representative )rogers{1,4}al|rogers, michael|rogers, mike|rogers mike|representative rogers{1,4}al|m d rogers | 20301 | AL | 3 | Mike | ROGERS |
| 110 | House | DAVIS, Artur | artur davis|davis|(^|senator |representative )davis{1,4}al|davis, artur|davis artur|davis, a|representative davis{1,4}al | 20302 | AL | 7 | Artur | DAVIS |
| 110 | House | CRAMER, Robert E. (Bud), Jr. | robert cramer|robert e cramer|cramer|bud cramer|bud e cramer|cramer|(^|senator |representative )cramer|cramer, bud|cramer, robert|cramer robert|cramer, r|cramer, b|representative cramer|r e cramer | 29100 | AL | 5 | Robert | CRAMER |
| date | text | header |
|---|---|---|
| 2007-03-01 | HON. SAM GRAVES;Mr. GRAVES | RECOGNIZING JARRETT MUCK FOR ACHIEVING THE RANK OF EAGLE SCOUT; Congressional Record Vol. 153, No. 35 |
| 2007-03-01 | HON. MARK UDALL;Mr. UDALL | INTRODUCING A CONCURRENT RESOLUTION HONORING THE 50TH ANNIVERSARY OF THE INTERNATIONAL GEOPHYSICAL YEAR (IGY); Congressional Record Vol. 153, No. 35 |
| 2007-03-01 | HON. JAMES R. LANGEVIN;Mr. LANGEVIN | BIOSURVEILLANCE ENHANCEMENT ACT OF 2007; Congressional Record Vol. 153, No. 35 |
| 2007-03-01 | HON. JIM COSTA;Mr. COSTA | A TRIBUTE TO THE LIFE OF MRS. VERNA DUTY; Congressional Record Vol. 153, No. 35 |
| 2007-03-01 | HON. SAM GRAVES;Mr. GRAVES | RECOGNIZING JARRETT MUCK FOR ACHIEVING THE RANK OF EAGLE SCOUT |
The simplest use of extract():
result <- extract(
data = cr2007_03_01,
regex_table = members,
data_return_cols = c("text"),
regex_return_cols = c("icpsr")
)
kable(head(result))| row_id | text | icpsr | pattern | match |
|---|---|---|---|---|
| 1 | HON. SAM GRAVES;Mr. GRAVES | 20124 | samuel graves|graves|sam graves|(^|senator |representative )graves|graves, sam|graves, samuel|graves samuel|graves, s|representative graves/td> | SAM GRAVES |
| 2 | HON. MARK UDALL;Mr. UDALL | 29906 | mark udall|udall|mark e udall|udall|(^|senator |representative )udall{1,4}co|udall, mark|udall mark|udall, m|representative udall{1,4}co|m e udall | MARK UDALL |
| 3 | HON. JAMES R. LANGEVIN;Mr. LANGEVIN | 20136 | james langevin|langevin|james r langevin|jim langevin|jim r langevin|(^|senator |representative )langevin|langevin, jim|langevin, james|langevin james|langevin, j|representative langevin|j r langevin | james r langevin |
| 4 | HON. JIM COSTA;Mr. COSTA | 20501 | jim costa|costa|james costa|(^|senator |representative )costa|costa, james|costa, jim|costa jim|costa, j|representative costa/td> | JIM COSTA |
| 5 | HON. SAM GRAVES;Mr. GRAVES | 20124 | samuel graves|graves|sam graves|(^|senator |representative )graves|graves, sam|graves, samuel|graves samuel|graves, s|representative graves/td> | SAM GRAVES |
Explanation: - data: the text dataset to search. -
col_name: which column contains the text. -
regex_table: the lookup table of patterns. -
data_return_cols: additional columns from data
to include in the result. - regex_return_cols: additional
columns from the pattern table to attach. Each row in the output
corresponds to a detected match, and includes both the original text and
the matching pattern. —
extract() can also filter data by date, remove acronyms
(all-uppercase patterns with 2+ characters), and select specific output
columns. This is useful for more controlled extraction.
Explanation: - date_col, date_start,
date_end: filter rows by date. -
remove_acronyms: skip patterns like “NASA” or “USA”. You
can combine these filters with any subset of columns for flexible
outputs. —
extract() supports parallel processing via the
cl parameter:
regextable is a tool for extracting data from
text.extract() by default handles text cleaning and
efficient matching.