This package is designed to map a group of derivative texts to the corresponding parent text, based on the frequency with which phrases occur in the derivative texts. The parent text is highlighted corresponding to this frequency, in order to create a ‘heatmap’ of popular phrases found in the derivative texts.
This example is taken from the initial description of a crime used in
a study of jury perception of algorithm use and demonstrative evidence.
The notepad_example data frame contains an ‘ID’ number
corresponding to a study participant, as well as their notes, labelled
as ‘Text’. The first six observations are shown below.
# load the library
library(highlightr)
library(knitr)
#> Warning: package 'knitr' was built under R version 4.5.2
# View first 6 observations
knitr::kable(head(notepad_example))| ID | Text |
|---|---|
| 121 | Richard Cole - charged with discharging firearm in business. // felony . NOT GUILTY. |
| 197 | Richard Cole - Def: Willfully discarge firearm in biz - Felony. Pleaded NG |
| 168 | willfully discharging firearm in a business - felony. not guilty |
| 131 | discharged firearm in business, intentionally |
| 77 | In this case, the defendant - Richard Cole - has been charged with willfully discharging a firearm in a place of business. This crime is a felony. |
| 24 | defendant - Richard Cole discharging a firearm in a place of business. pleaded not guilty. |
Additionally, the source document (or study transcript) is included
in notepad_example with an ID of ‘source’. The original
transcript is shown here:
study_transcript <- notepad_example[notepad_example$ID == "source",]$Text
knitr::kable(study_transcript)| x |
|---|
| In this case, the defendant - Richard Cole - has been charged with willfully discharging a firearm in a place of business. This crime is a felony. Mr. Cole has pleaded not guilty to the charge. You will now read a summary of the case. This summary was prepared by an objective court clerk. It describes select evidence that was presented at trial. |
Fuzzy collocations are used to match the tokenized derivative
texts to the phrases in the tokenized source text. This function first
determines the number of times a collocation of length 5 occurs in
derivative texts, or participant notes on the case. Fuzzy (or indirect)
matches are then added to the frequency count of the source collocation
that is the closest match. These fuzzy matches are weighted based on the
edit distance between the source collocation and the indirect phrase:
\[
\frac{n*d}{m}
\]
Here, \(n\) is the frequency of the fuzzy collocation, \(d\) is the Jaccard similarity between the fuzzy collocation and the source collocation (ranging from 0 to 1, where 1 indicates identical strings), and \(m\) is the number of closest matches for the fuzzy collocation. The total count is divided by the number of times a collocation occurs in the source document.
The collocation_frequency() function attaches the
collocation counts to the full text of the transcript. The collocation
frequencies are averaged per word.
# connect collocation frequencies to source document
merged_frequency <- collocation_frequency(notepad_example, source_row=which(notepad_example$ID=="source"), text_column = "Text", fuzzy=TRUE)
#> Warning in join_func(a = a, b = b, by_a = by_a, by_b = by_b, block_by_a = block_by_a, : A pair of records at the threshold (0.7) have only a 95% chance of being compared.
#> Please consider changing `n_bands` and `band_width`.
knitr::kable(head(merged_frequency), digits=2)| words | word_num | to_merge | col_1 | col_2 | col_3 | col_4 | col_5 | collocation | Freq |
|---|---|---|---|---|---|---|---|---|---|
| In | 1 | in | 6.96 | NA | NA | NA | NA | in this case the defendant | 6.96 |
| this | 2 | this | 7.00 | 6.96 | NA | NA | NA | this case the defendant richard | 6.98 |
| case, | 3 | case | 7.93 | 7.00 | 6.96 | NA | NA | case the defendant richard cole | 7.30 |
| the | 4 | the | 10.00 | 7.93 | 7.00 | 6.96 | NA | the defendant richard cole has | 7.97 |
| defendant | 5 | defendant | 10.00 | 10.00 | 7.93 | 7.00 | 6.96 | defendant richard cole has been | 8.38 |
| - | 6 | NA | NA | NA | NA | NA | NA | NaN |
The warning regarding the chance of comparisons for a threshold of
0.7 is generated from zoomerjoin::jaccard_right_join(). If
desired, threshold, n_bands, and
band_width can be adjusted via corresponding values in
collocation_frequency(). The output assigns the frequency
of each collocation to each word that occurs in that collocation. For
example, the first collocation in the description is “in this case the
defendant”, which occurs with a frequency of 6.96. This is the only
collocation in which the first word will appear, so this is the only
collocation value provided for the first word. The second word, “this”
appears in the next collocation as well: “this case the defendant
richard”, whose frequency is 7, and so on for all words in the
description. Collocations are weighted by the number of times they
appear in the transcript text.
The combined document is then fed through ggplot to assign gradient colors based on frequency, and the minimum and maximum values are recorded.
After colors have been assigned, HTML output is created for
highlighted text is created based on frequency, as well as a gradient
bar indicating the high and low values. The left side of each word
gradient indicates the frequency of the previous word’s averaged
collocation frequency, while the right side indicates the current word’s
averaged collocation frequency. This HTML output can be rendered into
highlighted text by specifying `r page_highlight` in an R
Markdown document outside of a code chunk and knitting to HTML:
Alternatively, the xml2 package can be used to save the
output as an html file, as shown in the following code:
# load `xml2` library
library(xml2)
# save html output to desired location
xml2::write_html(xml2::read_html(page_highlight), "filename.html")Non-fuzzy matching can also be used by removing the
fuzzy=TRUE argument. This will only include direct matches
betweeen derivative documents and the parent document. In this case, the
highlighting pattern resembles that when the fuzzy matches are included,
but the maximum value reached is smaller. Note also that the colors used
in highlighting can be changed in the “colors” argument of the
collocation_plot function.
# connect collocation frequencies to source document
merged_frequency_nonfuzzy <- collocation_frequency(notepad_example, source_row=which(notepad_example$ID=="source"), text_column = "Text")
# create a `ggplot` object of the transcript, and change colors of the gradient
freq_plot_nonfuzzy <- collocation_plot(merged_frequency_nonfuzzy, colors=c("#15bf7e", "#fcc7ed"))
# add html tags to source document
page_highlight_nonfuzzy <- highlighted_text(freq_plot_nonfuzzy)Additionally, the length of the collocation can be changed. The default collocation length (shown above) is 5 words. Below, this collocation length has been changed to 2 words.
In these shorter collocations, we can see that the collocation containing the name “Richard Cole” is popular, with a frequency of 89.
# connect collocation frequencies to source document
merged_frequency_2col <- collocation_frequency(notepad_example, source_row=which(notepad_example$ID=="source"), text_column = "Text", collocate_length = 2)
# create a `ggplot` object of the transcript
freq_plot_2col <- collocation_plot(merged_frequency_2col)
# add html tags to source document
page_highlight_2col <- highlighted_text(freq_plot_2col)