---
title: "Chapter 03: Kernel runners and wrappers — the glmbayes pattern"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Chapter 03: Kernel runners and wrappers — the glmbayes pattern}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup, eval = FALSE}
library(opencltools)
```

# Introduction

Once you have OpenCL installed (Chapter 01) and understand how to assemble
a program from annotated library shards (Chapter 02), the next step is to
wire GPU computation into your R package in a maintainable way.

`glmbayes` provides the canonical reference implementation of this pattern.
It uses a **two-layer architecture** — a *kernel runner* that manages the
raw OpenCL API, and a *kernel wrapper* that adapts R inputs and outputs —
together with a "fail gracefully" strategy that falls back to CPU execution
whenever OpenCL is unavailable or slow.

This chapter describes both layers and the fail-gracefully pattern so you
can replicate them in your own package.  All source code referenced here
lives in `glmbayes`; see `?glmbayes::EnvelopeEval` for entry points.

---

# 1. Architecture overview

The two-layer design cleanly separates concerns:

| Layer | Name | File | Responsibility |
|-------|------|------|----------------|
| **Wrapper** | `f2_f3_opencl` | `kernel_wrappers.cpp` | Flatten R inputs, select kernel, assemble program, call runner, reshape outputs |
| **Runner** | `f2_f3_kernel_runner` | `kernel_runners.cpp` | Platform/device setup, buffer management, kernel launch, result readback |

### Call path

```
EnvelopeEval (R or C++)
    |-- if use_opencl && pilot/check passes:
    |       f2_f3_opencl()  ->  f2_f3_kernel_runner()
    \-- else:
            f2_f3_non_opencl()  ->  CPU family functions
```

`EnvelopeEval` receives a coefficient grid, design matrix, prior parameters,
and family/link specification.  It selects the GPU or CPU path and
dispatches accordingly.  The GPU path calls `f2_f3_opencl`, which assembles
the OpenCL program and invokes `f2_f3_kernel_runner`; the CPU path calls
`f2_f3_non_opencl`, which dispatches to the appropriate C++ function.

---

# 2. Kernel structure

Every family/link has a dedicated `.cl` file that implements the f2 (negative
log posterior) and f3 (gradient) computations for one model.  Each kernel
follows the same five-step pattern:

### Step 1 — Work-item mapping

```c
int j = get_global_id(0);
if (j >= m1) return;
```

One OpenCL work item handles one grid point `j`.  The total number of work
items equals `m1` (the size of the tangency grid), so all grid points are
evaluated in parallel.

### Step 2 — Prior term

```c
// tmp[k] = [P * (B_j - mu)]_k
double tmp[MAX_L2];
for (int k = 0; k < l2; ++k) {
  double acc = 0.0;
  for (int ell = 0; ell < l2; ++ell)
    acc += P[k*l2 + ell] * (B[j*l2 + ell] - mu[ell]);
  tmp[k] = acc;
}

// Quadratic form 0.5 * (B_j - mu)' * P * (B_j - mu)
double res_acc = 0.0;
for (int k = 0; k < l2; ++k)
  res_acc += (B[j*l2 + k] - mu[k]) * tmp[k];
res_acc *= 0.5;
```

### Step 3 — Prior gradient

```c
double g_loc[MAX_L2];
for (int k = 0; k < l2; ++k)
  g_loc[k] = tmp[k];   // P * (B_j - mu)
```

### Step 4 — Data loop

For each observation `i`, compute the linear predictor, apply the link
function, add the likelihood contribution to `res_acc`, and accumulate the
gradient.  This step calls the ported nmath functions (e.g. `dnorm4`,
`dgamma`, `dbinom_raw`, `pnorm5`):

```c
for (int i = 0; i < l1; ++i) {
  double dot = alpha[i];
  for (int k = 0; k < l2; ++k)
    dot += X[k*l1 + i] * B[j*l2 + k];

  // Gaussian example: -log dnorm(y | mean=dot, sd=1/sqrt(wt))
  double sd_i = 1.0 / sqrt(wt[i]);
  double ll   = dnorm4(y[i], dot, sd_i, 1);   // give_log = 1
  res_acc    -= ll;

  double resid = wt[i] * (dot - y[i]);
  for (int k = 0; k < l2; ++k)
    g_loc[k] += X[k*l1 + i] * resid;
}
```

### Step 5 — Write outputs

```c
qf[j] = res_acc;
for (int k = 0; k < l2; ++k)
  grad[k*m1 + j] = g_loc[k];   // column-major
```

`qf[j]` holds the negative log posterior for grid point `j`; `grad` holds
the gradient in column-major layout (dimension `l2 × m1`).

---

## 2.1 Supported family/link kernels in glmbayes

| Family | Link | Kernel file | nmath calls |
|--------|------|-------------|-------------|
| binomial | logit | `f2_f3_binomial_logit.cl` | `dbinom_raw` |
| binomial | probit | `f2_f3_binomial_probit.cl` | `dbinom_raw`, `dnorm4`, `pnorm5` |
| binomial | cloglog | `f2_f3_binomial_cloglog.cl` | `dbinom_raw`, `expm1` |
| Poisson | log | `f2_f3_poisson.cl` | *(uses OpenCL builtin `lgamma`)* |
| Gamma | inverse | `f2_f3_gamma.cl` | `dgamma` |
| Gaussian | identity | `f2_f3_gaussian.cl` | `dnorm4` |

Each file is annotated with `@calls_nmath`, `@depends_nmath`, and
`@all_depends_nmath` (see Chapter 02, §6) so the kernel runner loads only
the library shards it needs.

---

# 3. The kernel wrapper

The wrapper (`f2_f3_opencl` in `kernel_wrappers.cpp`) is an Rcpp-exported
function.  Its responsibilities are:

1. **Input flattening** — convert R `NumericMatrix` / `NumericVector` to
   contiguous `std::vector<double>` in the memory layout the runner expects
   (`flattenMatrix`, `copyVector` from `openclPort`).

2. **Kernel selection** — map `(family, link)` to a kernel name and
   kernel file path:
   ```cpp
   std::string kernel_name = "f2_f3_binomial_logit";
   std::string kernel_file = "src/f2_f3_binomial_logit.cl";
   ```

3. **Program assembly** — call `load_library_for_kernel()` for the minimal
   nmath subset, then concatenate:
   ```cpp
   std::string all_src =
       load_kernel_source("OPENCL.cl", pkg)
     + load_kernel_library("rmath", pkg)
     + load_kernel_library("dpq",   pkg)
     + load_library_for_kernel(kernel_file, nmath_dir, "all_depends_nmath")
     + load_kernel_source(kernel_file, pkg);
   ```

4. **Runner call** — invoke `f2_f3_kernel_runner(all_src, kernel_name, ...)`.

5. **Output reshaping** — wrap the flat `qf` and `grad` results into R
   `NumericVector` and `arma::mat` for return to R.

### Linking against opencltools

The wrapper can use `opencltools` C++ utilities by adding to the downstream
package's `DESCRIPTION`:

```
LinkingTo: opencltools, Rcpp, RcppArmadillo
```

This gives access to `openclPort.h` which declares `flattenMatrix`,
`copyVector`, `configureOpenCL`, and `load_library_for_kernel`.

---

# 4. The kernel runner

The runner (`f2_f3_kernel_runner` in `kernel_runners.cpp`) handles the
low-level OpenCL API.  Each call is stateless — it sets up a complete
OpenCL context, launches the kernel, reads back results, and releases
all resources:

```
1. clGetPlatformIDs / clGetDeviceIDs (CL_DEVICE_TYPE_DEFAULT)
2. clCreateContext / clCreateCommandQueueWithProperties
3. clCreateProgramWithSource(all_src) + clBuildProgram
4. clCreateKernel(kernel_name)
5. Create read/write buffers; copy input data to device
6. clEnqueueNDRangeKernel (global_size = m1, local_size = NULL)
7. clEnqueueReadBuffer for qf and grad
8. Sanity check: if both outputs are all-zero, throw (kernel build failure)
9. Release buffers, kernel, program, queue, context
```

The runner is GLM-specific (it knows the buffer layout for X, B, mu, P,
alpha, y, wt, qf, grad) but the OpenCL plumbing portion is fully generic.
The general utilities — kernel loading, device enumeration, `fp64` probing —
live in `openclPort` and are re-exported by `opencltools` for use by any
downstream package.

---

# 5. Fail gracefully — the dispatch pattern

The fail-gracefully pattern is the key architectural decision that makes
GPU acceleration optional in a downstream package.  The idea is that
**every** GPU-dispatching function has a CPU fallback of identical interface:

```cpp
// In EnvelopeEval.cpp (glmbayes)
if (use_opencl && opencl_pilot_ok) {
  result = f2_f3_opencl(X, B, mu, P, alpha, y, wt, family, link);
} else {
  result = f2_f3_non_opencl(X, B, mu, P, alpha, y, wt, family, link);
}
```

`f2_f3_non_opencl` is a plain C++ function that calls the same family
functions (`f2_f3_binomial_logit`, `f2_f3_gaussian`, etc.) without any
OpenCL involvement.

This pattern means:

- **`has_opencl()` returning `FALSE`** — the package still installs and
  works; `use_opencl` is silently ignored.
- **OpenCL build failure** (kernel compile error) — the runner throws; the
  caller catches and falls back to CPU.
- **Numerical mismatch** — the pilot can compare a small GPU result against
  the CPU result; if they differ beyond tolerance, fall back automatically.

---

# 6. The pilot pattern

For large workloads (e.g. a tangency grid with > 50,000 points),
`glmbayes` runs a **pilot** before committing to the full GPU evaluation.
The pilot:

1. **Warm-up** — run the GPU kernel on a tiny slice (10–20 points) to
   initialize the context and just-in-time (JIT) compile the program.
2. **Calibration** — run on slices of ~1% and ~2% of the grid, time each.
3. **Extrapolation** — `refined_est_total_sec = per_grid_sec × m1`.
4. **5-minute safeguard** — if `refined_est_total_sec > 300`:
   - **Interactive session**: prompt "Do you want to continue? [y/N]"
   - **Non-interactive session**: proceed automatically (CI / batch jobs)

This pattern is reusable in any package that dispatches large parallel
workloads to a GPU.  The 5-minute threshold and prompt wording can be
tuned to the specific workload.

---

# 7. Setting up OpenCL before first use

`glmbayes` calls `configureOpenCL()` during package initialization to probe
the OpenCL device for native `expm1` and `log1p` support.  The probe
compiles a tiny test kernel and checks the result:

- If the device computes `expm1(0)` accurately → define `USE_OPENCL_EXPM1`.
- If the device computes `log1p(0)` accurately → define `USE_OPENCL_LOG1P`.

These flags are passed as `-D` options to `clBuildProgram` and control
whether the nmath shims (`expm1.cl`, `log1p.cl`) use the native builtin or
the ported fallback.

```cpp
// In your package's initialization (e.g., called from .onLoad)
OpenCLConfig cfg = configureOpenCL();
// cfg.build_options contains -DUSE_OPENCL_EXPM1 etc. as appropriate
// Pass cfg.build_options to clBuildProgram
```

`configureOpenCL` is declared in `openclPort.h` and is available to any
package that lists `opencltools` under `LinkingTo`.

---

# Cross-references

- **Chapter 01** — OpenCL installation and `has_opencl()`
- **Chapter 02** — Program assembly, `load_library_for_kernel()`
- **`?load_kernel_source`**, **`?load_kernel_library`**
- **`?load_library_for_kernel`**
- **`glmbayes::EnvelopeEval`**, **`glmbayes::EnvelopeBuild`** — reference implementation

---

# References

Nygren, K. N., & Nygren, L. M. (2006). Likelihood subgradient densities.
*Journal of the American Statistical Association*, 101(475), 1144–1156.
<https://doi.org/10.1198/016214506000000357>

The `f2_f3_*` kernels documented here evaluate negative log-posterior and
gradient terms that feed the likelihood-subgradient envelope construction
from this paper.

Nygren, K. N. (2026). *opencltools: OpenCL Tools for R Package Developers*.
R package. `citation("opencltools")`.