| Type: | Package |
| Title: | Super Learning with Flexible Formulas |
| Version: | 0.0.1 |
| Description: | A functional programming based implementation of the super learner algorithm with an emphasis on supporting the use of formulas to specify learners. This approach offers several improvements compared to past implementations including the ability to easily use random-effects specified in formulas (like y ~ (age | strata) + ...) and construction of new learners is as simple as writing and passing a new function. The super learner algorithm was originally described in van der Laan et al. (2007) https://biostats.bepress.com/ucbbiostat/paper222/. |
| License: | MIT + file LICENSE |
| Depends: | R (≥ 4.2.0) |
| Encoding: | UTF-8 |
| RoxygenNote: | 7.3.2 |
| Suggests: | MASS, ggplot2, knitr, palmerpenguins, rmarkdown, survival, testthat (≥ 3.0.0), withr |
| VignetteBuilder: | knitr |
| Config/testthat/edition: | 3 |
| URL: | https://ctesta01.github.io/nadir/, https://github.com/ctesta01/nadir/ |
| BugReports: | https://github.com/ctesta01/nadir/issues |
| Imports: | dplyr, earth, future, future.apply, gbm, glmnet, hal9001, lifecycle, lme4, methods, mgcv, nnet, nnls, origami, randomForest, ranger, tibble, tidyr, VGAM, xgboost |
| NeedsCompilation: | no |
| Packaged: | 2026-02-18 00:35:22 UTC; cht180 |
| Author: | Christian Testa |
| Maintainer: | Christian Testa <ctesta@hsph.harvard.edu> |
| Repository: | CRAN |
| Date/Publication: | 2026-02-20 10:30:07 UTC |
nadir: Super Learning with Flexible Formulas
Description
A functional programming based implementation of the super learner algorithm with an emphasis on supporting the use of formulas to specify learners. This approach offers several improvements compared to past implementations including the ability to easily use random-effects specified in formulas (like y ~ (age | strata) + ...) and construction of new learners is as simple as writing and passing a new function. The super learner algorithm was originally described in van der Laan et al. (2007) https://biostats.bepress.com/ucbbiostat/paper222/.
Author(s)
Maintainer: Christian Testa ctesta@hsph.harvard.edu (ORCID)
Authors:
Nima Hejazi (ORCID) [thesis advisor]
See Also
Useful links:
Report bugs at https://github.com/ctesta01/nadir/issues
Add a Screener to a Learner
Description
Add a Screener to a Learner
Usage
add_screener(learner, screener, screener_extra_args = NULL)
Arguments
learner |
A learner to be modified by wrapping a screening stage on top of it. |
screener |
A screener to be added on top of the learner |
screener_extra_args |
Extra arguments to be passed to the screener |
Value
A modified learner that when called on data and a formula now runs a screening stage before fitting the learner and returning a prediction function.
Examples
# construct a learner where variables with less than .6 correlation are screened out
lnr_glm_with_cor_60_thresholding <-
add_screener(
learner = lnr_glm,
screener = screener_cor,
screener_extra_args = list(threshold = .6)
)
# train that on the mtcars dataset — also checking that extra arguments are properly passed to glm
lnr_glm_with_cor_60_thresholding(mtcars, formula = mpg ~ ., family = "gaussian")(mtcars)
# if we've screened out variables with low correlation to mpg, one such variable is qsec,
# so changing qsec shouldn't modify the predictions from our learned algorithm
mtcars_but_qsec_is_changed <- mtcars
mtcars_but_qsec_is_changed$qsec <- rnorm(n = nrow(mtcars))
identical(
lnr_glm_with_cor_60_thresholding(mtcars, formula = mpg ~ .)(mtcars),
lnr_glm_with_cor_60_thresholding(mtcars, formula = mpg ~ .)(mtcars_but_qsec_is_changed)
)
# earth version
lnr_earth_with_cor_60_thresholding <-
add_screener(
learner = lnr_earth,
screener = screener_cor,
screener_extra_args = list(threshold = .6)
)
lnr_earth_with_cor_60_thresholding(mtcars, formula = mpg ~ .)(mtcars)
identical(
lnr_earth_with_cor_60_thresholding(mtcars, formula = mpg ~ .)(mtcars),
lnr_earth_with_cor_60_thresholding(mtcars, formula = mpg ~ .)(mtcars)
)
# note that this 'test' does not pass for a learner like randomForest that has
# some randomness in its predictions.
Binary Learners in {nadir}
Description
-
lnr_nnet -
lnr_rf_binary -
lnr_logistic
Details
The important thing to know about binary learners is that they
need to produce predictions that the outcome is == 1 or TRUE.
Also, for binary outcomes, we should make sure to use the
determine_weights_for_binary_outcomes in our calls to
super_learner() which calculates the estimated probability of the observed
outcome (either 0 or 1) and then applies the negative log loss function
afterwards. This can be done automatically by declaring outcome_type = 'binary'
in calling super_learner()
See Also
density_learners learners
Examples
super_learner(
data = mtcars,
learners = list(logistic1 = lnr_logistic, logistic2 = lnr_logistic, lnr_rf_binary),
formulas = list(
.default = am ~ .,
logistic2 = am ~ mpg * hp + .),
outcome_type = 'binary'
)
Validate that a formula has a simple left-hand side
Description
For example, a complex left-hand-side would be one that includes a transformation
like log(y) ~ x1 + x2 or as is commonly done in survival modeling, a
survival outcome as in Surv(time, death) ~ x1 + x2. Both of these
examples are considered "complex" left-hand-sides by nadir and are not
currently supported. This function simply checks that the left-hand-side is
simple (as in, not complex), and returns TRUE in that case. An error is thrown
if the left-hand-side is complex. is not the case.
Usage
check_simple_lhs(formula)
Arguments
formula |
A formula to be checked to ensure its left-hand-side (dependent/outcome) variable is not complex. |
Value
Invisibly TRUE if okay; otherwise errors.
Compare Learners
Description
Compare learners using the specified loss_metric
Usage
compare_learners(sl_output, y_variable = NULL, loss_metric)
Arguments
sl_output |
Output from running |
y_variable |
A character vector indicating the outcome variable.
|
loss_metric |
A loss metric, like the mean-squared-error or negative-log-loss to be used in comparing the learners. A loss metric should take two (vector) arguments: predictions, and true outcomes, and produce a single statistic summarizing the performance of each learner. |
Value
A data.frame with the loss-metric on the held-out data for each learner.
Examples
sl_model <- super_learner(
data = mtcars,
learners = list(lm = lnr_lm, rf = lnr_rf, mean = lnr_mean),
formula = mpg ~ .)
compare_learners(sl_model)
sl_model <- super_learner(
data = mtcars,
learners = list(lnr_logistic, lnr_rf_binary, mean = lnr_mean),
formula = am ~ mpg,
outcome_type = 'binary')
compare_learners(sl_model)
Cross Validation Training/Validation Splits with Characters/Factor Columns
Description
Designed to handle cross-validation on models like randomForest, ranger, glmnet, etc., where the model matrix of newdata must match eactly the model matrix of the training dataset, this function intends to answer the need "The training datasets need to have every level of every discrete-type column that appears in the data."
Usage
cv_character_and_factors_schema(
data,
n_folds = 5,
cv_sl_mode = TRUE,
check_validation_datasets_too = TRUE
)
Arguments
data |
Data to use in training a |
n_folds |
The number of cross-validation folds to use in constructing the |
cv_sl_mode |
A binary (default: TRUE) indicator for if the output
training/validation data lists will be used inside another |
check_validation_datasets_too |
Enforce that the validation datasets
produced also have every level of every character / factor type column
present. This is particularly useful for learners like |
Details
The fundamental idea is to check if the unique levels of character and/or factor columns are represented in every training dataset.
Above and beyond this, this function is designed to support cv_super_learner,
which inherently involves two layers of cross-validation. As a result, more stringent
conditions are specified when the cv_sl_mode is enabled. For convenience this
mode is enabled by default
Value
A list of two lists ($training_data and $validation_data)
which are each lists of length n_folds. In each of those entries is a
data.frame that contains the nth training or validation fold of the data.
a named list of two lists, each being a list of n_folds data frames.
Examples
if (requireNamespace("palmerpenguins", quietly = TRUE)) {
training_validation_splits <- cv_character_and_factors_schema(
palmerpenguins::penguins)
# we can see the population breakdown across all the training
# splits:
sapply(training_validation_splits$training_data, function(df) {
table(df$species)
})
# notably, none of them are empty! this is crucial for certain
# types of learning algorithms that must see all levels appear in the
# training data, like random forests.
# certain models like glmnet require that the prediction dataset
# newx have the _exact_ same shape as the training data, so it
# can be important that every level appears in the validation data
# as well. check that by looking into these types of tables:
sapply(training_validation_splits$validation_data, function(df) {
table(df$species)
})
# if you don't need this level of stringency, but you just want
# to make cv_splits where every level appears in the training_data,
# you can do so using the check_validation_datasets_too = FALSE
# argument.
penguins_small <- palmerpenguins::penguins[c(1:3, 154:156, 277:279), ]
penguins_small <- penguins_small[complete.cases(penguins_small),]
training_validation_splits <- cv_character_and_factors_schema(
penguins_small,
cv_sl_mode = FALSE,
n_folds = 5,
check_validation_datasets_too = FALSE)
sapply(training_validation_splits$training_data, function(df) {
table(df$species)
})
# now you can see plenty of non-appearing levels in the validation data:
sapply(training_validation_splits$validation_data, function(df) {
table(df$species)
})
}
Cross-Validation with Origami
Description
Cross-Validation with Origami
Usage
cv_origami_schema(
data = data,
n_folds = 5,
fold_fun = origami::folds_vfold,
cluster_ids = NULL,
strata_ids = NULL,
...
)
Arguments
data |
a data.frame (or similar) to split into training and validation datasets. |
n_folds |
The number of |
fold_fun |
An |
cluster_ids |
A vector of cluster ids. Clusters are treated as a unit –
that is, all observations within a cluster are placed in either the
training or validation set. See |
strata_ids |
A vector of strata ids. Strata are balanced: insofar as
possible the distribution in the sample should be the same as the
distribution in the training and validation sets. See |
... |
Extra arguments to be passed to |
Value
A list of n_folds training_data and validation_data data.frames
Examples
# to use origami::folds_vfold behind the scenes, just tell nadir::super_learner
# you want to use cv_origami_schema.
sl_model <- super_learner(
data = mtcars,
formula = mpg ~ cyl + hp,
learners = list(rf = lnr_rf, lm = lnr_lm, mean = lnr_mean),
cv_schema = cv_origami_schema
)
# if you want to use a different origami::folds_* function, pass it into cv_origami_schema
sl_model <- super_learner(
data = mtcars,
formula = mpg ~ cyl + hp,
learners = list(rf = lnr_rf, lm = lnr_lm, mean = lnr_mean),
cv_schema = \(data, n_folds) {
cv_origami_schema(data, n_folds, fold_fun = origami::folds_loo)
}
)
Assign Data to One of n_folds Randomly and Produce Training/Validation Data Lists
Description
Each row in the data are assigned to one of 1:n_folds at random.
Then for each of i in 1:n_folds, the training_data[[i]]
is comprised of the data with sl_fold != i, i.e., capturing
roughly (n-folds-1)/n_folds proportion of the data. The validation data
is a list of dataframes, each comprising of roughly 1/n_folds proportion of the
data.
Usage
cv_random_schema(data, n_folds = 5)
Arguments
data |
a data.frame (or similar) to split into training and validation datasets. |
n_folds |
The number of |
Details
Since the assignment to folds is random, the proportions are not
exact or guaranteed and there is some variability in the size of
each training_data data frame, and likewise for the validation_data
data frames.
Value
A list of two lists ($training_data and $validation_data)
which are each lists of length n_folds. In each of those entries is a
data.frame that contains the nth training or validation fold of the data.
Examples
data(Boston, package = 'MASS')
training_validation_data <- cv_random_schema(Boston, n_folds = 3)
# take a look at what's in the output:
str(training_validation_data, max.level = 2)
Cross-Validating a super_learner
Description
Produce cv-rmse for a super_learner specified by a closure that
accepts data and returns a super_learner prediction function.
Usage
cv_super_learner(
data,
learners,
formulas,
y_variable = NULL,
n_folds = 5,
determine_super_learner_weights = determine_super_learner_weights_nnls,
ensemble_or_discrete = "ensemble",
cv_schema = cv_random_schema,
outcome_type = "continuous",
extra_learner_args = NULL,
cluster_ids = NULL,
strata_ids = NULL,
weights = NULL,
loss_metric,
use_complete_cases = FALSE
)
Arguments
data |
Data to use in training a |
learners |
A list of predictor/closure-returning-functions. See Details. |
formulas |
Either a single regression formula or a vector of regression formulas. |
y_variable |
Typically |
n_folds |
The number of cross-validation folds to use in constructing the |
determine_super_learner_weights |
A function/method to determine the weights for each of the candidate |
ensemble_or_discrete |
Defaults to |
cv_schema |
A function that takes |
outcome_type |
One of 'continuous', 'binary', 'multiclass', or 'density'. |
extra_learner_args |
A list of equal length to the |
cluster_ids |
(default: null) If specified, clusters will either be entirely assigned to training or validation (not both) in each cross-validation split. |
strata_ids |
(default: null) If specified, strata are balanced across training and validation splits so that strata appear in both the training and validation splits. |
weights |
If specified, (per observation) weights are used to indicate that risk minimization across models (i.e., the meta-learning step) should be targeted to higher weight observations. |
loss_metric |
A loss metric function, like the mean-squared-error or negative-log-loss to be
used in evaluating the learners on held-out data and minimized through convex optimization.
A loss metric should take two (vector) arguments:
predictions, and true outcomes, and produce a single statistic summarizing the
performance of each learner. Defaults to the mean-squared-error |
use_complete_cases |
(default: FALSE) If the |
Details
The idea is that cv_super_learner splits the data into training/validation
splits, trains super_learner on each training split, and then
evaluates their predictions on the held-out validation data, calculating
a root-mean-squared-error on those held-out data.
This function does print a message if the loss_function argument is
not set explicitly, letting the user know that the mean-squared-error will be
used by default. Pass in loss_function = nadir:::mse to
super_learner() if you'd like to suppress this message, or use a
similar approach for the appropriate loss function depending on context.
Value
A list containing $trained_learners and $cv_loss which
respectively include 1) the trained super learner models on each fold of the data, their holdout predictions and,
2) the cross-validated estimate of the risk (expected loss) on held-out data.
Examples
cv_super_learner(
data = mtcars,
formula = mpg ~ cyl + hp,
learners = list(lnr_mean, lnr_lm))
Apply Cross-Validation to a Super Learner Closure
Description
Taking an sl_closure, a function that trains a super learner on one
argument data and produces a predictor function, cv_super_learner_internal
applies cross validation to this sl_closure with the data passed.
Usage
cv_super_learner_internal(
data,
sl_closure,
y_variable = NULL,
n_folds = 5,
cv_schema = cv_random_schema,
loss_metric,
outcome_type = "continuous"
)
Arguments
data |
Data to use in training a |
sl_closure |
A function that takes in data and produces a |
y_variable |
The string name of the outcome column in |
n_folds |
The number of cross-validation folds to use in constructing the |
cv_schema |
A function that takes |
loss_metric |
A loss metric function, like the mean-squared-error or negative-log-loss to be
used in evaluating the learners on held-out data and minimized through convex optimization.
A loss metric should take two (vector) arguments:
predictions, and true outcomes, and produce a single statistic summarizing the
performance of each learner. Defaults to the mean-squared-error |
outcome_type |
One of 'continuous', 'binary', 'multiclass', or 'density'. |
Value
A list containing $trained_learners and $cv_loss which
respectively include 1) the trained super learner models on each fold of the data, their holdout predictions and,
2) the cross-validated estimate of the risk (expected loss) on held-out data.
Conditional Density Estimation in the {nadir} Package
Description
The following learners are available for conditional density estimation:
-
lnr_lm_density -
lnr_glm_density -
lnr_homoscedastic_density
Details
There are a few important things to know about conditional density
estimation in the nadir package.
Firstly, conditional density learners must produce prediction functions that predict densities at the new outcome values given the new covariates.
Secondly, the implemented density estimators come in two flavors:
those with a strong assumption (that of conditional normality), and those
with much weaker assumptions. The strong assumption is encoded
into learners like lnr_lm_density and lnr_glm_density
and says "after we model the predicted mean given covariates, we expect
the remaining errors to be normally distributed." The
more flexible learners produced by lnr_homoskedastic_density
are similar in spirit, except they fit a stats::density kernel
bandwidth smoother to the error distribution (after predicting the
conditional expected mean).
A subpoint to the above point that's worth calling attention to is that
lnr_homoskedastic_density is a learner factory. That is to say,
given a mean_lnr, lnr_homoskedastic_density produces a
conditional density learner that uses that mean_lnr.
Work is ongoing on implementing a lnr_heteroskedastic_density
learner that allows for predicting higher or lower variance in the
conditional density given covariates.
Conditional density learners should be combined with the negative log loss
function when using super_learner() or using compare_learners.
Refer to the 2003 Dudoit and van der Laan paper for a starting place on the
appropriate loss functions to use for different types of outcomes.
https://biostats.bepress.com/ucbbiostat/paper130/
See Also
learners binary_learners multiclass_learners
Determine SuperLearner Weights with Nonnegative Least Squares
Description
This function accepts a dataframe that is structured to have
one column Y and other columns with unique names corresponding to
different model predictions for Y, and it will use nonnegative
least squares to determine the weights to use for a SuperLearner.
Usage
determine_super_learner_weights_nnls(data, y_variable, obs_weights = NULL)
Arguments
data |
A data frame consisting of an outcome (y_variable) and other columns corresponding to predictions from candidate learners. |
y_variable |
The string name of the outcome column in |
obs_weights |
A vector of weights for each observation that dictate how prediction should be more targeted to higher weighted observations. |
Value
A vector of weights to be used for each of the learners.
Examples
# suppose that we have a data.frame of predictions from different candidate
# learners:
prediction_data <- data.frame(
lm = lnr_lm(mtcars, mpg ~ hp)(mtcars),
rf = lnr_rf(mtcars, mpg ~ hp)(mtcars),
rf2 = lnr_rf(mtcars, mpg ~ hp, ntree = 20)(mtcars),
earth = lnr_earth(mtcars, mpg ~ hp)(mtcars))
# make sure it includes the outcome y_variable
prediction_data$mpg <- mtcars$mpg
# we can use determine_super_learner_weights() fn to apply the non-negative least
# squares algorithm to produce weights for averaging the learners
determine_super_learner_weights_nnls(
data = prediction_data,
y_variable = 'mpg')
Determine Weights Appropriately for Super Learner given Binary Outcomes
Description
Determine Weights Appropriately for Super Learner given Binary Outcomes
Usage
determine_weights_for_binary_outcomes(data, y_variable, obs_weights = NULL)
Arguments
data |
A data.frame with columns corresponding to predicted probabilities of 1 from each learner and the true y_variable from held-out data |
y_variable |
A character indicating the outcome variable in the data.frame. |
obs_weights |
A vector of weights for each observation that dictate how prediction should be more targeted to higher weighted observations. |
Value
A vector of weights to be used for each of the learners.
Examples
predicted_probabilities <- data.frame(
logistic = lnr_logistic(mtcars, am ~ hp)(mtcars),
nnet = lnr_nnet(mtcars, am ~ hp)(mtcars),
am = mtcars$am)
determine_weights_for_binary_outcomes(predicted_probabilities, y_variable = 'am')
Determine Weights for Density Estimators for SuperLearner
Description
Determine Weights for Density Estimators for SuperLearner
Usage
determine_weights_using_neg_log_loss(data, y_variable, obs_weights = NULL)
Arguments
data |
A data.frame with columns corresponding to predicted densities from each learner and the true y_variable from held-out data |
y_variable |
A character indicating the outcome variable in the data.frame. |
obs_weights |
A vector of weights for each observation that dictate how prediction should be more targeted to higher weighted observations. |
Value
A vector of weights to be used for each of the learners.
Examples
predicted_densities <- data.frame(
lm = lnr_lm_density(mtcars, mpg ~ hp)(mtcars),
earth = lnr_homoskedastic_density(mtcars, mpg ~ hp, mean_lnr = lnr_earth)(mtcars),
rf = lnr_homoskedastic_density(mtcars, mpg ~ hp, mean_lnr = lnr_rf)(mtcars),
rf2 = lnr_homoskedastic_density(mtcars, mpg ~ hp, mean_lnr = lnr_rf,
mean_lnr_args = list(ntree = 20))(mtcars),
mpg = mtcars$mpg)
determine_weights_using_neg_log_loss(predicted_densities, y_variable = 'mpg')
Repeat Observations for Survival Stacking
Description
Per the approach in A review of survival stacking: a method to cast survival regression analysis as a classification problem https://www.degruyterbrill.com/document/doi/10.1515/ijb-2022-0055/html
https://arxiv.org/abs/2107.13480, we provide df_to_survival_stacked as
a helper function for converting traditional survival data (one observation = one row) into
the survival stacked data structure, a repeated observations data structure where
multiple rows exist for each individual for each timepoint at which they were still in the
risk set up to and including their event time.
Usage
df_to_survival_stacked(
data,
id_col = NULL,
time_col,
status_col,
covariate_cols,
period_duration = 1,
custom_times = NULL
)
Arguments
data |
A data frame with survival -type outcomes including an event indicator and a time-to-event-or-censoring column |
id_col |
(string) name of the id column that is unique to each observation in |
time_col |
(string) name of the time‐to‐event column |
status_col |
(string) name of the 0/1 event indicator column |
covariate_cols |
(string vector) names of your predictors |
period_duration |
(numeric) length of each time-period (e.g. 1) |
custom_times |
(numeric vector) (optional) A vector of the time-period breakpoints. If events could have occurred at any time after zero, this should begin with 0. |
Value
a data.frame of repeated observations, one row for each time-step, that indicates which observations remain in the risk set and whether or not an event occurs.
Examples
if (requireNamespace("survival", quietly = TRUE)) {
df_to_survival_stacked(
data = survival::kidney,
id_col = 'id',
time_col = 'time',
status_col = 'status',
covariate_cols = c('age', 'sex', 'disease', 'frail'))
}
Extract Y Variable from a list of Regression Formulas and Learners
Description
Extract Y Variable from a list of Regression Formulas and Learners
Usage
extract_y_variable(formulas, learner_names, data_colnames, y_variable = NULL)
Arguments
formulas |
A vector of formulas used for super learning |
learner_names |
A character vector of names for the learners |
data_colnames |
The column names of the dataset for super learning |
y_variable |
(Optional) the y_variable specified by the user |
Value
A character value corresponding to the column name of the outcome variable.
Learners in the {nadir} Package
Description
The following learners are available for continuous outcomes:
Details
-
lnr_mean -
lnr_earth -
lnr_gam -
lnr_glm -
lnr_glmer -
lnr_glmnet -
lnr_hal -
lnr_lm -
lnr_lmer -
lnr_ranger -
lnr_rf -
lnr_xgboost
See ?density_learners to learn more about using conditional density
estimation in nadir.
lnr_mean is generally provided only for benchmarking purposes to compare
other learners against to ensure correct specification of learners, since any
prediction algorithm should (in theory) out-perform just using the mean of
the outcome for all predictions.
If you'd like to build a new learner, we recommend reading the
source code of several of the learners provided with {nadir} to
get a sense of how they should be specified.
A learner, as {nadir} understands them, is a function which
takes in data, a formula, possibly ..., and
returns a function that predicts on its input newdata.
A simple example is reproduced here for ease of reference:
Examples
lnr_glm <- function(data, formula, weights = NULL, ...) {
model <- stats::glm(formula = formula, data = data, weights = weights, ...)
return(function(newdata) {
predict(model, newdata = newdata, type = 'response')
})
}
List Known Learners
Description
List Known Learners
Usage
list_known_learners(type = "any")
Arguments
type |
One of 'any' or a supported outcome type in nadir including
at least 'continuous', 'binary', 'multiclass', 'density'. See |
Value
A character vector of functions that were automatically recognized as nadir learners with the prediction/outcome type given.
Examples
list_known_learners()
list_known_learners('continuous')
list_known_learners('binary')
list_known_learners('density')
list_known_learners('multiclass')
Earth Learner
Description
A wrapper for earth::earth() for use in nadir::super_learner().
Usage
lnr_earth(data, formula, weights = NULL, ...)
Arguments
data |
A dataframe to train a learner / learners on. |
formula |
A regression formula to use inside this learner. |
weights |
Observation weights; see |
... |
Any extra arguments that should be passed to the internal model for model fitting purposes. |
Value
A prediction function that accepts newdata,
which returns predictions (a numeric vector of values, one for each row
of newdata).
See Also
learners
Examples
lnr_earth(mtcars, mpg ~ hp + disp + am + wt)(mtcars)
Generalized Additive Model Learner
Description
A wrapper for mgcv::gam() for use in nadir::super_learner().
Usage
lnr_gam(data, formula, weights = NULL, ...)
Arguments
data |
A dataframe to train a learner / learners on. |
formula |
A regression formula to use inside this learner. |
weights |
Observation weights; see |
... |
Any extra arguments that should be passed to the internal model for model fitting purposes. |
Value
A prediction function that accepts newdata,
which returns predictions (a numeric vector of values, one for each row
of newdata).
See Also
learners
Examples
lnr_gam(mtcars, mpg ~ s(hp) + disp + am + wt)(mtcars)
lnr_gam(mtcars, mpg ~ s(hp) + disp + am + wt, family = Gamma)(mtcars)
Gradient Boosting Machines Learner
Description
A wrapper for gbm::gbm() for use in nadir::super_learner().
Usage
lnr_gbm(data, formula, verbose = FALSE, n.minobsinnode = 0, ...)
Arguments
data |
A dataframe to train a learner / learners on. |
formula |
A regression formula to use inside this learner. |
verbose |
(default: FALSE) if set to TRUE, information about the automatic
outcome type inferred by |
n.minobsinnode |
(default: 0) An integer specifying the minimum number of observations in the terminal nodes of the trees. See the gbm documentation for more. Set here to 0 to account for the potential of very small splits in cross-fitting. |
... |
Any extra arguments that should be passed to the internal model for model fitting purposes. |
Value
A prediction function that accepts newdata,
which returns predictions (a numeric vector of values, one for each row
of newdata).
See Also
learners
Examples
lnr_gbm(mtcars, mpg ~ hp)(mtcars)
GLM Learner
Description
A wrapper for stats::glm() for use in nadir::super_learner().
Usage
lnr_glm(data, formula, weights = NULL, ...)
Arguments
data |
A dataframe to train a learner / learners on. |
formula |
A regression formula to use inside this learner. |
weights |
Observation weights; see |
... |
Any extra arguments that should be passed to the internal model for model fitting purposes. |
Value
A prediction function that accepts newdata,
which returns predictions (a numeric vector of values, one for each row
of newdata).
See Also
learners
Examples
lnr_glm(mtcars, mpg ~ hp + disp + am + wt)(mtcars)
lnr_glm(mtcars, mpg ~ hp + disp + am + wt, family = Gamma)(mtcars)
Conditional Normal Density Estimation Given Mean Predictors — with GLMs
Description
This is a step up from the lnr_lm_density in that it uses
a glm for the conditional mean model.
Note that this allows for specification of glm features
like family = ... in the ,.. arguments, and
that's the main advantage over the lnr_lm_density.
Also note that this still differs from using lnr_homoskedastic_density
with mean_lnr = lnr_glm because lnr_homoscedastic_density
uses stats::density to do kernel bandwidth smoothing
on the error distribution of the mean predictions..
Usage
lnr_glm_density(data, formula, weights = NULL, ...)
Arguments
data |
A dataframe to train a learner / learners on. |
formula |
A regression formula to use inside this learner. |
weights |
Observation weights; see |
... |
Any extra arguments that should be passed to the internal model for model fitting purposes. |
Value
a closure (function) that produces density estimates
at the newdata given according to the fit model.
Examples
# for example, we could use a Poisson assumption with identity link:
lnr_glm_density(mtcars, hp ~ mpg, family = poisson(link = 'identity'))(mtcars)
hp_seq <- seq(min(mtcars$hp), max(mtcars$hp), length.out = 1000)
plot(
x = hp_seq,
y = lnr_glm_density(mtcars, hp ~ mpg, family = poisson(link = 'identity'))(
data.frame(hp = hp_seq, mpg = rep(mean(mtcars$mpg), 1000))),
xlab = 'hp',
ylab = 'density',
main = 'normal density model of horsepower given mean(mpg)')
Generalized Linear Mixed-Effects (lme4::glmer) Learner
Description
A wrapper for lme4::glmer() for use in nadir::super_learner().
Usage
lnr_glmer(data, formula, weights = NULL, ...)
Arguments
data |
A dataframe to train a learner / learners on. |
formula |
A regression formula to use inside this learner. |
weights |
Observation weights; see |
... |
Any extra arguments that should be passed to the internal model for model fitting purposes. |
Value
A prediction function that accepts newdata,
which returns predictions (a numeric vector of values, one for each row
of newdata).
See Also
learners
Examples
# random intercepts for each level of cyl column:
suppressMessages({
# singular fit, but that's ok if all you need is prediction:
lnr_glmer(mtcars, mpg ~ (1|cyl) + disp + wt, family = Gamma)(mtcars)
})
glmnet Learner
Description
A wrapper for glmnet::glmnet() for use in nadir::super_learner().
Usage
lnr_glmnet(data, formula, weights = NULL, lambda = 0.2, ...)
Arguments
data |
A dataframe to train a learner / learners on. |
formula |
A regression formula to use inside this learner. |
weights |
Observation weights; see |
lambda |
The multiplier parameter for the penalty; see |
... |
Any extra arguments that should be passed to the internal model for model fitting purposes. |
Details
glmnet predictions will by default, if lambda is unspecified, return a matrix of predictions for varied lambda values, hence the need to explicitly handle the lambda argument in building glmnet learners.
Value
A prediction function that accepts newdata,
which returns predictions (a numeric vector of values, one for each row
of newdata).
See Also
learners
Examples
lnr_glmnet(mtcars, mpg ~ hp + disp + am + wt, lambda = .5)(mtcars)
Highly Adaptive Lasso
Description
Highly Adaptive Lasso
Usage
lnr_hal(data, formula, weights = NULL, lambda = NULL, ...)
Arguments
data |
A dataframe to train a learner / learners on. |
formula |
A regression formula to use inside this learner. |
weights |
Observation weights; see |
lambda |
The multiplier parameter for the penalty; see |
... |
Any extra arguments that should be passed to the internal model for model fitting purposes. |
Value
A prediction function that accepts newdata,
which returns predictions (a numeric vector of values, one for each row
of newdata).
See Also
learners
Examples
suppressWarnings({
# hal prints a lot of messages about some threads not reaching convergence
lnr_hal(mtcars, mpg ~ hp)(mtcars)
})
Conditional Density Estimation with Heteroskedasticity
Description
Conditional Density Estimation with Heteroskedasticity
Usage
lnr_heteroskedastic_density(
data,
formula,
mean_lnr,
var_lnr,
mean_lnr_args = NULL,
var_lnr_args = NULL,
density_args = NULL
)
Arguments
data |
A dataframe to train a learner / learners on. |
formula |
A regression formula to use inside this learner. |
mean_lnr |
A learner (function) passed in to be trained on the data with
the given formula and then used to predict conditional means for provided
|
var_lnr |
A learner (function) passed in to be trained on the squared
error from the |
mean_lnr_args |
Extra arguments to be passed to the |
var_lnr_args |
Extra arguments to be passed to the |
density_args |
Extra arguments to be passed to the kernel
density smoother |
Value
a closure (function) that produces density estimates
at the newdata given according to the fit model.
Examples
# fit a conditional density model with mean model as a randomForest
fit_density_hetero <- lnr_heteroskedastic_density(
data = mtcars,
formula = mpg ~ hp,
mean_lnr = lnr_rf,
var_lnr = lnr_lm)
# and what we should get back should be predicted densities at the
# observed mpg given the covariates hp
fit_density_hetero(mtcars)
if (requireNamespace("ggplot2", quietly = TRUE)) {
hp_grid <- with(mtcars, seq(min(hp), max(hp), length.out=100))
mpg_grid <- with(mtcars, seq(min(mpg), max(mpg), length.out=100))
mt_grid <- expand.grid(mpg = mpg_grid, hp = hp_grid)
plt_df <- cbind(mt_grid, pred_dens = fit_density_hetero(mt_grid))
require(ggplot2)
ggplot(plt_df, aes(x = hp, y = mpg, fill = pred_dens)) +
geom_tile() +
scale_fill_viridis_c() +
ggtitle("Density Model of MPG given HP")
}
Conditional Density Estimation with Homoskedasticity Assumption
Description
This function accepting an mean_lnr, which it then trains on the data
and formula given. Then stats::density is fit to the error (difference
between observed outcome and the mean_lnr predictions).
Usage
lnr_homoskedastic_density(
data,
formula,
mean_lnr,
mean_lnr_args = NULL,
density_args = NULL,
weights = NULL
)
Arguments
data |
A dataframe to train a learner / learners on. |
formula |
A regression formula to use inside this learner. |
mean_lnr |
A learner (function) passed in to be trained on the data with
the given formula and then used to predict conditional means for provided
|
mean_lnr_args |
Extra arguments to be passed to the |
density_args |
Extra arguments to be passed to the kernel
density smoother |
weights |
Observation weights; see |
Details
This returns a function that takes in newdata and produces density
estimates according to the estimated stats::density fit the error
from the newdata observed outcome and the prediction from the mean_lnr.
That is to say, this follows the following procedure (assuming Y as the outcome
and X as a matrix of predictors):
\texttt{obtain } \hat{\mathbb E}(Y | X) \quad \mathtt{using \quad mean\_learner}
\texttt{fit } \hat{f} \gets \mathtt{density}(Y - \hat{\mathbb E}(Y | X))
\mathtt{return \quad function(newdata) \{ } \hat{f}(\mathtt{newdata\$Y} -
\hat{\mathbb E}[Y | \mathtt{newdata\$X}]) \}
Value
A predictor function that takes in newdata and produces density
estimates
Examples
# fit a conditional density model with mean model as a randomForest
fit_density_lnr <- lnr_homoskedastic_density(
data = mtcars,
formula = mpg ~ hp,
mean_lnr = lnr_rf)
# and what we should get back should be predicted densities at the
# observed mpg given the covariates hp
fit_density_lnr(mtcars)
Linear Model Learner
Description
A wrapper for lm() for use in nadir::super_learner().
Usage
lnr_lm(data, formula, weights = NULL, ...)
Arguments
data |
A dataframe to train a learner / learners on. |
formula |
A regression formula to use inside this learner. |
weights |
Observation weights; see |
... |
Any extra arguments that should be passed to the internal model for model fitting purposes. |
Value
A prediction function that accepts newdata,
which returns predictions (a numeric vector of values, one for each row
of newdata).
A prediction function that accepts newdata,
which returns predictions (a numeric vector of values, one for each row
of newdata).
See Also
learners
Examples
lnr_lm(mtcars, mpg ~ hp + disp + am + wt)(mtcars)
Conditional Normal Density Estimation Given Mean Predictors
Description
This is the simplest possible density estimator that is
entertainable. It fits a lm model to the data, and
uses the variance of the residuals to parameterize a
model of the data as \mathcal N(y | \beta x, \sigma^2).
Usage
lnr_lm_density(data, formula, weights = NULL, ...)
Arguments
data |
A dataframe to train a learner / learners on. |
formula |
A regression formula to use inside this learner. |
weights |
Observation weights; see |
... |
Any extra arguments that should be passed to the internal model for model fitting purposes. |
Value
a closure (function) that produces density estimates
at the newdata given according to the fit model.
Examples
lnr_lm_density(mtcars, hp ~ mpg)(mtcars)
hp_seq <- seq(min(mtcars$hp), max(mtcars$hp), length.out = 1000)
plot(
x = hp_seq,
y = lnr_lm_density(mtcars, hp ~ mpg)(
data.frame(hp = hp_seq, mpg = rep(mean(mtcars$mpg), 1000))),
xlab = 'hp',
ylab = 'density',
main = 'normal density model of horsepower given mean(mpg)')
Random/Mixed-Effects (lme4::lmer) Learner
Description
A wrapper for lme4::lmer for use in nadir::super_learner().
Usage
lnr_lmer(data, formula, weights = NULL, ...)
Arguments
data |
A dataframe to train a learner / learners on. |
formula |
A regression formula to use inside this learner. |
weights |
Observation weights; see |
... |
Any extra arguments that should be passed to the internal model for model fitting purposes. |
Value
A prediction function that accepts newdata,
which returns predictions (a numeric vector of values, one for each row
of newdata).
See Also
learners
Examples
# random intercepts for each level of cyl column:
lnr_lmer(mtcars, mpg ~ (1|cyl) + disp + am + wt)(mtcars)
Standard Logistic Regression for Binary Classification
Description
A wrapper provided for convenience around lnr_glm that sets
family = binomial(link = 'logit').
Usage
lnr_logistic(data, formula, weights = NULL, ...)
Arguments
data |
A dataframe to train a learner / learners on. |
formula |
A regression formula to use inside this learner. |
weights |
Observation weights; see |
... |
Any extra arguments that should be passed to the internal model for model fitting purposes. |
Value
A prediction function that accepts newdata, which returns
predictions for the probability of the outcome being 1/TRUE (a numeric
vector of values, one for each row of newdata).
Examples
lnr_logistic(mtcars, am ~ hp)(mtcars)
Mean Learner
Description
This is a very naive/simple learner that simply predicts the mean of the
outcome for every row of input newdata. This is primarily
useful for benchmarking and confirming that other learners are
performing better than lnr_mean. Additionally, it may be the case
that some learners are over-fitting the data, and giving some weight to
lnr_mean helps to reduce over-fitting in super_learner().
Usage
lnr_mean(data, formula, weights = NULL)
Arguments
data |
A dataframe to train a learner / learners on. |
formula |
A regression formula to use inside this learner. |
weights |
Observation weights; see |
Value
A prediction function that accepts newdata,
which returns predictions (a numeric vector of values, one for each row
of newdata).
See Also
learners
Examples
lnr_mean(mtcars, mpg ~ hp)(mtcars)
nnet::multinom Multinomial Learner
Description
nnet::multinom Multinomial Learner
Usage
lnr_multinomial_nnet(data, formula, weights = NULL, ...)
Arguments
data |
A dataframe to train a learner / learners on. |
formula |
A regression formula to use inside this learner. |
weights |
Observation weights; see |
... |
Any extra arguments that should be passed to the internal model for model fitting purposes. |
Value
A prediction function that accepts newdata,
which returns predictions (a numeric vector of density prediction values at the
outcome value observed in the newdata conditioning on the predictor
variables in newdata).
Examples
df <- mtcars
df$cyl <- as.factor(df$cyl)
lnr_multinomial_nnet(df, cyl ~ hp + mpg)(df)
lnr_multinomial_nnet(iris, Species ~ .)(iris)
VGAM::vglm Multinomial Learner
Description
VGAM::vglm Multinomial Learner
Usage
lnr_multinomial_vglm(data, formula, ...)
Arguments
data |
A dataframe to train a learner / learners on. |
formula |
A regression formula to use inside this learner. |
... |
Any extra arguments that should be passed to the internal model for model fitting purposes. |
Value
A prediction function that accepts newdata,
which returns predictions (a numeric vector of density prediction values at the
outcome value observed in the newdata conditioning on the predictor
variables in newdata).
Examples
df <- mtcars
df$cyl <- as.factor(df$cyl)
lnr_multinomial_vglm(df, cyl ~ hp + mpg)(df)
lnr_multinomial_vglm(iris, Species ~ .)(iris)
Use nnet for Binary Classification
Description
Use nnet for Binary Classification
Usage
lnr_nnet(data, formula, trace = FALSE, size, ...)
Arguments
data |
A dataframe to train a learner / learners on. |
formula |
A regression formula to use inside this learner. |
trace |
Whether nnet should print out its optimization success |
size |
Size for neural network hidden layer |
... |
Any extra arguments that should be passed to the internal model for model fitting purposes. |
Value
A prediction function that accepts newdata,
which returns predictions (a numeric vector of values, one for each row
of newdata).
Examples
lnr_nnet(mtcars, am ~ ., size = 50)(mtcars)
lnr_nnet(iris, I(Species=='setosa') ~ ., size = 50)(iris)
ranger Learner
Description
A wrapper for ranger::ranger() for use in nadir::super_learner().
Usage
lnr_ranger(data, formula, weights = NULL, ...)
Arguments
data |
A dataframe to train a learner / learners on. |
formula |
A regression formula to use inside this learner. |
weights |
Observation weights; see |
... |
Any extra arguments that should be passed to the internal model for model fitting purposes. |
Value
A prediction function that accepts newdata,
which returns predictions (a numeric vector of values, one for each row
of newdata).
See Also
learners
Examples
lnr_ranger(mtcars, mpg ~ hp)(mtcars)
ranger Learner for Binary Outcomes
Description
A wrapper for ranger::ranger() for use in nadir::super_learner().
Usage
lnr_ranger_binary(data, formula, weights = NULL, ...)
Arguments
data |
A dataframe to train a learner / learners on. |
formula |
A regression formula to use inside this learner. |
weights |
Observation weights; see |
... |
Any extra arguments that should be passed to the internal model for model fitting purposes. |
Value
A prediction function that accepts newdata,
which returns predictions (a numeric vector of values, one for each row
of newdata).
See Also
learners
Examples
lnr_ranger_binary(mtcars, am ~ hp)(mtcars)
randomForest Learner
Description
A wrapper for randomForest::randomForest() for use in nadir::super_learner().
Usage
lnr_rf(data, formula, weights = NULL, ...)
Arguments
data |
A dataframe to train a learner / learners on. |
formula |
A regression formula to use inside this learner. |
weights |
Observation weights; see |
... |
Any extra arguments that should be passed to the internal model for model fitting purposes. |
Value
A prediction function that accepts newdata,
which returns predictions (a numeric vector of values, one for each row
of newdata).
See Also
learners
Examples
lnr_rf(mtcars, mpg ~ hp + disp + am + wt, ntree = 20)(mtcars)
Use Random Forest for Binary Classification
Description
Use Random Forest for Binary Classification
Usage
lnr_rf_binary(data, formula, weights = NULL, ...)
Arguments
data |
A dataframe to train a learner / learners on. |
formula |
A regression formula to use inside this learner. |
weights |
Observation weights; see |
... |
Any extra arguments that should be passed to the internal model for model fitting purposes. |
Value
A prediction function that accepts newdata, which returns
predictions for the probability of the outcome being 1/TRUE (a numeric
vector of values, one for each row of newdata).
Examples
lnr_rf_binary(data = mtcars, am ~ mpg)(mtcars)
lnr_rf_binary(mtcars, am ~ hp)(mtcars)
XGBoost Learner
Description
A wrapper for xgboost::xgboost() for use in nadir::super_learner().
Usage
lnr_xgboost(data, formula, weights = NULL, nrounds = 1000, ...)
Arguments
data |
A dataframe to train a learner / learners on. |
formula |
A regression formula to use inside this learner. |
weights |
Observation weights; see |
nrounds |
The max number of boosting iterations |
... |
Any extra arguments that should be passed to the internal model for model fitting purposes. |
Value
A prediction function that accepts newdata,
which returns predictions (a numeric vector of values, one for each row
of newdata).
See Also
learners
Examples
lnr_xgboost(mtcars, mpg ~ hp)(mtcars)
Make Unique Learner Names
Description
Make Unique Learner Names
Usage
make_learner_names_unique(learners)
Arguments
learners |
A list of learners. See |
Value
A list of learners with (possibly) improved names.
Examples
learners <-
list(
mean = lnr_mean,
rf = lnr_rf,
rf = lnr_rf,
lnr_glm,
lnr_xgboost,
function(data, formula) {},
function(data, formula) {})
learners <- nadir::make_learner_names_unique(learners)
names(learners)
learners <-
list(
lnr_mean,
lnr_rf,
lnr_rf,
lnr_glm,
lnr_xgboost,
function(data, formula) {},
function(data, formula) {})
learners <- nadir::make_learner_names_unique(learners)
names(learners)
Mean Squared Error
Description
Mean Squared Error
Usage
mse(x, y)
Value
A numeric value of the mean squared difference between x and y
Multiclass Learners in {nadir}
Description
-
lnr_multinomial_nnet -
lnr_multinomial_vglm
Details
Suppose one of these is trained on some data and the fit learner is stored.
Suppose we are going to call it on newdata and newdata$class is
the outcome variable being predicting.
The important thing to know about multiclass learners is that they
produce predictions that the outcome class is equal to
newdata$class given the covariates specified in
newdata.
Similar to density estimation, we want to use
determine_weights_using_neg_log_loss in our calls to
super_learner(). This can be done automatically by declaring outcome_type = 'multiclass'
in calling super_learner()
See Also
density_learners binary_learners learners
Examples
super_learner(
data = iris,
learners = list(lnr_multinomial_vglm, lnr_multinomial_vglm, lnr_multinomial_nnet),
formulas = list(
.default = Species ~ .,
multinomial_vglm2 = Species ~ Petal.Length*Petal.Width + .),
outcome_type = 'multiclass'
)
Outcome types supported by {nadir}
Description
The following outcome types are supported in the {nadir}
package:
Usage
nadir_supported_types
Format
An object of class character of length 4.
Details
continuous
binary
multiclass
density
See Also
super_learner
Negative Log Loss
Description
Negative Log Loss
Usage
negative_log_loss(predicted_densities, ...)
Arguments
predicted_densities |
The predicted densities from a learner predicted at |
... |
Because |
Details
negative_log_loss encodes the logic:
if \hat p_n is a good model of the conditional densities, then it should minimize:
-\sum(\log(\hat p_n(X_i))
Value
A sum of the negative log loss given a vector of predicted probabilities/densities for some observed outcome.
Examples
# suppose we have some prediction probabilities _at the true values_:
predicted_probabilities <- lnr_logistic(mtcars, am ~ hp)(mtcars)
# we can calculate the -log(loss) for binary predicted probabilities like so:
negative_log_loss(predicted_probabilities)
Negative Log Loss for Binary
Description
Negative Log Loss for Binary
Usage
negative_log_loss_for_binary(predicted_probabilities, true_outcomes)
Arguments
predicted_probabilities |
The predicted probabilities from a learner predicted at |
true_outcomes |
A vector of true outcomes to use in calculating the negative log loss of the relevant predicted probabilities. |
Value
A sum of the negative log loss given a vector of predicted probabilities for
outcome == 1 or equivalently a 'success'.
Parse Extra Arguments
Description
Similar to the job of parse_formulas(), this function handles the
extra_learner_args argument to super_learner() to ensure that
each learner has a corresponding list (potentially empty) of
any extra arguments being passed to it.
Usage
parse_extra_learner_arguments(extra_learner_args, learner_names)
Arguments
extra_learner_args |
A list of extra learner arguments |
learner_names |
The names of the learners |
Value
A list of extra arguments for each learner, in the same order as learner_names
Parse Formulas for Super Learner
Description
This function converts the user-input argument formulas to
super_learner() into a standardized format. More specifically, it deals
with things like the use of a .default formula that learners assume if one
is not specified for each particular learner. The end result is a named list
of formulas, one for each learner, with names matching the learners each
formula corresponds to.
Usage
parse_formulas(formulas, learner_names)
Arguments
formulas |
Formulas to be passed to each learner of a super learner |
learner_names |
The names of each of the learners passed to a super learner |
Value
A named list of formulas, one for each learner
Predict from a nadir::super_learner() model
Description
Predict from a nadir::super_learner() model
Usage
## S3 method for class 'nadir_sl_model'
predict(object, newdata, ...)
Arguments
object |
An object of class inheriting from |
newdata |
A tabular data structure (data.frame or matrix) of predictor variables. |
... |
Ellipses, solely provided so that the |
Value
a numeric vector of predicted values
Examples
sl_fit <- super_learner(mtcars, mpg ~ hp,
learners = list(lnr_lm, lnr_rf, lnr_earth))
predict(sl_fit, newdata = mtcars)
Correlation Threshold Based Screening
Description
Correlation Threshold Based Screening
Usage
screener_cor(data, formula, threshold = 0.2, cor... = NULL)
Arguments
data |
A dataframe intended to be used with |
formula |
The formula specifying the regression to be done |
threshold |
The correlation coefficient cutoff, below which variables are screened out from the dataset and regression formula. |
cor... |
An optional list of extra arguments to pass to |
Details
If a variable used has little correlation with the outcome being predicted, we might want to screen that variable out from the predictors.
In large datasets, this is quite important, as having a huge number of
columns could be computationally intractable or frustratingly time-consuming
to run super_learner() with.
Value
A list of $data with columns screened out,
$formula with variables screened out, and $failed_to_correlate_names
the names of variables that failed to correlate with the outcome at least at the threshold
level.
Examples
screener_cor(
data = mtcars,
formula = mpg ~ .,
threshold = .5)
# We're also showing how to specify that you want the Spearman rank-based
# correlation coefficient, to get away from the assumption of linearity.
screener_cor(
data = mtcars,
formula = mpg ~ .,
threshold = .5,
cor... = list(method = 'spearman')
)
Correlation Threshold Based Screening
Description
Correlation Threshold Based Screening
Usage
screener_cor_top_n(data, formula, keep_n_terms, cor... = NULL)
Arguments
data |
A dataframe intended to be used with |
formula |
The formula specifying the regression to be done |
keep_n_terms |
Set to an integer value >=1, this indicates that the top n terms in the model frame with greatest absolute correlation with the outcome will be kept. |
cor... |
An optional list of extra arguments to pass to |
Details
If a variable used has little correlation with the outcome being predicted, we might want to screen that variable out from the predictors.
In large datasets, this is quite important, as having a huge number of
columns could be computationally intractable or frustratingly time-consuming
to run super_learner() with.
Value
A list of $data with columns screened out,
$formula with variables screened out, and $failed_to_correlate_names
the names of variables that failed to correlate with the outcome at least at the threshold
level.
Examples
screener_cor_top_n(
data = mtcars,
formula = mpg ~ .,
keep_n_terms = 5)
# We're also showing how to specify that you want the Spearman rank-based
# correlation coefficient, to get away from the assumption of linearity.
screener_cor_top_n(
data = mtcars,
formula = mpg ~ .,
keep_n_terms = 5,
cor... = list(method = 'spearman')
)
t-test Based Screening: Thresholds on p.values and/or t statistics
Description
Screens out variables from the formula and dataset based on a p.value and/or the absolute value of the t statistic from a univariate linear regression (with intercept and one term) comparing each predictor to the outcome (dependent) variable.
Usage
screener_t_test(
data,
formula,
p_value_threshold = NULL,
t_statistic_threshold = NULL
)
Arguments
data |
a dataset with variables mentioned in the |
formula |
a |
p_value_threshold |
A numeric scalar where terms pass if the t test for
the linear model coefficient has p value lower than or equal to the
|
t_statistic_threshold |
A numeric scalar where terms pass if they have a t test
statistic greater than or equal to the |
Details
The intended use of screener_t_test and other screeners is for
pragmatic purposes: when there are a very large number of candidate
predictors, such that super_learner is very slow to run, predictor
variables that fail to have a detectable association with the dependent
variable of a formula should be dropped from the learner.
Value
A list of $data with columns screened out,
$formula with variables screened out, and $failed_to_pass_threshold
the names of variables that failed to associate with the outcome at least at the threshold
level.
See Also
screeners, add_screener, screener_cor_top_n
Wrapping Learners with a Screener
Description
Screeners work off of the principle that they should take the same arguments that a learner does and return a modified dataset and formula in which variables that have failed to meet some threshold have been screened out.
Details
A screener can be added to a learner by using the add_screener(learner, screener)
function provided. This returns a modified learner that implements screening based on the
data and formula passed.
So far, the screeners implemented rely on being able to call model.matrix and therefore
only support standard (generalized) linear model syntax like those mentioned in ?formula.
See Also
screener_cor, screener_cor_top_n, screener_t_test, add_screener
Examples
# examples for setting up a screened regression problem:
#
# users can just run a screener to see what data and formula terms pass the
# given screener conditions:
screened_regression_problem <- screener_cor(data = mtcars,
formula = mpg ~ ., threshold = 0.5)
screened_regression_problem
screened_regression_problem2 <- screener_cor(data = mtcars,
formula = mpg ~ ., threshold = 0.5, cor... = list(method = 'spearman'))
screened_regression_problem2
screened_regression_problem3 <- screener_t_test(data = mtcars,
formula = mpg ~ ., t_statistic_threshold = 10)
screened_regression_problem3
# build a new learner with screening builtin:
lnr_rf_screener_top_5_cor_terms <- add_screener(
learner = lnr_rf,
screener = screener_cor_top_n,
screener_extra_args = list(cor... = list(method = 'spearman'),
keep_n_terms = 5)
)
# train learner
trained_learner <- lnr_rf_screener_top_5_cor_terms(data = mtcars, formula = mpg ~ .)
mtcars_modified <- mtcars
mtcars_modified['gear'] <- 1 # gear is one of the least correlated variables with mpg
identical(trained_learner(mtcars), trained_learner(mtcars_modified))
Softmax
Description
A common transformation used to go from a collection of numbers from R to numbers in 0 to 1 (inclusive) such that they sum to 1.
Usage
softmax(beta)
Arguments
beta |
A vector of numeric values to transform |
Value
The numeric vector rescaled (via the softmax method) so that all of its values are between 0 to 1, inclusive.
Round up or down randomly with probability equal to the decimal part of x
Description
Round up or down randomly with probability equal to the decimal part of x
Usage
stochastic_round(x)
Arguments
x |
A numeric vector |
Value
A vector of integer values
Super Learner: Cross-Validation Based Ensemble Learning
Description
Super learning with functional programming!
Usage
super_learner(
data,
learners,
formulas,
y_variable = NULL,
n_folds = 5,
determine_super_learner_weights,
ensemble_or_discrete = "ensemble",
cv_schema,
outcome_type = "continuous",
extra_learner_args = NULL,
cluster_ids = NULL,
strata_ids = NULL,
weights = NULL,
use_complete_cases = FALSE
)
Arguments
data |
Data to use in training a |
learners |
A list of predictor/closure-returning-functions. See Details. |
formulas |
Either a single regression formula or a vector of regression formulas. |
y_variable |
Typically |
n_folds |
The number of cross-validation folds to use in constructing the |
determine_super_learner_weights |
A function/method to determine the weights for each of the candidate |
ensemble_or_discrete |
Defaults to |
cv_schema |
A function that takes |
outcome_type |
One of 'continuous', 'binary', 'multiclass', or 'density'. |
extra_learner_args |
A list of equal length to the |
cluster_ids |
(default: null) If specified, clusters will either be entirely assigned to training or validation (not both) in each cross-validation split. |
strata_ids |
(default: null) If specified, strata are balanced across training and validation splits so that strata appear in both the training and validation splits. |
weights |
If specified, (per observation) weights are used to indicate that risk minimization across models (i.e., the meta-learning step) should be targeted to higher weight observations. |
use_complete_cases |
(default: FALSE) If the |
Details
The goal of any super learner is to use cross-validation and a set of candidate learners to 1) evaluate how the learners perform on held out data and 2) to use that evaluation to produce a weighted average (for continuous super learner) or to pick a best learner (for discrete super learner) of the specified candidate learners.
Super learner and its statistically desirable properties have been written about at length, including at least the following references:
nadir::super_learner adopts several user-interface design-perspectives
that will be useful to know in understanding what it does and how it works:
The specification of learners should be very flexible, really only constrained by the fact that candidate learners should be designed for the same prediction problem but their details can wildly vary from learner to learner.
It should be easy to specify a customized or new learner.
nadir::super_learner at its core accepts data,
a formula (a single one passed to formulas is fine),
and a list of learners.
learners are taken to be lists of functions of the following specification:
a learner must accept a
dataandformulaargument,a learner may accept more arguments, and
a learner must return a prediction function that accepts
newdataand produces a vector of prediction values givennewdata.
In essence, a learner is specified to be a function taking (data, formula, ...)
and returning a closure (see http://adv-r.had.co.nz/Functional-programming.html#closures for an introduction to closures)
which is a function accepting newdata returning predictions.
Since many candidate learners will have hyperparameters that should be tuned,
like depth of trees in random forests, or the lambda parameter for glmnet,
extra arguments can be passed to each learner via the extra_learner_args
argument. extra_learner_args should be a list of lists, one list of
extra arguments for each learner. If no additional arguments are needed
for some learners, but some learners you're using do require additional
arguments, you can just put a NULL value into the extra_learner_args.
See the examples.
In order to seamlessly support using features implemented by extensions
to the formula syntax (like random effects formatted like random intercepts or slopes that use the
(age | strata) syntax in
lme4 or splines like s(age | strata) in mgcv), we allow for the
formulas argument to either be one fixed formula that
super_learner will use for all the models, or a vector of formulas,
one for each learner specified.
Note that in the examples a mean-squared-error (mse) is calculated on
the same training/test set, and this is only useful as a crude diagnostic to
see that super_learner is working. A more rigorous performance metric to
evaluate super_learner on is the cv-rmse produced by cv_super_learner.
Value
An object of class inheriting from nadir_sl_model. This is an S3 object,
with elements including a $predict(newdata) method, and some information
about the fit model including y_variable, outcome_type, learner_weights,
holdout_predictions and optionally information about any errors thrown by the
learner fitting process.
See Also
predict.nadir_sl_model compare_learners
cv_super_learner
Examples
learners <- list(
glm = lnr_glm,
rf = lnr_rf,
glmnet = lnr_glmnet,
lmer = lnr_lmer
)
# mtcars example ---
formulas <- c(
.default = mpg ~ cyl + hp, # first three models use same formula
lmer = mpg ~ (1 | cyl) + hp # lme4 uses different language features
)
# fit a super_learner
sl_model <- super_learner(
data = mtcars,
formula = formulas,
learners = learners)
# We recommend taking a look at this object to see what's contained inside it:
sl_model
compare_learners(sl_model)
# iris example ---
sl_model <- super_learner(
data = iris,
formula = list(
.default = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width,
lmer = Sepal.Length ~ (Sepal.Width | Species) + Petal.Length),
learners = learners)
# produce super_learner predictions and compare against the individual learners
compare_learners(sl_model)
Validate Learner Types
Description
Validate Learner Types
Usage
validate_learner_types(learners, outcome_type)
Arguments
learners |
A list of learners. See |
outcome_type |
An outcome type that |
Value
In the case of success, returns NULL invisibly; if there are mis-matches
between the outcome_type given and the learners' types, warnings
will be thrown.