library(tidylearn)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(ggplot2)Automated Machine Learning (AutoML) streamlines the model development
process by automatically trying multiple approaches and selecting the
best one. tidylearn’s tl_auto_ml() function explores
various modeling strategies including dimensionality reduction,
clustering, and different supervised methods.
Note: AutoML orchestrates the wrapped packages
(glmnet, randomForest, xgboost, etc.) rather than implementing new
algorithms. Each model in the leaderboard wraps an established package,
and you can access the raw model objects via model$fit.
The tl_auto_ml() function follows a four-phase pipeline.
Which phases actually run – and how thoroughly – depends on the
time_budget and the toggle parameters
use_reduction and use_clustering.
| Phase | What it does | Models added (classification) | Models added (regression) |
|---|---|---|---|
| 1. Baselines | Trains standard models | tree, logistic, forest | tree, linear, forest |
| 2. PCA variants | PCA preprocessing + baseline methods | pca_tree, pca_logistic, pca_forest | pca_tree, pca_linear, pca_forest |
| 3. Cluster variants | Adds cluster assignments as features | clustered_tree, clustered_logistic, clustered_forest | clustered_tree, clustered_linear, clustered_forest |
| 4. Advanced | Tries heavier methods | svm, xgboost | ridge, lasso |
Each model is first fit on the full training data, then (if budget allows) evaluated with k-fold cross-validation. When time is tight, models fall back to training-set metrics instead of CV.
The time_budget parameter (in seconds) is the most
important knob for controlling the speed/thoroughness trade-off. It is
checked between model fits, not during them. Once a
model starts training it runs to completion, because many of the wrapped
packages (randomForest, xgboost, e1071) execute C-level code that R
cannot safely interrupt mid-execution. This means the actual wall-clock
time may modestly exceed the budget by the duration of the last model
that started before the budget expired.
| Budget | Baselines | CV | PCA/Cluster variants | Advanced models | Typical models | Use case |
|---|---|---|---|---|---|---|
| < 30 s | tree + logistic/linear only | No (training metrics) | No | No | 2 | Quick sanity check, interactive use |
| 30–120 s | tree + logistic/linear + forest | When time remains | If enabled and > 10 % budget left | If > 40 % budget left | 3–7 | Development iteration, notebook exploration |
| 120 s+ | All | Yes | Yes (if enabled) | Yes | 9–11 | Thorough comparison, final model selection |
The “forest” baseline and all advanced models (SVM, XGBoost, ridge,
lasso) involve C-level code that typically takes 3–15 seconds per fit
depending on data size. They are only attempted when
time_budget >= 30.
A single tl_model() call fits one model.
Cross-validation (tl_cv(folds = 5)) fits
five models on subsets, so it costs roughly 5x the
time. The function checks the remaining budget after each model fit and
skips CV when it would likely exceed the budget, falling back to
training-set evaluation instead. Reducing cv_folds
(e.g. from 5 to 2) is the most effective way to stay closer to the
budget while still getting out-of-sample estimates.
# Quick sanity check -- 2 fast models, no CV, done in ~1s
quick <- tl_auto_ml(iris, Species ~ .,
time_budget = 10,
use_reduction = FALSE,
use_clustering = FALSE)
quick$leaderboard
#> baseline_tree, baseline_logistic
# Development iteration -- baselines + forest, some CV
medium <- tl_auto_ml(iris, Species ~ .,
time_budget = 60,
cv_folds = 3)
medium$leaderboard
#> 5--7 models depending on data size
# Thorough search -- all phases, full CV
thorough <- tl_auto_ml(iris, Species ~ .,
time_budget = 300,
cv_folds = 5)
thorough$leaderboard
#> 9--11 models with cross-validated scoresAutoML automatically detects the task type from the response variable:
# Disable dimensionality reduction
no_reduction <- tl_auto_ml(iris, Species ~ .,
use_reduction = FALSE,
time_budget = 60)
# Disable cluster features
no_clustering <- tl_auto_ml(iris, Species ~ .,
use_clustering = FALSE,
time_budget = 60)
# Baseline models only
baseline_only <- tl_auto_ml(iris, Species ~ .,
use_reduction = FALSE,
use_clustering = FALSE,
time_budget = 30)# Split data for evaluation
split <- tl_split(iris, prop = 0.7, stratify = "Species", seed = 123)
# Run AutoML on training data
automl_iris <- tl_auto_ml(split$train, Species ~ .,
time_budget = 90,
cv_folds = 5)
# Evaluate on test set
test_preds <- predict(automl_iris$best_model, new_data = split$test)
test_accuracy <- mean(test_preds$.pred == split$test$Species)
cat("AutoML Test Accuracy:", round(test_accuracy * 100, 1), "%\n")# Split mtcars data
split_mtcars <- tl_split(mtcars, prop = 0.7, seed = 42)
# Run AutoML
automl_mpg <- tl_auto_ml(split_mtcars$train, mpg ~ .,
task = "regression",
time_budget = 90)
# Evaluate
test_preds_mpg <- predict(automl_mpg$best_model, new_data = split_mtcars$test)
rmse <- sqrt(mean((test_preds_mpg$.pred - split_mtcars$test$mpg)^2))
cat("AutoML Test RMSE:", round(rmse, 2), "\n")# Preprocess data first
processed <- tl_prepare_data(
split$train,
Species ~ .,
scale_method = "standardize",
remove_correlated = TRUE
)
# Run AutoML on preprocessed data
automl_processed <- tl_auto_ml(processed$data, Species ~ .,
time_budget = 60)
# Note: Need to apply same preprocessing to test data
test_processed <- tl_prepare_data(
split$test,
Species ~ .,
scale_method = "standardize"
)
test_preds_proc <- predict(
automl_processed$best_model,
new_data = test_processed$data
)# Manual approach: choose one model
manual_model <- tl_model(split$train, Species ~ ., method = "forest")
manual_preds <- predict(manual_model, new_data = split$test)
manual_acc <- mean(manual_preds$.pred == split$test$Species)
# AutoML approach
automl_model <- tl_auto_ml(split$train, Species ~ ., time_budget = 60)
automl_preds <- predict(automl_model$best_model, new_data = split$test)
automl_acc <- mean(automl_preds$.pred == split$test$Species)
cat("Manual Selection:", round(manual_acc * 100, 1), "%\n")
cat("AutoML:", round(automl_acc * 100, 1), "%\n")# First pass: quick exploration
quick_automl <- tl_auto_ml(split$train, Species ~ .,
time_budget = 30,
use_reduction = TRUE,
use_clustering = FALSE)
# Analyze what worked — best model name is in the leaderboard
best_name <- quick_automl$leaderboard$model[1]
best_method <- quick_automl$best_model$spec$method
cat("Best model:", best_name, "(method:", best_method, ")\n")
# Second pass: if a PCA variant won, invest more in reduction
if (grepl("^pca_", best_name)) {
refined_automl <- tl_auto_ml(split$train, Species ~ .,
time_budget = 60,
use_reduction = TRUE,
use_clustering = TRUE)
}# Get top 3 models
top_models <- automl_iris$leaderboard %>%
arrange(desc(score)) %>%
head(3)
# Make predictions with each
ensemble_preds <- list()
for (i in seq_len(nrow(top_models))) {
model_name <- top_models$model[i]
model <- automl_iris$models[[model_name]]
ensemble_preds[[i]] <- predict(model, new_data = split$test)$.pred
}
# Majority vote for classification
final_pred <- apply(do.call(cbind, ensemble_preds), 1, function(x) {
names(which.max(table(x)))
})
ensemble_acc <- mean(final_pred == split$test$Species)
cat("Ensemble Accuracy:", round(ensemble_acc * 100, 1), "%\n")time_budget = 10 to verify the pipeline runs, then increase
to 60–120s for real evaluation, and 300s for final model selection.cv_folds before reducing
time_budget: Going from 5-fold to 2-fold CV cuts
evaluation time by ~60% while still providing out-of-sample estimates. A
30s budget with cv_folds = 2 is often more useful than a
30s budget with the default 5 folds (which will skip CV entirely).Good use cases:
Consider manual selection when:
time_budgetThe budget is checked between model fits. A single random forest or XGBoost fit can take 5–30 seconds depending on data size, and R cannot safely interrupt C-level code mid-execution. To stay closer to the budget:
# 1. Reduce CV folds (biggest impact)
fast_result <- tl_auto_ml(data, formula,
cv_folds = 2,
time_budget = 30)
# 2. Disable slow phases
baseline_result <- tl_auto_ml(data, formula,
use_reduction = FALSE,
use_clustering = FALSE,
time_budget = 30)
# 3. Use a budget under 30s to skip forest/SVM/XGBoost entirely
quick_result <- tl_auto_ml(data, formula, time_budget = 10)This happens when the evaluation metric isn’t found in the results. The most common cause is using training-set evaluation (short budget) where the metric names differ from CV output. Try increasing the budget so that CV runs, or specify the metric explicitly:
tidylearn’s AutoML provides:
# Complete AutoML workflow
workflow_split <- tl_split(iris, prop = 0.7, stratify = "Species", seed = 123)
automl_result <- tl_auto_ml(
data = workflow_split$train,
formula = Species ~ .,
task = "auto",
use_reduction = TRUE,
use_clustering = TRUE,
time_budget = 120,
cv_folds = 5
)
# Evaluate best model
final_preds <- predict(automl_result$best_model, new_data = workflow_split$test)
final_accuracy <- mean(final_preds$.pred == workflow_split$test$Species)
cat("Final AutoML Accuracy:", round(final_accuracy * 100, 1), "%\n")
cat("Best approach:", automl_result$best_model$spec$method, "\n")AutoML makes machine learning accessible and efficient, allowing you to quickly find good solutions while learning which approaches work best for your data.