Automated Machine Learning with tidylearn

Introduction

Automated Machine Learning (AutoML) streamlines the model development process by automatically trying multiple approaches and selecting the best one. tidylearn’s tl_auto_ml() function explores various modeling strategies including dimensionality reduction, clustering, and different supervised methods.

Note: AutoML orchestrates the wrapped packages (glmnet, randomForest, xgboost, etc.) rather than implementing new algorithms. Each model in the leaderboard wraps an established package, and you can access the raw model objects via model$fit.

Basic Usage

Classification Task

# Run AutoML on iris dataset
result <- tl_auto_ml(iris, Species ~ .,
                     task = "classification",
                     time_budget = 60)

# View best model
print(result$best_model)

# View all models tried
names(result$models)

# View leaderboard
result$leaderboard

Regression Task

# Run AutoML on regression problem
result_reg <- tl_auto_ml(mtcars, mpg ~ .,
                         task = "regression",
                         time_budget = 60)

# Best model
print(result_reg$best_model)

How AutoML Works

The tl_auto_ml() function follows a four-phase pipeline. Which phases actually run – and how thoroughly – depends on the time_budget and the toggle parameters use_reduction and use_clustering.

Phase	What it does	Models added (classification)	Models added (regression)
1. Baselines	Trains standard models	tree, logistic, forest	tree, linear, forest
2. PCA variants	PCA preprocessing + baseline methods	pca_tree, pca_logistic, pca_forest	pca_tree, pca_linear, pca_forest
3. Cluster variants	Adds cluster assignments as features	clustered_tree, clustered_logistic, clustered_forest	clustered_tree, clustered_linear, clustered_forest
4. Advanced	Tries heavier methods	svm, xgboost	ridge, lasso

Each model is first fit on the full training data, then (if budget allows) evaluated with k-fold cross-validation. When time is tight, models fall back to training-set metrics instead of CV.

Understanding the Time Budget

The time_budget parameter (in seconds) is the most important knob for controlling the speed/thoroughness trade-off. It is checked between model fits, not during them. Once a model starts training it runs to completion, because many of the wrapped packages (randomForest, xgboost, e1071) execute C-level code that R cannot safely interrupt mid-execution. This means the actual wall-clock time may modestly exceed the budget by the duration of the last model that started before the budget expired.

Budget tiers at a glance

Budget	Baselines	CV	PCA/Cluster variants	Advanced models	Typical models	Use case
< 30 s	tree + logistic/linear only	No (training metrics)	No	No	2	Quick sanity check, interactive use
30–120 s	tree + logistic/linear + forest	When time remains	If enabled and > 10 % budget left	If > 40 % budget left	3–7	Development iteration, notebook exploration
120 s+	All	Yes	Yes (if enabled)	Yes	9–11	Thorough comparison, final model selection

The “forest” baseline and all advanced models (SVM, XGBoost, ridge, lasso) involve C-level code that typically takes 3–15 seconds per fit depending on data size. They are only attempted when time_budget >= 30.

Why CV is the expensive step

A single tl_model() call fits one model. Cross-validation (tl_cv(folds = 5)) fits five models on subsets, so it costs roughly 5x the time. The function checks the remaining budget after each model fit and skips CV when it would likely exceed the budget, falling back to training-set evaluation instead. Reducing cv_folds (e.g. from 5 to 2) is the most effective way to stay closer to the budget while still getting out-of-sample estimates.

Practical examples

# Quick sanity check -- 2 fast models, no CV, done in ~1s
quick <- tl_auto_ml(iris, Species ~ .,
                    time_budget = 10,
                    use_reduction = FALSE,
                    use_clustering = FALSE)
quick$leaderboard
#> baseline_tree, baseline_logistic

# Development iteration -- baselines + forest, some CV
medium <- tl_auto_ml(iris, Species ~ .,
                     time_budget = 60,
                     cv_folds = 3)
medium$leaderboard
#> 5--7 models depending on data size

# Thorough search -- all phases, full CV
thorough <- tl_auto_ml(iris, Species ~ .,
                       time_budget = 300,
                       cv_folds = 5)
thorough$leaderboard
#> 9--11 models with cross-validated scores

Task Type Detection

AutoML automatically detects the task type from the response variable:

# Factor/character response -> classification
result_class <- tl_auto_ml(iris, Species ~ ., task = "auto")

# Numeric response -> regression
result_reg <- tl_auto_ml(mtcars, mpg ~ ., task = "auto")

Controlling the Search

Feature Engineering Options

# Disable dimensionality reduction
no_reduction <- tl_auto_ml(iris, Species ~ .,
                           use_reduction = FALSE,
                           time_budget = 60)

# Disable cluster features
no_clustering <- tl_auto_ml(iris, Species ~ .,
                            use_clustering = FALSE,
                            time_budget = 60)

# Baseline models only
baseline_only <- tl_auto_ml(iris, Species ~ .,
                            use_reduction = FALSE,
                            use_clustering = FALSE,
                            time_budget = 30)

Cross-Validation Settings

# Adjust cross-validation folds
result_cv <- tl_auto_ml(iris, Species ~ .,
                        cv_folds = 10,
                        time_budget = 120)

# Fewer folds for faster evaluation
result_fast <- tl_auto_ml(iris, Species ~ .,
                          cv_folds = 3,
                          time_budget = 60)

Understanding Results

Accessing Models

result <- tl_auto_ml(iris, Species ~ ., time_budget = 60)

# Best performing model
best_model <- result$best_model

# All models trained
all_models <- result$models

# Specific model
baseline_logistic <- result$models$baseline_logistic
pca_forest <- result$models$pca_forest

Leaderboard

# View performance comparison
leaderboard <- result$leaderboard

# Sort by score (higher is better for accuracy, lower for RMSE)
leaderboard <- leaderboard %>%
  arrange(desc(score))

print(leaderboard)

Making Predictions

# Use best model for predictions
predictions <- predict(result$best_model, new_data = new_data)

# Or use a specific model
predictions_pca <- predict(result$models$pca_forest, new_data = new_data)

Practical Examples

Example 1: Iris Classification

# Split data for evaluation
split <- tl_split(iris, prop = 0.7, stratify = "Species", seed = 123)

# Run AutoML on training data
automl_iris <- tl_auto_ml(split$train, Species ~ .,
                          time_budget = 90,
                          cv_folds = 5)

# Evaluate on test set
test_preds <- predict(automl_iris$best_model, new_data = split$test)
test_accuracy <- mean(test_preds$.pred == split$test$Species)

cat("AutoML Test Accuracy:", round(test_accuracy * 100, 1), "%\n")

# Compare models
for (model_name in names(automl_iris$models)) {
  model <- automl_iris$models[[model_name]]
  preds <- predict(model, new_data = split$test)
  acc <- mean(preds$.pred == split$test$Species)
  cat(model_name, ":", round(acc * 100, 1), "%\n")
}

Example 2: MPG Prediction

# Split mtcars data
split_mtcars <- tl_split(mtcars, prop = 0.7, seed = 42)

# Run AutoML
automl_mpg <- tl_auto_ml(split_mtcars$train, mpg ~ .,
                         task = "regression",
                         time_budget = 90)

# Evaluate
test_preds_mpg <- predict(automl_mpg$best_model, new_data = split_mtcars$test)
rmse <- sqrt(mean((test_preds_mpg$.pred - split_mtcars$test$mpg)^2))

cat("AutoML Test RMSE:", round(rmse, 2), "\n")

Example 3: Custom Preprocessing + AutoML

# Preprocess data first
processed <- tl_prepare_data(
  split$train,
  Species ~ .,
  scale_method = "standardize",
  remove_correlated = TRUE
)

# Run AutoML on preprocessed data
automl_processed <- tl_auto_ml(processed$data, Species ~ .,
                               time_budget = 60)

# Note: Need to apply same preprocessing to test data
test_processed <- tl_prepare_data(
  split$test,
  Species ~ .,
  scale_method = "standardize"
)

test_preds_proc <- predict(
  automl_processed$best_model,
  new_data = test_processed$data
)

Comparing AutoML with Manual Selection

# Manual approach: choose one model
manual_model <- tl_model(split$train, Species ~ ., method = "forest")
manual_preds <- predict(manual_model, new_data = split$test)
manual_acc <- mean(manual_preds$.pred == split$test$Species)

# AutoML approach
automl_model <- tl_auto_ml(split$train, Species ~ ., time_budget = 60)
automl_preds <- predict(automl_model$best_model, new_data = split$test)
automl_acc <- mean(automl_preds$.pred == split$test$Species)

cat("Manual Selection:", round(manual_acc * 100, 1), "%\n")
cat("AutoML:", round(automl_acc * 100, 1), "%\n")

Advanced AutoML Strategies

Strategy 1: Iterative AutoML

# First pass: quick exploration
quick_automl <- tl_auto_ml(split$train, Species ~ .,
                           time_budget = 30,
                           use_reduction = TRUE,
                           use_clustering = FALSE)

# Analyze what worked — best model name is in the leaderboard
best_name <- quick_automl$leaderboard$model[1]
best_method <- quick_automl$best_model$spec$method
cat("Best model:", best_name, "(method:", best_method, ")\n")

# Second pass: if a PCA variant won, invest more in reduction
if (grepl("^pca_", best_name)) {
  refined_automl <- tl_auto_ml(split$train, Species ~ .,
                               time_budget = 60,
                               use_reduction = TRUE,
                               use_clustering = TRUE)
}

Strategy 2: Ensemble of AutoML Models

# Get top 3 models
top_models <- automl_iris$leaderboard %>%
  arrange(desc(score)) %>%
  head(3)

# Make predictions with each
ensemble_preds <- list()
for (i in seq_len(nrow(top_models))) {
  model_name <- top_models$model[i]
  model <- automl_iris$models[[model_name]]
  ensemble_preds[[i]] <- predict(model, new_data = split$test)$.pred
}

# Majority vote for classification
final_pred <- apply(do.call(cbind, ensemble_preds), 1, function(x) {
  names(which.max(table(x)))
})

ensemble_acc <- mean(final_pred == split$test$Species)
cat("Ensemble Accuracy:", round(ensemble_acc * 100, 1), "%\n")

Performance Metrics

Classification Metrics

# AutoML automatically uses accuracy for classification
result_class <- tl_auto_ml(iris, Species ~ .,
                           metric = "accuracy",
                           time_budget = 60)

Regression Metrics

# AutoML automatically uses RMSE for regression
result_reg <- tl_auto_ml(mtcars, mpg ~ .,
                         metric = "rmse",
                         time_budget = 60)

Best Practices

Start fast, then expand: Use time_budget = 10 to verify the pipeline runs, then increase to 60–120s for real evaluation, and 300s for final model selection.
Reduce cv_folds before reducing time_budget: Going from 5-fold to 2-fold CV cuts evaluation time by ~60% while still providing out-of-sample estimates. A 30s budget with cv_folds = 2 is often more useful than a 30s budget with the default 5 folds (which will skip CV entirely).
Preprocess when needed: Handle missing values before AutoML.
Split your data: Always evaluate on held-out test data.
Examine multiple models: The “best” model may not always be robust.
Consider ensemble approaches: Combine top models for better performance.
Understand training-set metrics: When CV is skipped (short budgets), the leaderboard uses training-set metrics which are optimistically biased. These are useful for ranking but not for reporting final performance.

When to Use AutoML

Good use cases:

Quick prototyping and baseline establishment
When you’re unsure which algorithm to use
Feature engineering exploration
Benchmark for manual approaches
Limited ML expertise

Consider manual selection when:

You have domain knowledge about the best approach
Interpretability is critical
You need fine-grained control over hyperparameters
Computational resources are very limited

Troubleshooting

AutoML takes longer than `time_budget`

The budget is checked between model fits. A single random forest or XGBoost fit can take 5–30 seconds depending on data size, and R cannot safely interrupt C-level code mid-execution. To stay closer to the budget:

# 1. Reduce CV folds (biggest impact)
fast_result <- tl_auto_ml(data, formula,
                          cv_folds = 2,
                          time_budget = 30)

# 2. Disable slow phases
baseline_result <- tl_auto_ml(data, formula,
                              use_reduction = FALSE,
                              use_clustering = FALSE,
                              time_budget = 30)

# 3. Use a budget under 30s to skip forest/SVM/XGBoost entirely
quick_result <- tl_auto_ml(data, formula, time_budget = 10)

Leaderboard scores are all NA

This happens when the evaluation metric isn’t found in the results. The most common cause is using training-set evaluation (short budget) where the metric names differ from CV output. Try increasing the budget so that CV runs, or specify the metric explicitly:

result <- tl_auto_ml(data, formula,
                     metric = "accuracy",  # or "rmse" for regression
                     time_budget = 60)

Not enough models tried

# Increase time budget to unlock all phases
thorough_result <- tl_auto_ml(data, formula, time_budget = 300)

# Ensure feature engineering is enabled
full_result <- tl_auto_ml(data, formula,
                          use_reduction = TRUE,
                          use_clustering = TRUE,
                          time_budget = 300)

Summary

tidylearn’s AutoML provides:

Automated model selection across multiple algorithms
Feature engineering with PCA and clustering
Cross-validation for robust performance estimates
Easy comparison through leaderboard
Flexible configuration for different scenarios
Integration workflows combining supervised and unsupervised learning

# Complete AutoML workflow
workflow_split <- tl_split(iris, prop = 0.7, stratify = "Species", seed = 123)

automl_result <- tl_auto_ml(
  data = workflow_split$train,
  formula = Species ~ .,
  task = "auto",
  use_reduction = TRUE,
  use_clustering = TRUE,
  time_budget = 120,
  cv_folds = 5
)

# Evaluate best model
final_preds <- predict(automl_result$best_model, new_data = workflow_split$test)
final_accuracy <- mean(final_preds$.pred == workflow_split$test$Species)

cat("Final AutoML Accuracy:", round(final_accuracy * 100, 1), "%\n")
cat("Best approach:", automl_result$best_model$spec$method, "\n")

AutoML makes machine learning accessible and efficient, allowing you to quickly find good solutions while learning which approaches work best for your data.