Introduction to sshist

The sshist package provides a fast and robust method for selecting the optimal number of bins for histograms. It implements the algorithm developed by Shimazaki and Shinomoto (2007), which minimizes the expected L2 loss function between the histogram and the unknown underlying density function.

This method is particularly superior to standard rules (like Sturges or Freedman-Diaconis) when the data exhibits:

library(sshist)

1. One-Dimensional Optimization

Let’s look at the classic “Old Faithful” geyser data. It has two distinct modes (short and long eruptions).

Standard Approach vs. sshist

Standard histograms often hide the structure if the bin width is not chosen carefully.

data(faithful)
x_data <- faithful$eruptions

# 1. Standard R histogram (Sturges rule by default)
par(mfrow = c(1, 2))
hist(x_data, main = "Standard hist()", xlab = "Duration of Eruptions", col = "gray90")

# 2. Shimazaki-Shinomoto Optimization
# The function returns an S3 object with optimal parameters
res <- sshist(x_data)
hist(x_data, breaks=res$edges,
       main=paste("Optimal Hist (N=", res$opt_n, ")"),
       xlab = "Duration of Eruptions", col = "gray90")


par(mfrow = c(1, 1))

As you can see, sshist automatically detects the bimodal nature and sharp peaks much better than the conservative default.

Accessing Parameters

You can extract the optimal number of bins (opt_n) or bin width (opt_d) to use in other plotting libraries like ggplot2.

print(res)
#> Shimazaki-Shinomoto Histogram Optimization
#> ------------------------------------------
#> Optimal Bins (N): 18 
#> Bin Width (D):    0.1944 
#> Cost Minimum:     -2726

# The plot method shows the Cost Function Graph and the Optimal Histogram
plot(res)

The left plot shows the cost function graph. The red dot indicates the global minimum (optimal binning). The right plot shows the resulting histogram.

2. Two-Dimensional Optimization

sshist also supports 2D histograms, solving the problem of choosing bin sizes for X and Y axes simultaneously.

# Run 2D optimization
# This calculates the cost function for a grid of (Nx, Ny) combinations
res_2d <- sshist_2d(iris$Petal.Length, iris$Petal.Width)

# optimal number of bins
print(res_2d)
#> Shimazaki-Shinomoto 2D Histogram Optimization
#> ---------------------------------------------
#> Optimal Bins X:   46 
#> Optimal Bins Y:   23 
#> Bin Width X:      0.1283 
#> Bin Width Y:      0.1043 
#> Cost Minimum:     -1640

# The plot method shows the Cost Function Landscape and the Optimal Histogram
plot(res_2d)

The left plot shows the cost function landscape. The red dot indicates the global minimum (optimal binning). The right plot shows the resulting histogram.

Performance

The core cost function calculation is implemented in C++ (via Rcpp) and parallelized with OpenMP where possible, making it efficient even for large datasets or dense grid searches in 2D.

References