An outlier can be defined as an observation or subset of observations that appears to be inconsistent with the rest of the data and do not fit into the overall trend. They may appear due to various reasons, including mistakes in data collection or recording, natural variation, or the representation of uncommon but valid data points. In statistics and data science, identifying outliers is an essential preprocessing step to ensure the robustness and accuracy of analyses. While univariate outliers can often be detected easily, multivariate outliers which result from complex interactions among multiple variables are more difficult to identify. If those hidden anomalies are not properly addressed, they can bias model conclusions and significantly degrade model performance. Therefore, multivariate outlier detection plays a vital role in the statistical analysis of multidimensional data.
The package is designed to include several useful components. It offers three distinct techniques for identifying outliers in multivariate data: Mahalanobis Distance, Minimum Covariance Determinant, and Principal Component Analysis-based distance.The program makes use of the Rcpp package’s C++ integration, which speeds up computations and makes handling bigger datasets possible.Furthermore, the package incorporates pairwise plots that highlight possible outliers to improve the interpretability of results for those without a background in statistics.
The package can be installed directly from GitHub using the
devtools package.
# Install devtools if not already installed
install.packages("devtools")
# Install the MOutliers package from GitHub (with tests)
devtools::install_github("SenuYasara/Multivariate_Outlier_Detection_R_Package",
INSTALL_opts = "--install-tests"
)Once installed, the package can be loaded in R as follows:
library(MOutliers)To run the included unit tests after installation:
library(testthat)
test_package("MOutliers")These tests confirm that the functions
detect_multivariate_outliers() and
plot_outliers() behave as expected, producing correct
outputs and handling invalid inputs appropriately.
This example demonstrates detecting multivariate outliers using simulated data.
set.seed(123)
df <- data.frame(
x = c(rnorm(50), 5),
y = c(rnorm(50), 5)
)
head(df)
#> x y
#> 1 -0.56047565 0.25331851
#> 2 -0.23017749 -0.02854676
#> 3 1.55870831 -0.04287046
#> 4 0.07050839 1.36860228
#> 5 0.12928774 -0.22577099
#> 6 1.71506499 1.51647060# Mahalanobis Distance
result_mahal <- detect_multivariate_outliers(df, method = "mahalanobis", alpha = 0.975)
head(result_mahal)
#> x y Distance Outlier
#> 1 -0.56047565 0.25331851 0.4151305 FALSE
#> 2 -0.23017749 -0.02854676 0.1188414 FALSE
#> 3 1.55870831 -0.04287046 2.0584218 FALSE
#> 4 0.07050839 1.36860228 1.1817300 FALSE
#> 5 0.12928774 -0.22577099 0.1948453 FALSE
#> 6 1.71506499 1.51647060 2.3906742 FALSE# Minimum Covariance Determinant (MCD)
result_mcd <- detect_multivariate_outliers(df, method = "mcd", alpha = 0.975)
head(result_mcd)
#> x y Distance Outlier
#> 1 -0.56047565 0.25331851 0.4591213 FALSE
#> 2 -0.23017749 -0.02854676 0.1299266 FALSE
#> 3 1.55870831 -0.04287046 2.5319996 FALSE
#> 4 0.07050839 1.36860228 2.7497316 FALSE
#> 5 0.12928774 -0.22577099 0.2077008 FALSE
#> 6 1.71506499 1.51647060 6.5143416 FALSE# Principal Component Analysis (PCA)
result_pca <- detect_multivariate_outliers(df, method = "pca", alpha = 0.975)
head(result_pca)
#> x y Distance Outlier
#> 1 -0.56047565 0.25331851 0.3621629 FALSE
#> 2 -0.23017749 -0.02854676 0.1566441 FALSE
#> 3 1.55870831 -0.04287046 1.6023335 FALSE
#> 4 0.07050839 1.36860228 1.0066602 FALSE
#> 5 0.12928774 -0.22577099 0.1726165 FALSE
#> 6 1.71506499 1.51647060 3.1785118 FALSEThis example demonstrates detecting multivariate outliers using a real dataset (mtcars) with three variables: mpg, hp, and wt.
df_mtcars <- mtcars[, c("mpg", "hp", "wt" )]
head(df_mtcars)
#> mpg hp wt
#> Mazda RX4 21.0 110 2.620
#> Mazda RX4 Wag 21.0 110 2.875
#> Datsun 710 22.8 93 2.320
#> Hornet 4 Drive 21.4 110 3.215
#> Hornet Sportabout 18.7 175 3.440
#> Valiant 18.1 105 3.460# Mahalanobis Distance
result_mahal <- detect_multivariate_outliers(df_mtcars, method = "mahalanobis"
,alpha = 0.975)
head(result_mahal)
#> mpg hp wt Distance Outlier
#> Mazda RX4 21.0 110 2.620 1.4554908 FALSE
#> Mazda RX4 Wag 21.0 110 2.875 0.6848547 FALSE
#> Datsun 710 22.8 93 2.320 1.8717032 FALSE
#> Hornet 4 Drive 21.4 110 3.215 0.5058688 FALSE
#> Hornet Sportabout 18.7 175 3.440 0.1960802 FALSE
#> Valiant 18.1 105 3.460 2.0085341 FALSE# Minimum Covariance Determinant (MCD)
result_mcd <- detect_multivariate_outliers(df_mtcars, method = "mcd", alpha = 0.975)
head(result_mcd)
#> mpg hp wt Distance Outlier
#> Mazda RX4 21.0 110 2.620 1.4032515 FALSE
#> Mazda RX4 Wag 21.0 110 2.875 0.4356093 FALSE
#> Datsun 710 22.8 93 2.320 1.7928535 FALSE
#> Hornet 4 Drive 21.4 110 3.215 0.7528113 FALSE
#> Hornet Sportabout 18.7 175 3.440 1.8629727 FALSE
#> Valiant 18.1 105 3.460 3.1254814 FALSE# Principal Component Analysis (PCA)
result_pca <- detect_multivariate_outliers(df_mtcars, method = "pca", alpha = 0.975)
head(result_pca)
#> mpg hp wt Distance Outlier
#> Mazda RX4 21.0 110 2.620 0.5460497 FALSE
#> Mazda RX4 Wag 21.0 110 2.875 0.3829775 FALSE
#> Datsun 710 22.8 93 2.320 1.5163542 FALSE
#> Hornet 4 Drive 21.4 110 3.215 0.3326773 FALSE
#> Hornet Sportabout 18.7 175 3.440 0.2723783 FALSE
#> Valiant 18.1 105 3.460 0.4647775 FALSEThis example demonstrates visualizing 2D scatterplots for each pair of variable in the dataset using simulated data.
# Mahalanobis Distance
plot_outliers(df, method = "mahalanobis", alpha = 0.975)
# Minimum Covariance Determinant (MCD)
plot_outliers(df, method = "mcd", alpha = 0.975)
This example demonstrates visualizing 2D scatterplots for each pair of variable in the dataset using a real dataset (mtcars) with three variables: mpg, hp, and wt.
# Mahalanobis Distance
plot_outliers(df_mtcars, method = "mahalanobis", alpha = 0.975)
# Minimum Covariance Determinant (MCD)
plot_outliers(df_mtcars, method = "mcd", alpha = 0.975)