| Type: | Package |
| Title: | Fitting Generalized Linear Models Using Sparse Matrices |
| Version: | 0.1.0 |
| Description: | Fits Generalised Linear Models (GLMs) with sparse and dense 'Matrix' matrices for memory efficiency. Acts as a wrapper for the glm4() function in the 'MatrixModels' package <doi:10.32614/CRAN.package.MatrixModels>, but adds convenient model methods and functions designed to mimic those associated with the glm() function from the 'stats' package. |
| URL: | https://github.com/awhug/glm4/issues |
| BugReports: | https://github.com/awhug/glm4/issues |
| License: | GPL-2 | GPL-3 [expanded from: GPL (≥ 2)] |
| Encoding: | UTF-8 |
| RoxygenNote: | 7.3.3 |
| Imports: | methods, Matrix, MatrixModels, stats |
| Suggests: | knitr, rmarkdown |
| VignetteBuilder: | knitr |
| NeedsCompilation: | no |
| Packaged: | 2026-04-09 23:40:09 UTC; angus |
| Author: | Angus Hughes |
| Maintainer: | Angus Hughes <angus-hughes+glm4@outlook.com> |
| Repository: | CRAN |
| Date/Publication: | 2026-04-16 09:30:02 UTC |
Analysis of Deviance for Generalized Linear Model Fits
Description
Compute an analysis of deviance table for one or more generalized linear model fits.
Usage
## S3 method for class 'glm4'
anova(object, ..., dispersion = NULL, test = NULL)
Arguments
object |
an object of class "glm4" |
... |
additional objects of class "glm4" for multi-model comparison |
dispersion |
the dispersion parameter for the fitting family. By default it is obtained from the object(s). |
test |
a character string, (partially) matching one of |
Details
Specifying a single object gives a sequential analysis of deviance table for that fit. That is, the reductions in the residual deviance as each term of the formula is added in turn are given in as the rows of a table, plus the residual deviances themselves.
If more than one object is specified, the table has a row for the residual degrees of freedom and deviance for each model. For all but the first model, the change in degrees of freedom and deviance is also given. (This only makes statistical sense if the models are nested.) It is conventional to list the models from smallest to largest, but this is up to the user.
The table will optionally contain test statistics (and P values)
comparing the reduction in deviance for the row to the residuals.
For models with known dispersion (e.g., binomial and Poisson fits)
the chi-squared test is most appropriate, and for those with
dispersion estimated by moments (e.g., gaussian,
quasibinomial and quasipoisson fits) the F test is
most appropriate. If anova.glm can determine which of these
cases applies then by default it will use one of the above tests.
If the dispersion argument is supplied, the dispersion is
considered known and the chi-squared test will be used.
Argument test=FALSE suppresses the test statistics and P values.
Mallows' C_p statistic is the residual
deviance plus twice the estimate of \sigma^2 times
the residual degrees of freedom, which is closely related to AIC (and
a multiple of it if the dispersion is known).
You can also choose "LRT" and
"Rao" for likelihood ratio tests and Rao's efficient score test.
The former is synonymous with "Chisq" (although both have
an asymptotic chi-square distribution).
The dispersion estimate will be taken from the largest model, using
the value returned by summary.glm. As this will in most
cases use a Chi-squared-based estimate, the F tests are not based on
the residual deviance in the analysis of deviance table shown.
Value
An object of class "anova" inheriting from class "data.frame".
Warning
The comparison between two or more models will only be valid if they
are fitted to the same dataset. This may be a problem if there are
missing values and R's default of na.action = na.omit is used,
and anova will detect this with an error.
References
Hastie, T. J. and Pregibon, D. (1992) Generalized linear models. Chapter 6 of Statistical Models in S eds J. M. Chambers and T. J. Hastie, Wadsworth & Brooks/Cole.
See Also
drop1 for
so-called ‘type II’ ANOVA where each term is dropped one at a
time respecting their hierarchy.
Examples
## --- Continuing the Example from '?glm':
anova(glm.D93, test = FALSE)
anova(glm.D93, test = "Cp")
anova(glm.D93, test = "Chisq")
glm.D93a <-
update(glm.D93, ~treatment*outcome) # equivalent to Pearson Chi-square
anova(glm.D93, glm.D93a, test = "Rao")
Case and Variable Names of Fitted Models
Description
Simple utilities returning (non-missing) case names, and (non-eliminated) variable names.
Usage
## S3 method for class 'glm4'
case.names(object, full = FALSE, ...)
Arguments
object |
an R object, typically a fitted model. |
full |
logical; if |
... |
further arguments passed to or from other methods. |
Value
A character vector.
See Also
lm; further, all.names,
all.vars for functions with a similar name but only slightly
related purpose.
Examples
x <- 1:20
y <- setNames(x + (x/4 - 2)^3 + rnorm(20, sd = 3),
paste("O", x, sep = "."))
ww <- rep(1, 20); ww[13] <- 0
summary(lmxy <- lm(y ~ x + I(x^2)+I(x^3) + I((x-10)^2), weights = ww),
correlation = TRUE)
variable.names(lmxy)
variable.names(lmxy, full = TRUE) # includes the last
case.names(lmxy)
case.names(lmxy, full = TRUE) # includes the 0-weight case
Confidence Intervals for Model Parameters
Description
Computes confidence intervals for one or more parameters in a fitted
model. There is a default and a method for objects inheriting from class
"lm".
Usage
## S3 method for class 'glm4'
confint(object, parm, level = 0.95, ...)
Arguments
object |
a fitted model object. |
parm |
a specification of which parameters are to be given confidence intervals, either a vector of numbers or a vector of names. If missing, all parameters are considered. |
level |
the confidence level required. |
... |
additional argument(s) for methods. |
Details
confint is a generic function. The default method assumes
normality, and needs suitable coef and
vcov methods to be available. The default method can be
called directly for comparison with other methods.
For objects of class "lm" the direct formulae based on t
values are used.
Methods for classes "glm"
and "nls" call the appropriate profile method,
then find the confidence intervals by interpolation in the profile
traces. If the profile object is already available it can be used
as the main argument rather than the fitted model object itself.
Value
A matrix (or vector) with columns giving lower and upper confidence limits for each parameter. These will be labelled as (1-level)/2 and 1 - (1-level)/2 in % (by default 2.5% and 97.5%).
References
Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth edition. Springer.
See Also
Original versions: confint.glm and
confint.nls in package MASS.
Examples
fit <- lm(100/mpg ~ disp + hp + wt + am, data = mtcars)
confint(fit)
confint(fit, "wt")
## from example(glm)
counts <- c(18,17,15,20,10,20,25,13,12)
outcome <- gl(3, 1, 9); treatment <- gl(3, 3)
glm.D93 <- glm(counts ~ outcome + treatment, family = poisson())
confint(glm.D93)
confint.default(glm.D93) # based on asymptotic normality
Regression Deletion Diagnostics
Description
This suite of functions can be used to compute some of the regression (leave-one-out deletion) diagnostics for linear and generalized linear models discussed in Belsley, Kuh and Welsch (1980), Cook and Weisberg (1982), etc.
Usage
## S3 method for class 'glm4'
cooks.distance(model, infl = influence.glm4(model), ...)
Arguments
model |
|
infl |
influence structure as returned by
|
... |
further arguments passed to or from other methods. |
Details
The primary high-level function is influence.measures which produces a
class "infl" object tabular display showing the DFBETAs for
each model variable, DFFITs, covariance ratios, Cook's distances and
the diagonal elements of the hat matrix. Cases which are influential
with respect to any of these measures are marked with an asterisk.
The functions dfbetas, dffits,
covratio and cooks.distance provide direct access to the
corresponding diagnostic quantities. Functions rstandard and
rstudent give the standardized and Studentized residuals
respectively. (These re-normalize the residuals to have unit variance,
using an overall and leave-one-out measure of the error variance
respectively.)
Note that for multivariate lm() models (of class
"mlm"), these functions return 3d arrays instead of matrices,
or matrices instead of vectors.
Values for generalized linear models are approximations, as described
in Williams (1987) (except that Cook's distances are scaled as
F rather than as chi-square values). The approximations can be
poor when some cases have large influence.
The optional infl, res and sd arguments are there
to encourage the use of these direct access functions, in situations
where, e.g., the underlying basic influence measures (from
lm.influence or the generic influence) are
already available.
Note that cases with weights == 0 are dropped from all
these functions, but that if a linear model has been fitted with
na.action = na.exclude, suitable values are filled in for the
cases excluded during fitting.
For linear models, rstandard(*, type = "predictive") provides
leave-one-out cross validation residuals, and the “PRESS”
statistic (PREdictive Sum of Squares, the same as
the CV score) of model model is
PRESS <- sum(rstandard(model, type="pred")^2)
The function hat() exists mainly for S (version 2)
compatibility; we recommend using hatvalues() instead.
Value
A numeric vector of Cook's distances, one per observation.
Note
For hatvalues, dfbeta, and dfbetas, the method
for linear models also works for generalized linear models.
Author(s)
Several R core team members and John Fox, originally in his ‘car’ package.
References
Belsley, D. A., Kuh, E. and Welsch, R. E. (1980). Regression Diagnostics. New York: Wiley.
Cook, R. D. and Weisberg, S. (1982). Residuals and Influence in Regression. London: Chapman and Hall.
Williams, D. A. (1987). Generalized linear model diagnostics using the deviance and single case deletions. Applied Statistics, 36, 181–191. doi:10.2307/2347550
See Also
influence (containing lm.influence).
‘plotmath’ for the use of hat in plot annotation.
Examples
require(graphics)
## Analysis of the life-cycle savings data
## given in Belsley, Kuh and Welsch.
lm.SR <- lm(sr ~ pop15 + pop75 + dpi + ddpi, data = LifeCycleSavings)
inflm.SR <- influence.measures(lm.SR)
which(apply(inflm.SR$is.inf, 1, any))
# which observations 'are' influential
summary(inflm.SR) # only these
inflm.SR # all
plot(rstudent(lm.SR) ~ hatvalues(lm.SR)) # recommended by some
plot(lm.SR, which = 5) # an enhanced version of that via plot(<lm>)
## The 'infl' argument is not needed, but avoids recomputation:
rs <- rstandard(lm.SR)
iflSR <- influence(lm.SR)
all.equal(rs, rstandard(lm.SR, infl = iflSR), tolerance = 1e-10)
## to "see" the larger values:
1000 * round(dfbetas(lm.SR, infl = iflSR), 3)
cat("PRESS :"); (PRESS <- sum( rstandard(lm.SR, type = "predictive")^2 ))
stopifnot(all.equal(PRESS, sum( (residuals(lm.SR) / (1 - iflSR$hat))^2)))
## Show that "PRE-residuals" == L.O.O. Crossvalidation (CV) errors:
X <- model.matrix(lm.SR)
y <- model.response(model.frame(lm.SR))
## Leave-one-out CV least-squares prediction errors (relatively fast)
rCV <- vapply(seq_len(nrow(X)), function(i)
y[i] - X[i,] %*% .lm.fit(X[-i,], y[-i])$coefficients,
numeric(1))
## are the same as the *faster* rstandard(*, "pred") :
stopifnot(all.equal(rCV, unname(rstandard(lm.SR, type = "predictive"))))
## Huber's data [Atkinson 1985]
xh <- c(-4:0, 10)
yh <- c(2.48, .73, -.04, -1.44, -1.32, 0)
lmH <- lm(yh ~ xh)
summary(lmH)
im <- influence.measures(lmH)
im
is.inf <- apply(im$is.inf, 1, any)
plot(xh,yh, main = "Huber's data: L.S. line and influential obs.")
abline(lmH); points(xh[is.inf], yh[is.inf], pch = 20, col = 2)
## Irwin's data [Williams 1987]
xi <- 1:5
yi <- c(0,2,14,19,30) # number of mice responding to dose xi
mi <- rep(40, 5) # number of mice exposed
glmI <- glm(cbind(yi, mi -yi) ~ xi, family = binomial)
summary(glmI)
signif(cooks.distance(glmI), 3) # ~= Ci in Table 3, p.184
imI <- influence.measures(glmI)
imI
stopifnot(all.equal(imI$infmat[,"cook.d"],
cooks.distance(glmI)))
Model Deviance
Description
Returns the deviance of a fitted model object.
Usage
## S3 method for class 'glm4'
deviance(object, ...)
Arguments
object |
an object for which the deviance is desired. |
... |
additional optional argument. |
Details
This is a generic function which can be used to extract deviances for fitted models. Consult the individual modeling functions for details on how to use this function.
Value
A numeric scalar giving the model deviance.
References
Chambers, J. M. and Hastie, T. J. (1992) Statistical Models in S. Wadsworth & Brooks/Cole.
See Also
df.residual,
extractAIC,
glm,
lm.
Examples
fit <- glm4(mpg ~ cyl + wt, data = mtcars, family = gaussian())
deviance(fit)
Residual Degrees-of-Freedom
Description
Returns the residual degrees-of-freedom extracted from a fitted model object.
Usage
## S3 method for class 'glm4'
df.residual(object, ...)
Arguments
object |
an object for which the degrees-of-freedom are desired. |
... |
additional optional arguments. |
Details
This is a generic function which can be used to extract residual degrees-of-freedom for fitted models. Consult the individual modeling functions for details on how to use this function.
The default method just extracts the df.residual component.
Value
An integer giving the residual degrees of freedom.
See Also
Examples
fit <- glm4(mpg ~ cyl + wt, data = mtcars, family = gaussian())
df.residual(fit)
Extract AIC from a Fitted Model
Description
Computes the (generalized) Akaike An Information Criterion for a fitted parametric model.
Usage
## S3 method for class 'glm4'
extractAIC(fit, scale = 0, k = 2, ...)
Arguments
fit |
fitted model, usually the result of a fitter like
|
scale |
optional numeric specifying the scale parameter of the
model, see |
k |
numeric specifying the ‘weight’ of the
equivalent degrees of freedom ( |
... |
further arguments (currently unused in base R). |
Details
This is a generic function, with methods in base R for classes
"aov", "glm" and "lm" as well as for
"negbin" (package MASS) and "coxph" and
"survreg" (package survival).
The criterion used is
AIC = - 2\log L + k \times \mbox{edf},
where L is the likelihood and edf the equivalent degrees
of freedom (i.e., the number of free parameters for usual parametric
models) of fit.
For linear models with unknown scale (i.e., for lm and
aov), -2\log L is computed from the
deviance and uses a different additive constant to
logLik and hence AIC. If RSS
denotes the (weighted) residual sum of squares then extractAIC
uses for - 2\log L the formulae RSS/s - n (corresponding
to Mallows' C_p) in the case of known scale s and
n \log (RSS/n) for unknown scale.
AIC only handles unknown scale and uses the formula
n \log (RSS/n) + n + n \log 2\pi - \sum \log w
where w are the weights. Further AIC counts the scale
estimation as a parameter in the edf and extractAIC does not.
For glm fits the family's aic() function is used to
compute the AIC: see the note under logLik about the
assumptions this makes.
k = 2 corresponds to the traditional AIC, using k =
log(n) provides the BIC (Bayesian IC) instead.
Note that the methods for this function may differ in their
assumptions from those of methods for AIC (usually
via a method for logLik). We have already
mentioned the case of "lm" models with estimated scale, and
there are similar issues in the "glm" and "negbin"
methods where the dispersion parameter may or may not be taken as
‘free’. This is immaterial as extractAIC is only used
to compare models of the same class (where only differences in AIC
values are considered).
Value
A numeric vector of length 2, with first and second elements giving
edf |
the ‘equivalent degrees of freedom’
for the fitted model |
AIC |
the (generalized) Akaike Information Criterion for |
Note
This function is used in add1, drop1
and step and the similar functions in package
MASS from which it was adopted.
Author(s)
B. D. Ripley
References
Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S. New York: Springer (4th ed).
See Also
Examples
utils::example(glm)
extractAIC(glm.D93) #>> 5 15.129
Family Objects for Models
Description
Family objects provide a convenient way to specify the details of the
models used by functions such as glm. See the
documentation for glm for the details on how such model
fitting takes place.
Usage
## S3 method for class 'glm4'
family(object, ...)
Arguments
object |
the function |
... |
further arguments passed to methods. |
Details
family is a generic function with methods for classes
"glm" and "lm" (the latter returning gaussian()).
For the binomial and quasibinomial families the response
can be specified in one of three ways:
As a factor: ‘success’ is interpreted as the factor not having the first level (and hence usually of having the second level).
As a numerical vector with values between
0and1, interpreted as the proportion of successful cases (with the total number of cases given by theweights).As a two-column integer matrix: the first column gives the number of successes and the second the number of failures.
The quasibinomial and quasipoisson families differ from
the binomial and poisson families only in that the
dispersion parameter is not fixed at one, so they can model
over-dispersion. For the binomial case see
McCullagh and Nelder (1989, pp. 124–8).
Although they show that there is (under some
restrictions) a model with
variance proportional to mean as in the quasi-binomial model, note
that glm does not compute maximum-likelihood estimates in that
model. The behaviour of S is closer to the quasi- variants.
Value
An object of class "family" (which has a concise print method).
This is a list with elements
family |
character: the family name. |
link |
character: the link name. |
linkfun |
function: the link. |
linkinv |
function: the inverse of the link function. |
variance |
function: the variance as a function of the mean. |
dev.resids |
function giving the deviance for each observation
as a function of |
aic |
function giving the AIC value if appropriate (but |
mu.eta |
function: derivative of the inverse-link function
with respect to the linear predictor. If the inverse-link
function is |
initialize |
expression. This needs to set up whatever data
objects are needed for the family as well as |
validmu |
logical function. Returns |
valideta |
logical function. Returns |
simulate |
(optional) function |
dispersion |
(optional since R version 4.3.0) numeric: value of the
dispersion parameter, if fixed, or |
Note
The link and variance arguments have rather awkward
semantics for back-compatibility. The recommended way is to supply
them as quoted character strings, but they can also be supplied
unquoted (as names or expressions). Additionally, they can be
supplied as a length-one character vector giving the name of one of
the options, or as a list (for link, of class
"link-glm"). The restrictions apply only to links given as
names: when given as a character string all the links known to
make.link are accepted.
This is potentially ambiguous: supplying link = logit could mean
the unquoted name of a link or the value of object logit. It
is interpreted if possible as the name of an allowed link, then
as an object. (You can force the interpretation to always be the value of
an object via logit[1].)
Author(s)
The design was inspired by S functions of the same names described
in Hastie & Pregibon (1992) (except quasibinomial and
quasipoisson).
References
McCullagh P. and Nelder, J. A. (1989) Generalized Linear Models. London: Chapman and Hall.
Dobson, A. J. (1983) An Introduction to Statistical Modelling. London: Chapman and Hall.
Cox, D. R. and Snell, E. J. (1981). Applied Statistics; Principles and Examples. London: Chapman and Hall.
Hastie, T. J. and Pregibon, D. (1992) Generalized linear models. Chapter 6 of Statistical Models in S eds J. M. Chambers and T. J. Hastie, Wadsworth & Brooks/Cole.
See Also
For binomial coefficients, choose;
the binomial and negative binomial distributions,
Binomial, and NegBinomial.
Examples
fit <- glm4(mpg ~ cyl + wt, data = mtcars, family = gaussian())
family(fit)
Model Formulae
Description
The generic function formula and its specific methods provide a
way of extracting formulae which have been included in other objects.
as.formula is almost identical, additionally preserving
attributes when object already inherits from
"formula".
Usage
## S3 method for class 'glm4'
formula(x, ...)
Arguments
... |
further arguments passed to or from other methods. |
Details
The models fitted by, e.g., the lm and glm
functions are specified in a compact symbolic form.
The ~ operator is basic in the formation of such models.
An expression of the form y ~ model is interpreted
as a specification that the response y is modelled
by a linear predictor specified symbolically by model.
Such a model consists of a series of terms separated
by + operators.
The terms themselves consist of variable and factor
names separated by : operators.
Such a term is interpreted as the interaction of
all the variables and factors appearing in the term.
In addition to + and :, a number of other operators are
useful in model formulae.
The
*operator denotes factor crossing:a*bis interpreted asa + b + a:b.The
^operator indicates crossing to the specified degree. For example(a+b+c)^2is identical to(a+b+c)*(a+b+c)which in turn expands to a formula containing the main effects fora,bandctogether with their second-order interactions.The
%in%operator indicates that the terms on its left are nested within those on the right. For examplea + b %in% aexpands to the formulaa + a:b.The
/operator provides a shorthand, so thata / bis equivalent toa + b %in% a.The
-operator removes the specified terms, hence(a+b+c)^2 - a:bis identical toa + b + c + b:c + a:c. It can also used to remove the intercept term: when fitting a linear modely ~ x - 1specifies a line through the origin. A model with no intercept can be also specified asy ~ x + 0ory ~ 0 + x.
While formulae usually involve just variable and factor
names, they can also involve arithmetic expressions.
The formula log(y) ~ a + log(x) is quite legal.
When such arithmetic expressions involve
operators which are also used symbolically
in model formulae, there can be confusion between
arithmetic and symbolic operator use.
To avoid this confusion, the function I()
can be used to bracket those portions of a model
formula where the operators are used in their
arithmetic sense. For example, in the formula
y ~ a + I(b+c), the term b+c is to be
interpreted as the sum of b and c.
Variable names can be quoted by backticks `like this` in
formulae, although there is no guarantee that all code using formulae
will accept such non-syntactic names.
Most model-fitting functions accept formulae with right-hand-side
including the function offset to indicate terms with a
fixed coefficient of one. Some functions accept other
‘specials’ such as strata or cluster (see the
specials argument of terms.formula).
There are two special interpretations of . in a formula. The
usual one is in the context of a data argument of model
fitting functions and means ‘all columns not otherwise in the
formula’: see terms.formula. In the context of
update.formula, only, it means ‘what was
previously in this part of the formula’.
When formula is called on a fitted model object, either a
specific method is used (such as that for class "nls") or the
default method. The default first looks for a "formula"
component of the object (and evaluates it), then a "terms"
component, then a formula parameter of the call (and evaluates
its value) and finally a "formula" attribute.
There is a formula method for data frames. When there's
"terms" attribute with a formula, e.g., for a
model.frame(), that formula is returned. If you'd like the
previous (R \le 3.5.x) behavior, use the auxiliary
DF2formula() which does not consider a "terms" attribute.
Otherwise, if
there is only
one column this forms the RHS with an empty LHS. For more columns,
the first column is the LHS of the formula and the remaining columns
separated by + form the RHS.
Value
All the functions above produce an object of class "formula"
which contains a symbolic model formula.
Environments
A formula object has an associated environment, and
this environment (rather than the parent
environment) is used by model.frame to evaluate variables
that are not found in the supplied data argument.
Formulas created with the ~ operator use the
environment in which they were created. Formulas created with
as.formula will use the env argument for their
environment.
Note
In R versions up to 3.6.0, character x of length
more than one were parsed as separate lines of R code and the first
complete expression was evaluated into a formula when possible. This
silently truncates such vectors of characters inefficiently and to some
extent inconsistently as this behaviour had been undocumented. For this
reason, such use has been deprecated. If you must work via character
x, do use a string, i.e., a character vector of length one.
E.g., eval(call("~", quote(foo + bar))) has been an order of magnitude
more efficient
than formula(c("~", "foo + bar")).
Further, character “expressions” needing an eval()
to return a formula are now deprecated.
References
Chambers, J. M. and Hastie, T. J. (1992) Statistical models. Chapter 2 of Statistical Models in S eds J. M. Chambers and T. J. Hastie, Wadsworth & Brooks/Cole.
See Also
For formula manipulation: update.formula,
terms.formula, and all.vars.
For typical use: lm, glm, and
coplot.
For formula construction: reformulate.
Examples
class(fo <- y ~ x1*x2) # "formula"
fo
typeof(fo) # R internal : "language"
terms(fo)
environment(fo)
environment(as.formula("y ~ x"))
environment(as.formula("y ~ x", env = new.env()))
## Create a formula for a model with a large number of variables:
xnam <- paste0("x", 1:25)
(fmla <- as.formula(paste("y ~ ", paste(xnam, collapse= "+"))))
## Equivalent with reformulate():
fmla2 <- reformulate(xnam, response = "y")
stopifnot(identical(fmla, fmla2))
Fitting Generalized Linear Models Using Sparse Matrices
Description
'glm4()', is used to fit generalized linear models, specified by giving a symbolic description of the linear predictor and a description of the error distribution.
It is very similar to the standard 'stats::glm()' function, but supports sparse matrices via the Matrix package, which can dramatically improve memory and computational efficiency on large and/or high-dimensional data.
Usage
glm4(formula, data, ...)
Arguments
formula |
an object of class |
data |
an optional data frame, list or environment (or object
coercible by |
... |
potentially arguments passed on to fitter functions; not used currently. |
Details
This function is a wrapper for 'MatrixModels::glm4()' which returns a more user-friendly object designed to resemble 'stats::glm()' as closely as possible. Behind the scenes, it extracts the relevant model details from the S4 class 'MatrixModels::glm4()' object and calculates new ones where necessary (e.g. AIC, deviance, residual degrees of freedom, etc) as per 'stats::glm()'.
Sparse matrix storage is not enabled by default; pass 'sparse = TRUE' to use it.
Value
A list object of class 'glm4'. See 'stats::glm()' for more details on returned components.
Examples
fit <- glm4(mpg ~ cyl + wt, data = mtcars, family = gaussian())
print(fit)
Regression Deletion Diagnostics
Description
This suite of functions can be used to compute some of the regression (leave-one-out deletion) diagnostics for linear and generalized linear models discussed in Belsley, Kuh and Welsch (1980), Cook and Weisberg (1982), etc.
Usage
## S3 method for class 'glm4'
hatvalues(model, batch_size = 1000L, verbose = FALSE, ...)
Arguments
model |
|
batch_size |
integer; number of rows processed per batch when the model matrix is sparse. Reduce if memory is limited. |
verbose |
logical; if 'TRUE', a progress bar is printed during the batched sparse computation. |
... |
further arguments passed to or from other methods. |
Details
The primary high-level function is influence.measures which produces a
class "infl" object tabular display showing the DFBETAs for
each model variable, DFFITs, covariance ratios, Cook's distances and
the diagonal elements of the hat matrix. Cases which are influential
with respect to any of these measures are marked with an asterisk.
The functions dfbetas, dffits,
covratio and cooks.distance provide direct access to the
corresponding diagnostic quantities. Functions rstandard and
rstudent give the standardized and Studentized residuals
respectively. (These re-normalize the residuals to have unit variance,
using an overall and leave-one-out measure of the error variance
respectively.)
Note that for multivariate lm() models (of class
"mlm"), these functions return 3d arrays instead of matrices,
or matrices instead of vectors.
Values for generalized linear models are approximations, as described
in Williams (1987) (except that Cook's distances are scaled as
F rather than as chi-square values). The approximations can be
poor when some cases have large influence.
The optional infl, res and sd arguments are there
to encourage the use of these direct access functions, in situations
where, e.g., the underlying basic influence measures (from
lm.influence or the generic influence) are
already available.
Note that cases with weights == 0 are dropped from all
these functions, but that if a linear model has been fitted with
na.action = na.exclude, suitable values are filled in for the
cases excluded during fitting.
For linear models, rstandard(*, type = "predictive") provides
leave-one-out cross validation residuals, and the “PRESS”
statistic (PREdictive Sum of Squares, the same as
the CV score) of model model is
PRESS <- sum(rstandard(model, type="pred")^2)
The function hat() exists mainly for S (version 2)
compatibility; we recommend using hatvalues() instead.
Value
A numeric vector of hat values (leverages), one per observation.
Note
For hatvalues, dfbeta, and dfbetas, the method
for linear models also works for generalized linear models.
Author(s)
Several R core team members and John Fox, originally in his ‘car’ package.
References
Belsley, D. A., Kuh, E. and Welsch, R. E. (1980). Regression Diagnostics. New York: Wiley.
Cook, R. D. and Weisberg, S. (1982). Residuals and Influence in Regression. London: Chapman and Hall.
Williams, D. A. (1987). Generalized linear model diagnostics using the deviance and single case deletions. Applied Statistics, 36, 181–191. doi:10.2307/2347550
See Also
influence (containing lm.influence).
‘plotmath’ for the use of hat in plot annotation.
Examples
require(graphics)
## Analysis of the life-cycle savings data
## given in Belsley, Kuh and Welsch.
lm.SR <- lm(sr ~ pop15 + pop75 + dpi + ddpi, data = LifeCycleSavings)
inflm.SR <- influence.measures(lm.SR)
which(apply(inflm.SR$is.inf, 1, any))
# which observations 'are' influential
summary(inflm.SR) # only these
inflm.SR # all
plot(rstudent(lm.SR) ~ hatvalues(lm.SR)) # recommended by some
plot(lm.SR, which = 5) # an enhanced version of that via plot(<lm>)
## The 'infl' argument is not needed, but avoids recomputation:
rs <- rstandard(lm.SR)
iflSR <- influence(lm.SR)
all.equal(rs, rstandard(lm.SR, infl = iflSR), tolerance = 1e-10)
## to "see" the larger values:
1000 * round(dfbetas(lm.SR, infl = iflSR), 3)
cat("PRESS :"); (PRESS <- sum( rstandard(lm.SR, type = "predictive")^2 ))
stopifnot(all.equal(PRESS, sum( (residuals(lm.SR) / (1 - iflSR$hat))^2)))
## Show that "PRE-residuals" == L.O.O. Crossvalidation (CV) errors:
X <- model.matrix(lm.SR)
y <- model.response(model.frame(lm.SR))
## Leave-one-out CV least-squares prediction errors (relatively fast)
rCV <- vapply(seq_len(nrow(X)), function(i)
y[i] - X[i,] %*% .lm.fit(X[-i,], y[-i])$coefficients,
numeric(1))
## are the same as the *faster* rstandard(*, "pred") :
stopifnot(all.equal(rCV, unname(rstandard(lm.SR, type = "predictive"))))
## Huber's data [Atkinson 1985]
xh <- c(-4:0, 10)
yh <- c(2.48, .73, -.04, -1.44, -1.32, 0)
lmH <- lm(yh ~ xh)
summary(lmH)
im <- influence.measures(lmH)
im
is.inf <- apply(im$is.inf, 1, any)
plot(xh,yh, main = "Huber's data: L.S. line and influential obs.")
abline(lmH); points(xh[is.inf], yh[is.inf], pch = 20, col = 2)
## Irwin's data [Williams 1987]
xi <- 1:5
yi <- c(0,2,14,19,30) # number of mice responding to dose xi
mi <- rep(40, 5) # number of mice exposed
glmI <- glm(cbind(yi, mi -yi) ~ xi, family = binomial)
summary(glmI)
signif(cooks.distance(glmI), 3) # ~= Ci in Table 3, p.184
imI <- influence.measures(glmI)
imI
stopifnot(all.equal(imI$infmat[,"cook.d"],
cooks.distance(glmI)))
Regression Diagnostics
Description
This function provides the basic quantities which are used in forming a wide variety of diagnostics for checking the quality of regression fits.
Usage
## S3 method for class 'glm4'
influence(model, batch_size = 1000L, verbose = FALSE, ...)
Arguments
model |
|
batch_size |
integer; passed to 'hatvalues.glm4()'. |
verbose |
logical; passed to 'hatvalues.glm4()'. |
... |
further arguments passed to or from other methods. |
Details
The influence.measures() and other functions listed in
See Also provide a more user oriented way of computing a
variety of regression diagnostics. These all build on
lm.influence. Note that for GLMs (other than the Gaussian
family with identity link) these are based on one-step approximations
which may be inadequate if a case has high influence.
An attempt is made to ensure that computed hat values that are
probably one are treated as one, and the corresponding rows in
sigma and coefficients are NaN. (Dropping such a
case would normally result in a variable being dropped, so it is not
possible to give simple drop-one diagnostics.)
naresid is applied to the results and so will fill in
with NAs it the fit had na.action = na.exclude.
qr.influence() is a low level interface to parts of
lm.influence(*, doc.coef = FALSE) provided for cases where speed
is more important than user safety.
Value
A list containing the following components of the same length or
number of rows n, which is the number of non-zero weights.
Cases omitted in the fit are omitted unless a na.action
method was used (such as na.exclude) which restores them.
hat |
a vector containing the diagonal of the ‘hat’ matrix. |
coefficients |
(unless |
sigma |
a vector whose i-th element contains the estimate
of the residual standard deviation obtained when the i-th
case is dropped from the regression. (The approximations needed for
GLMs can result in this being |
wt.res |
a vector of weighted (or for class |
qr.influence() returns list with the two components hat and
sigma, as above but without names.
Note
The coefficients returned by the R version
of lm.influence differ from those computed by S.
Rather than returning the coefficients which result
from dropping each case, we return the changes in the coefficients.
This is more directly useful in many diagnostic measures.
Since these need O(n p^2) computing time, they can be omitted by
do.coef = FALSE.
Note that cases with weights == 0 are dropped (contrary
to the situation in S).
If a model has been fitted with na.action = na.exclude (see
na.exclude), cases excluded in the fit are
considered here.
References
See the list in the documentation for influence.measures.
Chambers, J. M. (1992) Linear models. Chapter 4 of Statistical Models in S eds J. M. Chambers and T. J. Hastie, Wadsworth & Brooks/Cole.
See Also
summary.lm for summary and related methods;
influence.measures,
hat for the hat matrix diagonals,
dfbetas,
dffits,
covratio,
cooks.distance,
lm.
Examples
## Analysis of the life-cycle savings data
## given in Belsley, Kuh and Welsch.
summary(lm.SR <- lm(sr ~ pop15 + pop75 + dpi + ddpi,
data = LifeCycleSavings),
correlation = TRUE)
utils::str(lmI <- lm.influence(lm.SR))
qRes <- qr(lm.SR) # == lm.SR $ qr
qrI <- qr.influence(qRes, residuals(lm.SR))
strip <- function(x) lapply(lapply(x, unname), drop)
stopifnot(identical(strip(qrI),
strip(lmI[c("hat", "sigma")])))
## For more "user level" examples, use example(influence.measures)
Extract Log-Likelihood
Description
This function is generic; method functions can be written to handle
specific classes of objects. Classes which have methods for this
function include: "glm", "lm", "nls" and
"Arima". Packages contain methods for other classes, such as
"fitdistr", "negbin" and "polr" in package
MASS, "multinom" in package nnet and
"gls", "gnls" "lme" and others in package
nlme.
Usage
## S3 method for class 'glm4'
logLik(object, ...)
Arguments
object |
any object from which a log-likelihood value, or a contribution to a log-likelihood value, can be extracted. |
... |
some methods for this generic function require additional arguments. |
Details
logLik is most commonly used for a model fitted by maximum
likelihood, and some uses, e.g. by AIC, assume
this. So care is needed where other fit criteria have been used, for
example REML (the default for "lme").
For a "glm" fit the family does not have to
specify how to calculate the log-likelihood, so this is based on using
the family's aic() function to compute the AIC. For the
gaussian, Gamma and
inverse.gaussian families it assumed that the dispersion
of the GLM is estimated and has been counted as a parameter in the AIC
value, and for all other families it is assumed that the dispersion is
known. Note that this procedure does not give the maximized
likelihood for "glm" fits from the Gamma and inverse gaussian
families, as the estimate of dispersion used is not the MLE.
For "lm" fits it is assumed that the scale has been estimated
(by maximum likelihood or REML), and all the constants in the
log-likelihood are included. That method is only applicable to
single-response fits.
Value
Returns an object of class logLik. This is a number with at
least one attribute, "df" (degrees of freedom),
giving the number of (estimated) parameters in the model.
There is a simple print method for "logLik" objects.
There may be other attributes depending on the method used: see the
appropriate documentation. One that is used by several methods is
"nobs", the number of observations used in estimation (after
the restrictions if REML = TRUE).
Author(s)
José Pinheiro and Douglas Bates
References
Harville, D.A. (1974). Bayesian inference for variance components using only error contrasts. Biometrika, 61, 383–385. doi:10.2307/2334370
See Also
logLik.gls, logLik.lme, in
package nlme, etc.
Examples
NULL
Extracting the Model Frame from a Formula or Fit
Description
model.frame (a generic function) and its methods return a
data.frame with the variables needed to use
formula and any ... arguments.
Usage
## S3 method for class 'glm4'
model.frame(formula, ...)
Arguments
formula |
|
... |
for For |
Details
Exactly what happens depends on the class and attributes of the object
formula. If this is an object of fitted-model class such as
"lm", the method will either return the saved model frame
used when fitting the model (if any, often selected by argument
model = TRUE) or pass the call used when fitting on to the
default method. The default method itself can cope with rather
standard model objects such as those of class
"lqs" from package MASS if no other
arguments are supplied.
The rest of this section applies only to the default method.
If either formula or data is already a model frame (a
data frame with a "terms" attribute) and the other is missing,
the model frame is returned. Unless formula is a terms object,
as.formula and then terms is called on it. (If you wish
to use the keep.order argument of terms.formula, pass a
terms object rather than a formula.)
Row names for the model frame are taken from the data argument
if present, then from the names of the response in the formula (or
rownames if it is a matrix), if there is one.
All the variables in formula, subset and in ...
are looked for first in data and then in the environment of
formula (see the help for formula() for further
details) and collected into a data frame. Then the subset
expression is evaluated, and it is used as a row index to the data
frame. Then the na.action function is applied to the data frame
(and may well add attributes). The levels of any factors in the data
frame are adjusted according to the drop.unused.levels and
xlev arguments: if xlev specifies a factor and a
character variable is found, it is converted to a factor (as from R
2.10.0).
Because variables in the formula are evaluated before rows are dropped
based on subset, the characteristics of data-dependent bases such
as orthogonal polynomials (i.e. from terms using poly) or
splines will be computed based on the full data set rather than the
subsetted one.
Unless na.action = NULL, time-series attributes will be removed
from the variables found (since they will be wrong if NAs are
removed).
Note that all the variables in the formula are included in the
data frame, even those preceded by -.
Only variables whose type is raw, logical, integer, real, complex or character can be included in a model frame: this includes classed variables such as factors (whose underlying type is integer), but excludes lists.
get_all_vars returns a data.frame containing the
variables used in formula plus those specified in ...
which are recycled to the number of data frame rows.
Unlike model.frame.default, it returns the input variables and
not those resulting from function calls in formula.
Value
A data.frame containing the variables used in
formula plus those specified in .... It will have
additional attributes, including "terms" for an object of class
"terms" derived from formula,
and possibly "na.action" giving information on the handling of
NAs (which will not be present if no special handling was done,
e.g. by na.pass).
References
Chambers, J. M. (1992) Data for models. Chapter 3 of Statistical Models in S eds J. M. Chambers and T. J. Hastie, Wadsworth & Brooks/Cole.
See Also
model.matrix for the ‘design matrix’,
formula for formulas,
model.extract to extract components, and
expand.model.frame for model.frame manipulation.
Examples
data.class(model.frame(dist ~ speed, data = cars))
## using a subset and an extra variable
model.frame(dist ~ speed, data = cars, subset = speed < 10, z = log(dist))
## get_all_vars(): new var.s are recycled (iff length matches: 50 = 2*25)
ncars <- get_all_vars(sqrt(dist) ~ I(speed/2), data = cars, newVar = 2:3)
stopifnot(is.data.frame(ncars),
identical(cars, ncars[,names(cars)]),
ncol(ncars) == ncol(cars) + 1)
Extract the Number of Observations from a Fit
Description
Extract the number of ‘observations’ from a model fit. This is
principally intended to be used in computing BIC (see AIC).
Usage
## S3 method for class 'glm4'
nobs(object, ...)
Arguments
object |
a fitted model object. |
... |
further arguments to be passed to methods. |
Details
This is a generic function, with an S4 generic in package stats4.
There are methods in this package for objects of classes
"lm", "glm", "nls" and
"logLik", as well as a default method (which throws an
error, unless use.fallback = TRUE when it looks for
weights and residuals components – use with care!).
The main usage is in determining the appropriate penalty for BIC, but
nobs is also used by the stepwise fitting methods
step, add1 and drop1 as a
quick check that different fits have been fitted to the same set of
data (and not, say, that further rows have been dropped because of NAs
in the new predictors).
For lm, glm and nls fits, observations with zero
weight are not included.
Value
An integer giving the number of observations used in model fitting.
See Also
AIC.
Examples
fit <- glm4(mpg ~ cyl + wt, data = mtcars, family = gaussian())
nobs(fit)
Predict Method for GLM Fits
Description
Obtains predictions and optionally estimates standard errors of those predictions from a fitted generalized linear model object.
Usage
## S3 method for class 'glm4'
predict(...)
Arguments
... |
further arguments passed to or from other methods. |
Details
If newdata is omitted the predictions are based on the data
used for the fit. In that case how cases with missing values in the
original fit is determined by the na.action argument of that
fit. If na.action = na.omit omitted cases will not appear in
the residuals, whereas if na.action = na.exclude they will
appear (in predictions and standard errors), with residual value
NA. See also napredict.
Value
If se.fit = FALSE, a vector or matrix of predictions.
For type = "terms" this is a matrix with a column per term, and
may have an attribute "constant".
If se.fit = TRUE, a list with components
fit |
Predictions, as for |
se.fit |
Estimated standard errors. |
residual.scale |
A scalar giving the square root of the dispersion used in computing the standard errors. |
Note
Variables are first looked for in newdata and then searched for
in the usual way (which will include the environment of the formula
used in the fit). A warning will be given if the
variables found are not of the same length as those in newdata
if it was supplied.
See Also
Examples
require(graphics)
## example from Venables and Ripley (2002, pp. 190-2.)
ldose <- rep(0:5, 2)
numdead <- c(1, 4, 9, 13, 18, 20, 0, 2, 6, 10, 12, 16)
sex <- factor(rep(c("M", "F"), c(6, 6)))
SF <- cbind(numdead, numalive = 20-numdead)
budworm.lg <- glm(SF ~ sex*ldose, family = binomial)
summary(budworm.lg)
plot(c(1,32), c(0,1), type = "n", xlab = "dose",
ylab = "prob", log = "x")
text(2^ldose, numdead/20, as.character(sex))
ld <- seq(0, 5, 0.1)
lines(2^ld, predict(budworm.lg, data.frame(ldose = ld,
sex = factor(rep("M", length(ld)), levels = levels(sex))),
type = "response"))
lines(2^ld, predict(budworm.lg, data.frame(ldose = ld,
sex = factor(rep("F", length(ld)), levels = levels(sex))),
type = "response"))
Accessing Generalized Linear Model Fits
Description
These functions are all methods for class glm or
summary.glm objects.
Usage
## S3 method for class 'glm4'
resid(object, ...)
Arguments
object |
an object of class |
... |
further arguments passed to or from other methods. |
Details
The references define the types of residuals: Davison & Snell is a good reference for the usages of each.
The partial residuals are a matrix of working residuals, with each column formed by omitting a term from the model.
How residuals treats cases with missing values in the original
fit is determined by the na.action argument of that fit.
If na.action = na.omit omitted cases will not appear in the
residuals, whereas if na.action = na.exclude they will appear,
with residual value NA. See also naresid.
For fits done with y = FALSE the response values are computed
from other components.
Value
A numeric vector of residuals.
References
Davison, A. C. and Snell, E. J. (1991) Residuals and diagnostics. In: Statistical Theory and Modelling. In Honour of Sir David Cox, FRS, eds. Hinkley, D. V., Reid, N. and Snell, E. J., Chapman & Hall.
Hastie, T. J. and Pregibon, D. (1992) Generalized linear models. Chapter 6 of Statistical Models in S eds J. M. Chambers and T. J. Hastie, Wadsworth & Brooks/Cole.
McCullagh P. and Nelder, J. A. (1989) Generalized Linear Models. London: Chapman and Hall.
See Also
glm for computing glm.obj, anova.glm;
the corresponding generic functions, summary.glm,
coef, deviance,
df.residual,
effects, fitted,
residuals.
influence.measures for deletion diagnostics, including
standardized (rstandard)
and studentized (rstudent) residuals.
Examples
fit <- glm4(mpg ~ cyl + wt, data = mtcars, family = gaussian())
resid(fit)
Accessing Generalized Linear Model Fits
Description
These functions are all methods for class glm or
summary.glm objects.
Usage
## S3 method for class 'glm4'
residuals(object, ...)
Arguments
object |
an object of class |
... |
further arguments passed to or from other methods. |
Details
The references define the types of residuals: Davison & Snell is a good reference for the usages of each.
The partial residuals are a matrix of working residuals, with each column formed by omitting a term from the model.
How residuals treats cases with missing values in the original
fit is determined by the na.action argument of that fit.
If na.action = na.omit omitted cases will not appear in the
residuals, whereas if na.action = na.exclude they will appear,
with residual value NA. See also naresid.
For fits done with y = FALSE the response values are computed
from other components.
Value
A numeric vector of residuals.
References
Davison, A. C. and Snell, E. J. (1991) Residuals and diagnostics. In: Statistical Theory and Modelling. In Honour of Sir David Cox, FRS, eds. Hinkley, D. V., Reid, N. and Snell, E. J., Chapman & Hall.
Hastie, T. J. and Pregibon, D. (1992) Generalized linear models. Chapter 6 of Statistical Models in S eds J. M. Chambers and T. J. Hastie, Wadsworth & Brooks/Cole.
McCullagh P. and Nelder, J. A. (1989) Generalized Linear Models. London: Chapman and Hall.
See Also
glm for computing glm.obj, anova.glm;
the corresponding generic functions, summary.glm,
coef, deviance,
df.residual,
effects, fitted,
residuals.
influence.measures for deletion diagnostics, including
standardized (rstandard)
and studentized (rstudent) residuals.
Examples
fit <- glm4(mpg ~ cyl + wt, data = mtcars, family = gaussian())
residuals(fit)
Regression Deletion Diagnostics
Description
This suite of functions can be used to compute some of the regression (leave-one-out deletion) diagnostics for linear and generalized linear models discussed in Belsley, Kuh and Welsch (1980), Cook and Weisberg (1982), etc.
Usage
## S3 method for class 'glm4'
rstandard(
model,
infl = influence.glm4(model),
type = c("pearson", "deviance"),
...
)
Arguments
model |
|
infl |
influence structure as returned by
|
type |
type of residuals for |
... |
further arguments passed to or from other methods. |
Details
The primary high-level function is influence.measures which produces a
class "infl" object tabular display showing the DFBETAs for
each model variable, DFFITs, covariance ratios, Cook's distances and
the diagonal elements of the hat matrix. Cases which are influential
with respect to any of these measures are marked with an asterisk.
The functions dfbetas, dffits,
covratio and cooks.distance provide direct access to the
corresponding diagnostic quantities. Functions rstandard and
rstudent give the standardized and Studentized residuals
respectively. (These re-normalize the residuals to have unit variance,
using an overall and leave-one-out measure of the error variance
respectively.)
Note that for multivariate lm() models (of class
"mlm"), these functions return 3d arrays instead of matrices,
or matrices instead of vectors.
Values for generalized linear models are approximations, as described
in Williams (1987) (except that Cook's distances are scaled as
F rather than as chi-square values). The approximations can be
poor when some cases have large influence.
The optional infl, res and sd arguments are there
to encourage the use of these direct access functions, in situations
where, e.g., the underlying basic influence measures (from
lm.influence or the generic influence) are
already available.
Note that cases with weights == 0 are dropped from all
these functions, but that if a linear model has been fitted with
na.action = na.exclude, suitable values are filled in for the
cases excluded during fitting.
For linear models, rstandard(*, type = "predictive") provides
leave-one-out cross validation residuals, and the “PRESS”
statistic (PREdictive Sum of Squares, the same as
the CV score) of model model is
PRESS <- sum(rstandard(model, type="pred")^2)
The function hat() exists mainly for S (version 2)
compatibility; we recommend using hatvalues() instead.
Value
A numeric vector of standardised residuals, one per observation.
Note
For hatvalues, dfbeta, and dfbetas, the method
for linear models also works for generalized linear models.
Author(s)
Several R core team members and John Fox, originally in his ‘car’ package.
References
Belsley, D. A., Kuh, E. and Welsch, R. E. (1980). Regression Diagnostics. New York: Wiley.
Cook, R. D. and Weisberg, S. (1982). Residuals and Influence in Regression. London: Chapman and Hall.
Williams, D. A. (1987). Generalized linear model diagnostics using the deviance and single case deletions. Applied Statistics, 36, 181–191. doi:10.2307/2347550
See Also
influence (containing lm.influence).
‘plotmath’ for the use of hat in plot annotation.
Examples
require(graphics)
## Analysis of the life-cycle savings data
## given in Belsley, Kuh and Welsch.
lm.SR <- lm(sr ~ pop15 + pop75 + dpi + ddpi, data = LifeCycleSavings)
inflm.SR <- influence.measures(lm.SR)
which(apply(inflm.SR$is.inf, 1, any))
# which observations 'are' influential
summary(inflm.SR) # only these
inflm.SR # all
plot(rstudent(lm.SR) ~ hatvalues(lm.SR)) # recommended by some
plot(lm.SR, which = 5) # an enhanced version of that via plot(<lm>)
## The 'infl' argument is not needed, but avoids recomputation:
rs <- rstandard(lm.SR)
iflSR <- influence(lm.SR)
all.equal(rs, rstandard(lm.SR, infl = iflSR), tolerance = 1e-10)
## to "see" the larger values:
1000 * round(dfbetas(lm.SR, infl = iflSR), 3)
cat("PRESS :"); (PRESS <- sum( rstandard(lm.SR, type = "predictive")^2 ))
stopifnot(all.equal(PRESS, sum( (residuals(lm.SR) / (1 - iflSR$hat))^2)))
## Show that "PRE-residuals" == L.O.O. Crossvalidation (CV) errors:
X <- model.matrix(lm.SR)
y <- model.response(model.frame(lm.SR))
## Leave-one-out CV least-squares prediction errors (relatively fast)
rCV <- vapply(seq_len(nrow(X)), function(i)
y[i] - X[i,] %*% .lm.fit(X[-i,], y[-i])$coefficients,
numeric(1))
## are the same as the *faster* rstandard(*, "pred") :
stopifnot(all.equal(rCV, unname(rstandard(lm.SR, type = "predictive"))))
## Huber's data [Atkinson 1985]
xh <- c(-4:0, 10)
yh <- c(2.48, .73, -.04, -1.44, -1.32, 0)
lmH <- lm(yh ~ xh)
summary(lmH)
im <- influence.measures(lmH)
im
is.inf <- apply(im$is.inf, 1, any)
plot(xh,yh, main = "Huber's data: L.S. line and influential obs.")
abline(lmH); points(xh[is.inf], yh[is.inf], pch = 20, col = 2)
## Irwin's data [Williams 1987]
xi <- 1:5
yi <- c(0,2,14,19,30) # number of mice responding to dose xi
mi <- rep(40, 5) # number of mice exposed
glmI <- glm(cbind(yi, mi -yi) ~ xi, family = binomial)
summary(glmI)
signif(cooks.distance(glmI), 3) # ~= Ci in Table 3, p.184
imI <- influence.measures(glmI)
imI
stopifnot(all.equal(imI$infmat[,"cook.d"],
cooks.distance(glmI)))
Extract Residual Standard Deviation 'Sigma'
Description
Extract the estimated standard deviation of the errors, the
“residual standard deviation” (misnamed also
“residual standard error”, e.g., in
summary.lm()'s output, from a fitted model).
Many classical statistical models have a scale parameter,
typically the standard deviation of a zero-mean normal (or Gaussian)
random variable which is denoted as \sigma.
sigma(.) extracts the estimated parameter from a fitted
model, i.e., \hat\sigma.
Usage
## S3 method for class 'glm4'
sigma(object, ...)
Arguments
object |
an R object, typically resulting from a model fitting
function such as |
... |
potentially further arguments passed to and from
methods. Passed to |
Details
The stats package provides the S3 generic, a default method,
and a method for objects of class "glm".
The default method is correct typically for (asymptotically / approximately)
generalized gaussian (“least squares”) problems, since it is
defined as
sigma.default <- function (object, use.fallback = TRUE, ...)
sqrt( deviance(object, ...) / (NN - PP) )
where NN <- nobs(object, use.fallback = use.fallback)
and PP <- sum(!is.na(coef(object))) – where in older R
versions this was length(coef(object)) which is too large in
case of undetermined coefficients, e.g., for rank deficient model fits.
Value
Typically a number, the estimated standard deviation of the errors (“residual standard deviation”) for Gaussian models, and—less interpretably—the square root of the residual deviance per degree of freedom in more general models.
Very strictly speaking, \hat{\sigma} (“\sigma hat”)
is actually \sqrt{\widehat{\sigma^2}}.
For generalized linear models (class "glm"), the
sigma.glm method returns the square root of the dispersion
parameter (See summary.glm). For families with free
dispersion parameter, sigma is estimated from the root mean
square of the Pearson residuals. For families with fixed dispersion,
sigma is not estimated from the residuals but extracted directly
from the family of the fitted model. Consequently, for binomial or
Poisson GLMs, sigma is exactly 1.
For multivariate linear models (class "mlm"), a vector
of sigmas is returned, each corresponding to one column of Y.
Note
The misnomer “Residual standard error” has been part of too many R (and S) outputs to be easily changed there.
See Also
deviance, nobs, vcov,
summary.glm.
Examples
## -- lm() ------------------------------
lm1 <- lm(Fertility ~ . , data = swiss)
sigma(lm1) # ~= 7.165 = "Residual standard error" printed from summary(lm1)
stopifnot(all.equal(sigma(lm1), summary(lm1)$sigma, tolerance=1e-15))
## -- nls() -----------------------------
DNase1 <- subset(DNase, Run == 1)
fm.DN1 <- nls(density ~ SSlogis(log(conc), Asym, xmid, scal), DNase1)
sigma(fm.DN1) # ~= 0.01919 as from summary(..)
stopifnot(all.equal(sigma(fm.DN1), summary(fm.DN1)$sigma, tolerance=1e-15))
## -- glm() -----------------------------
## -- a) Binomial -- Example from MASS
ldose <- rep(0:5, 2)
numdead <- c(1, 4, 9, 13, 18, 20, 0, 2, 6, 10, 12, 16)
sex <- factor(rep(c("M", "F"), c(6, 6)))
SF <- cbind(numdead, numalive = 20-numdead)
sigma(budworm.lg <- glm(SF ~ sex*ldose, family = binomial))
## -- b) Poisson -- from ?glm :
## Dobson (1990) Page 93: Randomized Controlled Trial :
counts <- c(18,17,15,20,10,20,25,13,12)
outcome <- gl(3,1,9)
treatment <- gl(3,3)
sigma(glm.D93 <- glm(counts ~ outcome + treatment, family = poisson()))
## equal to
sqrt(summary(glm.D93)$dispersion) # == 1
## and the *Quasi*poisson's dispersion
sigma(glm.qD93 <- update(glm.D93, family = quasipoisson()))
sigma (glm.qD93)^2 # 1.2933 equal to
summary(glm.qD93)$dispersion # == 1.2933
## -- Multivariate lm() "mlm" -----------
utils::example("SSD", echo=FALSE)
sigma(mlmfit) # is the same as {but more efficient than}
sqrt(diag(estVar(mlmfit)))
Summarising Generalized Linear Models Using Sparse Matrices
Description
Generates a summary of the 'glm4' object to evaluate coefficients, standard errors, model fit criteria, etc.
Usage
## S3 method for class 'glm4'
summary(
object,
p.adjust = NULL,
dispersion = NULL,
correlation = FALSE,
symbolic.cor = FALSE,
...
)
Arguments
object |
an object of class "glm4" |
p.adjust |
returns p-values adjusted using one of several methods implemented in 'stats::p.adjust'. Defaults to 'NULL' for no adjustment, consistent with 'stats::glm'. |
dispersion |
the dispersion parameter for the family used. Either a single numerical value or NULL (the default), when it is inferred from object (see 'stats::summary.glm()' details). |
correlation |
logical; if 'TRUE', the correlation matrix of the estimated parameters is returned and printed. |
symbolic.cor |
logical; if 'TRUE', and 'correlation' is 'TRUE', the correlation matrix is printed in symbolic form. |
... |
further arguments passed to or from other methods. |
Details
This function is designed to resemble 'stats::summary.glm()' as closely as possible. It calculates the (un)scaled variance-covariance matrix from a 'MatrixModels::glm4()' object using 'Matrix::chol2inv()' and produces a coefficient table for easy inspection of model parameters.
Value
A list object of class 'c("summary.glm", "summary.glm4")'. See 'stats::summary.glm()' for more details on returned components.
Examples
fit <- glm4(mpg ~ cyl + wt, data = mtcars, family = gaussian())
summary(fit)
Case and Variable Names of Fitted Models
Description
Simple utilities returning (non-missing) case names, and (non-eliminated) variable names.
Usage
## S3 method for class 'glm4'
variable.names(object, full = FALSE, ...)
Arguments
object |
an R object, typically a fitted model. |
full |
logical; if |
... |
further arguments passed to or from other methods. |
Value
A character vector.
See Also
lm; further, all.names,
all.vars for functions with a similar name but only slightly
related purpose.
Examples
x <- 1:20
y <- setNames(x + (x/4 - 2)^3 + rnorm(20, sd = 3),
paste("O", x, sep = "."))
ww <- rep(1, 20); ww[13] <- 0
summary(lmxy <- lm(y ~ x + I(x^2)+I(x^3) + I((x-10)^2), weights = ww),
correlation = TRUE)
variable.names(lmxy)
variable.names(lmxy, full = TRUE) # includes the last
case.names(lmxy)
case.names(lmxy, full = TRUE) # includes the 0-weight case
Calculate Variance-Covariance Matrix for a Fitted Model Object
Description
Returns the variance-covariance matrix of the main parameters of
a fitted model object. The “main” parameters of model
correspond to those returned by coef, and typically do
not contain a nuisance scale parameter (sigma).
Usage
## S3 method for class 'glm4'
vcov(object, complete = TRUE, ...)
Arguments
object |
a fitted model object, typically. Sometimes also a
|
complete |
for the |
... |
additional arguments for method functions. For the
|
Details
vcov() is a generic function and functions with names beginning
in vcov. will be methods for this function.
Classes with methods for this function include:
lm, mlm, glm, nls,
summary.lm, summary.glm,
negbin, polr, rlm (in package MASS),
multinom (in package nnet)
gls, lme (in package nlme),
coxph and survreg (in package survival).
(vcov() methods for summary objects allow more
efficient and still encapsulated access when both
summary(mod) and vcov(mod) are needed.)
.vcov.aliased() is an auxiliary function useful for
vcov method implementations which have to deal with singular
model fits encoded via NA coefficients: It augments a vcov–matrix
vc by NA rows and columns where needed, i.e., when
some entries of aliased are true and vc is of smaller dimension
than length(aliased).
Value
A matrix (or sparse 'Matrix') of estimated covariances between coefficient estimates.
Examples
fit <- glm4(mpg ~ cyl + wt, data = mtcars, family = gaussian())
vcov(fit)