Title: | Goodness of Fit Tests Based on Empirical Distribution Functions |
---|---|
Description: | Routines that allow the user to run goodness of fit tests based on empirical distribution functions for formal model evaluation in a general likelihood model. In addition, functions are provided to test a sample against Normal or Gamma distributions, validate the normality assumptions in a linear model, and examine the appropriateness of a Gamma distribution in generalized linear models with various link functions. Michael Arthur Stephens (1976) <http://www.jstor.org/stable/2958206>. |
Authors: | Richard Lockhart [aut], Payman Nickchi [aut, cre] |
Maintainer: | Payman Nickchi <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.0.0 |
Built: | 2025-02-16 08:23:12 UTC |
Source: | https://github.com/pnickchi/gofedf |
This function is used in testYourModel
function for example purposes.
IGMLE(obs, ...)
IGMLE(obs, ...)
obs |
a numeric vector of sample observations. |
... |
a list of additional parameters to define the likelihood. |
The function compute the MLE of parameters in Inverse Gaussian distribution and returns a vector of estimates. The first and second elements of the vector are MLE of the mean and shape, respectively.
This function is used in testYourModel
function for example purposes.
IGPIT(obs, ...)
IGPIT(obs, ...)
obs |
A numeric vector of sample observations. |
... |
A list of additional parameters to define the likelihood. |
A numeric vector of probability integral transformed values of sample observations.
This function is used in testYourModel
function for example purposes.
IGScore(obs, ...)
IGScore(obs, ...)
obs |
a numeric vector of sample observations. |
... |
a list of additional parameters to define the likelihood. |
The score matrix with n rows (number of sample observations) and 2 columns (mean and shape).
Performs the goodness-of-fit test based on empirical distribution function to check if an i.i.d sample follows an Exponential distribution.
testExponential( x, discretize = FALSE, ngrid = length(x), gridpit = FALSE, hessian = FALSE, method = "cvm" )
testExponential( x, discretize = FALSE, ngrid = length(x), gridpit = FALSE, hessian = FALSE, method = "cvm" )
x |
a non-empty numeric vector of sample data. |
discretize |
If |
ngrid |
the number of equally spaced points to discretize the (0,1) interval for computing the covariance function. |
gridpit |
logical. If |
hessian |
logical. If |
method |
a character string indicating which goodness-of-fit statistic is to be computed. The default value is 'cvm' for the Cramer-von-Mises statistic. Other options include 'ad' for the Anderson-Darling statistic, and 'both' to compute both cvm and ad. |
A list of two containing the following components:
Statistic: the value of goodness-of-fit statistic.
p-value: the approximate p-value for the goodness-of-fit test. if method = 'cvm' or method = 'ad', it returns a numeric value for the statistic and p-value. If method = 'both', it returns a numeric vector with two elements and one for each statistic.
set.seed(123) n <- 50 sim_data <- rexp(n, rate = 2) testExponential(x = sim_data)
set.seed(123) n <- 50 sim_data <- rexp(n, rate = 2) testExponential(x = sim_data)
Performs the goodness-of-fit test based on empirical distribution function to check if an i.i.d sample follows a Gamma distribution.
testGamma( x, discretize = FALSE, ngrid = length(x), gridpit = FALSE, hessian = FALSE, rate = TRUE, method = "cvm" )
testGamma( x, discretize = FALSE, ngrid = length(x), gridpit = FALSE, hessian = FALSE, rate = TRUE, method = "cvm" )
x |
a non-empty numeric vector of sample data. |
discretize |
If |
ngrid |
the number of equally spaced points to discretize the (0,1) interval for computing the covariance function. |
gridpit |
logical. If |
hessian |
logical. If |
rate |
logical. If |
method |
a character string indicating which goodness-of-fit statistic is to be computed. The default value is 'cvm' for the Cramer-von-Mises statistic. Other options include 'ad' for the Anderson-Darling statistic, and 'both' to compute both cvm and ad. |
A list of two containing the following components:
Statistic: the value of goodness-of-fit statistic.
p-value: the approximate p-value for the goodness-of-fit test. if method = 'cvm' or method = 'ad', it returns a numeric value for the statistic and p-value. If method = 'both', it returns a numeric vector with two elements and one for each statistic.
set.seed(123) sim_data <- rgamma(n = 50, shape = 3) testGamma(x = sim_data) sim_data <- runif(n = 50) testGamma(x = sim_data)
set.seed(123) sim_data <- rgamma(n = 50, shape = 3) testGamma(x = sim_data) sim_data <- runif(n = 50) testGamma(x = sim_data)
testGLMGamma
is used to check the validity of Gamma assumption for the
response variable when fitting generalized linear model. Common link functions
in glm
can be used here.
testGLMGamma( x, y, fit = NULL, l = "log", discretize = FALSE, ngrid = length(y), gridpit = TRUE, hessian = FALSE, start.value = NULL, control = NULL, method = "cvm" )
testGLMGamma( x, y, fit = NULL, l = "log", discretize = FALSE, ngrid = length(y), gridpit = TRUE, hessian = FALSE, start.value = NULL, control = NULL, method = "cvm" )
x |
is either a numeric vector or a design matrix. In the design matrix, rows indicate observations and columns presents covariats. |
y |
is a vector of numeric values with the same number of observations or number of rows as x. |
fit |
is an object of class |
l |
a character vector indicating the link function that should be used
for Gamma family. Acceptable link functions for Gamma family are inverse,
identity and log. For more details see |
discretize |
If |
ngrid |
the number of equally spaced points to discretize the (0,1) interval for computing the covariance function. |
gridpit |
logical. If |
hessian |
logical. If |
start.value |
a numeric value or vector. This is the same as |
control |
a list of parameters to control the fitting process in
|
method |
a character string indicating which goodness-of-fit statistic is to be computed. The default value is 'cvm' for the Cramer-von-Mises statistic. Other options include 'ad' for the Anderson-Darling statistic, and 'both' to compute both cvm and ad. |
A list of two containing the following components:
Statistic: the value of goodness-of-fit statistic.
p-value: the approximate p-value for the goodness-of-fit test. if method = 'cvm' or method = 'ad', it returns a numeric value for the statistic and p-value. If method = 'both', it returns a numeric vector with two elements and one for each statistic.
converged: logical to indicate if the IWLS algorithm have converged or not.
set.seed(123) n <- 50 p <- 5 x <- matrix( rnorm(n*p, mean = 10, sd = 0.1), nrow = n, ncol = p) b <- runif(p) e <- rgamma(n, shape = 3) y <- exp(x %*% b) * e testGLMGamma(x, y, l = 'log') myfit <- glm(y ~ x, family = Gamma('log'), x = TRUE, y = TRUE) testGLMGamma(fit = myfit)
set.seed(123) n <- 50 p <- 5 x <- matrix( rnorm(n*p, mean = 10, sd = 0.1), nrow = n, ncol = p) b <- runif(p) e <- rgamma(n, shape = 3) y <- exp(x %*% b) * e testGLMGamma(x, y, l = 'log') myfit <- glm(y ~ x, family = Gamma('log'), x = TRUE, y = TRUE) testGLMGamma(fit = myfit)
testLMNormal
is used to check the normality assumption of
residuals in a linear model. This function can take the response variable
and design matrix, fit a linear model, and apply the goodness-of-fit test.
Conveniently, it can take an object of class "lm" and directly applies the
goodness-of-fit test. The function returns a goodness-of-fit statistic
along with an approximate p-value.
testLMNormal( x, y, fit = NULL, discretize = FALSE, ngrid = length(y), gridpit = TRUE, hessian = FALSE, method = "cvm" )
testLMNormal( x, y, fit = NULL, discretize = FALSE, ngrid = length(y), gridpit = TRUE, hessian = FALSE, method = "cvm" )
x |
is either a numeric vector or a design matrix. In the design matrix, rows indicate observations and columns presents covariates. |
y |
is a vector of numeric values with the same number of observations or number of rows as x. |
fit |
an object of class "lm" returned by |
discretize |
If |
ngrid |
the number of equally spaced points to discretize the (0,1) interval for computing the covariance function. |
gridpit |
logical. If |
hessian |
logical. If |
method |
a character string indicating which goodness-of-fit statistic is to be computed. The default value is 'cvm' for the Cramer-von-Mises statistic. Other options include 'ad' for the Anderson-Darling statistic, and 'both' to compute both cvm and ad. |
A list of two containing the following components:
Statistic: the value of goodness-of-fit statistic.
p-value: the approximate p-value for the goodness-of-fit test. if method = 'cvm' or method = 'ad', it returns a numeric value for the statistic and p-value. If method = 'both', it returns a numeric vector with two elements and one for each statistic.
set.seed(123) n <- 50 p <- 5 x <- matrix( runif(n*p), nrow = n, ncol = p) e <- rnorm(n) b <- runif(p) y <- x %*% b + e testLMNormal(x, y) # Or pass lm.fit object directly: lm.fit <- lm(y ~ x, x = TRUE, y = TRUE) testLMNormal(fit = lm.fit)
set.seed(123) n <- 50 p <- 5 x <- matrix( runif(n*p), nrow = n, ncol = p) e <- rnorm(n) b <- runif(p) y <- x %*% b + e testLMNormal(x, y) # Or pass lm.fit object directly: lm.fit <- lm(y ~ x, x = TRUE, y = TRUE) testLMNormal(fit = lm.fit)
Performs the goodness-of-fit test based on empirical distribution function to check if an i.i.d sample follows a Normal distribution.
testNormal( x, discretize = FALSE, ngrid = length(x), gridpit = TRUE, hessian = FALSE, method = "cvm" )
testNormal( x, discretize = FALSE, ngrid = length(x), gridpit = TRUE, hessian = FALSE, method = "cvm" )
x |
a non-empty numeric vector of sample data. |
discretize |
If |
ngrid |
the number of equally spaced points to discretize the (0,1) interval for computing the covariance function. |
gridpit |
logical. If |
hessian |
logical. If |
method |
a character string indicating which goodness-of-fit statistic is to be computed. The default value is 'cvm' for the Cramer-von-Mises statistic. Other options include 'ad' for the Anderson-Darling statistic, and 'both' to compute both cvm and ad. |
A list of two containing the following components:
Statistic: the value of goodness-of-fit statistic.
p-value: the approximate p-value for the goodness-of-fit test. if method = 'cvm' or method = 'ad', it returns a numeric value for the statistic and p-value. If method = 'both', it returns a numeric vector with two elements and one for each statistic.
set.seed(123) sim_data <- rnorm(n = 50) testNormal(x = sim_data) sim_data <- rgamma(n = 50, shape = 3) testNormal(x = sim_data)
set.seed(123) sim_data <- rnorm(n = 50) testNormal(x = sim_data) sim_data <- rgamma(n = 50, shape = 3) testNormal(x = sim_data)
This function applies the goodness-of-fit test based on empirical distribution function. It requires certain inputs depending on whether the model involves parameter estimation or not. If the model is known and there is no parameter estimation, the function requires the probability transformed (or pit) values of the sample. This ought to be a numeric vector. If there is parameter estimation in the model, the function additionally requires the score as a matrix with n rows and p columns, where n is the sample size and p is the number of estimated parameters. The function checks if the sum of columns in score is near zero at the estimated parameter (which is assumed to be the maximum likelihood estimate).
testYourModel( pit, score = NULL, discretize = FALSE, ngrid = length(pit), gridpit = TRUE, precision = 1e-09, method = "cvm" )
testYourModel( pit, score = NULL, discretize = FALSE, ngrid = length(pit), gridpit = TRUE, precision = 1e-09, method = "cvm" )
pit |
The probability transformed (or pit) values of the sample which ought to be a numeric vector. |
score |
The default value is null and refers to no parameter estimation case. If there is parameter estimation, the score must be a matrix with n rows and p columns, where n is the sample size and p is the number of estimated parameters. |
discretize |
If |
ngrid |
The number of equally spaced points to discretize the (0,1)interval for computing the covariance function. |
gridpit |
logical. If |
precision |
The theory behind goodness-of-fit test based on empirical distribution function (edf) works well if the MLE is indeed the root of derivative of log likelihood function. A precision of 1e-9 (default value) is used to check this. A warning message is generated if the score evaluated at MLE is not close enough to zero. |
method |
a character string indicating which goodness-of-fit statistic is to be computed. The default value is 'cvm' for the Cramer-von-Mises statistic. Other options include 'ad' for the Anderson-Darling statistic, and 'both' to compute both cvm and ad. |
A list of two containing the following components:
Statistic: the value of goodness-of-fit statistic.
p-value: the approximate p-value for the goodness-of-fit test. if method = 'cvm' or method = 'ad', it returns a numeric value for the statistic and p-value. If method = 'both', it returns a numeric vector with two elements and one for each statistic.
# Example: Inverse Gaussian (IG) distribution with weights # Set the seed to reproduce example. set.seed(123) # Set the sample size n <- 50 # Assign weights weights <- rep(1.5,n) # Set mean and shape parameters for IG distribution. mio <- 2 lambda <- 2 # Generate a random sample from IG distribution with weighted shape. sim_data <- statmod::rinvgauss(n, mean = mio, shape = lambda * weights) # Compute MLE of parameters, score matrix, and pit values. theta_hat <- IGMLE(obs = sim_data, w = weights) ScoreMatrix <- IGScore(obs = sim_data, w = weights, mle = theta_hat) pitvalues <- IGPIT(obs = sim_data , w = weights, mle = theta_hat) # Apply the goodness-of-fit test. testYourModel(pit = pitvalues, score = ScoreMatrix)
# Example: Inverse Gaussian (IG) distribution with weights # Set the seed to reproduce example. set.seed(123) # Set the sample size n <- 50 # Assign weights weights <- rep(1.5,n) # Set mean and shape parameters for IG distribution. mio <- 2 lambda <- 2 # Generate a random sample from IG distribution with weighted shape. sim_data <- statmod::rinvgauss(n, mean = mio, shape = lambda * weights) # Compute MLE of parameters, score matrix, and pit values. theta_hat <- IGMLE(obs = sim_data, w = weights) ScoreMatrix <- IGScore(obs = sim_data, w = weights, mle = theta_hat) pitvalues <- IGPIT(obs = sim_data , w = weights, mle = theta_hat) # Apply the goodness-of-fit test. testYourModel(pit = pitvalues, score = ScoreMatrix)