Type I Error Rate Higher than Significance Level in Likelihood Ratio Test for Exponential Distribution
Problem Description
I am conducting a Likelihood Ratio Test (LRT) to determine if data from a two-parameter exponential distribution fits a one-parameter exponential distribution (with the location parameter fixed at zero). However, I'm observing that the Type I error rate is significantly higher than the set significance level. I am looking for insights into why this discrepancy might be occurring.
Mathematical Formulation
The LRT compares two models:
- Full Model: A two-parameter exponential distribution, estimating both the rate $\lambda$ and location $\theta$ parameters.
- Reduced Model: A one-parameter exponential distribution, with the location parameter fixed at zero and only the rate parameter estimated.
The hypothesis tested is:
- Null Hypothesis ($H_0$): The data follows a one-parameter exponential distribution ($\theta = 0$).
- Alternative Hypothesis ($H_1$): The data follows a two-parameter exponential distribution ($\theta \neq 0$).
The test statistic $\Lambda$ is calculated as: $\Lambda = -2 \times (\text{LL}_{\text{reduced}} - \text{LL}_{\text{full}}) $ where $\text{LL}_{\text{reduced}}$ and $\text{LL}_{\text{full}}$ are the log-likelihoods of the reduced and full models, respectively. The p-value is then obtained from the chi-square distribution with 1 degree of freedom.
Python Implementation
Here's the code I'm using for the LRT and the simulation to calculate the Type I error rate:
import numpy as np
from scipy.stats import expon, chi2
from scipy.optimize import minimize
def log_likelihood_two_param(data, params):
theta, lamb = params
data_adj = data[data >= theta]
if len(data_adj) < len(data) or lamb <= 0:
return -np.inf
return np.sum(np.log(lamb) - lamb * (data_adj - theta))
def log_likelihood_one_param(data, lamb):
if lamb <= 0:
return -np.inf
return np.sum(np.log(lamb) - lamb * data)
def likelihood_ratio_test(data):
res_full = minimize(lambda params: -log_likelihood_two_param(data, params),
x0=[np.min(data), 1/np.mean(data)], method='L-BFGS-B', bounds=[(np.min(data), np.max(data)), (0, None)])
LL_full = -res_full.fun
res_reduced = minimize(lambda lamb: -log_likelihood_one_param(data, lamb),
x0=[1/np.mean(data)], method='L-BFGS-B', bounds=[(0, None)])
LL_reduced = -res_reduced.fun
Lambda = -2 * (LL_reduced - LL_full)
p_value = chi2.sf(Lambda, 1)
return p_value
def simulate_and_test(sample_size, theta_true, lambda_true, significance_level, num_simulations):
type_1_errors = 0
for _ in range(num_simulations):
data = np.random.exponential(1/lambda_true, sample_size) + theta_true
p_value = likelihood_ratio_test(data)
if p_value < significance_level:
type_1_errors += 1
return type_1_errors
# Simulation parameters
num_simulations = 1000
significance_level = 0.05
sample_size = 100
theta_true = 0 # Null hypothesis
lambda_true = 0.1 # Rate parameter
type_1_error_count = simulate_and_test(sample_size, theta_true, lambda_true, significance_level, num_simulations)
type_1_error_rate = type_1_error_count / num_simulations
print(type_1_error_rate)
The output indicates a Type I error rate (around 0.15) significantly higher than the set significance level (0.05).
Question
Could anyone help me understand why the Type I error rate is higher than expected in this simulation? Are there any issues with the implementation of the LRT or the assumptions underlying the test?