Deriving Poisson Regression Estimates with confounding using Statistics Without Probability (SWOP) and the Corrected Treatment Effect (CTE) – A Secondary Analysis of the Doll Hill Study
Poisson regression is used in frequentist statistics where the outcome being modelled is a rate. An example of Poisson regression is a comparison of smokers vs nonsmokers with the outcome being deaths per person years of life, adjusted for age of the person entering the study. This example is the famous Doll-Hill study on doctors who did and did not smoke and their risk of death. This paper will reanalyze the Doll-Hill dataset using an alternative technique to Poisson regression that makes none of the assumptions of Poisson Regression and does not even depend on the axioms of probability.
Poisson regression in frequentist statistics is considered a generalised linear model under the family of the Poisson distribution with various link functions including the log link function. The Poisson regression assumes a variance equal or lower than the mean. Often we have to use a negative binomial model when analysing data where the variance is greater than the mean (overdispersion). Also, the observational groups in Poisson regression have to be independent. When the observations are not independent that we have to adjust for the correlation of standard errors. It also assumes the sparseness of events. That is “[The probability of occurrence of an event ] In a short interval is proportional to the length of the interval” and “[The] Probability of another occurrence in such a short interval is zero”. This assumption of sparsity is actually quite difficult to test. Also, sparsity is often ignored when doing Poisson regression as it excludes many types of datasets where the frequency of counts is not sparse eg greater than a 100.
Despite all these assumptions Poisson regression is widely used as modelling rates is a very common problem in science, medicine and social research especially when unit level data is unavailable. There is an alternative method of analyzing rates which has its own system of point estimation, interval estimation, hypothesis testing and adjusting for confounding which is defined here as SWOP+CTE.
Evaluation of the SWOP+CTE method for rates:
Smoking is the intervention or predictor variable and exposure is binary as per the Doll-Hill study. Here the data is presented as the number of deaths per person-year follow-up. Age is a confounder and has been categorized into groups.
For the SWOP+CTE method, I transformed the below data into a binary data set. Death is the outcome and each record represents a person year. Age is still a confounder and has been categorized into groups. Correlated error terms are not important as they are not an issue for SWOP. SWOP does not use standard errors to calculate interval estimation or P-values.
The transformed data set and original dataset can be accessed in the attached link: RateSWOP.zip
Data:
Agecat | Smokes | deaths | Pyears | Average |
35-44 | 1 | 32 | 52407 | 0.000611 |
45-54 | 1 | 104 | 43248 | 0.002405 |
55-64 | 1 | 206 | 28612 | 0.0072 |
65-74 | 1 | 186 | 12663 | 0.014688 |
75-84 | 1 | 102 | 5317 | 0.019184 |
Group | Smoker | 630 | 142247 (78.4%) | 0.004429 |
35-44 | 0 | 2 | 18790 | 0.000106 |
45-54 | 0 | 12 | 10673 | 0.001124 |
55-64 | 0 | 28 | 5710 | 0.004904 |
65-74 | 0 | 28 | 2585 | 0.010832 |
75-84 | 0 | 31 | 1462 | 0.021204 |
Group | Non-smoker | 101 | 39220 (21.6%) | 0.002575 |
Sum | 731 | 181467 | RR: 1.7198 |
- Smoking Crude Relative Risk (expanded binary dataset):
Point Estimate: RR: 1.7198
Interval Estimate: 95%CI: 1.394 – 2.121
P Value: P<0.0001
- Smoking GLM Relative Risk (expanded binary dataset):
Point Estimate: RR: 1.7198
Interval Estimate: 95%CI: 1.394 – 2.121
P Value: P<0.0001
- Smoking Adjusted for age Relative Risk using GLM (binomial distribution and logarithmic link) (expanded binary dataset):
Point Estimate: RR: 1.4974
Interval Estimate: 95%CI: 1.215 – 1.846
P Value: P<0.0001
- Smoking Poisson IRR (compressed Rates dataset):
Point Estimate: RR: 1.7198
Interval Estimate: 95%CI: 1.394 – 2.121
P Value: P<0.0001
- Smoking Adjusted for age IRR Poisson:
Point Estimate: RR: 1.5014
Interval Estimate: 95%CI: 1.217 – 1.852
P Value: P<0.0001
- Crude Smoking OR (expanded binary dataset):
Point Estimate: 1.723024
Interval Estimate: 1.396134 – 2.126452
Hypothesis Testing: P<0.0001
- Smoking OR Adjusted for age CTE using SWOP (expanded binary dataset):
CTE for logarithmic adjustment equation and logarithmic estimation equation.
CTE= x/ inverse_logit(bz) = Exposure / Constant in adjustment equation
x = Exposure = Person Years smoked/Total Person Years = .78387255
Adjustment Equation from command:
logit smoking i.agecat
bz = 1.025715
inverse_logit(bz) = .73608432
Corrected Treatment Effect = CTE = x/bz = 0.78387255/.73608432= 1.0649222
Smoking Adjusted for age CTE = OR*CTE = 1.0649222*1.723024 = 1.8348865 OR
Smoking OR Adjusted for age CTE using SWOP = 1.83 OR
Conclusion:
After adjusting for Age using SWOP+CTE the point estimate increased from 1.72 to 1.83 OR. When adjusting for Age using Poisson regression the point estimate decreased from 1.72RR to 1.50RR. This is a change in direction of confounding by Age which is critically important in data-analysis. Given the right data set this could leave to a change in the direction of effect of a treatment from effective to not effective or vice-versa.
CTE cancels out the effect of age on smoking leading to death where as adjusting used conditional probability to adjust for the effect of age on smoking leading to death. CTU transforms the results in such a way that age groups are balanced in smokers and non-smokers, similar to a randomised control trial while conveniently using observational data.