*Deriving Poisson Regression Estimates with confounding using Statistics Without Probability (SWOP) and the Corrected Treatment Effect (CTE) – A Secondary Analysis of the Doll Hill Study*

Poisson regression is used in frequentist statistics where the outcome being modelled is a rate. An example of Poisson regression is a comparison of smokers vs nonsmokers with the outcome being deaths per person years of life, adjusted for age of the person entering the study. This example is the famous Doll-Hill study on doctors who did and did not smoke and their risk of death. This paper will reanalyze the Doll-Hill dataset using an alternative technique to Poisson regression that makes none of the assumptions of Poisson Regression and does not even depend on the axioms of probability.

Poisson regression in frequentist statistics is considered a generalised linear model under the family of the Poisson distribution with various link functions including the log link function. The Poisson regression assumes a variance equal or lower than the mean. Often we have to use a negative binomial model when analysing data where the variance is greater than the mean (overdispersion). Also, the observational groups in Poisson regression have to be independent. When the observations are not independent that we have to adjust for the correlation of standard errors. It also assumes the sparseness of events. That is “[The probability of occurrence of an event ] In a short interval is proportional to the length of the interval” and “[The] Probability of another occurrence in such a short interval is zero”. This assumption of sparsity is actually quite difficult to test. Also, sparsity is often ignored when doing Poisson regression as it excludes many types of datasets where the frequency of counts is not sparse eg greater than a 100.

Despite all these assumptions Poisson regression is widely used as modelling rates is a very common problem in science, medicine and social research especially when unit level data is unavailable. There is an alternative method of analyzing rates which has its own system of point estimation, interval estimation, hypothesis testing and adjusting for confounding which is defined here as SWOP+CTE.

Evaluation of the SWOP+CTE method for rates:

Smoking is the intervention or predictor variable and exposure is binary as per the Doll-Hill study. Here the data is presented as the number of deaths per person-year follow-up. Age is a confounder and has been categorized into groups.

For the SWOP+CTE method, I transformed the below data into a binary data set. Death is the outcome and each record represents a person year. Age is still a confounder and has been categorized into groups. Correlated error terms are not important as they are not an issue for SWOP. SWOP does not use standard errors to calculate interval estimation or P-values.

The transformed data set and original dataset can be accessed in the attached link: RateSWOP.zip

Data:

Agecat | Smokes | deaths | Pyears | Average |

35-44 | 1 | 32 | 52407 | 0.000611 |

45-54 | 1 | 104 | 43248 | 0.002405 |

55-64 | 1 | 206 | 28612 | 0.0072 |

65-74 | 1 | 186 | 12663 | 0.014688 |

75-84 | 1 | 102 | 5317 | 0.019184 |

Group | Smoker | 630 | 142247 (78.4%) | 0.004429 |

35-44 | 0 | 2 | 18790 | 0.000106 |

45-54 | 0 | 12 | 10673 | 0.001124 |

55-64 | 0 | 28 | 5710 | 0.004904 |

65-74 | 0 | 28 | 2585 | 0.010832 |

75-84 | 0 | 31 | 1462 | 0.021204 |

Group | Non-smoker | 101 | 39220 (21.6%) | 0.002575 |

Sum | 731 | 181467 | RR: 1.7198 |

__Smoking Crude Relative Risk (expanded binary dataset):__

Point Estimate: RR: 1.7198

Interval Estimate: 95%CI: 1.394 – 2.121

P Value: P<0.0001

__Smoking GLM Relative Risk (expanded binary dataset):__

Point Estimate: RR: 1.7198

Interval Estimate: 95%CI: 1.394 – 2.121

P Value: P<0.0001

__Smoking Adjusted for age Relative Risk using GLM (binomial distribution and logarithmic link) (expanded binary dataset):__

Point Estimate: RR: 1.4974

Interval Estimate: 95%CI: 1.215 – 1.846

P Value: P<0.0001

__Smoking Poisson IRR (compressed Rates dataset):__

Point Estimate: RR: 1.7198

Interval Estimate: 95%CI: 1.394 – 2.121

P Value: P<0.0001

__Smoking Adjusted for age IRR Poisson:__

Point Estimate: RR: 1.5014

Interval Estimate: 95%CI: 1.217 – 1.852

P Value: P<0.0001

__Crude Smoking OR (expanded binary dataset):__

Point Estimate: 1.723024

Interval Estimate: 1.396134 – 2.126452

Hypothesis Testing: P<0.0001

__Smoking OR Adjusted for age CTE using SWOP (expanded binary dataset):__

CTE for logarithmic adjustment equation and logarithmic estimation equation.

CTE= x/ inverse_logit(b_{z}) = Exposure / Constant in adjustment equation

*x = Exposure = Person Years smoked/Total Person Years = .78387255*

Adjustment Equation from command:

logit smoking i.agecat

*b _{z} = 1.025715*

inverse_logit(bz) = .73608432

Corrected Treatment Effect = CTE = x/b_{z} = 0**.78387255**/.73608432= 1.0649222

Smoking Adjusted for age CTE = OR*CTE = 1.0649222*1.723024 = 1.8348865 OR

**Smoking OR Adjusted for age CTE using SWOP = 1.83 OR**

**Conclusion:**

After adjusting for Age using SWOP+CTE the point estimate * increased from 1.72 to 1.83 OR*. When adjusting for Age using Poisson regression the point estimate

*. This is a change in direction of confounding by Age which is critically important in data-analysis. Given the right data set this could leave to a change in the direction of effect of a treatment from effective to not effective or vice-versa.*

**decreased from 1.72RR to 1.50RR**CTE cancels out the effect of age on smoking leading to death where as adjusting used conditional probability to *adjust* for the effect of age on smoking leading to death. CTU transforms the results in such a way that age groups are balanced in smokers and non-smokers, similar to a randomised control trial while conveniently using observational data.