Regression Analysis and Visualization

library(clinpubr)
library(dplyr)
library(survival)
library(ggplot2)

Introduction

The clinpubr package provides a comprehensive suite for regression analysis in clinical research, supporting Cox proportional hazards, logistic, and linear regression with publication-ready visualizations. This vignette covers the typical analysis workflow:

  1. Regression Scan — Screen predictors with multiple transformations
  2. Comprehensive Results — Generate multi-model specification tables
  3. Forest Plots — Visualize effect sizes
  4. Predictor Effects — RCS plots and effect plots
  5. Interaction Analysis — Test and visualize effect modification
  6. Subgroup Analysis — Subgroup forest plots

The clinpubr package supports frequently used regression models: Cox Proportional Hazards, Logistic Regression, Linear Regression, and provides unified interface for them. The model used would be infered based on the settings of the y and time arguments. If y is binary and time is given, the model would be Cox Proportional Hazards; if y is a binary variable, and time not given, the model would be Logistic Regression; if y is a continuous variable, the model would be Linear Regression.

Data Preparation

We’ll use the NCCTG Lung Cancer dataset (228 patients with advanced lung cancer):

data(cancer, package = "survival")
cancer$dead <- cancer$status == 2

cancer_clean <- cancer %>%
  mutate(
    sex = factor(sex, labels = c("Male", "Female"))
  ) %>%
  na.omit()

knitr::kable(data.frame(
  Rows = nrow(cancer_clean),
  Columns = ncol(cancer_clean),
  Description = "Complete cases for analysis"
), caption = "Analysis Dataset Summary")
Analysis Dataset Summary
Rows Columns Description
167 11 Complete cases for analysis

knitr::kable(head(cancer_clean), caption = "Sample of Analysis Variables")
Sample of Analysis Variables
inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss dead
2 3 455 2 68 Male 0 90 90 1225 15 TRUE
4 5 210 2 57 Male 1 90 60 1150 11 TRUE
6 12 1022 1 74 Male 1 50 80 513 0 FALSE
7 7 310 2 68 Female 2 70 60 384 10 TRUE
8 11 361 2 71 Female 2 60 80 538 1 TRUE
9 1 218 2 53 Male 1 70 80 825 16 TRUE

Regression Scan

regression_scan() automatically tests multiple variable transformations for each predictor:

Cox Proportional Hazards Model

cox_scan <- regression_scan(
  data = cancer_clean,
  y = "status",
  time = "time",
  # set predictors = NULL to scan all available variables
  # predictors = NULL,
  predictors = c("age", "sex", "ph.karno", "pat.karno", "wt.loss", "meal.cal"),
  covars = NULL,
  save_table = FALSE
)

knitr::kable(cox_scan, caption = "Cox Regression Scan Results")
Cox Regression Scan Results
predictor nvalid original.HR original.pval original.padj logarithm.HR logarithm.pval logarithm.padj categorized.HR categorized.pval categorized.padj rcs.overall.pval rcs.overall.padj rcs.nonlinear.p best.var.trans
4 pat.karno 167 0.9812708 0.0021995 0.0131971 0.2933451 0.0024286 0.0097142 0.5831016 0.0038615 0.0193076 0.0210379 0.0698049 0.9680713 original
2 sex 167 0.6192558 0.0147907 0.0443721 NA NA NA NA NA NA NA NA NA original
1 age 167 1.0200874 0.0641959 0.1029532 3.2622315 0.0688278 0.1275780 1.3335186 0.1177009 0.1471261 0.1464013 0.1830016 0.4447903 original
3 ph.karno 167 0.9878729 0.0686354 0.1029532 0.4378243 0.0956835 0.1275780 0.6274233 0.0184766 0.0461915 0.0279220 0.0698049 0.0803601 categorized
6 meal.cal 167 0.9998804 0.6209701 0.7451641 0.8967025 0.5548170 0.5548170 0.8640874 0.4291805 0.4291805 0.7222351 0.7222351 0.6013624 categorized
5 wt.loss 167 1.0001532 0.9817372 0.9817372 NA NA NA 1.4330287 0.0527764 0.0879607 0.1053526 0.1755877 0.0466637 categorized

The scan reports the effect size, p-value, adjusted p-value (BH correction by default), and the best transformation.

Logistic Regression Scan

logistic_scan <- regression_scan(
  data = cancer_clean,
  y = "dead",
  predictors = c("age", "sex", "ph.karno", "pat.karno", "wt.loss"),
  covars = NULL,
  save_table = FALSE
)

knitr::kable(logistic_scan, caption = "Logistic Regression Scan Results")
Logistic Regression Scan Results
predictor nvalid original.OR original.pval original.padj logarithm.OR logarithm.pval logarithm.padj categorized.OR categorized.pval categorized.padj rcs.overall.pval rcs.overall.padj rcs.nonlinear.p best.var.trans
2 sex 167 0.3742964 0.0053682 0.0268408 NA NA NA NA NA NA NA NA NA original
4 pat.karno 167 0.9701901 0.0184097 0.0460243 0.1480629 0.0465029 0.0465029 0.4427438 0.0373640 0.0498187 0.0689533 0.1457501 0.5271516 original
3 ph.karno 167 0.9703678 0.0396626 0.0528173 0.0984975 0.0461388 0.0465029 0.3157895 0.0161386 0.0475296 0.0995735 0.1457501 0.2382474 categorized
1 age 167 1.0389903 0.0422538 0.0528173 10.2713218 0.0335189 0.0465029 1.7416268 0.1120568 0.1120568 0.1457501 0.1457501 0.5010354 logarithm
5 wt.loss 167 1.0084347 0.5300552 0.5300552 NA NA NA 2.2308546 0.0237648 0.0475296 0.1176075 0.1457501 0.0679239 categorized

Comprehensive Regression Results

In clinical publications, it’s standard to present multiple model specifications (Crude, Partially adjusted, Fully adjusted) to show how confounding affects the estimated association.

Cox Model with Multiple Specifications

cox_results <- regression_basic_results(
  data = cancer_clean,
  x = "age",
  y = "status",
  time = "time",
  model_covs = list(
    Crude = c(),
    Model1 = c("sex"),
    Model2 = c("sex", "ph.karno", "pat.karno")
  ),
  pers = c(1, 5, 10),
  save_output = FALSE
)

knitr::kable(cox_results$table, caption = "Cox Regression Results for Age")
Cox Regression Results for Age
Terms Count Crude Crude Model1 Model1 Model2 Model2
age (All) 167 HR P HR P HR P
Continuous NA 1.02(1.00,1.04) 0.064 1.02(1.00,1.04) 0.109 1.01(0.99,1.03) 0.352
Continuous, per 1 NA 1.02(1.00,1.04) 0.064 1.02(1.00,1.04) 0.109 1.01(0.99,1.03) 0.352
Continuous, per 5 NA 1.10(0.99,1.23) 0.064 1.09(0.98,1.21) 0.109 1.06(0.94,1.18) 0.352
Continuous, per 10 NA 1.22(0.99,1.51) 0.064 1.19(0.96,1.47) 0.109 1.11(0.89,1.40) 0.352
Continuous, per 1 SD NA 1.20(0.99,1.46) 0.064 1.17(0.97,1.43) 0.109 1.10(0.90,1.36) 0.352
Continuous, logarithm NA 3.26(0.91,11.66) 0.069 2.79(0.78,10.06) 0.116 1.89(0.49,7.34) 0.359
Grouped by Quartiles NA NA NA NA NA NA NA
Q1 41 1 (Reference) NA 1 (Reference) NA 1 (Reference) NA
Q2 42 0.79(0.46,1.34) 0.373 0.72(0.42,1.23) 0.228 0.71(0.42,1.22) 0.215
Q3 40 1.03(0.61,1.73) 0.905 0.98(0.58,1.65) 0.941 0.87(0.51,1.49) 0.613
Q4 44 1.34(0.81,2.20) 0.256 1.23(0.75,2.04) 0.414 1.08(0.64,1.83) 0.775
P for trend NA NA 0.144 NA 0.208 NA 0.561
Grouped by Median Value NA NA NA NA NA NA NA
Low 83 1 (Reference) NA 1 (Reference) NA 1 (Reference) NA
High 84 1.33(0.93,1.91) 0.118 1.31(0.92,1.88) 0.138 1.16(0.79,1.71) 0.437

Logistic Model with Clinical Cut-points

logistic_results <- regression_basic_results(
  data = cancer_clean,
  x = "ph.karno",
  y = "dead",
  model_covs = list(
    Crude = c(),
    Adjusted = c("age", "sex")
  ),
  factor_breaks = c(80),
  factor_labels = c("<80", ">=80"),
  save_output = FALSE
)

knitr::kable(logistic_results$table, caption = "Logistic Regression Results for Performance Status")
Logistic Regression Results for Performance Status
Terms Count Crude Crude Adjusted Adjusted
ph.karno (All) 167 OR P OR P
Continuous NA 0.97(0.94,1.00) 0.040 0.97(0.94,1.00) 0.085
Continuous, per 0.1 NA 1.00(0.99,1.00) 0.040 1.00(0.99,1.00) 0.085
Continuous, per 10 NA 0.74(0.55,0.98) 0.040 0.76(0.55,1.03) 0.085
Continuous, per 100 NA 0.05(0.003,0.80) 0.040 0.07(0.003,1.36) 0.085
Continuous, per 1 SD NA 0.68(0.47,0.97) 0.040 0.71(0.47,1.04) 0.085
Continuous, logarithm NA 0.10(0.009,0.87) 0.046 0.12(0.009,1.34) 0.097
Grouped by Quartiles NA NA NA NA NA
Q1 20 1 (Reference) NA 1 (Reference) NA
Q2 24 2.75(0.48,21.64) 0.275 2.58(0.43,20.67) 0.314
Q3 47 0.53(0.14, 1.76) 0.326 0.56(0.14, 1.99) 0.393
Q4 76 0.48(0.13, 1.47) 0.229 0.50(0.12, 1.69) 0.291
P for trend NA NA 0.039 NA 0.077
Grouped by Median Value NA NA NA NA NA
Low 44 1 (Reference) NA 1 (Reference) NA
High 123 0.32(0.11,0.76) 0.016 0.34(0.12,0.88) 0.035
Grouped by Clinical Value NA NA NA NA NA
<80 44 1 (Reference) NA 1 (Reference) NA
>=80 123 0.32(0.11,0.76) 0.016 0.34(0.12,0.88) 0.035

Quantile-Based Categories

When clinical cut-points aren’t established, use tertiles for balanced group comparison:

custom_results <- regression_basic_results(
  data = cancer_clean,
  x = "age",
  y = "status",
  time = "time",
  quantile_breaks = c(0.33, 0.67),
  quantile_labels = c("Youngest Third", "Middle Third", "Oldest Third"),
  label_with_range = TRUE,
  save_output = FALSE
)

knitr::kable(custom_results$table, caption = "Cox Regression Results - Age Tertiles")
Cox Regression Results - Age Tertiles
Terms Count Crude Crude
age (All) 167 HR P
Continuous NA 1.02(1.00,1.04) 0.064
Continuous, per 0.1 NA 1.00(1.00,1.00) 0.064
Continuous, per 10 NA 1.22(0.99,1.51) 0.064
Continuous, per 100 NA 7.31(0.89,60.04) 0.064
Continuous, per 1 SD NA 1.20(0.99,1.46) 0.064
Continuous, logarithm NA 3.26(0.91,11.66) 0.069
Grouped by Quartiles NA NA NA
Q1 [39,57) 41 1 (Reference) NA
Q2 [57,64) 42 0.79(0.46,1.34) 0.373
Q3 [64,70) 40 1.03(0.61,1.73) 0.905
Q4 [70,82] 44 1.34(0.81,2.20) 0.256
P for trend NA NA 0.144
Grouped by Median Value NA NA NA
Low [39,64) 83 1 (Reference) NA
High [64,82] 84 1.33(0.93,1.91) 0.118
Values NA NA NA
Youngest Third [39,58.8) 55 1 (Reference) NA
Middle Third [58.8,68) 54 0.92(0.58,1.46) 0.721
Oldest Third [68,82] 58 1.37(0.89,2.13) 0.154
P for trend NA NA 0.145

Forest Plots

Univariate Forest Plot

uni_forest <- regression_forest(
  data = cancer_clean,
  model_vars = c("age", "ph.karno", "pat.karno", "wt.loss"),
  y = "status",
  time = "time",
  as_univariate = TRUE, # Generate univariate forest plot
  save_plot = FALSE
)

uni_forest

The univariate forest plot displays:

In this example, pat.karno shows a significant protective effect (HR ≈ 0.98, p < 0.01), while age, ph.karno, and wt.loss do not reach statistical significance in univariate analysis.

Multivariate Forest Plot

multi_forest <- regression_forest(
  data = cancer_clean,
  model_vars = list(
    Model1 = c("age"),
    Model2 = c("age", "ph.karno"),
    Model3 = c("age", "ph.karno", "pat.karno")
  ),
  y = "status",
  time = "time",
  as_univariate = FALSE, # Generate multivariate forest plot
  save_plot = FALSE
)

multi_forest

The multivariate forest plot compares different model specifications:

Key observations: - As more covariates are added, the effect estimate for age changes slightly - The confidence intervals may widen or narrow depending on the correlation between predictors - This visualization helps assess confounding effects and model stability

Customizing Forest Plots

custom_forest <- regression_forest(
  data = cancer_clean,
  model_vars = c("age", "ph.karno", "pat.karno", "wt.loss", "meal.cal"),
  y = "status",
  time = "time",
  as_univariate = TRUE,
  show_vars = c("age", "ph.karno", "pat.karno"),
  est_nsmall = 2,
  p_nsmall = 4,
  save_plot = FALSE
)

custom_forest

The customized forest plot demonstrates several formatting options:

This allows focused presentation of key predictors with precise numerical formatting suitable for publication.

Predictor Effect Visualization

Predictor Effect Plot

A unified interface for both continuous and categorical predictors:

effect_linear <- predictor_effect_plot(
  data = cancer_clean,
  x = "age",
  y = "status",
  time = "time",
  covars = c("sex"),
  method = "linear",
  add_hist = TRUE,
  save_plot = FALSE
)

effect_linear

The linear predictor effect plot shows:

This plot assumes a linear relationship between the predictor and outcome, useful for confirming linearity assumptions or presenting simple linear effects.

effect_cat <- predictor_effect_plot(
  data = cancer_clean,
  x = "sex",
  y = "status",
  time = "time",
  covars = c("age", "ph.karno"),
  method = "categorical",
  save_plot = FALSE
)

effect_cat

The categorical predictor effect plot displays:

In this example, females show a significantly lower hazard compared to males (HR ≈ 0.62, 95% CI does not include 1.0), indicating better survival outcomes for female patients after adjusting for age and performance status.

RCS Plots

RCS plots visualize non-linear relationships while preserving the continuous nature of the data, the plot can also be generated using predictor_effect_plot() with method = "rcs":

rcs_age <- rcs_plot(
  data = cancer_clean,
  x = "age",
  y = "status",
  time = "time",
  covars = c("sex", "ph.karno"),
  knot = 4,
  ref = "x_median",
  add_hist = TRUE,
  save_plot = FALSE
)

rcs_age

The RCS (Restricted Cubic Spline) plot visualizes the non-linear relationship between age and survival, the element interpretation is similar to the linear plot. There is one additional element:

Key features: - The spline with 4 knots (knot = 4) allows flexible modeling of non-linear relationships - Reference point (ref = "x_median") is set at the median age (HR = 1.0) - The plot shows how the hazard changes relative to the median age

Customized RCS Plot

custom_rcs <- rcs_plot(
  data = cancer_clean,
  x = "wt.loss",
  y = "status",
  time = "time",
  covars = c("age", "sex"),
  knot = 5,
  ref = 0,
  add_hist = FALSE,
  trans = "log10",
  y_lim = c(0.5, 5),
  save_plot = FALSE
)

custom_rcs

The customized RCS plot demonstrates advanced options:

This plot examines the relationship between weight loss and survival, with the reference at zero weight loss. The log-scale y-axis helps visualize effect sizes across a wide range.

Subgroup Analysis

Scanning for Interactions

Interaction analysis is essential in clinical research to assess whether the effect of a predictor variable on the outcome differs across subgroups defined by another variable (effect modification). In medical papers, this is commonly referred to as subgroup analysis with interaction testing.

The interaction_scan() function systematically evaluates potential interactions between multiple predictors and grouping variables:

Clinical interpretation: A significant interaction p-value (typically < 0.05) indicates that the effect of the predictor on the outcome is modified by the grouping variable, suggesting that treatment effects or risk associations may differ between subgroups.

interaction_scan <- interaction_scan(
  data = cancer_clean,
  y = "status",
  time = "time",
  predictors = c("age", "ph.karno", "pat.karno"),
  group_vars = c("sex", "ph.karno"), # continuous variables will be split by median into subgroups
  save_table = FALSE
)

knitr::kable(interaction_scan, caption = "Interaction Scan Results")
Interaction Scan Results
predictor group.by nvalid linear.p.int rcs.p.int linear.p.adj rcs.p.adj
2 age ph.karno 167 0.0972655 0.0117594 0.3101036 0.0235189
3 ph.karno sex 167 0.1240414 NA 0.3101036 NA
5 pat.karno ph.karno 167 0.5731909 NA 0.7485232 NA
1 age sex 167 0.6276238 0.9342510 0.7485232 0.9342510
4 pat.karno sex 167 0.7485232 NA 0.7485232 NA

Understanding the output columns:

Column Description
predictor The continuous or categorical variable being tested for effect modification
group.by The subgroup variable (e.g., sex) that may modify the predictor’s effect
nvalid Number of complete observations used in the interaction test
linear.p.int Raw p-value for linear interaction effect
linear.p.adj BH-adjusted p-value for linear interaction
rcs.p.int Raw p-value for non-linear (RCS) interaction effect
rcs.p.adj BH-adjusted p-value for RCS interaction

Visualizing Interactions

interaction_age_sex <- interaction_plot(
  data = cancer_clean,
  predictor = "age",
  y = "status",
  time = "time",
  group_var = "sex",
  covars = c("ph.karno"),
  save_plot = FALSE
)

interaction_age_sex
#> $lin

#> 
#> $rcs

The interaction plot displays two panels:

Linear Interaction Plot ($lin): - Shows the effect of age on survival separately for males and females - Different colored lines represent different sex groups - Parallel lines suggest no interaction; diverging/converging lines suggest interaction - Shaded areas represent 95% confidence intervals

RCS Interaction Plot ($rcs): - Shows non-linear effects of age within each sex group - Allows detection of non-linear interactions - Each curve represents the HR relative to the reference point for that sex

In this example, the lines appear roughly parallel, suggesting no significant interaction between age and sex (confirmed by p-value > 0.05).

Subgroup Forest Plot

subgroup_result <- subgroup_forest(
  data = cancer_clean,
  x = "age",
  y = "status",
  time = "time",
  subgroup_vars = c("sex", "pat.karno"),
  covars = c("ph.karno"),
  save_plot = FALSE
)

subgroup_result

The subgroup forest plot displays the effect of age on survival stratified by groups:

Each row shows: - Sample size for that subgroup - Effect estimate with 95% confidence interval - P-value of interaction

This visualization helps assess: - Whether the effect is consistent across subgroups - Potential effect modification by the stratifying variable

Summary

Function Purpose
regression_scan() Automated screening of predictors with multiple transformations
regression_basic_results() Comprehensive results with various model specifications
regression_forest() Publication-ready forest plots
rcs_plot() / predictor_effect_plot() Flexible visualization of predictor effects
interaction_scan() / interaction_plot() Interaction analysis and visualization
subgroup_forest() Subgroup analysis with forest plots