A pre-specified secondary analysis of an open clinical dataset

Glycaemic control, vascular brain injury & cognition in type 2 diabetes

This project works a small, messy clinical dataset end to end: 88 subjects, 1,111 columns, a plan fixed in advance, transparent accounting of how the sample shrinks, small-sample-robust inference, and a machine-learning layer that declines to predict what n=44 cannot support. No exposure showed a robust association with cognition. The contribution is showing exactly what that null can and cannot rule out.

n=44 analytic sample 88 records · 1,111 columns 0/4 robust exposure associations HbA1c: excludes ≥0.5 SD Scripts regenerate tables and figures

Dataset PhysioNet GE-75 — Cerebral Perfusion & Cognitive Decline in T2DM (v1.0.1)

Analytic cohort 75 DM subjects · 44 with complete cognitive composite · 41 with WMH data

Approach Planned OLS · HC3 robust SE · Bootstrap sensitivity · Complete-case

Status Analysis complete · Plan predates model outputs · Not prospectively registered

Data ↗

Question Data Methods Results ML Limits Reproduce

The question

Among older adults with type 2 diabetes, is poorer current glycaemic control (higher HbA1c, a roughly 3-month marker) associated with poorer global cognition after adjustment for age, sex and BMI? Secondary, exploratory exposures considered white-matter-hyperintensity burden normalised by intracranial volume (WMH/ICV), gait speed and global cerebral vasoreactivity. The analysis is deliberately narrow: answer what the released table can support, rather than reconstructing the original study's full mechanistic aims.

In this small group, blood-sugar control did not track with thinking ability, and the analysis is careful about how strong a conclusion that supports.

Data reality

The central constraint is not model choice; it is usable data. The released summary table contains 75 people with diabetes and only 13 controls, so the inferential analysis is diabetes-only. Cognitive testing is almost all-or-nothing: 44 diabetic participants have all five cognitive tests, 30 have none, and one has only two. MRI and vasoreactivity are sparser still.

Stage	n
All summary rows	88
DM subjects	75
Cognitive composite	44
Vascular model with WMH/ICV	41
Vasoreactivity model	36

The analysis is a protocol-completer story: cognition, MRI and haemodynamic measures co-occur in a much smaller subset than the released table initially suggests.

Selected-variable missingness heatmap for the diabetes analysis base.

Figure 1 Selected-variable missingness in the diabetes analysis base. The heatmap makes the effective sample size visible: cognition is concentrated in the same protocol-completer subset that drives the inferential models.

Methods at a glance

Outcome

Five-test cognitive composite; Trail-Making Test Part B reverse-scored; higher values mean better cognition.

Primary model

Cognitive composite ~ HbA1c + age + sex + BMI.

Inference

OLS with HC3 robust standard errors for small-sample heteroskedasticity.

Precision checks

Percentile bootstrap CIs and post-hoc TOST equivalence against a +/-0.5 SD precision bound.

Robustness

Leave-one-out refits and Cook's distance.

Confounding transparency

Education unavailable in the released table; GDS sensitivity model added; diabetes duration, hypertension and stroke/TIA tabulated.

Machine learning

Elastic net under leakage-safe nested cross-validation, framed as feature prioritisation rather than clinical prediction.

Exposure label

Refined from "glycaemic burden" to "current glycaemic control" because a single HbA1c is a roughly 3-month marker.

Results

All four intervals cross zero. WMH/ICV has the largest negative point estimate but also the widest, most fragile uncertainty. Gait and vasoreactivity point positive but remain imprecise and exploratory.

Forest plot of four standardised main exposure coefficients with HC3 95% confidence intervals. Every interval crosses zero.

Figure 2 Main exposure coefficients standardised per SD exposure, in composite-SD outcome units, with HC3 95% confidence intervals.

The primary HbA1c model was null: beta = -0.052 per 1% HbA1c, HC3 95% CI -0.279 to +0.176, p=0.65. On a fully standardised scale, beta* = -0.093. A post-hoc TOST equivalence analysis rules out HbA1c-cognition associations of +/-0.5 SD or larger in this sample (TOST p=0.025; 90% CI -0.43 to +0.25), while smaller associations remain indistinguishable from noise.

A null is not "no effect"; it is a statement of precision. Here, HbA1c can rule out a large glycaemic-control association with cognition, but the secondary vascular and haemodynamic exposures are too imprecise to exclude even a +/-0.5 SD association.

Scatter plot of HbA1c versus cognitive composite with an unadjusted bivariate fit.

Figure 3 HbA1c versus cognitive composite; unadjusted descriptive scatter and bivariate fit. The scatter is descriptive, not adjusted. It shows why the result is visually unsurprising: most participants cluster around HbA1c 6-7%, with a thin higher-HbA1c tail and no clear cognitive trend.

A model that declines to predict

Under repeated nested cross-validation, the elastic net did not beat a mean-only baseline and its full-fit coefficients shrank to zero. A random forest performed worse. The disciplined conclusion is that no prediction model is warranted at n=44; the value is a leakage-safe workflow that reports its own uncertainty and declines to manufacture a result.

Feature set	Model	CV RMSE
5 complete	Mean baseline	0.74
5 complete	Elastic net	0.74
5 complete	Random forest	0.81
7 imputed	Elastic net	0.75

What it can and cannot say

Can say

The released GE-75 summary table does not show a robust association between HbA1c, WMH/ICV, gait speed, vasoreactivity and the cognitive composite in the analytic diabetes sample.
For HbA1c, the data can reject a large standardised association of +/-0.5 SD or larger.
The WMH/ICV estimate is fragile and method-dependent: HC3 includes zero, bootstrap marginally excludes zero, and Cook's distance identifies one high-influence observation.
A leakage-safe ML workflow does not support prediction at this sample size.

Cannot say

It cannot prove no glycaemic-cognition relationship exists.
It cannot estimate small effects precisely.
It cannot make causal claims.
It cannot make clinical prediction claims.
It cannot fully address residual confounding because education is effectively unavailable in the released summary table.
It depends on a complete-case assumption that observed completer-versus-excluded comparisons are consistent with but cannot prove.

Reproducibility and attribution

The project is scripted end to end: one audit script, one analysis script and one machine-learning script regenerate the tables and figures used in the report. The current environment verification used Python 3.14, pandas 3.0, NumPy 2.4 and scikit-learn 1.9.

Verified in the current environment; scripts and generated tables are available on request.

Python pandas statsmodels HC3 OLS SciPy scikit-learn nested CV TOST equivalence Matplotlib/Seaborn

Data: Novak V, Quispe R, Saunders C. Cerebral perfusion and cognitive decline in type 2 diabetes (version 1.0.1). PhysioNet, 2022. CC BY 4.0. https://doi.org/10.13026/whjz-e968

← Back to work