From Clinical Trials to Real-World Impact: Introducing a Computational Framework to Detect Endpoint Bias in Opioid Use Disorder Research.
Article
Odom, Gabriel J, Brandt, Laura, Marker, Aaron et al. (2026). From Clinical Trials to Real-World Impact: Introducing a Computational Framework to Detect Endpoint Bias in Opioid Use Disorder Research.
. DRUG AND ALCOHOL REVIEW, 45(1), e70085. 10.1111/dar.70085
Odom, Gabriel J, Brandt, Laura, Marker, Aaron et al. (2026). From Clinical Trials to Real-World Impact: Introducing a Computational Framework to Detect Endpoint Bias in Opioid Use Disorder Research.
. DRUG AND ALCOHOL REVIEW, 45(1), e70085. 10.1111/dar.70085
Clinical trial endpoints are a 'finite sequence of instructions to perform a task' (measure treatment effectiveness), making them algorithms. Consequently, they may exhibit algorithmic bias: internal and external performance can vary across demographic groups, impacting fairness, validity and clinical decision-making.
Methods
We developed the open-source Detecting Algorithmic Bias (DAB) Pipeline in Python to identify endpoint 'performance variance'-a specific algorithmic bias-as the proportion of minority participants changes. This pipeline assesses internal performance (on demographically matched test data) and external performance (on demographically diverse validation data) using metrics including F1 scores and area under the receiver operating characteristic curve (AUROC). We applied it to representative opioid use disorder (OUD) trial endpoints.
Results
F1 scores remained stable across minority representation levels, suggesting consistency in precision-recall balance (F1) despite demographic shifts. Conversely, AUROC measures were more sensitive, revealing significant performance variance. Training on demographically homogeneous populations boosted internal performance (accuracy within similar cohorts) but critically compromised external generalisability (accuracy within diverse cohorts). This pattern reveals an 'endpoint bias trade-off': optimising performance for homogeneous populations vs. having generalisable performance for the real world.
Discussion and conclusions
Consistently performing endpoints for one demographic profile may lose generalisability during population shifts, potentially introducing endpoint bias. Increasing minority representation in the training data consistently improved generalisability. The endpoint bias trade-off reinforces the importance of diverse recruitment in OUD trials. The DAB Pipeline helps researchers systematically pinpoint when an endpoint may suffer 'performance variance' (i.e., bias). As an open-source tool, it promotes transparent endpoint evaluation and supports selecting demographically invariant OUD endpoints.