Original Research

Design differences and variation in results between randomised trials and non-randomised emulations: meta-analysis of RCT-DUPLICATE data

Abstract

Objective To explore how design emulation and population differences relate to variation in results between randomised controlled trials (RCT) and non-randomised real world evidence (RWE) studies, based on the RCT-DUPLICATE initiative (Randomised, Controlled Trials Duplicated Using Prospective Longitudinal Insurance Claims: Applying Techniques of Epidemiology).

Design Meta-analysis of RCT-DUPLICATE data.

Data sources Trials included in RCT-DUPLICATE, a demonstration project that emulated 32 randomised controlled trials using three real world data sources: Optum Clinformatics Data Mart, 2004-19; IBM MarketScan, 2003-17; and subsets of Medicare parts A, B, and D, 2009-17.

Eligibility criteria for selecting studies Trials where the primary analysis resulted in a hazard ratio; 29 RCT-RWE study pairs from RCT-DUPLICATE.

Results Differences and variation in effect sizes between the results from randomised controlled trials and real world evidence studies were investigated. Most of the heterogeneity in effect estimates between the RCT-RWE study pairs in this sample could be explained by three emulation differences in the meta-regression model: treatment started in hospital (which does not appear in health insurance claims data), discontinuation of some baseline treatments at randomisation (which would have been an unusual care decision in clinical practice), and delayed onset of drug effects (which would be under-reported in real world clinical practice because of the relatively short persistence of the treatment). Adding the three emulation differences to the meta-regression reduced heterogeneity from 1.9 to almost 1 (absence of heterogeneity).

Conclusions This analysis suggests that a substantial proportion of the observed variation between results from randomised controlled trials and real world evidence studies can be attributed to differences in design emulation.

What is already known on this topic

  • Real world evidence studies can complement randomised controlled trials by providing insights on the effectiveness of a medical treatment in clinical practice

  • Concerns about confounding have limited the use of real world evidence studies in clinical practice and policy decisions

What this study adds

  • This study suggests that heterogeneity among pairs of randomised controlled trials and their non-randomised emulations can be explained by differences in design emulation

How this study might affect research, practice, or policy

  • These results could inform researchers and clinicians on the degree to which apparent divergence in results between randomised controlled trials and real world evidence studies can be driven by differences in study design and the research question

Introduction

Real world evidence (RWE) has been defined as evidence on the effects of medical products that are derived from the analysis of real world data, which includes different sources of patient health data, particularly data collected as part of routine clinical practice, including electronic health records and insurance claims data.1 Interest in the use of real world evidence from real world data to support clinical practice and policy decisions has been increasing.2–5 Concerns remain, however, about the validity of this evidence compared with the traditional randomised controlled trial (RCT).5–7

These concerns come from a misleading dichotomy that sets randomised controlled trials against database studies instead of viewing them as providing complementary information that informs a better understanding of the effects of drugs.8 Results from databases and randomised controlled trials have been compared, and some have found high concordance, supporting the ability of well designed database studies to generate valid causal conclusions.9–13 Others have used observed differences in results to criticise database studies as intractably confounded.7 14–17

The RCT-DUPLICATE initiative (Randomised Controlled Trials Duplicated Using Prospective Longitudinal Insurance Claims: Applying Techniques of Epidemiology) is one effort comparing randomised controlled trials with database studies.10 18–20 RCT-DUPLICATE set out to emulate 32 trials by prospectively designing a series of insurance claims database studies to match each design of the randomised controlled trial as closely as possible within the confines and limitations of using data that were not collected for research purposes. Because of the nature of using routinely collected data from clinical practice, some elements of the trial design could not be exactly emulated (eg, measures to ensure prolonged adherence over long follow-up periods). These emulation differences can be summarised as differences in outcome measurements, demographics of the included patients, treatment implementation in clinical practice, and lack of placebo in clinical practice. Design emulation and population differences change the question or estimand being looked at in the randomised controlled trial compared with the database study.21 22

Our aim was to use the RCT-DUPLICATE collection of emulated trials to assess how design emulation and population differences relate to variation in results between randomised controlled trials and real world evidence database studies that were designed to emulate them. We explored whether the characteristics of design emulation and population differences can reduce and therefore explain the residual heterogeneity in differences in effect size in a meta-regression analysis.

Methods

Our analysis was exploratory rather than confirmatory, meaning that the data used for the analysis were collected for another purpose. The conclusions drawn from our analysis might therefore help to formulate hypotheses to be tested in a subsequent confirmatory study. Our aim was to better understand emulation differences and how this affects variation in results between RCT-RWE study pairs.

RCT-DUPLICATE

The selection process for the RCT-DUPLICATE initiative is described in detail elsewhere.18 23 In summary, the RCT-DUPLICATE consortium emulated 32 randomised controlled trials that were relevant to regulatory decision making and were potentially feasible to emulate based on insurance claims data because key study parameters, such as the primary outcome, treatment strategies, and inclusion and exclusion criteria were measurable. The selected trials included a mix of superiority and non-inferiority trials, trials with large and small effect sizes, and a mix of trials with active comparators and placebo added to active standard of care treatments. The consortium used three real world data sources to implement the database studies that emulated the randomised controlled trials: Optum Clinformatics Data Mart, 2004-19; IBM MarketScan, 2003-17; and subsets of Medicare parts A, B, and D (data from 2011 to 2017 including all patients with a diagnosis of diabetes or heart failure, and data from 2009 to 2017 including all patients who had been prescribed an oral anticoagulant). Whenever possible, the emulations of the randomised controlled trials were implemented in more than one of the data sources with a while on-treatment analysis (chosen because of the shorter duration of drug use in clinical practice whereas adherence to treatment is generally longer in randomised controlled trials) and the final analyses were based on estimates resulting from a fixed effects meta-analysis of the implementations in all databases.

In this study, only trials where the primary analysis resulted in a hazard ratio were used. The LEAD2 trial with continuous outcome was excluded. For two trials (ISAR-REACT5 and VERO) a χ2 test indicated that the results were heterogeneous across databases so that the meta-analysis could not be performed to obtain a pooled real world evidence estimate for the hazard ratio19 and these trials were also excluded. Online supplemental file, section A has a summary of the 29 trials included in the analysis. We evaluated hazard ratios that were adjusted for confounding by 1:1 nearest neighbour propensity score matching on prespecified risk factors (chosen in discussion with clinical experts), as described in Franklin et al,19 for the RCT-RWE comparisons.

Design emulation and population differences identified in RCT-DUPLICATE

Emulation differences were recorded as covariates in RCT-DUPLICATE. Differences in age and sex distributions were captured as numerical variables representing the difference in mean age or percentage of women (the value in the randomised controlled trial minus the value in the real world evidence pooled emulation). Table 1 shows the categorical emulation difference characteristics by reference category recorded in RCT-DUPLICATE. 18

Table 1
|
Categorical emulation differences with possible levels, reference category, and description. All characteristics are binary

All characteristics in table 1 were summarised as a binary composite covariate, indicating if the RCT-RWE study pair was closely emulated or not closely emulated. More specifically, a study pair was considered closely emulated if the comparator and outcome emulations were at least moderate, and at least one of them was good, and if none of the following was true: follow-up started in hospital; run-in window that selectively included responders to one treatment arm; effects of randomisation and discontinuation of baseline treatment were mixed; and delayed effect over a long period of follow-up. The composite indicator was defined as part of the post hoc explorations by the RCT-DUPLICATE team18 to evaluate concordance in the results for randomised controlled trial-database pairs, with closer versus less close emulation of the design and research question based on the randomised controlled trial PICOT (population, intervention comparator, outcome, time).

Statistical analysis

All statistical analyses required that the effect estimates from randomised controlled trials and real world evidence were approximately normally distributed. Hence log transformations were applied on hazard ratios. The standardised differences in the RCT-RWE study pairs were computed by dividing the difference in log hazard ratios by the standard error of the difference. The squared standardised difference is the Q statistic which was used to perform the Q test for heterogeneity between the randomised controlled trials and real world evidence studies.24 25 The sum of all computed Q statistics was used as an overall test for heterogeneity between the RCT-RWE study pairs included in RCT-DUPLICATE.

Heterogeneity can be quantified as a multiplicative parameter,26 which is an overdispersion parameter generally larger than one, inflating the model's standard errors. As described by Mawdsley et al,27 multiplicative heterogeneity is estimated by fitting a weighted linear regression on the observed differences from all RCT-RWE study pairs against a constant, with weights defined as the inverse of the squared standard error of the differences. Multiplicative heterogeneity is then simply this model's standard error, and absence of heterogeneity is achieved if the parameter is equal to 1. The heterogeneity parameter is set to its lower bound of 1 if estimated to be <1.

Characteristics describing emulation differences are used to explain heterogeneity. With meta-regression methods (chapter 7 of Handbook of meta-analysis28), the characteristics of the emulation differences (ie, differences in age and sex distributions as well as each of the binary characteristics summarised in table 1 and the composite indicator) are added to the weighted linear regression models estimating multiplicative heterogeneity. If the extracted residual heterogeneity from the more complex, adjusted model is smaller than the heterogeneity measured with the simple model (with only a constant), part of the variation can be explained by the set of included emulation differences.

To reduce the complexity of the meta-regression, avoid overfitting, and choose only the most predictive of the p candidate characteristics, leave-one-out cross validated mean squared errors29 were computed for all 2p possible candidate models. Many of the included characteristics were suspected to be dependent. The simplest model, with a mean squared error of at most one standard error from the smallest mean squared error across all models, was selected.30 The model coefficients for the included characteristics have to be interpreted with respect to the model's intercept, the difference in RCT-RWE effect estimates that remains when all binary characteristics of emulation differences and the centred continuous characteristics are set to their reference or zero, respectively. Online supplemental file, section B gives a detailed description of the statistical analyses. All analyses were performed in R version 4.3.2.31 Code and data to reproduce the analyses and recompile this manuscript are available from https://gitlab.com/heyardr/hte-in-rwe and from Heyard and Wang.32

Patient and public involvement

As a reanalysis of publicly available data, no patients or members of the public were involved in the conception, development, analysis, interpretation, or reporting of the results of our study. There are no plans to disseminate the study findings to patient and public communities.

Results

Figure 1 shows the estimated hazard ratios from the randomised controlled trials against the hazard ratios estimated with the pooled real world evidence studies (with 95% confidence intervals). Estimates from perfectly emulated trials would scatter around the diagonal line. Although more than half of the pooled estimates from the real world evidence studies tended to be smaller than the estimates for the randomised controlled trials, many were also larger. This finding is different from the results seen in the large scale replication projects where the effect size estimated in the replication study was generally smaller than in the original study, which might be attributable to publication bias or other questionable research practices, unlikely operating in this study.33 This phenomenon is referred to as shrinkage of effect size.34 Also, an overall test of heterogeneity suggested strong evidence of variation between all study pairs in RCT-DUPLICATE (online supplemental figure C.1).

Figure 1
Figure 1

Hazard ratios (95% confidence intervals) estimated in randomised controlled trials and real world evidence studies (pooled for all data sources). Diagonal line represents perfect emulation; all trials with points on the right side of the diagonal have an effect size estimated in the randomised controlled trial (RCT) that is larger than the effect size estimated in the pooled real world evidence study (RWE)

To better understand the variability in results in RCT-DUPLICATE, variation was quantified and its sources were investigated. Figure 2 represents the differences in log hazard ratio for each study pair depending on whether the study was closely emulated or not closely emulated. Trials categorised as not closely emulated based on the indicator tended towards positive differences. The average difference in log hazard ratio over all included trials was estimated to be slightly negative (−0.015, 95% confidence interval −0.084 to 0.054), suggesting that, on average, the hazard ratio estimated with the real world data was larger than in the randomised controlled trial.

Table 2 shows the estimated multiplicative heterogeneity comparing the pooled real world evidence studies with the randomised controlled trials, together with the model intercept and coefficient values (with 95% confidence intervals). The simple model refers to the weighted regression with only a constant whereas the second model is a meta-regression adjusted for the binary characteristic, close emulation. Including close emulation in the weighted linear regression model reduced heterogeneity from 1.905 to 1.725, indicating that part of the observed variation between estimates in RCT-RWE study pairs can be attributed to the composite covariate. Although the intercept of the simple model was close to zero, the intercept of the adjusted model (difference in log hazard ratio for trials that were not closely emulated) tended to be positive. Closely emulated trials had, on average, slightly negative differences (figure 3).

Figure 2
Figure 2

Difference in effect estimates (log hazard ratio with 95% confidence interval) between the randomised controlled trials and pooled real world evidence studies, depending on whether the study was closely emulated or not closely emulated. Horizontal line represents no difference between real world evidence and randomised controlled trial estimates (upplementary table E.2 shows more information on each study)

Table 2
|
Model intercept and coefficient values (with 95% confidence intervals), and heterogeneity between real world evidence studies and randomised controlled trials, depending on model used. Heterogeneity close to 1 represents homogeneous effect size differences between study pairs
Figure 3
Figure 3

Bubble plots showing associations of the log hazard ratio between the randomised controlled trial and pooled real world evidence with (top graph) whether study pairs were closely emulated (yes or no), and with (bottom graph) possible combinations of three binary characteristics (treatment started in hospital, discontinuation of maintenance treatment without washout, and delayed onset of effects of drugs). The larger the bubble, the more precise the estimate or trial. Horizontal jitter has been applied on the bubbles to enhance visibility. 95% prediction intervals are compiled by the meta-regression, including the binary composite covariate

We explored the use of a set of explanatory characteristics instead of the composite covariate, close emulation. Table 3 shows the univariate coefficients, respective model intercept, and residual heterogeneity. Some of the characteristics reduced heterogeneity more than others; for example, adding the characteristic, discontinuation of maintenance treatment without washout, gave the largest decrease in heterogeneity, from 1.905 to 1.260. The intercept in table 3 can be interpreted as the difference in log hazard ratio for the respective reference category of the binary characteristics or no difference in the distribution for the two continuous characteristics, age and percentage of women. Then all possible candidate models (210=1024), depending on which of the 10 characteristics are included, were fitted and the models' leave-one-out mean squared errors were computed. The final model was the simplest model with leave-one-out mean squared errors smaller than the minimum mean squared errors plus one standard error (online supplemental figure D.2). With this tuning parameter, three characteristics would be included. Table 4 shows the coefficient estimates of the models with the best performance for each number of included characteristics. The models summarised in table 4 resulted in the model performance and heterogeneity illustrated in online supplemental figure D.2.

Table 3
|
Univariate coefficients (with 95% confidence intervals) for each candidate characteristic, ordered by increasing heterogeneity. For each row (each characteristic) a separate model was fitted, resulting in separate intercept and residual heterogeneity. The closer residual heterogeneity is to 1, the more the characteristic explains part of the variations. Residual heterogeneity and R2 values were added to further explain the proportion of variation for each covariate
Table 4
|
Model selection. Coefficient estimates for the best model with respect to leave-one-out mean squared errors for each number of characteristics included

The best model with three design emulation differences includes delayed onset of effect of drugs, discontinuation of maintenance treatment without washout, and treatment started in hospital. This model's residual heterogeneity was 1.003. Figure 3 shows the association between the combination of these finally selected characteristics and outcome (difference in log hazard ratio). Only the prediction intervals for the combinations with observations are displayed; for example, none of the trials in RCT-DUPLICATE had more than one of the three emulation differences set to yes. The three included characteristics were mutually exclusive and together were better in reducing observed heterogeneity than close emulation. Hence the remaining characteristics only added noise to the indicator for close emulation, or cancelled each other out.

Discussion

Principal findings

Based on data from the RCT-DUPLICATE initiative, comparing results from RCT-RWE study pairs, we found that the study emulation characteristics delayed effect of treatment, discontinuation of treatment during run-in period, and treatment started in hospital explained most of the observed variation beyond chance in this sample. In this collection of RCT-RWE study pairs, most of the observed variation in effect estimates could be explained by these three emulation characteristics. The results suggest that, on average, the hazard ratios estimated with real world data tended to be slightly larger than the hazard ratios estimated in the randomised controlled trials.

Surprisingly little variation was explained by placebo comparator, which was thought to be an emulation challenge, in the absence of placebo in clinical practice, and a source of confounding bias. This result might have been influenced by the quality of the placebo proxy that was used in emulation of placebo controlled trials for RCT-DUPLICATE. Although all of the included studies focused on a hazard ratio for the primary result, the proposed analysis can be applied to studies investigating other outcome measures (ie, risk ratios or risk differences). The meta-regression analyses, however, required that the estimates for all studies were on the same scale. Appropriate transformations could be applied to include studies whose primary analyses used a different scale.

Randomised controlled trials are seen as the standard in establishing the efficacy of medical products, but these studies might not be free of flaws in their implementation and might not always represent clinical practice. The results of multiple clinical trials that look at similar questions, even identical twin trials, can vary in their findings.35–39 Discordant results between randomised controlled trials and real world evidence studies that investigate similar use of drugs and outcomes should not necessarily discredit the real world evidence study before considering emulation differences that might result in assessing a slightly different causal question. Therefore, the emphasis should be on understanding where these differences come from, and the clinical or research question that is being asked by each study type.

Limitations of this study

Our study had some limitations. We have presented the results of an exploratory analysis with a limited sample size from 29 RCT-RWE study pairs, non-randomly selected from the RCT-DUPLICATE initiative. Therefore, we could only include a limited number of explanatory emulation characteristics in our models. Other emulation differences could further reduce residual heterogeneity. A follow-up study designed for purpose could derive and investigate other emulation characteristics that might be informative in the meta-regression.

The trials included in RCT-DUPLICATE were selected as having a high probability of being feasible to emulate with insurance claims data. Therefore, our results provide an understanding of how concordance in results between randomised controlled trials and database studies are influenced by concordance in design, but the specific coefficients should not be interpreted as generalisable because of the highly selected sample of trials. Also, the design emulation and population differences recorded in this study might not be a comprehensive list of all of the important emulation challenges that could be considered. Different emulation differences might be more or less relevant for different clinical areas, and the direction of the effect of these differences are context dependent, limiting the generalisability of our empirical findings. Furthermore, the emulations were conducted with insurance claims data. Emulated randomised controlled trials from registry data or data from electronic health records might have other design emulation and population differences (eg, challenges to defining observable time when data from fragmented healthcare systems are used).

Conclusion

Overall, our study showed that a substantial proportion of heterogeneity between the results of randomised controlled trials and real world evidence studies can be attributed to differences in design emulation. Furthermore, our study showed how meta-regression can be used to define a more nuanced understanding of emulation differences.

Ethics approval

Not applicable.