Original articles
Combined evidence from multiple outcomes in a clinical trial

https://doi.org/10.1016/S0895-4356(00)00238-9Get rights and content

Abstract

Clinical investigators are encouraged to apply recently developed statistical methodology. For each patient in a trial, favorable and unfavorable results from multiple outcomes may be summarized in a suitable summary measure. This summary measure may be used in a two-sample t-test to decide which treatment is best. An example illustrates how the evidence from the main outcome criteria may be combined. The required study size depends on the mean treatment effect on the outcomes in the summary measure. When separate outcomes are considered, there is a multiple comparisons problem, for which Hochberg offered a simple solution. Evaluation of a single-summary measure may require a larger or a smaller study size than evaluation of separate outcomes, depending on whether treatment effects are about the same or very different.

Introduction

If a single outcome measure is clearly more important than all other outcome measures, then this most important measure may decide which treatment should be chosen for future patients. In some clinical trials, however, treatment groups are compared with respect to several outcome measures that are about equally important to patients. In that case the main outcome measures may be replaced by a single-summary measure that totals favorable and unfavorable results. Although this article focuses on quantitative outcome measures, in this introduction dichotomous outcomes also are considered to clarify the line of thought.

A zero–one variable, also called a dummy or indicator variable, may be used to indicate the absence (0) or presence (1) of a certain outcome. If there are two or more important dichotomous outcomes, these outcomes may be combined in several ways. A new dichotomy may indicate whether or not at least one of the outcomes is present in the patient under consideration, or a sum score may be used, or some ordinal scale may be adequate. This is explained in the next paragraphs.

As an example, consider a randomized clinical trial that compares two methods of treatment of the appendix stump after appendectomy. If the possible outcomes “wound infection” and “postoperative ileus” are considered equally important, these two outcome measures “infection” and “ileus” may be replaced by the single outcome measure “infection or ileus,” defined as presence of infection or ileus or both, that is presence of at least one of the unfavorable outcomes. The combined outcome measure “infection or ileus” should indicate which treatment is best. If the one treatment reduces the risk of infection from 8% to 4% and the risk of ileus from 2% to 1%, compared to the other treatment, then the combined risk is about reduced from 10% to 5%. In this case, the combined outcome has better power than each of the separate outcomes; in other words, the combined risk requires a slightly smaller study size. However, if the risk of ileus is 2% for both treatments, the combined risk is reduced from 10% to 6% and goes with less power than the risk of infection alone; in other words, the combined risk requires a slightly larger study size. If it is known beforehand that both treatments have the same risk of ileus, ileus should not be considered an important outcome and it should not be incorporated into a combined outcome measure, since this outcome would not help to choose between treatments. Of course, the combined outcome should be defined in the research protocol, before any data are gathered.

As another example, consider a randomized clinical trial where patients with a high risk of stroke are treated with either aspirin or placebo. It is expected that aspirin substantially reduces the risk of a major cerebral infarction, but it is also expected that aspirin slightly increases the risk of a major cerebral bleeding. The combined risk of infarction or bleeding goes with lower statistical power than the risk of infarction alone. But it is more honest to compare treatments regarding the combined risk, that is infarction or bleeding, thus simultaneously taking account of advantages and disadvantages in a single statistical test (logrank or chi square).

A sum score may be better if important outcomes are present in many patients. Absence or presence may be coded as zero or one. Such zero–one variables may be totalled to create a sum score that represents the number of important outcomes (events) in a patient, within a certain time period; a sum score may be used in a nonparametric statistical test.

If a treatment has serious side effects, the following ordered outcome scores may be used to evaluate each patient in a global way.

  • 1.

    Score 1: Serious side effects and no beneficial effect, or side effects are much more important than beneficial effects.

  • 2.

    Score 2: Side effects are (slightly) more important than beneficial effects.

  • 3.

    Score 3: Balance. Side effects and beneficial effects are (about) equally important, or both are absent.

  • 4.

    Score 4: Beneficial effects are (slightly) more important than side effects.

  • 5.

    Score 5: Great beneficial effect and no side effect, or beneficial effects are much more important than side effects.

Of course, in a particular trial this ordinal scale may be adapted and clarified according to the research question in that trial. The combined outcome measure should be defined in the research protocol, in a way that best reflects what is important for patients admitted to the trial. Moreover, it must be decided whether the outcome is best assessed by the patient or the physician or both.

It may be unavoidable that some patients withdraw before the end of the planned follow-up period. Withdrawals for reasons that are certainly not related to the outcome or to the treatment (moving away, not randomized, incorrectly admitted to the trial) can be excluded from the statistical analyses, since their exclusion would not bias treatment comparison 1, 2, 3. Withdrawals for reasons that may be related to outcome or to treatment, however, should be included in the statistical analysis. Patients who withdraw because of serious side effects (or unpleasant trial procedures, or death) or lack of any beneficial therapeutic effect demonstrate the worst possible outcome [1]: Score 1 in the previous paragraph, or even Score 0 as an extension of the scoring system. Patients who withdraw because of early recovery demonstrate the best possible outcome [1]: Score 5 in the previous paragraph, or even Score 6 as another extension of the scoring system. In some trials, however, the scoring system may be reduced to a dichotomy that just distinguishes treatment successes and treatment failures [3]. In case an investigator insists on comparing group means of a quantitative outcome measure in study completers, this investigator should also compare group proportions of withdrawals due to unfavorable reasons [2].

Further in this article only quantitative outcome measures are considered. These quantitative measures may have different standard deviations and, therefore, should not be totalled in a straightforward manner. Moreover, there may be clusters of similar outcomes that are highly correlated, which should be taken into account.

A simple procedure consists of the following four steps: (1) the primary outcome measures are stated in the research protocol; (2) similar measures are replaced by their average; (3) each outcome measure is properly standardized to take account of different standard deviations; and (4) a mean summary measure is computed for each patient in the trial and then used in a two-sample t-test. An example demonstrates the computational simplicity of the procedure.

The summary measure combines all the evidence and it has great power if the separate outcome measures show about the same treatment effect. But the summary measure may have poor power if some outcome measures show a much smaller treatment effect than other outcome measures. If the outcome measures point in different directions, the overall conclusion may be that the investigated treatments are about equivalent. A sample size formula is presented that takes account of the mean treatment effect on the outcomes in the summary measure, the number of outcomes, and their mean correlation. If conclusions are drawn regarding separate outcomes, there is a multiple comparisons problem, and P-values may be adjusted according to Bonferroni or Hochberg.

Section snippets

Example

Seventy-two patients with acute lateral distortions of the ankle (sprained ankle) were treated with an ointment [4]. In a randomized study, n1 = 36 patients received the active treatment and another n2 = 36 patients were treated with a placebo. The summary measure procedure consists of four steps. The present section contains the first two steps that are basic and very generally applicable. The next section contains the third and fourth step.

Summary measure procedure

In step 2, each cluster of highly correlated outcomes is replaced by a new outcome measure that represents the cluster; step 2 may be omitted if there are no such clusters. In steps 3 and 4 the new outcome measures are combined into a global summary measure that is used in a two-sample t-test.

Smallest clinically relevant treatment effect

It is assumed that there is no bias in a well-designed experiment (randomization, blinding, analysis by intention to treat, and so on). Regarding a certain outcome measure, μ1 and μ2 denote the expected means in the treatment groups, and μ1−μ2 denotes the true difference in effectiveness between the treatments; for Pain this may be 5 points on the visual analogue scale, which is about half the standard deviation. The standardized difference δ = (μ1−μ2)/σ is the expected difference in treatment

Discussion

The ultimate goal of a clinical trial is to decide which treatment should be chosen for future patients. It may be hard to make a scientifically valid choice if there are many important outcome measures that may create a diffuse picture. The research protocol should describe a clearly structured statistical analysis, with probability .05 (or less) of a false significant result and sufficient power under sensible hypothesized treatment effects. The recommended first step is that the research

References (17)

  • A.L. Ries et al.

    Use of factor analysis to consolidate multiple outcome measures in chronic obstructive pulmonary disease

    J Clin Epidemiol

    (1991)
  • Y. Hochberg et al.

    Extensions of multiple testing procedures based on Simes' test

    J Stat Plan Inference

    (1995)
  • A.L. Gould

    A new approach to the analysis of clinical drug trials with withdrawals

    Biometrics

    (1980)
  • W.J. Shih et al.

    Testing for treatment differences with dropouts present in clinical trials—a composite approach

    Stat Med

    (1997)
  • S.J. Pocock

    Clinical trialsa practical approach

    (1983)
  • W. Lehmacher et al.

    Procedures for two-sample comparisons with multiple endpoints controlling the experimentwise error rate

    Biometrics

    (1991)
  • J. Läuter

    Exact t and F-tests for analyzing studies with multiple endpoints

    Biometrics

    (1996)
  • H.J.A. Schouten

    Planning group sizes in clinical trials with a continuous outcome and repeated measures

    Stat Med

    (1999)
There are more references available in the full text version of this article.

Cited by (10)

View all citing articles on Scopus
View full text