Randomized controlled trials (RCTs) are considered the gold standard approach for estimating treatment effects. However, not all clinical research involves randomization of subjects into treatment and control groups. These studies are commonly referred to as non-experimental or observational studies. Some examples of non-experimental (observational) studies include:

In observational studies, the treatment selection is influenced by the characteristics of subjects. Therefore, any differences between groups are not randomized out, and the baseline characteristics of the subjects could differ between treatment and control groups. In essence, the baseline characteristics systematically differ and we must find a way to account for these systematic baseline differences.

Researchers often use regression adjustments to account for baseline differences between groups. A regression adjustment is made by using one or more (usually more) variables obtained at baseline as predictors, and one dependent variable of the treatment outcome. Using a regression adjustment to investigate baseline differences in observational studies has the following issues:

Propensity scores, and matching subjects from each of the study groups using propensity scores, are constructed without taking the treatment outcome into consideration. The use of propensity scores keeps the researcher’s attention on baseline characteristics only. However, once the subjects are scored and matched (defined as balanced), a regression model can be analyzed to further adjust for any residual imbalance between the groups. So regression models still have their use! But they are used after the propensity scoring and matching.

Propensity scores have the added benefit of allowing a researcher to see the actual amount of overlap, or lack of it, between treatment groups. After propensity scores are assigned to each individual in each group, then the researcher uses the scores to “match” pairs of subjects in the treatment with subjects in the control group. The easiest way to match is with a one-to-one match: one treatment subject to one control subject. But some, more-advanced matching techniques can match one-to-many.

After matching, the researcher can see if there are many unmatched individuals left over (indicative of large differences in baseline characteristics between the groups). A large difference between groups might not just indicate that the treatment and control groups differed at baseline, but that the differences between groups might be too large to assess any meaning on the outcome of the intervention. After all, the reason for propensity score matching is to derive groups that simulate equal baseline characteristics. If they can’t be matched, they were just not similar. Hence, treatment efficacy cannot be derived or established.

Here is a very simple example of the use of propensity scores and matching for a non-experimental study:

A study is performed to assess the treatment effects of two analgesic drugs given to patients presenting to an emergency room with severe cluster headaches. The type of drug given, A or B, is decided upon various factors such as the time span of the current headache, frequency of headaches, age of the patient, and various comorbidities. Thus, the patients were not randomized into the two drug treaments.

This non-randomization is also called “selection bias”. The clinician selected the treatment to give to each patient. If more than one clinician was involved in the decision making of treatment, we should control for this also! Perhaps some clinicians like one treatment over the other.

The outcome is time to pain management. And looking at the data without any adjustment, Drug A appears to relive pain significantly faster than Drug B. But, maybe this is not the case. Something else could be at work here, or maybe there isn’t a difference between the drugs at all. Or maybe the patients in Group A (patients who took drug A) are much too different from the patients in Group B (patients who took Drug B) to make any assessment of efficacy.

So, we will develop a propensity score for each patient based on the covariates we believe (or better, know from our knowledge and the literature). The propensity score, let’s call it Z, predicts how likely a person is to get Drug A.

We then group people with similar propensity scores between the two groups of patients, such that patients with, say, Z = .30 in Group A are matched with patients with Z = .30 in Group B.

Nothing is as good as the RCT, but I hope I’ve opened up your thinking a bit to the use of propensity scores in observational studies. There are many ways to score and match and analyze observational study groups. A good reference to start with is the article by Austin (2011) listed below. Rosenbaum and Rubin (1983) wrote the seminal work on propensity scores, but even I think it is a bit too theory heavy. But it is also listed in the references below if you are so inclined.

Not everyone likes propensity scores for matching cases. The article by King and Nielsen (2016, also referenced below) presents some limitations in propensity score matching and some remedies for when many individual cases remain after the matching attempt.

Stata has a function for tseffects for obtaining propensity scores, and the function of psmatch for propensity score matching. You can also run post-estimation regression with the functions.

For R fans, here is a nice tutorial on propensity score matching. Not as nice as the Stata code, but hey, it’s free! http://pareonline.net/pdf/v19n18.pdf

**References**

Austin, Peter. (2011). An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies. Multivariate behavioral research. 46. 399-424. 10.1080/00273171.2011.568786.

R. Rosenbaum, Paul & B Rubin, David. (1983). The Central Role of the Propensity Score in Observational Studies for Causal Effects. Biometrika. 70. 41-55. 10.2307/2335942.

*Not everyone likes propensity scores for matching:*

Gary King and Richard Nielsen. Working Paper. “Why Propensity Scores Should Not Be Used for Matching”. Copy at http://j.mp/2ovYGsW