Coding groups on nominal (categorical) groups of an independent variable is not a problem in analyses such as t-tests or ANOVA. However, things can fall apart fairly quickly if nominal variables are not coded properly for use as predictors in linear models such as multiple regressions.

Let’s take a simple example of a variable for Gender, which we plan on using as an independent variable. In a t-test comparing males and females, the data is classified into independent groups and could be coded as 1 = female, and 2 = male. We would just be looking at means between two separate and independent groups, so we could call the groups anything we want. However, in a regression model the coding would be invalid because the coding would imply that males, coded as 2, are ranked higher than females, coded as 1. The regression model would compute the results using gender as an ordinal variable, returning results that may look just fine…but the findings would be invalid.

Ordinal categorical variables, such as education level (high school, some college, college graduate) or class standing (1st grade, 2nd grade, 3rd grade) can be coded 1, 2, 3, and used in a regression model. But truly nominal variables, such as gender (male vs. female), type of teacher (History, Mathematics, English) or race/ethnicity (White, Black, Hispanic) must be coded in a way that represents group membership of each participant in the study…but keeps the groups on equal footing. There are quite a few ways to achieve this, but one of the simplest is dummy coding.

Dummy coding involves using binary data, 0’s and 1’s, to indicate which group of an independent variable each participant belongs. One part that confuses most researchers is that one needs 1 less dummy variable than they have independent groups. For instance, gender has two groups, but we only need to code gender onto one variable to know who is who. Instead of 1 = female, 2 = male, we could dummy code on 1 variable with female = 0 or male = 1. That one variable, coded in 0’s and 1’s will tell us the gender for each individual in the study. Easy peasy!

It gets a little strange when there are more than 2 categories on a nominal variable. Let’s look at race/ethnicity: We have 3 independent groups of (1) White, (2) Black, and (3) Hispanic. Even though we have 3 groups, Each individual in the data will be coded on only 2 to tell us everything we need to know about each person’s race/ethnicity. The data will look something like this:

Subject ID | Race/Ethnicity | Race = White | Race = Black |

1 | 1 | 1 | 0 |

2 | 2 | 0 | 1 |

3 | 3 | 0 | 0 |

4 | 3 | 0 | 0 |

5 | 1 | 1 | 0 |

We can see that of the 5 study participants, 2 are Caucasian (coded as 1 on the Race/Ethnicity variable), 1 is African American (coded as 2), and 2 participants are Hispanic (coded as 3). The two columns on the right are the two dummy coded variables, one for White, and one for Black, but none for Hispanic! If you look at the variable for Race = White you can see the two participants who are White are coded as 1, and those who are not White are coded as 0. Same thing for participants who are Black…the one Black participant is coded as 1 on the Race = Black variable and everyone else is coded as 0. The trick is in seeing that those who are neither White nor Black, the Hispanic participants, are coded with 0’s on both the White and Black dummy variables. So those two variables tell us everything we need to know about participant membership in the 3 race/ethnicity groups, on only two dummy coded variables. Very cool!

This information is very cursory and of course more is involved in coding properly for analysis. If you’d like to see some of the dummy coding and reasoning in action, visit the Stats for the Masses page on YouTube to review a recording of one of our past webinars, GIGO NO-NO’s Problems and Solutions in Data Preparation: