R, Damned R, and Statistics

I recently had a small project to implement Odds Ratio and Risk Ratio for contingency tables with some measurements like Fisher’s Exact Test and a few other significance tests. To help with the project, I used R as  reference. Most of the values I wanted to test against were available from a handy dandy package called “epitools.”

If you aren’t really familiar with contingency tables, they are usually set up to show how two variable with two values (or rather a value and a not-that-value) are related.

For example, for a dataset with two variables \mathcal{X} and \mathcal{Y} with the values X and “not X” and Y and “not Y,” the table would look like this:

X ¬ X
Y a b
¬ Y c d

where a is the number of instances in the dataset where the value X and Y both occur, b is the number of instances where the value Y occurs with the value(s) that are not X and so on.

All of the literature describes the material in this way. The boxes go from left to right and then top to down.  a \rightarrow b \rightarrow c\rightarrow d .

There is a formula for calculating the Odds Ratio, which after a bit of fiddlin’ turns into  OR = \frac{a*d}{b*c} .

So far, so good.

When you want to calculate the Risk Ratio, there is a similar formula using the values in this table.  RR = \frac{a/(a+b)}{c/(c+d)} .  Easy right?

As the wise man once said: The great thing about standards is that there are so many to choose from.

Within the same package of epitools from R, when you use the riskratio() function, it gives the wrong result for the values that should be in a, b, c and d.

Completely confused by this, I went to the interwebs and asked the crowd and the answer was that the  a, b,c, d convention for the other functions, i.e., oddsratio(), in epitools doesn’t apply to riskratio(). WTF^10? Why would the R people do something so completely stupid like that? Why do that and not document it? R is really cool about some stuff, but at the end of the day there’s another truism from that wise man: The great thing about R is that it’s written by statisticians. The worst thing about R is that it’s written by statisticians.

So if you want some good Odds Ratio and Risk Ratio lovin’, use the implementation in KNIME. I know it works, and it’s consistent.

UPDATE: I forgot to mention that I sent an email on 22 June to the epitools package maintainer. I haven’t heard anything back yet.

This entry was posted in KNIME, Statistics, Tutorial and tagged , , . Bookmark the permalink.

One Response to R, Damned R, and Statistics

  1. Pingback: Microsoft: EmbraceR, ExtendeR, ExtinquisheR | Information Entropy

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s