KNIME and R Integration with Unbalanced Classes in Test and Train Partitions

Integration of data analysis (whither “Data Mining”?) tools can lead to some interesting interactions. For example, on a recent project, I was trying to use R and R’s bnlearn package with KNIME. KNIME has some very cool R nodes to help with pulling the out what R does best and mixing it with what KNIME does best. In this case, I was using Bayesian Networks from bnlearn as classifiers with the R Learner and R Predictor nodes.

With some most of the datasets from the UCI ML repository, this worked flawlessly. However with the ecoli dataset split into 80/20 train/test partitions, I kept getting the following error:

<date> : DEBUG : KNIME-Worker-54 : R Snippet : R Predictor : 0:224:221 : Execution of R Script failed with exit code: 1
<date> : ERROR : KNIME-Worker-54 : R Snippet : R Predictor : 0:224:221 : Execution of R script failed: Calls: cbind ... predict -> predict.bn -> predict.bn.fit -> check.data
<date> : DEBUG : KNIME-Worker-54 : R Snippet : R Predictor : 0:224:221 : Execution of R script failed: Calls: cbind ... predict -> predict.bn -> predict.bn.fit -> check.data

Okay not very helpful, but there seemed to be something wrong with the check.data function in bnlearn.

When testing this against a reference R installation, i.e., without using KNIME, sure enough the following warnings (or similar, depending on the output of the 80/20 train/test split) can be found:

Warning messages:
1: In check.data(data) :
  variable V2 has levels that are not observed in the data.
2: In check.data(data) :
  variable V3 has levels that are not observed in the data.
3: In check.data(data) :
  variable V4 has levels that are not observed in the data.
4: In check.data(data) :
  variable V7 has levels that are not observed in the data.
5: In check.data(data) :
  variable V8 has levels that are not observed in the data.
6: In check.data(data) :
  variable V2 has levels that are not observed in the data.

But the KNIME R nodes, aren’t supposed to stop execution on a warning, only when the R script returns an error. What gives?!?! Quick verification of the train/test partitions shows that yes indeed, the test data has columns (in this case, “factors” in R) without all of the values (“levels” in R) due to the frequency of those values.

After a bit more poking around, it turns out that the R Predictor node was failing, but not because of the warning, but because of an error:

Error in check.data(data) : variable Col3 must have at least two levels.
Calls: cbind ... predict -> predict.bn -> predict.bn.fit -> check.data
Execution halted

So, the warning in the R reference installation was a red herring! <sarcasm>Great.</sarcasm>

Okay, it now all makes sense. The separate partitions have separate data frames in the two different R nodes (unlike the R reference installation), so in the check.data function, which is used while learning the model and while using the model for prediction, is comparing the number of levels for each column (factor) and seeing a discrepancy. Seems fair, but in this case clearly a problem, and maybe an area for improvement for the bnlearn maintainer. (Suggestion: different check.data implementations for learn and predict.)

The check.data function is located in the utils-sanitization.R file in the bnlearn package and the quick solution in this case is to just comment out the offending error check. (This could be a problem in a production run, and certainly this check should be implemented somewhere else before learning the model.)

      # check the number of levels of discrete variables, to guarantee that
      # the degrees of freedom of the tests are positive.
#comment out      if (nlevels(x[, col]) < 2)
#comment out        stop("variable ", col, " must have at least two levels.")

Now the trick is to get the modified bnlearn package into the R installation.

The first thing to do is to remove the package from the current R installation.

remove.packages("bnlearn", lib="/usr/lib64/R/library")

In this case, I had it installed as root. I reinstall and modify it as my local user.

Compile and install the local instance, (where the package was downloaded before to actually make the edit).

R CMD INSTALL ~/bin/bnlearn

and the modified bnlearn will be accessible to the KNIME installation.

Now, the check is skipped and unbalanced factors can have different numbers of levels while usine KNIME and R together.

Advertisements
This entry was posted in Data Mining, KNIME, R, Tutorial and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s