Integration of data analysis (whither “Data Mining”?) tools can lead to some interesting interactions. For example, on a recent project, I was trying to use R and R’s
bnlearn package with KNIME. KNIME has some very cool R nodes to help with pulling the out what R does best and mixing it with what KNIME does best. In this case, I was using Bayesian Networks from
bnlearn as classifiers with the R Learner and R Predictor nodes.
<date> : DEBUG : KNIME-Worker-54 : R Snippet : R Predictor : 0:224:221 : Execution of R Script failed with exit code: 1 <date> : ERROR : KNIME-Worker-54 : R Snippet : R Predictor : 0:224:221 : Execution of R script failed: Calls: cbind ... predict -> predict.bn -> predict.bn.fit -> check.data <date> : DEBUG : KNIME-Worker-54 : R Snippet : R Predictor : 0:224:221 : Execution of R script failed: Calls: cbind ... predict -> predict.bn -> predict.bn.fit -> check.data
Okay not very helpful, but there seemed to be something wrong with the
check.data function in
When testing this against a reference R installation, i.e., without using KNIME, sure enough the following warnings (or similar, depending on the output of the 80/20 train/test split) can be found:
Warning messages: 1: In check.data(data) : variable V2 has levels that are not observed in the data. 2: In check.data(data) : variable V3 has levels that are not observed in the data. 3: In check.data(data) : variable V4 has levels that are not observed in the data. 4: In check.data(data) : variable V7 has levels that are not observed in the data. 5: In check.data(data) : variable V8 has levels that are not observed in the data. 6: In check.data(data) : variable V2 has levels that are not observed in the data.
But the KNIME R nodes, aren’t supposed to stop execution on a warning, only when the R script returns an error. What gives?!?! Quick verification of the train/test partitions shows that yes indeed, the test data has columns (in this case, “factors” in R) without all of the values (“levels” in R) due to the frequency of those values.
After a bit more poking around, it turns out that the R Predictor node was failing, but not because of the warning, but because of an error:
Error in check.data(data) : variable Col3 must have at least two levels. Calls: cbind ... predict -> predict.bn -> predict.bn.fit -> check.data Execution halted
So, the warning in the R reference installation was a red herring! <sarcasm>Great.</sarcasm>
Okay, it now all makes sense. The separate partitions have separate data frames in the two different R nodes (unlike the R reference installation), so in the
check.data function, which is used while learning the model and while using the model for prediction, is comparing the number of levels for each column (factor) and seeing a discrepancy. Seems fair, but in this case clearly a problem, and maybe an area for improvement for the bnlearn maintainer. (Suggestion: different
check.data implementations for learn and predict.)
check.data function is located in the
utils-sanitization.R file in the
bnlearn package and the quick solution in this case is to just comment out the offending error check. (This could be a problem in a production run, and certainly this check should be implemented somewhere else before learning the model.)
# check the number of levels of discrete variables, to guarantee that # the degrees of freedom of the tests are positive. #comment out if (nlevels(x[, col]) < 2) #comment out stop("variable ", col, " must have at least two levels.")
Now the trick is to get the modified
bnlearn package into the R installation.
The first thing to do is to remove the package from the current R installation.
In this case, I had it installed as root. I reinstall and modify it as my local user.
Compile and install the local instance, (where the package was downloaded before to actually make the edit).
R CMD INSTALL ~/bin/bnlearn
and the modified
bnlearn will be accessible to the KNIME installation.
Now, the check is skipped and unbalanced factors can have different numbers of levels while usine KNIME and R together.