Programming Nodes in KNIME with Scala, Part 1

This post will help get you started in programming nodes in KNIME using the Scala programming language. If you are unfamiliar with Scala, you might want to change that: From Wikipedia

Scala has many features of functional programming languages like Scheme, Standard ML and Haskell, including currying, type inference, immutability, lazy evaluation, and pattern matching. It also has an advanced type system supporting algebraic data types, covariance and contravariance, higher-order types (but not higher-rank types), and anonymous types. Other features of Scala not present in Java include operator overloading, optional parameters, named parameters, raw strings, and no checked exceptions.

Scala 2.11 also has direct integration with the Akka library, which makes programming parallel and distributed applications really easy, espcially on a Spark installation.

KNIME is programmed in Java, which makes interworking with Scala very easy since they are both languages native to the JVM.

If you are already familiar with programming nodes with the KNIME SDK, then skip ahead.

This isn’t intended to be a primer in programming in Java or for programming Nodes for KNIME in general, but this guide will start by assuming nothing in the development environment to make sure this guide is reproducible. This guide was written using the KNIME SDK v3.1.1.

Download and install the KNIME SDK.

Over at the KNIME download page, either sign up for the low-traffic newsletter and announcements list, or just jump straight to the download section by clicking tab number 2. At the bottom are the downloads for the SDK which includes the Eclipse development environment prepackaged to start programming nodes and/or plugins for Eclipse and KNIME. Choose the correct one for your OS. Linux, MacOS, and Windows are supported.

knime-sdk-download

Unzip and install, depending on the OS, the package and begin executing Eclipse. Depending on when you run this, one of the first things to do is to execute Help->Check for Updates… for any latest bug fixes. This step updated Eclipse to v4.5.2

You’ll need to set up the Debug and Run configurations by clicking arrow next to the Debug button and double-clicking “Eclipse application” to create a new configuration call, by default, and aptly enough “New_configuration.” In the “Main” tab select the “product” to run in the “Program to Run” group from the dropdown menu, and select “org.knime.product.KNIME_PRODUCT.”

 

Under the “Arguments” tab, vm-arguments

copy these values into the VM arguments edit space:

-Xms40m -Xmx2048m -ea -Dosgi.classloader.lock=classname 
-XX:+UnlockDiagnosticVMOptions 
-XX:+UnsyncloadClass 
-Dknime.enable.fastload=true

Press “Debug” and KNIME should start from Eclipse in Debug mode. Congrats! You are now a KNIME programmer!

Another good tips for programming/using KNIME: Change the settings for Windows->Preferences->General->Show Heap Status to show the heap status.

show-heap-status

Close KNIME and return to the Eclipse IDE.

Install the Scala IDE into Eclipse

The Eclipse Scala IDE plug-in is installed by adding a new Update Site to Eclipse at the Help->Install New Software… dialog.

install-new-software

Add a new update site by pressing “Add…” and adding the site: http://download.scala-ide.org/sdk/lithium/e44/scala211/stable/site. Select the checkbox for “Scala IDE for Eclipse” since that’s the only one that’s necessary. Press “Next >”, “Next >”, accept the License agreement and “Finish.”

The software will install and require a restart from Eclipse. You may need to adjust the heap size used by Eclipse in the eclipse.ini file. (This is not the same as those set for the KNIME executable in the VM settings above.) Adjust the heap to be 1G or larger. (I have mine set to 2G.) The Scala IDE is now integrated into your Eclipse IDE and you can devlope regular Scala programs as well as Java programs, and most interesting of all, node plug-ins (or just simply “Nodes”) for KNIME.

Create a new node using the Node Wizard.

The finished project is available over at Github. You can download it there, or got through the steps here. The New Node Wizard is located under File->New->Other->KNIME->Create a new KNIME Node-Extension.

new-node-wizard

The dialog will require a few fields. I’ve shown an example for the Project MyFirstScalaKnimeNode with the Node Name MyFirstScalaKnimeNode in the package de.uni_konstanz.knime.scala.myfirst. In this case, select the Node type to be a Manipulator. Right now, the only thing the Node type affects is the color of the node.node-settings.

Press “Finish” and a new project with the name MyFirstScalaKnimeNode will be created with the package de.uni_konstanz.knime.scala.myfirst with a few files in a src folder. The two most important files we’re concerned with are the ...NodeModel.java and the ...NodeDialog.java source files.  The NodeModel files is where the actual execution and processing of data takes place in a KNIME node, and the NodeDialog file handles the Configuration dialog for the node.

About the Scala Node

The KNIME Scala node is just going to do a very simple task. It will read a value in from an Integer column on the input, and then output a factorial of that number using the BigInteger data structure. We can’t directly output BigIntegers into a column type in KNIME, but we can output their String representation. I may gloss over some of the imlementation details, so you might want to download the project from my GitHub site, and read along with it there.

Set up the NodeDialog to select a column of Integers

In the NodeModel, the CFGKEY_COUNT, DEFAULT_COUNT constants and the m_count member variable can be deleted. They’re to be replaced with a constant String CFGKEY_FACTOR = "Factor", a member variable m_factorColumn, and another constant PORT_IN_DATA = 0

The CFGKEY_FACTOR is the event ID used by the dialog window system to identify a change in value for a field in the dialog component. m_factorColumn is the variable that will hold the name of the column with the Integers to be factored. PORT_IN_DATA is just a constant to use to represent which port of the node will contain the model. This is a bit overkill of a one-input/one-output node, but good programming habits are (hopefully) hard to break.

We’ll insert a helper static method to initialize m_factorColumn with a static method so we only have to type it once.


public static SettingsModelString createSettingsModelFactorColumn() {		
	return new SettingsModelString(CFGKEY_FACTOR, "");
}

Set m_factorColumn to this:

private SettingsModelString m_factorColumn = createSettingsModelFactorColumn(); 

Each of the saveSettingsTo(), loadValidatedSettingsFrom(), and validateSettings() methods needs to be changed to call the appropriate method for m_factorColumn.

We want to change the dialog to select a column name (a String) of a column of type Integer. Replace the addDialogComponent statement with the following statement:


addDialogComponent(new DialogComponentColumnNameSelection(
MyFirstScalaKnimeNodeNodeModel.createSettingsModelFactorColumn(),
"Factor Column:", MyFirstScalaKnimeNodeNodeModel.PORT_IN_DATA,
IntValue.class));

This is done by using an instance of the DialogComponentColumnNameSelection. It uses the same static method to create the SettingsModel, labels the field “Factor Column:” and selects the input port of the node (PORT_IN_DATA or 0) which should have this column type. The cool part is the last field. This is the value that is searched for in the incoming DataTableSpecification, and will allow only selecting columns of that type. If there’s only one column of that type, it’s automatically selected.

Set up the NodeModel.

The magic in the NodeModel happens in the execute() method. This is where the data table is passed to the method to be processed by the node. The sample code from the Node Wizard creates a node with three output columns that take a count value from the Configure dialog, output that number of rows, and do different things with the counter, like output the values as a string, multiply it by 0.5, or just simply output it. This example is going to show a slightly more advanced construct for KNIME nodes using the ColumnRearranger class, which will just append a new column of type StringCell on a row-by-row basis.

Note: Using the ColumnRearranger is really only useful for setting up code that can process one row at a time, or can ouput some information one row at a time without having to look at a sequence of rows (like for a moving average or similar.)

Replace the contents of your execute() with this:


ColumnRearranger c = new ColumnRearranger(inData[PORT_IN_DATA].getDataTableSpec());
		MyFirstScalaNodeCellFactory f = new MyFirstScalaNodeCellFactory(inData[PORT_IN_DATA].getDataTableSpec().findColumnIndex(m_factorColumn.getStringValue()), exec);

c.append(f);

final BufferedDataTable returnVal = exec.createColumnRearrangeTable(inData[PORT_IN_DATA], c, exec);

return new BufferedDataTable[] { returnVal };

Add any imports as necessary.

What’s happening here is is not entirely obvious. a new object of type ColumnRearranger is created with the input DataTableSpec as input. A factory (that we’ll create later) is instantiated with some values that will be needed later. The column (cells) created by this factory are appended to those rearranged (or not) by the ColumnRearranger factory with the append() method. The real work is done in the CellFactory.

Add this private class to the end of the NodeModel.


private class MyFirstScalaNodeCellFactory implements CellFactory, AppendedCellFactory {

	private String m_factorCol;
	private DataTableSpec m_tableSpec;
	private ExecutionContext m_exec;

	public MyFirstScalaNodeCellFactory(DataTableSpec dataTableSpec, String factorColumn, ExecutionContext exec) {
		m_tableSpec = dataTableSpec;
		m_factorCol = factorColumn;
		m_exec = exec;
	}
	@Override
	public DataCell[] getAppendedCell(DataRow row) {			
		return getCells(row);
	}

	@Override
	public DataCell[] getCells(DataRow row) {			
		return new DataCell[]{};
	}

	@Override
	public DataColumnSpec[] getColumnSpecs() {
			
		return new DataColumnSpec[]{new DataColumnSpecCreator("Factorial", StringCell.TYPE)
				.createSpec()} ;
	}

	@Override
	public void setProgress(int curRowNr, int rowCount, RowKey lastKey, ExecutionMonitor exec) {
		m_exec.setProgress( (double)curRowNr/rowCount);			
	}
		
}

Let’s have a closer look at what’s happening in the class. The CellFactory and AppendedCellFactory interfaces have to be implemented to append cell(s). The constructor initializes the private class’s field variables to the passed in values. The only value we need is the column index to get the input integer value and the ExecutionContext to update KNIME about the current progress. getAppendedCell() simply calls getCells(), because we’re appending cells and the method getCells() just came along with the CellFactory interface.

I’ve purposely left getCells() blank here.

Program the processing part in Scala

Now’s the time to start programmig in Scala. The first thing we need to do is convert the Java project to a Scala project. The main difference here is that Eclipse will use the Scala compiler to compile both the Scala and Java code. This is done by changing the “nature” of the project to a Scala nature with right-click on the Project->Configure->Add Scala Nature.

add-scala-nature

You may get an “Updated Required Bundles” error. You can ignore this. You may also get the following error.

org.apache.commons.io_2.4.0.jar of MyFirstScalaKnimeNode build path is cross-compiled with an incompatible version of Scala (2.4.0). In case this report is mistaken, this check can be disabled in the compiler preference page.

This is mistaken in this case, or can at least be ignored for now, so the check needs to be disabled in the compiler preference page.

Under Project->Properties->Scala Compiler, check “Use Project Settings” and then under the Buid Manager tab, deselect “withVersionClasspathValidator”.

check-scala-compile

Click “Apply” and “Okay” and the error has been eliminated.

You may get this error or errors:

SBT builder crashed while compiling. The error message is ‘class “org.knime.core.node.InvalidSettingsException”‘s signer information does not match signer information of other classes in the same package’. Check Error Log for details.

Error in Scala compiler: class “org.knime.core.node.InvalidSettingsException”‘s signer information does not match signer information of other classes in the same package

This requires setting the Scala Compiler target to 1.8 to match the already compiled Java code.

Now we can finally write some Scala code!

In the de.uni_konstanz.knime.scala.myfirst package, create a new Scala class called BigIntFactorial.

create-scala-class

The new Scala IDE new class wizard is a bit different from that of the Java IDE, so just type the name at the end of the prefilled-in package name and click “Finish.”

We’ll just use a simple product to calculate the factorial. We could have done this with recursion, but with larger integers, this could cause us to run out of stack space.

Add this method to the Scala class:


def calcFactorial(f : Int) : BigInt = {
    var fac : BigInt = 1
    var i : Int = 1
    for(i 

This method is exactly what you think it is: it takes an integer and returns a value of BigInt, (Scala’s version of BigInteger, which actually has a BigInteger as its basis) which has just been multiplied by a counter up to f times. No error checking is performed for values less than 0.

Back in the NodeModel, we’ll change the method getCells() in the private CellFactory class to the following:


public DataCell[] getCells(DataRow row) {
	BigIntFactorial bif = new BigIntFactorial();
	int f = ((IntCell)row.getCell(m_factorColumnIdx)).getIntValue();
			
	BigInt i = bif.calcFactorial(f);
			
	return new DataCell[] {new StringCell(i.toString())};
}

We instantiate a Scala(!!) object in the method just like we would a Java object and call the method just like it were a Java method. No translation necessary. We can even use the BigInt, a Scala class, as the return type in Java like we would Big Integer. Seamless!

From the KNIME API, we get the value from the passed-in DataRow by using the getCell() method, casting it to an IntCell and calling the getIntValue() method.

Test the Code

Time to get this thing running! Click the debug button. If you get the following error message:

JDT Weaving is currently disabled. The Scala IDE needs JDT Weaving to be active,
or it will not work correctly.

Activate JDT Weaving and Restart Eclipse? (Highly Recommended)

[OK] [Cancel]

click “Cancel” as this won’t affect our test.

Create a simple workflow with the Table Creator node and your Scala node, put some simple values in it and test your code.

After that works, try it with some large number: 1000 is good. See the output really test the BigInt and String structures.

big-factorial

In Part 2, we’ll do some things that will make integrating Scala worth the effort.

Advertisements
Posted in KNIME, Scala, Tutorial | Tagged , , | 2 Comments

Microsoft is trying its old tricks?

When I read this:

When Microsoft introduced the Azure Data Lake, we included a new language, U-SQL, to make Big Data processing easy. U-SQL unifies the declarative power of SQL and the extensibility of C# to make writing custom processing of Big Data easy.

I guess it’s not too hard that all I see is dead people Embrace, Extend, Extinguish.

Free Webinar: U-SQL for Big Data – A Definitive Guide https://blogs.technet.microsoft.com/machinelearning/2016/02/09/free-webinar-u-sql-for-big-data-a-definitive-guide/

Posted in Data Activisim, Machine Learning, Society | Tagged , , | Leave a comment

KNIME and R Integration with Unbalanced Classes in Test and Train Partitions

Integration of data analysis (whither “Data Mining”?) tools can lead to some interesting interactions. For example, on a recent project, I was trying to use R and R’s bnlearn package with KNIME. KNIME has some very cool R nodes to help with pulling the out what R does best and mixing it with what KNIME does best. In this case, I was using Bayesian Networks from bnlearn as classifiers with the R Learner and R Predictor nodes.

With some most of the datasets from the UCI ML repository, this worked flawlessly. However with the ecoli dataset split into 80/20 train/test partitions, I kept getting the following error:

<date> : DEBUG : KNIME-Worker-54 : R Snippet : R Predictor : 0:224:221 : Execution of R Script failed with exit code: 1
<date> : ERROR : KNIME-Worker-54 : R Snippet : R Predictor : 0:224:221 : Execution of R script failed: Calls: cbind ... predict -> predict.bn -> predict.bn.fit -> check.data
<date> : DEBUG : KNIME-Worker-54 : R Snippet : R Predictor : 0:224:221 : Execution of R script failed: Calls: cbind ... predict -> predict.bn -> predict.bn.fit -> check.data

Okay not very helpful, but there seemed to be something wrong with the check.data function in bnlearn.

When testing this against a reference R installation, i.e., without using KNIME, sure enough the following warnings (or similar, depending on the output of the 80/20 train/test split) can be found:

Warning messages:
1: In check.data(data) :
  variable V2 has levels that are not observed in the data.
2: In check.data(data) :
  variable V3 has levels that are not observed in the data.
3: In check.data(data) :
  variable V4 has levels that are not observed in the data.
4: In check.data(data) :
  variable V7 has levels that are not observed in the data.
5: In check.data(data) :
  variable V8 has levels that are not observed in the data.
6: In check.data(data) :
  variable V2 has levels that are not observed in the data.

But the KNIME R nodes, aren’t supposed to stop execution on a warning, only when the R script returns an error. What gives?!?! Quick verification of the train/test partitions shows that yes indeed, the test data has columns (in this case, “factors” in R) without all of the values (“levels” in R) due to the frequency of those values.

After a bit more poking around, it turns out that the R Predictor node was failing, but not because of the warning, but because of an error:

Error in check.data(data) : variable Col3 must have at least two levels.
Calls: cbind ... predict -> predict.bn -> predict.bn.fit -> check.data
Execution halted

So, the warning in the R reference installation was a red herring! <sarcasm>Great.</sarcasm>

Okay, it now all makes sense. The separate partitions have separate data frames in the two different R nodes (unlike the R reference installation), so in the check.data function, which is used while learning the model and while using the model for prediction, is comparing the number of levels for each column (factor) and seeing a discrepancy. Seems fair, but in this case clearly a problem, and maybe an area for improvement for the bnlearn maintainer. (Suggestion: different check.data implementations for learn and predict.)

The check.data function is located in the utils-sanitization.R file in the bnlearn package and the quick solution in this case is to just comment out the offending error check. (This could be a problem in a production run, and certainly this check should be implemented somewhere else before learning the model.)

      # check the number of levels of discrete variables, to guarantee that
      # the degrees of freedom of the tests are positive.
#comment out      if (nlevels(x[, col]) < 2)
#comment out        stop("variable ", col, " must have at least two levels.")

Now the trick is to get the modified bnlearn package into the R installation.

The first thing to do is to remove the package from the current R installation.

remove.packages("bnlearn", lib="/usr/lib64/R/library")

In this case, I had it installed as root. I reinstall and modify it as my local user.

Compile and install the local instance, (where the package was downloaded before to actually make the edit).

R CMD INSTALL ~/bin/bnlearn

and the modified bnlearn will be accessible to the KNIME installation.

Now, the check is skipped and unbalanced factors can have different numbers of levels while usine KNIME and R together.

Posted in Data Mining, KNIME, R, Tutorial | Tagged , , , | Leave a comment

Do daughters make you more conservative?

Now that I’ve reached the age where my friends from when I was a kid and during my college days (Virginia Tech, Computer Engineering, 1992) have kids you are teenagers and young adults, and thanks to the incredible social experiment that is Facebook, I’ve made a couple of observations (we data people would call this “anecdata“):

  1. (Some of) the wild ones, in particular the women, from that time have become intensely religious, and to a degree socially conservative, even when they never uttered a single word about religion at the time, and their behavior demonstrated what one would describe as having distinctly non-religious* qualities.
  2. (Some of) the men, and those who demonstrated what one would describe as typical young-adult and college types of behavior, e.g., drinking, partying, chasing girls, and who have had daughters since that time, have become intensely socially conservative, if not also religious.

It’s this second point that has inspired this post. I was thinking that this would be a great way to design a long term study to measure the behavioral and political mores of young people and then sample the same people at 20 or 30 years later to see if there was any change in those views, and how it correlated to things like having children, and the role the gender of the children might play to that. Now that would be job security!

However, it seems I’m just a little too late. There’s an article from the Atlantic from 2013, linking to another article, that describes a paper, “The Effect of Daughters on Partisanship and Social Attitudes Toward Women” from Dalton Conley and Emily Rauscher that describes exactly this effect. Even more surprising is that this contrary to other studies at the time.

The researchers note that their results fly in the face of the few other studies that test the effect of daughters on political attitudes. Among them is a 2008 voting analysis of members of Congress. It found U.S. Senators and Representatives with more daughters voted more liberally than other members.  A 2010 study in Great Britain found having daughters increased the likelihood of voting for the Labor or Liberal Democrat parties as opposed to the Conservative Party, though the data are limited to “children who live at home, do not include information on those who have left home, and include step-children,” Rauscher and Conley write.

However, their findings are consistent with a recent study that found boys who grew up with sisters in the house were more likely to identify as adults with the Republican Party.

But why would having a daughter cause parents to become more Republican? The authors speculate that men and women might want more socially conservative policies when they have daughters and thus be more attracted to the GOP.

Well, yeah…um…duh. But why do they want more conservative policies? I speculate that it’s exactly because the men remember how they were when they were young, and actually want to protect their daughters from the visions they have of their former selves.

Well, my plans for a grant proposal have gone up in the air. The research is there and being done by very capable researchers. I just hope they dig a bit deeper into the “why?” part of it all. They’ll probably find a lot of hidden remorse and self-loathing which seems to drive most of the contemporary hate-oriented right-wing policies.

*I do not really wish to open up the can of worms about what behavior is condoned or condemned by religion, and intend here keep it to the more traditional, or even stereotypical, view of how one expects a religious person to behave.

Posted in Society | Tagged , , | Leave a comment

Book Review: Infinite Ascent, A Short History of Mathematics

4 of 5 Stars. The Mathematical topics covered, and the depth to which they are covered make David Berlinski’s Infinite Ascent: A Short History of Mathematics enjoyable reading. The survey of several different areas and the history of those areas makes for a entertaining overview. It’s almost like looking onto different faces of a polyhedron, where each side is just glimpse of the whole. My major dislike of this book is that the author seems to be writing as much about himself in the subject as the subject itself, without actually saying as much, and even some of the liberties he takes with the narrative become more distracting than amusing.

Posted in Book Review, Mathematics | Tagged , | Leave a comment

Book Review: Is God a Mathematician?

I’m going to give Mario Livo’s Is God a Mathematician? five starts despite its grievous flaw. The book is a very well written history of Mathematics, with an emphasis on the discussion of whether Mathematics is something invented by humans to describe the world or is an intrinsic part of the world discovered by humans.  The pacing is excellent and reads as much as a novel as a history. The explanations of mathematical concepts is also excellent giving enough detail to keep the informed interested without delving too far into the details to scare off the novice. The book is simply fantastic up until the last chapter where the author tries to answer the question on the origin of Mathematics.

He ends up splitting the difference and claims that Mathematics is a combination of both discovery and invention, which is, indeed, very unsatisfying. This especially after he seems to build a rock-solid case for its discovery. After reading this book, I am more convinced that Mathematics is discovered, that God is a Mathematician, or more likely the other way around.

Posted in Book Review, Mathematics | Tagged | Leave a comment

The Age of Robots is Arriving One Brick at a Time.

The Age of Robots is arriving one brick at a time. In what is an absolutely perfect example of how robots will help and augment human behavior, a robot lays bricks, essentially doing the grunt work, while a mason does the detail and the work that is (for now) too tricky for the robot.

In this human-robot team, the robot is responsible for the more rote tasks: picking up bricks, applying mortar, and placing them in their designated location. A human handles the more nuanced activities, like setting up the worksite, laying bricks in tricky areas, such as corners, and handling aesthetic details, like cleaning up excess mortar.

But the kicker is the efficiency:

a human mason can lay about 300 to 500 bricks a day, while SAM can lay about 800 to 1,200 bricks a day. One human plus one SAM equals the productivity of having four or more masons on the job.

There’s quite a bit of cool technology at:

The robot is able to do all of this using a set of algorithms, a handful of sensors that measure incline angles, velocity, and orientation, and a laser. The laser is rigged up between two poles at the extreme left and right sides of the robot’s work space, and moves up and down the wall as work progresses to act as an anchor point for the robot.

Even though the price is high at ~$500,000, large commercial jobs will be the perfect place for this machine.

The revolution is coming slowly, but it is coming, with the best summary of the future being, by Marc Andreessen:

The spread of computers and the Internet will put jobs in two categories. People who tell computers what to do, and people who are told by computers what to do.

(h/t: O’Reilly Radar)

Posted in Uncategorized | Leave a comment