Programming Nodes in KNIME with Scala, Part 1

This post will help get you started in programming nodes in KNIME using the Scala programming language. If you are unfamiliar with Scala, you might want to change that: From Wikipedia

Scala has many features of functional programming languages like Scheme, Standard ML and Haskell, including currying, type inference, immutability, lazy evaluation, and pattern matching. It also has an advanced type system supporting algebraic data types, covariance and contravariance, higher-order types (but not higher-rank types), and anonymous types. Other features of Scala not present in Java include operator overloading, optional parameters, named parameters, raw strings, and no checked exceptions.

Scala 2.11 also has direct integration with the Akka library, which makes programming parallel and distributed applications really easy, espcially on a Spark installation.

KNIME is programmed in Java, which makes interworking with Scala very easy since they are both languages native to the JVM.

If you are already familiar with programming nodes with the KNIME SDK, then skip ahead.

This isn’t intended to be a primer in programming in Java or for programming Nodes for KNIME in general, but this guide will start by assuming nothing in the development environment to make sure this guide is reproducible. This guide was written using the KNIME SDK v3.1.1.

Download and install the KNIME SDK.

Over at the KNIME download page, either sign up for the low-traffic newsletter and announcements list, or just jump straight to the download section by clicking tab number 2. At the bottom are the downloads for the SDK which includes the Eclipse development environment prepackaged to start programming nodes and/or plugins for Eclipse and KNIME. Choose the correct one for your OS. Linux, MacOS, and Windows are supported.

knime-sdk-download

Unzip and install, depending on the OS, the package and begin executing Eclipse. Depending on when you run this, one of the first things to do is to execute Help->Check for Updates… for any latest bug fixes. This step updated Eclipse to v4.5.2

You’ll need to set up the Debug and Run configurations by clicking arrow next to the Debug button and double-clicking “Eclipse application” to create a new configuration call, by default, and aptly enough “New_configuration.” In the “Main” tab select the “product” to run in the “Program to Run” group from the dropdown menu, and select “org.knime.product.KNIME_PRODUCT.”

 

Under the “Arguments” tab, vm-arguments

copy these values into the VM arguments edit space:

-Xms40m -Xmx2048m -ea -Dosgi.classloader.lock=classname 
-XX:+UnlockDiagnosticVMOptions 
-XX:+UnsyncloadClass 
-Dknime.enable.fastload=true

Press “Debug” and KNIME should start from Eclipse in Debug mode. Congrats! You are now a KNIME programmer!

Another good tips for programming/using KNIME: Change the settings for Windows->Preferences->General->Show Heap Status to show the heap status.

show-heap-status

Close KNIME and return to the Eclipse IDE.

Install the Scala IDE into Eclipse

The Eclipse Scala IDE plug-in is installed by adding a new Update Site to Eclipse at the Help->Install New Software… dialog.

install-new-software

Add a new update site by pressing “Add…” and adding the site: http://download.scala-ide.org/sdk/lithium/e44/scala211/stable/site. Select the checkbox for “Scala IDE for Eclipse” since that’s the only one that’s necessary. Press “Next >”, “Next >”, accept the License agreement and “Finish.”

The software will install and require a restart from Eclipse. You may need to adjust the heap size used by Eclipse in the eclipse.ini file. (This is not the same as those set for the KNIME executable in the VM settings above.) Adjust the heap to be 1G or larger. (I have mine set to 2G.) The Scala IDE is now integrated into your Eclipse IDE and you can devlope regular Scala programs as well as Java programs, and most interesting of all, node plug-ins (or just simply “Nodes”) for KNIME.

Create a new node using the Node Wizard.

The finished project is available over at Github. You can download it there, or got through the steps here. The New Node Wizard is located under File->New->Other->KNIME->Create a new KNIME Node-Extension.

new-node-wizard

The dialog will require a few fields. I’ve shown an example for the Project MyFirstScalaKnimeNode with the Node Name MyFirstScalaKnimeNode in the package de.uni_konstanz.knime.scala.myfirst. In this case, select the Node type to be a Manipulator. Right now, the only thing the Node type affects is the color of the node.node-settings.

Press “Finish” and a new project with the name MyFirstScalaKnimeNode will be created with the package de.uni_konstanz.knime.scala.myfirst with a few files in a src folder. The two most important files we’re concerned with are the ...NodeModel.java and the ...NodeDialog.java source files.  The NodeModel files is where the actual execution and processing of data takes place in a KNIME node, and the NodeDialog file handles the Configuration dialog for the node.

About the Scala Node

The KNIME Scala node is just going to do a very simple task. It will read a value in from an Integer column on the input, and then output a factorial of that number using the BigInteger data structure. We can’t directly output BigIntegers into a column type in KNIME, but we can output their String representation. I may gloss over some of the imlementation details, so you might want to download the project from my GitHub site, and read along with it there.

Set up the NodeDialog to select a column of Integers

In the NodeModel, the CFGKEY_COUNT, DEFAULT_COUNT constants and the m_count member variable can be deleted. They’re to be replaced with a constant String CFGKEY_FACTOR = "Factor", a member variable m_factorColumn, and another constant PORT_IN_DATA = 0

The CFGKEY_FACTOR is the event ID used by the dialog window system to identify a change in value for a field in the dialog component. m_factorColumn is the variable that will hold the name of the column with the Integers to be factored. PORT_IN_DATA is just a constant to use to represent which port of the node will contain the model. This is a bit overkill of a one-input/one-output node, but good programming habits are (hopefully) hard to break.

We’ll insert a helper static method to initialize m_factorColumn with a static method so we only have to type it once.


public static SettingsModelString createSettingsModelFactorColumn() {		
	return new SettingsModelString(CFGKEY_FACTOR, "");
}

Set m_factorColumn to this:

private SettingsModelString m_factorColumn = createSettingsModelFactorColumn(); 

Each of the saveSettingsTo(), loadValidatedSettingsFrom(), and validateSettings() methods needs to be changed to call the appropriate method for m_factorColumn.

We want to change the dialog to select a column name (a String) of a column of type Integer. Replace the addDialogComponent statement with the following statement:


addDialogComponent(new DialogComponentColumnNameSelection(
MyFirstScalaKnimeNodeNodeModel.createSettingsModelFactorColumn(),
"Factor Column:", MyFirstScalaKnimeNodeNodeModel.PORT_IN_DATA,
IntValue.class));

This is done by using an instance of the DialogComponentColumnNameSelection. It uses the same static method to create the SettingsModel, labels the field “Factor Column:” and selects the input port of the node (PORT_IN_DATA or 0) which should have this column type. The cool part is the last field. This is the value that is searched for in the incoming DataTableSpecification, and will allow only selecting columns of that type. If there’s only one column of that type, it’s automatically selected.

Set up the NodeModel.

The magic in the NodeModel happens in the execute() method. This is where the data table is passed to the method to be processed by the node. The sample code from the Node Wizard creates a node with three output columns that take a count value from the Configure dialog, output that number of rows, and do different things with the counter, like output the values as a string, multiply it by 0.5, or just simply output it. This example is going to show a slightly more advanced construct for KNIME nodes using the ColumnRearranger class, which will just append a new column of type StringCell on a row-by-row basis.

Note: Using the ColumnRearranger is really only useful for setting up code that can process one row at a time, or can ouput some information one row at a time without having to look at a sequence of rows (like for a moving average or similar.)

Replace the contents of your execute() with this:


ColumnRearranger c = new ColumnRearranger(inData[PORT_IN_DATA].getDataTableSpec());
		MyFirstScalaNodeCellFactory f = new MyFirstScalaNodeCellFactory(inData[PORT_IN_DATA].getDataTableSpec().findColumnIndex(m_factorColumn.getStringValue()), exec);

c.append(f);

final BufferedDataTable returnVal = exec.createColumnRearrangeTable(inData[PORT_IN_DATA], c, exec);

return new BufferedDataTable[] { returnVal };

Add any imports as necessary.

What’s happening here is is not entirely obvious. a new object of type ColumnRearranger is created with the input DataTableSpec as input. A factory (that we’ll create later) is instantiated with some values that will be needed later. The column (cells) created by this factory are appended to those rearranged (or not) by the ColumnRearranger factory with the append() method. The real work is done in the CellFactory.

Add this private class to the end of the NodeModel.


private class MyFirstScalaNodeCellFactory implements CellFactory, AppendedCellFactory {

	private String m_factorCol;
	private DataTableSpec m_tableSpec;
	private ExecutionContext m_exec;

	public MyFirstScalaNodeCellFactory(DataTableSpec dataTableSpec, String factorColumn, ExecutionContext exec) {
		m_tableSpec = dataTableSpec;
		m_factorCol = factorColumn;
		m_exec = exec;
	}
	@Override
	public DataCell[] getAppendedCell(DataRow row) {			
		return getCells(row);
	}

	@Override
	public DataCell[] getCells(DataRow row) {			
		return new DataCell[]{};
	}

	@Override
	public DataColumnSpec[] getColumnSpecs() {
			
		return new DataColumnSpec[]{new DataColumnSpecCreator("Factorial", StringCell.TYPE)
				.createSpec()} ;
	}

	@Override
	public void setProgress(int curRowNr, int rowCount, RowKey lastKey, ExecutionMonitor exec) {
		m_exec.setProgress( (double)curRowNr/rowCount);			
	}
		
}

Let’s have a closer look at what’s happening in the class. The CellFactory and AppendedCellFactory interfaces have to be implemented to append cell(s). The constructor initializes the private class’s field variables to the passed in values. The only value we need is the column index to get the input integer value and the ExecutionContext to update KNIME about the current progress. getAppendedCell() simply calls getCells(), because we’re appending cells and the method getCells() just came along with the CellFactory interface.

I’ve purposely left getCells() blank here.

Program the processing part in Scala

Now’s the time to start programmig in Scala. The first thing we need to do is convert the Java project to a Scala project. The main difference here is that Eclipse will use the Scala compiler to compile both the Scala and Java code. This is done by changing the “nature” of the project to a Scala nature with right-click on the Project->Configure->Add Scala Nature.

add-scala-nature

You may get an “Updated Required Bundles” error. You can ignore this. You may also get the following error.

org.apache.commons.io_2.4.0.jar of MyFirstScalaKnimeNode build path is cross-compiled with an incompatible version of Scala (2.4.0). In case this report is mistaken, this check can be disabled in the compiler preference page.

This is mistaken in this case, or can at least be ignored for now, so the check needs to be disabled in the compiler preference page.

Under Project->Properties->Scala Compiler, check “Use Project Settings” and then under the Buid Manager tab, deselect “withVersionClasspathValidator”.

check-scala-compile

Click “Apply” and “Okay” and the error has been eliminated.

You may get this error or errors:

SBT builder crashed while compiling. The error message is ‘class “org.knime.core.node.InvalidSettingsException”‘s signer information does not match signer information of other classes in the same package’. Check Error Log for details.

Error in Scala compiler: class “org.knime.core.node.InvalidSettingsException”‘s signer information does not match signer information of other classes in the same package

This requires setting the Scala Compiler target to 1.8 to match the already compiled Java code.

Now we can finally write some Scala code!

In the de.uni_konstanz.knime.scala.myfirst package, create a new Scala class called BigIntFactorial.

create-scala-class

The new Scala IDE new class wizard is a bit different from that of the Java IDE, so just type the name at the end of the prefilled-in package name and click “Finish.”

We’ll just use a simple product to calculate the factorial. We could have done this with recursion, but with larger integers, this could cause us to run out of stack space.

Add this method to the Scala class:


def calcFactorial(f : Int) : BigInt = {
    var fac : BigInt = 1
    var i : Int = 1
    for(i 

This method is exactly what you think it is: it takes an integer and returns a value of BigInt, (Scala’s version of BigInteger, which actually has a BigInteger as its basis) which has just been multiplied by a counter up to f times. No error checking is performed for values less than 0.

Back in the NodeModel, we’ll change the method getCells() in the private CellFactory class to the following:


public DataCell[] getCells(DataRow row) {
	BigIntFactorial bif = new BigIntFactorial();
	int f = ((IntCell)row.getCell(m_factorColumnIdx)).getIntValue();
			
	BigInt i = bif.calcFactorial(f);
			
	return new DataCell[] {new StringCell(i.toString())};
}

We instantiate a Scala(!!) object in the method just like we would a Java object and call the method just like it were a Java method. No translation necessary. We can even use the BigInt, a Scala class, as the return type in Java like we would Big Integer. Seamless!

From the KNIME API, we get the value from the passed-in DataRow by using the getCell() method, casting it to an IntCell and calling the getIntValue() method.

Test the Code

Time to get this thing running! Click the debug button. If you get the following error message:

JDT Weaving is currently disabled. The Scala IDE needs JDT Weaving to be active,
or it will not work correctly.

Activate JDT Weaving and Restart Eclipse? (Highly Recommended)

[OK] [Cancel]

click “Cancel” as this won’t affect our test.

Create a simple workflow with the Table Creator node and your Scala node, put some simple values in it and test your code.

After that works, try it with some large number: 1000 is good. See the output really test the BigInt and String structures.

big-factorial

In Part 2, we’ll do some things that will make integrating Scala worth the effort.

Advertisements
This entry was posted in KNIME, Scala, Tutorial and tagged , , . Bookmark the permalink.

2 Responses to Programming Nodes in KNIME with Scala, Part 1

  1. Jason says:

    Hi, i’m using knime and think this post is great, very simple but useful.
    When will be available the second part?
    Thanks

    • Oliver says:

      Thanks! I wanted to keep it simple and thorough, because I know that when I read the Innertubes for tips, it’s a PITA when the secret step gets left out. The response so far to this post was a bit underwhelming, but I’ll be very happy to get back to the Part 2 sometime soon.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s