Switching Kernels: Using Python 2.7 and Python 3.5 in Jupyter Notebooks

Jupyter Notebooks are a great way for working with Python interactively. The integration of Python code into documents is very useful for reports or for writing executable documentation of algorithms and functions. The text can be structured and exported in various formats. With the ever increasing popularity of Python based on the data science hype, more and more libraries are available. Although Python3 is considered to be the future of Python, consensus on the question Python 2.7 vs Python 3.5 is not yet reached. There are quite a few differences and Python 3 is not backwards compatible and therefore the code cannot be executed with both versions without modification. When you install Jupyter Notebooks via Anaconda, Python3 is recommended but Python 2.7 packages also exist.

As there is a large number of libraries, which have not yet been ported to Python 3, it can be useful to switch between the language version within a Jupyter Notebook. The following example assumes that you have both Python versions already installed.

Installing a new Kernel

In Jupyter Notebooks, the kernel is responsible for executing Python code. When you install the Anaconda System for Python3, this version also becomes the default for the notebooks. In order to enable Python 2.7 in your notebooks, you need to install a new kernel like this:

Restart Jupyter to activate the new Python 2.7 kernel.

Switching Kernels

After restarting Jupyter, you can select the kernel and thereby which version to run the code easily from the menu:

 

 

Continue reading


A MySQL 5.7 Cluster Based on Ubuntu 16.04 LTS – Part 2

In a recent article, I described how to setup a basic MySQL Cluster with two data nodes and a combined SQL and management node. In this article, I am going to highlight a hew more things and we are going to adapt the cluster a little bit.

Using Hostnames

For making our lives easier, we can use hostnames which are easier to remember than IP addresses. Hostnames can be specified for each VM in the file /etc/hosts . For each request to the hostname, the operating system will lookup the corresponding IP address. We need to change this file on all three nodes to the following example:

Now we can lookup IP addresses by the associated hostname.

Adapting Node1

The management and the SQL node are both running in the same VM. Now that we introduced the hostnames for all machines, we need to reconfigure the cluster manager and the MySQL Server. The cluster configuration is stored in the file /opt/mysql/server-5.7/mysql-cluster/config.ini . Replace all the IP addresses with the corresponding host name.

For reconfiguring the SQL node, we need to adapt the file /etc/my.cnf . We need to replace all IP addresses in this file too.

If done, stop the cluster and the SQL node like this:

Adapt the Data Nodes

Replacing the IP adresses with hostnames is pretty straightforward. Change the IP address on each node with the host name in the file /etc/my.cnf :

Startup the Cluster

Now that we have exchanged all IP addresses for the hostname, we can restart the cluster as follows:

Now the cluster should be up again. Verify it like this:

Encrypt Data in Transit

For improving the security of the cluster, we can encrypt the traffic which is exchanged between the MySQL node and the MySQL client. Please note that this encryption method is not specific for the cluster, but rather encrypts the traffic between the MySQL client and the MySQL server. You can use this method also for standard MySQL server setups. It does not cover encryption between the data nodes. By design, data nodes are intended to be run in a private network, not via the open Interwebs.

Create a CA Authority Key and Certificate

To do so, we work on node 1. In the first step we create a CA authority and a CA certificate.

Create a Server Certificate

In the second step, we create a server certificate and sign it with the CA from the previous step.

The server key is ready. We now need to create a client certificate.

Create a Client Certificate

In a rather similar fashion, we create a client certificate.

Now we have created the keys and can configure MySQL for using SSL encryption.

Configure MySQL to use Encryption

Add the following configuration parameters to the /etc/my.cnf file in order to define server and client certificates and keys.

Restart the MySQL server and test the SSL encryption. You can immediately see that the client uses SSL, although it does not really make a lot of sense for the local user.

Let’s create a new test user, allow remote access from 192.168.0.12 and enforce SSL connections:

Switch to the console of the machine 192.168.0.12 and try to connect. You can then display some variables about the SSL configuration.

To verify that you are actually using SSL encryption, the following command helps, it gives SSL: Cipher in use is DHE-RSA-AES256-SHA :

What about the Data Nodes?

So far we encrypted only the traffic between the MySQL client and the MySQL server. This configuration does not differ from a single server or replication setup and does not include the traffic between the data nodes, which is not encrypted at all. MySQL cluster has been designed to be run in a controlled environment within high speed network locally. The MySQL Cluster FAQ states:

It is very unlikely that a cluster would perform reliably under such conditions, as NDB Cluster was designed and implemented with the assumption that it would be run under conditions guaranteeing dedicated high-speed connectivity such as that found in a LAN setting using 100 Mbps or gigabit Ethernet—preferably the latter. We neither test nor warrant its performance using anything slower than this.

Also, it is extremely important to keep in mind that communications between the nodes in an NDB Cluster are not secure; they are neither encrypted nor safeguarded by any other protective mechanism. The most secure configuration for a cluster is in a private network behind a firewall, with no direct access to any Cluster data or management nodes from outside.

 

 

 

Continue reading


A MySQL 5.7 Cluster Based on Ubuntu 16.04 LTS – Part 1

A Cluster Scenario

In this example we create the smallest possible MySQL cluster based on four nodes running on three machines. Node 1 will run the cluster management software, Node 2 and Node 3 will serve as dats nodes and Node 4 is the MySQSL API, which runs on the same VM on Node 1.

What is a Cluster and When to Use it?

When applications grow, there are several possibilities to improve the performance of the database layer. The two technologies used in most scenarios are replication and clustering.

Replication is used for moving the data from a master database to one or more slave databases. Per default, this data exchange is asynchronously. Semi-synchronous data replication can be implemented, but if you need synchronous data replication on several machines, clusters are the way to go.

A MySQL Cluster implements a shared nothing architecture and supports synchronous replication. You can very easily add and remove nodes to your cluster and scale out horizontally. This is especially true if you need scaling the database writes. Clusers use a special storage engine called NDB. This technology offers high-availability and high-redundancy. The documentation for a MySQL 5.7 cluster can be found here.

Preconditions

In this example, we will use VirtualBox as virtualization platform and run three nodes as the smallest possible cluster. In the first step, we create a template virtual machine, which comes with the basic packages, passwordless keybased authentication and a manually configured IP address. We will clone this machine and adapt it to our needs. Parts of this tutorial are based on the digital ocean howto.

The Template

We create the first VM based on Ubuntu 16.05 LTS and install the basics. MySQL requires the libaio1 package, which we will install in the template like this:

Then it is time to download the MySQL Cluster software, which can be obtained for free from the MySQL developer page. This are the two steps necessary:

Assign a static IP address in the file /etc/network/interfaces .

Make sure to install the guest additions, add a MySQL user, copy the MySQL startup script etc. Now the template machine is ready. Shut it down and clone it two times. Make sure to assign a new MAC address and then assign a different, static IP address.

Setting Up the Machines

In the following steps, we are going to setup the virtul machines. We run three machines which provide four services.

Install the Manager on Node 1

The cluster will be controlled via Node 1. First, create a data directory, where we will store the configuration files.

Then create the configuration file. This file defines the cluster and will be read during the startup of the cluster. Each node needs to have a unique id. Usualy, the ID starts with 0, but in this example we  defined the node IDs starting from 1 in order to align it with the IP addresses.

Note that if you make changes to the config.ini files, they might not be loaded directly, but are read from a cache file. So if you change for instance the data directory, make sure to delete all cached files, e.g. ndb_1_config.bin.1*.

Install the MySQL Server on Node1

The installation of the MySQL service is pretty straight forward and follows a similar pattern as a regular MySQL setup. You just need to ensure that you use the Cluster Version which we already installed in the template machine and create symlinks accordingly. The configuration of the MySQL Server is done in /etc/my.cnf , as usual. In the following you can see an example configuration.

The most important bits are at the bottom of the file. The command ndbcluster makes the NDB storage engine available and the section [mysql_cluster] points to the cluster management node. In our case, the MySQL server node and the cluster management node are on the same machine. In larger settings it can be benefitial to have a dedicated VM for each task.

Make sure to set the permissions for the data directory correctly and to place the startup script in /etc/init.d/mysql . Make sure to start the MySQL server on Node 1 and check the error.log file for any strange messages.

Setting up the Data Nodes

In the next step, we need to configure the data nodes. Launch Node 2 and login.

The basic configuration is very simple. Just edit the /etc/my.cnf file and provide the address of the management node.

Then we can start the data node with the following command:

You should see something like this:

If something goes wrong, check the error log file called ndb_X_out.log , where  X is the node id. So in this example the actual file name is ndb_2_out.log . Proceed the same way with the second data node called Node 3.

Starting the Cluster

Now comes the great moment. We start the cluster. Please note that you cannot start th cluster before you have configured at least two data nodes and a MySQL API node. All components need to be running, otherwise there will be error messages.

If there are no errors, we can login into the management console for the cluster with the following command.

You can then get an overview of your cluster with the show  command. In the ideal case, the output reads as follows:

Connecting to MySQL

Now that our cluster is up and running, we can interact with MySQL, by connecting to the MySQL server running on Node 1. In order to distribute the data automatically on the nodes, we need to utilise the NDB Engine when we create our tables. This is a distributed version of the InnoDB engine and comes with different features. A comparison is given here. Connect to the MySQL instance on Node 1 as you always would.

Then we can create a test table to see of the system works properly. In contrast to replication, the nodes do not store the whole data each, but rather gets the data distributed over at least two nodes. This is also the reason why we need to have at least two nodes.

Sakila in the Cluster

We can use the famous Sakila database that is available from the MySQL web site to test our Cluster briefly. The following series of commands retrieves the database and imports it into our cluster.

Now the data resides in our cluster and is distributed via our two nodes. We can run the following sample query and the system transparently gets the data from the two nodes.

Caveats

If you try to start the cluster before all components are up and ready, finding the error can be a bit tricky. In the following example, we did not start the data nodes. The MySQL API node is running, but it does not show up. 

Check the error log of the management cluster in that case. As long as the cluster does not consist of at least two data nodes and a MySQL API node, it will not start.

Continue reading


Parsing SQL Statements

JDBC and the Limits of ResultSet Metadata

For my work in the area of data citation, I need to analyse queries, which are used for creating subsets. I am particularly interested in query parameters, sortings and filters. One of the most commonly used query languages is SQL, which is used by many relational database management systems such as MySQL. In some cases, the interaction with databases is abstract, meaning that there is hardly any SQL statements executed directly. The SQL statements are rather built on the fly by object relational mappers such as Hibernate. Other scenarios use SQL statements as String and also prepared statements, which are executed via JDBC. However,  analysing SQL statements is tricky as the language is very flexible.

In order to understand what columns have been selected, it is sufficient to utilise the ResultSet Metadata and retrieve the column names from there. In my case I need to extract this imformation from the query in advance and potentially enforce a specific sorting by adding columns to the ORDER BY clause. In this scenario, I need to parse the SQL statement and retrieve this information from the statement itself. Probably the best way to do this would be to implement a parser for the SQL dialect with ANTLR (ANother Tool for Language Recognition). But this is quite a challenge, so I decided to take a shortcut: FoundationDB.

The FoundationDB Parser

FoundationDB was a NoSQL database which provided several layers for supporting different paradigms at once. I am using past tense here, because the project got acquired by Apple in 2015 and since then does pursue the open source project any more. However, the Maven libraries for the software are still available at Maven Central. FoundationDB uses its own SQL parser, which understands standard SQL queries. These queries can be interpreted as a tree and the parser library allows traversing SQL statements and analyse the nodes. We can use this tree to parse and interpret SQL statements and extract additional information.

The Foundations of FoundationDB

The FoundationDB parser can be included into your own project with the following Maven dependency:

The usage of the parser is straight forward. We use the following example SQL statement as input:

The following function calls the parser and prints the tree of the statement.

The resulting tree is listed below. The statement has also been normalized, which ensures a stable sequence of the parameters.

This tree offers a lot of information, which can be used programmatically as well. In the top of the output, we can see that the statement was a SELECT statement and that it was not DISTINCT. Then follows the ResultSet, which contains a list of the three ResultColumns, which have been specified in the SELECT clause. We can see the column names and the table names from which they are drawn. The next block provides the referenced tables (the FROM list) and their alias names. The WHERE – block contains the operands which have been used for filtering and last but not least, there is the list of ORDER BY clauses and their sorting directions.

The Visitor

In order to access the information shown above programmatically, we need to access the content of the node one by one. This can be achieved with the visitor pattern, which traverses all the nodes of the tree. The following listing shows how the visitor pattern can be used for accessing the list of columns from the SELECT clause.

This code example, we define a visitor which traverses all the ResultColumn nodes. Every time the current node is an instance of ResultColumn, we add this node to our list of columns. The nodes are only visited, if they are children of a SELECT statement. This is our entry point into the tree. We leave the tree when we reach the FROM list. We then apply the visitor to the statement, which initiates the traversal. As a result, we receive a list of columns which have been used for the result set.

In order to get the list of ORDER BY columns, we can utilise a similar approach. The following functions gives an example:

This time, we retrieve the list of ORDER BY columns directly from the CurserNode. Similar principles can be used for manipulating SQL statements and apply a different sorting for instance.

 

 

Continue reading


Use Fontawesome with pdfLatex

Using Fontawesome in Documents

Fontawesome offers more than 600 scalable vector icons which can be used via CSS. The same icon kit is also available for being used in Latex. Since recently, the package was only available for Xelatex and LuaTex, but it is now also available for pdfLatex. The installation is a bit tricky, which I would like to share with you how I managed to install the font for Latex.

Get Hold of the Package

You can download the package from CTAN and store it locally on your drive. Alternatively you can also use wget:

Obey the Structure!

The TexLive system utilises a special directory structure for finding the packages and their content. The package we obtained from CTAN does currently not comply with the given structure, which is why the installation procedure differs a little bit from the norm. You should know that TexLive utilises several repositories for storing packages. A global one in the directory  /usr/local/share/texmf  and a local one in the home directory, usually ~/texmf . You can print the directory that TexLive is using withthe following command:

We will install fontawesome in the user’s directory as this does not require root privileges. Upon googling for a little while, I found this helpful post, which describes how to copy the needed tiles to their  appropriate location. First, we create the directory structure in the local TexLive repository of the user.

In the second step, unzip the package file and copy the content into the directory structure:

Now everything is in the right place. Following the blog post I linked above unfortunately leads to an error, that the map file cannot be found. What was missing in the original post is that you need to provide the name of the map file, such that the update tool can actually find it. Reading the linked documentation actually helped and I found the trick. We need to create a directory called web2c in the ~/texmf directory and create a file called updmap.cfg. This file needs to point to the map file of Fontawesome and we are all set:

Install the new Font

Then we can proceed as described and update the package manager of TexLive and install the font like this:

Try it!

This minimal example shows how to test and use the Fontawesome package.

The final result looks like this. Isn’t it beautiful?

Fontawesome Example

Continue reading