Data Wrangling with csvkit and SQLite

As mentioned earlier, csvkit is a very convenient tool for handling coma separated text files, especially when they are too large to be processed with conventional spread sheet applications like Excel or Libre Office Calc. The limits of Office programs can rather easy be reached, especially when dealing with scientific data. Open Office Calc supports the following limits.

  • maximum number of rows: 1,048,576
  • maximum number of columns: 1,024
  • maximum number of sheets: 256

Excel offers  also 1 048 576 rows but provides 16,384 columns. SQLite in contrast allows by default 2000 columns and provides if really needed up to 32767 columns if complied with a specific setting. In terms of row storage, SQLite provides a theoretical maximum number of 264 (18446744073709551616) rows. This limit is unreachable since the maximum database size of 140 terabytes will be reached first.

The limits we discussed will not be hit the our example of air traffic data, which we obtain from ourairports.com. You can download the sample file with currently 47409 airports described in the CSV format from the linked web page.

This little command provides us with the statistics of the columns in the file. We see that the file we provided offers 18 columns and we also can immediately see the column types, if there are null values and what the 5 most frequent values are. If we are interested in a list of columns only, we can print them with the following command.

We can also use the csvcut command for – you expect it already – cutting specific columns from the CSV file, in order to reduce the size of the file and only retrieve the columns that we are interested in. Image you would like to create a list of all airports per region. Simply cut the columns you need and redirect the output into a new file. The tool csvlook provides us with a MySQL-style preview of the data.

We could then sort the list of airports alphabetically in reverse and write the new list into a new file. We specify the name of the column we want to sort the file and We measure the execution time needed by prepending the command time.

A nice feature of csvkit is its option to query CSV files with SQL. You can formulate SELECT queries and it even supports joins and other tricks. Thus you can achieve the same result with just one SQL query.

Obviously, this is not the fastest possibility and may not be suitable for larger data sets. But csvkit offers more: You can create SQL tables automatically by letting csvkit browse through your CSV files. It will try to guess the column type, the appropriate field length and even constraints.

The second command in the listing above simply stores the table in a separate file.  We can import this CREATE TABLE statement by reading the file in SQLite. Change to the folder where you downloaded SQLite3 and create a new database called AirportDB. The following listing contains SQL style comments (starting with –) in order to improve readability.

You can also achieve the same results directly from Bash, simply by piping the SQL file to the database.

We created a complex SQL table by automatically parsing CSV files. This gives a lot of opportunities, also for Excel spreadsheets and other data available in CSV. The great thing about csvkit is that it supports a large variety of database dialects. You can use the same command by adapting the -i parameter for the following database systems:

  • access
  • sybase
  • sqlite
  • informix
  • firebird
  • mysql
  • oracle
  • maxdb
  • postgresql
  • mssql

All major systems are supported, which is a great benefit. Now that we have the schema ready, we need to import the data into the SQLite database.  We can use the SQLite client to import the CSV file into the database, but suddenly we run into a problem! The 12th column contains boolean values, as correctly identified by the csvkit tool. When we inspect the file again with csvlook, we can see that the column contains ‘yes’ and ‘no’ values. Unfortunately SQLite does not understand this particular notion of boolean values, but rather expects 0 for false and 1 for true, as described in the data types documentation.We have two options: We could replace the values of yes and no by their corresponding integer, for instance with awk:

Or, much more comfortably, we could again use csvkit, which can help us out and replaces the values automatically. The following command imports the data into our database. As we already created the table in advance, we can skip the process with the appropriate flag.

This takes a little while, but after a few seconds, we have the data ready. We can then open the database and query our Airport data set.

You can now use the data in an advanced way and also may utilise advanced database features such as indices in order to speed up the data processing. If we compare again the execution of the same query on the CSV file and within SQLite, the advantage becomes much more obvious if we omit command line output, for instance by querying the COUNT of the airport names.

Continue reading


Create an ER Diagram of an Existing SQLite Database (or many other RDBMS)

Visualisation helps solving problems and is therefore an important tool database design. Many database providers have their product specific tools for re-engineering existing schemata, but self-contained, serverless, embedded relational database management systems (RDBMS) such as SQLite often come without much tooling support. The extremely small footprint of SQLite provides a very powerful tool for implementing database driven applications without the hassle of database administration, user privilege management and other demanding tasks that come with more complex systems. There does not exist a workbench-like tool for SQLite, but we can use the open source SchemaCrawler for analysing database schemata and table relationships. The tool provides a plethora of commands and options, in this post we will only cover the diagramming part, which allows creating ER diagrams of the table.

After downloading and extracting the tool to your local drive, you can find a lot of examples included. The tool can handle SQLite, Oracle,  MS SQL Server, IBM DB2, MySQL, MariaDB, Postgres and Sybase database servers and is therefore very versatile. You will need Java 8 in order to run it. Have a look at the script below, which creates a PNG image of the database schema of the Chinook test database.

The SchemaCrawlerPATH variable contains the path to the directory where we unzipped the SchemaCrawler files to. This is needed in order to load all the required libraries into the classpath below. We then specify the SQLite database file, define the RDBMS and provide an output path where we store the image. Additionally, we provide an empty user name and password combination. SQLite does not provide user authentication, thus those two parameters need to be empty, SchemaCrawler simply ignores them. Then we can execute the command and the tool generates the PNG of the ER diagram for us.

ER-DiagramYou can find a lot of examples also online, which gives you an overview of the features of this tool. One of the main purposes of SchemaCrawler is to generate diffable text outputs of the database schemata. In combination with a source code version management tool such as Git or Subversion, you can create clean and usable reports of your databases and keep track of the changes there. You can retrieve an overview of the options  with the following command.

 

You can also HTML reports with the following command:

 

HTML ReportOther available output formats are plain text, CSV or JSON.

Continue reading


Importing your Repositories from Github to your Gitlab Instance

Github is a great service and offers a free micro account for students, which includes five private repositories. Still this might not be enough private repositories for some of us, especially if you start a lot of side projects (which are hardly finished :-)) and want to keep your code and data organised, versioned and collaborative but non-public (because it is not finished yet :-)). For this and other reasons I am also running a private Gitlab instance, which basically offers the same features as Github and can be hosted on your private server.

I still use Github for some projects, which means that I continuously run out of private repositories there as only five are included in my plan. Now Gitlab (the private instance) offers a great feature, which allows one to hook the Github account into the Gitlab instance and import code hosted in private repositories at Github into your Gitlab instance, where no limits for private repositories apply. Obviously the repositories could also be transferred manually, but the import feature via the Web interface is very convenient.

Installing the oAuth feature for Github and other services is very simple and described here for Gihub and there in general. Read both documentations carefully, as using the wrong settings can lead to a serious security issues. After following the instructions carefully, the warning that oAuth needs to be configured kept popping up in the Gitlab dashboard:

To enable importing projects from GitHub, as administrator you need to configure OAuth integration.

The final hint was given in this SO post here. The official documentation failed to mention that the changes in the configuration file must be updated by reconfiguring Gitlab. The following list of steps should lead to success:

  1. Login into Github and open your account settings
  2. Select Applications on the left and chose the Developer Applications tab
  3. Register a new application by providing the name (e.g. Gitlab Access), a homepage (can be anything), a description (optional) and the callback URL (e.g. https://gitlab.example.org).
  4. Add the initial OmniAuth configuration as described in the official documentation.
  5. Provide the settings for Github.
  6. Save the file and execute the following two commands.

The first time you select “Import project from Github”, you will be redirected to the Github authentication page, where you need to grant permission for the Gitlab user to retrieve the repositories. After you imported the private repositories into Gitlab, you can delete them from Github and reclaim some free space.

 

Continue reading