Data Wrangling with csvkit and SQLite

As mentioned earlier, csvkit is a very convenient tool for handling coma separated text files, especially when they are too large to be processed with conventional spread sheet applications like Excel or Libre Office Calc. The limits of Office programs can rather easy be reached, especially when dealing with scientific data. Open Office Calc supports the following limits.

  • maximum number of rows: 1,048,576
  • maximum number of columns: 1,024
  • maximum number of sheets: 256

Excel offers  also 1 048 576 rows but provides 16,384 columns. SQLite in contrast allows by default 2000 columns and provides if really needed up to 32767 columns if complied with a specific setting. In terms of row storage, SQLite provides a theoretical maximum number of 264 (18446744073709551616) rows. This limit is unreachable since the maximum database size of 140 terabytes will be reached first.

The limits we discussed will not be hit the our example of air traffic data, which we obtain from ourairports.com. You can download the sample file with currently 47409 airports described in the CSV format from the linked web page.

$: csvstat airports.csv 
  1. id
	<type 'int'>
	Nulls: False
	Min: 2
	Max: 316827
	Sum: 2112054844
	Mean: 44549.6602755
	Median: 23847
	Standard Deviation: 77259.1794792
	Unique values: 47409
  2. ident
	<type 'unicode'>
	Nulls: False
	Unique values: 47409
	Max length: 7
  3. type
	<type 'unicode'>
	Nulls: False
	Unique values: 7
	5 most frequent values:
		small_airport:	30635
		heliport:	9098
		medium_airport:	4536
		closed:	1623
		seaplane_base:	927
	Max length: 14
......

This little command provides us with the statistics of the columns in the file. We see that the file we provided offers 18 columns and we also can immediately see the column types, if there are null values and what the 5 most frequent values are. If we are interested in a list of columns only, we can print them with the following command.

$: csvcut -n  airports.csv 
  1: id
  2: ident
  3: type
  4: name
  5: latitude_deg
  6: longitude_deg
  7: elevation_ft
  8: continent
  9: iso_country
 10: iso_region
 11: municipality
 12: scheduled_service
 13: gps_code
 14: iata_code
 15: local_code
 16: home_link
 17: wikipedia_link
 18: keywords

We can also use the csvcut command for – you expect it already – cutting specific columns from the CSV file, in order to reduce the size of the file and only retrieve the columns that we are interested in. Image you would like to create a list of all airports per region. Simply cut the columns you need and redirect the output into a new file. The tool csvlook provides us with a MySQL-style preview of the data.

$: csvcut --columns=name,iso_country,iso_region  airports.csv > airports_country_region.csv
$: csvlook airports_country_region.csv | head -n 15
|--------------------------------------------------------------------------------+-------------+-------------|
|  name                                                                          | iso_country | iso_region  |
|--------------------------------------------------------------------------------+-------------+-------------|
|  Total Rf Heliport                                                             | US          | US-PA       |
|  Lowell Field                                                                  | US          | US-AK       |
|  Epps Airpark                                                                  | US          | US-AL       |
|  Newport Hospital & Clinic Heliport                                            | US          | US-AR       |
|  Cordes Airport                                                                | US          | US-AZ       |
|  Goldstone /Gts/ Airport                                                       | US          | US-CA       |
|  Cass Field                                                                    | US          | US-CO       |
|  Grass Patch Airport                                                           | US          | US-FL       |
|  Ringhaver Heliport                                                            | US          | US-FL       |
|  River Oak Airport                                                             | US          | US-FL       |
|  Lt World Airport                                                              | US          | US-GA       |
|  Caffrey Heliport                                                              | US          | US-GA       |

We could then sort the list of airports alphabetically in reverse and write the new list into a new file. We specify the name of the column we want to sort the file and We measure the execution time needed by prepending the command time.

$: time csvsort -c "name" --delimiter="," --reverse airports_country_region.csv > airports_country_region_sorted.csv

real    0m1.177s
user    0m1.135s
sys    0m0.042s

A nice feature of csvkit is its option to query CSV files with SQL. You can formulate SELECT queries and it even supports joins and other tricks. Thus you can achieve the same result with just one SQL query.

$: time csvsql -d ',' --query="SELECT name,iso_country,iso_region FROM airports ORDER BY name DESC" airports.csv > sql_airports_country_region.csv

real	0m11.626s
user	0m11.532s
sys	0m0.090s

Obviously, this is not the fastest possibility and may not be suitable for larger data sets. But csvkit offers more: You can create SQL tables automatically by letting csvkit browse through your CSV files. It will try to guess the column type, the appropriate field length and even constraints.

$: csvsql -i sqlite -d ',' --db-schema AirportDB --table Airports airports.csv 
CREATE TABLE "Airports" (
	id INTEGER NOT NULL, 
	ident VARCHAR(7) NOT NULL, 
	type VARCHAR(14) NOT NULL, 
	name VARCHAR(77) NOT NULL, 
	latitude_deg FLOAT NOT NULL, 
	longitude_deg FLOAT NOT NULL, 
	elevation_ft INTEGER, 
	continent VARCHAR(4), 
	iso_country VARCHAR(4), 
	iso_region VARCHAR(7) NOT NULL, 
	municipality VARCHAR(60), 
	scheduled_service BOOLEAN NOT NULL, 
	gps_code VARCHAR(4), 
	iata_code VARCHAR(4), 
	local_code VARCHAR(7), 
	home_link VARCHAR(128), 
	wikipedia_link VARCHAR(128), 
	keywords VARCHAR(173), 
	CHECK (scheduled_service IN (0, 1))
);

$: csvsql -i sqlite -d ',' --db-schema AirportDB --table Airports -u 0 airports.csv > airport_schema.sql

The second command in the listing above simply stores the table in a separate file.  We can import this CREATE TABLE statement by reading the file in SQLite. Change to the folder where you downloaded SQLite3 and create a new database called AirportDB. The following listing contains SQL style comments (starting with –) in order to improve readability.

./sqlite3 AirportDB.sqlite


SQLite version 3.9.2 2015-11-02 18:31:45
Enter ".help" for usage hints.
-- change the command separator from its default | to ;
sqlite> .separator ;
-- read the SQL file we created before
sqlite> .read /home/stefan/datawrangling/airport_schema.sql
-- list all tables
sqlite> .tables
Airports
-- print table schema

sqlite> .schema Airports
CREATE TABLE "Airports" (
	id INTEGER NOT NULL, 
	ident VARCHAR(7) NOT NULL, 
	type VARCHAR(14) NOT NULL, 
	name VARCHAR(77) NOT NULL, 
	latitude_deg FLOAT NOT NULL, 
	longitude_deg FLOAT NOT NULL, 
	elevation_ft INTEGER, 
	continent VARCHAR(4), 
	iso_country VARCHAR(4), 
	iso_region VARCHAR(7) NOT NULL, 
	municipality VARCHAR(60), 
	scheduled_service BOOLEAN NOT NULL, 
	gps_code VARCHAR(4), 
	iata_code VARCHAR(4), 
	local_code VARCHAR(7), 
	home_link VARCHAR(128), 
	wikipedia_link VARCHAR(128), 
	keywords VARCHAR(173), 
	CHECK (scheduled_service IN (0, 1))
);

You can also achieve the same results directly from Bash, simply by piping the SQL file to the database.

$: cat ~/datawrangling/airport_schema.sql | ./sqlite3 AirportDB.sqlite
$: ./sqlite3 AirportDB.sqlite ".tables"
Airports
$: ./sqlite3 AirportDB.sqlite ".schema Airports"
CREATE TABLE "Airports" (
	id INTEGER NOT NULL, 
	ident VARCHAR(7) NOT NULL, 
	type VARCHAR(14) NOT NULL, 
	name VARCHAR(77) NOT NULL, 
	latitude_deg FLOAT NOT NULL, 
	longitude_deg FLOAT NOT NULL, 
	elevation_ft INTEGER, 
	continent VARCHAR(4), 
	iso_country VARCHAR(4), 
	iso_region VARCHAR(7) NOT NULL, 
	municipality VARCHAR(60), 
	scheduled_service BOOLEAN NOT NULL, 
	gps_code VARCHAR(4), 
	iata_code VARCHAR(4), 
	local_code VARCHAR(7), 
	home_link VARCHAR(128), 
	wikipedia_link VARCHAR(128), 
	keywords VARCHAR(173), 
	CHECK (scheduled_service IN (0, 1))
);

We created a complex SQL table by automatically parsing CSV files. This gives a lot of opportunities, also for Excel spreadsheets and other data available in CSV. The great thing about csvkit is that it supports a large variety of database dialects. You can use the same command by adapting the -i parameter for the following database systems:

  • access
  • sybase
  • sqlite
  • informix
  • firebird
  • mysql
  • oracle
  • maxdb
  • postgresql
  • mssql

All major systems are supported, which is a great benefit. Now that we have the schema ready, we need to import the data into the SQLite database.  We can use the SQLite client to import the CSV file into the database, but suddenly we run into a problem! The 12th column contains boolean values, as correctly identified by the csvkit tool. When we inspect the file again with csvlook, we can see that the column contains ‘yes’ and ‘no’ values. Unfortunately SQLite does not understand this particular notion of boolean values, but rather expects 0 for false and 1 for true, as described in the data types documentation.We have two options: We could replace the values of yes and no by their corresponding integer, for instance with awk:

$: awk -F, 'NR>1 { $12 = ($12 == "\"no\"" ? 0 : $12) } 1' OFS=,  airports.csv > airports_no.csv
$: awk -F, 'NR>1 { $12 = ($12 == "\"yes\"" ? 1 : $12) } 1' OFS=,  airports_no.csv > airports_yes.csv

Or, much more comfortably, we could again use csvkit, which can help us out and replaces the values automatically. The following command imports the data into our database. As we already created the table in advance, we can skip the process with the appropriate flag.

$: time csvsql --db "sqlite:///home/stefan/datawrangling/AirportDB.sqlite" --table "Airports" --insert airports.csv --no-create

real    0m11.161s
user    0m10.743s
sys    0m0.169s

This takes a little while, but after a few seconds, we have the data ready. We can then open the database and query our Airport data set.

sqlite> SELECT name,iso_country,iso_region FROM airports ORDER BY name DESC
Run Time: real 3.746 user 0.087215 sys 0.167298

You can now use the data in an advanced way and also may utilise advanced database features such as indices in order to speed up the data processing. If we compare again the execution of the same query on the CSV file and within SQLite, the advantage becomes much more obvious if we omit command line output, for instance by querying the COUNT of the airport names.

-- CSV SQL
:$ time csvsql -d ',' --query="SELECT COUNT(name) FROM airports" airports.csv 
COUNT(name)
47409

real	0m11.068s
user	0m10.977s
sys	0m0.086s

-- SQLite
sqlite> .timer on
sqlite> SELECT COUNT(name) FROM Airports;
47409
Run Time: real 0.014 user 0.012176 sys 0.001273

Continue reading


Create an ER Diagram of an Existing SQLite Database (or many other RDBMS)

Visualisation helps solving problems and is therefore an important tool database design. Many database providers have their product specific tools for re-engineering existing schemata, but self-contained, serverless, embedded relational database management systems (RDBMS) such as SQLite often come without much tooling support. The extremely small footprint of SQLite provides a very powerful tool for implementing database driven applications without the hassle of database administration, user privilege management and other demanding tasks that come with more complex systems. There does not exist a workbench-like tool for SQLite, but we can use the open source SchemaCrawler for analysing database schemata and table relationships. The tool provides a plethora of commands and options, in this post we will only cover the diagramming part, which allows creating ER diagrams of the table.

After downloading and extracting the tool to your local drive, you can find a lot of examples included. The tool can handle SQLite, Oracle,  MS SQL Server, IBM DB2, MySQL, MariaDB, Postgres and Sybase database servers and is therefore very versatile. You will need Java 8 in order to run it. Have a look at the script below, which creates a PNG image of the database schema of the Chinook test database.

#!/bin/bash
# The path of the unzipped SchemaCrawler directory
SchemaCrawlerPATH=/media/stefan/Datenbank/schemacrawler-14.05.04-main
# The path of the SQLite database
SQLiteDatabaseFILE=/media/stefan/Datenbank/ChinookDatabase1.4_CompleteVersion/Chinook_Sqlite.sqlite
# The type of the database system.
RDBMS=sqlite
# Where to store the image
OutputPATH=/media/stefan/Datenbank/ER.png
# Username and password need to be empty for SQLite
USER=
PASSWORD=

java -classpath $(echo ${SchemaCrawlerPATH}/_schemacrawler/lib/*.jar | tr ' ' ':') schemacrawler.Main -server=${RDBMS} -database=${SQLiteDatabaseFILE} -outputformat=png -outputfile=${OutputPATH} -command=graph -infolevel=maximum -user=${USER} -password=${PASSWORD}

The SchemaCrawlerPATH variable contains the path to the directory where we unzipped the SchemaCrawler files to. This is needed in order to load all the required libraries into the classpath below. We then specify the SQLite database file, define the RDBMS and provide an output path where we store the image. Additionally, we provide an empty user name and password combination. SQLite does not provide user authentication, thus those two parameters need to be empty, SchemaCrawler simply ignores them. Then we can execute the command and the tool generates the PNG of the ER diagram for us.

ER-DiagramYou can find a lot of examples also online, which gives you an overview of the features of this tool. One of the main purposes of SchemaCrawler is to generate diffable text outputs of the database schemata. In combination with a source code version management tool such as Git or Subversion, you can create clean and usable reports of your databases and keep track of the changes there. You can retrieve an overview of the options  with the following command.

java -classpath $(echo ${SchemaCrawlerPATH}/_schemacrawler/lib/*.jar | tr ' ' ':') schemacrawler.Main -?

 

You can also HTML reports with the following command:

java -classpath $(echo ${SchemaCrawlerPATH}/_schemacrawler/lib/*.jar | tr ' ' ':') schemacrawler.Main -server=${RDBMS} -database=${SQLiteDatabaseFILE} -outputformat=html -outputfile=report.html -command=details -infolevel=maximum -user=${USER} -password=${PASSWORD}

 

HTML ReportOther available output formats are plain text, CSV or JSON.

Continue reading


Importing your Repositories from Github to your Gitlab Instance

Github is a great service and offers a free micro account for students, which includes five private repositories. Still this might not be enough private repositories for some of us, especially if you start a lot of side projects (which are hardly finished :-)) and want to keep your code and data organised, versioned and collaborative but non-public (because it is not finished yet :-)). For this and other reasons I am also running a private Gitlab instance, which basically offers the same features as Github and can be hosted on your private server.

I still use Github for some projects, which means that I continuously run out of private repositories there as only five are included in my plan. Now Gitlab (the private instance) offers a great feature, which allows one to hook the Github account into the Gitlab instance and import code hosted in private repositories at Github into your Gitlab instance, where no limits for private repositories apply. Obviously the repositories could also be transferred manually, but the import feature via the Web interface is very convenient.

Installing the oAuth feature for Github and other services is very simple and described here for Gihub and there in general. Read both documentations carefully, as using the wrong settings can lead to a serious security issues. After following the instructions carefully, the warning that oAuth needs to be configured kept popping up in the Gitlab dashboard:

To enable importing projects from GitHub, as administrator you need to configure OAuth integration.

The final hint was given in this SO post here. The official documentation failed to mention that the changes in the configuration file must be updated by reconfiguring Gitlab. The following list of steps should lead to success:

  1. Login into Github and open your account settings
  2. Select Applications on the left and chose the Developer Applications tab
  3. Register a new application by providing the name (e.g. Gitlab Access), a homepage (can be anything), a description (optional) and the callback URL (e.g. https://gitlab.example.org).
  4. Add the initial OmniAuth configuration as described in the official documentation.
  5. Provide the settings for Github.
  6. Save the file and execute the following two commands.
sudo gitlab-ctl reconfigure
sudo gitlab-ctl restart

The first time you select “Import project from Github”, you will be redirected to the Github authentication page, where you need to grant permission for the Gitlab user to retrieve the repositories. After you imported the private repositories into Gitlab, you can delete them from Github and reclaim some free space.

 

Continue reading