Useful tricks with spatial data

by levien on za 25 juni 2011 // Posted in misc // under

For my research on Avian Influenza in waterbirds, I recently needed data on lakes and marsh-areas in Europe. I ended up compiling a spatial dataset from a number of different sources, including the EU Corine Land Cover database (CLC2000v13), the lake depth dataset compiled by Ekaterina Kourzeneva for the FLake model and data from the Finnish national lake register. In order for this to be usable I had to relate the lakes and marshes in the different datasets to each other. In other words, I needed to find out which coordinates or polygons in the different datasets represented the same lakes, and I needed to know which marsh areas were close to which lakes.

After searching around a bit and considering several options, I decided (for several reasons) not to use a full-blown GIS system or a generic data-analysis environment such as R. Instead I found that I could easily do what I needed using a stand-alone spatial database called Spatialite (which in turn is basically the well-known SQLite package with spatial data extensions) and a few lines of Spatial SQL.

After cleaning I should probably publish my dataset online, but here I'll just give some examples of how I coupled the spatial entities in the different datasets.

First, you'll need to download and install Spatialite. Unless you prefer using just the command-line, you'll probably want the Spatialite GUI as well. For Mac or Windows, just download the binaries from the site. If you're on an Ubuntu Linux system, you can install spatialite from the repositories. However, spatialite-gui isn't in there yet at the time of writing, so you'll either need to compile from source or install the newer Debian sid packages for spatialite-gui and the required dependencies (to make life easier, I'll put those up in my Ubuntu PPA as soon as I have time).

To complicate things a little bit, the EU Corine Land Use Classification dataset does not use "standard" WGS84 latitude/longitude coordinates. Instead, it uses a "Europe-centric" ETRS89-LAEA projection with SRID 3035 (see also http://grass.fbk.eu/pipermail/grass5/2006-November/027177.html). The Finnish lake data uses the national KKJ / Finland Uniform Coordinate System (in Finnish "Yhtenäiskoordinaatisto" or YKJ), with SRID 2393. Luckily both do use metre as the spatial unit.

I used the Corine shapefiles for waterbodies (class ID 512) and marshes (class ID 411).

To avoid problems later on, it's best to reproject the CLC shapefiles to WGS84 (SRID 4326). You can do this using the GDAL utility ogr2ogr (in Ubuntu, just install the package gdal-bin). For instance, to reproject the waterbody-data use the following command:

ogr2ogr -t_srs EPSG:4326 clc00_waterbodies_wgs84.shp clc00_c512.shp

The Finnish data consisted of the output from lots of separate database-queries. As different files were formatted in different ways, I wrote some Perl-code to filter and combine query-results. And as I was using Perl anyway, I used the Geo::Coordinates::KKJ module to convert lake-coordinates into WGS84.

The lake-depth dataset was also pre-processed a bit in Perl, but (presumably) its coordinates were already in WGS84.

The next step was to load the data into a Spatialite database. I used the Spatialite GUI to create a new database-file, to import the CLC waterbody and marsh shapefiles and to import the CSV-files containing the Finnish lake data and the European portion of the lake-depth dataset. If you can't or don't want to use the GUI, you can do the same with the .loadshp and .import commands in the Spatialite shell.

The polygons in the CLC shapefiles represent waterbodies and marsh areas, and will be imported into database tables as spatial polygon objects. That is to say, the database will automagically treat the imported shapefile as a table of spatial objects. This is unfortunately not the case with the imported CSV-files, you'll need to tell the database that the latitude and longitude columns represent spatial coordinates. The way to do this is a bit convoluted: essentially you'll need to add a geometry-column with point-objects, and then fill it with coordinates from the latitude and longitude columns. Say your table is called "finnish_lake_table", and your WGS84 coordinate columns are called "lat" and "lon", you can do:

SELECT AddGeometryColumn('finnish_lake_table', 'geometry_column', 4326, 'POINT', 2);
UPDATE 'finnish_lake_table' SET geometry_column = GeomFromText('POINT('||lon||' '||lat||')', 4326);

The details of how this works are not so important (it creates a WKT-description of each point, which is then inserted into the table as a spatial object). After executing this, you'll have a new column called "geometry_column" and your rows will be treated by the database as spatial point objects. Note that now you can also export the table as a shapefile, using either the GUI or the .dumpshp command.

After you've imported the shapefiles and created geometry columns for any imported CSV-files, you'll also want to generate a spatial index for all geometry columns. I'll explain why later on. In the GUI, you can simply right-click on the geometry columns in the database tree-view, and generate an R-tree spatial index. In the shell you can do it using a command such as: SELECT CreateSpatialIndex('finnish_lake_table', 'geometry_column');

My next step was to map the point-coordinates representing the Finnish lakes and the lake-depth data to CLC waterbody IDs. In this example the columns were called "lake_id" (Finnish lake ID) and "ID" (CLC object ID), and I exported the mapping tables to tab-separated text-files (but you can equally well store the result as a new table in the database). In this case I used the spatialite command-shell to execute the following commands (but you can execute the same query in the GUI):

.mode list.separator \t
.output "fi_clc_idmap.dat"
SELECT FI.lake_id, CLC.ID FROM finnish_lake_table AS FI, clc00_waterbodies_wgs84 AS CLC WHERE Within(FI.geometry_column, CLC.Geometry) AND CLC.ROWID IN ( SELECT pkid FROM idx_clc00_waterbodies_wgs84_Geometry   WHERE xmin < (X(FI.geometry_column)) AND xmax > (X(FI.geometry_column)) AND ymin < (Y(FI.geometry_column)) AND ymax > (Y(FI.geometry_column)) ) ;
.exit

This query looks rather intimidating, but it basically does the following:

  • Set the output to a tab-seperated text-file called fi_clc_idmap.dat
  • Select the Finnish lake_id and CLC ID columns on which we will operate
  • Designate FI and CLC as shortcut-names for the two data tables
  • If a Finnish lake-coordinate falls inside a CLC waterbody-polygon, output both the Finnish lake ID and the CLC object ID

The part between parentheses at the end of the query is needed to speed things up a bit. If omitted, the query would work, but would take a very long time to execute. The reason is that the database would have to check every point in one table against every polygon in the other table, which is computationally very costly. The last part of the query uses a trick based on the R-tree spatial index we created earlier. It first selects CLC waterbody-polygons for which the Finnish lake-coordinate falls within the bounding-box of the polygon. This operation is very fast because it uses the spatial index (here called idx_clc00_waterbodies_wgs84_Geometry). It subsequently uses the much slower Within() function to check if the point is really within any of those polygons. This way, it takes only a few seconds or minutes to create the mapping table, instead of the several hours or even days that would be required to execute the simple "brute-force" query.

I used a similar method to create a table of distances between waterbodies and all marsh-areas with an edge that is within 5 km of the nearest waterbody-edge:

.mode list.separator \t
.output "marsh_wetland_distance.dat"
SELECT marsh.ID, lake.ID, Distance(lake.Geometry, marsh.Geometry) FROM clc00_marshes_wgs84 AS marsh, clc00_waterbodies_wgs84 AS lake WHERE Distance(marsh.Geometry, lake.Geometry) < 5000.0  AND lake.ROWID IN ( SELECT pkid FROM idx_clc00_waterbodies_wgs84_Geometry  WHERE xmin < (MbrMaxX(marsh.Geometry) + 5000) AND xmax > (MbrMinX(marsh.Geometry) - 5000) AND ymin < (MbrMaxY(marsh.Geometry) + 5000) AND ymax > (MbrMinY(marsh.Geometry) - 5000) ) ;
.exit

As I have rather limited knowledge of SQL and Spatial SQL, I constructed both queries based on the examples mentioned in this thread.

Finally, for my purpose I didn't need the full set of waterbody-polygons, as I based my further analyses on tables of lakes with their associated properties (area, depth, volume, etc.). For example, I used the following query to export a tab-seperated table of CLC waterbodies containing object ID, centroid-coordinates of the polygons (WGS84 latitude and longitude), lake area (in hectares) and shoreline length (in m):

.mode list.separator \t
.output "clc_waterbodies_wgs84.dat"
select lake.ID, Y(Centroid(lake.Geometry)), X(Centroid(lake.Geometry)), lake.Area_ha, lake.Shape_Leng from clc00_waterbodies_wgs84 as lake;
.exit

Of course apart from the fancy indexing-tricks to speed things up, the examples I showed here are quite basic. As you can see I immediately export all query results, rather than keeping them in the database, because I did all further processing and analysis in external packages such as Python/Pandas, Perl/PDL and GNU R. This approach somewhat reduces the database to a data-handing tool, which doesn't really do it justice. Have a look at the Spatialite documentation and tutorials to see what else is possible with Spatialite. For instance, instead of using text-files as intermediate storage, you can (and probably should) directly use a spatialite-database as data-store for packages like R and QGIS, as well as any custom code you write in Perl, Python, C, Java etc. Here are some further resources: