Storage-Optimizing PostGIS Geometries

15 Feb 2018

Do you work with spatial data that covers the globe at nanometer precision? If not, PostGIS is probably wasting space when it stores your geometry. This post describes a technique that can be used to reduce the storage space of PostGIS geometries in all of those cases when you’re not working with nanometers.

A quick background: PostGIS stores all coordinates as double-precision floating point values. A double-precision number can accurately represent approximately 15 significant digits. In many applications, PostGIS is being used with data that’s accurate to the nearest hundredth or thousandth of a degree, (e.g., -173.2243). Since longitude ranges from -180 to 180, storing this data requires at most seven significant digits. Some prominent data sets (e.g., TIGER) use six digits after the decimal point, requiring nine significant digits in WGS84. In all of these cases, much of the disk space occupied by a geometry represents noise.

In the past, I’ve saved space by resorting to tricks like storing points as in x/y columns, using the real data type, and then creating a functional index of ST_MakePoint(x, y). That’s ugly, only works for Point objects, and creates complexity that bleeds into any query needing to use that data.

It turns out that we can take advantage of the compression that’s built into Postgres to reclaim much of the space that’s wasted by PostGIS. We just have to modify our data to be more compressible.

Consider a coordinate value from a dataset like TIGER that represents geographic coordinates in microdegrees (roughly 10 cm of precision): 44.455725. (You may recognize this as the latitude of the world’s tallest filing cabinet.)

In binary form, the latitude value looks like this:

01000000 01000110 00111010 01010101 00110010 01100001 01111100 00011100
S******* ****#### ######## ######## ######## ######## ######## ########

Symbols in the second row denote:

S: 1 bit holding the sign
*: 11 bits holding the exponent
#: 52 bits holding the mantissa

Since our coordinate is only in microdegree precision, we don’t care if it’s stored internally as 44.45572531822 or 44.45572560712. In other words, we don’t need all 52 bits to store the meaningful part of our coordinate mantissa.

If you do some printf experimentation in C, you can see that our mantissa value is always 44455725..., regardless of what’s sitting in the last 29 bits. Much of our space then, is just storing noise:

01000000 01000110 00111010 01010101 00110010 01100001 01111100 00011100
S******* ****#### ######## ######## ######## ######## ######## ########
             |->   valuable data!   <-|->          noise!           <-|

The problem is not only that our valuable data takes up just as much space as our noise, but that our noise appears to have the same information content as our valuable data. If we reduce the apparent information content from our noise, we can make it more compressible. (Yes, I’m misusing all kinds of CS terminology here.)

Since we’ve established that we can set our noisy bits to whatever we like without changing microdegree value of our coordinate, let’s just make them all zero:

01000000 01000110 00111010 01010101 00100000 00000000 00000000 00000000
S******* ****#### ######## ######## ######## ######## ######## ########
             |->   valuable data!   <-|->          noise!           <-|

By doing so, we haven’t lost any valuable data. We haven’t saved any space, but by reducing randomess we’ve made our number much more compressible. Since Postgres automatically compresses PostGIS geometries, harmonizing our noise bits to zero or one should reduce the on-disk size of PostGIS geometry objects.

To test the practical implications of this, I wrote up a little bit-cleaning implementation for PostGIS. The code is in a GitHub branch, but the core of it is pretty short:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
/* Figure out how many mantissa bits we need to keep in order
   to retain a specified number of significant digits */
static int
bits_for_precision(int digits_precision)
{
	if (digits_precision < 1)
		lwerror("Must have at least one digit of precision");
	
	if (digits_precision > 15)
		lwerror("Can't request more than 15 digits of precision");
	
	return (int) ceil(digits_precision / log10(2));
}

static inline
double mask_double(double d, int64_t mask)
{
	int64_t* double_bits = (int64_t*) (&d);
	
	(*double_bits) &= mask;
	
	return *((double*) double_bits);
}

void
lwgeom_trim_bits_in_place(LWGEOM* geom, int digits_precision)
{
	LWPOINTITERATOR* it = lwpointiterator_create_rw(geom);
	int bits_to_keep = bits_for_precision(digits_precision);
	int64_t mask = 0xffffffffffffffff << (52 - bits_to_keep);
	POINT4D p;
	
	while (lwpointiterator_has_next(it)) {
		lwpointiterator_peek(it, &p);
		p.x = mask_double(p.x, mask);
		p.y = mask_double(p.y, mask);
		if (lwgeom_has_z(geom))
			p.z = mask_double(p.z, mask);
		if (lwgeom_has_m(geom))
			p.m = mask_double(p.m, mask);
		lwpointiterator_modify_next(it, &p);
	}
	
	lwpointiterator_destroy(it);
}

This gets exposed at the SQL level like this:

CREATE OR REPLACE FUNCTION postgis_optimize_geometry(g geometry, digits_precision int)
RETURNS geometry AS 'postgis_optimize_geometry'
LANGUAGE 'c' IMMUTABLE STRICT;

To test the effectiveness, I created a table with about 100 MB of circles:

CREATE TABLE circles AS
SELECT ST_Buffer(ST_Makepoint(-180 + 360*random(), -90 + 180*random()), random(), 128) AS geom
FROM generate_series(1, 1.06e4);

SELECT pg_size_pretty(pg_total_relation_size('"circles"'));
-- 100 MB

What if we only need nine significant digits in our circles?

CREATE TABLE circles_9dig AS SELECT postgis_optimize_geometry(geom, 9) FROM circles;
SELECT pg_size_pretty(pg_total_relation_size('"circles_9dig"'));
-- 77 MB 

A 23% reduction. Not bad, but not too exciting. How about six digits?

CREATE TABLE circles_6dig AS SELECT postgis_optimize_geometry(geom, 6) FROM circles;
SELECT pg_size_pretty(pg_total_relation_size('"circles_6dig"'));
--41 MB

More than 50% - nice! It’s worth pointing out that the size reductions are not as good as what we get from TWKB:

CREATE TABLE circles_twkb AS SELECT ST_astwkb(geom, 6) FROM circles;
SELECT pg_size_pretty(pg_total_relation_size('"circles_twkb"'));
--28 MB

That said, the nice thing about this trick is that it relies entirely on built-in Postgres compression. It doesn’t require using another data type, and functions using your geometries don’t have to care that you’ve performed this optimization.

Published on 15 Feb 2018