Benchmarking Spatial Queries

09 Jan 2017

Sometimes you want to ask the database a question that could be expressed as many different queries. For example:

What’s the fastest way to find all users within 1 km of a given point?

Let’s assume that we need a distance in meters, and that spherical distance calculations are good enough for our purposes. There are a ton of ways we could write this query. Here are six options, though one could certainly imagine others.

  1. Storing our points as geography, and using ST_DWithin in meters.
  2. Storing our points as geometry, creating a functional index on the geometry to geography cast, and then using ST_DWithin in meters.
  3. Storing our points as geometry, using ST_DWithin and a computed value of degrees, and then refining with ST_DistanceSphere.
  4. Similar to the above, but using some ST_Expand trickery to make the index call more selective.
  5. Generating a buffer around our query point, and using ST_Intersects.
  6. Similar to #4, but storing our points as the Postgres point type, and using the earthdistance module instead of ST_DistanceSphere.

This post will show a simple way to find out which of these approaches is fastest, for a particular situation, using the pg_stat_statements extension.

These approaches have different storage requirements and different accuracies. The purpose of this example is not to demonstrate which of these approaches is best for all situations, but to show how you might perform testing using your own data.

For this example, I’ll be working with a table of 1.3 million points, created using the method described in this post.

The table looks like this:

                              Table "public.users"
 Column  |         Type         | Modifiers | Storage | Stats target | Description 
---------+----------------------+-----------+---------+--------------+-------------
 user_id | integer              | not null  | plain   |              | 
 geom    | geometry(Point,4326) |           | main    |              | 
Indexes:
    "users_pkey1" PRIMARY KEY, btree (user_id)
    "users_geom_idx" gist (geom)

The obvious solution

The simplest way to write this query is to use ST_DWithin with the PostGIS geography type, instead of geometry. The geography type is intended to be used with latitude/longitude coordinates on the earth’s surface, and performs accurate spheroid distance calculations in meters. For this method, we’ll add a new column with type geography and copy our points into it.

1
2
3
ALTER TABLE users ADD column geog geography;
UPDATE users SET geog = geom::geography;
CREATE INDEX ON users USING gist(geog);

Then, we can run a straightforward query:

1
2
3
SELECT *
FROM users
WHERE ST_DWithin(geog, 'SRID=4326;POINT(-71.007 41.671)'::geography, 1000)

Which yields the following plan:

 Index Scan using users_geog_idx1 on users  (cost=0.41..8.81 rows=1 width=68)
   Index Cond: (geog && '010...440'::geography)
   Filter: (('010...440'::geography && _st_expand(geog, '1000'::double precision)) 
       AND _st_dwithin(geog, '010...440'::geography, '1000'::double precision, true))

The cheater’s solution

If we don’t want to create a native geography column, we can “cheat” and create a functional index on our geometry column. Why might we do this, instead of creating a native geography column?

  • We could be using software that doesn’t understand the PostGIS geography type.
  • We may have other operations that require the geometry type, and don’t want the hassle or storage overhead of maintaining two columns and keeping them in sync.

For this solution, we create a functional index on the result of the geography function, which casts a geometry to a geography. A functional index works just like a regular index, except that we’re creating it on the result of a function instead of a stored column. There’s usually a performance penalty relative to a stored value, but it might be acceptable in some cases.

1
2
-- Set up functional index on geometry
CREATE INDEX ON users USING gist(geography(geom));

We can now write a ST_DWithin query, in meters, against our geometry column:

1
2
3
4
SELECT *
FROM users
WHERE ST_DWithin(geom::geography, 
                 'SRID=4326;POINT(-71.007 41.671)'::geography, 1000)

Yielding the following plan:

 Bitmap Heap Scan on users  (cost=11508.24..141354.75 rows=17453 width=68)
   Recheck Cond: ((geom)::geography && '010...440'::geography)
   Filter: (('010...440'::geography && _st_expand((geom)::geography, '1000'::double precision))
         AND _st_dwithin((geom)::geography, '010...440'::geography, '1000'::double precision, true))
   ->  Bitmap Index Scan on users_geography_idx  (cost=0.00..11503.88 rows=261795 width=0)
         Index Cond: ((geom)::geography && '010...440'::geography)

Getting uglier: using a degree-based distance tolerance

Maybe we get the idea that geodetic distance calculations are expensive, and we can speed things up by using faster Cartesian calculations for our index scan, at the expense of adding a separate filtering step that uses spherical distances. To do this, we need to figure out a suitable value of degrees to plug into ST_DWithin.

If, like Wikipedia, we consider the minimum radius of the Earth to be about 3,353 km, then we can calculate the minimum length of a degree of latitude (we want the minimum distance, so that our index scan will be overselective):

2 * pi * 6353000 m / 360 degrees = 110881 m / degree

The longitude to meters conversion is a bit trickier. At the equator, a degree of longitude will cover the same number of a meters as a degree of latitude. But as we approach the poles, you need fewer and fewer meters to make up a degree of longitude. If you put a coin directly on the north pole, the radius of the coin would cover 360 degrees, and a degree of longitude would equal something like 0.2 mm.

In between the equator and the poles, a degree of longitude is approximated by 110881*cos(lat) m / degree. Since a degree of longitude can never cover more meters than a degree of latitude, we only need to consider longitude when figuring a search radius for ST_DWithin. In our worst case, a point is located exactly 1 km east or west of our query point. In that case it would be approximately 1000.0/(110881 * cos(lat)) degrees away, so this is the distance we need to pass to ST_DWithin.

Our two-stage query will look like this:

1
2
3
4
SELECT *
FROM users
WHERE ST_DWithin(geom, 'SRID=4326;POINT(-71.007 41.671)', 1000.0/(110881*cos(radians(41.671)))) 
  AND ST_DistanceSphere(geom, 'SRID=4326;POINT(-71.007 41.671)') < 1000

Plan:

 Bitmap Heap Scan on users  (cost=4.90..362.46 rows=2 width=68)
   Recheck Cond: (geom && '010...440'::geometry)
   Filter: (('010...440'::geometry && st_expand(geom, '0.009...'::double precision)) 
            AND _st_dwithin(geom, '010...440'::geometry, '0.009...'::double precision) 
            AND (_st_distance(geography(geom), 
                              '010...440'::geography,
                              '0'::double precision, false) < '1000'::double precision))
   ->  Bitmap Index Scan on users_geom_idx  (cost=0.00..4.89 rows=81 width=0)
         Index Cond: (geom && '010...440'::geometry)

A slight optimization

Because the distance covered by a degree of longitude shrinks as we approach the poles, the query above is going to become increasingly overselective as we move away from the equator.

If we actually reach the poles, the size of the bounding box ST_DWithin uses for its index scan would become infinitely large. We can fix this by removing the ST_DWithin query and using a more primitive ST_Expand operation, similar to what ST_DWithin is doing behind the scenes anyway:

1
2
3
4
5
6
SELECT *
FROM users
WHERE geom && ST_Expand('SRID=4326;POINT(-71.007 41.671)'::geometry,
                        1000.0/(110881*cos(radians(41.671))),
                        1000.0/110881)
  AND ST_DistanceSphere(geom, 'SRID=4326;POINT(-71.007 41.671)') < 1000

This query makes direct use of the && operator, which performs a test for bounding box intersection. As we approach the poles, the bounding box used for our index scan will still become infinitely large in the x direction, but will retain a fixed size in the y direction.

Note that PostGIS 2.3 or greater is required to call ST_Expand with different distances in the x and y axes. In earlier versions, an equivalent box could be constructed using ST_MakeEnvelope.

This gives us the following plan:

 Index Scan using users_geom_idx on users  (cost=0.29..34.20 rows=2 width=68)
   Index Cond: (geom && '0.010...440'::geometry)
   Filter: (_st_distance(geography(geom),
                         '0.010...440'::geography,
                         '0'::double precision, false) < '1000'::double precision)

Intersection tests using ST_Buffer

Another option is to forego distance calculations altogether and create a 1 km buffered polygon around our test point. We can then perform intersection tests against this buffered polygon. This option is most likely to perform well when we’re working with complex LineString or Polygon data and we can take advantage of PostGIS’ support for GEOS prepared geometries. It’s not likely to do as well with simple point data, but it’s included here for completeness.

1
2
3
4
SELECT *
FROM users
WHERE ST_Intersects(geom, 
                    ST_Buffer('SRID=4326;POINT(-71.007 41.671)'::geography, 1000)::geometry);

Note that we used the geography version of ST_Buffer so that we could buffer using a distance in meters.

The plan for this query is a simple index scan:

 Bitmap Heap Scan on users  (cost=9.31..494.99 rows=40 width=84)
   Recheck Cond: (geom && '010...440'::geometry)
   Filter: _st_intersects(geom, '010...440'::geometry)
   ->  Bitmap Index Scan on users_geom_idx  (cost=0.00..9.30 rows=119 width=0)
         Index Cond: (geom && '010...440'::geometry)

Going Native

Last but not least, we can look at an old-school option using the native Postgres geometric types and the earthdistance module.

First, we’ll create, populate, and index a column of point type:

1
2
3
ALTER TABLE users ADD column pt point;
UPDATE users SET pt = geom::point;
CREATE INDEX ON users USING gist(pt);

And then write a query like the following (there may be much better ways to do this, I’m not sure). In this query, <@ is a “contains” operator (docs), and <@> is a geographic distance operator provided by earthdistance. The <@> operator gives us a distance in miles, so we query using a distance threshold of 0.621371.

1
2
3
4
5
6
7
8
9
-- Replaced test point with variables for readability,
--   x = -71.007
--   y =  41.671
--   distance_deg = 1000.0 / 110881

  SELECT * FROM users
    WHERE pt <@ box(point(x - distance_deg/cos(radians(y - distance_deg)), y - distance_deg),
                    point(x + distance_deg/cos(radians(y + distance_deg)), y + distance_deg))
    AND pt <@> point(x, y) < 0.621371;

Measuring performance

So which one of these queries is actually fastest? We can set up a simple test using the built-in pg_stat_statements data collector. To use this, you need to set the following values in your postgresql.conf:

# This line should already exist
shared_preload_libraries = 'pg_stat_statements'

# This line probably does not
pg_stat_statements.track = all

Then restart Postgres, and run CREATE EXTENSION pg_stat_statements.

To collect data, all we need to do is run each of these queries enough times on realistic data. You could write a test client to manage this, but I think it’s often easiest to run this right in psql with a little DO block.

In short, the DO block need to perform the following:

  1. Clear our statistics with pg_stat_statements_reset().
  2. Generate a random test. Here, I’ve used the input points themselves as test points, but you could just as well generate a random point within a certain extent.
  3. Run one of the queries using the test point.
  4. Repeat.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
DO $$
DECLARE 
  i int;
  num_options int := 6;
  x double precision;
  y double precision;
  distance_m double precision := 1000.0;
  distance_deg double precision;
  num_tests integer := 10000;
  query_point geometry;
BEGIN

PERFORM pg_stat_statements_reset();
distance_deg := distance_m/110881;

FOR i IN SELECT generate_series(1, num_tests) LOOP
  RAISE NOTICE '%', i;

  SELECT geom FROM users TABLESAMPLE BERNOULLI (0.01) INTO query_point;
  x := ST_X(query_point);
  y := ST_Y(query_point);

  IF i % num_options = 0 THEN
    -- Option 1, using stored geography column
    PERFORM count(*)
    FROM users
    WHERE ST_DWithin(geog, query_point::geography, distance_m);
  ELSIF i % num_options = 1 THEN
    -- Option 2, using geography functional index
    PERFORM count(*)
    FROM users 
    WHERE ST_DWithin(geom::geography, query_point::geography, distance_m);
  ELSIF i % num_options = 2 THEN
    -- Option 3, using Cartesian distance
    PERFORM count(*)
    FROM users 
    WHERE ST_DWithin(geom, query_point, distance_deg/cos(radians(y + distance_deg)))
    AND ST_DistanceSphere(geom, query_point) < distance_m;
  ELSIF i % num_options = 3 THEN
    -- Option 4, using Cartesian distance and non-square bounding box
    PERFORM  count(*) FROM users 
    WHERE geom && ST_Expand(query_point::geometry, 
                            distance_deg/cos(radians(y)),
                            distance_deg)
    AND ST_DistanceSphere(geom, query_point) < distance_m;
  ELSIF i % num_options = 4 THEN
    -- Option 5, using ST_Buffer and ST_Intersects
    PERFORM count(*) FROM users 
    WHERE ST_Intersects(geom, ST_Buffer(query_point::geography, distance_m)::geometry);
  ELSIF i % num_options = 5 THEN
    -- Option 6, using native Postgres geometric types
    PERFORM count(*) FROM users
    WHERE pt <@ box(point(x - distance_deg/cos(radians(y - distance_deg)), y - distance_deg),
                    point(x + distance_deg/cos(radians(y + distance_deg)), y + distance_deg))
    AND pt <@> point(x, y) < distance_m/1000*0.621371;
  END IF;

END LOOP;
END;
$$ LANGUAGE plpgsql;

Why not run each point against all of the queries? The problem we could run into is that index results, or possibly function calls (I’m not sure) could be cached between versions of the query that share common operations. For example, both Option 3 and Option 4 are relying on our geometry index, so Option 4’s results could be systematically lowered by caching from Option 3. I think it’s better to assign different inputs to each query, and increase the number of tests run to lower the chances that the outcome is affected by the way we divide the test points among the queries.

After running the above, we can query pg_stat_statments to see the results in nice clean form:

SELECT
  query,
  min_time,
  mean_time,
  max_time
FROM pg_stat_statements 
WHERE query ~ 'users' 
  AND calls > 1 
ORDER BY mean_time DESC;

Results are below.

-[ RECORD 1 ]------------------------------------------------------------------------------------
query     | SELECT count(*)                                                                      
          |     FROM users                                                                       
          |     WHERE ST_DWithin(geom::geography, query_point::geography, distance_m)
min_time  | 0.233
mean_time | 9.71704499100178
max_time  | 56.057
-[ RECORD 2 ]------------------------------------------------------------------------------------
query     | SELECT count(*)                                                                      
          |     FROM users                                                                       
          |     WHERE ST_DWithin(geog, query_point::geography, distance_m)
min_time  | 0.238
mean_time | 9.18121968787515
max_time  | 46.497
-[ RECORD 3 ]------------------------------------------------------------------------------------
query     | SELECT count(*)                                                                      
          |     FROM users                                                                       
          |     WHERE ST_DWithin(geom, query_point, distance_deg/cos(radians(y + distance_deg))) 
          |     AND ST_DistanceSphere(geom, query_point) < distance_m
min_time  | 0.253
mean_time | 8.59873185362928
max_time  | 49.926
-[ RECORD 4 ]------------------------------------------------------------------------------------
query     | SELECT count(*) FROM users                                                           
          |     WHERE geom && ST_Expand(query_point::geometry,                                   
          |                             distance_deg/cos(radians(y)),                            
          |                             distance_deg)                                            
          |     AND ST_DistanceSphere(geom, query_point) < ?
min_time  | 0.192
mean_time | 3.88295440911818
max_time  | 24.607
-[ RECORD 5 ]------------------------------------------------------------------------------------
query     | SELECT count(*) FROM users                                                           
          |     WHERE ST_Intersects(geom, ST_Buffer(query_point::geography, ?)::geometry)
min_time  | 0.086
mean_time | 2.59793401319735
max_time  | 16.653
-[ RECORD 6 ]------------------------------------------------------------------------------------
query     | SELECT count(*) FROM users                                                           
          |     WHERE pt <@ box(point(x - distance_deg/cos(radians(y - distance_deg)), 
          |                     y - distance_deg),
          |                     point(x + distance_deg/cos(radians(y + distance_deg)),
          |                     y + distance_deg))
          |     AND pt <@> point(x, y) < ?
min_time  | 0.055
mean_time | 1.52340396158464
max_time  | 9.748

I hope the range of these results (nearly an order of magnitude) shows the value in experimenting with different options for an important query. A couple of points I found interesting:

  • The penalty for using a functional index is pretty low, on the order of 5% in this case. That seems like a good tradeoff to reduce space and complexity associated with storing both geometry and geography representations of the same point.
  • It is possible to accelerate distance calculations in some cases by using Cartesian calculations for the initial index scan and then refining the results with ST_DistanceSphere, assuming that the reduced accuracy is acceptable. I did some additional testing and found that these benefits decrease, or go away entirely, as the points approach the poles. Also, points aren’t able to take advantage of the clever index-based geography distance calculations that are used on higher-dimension geometries.
  • The buffer solution does surprisingly well, given that it’s performing what seems like a lot more work. Why this is the case might be an interesting topic for a future post.

Published on 09 Jan 2017