Move part of the fingerprinting logic into PostgreSQL when possible.

This allows ordering features before hashing, which is required for layers from Naturvårdsverket and Skogsstyrelsen (features appears to be randomly ordered in daily exports, so normalization and fingerprinting is needed to detect whether there are now changes). On the downside, this makes the cache a PostgreSQL-only feature. It's also marginally slower than the old logic because for some reason PostgreSQL doesn't seem to use the UNIQUE index and instead does a seq scan followed by a quicksort. Without fingerprinting logic: $ time -f "%E (%U user, %S sys) %Mk maxres" /usr/local/bin/webmap-import \ --cachedir=/var/cache/webmap \ --lockfile=/run/lock/webmap/lock \ --lockdir-sources=/run/lock/webmap/cache \ --force \ "sks:UtfordAvverk" […] INFO: Layer "sks:UtfordAvverk" has 313044 features […] 3:54.45 (85.28 user, 26.19 sys) 72520k maxres With old fingerprinting logic (full client-side SHA-256 digest of features as they are being imported): $ time -f "%E (%U user, %S sys) %Mk maxres" /usr/local/bin/webmap-import \ --cachedir=/var/cache/webmap \ --lockfile=/run/lock/webmap/lock \ --lockdir-sources=/run/lock/webmap/cache \ --force \ "sks:UtfordAvverk" […] INFO: Imported 313044 features from source layer "UtfordAvverkningYta" […] INFO: Updated layer "sks:UtfordAvverk" has new fingerprint e655a97a 4:15.65 (108.46 user, 26.73 sys) 80672k maxres With now fingerprinting logic (hybrid client/server SHA-256 digest and hash_record_extended() calls after the import process): $ time -f "%E (%U user, %S sys) %Mk maxres" /usr/local/bin/webmap-import \ --cachedir=/var/cache/webmap \ --lockfile=/run/lock/webmap/lock \ --lockdir-sources=/run/lock/webmap/cache \ --force \ "sks:UtfordAvverk" […] INFO: Layer "sks:UtfordAvverk" has 313044 features […] 4:30.77 (87.02 user, 25.67 sys) 72856k maxres Same but without ORDER BY (or ORDER BY ogc_fid): 4:07.52 (88.23 user, 26.58 sys) 72060k maxres (A server side incremental hash function would be better, but there is no such thing currently and the only way to hash fully server side is to aggregate rows in an array which would be too expensive memory-wise for large table.)
author: Guilhem Moulin <guilhem@fripost.org> 2025-05-01 21:20:44 +0200
committer: Guilhem Moulin <guilhem@fripost.org> 2025-05-20 09:51:54 +0200
commit: 12bd18ed5e01a84b03be7c21570bac6547759970 (patch)
tree: ec491f29beca20bc4657f34ae7244b9f52321b0a /webmap-import
parent: 3edce255b3010244ab5d7fae59cbda11926f50f1 (diff)
1 files changed, 4 insertions, 0 deletions
diff --git a/webmap-import b/webmap-import
index 6f514a9..f20fdef 100755
--- a/webmap-import
+++ b/webmap-import
@@ -377,6 +377,10 @@ def validateLayerCacheField(defn : ogr.FeatureDefn, idx : int,
 
 def validateCacheLayer(ds : gdal.Dataset, name : str) -> bool:
     """Validate layer cache table."""
+    drvName = ds.GetDriver().ShortName
+    if drvName != 'PostgreSQL': # we need hash_record_extended(), sha256() and ST_AsEWKB()
+        logging.warning('Unsupported cache layer for output driver %s', drvName)
+        return False
     lyr = ds.GetLayerByName(name)
     if lyr is None:
         logging.warning('Table "%s" does not exist', name)
author	Guilhem Moulin <guilhem@fripost.org>	2025-05-01 21:20:44 +0200
committer	Guilhem Moulin <guilhem@fripost.org>	2025-05-20 09:51:54 +0200
commit	12bd18ed5e01a84b03be7c21570bac6547759970 (patch)
tree	ec491f29beca20bc4657f34ae7244b9f52321b0a /webmap-import
parent	3edce255b3010244ab5d7fae59cbda11926f50f1 (diff)