aboutsummaryrefslogtreecommitdiffstats
path: root/import_source.py
Commit message (Collapse)AuthorAgeFiles
* Schema: Add functions to get a list of municipality and county codes.Guilhem Moulin9 days1
| | | | | We subdivide administrative polygons to speed up things, cf. https://symphony.is/about-us/blog/boosting-postgis-performance
* webmap-import: Add option to generate Mapbox Vector Tiles (MVT).Guilhem Moulin14 days1
|
* Factor out densification logic from getExtent() into own function.Guilhem Moulin14 days1
| | | | | | And only densify if needs be. Most sources are already in SWEREF 99 (modulo axis mapping strategy) so in pratice we can use mere rectangles as spatial filters.
* Move part of the fingerprinting logic into PostgreSQL when possible.Guilhem Moulin2025-05-201
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This allows ordering features before hashing, which is required for layers from Naturvårdsverket and Skogsstyrelsen (features appears to be randomly ordered in daily exports, so normalization and fingerprinting is needed to detect whether there are now changes). On the downside, this makes the cache a PostgreSQL-only feature. It's also marginally slower than the old logic because for some reason PostgreSQL doesn't seem to use the UNIQUE index and instead does a seq scan followed by a quicksort. Without fingerprinting logic: $ time -f "%E (%U user, %S sys) %Mk maxres" /usr/local/bin/webmap-import \ --cachedir=/var/cache/webmap \ --lockfile=/run/lock/webmap/lock \ --lockdir-sources=/run/lock/webmap/cache \ --force \ "sks:UtfordAvverk" […] INFO: Layer "sks:UtfordAvverk" has 313044 features […] 3:54.45 (85.28 user, 26.19 sys) 72520k maxres With old fingerprinting logic (full client-side SHA-256 digest of features as they are being imported): $ time -f "%E (%U user, %S sys) %Mk maxres" /usr/local/bin/webmap-import \ --cachedir=/var/cache/webmap \ --lockfile=/run/lock/webmap/lock \ --lockdir-sources=/run/lock/webmap/cache \ --force \ "sks:UtfordAvverk" […] INFO: Imported 313044 features from source layer "UtfordAvverkningYta" […] INFO: Updated layer "sks:UtfordAvverk" has new fingerprint e655a97a 4:15.65 (108.46 user, 26.73 sys) 80672k maxres With now fingerprinting logic (hybrid client/server SHA-256 digest and hash_record_extended() calls after the import process): $ time -f "%E (%U user, %S sys) %Mk maxres" /usr/local/bin/webmap-import \ --cachedir=/var/cache/webmap \ --lockfile=/run/lock/webmap/lock \ --lockdir-sources=/run/lock/webmap/cache \ --force \ "sks:UtfordAvverk" […] INFO: Layer "sks:UtfordAvverk" has 313044 features […] 4:30.77 (87.02 user, 25.67 sys) 72856k maxres Same but without ORDER BY (or ORDER BY ogc_fid): 4:07.52 (88.23 user, 26.58 sys) 72060k maxres (A server side incremental hash function would be better, but there is no such thing currently and the only way to hash fully server side is to aggregate rows in an array which would be too expensive memory-wise for large table.)
* importSources(): Return either success, error, or no change.Guilhem Moulin2025-05-011
| | | | | That way we can detect when the import of all layers are no-op (besides changing last_updated) and exit gracefully.
* webmap-import: Fingerprint destination layers to detect changes.Guilhem Moulin2025-05-011
| | | | | | | | Comparing modification times is not enough since some sources (for instance Naturvårdsverket's SCI_Rikstackande) are updated on the server even though no objects are being added; the source layer remains unchanged but the file differs because of OBJECTID changes we are not interested in.
* Move layer transactional logic to importSources().Guilhem Moulin2025-04-241
| | | | | | It's much clearer that way. The destination layer is cleared and updated in that function, so it makes sense if that's also where transactions (or SAVEPOINTs) are committed or rollback'ed.
* Change layer cache logic to target destination layers rather than sources.Guilhem Moulin2025-04-241
| | | | | | | | | | | | | | | | | | | | In a future commit we'll fingerprint layers to detect changes. Comparing modification times is not enough since some sources (for instance Naturvårdsverket's SCI_Rikstackande) are updated on the server even though no objects are being added; the source layer remains unchanged but the file differs because of OBJECTID changes we are not interested in. Rather than using another cache layer/table for fingerprints, we cache destination layernames rather than triplets (source_path, archive_member, layername), along with the time at which the import was started rather than source_path's mtime. There is indeed no value in having exact source_path's mtime in the cache. What we need is simply a way to detect whether source paths have been updated in a subsequent run. Thanks to the shared locks the ctime of any updated source path will be at least the time when the locks are released, thereby exceeding the last_updated value.
* webmap-import: Add a cache layer and store the source file's last ↵Guilhem Moulin2025-04-231
| | | | | | | | | | | | | | | | | | | | | | | | modification time. That way we can avoid the expensive unpack+import when the source file(s) have not been updated since the last run. The check can be bypassed with a new flag `--force`. We use a sequence for the FID:s (primary key) and a UNIQUE constraint on triplets (source_path, archive_member, layername) as GDAL doesn't support multicolumns primary keys. To avoid races between the stat(2) calls, gdal.OpenEx() and updates via `webmap-download` runs we place a shared lock on the downloaded files. One could resort to some tricks to eliminate the race between the first two, but there is also some value in having consistency during the entire execution of the script (a single source file can be used by multiple layers for instance, and it makes sense to use the very same file for all layers in that case). We also intersperse dso.FlushCache() calls between _importSource() calls in order to force the PG driver to call EndCopy() to detect errors and trigger a rollback when _importSource() fails.
* webmap-import: Break down into separate modules.Guilhem Moulin2025-04-211