Apache Iceberg用于地理空间数据分析的优势
Benefits of Apache Iceberg for geospatial data analysis

原始链接: https://wherobots.com/blog/benefits-of-apache-iceberg-for-geospatial-data-analysis/

Apache Iceberg 对几何数据列的支持是一项突破性进展,解决了传统格式在处理地理空间数据时面临的可扩展性问题。基于 Parquet 构建,Iceberg 提供了快速的读取速度、超越内存限制的可扩展性以及数据库式的 DML 操作(插入、更新、删除),并通过事务保证数据一致性。 其主要优势包括:可靠的地理空间事务、地理空间删除操作、地理空间时光倒流、地理空间 upsert 操作、地理空间模式强制、地理空间模式演变、高效的地理空间文件列表操作、地理空间小文件压缩以及 Iceberg 的 merge-on-read 操作。与数据湖不同,Iceberg 提供了模式强制、模式演变和压缩等功能,从而提高了数据完整性和查询性能。时光倒流功能支持历史数据分析。Iceberg 支持 merge-on-read DML 操作,从而减少了 DML 操作的运行时间。最终,Iceberg 表为地理空间数据的存储和管理提供了一种比传统数据湖更高效、更可靠的解决方案。

Hacker News 最新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 Apache Iceberg 对地理空间数据分析的益处 (wherobots.com) 9 分,来自 MrPowers,45 分钟前 | 隐藏 | 过去 | 收藏 | 讨论 加入我们,参加 6 月 16-17 日在旧金山举行的 AI 初创公司学校! 指导原则 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系方式 搜索:
相关文章

原文

Apache Iceberg is an open table format that recently added support for geometry data columns, which is game changing for geospatial data users both big and small.  Today, spatial data users have a problem if they need to scale above about a million features: 

  • Traditional file formats and row-oriented databases perform poorly compared to non-spatial equivalents like Parquet. 
  • Some solutions only work well for data that comfortably fits in memory
  • Other solutions only work well for data that does not change.

Geometry in Iceberg allows for the best of both worlds: 

  • It is built on Parquet, offers lightning fast reads, and is scalable for larger than memory data sets.
  • It offers developer-friendly features like DML operations (insert, update, merge, and delete) that spatial users might expect from a database-backed spatial file format. 
  • Beyond that, it implements extra features like versioning and time travel to allow queries against current or historical data.

Let’s take a look at these features!

Reliable geospatial transactions with Iceberg

Iceberg provides reliable transactions, so data operations are completed fully and successfully or do not run at all.

Data lakes don’t support transactions, which can cause issues:

  • Suppose you’re writing to a data lake, and the compute engine errors in the middle of the write. The partially written files will corrupt the data lake and require manual cleanup.
  • Data lakes are unreadable while DML operations are running (e.g. while files are being written).
  • Data lakes don’t provide concurrency protection.

Developers who usually work with databases are often surprised by these data lake limitations because databases have supported transactions for decades.

Lakehouse storage systems significantly improve data lakes with reliable transactions.

Create an Iceberg table with geometry data

Let’s create an Iceberg table with the following linestrings:

You can start by creating the Iceberg table:

        CREATE TABLE LOCAL.db.icetable (id string, geometry geometry)
USING iceberg
TBLPROPERTIES('format-version'='3');
    

Append lines a and b to the table:

        df = sedona.createDataFrame([
    ("a", 'LINESTRING(1.0 3.0,3.0 1.0)'),
    ("b", 'LINESTRING(2.0 5.0,6.0 1.0)'),
], ["id", "geometry"])
df = df.withColumn("geometry", ST_GeomFromText(col("geometry")))
 
df.write.format("iceberg").mode("append").saveAsTable("local.db.icetable")
    

Now append lines c and d to the table:

        df = sedona.createDataFrame([
    ("c", 'LINESTRING(7.0 4.0,9.0 2.0)'),
    ("d", 'LINESTRING(2.0 7.0,4.0 9.0)'),
], ["id", "geometry"])
df = df.withColumn("geometry", ST_GeomFromText(col("geometry")))
 
df.write.format("iceberg").mode("append").saveAsTable("local.db.icetable")
    

Finally append lines e and f to the table:

        df = sedona.createDataFrame([
    ("e", 'LINESTRING(6.0 7.0,6.0 9.0)'),
    ("f", 'LINESTRING(8.0 7.0,9.0 9.0)'),
], ["id", "geometry"])
df = df.withColumn("geometry", ST_GeomFromText(col("geometry")))
 
df.write.format("iceberg").mode("append").saveAsTable("local.db.icetable")
    

Check the content of the table:

          sedona.sql("SELECT * FROM local.db.icetable;").show(truncate=False)
 
+---+---------------------+
|id |geometry             |
+---+---------------------+
|e  |LINESTRING (6 7, 6 9)|
|f  |LINESTRING (8 7, 9 9)|
|a  |LINESTRING (1 3, 3 1)|
|b  |LINESTRING (2 5, 6 1)|
|c  |LINESTRING (7 4, 9 2)|
|d  |LINESTRING (2 7, 4 9)|
+---+---------------------+
    

Geospatial delete operations with Iceberg

Iceberg makes it easy to delete rows of data in a table based on a predicate.  Suppose you’d like to delete all the linestrings in the table that cross a polygon for example:

Here’s how you can delete all the linestrings that cross the polygon.

        polygon = "POLYGON((3.0 2.0, 3.0 5.0, 8.0 5.0, 8.0 2.0, 3.0 2.0))"
sql = f"DELETE FROM local.db.icetable WHERE ST_Crosses(geometry, ST_GeomFromWKT('{polygon}'))"
sedona.sql(sql)
    

Check the table to see that linestrings b and c are deleted from the table.

        sedona.sql("SELECT * FROM local.db.icetable;").show(truncate=False)
 
+---+---------------------+
|id |geometry             |
+---+---------------------+
|e  |LINESTRING (6 7, 6 9)|
|f  |LINESTRING (8 7, 9 9)|
|d  |LINESTRING (2 7, 4 9)|
|a  |LINESTRING (1 3, 3 1)|
+---+---------------------+
    

Iceberg delete operations are much better than what’s offered by data lakes.

Data lakes don’t allow you to delete rows in a table. Instead, you need to filter and overwrite the whole table or run a tedious manual process to identify the files you need to rewrite. This process is error-prone, requires downtime for the data lake table, and could lead to irreversible data loss (if a file is accidentally deleted and not backed up, for example).

This Iceberg geospatial delete operation does not require downtime and is a reliable transaction.

Geospatial time travel with Iceberg

Iceberg also allows for time travel between different versions of a table.  Let’s see all the current versions of the Iceberg table:

        sql = "SELECT snapshot_id, committed_at, operation FROM local.db.icetable.snapshots;"
sedona.sql(sql).show(truncate=False)
 
+-------------------+-----------------------+---------+
|snapshot_id        |committed_at           |operation|
+-------------------+-----------------------+---------+
|1978905766678632864|2025-03-13 13:54:46.057|append   |
|8271841988321279736|2025-03-13 13:55:03.642|append   |
|9187521890414340610|2025-03-13 13:55:19.286|append   |
|1785310483219777508|2025-03-13 13:55:33.908|delete   |
+-------------------+-----------------------+---------+
    

Let’s check the contents of the table before the delete operation was run:

        sql = "SELECT * FROM local.db.icetable FOR SYSTEM_VERSION AS OF 9187521890414340610;"
sedona.sql(sql).show(truncate=False)

+---+---------------------+
|id |geometry             |
+---+---------------------+
|c  |LINESTRING (7 4, 9 2)|
|d  |LINESTRING (2 7, 4 9)|
|a  |LINESTRING (1 3, 3 1)|
|b  |LINESTRING (2 5, 6 1)|
|e  |LINESTRING (6 7, 6 9)|
|f  |LINESTRING (8 7, 9 9)|
+---+---------------------+

    

And now let’s check the table state after the first append operation:

        sql = "SELECT * FROM local.db.icetable FOR SYSTEM_VERSION AS OF 1978905766678632864;"
sedona.sql(sql).show(truncate=False)

+---+---------------------+
|id |geometry             |
+---+---------------------+
|a  |LINESTRING (1 3, 3 1)|
|b  |LINESTRING (2 5, 6 1)|
+---+---------------------+
    

A new table version is created every time an Iceberg transaction is completed.

Suppose you have a geospatial table and append new data daily.  You’re training a model on the data and are surprised that the model’s conclusions changed last week.  You’re unsure if your geospatial model provides different results because the data or model code has changed.  With time travel, you can run the latest model code on an earlier data version to help analyze the root cause.

Time travel is commonly helpful in the following scenarios:

  • Regulatory audit purposes
  • Model training on different versions of a table
  • Isolating specific versions of a table while iterating on model code 

Iceberg lets you delete expired snapshots, which limits your ability to time travel.  Therefore, make sure you set each table’s retention properties correctly.  

Geospatial upsert operations with Iceberg

Iceberg also supports upsert operations, which allow you to update or insert rows in a table simultaneously. Upserts are especially useful when tables are updated incrementally regularly.

Here are the current contents of the Iceberg table:

        +---+---------------------+
|id |geometry             |
+---+---------------------+
|d  |LINESTRING (2 7, 4 9)|
|e  |LINESTRING (6 7, 6 9)|
|f  |LINESTRING (8 7, 9 9)|
|a  |LINESTRING (1 3, 3 1)|
+---+---------------------+
    

Let’s perform an upsert with the following data:

        +---+---------------------+
|id |geometry             |
+---+---------------------+
|d  |LINESTRING (2 7, 4 9)| # duplicate
|e  |LINESTRING (7 7, 6 9)| # updated geometry
|z  |LINESTRING (6 7, 6 9)| # new data
+---+---------------------+
    

Here’s how the upsert append should run:

  • New data should be appended
  • Existing data should be updated
  • Duplicate data should be ignored

Here’s the code to execute this operation:

        merge_sql = """
MERGE INTO local.db.icetable target
USING source
ON target.id = source.id
WHEN MATCHED THEN
    UPDATE SET
        target.geometry = source.geometry
WHEN NOT MATCHED THEN
    INSERT (id, geometry) VALUES (source.id, source.geometry)
"""
sedona.sql(merge_sql)
    

Here are the contents of the table after running this operation:

        +---+---------------------+
|id |geometry             |
+---+---------------------+
|a  |LINESTRING (1 3, 3 1)|
|d  |LINESTRING (2 7, 4 9)|
|e  |LINESTRING (7 7, 6 9)|
|z  |LINESTRING (6 7, 6 9)|
|f  |LINESTRING (8 7, 9 9)|
+---+---------------------+
    

The MERGE command has many other practical applications for geospatial tables.

Geospatial schema enforcement with Iceberg

Iceberg supports schema enforcement, prohibiting appending data with a mismatched schema to the table. It will error if you try to append a DataFrame with a mismatched schema to an Iceberg table.

Let’s create a DataFrame with a different schema than table:

        df = sedona.createDataFrame([
    ("x", 2, 'LINESTRING(8.0 8.0,3.0 3.0)'),
    ("y", 3, 'LINESTRING(5.0 5.0,1.0 1.0)'),
], ["id", "num", "geometry"])
df = df.withColumn("geometry", ST_GeomFromText(col("geometry")))
    

Now attempt to try and append the DataFrame to the table:

        df.write.format("iceberg").mode("append").saveAsTable("local.db.icetable")
    

Here is the error:

        AnalysisException: [INSERT_COLUMN_ARITY_MISMATCH.TOO_MANY_DATA_COLUMNS] Cannot write to <code>local</code>.<code>db</code>.<code>icetable</code>, the reason is too many data columns:
Table columns: <code>id</code>, <code>geometry</code>.
Data columns: <code>id</code>, <code>num</code>, <code>geometry</code>.
    

The append operation is prohibited because the DataFrame schema differs from the Iceberg table schema.

Data lakes don’t have built-in schema enforcement, so you can append data with mismatched schema, which can corrupt a table or require developers to use specific syntax when reading the table.

Schema enforcement is a nice feature that protects the integrity of your data tables.

Geospatial schema evolution with Iceberg

Iceberg also allows you to evolve a table’s schema when you want to add or remove columns.

Add a num column to the Iceberg table:

        sql = "ALTER TABLE local.db.icetable ADD COLUMNS (num INTEGER);"
sedona.sql(sql)
    

Now try to append the DataFrame with id, num, and geometry columns to the Iceberg table:

        df.write.format("iceberg").mode("append").saveAsTable("local.db.icetable")
    

The append operation now works since the schema was evolved.  Take a look at the contents of the table now:

        sedona.sql("SELECT * FROM local.db.icetable;").show(truncate=False)
 
+---+---------------------+----+
|id |geometry             |num |
+---+---------------------+----+
|e  |LINESTRING (6 7, 6 9)|NULL|
|f  |LINESTRING (8 7, 9 9)|NULL|
|d  |LINESTRING (2 7, 4 9)|NULL|
|x  |LINESTRING (8 8, 3 3)|2   |
|y  |LINESTRING (5 5, 1 1)|3   |
|a  |LINESTRING (1 3, 3 1)|NULL|
+---+---------------------+----+
    

You cannot evolve the schema of a data lake. Developers must manually specify it when they read the data, or the engine must infer it.

It’s problematic for engines to infer the schema because it can be slow if all the data is read or wrong if only a subset of the data is sampled to determine the schema.

Efficient geospatial file listing operations with Iceberg

When an engine queries a data lake, it must perform a file listing operation, read all the metadata from the file footers, and then execute the query.  For large datasets, file listing operations can be relatively slow.  They take longer if the dataset uses deeply nested Hive-style partitioning or cloud object stores.

File listing operations are faster with Lakehouse storage systems like Iceberg. The engine can get all the file paths and file-level metadata directly from the metadata, so fetching metadata from a Lakehouse storage system is much faster than performing a separate file listing operation.

Small file compaction of geospatial files with Iceberg

Query engines don’t perform as efficiently when there are many small files.

A table composed of 10 one-gig files usually performs much better than a table with 10,000 one-mb files.

Iceberg supports small file compaction, automatically rearranging small files into bigger files and eliminating the small file problem.

You can compact the small files in an Iceberg table using the rewrite_data_files stored procedure.

        CALL system.rewrite_data_files(table => 'local.db.icetable')
    

Data lakes don’t support compaction, so you must manually compact the small files, which requires table downtime and is error-prone.

Iceberg merge on read operations

Iceberg supports both copy-on-write and merge-on-read for DML operations.  Copy-on-write operations immediately rewrite all files, which is slow.  Lakehouse storage systems store data in immutable Parquet files, so rewriting entire files is necessary for some DML operations.

Merge-on-read implements DML operations differently.  Rather than rewriting the underlying data files, merge-on-read operations output deletion vector files with the diffs.  Subsequent reads are slightly slower because the data files and deletion vector files must be “merged,” but this strategy dramatically reduces the runtime of DML operations.

Merge-on-read generally offers a better set of tradeoffs than copy-on-write.  Iceberg supports merge-on-read, and data lakes do not.
Here’s how to enable merge-on-read for an Iceberg table:

        ALTER TABLE LOCAL.db.icetable SET TBLPROPERTIES (
    'write.delete.mode'='merge-on-read',
    'write.update.mode'='copy-on-write',
    'write.merge.mode'='copy-on-write'
);
    

Conclusion

Lakehouses bring many new features to the geospatial data community.

An Iceberg table is almost always a better option than a Parquet data lake because it offers many advantages. If you already have Parquet tables, converting them into Iceberg tables is straightforward.

It’s great to see how the geospatial data community collaborated closely with the Iceberg community to update the Iceberg spec to support geometry columns. The community can now start investing in spatial tables in Lakehouses, which are more performant and reliable in production applications.

Here are some other blog posts in case you’d like to learn more:

Stay up-to-date with the latest on Wherobots, Apache Sedona, and geospatial data by subscribing to the Spatial Intelligence Newsletter:

联系我们 contact @ memedata.com