I think there should be an option to use Iceberg without a catalog. This would free you from any lock-in and give you the Open table format that everybody wants.
Comments
Log in with your Bluesky account to leave a comment
@felixscherz.bsky.social already created the draft PR for pyiceberg here: https://github.com/apache/iceberg-python/issues/1404
I think the right way would be S3 adopting Iceberg REST protocol natively but this would be the alternative.
Every commercial data warehouse stores additional metadata like upper & lower bounds, statistics, and distinct counts on top of the actual data files to assist the query optimizer.
Iceberg is an open standard for this kind of metadata and provides speed ups over plain parquet.
Well, iceberg makes this metadata available at a higher level: the manifest-list and manifest files. Which means that you don't have to read all the parquet files.
Your reply implied iceberg adds the metadata, not just exposes it at the table, vs file, level. A less aware reader could be mislead by your statement and not realize stats exist in parquet, which is how duckdb et al can be so effective with it. Stats alone are not a good reason to roll out iceberg.
Comments
Catalog is for features such as time travel, atomic update/merge/insert.
I think the right way would be S3 adopting Iceberg REST protocol natively but this would be the alternative.
Iceberg is an open standard for this kind of metadata and provides speed ups over plain parquet.
Iceberg helps with updates and compaction, plus caching said metadata for many pq files.
Time travel meaning the ability to see the data as of some point in the past.
Iceberg brings the missing table functions to parquet.