I think there should be an option to use Iceberg without a catalog. This would free you from any lock-in and give you the Open table format that everybody wants. - ThreadSky

jankaul.bsky.social • 191 days ago

I think there should be an option to use Iceberg without a catalog. This would free you from any lock-in and give you the Open table format that everybody wants.

Comments

buremba.bsky.social•191 days ago

If you are operating with a single table and know the path of a Iceberg metadata file, you don’t need a catalog. Here is an example: https://duckdb.org/docs/extensions/iceberg.html#querying-individual-tables
Catalog is for features such as time travel, atomic update/merge/insert.

jankaul.bsky.social•191 days ago

Well, ideally it would be complete read and write support.

buremba.bsky.social•191 days ago

@felixscherz.bsky.social already created the draft PR for pyiceberg here: https://github.com/apache/iceberg-python/issues/1404
I think the right way would be S3 adopting Iceberg REST protocol natively but this would be the alternative.

buremba.bsky.social•191 days ago

Assuming you refer to S3Tables, I believe you already know it better than me :)

jakthom.bsky.social•191 days ago

Is that just a parquet file then? Or something-something hive partitioned parquet files?

jankaul.bsky.social•190 days ago

Every commercial data warehouse stores additional metadata like upper & lower bounds, statistics, and distinct counts on top of the actual data files to assist the query optimizer.
Iceberg is an open standard for this kind of metadata and provides speed ups over plain parquet.

squarecog.bsky.social•190 days ago

Such metadata is stored in the parquet file already.
Iceberg helps with updates and compaction, plus caching said metadata for many pq files.

jankaul.bsky.social•190 days ago

Well, iceberg makes this metadata available at a higher level: the manifest-list and manifest files. Which means that you don't have to read all the parquet files.

squarecog.bsky.social•190 days ago

Your reply implied iceberg adds the metadata, not just exposes it at the table, vs file, level. A less aware reader could be mislead by your statement and not realize stats exist in parquet, which is how duckdb et al can be so effective with it. Stats alone are not a good reason to roll out iceberg.

squarecog.bsky.social•190 days ago

I would've said duckdb and datafusion, but, character limit 😁

jankaul.bsky.social•189 days ago

You're right, I wasn't entirely clear

jdlong.cerebralmastication.com•190 days ago

Correct. Column stats are in parquet. Iceberg is for upserts, deletions, “time travel”, and later compaction of all this.

Time travel meaning the ability to see the data as of some point in the past.

Iceberg brings the missing table functions to parquet.

Comments

Posting Rules

Reply