in DuckDB, if you have a 500m row table and you want to estimate the size of it as a CSV, how would you do it?
Originally was using bit_length with VARCHAR casting, but this is REALLY expensive. Maybe just adding a limit and inflating would give an estimate, but error-prone b/c of drift
Any ideas?
Originally was using bit_length with VARCHAR casting, but this is REALLY expensive. Maybe just adding a limit and inflating would give an estimate, but error-prone b/c of drift
Any ideas?
Comments
duckdb-wasm has a brutally hard cap here.
Lots of varchar columns?
the size of export is dominated by varchar fields. and those have to have their average length calculated. And that is really hard to do quickly. And even sampling requires a full table scan, so it has a fixed cost even if the sample size is small.
1. must be pure SQL (duckdb)
2. must run in < 1sec for any size table
3. target error of +/-10% is great!
If you have multiple varchar columns then there’s a fun question of whether they are independent