I had to do something like this for a few TB of json recently. The unique thing ...

I had to do something like this for a few TB of json recently. The unique thing about this workload was it was a ton of small 10-20mb files.

I found that clickhouse was the fastest, but duckdb was the simplest to work with it usually just works. DuckDB was close enough to the max performance from clickhouse.

I tried flink & pyspark but they were way slower (like 3-5x) than clickhouse and the code was kind of annoying. Dask and Ray were also way too slow, but dask’s parallelism was easy to code but it was just too slow. I also tried Datafusion and polars but clickhouse ended up being faster.

These days I would recommend starting with DuckDB or Clickhouse for most workloads just cause it’s the easiest to work with AND has good performance. Personally I switched to using DuckDB instead of polars for most things where pandas is too slow.