Hardly a surprise, given the nature of Spark and benchmark prerequisites. Compar...

Hardly a surprise, given the nature of Spark and benchmark prerequisites. Comparing a positively ancient distributed JVM-based compute framework running on a single node, with modern native tools like DuckDB or Polars, and all that on a select from a single table- does it tell us something new?

Even Trino runs circles around Spark, with some heavier jobs simply not completing in Spark at all (total data size up to a single PB, with about 10TB of RAM available for compute), and Trino isn't known for its extreme performance. StarRocks is noticeably faster still, so I wouldn't right off distributed compute just yet- at least for some applications.

And even then, performance isn't the most important criterion for an analytics tool choice- more probably depends on the integrations, access control, security, ease of extendability, maintenance, scaling, support by existing instruments. Boring enterprise stuff, sure, but for those older frameworks it's all either readily available, or can be quickly added with little experience (writing a java plugin for Trino is as easy as it gets).

With Duckdb or Polars (if used as a basis for a datalake/house etc) it may degrade into an entire team of engineers wasting resources on implementing the tooling around the tooling instead of providing something actually useful for the business