> It seems like these single-node libraries can process a terabyte on a typical ...

esafak · 2025-11-14T00:42:09 1763080929

I used pySpark some time ago when it was introduced to my company at the time and I realized that it was slow when you used python libraries in the UDFs rather than pySpark's own functions.

rmnclmnt · 2025-11-14T13:10:24 1763125824

Yes using Python UDFs within Spark pipelines are a hog! That’s because the entire Python context is serialized with cloudpickle and sent over the wire to the executor nodes! (It can represent a few GB of serialized data depending on the UDF and driver process Python context)

jellyfishbeaver · 2025-11-16T14:57:00 1763305020

We actually baked a rule to catch UDF usage into our Python linter. Almost always, a UDF can be refactored to use only native PySpark functions.

nijave · 2025-11-14T14:12:28 1763129548

Python isn't too bad if you swap in orjson instead of stdlib which is quite a bit slower

Wrangling multiprocess is still annoying tho

rgblambda · 2025-11-14T08:37:09 1763109429

I think Spark was the best tool out there when data engineering started taking off, and it just works (provided you don't have to deal with jar dependency hell) so there's not a huge incentive to move away from it.

benrutter · 2025-11-14T11:05:20 1763118320

This is so true! Even a few years ago, these benchmarks would have been against pandas (instead of polaes and duckdb) and would likely have looked very different.