I used pySpark some time ago when it was introduced to my company at the time an...

rmnclmnt · 2025-11-14T13:10:24 1763125824

Yes using Python UDFs within Spark pipelines are a hog! That’s because the entire Python context is serialized with cloudpickle and sent over the wire to the executor nodes! (It can represent a few GB of serialized data depending on the UDF and driver process Python context)

jellyfishbeaver · 2025-11-16T14:57:00 1763305020

We actually baked a rule to catch UDF usage into our Python linter. Almost always, a UDF can be refactored to use only native PySpark functions.