Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I used pySpark some time ago when it was introduced to my company at the time and I realized that it was slow when you used python libraries in the UDFs rather than pySpark's own functions.


Yes using Python UDFs within Spark pipelines are a hog! That’s because the entire Python context is serialized with cloudpickle and sent over the wire to the executor nodes! (It can represent a few GB of serialized data depending on the UDF and driver process Python context)


We actually baked a rule to catch UDF usage into our Python linter. Almost always, a UDF can be refactored to use only native PySpark functions.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: