> It seems like these single-node libraries can process a terabyte on a typical machine, and you'd have have over 10TB before moving to Spark.
I'm surprised by how often people jump to Spark because "it's (highly) parallelizable!" and "you can throw more nodes at it easy-peasy!" And yet, there are so many cases where you can just do things with better tools.
Like the time a junior engineer asked for help processing 100s of ~5GB files of JSON data which turned out to be doing crazy amounts of string concatenation in Python (don't ask). It was taking something like 18 hours to run, IIRC, and writing a simple console tool to do the heavy lifting and letting Python's multiprocessing tackle it dropped the time to like 35 minutes.
I used pySpark some time ago when it was introduced to my company at the time and I realized that it was slow when you used python libraries in the UDFs rather than pySpark's own functions.
Yes using Python UDFs within Spark pipelines are a hog! That’s because the entire Python context is serialized with cloudpickle and sent over the wire to the executor nodes! (It can represent a few GB of serialized data depending on the UDF and driver process Python context)
I think Spark was the best tool out there when data engineering started taking off, and it just works (provided you don't have to deal with jar dependency hell) so there's not a huge incentive to move away from it.
This is so true! Even a few years ago, these benchmarks would have been against pandas (instead of polaes and duckdb) and would likely have looked very different.
I'm surprised by how often people jump to Spark because "it's (highly) parallelizable!" and "you can throw more nodes at it easy-peasy!" And yet, there are so many cases where you can just do things with better tools.
Like the time a junior engineer asked for help processing 100s of ~5GB files of JSON data which turned out to be doing crazy amounts of string concatenation in Python (don't ask). It was taking something like 18 hours to run, IIRC, and writing a simple console tool to do the heavy lifting and letting Python's multiprocessing tackle it dropped the time to like 35 minutes.
Right cool for the right job, people.