Parallel processing libraries in Python

At work, I'm working with enabling numerical computations/transformations of N-dimensional tensors that cannot fit in a single machine's memory. They are "big data" in the order of TBs-PBs. The data types are numerical arrays; e.g. np.array. Data processing / analysis code are written in Python.

I looked at two libraries to enable parallel computing within the Python ecosystem. The two are Dask and Ray. I made some comparison between them.

Dask

Dask is a flexible library for parallel computing in Python

https://docs.dask.org/en/latest/

Ray

Ray is a fast and simple framework for building and running distributed applications.

https://ray.readthedocs.io/en/latest/

(Originally published 02/25/2020, updated 03/03/2021 as per new features in the libraries)

Dimension	Dask	Ray	Color Code Justification
Efficient Data Sharing Between Nodes	Source: https://distributed.dask.org/en/latest/serialization.html Data sharing happens via TCP messages between workers. Every time data should be shared, the task is serialized. Serialization schemes include msgpack, pickle+custom serializers, and pickle+cloudpickle.	Source: https://ray-project.github.io/2017/08/08/plasma-in-memory-object-store.html No serialization Instead, shared data layer to enable zero-copy data exchange between nodes (Apache Arrow)	Source: https://arrow.apache.org/blog/2017/10/15/fast-python-serialization-with-ray-and-arrow/ Above article shows that Apache Arrow has huge speedup with numerical data (e.g. Numpy arrows) and moderate speedup with regular Python objects, over object serialization schemes (pickle)
Scheduler Fault Tolerance	Source: https://distributed.dask.org/en/latest/resilience.html Dask has one central scheduler that coordinates worker processes across multiple machines and concurrent requests of several clients "The process containing the scheduler might die. There is currently no persistence mechanism to record and recover the scheduler state."	Source: https://github.com/ray-project/ray/issues/642 Bottom up scheduling scheme where workers submit work to local schedulers, and local schedulers assign tasks to workers. The local scheduler also forward tasks to global schedulers.	Source: https://github.com/ray-project/ray/issues/642 In Dask, central scheduler can mean a single central point of failure. Whereas in Ray, schedulers exist per machine, so less work is lost if a scheduler dies.
Support for Deployment on AWS	Sources [1]: https://docs.dask.org/en/latest/setup/cloud.html [2]: https://towardsdatascience.com/serverless-distributed-data-pre-processing-using-dask-amazon-ecs-and-python-part-1-a6108c728cc4 [3]: https://yarn.dask.org/en/latest/ Options/well written documentations available for deploying on multitudes of AWS Compute platforms - ECS, EKS, EMR [1], Google Cloud DataProc Articles available for deploying on AWS ECS [2] Library available for dedicated deployment on AWS EMR [3]	https://docs.ray.io/en/latest/cluster/launcher.html Options/well written documentation available for deploying on AWS EC2, GCP, Azure	Both are well-document and available on various cloud providers / cluster scaling solutions.
Support for Distributed Training	Source: https://ml.dask.org/ Dask-ML Can parallelize Scikit-learn directly Partners with XGBoost, Tensorflow PyTorch can be used by integration with Skorch	Source: https://docs.ray.io/en/master/raysgd/raysgd.html RaySGD is a lightweight library for distributed deep learning, providing thin wrappers around PyTorch and TensorFlow native modules for data parallel training.
ML features - Hyperparameter optimization	Source: https://ml.dask.org/hyper-parameter-search.html Based on Scikit-learn API-compatible models using that are compatible with Dask objects Memory constrained search - data doesn't fit into memory of a single machine Compute constrained - search takes too long GridSearch/Randomized Search - no constraint Incremental Search - memory constrained, not compute constrained HyperbandSearch/ SuccessiveHalvingSearch/ InverseDecaySearch - compute constrained, not memory constrained OR both constrained	Source: https://ray.readthedocs.io/en/latest/tune.html Built in library "Tune" Algorithms Grid / Random search, BayeOpt HyperOpt, SigOpt, Nevergrad, Scikit-Optimize search, etc.. Supports any machine learning framework, including PyTorch, XGBoost, MXNet, and Keras.	While both supports different hyperparameter search algorithms, Ray is compatible with any ML frameworks while Dask is only compatible with Scikit-learn.
ML feature - Generalized Linear Models	Source: https://ml.dask.org/glm.html Dask-provided Linear/Logistic/Poisson Regression	N/A
ML feature - Clustering	Source: https://ml.dask.org/clustering.html Dask-provided KMeans, PartialMinibatchKMeans, SpectralClustering	N/A
ML feature - Reinforcement Learning Library	N/A	Source: https://ray.readthedocs.io/en/latest/rllib.html Built in library "RLib"
Support for running on GPU [13] [14]	Source: https://docs.dask.org/en/latest/gpu.html Works with GPU-accelerated array or dataframe library (cuDF, cuPy) Works with GPU-accelerated ML librareies that follow the Scikit-learn estimator API (Skorch, cuML, XGBoost, etc..) Can specify GPUs per machine	Source: https://ray.readthedocs.io/en/latest/using-ray-with-gpus.html # GPU is specifiable at the function level # GPU is specifiable fractionally (functions can share the same GPU)	Both Ray and Dask can work with GPUs; the provided support from both seems adequate for most use cases. Ray does seem a bit more sophisticated than Dask since Dask does not support fractional GPUs at the method level.
Support for creating Task Graph [15] [16]	Source: https://docs.dask.org/en/latest/custom-graphs.html Custom graph: User can create their own custom graph as a Python dict as well as visualize it / execute it on cluster	Source: https://rise.cs.berkeley.edu/blog/modern-parallel-and-distributed-python-a-quick-tutorial-on-ray/ User can chain arbitrary functions, but there's no construct for creating a "custom graph"	Dask's custom graph as a Python dictionary can be convenient for high-level specification of workflow. Dask will translate the custom graph into a low-level parallelizable task graph. In Ray, you have to create the low-level graph yourself.
Ease of Monitoring [17] [18]	Source: https://distributed.dask.org/en/latest/diagnosing-performance.html Provides UI-based, real-time diagnostics dashboard for monitoring jobs on the cluster Logs are available on UI-based diagnostics dashboard	Source: https://docs.ray.io/en/latest/ray-dashboard.html Provides UI-based, real-time diagnostics dashboard
Support for Distributed Data Structures	Source: [1] https://docs.dask.org/en/latest/array.html [2] https://docs.dask.org/en/latest/dataframe.html [3] https://docs.dask.org/en/latest/bag.html Distributed version of Numpy ndarray [1] Distributed version of Pandas DataFrame [2] Distributed collection of generic Python objects [3]	N/A	Source: https://github.com/ray-project/ray/issues/642 From main contributor of Ray: "Dask has extensive high-level collections APIs (e.g., dataframes, distributed arrays, etc), whereas Ray does not"
Process Synchronization with Shared Data	Source: https://distributed.dask.org/en/latest/actors.html Currently, "Actors" feature in Dask is "an experimental feature and is subject to change without notice". "Actors enable stateful computations within a Dask workflow. They are useful for some rare algorithms that require additional performance and are willing to sacrifice resilience." "An actor is a pointer to a user-defined-object living on a remote worker. Anyone with that actor can call methods on that remote object."	Source: https://ray.readthedocs.io/en/latest/actors.html Actor is a core concept of Ray. "An actor is essentially a stateful worker (or a service). When a new actor is instantiated, a new worker is created, and methods of the actor are scheduled on that specific worker and can access and mutate the state of that worker."	Source: https://github.com/ray-project/ray/issues/642 In June 2017, from main contributor of Ray: "An important part of the Ray API is the actor abstraction for sharing mutable state between tasks...... I don't think there is an analogue in Dask" (The last sentence is no longer true) Whereas Actor concept was core to Ray from the beginning, Dask seems to have added it later in the project and is currently experimental with it.
Maturity of Project	Source [1] https://github.com/dask/dask/releases?after=0.2.1 [2] https://dask.org/ First release on Github: Jan 2015 [1] Built by developers at Anaconda, so Dask is closely integrated with Scikit-learn, XGBoost, XArray, etc. [2] Supported by Anaconda, DARPA, Mooro, NSF, Nvidia, etc [2] Fiscally sponsored project of NumFOCUS [2]	Source [1] https://github.com/ray-project/ray/releases?after=ray-0.5.1 [2] https://rise.cs.berkeley.edu/projects/ray/ First release on Github: May 2017 [1] Built by students at Berkeley Rise Lab [2] Fiscally sponsored by Anyscale

Jenna Kwon

Parallel processing libraries in Python - Dask vs Ray

Comments

Post a Comment

Popular posts from this blog

Benchmarks: Dask Distributed vs. Ray for Dask Workloads

2020 Climbing Season - Day Crag Photo Compilations & My Route Pyramid