Parallel processing libraries in Python - Dask vs Ray
- At work, I'm working with enabling numerical computations/transformations of N-dimensional tensors that cannot fit in a single machine's memory. They are "big data" in the order of TBs-PBs. The data types are numerical arrays; e.g. np.array. Data processing / analysis code are written in Python.
- I looked at two libraries to enable parallel computing within the Python ecosystem. The two are Dask and Ray. I made some comparison between them.
Dask is a flexible library for parallel computing in Python
Ray is a fast and simple framework for building and running distributed applications.
(Originally published 02/25/2020, updated 03/03/2021 as per new features in the libraries)
|Dimension||Dask||Ray||Color Code Justification|
|Efficient Data Sharing Between Nodes|
|Scheduler Fault Tolerance|
|Support for Deployment on AWS|
|Support for Distributed Training|
|ML features - Hyperparameter optimization|
|ML feature - Generalized Linear Models|
|ML feature - Clustering|
|ML feature - Reinforcement Learning Library|
|Support for running on GPU  |
|Support for creating Task Graph  |
|Ease of Monitoring  |
|Support for Distributed Data Structures||N/A|
|Process Synchronization with Shared Data|