Parallel processing libraries in Python - Dask vs Ray
- At work, I'm working with enabling numerical computations/transformations of N-dimensional tensors that cannot fit in a single machine's memory. They are "big data" in the order of TBs-PBs. The data types are numerical arrays; e.g. np.array. Data processing / analysis code are written in Python.
- I looked at two libraries to enable parallel computing within the Python ecosystem. The two are Dask and Ray. I made some comparison between them.
Dask is a flexible library for parallel computing in Python
Ray is a fast and simple framework for building and running distributed applications.
(Originally published 02/25/2020, updated 03/03/2021 as per new features in the libraries)
|Color Code Justification
|Efficient Data Sharing Between Nodes
|Scheduler Fault Tolerance
|Support for Deployment on AWS
|Support for Distributed Training
|ML features - Hyperparameter optimization
|ML feature - Generalized Linear Models
|ML feature - Clustering
|ML feature - Reinforcement Learning Library
|Support for running on GPU  
|Support for creating Task Graph  
|Ease of Monitoring  
|Support for Distributed Data Structures
|Process Synchronization with Shared Data