Parallel processing libraries in Python - Dask vs Ray

  • At work, I'm working with enabling numerical computations/transformations of N-dimensional tensors that cannot fit in a single machine's memory. They are "big data" in the order of TBs-PBs. The data types are numerical arrays; e.g. np.array. Data processing / analysis code are written in Python.
  • I looked at two libraries to enable parallel computing within the Python ecosystem. The two are Dask and Ray. I made some comparison between them.  

Dask

Dask is a flexible library for parallel computing in Python 

https://docs.dask.org/en/latest/

Ray

Ray is a fast and simple framework for building and running distributed applications.

https://ray.readthedocs.io/en/latest/

(Originally published 02/25/2020, updated 03/03/2021 as per new features in the libraries)



DimensionDaskRayColor Code Justification
Efficient Data Sharing Between Nodes 

Source: https://distributed.dask.org/en/latest/serialization.html

  • Data sharing happens via TCP messages between workers. Every time data should be shared, the task is serialized. Serialization schemes include msgpack, pickle+custom serializers, and pickle+cloudpickle. 

Source: https://ray-project.github.io/2017/08/08/plasma-in-memory-object-store.html

  • No serialization
  • Instead, shared data layer to enable zero-copy data exchange between nodes (Apache Arrow) 

Source: https://arrow.apache.org/blog/2017/10/15/fast-python-serialization-with-ray-and-arrow/

  • Above article shows that Apache Arrow has huge speedup with numerical data (e.g. Numpy arrows) and moderate speedup with regular Python objects, over object serialization schemes (pickle)
Scheduler Fault Tolerance 

Source: https://distributed.dask.org/en/latest/resilience.html

  • Dask has one central scheduler that coordinates worker processes across multiple machines and concurrent requests of several clients
  • "The process containing the scheduler might die. There is currently no persistence mechanism to record and recover the scheduler state."

Source: https://github.com/ray-project/ray/issues/642

  • Bottom up scheduling scheme where workers submit work to local schedulers, and local schedulers assign tasks to workers. The local scheduler also forward tasks to global schedulers.

Source: https://github.com/ray-project/ray/issues/642

  • In Dask, central scheduler can mean a  single central point of failure. Whereas in Ray, schedulers exist per machine, so less work is lost if a scheduler dies.
     
Support for Deployment on AWS

Sources

[1]: https://docs.dask.org/en/latest/setup/cloud.html

[2]: https://towardsdatascience.com/serverless-distributed-data-pre-processing-using-dask-amazon-ecs-and-python-part-1-a6108c728cc4 

[3]: https://yarn.dask.org/en/latest/

  • Options/well written documentations available for deploying on multitudes of AWS Compute platforms - ECS, EKS, EMR [1], Google Cloud DataProc
  • Articles available for deploying on AWS ECS [2]
  • Library available for dedicated deployment on AWS EMR  [3]

https://docs.ray.io/en/latest/cluster/launcher.html

  • Options/well written documentation available for deploying on AWS EC2, GCP, Azure
  • Both are well-document and available on various cloud providers / cluster scaling solutions.
Support for Distributed Training 

Source: https://ml.dask.org/

  • Dask-ML
    • Can parallelize Scikit-learn directly
    • Partners with XGBoost, Tensorflow
    • PyTorch can be used by integration with Skorch

Source: https://docs.ray.io/en/master/raysgd/raysgd.html

  • RaySGD is a lightweight library for distributed deep learning, providing thin wrappers around PyTorch and TensorFlow native modules for data parallel training.

 

ML features -  Hyperparameter optimization

Source: https://ml.dask.org/hyper-parameter-search.html

  • Based on Scikit-learn API-compatible models using that are compatible with Dask objects 
  • Memory constrained search - data doesn't fit into memory of a single machine
  • Compute constrained - search takes too long
  • GridSearch/Randomized Search - no constraint
  • Incremental Search - memory constrained, not compute constrained
  • HyperbandSearch/ SuccessiveHalvingSearch/ InverseDecaySearch - compute constrained, not memory constrained OR both constrained 

Source: https://ray.readthedocs.io/en/latest/tune.html

  • While both supports different hyperparameter search algorithms, Ray is compatible with any ML frameworks while Dask is only compatible with Scikit-learn. 
ML feature -  Generalized Linear Models 

Source:  https://ml.dask.org/glm.html

  • Dask-provided Linear/Logistic/Poisson Regression 
  • N/A

 

ML feature -  Clustering

Source: https://ml.dask.org/clustering.html

  • Dask-provided KMeans, PartialMinibatchKMeans, SpectralClustering
  • N/A
 
ML feature -  Reinforcement Learning Library
  • N/A

Source: https://ray.readthedocs.io/en/latest/rllib.html

  • Built in library "RLib"
 
Support for running on GPU [13] [14]

Source: https://docs.dask.org/en/latest/gpu.html

  • Works with GPU-accelerated array or dataframe library (cuDF, cuPy)
  • Works with GPU-accelerated ML librareies that follow the Scikit-learn estimator API (Skorch, cuML, XGBoost, etc..)
  • Can specify GPUs per machine 

Source:  https://ray.readthedocs.io/en/latest/using-ray-with-gpus.html

  • # GPU is specifiable at the function level
  • # GPU is specifiable fractionally (functions can share the same GPU)
  • Both Ray and Dask can work with GPUs; the provided support from both seems adequate for most use cases.
  • Ray does seem a bit more sophisticated than Dask since Dask does not support fractional GPUs at the method level.
Support for creating Task Graph [15] [16] 

Source: https://docs.dask.org/en/latest/custom-graphs.html

  • Custom graph: User can create their own custom graph as a Python dict as well as visualize it / execute it on cluster

Source: https://rise.cs.berkeley.edu/blog/modern-parallel-and-distributed-python-a-quick-tutorial-on-ray/

  • User can chain arbitrary functions, but there's no construct for creating a "custom graph"
  • Dask's custom graph as a Python dictionary can be convenient for high-level specification of workflow. Dask will translate the custom graph into a low-level parallelizable task graph.
  • In Ray, you have to create the low-level graph yourself.
Ease of Monitoring [17] [18]

Source: https://distributed.dask.org/en/latest/diagnosing-performance.html

  • Provides UI-based, real-time diagnostics dashboard for monitoring jobs on the cluster 
  • Logs are available on UI-based diagnostics dashboard 

Source:

https://docs.ray.io/en/latest/ray-dashboard.html

  • Provides UI-based, real-time diagnostics dashboard

 

Support for Distributed Data Structures 

Source: 

[1] https://docs.dask.org/en/latest/array.html

[2] https://docs.dask.org/en/latest/dataframe.html

[3] https://docs.dask.org/en/latest/bag.html

  • Distributed version of Numpy ndarray [1] 
  • Distributed version of Pandas DataFrame [2] 
  • Distributed collection of generic Python objects [3]
N/A

Source: https://github.com/ray-project/ray/issues/642

  • From main contributor of Ray: "Dask has extensive high-level collections APIs (e.g., dataframes, distributed arrays, etc), whereas Ray does not"
Process Synchronization with Shared Data
  • Currently, "Actors" feature in Dask is "an experimental feature and is subject to change without notice".
  • "Actors enable stateful computations within a Dask workflow. They are useful for some rare algorithms that require additional performance and are willing to sacrifice resilience."
  • "An actor is a pointer to a user-defined-object living on a remote worker. Anyone with that actor can call methods on that remote object."

Source: https://ray.readthedocs.io/en/latest/actors.html

  • Actor is a core concept of Ray.
  • "An actor is essentially a stateful worker (or a service). When a new actor is instantiated, a new worker is created, and methods of the actor are scheduled on that specific worker and can access and mutate the state of that worker."

Source: https://github.com/ray-project/ray/issues/642

  • In June 2017, from main contributor of Ray: "An important part of the Ray API is the actor abstraction for sharing mutable state between tasks...... I don't think there is an analogue in Dask" (The last sentence is no longer true)

  • Whereas Actor concept was core to Ray from the beginning, Dask seems to have added it later in the project and is currently experimental with it. 

Maturity of Project

Source

[1] https://github.com/dask/dask/releases?after=0.2.1

[2] https://dask.org/

  • First release on Github: Jan 2015 [1] 
  • Built by developers at Anaconda, so Dask is closely integrated with Scikit-learn, XGBoost, XArray, etc. [2]
  • Supported by Anaconda, DARPA, Mooro, NSF, Nvidia, etc [2]
  • Fiscally sponsored project of NumFOCUS [2] 

Source

[1] https://github.com/ray-project/ray/releases?after=ray-0.5.1

[2] https://rise.cs.berkeley.edu/projects/ray/

  • First release on Github: May 2017 [1]
  • Built by students at Berkeley Rise Lab [2] 
  • Fiscally sponsored by Anyscale 

 



Comments

Popular posts from this blog

Benchmarks: Dask Distributed vs. Ray for Dask Workloads

[Mountaineering] (08/13/18) Mt. Rainier Summit