Benchmarks: Dask Distributed vs. Ray for Dask Workloads
Dask Workloads In this section, I detail the dask workloads that will be used to benchmark two cluster backends - Ray and Dask Distributed. In summary, my dask workload involves processing ~3.3 Petabytes of ~6.06 million input files with dask-based Processing Chain and producing ~14.88 Terabytes of ~1.86 million output files. Input Data ~6.06 million ~550MB files, totaling ~3.3 Petabytes Each file represents a N-dimensional numerical tensor (np.ndarray). Processing Chain: How we go from Input data to Output data (Dask Task Graphs) We use Xarray / Dask APIs for lazy , numerical computations on out-of-core multidimensional tensors. Per 3 input files, we generate 1 output file using below Dask Task Graph (auto generated by using Dask APIs) The Task Graph is consisted of small, Dask-level 'tasks'; tasks can be grouped into 3 actions: Load Download file from S3 and load each file into 2D np.ndarray of type np.float32 Convert np.ndarray into dask.array of chunksize
Comments
Post a Comment