Workflows#
A workflow is a sequence of defined steps taken to complete a task.
In the context of compute jobs they will often involve steps like:
Fetching data
Preprocessing the data
Perform a computation on the data
Save the computation results
This section collects tools and techniques for workflows in different domains.
Data Processing#
MLOPS#
https://aws.amazon.com/what-is/mlops/
Workflow Management Systems#
Software#
Apache Spark: https://en.wikipedia.org/wiki/Apache_Spark
Apache Airflow: https://en.wikipedia.org/wiki/Apache_Airflow
Apache ODE: Apache Orchestration Director Engine https://en.wikipedia.org/wiki/Apache_ODE
Aiida: https://aiida.readthedocs.io/projects/aiida-core/en/latest/intro/tutorial.html
Covalent: https://www.covalent.xyz
Container Orchestration#
Kubernetes https://en.wikipedia.org/wiki/Kubernetes
Configuration Management#
Private Clouds#
OpenStack https://en.wikipedia.org/wiki/OpenStack
Events and Streaming#
Workflow Frameworks#
We would like to run compute jobs over multiple GPUs in a user-friendly and repeatible way. The first target application is in Earth Observation.
There are a couple of reasons one might want to run over multiple GPUs:
Reduce training time - done by parallizing compute via data or model parallelism
Data too large for single GPU memory
Model too large for single GPU memory
The primary paradigms are:
Data parallelism
- copy model to all devices, send different data batch to each deviceModel parallelism
- split the model itself (e.g. certain layers) or the gradient or optimization calculation over devicesMixed - such as in Fully Sharded Data Parallelism (defined later)
When distributing compute there should be a good reason to use multiple devices per node and even stronger justification for multiple nodes, since the bandwidth cost for syncing can impact gains by distribution.
For Model parallelism
there are several different methods of splitting the model, so there are dependencies on the problem domain and some experimentation over different approaches may be needed also. This can be helped by dedicated tooling and frameworks and by taking a ‘pipeline’ approach to the research workflow. These frameworks and pipeline tools are reviewed for the chosen Machine Learning (ML) backend PyTorch
next.
Distributed PyTorch ecosystem#
This section reviews typical components of a scalable (distributed) machine learning pipeline in the context of PyTorch. It breaks the hardware and software ecosystem into ‘layers’, starting with the compute hardware, moving to machine-learning specifics and through to integration with cloud and HPC host platforms, shown in Fig 1.
Hardware and Firmware#
In the distributed setting, a host machine (or node
) may have multiple CPU and GPU device
s. On an individual node, data transfer between CPUs, GPUs and storage can happen over hardware interconnects
following e.g. PCIe or NVidia NVLink speicifcations. General network communcation, such as between nodes can be by e.g. Ethernet or Inifiniband (or similar Fabrics based) hardware and firmware implementations.
Distributed Compute#
The NVidia CUDA toolkit allows the development of applications using NVidia GPUs. Tools can be built on this, for example the cuDNN library for GPU acceleration of Deep Neural Networks.
NVidia NCCL or the Collective Communication Library allows for multi-GPU compute and networking, with an API similar to that of the Message Passing Interface MPI specification.
A similar implementation based on the oneAPI collection of specifications is Intel’s oneCCL.
Machine Learning#
PyTorch is a library for for building deep learning models, or more generally differentiable programming. For distributed compute it has three primary approaches, sketched in Fig. 2.
DistributedData
is a legacy approach for data-parallel modelling, based on the Pythonmulti-processing
library, but subject to the Python Global Interpreter Lock.DistributedDataParallel
focuses on data-parallel modelling in a general distributed compute setting, with a particularAllReduce
approach for syncronization during model gradient updates.RPC is a more general module, allowing collective or peer-to-peer communcation and building more general data or model-parallel approaches.
The latter two approaches are built on the PyTorch c10d ‘collective communication’ package. This is in turn built on the Tensorpipe library which supports a range of backends for distributed compute communication, including Facebook’s Gloo library, NCCL, MPI and Ethernet based communication.
The general RPC
module allows the development of the more recent FullyShardedDataParallel module which is a combined data and model-parallel approach for large memory models and ParameterServer approaches for similar applications.
Further specialized algorithms, such as Deepspeed, Varuna and Flexflow collections have been built on this, with others using Distributed Autograd or Distributed Optimizers as generalizations.
Torch-elastic and Torch-run are fault-tolerant launch and coordination utilities in PyTorch for launching distributed training sessions and re-launching on device failure (such as out of memory.)
Somewhat independently of ‘distributed compute’, Torch Dynamo is a utility for JIT ‘compilation’ of models (to e.g. C++ or various compute graphs such as for TensorRT) for higher performance usually at inference time. It currently has some incompatiblities or limitations with the distributed modules.
ML Framework#
There are various tools and frameworks built on or incorporating PyTorch focusing on managing or deploying distributed compute.
Lightening AI provides a wrapper over PyTorch for easier use of distributed compute, boilerplate reduction and some extra implementations over the RPC
module.
RayIO, MosaicML, ColossalAI and DeterminedAI are examples of frameworks with an added focus on distributed compute.
Torch-serve is a PyTorch module to serve trained models, for example through a rest-like API.
ML Pipeline#
ML Pipeline software is focused on bringing ‘DevOPs’ approaches to Machine Learning flows, such as with declarative configuration files, DAG based flow modelling and integration of containers.
Kubeflow and KServe have integrations with PyTorch via Torch-Serve for Kuebernetes based flows. Apache Airflow is a general DevOps pipeline software. Amazon Sagemaker, Google Vertex-AI, MLFlow and NVidia Triton are some examples of software providing pipeline tooling with PyTorch integration.
These frameworks also handle integration with data storage backends, like S3 or the more general minIO, metrics (e.g. Prometheus or Grafana) and many other features such as Notebooks.
In the context of HPC, integration between Torch-elastic and Slurm are available.
Compute Platform#
The particulars of the underlying compute platforms, e.g. AWS, GCP or dedicated HPC can affect choices and outcomes in distributed compute - including around available compute architectures (GPU, TPU etc) and network bandwitdh between compute nodes.
ICHEC Compute Resources#
ICHEC Project:#
Work is under icear025c
- use https://www.ichec.ie/academic/national-hpc/documentation (Join an existing project
) to join.
Kay GPUs#
https://www.ichec.ie/about/infrastructure/kay
A partition of 16 nodes with the same specification as above, plus 2x NVIDIA Tesla V100 16GB PCIe (Volta architecture) GPUs on each node. Each GPU has 5,120 CUDA cores and 640 Tensor Cores.
Kay allows up to 4 GPU nodes per compute job.
Available CUDA versions:
cuda/10.0
cuda/10.1.243
cuda/11.2
cuda/11.3
cuda/11.4
cuda/9.2(default)
Sample code and tutorials#
Useful blog posts#
Nvidia CCL#
NVIDA HPC SDK https://developer.nvidia.com/hpc-sdk
GPU Direct: https://developer.nvidia.com/gpudirect
Magum io: https://www.nvidia.com/en-us/data-center/magnum-io/
Install: https://docs.nvidia.com/deeplearning/nccl/install-guide/index.html
Developer guide: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html
PyTorch#
Distributed compute overview: https://pytorch.org/tutorials/beginner/dist_overview.html
Distributed compute tutorial: https://pytorch.org/tutorials/intermediate/dist_tuto.html
Sample code, including slurm scheduling: pytorch/examples
Distributed Data Parallel internals: https://pytorch.org/docs/master/notes/ddp.html
Distributed backends: https://pytorch.org/docs/stable/distributed.html#module-torch.distributed
RPC: https://pytorch.org/tutorials/intermediate/rpc_tutorial.html
Distributed autograd: https://pytorch.org/docs/stable/rpc/distributed_autograd.html#distributed-autograd-design
Distributed optimizers: http://www.vldb.org/pvldb/vol13/p3005-li.pdf, https://arxiv.org/abs/1910.02054
Parallel Model: https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html
FSDP: https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html https://pytorch.org/tutorials/intermediate/FSDP_adavnced_tutorial.html
Torchrun backends, Rendezvous: Distributed Sync and Peer Discovery: https://pytorch.org/docs/stable/elastic/rendezvous.html
OneCCL extension: https://pytorch.org/tutorials/intermediate/process_group_cpp_extension_tutorial.html
Model Parallel Tools#
Deepspeed optimizer: https://www.deepspeed.ai/tutorials/zero/ https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/
Parameter server: https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf https://pytorch.org/tutorials/intermediate/rpc_param_server_tutorial.html
Distributed frameworks#
Pipelines#
Modelling and Respositories#
Deep learning recommenders: https://medium.com/swlh/deep-learning-recommendation-models-dlrm-a-deep-dive-f38a95f47c2c
Huggingface: https://huggingface.co/docs/transformers/main/main_classes/deepspeed
Reviews#
common-workflow-language/common-workflow-language pditommaso/awesome-pipeline https://workflows.community/systems https://workflows.community/groups/fair/resources/
Standards#
Storage#
Data Management#
https://snakemake.github.io https://www.researchobject.org/ro-crate/1.1/introduction.html https://arvados.org
Executors#
crim-ca/weaver Barski-lab/cwl-airflow https://streamflow.di.unito.it http://toil.ucsc-cgl.org https://www.nextflow.io MD-Studio/cerise https://pep.databio.org/looper/ http://code.databio.org/pypiper/philosophy/ AgnostiqHQ/covalent
Scheduler Integration#
https://pysqa.readthedocs.io/en/latest/README.html https://pympipool.readthedocs.io/en/latest/README.html https://flux-framework.org Libensemble/libensemble bsc-wdc/compss
FAIR#
https://docs.dockstore.org/en/stable/advanced-topics/best-practices/cwl-best-practices.html https://workflows.community/groups/fair/