Workflows#

A workflow is a sequence of defined steps taken to complete a task.

In the context of compute jobs they will often involve steps like:

  • Fetching data

  • Preprocessing the data

  • Perform a computation on the data

  • Save the computation results

This section collects tools and techniques for workflows in different domains.

Data Processing#

MLOPS#

https://aws.amazon.com/what-is/mlops/

https://docs.databricks.com/en/machine-learning/index.html

https://en.wikipedia.org/wiki/MLOps

Workflow Management Systems#

https://en.wikipedia.org/wiki/Workflow_management_system

Software#

Container Orchestration#

Configuration Management#

Private Clouds#

Events and Streaming#

Workflow Frameworks#

We would like to run compute jobs over multiple GPUs in a user-friendly and repeatible way. The first target application is in Earth Observation.

There are a couple of reasons one might want to run over multiple GPUs:

  • Reduce training time - done by parallizing compute via data or model parallelism

  • Data too large for single GPU memory

  • Model too large for single GPU memory

The primary paradigms are:

  • Data parallelism - copy model to all devices, send different data batch to each device

  • Model parallelism - split the model itself (e.g. certain layers) or the gradient or optimization calculation over devices

  • Mixed - such as in Fully Sharded Data Parallelism (defined later)

When distributing compute there should be a good reason to use multiple devices per node and even stronger justification for multiple nodes, since the bandwidth cost for syncing can impact gains by distribution.

For Model parallelism there are several different methods of splitting the model, so there are dependencies on the problem domain and some experimentation over different approaches may be needed also. This can be helped by dedicated tooling and frameworks and by taking a ‘pipeline’ approach to the research workflow. These frameworks and pipeline tools are reviewed for the chosen Machine Learning (ML) backend PyTorch next.

Distributed PyTorch ecosystem#

This section reviews typical components of a scalable (distributed) machine learning pipeline in the context of PyTorch. It breaks the hardware and software ecosystem into ‘layers’, starting with the compute hardware, moving to machine-learning specifics and through to integration with cloud and HPC host platforms, shown in Fig 1.

PytorchEcosystem

Hardware and Firmware#

In the distributed setting, a host machine (or node) may have multiple CPU and GPU devices. On an individual node, data transfer between CPUs, GPUs and storage can happen over hardware interconnects following e.g. PCIe or NVidia NVLink speicifcations. General network communcation, such as between nodes can be by e.g. Ethernet or Inifiniband (or similar Fabrics based) hardware and firmware implementations.

Distributed Compute#

The NVidia CUDA toolkit allows the development of applications using NVidia GPUs. Tools can be built on this, for example the cuDNN library for GPU acceleration of Deep Neural Networks.

NVidia NCCL or the Collective Communication Library allows for multi-GPU compute and networking, with an API similar to that of the Message Passing Interface MPI specification.

A similar implementation based on the oneAPI collection of specifications is Intel’s oneCCL.

Machine Learning#

PyTorch is a library for for building deep learning models, or more generally differentiable programming. For distributed compute it has three primary approaches, sketched in Fig. 2.

DistributedDataFlow

  • DistributedData is a legacy approach for data-parallel modelling, based on the Python multi-processing library, but subject to the Python Global Interpreter Lock.

  • DistributedDataParallel focuses on data-parallel modelling in a general distributed compute setting, with a particular AllReduce approach for syncronization during model gradient updates.

  • RPC is a more general module, allowing collective or peer-to-peer communcation and building more general data or model-parallel approaches.

The latter two approaches are built on the PyTorch c10d ‘collective communication’ package. This is in turn built on the Tensorpipe library which supports a range of backends for distributed compute communication, including Facebook’s Gloo library, NCCL, MPI and Ethernet based communication.

The general RPC module allows the development of the more recent FullyShardedDataParallel module which is a combined data and model-parallel approach for large memory models and ParameterServer approaches for similar applications.

Further specialized algorithms, such as Deepspeed, Varuna and Flexflow collections have been built on this, with others using Distributed Autograd or Distributed Optimizers as generalizations.

Torch-elastic and Torch-run are fault-tolerant launch and coordination utilities in PyTorch for launching distributed training sessions and re-launching on device failure (such as out of memory.)

Somewhat independently of ‘distributed compute’, Torch Dynamo is a utility for JIT ‘compilation’ of models (to e.g. C++ or various compute graphs such as for TensorRT) for higher performance usually at inference time. It currently has some incompatiblities or limitations with the distributed modules.

ML Framework#

There are various tools and frameworks built on or incorporating PyTorch focusing on managing or deploying distributed compute.

Lightening AI provides a wrapper over PyTorch for easier use of distributed compute, boilerplate reduction and some extra implementations over the RPC module.

RayIO, MosaicML, ColossalAI and DeterminedAI are examples of frameworks with an added focus on distributed compute.

Torch-serve is a PyTorch module to serve trained models, for example through a rest-like API.

ML Pipeline#

ML Pipeline software is focused on bringing ‘DevOPs’ approaches to Machine Learning flows, such as with declarative configuration files, DAG based flow modelling and integration of containers.

Kubeflow and KServe have integrations with PyTorch via Torch-Serve for Kuebernetes based flows. Apache Airflow is a general DevOps pipeline software. Amazon Sagemaker, Google Vertex-AI, MLFlow and NVidia Triton are some examples of software providing pipeline tooling with PyTorch integration.

These frameworks also handle integration with data storage backends, like S3 or the more general minIO, metrics (e.g. Prometheus or Grafana) and many other features such as Notebooks.

In the context of HPC, integration between Torch-elastic and Slurm are available.

Compute Platform#

The particulars of the underlying compute platforms, e.g. AWS, GCP or dedicated HPC can affect choices and outcomes in distributed compute - including around available compute architectures (GPU, TPU etc) and network bandwitdh between compute nodes.

ICHEC Compute Resources#

ICHEC Project:#

Work is under icear025c - use https://www.ichec.ie/academic/national-hpc/documentation (Join an existing project) to join.

Kay GPUs#

https://www.ichec.ie/about/infrastructure/kay

A partition of 16 nodes with the same specification as above, plus 2x NVIDIA Tesla V100 16GB PCIe (Volta architecture) GPUs on each node. Each GPU has 5,120 CUDA cores and 640 Tensor Cores.

Kay allows up to 4 GPU nodes per compute job.

Available CUDA versions:

cuda/10.0
cuda/10.1.243
cuda/11.2
cuda/11.3
cuda/11.4
cuda/9.2(default) 

Sample code and tutorials#

Useful blog posts#

Nvidia CCL#

PyTorch#

Model Parallel Tools#

Distributed frameworks#

Pipelines#

Modelling and Respositories#

Reviews#

common-workflow-language/common-workflow-language pditommaso/awesome-pipeline https://workflows.community/systems https://workflows.community/groups/fair/resources/

Standards#

Storage#

Data Management#

https://snakemake.github.io https://www.researchobject.org/ro-crate/1.1/introduction.html https://arvados.org

Executors#

crim-ca/weaver Barski-lab/cwl-airflow https://streamflow.di.unito.it http://toil.ucsc-cgl.org https://www.nextflow.io MD-Studio/cerise https://pep.databio.org/looper/ http://code.databio.org/pypiper/philosophy/ AgnostiqHQ/covalent

Scheduler Integration#

https://pysqa.readthedocs.io/en/latest/README.html https://pympipool.readthedocs.io/en/latest/README.html https://flux-framework.org Libensemble/libensemble bsc-wdc/compss

FAIR#

https://docs.dockstore.org/en/stable/advanced-topics/best-practices/cwl-best-practices.html https://workflows.community/groups/fair/

https://www.youtube.com/watch?v=FoMgkExAqQU&t=103s

Uncategorized#

Events#