Machine Learning#

Machine learning is a field where statistical models are used to make predictions based on incoming data which are based on some previously provided data.

It has many subfields and related technologies:

Supervised learning involves using a training set of ‘labelled’ or annotated data, which is used to train a model that is able to predict labels for unseen input data. Unsupervised learning is based on unlabelled or partially labelled data - where the model learns to label unseen data based on relations it determines in the training set. Reinforcement learning involves a decision-making agent learning to complete tasks based on taking a certain sequence of actions.

Neural networks are graphical structures which can be used to construct models of data via changing a set of weights. The ‘depth’ of a neural network refers to the number of ‘layers’ of connected nodes it has. Deep learning involves using neural networks with several layers of nodes.

Machine learning model training often involves the repeated solution of linear systems of equations, often with low requirements on numerical precision. This makes them well suited to running on GPUs, which can give very significant speedups relative to CPU computation.

Nvidia’s CUDA language and ecosystem are widely used in machine learning problems.

Supervised Learning#

Two common types of supervised learning are ‘classification’ and ‘regression’. Regression involves finding the parameters of a model such that it best describes (or fits) some data. That model can then predict some outcome based on new or unseen data. Classification involves predicting which of a discrete set of classes corresponds to some input data.

In one-dimensional linear regression the model takes the form:

\[ \mathbf{\hat{y}} = w\mathbf{x} + b \]

where \(w\) is a scalar weight, \(\mathbf{x}\) is a vector of scalar input data and \(b\) is a scalar offset. Performing a best fit of the data amounts to determining the values of w and b that minimimze the distance between the vector of predictions \(\mathbf{\hat{y}}\) and known values \(\mathbf{y}\) for a given \(\mathbf{x}\). The function used to describe the distance and be minimized is often called the ‘loss’ function. Classification works in a similar way - except the loss function is formulated to work with categorical rather than discrete inputs and outputs. For the case of linear regression the minimzing values can be found analytically - however in general that is not the case and an optimization problem needs to be solved. Gradient descent is a popular method for this, which involves iteratively searching for the minimum using the derivative of the loss function with respect to the weights to direct the search process.

If the amount of input data is relatively small it is possible to calculate gradients using all of it at once. However this is often not feasible, so the training data is split into smaller batches in a technique known as Stochastic Gradient Descent. There are a variety of SGD algorithms available with different qualities for efficiently exploring the optimization landscape while trying to avoid getting stuck in local minima. The ‘learning rate’ or how large a step the algorithm takes per iteration in the landscape is an important parameter. Parameters that influence the model outputs, but which aren’t themselves weights in the models functions are known as hyperparameters.

In summary - a supervised machine learning problem typically involves the definition of a ‘loss function’ of some scalar parameters, known as weights, which is to be minimized for a given set of ‘input data’. For problems with a large amount of data a model is trained by iteratively minimizing the loss function using batches of the input data, passing through all of the data in an epoch. Training progresses for a fixed number of epochs or until some stopping criterion such as a suitably low loss or small change in loss over epochs is reached.

Assessing Model Quality#

An important part of a machine learning process is to assess the predictive ability of the trained model. In the case of regression the loss function is some measure of distance between the annotation (ground truth) and the model prediction. So a lower loss implies a better fit to the data. However it is important to note that imposition of structure \( w\mathbf{x} + b\) in the model - which is always necessary to some degree - carries implicit assumptions about the nature of the process that generated the data in the first place.

Considering the regression case, adding higher order terms such as dependence on the square of the data may produce a better fit as measure by loss - but may make the model poorer at dealing with new data. This is known as overfitting. Conversely a model that is too simple may underfit the data and provide a poor proxy of the underlying generating process. Models may also be highly unsuitable to describing the generating process but still produce convincing loss value and generalize to unseen data - this can happen in the case of rare events or if the typical assumption of ‘independent and identically distributed’ data generation doesn’t hold.

Further approaches that can aid in assessing model and data quality include the use of measures of true and false positive and negative ratios. Models with high true positive and true negative rates are desireable - however there is often a trade-off when optimizing this rates - for example improving the true positive rate may also increase the false negative rate. Tools such as the Receiver-Operator Characteristic (ROC) curve can help in comparing the relative predictive ability of models relative to these rates.

Data Preparation#

In a supervised learning scenario some preparation is likely necessary to prepare the data for model training. The data can come in many modalities, such as:

  • a path or URL to data somewhere on a network

  • an API or endpoint from which to query data

  • a stream of incoming data

First the data is collected or at least cached somewhere for processing. Then it is ‘cleaned’ - which entails bringing it into a standard format and removing invalid or unexpected entries. The data is then ‘annotated’ or ‘labelled’ - with values of the same type a trained model is expected to predict.

If an individual data sample is large - such as an image it may be broken into smaller sub-samples, based on for example an assumption of spatial uniformity of interesting features.

Following this the data may be split into different datasets - often ‘training’, ‘validation’ and ‘test’ sets. The idea is that the ‘training’ set will be used during minimization of the loss function, the ‘validation’ set will be used to evaluate a model as it is being trained and the ‘test’ set is used only as a final characterisation of the model’s behaviour. It is generally important to make sure the data is representative of the process that generates it - it is easy to introduce bias in dataset collection which will transfer through to any model trained on it without dedicated mitigation.

Data may also need to be further modified to suit its use for a particular model. Typically data values are normalized to between 0 and 1 which results in a more amenable optimization landscape and easier interpretation of the loss value. For image data it may also be scaled to have a certain average and variation in pixel values to suit a particular model. Image data channel ordering may also be important. Image data may be resampled so the dimensions are suitable for the model to ingest them. They may also be cropped or re-centered if a particular subject is being focused on. Categorial data may be ‘one hot encoded’ - for example a value of ‘3’ amoung 5 classes may be represented as (0 0 1 0 0) which can better suit certain types of model.

It is always advisable to preview data at all transformation data to make sure it appears as expected - the transforms should also be subject to automated testing using reference data to ensure they behave as expected.

Image Analysis#

Some common image analysis techniques include:

  • image classification - determination of which class an object in an image belongs to

  • object detection - location of a particular object in an image, for example via bounding box. This may be combined with classification.

  • semantic segmentation - determination of the pixels that comprise an object. Can include multi-class semantic segmentation.

Deep learning is commonly used in machine-learning based image analysis. The convolutional neural network (CNN) architecture is commonly adopted - which amounts to training a set of filters on a hierarchy of progressively larger pixel blocks. This approach allows neural networks to be trained on images without needing an unfeasibly large number of neurons when describing each pixel.

Training and Inference Performance#

It is usually of interest to measure the training and inference performance of a model and adopted framework. Different types of models can have different resource needs, in terms of CPU, GPU and memory requirements. Some allow streaming or batching of data while some require all data to be processed in memory at once.

Many models, particularly neural networks, benefit signficantly from the use of GPUs - which can efficiently perform the linear algebra operations required in their solution. GPU memory can be a bottleneck when dealing with large dataset batches or when a data instance is large, such as a high resolution image. Distributed GPU training can be used by splitting datasets or elements of a model over multiple GPUs in this case. Even without the memory bottleneck - for large datasets training batches in parallel on different GPUs may be beneficial.

It is common in machine learning to need to perform sweeps over a model’s hyperparameters to indentify high performing models or if using enemble-like methods.

To reduce the need to train large models from scratch it is common to use a pre-trained model and train further epochs with a specialized dataset or for neural networks to add some extra layers to the end to only fit based on higher-level features that have already been learned.

Frameworks#

Standard Machine Learning Frameworks#

Commonly used software for non deep-learning based applications includes:

Deep Learning Frameworks#

Commonly used deep-learning software includes:

PyTorch#

PyTorch is a widely used Python library, primarily for deep learning applications, deriving from Lua Torch.

It can be installed with pip:

python -m venv .venv
source .venv/bin/activate
pip install torch

Addons such as torchvision may also be of interest if working in image-based domains. PyTorch supports distributed computing natively through its dist module, however there is a broad ecosystem of ‘wrapper’ tools for distributed computing also.

Horovod allows conversion of serial to distributed runtimes with minimal code changes. It can be installed with:

HOROVOD_WITH_PYTORCH=1 pip install horovod[pytorch]

Some other options:

  • HOROVOD_WITH_MPI=1

  • HOROVOD_WITH_GLOO=1

  • HOROVOD_GPU_OPERATIONS=NCCL

  • HOROVOD_CPU_OPERATIONS=CCL (Intel OneCCL)

Similarly PyTorch Lightning hides a lot of boiler plate code (also nice for serial training). Very little changes needed between serial & parallel execution, especially on a SLURM cluster. Ray is a framework for scaling ML focused workloads.

pip install -U "ray[default]"

FastAI#

FastAI is a software library for quick model building and prototyping.

Installation:

conda install -c fastchan fastai

Profiling#

This section overviews some options for profiling in Pytorch, the tools listed here have more uses than explored here so check out their docs online for more information. PyTorch includes a profiler API that is useful to identify the time and memory costs of various PyTorch operations in your code.

First, to import the necessary library:

from torch.profiler import profile, record_function, ProfilerActivity

For CPU profiling:

with profile(activities=[ProfilerActivity.CPU], 
                record_shapes=True) as prof:
       with record_function("cpuprofile"):

           YOUR_CODE

    print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))

It will print out the stats for the execution. For GPU profiling:

with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], 
                record_shapes=True) as prof:
       with record_function("gpuprofile"):

           YOUR_CODE

    print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

PyTorch profiler can also show the amount of memory (used by the model’s tensors) that was allocated (or released) during the execution of the model’s operators.

with profile(activities=[ProfilerActivity.CPU], 
                record_shapes=True) as prof:
           train(session_result_dir, ops_path, params)

           YOUR_CODE

    print(prof.key_averages().table(sort_by="self_cpu_memory_usage", row_limit=10))

The results can be printed as a table or returned in a JSON trace file. To generate the output:

prof.export_chrome_trace("trace.json") 

You can examine the sequence of profiled operators and CUDA kernels in trace.json in Chrome trace viewer chrome://tracing or https://ui.perfetto.dev.

To use the profiler to record execution events do:

prof = torch.profiler.profile(activities=
           [torch.profiler.ProfilerActivity.CPU, 
           torch.profiler.ProfilerActivity.CUDA],
           on_trace_ready=torch.profiler.tensorboard_trace_handler('./log/CODE'))
prof.start()

           YOUR_CODE

prof.stop()

The profiling result will be saved under the ./log/CODE directory.

Install the following component to visualize the profiler information:

pip install torch_tb_profiler

Launch the TensorBoard.

tensorboard --logdir /path/to/log/CODE

Open the TensorBoard profile URL in browser and go to the URL it provides or

http://localhost:6006/

The TensorBoard dashboard shows model performance, the performance of every PyTorch operator that is executed either on the host or device, the GPU kernel view shows all kernels’ time spent on GPU and many other information.

Tensorboard#

TensorBoard allows tracking and visualizing metrics such as loss and accuracy, visualizing the model graph, viewing histograms and displaying images.

First, import library:

from torch.utils.tensorboard import SummaryWriter             

Create SummaryWriter instance:

writer = SummaryWriter()

Writer will output to ./runs/ directory by default.

Then, write away some scalar values, both individually and in groups, images, graphs and histograms. To log a scalar value, use add_scalar(). For example, following logs loss value:

writer.add_scalar("Loss/train", LOSS, EPOCH)

If you do not need the summary writer anymore, call close() method.

writer.close()

Install TensorBoard through the command line to visualize data you logged:

pip install tensorboard

Launch the TensorBoard.

tensorboard --logdir /path/to/log/CODE

Open the TensorBoard profile URL in browser and go to the URL it provides or

http://localhost:6006/

The TensorBoard dashboard shows how the loss and accuracy change with every epoch. You can use it to also track training speed, learning rate, and other scalar values.

Further Reading#