Machine Learning#
Machine learning is a field where statistical models are used to make predictions based on incoming data which are based on some previously provided data.
It has many subfields and related technologies:
Supervised Learning
Unsupervised Learning
Supervised learning involves using a training set of ‘labelled’ or annotated data, which is used to train a model that is able to predict labels for unseen input data. Unsupervised learning is based on unlabelled or partially labelled data - where the model learns to label unseen data based on relations it determines in the training set. Reinforcement learning involves a decision-making agent learning to complete tasks based on taking a certain sequence of actions.
Neural networks are graphical structures which can be used to construct models of data via changing a set of weights. The ‘depth’ of a neural network refers to the number of ‘layers’ of connected nodes it has. Deep learning involves using neural networks with several layers of nodes.
Machine learning model training often involves the repeated solution of linear systems of equations, often with low requirements on numerical precision. This makes them well suited to running on GPUs, which can give very significant speedups relative to CPU computation.
Nvidia’s CUDA language and ecosystem are widely used in machine learning problems.
Supervised Learning#
Two common types of supervised learning are ‘classification’ and ‘regression’. Regression involves finding the parameters of a model such that it best describes (or fits) some data. That model can then predict some outcome based on new or unseen data. Classification involves predicting which of a discrete set of classes corresponds to some input data.
In one-dimensional linear regression the model takes the form:
where \(w\) is a scalar weight, \(\mathbf{x}\) is a vector of scalar input data and \(b\) is a scalar offset. Performing a best fit of the data amounts to determining the values of w and b that minimise the distance between the vector of predictions \(\mathbf{\hat{y}}\) and known values \(\mathbf{y}\) for a given \(\mathbf{x}\). The function used to describe the distance and be minimized is often called the ‘loss’ function. Classification works in a similar way - except the loss function is formulated to work with categorical rather than discrete inputs and outputs. For the case of linear regression the minimising values can be found analytically - however in general that is not the case and an optimization problem needs to be solved. Gradient descent is a popular method for this, which involves iteratively searching for the minimum using the derivative of the loss function with respect to the weights to direct the search process.
If the amount of input data is relatively small it is possible to calculate gradients using all of it at once. However this is often not feasible, so the training data is split into smaller batches in a technique known as Stochastic Gradient Descent. There are a variety of SGD algorithms available with different qualities for efficiently exploring the optimization landscape while trying to avoid getting stuck in local minima. The ‘learning rate’ or how large a step the algorithm takes per iteration in the landscape is an important parameter. Parameters that influence the model outputs, but which aren’t themselves weights in the models functions are known as hyperparameters.
In summary - a supervised machine learning problem typically involves the definition of a ‘loss function’ of some scalar parameters, known as weights, which is to be minimized for a given set of ‘input data’. For problems with a large amount of data a model is trained by iteratively minimizing the loss function using batches of the input data, passing through all of the data in an epoch. Training progresses for a fixed number of epochs or until some stopping criterion such as a suitably low loss or small change in loss over epochs is reached.
Assessing Model Quality#
An important part of a machine learning process is to assess the predictive ability of the trained model. In the case of regression the loss function is some measure of distance between the annotation (ground truth) and the model prediction. So a lower loss implies a better fit to the data. However it is important to note that imposition of structure \( w\mathbf{x} + b\) in the model - which is always necessary to some degree - carries implicit assumptions about the nature of the process that generated the data in the first place.
Considering the regression case, adding higher order terms such as dependence on the square of the data may produce a better fit as measure by loss - but may make the model poorer at dealing with new data. This is known as overfitting. Conversely a model that is too simple may underfit the data and provide a poor proxy of the underlying generating process. Models may also be highly unsuitable to describing the generating process but still produce convincing loss value and generalize to unseen data - this can happen in the case of rare events or if the typical assumption of ‘independent and identically distributed’ data generation doesn’t hold.
Further approaches that can aid in assessing model and data quality include the use of measures of true and false positive and negative ratios. Models with high true positive and true negative rates are desireable - however there is often a trade-off when optimizing this rates - for example improving the true positive rate may also increase the false negative rate. Tools such as the Receiver-Operator Characteristic (ROC) curve can help in comparing the relative predictive ability of models relative to these rates.
Data Preparation#
In a supervised learning scenario some preparation is likely necessary to prepare the data for model training. The data can come in many modalities, such as:
a path or URL to data somewhere on a network
an API or endpoint from which to query data
a stream of incoming data
First the data is collected or at least cached somewhere for processing. Then it is ‘cleaned’ - which entails bringing it into a standard format and removing invalid or unexpected entries. The data is then ‘annotated’ or ‘labelled’ - with values of the same type a trained model is expected to predict.
If an individual data sample is large - such as an image it may be broken into smaller sub-samples, based on for example an assumption of spatial uniformity of interesting features.
Following this the data may be split into different datasets - often ‘training’, ‘validation’ and ‘test’ sets. The idea is that the ‘training’ set will be used during minimization of the loss function, the ‘validation’ set will be used to evaluate a model as it is being trained and the ‘test’ set is used only as a final characterisation of the model’s behaviour. It is generally important to make sure the data is representative of the process that generates it - it is easy to introduce bias in dataset collection which will transfer through to any model trained on it without dedicated mitigation.
Data may also need to be further modified to suit its use for a particular model. Typically data values are normalized to between 0 and 1 which results in a more amenable optimization landscape and easier interpretation of the loss value. For image data it may also be scaled to have a certain average and variation in pixel values to suit a particular model. Image data channel ordering may also be important. Image data may be resampled so the dimensions are suitable for the model to ingest them. They may also be cropped or re-centered if a particular subject is being focused on. Categorial data may be ‘one hot encoded’ - for example a value of ‘3’ among 5 classes may be represented as (0 0 1 0 0) which can better suit certain types of model.
It is always advisable to preview data at all transformation data to make sure it appears as expected - the transforms should also be subject to automated testing using reference data to ensure they behave as expected.
Image Analysis#
Some common image analysis techniques include:
image classification - determination of which class an object in an image belongs to
object detection - location of a particular object in an image, for example via bounding box. This may be combined with classification.
semantic segmentation - determination of the pixels that comprise an object. Can include multi-class semantic segmentation.
Deep learning is commonly used in machine-learning based image analysis. The convolutional neural network (CNN) architecture is commonly adopted - which amounts to training a set of filters on a hierarchy of progressively larger pixel blocks. This approach allows neural networks to be trained on images without needing an unfeasibly large number of neurons when describing each pixel.
Examples#
Linear Regression#
Understanding the theoretical framework for linear regression techniques provides us with a simple yet insightful baseline for most machine learning models.
The ultimate goal of linear regression is to refine the predicted parameters of the regression curve to minimise error with the provided data.
We begin by considering a one dimensional dataset \(\mathbf{X} = \{x_1, x_2, \dots, x_n\}\) of n predictor variables. There is a corresponding one dimensional dataset \(\mathbf{Y} = \{y_1, y_2, \dots, y_n\}\) of n dependant variables.
Our hope is that given these datasets there exists a reasonable linear relationship between them described by
where \(\omega\) is the slope (or weight in ML terms) and \(\beta\) is the intercept (or bias in ML terms). This is the foundation for linear regression and trying to find the line of best fit.
To determine this line of best fit we need some concept of error so that we can improve our model parameters to minimise this error term. As we are dealing with the distance between points we could use any metric to achieve this. A popular choice is the L2 (Euclidean) norm, although there are other alternative such as the L1 (absolute) norm.
We want to minimise the difference between the true \(\mathbf{Y}\) terms and the predicted \(\mathbf{\hat{Y}}\) terms derived from our linear model. The error term associated with the L2 norm is known as the mean square error (MSE). The formula for the MSE is given by
where \(y \in \mathbf{Y}\) and \(\hat{y} \in \mathbf{\hat{Y}}\).
A benefit of linear regression is that it has an exact analytical solution. This analytical solution, known as the normal equation, is derived from the MSE cost function \(J\). The MSE cost function, with parameters corresponding to \(\mathbf{\Theta} = \{\Theta_0 = \omega, \Theta_1 = \beta\}\), is given as
Thus, for a linear system described by
we want to minimise the error produced by our vectorised cost function \(J\).
This is achieved by taking the partial derivatives of \(J\) and solving for the parameters of the cost function that find the global minimum. Exploiting some basic yet tedious linear algebra and associated calculus produces the following normal equation
recalling that \(\mathbf{\Theta}\) is the vector containing the slope and intercept terms.
However the computational time needed for these matrix operations (particularly the matrix inversion) can lead to long and inefficient completion times. A solution to this can be in the form of an iterative algorithm known as gradient descent.
Gradient descent is described by the following scheme
where \(\Delta_\omega\) represents the partial derivative of the MSE cost function with respect to \(\omega\) and likewise for \(\beta\). The \(L\) term represents the learning rate, which can be viewed as the scale at which each step changes. The partial derivatives are given by
Thus, for linear regression, the gradient descent formula is given by
This continues until the MSE is lower than some tolerance level or the loop reaches its conclusion.
A variant of gradient descent, called stochastic gradient descent (SGD), achieves similar results and is common in machine learning. In SGD we use a random subset of the data in our MSE calculation.
Support Vector Machines#
Support Vector Machines (SVM) are a supervised machine learning algorithm used to classify data points. The simplest form known as a linear hard-margin SVM assumes the data is separable by a hyperplane \(H = \vec{w}^T \vec{x} + b=0\) and belongs to one of two classes \(\{A,B\}\), that is
There is actually an infinite number of such hyperplanes so are left with the task of picking the “best” hyperplane. The definition of “best” we pick is the hyperplane for which the separation margin is maximum. A quick bit of vector geometry shows that the size of the margin is
We call the data points that are the closest to the separating hyperplane the support vectors. If we assign the label of \(+1\) to elements of class \(A\), and \(-1\) to those in \(B\), then we can formulate the SVM problem as a constrained convex optimisation problem
This form of the problem is known as the primal problem. Using the theory of Lagrange multipliers and the strong dual theorem, we can formulate the so-called dual problem as
where \(\lambda\) is a vector of Lagrange multipliers, \(X\) is a matrix with columns \(y_i\vec{x}_i\), and \(\vec{1}\) is a vector of ones. The primal solution can then be reconstructed as
If the data is not linearly separable, then the above solution is no longer valid. To combat this, we may employ the Kernel Trick which maps non-linearly separable data to a Hilbert space in which they are linearly separable. Two commonly used kernels are the polynomial basis kernel and the radial basis or Gaussian kernel:
Here \(\alpha,\beta,\gamma\) are parameters that must be chosen when employing the kernel.
The generalisation of hard-margin SVMs are soft-margin SVMs that allow points to be misclassified. This misclassification allows the model to be trained on data that is not perfectly separable. To do this, we introduce slack variables \(\vec{\xi}\) and regularisation parameter \(C\). The primal problem now becomes
The regularisation parameter determines how much the model should be penalised for allowing misclassification. In the \(C\rightarrow\infty\) limit we recover the hard-margin case as the model severely penalises misclassification. In the \(C\rightarrow0\) limit, the model has no penalty for making mistakes and so we lose all classifying ability. Although the primal problem has an entire new vector to optimise over, it can be shown that the dual problem takes the form
The only thing that changes is that the Lagrange multipliers must now be bound from above by the regularisation parameter.
Logistic Regression#
Logistic regression, also known as logit regression, is a supervised machine learning model used in classification problems. In its simplest form it is a binary classifier but can be extended to \(n\) independent classes. It is commonly used with support-vector-machines and forest-classifiers in data science.
The basis of logistic regression is the logistic or sigmoid function defined by
The sigmoid function has codomain \((0,1)\) which allows it to be a candidate for a probability function. One useful property that speeds up computations is that the derivative of the sigmoid function satisfies
To use this function for binary classification we introduce weights \(\vec{w}\) and bias \(b\) that parameterise our model by
We use the shorthand \(z=\vec{w}\cdot\vec{x}+b\) going forward. Unlike for neural networks, the weights and bias are initialised to zero in logistic regression problems. The error or cost function used in binary classification is the binary-cross-entropy formula. Let \(y\) be the correct label and let \(\hat{y}\) be the label our model predicts. The objective function is then
Applying the chain rule, the gradients we require for our optimisation scheme are
If we instead are working with batches, the cost function takes the form
where \(m\) is the size of the batch and \(L_i\) is the binary-cross-entropy loss evaluated for the \(i\)-th element of the batch.
If we ran our scheme forever, we would eventually reach a loss of zero and perfectly learn the data. This is overfitting and we want to avoid it. To accomplish this, we introduce a penalty term to the objective function:
Here \(\lambda\) is called the regularisation parameter and \(\vec{\Theta}\) is a vector of all our parameters: all weights and biases. The penalty function takes the form of a \(p\)-norm or a combination of them
Our required derivatives are
Multinomial Logistic Regression#
Having described binary classification we now turn our attention to problem of more than two classes. Such a problem has the name multinomial logistic regression. The approach is very similar to binary logistic regression but we need to modify our sigmoid function and generalise some expressions.
The first change that we make is that the label is no longer binary. We encode this information by assigning to each \(\vec{x}_i\) a label vector \(\vec{y}_i\) which has length equal to the number of classes. The only non-zero element of this vector is in the \(c\)-th position where \(c\) is the correct class.
Next we introduce the generalisation of the sigmoid called the softmax function. This function maps a vector \(\vec{z}\) to a probability vector via
where \(K\) is the number of classes.
Our weights and bias must also be generalised. The idea here is that each class gets its own weight vector and bias term. We represent all of these as
The final modification is to the loss function:
This dot product only has one non-zero term which corresponds to the correct class. This is because \(y_i\) is zero for all other classes. The derivatives we require for our gradient scheme are
Training and Inference Performance#
It is usually of interest to measure the training and inference performance of a model and adopted framework. Different types of models can have different resource needs, in terms of CPU, GPU and memory requirements. Some allow streaming or batching of data while some require all data to be processed in memory at once.
Many models, particularly neural networks, benefit signficantly from the use of GPUs - which can efficiently perform the linear algebra operations required in their solution. GPU memory can be a bottleneck when dealing with large dataset batches or when a data instance is large, such as a high resolution image. Distributed GPU training can be used by splitting datasets or elements of a model over multiple GPUs in this case. Even without the memory bottleneck - for large datasets training batches in parallel on different GPUs may be beneficial.
It is common in machine learning to need to perform sweeps over a model’s hyperparameters to indentify high performing models or if using enemble-like methods.
To reduce the need to train large models from scratch it is common to use a pre-trained model and train further epochs with a specialized dataset or for neural networks to add some extra layers to the end to only fit based on higher-level features that have already been learned.
Frameworks#
Standard Machine Learning Frameworks#
Commonly used software for non deep-learning based applications includes:
Deep Learning Frameworks#
Commonly used deep-learning software includes:
PyTorch#
PyTorch is a widely used Python library, primarily for deep learning applications, deriving from Lua Torch.
It can be installed with pip:
python -m venv .venv
source .venv/bin/activate
pip install torch
Addons such as torchvision may also be of interest if working in image-based domains. PyTorch supports distributed computing natively through its dist module, however there is a broad ecosystem of ‘wrapper’ tools for distributed computing also.
Horovod allows conversion of serial to distributed runtimes with minimal code changes. It can be installed with:
HOROVOD_WITH_PYTORCH=1 pip install horovod[pytorch]
Some other options:
HOROVOD_WITH_MPI=1HOROVOD_WITH_GLOO=1HOROVOD_GPU_OPERATIONS=NCCLHOROVOD_CPU_OPERATIONS=CCL(Intel OneCCL)
Similarly PyTorch Lightning hides a lot of boiler plate code (also nice for serial training). Very little changes needed between serial & parallel execution, especially on a SLURM cluster. Ray is a framework for scaling ML focused workloads.
pip install -U "ray[default]"
FastAI#
FastAI is a software library for quick model building and prototyping.
Installation:
conda install -c fastchan fastai
Profiling#
This section overviews some options for profiling in Pytorch, the tools listed here have more uses than explored here so check out their docs online for more information. PyTorch includes a profiler API that is useful to identify the time and memory costs of various PyTorch operations in your code.
First, to import the necessary library:
from torch.profiler import profile, record_function, ProfilerActivity
For CPU profiling:
with profile(activities=[ProfilerActivity.CPU],
record_shapes=True) as prof:
with record_function("cpuprofile"):
YOUR_CODE
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))
It will print out the stats for the execution. For GPU profiling:
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
record_shapes=True) as prof:
with record_function("gpuprofile"):
YOUR_CODE
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
PyTorch profiler can also show the amount of memory (used by the model’s tensors) that was allocated (or released) during the execution of the model’s operators.
with profile(activities=[ProfilerActivity.CPU],
record_shapes=True) as prof:
train(session_result_dir, ops_path, params)
YOUR_CODE
print(prof.key_averages().table(sort_by="self_cpu_memory_usage", row_limit=10))
The results can be printed as a table or returned in a JSON trace file. To generate the output:
prof.export_chrome_trace("trace.json")
You can examine the sequence of profiled operators and CUDA kernels in trace.json in Chrome trace viewer chrome://tracing or https://ui.perfetto.dev.
To use the profiler to record execution events do:
prof = torch.profiler.profile(activities=
[torch.profiler.ProfilerActivity.CPU,
torch.profiler.ProfilerActivity.CUDA],
on_trace_ready=torch.profiler.tensorboard_trace_handler('./log/CODE'))
prof.start()
YOUR_CODE
prof.stop()
The profiling result will be saved under the ./log/CODE directory.
Install the following component to visualize the profiler information:
pip install torch_tb_profiler
Launch the TensorBoard.
tensorboard --logdir /path/to/log/CODE
Open the TensorBoard profile URL in browser and go to the URL it provides or
http://localhost:6006/
The TensorBoard dashboard shows model performance, the performance of every PyTorch operator that is executed either on the host or device, the GPU kernel view shows all kernels’ time spent on GPU and many other information.
Tensorboard#
TensorBoard allows tracking and visualizing metrics such as loss and accuracy, visualizing the model graph, viewing histograms and displaying images.
First, import library:
from torch.utils.tensorboard import SummaryWriter
Create SummaryWriter instance:
writer = SummaryWriter()
Writer will output to ./runs/ directory by default.
Then, write away some scalar values, both individually and in groups, images, graphs and histograms.
To log a scalar value, use add_scalar(). For example,
following logs loss value:
writer.add_scalar("Loss/train", LOSS, EPOCH)
If you do not need the summary writer anymore, call close() method.
writer.close()
Install TensorBoard through the command line to visualize data you logged:
pip install tensorboard
Launch the TensorBoard.
tensorboard --logdir /path/to/log/CODE
Open the TensorBoard profile URL in browser and go to the URL it provides or
http://localhost:6006/
The TensorBoard dashboard shows how the loss and accuracy change with every epoch. You can use it to also track training speed, learning rate, and other scalar values.