Data Analytics#
Introduction#
Data analytics is concerned with tooling and methods for generating insights from data.
Depending on the application in question, data can come in many formats - such as text (encoded as ASCII or similar or binary), image, video, audio or time series. The data can be structured, following a schema with defined fields and attributes or unstructuted following an ad-hoc format.
Further Reading#
Data Formats#
It is common for data to be stored in text-based formats. They can be human readable or, for space efficiency, binary based. Human-readable formats will often be stored in a file with a given ‘encoding’. This encoding maps the sequence of bytes in a file to text characters. Historically human-readable text files were assumed to be ASCII
encoded - supporting fewer than 100 printable characters. It is more common now for text data to support Unicode
encoding, with the UTF-8
encoding format being the most common. When reading, writing and transmitting plain-text files it is becomming increasingly common to explicitly identify the encoding to avoid processing errors for yourself and others.
Tabular data is often stored in ‘Comma Separated Value’ (CSV) format, with an optional first row of comma separated headers followed by subsquent data rows with comma separated values. This format is widely supported by libraries and applications due to its simplicity - however on its own it does not give much context on the data it contains.
More advanced data formats, such as ‘Network Common Data Form’ (NetCDF) are self-describing - in that they contain both the data itself and a description of what the data means and where it came from, via metadata. This makes the data easier to interchange and processes in machine-to-machine workflows. NetCDF is an example of a binary format. It is not straight-forward for a human to open a file and read it. Thus, unlike CSV, NetCDF needs dedicated libraries to read and write its files - of which many are available.
Images can be represented as a two dimensional regular grid, with each grid point or pixel having at least one but often up to four values. The number of values a pixel has is often known as the number of channels. Single-channel images usually contain a grayscale value per pixel - whose representation depends on the images’ ‘bit-depth’. 8-bit images, for example, can represent 256 distinct values and can be represented by a single byte
or char
. Colors in images can be described via a pixel’s coordinates in a ‘color-space’. A common color-space is RGB - where a pixel has three values corresponding to the intensity of Red, Green and Blue colors. Images can also have ‘alpha’ channels, which describe opacity - this applies to both grayscale and color images. Thus, images can be grayscale, grayscale with alpha, RGB or RGBA. Indeed, several other combinations and different color spaces are possible. Images on disk are usually stored in a binary and compressed format, such as PNG, JPG or TIFF. To read images a dedicated library is usually needed. It is important when reading and writing images in scripts that you are aware of the order of channels, bit-depth and color space that you are working with to avoid errors in your analysis. These quantities are usually many available in image processing libraries.
Tools and Platforms#
Common languages and tools used in data analytics are Python, R, Matlab and Julia - each having their own ecosystem of processing tools and libraries.
Python#
Some common Python tools are:
numpy: Widely used vector-based scientific computing package
pandas: Python data analytics library
xarray: Tool for working with labelled multidimensional arrays
netcdf4-python: Library for working with NetCDF files
scikit-learn: Machine learning library for Python (not Deep Learning focused)
pydantic: Data type modelling with built-in type checking and static analysis
sqlalchemy: SQL integration for database with an object-relational mapper
pillow: Simple image processing library, including reading and writing. Also known as PIL.
OpenCV: Python wrapper for powerful computer vision library with image and video processing
Matplotlib: Python visualization and plotting library
seaborn: Layer over matplotlib for statistical data visualization
Jupyter#
The Jupyer project is an ecosystem for interactive data analytics based on the Julia, Python and R languages. One of its better known tools are ‘Jupyter Notebooks’ which are user-friendly, web-browser based interactive code execution and visualization applications.
Jupyter can be installed in several ways, but it is recommended to do it in a Python versional environment with:
python3 -m venv .venv
source .venv/bin/activate
pip install jupyter