Storage#
This section gives some background on data storage technologies.
Storage Media#
Some relevant attributes of storage technologies are:
capacity - what is the quantity of the data that can be stored
bandwidth - how quickly can data be read or written to the media
redundancy - how secure is the data to system or hardware failure
resilience - how likely is the storage system to fail
cost - what is the cost of storing the data
compatibility - how easy is to transfer the data to or from other media
Data storage usually amounts to some compromise amongst these attributes. CPU cache memory has high bandwidth but also high cost per unit capacity. Magnetic tape has low cost for large amounts of data but also low bandwidth - which can include waiting for a tape to be physically driven from one storage site to another. Between these extremes there is a spectrum of technologies including faster NVMe and SSD storage and slower spinning disk HDD options.
Compression#
Compression is an important aspect of data storage. In many cases there is some structure or repition in data that is store, for example in audio, text or images. This can be exploited to reduce the amount of storage space required for a particular amount of information, at the expense of the computational cost of compression and extraction. Compression algorithms can be lossy, where information is lost but compression ratios are higher, and lossless.
Filesystems#
Data is usually managed via a filesystem, which is in essence a database the keeps attributes for some data and a location of that data on a block storage device. Block storage devices, as they are presented by an operating systems kernel as a layer over the real hardware, are usually formated to have a particular filesystem - which can then be interacted with by user applications in a unified way, via the notion of ‘files’ and ‘directories’.
Although less common - it is entirely possible to have filesystems that don’t directly sit on a block device - but are virtualized by the operating system. Two useful applications of this are in-memory filesystems and networked filesystems.
In-memory filesystems allow an application to write to memory rather than disk. In modern systems with large amounts of RAM it is often possible to store large amounts of data in memory rather than on disk as an application executes, with significant performance improvements.
Network filesystems are common as shared network drives, and particularly on HPC clusters. NTFS is a common technology for shared drives, while Lustre is commonly used on HPC systems. Current tetwork filesystems can introduce signficiant overhead on operations on large systems as they are highly centralized. There are attempts to add distributed network file systems or to update existing technologies with this capability - but in the meantime it is worth bearing in mind that network filesystem read and write can be much slower than scratch or local disk operations.
Cloud and Object Storage#
Current filesystems can be complex and don’t scale well over large distributed systems. Object storage is a simplified way to store blobs of data without the structure and complexity imposed by such systems.
Amazon’s Simple Storage Service (S3) is a widely used cloud object storage. Due to its ubiquity a lot of servers and clients supports its APIs. Thus a lot of cloud storage solutions will involve S3. It should be noted, that there are attempts for standardized and open interfaces that are note tied to a single vendor - such as Swift.