Data Formats

Data Formats#

Markup Languages#

A markup language allows the specification of the structure and formatting of a text-based document. The markup can be used to control the display of the document or enrich its content for automated processing.

The Standard Generalized Markup Language (SGML) is a standard for defining generalized markup languages for documents, from which the likes of the Extensible Markup Language (XML) and HyperText Markup Language (HTML) derive.

XML is a commonly used serialization format which is both human and machine readible. A distinctive feature is the use of bracket-based <tags> for elements. A ‘schema’ can be used to define or constrain the elements of an XML document - giving them specific meaning and allowing for validation. Some schema languages include the document type definition (DTD), XML Schema (XSD), RELAX NG and Schematron. Other common XML related technologies include Extensible Stylesheet Language Transformations (XSLT) for transforming XML documents into others and XPath for querying documents.

Array Based#

Array-based data formats are commonly used in scientific applications. The formats covered in this section support multi-dimensional arrays, but typically also formats such as time-series and images.

The Hierarchical Data Format - HDF is used to store large amounts of data in different formats. The current version (HDF5) is composed of Datasets which are typed multi-dimensional arrays and Groups which are collections of datasets and other groups, which act like folders in a file-system. Datasets and groups can also each be assigned key-value like metadata as attributes, making the format capable of being self-describing. Complex data types can be built up from these elements.

Network Common Data Form - NetCDF is a set of self-describing data formats for creation and sharing of array-oriented scientific data. It is a binary format, with several versions available including a classic version, 64-bit offset version with larger addressing capability and a HDF5 backed version. The user guide has a more detailed description of the data format. NetCDF can be extended via conventions, for example the Climate and Forecast Metadata Conventions - CF which describe the form metadata can take. See the CF conventions site for more.