Web Technologies

Web Technologies#

This chapter gives a basic introduction to Web technologies, including working with clients and servers to consume and produce web APIs.

The World Wide Web (Web) allows information transfer between networked computer systems according to a collection of standardized interfaces and over standarized protocols. The Hyptertext Transfer Protocol (HTTP) is a foundational protocol on the web that allows information exchange. Uniform Resource Locators (URLs) allow the specification of the address or location of a resource on a computer network.

When an entity, either a human user or a machine, wants to retrieve some information over the web they can make a ‘HTTP request’, which includes a URL for the resource they are querying and metadata on how they want their query to be processed. A piece of software, sometimes known as a HTTP Client or Web Client, will make the request for the entity. This client can be a web browser, with a human user visiting a website and reading its content, a program running in the terminal - such as curl - requesting some JSON from an API, or an element of a script downloading a machine learning model or dataset.

Another piece of software, a HTTP Server or Web Server listens for incoming requests, carries out any requested actions and issues a response to the client.

Within the HTTP there is the concept of HTTP Methods, which specify how a client is requesting to interact with a resource. A client wishing to query or ‘read’ a resource will typically issue a GET request. Other request methods such as POST or PUT will create or modify resources via the server. When you supply information to a website, such as via a webform it will typically make a POST request to the server on your behalf.

HTTP requests are make up of a version specifier, a URL for the requested resource, a HTTP method, HTTP headers and an optional HTTP body. HTTP request headers are metadata in a key-value format that give information about the client making the request and expectations on the recieved response.

You don’t tend to interact with HTTP headers directly in the browser, but they are often needed when working with web APIs, particularly for Authorization. On web pages to access protected resources or resources specific to you you will need to ‘log in’ in some way. After you log in your web browser will store or cache your authorization credentials so you won’t need to log in for every new request. When working in non-browser situations, like the terminal or in scrtips, it is often neccessary to pass authorization information with each request. This typically takes the form of a key-value entry in the HTTP request header.

When interacting with the web outside of a browser you will often use an Application Programming Interface (API). This is a documented and structured interface that you can use to interact with the resources presented by a webserver, distinct from a ‘freeform’ webpage which would be difficult to work with in scripts or automations. The design of many Web APIs derive from the Representational State Transfer (REST) archituctre style - for which reason you will often see references to REST or RESTful APIs. The Extensible Markup Language (XML) and Javascript Object Notation (JSON) are two common ways to represent resources in plain-text in Web APIs. Their use allows the representation of the object to be sent over a network. JSON is currently popular in web technologies since it is easily human readable, being less verbose than XML, and is a native format for the widely used Javascript language.

Web Frontends#

Web development is often described in terms of ‘frontent’ and ‘backend’. The ‘frontend’ is the ‘web site’ interacted with by human users in a web browser. Interactivity and visual design play important roles in web frontend development.

The three primary technologies used in web frontend development are the Hypertext Markup Language (HTML), Cascading Style Sheets (CSS) and Javascript. HTML is intended for delivering the site content, both its copy and structure, CSS is for styling and Javascript for interactivity.

As web sites have grown in sophistication ‘web frameworks’ have been developed to ease development. Frameworks such as React and Angular allow the rapid development of sophisticated web sites and applications via pre-made building blocks. Styling frameworks such as Bootstrap and Google Material ensure users are presented with consistent and familiar interfaces as interact with web sites.

At a higher layer still in terms of web building-blocks come Content Management Systems (CMS) and blogging platforms. These systems, such as Wordpress and Drupal, allow one to build websites, style then and publish content without significant software development. This allows sites to be built and maintained primarily through the web browser. Related, but typically more simple software, are Static Site Generators which take content in an easy-to-write format such as Markdown and via some templates generate a website from the content files. Some common static site generations include Jekyll, Hugo, Pelican and Jupyter Books, with integrations to cloud development platforms such as Gitlab being common for automatic site publication and hosting.

Web Backends#

The web backend typically refers to the remainder of the technology stack, which has many different elements. At a minimum a web site needs a running server to respond to requests. Commonly used server software includes Apache and Nginx. For simple sites, with a collection of static files for example, these servers are sufficient to respond to requests. However, more sophisticated web sites with dynamic content and/or APIs typically have their ‘business logic’ (which decides how a request is responded to) handled by some separate software which is called by the server when a request is made.

Two commonly used web application frameworks are Express via the Javascript Node runtime and Django, built on Python. These frameworks have many features and pre-made building blocks for creating web site - with rich plugin ecosystems. However more light-weight frameworks are also common, such as Flask.

Many websites need to maintain some sort of ‘state’ - just a list of logged in users or resources beloging to a particular user. This is usually maintained in a database. Relation databases (RDBMS) supporting the Structured Query Language (SQL) are the most common - with postgress and mysql being popular. Databases usually run as their own process with applications connecting to them via socket for performance an security reasons. Most popular web application frameworks have plugins for easily working with common database types.

In addition to lightweight information in databases it is often neccessary for websites to store larger user data, such as uploaded images, videos or other files. For similar reasons to database separation this content is usually stored in a different network location to the web application. Although some websites will store user content on the same filesystem as the running web application, and it can be often seen in prototypes or getting started guides it can cause several issues. The Amazon Simple Storage Service (S3) is a popular service and de-facto API for storing user data. Its simple object-based semantics make storing large amounts of content relatively simple, while its APIs allow for secure control over content access. There are other storage technologies that can use, but don’t necessarily rely on S3 APIs - such as ceph. As with databases, most web frameworks have plugins for storage integration.

For reasons of efficiency and performance, frequently accessed data is often cached independently of the application webserver. This is usually done via a large provider who can offer both geographical locality and redundancy. They are known as Content delivery networks or CDNs.

For reliability and redundancy of the remainder of the web application there are several approaches depending on antipicated scale. Usually there is a Apache or Nginx server configured as a reverse proxy to a collection of smaller application servers. These application servers can be running indentical copies of the web application, but as different processes or on different hosts. This allows some load balancing and redundancy - including allowing the application version to be updated on selected servers without interuppting service overall. More sophisticated high availability performance uses more specialized tools such as HAProxy and orchestrators and container technologies such as kubernetes and docker.

Security#

The HTTP itself does not cover encryption of communications. Communication over HTTP is not encrypted and can be read or modified by anyone in the intermediate network between the client and server. This includes passwords and messages, which are transmitted in plain text. Hypertext Transfer Protocol Secure (HTTPS) is an extension to HTTP to allow encrypted communication of HTTP messages.

Cryptographic certificates (Certs) are used to allow encryption in HTTPS. Certs are issued by a small collection of trusted ‘authorities’. Web site publishers who wish to offer HTTPS need to obtain a cert from either one of these authorities or from a body who can issue certs backend by one of the authority certs. A popular service known as Let's Encrypt has made obtaining and maintaining certs very straightforward for website publishers.

When a user visits a site in their browser the cert presented by the website is checked by the browser against an in-built browser cache and any cert definitions installed on the users system. If the cert is valid the browser indicates to the user that they can securly communicate with the server, e.g. via a green lock in the UI. Recently most browsers issue a clear warning to users if HTTPS communication is not possible.

Web Hosting#

Web servers need to run on a machine that can accept and respond to web requests on certain ports, usually 80 and 443. This machine needs to have a static IP address, which is registered with a Domain Name System (DNS) lookup provider - such that when a user attempts to access a domain via a url the IP address of the server that will handle it can be resolved.

Domain names are typically purposed through a limited number of providers. Given a domain, mydomain.com it is possible to have many subdomains myservice.mydomain.com which can have different DNS routing. DNS providers usually give a Web UI that allow you to link your domain name to one or more static IPs for handling requests to that name.

You can host your web site on your own machine provided that it can be addressed with a static IP address - however to avoid the high maintenance overhead of doing so it is typical to rent a managed machine from a web hosting provided. This may be a physical machine or a virtual one. Many hosting providers provide many additional services beyond machine availability, including DNS handling, database, media storage and CDNs, user management and high availability. Feature rich hosting platforms often brand themselves as Cloud providers.

File Transfer#

File transfer is typically a less commonly used feature of the web, however it is often used in ICHEC and HPC applications. The File Transfer Protocol (FTP) is a common method for transferring files over the web, along with its ‘secure’ extension SFTP. FTP capabilities are usually built into most command line tools of interest, such as curl, however it is being gradually remove from web browsers.

There are many graphical File Transfer clients available, such as Filezilla, however getting familiarity with tools like curl is recommended.

In addition to FTP, S3 has become a common way to provide and accept large files. S3 is generally less supported than FTP in command line clients as it has a more involved method of generating authentication headers than simply setting a key-value pair - however it is now supported by recent versions of curl. Given the complexity of S3 APIs using a scripted application and dedicated libray is recommended for S3, such as Python and boto.

Web APIs#

A simple and minimalist way to consume many web apis is via curl and the jq application, the latter of which can manipulate javascript. This can be useful for constructing simple bash scripts for one-off applications or prototyping.

Otherwise you may want to consider a scripting language like Python and libraries such as requests or the lower-level built-in urllib and json modules.

If you are interacting with an API long-term or building a dependent application or service you should consider attempting automatic client generation via OpenAPI/Swagger bindings for your chosen language or framework. This reduces boilerplate and allows for error handling and validation.

There are graphical clients available for interacting with and exploring Web APIs. Until recently postman has been popular, however there have been some concerns with licensing. Playwright may be an alternative. Overall it is likely better to familiarise yourself with command-line and script-based tooling ahead of developing a reliance on the UI tooling.