Friday, January 10, 2025

The Case for CoreOS - Network Infrastructure on an Immutable OS

The Lifetime of Silent Services

For small and medium sized organizations, a local network requires the creation of and management of local network services such as DNS, NTP, DHCP, monitoring and user access controls.  These are the ante needed to get in the game but when they work properly they become invisible. This is good, but it means they can be neglected from the standpoint of management and maintenance. As long as they work it's easy to ignore them until they do break. There is a tendency to treat maintenance is a risk rather than a benefit, the fear of service interruption and downtime leading to neglect and a sense that these services are somehow fragile and precious.

For these silent services, the neglect usually manifests when the admins discover that the OS has gone end-of-life or a bug is discovered in the current version of a service or there are 200 CVEs to apply because the last reboot was 700 days ago. The problem is that accumulated updates required and unfamiliarity with the services and the maintenance history makes admins gun-shy of updates. Time only makes the fear and the debt worse.

What are you afraid of, Really?

The modern alternate is the cliche "Fail Fast", which, when thrown about without comprehension, is correctly scorned.  I prefer to say "Find the scariest thing you have to do, and do it repeatedly until it stops being scary. Then find the next scariest thing.".

The real fear and risk is of downtime without a recovery plan.  In a corporate environment the tendency of management is to CYA by avoiding any downtime by avoiding any change. While this can provide the illusion of stability, it treats the infrastructure as a static monolith. It ignores the facts that failures and updates are inevitable and sets the operations teams up for failure. It restricts their ability to practice the very update and mitigation processes that would allow them to create a robust reliable service.

The real solution is to create a system where any change can be rolled back quickly, reliably and completely.  Fedora CoreOS provides that.

Git for Filesystems?

Fedora CoreOS is a distribution of Fedora Linux that is created specifically to run software containers.  Red Hat promotes it for cloud use and only supports it as a base for OpenShift.  It is a minimal distribution with no GUI only a simple installer that writes the initial state to a bootable storage device and a simple configuration file that is applied on first boot. This by itself is unremarkable. The feature that makes CoreOS significant is that the file and package systems are based on rpm-ostree. This is an integrated file and package management system. It presents to users as an XFS filesystem, but it is mounted read-only. The filesystem is immutable. To install packages you must use the rpm-ostree command to layer the package into a new image version and then reboot to the new image. Installing application packages is discouraged in favor of running services in containers.

Did you get that? The filesystem is read only. To see updated packages you have to reboot. Wait, there's more.

The Turtle or the Frog?

Most distributions provide updates through online package repositories. Admins must periodically poll the repository, pull down any new packages, and then overlay them into the running system. At that point it becomes extremely difficult to reliably roll back. If anything fails, the only recourse is to recover the system from backups, which is understandably an extreme and time-consuming process.  This leads to a "slow and steady" approach to updates. Updates are applied to a few test systems. If no problems are discovered they are rolled forward to a set of staging systems.  Finally the updates are deployed to production.

This is an expensive, time consuming system, suited only to large organizations with the resources to implement them. It's also error prone, as it is often difficult to adequately simulate the production operating conditions in a small test environment.  More commonly in smaller organizations, updates are shunted to backlog work and neglected in favor of feature requests or helpdesk issues until some outside event brings the problem to the attention of management, when it becomes an emergency.
To compound the problems, it is common to run package updates without rebooting the system. This can result in failures that don't appear until long after the actual change is applied. All together this makes IT management very averse to regular updates and reboots because they see these as introducing problems and risking downtime with long recovery periods.

Until recently (well ages in Internet Time) this "frog in the pot" approach was really the only option. The fact that it was impossible to reliably roll back changes rightly made management and operations averse to any change to a system that was "working". 

Double-Buffered Operating System

CoreOS updates are atomic. That is, updates are published as a unit.  The stable stream is updated approximately every two weeks. There are also test and "next" streams that update more often but aren't meant for regular use.  CoreOS runs a service called Zincati. This service polls the release streams for new images and will apply them and reboot when needed. Zincati can be tuned to create staged roll-outs, applying updates first to a set of canary systems before moving on to more critical systems. It can also be tuned to restrict reboots to specific days of the week and times of day.

By conventional standards, read-only systems that update automatically and require reboot every two weeks provides the opposite of stability and reliability. But the risks posed when this is implemented on a conventional Linux distribution are mitigated when presented using rpm-ostree, zincati and software containers.  The benefits of atomic rollback and application decoupling mean that it is possible to keep systems up to date and to respond instantly to any update-induced problems. In essence the operating system is double-buffered and the current system is preserved perfectly across updates. You don't have to worry about losing the working configuration because it's still there.

For The Best Services, Don't Install Any

On CoreOS you're discouraged from installing application or service software on the system.  CoreOS is designed to run software containers. The only major service component integrated into the OS is podman, while all of the network services run on Linux as systemd services.

In 2021, a project called Quadlets was created to allow containers to be managed as first-class services under systemd. In 2022 quadlets were merged into the systemd project and as of 2024 they are available on any systemd based Linux. This means that your system services no longer are tightly coupled to the OS updates.  They don't even need to be based on the same OS distribution.

Using Quadlets, deploying a network service is a matter of defining a systemd container spec, providing the service configuration files and enabling and starting the service. No service software needs to be installed or updated ever.  Updating the service software is a matter of updating the container image path and tag and restarting the systemd service.  Reverting is just as simple. It becomes possible to basically ignore the OS when updating system services and vice-versa.  The loose coupling means that changes to one are very unlikely to affect the other and that any change can be trivially and reliably reverted without affecting the other components.

Do it again! Do it again!

The simplicity and minimalism of using CoreOS with software containers enables one last element for providing stable reliable network services. CoreOS can be installed with a simple DHCP/PXE boot and, once installed, it can be configured with a small set of Ansible scripts. These aren't remarkable by themselves but the simplicity of and compartmentalization that the immutable OS are somewhat novel in the on-premise hardware environment.  These are usually thought of as features of cloud-based services, but are perfectly applicable for small and medium organizations with limited resources.

As a matter of practice I tend not to say I can do something until I can do it 100 times with the push of a single button. With some simple automation the infrastructure can be restored in a matter of minutes on the old hardware or new.  These services tend to be small and light-weight, so they can run on inexpensive redundant hardware.

So You Say, But How?

Well, I plan to show you.  This first post is a long pontification on some thoughts I've had over the last couple of years. I've put it into practice for my home network and at one employer.  It falls under a larger theme of adapting cloud networking practices for on-premise network services.  After all, Red Hat now only supports their CoreOS stream as the base for OpenShift, Red Hat's extended Kubernetes offering. Red Hat recommends the very practices I'm going to detail to maintain the underpinnings of their enterprise distributed application service. I suspect that part of the reason they don't support it for general use is that serious adoption would undercut their revenue stream from RHEL, and I can tell you from personal experience that matters to them a lot.

This isn't a perfect strategy for all purposes either.  Unless your application is extremely simple and has already been designed and implemented for containers it doesn't make sense to shoehorn it in.  Large distributed applications are better supported on a proper Kubernetes or OpenShift deployment, whether on-premise or on a cloud service. Heavy-weight monolithic services (I'm looking at you JBoss/Tomcat apps) aren't well suited to containers, despite the trend to push them in.

In following posts I mean to walk through the deployment of Fedora CoreOS, preparation for automated configuration management and the deployment of service containers. I'm not actually sure where this will end but I mean to see just how far I can push it.  Come along if it seems like your kind of fun.

Resources

  • Fedora Linux - An extremely popular and well managed Linux distribution
  • Fedora CoreOS - A spin of Fedora that is designed to run software containers
  • libostree - A checkpointed filesystem that allows atomic rollback of file changes
  • rpm-ostree - An extension of libostree that integrates RPM package management
  • butane - YAML schema to define OS configurations for CoreOS
  • ignition - JSON schema to define OS configurations for CoreOS
  • zincati - A service to control and tune updates from CoreOS image streams
  • Quadlets - Software containers as systemd services
  • Ansible - System configuration language and toolset
  • OpenShift - Red Hat's enterprise extended version of Kubernetes
  • Kubernetes - A computing cluster system for running applications in software containers