How to make best use of SUSE Linux Enterprise Micro self healing capabilities

SUSE Linux Enterprise Micro (SLE Micro) ensures deployments stay healthy and operational to run your workloads.

How does it work?

Unlike traditional Linux distributions, SLE Micro is a very small Linux Operating System (OS) and focuses on running your workloads, either as containers or VMs.

On traditional Linux, when applying maintenance or security updates, some parts of the OS (a library for instance) can be replaced while still being used by other parts, which can cause instability.

On SLE Micro, changes to the core are done atomically and are called a transaction. They can be changes to OS configuration or software stack. A transaction is done on a snapshot of the root filesystem, is only active after reboot and managed by transactional-update tool. To ensure SLE Micro integrity regarding transitions, OS core is read-only and can not be modified (even by root) while it is running. SLE Micro is also able to monitor system health for the various transactions.

If one RPM package installation (or removal or upgrade) fails:

On a traditional Linux distribution, the system would be in undefined state, which could include left-over files or processes running.
On SLE Micro, this package failure is detected by transactional-update. Snapshot is discarded and the running system stays intact.

With each OS change being stored in snapshots, it is possible to go back in time and rollback to a previously known working state.

When preparing system deployment, default setup is sometimes not enough and we need to create additional checks which are specific for our use-case.

Thanks to SLE Micro built-in health-checker,you can add more tests which will run at system startup and increase system reliability.

By default, health-checker performs some basic tests on the system at boot time.

If those tests fail, health-checker will:

try to restart the failing services, if the system was already booted successfully (in the current snapshot),

rollback to the last known working state if it is the first time the snapshot is used or if restarting the failing service didn’t work.

Health-checker can be extended via additional plugins, written with your favorite programming or scripting language.

Customizing health-checker

As an example, let’s create a plugin to verify sshd service is starting properly (code available at https://github.com/fcrozat/SUSECON-demos/blob/main/VM/deploy-container/sshd.sh ):

#!/bin/bash

run_checks() {
  systemctl is-failed -q sshd
  test $? -ne 1 && exit 1
}

stop_services() {
  systemctl stop sshd
}

case "$1" in
  check)
  run_checks
  ;;

  stop)
  stop_services
  ;;

  *)
  echo "Usage: $0 {check|stop}"
  exit 1
  ;;
esac
exit 0

Copy this file to /usr/local/libexec/health-checker/ with executable permissions.

Reboot the system to ensure this new test is not causing any regression.

You can now test it works as designed: uninstall openssh-server with transactional-update pkg rm openssh-server and reboot. Check the system closely during boot and you will notice it will boot twice: first time with the change (removing openssh-server), but health-checker detected sshd not starting properly and rolled back to the previous working state, which causes the second reboot to activate the older snapshot. You can check this using journalctl -u health-checker.service.

You are now ready to create additional tests for your deployments.

In our next blog post, we will look into how to easily deploy / update / rollback containers on SLE Micro.

(Visited 1 times, 1 visits today)

Source link