salsa.debian.org: Postmortem of failed Docker registry move
The Salsa admin team provides the following report about the failed migration of the Docker container registry.
The Docker container registry stores Docker images,
which are for example used in the Salsa CI toolset.
This migration would have moved all data off to Google Cloud Storage (GCS)
and would have lowered the used file system space on Debian systems significantly.
The Docker container registry is part of the Docker distribution toolset.
This system supports multiple backends for file storage: local, Amazon Simple Storage Service (Amazon S3) and Google Cloud Storage (GCS).
As Salsa already uses GCS for data storage, the Salsa admin team decided to move all the Docker registry data off to GCS too.
Migration and rollback
On 2019-08-06 the migration process was started.
The migration itself went fine, although it took a bit longer than anticipated.
However, as not all parts of the migration had been properly tested,
a test of the garbage collection triggered a bug in the software.
On 2019-08-10 the Salsa admins started to see problems with garbage collection.
The job running it timed out after one hour.
Within this timeframe it not even managed to collect information about all used layers to see what it can cleanup.
A source code analysis showed that this design flaw can’t be fixed.
On 2019-08-13 the change was rolled back to storing data on the file system.
Docker registry data storage
The Docker registry stores all of the data sans indexing or reverse references in a file system-like structure comprised of 4 separate types of information:
Manifests of images and contents, tags for the manifests, deduplicaed layers (or blobs) which store the actual data, and lastly links which show which deduplicated blogs belong to their respective images, all of this does not allow for easy searching within the data.
The file system structure is built as append-only which allows for adding blobs and manifests, addition, modification, or deletion of tags.
However cleanup of items other than tags is not achievable within the maintenance tools.
There is a garbage collection process which can be used to clean up unreferenced blobs, however according to the documentation the process can only be used while the registry is set to read-only and unfortunately it cannot be used to clean up unused links.
Docker registry garbage collection on external storage
For the garbage collection the registry tool needs to read a lot of information as there is no indexing of the data.
The tool connects to the storage medium and proceeds to download … everything, every single manifest and information about the referenced blobs, which now takes up over 1 second to process a single manifest.
This process will take up a significant amount of time, which in the current configuration of external storage would make the clean up nearly impossible.
Leasons learned
The Docker registry is a data storage tool that can only properly be used in append-only mode.
If you never cleanup, it works well.
As soon as you want to actually remove data, it goes bad.
For Salsa clean up of old data is actually a necessity, as the registry currently grows about 20GB per day.
Next steps
Sadly there is not much that can be done using the existing Docker container registry.
Maybe GitLab or someone else would like to contribute a new implementation of a Docker registry,
either integrated into GitLab itself or stand-alone?