---
title: "Upgrading Fedora's Monitoring"
subtitle: "A real Tech Debt story"
author: "Greg 'gwmngilfen' Sutcliffe"
format:
revealjs:
slide-number: true
chalkboard:
buttons: false
preview-links: auto
logo: images/fedora.png
css: styles.css
footer: '[https://fedoraproject.org](https://fedoraproject.org)'
---
## `cat me.yaml` {.smaller}
:::: {.columns}
::: {.column width="60%"}
- name: Greg 'Gwmngilfen' Sutcliffe
- matrix: @gwmngilfen:fedora.im
- history:
- Senior Sysadmin, Fedora & CentOS
- Community Architect / Data Scientist, Ansible
- Community Architect, Foreman
- 13+ years at Red Hat
- notes:
- likes solving problems
- dislikes taking averages
- plays too many automation games^1^
:::
::: {.column width="40%"}
{width="100%"}
:::
::::
::: footer
1: You already missed our [automation games talk](https://cfp.cfgmgmtcamp.org/ghent2026/talk/CBHVKV/) - check the recording :P
:::
## Abstract
::::: {.columns}
:::: {.column width="60%"}
- Tech debt story?
::: {.incremental}
- Nagios, circa 10,000 BC
- Zabbix & why this is progress
- Monitoring via cfg mgmt
- Dedicated roles vs application snippets
- Code density (hi Jinja!)
- What are we monitoring again?
:::
::::
:::: {.column width="40%"}
{width="100%"}
::::
:::::
::: footer
Photo by Spencer Davis on Unsplash
:::
---
::: {.absolute top=300 left=70 width="900" height="250"}
**All your technical debt is actually political debt**
:::
---
::: {.absolute top=250 left=50 width="900" height="250"}
> "Politics is the set of activities that are associated with making decisions in groups"
:::
## Fedora monitoring, 2024-era
We're using Nagios, deployed via a monolithic Ansible role

## Concerns & Constraints
::::: {.columns}
:::: {.column width="60%"}
::: {.incremental}
- Can only handle OK/WARN/CRIT
- Checks aren't on a given schedule
- Sepatarate collectd instance
- Monolithic Ansible role
- Fedora prefers FOSS ...
- ... and on-premise
- Highly heterogeneous install base
- Many on-disk / internal checks
:::
::::
:::: {.column width="40%"}
{width="100%"}
::::
:::::
::: footer
Photo by Spencer Davis on Unsplash
:::
## Service definitions
We're using Nagios, deployed via a monolithic Ansible role
``` {.yaml code-line-numbers="|2-3|5-14"}
- name: Copy /etc/nagios/services (RDU3 specific files)
ansible.builtin.copy: src=nagios/services/rdu3_internal/{{ item }}
dest=/etc/nagios/services/{{ item }}
with_items:
- certgetter.cfg
- db_backups.cfg
- disk.cfg
- fedora_messaging.cfg
- file_age.cfg
- koji.cfg
- locking.cfg
- mailman.cfg
- nrpe.cfg
- pgsql.cfg
```
::: footer
[roles/nagios_server/tasks/main.yml](https://pagure.io/fedora-infra/ansible)
:::
## Host definitions
``` {.yaml}
{% for host in groups['all']|sort %}
{% if hostvars[host].datacenter == 'rdu3'
and hostvars[host].nagios_Can_Connect == true %}
define host {
{% if hostvars[host].nagios_Check_Services['nrpe'] == true %}
use defaulttemplate
{% else %}
use mincheck
{% endif %}
host_name {{ host }}
...
}
```
::: footer
[roles/nagios_server/templates/nagios/hosts/rdu3-hosts.cfg.j2](https://pagure.io/fedora-infra/ansible)
:::
## Ansible code then
``` {.yaml code-line-numbers="1|2-3|5"}
{% for host in groups['all']|sort %}
{% if hostvars[host].datacenter == 'rdu3'
and hostvars[host].nagios_Can_Connect == true %}
define host {
{% if hostvars[host].nagios_Check_Services['nrpe'] == true %}
use defaulttemplate
{% else %}
use mincheck
{% endif %}
host_name {{ host }}
...
}
```
::: footer
[roles/nagios_server/templates/nagios/hosts/rdu3-hosts.cfg.j2](https://pagure.io/fedora-infra/ansible)
:::
---
::: {.absolute top=300 left=200 width="900" height="250"}
Tech debt #1 - Maintenance Debt
:::
## Progess, circa 1800?
::::: {.columns}
:::: {.column width="60%"}
::: {.incremental}
- Zabbix isn't exactly "new"
- Actively maintained
- Integrated trend data
- Moves trigger logic to the server
- Active agent
- Ansible collection available
- Also, we already had prior work
- and CentOS uses it
:::
::::
:::: {.column width="40%"}
{width="100%"}
::::
:::::
::: footer
Photo by Vitalijs Barilo on Unsplash
:::
## Prior art
{width="100%"}
::: footer
[Pagure / fedora-infra / #11393](https://pagure.io/fedora-infrastructure/issue/11393)
:::
---
::: {.absolute top=300 left=200 width="900" height="250"}
Tech debt #2 - Architectural Debt
:::
## Declarative monitoring
::::: {.columns}
:::: {.column width="70%"}
::: {.incremental}
- All Zabbix config should be in Ansible
- Items / triggers in templates
- Hosts declare self & templates
- Notifications, users, SAML, etc too
- Monitoring tasks live in the app role
- App developers know best
- Even for `base` or `httpd` it works
- Not *everything* can be done this way
:::
::::
:::: {.column width="30%"}
{width="100%"}
::::
:::::
::: footer
Photo by Shane Rounce on Unsplash
:::
## Ansible code now {auto-animate="true"}
``` {.yaml}
- name: Import Anubis template file
community.zabbix.zabbix_template:
template_yaml: "{{ lookup('file', 'zabbix/template-anubis.yml') }}"
state: present
```
::: footer
[roles/anubis/tasks/main.yml](https://pagure.io/fedora-infra/ansible)
:::
## Ansible code now {auto-animate="true"}
``` {.yaml code-line-numbers="5-9|7|8-9"}
- name: Import Anubis template file
community.zabbix.zabbix_template:
template_yaml: "{{ lookup('file', 'zabbix/template-anubis.yml') }}"
state: present
- name: Add self to Anubis in Zabbix
community.zabbix.zabbix_host:
host_name: "{{ inventory_hostname }}"
link_templates: Anubis Monitoring
force: false
```
::: footer
[roles/anubis/tasks/main.yml](https://pagure.io/fedora-infra/ansible)
:::
## Ansible code now {auto-animate="true"}
``` {.yaml code-line-numbers="5-9,13"}
- name: Import Anubis template file
community.zabbix.zabbix_template:
template_yaml: "{{ lookup('file', 'zabbix/template-anubis.yml') }}"
state: present
- name: Ensure Anubis hostgroup is present
community.zabbix.zabbix_group:
host_groups:
- Anubis servers
state: present
- name: Add self to Anubis in Zabbix
community.zabbix.zabbix_host:
host_name: "{{ inventory_hostname }}"
host_groups: Anubis servers
link_templates: Anubis Monitoring
force: false
```
::: footer
[roles/anubis/tasks/main.yml](https://pagure.io/fedora-infra/ansible)
:::
## Workflow (in principle)
::::: {.columns}
:::: {.column width="70%"}
::: {.incremental}
- App developers work on STG Zabbix
- Design/test template in the UI
- Export it to YAML
- Ansible PR created
- Adds YAML template to PRD Zabbix
- Reviewed by Infra
- Usual code flow, merge
- Monitoring is part of "Definition of Done"
:::
::::
:::: {.column width="30%"}
{width="100%"}
::::
:::::
::: footer
Photo by Danylo Sorokin on Unsplash
:::
---
::: {.absolute top=300 left=200 width="900" height="250"}
Tech debt #3 - Delegation Debt
:::
## Upgrade check types & purposes
::::: {.columns}
:::: {.column width="70%"}
::: {.incremental}
- Nagios *technically* has one check
- This offloads logic to the NRPE agent
- Leads to very narrow thinking
- "Is this higher than X? -> Alert"
- Many things are statistical/predictive
- "Is this thing going to fail soon?"
- "Is this pattern unusual for this host?"
- Grouping is not possible either
:::
::::
:::: {.column width="30%"}
{width="100%"}
::::
:::::
::: footer
Photo by Johan Extra on Unsplash
:::
---
::: {.absolute top=300 left=200 width="900" height="250"}
Tech debt #4 - Time Debt (or pressure)
:::
## But Greg, you mentioned tools to help...
. . .
::: {.absolute top=200 left=50 width="900" height="250"}
> The people side of tech debt work - practicalities for helping the people around you to get on board when "the current thing still works, doesn't it?"
:::
---
::: {.absolute top=300 left=50 width="900" height="250"}
**All your technical debt is actually political debt**
:::
## Technical debt is too broad
::::: {.columns}
:::: {.column width="60%"}
There are multiple types of debt
::: {.incremental}
- Maintenance debt
- Architectural debt
- Delegation Debt
- Time debt
:::
Fixing these will take resources
and will be hard to show value for
::::
:::: {.column width="40%"}
{width="100%"}
::::
:::::
::: footer
Photo by Tommy Tsao on Unsplash
:::
## Political debt?
::::: {.columns}
:::: {.column width="60%"}
- You'll need some-to-all of:
- your own time to *not* work on features
- time from other teams
- buy in from your/other teams
- buy in from management
::::
:::: {.column width="40%"}
{width="100%"}
::::
:::::
::: footer
Photo by krakenimages on Unsplash
:::
---
::: {.absolute top=100 left=50 width="900" height="250"}
Your best tech debt tool is not your coding skills ...
:::
. . .
::: {.absolute top=300 left=200 width="900" height="250"}
... **it's your communication skills.**
:::
## Communication skills
::::: {.columns}
:::: {.column width="60%"}
- Convincing your team that the new thing is worth the time
- Convincing managers why this matters, and how they benefit
- Convincing nearby teams to use/contribute to the new plan
- ... even that they should own it
::::
:::: {.column width="40%"}
{width="100%"}
::::
:::::
::: footer
Photo by Nathan Cima on Unsplash
:::
---
::: {.absolute top=200 left=450 width="900" height="250"}
Thanks!
:::
::: {.absolute top=300 left=350 width="900" height="250"}
@gwmngilfen:fedora.im
:::