--- title: "Upgrading Fedora's Monitoring" subtitle: "A real Tech Debt story" author: "Greg 'gwmngilfen' Sutcliffe" format: revealjs: slide-number: true chalkboard: buttons: false preview-links: auto logo: images/fedora.png css: styles.css footer: '[https://fedoraproject.org](https://fedoraproject.org)' --- ## `cat me.yaml` {.smaller} :::: {.columns} ::: {.column width="60%"} - name: Greg 'Gwmngilfen' Sutcliffe - matrix: @gwmngilfen:fedora.im - history: - Senior Sysadmin, Fedora & CentOS - Community Architect / Data Scientist, Ansible - Community Architect, Foreman - 13+ years at Red Hat - notes: - likes solving problems - dislikes taking averages - plays too many automation games^1^ ::: ::: {.column width="40%"} ![](images/me.jpg){width="100%"} ::: :::: ::: footer 1: You already missed our [automation games talk](https://cfp.cfgmgmtcamp.org/ghent2026/talk/CBHVKV/) - check the recording :P ::: ## Abstract ::::: {.columns} :::: {.column width="60%"} - Tech debt story? ::: {.incremental} - Nagios, circa 10,000 BC - Zabbix & why this is progress - Monitoring via cfg mgmt - Dedicated roles vs application snippets - Code density (hi Jinja!) - What are we monitoring again? ::: :::: :::: {.column width="40%"} ![](images/petra.jpg){width="100%"} :::: ::::: ::: footer Photo by Spencer Davis on Unsplash ::: --- ::: {.absolute top=300 left=70 width="900" height="250"} **All your technical debt is actually political debt** ::: --- ::: {.absolute top=250 left=50 width="900" height="250"} > "Politics is the set of activities that are associated with making decisions in groups" ::: ## Fedora monitoring, 2024-era We're using Nagios, deployed via a monolithic Ansible role ![](images/nagios-1.png) ## Concerns & Constraints ::::: {.columns} :::: {.column width="60%"} ::: {.incremental} - Can only handle OK/WARN/CRIT - Checks aren't on a given schedule - Sepatarate collectd instance - Monolithic Ansible role
- Fedora prefers FOSS ... - ... and on-premise - Highly heterogeneous install base - Many on-disk / internal checks ::: :::: :::: {.column width="40%"} ![](images/chains.jpg){width="100%"} :::: ::::: ::: footer Photo by Spencer Davis on Unsplash ::: ## Service definitions We're using Nagios, deployed via a monolithic Ansible role ``` {.yaml code-line-numbers="|2-3|5-14"} - name: Copy /etc/nagios/services (RDU3 specific files) ansible.builtin.copy: src=nagios/services/rdu3_internal/{{ item }} dest=/etc/nagios/services/{{ item }} with_items: - certgetter.cfg - db_backups.cfg - disk.cfg - fedora_messaging.cfg - file_age.cfg - koji.cfg - locking.cfg - mailman.cfg - nrpe.cfg - pgsql.cfg ``` ::: footer [roles/nagios_server/tasks/main.yml](https://pagure.io/fedora-infra/ansible) ::: ## Host definitions ``` {.yaml} {% for host in groups['all']|sort %} {% if hostvars[host].datacenter == 'rdu3' and hostvars[host].nagios_Can_Connect == true %} define host { {% if hostvars[host].nagios_Check_Services['nrpe'] == true %} use defaulttemplate {% else %} use mincheck {% endif %} host_name {{ host }} ... } ``` ::: footer [roles/nagios_server/templates/nagios/hosts/rdu3-hosts.cfg.j2](https://pagure.io/fedora-infra/ansible) ::: ## Ansible code then ``` {.yaml code-line-numbers="1|2-3|5"} {% for host in groups['all']|sort %} {% if hostvars[host].datacenter == 'rdu3' and hostvars[host].nagios_Can_Connect == true %} define host { {% if hostvars[host].nagios_Check_Services['nrpe'] == true %} use defaulttemplate {% else %} use mincheck {% endif %} host_name {{ host }} ... } ``` ::: footer [roles/nagios_server/templates/nagios/hosts/rdu3-hosts.cfg.j2](https://pagure.io/fedora-infra/ansible) ::: --- ::: {.absolute top=300 left=200 width="900" height="250"} Tech debt #1 - Maintenance Debt ::: ## Progess, circa 1800? ::::: {.columns} :::: {.column width="60%"} ::: {.incremental} - Zabbix isn't exactly "new" - Actively maintained - Integrated trend data - Moves trigger logic to the server - Active agent - Ansible collection available - Also, we already had prior work - and CentOS uses it ::: :::: :::: {.column width="40%"} ![](images/bigben.jpg){width="100%"} :::: ::::: ::: footer Photo by Vitalijs Barilo on Unsplash ::: ## Prior art ![](images/pagure.png){width="100%"} ::: footer [Pagure / fedora-infra / #11393](https://pagure.io/fedora-infrastructure/issue/11393) ::: --- ::: {.absolute top=300 left=200 width="900" height="250"} Tech debt #2 - Architectural Debt ::: ## Declarative monitoring ::::: {.columns} :::: {.column width="70%"} ::: {.incremental} - All Zabbix config should be in Ansible - Items / triggers in templates - Hosts declare self & templates - Notifications, users, SAML, etc too - Monitoring tasks live in the app role - App developers know best - Even for `base` or `httpd` it works - Not *everything* can be done this way ::: :::: :::: {.column width="30%"} ![](images/team.jpg){width="100%"} :::: ::::: ::: footer Photo by Shane Rounce on Unsplash ::: ## Ansible code now {auto-animate="true"} ``` {.yaml} - name: Import Anubis template file community.zabbix.zabbix_template: template_yaml: "{{ lookup('file', 'zabbix/template-anubis.yml') }}" state: present ``` ::: footer [roles/anubis/tasks/main.yml](https://pagure.io/fedora-infra/ansible) ::: ## Ansible code now {auto-animate="true"} ``` {.yaml code-line-numbers="5-9|7|8-9"} - name: Import Anubis template file community.zabbix.zabbix_template: template_yaml: "{{ lookup('file', 'zabbix/template-anubis.yml') }}" state: present - name: Add self to Anubis in Zabbix community.zabbix.zabbix_host: host_name: "{{ inventory_hostname }}" link_templates: Anubis Monitoring force: false ``` ::: footer [roles/anubis/tasks/main.yml](https://pagure.io/fedora-infra/ansible) ::: ## Ansible code now {auto-animate="true"} ``` {.yaml code-line-numbers="5-9,13"} - name: Import Anubis template file community.zabbix.zabbix_template: template_yaml: "{{ lookup('file', 'zabbix/template-anubis.yml') }}" state: present - name: Ensure Anubis hostgroup is present community.zabbix.zabbix_group: host_groups: - Anubis servers state: present - name: Add self to Anubis in Zabbix community.zabbix.zabbix_host: host_name: "{{ inventory_hostname }}" host_groups: Anubis servers link_templates: Anubis Monitoring force: false ``` ::: footer [roles/anubis/tasks/main.yml](https://pagure.io/fedora-infra/ansible) ::: ## Workflow (in principle) ::::: {.columns} :::: {.column width="70%"} ::: {.incremental} - App developers work on STG Zabbix - Design/test template in the UI - Export it to YAML - Ansible PR created - Adds YAML template to PRD Zabbix - Reviewed by Infra - Usual code flow, merge - Monitoring is part of "Definition of Done" ::: :::: :::: {.column width="30%"} ![](images/pipes.jpg){width="100%"} :::: ::::: ::: footer Photo by Danylo Sorokin on Unsplash ::: --- ::: {.absolute top=300 left=200 width="900" height="250"} Tech debt #3 - Delegation Debt ::: ## Upgrade check types & purposes ::::: {.columns} :::: {.column width="70%"} ::: {.incremental} - Nagios *technically* has one check - This offloads logic to the NRPE agent - Leads to very narrow thinking - "Is this higher than X? -> Alert" - Many things are statistical/predictive - "Is this thing going to fail soon?" - "Is this pattern unusual for this host?" - Grouping is not possible either ::: :::: :::: {.column width="30%"} ![](images/bulb.jpg){width="100%"} :::: ::::: ::: footer Photo by Johan Extra on Unsplash ::: --- ::: {.absolute top=300 left=200 width="900" height="250"} Tech debt #4 - Time Debt (or pressure) ::: ## But Greg, you mentioned tools to help... . . . ::: {.absolute top=200 left=50 width="900" height="250"} > The people side of tech debt work - practicalities for helping the people around you to get on board when "the current thing still works, doesn't it?" ::: --- ::: {.absolute top=300 left=50 width="900" height="250"} **All your technical debt is actually political debt** ::: ## Technical debt is too broad ::::: {.columns} :::: {.column width="60%"} There are multiple types of debt ::: {.incremental} - Maintenance debt - Architectural debt - Delegation Debt - Time debt ::: Fixing these will take resources and will be hard to show value for :::: :::: {.column width="40%"} ![](images/tower.jpg){width="100%"} :::: ::::: ::: footer Photo by Tommy Tsao on Unsplash ::: ## Political debt? ::::: {.columns} :::: {.column width="60%"} - You'll need some-to-all of: - your own time to *not* work on features - time from other teams - buy in from your/other teams - buy in from management :::: :::: {.column width="40%"} ![](images/hands.jpg){width="100%"} :::: ::::: ::: footer Photo by krakenimages on Unsplash ::: --- ::: {.absolute top=100 left=50 width="900" height="250"} Your best tech debt tool is not your coding skills ... ::: . . . ::: {.absolute top=300 left=200 width="900" height="250"} ... **it's your communication skills.** ::: ## Communication skills ::::: {.columns} :::: {.column width="60%"} - Convincing your team that the new thing is worth the time - Convincing managers why this matters, and how they benefit - Convincing nearby teams to use/contribute to the new plan - ... even that they should own it :::: :::: {.column width="40%"} ![](images/wires.jpg){width="100%"} :::: ::::: ::: footer Photo by Nathan Cima on Unsplash ::: --- ::: {.absolute top=200 left=450 width="900" height="250"} Thanks! ::: ::: {.absolute top=300 left=350 width="900" height="250"} @gwmngilfen:fedora.im :::