{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "title: \"Upgrading Fedora's Monitoring\"\n", "subtitle: \"A real Tech Debt story\"\n", "author: \"Greg 'gwmngilfen' Sutcliffe\"\n", "format:\n", " revealjs: \n", " slide-number: true\n", " chalkboard: \n", " buttons: false\n", " preview-links: auto\n", " logo: images/fedora.png\n", " css: styles.css\n", " footer: '[https://fedoraproject.org](https://fedoraproject.org)'\n", "resources:\n", " - demo.pdf\n", "---\n", "\n", "## Abstract\n", "\n", "::: {.incremental}\n", "\n", "- Nagios, radio-carbon-dated to about 10,000 BC\n", "- Zabbix, and why this counts as progress\n", "- Monitoring via configuration management\n", " - Dedicated roles vs application snippets\n", " - Code density (hi Jinja!)\n", "- What are we monitoring again?\n", "\n", ":::\n", "\n", "## `cat me.yaml` {.smaller}\n", "\n", ":::: {.columns}\n", "::: {.column width=\"60%\"}\n", "```{yaml}\n", "- name: Greg 'Gwmngilfen' Sutcliffe\n", "- matrix: @gwmngilfen:fedora.im\n", "- history:\n", " - Senior Sysadmin, Fedora & CentOS\n", " - Community Architect / Data Scientist, Ansible\n", " - Community Architect, Foreman\n", " - 13+ years at Red Hat\n", "- notes:\n", " - likes solving problems\n", " - dislikes taking averages\n", " - plays too many automation games^1^\n", "```\n", ":::\n", "\n", "::: {.column width=\"40%\"}\n", "![](images/me.jpg){width=\"100%\"}\n", ":::\n", "::::\n", "\n", "::: footer\n", "1: You already missed our [automation games talk](https://cfp.cfgmgmtcamp.org/ghent2026/talk/CBHVKV/) - check the recording :P\n", ":::\n", "\n", "## Fedora monitoring, 2024-era\n", "\n", "We're using Nagios, deployed via a monolithic Ansible role\n", "\n", "![](images/nagios-1.png)\n", "\n", "## Fedora monitoring, 2024-era\n", "\n", "We're using Nagios, deployed via a monolithic Ansible role\n", "\n", "``` {.yaml code-line-numbers=\"2|4-13\"}\n", "- name: Copy /etc/nagios/services (RDU3 specific files)\n", " ansible.builtin.copy: src=nagios/services/rdu3_internal/{{ item }} dest=/etc/nagios/services/{{ item }}\n", " with_items:\n", " - certgetter.cfg\n", " - db_backups.cfg\n", " - disk.cfg\n", " - fedora_messaging.cfg\n", " - file_age.cfg\n", " - koji.cfg\n", " - locking.cfg\n", " - mailman.cfg\n", " - nrpe.cfg\n", " - pgsql.cfg\n", "```\n", "\n", "::: footer\n", "[roles/nagios_server/tasks/main.yml](https://pagure.io/fedora-infra/ansible)\n", ":::\n", "\n", "## Concerns & Constraints\n", "\n", "::: {.incremental}\n", "- Can only handle OK/WARN/CRIT\n", "- Checks aren't on a given schedule\n", "- Sepatarate collectd instance for trend data\n", "- Monolithic Ansible role\n", "
\n", "- Fedora prefers open source vs commercial solutions\n", "- On-premise deployment\n", "- Highly heterougenous install base\n", "- Many on-disk / internal checks\n", ":::\n", "\n", "## Progess, circa 1800?\n", "\n", "::: {.incremental}\n", "- Zabbix isn't exactly the newest thing, but\n", " - Integrated trend data\n", " - Moves trigger logic to the server\n", " - Active agent for locked-down servers\n", " - Ansible collection available\n", " - Actively maintained\n", "- Also, we already had prior work on Zabbix\n", ":::\n", "\n", "## Prior art\n", "\n", "screenshot first comit to zabbix role\n", "screenshot pagure issue\n", "\n", "## Declarative monitoring\n", "\n", "- All Zabbix config should be in Ansible, no UI work\n", " - Items / triggers in templates\n", " - Hosts declare themselves and their templates\n", " - Notifications, users, SAML, PSK config, etc too\n", "- Monitoring Ansible should live in the app role\n", " - App developers know best how to monitor the things\n", " - Even for `base` or `httpd` it makes sense too\n", " - Not *everything* can be done this way, sadly\n", "\n", "## Ansible code then\n", "\n", "``` {.yaml}\n", "{% for host in groups['all']|sort %}\n", "{% if hostvars[host].datacenter == 'rdu3' and hostvars[host].nagios_Can_Connect == true %}\n", "define host {\n", "{% if hostvars[host].nagios_Check_Services['nrpe'] == true %}\n", " use defaulttemplate\n", "{% else %}\n", " use mincheck\n", "{% endif %}\n", " host_name {{ host }}\n", "...\n", "}\n", "```\n", "::: footer\n", "[roles/nagios_server/templates/nagios/hosts/rdu3-hosts.cfg.j2](https://pagure.io/fedora-infra/ansible)\n", ":::\n", "\n", "## Ansible code then\n", "\n", "``` {.yaml code-line-numbers=\"1|2|4\"}\n", "{% for host in groups['all']|sort %}\n", "{% if hostvars[host].datacenter == 'rdu3' and hostvars[host].nagios_Can_Connect == true %}\n", "define host {\n", "{% if hostvars[host].nagios_Check_Services['nrpe'] == true %}\n", " use defaulttemplate\n", "{% else %}\n", " use mincheck\n", "{% endif %}\n", " host_name {{ host }}\n", "...\n", "}\n", "```\n", "::: footer\n", "[roles/nagios_server/templates/nagios/hosts/rdu3-hosts.cfg.j2](https://pagure.io/fedora-infra/ansible)\n", ":::\n", "\n", "## Ansible code now {auto-animate=\"true\"}\n", "\n", "``` {.yaml} \n", " - name: Import Anubis template file\n", " community.zabbix.zabbix_template:\n", " template_yaml: \"{{ lookup('file', 'zabbix/template-anubis.yml') }}\"\n", " state: present\n", "```\n", "\n", "::: footer\n", "[roles/anubis/tasks/main.yml](https://pagure.io/fedora-infra/ansible)\n", ":::\n", "\n", "## Ansible code now {auto-animate=\"true\"}\n", "\n", "``` {.yaml code-line-numbers=\"5-9|7|8-9\"}\n", " - name: Import Anubis template file\n", " community.zabbix.zabbix_template:\n", " template_yaml: \"{{ lookup('file', 'zabbix/template-anubis.yml') }}\"\n", " state: present\n", " - name: Add self to Anubis in Zabbix\n", " community.zabbix.zabbix_host:\n", " host_name: \"{{ inventory_hostname }}\"\n", " link_templates: Anubis Monitoring\n", " force: false\n", "```\n", "\n", "::: footer\n", "[roles/anubis/tasks/main.yml](https://pagure.io/fedora-infra/ansible)\n", ":::\n", "\n", "## Ansible code now {auto-animate=\"true\"}\n", "\n", "``` {.yaml code-line-numbers=\"5-9,13\"}\n", " - name: Import Anubis template file\n", " community.zabbix.zabbix_template:\n", " template_yaml: \"{{ lookup('file', 'zabbix/template-anubis.yml') }}\"\n", " state: present\n", " - name: Ensure Anubis hostgroup is present\n", " community.zabbix.zabbix_group:\n", " host_groups:\n", " - Anubis servers\n", " state: present\n", " - name: Add self to Anubis in Zabbix\n", " community.zabbix.zabbix_host:\n", " host_name: \"{{ inventory_hostname }}\"\n", " host_groups: Anubis servers\n", " link_templates: Anubis Monitoring\n", " force: false\n", "```\n", "\n", "::: footer\n", "[roles/anubis/tasks/main.yml](https://pagure.io/fedora-infra/ansible)\n", ":::\n", "\n", "---\n", "\n", "## Workfow (in principle)\n", "\n", "- App developers work on STG Zabbix\n", " - Design template in the UI\n", " - Test it works\n", " - Export it to YAML\n", "- Ansible PR created\n", " - Adds YAML template to PRD Zabbix (and STG)\n", " - Reviewed by Infra\n", " - Usual code flow, merge\n", "- Monitoring becomes part of \"Definition of Done\"\n", "\n", "## What even is this?\n", "\n", "Screen shot of nagios checks with weird names\n", "\n", "## Check types and purposes\n", "\n", "- Nagios *technically* only has one way to check things\n", " - In reality, this offloads the logic to the NRPE agent\n", "- Leads to very narrow thinking\n", " - \"Is this higher than X? -> Alert\"\n", "- In reality, many things are statistical/predictive\n", " - \"Is this thing going to fail soon?\"\n", " - \"Is this pattern unusual for this host?\"\n", "- Very much WIP right now\n", "\n", "---\n", "\n", "::: {.absolute top=300 left=50 width=\"900\" height=\"250\"}\n", "**All your technical debt is actually political debt**\n", ":::\n", "\n", "## Technical debt is too broad\n", "\n", ". . .\n", "\n", "There are multiple types of debt\n", "\n", ". . .\n", "\n", "::: {.incremental}\n", " - Maintenance debt\n", " - Architectural debt\n", " - Pressure debt\n", ":::\n", "\n", ". . .\n", "\n", "Fixing any of these will take resources\n", "\n", "and will be hard to show value for\n", "\n", "## Political debt?\n", "\n", "- You'll need some-to-all of:\n", " - your own time to *not* work on features\n", " - time from other teams\n", " - buy in from your/other teams\n", " - buy in from management\n", "\n", "> \"Politics is the set of activities that are associated with making decisions in groups\"\n", "\n", "## \n", "\n", "\n", "## Whose monitoring is it anyway?\n" ], "id": "3e843506" } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 5 }