Upgrading Fedora’s Monitoring

A real Tech Debt story

Greg ‘gwmngilfen’ Sutcliffe

cat me.yaml

  • name: Greg ‘Gwmngilfen’ Sutcliffe
  • matrix: @gwmngilfen:fedora.im
  • history:
    • Senior Sysadmin, Fedora & CentOS
    • Community Architect / Data Scientist, Ansible
    • Community Architect, Foreman
    • 13+ years at Red Hat
  • notes:
    • likes solving problems
    • dislikes taking averages
    • plays too many automation games1

Abstract

  • Tech debt story?
  • Nagios, circa 10,000 BC
  • Zabbix & why this is progress
  • Monitoring via cfg mgmt
    • Dedicated roles vs application snippets
    • Code density (hi Jinja!)
  • What are we monitoring again?

All your technical debt is actually political debt

“Politics is the set of activities that are associated with making decisions in groups”

Fedora monitoring, 2024-era

We’re using Nagios, deployed via a monolithic Ansible role

Concerns & Constraints

  • Can only handle OK/WARN/CRIT
  • Checks aren’t on a given schedule
  • Sepatarate collectd instance
  • Monolithic Ansible role
  • Fedora prefers FOSS …
  • … and on-premise
  • Highly heterogeneous install base
  • Many on-disk / internal checks

Service definitions

We’re using Nagios, deployed via a monolithic Ansible role

- name: Copy /etc/nagios/services (RDU3 specific files)
  ansible.builtin.copy: src=nagios/services/rdu3_internal/{{ item }} 
     dest=/etc/nagios/services/{{ item }}
  with_items:
    - certgetter.cfg
    - db_backups.cfg
    - disk.cfg
    - fedora_messaging.cfg
    - file_age.cfg
    - koji.cfg
    - locking.cfg
    - mailman.cfg
    - nrpe.cfg
    - pgsql.cfg

Host definitions

{% for host in groups['all']|sort %}
{% if hostvars[host].datacenter == 'rdu3'
  and hostvars[host].nagios_Can_Connect == true %}
define host {
{% if hostvars[host].nagios_Check_Services['nrpe'] == true %}
   use                     defaulttemplate
{% else %}
   use                     mincheck
{% endif %}
   host_name               {{ host }}
...
}

Ansible code then

{% for host in groups['all']|sort %}
{% if hostvars[host].datacenter == 'rdu3' 
  and hostvars[host].nagios_Can_Connect == true %}
define host {
{% if hostvars[host].nagios_Check_Services['nrpe'] == true %}
   use                     defaulttemplate
{% else %}
   use                     mincheck
{% endif %}
   host_name               {{ host }}
...
}

Tech debt #1 - Maintenance Debt

Progess, circa 1800?

  • Zabbix isn’t exactly “new”
    • Actively maintained
    • Integrated trend data
    • Moves trigger logic to the server
    • Active agent
    • Ansible collection available
  • Also, we already had prior work
    • and CentOS uses it

Prior art

Tech debt #2 - Architectural Debt

Declarative monitoring

  • All Zabbix config should be in Ansible
    • Items / triggers in templates
    • Hosts declare self & templates
    • Notifications, users, SAML, etc too
  • Monitoring tasks live in the app role
    • App developers know best
    • Even for base or httpd it works
  • Not everything can be done this way

Ansible code now

    - name: Import Anubis template file
      community.zabbix.zabbix_template:
        template_yaml: "{{ lookup('file', 'zabbix/template-anubis.yml') }}"
        state: present

Ansible code now

    - name: Import Anubis template file
      community.zabbix.zabbix_template:
        template_yaml: "{{ lookup('file', 'zabbix/template-anubis.yml') }}"
        state: present
    - name: Add self to Anubis in Zabbix
      community.zabbix.zabbix_host:
        host_name: "{{ inventory_hostname }}"
        link_templates: Anubis Monitoring
        force: false

Ansible code now

    - name: Import Anubis template file
      community.zabbix.zabbix_template:
        template_yaml: "{{ lookup('file', 'zabbix/template-anubis.yml') }}"
        state: present
    - name: Ensure Anubis hostgroup is present
      community.zabbix.zabbix_group:
        host_groups:
          - Anubis servers
        state: present
    - name: Add self to Anubis in Zabbix
      community.zabbix.zabbix_host:
        host_name: "{{ inventory_hostname }}"
        host_groups: Anubis servers
        link_templates: Anubis Monitoring
        force: false

Workflow (in principle)

  • App developers work on STG Zabbix
    • Design/test template in the UI
    • Export it to YAML
  • Ansible PR created
    • Adds YAML template to PRD Zabbix
    • Reviewed by Infra
    • Usual code flow, merge
  • Monitoring is part of “Definition of Done”

Tech debt #3 - Delegation Debt

Upgrade check types & purposes

  • Nagios technically has one check
    • This offloads logic to the NRPE agent
  • Leads to very narrow thinking
    • “Is this higher than X? -> Alert”
  • Many things are statistical/predictive
    • “Is this thing going to fail soon?”
    • “Is this pattern unusual for this host?”
  • Grouping is not possible either

Tech debt #4 - Time Debt (or pressure)

But Greg, you mentioned tools to help…

The people side of tech debt work - practicalities for helping the people around you to get on board when “the current thing still works, doesn’t it?”

All your technical debt is actually political debt

Technical debt is too broad

There are multiple types of debt

  • Maintenance debt
  • Architectural debt
  • Delegation Debt
  • Time debt

Fixing these will take resources

and will be hard to show value for

Political debt?

  • You’ll need some-to-all of:
    • your own time to not work on features
    • time from other teams
    • buy in from your/other teams
    • buy in from management

Your best tech debt tool is not your coding skills …

it’s your communication skills.

Communication skills

  • Convincing your team that the new thing is worth the time
  • Convincing managers why this matters, and how they benefit
  • Convincing nearby teams to use/contribute to the new plan
    • … even that they should own it

Thanks!

@gwmngilfen:fedora.im