Upgrading Fedora’s Monitoring

`cat me.yaml`

name: Greg ‘Gwmngilfen’ Sutcliffe
matrix: @gwmngilfen:fedora.im
history:
- Senior Sysadmin, Fedora & CentOS
- Community Architect / Data Scientist, Ansible
- Community Architect, Foreman
- 13+ years at Red Hat
notes:
- likes solving problems
- dislikes taking averages
- plays too many automation games¹

Abstract

Tech debt story?

Nagios, circa 10,000 BC
Zabbix & why this is progress
Monitoring via cfg mgmt
- Dedicated roles vs application snippets
- Code density (hi Jinja!)
What are we monitoring again?

All your technical debt is actually political debt

“Politics is the set of activities that are associated with making decisions in groups”

Fedora monitoring, 2024-era

We’re using Nagios, deployed via a monolithic Ansible role

Concerns & Constraints

Can only handle OK/WARN/CRIT
Checks aren’t on a given schedule
Sepatarate collectd instance
Monolithic Ansible role
Fedora prefers FOSS …
… and on-premise
Highly heterogeneous install base
Many on-disk / internal checks

Service definitions

We’re using Nagios, deployed via a monolithic Ansible role

- name: Copy /etc/nagios/services (RDU3 specific files)
  ansible.builtin.copy: src=nagios/services/rdu3_internal/{{ item }} 
     dest=/etc/nagios/services/{{ item }}
  with_items:
    - certgetter.cfg
    - db_backups.cfg
    - disk.cfg
    - fedora_messaging.cfg
    - file_age.cfg
    - koji.cfg
    - locking.cfg
    - mailman.cfg
    - nrpe.cfg
    - pgsql.cfg

Host definitions

{% for host in groups['all']|sort %}
{% if hostvars[host].datacenter == 'rdu3'
  and hostvars[host].nagios_Can_Connect == true %}
define host {
{% if hostvars[host].nagios_Check_Services['nrpe'] == true %}
   use                     defaulttemplate
{% else %}
   use                     mincheck
{% endif %}
   host_name               {{ host }}
...
}

Ansible code then

{% for host in groups['all']|sort %}
{% if hostvars[host].datacenter == 'rdu3' 
  and hostvars[host].nagios_Can_Connect == true %}
define host {
{% if hostvars[host].nagios_Check_Services['nrpe'] == true %}
   use                     defaulttemplate
{% else %}
   use                     mincheck
{% endif %}
   host_name               {{ host }}
...
}

Tech debt #1 - Maintenance Debt

Progess, circa 1800?

Zabbix isn’t exactly “new”
- Actively maintained
- Integrated trend data
- Moves trigger logic to the server
- Active agent
- Ansible collection available
Also, we already had prior work
- and CentOS uses it

Prior art

Tech debt #2 - Architectural Debt

Declarative monitoring

All Zabbix config should be in Ansible
- Items / triggers in templates
- Hosts declare self & templates
- Notifications, users, SAML, etc too
Monitoring tasks live in the app role
- App developers know best
- Even for base or httpd it works
Not everything can be done this way

Ansible code now

    - name: Import Anubis template file
      community.zabbix.zabbix_template:
        template_yaml: "{{ lookup('file', 'zabbix/template-anubis.yml') }}"
        state: present

Ansible code now

    - name: Import Anubis template file
      community.zabbix.zabbix_template:
        template_yaml: "{{ lookup('file', 'zabbix/template-anubis.yml') }}"
        state: present
    - name: Add self to Anubis in Zabbix
      community.zabbix.zabbix_host:
        host_name: "{{ inventory_hostname }}"
        link_templates: Anubis Monitoring
        force: false

Ansible code now

    - name: Import Anubis template file
      community.zabbix.zabbix_template:
        template_yaml: "{{ lookup('file', 'zabbix/template-anubis.yml') }}"
        state: present
    - name: Ensure Anubis hostgroup is present
      community.zabbix.zabbix_group:
        host_groups:
          - Anubis servers
        state: present
    - name: Add self to Anubis in Zabbix
      community.zabbix.zabbix_host:
        host_name: "{{ inventory_hostname }}"
        host_groups: Anubis servers
        link_templates: Anubis Monitoring
        force: false

Workflow (in principle)

App developers work on STG Zabbix
- Design/test template in the UI
- Export it to YAML
Ansible PR created
- Adds YAML template to PRD Zabbix
- Reviewed by Infra
- Usual code flow, merge
Monitoring is part of “Definition of Done”

Tech debt #3 - Delegation Debt

Upgrade check types & purposes

Nagios technically has one check
- This offloads logic to the NRPE agent
Leads to very narrow thinking
- “Is this higher than X? -> Alert”
Many things are statistical/predictive
- “Is this thing going to fail soon?”
- “Is this pattern unusual for this host?”
Grouping is not possible either

Tech debt #4 - Time Debt (or pressure)

But Greg, you mentioned tools to help…

The people side of tech debt work - practicalities for helping the people around you to get on board when “the current thing still works, doesn’t it?”

All your technical debt is actually political debt

Technical debt is too broad

There are multiple types of debt

Maintenance debt
Architectural debt
Delegation Debt
Time debt

Fixing these will take resources

and will be hard to show value for

Political debt?

You’ll need some-to-all of:
- your own time to not work on features
- time from other teams
- buy in from your/other teams
- buy in from management

Your best tech debt tool is not your coding skills …

… it’s your communication skills.

Communication skills

Convincing your team that the new thing is worth the time
Convincing managers why this matters, and how they benefit
Convincing nearby teams to use/contribute to the new plan
- … even that they should own it

Thanks!

@gwmngilfen:fedora.im