mirror of https://github.com/DoTheEvo/selfhosted-apps-docker synced 2024-11-06 21:20:41 +00:00

History

DoTheEvo ce8bd5e352 update		2023-03-16 02:09:31 +01:00
..
dashboards	update	2023-03-04 23:56:52 +01:00
readme.md	update	2023-03-16 02:09:31 +01:00

readme.md

Prometheus+Grafana in docker

guide-by-example

WORK IN PROGRESS
Loki and caddy monitoring parts are not finished yet

Purpose

Monitoring of the host and the running cointaners.

Monitoring in this case means gathering and showing information on how services or machines or containers are running. Can be cpu, io, ram, disk use... can be number of http requests, errors, or results of backups.
Prometheus deals with metrics. Loki deals with logs. Grafana is there to show the data on a dashboard.

Lot of the prometheus stuff here is based off the magnificent stefanprodan/dockprom.

Chapters

Core prometheus+grafana - nice dashboards with metrics of docker host and containers
Pushgateway - push data to prometheus from anywhere
Alertmanager - setting alerts and getting notifications
Loki - all of the above but for log files
Caddy monitoring - monitoring a reverse proxy

Overview

Good youtube overview of Prometheus.

Prometheus is an open source system for monitoring and alerting, written in golang.
It periodically collects metrics from configured targets, makes these metrics available for visualization, and can trigger alerts.
Prometheus is relatively young project, it is a pull type monitoring.

Glossary.

Prometheus Server is the core of the system, responsible for
- pulling new metrics
- storing the metrics in a database and evaluating them
- making metrics available through PromQL API
Targets - machines, services, applications that are monitored.
These need to have an exporter.
- exporter - a script or a service that gathers metrics on the target, converts them to prometheus server format, and exposes them at an endpoint so they can be pulled
Alertmanager - responsible for handling alerts from Prometheus Server, and sending notifications through email, slack, pushover,.. In this setup ntfy webhook will be used.
pushgateway - allows push type of monitoring. Meaning a machine anywhere in the world can push data in to your prometheus. Should not be overused as it goes against the pull philosophy of prometheus.
Grafana - for web UI visualization of the collected metrics

Files and directory structure

/home/
 └── ~/
     └── docker/
         └── prometheus/
             ├── 🗁 grafana_data/
             ├── 🗁 prometheus_data/
             ├── 🗋 docker-compose.yml
             ├── 🗋 .env
             └── 🗋 prometheus.yml

grafana_data/ - a directory where grafana stores its data
prometheus_data/ - a directory where prometheus stores its database and data
.env - a file containing environment variables for docker compose
docker-compose.yml - a docker compose file, telling docker how to run the containers
prometheus.yml - a configuration file for prometheus

The three files must be provided.
The directories are created by docker compose on the first run.

docker-compose

Prometheus - The official image used. Few extra commands passing configuration. Of note is 240 hours(10days) retention policy.
Grafana - The official image used. Bind mounted directory for persistent data storage. User sets as root, as it solves issues I am lazy to investigate.
NodeExporter - An exporter for linux machines, in this case gathering the metrics of the linux machine runnig docker, like uptime, cpu load, memory use, network bandwidth use, disk space,...
Also bind mount of some system directories to have access to required info.
cAdvisor - An exporter for gathering docker containers metrics, showing cpu, memory, network use of each container
Runs in privileged mode and has some bind mounts of system directories to have access to required info.

Note - ports are only expose, since expectation of use of a reverse proxy and accessing the services by hostname, not ip and port.

docker-compose.yml

services:

  # MONITORING SYSTEM AND THE METRICS DATABASE
  prometheus:
    image: prom/prometheus:v2.42.0
    container_name: prometheus
    hostname: prometheus
    user: root
    restart: unless-stopped
    depends_on:
      - cadvisor
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--storage.tsdb.retention.time=240h'
      - '--web.enable-lifecycle'
    volumes:
      - ./prometheus_data:/prometheus
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    expose:
      - "9090"
    labels:
      org.label-schema.group: "monitoring"

  # WEB BASED UI VISUALISATION OF METRICS
  grafana:
    image: grafana/grafana:9.4.3
    container_name: grafana
    hostname: grafana
    user: root
    restart: unless-stopped
    env_file: .env
    volumes:
      - ./grafana_data:/var/lib/grafana
    expose:
      - "3000"
    labels:
      org.label-schema.group: "monitoring"

  # HOST LINUX MACHINE METRICS EXPORTER
  nodeexporter:
    image: prom/node-exporter:v1.5.0
    container_name: nodeexporter
    hostname: nodeexporter
    restart: unless-stopped
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    expose:
      - "9100"
    labels:
      org.label-schema.group: "monitoring"

  # DOCKER CONTAINERS METRICS EXPORTER
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.47.1
    container_name: cadvisor
    hostname: cadvisor
    restart: unless-stopped
    privileged: true
    devices:
      - /dev/kmsg:/dev/kmsg
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker:/var/lib/docker:ro
      - /cgroup:/cgroup:ro #doesn't work on MacOS only for Linux
    expose:
      - "3000"
    labels:
      org.label-schema.group: "monitoring"

networks:
  default:
    name: $DOCKER_MY_NETWORK
    external: true

.env

# GENERAL
DOCKER_MY_NETWORK=caddy_net
TZ=Europe/Bratislava

# GRAFANA
GF_SECURITY_ADMIN_USER=admin
GF_SECURITY_ADMIN_PASSWORD=admin
GF_USERS_ALLOW_SIGN_UP=false
# GRAFANA EMAIL
GF_SMTP_ENABLED=true
GF_SMTP_HOST=smtp-relay.sendinblue.com:587
GF_SMTP_USER=example@gmail.com
GF_SMTP_PASSWORD=xzu0dfFhn3eqa

All containers must be on the same network.
Which is named in the .env file.
If one does not exist yet: docker network create caddy_net

prometheus.yml

Official documentation.

Contains the bare minimum setup of targets from where metrics are to be pulled.

prometheus.yml

global:
  scrape_interval:     15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'nodeexporter'
    static_configs:
      - targets: ['nodeexporter:9100']

  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

Reverse proxy

Caddy v2 is used, details here.

Caddyfile

graf.{$MY_DOMAIN} {
  reverse_proxy grafana:3000
}

prom.{$MY_DOMAIN} {
  reverse_proxy prometheus:9090
}

First run and Grafana configuration

login admin/admin to graf.example.com, change the password
add Prometheus as a Data source in configuration
set URL to http://prometheus:9090
import dashboards from json files in this repo

These dashboards are the preconfigured ones from stefanprodan/dockprom with few changes.
docker_host.json did not show free disk space for me, had to change fstype from aufs to ext4. Also included is a fix for host network monitoring not showing traffick. In all of them the default time interval is set to 1h instead of 15m

docker_host.json - dashboard showing linux host machine metrics
docker_containers.json - dashboard showing docker containers metrics, except the ones labeled as monitoring in the compose file
monitoring_services.json - dashboar showing docker containers metrics of containers that are labeled monitoring

PromQL

Some concept, highlights and examples of PromQL.

PromQL returns results as vectors"

The official basics page, quite to the point and short
Introduction to PromQL
relatively short video to the point
Prometheus Cheat Sheet - How to Join Multiple Metrics
decent stackoverflow answer

Pushgateway

Gives freedom to push information in to prometheus from anywhere.

The setup

To add pushgateway functionality to the current stack:

New container pushgateway added to the compose file.

docker-compose.yml

services:

# PUSHGATEWAY FOR PROMETHEUS
pushgateway:
  image: prom/pushgateway:v1.5.1
  container_name: pushgateway
  hostname: pushgateway
  restart: unless-stopped
  command:
    - '--web.enable-admin-api'    
  expose:
    - "9091"

networks:
default:
  name: $DOCKER_MY_NETWORK
  external: true

Adding pushgateway to the Caddyfile of the reverse proxy so that it can be reached at https://push.example.com
Caddyfile
```
push.{$MY_DOMAIN} {
    reverse_proxy pushgateway:9091
}
```

Adding pushgateway's scrape point to prometheus.yml

prometheus.yml

global:
  scrape_interval:     15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'pushgateway-scrape'
    honor_labels: true
    static_configs:
      - targets: ['pushgateway:9091']

The basics

To test pushing some metric, execute in linux:
echo "some_metric 3.14" | curl --data-binary @- https://push.example.com/metrics/job/blabla/instance/whatever

You see labels being set to the pushed metric in the path.
Label job is required, but after that it's whatever you want, though use of instance label is customary.
Now in grafana, in Explore section you should see some results when quering for some_metric.

The metrics sit on the pushgateway forever, unless deleted or container shuts down. Prometheus will not remove the metrics from it after scraping, it will keep scraping the pushgateway and store the value with the time of scraping.

To wipe the pushgateway clean
curl -X PUT https://push.example.com/api/v1/admin/wipe

More on pushgateway setup, with the real world use to monitor backups, along with pushing metrics from windows in powershell - Veeam Prometheus Grafana

Alertmanager

To send a notification about some metric breaching some preset condition.
Notifications chanels set here will be email and ntfy

The setup

To add alertmanager to the current stack:

New file - alertmanager.yml will be bind mounted in alertmanager container.
This file contains configuration on how and where to deliver alerts.

alertmanager.yml

route:
  receiver: 'email'

receivers:
  - name: 'ntfy'
    webhook_configs:
    - url: 'https://ntfy.example.com/alertmanager'
      send_resolved: true

  - name: 'email'
    email_configs:
    - to: 'whoever@example.com'
      from: 'alertmanager@example.com'
      smarthost: smtp-relay.sendinblue.com:587
      auth_username: '<registration_email@gmail.com>'
      auth_identity: '<registration_email@gmail.com>'
      auth_password: '<long ass generated SMTP key>'

New file - alert.rules will be mounted in to prometheus container
This file defines which value of some metric becomes an alert event.

alert.rules

groups:
  - name: host
    rules:
      - alert: DiskSpaceLow
        expr: sum(node_filesystem_free_bytes{fstype="ext4"}) > 19
        for: 10s
        labels:
          severity: critical
        annotations:
          description: "Diskspace is low!"

Changed prometheus.yml. Added alerting section that points to alertmanager container, and also set is a path to a rules file.

prometheus.yml

global:
  scrape_interval:     15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'nodeexporter'
    static_configs:
      - targets: ['nodeexporter:9100']

  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

alerting:
  alertmanagers:
  - scheme: http
    static_configs:
    - targets: 
      - 'alertmanager:9093'

rule_files:
  - '/etc/prometheus/rules/alert.rules'

New container - alertmanager added to the compose file and prometheus container has bind mount rules file added.

docker-compose.yml

services:

  # MONITORING SYSTEM AND THE METRICS DATABASE
  prometheus:
    image: prom/prometheus:v2.42.0
    container_name: prometheus
    hostname: prometheus
    restart: unless-stopped
    user: root
    depends_on:
      - cadvisor
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--storage.tsdb.retention.time=240h'
      - '--web.enable-lifecycle'
    volumes:
      - ./prometheus_data:/prometheus
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./alert.rules:/etc/prometheus/rules/alert.rules
    expose:
      - "9090"
    labels:
      org.label-schema.group: "monitoring"

  # ALERT MANAGMENT FOR PROMETHEUS
  alertmanager:
    image: prom/alertmanager:v0.25.0
    container_name: alertmanager
    hostname: alertmanager
    restart: unless-stopped
    volumes:
      - ./alertmanager.yml:/etc/alertmanager.yml
      - ./alertmanager_data:/alertmanager
    command:
      - '--config.file=/etc/alertmanager.yml'
      - '--storage.path=/alertmanager'
    expose:
      - "9093"
    labels:
      org.label-schema.group: "monitoring"

networks:
  default:
    name: $DOCKER_MY_NETWORK
    external: true

Adding alertmanager to the Caddyfile of the reverse proxy so that it can be reached at https://alert.example.com. Not really necessary, but useful as it allows to send alerts from anywhere, not just from prometheus.
Caddyfile
```
alert.{$MY_DOMAIN} {
    reverse_proxy alertmanager:9093
}
```

The basics

Once above setup is done an alert about low disk space should fire and notification email should come.
In alertmanager.yml switch from email to ntfy can be done.

Useful

alert from anywhere using curl:
curl -H 'Content-Type: application/json' -d '[{"labels":{"alertname":"blabla"}}]' https://alert.example.com/api/v1/alerts
reload rules:
curl -X POST https://prom.example.com/-/reload

stefanprodan/dockprom has more detailed section on alerting worth checking out.

Loki

Loki is made by the grafana team. Sometimes called a Prometheus for logs, it's a push type monitoring, where an agent - promtail scrapes logs and then pushes them on to a Loki instance.

For docker containers theres also an option to install loki-docker-driver on a docker host and log pushing is set either globally in /etc/docker/daemon.json or per container in compose files.
But as it turns out, promtail capabilities might be missed, like its ability to add labels to logs it scrapes based on some rule. Or processing data in some way, like translate IP addresses in to country names or cities.
Still loki-docker-driver is useful for getting containers logs in to loki quickly and easily, with less cluttering of compose and less containers runnig.

There will be two examples of logs monitoring.
A minecraft server and a caddy revers proxy, both docker containers.

Loki setup

New container - loki added to the compose file.
Note the port 3100 is actually mapped to the host, allowing localhost:3100 from driver to work.

docker-compose.yml

services:

  # LOG MANAGMENT WITH LOKI
  loki:
    image: grafana/loki:2.7.3
    container_name: loki
    hostname: loki
    user: root
    restart: unless-stopped
    volumes:
      - ./loki_data:/loki
      - ./loki-docker-config.yml:/etc/loki-docker-config.yml
    command:
      - '-config.file=/etc/loki-docker-config.yml'
    ports:
      - "3100:3100"
    labels:
      org.label-schema.group: "monitoring"

networks:
  default:
    name: $DOCKER_MY_NETWORK
    external: true

New file - loki-docker-config.yml bind mounted in the loki container.
The file comes from the official example, but url is changed, and compactor section is added, to have control over data retention.

loki-docker-config.yml

auth_enabled: false

server:
  http_listen_port: 3100

common:
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

compactor:
  working_directory: /loki/compactor
  compaction_interval: 10m
  retention_enabled: true
  retention_delete_delay: 2h
  retention_delete_worker_count: 150

limits_config:
  retention_period: 240h

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

ruler:
  alertmanager_url: http://alertmanager:9093

analytics:
  reporting_enabled: false

loki-docker-driverdriver
- Install loki-docker-driver
  docker plugin install grafana/loki-docker-driver:latest --alias loki --grant-all-permissions
  To check if it's installed and enabled: docker plugin ls
- Containers that should be monitored usind loki-docker-driver need logging section in their compose.
  docker-compose.yml
```
services:

  whoami:
    image: "containous/whoami"
    container_name: "whoami"
    hostname: "whoami"
    logging:
      driver: "loki"
      options:
        loki-url: "http://localhost:3100/loki/api/v1/push"
```

promtail

Containers that should be monitored with promtail need it added to their compose file, and made sure that it has access to the log files.

minecraft-docker-compose.yml

services:

  minecraft:
    image: itzg/minecraft-server
    container_name: minecraft
    hostname: minecraft
    restart: unless-stopped
    env_file: .env
    tty: true
    stdin_open: true
    ports:
      - 25565:25565     # minecraft server players connect
    volumes:
      - ./minecraft_data:/data

  # LOG AGENT PUSHING LOGS TO LOKI
  promtail:
    image: grafana/promtail
    container_name: minecraft-promtail
    hostname: minecraft-promtail
    restart: unless-stopped
    volumes:
      - ./minecraft_data/logs:/var/log/minecraft:ro
      - ./promtail-config.yml:/etc/promtail-config.yml
    command:
      - '-config.file=/etc/promtail-config.yml'

networks:
  default:
    name: $DOCKER_MY_NETWORK
    external: true

caddy-docker-compose.yml

services:

  caddy:
    image: caddy
    container_name: caddy
    hostname: caddy
    restart: unless-stopped
    env_file: .env
    ports:
      - "80:80"
      - "443:443"
      - "443:443/udp"
    volumes:
      - ./Caddyfile:/etc/caddy/Caddyfile
      - ./caddy_config:/data
      - ./caddy_data:/config
      - ./caddy_logs:/var/log/caddy

  # LOG AGENT PUSHING LOGS TO LOKI
  promtail:
    image: grafana/promtail
    container_name: caddy-promtail
    hostname: caddy-promtail
    restart: unless-stopped
    volumes:
      - ./caddy_logs:/var/log/caddy:ro
      - ./promtail-config.yml:/etc/promtail-config.yml
    command:
      - '-config.file=/etc/promtail-config.yml'

networks:
  default:
    name: $DOCKER_MY_NETWORK
    external: true

Generic config file for promtail, needs to be bind mounted

promtail-config.yml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: blablabla
    static_configs:
      - targets:
          - localhost
        labels:
          job: blablabla_log
          __path__: /var/log/blablabla/*.log

Minecraft Loki example

What can be seen in this example:

How to monitor logs of a docker container, a minecraft server.
How to visualize the logs in a dashboard.
How to set an alert when a specific pattern appears in the logs.
How to extract information from log to include it in the alert notification.
Basic of grafana alert templates, so that the notification actually looks good, and shows only relevant info.

Requirements - grafana, loki, minecraft.

The Setup

Initially loki-docker-driver was used to get logs to Loki, and it was simple and worked nicely. But during alert stage I could not figure out how to extract string from logs and include it in an alert notification. Specificly to not just say that "a player joined", but to have there name of the player that joined.
The way to solve that was to switch to promtail and make use of its pipeline_stages. Which was suprisingly simple and elegant.

Promtail container is added to minecraft's compose, with bind mount access to minecraf's logs.

minecraft-docker-compose.yml

services:

  minecraft:
    image: itzg/minecraft-server
    container_name: minecraft
    hostname: minecraft
    restart: unless-stopped
    env_file: .env
    tty: true
    stdin_open: true
    ports:
      - 25565:25565     # minecraft server players connect
    volumes:
      - ./minecraft_data:/data

  # LOG AGENT PUSHING LOGS TO LOKI
  promtail:
    image: grafana/promtail
    container_name: minecraft-promtail
    hostname: minecraft-promtail
    restart: unless-stopped
    volumes:
      - ./minecraft_data/logs:/var/log/minecraft:ro
      - ./promtail-config.yml:/etc/promtail-config.yml
    command:
      - '-config.file=/etc/promtail-config.yml'

networks:
  default:
    name: $DOCKER_MY_NETWORK
    external: true

Promtail's config is similar to the generic config in the previous section.
The only addition is a short pipeline stage with a regex that runs against every log line before sending it to Loki. When a line matches, a label player is added to that log line. The value of that label comes from the named capture group thats part of that regex, the syntax is: (?P<name>group)
This label will be easy to use later in the alert stage.

promtail-config.yml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: minecraft
    static_configs:
      - targets:
          - localhost
        labels:
          job: minecraft_logs
          __path__: /var/log/minecraft/*.log
    pipeline_stages:
    - regex:
        expression: '.*:\s(?P<player>.*)\sjoined the game$'
    - labels:
        player:

Here's regex101 of it, with some data to show how it works and bit of explanation.
Here's the stackoverflow answer that is the source for that config.

First steps in Grafana

In grafana, loki needs to be added as a datasource, http://loki:3100
In Explore section, filter, job = minecraft_logs, Run query button... this should result in seeing minecraft logs and their volume/time graph.

This Explore view will be recreated as a dashboard.

Dashboard minecraft_logs

New dashboard, new panel
- Data source - Loki
- Switch from builder to code
- query - count_over_time({container_name="minecraft"} |= `` [1m])
- Transform - Rename by regex - (.*) - Logs
- Graph type - Time series
- Title - Logs volume
- Transparent background
- Legend off
- Graph styles - bar
- Fill opacity - 50
- Color scheme - single color
- Query options - Min interval=1m
- Save
Add another pane to the dashboard
- Graph type - Logs
- Data source - Loki
- Switch from builder to code
  query - {container_name="minecraft"} |= ""
- Title - empty
- Deduplication - Signature
- Save

This should create a similar dashboard to the one in the picture above.

Performance tips for grafana loki queries

Alerts in Grafana for Loki

When a player joins minecraft server a log appears "Bastard joined the game"
Alert will be set to look for string "joined the game" and send notification when it occurs.

At this point might be good time to brush up on promQL/logQL and the data types they return when a query happens. That instant vector and range vector thingie.

Create alert rule

1 Set an alert rule name
- Rule name = Minecraft-player-joined-alert
2 Set a query and alert condition
- A - Switch to Loki; set Last 5 minutes
  - switch from builder to code
  - count_over_time({job="minecraft_logs"} |= "joined the game" [5m])
- B - Reduce
  - Function = Last
  - Input = A
  - Mode = Strict
- C - Treshold
  - Input = B
  - is above 0
  - Make this the alert condition
3 Alert evaluation behavior
- Folder = "Alerts"
- Evaluation group (interval) = "five-min"
- Evaluation interval = 5m
- For 0s
- Configure no data and error handling
  - Alert state if no data or all values are null = OK
4 Add details for your alert rule
- Here is where the label player that was set in promtail is used
  Summary = {{ $labels.player }} joined the Minecraft server.
- Can also pass values from expressions by targeting A/B/C/.. from step2
  Description = Number of players that joined in the last 5 min: {{ $values.B }}
5 Notifications
- nothing
Save and exit

Contact points

New contact point
Name = ntfy
Integration = Webhook
URL = https://ntfy.example.com/grafana
Title = {{ .CommonAnnotations.summary }}
Message = I put in empty space unicode character
Disable resolved message = check
Test
Save

Notification policies

Edit default
Default contact point = ntfy
Save

After all this, there should be notification coming when a player joins.

grafana-to-ntfy

For alerts one can use ntfy but on its own the alerts from grafana are just plain text json.
Here's how to setup grafana-to-ntfy, to make the alerts look good.

Templates

Not really used here, but heres some basics as it took embarasignly long to find that {{ .CommonAnnotations.summary }} for the title.

Templates basic

Testing should be done in contact point when editing, useful Test button that allows you send alerts with custom values.
My big mistake when playing with this was missing a dot. In Contact point, in Title/Message input box.
- correct one - {{ template "test" . }}
- the one I had - {{ template "test" }}
So yeah, dot is important in here. It represents data and context passed to a template. It can represent global context or when used inside {{ range }} it represents iteration loop value, like i in classic for loop.
This json structure is what an alert looks like. Notice alerts being an array and commonAnnotations being object. If something is an array, theres need to loop over it to get acces to the values in it. For objects one just needs to target the value.
To iterate over alerts array.
To just access a value - {{ .CommonAnnotations.summary }}

Templates resources

Caddy monitoring

Reverse proxy is kinda linchpin of a selfhosted setup, since it's in charge of all the http/https traffic that goes in. So focus on monitoring this keystone makes sense.

Will be using Prometheus for monitoring metrics and Loki for log files monitoring.

Requirements - grafana, prometheus, loki, caddy container

Metrics

Caddy has build in exporter of metrics for prometheus, so all that is needed is enabling it, scrape it by prometheus, and import a dashboard.

Edit Caddyfile to enable metrics.

Caddyfile

{
    servers {
        metrics
    }

    admin 0.0.0.0:2019
}


a.{$MY_DOMAIN} {
    reverse_proxy whoami:80
}

Edit compose to publish 2019 port.
Likely not necessary if Caddy and Prometheus are on the same docker network, but its nice to check if the metrics export works at <docker-host-ip>:2019/metrics

docker-compose.yml

services:

  caddy:
    image: caddy
    container_name: caddy
    hostname: caddy
    restart: unless-stopped
    env_file: .env
    ports:
      - "80:80"
      - "443:443"
      - "443:443/udp"
      - "2019:2019"
    volumes:
      - ./Caddyfile:/etc/caddy/Caddyfile
      - ./caddy_config:/data
      - ./caddy_data:/config

networks:
  default:
    name: $DOCKER_MY_NETWORK
    external: true

Edit prometheus.yml to add caddy scraping point

prometheus.yml

global:
  scrape_interval:     15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'caddy'
    static_configs:
      - targets: ['caddy:2019']

In grafana import caddy dashboard
or make your own, caddy_reverse_proxy_upstreams_healthy shows reverse proxy upstreams, but thats all.

But these metrics are more about performance and load put on Caddy, which in selfhosted enviroment will likely be minmal and not interesting.
To get more intriguing info of who, when, from where, connects to what service,.. for that acces logs monitoring is needed.

Logs

Loki will be used for logs monitoring.
Loki itself just stores them, to get logs a promtail container will be used that will have access to caddy's logs, and its job is to scrape them regularly and push them to Loki. Once there, a basic grafana dashboard can be made.

Have Grafana, Loki, Caddy working

Edit Caddy compose, bind mount /var/log/caddy.
Add Promtail container, that also has same bind mount, along with bind mount of its config file.
Promtail will scrape logs to which it now has access and pushes them to Loki.

docker-compose.yml

services:

  caddy:
    image: caddy
    container_name: caddy
    hostname: caddy
    restart: unless-stopped
    env_file: .env
    ports:
      - "80:80"
      - "443:443"
      - "443:443/udp"
      - "2019:2019"
    volumes:
      - ./Caddyfile:/etc/caddy/Caddyfile
      - ./caddy_data:/data
      - ./caddy_config:/config
      - /var/log/caddy:/var/log/caddy

  # LOG AGENT PUSHING LOGS TO LOKI
  promtail:
    image: grafana/promtail
    container_name: caddy-promtail
    hostname: caddy-promtail
    restart: unless-stopped
    volumes:
      - ./promtail-config.yml:/etc/promtail-config.yml
      - /var/log/caddy:/var/log/caddy:ro
    command:
      - '-config.file=/etc/promtail-config.yml'

networks:
  default:
    name: $DOCKER_MY_NETWORK
    external: true

promtail-config.yml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: caddy
    static_configs:
      - targets:
          - localhost
        labels:
          job: caddy_access_log
          __path__: /var/log/caddy/*.log

promtail-config.yml customizing fields

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: caddy_access_log
    static_configs:
    - targets: # tells promtail to look for the logs on the current machine/host
        - localhost
      labels:
        job: caddy_access_log
        __path__: /var/log/caddy/*.log
    pipeline_stages:
      # Extract all the fields I care about from the
      # message:
      - json:
          expressions:
            "level": "level"
            "timestamp": "ts"
            "duration": "duration"
            "response_status": "status"
            "request_path": "request.uri"
            "request_method": "request.method"
            "request_host": "request.host"
            "request_useragent": "request.headers.\"User-Agent\""
            "request_remote_ip": "request.remote_ip"

      # Promote the level into an actual label:
      - labels:
          level:

      # Regenerate the message as all the fields listed
      # above:
      - template:
          # This is a field that doesn't exist yet, so it will be created
          source: "output"
          template: |
                        {{toJson (unset (unset (unset . "Entry") "timestamp") "filename")}}                        
      - output:
          source: output

      # Set the timestamp of the log entry to what's in the
      # timestamp field.
      - timestamp:
          source: "timestamp"
          format: "Unix"

Edit Caddyfile to enable access logs. Unfortunetly this can't be globally enabled, so the easiest way seems to be to create a logging snippet and copy paste import line in to every site block.

Caddyfile

(log_common) {
  log {
    output file /var/log/caddy/caddy_access.log
  }
}

ntfy.example.com {
  import log_common
  reverse_proxy ntfy:80
}

mealie.{$MY_DOMAIN} {
  import log_common
  reverse_proxy mealie:80
}

at this points logs should be visible and explorable in grafana
Explore > {job="caddy_access_log"} |= "" | json

dashboard

new pane, will be time series graph showing logs volume in time
- Data source = Loki
- switch from builder to code
  sum(count_over_time({job="caddy_access_log"} |= "" | json [1m])) by (request_host)
- Transform > Rename by regex > Match = \{request_host="(.*)"\}; Replace = $1
- Query options > Min interval = 1m
- Graph type = Time series
- Title = "Access timeline"
- Transparent
- Tooltip mode = All
- Tooltip values sort order = Descending
- Legen Placement = Right
- Value = Total
- Graph style = Bars
- Fill opacity = 50
Add another pane, will be a pie chart, showing subdomains divide
- Data source = Loki
- switch from builder to code
  sum(count_over_time({job="caddy_access_log"} |= "" | json [$__range])) by (request_host)
- Transform > Rename by regex > Match = \{request_host="(.*)"\}; Replace = $1
- Graph type = Pie chart
- Title = "Subdomains divide"
- Transparent
- Legen Placement = Right
- Value = Total
- Graph style = Bars
Add another pane, this will be actual log view
- Graph type - Logs
- Data source - Loki
- Switch from builder to code
- query - {job="caddy_access_log"} |= "" | json
- Title - empty
- Deduplication - Signature
- Save

useful resources

https://www.youtube.com/watch?v=UtmmhLraSnE

Geoip

to-do

Update

Manual image update:

docker-compose pull
docker-compose up -d
docker image prune

Backup and restore

Backup

Using borg that makes daily snapshot of the entire directory.

Restore

down the prometheus containers docker-compose down
delete the entire prometheus directory
from the backup copy back the prometheus directory
start the containers docker-compose up -d

readme.md

Prometheus+Grafana in docker

guide-by-example

Purpose

Chapters

Overview

Files and directory structure

docker-compose

prometheus.yml

Reverse proxy

First run and Grafana configuration

PromQL

Pushgateway

The setup

The basics

Alertmanager

The setup

The basics

Loki

Loki setup

loki-docker-driverdriver

promtail

Minecraft Loki example

The Setup

First steps in Grafana

Dashboard minecraft_logs

Alerts in Grafana for Loki

Create alert rule

Contact points

Notification policies

grafana-to-ntfy

Templates

Templates basic

Caddy monitoring

Metrics

Logs

dashboard

Geoip

Update

Backup and restore

Backup

Restore