.. | ||
dashboards | ||
readme.md |
Prometheus+Grafana in docker
guide-by-example
WORK IN PROGRESS
Loki and caddy monitoring parts are not finished yet
Purpose
Monitoring of the host and the running cointaners.
Monitoring in this case means gathering and showing information on how services
or machines or containers are running. Can be cpu, io, ram, disk use...
can be number of http requests, errors, or results of backups.
Prometheus deals with metrics. Loki deals with logs. Grafana is there to show
the data on a dashboard.
Lot of the prometheus stuff here is based off the magnificent stefanprodan/dockprom.
Chapters
- Core prometheus+grafana - nice dashboards with metrics of docker host and containers
- Pushgateway - push data to prometheus from anywhere
- Alertmanager - setting alerts and getting notifications
- Loki - all of the above but for log files
- Caddy monitoring - monitoring a reverse proxy
Overview
Good youtube overview of Prometheus.
Prometheus is an open source system for monitoring and alerting,
written in golang.
It periodically collects metrics from configured targets,
makes these metrics available for visualization, and can trigger alerts.
Prometheus is relatively young project, it is a pull type monitoring.
- Prometheus Server is the core of the system, responsible for
- pulling new metrics
- storing the metrics in a database and evaluating them
- making metrics available through PromQL API
- Targets - machines, services, applications that are monitored.
These need to have an exporter.- exporter - a script or a service that gathers metrics on the target, converts them to prometheus server format, and exposes them at an endpoint so they can be pulled
- Alertmanager - responsible for handling alerts from Prometheus Server, and sending notifications through email, slack, pushover,.. In this setup ntfy webhook will be used.
- pushgateway - allows push type of monitoring. Meaning a machine anywhere in the world can push data in to your prometheus. Should not be overused as it goes against the pull philosophy of prometheus.
- Grafana - for web UI visualization of the collected metrics
Files and directory structure
/home/
└── ~/
└── docker/
└── prometheus/
├── 🗁 grafana_data/
├── 🗁 prometheus_data/
├── 🗋 docker-compose.yml
├── 🗋 .env
└── 🗋 prometheus.yml
grafana_data/
- a directory where grafana stores its dataprometheus_data/
- a directory where prometheus stores its database and data.env
- a file containing environment variables for docker composedocker-compose.yml
- a docker compose file, telling docker how to run the containersprometheus.yml
- a configuration file for prometheus
The three files must be provided.
The directories are created by docker compose on the first run.
docker-compose
- Prometheus - The official image used. Few extra commands passing configuration. Of note is 240 hours(10days) retention policy.
- Grafana - The official image used. Bind mounted directory for persistent data storage. User sets as root, as it solves issues I am lazy to investigate.
- NodeExporter - An exporter for linux machines,
in this case gathering the metrics of the linux machine runnig docker,
like uptime, cpu load, memory use, network bandwidth use, disk space,...
Also bind mount of some system directories to have access to required info. - cAdvisor - An exporter for gathering docker containers metrics,
showing cpu, memory, network use of each container
Runs inprivileged
mode and has some bind mounts of system directories to have access to required info.
Note - ports are only expose
, since expectation of use of a reverse proxy
and accessing the services by hostname, not ip and port.
docker-compose.yml
services:
# MONITORING SYSTEM AND THE METRICS DATABASE
prometheus:
image: prom/prometheus:v2.42.0
container_name: prometheus
hostname: prometheus
user: root
restart: unless-stopped
depends_on:
- cadvisor
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--storage.tsdb.retention.time=240h'
- '--web.enable-lifecycle'
volumes:
- ./prometheus_data:/prometheus
- ./prometheus.yml:/etc/prometheus/prometheus.yml
expose:
- "9090"
labels:
org.label-schema.group: "monitoring"
# WEB BASED UI VISUALISATION OF METRICS
grafana:
image: grafana/grafana:9.4.3
container_name: grafana
hostname: grafana
user: root
restart: unless-stopped
env_file: .env
volumes:
- ./grafana_data:/var/lib/grafana
expose:
- "3000"
labels:
org.label-schema.group: "monitoring"
# HOST LINUX MACHINE METRICS EXPORTER
nodeexporter:
image: prom/node-exporter:v1.5.0
container_name: nodeexporter
hostname: nodeexporter
restart: unless-stopped
command:
- '--path.procfs=/host/proc'
- '--path.rootfs=/rootfs'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
expose:
- "9100"
labels:
org.label-schema.group: "monitoring"
# DOCKER CONTAINERS METRICS EXPORTER
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.47.1
container_name: cadvisor
hostname: cadvisor
restart: unless-stopped
privileged: true
devices:
- /dev/kmsg:/dev/kmsg
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker:/var/lib/docker:ro
- /cgroup:/cgroup:ro #doesn't work on MacOS only for Linux
expose:
- "3000"
labels:
org.label-schema.group: "monitoring"
networks:
default:
name: $DOCKER_MY_NETWORK
external: true
.env
# GENERAL
DOCKER_MY_NETWORK=caddy_net
TZ=Europe/Bratislava
# GRAFANA
GF_SECURITY_ADMIN_USER=admin
GF_SECURITY_ADMIN_PASSWORD=admin
GF_USERS_ALLOW_SIGN_UP=false
# GRAFANA EMAIL
GF_SMTP_ENABLED=true
GF_SMTP_HOST=smtp-relay.sendinblue.com:587
GF_SMTP_USER=example@gmail.com
GF_SMTP_PASSWORD=xzu0dfFhn3eqa
All containers must be on the same network.
Which is named in the .env
file.
If one does not exist yet: docker network create caddy_net
prometheus.yml
Contains the bare minimum setup of targets from where metrics are to be pulled.
prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'nodeexporter'
static_configs:
- targets: ['nodeexporter:9100']
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
Reverse proxy
Caddy v2 is used, details
here.
Caddyfile
graf.{$MY_DOMAIN} {
reverse_proxy grafana:3000
}
prom.{$MY_DOMAIN} {
reverse_proxy prometheus:9090
}
First run and Grafana configuration
- login admin/admin to
graf.example.com
, change the password - add Prometheus as a Data source in configuration
set URL tohttp://prometheus:9090
- import dashboards from json files in this repo
These dashboards are the preconfigured ones from
stefanprodan/dockprom
with few changes.
docker_host.json
did not show free disk space for me, had to change fstype
from aufs
to ext4
.
Also included is a fix
for host network monitoring not showing traffick. In all of them
the default time interval is set to 1h instead of 15m
- docker_host.json - dashboard showing linux host machine metrics
- docker_containers.json - dashboard showing docker containers metrics,
except the ones labeled as
monitoring
in the compose file - monitoring_services.json - dashboar showing docker containers metrics
of containers that are labeled
monitoring
Pushgateway
Gives freedom to push information in to prometheus from anywhere.
The setup
To add pushgateway functionality to the current stack:
-
New container
pushgateway
added to the compose file.docker-compose.yml
services: # PUSHGATEWAY FOR PROMETHEUS pushgateway: image: prom/pushgateway:v1.5.1 container_name: pushgateway hostname: pushgateway restart: unless-stopped command: - '--web.enable-admin-api' expose: - "9091" networks: default: name: $DOCKER_MY_NETWORK external: true
-
Adding pushgateway to the Caddyfile of the reverse proxy so that it can be reached at
https://push.example.com
Caddyfile
push.{$MY_DOMAIN} { reverse_proxy pushgateway:9091 }
-
Adding pushgateway's scrape point to
prometheus.yml
prometheus.yml
global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: - job_name: 'pushgateway-scrape' honor_labels: true static_configs: - targets: ['pushgateway:9091']
The basics
To test pushing some metric, execute in linux:
echo "some_metric 3.14" | curl --data-binary @- https://push.example.com/metrics/job/blabla/instance/whatever
You see labels being set to the pushed metric in the path.
Label job
is required, but after that it's whatever you want,
though use of instance
label is customary.
Now in grafana, in Explore section you should see some results
when quering for some_metric
.
The metrics sit on the pushgateway forever, unless deleted or container shuts down. Prometheus will not remove the metrics from it after scraping, it will keep scraping the pushgateway and store the value there with the time of scraping.
To wipe the pushgateway clean
curl -X PUT https://push.example.com/api/v1/admin/wipe
More on pushgateway setup, with the real world use to monitor backups,
along with pushing metrics from windows in powershell -
Veeam Prometheus Grafana
Alertmanager
To send a notification about some metric breaching some preset condition.
Notifications chanels set here will be email and
ntfy
The setup
To add alertmanager to the current stack:
-
New file -
alertmanager.yml
will be bind mounted in alertmanager container.
This file contains configuration on how and where to deliver alerts.alertmanager.yml
route: receiver: 'email' receivers: - name: 'ntfy' webhook_configs: - url: 'https://ntfy.example.com/alertmanager' send_resolved: true - name: 'email' email_configs: - to: 'whoever@example.com' from: 'alertmanager@example.com' smarthost: smtp-relay.sendinblue.com:587 auth_username: '<registration_email@gmail.com>' auth_identity: '<registration_email@gmail.com>' auth_password: '<long ass generated SMTP key>'
-
New file -
alert.rules
will be mounted in to prometheus container
This file defines which value of some metric becomes an alert event.alert.rules
groups: - name: host rules: - alert: DiskSpaceLow expr: sum(node_filesystem_free_bytes{fstype="ext4"}) > 19 for: 10s labels: severity: critical annotations: description: "Diskspace is low!"
-
Changed
prometheus.yml
. Addedalerting
section that points to alertmanager container, and also set is a path to arules
file.prometheus.yml
global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: - job_name: 'nodeexporter' static_configs: - targets: ['nodeexporter:9100'] - job_name: 'cadvisor' static_configs: - targets: ['cadvisor:8080'] - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] alerting: alertmanagers: - scheme: http static_configs: - targets: - 'alertmanager:9093' rule_files: - '/etc/prometheus/rules/alert.rules'
-
New container -
alertmanager
added to the compose file and prometheus container has bind mount rules file added.docker-compose.yml
services: # MONITORING SYSTEM AND THE METRICS DATABASE prometheus: image: prom/prometheus:v2.42.0 container_name: prometheus hostname: prometheus restart: unless-stopped user: root depends_on: - cadvisor command: - '--config.file=/etc/prometheus/prometheus.yml' - '--storage.tsdb.path=/prometheus' - '--web.console.libraries=/etc/prometheus/console_libraries' - '--web.console.templates=/etc/prometheus/consoles' - '--storage.tsdb.retention.time=240h' - '--web.enable-lifecycle' volumes: - ./prometheus_data:/prometheus - ./prometheus.yml:/etc/prometheus/prometheus.yml - ./alert.rules:/etc/prometheus/rules/alert.rules expose: - "9090" labels: org.label-schema.group: "monitoring" # ALERT MANAGMENT FOR PROMETHEUS alertmanager: image: prom/alertmanager:v0.25.0 container_name: alertmanager hostname: alertmanager restart: unless-stopped volumes: - ./alertmanager.yml:/etc/alertmanager.yml - ./alertmanager_data:/alertmanager command: - '--config.file=/etc/alertmanager.yml' - '--storage.path=/alertmanager' expose: - "9093" labels: org.label-schema.group: "monitoring" networks: default: name: $DOCKER_MY_NETWORK external: true
-
Adding alertmanager to the Caddyfile of the reverse proxy so that it can be reached at
https://alert.example.com
. Not really necessary, but useful as it allows to send alerts from anywhere, not just from prometheus.Caddyfile
alert.{$MY_DOMAIN} { reverse_proxy alertmanager:9093 }
The basics
Once above setup is done an alert about low disk space should fire and notification
email should come.
In alertmanager.yml
switch from email to ntfy can be done.
Useful
- alert from anywhere using curl:
curl -H 'Content-Type: application/json' -d '[{"labels":{"alertname":"blabla"}}]' https://alert.example.com/api/v1/alerts
- reload rules:
curl -X POST https://prom.example.com/-/reload
stefanprodan/dockprom has more detailed section on alerting worth checking out.
Loki
Loki is made by the grafana team. It's often refered as a Prometheus for logs.
It is a push type monitoring, where an agent - promtail
pushes logs on to a Loki instance.
For docker containers theres also an option to install loki-docker-driver
on a docker host and log pushing is set either globally in /etc/docker/daemon.json
or per container in compose files.
There will be two examples.
A minecraft server and a caddy revers proxy, both docker containers.
The setup
To add Loki to the current stack:
-
New container -
loki
added to the compose file.
Note the port 3100 is actually mapped to the host, allowinglocalhost:3100
from driver to work.docker-compose.yml
services: # LOG MANAGMENT WITH LOKI loki: image: grafana/loki:2.7.3 container_name: loki hostname: loki user: root restart: unless-stopped volumes: - ./loki_data:/loki - ./loki-docker-config.yml:/etc/loki-docker-config.yml command: - '-config.file=/etc/loki-docker-config.yml' ports: - "3100:3100" labels: org.label-schema.group: "monitoring" networks: default: name: $DOCKER_MY_NETWORK external: true
-
New file -
loki-docker-config.yml
bind mounted in the loki container.
The file comes from the official example, but url is changed, and compactor section is added, to have control over data retention.loki-docker-config.yml
auth_enabled: false server: http_listen_port: 3100 common: path_prefix: /loki storage: filesystem: chunks_directory: /loki/chunks rules_directory: /loki/rules replication_factor: 1 ring: kvstore: store: inmemory compactor: working_directory: /loki/compactor compaction_interval: 10m retention_enabled: true retention_delete_delay: 2h retention_delete_worker_count: 150 limits_config: retention_period: 240h schema_config: configs: - from: 2020-10-24 store: boltdb-shipper object_store: filesystem schema: v11 index: prefix: index_ period: 24h ruler: alertmanager_url: http://alertmanager:9093 analytics: reporting_enabled: false
-
Install loki-docker-driver
docker plugin install grafana/loki-docker-driver:latest --alias loki --grant-all-permissions
To check if it's installed and enabled:docker plugin ls
-
Containers that should be monitored need
logging
section in their compose.docker-compose.yml
services: whoami: image: "containous/whoami" container_name: "whoami" hostname: "whoami" logging: driver: "loki" options: loki-url: "http://localhost:3100/loki/api/v1/push"
Minecraft example
Loki will be used to monitor logs of a minecraft server.
A dashboard will be created, showing logs volume in time.
Alert will be set to send a notification when a player joins.
Requirements - grafana, loki, loki-docker-driver, minecraft with logging set in compose
First steps
- In grafana, loki needs to be added as a datasource,
http://loki:3100
- In
Explore
section, filter, container_name = minecraft, query... this should result in seeing minecraft logs and their volume/time graph.
This Explore view will be recreated as a dashboard.
Dashboard minecraft_logs
- New dashboard, new panel
- Data source - Loki
- Switch from
builder
tocode
- query -
count_over_time({container_name="minecraft"} |= `` [1m])
- Transform - Rename by regex -
(.*)
-Logs
- Graph type -
Time series
- Title - Logs volume
- Transparent background
- Legend off
- Graph styles - bar
- Fill opacity - 50
- Color scheme - single color
Query options
- Min interval=1m- Save
- Add another pane to the dashboard
- Graph type -
Logs
- Data source - Loki
- Switch from
builder
tocode
query -{container_name="minecraft"} |= ""
- Title - empty
- Deduplication - Signature
- Save
- Graph type -
This should create a similar dashboard to the one in the picture above.
Performance tips for grafana loki queries
Alerts in Grafana for Loki
When a player joins minecraft server a log appears "Bastard joined the game"
Alert will be set to look for string "joined the game" and send notification
when it occurs.
Grafana rules are based around a Query
and Expressions
and each
and every one has to result in a a simple number or a true or false condition.
Create alert rule
- 1 Set an alert rule name
- Rule name = Minecraft-player-joined-alert
- 2 Set a query and alert condition
- A - Switch to Loki; set Last 5 minutes
- switch from builder to code
count_over_time({container_name="minecraft"} |= "joined the game" [5m])
- B - Reduce
- Function = Last
- Input = A
- Mode = Strict
- C - Treshold
- Input = B
- is above 0
- Make this the alert condition
- A - Switch to Loki; set Last 5 minutes
- 3 Alert evaluation behavior
- Folder = "Alerts"
- Evaluation group (interval) = "five-min"
- Evaluation interval = 5m
- For 0s
- Configure no data and error handling
- Alert state if no data or all values are null = OK
- 4 Add details for your alert rule
- Can pass values from logs to alerts, by targeting A/B/C/.. expressions from step2.
- Summary =
Number of players joined: {{ $values.B }}
- Maybe one day I figure out how to pull player's name from the log
and pass it to alert, so far I got this
.*:\s(?P<player>.*)\sjoined the game$
and a full query but dunno how to reference the named regex group in alert 4th section.
And grafana forum is kinda big black hole of unanswared questions.
- 5 Notifications
- nothing
- Save and exit
Contact points
- New contact point
- Name = ntfy
- Integration = Webhook
- URL = https://ntfy.example.com/grafana
- Disable resolved message = check
- Test
- Save
Notification policies
- Edit default
- Default contact point = ntfy
- Save
After all this, there should be notification coming when a player joins.
Caddy monitoring
Reverse proxy is kinda linchpin of a selfhosted setup, since it's in charge of all the http/https traffic that goes in. So focus on monitoring this keystone makes sense.
Will be using Prometheus for monitoring metrics and Loki for log files monitoring.
Requirements - grafana, prometheus, loki, caddy container
Metrics
Caddy has build in exporter of metrics for prometheus, so all that is needed is enabling it, scrape it by prometheus, and import a dashboard.
-
Edit Caddyfile to enable metrics.
Caddyfile
{ servers { metrics } admin 0.0.0.0:2019 } a.{$MY_DOMAIN} { reverse_proxy whoami:80 }
-
Edit compose to publish 2019 port.
Likely not necessary if Caddy and Prometheus are on the same docker network, but its nice to check if the metrics export works at<docker-host-ip>:2019/metrics
docker-compose.yml
services: caddy: image: caddy container_name: caddy hostname: caddy restart: unless-stopped env_file: .env ports: - "80:80" - "443:443" - "443:443/udp" - "2019:2019" volumes: - ./Caddyfile:/etc/caddy/Caddyfile - ./caddy_config:/data - ./caddy_data:/config networks: default: name: $DOCKER_MY_NETWORK external: true
-
Edit prometheus.yml to add caddy scraping point
prometheus.yml
global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: - job_name: 'caddy' static_configs: - targets: ['caddy:2019']
-
In grafana import caddy dashboard
or make your own,caddy_reverse_proxy_upstreams_healthy
shows reverse proxy upstreams, but thats all.
But these metrics are more about performance and load put on Caddy,
which in selfhosted enviroment will likely be minmal and not interesting.
To get more intriguing info of who, when, from where, connects to what service,..
for that acces logs monitoring is needed.
Logs
Loki will be used for logs monitoring.
Loki itself just stores them, to get logs a promtail container will be used
that will have access to caddy's logs, and its job is to scrape them regularly
and push them to Loki. Once there, a basic grafana dashboard can be made.
-
Have Grafana, Loki, Caddy working
-
Edit Caddy compose, bind mount
/var/log/caddy
.
Add Promtail container, that also has same bind mount, along with bind mount of its config file.
Promtail will scrape logs to which it now has access and pushes them to Loki.docker-compose.yml
services: caddy: image: caddy container_name: caddy hostname: caddy restart: unless-stopped env_file: .env ports: - "80:80" - "443:443" - "443:443/udp" - "2019:2019" volumes: - ./Caddyfile:/etc/caddy/Caddyfile - ./caddy_data:/data - ./caddy_config:/config - /var/log/caddy:/var/log/caddy # LOG AGENT PUSHING LOGS TO LOKI promtail: image: grafana/promtail container_name: caddy-promtail hostname: caddy-promtail restart: unless-stopped volumes: - ./promtail-config.yml:/etc/promtail-config.yml - /var/log/caddy:/var/log/caddy:ro command: - '-config.file=/etc/promtail-config.yml' networks: default: name: $DOCKER_MY_NETWORK external: true
promtail-config.yml
clients: - url: http://loki:3100/loki/api/v1/push scrape_configs: - job_name: caddy static_configs: - targets: - localhost labels: job: caddy_access_log __path__: /var/log/caddy/*.log
promtail-config.yml customizing fields
clients: - url: http://loki:3100/loki/api/v1/push scrape_configs: - job_name: caddy_access_log static_configs: - targets: # tells promtail to look for the logs on the current machine/host - localhost labels: job: caddy_access_log __path__: /var/log/caddy/*.log pipeline_stages: # Extract all the fields I care about from the # message: - json: expressions: "level": "level" "timestamp": "ts" "duration": "duration" "response_status": "status" "request_path": "request.uri" "request_method": "request.method" "request_host": "request.host" "request_useragent": "request.headers.\"User-Agent\"" "request_remote_ip": "request.remote_ip" # Promote the level into an actual label: - labels: level: # Regenerate the message as all the fields listed # above: - template: # This is a field that doesn't exist yet, so it will be created source: "output" template: | {{toJson (unset (unset (unset . "Entry") "timestamp") "filename")}} - output: source: output # Set the timestamp of the log entry to what's in the # timestamp field. - timestamp: source: "timestamp" format: "Unix"
-
Edit
Caddyfile
to enable access logs. Unfortunetly this can't be globally enabled, so the easiest way seems to be to create a logging snippet and copy paste import line in to every site block.Caddyfile
(log_common) { log { output file /var/log/caddy/caddy_access.log } } ntfy.example.com { import log_common reverse_proxy ntfy:80 } mealie.{$MY_DOMAIN} { import log_common reverse_proxy mealie:80 }
-
at this points logs should be visible and explorable in grafana
Explore >{job="caddy_access_log"} |= "" | json
dashboard
-
new pane, will be time series graph showing logs volume in time
- Data source = Loki
- switch from builder to code
sum(count_over_time({job="caddy_access_log"} |= "" | json [1m])) by (request_host)
- Transform > Rename by regex > Match =
\{request_host="(.*)"\}
; Replace = $1 - Query options > Min interval = 1m
- Graph type = Time series
- Title = "Access timeline"
- Transparent
- Tooltip mode = All
- Tooltip values sort order = Descending
- Legen Placement = Right
- Value = Total
- Graph style = Bars
- Fill opacity = 50
-
Add another pane, will be a pie chart, showing subdomains divide
- Data source = Loki
- switch from builder to code
sum(count_over_time({job="caddy_access_log"} |= "" | json [$__range])) by (request_host)
- Transform > Rename by regex > Match =
\{request_host="(.*)"\}
; Replace = $1 - Graph type = Pie chart
- Title = "Subdomains divide"
- Transparent
- Legen Placement = Right
- Value = Total
- Graph style = Bars
-
Add another pane, this will be actual log view
- Graph type - Logs
- Data source - Loki
- Switch from builder to code
- query -
{job="caddy_access_log"} |= "" | json
- Title - empty
- Deduplication - Signature
- Save
Geoip
to-do
Update
Manual image update:
docker-compose pull
docker-compose up -d
docker image prune
Backup and restore
Backup
Using borg that makes daily snapshot of the entire directory.
Restore
- down the prometheus containers
docker-compose down
- delete the entire prometheus directory
- from the backup copy back the prometheus directory
- start the containers
docker-compose up -d