# Prometheus + Grafana + Loki in docker
###### guide-by-example
![logo ](https://i.imgur.com/q41QfyI.png )
# Purpose
Monitoring of machines, containers, services, logs, ...
* [Official Prometheus ](https://prometheus.io/ )
* [Official Grafana ](https://grafana.com/ )
* [Official Loki ](https://grafana.com/oss/loki/ )
Monitoring in this case means gathering and showing information on how services
or machines or containers are running.< br >
Can be cpu, io, ram, disk use... can be number of http requests, errors,
results of backups, or a world map showing location of IP addresses
that access your services.< br >
Prometheus deals with **metrics** . Loki deals with **logs** .
Grafana is there to show the data on **dashboards** .
Most of the prometheus stuff here is based off the magnificent
[**stefanprodan/dockprom**. ](https://github.com/stefanprodan/dockprom )
# Chapters
* **[Core prometheus+grafana](#Overview)** - nice dashboards with metrics
of a docker host and containers
* **[Pushgateway](#Pushgateway)** - push data to prometheus from anywhere
* **[Alertmanager](#Alertmanager)** - setting alerts and getting notifications
* **[Loki](#Loki)** - prometheus for logs
* **[Minecraft Loki example](#minecraft-loki-example)** - logs, grafana alerts
and templates
* **[Caddy reverse proxy monitoring](#caddy-reverse-proxy-monitoring)** -
metrics, logs and geoip map
![dashboards_pic ](https://i.imgur.com/ZmyP0T8.png )
# Overview
[Good youtube overview ](https://youtu.be/h4Sl21AKiDg ) of Prometheus.</ br >
Prometheus is an open source system for monitoring and alerting,
written in golang.< br >
It periodically collects **metrics** from configured **targets** ,
makes these metrics available for visualization, and can trigger **alerts** .< br >
Prometheus is relatively young project, it is a **pull type** monitoring.
[Glossary. ](https://prometheus.io/docs/introduction/glossary/ )
* **Prometheus Server** is the core of the system, responsible for
* pulling new metrics
* storing the metrics in a database and evaluating them
* making metrics available through PromQL API
* **Targets** - machines, services, applications that are monitored.</ br >
These need to have an **exporter** .
* **exporter** - a script or a service that gathers metrics on the target,
converts them to prometheus server format,
and exposes them at an endpoint so they can be pulled
* **Alertmanager** - responsible for handling alerts from Prometheus Server,
and sending notifications through email, slack, pushover,..
**In this setup [ntfy ](https://github.com/DoTheEvo/selfhosted-apps-docker/tree/master/gotify-ntfy-signal )
webhook will be used.**
* **pushgateway** - allows push type of monitoring. Meaning a machine anywhere
in the world can push data in to your prometheus. Should not be overused
as it goes against the pull philosophy of prometheus.
* **Grafana** - for web UI visualization of the collected metrics
![prometheus components ](https://i.imgur.com/AxJCg8C.png )
# Files and directory structure
```
/home/
└── ~/
└── docker/
└── monitoring/
├── 🗁 grafana_data/
├── 🗁 prometheus_data/
├── 🗋 docker-compose.yml
├── 🗋 .env
└── 🗋 prometheus.yml
```
* `grafana_data/` - a directory where grafana stores its data
* `prometheus_data/` - a directory where prometheus stores its database and data
* `.env` - a file containing environment variables for docker compose
* `docker-compose.yml` - a docker compose file, telling docker how to run the containers
* `prometheus.yml` - a configuration file for prometheus
The three files must be provided.< / br >
The directories are created by docker compose on the first run.
# docker-compose
* **Prometheus** - The official image used. Few extra commands passing configuration.
Of note is 240 hours(10days) **retention** policy.
* **Grafana** - The official image used. Bind mounted directory
for persistent data storage. User sets **as root** , as it solves issues I am
lazy to investigate, likely me editing some files as root.
* **NodeExporter** - An exporter for linux machines,
in this case gathering the **metrics** of the docker host,
like uptime, cpu load, memory use, network bandwidth use, disk space,...< br >
Also **bind mount** of some system directories to have access to required info.
* **cAdvisor** - An exporter for gathering docker **containers** metrics,
showing cpu, memory, network use of **each container** < br >
Runs in `privileged` mode and has some bind mounts of system directories
to have access to required info.
*Note* - ports are only `expose` , since expectation of use of a reverse proxy
and accessing the services by hostname, not ip and port.
`docker-compose.yml`
```yml
services:
# MONITORING SYSTEM AND THE METRICS DATABASE
prometheus:
image: prom/prometheus:v2.42.0
container_name: prometheus
hostname: prometheus
user: root
restart: unless-stopped
depends_on:
- cadvisor
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--storage.tsdb.retention.time=240h'
- '--web.enable-lifecycle'
volumes:
- ./prometheus_data:/prometheus
- ./prometheus.yml:/etc/prometheus/prometheus.yml
expose:
- "9090"
labels:
org.label-schema.group: "monitoring"
# WEB BASED UI VISUALISATION OF METRICS
grafana:
image: grafana/grafana:9.4.3
container_name: grafana
hostname: grafana
user: root
restart: unless-stopped
env_file: .env
volumes:
- ./grafana_data:/var/lib/grafana
expose:
- "3000"
labels:
org.label-schema.group: "monitoring"
# HOST LINUX MACHINE METRICS EXPORTER
nodeexporter:
image: prom/node-exporter:v1.5.0
container_name: nodeexporter
hostname: nodeexporter
restart: unless-stopped
command:
- '--path.procfs=/host/proc'
- '--path.rootfs=/rootfs'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
expose:
- "9100"
labels:
org.label-schema.group: "monitoring"
# DOCKER CONTAINERS METRICS EXPORTER
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.47.1
container_name: cadvisor
hostname: cadvisor
restart: unless-stopped
privileged: true
devices:
- /dev/kmsg:/dev/kmsg
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker:/var/lib/docker:ro
- /cgroup:/cgroup:ro #doesn 't work on MacOS only for Linux
expose:
- "3000"
labels:
org.label-schema.group: "monitoring"
networks:
default:
name: $DOCKER_MY_NETWORK
external: true
```
`.env`
```bash
# GENERAL
DOCKER_MY_NETWORK=caddy_net
TZ=Europe/Bratislava
# GRAFANA
GF_SECURITY_ADMIN_USER=admin
GF_SECURITY_ADMIN_PASSWORD=admin
GF_USERS_ALLOW_SIGN_UP=false
# GRAFANA EMAIL
GF_SMTP_ENABLED=true
GF_SMTP_HOST=smtp-relay.sendinblue.com:587
GF_SMTP_USER=example@gmail.com
GF_SMTP_PASSWORD=xzu0dfFhn3eqa
```
**All containers must be on the same network**.< / br >
Which is named in the `.env` file.</ br >
If one does not exist yet: `docker network create caddy_net`
## prometheus.yml
[Official documentation. ](https://prometheus.io/docs/prometheus/latest/configuration/configuration/ )
Contains the bare minimum settings of targets from where metrics are to be pulled.
`prometheus.yml`
```yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'nodeexporter'
static_configs:
- targets: ['nodeexporter:9100']
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
```
## Reverse proxy
Caddy v2 is used, details
[here ](https://github.com/DoTheEvo/selfhosted-apps-docker/tree/master/caddy_v2 ).</ br >
`Caddyfile`
```php
graf.{$MY_DOMAIN} {
reverse_proxy grafana:3000
}
prom.{$MY_DOMAIN} {
reverse_proxy prometheus:9090
}
```
## First run and Grafana configuration
* Login **admin/admin** to `graf.example.com` , change the password.
* **Add** Prometheus as a **Data source** in Configuration< br >
Set **URL** to `http://prometheus:9090` < br >
* **Import** dashboards from [json files in this repo ](dashboards/ )< br >
These **dashboards** are the preconfigured ones from
[stefanprodan/dockprom ](https://github.com/stefanprodan/dockprom )
with **few changes** .< br >
**Docker host** dashboard did not show free disk space for me, **had to change fstype**
from `aufs` to `ext4` .
Also included is [a fix ](https://github.com/stefanprodan/dockprom/issues/18#issuecomment-487023049 )
for **host network monitoring** not showing traffick. In all of them
the default time interval is set to 1h instead of 15m
* **docker_host.json** - dashboard showing linux docker host metrics
* **docker_containers.json** - dashboard showing docker containers metrics,
except the ones labeled as `monitoring` in the compose file
* **monitoring_services.json** - dashboar showing docker containers metrics
of containers that are labeled `monitoring`
![interface-pic ](https://i.imgur.com/wzwgBkp.png )
---
---
# PromQL basics
My understanding of this shit..
* Prometheus stores **metrics** , each metric has a name, like `cpu_temp` .
* the metrics values are stored as **time series** , just simple - timestamped values< br >
`[43 @1684608467][41 @1684608567][48 @1684608667]` .
* This metric has **labels** `[name="server-19", state="idle", city="Frankfurt"]` .< br >
These allow far better **targeting** of the data, or as they say **multidimensionality.**
**Queries** to retrieve metrics.
* `cpu_temp` - **simple query** will show values over whatever time period
is selected in the interface.
* `cpu_temp{state="idle"}` - will narrow down results by applying a **label** .< br >
`cpu_temp{state="idle", name="server-19"}` - **multiple labels** narrow down results.
A query can return various **data type** , kinda tricky concept is difference between:
* **instant vector** - query returns a single value with a single timestamp.
It is simple and intuitive. All the above examples are instant vectors.< br >
Of note, there is **no thinking about time range here** . That is few layers above,
if one picks last 1h or last 7 days... that plays no role here,
this is a query response datatype and it is still instant vector - a single value in
point of time.
* **range vector** - returns multiple values with a single timestamp< br >
This is **needed by some [query functions](https://prometheus.io/docs/prometheus/latest/querying/functions)**
but on its own useless.< br >
A useless example would be `cpu_temp[10m]` . This query first looks at the last
timestamp data, then it would take all data points within the previous 10m
before that one timestamp, and return all those values.
**This colletion would have a single timestamp.** < br >
This functionality allows use of various **functions** that can do complex tasks.< br >
Actual useful example of a range vector would be `changes(cpu_temp[10m])`
where the function
[changes\(\) ](https://prometheus.io/docs/prometheus/latest/querying/functions/#changes )
would take that range vector info, look at those 10min of data and return
a single value, telling how many times the value of that metric changed in those 10 min.
Links
* [Stackoverflow - Prometheus instant vector vs range vector ](https://stackoverflow.com/questions/68223824/prometheus-instant-vector-vs-range-vector )
* [The Anatomy of a PromQL Query ](https://promlabs.com/blog/2020/06/18/the-anatomy-of-a-promql-query/ )
* [Why are Prometheus queries hard? ](https://fiberplane.com/blog/why-are-prometheus-queries-hard )
* [Prometheus Cheat Sheet - Basics \(Metrics, Labels, Time Series, Scraping\) ](https://iximiuz.com/en/posts/prometheus-metrics-labels-time-series/ )
* [Learning Prometheus and PromQL - Learning Series ](https://iximiuz.com/en/series/learning-prometheus-and-promql/ )
* [Prometheus from A to Y ](https://a-cup-of.coffee/blog/prometheus/ )
* [The official ](https://prometheus.io/docs/prometheus/latest/querying/basics/ )
< / details >
---
---
# Pushgateway
Gives freedom to **push** information in to prometheus from **anywhere** .< br >
Be aware that it should **not be abused** to turn prometheus in to push type
monitoring. It is only intented for
[specific situations. ](https://github.com/prometheus/pushgateway/blob/master/README.md )
### The setup
To **add** pushgateway functionality to the current stack:
* **New container** `pushgateway` added to the **compose** file.
< details >
< summary > docker-compose.yml< / summary >
```yml
services:
# PUSHGATEWAY FOR PROMETHEUS
pushgateway:
image: prom/pushgateway:v1.5.1
container_name: pushgateway
hostname: pushgateway
restart: unless-stopped
command:
- '--web.enable-admin-api'
expose:
- "9091"
networks:
default:
name: $DOCKER_MY_NETWORK
external: true
```
< / details >
* Adding pushgateway to the **Caddyfile** of the reverse proxy so that
it can be reached at `https://push.example.com` < br >
< details >
< summary > Caddyfile< / summary >
```php
push.{$MY_DOMAIN} {
reverse_proxy pushgateway:9091
}
```
< / details >
* Adding pushgateway as a **scrape point** to `prometheus.yml` < br >
Of note is **honor_labels** set to true,
which makes sure that **conflicting labels** like `job` , set during push
are kept over labels set in `prometheus.yml` for the scrape job.
[Docs ](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config ).
< details >
< summary > prometheus.yml< / summary >
```yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'pushgateway-scrape'
scrape_interval: 60s
honor_labels: true
static_configs:
- targets: ['pushgateway:9091']
```
< / details >
### The basics
![push-web ](https://i.imgur.com/9Jk0HKu.png )
To **test pushing** some metric, execute in linux:< br >
* `echo "some_metric 3.14" | curl --data-binary @- https://push.example.com/metrics/job/blabla/instance/whatever`
* Visit `push.example.com` and see the metric there.
* In Grafana > Explore > query for `some_metric` and see its value there.
In that command you see the metric itself: `some_metric` and it's value: `3.14` < br >
But there are also **labels** being set as part of the url. One label named `job` ,
which is required, but after that it's whatever you want.
They just need to be in **pairs** - label name and label value.
The metrics sit on the pushgateway **forever** , unless deleted or container
shuts down. **Prometheus will not remove** the metrics **after scraping** ,
it will keep scraping the pushgateway, every X seconds,
and store the value that sits there with the timestamp of scraping.
To **wipe** the pushgateway clean< br >
`curl -X PUT https://push.example.com/api/v1/admin/wipe`
### The real world use
* [**Veeam Prometheus Grafana** ](https://github.com/DoTheEvo/veeam-prometheus-grafana )
Linked above is a guide-by-example with more info on **pushgateway setup** .< br >
A real world use to **monitor backups** , along with pushing metrics
from **windows in powershell** .< br >
![veeam-dash ](https://i.imgur.com/dUyzuyl.png )
---
---
# Alertmanager
To send a **notification** about some **metric** breaching some preset **condition** .< br >
Notifications channels used here are **email** and
[**ntfy** ](https://github.com/DoTheEvo/selfhosted-apps-docker/tree/master/gotify-ntfy-signal )
*Note*< br >
I myself am **not** planning on using alertmanager.
Grafana can do alerts for both logs and metrics.
![alert ](https://i.imgur.com/b4hchSu.png )
## The setup
To **add** alertmanager to the current stack:
* **New file** - `alertmanager.yml` to be **bind mounted** in alertmanager container.< br >
This is the **configuration** on how and where **to deliver** alerts.< br >
Correct smtp or ntfy info needs to be filled out.
< details >
< summary > alertmanager.yml< / summary >
```yml
route:
receiver: 'email'
receivers:
- name: 'ntfy'
webhook_configs:
- url: 'https://ntfy.example.com/alertmanager'
send_resolved: true
- name: 'email'
email_configs:
- to: 'whoever@example.com'
from: 'alertmanager@example.com'
smarthost: smtp-relay.sendinblue.com:587
auth_username: '< registration_email @ gmail . com > '
auth_identity: '< registration_email @ gmail . com > '
auth_password: '< long ass generated SMTP key > '
```
< / details >
* **New file** - `alert.rules` to be **bind mounted** in to prometheus container< br >
This file **defines** at what value a metric becomes an **alert** event.
< details >
< summary > alert.rules< / summary >
```yml
groups:
- name: host
rules:
- alert: DiskSpaceLow
expr: sum(node_filesystem_free_bytes{fstype="ext4"}) > 19
for: 10s
labels:
severity: critical
annotations:
description: "Diskspace is low!"
```
< / details >
* **Changed** `prometheus.yml` . Added **alerting section** that points to alertmanager
container, and also **set path** to a `rules` file.
< details >
< summary > prometheus.yml< / summary >
```yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'nodeexporter'
static_configs:
- targets: ['nodeexporter:9100']
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
alerting:
alertmanagers:
- scheme: http
static_configs:
- targets:
- 'alertmanager:9093'
rule_files:
- '/etc/prometheus/rules/alert.rules'
```
< / details >
* **New container** - `alertmanager` added to the compose file and **prometheus
container** has bind mount **rules file** added.
< details >
< summary > docker-compose.yml< / summary >
```yml
services:
# MONITORING SYSTEM AND THE METRICS DATABASE
prometheus:
image: prom/prometheus:v2.42.0
container_name: prometheus
hostname: prometheus
user: root
restart: unless-stopped
depends_on:
- cadvisor
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--storage.tsdb.retention.time=240h'
- '--web.enable-lifecycle'
volumes:
- ./prometheus_data:/prometheus
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./alert.rules:/etc/prometheus/rules/alert.rules
expose:
- "9090"
labels:
org.label-schema.group: "monitoring"
# ALERT MANAGMENT FOR PROMETHEUS
alertmanager:
image: prom/alertmanager:v0.25.0
container_name: alertmanager
hostname: alertmanager
user: root
restart: unless-stopped
volumes:
- ./alertmanager.yml:/etc/alertmanager.yml
- ./alertmanager_data:/alertmanager
command:
- '--config.file=/etc/alertmanager.yml'
- '--storage.path=/alertmanager'
expose:
- "9093"
labels:
org.label-schema.group: "monitoring"
networks:
default:
name: $DOCKER_MY_NETWORK
external: true
```
< / details >
* **Adding** alertmanager to the **Caddyfile** of the reverse proxy so that
it can be reached at `https://alert.example.com` . **Not necessary** ,
but useful as it **allows to send alerts from anywhere** ,
not just from prometheus, or other containers on same docker network.
< details >
< summary > Caddyfile< / summary >
```php
alert.{$MY_DOMAIN} {
reverse_proxy alertmanager:9093
}
```
< / details >
## The basics
![alert ](https://i.imgur.com/C7g0xJt.png )
Once above setup is done, **an alert** about low disk space **should fire**
and a **notification** email should come.< br >
In `alertmanager.yml` a switch from email **to ntfy** can be done.
*Useful*
* **alert** from anywhere using **curl** :< br >
`curl -H 'Content-Type: application/json' -d '[{"labels":{"alertname":"blabla"}}]' https://alert.example.com/api/v1/alerts`
* **reload rules**:< br >
`curl -X POST https://prom.example.com/-/reload`
[stefanprodan/dockprom ](https://github.com/stefanprodan/dockprom#define-alerts )
has more detailed section on alerting worth checking out.
---
---
![Loki-logo ](https://i.imgur.com/HUohN3P.png )
# Loki
Loki is a log aggregation tool, made by the grafana team.
Sometimes called a Prometheus for logs, it's a **push** type monitoring.< br >
It uses [LogQL ](https://promcon.io/2019-munich/slides/lt1-08_logql-in-5-minutes.pdf )
for queries, which is similar to PromQL in its use of labels.
[The official documentation overview ](https://grafana.com/docs/loki/latest/fundamentals/overview/ )
There are two ways to **push logs** to Loki from a docker container.
* [**Loki-docker-driver** ](https://grafana.com/docs/loki/latest/clients/docker-driver/ )
**installed** on a docker host and log pushing is set either globally in
`/etc/docker/daemon.json` or per container in compose files.< br >
It's the simpler, easier way, but **lacks fine control** over the logs
being pushed.
* **[Promtail](https://grafana.com/docs/loki/latest/clients/promtail/)**
deployed as an another **container** , with bind mount of logs it should scrape,
and bind mount of its config file. This config file is very powerful,
giving a lot of **control** how logs are processed and pushed.
![loki_arch ](https://i.imgur.com/aoMPrVV.png )
## Loki setup
* **New container** - `loki` added to the compose file.< br >
Note the port 3100 is actually mapped to the host,
allowing `localhost:3100` from driver to work.
< details >
< summary > docker-compose.yml< / summary >
```yml
services:
# LOG MANAGMENT WITH LOKI
loki:
image: grafana/loki:main-0295fd4
container_name: loki
hostname: loki
user: root
restart: unless-stopped
volumes:
- ./loki_data:/loki
- ./loki-config.yml:/etc/loki-config.yml
command:
- '-config.file=/etc/loki-config.yml'
ports:
- "3100:3100"
labels:
org.label-schema.group: "monitoring"
networks:
default:
name: $DOCKER_MY_NETWORK
external: true
```
< / details >
* **New file** - `loki-config.yml` bind mounted in the loki container.< br >
The config here comes from
[the official example ](https://github.com/grafana/loki/tree/main/cmd/loki )
with some changes.
* **URL** changed for this setup.
* **Compactor** section is added, to have control over
[data retention. ](https://grafana.com/docs/loki/latest/operations/storage/retention/ )
* **Fixing** error - *"too many outstanding requests"* , discussion
[here. ](https://github.com/grafana/loki/issues/5123 )< br >
It turns off parallelism, both split by time interval and shards split.
< details >
< summary > loki-config.yml< / summary >
```yml
auth_enabled: false
server:
http_listen_port: 3100
common:
path_prefix: /loki
storage:
filesystem:
chunks_directory: /loki/chunks
rules_directory: /loki/rules
replication_factor: 1
ring:
kvstore:
store: inmemory
# --- disable splitting to fix "too many outstanding requests"
query_range:
parallelise_shardable_queries: false
# --- compactor to have control over length of data retention
compactor:
working_directory: /loki/compactor
compaction_interval: 10m
retention_enabled: true
retention_delete_delay: 2h
retention_delete_worker_count: 150
limits_config:
retention_period: 240h
split_queries_by_interval: 0 # part of disable splitting fix
# -------------------------------------------------------
schema_config:
configs:
- from: 2020-10-24
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
ruler:
alertmanager_url: http://alertmanager:9093
analytics:
reporting_enabled: false
```
< / details >
* #### loki-docker-driverdriver
* **Install** [loki-docker-driver ](https://grafana.com/docs/loki/latest/clients/docker-driver/ )< br >
`docker plugin install grafana/loki-docker-driver:latest --alias loki --grant-all-permissions` < br >
To check if it's installed and enabled: `docker plugin ls` < br >
* Containers that should be monitored usind loki-docker-driver need
`logging` section in their compose.
< details >
< summary > docker-compose.yml< / summary >
```yml
services:
whoami:
image: "containous/whoami"
container_name: "whoami"
hostname: "whoami"
logging:
driver: "loki"
options:
loki-url: "http://localhost:3100/loki/api/v1/push"
```
< / details >
* #### promtail
* Containers that should be monitored with **promtail** need it **added**
**to** their **compose** file, and made sure that it has access to the log files.
< details >
< summary > minecraft-docker-compose.yml< / summary >
```yml
services:
minecraft:
image: itzg/minecraft-server
container_name: minecraft
hostname: minecraft
restart: unless-stopped
env_file: .env
tty: true
stdin_open: true
ports:
- 25565:25565 # minecraft server players connect
volumes:
- ./minecraft_data:/data
# LOG AGENT PUSHING LOGS TO LOKI
promtail:
image: grafana/promtail
container_name: minecraft-promtail
hostname: minecraft-promtail
restart: unless-stopped
volumes:
- ./minecraft_data/logs:/var/log/minecraft:ro
- ./promtail-config.yml:/etc/promtail-config.yml
command:
- '-config.file=/etc/promtail-config.yml'
networks:
default:
name: $DOCKER_MY_NETWORK
external: true
```
< / details >
< details >
< summary > caddy-docker-compose.yml< / summary >
```yml
services:
caddy:
image: caddy
container_name: caddy
hostname: caddy
restart: unless-stopped
env_file: .env
ports:
- "80:80"
- "443:443"
- "443:443/udp"
volumes:
- ./Caddyfile:/etc/caddy/Caddyfile
- ./caddy_config:/data
- ./caddy_data:/config
- ./caddy_logs:/var/log/caddy
# LOG AGENT PUSHING LOGS TO LOKI
promtail:
image: grafana/promtail
container_name: caddy-promtail
hostname: caddy-promtail
restart: unless-stopped
volumes:
- ./caddy_logs:/var/log/caddy:ro
- ./promtail-config.yml:/etc/promtail-config.yml
command:
- '-config.file=/etc/promtail-config.yml'
networks:
default:
name: $DOCKER_MY_NETWORK
external: true
```
< / details >
* Generic **config file for promtail** , needs to be bind mounted
< details >
< summary > promtail-config.yml< / summary >
```yml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: blablabla
static_configs:
- targets:
- localhost
labels:
job: blablabla_log
__path__ : /var/log/blablabla/*.log
```
< / details >
#### First Loki use in Grafana
* In **grafana** , loki needs to be added as a **datasource** , `http://loki:3100`
* In **Explore section** , switch to Loki as source
* if loki-docker-driver then filter by `container_name` or `compose_project`
* if promtail then filter by job name set in promtail config
in the labels section
If all was set correctly logs should be visible in Grafana.
![query ](https://i.imgur.com/XSevjIR.png )
# Minecraft Loki example
What can be seen in this example:
* How to monitor logs of a docker container,
a [minecraft server ](https://github.com/DoTheEvo/selfhosted-apps-docker/tree/master/minecraft ).
* How to visualize the logs in a dashboard.
* How to set an alert when a specific pattern appears in the logs.
* How to extract information from log to include it in the alert notification.
* Basics of grafana alert templates, so that notifications actually look good,
and show only relevant info.
**Requirements** - grafana, loki, minecraft.
![logo-minecraft ](https://i.imgur.com/VphJTKG.png )
### The objective and overview
The **main objective** is to get an **alert** when a player **joins** the server.< br >
The secondary one is to have a place where recent *"happening"* on the server
can be seen.
Initially **loki-docker-driver** was used to get logs to Loki, and it was simple
and worked nicely. But during alert stage
I **could not** figure out how to extract string from logs and include it in
an **alert** notification. Specificly to not just say that "a player joined",
but to have there name of the player that joined.< br >
**Switch to promtail** solved this, with the use of its
[pipeline_stages ](https://grafana.com/docs/loki/latest/clients/promtail/pipelines/ ).
Which was **suprisingly simple** and elegant.
### The Setup
**Promtail** container is added to minecraft's **compose** , with bind mount
access to minecraf's logs.< br >
< details >
< summary > minecraft-docker-compose.yml< / summary >
```yml
services:
minecraft:
image: itzg/minecraft-server
container_name: minecraft
hostname: minecraft
restart: unless-stopped
env_file: .env
tty: true
stdin_open: true
ports:
- 25565:25565 # minecraft server players connect
volumes:
- ./minecraft_data:/data
# LOG AGENT PUSHING LOGS TO LOKI
promtail:
image: grafana/promtail
container_name: minecraft-promtail
hostname: minecraft-promtail
restart: unless-stopped
volumes:
- ./minecraft_data/logs:/var/log/minecraft:ro
- ./promtail-config.yml:/etc/promtail-config.yml
command:
- '-config.file=/etc/promtail-config.yml'
networks:
default:
name: $DOCKER_MY_NETWORK
external: true
```
< / details >
**Promtail's config** is similar to the generic config in the previous section.< br >
The only addition is a short **pipeline** stage with a **regex** that runs against
every log line before sending it to Loki. When a line matches, **a label** `player`
is added to that log line.
The value of that label comes from the **named capture group** thats part of
that regex, the [syntax ](https://www.regular-expressions.info/named.html )
is: `(?P<name>group)` < br >
This label will be easy to use later in the alert stage.
< details >
< summary > promtail-config.yml< / summary >
```yml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: minecraft
static_configs:
- targets:
- localhost
labels:
job: minecraft_logs
__path__ : /var/log/minecraft/*.log
pipeline_stages:
- regex:
expression: .*:\s(?P< player > .*)\sjoined the game$
- labels:
player:
```
< / details >
[Here's regex101 ](https://regex101.com/r/5vkOU2/1 ) of it,
with some data to show how it works and bit of explanation.< br >
[Here's ](https://stackoverflow.com/a/74962269/1383369 )
the stackoverflow answer that is the source for that config.
![regex ](https://i.imgur.com/bT5XSHn.png )
### In Grafana
* If Loki is not yet added, it needs to be added as a **datasource** , `http://loki:3100`
* In **Explore section** , filter, job = `minecraft_logs` , **Run query** button...
this should result in seeing minecraft logs and their volume/time graph.
This Explore view will be **recreated** as a dashboard.
### Dashboard for minecraft logs
![dashboard-minecraft ](https://i.imgur.com/M1k0Dn4.png )
* **New dashboard, new panel**
* Graph type - `Time series`
* Data source - Loki
* Switch from `builder` to `code` < br >
* query - `count_over_time({job="minecraft_logs"} |= `` [1m])` < br >
* `Query options` - Min interval=1m
* Transform - Rename by regex
Match - `(.*)`
Replace - `Logs`
* Title - Logs volume
* Transparent background
* Legend off
* Graph styles - bar
* Fill opacity - 50
* Color scheme - single color
* Save
* **Add another panel**
* Graph type - `Logs`
* Data source - Loki
* Switch from `builder` to `code` < br >
query - `{job="minecraft_logs"} |= ""` < br >
* Title - *empty*
* Deduplication - Signature or Exact
* Save
This should create a similar dashboard to the one in the picture above.< br >
[Performance tips ](https://www.youtube.com/watch?v=YED8XIm0YPs )
for grafana loki queries
## Alerts in Grafana for Loki
![alert-labels ](https://i.imgur.com/LuUBZFn.png )
When a **player joins** minecraft server a **log line** appears *"Bastard joined the game"* < br >
An **Alert** will be set to detect string *"joined the game"* and send
a **notification** when it occurs.
Now, might be good time to **brush up on PromQL / LogQL** and the **data types**
they return when a query happens. That **instant vector** and **range vector**
thingie. As grafana will scream when using range vector.
### Create alert rule
- **1 Set an alert rule name**
- Rule name = Minecraft-player-joined-alert
- **2 Set a query and alert condition**
- **A** - Switch to Loki; set Last 5 minutes
- switch from builder to code
- `count_over_time({job="minecraft_logs"} |= "joined the game" [5m])`
- **B** - Reduce
- Function = Last
- Input = A
- Mode = Strict
- **C** - Treshold
- Input = B
- is above 0
- Make this the alert condition
- **3 Alert evaluation behavior**
- Folder = "Alerts"
- Evaluation group (interval) = "five-min"< br >
- Evaluation interval = 5m
- For 0s
- Configure no data and error handling
- Alert state if no data or all values are null = OK
- **4 Add details for your alert rule**
- Here is where the label `player` that was set in **promtail** is used< br >
Summary = `{{ $labels.player }} joined the Minecraft server.`
- Can also pass values from expressions by targeting A/B/C/.. from step2< br >
Description = `Number of players that joined in the last 5 min: {{ $values.B }}` < br >
- **5 Notifications**
- nothing
- Save and exit
### Contact points
- New contact point
- Name = ntfy
- Integration = Webhook
- URL = `https://ntfy.example.com/grafana` < br >
or if grafana-to-ntfy is already setup then `http://grafana-to-ntfy:8080` < br >
but also credentials need to be set.
- Title = `{{ .CommonAnnotations.summary }}`
- Message = I put in [empty space unicode character ](https://emptycharacter.com/ )
- Disable resolved message = check
- Test
- Save
### Notification policies
- Edit default
- Default contact point = ntfy
- Save
After all this, there should be notification coming when a player joins.
### grafana-to-ntfy
For **alerts** one can use
[ntfy ](https://github.com/DoTheEvo/selfhosted-apps-docker/tree/master/gotify-ntfy-signal )
but on its own alerts from grafana are **just plain text json** .< br >
[Here's ](https://github.com/DoTheEvo/selfhosted-apps-docker/tree/master/gotify-ntfy-signal#grafana-to-ntfy )
how to setup grafana-to-ntfy, to **make alerts look good** .
![ntfy ](https://i.imgur.com/gL81jRg.png )
---
---
### Templates
Not really used here, but **they are pain** and there's some info
as it took **embarrassingly** long to find that
`{{ .CommonAnnotations.summary }}` for the title.
* **Testing** should be done in **contact point** when editing,
useful **Test button** that allows you to send alerts with custom values.
* To [define a template. ](https://i.imgur.com/vYPO7yd.png )
* To [call a template. ](https://i.imgur.com/w3Sb6fF.png )
* My **big mistake** when playing with this was **missing a dot** .< br >
In Contact point, in Title/Message input box.
* correct one - `{{ template "test" . }}`
* the one I had - `{{ template "test" }}` < br >
* So yeah, **dot** is important in here. It represents **data and context**
passed to a template. It can represent **global context** , or when used inside
`{{ range }}` it represents **iteration** loop value.
* [This ](https://pastebin.com/id3264k6 ) json structure is what an **alert** looks
like. Notice `alerts` being an **array** and `commonAnnotations` being **object** .
For **array** theres need to **loop** over it to get access to the
values in it. For **objects** one just needs to target **the name**
from global context.. **using dot** at the beginning.
* To [iterate over alerts array. ](https://i.imgur.com/yKmZLLQ.png )
* To just access a value - `{{ .CommonAnnotations.summary }}`
* Then theres **conditional** things one can do in **golang templates** ,
but I am not going to dig that deep...
Templates resources
* [Overview of Grafana Alerting and Message Templating for Slack ](https://faun.pub/overview-of-grafana-alerting-and-message-templating-for-slack-6bb740ec44af )
* [youtube - Unified Alerting Grafana 8 | Prometheus | Victoria | Telegraf | Notifications | Alert Templating ](https://youtu.be/UtmmhLraSnE )
* [Dot notation ](https://www.practical-go-lessons.com/chap-32-templates#dot-notation )
* [video - Annotations and Alerts tutorial for Grafana with Timescale ](https://youtu.be/bmOkirtC65w )
---
---
# Caddy reverse proxy monitoring
What can be seen in this example:
* Use of **Prometheus** to monitor a docker **container** -
[caddy ](https://github.com/DoTheEvo/selfhosted-apps-docker/tree/master/caddy ).
* How to **import a dashobard** to grafana.
* Use of **Loki** to monitor **logs** of a docker **container** .
* How to set **Promtail** to push only certain values and label logs.
* How to use **geoip** part of **Promtail** .
* How to create **dashboard** in grafana from data in **Loki** .
**Requirements** - grafana, loki, caddy.
![logo-caddy ](https://i.imgur.com/rB6sjKQ.png )
**Reverse proxy** is kinda linchpin of a selfhosted setup as it is **in charge**
of all the http/https **traffic** that goes in. So focus on monitoring this
**keystone** makes sense.
**Requirements** - grafana, prometheus, loki, caddy container
## Caddy - Metrics - Prometheus
![logo ](https://i.imgur.com/6QdZuVR.png )
**Caddy** has build in **exporter** of metrics for prometheus, so all that is needed
is enabling it, **scrape it** by prometheus, and import a **dashboard** .
* Edit Caddyfile to [enable metrics. ](https://caddyserver.com/docs/metrics )
< details >
< summary > Caddyfile< / summary >
```php
{
servers {
metrics
}
admin 0.0.0.0:2019
}
a.{$MY_DOMAIN} {
reverse_proxy whoami:80
}
```
< / details >
* Edit compose to publish 2019 port.< br >
Likely **not necessary** if Caddy and Prometheus are on the **same docker network** ,
but its nice to check if the metrics export works at `<docker-host-ip>:2019/metrics`
< details >
< summary > docker-compose.yml< / summary >
```yml
services:
caddy:
image: caddy
container_name: caddy
hostname: caddy
restart: unless-stopped
env_file: .env
ports:
- "80:80"
- "443:443"
- "443:443/udp"
- "2019:2019"
volumes:
- ./Caddyfile:/etc/caddy/Caddyfile
- ./caddy_config:/data
- ./caddy_data:/config
networks:
default:
name: $DOCKER_MY_NETWORK
external: true
```
< / details >
* Edit **prometheus.yml** to add caddy **scraping** point
< details >
< summary > prometheus.yml< / summary >
```yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'caddy'
static_configs:
- targets: ['caddy:2019']
```
< / details >
* In grafana **import**
[caddy dashboard ](https://grafana.com/grafana/dashboards/14280-caddy-exporter/ )< br >
But these **metrics** are about **performance** and **load** put on Caddy,
which in selfhosted environment will **likely be minimal** and not interesting.< br >
To get **more intriguing** info of who, when, **from where** , connects
to what **service** ,.. well for that monitoring of **access logs** is needed.
---
---
## Caddy - Logs - Loki
![logs_dash ](https://i.imgur.com/j9CcJ44.png )
**Loki** itself just **stores** the logs. To get them to Loki a **Promtail** container is used
that has **access** to caddy's **logs** . Its job is to **scrape** them regularly, maybe
**process** them in some way, and then **push** them to Loki.< br >
Once there, a basic grafana **dashboard** can be made.
### The setup
* Have Grafana, Loki, Caddy working
* Edit Caddy **compose** , bind mount `/var/log/caddy` .< br >
**Add** to the compose also **Promtail container** , that has the same logs bind mount,
along with bind mount of its **config file** .< br >
Promtail will scrape logs to which it now has access and **pushes** them **to Loki.**
< details >
< summary > docker-compose.yml< / summary >
```yml
services:
caddy:
image: caddy
container_name: caddy
hostname: caddy
restart: unless-stopped
env_file: .env
ports:
- "80:80"
- "443:443"
- "443:443/udp"
- "2019:2019"
volumes:
- ./Caddyfile:/etc/caddy/Caddyfile
- ./caddy_data:/data
- ./caddy_config:/config
- ./caddy_logs:/var/log/caddy
# LOG AGENT PUSHING LOGS TO LOKI
promtail:
image: grafana/promtail
container_name: caddy-promtail
hostname: caddy-promtail
restart: unless-stopped
volumes:
- ./promtail-config.yml:/etc/promtail-config.yml
- ./caddy_logs:/var/log/caddy:ro
command:
- '-config.file=/etc/promtail-config.yml'
networks:
default:
name: $DOCKER_MY_NETWORK
external: true
```
< / details >
< details >
< summary > promtail-config.yml< / summary >
```yml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: caddy_access_log
static_configs:
- targets:
- localhost
labels:
job: caddy_access_log
host: example.com
agent: caddy-promtail
__path__ : /var/log/caddy/*.log
```
< / details >
* **Promtail** scrapes a logs, **one line** at the time and is able to do **neat
things** with it before sending it - add labels, ignore some lines,
only send some values,...< br >
[Pipelines ](https://grafana.com/docs/loki/latest/clients/promtail/pipelines/ )
are used for this.
Bellow is an example of extracting just a single value - an IP address
and using it in a tempalte that gets send to Loki and nothing else.
[Here's ](https://zerokspot.com/weblog/2023/01/25/testing-promtail-pipelines/ )
some more to read on this.
< details >
< summary > promtail-config.yml customizing fields< / summary >
```yml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: caddy_access_log
static_configs:
- targets:
- localhost
labels:
job: caddy_access_log
host: example.com
agent: caddy-promtail
__path__ : /var/log/caddy/*.log
pipeline_stages:
- json:
expressions:
request_remote_ip: request.remote_ip
- template:
source: output # creates empty output variable
template: '{"remote_ip": {{.request_remote_ip}}}'
- output:
source: output
```
< / details >
* Edit `Caddyfile` to enable
[**access logs** ](https://caddyserver.com/docs/caddyfile/directives/log ).
Unfortunately this **can't be globally** enabled, so the easiest way seems to be
to create a **logging** [**snippet** ](https://caddyserver.com/docs/caddyfile/concepts#snippets )
called `log_common` and copy paste the **import line** in to every site block.
< details >
< summary > Caddyfile< / summary >
```yml
(log_common) {
log {
output file /var/log/caddy/caddy_access.log
}
}
ntfy.example.com {
import log_common
reverse_proxy ntfy:80
}
mealie.{$MY_DOMAIN} {
import log_common
reverse_proxy mealie:80
}
```
< / details >
* at this points logs should be visible and **explorable in grafana** < br >
Explore > `{job="caddy_access_log"} |= "" | json`
### Geoip
![geoip_info ](https://i.imgur.com/f4P8ydl.png )
**Promtail** got recently a **geoip stage** . One can feed it an **IP address** and an mmdb **geoIP
database** and it adds geoip **labels** to the log entry.
[The official documentation. ](https://github.com/grafana/loki/blob/main/docs/sources/clients/promtail/stages/geoip.md )
* **Register** a free account on [maxmind.com ](https://www.maxmind.com/en/geolite2/signup ).
* **Download** one of the mmdb format **databases**
* `GeoLite2 City` - 70MB full geoip info - city, postal code, time zone, latitude/longitude,..
* `GeoLite2 Country` 6MB, just country and continent
* **Bind mount** whichever database in to **promtail container** .
< details >
< summary > docker-compose.yml< / summary >
```yml
services:
caddy:
image: caddy
container_name: caddy
hostname: caddy
restart: unless-stopped
env_file: .env
ports:
- "80:80"
- "443:443"
- "443:443/udp"
- "2019:2019"
volumes:
- ./Caddyfile:/etc/caddy/Caddyfile
- ./caddy_data:/data
- ./caddy_config:/config
- ./caddy_logs:/var/log/caddy
# LOG AGENT PUSHING LOGS TO LOKI
promtail:
image: grafana/promtail
container_name: caddy-promtail
hostname: caddy-promtail
restart: unless-stopped
volumes:
- ./promtail-config.yml:/etc/promtail-config.yml
- ./caddy_logs:/var/log/caddy:ro
- ./GeoLite2-City.mmdb:/etc/GeoLite2-City.mmdb:ro
command:
- '-config.file=/etc/promtail-config.yml'
networks:
default:
name: $DOCKER_MY_NETWORK
external: true
```
* In **promtail** config, **json stage** is added where IP address is loaded in to
a **variable** called `remote_ip` , which then is used in **geoip stage** .
If all else is set correctly, the geoip **labels** are automaticly added to the log entry.
< details >
< summary > geoip promtail-config.yml< / summary >
```yml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: caddy_access_log
static_configs:
- targets:
- localhost
labels:
job: caddy_access_log
host: example.com
agent: caddy-promtail
__path__ : /var/log/caddy/*.log
pipeline_stages:
- json:
expressions:
remote_ip: request.remote_ip
- geoip:
db: "/etc/GeoLite2-City.mmdb"
source: remote_ip
db_type: "city"
```
< / details >
Can be tested with opera build in VPN, or some online
[site tester ](https://pagespeed.web.dev/ ).
### Dashboard
![panel1 ](https://i.imgur.com/hW92sLO.png )
* **new panel**, will be **time series** graph showing **Subdomains hits timeline**
* Graph type = Time series
* Data source = Loki
* switch from builder to code< br >
`sum(count_over_time({job="caddy_access_log"} |= "" | json [1m])) by (request_host)`
* Query options > Min interval = 1m
* Transform > Rename by regex
* Match = `\{request_host="(.*)"\}`
* Replace = `$1`
* Title = "Subdomains hits timeline"
* Transparent
* Tooltip mode = All
* Tooltip values sort order = Descending
* Legen Placement = Right
* Value = Total
* Graph style = Bars
* Fill opacity = 50
![panel2 ](https://i.imgur.com/KYZdotg.png )
* Add **another panel** , will be a **pie chart** , showing **subdomains** divide
* Graph type = Pie chart
* Data source = Loki
* switch from builder to code< br >
`sum(count_over_time({job="caddy_access_log"} |= "" | json [$__range])) by (request_host)`
* Query options > Min interval = 1m
* Transform > Rename by regex
* Match = `\{request_host="(.*)"\}`
* Replace = `$1`
* Title = "Subdomains divide"
* Transparent
* Legend Placement = Right
* Value = Last
![panel3 ](https://i.imgur.com/MjbLVlJ.png )
* Add **another panel** , will be a **Geomap** , showing location of machine accessing
Caddy
* Graph type = Geomap
* Data source = Loki
* switch from builder to code< br >
`{job="caddy_access_log"} |= "" | json`
* Query options > Min interval = 1m
* Transform > Extract fields
* Source = labels
* Format = JSON
* 1. Field = `geoip_location_latitude` ; Alias = `latitude`
* 2. Field = `geoip_location_longitude` ; Alias = `longitude`
* Title = "Geomap"
* Transparent
* Map view > View > *Drag and zoom around* > Use current map setting
* Add **another panel** , will be a **pie chart** , showing **IPs** that hit the most
* Graph type = Pie chart
* Data source = Loki
* switch from builder to code< br >
`sum(count_over_time({job="caddy_access_log"} |= "" | json [$__range])) by (request_remote_ip)`
* Query options > Min interval = 1m
* Transform > Rename by regex
* Match = `\{request_remote_ip="(.*)"\}`
* Replace = `$1`
* Title = "IPs by number of requests"
* Transparent
* Legen Placement = Right
* Value = Last or Total
* Add **another panel** , this will be actual **log view**
* Graph type - Logs
* Data source - Loki
* Switch from builder to code
* query - `{job="caddy_access_log"} |= "" | json`
* Title - empty
* Deduplication - Exact or Signature
* Save
![panel3 ](https://i.imgur.com/bzE6JEg.png )
# Update
Manual image update:
- `docker-compose pull` </ br >
- `docker-compose up -d` </ br >
- `docker image prune`
# Backup and restore
#### Backup
Using [borg ](https://github.com/DoTheEvo/selfhosted-apps-docker/tree/master/borg_backup )
that makes daily snapshot of the entire directory.
#### Restore
* down the containers `docker-compose down` </ br >
* delete the entire monitoring directory</ br >
* from the backup copy back the monitoring directory</ br >
* start the containers `docker-compose up -d`