# Prometheus + Grafana + Loki in docker
###### guide-by-example
![logo](https://i.imgur.com/q41QfyI.png)
# Purpose
Monitoring of machines, containers, services, logs, ...
* [Official Prometheus](https://prometheus.io/)
* [Official Grafana](https://grafana.com/)
* [Official Loki](https://grafana.com/oss/loki/)
Monitoring in this case means gathering and showing information on how services
or machines or containers are running.
Can be cpu, io, ram, disk use... can be number of http requests, errors,
results of backups, or a world map showing location of IP addresses
that access your services.
Prometheus deals with **metrics**. Loki deals with **logs**.
Grafana is there to show the data on **dashboards**.
Most of the prometheus stuff here is based off the magnificent
[**stefanprodan/dockprom**.](https://github.com/stefanprodan/dockprom)
# Chapters
* **[Core prometheus+grafana](#Overview)** - nice dashboards with metrics
of a docker host and containers
* **[Pushgateway](#Pushgateway)** - push data to prometheus from anywhere
* **[Alertmanager](#Alertmanager)** - setting alerts and getting notifications
* **[Loki](#Loki)** - prometheus for logs
* **[Minecraft Loki example](#minecraft-loki-example)** - logs, grafana alerts
and templates
* **[Caddy reverse proxy monitoring](#caddy-reverse-proxy-monitoring)** -
metrics, logs and geoip map
![dashboards_pic](https://i.imgur.com/ZmyP0T8.png)
# Overview
[Good youtube overview](https://youtu.be/h4Sl21AKiDg) of Prometheus.
Prometheus is an open source system for monitoring and alerting,
written in golang.
It periodically collects **metrics** from configured **targets**,
makes these metrics available for visualization, and can trigger **alerts**.
Prometheus is relatively young project, it is a **pull type** monitoring.
[Glossary.](https://prometheus.io/docs/introduction/glossary/)
* **Prometheus Server** is the core of the system, responsible for
* pulling new metrics
* storing the metrics in a database and evaluating them
* making metrics available through PromQL API
* **Targets** - machines, services, applications that are monitored.
These need to have an **exporter**.
* **exporter** - a script or a service that gathers metrics on the target,
converts them to prometheus server format,
and exposes them at an endpoint so they can be pulled
* **Alertmanager** - responsible for handling alerts from Prometheus Server,
and sending notifications through email, slack, pushover,..
**In this setup [ntfy](https://github.com/DoTheEvo/selfhosted-apps-docker/tree/master/gotify-ntfy-signal)
webhook will be used.**
* **pushgateway** - allows push type of monitoring. Meaning a machine anywhere
in the world can push data in to your prometheus. Should not be overused
as it goes against the pull philosophy of prometheus.
* **Grafana** - for web UI visualization of the collected metrics
![prometheus components](https://i.imgur.com/AxJCg8C.png)
# Files and directory structure
```
/home/
└── ~/
└── docker/
└── monitoring/
├── 🗁 grafana_data/
├── 🗁 prometheus_data/
├── 🗋 docker-compose.yml
├── 🗋 .env
└── 🗋 prometheus.yml
```
* `grafana_data/` - a directory where grafana stores its data
* `prometheus_data/` - a directory where prometheus stores its database and data
* `.env` - a file containing environment variables for docker compose
* `docker-compose.yml` - a docker compose file, telling docker how to run the containers
* `prometheus.yml` - a configuration file for prometheus
The three files must be provided.
The directories are created by docker compose on the first run.
# docker-compose
* **Prometheus** - The official image used. Few extra commands passing configuration.
Of note is 240 hours(10days) **retention** policy.
* **Grafana** - The official image used. Bind mounted directory
for persistent data storage. User sets **as root**, as it solves issues I am
lazy to investigate, likely me editing some files as root.
* **NodeExporter** - An exporter for linux machines,
in this case gathering the **metrics** of the docker host,
like uptime, cpu load, memory use, network bandwidth use, disk space,...
Also **bind mount** of some system directories to have access to required info.
* **cAdvisor** - An exporter for gathering docker **containers** metrics,
showing cpu, memory, network use of **each container**
Runs in `privileged` mode and has some bind mounts of system directories
to have access to required info.
*Note* - ports are only `expose`, since expectation of use of a reverse proxy
and accessing the services by hostname, not ip and port.
`docker-compose.yml`
```yml
services:
# MONITORING SYSTEM AND THE METRICS DATABASE
prometheus:
image: prom/prometheus:v2.42.0
container_name: prometheus
hostname: prometheus
user: root
restart: unless-stopped
depends_on:
- cadvisor
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--storage.tsdb.retention.time=240h'
- '--web.enable-lifecycle'
volumes:
- ./prometheus_data:/prometheus
- ./prometheus.yml:/etc/prometheus/prometheus.yml
expose:
- "9090"
labels:
org.label-schema.group: "monitoring"
# WEB BASED UI VISUALISATION OF METRICS
grafana:
image: grafana/grafana:9.4.3
container_name: grafana
hostname: grafana
user: root
restart: unless-stopped
env_file: .env
volumes:
- ./grafana_data:/var/lib/grafana
expose:
- "3000"
labels:
org.label-schema.group: "monitoring"
# HOST LINUX MACHINE METRICS EXPORTER
nodeexporter:
image: prom/node-exporter:v1.5.0
container_name: nodeexporter
hostname: nodeexporter
restart: unless-stopped
command:
- '--path.procfs=/host/proc'
- '--path.rootfs=/rootfs'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
expose:
- "9100"
labels:
org.label-schema.group: "monitoring"
# DOCKER CONTAINERS METRICS EXPORTER
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.47.1
container_name: cadvisor
hostname: cadvisor
restart: unless-stopped
privileged: true
devices:
- /dev/kmsg:/dev/kmsg
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker:/var/lib/docker:ro
- /cgroup:/cgroup:ro #doesn't work on MacOS only for Linux
expose:
- "3000"
labels:
org.label-schema.group: "monitoring"
networks:
default:
name: $DOCKER_MY_NETWORK
external: true
```
`.env`
```bash
# GENERAL
DOCKER_MY_NETWORK=caddy_net
TZ=Europe/Bratislava
# GRAFANA
GF_SECURITY_ADMIN_USER=admin
GF_SECURITY_ADMIN_PASSWORD=admin
GF_USERS_ALLOW_SIGN_UP=false
# GRAFANA EMAIL
GF_SMTP_ENABLED=true
GF_SMTP_HOST=smtp-relay.sendinblue.com:587
GF_SMTP_USER=example@gmail.com
GF_SMTP_PASSWORD=xzu0dfFhn3eqa
```
**All containers must be on the same network**.
Which is named in the `.env` file.
If one does not exist yet: `docker network create caddy_net`
## prometheus.yml
[Official documentation.](https://prometheus.io/docs/prometheus/latest/configuration/configuration/)
Contains the bare minimum settings of targets from where metrics are to be pulled.
`prometheus.yml`
```yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'nodeexporter'
static_configs:
- targets: ['nodeexporter:9100']
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
```
## Reverse proxy
Caddy v2 is used, details
[here](https://github.com/DoTheEvo/selfhosted-apps-docker/tree/master/caddy_v2).
`Caddyfile`
```php
graf.{$MY_DOMAIN} {
reverse_proxy grafana:3000
}
prom.{$MY_DOMAIN} {
reverse_proxy prometheus:9090
}
```
## First run and Grafana configuration
* Login **admin/admin** to `graf.example.com`, change the password.
* **Add** Prometheus as a **Data source** in Configuration
Set **URL** to `http://prometheus:9090`
* **Import** dashboards from [json files in this repo](dashboards/)
These **dashboards** are the preconfigured ones from
[stefanprodan/dockprom](https://github.com/stefanprodan/dockprom)
with **few changes**.
**Docker host** dashboard did not show free disk space for me, **had to change fstype**
from `aufs` to `ext4`.
Also included is [a fix](https://github.com/stefanprodan/dockprom/issues/18#issuecomment-487023049)
for **host network monitoring** not showing traffick. In all of them
the default time interval is set to 1h instead of 15m
* **docker_host.json** - dashboard showing linux docker host metrics
* **docker_containers.json** - dashboard showing docker containers metrics,
except the ones labeled as `monitoring` in the compose file
* **monitoring_services.json** - dashboar showing docker containers metrics
of containers that are labeled `monitoring`
![interface-pic](https://i.imgur.com/wzwgBkp.png)
---
---
# PromQL basics
My understanding of this shit..
* Prometheus stores **metrics**, each metric has a name, like `cpu_temp`.
* the metrics values are stored as **time series**, just simple - timestamped values
`[43 @1684608467][41 @1684608567][48 @1684608667]`.
* This metric has **labels** `[name="server-19", state="idle", city="Frankfurt"]`.
These allow far better **targeting** of the data, or as they say **multidimensionality.**
**Queries** to retrieve metrics.
* `cpu_temp` - **simple query** will show values over whatever time period
is selected in the interface.
* `cpu_temp{state="idle"}` - will narrow down results by applying a **label**.
`cpu_temp{state="idle", name="server-19"}` - **multiple labels** narrow down results.
A query can return various **data type**, kinda tricky concept is difference between:
* **instant vector** - query returns a single value with a single timestamp.
It is simple and intuitive. All the above examples are instant vectors.
Of note, there is **no thinking about time range here**. That is few layers above,
if one picks last 1h or last 7 days... that plays no role here,
this is a query response datatype and it is still instant vector - a single value in
point of time.
* **range vector** - returns multiple values with a single timestamp
This is **needed by some [query functions](https://prometheus.io/docs/prometheus/latest/querying/functions)**
but on its own useless.
A useless example would be `cpu_temp[10m]`. This query first looks at the last
timestamp data, then it would take all data points within the previous 10m
before that one timestamp, and return all those values.
**This colletion would have a single timestamp.**
This functionality allows use of various **functions** that can do complex tasks.
Actual useful example of a range vector would be `changes(cpu_temp[10m])`
where the function
[changes\(\)](https://prometheus.io/docs/prometheus/latest/querying/functions/#changes)
would take that range vector info, look at those 10min of data and return
a single value, telling how many times the value of that metric changed in those 10 min.
Links
* [Stackoverflow - Prometheus instant vector vs range vector](https://stackoverflow.com/questions/68223824/prometheus-instant-vector-vs-range-vector)
* [The Anatomy of a PromQL Query](https://promlabs.com/blog/2020/06/18/the-anatomy-of-a-promql-query/)
* [Why are Prometheus queries hard?](https://fiberplane.com/blog/why-are-prometheus-queries-hard)
* [Prometheus Cheat Sheet - Basics \(Metrics, Labels, Time Series, Scraping\)](https://iximiuz.com/en/posts/prometheus-metrics-labels-time-series/)
* [Learning Prometheus and PromQL - Learning Series](https://iximiuz.com/en/series/learning-prometheus-and-promql/)
* [The official](https://prometheus.io/docs/prometheus/latest/querying/basics/)
---
---
# Pushgateway
Gives freedom to **push** information in to prometheus from **anywhere**.
Be aware that it should **not be abused** to turn prometheus in to push type
monitoring. It is only intented for
[specific situations.](https://github.com/prometheus/pushgateway/blob/master/README.md)
### The setup
To **add** pushgateway functionality to the current stack:
* **New container** `pushgateway` added to the **compose** file.
docker-compose.yml
```yml
services:
# PUSHGATEWAY FOR PROMETHEUS
pushgateway:
image: prom/pushgateway:v1.5.1
container_name: pushgateway
hostname: pushgateway
restart: unless-stopped
command:
- '--web.enable-admin-api'
expose:
- "9091"
networks:
default:
name: $DOCKER_MY_NETWORK
external: true
```
* Adding pushgateway to the **Caddyfile** of the reverse proxy so that
it can be reached at `https://push.example.com`
Caddyfile
```php
push.{$MY_DOMAIN} {
reverse_proxy pushgateway:9091
}
```
* Adding pushgateway as a **scrape point** to `prometheus.yml`
Of note is **honor_labels** set to true,
which makes sure that **conflicting labels** like `job`, set during push
are kept over labels set in `prometheus.yml` for the scrape job.
[Docs](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config).
prometheus.yml
```yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'pushgateway-scrape'
scrape_interval: 60s
honor_labels: true
static_configs:
- targets: ['pushgateway:9091']
```
### The basics
![push-web](https://i.imgur.com/9Jk0HKu.png)
To **test pushing** some metric, execute in linux:
* `echo "some_metric 3.14" | curl --data-binary @- https://push.example.com/metrics/job/blabla/instance/whatever`
* Visit `push.example.com` and see the metric there.
* In Grafana > Explore > query for `some_metric` and see its value there.
In that command you see the metric itself: `some_metric` and it's value: `3.14`
But there are also **labels** being set as part of the url. One label named `job`,
which is required, but after that it's whatever you want.
They just need to be in **pairs** - label name and label value.
The metrics sit on the pushgateway **forever**, unless deleted or container
shuts down. **Prometheus will not remove** the metrics **after scraping**,
it will keep scraping the pushgateway, every X seconds,
and store the value that sits there with the timestamp of scraping.
To **wipe** the pushgateway clean
`curl -X PUT https://push.example.com/api/v1/admin/wipe`
### The real world use
* [**Veeam Prometheus Grafana**](https://github.com/DoTheEvo/veeam-prometheus-grafana)
Linked above is a guide-by-example with more info on **pushgateway setup**.
A real world use to **monitor backups**, along with pushing metrics
from **windows in powershell**.
![veeam-dash](https://i.imgur.com/dUyzuyl.png)
---
---
# Alertmanager
To send a **notification** about some **metric** breaching some preset **condition**.
Notifications channels used here are **email** and
[**ntfy**](https://github.com/DoTheEvo/selfhosted-apps-docker/tree/master/gotify-ntfy-signal)
*Note*
I myself am **not** planning on using alertmanager.
Grafana can do alerts for both logs and metrics.
![alert](https://i.imgur.com/b4hchSu.png)
## The setup
To **add** alertmanager to the current stack:
* **New file** - `alertmanager.yml` to be **bind mounted** in alertmanager container.
This is the **configuration** on how and where **to deliver** alerts.
Correct smtp or ntfy info needs to be filled out.
alertmanager.yml
```yml
route:
receiver: 'email'
receivers:
- name: 'ntfy'
webhook_configs:
- url: 'https://ntfy.example.com/alertmanager'
send_resolved: true
- name: 'email'
email_configs:
- to: 'whoever@example.com'
from: 'alertmanager@example.com'
smarthost: smtp-relay.sendinblue.com:587
auth_username: ''
auth_identity: ''
auth_password: ''
```
* **New file** - `alert.rules` to be **bind mounted** in to prometheus container
This file **defines** at what value a metric becomes an **alert** event.
alert.rules
```yml
groups:
- name: host
rules:
- alert: DiskSpaceLow
expr: sum(node_filesystem_free_bytes{fstype="ext4"}) > 19
for: 10s
labels:
severity: critical
annotations:
description: "Diskspace is low!"
```
* **Changed** `prometheus.yml`. Added **alerting section** that points to alertmanager
container, and also **set path** to a `rules` file.
prometheus.yml
```yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'nodeexporter'
static_configs:
- targets: ['nodeexporter:9100']
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
alerting:
alertmanagers:
- scheme: http
static_configs:
- targets:
- 'alertmanager:9093'
rule_files:
- '/etc/prometheus/rules/alert.rules'
```
* **New container** - `alertmanager` added to the compose file and **prometheus
container** has bind mount **rules file** added.
docker-compose.yml
```yml
services:
# MONITORING SYSTEM AND THE METRICS DATABASE
prometheus:
image: prom/prometheus:v2.42.0
container_name: prometheus
hostname: prometheus
user: root
restart: unless-stopped
depends_on:
- cadvisor
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--storage.tsdb.retention.time=240h'
- '--web.enable-lifecycle'
volumes:
- ./prometheus_data:/prometheus
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./alert.rules:/etc/prometheus/rules/alert.rules
expose:
- "9090"
labels:
org.label-schema.group: "monitoring"
# ALERT MANAGMENT FOR PROMETHEUS
alertmanager:
image: prom/alertmanager:v0.25.0
container_name: alertmanager
hostname: alertmanager
user: root
restart: unless-stopped
volumes:
- ./alertmanager.yml:/etc/alertmanager.yml
- ./alertmanager_data:/alertmanager
command:
- '--config.file=/etc/alertmanager.yml'
- '--storage.path=/alertmanager'
expose:
- "9093"
labels:
org.label-schema.group: "monitoring"
networks:
default:
name: $DOCKER_MY_NETWORK
external: true
```
* **Adding** alertmanager to the **Caddyfile** of the reverse proxy so that
it can be reached at `https://alert.example.com`. **Not necessary**,
but useful as it **allows to send alerts from anywhere**,
not just from prometheus, or other containers on same docker network.
Caddyfile
```php
alert.{$MY_DOMAIN} {
reverse_proxy alertmanager:9093
}
```
## The basics
![alert](https://i.imgur.com/C7g0xJt.png)
Once above setup is done, **an alert** about low disk space **should fire**
and a **notification** email should come.
In `alertmanager.yml` a switch from email **to ntfy** can be done.
*Useful*
* **alert** from anywhere using **curl**:
`curl -H 'Content-Type: application/json' -d '[{"labels":{"alertname":"blabla"}}]' https://alert.example.com/api/v1/alerts`
* **reload rules**:
`curl -X POST https://prom.example.com/-/reload`
[stefanprodan/dockprom](https://github.com/stefanprodan/dockprom#define-alerts)
has more detailed section on alerting worth checking out.
---
---
![Loki-logo](https://i.imgur.com/HUohN3P.png)
# Loki
Loki is a log aggregation tool, made by the grafana team.
Sometimes called a Prometheus for logs, it's a **push** type monitoring.
It uses [LogQL](https://promcon.io/2019-munich/slides/lt1-08_logql-in-5-minutes.pdf)
for queries, which is similar to PromQL in its use of labels.
[The official documentation overview](https://grafana.com/docs/loki/latest/fundamentals/overview/)
There are two ways to **push logs** to Loki from a docker container.
* [**Loki-docker-driver**](https://grafana.com/docs/loki/latest/clients/docker-driver/)
**installed** on a docker host and log pushing is set either globally in
`/etc/docker/daemon.json` or per container in compose files.
It's the simpler, easier way, but **lacks fine control** over the logs
being pushed.
* **[Promtail](https://grafana.com/docs/loki/latest/clients/promtail/)**
deployed as an another **container**, with bind mount of logs it should scrape,
and bind mount of its config file. This config file is very powerful,
giving a lot of **control** how logs are processed and pushed.
![loki_arch](https://i.imgur.com/aoMPrVV.png)
## Loki setup
* **New container** - `loki` added to the compose file.
Note the port 3100 is actually mapped to the host,
allowing `localhost:3100` from driver to work.
docker-compose.yml
```yml
services:
# LOG MANAGMENT WITH LOKI
loki:
image: grafana/loki:main-0295fd4
container_name: loki
hostname: loki
user: root
restart: unless-stopped
volumes:
- ./loki_data:/loki
- ./loki-config.yml:/etc/loki-config.yml
command:
- '-config.file=/etc/loki-config.yml'
ports:
- "3100:3100"
labels:
org.label-schema.group: "monitoring"
networks:
default:
name: $DOCKER_MY_NETWORK
external: true
```
* **New file** - `loki-config.yml` bind mounted in the loki container.
The config here comes from
[the official example](https://github.com/grafana/loki/tree/main/cmd/loki)
with some changes.
* **URL** changed for this setup.
* **Compactor** section is added, to have control over
[data retention.](https://grafana.com/docs/loki/latest/operations/storage/retention/)
* **Fixing** error - *"too many outstanding requests"*, discussion
[here.](https://github.com/grafana/loki/issues/5123)
It turns off parallelism, both split by time interval and shards split.
loki-config.yml
```yml
auth_enabled: false
server:
http_listen_port: 3100
common:
path_prefix: /loki
storage:
filesystem:
chunks_directory: /loki/chunks
rules_directory: /loki/rules
replication_factor: 1
ring:
kvstore:
store: inmemory
# --- disable splitting to fix "too many outstanding requests"
query_range:
parallelise_shardable_queries: false
# --- compactor to have control over length of data retention
compactor:
working_directory: /loki/compactor
compaction_interval: 10m
retention_enabled: true
retention_delete_delay: 2h
retention_delete_worker_count: 150
limits_config:
retention_period: 240h
split_queries_by_interval: 0 # part of disable splitting fix
# -------------------------------------------------------
schema_config:
configs:
- from: 2020-10-24
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
ruler:
alertmanager_url: http://alertmanager:9093
analytics:
reporting_enabled: false
```
* #### loki-docker-driverdriver
* **Install** [loki-docker-driver](https://grafana.com/docs/loki/latest/clients/docker-driver/)
`docker plugin install grafana/loki-docker-driver:latest --alias loki --grant-all-permissions`
To check if it's installed and enabled: `docker plugin ls`
* Containers that should be monitored usind loki-docker-driver need
`logging` section in their compose.
docker-compose.yml
```yml
services:
whoami:
image: "containous/whoami"
container_name: "whoami"
hostname: "whoami"
logging:
driver: "loki"
options:
loki-url: "http://localhost:3100/loki/api/v1/push"
```
* #### promtail
* Containers that should be monitored with **promtail** need it **added**
**to** their **compose** file, and made sure that it has access to the log files.
minecraft-docker-compose.yml
```yml
services:
minecraft:
image: itzg/minecraft-server
container_name: minecraft
hostname: minecraft
restart: unless-stopped
env_file: .env
tty: true
stdin_open: true
ports:
- 25565:25565 # minecraft server players connect
volumes:
- ./minecraft_data:/data
# LOG AGENT PUSHING LOGS TO LOKI
promtail:
image: grafana/promtail
container_name: minecraft-promtail
hostname: minecraft-promtail
restart: unless-stopped
volumes:
- ./minecraft_data/logs:/var/log/minecraft:ro
- ./promtail-config.yml:/etc/promtail-config.yml
command:
- '-config.file=/etc/promtail-config.yml'
networks:
default:
name: $DOCKER_MY_NETWORK
external: true
```
caddy-docker-compose.yml
```yml
services:
caddy:
image: caddy
container_name: caddy
hostname: caddy
restart: unless-stopped
env_file: .env
ports:
- "80:80"
- "443:443"
- "443:443/udp"
volumes:
- ./Caddyfile:/etc/caddy/Caddyfile
- ./caddy_config:/data
- ./caddy_data:/config
- ./caddy_logs:/var/log/caddy
# LOG AGENT PUSHING LOGS TO LOKI
promtail:
image: grafana/promtail
container_name: caddy-promtail
hostname: caddy-promtail
restart: unless-stopped
volumes:
- ./caddy_logs:/var/log/caddy:ro
- ./promtail-config.yml:/etc/promtail-config.yml
command:
- '-config.file=/etc/promtail-config.yml'
networks:
default:
name: $DOCKER_MY_NETWORK
external: true
```
* Generic **config file for promtail**, needs to be bind mounted
promtail-config.yml
```yml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: blablabla
static_configs:
- targets:
- localhost
labels:
job: blablabla_log
__path__: /var/log/blablabla/*.log
```
#### First Loki use in Grafana
* In **grafana**, loki needs to be added as a **datasource**, `http://loki:3100`
* In **Explore section**, switch to Loki as source
* if loki-docker-driver then filter by `container_name` or `compose_project`
* if promtail then filter by job name set in promtail config
in the labels section
If all was set correctly logs should be visible in Grafana.
![query](https://i.imgur.com/XSevjIR.png)
# Minecraft Loki example
What can be seen in this example:
* How to monitor logs of a docker container,
a [minecraft server](https://github.com/DoTheEvo/selfhosted-apps-docker/tree/master/minecraft).
* How to visualize the logs in a dashboard.
* How to set an alert when a specific pattern appears in the logs.
* How to extract information from log to include it in the alert notification.
* Basics of grafana alert templates, so that notifications actually look good,
and show only relevant info.
**Requirements** - grafana, loki, minecraft.
![logo-minecraft](https://i.imgur.com/VphJTKG.png)
### The objective and overview
The **main objective** is to get an **alert** when a player **joins** the server.
The secondary one is to have a place where recent *"happening"* on the server
can be seen.
Initially **loki-docker-driver** was used to get logs to Loki, and it was simple
and worked nicely. But during alert stage
I **could not** figure out how to extract string from logs and include it in
an **alert** notification. Specificly to not just say that "a player joined",
but to have there name of the player that joined.
**Switch to promtail** solved this, with the use of its
[pipeline_stages](https://grafana.com/docs/loki/latest/clients/promtail/pipelines/).
Which was **suprisingly simple** and elegant.
### The Setup
**Promtail** container is added to minecraft's **compose**, with bind mount
access to minecraf's logs.
minecraft-docker-compose.yml
```yml
services:
minecraft:
image: itzg/minecraft-server
container_name: minecraft
hostname: minecraft
restart: unless-stopped
env_file: .env
tty: true
stdin_open: true
ports:
- 25565:25565 # minecraft server players connect
volumes:
- ./minecraft_data:/data
# LOG AGENT PUSHING LOGS TO LOKI
promtail:
image: grafana/promtail
container_name: minecraft-promtail
hostname: minecraft-promtail
restart: unless-stopped
volumes:
- ./minecraft_data/logs:/var/log/minecraft:ro
- ./promtail-config.yml:/etc/promtail-config.yml
command:
- '-config.file=/etc/promtail-config.yml'
networks:
default:
name: $DOCKER_MY_NETWORK
external: true
```
**Promtail's config** is similar to the generic config in the previous section.
The only addition is a short **pipeline** stage with a **regex** that runs against
every log line before sending it to Loki. When a line matches, **a label** `player`
is added to that log line.
The value of that label comes from the **named capture group** thats part of
that regex, the [syntax](https://www.regular-expressions.info/named.html)
is: `(?Pgroup)`
This label will be easy to use later in the alert stage.
promtail-config.yml
```yml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: minecraft
static_configs:
- targets:
- localhost
labels:
job: minecraft_logs
__path__: /var/log/minecraft/*.log
pipeline_stages:
- regex:
expression: .*:\s(?P.*)\sjoined the game$
- labels:
player:
```
[Here's regex101](https://regex101.com/r/5vkOU2/1) of it,
with some data to show how it works and bit of explanation.
[Here's](https://stackoverflow.com/a/74962269/1383369)
the stackoverflow answer that is the source for that config.
![regex](https://i.imgur.com/bT5XSHn.png)
### In Grafana
* If Loki is not yet added, it needs to be added as a **datasource**, `http://loki:3100`
* In **Explore section**, filter, job = `minecraft_logs`, **Run query** button...
this should result in seeing minecraft logs and their volume/time graph.
This Explore view will be **recreated** as a dashboard.
### Dashboard for minecraft logs
![dashboard-minecraft](https://i.imgur.com/M1k0Dn4.png)
* **New dashboard, new panel**
* Graph type - `Time series`
* Data source - Loki
* Switch from `builder` to `code`
* query - `count_over_time({job="minecraft_logs"} |= `` [1m])`
* `Query options` - Min interval=1m
* Transform - Rename by regex
Match - `(.*)`
Replace - `Logs`
* Title - Logs volume
* Transparent background
* Legend off
* Graph styles - bar
* Fill opacity - 50
* Color scheme - single color
* Save
* **Add another panel**
* Graph type - `Logs`
* Data source - Loki
* Switch from `builder` to `code`
query - `{job="minecraft_logs"} |= ""`
* Title - *empty*
* Deduplication - Signature or Exact
* Save
This should create a similar dashboard to the one in the picture above.
[Performance tips](https://www.youtube.com/watch?v=YED8XIm0YPs)
for grafana loki queries
## Alerts in Grafana for Loki
![alert-labels](https://i.imgur.com/LuUBZFn.png)
When a **player joins** minecraft server a **log line** appears *"Bastard joined the game"*
An **Alert** will be set to detect string *"joined the game"* and send
a **notification** when it occurs.
Now, might be good time to **brush up on PromQL / LogQL** and the **data types**
they return when a query happens. That **instant vector** and **range vector**
thingie. As grafana will scream when using range vector.
### Create alert rule
- **1 Set an alert rule name**
- Rule name = Minecraft-player-joined-alert
- **2 Set a query and alert condition**
- **A** - Switch to Loki; set Last 5 minutes
- switch from builder to code
- `count_over_time({job="minecraft_logs"} |= "joined the game" [5m])`
- **B** - Reduce
- Function = Last
- Input = A
- Mode = Strict
- **C** - Treshold
- Input = B
- is above 0
- Make this the alert condition
- **3 Alert evaluation behavior**
- Folder = "Alerts"
- Evaluation group (interval) = "five-min"
- Evaluation interval = 5m
- For 0s
- Configure no data and error handling
- Alert state if no data or all values are null = OK
- **4 Add details for your alert rule**
- Here is where the label `player` that was set in **promtail** is used
Summary = `{{ $labels.player }} joined the Minecraft server.`
- Can also pass values from expressions by targeting A/B/C/.. from step2
Description = `Number of players that joined in the last 5 min: {{ $values.B }}`
- **5 Notifications**
- nothing
- Save and exit
### Contact points
- New contact point
- Name = ntfy
- Integration = Webhook
- URL = `https://ntfy.example.com/grafana`
or if grafana-to-ntfy is already setup then `http://grafana-to-ntfy:8080`
but also credentials need to be set.
- Title = `{{ .CommonAnnotations.summary }}`
- Message = I put in [empty space unicode character](https://emptycharacter.com/)
- Disable resolved message = check
- Test
- Save
### Notification policies
- Edit default
- Default contact point = ntfy
- Save
After all this, there should be notification coming when a player joins.
### grafana-to-ntfy
For **alerts** one can use
[ntfy](https://github.com/DoTheEvo/selfhosted-apps-docker/tree/master/gotify-ntfy-signal)
but on its own alerts from grafana are **just plain text json**.
[Here's](https://github.com/DoTheEvo/selfhosted-apps-docker/tree/master/gotify-ntfy-signal#grafana-to-ntfy)
how to setup grafana-to-ntfy, to **make alerts look good**.
![ntfy](https://i.imgur.com/gL81jRg.png)
---
---
### Templates
Not really used here, but **they are pain** and there's some info
as it took **embarrassingly** long to find that
`{{ .CommonAnnotations.summary }}` for the title.
* **Testing** should be done in **contact point** when editing,
useful **Test button** that allows you to send alerts with custom values.
* To [define a template.](https://i.imgur.com/vYPO7yd.png)
* To [call a template.](https://i.imgur.com/w3Sb6fF.png)
* My **big mistake** when playing with this was **missing a dot**.
In Contact point, in Title/Message input box.
* correct one - `{{ template "test" . }}`
* the one I had - `{{ template "test" }}`
* So yeah, **dot** is important in here. It represents **data and context**
passed to a template. It can represent **global context**, or when used inside
`{{ range }}` it represents **iteration** loop value.
* [This](https://pastebin.com/id3264k6) json structure is what an **alert** looks
like. Notice `alerts` being an **array** and `commonAnnotations` being **object**.
For **array** theres need to **loop** over it to get access to the
values in it. For **objects** one just needs to target **the name**
from global context.. **using dot** at the beginning.
* To [iterate over alerts array.](https://i.imgur.com/yKmZLLQ.png)
* To just access a value - `{{ .CommonAnnotations.summary }}`
* Then theres **conditional** things one can do in **golang templates**,
but I am not going to dig that deep...
Templates resources
* [Overview of Grafana Alerting and Message Templating for Slack](https://faun.pub/overview-of-grafana-alerting-and-message-templating-for-slack-6bb740ec44af)
* [youtube - Unified Alerting Grafana 8 | Prometheus | Victoria | Telegraf | Notifications | Alert Templating](https://youtu.be/UtmmhLraSnE)
* [Dot notation](https://www.practical-go-lessons.com/chap-32-templates#dot-notation)
* [video - Annotations and Alerts tutorial for Grafana with Timescale](https://youtu.be/bmOkirtC65w)
---
---
# Caddy reverse proxy monitoring
What can be seen in this example:
* Use of **Prometheus** to monitor a docker **container** -
[caddy](https://github.com/DoTheEvo/selfhosted-apps-docker/tree/master/caddy).
* How to **import a dashobard** to grafana.
* Use of **Loki** to monitor **logs** of a docker **container**.
* How to set **Promtail** to push only certain values and label logs.
* How to use **geoip** part of **Promtail**.
* How to create **dashboard** in grafana from data in **Loki**.
**Requirements** - grafana, loki, caddy.
![logo-caddy](https://i.imgur.com/rB6sjKQ.png)
**Reverse proxy** is kinda linchpin of a selfhosted setup as it is **in charge**
of all the http/https **traffic** that goes in. So focus on monitoring this
**keystone** makes sense.
**Requirements** - grafana, prometheus, loki, caddy container
## Caddy - Metrics - Prometheus
![logo](https://i.imgur.com/6QdZuVR.png)
**Caddy** has build in **exporter** of metrics for prometheus, so all that is needed
is enabling it, **scrape it** by prometheus, and import a **dashboard**.
* Edit Caddyfile to [enable metrics.](https://caddyserver.com/docs/metrics)
Caddyfile
```php
{
servers {
metrics
}
admin 0.0.0.0:2019
}
a.{$MY_DOMAIN} {
reverse_proxy whoami:80
}
```
* Edit compose to publish 2019 port.
Likely **not necessary** if Caddy and Prometheus are on the **same docker network**,
but its nice to check if the metrics export works at `:2019/metrics`
docker-compose.yml
```yml
services:
caddy:
image: caddy
container_name: caddy
hostname: caddy
restart: unless-stopped
env_file: .env
ports:
- "80:80"
- "443:443"
- "443:443/udp"
- "2019:2019"
volumes:
- ./Caddyfile:/etc/caddy/Caddyfile
- ./caddy_config:/data
- ./caddy_data:/config
networks:
default:
name: $DOCKER_MY_NETWORK
external: true
```
* Edit **prometheus.yml** to add caddy **scraping** point
prometheus.yml
```yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'caddy'
static_configs:
- targets: ['caddy:2019']
```
* In grafana **import**
[caddy dashboard](https://grafana.com/grafana/dashboards/14280-caddy-exporter/)
But these **metrics** are about **performance** and **load** put on Caddy,
which in selfhosted environment will **likely be minimal** and not interesting.
To get **more intriguing** info of who, when, **from where**, connects
to what **service**,.. well for that monitoring of **access logs** is needed.
---
---
## Caddy - Logs - Loki
![logs_dash](https://i.imgur.com/j9CcJ44.png)
**Loki** itself just **stores** the logs. To get them to Loki a **Promtail** container is used
that has **access** to caddy's **logs**. Its job is to **scrape** them regularly, maybe
**process** them in some way, and then **push** them to Loki.
Once there, a basic grafana **dashboard** can be made.
### The setup
* Have Grafana, Loki, Caddy working
* Edit Caddy **compose**, bind mount `/var/log/caddy`.
**Add** to the compose also **Promtail container**, that has the same logs bind mount,
along with bind mount of its **config file**.
Promtail will scrape logs to which it now has access and **pushes** them **to Loki.**
docker-compose.yml
```yml
services:
caddy:
image: caddy
container_name: caddy
hostname: caddy
restart: unless-stopped
env_file: .env
ports:
- "80:80"
- "443:443"
- "443:443/udp"
- "2019:2019"
volumes:
- ./Caddyfile:/etc/caddy/Caddyfile
- ./caddy_data:/data
- ./caddy_config:/config
- ./caddy_logs:/var/log/caddy
# LOG AGENT PUSHING LOGS TO LOKI
promtail:
image: grafana/promtail
container_name: caddy-promtail
hostname: caddy-promtail
restart: unless-stopped
volumes:
- ./promtail-config.yml:/etc/promtail-config.yml
- ./caddy_logs:/var/log/caddy:ro
command:
- '-config.file=/etc/promtail-config.yml'
networks:
default:
name: $DOCKER_MY_NETWORK
external: true
```
promtail-config.yml
```yml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: caddy_access_log
static_configs:
- targets:
- localhost
labels:
job: caddy_access_log
host: example.com
agent: caddy-promtail
__path__: /var/log/caddy/*.log
```
* **Promtail** scrapes a logs, **one line** at the time and is able to do **neat
things** with it before sending it - add labels, ignore some lines,
only send some values,...
[Pipelines](https://grafana.com/docs/loki/latest/clients/promtail/pipelines/)
are used for this.
Bellow is an example of extracting just a single value - an IP address
and using it in a tempalte that gets send to Loki and nothing else.
[Here's](https://zerokspot.com/weblog/2023/01/25/testing-promtail-pipelines/)
some more to read on this.
promtail-config.yml customizing fields
```yml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: caddy_access_log
static_configs:
- targets:
- localhost
labels:
job: caddy_access_log
host: example.com
agent: caddy-promtail
__path__: /var/log/caddy/*.log
pipeline_stages:
- json:
expressions:
request_remote_ip: request.remote_ip
- template:
source: output # creates empty output variable
template: '{"remote_ip": {{.request_remote_ip}}}'
- output:
source: output
```
* Edit `Caddyfile` to enable
[**access logs**](https://caddyserver.com/docs/caddyfile/directives/log).
Unfortunately this **can't be globally** enabled, so the easiest way seems to be
to create a **logging** [**snippet**](https://caddyserver.com/docs/caddyfile/concepts#snippets)
called `log_common` and copy paste the **import line** in to every site block.
Caddyfile
```yml
(log_common) {
log {
output file /var/log/caddy/caddy_access.log
}
}
ntfy.example.com {
import log_common
reverse_proxy ntfy:80
}
mealie.{$MY_DOMAIN} {
import log_common
reverse_proxy mealie:80
}
```
* at this points logs should be visible and **explorable in grafana**
Explore > `{job="caddy_access_log"} |= "" | json`
### Geoip
![geoip_info](https://i.imgur.com/f4P8ydl.png)
**Promtail** got recently a **geoip stage**. One can feed it an **IP address** and an mmdb **geoIP
database** and it adds geoip **labels** to the log entry.
[The official documentation.](https://github.com/grafana/loki/blob/main/docs/sources/clients/promtail/stages/geoip.md)
* **Register** a free account on [maxmind.com](https://www.maxmind.com/en/geolite2/signup).
* **Download** one of the mmdb format **databases**
* `GeoLite2 City` - 70MB full geoip info - city, postal code, time zone, latitude/longitude,..
* `GeoLite2 Country` 6MB, just country and continent
* **Bind mount** whichever database in to **promtail container**.
docker-compose.yml
```yml
services:
caddy:
image: caddy
container_name: caddy
hostname: caddy
restart: unless-stopped
env_file: .env
ports:
- "80:80"
- "443:443"
- "443:443/udp"
- "2019:2019"
volumes:
- ./Caddyfile:/etc/caddy/Caddyfile
- ./caddy_data:/data
- ./caddy_config:/config
- ./caddy_logs:/var/log/caddy
# LOG AGENT PUSHING LOGS TO LOKI
promtail:
image: grafana/promtail
container_name: caddy-promtail
hostname: caddy-promtail
restart: unless-stopped
volumes:
- ./promtail-config.yml:/etc/promtail-config.yml
- ./caddy_logs:/var/log/caddy:ro
- ./GeoLite2-City.mmdb:/etc/GeoLite2-City.mmdb:ro
command:
- '-config.file=/etc/promtail-config.yml'
networks:
default:
name: $DOCKER_MY_NETWORK
external: true
```
* In **promtail** config, **json stage** is added where IP address is loaded in to
a **variable** called `remote_ip`, which then is used in **geoip stage**.
If all else is set correctly, the geoip **labels** are automaticly added to the log entry.
geoip promtail-config.yml
```yml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: caddy_access_log
static_configs:
- targets:
- localhost
labels:
job: caddy_access_log
host: example.com
agent: caddy-promtail
__path__: /var/log/caddy/*.log
pipeline_stages:
- json:
expressions:
remote_ip: request.remote_ip
- geoip:
db: "/etc/GeoLite2-City.mmdb"
source: remote_ip
db_type: "city"
```
Can be tested with opera build in VPN, or some online
[site tester](https://pagespeed.web.dev/).
### Dashboard
![panel1](https://i.imgur.com/hW92sLO.png)
* **new panel**, will be **time series** graph showing **Subdomains hits timeline**
* Graph type = Time series
* Data source = Loki
* switch from builder to code
`sum(count_over_time({job="caddy_access_log"} |= "" | json [1m])) by (request_host)`
* Query options > Min interval = 1m
* Transform > Rename by regex
* Match = `\{request_host="(.*)"\}`
* Replace = `$1`
* Title = "Subdomains hits timeline"
* Transparent
* Tooltip mode = All
* Tooltip values sort order = Descending
* Legen Placement = Right
* Value = Total
* Graph style = Bars
* Fill opacity = 50
![panel2](https://i.imgur.com/KYZdotg.png)
* Add **another panel**, will be a **pie chart**, showing **subdomains** divide
* Graph type = Pie chart
* Data source = Loki
* switch from builder to code
`sum(count_over_time({job="caddy_access_log"} |= "" | json [$__range])) by (request_host)`
* Query options > Min interval = 1m
* Transform > Rename by regex
* Match = `\{request_host="(.*)"\}`
* Replace = `$1`
* Title = "Subdomains divide"
* Transparent
* Legend Placement = Right
* Value = Last
![panel3](https://i.imgur.com/MjbLVlJ.png)
* Add **another panel**, will be a **Geomap**, showing location of machine accessing
Caddy
* Graph type = Geomap
* Data source = Loki
* switch from builder to code
`{job="caddy_access_log"} |= "" | json`
* Query options > Min interval = 1m
* Transform > Extract fields
* Source = labels
* Format = JSON
* 1. Field = `geoip_location_latitude`; Alias = `latitude`
* 2. Field = `geoip_location_longitude`; Alias = `longitude`
* Title = "Geomap"
* Transparent
* Map view > View > *Drag and zoom around* > Use current map setting
* Add **another panel**, will be a **pie chart**, showing **IPs** that hit the most
* Graph type = Pie chart
* Data source = Loki
* switch from builder to code
`sum(count_over_time({job="caddy_access_log"} |= "" | json [$__range])) by (request_remote_ip)`
* Query options > Min interval = 1m
* Transform > Rename by regex
* Match = `\{request_remote_ip="(.*)"\}`
* Replace = `$1`
* Title = "IPs by number of requests"
* Transparent
* Legen Placement = Right
* Value = Last or Total
* Add **another panel**, this will be actual **log view**
* Graph type - Logs
* Data source - Loki
* Switch from builder to code
* query - `{job="caddy_access_log"} |= "" | json`
* Title - empty
* Deduplication - Exact or Signature
* Save
![panel3](https://i.imgur.com/bzE6JEg.png)
# Update
Manual image update:
- `docker-compose pull`
- `docker-compose up -d`
- `docker image prune`
# Backup and restore
#### Backup
Using [borg](https://github.com/DoTheEvo/selfhosted-apps-docker/tree/master/borg_backup)
that makes daily snapshot of the entire directory.
#### Restore
* down the containers `docker-compose down`
* delete the entire monitoring directory
* from the backup copy back the monitoring directory
* start the containers `docker-compose up -d`