Skip to content

Prometheus & Grafana: Deep Dive

This document explains the Prometheus data model, PromQL query language, scrape configuration, ServiceMonitor CRDs, recording and alerting rules, AlertManager routing, Grafana data sources, and using custom metrics for HPA.

Prometheus stores data as time series. Every time series is uniquely identified by a metric name and a set of key-value labels.

metric_name{label1="value1", label2="value2"} value timestamp

For example:

container_cpu_usage_seconds_total{namespace="monitoring-demo", pod="sample-app-abc", container="nginx"} 42.5 1704067200

This single data point says: the container “nginx” in pod “sample-app-abc” in namespace “monitoring-demo” has used 42.5 CPU-seconds total at the given timestamp.

Prometheus defines four metric types:

Counter: Monotonically increasing value. Only goes up (or resets to 0 on restart). Examples: total HTTP requests, total bytes sent, total errors.

http_requests_total{method="GET", status="200"} 1547

Gauge: Value that goes up and down. Represents a snapshot. Examples: current temperature, memory usage, active connections.

node_memory_MemAvailable_bytes 8589934592

Histogram: Samples observations and counts them in configurable buckets. Also provides sum and count. Examples: request latency, response sizes.

http_request_duration_seconds_bucket{le="0.1"} 24054
http_request_duration_seconds_bucket{le="0.5"} 33444
http_request_duration_seconds_bucket{le="1.0"} 34534
http_request_duration_seconds_bucket{le="+Inf"} 34567
http_request_duration_seconds_sum 5765.123
http_request_duration_seconds_count 34567

Summary: Similar to histogram but calculates quantiles on the client side. Less flexible (quantiles cannot be aggregated across instances) but more accurate for a single instance.

Labels turn a single metric name into a multi-dimensional data space. http_requests_total without labels is one time series. With labels {method, status, handler}, it becomes hundreds of time series, one for each unique combination.

This is powerful but dangerous. High-cardinality labels (user IDs, request IDs, IP addresses) create millions of time series and can crash Prometheus. Never use unbounded values as labels.

PromQL is Prometheus’s query language. Understanding a few core functions covers most use cases.

Calculates the per-second rate of increase over a time range. Only works with counters.

rate(container_cpu_usage_seconds_total{namespace="monitoring-demo"}[5m])

This reads: “How fast is CPU usage increasing, averaged over the last 5 minutes?” The result is in CPU cores (seconds per second). A value of 0.5 means half a CPU core is being used.

The [5m] is the lookback window. It must be at least 2x the scrape interval. With a 30-second scrape, use at least [1m]. With a 15-second scrape, use at least [30s].

Why rate() over plain subtraction? Rate handles counter resets (when a pod restarts, the counter goes back to 0). Plain subtraction would show a huge negative value.

Total increase over a time range. Equivalent to rate() * seconds_in_range.

increase(http_requests_total{namespace="monitoring-demo"}[1h])

“How many requests were made in the last hour?” More intuitive than rate for some metrics.

Calculates quantiles from histogram buckets. This is how you get p50, p95, p99 latencies.

histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket{namespace="monitoring-demo"}[5m])
)

“What is the 95th percentile request duration over the last 5 minutes?”

The function takes the bucket boundaries and interpolates. It is an approximation, not exact. Accuracy depends on bucket boundaries. If your buckets are [0.1, 0.5, 1.0] and most requests take 0.3 seconds, the p95 is interpolated between 0.1 and 0.5.

# Sum CPU across all pods in a namespace
sum(rate(container_cpu_usage_seconds_total{namespace="monitoring-demo"}[5m])) by (pod)
# Average memory per namespace
avg(container_memory_working_set_bytes) by (namespace)
# Max CPU across all nodes
max(node_cpu_seconds_total) by (instance)
# Count running pods
count(kube_pod_status_phase{phase="Running"})

by (label) groups results. Without it, you get a single aggregated value. With it, you get one value per unique label combination.

CPU utilization as percentage:

sum(rate(container_cpu_usage_seconds_total{namespace="monitoring-demo"}[5m])) by (pod)
/
sum(kube_pod_container_resource_limits{resource="cpu", namespace="monitoring-demo"}) by (pod)
* 100

Memory utilization as percentage:

container_memory_working_set_bytes{namespace="monitoring-demo"}
/
kube_pod_container_resource_limits{resource="memory", namespace="monitoring-demo"}
* 100

Pod restart rate:

rate(kube_pod_container_status_restarts_total{namespace="monitoring-demo"}[5m]) > 0

Prometheus pulls metrics from targets at regular intervals. This is the “pull model.” Targets expose a /metrics endpoint that returns metrics in the Prometheus text format.

  1. Prometheus discovers targets (via config, DNS, Kubernetes API, etc.).
  2. At each scrape interval, it sends an HTTP GET to each target’s /metrics endpoint.
  3. The response is parsed and ingested into the time series database.
  4. Failed scrapes are recorded in the up metric (0 = down, 1 = up).

The default scrape interval is 30 seconds. The kube-prometheus-stack Helm chart configures this.

In Kubernetes, Prometheus discovers targets using the Kubernetes API:

scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true

This discovers all pods with the annotation prometheus.io/scrape: "true" and scrapes their /metrics endpoint.

The demo’s sample app does not expose custom metrics, but the kube-prometheus-stack scrapes Kubernetes system components automatically:

apiVersion: apps/v1
kind: Deployment
metadata:
name: sample-app
namespace: monitoring-demo
spec:
replicas: 2
template:
spec:
containers:
- name: nginx
image: nginx:1.25.3-alpine
ports:
- containerPort: 80

Even without custom metrics, Prometheus collects container-level metrics (CPU, memory, network) via cAdvisor and kubelet integration.

The kube-prometheus-stack introduces CRDs that replace raw scrape configuration with Kubernetes-native objects.

Tells Prometheus to scrape pods behind a Kubernetes Service:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: sample-app
namespace: monitoring-demo
labels:
release: monitoring # Must match Prometheus operator's selector
spec:
selector:
matchLabels:
app: sample-app # Matches the Service labels
endpoints:
- port: http # Named port on the Service
path: /metrics
interval: 15s

The Prometheus Operator watches ServiceMonitor objects and automatically updates the Prometheus scrape configuration. No restart needed.

The release: monitoring label is critical. The Prometheus Operator only picks up ServiceMonitors that match its configured selector. The kube-prometheus-stack Helm chart sets this selector, and ServiceMonitors without the matching label are ignored silently.

Same concept but targets pods directly, without requiring a Service:

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: batch-jobs
namespace: monitoring-demo
spec:
selector:
matchLabels:
app: batch-worker
podMetricsEndpoints:
- port: metrics
path: /metrics

Use PodMonitor for pods that do not have a Service (batch jobs, cron jobs, standalone pods).

Recording rules pre-compute expensive PromQL queries and store the result as a new time series. This is essential for dashboard performance.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: recording-rules
namespace: monitoring-demo
labels:
release: monitoring
spec:
groups:
- name: cpu-usage
interval: 30s
rules:
- record: namespace:container_cpu_usage_seconds:sum_rate5m
expr: |
sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace)

Instead of every Grafana dashboard panel computing sum(rate(...)) on every page load, the recording rule computes it once every 30 seconds. Dashboards query the pre-computed namespace:container_cpu_usage_seconds:sum_rate5m metric, which is instant.

Recording rule naming convention: level:metric:operations. For example, namespace:container_cpu_usage_seconds:sum_rate5m means: aggregated at the namespace level, from the container_cpu_usage_seconds metric, using sum and rate over 5 minutes.

Alerting rules evaluate PromQL expressions and fire alerts when conditions are met:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: alerting-rules
namespace: monitoring-demo
labels:
release: monitoring
spec:
groups:
- name: pod-alerts
rules:
- alert: PodCrashLooping
expr: |
rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has been restarting."
- alert: HighMemoryUsage
expr: |
container_memory_working_set_bytes / kube_pod_container_resource_limits{resource="memory"} > 0.9
for: 10m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} memory usage above 90%"

The for field requires the condition to be true for the specified duration before firing. A 5-minute for means the expression must be true for 5 consecutive evaluations (at the evaluation interval). This prevents alerting on brief spikes.

Alert states:

  • Inactive: Expression is false
  • Pending: Expression is true but for duration not met
  • Firing: Expression is true and for duration exceeded

AlertManager receives alerts from Prometheus and routes them to notification channels.

AlertManager uses a tree-based routing configuration:

route:
receiver: default-slack
group_by: ['namespace', 'alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: pagerduty-critical
- match:
severity: warning
receiver: slack-warnings
- match_re:
namespace: "prod-.*"
receiver: prod-team-slack
receivers:
- name: default-slack
slack_configs:
- api_url: https://hooks.slack.com/...
channel: '#alerts'
- name: pagerduty-critical
pagerduty_configs:
- service_key: <key>
- name: slack-warnings
slack_configs:
- api_url: https://hooks.slack.com/...
channel: '#warnings'

Key concepts:

  • group_by: Groups alerts with the same labels into a single notification. Without grouping, 100 pods in the same namespace would generate 100 separate alerts.
  • group_wait: How long to wait for more alerts in the same group before sending the first notification.
  • group_interval: How long to wait before sending updates for the same group.
  • repeat_interval: How long before resending an already-fired alert.

Inhibition suppresses alerts when a related, more severe alert is already firing:

inhibit_rules:
- source_match:
severity: critical
target_match:
severity: warning
equal: ['namespace', 'alertname']

If a critical alert fires for namespace X, warning alerts for the same namespace are suppressed. This reduces noise during incidents.

Temporary mutes for alerts during planned maintenance. Create via the AlertManager UI or API:

Terminal window
amtool silence add alertname=PodCrashLooping namespace=monitoring-demo \
--duration=2h \
--comment="Planned restart during maintenance"

Grafana connects to Prometheus as a data source. The kube-prometheus-stack Helm chart configures this automatically.

datasources:
- name: Prometheus
type: prometheus
url: http://monitoring-kube-prometheus-prometheus:9090
access: proxy
isDefault: true

The access: proxy mode means Grafana’s backend proxies queries to Prometheus. The browser never talks to Prometheus directly. This is more secure and avoids CORS issues.

Grafana dashboards use template variables for dynamic filtering. A $namespace variable populated by label_values(kube_pod_info, namespace) creates a dropdown that filters all panels. The pre-built dashboards include Compute Resources (per namespace/pod), Networking, and Node Exporter views.

Prometheus metrics can drive HPA scaling via the prometheus-adapter. The adapter queries Prometheus and exposes metrics through the Kubernetes Custom Metrics API. HPA reads these metrics and scales accordingly.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 2
maxReplicas: 10
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "100"

The adapter transforms counter metrics (like http_requests_total) into rate metrics (http_requests_per_second) that HPA can use as scaling signals.

Prometheus stores data in a custom TSDB on local disk. Data flows through an in-memory head block (most recent 2 hours), then gets compressed to persistent blocks. The demo uses retention: 2h. Production systems typically use 15-30 days.

For long-term storage, use remote write to Thanos (S3/GCS-backed), Cortex/Mimir (horizontally scalable), or VictoriaMetrics (better compression). Each time series consumes about 1-2 bytes per sample. With 10,000 series at 30-second scraping: ~55 MiB/day. Monitor prometheus_tsdb_head_series to track cardinality.

The Helm chart deploys several components:

ComponentPurpose
Prometheus OperatorManages Prometheus instances via CRDs
PrometheusScrapes and stores metrics
AlertManagerRoutes alerts to notification channels
GrafanaDashboards and visualization
kube-state-metricsExports Kubernetes object state as metrics
Node ExporterExports node hardware and OS metrics
Prometheus AdapterExposes custom metrics for HPA (optional)

Each component runs as a separate Deployment or DaemonSet. The Prometheus Operator watches for CRD changes and reconfigures Prometheus automatically.