Prometheus & Grafana: Deep Dive

This document explains the Prometheus data model, PromQL query language, scrape configuration, ServiceMonitor CRDs, recording and alerting rules, AlertManager routing, Grafana data sources, and using custom metrics for HPA.

The Prometheus Data Model

Prometheus stores data as time series. Every time series is uniquely identified by a metric name and a set of key-value labels.

Metric Format

metric_name{label1="value1", label2="value2"} value timestamp

For example:

container_cpu_usage_seconds_total{namespace="monitoring-demo", pod="sample-app-abc", container="nginx"} 42.5 1704067200

This single data point says: the container “nginx” in pod “sample-app-abc” in namespace “monitoring-demo” has used 42.5 CPU-seconds total at the given timestamp.

Metric Types

Prometheus defines four metric types:

Counter: Monotonically increasing value. Only goes up (or resets to 0 on restart). Examples: total HTTP requests, total bytes sent, total errors.

http_requests_total{method="GET", status="200"} 1547

Gauge: Value that goes up and down. Represents a snapshot. Examples: current temperature, memory usage, active connections.

node_memory_MemAvailable_bytes 8589934592

Histogram: Samples observations and counts them in configurable buckets. Also provides sum and count. Examples: request latency, response sizes.

http_request_duration_seconds_bucket{le="0.1"} 24054
http_request_duration_seconds_bucket{le="0.5"} 33444
http_request_duration_seconds_bucket{le="1.0"} 34534
http_request_duration_seconds_bucket{le="+Inf"} 34567
http_request_duration_seconds_sum 5765.123
http_request_duration_seconds_count 34567

Summary: Similar to histogram but calculates quantiles on the client side. Less flexible (quantiles cannot be aggregated across instances) but more accurate for a single instance.

Labels Are Dimensions

Labels turn a single metric name into a multi-dimensional data space. http_requests_total without labels is one time series. With labels {method, status, handler}, it becomes hundreds of time series, one for each unique combination.

This is powerful but dangerous. High-cardinality labels (user IDs, request IDs, IP addresses) create millions of time series and can crash Prometheus. Never use unbounded values as labels.

PromQL Functions

PromQL is Prometheus’s query language. Understanding a few core functions covers most use cases.

rate()

Calculates the per-second rate of increase over a time range. Only works with counters.

rate(container_cpu_usage_seconds_total{namespace="monitoring-demo"}[5m])

This reads: “How fast is CPU usage increasing, averaged over the last 5 minutes?” The result is in CPU cores (seconds per second). A value of 0.5 means half a CPU core is being used.

The [5m] is the lookback window. It must be at least 2x the scrape interval. With a 30-second scrape, use at least [1m]. With a 15-second scrape, use at least [30s].

Why rate() over plain subtraction? Rate handles counter resets (when a pod restarts, the counter goes back to 0). Plain subtraction would show a huge negative value.

increase()

Total increase over a time range. Equivalent to rate() * seconds_in_range.

increase(http_requests_total{namespace="monitoring-demo"}[1h])

“How many requests were made in the last hour?” More intuitive than rate for some metrics.

histogram_quantile()

Calculates quantiles from histogram buckets. This is how you get p50, p95, p99 latencies.

histogram_quantile(0.95,
  rate(http_request_duration_seconds_bucket{namespace="monitoring-demo"}[5m])
)

“What is the 95th percentile request duration over the last 5 minutes?”

The function takes the bucket boundaries and interpolates. It is an approximation, not exact. Accuracy depends on bucket boundaries. If your buckets are [0.1, 0.5, 1.0] and most requests take 0.3 seconds, the p95 is interpolated between 0.1 and 0.5.

Aggregation Operators

# Sum CPU across all pods in a namespace
sum(rate(container_cpu_usage_seconds_total{namespace="monitoring-demo"}[5m])) by (pod)

# Average memory per namespace
avg(container_memory_working_set_bytes) by (namespace)

# Max CPU across all nodes
max(node_cpu_seconds_total) by (instance)

# Count running pods
count(kube_pod_status_phase{phase="Running"})

by (label) groups results. Without it, you get a single aggregated value. With it, you get one value per unique label combination.

Common PromQL Patterns

CPU utilization as percentage:

sum(rate(container_cpu_usage_seconds_total{namespace="monitoring-demo"}[5m])) by (pod)
/
sum(kube_pod_container_resource_limits{resource="cpu", namespace="monitoring-demo"}) by (pod)
* 100

Memory utilization as percentage:

container_memory_working_set_bytes{namespace="monitoring-demo"}
/
kube_pod_container_resource_limits{resource="memory", namespace="monitoring-demo"}
* 100

Pod restart rate:

rate(kube_pod_container_status_restarts_total{namespace="monitoring-demo"}[5m]) > 0

Scrape Configuration

Prometheus pulls metrics from targets at regular intervals. This is the “pull model.” Targets expose a /metrics endpoint that returns metrics in the Prometheus text format.

How Scraping Works

Prometheus discovers targets (via config, DNS, Kubernetes API, etc.).
At each scrape interval, it sends an HTTP GET to each target’s /metrics endpoint.
The response is parsed and ingested into the time series database.
Failed scrapes are recorded in the up metric (0 = down, 1 = up).

The default scrape interval is 30 seconds. The kube-prometheus-stack Helm chart configures this.

Kubernetes Service Discovery

In Kubernetes, Prometheus discovers targets using the Kubernetes API:

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

This discovers all pods with the annotation prometheus.io/scrape: "true" and scrapes their /metrics endpoint.

The demo’s sample app does not expose custom metrics, but the kube-prometheus-stack scrapes Kubernetes system components automatically:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: sample-app
  namespace: monitoring-demo
spec:
  replicas: 2
  template:
    spec:
      containers:
        - name: nginx
          image: nginx:1.25.3-alpine
          ports:
            - containerPort: 80

Even without custom metrics, Prometheus collects container-level metrics (CPU, memory, network) via cAdvisor and kubelet integration.

ServiceMonitor and PodMonitor CRDs

The kube-prometheus-stack introduces CRDs that replace raw scrape configuration with Kubernetes-native objects.

ServiceMonitor

Tells Prometheus to scrape pods behind a Kubernetes Service:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: sample-app
  namespace: monitoring-demo
  labels:
    release: monitoring    # Must match Prometheus operator's selector
spec:
  selector:
    matchLabels:
      app: sample-app      # Matches the Service labels
  endpoints:
    - port: http           # Named port on the Service
      path: /metrics
      interval: 15s

The Prometheus Operator watches ServiceMonitor objects and automatically updates the Prometheus scrape configuration. No restart needed.

The release: monitoring label is critical. The Prometheus Operator only picks up ServiceMonitors that match its configured selector. The kube-prometheus-stack Helm chart sets this selector, and ServiceMonitors without the matching label are ignored silently.

PodMonitor

Same concept but targets pods directly, without requiring a Service:

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: batch-jobs
  namespace: monitoring-demo
spec:
  selector:
    matchLabels:
      app: batch-worker
  podMetricsEndpoints:
    - port: metrics
      path: /metrics

Use PodMonitor for pods that do not have a Service (batch jobs, cron jobs, standalone pods).

Recording Rules

Recording rules pre-compute expensive PromQL queries and store the result as a new time series. This is essential for dashboard performance.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: recording-rules
  namespace: monitoring-demo
  labels:
    release: monitoring
spec:
  groups:
    - name: cpu-usage
      interval: 30s
      rules:
        - record: namespace:container_cpu_usage_seconds:sum_rate5m
          expr: |
            sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace)

Instead of every Grafana dashboard panel computing sum(rate(...)) on every page load, the recording rule computes it once every 30 seconds. Dashboards query the pre-computed namespace:container_cpu_usage_seconds:sum_rate5m metric, which is instant.

Recording rule naming convention: level:metric:operations. For example, namespace:container_cpu_usage_seconds:sum_rate5m means: aggregated at the namespace level, from the container_cpu_usage_seconds metric, using sum and rate over 5 minutes.

Alerting Rules

Alerting rules evaluate PromQL expressions and fire alerts when conditions are met:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: alerting-rules
  namespace: monitoring-demo
  labels:
    release: monitoring
spec:
  groups:
    - name: pod-alerts
      rules:
        - alert: PodCrashLooping
          expr: |
            rate(kube_pod_container_status_restarts_total[15m]) > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Pod {{ $labels.pod }} is crash looping"
            description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has been restarting."

        - alert: HighMemoryUsage
          expr: |
            container_memory_working_set_bytes / kube_pod_container_resource_limits{resource="memory"} > 0.9
          for: 10m
          labels:
            severity: critical
          annotations:
            summary: "Pod {{ $labels.pod }} memory usage above 90%"

The for field requires the condition to be true for the specified duration before firing. A 5-minute for means the expression must be true for 5 consecutive evaluations (at the evaluation interval). This prevents alerting on brief spikes.

Alert states:

Inactive: Expression is false
Pending: Expression is true but for duration not met
Firing: Expression is true and for duration exceeded

AlertManager Routing

AlertManager receives alerts from Prometheus and routes them to notification channels.

Routing Tree

AlertManager uses a tree-based routing configuration:

route:
  receiver: default-slack
  group_by: ['namespace', 'alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: pagerduty-critical
    - match:
        severity: warning
      receiver: slack-warnings
    - match_re:
        namespace: "prod-.*"
      receiver: prod-team-slack

receivers:
  - name: default-slack
    slack_configs:
      - api_url: https://hooks.slack.com/...
        channel: '#alerts'
  - name: pagerduty-critical
    pagerduty_configs:
      - service_key: <key>
  - name: slack-warnings
    slack_configs:
      - api_url: https://hooks.slack.com/...
        channel: '#warnings'

Key concepts:

group_by: Groups alerts with the same labels into a single notification. Without grouping, 100 pods in the same namespace would generate 100 separate alerts.
group_wait: How long to wait for more alerts in the same group before sending the first notification.
group_interval: How long to wait before sending updates for the same group.
repeat_interval: How long before resending an already-fired alert.

Inhibition Rules

Inhibition suppresses alerts when a related, more severe alert is already firing:

inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ['namespace', 'alertname']

If a critical alert fires for namespace X, warning alerts for the same namespace are suppressed. This reduces noise during incidents.

Silences

Temporary mutes for alerts during planned maintenance. Create via the AlertManager UI or API:

amtool silence add alertname=PodCrashLooping namespace=monitoring-demo \
  --duration=2h \
  --comment="Planned restart during maintenance"

Grafana Data Sources

Grafana connects to Prometheus as a data source. The kube-prometheus-stack Helm chart configures this automatically.

Data Source Configuration

datasources:
  - name: Prometheus
    type: prometheus
    url: http://monitoring-kube-prometheus-prometheus:9090
    access: proxy
    isDefault: true

The access: proxy mode means Grafana’s backend proxies queries to Prometheus. The browser never talks to Prometheus directly. This is more secure and avoids CORS issues.

Dashboard Variables

Grafana dashboards use template variables for dynamic filtering. A $namespace variable populated by label_values(kube_pod_info, namespace) creates a dropdown that filters all panels. The pre-built dashboards include Compute Resources (per namespace/pod), Networking, and Node Exporter views.

Custom Metrics for HPA

Prometheus metrics can drive HPA scaling via the prometheus-adapter. The adapter queries Prometheus and exposes metrics through the Kubernetes Custom Metrics API. HPA reads these metrics and scales accordingly.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "100"

The adapter transforms counter metrics (like http_requests_total) into rate metrics (http_requests_per_second) that HPA can use as scaling signals.

Retention and Storage

Prometheus stores data in a custom TSDB on local disk. Data flows through an in-memory head block (most recent 2 hours), then gets compressed to persistent blocks. The demo uses retention: 2h. Production systems typically use 15-30 days.

For long-term storage, use remote write to Thanos (S3/GCS-backed), Cortex/Mimir (horizontally scalable), or VictoriaMetrics (better compression). Each time series consumes about 1-2 bytes per sample. With 10,000 series at 30-second scraping: ~55 MiB/day. Monitor prometheus_tsdb_head_series to track cardinality.

What kube-prometheus-stack Installs

The Helm chart deploys several components:

Component	Purpose
Prometheus Operator	Manages Prometheus instances via CRDs
Prometheus	Scrapes and stores metrics
AlertManager	Routes alerts to notification channels
Grafana	Dashboards and visualization
kube-state-metrics	Exports Kubernetes object state as metrics
Node Exporter	Exports node hardware and OS metrics
Prometheus Adapter	Exposes custom metrics for HPA (optional)

Each component runs as a separate Deployment or DaemonSet. The Prometheus Operator watches for CRD changes and reconfigures Prometheus automatically.