Prometheus & Grafana: Deep Dive
This document explains the Prometheus data model, PromQL query language, scrape configuration, ServiceMonitor CRDs, recording and alerting rules, AlertManager routing, Grafana data sources, and using custom metrics for HPA.
The Prometheus Data Model
Section titled “The Prometheus Data Model”Prometheus stores data as time series. Every time series is uniquely identified by a metric name and a set of key-value labels.
Metric Format
Section titled “Metric Format”metric_name{label1="value1", label2="value2"} value timestampFor example:
container_cpu_usage_seconds_total{namespace="monitoring-demo", pod="sample-app-abc", container="nginx"} 42.5 1704067200This single data point says: the container “nginx” in pod “sample-app-abc” in namespace “monitoring-demo” has used 42.5 CPU-seconds total at the given timestamp.
Metric Types
Section titled “Metric Types”Prometheus defines four metric types:
Counter: Monotonically increasing value. Only goes up (or resets to 0 on restart). Examples: total HTTP requests, total bytes sent, total errors.
http_requests_total{method="GET", status="200"} 1547Gauge: Value that goes up and down. Represents a snapshot. Examples: current temperature, memory usage, active connections.
node_memory_MemAvailable_bytes 8589934592Histogram: Samples observations and counts them in configurable buckets. Also provides sum and count. Examples: request latency, response sizes.
http_request_duration_seconds_bucket{le="0.1"} 24054http_request_duration_seconds_bucket{le="0.5"} 33444http_request_duration_seconds_bucket{le="1.0"} 34534http_request_duration_seconds_bucket{le="+Inf"} 34567http_request_duration_seconds_sum 5765.123http_request_duration_seconds_count 34567Summary: Similar to histogram but calculates quantiles on the client side. Less flexible (quantiles cannot be aggregated across instances) but more accurate for a single instance.
Labels Are Dimensions
Section titled “Labels Are Dimensions”Labels turn a single metric name into a multi-dimensional data space. http_requests_total without labels is one time series. With labels {method, status, handler}, it becomes hundreds of time series, one for each unique combination.
This is powerful but dangerous. High-cardinality labels (user IDs, request IDs, IP addresses) create millions of time series and can crash Prometheus. Never use unbounded values as labels.
PromQL Functions
Section titled “PromQL Functions”PromQL is Prometheus’s query language. Understanding a few core functions covers most use cases.
rate()
Section titled “rate()”Calculates the per-second rate of increase over a time range. Only works with counters.
rate(container_cpu_usage_seconds_total{namespace="monitoring-demo"}[5m])This reads: “How fast is CPU usage increasing, averaged over the last 5 minutes?” The result is in CPU cores (seconds per second). A value of 0.5 means half a CPU core is being used.
The [5m] is the lookback window. It must be at least 2x the scrape interval. With a 30-second scrape, use at least [1m]. With a 15-second scrape, use at least [30s].
Why rate() over plain subtraction? Rate handles counter resets (when a pod restarts, the counter goes back to 0). Plain subtraction would show a huge negative value.
increase()
Section titled “increase()”Total increase over a time range. Equivalent to rate() * seconds_in_range.
increase(http_requests_total{namespace="monitoring-demo"}[1h])“How many requests were made in the last hour?” More intuitive than rate for some metrics.
histogram_quantile()
Section titled “histogram_quantile()”Calculates quantiles from histogram buckets. This is how you get p50, p95, p99 latencies.
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{namespace="monitoring-demo"}[5m]))“What is the 95th percentile request duration over the last 5 minutes?”
The function takes the bucket boundaries and interpolates. It is an approximation, not exact. Accuracy depends on bucket boundaries. If your buckets are [0.1, 0.5, 1.0] and most requests take 0.3 seconds, the p95 is interpolated between 0.1 and 0.5.
Aggregation Operators
Section titled “Aggregation Operators”# Sum CPU across all pods in a namespacesum(rate(container_cpu_usage_seconds_total{namespace="monitoring-demo"}[5m])) by (pod)
# Average memory per namespaceavg(container_memory_working_set_bytes) by (namespace)
# Max CPU across all nodesmax(node_cpu_seconds_total) by (instance)
# Count running podscount(kube_pod_status_phase{phase="Running"})by (label) groups results. Without it, you get a single aggregated value. With it, you get one value per unique label combination.
Common PromQL Patterns
Section titled “Common PromQL Patterns”CPU utilization as percentage:
sum(rate(container_cpu_usage_seconds_total{namespace="monitoring-demo"}[5m])) by (pod)/sum(kube_pod_container_resource_limits{resource="cpu", namespace="monitoring-demo"}) by (pod)* 100Memory utilization as percentage:
container_memory_working_set_bytes{namespace="monitoring-demo"}/kube_pod_container_resource_limits{resource="memory", namespace="monitoring-demo"}* 100Pod restart rate:
rate(kube_pod_container_status_restarts_total{namespace="monitoring-demo"}[5m]) > 0Scrape Configuration
Section titled “Scrape Configuration”Prometheus pulls metrics from targets at regular intervals. This is the “pull model.” Targets expose a /metrics endpoint that returns metrics in the Prometheus text format.
How Scraping Works
Section titled “How Scraping Works”- Prometheus discovers targets (via config, DNS, Kubernetes API, etc.).
- At each scrape interval, it sends an HTTP GET to each target’s
/metricsendpoint. - The response is parsed and ingested into the time series database.
- Failed scrapes are recorded in the
upmetric (0 = down, 1 = up).
The default scrape interval is 30 seconds. The kube-prometheus-stack Helm chart configures this.
Kubernetes Service Discovery
Section titled “Kubernetes Service Discovery”In Kubernetes, Prometheus discovers targets using the Kubernetes API:
scrape_configs: - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: trueThis discovers all pods with the annotation prometheus.io/scrape: "true" and scrapes their /metrics endpoint.
The demo’s sample app does not expose custom metrics, but the kube-prometheus-stack scrapes Kubernetes system components automatically:
apiVersion: apps/v1kind: Deploymentmetadata: name: sample-app namespace: monitoring-demospec: replicas: 2 template: spec: containers: - name: nginx image: nginx:1.25.3-alpine ports: - containerPort: 80Even without custom metrics, Prometheus collects container-level metrics (CPU, memory, network) via cAdvisor and kubelet integration.
ServiceMonitor and PodMonitor CRDs
Section titled “ServiceMonitor and PodMonitor CRDs”The kube-prometheus-stack introduces CRDs that replace raw scrape configuration with Kubernetes-native objects.
ServiceMonitor
Section titled “ServiceMonitor”Tells Prometheus to scrape pods behind a Kubernetes Service:
apiVersion: monitoring.coreos.com/v1kind: ServiceMonitormetadata: name: sample-app namespace: monitoring-demo labels: release: monitoring # Must match Prometheus operator's selectorspec: selector: matchLabels: app: sample-app # Matches the Service labels endpoints: - port: http # Named port on the Service path: /metrics interval: 15sThe Prometheus Operator watches ServiceMonitor objects and automatically updates the Prometheus scrape configuration. No restart needed.
The release: monitoring label is critical. The Prometheus Operator only picks up ServiceMonitors that match its configured selector. The kube-prometheus-stack Helm chart sets this selector, and ServiceMonitors without the matching label are ignored silently.
PodMonitor
Section titled “PodMonitor”Same concept but targets pods directly, without requiring a Service:
apiVersion: monitoring.coreos.com/v1kind: PodMonitormetadata: name: batch-jobs namespace: monitoring-demospec: selector: matchLabels: app: batch-worker podMetricsEndpoints: - port: metrics path: /metricsUse PodMonitor for pods that do not have a Service (batch jobs, cron jobs, standalone pods).
Recording Rules
Section titled “Recording Rules”Recording rules pre-compute expensive PromQL queries and store the result as a new time series. This is essential for dashboard performance.
apiVersion: monitoring.coreos.com/v1kind: PrometheusRulemetadata: name: recording-rules namespace: monitoring-demo labels: release: monitoringspec: groups: - name: cpu-usage interval: 30s rules: - record: namespace:container_cpu_usage_seconds:sum_rate5m expr: | sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace)Instead of every Grafana dashboard panel computing sum(rate(...)) on every page load, the recording rule computes it once every 30 seconds. Dashboards query the pre-computed namespace:container_cpu_usage_seconds:sum_rate5m metric, which is instant.
Recording rule naming convention: level:metric:operations. For example, namespace:container_cpu_usage_seconds:sum_rate5m means: aggregated at the namespace level, from the container_cpu_usage_seconds metric, using sum and rate over 5 minutes.
Alerting Rules
Section titled “Alerting Rules”Alerting rules evaluate PromQL expressions and fire alerts when conditions are met:
apiVersion: monitoring.coreos.com/v1kind: PrometheusRulemetadata: name: alerting-rules namespace: monitoring-demo labels: release: monitoringspec: groups: - name: pod-alerts rules: - alert: PodCrashLooping expr: | rate(kube_pod_container_status_restarts_total[15m]) > 0 for: 5m labels: severity: warning annotations: summary: "Pod {{ $labels.pod }} is crash looping" description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has been restarting."
- alert: HighMemoryUsage expr: | container_memory_working_set_bytes / kube_pod_container_resource_limits{resource="memory"} > 0.9 for: 10m labels: severity: critical annotations: summary: "Pod {{ $labels.pod }} memory usage above 90%"The for field requires the condition to be true for the specified duration before firing. A 5-minute for means the expression must be true for 5 consecutive evaluations (at the evaluation interval). This prevents alerting on brief spikes.
Alert states:
- Inactive: Expression is false
- Pending: Expression is true but
forduration not met - Firing: Expression is true and
forduration exceeded
AlertManager Routing
Section titled “AlertManager Routing”AlertManager receives alerts from Prometheus and routes them to notification channels.
Routing Tree
Section titled “Routing Tree”AlertManager uses a tree-based routing configuration:
route: receiver: default-slack group_by: ['namespace', 'alertname'] group_wait: 30s group_interval: 5m repeat_interval: 4h routes: - match: severity: critical receiver: pagerduty-critical - match: severity: warning receiver: slack-warnings - match_re: namespace: "prod-.*" receiver: prod-team-slack
receivers: - name: default-slack slack_configs: - api_url: https://hooks.slack.com/... channel: '#alerts' - name: pagerduty-critical pagerduty_configs: - service_key: <key> - name: slack-warnings slack_configs: - api_url: https://hooks.slack.com/... channel: '#warnings'Key concepts:
- group_by: Groups alerts with the same labels into a single notification. Without grouping, 100 pods in the same namespace would generate 100 separate alerts.
- group_wait: How long to wait for more alerts in the same group before sending the first notification.
- group_interval: How long to wait before sending updates for the same group.
- repeat_interval: How long before resending an already-fired alert.
Inhibition Rules
Section titled “Inhibition Rules”Inhibition suppresses alerts when a related, more severe alert is already firing:
inhibit_rules: - source_match: severity: critical target_match: severity: warning equal: ['namespace', 'alertname']If a critical alert fires for namespace X, warning alerts for the same namespace are suppressed. This reduces noise during incidents.
Silences
Section titled “Silences”Temporary mutes for alerts during planned maintenance. Create via the AlertManager UI or API:
amtool silence add alertname=PodCrashLooping namespace=monitoring-demo \ --duration=2h \ --comment="Planned restart during maintenance"Grafana Data Sources
Section titled “Grafana Data Sources”Grafana connects to Prometheus as a data source. The kube-prometheus-stack Helm chart configures this automatically.
Data Source Configuration
Section titled “Data Source Configuration”datasources: - name: Prometheus type: prometheus url: http://monitoring-kube-prometheus-prometheus:9090 access: proxy isDefault: trueThe access: proxy mode means Grafana’s backend proxies queries to Prometheus. The browser never talks to Prometheus directly. This is more secure and avoids CORS issues.
Dashboard Variables
Section titled “Dashboard Variables”Grafana dashboards use template variables for dynamic filtering. A $namespace variable populated by label_values(kube_pod_info, namespace) creates a dropdown that filters all panels. The pre-built dashboards include Compute Resources (per namespace/pod), Networking, and Node Exporter views.
Custom Metrics for HPA
Section titled “Custom Metrics for HPA”Prometheus metrics can drive HPA scaling via the prometheus-adapter. The adapter queries Prometheus and exposes metrics through the Kubernetes Custom Metrics API. HPA reads these metrics and scales accordingly.
apiVersion: autoscaling/v2kind: HorizontalPodAutoscalermetadata: name: web-app-hpaspec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: web-app minReplicas: 2 maxReplicas: 10 metrics: - type: Pods pods: metric: name: http_requests_per_second target: type: AverageValue averageValue: "100"The adapter transforms counter metrics (like http_requests_total) into rate metrics (http_requests_per_second) that HPA can use as scaling signals.
Retention and Storage
Section titled “Retention and Storage”Prometheus stores data in a custom TSDB on local disk. Data flows through an in-memory head block (most recent 2 hours), then gets compressed to persistent blocks. The demo uses retention: 2h. Production systems typically use 15-30 days.
For long-term storage, use remote write to Thanos (S3/GCS-backed), Cortex/Mimir (horizontally scalable), or VictoriaMetrics (better compression). Each time series consumes about 1-2 bytes per sample. With 10,000 series at 30-second scraping: ~55 MiB/day. Monitor prometheus_tsdb_head_series to track cardinality.
What kube-prometheus-stack Installs
Section titled “What kube-prometheus-stack Installs”The Helm chart deploys several components:
| Component | Purpose |
|---|---|
| Prometheus Operator | Manages Prometheus instances via CRDs |
| Prometheus | Scrapes and stores metrics |
| AlertManager | Routes alerts to notification channels |
| Grafana | Dashboards and visualization |
| kube-state-metrics | Exports Kubernetes object state as metrics |
| Node Exporter | Exports node hardware and OS metrics |
| Prometheus Adapter | Exposes custom metrics for HPA (optional) |
Each component runs as a separate Deployment or DaemonSet. The Prometheus Operator watches for CRD changes and reconfigures Prometheus automatically.