Guide

Grafana explained

An on-call engineer gets paged at 2 a.m. for elevated API errors. Prometheus fired the alert; the first question is always where to look. Opening a well-designed Grafana dashboard answers in seconds: error rate by deployment version, latency percentiles by route, saturation on the database pool, and a correlated log stream from Loki — all on one screen. Grafana is not another time-series database; it is the visualization and exploration layer that sits on top of Prometheus, Loki, Tempo, InfluxDB, CloudWatch, and dozens of other backends. Teams that instrument services but never invest in dashboard design still fly blind during incidents. This guide covers data sources and plugins, RED and USE dashboard layouts, variables and templating, Grafana unified alerting versus Alertmanager, dashboard-as-code with provisioning, a Harbor Fleet SRE board worked example, a tooling decision table, common pitfalls, and a production checklist — assuming familiarity with the three pillars of observability.

What Grafana is and where it fits

Grafana Labs open-sourced Grafana in 2014 as a graphing front-end for time-series data. Today it is the de facto UI for cloud-native operations: Kubernetes clusters, microservice meshes, data pipelines, and SaaS infrastructure. The server queries external systems through data source plugins, renders panels (time series, stat, gauge, heatmap, logs, traces), and optionally evaluates alert rules that route to Slack, PagerDuty, or Opsgenie.

Core concepts

  • Data source — a configured connection to Prometheus, Loki, Tempo, PostgreSQL, Elasticsearch, or a cloud vendor API.
  • Dashboard — a collection of panels arranged in rows, optionally parameterized with variables.
  • Panel — a single visualization bound to one or more queries against a data source.
  • Variable — a dropdown or text input that injects values into panel queries (cluster, namespace, service).
  • Folder and permissions — RBAC scopes who can view, edit, or administer dashboards per team.

Grafana does not replace Prometheus for scraping or Alertmanager for alert routing at scale — it complements them. Many teams keep Alertmanager as the authoritative paging path while using Grafana for exploration, ad-hoc analysis, and business-facing dashboards. Grafana Cloud adds hosted metrics, logs, traces, and synthetic monitoring when self-hosting the full LGTM stack (Loki, Grafana, Tempo, Mimir) is not worth the operational cost.

Connecting data sources

The Prometheus data source is the most common starting point. Point Grafana at your Prometheus server URL (or Mimir/Cortex query frontend), set the scrape interval as the min step hint so range queries align with sample resolution, and enable exemplars if you link histogram buckets to Tempo traces.

For logs, add a Loki data source and query with LogQL:

{namespace="payments", app="order-api"} |= "error" | json | level="error"

For traces, connect Tempo or Jaeger and use the built-in trace view — jump from a slow span in a dashboard to the full request waterfall. Pairing all three on one dashboard (metrics spike, log lines, trace ID) is the practical payoff of OpenTelemetry instrumentation.

Multi-tenant and high-availability patterns

  • Run Grafana behind OAuth (Google, GitHub, Okta) or SAML; map groups to folder roles.
  • Use read-only data source credentials in production; separate admin Grafana from public kiosk dashboards.
  • Front Prometheus with a query frontend (Mimir, Thanos) when dashboards query months of data across clusters.
  • Cache heavy dashboard queries with recording rules in Prometheus rather than widening Grafana range windows.

Dashboard design: RED and USE

Good dashboards answer one operational question per row. Two frameworks dominate service dashboards:

RED (request-driven services)

  • Rate — requests per second (sum(rate(http_requests_total[5m]))).
  • Errors — ratio of 5xx or failed requests to total.
  • Duration — p50, p95, p99 latency from histogram buckets.

USE (resources: CPU, disk, network)

  • Utilization — percentage of capacity in use.
  • Saturation — queue depth, threads waiting, disk await time.
  • Errors — device or kernel error counters.

Layout convention: top row — golden signals (availability stat, error budget remaining tied to SLO targets); middle rows — RED per service or USE per node pool; bottom row — dependency health (database connections, cache hit rate, message queue lag). Avoid cramming forty panels on one board; split by audience (SRE overview vs deep-dive per team).

Panel types matter: time series for trends, stat with sparklines for at-a-glance health, heatmap for latency distribution over time, logs panel for Loki tailing during incidents. Set sensible defaults: log scale for byte counters, unit suffixes (ms, percent, bytes), and thresholds that turn panels red before humans read the legend.

Variables, templating, and reuse

Hard-coding namespace="payments" in every panel breaks when you clone the dashboard for staging. Dashboard variables fix this:

  • Query variable — populate dropdown from Prometheus label values: label_values(kube_pod_info, namespace).
  • Custom variable — fixed list (prod, staging) for environments without Kubernetes.
  • Chained variables — service list filtered by selected namespace.
  • Interval variable$__rate_interval auto-adjusts rate windows to dashboard zoom level.

Use consistent variable names across team dashboards ($cluster, $namespace, $service) so on-call muscle memory transfers. Enable repeat rows or repeat panels to fan out the same RED row per service without copy-paste drift. Export the finished JSON to Git for review — diffing dashboard JSON in pull requests catches accidental query changes.

Alerting: Grafana vs Alertmanager

Grafana 8+ includes unified alerting: alert rules defined on panel queries or standalone expressions, grouped into notification policies with mute windows and contact points. This overlaps with Prometheus + Alertmanager, so teams need a deliberate split:

ConcernAlertmanager (Prometheus)Grafana unified alerting
Primary signalPromQL on scraped metricsAny data source (Prometheus, Loki, SQL, CloudWatch)
Routing maturityInhibition, grouping, silences battle-testedImproving; good for multi-source rules
GitOpsAlert rules in Prometheus YAMLProvisioning via files or Terraform
Best forSymptom-based SLO pages from metricsLog-based alerts, SQL thresholds, mixed sources

A common pattern: page from Alertmanager on burn-rate SLO violations; use Grafana alerts for Loki patterns (“payment webhook signature failures”) or business KPIs in PostgreSQL. Never route the same condition through both paths without deduplication — duplicate pages erode on-call trust faster than missed alerts.

Dashboard-as-code and provisioning

Click-ops dashboards rot: someone edits a panel during an incident, nobody commits the change, and staging diverges from production. Treat dashboards like application code:

  • Export JSON or author with Grafonnet / Jsonnet for composable libraries.
  • Mount provisioning YAML in /etc/grafana/provisioning/dashboards/ pointing at a Git-synced directory.
  • Manage data sources and contact points via Terraform grafana_* resources or Helm values.
  • Tag dashboards with git_sha annotation on deploy for correlation during regressions.

Mark exploratory boards as editable; mark production SRE boards as provisioned read-only so only CI can update them. Pair with structured logging conventions so Loki panels use the same field names your apps emit.

Worked example: Harbor Fleet SRE board

Harbor Fleet runs twelve microservices on Kubernetes behind a shared ingress. The SRE team provisions one Grafana folder with three dashboards:

  1. Overview — variables for $cluster and $namespace. Top stat row: global availability (1 minus error ratio), remaining monthly error budget, ingress p99. RED rows repeat per $service from a query variable.
  2. Ingress and edge — request rate and 5xx ratio from nginx ingress metrics; TLS cert expiry gauge; WAF block rate from Loki parsing JSON access logs.
  3. Data plane — PostgreSQL connections, replication lag, Redis memory pressure; Kafka consumer group lag panel with alert threshold at five minutes behind.

Alertmanager handles two burn-rate rules on the overview availability stat. Grafana alerts fire when Loki sees level="fatal" in the payments namespace more than five times in ten minutes. During a June rollout incident, the on-call engineer selects the bad deployment version in a $version variable, confirms p99 latency spike on /api/orders only, clicks an exemplar dot to open the Tempo trace, finds a downstream timeout to the inventory service, and rolls back — without writing ad-hoc PromQL under pressure.

Tooling decision table

NeedGrafanaAlternative
Metrics + logs + traces in one UINative multi-source dashboardsDatadog / New Relic (commercial APM)
Prometheus-only graphs, minimal opsGrafana or Prometheus UIPerses, custom React + PromQL API
Log exploration without metricsLoki + Grafana ExploreKibana (Elasticsearch), Graylog
Embedded analytics for customersGrafana with auth proxy (heavy)Metabase, Superset, Hex
Fully managed observabilityGrafana CloudDatadog, Honeycomb, Dynatrace

Common pitfalls

  • Dashboard sprawl — hundreds of unmaintained boards; enforce folder ownership and delete unused dashboards quarterly.
  • Queries without recording rules — dashboards that time out at range > 7 days; pre-aggregate in Prometheus.
  • Duplicate alerting — Grafana and Alertmanager both page on the same error-rate condition.
  • Missing units and legends — panels labeled “value” force on-call to guess milliseconds vs seconds.
  • Over-broad variables — defaulting $service to all hides per-service regressions in aggregated graphs.
  • Editable production boards — incident tweaks never synced back to Git; next deploy wipes them.
  • Ignoring Explore — engineers who only use fixed dashboards cannot ad-hoc debug novel failures.

Production checklist

  • Standardize RED service dashboards before writing alert rules.
  • Configure Prometheus, Loki, and Tempo data sources with correct min step and exemplar links.
  • Define shared variables ($cluster, $namespace, $service) across folders.
  • Provision production dashboards from Git; restrict edit permissions.
  • Document which alerts page via Alertmanager vs Grafana; avoid overlap.
  • Add SLO and error-budget panels aligned with documented targets.
  • Set dashboard refresh intervals appropriate to incident use (30s ops, 5m exec views).
  • Run Grafana behind SSO; audit folder permissions each quarter.
  • Test notification contact points after credential rotation.
  • Train on-call on Explore mode and trace/log correlation before the first real page.

Key takeaways

  • Grafana visualizes data from Prometheus, Loki, Tempo, and cloud APIs — it does not replace your metrics or log stores.
  • Structure service dashboards around RED and resource dashboards around USE; one question per row.
  • Variables and provisioning keep dashboards consistent across environments and prevent click-ops drift.
  • Split Alertmanager paging (metric SLOs) from Grafana alerts (logs, SQL, multi-source) with clear ownership.
  • During incidents, the goal is fast correlation: metrics, logs, and traces on linked panels beat heroic ad-hoc querying.

Related reading