Guide
Grafana explained
An on-call engineer gets paged at 2 a.m. for elevated API errors. Prometheus fired the alert; the first question is always where to look. Opening a well-designed Grafana dashboard answers in seconds: error rate by deployment version, latency percentiles by route, saturation on the database pool, and a correlated log stream from Loki — all on one screen. Grafana is not another time-series database; it is the visualization and exploration layer that sits on top of Prometheus, Loki, Tempo, InfluxDB, CloudWatch, and dozens of other backends. Teams that instrument services but never invest in dashboard design still fly blind during incidents. This guide covers data sources and plugins, RED and USE dashboard layouts, variables and templating, Grafana unified alerting versus Alertmanager, dashboard-as-code with provisioning, a Harbor Fleet SRE board worked example, a tooling decision table, common pitfalls, and a production checklist — assuming familiarity with the three pillars of observability.
What Grafana is and where it fits
Grafana Labs open-sourced Grafana in 2014 as a graphing front-end for time-series data. Today it is the de facto UI for cloud-native operations: Kubernetes clusters, microservice meshes, data pipelines, and SaaS infrastructure. The server queries external systems through data source plugins, renders panels (time series, stat, gauge, heatmap, logs, traces), and optionally evaluates alert rules that route to Slack, PagerDuty, or Opsgenie.
Core concepts
- Data source — a configured connection to Prometheus, Loki, Tempo, PostgreSQL, Elasticsearch, or a cloud vendor API.
- Dashboard — a collection of panels arranged in rows, optionally parameterized with variables.
- Panel — a single visualization bound to one or more queries against a data source.
- Variable — a dropdown or text input that injects values into panel queries (cluster, namespace, service).
- Folder and permissions — RBAC scopes who can view, edit, or administer dashboards per team.
Grafana does not replace Prometheus for scraping or Alertmanager for alert routing at scale — it complements them. Many teams keep Alertmanager as the authoritative paging path while using Grafana for exploration, ad-hoc analysis, and business-facing dashboards. Grafana Cloud adds hosted metrics, logs, traces, and synthetic monitoring when self-hosting the full LGTM stack (Loki, Grafana, Tempo, Mimir) is not worth the operational cost.
Connecting data sources
The Prometheus data source is the most common starting point. Point Grafana at your Prometheus server URL (or Mimir/Cortex query frontend), set the scrape interval as the min step hint so range queries align with sample resolution, and enable exemplars if you link histogram buckets to Tempo traces.
For logs, add a Loki data source and query with LogQL:
{namespace="payments", app="order-api"} |= "error" | json | level="error"
For traces, connect Tempo or Jaeger and use the built-in trace view — jump from a slow span in a dashboard to the full request waterfall. Pairing all three on one dashboard (metrics spike, log lines, trace ID) is the practical payoff of OpenTelemetry instrumentation.
Multi-tenant and high-availability patterns
- Run Grafana behind OAuth (Google, GitHub, Okta) or SAML; map groups to folder roles.
- Use read-only data source credentials in production; separate admin Grafana from public kiosk dashboards.
- Front Prometheus with a query frontend (Mimir, Thanos) when dashboards query months of data across clusters.
- Cache heavy dashboard queries with recording rules in Prometheus rather than widening Grafana range windows.
Dashboard design: RED and USE
Good dashboards answer one operational question per row. Two frameworks dominate service dashboards:
RED (request-driven services)
- Rate — requests per second (
sum(rate(http_requests_total[5m]))). - Errors — ratio of 5xx or failed requests to total.
- Duration — p50, p95, p99 latency from histogram buckets.
USE (resources: CPU, disk, network)
- Utilization — percentage of capacity in use.
- Saturation — queue depth, threads waiting, disk await time.
- Errors — device or kernel error counters.
Layout convention: top row — golden signals (availability stat, error budget remaining tied to SLO targets); middle rows — RED per service or USE per node pool; bottom row — dependency health (database connections, cache hit rate, message queue lag). Avoid cramming forty panels on one board; split by audience (SRE overview vs deep-dive per team).
Panel types matter: time series for trends, stat with sparklines for at-a-glance health, heatmap for latency distribution over time, logs panel for Loki tailing during incidents. Set sensible defaults: log scale for byte counters, unit suffixes (ms, percent, bytes), and thresholds that turn panels red before humans read the legend.
Variables, templating, and reuse
Hard-coding namespace="payments" in every panel breaks when you clone the
dashboard for staging. Dashboard variables fix this:
- Query variable — populate dropdown from Prometheus label values:
label_values(kube_pod_info, namespace). - Custom variable — fixed list (prod, staging) for environments without Kubernetes.
- Chained variables — service list filtered by selected namespace.
- Interval variable —
$__rate_intervalauto-adjusts rate windows to dashboard zoom level.
Use consistent variable names across team dashboards ($cluster,
$namespace, $service) so on-call muscle memory transfers.
Enable repeat rows or repeat panels to fan out the
same RED row per service without copy-paste drift. Export the finished JSON to Git for
review — diffing dashboard JSON in pull requests catches accidental query changes.
Alerting: Grafana vs Alertmanager
Grafana 8+ includes unified alerting: alert rules defined on panel queries or standalone expressions, grouped into notification policies with mute windows and contact points. This overlaps with Prometheus + Alertmanager, so teams need a deliberate split:
| Concern | Alertmanager (Prometheus) | Grafana unified alerting |
|---|---|---|
| Primary signal | PromQL on scraped metrics | Any data source (Prometheus, Loki, SQL, CloudWatch) |
| Routing maturity | Inhibition, grouping, silences battle-tested | Improving; good for multi-source rules |
| GitOps | Alert rules in Prometheus YAML | Provisioning via files or Terraform |
| Best for | Symptom-based SLO pages from metrics | Log-based alerts, SQL thresholds, mixed sources |
A common pattern: page from Alertmanager on burn-rate SLO violations; use Grafana alerts for Loki patterns (“payment webhook signature failures”) or business KPIs in PostgreSQL. Never route the same condition through both paths without deduplication — duplicate pages erode on-call trust faster than missed alerts.
Dashboard-as-code and provisioning
Click-ops dashboards rot: someone edits a panel during an incident, nobody commits the change, and staging diverges from production. Treat dashboards like application code:
- Export JSON or author with Grafonnet / Jsonnet for composable libraries.
- Mount provisioning YAML in
/etc/grafana/provisioning/dashboards/pointing at a Git-synced directory. - Manage data sources and contact points via Terraform
grafana_*resources or Helm values. - Tag dashboards with
git_shaannotation on deploy for correlation during regressions.
Mark exploratory boards as editable; mark production SRE boards as provisioned read-only so only CI can update them. Pair with structured logging conventions so Loki panels use the same field names your apps emit.
Worked example: Harbor Fleet SRE board
Harbor Fleet runs twelve microservices on Kubernetes behind a shared ingress. The SRE team provisions one Grafana folder with three dashboards:
- Overview — variables for
$clusterand$namespace. Top stat row: global availability (1 minus error ratio), remaining monthly error budget, ingress p99. RED rows repeat per$servicefrom a query variable. - Ingress and edge — request rate and 5xx ratio from nginx ingress metrics; TLS cert expiry gauge; WAF block rate from Loki parsing JSON access logs.
- Data plane — PostgreSQL connections, replication lag, Redis memory pressure; Kafka consumer group lag panel with alert threshold at five minutes behind.
Alertmanager handles two burn-rate rules on the overview availability stat. Grafana
alerts fire when Loki sees level="fatal" in the payments namespace more
than five times in ten minutes. During a June rollout incident, the on-call engineer
selects the bad deployment version in a $version variable, confirms p99
latency spike on /api/orders only, clicks an exemplar dot to open the
Tempo trace, finds a downstream timeout to the inventory service, and rolls back
— without writing ad-hoc PromQL under pressure.
Tooling decision table
| Need | Grafana | Alternative |
|---|---|---|
| Metrics + logs + traces in one UI | Native multi-source dashboards | Datadog / New Relic (commercial APM) |
| Prometheus-only graphs, minimal ops | Grafana or Prometheus UI | Perses, custom React + PromQL API |
| Log exploration without metrics | Loki + Grafana Explore | Kibana (Elasticsearch), Graylog |
| Embedded analytics for customers | Grafana with auth proxy (heavy) | Metabase, Superset, Hex |
| Fully managed observability | Grafana Cloud | Datadog, Honeycomb, Dynatrace |
Common pitfalls
- Dashboard sprawl — hundreds of unmaintained boards; enforce folder ownership and delete unused dashboards quarterly.
- Queries without recording rules — dashboards that time out at range > 7 days; pre-aggregate in Prometheus.
- Duplicate alerting — Grafana and Alertmanager both page on the same error-rate condition.
- Missing units and legends — panels labeled “value” force on-call to guess milliseconds vs seconds.
- Over-broad variables — defaulting
$servicetoallhides per-service regressions in aggregated graphs. - Editable production boards — incident tweaks never synced back to Git; next deploy wipes them.
- Ignoring Explore — engineers who only use fixed dashboards cannot ad-hoc debug novel failures.
Production checklist
- Standardize RED service dashboards before writing alert rules.
- Configure Prometheus, Loki, and Tempo data sources with correct min step and exemplar links.
- Define shared variables (
$cluster,$namespace,$service) across folders. - Provision production dashboards from Git; restrict edit permissions.
- Document which alerts page via Alertmanager vs Grafana; avoid overlap.
- Add SLO and error-budget panels aligned with documented targets.
- Set dashboard refresh intervals appropriate to incident use (30s ops, 5m exec views).
- Run Grafana behind SSO; audit folder permissions each quarter.
- Test notification contact points after credential rotation.
- Train on-call on Explore mode and trace/log correlation before the first real page.
Key takeaways
- Grafana visualizes data from Prometheus, Loki, Tempo, and cloud APIs — it does not replace your metrics or log stores.
- Structure service dashboards around RED and resource dashboards around USE; one question per row.
- Variables and provisioning keep dashboards consistent across environments and prevent click-ops drift.
- Split Alertmanager paging (metric SLOs) from Grafana alerts (logs, SQL, multi-source) with clear ownership.
- During incidents, the goal is fast correlation: metrics, logs, and traces on linked panels beat heroic ad-hoc querying.
Related reading
- Prometheus monitoring explained — metrics collection and PromQL behind Grafana panels
- Observability explained — how metrics, logs, and traces fit together
- SLOs and error budgets explained — error-budget panels and burn-rate context
- Distributed tracing and OpenTelemetry explained — trace panels and exemplar links