Question 1

Is Prometheus better than DataDog in 2026?

Accepted Answer

For cost-sensitive teams running Kubernetes, almost always yes, Prometheus is free and the de-facto standard. DataDog is faster to set up and bundles logs + APM out of the box, but cost scales aggressively with hosts and custom metrics. Most Indian unicorns run Prometheus + Grafana for metrics and either Loki/ELK for logs and Tempo/Jaeger for traces. DataDog is more common at large enterprises that have already standardized on it.

Question 2

How much does a Prometheus / SRE engineer earn in India?

Accepted Answer

₹8-25 LPA in 2026 for SREs and DevOps engineers with Prometheus + Kubernetes as primary skills. Senior SREs and Staff SREs at unicorns (Razorpay, Swiggy, CRED, Zerodha, Postman) can clear ₹40-60 LPA total comp. Observability platform engineers, those who build Prometheus + Thanos/Mimir at scale, are in particularly high demand.

Question 3

Should I use Prometheus or VictoriaMetrics?

Accepted Answer

Prometheus is the standard and what every interview will ask about. VictoriaMetrics is a high-performance compatible alternative with much better compression and lower memory usage; some teams use it as a drop-in replacement, others use it as the long-term store behind vanilla Prometheus. For interviews: know Prometheus inside-out and be aware that VictoriaMetrics, Thanos, Mimir, and Cortex exist as long-term storage options.

Question 4

What's the relationship between Prometheus and Kubernetes?

Accepted Answer

Prometheus is the de-facto monitoring solution for Kubernetes, both are CNCF graduated projects and the Prometheus Operator integrates natively via CRDs (ServiceMonitor, PodMonitor, PrometheusRule). The kube-prometheus-stack Helm chart is how most teams deploy Prometheus on K8s; it bundles Prometheus + AlertManager + Grafana + node-exporter + kube-state-metrics + a curated set of dashboards and alerts.

Question 5

Do I need to learn PromQL to be effective?

Accepted Answer

Yes, PromQL is the hardest part of Prometheus and the most-asked interview topic. You can write basic instrumentation without it, but you cannot write good alerts, recording rules, or dashboards without comfort in PromQL. The minimum: instant vs range vectors, `rate()` / `increase()` / `irate()`, `histogram_quantile()`, aggregation operators with `by`/`without`, and label matching syntax. The rest comes with practice.

Question 6

What is Prometheus and what problems does it solve?

Accepted Answer

Prometheus is an open-source time-series database and monitoring system, originally built at SoundCloud in 2012 and now a graduated CNCF project. It solves three problems that traditional monitoring tools (Nagios, Zabbix) handled poorly in cloud-native environments: (1) static configuration breaks in environments where servers come and go every minute, Prometheus uses service discovery instead, (2) push-based agents create operational fragility and need separate config for every host, Prometheus pulls metrics from a single config file, and (3) dimensional data (per-endpoint, per-status-code, per-tenant) requires a real query language, Prometheus introduced PromQL, which lets you slice metrics by labels. Prometheus stores metrics in its own TSDB optimized for high-cardinality time series, and integrates with AlertManager for routing alerts and Grafana for visualization.

Question 7

What is the Prometheus data model, metric names and labels?

Accepted Answer

Every Prometheus time series is uniquely identified by a metric name plus a set of key-value labels. The metric name describes WHAT is being measured (e.g. `http_requests_total`), and labels describe the DIMENSIONS along which the measurement varies (e.g. `method="GET"`, `status="200"`, `endpoint="/api/users"`). Together they form a unique series, and each series has a stream of (timestamp, float64) samples. The same metric name can appear with different label combinations, each unique combination is its own series. This dimensional model is what makes Prometheus powerful: you can query `http_requests_total` aggregated any way you want, instead of pre-deciding which dimensions to track.

Question 8

What are the four Prometheus metric types, Counter, Gauge, Histogram, Summary?

Accepted Answer

Prometheus has four metric types, and picking the right one is the most common beginner mistake. (1) **Counter**, monotonically increasing value (only goes up, or resets to zero on process restart). Use for: request counts, errors, bytes processed. Always query with `rate()` or `increase()`. (2) **Gauge**, single value that goes up and down. Use for: temperatures, memory usage, queue depth, in-flight requests. (3) **Histogram**, samples observations into configurable buckets (e.g. request latency in 10ms, 50ms, 100ms buckets). Cumulative on the client side, aggregatable on the server side. (4) **Summary**, like histogram but calculates quantiles on the client side. Cannot be aggregated across instances. Histograms are preferred over summaries in 2026 because they aggregate cleanly across replicas, summaries are mathematically broken when you sum them.

Question 9

What is the pull-based scrape model and why did Prometheus choose it?

Accepted Answer

In Prometheus's pull model, the server periodically fetches (scrapes) metrics from HTTP endpoints exposed by each target. Targets expose `/metrics` in a simple text format and don't know or care about Prometheus, they just expose stats. The Prometheus server has the full list of targets (via service discovery or a static config), and scrapes them at a configurable interval (typically 15s or 30s). Why pull? (1) Easy to detect down targets, if the scrape fails, Prometheus knows immediately and can alert. (2) Easier to debug, you can manually `curl /metrics` to see the same data Prometheus sees. (3) No need to configure each target with the server's address, the server has the addresses. (4) Better control over scrape rate, you can't accidentally DDoS Prometheus by deploying 10,000 new instances. The drawback is short-lived jobs (batch jobs, lambdas) that finish before the next scrape, for those, Prometheus offers the Pushgateway as a workaround.

Question 10

How do you query Prometheus with PromQL, instant vs range vectors?

Accepted Answer

PromQL has two fundamental vector types. An **instant vector** is one sample per series at a single point in time, that's what you get when you query a metric name directly. A **range vector** is a series of samples over a time window, you get this by appending `[duration]` to a selector. Most aggregation operators (sum, avg) work on instant vectors. Most rate functions (rate, increase, irate) require range vectors. The difference matters because Grafana panels almost always want an instant vector (one value per timestamp), so you typically wrap a range vector in `rate()` to collapse it.

Question 11

What is `rate()` and how is it different from `increase()` and `irate()`?

Accepted Answer

All three operate on counter range vectors and handle counter resets automatically. (1) `rate(metric[5m])` returns the per-SECOND average rate of increase over the window, this is what you want 95% of the time. It uses linear regression-like smoothing across all samples in the window. (2) `increase(metric[5m])` returns the TOTAL increase over the window (rate × window seconds). Useful for 'how many errors in the last hour?'. (3) `irate(metric[5m])` returns the rate between the LAST TWO samples in the window, much spikier than `rate()`, useful for short-term anomaly detection but bad for alerting. The rule: use `rate()` for graphs and alerts, `increase()` for human-readable totals, `irate()` only for high-resolution debugging.

Question 12

How do you install and configure a basic Prometheus server?

Accepted Answer

In 2026, almost nobody installs Prometheus from a tarball anymore, the standard pattern is the kube-prometheus-stack Helm chart on Kubernetes (which bundles Prometheus + AlertManager + Grafana + node-exporter + kube-state-metrics), or the official Docker image for non-K8s setups. The core config file is `prometheus.yml` and contains four sections: `global` (scrape interval, evaluation interval), `scrape_configs` (what to scrape), `alerting` (where AlertManager lives), and `rule_files` (paths to alerting and recording rules). For Kubernetes, the Prometheus Operator introduces CRDs like `ServiceMonitor` and `PrometheusRule` so you can define scrape targets declaratively per service instead of editing one giant config.

Question 13

What are exporters and which ones do you commonly use?

Accepted Answer

An exporter is a sidecar process that translates metrics from a non-Prometheus system into the Prometheus text format. Standard exporters in every production stack: (1) **node_exporter**, host-level metrics: CPU, memory, disk, network. Runs on every VM. (2) **cAdvisor**, container metrics. Built into kubelet, so on Kubernetes you get this free. (3) **kube-state-metrics**, Kubernetes object state (pod count, deployment replicas, etc.). (4) **blackbox_exporter**, probe HTTP/HTTPS/TCP/ICMP/DNS endpoints from outside. Used for uptime checks. (5) **mysql_exporter / postgres_exporter / redis_exporter**, database metrics. The pattern: if your dependency doesn't natively expose Prometheus metrics, there's almost certainly an exporter for it on the Prometheus GitHub org or by a third party.

Question 14

What is the Pushgateway and when should you use it?

Accepted Answer

The Pushgateway lets short-lived jobs push metrics to a stable endpoint that Prometheus then scrapes. Pure pull doesn't work for batch jobs that finish in 30 seconds, by the time Prometheus's next scrape happens, the process is gone. The Pushgateway sits in the middle: the batch job pushes its final metrics (success, duration, rows processed) on completion, and Prometheus scrapes the Pushgateway like any other target. Important gotcha: Pushgateway is NOT for service-level metrics from long-running processes, that's an anti-pattern that defeats the pull model's failure detection. Only use it for cron jobs, batch pipelines, and CI pipelines.

Question 15

How do you instrument your application with the Prometheus client library?

Accepted Answer

Every major language has an official Prometheus client library. The pattern is the same: import the client, create metric objects at startup (Counter, Gauge, Histogram, Summary), increment/observe them in your application code, and expose a `/metrics` HTTP endpoint that the library writes the text format to. In Python, you wire it as a Flask/FastAPI route; in Go, you use `promhttp.Handler()`; in Node.js, you use `prom-client`. The library handles the cumulative bookkeeping (counters, histogram buckets) so you only think in terms of 'I want to count this' or 'I want to record this latency'.

Prometheus Interview Questions

What is Prometheus and what problems does it solve?

What is the Prometheus data model, metric names and labels?

What are the four Prometheus metric types, Counter, Gauge, Histogram, Summary?

What is the pull-based scrape model and why did Prometheus choose it?

How do you query Prometheus with PromQL, instant vs range vectors?

What is `rate()` and how is it different from `increase()` and `irate()`?

How do you install and configure a basic Prometheus server?

What are exporters and which ones do you commonly use?

What is the Pushgateway and when should you use it?

How do you instrument your application with the Prometheus client library?

What is Grafana and how does it integrate with Prometheus?

What's the difference between Prometheus and DataDog or New Relic?

What is cardinality and why is it the biggest Prometheus footgun?

What is `histogram_quantile()` and how do you calculate p99 latency?

Histogram vs Summary, which should you use?

What is service discovery in Prometheus, Kubernetes, Consul, EC2, file_sd?

What are relabeling rules and how do they work?

What are recording rules and when should you use them?

How does AlertManager work, routing tree, grouping, inhibition, silences?

How do you write a good Prometheus alerting rule?

What aggregation operators does PromQL support?

How does Prometheus's local TSDB storage work and what is the retention policy?

What is remote write and when do you need Thanos, Cortex, or Mimir?

What is Prometheus federation and when should you use it?

What are the four golden signals and how do you measure them in Prometheus?

How would you architect Prometheus for a multi-cluster, multi-region setup at scale?

What are native histograms and how do they change Prometheus's cardinality story?

How do you debug high-cardinality issues in a production Prometheus?

How does Prometheus interact with the OpenTelemetry collector in 2026?

How do you design SLOs (Service Level Objectives) using Prometheus?

Companies Hiring Prometheus

Salary Insights

Frequently Asked Questions

Is Prometheus better than DataDog in 2026?

How much does a Prometheus / SRE engineer earn in India?

Should I use Prometheus or VictoriaMetrics?

What's the relationship between Prometheus and Kubernetes?

Do I need to learn PromQL to be effective?

Introduction

Ready to practice Prometheus interviews?