Introduction
Observability is no longer optional for production systems. When something breaks at 2 AM, you need metrics, traces, and logs in one place — not three separate tools with no correlation between them.
SigNoz is an open-source, OpenTelemetry-native observability platform that gives you distributed tracing, metrics, and logs in a single unified UI. Unlike the Grafana/Prometheus/Loki/Tempo stack — which requires stitching together four separate projects — SigNoz ships as a cohesive product backed by ClickHouse for high-performance analytics.
Self-hosting matters for several reasons:
- Data sovereignty — traces and logs often contain sensitive request payloads, user IDs, and internal service topology. Keeping that data in your own cluster avoids vendor lock-in and compliance headaches.
- Cost predictability — SaaS observability pricing scales with data volume. A busy microservices platform can generate millions of spans per minute; self-hosting turns that into a fixed infrastructure cost.
- Customisation — you control retention policies, sampling rates, and alert routing without negotiating with a vendor's pricing tier.
This guide walks through a production-grade SigNoz deployment on Kubernetes: Helm installation, OpenTelemetry Collector configuration, application instrumentation, dashboards, alerts, and a comparison with the Grafana/Prometheus stack.
Prerequisites
Infrastructure:
- Kubernetes cluster 1.25+ (K3s, kubeadm, EKS, GKE — any distribution works)
kubectlwith cluster-admin accesshelmv3.12+- At least 4 vCPUs and 8 GB RAM available for SigNoz components
- Persistent storage class available (for ClickHouse and Zookeeper volumes)
Knowledge:
- Familiarity with Kubernetes concepts (Deployments, Services, ConfigMaps, PersistentVolumeClaims)
- Basic understanding of distributed tracing concepts (spans, traces, context propagation)
- Comfort with YAML manifests
Naming conventions used in this guide:
Namespace: observability
SigNoz release: signoz
Cluster domain: cluster.local
Sample app: order-service (Go)
Installing SigNoz on Kubernetes via Helm
SigNoz ships a single Helm chart that deploys the full stack: the SigNoz backend, ClickHouse (the columnar store for traces and logs), Zookeeper (ClickHouse coordination), and the OpenTelemetry Collector.
Add the Helm Repository
helm repo add signoz https://charts.signoz.io
helm repo update
Create the Namespace
kubectl create namespace observability
Inspect Default Values
Before installing, pull the default values to understand what you're deploying:
helm show values signoz/signoz > signoz-values.yaml
The chart is large. The key sections to review are clickhouse, queryService, frontend, and otelCollector.
Minimal Production Values
Create a signoz-override.yaml with sensible production defaults:
# signoz-override.yaml
global:
storageClass: "standard" # replace with your StorageClass name
clickhouse:
replicasCount: 1 # increase to 3 for HA
persistence:
enabled: true
size: 50Gi
resources:
requests:
cpu: "1"
memory: 2Gi
limits:
cpu: "4"
memory: 8Gi
queryService:
resources:
requests:
cpu: "250m"
memory: 512Mi
limits:
cpu: "1"
memory: 1Gi
frontend:
resources:
requests:
cpu: "100m"
memory: 128Mi
limits:
cpu: "500m"
memory: 512Mi
otelCollector:
resources:
requests:
cpu: "250m"
memory: 512Mi
limits:
cpu: "1"
memory: 1Gi
alertmanager:
enabled: true
zookeeper:
replicaCount: 1
persistence:
enabled: true
size: 10Gi
Install SigNoz
helm install signoz signoz/signoz \
--namespace observability \
--values signoz-override.yaml \
--version 0.55.1 \
--wait \
--timeout 10m
The --wait flag blocks until all pods are ready. ClickHouse takes the longest — expect 3–5 minutes on a fresh cluster.
Verify the Installation
kubectl -n observability get pods
# NAME READY STATUS RESTARTS AGE
# signoz-clickhouse-0 1/1 Running 0 4m
# signoz-zookeeper-0 1/1 Running 0 4m
# signoz-query-service-xxxxxxxxx-xxxxx 1/1 Running 0 3m
# signoz-frontend-xxxxxxxxx-xxxxx 1/1 Running 0 3m
# signoz-otel-collector-xxxxxxxxx-xxxxx 1/1 Running 0 3m
# signoz-otel-collector-metrics-xxxxxxxxx-xxx 1/1 Running 0 3m
# signoz-alertmanager-0 1/1 Running 0 3m
Access the UI
Port-forward the frontend service to verify everything is working before setting up an Ingress:
kubectl -n observability port-forward svc/signoz-frontend 3301:3301
Open http://localhost:3301 in your browser. You'll be prompted to create an admin account on first login.
Expose via Ingress (Optional)
For persistent access, create an Ingress resource:
# signoz-ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: signoz-frontend
namespace: observability
annotations:
nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"
cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
ingressClassName: nginx
tls:
- hosts:
- signoz.example.com
secretName: signoz-tls
rules:
- host: signoz.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: signoz-frontend
port:
number: 3301
kubectl apply -f signoz-ingress.yaml
Configuring OpenTelemetry Collectors
SigNoz ships with an embedded OpenTelemetry Collector, but for production you'll want a DaemonSet collector on each node to collect host metrics and forward application telemetry. This two-tier architecture keeps the SigNoz collector focused on ingestion while the node-level collector handles scraping and pre-processing.
Understanding the Architecture
Application Pods
│ (OTLP gRPC/HTTP)
▼
Node-level OTel Collector (DaemonSet)
│ (OTLP gRPC)
▼
SigNoz OTel Collector (Deployment)
│ (internal)
▼
ClickHouse
Deploy the OpenTelemetry Operator
The OpenTelemetry Operator manages OpenTelemetryCollector and Instrumentation custom resources, making collector configuration declarative:
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo update
helm install opentelemetry-operator open-telemetry/opentelemetry-operator \
--namespace observability \
--set "manager.collectorImage.repository=otel/opentelemetry-collector-contrib" \
--set admissionWebhooks.certManager.enabled=true
DaemonSet Collector for Node Metrics
# otel-daemonset-collector.yaml
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
name: otel-node
namespace: observability
spec:
mode: daemonset
serviceAccount: otel-collector-sa
tolerations:
- operator: Exists
config: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
hostmetrics:
collection_interval: 30s
scrapers:
cpu: {}
disk: {}
filesystem: {}
load: {}
memory: {}
network: {}
paging: {}
processes: {}
kubeletstats:
collection_interval: 30s
auth_type: serviceAccount
endpoint: "https://${env:K8S_NODE_NAME}:10250"
insecure_skip_verify: true
metric_groups:
- node
- pod
- container
processors:
batch:
timeout: 10s
send_batch_size: 1000
memory_limiter:
check_interval: 5s
limit_mib: 400
spike_limit_mib: 100
resourcedetection:
detectors: [env, k8snode]
timeout: 5s
k8sattributes:
auth_type: serviceAccount
passthrough: false
extract:
metadata:
- k8s.pod.name
- k8s.pod.uid
- k8s.deployment.name
- k8s.namespace.name
- k8s.node.name
- k8s.pod.start_time
exporters:
otlp:
endpoint: "signoz-otel-collector.observability.svc.cluster.local:4317"
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, k8sattributes, batch]
exporters: [otlp]
metrics:
receivers: [otlp, hostmetrics, kubeletstats]
processors: [memory_limiter, resourcedetection, k8sattributes, batch]
exporters: [otlp]
logs:
receivers: [otlp]
processors: [memory_limiter, k8sattributes, batch]
exporters: [otlp]
kubectl apply -f otel-daemonset-collector.yaml
RBAC for the Collector
The collector needs permissions to read pod metadata and kubelet stats:
# otel-collector-rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: otel-collector-sa
namespace: observability
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: otel-collector-role
rules:
- apiGroups: [""]
resources: ["nodes", "nodes/proxy", "nodes/metrics", "services", "endpoints", "pods"]
verbs: ["get", "list", "watch"]
- apiGroups: ["extensions", "networking.k8s.io"]
resources: ["ingresses"]
verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics", "/metrics/cadvisor"]
verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: otel-collector-binding
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: otel-collector-role
subjects:
- kind: ServiceAccount
name: otel-collector-sa
namespace: observability
kubectl apply -f otel-collector-rbac.yaml
Instrumenting a Sample Application
Let's instrument a Go microservice — an order-service — to emit traces, metrics, and logs to SigNoz via OpenTelemetry.
Auto-Instrumentation via the Operator
The OpenTelemetry Operator supports zero-code auto-instrumentation for Go, Java, Python, Node.js, and .NET. Create an Instrumentation resource that points to your SigNoz collector:
# otel-instrumentation.yaml
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
name: signoz-instrumentation
namespace: default
spec:
exporter:
endpoint: http://otel-node-collector.observability.svc.cluster.local:4318
propagators:
- tracecontext
- baggage
- b3
sampler:
type: parentbased_traceidratio
argument: "1.0"
go:
image: ghcr.io/open-telemetry/opentelemetry-go-instrumentation/autoinstrumentation-go:v0.14.0-alpha
java:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:1.32.0
nodejs:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs:0.46.0
python:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:0.43b0
kubectl apply -f otel-instrumentation.yaml
Annotate your application Deployment to enable auto-instrumentation:
# order-service-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service
namespace: default
spec:
replicas: 2
selector:
matchLabels:
app: order-service
template:
metadata:
labels:
app: order-service
annotations:
instrumentation.opentelemetry.io/inject-go: "signoz-instrumentation"
instrumentation.opentelemetry.io/otel-go-auto-target-exe: "/app/order-service"
spec:
containers:
- name: order-service
image: your-registry/order-service:latest
ports:
- containerPort: 8080
env:
- name: OTEL_SERVICE_NAME
value: "order-service"
- name: OTEL_RESOURCE_ATTRIBUTES
value: "deployment.environment=production,service.version=1.0.0"
The operator injects an init container that instruments the binary at startup — no code changes required.
Manual Instrumentation (Go)
For fine-grained control, instrument your Go service directly using the OpenTelemetry SDK:
// main.go
package main
import (
"context"
"log"
"net/http"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/propagation"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.21.0"
"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
)
func initTracer(ctx context.Context) (*sdktrace.TracerProvider, error) {
exporter, err := otlptracegrpc.New(ctx,
otlptracegrpc.WithEndpoint("otel-node-collector.observability.svc.cluster.local:4317"),
otlptracegrpc.WithInsecure(),
)
if err != nil {
return nil, err
}
res, err := resource.New(ctx,
resource.WithAttributes(
semconv.ServiceName("order-service"),
semconv.ServiceVersion("1.0.0"),
semconv.DeploymentEnvironment("production"),
),
)
if err != nil {
return nil, err
}
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exporter),
sdktrace.WithResource(res),
sdktrace.WithSampler(sdktrace.AlwaysSample()),
)
otel.SetTracerProvider(tp)
otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
propagation.TraceContext{},
propagation.Baggage{},
))
return tp, nil
}
func handleOrder(w http.ResponseWriter, r *http.Request) {
ctx := r.Context()
tracer := otel.Tracer("order-service")
_, span := tracer.Start(ctx, "process-order")
defer span.End()
// Add custom attributes to the span
span.SetAttributes(
attribute.String("order.id", r.URL.Query().Get("id")),
attribute.String("order.status", "processing"),
)
// Your business logic here
w.WriteHeader(http.StatusOK)
w.Write([]byte(`{"status":"accepted"}`))
}
func main() {
ctx := context.Background()
tp, err := initTracer(ctx)
if err != nil {
log.Fatalf("failed to initialize tracer: %v", err)
}
defer tp.Shutdown(ctx)
// Wrap the HTTP handler with OTel instrumentation
mux := http.NewServeMux()
mux.Handle("/order", otelhttp.NewHandler(http.HandlerFunc(handleOrder), "order"))
log.Println("order-service listening on :8080")
log.Fatal(http.ListenAndServe(":8080", mux))
}
Install the required modules:
go get go.opentelemetry.io/otel@v1.24.0
go get go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc@v1.24.0
go get go.opentelemetry.io/otel/sdk@v1.24.0
go get go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.49.0
Structured Logging with Trace Correlation
Correlate logs with traces by injecting the trace ID into your log output:
import (
"go.opentelemetry.io/otel/trace"
"go.uber.org/zap"
)
func logWithTrace(ctx context.Context, logger *zap.Logger, msg string) {
span := trace.SpanFromContext(ctx)
sc := span.SpanContext()
logger.Info(msg,
zap.String("trace_id", sc.TraceID().String()),
zap.String("span_id", sc.SpanID().String()),
zap.Bool("trace_sampled", sc.IsSampled()),
)
}
SigNoz's log viewer can then pivot from a log line directly to the associated trace — a workflow that saves significant debugging time.
Setting Up Dashboards and Alerts
Dashboards
SigNoz ships with pre-built dashboards for common infrastructure components. Navigate to Dashboards in the UI and import from the community library, or build custom ones.
For the order-service, create a dashboard tracking the RED metrics (Rate, Errors, Duration):
- Go to Dashboards → New Dashboard
- Add a Time Series panel with this PromQL-style query:
# Request rate (requests per second)
sum(rate(http_server_duration_count{service_name="order-service"}[5m])) by (http_route)
# Error rate
sum(rate(http_server_duration_count{service_name="order-service", http_status_code=~"5.."}[5m]))
/ sum(rate(http_server_duration_count{service_name="order-service"}[5m]))
# P99 latency
histogram_quantile(0.99, sum(rate(http_server_duration_bucket{service_name="order-service"}[5m])) by (le, http_route))
SigNoz also supports ClickHouse SQL queries for custom analytics — useful for business metrics that don't fit the standard Prometheus model.
Alerts
SigNoz's alert system is built on top of Alertmanager. Configure alerts via the UI under Alerts → New Alert Rule.
For a latency SLO alert:
# Example alert rule (configured via SigNoz UI or API)
alert: OrderServiceHighLatency
expr: |
histogram_quantile(0.99,
sum(rate(http_server_duration_bucket{service_name="order-service"}[5m])) by (le)
) > 500
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "order-service P99 latency above 500ms"
description: "P99 latency is {{ $value }}ms for the last 5 minutes"
runbook: "https://wiki.example.com/runbooks/order-service-latency"
Configuring Alert Channels
Route alerts to Slack, PagerDuty, or email via the Alert Channels settings:
# Slack channel configuration (set via SigNoz UI)
name: platform-alerts-slack
type: slack
settings:
webhook_url: "https://hooks.slack.com/services/T.../B.../..."
channel: "#platform-alerts"
title: "{{ .Labels.alertname }}"
text: "{{ .Annotations.description }}"
For PagerDuty integration, use the routing key from your PagerDuty service and set the channel type to pagerduty in the SigNoz UI.
Comparing SigNoz with the Grafana/Prometheus Stack
Both stacks are production-proven. The right choice depends on your team's priorities.
Architecture Complexity
The Grafana/Prometheus stack is a collection of independent projects:
| Component | Purpose | |-----------|---------| | Prometheus | Metrics scraping and storage | | Grafana | Visualisation | | Loki | Log aggregation | | Tempo | Distributed tracing | | Alertmanager | Alert routing | | Promtail/Alloy | Log shipping |
Each component has its own configuration format, upgrade cycle, and failure mode. Correlating a trace with its logs requires manual configuration of Grafana data source links and consistent label schemes across all components.
SigNoz collapses this into a single deployment backed by ClickHouse. Trace-to-log correlation works out of the box because everything lands in the same database.
Storage and Scalability
Prometheus uses a local TSDB optimised for recent data. Long-term storage requires Thanos or Cortex — both significant operational additions. Loki and Tempo have their own storage backends (object storage or local disk).
SigNoz uses ClickHouse for all three signal types. ClickHouse is a columnar OLAP database that handles high-cardinality queries efficiently. A single ClickHouse cluster stores months of traces, metrics, and logs with good query performance. The trade-off is that ClickHouse is more resource-intensive than Prometheus TSDB for pure metrics workloads.
Query Languages
| Stack | Metrics | Logs | Traces | |-------|---------|------|--------| | Grafana | PromQL | LogQL | TraceQL | | SigNoz | PromQL-compatible | SQL / UI filters | UI filters + SQL |
If your team already knows PromQL deeply, the Grafana stack has an advantage. SigNoz's query interface is more approachable for teams new to observability, but less expressive for complex metric transformations.
OpenTelemetry Native vs. Prometheus Native
SigNoz is designed from the ground up for OpenTelemetry. The OTLP protocol is the primary ingestion path, and the data model maps directly to OTel's semantic conventions.
Prometheus was designed for pull-based metric scraping. While it supports OTLP ingestion via the remote write protocol, the data model mismatch (Prometheus labels vs. OTel attributes) creates friction. Exemplars — the mechanism that links a metric data point to a trace — work in Prometheus but require careful configuration.
When to Choose Each
Choose SigNoz when:
- You're starting fresh and want a unified observability platform
- Your team is adopting OpenTelemetry as the instrumentation standard
- You want trace-log-metric correlation without manual configuration
- You prefer a product experience over assembling components
Choose Grafana/Prometheus when:
- You have significant existing Prometheus investment (recording rules, dashboards, exporters)
- Your team has deep PromQL expertise
- You need Grafana's plugin ecosystem (dozens of data source integrations)
- You're running at very large scale where Thanos/Cortex/Mimir are already in place
The two stacks are not mutually exclusive. A common pattern is to run Prometheus for infrastructure metrics (where the pull model and exporter ecosystem shine) and SigNoz for application traces and logs.
Troubleshooting Common Issues
Traces not appearing in SigNoz: Check that the OTLP endpoint in your application matches the collector service address. Verify with:
kubectl -n observability logs -l app.kubernetes.io/name=otel-collector --tail=50 | grep -i "error\|dropped"
ClickHouse pod in CrashLoopBackOff: Usually a resource constraint. Check events and increase memory limits:
kubectl -n observability describe pod signoz-clickhouse-0
kubectl -n observability logs signoz-clickhouse-0 --previous
High memory usage on the collector:
The memory_limiter processor is your first line of defence. Tune limit_mib based on your node's available memory. Also review batch sizes — large batches increase peak memory usage.
Missing Kubernetes metadata on spans:
The k8sattributes processor requires the collector's service account to have get/list/watch on pods. Verify the ClusterRoleBinding is applied and the collector pod is using the correct service account.
Alert not firing: Check Alertmanager logs and verify the alert rule expression returns data in the SigNoz metrics explorer before creating the alert rule.
Production Considerations
Before this setup handles real production traffic, address the following:
- ClickHouse replication — Set
clickhouse.replicasCount: 3andzookeeper.replicaCount: 3for HA. A single ClickHouse node is a single point of failure. - Retention policies — Configure ClickHouse TTL policies to automatically expire old data. The default retention is 15 days for traces; adjust based on your storage budget.
- Sampling — At high request volumes, storing every trace is expensive. Implement tail-based sampling in the collector using the
tailsamplingprocessor to keep 100% of error traces and a percentage of successful ones. - Network policies — Restrict which namespaces can reach the OTLP collector ports (4317/4318) using
NetworkPolicyresources. - Backup — ClickHouse supports
BACKUP TABLEto S3-compatible storage. Schedule regular backups of the traces and metrics tables. - Upgrades — SigNoz releases frequently. Use
helm diffbefore upgrading to review changes, and test upgrades in a staging cluster first.
Conclusion
SigNoz delivers a compelling self-hosted observability experience for Kubernetes-native teams. The combination of OpenTelemetry-native ingestion, ClickHouse-backed storage, and a unified UI for traces, metrics, and logs removes the integration overhead that comes with assembling the Grafana/Prometheus/Loki/Tempo stack from scratch.
The setup covered here — Helm installation, DaemonSet collectors, application instrumentation, dashboards, and alerts — is a solid foundation for a production observability platform. The key operational investment is ClickHouse: understanding its resource requirements, tuning its TTL policies, and planning for replication will determine how well the platform scales.
From here, the natural next steps are implementing tail-based sampling to manage trace volume at scale, setting up SLO tracking using SigNoz's built-in SLO feature, and exploring the ClickHouse SQL interface for custom analytics that go beyond standard RED metrics.