SigNoz on Kubernetes: Self-Hosted Observability with OpenTelemetry

Introduction

Observability is no longer optional for production systems. When something breaks at 2 AM, you need metrics, traces, and logs in one place — not three separate tools with no correlation between them.

SigNoz is an open-source, OpenTelemetry-native observability platform that gives you distributed tracing, metrics, and logs in a single unified UI. Unlike the Grafana/Prometheus/Loki/Tempo stack — which requires stitching together four separate projects — SigNoz ships as a cohesive product backed by ClickHouse for high-performance analytics.

Self-hosting matters for several reasons:

Data sovereignty — traces and logs often contain sensitive request payloads, user IDs, and internal service topology. Keeping that data in your own cluster avoids vendor lock-in and compliance headaches.
Cost predictability — SaaS observability pricing scales with data volume. A busy microservices platform can generate millions of spans per minute; self-hosting turns that into a fixed infrastructure cost.
Customisation — you control retention policies, sampling rates, and alert routing without negotiating with a vendor's pricing tier.

This guide walks through a production-grade SigNoz deployment on Kubernetes: Helm installation, OpenTelemetry Collector configuration, application instrumentation, dashboards, alerts, and a comparison with the Grafana/Prometheus stack.

Prerequisites

Infrastructure:

Kubernetes cluster 1.25+ (K3s, kubeadm, EKS, GKE — any distribution works)
kubectl with cluster-admin access
helm v3.12+
At least 4 vCPUs and 8 GB RAM available for SigNoz components
Persistent storage class available (for ClickHouse and Zookeeper volumes)

Knowledge:

Familiarity with Kubernetes concepts (Deployments, Services, ConfigMaps, PersistentVolumeClaims)
Basic understanding of distributed tracing concepts (spans, traces, context propagation)
Comfort with YAML manifests

Naming conventions used in this guide:

Namespace:        observability
SigNoz release:   signoz
Cluster domain:   cluster.local
Sample app:       order-service (Go)

Installing SigNoz on Kubernetes via Helm

SigNoz ships a single Helm chart that deploys the full stack: the SigNoz backend, ClickHouse (the columnar store for traces and logs), Zookeeper (ClickHouse coordination), and the OpenTelemetry Collector.

Add the Helm Repository

helm repo add signoz https://charts.signoz.io
helm repo update

Create the Namespace

kubectl create namespace observability

Inspect Default Values

Before installing, pull the default values to understand what you're deploying:

helm show values signoz/signoz > signoz-values.yaml

The chart is large. The key sections to review are clickhouse, queryService, frontend, and otelCollector.

Minimal Production Values

Create a signoz-override.yaml with sensible production defaults:

# signoz-override.yaml
global:
  storageClass: "standard"   # replace with your StorageClass name

clickhouse:
  replicasCount: 1           # increase to 3 for HA
  persistence:
    enabled: true
    size: 50Gi
  resources:
    requests:
      cpu: "1"
      memory: 2Gi
    limits:
      cpu: "4"
      memory: 8Gi

queryService:
  resources:
    requests:
      cpu: "250m"
      memory: 512Mi
    limits:
      cpu: "1"
      memory: 1Gi

frontend:
  resources:
    requests:
      cpu: "100m"
      memory: 128Mi
    limits:
      cpu: "500m"
      memory: 512Mi

otelCollector:
  resources:
    requests:
      cpu: "250m"
      memory: 512Mi
    limits:
      cpu: "1"
      memory: 1Gi

alertmanager:
  enabled: true

zookeeper:
  replicaCount: 1
  persistence:
    enabled: true
    size: 10Gi

Install SigNoz

helm install signoz signoz/signoz \
  --namespace observability \
  --values signoz-override.yaml \
  --version 0.55.1 \
  --wait \
  --timeout 10m

The --wait flag blocks until all pods are ready. ClickHouse takes the longest — expect 3–5 minutes on a fresh cluster.

Verify the Installation

kubectl -n observability get pods
# NAME                                          READY   STATUS    RESTARTS   AGE
# signoz-clickhouse-0                           1/1     Running   0          4m
# signoz-zookeeper-0                            1/1     Running   0          4m
# signoz-query-service-xxxxxxxxx-xxxxx          1/1     Running   0          3m
# signoz-frontend-xxxxxxxxx-xxxxx               1/1     Running   0          3m
# signoz-otel-collector-xxxxxxxxx-xxxxx         1/1     Running   0          3m
# signoz-otel-collector-metrics-xxxxxxxxx-xxx   1/1     Running   0          3m
# signoz-alertmanager-0                         1/1     Running   0          3m

Access the UI

Port-forward the frontend service to verify everything is working before setting up an Ingress:

kubectl -n observability port-forward svc/signoz-frontend 3301:3301

Open http://localhost:3301 in your browser. You'll be prompted to create an admin account on first login.

Expose via Ingress (Optional)

For persistent access, create an Ingress resource:

# signoz-ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: signoz-frontend
  namespace: observability
  annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - signoz.example.com
      secretName: signoz-tls
  rules:
    - host: signoz.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: signoz-frontend
                port:
                  number: 3301

kubectl apply -f signoz-ingress.yaml

Configuring OpenTelemetry Collectors

SigNoz ships with an embedded OpenTelemetry Collector, but for production you'll want a DaemonSet collector on each node to collect host metrics and forward application telemetry. This two-tier architecture keeps the SigNoz collector focused on ingestion while the node-level collector handles scraping and pre-processing.

Understanding the Architecture

Application Pods
      │  (OTLP gRPC/HTTP)
      ▼
Node-level OTel Collector (DaemonSet)
      │  (OTLP gRPC)
      ▼
SigNoz OTel Collector (Deployment)
      │  (internal)
      ▼
ClickHouse

Deploy the OpenTelemetry Operator

The OpenTelemetry Operator manages OpenTelemetryCollector and Instrumentation custom resources, making collector configuration declarative:

helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo update

helm install opentelemetry-operator open-telemetry/opentelemetry-operator \
  --namespace observability \
  --set "manager.collectorImage.repository=otel/opentelemetry-collector-contrib" \
  --set admissionWebhooks.certManager.enabled=true

DaemonSet Collector for Node Metrics

# otel-daemonset-collector.yaml
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: otel-node
  namespace: observability
spec:
  mode: daemonset
  serviceAccount: otel-collector-sa
  tolerations:
    - operator: Exists
  config: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318
      hostmetrics:
        collection_interval: 30s
        scrapers:
          cpu: {}
          disk: {}
          filesystem: {}
          load: {}
          memory: {}
          network: {}
          paging: {}
          processes: {}
      kubeletstats:
        collection_interval: 30s
        auth_type: serviceAccount
        endpoint: "https://${env:K8S_NODE_NAME}:10250"
        insecure_skip_verify: true
        metric_groups:
          - node
          - pod
          - container

    processors:
      batch:
        timeout: 10s
        send_batch_size: 1000
      memory_limiter:
        check_interval: 5s
        limit_mib: 400
        spike_limit_mib: 100
      resourcedetection:
        detectors: [env, k8snode]
        timeout: 5s
      k8sattributes:
        auth_type: serviceAccount
        passthrough: false
        extract:
          metadata:
            - k8s.pod.name
            - k8s.pod.uid
            - k8s.deployment.name
            - k8s.namespace.name
            - k8s.node.name
            - k8s.pod.start_time

    exporters:
      otlp:
        endpoint: "signoz-otel-collector.observability.svc.cluster.local:4317"
        tls:
          insecure: true

    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, k8sattributes, batch]
          exporters: [otlp]
        metrics:
          receivers: [otlp, hostmetrics, kubeletstats]
          processors: [memory_limiter, resourcedetection, k8sattributes, batch]
          exporters: [otlp]
        logs:
          receivers: [otlp]
          processors: [memory_limiter, k8sattributes, batch]
          exporters: [otlp]

kubectl apply -f otel-daemonset-collector.yaml

RBAC for the Collector

The collector needs permissions to read pod metadata and kubelet stats:

# otel-collector-rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: otel-collector-sa
  namespace: observability
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: otel-collector-role
rules:
  - apiGroups: [""]
    resources: ["nodes", "nodes/proxy", "nodes/metrics", "services", "endpoints", "pods"]
    verbs: ["get", "list", "watch"]
  - apiGroups: ["extensions", "networking.k8s.io"]
    resources: ["ingresses"]
    verbs: ["get", "list", "watch"]
  - nonResourceURLs: ["/metrics", "/metrics/cadvisor"]
    verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: otel-collector-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: otel-collector-role
subjects:
  - kind: ServiceAccount
    name: otel-collector-sa
    namespace: observability

kubectl apply -f otel-collector-rbac.yaml

Instrumenting a Sample Application

Let's instrument a Go microservice — an order-service — to emit traces, metrics, and logs to SigNoz via OpenTelemetry.

Auto-Instrumentation via the Operator

The OpenTelemetry Operator supports zero-code auto-instrumentation for Go, Java, Python, Node.js, and .NET. Create an Instrumentation resource that points to your SigNoz collector:

# otel-instrumentation.yaml
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: signoz-instrumentation
  namespace: default
spec:
  exporter:
    endpoint: http://otel-node-collector.observability.svc.cluster.local:4318
  propagators:
    - tracecontext
    - baggage
    - b3
  sampler:
    type: parentbased_traceidratio
    argument: "1.0"
  go:
    image: ghcr.io/open-telemetry/opentelemetry-go-instrumentation/autoinstrumentation-go:v0.14.0-alpha
  java:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:1.32.0
  nodejs:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs:0.46.0
  python:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:0.43b0

kubectl apply -f otel-instrumentation.yaml

Annotate your application Deployment to enable auto-instrumentation:

# order-service-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
  namespace: default
spec:
  replicas: 2
  selector:
    matchLabels:
      app: order-service
  template:
    metadata:
      labels:
        app: order-service
      annotations:
        instrumentation.opentelemetry.io/inject-go: "signoz-instrumentation"
        instrumentation.opentelemetry.io/otel-go-auto-target-exe: "/app/order-service"
    spec:
      containers:
        - name: order-service
          image: your-registry/order-service:latest
          ports:
            - containerPort: 8080
          env:
            - name: OTEL_SERVICE_NAME
              value: "order-service"
            - name: OTEL_RESOURCE_ATTRIBUTES
              value: "deployment.environment=production,service.version=1.0.0"

The operator injects an init container that instruments the binary at startup — no code changes required.

Manual Instrumentation (Go)

For fine-grained control, instrument your Go service directly using the OpenTelemetry SDK:

// main.go
package main

import (
    "context"
    "log"
    "net/http"

    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/propagation"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.21.0"
    "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
)

func initTracer(ctx context.Context) (*sdktrace.TracerProvider, error) {
    exporter, err := otlptracegrpc.New(ctx,
        otlptracegrpc.WithEndpoint("otel-node-collector.observability.svc.cluster.local:4317"),
        otlptracegrpc.WithInsecure(),
    )
    if err != nil {
        return nil, err
    }

    res, err := resource.New(ctx,
        resource.WithAttributes(
            semconv.ServiceName("order-service"),
            semconv.ServiceVersion("1.0.0"),
            semconv.DeploymentEnvironment("production"),
        ),
    )
    if err != nil {
        return nil, err
    }

    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exporter),
        sdktrace.WithResource(res),
        sdktrace.WithSampler(sdktrace.AlwaysSample()),
    )

    otel.SetTracerProvider(tp)
    otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
        propagation.TraceContext{},
        propagation.Baggage{},
    ))

    return tp, nil
}

func handleOrder(w http.ResponseWriter, r *http.Request) {
    ctx := r.Context()
    tracer := otel.Tracer("order-service")

    _, span := tracer.Start(ctx, "process-order")
    defer span.End()

    // Add custom attributes to the span
    span.SetAttributes(
        attribute.String("order.id", r.URL.Query().Get("id")),
        attribute.String("order.status", "processing"),
    )

    // Your business logic here
    w.WriteHeader(http.StatusOK)
    w.Write([]byte(`{"status":"accepted"}`))
}

func main() {
    ctx := context.Background()

    tp, err := initTracer(ctx)
    if err != nil {
        log.Fatalf("failed to initialize tracer: %v", err)
    }
    defer tp.Shutdown(ctx)

    // Wrap the HTTP handler with OTel instrumentation
    mux := http.NewServeMux()
    mux.Handle("/order", otelhttp.NewHandler(http.HandlerFunc(handleOrder), "order"))

    log.Println("order-service listening on :8080")
    log.Fatal(http.ListenAndServe(":8080", mux))
}

Install the required modules:

go get go.opentelemetry.io/otel@v1.24.0
go get go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc@v1.24.0
go get go.opentelemetry.io/otel/sdk@v1.24.0
go get go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.49.0

Structured Logging with Trace Correlation

Correlate logs with traces by injecting the trace ID into your log output:

import (
    "go.opentelemetry.io/otel/trace"
    "go.uber.org/zap"
)

func logWithTrace(ctx context.Context, logger *zap.Logger, msg string) {
    span := trace.SpanFromContext(ctx)
    sc := span.SpanContext()

    logger.Info(msg,
        zap.String("trace_id", sc.TraceID().String()),
        zap.String("span_id", sc.SpanID().String()),
        zap.Bool("trace_sampled", sc.IsSampled()),
    )
}

SigNoz's log viewer can then pivot from a log line directly to the associated trace — a workflow that saves significant debugging time.

Setting Up Dashboards and Alerts

Dashboards

SigNoz ships with pre-built dashboards for common infrastructure components. Navigate to Dashboards in the UI and import from the community library, or build custom ones.

For the order-service, create a dashboard tracking the RED metrics (Rate, Errors, Duration):

Go to Dashboards → New Dashboard
Add a Time Series panel with this PromQL-style query:

# Request rate (requests per second)
sum(rate(http_server_duration_count{service_name="order-service"}[5m])) by (http_route)

# Error rate
sum(rate(http_server_duration_count{service_name="order-service", http_status_code=~"5.."}[5m]))
/ sum(rate(http_server_duration_count{service_name="order-service"}[5m]))

# P99 latency
histogram_quantile(0.99, sum(rate(http_server_duration_bucket{service_name="order-service"}[5m])) by (le, http_route))

SigNoz also supports ClickHouse SQL queries for custom analytics — useful for business metrics that don't fit the standard Prometheus model.

Alerts

SigNoz's alert system is built on top of Alertmanager. Configure alerts via the UI under Alerts → New Alert Rule.

For a latency SLO alert:

# Example alert rule (configured via SigNoz UI or API)
alert: OrderServiceHighLatency
expr: |
  histogram_quantile(0.99,
    sum(rate(http_server_duration_bucket{service_name="order-service"}[5m])) by (le)
  ) > 500
for: 5m
labels:
  severity: warning
  team: platform
annotations:
  summary: "order-service P99 latency above 500ms"
  description: "P99 latency is {{ $value }}ms for the last 5 minutes"
  runbook: "https://wiki.example.com/runbooks/order-service-latency"

Configuring Alert Channels

Route alerts to Slack, PagerDuty, or email via the Alert Channels settings:

# Slack channel configuration (set via SigNoz UI)
name: platform-alerts-slack
type: slack
settings:
  webhook_url: "https://hooks.slack.com/services/T.../B.../..."
  channel: "#platform-alerts"
  title: "{{ .Labels.alertname }}"
  text: "{{ .Annotations.description }}"

For PagerDuty integration, use the routing key from your PagerDuty service and set the channel type to pagerduty in the SigNoz UI.

Comparing SigNoz with the Grafana/Prometheus Stack

Both stacks are production-proven. The right choice depends on your team's priorities.

Architecture Complexity

The Grafana/Prometheus stack is a collection of independent projects:

| Component | Purpose | |-----------|---------| | Prometheus | Metrics scraping and storage | | Grafana | Visualisation | | Loki | Log aggregation | | Tempo | Distributed tracing | | Alertmanager | Alert routing | | Promtail/Alloy | Log shipping |

Each component has its own configuration format, upgrade cycle, and failure mode. Correlating a trace with its logs requires manual configuration of Grafana data source links and consistent label schemes across all components.

SigNoz collapses this into a single deployment backed by ClickHouse. Trace-to-log correlation works out of the box because everything lands in the same database.

Storage and Scalability

Prometheus uses a local TSDB optimised for recent data. Long-term storage requires Thanos or Cortex — both significant operational additions. Loki and Tempo have their own storage backends (object storage or local disk).

SigNoz uses ClickHouse for all three signal types. ClickHouse is a columnar OLAP database that handles high-cardinality queries efficiently. A single ClickHouse cluster stores months of traces, metrics, and logs with good query performance. The trade-off is that ClickHouse is more resource-intensive than Prometheus TSDB for pure metrics workloads.

Query Languages

| Stack | Metrics | Logs | Traces | |-------|---------|------|--------| | Grafana | PromQL | LogQL | TraceQL | | SigNoz | PromQL-compatible | SQL / UI filters | UI filters + SQL |

If your team already knows PromQL deeply, the Grafana stack has an advantage. SigNoz's query interface is more approachable for teams new to observability, but less expressive for complex metric transformations.

OpenTelemetry Native vs. Prometheus Native

SigNoz is designed from the ground up for OpenTelemetry. The OTLP protocol is the primary ingestion path, and the data model maps directly to OTel's semantic conventions.

Prometheus was designed for pull-based metric scraping. While it supports OTLP ingestion via the remote write protocol, the data model mismatch (Prometheus labels vs. OTel attributes) creates friction. Exemplars — the mechanism that links a metric data point to a trace — work in Prometheus but require careful configuration.

When to Choose Each

Choose SigNoz when:

You're starting fresh and want a unified observability platform
Your team is adopting OpenTelemetry as the instrumentation standard
You want trace-log-metric correlation without manual configuration
You prefer a product experience over assembling components

Choose Grafana/Prometheus when:

You have significant existing Prometheus investment (recording rules, dashboards, exporters)
Your team has deep PromQL expertise
You need Grafana's plugin ecosystem (dozens of data source integrations)
You're running at very large scale where Thanos/Cortex/Mimir are already in place

The two stacks are not mutually exclusive. A common pattern is to run Prometheus for infrastructure metrics (where the pull model and exporter ecosystem shine) and SigNoz for application traces and logs.

Troubleshooting Common Issues

Traces not appearing in SigNoz: Check that the OTLP endpoint in your application matches the collector service address. Verify with:

kubectl -n observability logs -l app.kubernetes.io/name=otel-collector --tail=50 | grep -i "error\|dropped"

ClickHouse pod in CrashLoopBackOff: Usually a resource constraint. Check events and increase memory limits:

kubectl -n observability describe pod signoz-clickhouse-0
kubectl -n observability logs signoz-clickhouse-0 --previous

High memory usage on the collector: The memory_limiter processor is your first line of defence. Tune limit_mib based on your node's available memory. Also review batch sizes — large batches increase peak memory usage.

Missing Kubernetes metadata on spans: The k8sattributes processor requires the collector's service account to have get/list/watch on pods. Verify the ClusterRoleBinding is applied and the collector pod is using the correct service account.

Alert not firing: Check Alertmanager logs and verify the alert rule expression returns data in the SigNoz metrics explorer before creating the alert rule.

Production Considerations

Before this setup handles real production traffic, address the following:

ClickHouse replication — Set clickhouse.replicasCount: 3 and zookeeper.replicaCount: 3 for HA. A single ClickHouse node is a single point of failure.
Retention policies — Configure ClickHouse TTL policies to automatically expire old data. The default retention is 15 days for traces; adjust based on your storage budget.
Sampling — At high request volumes, storing every trace is expensive. Implement tail-based sampling in the collector using the tailsampling processor to keep 100% of error traces and a percentage of successful ones.
Network policies — Restrict which namespaces can reach the OTLP collector ports (4317/4318) using NetworkPolicy resources.
Backup — ClickHouse supports BACKUP TABLE to S3-compatible storage. Schedule regular backups of the traces and metrics tables.
Upgrades — SigNoz releases frequently. Use helm diff before upgrading to review changes, and test upgrades in a staging cluster first.

Conclusion

SigNoz delivers a compelling self-hosted observability experience for Kubernetes-native teams. The combination of OpenTelemetry-native ingestion, ClickHouse-backed storage, and a unified UI for traces, metrics, and logs removes the integration overhead that comes with assembling the Grafana/Prometheus/Loki/Tempo stack from scratch.

The setup covered here — Helm installation, DaemonSet collectors, application instrumentation, dashboards, and alerts — is a solid foundation for a production observability platform. The key operational investment is ClickHouse: understanding its resource requirements, tuning its TTL policies, and planning for replication will determine how well the platform scales.

From here, the natural next steps are implementing tail-based sampling to manage trace volume at scale, setting up SLO tracking using SigNoz's built-in SLO feature, and exploring the ClickHouse SQL interface for custom analytics that go beyond standard RED metrics.