Blog post image for Structured Logging & Log Aggregation with ELK Stack - Learn centralized logging for microservices using Elasticsearch, Logstash, and Kibana. This guide covers structured logging best practices, log pipeline setup, Kibana dashboards, alerting strategies, and log retention policies for production observability.
Devtips

Structured Logging & Log Aggregation with ELK Stack

Structured Logging & Log Aggregation with ELK Stack

05 Mins read

Why Centralized Logging Matters

The Logging Crisis

When services fail, where do you look first?

With distributed systems, logs scatter across servers, containers, and cloud regions. A single request might touch 5 services. If something breaks, you’re hunting through log files on multiple machines, missing context, losing data when containers restart.

Centralized logging solves this by collecting all logs in one searchable place, with context, correlation, and instant query capability.

The Cost of Poor Logging

  • Slow debugging: 30+ minutes to find what went wrong 5 minutes ago
  • Lost logs: Container restarts = logs disappear and are never recovered
  • No correlation: Can’t trace a request across multiple services
  • Manual hunting: SSH + grep through millions of lines
  • No alerting: You wake up to customer complaints, not alerts

The Problem: Distributed Logs

Why Server Logs Aren’t Enough

/var/log/app.log
2026-03-21 10:15:23 Error: Database connection refused
# Server 2: /var/log/app.log (you don't see this for 15 minutes)
2026-03-21 10:15:22 Error: Database connection refused
# Server 3: Combined, these tell a story, but:
# - They're on 3 different machines
# - You can't search them together
# - Container restart and logs are gone
# - You have no context (which user? which request?)

The Solution: ELK Stack

What is ELK?

  • Elasticsearch: Distributed search and analytics engine. Stores logs as searchable documents with full-text indexing.
  • Logstash: Log processing pipeline. Collects, parses, enriches, and routes logs to Elasticsearch.
  • Kibana: Visualization and exploration platform. Query logs with SQL-like syntax, build dashboards, set alerts.

Core Benefits

  • Centralized: All logs in one place, searchable in milliseconds
  • Scalable: Handles billions of logs without slowdown
  • Structured: JSON-based searching and filtering
  • Correlated: Trace requests across multiple services
  • Persistent: No data loss when services restart
  • Alertable: Triggered notifications on patterns

Architecture Overview

Services → Filebeat/Logstash → Elasticsearch ← Kibana (Query/Visualize)
↓ ↓ ↓
App logs Parse, enrich Index, store, analyze
DB logs Filter, route Full-text search
System logs Add context Real-time updates

Getting Started with ELK

Docker Compose Setup

compose.yml
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
container_name: elasticsearch
environment:
discovery.type: single-node
xpack.security.enabled: false
xpack.security.transport.ssl.enabled: false
ports:
- '9200:9200'
volumes:
- elasticsearch-data:/usr/share/elasticsearch/data
kibana:
image: docker.elastic.co/kibana/kibana:8.11.0
container_name: kibana
ports:
- '5601:5601'
environment:
ELASTICSEARCH_HOSTS: http://elasticsearch:9200
depends_on:
- elasticsearch
logstash:
image: docker.elastic.co/logstash/logstash:8.11.0
container_name: logstash
volumes:
- ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf
ports:
- '5000:5000'
environment:
discovery.seed_hosts: elasticsearch
LS_JAVA_OPTS: '-Xmx256m -Xms256m'
depends_on:
- elasticsearch
volumes:
elasticsearch-data:

Start the stack:

Terminal window
docker-compose up -d
# Kibana available at http://localhost:5601
# Elasticsearch at http://localhost:9200

Structured Logging with JSON

Why Structured Logging?

// Good: Structured (searchable, filterable)
{"timestamp": "2026-03-21T10:15:23Z", "service": "user-api", "level": "ERROR", "message": "Database connection failed", "user_id": 42, "request_id": "req-abc-123", "error_code": "DB_CONN_REFUSED", "retry_count": 3}
// Bad: Unstructured (exact string matching only)
"2026-03-21 10:15:23 ERROR [user-api] Database connection failed for user 42 in request req-abc-123"

Application Logging

Python Example:

app.py
import json
import logging
from pythonjsonlogger import jsonlogger
# Configure JSON logging
logHandler = logging.StreamHandler()
formatter = jsonlogger.JsonFormatter()
logHandler.setFormatter(formatter)
logger = logging.getLogger()
logger.addHandler(logHandler)
logger.setLevel(logging.INFO)
# Use logging with context
logger.info("User login", extra={
"user_id": 42,
"request_id": "req-abc-123",
"service": "user-api",
"ip_address": "192.168.1.1"
})
logger.error("Database connection failed", extra={
"user_id": 42,
"request_id": "req-abc-123",
"service": "user-api",
"error_code": "DB_CONN_REFUSED",
"retry_count": 3
})

Node.js Example:

app.ts
import winston from 'winston';
const logger = winston.createLogger({
format: winston.format.json(),
defaultMeta: {service: 'api-gateway'},
transports: [new winston.transports.Console()],
});
// Log with context
logger.info('User authenticated', {
user_id: 42,
request_id: 'req-abc-123',
ip_address: '192.168.1.1',
});
logger.error('Database connection failed', {
user_id: 42,
request_id: 'req-abc-123',
error_code: 'DB_CONN_REFUSED',
retry_count: 3,
});

Logstash Configuration

Basic Pipeline

logstash.conf
input {
tcp {
port => 5000
codec => json
}
# Read from files
file {
path => "/var/log/app/*.log"
codec => json
}
}
filter {
# Parse and enrich logs
if [service] == "api-gateway" {
mutate {
add_field => { "service_tier" => "frontend" }
}
}
# Extract request ID from logs for correlation
grok {
match => { "message" => "request_id=%{NOTSPACE:request_id}" }
}
# Add timestamp if missing
date {
match => [ "timestamp", "ISO8601" ]
target => "@timestamp"
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "logs-%{+YYYY.MM.dd}"
}
# Also output to stdout for debugging
stdout {
codec => rubydebug
}
}

Advanced: Parsing Multi-Service Logs

logstash-advanced.conf
input {
tcp {
port => 5000
codec => json
}
}
filter {
# Normalize service names
translate {
field => "service"
destination => "service_normalized"
dictionary => {
"user-api" => "user-service"
"user_api" => "user-service"
"users" => "user-service"
}
}
# Add environment if not present
if ![environment] {
mutate {
add_field => { "environment" => "production" }
}
}
# Parse error stack traces
if [level] == "ERROR" and [stack_trace] {
mutate {
split => { "stack_trace" => "\n" }
}
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "logs-%{environment}-%{+YYYY.MM.dd}"
}
}

Querying Logs in Kibana

Creating Index Patterns

In Kibana:

  1. Go to Stack ManagementIndex Patterns
  2. Create pattern: logs-* (matches logs-2026.03.21, etc.)
  3. Set timestamp field to @timestamp

Basic Searches

# Find all ERROR logs
level: ERROR
# Errors in specific service
level: ERROR AND service: "user-api"
# Errors for specific user
level: ERROR AND user_id: 42
# Errors in time range (last 1 hour)
level: ERROR AND @timestamp: [now-1h TO now]
# Request tracing across services
request_id: "req-abc-123"

Advanced Kibana Query Language (KQL)

# Multiple conditions
service: "user-api" AND level: "ERROR" AND response_time_ms > 1000
# Wildcard matching
service: "user-*" AND message: "*connection*"
# Range queries
http_status_code: [400 TO 599] AND @timestamp: [now-1d/d TO now]
# Logical operators
(service: "payment-api" OR service: "billing-api") AND level: "ERROR"
# Exists
error_trace:*

Building Dashboards

Creating a Monitoring Dashboard

Dashboard: "Microservices Health"
1. **Error Rate Panel** (Line chart)
- Query: level: "ERROR"
- Group by: service (X-axis), time (series)
- Show: errors per minute
2. **Response Time Panel** (Bar chart)
- Query: All logs
- Metric: avg(response_time_ms)
- Breakdown by: service
3. **Top Errors Panel** (Table)
- Query: level: "ERROR"
- Top 10: error_code
4. **Request Volume Panel** (Metric)
- Query: All logs
- Show: total request count

Setting Up Alerts

Alert: Error Rate Spike

# In Kibana: Stack Management → Alerting → Create Rule
Condition:
When: average(level: "ERROR") is greater than 100
For: the last 5 minutes
Action:
Webhook: POST to Slack channel
Message: "Error rate spiked in production"

Alert: Specific Error Pattern

Condition:
When: count(error_code: "DB_CONN_REFUSED") is greater than 10
For: the last 2 minutes
Action:
Send to PagerDuty
Message: "Database connection failures detected"

Log Retention and Lifecycle

Index Lifecycle Management (ILM)

{
"policy": "logs-policy",
"phases": {
"hot": {
"min_age": "0d",
"actions": {
"rollover": {
"max_primary_store_size": "50GB",
"max_age": "1d"
}
}
},
"warm": {
"min_age": "7d",
"actions": {
"set_replicas": {
"number_of_replicas": 1
}
}
},
"cold": {
"min_age": "30d",
"actions": {
"searchable_snapshot": {
"snapshot_repository": "my_repository"
}
}
},
"delete": {
"min_age": "90d",
"actions": {
"delete": {}
}
}
}
}

Request Tracing: Correlation IDs

Implementing Request IDs

request_id_middleware.py
from fastapi import Request
import uuid
import logging
logger = logging.getLogger(__name__)
async def add_request_id(request: Request, call_next):
# Generate or extract request ID
request_id = request.headers.get("X-Request-ID") or str(uuid.uuid4())
# Store in request state
request.state.request_id = request_id
# Log with correlation
logger.info("Request started", extra={
"request_id": request_id,
"method": request.method,
"path": request.url.path
})
response = await call_next(request)
# Add to response headers for client
response.headers["X-Request-ID"] = request_id
return response

Propagating Request IDs Across Services

# When calling another service
import httpx
async def call_user_service(request):
request_id = request.state.request_id
async with httpx.AsyncClient() as client:
response = await client.get(
"http://user-api/users/42",
headers={"X-Request-ID": request_id} # Pass it along
)
return response.json()

Best Practices

1. Log the Right Amount

# Good: Structured context without redundancy
logger.info("Payment processed", extra={
"user_id": 42,
"request_id": "req-abc",
"amount": 99.99,
"currency": "USD"
})
# Bad: Too verbose
logger.info(f"User with ID 42 has processed a payment of 99.99 USD via request req-abc at {timestamp}")

2. Use Consistent Field Names

// Across all services, use same field names
{
"timestamp": "2026-03-21T10:15:23Z",
"level": "ERROR",
"service": "user-api",
"user_id": 42,
"request_id": "req-abc"
}

3. Add Context to Errors

try:
result = db.query(...)
except Exception as e:
logger.error("Database query failed", extra={
"error_type": type(e).__name__,
"error_message": str(e),
"query": query, # What failed?
"user_id": user_id, # Who was affected?
"request_id": request_id # Trace it
})

4. Index Planning

# Keep recent data hot (highly available)
# Archive old data (cost-effective)
# Delete after retention period
Daily indices: logs-2026.03.21, logs-2026.03.22
Retention: 90 days hot + searchable, 1 year archival, then delete

Conclusion

Centralized logging transforms debugging from hours to minutes.

With ELK Stack, you go from manual log hunting to instant dashboard insights. Combined with your OpenTelemetry tracing, you have complete observability: logs for context, traces for request flow, metrics for trends.

Resources

Related Posts

You might also enjoy

Check out some of our other posts on similar topics

Tracing Microservices with OpenTelemetry

Tracing Microservices with OpenTelemetry

Why Monitor Your Microservices? The Complexity of Distributed Systems Hey, want to know what’s going on in your microservices? If you’re juggling multiple services, it’s hard to track ho

Container Image Vulnerability Scanning in CI/CD with Trivy

Container Image Vulnerability Scanning in CI/CD with Trivy

Why Container Security Matters The Vulnerability Problem Container images are a critical attack surface in modern deployments. Every time you build a container image, it includes the bas

Policy-as-Code Governance with OPA/Rego

Policy-as-Code Governance with OPA/Rego

Why Policy-as-Code Matters The Governance Challenge Managing infrastructure at scale gets complicated fast. As your infrastructure grows, ensuring consistency and compliance becomes incr

Managing Terraform at Scale with Terragrunt

Managing Terraform at Scale with Terragrunt

The Problem with Terraform at Scale Code Duplication Across Environments Tired of copying Terraform code across every environment? If you're managing infrastructure with Terraform across

Setting Up GitHub Copilot Agent Skills in Your Repository

Setting Up GitHub Copilot Agent Skills in Your Repository

Why Teach Copilot New Skills? The Power of Specialized Instructions Want Copilot to do more than just autocomplete? If you're ready to teach Copilot some new tricks, Agent Skills are you

7 Reasons Learning the Linux Terminal is Worth It (Even for Beginners)

7 Reasons Learning the Linux Terminal is Worth It (Even for Beginners)

Why Learn the Linux Terminal? The Terminal's Enduring Value Is the Linux terminal still relevant in 2026? You bet it is. Even with all the fancy graphical interfaces and AI assistants ou

6 related posts