Devtips

Structured Logging & Log Aggregation with ELK Stack

Devtips

DevOps & Observability

Structured Logging & Log Aggregation with ELK Stack

Mohammad Abu Mattar 21 Mar, 2026 05 Mins read

Why Centralized Logging Matters

The Logging Crisis

When services fail, where do you look first?

With distributed systems, logs scatter across servers, containers, and cloud regions. A single request might touch 5 services. If something breaks, you’re hunting through log files on multiple machines, missing context, losing data when containers restart.

Centralized logging solves this by collecting all logs in one searchable place, with context, correlation, and instant query capability.

The Cost of Poor Logging

Slow debugging: 30+ minutes to find what went wrong 5 minutes ago
Lost logs: Container restarts = logs disappear and are never recovered
No correlation: Can’t trace a request across multiple services
Manual hunting: SSH + grep through millions of lines
No alerting: You wake up to customer complaints, not alerts

The Problem: Distributed Logs

Why Server Logs Aren’t Enough

2026-03-21 10:15:23 Error: Database connection refused

# Server 2: /var/log/app.log (you don't see this for 15 minutes)
2026-03-21 10:15:22 Error: Database connection refused

# Server 3: Combined, these tell a story, but:
# - They're on 3 different machines
# - You can't search them together
# - Container restart and logs are gone
# - You have no context (which user? which request?)

The Solution: ELK Stack

What is ELK?

Elasticsearch: Distributed search and analytics engine. Stores logs as searchable documents with full-text indexing.
Logstash: Log processing pipeline. Collects, parses, enriches, and routes logs to Elasticsearch.
Kibana: Visualization and exploration platform. Query logs with SQL-like syntax, build dashboards, set alerts.

Core Benefits

Centralized: All logs in one place, searchable in milliseconds
Scalable: Handles billions of logs without slowdown
Structured: JSON-based searching and filtering
Correlated: Trace requests across multiple services
Persistent: No data loss when services restart
Alertable: Triggered notifications on patterns

Architecture Overview

1
Services → Filebeat/Logstash → Elasticsearch ← Kibana (Query/Visualize)
2
 ↓           ↓                    ↓
3
App logs    Parse, enrich        Index, store, analyze
4
DB logs     Filter, route        Full-text search
5
System logs Add context          Real-time updates

Getting Started with ELK

Docker Compose Setup

1
services:
2
  elasticsearch:
3
    image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
4
    container_name: elasticsearch
5
    environment:
6
      discovery.type: single-node
7
      xpack.security.enabled: false
8
      xpack.security.transport.ssl.enabled: false
9
    ports:
10
      - '9200:9200'
11
    volumes:
12
      - elasticsearch-data:/usr/share/elasticsearch/data
13

14
  kibana:
15
    image: docker.elastic.co/kibana/kibana:8.11.0
16
    container_name: kibana
17
    ports:
18
      - '5601:5601'
19
    environment:
20
      ELASTICSEARCH_HOSTS: http://elasticsearch:9200
21
    depends_on:
22
      - elasticsearch
23

24
  logstash:
25
    image: docker.elastic.co/logstash/logstash:8.11.0
26
    container_name: logstash
27
    volumes:
28
      - ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf
29
    ports:
30
      - '5000:5000'
31
    environment:
32
      discovery.seed_hosts: elasticsearch
33
      LS_JAVA_OPTS: '-Xmx256m -Xms256m'
34
    depends_on:
35
      - elasticsearch
36

37
volumes:
38
  elasticsearch-data:

Start the stack:

docker-compose up -d
# Kibana available at http://localhost:5601
# Elasticsearch at http://localhost:9200

Structured Logging with JSON

Why Structured Logging?

1
// Good: Structured (searchable, filterable)
2
{"timestamp": "2026-03-21T10:15:23Z", "service": "user-api", "level": "ERROR", "message": "Database connection failed", "user_id": 42, "request_id": "req-abc-123", "error_code": "DB_CONN_REFUSED", "retry_count": 3}
3

4
// Bad: Unstructured (exact string matching only)
5
"2026-03-21 10:15:23 ERROR [user-api] Database connection failed for user 42 in request req-abc-123"

Application Logging

Python Example:

1
import json
2
import logging
3
from pythonjsonlogger import jsonlogger
4

5
# Configure JSON logging
6
logHandler = logging.StreamHandler()
7
formatter = jsonlogger.JsonFormatter()
8
logHandler.setFormatter(formatter)
9
logger = logging.getLogger()
10
logger.addHandler(logHandler)
11
logger.setLevel(logging.INFO)
12

13
# Use logging with context
14
logger.info("User login", extra={
15
    "user_id": 42,
16
    "request_id": "req-abc-123",
17
    "service": "user-api",
18
    "ip_address": "192.168.1.1"
19
})
20

21
logger.error("Database connection failed", extra={
22
    "user_id": 42,
23
    "request_id": "req-abc-123",
24
    "service": "user-api",
25
    "error_code": "DB_CONN_REFUSED",
26
    "retry_count": 3
27
})

Node.js Example:

1
import winston from 'winston';
2

3
const logger = winston.createLogger({
4
  format: winston.format.json(),
5
  defaultMeta: {service: 'api-gateway'},
6
  transports: [new winston.transports.Console()],
7
});
8

9
// Log with context
10
logger.info('User authenticated', {
11
  user_id: 42,
12
  request_id: 'req-abc-123',
13
  ip_address: '192.168.1.1',
14
});
15

16
logger.error('Database connection failed', {
17
  user_id: 42,
18
  request_id: 'req-abc-123',
19
  error_code: 'DB_CONN_REFUSED',
20
  retry_count: 3,
21
});

Logstash Configuration

Basic Pipeline

1
input {
2
  tcp {
3
    port => 5000
4
    codec => json
5
  }
6

7
  # Read from files
8
  file {
9
    path => "/var/log/app/*.log"
10
    codec => json
11
  }
12
}
13

14
filter {
15
  # Parse and enrich logs
16
  if [service] == "api-gateway" {
17
    mutate {
18
      add_field => { "service_tier" => "frontend" }
19
    }
20
  }
21

22
  # Extract request ID from logs for correlation
23
  grok {
24
    match => { "message" => "request_id=%{NOTSPACE:request_id}" }
25
  }
26

27
  # Add timestamp if missing
28
  date {
29
    match => [ "timestamp", "ISO8601" ]
30
    target => "@timestamp"
31
  }
32
}
33

34
output {
35
  elasticsearch {
36
    hosts => ["elasticsearch:9200"]
37
    index => "logs-%{+YYYY.MM.dd}"
38
  }
39

40
  # Also output to stdout for debugging
41
  stdout {
42
    codec => rubydebug
43
  }
44
}

Advanced: Parsing Multi-Service Logs

1
input {
2
  tcp {
3
    port => 5000
4
    codec => json
5
  }
6
}
7

8
filter {
9
  # Normalize service names
10
  translate {
11
    field => "service"
12
    destination => "service_normalized"
13
    dictionary => {
14
      "user-api" => "user-service"
15
      "user_api" => "user-service"
16
      "users" => "user-service"
17
    }
18
  }
19

20
  # Add environment if not present
21
  if ![environment] {
22
    mutate {
23
      add_field => { "environment" => "production" }
24
    }
25
  }
26

27
  # Parse error stack traces
28
  if [level] == "ERROR" and [stack_trace] {
29
    mutate {
30
      split => { "stack_trace" => "\n" }
31
    }
32
  }
33
}
34

35
output {
36
  elasticsearch {
37
    hosts => ["elasticsearch:9200"]
38
    index => "logs-%{environment}-%{+YYYY.MM.dd}"
39
  }
40
}

Querying Logs in Kibana

Creating Index Patterns

In Kibana:

Go to Stack Management → Index Patterns
Create pattern: logs-* (matches logs-2026.03.21, etc.)
Set timestamp field to @timestamp

Basic Searches

1
# Find all ERROR logs
2
level: ERROR
3

4
# Errors in specific service
5
level: ERROR AND service: "user-api"
6

7
# Errors for specific user
8
level: ERROR AND user_id: 42
9

10
# Errors in time range (last 1 hour)
11
level: ERROR AND @timestamp: [now-1h TO now]
12

13
# Request tracing across services
14
request_id: "req-abc-123"

Advanced Kibana Query Language (KQL)

1
# Multiple conditions
2
service: "user-api" AND level: "ERROR" AND response_time_ms > 1000
3

4
# Wildcard matching
5
service: "user-*" AND message: "*connection*"
6

7
# Range queries
8
http_status_code: [400 TO 599] AND @timestamp: [now-1d/d TO now]
9

10
# Logical operators
11
(service: "payment-api" OR service: "billing-api") AND level: "ERROR"
12

13
# Exists
14
error_trace:*

Building Dashboards

Creating a Monitoring Dashboard

1
Dashboard: "Microservices Health"
2

3
1. **Error Rate Panel** (Line chart)
4
   - Query: level: "ERROR"
5
   - Group by: service (X-axis), time (series)
6
   - Show: errors per minute
7

8
2. **Response Time Panel** (Bar chart)
9
   - Query: All logs
10
   - Metric: avg(response_time_ms)
11
   - Breakdown by: service
12

13
3. **Top Errors Panel** (Table)
14
   - Query: level: "ERROR"
15
   - Top 10: error_code
16

17
4. **Request Volume Panel** (Metric)
18
   - Query: All logs
19
   - Show: total request count

Setting Up Alerts

Alert: Error Rate Spike

1
# In Kibana: Stack Management → Alerting → Create Rule
2

3
Condition:
4
  When: average(level: "ERROR") is greater than 100
5
  For: the last 5 minutes
6

7
Action:
8
  Webhook: POST to Slack channel
9
  Message: "Error rate spiked in production"

Alert: Specific Error Pattern

1
Condition:
2
  When: count(error_code: "DB_CONN_REFUSED") is greater than 10
3
  For: the last 2 minutes
4

5
Action:
6
  Send to PagerDuty
7
  Message: "Database connection failures detected"

Log Retention and Lifecycle

Index Lifecycle Management (ILM)

1
{
2
  "policy": "logs-policy",
3
  "phases": {
4
    "hot": {
5
      "min_age": "0d",
6
      "actions": {
7
        "rollover": {
8
          "max_primary_store_size": "50GB",
9
          "max_age": "1d"
10
        }
11
      }
12
    },
13
    "warm": {
14
      "min_age": "7d",
15
      "actions": {
16
        "set_replicas": {
17
          "number_of_replicas": 1
18
        }
19
      }
20
    },
21
    "cold": {
22
      "min_age": "30d",
23
      "actions": {
24
        "searchable_snapshot": {
25
          "snapshot_repository": "my_repository"
26
        }
27
      }
28
    },
29
    "delete": {
30
      "min_age": "90d",
31
      "actions": {
32
        "delete": {}
33
      }
34
    }
35
  }
36
}

Request Tracing: Correlation IDs

Implementing Request IDs

1
from fastapi import Request
2
import uuid
3
import logging
4

5
logger = logging.getLogger(__name__)
6

7
async def add_request_id(request: Request, call_next):
8
    # Generate or extract request ID
9
    request_id = request.headers.get("X-Request-ID") or str(uuid.uuid4())
10

11
    # Store in request state
12
    request.state.request_id = request_id
13

14
    # Log with correlation
15
    logger.info("Request started", extra={
16
        "request_id": request_id,
17
        "method": request.method,
18
        "path": request.url.path
19
    })
20

21
    response = await call_next(request)
22

23
    # Add to response headers for client
24
    response.headers["X-Request-ID"] = request_id
25

26
    return response

Propagating Request IDs Across Services

1
# When calling another service
2
import httpx
3

4
async def call_user_service(request):
5
    request_id = request.state.request_id
6

7
    async with httpx.AsyncClient() as client:
8
        response = await client.get(
9
            "http://user-api/users/42",
10
            headers={"X-Request-ID": request_id}  # Pass it along
11
        )
12

13
    return response.json()

Best Practices

1. Log the Right Amount

1
# Good: Structured context without redundancy
2
logger.info("Payment processed", extra={
3
    "user_id": 42,
4
    "request_id": "req-abc",
5
    "amount": 99.99,
6
    "currency": "USD"
7
})
8

9
# Bad: Too verbose
10
logger.info(f"User with ID 42 has processed a payment of 99.99 USD via request req-abc at {timestamp}")

2. Use Consistent Field Names

1
// Across all services, use same field names
2
{
3
  "timestamp": "2026-03-21T10:15:23Z",
4
  "level": "ERROR",
5
  "service": "user-api",
6
  "user_id": 42,
7
  "request_id": "req-abc"
8
}

3. Add Context to Errors

1
try:
2
    result = db.query(...)
3
except Exception as e:
4
    logger.error("Database query failed", extra={
5
        "error_type": type(e).__name__,
6
        "error_message": str(e),
7
        "query": query,  # What failed?
8
        "user_id": user_id,  # Who was affected?
9
        "request_id": request_id  # Trace it
10
    })

4. Index Planning

1
# Keep recent data hot (highly available)
2
# Archive old data (cost-effective)
3
# Delete after retention period
4

5
Daily indices: logs-2026.03.21, logs-2026.03.22
6
Retention: 90 days hot + searchable, 1 year archival, then delete

Conclusion

Centralized logging transforms debugging from hours to minutes.

With ELK Stack, you go from manual log hunting to instant dashboard insights. Combined with your OpenTelemetry tracing, you have complete observability: logs for context, traces for request flow, metrics for trends.

Structured Logging & Log Aggregation with ELK Stack

You might also enjoy