Disciplined Logging with ELK

4 years ago, I had a chance to handle a medium-sized system generating 2GB of logs daily, with full-body logging, stacktrace, query debug. All streams to an medium-size Elasticsearch cluster (hot-warm architecture, 8 nodes).

  • After a month, the cost for Elasticsearch EC2 instances + EBS jumped to more than $7000, excluding the S3 backup cost.
  • More than 40% of the logs was filled from dev/staging environments, and no Index Lifecycle Management (ILM) was implemented.
  • Logs were poorly formatted, each service implemented logging in a different way, making it very hard to filter out a specific error.

The right mindset first

  1. The Purpose: Don’t just log everything. We need to avoid that “just in case” approach. You need to know why you’re logging something, who is going to use that, at when.
    – When Devs wants to troubleshoot application errors
    – Security guys need the auditing security events
    – Business team needs to trace back a specific payment
    – You SRE needs to know the system health status
  2. Log levels: The levels ERROR, WARN, INFO, DEBUG, TRACE exist for a reason. By implementing that, you can later figure out when they are or not necessary, such as excluding the console.log debug from Production environment. That will also comes in handy later when you need to setup the alerting based on the severity.
  3. Structured logging: Having a unified structured format logs across your services makes the logs searchable, filterable.
{
  "timestamp": "2025-05-07T10:23:45.678Z",
  "level": "ERROR",
  "service": "payment-processor",
  "message": "Payment gateway connection timeout",
  "transaction_id": "tx-98765",
  "timeout_ms": 5000
}

Some best practices for Elasticsearch

  • Implement tiered storage: setting up Elasticsearch’s Index Lifecycle Management (ILM) with Hot, Warm, Cold, Delete phases to automatically move older logs to cheaper storage can save you a lot of storage cost.
  • Use an index template for the new indices to inherit the ILM policy. It’s also easier to monitor the index transition, for example to see if the index metric-2024-11-01 has been moved to from Warm to Cold storage
  • It’s important, especially for multi-service system, to integrate logging with tracing solution, such as OpenTelemetry, to achieve a better dependency visibility. It’s also much easier when you want to trace with a requestID of a payment across multiple services.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *