Observability Concepts

Observability

It is a measurement of how well system's internal states can be inferred based on known outputs from this system.

(How easily can you know if the system is working based on its outputs)

Observability allow us to know the why and how something went wrong.

To ask these questions about your system, your application must be properly instrumented. That is, the application code must emit signals such as traces, metrics and logs.

Monitoring

It is a subset of Observability.

Monitoring shows us that something is wrong, and it is based in knowing with precedence which signals to monitor.

Reliability and Metrics

Reliability

Answers the question: “Is the service doing what users expect it to be doing?”

A system could be up 100% of the time, but if, when a user clicks “Add to Cart” to add a black pair of shoes to their shopping cart, the system doesn’t always add black shoes, then the system could be unreliable.

SLI

Service Level Indicator, represents a measurement of a service's behavior. A good SLI measures your service from the perspective of your users.

An example SLI can be the speed at which a web page loads.

SLO

Service Level Objective, represents the means by which reliability is communicated to an organization/other teams.

This is accomplished by attaching one or more SLIs to business value.

Signals

Metrics

Metrics are aggregations over a period of time of numeric data about your infrastructure or application.

They should help us guiding the infrastructure (Technical metrics) and also business (Business metrics).

Infrastructure metrics

They help to optimize and make a better infrastructure.

Examples include: system error rate, CPU utilization, and request rate for a given service.

Business metrics

It helps to make the business grow, cut expenses, create intelligence for the business.

Logs

A recording of an event.

A log is a timestamped text record, either structured (recommended) or unstructured, with optional metadata.

Consider data protection laws, when writing logs, so that you don't log sensitive data.

Structured logs

A structured log is a log whose textual format follows a consistent, machine-readable format.

For applications, one of the most common formats is JSON.

{
  "timestamp": "2024-08-04T12:34:56.789Z",
  "level": "INFO",
  "service": "user-authentication",
  "environment": "production",
  "message": "User login successful"
}

Common Log Format (CLF) is also commonly used.

127.0.0.1 - johndoe [04/Aug/2024:12:34:56 -0400] "POST /api/v1/login HTTP/1.1" 200 1234

Unstructured logs

Unstructured logs are logs that don’t follow a consistent structure. They may be more human-readable, and are often used in development.

It is not preferred to use unstructured logs for production observability purposes.

Tracing

Shows us the order at which an Event was executed, allowing to track the order of requests, services and Events until an error was produced.

Whether your application is a monolith with a single database or a sophisticated mesh of services, traces are essential to understanding the full “path” a request takes in your application.

NextElastic Stack

Last updated 4 months ago