Arch Forum 2025-04-10¶

Participants: Backend devs, Andy, Zak, Victor

Agenda¶

Observability 2.0

Summary¶

In this Arch Forum we had a "retro" session of our Observability system and usage, since we have an upcoming project to improve.

Raw result from reetro here, summary below.

Things that works great:¶

Overall the feeling is we are in a good place already, and we should be careful to not remove any useful features

Kibana is for the most time easy to work with, you find what you need.
The logs covers most of whats needed, i.e events, db calls, request/response with ruid, userid etc.
The logs are mostly consistent between areas.

Things we could improve:¶

The top most issue is alerting noise which is something many feel can be improved. It relies a lot on long Majority experience to learn what is important, what needs to be acted on and who should act on it.

Logs:
- Log sanitizing. We mask both too much sometimes, but also not enough. And probably related, sometimes onyl key and not value of for example DB queries are logged. And the masking code itself is difficult to maintain!
- Not mix Serilog and MS extensions ILogger. We could add a global use of Serilog to fix maybe?
- When calling loggger.log and related method, we should never use string interpolation!
- We miss raw log of request/responses to external parties.
- Sometimes even longer retention period would be nice.
- Span/ParentSpans are in the logs, but very cumbersome to actually use.
- Some fields are not indexed, making query for them more difficult ()
- How to handle Ruid when running a bigger job publishing events. Use one ruid per event or one ruid for the full job? Both have downsides.

Query / Analysis:
- DW/BigQuery is a good tool for deeper analysis, but is easy to forget about
- Kibana query language is limited, some queries are difficult to do (or we lack the skills)

Alerts/monitoring:
- The ops slack channel is difficult to follow. What is important/not important and what do I need to worry about?
- Can we be helped by anomaly detections / more fluid thresholds
- We need a guideline for naming etc when publishing custom metrics "manually" from the code
- Can we use Elastic/Kibana for more alerts? Thats where we will go to troubleshoot anyway, so would be convenient if all in one place.
- The Datadog dashboards and monitors are a bit messy, can they be cleaned up?

Overall
- Would be nice with a Servicemap/Application Map or similar overview of the whole system and how different parts are connected and their current status.
- There's no "real" tracing.
- Shouldn't we use OpenTelemetry for everything since its the standard?

New devs:
- The Loglevel contains two concepts, log type and log level. This is confusing.