Just to add, VictoriaMetrics covers all 3 signals:
- VictoriaMetrics for metrics. With Prometheus API support, so it integrates with Grafana using Prometheus datasource. It has its own Grafana datasource with extra functionality too.
- VictoriaLogs for logs. Integrates natively with Grafana using VictoriaLogs datasource.
- VictoriaTraces for traces. With Jaeger API support, so it intergrates with Grafana using Jaeger datasource.
All 3 solutions support alerting, managed by same team, are Apache2 licensed, are focused on resource efficiency and simiplicity.
All of them provide a way to scale monitoring to insane numbers. The difference is in architecture, maintainability and performance. But make your own choices here.
Before, I remember there was m3db from Uber. But the project seems pretty dead now.
And there was a Cortex project, mostly maintaned by GrafanaLabs. But at some point they forked Cortex and named it Mimir. And Cortex is now maintained by Amazon and, as I undersand, is powering Amazon Managed Prometheus. However, I would avoid using Cortex ecaxctly because it is now maintained by Amazon.
I think OTEL has made things worse for metrics. Prometheus was so simple and clean before the long journey toward OTEL support began. Now Prometheus is much more complicated:
- all the delta-vs-cumulative counter confusion
- push support for Prometheus, and the resulting out-of-order errors
- the {"metric_name"} syntax changes in PromQL
- resource attributes and the new info() function needed to join them
I just don’t see how any of these OTEL requirements make my day-to-day monitoring tasks easier. Everything has only become more complicated.
My understanding is that with Prometheus+Grafana, and the rest of their stack, you can achieve the same functionality as Datadog (or even more) at much lower costs. But, it requires engineering time to set up these tools, monitor them, build dashboards and alerts. Build an observability platform at home, in other words.
But what about other open source solutions that already trying very hard to become an out-of-box solution for observability? Things like Netdata, Hyperdx, Coroot, etc. are already platforms for all telemetry signals, with fancy UIs and a lot of presets. Why people don't use them instead of Datadog?
Grafana isn't quite as featureful as Datadog, though nothing to keep you from getting the job done.
> But, it requires engineering time to set up these tools
At some price point, you have to wonder if it doesn't make more sense to hire engineers to get it just right for your use case. I'd bet that price point is less than $65MM. Hell, you could have people full-time on Grafana to add features you want.
Ofc you need to monitor your monitoring, because you run it.
Datadog runs their own systems and monitors them, that's why they charge you so much.
I barely can imagine a criticial piece of software that I need to run and not monitor it in the same time.
> Our tests revealed that Prometheus v3.1 requires 500 GiB of RAM to handle this workload, despite claims from its developers that memory efficiency has improved in v3.
AFAIK, starting from v3 Prometheus has `auto-gomemlimit` set by default. It means "Prometheus v3 will automatically set GOMEMLIMIT to match the Linux container memory limit.", which effectively prevents garbage collection until process reaches the specified limit. This is why, I think, Prometheus has increased flatlined mem usage in the article.
> Query the average over the last 2 hours, of system.ram of all nodes, grouped by dimension (4 dimensions in total), providing 120 (per-minute) points over time.
The query used for Prometheus here is an Instant query `query=avg_over_time(netdata_system_ram{dimension=~".+"}[2h:60s])`. This is rather a very weird subquery, that probably was never used by any of Prometheus users. Effectively, it instructs Prometheus to execute `netdata_system_ram{dimension=~".+"}` on 2h interval `2h/60s=120` 120 times, reading `120 * 7200 * 4k series = 3.5Bil` data samples.
Normally, Prometheus users don't do this. They'd rather run a /query_range query `avg_over_time(netdata_system_ram{dimension=~".+"}[5m])` with step=5m on 2h time interval, reading `7200*4k=29Mil` samples.
Another weird thing with this query is that in return Prometheus will send 4k time series with all the labels in JSON format. I wonder, how much time from 1.8s it took just to transfer data over the network.
It usually comes with increase of active series and churn rate. Of course, you can scale Prometheus horizontally by adding more replicas and by sharding scrape targets. But at some point you'd like to achieve the following:
1. Global query view. Ability to get metrics from all Prometheis with one request. Or just simply not thinking which Prometheus has data you're looking for.
2. Resource usage management. No matter how you try, scrape targets can't be sharded perfectly. So you'll end up with some Prometheis using more resources than others. This could backfire in future in weird ways, reducing stability of the whole system.
We use the same approach in time series database I'm working on. While file creation and fsync aren't atomic, rename [1] syscall is. So we create a temporary file, write the data, call fsync and if all is good - rename it atomically to be visible for other users. I had a talk about this [2] a few month ago.
If Mimir is the only one, why Roblox, GrafanaLabs's customer, isn't using Mimir for monitoring? They're using VictoriaMetrics on approx scale of 5 Billion active time series. See https://docs.victoriametrics.com/victoriametrics/casestudies....
None solution is perfect. Each one has its own trade-offs. That is why it triggers me when I see statements like this one.