I don't see any mentions of p99 latency in the benchmark results. Pushing gigabytes per second is not that difficult on modern hardware. Doing so with reasonable latency is what's challenging. Also, instead of using custom benchmarks it's better to just use the OMB (open-messaging benchmark).
The code tend to be loaded with primitives that express ownership semantics or error handling. Every time something changes (for instance, you want not just read but also modify values referenced by the iterator) you have to change code in many places (you will have to invoke 'as_mut' explicitly even if you're accessing your iterator through mutable ref). This could be attributed (partially) to the lack of function overload. People believe that overload is often abused so it shouldn't be present in the "modern" language. But in languages like C++ overload also helps with const correctness and move semantics. In C++ I don't have to invoke 'as_mut' to modify value referenced by the non-const iterator because dereferencing operator has const and non-const overloads.
Async Rust is on another level of complexity compared to anything I used. The lifetimes are often necessary and everything is warpped into mutliple layers, everything is Arc<Mutex<Box<*>>>.
> Why replicate data at the VM/disk level when those disks are already provided as a fully redundant system?
That's easy. EBS and similar solutions comes with the price. They're very expensive. Especially, when you need a lot of IOPs. You may be saving on cross-AZ traffic but you will pay ridiculous amount of money on storage. If you have replication you can use attached storage which is way cheaper.
Azure Managed Disks are inherently replicated. But there are other storage solutions (like redundant WebHDFS) you can take advantage of. At the time, the customers I was talking to wanted to have identical deployments in the cloud (to what they had on-premises, down to the number and size of disks).
It could be easy to operate when everything is fine but what's about incidents?
If I understand correctly, there is a metadata database (BTW, is it multi-AZ as well?). But what if there is a data loss incident and some metadata was lost? Is it possible to recover from S3? If this is possible, then I guess that can't be very simple and should require a lot of time because S3 is not that easy to scan to find all the artefacts needed for recovery.
Also, this metadata database looks like a bottleneck. All writes and reads should go through it so it could be a point of failure. It's probably distributed and in this case it has its own complex failure modes and it has to be operated somehow.
Also, putting things from different partitions into one object is also something I'm not very keen about. You're introducing a lot of read amplification and S3 bills for egress. So if the object/file has data from 10 partitions and I only need 1, I'm paying for 10x more egress than I need to. The doc mentions fanout reads from multiple agents to satisfy a fetch request. I guess this is the price to pay for this. This is also affects the metadata database. If every object stores data from one partition the metadata can be easily partitioned. But if the object could have data from many partitions it's probably difficult to partition. One reason why Kafka/Redpanda/Pulsar scale very well is that the data and metadata can be easily partitioned and these systems do not have to handle as much metadata as I think WarpStream have to.
I'm not going to respond to your comment directly (we've already solved all the problems you've mentioned), but I thought I should mention for the sake of the other readers of this thread that you work for Redpanda which is a competitor of ours and didn't disclose that fact. Not a great look.
I'm not asking anything on behalf of any company and just genuinely curious (and I don't think that we're competitors, both systems are designed for totally different niches). I'm working on tiered-storage implementation btw. Looks like the approach here is the total opposite of what everyone else is doing. I see some advantages but also disadvantages to this. Hence the question.
The disk thrashes because of fsyncs (Kafka doesn't perform any fsync's). But you can provision more disk space to mitigate this problem. And it looks like the test was set up this way to make Redpanda look worse.
You have to provision you disk space accordingly. NVMe needs some free space to have good performance. In this case I think that in Redpanda benchmarks the disk space was available and in case of benchmarks done by Confluent guy the system was provisioned to use all disk space.
With page cache it's OK, because the FTL layer of the drive will work with 32MiB blocks but in case of Redpanda the drive will struggle because FTL mappings are complex and GC has more work. If Kafka would be doing fsync's the behaviour would be the same.
Overall, this looks like a smearing campaign against Redpanda. The guy who wrote this article works for Confluent and he published it on his own domain to look more neutral. The benchmarks are not fair because one of the systems is doing fsyncs and the other does not. Most differences could be explained by this fact alone.
Both of you commenters have your own TSDBs which seems to be coloring all of your posts.
I'm going to leave this conversation as unproductive unless you care to benchmark your products against modern column-stores, although I think it's telling that there are never such benchmarks available.
They can't handle high cardinality. Imagine having millions of columns in the column-oriented database (70% of those columns are updated every second). Imagine that you have to add new columns all the time.
The main misconception about TSDB's is that it's just a data with timestamp. TSDB's has multi-dimentional data model, time is only one of the dimensions.
'metric_name text' is actually a tag-value list. Many TSDB's allows you to match data by tag. Each tag should be represented by a column in your example.
Single table design will be prone to high read/write amplification due to data alignment. Usually, you need to read many series at once so your query will turn into full table scan. Or it will read a lot of unneeded data which happened to be located near the data you need. Writes will be slow since your key starts with metrics name. Imagine that you have 1M series and each series gets new data point every second. In your scema it will result in 1M random writes.
Cardinality of the table will go through the roof, BTW. Every data point will add the key. Good luck dealing with this.
> 'metric_name text' is actually a tag-value list. Many TSDB's allows you to match data by tag. Each tag should be represented by a column in your example.
For the life of me, I can't figure out why this would be a good idea. I feel like I must not understand what you're saying:
If I've got a million disks that I want to draw usage graphs for, why I would put each one in a separate column?
What's the business use-case you're imagining?
> Usually, you need to read many series at once so your query will turn into full table scan.
Why do I need a full table scan if I'm going to draw some graphs?
I've got something like 4000 pixels across my screen; I could supersample by 100x and still be pulling down less data than the average nodejs/webpack app.
> Imagine that you have 1M series and each series gets new data point every second. In your scema it will result in 1M random writes.
No that's definitely not what manigandham is suggesting. One million disks each reporting their usage means a million rows in two columns (disk name/sym, and volume) would be written (relatively) linearly.
Modern TSDB is expected to support tags. This means that every series will have a unique set of tag-value pairs associated with it. E.g. 'host=Foo OS=CentOS arch=amd64 ssdmodel=Intel545 ...'. And in the query you could pinpoint relevant series by this tags thus the tags should be searchable. For instance, I may want to see how specific SSD model performs on specific OS or specific app. If the set of tags is stored as a json in one field such queries wouldn't work efficiently.
About that 1M writes thing. You have two options. 1) Organize data by metric name first, or 2) by timestamp. In case of 2) the updates will be linear but reads will have huge amplification. In case of 1) updates will be random, but reads will be fast.
You are talking about regular relational databases. I'm talking about distributed column-oriented databases. Big difference.
You can store tags and other data in JSON/ARRAY columns. The primary key is used for automatically sharding and sorting.
Groups of rows are sorted, split into columns, compressed, and stored as partitions with metadata. This means you can 'scan' the entire table in milliseconds using metadata and then only open the partitions, and the columns inside, that you actually need for your query. There are no random writes either, it's all constant sequential I/O with optional background optimization. And because of compression, storing the same key millions of times has no real overhead.
As stated several times before, we deal with this everyday on trillion row tables inserting 100s of billions of rows daily. Queries run in seconds. We do just fine.
We have a lot of data stored in Postgres JSON fields at my work. Around three months ago, we were trying to optimize some queries by adding sub-key indexing to the JSON field. We tried multiple times, but Postgres seemed to keep using sequential scan on the records, rather than the JSON index. So, we just decided to normalize the data and use proper foreign key fields for query performance.
> Successful, popular concurrent platforms like Erlang/OTP, Nginx, and node.js, eschew threads in favor a single-threaded, async/non-blocking code model.
All these platforms have some problems. E.g. Nginx is PITA if your processes need to communicate with each other. Erlang/OTP has this nasty mailbox problem (O(N^2) behavior of the selective receive). The cooperative multitasking is a piece of Victorian-era technology that is so painfully bad for various reasons, but people still tend to believe that is solves something.
On the other hand, modern operating systems are awesome. The thread schedules are awesome. Thread schedulers can dynamically adjust thread priorities depending on the load and solve some nasty synchronization problems (like priority inversion or starvation) for you. It's not 1995 and OS schedulers can switch threads in O(1) and most server apps can use a thread per connection approach without any problems. The only thing you need to get right is a synchronization, but there are a lot of tools that can help (like Thread Sanitizer). It's much harder to get the synchronization right with cooperative multitasking (good luck finding that priority inversion on implicit lock caused by dumb channel use pattern) than with normal threads.