> Wait, so the OS can re-order the fsync() to happen before the write request it...

n_u · 2025-12-14T23:28:36 1765754916

I guess I'm a bit confused why the author recommends using this flag and fsync.

Related: I would think that grouping your writes and then fsyncing rather than fsyncing every time would be more efficient but it looks like a previous commenter did some testing and that isn't always the case https://news.ycombinator.com/item?id=15535814

scottlamb · 2025-12-15T03:46:22 1765770382

I'm not sure there's any good reason. Other commenters mentioned AI tells. I wouldn't consider this article a trustworthy or primary source.

n_u · 2025-12-15T04:36:44 1765773404

Yeah that seems reasonable. The article seems to mix fsync and O_DSYNC without discussing their relationship which seems more like AI and less like a human who understands it.

It also seems if you were using io_uring and used O_DSYNC you wouldn't need to use IOSQE_IO_LINK right?

Even if you were doing primary and secondary log file writes, they are to different files so it doesn't matter if they race.

scottlamb · 2025-12-15T15:47:14 1765813634

> It also seems if you were using io_uring and used O_DSYNC you wouldn't need to use IOSQE_IO_LINK right? Even if you were doing primary and secondary log file writes, they are to different files so it doesn't matter if they race.

I think there are a lot of reasons to use this flag besides a write()+f(data)sync() sequence:

* If you're putting something in a write-ahead log then applying it to the primary storage, you want it to be fully committed to the write-ahead log before you start changing the primary storage, so if there's a crash halfway through the primary storage change you can use the log to get to a consistent state (via undo or redo).

* If you're trying to atomically replace a file via the rename-a-temporary-file-into-place trick, you can submit the whole operation to the ring at once, but you'd want to use `IOSQE_IO_LINK` to ensure the temporary file is fully written/synced before the rename happens.

btw, a clarification about my earlier comment: `O_SYNC` (no `D`) should be equivalent to calling `fsync` after every write. `O_DSYNC` should be equivalent to calling the weaker `fdatasync` after every write. The difference is the metadata stored in the inode.

n_u · 2025-12-15T17:13:32 1765818812

> I think there are a lot of reasons to use this flag besides a write()+f(data)sync() sequence:

> * If you're putting something in a write-ahead log then applying it to the primary storage, you want it to be fully committed to the write-ahead log before you start changing the primary storage, so if there's a crash halfway through the primary storage change you can use the log to get to a consistent state (via undo or redo).

I guess I meant exclusively in terms of writing to the WAL. As I understand most DBMSes synchronously write the log entries for a transaction and asynchronously write the data pages to disk via a separate API or just mark the pages as dirty and let the buffer pool manager flush them to disk at its discretion.

> * If you're trying to atomically replace a file via the rename-a-temporary-file-into-place trick, you can submit the whole operation to the ring at once, but you'd want to use `IOSQE_IO_LINK` to ensure the temporary file is fully written/synced before the rename happens.

Makes sense

scottlamb · 2025-12-15T18:34:07 1765823647

> As I understand most DBMSes synchronously write the log entries for a transaction and asynchronously write the data pages to disk via a separate API or just mark the pages as dirty and let the buffer pool manager flush them to disk at its discretion.

I think they do need to ensure that page doesn't get flushed before the log entry in some manner. This might happen naturally if they're doing something in single-threaded code without io_uring (or any other form of async IO). With io_uring, it could be a matter of waiting for completion entry for the log write before submitting the page write, but it could be the link instead.

n_u · 2025-12-15T19:35:21 1765827321

> I think they do need to ensure that page doesn't get flushed before the log entry in some manner.

Yes I agree. I meant like they synchronously write the log entries, then return success to the caller, and then deal with dirty data pages. As I recall the buffer pool manager has to do something special with dirty pages for transactions that are not committed yet.