Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Wait, so the OS can re-order the fsync() to happen before the write request it is supposed to be syncing? Is there a citation or link to some code for that? It seems too ridiculous to be real.

This is an io_uring-specific thing. It doesn't guarantee any ordering between operations submitted at the same time, unless you explicitly ask it to with the `IOSQE_IO_LINK` they mentioned.

Otherwise it's as if you called write() from one thread and fsync() from another, before waiting for the write() call to return. That obviously defeats the point of using fsync() so you wouldn't do that.

> If you call fsync(), [O_DSYNC] isn't needed correct? And if you use [O_DSYNC], then fsync() isn't needed right?

I believe you're right.





I guess I'm a bit confused why the author recommends using this flag and fsync.

Related: I would think that grouping your writes and then fsyncing rather than fsyncing every time would be more efficient but it looks like a previous commenter did some testing and that isn't always the case https://news.ycombinator.com/item?id=15535814


I'm not sure there's any good reason. Other commenters mentioned AI tells. I wouldn't consider this article a trustworthy or primary source.

Yeah that seems reasonable. The article seems to mix fsync and O_DSYNC without discussing their relationship which seems more like AI and less like a human who understands it.

It also seems if you were using io_uring and used O_DSYNC you wouldn't need to use IOSQE_IO_LINK right?

Even if you were doing primary and secondary log file writes, they are to different files so it doesn't matter if they race.


> It also seems if you were using io_uring and used O_DSYNC you wouldn't need to use IOSQE_IO_LINK right? Even if you were doing primary and secondary log file writes, they are to different files so it doesn't matter if they race.

I think there are a lot of reasons to use this flag besides a write()+f(data)sync() sequence:

* If you're putting something in a write-ahead log then applying it to the primary storage, you want it to be fully committed to the write-ahead log before you start changing the primary storage, so if there's a crash halfway through the primary storage change you can use the log to get to a consistent state (via undo or redo).

* If you're trying to atomically replace a file via the rename-a-temporary-file-into-place trick, you can submit the whole operation to the ring at once, but you'd want to use `IOSQE_IO_LINK` to ensure the temporary file is fully written/synced before the rename happens.

btw, a clarification about my earlier comment: `O_SYNC` (no `D`) should be equivalent to calling `fsync` after every write. `O_DSYNC` should be equivalent to calling the weaker `fdatasync` after every write. The difference is the metadata stored in the inode.


> I think there are a lot of reasons to use this flag besides a write()+f(data)sync() sequence:

> * If you're putting something in a write-ahead log then applying it to the primary storage, you want it to be fully committed to the write-ahead log before you start changing the primary storage, so if there's a crash halfway through the primary storage change you can use the log to get to a consistent state (via undo or redo).

I guess I meant exclusively in terms of writing to the WAL. As I understand most DBMSes synchronously write the log entries for a transaction and asynchronously write the data pages to disk via a separate API or just mark the pages as dirty and let the buffer pool manager flush them to disk at its discretion.

> * If you're trying to atomically replace a file via the rename-a-temporary-file-into-place trick, you can submit the whole operation to the ring at once, but you'd want to use `IOSQE_IO_LINK` to ensure the temporary file is fully written/synced before the rename happens.

Makes sense


> As I understand most DBMSes synchronously write the log entries for a transaction and asynchronously write the data pages to disk via a separate API or just mark the pages as dirty and let the buffer pool manager flush them to disk at its discretion.

I think they do need to ensure that page doesn't get flushed before the log entry in some manner. This might happen naturally if they're doing something in single-threaded code without io_uring (or any other form of async IO). With io_uring, it could be a matter of waiting for completion entry for the log write before submitting the page write, but it could be the link instead.


> I think they do need to ensure that page doesn't get flushed before the log entry in some manner.

Yes I agree. I meant like they synchronously write the log entries, then return success to the caller, and then deal with dirty data pages. As I recall the buffer pool manager has to do something special with dirty pages for transactions that are not committed yet.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: