I am writing an application that really cares about durability of created files (a Certificate Transparency log), and... oof. I fsync the file. I fsync the directory. Ok. But... how do I test it? Even targeting a specific filesystem, I have to make VMs and try to race killing them? - ThreadSky

filippo.abyssdomain.expert • 8 days ago

I am writing an application that really cares about durability of created files (a Certificate Transparency log), and... oof.

I fsync the file. I fsync the directory. Ok.

But... how do I test it? Even targeting a specific filesystem, I have to make VMs and try to race killing them?

Comments

chown.de•8 days ago

You could take inspiration from the jepsen test? Not sure how he does it, but he finds these durability/consistency issues in databases.

https://jepsen.io/

chown.de•8 days ago

I vaguely remember an interesting post where the author claimed that fsync cannot always be trusted.

It involved drives acquired from a market somewhere in Asia that would always say "OK", but the data was never written. Or they managed to fake the storage?

But I think (hope) hat's not the norm. 😂

boomies-n-zoomies.bsky.social•7 days ago

Jepsen is mostly testing database performance in the face of network partitions, not disk failures. This is an easier target, since one can trivially write a proxy that severs connectivity between nodes on demand.

filippo.abyssdomain.expert•7 days ago

Alright, I think I have a durable, atomic implementation of WriteFile.

https://github.com/FiloSottile/sunlight/pull/30/files#diff-3ebed9953cad6795070bcec5f4141e48a1e77f2d4b979411664d8a4e43c41331

Got lots of good testing recs. I think the strategy is going to be LazyFS or ALICE or Gosim or dm-log-writes in CI to test the application, and manual power cuts in production to test hw and fs.

valkyrie.hacker.gf•7 days ago

This probably has some failure cases where the file has garbage at the end after recovery; most filesystems log metadata but don't take the doubling-all-writes penalty for logging data, so it's possible to end up allocating new blocks, crashing, then replaying the allocation but not the data write

filippo.abyssdomain.expert•7 days ago

What's the sequence leading to one such case? Note that fsync on the file runs (or should run, if I wrote it right) before the rename, so I think such a corrupted file should not make it to the target filename?

valkyrie.hacker.gf•7 days ago

interrupt after creation/rename/fsync, during the write; metadata logging covers allocating new blocks and btree/inode/whatever updates to put them in the file, but not writing the data blocks

filippo.abyssdomain.expert•7 days ago

The write is before the rename (note the defer), and fenced by a fsync, not the other way around, right?

valkyrie.hacker.gf•7 days ago

hmm, okay - you're *probably* the safe if the final operation is an fsync except for internal buffering in disk

maik.zumstrull.net•8 days ago

I think https://gist.github.com/bradfitz/3172656 is still the standard (for approach, not necessarily implementation).

Ship a list of things you think you've committed off-machine, cut the power, see if anything the OS claims is persisted fails to be there, repeat until satisfied.

bram.xyz•8 days ago

Jensen is the gold standard for testing databases, I imagine it’ll work well here too.

I think typically it’s used on a slightly higher level (killing processes), but I imagine not super hard to make it work with e.g. Xen and kernel panics. Or maybe even electric relays to shut off power.

boomies-n-zoomies.bsky.social•7 days ago

I don’t know how to do this, but I do know who does. The sqlite tests cover this pretty extensively, iirc using the method you describe.

luiz.aoqui.dev•8 days ago

The authors of "Can Applications Recover from fsync Failures?" wrote a tool to help test disk failures (https://github.com/WiscADSL/dm-loki) and mention some others in the "Related Work" section.

The Tigerbeetle team also often talks about FS durability and simulation testing (https://github.com/tigerbeetle/tigerbeetle/blob/main/docs/internals/ARCHITECTURE.md#direct-io).

justincormack.bsky.social•8 days ago

ZFS they tested on real hardware with controlled power offs if I remember correctly. But I think there are some fault injection Linux kernel drivers from memory?

simo5.bsky.social•8 days ago

Use eBPF to mess with writes/fsyncs?

headmold.bsky.social•8 days ago

Not an answer to your question, but in case it's useful, apparently on MacOS, fsync by itself doesn't do that, but the F_FULLFSYNC fcntl does. Confirmed by the man page, and rants documented at https://mjtsai.com/blog/2022/02/17/apple-ssd-benchmarks-and-f_fullsync/

maik.zumstrull.net•8 days ago

Similar issue on Windows – some fsync() equivalents do nothing on editions other than Windows Server, to juice desktop benchmarks.

filippo.abyssdomain.expert•8 days ago

I know os.File.Sync does the right thing on macOS, I wonder if it does on Windows too.

maik.zumstrull.net•8 days ago

I'm not optimistic – file_windows.go doesn't seem to have any implementation of Sync() at all, and syscall.FlushFileBuffers (which would be the thing to do) is only called in one clearly unrelated place. But I could be misreading the source.

maik.zumstrull.net•8 days ago

Yeah I think I did misread it, the call flow is a bit strange but I think os.File.Sync() would eventually end up here: https://github.com/golang/go/blob/7d0cb2a2adec493b8ad9d79ef35354c8e20f0213/src/internal/poll/fd_fsync_windows.go#L10 which wraps FlushFileBuffers under a different name https://github.com/golang/go/blob/7d0cb2a2adec493b8ad9d79ef35354c8e20f0213/src/syscall/syscall_windows.go#L710

klauspost.com•8 days ago

Looks like it to me: https://github.com/golang/go/blob/master/src/syscall/syscall_windows.go#L404-L407

thepudds.bsky.social•8 days ago

You could try LazyFS:

"A FUSE file system with an internal dedicated page cache that only flushes data if explicitly requested by the application. This is useful for simulating power failures and losing unsynced data."

https://github.com/dsrhaslab/lazyfs

Apparently used by Jespen:

thepudds.bsky.social•8 days ago

Separately, gosim seems intriguing (and ambitious!) as a deterministic simulation testing framework for Go:

https://github.com/jellevandenhooff/gosim

I tried an example of simulating crashing a server after writing data to disk with os.File.Sync vs. "yolo" (without os.File.Sync):

https://go.dev/play/p/1L1pXCLh5_k

thepudds.bsky.social•8 days ago

In that gosim example, the "yolo" writes don't persist across the simulated server crash, but the writes with os.File.Sync do seem to be visible after the crash.

Just a quick test.

It doesn't work in the playground (you need to run it locally -- I put instructions in the playground link).

thepudds.bsky.social•8 days ago

I think gosim is probably not the real solution for you, including the author labels it as still experimental, but also says it can currently simulate a 3 node Etcd cluster and then partition the nodes.

I'd love to see it continue to progress (part of which might be more people trying it out 😅).

filippo.abyssdomain.expert•7 days ago

Both LazyFS and gosim look practical and way less work than making my own VM setup, thank you!

I will probably still wire a relay to the production machine once, but I might do CI based on one or both of these.

rst.bsky.social•8 days ago

The PostgreSQL developers have relevant experience. It is not an entirely happy experience. Their wiki page on the subject doesn't say much more than "there are issues", but it has pointers to discussion of some of them, particularly the mailing list thread. https://wiki.postgresql.org/wiki/Fsync_Errors

tom.sherman.is•8 days ago

This is why people pay sqlite maintainers $$$$ and why so many people rely on sqlite to handle durability for them

retr0.id•8 days ago

Yeah this is also why my PDS impl stores media "files" in one big sqlite db

retr0.id•8 days ago

Which I don't think is necessarily a great long-term solution, but until it actually causes me scaling issues I'm not going to figure out how to make it work reliably with fs

filippo.abyssdomain.expert•8 days ago

I am a big big fan of sqlite, including for blobs, but I looked at their os_unix.c and it doesn't do anything special beyond fdatasync on the file and fsync on the directory.

I *think* it doesn't even try to keep the dir inode in cache like Postgres to avoid losing a write-through error.

rbtcollins.bsky.social•7 days ago

As I recall the current state of play, your challenge isn't going to be demonstrating that fsync is called: a local interface plus mocks will do that trivially.

The challenge will be designing correct semantics on top of fsync :/.

Comments

Posting Rules

Reply