feld

2 months ago

feld
2 months ago

For a service that depends directly on a Postgres database they've shown:

- they don't know how to properly manage storage
- they still never turned on pg_checksum
- they have no idea how to run a reliable production Postgres cluster

These are unserious people trying to run a serious project and it should make you very concerned about how professionally they do all their work
RT: mastodon.matrix.org/users/matr…

The Matrix.org Foundation

2025-09-02 21:39:25

Sorry, but it's bad news: we haven't been able to restore the DB primary filesystem to a state we're confident in running as a primary (especially given our experiences with slow-burning postgres db corruption). So we're having to do a full 55TB DB snapshot restore from last night, which will take >10h to recover the data, and then >4h to actually restore, and then >3h to catch up on missing traffic. Huge apologies for the outage. Again, folks using their own homeservers are not impacted.

Peter Vágner reshared this.

in reply to feld

feld

in reply to feld 2 months ago

Oh a mysterious "slow burn" of Postgres corruption? Where is the engagement on the Postgres mailing lists? I haven't seen a single thread about this issue on the pgsql-general or pgsql-hackers lists.

It's either a hardware storage bug, a raid implementation bug, a kernel bug, or their Postgres/filesystem tuning is trading data reliability for performance. But they're not sharing anything of value.

Postgres doesn't just corrupt itself. We have several DBs > 100TB at $work. Many people have significantly larger databases...

I kinda doubt their recovery times too. They will probably forget that they need to disable indexes to make the restore have a reasonable speed. And pg_restore is single threaded per table. 1.5TB can take 1.5 days.

blog.peerdb.io/how-can-we-make…

I think they're fucked. I wonder if they will be able to recover without it taking months, literally. They haven't indicated they're using anything but vanilla Postgres.

This could be the end of the matrix.org homeserver.

How can we make pg_dump and pg_restore 5 times faster?

pg_dump and pg_restore are reliable tools for backing up and restoring Postgres databases. They're essential for database migrations, disaster recovery and so on.

^{Sai Srirampur (PeerDB Blog)}

This entry was edited (2 months ago)

Peter Vágner reshared this.

in reply to feld

Blurry Moon

in reply to feld 2 months ago

it's really easy to disable everything and restore an oracle database but postgres is a pain in the fucking ass last time I tried, you have to do a special dump of indices etc separately from structure/data and then do a manual babysat restore. I mean if you're a sysadmin that's your fucking job but it actually does suck imo

in reply to Blurry Moon

feld

in reply to Blurry Moon 2 months ago

@sun Oh I don't disagree, but Oracle has had billions poured into it so they can make that possible. Postgres is nearly as good as Oracle in almost all use cases, but these types of maintenance operations have not yet been engineered for performance.

The companies doing the Postgres forks have been the ones innovating here and putting their time and expertise into making sure they solve their customers' needs. And often those improvements get merged upstream. But as far as R&D goes it's still a drop in the bucket compared to Oracle 🫠

@Blurry Moon

in reply to Blurry Moon

Magic

in reply to Blurry Moon 2 months ago

@sun I haven't ever played with Oracle database. What advantages does it have compared to other database on the market?

@Blurry Moon

in reply to Magic

feld

in reply to Magic 2 months ago

@j @sun Tons of features you probably don't need and a unique ability to drain your company of all its money both in licensing costs and Oracle DBA salaries

Although if you have a particular problem they will make you a custom patch you can apply to your Oracle database. A change to the code nobody else will ever have. It's bizarre.

@Blurry Moon @Magic

in reply to feld

Magic

in reply to feld 2 months ago

@sun do they update your patches with the code?

@Blurry Moon

in reply to Magic

feld

in reply to Magic 2 months ago

@j @sun At that point you basically consult them for every patch / upgrade so they can keep your changes working. This is sometimes a major issue keeping companies from upgrading to the next major release as their query planner etc will change and you could lose the performance you had or new problems arise that need different custom patches to keep your workload performing as expected.

@Blurry Moon @Magic

in reply to Magic

Blurry Moon

in reply to Magic 2 months ago

@j if you aren't a big company the few benefits aren't worth it, in fact I would say it has little benefit unless you buy their most expensive horizontally scaleable option which is meant for busineses where the data size is so massive it should be nosql but you're architecturally locked into rdbms. very time I've mentioned it on here people say "you're doing it wrong" well I have to explain that a lot of corporate customers are just plain locked into somethigng that got built in the 1990s and it would take a hundred million dollars and shitloads of uinacceptable risk to rewrite. for those customers there is a big fat oracle database and you will pay a LOT for it.

@Magic

in reply to Blurry Moon

Magic

in reply to Blurry Moon 2 months ago

@sun so it really is the perfect database with all the features

@Blurry Moon

in reply to Magic

Blurry Moon

in reply to Magic 2 months ago

@j for years and years people went with oracle because it was the only ANSI SQL compliant database, everybody else either didn't have x feature or it was a proprietary extension. but this hasn't been true for years, Postgres is compliant.

oracle also spends a gazillion dollars convincing your company to put everything into oracle though, so they have really stupid bad shit you should never do, but on the surface you think "I'm already paying them so I'll integrate that too". it's pretty transparent that they're taking advantage of know-nothing managers to trap companies into never being able to leave.

@Magic

in reply to Blurry Moon

feld

in reply to Blurry Moon 2 months ago

@sun @j It helps that they offer you 24/7 access to experts and developers to solve your problems in case your own employees aren't capable. They become an extension of your IT department. But really they're a virus.

@Blurry Moon @Magic

in reply to feld

Blurry Moon

in reply to feld 2 months ago

@j for a long time it was really fucking difficult to even get it installed and install patches so you relied on them heavily but now you can install it with apt-get. it used to take more than a day for your most expensive technical staff member to install it correctly. many companies just opted to pay oracle directly to set up their shit

@Magic

in reply to Blurry Moon

feld

in reply to Blurry Moon 2 months ago

@sun @j We had a scheduled overnight outage in 2007 to upgrade Oracle 9i to 10g. It was an 8 hour outage and the process to backup then apply the patches took 7 hours.

We couldn't afford more Sun servers. A restore from backup was also 8 hours. We practiced it several times because even doing one thing out of order breaks the database.

It was all or nothing (and probably losing our jobs).

It worked. I was never so scared though

@Blurry Moon @Magic

in reply to feld

feld

in reply to feld 2 months ago

@sun @j The DBA was making 130k/yr which was huge back then for a small city. He only came in one day a week. He lived in the woods off-grid.

@Blurry Moon @Magic

in reply to feld

Magic

in reply to feld 2 months ago

they should be using zfs at that scale

It also makes snapshots stupid easy. You can get pretty creative to do point in time postgres restore for a 100 TB DB in under 15 minutes.

in reply to Magic

feld

in reply to Magic 2 months ago

yeah but there's a performance penalty to the COW of ZFS. Soon™️ OpenZFS will support a DirectIO interface to make Postgres fast, but it's not released last I checked

This entry was edited (2 months ago)

in reply to Magic

Blurry Moon

in reply to Magic 2 months ago

@j I have been told repeatedly to not put a database on ZFS but this info seems to be contradicted last year or so by people doing it successfully. In the past there was speed issues but they were somewhat mitigated by compression

@Magic

in reply to Blurry Moon

Phantasm

in reply to Blurry Moon 2 months ago

@sun @j You can and I have done so without issues, it's just not a simple install, configure Postgres and let it run thing. You have to change record sizes to avoid fragmentation, if you are on fast SSDs disable ZIL on your DB dataset and hope that Postgres will ensure data integrity with fsync and pg_wal, or move it to a special ZIL SLOG on fast SSDs. And those are the absolute basics of what you have to do to make it somewhat work.

@Blurry Moon @Magic

in reply to Phantasm

Phantasm

in reply to Phantasm 2 months ago

@sun @j I think there's an OpenZFS page on how you should optimize it for these tasks, but I can't find it now.

@Blurry Moon @Magic

in reply to Phantasm

Blurry Moon

in reply to Phantasm 2 months ago

@phnt @j this is definitely inside a technical weak spot for me and I need a guide lol

@Phantasm @Magic

in reply to Blurry Moon

Phantasm

in reply to Blurry Moon 2 months ago

@sun @j Yes, it's very much in the technical voodoo territory. There's also many configurations that will wreck your ZFS performance if you don't know what you are doing. Hence why companies like Klara Systems exist.

@Blurry Moon @Magic

in reply to Phantasm

Phantasm

in reply to Phantasm 2 months ago

@sun @j openzfs.github.io/openzfs-docs…

Performance and Tuning — OpenZFS documentation

^{openzfs.github.io}

@Blurry Moon @Magic

in reply to Phantasm

Blurry Moon

in reply to Phantasm 2 months ago

@phnt @j thanks, saved

@Phantasm @Magic

in reply to Blurry Moon

Magic

in reply to Blurry Moon 2 months ago

@sun @phnt here's this too
vadosware.io/post/everything-i…

Setting record size to 8k is faster than 16k but only for a little bit because it gets super fragmented. Setting to 16k fixes the fragmentation and provides better compression ratios since compression happens to each record block. Setting to 32 or higher could be interesting and help compression even more. You won't see improvements beyond the default 128k on like 95% of drives and it could even hurt performance. That being said 1M+ record sizes may be useful in conjunction with zstd-4 for long-term archival of compressible data like database backups. All of this can be changed whenever so it's not that big of a deal. Block size you're stuck with forever so make sure you set the correct block size.

Most of the data the database actually cares about at any time will live in the arc (ram cache) and if you use compression it's compressed in the ARC so you get even better cache hits.

For compression I used lz4. Zstd (even compression level 1) was too much latency. Lz4 is really great and shaved off about 45% of data needing to be written to disk. That was the main reason I switched to zfs. It was the only practical filesystem for postgres that supports disk compression.

It makes postgres upgrades super fast and easy. Just take a snapshot, hard link the database files, fire up the new postgres version and it should work but if it starts fugging the database then you can just easily restore the snapshot.

I came for the compression and ending up loving it because not only is it the best filesystem but it's the best disk management system too. You can even just create raw volumes and format them however you want. You can have ext4 on zfs, you can have NTFS on zfs, you could even put zfs on top of zfs if you really wanted to.

Zfs is also the only way to have a compressed swap partition

Everything I've seen on optimizing Postgres on ZFS

+ UPDATE (2021/12/21) After lots of feedback on Reddit (thanks /u/BucketOfSpinningRust!) and doing some more experimenting and digging, I've updated this post with more information -- new/updated sections are marked "Update".

^vadosware.io

@Phantasm @Blurry Moon

in reply to Magic

feld

in reply to Magic 2 months ago

> All of this can be changed whenever so it's not that big of a deal.

when you make these changes to ZFS filesystems it does not change the existing data. That problem is left to you to solve -- traditionally by restoring all the data from backup.

However, a new tool is coming called "zfs rewrite" that will let you atomically rewrite underlying blocks so the data gets the new storage settings applied to the filesystem.

openzfs.github.io/openzfs-docs…

edit: this would also be useful for re-balancing your zpool if you add new zvols or something

zfs-rewrite.8 — OpenZFS documentation

^{openzfs.github.io}

This entry was edited (2 months ago)

in reply to feld

Phantasm

in reply to feld 2 months ago

@j @sun That's nice, although not specifically mentioned, I guess it should be able to rewrite fragmented files into continuous blocks. And the whole send, destroy, recreate, recv dance wouldn't be needed. Not really a wide spread issue, but it does happen.

@Blurry Moon @Magic

in reply to Phantasm

feld

in reply to Phantasm 2 months ago

@phnt @j @sun correct it would reduce fragmentation.

Right now I solve that problem myself with a custom script I made that uses cpdup to clone files and then move them into place of the original. A ZFS native tool would be much better.

@Phantasm @Blurry Moon @Magic

⇧

feld 2 months ago • •

feld
2 months ago