For a service that depends directly on a Postgres database they've shown:

- they don't know how to properly manage storage
- they still never turned on pg_checksum
- they have no idea how to run a reliable production Postgres cluster

These are unserious people trying to run a serious project and it should make you very concerned about how professionally they do all their work
RT: mastodon.matrix.org/users/matr…


Sorry, but it's bad news: we haven't been able to restore the DB primary filesystem to a state we're confident in running as a primary (especially given our experiences with slow-burning postgres db corruption). So we're having to do a full 55TB DB snapshot restore from last night, which will take >10h to recover the data, and then >4h to actually restore, and then >3h to catch up on missing traffic. Huge apologies for the outage. Again, folks using their own homeservers are not impacted.

Peter Vágner reshared this.

in reply to feld

Oh a mysterious "slow burn" of Postgres corruption? Where is the engagement on the Postgres mailing lists? I haven't seen a single thread about this issue on the pgsql-general or pgsql-hackers lists.

It's either a hardware storage bug, a raid implementation bug, a kernel bug, or their Postgres/filesystem tuning is trading data reliability for performance. But they're not sharing anything of value.

Postgres doesn't just corrupt itself. We have several DBs > 100TB at $work. Many people have significantly larger databases...

I kinda doubt their recovery times too. They will probably forget that they need to disable indexes to make the restore have a reasonable speed. And pg_restore is single threaded per table. 1.5TB can take 1.5 days.

blog.peerdb.io/how-can-we-make…

I think they're fucked. I wonder if they will be able to recover without it taking months, literally. They haven't indicated they're using anything but vanilla Postgres.

This could be the end of the matrix.org homeserver.

This entry was edited (1 week ago)

Peter Vágner reshared this.

in reply to Blurry Moon

@sun Oh I don't disagree, but Oracle has had billions poured into it so they can make that possible. Postgres is nearly as good as Oracle in almost all use cases, but these types of maintenance operations have not yet been engineered for performance.

The companies doing the Postgres forks have been the ones innovating here and putting their time and expertise into making sure they solve their customers' needs. And often those improvements get merged upstream. But as far as R&D goes it's still a drop in the bucket compared to Oracle 🫠

in reply to Dr. Cat

@j @sun At that point you basically consult them for every patch / upgrade so they can keep your changes working. This is sometimes a major issue keeping companies from upgrading to the next major release as their query planner etc will change and you could lose the performance you had or new problems arise that need different custom patches to keep your workload performing as expected.
in reply to Dr. Cat

@j if you aren't a big company the few benefits aren't worth it, in fact I would say it has little benefit unless you buy their most expensive horizontally scaleable option which is meant for busineses where the data size is so massive it should be nosql but you're architecturally locked into rdbms. very time I've mentioned it on here people say "you're doing it wrong" well I have to explain that a lot of corporate customers are just plain locked into somethigng that got built in the 1990s and it would take a hundred million dollars and shitloads of uinacceptable risk to rewrite. for those customers there is a big fat oracle database and you will pay a LOT for it.
in reply to Dr. Cat

@j for years and years people went with oracle because it was the only ANSI SQL compliant database, everybody else either didn't have x feature or it was a proprietary extension. but this hasn't been true for years, Postgres is compliant.

oracle also spends a gazillion dollars convincing your company to put everything into oracle though, so they have really stupid bad shit you should never do, but on the surface you think "I'm already paying them so I'll integrate that too". it's pretty transparent that they're taking advantage of know-nothing managers to trap companies into never being able to leave.

in reply to Blurry Moon

@sun @j We had a scheduled overnight outage in 2007 to upgrade Oracle 9i to 10g. It was an 8 hour outage and the process to backup then apply the patches took 7 hours.

We couldn't afford more Sun servers. A restore from backup was also 8 hours. We practiced it several times because even doing one thing out of order breaks the database.

It was all or nothing (and probably losing our jobs).

It worked. I was never so scared though

in reply to Blurry Moon

@sun @j You can and I have done so without issues, it's just not a simple install, configure Postgres and let it run thing. You have to change record sizes to avoid fragmentation, if you are on fast SSDs disable ZIL on your DB dataset and hope that Postgres will ensure data integrity with fsync and pg_wal, or move it to a special ZIL SLOG on fast SSDs. And those are the absolute basics of what you have to do to make it somewhat work.
in reply to Blurry Moon

@sun @phnt here's this too
vadosware.io/post/everything-i…

Setting record size to 8k is faster than 16k but only for a little bit because it gets super fragmented. Setting to 16k fixes the fragmentation and provides better compression ratios since compression happens to each record block. Setting to 32 or higher could be interesting and help compression even more. You won't see improvements beyond the default 128k on like 95% of drives and it could even hurt performance. That being said 1M+ record sizes may be useful in conjunction with zstd-4 for long-term archival of compressible data like database backups. All of this can be changed whenever so it's not that big of a deal. Block size you're stuck with forever so make sure you set the correct block size.

Most of the data the database actually cares about at any time will live in the arc (ram cache) and if you use compression it's compressed in the ARC so you get even better cache hits.

For compression I used lz4. Zstd (even compression level 1) was too much latency. Lz4 is really great and shaved off about 45% of data needing to be written to disk. That was the main reason I switched to zfs. It was the only practical filesystem for postgres that supports disk compression.

It makes postgres upgrades super fast and easy. Just take a snapshot, hard link the database files, fire up the new postgres version and it should work but if it starts fugging the database then you can just easily restore the snapshot.

I came for the compression and ending up loving it because not only is it the best filesystem but it's the best disk management system too. You can even just create raw volumes and format them however you want. You can have ext4 on zfs, you can have NTFS on zfs, you could even put zfs on top of zfs if you really wanted to.

Zfs is also the only way to have a compressed swap partition

in reply to Dr. Cat

> All of this can be changed whenever so it's not that big of a deal.

when you make these changes to ZFS filesystems it does not change the existing data. That problem is left to you to solve -- traditionally by restoring all the data from backup.

However, a new tool is coming called "zfs rewrite" that will let you atomically rewrite underlying blocks so the data gets the new storage settings applied to the filesystem.

openzfs.github.io/openzfs-docs…

edit: this would also be useful for re-balancing your zpool if you add new zvols or something

This entry was edited (1 week ago)