For a service that depends directly on a Postgres database they've shown:
- they don't know how to properly manage storage
- they still never turned on pg_checksum
- they have no idea how to run a reliable production Postgres cluster
These are unserious people trying to run a serious project and it should make you very concerned about how professionally they do all their work
RT: mastodon.matrix.org/users/matr…
Sorry, but it's bad news: we haven't been able to restore the DB primary filesystem to a state we're confident in running as a primary (especially given our experiences with slow-burning postgres db corruption). So we're having to do a full 55TB DB snapshot restore from last night, which will take >10h to recover the data, and then >4h to actually restore, and then >3h to catch up on missing traffic. Huge apologies for the outage. Again, folks using their own homeservers are not impacted.
feld
in reply to feld • • •Oh a mysterious "slow burn" of Postgres corruption? Where is the engagement on the Postgres mailing lists? I haven't seen a single thread about this issue on the pgsql-general or pgsql-hackers lists.
It's either a hardware storage bug, a raid implementation bug, a kernel bug, or their Postgres/filesystem tuning is trading data reliability for performance. But they're not sharing anything of value.
Postgres doesn't just corrupt itself. We have several DBs > 100TB at $work. Many people have significantly larger databases...
I kinda doubt their recovery times too. They will probably forget that they need to disable indexes to make the restore have a reasonable speed. And pg_restore is single threaded per table. 1.5TB can take 1.5 days.
blog.peerdb.io/how-can-we-make…
I think they're fucked. I wonder if they will be able to recover without it taking months, literally. They haven't indicated they're using anything but vanilla Postgres.
This could be the end of the matrix.org homeserver.
How can we make pg_dump and pg_restore 5 times faster?
Sai Srirampur (PeerDB Blog)Peter Vágner reshared this.
Blurry Moon
in reply to feld • • •feld
in reply to Blurry Moon • • •@sun Oh I don't disagree, but Oracle has had billions poured into it so they can make that possible. Postgres is nearly as good as Oracle in almost all use cases, but these types of maintenance operations have not yet been engineered for performance.
The companies doing the Postgres forks have been the ones innovating here and putting their time and expertise into making sure they solve their customers' needs. And often those improvements get merged upstream. But as far as R&D goes it's still a drop in the bucket compared to Oracle 🫠
Dr. Cat
in reply to Blurry Moon • • •feld
in reply to Dr. Cat • • •@j @sun Tons of features you probably don't need and a unique ability to drain your company of all its money both in licensing costs and Oracle DBA salaries
Although if you have a particular problem they will make you a custom patch you can apply to your Oracle database. A change to the code nobody else will ever have. It's bizarre.
Dr. Cat
in reply to feld • • •feld
in reply to Dr. Cat • • •Blurry Moon
in reply to Dr. Cat • • •Dr. Cat
in reply to Blurry Moon • • •Blurry Moon
in reply to Dr. Cat • • •@j for years and years people went with oracle because it was the only ANSI SQL compliant database, everybody else either didn't have x feature or it was a proprietary extension. but this hasn't been true for years, Postgres is compliant.
oracle also spends a gazillion dollars convincing your company to put everything into oracle though, so they have really stupid bad shit you should never do, but on the surface you think "I'm already paying them so I'll integrate that too". it's pretty transparent that they're taking advantage of know-nothing managers to trap companies into never being able to leave.
feld
in reply to Blurry Moon • • •Blurry Moon
in reply to feld • • •feld
in reply to Blurry Moon • • •@sun @j We had a scheduled overnight outage in 2007 to upgrade Oracle 9i to 10g. It was an 8 hour outage and the process to backup then apply the patches took 7 hours.
We couldn't afford more Sun servers. A restore from backup was also 8 hours. We practiced it several times because even doing one thing out of order breaks the database.
It was all or nothing (and probably losing our jobs).
It worked. I was never so scared though
feld
in reply to feld • • •Dr. Cat
in reply to feld • • •they should be using zfs at that scale
It also makes snapshots stupid easy. You can get pretty creative to do point in time postgres restore for a 100 TB DB in under 15 minutes.
feld
in reply to Dr. Cat • • •Blurry Moon
in reply to Dr. Cat • • •Phantasm
in reply to Blurry Moon • • •Phantasm
in reply to Phantasm • • •Blurry Moon
in reply to Phantasm • • •Phantasm
in reply to Blurry Moon • • •Phantasm
in reply to Phantasm • • •Performance and Tuning — OpenZFS documentation
openzfs.github.ioBlurry Moon
in reply to Phantasm • • •Dr. Cat
in reply to Blurry Moon • • •@sun @phnt here's this too
vadosware.io/post/everything-i…
Setting record size to 8k is faster than 16k but only for a little bit because it gets super fragmented. Setting to 16k fixes the fragmentation and provides better compression ratios since compression happens to each record block. Setting to 32 or higher could be interesting and help compression even more. You won't see improvements beyond the default 128k on like 95% of drives and it could even hurt performance. That being said 1M+ record sizes may be useful in conjunction with zstd-4 for long-term archival of compressible data like database backups. All of this can be changed whenever so it's not that big of a deal. Block size you're stuck with forever so make sure you set the correct block size.
Most of the data the database actually cares about at any time will live in the arc (ram cache) and if you use compression it's compressed in the ARC so you get even better cache hits.
For compression I used lz4. Zstd (even compression level 1) was too much latency. Lz4 is really great and shaved off about 45% of data needing to be written to disk. That was the main reason I switched to zfs. It was the only practical filesystem for postgres that supports disk compression.
It makes postgres upgrades super fast and easy. Just take a snapshot, hard link the database files, fire up the new postgres version and it should work but if it starts fugging the database then you can just easily restore the snapshot.
I came for the compression and ending up loving it because not only is it the best filesystem but it's the best disk management system too. You can even just create raw volumes and format them however you want. You can have ext4 on zfs, you can have NTFS on zfs, you could even put zfs on top of zfs if you really wanted to.
Zfs is also the only way to have a compressed swap partition
Everything I've seen on optimizing Postgres on ZFS
vadosware.iofeld
in reply to Dr. Cat • • •> All of this can be changed whenever so it's not that big of a deal.
when you make these changes to ZFS filesystems it does not change the existing data. That problem is left to you to solve -- traditionally by restoring all the data from backup.
However, a new tool is coming called "zfs rewrite" that will let you atomically rewrite underlying blocks so the data gets the new storage settings applied to the filesystem.
openzfs.github.io/openzfs-docs…
edit: this would also be useful for re-balancing your zpool if you add new zvols or something
zfs-rewrite.8 — OpenZFS documentation
openzfs.github.ioPhantasm
in reply to feld • • •feld
in reply to Phantasm • • •@phnt @j @sun correct it would reduce fragmentation.
Right now I solve that problem myself with a custom script I made that uses cpdup to clone files and then move them into place of the original. A ZFS native tool would be much better.