Current major server outage - advice please!

AAD Web Team · November 3, 2023

Hello, Brodie from the Australian Antarctic Division here. We've got some kind of suspected cache issue casuing runaway disk usage, it's currently so bad that our websites are down (unless logged in) and I cannot SSH into the servers. A restart of both the database and hosting server have not helped either.

Pages load fine when logged in (since the cache is skipped) or when disabling the cache.

Below are some of various error messages we've started seeing, I suspect due to running out of disk space:

	Unable to write lock file: /site/assets/cache/LazyCronLock.cache

Error: Exception: Unable to copy: /site/assets/files/26153/keon-anzac.jpg => /site/assets/files/26151/keon-anzac-7.jpg (in wire/core/Pagefile.php line 236)

unlink: Unable to unlink file: /site/assets/cache/Page/45511/4df366e0700b7c24883b744b6cb250ee+https.cache

Has anyone had a similar issue before and knows something we could try to resolve it?

Craig · November 3, 2023

There are a couple of things you could try in the PW admin area.

The first option would be to find a template that has the 'Clear cache for entire site' setting (or enable it). Find a page that uses it, and Save it.

The next option, if that doesn't work, is to find a page (or pages) with large files added to them. Download a copy of the original files, and then delete them from the page. (You can re-add them later once the issue is resolved).

This might give you enough free space to gain access via SSH to clear the cache, or to install the ProcessCacheControl module which lets you clear it from within PW.

Whatever you do, though, try to be quick, to avoid it filling back up again before you have time to intervene.

flydev · November 3, 2023

Hi, it seem a server issue, but for now we cant see what is the root source of the disk being filled.

Do get an access to ssh again, which is the most important thing for now, you could use à logged user, and then in the admin, delete all logs file, then SSH asap. Once in the server, check /var/log and remove some old *.gz or the bigger to get more space, then investigate.

Try to make a backup or an image of the server if you can before doing root cmd.

AAD Web Team · November 5, 2023

Thanks @Craigand @flydevwe have now resolved the issue. Clearing the cache was good for temporarily unlocking space in blocks to get through.

Turns out the issue was a recent code change where the Pageimages class was used for handling a group of Pageimage's from across many pages (where we'd previously use an array) - we had not realised at the time that this leads to each image getting instantiated again (and therefore duplicated)!

Current major server outage - advice please!

Recommended Posts

AAD Web Team

Link to comment

Share on other sites

Craig

Link to comment

Share on other sites

flydev

Link to comment

Share on other sites

AAD Web Team

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members

Browse

Activity

My Activity Streams

Support

Store

My Details