Resolved Server Unable to Recover Post Crash

Status
Not open for further replies.

Josh

shitting in your pants soon, maybe already did
Staff member
Moderator
Legendary
Web Shit
I have been noticing that when the server has been crashing recently it has not been able to self recover which has been leading to high populations on the server getting killed.

I am wondering if we can look into this and why our K8 or Docker or whatever we use to run the server instance is unable to self recover

Instances of this Happening (That I can Remember)
10-13-2025 Approx 2:42pm CST
Crash Megathread Post: https://giantslair.com/threads/crash-megathread.382/#post-6548

10-12-2025 Approx 10:31pm CST
Crash Megathread Post: https://giantslair.com/threads/crash-megathread.382/#post-6532

9-28-25 Approx 1:46am CST
Crash Megathread Post: https://giantslair.com/threads/crash-megathread.382/#post-5732

If I get more info ill post it in here!
 
As of right now we don’t have a watchdog. It’s currently being made. Until then it’s manual restarts
 
As of right now we don’t have a watchdog. It’s currently being made. Until then it’s manual restarts
I could have sworn we had a watchdog in the past - unless it broke

Mainly bc the server would auto restart
 
I could have sworn we had a watchdog in the past - unless it broke

Mainly bc the server would auto restart
Nah never had a watchdog. The times it "autorestarted" before was just when the server was able to fix itself from 100% cpu. If it gets completely hung up it needs to be manually restarted
 
A watchdog has been developed and will be moved to staging. If the server stops responding after 10 seconds, it will be killed and started again. If there are no issues observed then it will be rolled out to production in the next week.
 
As of about 1 month ago with our new deployment overhaul, a watchdog will continuously monitor for a heartbeat from the server. If a heartbeat isn't detected for 10 seconds, the server process will be killed and restarted before the crash screen timer will automatically reconnect.

We have also seen a significant reduction of reported server crashes. In cases where the server has crashed, the watchdog has performed as expected with the server process being killed and started, along with players who didn't disconnect during the crash countdown automatically reconnecting.
 
Status
Not open for further replies.
Back
Top