Up here in the mountains, we get an occasional power outage. I have the servers on UPSs but generally only have 5 minutes or so to snag my laptop, log in, and start shutting down VMs before the power fails. And sometimes I’m not even around so servers basically crash.
In this case, gitlab failed to start. When checking the gitlab-ctl status output, redis is identified as down.
-bash-4.2# gitlab-ctl status
run: alertmanager: (pid 4395) 215617s; run: log: (pid 2085) 231156s
run: gitaly: (pid 4416) 215615s; run: log: (pid 2084) 231156s
run: gitlab-exporter: (pid 4446) 215614s; run: log: (pid 2088) 231156s
run: gitlab-kas: (pid 4600) 215604s; run: log: (pid 2093) 231153s
run: gitlab-workhorse: (pid 4623) 215601s; run: log: (pid 2079) 231156s
run: grafana: (pid 4658) 215598s; run: log: (pid 2086) 231156s
run: logrotate: (pid 6103) 2852s; run: log: (pid 2090) 231155s
run: nginx: (pid 4731) 215580s; run: log: (pid 2080) 231156s
run: node-exporter: (pid 4750) 215578s; run: log: (pid 2077) 231156s
run: postgres-exporter: (pid 4757) 215578s; run: log: (pid 2078) 231156s
run: postgresql: (pid 23262) 26s; run: log: (pid 2083) 231156s
run: prometheus: (pid 4782) 215576s; run: log: (pid 2075) 231156s
run: puma: (pid 23226) 28s; run: log: (pid 2087) 231156s
down: redis: 0s, normally up, want up; run: log: (pid 2082) 231156s
run: redis-exporter: (pid 4816) 215574s; run: log: (pid 2076) 231156s
run: sidekiq: (pid 23237) 27s; run: log: (pid 2081) 231156s
Okay. So I checked the logs and ran the gitlab-redis-cli stat but again got an error.
-bash-4.2# gitlab-redis-cli stat
Could not connect to Redis at /var/opt/gitlab/redis/redis.socket: Connection refused
The socket does exist but since redis is down, there’s nothing to connect to. After a bit more sleuthing, I tried a gitlab-ctl reconfigure but that kicked out an error as well.
[2024-03-17T13:03:34+00:00] FATAL: RuntimeError: redis_service[redis] (redis::enable line 19) had an error: RuntimeError: ruby_block[warn pending redis restart] (redis::enable line 68) had an error: RuntimeError: Execution of the command `/opt/gitlab/embedded/bin/redis-cli -s /var/opt/gitlab/redis/redis.socket INFO` failed with a non-zero exit code (1)
stdout:
stderr: Error: Connection reset by peer
At this point, redis is the problem. I did a gitlab-ctl tail to see the logs and redis keeps trying to start but kicks out a rdb error.
2024-03-17_13:09:09.00014 25036:C 17 Mar 2024 13:09:08.999 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
2024-03-17_13:09:09.00020 25036:C 17 Mar 2024 13:09:08.999 # Redis version=6.2.8, bits=64, commit=423c78f4, modified=1, pid=25036, just started
2024-03-17_13:09:09.00022 25036:C 17 Mar 2024 13:09:08.999 # Configuration loaded
2024-03-17_13:09:09.00208 25036:M 17 Mar 2024 13:09:09.001 * monotonic clock: POSIX clock_gettime
2024-03-17_13:09:09.00394 _._
2024-03-17_13:09:09.00397 _.-``__ ''-._
2024-03-17_13:09:09.00398 _.-`` `. `_. ''-._ Redis 6.2.8 (423c78f4/1) 64 bit
2024-03-17_13:09:09.00399 .-`` .-```. ```\/ _.,_ ''-._
2024-03-17_13:09:09.00400 ( ' , .-` | `, ) Running in standalone mode
2024-03-17_13:09:09.00401 |`-._`-...-` __...-.``-._|'` _.-'| Port: 0
2024-03-17_13:09:09.00402 | `-._ `._ / _.-' | PID: 25036
2024-03-17_13:09:09.00406 `-._ `-._ `-./ _.-' _.-'
2024-03-17_13:09:09.00407 |`-._`-._ `-.__.-' _.-'_.-'|
2024-03-17_13:09:09.00408 | `-._`-._ _.-'_.-' | https://redis.io
2024-03-17_13:09:09.00409 `-._ `-._`-.__.-'_.-' _.-'
2024-03-17_13:09:09.00410 |`-._`-._ `-.__.-' _.-'_.-'|
2024-03-17_13:09:09.00411 | `-._`-._ _.-'_.-' |
2024-03-17_13:09:09.00411 `-._ `-._`-.__.-'_.-' _.-'
2024-03-17_13:09:09.00412 `-._ `-.__.-' _.-'
2024-03-17_13:09:09.00413 `-._ _.-'
2024-03-17_13:09:09.00417 `-.__.-'
2024-03-17_13:09:09.00417
2024-03-17_13:09:09.00418 25036:M 17 Mar 2024 13:09:09.003 # Server initialized
2024-03-17_13:09:09.00419 25036:M 17 Mar 2024 13:09:09.003 # WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. Being disabled, it can can also cause failures without low memory condition, see https://github.com/jemalloc/jemalloc/issues/1328. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
2024-03-17_13:09:09.00443 25036:M 17 Mar 2024 13:09:09.004 * Loading RDB produced by version 6.2.8
2024-03-17_13:09:09.00445 25036:M 17 Mar 2024 13:09:09.004 * RDB age 264959 seconds
2024-03-17_13:09:09.00448 25036:M 17 Mar 2024 13:09:09.004 * RDB memory usage when created 6.59 Mb
2024-03-17_13:09:09.05154 25036:M 17 Mar 2024 13:09:09.051 # Short read or OOM loading DB. Unrecoverable error, aborting now.
2024-03-17_13:09:09.05159 25036:M 17 Mar 2024 13:09:09.051 # Internal error in RDB reading offset 0, function at rdb.c:2750 -> Unexpected EOF reading RDB file
2024-03-17_13:09:09.08898 [offset 0] Checking RDB file dump.rdb
2024-03-17_13:09:09.08905 [offset 26] AUX FIELD redis-ver = '6.2.8'
2024-03-17_13:09:09.08906 [offset 40] AUX FIELD redis-bits = '64'
2024-03-17_13:09:09.08907 [offset 52] AUX FIELD ctime = '1710415990'
2024-03-17_13:09:09.08908 [offset 67] AUX FIELD used-mem = '6911800'
2024-03-17_13:09:09.08908 [offset 83] AUX FIELD aof-preamble = '0'
2024-03-17_13:09:09.08909 [offset 85] Selecting DB ID 0
2024-03-17_13:09:09.08910 --- RDB ERROR DETECTED ---
2024-03-17_13:09:09.08910 [offset 966671] Unexpected EOF reading RDB file
2024-03-17_13:09:09.08911 [additional info] While doing: read-object-value
2024-03-17_13:09:09.08912 [additional info] Reading key 'cache:gitlab:flipper/v1/feature/ci_use_run_pipeline_schedule_worker'
2024-03-17_13:09:09.08913 [additional info] Reading type 0 (string)
2024-03-17_13:09:09.08913 [info] 4828 keys read
2024-03-17_13:09:09.08914 [info] 3570 expires
2024-03-17_13:09:09.08915 [info] 42 already expired
Unexpected EOF reading RDB file. The final solution is to delete the /var/opt/gitlab/redis/dump.rdb. I’m not a fan of just deleting the file so I backed it up and restarted redis.
-bash-4.2# mv dump.rdb ~
-bash-4.2# gitlab-ctl stop redis
ok: down: redis: 0s, normally up
-bash-4.2# gitlab-ctl start redis
ok: run: redis: (pid 26114) 1s
-bash-4.2# gitlab-ctl status
run: alertmanager: (pid 4395) 216115s; run: log: (pid 2085) 231654s
run: gitaly: (pid 4416) 216113s; run: log: (pid 2084) 231654s
run: gitlab-exporter: (pid 4446) 216112s; run: log: (pid 2088) 231654s
run: gitlab-kas: (pid 4600) 216102s; run: log: (pid 2093) 231651s
run: gitlab-workhorse: (pid 4623) 216099s; run: log: (pid 2079) 231654s
run: grafana: (pid 4658) 216096s; run: log: (pid 2086) 231654s
run: logrotate: (pid 6103) 3350s; run: log: (pid 2090) 231653s
run: nginx: (pid 4731) 216078s; run: log: (pid 2080) 231654s
run: node-exporter: (pid 4750) 216076s; run: log: (pid 2077) 231654s
run: postgres-exporter: (pid 4757) 216076s; run: log: (pid 2078) 231654s
run: postgresql: (pid 26191) 2s; run: log: (pid 2083) 231654s
run: prometheus: (pid 4782) 216074s; run: log: (pid 2075) 231654s
run: puma: (pid 26123) 27s; run: log: (pid 2087) 231654s
run: redis: (pid 26114) 28s; run: log: (pid 2082) 231654s
run: redis-exporter: (pid 4816) 216072s; run: log: (pid 2076) 231654s
run: sidekiq: (pid 26162) 10s; run: log: (pid 2081) 231654s
And that seems to have done the trick for redis. I ran gitlab-ctl stop to completely stop gitlab and then rebooted the server.
Once up though, postgresql failed to start. In checking, I received the following error
PANIC: could not locate a valid checkpoint record
It took a bit of hunting to find a basic solution. Basically I needed to reset by running the pg_resetwal program in /opt/gitlab/ but by becoming the gitlab-psql user. But first I had to stop gitlab and then become the gitlab-psql user and run the program to reset the record. It’s not a great solution but as a result of the server resetting due to a power outage.
-bash-4.2# su - gitlab-psql
Last login: Sun Mar 17 17:52:00 UTC 2024 on pts/0
-sh-4.2$ pg_resetwal -f /var/opt/gitlab/postgresql/data
Write-ahead log reset
Then I started the server again and checked the status.
-bash-4.2# gitlab-ctl status
run: alertmanager: (pid 22057) 35s; run: log: (pid 1780) 16764s
run: gitaly: (pid 22068) 35s; run: log: (pid 1801) 16764s
run: gitlab-exporter: (pid 22085) 34s; run: log: (pid 1781) 16764s
run: gitlab-kas: (pid 22087) 34s; run: log: (pid 1814) 16763s
run: gitlab-workhorse: (pid 22098) 33s; run: log: (pid 1798) 16764s
run: grafana: (pid 22108) 33s; run: log: (pid 1779) 16764s
run: logrotate: (pid 22118) 32s; run: log: (pid 1808) 16764s
run: nginx: (pid 22125) 32s; run: log: (pid 1796) 16764s
run: node-exporter: (pid 22133) 32s; run: log: (pid 1787) 16764s
run: postgres-exporter: (pid 22139) 31s; run: log: (pid 1785) 16764s
run: postgresql: (pid 22147) 31s; run: log: (pid 1799) 16764s
run: prometheus: (pid 22150) 30s; run: log: (pid 1786) 16764s
run: puma: (pid 22166) 30s; run: log: (pid 1778) 16764s
run: redis: (pid 22172) 29s; run: log: (pid 1797) 16764s
run: redis-exporter: (pid 22178) 29s; run: log: (pid 1784) 16764s
run: sidekiq: (pid 22185) 29s; run: log: (pid 1800) 16764s
Resources:
- https://gitlab.com/gitlab-org/omnibus-gitlab/-/issues/293
- https://forum.gitlab.com/t/postgresql-down-after-upgrade-from-13-10-to-13-12/57018