Node down and seem unresponsive

Xypto · January 8, 2022, 6:33pm

Hi,

I’ve been running my node since April with no issue. Hardware is:

Rasp Pi 4 4 Gb Ram
FLIRC case → added recently
1 TB SSD Verbatim Vi550 S3
Official power supply

But it’s down for some weeks now. The web interface stays stucks on the “starting umbrel” page. Sometimes it displays the Red Umbrella of Death with “system service failed” message. The LED on the SSD keeps blinking without interruption. I’m not totally sure, but it seems to have appeared after repeated power outages in December due to snowfalls.

When SSH’ing into the Pi, htop shows a load average over 20%

I tried running the debug command to get the logs but nothing happens, the logs never show up. I also tried following the steps listed here: Red Umbrella of Death after Power Outage
and commands to stop umbrel or docker seem not to respond as well.

Command to update the os version does not work as well. Even ls command ran in the /umbrel folder hangs.
I plugged the SSD into a Ubuntu machine an ran e2fsck, it told me it was clean.

I reflashed the SD with 0.4.8 and 0.4.9 outcome is the same.

Any idea of what I could do?

Thx.

Hakuna · January 8, 2022, 8:26pm

Could you run docker ps and paste the dump here. Are there docker container services running at all?
And just to verify, ~/umbrel/scripts/debug doesn’t do anything?
How about dmesg to check whether your SSD has a mounting issue - or df -hal | grep G | head -10 to check whether the SSD is full

Xypto · January 9, 2022, 8:44am

Thanks @Hakuna for your help.

df -hal | grep G | head -10 gives the output below

    /dev/root        59G  3.1G   53G   6% /
    devtmpfs        1.7G     0  1.7G   0% /dev
    tmpfs           1.9G     0  1.9G   0% /dev/shm
    tmpfs           1.9G   18M  1.9G   1% /run
    tmpfs           1.9G     0  1.9G   0% /sys/fs/cgroup
    /dev/root        59G  3.1G   53G   6% /status-server
    /dev/sda1       938G  488G  403G  55% /mnt/data
    /dev/sda1       938G  488G  403G  55% /home/umbrel/umbrel
    /dev/sda1       938G  488G  403G  55% /var/lib/docker
    /dev/sda1       938G  488G  403G  55% /swap

dmesg gives several warnings/errors related to SSD device and timeouts for processes

Xypto · January 9, 2022, 8:45am

htop shows docker related processes:

Xypto · January 9, 2022, 8:45am

running docker ps and waiting 10 mins or so

same with ~/umbrel/scripts/debug

Hakuna · January 9, 2022, 11:03am

Really tricky, maybe others here can jump in.
Assuming it’s either

something with the SSD. Can you try connecting it to your laptop / desktop and check whether it’s working fine? Under linux, you can run fsck (read mode only) or efsck, checking for bad sectors or file system
something with the heat? Is the node particulary hot, since you use a new casing, maybe that’s causing your node to malfunction. You could check temperature and whether the node is throttled with vcgencmd measure_temp && vcgencmd get_throttled
lastly, perhaps something with the USB connection. Do you have a chance to change cables, the USB connector?

sorry bit poking into the dark here too

Xypto · January 9, 2022, 2:32pm

Yes, I’m also poking into the dark

I did try to connect the SSD to a laptop. No problem, encountered. Ran e2fsck and fsck both told me it is clean.

Capture d’écran de 2022-01-09 14-47-181920×1080 130 KB

Xypto · January 9, 2022, 2:32pm

The new case is supposed to limit the heat, it has a heatsink and a thermal pad on top of the CPU.

Capture d’écran de 2022-01-09 15-18-321920×1080 148 KB
Did the test with the laptop with the same cable attached to SSD, switched to the other USB3 port on the Pi with the same result.

Hakuna · January 11, 2022, 11:35am

Okay, I think we can somewhat assume the hardware is fine. Let’s bring out the big guns and point at the software (Umbrel OS).
Could you first ensure that

LND isn’t running (I’d assume so, since docker ps isn’t running, and I can’t see any docker services on htop with the user umbrel)
backup your channel.backup under ~/umbrel/backup and all your files / your lnd channel states under ~/umbrel/lnd/data/graph/mainnet/
backup your lnd.conf under ~/umbrel/lnd
have your seed words

once this is all done, let’s try to kill and reinstate the docker system. I think this is where it’s hanging up

Source ==> Umbrel Troubleshooting Guide

Some docker component fail to start

I can’t access umbrel.local on browser or ip address. Did ssh and ran debug script. First suspect line is:

stat /var/lib/docker/overlay2/....... no such file or directory

How to fix this issue:

just in case, re-flash the mSD card with the latest version of UmbrelOS (exactly the steps you did first time installing your node using the instructions from getumbrel.com
If still don’t do nothing, use this command (enter using SSH into your node):

sudo systemctl stop umbrel-startup.service && docker system prune --force --all && sudo systemctl start umbrel-startup.service

Restart your node

sudo reboot

Optional another command to clear the docker containers is:

sudo docker kill $(sudo docker ps -aq) && sudo docker rm $(sudo docker ps -aq)

then restart your node

Let’s see how that works

Xypto · January 13, 2022, 5:52pm

Hello @Hakuna and thank you again for your much appreciated help.

~/umbrel/backup and ~/umbrel/lnd/data/graph/mainnet/ are empty, maybe because I already reflashed the os a few times before? as for ~/umbrel/lnd I don’t remember, but I shall confirm that.

EDIT:
~/umbrel/backup doesn’t exist.
~/umbrel/lnd/data/graph/mainnet/ does have some files
lnd.conf exists in ~/umbrel/lnd

sudo systemctl stop umbrel-startup.service && docker system prune --force --all && sudo systemctl start umbrel-startup.service and related commands involving systemctl or docker don’t work, just as ~/umbrel/scripts/debug

I tried running those command while having the SSD unplugged, it works, and some data around 1Gb has been cleared. However, after shutting down the node, pluging back the SDD in and restarting the node the result is the same.

Any idea? Is there a way to wipe the SSD without losing the sync and having to restore with the seed?

Xypto · March 6, 2022, 10:30am

Hi, thanks for your help @Hakuna.
I finally had to try to format the SSD, but I couldn’t manage to complete it, it seems that it was dead for good. Had to start over with a brand new ssd

austenjt · March 7, 2022, 6:48pm

If there was a Kubernetes version of Umbrel, it would probably be more resiliant. (and self-healing) (similar to how https://aerokube.com/moon/latest/#install-kubernetes does it.)

Topic		Replies	Views
Umbrel becomes unavailable ~24 hours this last week Support and Troubleshooting	15	916	June 26, 2021
Unable to access umbrel due to power outage Support and Troubleshooting	4	821	July 7, 2021
Node Offline message Support and Troubleshooting	0	204	February 22, 2023
Node "bricked"? Support and Troubleshooting	18	2502	July 29, 2021
Node unresponsive, unreachable, but boots Support and Troubleshooting	1	370	July 29, 2021

Node down and seem unresponsive

Related topics