Lost connectivity to the device mpx.vmhba32 ERROR

So I had this wonderful error happen to me recently.

I am in the process of migrating the Management IPs from one network to another on all the hosts in the environment. It is part of a bigger plan to migrate everything including the vCenter components to a new IP range ( I am not looking forward to that esp since this is a 5.1 environment!).

I did that and updated DNS, flushed the DNS caches on my machine and on the PRTG probes etc. The host came back up fine, but when it rejoined vCenter it had this wonderful error message:

Lost connectivity to the device mpx.vmhba32:C0:T0:L0 backing the boot filesystem /vmfs/devices/disk/mpx.vmhba32:C0:T0:L0. As a result, host configuration changes will not be saved to persistent storage.

I was like what is this witchcraft! So after some digging around, it turned out to be an SD card issue.

The Dell R720s are set-up with dual SD cards in a mirror configuration for redundancy. When ESXi is booted off the SD card it is loaded into RAM like a RAM Disk of sorts. This stops the SD card from getting hammered  with I/O and prolongs the life of the card. Periodically data is written to the card and also when you make changes.

So what it looks like has happened is that, when I have applied the Management IP change it has tried to write to the SD card and failed, but has made the changes in the RAM where ESXi is running from….Hence the error! Note that the host can run in this state for as long as you need, just no changes will persist after a reboot at any point.

I logged into the iDRAC7 of the R720 and it showed that the SD cards were fine, both of them! So I was puzzeled!

The first thing I did was a services.sh restart, this restarted all the services…and the error went away! Horrayyy…..or so I thought!

I did some digging via esxcli to confirm it was an SD card issue and as you can see from the following screenshot, the altbootbank locations are marked in red, which means they are inaccessible, and the bootbank is set to /tmp. So from my point of view the issue had not been resolved.

IMG_20150825_100630-e1440622828187

So after that, the only other option was to do a shut down of the server. I would like to point out that before I even did the Management IP changes, I had migrated all the VMs off the host and it was empty. I like to play it cautious, esp in a Production Environment!

I shut it down and took out the SD card riser board and reseated the SD cards and reseated them, Dell have even made a video of how to do this task!

Dell R720 SD card removal video

I then rebooted the host, and it came back up. At the DCUI I pressed ALT+F1 and got access to the command line and did a ls -la and the altbootbank was now pointing where it was supposed to! But the IP changes I had made obviously hadn’t stuck, so I re did them, saved it and rebooted the host. On this reboot it took the changes as they had obviously now been saved to the SD card.

So all was good in the world!

Be the first to comment

Leave a Reply