Few days ago, I went to a customer that has old Virtual Environment. He has 3 hosts IBM SystemX servers, IBM DS3524 SAS Storage which is directly SAS-attached to the hosts and vSphere 5.0 U2. His environment has strange issue: every few minutes, Storage Manager throws an error that “LUN x is owned by not-preferred Controller. Preferred Controllers is Controller A”. He first thought that it was a firmware issue in the DS3524 storage and he asked IBM guys to upgrade the storage firmware. They upgraded the firmware to version 7.86.49 and the problem was still there.
First, I checked compatibility matrix of both VMware and IBM and I was shocked that they were different as the following screenshots:
VMware one is the following:
Actually, I became confused with both HCL’s. Personally, I preferred to follow IBM HCL, specially that my manager suggested that VMware may only state “The Minimum supported level of Firmware” (to be confirmed). So, I proceeded as the existing firmware version was 7.86.x.
The error itself seemed like an issue I already read about in Cormac Hogan‘s article about Asymmetric Logical Unit Access (ALUA) Storage here. This issue called “Path Thrashing“, in which each host try to pull LUNs towards its preferred active controller while the other hosts try to pull LUNs toward their active preferred one and this happens in Active-passive Arrays.
DS3524 is an ALUA array, in which a controller is considered “Active preferred” and the other as “Active non-preferred” for any LUN. With that error thrown, it’s for sure a mis-configuration in how ESXi hosts see the LUNs, i.e. multipathing plugin configuration. I checked again storage section in ESXi Configuration and I was shocked that, all three LUNs were discovered by all hosts as Active/Passive array, not ALUA as indicated by VMware in HCL and here. In addition, some hosts were using Controller A as “Active”, while other were using Controller B as “Active”.
Now, as I knew that I should fix how ESXi hosts discover the Array, I began to work my way inside ESXi NMP Plugin and its components: SATP and PSP. The problem was that ESXi host had the following SATP rule in its system:
VMW_SATP_LSI IBM ^1746* system tpgs_off VMW_PSP_MRU IBM DS3512/3524
It drive me crazy, how come that VMware stated that it should be VMW_SATP_ALUA and it’s VMW_SATP_LSI!! That LSI plugin caused that hosts discovered the Array as Active/Passive. I tried to add the following manual SATP rule:
VMW_SATP_ALUA IBM ^1746* user tpgs_on VMW_PSP_MRU IBM DS3512/3524
I reclaimed LUNs again, but it didn’t change. I tried to use this SATP rule, but no change:
VMW_SATP_ALUA device naa.6xxxxxxxxxxxxxxxxxxxxxxx VMW_PSP_MRU
So what was the issue? The issue wasn’t in the hosts only. After some searching here and there, I discovered the following VMware Community link, which indicating that there was anther option in IBM Storage Manager itself called “Host Operating System”. It controls how the storage presents itself to the hosts, according to the configured Operating System. There’re two VMware options: “VMware” & “VMwareTPGSALUA“. To be presented as an ALUA Storage, I should also select “VMwareTPGSALUA” option, but the hosts presented to storage array was configured with “VMware”.
The following screenshots indicate what I mean:
The only thing remaining was to add the SATP rule again, after putting hosts in Maintenance Mode then reboot. After hosts coming up, all LUNs on each host were discovered using VMW_SATP_ALUA with VMW_PSP_RR.