Following a recent firmware upgrade on our Dell Equallogic SAN I was horrified to see a massive spike in read IO activity, growing from an average of 2-3k IOPS right up to the maximum system limit on our SAN of 10k.

The thing that struck me the most was the imbalance in level of read traffic compared to write, as this jumped up to a massive 92% for read traffic. Initially my first thought was it was an issue with either the firmware on the Dell Equallogic or the Dell N4032 10Gbe switches and I logged a call with them, however the read activity just didn’t add up.

Over the weekend I had maintenance time to take down all of our VM’s for troubleshooting purposes, when I started a single application server I could see peaks of almost 1k IOPS from just that VM alone. Upon investigating further with Process Explorer I could clearly see that the Health Service service was by far the highest user of disk read activity, measuring IO within the VM I could see very clearly the impact that the service was having with IO dropping to >50 once the service was disabled.

There were obviously two issues here.

1. Dell Equallogic / SAN HQ – Misreporting IO Prior to the V7 firmware release our the Dell Equallogic appeared to be misreporting the level of IO activity in the SAN HQ software. In the below screenshot you can clearly see the level prior to the firmware upgrade and post upgrade (upgrade shown by 0 IO dip)

 

Traffic2

2. SCOM Agent Cache Clearing the SCOM agent cache had a dramatic impact on the level of disk activity, with no noticeable impact on disk IO activity once the service was started after the cache being cleared. At this point I had already disabled the Health Service by GPO for testing purposes, I removed the GPO entry and ran the following PS script to rename the SCOM agent cache and restart the agent.

<#
.NOTES
===========================================================================
Created with:     SAPIEN Technologies, Inc., PowerShell Studio 2015 v4.2.81
Created on:       16/03/2015 10:30
Created by:       Maurice Daly
Filename:         ClearAgentCache.ps1
===========================================================================
.DESCRIPTION
Rename's SCOM agent cache and starts agent.
#>
 
Import-Module ActiveDirectory
$Servers = Get-ADComputer -LDAPFilter "(name=YOUR-SERVER-NAMING*)" | ForEach-Object { Write-Output $($_.name) }
 
foreach ($Server in $Servers)
{
$HealthService = Get-Service -ComputerName $Server | Where-Object { $_.name -eq "HealthService" }
$HealthService | Set-Service -StartupType 'Disabled'
$HealthService | Stop-Service -Force
Rename-Item -Path "\\$Server\C$\Program Files\Microsoft Monitoring Agent\Agent\Health Service State" -NewName "\\$Server\C$\Program Files\Microsoft Monitoring Agent\Agent\Health Service State.old"
$HealthService | Set-Service -StartupType 'Automatic'
$HealthService | Start-Service
}

The result, SAN traffic levels are now back to normal 🙂 :

Traffic

Maurice Daly
Maurice has been working in the IT industry since 1999 and was awarded his first MVP Enterprise Mobility award in 2017. Technology focus includes Active Directory, Group Policy, Hyper-V, Windows Deployment (SCCM & MDT) and Office 365.

(179)

comments
  • Jason Jensen
    Posted at 16:38 September 29, 2015
    Jason Jensen
    Reply
    Author

    Thanks for this. I was having this exact issue and this completely resolved the issue.

    • modalyit
      Posted at 02:05 October 1, 2015
      modalyit
      Reply
      Author

      No problem Jason, glad that it helped.

  • storagebuilder
    Posted at 13:31 November 6, 2015
    storagebuilder
    Reply
    Author

    We are seeing the same issue but on a massive scale peaking at >100k read I/O. After cache flush, does this Read IO stay low or does it creep back up as the cache gets bigger and bigger?

    • modalyit
      Posted at 13:40 November 6, 2015
      modalyit
      Reply
      Author

      Having initially cleared the cache in March of this year we have seen no IO creep coming back in from the SCOM agent, so far so good.

  • Leave a Reply