Tuesday, March 11, 2014

svchost.exe goes wild on CPU

Another day at work and I notice that a few of the servers are going to need more CPU because they are using quite a bit.  I look at the graphs and notice something funny - it's even when the application isn't busy and that one of the culprits is svchost.exe.  We get on the server to take a look and it appears svchost.exe is spiking every few seconds but what would cause this on a web server?

Now svchost.exe under Windows does many things so figuring out what is was involved a few tools but we ended up figuring out it was the eventlog service but we couldn't figure out why it needed so much CPU.  We asked around and everyone denied it was their software.  After all, their stuff is on all the servers so why should the problem exist on only our servers?

Looking further, the problem was not in any of the non-production environments so we had to troubleshoot this in production.  Always have to be careful about doing things in production.

We ran some Microsoft tools from sysinternals (I think it was procmon) to give us some detailed activity and it showed that there was a lot of reading on the security.evtx file but nothing that indicated what would be busy with it.

We created a case with Microsoft and sent over graphs and logs and all kinds of output.  Finally, we got on a conference call with Microsoft.  The guy assigned to the case tried to blame several things but each one was shot down by the other evidence.  So he says ... event log corruption.  We take one of the servers and wipe the log file and the CPU settles down.  Interesting, but we see tiny spikes.

We let the server rest over the weekend and the spikes are getting larger.  It appears that the spikes are dependent on the size of the security.evtx file so it's not corrupted, but busy being read.

Another clue showed up when we saw that svchost was page swapping at the same time the virus scanner was page swapping.  Talked to the virus team and it can't be that because it's everywhere even in QC.

Spent some time on the weekend shutting down all kinds of monitoring to see if that would help.  It did not.

Got tuned into a flag that does process auditing into the event log.  If that flag was disabled, the spikes did not occur. 

Another clue was finding out the version of the virus scanning product was different in QC and production.  We ended up upgrading the product from version 5 to 6 to see what happens and the spikes disappeared.

As I understand it, the process auditing would log processes starting and stopping.  This would write to the security.evtx log file and the virus scanner would read very large chunks of the event log over and over again because of the logged activity.  In the end, there was a problem and it was identified and fixed and we didn't throw CPU and memory at the problem but what a pain in the ass.

0 Comments:

Post a Comment

<< Home