Customer reports that within their Citrix server farm (running on Windows Server 2003), when a user logs into or out of a session (seems more pronounced on logoff), ALL users connected to the server experience a ~5-20 second hang. The hang is described as follows:
- Application in the session (i.e. Outlook and Word) stop accepting keyboard input. When the hang subsides, the typed characters show up all at once.
- If applications are moved, they do not redraw
- Cannot click the start menu
- If the user was running performance monitor (in an attempt to diagnose), there would be a gap in the perfmon data that directly correlated to the hang duration
Customer has found that during the timeframe of the hang, Winlogon.exe (in session 0) is showing 25% CPU usage.
Troubleshooting & Analysis: Where to start… to begin, given the details above, we can assume we are not experiencing a hard hang where the server is completely unresponsive. The user can still move windows and the connection to the server is still active. Keyboard input is working, but seems to be buffered during the issue. The windows not redrawing indicates the message pump for each of the affected applications is stuck waiting for something.
In this scenario, I focused on the CPU usage for Winlogon under the premise that the other symptoms were side effects of whatever the CPU was working so hard on. On to the 25%; this is an interesting number to spike at. Generally speaking a thread has work to do, or it doesn't. If there was work to do it would normally take the CPU to 100%. So why 25%? In this scenario, each of the servers within the server farm had 4 processors. So we pegged 1 out of the 4 @ 100% resulting in task manager displaying 25% CPU utilization for the Winlogon in question.
So now we have a CPU spike in Winlogon. Why only the Winlogon in session 0? Going back to the issue summary, the customer reported that if ANY user logs off, ALL users experience a hang, and Winlogon in session 0 spikes. First, let's talk about what winlogon does for a living.
Resolution: Now that we know the exact behavior, we could align it to a known issue that was introduced with the release of Windows Sever 2003 SP1 (applies to both SP1 and SP2)
From the KB cause section:
This problem occurs because of the way that Windows Server 2003 Service Pack 1 (SP1) cleans up user registry information. When a user logs off, there is a five-second delay before Windows Server 2003 SP1 flushes the user registry hive. If the terminal server experiences heavy registry activity, the cached registry information on the terminal server may increase significantly. For example, this behavior may occur when a program scans the user profile during the logoff process. When the terminal server experiences heavy registry activity, it takes Windows longer to flush the data.
The fix applies to Windows Server 2003 with SP1 or SP2.
Windows Server 2003, CPU, Spike, Torubleshooting, Performance, Knowledgebase