The Case of the Stuck Cluster Service

In this particular scenario, the problem presented as follows:  An active-passive 2-node Windows Server 2003 print cluster failed over for some reason.  When digging into what was going on, Node 1 of the cluster had lost communication with the rest of the cluster.  However, it had not actually dropped its network connection.  Looking at the […]

In this particular scenario, the problem presented as follows:  An active-passive 2-node Windows Server 2003 print cluster failed over for some reason.  When digging into what was going on, Node 1 of the cluster had lost communication with the rest of the cluster.  However, it had not actually dropped its network connection.  Looking at the state of the cluster service itself, the service was stuck in a "Starting" state on Node 1.  The problem with being stuck in this state is that we could not really do very much with the service when it is in this condition.  Even when we rebooted the server (with the service startup type still set to Automatic), the cluster service came up and got hung in a "Starting" state.  Meanwhile, Node 2 of the cluster was chugging along happily, servicing print requests and behaving itself - but your high availability print environment was one bad spooler crash away from an administrator's worst nightmare ...

Since we know that this is a print cluster, the first thought was, "Are there third party port monitors in play here?"  If so, what might be going on is this.  The Port Monitors run checks against the printers and update the registry for the printer.  The registry information for the printer is located in the cluster hive.  Changes to this registry location are replicated up to the quorum.  If you have a couple of hundred printers being handled by this port monitor then the server can get fairly busy.  When a node tries to join the cluster it is given a sequence number.  The problem is that the server may be so busy that it may not respond to the node that handed out the sequence number in time.  When this happens, the formed node sends out another sequence number to the node that is trying to join - and the endless loop continues.  On the joining node (Node 1 in our example), the Cluster Service will just sit in a starting state.  So, what's the fix?  There are two possible ways to address this, the first being to take the spooler resource offline until the cluster node has successfully joined the cluster, and the second being to remove the third party print monitor.  Obviously, there are a couple of problems with either approach - the first one being downtime (taking the spooler resource offline), and the second being the amount of administrative overhead required to clean up the third party port monitors.  Now, here's the rub [...]

Full Article

Windows Server 2003, Printing, Cluster, Troubleshooting, GPO