NSClient++ Help (#1) - NSCA hangs forever while sending data to server (#404) - Message List
...and by "forever," I actually mean forever. It stops sending results even though the agent process continues to run.
We have an installation of the most recent version of NSClient++ (0.3.6 stable) running on a few dozen servers, and most of them work fine, including some 64-bit and some 2008 servers. However, there is one in particular where the agent stops reporting for no apparent reason after only a day or so. The monitoring server is watching for passive checks only, and so it thinks the server is down even though it's running just fine. This is obviously not ideal.
Here's an excerpt from the log on the server with the agent:
2009-06-17 06:06:21: debug:modules\NSCAAgent\NSCAThread.cpp:245: Sending to server... 2009-06-17 06:06:21: debug:modules\NSCAAgent\NSCAThread.cpp:252: Looked up [HOST] to [IP] 2009-06-17 06:06:25: error:modules\DebugLogMetrics\PDHCollector.cpp:216: Failed to query performance counters: PdhCollectQueryData? failed: : -2147481643: No data to return. 2009-06-17 11:28:26: debug:NSClient++.cpp:753: No shared session: ignoring change event!
(host/IP info hidden for security, but it does resolve correctly)
The second line ("Looked up ...") is the last real NSCA activity in the log. The next line about performance counters is repeated very, very often throughout the log, both before and after it stops sending results. This indicates to me that the service is still running even (in addition to the fact that the Services management console also shows the same thing). The last line shows up rarely, but every so often, after it stops sending results. It never executes any more checks, and never tries to send any results.
I looked at the NSCAThread.cpp code for any help, and nothing jumped out at me. I'm unfamiliar with the socket code. Is there any way it could be blocking somehow, attempting a connection with no timeout value in such a way that it never stops trying? Any other possible lock/hang points?
This was a 3.6 version just upgraded to 3.7, testing it out for further errorskaran09/13/09 11:22:08 (5 years ago)
(build number thank you) ((has to be a 0.3.7 version as well))
Michael Medinmickem09/13/09 10:33:56 (5 years ago)
This is an output from one of the server:-
2009-09-13 13:40:17: debug:NSClient++.cpp:1034: Injecting: alias_up: 2009-09-13 13:40:17: debug:NSClient++.cpp:1034: Injecting: check_uptime: 2009-09-13 13:40:17: debug:NSClient++.cpp:1070: Injected Result: OK '\\SERVER has been up for: 5 day(s), 2 hour(s), 41 minute(s), 32 second(s)' 2009-09-13 13:40:18: debug:NSClient++.cpp:1071: Injected Performance Result: '' 2009-09-13 13:40:18: debug:NSClient++.cpp:1070: Injected Result: OK '\\SERVER has been up for: 5 day(s), 2 hour(s), 41 minute(s), 32 second(s)' 2009-09-13 13:40:18: debug:NSClient++.cpp:1071: Injected Performance Result: '' 2009-09-13 13:40:18: debug:modules\NSCAAgent\NSCAThread.cpp:189: Executing (from NSCA): 2009-09-13 13:40:19: debug:NSClient++.cpp:1034: Injecting: check_ok: 2009-09-13 13:40:19: debug:NSClient++.cpp:1034: Injecting: CheckOK: Everything, is, fine! 2009-09-13 13:40:19: debug:NSClient++.cpp:1070: Injected Result: OK 'Everythingisfine!' 2009-09-13 13:40:19: debug:NSClient++.cpp:1071: Injected Performance Result: '' 2009-09-13 13:40:19: debug:NSClient++.cpp:1070: Injected Result: OK 'Everythingisfine!' 2009-09-13 13:40:20: debug:NSClient++.cpp:1071: Injected Performance Result: '' 2009-09-13 13:40:20: debug:modules\NSCAAgent\NSCAThread.cpp:245: Sending to server... 2009-09-13 13:40:20: debug:modules\NSCAAgent\NSCAThread.cpp:252: Looked up 220.127.116.11 to 18.104.22.168 2009-09-13 15:32:35: debug:NSClient++.cpp:753: No shared session: ignoring change event!karan09/13/09 09:37:48 (5 years ago)
It should have been resolved in the nightly build (0.3.7)
notice in the last few build there is some experimental packet length code so might wanna be a bit careful before upgrading...
Michael Medinmickem09/13/09 09:24:45 (5 years ago)
Is there any current solution for the problem listed in this current thread? I have around 30 servers also hooked up on Nagios with NSCA passive checks being done, active disabled.
Earlier, i had alot of server being reported as offline when they were actually online, i traced it down to it being a timing issue as NSCA protocol allows you to set timing behind, but as soon as your server timing is ahead, it will drop the connection with Future timestamp.
Patching NSCA seemed to have solved that problem.
But for 2-3 servers they have the same problem listed in this thread.
Nsclient just stops sending data back to Nagios server and fressness check will report the server as offline, all i could do is to restart NSclient services which apparently takes quite some time to shut down the services, around 40 seconds to 1 minute. Starting the services was just around 2-5 seconds.
If the services are alright on other servers, when it shuts down NSclient service, it takes around 4-5 seconds as well.
Anyway, restarting the services does seem to temporarily help, but then again , a permanent solution would be appreciated to prevent false server down.
I have just enabled debug for those 2-3 servers and will post once the problem occurs again, should be in a couple of hourskaran09/12/09 06:16:50 (5 years ago)
Yes, that would be the one...
Michael Medinmickem07/23/09 08:34:18 (5 years ago)
Is it this one:
2009-07-20 15:29:54: debug:modules\NSCAAgent\NSCAThread.cpp:246: Sending to server... 2009-07-20 15:29:54: debug:modules\NSCAAgent\NSCAThread.cpp:253: Looked up x.x.x.x to x.x.x.x 2009-07-20 15:30:03: error:modules\NSCAAgent\NSCAThread.cpp:272: Could not read NSCA hdr packet from socket :recv returned SOCKET_ERROR: 10054: An existing connection was forcibly closed by the remote host. 2009-07-20 15:34:54: debug:modules\NSCAAgent\NSCAThread.cpp:190: Executing (from NSCA): cpucheckmikkova07/23/09 07:44:08 (5 years ago)
Yes the copy works.
And the issue is related to a disabled (?) windows firewall, the new installer features a windows firewall "add exception thingy" but it is bleeding edge so not 100% ironed out.
on the servers "which work" check the nsc.log file and check for any NSCA related errors (the problem should I think now manifest it self as an error)
Michael Medinmickem07/21/09 09:58:40 (5 years ago)
Running the nightly build for 24 hours without problems. I now installed on 10 more servers on different site.
Btw, I had problems in this other domain with the 0.3.7 msi installer, almost all servers failed with error message: 1: Failed to install firewall exception: get_LocalPolicy failed: -2147023143: There are no more endpoints available from the endpoint mapper
I did 0.3.6 install and copied nightly from .zip over it, that works ok.mikkova07/21/09 09:46:34 (5 years ago)
I also have installed it on our problem server, and I'll report back about whether it fails again or seems to be stable. Thanks!jrowberg07/20/09 19:14:41 (5 years ago)
I installed it on 8 servers, and will keep you informed.
Mikkomikkova07/20/09 12:29:07 (5 years ago)
Check now, hopefully fixed in the latest nightly build...
Michael Medinmickem07/19/09 16:21:23 (5 years ago)
If this is as I think it will be present for all versions of NSClient++ and possibly affect other parts as well (as for instance the NRPE parts).
I shall see if I can do a work around for this in the next version (will be out after the weekend as nightly) but for the 0.4.x branch there will be a new socket subsystem which I hope solves this issue permanently...
Michael Medinmickem07/17/09 19:11:21 (5 years ago)
Yes, I have tried two older versions, same problem.mikkova07/17/09 13:50:54 (5 years ago)
i've about 20 servers in my setup right now, and all stop sending passive checks at some point. all using the 0.3.6 nsclient++ service. Is this problem also in older version?
Leonlblokland07/17/09 13:08:35 (5 years ago)
Yes, I can do that with few servers.mikkova07/15/09 13:52:02 (5 years ago)
After browsing the code I think the problem is reading the input package that will (I think) read and read and read untill done so if it never gets done it will never finnish.
But it is just a theory so I would need to verify it (and hopefully fix it)
MickeMmickem07/15/09 10:29:33 (5 years ago)
If you are interested I could hook you up with a build which logs more and wee can see if you can help me track down the problem...
MickeMmickem07/15/09 09:40:16 (5 years ago)
Last lines are always:
2009-07-08 21:47:27: debug:modules\NSCAAgent\NSCAThread.cpp:182: Sending to server... 2009-07-08 21:47:27: debug:modules\NSCAAgent\NSCAThread.cpp:189: Looked up xxx.xxx.xxx.xxx to xxx.xxx.xxx.xxx
And nothing after that. Checks may work only 2 hours or they may work for a week, but eventually all servers stop sending. I tried also with servers in the same network where nagios server is, same result.
My active servers work 100% with nsclient++ I have 0.3.6 clients on win2003 32bit servers. Same was with 0.3.5.mikkova07/15/09 08:14:18 (5 years ago)
What does the debug log say?
MickeMmickem07/14/09 19:13:35 (5 years ago)
I have same problem here: http://nsclient.org/nscp/discussion/topic/357
I think the error message comes when you log in to the server.
Restarting nsclientpp service resolves the problem temporarily. I tried nc_net with my setup and experienced same kind of problems, so I'm not sure if this is nsclient++ problem?mikkova07/14/09 14:29:04 (5 years ago)