NSClient++ Help (#1) - NSCA hangs forever while sending data to server (#404) - Message List

NSCA hangs forever while sending data to server

...and by "forever," I actually mean forever. It stops sending results even though the agent process continues to run.

We have an installation of the most recent version of NSClient++ (0.3.6 stable) running on a few dozen servers, and most of them work fine, including some 64-bit and some 2008 servers. However, there is one in particular where the agent stops reporting for no apparent reason after only a day or so. The monitoring server is watching for passive checks only, and so it thinks the server is down even though it's running just fine. This is obviously not ideal.

Here's an excerpt from the log on the server with the agent:

2009-06-17 06:06:21: debug:modules\NSCAAgent\NSCAThread.cpp:245: Sending to server... 2009-06-17 06:06:21: debug:modules\NSCAAgent\NSCAThread.cpp:252: Looked up [HOST] to [IP] 2009-06-17 06:06:25: error:modules\DebugLogMetrics\PDHCollector.cpp:216: Failed to query performance counters: PdhCollectQueryData? failed: : -2147481643: No data to return. 2009-06-17 11:28:26: debug:NSClient++.cpp:753: No shared session: ignoring change event!

(host/IP info hidden for security, but it does resolve correctly)

The second line ("Looked up ...") is the last real NSCA activity in the log. The next line about performance counters is repeated very, very often throughout the log, both before and after it stops sending results. This indicates to me that the service is still running even (in addition to the fact that the Services management console also shows the same thing). The last line shows up rarely, but every so often, after it stops sending results. It never executes any more checks, and never tries to send any results.

I looked at the NSCAThread.cpp code for any help, and nothing jumped out at me. I'm unfamiliar with the socket code. Is there any way it could be blocking somehow, attempting a connection with no timeout value in such a way that it never stops trying? Any other possible lock/hang points?

Help!

Jeff

Tree View Flat View (newer first) Flat View (older first)
  • Message #1428

    This was a 3.6 version just upgraded to 3.7, testing it out for further errors

  • Message #1427

    Which version?

    (build number thank you) ((has to be a 0.3.7 version as well))

    Michael Medin

  • Message #1426

    This is an output from one of the server:-

    2009-09-13 13:40:17: debug:NSClient++.cpp:1034: Injecting: alias_up:
    2009-09-13 13:40:17: debug:NSClient++.cpp:1034: Injecting: check_uptime:
    2009-09-13 13:40:17: debug:NSClient++.cpp:1070: Injected Result: OK '\\SERVER has been up for: 5 day(s), 2 hour(s), 41 minute(s), 32 second(s)'
    2009-09-13 13:40:18: debug:NSClient++.cpp:1071: Injected Performance Result: ''
    2009-09-13 13:40:18: debug:NSClient++.cpp:1070: Injected Result: OK '\\SERVER has been up for: 5 day(s), 2 hour(s), 41 minute(s), 32 second(s)'
    2009-09-13 13:40:18: debug:NSClient++.cpp:1071: Injected Performance Result: ''
    2009-09-13 13:40:18: debug:modules\NSCAAgent\NSCAThread.cpp:189: Executing (from NSCA):
    2009-09-13 13:40:19: debug:NSClient++.cpp:1034: Injecting: check_ok:
    2009-09-13 13:40:19: debug:NSClient++.cpp:1034: Injecting: CheckOK: Everything, is, fine!
    2009-09-13 13:40:19: debug:NSClient++.cpp:1070: Injected Result: OK 'Everythingisfine!'
    2009-09-13 13:40:19: debug:NSClient++.cpp:1071: Injected Performance Result: ''
    2009-09-13 13:40:19: debug:NSClient++.cpp:1070: Injected Result: OK 'Everythingisfine!'
    2009-09-13 13:40:20: debug:NSClient++.cpp:1071: Injected Performance Result: ''
    2009-09-13 13:40:20: debug:modules\NSCAAgent\NSCAThread.cpp:245: Sending to server...
    2009-09-13 13:40:20: debug:modules\NSCAAgent\NSCAThread.cpp:252: Looked up 111.111.111.111 to 111.111.111.111
    2009-09-13 15:32:35: debug:NSClient++.cpp:753: No shared session: ignoring change event!
    
  • Message #1425

    It should have been resolved in the nightly build (0.3.7)

    notice in the last few build there is some experimental packet length code so might wanna be a bit careful before upgrading...

    Michael Medin

  • Message #1423

    Is there any current solution for the problem listed in this current thread? I have around 30 servers also hooked up on Nagios with NSCA passive checks being done, active disabled.

    Earlier, i had alot of server being reported as offline when they were actually online, i traced it down to it being a timing issue as NSCA protocol allows you to set timing behind, but as soon as your server timing is ahead, it will drop the connection with Future timestamp.

    Patching NSCA seemed to have solved that problem.

    But for 2-3 servers they have the same problem listed in this thread.

    Nsclient just stops sending data back to Nagios server and fressness check will report the server as offline, all i could do is to restart NSclient services which apparently takes quite some time to shut down the services, around 40 seconds to 1 minute. Starting the services was just around 2-5 seconds.

    If the services are alright on other servers, when it shuts down NSclient service, it takes around 4-5 seconds as well.

    Anyway, restarting the services does seem to temporarily help, but then again , a permanent solution would be appreciated to prevent false server down.

    I have just enabled debug for those 2-3 servers and will post once the problem occurs again, should be in a couple of hours

  • Message #1317

    Yes, that would be the one...

    Michael Medin

  • Message #1315

    Is it this one:

    2009-07-20 15:29:54: debug:modules\NSCAAgent\NSCAThread.cpp:246: Sending to server...
    2009-07-20 15:29:54: debug:modules\NSCAAgent\NSCAThread.cpp:253: Looked up x.x.x.x to x.x.x.x
    2009-07-20 15:30:03: error:modules\NSCAAgent\NSCAThread.cpp:272: Could not read NSCA hdr packet from socket :recv returned SOCKET_ERROR: 10054: An existing connection was forcibly closed by the remote host.
    2009-07-20 15:34:54: debug:modules\NSCAAgent\NSCAThread.cpp:190: Executing (from NSCA): cpucheck
    
  • Message #1308

    Yes the copy works.

    And the issue is related to a disabled (?) windows firewall, the new installer features a windows firewall "add exception thingy" but it is bleeding edge so not 100% ironed out.

    One thing!

    on the servers "which work" check the nsc.log file and check for any NSCA related errors (the problem should I think now manifest it self as an error)

    Michael Medin

  • Message #1307

    Running the nightly build for 24 hours without problems. I now installed on 10 more servers on different site.

    Btw, I had problems in this other domain with the 0.3.7 msi installer, almost all servers failed with error message: 1: Failed to install firewall exception: get_LocalPolicy failed: -2147023143: There are no more endpoints available from the endpoint mapper

    I did 0.3.6 install and copied nightly from .zip over it, that works ok.

  • Message #1305

    I also have installed it on our problem server, and I'll report back about whether it fails again or seems to be stable. Thanks!

  • Message #1299

    Thanks.

    I installed it on 8 servers, and will keep you informed.

    Mikko

  • Message #1296

    Check now, hopefully fixed in the latest nightly build...

    Michael Medin

  • Message #1290

    If this is as I think it will be present for all versions of NSClient++ and possibly affect other parts as well (as for instance the NRPE parts).

    I shall see if I can do a work around for this in the next version (will be out after the weekend as nightly) but for the 0.4.x branch there will be a new socket subsystem which I hope solves this issue permanently...

    Michael Medin

  • Message #1287

    Yes, I have tried two older versions, same problem.

  • Message #1285

    Hi,

    i've about 20 servers in my setup right now, and all stop sending passive checks at some point. all using the 0.3.6 nsclient++ service. Is this problem also in older version?

    Thx,

    Leon

  • Message #1280

    Yes, I can do that with few servers.

  • Message #1279

    After browsing the code I think the problem is reading the input package that will (I think) read and read and read untill done so if it never gets done it will never finnish.

    But it is just a theory so I would need to verify it (and hopefully fix it)

    MickeM

  • Message #1278

    If you are interested I could hook you up with a build which logs more and wee can see if you can help me track down the problem...

    MickeM

  • Message #1277

    Last lines are always:

    2009-07-08 21:47:27: debug:modules\NSCAAgent\NSCAThread.cpp:182: Sending to server...
    2009-07-08 21:47:27: debug:modules\NSCAAgent\NSCAThread.cpp:189: Looked up xxx.xxx.xxx.xxx to xxx.xxx.xxx.xxx
    

    And nothing after that. Checks may work only 2 hours or they may work for a week, but eventually all servers stop sending. I tried also with servers in the same network where nagios server is, same result.

    My active servers work 100% with nsclient++ I have 0.3.6 clients on win2003 32bit servers. Same was with 0.3.5.

  • Message #1276

    What does the debug log say?

    MickeM

  • Message #1274

    I have same problem here: http://nsclient.org/nscp/discussion/topic/357

    I think the error message comes when you log in to the server.

    Restarting nsclientpp service resolves the problem temporarily. I tried nc_net with my setup and experienced same kind of problems, so I'm not sure if this is nsclient++ problem?

Tree View Flat View (newer first) Flat View (older first)

Subscriptions