Using the built-in crash detection/protection

About this guide

This guide describes how to use the crash detection/submission built in to NSClient++. It attempts to be a step by step guidefor first deciding which strategy to use and then how to configure it and set up monitoring of the health of NSClient++.

Introduction

Starting with version 0.3.9 (not yet released, but it is available in the latest nightly builds) NSClient++ features a built-in Google Breakpad. This means that if NSClient++ was to crash it will be detected and an action can be performed to help save your monitoring infrastructure as well as report the crash and help me figure out whats broken (so I can repair it).

The break pad "protection" can perform any one or more of three actions upon a crash.

  1. Archive crash dump for later submission (and health monitoring).
  2. Submit crash dump to crash.nsclient.org for analysis by me.
  3. restart the NSClient++ agent.

Generally you want to do all three of these actions. The general reasons are:

  1. The archive makes the client aware that it has crashed so the built-in command (check_nscp) will report the crash and make you aware that it has happened.
  2. The submission help me be aware of the problem and makes it possible for me to fix bugs and improve NSClient++.
  3. Restarting NSClient++ will keep your monitoring infrastructure available even if it has issues.

The main reasons for not doing this are:

  1. Archiving can fail and thus prevent NSClient++ from restarting (or perhaps cause more havoc if the reason for the crash is resources).
  2. Submission will not work behind firewalls and sending somewhat sensitive (generally this contains very little data) data is not for everyone.
  3. Restarting a crashing server is sometimes a very bad idea as this can wreck havoc on your system eating up all resources and such.

I would probably advice people to think about which strategy is the right one for you but for me this high impact points are sensitivity and firewall issues related to submissions. Which can be circumvented by sending the archived files manually from a server which can access the Internet.

Configuring crash detection and actions

To get you started I have divided this guide into three sections but nothing keeps you (and I advocate when possible) from using all three of these methods.

Archiving files

The simplest way to protect your self is to make yourself aware of the fact that NSClient++ has crashed. This is done using both the archive feature and the [check_nscp] command in addition to this archiving crash dump files makes it possible to submit them by hand afterwards if you have an issue you want me to help you resolve. Submitting the crash dumps are fairly simple and you can either use the reporter tool or just send them via email to me.

Archiving is the default action and it is enable by default.

[crash]
; Archive crash dump files if a crash is detected
;archive=1

; Submit crash reports to a crash report server (this overrides archive)
;submit=0

; Restart service if a crash is detected
;restart=1

To disable archiving you use the following instead:

[crash]
; Archive crash dump files if a crash is detected
archive=0

Once NSClient++ crashes two files are crated;

  1. <GUID>.dmp) is the actual dumped data (stack traces and variable contents)
  2. <GUID>.dmp.txt is a description file with some meta information about NSClient++

To be able to check the health of NSClient++ you need to enable the [CheckNSCP.dll] module. Which is done like so:

[modules]
; ...
CheckNSCP.dll
; ...

The module serves two purposes the first being to collect all error messages reported inside NSClient++ and the second being to check the crash dump folder for crash dumps. After enabling the CheckNSCP module start NSClient++ in "test" mode by running the following command:

nsclient++ /test

Now go to your nagios box and run the check_nscp command from using the check_nrpe command.

check_nrpe ... -c check_nscp
OK: 0 crash(es), 0 error(s)

Hopefully you now see the "OK: ..." response above with means everything is fine if not you already have a problem with you need to resolve. To simulate a problem we can force the client to crash when it is in "test" mode using the assert command. So go back to your "nsclient++ /test" command window and type the following:

...
assert

The result should be that NSClient++ crashes and produces a dump file. Notice if you have enabled "restart" (default) here you need to first stop nsclient++ before we can continue. So first stop nsclient++ then start it again in "test" mode.

nsclient++ -stop
nsclient++ -test
...

Now go back to the nagios machine and run the check_nscp command again. This time we should see an error indicating that NSClient++ has crashed.

check_nrpe ... -c check_nscp
ERROR: 1 crash(es), last crash: fb472415-34e8-434d-9b48-f3929a834a87.dmp.txt

The next step is to remove the crash dump and then start the service normally. The crash dumps are stored under %LOCALAPPDATA%\NSClient++\crash dumps. So open that folder and see if you can find the dump mentioned above. If you remove the file and again run the check_nscp command the result should revert back to good again.

It is important to understand that crash dumps are per-user which means if you run in "test" mode you will not see the same dumps as if you run the agent as a service.

Submit crash reports

TODO

Restart service

TODO

Last modified 2 years ago Last modified on 07/06/11 10:04:06