Opened 13 months ago
Closed 12 months ago
#529 closed enhancement (fixed)
add alert severity to real-time event log monitoring
| Reported by: | mikep | Owned by: | mickem |
|---|---|---|---|
| Priority: | 1 | Milestone: | 0.4.1 |
| Component: | CheckEventlog | Version: | 0.4.0 |
| Severity: | Feature Requests | Keywords: | |
| Cc: |
Description
Hello. I believe it would be very valuable to add the ability to specify what alert level will be sent when a real-time event log filter matches.
This would allow greater control over the state returned to Nagios and will allow for resolved problems to return an OK status immediately.
I'm thinking that maybe you can add a parameter like AlertSeverity? that could be specified in the config section of each filter.
Maybe something like the following?
[/settings/eventlog/real-time/filters] Test App1=AlertSeverity = 0 id = 1000 AND source = 'Test App1' Test App1=AlertSeverity = 1 id = 1100 AND source = 'Test App1' Test App1=AlertSeverity = 2 id = 1200 AND source = 'Test App1'
I'm not sure if that is the best example of how it could work, but I think it shows that I want to attach different severities to different filters for the same service.
Thanks!
Change History (13)
comment:1 Changed 13 months ago by mickem
comment:2 Changed 13 months ago by mikep
Yes, I'm not clear on the best way to have the configuration settings work. But I stand by the great value of the capability.
With a normal polling monitor, like CPU usage, we can define warning and critical values. When it is run on its interval, you always get back an ok, warn, or crit. This enables it to reset its state, if the value changes before the next poll interval.
With real-time monitoring, I want to have a similar capability in real-time.
For example, I have an application that connects to a DB.
If that connection takes longer than 5 seconds, I write a specific event message to the event log with an id of 1100. The app is working, but the delay could impact the user experience.
If the connection completely fails, I write a specific event message to the event log with a different id of 1200. This means I need to address it immediately.
If the app can connect to the DB successfully again, I write an event message with an id of 1000. This tells me that everything is working again.
To reflect the health of the app in Nagios, I have created a service named 'Test App1'.
So, I would like to be able to create nscp real-time filters that send back ok, warn, or crit for my service, based on matching the id 1000, 1100, or 1200 respectively.
I haven't looked closely at how you have implemented the real-time filters, but I had thought you were creating a list. It seems like that would still work, since the different state filters for a single service would not have identical filter values.
Thanks.
mikep
comment:3 Changed 12 months ago by mickem
This is what I am looking at now (ish):
... ; Definitation for real time filter: default [/settings/eventlog/real-time/filters/default] ; DESTINATION - The destination for intercepted messages destination = nsca ; OK MESSAGE - This is the message sent periodically whenever no error is discovered. ok message = eventlog found no records ; SYNTAX - Format string for dates syntax = hello world: %message% ; A set of filters to use in real-time mode [/settings/eventlog/real-time/filters/test_1] ; DESTINATION - The destination for intercepted messages destination = nsca_server_1 ; FILTER - The filter to match filter = id = 1001 and category = 1 ; SYNTAX - Format string for dates syntax = hello world: %message% ; Definitation for real time filter: default [/settings/eventlog/real-time/filters/test_2] ; DESTINATION - The destination for intercepted messages destination = nsca_server_2 ; FILTER - The filter to match filter = id = 1002 and category = 1 severity = WARNING ; A set of filters to use in real-time mode [/settings/eventlog/real-time/filters] test3 = id = 1003 and category = 1 ...
comment:4 Changed 12 months ago by mikep
Hi Michael,
I'm unclear on the format you have included. Please tell me if the following config would have the effect I list below.
[/settings/eventlog/real-time/filters] cart_checkout=severity = OK id = 1000 AND source = 'Shopping Basket' cart_checkout=severity = WARNING id = 1100 AND source = 'Shopping Basket' cart_checkout=severity = CRITICAL id = 1200 AND source = 'Shopping Basket'
In Nagios I have a service named "cart_checkout" to show the health of the checkout function of my shopping basket web app.
When a user checks out, the app will record a Windows event log based on the following.
A successful checkout will have an EventId = 1000 and a source of "Shopping Basket" A slow checkout will have an EventId = 1100 and a source of "Shopping Basket" A failed checkout will have an EventId = 1200 and a source of "Shopping Basket"
I want the nscp real-time monitor to make a nsca call with the correct information to tell Nagios that it is the status of service "cart_checkout" and a state of OK, WARNING, or CRITICAL (0, 1, 2) for the respective event id found in the event log entry.
Where I get confused by your example is how is the service name getting set?
In your example config file, using my example, would I do the following?
[/settings/eventlog/real-time/filters/cart_checkout_ok] cart_checkout=id = 1000 AND source = 'Shopping Basket' severity = OK [/settings/eventlog/real-time/filters/cart_checkout_warn] cart_checkout=id = 1100 AND source = 'Shopping Basket' severity = WARNING [/settings/eventlog/real-time/filters/cart_checkout_crit] cart_checkout=id = 1200 AND source = 'Shopping Basket' severity = CRITICAL
Please clarify. Thanks!
comment:5 Changed 12 months ago by mikep
comment:6 Changed 12 months ago by mickem
Almost...
[/settings/eventlog/real-time/filters/cart_checkout_ok] filter=id = 1000 AND source = 'Shopping Basket' severity = OK [/settings/eventlog/real-time/filters/cart_checkout_warn] filter=id = 1100 AND source = 'Shopping Basket' severity = WARNING [/settings/eventlog/real-time/filters/cart_checkout_crit] filter=id = 1200 AND source = 'Shopping Basket' severity = CRITICAL
If you grab the 0.4.1 one build I did last night you should be able to try it out.
Notice it is unstable so don't expect to much :)
I need to extend the unit tests to support this as well as the reworked socket handling).
Michael Medin
comment:7 Changed 12 months ago by mikep
Ok, I'll go grab your latest build and test the current functionality. The last piece of the puzzle that I'm having a hard time understanding is how I set the name of the service. For example, a normal NSCA message looks like this in your debug output.
Sending (data): host: server01, service: cart_checkout, code: 1, time: 1337686091, result: warning cart_checkout is slow
How do I set the "service:" value in your config files? I see that you replaced "cart_checkout" with "filter" in your correction to my example above. With the current 0.4.0 real-time functionality, I set the "service:" value by including the name of the service "cart_checkout" in the filter definition line as I did above. So I'm confused why you changed it to "filter". How would I make your example above return "service: cart_checkout" in the NSCA message?
Thanks!
mikep
comment:8 Changed 12 months ago by mickem
Ahh... sorry...
This is the "same concept" as I have for servers and what not.
So the following is equvavlent:
[/settings/eventlog/real-time/filters] foo=id = 1200
[/settings/eventlog/real-time/filters/foo] filter=id = 1200
With the benefit of the first being simplicity ie. alias=filter and the benefit of the second is precision at the ost of having to type more.
SO the "service name" in that case comes from the [.../SERVICE NAME] section name.
You can also (given that I like flexibility) override the alias as well but lets not get into that...
Michael Medin
comment:9 Changed 12 months ago by mikep
Ok, here are my findings with 0.4.1.1. I think there may be some bugs and I'm not sure the current implementation meets the requirements, at least with the config file I used. Please let me know if I am still not using the config file correctly.
This was my initial config:
[/settings/eventlog/real-time/filters/default] enabled=true maximum age=5m destination=NSCA syntax=hello world: %message% ok message = eventlog found no records [/settings/eventlog/real-time/filters/cart_checkout_ok] filter=id = 100 AND source = 'Shopping Basket' severity = OK [/settings/eventlog/real-time/filters/cart_checkout_warn] filter=id = 110 AND source = 'Shopping Basket' severity = WARNING [/settings/eventlog/real-time/filters/cart_checkout_crit] filter=id = 120 AND source = 'Shopping Basket' severity = CRITICAL
I received the error:
e rvice\NSClient++.cpp:1211 No one listens for events from: () e og\CheckEventLog.cpp:82 Failed to submit evenhtlog result: Missing response from submission
It appears the config isn't picking up the destination from the default section.
So I added destinations to my config.
settings/eventlog/real-time/filters/default] enabled=true maximum age=5m destination=NSCA syntax=hello world: %message% ok message = eventlog found no records [/settings/eventlog/real-time/filters/cart_checkout_ok] filter=id = 100 AND source = 'Shopping Basket' severity = OK destination=NSCA [/settings/eventlog/real-time/filters/cart_checkout_warn] filter=id = 110 AND source = 'Shopping Basket' severity = WARNING destination=NSCA [/settings/eventlog/real-time/filters/cart_checkout_crit] filter=id = 120 AND source = 'Shopping Basket' severity = CRITICAL destination=NSCA
This gets me a nsca response, but the status code isn't as expected.
I use the follow command to create an event to match the warning filter.
eventcreate /ID 110 /L APPLICATION /SO "Shopping Basket" /D "App is slow" /T WARNING
This gets the nsca message.
d lient\NSCAClient.cpp:417 Sending (data): host: orvomdev01, service: cart_checkout_warn, code: 3, time: 1339029020, result: hello world: App is slow
This means that a service named "cart_checkout_warn" will have a status code of UNKNOWN (3=UNKNOWN in Nagios).
1) I believe the intended functionality is for the "code:" value to be set to a value that matches the severity value I set in the config (code: 1 in this case). i.e. OK=0, WARNING=1, CRITICAL=2, UNKNOWN=3
2) This config format still doesn't help me in a real life scenerio. Using this config, I'm actually defining filters for 3 different Nagios service (cart_checkout_ok, cart_checkout_warn, cart_checkout_crit) instead of defining 3 differnt state filters (OK, WARNING, CRITICAL) for the same service "cart_checkout". The intent of this capability is to allow the real-time filter to provide different status codes for differnt event messages related to the same service. The way I configured this test doesn't allow that result.
I'm guessing item 1 is just an oversite with a simple fix in the code. I'm not clear on how your current config format addresses item 2. Please clarify.
Thanks!
mikep
comment:10 Changed 12 months ago by mickem
Ahh, sorry...
I was a bit quick there.
I have fixed a few issues with the new filters in the latest build and added testcases so they actually work :)
In the next build there will be (an untested) new keyword "command".
The command overrides the "service name" (I call it command since I don't like to change terminology for passive checks) which if not set will be taken from alias unless you override it here. Sorry for forgetting it last time around...
So you should e able to do:
[settings/eventlog/real-time] maximum age=5m enabled=true [settings/eventlog/real-time/filters/default] destination=NSCA syntax=hello world: %message% ok message = eventlog found no records command=cart_checkout [/settings/eventlog/real-time/filters/cart_checkout_ok] filter=id = 100 AND source = 'Shopping Basket' severity = OK [/settings/eventlog/real-time/filters/cart_checkout_warn] filter=id = 110 AND source = 'Shopping Basket' severity = WARNING [/settings/eventlog/real-time/filters/cart_checkout_crit] filter=id = 120 AND source = 'Shopping Basket' severity = CRITICAL
Also note that maximum age and enable are not set on the filter level (they are on the real-time level as they are global).
Michael Medin
comment:11 Changed 12 months ago by mickem
As a side not to you/or other people with more complex scenarios. The "default" filter is a standard feature which is a bit magical for simplicity. If you do not specify a parent you always get default if it exists.
In reality filters are a tree structure where you can have multiple parents.
so if you have "more then one" of these scenarios you can easily achieve these using multiple "templates" like so (ish):
[settings/eventlog/real-time/filters/shopping_cart] destination=NSCA syntax=hello world: %message% ok message = eventlog found no records [settings/eventlog/real-time/filters/shopping_cart] command=cart_checkout is template=true parent=default [/settings/eventlog/real-time/filters/cart_checkout_ok] filter=id = 100 AND source = 'Shopping Basket' severity = OK parent=shopping_cart [/settings/eventlog/real-time/filters/cart_checkout_warn] filter=id = 110 AND source = 'Shopping Basket' severity = WARNING parent=shopping_cart [/settings/eventlog/real-time/filters/cart_checkout_crit] filter=id = 120 AND source = 'Shopping Basket' severity = CRITICAL parent=shopping_cart [settings/eventlog/real-time/filters/foo_bar] command=foo_bar is template=true parent=default [/settings/eventlog/real-time/filters/foo_bar_ok] filter=id = 100 AND source = 'Foo Bar' severity = OK parent=foo_bar [/settings/eventlog/real-time/filters/foo_bar_warn] filter=id = 110 AND source = 'Foo Bar' severity = WARNING parent=foo_bar [/settings/eventlog/real-time/filters/foo_bar_crit] filter=id = 120 AND source = 'Foo Bar' severity = CRITICAL parent=foo_bar
comment:12 Changed 12 months ago by mikep
This is very good! I just performed some quick tests and it appears to work well. I will work on larger tests this weekend and provide you feedback as I gather it.
Thanks for your excellent work!
mikep
comment:13 Changed 12 months ago by mickem
- Resolution set to fixed
- Status changed from new to closed
Setting this to resolved









Not sure I follow... I would hazzard a guess though:
Question is if this becomes to complicated to configure though... perhaps better to have some other mechanism?
Michael Medin