Discussion:
Reaper
Hans Scheffers
2013-10-30 05:14:05 UTC
Permalink
Hi,
I see the following massages in teh logfiles:

[1383078054]
Warning: Breaking out of check result reaper: max reaper time (15)
exceeded. Reaped 53 results, but more checkresults to process. Perhaps
check core performance tuning tips?
[1383081646] Warning: Breaking
out of check result reaper: max reaper time (15) exceeded. Reaped 56
results, but more checkresults to process. Perhaps check core
performance tuning tips?

I changed the values in icinga.cfg and
restarted, but no difference was seen in the system. I am running on
1.9.3, about 5000 service checks 450 wmi). As soon as the reaper
messages appear after restart, the service latency grows to approximate
100s.

System: opensuse 12.2, with ido2db, snmptt, pnp4nagios ; quad core Xeon proc, 8g mem, 1 hardware raid 0 disk.

Icinga.cfg:
#check_result_reaper_frequency=10
#max_check_result_reaper_time=30
check_result_reaper_frequency=5
max_check_result_reaper_time=15

I
also tried bigger reaper_time and smaller frequency but no luck. Can
anyone explain the tuning of the reaper some more? Can there be more
reapers running simultaneously?

Grtz

Hans Scheffers

AIX / Linux Systeembeheer
Michael Friedrich
2013-11-01 23:03:19 UTC
Permalink
On 30.10.2013 06:14, Hans Scheffers wrote:
> Hi,
> I see the following massages in teh logfiles:
>
> [1383078054] Warning: Breaking out of check result reaper: max reaper
> time (15) exceeded. Reaped 53 results, but more checkresults to
> process. Perhaps check core performance tuning tips?
> [1383081646] Warning: Breaking out of check result reaper: max reaper
> time (15) exceeded. Reaped 56 results, but more checkresults to
> process. Perhaps check core performance tuning tips?
>
> I changed the values in icinga.cfg and restarted, but no difference
> was seen in the system. I am running on 1.9.3, about 5000 service
> checks 450 wmi). As soon as the reaper messages appear after restart,
> the service latency grows to approximate 100s.
>
> System: opensuse 12.2, with ido2db, snmptt, pnp4nagios ; quad core
> Xeon proc, 8g mem, 1 hardware raid 0 disk.
>
> Icinga.cfg:
> #check_result_reaper_frequency=10
> #max_check_result_reaper_time=30
> check_result_reaper_frequency=5
> max_check_result_reaper_time=15
>
> I also tried bigger reaper_time and smaller frequency but no luck. Can
> anyone explain the tuning of the reaper some more? Can there be more
> reapers running simultaneously?

no, that doesn't make much sense for core 1.x not being multithreaded at
all.

it obviously looks like that your core generates a lot of checkresults
in a short couple of time, and therefore the checkresult reaper cannot
process them that fast. it also sounds like that the in-memory list of
checkresults has grown huge and takes ages to be processed.

afterall it would be interesting in which interval the 5000 service
checks are being run, and how long their execution time is. some
icingastats and system performance graphing over time would help as well
for the reader.

https://wiki.icinga.org/display/howtos/Icinga+performance+analysis
https://wiki.icinga.org/display/howtos/Optimize+Icinga+Performance


--
DI (FH) Michael Friedrich

mail: michael.friedrich-***@public.gmane.org
twitter: https://twitter.com/dnsmichi
jabber: dnsmichi-***@public.gmane.org
irc: irc.freenode.net/icinga dnsmichi

icinga open source monitoring
position: lead core developer
url: https://www.icinga.org
Hans Scheffers
2013-11-02 11:49:26 UTC
Permalink
Hans Scheffers
AIX / Linux Systeembeheer


> Date: Sat, 2 Nov 2013 00:03:19 +0100
> From: michael.friedrich-***@public.gmane.org
> To: icinga-users-5NWGOfrQmneRv+***@public.gmane.org
> Subject: Re: [icinga-users] Reaper
>
> On 30.10.2013 06:14, Hans Scheffers wrote:
> > Hi,
> > I see the following massages in teh logfiles:
> >
> > [1383078054] Warning: Breaking out of check result reaper: max reaper
> > time (15) exceeded. Reaped 53 results, but more checkresults to
> > process. Perhaps check core performance tuning tips?
> >
> > System: opensuse 12.2, with ido2db, snmptt, pnp4nagios ; quad core
> > Xeon proc, 8g mem, 1 hardware raid 0 disk.
> >
> > Icinga.cfg:
> > #check_result_reaper_frequency=10
> > #max_check_result_reaper_time=30
> > check_result_reaper_frequency=5
> > max_check_result_reaper_time=15
> >
> > I also tried bigger reaper_time and smaller frequency but no luck. Can
> > anyone explain the tuning of the reaper some more? Can there be more
> > reapers running simultaneously?
>
> no, that doesn't make much sense for core 1.x not being multithreaded at
> all.

Ok, so that's mot the way to go ;)
>
> it obviously looks like that your core generates a lot of checkresults
> in a short couple of time, and therefore the checkresult reaper cannot
> process them that fast. it also sounds like that the in-memory list of
> checkresults has grown huge and takes ages to be processed.
>
> afterall it would be interesting in which interval the 5000 service
> checks are being run, and how long their execution time is. some
> icingastats and system performance graphing over time would help as well
> for the reader.

We have 4000 ~ 4500 checks that are running at the normal intervals or longer (some of them to once a day)
The rest of the checks have to be executed every 2 minutes as per SLA, so yes, a lot of checkresults are generated, and we need to have even more in the (near) future; at the moment we're running in the test environment, but in a week we will go live with a DRS also.
The number of 2 minute checks will then double (and test will go back to 5 minutes)

The checks that are generating the problems are the WMI checks, as soon as I shut off these checks, the system is running fine with 4000 ~ 4500 checks.
We are now running the WMI checks on a little heavier hardware (Linux ppc partition on a P710 with a v3700 SAN storage), and our reaper isn't complaining anymore, also the latency is now about 5 sec max (on both systems). We are not generating graphs at the moment on the PPC LPAR, because this is the first test (started wednesday).
Our main goal now is to update the PPC to OpenSuSE 12.3 with icinga >= 1.9 and moving all the test to this lpar (if needed we can extend the hardware still a little). Then we will also generate the performance stats again :)

>
> https://wiki.icinga.org/display/howtos/Icinga+performance+analysis
> https://wiki.icinga.org/display/howtos/Optimize+Icinga+Performance

I have read them, but the tuning of the reaper in this piece is a little bit harsh,.... still don't really know how to determine the optimal values.

>
>
> --
> DI (FH) Michael Friedrich
Gerd Radecke
2013-11-04 07:30:45 UTC
Permalink
On Sat, Nov 2, 2013 at 12:49 PM, Hans Scheffers
<hans.scheffers-1ViLX0X+***@public.gmane.org> wrote:
>
>
> Hans Scheffers
> AIX / Linux Systeembeheer
>
>
>> Date: Sat, 2 Nov 2013 00:03:19 +0100
>> From: michael.friedrich-***@public.gmane.org
>> To: icinga-users-5NWGOfrQmneRv+***@public.gmane.org
>> Subject: Re: [icinga-users] Reaper
>
>>
>> On 30.10.2013 06:14, Hans Scheffers wrote:
>> > Hi,
>> > I see the following massages in teh logfiles:
>> >
>> > [1383078054] Warning: Breaking out of check result reaper: max reaper
>> > time (15) exceeded. Reaped 53 results, but more checkresults to
>> > process. Perhaps check core performance tuning tips?
>> >
>> > System: opensuse 12.2, with ido2db, snmptt, pnp4nagios ; quad core
>> > Xeon proc, 8g mem, 1 hardware raid 0 disk.
>> >
>> > Icinga.cfg:
>> > #check_result_reaper_frequency=10
>> > #max_check_result_reaper_time=30
>> > check_result_reaper_frequency=5
>> > max_check_result_reaper_time=15
>> >
>> > I also tried bigger reaper_time and smaller frequency but no luck. Can
>> > anyone explain the tuning of the reaper some more? Can there be more
>> > reapers running simultaneously?
>>
>> no, that doesn't make much sense for core 1.x not being multithreaded at
>> all.
>
> Ok, so that's mot the way to go ;)
>
>>
>> it obviously looks like that your core generates a lot of checkresults
>> in a short couple of time, and therefore the checkresult reaper cannot
>> process them that fast. it also sounds like that the in-memory list of
>> checkresults has grown huge and takes ages to be processed.
>>
>> afterall it would be interesting in which interval the 5000 service
>> checks are being run, and how long their execution time is. some
>> icingastats and system performance graphing over time would help as well
>> for the reader.
>
> We have 4000 ~ 4500 checks that are running at the normal intervals or
> longer (some of them to once a day)
> The rest of the checks have to be executed every 2 minutes as per SLA, so
> yes, a lot of checkresults are generated, and we need to have even more in
> the (near) future; at the moment we're running in the test environment, but
> in a week we will go live with a DRS also.
> The number of 2 minute checks will then double (and test will go back to 5
> minutes)
>
> The checks that are generating the problems are the WMI checks, as soon as
> I shut off these checks, the system is running fine with 4000 ~ 4500
> checks.
> We are now running the WMI checks on a little heavier hardware (Linux ppc
> partition on a P710 with a v3700 SAN storage), and our reaper isn't
> complaining anymore, also the latency is now about 5 sec max (on both
> systems). We are not generating graphs at the moment on the PPC LPAR,
> because this is the first test (started wednesday).
> Our main goal now is to update the PPC to OpenSuSE 12.3 with icinga >= 1.9
> and moving all the test to this lpar (if needed we can extend the hardware
> still a little). Then we will also generate the performance stats again :)
>
>>
>> https://wiki.icinga.org/display/howtos/Icinga+performance+analysis
>> https://wiki.icinga.org/display/howtos/Optimize+Icinga+Performance
>
> I have read them, but the tuning of the reaper in this piece is a little
> bit harsh,.... still don't really know how to determine the optimal values.
>

maybe I've missed it, but have you tried moving your checkresults and
temporary directory to a ramdisk?
https://wiki.icinga.org/display/howtos/Create+a+ramdisk+for+better+performance+in+Icinga

I've seen the same error message and huge latencies when the drives
just couldn't serve Icinga's io requests when it was writing and
reading the checkresults at a certain rate
Hans Scheffers
2013-11-04 22:10:28 UTC
Permalink
Hans Scheffers

AIX / Linux Systeembeheer


> Date: Mon, 4 Nov 2013 08:30:45 +0100
> From: wi2009i-***@public.gmane.org
> To: icinga-users-5NWGOfrQmneRv+***@public.gmane.org
> Subject: Re: [icinga-users] Reaper
>
> On Sat, Nov 2, 2013 at 12:49 PM, Hans Scheffers
>
> maybe I've missed it, but have you tried moving your checkresults and
> temporary directory to a ramdisk?
> https://wiki.icinga.org/display/howtos/Create+a+ramdisk+for+better+performance+in+Icinga
>
> I've seen the same error message and huge latencies when the drives
> just couldn't serve Icinga's io requests when it was writing and
> reading the checkresults at a certain rate

You didn't miss it, but i did implement the ramdisk for the checks. That was sometime before i started with the wmi checks, so it didn't influence the results i had when implementing the WMI :)
Loading...