Nagios 3.x

Nagios在线帮助中文版

Dr. 田朝阳

项目负责人
sourceforge软件开发者组织
nagios-cn项目

0.1.0

Nagios版权归nagios软件的著作权者所有,本书仅对中文化后内容保留著作权。需要提醒的是:无论你将采用何种方式来引用本书,全部或部分章节,请一定要给出本书的来源站点是http://nagios-cn.sourceforge.net/,并且一定引用sourceforge站点的相关出版物的版权提示与声明。

修订历史
修订 0.0.330/01/2008enochcytian
将翻译完成的部分初步生成在线帮助文档。
修订 0.0.220/12/2007enochcytian
建立DocBook工程,从源html文件反向生成xml章节文件。
修订 0.0.112/12/2007enochcytian
建立初稿,开始编写初始文件。

摘要

Nagios是一款非常优秀的网络主机管理软件,它在开源社区的影响力是非同寻常的。但很可惜的是,它的界面及操作使用过程中采用了英语的语言提示与源程序紧密结合使得这款软件的汉化界面迟迟不能推出,影响了它在中文区的使用。为推进Nagios的使用,笔者建立了nagios-cn工程,该工程的主要目标是翻译源程序中运行提示、界面生成和文档说明,通过一些努力,nagios-cn终于可以正常运转了,本书编写的主要目的是为在中文使用区域推广和使用Nagios软件,让这款优秀的软件为国人服务。


题词

1. 致谢

首先要感谢我的家人,是他们在身后的支持使得我得以着手做我所喜欢的事情,我的家人是我引以自豪的根本,相信他们也会为我所做的而自豪。

其次要感谢Nagios的作者,是Ethan Galstad给我们带来了这么好的一款软件,也是他给我的回信,让我知道了Nagios软件将向何处努力与发展。

最后要感谢开源社区,特别是sourceforge和google.code站点,是在这两个组织的支持下使得nagios-cn得以生存下去,并使得越来越多的人感受到开源社区的对人类的贡献。

1. 序
2. 关于Nagios
2.1. 什么是Nagios?
2.2. 系统需求
2.3. 版权
2.4. 致谢
2.5. 下载最新版本
3. Nagios 3.0新特性
3.1. 更新日志
3.2. 变更与新特征
4. 入门
4.1. 给新手的建议
4.2. 旧Nagios升级到当前版本
4.3. 快速安装指南
4.4. 基于Fedora平台的快速指南
4.5. 基于openSUSE平台的快速指南
4.6. 基于Ubuntu平台的快速指南
4.7. 监控Windows主机
4.8. 监控Linux/Unix主机
4.9. 监控路由器和交换机
4.10. 监控网络打印机
4.11. 监控Netware服务器
4.12. 监控公众服务平台
5. 准备配置Nagios
5.1. 配置概览
5.2. 主配置文件选项
5.3. 对象配置概览
5.4. CGI配置文件选项
6. Nagios监控与配置的基本概念
6.1. 对象定义
6.2. 对象定义的省时决窍
6.3. 用户自定制对象变量
6.4. 对象继承关系
6.5. 计划停机时间
6.6. 时间周期
6.7. 通知
6.8. 事件处理
6.9. 外部命令
6.10. 状态类型
6.11. 主机检测
6.12. 服务检测
6.13. 自主检测
6.14. 被动检测
7. 运行Nagios的基本操作
7.1. 验证配置文件的正确性
7.2. 启动与停止Nagios
7.3. 快速启动选项
7.4. 关于CGI程序模块的信息
8. Nagios深入进阶
8.1. Nagios的插件
8.2. 理解Nagios宏及其工作机制
8.3. Nagiosr内嵌的标准宏
8.4. 如何确认网络中主机的状态与可达性
8.5. 可变服务
8.6. 主机与服务的刷新检测
8.7. 感知和处理状态抖动
8.8. Service and Host Check Scheduling
8.9. 有关通知的对象扩展
8.10. On-Call Rotations
8.11. 主机间与服务间依赖关系
8.12. 依赖检测的前处理
8.13. 性能数据
9. Nagios专业话题
9.1. 趣事与玩笑
9.2. 分布式监控
9.3. Redundant and Failover Network Monitoring
9.4. 大型安装模式的变化
9.5. 缓存检测
9.6. 状态追踪
9.7. 集群主机和集群服务的监控
9.8. 适应性监控
9.9. 被动地主机状态迁移
10. Nagios自身的安全性与性能调优
10.1. 自身安全相关事项
10.2. Nagios的性能调优
10.3. 使用Nagios状态工具
10.4. 使用MRTG绘制性能数据
10.5. 对CGIs程序模块的授权与认证
10.6. 用户定制CGI页面头和尾
11. 软件集成相关的内容
11.1. 软件集成概览
11.2. SNMP陷井集成
11.3. TCP Wrapper Integration
11.4. Nagios外部构件
12. 开发相关
12.1. 使用内嵌Perl解释器
12.2. 使用内嵌式Perl开发Nagios插件
12.3. Nagios插件API
13. 写在最后的话
13.1. 一些关于本手册的操作建议
13.2. 本书编辑出版打算
13.3. 有关nagios-cn项目推进打算
13.4. 项目捐助

第 1 章 序

相信玩计算机网络的人都或多或少地知道网络管理这一类型软件,但真正在实际中使用并以此为工作基础的人相信并不多,毕竟它不象游戏或字处理类软件那么常见。要不是某些事情所迫,我也不会尽心来了解并使用网管软件,在2004年年底,因为某些任务实在安排不下,“尚有剩余时间”的我接下研究一款网络管理软件的事情。没有最终目标,没有时间截止期限,也不会有太多的人员资金投入,但要把一些很实际的问题解决掉,这就是这些工作的起点。

好在软件并不难以安装和试用,我只花了一天就下载、编译和安装好了,试着把配置文件改了一下,也可以操作着试着用了,但操作界面丑陋、配置更新繁琐、初建系统工作量大等一系列问题使我不得不怀疑是否还需要它?毕竟有一款商业化的软件就放在手边,虽然定制得不太合乎要求,但至少没有这么繁杂的责任背身上,毕竟,我可以不为这些事情负责任的。

考虑在三,"放弃"并不是我想要做的,既然时间没有限制,那就两条腿走路吧,先稳妥地配置好那个商业化软件,让它可以操作与运转,但对后序的改动,只好开启一个记录库,不断地将问题记录下来,而对于Nagios,再清理一下思路,先看看到底我要它做些什么事情,在使用中会有多少问题需要解决,解决到什么程度,再把现有条件对比一下,看看能否走通。

不断地尝试与调整是一个漫长的过程,尤其是到着手编写检测插件的阶段,并不是象想像中的那么顺利,好在时间是挤出来的,写来写去竟然也有了些心得,顺手把Perl和BASH给练习了(只是这些插件与工作内容相关,可惜不能公开),也把几个Nagios安装和运行中常有问题给改掉了,还写了个专门给实施和运行用的BASH脚本方便后来者研究和利用它。

再往下,因为工作情况有变,把掌握的东西交付出来,让它真正有所实用。而后面再搞东西就完全是自己的兴趣了,我先后对nagios-cn项目加入了SVG格式支持、把RRD和Grapher功能整合、写SPEC以定制RPM、增加DocBook转换工程等等,每每做完这些总能让人感到有一种新鲜愉快的感受。

直到最后阶段,我才想到要宣传和推广它,也是因为脱离工作内容的关系,使我做的这些事情不再带有工作内容才有条件在网上公开,这就是后面几个网站或博客上给出的日益增多的项目信息,这本书也是其中的一部分。

第 2 章 关于Nagios

2.1. 什么是Nagios?

Nagios是一款用于系统和网络监控的应用程序。它可以在你设定的条件下对主机和服务进行监控,在状态变差和变好的时候给出告警信息。

Nagios最初被设计为在Linux系统之上运行,然而它同样可以在类Unix的系统之上运行。

Nagios更进一步的特征包括:

  1. 监控网络服务(SMTP、POP3、HTTP、NNTP、PING等);
  2. 监控主机资源(处理器负荷、磁盘利用率等);
  3. 简单地插件设计使得用户可以方便地扩展自己服务的检测方法;
  4. 并行服务检查机制;
  5. 具备定义网络分层结构的能力,用"parent"主机定义来表达网络主机间的关系,这种关系可被用来发现和明晰主机宕机或不可达状态;
  6. 当服务或主机问题产生与解决时将告警发送给联系人(通过EMail、短信、用户定义方式);
  7. 具备定义事件句柄功能,它可以在主机或服务的事件发生时获取更多问题定位;
  8. 自动的日志回滚;
  9. 可以支持并实现对主机的冗余监控;
  10. 可选的WEB界面用于查看当前的网络状态、通知和故障历史、日志文件等;

2.2. 系统需求

Nagios所需要的运行条件是机器必须可以运行Linux(或是Unix变种)并且有C语言编译器。你必须正确地配置TCP/IP协议栈以使大多数的服务检测可以通过网络得以进行。

你需要但并非必须正确地配置Nagios里的CGIs程序,而一旦你要使用CGI程序时,你必须要安装以下这些软件...

  1. 一个WEB服务(最好是Apache
  2. Thomas Boutell制作的gd库版本应是1.6.3或更高(在CGIs程序模块statusmaptrends这两个模块里需要这个库)

2.3. 版权

Nagios版权遵从于由自由软件基金会所发布的GNU版权协议第二版。有关GNU协议请查阅自由软件基金会网站。该版权协议允许你在某些条件下可以复制、分发并且或者是修改它。可以在Nagios软件发行包里阅读版权文件LICENSE或是在网站上阅读在线版权文件以获取更多信息。

Nagios is provided AS IS with NO WARRANTY OF ANY KIND, INCLUDING THE WARRANTY OF DESIGN, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE.

2.4. 致谢

一些人对Nagios的发布尽力,不管是报告错误、提供建议、编写插件等等,可以在网站http://www.nagios.org上找到这些人的名字列表。

2.5. 下载最新版本

可以在Nagioshttp://www.nagios.org站点获取最新版本。

注意

Nagios及Nagios商业标识由Ethan Galstad所拥有。其他的商业标识、服务标识、注册商标及注册服务属于各自的所有者。

第 3 章 Nagios 3.0新特性

重要

Important: Make sure you read through the documentation and the FAQs at http://www.nagios.org/ before sending a question to the mailing lists.

3.1. 更新日志

Nagios的更新日志可以在这里的在线文件或是在源程序的发行包的根目录里找到。

3.2. 变更与新特征

  • 文档:
    1. 更新了文档 - 很抱歉我对文档的更新工作进展迟缓。这会花些时间来做,因为有很多文档而且写这些文档并不是我喜欢的事情(我更不喜欢整天翻译,这也不是我喜欢的事情)。期待一些文档与其他的有所不同,而这些不同会对于那些新人或有经验的Nagios使用者起些作用。
  • 内嵌宏:
    1. 新加宏 - 加入了一些新宏,包括:$TEMPPATH$、$LONGHOSTOUTPUT$、$LONGSERVICEOUTPUT$、$HOSTNOTIFICATIONID$、$SERVICENOTIFICATIONID$、$HOSTEVENTID$、$SERVICEEVENTID$、$SERVICEISVOLATILE$、$LASTHOSTEVENTID$、$LASTSERVICEEVENTID$、$HOSTDISPLAYNAME$、$SERVICEDISPLAYNAME$、$MAXHOSTATTEMPTS$、$MAXSERVICEATTEMPTS$、$TOTALHOSTSERVICES$、$TOTALHOSTSERVICESOK$、$TOTALHOSTSERVICESWARNING$、$TOTALHOSTSERVICESUNKNOWN$、$TOTALHOSTSERVICESCRITICAL$、$CONTACTGROUPNAME$、$CONTACTGROUPNAMES$、$CONTACTGROUPALIAS$、$CONTACTGROUPMEMBERS$、$NOTIFICATIONRECIPIENTS$、$NOTIFICATIONISESCALATED$、$NOTIFICATIONAUTHOR$、$NOTIFICATIONAUTHORNAME$、$NOTIFICATIONAUTHORALIAS$、$NOTIFICATIONCOMMENT$、$EVENTSTARTTIME$、$HOSTPROBLEMID$、$LASTHOSTPROBLEMID$、$SERVICEPROBLEMID$、$LASTSERVICEPROBLEMID$、$LASTHOSSTATE$、$LASTHOSTSTATEID$、$LASTSERVICESTATE$、$LASTSERVICESTATEID$。加入了两个特殊的守护时间宏:$ISVALIDTIME:$和$NEXTVALIDTIME:$。
    2. 移除的宏 - 原有的宏$NOTIFICATIONNUMBER$被分离为两个新宏$HOSTNOTIFICATIONNUMBER$和$SERVICENOTIFICATIONNUMBER$。
    3. 变更的宏 - 现有的$HOSTNOTES$和$SERVICENOTES$宏包括自身外,还包括$HOSTNOTESURL$、$HOSTACTIONURL$、$SERVICENOTESURL$和$SERVICEACTIONURL$等几个宏。
    4. 在检测、事件句柄处理、告警和其他外部命令执行时,宏可以获取环境变量。这可会使Nagios在大型部署方案时占用较高的CPU处理能力,你可以设置enable_environment_macros 选项来不使能它。
    5. 有关宏的更新信息可以在这里查到。
  • 预定义停机时间:
    1. 预定义停机时间不再保存在各自文件(之前是由主配置文件里的downtime_file来指定)。当前的和保留的预定义停机时间将分别保存于状态文件保留文件retention file中。
  • 注释:
    1. 主机和服务的注释不再保存于各自的文件(之前在主配置文件中的comment_file来指定)。当前的和保留的注释将分别保存于状态文件status file保留文件retention file之中。
    2. Acknowledgement comments that are marked as non-persistent are now only deleted when the acknowledgement is removed. They were previously automatically deleted when Nagios restarted, which was not ideal.
  • State Retention Data:
    1. Status information for individual contacts is now retained across program restarts.
    2. Comment and downtime IDs are now retained across program restarts and should be unique unless the retention data is deleted or ignored.
    3. Added retained_host_attribute_mask and retained_service_attribute_mask variables to control what host/service attributes are retained globally across program restarts.
    4. Added retained_process_host_attribute_mask and retained_process_service_attribute_mask variables to control what process attributes are retained across program restarts.
    5. Added retained_contact_host_attribute_mask and retained_contact_service_attribute_mask variables to control what contact attributes are retained globally across program restarts.
  • Flap Detection:
    1. Added flap_detection_options directive to host and service definitions to allow you to specify what host/service states should be used by the flap detection logic (by default all states are used).
    2. Percent state change and state history are now retained and recorded even when flap detection is disabled.
    3. Hosts and services are immediately checked for flapping when flap detection is enabled program-wide.
    4. Hosts and services that are flapping when flap detection is disabled program-wide are now logged.
    5. More information on flap detection can be found here.
  • External Commands:
    1. Added a new PROCESS_FILE external command to allow processing of external commands found in an eternal (regular) file. Useful for processing large amounts of passive checks with long output, or for scripting regular commands. More information can be found here.
    2. Custom commands may now be submitted to Nagios. Custom command names are prefixed with an underscore and are not processed internally by the Nagios daemon. They may, however, be processed by a loaded NEB module.
    3. The check_external_commands option is now enabled by default, which means Nagios is configured to check for external "commands out of the box". All 2.x and earlier versions of Nagios had this option disabled by default.
  • Status Data:
    1. Contact status information (last notification times, notifications enabled/disabled, etc.) is now saved in the status and retention files, although it is not processed by the CGIs.
  • Embedded Perl:
    1. Added new enable_embedded_perl and use_embedded_perl_implicitly variables to control use of the embedded Perl interpreter.
    2. Perl scripts/plugins can now explicitly tell Nagios whether or not they should be run under the embedded Pel interpreter. This is useful if you have troublesome scripts that don't function well under the ePN.
    3. More information about these new optios can be found here.
  • Adaptive Monitoring:
    1. The check timeperiod for hosts and services can now be modified on-the-fly with the appropriate external command (CHANGE_HOST_CHECK_TIMEPERIOD or CHANGE_SVC_CHECK_TIMEPERIOD).查阅这个网页以取得更多可用的适应性检测命令。
  • Notifications:
    1. A first_notification_delay option has been added to host and service definitions to (what else) introduce a delay between when a host/service problem first occurs and when the first problem notification goes out. In previous versions you had to use some mighty config-fu with escalations to accomplish this. Now this feature is available to normal mortals.
    2. Notifications are now sent out for hosts/services that are flapping when flap detection is disabled on a host- or service-specific basis or on a program-wide basis. The $NOTIFICATIONTYPE$ macro will be set to "FLAPPINGDISABLED" in this situation.
    3. Notifications can now be sent out when scheduled downtime start, ends, and is cancelled for hosts and services. The $NOTIFICATIONTYPE$ macro will be set to "DOWNTIMESTART", "DOWNTIMEEND", or "DOWNTIMECANCELLED", respectively. In order to received notifications on scheduled downtime events, specify "s" or "downtime" in your contact, host, and/or service notification options.
    4. More information on notifications can be found here.
  • Object Definitions:
    1. Service dependencies can now be created to easily define "same host" dependencies for different services on one or more hosts. (Read more)
    2. Extended host and service definitions (hostextinfo and serviceextinfo, respectively) have been deprecated. All values that from extended definitions have been merged with host or service definitions, as appropriate. Nagios 3 will continue to read and process older extended information definitions, but will log a warning. Future versions of Nagios (4.x and later) will not support separate extended info definitions.
    3. New hostgroup_members, servicegroup_members, and contactgroup_members directives have been added to hostgroup, servicegroup, and contactgroups definitions, respectively. This allows you to include hosts, services, or contacts from sub-groups in your group definitions.
    4. New notes, notes_url, and action_url have been added to hostgroup and servicegroup definition.
    5. Contact definitions have the new host_notifications_enabled, service_notifications_enabled, and can_submit_commands directives to better control notifications and determine whether or not they can submit commands through the web interface.
    6. Host and service dependencies now support an optional dependency_period directive. This allows you to limit the times during which dependencies are valid.
    7. The parallelize directive in service definitions is now deprecated and no longer used. All service checks are run in parallel in Nagios 3.
    8. There are no longer any inherent limitations on the length of host names or service descriptions.
    9. Extended regular expressions are now used if you enable the use_regexp_matching config option. Regular expression matching is only used in certain object definition directives that contain *, ?, +, or \..
    10. A new initial_state directive has been added to host and service definitions, so you can tell Nagios that a host/service should default to a specific state when Nagios starts, rather than UP or OK (which is still the default).
  • Object Inheritance:
    1. You can now inherit object variables/values from multiple templates by specifying more than one template name in the use directive of object definitions. This can allow for some very powerful (and complex) inheritance setups. (Read more)
    2. Services now inherit contact groups, notification interval, and notification period from their associated host if not otherwise specified. (Read more)
    3. Host and service escalations now inherit contact groups, notification interval, and escalation timeperiod fro their associated host or service if not otherwise specified. (Read more)
    4. String variables in host, service, and contact definitions can now be prevented from being inherited by specifying a value of "null" (without quotes) for the value of the variable. (Read more)
    5. Most string variables in local object definitions can now be appended to the string values that are inherited. This is quite handy in large configurations. (Read more)
  • Performance Improvements:
    1. Add ability to precache object config files and exclude circular path detection checks from verification process. This can speed up Nagios start time immensely in large environments! Read more here.
    2. A new use_large_installation_tweaks option has been added that should improve performance in large Nagios installations. Read more about this here.
    3. A number of internal improvements have been made with regards to how Nagios deals with internal data structures and object (e.g. host and service) relationships. These improvements should result in a speedup for larger installations.
    4. New external_command_buffer_slots option has been added to allow you to more easily scale Nagios in large environments. For best results you should consider using MRTG to graph Nagios' usage of buffer slots over time.
  • Plugin Output:
    1. Multiline plugin output is now supported for host and service checks. Hooray! The plugin API has been updated to support multiple lines of output in a manner that retains backward compatability with older plugins. Additional lines of output (aside from the first line) are now stored in new $LONGHOSTOUTPUT$ and $LONGSERVICEOUTPUT$ macros.
    2. The maximum length of plugin output has been increased to 4K (from around 350 bytes in previous versions). This 4K limit has been arbitrarily chosen to protect again runaway plugins that dump back too much data to Nagios.
    3. More information on the plugins, multiline output, and max plugin output length can be found here.
  • Service Checks:
    1. Nagios now checks for orphaned service checks by default.
    2. Added a new enable_predictive_service_dependency_checks option to control whether or not Nagios will initiate predictive check of service that are being depended upon (in dependency definitions). Predictive checks help ensure that the dependency logic is as accurate as possible. (Read more)
    3. A new cached service check feature has been implemented that can significantly improve performance for many people Instead of executing a plugin to check the status of a service, Nagios can often use a cached service check result instead. More information on this can be found here.
  • Host Checks:
    1. Host checks are now run in parallel! Host checks used to be run in a serial fashion, which meant they were a major holdup in terms of performance. No longer! (Read more)
    2. Host check retries are now performed like service check retries. That is to say, host definitions now have a new retry_interval that specifies how much time to wait before trying the host check again. :-)
    3. Regularly scheduled host checks now longer hinder performance. In fact, they can help to increase performance with the new cached check logic (see below).
    4. Added a new check_for_orphaned_hosts option to enable checks of orphaned host checks. This is need now that host checks are run in parallel.
    5. Added a new enable_predictive_host_dependency_checks option to control whether or not Nagios will initiate predictive check of hosts that are being depended upon (in dependency definitions). Predictive checks help ensure that the dependency logic is as accurate as possible. (Read more)
    6. A new cached host check feature has been implemented that can significantly improve performance for many people Instead of executing a plugin to check the status of a host, Nagios can often use a cached host check result instead. More information on this can be found here.
    7. Passive host checks that have a DOWN or UNREACHABLE result can now be automatically translated to their proper state from the point of view of the Nagios instance that receives them. This is very useful in failover and distributed monitoring setups. More information on passive host check state translation can be found here.
    8. Passive host checks normally put a host into a HARD state. This can now be changed by enabling the passive_host_checks_are_soft option.
  • Freshness checks:
    1. A new additional_freshness_latency option has been added to allow to you specify the number of seconds that should be added to any host or service freshness threshold that is automatically calculated by Nagios.
  • IPC:
    1. The IPC mechanism that is used to transfer host/service check results back to the Nagios daemon from (grand)child processes has changed! This should help to reduce load/latency issues related to processing large numbers of passive checks in distributed monitoring environments.
    2. Check results are now transferred by writing check results to files in directory specified by the check_result_path option. Files that are older that the max_check_result_file_age option will be mercilessly deleted without further processing.
  • Timeperiods:
    1. Timeperiods were overdue for a major overhaul and have finally been extended to allow for date exceptions, skip dates (every 3 days), etc! This should help you out when defining notification timeperiods for pager rotations.
    2. More information on the new timeperiod directives can be found here and here.
  • Event Broker:
    1. Updated NEB API version
    2. Modified callback for adaptive program status data
    3. Added callback for adaptive contact status data
    4. Added precheck callbacks for hosts and services to allow modules to cancel/override internal host/service checks.
  • Web Interface:
    enable_splunk_integrationsplunk_url
    1. Hostgroup and servicegroup summaries now show important/unimportant problem breakdowns liek the TAC CGI.
    2. Minor layout changes to host and service detail views in extinfo CGI.
    3. New check statistics and have been added to the "Performance Info" screen.
    4. Added Splunk
    5. Added new notes_url_target and action_url_target options to control what frame notes and action URLs are opened in.
    6. Added new lock_author_names option to prevent alteration of author names when users submit comments, acknowledgements, and scheduled downtime.
  • Deubbing Info:
    1. The DEBUGx compile options available in the configure script for have been removed.
    2. Debugging information can now be written to a separate debug file, which is automatically rotated when it reaches a user-defined size. This should make debugging problems much easier, as you don't need to recompiled Nagios. Full support for writing debugging information to file is being added during the alpha development phase, so it may not be complete when you try it.
    3. Variables that affect the debug log in debug_file, debug_level, debug_verbosity, and max_debug_file_size.
  • Misc:
    1. Temp path variable - A new temp_path variable has been added to specify a scratch directory that Nagios can use for temporary scratch space.
    2. Unique notification and event ID numbers - A unique ID number is now assigned to each host and service notification. Another unique ID is now assigned to all host and service state changes as well. The unique IDs can be accessed using the following respective macros: $HOSTNOTIFICATIONID$, $SERVICENOTIFICATIONID$, $HOSTEVENTID$, $SERVICEEVENTID$, $LASTHOSTEVENTID$, $LASTSERVICEEVENTID$.
    3. New macros - A few new macros (other than those already mentioned elsewhere above) have been added. They include $HOSTGROUPNAMES$, $SERVICEGROUPNAMES$, $HOSTACKAUTHORNAME$, $HOSTACKAUTHORALIAS$, $SERVICEACKAUTHORNAME$, and $SERVICEACKAUTHORALIAS$.
    4. Reaper frequency - The old service_reaper_frequency variable has been renamed to check_result_reaper_frequency, as it is now also used to process host check results.
    5. Max reaper time - A new max_check_result_reaper_time variable has been added to limit the amount of time a single reaper event is allowed to run.
    6. Fractional intervals - Fractional notification and check intervals (e.g. "3.5" minutes) are now supported in host, service, host escalation, and service escalation definitions.
    7. Escaped command arguments - You can now pass bang (!) characters in your command arguments by escaping them with a backslash (\). If you need to include backslashes in your command arguments, they should also be escaped with a backslash.
    8. Multiline system command output - Nagios will now read multiple lines out output from system commands it runs (notification scripts, etc.), up to 4K. This matches the limits on plugin output mentioned earliar. Output from system commands is not directly processed by Nagios, but support for it is there nonetheless.
    9. Better scheduling information - More detailed information is given when Nagios is executed with the -s command line option. This information can be used to help reduce the time it takes to start/restart Nagios.
    10. Aggregated status file updates - The old aggregate_status_updates option has been removed. All status file updates are now aggregated at a minimum interval of 1 second.
    11. New performance data file mode - A new "p" option has been added to the host_perfdata_file_mode and service_perfdata_file_mode options. This new mode will open the file in non-blocking read/write mode, which is useful for pipes.
    12. Timezone offset - A new use_timezone option has been added to allow you to run different instances of Nagios in timezones different from the local zone.

第 4 章 入门

4.1. 给新手的建议

祝贺你选择了Nagios!Nagios是一个非常强大且柔性化的软件,但可能需要不少心血来学习如何配置使之符合你所需,一旦掌握了它如何工作并怎样来工作时,你会觉得再也离不开它! :-) 对于初次使用Nagios的新手这有几个建议需要遵从:

  • 放松点 - 这会花些时间。不要指望它事情可以在转瞬间就搞掟,没有那么容易。设置好Nagios是一个费点事的工作,部分是由于对Nagios设置并不清楚,而还可能是由于并不清楚如何来监控现有网络(或者说如何监控会更好)。
  • 使用快速上手指南。本帮助给出了快速安装指南是给那些新手尽快地将Nagios安装到位并运行起来而写就的。在不到二十分钟之内可以安装并监控本地的系统,一旦完成了,就可以继续学习配置Nagios了。
  • 阅读文档。如果掌握Nagios运行机制,可以高效地配置它并且使之无所不能。确信已经阅读了这些文档(是“配置Nagios”和“基本操作”两章)。在更好地理解基础性配置之前可以对那些高级内容暂时不管。
  • 获得他人协助。如果已经阅读文档并检测了样本配置文件但仍然有问题,写一个EMail给nagios-users邮件列表并写清楚问题。由于在这个项目上我有不少事情要做,直接给我的邮件我可能无法回复,所以最好是求助于邮件列表,如果有较好的背景并且可以将问题描述清楚,或许有人可以指出如何正确来做。更多地信息请在这个链接http://www.nagios.org/support/下寻找。

4.2. 旧Nagios升级到当前版本

目录

4.2.1. 从旧的3.x版本升级到当前版本

如果是使用3.x的旧版,肯定是要尽快升级到当前版本。新版本修正了许多错误,下面假定已经根据快速安装指南的操作步骤从源代码包开始安装好Nagios,下面可以安装更新的版本。虽然下面的操作都是用root操作的,但可以不用root权限也可以升级成功。下面是升级过程...

先确认已经备份好现有版本的Nagios软件和配置文件。如果升级过程中有不对的,至少可以回退到旧版本。

切换为Nagios用户。使用Debian/Ubuntu系统的可以用sudo -s nagios来切换。

su -l nagios

下载最新的Nagios安装包(http://www.nagios.org/download/)。

wget http://osdn.dl.sourceforge.net/sourceforge/nagios/nagios-3.x.tar.gz

展开源码包。

tar xzf nagios-3.x.tar.gz cd nagios-3.x

运行Nagios源程序的配置脚本,把加入外部命令的组名加上,象这样:

./configure --with-command-group=nagcmd

编译源程序

make all

安装升级后的二进制程序、文档和Web接口程序。在这步时旧配置文件还不会被覆盖。

make install

验证配置并重启动Nagios

/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg /sbin/service nagios restart

好了,升级完成!

4.2.2. 从2.x升级到3.x

Nagios从2.x升级到3.x并不难。升级过程如同上面的旧版3.x的升级过程。但是Nagios3.x中有几处配置文件的改动需要注意:

  1. The old service_reaper_frequency variable in the main config file has been renamed to check_result_reaper_frequency.
  2. The old $NOTIFICATIONNUMBER$ macro has been deprecated in favor of new $HOSTNOTIFICATIONNUMBER$ and $SERVICENOTIFICATIONNUMBER$ macros.
  3. The old parallelize directive in service definitions is now deprecated and no longer used, as all service checks are run in parallel.
  4. The old aggregate_status_updates option has been removed. All status file updates are now aggregated at a minimum interval of 1 second.
  5. Extended host and extended service definitions have been deprecated. They are still read and processed by Nagios, but it is recommended that you move the directives found in these definitions to your host and service definitions, respectively.
  6. The old downtime_file file variable in the main config file is no longer supported, as scheduled downtime entries are now saved in the retention file. To preserve existing downtime entries, stop Nagios 2.x and append the contents of your old downtime file to the retention file.
  7. The old comment_file file variable in the main config file is no longer supported, as comments are now saved in the retention file. To preserve existing comments, stop Nagios 2.x and append the contents of your old comment file to the retention file.

Also make sure to read the "What's New" section of the documentation. It describes all the changes that were made to the Nagios 3 code since the latest stable release of Nagios 2.x. Quite a bit has changed, so make sure you read it over.

4.2.3. 从RPM包安装状态升级

如果当前是用RPM包安装的,或是用Debian/Ubuntu的APT软件包来安装Nagios的,需要用源程序包来安装升级,下面是操作步骤:

  1. Main config file (usually nagios.cfg)
  2. Resource config file (usually resource.cfg)
  3. CGI config file (usually cgi.cfg)
  4. All your object definition files
  1. Configuration files
  2. Retention file (usually retention.dat)
  3. Current Nagios log file (usually nagios.log)
  4. Archived Nagios log files
  1. Backup your existing Nagios installation
  2. Uninstall the original RPM or APT package
  3. Install Nagios from source by following the quickstart guide
  4. Restore your original Nagios configuration files, retention file, and log files
  5. Verify your configuration and start Nagios

注意RPM和APT包把Nagios的文件放置的位置有所不同。在升级前要确保那些配置文件备份好以在碰到解决不了的升级问题时可以回退到旧版本。

4.3. 快速安装指南

4.3.1. 介绍

这些指南试图让你在二十分钟内用简单地指令操作下从源程序安装Nagios并监控你的本地机器。这里并不讨论那些高级指令对于95%以上的想起步的用户而言这是基础。

4.3.2. 指南

现在可以提供如下Linux发行版本上的快速安装指南:

你可以在NagiosCommunity.org的维基百科上找到更多的安装上手指南。什么?找不到你所用的操作系统版本的指南?在维基百科上给其他人写一条吧!

如果你在一个上面没列出的操作系统或Linux发行包上安装Nagios,请参照Fedora快速指南来概要地了解一下你需要做的事情。命令名、路径等可能因不同的发行包或操作系统而不同,因而这时你可能需要些努力来搞一下安装文档里的东西。

4.3.3. 安装后该做的

一旦你正确地安装并使Nagios运行起来后,毫无疑问你不仅要监控你的主机,你需要审视一下更多的文档来做更多的事情...

4.4. 基于Fedora平台的快速指南

4.4.1. 介绍

本指南试图让你通过简单的指令以在20分钟内在Fedora平台上通过对Nagios的源程序的安装来监控本地主机。这里没有讨论更高级的设置项 - 只是一些基本操作,但这足以使95%的用户启动Nagios。

这些指令在基于Fedora Core 6的系统下写成的。

最终结果是什么

如果按照本指南安装,最后将是这样结果:

  1. Nagios和插件将安装到/usr/local/nagios
  2. Nagios将被配置为监控本地系统的几个主要服务(CPU负荷、磁盘利用率等)
  3. Nagios的Web接口是URL是http://localhost/nagios/

4.4.2. 准备软件包

在做安装之前确认要对该机器拥有root权限。

确认你安装好的Fedora系统上已经安装如下软件包再继续。

  1. Apache
  2. GCC编译器
  3. GD库与开发库

可以用yum命令来安装这些软件包,键入命令:

yum install httpd yum install gcc yum install glibc glibc-common yum install gd gd-devel

4.4.3. 操作过程

1)建立一个帐号

切换为root用户

su -l

创建一个名为nagios的帐号并给定登录口令

/usr/sbin/useradd nagios passwd nagios

创建一个用户组名为nagcmd用于从Web接口执行外部命令。将nagios用户和apache用户都加到这个组中。

/usr/sbin/groupadd nagcmd /usr/sbin/usermod -G nagcmd nagios /usr/sbin/usermod -G nagcmd apache

2)下载Nagios和插件程序包

建立一个目录用以存储下载文件

mkdir ~/downloads cd ~/downloads

下载Nagios和Nagios插件的软件包(访问http://www.nagios.org/download/站点以获得最新版本),在写本文档时,最新的Nagios的软件版本是3.0rc1,Nagios插件的版本是1.4.11。

wget http://osdn.dl.sourceforge.net/sourceforge/nagios/nagios-3.0rc1.tar.gz wget http://osdn.dl.sourceforge.net/sourceforge/nagiosplug/nagios-plugins-1.4.11.tar.gz

3)编译与安装Nagios

展开Nagios源程序包

cd ~/downloads tar xzf nagios-3.0rc1.tar.gz cd nagios-3.0rc1

运行Nagios配置脚本并使用先前开设的用户及用户组:

./configure --with-command-group=nagcmd

编译Nagios程序包源码

make all

安装二进制运行程序、初始化脚本、配置文件样本并设置运行目录权限

make install make install-init make install-config make install-commandmode

现在还不能启动Nagios-还有一些要做的...

4)客户化配置

样例配置文件默认安装在这个目录下/usr/local/nagios/etc,这些样例文件可以配置Nagios使之正常运行,只需要做一个简单的修改...

用你擅长的编辑器软件来编辑这个/usr/local/nagios/etc/objects/contacts.cfg配置文件,更改email地址nagiosadmin的联系人定义信息中的EMail信息为你的EMail信息以接收报警内容。

vi /usr/local/nagios/etc/objects/contacts.cfg

5)配置WEB接口

安装Nagios的WEB配置文件到Apache的conf.d目录下

make install-webconf

创建一个nagiosadmin的用户用于Nagios的WEB接口登录。记下你所设置的登录口令,一会儿你会用到它。

htpasswd -c /usr/local/nagios/etc/htpasswd.users nagiosadmin

重启Apache服务以使设置生效。

service httpd restart

6)编译并安装Nagios插件

展开Nagios插件的源程序包

cd ~/downloads tar xzf nagios-plugins-1.4.11.tar.gz cd nagios-plugins-1.4.11

编译并安装插件

./configure --with-nagios-user=nagios --with-nagios-group=nagios make make install

7)启动Nagios

把Nagios加入到服务列表中以使之在系统启动时自动启动

chkconfig --add nagios chkconfig nagios on

验证Nagios的样例配置文件

/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg

如果没有报错,可以启动Nagios服务

service nagios start

8)更改SELinux设置

Fedora与SELinux(安全增强型Linux)同步发行与安装后将默认使用强制模式。这会在你尝试联入Nagios的CGI时导致一个"内部服务错误"消息。

如果是SELinux处于强制安全模式时需要做

getenforce

令SELinux处于容许模式

setenforce 0

如果要永久性更变它,需要更改/etc/selinux/config里的设置并重启系统。

不关闭SELinux或是永久性变更它的方法是让CGI模块在SELinux下指定强制目标模式:

chcon -R -t httpd_sys_content_t /usr/local/nagios/sbin/ chcon -R -t httpd_sys_content_t /usr/local/nagios/share/

更多有关Nagios的CGI模块增加目标策略的强制权限方式见NagiosCommunity.org的维基百科http://www.nagioscommunity.org/wiki

9)登录WEB接口

你现在可以从WEB方式来接入Nagios的WEB接口了,你需要在提示下输入你的用户名(nagiosadmin)和口令,你刚刚设置的,这里用系统默认安装的浏览器,用下面这个超链接

http://localhost/nagios/

点击“服务详情”的引导超链来查看你本机的监视详情。你可能需要给点时间让Nagios来检测你机器上所依赖的服务因为检测需要些时间。

10)其他的变更

确信你机器的防火墙规则配置允许你可以从远程登录到Nagios的WEB服务。

配置EMail的报警项超出了本文档的内容,指向你的系统档案用网页查找或是到这个站点NagiosCommunity.org wiki来查找更进一步的信息,以使你的系统上可以向外部地址发送EMail信息。更多有关通知的信息可以查阅这篇文档。

11)完成了

祝贺你已经成功安装好Nagios,但网络监控工作只是刚开始。毫无疑问你不是只监控本地系统,所以要看以下这些文档...

4.5. 基于openSUSE平台的快速指南

4.5.1. 介绍

本指南试图让你通过简单的指令以在20分钟内在你的openSUSE平台上通过对Nagios的源程序的安装来监控本地主机。这里没有讨论更高级的设置项 - 只是一些基本操作,但这足以使95%的用户启动Nagios。

这些指令在基于openSUSE10.2的系统下写成的。

4.5.2. 所需的软件包

确认你安装好的openSUSE系统之上已经安装了如下软件包再继续。你可以在openSUSE系统下用yast来安装软件包。

  • apache2
  • C/C++开发库

4.5.3. 操作过程

1)建立一个帐号

切换为root用户

su -l

创建新帐户名为nagios并给它一个登录口令

/usr/sbin/useradd nagios

passwd nagios

创建一个用户组名为nagios,并把nagios帐户加入该组

/usr/sbin/groupadd nagios

/usr/sbin/usermod -G nagios nagios

创建一个用户组名为nagcmd来执行外部命令并可以通过WEB接口来执行。将nagios用户和apache用户都加到这个组中。

/usr/sbin/groupadd nagcmd

/usr/sbin/usermod -G nagcmd nagios

/usr/sbin/usermod -G nagcmd wwwrun

2)下载Nagios和插件程序包

建立一个目录用以存储下载文件

mkdir ~/downloads

cd ~/downloads

下载Nagios和Nagios插件的软件包(访问http://www.nagios.org/download/站点以获得最新版本),在写本文档时,最新的Nagios的软件版本是3.0rc1,Nagios插件的版本是1.4.11。

wget http://osdn.dl.sourceforge.net/sourceforge/nagios/nagios-3.0rc1.tar.gz

wget http://osdn.dl.sourceforge.net/sourceforge/nagiosplug/nagios-plugins-1.4.11.tar.gz

3)编译与安装Nagios

展开Nagios源程序包

cd ~/downloads

tar xzf nagios-3.0rc1.tar.gz

cd nagios-3.0rc1

运行Nagios配置脚本并使用先前开设的用户及用户组:

./configure --with-command-group=nagcmd

编译Nagios程序包源码

make all

安装二进制运行程序、初始化脚本、配置文件样本并设置运行目录权限

make install

make install-init

make install-config

make install-commandmode

现在还不能启动Nagios - 还有一些要做的...

4)客户化配置

样例配置文件默认安装在这个目录下/usr/local/nagios/etc,这些样例文件可以配置Nagios使之正常运行,只需要做一个简单的修改...

用你擅长的编辑器软件来编辑这个/usr/local/nagios/etc/objects/contacts.cfg配置文件,更改email地址nagiosadmin的联系人定义信息中的EMail信息为你的EMail信息以接收报警内容。

vi /usr/local/nagios/etc/objects/contacts.cfg

5)配置WEB接口

安装Nagios的WEB配置文件到Apache的conf.d目录下

make install-webconf

创建一个nagiosadmin的用户用于Nagios的WEB接口登录。记下你所设置的登录口令,一会儿你会用到它。

htpasswd2 -c /usr/local/nagios/etc/htpasswd.users nagiosadmin

重启Apache服务以使设置生效。

service apache2 restart

6)编译并安装Nagios插件

展开Nagios插件的源程序包

cd ~/downloads

tar xzf nagios-plugins-1.4.11.tar.gz

cd nagios-plugins-1.4.11

编译并安装插件

./configure --with-nagios-user=nagios --with-nagios-group=nagios

make

make install

7)启动Nagios

把Nagios加入到服务列表中以使之在系统启动时自动启动

chkconfig --add nagios

chkconfig nagios on

验证Nagios的样例配置文件

/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg

如果没有报错,可以启动Nagios服务

service nagios start

8)登录WEB接口

你现在可以从WEB方式来接入Nagios的WEB接口了,你需要在提示下输入你的用户名(nagiosadmin)和口令,你刚刚设置的,这里用系统默认安装的浏览器,用下面这个超链接

konqueror http://localhost/nagios/

点击“服务详情”的引导超链来查看你本机的监视详情。你可能需要给点时间让Nagios来检测你机器上所依赖的服务因为检测需要些时间。

9)其他的变更

确信你机器的防火墙规则配置允许你可以从远程登录到Nagios的WEB服务。

你可以这样做:

  1. 打开控制中心
  2. 选择'打开超户设置'以打开YaST超户控制中心
  3. 选择在'安全与用户'设置里的'防火墙'
  4. 在防火墙的配置窗口中点击'允许的服务'选项
  5. 在许可的服务中增加'HTTP服务',是'外部区'的部分
  6. 点击'下一步'并选择'接受'以使得防火墙设置生效

配置EMail的报警项超出了本文档的内容,指向你的系统档案用网页查找或是到这个站点NagiosCommunity.org wiki来查找更进一步的信息,以使你的openSUSE系统上可以向外部地址发送EMail信息。

4.6. 基于Ubuntu平台的快速指南

4.6.1. 介绍

本指南试图让你通过简单的指令以在20分钟内在Ubuntu平台上通过对Nagios的源程序的安装来监控本地主机。没有讨论更高级的设置项-只是一些基本操作,但这足以使95%的用户启动Nagios。

这些指令在基于Ubuntu6.10(桌面版)的系统下写成的。

What You'll End Up With

如果按照本指南安装,最后将是这样结果:

  1. Nagios和插件将安装到/usr/local/nagios
  2. Nagios将被配置为监控本地系统的几个主要服务(CPU负荷、磁盘利用率等)
  3. Nagios的Web接口是URL是http://localhost/nagios/

4.6.2. 所需软件包

确认你安装好的系统上已经安装如下软件包再继续。

  1. Apache2
  2. GCC编译器与开发库
  3. GD库与开发库

可以用apt-get命令来安装这些软件包,键入命令:

sudo apt-get install apache2 sudo apt-get install build-essential sudo apt-get install libgd2-dev

4.6.3. 操作过程

1)建立一个帐号

切换为root用户

sudo -s

创建一个名为nagios的帐号并给定登录口令

/usr/sbin/useradd nagios passwd nagios

在Ubuntu服务器版(6.01或更高版本),创建一个用户组名为nagios(默认是不创建的)。在Ubuntu桌面版上要跳过这一步。

/usr/sbin/groupadd nagios /usr/sbin/usermod -G nagios nagios

创建一个用户组名为nagcmd用于从Web接口执行外部命令。将nagios用户和apache用户都加到这个组中。

/usr/sbin/groupadd nagcmd /usr/sbin/usermod -G nagcmd nagios /usr/sbin/usermod -G nagcmd www-data

2)下载Nagios和插件程序包

建立一个目录用以存储下载文件

mkdir ~/downloads cd ~/downloads

下载Nagios和Nagios插件的软件包(访问http://www.nagios.org/download/站点以获得最新版本),在写本文档时,最新的Nagios的软件版本是3.0rc1,Nagios插件的版本是1.4.11。

wget http://osdn.dl.sourceforge.net/sourceforge/nagios/nagios-3.0rc1.tar.gz wget http://osdn.dl.sourceforge.net/sourceforge/nagiosplug/nagios-plugins-1.4.11.tar.gz

3)编译与安装Nagios

展开Nagios源程序包

cd ~/downloads tar xzf nagios-3.0rc1.tar.gz cd nagios-3.0rc1

运行Nagios配置脚本并使用先前开设的用户及用户组:

./configure --with-command-group=nagcmd

编译Nagios程序包源码

make all

安装二进制运行程序、初始化脚本、配置文件样本并设置运行目录权限

make install make install-init make install-config make install-commandmode

现在还不能启动Nagios-还有一些要做的...

4)客户化配置

样例配置文件默认安装在这个目录下/usr/local/nagios/etc,这些样例文件可以配置Nagios使之正常运行,只需要做一个简单的修改...

用你擅长的编辑器软件来编辑这个/usr/local/nagios/etc/objects/contacts.cfg配置文件,更改email地址nagiosadmin的联系人定义信息中的EMail信息为你的EMail信息以接收报警内容。

vi /usr/local/nagios/etc/objects/contacts.cfg

5)配置WEB接口

安装Nagios的WEB配置文件到Apache的conf.d目录下

make install-webconf

创建一个nagiosadmin的用户用于Nagios的WEB接口登录。记下你所设置的登录口令,一会儿你会用到它。

htpasswd -c /usr/local/nagios/etc/htpasswd.users nagiosadmin

重启Apache服务以使设置生效。

/etc/init.d/apache2 reload

6)编译并安装Nagios插件

展开Nagios插件的源程序包

cd ~/downloads tar xzf nagios-plugins-1.4.11.tar.gz cd nagios-plugins-1.4.11

编译并安装插件

./configure --with-nagios-user=nagios --with-nagios-group=nagios make make install

7)启动Nagios

把Nagios加入到服务列表中以使之在系统启动时自动启动

ln -s /etc/init.d/nagios /etc/rcS.d/S99nagios

验证Nagios的样例配置文件

/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg

如果没有报错,可以启动Nagios服务

/etc/init.d/nagios start

8)登录WEB接口

你现在可以从WEB方式来接入Nagios的WEB接口了,你需要在提示下输入你的用户名(nagiosadmin)和口令,你刚刚设置的,这里用系统默认安装的浏览器,用下面这个超链接

http://localhost/nagios/

点击“服务详情”的引导超链来查看你本机的监视详情。你可能需要给点时间让Nagios来检测你机器上所依赖的服务因为检测需要些时间。

9)其他的变更

如果要接收Nagios的EMail警报,需要安装(Postfix)包

sudo apt-get install mailx

需要编辑Nagios里的EMail通知送出命令,它位于/usr/local/nagios/etc/commands.cfg文件中,将里面的'/bin/mail'全部替换为'/usr/bin/mail'。一旦设置好需要重启动Nagios以使配置生效。

sudo /etc/init.d/nagios restart

配置EMail的报警项超出了本文档的内容,指向你的系统档案用网页查找或是到这个站点NagiosCommunity.org wiki来查找更进一步的信息,以使Ubuntu系统上可以向外部地址发送EMail信息。

4.7. 监控Windows主机

4.7.1. 介绍

本文用来说明如何监控Windows主机的本地服务和特性,包括:

  1. 内存占用率
  2. CPU负载
  3. Disk利用率
  4. 服务状态
  5. 运行进程
  6. 等等

Publicly available services that are provided by Windows machines (HTTP, FTP, POP3, etc.) can be monitored easily by following the documentation on monitoring publicly available services.

Note: These instructions assume that you've installed Nagios according to the quickstart guide. The sample configuration entries below reference objects that are defined in the sample config files (commands.cfg, templates.cfg, etc.) that are installed if you follow the quickstart.

4.7.2. Overview

Monitoring private services or attributes of a Windows machine requires that you install an agent on it. This agent acts as a proxy between the Nagios plugin that does the monitoring and the actual service or attribute of the Windows machine. Without installing an agent on the Windows box, Nagios would be unable to monitor private services or attributes of the Windows box.

For this programlisting, we will be installing the NSClient++ addon on the Windows machine and using the check_nt plugin to communicate with the NSClient++ addon. The check_nt plugin should already be installed on the Nagios server if you followed the quickstart guide.

Other Windows agents (like NC_Net) could be used instead of NSClient++ if you wish - provided you change command and service definitions, etc. a bit. For the sake of simplicity I will only cover using the NSClient++ addon in these instructions.

4.7.3. Steps

There are several steps you'll need to follow in order to monitor a new Windows machine. They are:

  1. Perform first-time prerequisites
  2. Install a monitoring agent on the Windows machine
  3. Create new host and service definitions for monitoring the Windows machine
  4. Restart the Nagios daemon

4.7.4. What's Already Done For You

To make your life a bit easier, a few configuration tasks have already been done for you:

  1. A check_nt command definition has been added to the commands.cfg file. This allows you to use the check_nt plugin to monitor Window services.
  2. A Windows server host template (called windows-server) has already been created in the templates.cfg file. This allows you to add new Windows host definitions in a simple manner.

The above-mentioned config files can be found in the /usr/local/nagios/etc/objects/ directory. You can modify the definitions in these and other definitions to suit your needs better if you'd like. However, I'd recommend waiting until you're more familiar with configuring Nagios before doing so. For the time being, just follow the directions outlined below and you'll be monitoring your Windows boxes in no time.

4.7.5. Prerequisites

The first time you configure Nagios to monitor a Windows machine, you'll need to do a bit of extra work. Remember, you only need to do this for the *first* Windows machine you monitor.

Edit the main Nagios config file.

vi /usr/local/nagios/etc/nagios.cfg

Remove the leading pound (#) sign from the following line in the main configuration file:

#cfg_file=/usr/local/nagios/etc/objects/windows.cfg

Save the file and exit.

What did you just do? You told Nagios to look to the /usr/local/nagios/etc/objects/windows.cfg to find additional object definitions. That's where you'll be adding Windows host and service definitions. That configuration file already contains some sample host, hostgroup, and service definitions. For the *first* Windows machine you monitor, you can simply modify the sample host and service definitions in that file, rather than creating new ones.

4.7.6. Installing the Windows Agent

Before you can begin monitoring private services and attributes of Windows machines, you'll need to install an agent on those machines. I recommend using the NSClient++ addon, which can be found at http://sourceforge.net/projects/nscplus. These instructions will take you through a basic installation of the NSClient++ addon, as well as the configuration of Nagios for monitoring the Windows machine.

1. Download the latest stable version of the NSClient++ addon from http://sourceforge.net/projects/nscplus

2. Unzip the NSClient++ files into a new C:\NSClient++ directory

3. Open a command prompt and change to the C:\NSClient++ directory

4. Register the NSClient++ system service with the following command:

	nsclient++ /install

5. Install the NSClient++ systray with the following command ('SysTray' is case-sensitive):

	nsclient++ SysTray

6. Open the services manager and make sure the NSClientpp service is allowed to interact with the desktop (see the 'Log On' tab of the services manager). If it isn't already allowed to interact with the desktop, check the box to allow it to.

7. Edit the NSC.INI file (located in the C:\NSClient++ directory) and make the following changes:

  1. Uncomment all the modules listed in the [modules] section, except for CheckWMI.dll and RemoteConfiguration.dll
  2. Optionally require a password for clients by changing the 'password' option in the [Settings] section.
  3. Uncomment the 'allowed_hosts' option in the [Settings] section. Add the IP address of the Nagios server to this line, or leave it blank to allow all hosts to connect.
  4. Make sure the 'port' option in the [NSClient] section is uncommented and set to '12489' (the default port).

8. Start the NSClient++ service with the following command:

	nsclient++ /start

9. If installed properly, a new icon should appear in your system tray. It will be a yellow circle with a black 'M' inside.

10. Success! The Windows server can now be added to the Nagios monitoring configuration...

4.7.7. Configuring Nagios

Now it's time to define some object definitions in your Nagios configuration files in order to monitor the new Windows machine.

Open the windows.cfg file for editing.

vi /usr/local/nagios/etc/objects/windows.cfg

Add a new host definition for the Windows machine that you're going to monitor. If this is the *first* Windows machine you're monitoring, you can simply modify the sample host definition in windows.cfg. Change the host_name, alias, and address fields to appropriate values for the Windows box.

define host{

use windows-server ; Inherit default values from a Windows server template (make sure you keep this line!)

host_name winserver

alias My Windows Server

address 192.168.1.2

}

Good. Now you can add some service definitions (to the same configuration file) in order to tell Nagios to monitor different aspects of the Windows machine. If this is the *first* Windows machine you're monitoring, you can simply modify the sample service definitions in windows.cfg.

Note: Replace "winserver" in the programlisting definitions below with the name you specified in the host_name directive of the host definition you just added.

Add the following service definition to monitor the version of the NSClient++ addon that is running on the Windows server. This is useful when it comes time to upgrade your Windows servers to a newer version of the addon, as you'll be able to tell which Windows machines still need to be upgraded to the latest version of NSClient++.

define service{

use generic-service

host_name winserver

service_description NSClient++ Version

check_command check_nt!CLIENTVERSION

}

Add the following service definition to monitor the uptime of the Windows server.

define service{

use generic-service

host_name winserver

service_description Uptime

check_command check_nt!UPTIME

}

Add the following service definition to monitor the CPU utilization on the Windows server and generate a CRITICAL alert if the 5-minute CPU load is 90% or more or a WARNING alert if the 5-minute load is 80% or greater.

define service{

use generic-service

host_name winserver

service_description CPU Load

check_command check_nt!CPULOAD!-l 5,80,90

}

Add the following service definition to monitor memory usage on the Windows server and generate a CRITICAL alert if memory usage is 90% or more or a WARNING alert if memory usage is 80% or greater.

define service{

use generic-service

host_name winserver

service_description Memory Usage

check_command check_nt!MEMUSE!-w 80 -c 90

}

Add the following service definition to monitor usage of the C:\ drive on the Windows server and generate a CRITICAL alert if disk usage is 90% or more or a WARNING alert if disk usage is 80% or greater.

define service{

use generic-service

host_name winserver

service_description C:\ Drive Space

check_command check_nt!USEDDISKSPACE!-l c -w 80 -c 90

}

Add the following service definition to monitor the W3SVC service state on the Windows machine and generate a CRITICAL alert if the service is stopped.

define service{

use generic-service

host_name winserver

service_description W3SVC

check_command check_nt!SERVICESTATE!-d SHOWALL -l W3SVC

}

Add the following service definition to monitor the Explorer.exe process on the Windows machine and generate a CRITICAL alert if the process is not running.

define service{

use generic-service

host_name winserver

service_description Explorer

check_command check_nt!PROCSTATE!-d SHOWALL -l Explorer.exe

}

That's it for now. You've added some basic services that should be monitored on the Windows box. Save the configuration file.

4.7.8. Password Protection

If you specified a password in the NSClient++ configuration file on the Windows machine, you'll need to modify the check_nt command definition to include the password. Open the commands.cfg file for editing.

vi /usr/local/nagios/etc/commands.cfg

Change the definition of the check_nt command to include the "-s <PASSWORD>" argument (where PASSWORD is the password you specified on the Windows machine) like this:

define command{

command_name check_nt

command_line $USER1$/check_nt -H $HOSTADDRESS$ -p 12489 -s PASSWORD -v $ARG1$ $ARG2$

}

Save the file.

4.7.9. Restarting Nagios

You're done with modifying the Nagios configuration, so you'll need to verify your configuration files and restart Nagios.

If the verification process produces any errors messages, fix your configuration file before continuing. Make sure that you don't (re)start Nagios until the verification process completes without any errors!

4.8. 监控Linux/Unix主机

4.8.1. 介绍

本文档描述了如果监控Linux/UNIX的"私有"服务和属性,如:

  1. CPU负荷
  2. 内存占用率
  3. 磁盘利用率
  4. 登录用户
  5. 运行进程

由Linux系统上的公众服务(HTTP、FTP、SSH、SMTP等)可以按照这篇监控公众服务文档。

注意

如下内容是假定已经按照快速安装指南安装并设置好Nagios。如下例子参考了样例配置文件(commands.cfgtemplates.cfg等)里的对象定义,样例配置文件已经在安装过程中安装就位。

4.8.2. 概览

[注意:本文档没有结束。推荐阅读文档NRPE外部构件里如何监控远程Linux/Unix服务器中的指令]

有几种不同方式来监控远程Linux/UNIX服务器的服务与属性。一个是应用共享式SSH密钥运行check_by_ssh插件来执行对远程主机的检测。这种方法本文档不讨论,但它会导致安装有Nagios的监控服务器很高的系统负荷,尤其是你要监控成百个主机中的上千个服务时,这是因为要建立/毁构SSH联接的总开销很高。

另一种方法是使用NRPE外部构件监控远程主机。NRPE外部构件可以在远程的Linux/Unix主机上执行插件程序。如果是要象监控本地主机一样对远程主机的磁盘利用率、CPU负荷和内存占用率等情况下,NRPE外部构件非常有用。

4.9. 监控路由器和交换机

4.9.1. 介绍

本文档将介绍如何来监控路由器和交换机的状态。一些便宜的"无网管"功能的交换机与集线器不能配置IP地址而且对于网络是不可见的组成构件,因而没办法来监控这种东西。稍贵些的交换机和路由器可以配置IP地址可以用PING检测或是通过SNMP来查询状态信息。

下面将描述如果来监控这些有网管功能的交换机、集线器和路由器:

  1. 包丢弃率,平均回包周期RTA
  2. SNMP状态信息
  3. 带宽与流量

注意

如下指令是假定你已经按快速安装指南安装好Nagios。参考的样例配置是在已经按指南安装就位的配置文件(commands.cfgtemplates.cfg等)。

4.9.2. 概览

监控交换机与路由器可简可繁-主要是看拥有什么样设备与想监控什么内容。做为极为重要的网络组成构件,毫无疑问至少要监控一些基本状态。

交换机与路由器可以简单地用PING来监控丢包率、RTA等数据。如果交换机支持SNMP,就可以监控端口状态等,用check_snmp插件,也可以监控带宽(如果用了MRTG),用check_mrtgtraf插件。

check_snmp插件只有当系统里安装了net-snmp和net-snmp-utils包后才编译。先确定插件已经在/usr/local/nagios/libexec目录里再继续做,如果没有这个文件,安装net-snmp和net-snmp-utils包并且重编译并重新安装Nagios插件包。

4.9.3. 步骤

要监控交换机与路由器要有几步工作:

  1. 第一时间执行些必备工作;
  2. 给设备创建要监控的主机与服务对象定义;
  3. 重启动Nagios守护进程。

4.9.4. 已经做了什么?

为了让工作轻松点,几个配置任务已经做好了:

  1. 两个命令定义(check_snmpcheck_local_mrtgtraf)已经加到了commands.cfg文件中。可以用check_snmpcheck_mrtgtraf插件来监控网络打印机。
  2. 一个交换机模板(命名为generic-switch)已经创建在templates.cfg文件里。可以在对象定义里更容易地加一个新的交换机与路由器设备。

以上的监控配置文件可以在/usr/local/nagios/etc/objects/目录里找到。如果愿意可以修改这些定义或是加入其他适合需要的更好的定义。但推荐你最好是等到你熟练地掌握了Nagios配置之后再这么做。开始的时候,只要按上述的配置来监控网络里的路由器和交换机就可以了。

4.9.5. 必备工作

要配置Nagios用于监控网络里的交换机之前,有必要做点额外工作。记住,这是首先要做的工作才能监控。

编辑Nagios的主配置文件

vi /usr/local/nagios/etc/nagios.cfg

移除文件里下面这行的最前面的(#)符号

#cfg_file=/usr/local/nagios/etc/objects/switch.cfg

保存文件并退出。

为何要这么做?这是要让Nagios检查/usr/local/nagios/etc/objects/switch.cfg配置文件来找些额外的对象定义。在文件里可以增加有关路由器和交换机设备的主机与服务定义。配置文件已经包含了几个样本主机、主机组和服务定义。做为监控路由器与交换机的第一步工作是最好在样例的主机与服务对象定义之上修改而不是重建一个。

4.9.6. 配置Nagios

需要做些对象定义以监控新的交换机与路由器设备。

打开switch.cfg文件进行编辑。

vi /usr/local/nagios/etc/objects/switch.cfg

给要监控的交换机加一个新的主机对象定义。如果这是第一台要监控的交换机设备,可以简单地修改switch.cfg里的样例配置。修改主机对象里的host_namealiasaddress域值来适用于监控。

define host{ use generic-switch ; Inherit default values from a template host_name linksys-srw224p ; The name we're giving to this switch alias Linksys SRW224P Switch ; A longer name associated with the switch address 192.168.1.253 ; IP address of the switch hostgroups allhosts,switches ; Host groups this switch is associated with }

4.9.7. 监控服务

现在可以加些针对监控交换机的服务对象定义(在同一个配置文件)。如果是第一台要监控的交换机设备,可以简单地修改switch.cfg里的样例配置。

注意

替换样例定义里的"linksys-srw224p"主机名为你刚才定义的名字,是修改在host_name域。

4.9.8. 监控丢包率和RTA

增加如下的服务定义以监控自Nagios监控主机到交换机的丢包率和平均回包周期RTA,在一般情况下每5分钟检测一次。

define service{ use generic-service ; Inherit values from a template host_name linksys-srw224p ; The name of the host the service is associated with service_description PING ; The service description check_command check_ping!200.0,20%!600.0,60% ; The command used to monitor the service normal_check_interval 5 ; Check the service every 5 minutes under normal conditions retry_check_interval 1 ; Re-check the service every minute until its final/hard state is determined }

这个服务的状态将会处于:

  1. 紧急(CRITICAL)-条件是RTA大于600ms或丢包率大于等于60%;
  2. 告警(WARNING)-条件是RTA大于200ms或是丢包率大于等于20%;
  3. 正常(OK)-条件是RTA小于200ms或丢包率小于20%

4.9.9. 监控SNMP状态信息

如果交换机与路由器支持SNMP接口,可以用check_snmp插件来监控更丰富的信息。如果不支持SNMP,跳过此节。

加入如下服务定义到你刚才修改的交换机对象定义之中

define service{ use generic-service ; Inherit values from a template host_name linksys-srw224p service_description Uptime check_command check_snmp!-C public -o sysUpTime.0 }

在上述服务定义中的check_command域里,用"-C public"来指定SNMP共同体名称为"public",用"-o sysUpTime.0"指明要检测的OID(译者注-MIB节点值)。

如果要确保交换机上某个指定端口或接口的状态处于运行状态,可以在对象定义里加入一段定义:

define service{ use generic-service ; Inherit values from a template host_name linksys-srw224p service_description Port 1 Link Status check_command check_snmp!-C public -o ifOperStatus.1 -r 1 -m RFC1213-MIB }

在上例中,"-o ifOperStatus.1"指出取出交换机的端口编号为1的OID状态。"-r 1"选项是让check_snmp插件检查返回一个正常(OK)状态,如果是在SNMP查询结果中存在"1"(1说明交换机端口处于运行状态)如果没找到1就是紧急(CRITICAL)状态。"-m RFC1213-MIB"是可选的,它告诉check_snmp插件只加载"RFC1213-MIB"库而不是加载每个在系统里的MIB库,这可以加快插件运行速度。

这就是给SNMP库的例子。有成百上千种信息可以通过SNMP来监控,这完全取决于你需要做什么和如果来做监控。祝你好运!

提示

通常可以用如下命令来寻找你想用于监控的OID节点(用你的交换机IP替换192.168.1.253):snmpwalk -v1 -c public 192.168.1.253 -m ALL .1

4.9.10. 监控带宽和流量

可以监控交换机或路由器的带宽利用率,用MRTG绘图并让Nagios在流量超出指定门限时报警。check_mrtgtraf插件(它已经包含在Nagios插件软件发行包中)可以实现。

需要让check_mrtgtraf插件知道如何来保存MRTG数据并存入文件,以及门限等。在例子中,监控了一个Linksys交换机。MRTG日志保存于/var/lib/mrtg/192.168.1.253_1.log文件中。这就是我用于监控的服务定义,它可以用于监控带宽数据到日志文件之中...

define service{ use generic-service ; Inherit values from a template host_name linksys-srw224p service_description Port 1 Bandwidth Usage check_command check_local_mrtgtraf!/var/lib/mrtg/192.168.1.253_1.log!AVG!1000000,2000000!5000000,5000000!10 }

在上例中,"/var/lib/mrtg/192.168.1.253_1.log"参数传给check_local_mrtgtraf命令意思是插件的MRTG日志文件在这个文件里读写,"AVG"参数的意思是取带宽的统计平均值,"1000000,200000"参数是指流入的告警门限(以字节为单位),"5000000,5000000"是输出流量紧急状态门限(以字节为单位),"10"是指如果MRTG日志如果超过10分钟没有数据返回一个紧急状态(应该每5分钟更新一次)。

保存该配置文件

4.9.11. 重启动Nagios

一旦给switch.cfg文件里加好新的主机与服务对象定义,就可以开始对路由器与交换机进行监控。为了开始监控,需要先验证配置文件重新启动Nagios

如果验证过程有有任何错误信息,修改配置文件再继续。一定要保证配置验证过程中没有错误信息再启动Nagios!

4.10. 监控网络打印机

4.10.1. 介绍

本文件描述了如何监控网络打印机。特别是有内置或外置JetDirect卡的HP惠普打印机设备,或是其他(象Troy PocketPro 100S或Netgear PS101)支持JetDirect协议的打印机。

check_hpjd插件(该命令是Nagios插件软件发行包的标准组成部分)可以用SNMP使能的方式来监控JetDirect兼容型打印机。该插件可以检查如下打印机状态:

  1. 卡纸
  2. 无纸
  3. 打印机离线
  4. 需要人工干预
  5. 墨盒墨粉低
  6. 内存不足
  7. 开外壳
  8. 输出托盘已满
  9. 和其他...

注意

如下指令假定你已经按照快速安装指南安装好Nagios。可以参考安装好的样本配置文件(commands.cfgtemplates.cfg等)。

4.10.2. 概览

监控网络打印机的状态很简单。有JetDirect功能的打印机一般提供SNMP功能,可以用check_hpjd插件来检测状态。

check_hpjd插件只是当当前系统中安装有net-snmp和net-snmp-utils软件包时才会被编译和安装。要保证在/usr/local/nagios/libexec目录下有check_hpjd文件再继承,否则,要安装好net-snmp和net-snmp-utils软件包再重新编译安装Nagios插件包。

4.10.3. 步骤

监控打印机需要做如下几步:

  1. 做些事先准备工作;
  2. 创建一个用于监控打印机的主机与服务对象定义;
  3. 重启动Nagios守护进程。

4.10.4. 已经做了什么?

为使这项工作更轻松,几个配置工作已经做好:

  1. check_hpjd的命令定义已经加到了commands.cfg配置文件中,可以用check_hpjd插件来监控网络打印机;
  2. 一个网络打印机模板(命名为generic-printer)已经在templates.cfg配置文件里创建好,可以更方便地加入一个新打印机设备的主机对象。

上面的监控配置文件可以在/usr/local/nagios/etc/objects/目录里找到。如果想做,可以修改里面的定义以更好地适用于你的情况。但是在此之前,推荐你要熟悉Nagios的配置之后再做。起初,最好只是按下面的大概修改一下以实现对网络打印机的监控。

4.10.5. 事先准备工作

在配置Nagios用于监控网络打印机之前,有些额外工作,记住这是要对第一台打印机设备进行监控。

编辑Nagios的主配置文件。

vi /usr/local/nagios/etc/nagios.cfg

移除下面这行最前面的(#)号:

#cfg_file=/usr/local/nagios/etc/objects/printer.cfg

保存文件并退出编辑。

为何要这样?告诉Nagios查找/usr/local/nagios/etc/objects/printer.cfg文件以取得额外对象定义。该文件中将加入网络打印机设备的主机与服务对象定义。这个配置文件里已经包含有一个样本主机、主机组和服务定义。给第一台打印机设备做监控,可以简单地修改这个文件而不需重生成一个。

4.10.6. 配置Nagios

需要创建几个对象定义以进行网络打印机的监控。

打开printer.cfg文件并编辑它。

vi /usr/local/nagios/etc/objects/printer.cfg

增加一个你要监控的网络打印机设备的主机对象定义。如果这是第一台打印机设备,可以简单地修改printer.cfg文件里的样本主机定义。将合理的值赋在host_namealiasaddress域里。

define host{ use generic-printer ; Inherit default values from a template host_name hplj2605dn ; The name we're giving to this printer alias HP LaserJet 2605dn ; A longer name associated with the printer address 192.168.1.30 ; IP address of the printer hostgroups allhosts ; Host groups this printer is associated with }

现在可以给监控的打印机加些服务定义(在同一个配置文件里)。如果是第一台被监控的网络打印机,可以简单地修改printer.cfg里的服务配置。

注意

要用你要刚刚加上的被监控打印机主机名替换样例对象"hplj2605dn"里的host_name域值。

按如下方式加好对打印机状态检测的服务定义。服务用check_hpjd插件来检测打印机状态,默认情况下每10分钟检测一次。SNMP共同体串是"public"。

define service{ use generic-service ; Inherit values from a template host_name hplj2605dn ; The name of the host the service is associated with service_description Printer Status ; The service description check_command check_hpjd!-C public ; The command used to monitor the service normal_check_interval 10 ; Check the service every 10 minutes under normal conditions retry_check_interval 1 ; Re-check the service every minute until its final/hard state is determined }

加入一个默认每10分钟进行一次的PING检测服务。用于检测RTA、丢包率和网络联接状态。

define service{ use generic-service host_name hplj2605dn service_description PING check_command check_ping!3000.0,80%!5000.0,100% normal_check_interval 10 retry_check_interval 1 }

保存配置文件。

4.10.7. 重启动Nagios

一旦在printer.cfg文件里加好新的主机和服务对象定义就可以监控网络打印机。为了开始,应该先验证配置文件重启动Nagios

如果在验证配置过程中有任何错误信息,修改好配置文件再继续。保证验证过程完成且没有任何错误的情况下再重启动Nagios!

4.11. 监控Netware服务器

4.11.1. 介绍

本文档描述了如何对Netware服务器的"私有"服务和属性进行监控,象这些:

  1. 内存占用率
  2. 处理器利用率
  3. 缓冲区使用情况
  4. 活动的联接
  5. 磁盘卷使用率

由Netware服务器提供的公众服务(HTTP、FTP等)的监控可以按文档监控公众服务来做。

4.11.2. 概览

TODO...

注意

我在找一个志愿者来写就HOWTO文档。我只能接触到一台旧的Netware 4.11服务器,所以无法跟上形势需要。如果可以更新这个文档,请把它张贴到NagiosCommunity wiki里。

4.12. 监控公众服务平台

4.12.1. Introduction

This document describes how you can monitor publicly available services, applications and protocols. By "public" I mean services that are accessible across the network - either the local network or the greater Internet. Examples of public services include HTTP, POP3, IMAP, FTP, and SSH. There are many more public services that you probably use on a daily basis. These services and applications, as well as their underlying protocols, can usually be monitored by Nagios without any special access requirements.

Private services, in contrast, cannot be monitored with Nagios without an intermediary agent of some kind. Examples of private services associated with hosts are things like CPU load, memory usage, disk usage, current user count, process information, etc. These private services or attributes of hosts are not usually exposed to external clients. This situation requires that an intermediary monitoring agent be installed on any host that you need to monitor such information on. More information on monitoring private services on different types of hosts can be found in the documentation on:

Tip: Occassionally you will find that information on private services and applications can be monitored with SNMP. The SNMP agent allows you to remotely monitor otherwise private (and inaccessible) information about the host. For more information about monitoring services using SNMP, check out the documentation on monitoring switches and routers.

Note: These instructions assume that you've installed Nagios according to the quickstart guide. The sample configuration entries below reference objects that are defined in the sample commands.cfg and localhost.cfg config files.

4.12.2. Plugins For Monitoring Services

When you find yourself needing to monitor a particular application, service, or protocol, chances are good that a plugin exists to monitor it. The official Nagios plugins distribution comes with plugins that can be used to monitor a variety of services and protocols. There are also a large number of contributed plugins that can be found in the contrib/ subdirectory of the plugin distribution. The NagiosExchange.org website hosts a number of additional plugins that have been written by users, so check it out when you have a chance.

If you don't happen to find an appropriate plugin for monitoring what you need, you can always write your own. Plugins are easy to write, so don't let this thought scare you off. Read the documentation on developing plugins for more information.

I'll walk you through monitoring some basic services that you'll probably use sooner or later. Each of these services can be monitored using one of the plugins that gets installed as part of the Nagios plugins distribution. Let's get started...

4.12.3. Creating A Host Definition

Before you can monitor a service, you first need to define a host that is associated with the service. You can place host definitions in any object configuration file specified by a cfg_file directive or placed in a directory specified by a cfg_dir directive. If you have already created a host definition, you can skip this step.

For this programlisting, lets say you want to monitor a variety of services on a remote host. Let's call that host remotehost. The host definition can be placed in its own file or added to an already exiting object configuration file. Here's what the host definition for remotehost might look like:

define host{

use generic-host ; Inherit default values from a template

host_name remotehost ; The name we're giving to this host

alias Some Remote Host ; A longer name associated with the host

address 192.168.1.50 ; IP address of the host

hostgroups allhosts ; Host groups this host is associated with

}

Now that a definition has been added for the host that will be monitored, we can start defining services that should be monitored. As with host definitions, service definitions can be placed in any object configuration file.

4.12.4. Creating Service Definitions

For each service you want to monitor, you need to define a service in Nagios that is associated with the host definition you just created. You can place service definitions in any object configuration file specified by a cfg_file directive or placed in a directory specified by a cfg_dir directive.

Some programlisting service definitions for monitoring common public service (HTTP, FTP, etc) are given below.

4.12.5. Monitoring HTTP

Chances are you're going to want to monitor web servers at some point - either yours or someone else's. The check_http plugin is designed to do just that. It understands the HTTP protocol and can monitor response time, error codes, strings in the returned HTML, server certificates, and much more.

The commands.cfg file contains a command definition for using the check_http plugin. It looks like this:

define command{

name check_http

command_name check_http

command_line $USER1$/check_http -I $HOSTADDRESS$ $ARG1$

}

A simple service definition for monitoring the HTTP service on the remotehost machine might look like this:

define service{

use generic-service ; Inherit default values from a template

host_name remotehost

service_description HTTP

check_command check_http

}

This simple service definition will monitor the HTTP service running on remotehost. It will produce alerts if the web server doesn't respond within 10 seconds or if it returns HTTP errors codes (403, 404, etc.). That's all you need for basic monitoring. Pretty simple, huh?

Tip: For more advanced monitoring, run the check_http plugin manually with --help as a command-line argument to see all the options you can give the plugin. This --help syntax works with all of the plugins I'll cover in this document.

A more advanced definition for monitoring the HTTP service is shown below. This service definition will check to see if the /download/index.php URI contains the string "latest-version.tar.gz". It will produce an error if the string isn't found, the URI isn't valid, or the web server takes longer than 5 seconds to respond.

define service{

use generic-service ; Inherit default values from a template

host_name remotehost

service_description Product Download Link

check_command check_http!-u /download/index.php -t 5 -s "latest-version.tar.gz"

}

4.12.6. Monitoring FTP

When you need to monitor FTP servers, you can use the check_ftp plugin. The commands.cfg file contains a command definition for using the check_ftp plugin, which looks like this:

define command{

command_name check_ftp

command_line $USER1$/check_ftp -H $HOSTADDRESS$ $ARG1$

}

A simple service definition for monitoring the FTP server on remotehost would look like this:

define service{

use generic-service ; Inherit default values from a template

host_name remotehost

service_description FTP

check_command check_ftp

}

This service definition will monitor the FTP service and generate alerts if the FTP server doesn't respond within 10 seconds.

A more advanced service definition is shown below. This service will check the FTP server running on port 1023 on remotehost. It will generate an alert if the server doesn't respond within 5 seconds or if the server response doesn't contain the string "Pure-FTPd [TLS]".

define service{

use generic-service ; Inherit default values from a template

host_name remotehost

service_description Special FTP

check_command check_ftp!-p 1023 -t 5 -e "Pure-FTPd [TLS]"

}

4.12.7. Monitoring SSH

When you need to monitor SSH servers, you can use the check_ssh plugin. The commands.cfg file contains a command definition for using the check_ssh plugin, which looks like this:

define command{

command_name check_ssh

command_line $USER1$/check_ssh $ARG1$ $HOSTADDRESS$

}

A simple service definition for monitoring the SSH server on remotehost would look like this:

define service{

use generic-service ; Inherit default values from a template

host_name remotehost

service_description SSH

check_command check_ssh

}

This service definition will monitor the SSH service and generate alerts if the SSH server doesn't respond within 10 seconds.

A more advanced service definition is shown below. This service will check the SSH server and generate an alert if the server doesn't respond within 5 seconds or if the server version string string doesn't match "OpenSSH_4.2".

define service{

use generic-service ; Inherit default values from a template

host_name remotehost

service_description SSH Version Check

check_command check_ssh!-t 5 -r "OpenSSH_4.2"

}

4.12.8. Monitoring SMTP

The check_smtp plugin can be using for monitoring your email servers. The commands.cfg file contains a command definition for using the check_smtp plugin, which looks like this:

define command{

command_name check_smtp

command_line $USER1$/check_smtp -H $HOSTADDRESS$ $ARG1$

}

A simple service definition for monitoring the SMTP server on remotehost would look like this:

define service{

use generic-service ; Inherit default values from a template

host_name remotehost

service_description SMTP

check_command check_smtp

}

This service definition will monitor the SMTP service and generate alerts if the SMTP server doesn't respond within 10 seconds.

A more advanced service definition is shown below. This service will check the SMTP server and generate an alert if the server doesn't respond within 5 seconds or if the response from the server doesn't contain "mygreatmailserver.com".

define service{

use generic-service ; Inherit default values from a template

host_name remotehost

service_description SMTP Response Check

check_command check_smtp!-t 5 -e "mygreatmailserver.com"

}

4.12.9. Monitoring POP3

The check_pop plugin can be using for monitoring the POP3 service on your email servers. The commands.cfg file contains a command definition for using the check_pop plugin, which looks like this:

define command{

command_name check_pop

command_line $USER1$/check_pop -H $HOSTADDRESS$ $ARG1$

}

A simple service definition for monitoring the POP3 service on remotehost would look like this:

define service{

use generic-service ; Inherit default values from a template

host_name remotehost

service_description POP3

check_command check_pop

}

This service definition will monitor the POP3 service and generate alerts if the POP3 server doesn't respond within 10 seconds.

A more advanced service definition is shown below. This service will check the POP3 service and generate an alert if the server doesn't respond within 5 seconds or if the response from the server doesn't contain "mygreatmailserver.com".

define service{

use generic-service ; Inherit default values from a template

host_name remotehost

service_description POP3 Response Check

check_command check_pop!-t 5 -e "mygreatmailserver.com"

}

4.12.10. Monitoring IMAP

The check_imap plugin can be using for monitoring IMAP4 service on your email servers. The commands.cfg file contains a command definition for using the check_imap plugin, which looks like this:

define command{

command_name check_imap

command_line $USER1$/check_imap -H $HOSTADDRESS$ $ARG1$

}

A simple service definition for monitoring the IMAP4 service on remotehost would look like this:

define service{

use generic-service ; Inherit default values from a template

host_name remotehost

service_description IMAP

check_command check_imap

}

This service definition will monitor the IMAP4 service and generate alerts if the IMAP server doesn't respond within 10 seconds.

A more advanced service definition is shown below. This service will check the IAMP4 service and generate an alert if the server doesn't respond within 5 seconds or if the response from the server doesn't contain "mygreatmailserver.com".

define service{

use generic-service ; Inherit default values from a template

host_name remotehost

service_description IMAP4 Response Check

check_command check_imap!-t 5 -e "mygreatmailserver.com"

}

4.12.11. Restarting Nagios

Once you've added the new host and service definitions to your object configuration file(s), you're ready to start monitoring them. To do this, you'll need to verify your configuration and restart Nagios.

If the verification process produces any errors messages, fix your configuration file before continuing. Make sure that you don't (re)start Nagios until the verification process completes without any errors!

第 5 章 准备配置Nagios

5.1. 配置概览

5.1.1. 介绍

在你开始监控网络与系统之前要有同个不同配置文件需要创建和编辑。耐心点,配置Nagios可能是要花些时间特别是对于那些初次使用者。弄清其机理所有的将它们搞定绝对是值得的。 :-)

注意

样本配置文件在安装时放在了/usr/local/nagios/etc/目录下,如果你是按照前面给出的快速安装指南来做的话。

5.1.2. 主配置文件

主配置文件包括了一系列的设置,它们会影响Nagios守护进程。不仅是Nagios守护进程要使用主配置文件,CGIs程序组模块也需要,因此,主配置文件是你开始学习配置其他文件的基础。

有关主配置文件的文档在这里

5.1.3. 资源配置文件

资源文件可以保存用户自定义的宏。资源文件的一个主要用处是用于保存一些敏感的配置信息如系统口令等不能让CGIs程序模块获取到的东西。

你可以在主配置文件中设置resource_file指向一个或是多个资源文件。

5.1.4. 对象定义文件

对象定义文件用于定义主机、服务、主机组、服务组、联系人、联系人组、命令等等。这些将定义你需要监控什么并将如何监控它们。

你可以在主配置文件里设置cfg_file加上cfg_dir来指向一个或是多个对象定义文件。

有关对象定义和与其他间关系的文档是这里

5.1.5. CGI配置文件

CGI配置文件包含了一系列的设置,它们会影响CGIs程序模块。还有一些保存在主配置文件之中,因此CGI程序会知道你是如何配置的Nagios并且在哪里保存了对象定义。

有关CGI配置文件的文档在这里

5.2. 主配置文件选项

注意

当创建或编辑配置文件时,要遵守如下要求:

  • 以符号'#'开头的行将视为注释不做处理;
  • 变量必须是新起的一行 - 变量之前不能有空格符;
  • 变量名是大小写敏感的;

提示

样例配置文件(/usr/local/nagios/etc/nagios.cfg)已经安装到位,如果你是按照快速安装指南来操作的话。

5.2.1. 配置文件的位置

主配置文件一般(实际是固定的)是nagios.cfg,存放位置在/usr/local/nagios/etc/目录里(--如果是rpm包来安装,应该是在/etc/nagios/)。

5.2.2. 配置文件里的变量

下面将对每个主配置文件里的选项进行说明...

表 5.1. 日志文件

格式:log_file=<file_name>
样例:log_file=/usr/local/nagios/var/nagios.log

这个变量用于设定Nagios在何处创建其日志文件。它应该是你主配置文件里面的第一个变量,当Nagios找到你配置文件并发现配置里有错误时会向该文件中写入错误信息。如果你使能了日志回滚,Nagios将在每小时、每天、每周或每月对日志进行回滚。

表 5.2. 对象配置文件

格式:cfg_file=<file_name>
样例:

cfg_file=/usr/local/nagios/etc/hosts.cfg

cfg_file=/usr/local/nagios/etc/services.cfg

cfg_file=/usr/local/nagios/etc/commands.cfg


该变量用于指定一个包含有将用于Nagios监控对象的对象配置文件。对象配置文件中包括有主机、主机组、联系人、联系人组、服务、命令等等对象的定义。配置信息可以切分为多个文件并且用cfg_file=语句来指向每个待处理的配置文件。

表 5.3. 对象配置目录

格式:cfg_dir=<directory_name>
样例:

cfg_dir=/usr/local/nagios/etc/commands

cfg_dir=/usr/local/nagios/etc/services

cfg_dir=/usr/local/nagios/etc/hosts


该变量用于指定一个目录,目录里包含有将用于Nagios监控对象的对象配置文件。所有的在这个目录下的且以.cfg为扩展名的文件将被作为配置文件来处理。另外,Nagios将会递归该目录下的子目录并处理其子目录下的全部配置文件。你可以把配置放入不同的目录并且用cfg_dir=语句来指向每个待处理的目录。

表 5.4. 对象缓冲文件

格式:object_cache_file=<file_name>
样例:object_cache_file=/usr/local/nagios/var/objects.cache

该变量用于指定一个用于缓冲对象定义复本的文件存放位置。对象缓冲将在每次Nagios的启动和重启时和使用CGI模块时被创建或重建。它试图加快在CGI里的配置缓冲并使得你在编辑对象配置文件时可以让正在运行的Nagios不影响CGI的显示输出。

表 5.5. 预缓冲对象文件

格式:precached_object_file=<file_name>
样例:precached_object_file=/usr/local/nagios/var/objects.precache

该变量用于指定一个用于指定一个用于预处理、预缓冲 This directive is used to specify a file in which a pre-processed, pre-cached copy of 对象定义复本的文件存放位置。在大型或复杂Nagios安装模式下这个文件可用于显著地减少Nagios的启动时间。如何加快启动的更多信息可以查看这个内容。

表 5.6. 资源文件

格式:resource_file=<file_name>
样例:resource_file=/usr/local/nagios/etc/resource.cfg

该变量用于指定一个可选的包含有$USERn$定义的可选资源文件。$USERn$宏在存放用户名、口令及通用的命令定义内容(如目录路径)时非常有用。CGIs模块将不会试图读取资源文件,所以你可以限定这权文件权限(600或660)来保护敏感信息。你可以在主配置文件里用resource_file语句来加入多个资源文件-Nagios将会处理它们。如何定义$USERn$宏参见样例resource.cfg文件,它放在Nagios发行包的sample-config/子目录下。

表 5.7. 临时文件

格式:temp_file=<file_name>
样例:temp_file=/usr/local/nagios/var/nagios.tmp

该变量用于指定一个临时文件,Nagios将在更新注释数据、状态数据等时周期性地创建它。该文件不再需要时会删除它。

表 5.8. 临时路径

格式:temp_path=<dir_name>
样例:temp_path=/tmp

这个变量是一个目录,该目录是块飞地,在监控过程中用于创建临时文件。你应在该目录内运行tmpwatch或类似的工具程序以删除早于24小时的文件(这是个垃圾文件存放地)。

表 5.9. 状态文件

格式:status_file=<file_name>
样例:status_file=/usr/local/nagios/var/status.dat

这个变量指向一个文件,文件被Nagios用于保存当前状态、注释和宕机信息。CGI模块也会用这个文件以通过Web接口来显示当前被监控的状态,CGI模块必须要有这个文件的读取权限以使工作正常。在Nagios停机或在重启动时将会删除并重建该文件。

表 5.10. 状态文件更新间隔

格式:status_update_interval=<seconds>
样例:status_update_interval=15

这个变量设置了Nagios更新状态文件的速度(秒为单位),最小更新间隔是1秒。

表 5.11. Nagios用户

格式:nagios_user=<username/UID>
样例:nagios_user=nagios

该变量指定了Nagios进程使用哪个用户运行。当程序启动完成并开始监控对象之前,Nagios将切换自己的权限并使用该用户权限运行。你可以指定用户或是UID名。

表 5.12. Nagios组

格式:nagios_group=<groupname/GID>
样例:nagios_group=nagios

该变量用于指定Nagios使用哪个用户组运行。当程序启动完成并开始监控对象之前,Nagios将切换自己的权限并以该用户组权限运行。你可以拽定用户组或GID名。

表 5.13. 通知选项

格式:enable_notifications=<0/1>
样例:enable_notifications=1

该选项决定了Nagios在初始化启动或重启动时是否要送出通知。如果这个选项不使能,Nagios将不会向任何主机或服务送出通知。注意,如果你打开了状态保持选项,Nagios在其启动和重启时将忽略此设置并用这个选项的最近的一个设置(已经保存在状态保持文件)的值来工作,除非你取消了use_retained_program_state选项。如果你想在使能状态保存选项(并且是use_retained_program_state使能)的情况下更改这个选项,你必须要通过合适的外部命令或是通过Web接口来修改它。选项的取值可以是:

  1. 0 = 关闭通知
  2. 1 = 打开通知(默认)

表 5.14. 服务检测执行选项

格式:execute_service_checks=<0/1>
样例:execute_service_checks=1

这个选项指定了Nagios在初始的启动或重启时是否要执行服务检测。如果这个没有使能,Nagios将不会主动地执行任何服务的检测并且保持一系列的"静默"状态(它仍旧可以接收被动检测除非你已经将accept_passive_service_checks选项关闭)。这个选项经常用于备份被监控服务配置,被监控服务的配置备份在文档冗余安装或设置成一个分布式监控环境中有描述。注意:如果你已经使能了状态保持,Nagios在其启动或重启时将会忽略这个选项设置并使用和旧的设置值(旧值保存于状态保持文件),除非你关闭了use_retained_program_state选项。如果你想在状态保持使能(和use_retained_program_state选项使能)的情况下修改这个选项,你只得用适当的外部命令或是通过Web接口来修改它。选项可用的值有:

  1. 0 = 不执行服务检测
  2. 1 = 执行服务检测(默认)

表 5.15. 被动服务检测结果接受选项

格式:accept_passive_service_checks=<0/1>
样例:accept_passive_service_checks=1

该选项决定了Nagios在其初始化启动或重启后是否要授受强制服务检测,如果它关闭了,Nagios将不会接受任何强制服务检测结果。注意:如果你已经使能了状态保持,Nagios在其启动或重启时将会忽略这个选项设置并使用和旧的设置值(旧值保存于状态保持文件),除非你关闭了use_retained_program_state选项。如果你想在状态保持使能(和use_retained_program_state选项使能)的情况下修改这个选项,你只得用适当的外部命令或是通过Web接口来修改它。选项可用的值有:

  1. 0 = 不接受强制服务检测结果
  2. 1 = 接受强制服务检测结果(默认)

表 5.16. 主机检测执行选项

格式:execute_host_checks=<0/1>
样例:execute_host_checks=1

该选项将决定Nagios在初始地启动或重启时是否执行按需地和有规律规划检测。如果该选项不使能,那么Nagios将不会对任何主机进行检测,然而它仍旧可以接收强制主机检测结果除非你已经将accept_passive_host_checks选项关闭。该选项通常用于监控服务器的配置备份,详细信息请查看冗余安装的配置,或是用于设置一个分布式监控环境中。注意:如果你已经使能retain_state_information状态保持选项,Nagios将在启动和重启时使用旧的选项值(保存于state_retention_file状态保持文件中)而忽略此设置,除非你关闭了use_retained_program_state选项。如果你想在保持选项使能(且use_retained_program_state选项使能)的情况下修改这个选项,你只得用适当的外部命令或是通过Web接口来修改它。选项可用的值有:

  1. 0 = 不执行主机检测
  2. 1 = 执行主机检测(默认)

表 5.17. 强制主机检测接受选项

格式:accept_passive_host_checks=<0/1>
样例:accept_passive_host_checks=1

该选项决定了在Nagios初始启动或重启后是否要接受强制主机检测结果。如果这个选项关闭,Nagios将不再接受任何强制主机检测结果。注意:如果你使能retain_state_information状态保持选项,Nagios将在启动或重启动时使用旧的选项设置(保存于state_retention_file状态保持文件中)而忽略这个设置。除非你已经关闭use_retained_program_state选项。如果你想在保持选项使能(且use_retained_program_state选项使能)的情况下修改这个选项,你只得用适当的外部命令或是通过Web接口来修改它。选项可用的值有:

  1. 0 = 不接受强制主机检测结果
  2. 1 = 接受强制主机检测结果(默认)

表 5.18. 事件处理选项

格式:enable_event_handlers=<0/1>
样例:enable_event_handlers=1

该选项决定了在Nagios初始启动或重启后是否要运行事件处理,如果该选项关闭,Nagios将不做任何主机或服务的事件处理。注意:如果你使能retain_state_information状态保持选项(保存于state_retention_file状态保持文件中)而忽略这个设置,除非你已经关闭use_retained_program_state选项。如果你想在保持选项使能(且use_retained_program_state选项使能)的情况下修改这个选项,你只得用适当的外部命令或是通过Web接口来修改它。选项可用的值有:

  1. 0 = 禁止事件处理
  2. 1 = 打开事件处理(默认)

表 5.19. 日志回滚方法

格式:log_rotation_method=<n/h/d/w/m>
样例:log_rotation_method=d

该选项决定了你想让Nagios以何种方法回滚你的日志文件。可用的值有:

  1. n = None (不做日志回滚 - 这个是默认值)
  2. h = Hourly (每小时做一次日志回滚)
  3. d = Daily (每天午夜做日志回滚)
  4. w = Weekly (每周六午夜做日志回滚)
  5. m = Monthly (每月最后一天的午夜做日志回滚)

表 5.20. 日志打包路径

格式:log_archive_path=<path>
样例:log_archive_path=/usr/local/nagios/var/archives/

该选项将指定一个用于存放回滚日志文件的保存路径。如果没有使用日志回滚功能时会忽略此设置。

表 5.21. 外部命令检查选项

格式:check_external_commands=<0/1>
样例:check_external_commands=1

该选项决定了Nagios是否要检查存于命令文件里的将要执行的命令。这个选项在你计划通过Web接口来运行CGI命令时必须要打开它。更多的关于外部命令的信息可以查阅这份文档

  1. 0 = 不做外部命令检测
  2. 1 = 检测外部命令(默认值)

表 5.22. 外部命令检测间隔

格式:command_check_interval=<xxx>[s]
样例:command_check_interval=1

如果你指定了一个数字加一个"s"(如30s),那么外部检测命令的间隔是这个数值以为单位的时间间隔。如果没有用"s",那么外部检测命令的间隔是以这个数值的“时间单位”的时间间隔,除非你把interval_length的值(下面有说明)从默认60给更改了,这个值的意思是60s,即一分钟。

注意:将这个值设置为-1可令Nagios尽可能频繁地对外命令进行检测。在进行其他任务之前,Nagios每次都将会读入并处理保存于命令文件之中的全部命令以进行命令检查。更多的关于外部命令的信息可以查阅这份文档

表 5.23. 外部命令文件

格式:command_file=<file_name>
样例:command_file=/usr/local/nagios/var/rw/nagios.cmd

这是一个Nagios用于外部命令检测处理的文件,命令CGI程序模块将命令写入该文件,外部命令文件实现成一个命名管道(先入先出),在Nagios启动时创建它,并在关闭时删除它。如果在Nagios启动时该文件已经存在,那么Nagios会给出一个错误信息后中止。更多的关于外部命令的信息可以查阅这份文档

表 5.24. 外部命令缓冲队列数

格式:external_command_buffer_slots=<#>
样例:external_command_buffer_slots=512

注意:这是个高级特性。该选项决定了Nagios将使用多少缓冲队列来缓存外部命令,外部命令是从一个工作线程从外部命令文件将命令读入的,但这些外部命令还没有被Nagios的主守护程序处理。缓冲中的每个位置可以处理一个外部命令,所以这个选项决定了有多少命令可以被缓冲处理。为了对一个有大量被动检测系统(比如分布式系统安装)进行安装时,你可能需要降低这个值。你要考虑使用MRTG工具来绘制外部命令缓冲的利用率图表,如何配置绘制图表可阅读这篇文档。

表 5.25. 互锁文件

格式:lock_file=<file_name>
样例:lock_file=/tmp/nagios.lock

该选项指定了Nagios在以守护态运行(以-d命令行参数运行)时在哪个位置上创建互锁文件。该文件包含有运行Nagios的进程id值(PID)。

表 5.26. 状态保持选项

格式:retain_state_information=<0/1>
样例:retain_state_information=1

该选项决定了Nagios是否要在程序的两次启动之间保存主机和服务的状态信息。如果你使能了这个选项,你应预先给出了state_retention_file变量的值,当选项使能时,Nagios将会在程序停止(或重启)时保存全部的主机和服务的状态信息并且会在启动时再次预读入保存的状态信息。

  1. 0 = 不保存状态保持信息
  2. 1 = 保留状态保持信息(默认)

表 5.27. 状态保持文件

格式:state_retention_file=<file_name>
样例:state_retention_file=/usr/local/nagios/var/retention.dat

该文件用于在Nagios停止之前保存状态、停机时间和注释等信息。当Nagios重启时它会在开始监控工作之前使用保存于这个文件里的信息用于初始化主机与服务的状态。为使Nagios在程序的启动之间利用状态保持信息,你必须使能retain_state_information选项。

表 5.28. 自动状态保持的更新间隔

格式:retention_update_interval=<minutes>
样例:retention_update_interval=60

该选项决定了Nagios需要以什么频度(分钟为单位)在正常操作时自动地保存状态保持信息。如果你把这个值设置为0,Nagios将不会以规则的间隔保存状态保持数据,但是Nagios仍旧会在停机或重启之前做保存状态保持数据的工作。如果你关闭了状态保持功能(用retain_state_information选项设置),这个选项值将无效。

表 5.29. 程序所用状态的使用选项

格式:use_retained_program_state=<0/1>
样例:use_retained_program_state=1

这个设置将决定了Nagios是否要使用保存于状态保持文件之中的值以更新程序范围内的变量状态。有些程序范围内的变量的状态将在程序重启时被保存于状态保持文件之中,包括enable_notificationsenable_flap_detectionenable_event_handlersexecute_service_checksaccept_passive_service_checks选项。如果你没有使用retain_state_information状态保持选项使能,这个选项将无效。

  1. 0 = 不使用程序变量的状态值
  2. 1 = 使用状态保持文件中的程序变量状态记录(默认)

表 5.30. 使用保持计划表信息选项

格式:use_retained_scheduling_info=<0/1>
样例:use_retained_scheduling_info=1

该选项决定Nagios在重启时是否要使用主机和服务的保持计划表信息(下次检测时间)。如果增加了很多数量(或很大百分比)的主机和服务,建议你在首次重启动Nagios时关闭选项,因为这个选项将会使初始检测误入歧途。其他情况下你可以要使能这个选项。

  1. 0 = 不使用计划表信息
  2. 1 = 使用保存的计划表信息(默认)

表 5.31. 保持主机和服务属性掩码

格式:

retained_host_attribute_mask=<number>

retained_service_attribute_mask=<number>

样例:

retained_host_attribute_mask=0

retained_service_attribute_mask=0


警告:这是个高级特性。你需要读一下源程序以看清楚它是如何起效果的。

该选项决定了哪个主机和服务的属性在程序重启时不会被保留。这些选项值是与指定的"MODATTR_"值进行按位与运算出的,MODATTR_在源程序的include/common.h里定义,默认情况下,全部主机和服务的属性都会被保持。

表 5.32. 保持进程属性掩码

格式:

retained_process_host_attribute_mask=<number>

retained_process_service_attribute_mask=<number>

样例:

retained_process_host_attribute_mask=0

retained_process_service_attribute_mask=0


警告:这是个高级特性。你需要读一下源程序以看清楚它是如何起效果的。

该选项决定了哪个进程属性在程序重启时不会被保留。有两个属性掩码因为经常是主机和服务的进程属性可以分别被修改。例如,主机检测在程序层面上被关闭,而服务检测仍旧被打开。这些选项值是与指定的"MODATTR_"值进行按位与运算出的,MODATTR_在源程序的include/common.h里定义,默认情况下,全部主机和服务的属性都会被保持。

表 5.33. 保持联系人属性掩码

格式:

retained_contact_host_attribute_mask=<number>

retained_contact_service_attribute_mask=<number>

样例:

retained_contact_host_attribute_mask=0

retained_contact_service_attribute_mask=0


警告:这是个高级特性。你需要读一下源程序以看清楚它是如何起效果的。

该选项决定了哪个联系人属性在程序重启时不会被保留。有两个属性掩码因为经常是主机和服务的联系人属性可以分别被修改。这些选项值是与指定的"MODATTR_"值进行按位与运算出的,MODATTR_在源程序的include/common.h里定义,默认情况下,全部主机和服务的属性都会被保持。

表 5.34. Syslog日志选项

格式:use_syslog=<0/1>
样例:use_syslog=1

该选项决定了是否将日志信息记录到本地的Syslog中。可用的值有:

  1. 0 = 不使用Syslog机制
  2. 1 = 使用Syslog机制

表 5.35. 通知记录日志选项

格式:log_notifications=<0/1>
样例:log_notifications=1

该选项决定了是否将通知信息记录进行记录,如果有很多联系人或是有规律性的服务故障时,记录文件将会增长很快。使用这个选项来保存已发出的通知记录。

  1. 0 = 不记录通知
  2. 1 = 记录通知

表 5.36. 服务检测重试记录选项

格式:log_service_retries=<0/1>
样例:log_service_retries=1

该选项决定了是否将服务检测重试进行记录。服务检测重试发生在服务检测结果返回一个异常状态信息之时,而且你已经配置Nagios在对故障出现时进行一次以上的服务检测重试。此时有服务状态被认为是处理“软”故障状态。当调试Nagios或对服务的事件处理进行测试时记录下服务检测的重试是非常有用的。

  1. 0 = 不记录服务检测重试
  2. 1 = 记录服务检测重试

表 5.37. 主机检测重试记录选项

格式:log_host_retries=<0/1>
样例:log_host_retries=1

该选项决定了是否将主机检测重试进行记录。当调试Nagios或对主机的事件处理进行测试时记录下主机检测的重试是非常有用的。

  1. 0 = 不记录主机检测重试
  2. 1 = 记录主机检测重试

表 5.38. 事件处理记录选项

格式:log_event_handlers=<0/1>
样例:log_event_handlers=1

该选项决定了是否将服务和主机的事件处理进行记录。一旦发生服务或主机状态迁移时,可选的事件处理命令会被执行。当调试Nagios或首次尝试事件处理脚本时记录下事件处理是非常有用的。

  1. 0 = 不记录事件处理
  2. 1 = 记录事件处理

表 5.39. 初始状态记录选项

格式:log_initial_states=<0/1>
样例:log_initial_states=1

该选项决定了Nagios是否要强行记录全部的主机和服务的初始状态,即便状态报告是OK也要记录。只是在第一次检测发现主机和服务有异常时才会记录下初始状态。如果想用应用程序扫描一段时间内的主机和服务状态以生成统计报告时,使能这个选项将有很有帮助。

  1. 0 = 不记录初始状态(默认)
  2. 1 = 记录初始状态

表 5.40. 外部命令记录选项

格式:log_external_commands=<0/1>
样例:log_external_commands=1

该选项决定了Nagios是否要记录外部命令,外部命令是从command_file外部命令文件中提取的。注意:这个选项并不控制是否要对强制服务检测 (一种外部命令类型)进行记录。为使能或关闭对强制服务检测的记录,使用log_passive_checks强制检测记录选项。

  1. 0 = 不记录外部命令
  2. 1 = 记录外部命令(默认)

表 5.41. 强制检测记录选项

格式:log_passive_checks=<0/1>
样例:log_passive_checks=1

该选项决定了Nagios是否要记录来自于command_file外部命令文件的强制主机和强制服务检测命令。如果要设置一个分布式监控环境或是计划在规整的基础上要对大量的强制检测的结果进行处理时,需要关闭这个选项以防止日志文件过份增长。

  1. 0 = 不记录强制检测
  2. 1 = 记录强制检测(默认)

表 5.42. 全局主机事件处理选项

格式:global_host_event_handler=<command>
样例:global_host_event_handler=log-host-event-to-db

该选项指定了当每个主机状态迁移时需要执行的主机事件处理命令。全局事件处理命令将优于在每个主机定义的事件处理命令而立即执行。命令参数是在对象配置文件里定义的命令的短名称。由event_handler_timeout事件处理超时选项控制的这个命令可运行的最大次数。更多的有关事件处理的信息可以查阅这篇文档

表 5.43. 全局服务事件处理选项

格式:global_service_event_handler=<command>
样例:global_service_event_handler=log-service-event-to-db

该选项指定了当每个服务状态迁移时需要执行的服务事件处理命令。全局事件处理命令将优于在每个服务定义的事件处理命令而立即执行。命令参数是在对象配置文件里定义的命令的短名称。由event_handler_timeout事件处理超时选项控制的这个命令可运行的最大次数。更多的有关事件处理的信息可以查阅这篇文档

表 5.44. 检测休止时间间隔

格式:sleep_time=<seconds>
样例:sleep_time=1

它指定了Nagios在进行计划表的下一次服务或主机检测命令执行之前应该休止多少秒。注意Nagios只是在已经进行了服务故障的排队检测之后才会休止。

表 5.45. 服务检测迟滞间隔计数方法

格式:service_inter_check_delay_method=<n/d/s/x.xx>
样例:service_inter_check_delay_method=s

该选项容许你控制服务检测将如何初始展开事件队列。 Using a "smart" delay calculation (the default) will cause Nagios to calculate an average check interval and spread initial checks of all services out over that interval, thereby helping to eliminate CPU load spikes. Using no delay is generally not recommended, as it will cause all service checks to be scheduled for execution at the same time. This means that you will generally have large CPU spikes when the services are all executed in parallel. More information on how to estimate how the inter-check delay affects service check scheduling can be found here. Values are as follows:

  1. n = Don't use any delay - schedule all service checks to run immediately (i.e. at the same time!)
  2. d = Use a "dumb" delay of 1 second between service checks
  3. s = Use a "smart" delay calculation to spread service checks out evenly (default)
  4. x.xx = Use a user-supplied inter-check delay of x.xx seconds

表 5.46. 最大服务检测传播时间

格式:max_service_check_spread=<minutes>
样例:max_service_check_spread=30

This option determines the maximum number of minutes from when Nagios starts that all services (that are scheduled to be regularly checked) are checked. This option will automatically adjust the service_inter_check_delay_methodservice inter-check delay method (if necessary) to ensure that the initial checks of all services occur within the timeframe you specify. In general, this option will not have an affect on service check scheduling if scheduling information is being retained using the use_retained_scheduling_infouse_retained_scheduling_info option. 默认值是30分钟。

表 5.47. 服务交错因子

格式:service_interleave_factor=<s|x>
样例:service_interleave_factor=s

This variable determines how service checks are interleaved. Interleaving allows for a more even distribution of service checks, reduced load on remote hosts, and faster overall detection of host problems. Setting this value to 1 is equivalent to not interleaving the service checks (this is how versions of Nagios previous to 0.0.5 worked). Set this value to s (smart) for automatic calculation of the interleave factor unless you have a specific reason to change it. The best way to understand how interleaving works is to watch the status CGI (detailed view) when Nagios is just starting. You should see that the service check results are spread out as they begin to appear. More information on how interleaving works can be found here.

  1. x = A number greater than or equal to 1 that specifies the interleave factor to use. An interleave factor of 1 is equivalent to not interleaving the service checks.
  2. s = Use a "smart" interleave factor calculation (default)

表 5.48. 最大并发服务检测数

格式:max_concurrent_checks=<max_checks>
样例:max_concurrent_checks=20

该选项可指定在任意给定时间里可被同时运行的服务检测命令的最大数量。如果指定这个值为1,则说明不允许任何并行服务检测,如果指定为0(默认值)则是对并行服务检测。你须按照可运行Nagios的机器上的机器资源情况修改这个值,因为它会直接影响系统最大负荷,它施加于系统(处理器利用率、内存使用率等)之上。更多的关于如何评估需要设置多少并行检测值的信息可以查阅这篇文档。

表 5.49. 检测结果的回收频度

格式:check_result_reaper_frequency=<frequency_in_seconds>
样例:check_result_reaper_frequency=5

该选项控制检测结果的回收事件的处理频度(以秒为单位)。从主机和服务的检测过程里“回收”事件处理结果将是对已经执行结束的检测。事件的构成在Nagios里是监控逻辑里的核心内容。

表 5.50. 最大检测结果回收时间段

格式:max_check_result_reaper_time=<seconds>
样例:max_check_result_reaper_time=30

该选项决定主机和服务检测结果回收时对结果回收时间段的控制,这个值是个以秒为单位的最大时间跨度。从主机和服务的检测过程里“回收”事件处理结果将是对已经执行结束的检测。如果有许多结果要处理,回收事件过程将占用很长时间来完成它,这将延迟对新的主机和服务检测的执行。该选项可以限制从检测结果得到与回收处理之间的最大时间间隔以使Nagios可以完成对其他监控逻辑的转换处理。

表 5.51. 检测结果保存路径

格式:check_result_path=<path>
样例:check_result_path=/var/spool/nagios/checkresults

该选项决定了Nagios将在处理检测结果之前使用哪个目录来保存主机和服务检测结果。这个目录不能保存其他文件,因为Nagios会周期性地清理这个目录下的旧文件(更多信息见max_check_result_file_age选项)。

注意:确保只有一个Nagios的实例在操作检测结果保存路径。如果有多个Nagios的实例来操作相同的目录,将会因为错误的Nagios实例不正确地处理导致有错误结果!

表 5.52. 检测结果文件的最大生存时间

格式:max_check_result_file_age=<seconds>
样例:max_check_result_file_age=3600

该选项决定用最大多少秒来限定那些在check_result_path设置所指向目录里的检测结果文件是合法的。如果检测结果文件超出了这个门限,Nagios将会把过旧的文件删除而且不会处理内含的检测结果。若设置该选项为0,Nagios将处理全部的检测结果文件-即便这些文件比你的硬件还老旧。

表 5.53. 主机检测迟滞间隔计数方式

格式:host_inter_check_delay_method=<n/d/s/x.xx>
样例:host_inter_check_delay_method=s

This option allows you to control how host checks that are scheduled to be checked on a regular basis are initially "spread out" in the event queue. Using a "smart" delay calculation (the default) will cause Nagios to calculate an average check interval and spread initial checks of all hosts out over that interval, thereby helping to eliminate CPU load spikes. Using no delay is generally not recommended. Using no delay will cause all host checks to be scheduled for execution at the same time. More information on how to estimate how the inter-check delay affects host check scheduling can be found here.Values are as follows:

  1. n = Don't use any delay - schedule all host checks to run immediately (i.e. at the same time!)
  2. d = Use a "dumb" delay of 1 second between host checks
  3. s = Use a "smart" delay calculation to spread host checks out evenly (default)
  4. x.xx = Use a user-supplied inter-check delay of x.xx seconds

表 5.54. 最大主机检测传播时间

格式:max_host_check_spread=<minutes>
样例:max_host_check_spread=30

This option determines the maximum number of minutes from when Nagios starts that all hosts (that are scheduled to be regularly checked) are checked. This option will automatically adjust the host_inter_check_delay_methodhost inter-check delay method (if necessary) to ensure that the initial checks of all hosts occur within the timeframe you specify. In general, this option will not have an affect on host check scheduling if scheduling information is being retained using the use_retained_scheduling_infouse_retained_scheduling_info option. Default value is 30 (minutes).

表 5.55. 计数间隔长度

格式:interval_length=<seconds>
样例:interval_length=60

该选项指定了“单位间隔”是多少秒数,单位间隔用于计数计划队列处理、再次通知等。单位间隔在对象配置文件被用于决定以何频度运行服务检测、以何频度与联系人再通知等。

重要:默认值是60,这说明在对象配置文件里设定的“单位间隔”是60秒(1分钟)。我没测试过其他值,所以如果要用其他值要自担风险!

表 5.56. 自动计划检测选项

格式:auto_reschedule_checks=<0/1>
样例:auto_reschedule_checks=1

该选项决定了Nagios是否要试图自动地进行计划的自主检测主机与服务以使在之后的时间里检测更为“平滑”。这可以使得监控主机保持一个均衡的负载,也使得在持续检测之间的保持相对一致,其代价是要更刚性地按计划执行检测工作。

WARNING: THIS IS AN EXPERIMENTAL FEATURE AND MAY BE REMOVED IN FUTURE VERSIONS. ENABLING THIS OPTION CAN DEGRADE PERFORMANCE - RATHER THAN INCREASE IT - IF USED IMPROPERLY!

表 5.57. Auto-Rescheduling Interval

格式:auto_rescheduling_interval=<seconds>
样例:auto_rescheduling_interval=30

This option determines how often (in seconds) Nagios will attempt to automatically reschedule checks. This option only has an effect if the auto_reschedule_checksauto_reschedule_checks option is enabled. Default is 30 seconds.

WARNING: THIS IS AN EXPERIMENTAL FEATURE AND MAY BE REMOVED IN FUTURE VERSIONS. ENABLING THE AUTO-RESCHEDULING OPTION CAN DEGRADE PERFORMANCE - RATHER THAN INCREASE IT - IF USED IMPROPERLY!

表 5.58. Auto-Rescheduling Window

格式:auto_rescheduling_window=<seconds>
样例:auto_rescheduling_window=180

This option determines the "window" of time (in seconds) that Nagios will look at when automatically rescheduling checks. Only host and service checks that occur in the next X seconds (determined by this variable) will be rescheduled. This option only has an effect if the auto_reschedule_checksauto_reschedule_checks option is enabled. Default is 180 seconds (3 minutes).

WARNING: THIS IS AN EXPERIMENTAL FEATURE AND MAY BE REMOVED IN FUTURE VERSIONS. ENABLING THE AUTO-RESCHEDULING OPTION CAN DEGRADE PERFORMANCE - RATHER THAN INCREASE IT - IF USED IMPROPERLY!

表 5.59. 进取式主机检测选项

格式:use_aggressive_host_checking=<0/1>
样例:use_aggressive_host_checking=0

Nagios tries to be smart about how and when it checks the status of hosts. In general, disabling this option will allow Nagios to make some smarter decisions and check hosts a bit faster. Enabling this option will increase the amount of time required to check hosts, but may improve reliability a bit. Unless you have problems with Nagios not recognizing that a host recovered, I would suggest not enabling this option.

  1. 0 = Don't use aggressive host checking (default)
  2. 1 = Use aggressive host checking

表 5.60. 传递强制主机检测结果选项

格式:translate_passive_host_checks=<0/1>
样例:translate_passive_host_checks=1

This option determines whether or not Nagios will DOWN/UNREACHABLE passive host check results to their "correct" state from the viewpoint of the local Nagios instance. This can be very useful in distributed and failover monitoring installations. More information on passive check state translation can be found here.

  1. 0 = Disable check translation (default)
  2. 1 = Enable check translation

表 5.61. Passive Host Checks Are SOFT Option

格式:passive_host_checks_are_soft=<0/1>
样例:passive_host_checks_are_soft=1

This option determines whether or not Nagios will treat passive host checks as HARD states or SOFT states. By default, a passive host check result will put a host into a HARD state type. You can change this behavior by enabling this option.

  1. 0 = Passive host checks are HARD (default)
  2. 1 = Passive host checks are SOFT

表 5.62. Predictive Host Dependency Checks Option

格式:enable_predictive_host_dependency_checks=<0/1>
样例:enable_predictive_host_dependency_checks=1

This option determines whether or not Nagios will execute predictive checks of hosts that are being dependended upon (as defined in host dependencies) for a particular host when it changes state.

Predictive checks help ensure that the dependency logic is as accurate as possible. More information on how predictive checks work can be found here.

  1. 0 = Disable predictive checks
  2. 1 = Enable predictive checks (default)

表 5.63. Predictive Service Dependency Checks Option

格式:enable_predictive_service_dependency_checks=<0/1>
样例:enable_predictive_service_dependency_checks=1

This option determines whether or not Nagios will execute predictive checks of services that are being dependended upon (as defined in service dependencies) for a particular service when it changes state.

Predictive checks help ensure that the dependency logic is as accurate as possible. More information on how predictive checks work can be found here.

  1. 0 = Disable predictive checks
  2. 1 = Enable predictive checks (default)

表 5.64. Cached Host Check Horizon

格式:cached_host_check_horizon=<seconds>
样例:cached_host_check_horizon=15

This option determines the maximum amount of time (in seconds) that the state of a previous host check is considered current. Cached host states (from host checks that were performed more recently than the time specified by this value) can improve host check performance immensely. Too high of a value for this option may result in (temporarily) inaccurate host states, while a low value may result in a performance hit for host checks. Use a value of 0 if you want to disable host check caching. More information on cached checks can be found here.

表 5.65. Cached Service Check Horizon

格式:cached_service_check_horizon=<seconds>
样例:cached_service_check_horizon=15

This option determines the maximum amount of time (in seconds) that the state of a previous service check is considered current. Cached service states (from service checks that were performed more recently than the time specified by this value) can improve service check performance when a lot of service dependencies are used. Too high of a value for this option may result in inaccuracies in the service dependency logic. Use a value of 0 if you want to disable service check caching. More information on cached checks can be found here.

表 5.66. Large Installation Tweaks Option

格式:use_large_installation_tweaks=<0/1>
样例:use_large_installation_tweaks=0

This option determines whether or not the Nagios daemon will take several shortcuts to improve performance. These shortcuts result in the loss of a few features, but larger installations will likely see a lot of benefit from doing so. More information on what optimizations are taken when you enable this option can be found here.

  1. 0 = Don't use tweaks (default)
  2. 1 = Use tweaks

表 5.67. 子进程内存选项

格式:free_child_process_memory=<0/1>
样例:free_child_process_memory=0

This option determines whether or not Nagios will free memory in child processes when they are fork()ed off from the main process. By default, Nagios frees memory. However, if the use_large_installation_tweaks option is enabled, it will not. By defining this option in your configuration file, you are able to override things to get the behavior you want.

  1. 0 = Don't free memory
  2. 1 = Free memory

表 5.68. 子进程二次派生选项

格式:child_processes_fork_twice=<0/1>
样例:child_processes_fork_twice=0

This option determines whether or not Nagios will fork() child processes twice when it executes host and service checks. By default, Nagios fork()s twice. However, if the use_large_installation_tweaks option is enabled, it will only fork() once. By defining this option in your configuration file, you are able to override things to get the behavior you want.

  1. 0 = Fork() just once
  2. 1 = Fork() twice

表 5.69. 环境变量中标准宏可用性选项

格式:enable_environment_macros=<0/1>
样例:enable_environment_macros=0

This option determines whether or not the Nagios daemon will make all standard macros available as environment variables to your check, notification, event hander, etc. commands. In large Nagios installations this can be problematic because it takes additional memory and (more importantly) CPU to compute the values of all macros and make them available to the environment.

  1. 0 = Don't make macros available as environment variables
  2. 1 = Make macros available as environment variables (default)

表 5.70. Flap Detection Option

格式:enable_flap_detection=<0/1>
样例:enable_flap_detection=0

This option determines whether or not Nagios will try and detect hosts and services that are "flapping". Flapping occurs when a host or service changes between states too frequently, resulting in a barrage of notifications being sent out. When Nagios detects that a host or service is flapping, it will temporarily suppress notifications for that host/service until it stops flapping. Flap detection is very experimental at this point, so use this feature with caution! More information on how flap detection and handling works can be found here.注意:如果你使能retain_state_information状态保持选项(保存于state_retention_file状态保持文件中)而忽略这个设置,除非你已经关闭use_retained_program_state选项。如果你想在保持选项使能(且use_retained_program_state选项使能)的情况下修改这个选项,你只得用适当的外部命令或是通过Web接口来修改它。选项可用的值有:

  1. 0 = Don't enable flap detection (default)
  2. 1 = Enable flap detection

表 5.71. Low Service Flap Threshold

格式:low_service_flap_threshold=<percent>
样例:low_service_flap_threshold=25.0

This option is used to set the low threshold for detection of service flapping. For more information on how flap detection and handling works (and how this option affects things) read this.

表 5.72. High Service Flap Threshold

格式:high_service_flap_threshold=<percent>
样例:high_service_flap_threshold=50.0

This option is used to set the low threshold for detection of service flapping. For more information on how flap detection and handling works (and how this option affects things) read this.

表 5.73. Low Host Flap Threshold

格式:low_host_flap_threshold=<percent>
样例:low_host_flap_threshold=25.0

This option is used to set the low threshold for detection of host flapping. For more information on how flap detection and handling works (and how this option affects things) read this.

表 5.74. High Host Flap Threshold

格式:high_host_flap_threshold=<percent>
样例:high_host_flap_threshold=50.0

This option is used to set the low threshold for detection of host flapping. For more information on how flap detection and handling works (and how this option affects things) read this.

表 5.75. Soft State Dependencies Option

格式:soft_state_dependencies=<0/1>
样例:soft_state_dependencies=0

This option determines whether or not Nagios will use soft state information when checking host and service dependencies. Normally Nagios will only use the latest hard host or service state when checking dependencies. If you want it to use the latest state (regardless of whether its a soft or hard state type), enable this option.

  1. 0 = Don't use soft state dependencies (default)
  2. 1 = Use soft state dependencies

表 5.76. 服务检测超时

格式:service_check_timeout=<seconds>
样例:service_check_timeout=60

This is the maximum number of seconds that Nagios will allow service checks to run. If checks exceed this limit, they are killed and a 紧急 state is returned. A timeout error will also be logged.

There is often widespread confusion as to what this option really does. It is meant to be used as a last ditch mechanism to kill off plugins which are misbehaving and not exiting in a timely manner. It should be set to something high (like 60 seconds or more), so that each service check normally finishes executing within this time limit. If a service check runs longer than this limit, Nagios will kill it off thinking it is a runaway processes.

表 5.77. 主机检测超时

格式:host_check_timeout=<seconds>
样例:host_check_timeout=60

This is the maximum number of seconds that Nagios will allow host checks to run. If checks exceed this limit, they are killed and a 紧急 state is returned and the host will be assumed to be DOWN. A timeout error will also be logged.

There is often widespread confusion as to what this option really does. It is meant to be used as a last ditch mechanism to kill off plugins which are misbehaving and not exiting in a timely manner. It should be set to something high (like 60 seconds or more), so that each host check normally finishes executing within this time limit. If a host check runs longer than this limit, Nagios will kill it off thinking it is a runaway processes.

表 5.78. 事件处理超时

格式:event_handler_timeout=<seconds>
样例:event_handler_timeout=60

This is the maximum number of seconds that Nagios will allow event handlers to be run. If an event handler exceeds this time limit it will be killed and a warning will be logged.

There is often widespread confusion as to what this option really does. It is meant to be used as a last ditch mechanism to kill off commands which are misbehaving and not exiting in a timely manner. It should be set to something high (like 60 seconds or more), so that each event handler command normally finishes executing within this time limit. If an event handler runs longer than this limit, Nagios will kill it off thinking it is a runaway processes.

表 5.79. 通知超时

格式:notification_timeout=<seconds>
样例:notification_timeout=60

This is the maximum number of seconds that Nagios will allow notification commands to be run. If a notification command exceeds this time limit it will be killed and a warning will be logged.

There is often widespread confusion as to what this option really does. It is meant to be used as a last ditch mechanism to kill off commands which are misbehaving and not exiting in a timely manner. It should be set to something high (like 60 seconds or more), so that each notification command finishes executing within this time limit. If a notification command runs longer than this limit, Nagios will kill it off thinking it is a runaway processes.

表 5.80. Obsessive Compulsive Service Processor Timeout

格式:ocsp_timeout=<seconds>
样例:ocsp_timeout=5

This is the maximum number of seconds that Nagios will allow an ocsp_commandobsessive compulsive service processor command to be run. If a command exceeds this time limit it will be killed and a warning will be logged.

表 5.81. Obsessive Compulsive Host Processor Timeout

格式:ochp_timeout=<seconds>
样例:ochp_timeout=5

This is the maximum number of seconds that Nagios will allow an ochp_commandobsessive compulsive host processor command to be run. If a command exceeds this time limit it will be killed and a warning will be logged.

表 5.82. 性能数据处理命令超时

格式:perfdata_timeout=<seconds>
样例:perfdata_timeout=5

This is the maximum number of seconds that Nagios will allow a host_perfdata_commandhost performance data processor command or service_perfdata_commandservice performance data processor command to be run. If a command exceeds this time limit it will be killed and a warning will be logged.

表 5.83. Obsess Over Services Option

格式:obsess_over_services=<0/1>
样例:obsess_over_services=1

This value determines whether or not Nagios will "obsess" over service checks results and run the ocsp_commandobsessive compulsive service processor command you define. I know - funny name, but it was all I could think of. This option is useful for performing distributed monitoring. If you're not doing distributed monitoring, don't enable this option.

  1. 0 = Don't obsess over services (default)
  2. 1 = Obsess over services

表 5.84. Obsessive Compulsive Service Processor Command

格式:ocsp_command=<command>
样例:ocsp_command=obsessive_service_handler

This option allows you to specify a command to be run after every service check, which can be useful in distributed monitoring. This command is executed after any event handler or notification commands. The command argument is the short name of a command definition that you define in your 对象配置文件. The maximum amount of time that this command can run is controlled by the ocsp_timeoutocsp_timeout option. More information on distributed monitoring can be found here. This command is only executed if the obsess_over_servicesobsess_over_services option is enabled globally and if the obsess_over_service directive in the service definition is enabled.

表 5.85. Obsess Over Hosts Option

格式:obsess_over_hosts=<0/1>
样例:obsess_over_hosts=1

This value determines whether or not Nagios will "obsess" over host checks results and run the ochp_commandobsessive compulsive host processor command you define. I know - funny name, but it was all I could think of. This option is useful for performing distributed monitoring. If you're not doing distributed monitoring, don't enable this option.

  1. 0 = Don't obsess over hosts (default)
  2. 1 = Obsess over hosts

表 5.86. Obsessive Compulsive Host Processor Command

格式:ochp_command=<command>
样例:ochp_command=obsessive_host_handler

This option allows you to specify a command to be run after every host check, which can be useful in distributed monitoring. This command is executed after any event handler or notification commands. The command argument is the short name of a command definition that you define in your 对象配置文件. The maximum amount of time that this command can run is controlled by the ochp_timeoutochp_timeout option. More information on distributed monitoring can be found here. This command is only executed if the obsess_over_hostsobsess_over_hosts option is enabled globally and if the obsess_over_host directive in the host definition is enabled.

表 5.87. 性能数据处理选项

格式:process_performance_data=<0/1>
样例:process_performance_data=1

该选项决定Nagios是否要处理主机和服务检测性能数据

  1. 0 = Don't process performance data (default)
  2. 1 = Process performance data

表 5.88. 主机性能数据处理命令

格式:host_perfdata_command=<command>
样例:host_perfdata_command=process-host-perfdata

This option allows you to specify a command to be run after every host check to process host performance data that may be returned from the check. The command argument is the short name of a command definition that you define in your 对象配置文件. This command is only executed if the process_performance_dataprocess_performance_data option is enabled globally and if the process_perf_data directive in the host definition is enabled.

表 5.89. 服务性能数据处理命令

格式:service_perfdata_command=<command>
样例:service_perfdata_command=process-service-perfdata

This option allows you to specify a command to be run after every service check to process service performance data that may be returned from the check. The command argument is the short name of a command definition that you define in your 对象配置文件. This command is only executed if the process_performance_dataprocess_performance_data option is enabled globally and if the process_perf_data directive in the service definition is enabled.

表 5.90. 主机性能数据文件

格式:host_perfdata_file=<file_name>
样例:host_perfdata_file=/usr/local/nagios/var/host-perfdata.dat

This option allows you to specify a file to which host performance data will be written after every host check. Data will be written to the performance file as specified by the host_perfdata_file_templatehost_perfdata_file_template option. Performance data is only written to this file if the process_performance_dataprocess_performance_data option is enabled globally and if the process_perf_data directive in the host definition is enabled.

表 5.91. 服务性能数据文件

格式:service_perfdata_file=<file_name>
样例:service_perfdata_file=/usr/local/nagios/var/service-perfdata.dat

This option allows you to specify a file to which service performance data will be written after every service check. Data will be written to the performance file as specified by the service_perfdata_file_template option. Performance data is only written to this file if the process_performance_dataprocess_performance_data option is enabled globally and if the process_perf_data directive in the service definition is enabled.

表 5.92. 主机性能数据文件模板

格式:host_perfdata_file_template=<template>
样例:

host_perfdata_file_template=[HOSTPERFDATA]\t$TIMET$\t$HOSTNAME$\t$HOSTEXECUTIONTIME$

\t$HOSTOUTPUT$\t$HOSTPERFDATA$


This option determines what (and how) data is written to the host_perfdata_filehost performance data file. The template may contain macros, special characters (\t for tab, \r for carriage return, \n for newline) and plain text. A newline is automatically added after each write to the performance data file.

表 5.93. 服务性能数据文件模板

格式:service_perfdata_file_template=<template>
样例:

service_perfdata_file_template=[SERVICEPERFDATA]\t$TIMET$\t$HOSTNAME$\t$SERVICEDESC$\t

$SERVICEEXECUTIONTIME$\t$SERVICELATENCY$\t$SERVICEOUTPUT$\t$SERVICEPERFDATA$


This option determines what (and how) data is written to the service performance data file. The template may contain macros, special characters (\t for tab, \r for carriage return, \n for newline) and plain text. A newline is automatically added after each write to the performance data file.

表 5.94. 主机性能数据文件打开方式

格式:host_perfdata_file_mode=<mode>
样例:host_perfdata_file_mode=a

This option determines how the host_perfdata_filehost performance data file is opened. Unless the file is a named pipe you'll probably want to use the default mode of append.

  1. a = Open file in append mode (default)
  2. w = Open file in write mode
  3. p = Open in non-blocking read/write mode (useful when writing to pipes)

表 5.95. 性能数据文件打开方式

格式:service_perfdata_file_mode=<mode>
样例:service_perfdata_file_mode=a

This option determines how the service performance data file is opened. Unless the file is a named pipe you'll probably want to use the default mode of append.

  1. a = Open file in append mode (default)
  2. w = Open file in write mode
  3. p = Open in non-blocking read/write mode (useful when writing to pipes)

表 5.96. 主机性能数据文件处理间隔

格式:host_perfdata_file_processing_interval=<seconds>
样例:host_perfdata_file_processing_interval=0

This option allows you to specify the interval (in seconds) at which the host_perfdata_filehost performance data file is processed using the host_perfdata_file_processing_commandhost performance data file processing command. A value of 0 indicates that the performance data file should not be processed at regular intervals.

表 5.97. 服务性能数据文件处理间隔

格式:service_perfdata_file_processing_interval=<seconds>
样例:service_perfdata_file_processing_interval=0

This option allows you to specify the interval (in seconds) at which the service_perfdata_fileservice performance data file is processed using the service_perfdata_file_processing_commandservice performance data file processing command. A value of 0 indicates that the performance data file should not be processed at regular intervals.

表 5.98. 主机性能数据文件处理命令

格式:host_perfdata_file_processing_command=<command>
样例:host_perfdata_file_processing_command=process-host-perfdata-file

This option allows you to specify the command that should be executed to process the host_perfdata_filehost performance data file. The command argument is the short name of a command definition that you define in your 对象配置文件. The interval at which this command is executed is determined by the host_perfdata_file_processing_intervalhost_perfdata_file_processing_interval directive.

表 5.99. 服务性能数据文件处理命令

格式:service_perfdata_file_processing_command=<command>
样例:service_perfdata_file_processing_command=process-service-perfdata-file

This option allows you to specify the command that should be executed to process the service_perfdata_fileservice performance data file. The command argument is the short name of a command definition that you define in your 对象配置文件. The interval at which this command is executed is determined by the service_perfdata_file_processing_intervalservice_perfdata_file_processing_interval directive.

表 5.100. 孤立服务检测选项

格式:check_for_orphaned_services=<0/1>
样例:check_for_orphaned_services=1

This option allows you to enable or disable checks for orphaned service checks. Orphaned service checks are checks which have been executed and have been removed from the event queue, but have not had any results reported in a long time. Since no results have come back in for the service, it is not rescheduled in the event queue. This can cause service checks to stop being executed. Normally it is very rare for this to happen - it might happen if an external user or process killed off the process that was being used to execute a service check. If this option is enabled and Nagios finds that results for a particular service check have not come back, it will log an error message and reschedule the service check. If you start seeing service checks that never seem to get rescheduled, enable this option and see if you notice any log messages about orphaned services.

  1. 0 = Don't check for orphaned service checks
  2. 1 = Check for orphaned service checks (default)

表 5.101. 孤立主机检测选项

格式:check_for_orphaned_hosts=<0/1>
样例:check_for_orphaned_hosts=1

This option allows you to enable or disable checks for orphaned hoste checks. Orphaned host checks are checks which have been executed and have been removed from the event queue, but have not had any results reported in a long time. Since no results have come back in for the host, it is not rescheduled in the event queue. This can cause host checks to stop being executed. Normally it is very rare for this to happen - it might happen if an external user or process killed off the process that was being used to execute a host check. If this option is enabled and Nagios finds that results for a particular host check have not come back, it will log an error message and reschedule the host check. If you start seeing host checks that never seem to get rescheduled, enable this option and see if you notice any log messages about orphaned hosts.

  1. 0 = Don't check for orphaned host checks
  2. 1 = Check for orphaned host checks (default)

表 5.102. 服务更新检测选项

格式:check_service_freshness=<0/1>
样例:check_service_freshness=0

This option determines whether or not Nagios will periodically check the "freshness" of service checks. Enabling this option is useful for helping to ensure that passive service checks are received in a timely manner. More information on freshness checking can be found here.

  1. 0 = Don't check service freshness
  2. 1 = Check service freshness (default)

表 5.103. 服务更新检测间隔

格式:service_freshness_check_interval=<seconds>
样例:service_freshness_check_interval=60

This setting determines how often (in seconds) Nagios will periodically check the "freshness" of service check results. If you have disabled service freshness checking (with the check_service_freshnesscheck_service_freshness option), this option has no effect. More information on freshness checking can be found here.

表 5.104. 主机更新检测选项

格式:check_host_freshness=<0/1>
样例:check_host_freshness=0

This option determines whether or not Nagios will periodically check the "freshness" of host checks. Enabling this option is useful for helping to ensure that passive host checks are received in a timely manner. More information on freshness checking can be found here.

  1. 0 = Don't check host freshness
  2. 1 = Check host freshness (default)

表 5.105. 主机更新检测间隔

格式:host_freshness_check_interval=<seconds>
样例:host_freshness_check_interval=60

This setting determines how often (in seconds) Nagios will periodically check the "freshness" of host check results. If you have disabled host freshness checking (with the check_host_freshnesscheck_host_freshness option), this option has no effect. More information on freshness checking can be found here.

表 5.106. Additional Freshness Threshold Latency Option

格式:additional_freshness_latency=<#>
样例:additional_freshness_latency=15

This option determines the number of seconds Nagios will add to any host or services freshness threshold it automatically calculates (e.g. those not specified explicity by the user). More information on freshness checking can be found here.

表 5.107. Embedded Perl Interpreter Option

格式:enable_embedded_perl=<0/1>
样例:enable_embedded_perl=1

This setting determines whether or not the embedded Perl interpreter is enabled on a program-wide basis. Nagios must be compiled with support for embedded Perl for this option to have an effect. More information on the embedded Perl interpreter can be found here.

表 5.108. Embedded Perl Implicit Use Option

格式:use_embedded_perl_implicitly=<0/1>
样例:use_embedded_perl_implicitly=1

This setting determines whether or not the embedded Perl interpreter should be used for Perl plugins/scripts that do not explicitly enable/disable it. Nagios must be compiled with support for embedded Perl for this option to have an effect. More information on the embedded Perl interpreter and the effect of this setting can be found here.

表 5.109. Date Format

格式:date_format=<option>
样例:date_format=us

This option allows you to specify what kind of date/time format Nagios should use in the web interface and date/time macros. Possible options (along with example output) include:

表 5.110. 

选项输出格式输出样例
usMM/DD/YYYY HH:MM:SS06/30/2002 03:15:00
euroDD/MM/YYYY HH:MM:SS30/06/2002 03:15:00
iso8601YYYY-MM-DD HH:MM:SS2002-06-30 03:15:00
strict-iso8601YYYY-MM-DDTHH:MM:SS2002-06-30T03:15:00

表 5.111. 时区选项

格式:use_timezone=<tz>
样例:use_timezone=US/Mountain

This option allows you to override the default timezone that this instance of Nagios runs in. Useful if you have multiple instances of Nagios that need to run from the same server, but have different local times associated with them. If not specified, Nagios will use the system configured timezone.

Note: If you use this option to specify a custom timezone, you will also need to alter the Apache configuration directives for the CGIs to specify the timezone you want. Example:

<Directory "/usr/local/nagios/sbin/">

SetEnv TZ "US/Mountain"

...

</Directory>

表 5.112. 非法对象名字符

格式:illegal_object_name_chars=<chars...>
样例:illegal_object_name_chars=`~!$%^&*"|'<>?,()=

This option allows you to specify illegal characters that cannot be used in host names, service descriptions, or names of other object types. Nagios will allow you to use most characters in object definitions, but I recommend not using the characters shown in the example above. Doing may give you problems in the web interface, notification commands, etc.

表 5.113. 非法宏输出字符

格式:illegal_macro_output_chars=<chars...>
样例:illegal_macro_output_chars=`~$^&"|'<>

This option allows you to specify illegal characters that should be stripped from macros before being used in notifications, event handlers, and other commands. This DOES NOT affect macros used in service or host check commands. You can choose to not strip out the characters shown in the example above, but I recommend you do not do this. Some of these characters are interpreted by the shell (i.e. the backtick) and can lead to security problems. The following macros are stripped of the characters you specify:

$HOSTOUTPUT$, $HOSTPERFDATA$, $HOSTACKAUTHOR$, $HOSTACKCOMMENT$, $SERVICEOUTPUT$, $SERVICEPERFDATA$, $SERVICEACKAUTHOR$, and $SERVICEACKCOMMENT$

表 5.114. 正则表达式选项

格式:use_regexp_matching=<0/1>
样例:use_regexp_matching=0

This option determines whether or not various directives in your 对象定义 will be processed as regular expressions. More information on how this works can be found here.

  1. 0 = Don't use regular expression matching (default)
  2. 1 = Use regular expression matching

表 5.115. True Regular Expression Matching Option

格式:use_true_regexp_matching=<0/1>
样例:use_true_regexp_matching=0

If you've enabled regular expression matching of various object directives using the use_regexp_matching option, this option will determine when object directives are treated as regular expressions. If this option is disabled (the default), directives will only be treated as regular expressions if the contain *, ?, +, or \.. If this option is enabled, all appropriate directives will be treated as regular expression - be careful when enabling this! More information on how this works can be found here.

  1. 0 = Don't use true regular expression matching (default)
  2. 1 = Use true regular expression matching

表 5.116. 管理员EMail帐号

格式:admin_email=<email_address>
样例:admin_email=root@localhost.localdomain

This is the email address for the administrator of the local machine (i.e. the one that Nagios is running on). This value can be used in notification commands by using the $ADMINEMAIL$macro.

表 5.117. 管理员BP机帐号

格式:admin_pager=<pager_number_or_pager_email_gateway>
样例:admin_pager=pageroot@localhost.localdomain

This is the pager number (or pager email gateway) for the administrator of the local machine (i.e. the one that Nagios is running on). The pager number/address can be used in notification commands by using the $ADMINPAGER$macro.

表 5.118. Event Broker Options

格式:event_broker_options=<#>
样例:event_broker_options=-1

This option controls what (if any) data gets sent to the event broker and, in turn, to any loaded event broker modules. This is an advanced option. When in doubt, either broker nothing (if not using event broker modules) or broker everything (if using event broker modules). Possible values are shown below.

  1. 0 = Broker nothing
  2. -1 = Broker everything
  3. # = See BROKER_* definitions in source code (include/broker.h) for other values that can be OR'ed together

表 5.119. Event Broker Modules

格式:broker_module=<modulepath> [moduleargs]
样例:

broker_module=/usr/local/nagios/bin/ndomod.o

cfg_file=/usr/local/nagios/etc/ndomod.cfg


This directive is used to specify an event broker module that should by loaded by Nagios at startup. Use multiple directives if you want to load more than one module. Arguments that should be passed to the module at startup are seperated from the module path by a space.

!!! WARNING !!!

Do NOT overwrite modules while they are being used by Nagios or Nagios will crash in a fiery display of SEGFAULT glory. This is a bug/limitation either in dlopen(), the kernel, and/or the filesystem. And maybe Nagios...

The correct/safe way of updating a module is by using one of these methods:

  1. Shutdown Nagios, replace the module file, restart Nagios
  2. While Nagios is running... delete the original module file, move the new module file into place, restart Nagios

表 5.120. 调试文件

格式:debug_file=<file_name>
样例:debug_file=/usr/local/nagios/var/nagios.debug

This option determines where Nagios should write debugging information. What (if any) information is written is determined by the debug_level and debug_verbosity options. You can have Nagios automaticaly rotate the debug file when it reaches a certain size by using the max_debug_file_size option.

表 5.121. 调试等级

格式:debug_level=<#>
样例:debug_level=24

该选项决定Nagios将往debug_file文件里写入什么调试信息。下面值是可以逻辑或关系:

  1. -1 = Log everything
  2. 0 = Log nothing (default)
  3. 1 = Function enter/exit information
  4. 2 = Config information
  5. 4 = Process information
  6. 8 = Scheduled event information
  7. 16 = Host/service check information
  8. 32 = Notification information
  9. 64 = Event broker information

表 5.122. Debug Verbosity

格式:debug_verbosity=<#>
样例:debug_verbosity=1

This option determines how much debugging information Nagios should write to the debug_filedebug_file.

  1. 0 = Basic information
  2. 1 = More detailed information (default)
  3. 2 = Highly detailed information

表 5.123. 调试文件最大长度

格式:max_debug_file_size=<#>
样例:max_debug_file_size=1000000

该选项定义了以字节为单位的debug_file调试文件最大长度。如果文件增至大于该值,将会自动被命名为.old扩展名的文件,如果.old扩展名已经存在,那么旧.old文件将被删除。这可以保证在Nagios调试时磁盘空间不会过多占用而失控。

5.3. 对象配置概览

5.3.1. 什么是对象?

对象是指所有在监控和通知逻辑中涉及到的元素。对象的类型包括:

  1. 服务
  2. 服务组
  3. 主机
  4. 主机组
  5. 联系人
  6. 联系人组
  7. 命令
  8. 时间周期
  9. 通知扩展
  10. 通知和执行依赖关系

更多有关对象和它们之间关系的说明见下面。

5.3.2. 对象在哪里定义?

对象可以在一个配置文件cfg_file或是多个由主配置文件对象保存目录cfg_dir里配置文件来定义。

提示

当按照快速安装指南进行安装后,几个对象配置文件的样例放在了/usr/local/nagios/etc/objects/目录下。可以用这些样例文件来搞清楚对象继承关系并学习如何进行自己的对象定义。

5.3.3. 对象如何定义?

对象可以在一个用柔性化模板样式来定义,模板可使得对Nagios的配置管理更为容易,有关如果进行对象定义的基本信息可以查阅这篇文件

一旦熟悉了如何进行对象定义的基础,需要阅读对象继承以在将来应用中配置更为鲁棒(就是尽量使用对象继承关系啦)。经验丰富的使用者可以在对象定义决窍一文中发掘到一些有关对象定义的高级特性.

关于对象的解释

下面在一些主要的对象的解释...

  • 主机是监控逻辑中的核心对象之一。主机的重要属性有:

    1. 主机通常在网络中的物理设备(如服务器、工作站、路由器、交换机和打印机等);
    2. 主机有某种形式的地址(象IP或MAC地址);
    3. 主机有一个或多个绑定的服务;
    4. 主机与其他的主机间可以有父/子节点的关系,通常反应出真实世界里的网络联接关系,而联接关系会在网络可达性逻辑中用到。
  • 主机组是一台或多台主机组成的组。主机成组可以如下工作更简单(1)在Nagios的Web接口里查看相关的主机状态(2)使用对象定义决窍来简化配置。

  • 服务监控逻辑中的一个核心对象之一。在主机上的服务用户可以:

    1. 主机的属性(CPU负荷、磁盘利用率、启动时间等);
    2. 主机提供的服务(HTTP, POP3, FTP, SSH等等);
    3. 其他与主机有关的信息(DNS记录等);
  • 服务组是一个或多个服务组成的组。服务组可以对如下工作更简单(1)在Nagios的Web接口里查看相关的服务状态(2)使用对象定义决窍来简化配置。

  • 联系人是那些涉及到通知过程中的人:

    1. 有多种通知联系人的方法(对讲机、BP机、EMail、即时信息等);
    2. 联系人收到的通知来自于其负责的主机或服务;
  • 联系人组是一个或多个联系人组成的组。联系人组可以简化在主机或服务故障时负责的人员划分。

  • 时间周期用于控制:

    1. 主机或服务被监控的时间;
    2. 联系人可接收通知的时间;

    时间段时如何工作的信息可以查阅这篇文档

  • 命令是指出Nagios用哪个程序、脚本等,它必须可执行后完成:

    1. 主机和服务检测
    2. 通知
    3. 事件处理
    4. 和其他...

5.4. CGI配置文件选项

注意

当创建或编辑配置文件时,要遵守如下要求:

  1. 以符号'#'开头的行将视为注释不做处理;
  2. 变量必须是新起的一行 - 变量之前不能有空格符;
  3. 变量名是大小写敏感的;

5.4.1. 样例配置文件

提示

一个CGI的样例配置文件(/usr/local/nagios/etc/cgi.cfg)已经安装到位,如果你是按照快速安装指南来操作的话。

5.4.2. 配置文件的位置

默认情况下,Nagios期望的CGI配置文件被命名为cgi.cfg并且该配置文件被放在了主配置文件指定的位置。如果你想改变名称和位置,你可以在Apache里配置一个环境变量叫做NAGIO_CGI_CONFIG的(里面设置好文件名和位置)给CGI程序用。如何来做可以查看Apache文档里的说明。

5.4.3. 配置文件里的变量

下面将给出每个主配置文件里的变量与值选项说明...

表 5.124. 主配置文件的位置

格式:main_config_file=<file_name>
举例:main_config_file=/usr/local/nagios/etc/nagios.cfg

它用于指向主配置文件所在的位置。CGI模块需要知道在哪里可以得到主配置文件以取得配置信息、当前的主机和服务的状态等。

表 5.125. HTML文件的系统路径

格式:physical_html_path=<path>
举例:physical_html_path=/usr/local/nagios/share

它用于指明用于服务器或工作站上的HTML文件所在的系统路径。Nagios假定文档和图片文件被分别放在了docs/images/两个子目录下。

表 5.126. URL里的HTML路径

格式:url_html_path=<path>
举例:url_html_path=/nagios

如果通过Web浏览器来操作Nagios,你要通过一个URL如http://www.myhost.com/nagios来操作的话,则需要设置为/nagios。一般是用这个URL来操作Nagios的HTML页面。

表 5.127. 应用认证

格式:use_authentication=<0/1>
举例:use_authentication=1

该选项控制着CGI模块里,对于用户操作或是取得信息时是否需要打开认证和授权功能。如果你断定你不使用认证,一定要把CGI命令移走以免没有授权的用户发出Nagios命令。如果不使用认证功能,CGI模块不会向Nagios发出命令,但我同时也建议你也把CGI模块同时移到安全位置。更多的有关设置认证与授权的内容可以查看这个文件。

  1. 0 = 不使用认证功能
  2. 1 = 使用认主与授权功能(默认值)

表 5.128. 默认用户名

格式:default_user_name=<username>
举例:default_user_name=guest

用这个变量可以设置一个默认的用户来操作CGI程序。它可以在一个加密的域里(如在防火墙后建立的WEB)不需要WEB认证就可以操作CGI模块。你可能需要这个功能来避免仅仅在一个非加密的服务器上(通过因特网以明文方式来传递你的口令)来做基本的认证。

Important:除非你是在一个加密的WEB服务器上并且保证每个进入该域的用户都具备CGI操作权,否则的话,你要定义这个默认用户。如果你决定用它,那么任何一个未经认证的WEB服务器用户都可以继承你设定的全部权限!

表 5.129. 系统和进程的信息操作权

格式:authorized_for_system_information=<user1>,<user2>,<user3>,...<usern>
举例:authorized_for_system_information=nagiosadmin,theboss

这是一个以逗号分陋的列表,列举出了在扩展CGI信息里查看系统和进程信息的可认证用户。在列表中列出的用户并不会自动被授权可发出系统和进程的命令。如果你想也同时可以发出系统和进程命令,你必须把这些用户也加到authorized_for_system_commands变量之中。更多的如何给CGI模块设置认证和配置授权的内容可以查阅这个文档。

表 5.130. 系统和进程的命令操作权

格式:authorized_for_system_commands=<user1>,<user2>,<user3>,...<usern>
举例:authorized_for_system_commands=nagiosadmin

这是一个以逗号分隔的列表,列出了可以通过CGI命令发出系统和进程命令的被认证用户。在列表中的用户并没有被自动授权查看系统和进程的信息。如果你想让用户也同时可以查看系统和进程信息的话,你必须把这些用户也加到authorized_for_system_information变量里面。更多的如何给CGI模块设置认证和配置授权的内容可以查阅这个文档。

表 5.131. 配置的信息获取权限

格式:authorized_for_configuration_information=<user1>,<user2>,<user3>,...<usern>
举例:authorized_for_configuration_information=nagiosadmin

这是一个以逗号分隔的列表,列出了可以通过配置查看CGI里查看配置信息的可认证用户。这些列表中的用户可以查看全部的配置好的主机、主机组、服务、联系人、联系人组等的配置信息。更多的如何给CGI模块设置认证和配置授权的内容可以查阅这个文档。

表 5.132. 全局主机的信息获取权限

格式:authorized_for_all_hosts=<user1>,<user2>,<user3>,...<usern>
举例:authorized_for_all_hosts=nagiosadmin,theboss

这是一个以逗号分隔的列表,列出了可以查看全部主机的状态和配置信息的被认证用户。这些列表中的用户同时被授权查看在全部的服务信息。但列表中的用户并没有自动地授权向全部的主机或服务发出命令。如果你想让这些用户同时可以向全部主机和服务发出命令,你必须将用户加入到authorized_for_all_host_commands变量里。更多的如何给CGI模块设置认证和配置授权的内容可以查阅这个文档。

表 5.133. 全局主机的命令操作权

格式:authorized_for_all_host_commands=<user1>,<user2>,<user3>,...<usern>
举例:authorized_for_all_host_commands=nagiosadmin

这是一个以逗号分隔的列表,列出了可以通过命令CGI功能模块向全部主机发出命令的被授权用户。列表中的用户同时自动地被授权可以向全部服务发出命令。但列表中的用户并没有自动地授权可以查看全部的主机或服务的状态和配置信息,如果你想让用户同样可以查看状态和配置信息,你需要将用户加入到authorized_for_all_hosts变量之中。更多的如何给CGI模块设置认证和配置授权的内容可以查阅这个文档。

表 5.134. 全局服务的信息获取权

格式:authorized_for_all_services=<user1>,<user2>,<user3>,...<usern>
举例:authorized_for_all_services=nagiosadmin,theboss

这是一个以逗号分隔的列表,列出了可以查看全部服务的状态和配置的被授权用户。但列表中的用户并没有自动地授权可以查看全部主机的信息。列表中的用户并没有自动地授权向全部服务发送命令。如果你想让这些用户也同样可以发全部服务发送命令,你必须将这些用户加入到authorized_for_all_service_commands变量之中。更多的如何给CGI模块设置认证和配置授权的内容可以查阅这个文档。

表 5.135. 全局服务的命令操作权

格式:authorized_for_all_service_commands=<user1>,<user2>,<user3>,...<usern>
举例:authorized_for_all_service_commands=nagiosadmin

这是一个以逗号分隔的列表,列出了可以通过命令CGI来向全部服务发送命令的被授权用户。但列表中的用户并没有自动地授权向全部主机发送命令。列表中的用户也没有自动地授权查看全部主机的状态和配置信息。如果你想让这些用户同样可以查年全部服务的状态和服务的信息,你必须把这些用户加入到authorized_for_all_services变量中。更多的如何给CGI模块设置认证和配置授权的内容可以查阅这个文档。

表 5.136. 锁定动作者的用户名

格式:lock_author_names=[0/1]
举例:lock_author_names=1

该选项将使用WEB接口时在提交注释、做内容确认和制订宕机计划等操作时限制修改已经他们的动作提交者的名字。如果该选项使能,那么用户在做这些进行命令时将不能修改发出操作者的名字。

  1. 0 = 允许用户在提交命令时修改名字
  2. 1 = 不许用户提交命令时修改名字(默认值)

表 5.137. 网络拓扑图的背景图设置

格式:statusmap_background_image=<image_file>
举例:statusmap_background_image=smbackground.gd2

该选项将让你可以在使用网络拓扑图时可以指定一个图形文件做为背景图,如果你选择了使用用户定义坐标来绘制的二维网络拓扑图的话。该背景图文件将不能为其他绘制方式提供背景。它假定这个文件是放在图像文件的路径里了(如/usr/local/nagios/share/images)。该路径将自动地在physical_html_path域之后加上"/images"生成路径。注意,这个图像文件的格式可以是GIF、JPEG、PNG或GD2格式。而推荐是GD2格式的文件,因为它可以在生成二维图时降低CPU负荷。

表 5.138. 默认的二维拓扑图层绘制方式

格式:default_statusmap_layout=<layout_number>
举例:default_statusmap_layout=4

这个选项将让你指定出网络拓扑图CGI的默认绘制方式,可用的选项值有:

表 5.139. Statusmap的<layout_number>取值

ValueLayout Method
0用户定义坐标系
1深度图
2树形折叠图
3平衡权图
4圆形图
5圆形图(出标记的)
6圆形图(气泡式)

表 5.140. 三维空间的容纳器

格式:statuswrl_include=<vrml_file>
举例:statuswrl_include=myworld.wrl

这个选项将让你指定一个你的对象实体在哪个三维空间的容纳器里展现。它默认是文件已经存放在指定的路径下了,该路径由physical_html_path域来指定。注意,这个文件必须是合格的虚拟现实建模(VRML)文件(如你可以在它的专用浏览器里可以查看它)。

表 5.141. 默认三维空间坐标生成算法

格式:default_statuswrl_layout=<layout_number>
举例:default_statuswrl_layout=4

该选项让你指定在三维空间图里对象的三维空间坐标的生成算法。可用的选项值有:

表 5.142. Statuswrl的<layout_number>取值

绘制算法
0用户定义坐标系
2折叠树
3平衡树
4圆形

表 5.143. CGI模块的刷新速率

格式:refresh_rate=<rate_in_seconds>
举例:refresh_rate=90

该选项将让你指定以秒为单位的对于CGI模块刷新的周期,CGI模块有状态列表二维拓扑图扩展信息等CGI模块。

表 5.144. 声音报警

格式:

host_unreachable_sound=<sound_file>

host_down_sound=<sound_file>

service_critical_sound=<sound_file>

service_warning_sound=<sound_file>

service_unknown_sound=<sound_file>

举例:

host_unreachable_sound=hostu.wav

host_down_sound=hostd.wav

service_critical_sound=critical.wav

service_warning_sound=warning.wav

service_unknown_sound=unknown.wav


这个选项将让你指定在查看状态列表时如果有故障发生,你的浏览器里将发出哪个声音文件。如果有故障将按指定的临界故障类型来播放不同的声音文件。这些临界的故障类型是一个或多个主机不可达,至少是一个或多个服务处于未知的状态(见上例中的次序)。声音文件将假定你放在了HTML目录的"media/"子目录里(如/usr/local/nagios/share/media)。

表 5.145. Ping语法

格式:ping_syntax=<command>
举例:ping_syntax=/bin/ping -n -U -c 5 $HOSTADDRESS$

这个选项给出了当从WAP接口(使用statuswml CGI)做PING一个主机操作时的PING的语法。你必须给出包含全路径名的PING的执行文件及全部参数的命令行。命令中使用$HOSTADDRESS$宏来预指定在命令执行前对哪个地址替换并执行PING检测。

表 5.146. 扩展HTML标记选项

格式:escape_html_tags=[0/1]
举例:escape_html_tags=1

这个选项将决定是否在主机和服务(插件)的检测输出中包含使用HTML的扩展选项。如果你使能了它,你的插件将不能使用可点击的超链接标记。

表 5.147. 注释的URL指向

格式:notes_url_target=[target]
举例:notes_url_target=_blank

这个选项决定了你的注释URL必须要显示的URL目标。合法的选项内容包括_blank_self_top_parent或是其他合法目标的名字。

表 5.148. 动作的URL指向

格式:action_url_target=[target]
举例:action_url_target=_blank

这个选项给定了框内对象的动作里显示的动作URL的目标。合法的选项值包括_blank_self_top_parent或是任何其他合法目标名字。

表 5.149. Splunk集成选项

格式:enable_splunk_integration=[0/1]
举例:enable_splunk_integration=1

这个选项决定了在WEB接口里与Splunk集成功能是否集成。如果使能它,你页面中将在许多地方呈现出"Splunk It"的链接,CGI模块页面(日志文件、告警历史、主机和服务的详细信息等)里都有。如果你想对特别的故障发生想知道原诿时很有用。更多关于Splunk的信息请访问http://www.splunk.com/

表 5.150. Splunk URL

格式:splunk_url=<path>
举例:splunk_url=http://127.0.0.1:8000/

这个选项设置了指向Splunk网站的URL。在enable_splunk_integration使能时这个URL被CGI模块用于指向Splunk。

第 6 章 Nagios监控与配置的基本概念

6.1. 对象定义

6.1.1. 介绍

Nagios对象格式的一个特点是可以创建上下继承关系的对象定义。一个如何实现对象继承关系的解释可查阅这篇文档。强烈建议你在阅读过下面内容后要再熟悉一下继承关系,因为它将使对象定义创建和维护变得更为容易,同样,还得阅读对象定义决窍一文以使一些冗长定义任务变得简短。

注意

当创建或编辑配置文件时,要遵守如下要求:
  1. 以符号'#'开头的行将视为注释不做处理;
  2. 变量名是大小写敏感的;

6.1.2. 注意状态保持设置

需要着重指出一点,当修改了配置文件时有几个在主机、服务和联系人定义里的域值不会清除。有这种特性的对象域在下面被标记了星号(*)。这个原因是由于Nagios会将一些对象域值会用保存在状态保持文件里的值来覆盖配置文件,前提是配置了对程序内容全面地状态保持选项使能并且域里的值在运行时被外部命令修改过。

绕过这个问题的一个方法是将非状态信息的保持选项关闭掉,在主机、服务和联系人对象定义里用retain_nonstatus_information选项开关。关掉这个选项后会令Nagios在重启动时使用配置文件里给出的域值而不是从状态保持文件中取值。

6.1.3. 样例配置文件

注意

如果按照快速安装指南来操作的话,一个样例对象配置文件将被安装到/usr/local/nagios/etc/目录里。

6.1.4. 对象种类

6.1.4.1.  主机定义

描述:

主机被定义为存在于网络中的一个物理服务器、工作站或设备等。

定义格式:

注意

标记了(*)的域是必备的而黑色是可选的。

define host{ host_name host_name(*) alias alias(*) display_name display_name address address(*) parents host_names hostgroups hostgroup_names check_command command_name initial_state [o,d,u] max_check_attempts #(*) check_interval # retry_interval # active_checks_enabled [0/1] passive_checks_enabled [0/1] check_period timeperiod_name(*) obsess_over_host [0/1] check_freshness [0/1] freshness_threshold # event_handler command_name event_handler_enabled [0/1] low_flap_threshold # high_flap_threshold # flap_detection_enabled [0/1] flap_detection_options [o,d,u] process_perf_data [0/1] retain_status_information [0/1] retain_nonstatus_information [0/1] contacts contacts(*) contact_groups contact_groups(*) notification_interval #(*) first_notification_delay # notification_period timeperiod_name(*) notification_options [d,u,r,f,s] notifications_enabled [0/1] stalking_options [o,d,u] notes note_string notes_url url action_url url icon_image image_file icon_image_alt alt_string vrml_image image_file statusmap_image image_file 2d_coords x_coord,y_coord 3d_coords x_coord,y_coord,z_coord ... }

定义样例:

define host{ host_name bogus-router alias Bogus Router #1 address 192.168.1.254 parents server-backbone check_command check-host-alive check_interval 5 retry_interval 1 max_check_attempts 5 check_period 24x7 process_perf_data 0 retain_nonstatus_information 0 contact_groups router-admins notification_interval 30 notification_period 24x7 notification_options d,u,r }

域描述:

host_name: This directive is used to define a short name used to identify the host. It is used in host group and service definitions to reference this particular host. Hosts can have multiple services (which are monitored) associated with them. When used properly, the $HOSTNAME$ macro will contain this short name.

alias: This directive is used to define a longer name or description used to identify the host. It is provided in order to allow you to more easily identify a particular host. When used properly, the $HOSTALIAS$ macro will contain this alias/description.

address: This directive is used to define the address of the host. Normally, this is an IP address, although it could really be anything you want (so long as it can be used to check the status of the host). You can use a FQDN to identify the host instead of an IP address, but if DNS services are not availble this could cause problems. When used properly, the $HOSTADDRESS$ macro will contain this address. Note: If you do not specify an address directive in a host definition, the name of the host will be used as its address. A word of caution about doing this, however - if DNS fails, most of your service checks will fail because the plugins will be unable to resolve the host name.

display_name: This directive is used to define an alternate name that should be displayed in the web interface for this host. If not specified, this defaults to the value you specify for the host_name directive. Note: The current CGIs do not use this option, although future versions of the web interface will.

parents: This directive is used to define a comma-delimited list of short names of the "parent" hosts for this particular host. Parent hosts are typically routers, switches, firewalls, etc. that lie between the monitoring host and a remote hosts. A router, switch, etc. which is closest to the remote host is considered to be that host's "parent". Read the "Determining Status and Reachability of Network Hosts" document located here for more information. If this host is on the same network segment as the host doing the monitoring (without any intermediate routers, etc.) the host is considered to be on the local network and will not have a parent host. Leave this value blank if the host does not have a parent host (i.e. it is on the same segment as the Nagios host). The order in which you specify parent hosts has no effect on how things are monitored.

hostgroups: This directive is used to identify the short name(s) of the hostgroup(s) that the host belongs to. Multiple hostgroups should be separated by commas. This directive may be used as an alternative to (or in addition to) using the members directive in hostgroup definitions.

check_command: This directive is used to specify the short name of the command that should be used to check if the host is up or down. Typically, this command would try and ping the host to see if it is "alive". The command must return a status of OK (0) or Nagios will assume the host is down. If you leave this argument blank, the host will not be actively checked. Thus, Nagios will likely always assume the host is up (it may show up as being in a "PENDING" state in the web interface). This is useful if you are monitoring printers or other devices that are frequently turned off. The maximum amount of time that the notification command can run is controlled by the host_check_timeout option.

initial_state: By default Nagios will assume that all hosts are in UP states when in starts. You can override the initial state for a host by using this directive. Valid options are: o = UP, d = DOWN, and u = UNREACHABLE.

max_check_attempts: This directive is used to define the number of times that Nagios will retry the host check command if it returns any state other than an OK state. Setting this value to 1 will cause Nagios to generate an alert without retrying the host check again. Note: If you do not want to check the status of the host, you must still set this to a minimum value of 1. To bypass the host check, just leave the check_command option blank.

check_interval: This directive is used to define the number of "time units" between regularly scheduled checks of the host. Unless you've changed the interval_length directive from the default value of 60, this number will mean minutes. More information on this value can be found in the check scheduling documentation.

retry_interval: This directive is used to define the number of "time units" to wait before scheduling a re-check of the hosts. Hosts are rescheduled at the retry interval when the have changed to a non-UP state. Once the host has been retried max_attempts times without a change in its status, it will revert to being scheduled at its "normal" rate as defined by the check_interval value. Unless you've changed the interval_length directive from the default value of 60, this number will mean minutes. More information on this value can be found in the check scheduling documentation.

active_checks_enabled **: This directive is used to determine whether or not active checks (either regularly scheduled or on-demand) of this host are enabled. Values: 0 = disable active host checks, 1 = enable active host checks.

passive_checks_enabled **: This directive is used to determine whether or not passive checks are enabled for this host. Values: 0 = disable passive host checks, 1 = enable passive host checks.

check_period: This directive is used to specify the short name of the time period during which active checks of this host can be made.

obsess_over_host **: This directive determines whether or not checks for the host will be "obsessed" over using the ochp_command.

check_freshness **: This directive is used to determine whether or not freshness checks are enabled for this host. Values: 0 = disable freshness checks, 1 = enable freshness checks.

freshness_threshold: This directive is used to specify the freshness threshold (in seconds) for this host. If you set this directive to a value of 0, Nagios will determine a freshness threshold to use automatically.

event_handler: This directive is used to specify the short name of the command that should be run whenever a change in the state of the host is detected (i.e. whenever it goes down or recovers). Read the documentation on event handlers for a more detailed explanation of how to write scripts for handling events. The maximum amount of time that the event handler command can run is controlled by the event_handler_timeout option.

event_handler_enabled **: This directive is used to determine whether or not the event handler for this host is enabled. Values: 0 = disable host event handler, 1 = enable host event handler.

low_flap_threshold: This directive is used to specify the low state change threshold used in flap detection for this host. More information on flap detection can be found here. If you set this directive to a value of 0, the program-wide value specified by the low_host_flap_threshold directive will be used.

high_flap_threshold: This directive is used to specify the high state change threshold used in flap detection for this host. More information on flap detection can be found here. If you set this directive to a value of 0, the program-wide value specified by the high_host_flap_threshold directive will be used.

flap_detection_enabled **: This directive is used to determine whether or not flap detection is enabled for this host. More information on flap detection can be found here. Values: 0 = disable host flap detection, 1 = enable host flap detection.

flap_detection_options: This directive is used to determine what host states the flap detection logic will use for this host. Valid options are a combination of one or more of the following: o = UP states, d = DOWN states, u = UNREACHABLE states.

process_perf_data **: This directive is used to determine whether or not the processing of performance data is enabled for this host. Values: 0 = disable performance data processing, 1 = enable performance data processing.

retain_status_information: This directive is used to determine whether or not status-related information about the host is retained across program restarts. This is only useful if you have enabled state retention using the retain_state_information directive. Value: 0 = disable status information retention, 1 = enable status information retention.

retain_nonstatus_information: This directive is used to determine whether or not non-status information about the host is retained across program restarts. This is only useful if you have enabled state retention using the retain_state_information directive. Value: 0 = disable non-status information retention, 1 = enable non-status information retention.

contacts: This is a list of the short names of the contacts that should be notified whenever there are problems (or recoveries) with this host. Multiple contacts should be separated by commas. Useful if you want notifications to go to just a few people and don't want to configure contact groups. You must specify at least one contact or contact group in each host definition.

contact_groups: This is a list of the short names of the contact groups that should be notified whenever there are problems (or recoveries) with this host. Multiple contact groups should be separated by commas. You must specify at least one contact or contact group in each host definition.

notification_interval: This directive is used to define the number of "time units" to wait before re-notifying a contact that this server is still down or unreachable. Unless you've changed the interval_length directive from the default value of 60, this number will mean minutes. If you set this value to 0, Nagios will not re-notify contacts about problems for this host - only one problem notification will be sent out.

first_notification_delay: This directive is used to define the number of "time units" to wait before sending out the first problem notification when this host enters a non-UP state. Unless you've changed the interval_length directive from the default value of 60, this number will mean minutes. If you set this value to 0, Nagios will start sending out notifications immediately.

notification_period: This directive is used to specify the short name of the time period during which notifications of events for this host can be sent out to contacts. If a host goes down, becomes unreachable, or recoveries during a time which is not covered by the time period, no notifications will be sent out.

notification_options: This directive is used to determine when notifications for the host should be sent out. Valid options are a combination of one or more of the following: d = send notifications on a DOWN state, u = send notifications on an UNREACHABLE state, r = send notifications on recoveries (OK state), f = send notifications when the host starts and stops flapping, and s = send notifications when scheduled downtime starts and ends. If you specify n (none) as an option, no host notifications will be sent out. If you do not specify any notification options, Nagios will assume that you want notifications to be sent out for all possible states. Example: If you specify d,r in this field, notifications will only be sent out when the host goes DOWN and when it recovers from a DOWN state.

notifications_enabled **: This directive is used to determine whether or not notifications for this host are enabled. Values: 0 = disable host notifications, 1 = enable host notifications.

stalking_options: This directive determines which host states "stalking" is enabled for. Valid options are a combination of one or more of the following: o = stalk on UP states, d = stalk on DOWN states, and u = stalk on UNREACHABLE states. More information on state stalking can be found here.

notes: This directive is used to define an optional string of notes pertaining to the host. If you specify a note here, you will see the it in the extended information CGI (when you are viewing information about the specified host).

notes_url: This variable is used to define an optional URL that can be used to provide more information about the host. If you specify an URL, you will see a red folder icon in the CGIs (when you are viewing host information) that links to the URL you specify here. Any valid URL can be used. If you plan on using relative paths, the base path will the the same as what is used to access the CGIs (i.e. /cgi-bin/nagios/). This can be very useful if you want to make detailed information on the host, emergency contact methods, etc. available to other support staff.

action_url: This directive is used to define an optional URL that can be used to provide more actions to be performed on the host. If you specify an URL, you will see a red "splat" icon in the CGIs (when you are viewing host information) that links to the URL you specify here. Any valid URL can be used. If you plan on using relative paths, the base path will the the same as what is used to access the CGIs (i.e. /cgi-bin/nagios/).

icon_image: This variable is used to define the name of a GIF, PNG, or JPG image that should be associated with this host. This image will be displayed in the various places in the CGIs. The image will look best if it is 40x40 pixels in size. Images for hosts are assumed to be in the logos/ subdirectory in your HTML images directory (i.e. /usr/local/nagios/share/images/logos).

icon_image_alt: This variable is used to define an optional string that is used in the ALT tag of the image specified by the <icon_image> argument.

vrml_image: This variable is used to define the name of a GIF, PNG, or JPG image that should be associated with this host. This image will be used as the texture map for the specified host in the statuswrl CGI. Unlike the image you use for the <icon_image> variable, this one should probably not have any transparency. If it does, the host object will look a bit wierd. Images for hosts are assumed to be in the logos/ subdirectory in your HTML images directory (i.e. /usr/local/nagios/share/images/logos).

statusmap_image: This variable is used to define the name of an image that should be associated with this host in the statusmap CGI. You can specify a JPEG, PNG, and GIF image if you want, although I would strongly suggest using a GD2 format image, as other image formats will result in a lot of wasted CPU time when the statusmap image is generated. GD2 images can be created from PNG images by using the pngtogd2 utility supplied with Thomas Boutell's gd library. The GD2 images should be created in uncompressed format in order to minimize CPU load when the statusmap CGI is generating the network map image. The image will look best if it is 40x40 pixels in size. You can leave these option blank if you are not using the statusmap CGI. Images for hosts are assumed to be in the logos/ subdirectory in your HTML images directory (i.e. /usr/local/nagios/share/images/logos).

2d_coords: This variable is used to define coordinates to use when drawing the host in the statusmap CGI. Coordinates should be given in positive integers, as the correspond to physical pixels in the generated image. The origin for drawing (0,0) is in the upper left hand corner of the image and extends in the positive x direction (to the right) along the top of the image and in the positive y direction (down) along the left hand side of the image. For reference, the size of the icons drawn is usually about 40x40 pixels (text takes a little extra space). The coordinates you specify here are for the upper left hand corner of the host icon that is drawn. Note: Don't worry about what the maximum x and y coordinates that you can use are. The CGI will automatically calculate the maximum dimensions of the image it creates based on the largest x and y coordinates you specify.

3d_coords: This variable is used to define coordinates to use when drawing the host in the statuswrl CGI. Coordinates can be positive or negative real numbers. The origin for drawing is (0.0,0.0,0.0). For reference, the size of the host cubes drawn is 0.5 units on each side (text takes a little more space). The coordinates you specify here are used as the center of the host cube.

6.1.4.2.  主机组定义

描述:

主机组是指一台或多台主机构成的组,可使配置更简单或是为完成特定目的而在CGI里显示使用。

定义格式:

注意

标记了(*)的域是必备的而黑色是可选的。

define hostgroup{ hostgroup_name hostgroup_name(*) alias alias(*) members hosts hostgroup_members hostgroups notes note_string notes_url url action_url url ... }

定义样例:

define hostgroup{ hostgroup_name novell-servers alias Novell Servers members netware1,netware2,netware3,netware4 }

域描述:

hostgroup_name: This directive is used to define a short name used to identify the host group.

alias: This directive is used to define is a longer name or description used to identify the host group. It is provided in order to allow you to more easily identify a particular host group.

members: This is a list of the short names of hosts that should be included in this group. Multiple host names should be separated by commas. This directive may be used as an alternative to (or in addition to) the hostgroups directive in host definitions.

hostgroup_members: This optional directive can be used to include hosts from other "sub" host groups in this host group. Specify a comma-delimited list of short names of other host groups whose members should be included in this group.

notes: This directive is used to define an optional string of notes pertaining to the host. If you specify a note here, you will see the it in the extended information CGI (when you are viewing information about the specified host).

notes_url: This variable is used to define an optional URL that can be used to provide more information about the host group. If you specify an URL, you will see a red folder icon in the CGIs (when you are viewing hostgroup information) that links to the URL you specify here. Any valid URL can be used. If you plan on using relative paths, the base path will the the same as what is used to access the CGIs (i.e. /cgi-bin/nagios/). This can be very useful if you want to make detailed information on the host group, emergency contact methods, etc. available to other support staff.

action_url: This directive is used to define an optional URL that can be used to provide more actions to be performed on the host group. If you specify an URL, you will see a red "splat" icon in the CGIs (when you are viewing hostgroup information) that links to the URL you specify here. Any valid URL can be used. If you plan on using relative paths, the base path will the the same as what is used to access the CGIs (i.e. /cgi-bin/nagios/).

6.1.4.3.  服务定义

描述:

服务定义为在主机上运行的某种“应用服务”。这种服务定义得非常宽泛,可以是在主机上实际的服务进程(POP3、SMTP、HTTP等)或是与主机有关的某种计量值(PING响应值、在线用户数、磁盘空闲空间等),其中的差异见下面的说明。

定义格式:

注意

标记了(*)的域是必备的而黑色是可选的。

define service{ host_name host_name(*) hostgroup_name hostgroup_name service_description service_description(*) display_name display_name servicegroups servicegroup_names is_volatile [0/1] check_command command_name(*) initial_state [o,w,u,c] max_check_attempts #(*) check_interval #(*) retry_interval #(*) active_checks_enabled [0/1] passive_checks_enabled [0/1] check_period timeperiod_name(*) obsess_over_service [0/1] check_freshness [0/1] freshness_threshold # event_handler command_name event_handler_enabled [0/1] low_flap_threshold # high_flap_threshold # flap_detection_enabled [0/1] flap_detection_options [o,w,c,u] process_perf_data [0/1] retain_status_information [0/1] retain_nonstatus_information [0/1] notification_interval #(*) first_notification_delay # notification_period timeperiod_name(*) notification_options [w,u,c,r,f,s] notifications_enabled [0/1] contacts contacts(*) contact_groups contact_groups(*) stalking_options [o,w,u,c] notes note_string notes_url url action_url url icon_image image_file icon_image_alt alt_string ... }

定义样例:

define service{ host_name linux-server service_description check-disk-sda1 check_command check-disk!/dev/sda1 max_check_attempts 5 check_interval 5 retry_interval 3 check_period 24x7 notification_interval 30 notification_period 24x7 notification_options w,c,r contact_groups linux-admins }

域描述:

host_name: This directive is used to specify the short name(s) of the host(s) that the service "runs" on or is associated with. Multiple hosts should be separated by commas.

hostgroup_name: This directive is used to specify the short name(s) of the hostgroup(s) that the servi