Configuring Zabbix for NVIDIA GPU Monitoring
Step 1: Locate and Edit the Zabbix Configuration File
Navigate to the Zabbix Agent configuration file located at C:\Program Files\Zabbix Agent\zabbix_agentd.conf
. Add the following lines to the end of the file to define new UserParameters for GPU monitoring:
UserParameter=gpu.number,"nvidia-smi.exe" -L | find /c /v ""
UserParameter=gpu.discovery,C:\scripts\get_gpus_info.bat
UserParameter=gpu.fanspeed[*],"nvidia-smi.exe" --query-gpu=fan.speed --format=csv,noheader,nounits -i $1
UserParameter=gpu.power[*],"nvidia-smi.exe" --query-gpu=power.draw --format=csv,noheader,nounits -i $1
UserParameter=gpu.temp[*],"nvidia-smi.exe" --query-gpu=temperature.gpu --format=csv,noheader,nounits -i $1
UserParameter=gpu.utilization[*],"nvidia-smi.exe" --query-gpu=utilization.gpu --format=csv,noheader,nounits -i $1
UserParameter=gpu.memfree[*],"nvidia-smi.exe" --query-gpu=memory.free --format=csv,noheader,nounits -i $1
UserParameter=gpu.memused[*],"nvidia-smi.exe" --query-gpu=memory.used --format=csv,noheader,nounits -i $1
UserParameter=gpu.memtotal[*],"nvidia-smi.exe" --query-gpu=memory.total --format=csv,noheader,nounits -i $1
Ensure the Hostname
in zabbix_agentd.conf
matches the host name in your Zabbix web console:
Hostname=192.168.51.93
Step 2: Update System Environment Variables
Add the path to nvidia-smi.exe
to your system’s environment variables to ensure Zabbix Agent can execute it:
C:\Program Files\NVIDIA Corporation\NVSMI\
Step 3: Create the GPU Discovery Script
Create a new file named get_gpus_info.bat
in C:\scripts\
with the following content to enable Zabbix to discover GPUs:
@ECHO OFF
SETLOCAL ENABLEDELAYEDEXPANSION
echo {
echo "data":[
SET count=1
FOR /F "tokens=* USEBACKQ" %%F IN (`"nvidia-smi.exe" -L`) DO (
if !count! GTR 1 echo ,
SET line=%%F
SET var!count!=%%F
SET /a count=!count!+1
for /f "tokens=1 delims=:" %%A in ('echo %%F') do (
for /f "tokens=2 delims= " %%B in ('echo %%A') do (
echo|set /p={"{#GPUINDEX}":"
echo|set /p=%%B", "
)
)
for /f "tokens=3 delims=:" %%A in ('echo %%F') do (
echo|set /p={#GPUUUID}":"
for /f "tokens=1 delims= " %%B in ('echo %%A') do (
for /f "tokens=1 delims=)" %%C in ('echo %%B') do (
echo|set /p=%%C"}
)
)
rem echo|set /p=
)
)
echo.
echo ]
echo }
ENDLOCAL
This script outputs GPU details in a JSON format that Zabbix can use for discovery.
Step 4: Restart the Zabbix Agent Service
Apply the changes by restarting the Zabbix Agent service on your host.
Step 5: Add the Template in Zabbix Console
In the Zabbix console, link the Template Nvidia GPUs Performance
to your host to begin monitoring.
template content
<?xml version="1.0" encoding="UTF-8"?>
<zabbix_export>
<version>5.0</version>
<date>2021-09-08T15:47:49Z</date>
<groups>
<group>
<name>Templates</name>
</group>
</groups>
<templates>
<template>
<template>Template Nvidia GPUs Performance</template>
<name>Template Nvidia GPUs Performance</name>
<groups>
<group>
<name>Templates</name>
</group>
</groups>
<applications>
<application>
<name>Nvidia</name>
</application>
</applications>
<items>
<item>
<name>Number of GPUs</name>
<key>gpu.number</key>
<delay>30</delay>
<value_type>FLOAT</value_type>
<description>The number of GPUs present on this system.</description>
<applications>
<application>
<name>Nvidia</name>
</application>
</applications>
</item>
</items>
<discovery_rules>
<discovery_rule>
<name>GPU discovery</name>
<key>gpu.discovery</key>
<delay>600</delay>
<description>Discovery of graphics cards.</description>
<item_prototypes>
<item_prototype>
<name>GPU [{#GPUINDEX}] Fan Speed</name>
<key>gpu.fanspeed[{#GPUINDEX}]</key>
<delay>60</delay>
<history>7d</history>
<units>%</units>
<applications>
<application>
<name>Nvidia</name>
</application>
</applications>
<preprocessing>
<step>
<type>MULTIPLIER</type>
<params>1</params>
</step>
</preprocessing>
</item_prototype>
<item_prototype>
<name>GPU [{#GPUINDEX}] Memory Free</name>
<key>gpu.memfree[{#GPUINDEX}]</key>
<delay>60</delay>
<history>7d</history>
<units>b</units>
<applications>
<application>
<name>Nvidia</name>
</application>
</applications>
<preprocessing>
<step>
<type>MULTIPLIER</type>
<params>1000000</params>
</step>
</preprocessing>
</item_prototype>
<item_prototype>
<name>GPU [{#GPUINDEX}] Memory Total</name>
<key>gpu.memtotal[{#GPUINDEX}]</key>
<delay>60</delay>
<history>7d</history>
<units>b</units>
<applications>
<application>
<name>Nvidia</name>
</application>
</applications>
<preprocessing>
<step>
<type>MULTIPLIER</type>
<params>1000000</params>
</step>
</preprocessing>
</item_prototype>
<item_prototype>
<name>GPU [{#GPUINDEX}] Memory Used</name>
<key>gpu.memused[{#GPUINDEX}]</key>
<delay>60</delay>
<history>7d</history>
<units>b</units>
<applications>
<application>
<name>Nvidia</name>
</application>
</applications>
<preprocessing>
<step>
<type>MULTIPLIER</type>
<params>1000000</params>
</step>
</preprocessing>
</item_prototype>
<item_prototype>
<name>GPU [{#GPUINDEX}] Power in decaWatts</name>
<key>gpu.power[{#GPUINDEX}]</key>
<delay>60</delay>
<history>7d</history>
<value_type>FLOAT</value_type>
<units>dW</units>
<applications>
<application>
<name>Nvidia</name>
</application>
</applications>
<preprocessing>
<step>
<type>MULTIPLIER</type>
<params>0.1</params>
</step>
</preprocessing>
</item_prototype>
<item_prototype>
<name>GPU [{#GPUINDEX}] Temperature</name>
<key>gpu.temp[{#GPUINDEX}]</key>
<delay>60</delay>
<history>7d</history>
<value_type>FLOAT</value_type>
<units>C</units>
<applications>
<application>
<name>Nvidia</name>
</application>
</applications>
<trigger_prototypes>
<trigger_prototype>
<expression>{last()}>80</expression>
<name>GPU {#GPUINDEX} Temperature is extremely high</name>
<priority>DISASTER</priority>
<description>A GPU's temperature is getting extremely high!</description>
</trigger_prototype>
<trigger_prototype>
<expression>{last()}>70</expression>
<name>GPU {#GPUINDEX} Temperature is high</name>
<priority>WARNING</priority>
<description>A GPU's temperature is getting high!</description>
<dependencies>
<dependency>
<name>GPU {#GPUINDEX} Temperature is very high</name>
<expression>{Template Nvidia GPUs Performance:gpu.temp[{#GPUINDEX}].last()}>75</expression>
</dependency>
</dependencies>
</trigger_prototype>
<trigger_prototype>
<expression>{last()}>75</expression>
<name>GPU {#GPUINDEX} Temperature is very high</name>
<priority>HIGH</priority>
<description>A GPU's temperature is getting very high!</description>
<dependencies>
<dependency>
<name>GPU {#GPUINDEX} Temperature is extremely high</name>
<expression>{Template Nvidia GPUs Performance:gpu.temp[{#GPUINDEX}].last()}>80</expression>
</dependency>
</dependencies>
</trigger_prototype>
</trigger_prototypes>
</item_prototype>
<item_prototype>
<name>GPU [{#GPUINDEX}] Decoder Utilization Max</name>
<key>gpu.utilization.dec.max[{#GPUINDEX}]</key>
<delay>60</delay>
<history>7d</history>
<units>%</units>
<applications>
<application>
<name>Nvidia</name>
</application>
</applications>
</item_prototype>
<item_prototype>
<name>GPU [{#GPUINDEX}] Decoder Utilization Min</name>
<key>gpu.utilization.dec.min[{#GPUINDEX}]</key>
<delay>60</delay>
<history>7d</history>
<units>%</units>
<applications>
<application>
<name>Nvidia</name>
</application>
</applications>
</item_prototype>
<item_prototype>
<name>GPU [{#GPUINDEX}] Encoder Utilization Max</name>
<key>gpu.utilization.enc.max[{#GPUINDEX}]</key>
<delay>60</delay>
<history>7d</history>
<units>%</units>
<applications>
<application>
<name>Nvidia</name>
</application>
</applications>
</item_prototype>
<item_prototype>
<name>GPU [{#GPUINDEX}] Encoder Utilization min</name>
<key>gpu.utilization.enc.min[{#GPUINDEX}]</key>
<delay>60</delay>
<history>7d</history>
<units>%</units>
<applications>
<application>
<name>Nvidia</name>
</application>
</applications>
</item_prototype>
<item_prototype>
<name>GPU [{#GPUINDEX}] Utilization</name>
<key>gpu.utilization[{#GPUINDEX}]</key>
<delay>60</delay>
<history>7d</history>
<units>%</units>
<applications>
<application>
<name>Nvidia</name>
</application>
</applications>
</item_prototype>
</item_prototypes>
<graph_prototypes>
<graph_prototype>
<name>GPU {#GPUINDEX} Encoder/Decoder Utilization</name>
<graph_items>
<graph_item>
<sortorder>1</sortorder>
<drawtype>BOLD_LINE</drawtype>
<color>1A7C11</color>
<item>
<host>Template Nvidia GPUs Performance</host>
<key>gpu.utilization.dec.max[{#GPUINDEX}]</key>
</item>
</graph_item>
<graph_item>
<sortorder>2</sortorder>
<color>00FF00</color>
<item>
<host>Template Nvidia GPUs Performance</host>
<key>gpu.utilization.dec.min[{#GPUINDEX}]</key>
</item>
</graph_item>
<graph_item>
<sortorder>3</sortorder>
<drawtype>BOLD_LINE</drawtype>
<color>BF00FF</color>
<item>
<host>Template Nvidia GPUs Performance</host>
<key>gpu.utilization.enc.max[{#GPUINDEX}]</key>
</item>
</graph_item>
<graph_item>
<sortorder>4</sortorder>
<color>311B92</color>
<item>
<host>Template Nvidia GPUs Performance</host>
<key>gpu.utilization.enc.min[{#GPUINDEX}]</key>
</item>
</graph_item>
</graph_items>
</graph_prototype>
<graph_prototype>
<name>GPU {#GPUINDEX} Memory</name>
<graph_items>
<graph_item>
<color>00AA00</color>
<item>
<host>Template Nvidia GPUs Performance</host>
<key>gpu.memfree[{#GPUINDEX}]</key>
</item>
</graph_item>
<graph_item>
<sortorder>1</sortorder>
<color>0000DD</color>
<item>
<host>Template Nvidia GPUs Performance</host>
<key>gpu.memused[{#GPUINDEX}]</key>
</item>
</graph_item>
</graph_items>
</graph_prototype>
<graph_prototype>
<name>GPU {#GPUINDEX} Temperature, Fan Speed and Power</name>
<graph_items>
<graph_item>
<color>1A7C11</color>
<item>
<host>Template Nvidia GPUs Performance</host>
<key>gpu.power[{#GPUINDEX}]</key>
</item>
</graph_item>
<graph_item>
<sortorder>1</sortorder>
<color>2774A4</color>
<item>
<host>Template Nvidia GPUs Performance</host>
<key>gpu.fanspeed[{#GPUINDEX}]</key>
</item>
</graph_item>
<graph_item>
<sortorder>2</sortorder>
<color>F63100</color>
<item>
<host>Template Nvidia GPUs Performance</host>
<key>gpu.temp[{#GPUINDEX}]</key>
</item>
</graph_item>
</graph_items>
</graph_prototype>
<graph_prototype>
<name>GPU {#GPUINDEX} Utilization</name>
<graph_items>
<graph_item>
<color>2774A4</color>
<item>
<host>Template Nvidia GPUs Performance</host>
<key>gpu.utilization[{#GPUINDEX}]</key>
</item>
</graph_item>
</graph_items>
</graph_prototype>
</graph_prototypes>
</discovery_rule>
</discovery_rules>
</template>
</templates>
</zabbix_export>
FAQ: Troubleshooting the “Missing -i Argument” Error
Q: What should I do if I receive a “Missing value for -i argument” error when using zabbix_get
?
A: This error occurs because the -i
argument, which specifies the GPU ID, is missing in the command. In the Zabbix setup, this is handled automatically after the template is imported and GPUs are discovered. For manual testing with zabbix_get
, you need to specify the GPU ID, like so:
bashCopy code
zabbix_get -s <zabbixagentip> -k gpu.memused[0]
Replace <zabbixagentip>
with your Zabbix agent’s IP address and 0
with the GPU ID you wish to query.