How to Configure Zabbix for NVIDIA GPU Monitoring: A Step-by-Step Guide with FAQ

Configuring Zabbix for NVIDIA GPU Monitoring

Step 1: Locate and Edit the Zabbix Configuration File

Navigate to the Zabbix Agent configuration file located at C:\Program Files\Zabbix Agent\zabbix_agentd.conf. Add the following lines to the end of the file to define new UserParameters for GPU monitoring:

UserParameter=gpu.number,"nvidia-smi.exe" -L | find /c /v ""
UserParameter=gpu.discovery,C:\scripts\get_gpus_info.bat
UserParameter=gpu.fanspeed[*],"nvidia-smi.exe" --query-gpu=fan.speed --format=csv,noheader,nounits -i $1
UserParameter=gpu.power[*],"nvidia-smi.exe" --query-gpu=power.draw --format=csv,noheader,nounits -i $1
UserParameter=gpu.temp[*],"nvidia-smi.exe" --query-gpu=temperature.gpu --format=csv,noheader,nounits -i $1
UserParameter=gpu.utilization[*],"nvidia-smi.exe" --query-gpu=utilization.gpu --format=csv,noheader,nounits -i $1
UserParameter=gpu.memfree[*],"nvidia-smi.exe" --query-gpu=memory.free --format=csv,noheader,nounits -i $1
UserParameter=gpu.memused[*],"nvidia-smi.exe" --query-gpu=memory.used --format=csv,noheader,nounits -i $1
UserParameter=gpu.memtotal[*],"nvidia-smi.exe" --query-gpu=memory.total --format=csv,noheader,nounits -i $1

Ensure the Hostname in zabbix_agentd.conf matches the host name in your Zabbix web console:

Hostname=192.168.51.93

Step 2: Update System Environment Variables

Add the path to nvidia-smi.exe to your system’s environment variables to ensure Zabbix Agent can execute it:

C:\Program Files\NVIDIA Corporation\NVSMI\

Step 3: Create the GPU Discovery Script

Create a new file named get_gpus_info.bat in C:\scripts\ with the following content to enable Zabbix to discover GPUs:

@ECHO OFF
SETLOCAL ENABLEDELAYEDEXPANSION

echo {
echo "data":[

SET count=1
FOR /F "tokens=* USEBACKQ" %%F IN (`"nvidia-smi.exe" -L`) DO (
    if !count! GTR 1 echo ,

    SET line=%%F
    SET var!count!=%%F
    SET /a count=!count!+1
    for /f "tokens=1 delims=:" %%A in ('echo %%F') do (
        for /f "tokens=2 delims= " %%B in ('echo %%A') do (
            echo|set /p={"{#GPUINDEX}":"
            echo|set /p=%%B", "
        )
    )
    for /f "tokens=3 delims=:" %%A in ('echo %%F') do (
        echo|set /p={#GPUUUID}":"
        for /f "tokens=1 delims= " %%B in ('echo %%A') do (
            for /f "tokens=1 delims=)" %%C in ('echo %%B') do (
                echo|set /p=%%C"}
            )
        )
        rem echo|set /p=
    )
)
echo.
echo ]
echo }

ENDLOCAL

This script outputs GPU details in a JSON format that Zabbix can use for discovery.

Step 4: Restart the Zabbix Agent Service

Apply the changes by restarting the Zabbix Agent service on your host.

Step 5: Add the Template in Zabbix Console

In the Zabbix console, link the Template Nvidia GPUs Performance to your host to begin monitoring.

template content

<?xml version="1.0" encoding="UTF-8"?>
<zabbix_export>
    <version>5.0</version>
    <date>2021-09-08T15:47:49Z</date>
    <groups>
        <group>
            <name>Templates</name>
        </group>
    </groups>
    <templates>
        <template>
            <template>Template Nvidia GPUs Performance</template>
            <name>Template Nvidia GPUs Performance</name>
            <groups>
                <group>
                    <name>Templates</name>
                </group>
            </groups>
            <applications>
                <application>
                    <name>Nvidia</name>
                </application>
            </applications>
            <items>
                <item>
                    <name>Number of GPUs</name>
                    <key>gpu.number</key>
                    <delay>30</delay>
                    <value_type>FLOAT</value_type>
                    <description>The number of GPUs present on this system.</description>
                    <applications>
                        <application>
                            <name>Nvidia</name>
                        </application>
                    </applications>
                </item>
            </items>
            <discovery_rules>
                <discovery_rule>
                    <name>GPU discovery</name>
                    <key>gpu.discovery</key>
                    <delay>600</delay>
                    <description>Discovery of graphics cards.</description>
                    <item_prototypes>
                        <item_prototype>
                            <name>GPU [{#GPUINDEX}] Fan Speed</name>
                            <key>gpu.fanspeed[{#GPUINDEX}]</key>
                            <delay>60</delay>
                            <history>7d</history>
                            <units>%</units>
                            <applications>
                                <application>
                                    <name>Nvidia</name>
                                </application>
                            </applications>
                            <preprocessing>
                                <step>
                                    <type>MULTIPLIER</type>
                                    <params>1</params>
                                </step>
                            </preprocessing>
                        </item_prototype>
                        <item_prototype>
                            <name>GPU [{#GPUINDEX}] Memory Free</name>
                            <key>gpu.memfree[{#GPUINDEX}]</key>
                            <delay>60</delay>
                            <history>7d</history>
                            <units>b</units>
                            <applications>
                                <application>
                                    <name>Nvidia</name>
                                </application>
                            </applications>
                            <preprocessing>
                                <step>
                                    <type>MULTIPLIER</type>
                                    <params>1000000</params>
                                </step>
                            </preprocessing>
                        </item_prototype>
                        <item_prototype>
                            <name>GPU [{#GPUINDEX}] Memory Total</name>
                            <key>gpu.memtotal[{#GPUINDEX}]</key>
                            <delay>60</delay>
                            <history>7d</history>
                            <units>b</units>
                            <applications>
                                <application>
                                    <name>Nvidia</name>
                                </application>
                            </applications>
                            <preprocessing>
                                <step>
                                    <type>MULTIPLIER</type>
                                    <params>1000000</params>
                                </step>
                            </preprocessing>
                        </item_prototype>
                        <item_prototype>
                            <name>GPU [{#GPUINDEX}] Memory Used</name>
                            <key>gpu.memused[{#GPUINDEX}]</key>
                            <delay>60</delay>
                            <history>7d</history>
                            <units>b</units>
                            <applications>
                                <application>
                                    <name>Nvidia</name>
                                </application>
                            </applications>
                            <preprocessing>
                                <step>
                                    <type>MULTIPLIER</type>
                                    <params>1000000</params>
                                </step>
                            </preprocessing>
                        </item_prototype>
                        <item_prototype>
                            <name>GPU [{#GPUINDEX}] Power in decaWatts</name>
                            <key>gpu.power[{#GPUINDEX}]</key>
                            <delay>60</delay>
                            <history>7d</history>
                            <value_type>FLOAT</value_type>
                            <units>dW</units>
                            <applications>
                                <application>
                                    <name>Nvidia</name>
                                </application>
                            </applications>
                            <preprocessing>
                                <step>
                                    <type>MULTIPLIER</type>
                                    <params>0.1</params>
                                </step>
                            </preprocessing>
                        </item_prototype>
                        <item_prototype>
                            <name>GPU [{#GPUINDEX}] Temperature</name>
                            <key>gpu.temp[{#GPUINDEX}]</key>
                            <delay>60</delay>
                            <history>7d</history>
                            <value_type>FLOAT</value_type>
                            <units>C</units>
                            <applications>
                                <application>
                                    <name>Nvidia</name>
                                </application>
                            </applications>
                            <trigger_prototypes>
                                <trigger_prototype>
                                    <expression>{last()}&gt;80</expression>
                                    <name>GPU {#GPUINDEX} Temperature is extremely high</name>
                                    <priority>DISASTER</priority>
                                    <description>A GPU's temperature is getting extremely high!</description>
                                </trigger_prototype>
                                <trigger_prototype>
                                    <expression>{last()}&gt;70</expression>
                                    <name>GPU {#GPUINDEX} Temperature is high</name>
                                    <priority>WARNING</priority>
                                    <description>A GPU's temperature is getting high!</description>
                                    <dependencies>
                                        <dependency>
                                            <name>GPU {#GPUINDEX} Temperature is very high</name>
                                            <expression>{Template Nvidia GPUs Performance:gpu.temp[{#GPUINDEX}].last()}&gt;75</expression>
                                        </dependency>
                                    </dependencies>
                                </trigger_prototype>
                                <trigger_prototype>
                                    <expression>{last()}&gt;75</expression>
                                    <name>GPU {#GPUINDEX} Temperature is very high</name>
                                    <priority>HIGH</priority>
                                    <description>A GPU's temperature is getting very high!</description>
                                    <dependencies>
                                        <dependency>
                                            <name>GPU {#GPUINDEX} Temperature is extremely high</name>
                                            <expression>{Template Nvidia GPUs Performance:gpu.temp[{#GPUINDEX}].last()}&gt;80</expression>
                                        </dependency>
                                    </dependencies>
                                </trigger_prototype>
                            </trigger_prototypes>
                        </item_prototype>
                        <item_prototype>
                            <name>GPU [{#GPUINDEX}] Decoder Utilization Max</name>
                            <key>gpu.utilization.dec.max[{#GPUINDEX}]</key>
                            <delay>60</delay>
                            <history>7d</history>
                            <units>%</units>
                            <applications>
                                <application>
                                    <name>Nvidia</name>
                                </application>
                            </applications>
                        </item_prototype>
                        <item_prototype>
                            <name>GPU [{#GPUINDEX}] Decoder Utilization Min</name>
                            <key>gpu.utilization.dec.min[{#GPUINDEX}]</key>
                            <delay>60</delay>
                            <history>7d</history>
                            <units>%</units>
                            <applications>
                                <application>
                                    <name>Nvidia</name>
                                </application>
                            </applications>
                        </item_prototype>
                        <item_prototype>
                            <name>GPU [{#GPUINDEX}] Encoder Utilization Max</name>
                            <key>gpu.utilization.enc.max[{#GPUINDEX}]</key>
                            <delay>60</delay>
                            <history>7d</history>
                            <units>%</units>
                            <applications>
                                <application>
                                    <name>Nvidia</name>
                                </application>
                            </applications>
                        </item_prototype>
                        <item_prototype>
                            <name>GPU [{#GPUINDEX}] Encoder Utilization min</name>
                            <key>gpu.utilization.enc.min[{#GPUINDEX}]</key>
                            <delay>60</delay>
                            <history>7d</history>
                            <units>%</units>
                            <applications>
                                <application>
                                    <name>Nvidia</name>
                                </application>
                            </applications>
                        </item_prototype>
                        <item_prototype>
                            <name>GPU [{#GPUINDEX}] Utilization</name>
                            <key>gpu.utilization[{#GPUINDEX}]</key>
                            <delay>60</delay>
                            <history>7d</history>
                            <units>%</units>
                            <applications>
                                <application>
                                    <name>Nvidia</name>
                                </application>
                            </applications>
                        </item_prototype>
                    </item_prototypes>
                    <graph_prototypes>
                        <graph_prototype>
                            <name>GPU {#GPUINDEX} Encoder/Decoder Utilization</name>
                            <graph_items>
                                <graph_item>
                                    <sortorder>1</sortorder>
                                    <drawtype>BOLD_LINE</drawtype>
                                    <color>1A7C11</color>
                                    <item>
                                        <host>Template Nvidia GPUs Performance</host>
                                        <key>gpu.utilization.dec.max[{#GPUINDEX}]</key>
                                    </item>
                                </graph_item>
                                <graph_item>
                                    <sortorder>2</sortorder>
                                    <color>00FF00</color>
                                    <item>
                                        <host>Template Nvidia GPUs Performance</host>
                                        <key>gpu.utilization.dec.min[{#GPUINDEX}]</key>
                                    </item>
                                </graph_item>
                                <graph_item>
                                    <sortorder>3</sortorder>
                                    <drawtype>BOLD_LINE</drawtype>
                                    <color>BF00FF</color>
                                    <item>
                                        <host>Template Nvidia GPUs Performance</host>
                                        <key>gpu.utilization.enc.max[{#GPUINDEX}]</key>
                                    </item>
                                </graph_item>
                                <graph_item>
                                    <sortorder>4</sortorder>
                                    <color>311B92</color>
                                    <item>
                                        <host>Template Nvidia GPUs Performance</host>
                                        <key>gpu.utilization.enc.min[{#GPUINDEX}]</key>
                                    </item>
                                </graph_item>
                            </graph_items>
                        </graph_prototype>
                        <graph_prototype>
                            <name>GPU {#GPUINDEX} Memory</name>
                            <graph_items>
                                <graph_item>
                                    <color>00AA00</color>
                                    <item>
                                        <host>Template Nvidia GPUs Performance</host>
                                        <key>gpu.memfree[{#GPUINDEX}]</key>
                                    </item>
                                </graph_item>
                                <graph_item>
                                    <sortorder>1</sortorder>
                                    <color>0000DD</color>
                                    <item>
                                        <host>Template Nvidia GPUs Performance</host>
                                        <key>gpu.memused[{#GPUINDEX}]</key>
                                    </item>
                                </graph_item>
                            </graph_items>
                        </graph_prototype>
                        <graph_prototype>
                            <name>GPU {#GPUINDEX} Temperature, Fan Speed and Power</name>
                            <graph_items>
                                <graph_item>
                                    <color>1A7C11</color>
                                    <item>
                                        <host>Template Nvidia GPUs Performance</host>
                                        <key>gpu.power[{#GPUINDEX}]</key>
                                    </item>
                                </graph_item>
                                <graph_item>
                                    <sortorder>1</sortorder>
                                    <color>2774A4</color>
                                    <item>
                                        <host>Template Nvidia GPUs Performance</host>
                                        <key>gpu.fanspeed[{#GPUINDEX}]</key>
                                    </item>
                                </graph_item>
                                <graph_item>
                                    <sortorder>2</sortorder>
                                    <color>F63100</color>
                                    <item>
                                        <host>Template Nvidia GPUs Performance</host>
                                        <key>gpu.temp[{#GPUINDEX}]</key>
                                    </item>
                                </graph_item>
                            </graph_items>
                        </graph_prototype>
                        <graph_prototype>
                            <name>GPU {#GPUINDEX} Utilization</name>
                            <graph_items>
                                <graph_item>
                                    <color>2774A4</color>
                                    <item>
                                        <host>Template Nvidia GPUs Performance</host>
                                        <key>gpu.utilization[{#GPUINDEX}]</key>
                                    </item>
                                </graph_item>
                            </graph_items>
                        </graph_prototype>
                    </graph_prototypes>
                </discovery_rule>
            </discovery_rules>
        </template>
    </templates>
</zabbix_export>

FAQ: Troubleshooting the “Missing -i Argument” Error

Q: What should I do if I receive a “Missing value for -i argument” error when using zabbix_get?

A: This error occurs because the -i argument, which specifies the GPU ID, is missing in the command. In the Zabbix setup, this is handled automatically after the template is imported and GPUs are discovered. For manual testing with zabbix_get, you need to specify the GPU ID, like so:

bashCopy code

zabbix_get -s <zabbixagentip> -k gpu.memused[0]

Replace <zabbixagentip> with your Zabbix agent’s IP address and 0 with the GPU ID you wish to query.

Leave a Comment

Index