Grafana

Grafana是一個跨平台、開源的資料視覺化網路應用程式平台。使用者組態連接的資料來源之後,Grafana可以在網路瀏覽器里顯示資料圖表和警告。該軟體的企業版本提供更多的擴充功能。擴充功能通過外掛程式的形式提供,終端使用者可以自訂自己的資料面板介面以及資料請求方式。Grafana被廣泛使用,包括維基百科專案。

Grafana Installation

With Docker
# Grafana Install
docker run -d --name=grafana -p 3001:3000 -v grafana_config:/etc/grafana -v grafana_data:/var/lib/grafana \
-v grafana_logs:/var/log/grafana grafana/grafana

# MySQL Install
docker container run -d --name mysqldb -p 3306:3306 -v db_data:/var/lib/mysql -e MYSQL_ROOT_PASSWORD=password mysql:latest

Learning Grafana

Installation
Dashboard Visualizing

MySQL Monitoring

vSphere ESXi Monitoring

Grafana Tutorials
Telegraf
Amazon Cloudwatch
Loki + Promtail
Plugins

Hourly Heatmap

 

InfluxDB

Configuration

/etc/influxdb/influxdb.conf

[http]
  auth-enabled = true
Schema Design

official tutorial: InfluxDB schema design

Data Elements

Where to store data (tag or field)

tag value

field value

 

Management

Create a new user and database

influx
> CREATE USER admin WITH PASSWORD 'adminpass' WITH ALL PRIVILEGES
> exit

influx -username admin -password adminpass
> set password for admin = 'newpass'
>
> create database mmap_nmon with duration 180d
> create user mon with password 'thisispassword'
> grant ALL on mmap_nmon to mon
> show GRANTS for mon

Login to InfluxDB

# In the shell
influx -username admin -password thisispass

# In the InfluxDB CLI
> auth
username:
password
InfluxQL
> show databases
name: databases
name
----
_internal
nmon_reports
nmon2influxdb_log

> show users
user  admin
----  -----
admin true
mon   false

> use nmon_reports
Using database nmon_reports
> show measurements
name: measurements
name
----
CPU_ALL
DISKAVGRIO
DISKAVGWIO
DISKBSIZE
DISKBUSY
DISKREAD
DISKREADSERV
DISKRIO
DISKRXFER
DISKSERV
...

Retention Policy

# for current database
> show retention policies
name    duration  shardGroupDuration replicaN default
----    --------  ------------------ -------- -------
autogen 8760h0m0s 168h0m0s           1        true

# for specified database
> show retention policies on nmon2influxdb_log
name          duration shardGroupDuration replicaN default
----          -------- ------------------ -------- -------
autogen       0s       168h0m0s           1        false
log_retention 48h0m0s  24h0m0s            1        true

# Create a policy
> CREATE RETENTION POLICY "one_day_only" ON "NOAA_water_database" DURATION 1d REPLICATION 1

# Alter the policy
> ALTER RETENTION POLICY "what_is_time" ON "NOAA_water_database" DURATION 3w SHARD DURATION 2h DEFAULT

# Delete a policy
> DROP RETENTION POLICY "what_is_time" ON "NOAA_water_database"

Verify the account

curl -G http://localhost:8086/query -u mon:thisispassword --data-urlencode "q=SHOW DATABASES"
{"results":[{"statement_id":0,"series":[{"name":"databases","columns":["name"],"values":[["mmap_nmon"]]}]}]}

查詢表資訊

SHOW MEASUREMENTS  --查詢當前資料庫中含有的表
SHOW FIELD KEYS --查看當前資料庫所有表的欄位
SHOW series from pay --查看key數據
SHOW TAG KEYS FROM "pay" --查看key中tag key值
SHOW TAG VALUES FROM "pay" WITH KEY = "merId" --查看key中tag 指定key值對應的值
SHOW TAG VALUES FROM cpu WITH KEY IN ("region", "host") WHERE service = 'redis'
DROP SERIES FROM  WHERE ='' --刪除key
SHOW CONTINUOUS QUERIES   --查看連續執行命令
SHOW QUERIES  --查看最後執行命令
KILL QUERY  --結束命令
SHOW RETENTION POLICIES ON mydb  --查看保留數據
show series cardinality on mydb --資料庫總筆數

查詢資料

SELECT * FROM /.*/ LIMIT 1  --查詢當前資料庫下所有表的第一行記錄
select * from pay  order by time desc limit 2
select * from  db_name."POLICIES name".measurement_name --指定查詢資料庫下資料保留中的表資料 POLICIES name資料保留

刪除資料

delete from "query" --刪除表所有資料,則表就不存在了
drop MEASUREMENT "query"   --刪除表(注意會把資料保留刪除使用delete不會)
DELETE FROM cpu
DELETE FROM cpu WHERE time < '2000-01-01T00:00:00Z'
DELETE WHERE time < '2000-01-01T00:00:00Z'
DROP DATABASE “testDB” --刪除資料庫
DROP RETENTION POLICY "dbbak" ON mydb --刪除保留資料為dbbak資料
DROP SERIES from pay where tag_key='' --刪除key中的tag

函數使用

select * from pay   order by time desc limit 2
select mean(allTime) from pay where time >= today() group by time(10m) time_zone(+8)
select * from pay time_zone(+8) limit 2
SELECT sum(allTime) FROM "pay" WHERE time > now() - 10s
select count(allTime) from pay  where time > now() - 10m  group by time(1s)

select difference("commit_sql") from "snapdb2" where time > now() - 1h limit 10
select non_negative_difference(*) from "snapdb2" where time > now() - 1h limit 10
select non_negative_difference(/commit_sql|rollback_sql/) from "snapdb2" where time > now() - 1h limit 10

User and Privileges

> CREATE USER "todd" WITH PASSWORD '123456'

> SHOW GRANTS for "todd"

> SET PASSWORD FOR "todd" = 'newpassword'

> GRANT READ ON "NOAA_water_database" TO "todd"
> GRANT ALL ON "NOAA_water_database" TO "todd"

> REVOKE ALL PRIVILEGES FROM "todd"
> REVOKE ALL ON "NOAA_water_database" FROM "todd"
> REVOKE WRITE ON "NOAA_water_database" FROM "todd"

> DROP USER "todd"

Covert timestamp to normal datetime

influx -precision rfc3339
# Or, in CLI
> precision rfc3339

Convert InfluxData to csv format

# With CLI
# There is also useful -precision option to set format of timestamp.
influx -database 'database_name' -execute "SELECT * FROM table_name" -format csv > test.csv

influx -username your_user_if_any -password "secret!" -database 'db_name' -host 'localhost' -execute 'SELECT * FROM "db_name"."your_retention_policy_or_nothing"."your_measurement_name" WHERE time > '\''2017-09-05'\'' and time < '\''2017-09-05T23:59:59Z'\'' AND other_conditions_if_required' -format 'csv' > /tmp/your_measurement_name_20170905.csv

# With HTTP-API
# Samples:
# "q=SELECT * FROM \"mymeasurement\" where time > now() - 130d"
# "q=SELECT * FROM \"mymeasurement\" where (time < now() - 130d) and  (time > now() - 260d)"
curl -G 'http://localhost:8086/query' --data-urlencode "db=mydb" --data-urlencode "epoch=#timeunit" --data-urlencode "q=SELECT * FROM \"mymeasurement\" " -H "Accept: application/csv" >  mytargetcsv.csv

Tag Query

> show tag values from {measurement} with key={key}
> show tag values from {measurement} with key={key} where {tag-key}={tag-value}

 

Services
systemctl start influxdb
systemctl enable influxdb
Learning

 

 

vSphere Monitoring

With Telegraf + InfluxDB

Install Telegraf

Download: https://portal.influxdata.com/downloads/

yum localinstall telegraf-1.18.3-1.x86_64.rpm
Configure Telegraf

vi /etc/telegraf/telegraf.conf

[agent]
...
    logfile = "/var/log/telegraf/telegraf.log"
...

# Configuration for sending metrics to InfluxDB
[[outputs.influxdb]]
    urls = ["http://10.10.2.209:8086"]
    database = "vmware"
    timeout = "0s"
    username = "admin"
    password = "dba4mis"
    retention_policy = "200d"


# Read metrics from VMware vCenter
 [[inputs.vsphere]]
 ## List of vCenter URLs to be monitored. These three lines must be uncommented
 ## and edited for the plugin to work.
 vcenters = [ "https://10.10.1.2/sdk" ]
    username = "administrator@vsphere.local"
    password = "AdminPassword"
 #
 ## VMs
 ## Typical VM metrics (if omitted or empty, all metrics are collected)
 vm_metric_include = [
      "cpu.demand.average",
      "cpu.idle.summation",
      "cpu.latency.average",
      "cpu.readiness.average",
      "cpu.ready.summation",
      "cpu.run.summation",
      "cpu.usagemhz.average",
      "cpu.used.summation",
      "cpu.wait.summation",
      "mem.active.average",
      "mem.granted.average",
      "mem.latency.average",
      "mem.swapin.average",
      "mem.swapinRate.average",
      "mem.swapout.average",
      "mem.swapoutRate.average",
      "mem.usage.average",
      "mem.vmmemctl.average",
      "net.bytesRx.average",
      "net.bytesTx.average",
      "net.droppedRx.summation",
      "net.droppedTx.summation",
      "net.usage.average",
      "power.power.average",
      "virtualDisk.numberReadAveraged.average",
      "virtualDisk.numberWriteAveraged.average",
      "virtualDisk.read.average",
      "virtualDisk.readOIO.latest",
      "virtualDisk.throughput.usage.average",
      "virtualDisk.totalReadLatency.average",
      "virtualDisk.totalWriteLatency.average",
      "virtualDisk.write.average",
      "virtualDisk.writeOIO.latest",
      "sys.uptime.latest",
    ]
 # vm_metric_exclude = [] ## Nothing is excluded by default
 # vm_instances = true ## true by default
 #
 ## Hosts
 ## Typical host metrics (if omitted or empty, all metrics are collected)
 host_metric_include = [
      "cpu.coreUtilization.average",
      "cpu.costop.summation",
      "cpu.demand.average",
      "cpu.idle.summation",
      "cpu.latency.average",
      "cpu.readiness.average",
      "cpu.ready.summation",
      "cpu.swapwait.summation",
      "cpu.usage.average",
      "cpu.usagemhz.average",
      "cpu.used.summation",
      "cpu.utilization.average",
      "cpu.wait.summation",
      "disk.deviceReadLatency.average",
      "disk.deviceWriteLatency.average",
      "disk.kernelReadLatency.average",
      "disk.kernelWriteLatency.average",
      "disk.numberReadAveraged.average",
      "disk.numberWriteAveraged.average",
      "disk.read.average",
      "disk.totalReadLatency.average",
      "disk.totalWriteLatency.average",
      "disk.write.average",
      "mem.active.average",
      "mem.latency.average",
      "mem.state.latest",
      "mem.swapin.average",
      "mem.swapinRate.average",
      "mem.swapout.average",
      "mem.swapoutRate.average",
      "mem.totalCapacity.average",
      "mem.usage.average",
      "mem.vmmemctl.average",
      "net.bytesRx.average",
      "net.bytesTx.average",
      "net.droppedRx.summation",
      "net.droppedTx.summation",
      "net.errorsRx.summation",
      "net.errorsTx.summation",
      "net.usage.average",
      "power.power.average",
      "storageAdapter.numberReadAveraged.average",
      "storageAdapter.numberWriteAveraged.average",
      "storageAdapter.read.average",
      "storageAdapter.write.average",
      "sys.uptime.latest",
    ]
 # host_metric_exclude = [] ## Nothing excluded by default
 # host_instances = true ## true by default
 #
 ## Clusters
 cluster_metric_include = [] ## if omitted or empty, all metrics are collected
 # cluster_metric_exclude = [] ## Nothing excluded by default
 # cluster_instances = false ## false by default
 #
 ## Datastores
 datastore_metric_include = [] ## if omitted or empty, all metrics are collected
 # datastore_metric_exclude = [] ## Nothing excluded by default
 # datastore_instances = false ## false by default for Datastores only
 #
 ## Datacenters
 datacenter_metric_include = [] ## if omitted or empty, all metrics are collected
# datacenter_metric_exclude = [ "*" ] ## Datacenters are not collected by default.
 # datacenter_instances = false ## false by default for Datastores only
 #
 ## Plugin Settings
 ## separator character to use for measurement and field names (default: "_")
 # separator = "_"
 #
 ## number of objects to retreive per query for realtime resources (vms and hosts)
 ## set to 64 for vCenter 5.5 and 6.0 (default: 256)
 # max_query_objects = 256
 #
 ## number of metrics to retreive per query for non-realtime resources (clusters and datastores)
 ## set to 64 for vCenter 5.5 and 6.0 (default: 256)
 # max_query_metrics = 256
 #
 ## number of go routines to use for collection and discovery of objects and metrics
 # collect_concurrency = 1
 # discover_concurrency = 1
 #
 ## whether or not to force discovery of new objects on initial gather call before collecting metrics
 ## when true for large environments this may cause errors for time elapsed while collecting metrics
 ## when false (default) the first collection cycle may result in no or limited metrics while objects are discovered
 # force_discover_on_init = false
 #
 ## the interval before (re)discovering objects subject to metrics collection (default: 300s)
 # object_discovery_interval = "300s"
 #
 ## timeout applies to any of the api request made to vcenter
 # timeout = "60s"
 #
 ## Optional SSL Config
 # ssl_ca = "/path/to/cafile"
 # ssl_cert = "/path/to/certfile"
 # ssl_key = "/path/to/keyfile"
 ## Use SSL but skip chain & host verification
 insecure_skip_verify = true

Startup Telegraf

systemctl start telegraf
systemctl enable telegraf
Configure InfluxDB

Set the retention policy

[root@mm-mon ~]# influx -username admin -password dba4mis
Connected to http://localhost:8086 version 1.8.5
InfluxDB shell version: 1.8.5
> show retention policies on vmware
name    duration shardGroupDuration replicaN default
----    -------- ------------------ -------- -------
autogen 0s       168h0m0s           1        true
> alter retention policy "autogen" on "vmware" duration 200d shard duration 1d
> show retention policies on vmware
name    duration  shardGroupDuration replicaN default
----    --------  ------------------ -------- -------
autogen 4800h0m0s 24h0m0s            1        true
Configure Grafana
  1. Add a datasource for InfluxDB
    • Name: VMware
    • Type: InfluxDB
    • Database: vmware
    • Username: <InfluxDB Credential>
    • Password: <InfluxDB Credential>
  2. Import the dashboards
    1. https://grafana.com/grafana/dashboards/8159
    2. https://grafana.com/grafana/dashboards/8165
    3. https://grafana.com/grafana/dashboards/8168
    4. https://grafana.com/grafana/dashboards/8162

With SexiGraf

Download the OVA appliance
vCenter/vSphere Credential for monitor only

vCenter Web Client > 功能表 > 系統管理 > Single Sign On: 使用者與群組 > 新增

vCenter Web Client > 功能表 > 主機與叢集 > 權限 > 新增權限

Deploy the OVA to vCenter/ESXi

部署到 ESXi 6.5 時失敗,錯誤訊息

Line 163: Unable to parse 'tools.syncTime' for attribute 'key' on element 'Config'.

解決方法: 使用 OVF-Tool 先解開 OVA 檔,編輯 OVF 檔的內容

# Before
<vmw:Config ovf:required="true"  vmw:key="tools.syncTime" vmw:value="true"/>

# After
<vmw:Config ovf:required="false"  vmw:key="tools.syncTime" vmw:value="true"/>

存檔後,重新再部署一次。

 

First to Start the VM

1. SSH Credential: root / Sex!Gr@f

2. Need to manually configure the IP, Edit the /etc/network/interfaces .

3. Configure the hostname

hostnamectl set-hostname esx-mon

4. Configure the timezone and time server

timedatectl set-timezone Asia/Taipei

vi /etc/ntp.conf

#pool 0.ubuntu.pool.ntp.org iburst
#pool 1.ubuntu.pool.ntp.org iburst
#pool 2.ubuntu.pool.ntp.org iburst
#pool 3.ubuntu.pool.ntp.org iburst

# Use Ubuntu's ntp server as a fallback.
#pool ntp.ubuntu.com

# Added the local time server
server 192.168.21.86 prefer iburst

Restart the ntpd

systemctl stop ntp
systemctl start ntp

# Check the timeserver
ntpq -p

 

First to Login the Grafana Web
  1. Login: admin / Sex!Gr@f
  2. Add the credential to connect to the vCenter server managed: Search > SexiGraf > SexiGraf Web Admin > Credential Store
    • vCenter IP: <vCenter/ESXi IP or FQDN>
    • Username: <Username to login to vCenter/ESXi>
    • Password: <Password to login to vCenter/ESXi>

 

 

Telegraf

Custom Shell

Samples #1

#/bin/bash

devname=(`lsblk| grep 'disk'|awk '{print $1}'`)
dirname=(`lsblk| grep 'disk'|awk '{if ($7=="") print "/";else print $7}'`)
#At that time, I wanted to store these directory names in dictionary format, and later changed to variable mode, shell Of[ ] { } * @ $Special characters will drive you crazy
#declare -A devdict
devnum=`expr ${#devname[@]} - 1`
for i in `seq 0 $devnum`;do
  if [-z "${dirname[$i]}" ];then
    eval ${devname[$i]}="/"
  else
    eval ${devname[$i]}="${dirname[$i]}"
  fi
  #devdict+=([${devname[$i]}]="${dirname[$i]}")
done
#echo ${!devdict[*]}
#echo ${devdict[*]}

ioarry=`iostat -x | grep sd|awk '{print "datadir=${"$1"}@r="$4",w="$5",await="$10",svctm="$11",util="$12}'`
for i in ${ioarry[@]};do
  eval temp="${i}"
  #Replace the special character @, and the space in the shell will be truncated to two elements
  temp=${temp/@/ }
  echo "exec,${temp}"
  #Ensure that the final output is in the following format. The first character is the measurement name. If the input.exec plug-in has the configuration name "suffix", the suffix will be added automatically
  #The output format is measurement name, comma, tag keys (comma separated), space, filed keys (comma separated)
  #The data format output mismatch will lead to the failure of telegraf to parse the data and go to the influxdb. It took a long time to debug and didn't look at the hole dug by the official website 
  #exec,datadir=/data/data11 r=4.1,w=6.1,await=0.83,svctm=1.35,util=1.46" 
done 
#echo ${devdict[@]}
[[inputs.exec]]
  ##Commands array
  commands = ["bash /appcom/telegraf/collect_iostat.sh",]
  timeout='5s'
  ##Suffix for measurements
  name_suffix="_collectiostat"
  data_format="influx"

Sample #2

#!/bin/sh
hostname=`hostname`
uptime=`awk '{print $1}' /proc/uptime`
if uptime |grep -q user ; then
load1=`uptime | grep -ohe 'up .*' | sed 's/,//g' | awk '{ print $7}'`
load5=`uptime | grep -ohe 'up .*' | sed 's/,//g' | awk '{ print $8}'`
load15=`uptime | grep -ohe 'up .*' | sed 's/,//g' | awk '{ print $9}'`
else
load1=`uptime | grep -ohe 'up .*' | sed 's/,//g' | awk '{ print $5}'`
load5=`uptime | grep -ohe 'up .*' | sed 's/,//g' | awk '{ print $6}'`
load15=`uptime | grep -ohe 'up .*' | sed 's/,//g' | awk '{ print $7}'`
fi
echo "uptime,host=$hostname uptime=$uptime,load1=$load1,load5=$load5,load15=$load15"
[agent]
interval = "5s"
round_interval = true
[[inputs.swap]]
  [inputs.swap.tags]
    metrics_source="telegraf_demo"
[[inputs.exec]]
  commands = ["/etc/telegraf/uptime.sh"]
  data_format = "influx"
  [inputs.exec.tags]
    metrics_source="telegraf_demo"
[[outputs.influxdb]]
  url = "https://influxdemo:8086"
  database = "telegraf"

Sample #3

#! /bin/bash
/usr/bin/speedtest --format json | jq '.download.bandwidth = .download.bandwidth / 125000 |  .upload.bandwidth = .upload.bandwidth / 125000'
[[inputs.exec]]
  commands = [
    "/home/rock64/speedtest.sh"
    ]
  interval = "300s"
  timeout = "60s"

Sample #4

[[inputs.exec]]
  commands = ["sh -c 'sysctl -n dev.cpu.0.temperature | tr -d C'"]
  name_override = "cpu_temp"
  timeout = "5s"
  data_format = "value"
  data_type = "float"
  [inputs.exec.tags]
    core = "core0"

[[inputs.exec]]
  commands = ["sh -c 'sysctl -n dev.cpu.1.temperature | tr -d C'"]
  name_override = "cpu_temp"
  timeout = "5s"
  data_format = "value"
  data_type = "float"
  [inputs.exec.tags]
    core = "core1"

[[inputs.exec]]
  commands = ["sh -c 'sysctl -n dev.cpu.2.temperature | tr -d C'"]
  name_override = "cpu_temp"
  timeout = "5s"
  data_format = "value"
  data_type = "float"
  [inputs.exec.tags]
    core = "core2"

[[inputs.exec]]
  commands = ["sh -c 'sysctl -n dev.cpu.3.temperature | tr -d C'"]
  name_override = "cpu_temp"
  timeout = "5s"
  data_format = "value"
  data_type = "float"
  [inputs.exec.tags]
    core = "core3"

Q & A

[agent] Error terminating process: operation not permitted

Causation: 在 telegraf.conf 設定裡,有個 agent 排程啟動時,因為 timeout 設定時間已到,而 agent 還未完成工作,telegraf 嘗試終止該 agent 失敗。

Solution: 解決方法一:如果無所謂終止 agent 失敗的行為,可以將 timeout 時間調大,就可以避免或降低錯誤的發生。

解決方法二:如果想利用 timeout 的設定來避免 agent 可能因為某些異常造成大量程序累積,進而影響系統的運作。

分析 telegraf 無法終止 agent 的原因,排除異常後,在依需要調整 timeout 時間。

以筆者案例,agent 使用 sudo 指令收集 db2 的效能指標,指令如下

[[inputs.exec]]
    interval = "1h"
    commands = ["sudo -u db2mon sh -c '/home/db2mon/bin/collect_db2x1h.sh -d centdb -a b_centdb'"]
    timeout = "5s"
    data_format = "influx"

由於 telegraf 無法 kill 用 sudo 執行的其他帳號下的程序,解決方法是修改指令 collect_db2x1h.sh,可以讓 telegraf 不用 sudo 就可以執行。 

[[inputs.exec]]
    interval = "1h"
    commands = ["/home/db2mon/bin/collect_db2x1h.sh -d centdb -a b_centdb"]
    timeout = "15s"
    data_format = "influx"

驗證一下,timeout 時間到達能否成功終止 agent,如果有,會顯示下方訊息:

[inputs.exec] Error in plugin: exec: command timed out for command '/home/db2mon/bin/collect_db2x1h.sh -d centdb -a b_centdb'

沒問題後,再調整合適的 timeout。

Error in plugin: metric parse error: expected tag at 7:20:

Causation: 輸出的 Influxdata 資料格式不正確

Solution: 檢查第 7 筆的第 20 個字元。Influxdada 格式為

measurement, tag-key1=tag-value1,tag-key2=tag-value2 field-key1=field-value1,field-key2=field-value2,....

max-series-per-database limit exceeded: (1000000)

Causation: 寫入的資料庫已經達到設定的上限總筆數 1000000。

在 InfluxDB CLI 執行這段,檢查目前資料庫的筆數

show series cardinality on <db-name>

Solution: 調整 InfluxDB 主機上的設定,編輯 /etc/influxdb/influxdb.conf 預設是 1000000

# max-series-per-database = 1000000
max-series-per-database = 2000000

重啟 InfluxDB

systemctl restart influxdb

 

 

 

 

 

DB2 Monitoring

Prerequisites
InfluxDB

新增 database

> create database db2_mon with duration 180d
> create user mon with password 'thisispassword'
> grant read on db2_mon to mon

mon 唯讀帳號是給 Grafana 的 datasource 使用。

Telegraf

建立新的設定檔 telegraf-db2.conf

telegraf --input-filter exec --output-filter influxdb config > /etc/telegraf/telegraf-db2.conf

修改設定檔

[agent]
...
...
  logfile = "/var/log/telegraf/telegraf-db2.log"
  
  
# Configuration for sending metrics to InfluxDB
[[outputs.influxdb]]
  urls = ["http://10.10.2.209:8086"]
  database = "db2_mon"
  timeout = "0s"
  username = "admin"
  password = "Thispassword"

[[inputs.exec]]
    interval = "300s"

    ## Commands array
    #commands = [
    #  "/tmp/test.sh",
    #  "/usr/bin/mycollector --foo=bar",
    #  "/tmp/collect_*.sh"
    #]
    commands = ["sudo -u db2mon sh -c '/home/db2mon/bin/collect_db2.v2.sh -d dcdb -a b_dcdb -u dbuser -p dbpass'"]

    ## Timeout for each command to complete.
    timeout = "5s"

    ## measurement name suffix (for separating different commands)
    #name_suffix = "_mycollector"

    ## Data format to consume.
    ## Each data format has its own unique set of configuration options, read
    ## more about them here:
    ## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_INPUT.md
    data_format = "influx"

[[inputs.exec]]
    interval = "1h"
    commands = ["sudo -u db2mon sh -c '/home/db2mon/bin/collect_db2x1h.sh -d dcdb -a b_dcdb -u dbuser -p dbpass'"]
    timeout = "5s"
    data_format = "influx"

### CENTDB
[[inputs.exec]]
    interval = "300s"
    commands = ["sudo -u db2mon sh -c '/home/db2mon/bin/collect_db2.v2.sh -d centdb -a b_centdb -u dbuser -p dbpass'"]
    timeout = "5s"
    data_format = "influx"

建立一個新的服務啟動檔 telegraf-db2.service

NOTE: 由於 telegraf 只能有一個對應的設定檔,如果這部主機需要多個 Output ,則必需啟動多個不同的 telegraf 服務。

cp /usr/lib/systemd/system/telegraf.service /etc/systemd/system/telegraf-db2.service

修改設定檔

// 修改這一行
ExecStart=/usr/bin/telegraf -config /etc/telegraf/telegraf-db2.conf $TELEGRAF_OPTS

檢查服務啟動列表有無新增 telegraf-db2,如果沒有,執行

systemctl list-unit-files --type service
systemctl daemon-reload
DB2 Metrics

Bufferpool

-- DATA_HIT_RATIO_PERCENT: Data hit ratio, that is, the percentage of time that the database manager did not need to load a page from disk to service a data page request.
-- INDEX_HIT_RATIO_PERCENT: Index hit ratio, that is, the percentage of time that the database manager did not need to load a page from disk to service an index data page request.
select BP_NAME, DATA_HIT_RATIO_PERCENT, INDEX_HIT_RATIO_PERCENT 
  from sysibmadm.MON_BP_UTILIZATION 
  where BP_NAME not like 'IBMSYS%' and BP_NAME<>'IBMDEFAULTBP'

Logs (active log & archive log)

-- LOG_UTILIZATION_PERCENT: Percent utilization of total log space. (active logs)
-- TOTAL_LOG_USED_KB: The total amount of active log space currently used in the database
-- TOTAL_LOG_AVAILABLE_KB: The amount of active log space in the database that is not being used by uncommitted transactions.
select DB_NAME, LOG_UTILIZATION_PERCENT, TOTAL_LOG_USED_KB, TOTAL_LOG_AVAILABLE_KB 
  from sysibmadm.log_utilization

Connections

-- APPLS_CUR_CONS: Indicates the number of applications that are currently connected to the database.
-- LOCKS_WAITING: Indicates the number of agents waiting on a lock.
-- NUM_INDOUBT_TRANS: The number of outstanding indoubt transactions in the database.
select DB_NAME,APPLS_CUR_CONS,LOCKS_WAITING,NUM_INDOUBT_TRANS from sysibmadm.snapdb

Transactions

-- 這是累計的數值
-- COMMIT_SQL_STMTS: The total number of SQL COMMIT statements that have been attempted.
-- ROLLBACK_SQL_STMTS: The total number of SQL ROLLBACK statements that have been attempted.
select COMMIT_SQL_STMTS, ROLLBACK_SQL_STMTS from sysibmadm.snapdb

 

Custom scripts

 

 

Learning

Archive Log Monitor

Dashboards Setup

Variables

Get the value from a tag

> show tag values from "snapdb" with key="db"

Get the value from a field

> select DISTINCT("vmname") from (select "ready_summation","vmname" from "vsphere_vm_cpu")

Why Monitoring

Why Monitoring

Sample #1

Sample #2

Presentation Videos

Prometheus

Linux Monitoring

Linux Monitoring with Prometheus

MySQL Monitoring

 

AIX Monitoring

Plugins

Install plugin on local Grafana

Option 1: with grafana-cli

# Internet network is required
# reference to https://grafana.com/docs/grafana/latest/administration/cli/#plugins-commands
grafana-cli plugins install marcusolsson-hourly-heatmap-panel

Option 2: manually unpack the .zip file

unzip my-plugin-0.2.0.zip -d YOUR_PLUGIN_DIR/my-plugin

By default the plugin_dir is /var/lib/grafana/plugins。

Restart the Grafana

systemctl stop grafana-server
systemctl start grafana-server
Q & A

Q: 安裝了 marcusolsson-hourly-heatmap-panel-1.0.0 plugin,可是從 Visualization 裡找不到。
A: 使用 grafana-cli 指令與 UI 介面的 Plugins 確認能看到這個 plugin,網站登出再登入試試。