[Linux-ha-jp] Pacemaker+corosync+DRBD環境でのフェイルオーバー不可について

Back to archive index

b6n3j****@gmail***** b6n3j****@gmail*****
2016年 2月 1日 (月) 14:35:25 JST


お世話になっております。渡辺と申します。

先日、以下の質問をさせていただきましたが、
その後の検証により、自己解決いたしました。
ご回答の準備をしていただいた方がいらっしゃいましたら、
大変申し訳ありません。
参考までに原因を以下に記載しておきます。

/etc/drbd.d/global_common.confにおいて、fencingの設定はしていましたが、
handlersの設定をしていなかったためでした。
handlersを追記することにより、フェイルオーバーできるようになりました。

disk {
       on-io-error detach;
       fencing resource-only;
}

↓ 以下を追記
handlers {
       fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
       after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
}

以上、よろしくお願いいたします。


2016-01-25 16:33 GMT+09:00 b6n3j****@gmail***** <b6n3j****@gmail*****>:
> Linux-HA Japanのみなさま
>
> お世話になっております。渡辺と申します。
> 先日、 Pacemaker+corosync+DRBD環境において、MySQLが起動できないことで質問させていただきました。
> その後、MySQLは起動するのですが、次のような場合にフェイルオーバーせず、困っております。
> ・プライマリノードのPacemakerプロセスを落としたとき
> ・プライマリノードをシャットダウンしたとき
> なお、両ノードが生きている場合に、crmにおけるmigrateコマンドによる手動フェイルオーバーは行われます。
> ※migrate実行後は、unmigrateを実行しており、制約は残っておりません。
>
> corosync.logを確認すると、セカンダリノードを起動しようとしているのですが、
> DRBDのプライマリへの昇格に失敗し、レプリケーション領域をマウントできないことが原因のようです。
>
> 以下に設定値およびログを記載いたしますので、お気づきの点がございましたら、
> ご教授いただければと思います。
> たびたびの質問で申し訳ありませんが、よろしくお願いいたします。
>
> ----------
> ■環境
> CentOS7
> pacemaker1.1.13
> corosync2.3.4
> drbd8.4.6
>
> ■crm_mon(Pacemakerダウン前)
> Last updated: Fri Jan 22 16:35:39 2016          Last change: Thu Jan
> 21 17:45:35 2016 by hacluster via crmd on NODE1
> Stack: corosync
> Current DC: NODE1 (version 1.1.13-10.el7-44eb2dd) - partition with quorum
> 2 nodes and 5 resources configured
> Online: [ NODE1 NODE2 ]
>  Master/Slave Set: ms_drbd_r0 [res_drbd_r0]
>      Masters: [ NODE1 ]
>      Slaves: [ NODE2 ]
>  Resource Group: rg_mysql
>      res_vipaddr        (ocf::heartbeat:IPaddr2):       Started NODE1
>      res_fsmnt  (ocf::heartbeat:Filesystem):    Started NODE1
>      res_mysql  (ocf::heartbeat:mysql): Started NODE1
> Migration Summary:
> * Node NODE1:
> * Node NODE2:
>
> ■crm_mon(Pacemakerダウン後)
> Last updated: Mon Jan 25 15:27:35 2016          Last change: Mon Jan
> 25 15:24:50 2016 by hacluster via crmd on NODE2
> Stack: corosync
> Current DC: NODE2 (version 1.1.13-10.el7-44eb2dd) - partition with quorum
> 2 nodes and 5 resources configured
> Online: [ NODE2 ]
> OFFLINE: [ NODE1 ]
>  Master/Slave Set: ms_drbd_r0 [res_drbd_r0]
>      res_drbd_r0        (ocf::linbit:drbd):     FAILED Master NODE2
> Failed Actions:
> * res_drbd_r0_promote_0 on NODE2 'unknown error' (1): call=123,
> status=complete, exitreason='none',
>     last-rc-change='Mon Jan 25 15:27:20 2016', queued=0ms, exec=15091ms
>
> ■corosync.log(抜粋)
> Jan 25 15:21:21 [55019] NODE2       lrmd:     info: log_execute:
> executing - rsc:res_drbd_r0 action:promote call_id:28
> drbd(res_drbd_r0)[55556]:       2016/01/25_15:21:21 ERROR: r0: Called
> drbdadm -c /etc/drbd.conf primary r0
> drbd(res_drbd_r0)[55556]:       2016/01/25_15:21:21 ERROR: r0: Exit code 11
> drbd(res_drbd_r0)[55556]:       2016/01/25_15:21:21 ERROR: r0: Command output:
> drbd(res_drbd_r0)[55556]:       2016/01/25_15:21:21 ERROR: r0: Called
> drbdadm -c /etc/drbd.conf primary r0
> drbd(res_drbd_r0)[55556]:       2016/01/25_15:21:21 ERROR: r0: Exit code 11
> drbd(res_drbd_r0)[55556]:       2016/01/25_15:21:21 ERROR: r0: Command output:
> drbd(res_drbd_r0)[55556]:       2016/01/25_15:21:22 ERROR: r0: Called
> drbdadm -c /etc/drbd.conf primary r0
> drbd(res_drbd_r0)[55556]:       2016/01/25_15:21:22 ERROR: r0: Exit code 11
> drbd(res_drbd_r0)[55556]:       2016/01/25_15:21:22 ERROR: r0: Command output:
> drbd(res_drbd_r0)[55556]:       2016/01/25_15:21:23 ERROR: r0: Called
> drbdadm -c /etc/drbd.conf primary r0
> drbd(res_drbd_r0)[55556]:       2016/01/25_15:21:23 ERROR: r0: Exit code 11
> drbd(res_drbd_r0)[55556]:       2016/01/25_15:21:23 ERROR: r0: Command output:
> drbd(res_drbd_r0)[55556]:       2016/01/25_15:21:24 ERROR: r0: Called
> drbdadm -c /etc/drbd.conf primary r0
> drbd(res_drbd_r0)[55556]:       2016/01/25_15:21:24 ERROR: r0: Exit code 11
> drbd(res_drbd_r0)[55556]:       2016/01/25_15:21:24 ERROR: r0: Command output:
> drbd(res_drbd_r0)[55556]:       2016/01/25_15:21:25 ERROR: r0: Called
> drbdadm -c /etc/drbd.conf primary r0
> drbd(res_drbd_r0)[55556]:       2016/01/25_15:21:25 ERROR: r0: Exit code 11
> drbd(res_drbd_r0)[55556]:       2016/01/25_15:21:25 ERROR: r0: Command output:
> ……
> Jan 25 15:21:41 [55019] NODE2       lrmd:  warning:
> child_timeout_callback:   res_drbd_r0_promote_0 process (PID 55556)
> timed outJan 25 15:21:41 [55019] NODE2       lrmd:  warning:
> operation_finished:       res_drbd_r0_promote_0:55556 - timed out
> after 20000ms
> Jan 25 15:21:41 [55019] NODE2       lrmd:   notice:
> operation_finished:       res_drbd_r0_promote_0:55556:stderr [ 0:
> State change failed: (-7) Refusing to be Primary while peer is not
> outdated ]
> Jan 25 15:21:41 [55019] NODE2       lrmd:   notice:
> operation_finished:       res_drbd_r0_promote_0:55556:stderr [ Command
> 'drbdsetup-84 primary 0' terminated with exit code 11 ]
> Jan 25 15:21:41 [55019] NODE2       lrmd:   notice:
> operation_finished:       res_drbd_r0_promote_0:55556:stderr [ 0:
> State change failed: (-7) Refusing to be Primary while peer is not
> outdated ]
> Jan 25 15:21:41 [55019] NODE2       lrmd:   notice:
> operation_finished:       res_drbd_r0_promote_0:55556:stderr [ Command
> 'drbdsetup-84 primary 0' terminated with exit code 11 ]
> ……
>
> ■Pacemaker設定
> node 1: NODE1
> node 2: NODE2
> primitive res_drbd_r0 ocf:linbit:drbd \
>         params drbd_resource=r0 \
>         op start interval=0 timeout=240 on-fail=restart \
>         op stop interval=0 timeout=100 on-fail=fence
> primitive res_fsmnt Filesystem \
>         params device="/dev/drbd0" directory="/drbd" fstype=xfs
> options=noatime \
>         op start interval=0 timeout=60 on-fail=restart \
>         op stop interval=0 timeout=60 on-fail=fence
> primitive res_mysql mysql \
>         params binary="/usr/local/mysql/bin/mysqld_safe"
> client_binary="/usr/local/mysql/bin/mysql"
> datadir="/usr/local/mysql/data" config="/usr/local/mysql/my.cnf"
> socket="/tmp/mysql.sock" pid="/var/run/mysqld/mysqld.pid" user=root
> group=mysql additional_parameters="--ledir=/usr/local/mysql/bin
> --basedir=/usr/local/mysql" \
>         op start interval=0 timeout=120 on-fail=restart \
>         op stop interval=0 timeout=120 on-fail=fence \
>         op notify interval=90 timeout=90 \
>         op monitor interval=20 timeout=30 on-fail=restart
> primitive res_vipaddr IPaddr2 \
>         params ip=192.168.202.10 cidr_netmask=16 nic=eth0 \
>         op start interval=0 timeout=20 on-fail=restart \
>         op stop interval=0 timeout=20 on-fail=fence \
>         op monitor interval=10 timeout=20 on-fail=restart
> group rg_mysql res_vipaddr res_fsmnt res_mysql \
>         meta target-role=Started
> ms ms_drbd_r0 res_drbd_r0 \
>         meta master-max=1 master-node-max=1 clone-max=2
> clone-node-max=1 notify=true
> location l_mysql rg_mysql 100: NODE1
> colocation c_mysql inf: rg_mysql ms_drbd_r0:Master
> order o_mysql inf: ms_drbd_r0:promote rg_mysql:start
> property cib-bootstrap-options: \
>         have-watchdog=false \
>         dc-version=1.1.13-10.el7-44eb2dd \
>         cluster-infrastructure=corosync \
>         stonith-enabled=false \
>         no-quorum-policy=ignore \
>         default-resource-stickiness=200 \
>         last-lrm-refresh=1453365935
>
> ■/etc/drbd.conf
> include "drbd.d/global_common.conf";
> include "drbd.d/*.res";
>
> ■/etc/drbd.d/global_common.conf
> global {
>         usage-count no;
>         # minor-count dialog-refresh disable-ip-verification
>         # cmd-timeout-short 5; cmd-timeout-medium 121; cmd-timeout-long 600;
> }
> common {
>         handlers {
>                 # These are EXAMPLE handlers only.
>                 # They may have severe implications,
>                 # like hard resetting the node under certain circumstances.
>                 # Be careful when chosing your poison.
>                 # pri-on-incon-degr
> "/usr/lib/drbd/notify-pri-on-incon-degr.sh;
> /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger
> ; reboot -f";
>                 # pri-lost-after-sb
> "/usr/lib/drbd/notify-pri-lost-after-sb.sh;
> /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger
> ; reboot -f";
>                 # local-io-error "/usr/lib/drbd/notify-io-error.sh;
> /usr/lib/drbd/notify-emergency-shutdown.sh; echo o >
> /proc/sysrq-trigger ; halt -f";
>                 # fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
>                 # split-brain "/usr/lib/drbd/notify-split-brain.sh root";
>                 # out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
>                 # before-resync-target
> "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";
>                 # after-resync-target
> /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
>         }
>         startup {
>                 # wfc-timeout degr-wfc-timeout outdated-wfc-timeout
> wait-after-sb
>         }
>         options {
>                 # cpu-mask on-no-data-accessible
>         }
>         disk {
>                 # size on-io-error fencing disk-barrier disk-flushes
>                 # disk-drain md-flushes resync-rate resync-after al-extents
>                 # c-plan-ahead c-delay-target c-fill-target c-max-rate
>                 # c-min-rate disk-timeout
>                 on-io-error detach;
>                 fencing resource-only;
>         }
>         net {
>                 # protocol timeout max-epoch-size max-buffers unplug-watermark
>                 # connect-int ping-int sndbuf-size rcvbuf-size ko-count
>                 # allow-two-primaries cram-hmac-alg shared-secret after-sb-0pri
>                 # after-sb-1pri after-sb-2pri always-asbp rr-conflict
>                 # ping-timeout data-integrity-alg tcp-cork on-congestion
>                 # congestion-fill congestion-extents csums-alg verify-alg
>                 # use-rle
>                 protocol C;
>         }
> }
>
> ■/etc/drbd.d/r0.res
> resource r0 {
>         volume 0 {
>                 device /dev/drbd0;
>                 disk /dev/sda3;
>                 meta-disk internal;
>         }
>         on NODE1 {
>                 address 10.0.10.1:7788;
>         }
>         on NODE2 {
>                 address 10.0.10.2:7788;
>         }
> }
>
> ■/etc/corosync/corosync.conf
> totem {
>         version: 2
>         crypto_cipher: none
>         crypto_hash: none
>         rrp_mode: active
>         nodeid: 1
>         interface {
>                 member {
>                         memberaddr: 10.0.10.1
>                 }
>                 member {
>                         memberaddr: 10.0.10.2
>                 }
>                 ringnumber: 0
>                 bindnetaddr: 10.0.10.0
>                 mcastport: 5405
>                 ttl: 1
>         }
>         interface {
>                 member {
>                         memberaddr: 10.0.11.1
>                 }
>                 member {
>                         memberaddr: 10.0.11.2
>                 }
>                 ringnumber: 1
>                 bindnetaddr: 10.0.11.0
>                 mcastport: 5405
>                 ttl: 1
>         }
>         transport: udpu
> }
> logging {
>         fileline: off
>         to_logfile: yes
>         to_syslog: no
>         logfile: /var/log/cluster/corosync.log
>         debug: off
>         timestamp: on
>         logger_subsys {
>                 subsys: QUORUM
>                 debug: off
>         }
> }
> quorum {
>         # Enable and configure quorum subsystem (default: off)
>         # see also corosync.conf.5 and votequorum.5
>         provider: corosync_votequorum
>         expected_votes: 2
> }
> aisexec {
>         user: root
>         group: root
> }
> ----------




Linux-ha-japan メーリングリストの案内
Back to archive index