b6n3j****@gmail*****
b6n3j****@gmail*****
2016年 2月 1日 (月) 14:35:25 JST
お世話になっております。渡辺と申します。 先日、以下の質問をさせていただきましたが、 その後の検証により、自己解決いたしました。 ご回答の準備をしていただいた方がいらっしゃいましたら、 大変申し訳ありません。 参考までに原因を以下に記載しておきます。 /etc/drbd.d/global_common.confにおいて、fencingの設定はしていましたが、 handlersの設定をしていなかったためでした。 handlersを追記することにより、フェイルオーバーできるようになりました。 disk { on-io-error detach; fencing resource-only; } ↓ 以下を追記 handlers { fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh"; } 以上、よろしくお願いいたします。 2016-01-25 16:33 GMT+09:00 b6n3j****@gmail***** <b6n3j****@gmail*****>: > Linux-HA Japanのみなさま > > お世話になっております。渡辺と申します。 > 先日、 Pacemaker+corosync+DRBD環境において、MySQLが起動できないことで質問させていただきました。 > その後、MySQLは起動するのですが、次のような場合にフェイルオーバーせず、困っております。 > ・プライマリノードのPacemakerプロセスを落としたとき > ・プライマリノードをシャットダウンしたとき > なお、両ノードが生きている場合に、crmにおけるmigrateコマンドによる手動フェイルオーバーは行われます。 > ※migrate実行後は、unmigrateを実行しており、制約は残っておりません。 > > corosync.logを確認すると、セカンダリノードを起動しようとしているのですが、 > DRBDのプライマリへの昇格に失敗し、レプリケーション領域をマウントできないことが原因のようです。 > > 以下に設定値およびログを記載いたしますので、お気づきの点がございましたら、 > ご教授いただければと思います。 > たびたびの質問で申し訳ありませんが、よろしくお願いいたします。 > > ---------- > ■環境 > CentOS7 > pacemaker1.1.13 > corosync2.3.4 > drbd8.4.6 > > ■crm_mon(Pacemakerダウン前) > Last updated: Fri Jan 22 16:35:39 2016 Last change: Thu Jan > 21 17:45:35 2016 by hacluster via crmd on NODE1 > Stack: corosync > Current DC: NODE1 (version 1.1.13-10.el7-44eb2dd) - partition with quorum > 2 nodes and 5 resources configured > Online: [ NODE1 NODE2 ] > Master/Slave Set: ms_drbd_r0 [res_drbd_r0] > Masters: [ NODE1 ] > Slaves: [ NODE2 ] > Resource Group: rg_mysql > res_vipaddr (ocf::heartbeat:IPaddr2): Started NODE1 > res_fsmnt (ocf::heartbeat:Filesystem): Started NODE1 > res_mysql (ocf::heartbeat:mysql): Started NODE1 > Migration Summary: > * Node NODE1: > * Node NODE2: > > ■crm_mon(Pacemakerダウン後) > Last updated: Mon Jan 25 15:27:35 2016 Last change: Mon Jan > 25 15:24:50 2016 by hacluster via crmd on NODE2 > Stack: corosync > Current DC: NODE2 (version 1.1.13-10.el7-44eb2dd) - partition with quorum > 2 nodes and 5 resources configured > Online: [ NODE2 ] > OFFLINE: [ NODE1 ] > Master/Slave Set: ms_drbd_r0 [res_drbd_r0] > res_drbd_r0 (ocf::linbit:drbd): FAILED Master NODE2 > Failed Actions: > * res_drbd_r0_promote_0 on NODE2 'unknown error' (1): call=123, > status=complete, exitreason='none', > last-rc-change='Mon Jan 25 15:27:20 2016', queued=0ms, exec=15091ms > > ■corosync.log(抜粋) > Jan 25 15:21:21 [55019] NODE2 lrmd: info: log_execute: > executing - rsc:res_drbd_r0 action:promote call_id:28 > drbd(res_drbd_r0)[55556]: 2016/01/25_15:21:21 ERROR: r0: Called > drbdadm -c /etc/drbd.conf primary r0 > drbd(res_drbd_r0)[55556]: 2016/01/25_15:21:21 ERROR: r0: Exit code 11 > drbd(res_drbd_r0)[55556]: 2016/01/25_15:21:21 ERROR: r0: Command output: > drbd(res_drbd_r0)[55556]: 2016/01/25_15:21:21 ERROR: r0: Called > drbdadm -c /etc/drbd.conf primary r0 > drbd(res_drbd_r0)[55556]: 2016/01/25_15:21:21 ERROR: r0: Exit code 11 > drbd(res_drbd_r0)[55556]: 2016/01/25_15:21:21 ERROR: r0: Command output: > drbd(res_drbd_r0)[55556]: 2016/01/25_15:21:22 ERROR: r0: Called > drbdadm -c /etc/drbd.conf primary r0 > drbd(res_drbd_r0)[55556]: 2016/01/25_15:21:22 ERROR: r0: Exit code 11 > drbd(res_drbd_r0)[55556]: 2016/01/25_15:21:22 ERROR: r0: Command output: > drbd(res_drbd_r0)[55556]: 2016/01/25_15:21:23 ERROR: r0: Called > drbdadm -c /etc/drbd.conf primary r0 > drbd(res_drbd_r0)[55556]: 2016/01/25_15:21:23 ERROR: r0: Exit code 11 > drbd(res_drbd_r0)[55556]: 2016/01/25_15:21:23 ERROR: r0: Command output: > drbd(res_drbd_r0)[55556]: 2016/01/25_15:21:24 ERROR: r0: Called > drbdadm -c /etc/drbd.conf primary r0 > drbd(res_drbd_r0)[55556]: 2016/01/25_15:21:24 ERROR: r0: Exit code 11 > drbd(res_drbd_r0)[55556]: 2016/01/25_15:21:24 ERROR: r0: Command output: > drbd(res_drbd_r0)[55556]: 2016/01/25_15:21:25 ERROR: r0: Called > drbdadm -c /etc/drbd.conf primary r0 > drbd(res_drbd_r0)[55556]: 2016/01/25_15:21:25 ERROR: r0: Exit code 11 > drbd(res_drbd_r0)[55556]: 2016/01/25_15:21:25 ERROR: r0: Command output: > …… > Jan 25 15:21:41 [55019] NODE2 lrmd: warning: > child_timeout_callback: res_drbd_r0_promote_0 process (PID 55556) > timed outJan 25 15:21:41 [55019] NODE2 lrmd: warning: > operation_finished: res_drbd_r0_promote_0:55556 - timed out > after 20000ms > Jan 25 15:21:41 [55019] NODE2 lrmd: notice: > operation_finished: res_drbd_r0_promote_0:55556:stderr [ 0: > State change failed: (-7) Refusing to be Primary while peer is not > outdated ] > Jan 25 15:21:41 [55019] NODE2 lrmd: notice: > operation_finished: res_drbd_r0_promote_0:55556:stderr [ Command > 'drbdsetup-84 primary 0' terminated with exit code 11 ] > Jan 25 15:21:41 [55019] NODE2 lrmd: notice: > operation_finished: res_drbd_r0_promote_0:55556:stderr [ 0: > State change failed: (-7) Refusing to be Primary while peer is not > outdated ] > Jan 25 15:21:41 [55019] NODE2 lrmd: notice: > operation_finished: res_drbd_r0_promote_0:55556:stderr [ Command > 'drbdsetup-84 primary 0' terminated with exit code 11 ] > …… > > ■Pacemaker設定 > node 1: NODE1 > node 2: NODE2 > primitive res_drbd_r0 ocf:linbit:drbd \ > params drbd_resource=r0 \ > op start interval=0 timeout=240 on-fail=restart \ > op stop interval=0 timeout=100 on-fail=fence > primitive res_fsmnt Filesystem \ > params device="/dev/drbd0" directory="/drbd" fstype=xfs > options=noatime \ > op start interval=0 timeout=60 on-fail=restart \ > op stop interval=0 timeout=60 on-fail=fence > primitive res_mysql mysql \ > params binary="/usr/local/mysql/bin/mysqld_safe" > client_binary="/usr/local/mysql/bin/mysql" > datadir="/usr/local/mysql/data" config="/usr/local/mysql/my.cnf" > socket="/tmp/mysql.sock" pid="/var/run/mysqld/mysqld.pid" user=root > group=mysql additional_parameters="--ledir=/usr/local/mysql/bin > --basedir=/usr/local/mysql" \ > op start interval=0 timeout=120 on-fail=restart \ > op stop interval=0 timeout=120 on-fail=fence \ > op notify interval=90 timeout=90 \ > op monitor interval=20 timeout=30 on-fail=restart > primitive res_vipaddr IPaddr2 \ > params ip=192.168.202.10 cidr_netmask=16 nic=eth0 \ > op start interval=0 timeout=20 on-fail=restart \ > op stop interval=0 timeout=20 on-fail=fence \ > op monitor interval=10 timeout=20 on-fail=restart > group rg_mysql res_vipaddr res_fsmnt res_mysql \ > meta target-role=Started > ms ms_drbd_r0 res_drbd_r0 \ > meta master-max=1 master-node-max=1 clone-max=2 > clone-node-max=1 notify=true > location l_mysql rg_mysql 100: NODE1 > colocation c_mysql inf: rg_mysql ms_drbd_r0:Master > order o_mysql inf: ms_drbd_r0:promote rg_mysql:start > property cib-bootstrap-options: \ > have-watchdog=false \ > dc-version=1.1.13-10.el7-44eb2dd \ > cluster-infrastructure=corosync \ > stonith-enabled=false \ > no-quorum-policy=ignore \ > default-resource-stickiness=200 \ > last-lrm-refresh=1453365935 > > ■/etc/drbd.conf > include "drbd.d/global_common.conf"; > include "drbd.d/*.res"; > > ■/etc/drbd.d/global_common.conf > global { > usage-count no; > # minor-count dialog-refresh disable-ip-verification > # cmd-timeout-short 5; cmd-timeout-medium 121; cmd-timeout-long 600; > } > common { > handlers { > # These are EXAMPLE handlers only. > # They may have severe implications, > # like hard resetting the node under certain circumstances. > # Be careful when chosing your poison. > # pri-on-incon-degr > "/usr/lib/drbd/notify-pri-on-incon-degr.sh; > /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger > ; reboot -f"; > # pri-lost-after-sb > "/usr/lib/drbd/notify-pri-lost-after-sb.sh; > /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger > ; reboot -f"; > # local-io-error "/usr/lib/drbd/notify-io-error.sh; > /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > > /proc/sysrq-trigger ; halt -f"; > # fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; > # split-brain "/usr/lib/drbd/notify-split-brain.sh root"; > # out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root"; > # before-resync-target > "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k"; > # after-resync-target > /usr/lib/drbd/unsnapshot-resync-target-lvm.sh; > } > startup { > # wfc-timeout degr-wfc-timeout outdated-wfc-timeout > wait-after-sb > } > options { > # cpu-mask on-no-data-accessible > } > disk { > # size on-io-error fencing disk-barrier disk-flushes > # disk-drain md-flushes resync-rate resync-after al-extents > # c-plan-ahead c-delay-target c-fill-target c-max-rate > # c-min-rate disk-timeout > on-io-error detach; > fencing resource-only; > } > net { > # protocol timeout max-epoch-size max-buffers unplug-watermark > # connect-int ping-int sndbuf-size rcvbuf-size ko-count > # allow-two-primaries cram-hmac-alg shared-secret after-sb-0pri > # after-sb-1pri after-sb-2pri always-asbp rr-conflict > # ping-timeout data-integrity-alg tcp-cork on-congestion > # congestion-fill congestion-extents csums-alg verify-alg > # use-rle > protocol C; > } > } > > ■/etc/drbd.d/r0.res > resource r0 { > volume 0 { > device /dev/drbd0; > disk /dev/sda3; > meta-disk internal; > } > on NODE1 { > address 10.0.10.1:7788; > } > on NODE2 { > address 10.0.10.2:7788; > } > } > > ■/etc/corosync/corosync.conf > totem { > version: 2 > crypto_cipher: none > crypto_hash: none > rrp_mode: active > nodeid: 1 > interface { > member { > memberaddr: 10.0.10.1 > } > member { > memberaddr: 10.0.10.2 > } > ringnumber: 0 > bindnetaddr: 10.0.10.0 > mcastport: 5405 > ttl: 1 > } > interface { > member { > memberaddr: 10.0.11.1 > } > member { > memberaddr: 10.0.11.2 > } > ringnumber: 1 > bindnetaddr: 10.0.11.0 > mcastport: 5405 > ttl: 1 > } > transport: udpu > } > logging { > fileline: off > to_logfile: yes > to_syslog: no > logfile: /var/log/cluster/corosync.log > debug: off > timestamp: on > logger_subsys { > subsys: QUORUM > debug: off > } > } > quorum { > # Enable and configure quorum subsystem (default: off) > # see also corosync.conf.5 and votequorum.5 > provider: corosync_votequorum > expected_votes: 2 > } > aisexec { > user: root > group: root > } > ----------