[Linux-ha-jp] スプリットブレイン時の回復方法

Back to archive index

mlus mlus****@39596*****
2014年 1月 22日 (水) 13:07:58 JST


コヤマです。

高塚さん、ご返答ありがとうございます。
長文にて失礼します。

> クラスタ管理部が corosync の場合については詳しく把握していませんが、
> heartbeat の場合についていえば、
> 双方が DC を名乗っている状態になった後で相互通信が復旧したとして、
> 少なくとも片方の heartbeat プロセスを再起動をしないと、スプリット
> ブレインを解消できないと思います。

テストとして、NICをホストから挿抜して行いましたが、
スプリットブレインを擬似的に作り出せてないので、
想定していたテストが出来ていないのが現状です。
2つのノードとも、HAデーモン(pacemaker corosync)を起動させた
ままのテストになりました。
片方(この場合active側)のHAデーモンを再起動してしまうケースだと、
私の知識では、standby側のHAデーモンを再起動なしに稼動したままでの
crmコマンドを使って元に戻す事ができていません。
質問が上手くできていませんでしたが、この方法も知りたかったのです。

再起動なしバージョンでは、テストと経過が以下のようになりまして、
一応元に戻すことができています。


******>  host1 のUSBLANを引っこ抜く

#crm_mon -rfA1
1:host1-------------------
Last updated: Wed Jan 22 11:45:27 2014
Last change: Wed Jan 22 11:32:29 2014 by root via cibadmin on host1
Current DC: host1 (2886926337) - partition with quorum
2 Nodes configured
4 Resources configured

Online: [ host2 host1 ]

Full list of resources:

 Resource Group: grp
     v_ip       (ocf::heartbeat:IPaddr2):       Started host2
     failmail   (ocf::heartbeat:MailTo):        Started host2
 Clone Set: clone_v_ping [v_ping]
     Started: [ host2 host1 ]

Node Attributes:
* Node host2:
    + pingcheck                         : 100
* Node host1:
    + pingcheck                         : 0

Migration summary:
* Node host1:
   v_ip: migration-threshold=1 fail-count=1 last-failure='Wed Jan 22
11:45:11 2014
'* Node host2:     v_ip_monitor_30000 on (null) 'unknown error' (1): c
----------------------------------------------

********>  host1 のUSBLANを差す
2:host1-------------------
Last updated: Wed Jan 22 11:48:02 2014
Last change: Wed Jan 22 11:32:29 2014 by root via cibadmin on host1
Current DC: host1 (2886926337) - partition with quorum
2 Nodes configured
4 Resources configured


Online: [ host2 host1 ]

Full list of resources:

 Resource Group: grp
     v_ip       (ocf::heartbeat:IPaddr2):       Started host2
     failmail   (ocf::heartbeat:MailTo):        Started host2
 Clone Set: clone_v_ping [v_ping]
     Started: [ host2 host1 ]

Node Attributes:
* Node host2:
    + pingcheck                         : 100
* Node host1:
    + pingcheck                         : 100

Migration summary:
* Node host1:
   v_ip: migration-threshold=1 fail-count=1 last-failure='Wed Jan 22
11:45:11 2014
'* Node host2:     v_ip_monitor_30000 on (null) 'unknown error' (1): c

2:host2------------------------------------------
Last updated: Wed Jan 22 11:48:42 2014
Last change: Wed Jan 22 11:32:29 2014 by root via cibadmin on host1
Current DC: host1 (2886926337) - partition with quorum
2 Nodes configured
4 Resources configured


Online: [ host2 host1 ]

Full list of resources:

 Resource Group: grp
     v_ip       (ocf::heartbeat:IPaddr2):       Started host2
     failmail   (ocf::heartbeat:MailTo):        Started host2
 Clone Set: clone_v_ping [v_ping]
     Started: [ host2 host1 ]

Node Attributes:
* Node host2:
    + pingcheck                         : 100
* Node host1:
    + pingcheck                         : 100

Migration summary:
* Node host1:
   v_ip: migration-threshold=1 fail-count=1 last-failure='Wed Jan 22
11:45:11 2014
'
* Node host2:

Failed actions:
    v_ip_monitor_30000 on host1 'unknown error' (1): call=56, status=co
mplete, last-rc-change='Wed Jan 22 11:45:11 2014', queued=0ms, exec=0ms


*******> host1 でリソース停止・状態クリアコマンド実行
crm(live)resource# cleanup grp host1
Cleaning up v_ip on host1
Cleaning up failmail on host1
Waiting for 1 replies from the CRMd. OK
crm(live)resource# cleanup grp host2
Cleaning up v_ip on host2
Cleaning up failmail on host2
Waiting for 1 replies from the CRMd. OK

3:host1 ---------------------------------------------------
Last updated: Wed Jan 22 11:55:58 2014
Last change: Wed Jan 22 11:54:42 2014 by hacluster via crmd on host2
Current DC: host1 (2886926337) - partition with quorum
2 Nodes configured
4 Resources configured


Online: [ host2 host1 ]

Full list of resources:

 Resource Group: grp
     v_ip       (ocf::heartbeat:IPaddr2):       Started host2
     failmail   (ocf::heartbeat:MailTo):        Started host2
 Clone Set: clone_v_ping [v_ping]
     Started: [ host2 host1 ]

Node Attributes:
* Node host2:
    + pingcheck                         : 100
* Node host1:
    + pingcheck                         : 100

Migration summary:
* Node host1:
* Node host2:


3:host2 -------------------------------------------------------
Last updated: Wed Jan 22 11:54:48 2014
Last change: Wed Jan 22 11:54:42 2014 by hacluster via crmd on host2
Current DC: host1 (2886926337) - partition with quorum
2 Nodes configured
4 Resources configured


Online: [ host2 host1 ]

Full list of resources:

 Resource Group: grp
     v_ip       (ocf::heartbeat:IPaddr2):       Started host2
     failmail   (ocf::heartbeat:MailTo):        Started host2
 Clone Set: clone_v_ping [v_ping]
     Started: [ host2 host1 ]

Node Attributes:
* Node host2:
    + pingcheck                         : 100
* Node host1:
    + pingcheck                         : 100

Migration summary:
* Node host1:
* Node host2:


**********> host1 コンソール リソースを停止
crm(live)resource# stop grp
**********> host1 コンソール リソースを移動
crm(live)resource# move grp host1 force
**********> host1 コンソール リソースを稼動
crm(live)resource# start grp

4:host1 ------------------------------------------
Last updated: Wed Jan 22 12:00:00 2014
Last change: Wed Jan 22 11:59:34 2014 by root via cibadmin on host1
Current DC: host1 (2886926337) - partition with quorum
2 Nodes configured
4 Resources configured


Online: [ host2 host1 ]

Full list of resources:

 Resource Group: grp
     v_ip       (ocf::heartbeat:IPaddr2):       Started host1
     failmail   (ocf::heartbeat:MailTo):        Started host1
 Clone Set: clone_v_ping [v_ping]
     Started: [ host2 host1 ]

Node Attributes:
* Node host2:
    + pingcheck                         : 100
* Node host1:
    + pingcheck                         : 100

Migration summary:
* Node host1:
* Node host2:

4:host2 -----------------------------------------------------
Last updated: Wed Jan 22 11:59:38 2014
Last change: Wed Jan 22 11:59:34 2014 by root via cibadmin on host1
Current DC: host1 (2886926337) - partition with quorum
2 Nodes configured
4 Resources configured


Online: [ host2 host1 ]

Full list of resources:

 Resource Group: grp
     v_ip       (ocf::heartbeat:IPaddr2):       Started host1
     failmail   (ocf::heartbeat:MailTo):        Started host1
 Clone Set: clone_v_ping [v_ping]
     Started: [ host2 host1 ]

Node Attributes:
* Node host2:
    + pingcheck                         : 100
* Node host1:
    + pingcheck                         : 100

Migration summary:
* Node host1:
* Node host2:





Linux-ha-japan メーリングリストの案内
Back to archive index