DRBD: Difference between revisions

From DWIKI
mNo edit summary
(24 intermediate revisions by the same user not shown)
Line 3: Line 3:


=Links=
=Links=
*[http://www.drbd.org/ Homepage]
*[http://www.drbd.org/ Homepage]
*[https://www.linbit.com/drbd-user-guide/users-guide-drbd-8-4/ DRBD-8.4 user's guide]
*[https://www.canarytek.com/2017/09/06/DRBD_NFS_Cluster.html DRBD + pacemaker + NFS, pretty good doc]
*[https://www.canarytek.com/2017/09/06/DRBD_NFS_Cluster.html DRBD + pacemaker + NFS, pretty good doc]
*http://www.securityandit.com/system/pacemaker-cluster-with-nfs-and-drbd/
*http://www.securityandit.com/system/pacemaker-cluster-with-nfs-and-drbd/
Line 20: Line 22:
*[https://wiki.mikejung.biz/DRBD also tuning tips]
*[https://wiki.mikejung.biz/DRBD also tuning tips]
*[http://docs.gz.ro/node/249 Debian DRBD: How to resize NFS on drbd volume on top of LVM]
*[http://docs.gz.ro/node/249 Debian DRBD: How to resize NFS on drbd volume on top of LVM]
*[https://www.recitalsoftware.com/blogs/29-howto-resolve-drbd-split-brain-recovery-manually HOWTO: Resolve DRBD split-brain recovery manually]


See also: http://www.gluster.org/ and http://ceph.com/
See also: http://www.gluster.org/ and http://ceph.com/
Line 52: Line 55:
*[https://docs.linbit.com/doc/users-guide-83/s-split-brain-notification-and-recovery/ Split brain notification and automatic recovery
*[https://docs.linbit.com/doc/users-guide-83/s-split-brain-notification-and-recovery/ Split brain notification and automatic recovery
*[http://avid.force.com/pkb/articles/en_US/Compatibility/Troubleshooting-DRBD-on-MediaCentral Troubleshooting DRBD on MediaCentral]
*[http://avid.force.com/pkb/articles/en_US/Compatibility/Troubleshooting-DRBD-on-MediaCentral Troubleshooting DRBD on MediaCentral]


=GFS on DRBD=
=GFS on DRBD=
Line 58: Line 64:




=Cheatsheet=
 
==Make device primary==
 
 
= Cheatsheet =
 
== HOWTO create a drbd resource ==
 
lvcreate -L2G -n test3 DRBD
 
Create resource file and verify it
 
drbdadm dump -c test3.res
 
Copy the resource file to /etc/drbd.d/ on both nodes On both nodes run
 
drbdadm create-md test3
cat /proc/drbd
 
Then:
 
drbdadm up test3
 
should give:
<pre>3: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r-----
</pre>
 
On one node run:
 
drbdadm -- --overwrite-data-of-peer primary test3
 
or just
 
drbdadm primary --force test3
 
and check:
 
cat /proc/drbd
 
which should give:
<pre>3: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r-----
    ns:152168 nr:0 dw:0 dr:154288 al:8 bm:0 lo:0 pe:1 ua:0 ap:0 ep:1 wo:f oos:1945500
        [>...................] sync'ed:  7.5% (1945500/2097052)K
</pre>
 
== Make device/node primary ==
 
Should already be done by previous step
 
  drbdadm primary yourdeviceID
  drbdadm primary yourdeviceID
or
or
  drbdsetup /dev/drbdX primary -o
  drbdsetup /dev/drbdX primary -o


== Create pcs resource ==
pcs resource create TEST3_DRBD ocf:linbit:drbd drbd_resource=test3 op demote interval=0s timeout=90 monitor interval=60s \
  notify interval=0s  timeout=90 promote interval=0s timeout=90 reload interval=0s timeout=30 \
  start interval=0s timeout=240 stop interval=0s timeout=100 --disabled
This will result in the error:
* TEST3_DRBD_monitor_0 on santest-b 'not configured' (6): call=922, status=complete, exitreason='meta parameter misconfigured, expected clone-max -le 2, but found unset.',
    last-rc-change='Tue Jul 21 10:06:20 2020', queued=0ms, exec=680ms
Just run
pcs resource master TEST3_DRBD-Clone TEST3_DRBD master-node-max=1 clone-max=2 notify=true  master-max=1 clone-node-max=1 --disabled
and then
pcs resource cleanup TEST3_DRBD
&nbsp;
== Grow resource ==


==Grow resource==
On both nodes:
On both nodes:
  lvextend -L+10G /dev/DRBD/myresource
  lvextend -L+10G /dev/DRBD/myresource
On one node:
On one node:
  drbdadm resize myresource
  drbdadm resize myresource


&nbsp;
== Check resource file ==


==Check resource file==
Editing files in /etc/drbd.d/ is a bad plan, to check syntax first:
Editing files in /etc/drbd.d/ is a bad plan, to check syntax first:


  drbdadm dump -c /tmp/test.res
  drbdadm dump -c /tmp/test.res


=FAQ=
&nbsp;
 
== Mapping resource name and device ==
 
ls -al /dev/drbd/<LVM volume group name>/by-disk/
 
&nbsp;
 
&nbsp;
 
== Renaming DRBD resource ==


==Get out of 'Standalone'==
Is a simple matter of renaming the oldresource.res file and updating its contents. Rember to first (on one node is enough):
disconnect/connect until works :)
 
drbdadm down myresource
 
and if you're like me you might need to rename your logical volume too
 
lvrename VG oldresource newresource
 
== Remove drbd resource ==
 
drbdadm down myresource
 
remove resource files
 
== Show statistics ==
 
drbdsetup status --verbose --statistics
 
= FAQ =
 
== Get out of 'Standalone' ==
 
disconnect/connect until works&nbsp;:)
 
== 1: State change failed: (-2) Need access to UpToDate data ==


==1: State change failed: (-2) Need access to UpToDate data==
when you get that tryinng to make a node/resource primary, try
when you get that tryinng to make a node/resource primary, try
  drbdadm primary drbdX --force
  drbdadm primary drbdX --force


==calculate metadata size==
== Failure: (102) Local address(port) already in use. ==
https://serverfault.com/questions/433999/calculating-drbd-meta-size


When drbdadm shows
disk:Inconsistent
and or
replication:SyncTarget
it's already busy, check
cat /proc/drbd instead
&nbsp;
== calculate metadata size ==
[https://serverfault.com/questions/433999/calculating-drbd-meta-size https://serverfault.com/questions/433999/calculating-drbd-meta-size]
&nbsp;


  Cs=`blockdev --getsz /dev/foo`
  Cs=`blockdev --getsz /dev/foo`
Line 95: Line 227:
TODO finish this
TODO finish this


=='mydrbd' not defined in your config (for this host).==
== 'mydrbd' not defined in your config (for this host). ==
 
If drbdadm create-md throws this, 'this host' is the clue: it must match `hostname`
If drbdadm create-md throws this, 'this host' is the clue: it must match `hostname`


[https://newbiedba.wordpress.com/2015/09/21/drbd-not-defined-in-your-config-for-this-host/ https://newbiedba.wordpress.com/2015/09/21/drbd-not-defined-in-your-config-for-this-host/]


https://newbiedba.wordpress.com/2015/09/21/drbd-not-defined-in-your-config-for-this-host/
== show resource sizes ==


==show resource sizes==
  lsblk
  lsblk


==commands to show info==
== commands to show info ==
 
  drbdmon
  drbdmon
  drbdtop
  drbdtop


&nbsp;
== resolving split brain issues ==
*[https://docs.linbit.com/doc/users-guide-83/s-resolve-split-brain/ https://docs.linbit.com/doc/users-guide-83/s-resolve-split-brain/]
*[https://www.sebastien-han.fr/blog/2012/04/25/DRBD-split-brain/ https://www.sebastien-han.fr/blog/2012/04/25/DRBD-split-brain/]


==resolving split brain issues==
== diskless ==
*https://docs.linbit.com/doc/users-guide-83/s-resolve-split-brain/
*https://www.sebastien-han.fr/blog/2012/04/25/DRBD-split-brain/


==diskless==
You might try
You might try
  drbdadm attach drbd0
  drbdadm attach drbd0


&nbsp;


== The disk contains an unclean file system (0, 0). ==


==The disk contains an unclean file system (0, 0).==
Metadata kept in Windows cache, refused to mount. Falling back to read-only mount because the NTFS partition is in an unsafe state. Please resume and shutdown Windows fully (no hibernation or fast restarting.)
Metadata kept in Windows cache, refused to mount.
Falling back to read-only mount because the NTFS partition is in an
unsafe state. Please resume and shutdown Windows fully (no hibernation
or fast restarting.)


When trying to mount a snapshot (kpartx -av backup-snap1 etc)
When trying to mount a snapshot (kpartx -av backup-snap1 etc)
Line 129: Line 265:
???
???


&nbsp;
&nbsp;


==sync is slow==
== sync is slow ==
*https://forum.proxmox.com/threads/slow-drbd9-sync-like-20mbit-on-1gbit.27927/
*[https://docs.linbit.com/doc/users-guide-84/s-configure-sync-rate/ Configuring sync rate]


*[https://forum.proxmox.com/threads/slow-drbd9-sync-like-20mbit-on-1gbit.27927/ https://forum.proxmox.com/threads/slow-drbd9-sync-like-20mbit-on-1gbit.27927/]


On secondary:
On secondary:
  drbdadm disk-options --c-plan-ahead=0 --resync-rate=50M drbd0and
 
  drbdadm disk-options --c-plan-ahead=0 --resync-rate=50M drbd0
 
and to reset after sync:
and to reset after sync:
  drbdadm adjust drbd0
  drbdadm adjust drbd0


==show configuration==
== show configuration ==
 
  drbdsetup show
  drbdsetup show


&nbsp;
== show more info ==


==show more info==
  drbdsetup show-gi <minor-number>
  drbdsetup show-gi <minor-number>
Minor number is shown in drbdsetup show
Minor number is shown in drbdsetup show


==update network settings==
== update network settings ==
 
  drbdsetup net-options 10.0.0.1 10.0.0.2  --sndbuf-size=2M
  drbdsetup net-options 10.0.0.1 10.0.0.2  --sndbuf-size=2M


&nbsp;
== mount: unknown filesystem type 'drbd' ==


Usually means your node is not primary. If you're sure you know what you're doing you can use


==mount: unknown filesystem type 'drbd'==
Usually means your node is not primary. If you're sure you know what you're doing you can use
  mount -t ext4 /dev/drbd1 /drbdmount
  mount -t ext4 /dev/drbd1 /drbdmount


or when you want to mount the partion while drbd is down:
or when you want to mount the partion while drbd is down:
  kpartx -av /dev/mapper/DRBD-test1  
  kpartx -av /dev/mapper/DRBD-test1  
  #add map DRBD-test1p1 (253:5): 0 10482928 linear /dev/mapper/DRBD-test1 2048
  #add map DRBD-test1p1 (253:5): 0 10482928 linear /dev/mapper/DRBD-test1 2048
Line 163: Line 312:
  mount -o ro /dev/mapper/DRBD-test1o1 /mnt/test1
  mount -o ro /dev/mapper/DRBD-test1o1 /mnt/test1


==reload configuration==
== reload configuration ==


  drbdadm --dry-run adjust <resourcename|all>
  drbdadm --dry-run adjust <resourcename|all>


and then  
and then


  drbdadm adjust <resourcename|all>
  drbdadm adjust <resourcename|all>


==show configuration of resource==  
== show configuration of resource ==
 
  drbdsetup /dev/drbd0 show
  drbdsetup /dev/drbd0 show


&nbsp;
== resource unknown ==


==resource unknown==
First try
First try
  drbdadm up resourcename
  drbdadm up resourcename


==Command 'drbdmeta 1 v08 /dev/drbd0 internal apply-al' terminated with exit code 20==
== Command 'drbdmeta 1 v08 /dev/drbd0 internal apply-al' terminated with exit code 20 ==
 
Most likely split brain issue, check dmesg etc
Most likely split brain issue, check dmesg etc


==wfconnection==
== wfconnection ==
 
Could be split brain situation
Could be split brain situation
  drbdadm -- --discard-my-data connect resource
  drbdadm -- --discard-my-data connect resource


==cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown==
== cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown ==
 
So you're on the primary node, secondary might showing nothing, then first
So you're on the primary node, secondary might showing nothing, then first
  drbdadm up drbdX
  drbdadm up drbdX
or if secondary shows "cs:WFConnection ro:Secondary/Unknown ds:UpToDate/DUnknown", it is waiting for connection(no?)
or if secondary shows "cs:WFConnection ro:Secondary/Unknown ds:UpToDate/DUnknown", it is waiting for connection(no?)
  drbdadm disconnect drbdres
  drbdadm disconnect drbdres
  drbdadm connect --discard-my-data drbdres
  drbdadm connect --discard-my-data drbdres


or it shows "cs:WFConnection ro:Secondary/Unknown ds:UpToDate/DUnknown", in that case what might work on primary node:


or it shows "cs:WFConnection ro:Secondary/Unknown ds:UpToDate/DUnknown", in that case what might work on primary node:
  drbdadm disconnect drbdX
  drbdadm disconnect drbdX
  drbdadm connect drbdX
  drbdadm connect drbdX


==Split brain issue==
== Split brain issue ==
 
To force updating resource
To force updating resource
drbdadm invalidate resource


==cs:WFReportParams ro:Secondary/Unknown ds:UpToDate/DUnknown==
#You might need this
 
drbdadm down <resource>
 
This works on a device that's not shown in /proc/drbd
 
drbdadm invalidate <resource>
 
== cs:WFReportParams ro:Secondary/Unknown ds:UpToDate/DUnknown ==
 
  connection is made, waiting for more
  connection is made, waiting for more


==Unexpected data packet AuthChallenge (0x0010)==
== cs:StandAlone ro:Secondary/Unknown ds:UpToDate/DUnknown ==
 
Assuming you really are looking at the secondary:
 
drbdadm -- --discard-my-data connect test3
 
and on primary you might need to
 
drbdadm connect test3
 
or
 
drbdadm primary --force test3
 
&nbsp;
 
== Unexpected data packet AuthChallenge (0x0010) ==
 
maybe the shared key
maybe the shared key


==State change failed: Device is held open by someone==
== State change failed: Device is held open by someone ==
 
could be stacked resource, timeout?
could be stacked resource, timeout?


==error receiving ReportState, e: -5 l: 0!==
== error receiving ReportState, e: -5 l: 0! ==
 
??
??


==drbd: error sending genl reply==
== resource XYX cannot run anwhere ==
CentOS feature, https://wiki.centos.org/Manuals/ReleaseNotes/CentOS7.2003.
If you see that in log, it probably means it's a disabled resource in a disabled group
"Try another kernel/module version"
CHECK THIS
 
== drbd: error sending genl reply ==
 
CentOS feature, [https://wiki.centos.org/Manuals/ReleaseNotes/CentOS7.2003 https://wiki.centos.org/Manuals/ReleaseNotes/CentOS7.2003]. "Try another kernel/module version"
 
== State change failed: (-14) Need a verify algorithm to start online verify ==


==State change failed: (-14) Need a verify algorithm to start online verify==
Means no verify-alg was defined, so no online checking
Means no verify-alg was defined, so no online checking


==drbdadm dump-md foo: Found meta data is "unclean", please apply-al first==
== drbdadm dump-md foo: Found meta data is "unclean", please apply-al first ==
 
Try
Try
  drbdadm apply-al foo
  drbdadm apply-al foo
( AL means "activity log", btw )
&nbsp;
== log messages ==
=== sock_sendmsg time expired, ko=6 ===
latency problem?

Revision as of 15:18, 6 October 2021

Distributed Replicated Block Device


Links

See also: http://www.gluster.org/ and http://ceph.com/

Stacked resources

Support

Tools

drbdadm

drbd-overview

drbdsetup

Docs

Recovery



GFS on DRBD



Cheatsheet

HOWTO create a drbd resource

lvcreate -L2G -n test3 DRBD

Create resource file and verify it

drbdadm dump -c test3.res

Copy the resource file to /etc/drbd.d/ on both nodes On both nodes run

drbdadm create-md test3
cat /proc/drbd

Then:

drbdadm up test3

should give:

3: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r-----

On one node run:

drbdadm -- --overwrite-data-of-peer primary test3

or just

drbdadm primary --force test3

and check:

cat /proc/drbd

which should give:

3: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r-----
    ns:152168 nr:0 dw:0 dr:154288 al:8 bm:0 lo:0 pe:1 ua:0 ap:0 ep:1 wo:f oos:1945500
        [>...................] sync'ed:  7.5% (1945500/2097052)K

Make device/node primary

Should already be done by previous step

drbdadm primary yourdeviceID

or

drbdsetup /dev/drbdX primary -o

Create pcs resource

pcs resource create TEST3_DRBD ocf:linbit:drbd drbd_resource=test3 op demote interval=0s timeout=90 monitor interval=60s \
  notify interval=0s   timeout=90 promote interval=0s timeout=90 reload interval=0s timeout=30 \
  start interval=0s timeout=240 stop interval=0s timeout=100 --disabled

This will result in the error:

* TEST3_DRBD_monitor_0 on santest-b 'not configured' (6): call=922, status=complete, exitreason='meta parameter misconfigured, expected clone-max -le 2, but found unset.',
   last-rc-change='Tue Jul 21 10:06:20 2020', queued=0ms, exec=680ms

Just run

pcs resource master TEST3_DRBD-Clone TEST3_DRBD master-node-max=1 clone-max=2 notify=true   master-max=1 clone-node-max=1 --disabled

and then

pcs resource cleanup TEST3_DRBD

 

Grow resource

On both nodes:

lvextend -L+10G /dev/DRBD/myresource

On one node:

drbdadm resize myresource

 

Check resource file

Editing files in /etc/drbd.d/ is a bad plan, to check syntax first:

drbdadm dump -c /tmp/test.res

 

Mapping resource name and device

ls -al /dev/drbd/<LVM volume group name>/by-disk/

 

 

Renaming DRBD resource

Is a simple matter of renaming the oldresource.res file and updating its contents. Rember to first (on one node is enough):

drbdadm down myresource

and if you're like me you might need to rename your logical volume too

lvrename VG oldresource newresource

Remove drbd resource

drbdadm down myresource

remove resource files

Show statistics

drbdsetup status --verbose --statistics

FAQ

Get out of 'Standalone'

disconnect/connect until works :)

1: State change failed: (-2) Need access to UpToDate data

when you get that tryinng to make a node/resource primary, try

drbdadm primary drbdX --force

Failure: (102) Local address(port) already in use.

When drbdadm shows

disk:Inconsistent

and or

replication:SyncTarget 

it's already busy, check

cat /proc/drbd instead

 

calculate metadata size

https://serverfault.com/questions/433999/calculating-drbd-meta-size

 

Cs=`blockdev --getsz /dev/foo`
Bs=`blockdev --getpbsz /dev/foo`

TODO finish this

'mydrbd' not defined in your config (for this host).

If drbdadm create-md throws this, 'this host' is the clue: it must match `hostname`

https://newbiedba.wordpress.com/2015/09/21/drbd-not-defined-in-your-config-for-this-host/

show resource sizes

lsblk

commands to show info

drbdmon
drbdtop

 

resolving split brain issues

diskless

You might try

drbdadm attach drbd0

 

The disk contains an unclean file system (0, 0).

Metadata kept in Windows cache, refused to mount. Falling back to read-only mount because the NTFS partition is in an unsafe state. Please resume and shutdown Windows fully (no hibernation or fast restarting.)

When trying to mount a snapshot (kpartx -av backup-snap1 etc)

???

 

 

sync is slow

On secondary:

drbdadm disk-options --c-plan-ahead=0 --resync-rate=50M drbd0

and to reset after sync:

drbdadm adjust drbd0

show configuration

drbdsetup show

 

show more info

drbdsetup show-gi <minor-number>

Minor number is shown in drbdsetup show

update network settings

drbdsetup net-options 10.0.0.1 10.0.0.2  --sndbuf-size=2M

 

mount: unknown filesystem type 'drbd'

Usually means your node is not primary. If you're sure you know what you're doing you can use

mount -t ext4 /dev/drbd1 /drbdmount

or when you want to mount the partion while drbd is down:

kpartx -av /dev/mapper/DRBD-test1 
#add map DRBD-test1p1 (253:5): 0 10482928 linear /dev/mapper/DRBD-test1 2048
#I suggest mounting ro 
mount -o ro /dev/mapper/DRBD-test1o1 /mnt/test1

reload configuration

drbdadm --dry-run adjust <resourcename|all>

and then

drbdadm adjust <resourcename|all>

show configuration of resource

drbdsetup /dev/drbd0 show

 

resource unknown

First try

drbdadm up resourcename

Command 'drbdmeta 1 v08 /dev/drbd0 internal apply-al' terminated with exit code 20

Most likely split brain issue, check dmesg etc

wfconnection

Could be split brain situation

drbdadm -- --discard-my-data connect resource

cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown

So you're on the primary node, secondary might showing nothing, then first

drbdadm up drbdX

or if secondary shows "cs:WFConnection ro:Secondary/Unknown ds:UpToDate/DUnknown", it is waiting for connection(no?)

drbdadm disconnect drbdres
drbdadm connect --discard-my-data drbdres

or it shows "cs:WFConnection ro:Secondary/Unknown ds:UpToDate/DUnknown", in that case what might work on primary node:

drbdadm disconnect drbdX
drbdadm connect drbdX

Split brain issue

To force updating resource

  1. You might need this
drbdadm down <resource>

This works on a device that's not shown in /proc/drbd

drbdadm invalidate <resource>

cs:WFReportParams ro:Secondary/Unknown ds:UpToDate/DUnknown

connection is made, waiting for more

cs:StandAlone ro:Secondary/Unknown ds:UpToDate/DUnknown

Assuming you really are looking at the secondary:

drbdadm -- --discard-my-data connect test3

and on primary you might need to

drbdadm connect test3

or

drbdadm primary --force test3

 

Unexpected data packet AuthChallenge (0x0010)

maybe the shared key

State change failed: Device is held open by someone

could be stacked resource, timeout?

error receiving ReportState, e: -5 l: 0!

??

resource XYX cannot run anwhere

If you see that in log, it probably means it's a disabled resource in a disabled group CHECK THIS

drbd: error sending genl reply

CentOS feature, https://wiki.centos.org/Manuals/ReleaseNotes/CentOS7.2003. "Try another kernel/module version"

State change failed: (-14) Need a verify algorithm to start online verify

Means no verify-alg was defined, so no online checking

drbdadm dump-md foo: Found meta data is "unclean", please apply-al first

Try

drbdadm apply-al foo

( AL means "activity log", btw )

 

log messages

sock_sendmsg time expired, ko=6

latency problem?