Image of the glider from the Game of Life by John Conway
Skip to content

GlusterFS Linked List Topology

Lately, a few coworkers and myself decided to put our workstations into a GlusterFS cluster. We wanted to test distributed replication. Our workstations are already running ZFS on Linux, so we built two datasets on each of our workstations, and made them the bricks for GlusterFS. We created a nested "brick" directory to prevent GlusterFS from sending data to the root ZFS mountpoint, if the dataset is not mounted. Here is our setup on each of our workstations:

# zfs create -o sync=disabled pool/vol1
# zfs create -o sync=disabled pool/vol2
# mkdir /pool/vol1/brick /pool/vol2/brick

Notice that I've disabled synchronous writes. This is because GlusterFS is synchronous by default already. Because ZFS resides underneath the GlusterFS client mount, and GlusterFS is communicating with the application about synchronous data, there is no need to increase write latencies with synchronous writes on ZFS.

Now the question comes as to how to set the right topology for our storage cluster. I wish to maintain two copies of the data in a distributed manner. Meaning that the local peer has a copy of the data, and a remote peer also has a copy. Thus, distributed replication. But, how do you decide where the copies get distributed? I looked at two different topologies before making my decision, which I'll discuss here.

Paired Server Topology

Paired server GlusterFS topology

In this topology, servers are completely paired together. This means you always know where both copies of your data reside. You could think of it as a mirrored setup. The bricks on serverA will hold identical data to the bricks on serverB. This obviously simplifies administration and troubleshooting a great deal. And, it's easy to setup. Suppose we wish to create a volume named "testing", and assumed that we've peered with all the necessary nodes, we would proceed as follows:

# gluster volume create testing replica 2\
serverA:/pool/vol1/brick serverB:/pool/vol1/brick\
serverA:/pool/vol2/brick serverB:/pool/vol2/brick
# gluster volume info testing
 
Volume Name: testing
Type: Distributed-Replicate
Volume ID: 8ee0a256-8da4-4d4b-ae98-3c9a5c62d1b8
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: serverA:/pool/vol1/brick
Brick2: serverB:/pool/vol1/brick
Brick3: serverA:/pool/vol2/brick
Brick4: serverB:/pool/vol2/brick

If we wish to add more storage to the volume, the commands are pretty straight forward:

# gluster volume add-brick testing\
serverC:/pool/vol1/brick serverD:/pool/vol1/brick\
serverC:/pool/vol2/brick serverD:/pool/vol2/brick
# gluster volume info testing
 
Volume Name: testing
Type: Distributed-Replicate
Volume ID: 8ee0a256-8da4-4d4b-ae98-3c9a5c62d1b8
Status: Started
Number of Bricks: 4 x 2 = 8
Transport-type: tcp
Bricks:
Brick1: serverA:/pool/vol1/brick
Brick2: serverB:/pool/vol1/brick
Brick3: serverA:/pool/vol2/brick
Brick4: serverB:/pool/vol2/brick
Brick5: serverC:/pool/vol1/brick
Brick6: serverD:/pool/vol1/brick
Brick7: serverC:/pool/vol2/brick
Brick8: serverD:/pool/vol2/brick

The drawback to this setup, as it should be obvious, is when servers are added, they must be added in pairs. You cannot have an odd number of servers in this topology. However, as shown in both the image, and the commands, this is fairly straight forward from an administration perspective, and from a storage perspective.

Linked List Topology

glusterfs-linked-list

In computer science, a "linked list" is a data structure sequence, where the tail of one node points to the head of another. In the case of our topology, the "head" is the first brick, and the "tail" is the second brick. As a result, this creates a circular storage setup, as shown in the image above.

To set something like this up with say 3 peers, you would do the following:

# gluster volume create testing replica 2\
serverA:/pool/vol1/brick serverB:/pool/vol2/brick\
serverB:/pool/vol1/brick serverC:/pool/vol2/brick\
serverC:/pool/vol1/brick serverA:/pool/vol2/brick
# gluster volume info testing

Volume Name: testing
Type: Distributed-Replicate
Volume ID: 8ee0a256-8da4-4d4b-ae98-3c9a5c62d1b8
Status: Started
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Bricks:
Brick1: serverA:/pool/vol1/brick
Brick2: serverB:/pool/vol2/brick
Brick3: serverB:/pool/vol1/brick
Brick4: serverC:/pool/vol2/brick
Brick5: serverC:/pool/vol1/brick
Brick6: serverA:/pool/vol2/brick

Now, if you wanted to add a new server to the cluster, you can. You can add servers individually, unlike the paired server topology above. But, the trick is replacing bricks, as well as adding bricks, and it's not 100% intuitive on how to proceed. Thus, if I wanted to add "serverD" with its two bricks to the setup, I would first need to recognize that "serverA:/pool/vol2/brick" is going to be replaced with "serverD:/pool/vol2/brick". Then, I will have two bricks available to add to the volume, namely "serverD:/pool/vol1/brick" and "serverA:/pool/vol2/brick". Armed with that information, and assuming that "serverD" has already peered with the others, let's proceed:

# gluster volume replace-brick testing\
serverA:/pool/vol2/brick serverD:/pool/vol2/brick start

I can run "gluster volume replace-brick testing status" to keep an eye on the brick replacement. When ready, I need to commit it:

# gluster volume replace-brick testing\
serverA:/pool/vol2/brick serverD:/pool/vol2/brick commit

Now we have two bricks to add to the cluster. However, the "serverA:/pool/vol2/brick" brick was previously part of the cluster. As such, it contains metadata that is no longer relevant when adding the new server. As such, we must clear the metadata off of the brick, so it starts from a clean slate, then we can add it without problem. Here are the steps we need to do next:

(serverA)# setfattr -x trusted.glusterfs.volume-id /pool/vol2/brick
(serverA)# setfattr -x trusted.gfid /pool/vol2/brick
(serverA)# rm -rf /pool/vol2/brick/.glusterfs/
(serverA)# service glusterfs-server restart

We are now ready to add the bricks cleanly:

# gluster volume add-brick testing\
serverD:/pool/vol1/brick serverA:/pool/vol2/brick
# gluster volume info testing

Volume Name: testing
Type: Distributed-Replicate
Volume ID: 8ee0a256-8da4-4d4b-ae98-3c9a5c62d1b8
Status: Started
Number of Bricks: 4 x 2 = 8
Transport-type: tcp
Bricks:
Brick1: serverA:/pool/vol1/brick
Brick2: serverB:/pool/vol2/brick
Brick3: serverB:/pool/vol1/brick
Brick4: serverC:/pool/vol2/brick
Brick5: serverC:/pool/vol1/brick
Brick6: serverD:/pool/vol2/brick
Brick7: serverD:/pool/vol1/brick
Brick8: serverA:/pool/vol2/brick

It should be obvious that this is a more complicated setup. It's more abstract from a topological perspective, and it more difficult to implement and get right from an application perspective. And there is certainly a strong argument for simplifying storage architectures and administration. However, this linked list topology has the advantage of adding and removing one server at a time, unlike the paired server setup. If this is something you need, or you have an odd-number of servers in your cluster, the linked list topology will work well.

For our workstation cluster at the office, we went with a linked list topology, because it will mimic our production setup needs more closely. There may also be other topologies that we haven't explored. We also added "geo-replication" by replicating our volume to a larger storage node in the server room. This allows us to ensure data integrity, should two servers go down in our cluster.

{ 3 } Comments

  1. HippieJoe using Google Chrome 30.0.1599.101 on Windows 7 | November 5, 2013 at 12:56 pm | Permalink

    Great detail, thank you!

    However, I am having an issue with the linked list approach. I have five servers in a glusterfs (3.4.1) cluster, each with 1 brick. I require replication, and due to my odd number of servers, I believe the linked list approach is what I need.

    Based on your example, I came up with the below command. All one line when I run it, but broken out here to better show the links:

    gluster volume create gfsvol01 replica 2 gfsnode01:/exports/brick1 gfsnode02:/exports/brick1 gfsnode02:/exports/brick1 gfsnode03:/exports/brick1 gfsnode03:/exports/brick1 gfsnode04:/exports/brick1 gfsnode04:/exports/brick1 gfsnode05:/exports/brick1 gfsnode05:/exports/brick1 gfsnode01:/exports/brick1

    My issue is that when I run this, I get the error:

    "Found duplicate exports gfsnode02:/exports/brick1"

    Any help is appreciated, and thank you again!

  2. HippieJoe using Google Chrome 30.0.1599.101 on Windows 7 | November 5, 2013 at 1:08 pm | Permalink

    Following up; I put a backslash after each brick pair except the last. However, now I get an error that the number of bricks is not a multiple of replica count. I read that this was a requirement in the documentation, but thought the linked approach would get around the issue. Ideas are always welcome. Thank you.

  3. HippieJoe using Google Chrome 30.0.1599.101 on Windows 7 | November 5, 2013 at 1:22 pm | Permalink

    I looked at it a million times, but still missed the vol1 vol2 in the commands. You have two bricks per server. Feel free to delete my posts, and thank you again.

Post a Comment

Your email is never published nor shared.

Switch to our mobile site