Lately, a few coworkers and myself decided to put our workstations into a GlusterFS cluster. We wanted to test distributed replication. Our workstations are already running ZFS on Linux, so we built two datasets on each of our workstations, and made them the bricks for GlusterFS. We created a nested "brick" directory to prevent GlusterFS from sending data to the root ZFS mountpoint, if the dataset is not mounted. Here is our setup on each of our workstations:
# zfs create -o sync=disabled pool/vol1 # zfs create -o sync=disabled pool/vol2 # mkdir /pool/vol1/brick /pool/vol2/brick
Notice that I've disabled synchronous writes. This is because GlusterFS is synchronous by default already. Because ZFS resides underneath the GlusterFS client mount, and GlusterFS is communicating with the application about synchronous data, there is no need to increase write latencies with synchronous writes on ZFS.
Now the question comes as to how to set the right topology for our storage cluster. I wish to maintain two copies of the data in a distributed manner. Meaning that the local peer has a copy of the data, and a remote peer also has a copy. Thus, distributed replication. But, how do you decide where the copies get distributed? I looked at two different topologies before making my decision, which I'll discuss here.
Paired Server Topology

In this topology, servers are completely paired together. This means you always know where both copies of your data reside. You could think of it as a mirrored setup. The bricks on serverA will hold identical data to the bricks on serverB. This obviously simplifies administration and troubleshooting a great deal. And, it's easy to setup. Suppose we wish to create a volume named "testing", and assumed that we've peered with all the necessary nodes, we would proceed as follows:
# gluster volume create testing replica 2\ serverA:/pool/vol1/brick serverB:/pool/vol1/brick\ serverA:/pool/vol2/brick serverB:/pool/vol2/brick # gluster volume info testing Volume Name: testing Type: Distributed-Replicate Volume ID: 8ee0a256-8da4-4d4b-ae98-3c9a5c62d1b8 Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: serverA:/pool/vol1/brick Brick2: serverB:/pool/vol1/brick Brick3: serverA:/pool/vol2/brick Brick4: serverB:/pool/vol2/brick
If we wish to add more storage to the volume, the commands are pretty straight forward:
# gluster volume add-brick testing\ serverC:/pool/vol1/brick serverD:/pool/vol1/brick\ serverC:/pool/vol2/brick serverD:/pool/vol2/brick # gluster volume info testing Volume Name: testing Type: Distributed-Replicate Volume ID: 8ee0a256-8da4-4d4b-ae98-3c9a5c62d1b8 Status: Started Number of Bricks: 4 x 2 = 8 Transport-type: tcp Bricks: Brick1: serverA:/pool/vol1/brick Brick2: serverB:/pool/vol1/brick Brick3: serverA:/pool/vol2/brick Brick4: serverB:/pool/vol2/brick Brick5: serverC:/pool/vol1/brick Brick6: serverD:/pool/vol1/brick Brick7: serverC:/pool/vol2/brick Brick8: serverD:/pool/vol2/brick
The drawback to this setup, as it should be obvious, is when servers are added, they must be added in pairs. You cannot have an odd number of servers in this topology. However, as shown in both the image, and the commands, this is fairly straight forward from an administration perspective, and from a storage perspective.
Linked List Topology

In computer science, a "linked list" is a data structure sequence, where the tail of one node points to the head of another. In the case of our topology, the "head" is the first brick, and the "tail" is the second brick. As a result, this creates a circular storage setup, as shown in the image above.
To set something like this up with say 3 peers, you would do the following:
# gluster volume create testing replica 2\ serverA:/pool/vol1/brick serverB:/pool/vol2/brick\ serverB:/pool/vol1/brick serverC:/pool/vol2/brick\ serverC:/pool/vol1/brick serverA:/pool/vol2/brick # gluster volume info testing Volume Name: testing Type: Distributed-Replicate Volume ID: 8ee0a256-8da4-4d4b-ae98-3c9a5c62d1b8 Status: Started Number of Bricks: 3 x 2 = 6 Transport-type: tcp Bricks: Brick1: serverA:/pool/vol1/brick Brick2: serverB:/pool/vol2/brick Brick3: serverB:/pool/vol1/brick Brick4: serverC:/pool/vol2/brick Brick5: serverC:/pool/vol1/brick Brick6: serverA:/pool/vol2/brick
Now, if you wanted to add a new server to the cluster, you can. You can add servers individually, unlike the paired server topology above. But, the trick is replacing bricks, as well as adding bricks, and it's not 100% intuitive on how to proceed. Thus, if I wanted to add "serverD" with its two bricks to the setup, I would first need to recognize that "serverA:/pool/vol2/brick" is going to be replaced with "serverD:/pool/vol2/brick". Then, I will have two bricks available to add to the volume, namely "serverD:/pool/vol1/brick" and "serverA:/pool/vol2/brick". Armed with that information, and assuming that "serverD" has already peered with the others, let's proceed:
# gluster volume replace-brick testing\ serverA:/pool/vol2/brick serverD:/pool/vol2/brick start
I can run "gluster volume replace-brick testing status" to keep an eye on the brick replacement. When ready, I need to commit it:
# gluster volume replace-brick testing\ serverA:/pool/vol2/brick serverD:/pool/vol2/brick commit
Now we have two bricks to add to the cluster. However, the "serverA:/pool/vol2/brick" brick was previously part of the cluster. As such, it contains metadata that is no longer relevant when adding the new server. As such, we must clear the metadata off of the brick, so it starts from a clean slate, then we can add it without problem. Here are the steps we need to do next:
(serverA)# setfattr -x trusted.glusterfs.volume-id /pool/vol2/brick (serverA)# setfattr -x trusted.gfid /pool/vol2/brick (serverA)# rm -rf /pool/vol2/brick/.glusterfs/ (serverA)# service glusterfs-server restart
We are now ready to add the bricks cleanly:
# gluster volume add-brick testing\ serverD:/pool/vol1/brick serverA:/pool/vol2/brick # gluster volume info testing Volume Name: testing Type: Distributed-Replicate Volume ID: 8ee0a256-8da4-4d4b-ae98-3c9a5c62d1b8 Status: Started Number of Bricks: 4 x 2 = 8 Transport-type: tcp Bricks: Brick1: serverA:/pool/vol1/brick Brick2: serverB:/pool/vol2/brick Brick3: serverB:/pool/vol1/brick Brick4: serverC:/pool/vol2/brick Brick5: serverC:/pool/vol1/brick Brick6: serverD:/pool/vol2/brick Brick7: serverD:/pool/vol1/brick Brick8: serverA:/pool/vol2/brick
It should be obvious that this is a more complicated setup. It's more abstract from a topological perspective, and it more difficult to implement and get right from an application perspective. And there is certainly a strong argument for simplifying storage architectures and administration. However, this linked list topology has the advantage of adding and removing one server at a time, unlike the paired server setup. If this is something you need, or you have an odd-number of servers in your cluster, the linked list topology will work well.
For our workstation cluster at the office, we went with a linked list topology, because it will mimic our production setup needs more closely. There may also be other topologies that we haven't explored. We also added "geo-replication" by replicating our volume to a larger storage node in the server room. This allows us to ensure data integrity, should two servers go down in our cluster.
{ 7 } Comments