Horizontal Sharding in Rails: Ruby on Rails Highlights

Senior Ruby Developer Prepsa Kayastha shared insights on scaling Rails applications through horizontal sharding, drawing from real-world challenges and solutions in modern Ruby on Rails architectures.

In this episode of Ruby on Rails Meetup, As Rails applications grow, the database often becomes the main bottleneck: increasing users, higher request volume, and larger datasets lead to bigger tables and slower queries, making a single database unable to handle the load. The talk focuses on scaling the database layer efficiently by using horizontal sharding to distribute data across multiple databases.

Timestamps

00:00:00 — Introduction of the Speaker

00:00:29 — Background of Horizontal Sharding

00:02:26 — Introduction of Sharding

00:03:44 — Brief Explaination of Horizontal Sharding

00:04:24 — When to use Horizontal Sharding

00:05:26 — Implementing Horizontal Sharding

00:07:20 — Scenario Recap

00:09:25 —Challenges and Things to Consider while Implementing Horizontal Sharding

00:14:09 — References

Transcript

00:00:00 — Introduction

Good evening, everyone. I am Prepsa Kayastha, and today I will be talking about our journey with Horizontal Sharding. So, in today’s session, I will be going through the background of what led us to choose Horizontal Sharding and a brief introduction of  what and when to use it with a brief demo and also the challenges of it during implementation. 

00:00:29 — Background of Horizontal Sharding

So, Let’s consider a scenario. We want 2 separate environments, staging and production environments, which are both independent. We want a staging environment for previewing and testing our app. It has both the environments, backend, frontend, database, everything. So, they are both independent. But, we want the staging environment too as a sandbox. 

00:00:56 — I am only using staging and production as names for convenience. For the staging environment, the admin of the app should be able to create new records, edit, and delete it, and if they are happy with what they see in the staging environment. 

00:01:18 — If the frontend looks good, I mean the data looks good, they would want the ability to sync that record or database to the production environment and push it live. So, they would want the ability to sync individual records. They would want the ability to sync a whole model at once, and then also back sync the records. They would want to pick and choose which content they want to sync.

00:01:51 — So, that’s the scenario. We considered a lot of alternatives for this scenario. There could be a preview feature for the production environment itself without creating a separate environment. There can be data dumping as a whole. But those alternatives did not actually fulfill the requirement for this scenario.

00:02:26 — Introduction of Sharding

So, that’s where the sharding comes in. Talking about sharding, sharding is a strategy to partition the data, large data sets into different shards or manageable chunks which is also known as shard, to make it more manageable and more easy to access. 

00:02:39 — So in sharding, we can partition the data vertically or horizontally. Vertical sharding would mean to partition or split the data into 2 or multiple shards, on the basis of columns. In the original table, we have customers’ data. first name, last name, and city. In the example, we have created two shards, the first shard contains the first name and last name of the customer, whereas the second shard contains the city. 

00:03:17 — So, that’s sharding vertically. In horizontal shards, the data splitting is done by rows. So, if you have a large data set, you only have 4 data here. But, if you have large data sets, we can split the data. Both data shards would have the same attribute, but the data will be partitioned by row.

00:03:44 — Brief Explaination of Horizontal Sharding

So, in horizontal sharding, I have already talked about how we will split the data by row and all of the attributes and the schema of the database would be the same. So, for example, if we are to create an e-commerce website like Amazon, we have hundreds and thousands of products, and we can use horizontal sharding to split the data on the basis of category. One category could be clothing, another could be electronics, books, and so on. 

00:04:15 — So, in our scenario, we are trying to split the database on the basis of environment. So, I have already mentioned some of the ways we can use horizontal sharding. One is when the data gets too big, and when it is, it gets more cost effective to horizontally shard than actually vertically scaling the database. 

00:04:42 — So, we can use horizontal shading, and if we have high traffic, or if our website is getting high traffic, and we want to distribute the load evenly throughout our database, we can use horizontal sharding. It helps us do that and also scalability.

00:05:01 — So, in horizontal sharding, we just add more servers as we need. It’s basically infinite scaling right, and finally custom requirements such as ours. We can use horizontal sharding there as well. 

00:05:26 — Implementing Horizontal Sharding

So, how do we implement horizontal sharding? So, first we have to configure the database EML file. In our normal database EML file, we only have 1 database, but for horizontal sharding, we can specify multiple. I have specified a primary and secondary database here for the development environment. The next step would be to connect the charts to the model. We can do that from the application record by using the connect to method. 

00:05:53 — So, we are specifying 2 shards here. Default and Secondary shard. Default shards use the primary database and the secondary shard uses the secondary database. We are using it in the application record, because it will be inherited in all of the models. We can do that or we can create a new class. Here, there is a shard record, and we can inherit this shard record for specific models that we want to shard.

00:06:20 — The next step after connecting it to models would be to switch shards when we want it on the basis of our needs. So, to do that we can use the connected to method, and specify the role and the shard that we want to connect to here. 

00:06:36 — I am trying to create a product named default product one and I’m trying to create that product in the default shard. So, the output would be a new product in the default shard. However, if you want to or if we try to search the default product one in the secondary shard like I am doing here. 

00:07:01 — It won’t be able to find any product by the name and then it will give nil return. So, if we try to search it in the default shard, though, we will find the default product in the chart. 

00:07:20 - Scenario Recap

So, to recap the scenario again before the demo, we wanted 2 environments, one for previewing and testing, and one for live. So, in the staging environment, we will be creating the products, updating and deleting them, and we want to sync all of those changes in the production environment. So, I will switch. So, I have 2 environments here. One is for primary shard, and another for secondary shard.

00:08:01 — Let me destroy some of the products and check secondary shards before we begin. So, the secondary shard has only one product oscillating stand fan. And the primary shard has 4 products, right? So, if you want to. Let’s say sync only this product Samsung S25, right? We can go to view and then sync this product, then once okay, it has successfully sync.

00:08:41 — Let’s check. So, it has already synced, right? To do this, I have not done anything fancy, just done whatever I mentioned in the previous slides, just connected to the shard, and then created a new product from there. So, if I want to sync all these product at once, everything in this product table, I can just do this. 

00:09:12 — Let’s see. So, these all are synced now. So, that’s how horizontal sharding can be used for scenarios as well.

00:09:25 — Challenges and Things to Consider while Implementing Horizontal Sharding

So, while implementing this horizontal sharding, some of the challenges or things that we had to consider were these things. So, let’s go through this, the first one was to avoid duplication while syncing so how can we do that right. So, we have similar structure and same schema in both the environments, in both the shards. 

00:09:50 — It’s easy for new products to be synced because there is no existing product with the same name or description or price, right? So, we can directly sync that and there will be no problem. But, if we try to sync the first record, here ABC. and if it has been edited to ABD, then we will have trouble finding the exact simultaneous product in the secondary shard. Right? 

00:10:23 — So, how can we do that? So, to do that, one of the ways is to use a primary unique ID. For primary unique ID, we can use UID, or primary, I mean unique ID of the primary shard. Here, in this example, I have used the primary ID of the first record.

00:10:44 — So, even if other details are changed in the primary shard, we can easily find the corresponding record in the secondary shard and update it. 

00:11:01 — Next thing we had to consider was handling call backs. So, if the model that you are trying to sync has a callback, For example, if you have a user model, then you are trying to sync a user to the production environment and it has a callback “Send Welcome email”. If we try to sync new users in the production environment, it will send you a Welcome email to that user, but the user is already created in this staging environment, and we have already sent the email.

00:11:34 — So, sending the email would be like unnecessary notification to the user. So, to avoid this, we can either skip call back when we are syncing. Sp, we can use skip call back method and skip the send welcome email call back, and then reset the call back after we have already saved the user in the primary database.

00:12:02 — We can do it like this, but there is a caveat here. This is not threat safe basically, if we are trying to creatae multiple users at the same time, this may cause some problems. So,, a better version or a stressive version for this would be to use a condition for callback. Here, we are using skip sending welcome email

00:12:37 — This is an attribute and we can send this attribute while creating the user in the production environment. We can set this to true and then it will skip the send welcome email call back. So, one of the ways of handling callbacks is using the attributes.

00:13:00 — Next is deleting the records, right? So, we need to track if we delete any record in the staging environment. We would need to be able to track the record in production environment and also delete there. 

00:13:17 — So, how do we do that? We can either use soft delete and then track it like that, then use a background job to asynchronously delete the deleted records. Later, in the production environment, we can do the simultaneous delete.

00:13:40 — Whenever, a record is deleted in the staging environment, we can use a callback, and then delete the corresponding record in the production environment simultaneously. 

00:13:55 — So, these are some of the things that we had to consider while implementing the sync feature right in our scenario.

00:14:09, References — These are the references that I have used to create, build, and compile this presentation.