ProcessWire 3.0.175 adds new database scalability options

This week ProcessWire gained the ability to maintain separate read-only and read-write database connections to optimize scalability, cost and performance. The post covers why this can be so valuable and how to configure it in ProcessWire.

Writing is expensive, reading is cheap
SUVs on the highway
Reaching a hard limit versus not having one
Express passenger service
RDS read replicas and Amazon Aurora
Case study
How to use it in ProcessWire?
Read-only and read-write connections at the same time
Excluding read-only from the admin
Excluding read-only from other cases
Read-only and DB-driven sessions
An example of results

Writing is expensive, reading is cheap

In the world of database scalability, writing is expensive, but reading is cheap. One user reading something from a database need not affect anything for other users reading from the same database. But writing to that database… now that's an entirely different matter… one user writing to the database affects what will be read by all other users reading from the database (not to mention also the users writing to the database).

Why is writing to a database expensive? In order to avoid write collisions and destruction of data, only one user can write to a row at a time. So writing involves temporarily locking a row (or sometimes an entire table) from other users for writing. In some cases, the entire table may be locked for reading too. Depending on what's locked, the other users will just have to wait.

The exact details depend on the database engine, but the underlying factor is that writing remains expensive while reading is relatively cheap. I'm not just talking about money, but about resources, overhead, availability and scalability. They all tie back to each other.

As scale grows, so too does that expense. You can pay for more and more resources to power that database. But for every investment you make in doing that, you are paying to increase the size of not just the DB reader, but the DB writer—in equal amounts—because usually they are one and the same. Yet for nearly all websites, it's only the DB reader that you need the scalable resources for.

For most of us, the majority of front-end database activity is just reading from the database. In fact, chances are there is hardly any writing on the front-end, and it might only represent 0 to 1 percent of total front-end database activity.

SUVs on the highway

When our traffic grows and we must expand our database resources, we typically buy the resources kind of like how many buy their vehicles here in the US. Big heavy cars and trucks (aka SUVs or "sport utility vehicles") where the entire design, materials and construction are built for going off-road. But the vehicle will rarely (if ever) actually go off-road. Nor will it ever participate in any "sports" or do "utility" work. It's just a vehicle to get the driver from point A to B, but with a lot of unnecessary size, weight, and the ongoing cost of fuel to make it run. (Okay, a lot of us buy them just to keep our kids safe on the road from all the other big SUVs).

To expand our database resources, we don't really have a choice but to buy the equivalent of an SUV, because that's how most CMSs work. They require a single database connection that can be both read-from and written-to (a read/write connection). That's despite the fact that our front-end rarely needs to write anything. It needs that read/write connection just in case it ever might need to write something. Meanwhile, it's using the expensive writable resources just for reading. It's us paying for the SUV that rarely (or never) travels off road.

Reaching a hard limit versus not having one

At some point, expanding with only a read/write connection will also limit the ability to scale further. That's because there can only be one DB writer at a time… a single source of that data that has authority over it. Otherwise there could be different versions of the data with no apparent authority. You can only grow a single machine or instance so far until you reach a hard limit in hardware, size or cost.

No such limitation exists for a read-only connection. There can be any number of them, whether separate machines or instances. None of them needs to be able to modify the data, they only need to be a copy of the source (or even a copy of another copy). Whether just one instance or fifteen of them, the resources can focus exclusively on reading. And the cost can scale equally to the demand.

Express passenger service

Having the read-only connection option is like an express passenger train service with the ability to add as many railcars as necessary to accommodate the number of passengers. ProcessWire 3.0.175+ lets you choose whether you need the SUV or the express train, or it can decide for you. It adds the ability to support separate writable (read-write) and readable (read-only) database connections. This opens up a new door for scalability and efficiency, especially as needs for resources grow.

In order to use it, you must also have a database platform that supports the ability. My understanding so far is that it is more common on platform and infrastructure service providers, and less common at regular web hosting providers. I don't really know who has it and who doesn't, but wherever you can find it, it's really useful.

RDS read replicas and Amazon Aurora

We are running on AWS here, which has the feature built-in (RDS read-replicas and Amazon Aurora replication), but this feature in ProcessWire has no AWS-specific dependencies. So it should work equally well on any service that supports separate reader and writer database connections. While I'll focus on our AWS setup here, keep in mind it likely translates to something similar at other service providers.

Our database platform is Amazon Aurora’s version of MySQL with read-replicas. Combined with ProcessWire 3.0.175, it can scale with up to 15 read-only nodes, should the traffic demand it.

Any one of the read-only nodes can also take over as the writer (read-write) node, in the unlikely event of an outage with the primary read-write node.

Case study

Individual instances can be of varying sizes in AWS, and the larger the size, the larger the cost. Prior to this version of ProcessWire, we were running on a small database instance and using it as a typical read/write MySQL database. (The instance we are using powers this site as well as others, like Tripsite). Common traffic spikes indicated we would need to increase the size of the database instance, at significant cost. But like most traffic spikes, they were exclusively front-end traffic and thus primarily read-only traffic.

Rather than investing in the substantial costs of a larger database instance, it was a lot more cost effective to isolate the read-write from the read-only database traffic. This enables us to stay on the cost effective [small] database instance and dynamically scale the read-only nodes/instances according to traffic demand. Now, no resources or costs are wasted. Meanwhile, we gain the ability to handle traffic spikes greater than even the largest available database instance. It's a major win-win.

How to use it in ProcessWire?

Now that we've gone over reasons for and benefits of this type of setup, let's take a look at how to enable it in ProcessWire. First, you must be running ProcessWire 3.0.175 or newer, which provides a new $config setting called $config->dbReader. This is an array that lets you override any of the $config->db[xxx] settings for a separate read-only database connection.

In our case, the read-only database connection differs from the read-write connection only in the MySQL server hostname that we connect to. The DB name, user name, password and port remain the same. Our regular (read/write) database hostname is database.processwire.com specified in $config->dbHost. And our read-only hostname is readonly.database.processwire.com. So we configure it like this in our /site/config.php file:

$config->dbReader = [
  'host' => 'readonly.database.processwire.com'
];

This is all that was necessary to maintain a separate read-only connection here. When enabled, ProcessWire will direct all queries it identifies as read-only to that database connection rather than the one defined in your regular $config->dbHost setting. AWS dynamically takes care of routing the read-only hostname to whatever node (up to 15 of them) can best accommodate the traffic.

Should your read-only database connection require it, you can override any other settings as needed. Just drop the "db" prefix from the setting, use lowercase, and specify it in the $config->dbReader array. So "dbHost" becomes just "host". If we needed to override the DB host, name, user, password and port, we could do so like this:

$config->dbReader = [
  'host' => 'readonly.database.processwire.com',
  'name' => 'pwsite2',
  'user' => 'ryan_pwsite2',
  'pass' => 'd4t4baZ3_p4sZw0rd',
  'port' => 3307,
];

As you can see, configuring the database read-only connection is very simple.

Note: all code examples in this post are intended for the file /site/config.php

Read-only and read-write connections at the same time

You might be wondering what happens when ProcessWire serves requests that perform both lots of reads and writes, such as in the admin. Both the read-only and read-write database connections are created on demand, so ProcessWire 3.0.175+ now doesn't initiate a database connection until a $database->prepare() or similar call is made from the API. On a request needing both a reader and a writer, two database connections will be maintained simultaneously and queries directed to the appropriate one.

Excluding read-only from the admin

I don't yet have data to say for certain one way or the other, but in an environment like your admin that's not likely to need to scale for thousands of users, I think there's a good chance that it may be more efficient to maintain just a single read-write DB connection, and not use the read-only connection in that admin context. Though as time goes on, maybe the data will say otherwise. But for now, I'd suggest starting with the read-only connection isolated to your front-end only. You can do that like this:

if(!$config->requestPath('/processwire/')) {
  $config->dbReader = [ 'host' => 'readonly.mydb.domain.com' ];
}

The above says that if the current request path does not have /processwire/ in it, then add the dbReader setting(s). If your admin is at something other than /processwire/ then of course you'd replace it with your admin path.

Excluding read-only from other cases

If you want to take the logic further, maybe you'd like to also turn off the read-only option for some pages on your front-end too. For instance, maybe you have a page at /tools/maintenance/ that does a lot of DB writing when called upon, so you want to use only the read-write DB connection for that URL, in addition to your admin. Or maybe you know that DB writing is likely during POST requests, so you want to limit read-only to GET requests. Here's how you'd accomplish all of these things:

$skipPaths = [ '/processwire/', '/tools/maintenance/' ];
if($config->requestMethod('GET') && !$config->requestPath($skipPaths)) {
  $config->dbReader = [ 'host' => 'readonly.mydb.domain.com' ];
}

Read-only and DB-driven sessions

Let's say that you are using the core SessionHandlerDB module, which makes sessions save in the database rather than on the file system. That's a case where you know you will likely be doing a lot of DB writing for session data. If using a separate reader and writer connection for the database, consider just using file-based sessions to reduce the need for DB writes on the front-end. Or if you want to stick with SessionHandlerDB, that would be fine too. But a potentially useful optimization might be to limit sessions to logged-in users and pages where they might login. Actually, this may be a worthwhile optimization regardless of anything DB connection related.

$config->sessionAllow = function($session) use($config) {
  // if user might be logged in, always allow sessions
  if($session->hasLoginCookie()) return true;
  // always allow session when accessing admin
  if($config->requestPath('/processwire/')) return true;
  // always allow sessions on our LoginRegisterPro page
  if($config->requestPath('/login/')) return true;
  // otherwise, disallow session
  return false;
};

An example of results

When there aren't separate DB reader and writer connections, then all DB traffic goes to the more costly and less scalable DB writer (read/write) connection. But when the reader and writer are separate connections, you have two different traffic patterns with different costs and different abilities to scale. In the graph below, we are looking at quantity of queries. The orange line represents the reader while the blue line represents the writer. This is on a site that does a lot of writing.

Number of queries (orange=reader, blue=writer):

Without the separate reader and writer connections, there would just be the blue line, it would be larger than the orange line, and it would likely indicate we are maxing out the DB resources on a regular basis. So in this case, having the separate reader and writer enables us to accomplish more at lower cost. But CPU usage drives the resources and cost, not query counts, so the above graph doesn't tell the most important part of the story.

Look at the exact same data showing CPU usage below. As you can see, the writer (blue line) is pretty much idle and using very little resources, while most of the CPU usage can now be targeted to the low cost and scalable reader (orange line).

CPU usage (orange=reader, blue=writer):

Before wrapping up, I'd like to give major credit to Jan at Tripsite. He's the one that came up with the idea to implement this in ProcessWire, and he also helped sponsor the cost of building it. Everything I know about this topic I learned from him. Jan runs the servers and web services for a lot of big companies through his other company Perago Solutions, and he's an absolute expert with this stuff. He's also the one that manages the hosting of processwire.com and keeps it running smoothly on the AWS load balancers. Both Tripsite.com and ProcessWire.com are now using separate read-only and read-write DB connections as well, so far with great results.

This blog post was delivered to you on a read-only database connection.

Thanks for reading and have a great weekend! Visit ProcessWire Weekly for the latest ProcessWire-related news, updates and the always great “site of the week”. The newest issue usually comes out every Saturday. See you there!

Comments

HMCB
- 4 years ago
- 6 2
★★★★★
I can’t believe I’m the first one commenting on this. . This is such a huge feature Ryan. Amazing as always.
Now get some company to sponsor AWS at PW’s core and we’ll all be happy campers. Thank you and enjoy a well-deserved weekend.
Reply
Daniel
- 4 years ago
- 3 3
This is a really great feature and it really helps establish PW in the big leagues.
Reply

PrevProcessWire 3.0.173 core updates: New URL hooks

5 March 2021 6

ProcessWire 3.0.173 adds several new requested features and this post focuses on one of my favorites: the ability to hook into and handle ProcessWire URLs, independent of pages. More

NextProcessWire 3.0.177 core updates

30 April 2021 4

This week we focus in on a new and unique Inputfield module added to the core that enables a lot of useful new input capabilities for tags, sortable multiple selection and custom user input. More