Jump to content

Weekly update – 26 March 2021


ryan
 Share

Recommended Posts

This week I've been working on something a little different for the core. Specifically, ProcessWire's database class (WireDatabasePDO) has been rewritten to support separate read-only and read-write database connections. Jan, who administers the processwire.com server, asked if I could implement it. I looked into it and thought it had significant benefits for ProcessWire users, so it worked out that now was a good time to implement it. The rewritten database class is actually complete and now running on a live test installation, but I have not yet committed it to the core dev branch because I want to fully document it first. This will happen next week in a blog post. By then I'll have processwire.com using it too. But I can already say that it's a cool thing watching the graphs in AWS show the difference it makes when we start hitting the site with crawlers. 

You might be wondering what the benefits are in having separate read-only and read-write database connections. I'll get into some of the details next week. But essentially, read-only database connections can scale in a way (and at much lower cost) than read-write connections can. And when using a service like Amazon Aurora, they can scale on-the-fly automatically according to traffic and demand for resources. Not only does it open up the ability for a ProcessWire-powered site to scale much further than before, but it has potential to reduce the costs of doing so by perhaps 50% or more. If you are interested in reading more, we are currently testing the features using Amazon Aurora and RDS Read Replicas (see also Replication with Aurora). However, the ProcessWire core support for this feature is not bound to anything AWS specific and will work with any platform supporting a similar ability. Thanks for reading, I'll have more on this next week, and also have it ready to use should you want to try it out yourself. 

  • Like 20
  • Thanks 3
Link to comment
Share on other sites

This is a great feature for larger sites. Really excited about this. Speaking of scaling: I would imagine a limiting factor right now would be, that the uploaded assets are bound to the local drive. Is that a problem for your setup / would proccesswire.com benefit if it was possible to store files on S3?

And is there a post somewhere explaining your setup on AWS?

  • Like 5
Link to comment
Share on other sites

2 hours ago, MrSnoozles said:

Speaking of scaling: I would imagine a limiting factor right now would be, that the uploaded assets are bound to the local drive. Is that a problem for your setup / would proccesswire.com benefit if it was possible to store files on S3?

Also interested in this.

My impression is that offloading assets to Amazon S3 / Google Cloud Storage etc. is a relatively common need, and there's currently no bulletproof solution. Some third party modules have attempted to, but I'm not sure if any of them really solve all the related needs (image management etc.) — and while remote filesystems such as EFS are easier to work with, they also come with certain downsides (such as a higher price point).

I seem to recall that Ryan mentioned something about this a while ago too ?

  • Like 5
Link to comment
Share on other sites

@MrSnoozles @teppo

This is not a limiting factor in scalability at least. First off, at least here, the file-based assets are delivered by Cloudfront CDN, so they aren't part of the website traffic in the first place (other than to feed the CDN). If you wanted scalability then you'd likely want a CDN serving your assets whether using S3 or not. 

But a CDN isn't a necessary part of the equation in our setup either. File systems can be replicated just like databases. That's how this site runs on a load balancer on any number of nodes. Requests that upload files are routed to node A (primary), but all other requests can hit any node that the load balancer decides to route it to. The other nodes are exact copies of the node A file system that update in real time. This is very similar to what's happening with the DB reader (read-only) and writer (read-write) connection I posted about above, where the writer would be node A and there can be any number of readers. Something like S3 doesn't enhance the scalability of any of this.

Implementing S3 as a storage option for PW is still part of the plan here, but more for the convenience and usefulness than scalability. You can already use S3 as a storage option in PW if you use one of the methods of mapping directories in your file system to it. But I'm looking to support for it in PW more natively than that. It is admittedly more complex than the DB stuff mentioned above. For instance, we use PHP's PDO class that all DB requests go through, so intercepting and routing them is relatively simple. Whereas in PHP, there is no PDO-type class that the file system API is built around, and instead it is dozens of different procedural functions (which is just fine, until you need to change their behavior). In addition, calls to S3 are more expensive than a file system access, so doing something as simple as getting an image dimensions is no longer a matter of a simple php getimagesize() call. Instead, it's more like making an FTP connection somewhere, FTP'ing the file to your computer, getting the image dimensions, storing them somewhere else, then deleting the image. So meta data like image dimensions needs to be stored somewhere else. PW actually implemented this ability last year (meta data and stored image dimensions). So we've already been making small steps towards S3-type storage, but because the big picture is still pretty broad in scope to implement, it's more of a long term plan. Though maybe one of my clients will tell me they need it next week, in which case it'll become a short term plan. ?

  • Like 10
Link to comment
Share on other sites

@monollonom I knew there was a reported PDO issue pending, but didn't remember the details so was going to be looking for it this coming week to make sure it was included (along with any others), thanks for pointing me to it. 

  • Like 2
  • Thanks 1
Link to comment
Share on other sites

5 hours ago, ryan said:

PW actually implemented this ability last year (meta data and stored image dimensions)

It took me a while to find this; I thought I’d link to it in case others wanted to see it:

 

  • Like 1
Link to comment
Share on other sites

20 hours ago, ryan said:

This is not a limiting factor in scalability at least. First off, at least here, the file-based assets are delivered by Cloudfront CDN, so they aren't part of the website traffic in the first place (other than to feed the CDN). If you wanted scalability then you'd likely want a CDN serving your assets whether using S3 or not. 

But a CDN isn't a necessary part of the equation in our setup either. File systems can be replicated just like databases. That's how this site runs on a load balancer on any number of nodes. Requests that upload files are routed to node A (primary), but all other requests can hit any node that the load balancer decides to route it to. The other nodes are exact copies of the node A file system that update in real time. This is very similar to what's happening with the DB reader (read-only) and writer (read-write) connection I posted about above, where the writer would be node A and there can be any number of readers. Something like S3 doesn't enhance the scalability of any of this.

I'd say that these are more like two sides of the same coin:

  • You've solved the file issue by directing static file requests to CDN and pointing file related requests to a single (?) master instance. This is of course a valid approach, but in this type of setup you've got at least two types of instances (read and read-write), which can make things more complicated than just replicating a single instance type based on the load, and may not work so well for sites that often require direct access to assets. This could also potentially impact fault tolerance: in case you indeed only have one write instance, that going down for any reason would be a big blow.
  • Most cloud setups I've worked with have handled this the other way around: files offloaded to external buckets and database shared among instances. One benefit of this is that those instances are identical, there's no "master" to rely on. In most cases S3 hasn't had major impact on performance (plus there's always CDN) and I've never run into a case where database would've been an absolute bottleneck. Of course DB requests should be limited: most web requests can be served from static cache, and if advanced caching is needed, in comes Redis.

The long story short is that I believe core level ability to store files somewhere other than local disk would be a splendid improvement for cloud use. It's true that you can get past this and it's not strictly necessary in order to run ProcessWire in the cloud, but it would simplify things quite a bit. That being said, I definitely get what you mean by this being the more complex part ?

Also, just to be clear, I think that the database abstraction is a great addition!

  • Like 4
Link to comment
Share on other sites

Thanks for implementing this Ryan. It was a secret feature request of mine I had wrongfully assumed would be out of scope and will immediately assist me in scaling a PW project I'm working on currently. Well done. ?

(Offloading to S3 would also be awesome, anything to make PW more stateless)

  • Like 1
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...