Jump to content

Some Database Syncing Insight Please


MindFull
 Share

Recommended Posts

Ok, so I have a new project that has PW as the selected framework of choice. Here are some of the constraints I'm working with:

1) The corporate database will replicate all the other branches data.

2) Each branch will have it's own web and database servers, thus run their own instance of ProcessWire and all data will be saved within the intranet.

3) Each branch will be able to access the other branches install via VPN. The web servers/code base will be synced using Git.

4) Each branch's database server will have a Symmetricds node installed. Via VPN, each node will push/pull to the corporate node, keeping the databases synced. Symmetricds allows for table and column specific updates so the whole db won't be getting replicated on each pass.

5) The database sync needs to be bidirectional, as in Multiple Master Replication. Reason being, the company plans on having certain departments add pages directly to the corporate account that will affect the other branches. For instance, the Human Resource department will handle adding/removing users and maintaining records on them. Each user created by HR will need to be able to access any branch's data when on the VPN. So user pages would need to be pushed to the branch db's. Here's what it would look like:

// Inside Branch 1's Building:

Client 1------via intranet---------------Web Server 1-----DB Server 1

// Connecting to remote Branch 2 from Branch 1

Client 1------via internet (VPN)------Web Server 2-----DB Server 2

// Database syncing

DB 1-------------via VPN-------Corporate DB 1       and vise versa (bidirectional)

DB 2-------------via VPN-------Corporate DB 2

The schema between all databases will be exactly the same. So, initially, my game plan was to change mySQL's auto_increment_increment and auto_increment_offset on each db instance, keeping the different branches on separate indexes when pages and fields are stored. Hopefully, this would cause a mostly conflict free environment. 

Also, this limits my approach on dealing with expansion. Once 9 offices are established, I'm going to have to start getting creative with the increment patterns. I suggested a cloud based approach but they insist on using their current infrastructure. My other idea was to write transaction and conflict rules into Symmetricds and allow it to make changes to page ids, etc and any references to such in the db. The real danger is when there is a mass building of pages when the VPN is not available, like an internet outage. Once the internet is back online, there could be a lot of pages that need to be replicated; with the likelihood that many pages end up with the same exact page id's, etc. 

Anyway, does anyone here have any advice as to what they would do? I'm open to hear opinions of what others have done in this same sort of scenario before I choose my path.

Link to comment
Share on other sites

First a few questions.

How probable is it that the VPN is down?

When the VPN is down, how long is it usually down?

Approximately how many writes are there during the day in the branch-offices?

How probable is it that there will be more than 9 offices?

Also I'm curious, why have you selected SymmetricDS over Tungsten Replicator?

The real danger is when there is a mass building of pages when the VPN is not available, like an internet outage. Once the internet is back online, there could be a lot of pages that need to be replicated; with the likelihood that many pages end up with the same exact page id's, etc.

I don't see how this would be a problem when you are using auto_increment_increment and auto_increment_offset.

You might have read this but it's a good article nonetheless: http://scale-out-blog.blogspot.fi/2012/04/if-you-must-deploy-multi-master.html

Edited by sforsman
  • Like 1
Link to comment
Share on other sites

@sforsman

The VPN shouldn't be down often. The only time it realistically should be offline is when there are problems with the ISP or internal networking issues. Maybe even for the split moment that new certs/keys are issued and updated. So the anticipation is that we maintain on online average of 99% or better throughout the year. Even still, that is about 87 hours of anticipated downtime.

The corporate node will be scheduled to pull every 10 minutes or sooner.

It's very probable that the company will have more than 9 offices in the next 2 years, they have 6 already.

I don't see how this would be a problem when you are using auto_increment_increment and auto_increment_offset.

I'm sorry, I meant to put that in the paragraph below it where I said I would write custom definitions to handle the page indexes.

Symmetricds was selected because the company already uses it for its existing databases and I have familiarity with it (though not in an all-master topology). I'm not biased and open to using different software as long as the learning curve isn't mountain steep; time is of the essence. I have never used Tungsten, though I have read a number of articles that referred to it and it seems much along the lines of how Symmetricds functions. Any advantages of using Tungsten vs Symmetricds would be appreciated. 

Thanks for the link. I've never read it before but it is quite informative and a great refresher. I like #5, "make the triggers harmless...". That should be fun to work through when I begin testing. I'll be keeping the original .sql file on hand for immediate importing. 

Link to comment
Share on other sites

@MindFull

A few more questions.

The items that are synced, are they only page- and field-related? Or also templates? Are all pages being synced or just a subset of them (i.e. pages with a specific template or parent)?

If new fields or templates are added at the branch-office, would they be synced back to the corporate DB?

How about modules and their configuration (i.e. the modules-table)? Are they synced from the corporate DB to the branch DBs?

Is something else being synced as well, apart from the data/code pulled from Git and the PW-related tables/rows?

I want to understand the complete scenario before making any suggestions.

The VPN shouldn't be down often.

In that case, would it be possible to consider locking/disabling branch-writes during this? I will come to my point later.

The corporate node will be scheduled to pull every 10 minutes or sooner.

Sorry for my being unclear, I meant that if you consider the average day at the branch office, how many writing operations are there related to the tables that are being synced (e.g. UPDATE/INSERT)? This is obviously related to the previous question.

I'm sorry, I meant to put that in the paragraph below it where I said I would write custom definitions to handle the page indexes.

I thought that might be the case, now it makes perfect sense :)

Symmetricds was selected because the company already uses it for its existing databases and I have familiarity with it

This is a good enough reason!

Any advantages of using Tungsten vs Symmetricds would be appreciated. 

Too early to say yet.

Thanks for the link. I've never read it before but it is quite informative and a great refresher. I like #5, "make the triggers harmless...". That should be fun to work through when I begin testing. I'll be keeping the original .sql file on hand for immediate importing. 

I'm glad you liked it! It's a good thing that at least the PW-core doesn't use triggers.

Link to comment
Share on other sites

@sforsman

Templates and modules will be included when a separate set of sync triggers are turned on in the config of Symmetricds. Typically, they won't be part of everyday syncing as they won't be changing or being added very often and only the corporate office will have permissions to add/edit/remove fields, templates, and modules. Yes, the changes will be pushed from corporate to all the branches.

There might be a separate table to handle some possible infinite row data that we will be writing a separate class for to handle querying its contents.  All other pages will be batched and channeled according to their relation to one another. There will be rows that will be exempt from updates after initial insert, like the page created timestamp and some other template defined ones. But, for the most part, most of the front-end submitted data will be subject to sync.

On average, there will be about 300 writes daily in each branch office. There are 3 full time shifts taking up all 24 hrs of the day.

As far as preventing branch writes, it just can't be allowed. The employees collect client data on a frequent basis and need to be able to store it immediately or they risk forgetting the details as the day goes on. That's the company's initial reasoning to having the local DB: VPN goes down, they have can keep going with the bulk of their work which is data collection of their clients locally.

Tomorrow I begin planning the data flow and transformation charts and I'm not particularly thrilled about that part of this project - lol, I hate diagramming in UML!!!!

BTW: Thanks for taking the time sforsman!

Link to comment
Share on other sites

@MindFull:

Thanks for the updated info, it was enlightening. 

Is the VPN down more often than their Internet-connection? The reasons behind these questions is that if you need 10+ masters, I would suggest a technique Flickr, for an example, uses. You can actually check their post about it here: http://code.flickr.net/2010/02/08/ticket-servers-distributed-unique-primary-keys-on-the-cheap/ - it's a really simple method :) If only the VPN-downtime is a problem, you could consider running the ticket-servers on the Internet.

If things cannot be reliant on the Internet-connection either, then in my opinion, the simplest solution would be using the same method described in the article above, but just using it locally.

Either way, such implementation would require hooking to the saving of new pages and just prefixing all IDs with a prefix assigned for each of the PW-instances (read from the site's wire('config') for an example). A three digit prefix would probably be enough to cover all of the customer's expansion needs (forever). Just an example

Corporate instance 1: 100 + <local ticket ID>

Corporate instance 2: 101 + <local ticket ID>

Branch instance 1: 200 + <local ticket ID>

Branch instance 2: 201 + <local ticket ID>

...and so on. This would work without any hacks to PW, because you are allowed to set the ID for new pages.

UML, urgh, I feel you there!

  • Like 4
Link to comment
Share on other sites

Either way, such implementation would require hooking to the saving of new pages and just prefixing all IDs with a prefix assigned for each of the PW-instances (read from the site's wire('config') for an example). A three digit prefix would probably be enough to cover all of the customer's expansion needs (forever).

My goodness, I was so caught up in the db side of things that I completely forgot to think about approaching this through the API!! This alleviates having to keep track of the auto increments in mySql - awesome! I love this forum and this API!

Thanks for your input sforsman! I feel much more confident with this approach than my previous ideas. Very simple and elegant, no wonder I overlooked it!  I have a tendency of doing things the hard way, :lol: . I'm going to write the hooks for this tomorrow when I'm back @ work and see how things pan out.

BTW, have you had the opportunity to try both Tungsten and Symmetricds? What's your opinion on Tungsten in comparison to Symmetricds? 

  • Like 1
Link to comment
Share on other sites

 Share

×
×
  • Create New...