Jump to content

Question about "extreme" scale, hundred of thousand (maybe millions) of pages!


adamspruijt
 Share

Recommended Posts

I am developing an ad managing/tracking tool for my employer, and the ads will be used on our main site, getting from 5-10,000 hits a day the ad view count sky rockets quickly. My current approach was to have the "stats" (clicks and views) be children of the ad they are associated with, but this would mean after a few months an ad could easily have hundreds of thousands of child entries and after a year millions. ???

I thought maybe in this situation I could do periodic "exports" where I would dump the old data into some type of flat file that would allow client to view their advertising stats on a month by month basis, but I feel like this is prone to error... (it could also detract from the functionality quite a bit).

Then I thought maybe a separate database table and use of the Processire $db class..? But in this case I wasn't sure if this was necessary (not having access to the lovely pageArray methods might not be so fun) or if there were any extra security concerns in this type of setup.

I hope this explains the issues I'm facing.

Link to comment
Share on other sites

We currently published a site (www.martat.fi) that has about 100 000 pages. This is mostly because we use my Discussions module, which has each discussion reply as own page. We did hit on one performance problem, which was quickly fixed by Ryan (https://github.com/r...71f76dba734c955). Usual $pages->find() selectors were fast as always, but when there were $page->rootParent->find() with multiple fields using *= selectoror it slowed down very badly. I don't understand the actual reason, since the slow query should actually be really fast (explain says just simple and simple - so maybe it is bug or something very strange behavior on mysql - no clue). After the fix Ryan provided things are rolling very smoothly.

I am pretty confident that PW will scale quite easily for 1 000 000 pages - but that scale might limit the way you can use the API. Because of that I wouldn't probably store that kind of data you have (hits counter) in pages, but in separate database table. Millions of rows are nothing for MySQL, especially if you query data from single table and with simple index (like page_id). But in your scenario I don't see that much benefit of having them as pages and therefore would probably keep them out of there. You don't probably need more than few simple sql queries anyway.

Just my 2 cents and I am most definitely not a DB schema designer... There are people who do know this kind of stuff much better (Ryan and Nik for example).

  • Like 1
Link to comment
Share on other sites

For the usage you're talking about I agree with Antti. Using the $db class this would be pretty straightforward as well as you're not talking about a complicated table either :)

I think this will come up in lots of scenarios where the number of invisible pages far outstrips the number of visitor-facing pages by a huge percentage (and it's not due to form entries) - that's the time to seriously consider whether to use an external table.

  • Like 1
Link to comment
Share on other sites

Man, you guys are awesome, thanks a lot. I thought the DB route made more sense and your words on the matter really make me feel a lot better about going ahead with the changes (which in ProcessWire will really only take a few minutes :P ).

THANKS!

Link to comment
Share on other sites

Pages will definitely scale quite far. But there is certainly more overhead with a page than there is with a plain DB table. As a result, when you are talking about storing huge quantities of data, I would keep your pages to represent the visible URLs on your site. If each row of data isn't going to be related to a unique URL in your page structure, then there really isn't a technical need to store it as a page. Though if you don't need infinite scalability, you may still find using pages for that data to be more convenient. But since it sounds like you do need near-infinite scalability, going to the DB sounds like a better choice. ProcessWire Fieldtypes are designed to represent simple and complex structures in this way, while still letting you use the API admin interface to handle it all. However this does require developing your own Fieldtype and Inputfield to manage it (which actually isn't too difficult). If you don't need an interface and/or PW API access to manage it, you can also just go straight to the $db object as if ProcessWire wasn't there. But this isn't as nice or fun as having your data still be connected with ProcessWire. 

  • Like 5
Link to comment
Share on other sites

Hi Ryan

This is useful stuff that could apply to a lot of situations I suspect

If you ever get some time, a small tutorial would be really useful - if you want to do it as quick bullet points I will be happy to write it up.

Joss

  • Like 1
Link to comment
Share on other sites

Ryan: little related. Can there be field with fieldtype "stats" that doesn't have inputfield? Or is inputifield always required? Of course it could be just dummy&hidden. Just thinking if Adam would implement "Stats" fieldtype for his tracking data.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...