Jump to content

DB data archiving


rusjoan
 Share

Recommended Posts

Let's talk about possibility to archive pages from DB.

I mean if your site collected a big amount of data and after some time you won't to store it in DB then you should have a tool to do archiving.

Simply delete old pages is not best practice sometimes (e.g. in my situation).

Archiving in my look is a process when old unused data compressing in DB (I don't know how, just my suggestion) with no ability to access it. And if you want this data again — you must decompress it.

So we have 2 time-consuming processes — compressing and decompressing, but we save on DB size.

But on other way we can do it more simply: you export a number of pages to one file to HDD and remove they from DB.

How can this be organized (I mean first method)? What does PW provide in this instance?

Link to comment
Share on other sites

Why wouldn't you be able to store it in the database though? A database still exists on a disk so your only limitation is disk space and server power (assuming lots of pages means lots of visitors).

It would be more worthwhile looking into caching options instead I think. Something like the ProCache module (see the Store link at the top of the forums) caches your pages as static HTML so the database doesn't get queried for users who aren't logged in and is a relatively simple but really massive speed boost whilst taking a lot of load off the server (neither PHP or mySQL get touched for ProCached pages).

If you have some really complicated page templates, ie. lots of fields in an invoices template that will have hundreds of thousands to millions of rows etc, then it can be worthwhile building a fieldtype specificaly for it so that it acts like a normal table and all the fields are in the one table in the database (less queries to join all the data).

I would definitely investigate other options first, but always remember that a database exists on a disk so is theoretically only limited by hardware if your queries are built well.

What size site do you have that is causing you to ask about this anyway, if you don't mind me asking?

  • Like 1
Link to comment
Share on other sites

My situation is potential, so now I haven't so big DB size but this is achievable in the future.

What about template with lots of fields — this is not in my case but you got me interested in it.

What does allow to build one "big" field what contains a lots of rows? It isn't Fieldset?

Link to comment
Share on other sites

@Pete: actually archiving database content elsewhere could have it's merits, in some cases.

Imagine a huge and constantly changing database of invoices, classifieds, messages, history data etc. Perhaps not the best possible examples, but anyway something that can grow into vast mass. Unless you keep adding extra muscle to the machine running your database (in which case there would only be theoretical limits to worry about), operations could become unbearably slow in the long run.

To avoid that you could decide not to keep records older than, say, two years, in your production database. In case that you don't actually want to completely destroy old records, you'd need a way to move them aside (or archive them) in a way that enables you to later fetch something (doesn't have to be easy, though).

Admittedly not the most common use case, but not entirely unimaginable either :)

As for the solution, there are quite a few possibilities. In addition to deleting pages periodically you could do one or more of these:

  • exporting pages via API into CSV or XML file(s)
  • duplicating existing tables for local "snapshots"
  • performing regular SQL dumps (typically exporting content into .sql files)
  • using pages to store data from other pages in large chunks of CSV/JSON (or custom fieldtype per Pete's idea)

In any case all of this isn't really going to be an issue before you've got a lot of data, and by lot I mean millions of pages, even. Like Pete said, caching methods, either built-in ones or ProCache, will make typical sites very slick even with huge amounts of content.

If your content structure is static (unchanged, new fields added and old ones removed or renamed very rarely), custom fieldtype is a good option, and so is a custom database table. These depend on the kind of content you're storing and the features of the service you're building.

  • Like 1
Link to comment
Share on other sites

I know IPB (this forum software) has an archive feature, but it's only worth it for forums when you reach many thousands of posts and want to archive some stuff that you don't want to search. I just did a whildcard search on the forum database for these forums over the 59,000 posts and it took 0.2973 seconds! Add on maybe a second or two when we get to 590,000 posts and it's still not that big an issue (though I would consider Sphinx or something by that point).

For large numbers of invoices/bookings/whatever if you simply use sensible selectors in your templates then there isn't really an upper limit aside from disk space. For example, only returning small result sets (25 or 50) and paginating, or offering filters to specify date ranges etc. You will very rarely want to return thousands of results in one massive list, and even if you did you certainly wouldn't want tens of thousands, let alone millions, so you're very unlikely to ever try and pull in every row of a database into memory - you're more likely to want to filter your results to find something.

If your tables are properly indexed (I think in PW the equivalent is that your fields would be autojoin :)) then hundreds of thousands of rows are just taking up space, not slowing down queries significantly.

Now, when you get to queries where you want to search the body field which contains unknown content length for a specific word and you have millions of rows, then yes - that will probably take a lot of time. If you want to do things like that then something like Apache Lucence or Sphinx will be the option to go for.

I wouldn't worry about it too much though to begin with. I used to get into a habit of thinking "what if my site reaches this size" and you generally find that you don't have to worry about it for several years or more, by which time there will be other technology or improved technology available to help you out (the next Lucence or Sphinx for example). So if you implement a solution on day one of your site/application but don't need to worry about such volume in your database until year 2+ you'll probably only have to update your code anyway to make use of newer technology :)

  • Like 2
Link to comment
Share on other sites

EDIT: I edited my posts because millions of rows will slow down a search, but you have to ask yourself when you expect to encounter millions of rows in your database? Comments on a news site maybe, but then when are you ever going to want to search the contents of a comments field across all articles?

It's all relative to what your scenario is.

  • Like 1
Link to comment
Share on other sites

Pete: I haven't had the opportunity (?) to deal with truly large databases myself, but I remember discussing this with someone more experienced a few years ago. They had built a system for local university and hospitals for managing medical research data (or something like that, the details are a bit hazy). I don't really know what that data was like (probably never asked), but according to him they started running into various performance issues at database level after just a few million rows of it.

Indexes in general make searches fast and having results in memory makes them even faster, but there's always a limit on buffer size, searching huge index takes time too and in some rare cases indexes can actually even make things worse. (Luckily the Optimizer is usually smart enough to identify the best approach to each specific scenario.)

Downside of indexes is that they too take space and need to be updated when data changes -- they're not a silver bullet that makes all performance issues vanish, but can actually add to the issue. This also means that you'll need to consider the amount of inserts/updates vs. searches when creating your indexes, not just the fields to index and their order.

This is starting to get almost theoretic and you're probably right that none of it matters to any of us here anyway (although what do I know, someone here might just be building the next Google, Craigslist or PayPal) -- just wanted to point out that it's not quite that simple when it comes to really large amounts of data. And no matter what you claim, (database) size does matter ;)

Edit: just saw your edits, apparently we're pretty much on the same page here after all :)

Edited by teppo
  • Like 1
Link to comment
Share on other sites

  • 6 months later...

To expand on Teppo’s point about very large database extractions, we’ve used a tool called FACT (Fast Extract) from IRI to dump multi-million row tables from RDBs into CSV and other kinds of flat files. On those we run scripts (with another tool from that company called CoSort) to query, reorganize, mask, and do what we need to do with that data without burdening the DB with lookups. Basically we found that flat files are just a more manageable environment for a lot of things, especially bulk data movement and manipulation, a la data warehouse transforms.

Link to comment
Share on other sites

What does allow to build one "big" field what contains a lots of rows? It isn't Fieldset?

If I understand you right, you are looking for a field type that can store values in multiple rows.

Have a look at the Inputfieldtype Events. This stores multiple values as rows in it's own DB table which makes searching really fast. I'm using an altered version to store thousands of timestamps with a page and searching them is really performant.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...