Based on what I'm understanding from your last message, I think you should skip keeping the separate table. It just sounds like extra, unnecessary work, unless there's something more to this project that I don't yet understand.
Instead, I think you should have your cron job execute a script that bootstraps ProcessWire and takes care of all the adding, updating and deleting of records consistent with the web service you are reading from. This is something that I think ProcessWire is particularly good at, because it's been designed for this from the beginning (it's something I have to do with a lot of my client work).
Whether XML or JSON doesn't matter much, as PHP includes the ability to read from either type quite easily. Though like the other guys here, I generally prefer JSON just because it's less verbose and less fuss. If JSON, you'll pull the feed and use PHP's json_decode() to convert it to an array. If XML, you'll use PHP's SimpleXML to convert it to an array. Once you've got the array of raw data, you'll iterate through it and add, update, or delete pages in ProcessWire to make it consistent with the data you are pulling from the web service.
Live, working example
I think that the best way to demonstrate it is with a live, working example. This one uses the existing modules.processwire.com/export-json/ feed. You might also want to see the feed in human-readable mode to get a better look at the format.
Below is a shell script that bootstraps ProcessWire, reads from that feed and maintains a mini "modules directory" site, on your own site. I made this feed so that it can be tested and used on a brand new installation using the basic profile (included with PW). If left how it is, it'll create a mini modules directory site below the '/about/what/' page and use the template 'basic-page' for any pages it adds. But you can run this on any ProcessWire installation by just editing the script and changing the parent from '/about/what/' to something else, and changing the template from 'basic-page' to something else, if necessary.
This script assumes that the template used has 3 fields: title, body, and summary. The 'basic-page' template in PW's default profile already has these. If you adapt this for your own use, you'd probably want to change it to use more specific fields consistent with what you need to store on your pages. In this example, I'm just building a 'body' field with some combined data in it, but that's just to minimize the amount of setup necessary for you or others to test this… The purpose is that this is something you can easily run in the default profile without adding any new templates, fields, pages, etc.
1. Paste the following script into the file import-json.php (or download the attachment below). For testing purposes, just put it in the same directory where you have ProcessWire installed. (If you place it elsewhere, update the include("./index.php"); line at the top to load ProcessWire's index.php file).
2. Edit the import-json.php file and update the first line: "#!/usr/bin/php", to point to where you have PHP installed (if not /usr/bin/php). Save.
3. Make the file executable as a shell script:
chmod +x ./import-json.php
4. Run the file at the command line by typing "./import-json.php" and hit enter. It should create about 95 or so pages under /about/what/. Take a look at them. Run it again, and you'll find it reports no changes. Try making some changes to the text on 1 or 2 of the pages it added and run it again, it should update them. Try deleting some of it's pages, and it should add them back. Try adding some pages below /about/what/ on your own, run it again, and it should delete them.
import-json.php
#!/usr/bin/php
<?php // replace the path in the shabang line above with the path to your PHP
// bootstrap ProcessWire. Update the path in the include if this script is not in the same dir
include("./index.php");
// if you want to run this as a PW page/template instead, remove everything above (except the PHP tag)
// save our start time, so we can find which pages should be removed
$started = time();
// keep track of how many changes we've made so we can report at the end
$numChanged = 0;
$numAdded = 0;
$numTrashed = 0;
// URL to our web service data
$url = 'http://modules.processwire.com/export-json/?apikey=pw223&limit=100';
// get the data and decode it to an array
$data = json_decode(file_get_contents($url), true);
// if we couldn't load the data, then abort
if(!$data || $data['status'] != 'success') throw new WireException("Can't load data from $url");
// the parent page of our items: /about/what/ is a page from the basic profile
// update this to be whatever parent you want it to populate...
$parent = wire('pages')->get('/about/what/');
if(!$parent->id) throw new WireException("Parent page does not exist");
// iterate each item in the feed and create or update pages with the data
foreach($data['items'] as $item) {
// see if we already have this item
$page = $parent->child("name=$item[name]");
// if we don't have this item already then create it
if(!$page->id) {
$page = new Page();
$page->parent = $parent;
$page->template = 'basic-page'; // template new pages should use
$page->name = $item['name'];
echo "\nAdding new page: $item[name]";
$numAdded++;
}
// now populate our page fields from data in the feed
$page->of(false); // ensure output formatting is off
$page->title = $item['title'];
$page->summary = $item['summary'];
// To keep it simple, we'll just populate our $page->body field with some combined
// data from the feed. Outside of this example context, you'd probably want to
// populate separate fields that you'd created on the page's template.
$body = "<h2>$item[summary]</h2>";
$body .= "<p>Version: $item[module_version]</p>";
foreach($item['categories'] as $category) $body .= "<p>Category: $category[title]</p>";
$body .= "<p><a href='$item[download_url]'>Download</a> / <a href='$item[url]'>More Details</a></p>";
$page->body = $body;
// print what changed
$changes = $page->getChanges();
if(count($changes)) {
$numChanged++;
foreach($changes as $change) echo "\nUpdated '$change' on page: $page->name";
}
// save the page
$page->save();
}
// now find pages that were not updated above, which indicates they
// weren't in the feed and should probably be trashed
$expired = $parent->children("modified<$started");
foreach($expired as $page) {
echo "\nTrashing expired page: $page->name";
$page->trash(); // move to trash
$numTrashed++;
}
echo "\n\n$numAdded page(s) were added";
echo "\n$numChanged page(s) were changed";
echo "\n$numTrashed page(s) were trashed\n";
import-json.php.txt
Running the script as a cron job:
You can instruct your cron job to run the script and it should be ready to go. You may want to move it to a non web accessible location for more permanent use. You'll also want to update your bootstrap "include()" line at the top to have the full path to your ProcessWire index.php file, as your cron job probably isn't executing it from the web root dir like you were manually.
Running the script as a template file:
You can run this script as a template file on a page by removing the include() line and everything above it with this line:
<pre><?php
Place it in your /site/templates/ directory, add the template from PW admin, and create a page that uses it, then view it.