louisstephens

Checking page (all fields) against json feed

Recommended Posts

I have a script that is pulling in a json feed (will be attached to a cron job later) which looks like:

$http = new WireHttp();

// Get the contents of a URL
$response = $http->get("feed_url");
if($response !== false) {
    $decodedFeed = json_decode($response);
}

Everything there works well and I can pull the id, title, status (updated, new, sold) and other items from the decoded feed in a foreach loop. My whole goal is to create pages from the feed, but if the page has already been created, with all the same exact items from the json feed, I will need to "skip" over it.

So far, I am running into a roadblock with my checks. I guess I need to compare the json to all my pages and their values and:

1. If an id already exists, check to see if a fields data has been updated and then update the page,

2. If an id exists and all fields are unchanged, skip adding that page

 

$http = new WireHttp();


// Get the contents of a URL
$response = $http->get("feed_url");
if($response !== false) {
    $decodedFeed = json_decode($response);
    foreach($decodedFeed as $feed) {
        $u = new Page();
        $u->template = $templates->get("basic-page");
        $u->parent = $pages->get("/development/");
        $u->name = $feed->title = $feed->id;
        $u->title = $feed->title;
		$u->status = $feed->status
        $u->body = $feed->title;
        $u->save();
        $u->setOutputFormatting(false);
    }
} else {
    echo "HTTP request failed: " . $http->getError();
    }

I am really just hung up on how to do the current page checks and matching them with the json field data.

Share this post


Link to post
Share on other sites

Two or three things come to my mind directly:

If there is no unique ID within the feed, you have to create one from the feed data per item and save it into an uneditable or hidden field of your pages.

Additionally, you may concatenate all fieldvalues (strings and numbers) on the fly and generate a crc32 checksum or something that like of it and save this into a hidden field (or at least uneditable) with every new created or updated page.

Then, when running a new importing loop, you extract or create the ID and create a crc32 checksum from the feed item on the fly.

Query if a page with that feed-ID is allready in the sytem; if not create a new page and move on to the next item; if yes, compare the checksums. If they match, move on t the next item, if not, update the page with the new data.

 

Code example:

$http = new WireHttp();

// Get the contents of a URL
$response = $http->get("feed_url");
if($response !== false) {
    $decodedFeed = json_decode($response);
    foreach($decodedFeed as $feed) {
		// create or fetch the unique id for the current feed
        $feedID = $feed->unique_id;
		// create a checksum
		$crc32 = crc32($feed->title . $feed->body . $feed->status);

		$u = $pages->get("template=basic-page, parent=/development/, feed_id={$feedID}");
		if(0 == $u->id) {
			// no page with that id in the system
			$u = createNewPageFromFeed($feed, $feedID, $crc32);
			$pages->uncache($u);
			continue;
		}
		
		// page already exists, compare checksums
		if($crc32 == $u->crc32) {
			$pages->uncache($u);
			continue; // nothing changed
		}
		
		// changed values, we update the page
		$u = updatePageFromFeed($u, $feed, $crc32);
		$pages->uncache($u);
    }

} else {
    echo "HTTP request failed: " . $http->getError();
}

function createNewPageFromFeed($feed, $feedID, $crc32) {
    $u = new Page();
    $u->setOutputFormatting(false);
    $u->template = wire('templates')->get("basic-page");
    $u->parent = wire('pages')->get("/development/");
    $u->name = $feed->title = $feed->id;
    $u->title = $feed->title;
    $u->status = $feed->status
    $u->body = $feed->title;
    $u->crc32 = $crc32;
    $u->feed_id = $feedID;
    $u->save();
    return $u;
}

function updatePageFromFeed($u, $feed, $crc32) {
    $u->of(false);
    $u->title = $feed->title;
    $u->status = $feed->status
    $u->body = $feed->title;
    $u->crc32 = $crc32;
    $u->save();
    return $u;
}

 

  • Like 7

Share this post


Link to post
Share on other sites

Wow Horst! I cant thank you enough for your insight as well as the example. To be honest, I had know idea about crc32 or using uncache.  I do have a question though, what are the benefits of using uncache when creating a new page within the functions?

Share this post


Link to post
Share on other sites

Using the uncache has nothing to do with the function calls.

Uncaching is usefull in loops, at least with lots of pages, to free up memory.

Every page you create or update or check values of, is loaded into memory and allocate some space. Without actively freeing it, your available RAM gets smaller and smaller with each loop iteration. Therefor it is good practice to release not further needed objects, also with not that large amount of iterations.

  • Like 5

Share this post


Link to post
Share on other sites

@louisstephens I really like the answer @horst posted about this but want to ask if you intend using Lazycron for doing this? If so, please be aware of the potentially long processing times associated with doing things this way, especially on the initial read of the feed. Also; there is no facility above for removal of items that no longer appear in the feed but that are stored in PW pages.  You might not need to do this though, it all depends on your application.

If anyone's interested, the way I've tackled this before, in outline, is to pre-process the feed and essentially do what horst posted about calculating a hash of the content (personally I don't like crc32 which returns an int but prefer the fixed length strings returned by md5 (which is fine for this - and fast)). Do filter out any feed fields that you don't intend to store before you calculate the hash so that insignificant changes in the feed don't trigger un-needed updates. Anyway, this gives a feed_id => hash_value map for each feed item. If we do this for the feed, we end up with a PHP array of these maps. This array can be stored persistently between each read of the feed. Let's call the previously created map, $prev_map, and the map for this read of the feed, $new_map.

You simply use PHP's built-in array methods to quickly find the records that have...

  1. Been added: $to_be_added = array_diff_key($new_map, $prev_map);
  2. Been deleted: $to_be_deleted = array_diff_key($prev_map, $new_map);
  3. Been updated: $to_be_updated = array_uintersect_assoc($new_map, $prev_map, 'strcasecmp');

...all without having to go to the DB layer with selectors.

On the first run, when the $prev_map is an empty array, you'll be facing a full import of the feed - potentially a LOT of work for PW to do adding pages. Even reads of the feed that add a lot of new pages or update a lot of pages could require mucho processing, so you'll need to think about how you could handle that - especially if all this is triggered using LazyCron and takes place in the context of a web server process or thread - having that go unresponsive while it adds 100,000 pages to your site may not be considered good.

Finally, don't forget to overwrite $prev_map with $new_map and persist it.

* NB: I've not done item 3 exactly this way before (I used array_intersect_key()), but I don't see why array_uintersect_assoc() shouldn't work.

  • Like 5

Share this post


Link to post
Share on other sites

Thanks @netcarver for the detailed line. Horst's approach worked really well with a very small feed, and I was learning how to make a module to potentially handle this. I was hopping to tap into a cron job to handle the updating/adding at a specific time, like midnight every night, but maybe lazycron might work inside the module. I havent done much research into actually hooking into lazy cron within a module and getting the module to perform the functions similiar to horst's example above.

However, you make a good point regarding page size. I believe max I will be dealing with maybe 500 to 600 max (though could be as low as 200). I would say I am getting a lot better with php as my background is in front-end, but I am enjoying the learning process. 

Since the items will have a status (new, used, or sold), I was thinking that I could potentially write a function to trash the items marked as sold after a 24 hour period and empty the trash. Well, this was my thought earlier in my pre-planning stages.

  • Like 3

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Recently Browsing   0 members

    No registered users viewing this page.

  • Similar Content

    • By louisstephens
      I have a repeater with a text field where I would like to use hanna code (to make it easier on the user) so the user can define the action of a image that is in the repeater. For example:
      1. The image should have an external link:
      - the user can just type in the url
      2. The image, when clicked, should be printable (using print.js)
      - the user can just type [[print_image]]
      That works just fine, but I am running into an "issue". The code I need to use to print is:
      onclick="printJS({printable: 'path_to_image', type: 'image'});" I can put that into the hanna code as plain text, but I have no idea how to get the image from the repeater item and pass to "path_to_image". Is this even possible?
    • By sam-vital
      Hi,
      I'm creating a News/Updates section for a client and they would like the 3 most recent Updates previewed on the home screen. So this would be Title and Date Posted. These blog posts will be child pages of a News and Updates page, which is a child of the Home page. So it's like Home -> News -> Blog posts. I don't need this to be a nav menu, just something to show the newest updates.
      All help is appreciated 🙂
    • By louisstephens
      So I was going to build a todo tracking app for myself to test/broaden my knowledge of processwire, and so far it has been taxing 😓.  My Site structure is:
      - Project One - Phase One - Task - Task - Task - Phase Two - Task - Task - Task So far it was pretty easy, I can easily foreach through the Project and get the phases with their tasks. However, it gets a bit muddled when I have more than one project as I was trying to have a dashboard where the content switches out to the selected project as opposed to accessing each project via their own url. How would yall handle this?
      My next hurdle is each task has a select field (for project status) that I want to update via ajax (for the smooth transitioning).
      Scenario: You finish a task, change the option from "new" to "completed", and an onclick changes the status drop down background to green (which I have working), but then posts/saves a field on the backend to the new option.
       
      I have a page called "Actions" set up with url segments using
      if ($config->ajax && $input->urlSegment1 == 'change-status') { //save update field on admin } However, I am a little fuzzy on how to actually pass the current page along with the new selected status to actually update the page (task). I guess I am not very far into my endeavor. Any help would be greatly appreciated.
    • By louisstephens
      So this is more or less just a question regarding best practices, and how to actually achieve this. I have an internal tool I built awhile ago that has just gotten very "trashed". The naming conventions are all out of whack due to trying to get something completed, and pages are just everywhere. The fields got really out of hand as well, but I believe I have found a much cleaner field structure with the Repeater Matrix and a few other ProFields.  I decided recently to rewrite the template structure and the fields from scratch to keep everything nice and neat, but then it hit me that I would need to move all the previous data over and populate the new fields etc.
      The current tree looks something like:
      Group Title - Section A (template "section") - Category One (template "category") - Category Two (template "category") - Section B (template "section") - Category One (template "category") - Category Two (template "category") Etc etc... The structure is pretty good, and I am thinking of keeping it as it serves my purpose. However, I now have about 140 Sections with child pages that I would hopefully like to move to different templates, as well as changing the child templates/mapping data to the new fields. The only fields I need to worry about are a few textareas, text fields, options field,  as well as 1 image filed.
      I guess my real question is, what is the best way to go about this "migration" via the api while keeping the parent/child structure. It would be ok to leave it as is and chip away at it if I were the only one using it, but unfortunately that is not the case. 
      I guess one approach would be to copy over the entire structure and make the move to the new templates/fields. Once we have verified all the data/templates are correct, we could remove the old structure and possibly rename the new to mirror the old. However, this might just be too complicated.
    • By sww
      Hey there,
      is there really no way to turn the OR logic into an AND logic when selecting pages by (e.g.) tags?
      so instead of
       $pages->find("template=exhibitions, tags=foo|bar") something like
       $pages->find("template=exhibitions, tags=foo&&bar") so the pages needs to have all requested tags, not just any of them.
       
      Thanks,
      Stefan