Jump to content
louisstephens

Checking page (all fields) against json feed

Recommended Posts

I have a script that is pulling in a json feed (will be attached to a cron job later) which looks like:

$http = new WireHttp();

// Get the contents of a URL
$response = $http->get("feed_url");
if($response !== false) {
    $decodedFeed = json_decode($response);
}

Everything there works well and I can pull the id, title, status (updated, new, sold) and other items from the decoded feed in a foreach loop. My whole goal is to create pages from the feed, but if the page has already been created, with all the same exact items from the json feed, I will need to "skip" over it.

So far, I am running into a roadblock with my checks. I guess I need to compare the json to all my pages and their values and:

1. If an id already exists, check to see if a fields data has been updated and then update the page,

2. If an id exists and all fields are unchanged, skip adding that page

 

$http = new WireHttp();


// Get the contents of a URL
$response = $http->get("feed_url");
if($response !== false) {
    $decodedFeed = json_decode($response);
    foreach($decodedFeed as $feed) {
        $u = new Page();
        $u->template = $templates->get("basic-page");
        $u->parent = $pages->get("/development/");
        $u->name = $feed->title = $feed->id;
        $u->title = $feed->title;
		$u->status = $feed->status
        $u->body = $feed->title;
        $u->save();
        $u->setOutputFormatting(false);
    }
} else {
    echo "HTTP request failed: " . $http->getError();
    }

I am really just hung up on how to do the current page checks and matching them with the json field data.

Share this post


Link to post
Share on other sites

Two or three things come to my mind directly:

If there is no unique ID within the feed, you have to create one from the feed data per item and save it into an uneditable or hidden field of your pages.

Additionally, you may concatenate all fieldvalues (strings and numbers) on the fly and generate a crc32 checksum or something that like of it and save this into a hidden field (or at least uneditable) with every new created or updated page.

Then, when running a new importing loop, you extract or create the ID and create a crc32 checksum from the feed item on the fly.

Query if a page with that feed-ID is allready in the sytem; if not create a new page and move on to the next item; if yes, compare the checksums. If they match, move on t the next item, if not, update the page with the new data.

 

Code example:

$http = new WireHttp();

// Get the contents of a URL
$response = $http->get("feed_url");
if($response !== false) {
    $decodedFeed = json_decode($response);
    foreach($decodedFeed as $feed) {
		// create or fetch the unique id for the current feed
        $feedID = $feed->unique_id;
		// create a checksum
		$crc32 = crc32($feed->title . $feed->body . $feed->status);

		$u = $pages->get("template=basic-page, parent=/development/, feed_id={$feedID}");
		if(0 == $u->id) {
			// no page with that id in the system
			$u = createNewPageFromFeed($feed, $feedID, $crc32);
			$pages->uncache($u);
			continue;
		}
		
		// page already exists, compare checksums
		if($crc32 == $u->crc32) {
			$pages->uncache($u);
			continue; // nothing changed
		}
		
		// changed values, we update the page
		$u = updatePageFromFeed($u, $feed, $crc32);
		$pages->uncache($u);
    }

} else {
    echo "HTTP request failed: " . $http->getError();
}

function createNewPageFromFeed($feed, $feedID, $crc32) {
    $u = new Page();
    $u->setOutputFormatting(false);
    $u->template = wire('templates')->get("basic-page");
    $u->parent = wire('pages')->get("/development/");
    $u->name = $feed->title = $feed->id;
    $u->title = $feed->title;
    $u->status = $feed->status
    $u->body = $feed->title;
    $u->crc32 = $crc32;
    $u->feed_id = $feedID;
    $u->save();
    return $u;
}

function updatePageFromFeed($u, $feed, $crc32) {
    $u->of(false);
    $u->title = $feed->title;
    $u->status = $feed->status
    $u->body = $feed->title;
    $u->crc32 = $crc32;
    $u->save();
    return $u;
}

 

  • Like 7

Share this post


Link to post
Share on other sites

Wow Horst! I cant thank you enough for your insight as well as the example. To be honest, I had know idea about crc32 or using uncache.  I do have a question though, what are the benefits of using uncache when creating a new page within the functions?

Share this post


Link to post
Share on other sites

Using the uncache has nothing to do with the function calls.

Uncaching is usefull in loops, at least with lots of pages, to free up memory.

Every page you create or update or check values of, is loaded into memory and allocate some space. Without actively freeing it, your available RAM gets smaller and smaller with each loop iteration. Therefor it is good practice to release not further needed objects, also with not that large amount of iterations.

  • Like 5

Share this post


Link to post
Share on other sites

@louisstephens I really like the answer @horst posted about this but want to ask if you intend using Lazycron for doing this? If so, please be aware of the potentially long processing times associated with doing things this way, especially on the initial read of the feed. Also; there is no facility above for removal of items that no longer appear in the feed but that are stored in PW pages.  You might not need to do this though, it all depends on your application.

If anyone's interested, the way I've tackled this before, in outline, is to pre-process the feed and essentially do what horst posted about calculating a hash of the content (personally I don't like crc32 which returns an int but prefer the fixed length strings returned by md5 (which is fine for this - and fast)). Do filter out any feed fields that you don't intend to store before you calculate the hash so that insignificant changes in the feed don't trigger un-needed updates. Anyway, this gives a feed_id => hash_value map for each feed item. If we do this for the feed, we end up with a PHP array of these maps. This array can be stored persistently between each read of the feed. Let's call the previously created map, $prev_map, and the map for this read of the feed, $new_map.

You simply use PHP's built-in array methods to quickly find the records that have...

  1. Been added: $to_be_added = array_diff_key($new_map, $prev_map);
  2. Been deleted: $to_be_deleted = array_diff_key($prev_map, $new_map);
  3. Been updated: $to_be_updated = array_uintersect_assoc($new_map, $prev_map, 'strcasecmp');

...all without having to go to the DB layer with selectors.

On the first run, when the $prev_map is an empty array, you'll be facing a full import of the feed - potentially a LOT of work for PW to do adding pages. Even reads of the feed that add a lot of new pages or update a lot of pages could require mucho processing, so you'll need to think about how you could handle that - especially if all this is triggered using LazyCron and takes place in the context of a web server process or thread - having that go unresponsive while it adds 100,000 pages to your site may not be considered good.

Finally, don't forget to overwrite $prev_map with $new_map and persist it.

* NB: I've not done item 3 exactly this way before (I used array_intersect_key()), but I don't see why array_uintersect_assoc() shouldn't work.

  • Like 5

Share this post


Link to post
Share on other sites

Thanks @netcarver for the detailed line. Horst's approach worked really well with a very small feed, and I was learning how to make a module to potentially handle this. I was hopping to tap into a cron job to handle the updating/adding at a specific time, like midnight every night, but maybe lazycron might work inside the module. I havent done much research into actually hooking into lazy cron within a module and getting the module to perform the functions similiar to horst's example above.

However, you make a good point regarding page size. I believe max I will be dealing with maybe 500 to 600 max (though could be as low as 200). I would say I am getting a lot better with php as my background is in front-end, but I am enjoying the learning process. 

Since the items will have a status (new, used, or sold), I was thinking that I could potentially write a function to trash the items marked as sold after a 24 hour period and empty the trash. Well, this was my thought earlier in my pre-planning stages.

  • Like 3

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.


  • Recently Browsing   0 members

    No registered users viewing this page.

  • Similar Content

    • By jom
      Hi everyone
      It seems that I don't fully understand the wireTempPath() function and I need some help.
      I use wireTempPath() to create a new location in assets/cache/WireTempDir and than copy a pdf from the assets/files/page folder to the new folder. I want the file to be accessible only for a limited time, that's why I use wireTempPath.
      The file seems to be copied to the right location, but gets deleted right afterwards, according to 
      As mentioned in the topic above, 
      $wireTempDir->setRemove(false); prevents the file to be deleted. But I like the file to be automatically deleted after a few days. So, how can I do that?
      My code so far (everything works, but the automatic removal of the tempDir folder):
      //generate and show download link $folder = time(); // timestamp as temporary folder $maxAge = (int) $settings->options_downloadlink_valid_hours * 3600; //tempDir wants maxAge as seconds $options = array( 'maxAge' => $maxAge ); $wireTempDir = wireTempDir($folder, $options); $wireTempDir->setRemove(false); $src_file = $page->ebook_download->filename; // Create a new directory in ProcessWire's cache dir if(wire('files')->mkdir($wireTempDir, $recursive = true)) { if(wire('files')->copy($src_file, $wireTempDir)){ //get subdirs from tempDir: $pos = strpos($wireTempDir, "WireTempDir"); $subdir = substr($wireTempDir, $pos, 100); $out .= "<p><a href='" . wire('pages')->get('template=passthrough')->httpUrl . "?file=" . $subdir . $page->ebook_download->basename . "' target='_blank'>$page->title</a></p>"; } } I appreciate any ideas - thanks!
      Oliver
    • By VeiJari
      Hello forum, this is my first security related post, so I'm a bit of a newbie.
      I understand that when I have direct front-input from user I should sanitize the input, but how about when I use a secret key for showing a API for a third-party supplier? Should I sanitize the input->get() key?
      I've tested this issue and I tried ?key=<?php echo $page->field; ?> And without adding any sanitization it comes back: /?key=<?php%20echo%20$page->field;%20?>
      So can I rely on this, or should I still use $sanitizer just in case?
       
      Thanks for the help!
    • By EyeDentify
      I have been experimenting with the new $page->meta() method and find it useful.

      Once i figured out that the data i "save" with it is tied to the page where i called the method from.

      So this is not obvious at least not for me in the documentation:
      https://processwire.com/api/ref/page/meta/
       
      So i just wanted to share that revelation with the community so you don´t get as confused as i was.

      Happy Coding Everyone.
    • By louisstephens
      Going through my long quest to get better with ajax and utilizing the api, I have hit yet another roadblock. I currently have a form with an image field (thanks to flydev for getting that sorted), "title" text input, and a select field set to multiple. In my ajax call, I added in:
      tags = $("#select-tags").val(); form_data.append('tags', tags); $.ajax({ type: 'POST', data: form_data, contentType: false, processData: false, url: '/ajax/upload-preview/', success: function(data) { console.log("Woo"); }, error: function(xhr, ajaxOptions, thrownError) { alert(xhr.responseText); } }); And in the ajax template: 
      $tags = $sanitizer->text($_POST['tags']); $image = $sanitizer->text($_POST['image']); $p = new Page(); $p->template = "preview"; $p->parent = $pages->get("/previews/"); $p->name = $title; $p->title = $title; $p->tags = $tags; $p->save(); If I select a "tag" from the select input and submit, it does indeed add it to the Page Reference field in the backend. However, this does not work with an array being passed to it of multiple options.

      So it does appear that my ajax call is trying to submit multiple options, but I am really just unsure how to get these two added in. I saw in other forums posts of add($page) and even add(array()). Do I need to handle this js array differently or do  I need to foreach through the $tags to add it like:
      foreach($tags as $tag) { $p->tags->add($tag); $p->save(); } I tried this approach, but apparently I am still missing something.
       
      Edit:
      I was doing some tweaking, and I know I can split the js array out like:
      for (i = 0, len = tags.length; i < len; i++) { console.log(tags[i]); } However, I am not sure then how to handle the POST in php if I were to split it out.
    • By louisstephens
      I have been messing around with creating pages from ajax requests, and it has gone swimmingly thus far. However, I am really struggling with creating a page and saving an image via ajax. 
      The form:
      <form action="./" role="form" method="post" enctype="multipart/form-data"> <div> <input type="text" id="preview" name="preview" placeholder="Image Title"> </div> <div> <input type="file" id="preview-name" name="preview-name"> </div> <div> <select id="select-tags" name="select-tags"> <?php $tags = $pages->find("template=tag"); ?> <option value="">Select Your Tags</option> <?php foreach ($tags as $tag) : ?> <option value="<?= $tag->name; ?>"><?= $tag->name; ?></option> <?php endforeach; ?> </select> </div> <div> <button type="button" id="submit-preview" name="submit" class="">Upload Images</button> </div> </form>  
      The ajax in my home template:
      $('#submit-preview').click(function(e) { e.preventDefault(); title = $("#preview").val(); image = $("input[name=preview-name]"); console.log(title); console.log(image); data = { title: title, image: image //not sure if this is actually needed }; $.ajax({ type: 'POST', data: data, url: '/development/upload-preview/', success: function(data) { console.log("Woo"); }, error: function(xhr, ajaxOptions, thrownError) { alert(xhr.responseText); } }); }); And finally in my ajax template:
      $imagePath = $config->paths->assets . "files/pdfs/"; //was from an older iteration $title = $sanitizer->text($_POST['title']); $image = $sanitizer->text($_POST['image']); $p = new Page(); $p->template = "preview"; $p->parent = $pages->get("/previews/"); $p->name = $title; $p->title = $title; $p->save(); $p->setOutputFormatting(false); $u = new WireUpload('preview_image'); $u->setMaxFiles(1); $u->setOverwrite(false); $u->setDestinationPath($p->preview_image->path()); $u->setValidExtensions(array('jpg', 'jpeg', 'gif', 'png', 'pdf')); foreach($u->execute() as $filename) { $p->preview_image->add($filename); } $p->save(); I can complete the file upload but just using a simple post to the same page and it it works well, but I was really trying to work out the ajax on this so I could utilize some modals for success on creation (and to keep my templates a little cleaner). When I do run the code I have, a new/blank folder is created under assets, and a new page is created with the correct title entered. However, no image is being processed. I do get a 200 status in my console. I have searched google for help, but everything seems to be slightly off from my needs. If anyone could help point me in the right direction I would greatly appreciate it. 
×
×
  • Create New...