Jump to content
louisstephens

Checking page (all fields) against json feed

Recommended Posts

I have a script that is pulling in a json feed (will be attached to a cron job later) which looks like:

$http = new WireHttp();

// Get the contents of a URL
$response = $http->get("feed_url");
if($response !== false) {
    $decodedFeed = json_decode($response);
}

Everything there works well and I can pull the id, title, status (updated, new, sold) and other items from the decoded feed in a foreach loop. My whole goal is to create pages from the feed, but if the page has already been created, with all the same exact items from the json feed, I will need to "skip" over it.

So far, I am running into a roadblock with my checks. I guess I need to compare the json to all my pages and their values and:

1. If an id already exists, check to see if a fields data has been updated and then update the page,

2. If an id exists and all fields are unchanged, skip adding that page

 

$http = new WireHttp();


// Get the contents of a URL
$response = $http->get("feed_url");
if($response !== false) {
    $decodedFeed = json_decode($response);
    foreach($decodedFeed as $feed) {
        $u = new Page();
        $u->template = $templates->get("basic-page");
        $u->parent = $pages->get("/development/");
        $u->name = $feed->title = $feed->id;
        $u->title = $feed->title;
		$u->status = $feed->status
        $u->body = $feed->title;
        $u->save();
        $u->setOutputFormatting(false);
    }
} else {
    echo "HTTP request failed: " . $http->getError();
    }

I am really just hung up on how to do the current page checks and matching them with the json field data.

Share this post


Link to post
Share on other sites

Two or three things come to my mind directly:

If there is no unique ID within the feed, you have to create one from the feed data per item and save it into an uneditable or hidden field of your pages.

Additionally, you may concatenate all fieldvalues (strings and numbers) on the fly and generate a crc32 checksum or something that like of it and save this into a hidden field (or at least uneditable) with every new created or updated page.

Then, when running a new importing loop, you extract or create the ID and create a crc32 checksum from the feed item on the fly.

Query if a page with that feed-ID is allready in the sytem; if not create a new page and move on to the next item; if yes, compare the checksums. If they match, move on t the next item, if not, update the page with the new data.

 

Code example:

$http = new WireHttp();

// Get the contents of a URL
$response = $http->get("feed_url");
if($response !== false) {
    $decodedFeed = json_decode($response);
    foreach($decodedFeed as $feed) {
		// create or fetch the unique id for the current feed
        $feedID = $feed->unique_id;
		// create a checksum
		$crc32 = crc32($feed->title . $feed->body . $feed->status);

		$u = $pages->get("template=basic-page, parent=/development/, feed_id={$feedID}");
		if(0 == $u->id) {
			// no page with that id in the system
			$u = createNewPageFromFeed($feed, $feedID, $crc32);
			$pages->uncache($u);
			continue;
		}
		
		// page already exists, compare checksums
		if($crc32 == $u->crc32) {
			$pages->uncache($u);
			continue; // nothing changed
		}
		
		// changed values, we update the page
		$u = updatePageFromFeed($u, $feed, $crc32);
		$pages->uncache($u);
    }

} else {
    echo "HTTP request failed: " . $http->getError();
}

function createNewPageFromFeed($feed, $feedID, $crc32) {
    $u = new Page();
    $u->setOutputFormatting(false);
    $u->template = wire('templates')->get("basic-page");
    $u->parent = wire('pages')->get("/development/");
    $u->name = $feed->title = $feed->id;
    $u->title = $feed->title;
    $u->status = $feed->status
    $u->body = $feed->title;
    $u->crc32 = $crc32;
    $u->feed_id = $feedID;
    $u->save();
    return $u;
}

function updatePageFromFeed($u, $feed, $crc32) {
    $u->of(false);
    $u->title = $feed->title;
    $u->status = $feed->status
    $u->body = $feed->title;
    $u->crc32 = $crc32;
    $u->save();
    return $u;
}

 

  • Like 7

Share this post


Link to post
Share on other sites

Wow Horst! I cant thank you enough for your insight as well as the example. To be honest, I had know idea about crc32 or using uncache.  I do have a question though, what are the benefits of using uncache when creating a new page within the functions?

Share this post


Link to post
Share on other sites

Using the uncache has nothing to do with the function calls.

Uncaching is usefull in loops, at least with lots of pages, to free up memory.

Every page you create or update or check values of, is loaded into memory and allocate some space. Without actively freeing it, your available RAM gets smaller and smaller with each loop iteration. Therefor it is good practice to release not further needed objects, also with not that large amount of iterations.

  • Like 5

Share this post


Link to post
Share on other sites

@louisstephens I really like the answer @horst posted about this but want to ask if you intend using Lazycron for doing this? If so, please be aware of the potentially long processing times associated with doing things this way, especially on the initial read of the feed. Also; there is no facility above for removal of items that no longer appear in the feed but that are stored in PW pages.  You might not need to do this though, it all depends on your application.

If anyone's interested, the way I've tackled this before, in outline, is to pre-process the feed and essentially do what horst posted about calculating a hash of the content (personally I don't like crc32 which returns an int but prefer the fixed length strings returned by md5 (which is fine for this - and fast)). Do filter out any feed fields that you don't intend to store before you calculate the hash so that insignificant changes in the feed don't trigger un-needed updates. Anyway, this gives a feed_id => hash_value map for each feed item. If we do this for the feed, we end up with a PHP array of these maps. This array can be stored persistently between each read of the feed. Let's call the previously created map, $prev_map, and the map for this read of the feed, $new_map.

You simply use PHP's built-in array methods to quickly find the records that have...

  1. Been added: $to_be_added = array_diff_key($new_map, $prev_map);
  2. Been deleted: $to_be_deleted = array_diff_key($prev_map, $new_map);
  3. Been updated: $to_be_updated = array_uintersect_assoc($new_map, $prev_map, 'strcasecmp');

...all without having to go to the DB layer with selectors.

On the first run, when the $prev_map is an empty array, you'll be facing a full import of the feed - potentially a LOT of work for PW to do adding pages. Even reads of the feed that add a lot of new pages or update a lot of pages could require mucho processing, so you'll need to think about how you could handle that - especially if all this is triggered using LazyCron and takes place in the context of a web server process or thread - having that go unresponsive while it adds 100,000 pages to your site may not be considered good.

Finally, don't forget to overwrite $prev_map with $new_map and persist it.

* NB: I've not done item 3 exactly this way before (I used array_intersect_key()), but I don't see why array_uintersect_assoc() shouldn't work.

  • Like 5

Share this post


Link to post
Share on other sites

Thanks @netcarver for the detailed line. Horst's approach worked really well with a very small feed, and I was learning how to make a module to potentially handle this. I was hopping to tap into a cron job to handle the updating/adding at a specific time, like midnight every night, but maybe lazycron might work inside the module. I havent done much research into actually hooking into lazy cron within a module and getting the module to perform the functions similiar to horst's example above.

However, you make a good point regarding page size. I believe max I will be dealing with maybe 500 to 600 max (though could be as low as 200). I would say I am getting a lot better with php as my background is in front-end, but I am enjoying the learning process. 

Since the items will have a status (new, used, or sold), I was thinking that I could potentially write a function to trash the items marked as sold after a 24 hour period and empty the trash. Well, this was my thought earlier in my pre-planning stages.

  • Like 3

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.


  • Recently Browsing   0 members

    No registered users viewing this page.

  • Similar Content

    • By EyeDentify
      I have been experimenting with the new $page->meta() method and find it useful.

      Once i figured out that the data i "save" with it is tied to the page where i called the method from.

      So this is not obvious at least not for me in the documentation:
      https://processwire.com/api/ref/page/meta/
       
      So i just wanted to share that revelation with the community so you don´t get as confused as i was.

      Happy Coding Everyone.
    • By louisstephens
      Going through my long quest to get better with ajax and utilizing the api, I have hit yet another roadblock. I currently have a form with an image field (thanks to flydev for getting that sorted), "title" text input, and a select field set to multiple. In my ajax call, I added in:
      tags = $("#select-tags").val(); form_data.append('tags', tags); $.ajax({ type: 'POST', data: form_data, contentType: false, processData: false, url: '/ajax/upload-preview/', success: function(data) { console.log("Woo"); }, error: function(xhr, ajaxOptions, thrownError) { alert(xhr.responseText); } }); And in the ajax template: 
      $tags = $sanitizer->text($_POST['tags']); $image = $sanitizer->text($_POST['image']); $p = new Page(); $p->template = "preview"; $p->parent = $pages->get("/previews/"); $p->name = $title; $p->title = $title; $p->tags = $tags; $p->save(); If I select a "tag" from the select input and submit, it does indeed add it to the Page Reference field in the backend. However, this does not work with an array being passed to it of multiple options.

      So it does appear that my ajax call is trying to submit multiple options, but I am really just unsure how to get these two added in. I saw in other forums posts of add($page) and even add(array()). Do I need to handle this js array differently or do  I need to foreach through the $tags to add it like:
      foreach($tags as $tag) { $p->tags->add($tag); $p->save(); } I tried this approach, but apparently I am still missing something.
       
      Edit:
      I was doing some tweaking, and I know I can split the js array out like:
      for (i = 0, len = tags.length; i < len; i++) { console.log(tags[i]); } However, I am not sure then how to handle the POST in php if I were to split it out.
    • By louisstephens
      I have been messing around with creating pages from ajax requests, and it has gone swimmingly thus far. However, I am really struggling with creating a page and saving an image via ajax. 
      The form:
      <form action="./" role="form" method="post" enctype="multipart/form-data"> <div> <input type="text" id="preview" name="preview" placeholder="Image Title"> </div> <div> <input type="file" id="preview-name" name="preview-name"> </div> <div> <select id="select-tags" name="select-tags"> <?php $tags = $pages->find("template=tag"); ?> <option value="">Select Your Tags</option> <?php foreach ($tags as $tag) : ?> <option value="<?= $tag->name; ?>"><?= $tag->name; ?></option> <?php endforeach; ?> </select> </div> <div> <button type="button" id="submit-preview" name="submit" class="">Upload Images</button> </div> </form>  
      The ajax in my home template:
      $('#submit-preview').click(function(e) { e.preventDefault(); title = $("#preview").val(); image = $("input[name=preview-name]"); console.log(title); console.log(image); data = { title: title, image: image //not sure if this is actually needed }; $.ajax({ type: 'POST', data: data, url: '/development/upload-preview/', success: function(data) { console.log("Woo"); }, error: function(xhr, ajaxOptions, thrownError) { alert(xhr.responseText); } }); }); And finally in my ajax template:
      $imagePath = $config->paths->assets . "files/pdfs/"; //was from an older iteration $title = $sanitizer->text($_POST['title']); $image = $sanitizer->text($_POST['image']); $p = new Page(); $p->template = "preview"; $p->parent = $pages->get("/previews/"); $p->name = $title; $p->title = $title; $p->save(); $p->setOutputFormatting(false); $u = new WireUpload('preview_image'); $u->setMaxFiles(1); $u->setOverwrite(false); $u->setDestinationPath($p->preview_image->path()); $u->setValidExtensions(array('jpg', 'jpeg', 'gif', 'png', 'pdf')); foreach($u->execute() as $filename) { $p->preview_image->add($filename); } $p->save(); I can complete the file upload but just using a simple post to the same page and it it works well, but I was really trying to work out the ajax on this so I could utilize some modals for success on creation (and to keep my templates a little cleaner). When I do run the code I have, a new/blank folder is created under assets, and a new page is created with the correct title entered. However, no image is being processed. I do get a 200 status in my console. I have searched google for help, but everything seems to be slightly off from my needs. If anyone could help point me in the right direction I would greatly appreciate it. 
    • By louisstephens
      This might be a completely dumb question, but I cant seem to wrap my head around it. I have a page reference field that allows users to select "Tags". In the front end I would like to use the titles as class names for a single item. ie:
      <?php $previews = $pages->find("template=preview"); ?> <?php foreach($previews as $preview): ?> <div class="tagOne TagTwo tagThree"> <!-- use another foreach to output--> <img src="<?=$preview->preview_image->url; ?>" /> </div> <?php endforeach; ?> I am little stumped as I know I need a foreach loop to produce each tag title, but how do I insert them all into one corresponding div class with spaces?
      Whelp, that was the easiest thing, but my brain just didnt "get it". Just put the foreach in the "class" inside of the overall foreach. Ugh 😓
    • By schwarzdesign
      We recently rebuilt the Architekturführer Köln (architectural guide Cologne) as a mobile-first JavaScript web app, powered by VueJS in the frontend and ProcessWire in the backend. Concept, design and implementation by schwarzdesign!
      The Architekturführer Köln is a guidebook and now a web application about architectural highlights in Cologne, Germany. It contains detailled information about around 100 objects (architectural landmarks) in Cologne. The web app offers multiple ways to search through all available objects, including:
      An interactive live map A list of object near the user's location Filtering based on architect, district and category Favourites saved by the user The frontend is written entirely in JavaScript, with the data coming from a ProcessWire-powered API-first backend.
      Frontend
      The app is built with the Vue framework and compiled with Webpack 4. As a learning exercise and for greater customizability we opted to not use Vue CLI, and instead wrote our own Webpack config with individually defined dependencies.
      The site is a SPA (Single Page Application), which means all internal links are intercepted by the Vue app and the corresponding routes (pages) are generated by the framework directly in the browser, using data retrieved from the API. It's also a PWA (Progressive Web App), the main feature of which is that you can install it to your home screen on your phone and launch it from there like a regular app. It also includes a service worker which catches requests to the API and returns cached responses when the network is not available. The Architekturführer is supposed to be taken with you on a walk through the city, and will keep working even if you are completely offline.
      Notable mentions from the tech stack:
      Vue Vue Router for the SPA functionality VueX for state management and storage / caching of the data returned through the API Leaflet (with Mapbox tiles) for the interactive maps Webpack 4 for compilation of the app into a single distributable Babel for transpilation of ES6+ SASS & PostCSS with Autoprefixer as a convenience for SASS in SFCs Google Workbox to generate the service worker instead of writing lots of boilerplate code Bootstrap 4 is barely used here, but we still included it's reboot and grid system Backend
      The ProcessWire backend is API-only, there are no server-side rendered templates, which means the only PHP template is the one used for the API. For this API, we used a single content type (template) with a couple of pre-defined endpoints (url segments); most importantly we built entdpoints to get a list of all objects (either including the full data, or only the data necessary to show teaser tiles), as well as individual objects and taxonomies. The API template which acts as a controller contains all the necessary switches and selectors to serve the correct response in <100 lines of code.
      Since we wanted some flexibility regarding the format in which different fields were transmitted over the api, we wrote a function to extract arbitrary page fields from ProcessWire pages and return them as serializable standard objects. There's also a function that takes a Pageimage object, creates multiple variants in different sizes and returns an object containing their base path and an array of variants (identified by their basename and width). We use that one to generate responsive images in the frontend. Check out the code for both functions in this gist.
      We used native ProcessWire data wherever possible, so as to not duplicate that work in the frontend app. For example:
      Page names from the backend translate to URLs in the frontend in the form of route parameters for the Vue Router Page IDs from ProcessWire are included in the API responses, we use those to identify objects across the app, for example to store the user's favourites, and as render keys for object lists Taxonomies have their own API endpoints, and objects contain their taxonomies only as IDs (in the same way ProcessWire uses Page References) Finally, the raw JSON data is cached using the cache API and this handy trick by @LostKobrakai to store raw JSON strings over the cache API.
      Screenshots














×
×
  • Create New...