Jump to content
louisstephens

Checking page (all fields) against json feed

Recommended Posts

I have a script that is pulling in a json feed (will be attached to a cron job later) which looks like:

$http = new WireHttp();

// Get the contents of a URL
$response = $http->get("feed_url");
if($response !== false) {
    $decodedFeed = json_decode($response);
}

Everything there works well and I can pull the id, title, status (updated, new, sold) and other items from the decoded feed in a foreach loop. My whole goal is to create pages from the feed, but if the page has already been created, with all the same exact items from the json feed, I will need to "skip" over it.

So far, I am running into a roadblock with my checks. I guess I need to compare the json to all my pages and their values and:

1. If an id already exists, check to see if a fields data has been updated and then update the page,

2. If an id exists and all fields are unchanged, skip adding that page

 

$http = new WireHttp();


// Get the contents of a URL
$response = $http->get("feed_url");
if($response !== false) {
    $decodedFeed = json_decode($response);
    foreach($decodedFeed as $feed) {
        $u = new Page();
        $u->template = $templates->get("basic-page");
        $u->parent = $pages->get("/development/");
        $u->name = $feed->title = $feed->id;
        $u->title = $feed->title;
		$u->status = $feed->status
        $u->body = $feed->title;
        $u->save();
        $u->setOutputFormatting(false);
    }
} else {
    echo "HTTP request failed: " . $http->getError();
    }

I am really just hung up on how to do the current page checks and matching them with the json field data.

Share this post


Link to post
Share on other sites

Two or three things come to my mind directly:

If there is no unique ID within the feed, you have to create one from the feed data per item and save it into an uneditable or hidden field of your pages.

Additionally, you may concatenate all fieldvalues (strings and numbers) on the fly and generate a crc32 checksum or something that like of it and save this into a hidden field (or at least uneditable) with every new created or updated page.

Then, when running a new importing loop, you extract or create the ID and create a crc32 checksum from the feed item on the fly.

Query if a page with that feed-ID is allready in the sytem; if not create a new page and move on to the next item; if yes, compare the checksums. If they match, move on t the next item, if not, update the page with the new data.

 

Code example:

$http = new WireHttp();

// Get the contents of a URL
$response = $http->get("feed_url");
if($response !== false) {
    $decodedFeed = json_decode($response);
    foreach($decodedFeed as $feed) {
		// create or fetch the unique id for the current feed
        $feedID = $feed->unique_id;
		// create a checksum
		$crc32 = crc32($feed->title . $feed->body . $feed->status);

		$u = $pages->get("template=basic-page, parent=/development/, feed_id={$feedID}");
		if(0 == $u->id) {
			// no page with that id in the system
			$u = createNewPageFromFeed($feed, $feedID, $crc32);
			$pages->uncache($u);
			continue;
		}
		
		// page already exists, compare checksums
		if($crc32 == $u->crc32) {
			$pages->uncache($u);
			continue; // nothing changed
		}
		
		// changed values, we update the page
		$u = updatePageFromFeed($u, $feed, $crc32);
		$pages->uncache($u);
    }

} else {
    echo "HTTP request failed: " . $http->getError();
}

function createNewPageFromFeed($feed, $feedID, $crc32) {
    $u = new Page();
    $u->setOutputFormatting(false);
    $u->template = wire('templates')->get("basic-page");
    $u->parent = wire('pages')->get("/development/");
    $u->name = $feed->title = $feed->id;
    $u->title = $feed->title;
    $u->status = $feed->status
    $u->body = $feed->title;
    $u->crc32 = $crc32;
    $u->feed_id = $feedID;
    $u->save();
    return $u;
}

function updatePageFromFeed($u, $feed, $crc32) {
    $u->of(false);
    $u->title = $feed->title;
    $u->status = $feed->status
    $u->body = $feed->title;
    $u->crc32 = $crc32;
    $u->save();
    return $u;
}

 

  • Like 7

Share this post


Link to post
Share on other sites

Wow Horst! I cant thank you enough for your insight as well as the example. To be honest, I had know idea about crc32 or using uncache.  I do have a question though, what are the benefits of using uncache when creating a new page within the functions?

Share this post


Link to post
Share on other sites

Using the uncache has nothing to do with the function calls.

Uncaching is usefull in loops, at least with lots of pages, to free up memory.

Every page you create or update or check values of, is loaded into memory and allocate some space. Without actively freeing it, your available RAM gets smaller and smaller with each loop iteration. Therefor it is good practice to release not further needed objects, also with not that large amount of iterations.

  • Like 5

Share this post


Link to post
Share on other sites

@louisstephens I really like the answer @horst posted about this but want to ask if you intend using Lazycron for doing this? If so, please be aware of the potentially long processing times associated with doing things this way, especially on the initial read of the feed. Also; there is no facility above for removal of items that no longer appear in the feed but that are stored in PW pages.  You might not need to do this though, it all depends on your application.

If anyone's interested, the way I've tackled this before, in outline, is to pre-process the feed and essentially do what horst posted about calculating a hash of the content (personally I don't like crc32 which returns an int but prefer the fixed length strings returned by md5 (which is fine for this - and fast)). Do filter out any feed fields that you don't intend to store before you calculate the hash so that insignificant changes in the feed don't trigger un-needed updates. Anyway, this gives a feed_id => hash_value map for each feed item. If we do this for the feed, we end up with a PHP array of these maps. This array can be stored persistently between each read of the feed. Let's call the previously created map, $prev_map, and the map for this read of the feed, $new_map.

You simply use PHP's built-in array methods to quickly find the records that have...

  1. Been added: $to_be_added = array_diff_key($new_map, $prev_map);
  2. Been deleted: $to_be_deleted = array_diff_key($prev_map, $new_map);
  3. Been updated: $to_be_updated = array_uintersect_assoc($new_map, $prev_map, 'strcasecmp');

...all without having to go to the DB layer with selectors.

On the first run, when the $prev_map is an empty array, you'll be facing a full import of the feed - potentially a LOT of work for PW to do adding pages. Even reads of the feed that add a lot of new pages or update a lot of pages could require mucho processing, so you'll need to think about how you could handle that - especially if all this is triggered using LazyCron and takes place in the context of a web server process or thread - having that go unresponsive while it adds 100,000 pages to your site may not be considered good.

Finally, don't forget to overwrite $prev_map with $new_map and persist it.

* NB: I've not done item 3 exactly this way before (I used array_intersect_key()), but I don't see why array_uintersect_assoc() shouldn't work.

  • Like 5

Share this post


Link to post
Share on other sites

Thanks @netcarver for the detailed line. Horst's approach worked really well with a very small feed, and I was learning how to make a module to potentially handle this. I was hopping to tap into a cron job to handle the updating/adding at a specific time, like midnight every night, but maybe lazycron might work inside the module. I havent done much research into actually hooking into lazy cron within a module and getting the module to perform the functions similiar to horst's example above.

However, you make a good point regarding page size. I believe max I will be dealing with maybe 500 to 600 max (though could be as low as 200). I would say I am getting a lot better with php as my background is in front-end, but I am enjoying the learning process. 

Since the items will have a status (new, used, or sold), I was thinking that I could potentially write a function to trash the items marked as sold after a 24 hour period and empty the trash. Well, this was my thought earlier in my pre-planning stages.

  • Like 3

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.


  • Recently Browsing   0 members

    No registered users viewing this page.

  • Similar Content

    • By louisstephens
      So I was tinkering around with the "select fields" field type and added it to a repeater. My thoughts were I could have a user select a field (textarea, text, etc etc) that I defined and give it a name (another field in the repeater) and create their own form on the page. To be honest, I am now a little lost with rendering the form and mailing the results as potentially the form will be unique and custom every time.  The only way I know to handle the output is by going about it this way:
      $forms = $page->form_select_fields; foreach($forms as $form) { if($form->name === "form_input") { //output input with custom name } elseif($form->name === "form_textarea") { //output input with custom name } } Is there a better way to go about rendering the elements from the repeater? As far as the custom sending goes, I am really at a loss since it would be pretty dynamic. Has anyone used this type of approach, and if so, how did you handle this without going insane?
    • By Harmen
      I want to add a few pages to an AsmSelect Page field inside a repeater using the following code:
      $trialsPage = wire("pages")->get(28422); // Get the page $trialsPage->of(false); $newTrial = $ordersPage->trial_repeater_orders->getNewItem(); // Add item to repeater foreach ($selectedProducts as $selectedProduct){ $productPage = $pages->get("template=product, reference=$selectedProduct"); $newTrial->trial_selected_products->add($productPage); } $newTrial->save(); $trialsPage->save(); However, when I check the page where the field is located it doesn't have the new values as expected. The selected pages exist, the field is in the right location, made sure that the output formatting is turned off: $page->of(false); But it still doesn't work with a variable. No matter what I try, it doesn't work.
      It only works when I replace $selectedProduct with a hardcoded string. Am I doing something wrong here?
    • By louisstephens
      So I have been diving into hooks lately, and I am enjoying them thus far. However, I guess I am a bit stumped on how to achieve what I want too. I am trying to set up a hook that would create a new child page when the parent page is saved. However, when you save the parent page a second time, I just need to update the child page without creating multiple child pages. What would be the best way to go about this?
      So after rereading my post, I believe it is a bit vague so I will try to explain more. 
      The Goal:
      Create a page with a template "one". Once the page is created/saved => create a new child page with the template of "two" If the parent is saved anytime after, do nothing to the child page (limit the parent page to one child page) The parent page is really just being used to output content, whereas the child page is being used to pull out the some fields from the parent to be used elsewhere. I might have made this too complicated in my head.
    • By David Karich
      The Page Hit Counter module for ProcessWire implements a simple page view counter in backend. Page views of visitors are automatically tracked on defined templates, with monitoring of multiple page views. This gives you a quick overview of how many visitors have read a news or a blog post, for example, without first having to open complex tools such as Google Analytics. This module quickly provides simple information, e.g. for editors. Or, for example, to sort certain news by most page views. For example for "Trending Topics".

       
      Works with ProCache and AdBlockers. With a lightweight tracking code of only ~320 bytes (gzipped). And no code changes necessary! In addition GDPR compliant, since no personal data or IP addresses are stored. Only session cookies are stored without information. 
      In addition, there are some options, for example filtering IP addresses (for CronJobs) and filtering bots, spiders and crawlers. You can also configure the lifetime of the session cookies. Repeated page views are not counted during this period. It is also possible to exclude certain roles from tracking. For example, logged in editors who work on a page are not counted as page views.

      Sort by hits and access page views (hit value)
      Each trackable template has an additional field called phits. For example, you want to output all news sorted by the number of page views.
      // It is assumed that the template, e.g. with the name "news", has been configured for tracking. $news = $pages->find("template=news, sort=-phits"); To output the page views of a tracked page, use:
      echo $page->phits; Example: Tracking a page hit via API and jQuery
      If you want to track a template that does not represent a full page to automatically inject a tracking script, you can define allowed API templates in the module that you can track. Below is an example of how you can track a click on news tag using jQuery. This will allow you to find out which keywords are clicked the most. For example, you can sort and display a tag cloud by the number of hits. Suppose your keywords have the template "news_tag". The template "news_tag" was also configured in the Page Hit Counter Module as a trackable API template.
      Example PHP output of keywords / tags:
      // Required: the data attribute "data-pid" with the ID of the template to be tracked. echo $pages->find("template=news_tag, sort=-phits")->each("<a href='{url}' class='news_tag' data-pid='{id}'>{title}</a>"); Example Tracking Script with jQuery:
      /** * Required: Data attribute "data-pid" with the ID of the news tag template * Required: Send the POST request to the URL "location.pathname.replace(/\/?$/, '/') + 'phcv1'" * Required: The POST parameter "pid" with the ID of the template */ $(function(){ if($('a.news_tag').length > 0) { $('a.news_tag').each(function(){ var tPID = $(this).data("pid"); if(tPID) { $(this).on("click", function(){ $.post(location.pathname.replace(/\/?$/, '/') + 'phcv1', {pid: tPID}); }); } }); } }); So simply every click on a tag is counted. Including all checks as for automatic tracking. Like Bot Filtering, Session Lifetime, etc.
      _______________________________________________________
      Background: This module is the result of a customer requirement, where the editors are overwhelmed with analytics or no tracking tools were allowed to be used. However, a way had to be found to at least count page views in a simple form for evaluations. Furthermore, by using ProCache, a way had to be found to count views of a page without clearing the cache.
      _______________________________________________________
      Pros
      Automatic Page View Tracking Lightweight tracking code, only ~320 bytes (gzipped) No code or frontend changes necessary Works with ProCache! Even if no PHP is executed on the cached page, the tracking works Works with browser AdBlockers No cache triggers (for example, ProCache) are triggered. The cache remains persistent GDPR compliant, session-based cookie only, no personal information Filtering of IPs and bots possible Exclude certain roles from tracking Ability to reset Page Views Works with all admin themes Counter database is created as write-optimized InnoDB API to track events for templates that are not viewable No dependencies on libraries, pure VanillaJS (Automatic tracking script) Works in all modern browsers Pages are sortable by hits Cons
      Only for ProcessWire version 3.0.80 or higher (Requires wireCount()) Only for PHP version 5.6.x or higher No support for Internet Explorer <= version 9 (Because of XMLHttpRequest()) No historical data, just simple summation (Because of GDPR) Planned Features / ToDos
      API access to hit values Since version 1.2.1 Possibility to sort the pages by hits (Request by @Zeka) Since version 1.2.0 Don't track logged in users with certain roles (Request by @wbmnfktr) Since version 1.1.0 Possibility to reset the counter for certain pages or templates (Request by @wbmnfktr) Since version 1.1.0 Better bot filter Since version 1.1.0 Disable session lifetime, don't store cookies to track every page view (Request by @matjazp) Since version 1.2.1 Option to hide the counter in the page tree (Request by @matjazp) Since version 1.2.1 Option to hide the counter in the page tree on certain templates Since version 1.2.1 API to track events for templates that are not viewable Since version 1.2.2 Changelog
      1.2.3
      Bug-Fix: Tracking script triggers 404 if pages are configured without slash (#3) Reported by @maxf5 Enhancement: Reduction of the tracking script size if it's gzipped (~320 bytes) Enhancement: Documentation improvement Enhancement: Corrected few typos 1.2.2
      New feature: API to track events for templates that are not viewable Enhancement: Documentation improvement 1.2.1
      API access to hit values Use $page->phits Bug-Fix: No tracking on welcomepage (Reported by wbmnfktr; Thx to matjazp) Bug-Fix: Tracking script path on subfolders (Reported by matjazp) Bug-Fix: Tracking on pages with status "hidden" Enhancement: Change database engine to InnoDB for phits field Enhancement: Option to disable session lifetime set session lifetime to 0, no cookies Enhancement: Better installation check Enhancement: AJAX Request asyncron Enhancement: Reduction of the tracking script size by ~20% Enhancement: Option to hide the counter in the page tree You can output the counter with the field name "phits" Enhancement: Option to hide the counter in the page tree on certain templates Enhancement: Option for activate general IP validation Enhancement: Reduction of tracking overhead up to ~30ms Enhancement: Better bot list for detection 1.2.0
      New feature: Sort pages by hits – New field phits Migrate old counter data to new field 1.1.0
      New feature: Exclude tracking of certain roles New feature: Reset Page Views Better bot filter and detection 1.0.0
      Initial release Notes
      By default, the page views are stored as INT in the database. This allows a maximum counter value of 4.2 billion views (4,294,967,295) per page. If you need more, change the type to BIGINT directly in the database. But I recommend to use Google Analytics or similar tools if you have such a large number of users.
      _______________________________________________________
      Download GitHub: ProcessWire Page Hit Counter (Version 1.2.3)
      PW Module Directory: ProcessWire Page Hit Counter (Version 1.2.3)
      Install via ProcessWire (Classname): PageHitCounter
      _______________________________________________________
      Update information
      If you have used version 1.2.1 from the DEV branch, please replace it completely with the new master version.
×
×
  • Create New...