louisstephens

Checking page (all fields) against json feed

Recommended Posts

I have a script that is pulling in a json feed (will be attached to a cron job later) which looks like:

$http = new WireHttp();

// Get the contents of a URL
$response = $http->get("feed_url");
if($response !== false) {
    $decodedFeed = json_decode($response);
}

Everything there works well and I can pull the id, title, status (updated, new, sold) and other items from the decoded feed in a foreach loop. My whole goal is to create pages from the feed, but if the page has already been created, with all the same exact items from the json feed, I will need to "skip" over it.

So far, I am running into a roadblock with my checks. I guess I need to compare the json to all my pages and their values and:

1. If an id already exists, check to see if a fields data has been updated and then update the page,

2. If an id exists and all fields are unchanged, skip adding that page

 

$http = new WireHttp();


// Get the contents of a URL
$response = $http->get("feed_url");
if($response !== false) {
    $decodedFeed = json_decode($response);
    foreach($decodedFeed as $feed) {
        $u = new Page();
        $u->template = $templates->get("basic-page");
        $u->parent = $pages->get("/development/");
        $u->name = $feed->title = $feed->id;
        $u->title = $feed->title;
		$u->status = $feed->status
        $u->body = $feed->title;
        $u->save();
        $u->setOutputFormatting(false);
    }
} else {
    echo "HTTP request failed: " . $http->getError();
    }

I am really just hung up on how to do the current page checks and matching them with the json field data.

Share this post


Link to post
Share on other sites

Two or three things come to my mind directly:

If there is no unique ID within the feed, you have to create one from the feed data per item and save it into an uneditable or hidden field of your pages.

Additionally, you may concatenate all fieldvalues (strings and numbers) on the fly and generate a crc32 checksum or something that like of it and save this into a hidden field (or at least uneditable) with every new created or updated page.

Then, when running a new importing loop, you extract or create the ID and create a crc32 checksum from the feed item on the fly.

Query if a page with that feed-ID is allready in the sytem; if not create a new page and move on to the next item; if yes, compare the checksums. If they match, move on t the next item, if not, update the page with the new data.

 

Code example:

$http = new WireHttp();

// Get the contents of a URL
$response = $http->get("feed_url");
if($response !== false) {
    $decodedFeed = json_decode($response);
    foreach($decodedFeed as $feed) {
		// create or fetch the unique id for the current feed
        $feedID = $feed->unique_id;
		// create a checksum
		$crc32 = crc32($feed->title . $feed->body . $feed->status);

		$u = $pages->get("template=basic-page, parent=/development/, feed_id={$feedID}");
		if(0 == $u->id) {
			// no page with that id in the system
			$u = createNewPageFromFeed($feed, $feedID, $crc32);
			$pages->uncache($u);
			continue;
		}
		
		// page already exists, compare checksums
		if($crc32 == $u->crc32) {
			$pages->uncache($u);
			continue; // nothing changed
		}
		
		// changed values, we update the page
		$u = updatePageFromFeed($u, $feed, $crc32);
		$pages->uncache($u);
    }

} else {
    echo "HTTP request failed: " . $http->getError();
}

function createNewPageFromFeed($feed, $feedID, $crc32) {
    $u = new Page();
    $u->setOutputFormatting(false);
    $u->template = wire('templates')->get("basic-page");
    $u->parent = wire('pages')->get("/development/");
    $u->name = $feed->title = $feed->id;
    $u->title = $feed->title;
    $u->status = $feed->status
    $u->body = $feed->title;
    $u->crc32 = $crc32;
    $u->feed_id = $feedID;
    $u->save();
    return $u;
}

function updatePageFromFeed($u, $feed, $crc32) {
    $u->of(false);
    $u->title = $feed->title;
    $u->status = $feed->status
    $u->body = $feed->title;
    $u->crc32 = $crc32;
    $u->save();
    return $u;
}

 

  • Like 7

Share this post


Link to post
Share on other sites

Wow Horst! I cant thank you enough for your insight as well as the example. To be honest, I had know idea about crc32 or using uncache.  I do have a question though, what are the benefits of using uncache when creating a new page within the functions?

Share this post


Link to post
Share on other sites

Using the uncache has nothing to do with the function calls.

Uncaching is usefull in loops, at least with lots of pages, to free up memory.

Every page you create or update or check values of, is loaded into memory and allocate some space. Without actively freeing it, your available RAM gets smaller and smaller with each loop iteration. Therefor it is good practice to release not further needed objects, also with not that large amount of iterations.

  • Like 5

Share this post


Link to post
Share on other sites

@louisstephens I really like the answer @horst posted about this but want to ask if you intend using Lazycron for doing this? If so, please be aware of the potentially long processing times associated with doing things this way, especially on the initial read of the feed. Also; there is no facility above for removal of items that no longer appear in the feed but that are stored in PW pages.  You might not need to do this though, it all depends on your application.

If anyone's interested, the way I've tackled this before, in outline, is to pre-process the feed and essentially do what horst posted about calculating a hash of the content (personally I don't like crc32 which returns an int but prefer the fixed length strings returned by md5 (which is fine for this - and fast)). Do filter out any feed fields that you don't intend to store before you calculate the hash so that insignificant changes in the feed don't trigger un-needed updates. Anyway, this gives a feed_id => hash_value map for each feed item. If we do this for the feed, we end up with a PHP array of these maps. This array can be stored persistently between each read of the feed. Let's call the previously created map, $prev_map, and the map for this read of the feed, $new_map.

You simply use PHP's built-in array methods to quickly find the records that have...

  1. Been added: $to_be_added = array_diff_key($new_map, $prev_map);
  2. Been deleted: $to_be_deleted = array_diff_key($prev_map, $new_map);
  3. Been updated: $to_be_updated = array_uintersect_assoc($new_map, $prev_map, 'strcasecmp');

...all without having to go to the DB layer with selectors.

On the first run, when the $prev_map is an empty array, you'll be facing a full import of the feed - potentially a LOT of work for PW to do adding pages. Even reads of the feed that add a lot of new pages or update a lot of pages could require mucho processing, so you'll need to think about how you could handle that - especially if all this is triggered using LazyCron and takes place in the context of a web server process or thread - having that go unresponsive while it adds 100,000 pages to your site may not be considered good.

Finally, don't forget to overwrite $prev_map with $new_map and persist it.

* NB: I've not done item 3 exactly this way before (I used array_intersect_key()), but I don't see why array_uintersect_assoc() shouldn't work.

  • Like 5

Share this post


Link to post
Share on other sites

Thanks @netcarver for the detailed line. Horst's approach worked really well with a very small feed, and I was learning how to make a module to potentially handle this. I was hopping to tap into a cron job to handle the updating/adding at a specific time, like midnight every night, but maybe lazycron might work inside the module. I havent done much research into actually hooking into lazy cron within a module and getting the module to perform the functions similiar to horst's example above.

However, you make a good point regarding page size. I believe max I will be dealing with maybe 500 to 600 max (though could be as low as 200). I would say I am getting a lot better with php as my background is in front-end, but I am enjoying the learning process. 

Since the items will have a status (new, used, or sold), I was thinking that I could potentially write a function to trash the items marked as sold after a 24 hour period and empty the trash. Well, this was my thought earlier in my pre-planning stages.

  • Like 3

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Recently Browsing   0 members

    No registered users viewing this page.

  • Similar Content

    • By David Karich
      The Page Hit Counter module for ProcessWire implements a simple page view counter in backend. Page views of visitors are automatically tracked on defined templates, with monitoring of multiple page views. This gives you a quick overview of how many visitors have read a news or a blog post, for example, without first having to open complex tools such as Google Analytics. This module quickly provides simple information, e.g. for editors. Or, for example, to sort certain news by most page views. For example for "Trending Topics".

       
      Works with ProCache and AdBlockers. With a lightweight tracking code of only ~400 bytes (gzipped). And no code changes necessary! In addition GDPR compliant, since no personal data or IP addresses are stored. Only session cookies are stored without information. 
      In addition, there are some options, for example filtering IP addresses (for CronJobs) and filtering bots, spiders and crawlers. You can also configure the lifetime of the session cookies. Repeated page views are not counted during this period. It is also possible to exclude certain roles from tracking. For example, logged in editors who work on a page are not counted as page views.

      Sort by hits and access page views (hit value)
      Each trackable template has an additional field called phits. For example, you want to output all news sorted by the number of page views.
      // It is assumed that the template, e.g. with the name "news", has been configured for tracking. $news = $pages->find("template=news, sort=-phits"); To output the page views of a tracked page, use:
      echo $page->phits; Background: This module is the result of a customer requirement, where the editors are overwhelmed with analytics or no tracking tools were allowed to be used. However, a way had to be found to at least count page views in a simple form for evaluations. Furthermore, by using ProCache, a way had to be found to count views of a page without clearing the cache. Therefore, data could not be stored directly in a page field, because otherwise, depending on the configuration, the cache maintenance was triggered after the save() event.
      Pros
      Automatic Page View Tracking Lightweight tracking code, only ~400 bytes (gzipped) No code or frontend changes necessary Works with ProCache! Even if no PHP is executed on the cached page, the tracking works Works with browser AdBlockers No cache triggers (for example, ProCache) are triggered. The cache remains persistent GDPR compliant, session-based cookie only, no personal information Filtering of IPs and bots possible Exclude certain roles from tracking Ability to reset Page Views Works with all admin themes Counter database is created as write-optimized InnoDB No dependencies on Librarys, pure VanillaJS Works in all modern browsers Pages are sortable by hits Cons
      Only for ProcessWire version 3.0.80 or higher (Requires wireCount()) Only for PHP version 5.6.x or higher No support for Internet Explorer <= version 9 (Because of XMLHttpRequest()) No historical data, just simple summation (Because of GDPR) Planned Features / ToDos
       API access to hit values Since version 1.2.1  Possibility to sort the pages by hits (Request by @Zeka) Since version 1.2.0  Don't track logged in users with certain roles (Request by @wbmnfktr) Since version 1.1.0  Possibility to reset the counter for certain pages or templates (Request by @wbmnfktr) Since version 1.1.0  Better bot filter Since version 1.1.0  Disable session lifetime, don't store cookies to track every page view (Request by @matjazp) Since version 1.2.1  Option to hide the counter in the page tree (Request by @matjazp) Since version 1.2.1  Option to hide the counter in the page tree on certain templates Since version 1.2.1  JavaScript API to track events for templates that are not viewable Changelog
      1.2.1
      API access to hit values Use $page->phits Bug-Fix: No tracking on welcomepage (Reported by wbmnfktr; Thx to matjazp) Bug-Fix: Tracking script path on subfolders (Reported by matjazp) Bug-Fix: Tracking on pages with status "hidden" Enhancement: Change database engine to InnoDB for phits field Enhancement: Option to disable session lifetime set session lifetime to 0, no cookies Enhancement: Better installation check Enhancement: AJAX Request asyncron Enhancement: Reduction of the tracking script size by ~20% Enhancement: Option to hide the counter in the page tree You can output the counter with the field name "phits" Enhancement: Option to hide the counter in the page tree on certain templates Enhancement: Option for activate general IP validation Enhancement: Reduction of tracking overhead up to ~30ms Enhancement: Better bot list for detection 1.2.0
      New feature: Sort pages by hits – New field phits Migrate old counter data to new field 1.1.0
      New feature: Exclude tracking of certain roles New feature: Reset Page Views Better bot filter and detection 1.0.0
      Initial release Notes
      By default, the page views are stored as INT in the database. This allows a maximum counter value of 4.2 billion views (4,294,967,295) per page. If you need more, change the type to BIGINT directly in the database. But I recommend to use Google Analytics or similar tools if you have such a large number of users.
      _______________________________________________________
      Download GitHub: ProcessWire Page Hit Counter (Version 1.2.1)
      PW Module Directory: – soon –
      _______________________________________________________
      Update information
      If you have used version 1.2.1 from the DEV branch, please replace it completely with the new master version.
    • By louisstephens
      From my last post, I was given a good idea on how to count the repeater items, and it worked wonderfully. I got my code working well and the columns (based on the count) all work well as well. Now, I have a head scratcher on my hands. 
      <?php $buttonsIncluded = $page->special_custom_buttons->find('special_custom_buttons_include=1'); $buttonsIncludedCount = count($buttonsIncluded); $buttonsIncludedCountAdditional = $buttonsIncludedCount +1; echo $buttonsIncludedCount; ?> <div class="row"> <?php foreach($buttonsIncluded as $button): ?> <?php if($button->custom_buttons_include): ?> <?php if($buttonsIncludedCountAdditional == 2): ?> <div class="col-6"> <a href=""><?php echo $button->custom_buttons_text; ?></a> </div> <?php elseif($buttonsIncludedCountAdditional == 3): ?> <div class="col-4"> <a href=""><?php echo $button->custom_buttons_text; ?></a> </div> <?php elseif($buttonsIncludedCountAdditional == 4): ?> <div class="col-3"> <a href=""><?php echo $button->custom_buttons_text; ?></a> </div> <?php endif; ?> <?php endif; ?> <?php endforeach; ?> </div> All of this is included in a larger foreach statement that is pulling in other data (like body copy etc etc) from a Page Table field. As you can see in my code above, I am adding "1" to the count, so I can have space in the grid layout for a new button.
      So, right now: it looks something like: 
      [repeater button] [repeater button] [repeater button] [space for new button] What I really need to do is to pull in the button from the Page Table and add it into the new space so it looks like:
      [repeater button] [repeater button] [repeater button] [button from Page Table] Is this even possible todo, or is there a better way to go about this? 
       
      *Edit*
      So, I really just overlooked something quite easy here. Since the grid is based on 12 columns, I could just take 12 and divide by $buttonsIncludedCountAdditional which would give me the remaining col width to use outside the foreach loop. I was trying to make this too complicated.
    • By louisstephens
      Is it possible to use count() to return a number of repeater items don't have a checkbox checked? In my current set up, I have a repeater on the page "dev_repeater" with a checkbox called "dev_repeater_exclude". I need to get a count of the current items that do not have it checked so I can pass it to my css grid to alter the column width.
    • By awebcreature
      Hi all,
      I have a small project which need to get records from Immowelt.de through API.  These records must be on specific user who has entered them. I find the documentation of this API but i don't find something about such selection of user related records. All parameters are for all records in immowelt.de database without user related selection. 
      https://www.immowelt.de/ImmoweltAG/InternetProdukte/api-immowelt.pdf
      Anyone with experience with this Immowelt.de API?  
    • By Arunesh Dutta
      Hello all
      I am newbie.Wanted to know does processwire will allow to display external website content and other sources to my website using API powered by processwire