Jump to content
combicart

Import pages with XML feed

Recommended Posts

I'm currently in the process of setting up a new website. Most pages on the website will be imported from an XML feed. The feed itself is located on an URL and will be updated daily. The pages should basically be an copy of the XML feed. So it might be possible that based on the import pages could be created, updated or deleted.

As far as I understand, there are a couple of ways ProcessWire could import the feed:
- Through one of the importer modules (https://modules.processwire.com/modules/import-pages-xml/)
- With the new JSON import function that is available from version 3.0.64 (https://processwire.com/blog/posts/a-look-at-upcoming-page-export-import-functions/)
- Use the API (https://processwire.com/talk/topic/352-creating-pages-via-api/

The feed itself is around 16Mb and contains about 1300 pages. Each page has around 10 images of 1Mb per page (13Gb in total).

Does anybody have worked with this kind of setup or has some advise on what the best way is to start?

Thanks!

Share this post


Link to post
Share on other sites

I've tried various modules and import formats (CSV, XML etc) and settled on a custom function using the API and XML format.

I found the CSV format limiting as it can not handle data that uses "," and more complex data types. 
Some of the modules like the importer module also has limitations like not being able to handle Repeater fields etc.

I use something called SimpleXML to import the xml file and then use the PW API to create pages.

 

I've included a code snippet below to illustrate.

Note the category field below is a PageArray field, so I'm creating a PageArray first and populate it, then assign that PageArrray to the category field.

    public function Execute()
    {
        $import = true;
        $xmlFile = $this->file;
        if ($import) {
            if (!file_exists($xmlFile))
                exit($xmlFile . ' failed to open');

            $items = simplexml_load_file($xmlFile);

            foreach ($items as $xml) {
                //$p = new \Processwire\Page();
                $p = new \Page();

                $p->of(false);

                $p->template = wire(templates)->get("id=" . $xml->template);
                $p->parent = wire("pages")->get("id=" . $xml->parent);
                $p->title = $xml->title;
                $p->name = $xml->title;


                $p->author = $xml->author;
                $p->content_path = $xml->content_path;
				...
				...
                //$p->save();

                // try creating PageArray first ????
                $cats = new \PageArray();
                //$cats = new \Processwire\PageArray();


                foreach ($xml->category->id as $id) {
                    $pageid = (string) $id;
                    $cat = wire("pages")->get($pageid);

                    if (!IsNullPage($cat))
                        $cats->add($cat);
                }

                $p->category->import($cats);


                $p->save();
                echo 'new page <a href="' . $p->editUrl . '" target="_blank">' . $p->path . '</a><br>';
            }
            //die();
        }
    }


 

  • Like 1

Share this post


Link to post
Share on other sites
3 hours ago, FrancisChung said:

I found the CSV format limiting as it can not handle data that uses "," and more complex data types. 

I am getting a little OT, but it's possible to support commas and complex data types in CSV with a proper CSV parser. PHP's native ones are not great. I make use of https://github.com/parsecsv/parsecsv-for-php in a couple of my modules. It's an older library that doesn't appear to be maintained anymore. I think there might be better ones out there, but at the time it handled everything I needed better than anything else I found.

Back to the topic at hand - I am curious about a feed of 1300 pages - that could definitely take a little while to process. It's a shame it's not just addition of new pages - you could make that very quick if the xml feed entries had a date. But if you have to process them all for updates and deletions, you will be relying on the title of the page for matches. Will you just update them all, or will you compare the contents of fields for differences and process only if there has been a change?

  • Like 2
  • Thanks 1

Share this post


Link to post
Share on other sites

Thanks @FrancisChung and @adrian!

Will check out both SimpleXML and the CSV parser to see which approach fits best for the XML feed.

About the 1300 pages, yeah I've already checked about ways to speed up the import process. Unfortunately they only provide 1 large XML feed which contains both the old and updated pages.

They provide a date and UUID field inside the XML feed which I could use to check if there is something updated or not. Totally agree with you that ideally only the changes are being processed instead of the complete file again and again.

Share this post


Link to post
Share on other sites
3 minutes ago, combicart said:

the CSV parser to see which approach fits best for the XML feed

A CSV parser will be no use for an XML feed - I was just responding to @FrancisChung's comments about CSV not handing certain things.

 

4 minutes ago, combicart said:

They provide a date and UUID field inside the XML feed which I could use to check if there is something updated or not.

That's good news for sure - that should make things more manageable.

Share this post


Link to post
Share on other sites

@adrian, good to know there are other CSV parsers out there that can handle more complex data types.

From my perspective, when I had to incorporate collection & repeater types into our feed, I faced an uphill battle with the CSV parser I was using (SimpleExcel). 

I didn't have any (file) format restrictions so It just seemed like a good decision to switch to XML (from CSV), which IMHO is a better format for dealing with such data types than the CSV format.
 

Share this post


Link to post
Share on other sites

I face a similar problem, 20k+ entries from external sources.

XMLReader() is quite fast and has low memory usage. It's also pretty simple if you understand its logic.

$xml = new \XMLReader();
$xml->open($file->filename);
while($xml->next('tagName')) {
	if ($xml->nodeType != \XMLReader::ELEMENT) continue; // skip the end element
	... process attributes ... $xml->getAttribute('attr')
	... inner or outer XML ... $xml->readOuterXML()	... you can even convert it to SimpleXML or DOM
}

 

  • Like 2

Share this post


Link to post
Share on other sites
21 hours ago, adrian said:

@FrancisChung - I agree, although I would go with JSON over XML given the choice.

I would too ... except the data source is an Excel Spreadsheet, so my hands are tied. So I did have file format restrictions after all lol.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Recently Browsing   0 members

    No registered users viewing this page.

×
×
  • Create New...