Import pages with XML feed

combicart · October 28, 2017

I'm currently in the process of setting up a new website. Most pages on the website will be imported from an XML feed. The feed itself is located on an URL and will be updated daily. The pages should basically be an copy of the XML feed. So it might be possible that based on the import pages could be created, updated or deleted.

As far as I understand, there are a couple of ways ProcessWire could import the feed:
- Through one of the importer modules (https://modules.processwire.com/modules/import-pages-xml/)
- With the new JSON import function that is available from version 3.0.64 (https://processwire.com/blog/posts/a-look-at-upcoming-page-export-import-functions/)
- Use the API (https://processwire.com/talk/topic/352-creating-pages-via-api/

The feed itself is around 16Mb and contains about 1300 pages. Each page has around 10 images of 1Mb per page (13Gb in total).

Does anybody have worked with this kind of setup or has some advise on what the best way is to start?

Thanks!

FrancisChung · October 29, 2017

I've tried various modules and import formats (CSV, XML etc) and settled on a custom function using the API and XML format.

I found the CSV format limiting as it can not handle data that uses "," and more complex data types.
Some of the modules like the importer module also has limitations like not being able to handle Repeater fields etc.

I use something called SimpleXML to import the xml file and then use the PW API to create pages.

I've included a code snippet below to illustrate.

Note the category field below is a PageArray field, so I'm creating a PageArray first and populate it, then assign that PageArrray to the category field.

    public function Execute()
    {
        $import = true;
        $xmlFile = $this->file;
        if ($import) {
            if (!file_exists($xmlFile))
                exit($xmlFile . ' failed to open');

            $items = simplexml_load_file($xmlFile);

            foreach ($items as $xml) {
                //$p = new \Processwire\Page();
                $p = new \Page();

                $p->of(false);

                $p->template = wire(templates)->get("id=" . $xml->template);
                $p->parent = wire("pages")->get("id=" . $xml->parent);
                $p->title = $xml->title;
                $p->name = $xml->title;


                $p->author = $xml->author;
                $p->content_path = $xml->content_path;
				...
				...
                //$p->save();

                // try creating PageArray first ????
                $cats = new \PageArray();
                //$cats = new \Processwire\PageArray();


                foreach ($xml->category->id as $id) {
                    $pageid = (string) $id;
                    $cat = wire("pages")->get($pageid);

                    if (!IsNullPage($cat))
                        $cats->add($cat);
                }

                $p->category->import($cats);


                $p->save();
                echo 'new page <a href="' . $p->editUrl . '" target="_blank">' . $p->path . '</a><br>';
            }
            //die();
        }
    }

adrian · October 29, 2017

3 hours ago, FrancisChung said:

I found the CSV format limiting as it can not handle data that uses "," and more complex data types.

I am getting a little OT, but it's possible to support commas and complex data types in CSV with a proper CSV parser. PHP's native ones are not great. I make use of https://github.com/parsecsv/parsecsv-for-php in a couple of my modules. It's an older library that doesn't appear to be maintained anymore. I think there might be better ones out there, but at the time it handled everything I needed better than anything else I found.

Back to the topic at hand - I am curious about a feed of 1300 pages - that could definitely take a little while to process. It's a shame it's not just addition of new pages - you could make that very quick if the xml feed entries had a date. But if you have to process them all for updates and deletions, you will be relying on the title of the page for matches. Will you just update them all, or will you compare the contents of fields for differences and process only if there has been a change?

combicart · October 29, 2017

Thanks @FrancisChung and @adrian!

Will check out both SimpleXML and the CSV parser to see which approach fits best for the XML feed.

About the 1300 pages, yeah I've already checked about ways to speed up the import process. Unfortunately they only provide 1 large XML feed which contains both the old and updated pages.

They provide a date and UUID field inside the XML feed which I could use to check if there is something updated or not. Totally agree with you that ideally only the changes are being processed instead of the complete file again and again.

adrian · October 29, 2017

3 minutes ago, combicart said:

the CSV parser to see which approach fits best for the XML feed

A CSV parser will be no use for an XML feed - I was just responding to @FrancisChung's comments about CSV not handing certain things.

4 minutes ago, combicart said:

They provide a date and UUID field inside the XML feed which I could use to check if there is something updated or not.

That's good news for sure - that should make things more manageable.

FrancisChung · October 30, 2017

@adrian, good to know there are other CSV parsers out there that can handle more complex data types.

From my perspective, when I had to incorporate collection & repeater types into our feed, I faced an uphill battle with the CSV parser I was using (SimpleExcel).

I didn't have any (file) format restrictions so It just seemed like a good decision to switch to XML (from CSV), which IMHO is a better format for dealing with such data types than the CSV format.

adrian · October 30, 2017

@FrancisChung - I agree, although I would go with JSON over XML given the choice.

mtwebit · October 31, 2017

I face a similar problem, 20k+ entries from external sources.

XMLReader() is quite fast and has low memory usage. It's also pretty simple if you understand its logic.

$xml = new \XMLReader();
$xml->open($file->filename);
while($xml->next('tagName')) {
	if ($xml->nodeType != \XMLReader::ELEMENT) continue; // skip the end element
	... process attributes ... $xml->getAttribute('attr')
	... inner or outer XML ... $xml->readOuterXML()	... you can even convert it to SimpleXML or DOM
}

FrancisChung · October 31, 2017

21 hours ago, adrian said:

@FrancisChung - I agree, although I would go with JSON over XML given the choice.

I would too ... except the data source is an Excel Spreadsheet, so my hands are tied. So I did have file format restrictions after all lol.

adrian · October 31, 2017

4 hours ago, FrancisChung said:

I would too ... except the data source is an Excel Spreadsheet, so my hands are tied. So I did have file format restrictions after all lol.

Maybe of use?

https://github.com/shancarter/Mr-Data-Converter

FrancisChung · November 1, 2017

15 hours ago, adrian said:

Maybe of use?

https://github.com/shancarter/Mr-Data-Converter

Could be of use in the future for other situations. Thanks for the head's up!

Sign In

Import pages with XML feed

Recommended Posts

combicart

FrancisChung

adrian

combicart

adrian

FrancisChung

adrian

mtwebit

FrancisChung

adrian

FrancisChung

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members

Browse

Activity

My Activity Streams

Support

Store

My Details