Importing HTML files and Textpattern data—including images from img elements and txp:image tags

johnstephens · May 7, 2020

I'm working on a script for importing very old static HTML files into ProcessWire so they are searchable on the new site.

What I have so far works, but I wonder if there are ways I can make the work of cleaning up the imported content easier, by doing more useful cleanup during the import.

For this demo, suppose all the files exist in one directory, called "public", and suppose we're importing them all into the basic-page template. At this point, the basic-page template has been modified from the blank profile to include one additional textarea field called "body", which uses the CKEditor.

<?php

include './path/to/processwire/index.php';

// Use FileSystemIterator to save all the files in the 'public' directory
// https://www.php.net/manual/en/class.filesystemiterator.php
$files = new FileSystemIterator('./public');

// This is a callback function for the CallbackFilterIterator below
$is_html_file = function($file) {
    return strpos($file->getFilename(), '.htm');
};

// Use CallbackFilterIterator to winnow the files down to only HTML files
// https://www.php.net/manual/en/class.callbackfilteriterator.php
$html_files = new CallbackFilterIterator($files, $is_html_file);

// Input a regular expression and a string -> output an array of matches
$preg_matches = function($regex, $string) {
    preg_match($regex, $string, $array);
    return $array;
};

// Iterate over the directory objects stored in $html_files
foreach($html_files as $file) {

    // Turn this file into a SplFileObject so we can read its contents
    // https://www.php.net/manual/en/class.splfileobject.php#splfileobject.constants.drop-new-line
    $_file = new SplFileObject($file);
    $contents = $_file->fread($_file->getSize());
    $h1_content = $preg_matches('/\<h1\>(.*?)\<\/h1\>/i', $contents)[1] | false;

    // Create a new ProcessWire page and save the content into it
    $article = new \ProcessWire\Page();
    $article->parent = $pages->get('/');
    $article->template = 'basic-page';
    $article->title = preg_match('/\<h1\>(.*?)\<\/h1\>/i', $contents)
        ? $preg_matches('/\<h1\>(.*?)\<\/h1\>/i', $contents)[1]
        : $preg_matches('/\<title\>(.*?)\<\/title\>/i', $contents)[1];

    $article->body = $contents;
    $article->save();

}

This successfully titles all the pages that have at least one h1 tag. (I know this is making a big assumption of proper markup, but it appears to be broadly correct in this one case.) The rest of the content is dumped into the page's body field.

If this helps anyone else solve a similar problem, have the code! (WTFPL)

But when one is dealing with archaic HTML using font tags and tables for layout (yeek!), this leaves much room for improvement.

Something I'd like to do is get rid of all the layout tables and site furniture, like branding markup, navigation, and footer text. Of course, that is not marked up in a consistent way across all the documents. ?

I wonder if anyone has guidance for something like this? Do you know of any best practices for automating the cleanup old HTML? Thank you!

Edit: When searching for HTML tags, matches should be case insensitive (using the i flag after the delimiter). Also, use the content of the title element when there is no h1 tag on the page. This is all fixed in the code above.

adrian · May 8, 2020

Surely you'll also want to find all images and automatically upload them to the assets/files/xxxx for the new page and then rewrite the img src to the new path. Maybe also grab the alt tag and add that to the description field in PW. Personally I would go with DOMDocument over a regex for this, but both would work.

johnstephens · May 8, 2020

1 minute ago, adrian said:

Surely you'll also want to find all images and automatically upload them to the assets/files/xxxx for the new page and then rewrite the img src to the new path. Maybe also grab the alt tag and add that to the description field in PW.

Thanks! I feel very foggy on how to do that. Could you direct me to an appropriate code example?

2 minutes ago, adrian said:

Personally I would go with DOMDocument over a regex for this, but both would work.

I'll look into that! I'm used to dealing with the DOM in JavaScript, but with PHP I'm not so savvy. DOMDocument looks like a great fit! Thank you!

adrian · May 8, 2020

Something like this should get you going. This is stolen from a recent import I did which worked well. This assumes you have a field called "images" that you want the images uploaded to.

I have also done more complex versions of this when the source HTML image tags have width and height tags - you can use those to resize the images using the PW API and embed that version back into the HTML.

    $dom = new \DOMDocument();
    @$dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));

    foreach($dom->getElementsByTagName('img') as $img) {

        // grab image from the external URL and add to images field
        try {
            $np->images->add('http://olddomain.com/' . $img->getAttribute('src'));
            if($img->getAttribute('alt') != '') {
                $np->images->last()->description = $img->getAttribute('alt');
            }
            $img->setAttribute('src', $np->images->last()->url());
        }
        catch(\Exception $e) {
            // in case remote image can't be downloaded
        }

    }
    return preg_replace('/^<!DOCTYPE.+?>/', '', str_replace( array('<html>', '</html>', '<body>', '</body>', '<p>&amp;n<p>', '<p><p>', '</p></p>'), array('', '', '', '', '<p>', '<p>', '</p>'), $dom->saveHTML()));

johnstephens · July 1, 2020

Thank you, @adrian!

I don't understand what the $np variable references. Is it the current ProcessWire page instance?

adrian · July 1, 2020

Just now, johnstephens said:

Thank you, @adrian!

I don't understand what the $np variable references. Is it the current ProcessWire page instance?

Sorry, that is the new page I created and saved before running the above.

johnstephens · July 1, 2020

Thanks!

johnstephens · August 31, 2020

Hi, @adrian! (and anyone else who reads this)

I'm running into a problem, I wonder if there's some simple way to solve.

I found that my import script was failing to import images from the content. So I added this to the script so that I could see what was going on:

$i = 0;

foreach($dom->getElementsByTagName('image') as $image) {
    $i++;
}

if ($i) echo "<pre>First count: I counted <b>{$i}</b> txp:image tags in this document.\n</pre>";

$j = 0;

foreach($dom->getElementsByTagName('image') as $image) {
    $j++;
    // Code that creates img tag from txp:image, adds image src to ProcessWire page, and replaces txp:image tag with img
}

The first foreach block just counts the number of txp:image tags in the body, so I can print it out afterward. The second block counts the same elements AGAIN, while also running code to import the images into the current ProcessWire page. Then it prints out the second count, for comparison with the first.

When an article has just 1 image, the two counts match: 1 image was found, 1 was imported.

When the article has more than that, the second foreach block appears to skip every alternate image.

My hunch is, the script gets stuck when importing an image, and that's why it only imports images 1, 3, 5, ….

If that's the actual choke point, how could I find out? Is there an obvious workaround? Is there a way to make the image import function asynchronous? Some other solution?

Thanks for any guidance or suggestions you can offer!

johnstephens · August 31, 2020

Oops, the code snippet above should conclude with this:

if ($i) echo "<pre>Second count: I counted <b>{$j}</b> txp:image tags in this document.\n\n</pre>";

Not that it matters a lot—it's just part of my troubleshooting.

DV-JF · September 1, 2020

Hey @johnstephens

12 hours ago, johnstephens said:

I found that my import script was failing to import images from the content. So I added this to the script so that I could see what was going on:

when I'm running into such problems, I'm trying to figure out which variable has which value at which time. A tool which is very helpful is TracyDebugger. https://modules.processwire.com/modules/tracy-debugger/

When you have installed the module you could do something like this...

$dom = new \DOMDocument();
bd($dom);
@$dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));

bd($dom->getElementsByTagName('img'));
foreach($dom->getElementsByTagName('img') as $img) {
	bd(img);
}

and you'll see exactly what you need to know - https://adrianbj.github.io/TracyDebugger/#/debug-bar?id=dumps

Just give it a try and you'll love it ?

johnstephens · September 1, 2020

Thank you, @DV-JF!

I had Tracy installed already, so this was a simple next step. Unfortunately, it confirms something I knew already without giving me new information.

// Get txp:image tags
$images = $dom->getElementsByTagName('image');


$list_a = [];
$i = 0;

// Iterate through all the images and just add their names to the $list_a array
foreach($images as $image) {
    $list_a[] = $i . ' => ' . $image->getAttribute('name');
    $i++;
}

$list_b = [];
$j = 0;

// Iterate through all the images AGAIN:
// Add their names to the $list_b array, AND
// Try to import them with the handle_picture function
foreach($images as $image) {
    $list_b[] = $j . ' => ' . $image->getAttribute('name');
    $j++;
    handle_picture($image, $image_prefix, $all_images, $newpage, $dom);
}

bd($list_a);
bd($list_b);

What I'm seeing in the bd dumps is exactly what I said above: Every alternate image item is being skipped in the second foreach block.

Completely skipped: Their names don't get added to the array, and the variable $j doesn't increment.

It's not just that the handle_picture() function chokes on them. Or rather, when the handle_picture() function doesn't work, $list_b and $j don't get any information either.

Here is the output of my bd dumps—first $list_a:

array (6)
0 => "0 => image_1.jpg" (16)
1 => "1 => image_2.jpg" (16)
2 => "2 => image_3.jpg" (16)
3 => "3 => image_4.jpg" (16)
4 => "4 => image_5.jpg" (16)
5 => "5 => image_6.jpg" (16)

…and $list_b:

array (3)
0 => "0 => image_1.jpg" (16)
1 => "1 => image_3.jpg" (16)
2 => "2 => image_5.jpg" (16)

Likewise, if I bd() anything at all inside the handle_picture() function definition, Tracy only shows me the output for every other image, ie. the items that got added to $list_b above.

This doesn't get me any closer to seeing what's going on. What am I missing?

Thanks in advance for any guidance you can offer!

johnstephens · September 1, 2020

Is this a known feature of PHP, that a function inside a foreach block can just blot out everything else happening inside the block for that iteration?

The handle_picture() function works perfectly fine on the odd-numbered iterations (even array indices), no matter what image it is processing. And it fails on every even-numbered iteration (odd array indices). If I shuffle the source order, I get the same odd/even success/failure breakdown. So it's not choking on specific images, just whatever image happens to fall on even iterations. And then, it just ignores the whole iteration without an error or any indication.

johnstephens · September 3, 2020

I think I've solved the problem. I have no idea why this is necessary, but running the foreach block inside a recursive function seems to rapidly pick up all the images:

function add_to_page_recursor($images, $image_prefix, $all_images, $newpage, $dom) {
        foreach($images as $image) {
            handle_picture($image, $image_prefix, $all_images, $newpage, $dom);
            $count = $images ? $images->count() : 0;
            if ($count > 0) add_to_page_recursor($images, $image_prefix, $all_images, $newpage, $dom);
    }
}

One obstacle to this solution is that DOMNodeList does not have the count() method before PHP 7.2, so this code requires PHP 7.2+. But for my publication server, it works.

Now I just need to refactor the handle_picture function to handle all the variations of images I'm importing, but that should be simple.

If anyone can shed any light on why the foreach block would be skipping images in the source that it can pick up in iterative passes, I'd love to learn what's going on here better.

Thank you!

johnstephens · September 5, 2020

Don't use that code. I found a better way.

What I discovered was, using the Iterator interface from PHP's standard library did not cause the same problems as DOMNodeList. I still can't account for why calling a function inside my foreach block caused the DOMNodeList to skip alternate nodes, but using an Iterator seems to just work.

Unfortunately, there's no Iterator that deals directly with DOM nodes, and you can't feed a DOMNodeList to any Iterator's constructor. That proved to be simple enough to solve by converting the DOMNodeList to an array first:

function array_from($listable) {
    $new_array = [];
    foreach($listable as $item) {
        $new_array[] = $item;
    }
    return $new_array;
}

Once I had an array, I could feed it to a new ArrayIterator, and then use foreach to go reliably do stuff to each DOMNode item.

Here's a sample of what it looks like in action. Since I'm importing content from Textpattern, the source includes images as HTML img elements as well as a variety of Textpattern tags (including txp:image tags and some smd_macros)

// Regular HTML img elements
$img_elements = new \ArrayIterator(array_from($dom->getElementsByTagName('img')));

// txp:image tags
$images = new \ArrayIterator(array_from($dom->getElementsByTagName('image')));

// smd_macro called txp:image_hd
$hd_images = new \ArrayIterator(array_from($dom->getElementsByTagName('image_hd')));

// smd_macro called txp:picture
$pictures = new \ArrayIterator(array_from($dom->getElementsByTagName('picture')));

// Combine them all using AppendIterator
$source_images = new \AppendIterator();
$source_images->append($img_elements);
$source_images->append($images);
$source_images->append($hd_images);
$source_images->append($pictures);

// Do the stuff that needs to be done to each DOMNode item
foreach ($source_images as $image) {
    // Handle a single image here…
}

I hope this helps someone in the future! Or, maybe me if I have to solve a similar problem again…

Sign In

Importing HTML files and Textpattern data—including images from img elements and txp:image tags

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members