Jump to content
johnstephens

Celebrating a small victory and wondering about importing pre-Web-Standards HTML files

Recommended Posts

I'm working on a script for importing very old static HTML files into ProcessWire so they are searchable on the new site.

What I have so far works, but I wonder if there are ways I can make the work of cleaning up the imported content easier, by doing more useful cleanup during the import.

For this demo, suppose all the files exist in one directory, called "public", and suppose we're importing them all into the basic-page template. At this point, the basic-page template has been modified from the blank profile to include one additional textarea field called "body", which uses the CKEditor.

<?php

include './path/to/processwire/index.php';

// Use FileSystemIterator to save all the files in the 'public' directory
// https://www.php.net/manual/en/class.filesystemiterator.php
$files = new FileSystemIterator('./public');

// This is a callback function for the CallbackFilterIterator below
$is_html_file = function($file) {
    return strpos($file->getFilename(), '.htm');
};

// Use CallbackFilterIterator to winnow the files down to only HTML files
// https://www.php.net/manual/en/class.callbackfilteriterator.php
$html_files = new CallbackFilterIterator($files, $is_html_file);

// Input a regular expression and a string -> output an array of matches
$preg_matches = function($regex, $string) {
    preg_match($regex, $string, $array);
    return $array;
};

// Iterate over the directory objects stored in $html_files
foreach($html_files as $file) {

    // Turn this file into a SplFileObject so we can read its contents
    // https://www.php.net/manual/en/class.splfileobject.php#splfileobject.constants.drop-new-line
    $_file = new SplFileObject($file);
    $contents = $_file->fread($_file->getSize());
    $h1_content = $preg_matches('/\<h1\>(.*?)\<\/h1\>/i', $contents)[1] | false;

    // Create a new ProcessWire page and save the content into it
    $article = new \ProcessWire\Page();
    $article->parent = $pages->get('/');
    $article->template = 'basic-page';
    $article->title = preg_match('/\<h1\>(.*?)\<\/h1\>/i', $contents)
        ? $preg_matches('/\<h1\>(.*?)\<\/h1\>/i', $contents)[1]
        : $preg_matches('/\<title\>(.*?)\<\/title\>/i', $contents)[1];

    $article->body = $contents;
    $article->save();

}

This successfully titles all the pages that have at least one h1 tag. (I know this is making a big assumption of proper markup, but it appears to be broadly correct in this one case.) The rest of the content is dumped into the page's body field.

If this helps anyone else solve a similar problem, have the code! (WTFPL)

But when one is dealing with archaic HTML using font tags and tables for layout (yeek!), this leaves much room for improvement.

Something I'd like to do is get rid of all the layout tables and site furniture, like branding markup, navigation, and footer text. Of course, that is not marked up in a consistent way across all the documents. 😉

I wonder if anyone has guidance for something like this? Do you know of any best practices for automating the cleanup old HTML? Thank you!

Edit: When searching for HTML tags, matches should be case insensitive (using the i flag after the delimiter). Also, use the content of the title element when there is no h1 tag on the page. This is all fixed in the code above.

  • Like 2

Share this post


Link to post
Share on other sites

Surely you'll also want to find all images and automatically upload them to the assets/files/xxxx for the new page and then rewrite the img src to the new path. Maybe also grab the alt tag and add that to the description field in PW. Personally I would go with DOMDocument over a regex for this, but both would work.

Share this post


Link to post
Share on other sites
1 minute ago, adrian said:

Surely you'll also want to find all images and automatically upload them to the assets/files/xxxx for the new page and then rewrite the img src to the new path. Maybe also grab the alt tag and add that to the description field in PW.

Thanks! I feel very foggy on how to do that. Could you direct me to an appropriate code example?

2 minutes ago, adrian said:

Personally I would go with DOMDocument over a regex for this, but both would work.

I'll look into that! I'm used to dealing with the DOM in JavaScript, but with PHP I'm not so savvy. DOMDocument looks like a great fit! Thank you!

Share this post


Link to post
Share on other sites

Something like this should get you going. This is stolen from a recent import I did which worked well. This assumes you have a field called "images" that you want the images uploaded to.

I have also done more complex versions of this when the source HTML image tags have width and height tags - you can use those to resize the images using the PW API and embed that version back into the HTML.

    $dom = new \DOMDocument();
    @$dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));

    foreach($dom->getElementsByTagName('img') as $img) {

        // grab image from the external URL and add to images field
        try {
            $np->images->add('http://olddomain.com/' . $img->getAttribute('src'));
            if($img->getAttribute('alt') != '') {
                $np->images->last()->description = $img->getAttribute('alt');
            }
            $img->setAttribute('src', $np->images->last()->url());
        }
        catch(\Exception $e) {
            // in case remote image can't be downloaded
        }

    }
    return preg_replace('/^<!DOCTYPE.+?>/', '', str_replace( array('<html>', '</html>', '<body>', '</body>', '<p>&amp;n<p>', '<p><p>', '</p></p>'), array('', '', '', '', '<p>', '<p>', '</p>'), $dom->saveHTML()));

 

 

  • Like 8

Share this post


Link to post
Share on other sites
Just now, johnstephens said:

Thank you, @adrian!

I don't understand what the $np variable references. Is it the current ProcessWire page instance?

Sorry, that is the new page I created and saved before running the above.

  • Like 1

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Recently Browsing   1 member

×
×
  • Create New...