Jump to content

Recommended Posts

Posted

Hello,

i need to import a large number of articles from an old site into my PW project. So I wrote an import script which gets the data and for each row, creates a new page in PW and inserts it. So far, so easy. Problems arise because the old DB is latin-1 encoded and I keep getting strange results which are probably due to character encoding.

$con = mysql_connect("localhost","yes","also");
if (!$con) {
die('Could not connect: ' . mysql_error());
}
mysql_set_charset("latin1", $con);
mysql_select_db("mtb", $con);
$result = mysql_query('SELECT * FROM stuff');
while($row = mysql_fetch_array($result)) {
$new = new Page();
$new->of(false);
$new->template = wire('templates')->get('article'); $new->parent = wire('pages')->get('/news/');
$name = wire('sanitizer')->pageName(utf8_encode($row['HEADLINE']));
[*]$new_name = $name;
$i = 1;
while(isset($found{$new_name})) {
error_log('DOPPLER! '.$new_name);
$new_name = $name.'-'.$i;
$i++;
}
$found{$new_name} = 1;
}

Even though I check for double page names, I keep getting the 'Duplicate entry' error. I also tried

while(wire('pages')->get("name=$new_name,template=article")->id) {}

to find duplicates but it stops with 'Duplicate entry' at a different data set ...I'm confused ... can anyone help?

Thanks,

thomas

Posted

Not sure about the duplicate warning but for the latin1 to utf-8 import then one option might be to use a DB management tool to copy the source table and then change every text-based field / text-based index (if you need to maintain/import them too) over to utf-8 prior to attempting the import.

Posted

I' m not really seeing big problems other than the duplicate error which keeps me from importing the whole DB. I just don't know how to catch it! Other than that, there's a few special characters gone missing (like „these quotes“) but I can live with that ...

Posted

Thomas, I'm little tired and not following your code pretty well.. but this is how I usually handle duplicates while importing pages:

$p =  $pages->get("name=$sanitizedPageName, parent=/where/these/go/");

  if($p->id) {
   echo "We have this already, skip or update";
  }
  else if (strlen($pageTitle) > 1) {
   $p = new Page;
   $p->template = "my-template";
   $p->name = $sanitizedPageName;
   $p->title = $pageTitle;
   $p->save();   
   echo "Imported: $pageTItle <br>";
  }

It is important to set the name for the page, so it will be the same that you are comparing it against.

Posted

I have no idea what this even does but adding "Sanitizer::translate" to my $sanitizer call solved the problem ...

Posted

I have no idea what this even does but adding "Sanitizer::translate" to my $sanitizer call solved the problem ...

It does character replacements which you have/can defined in the Inputfield Page Name module settings.

Posted

There's already a bunch defined in there even though I had no idea this existed. Cool. Since PW uses this to create new pages it is obvious why my script didn't catch some of the duplicates ...

Posted

When it comes to translating text from one character set to another, look into PHP's iconv() function. You'll want to run your latin1 text through this function before assigning it to a field in ProcessWire:

$newPage->title = iconv("ISO-8859-1", "UTF-8", $row['HEADLINE']);

With regard to the duplicates, it sounds like you already solved this. But when you are doing an import like this and it's more than a one-time thing, then you need accurate duplicate checking. The best way to get that is to make sure that you are the one assigning the 'name' field, and that you aren't letting ProcessWire auto-generate it for you:

$name = wire('sanitizer')->pageName($row['HEADLINE']);
$newPage->name = $name;

If there is a possibility that $row['HEADLINE'] will ever change on future imports, then you'd be better off using a separate ID field to match things up. I typically add an integer field named 'old_id' to my template, and that would contain the primary key from the old data. Then you would check that when determining whether its an existing page or a new one.

Last thing I want to mention is that you might be better off caching your $parent page above the loop where you are creating the pages:

$parent = wire('pages')->get('/news/');

// .. then in your loop:
$newPage->parent = $parent; 

The reason for this is that ProcessWire clears its own cache of loaded pages every time you save(). So you'd end up re-loading that $parent (/news/) on every loop iteration. Whereas, if you've kept a copy of it in your code ($parent) then you can continue using the same one without reloading. Basically it's just a little more efficient/faster, though probably doesn't matter much.

  • Like 2
Posted

Thanks for the info, Ryan! This is a one time import but I'll need to do it for several sites so your tips are really appreciated.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...