Get data from a latin-1 encoded DB and insert it into PW

thomas · September 3, 2012

Hello,

i need to import a large number of articles from an old site into my PW project. So I wrote an import script which gets the data and for each row, creates a new page in PW and inserts it. So far, so easy. Problems arise because the old DB is latin-1 encoded and I keep getting strange results which are probably due to character encoding.

$con = mysql_connect("localhost","yes","also");
if (!$con) {
die('Could not connect: ' . mysql_error());
}
mysql_set_charset("latin1", $con);
mysql_select_db("mtb", $con);
$result = mysql_query('SELECT * FROM stuff');
while($row = mysql_fetch_array($result)) {
$new = new Page();
$new->of(false);
$new->template = wire('templates')->get('article'); $new->parent = wire('pages')->get('/news/');
$name = wire('sanitizer')->pageName(utf8_encode($row['HEADLINE']));
[*]$new_name = $name;
$i = 1;
while(isset($found{$new_name})) {
error_log('DOPPLER! '.$new_name);
$new_name = $name.'-'.$i;
$i++;
}
$found{$new_name} = 1;
}

Even though I check for double page names, I keep getting the 'Duplicate entry' error. I also tried

while(wire('pages')->get("name=$new_name,template=article")->id) {}

to find duplicates but it stops with 'Duplicate entry' at a different data set ...I'm confused ... can anyone help?

Thanks,

thomas

netcarver · September 3, 2012

Not sure about the duplicate warning but for the latin1 to utf-8 import then one option might be to use a DB management tool to copy the source table and then change every text-based field / text-based index (if you need to maintain/import them too) over to utf-8 prior to attempting the import.

thomas · September 3, 2012

I' m not really seeing big problems other than the duplicate error which keeps me from importing the whole DB. I just don't know how to catch it! Other than that, there's a few special characters gone missing (like „these quotes“) but I can live with that ...

apeisa · September 3, 2012

Thomas, I'm little tired and not following your code pretty well.. but this is how I usually handle duplicates while importing pages:

$p =  $pages->get("name=$sanitizedPageName, parent=/where/these/go/");

  if($p->id) {
   echo "We have this already, skip or update";
  }
  else if (strlen($pageTitle) > 1) {
   $p = new Page;
   $p->template = "my-template";
   $p->name = $sanitizedPageName;
   $p->title = $pageTitle;
   $p->save();   
   echo "Imported: $pageTItle <br>";
  }

It is important to set the name for the page, so it will be the same that you are comparing it against.

thomas · September 4, 2012

I have no idea what this even does but adding "Sanitizer::translate" to my $sanitizer call solved the problem ...

Soma · September 4, 2012

I have no idea what this even does but adding "Sanitizer::translate" to my $sanitizer call solved the problem ...

It does character replacements which you have/can defined in the Inputfield Page Name module settings.

thomas · September 4, 2012

There's already a bunch defined in there even though I had no idea this existed. Cool. Since PW uses this to create new pages it is obvious why my script didn't catch some of the duplicates ...

ryan · September 4, 2012

When it comes to translating text from one character set to another, look into PHP's iconv() function. You'll want to run your latin1 text through this function before assigning it to a field in ProcessWire:

$newPage->title = iconv("ISO-8859-1", "UTF-8", $row['HEADLINE']);

With regard to the duplicates, it sounds like you already solved this. But when you are doing an import like this and it's more than a one-time thing, then you need accurate duplicate checking. The best way to get that is to make sure that you are the one assigning the 'name' field, and that you aren't letting ProcessWire auto-generate it for you:

$name = wire('sanitizer')->pageName($row['HEADLINE']);
$newPage->name = $name;

If there is a possibility that $row['HEADLINE'] will ever change on future imports, then you'd be better off using a separate ID field to match things up. I typically add an integer field named 'old_id' to my template, and that would contain the primary key from the old data. Then you would check that when determining whether its an existing page or a new one.

Last thing I want to mention is that you might be better off caching your $parent page above the loop where you are creating the pages:

$parent = wire('pages')->get('/news/');

// .. then in your loop:
$newPage->parent = $parent;

The reason for this is that ProcessWire clears its own cache of loaded pages every time you save(). So you'd end up re-loading that $parent (/news/) on every loop iteration. Whereas, if you've kept a copy of it in your code ($parent) then you can continue using the same one without reloading. Basically it's just a little more efficient/faster, though probably doesn't matter much.

thomas · September 4, 2012

Thanks for the info, Ryan! This is a one time import but I'll need to do it for several sites so your tips are really appreciated.

Sign In

Get data from a latin-1 encoded DB and insert it into PW

Recommended Posts

thomas

netcarver

thomas

apeisa

thomas

Soma

thomas

ryan

thomas

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members

Similar Content

An Easy Way to install ProcessWire on Vultr

Migrate Django to Processwire

Sanitize hidden characters from CKEditor

Not able to connect to external Database

About ProcessWire & Performance

Browse

Activity

My Activity Streams

Support

Store

My Details