Jump to content

Get data from a latin-1 encoded DB and insert it into PW


thomas
 Share

Recommended Posts

Hello,

i need to import a large number of articles from an old site into my PW project. So I wrote an import script which gets the data and for each row, creates a new page in PW and inserts it. So far, so easy. Problems arise because the old DB is latin-1 encoded and I keep getting strange results which are probably due to character encoding.

$con = mysql_connect("localhost","yes","also");
if (!$con) {
die('Could not connect: ' . mysql_error());
}
mysql_set_charset("latin1", $con);
mysql_select_db("mtb", $con);
$result = mysql_query('SELECT * FROM stuff');
while($row = mysql_fetch_array($result)) {
$new = new Page();
$new->of(false);
$new->template = wire('templates')->get('article'); $new->parent = wire('pages')->get('/news/');
$name = wire('sanitizer')->pageName(utf8_encode($row['HEADLINE']));
[*]$new_name = $name;
$i = 1;
while(isset($found{$new_name})) {
error_log('DOPPLER! '.$new_name);
$new_name = $name.'-'.$i;
$i++;
}
$found{$new_name} = 1;
}

Even though I check for double page names, I keep getting the 'Duplicate entry' error. I also tried

while(wire('pages')->get("name=$new_name,template=article")->id) {}

to find duplicates but it stops with 'Duplicate entry' at a different data set ...I'm confused ... can anyone help?

Thanks,

thomas

Link to comment
Share on other sites

Not sure about the duplicate warning but for the latin1 to utf-8 import then one option might be to use a DB management tool to copy the source table and then change every text-based field / text-based index (if you need to maintain/import them too) over to utf-8 prior to attempting the import.

Link to comment
Share on other sites

I' m not really seeing big problems other than the duplicate error which keeps me from importing the whole DB. I just don't know how to catch it! Other than that, there's a few special characters gone missing (like „these quotes“) but I can live with that ...

Link to comment
Share on other sites

Thomas, I'm little tired and not following your code pretty well.. but this is how I usually handle duplicates while importing pages:

$p =  $pages->get("name=$sanitizedPageName, parent=/where/these/go/");

  if($p->id) {
   echo "We have this already, skip or update";
  }
  else if (strlen($pageTitle) > 1) {
   $p = new Page;
   $p->template = "my-template";
   $p->name = $sanitizedPageName;
   $p->title = $pageTitle;
   $p->save();   
   echo "Imported: $pageTItle <br>";
  }

It is important to set the name for the page, so it will be the same that you are comparing it against.

Link to comment
Share on other sites

I have no idea what this even does but adding "Sanitizer::translate" to my $sanitizer call solved the problem ...

It does character replacements which you have/can defined in the Inputfield Page Name module settings.

Link to comment
Share on other sites

When it comes to translating text from one character set to another, look into PHP's iconv() function. You'll want to run your latin1 text through this function before assigning it to a field in ProcessWire:

$newPage->title = iconv("ISO-8859-1", "UTF-8", $row['HEADLINE']);

With regard to the duplicates, it sounds like you already solved this. But when you are doing an import like this and it's more than a one-time thing, then you need accurate duplicate checking. The best way to get that is to make sure that you are the one assigning the 'name' field, and that you aren't letting ProcessWire auto-generate it for you:

$name = wire('sanitizer')->pageName($row['HEADLINE']);
$newPage->name = $name;

If there is a possibility that $row['HEADLINE'] will ever change on future imports, then you'd be better off using a separate ID field to match things up. I typically add an integer field named 'old_id' to my template, and that would contain the primary key from the old data. Then you would check that when determining whether its an existing page or a new one.

Last thing I want to mention is that you might be better off caching your $parent page above the loop where you are creating the pages:

$parent = wire('pages')->get('/news/');

// .. then in your loop:
$newPage->parent = $parent; 

The reason for this is that ProcessWire clears its own cache of loaded pages every time you save(). So you'd end up re-loading that $parent (/news/) on every loop iteration. Whereas, if you've kept a copy of it in your code ($parent) then you can continue using the same one without reloading. Basically it's just a little more efficient/faster, though probably doesn't matter much.

  • Like 2
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...