Jump to content

UTF-8 issue importing pages from another PW instance


Can
 Share

Recommended Posts

Hey guys,

As I'm refactoring one of our websites completely I thought it's nice to start from scratch and make use of new multi-instance support to import data from the old site.

Most of it works nice so far, except I'm encountering a little (I guess) UTF-8 issue as the imported content within the new site has some black diamonds with question marks in it.

I tried utf8_en/decode and combinations in any order, $sanitizer->purify / ->entities / ->unentities, htmlspecialchars and various other possible solutions with no luck
Then I tried the same grabbing and inserting the body field directly from and to the db with no difference..
Both db's are utf8_general_ci and pw is at 3.0.33

As a side note, no question mark diamonds in the old (nor the new) db and old page content works as expected of course
Uh I also tried to set charsets using php's header and ini_set functions..

Hopefully one got another idea :)

Link to comment
Share on other sites

Hi @can, I suggest you to use an editor like Geany wich support multiples charactes set and open your .sql dump. If is too big, make a smaller dump that contains problematic data.

Then in Geany reload it as ISO-8859-15 and see what happen, if you see accents, ñ and other caracters then you discovered the right codification for the file.

Sometimes is a problem at sql file level, maybe created from a non UTF terminal or something.

 

Hablas español? Veo que estas en Perù. Yo hablo español. Antes vivia en Venezuela.

 

Saludos.

Link to comment
Share on other sites

How are you doing the import?

I remember I had a tough time with UTF-8 and German Umlauts when I implemented a Spreadsheet Content importer.

I had to get the CSV file in a particular format (UTF8-BOM or Byte Order Marker) for it work.

Below is what i wrote in the Header of that spreadsheet code.
 

Quote

*  Important : To get this working, the utf8.csv file that it imports must be in UTF8-BOM (Byte Order Marker) format
*  Best to save the XLS file into TSV, then open it with Sublime Text and encode it with UTF8-BOM before
*  saving it down again for processing.

 

Link to comment
Share on other sites

I'm going to try your suggestions guys, actually the german umlauts work..the places where I'm getting the question marks seem to be just whitespace in the old and the new databse

Import happens using the API (should've filed the question in the API section..), so I'm bootstrapping the old instance like
 

$old = new ProcessWire('../old/', 'http://old.dev');

$categories = $old->pages->find("template=forum-category");
foreach ($categories as $cat) {
	$p = new Page();
	$p->template = 'post';
	$p->parent = $parent;
	$p->title = $cat->title;
	..
}

and so on...

Uh and I also tried setting output formatting on and of.. $p->of(false)

@Francesco Bortolussi Estoy aprendiendo todavia..poco a poco ;) Ah ja y ahora vives en los estados unidos o por donde? ¿Ha leído de nosotros proyecto? enlace en mi firma...

Link to comment
Share on other sites

Couple of more suggestions :

1) Have you tried grabbing 1 of the fields with the problematic "spaces" and pasting it on a text editor like Sublime Text? Are you definitely sure it's just 'whitespace'?


2) Have you tried using functions like str_replace when doing the copying, like replacing the space with a space?

Link to comment
Share on other sites

15 hours ago, FrancisChung said:

Couple of more suggestions :

1) Have you tried grabbing 1 of the fields with the problematic "spaces" and pasting it on a text editor like Sublime Text? Are you definitely sure it's just 'whitespace'?


2) Have you tried using functions like str_replace when doing the copying, like replacing the space with a space?

1) pasting from db (adminer) to sublime it looks the same, just like whitespace

2) like str_replace(' ', ' ', $body) ? no difference

Aha..thanks to your suggestion I then tried preg_replace('/\s+/', ' ', $body) and it worked! :D

Thanks guys :)

So what exactly happened here? what are those mysterious falty white spaces in reality?

 

@Francesco Bortolussi hemos encontrado algunos italianos aka en el perú haha :D

  • Like 1
Link to comment
Share on other sites

I cheered too soon^^ not yet solved properly..instead of the "diamond" question mark icons I'm left with regular question marks which I can't just search and replace because the content contains question marks..

Any ideas?

Link to comment
Share on other sites

On 21.9.2016 at 11:37 PM, BitPoet said:

Have a look here, your problem sounds quite similar to what is described in the article.

Thanks @BitPoet, didn't had the time to follow the instructions so far..

but all db's/tables are utf8 already and were installed using utf8..I'm going to further check this and work the instructions of your linked article..

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...