Jump to content

Fixing PW Install in a Latin1 database; should have been UTF-8?


Recommended Posts

Hi All,

After SiNNuT's excellent help in this post: (thank you, SiNNuT),

http://processwire.com/talk/topic/5400-tinymce-converting-html-entities-to-characters/

I ran across a new situation -- or more accurately -- a new knowledge gap that I wish I didn't have. :-)

I have another PW install, in which the curly quotes are not displaying properly, even though the header includes the line:

<meta http-equiv="content-type" content="text/html; charset=utf-8" />

I was digging around, and noticed that I had installed PW into a database that I had created, set to:

* charset: latin1

* collation: latin1_swedish_ci

instead of utf8 and utf8_general_ci.

Now, just to be clear, one of my PW installs with the latin1 charset does display the curly quotes correctly, so I'm not sure why the other install does not. I'm not entirely sure if the db charset and collation are the issue, or something else. And that's an item to solve.

But my main questions here are:

- Is the issue of installing PW into a latin1 db a huge problem? (It would be really great if the PW install prompted one to make sure that the db had the right charset. Or maybe it did, and I missed it.)

- If it is a problem, does anyone have any tips on converting the db, tables and content appropriately?

I'm under the impression that one could relatively easily convert the db and tables to utf8, but that the content would have to be converted as well -- perhaps using the PHP inconv function, as Ryan pointed out in another post?

$newPage->title = iconv("ISO-8859-1", "UTF-8", $row['HEADLINE']);

That sounds like a lot of effort and processing: having to convert every field of every page.

As I mentioned, I have one install with latin1 where everything seems to be working okay. So maybe no conversion is necessary. ??

It's that one other install, where the curly quotes aren't showing up in view mode, that has me baffled.

What's really weird with that one, is that when I click on edit, either in the admin, or in my front end data form, the typographic curly quotes display correctly. It's only in view mode that they turn into this (in the source), with the page charset set to UTF-8:

�test�

the above text should be:

“test” (i.e. "test" with curly quotes)

By the way, the particular field I'm testing is set to be textarea only -- no TinyMCE at all.

=> As a related item, the problematic database receives email from a variety of email programs, from both PCs and MACs, and then saves the data to PW. Thus, things like curly quotes could be coming from a variety of machines.

My Processwire script to save those pages does not include any type of charset declaration or conversion, because I was under the impression that I didn't need to do that. But that might be another knowledge gap.

The test I did above, however, was me pasting in the word "test", with curly quotes, from LibreOffice Writer, on a PC.

Thanks for all the help!

I do intend to one day answer some questions myself, when I become more proficient in PW.

(Well, I have posted one Case Study, so far.) :-)

Peter

Link to comment
Share on other sites

Once again on a tight schedule here, but first of all, I'd start with making sure what character sets are actually in use. If not UTF-8, you can convert data to it with relative ease, though it needs to be done once for database, then on table-by-table basis for existing tables.
 
This is important to keep in mind, though:

The CONVERT TO operation converts column values between the character sets. This is not what you want if you have a column in one character set (like latin1) but the stored values actually use some other, incompatible character set (like utf8). In this case, you have to do the following for each such column:

ALTER TABLE t1 CHANGE c1 c1 BLOB;
ALTER TABLE t1 CHANGE c1 c1 TEXT CHARACTER SET utf8;

 
(Source of quote: http://dev.mysql.com/doc/refman/5.6/en/alter-table.html.)
 
Hope this helps a bit. I'd be happy to take closer look at this at some point too, now I'll have to run :)

  • Like 2
Link to comment
Share on other sites

Dear Teppo,

Thanks for this info. I'll study it. I ran this query:

SELECT 
  TABLE_NAME,
  COLUMN_NAME,
  CHARACTER_SET_NAME,
  COLUMN_TYPE
  COLLATION_NAME
FROM information_schema.COLUMNS
WHERE TABLE_SCHEMA = 'my_database_name'
ORDER BY character_set_name,table_name,column_name

and noticed that the varchar/text fields were all utf8, even though the db was latin1, including default fields like field_body.data, as well as custom fields.

One thing that seems very pertinent, is that the characters display correctly in edit mode, but not view mode.

That seems to indicate that the characters *can* be displayed properly. However, when I go to the PW admin page and view the source, I don't see anything unusual, i.e. it has the charset set to utf-8. The doctype is only 'html', and I tried that, with no difference.

And the question remains: what impact does this have in PW's scheme of things, to have a Latin1 database, but some fields set as utf8? etc...

Thanks,

Peter

Link to comment
Share on other sites

  • 2 weeks later...
And the question remains: what impact does this have in PW's scheme of things, to have a Latin1 database, but some fields set as utf8? etc...

I don't think it matters. I've had similar situations in the past but never had issues. I've even had tables that were the wrong character set, but storing UTF8 (a long time ago) and it didn't become a problem until I needed to export and import somewhere else. But if your tables are already the right character set, then I don't think you've got anything to worry about (someone correct me if I'm wrong). You might notice that PW uses the ASCII character set in MySQL for some fields (like those storing 'name' format data, where nothing beyond ASCII is needed). 

  • Like 1
Link to comment
Share on other sites

What i really don't understand is how curly quotes would ever turn up as â because that is the html entity for the letter a with a circonflex. I'm assuming you've given a bad example here.

If things show up correctly on the admin side there must be an issue with the rendering bits of the live site.

Have you checked the encoding of the corresponding template file itself? Any textformatters on the field? What is your render code? Is there a url we can look at?

  • Like 1
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...