tibs

utf8mb4 cuz emojis ftw

16 posts in this topic

MySQL's utf8 encoding can only represent a tiny (though useful) subset of the full 31-bit range of Unicode; it's not UTF-8 at all. To fix what never should've been broken, they came up with utf8mb4. It's "of course" (i.e. UTF-8 is some properly designed piece of stuff) upwards compatible with the broken utf8 type. Well, not exactly, but more on that later.

Would it be possible to default to the real UTF-8 type, when available? (MySQL 5.5.3+) (even wp did it! nudge nudge)

And here comes that "later" you've all been waiting for. This important(ish) change would require the noble sacrifice of cutting some of the core (YES, CORE!) fields down to 250 characters to satisfy MySQL's sad-pathetic-deplorable limitation on index sizes. I know, that's 5 precious characters lost, but think about all those suffering emojis and suddenly you'll feel all fuzzy and warm, right?

Anw, just a suggestion, because i18n and l10n and all the like.

Edit: It seems a cruel RFC took away much of UTF-8's expressive beauty, and it is now valid only up to 4-byte lengths and values of 0x10FFFF, whichever smaller; I'm gonna go and cry myself to sleep now, goodbye.

1 person likes this

Share this post


Link to post
Share on other sites

I'm not sure if moving the requirement for mysql from 5.0.15 to 5.5.3 is a step to be taken as easily. E.g. I'm hosting my personal websites on a webhost, which runs it's servers on CentOS6 (even still some on v5) which does come with just MySQL 5.1. I would suppose it's not the only one out there in that situation.

Share this post


Link to post
Share on other sites

I'm not sure if moving the requirement for mysql from 5.0.15 to 5.5.3 is a step to be taken as easily. E.g. I'm hosting my personal websites on a webhost, which runs it's servers on CentOS6 (even still some on v5) which does come with just MySQL 5.1. I would suppose it's not the only one out there in that situation.

MySQL's utf8 encoding can only represent a tiny (though useful) subset of the full 31-bit range of Unicode; it's not UTF-8 at all. To fix what never should've been broken, they came up with utf8mb4. It's "of course" (i.e. UTF-8 is some properly designed piece of stuff) upwards compatible with the broken utf8 type. Well, not exactly, but more on that later.

Would it be possible to default to the real UTF-8 type, when available? (MySQL 5.5.3+) (even wp did it! nudge nudge)

I wasn't talking about changing the requirements (I know the MySQL requirement isn't that high, and rightly so) -- I'm saying it's reasonable to use utf8mb4 when available, e.g. by running a "show character set" during install, and let's see what we got. (Yes, it would require reducing some field sizes from 255 to 250 characters, but that is hardly a deal-breaker.)

After all, it's just a string in the code ('utf8'), and that only used sparingly. Actually, we already have an option for the encoding in the config file, so why not just use it instead of those hard-coded utf8's? 

Share this post


Link to post
Share on other sites

Actually, that's only true for MyISAM tables with their 1000 bytes index limit. For InnoDB, field length would drop to 191 characters (max. index size of 767 bytes divided through 4 bytes per character).

Going below 200 characters would IMHO be rather inconvenient and possibly dangerous. One could extend the number of indexable characters for InnoDB storage, but that gets a little more involved, as it means re-creating any existing pw database to change the storage engine from antelope to barracuda and adapt innodb_large_prefix in the server configuration. Also something not doable on a shared hosting server.

1 person likes this

Share this post


Link to post
Share on other sites

Actually, that's only true for MyISAM tables with their 1000 bytes index limit. For InnoDB, field length would drop to 191 characters (max. index size of 767 bytes divided through 4 bytes per character).

Going below 200 characters would IMHO be rather inconvenient and possibly dangerous. One could extend the number of indexable characters for InnoDB storage, but that gets a little more involved, as it means re-creating any existing pw database to change the storage engine from antelope to barracuda and adapt innodb_large_prefix in the server configuration. Also something not doable on a shared hosting server.

But this would only be relevant if PW was to consider switching to InnoDB? I have not heard anything about this.

Share this post


Link to post
Share on other sites

If it's as easy as – just use it if available – I've no problem with it. But wouldn't at least all the sanitizers need to be aware about striping those characters or not, depending on if they can be stored? 

Share this post


Link to post
Share on other sites

But this would only be relevant if PW was to consider switching to InnoDB? I have not heard anything about this.

InnoDB is offered as an (experimental) installation option, as pointed out in this thread. So InnoDB compatibility should definitely be taken into account.

Share this post


Link to post
Share on other sites

InnoDB is offered as an (experimental) installation option, as pointed out in this thread. So InnoDB compatibility should definitely be taken into account.

Hah, i totally forgot about that. You are right.

Share this post


Link to post
Share on other sites

Actually, that's only true for MyISAM tables with their 1000 bytes index limit. For InnoDB, field length would drop to 191 characters (max. index size of 767 bytes divided through 4 bytes per character).

Going below 200 characters would IMHO be rather inconvenient and possibly dangerous.

So that's why that 1000-byte limit sounded unfamiliar!

Well, I'm not sure 191 is such a grave issue, but it sure would look weird in the specs.

Most of the fields in question will only ever store  [0-9a-z-], for which even ASCII is an overkill (and it could store 1000 characters yay), but I totes understand why nobody wants to break pw in fun ways while striving for marginal gain. Maybe I'll put together a pull request for the simple case (i.e. MySQL 5.5.3+ with the MyIsam engine and the 250-character limit), and we'll see if the gods favor it.

Share this post


Link to post
Share on other sites

I have to admit I would also like to free the emojis. PW should support utf8mb4 now at install time (click the gear icon on the DB config screen). There are some hard-coded utf8 references in the install.sql files (core and site profile) but the installer updates those before executing the queries (at least it should).  And any create/alter table queries that occur after that (like when adding a field, etc.) use the $config->dbCharset that is setup at install time.

It sounds like we need to update some column lengths though. Going from 255 down to 250 seems fine to me (for new installs). Though maybe a little more concerned about going down to 191 for InnoDB, especially since we'll probably make InnoDB our new default before PW3 becomes the new master branch. I'll look closer to see if there are situations where the indicated columns need more than 191. In PW3, we wouldn't change the character set or DB engine of existing installations, only new installations. 

Have a look at this discussion: https://github.com/r...Wire/issues/839

@Pierre-Luc I completely missed your patch before. Sorry about that, sometimes I've got information overload here and it's easy for me to miss things (I think this is likely the case for all of us). I will definitely use and apply this. Thank you for creating it. We won't need the changes to the install.sql files though, since the hard coded db charsets in the install.sql files are updated dynamically by the installer. Though I will need to double check that is working, as it's been awhile since I've tested it. 

Going below 200 characters would IMHO be rather inconvenient and possibly dangerous. One could extend the number of indexable characters for InnoDB storage, but that gets a little more involved, as it means re-creating any existing pw database to change the storage engine from antelope to barracuda and adapt innodb_large_prefix in the server configuration. Also something not doable on a shared hosting server.

@BitPoet can you expand on why you think it might be possibly dangerous? I know it would certainly be dangerous if applied to existing installations (it would delete data). But I don't think we should ever have PW modify the charset or engine of existing installations. But for new installations, do you see any potential dangers?

Share this post


Link to post
Share on other sites

I like your thinking Ryan, i would applaud the switch to InnoDB for PW3. I think we would want full-text search just like in PW2? If so, the min requirement for MySQL will become 5.6? Are you planning to make use of some of the InnoDB features that maybe now are done in the PHP codebase?

Share this post


Link to post
Share on other sites

SiNNuT by "make it the default" I mean "make it the default when supported". When full-text isn't available for InnoDB, we couldn't use InnoDB for things like FieldtypeText/FieldtypeTextarea. I don't plan to drop MyISAM support, just default to InnoDB in the installer when we're identified that it can be fully supported in the environment. This could be a problem with migration, like when your dev server has the needed versions and your production server doesn't, so the installer will definitely need to call more attention to these considerations when providing the db engine selection at install time. Btw, we already do use InnoDB for some tables where we want to avoid table locking (like DB sessions). 

1 person likes this

Share this post


Link to post
Share on other sites

@BitPoet can you expand on why you think it might be possibly dangerous? I know it would certainly be dangerous if applied to existing installations (it would delete data). But I don't think we should ever have PW modify the charset or engine of existing installations. But for new installations, do you see any potential dangers?

Yes, I was thinking in terms of upgrading existing installations, which, of course, is a moot point. There might be a nag if one fiddles around with exporting from an 'old' system and importing to utf8mb4 tables, but even that is not likely to come up. So best ignore my earlier worries. :)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now

  • Recently Browsing   0 members

    No registered users viewing this page.

  • Similar Content

    • By horst
      Hi,
      I have had some (more) trouble to get started with PW  :-(
      In my php.ini the default_charset was set to 'ISO-88591' or something similar.
      So, as the DB uses utf8 and the pages are delivered in utf8, php is required to use it too.
      If not you can run into errors when try to save strings with e.g. german umlaute: äöü ÄÖÜ
           Incorrect string value: '\xFCchte ...' for column 'data' at row 1
      I'm not sure if this was checked with the install routine, but I have seen only green lines
      Maybe this could be done to avoid some hassle to users with a setup other than utf8.
      Or it maybe an option to add "ini_set('default_charset','utf8');" somewhere at the top of index.php,
      and to be very very sure (also for paranoics like me ) one can set and check it:
          ini_set('default_charset','utf8');
          $php_use_utf8 = strtolower(trim(ini_get('default_charset')))=='utf8' ? true :  false;
       
      Ok, now I can start try out that wonderful CMS