Jump to content

[SOLVED] UTF8 page names whitelist exceptions


gebeer
 Share

Recommended Posts

Hi,

I have a multilang project that requires UTF8 page names and slugs mainly for chinese language. Referring to the documentation at https://processwire.com/blog/posts/page-name-charset-utf8/ and this post 

it seems that we can use $config->pageNameWhitelist="" to allow all characters.

Or, to allow only certain characters, we can use a list of those e.g. $config->pageNameWhitelist="æåäßöüđжхцчшщюяàáâèéëêě...".

In my usecase I want to allow all traditional chinese characters, but disallow german Umlauts, so that page names in chinese are using the UTF8 characters, but a german page name like "über-uns" gets converted to "ueber-uns".

With the avalable config settings, I don't see how I can accomplish that other than putting all traditinal chinese characters into the $config->pageNameWhitelist which is not feasable.

What I would need is a blacklist. Sanitizer.php uses a blacklist: https://github.com/processwire/processwire/blob/6ff498f503db118d5b6c190b35bd937b38b80a77/wire/core/Sanitizer.php#L844

But I can't add to that. A config setting like $config->pageNameBlacklist would be great.

Since we don't have that I need to work around the problem. I'm thinking of a Page::saveReady hook. That checks for german Umlauts and than translates those to the required values.

Does anyone have a better idea?  

Link to comment
Share on other sites

You could add a rewrite rule to .htaccess, but I reckon it'd be easier (and wouldn't risk being overwritten on updates) to do as you suggest and use a Page::saveReady hook. I do this regularly to manage page names and it doesn't cause problems.

You're probably aware of this, but you may need to stop the hook running if the page hasn't yet been created:

$pages->addHookAfter('Pages::saveReady', function($event) {
	$page = $event->arguments('page');
    if(!$page->id) return;
    ...

 

  • Like 1
Link to comment
Share on other sites

  • gebeer changed the title to [SOLVED] UTF8 page names whitelist exceptions

I solved it with this hook:

/**
 * only use UTF-8 page names for defined languages
 */
$wire->addHookBefore('Pages::saveReady', function($event) {
	/** @var Page $page */
    $page = $event->arguments(0);
	if($page->template->name == 'admin') return;
    $languages = $event->wire('languages');
    $sanitizer = $event->wire('sanitizer');

    $utf8Languages = ['cn']; // Add more languages as needed

    foreach ($languages as $language) {
        if ($language->isDefault()) continue;

        $langName = $language->name;
        $translatedTitle = $page->getLanguageValue($language, 'title');

        if (in_array($langName, $utf8Languages)) {
            $pageName = $sanitizer->pageNameUTF8($translatedTitle);
        } else {
            $pageName = $sanitizer->pageNameTranslate($translatedTitle);
        }

        $page->setLanguageValue($language, 'name', $pageName);
    }
});

 

Link to comment
Share on other sites

Additional info: with above solution, chinese characters are preserved as UTF8 in page names. But all other UTF8 characters in other languages do not get translated with pageNameTranslate like they would if the site didn't have $config->pageNameCharset = 'UTF8';
That is undesirable because for example, german umlauts now get translated from ä to a instead of ae. 

If setting $config->pageNameCharset = 'UTF8' is active, the pageNameTranslate sanitizer ignores the character map, that we can usually set in the module settings for InputfieldPageName. Actually, that editable character map is no longer available when $config->pageNameCharset = 'UTF8'.

To circumvent this issue, I manually replace all characters before passing them to sanitizer pageNameTranslate I wrote a function that takes the default character map for replacements from InputfieldPageName::$defaultReplacements, adds custom characters and then performs string replacements based on the character map:

	 /**
	 * replaces characters in a string with characters from a custom character map
	 * needed because Sanitizer::pageNameTranslate() doesn't take into account the character map defined in InputfieldPageName
	 * because the site uses $config->pageNameCharset = 'UTF8';
	 * used in the Pages::saveReady hook in ready.php
	 *
	 * @param string $string
	 * @return string
	 */
	public function replaceCustomCharacters($string) {
		// take default character map from InputfieldPageName class
		$defaultChars = InputfieldPageName::$defaultReplacements;
		// add additional characters
		$allChars = array_merge($defaultChars, [
			'ä' => 'ae',
			'ö' => 'oe',
			'ü' => 'ue',
			'ß' => 'ss',
		]);
		// replace all characters from map $allChars in $string
		$string = str_replace(array_keys($allChars), array_values($allChars), mb_strtolower($string));

		return $string;
	}

My final saveReady hook now looks like this:

/**
 * only use UTF-8 page names for defined languages
 */
$wire->addHookBefore('Pages::saveReady', function($event) {
	/** @var Page $page */
    $page = $event->arguments(0);
	if($page->template->name == 'admin') return; // exclude admin pages
	if($page->rootParent->id === 2) return; // exculde pages under admin tree (e.g. media manager pages)
    $languages = $event->wire('languages');
	/** @var Sanitizer $sanitizer */
    $sanitizer = $event->wire('sanitizer');

    $utf8Languages = ['cn']; // Add more languages as needed

    foreach ($languages as $language) {
        if ($language->isDefault()) continue;

        $langName = $language->name;
        $translatedTitle = $page->getLanguageValue($language, 'title');

        if (in_array($langName, $utf8Languages)) {
            $pageName = $sanitizer->pageNameUTF8($translatedTitle);
        } else {
            $pageName = $sanitizer->pageNameTranslate(wire('site')->replaceCustomCharacters($translatedTitle));
        }

        $page->setLanguageValue($language, 'name', $pageName);
    }
});

Note that my replacement function lives in a custom Site module that is made available to the API as wire('site'). 

All in all there is a lot to consider with UTF8 page names activated. I wish PW would make life easier here. Just the other day when I played around with GPT-4 and asked it how we could solve that problem, it started to halluzinate and proposed that the only thing I had to do was setting utf8PageNames property per language like in this dummy code:

$lang = $languages->get('cn);
$lang->utf8PageNames = true;
$lang->save();

In reality the Language object does not have such a property. But I think, this would be a great enhancement and would allow setting page name behaviour based on each language and not globally only, like it is now. Will open a feature request.

  • Like 1
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...