The ü's are not what they seem - Weird umlaut/diacritics problems when pasting text from PDFs (on Apple?)

Andi · May 3, 2020

So here's a thing I've been chewing on for a while now.. And although I kind of have a handle on it at this point, I'd still love to understand what is actually happening here ?

Please bear with me.. Last year I did a project for a client in Lübeck, Germany, and implemented a full text search on $page->body and $page->summary not too different from @ryan's default site frontend search functionality.

Got it all to work nicely without too much hassle, client was already filling up the site with content, and then one day I decided to test the frontend search by looking for pages with "Lübeck" in them. Which threw out 3 matches in more than 50 pages. Naturally, with the site being in Lübeck, that number seemed a little off, so I checked manually and realized there were lots of pages missing from the results. So I thought ok something's broken about my search.

Half a day of testing later, and ending up using Firefox's Ctrl+F search with "Match Diacritics" + "Highlight All" on the frontend, i realized that not all ü's were the same across the site. Some would get highlighted (those that PW's search would pick up), and some wouldn't.

// try to CTRL+F this page and search for one of these
ü != ü

Talking to the client it turned out that he had copy-pasted large chunks of the content from PDF files. He was working on a Mac, so I did some research and found a bunch of information on all sorts of weird diacritics problems when doing exactly that ?

These are in german but I think the basic idea should come across

https://blog.k-webs.ch/2017-02-verschobene-punkte-ueber-umlauten.html

https://www.macuser.de/threads/falsche-umlautdarstellung-ue-punkte-versetzt.748967/

So it turned out some of the buggers were actual ü's, and some were actual u's with a "trema" attached to them. Something a little bit like this

// but not quite
u¨

Now the part where this gets even more confusing is, on the frontend side they don't look the same across different OS's / browsers / what have you. Chrome on apple seems to handle them completely different from FF on Windows 7 for instance, so the client couldn't even understand what the problem was because he didn't see it.

Also please note that during that particular project all other Umlaute / Diacritics were fine, äs, ös, uppercase, lowercase,, everything. Just the little ü's were acting up. So we sat down and started out search & replacing them. There was not a lot of time, aaand well I was new to PW ?

... stay tuned for Part II where we're doing a project with over 200 international artists with all sorts of funky diacritics in bandinfos *all coming from PDF Files* ?

Andi · May 3, 2020

Part II

So one year later we're working a site that features little artist profiles with a summary, body/longer band info, and some pictures. These guys are from all over the place, Spain, France, Germany and all the docs the client is copy/pasting from are PDFs. He's also on a Mac (if that's at all related), and the ü's are back

Frontend:

Backend:

Together with of tons of messed up diacritics in Spanish & French names and so on. Also this time both lower (äöü) and uppercase (ÄÖÜ) have tremas, all of them.

So no way we could manually replace all that for currently ~250 artists in both the English and German language versions. We did try pasting the text into a Code Editor first and then copying it over to PW, but that also did nothing..

So the workaround we're currently using is this:

This is what the bookmark in Finder looks like that helped me build a selector to get the affected pages. I just copy/pasted the "bad" Umlauts straight from PWs backend fields into the value fields here.

And then wrote a makro in Tracy Debugger which currently looks like this

// CAREFUL WITH $PAGE IN MAKROS

$pgs = $pages->find("summary%=ü|ö|ä");
bd($pgs);
bd("Found " . $pgs->count() . " pages");

// try to add something to this array in Tracy or a code editor of your choice and watch all hell break loose
$bad = array("ü","ö","ä","Ü","Ä","Ö");
bd($bad);

$good = array("ü","ö","ä","Ü","Ä","Ö");
bd($good);

foreach($pgs as $p) {
    $p->of(false);
    $summary_old=$p->summary;
    $summary_new = str_replace($bad, $good, $summary_old);
    bd($summary_new);
    $p->summary = $summary_new;
    $p->save('summary');
}

Which, on it's face, looks like the most ridiculous piece of code you could ever write. But it works ?

Andi · May 3, 2020

Part III

So the questions would of course then be...

1. What the flying f&#k is this sh#t ?

2. how would I approach getting all this into a hook (I'm thinking on saveReady maybe?) so this would be fixed automatically for summary & body for both language versions of the page - whenever the user pastes another crazy mess of trematized Umlaut madness into an artists profile? ?

Thank you all in advance & god bless.. Glad I finally got this off my chest

bernhard · May 3, 2020

Hi @Andi

Interesting, thx for sharing ?

I've had a similar strang issue lately where utf8 encoded whitespaces where an issue.

Why don't you put that code in a saveReady hook so that it is done automatically on every page save? Even better would be a little module so that you can share it with us ?


// info snippet
class Classname extends WireData implements Module, ConfigurableModule {

  public static function getModuleInfo() {
    return [
      'title' => 'Classname',
      'version' => '0.0.1',
      'summary' => 'Your module description',
      'autoload' => true,
      'singular' => false,
      'icon' => 'smile-o',
      'requires' => [],
      'installs' => [],
    ];
  }

  public function init() {
    $this->addHookAfter("Pages::saveReady", $this, "replaceBadUmlauts");
  }

  public function replaceBadUmlauts(HookEvent $event) {
    $page = $event->arguments(0);
    $bad = array("ü","ö","ä","Ü","Ä","Ö");
    $good = array("ü","ö","ä","Ü","Ä","Ö");
    $page->summary = str_replace($bad, $good, $page->summary);
  }

  /**
  * Config inputfields
  * @param InputfieldWrapper $inputfields
  */
  public function getModuleConfigInputfields($inputfields) {
    return $inputfields;
  }
}

In the module config you could add an asm field to select the fields where the replace should happen

Andi · May 3, 2020

Haha thanks @bernhard, you're the man. Think I'm ready to be a PW module coder just yet..?

Still a little scared of hooks although it is getting better ?

This looks promising, I'll get on it after breakfast ?

bernhard · May 3, 2020

11 minutes ago, Andi said:

Still a little scared of hooks although it is getting better ?

See my signature ?

Andi · May 3, 2020

@bernhard how did I totally overlook that all this time? ?

So if I wanted to use this on more than one fieldtype on a multilingual site I'd just need two nested foreach-loops in

// Something along these lines?
public function replaceBadUmlauts(HookEvent $event) {
    $page = $event->arguments(0);
    $bad = array("ü","ö","ä","Ü","Ä","Ö");
    $good = array("ü","ö","ä","Ü","Ä","Ö");
	$fields = ?? somehow get fields into an array;
	foreach ( ??fields as $field) {
		foreach ( ??languages as $language) {
			$page->??field = str_replace($bad, $good, $page->??field);
		}
	}
}

I'll need to dig into the documentation a little bit and see how to do this.. Thanks for putting me on the track!

Andi · May 4, 2020

Haven't gotten to the coding part yet, but we had a longer team conference this morning and tried to narrow down this issue a little bit.. In the name of science, so to speak.

We tried 4 people with 4 different setups, all copy/pasting to PW from the same PDF file, and those were the results:

#1 macOS High Sierra 10.13.6
- Apple PDF Preview -> PW: Broken diacritics
- Acrobat Reader -> PW: Broken diacritics
- Google Drive (Browser) internal document viewer -> PW : Good diacritics

#2 macOS Catalina 10.15.4
- Acrobat Reader -> PW: Good diacritics
- Acrobat Reader -> Text Editor -> PW: Broken diacritics (no idea why..)
- Google Drive (Browser) internal document viewer -> PW : Good diacritics

#3 Windows 10
- No Problems with diacritics during copy/paste

#4 Windows 7
- No Problems with diacritics during copy/paste

So although I'll still have to implement a hook to deal with the existing records in the database, I think we've established a workflow for #1 to get clean data into the system right off the bat, or at least for the time being.

kongondo · May 4, 2020

21 minutes ago, Andi said:

I think we've established

...that Microsoft products are superior to Apple products ? ?.

...I'll crawl back under my rock now...

Andi · May 4, 2020

1 minute ago, kongondo said:

...that Microsoft products are superior to Apple products ? ?.

...I'll crawl back under my rock now...

Haha ok then maybe it should also be mentioned that on both of my Linux machines none of this cr#p is any issue at all ?

interrobang · May 4, 2020

Without much testing this hook works for me (in site/ready.php). But I think utf normalizing should be part of the core text sanitizer, so other texts (like image descriptions) are normalized too.

function normalize_UTF_NFC($string) {
    if (function_exists('normalizer_normalize')) {
        if ( !normalizer_is_normalized($string)) {
            $string = normalizer_normalize($string);
        }
    }

    return $string;
}

$wire->addHookBefore('InputfieldTextarea::processInput, InputfieldText::processInput', function (HookEvent $event) {
    /** @var Inputfield $inputfield */
    /** @var WireInputData $input */
    /** @var Language $language */
    $inputfield = $event->object;
    $input = $event->arguments(0);

    if ($this->languages && $this->languages->count > 1) {
        foreach ($this->languages as $language) {
            $input_var_name = $language->isDefault() ? $inputfield->name : "{$inputfield->name}__{$language->id}";
            $input->set($input_var_name, normalize_UTF_NFC($input->$input_var_name));
        }
    } else {
        $input_var_name = $inputfield->name;
        $input->set($input_var_name, normalize_UTF_NFC($input->$input_var_name));
    }

    $event->arguments(0, $input);

});

Andi · May 4, 2020

Hey @interrobang,

thanks a bunch. A different angle altogether.. And I guess I'll need to read up on utf normalization a little bit ?

https://stackoverflow.com/a/7934397

Quote

Unicode includes multiple ways to encode some characters, most notably accented characters. Canonical normalization changes the code points into a canonical encoding form.

Guess the problem is that I don't actually, on a technical level, understand what's happening at all.. But that does sound like we're headed in the right direction here.. I'll set up a testbed tomorrow on my localhost and try this approach, since we've already pretty much ironed out all the problematic database records on the dev site.

So am I understanding correctly that this hook would also utf-normalize textarea fields for existing records, as long as a user opened the page for editing and just hit save once?

Thanks again and greetings from Regensburg ?

interrobang · May 4, 2020

To be honest, I just googled a bit, I probably don't understand more of this than you ? Btw, Wordpress is discussing this now for 6 years: https://core.trac.wordpress.org/ticket/30130

Also, I don't know if this normalizer function is usually available or not – there are polyfills, but then all gets complicated.

6 minutes ago, Andi said:

So am I understanding correctly that this hook would also utf-normalize textarea fields for existing records, as long as user opened the page for editing and just hit save once?

I am not sure if just saving is enough or if the fields need some changes tracked. Better test before re-saving 1000s of pages.

Andi · May 4, 2020

4 minutes ago, interrobang said:

Btw, Wordpress is discussing this now for 6 years: https://core.trac.wordpress.org/ticket/30130

Haha ok looks like you're right on the money ?

Awesome stuff, that really makes this all seem a little less nonsensical..

6 minutes ago, interrobang said:

I am not sure if just saving is enough or if the fields need some changes tracked. Better test before re-saving 1000s of pages.

I'll absolutely make sure of that.. Will set this up tomorrow first thing in the morning and report back.

Thank you all for your help and advice!

Schönen Feierabend! ?

Andi · May 4, 2020

Always helps to know what you're looking for ?

https://modules.processwire.com/modules/textformatter-normalize-utf8/

Thanks for the pointers @interrobang, this is going to help a lot..

interrobang · May 4, 2020

Beware, a textformatter doesn't help with your search issue, as the texts in mysql are still the same and not normalized.

Andi · May 4, 2020

I'm aware of that, but thanks for pointing it out @interrobang

Thought I just put that here for future reference, I'm imagining many people (like me until 3 hrs ago ?) haven't even heard of the term utf normalization.

That thread at the wordpress tracker is a great read btw. There's a lot there already, so I think we should be able to get this whole thing sorted out by tomorrow night.

Cheers & danke vielmals ?

Andi · May 5, 2020

Just getting started here but I can already confirm that @justb3a's module

completely fixes this issue on the frontend side of things..

Now I'd basically just need to find a simple way to run every text & textarea field in both language versions through the normalizer..

Actually also pagetitle fields.. Or ideally, literally every bit of text that's being saved to the database..

Or would there be any downsides to that at all?

---

Module works fine on PW v3.0.148 by the way

bernhard · May 5, 2020

16 minutes ago, Andi said:

Now I'd basically just need to find a simple way to run every text & textarea field in both language versions through the normalizer..

$value = "üäö"; // bad values
$this->modules->get('TextformatterNormalizeUtf8')->format($value);
echo $value; // good values

Note that $value is modified by reference.

Imho a good utf8 normalization should be built into the core as sanitizer. So it would be great if you came up with a solid solution that we can suggest ryan ?

Andi · May 5, 2020

Haha @bernhard easy on me.. Still coding like it's 2005 over here.. Guess the 10 year break made me miss out on a whole lot of stuff ?

Good lord I'm glad these arrays are no longer needed ?

if ($modules->isInstalled("TextformatterNormalizeUtf8")) {
    $pgs = $pages->find("template=artist");
    foreach($pgs as $p) {
        $p->of(false);
        $summary=$p->summary;
        bd("OLD: $summary");
        $this->modules->get('TextformatterNormalizeUtf8')->format($summary);
        bd("NORMALIZED: $summary");
        $p->summary = $summary;
        $p->save('summary');
    }
}

Andi · May 5, 2020

So on the input side of things.. I'm thinking, since we don't know about

On 5/4/2020 at 4:37 PM, interrobang said:

if (function_exists('normalizer_normalize')) {

..but we have @justb3a's module we could combine @interrobang's idea of hooking into InputfieldTextarea::processInput with calling the textformatter module to sort out normalization right off the bat..

This almost works ?

$wire->addHookBefore('InputfieldTextarea::processInput, InputfieldText::processInput', function (HookEvent $event) {
	/** @var Inputfield $inputfield */
	/** @var WireInputData $input */
	/** @var Language $language */
	$inputfield = $event->object;
	$input = $event->arguments(0);

	if (wire()->modules->isInstalled("TextformatterNormalizeUtf8")) {

		bd($this->languages); // ### this doesn't seem to be an array ###

		if ($this->languages && $this->languages->count > 1) {
			// ### this part never gets called ###
			foreach ($this->languages as $language) {
				bd("language loop");
				$input_var_name = $language->isDefault() ? $inputfield->name : "{$inputfield->name}__{$language->id}";
				$normalize_input = $input->$input_var_name;
				wire()->modules->get('TextformatterNormalizeUtf8')->format($normalize_input);
				bd("$input_var_name :: $normalize_input");
				$input->set($input_var_name, $normalize_input);
			}
		} else {
			// ### this part works ###
			$input_var_name = $inputfield->name;
			$normalize_input = $input->$input_var_name;
			wire()->modules->get('TextformatterNormalizeUtf8')->format($normalize_input);
			$input->set($input_var_name, $normalize_input);
			bd("$input_var_name :: $normalize_input");
		}
	}

	$event->arguments(0, $input);

});

With this, utf-8 normalizing works for the default language input fields, but currently all others remain untouched.

I tried looking around for a working example of how to get to the language fields, but no luck so far.. Also feeling just a tad bit twitchy about hooking in at such a deep level, but so far no problems during testing..

Does anyone have an idea where I'm going wrong here?

Thanks and all the best..! Almost there I think ?

Andi · May 5, 2020

Ok ... ->count() ?

in site/ready.php

/**
 * run all Text and Textarea input fields through UTF-8 normalization
 * requires justb3a's TextformatterNormalizeUtf8 module
 * https://modules.processwire.com/modules/textformatter-normalize-utf8/
 *
 * @var Inputfield $inputfield
 * @var WireInputData $input
 * @var Language $language
 *
 */
$wire->addHookBefore('InputfieldTextarea::processInput, InputfieldText::processInput', function (HookEvent $event) {
	$inputfield = $event->object;
	$input = $event->arguments(0);
	if (wire()->modules->isInstalled("TextformatterNormalizeUtf8")) {
		if ($this->languages && $this->languages->count() > 1) {
			foreach ($this->languages as $language) {
				$input_var_name = $language->isDefault() ? $inputfield->name : "{$inputfield->name}__{$language->id}";
				$normalize_input = $input->$input_var_name;
				wire()->modules->get('TextformatterNormalizeUtf8')->format($normalize_input);
				$input->set($input_var_name, $normalize_input);
			}
		} else {
			$input_var_name = $inputfield->name;
			$normalize_input = $input->$input_var_name;
			wire()->modules->get('TextformatterNormalizeUtf8')->format($normalize_input);
			$input->set($input_var_name, $normalize_input);
		}
	}
	$event->arguments(0, $input);
});

This is by no means thoroughly tested.. Use at your own risk ?

Does anyone see how there could be any downsides to this approach on a live PW site?

Andi · May 6, 2020

After some more testing I feel like this should be safe to implement on the production site..

The only remaining issue at this point being..

On 5/4/2020 at 4:37 PM, interrobang said:

I think utf normalizing should be part of the core text sanitizer, so other texts (like image descriptions) are normalized too.

The image description input fields seem to be a different kind of beast altogether..

Is there any way to extend the hook to cover these as well?

interrobang · May 6, 2020

The more I think about it, I am sure we need a core solution. Even if we hook into every Inputfield::processInput method we know of, there will still be custom inputfields which we will miss. And if you populate your pages by api Inputfields are not even used and the hooks are never called.

UTF8 normalization should be buried somewhere deep in the core: Probably every mysql query and all user input should be normalized automatically. Currently this is not possible with hooks alone as far as I know.

Btw, I just found another lightweight library which provides a fallback function if normalizer_normalizer is not available: https://github.com/wikimedia/utfnormal

Andi · May 6, 2020

2 hours ago, interrobang said:

The more I think about it, I am sure we need a core solution.

As much as I appreciate @bernhard's faith in my motivation (?), and as much as I'd like to contribute, that would at this point go way above my head..

I keep pushing the day where I'll finally take a dive and learn to use git/hub/lab. And I'm still struggling to find my way around these "new" development tools and procedures.. Lots of ground to cover there currently.

But yeah this seems like something that would be really useful. Maybe even as some kind of default way of dealing with user input, as I currently could only think of fringe cases where someone would not want to have this enabled..

Sign In

The ü's are not what they seem - Weird umlaut/diacritics problems when pasting text from PDFs (on Apple?)

Recommended Posts

Andi

Andi

Andi

bernhard

Andi

bernhard

Andi

Andi

kongondo

Andi

interrobang

Andi

interrobang

Andi

Andi

interrobang

Andi

Andi

bernhard

Andi

Andi

Andi

Andi

interrobang

Andi

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members

Browse

Activity

My Activity Streams

Support

Store

My Details