Jump to content

Not (a-z) chars support in PageAutocomplete or in "Add New" Page inputfield


seddass
 Share

Recommended Posts

Hi all,

I am trying to create a tags functionality for hotel features, like "pool, sauna, etc", and the new tags to be created automatically when entered for the first time. The specific is that tags titles should to be with cyrillic characters.

I was hoping that PageAutocomplete will be ideal for tags system, but it doesn't support searching with cyrillic chars on keypress

Unfortunately "Create new" feature in Page input field doesn't support saving the items with cyrillic characters in their title, because it cant replace them automatically to their (a-z) equivalents for the "name" field.

1. Is there an easy way to use the PageName character replacement feature when using "Create new" in Page input field? Dont you think it will be great if sanitizer->name support such replacement internally?

2. And something related.. Someday I will ask if PW will support not (a-z) characters in URLs. I know that there are standarts and the cyrillic chars are not included in allowed chars. However... when searching for something in cyrillic, many of the Google results contain cyr characters in their URL. Probably we should to be competitive in SEO point of view and to be allowed to use the not a-z characters in the URL? The same for other specific chars in German and other languages. What do you think?

Thanks

Link to comment
Share on other sites

Thanks MadeMyDay, the most of the cyrillic characters are already there by default.

Meanwhile I have found that the PageName input field replacement was NOT enabled by default in $sanitizer->pageName(). I have modified the Pages->setupNew() method to enable it and this allowed me to use "Create new" feature with not (a-z) characters.

  • Like 1
Link to comment
Share on other sites

  Quote
Meanwhile I have found that the PageName input field replacement was NOT enabled by default in $sanitizer->pageName().

Thanks, I will make the same change in the core, replacing the second 'true' param with 'Sanitizer::translate' in the setupNew() function. The translate option was added to the sanitizer pretty recently.

  • Like 2
Link to comment
Share on other sites

Thanks Ryan!

I would like to remind about the second part of my post, about using PW with other than allowed (a-z-.) characters in the URLs. It seems that Google prioritize such sites compared to their competitors. Do you think it will be possible in some of the PW future releases and if it will worth the effort?

Link to comment
Share on other sites

While I know UTF-8 is possible in the query string of URLs, I had thought that domains/paths in URLs were limited to a subset of ascii characters (at least if we're trying to be standards compliant). I could be wrong about that, but honestly have not seen UTF-8 domains/paths before. (Or if I have, I didn't recognize it as that). Do you know of another open source CMS that supports this? I could take a closer look to see what's involved in the implementation and security of that, but would like to have other examples as this is something I'd not heard of before.

Regarding Google and prioritization, is there any research/documentation that supports the theory that it prioritizes sites using UTF-8 in URLs? I guess that would surprise me, but I always have an open mind. :) You've got me curious.

Link to comment
Share on other sites

There are other than a-z chars supported, but not sure how. It might be on browser level. If I go to http://fi.wikipedia.org/wiki/ääkköset it all works and looks nice... but when I copy & paste the url from address bar (chrome), I get this: http://fi.wikipedia....g/wiki/Ääkköset

EDIT: I mean I get this:

http://fi.wikipedia.org/wiki/%C3%84%C3%A4kk%C3%B6set
Edited by apeisa
Link to comment
Share on other sites

Antti,

As far as I can tell, URIs are all represented in a subset of ASCII characters (see RFC3986) but allow for the embedding of other characters (including unicode characters) by allowing them to be percent encoded into the URI. Browsers understand this and decode URIs to display the correct characters in the address bar and they allow you to enter the unicode when typing the characters in the address, converting them on submission using URL encoding. You can do this yourself in PHP using urlencode() or rawurlencode().

Looks like copy and paste out of chrome is pulling the encoded string out of the address bar.

Edited to add: Just found the relevant part of the article I linked...

  Quote
The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, and then percent-encode those values. This requirement was introduced in January 2005 with the publication of RFC 3986. URI schemes introduced before this date are not affected.
Edited by netcarver
  • Like 2
Link to comment
Share on other sites

  • 1 year later...
  On 5/27/2012 at 7:49 AM, MadeMyDay said:

Hi seddass,

Go to modules overview and look for the page name input field. Click on it, there you can define the rules for the char replacement. Try to include the Cyrillic characters there.

Thank you seddass! I was looking for this.

On the modules Page Name Settings I added a few latin characters for  PW 2.3 auto generated URLS

I was creating a page with the title: sopa de cação and the generated URL was sopa-de-cac-o

These are the pt-pt characters I added to the module's Page Name:
ã=a
õ=o
 
Not sure if PW installation could come with these two ã õ already?
 
Just for reference:
// Latin

'À' => 'A', 'Á' => 'A', 'Â' => 'A', 'Ã' => 'A', 'Ä' => 'A', 'Å' => 'A', 'Æ' => 'AE', 'Ç' => 'C', 

'È' => 'E', 'É' => 'E', 'Ê' => 'E', 'Ë' => 'E', 'Ì' => 'I', 'Í' => 'I', 'Î' => 'I', 'Ï' => 'I', 

'Ð' => 'D', 'Ñ' => 'N', 'Ò' => 'O', 'Ó' => 'O', 'Ô' => 'O', 'Õ' => 'O', 'Ö' => 'O', 'Ő' => 'O', 

'Ø' => 'O', 'Ù' => 'U', 'Ú' => 'U', 'Û' => 'U', 'Ü' => 'U', 'Ű' => 'U', 'Ý' => 'Y', 'Þ' => 'TH', 

'ß' => 'ss', 

'à' => 'a', 'á' => 'a', 'â' => 'a', 'ã' => 'a', 'ä' => 'a', 'å' => 'a', 'æ' => 'ae', 'ç' => 'c', 

'è' => 'e', 'é' => 'e', 'ê' => 'e', 'ë' => 'e', 'ì' => 'i', 'í' => 'i', 'î' => 'i', 'ï' => 'i', 

'ð' => 'd', 'ñ' => 'n', 'ò' => 'o', 'ó' => 'o', 'ô' => 'o', 'õ' => 'o', 'ö' => 'o', 'ő' => 'o', 

'ø' => 'o', 'ù' => 'u', 'ú' => 'u', 'û' => 'u', 'ü' => 'u', 'ű' => 'u', 'ý' => 'y', 'þ' => 'th', 

'ÿ' => 'y',

// Latin symbols

'©' => '©',

Link to comment
Share on other sites

  • 4 months later...

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...