Jump to content

§sanitizier->url() corrections


apeisa
 Share

Recommended Posts

I have few additions for $sanitizier->url().

Currently it allows values like "I am url" -> "iamurl" and "www.url.com" -> "www.url.com". I made variable that allows us to drop relative urls, so these come always as: "" and "http://www.url.com"

Of course this could be much better, but this suits my needs and I think many times needed:

public function url($value, $allowRelative = true) {

	if(!strlen($value)) return '';

	if(!strpos($value, '://')) {
		// URL is missing protocol, or is local/relative

		$dotPos = strpos($value, "."); 
		$slashPos = strpos($value, "/"); 

		if($dotPos !== false) {
			// something like: www.company.com/about or company.com
			$value = "http://$value";

		} else if($dotPos === false) {
			if (!$allowRelative) {
				// We don't allow relative urls, so return blank
				$value = '';
			} else {
				// relative URL like: /about/ or about/
				// leave it alone
			}
		}
	}

	$value = filter_var($value, FILTER_SANITIZE_URL); 
	return $value ? $value : '';
}

Ryan, please code check this and make needed corrections. Also: what would be best way to contribute in code wise? Through GitHub? (I am posting this here now for just this reason).

Link to comment
Share on other sites

The only issue I see is that dots can be in page names and filenames. So that leaves the question of whether "company.com" or "sitemap.xml" is a domain name or a relative path/file... This is a problem in the existing url() function too, I'm just not sure how to solve it. I think I'll err on the side of assuming a domain name if the path doesn't start with a ".", like "./sitemap.xml" or "../../sitemap.xml".  I like your addition of the allowRelative option.

GitHub is great, or forum and/or email is fine too. Whatever you prefer.

Thanks,

Ryan

Link to comment
Share on other sites

I've got to do more testing, but here's the solution I came up with that I think accomplishes what you want. I added an extra path() function to the Sanitizer class, to handle the relative URLs. Also, the class file is attached (in a ZIP) if you want to try it.

<?php
/**
* Return the given path if valid, or blank if not. 
*
* Path is validated per ProcessWire "name" convention of ascii only [-_./a-z0-9]
* As a result, this function is primarily useful for validating ProcessWire paths,
* and won't always work with paths outside ProcessWire. 
*
* @param string $value Path 
*
*/
public function path($value) {
if(!preg_match('{^[-_./a-z0-9]+$}iD', $value)) return '';
if(strpos($value, '/./') !== false || strpos($value, '//') !== false) $value = '';
return $value;
}

/**
* Returns a valid URL, or blank if it can't be made valid 
*
* Performs some basic sanitization like adding a protocol to the front if it's missing, but leaves alone local/relative URLs. 
*
* URL is not required to confirm to ProcessWire conventions unless a relative path is given.
*
* Please note that URLs should always be entity encoded in your output. <script> is technically allowed in a valid URL, so 
* your output should always entity encoded any URLs that came from user input. 
*
* @param string $value URL
* @param bool $allowRelative Whether to allow relative URLs
* @return string
* @todo add TLD validation
*
*/
public function url($value, $allowRelative = true) {

if(!strlen($value)) return '';

// this filter_var sanitizer just removes invalid characters that don't appear in domains or paths
$value = filter_var($value, FILTER_SANITIZE_URL);

if(!strpos($value, ".") && $allowRelative) {
	// if there's no dot (or it's in position 0) and relative paths are allowed, 
	// we can safely assume this is a relative path.
	// relative paths must follow ProcessWire convention of ascii-only, 
	// so they are passed through the $sanitizer->path() function.
	return $this->path($value); 
}

if(!strpos($value, '://')) {
	// URL is missing protocol, or is local/relative

	if($allowRelative) {
		// determine if this is a domain name 
		// regex legend:       (www.)?      company.         com       ( .uk or / or : or # or end)
		if(preg_match('{^([^\s_.]+\.)?[^-_\s.][^\s_.]+\.([a-z]{2,6})([./:#]|$)}i', $value, $matches)) {
			// most likely a domain name
			// $tld = $matches[3]; // TODO add TLD validation to confirm it's a domain name
			$value = filter_var("http://$value", FILTER_VALIDATE_URL); 

		} else {
			// most likely a relative path
			$value = $this->path($value); 
		}

	} else {
		// relative urls aren't allowed, so add the protocol and validate
		$value = filter_var("http://$value", FILTER_VALIDATE_URL);
	}
}

return $value ? $value : '';
}

Let me know if you think anything is missing here? I tried to duplicate what you added, and also account for the relative paths vs. domain issue.

Thanks,

Ryan

Sanitizer-php.zip

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...