Find returning strange pages that do not exist

BillH · March 11, 2018

I've implemented a straightforward site search, but among the correct results it is returning pages that don't exist with strange URLs.

If, for instance, the search is for "quercus" (it's a site about trees), some valid results are returned, e.g.:

[domain]/publications/general-articles/quercus-tungmaiensis/

But I also get pages that do not exist with URLs I can't explain, e.g.:

[domain]/site/en/tree-info/tree-info/-404--i-quercus-tungmaiensis-i/

[domain]/site/en/tree-info/-407--i--quercus-rubra-i/

[domain]/site/en/tree-info/tree-info/-407--i-quercus-rubra-i/

Note that there are fields in the system that contain italicised versions of the page title (e.g. "Quercus rubra"), and the tags may have got into the page title and name before the data was cleaned up. At one level explains the "-i-" in the invalid URLs, but it doesn't get me much further!

The search is based closely on that in the PW default site, and the relevant code is as follows:

$q = $sanitizer->text($input->get->q);

if($q) {
    
    // Set up the search term
    $input->whitelist('q', $q);
    $q = $sanitizer->selectorValue($q);
    
    // Build the selector
    $selector = "title|main_text|item_description~=$q, has_parent!=2, limit=50";

    // Find the pages
    $matches = $pages->find($selector);

    if($matches->count) {
        // ...
        // Render the results
        // ...
    }
}

I may well be missing something obvious, but at the moment I'm completely puzzled.

Anyone know what's happening?

BitPoet · March 11, 2018

Sure that these pages really don't exist (perhaps just PageTable entries that live under tree-info)?

Not directly related (as it might cure the symptom but not the issue): for a site-wide search, it's often a good idea to limit the templates targeted in the search.

BillH · March 11, 2018

Thanks @BitPoet, you put me on the right track - though it's taken me a while to get there.

I can now work round the problem, but it has revealed another mystery, and I'd be happier if I understood what's happening!

It turns out the pages do all exist, but in some circumstances the wrong URL is being returned.

So the code in my OP is not actually the relevant bit. It's the rendering of the results that caused the issue:

if($matches->count) {
    $content = "<h2>Found {$matches->count} pages matching your query:</h2>";
    $content .= "<ul>";
    foreach($matches as $match) {
        $content .= "<li>";

        // THE TROUBLE OCCURS HERE when $itemTitle is set to title_formatted
        $itemTitle = $match->title_formatted ? $match->title_formatted : $match->title;

        $content .= "<a href='{$match->url}'>{$itemTitle}</a>";
        $content .= "</li>";
    }
    $content .= "</ul>";
} else {
    $content = "<h2>Sorry, no results were found.</h2>";
}

(The tags should be removed from title_formatted to give a tidy listing, but that isn't the issue here.)

When $itemTitle is set to title_formatted, the resulting HTML is like this (I should have noticed this before!):

<li><a href="/publications/tree-profiles/quercus-rubra/"></a><a href="/site/en/tree-info/tree-info/-407--i-quercus-rubra-i/"> <i>Quercus rubra</i> </a></li>

I can work round this by sanitizing the text. For example, the following gets rid of the problem:

$itemTitle = $match->title_formatted ? $sanitizer->text($match->title_formatted) : $match->title;

And I'm sure I can easily come up with a method that keeps the tags. So in terms of developing the site, problem solved.

However, I have checked the title_formatted field, both using CK Editor's View Source and the browser's Inspect tool, and there is no sign of that extra, incorrect URL or an <a> tag.

So I'm still really puzzled about where it's coming from!

BitPoet · March 11, 2018

My first suspicion would be a Textformatter that might be active on title_formatted and add the a tag.

BillH · March 11, 2018

A good thought about text formatters @BitPoet, which I followed up, but they turn out not to be the cause of the problem.

However, I searched the raw data for the field in MySQL, and there, for some of the records in title_formatted, was the troublesome data. It is old tags from the previous version of the site from which the data was imported.

For example, viewed through phpMyAdmin, the title_formatted field for a record contains:

<a href="/site/en/tree-info/-407--i-quercus-rubra-i/"> <i>Quercus rubra</i> </a>

However, in PW clicking on the Source button for title_formatted gives:

<p><em>Quercus rubra</em></p>

And more-or-less the same with the browser's Inspect Element tool.

So, PW templates are getting the data as it is stored in the database field. But CK Editor is rendering it differently in the PW back end.

The data was imported directly by script, and thus was not entered through CK Editor. When the page is opened in the PW back end and saved, the data is stored in the database as it rendered by CK Editor (so in my particular case the extra <a> tags are removed). However, using $page->save() from a script does not trigger the CK Editor behaviour (perhaps there's a way that it could).

I'm wondering if it's worth re-posting this information under another title, as it really has nothing to do with finding pages and might be useful for someone.

kongondo · March 11, 2018

15 minutes ago, BillH said:

I'm wondering if it's worth re-posting this information under another title, as it really has nothing to do with finding pages and might be useful for someone.

Maybe no need since others encountering the problem will probably hit the 'wrong results returned' issue first (like you did). It's your topic though, so, if you wish, just edit the Title of the thread by editing your first post.

Robin S · March 11, 2018

2 hours ago, BillH said:

The data was imported directly by script, and thus was not entered through CK Editor. When the page is opened in the PW back end and saved, the data is stored in the database as it rendered by CK Editor (so in my particular case the extra <a> tags are removed). However, using $page->save() from a script does not trigger the CK Editor behaviour (perhaps there's a way that it could).

The changes/validation done by CKEditor happen via Javascript when the field is loaded in Page Edit. You won't be able to automate that with the API.

But you could loop over the pages using the API and process the field using a PHP DOM parser (e.g. Simple HTML DOM), removing the <a>, adding the , and converting to .

Incidentally, I often need to work with pages that have scientific species names as their titles - the way I deal with it is through markdown formatting in the main title field rather than using a separate formatted field. And often it's easier for editors if you use a sort of "reverse" markdown, where you apply the markdown syntax around words that are not to be italicised, as those are fewer in number. Then you do the italic/normal styling with CSS.

BillH · March 12, 2018

A good point @kongondo. There should be enough relevant words in the posts to bring it up in a search.

And thanks @Robin S. I suspected there wasn't a way to automate CK Editor changes from the API, but there might have been something I hadn't cottoned on to. Using a PHP DOM parser is a good suggestion - though in this case there were only a couple of dozen instances, so I fixed them the "hard" way, editing by hand in phpMyAdmin!

Also, thanks for the good advice on formatting species names in titles. I particularly like the reverse-markdown idea, and I've already thought of somewhere I might use it. There's another solution I'm using on another more-complex site, where I'm developing a module that allows users to enter italicized scientific names in a rich text field (the italicization is quite complex because it's botanical names with hybrids and cultivars) and hooks into page saving to keep a plain-text title and the page name up to date. But I have to be a bit careful with the module not doing unexpected things (though it's becoming quite robust), so it's not really worth installing for the simpler case at hand.

Sign In

Find returning strange pages that do not exist

Recommended Posts

BillH

BitPoet

BillH

BitPoet

BillH

kongondo

Robin S

BillH

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members

Browse

Activity

My Activity Streams

Support

Store

My Details