Jump to content

Sort by relatedness (keywords in common)


joe_g
 Share

Recommended Posts

Hi there,

Is there I can sort by "number of things in common". For example, I have "projects" with "tags". I'd like to show the 20 projects that is "most related". So first show the projects with 3 tags in common, then 2 tags in common, etc.

Perhaps something like:
$pages>find("template=project,limit=20,tags={$tags},sort=in-common-[tags]")

Is there a way I could do this in a scalable manner? (As in, avoid getting all results and loop through them)? Custom SQL?
 

thanks!

Link to comment
Share on other sites

An interesting problem, but I'm fairly sure there's no solution using PW selectors. As far as I can see you'd need to compare the tags for each project to those every other project, keep track of the results, and then sort by those results.

You could get the data you need into an array quickly by using something like:

$tagsForAllProjects = $pages->findRaw("template=project", "tags");

However, as you've probably realised, if you have many projects, it could take a long time to work though making the comparisons. Even though you only need to compare each pair of array items once (perhaps using array_intersect()), if I've got this right (which I might not have done!) it'd take n(n-1)/2 operations – so for 100 projects about 50,000 and for 1,000 the best part of half a million.

But perhaps someone knows a better way.

  • Like 1
Link to comment
Share on other sites

Thanks both. RockFinder looks great, didn't know about that module before. But I wonder if it can do what I need. I suppose bot findRaw and rockFinder would need to first load all pages that has 1 or more keywords in common - in my case that can be a lot of pages since some keywords are very generic. Lets say I have 20k projects and half of them might have the keyword "collaboration". What I need is the intersection of the most common, and only the first 'X first' hits.

I suppose the only way that can be done is with sql, since the data is purely relative (unlike how rockFinder works, there is no dataset to "start" with - i can't start with 10k pages). Now, how to solve it haha - that's going to be a bit challenging.

Link to comment
Share on other sites

I could be wrong, but I don't think findRaw(), RockFinder or SQL have any way of returning a result that depends on how many items records have in common. And anyway, to make comparisons between every pair among 20k projects, you'd be looking about 400 million comparisons – which is going to take a while!

So, it seems to me that either: you need to think of an algorithm that doesn't involve comparing each project with every other; or you'll need to use a different metric of similarity between projects – or even rank by something other than similarity.

Considering the algorithm, I'm wondering if there's an approach that would run in a time dependent on projects * tags (rather than projects * projects), though I don't know if this is possible - and I suspect it'd be really difficult to come up with something.

Using a different way of ranking the projects might be much easier, and perhaps equally meaningful - indeed, now that I think about it, I wonder how useful ranking by degree of similarity would be. An alternative, just for example, might be something like ranking the tags by popularity (number of projects they appear in), then give each project a score depending on the ranks of its tags (the mean, or median, or sum of the top three, or something like that).

 

 

Link to comment
Share on other sites

13 hours ago, BillH said:

I wonder how useful ranking by degree of similarity would be. An alternative, just for example, might be something like ranking the tags by popularity (number of projects they appear in), then give each project a score depending on the ranks of its tags

Popularity has it's use, and I'm doing that as well for a different visualization. But in this project I'm mainly looking for similarity. "obscure" similarity is more important than "popular" similarity. Lets say a project have 1000 project that share some popular tags, but there is only 2 that share the same (obscure) tag - on top of the popular ones - those 2 are the most important to show. If I prioritize popularity those 2 projects will be buried somewhere.

This probably needs to be some custom sql, just not sure how to write it.

Link to comment
Share on other sites

Something that would be worth a try...

Use a saveReady hook to put all the tags as space-separated values into a (hidden) textarea field. Then use the **= operator with the tags string of the current project.

From the docs:

Quote

Any given words match against compared value. Matches whole words. Uses “fulltext” index. Available in ProcessWire 3.0.160 or newer. This uses something more like the standard fulltext MATCH/AGAINST logic included with MySQL than most of the other operators. For those that want this more traditional search logic, this operator provides it. It behaves in an OR fashion with the words, but ranks results according to how many of the requested words appear.

 

  • Like 8
Link to comment
Share on other sites

27 minutes ago, Robin S said:

Use a saveReady hook to put all the tags as space-separated values into a (hidden) textarea field. Then use the **= operator with the tags string of the current project.

Oh wow, if this works the way i think it does this could really be it. I can combine **= with limit=20, for example and only get the most related. Thanks!

ps. I've been meaning to learn more about selectors, they've really expanded last couple of years.

Link to comment
Share on other sites

It occurs to me that with SQL you could write a user-defined function for comparing sets of tags, but this still might run into problems with execution time.

I suspect @Robin S's idea is a much better approach!

Link to comment
Share on other sites

  • 2 months later...

@Robin S I can confirm that this works really well. it's a relief, honestly - I expected having to dive deep int sql for this.

One thing that doesn't seem to work is to match things in quotations "Alice One" and "Alice Two" will both be considered related because of the same first name "Alice", regardless of quotes or not. But that's fine in my case.

  • Like 1
Link to comment
Share on other sites

Just off the top of my head you might be able to get round that by having a hidden field automatically populated with a kind of concatenated version of any multi word values - maybe Alice_One and Alice_Two. That way it might be possible to treat them as single word (kind of).

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   1 member

×
×
  • Create New...