Jump to content

Graph databases and the future of processwire


dfunk006
 Share

Recommended Posts

Hi,

Processwire makes efficient use of mysql - every field is a table and field-specific tables are joined on demand. However, joins are used nonetheless, especially in performing queries for $pages->find("selector") and related db-querying selector functions.

So queries for $pages->find(field1=value1, field2=value2, field3=value3) will join field_field1, field_field2 and field_field3 tables. In this case, does it make sense to use graph databases (such as neo4j) compared to mysql?

With data being connected now more than ever and with all the hype around graph databases and neo4j, will we see a potential support for graph databases in future releases of processwire?

Link to comment
Share on other sites

There isn't even support for things like sqlite oder postgree, even though there where requests for it, so I'll doubt others will be adopted in a near future. One has to keep in mind, ProcessWire is developed by Ryan only. He really has to put his work into the meat and by now it really does not make sense for a cms to support an other database than mysql if there isn't the time to support multiple databases.

  • Like 3
Link to comment
Share on other sites

At this point it's a bit like asking to drop php in favor of go. PW is a php/(my)sql framework by nature, and that's what it will continue to be. I would love to see other projects assumidely inspired by pw but built in other systems popping up.

  • Like 4
Link to comment
Share on other sites

At this point it's a bit like asking to drop php in favor of go. PW is a php/(my)sql framework by nature, and that's what it will continue to be. I would love to see other projects assumidely inspired by pw but built in other systems popping up, though.

I think the Analogy is wrong, Go is a concurrent language which is nothing compared to PHP, however I don't see the need for SQLite but that's my opinion, so prob a database wrapper to allow others write their own drivers or their various databases. something in the line of PDO

Link to comment
Share on other sites

At this point it's a bit like asking to drop php in favor of go.

That's an interesting analogy, although it's not quite the same. Graph databases can co-exist with mysql (a lot of people are using it that way), so its only an extension (maybe a core module) rather than a complete re-work.

I'm not sure if anyone has seen neo4j's model. It's exactly like processwire! (http://neo4j.com/developer/guide-importing-data-and-etl/)

Check out the part where it talks about the Graph Model. It says:

  • A row is a node
  • A table name is a label name

Now this sounds familiar, doesn't it? In processwire's model:

  • A row is a Page
  • A table name is a field name

Whoa! Looks like Ryan built a graph model in a relational database - absolutely brilliant!

And that's not it, you can use Page fields to create connections between pages (much like creating connections between nodes to form a graph). 

Also, it's interesting to see the php implementation of neo4j, which also, coincidently, is very similar to processwire's API calls (https://github.com/neo4j-contrib/developer-resources/blob/gh-pages/language-guides/php/neo4jphp/index.php).

$app->get('/graph', function (Request $request) use ($neo4j) {
	$limit = (integer)$request->get('limit', 50);
	$queryTemplate = <<<QUERY
MATCH (m:Movie)<-[:ACTED_IN]-(a:Person)
 RETURN m.title as movie, collect(a.name) as cast
 LIMIT {limit}
QUERY;
	$cypher = new Query($neo4j, $queryTemplate, array('limit'=>$limit));
	$results = $cypher->getResultSet();
	$actors = [];
	$nodes = [];
	$rels = [];
	foreach ($results as $result) {
		$target = count($nodes);
		$nodes[] = array('title' => $result['movie'], 'label' => 'movie');
		foreach ($result['cast'] as $name) {
			if (!isset($actors[$name])) {
				$actors[$name] = count($nodes);
				$nodes[] = array('title' => $name, 'label' => 'actor');
			}
			$rels[] = array('source' => $actors[$name], 'target' => $target);
		}
	}
	return json_encode(array(
		'nodes' => $nodes,
		'links' => $rels,
	));
}); 

If you notice, there's a cypher query which stores the results in a result array. Now, if we were to query a graph database and store the results in a PageArray, everything from there could be done processwire way (with access to processwire APIs).

Graph Database seems like a natural fit to me for PW. Whether it's worth it or not is up for debate. Maybe there's performance improvements, maybe more flexibility or maybe just an alternative to mysql - i dunno. I pointed out a use case in my earlier post where graph db could be more useful, but I'm not sure - hence the discussion.

Thoughts?

  • Like 1
Link to comment
Share on other sites

The thing is: Ryan most likely won't rebuild something that's already working really great. I think there are really more important features on the roadmap. This should not hold somebody else of building a wrapper for other databases, I'm just talking about a core implementation. 

  • Like 1
Link to comment
Share on other sites

Yeah in terms of request this is not a big priority. alot of things in the Pipeline also out of curiosity is it possible to Contribute codes to Processwire, I think it'd be much easier and relieve Ryan a lot of stress. 

Link to comment
Share on other sites

Guys, that was not meant to be an analogy between php/(other languages) vs sql/(other DBs). What I meant is that PW is built to work with PHP and MySQL, and it does it very well. I don't think Ryan felt at any time that these two are limiting the functionality of PW, and as a big bonus, these are the easiest conditions to find in any hosting. IF there was an analogy, than that's the one I used.
 
 

Yeah in terms of request this is not a big priority.

 
Exactly! I think we all agree with this.

 

alot of things in the Pipeline also out of curiosity is it possible to Contribute codes to Processwire, I think it'd be much easier and relieve Ryan a lot of stress.

At this point, Ryan filters and rewrites any piece of code that enters the core. Until now he's proved that this is the best way of keeping the reliability of the code https://processwire.com/talk/topic/9254-dev-and-master-versions/

  • Like 2
Link to comment
Share on other sites

The thing is: Ryan most likely won't rebuild something that's already working really great. I think there are really more important features on the roadmap. This should not hold somebody else of building a wrapper for other databases, I'm just talking about a core implementation. 

I wasn't asking for this to be built by Ryan. If I wanted to request this feature for the roadmap, I would have posted in the respective forum. I just wanted to discuss if there was any value to building something like this or not.

Link to comment
Share on other sites

The thing is: Ryan most likely won't rebuild something that's already working really great.

I vote for that. For expansions, we should concentrate on api, php, modules

and tuto's, not the core. At least that's what I hope for the future.

Link to comment
Share on other sites

AFAIK there are some efforts being made to be PDO compliant to support other database servers, but there are many challenges. PW is highly optimized for MySQL, so it would probably not be something you can expect in the near future. As for neo4j especially, unless there are PDO drivers in the works, don't put any hope on this.

Link to comment
Share on other sites

AFAIK there are some efforts being made to be PDO compliant to support other database servers, but there are many challenges. PW is highly optimized for MySQL, so it would probably not be something you can expect in the near future. As for neo4j especially, unless there are PDO drivers in the works, don't put any hope on this.

There are, in fact, PDO drivers available for neo4j - https://github.com/jakewins/neo4j-pdo

Link to comment
Share on other sites

  • 1 month later...

Interesting timing: I was just about to ask the same question the OP did a few weeks ago  :)

TL;DR Processwire can be looked at as a frontend to a network of pages with a semi-arbitrary selection of fields, so using a schema-less graph database instead of an RDBMS (if it can provide for all the other necessary features) would naturally mean less impedance mismatch to compensate for (that is, simpler code and easier implementation), it would allow for superior selectors (no need for intermediate PageArrays and foreach and such), and it would come with an increase in performance. Therefore, if Neo4j keeps gaining popularity and becomes more available as a choice, I believe it would be the right choice to support it as a store. In the long term. Not as a priority. Though I would love to see it tomorrow =)

 

PW is a php/(my)sql framework by nature, and that's what it will continue to be. I would love to see other projects assumidely inspired by pw but built in other systems popping up.

I understand my opinion about what's natural for PW is slightly less important than somebody else's with thousands of posts here on the forum =)
Still, let me try to challenge it.
 
I always considered PW's main strength its API and built-in admin interface, not the underlying data store. MySQL, of course, is an obvious choice considering its availability, and that one can force almost anything into an RDBMS with enough magic. However...
 
Looking at how PW uses MySQL, I wouldn't call it a SQL framework by nature. I understand the analogy was more about the ubiquity of MySQL than about a supposed inherent superiority, but I think it's interesting to have a look at why SQL is not the best fit. PW uses MySQL perfectly to do the job, but it does it in ways that go against traditional RDBMS principles; I would say it's a good sign that it's not an RDBMS problem by nature. Examples:

  • templates don't have their own tables in which they would store fields that have a 1-to-1 relationship to the PK,
  • instead, fields are always stored in key/value stores (that happen to be implemented in a certain flavor of SQL),
  • much of the querying is done using textual search in automatically populated columns that describe structural relationships, instead of key lookups,
  • it's fundamentally hierarchical, and page fields extend this structure to a full directed graph.

The above points summarized: Processwire is a frontend to a graph database implemented in MySQL.

MySQL is a good choice because it is available and usable, not because it (or any other kind of RDBMS) is naturally the best choice for PW.
 

[...] PW is highly optimized for MySQL, [...]

The reason I was thinking about Neo4j is exactly because there is no need for any such optimization: Neo4j is a natural fit for the PW use case "as is," so why not make one of those "other projects [...] inspired by pw" a flavor of PW itself?

  • PW selectors are trivial to implement in Neo4j because filters, as probably all other PW concepts, correspond to native Neo4j concepts and Cypher ("Neo4j SQL") statements. 

    Let's assume a "student" has "student_in_course" descendants, each of which store things like grades and the like, and also use a page field to refer to the actual "course" record (page) itself:

    • let's find the actual course pages for a student:
      • $courses = new PageArray;
        foreach($student
        ->find('template=student_in_course') as $student_in_course)
          $courses->add($student_in_course->course);
      • Cypher query: MATCH (:Student{id:$student_id})-[:IS_IN_COURSE]->(x:Course) RETURN x;
        • Note that Neo4j connections, while otherwise analogous to page fields and parent/child relationships in PW, can also have properties, just as nodes can. So, the "IS_IN_COURSE" connection itself can (unlike a page field in PW could) store the "percent", "grade", "passed" properties we need. Basically, Neo4j connections are something like the Table module from ProFields and the PageFields in one. When you use it as a "pure" page field, you don't add properties. When you use it as a Table row, as you MUST have a destination for an edge, you'd direct it to a (let's say) special NULL page to signal the destination is irrelevant; but this is just an implementation detail for our not-yet-existent Neo4j port =P
    • let's find other students who are in any of the same courses that our student is in:
      • $other_students_in_the_same_courses = new PageArray;
        foreach($student->find('template=student_in_course') as $student_in_course)
          foreach($pages->find("template=student_course,course={$student_in_course->course}") as $other_student_in_course)
           $other_students_in_the_same_courses->add($other_student_in_course->closest('template=student'));
      • Cypher query:
        MATCH (:Student{id:$student_id})-[:IS_IN_COURSE]->(c:Course),
                             (x:Student)-[:IS_IN_COURSE]->( c )
        RETURN x;
      • more concisely: MATCH (:Student{id:$student_id})-[:IS_IN_COURSE]->(:Course)<-[:IS_IN_COURSE]-(x:Student) RETURN x;
    • other students who took courses by any of the professors that we did (forgive me for not writing the php):
      • MATCH (:Student(id:$student_id})-[:IS_IN_COURSE]->(:Course)<-[:TEACHES]-(:Professor)-[:TEACHES]->(:Course)<-[:IS_IN_COURSE]-(x:Student) RETURN x;
      • if we know there are no other kinds of connections between students, courses, and professors, we can simply write:
        MATCH (:Student{id:$student_id})->(:Professor)<-(x:Student) RETURN x;
      • list just the names:
        MATCH (:Student{id:$student_id})->(p:Professor)<-(x:Student) RETURN p.name, x.name;
      • sorry i'm not fluent in Cypher so i didn't try, but i know it's just a few more words to eliminate duplicates and self from the results
  • parent/child relationships and page fields directly correspond to connections (edges) in Neo4j,
  • traversing/matching these connections is a native (extremely fast) operation, unlike the id and path based matching/joining that's necessary with an RDBMS,
  • it's schema-less, so there's no need to worry about how to store fields (e.g. in separate tables),
  • similarly no more need for an autojoin-like option to guess and explicitly fine-tune what we think would be frequently required, yet staying safe from extensive/expensive joins,
  • while otherwise schema-less, Neo4j can have uniqueness constraints, so things like unique page names within a parent can be easily ensured,
  • filtering against aggregates (e.g. "children.count") is possible just as with MySQL, except more flexible,
  • textual properties can be queried using full-text search, just as in MySQL.

I admit MySQL has these distinct advantages:

  • it's already done,
  • it's available anywhere and everywhere.

I love PW the way it is; it's ridiculously amazing compared to anything else I've seen before. The only reason I started wishing for a Neo4j-based version is because I'm using PW for something it was probably not designed in the mind with =) It's a database that utilizes page fields with custom PHP code way more than I would consider it healthy, and I got to where I need to run several queries to collect certain kinds of parents, then use intersections from some of those resulting arrays, then in turn collect items based no queries on the items from those, and so on. If PW had Neo4j under the hood, we would have access to selectors that would render custom PHP for page fields all but unnecessary, and my messed up query would be so very simple, too... All at the "expense" of a simpler database layer =P Therefore, please forgive me if I can't but dream about being able to use Neo4j's capabilities from PW selectors =)

  • Like 3
Link to comment
Share on other sites

I used neo4j for two projects in my studies. One of them was to visualize the GitHub network with User/Repository nodes and their connections over edges like "follows", "stars", "contributes", "owns" etc. I do like neo4j and the cypher query syntax and it was the right choice for this project.

However, for ProcessWire I don't see any advantage in using this database, simply because 99% of the apps written in Pw are fine with the tree. With page references you have endless power in modeling other relations than parent-child. For me, a graph database makes sense if you need queries like this:

  • Shortest path between two nodes
  • Get me all common nodes over at most 4 hubs
  • Fire a query from a known start node and return an (unknown) amount of target nodes of type X over edges Y
  • ...

If the underlying data structure is a graph and I must answer those type of questions, I would not choose ProcessWire to handle the job.

Another problem of neo4j is that it's not that fast, at least in my experience. And they recommend 16-32 GB RAM, this is simply too much for a "normal" web app :)

Cheers

  • Like 2
Link to comment
Share on other sites

If the underlying data structure is a graph and I must answer those type of questions, I would not choose ProcessWire to handle the job.

It hindsight, yes. The thing is, PW does almost everything my app needs, it breaks down only at one particularly messy task. I wish it wasn't so good, then I could've resisted the temptation to use it for this project =P

Link to comment
Share on other sites

 Share

×
×
  • Create New...