Jump to content

ryan

Administrators
  • Content Count

    14,290
  • Joined

  • Last visited

  • Days Won

    946

Everything posted by ryan

  1. @Hector Nguyen getcsv() does return an array, but it is an array representing and consuming the memory of just one line from a CSV file (i.e. the columns from 1 row). fgetcsv() is just a layer on top of the fgets() function, which reads one line at a time from a file, which is what makes it memory friendly. On the other hand, PHP functions like file() or file_get_contents() do read the entire file in memory, so they are not memory friendly, even if they are fast. @fedeb I think the best route to take for your groupID+start+end+sequence would be a custom Fieldtype. This would give you all of the benefits of having a repeater, and without any of the overhead. Custom fieldtype may sound complicated, but it's not at all. I've developed a module that can very easily be adapted for this need. See FieldtypeEvents which was created as an example to build exactly this sort of thing from. If you are interested in that route and have any questions, I'm happy to walk you through it.
  2. @fedeb Glad that moving the $parent outside the loop helped there. The reason it helps is because after a $pages->save() is the automatic $pages->uncacheAll(), so the auto-assigned parent from the template is having to be re-loaded on every iteration. By keeping your own copy loaded and assigning it yourself, you are able to avoid that extra overhead in this case. Avoid getting repeaters involved. I wouldn't even experiment with it here. That will at minimum triple the number of pages (assuming every protein page could have a repeater). Repeaters would be just fine if you were working in the thousands-of-pages territory, but in the millions-of-pages territory, it's not going to be worth even attempting. Using a ProFields table field would be the best alternative if you needed it to be queryable data. If you didn't need it to be queryable data (groupID, start, end, sequence), I would leave them as they are, space-separated in a plain textarea field — they can easily be parsed out at runtime so you can access them as as properties of the page. (If that suits your need, let me know and I'll get into how that can be done). When working at large scale, it's also always good to consider custom building a Fieldtype module for the purpose too (that's another topic, but we can get into it too). For your groupID, if the same groupID is referenced by multiple proteins, and there is more information about each "group" (other than just an ID) then I think it would make sense for it to be a Page reference field. What is the max number of groupID+start+end+sequence rows that a protein can have? If there is a natural limit and it's not large, then that would open up some new storage possibilities too. Another optimization you can make in your loop: $page->sort = $i; This prevents it from having to detect and auto-assign a sort value based on the quantity of children the parent page has. For the $page->name, if each page will have a unique "protein-name" then you might also consider using that rather than the ("protein" . $i), as it will be more reflective of the page than a generic index number.
  3. @cb2004 Got it. I'll put out an update to the ProcessWireUpgrade module this week. Support for identifying the latest version of Pro modules is a function of the modules directory rather than the upgrades module. I've been meaning to do this, so thanks for the reminder. I have gone ahead and updated it so that it can now identify the latest versions of all Pro modules. Though I can't add support for download+install upgrades of Pro modules, as they are access controlled so there can't be public download URLs for these. I also think that in general it's always better to install or upgrade modules directly on the file system, as that prevents permissions problems (for when apache is not running as your user account), and makes it easier to troubleshoot and resolve issues when installing or upgrading modules.
  4. @Hector Nguyen This is cool to see generators in action. Though as far as I know, PHP's fgetcsv() never loads the whole file in memory at the same time, regardless of which method is used to call it. I think it just loads one line at a time (?), but this reminds me that an optimization to fgetcsv() is to tell it what the longest possible line might be (as 2nd argument), so that it doesn't have to figure it out. Fedeb's example has 0 as the 2nd argument to fgetcsv(), which means "let PHP figure it out", so some overhead could be reduced here by giving it a number like 1024 or whatever the largest line length (in bytes) might be. There may be other benefits to using generators here though? I haven't experimented with them much yet so am curious.
  5. Unless I'm forgetting something, the $pages->uncache($page); won't help here because $page is a newly created Page that wasn't loaded from the database. So it's not going to be cached either. Uncaching pages is potentially useful when iterating through large groups of existing pages. For instance, if you are rendering or exporting something large from the contents of existing pages, you might like to $pages->uncacheAll() after getting through a thousand of them to clear room for another paginated batch. Though nowadays we have $pages->findMany() and $pages->findRaw(), so there are fewer instances were you would even need to use uncache or uncacheAll, if ever. ProcessWire actually does an uncacheAll() internally after saving a page already. This is necessary because changes to a page or additions/deletions to the page tree may affect other pages, and we don't want any potential for old cached data to appear in future $pages->find() or other operations. Just one example is if we called $parent->children() before a save, and then after the save called it again, we'd want our new page to be in the children rather than having it return the previously cached value. There are a lot of similar cases, so the safest bet is for PW to uncache the results of future page get/find operations after a save as the default behavior. So that's the way it's always done it. As far as I can tell from fedeb's example (and often other with import operations), it may be better to tell PW to skip this "uncacheAll-after-save" behavior. That's because imports often involve Page reference fields, and you don't want PW to have to reload referenced pages after every save. So you could potentially reduce overhead by telling it not to uncache after save, i.e. $pages->save($page, [ 'uncacheAll' => false ]); I'm not sure if fedeb's import involves loading of any other pages, whether for page reference fields, or anything else. So it may not matter one way or the other here, but wanted to mention it just in case. I know about ProcessWire tuning, but not about MySQL server tuning. When dealing with 20 million rows that seems like getting into the territory where optimizations to the DB configuration deserve a lot of focus, so I would bet that BitPoet's suggestions are going to make the most difference.
  6. @fedeb That's the largest quantity of pages I've heard of anyone creating in ProcessWire, by a pretty large margin. So you are in somewhat uncharted territory. But that's really cool you are doing that. I would be curious how different the graph would be if you split it up into batches so that you aren't creating more than a certain quantity per execution/runtime. For instance, maybe you create 10k in one execution and another 10k in the next, etc., or something like that. Would the same slowdown still occur? If so, I would start to think it might be the database index and increased overhead in maintaining that index as the quantity increases. On the flip side, if restarting the process to create each set in batches solves the slowdown, then I would think it might be memory or resource related. A couple things you can do to potentially (?) improve your page creation time: 1. At the top of your code (before the loop) put: $template = $templates->get('protein'); Then within the loop set: $page->template = $template; 2. I don't see a parent page assignment. How are you doing that? Double check that you aren't asking PW to load the parent page every time in the loop and instead handle it like with the template in #1 above. 3. What kind of fields are on your "protein" template? Depending on their type, there may be potential optimizations. Especially if any are Page references. Can you paste in a line or two from the CSV? 4. If you can assign a $page->name = "protein" . $i; rather than having PW auto-generate a name from the title, that will save some resources too.
  7. @cb2004 I'm not sure I understand what you mean by "removing older official repositories". Can you expand on this? Which social embeds? I often use the MarkupSocialShareButtons module for this stuff. But if there are oembed providers for social links then a similar strategy would work. I know that Twitter has an oembed service, and Facebook apparently used to, but then killed it. I was originally thinking about expanding this module (or adding another) to support any oembed provider, except there is a TextformatterOEmbed module that apparently does this so thought I might try that one out first. Within the last few weeks they opened it up to everyone here. Previously it was just people 65+ or with health conditions, etc. I think the vaccinations are going well for the people that want them (which seems to be the majority), but apparently there is still a portion of the population that doesn't want to get the vaccine, so that could keep the virus spreading and mutating as long as that remains the case.
  8. @markus_blue_tomato Great, glad to hear it's working well! @StanLindsey This would be very simple to add, I'll plan to add it this week. Question: would just an array of DB hosts be adequate, or would it need separate configuration (host plus db name, user, pass, port, etc.) for each of the readonly db hosts?
  9. @markus_blue_tomato Sorry to hear that, I hope they have it available soon. I figured we'd be the last to get it here because the for-profit healthcare system in the US doesn't often lend itself well to public health, unless you are wealthy. (Just getting a covid test was $400). Luckily it seems the vaccine isn't being handled by the healthcare companies, and it's free. My parents got their vaccine at an appliance store drive-through, my wife got hers at the grocery store, and I got mine at the office of some technology company in our town square. That might sound sketchy but they are all legitimate and it seems to be working well for once.
  10. It's spring break here and my kids are going back to school next week after being out for more than a year. Since it's a break week, the weather is great, and it's also the last week of the year-long covid break from school, I've spent a little less time at the computer this week. I've focused on some smaller module projects rather than the core. More specifically: posted a major update and refactor of the TextformatterHannaCode module, and a completely rewritten TextformatterVideoEmbed module. While making these updates, I've also made note of and attempted to resolve any reported issues in the GitHub repositories. Next week, it's back to the core, with both issue resolutions and pull requests scheduled for upcoming versions. Next week I also get my 2nd shot of covid vaccine, and I'm told it may slow me down a bit for a day, but will be well worth it. I had a day of tiredness from the 1st shot, but it was greatly outweighed by feelings of gratitude and reduction of worry. I highly recommend it as soon as you can get it, if you haven't already.
  11. This week ProcessWire gained the ability to maintain separate read-only and read-write database connections to optimize scalability, cost and performance. The post covers why this can be so valuable and how to configure it in ProcessWire— https://processwire.com/blog/posts/pw-3.0.175/
  12. @monollonom I knew there was a reported PDO issue pending, but didn't remember the details so was going to be looking for it this coming week to make sure it was included (along with any others), thanks for pointing me to it.
  13. @MrSnoozles @teppo This is not a limiting factor in scalability at least. First off, at least here, the file-based assets are delivered by Cloudfront CDN, so they aren't part of the website traffic in the first place (other than to feed the CDN). If you wanted scalability then you'd likely want a CDN serving your assets whether using S3 or not. But a CDN isn't a necessary part of the equation in our setup either. File systems can be replicated just like databases. That's how this site runs on a load balancer on any number of nodes. Requests that upload files are routed to node A (primary), but all other requests can hit any node that the load balancer decides to route it to. The other nodes are exact copies of the node A file system that update in real time. This is very similar to what's happening with the DB reader (read-only) and writer (read-write) connection I posted about above, where the writer would be node A and there can be any number of readers. Something like S3 doesn't enhance the scalability of any of this. Implementing S3 as a storage option for PW is still part of the plan here, but more for the convenience and usefulness than scalability. You can already use S3 as a storage option in PW if you use one of the methods of mapping directories in your file system to it. But I'm looking to support for it in PW more natively than that. It is admittedly more complex than the DB stuff mentioned above. For instance, we use PHP's PDO class that all DB requests go through, so intercepting and routing them is relatively simple. Whereas in PHP, there is no PDO-type class that the file system API is built around, and instead it is dozens of different procedural functions (which is just fine, until you need to change their behavior). In addition, calls to S3 are more expensive than a file system access, so doing something as simple as getting an image dimensions is no longer a matter of a simple php getimagesize() call. Instead, it's more like making an FTP connection somewhere, FTP'ing the file to your computer, getting the image dimensions, storing them somewhere else, then deleting the image. So meta data like image dimensions needs to be stored somewhere else. PW actually implemented this ability last year (meta data and stored image dimensions). So we've already been making small steps towards S3-type storage, but because the big picture is still pretty broad in scope to implement, it's more of a long term plan. Though maybe one of my clients will tell me they need it next week, in which case it'll become a short term plan. 🙂
  14. This week I've been working on something a little different for the core. Specifically, ProcessWire's database class (WireDatabasePDO) has been rewritten to support separate read-only and read-write database connections. Jan, who administers the processwire.com server, asked if I could implement it. I looked into it and thought it had significant benefits for ProcessWire users, so it worked out that now was a good time to implement it. The rewritten database class is actually complete and now running on a live test installation, but I have not yet committed it to the core dev branch because I want to fully document it first. This will happen next week in a blog post. By then I'll have processwire.com using it too. But I can already say that it's a cool thing watching the graphs in AWS show the difference it makes when we start hitting the site with crawlers. You might be wondering what the benefits are in having separate read-only and read-write database connections. I'll get into some of the details next week. But essentially, read-only database connections can scale in a way (and at much lower cost) than read-write connections can. And when using a service like Amazon Aurora, they can scale on-the-fly automatically according to traffic and demand for resources. Not only does it open up the ability for a ProcessWire-powered site to scale much further than before, but it has potential to reduce the costs of doing so by perhaps 50% or more. If you are interested in reading more, we are currently testing the features using Amazon Aurora and RDS Read Replicas (see also Replication with Aurora). However, the ProcessWire core support for this feature is not bound to anything AWS specific and will work with any platform supporting a similar ability. Thanks for reading, I'll have more on this next week, and also have it ready to use should you want to try it out yourself.
  15. I hope everyone is having a good week! Commits to the core this week include 10 issue report resolutions (so far). This will likely continue next week as well, and then I'll bump the version up then. Also included this week is an CKEditor upgrade from 4.14.0 to version 4.16.0. While that may sound like a minor CKEditor version bump, there's actually quite a list of updates in the CKEditor 4.x changelog, including a few security-related fixes, though none that I think are likely to affect PW users. I do still have a couple of core feature requests in progress as well, but there's more work still to do on those. Nothing too exciting this week, but I like to check in and say hello either way. I hope you all have a great weekend!
  16. @tcnet If possible, add a ModuleName.info.php file, or a ModuleName.info.json file to your repo, where "ModuleName" has the same name as your repo. Here's an example of an info.json file: { "title": "Your Module or Site Profile Name", "summary": "One sentence summary of the module or site profile.", "version": 1, "author": "Name of author" } We haven't had any site profiles added since the directory was recently updated, so if you find that doesn't work, please send me a PM with your repo URL and I'll figure it out here.
  17. @teppo When a module doesn't connect to your profile, it means that the module was submitted with a different email address than the one on your account. When that happens, I just fix them manually so that they connect to your account. I have fixed ProcessChangelog and all 3 or 4 others I could find so that they connect to your account now. Thanks for letting me know. PM or email me if there are any others, as I don't always know when someone tags me so it's easy for me to miss. @adrian I don't know about PageEditSoftLock specifically, but there is an automatic purge of older modules that match these conditions: they don't indicate support for PW 3.x, haven't been updated in 2+ years, and the author is not active. They are still technically in the DB, so if there is someone else that wants to maintain an inactive module, or if there's a known reliable module despite not being active, or if it just appears to be a mistake, let me know and I can re-publish.
  18. I was glad to see there was interest in the new URL hooks added last week, thanks! There were still a few details to work out, so thought I'd take care of those this week before moving on to other updates, and they are now in 3.0.174. Today I've also updated last week's blog post with the additional info, so that it's all in one place. This will likely be converted over to a dedicated documentation page, but for now, here is what's been added: The post last week introduced you to using named arguments. This week another kind was added, called simple named arguments (click the link for details). The idea is based off an example from another post in these forums by BitPoet. The handling of trailing slashes vs non-trailing slashes was undefined last week. This week it has been defined and is now enforced by ProcessWire. All of the details are here. Pagination was another grey area last week, but no longer. Here are all of the details on how you can utilize pagination in URL/path hooks. In addition, in 3.0.174 URL/path hooks can now have control even after a Page (template file) throws its own 404, whether by wire404() or throw new Wire404Exception(). I found this was necessary because it is possible to enable URL segments for your homepage template. And, depending on your homepage template settings, that means it is possible for the homepage to be the recipient of all URLs that don't match pages. This would leave no opportunity for URL/path hooks to execute. So now ProcessWire gives URL/path hooks another opportunity to run if it matches the URL, even after a 404 is thrown from a Page template file. Beyond the above, there's been a lot of additional optimization and improvement to the hooks system that handles the path/URL hooks, but nothing further that affects its API... just internal improvements. ProcessWire 3.0.174 also adds a feature requested by Adrian, which was to add object return value support for the recently added $pages->findRaw() method. The method by default returns arrays for everything, which can be helpful in making clear you are working with raw data, in cases where it matters. (As opposed to formatted or prepared page data). But if you specify objects=1 in your selector to the method, it will instead return StdClass objects where it previously returned arrays. This makes working with the data more consistent with how you might work with Page object data, even if it is still raw data. Should you be using any findRaw() data for output purposes, you can now also specify entities=1 in your selector to have it automatically entity-encode all string values, or entities=field, replacing "field" with pipe-separated field names you want entity encoded. The following example summarizes all of the recent additions to findRaw, as pretty much everything here was not possible on the first implementation: $items = $pages->findRaw("parent=/blog/posts, fields=title|url, entities=title, objects=1"); foreach($items as $item) { echo "<li><a href='$item->url'>$item->title</a></li>"; } Thanks for reading and have a great weekend!
  19. @adrian It's like an index for page URLs/paths. Usually PW has to join every page in a query in order to know its URL. So the PagePaths module provides a more direct and theoretically faster route. But in my experience, it's not really faster until the URLs get long (like in cases where PW would have to do lots of joins to determine the URL otherwise). The other thing it does is that it lets you perform $pages->find() partial text matching operations on the url/path, which you can't do otherwise. The only place where it adds overhead is if you change the name of a parent page, it then has to re-index all the URLs for everything below the parent. It's not installed automatically because most people don't need it, and it doesn't support multi-language URLs. But it's handy to have when it crosses over with your needs. I'm good to provide an option for findRaw to return an array of basic objects if one requests it in the $options. But for most I would suggest sticking to the array because it's good for it to be clearly different in syntax from a regular page. That's because it's all raw and unformatted data, so it's not going to be safe to swap between find() and findRaw() in most cases and good to maintain clear differentiation. For instance, when using something from findRaw() for output, you've got to be sure to entity encode anything you output, etc. Plus, one reason for findRaw() is to provide the lowest level path to the raw data, and I think a PHP array is probably the lowest level, least overhead way of doing that. But having an option/alternative for a std object seems fine to me.
  20. @adrian @bernhard I've added support for getting 'url' and 'path' from $pages->findRaw() in the latest commit, but it requires the PagePaths module be installed. Now just need multi-language URL support for that module.
  21. @Robin S That's correct, an existing page has precedence. I'd like to make it so you can optionally override that too, but still working to identify the most efficient way. I'm trying to avoid a solution that adds the overhead of checking hooks and regular expressions for every request before identifying the page. Currently it only does that if the request doesn't end up matching a page, which adds no overhead to page rendering requests. Once the technical details are worked out, likely the solution will involve using an $wire->addHookBefore('/path/', ...) (rather than just addHook) which will receive a matched page (if there is one) and then it can decide whether to let it proceed as-is, change the Page object, or do something else. That's also a technical still to work out. We could easily enforce trailing vs non-trailing slash but didn't want to do it without someone dictating that's what they want. Otherwise someone could very easily end up having every request getting 301'd without realizing it. So seemed safest just to allow either for the moment. I do plan to add a way to let you dictate what's required so that it can perform necessary redirects for you. I'm currently thinking that if the match pattern ends with a slash, it'll enforce the slash; if it doesn't, it'll enforce no-slash; and if it ends with a /? it'll allow for either. @eelkenet It doesn't at present since the feature was just added, but I do think it will be possible for ProCache to cache them. I've already started looking into it here. @StanLindsey The entire matched URL is always in $event->arguments(0). If it would be helpful I can also add a named argument to it, like 'url' or something, so you could do $event->arguments('url'); or just $event->url
  22. I'm getting a look at this thread linked from the PW Weekly, where it looks like this topic has been discussed. BitPoet has an example where named arguments are in the format like "{user}" (if I understood it correctly) and I really like that. It would provide for an option to have named arguments without having to specify what would be in it... just "any valid PW page name characters". That sounds useful. I'll add support for it, in addition to the named arguments support mentioned in the blog post. We can support that without interfering with other regular expression features by having it convert that to a PCRE capture group.
  23. ProcessWire 3.0.173 adds several new requested features and this post focuses on one of my favorites: the ability to hook into and handle ProcessWire URLs, independent of pages— https://processwire.com/blog/posts/pw-3.0.173/
  24. The core dev branch commits this week continue to work on feature requests, and the plan is that the version in progress (3.0.173) is and will be focused entirely on added feature requests. While a few requested small features have been committed to the dev branch this week, there are also still two more in progress that aren't quite ready to commit, so those will likely be in next week's commits. Once they are in place, we'll also bump the version to 3.0.173. Following that, I'd like to have 3.0.174 focused on resolutions from the processwire-issues repo, and 3.0.175 focus on PRs. That's the plan for now anyway. It might be a good rotation to keep going. In the next couple of weeks I'm also likely to wrap up the client project that's kept me pretty busy recently, though it's all been ProcessWire-related and fun work thankfully. If you've also been busy building new sites in ProcessWire any time recently, please add them to our sites directory if you get a chance. I hope you all have had a great week and likewise have a great weekend!
  25. @thetuningspoon After save actions don't participate in a delete action. But I can add a hook for that purpose: ProcessPageEdit::redirectAfterDelete(). I'll have it in there in 3.0.173 and you can hook before that method to perform your own redirect before the page editor does.
×
×
  • Create New...