Jump to content
mtwebit

DataSet import modules

Recommended Posts

I've created a set of modules for importing (manipulating and displaying) data from external resources. A key requirement was to handle large (100k+) number of pages easily.

Main features

  • import data from CSV and XML sources in the background (using Tasker)
  • purge, update or overwrite existing pages using selectors
  • user configurable input <-> field mappings
  • on-the-fly data conversion and composition (e.g. joining CSV columns into a single field)
  • download external resources (files, images) during import
  • handle page references by any (even numeric) fields

How it works

You can upload CSV or XML files to DataSet pages and specify import rules in their description.
The module imports the content of the file and creates/updates child pages automatically.

How to use it

Create a DataSet page that stores the source file. The file's description field specifies how the import should be done:

Spoiler

name: Testing the import
input: # Source configuration
  type: csv
  delimiter: ','
  header: 1
  limit: 10  # import only 10 entries, uncomment this if the test was successful
fieldmappings: # specified as field_name: csv_column_id (1, 2, 3, ...)
  title: 1
pages:  # Config for child pages
  template: Data
  selector: 'title=@title'

After saving the DataSet page an import button should appear below the file description.

dataset_file_description.thumb.png.b92cf93c8a529d9750622ef08b67fcad.png

When you start the import the DataSet module creates a task (executed by Tasker) that will import the data in the background.

You can monitor its execution and check its logs for errors.

dataset_import_running.thumb.png.ad0e58d907dcf1b379060afa9bc928e9.png

See the module's wiki for more details.

The module was already used in three projects to import and handle large XML and CSV datasets. It has some rough edges and I'm sure it needs improvement :) so comments are welcome.

  • Like 17

Share this post


Link to post
Share on other sites

Thanks for sharing your modules @mtwebit!  This looks like it could be really useful.  Is there any way you could include a place to add a url to the file instead of an upload?  For example, I store staff's contact information in a Google Spreadsheet.  This spreadsheets gets updated all the time.  It would be cool to just add the url to csv file instead of having to download the file and upload it into Processwire.  The input could also remember it's previous value so I can run the import over and over again as needed.  Maybe it also could be somehow automated to run the same import everyday?

If not, no worries.  Thanks again.

Share this post


Link to post
Share on other sites
14 hours ago, gmclelland said:

Thanks for sharing your modules @mtwebit!  This looks like it could be really useful.  Is there any way you could include a place to add a url to the file instead of an upload?  For example, I store staff's contact information in a Google Spreadsheet.  This spreadsheets gets updated all the time.  It would be cool to just add the url to csv file instead of having to download the file and upload it into Processwire.  The input could also remember it's previous value so I can run the import over and over again as needed.  Maybe it also could be somehow automated to run the same import everyday?

If not, no worries.  Thanks again.

I was thinking about this too...

There was a dev branch that dropped the [file + rules in description] scheme and introduced a fieldset of [rule + (optional) file]. It turned out to be too complicated and it did not work well so I dropped it.

An easy solution is to allow source location override. So... see this commit and use the input:location configuration option.
Not the best solution as it still requires a (dummy) file to be uploaded (to create the import rules in its description), but it works.
You can even use this solution to refer to files uploaded to other pages using this URL scheme: wire://pageid/filename

Hope it helps.

14 hours ago, gmclelland said:

It looks like you might have already considered and built this type of functionality https://github.com/mtwebit/DataSet/wiki/Import-rules#data-conversion-during-import

That's different. It downloads data for a single field (e.g. a file to be stored in a filefield) not for an entire DataSet.

  • Thanks 1

Share this post


Link to post
Share on other sites

really like this, will be complete if can do bulk export too, currently i'm using custom php script in front end for huge data export, but prefer if i can do this in admin area.

Share this post


Link to post
Share on other sites

JSON rule format is now supported but I have a small problem with that. It works fine in the global rule field but storing JSON in file descriptions is not possible atm.

Pagefile uses JSON internally for storing multi-language file descriptions so it is not possible to store JSON data there... I could not find a way to overcome this issue (even if multi-language descriptions are disabled Pagefile still drops JSON descriptions).

Any idea?

See Github issue

  • Like 1

Share this post


Link to post
Share on other sites
9 hours ago, mtwebit said:

Pagefile uses JSON internally for storing multi-language file descriptions so it is not possible to store JSON data there... I could not find a way to overcome this issue (even if multi-language descriptions are disabled Pagefile still drops JSON descriptions).

Any idea?

JSON in the description field is detected if the first character is { and the last character is }, or if the first character is [ and the last character is ]. See here.

So one workaround could be to prefix the JSON with some character...

*{"json": "here"}

...and then trim the first character before the module decodes the JSON.

  • Like 1
  • Thanks 1

Share this post


Link to post
Share on other sites
On 1/25/2019 at 2:05 PM, Robin S said:

JSON in the description field is detected if the first character is { and the last character is }, or if the first character is [ and the last character is ]. See here.

So one workaround could be to prefix the JSON with some character...


*{"json": "here"}

...and then trim the first character before the module decodes the JSON.

Yeah, this is a little painful. I use the same approach in Tracy. I think it might be better if Ryan replaces that json detection code with the following which seems to be the most common approach to problem.

    /**
     * is the provided string a valid json string?
     *
     * @param string $string
     * @return boolean
     */
    public function isJson($string) {
        json_decode($string);
        return (json_last_error() == JSON_ERROR_NONE);
    }

PS - actually maybe this isn't useful at all with this issue, but in general I think he should be using a function like this for determining if a string is JSON.

  • Like 2

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.


  • Recently Browsing   0 members

    No registered users viewing this page.

  • Similar Content

    • By Robin S
      A community member raised a question and I thought a new sanitizer method for the purpose would be useful, hence...
      Sanitizer Transliterate
      Adds a transliterate method to $sanitizer that performs character replacements as defined in the module config. The default character replacements are based on the defaults from InputfieldPageName, but with uppercase characters included too.
      Usage
      Install the Sanitizer Transliterate module.
      Customise the character replacements in the module config as needed.
      Use the sanitizer on strings like so:
      $transliterated_string = $sanitizer->transliterate($string);
       
      https://github.com/Toutouwai/SanitizerTransliterate
      https://modules.processwire.com/modules/sanitizer-transliterate/
       
    • By dimitrios
      Hello,
      this module can publish content of a Processwire page on a Facebook page, triggered by saving the Processwire page.
      To set it up, configure the module with a Facebook app ID, secret and a Page ID. Following is additional configuration on Facebook for developers:
      Minimum Required Facebook App configuration:
      on Settings -> Basics, provide the App Domains, provide the Site URL, on Settings -> Advanced, set the API version to 2.10, add Product: Facebook Login, on Facebook Login -> Settings, set Client OAuth Login: Yes, set Web OAuth Login: Yes, set Enforce HTTPS: Yes, add "http://www.example.com/processwire/page/" to field Valid OAuth Redirect URIs. This module is configurable as follows:
      Templates: posts can take place only for pages with the defined templates. On/Off switch: specify a checkbox field that will not allow the post if checked. Specify a message and/or an image for the post.
      Usage
      edit the desired PW page and save; it will post right after the initial Facebook log in and permission granting. After that, an access token is kept.
       
      Download
      PW module directory: http://modules.processwire.com/modules/auto-fb-post/ Github: https://github.com/kastrind/AutoFbPost   Note: Facebook SDK for PHP is utilized.


    • By thomasaull
      I created a little helper module to trigger a CI pipeline when your website has been changed. It's quite simple and works like this: As soon as you save a page the module sets a Boolean via a pages save after hook. Once a day via LazyCron the module checks if the Boolean is set and sends a POST Request to a configurable Webhook URL.
      Some ideas to extend this:
      make request type configurable (GET, POST) make the module trigger at a specified time (probably only possible with a server cronjob) trigger manually Anything else? If there's interest, I might put in some more functionality. Let me know what you're interested in. Until then, maybe it is useful for a couple of people 🙂
      Github Repo: https://github.com/thomasaull/CiTrigger
    • By Robin S
      I created this module a while ago and never got around to publicising it, but it has been outed in the latest PW Weekly so here goes the support thread...
      Unique Image Variations
      Ensures that all ImageSizer options and focus settings affect image variation filenames.

      Background
      When using methods that produce image variations such as Pageimage::size(), ProcessWire includes some of the ImageSizer settings (height, width, cropping location, etc) in the variation filename. This is useful so that if you change these settings in your size() call a new variation is generated and you see this variation on the front-end.
      However, ProcessWire does not include several of the other ImageSizer settings in the variation filename:
      upscaling cropping, when set to false or a blank string interlace sharpening quality hidpi quality focus (whether any saved focus area for an image should affect cropping) focus data (the top/left/zoom data for the focus area) This means that if you change any of these settings, either in $config->imageSizerOptions or in an $options array passed to a method like size(), and you already have variations at the requested size/crop, then ProcessWire will not create new variations and will continue to serve the old variations. In other words you won't see the effect of your changed ImageSizer options on the front-end until you delete the old variations.
      Features
      The Unique Image Variations module ensures that any changes to ImageSizer options and any changes to the focus area made in Page Edit are reflected in the variation filename, so new variations will always be generated and displayed on the front-end.
      Installation
      Install the Unique Image Variations module.
      In the module config, set the ImageSizer options that you want to include in image variation filenames.
      Warnings
      Installing the module (and keeping one or more of the options selected in the module config) will cause all existing image variations to be regenerated the next time they are requested. If you have an existing website with a large number of images you may not want the performance impact of that. The module is perhaps best suited to new sites where image variations have not yet been generated.
      Similarly, if you change the module config settings on an existing site then all image variations will be regenerated the next time they are requested.
      If you think you might want to change an ImageSizer option in the future (I'm thinking here primarily of options such as interlace that are typically set in $config->imageSizerOptions) and would not want that change to cause existing image variations to be regenerated then best to not include that option in the module config after you first install the module.
       
      https://github.com/Toutouwai/UniqueImageVariations
      https://modules.processwire.com/modules/unique-image-variations/
    • By Sebi
      I've created a small module which lets you define a timestamp after which a page should be accessible. In addition you can define a timestamp when the release should end and the page should not be accessable any more.
      ProcessWire-Module: http://modules.processwire.com/modules/page-access-releasetime/
      Github: https://github.com/Sebiworld/PageAccessReleasetime
      Usage
      PageAccessReleasetime can be installed like every other module in ProcessWire. Check the following guide for detailed information: How-To Install or Uninstall Modules
      After that, you will find checkboxes for activating the releasetime-fields at the settings-tab of each page. You don't need to add the fields to your templates manually.
      Check e.g. the checkbox "Activate Releasetime from?" and fill in a date in the future. The page will not be accessable for your users until the given date is reached.
      If you have $config->pagefileSecure = true, the module will protect files of unreleased pages as well.
      How it works
      This module hooks into Page::viewable to prevent users to access unreleased pages:
      public function hookPageViewable($event) { $page = $event->object; $viewable = $event->return; if($viewable){ // If the page would be viewable, additionally check Releasetime and User-Permission $viewable = $this->canUserSee($page); } $event->return = $viewable; } To prevent access to the files of unreleased pages, we hook into Page::isPublic and ProcessPageView::sendFile.
      public function hookPageIsPublic($e) { $page = $e->object; if($e->return && $this->isReleaseTimeSet($page)) { $e->return = false; } } The site/assets/files/ directory of pages, which isPublic() returns false, will get a '-' as prefix. This indicates ProcessWire (with activated $config->pagefileSecure) to check the file's permissions via PHP before delivering it to the client.
      The check wether a not-public file should be accessable happens in ProcessPageView::sendFile. We throw an 404 Exception if the current user must not see the file.
      public function hookProcessPageViewSendFile($e) { $page = $e->arguments[0]; if(!$this->canUserSee($page)) { throw new Wire404Exception('File not found'); } } Additionally we hook into ProcessPageEdit::buildForm to add the PageAccessReleasetime fields to each page and move them to the settings tab.
      Limitations
      In the current version, releasetime-protected pages will appear in wire('pages')->find() queries. If you want to display a list of pages, where pages could be releasetime-protected, you should double-check with $page->viewable() wether the page can be accessed. $page->viewable() returns false, if the page is not released yet.
      If you have an idea how unreleased pages can be filtered out of ProcessWire selector queries, feel free to write an issue, comment or make a pull request!
×
×
  • Create New...