Jump to content
mtwebit

DataSet import modules

Recommended Posts

I've created a set of modules for importing (manipulating and displaying) data from external resources. A key requirement was to handle large (100k+) number of pages easily.

Main features

  • import data from CSV and XML sources in the background (using Tasker)
  • purge, update or overwrite existing pages using selectors
  • user configurable input <-> field mappings
  • on-the-fly data conversion and composition (e.g. joining CSV columns into a single field)
  • download external resources (files, images) during import
  • handle page references by any (even numeric) fields

How it works

You can upload CSV or XML files to DataSet pages and specify import rules in their description.
The module imports the content of the file and creates/updates child pages automatically.

How to use it

Create a DataSet page that stores the source file. The file's description field specifies how the import should be done:

Spoiler

name: Testing the import
input: # Source configuration
  type: csv
  delimiter: ','
  header: 1
  limit: 10  # import only 10 entries, uncomment this if the test was successful
fieldmappings: # specified as field_name: csv_column_id (1, 2, 3, ...)
  title: 1
pages:  # Config for child pages
  template: Data
  selector: 'title=@title'

After saving the DataSet page an import button should appear below the file description.

dataset_file_description.thumb.png.b92cf93c8a529d9750622ef08b67fcad.png

When you start the import the DataSet module creates a task (executed by Tasker) that will import the data in the background.

You can monitor its execution and check its logs for errors.

dataset_import_running.thumb.png.ad0e58d907dcf1b379060afa9bc928e9.png

See the module's wiki for more details.

The module was already used in three projects to import and handle large XML and CSV datasets. It has some rough edges and I'm sure it needs improvement :) so comments are welcome.

  • Like 17

Share this post


Link to post
Share on other sites

Thanks for sharing your modules @mtwebit!  This looks like it could be really useful.  Is there any way you could include a place to add a url to the file instead of an upload?  For example, I store staff's contact information in a Google Spreadsheet.  This spreadsheets gets updated all the time.  It would be cool to just add the url to csv file instead of having to download the file and upload it into Processwire.  The input could also remember it's previous value so I can run the import over and over again as needed.  Maybe it also could be somehow automated to run the same import everyday?

If not, no worries.  Thanks again.

Share this post


Link to post
Share on other sites
14 hours ago, gmclelland said:

Thanks for sharing your modules @mtwebit!  This looks like it could be really useful.  Is there any way you could include a place to add a url to the file instead of an upload?  For example, I store staff's contact information in a Google Spreadsheet.  This spreadsheets gets updated all the time.  It would be cool to just add the url to csv file instead of having to download the file and upload it into Processwire.  The input could also remember it's previous value so I can run the import over and over again as needed.  Maybe it also could be somehow automated to run the same import everyday?

If not, no worries.  Thanks again.

I was thinking about this too...

There was a dev branch that dropped the [file + rules in description] scheme and introduced a fieldset of [rule + (optional) file]. It turned out to be too complicated and it did not work well so I dropped it.

An easy solution is to allow source location override. So... see this commit and use the input:location configuration option.
Not the best solution as it still requires a (dummy) file to be uploaded (to create the import rules in its description), but it works.
You can even use this solution to refer to files uploaded to other pages using this URL scheme: wire://pageid/filename

Hope it helps.

14 hours ago, gmclelland said:

It looks like you might have already considered and built this type of functionality https://github.com/mtwebit/DataSet/wiki/Import-rules#data-conversion-during-import

That's different. It downloads data for a single field (e.g. a file to be stored in a filefield) not for an entire DataSet.

  • Thanks 1

Share this post


Link to post
Share on other sites

really like this, will be complete if can do bulk export too, currently i'm using custom php script in front end for huge data export, but prefer if i can do this in admin area.

Share this post


Link to post
Share on other sites

JSON rule format is now supported but I have a small problem with that. It works fine in the global rule field but storing JSON in file descriptions is not possible atm.

Pagefile uses JSON internally for storing multi-language file descriptions so it is not possible to store JSON data there... I could not find a way to overcome this issue (even if multi-language descriptions are disabled Pagefile still drops JSON descriptions).

Any idea?

See Github issue

  • Like 1

Share this post


Link to post
Share on other sites
9 hours ago, mtwebit said:

Pagefile uses JSON internally for storing multi-language file descriptions so it is not possible to store JSON data there... I could not find a way to overcome this issue (even if multi-language descriptions are disabled Pagefile still drops JSON descriptions).

Any idea?

JSON in the description field is detected if the first character is { and the last character is }, or if the first character is [ and the last character is ]. See here.

So one workaround could be to prefix the JSON with some character...

*{"json": "here"}

...and then trim the first character before the module decodes the JSON.

  • Like 1
  • Thanks 1

Share this post


Link to post
Share on other sites
On 1/25/2019 at 2:05 PM, Robin S said:

JSON in the description field is detected if the first character is { and the last character is }, or if the first character is [ and the last character is ]. See here.

So one workaround could be to prefix the JSON with some character...


*{"json": "here"}

...and then trim the first character before the module decodes the JSON.

Yeah, this is a little painful. I use the same approach in Tracy. I think it might be better if Ryan replaces that json detection code with the following which seems to be the most common approach to problem.

    /**
     * is the provided string a valid json string?
     *
     * @param string $string
     * @return boolean
     */
    public function isJson($string) {
        json_decode($string);
        return (json_last_error() == JSON_ERROR_NONE);
    }

PS - actually maybe this isn't useful at all with this issue, but in general I think he should be using a function like this for determining if a string is JSON.

  • Like 2

Share this post


Link to post
Share on other sites

I'm trying to import 60k pages from a CVS file. 

I installed the DataSet module but the dataset_config field seams not working.

By default it is set as a textarea filed, and the configuration is not valid. There is also a message "YAML is not supported on your system. Try to use JSON for configuration." I installed the fieldtype-yaml module and set it for dataset_config but this is also not working.

https://modules.processwire.com/modules/fieldtype-yaml/

Any suggestions? All other modules required are installed.

The formatting on the screenshot for YMAL is wrong, I know. 

1840444955_Screenshot2019-12-24at14_21_24.thumb.png.7c2e3e4717ce3c40c41d19c6dd1fc51f.png

 

 

Share this post


Link to post
Share on other sites
6 hours ago, flydev said:

@theqbap

The YAML thing is an extension of PHP which need to be activated on your server configuration. 

 

Ok, thank you for replay. And can you provide me with an example of JSON config for dataset_config field. Unfortunately I can't activate YAML on server side.

Share this post


Link to post
Share on other sites

@theqbap

The YAML example converted to JSON with an online tool give us this config :

{
  "name": "Testing the import",
  "input": {
    "type": "csv",
    "delimiter": ",",
    "header": 1,
    "limit": 10
  },
  "fieldmappings": {
    "title": 1
  },
  "pages": {
    "template": "Data",
    "selector": "title=@title"
  }
}

 

Share this post


Link to post
Share on other sites
On 12/25/2019 at 10:26 AM, flydev said:

@theqbap

The YAML example converted to JSON with an online tool give us this config :


{
  "name": "Testing the import",
  "input": {
    "type": "csv",
    "delimiter": ",",
    "header": 1,
    "limit": 10
  },
  "fieldmappings": {
    "title": 1
  },
  "pages": {
    "template": "Data",
    "selector": "title=@title"
  }
}

 

Still with message "DataSet config is invalid."

Share this post


Link to post
Share on other sites

@theqbap

Then add the string "JSON" before the config :

JSON{
  "name": "Testing the import",
  "input": {
    "type": "csv",
    "delimiter": ",",
    "header": 1,
    "limit": 10
  },
  "fieldmappings": {
    "title": 1
  },
  "pages": {
    "template": "Data",
    "selector": "title=@title"
  }
}

 

  • Thanks 2

Share this post


Link to post
Share on other sites

I've checked the above config on my DataSet test site and it is valid.
(Don't forget to save the page to run the validator again.)

  • Thanks 1

Share this post


Link to post
Share on other sites
On 12/27/2019 at 9:49 AM, flydev said:

@theqbap

Then add the string "JSON" before the config :


JSON{
  "name": "Testing the import",
  "input": {
    "type": "csv",
    "delimiter": ",",
    "header": 1,
    "limit": 10
  },
  "fieldmappings": {
    "title": 1
  },
  "pages": {
    "template": "Data",
    "selector": "title=@title"
  }
}

 

Thank you for the help 🙂

Share this post


Link to post
Share on other sites

One more question regarding importing data. 

When a row in a CSV file will result in a page with the same 'title' as one that's already exits is there an option to make the title unique and import new page with the same name.

CSV example:

Title,Number
Orange,20-300
Orange,10-20
Banana,5-10


ProcessWire pages:

+Import Folder
-Orange
-Orange
-Banana
 

Share this post


Link to post
Share on other sites

The two “orange” pages have the same page title, but it probably gives them different page names.  Look on the settings tab of each page.

  • Like 1

Share this post


Link to post
Share on other sites

By default DataSet will create a new PW page each time it imports a row. In the above example, two pages will be created with title "Orange" and one with "Banana".

There is no option to change the title for the new page (2nd Orange) if it matches an already existing one (1st Orange).
You can, however, combine several fields in the title making it unique. E.g. you can create the title like this (column #0 always contains the row's serial number):

title: [1, ' (', 0, ')']

The result will be:

Orange (1)
Orange (2)
Banana (3)

You can also update (overwrite or merge) already existing pages. In the "pages" section of DS config you can specify a selector and add the overwrite or merge option.

See the wiki for more details. (Which needs to be updated but it is probably still helpful 🙂 )

  • Like 1

Share this post


Link to post
Share on other sites

OK. It was time to update the wiki 🙂

I've uploaded a new DataSet version (0.9.5) to GitHub. It contains many improvements for data type conversions, page reference handling and several bug fixes.
It also has a new profiler to optimize the import routines.

Tasker is also updated.

  • Like 3

Share this post


Link to post
Share on other sites

Thanks for developing this module, my tests so far have been really positive. I'm developing a PW site that requires import and regular update of 100k+ pages and this will be invaluable.

One question I have (if you have time) is around Page References. I'm unable to modify my source data, so have created a page reference and field that corresponds to that of the source data e.g. 'LED' which is ID 1137. 

My CSV has this 'LED' data, however when I import, I get this result:

Processing data for field 'category'.
Page selector @ field category: templates_id=50, has_parent=1110.
Found referenced page 'First Category Item' for field 'category' using the selector 'templates_id=50, has_parent=1110'.
Setting field 'category' = '1111'.

Page ID 1111 (or First Category Item) is the first Category page. I've also tried setting the category to 1137 within the CSV file and get the same result.

This is when using the below config:

JSON{
  "name": "Import",
  "input": {
    "type": "csv",
    "delimiter": ",",
    "header": 1,
    "limit": 10
  },
  "fieldmappings": {
    "model_id": 1,
    "title": 2,
    "category": 3
  },
  "pages": {
    "template": "model",
    "selector": "model_id=@model_id"
  }
}

The other two (text fields) work fine. Any advice would be appreciated!

Edit: I've now found your reference to Page References in the Wiki that changes everything! The default for Page References is Title as you say. I've installed Autocomplete and it's working great. One task for later is figuring out the scheduling side of things. I did wonder, with Page References is it possible for pages to be automatically created if they don't already exist on import?

Share this post


Link to post
Share on other sites
18 hours ago, DonPachi said:

Edit: I've now found your reference to Page References in the Wiki that changes everything! The default for Page References is Title as you say. I've installed Autocomplete and it's working great. One task for later is figuring out the scheduling side of things. I did wonder, with Page References is it possible for pages to be automatically created if they don't already exist on import?

I use page references heavily in my projects. Page Autocomplete has a field (Settings specific to ...) on the Input tab of the field settings page that can be used to specify what fields are used during the query. You can even select multiple fields, e.g. a category_ref_by_id field can specify multiple ID fields. This way you can merge individual data sets into a single one. Each source set can have its own ID, and the ...ref_by_id field can use all of them.

I have no plans for the automatic creation of the missing referenced page but it can be achieved very easily. Just create another DataSet using the same CSV file and import the appropriate "category" columns for creating the missing pages. You can also try to use the location attribute in the DataSet config to make a reference to the file uploaded to the original DataSet (see the wiki) to avoid duplicate uploads.

If you need to perform these imports automatically you can create two tasks (category import and the original one) and specify a dependency between them (first import categories then the full data set). See Tasker wiki.

  • Like 3

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Recently Browsing   0 members

    No registered users viewing this page.

  • Similar Content

    • By joshua
      This module is (yet another) way for implementing a cookie management solution.
      Of course there are several other possibilities:
      - https://processwire.com/talk/topic/22920-klaro-cookie-consent-manager/
      - https://github.com/webmanufaktur/CookieManagementBanner
      - https://github.com/johannesdachsel/cookiemonster
      - https://www.oiljs.org/
      - ... and so on ...
      In this module you can configure which kind of cookie categories you want to manage:

      You can also enable the support for respecting the Do-Not-Track (DNT) header to don't annoy users, who already decided for all their browsing experience.
      Currently there are four possible cookie groups:
      - Necessary (always enabled)
      - Statistics
      - Marketing
      - External Media
      All groups can be renamed, so feel free to use other cookie group names. I just haven't found a way to implement a "repeater like" field as configurable module field ...
      When you want to load specific scripts ( like Google Analytics, Google Maps, ...) only after the user's content to this specific category of cookies, just use the following script syntax:
      <script type="optin" data-type="text/javascript" data-category="statistics" data-src="/path/to/your/statistic/script.js"></script> <script type="optin" data-type="text/javascript" data-category="marketing" data-src="/path/to/your/mareketing/script.js"></script> <script type="optin" data-type="text/javascript" data-category="external_media" data-src="/path/to/your/external-media/script.js"></script> <script type="optin" data-type="text/javascript" data-category="marketing">console.log("Inline scripts are also working!");</script> The type has to be "optin" to get recognized by PrivacyWire, the data-attributes are giving hints, how the script shall be loaded, if the data-category is within the cookie consents of the user. These scripts are loaded asynchronously after the user made the decision.
      If you want to give the users the possibility to change their consent, you can use the following Textformatter:
      [[privacywire-choose-cookies]] It's planned to add also other Textformatters to opt-out of specific cookie groups or delete the whole consent cookie.
      You can also add a custom link to output the banner again with a link / button with following class:
      <a href="#" class="privacywire-show-options">Show Cookie Options</a> <button class="privacywire-show-options">Show Cookie Options</button> This module is still in development, but we already use it on several production websites.
      You find it here: PrivacyWire Git Repo
      Download as .zip
      I would love to hear your feedback 🙂
      CHANGELOG
      0.0.5 Multi-language support included completely (also in TextFormatter). Added possibility to async load other assets (e.g. <img type="optin" data-category="marketing" data-src="https://via.placeholder.com/300x300">) 0.0.4 Added possibility to add an imprint link to the banner 0.0.3 Multi-language support for module config (still in development) 0.0.2 First release 0.0.1 Early development
    • By MoritzLost
      This is a new module that provides a simple solution to clearing all your cache layers at once, and an extensible interface to perform various cache-related actions.
      The simple motivation behind this module was that I was tired of manually clearing caches in several places after deploying a change on a live site. The basic purpose of this module is a simple Clear all caches link in the Setup menu which clears out all caches, no matter where they hide. You can customize what exactly the module does through it's configuration menu:
      Expire or delete all cache entries in the database, or selectively clear caches by namespace ($cache API) Clear the the template render cache. Clear out specific folders inside your site's cache directory (/site/assets/cache) Refresh version strings for static assets to bust client-side browser caches (this requires some setup, see the full documentation for details). This is the basic function of the module. However, you can also add different cache management action through the API and execute them through the module's interface. For this advanced usage, the module provides:
      An interface to see all available cache actions and execute them. A system log and logging output on the module page to see verify what the module is doing. A CacheControlTools class with utility functions to clear out different caches. An API to add cache actions, execute them programmatically and even modify the default action. Permission management, allowing you granular control over which user roles can execute which actions. The complete documentation can be found in the module's README.
      Beta release
      Note that I consider this a Beta release. Since the module is relatively aggressive in deleting some caches, I would advise you to install in on a test environment before using it on a live site.
      Let me know if you're getting any errors, have trouble using the module or if you have suggestions for improvement!
      In particular, can someone let me know if this module causes any problems with the ProCache module? I don't own or use it, so I can't check. As far as I can tell, ProCache uses a folder inside the cache directory to cache static pages, so my module should be able to clear the ProCache site cache as well, I'd appreciate it if someone can test that for me.
      Future plans
      If there is some interest in this, I plan to expand this to a more general cache management solution. I particular, I would like to add additional cache actions. Some ideas that came to mind:
      Warming up the template render cache for publicly accessible pages. Removing all active user sessions. Let me know if you have more suggestions!
      Links
      https://github.com/MoritzLost/ProcessCacheControl ProcessCacheControl in the Module directory

    • By David Karich
      Admin Page Tree Multiple Sorting
      ClassName: ProcessPageListMultipleSorting
      Extend the ordinary sort of children of a template in the admin page tree with multiple properties. For each template, you can define your own rule. Write each template (template-name) in a row, followed by a colon and then the additional field names for sorting.
      Example: All children of the template "blog" to be sorted in descending order according to the date of creation, then descending by modification date, and then by title. Type:
      blog: -created, -modified, title  Installation
      Copy the files for this module to /site/modules/ProcessPageListMultipleSorting/ In admin: Modules > Check for new modules. Install Module "Admin Page Tree Multible Sorting". Alternative in ProcessWire 2.4+
      Login to ProcessWire backend and go to Modules Click tab "New" and enter Module Class Name: "ProcessPageListMultipleSorting" Click "Download and Install"   Compatibility   I have currently tested the module only under PW 2.6+, but think that it works on older versions too. Maybe someone can give a feedback.     Download   PW-Repo: http://modules.processwire.com/modules/process-page-list-multiple-sorting/ GitHub: https://github.com/FlipZoomMedia/Processwire-ProcessPageListMultipleSorting     I hope someone can use the module. Have fun and best regards, David
    • By dimitrios
      Hello,
      this module can publish content of a Processwire page on a Facebook page, triggered by saving the Processwire page.
      To set it up, configure the module with a Facebook app ID, secret and a Page ID. Following is additional configuration on Facebook for developers:
      Minimum Required Facebook App configuration:
      on Settings -> Basics, provide the App Domains, provide the Site URL, on Settings -> Advanced, set the API version (has been tested up to v3.3), add Product: Facebook Login, on Facebook Login -> Settings, set Client OAuth Login: Yes, set Web OAuth Login: Yes, set Enforce HTTPS: Yes, add "https://www.example.com/processwire/page/" to field Valid OAuth Redirect URIs. This module is configurable as follows:
      Templates: posts can take place only for pages with the defined templates. On/Off switch: specify a checkbox field that will not allow the post if checked. Specify a message and/or an image for the post.
      Usage
      edit the desired PW page and save; it will post right after the initial Facebook log in and permission granting. After that, an access token is kept.
       
      Download
      PW module directory: http://modules.processwire.com/modules/auto-fb-post/ Github: https://github.com/kastrind/AutoFbPost   Note: Facebook SDK for PHP is utilized.


×
×
  • Create New...