Jump to content
mtwebit

DataSet import modules

Recommended Posts

I've created a set of modules for importing (manipulating and displaying) data from external resources. A key requirement was to handle large (100k+) number of pages easily.

Main features

  • import data from CSV and XML sources in the background (using Tasker)
  • purge, update or overwrite existing pages using selectors
  • user configurable input <-> field mappings
  • on-the-fly data conversion and composition (e.g. joining CSV columns into a single field)
  • download external resources (files, images) during import
  • handle page references by any (even numeric) fields

How it works

You can upload CSV or XML files to DataSet pages and specify import rules in their description.
The module imports the content of the file and creates/updates child pages automatically.

How to use it

Create a DataSet page that stores the source file. The file's description field specifies how the import should be done:

Spoiler

name: Testing the import
input: # Source configuration
  type: csv
  delimiter: ','
  header: 1
  limit: 10  # import only 10 entries, uncomment this if the test was successful
fieldmappings: # specified as field_name: csv_column_id (1, 2, 3, ...)
  title: 1
pages:  # Config for child pages
  template: Data
  selector: 'title=@title'

After saving the DataSet page an import button should appear below the file description.

dataset_file_description.thumb.png.b92cf93c8a529d9750622ef08b67fcad.png

When you start the import the DataSet module creates a task (executed by Tasker) that will import the data in the background.

You can monitor its execution and check its logs for errors.

dataset_import_running.thumb.png.ad0e58d907dcf1b379060afa9bc928e9.png

See the module's wiki for more details.

The module was already used in three projects to import and handle large XML and CSV datasets. It has some rough edges and I'm sure it needs improvement :) so comments are welcome.

  • Like 17

Share this post


Link to post
Share on other sites

Thanks for sharing your modules @mtwebit!  This looks like it could be really useful.  Is there any way you could include a place to add a url to the file instead of an upload?  For example, I store staff's contact information in a Google Spreadsheet.  This spreadsheets gets updated all the time.  It would be cool to just add the url to csv file instead of having to download the file and upload it into Processwire.  The input could also remember it's previous value so I can run the import over and over again as needed.  Maybe it also could be somehow automated to run the same import everyday?

If not, no worries.  Thanks again.

Share this post


Link to post
Share on other sites
14 hours ago, gmclelland said:

Thanks for sharing your modules @mtwebit!  This looks like it could be really useful.  Is there any way you could include a place to add a url to the file instead of an upload?  For example, I store staff's contact information in a Google Spreadsheet.  This spreadsheets gets updated all the time.  It would be cool to just add the url to csv file instead of having to download the file and upload it into Processwire.  The input could also remember it's previous value so I can run the import over and over again as needed.  Maybe it also could be somehow automated to run the same import everyday?

If not, no worries.  Thanks again.

I was thinking about this too...

There was a dev branch that dropped the [file + rules in description] scheme and introduced a fieldset of [rule + (optional) file]. It turned out to be too complicated and it did not work well so I dropped it.

An easy solution is to allow source location override. So... see this commit and use the input:location configuration option.
Not the best solution as it still requires a (dummy) file to be uploaded (to create the import rules in its description), but it works.
You can even use this solution to refer to files uploaded to other pages using this URL scheme: wire://pageid/filename

Hope it helps.

14 hours ago, gmclelland said:

It looks like you might have already considered and built this type of functionality https://github.com/mtwebit/DataSet/wiki/Import-rules#data-conversion-during-import

That's different. It downloads data for a single field (e.g. a file to be stored in a filefield) not for an entire DataSet.

  • Thanks 1

Share this post


Link to post
Share on other sites

really like this, will be complete if can do bulk export too, currently i'm using custom php script in front end for huge data export, but prefer if i can do this in admin area.

Share this post


Link to post
Share on other sites

JSON rule format is now supported but I have a small problem with that. It works fine in the global rule field but storing JSON in file descriptions is not possible atm.

Pagefile uses JSON internally for storing multi-language file descriptions so it is not possible to store JSON data there... I could not find a way to overcome this issue (even if multi-language descriptions are disabled Pagefile still drops JSON descriptions).

Any idea?

See Github issue

  • Like 1

Share this post


Link to post
Share on other sites
9 hours ago, mtwebit said:

Pagefile uses JSON internally for storing multi-language file descriptions so it is not possible to store JSON data there... I could not find a way to overcome this issue (even if multi-language descriptions are disabled Pagefile still drops JSON descriptions).

Any idea?

JSON in the description field is detected if the first character is { and the last character is }, or if the first character is [ and the last character is ]. See here.

So one workaround could be to prefix the JSON with some character...

*{"json": "here"}

...and then trim the first character before the module decodes the JSON.

  • Like 1
  • Thanks 1

Share this post


Link to post
Share on other sites
On 1/25/2019 at 2:05 PM, Robin S said:

JSON in the description field is detected if the first character is { and the last character is }, or if the first character is [ and the last character is ]. See here.

So one workaround could be to prefix the JSON with some character...


*{"json": "here"}

...and then trim the first character before the module decodes the JSON.

Yeah, this is a little painful. I use the same approach in Tracy. I think it might be better if Ryan replaces that json detection code with the following which seems to be the most common approach to problem.

    /**
     * is the provided string a valid json string?
     *
     * @param string $string
     * @return boolean
     */
    public function isJson($string) {
        json_decode($string);
        return (json_last_error() == JSON_ERROR_NONE);
    }

PS - actually maybe this isn't useful at all with this issue, but in general I think he should be using a function like this for determining if a string is JSON.

  • Like 2

Share this post


Link to post
Share on other sites

I'm trying to import 60k pages from a CVS file. 

I installed the DataSet module but the dataset_config field seams not working.

By default it is set as a textarea filed, and the configuration is not valid. There is also a message "YAML is not supported on your system. Try to use JSON for configuration." I installed the fieldtype-yaml module and set it for dataset_config but this is also not working.

https://modules.processwire.com/modules/fieldtype-yaml/

Any suggestions? All other modules required are installed.

The formatting on the screenshot for YMAL is wrong, I know. 

1840444955_Screenshot2019-12-24at14_21_24.thumb.png.7c2e3e4717ce3c40c41d19c6dd1fc51f.png

 

 

Share this post


Link to post
Share on other sites
6 hours ago, flydev said:

@theqbap

The YAML thing is an extension of PHP which need to be activated on your server configuration. 

 

Ok, thank you for replay. And can you provide me with an example of JSON config for dataset_config field. Unfortunately I can't activate YAML on server side.

Share this post


Link to post
Share on other sites

@theqbap

The YAML example converted to JSON with an online tool give us this config :

{
  "name": "Testing the import",
  "input": {
    "type": "csv",
    "delimiter": ",",
    "header": 1,
    "limit": 10
  },
  "fieldmappings": {
    "title": 1
  },
  "pages": {
    "template": "Data",
    "selector": "title=@title"
  }
}

 

Share this post


Link to post
Share on other sites
On 12/25/2019 at 10:26 AM, flydev said:

@theqbap

The YAML example converted to JSON with an online tool give us this config :


{
  "name": "Testing the import",
  "input": {
    "type": "csv",
    "delimiter": ",",
    "header": 1,
    "limit": 10
  },
  "fieldmappings": {
    "title": 1
  },
  "pages": {
    "template": "Data",
    "selector": "title=@title"
  }
}

 

Still with message "DataSet config is invalid."

Share this post


Link to post
Share on other sites

@theqbap

Then add the string "JSON" before the config :

JSON{
  "name": "Testing the import",
  "input": {
    "type": "csv",
    "delimiter": ",",
    "header": 1,
    "limit": 10
  },
  "fieldmappings": {
    "title": 1
  },
  "pages": {
    "template": "Data",
    "selector": "title=@title"
  }
}

 

  • Thanks 2

Share this post


Link to post
Share on other sites

I've checked the above config on my DataSet test site and it is valid.
(Don't forget to save the page to run the validator again.)

  • Thanks 1

Share this post


Link to post
Share on other sites
On 12/27/2019 at 9:49 AM, flydev said:

@theqbap

Then add the string "JSON" before the config :


JSON{
  "name": "Testing the import",
  "input": {
    "type": "csv",
    "delimiter": ",",
    "header": 1,
    "limit": 10
  },
  "fieldmappings": {
    "title": 1
  },
  "pages": {
    "template": "Data",
    "selector": "title=@title"
  }
}

 

Thank you for the help 🙂

Share this post


Link to post
Share on other sites

One more question regarding importing data. 

When a row in a CSV file will result in a page with the same 'title' as one that's already exits is there an option to make the title unique and import new page with the same name.

CSV example:

Title,Number
Orange,20-300
Orange,10-20
Banana,5-10


ProcessWire pages:

+Import Folder
-Orange
-Orange
-Banana
 

Share this post


Link to post
Share on other sites

The two “orange” pages have the same page title, but it probably gives them different page names.  Look on the settings tab of each page.

  • Like 1

Share this post


Link to post
Share on other sites

By default DataSet will create a new PW page each time it imports a row. In the above example, two pages will be created with title "Orange" and one with "Banana".

There is no option to change the title for the new page (2nd Orange) if it matches an already existing one (1st Orange).
You can, however, combine several fields in the title making it unique. E.g. you can create the title like this (column #0 always contains the row's serial number):

title: [1, ' (', 0, ')']

The result will be:

Orange (1)
Orange (2)
Banana (3)

You can also update (overwrite or merge) already existing pages. In the "pages" section of DS config you can specify a selector and add the overwrite or merge option.

See the wiki for more details. (Which needs to be updated but it is probably still helpful 🙂 )

  • Like 1

Share this post


Link to post
Share on other sites

OK. It was time to update the wiki 🙂

I've uploaded a new DataSet version (0.9.5) to GitHub. It contains many improvements for data type conversions, page reference handling and several bug fixes.
It also has a new profiler to optimize the import routines.

Tasker is also updated.

  • Like 3

Share this post


Link to post
Share on other sites

Thanks for developing this module, my tests so far have been really positive. I'm developing a PW site that requires import and regular update of 100k+ pages and this will be invaluable.

One question I have (if you have time) is around Page References. I'm unable to modify my source data, so have created a page reference and field that corresponds to that of the source data e.g. 'LED' which is ID 1137. 

My CSV has this 'LED' data, however when I import, I get this result:

Processing data for field 'category'.
Page selector @ field category: templates_id=50, has_parent=1110.
Found referenced page 'First Category Item' for field 'category' using the selector 'templates_id=50, has_parent=1110'.
Setting field 'category' = '1111'.

Page ID 1111 (or First Category Item) is the first Category page. I've also tried setting the category to 1137 within the CSV file and get the same result.

This is when using the below config:

JSON{
  "name": "Import",
  "input": {
    "type": "csv",
    "delimiter": ",",
    "header": 1,
    "limit": 10
  },
  "fieldmappings": {
    "model_id": 1,
    "title": 2,
    "category": 3
  },
  "pages": {
    "template": "model",
    "selector": "model_id=@model_id"
  }
}

The other two (text fields) work fine. Any advice would be appreciated!

Edit: I've now found your reference to Page References in the Wiki that changes everything! The default for Page References is Title as you say. I've installed Autocomplete and it's working great. One task for later is figuring out the scheduling side of things. I did wonder, with Page References is it possible for pages to be automatically created if they don't already exist on import?

Share this post


Link to post
Share on other sites
18 hours ago, DonPachi said:

Edit: I've now found your reference to Page References in the Wiki that changes everything! The default for Page References is Title as you say. I've installed Autocomplete and it's working great. One task for later is figuring out the scheduling side of things. I did wonder, with Page References is it possible for pages to be automatically created if they don't already exist on import?

I use page references heavily in my projects. Page Autocomplete has a field (Settings specific to ...) on the Input tab of the field settings page that can be used to specify what fields are used during the query. You can even select multiple fields, e.g. a category_ref_by_id field can specify multiple ID fields. This way you can merge individual data sets into a single one. Each source set can have its own ID, and the ...ref_by_id field can use all of them.

I have no plans for the automatic creation of the missing referenced page but it can be achieved very easily. Just create another DataSet using the same CSV file and import the appropriate "category" columns for creating the missing pages. You can also try to use the location attribute in the DataSet config to make a reference to the file uploaded to the original DataSet (see the wiki) to avoid duplicate uploads.

If you need to perform these imports automatically you can create two tasks (category import and the original one) and specify a dependency between them (first import categories then the full data set). See Tasker wiki.

  • Like 3

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.


  • Recently Browsing   0 members

    No registered users viewing this page.

  • Similar Content

    • By Gadgetto
      SnipWire - Snipcart integration for ProcessWire
      Snipcart is a powerful 3rd party, developer-first HTML/JavaScript shopping cart platform. SnipWire is the missing link between Snipcart and the content management framework ProcessWire.
      With SnipWire, you can quickly turn any ProcessWire site into a Snipcart online shop. The SnipWire plugin helps you to get your store up and running in no time. Detailed knowledge of the Snipcart system is not required.
      SnipWire is free and open source licensed under Mozilla Public License 2.0! A lot of work and effort has gone into development. It would be nice if you could donate an amount to support further development:

      Status update links (inside this thread) for SnipWire development
      2020-01-21 -- Snipcart v3 - when will the new cart system be implemented? 2020-01-19 -- integrated taxes provider finished (+ very flexible shipping taxes handling) 2020-01-14 -- new date range picker, discount editor, order notifiactions, order statuses, and more ... 2019-11-15 -- orders filter, order details, download + resend invoices, refunds 2019-10-18 -- list filters, REST API improvements, new docs platform, and more ... 2019-08-08 -- dashboard interface, currency selector, managing Orders, Customers and Products, Added a WireTabs, refinded caching behavior 2019-06-15 -- taxes provider, shop templates update, multiCURL implementation, and more ... 2019-06-02 -- FieldtypeSnipWireTaxSelector 2019-05-25 -- SnipWire will be free and open source Plugin Key Features
      Fast and simple store setup Full integration of the Snipcart dashboard into the ProcessWire backend (no need to leave the ProcessWire admin area) Browse and manage orders, customers, discounts, abandoned carts, and more Process refunds and send customer notifications from within the ProcessWire backend Process Abandoned Carts + sending messages to customers from within the ProcessWire backend Complete Snipcart webhooks integration (all events are hookable via ProcessWire hooks) Integrated taxes provider (which is more flexible then Snipcart own provider) Useful Links
      SnipWire in PW modules directory (alpha version only available via GitHub) SnipWire Docs (please note that the documentation is a work in progress) SnipWire @GitHub (feature requests and suggestions for improvement are welcome - I also accept pull requests) Snipcart Website  
      ---- INITIAL POST FROM 2019-05-25 ----
       
    • By d'Hinnisdaël
      Happy new year, everybody 🥬
      I've been sitting on this Dashboard module I made for a client and finally came around to cleaning it up and releasing it to the wider public. This is how it looks.
      ProcessWire Dashboard

      If anyone is interested in trying this out, please go ahead! I'd love to get some feedback on it. If this proves useful and survives some real-world testing, I'll add this to the module directory.
      Download
      You can find the latest release on Github.
      Documentation
      Check out the documentation to get started. This is where you'll find information about included panel types and configuration options.
      Custom Panels
      My goal was to make it really simple to create custom panels. The easiest way to do that is to use the panel type template and have it render a file in your templates folder. This might be enough for 80% of all use cases. For anything more complex (FormBuilder submissions? Comments? Live chat?), you can add new panel types by creating modules that extend the DashboardPanel base class. Check out the documentation on custom panels or take a look at the HelloWorld panel to get started. I'm happy to merge any user-created modules into the main repo if they might be useful to more than a few people.
       Disclaimer
      This is a pre-release version. Please treat it as such — don't install it on production sites. Just making sure 🍇
      Roadmap
      These are the things I'm looking to implement myself at some point. The wishlist is a lot longer, but those are the 80/20 items that I probably won't regret spending time on.
      Improve documentation & add examples ⚙️ Panel types Google Analytics ⚙️ Add new page  🔥 Drafts 🔥 At a glance / Page counter 404s  Layout options Render multiple tabs per panel panel groups with heading and spacing between ✅ panel wrappers as grid item (e.g. stacked notices) ✅ Admin themes support AdminThemeReno and AdminThemeDefault ✅ Shortcuts panel add a table layout with icon, title & summary ✅ Chart panel add default styles for common chart types ✅ load chart data from JS file (currently passed as PHP array) Collection panel support image columns ✅ add buttons: view all & add new ✅
    • By Robin S
      This module is inspired by and similar to the Template Stubs module. The author of that module has not been active in the PW community for several years now and parts of the code for that module didn't make sense to me, so I decided to create my own module. Auto Template Stubs has only been tested with PhpStorm because that is the IDE that I use.
      Auto Template Stubs
      Automatically creates stub files for templates when fields or fieldgroups are saved.
      Stub files are useful if you are using an IDE (e.g. PhpStorm) that provides code assistance - the stub files let the IDE know what fields exist in each template and what data type each field returns. Depending on your IDE's features you get benefits such as code completion for field names as you type, type inference, inspection, documentation, etc.
      Installation
      Install the Auto Template Stubs module.
      Configuration
      You can change the class name prefix setting in the module config if you like. It's good to use a class name prefix because it reduces the chance that the class name will clash with an existing class name.
      The directory path used to store the stub files is configurable.
      There is a checkbox to manually trigger the regeneration of all stub files if needed.
      Usage
      Add a line near the top of each of your template files to tell your IDE what stub class name to associate with the $page variable within the template file. For example, with the default class name prefix you would add the following line at the top of the home.php template file:
      /** @var tpl_home $page */ Now enjoy code completion, etc, in your IDE.

      Adding data types for non-core Fieldtype modules
      The module includes the data types returned by all the core Fieldtype modules. If you want to add data types returned by one or more non-core Fieldtype modules then you can hook the AutoTemplateStubs::getReturnTypes() method. For example, in /site/ready.php:
      // Add data types for some non-core Fieldtype modules $wire->addHookAfter('AutoTemplateStubs::getReturnTypes', function(HookEvent $event) { $extra_types = [ 'FieldtypeDecimal' => 'string', 'FieldtypeLeafletMapMarker' => 'LeafletMapMarker', 'FieldtypeRepeaterMatrix' => 'RepeaterMatrixPageArray', 'FieldtypeTable' => 'TableRows', ]; $event->return = $event->return + $extra_types; }); Credits
      Inspired by and much credit to the Template Stubs module by mindplay.dk.
       
      https://github.com/Toutouwai/AutoTemplateStubs
      https://modules.processwire.com/modules/auto-template-stubs/
    • By Mike Rockett
      Jumplinks for ProcessWire
      Release: 1.5.60
      Composer: rockett/jumplinks
      ⚠️ NOTICE: 1.5.60 is an important security patch-release for an XSS vulnerability discovered by @phlp. It's HIGHLY RECOMMENDED that all Jumplinks users update to the latest version as soon as possible.
      Jumplinks is an enhanced version of the original ProcessRedirects by Antti Peisa.
      The Process module manages your permanent and temporary redirects (we'll call these "jumplinks" from now on, unless in reference to redirects from another module), useful for when you're migrating over to ProcessWire from another system/platform. Each jumplink supports wildcards, shortening the time needed to create them.
      Unlike similar modules for other platforms, wildcards in Jumplinks are much easier to work with, as Regular Expressions are not fully exposed. Instead, parameters wrapped in curly braces are used - these are described in the documentation.
      Under Development: 2.0, to be powered by FastRoute
      As of version 1.5.0, Jumplinks requires at least ProcessWire 2.6.1 to run.
      View on GitLab
      Download via the Modules Directory
      Read the docs
      Features
      The most prominent features include:
      Basic jumplinks (from one fixed route to another) Parameter-based wildcards with "Smart" equivalents Mapping Collections (for converting ID-based routes to their named-equivalents without the need to create multiple jumplinks) Destination Selectors (for finding and redirecting to pages containing legacy location information) Timed Activation (activate and/or deactivate jumplinks at specific times) 404-Monitor (for creating jumplinks based on 404 hits) Additionally, the following features may come in handy:
      Stale jumplink management Legacy domain support for slow migrations An importer (from CSV or ProcessRedirects) Feedback & Feature Requests
      I’d love to know what you think of this module. Please provide some feedback on the module as a whole, or even regarding smaller things that make it whole. Also, please feel free to submit feature requests and their use-cases.
      Note: Features requested so far have been added to the to-do list, and will be added to 2.0, and not the current dev/master branches.
      Open Source

      Jumplinks is an open-source project, and is free to use. In fact, Jumplinks will always be open-source, and will always remain free to use. Forever. If you would like to support the development of Jumplinks, please consider making a small donation via PayPal.
      Enjoy! 🙂
    • By Robin S
      Add Image URLs
      Allows images/files to be added to Image/File fields by pasting URLs.

      Usage
      Install the Add Image URLs module.
      A "Paste URLs" button will be added to all image and file fields. Use the button to show a textarea where URLs may be pasted, one per line. Images/files are added when the page is saved.
       
      https://github.com/Toutouwai/AddImageUrls
      https://modules.processwire.com/modules/add-image-urls/
×
×
  • Create New...