Jump to content

DataSet import modules


mtwebit

Recommended Posts

I've created a set of modules for importing (manipulating and displaying) data from external resources. A key requirement was to handle large (100k+) number of pages easily.

Main features

  • import data from CSV and XML sources in the background (using Tasker)
  • purge, update or overwrite existing pages using selectors
  • user configurable input <-> field mappings
  • on-the-fly data conversion and composition (e.g. joining CSV columns into a single field)
  • download external resources (files, images) during import
  • handle page references by any (even numeric) fields

How it works

You can upload CSV or XML files to DataSet pages and specify import rules in their description.
The module imports the content of the file and creates/updates child pages automatically.

How to use it

Create a DataSet page that stores the source file. The file's description field specifies how the import should be done:

Spoiler

name: Testing the import
input: # Source configuration
  type: csv
  delimiter: ','
  header: 1
  limit: 10  # import only 10 entries, uncomment this if the test was successful
fieldmappings: # specified as field_name: csv_column_id (1, 2, 3, ...)
  title: 1
pages:  # Config for child pages
  template: Data
  selector: 'title=@title'

After saving the DataSet page an import button should appear below the file description.

dataset_file_description.thumb.png.b92cf93c8a529d9750622ef08b67fcad.png

When you start the import the DataSet module creates a task (executed by Tasker) that will import the data in the background.

You can monitor its execution and check its logs for errors.

dataset_import_running.thumb.png.ad0e58d907dcf1b379060afa9bc928e9.png

See the module's wiki for more details.

The module was already used in three projects to import and handle large XML and CSV datasets. It has some rough edges and I'm sure it needs improvement :) so comments are welcome.

  • Like 18
Link to comment
Share on other sites

Thanks for sharing your modules @mtwebit!  This looks like it could be really useful.  Is there any way you could include a place to add a url to the file instead of an upload?  For example, I store staff's contact information in a Google Spreadsheet.  This spreadsheets gets updated all the time.  It would be cool to just add the url to csv file instead of having to download the file and upload it into Processwire.  The input could also remember it's previous value so I can run the import over and over again as needed.  Maybe it also could be somehow automated to run the same import everyday?

If not, no worries.  Thanks again.

Link to comment
Share on other sites

14 hours ago, gmclelland said:

Thanks for sharing your modules @mtwebit!  This looks like it could be really useful.  Is there any way you could include a place to add a url to the file instead of an upload?  For example, I store staff's contact information in a Google Spreadsheet.  This spreadsheets gets updated all the time.  It would be cool to just add the url to csv file instead of having to download the file and upload it into Processwire.  The input could also remember it's previous value so I can run the import over and over again as needed.  Maybe it also could be somehow automated to run the same import everyday?

If not, no worries.  Thanks again.

I was thinking about this too...

There was a dev branch that dropped the [file + rules in description] scheme and introduced a fieldset of [rule + (optional) file]. It turned out to be too complicated and it did not work well so I dropped it.

An easy solution is to allow source location override. So... see this commit and use the input:location configuration option.
Not the best solution as it still requires a (dummy) file to be uploaded (to create the import rules in its description), but it works.
You can even use this solution to refer to files uploaded to other pages using this URL scheme: wire://pageid/filename

Hope it helps.

14 hours ago, gmclelland said:

It looks like you might have already considered and built this type of functionality https://github.com/mtwebit/DataSet/wiki/Import-rules#data-conversion-during-import

That's different. It downloads data for a single field (e.g. a file to be stored in a filefield) not for an entire DataSet.

  • Like 1
  • Thanks 1
Link to comment
Share on other sites

  • 2 weeks later...
  • 2 weeks later...

JSON rule format is now supported but I have a small problem with that. It works fine in the global rule field but storing JSON in file descriptions is not possible atm.

Pagefile uses JSON internally for storing multi-language file descriptions so it is not possible to store JSON data there... I could not find a way to overcome this issue (even if multi-language descriptions are disabled Pagefile still drops JSON descriptions).

Any idea?

See Github issue

  • Like 1
Link to comment
Share on other sites

9 hours ago, mtwebit said:

Pagefile uses JSON internally for storing multi-language file descriptions so it is not possible to store JSON data there... I could not find a way to overcome this issue (even if multi-language descriptions are disabled Pagefile still drops JSON descriptions).

Any idea?

JSON in the description field is detected if the first character is { and the last character is }, or if the first character is [ and the last character is ]. See here.

So one workaround could be to prefix the JSON with some character...

*{"json": "here"}

...and then trim the first character before the module decodes the JSON.

  • Like 1
  • Thanks 1
Link to comment
Share on other sites

On 1/25/2019 at 2:05 PM, Robin S said:

JSON in the description field is detected if the first character is { and the last character is }, or if the first character is [ and the last character is ]. See here.

So one workaround could be to prefix the JSON with some character...


*{"json": "here"}

...and then trim the first character before the module decodes the JSON.

Yeah, this is a little painful. I use the same approach in Tracy. I think it might be better if Ryan replaces that json detection code with the following which seems to be the most common approach to problem.

    /**
     * is the provided string a valid json string?
     *
     * @param string $string
     * @return boolean
     */
    public function isJson($string) {
        json_decode($string);
        return (json_last_error() == JSON_ERROR_NONE);
    }

PS - actually maybe this isn't useful at all with this issue, but in general I think he should be using a function like this for determining if a string is JSON.

  • Like 2
Link to comment
Share on other sites

  • 10 months later...

I'm trying to import 60k pages from a CVS file. 

I installed the DataSet module but the dataset_config field seams not working.

By default it is set as a textarea filed, and the configuration is not valid. There is also a message "YAML is not supported on your system. Try to use JSON for configuration." I installed the fieldtype-yaml module and set it for dataset_config but this is also not working.

https://modules.processwire.com/modules/fieldtype-yaml/

Any suggestions? All other modules required are installed.

The formatting on the screenshot for YMAL is wrong, I know. 

1840444955_Screenshot2019-12-24at14_21_24.thumb.png.7c2e3e4717ce3c40c41d19c6dd1fc51f.png

 

 

Link to comment
Share on other sites

6 hours ago, flydev said:

@theqbap

The YAML thing is an extension of PHP which need to be activated on your server configuration. 

 

Ok, thank you for replay. And can you provide me with an example of JSON config for dataset_config field. Unfortunately I can't activate YAML on server side.

Link to comment
Share on other sites

On 12/25/2019 at 10:26 AM, flydev said:

@theqbap

The YAML example converted to JSON with an online tool give us this config :


{
  "name": "Testing the import",
  "input": {
    "type": "csv",
    "delimiter": ",",
    "header": 1,
    "limit": 10
  },
  "fieldmappings": {
    "title": 1
  },
  "pages": {
    "template": "Data",
    "selector": "title=@title"
  }
}

 

Still with message "DataSet config is invalid."

Link to comment
Share on other sites

One more question regarding importing data. 

When a row in a CSV file will result in a page with the same 'title' as one that's already exits is there an option to make the title unique and import new page with the same name.

CSV example:

Title,Number
Orange,20-300
Orange,10-20
Banana,5-10


ProcessWire pages:

+Import Folder
-Orange
-Orange
-Banana
 

Link to comment
Share on other sites

By default DataSet will create a new PW page each time it imports a row. In the above example, two pages will be created with title "Orange" and one with "Banana".

There is no option to change the title for the new page (2nd Orange) if it matches an already existing one (1st Orange).
You can, however, combine several fields in the title making it unique. E.g. you can create the title like this (column #0 always contains the row's serial number):

title: [1, ' (', 0, ')']

The result will be:

Orange (1)
Orange (2)
Banana (3)

You can also update (overwrite or merge) already existing pages. In the "pages" section of DS config you can specify a selector and add the overwrite or merge option.

See the wiki for more details. (Which needs to be updated but it is probably still helpful 🙂 )

  • Like 1
Link to comment
Share on other sites

OK. It was time to update the wiki 🙂

I've uploaded a new DataSet version (0.9.5) to GitHub. It contains many improvements for data type conversions, page reference handling and several bug fixes.
It also has a new profiler to optimize the import routines.

Tasker is also updated.

  • Like 3
Link to comment
Share on other sites

  • 2 weeks later...

Thanks for developing this module, my tests so far have been really positive. I'm developing a PW site that requires import and regular update of 100k+ pages and this will be invaluable.

One question I have (if you have time) is around Page References. I'm unable to modify my source data, so have created a page reference and field that corresponds to that of the source data e.g. 'LED' which is ID 1137. 

My CSV has this 'LED' data, however when I import, I get this result:

Processing data for field 'category'.
Page selector @ field category: templates_id=50, has_parent=1110.
Found referenced page 'First Category Item' for field 'category' using the selector 'templates_id=50, has_parent=1110'.
Setting field 'category' = '1111'.

Page ID 1111 (or First Category Item) is the first Category page. I've also tried setting the category to 1137 within the CSV file and get the same result.

This is when using the below config:

JSON{
  "name": "Import",
  "input": {
    "type": "csv",
    "delimiter": ",",
    "header": 1,
    "limit": 10
  },
  "fieldmappings": {
    "model_id": 1,
    "title": 2,
    "category": 3
  },
  "pages": {
    "template": "model",
    "selector": "model_id=@model_id"
  }
}

The other two (text fields) work fine. Any advice would be appreciated!

Edit: I've now found your reference to Page References in the Wiki that changes everything! The default for Page References is Title as you say. I've installed Autocomplete and it's working great. One task for later is figuring out the scheduling side of things. I did wonder, with Page References is it possible for pages to be automatically created if they don't already exist on import?

Link to comment
Share on other sites

18 hours ago, DonPachi said:

Edit: I've now found your reference to Page References in the Wiki that changes everything! The default for Page References is Title as you say. I've installed Autocomplete and it's working great. One task for later is figuring out the scheduling side of things. I did wonder, with Page References is it possible for pages to be automatically created if they don't already exist on import?

I use page references heavily in my projects. Page Autocomplete has a field (Settings specific to ...) on the Input tab of the field settings page that can be used to specify what fields are used during the query. You can even select multiple fields, e.g. a category_ref_by_id field can specify multiple ID fields. This way you can merge individual data sets into a single one. Each source set can have its own ID, and the ...ref_by_id field can use all of them.

I have no plans for the automatic creation of the missing referenced page but it can be achieved very easily. Just create another DataSet using the same CSV file and import the appropriate "category" columns for creating the missing pages. You can also try to use the location attribute in the DataSet config to make a reference to the file uploaded to the original DataSet (see the wiki) to avoid duplicate uploads.

If you need to perform these imports automatically you can create two tasks (category import and the original one) and specify a dependency between them (first import categories then the full data set). See Tasker wiki.

  • Like 3
Link to comment
Share on other sites

  • 3 months later...

It's been a while, but I just wanted to follow up with you on a project that's now in its final stages and say dataset and tasker are really exceptional, powerful modules, and definitely up there as my favourites for ProcessWire.

You really covered the edge cases with being able to set task dependencies, merge, overwrites etc, and while it took some time to get my head around I now have a system that calls multiple tasks every hour via cron for fresh data from a specific set of CSV files.

Looking forward to hopefully working on another project that uses dataset/tasker!

  • Like 3
Link to comment
Share on other sites

Thanks for the feedback! I'm glad to hear that they are useful 🙂 although a bit complex to use.

Tasker has a few small improvements, I think I pushed the latest version to the GitHub repo.
DataSet changed a bit more, and some modified parts still need review and testing. Thanks for reminding me to finish them.

My DataSet project is still running. We have like 150k+ (mostly complex) data pages interconnected with many references and getting to hit the wall with MySQL during imports and complex page reference lookups.

  • Like 3
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.
  • Similar Content

    • By monollonom
      (once again I was surprised to see a work of mine pop up in the newsletter, this time without even listing the module on PW modules website 😅. Thx @teppo !)
      FieldtypeQRCode
      Github: https://github.com/romaincazier/FieldtypeQRCode
      Modules directory: https://processwire.com/modules/fieldtype-qrcode/
      A simple fieldtype generating a QR Code from the public URL of the page, and more.
      Using the PHP library QR Code Generator by Kazuhiko Arase.

      Options
      In the field’s Details tab you can change between .gif or .svg formats. If you select .svg you will have the option to directly output the markup instead of a base64 image. SVG is the default.
      You can also change what is used to generate the QR code and even have several sources. The accepted sources (separated by a comma) are: httpUrl, editUrl, or the name of any text/URL/file/image field.
      If LanguageSupport is installed the compatible sources (httpUrl, text field, ...) will return as many QR codes as there are languages. Note however that when outputting on the front-end, only the languages visible to the user will be generated.
      Formatting
      Unformatted value
      When using $page->getUnformatted("qrcode_field") it returns an array with the following structure:
      [ [ "label" => string, // label used in the admin "qr" => string, // the qrcode image "source" => string, // the source, as defined in the configuration "text" => string // and the text used to generate the qrcode ], ... ] Formatted value
      The formatted value is an <img>/<svg> (or several right next to each other). There is no other markup.
      Should you need the same markup as in the admin you could use:
      $field = $fields->get("qrcode_field"); $field->type->markupValue($page, $field, $page->getUnformatted("qrcode_field")); But it’s a bit cumbersome, plus you need to import the FieldtypeQRCode's css/js. Best is to make your own markup using the unformatted value.
      Static QR code generator
      You can call FieldtypeQRCode::generateQRCode to generate any QR code you want. Its arguments are:
      string $text bool $svg Generate the QR code as svg instead of gif ? (default=true) bool $markup If svg, output its markup instead of a base64 ? (default=false) Hooks
      Please have a look at the source code for more details about the hookable functions.
      Examples
      $wire->addHookAfter("FieldtypeQRCode::getQRText", function($event) { $page = $event->arguments("page"); $event->return = $page->title; // or could be: $event->return = "Your custom text"; }) $wire->addHookAfter("FieldtypeQRCode::generateQRCodes", function($event) { $qrcodes = $event->return; // keep everything except the QR codes generated from editUrl foreach($qrcodes as $key => &$qrcode) { if($qrcode["source"] === "editUrl") { unset($qrcodes[$key]); } } unset($qrcode); $event->return = $qrcodes; })
    • By Sebi
      AppApiFile adds the /file endpoint to the AppApi routes definition. Makes it possible to query files via the api. 
      This module relies on the base module AppApi, which must be installed before AppApiFile can do its work.
      Features
      You can access all files that are uploaded at any ProcessWire page. Call api/file/route/in/pagetree?file=test.jpg to access a page via its route in the page tree. Alternatively you can call api/file/4242?file=test.jpg (e.g.,) to access a page by its id. The module will make sure that the page is accessible by the active user.
      The GET-param "file" defines the basename of the file which you want to get.
      The following GET-params (optional) can be used to manipulate an image:
      width height maxwidth maxheight cropX cropY Use GET-Param format=base64 to receive the file in base64 format.
    • By MarkE
      This fieldtype and inputfield bundle was built for storing measurement values within a field, rendering them in a variety of formats and converting them to other units or otherwise modifying them via the API.
      The API consists of a number of predefined functions, some of which include...
      render() for rendering the measurement object, valueAs() for converting the value to another unit value, convertTo() for converting the whole measurement object to different units, and add() and subtract() for for modifying the stored value by the value (converted as required) in another measurement. In the admin the inputfield includes a checkbox (which can be optionally disabled) for converting values on page save. For an example if a value was typed in as centimeters, the unit was changed to metres, and the page saved with this checkbox selected, said value would be automatically converted so that e.g. 170 cm becomes 1.7 m.

      A simple length field using Fieldtype Measurement and Inputfield Measurement.
      Combination units (e.g. feet and inches) are also supported.
      Please note that this module is 'proof of concept' at the moment - there are limited units available and quite a lot of code tidying to do. More units will be added shortly.
      See the GitHub at https://github.com/MetaTunes/FieldtypeMeasurement for full details and updates.
    • By tcnet
      File Manager for ProcessWire is a module to manager files and folders from the CMS backend. It supports creating, deleting, renaming, packing, unpacking, uploading, downloading and editing of files and folders. The integrated code editor ACE supports highlighting of all common programming languages.
      https://github.com/techcnet/ProcessFileManager

      Warning
      This module is probably the most powerful module. You might destroy your processwire installation if you don't exactly know what you doing. Be careful and use it at your own risk!
      ACE code editor
      This module uses ACE code editor available from: https://github.com/ajaxorg/ace

      Dragscroll
      This module uses the JavaScript dragscroll available from: http://github.com/asvd/dragscroll. Dragscroll adds the ability to drag the table horizontally with the mouse pointer.
      PHP File Manager
      This module uses a modified version of PHP File Manager available from: https://github.com/alexantr/filemanager
       
    • By tcnet
      This module implements the website live chat service from tawk.to. Actually the module doesn't have to do much. It just need to inserted a few lines of JavaScript just before the closing body tag </body> on each side. However, the module offers additional options to display the widget only on certain pages.
      Create an account
      Visit https://www.tawk.to and create an account. It's free! At some point you will reach a page where you can copy the required JavaScript-code.

      Open the module settings and paste the JavaScript-code into the field as shown below. Click "Submit" and that's all.

      Open the module settings
      The settings for this module are located int the menu Modules=>Configure=>LiveChatTawkTo.

       
×
×
  • Create New...