Jump to content

Best way of organising imported data


Tony Carnell
 Share

Recommended Posts

Hi all,

I'm a newbie to PW but can definitely see the power and potential that it holds, but I have a question that I'd like your guy's opinions on to help steer me down the best path.

I am working on a project to make a searchable catalogue of a certain type of dam, each one having in excess of 20 criteria that describes them, including name, owner, country, dimensions, construction dates, construction methods and statistics.

I've imported the dam database into PW and it's created a Page called 'dams' with 497 children for all the dams (forgive me if I'm not using the correct terminology, as I say I am a newbie!). It's also created fields for all the different criteria.

What I'm wondering is if this is the best way forward?

When I take a look at the backend of the Skyscraper Demo site (which has the same kind of functionality that I'm hoping to employ) I see that there's separate pages for Architects and Cities.

Is there a benefit in storing the information like this?

Does it make the search functionality on the front-end more powerful or easier to create?

Would it be best to have a separate page for Dam Owners and Countries for instance, or is having them simply as fields within the dam page enough to allow me to work with them easily?

Thanks in advance for any help or advice you have to offer, and if there's any more information you need to know, just ask.

Tony.

Link to comment
Share on other sites

  • 2 weeks later...

It's been a while since I started this thread, and thanks again to LostKobrakai for pointing me to that excellent post by Kongondo - very thorough!

Having read it I confess to still being confused though :(

The site I'm developing seems to fit into the 'Simple Multiple Categories' model so I thought I'd create pages for Country, Owner, Type & Purpose, since these fields repeat throughout the dam data (for instance there are 189 dams built in China, so currently China repeats 189 times).

I created a new page called 'countries' of type 'page', but I'm confused how to get the country child pages created below it. I've also created a template called 'countries' and added the field 'dam_country' to it. Is this the right thing to do? I assume not because when looking at the page tree in the admin area there's no child pages listed.

Also, at the moment all the dams exist as child pages of the Dam page, so if I split off the above 4 criteria on to their own pages, how will the dam's reference these new pages?

I'm not the person who imported the data originally you see, so it's all a bit of a black art to me! :)

All help & advice is gratefully received.

Tony.

  • Like 1
Link to comment
Share on other sites

You've two options to create relationships in processwire. Either using the parent/child one or using pagefields. As you've rightly pointed out the Simple Multiple Categories way kongondo described will most likely be the right one, kinda like this. 

Dams
 - Dam 1
 - Dam 2
 - …
Country
 - China
 - Japan
 - …
Owner
 - Some One
 - Another One
Type
 - Type 1
 - Type 2
Purpose
 - 1
 - 2

The dam pages will have pagefields, where you can categorize each dam to their country, owner type and purpose. All the other pages will most likely only hold raw information, e.g. first and lastname of the owner as separate fields. 

How to create those pages will depend on how you want to import them, but the best strategy will most likely be, that you create all the category pages at first and in the end import the dams, so you can already fill all the pagefields with the corresponding links to the categories.

  • Like 2
Link to comment
Share on other sites

Hi Tony,

have only a bit time now, but think I definetly can explain how to get this part with the page-fieldtype sorted. :)

Best thing would be to start with a simple step by step solution:

  1. in admin, create a template without a file, called country and let it have only the title field. Save!
  2. create a second template, called countries (plural / multiples of country), only assign the title filed to it and: Save!
  3. go into your page tree and create a single page named countries, with the template countries,
  4. now add children to this page, all of them should have the template country,

ok, now go into fields and create a new field of type "page", name it countries.

  • In the Input tab of it, go to Selectable Pages, use Parent of selectable page(s) and click the change button
  • you get the admin page tree now and have to go to your previously created countries (parent) page. Click select to select it!
  • additionally to that, go to Template of selectable pages(s) and use the dropdown to select country, (the single page template)
  • Now you can go to Input field type and select one that fits best your needs. For this example, please select the AsmSelect*  :)

ok, ready to save the field!

Now go to a template where you want to be able to select countries or one country, - add the new created field "countries" to it and save.

Open a page with that template or create a new one that bases of the template that now has the page field countries included.

  • Go to the field and play around with it! You got it now? Go ahead and edit your countries field:
  • under the Details tab select: Single page or empty page ... (this will restrict the field to let the user select only one child out of all available)
  • if you don't like the AsmSelect, go to Input tab and select another one, best would be Select now?
  • save it and test your changes.

Come back here if you don't get it i to work. :)

  • Like 5
Link to comment
Share on other sites

Wow, thanks so much for your help and advice LostKobrakai & Horst.

I continue to be amazed with how helpful the PW community is! :)

I followed your extremely helpful guide Horst and have successfully added a page for Dam Types (I chose to add this one and not the countries because there's only 4 child pages to add!) and all went well.

I selected one of the dams to edit and noticed the new dropdown list for Dam Type, but the field is currently empty.

Am I right in thinking that I will have to add all the extra pages/child pages that I need together with the associated fields and then re-import the data again to populate them correctly?

Thanks again for all you help.

Tony.

Link to comment
Share on other sites

I don't know exactly how you need your data to be organized, and how you import it, but I think best way would be if it is possible to import it in several steps. Maybe, that's exactly what you are said / asked me, - You need to do the import completly new. Here is the way hoa I would do it:

First import or create all categories, create a parent page, e.g. like with the countries example, create the countries template and page.

Then create the child template and import all items as children of the parent page.

Repeat this for all other categories! (name, owner, country, dimensions, construction dates, construction methods, etc)

If your categories are ready, you need to create the fields and templates (incl. template file(s)) for your main data pages. Also this will have one or more Page-Fields for the categories.

When importing a record of the main data, you need to assign a category by a page field.

If you can provide some informations about the data that you need to import, we can provide further help / assistance with an importer script.

Is it in CSV file? Is it in a mySQL-DB or is it a mysql.dump? Can you provide at least one record with example data and all the fieldnames?

Maybe it is also possible to import all at once, but therefore we need some more informations.

Horst

  • Like 1
Link to comment
Share on other sites

Thanks again for your continuing help Horst.

I've attached a dump of the first 3 records in the existing MySQL database (saved as a text file because the forum sensibly wouldn't allow me to upload a .sql file).

This is the database as it exists at the moment. The PW version that was imported had been tweaked very slightly, but I haven't got access to that raw data at the moment.

The Dam Type, Owner, Country and Purpose are the only ones I was going to create as page types because realistically they are the fields with repeating content.

By the way, the letter codes for Purpose are:

  • F - Flood control
  • I - Irrigation
  • H - Hydropower
  • W - Water Supply
  • R - Recreation

Thanks again for your help - it's much appreciated.

Tony.

rccdams_db.txt

Link to comment
Share on other sites

Hi Tony,

I have attached a site-profile as example. It has some data in it that was populated with the following importer script. In the site profile I have created the dam_type and the dam_purpose categories and page fields. The dam_purpose was created as Multiple categories and there will be no items added automaticaly during import. The dam_type will be added automaticaly during install. there is only the templates and the parent created manually.

Inspect the profile and look through the importer script, it has all needed comments. After playing around with it, I'm sure you can adapt it to suite your needs. :)

site-rccdams-starter.zip

save this besides your PW index.php and call it after you have logged in as superuser:

<?php

$GLOBALS['templateName'] = 'rccdams';     // the name of the template you use
$GLOBALS['oldrccid']     = 'rccid';       // the fieldname in PW which stores the id from the source DB
$GLOBALS['parentPath']   = '/rcc_dams/';  // add the path you want the pages stored, e.g. "/rccdams/" or that like, but check that you already have created this (parent) page in PW!!

// the templates and parent pages for the categories needs to be created manually before the import action is used!!
// also don't forgett to assign them to the rccdams template !!
$categoryPages = array(
    'dam_type'    => array( 'nameOfParentTemplate' => 'dam_type_category_parent',
                            'nameOfChildTemplate' => 'dam_type_category'
                         ),
    'dam_owner'   => array( 'nameOfParentTemplate' => '',
                            'nameOfChildTemplate' => ''
                         ),
    'dam_country' => array( 'nameOfParentTemplate' => '',
                            'nameOfChildTemplate' => ''
                         ),
    'dam_purpose' => array( 'nameOfParentTemplate' => 'dam_purpose_category_parent',
                            'nameOfChildTemplate' => 'dam_purpose_category'
                         )
);

#
# ATTENTION: SELECT AN ACTION HERE !!
#
# select an action you want to proceed by comment uncomment the following two lines
#

$ACTION = 'deleteImportedPages';

#$ACTION = 'createFields';

#$ACTION = 'importData';



// provide the credentials to connect to the Source-DB, including DB name and Table name
$MYSQL = array();
$MYSQL['host']         = '127.0.0.1';
$MYSQL['user']         = '';
$MYSQL['pass']         = '';
$MYSQL['dbname']       = 'rccdams_db';
$MYSQL['tablename']    = 'rccdams_db';


// a list with all the fields and the their types, you want create / import into PW
$rccdams_fields = array(
    'dam_type'          => 'Page',
    'dam_country'       => 'Page',
    'dam_owner'         => 'Page',
    'dam_purpose'       => 'Page',

    'rccid'             => 'Integer',
    'dam_name'          => 'Text',
    'river'             => 'Text',
    'capacity'          => 'Text',

    'cd_proj_start'     => 'Datetime',
    'cd_proj_start_d'   => 'Integer',
    'cd_proj_start_m'   => 'Integer',
    'cd_proj_start_y'   => 'Integer',
    'cd_rcc_start'      => 'Datetime',
    'cd_rcc_start_d'    => 'Integer',
    'cd_rcc_start_m'    => 'Integer',
    'cd_rcc_start_y'    => 'Integer',
    'cd_proj_fin'       => 'Datetime',
    'cd_proj_fin_d'     => 'Integer',
    'cd_proj_fin_m'     => 'Integer',
    'cd_proj_fin_y'     => 'Integer',
    'cd_rcc_fin'        => 'Datetime',
    'cd_rcc_fin_d'      => 'Integer',
    'cd_rcc_fin_m'      => 'Integer',
    'cd_rcc_fin_y'      => 'Integer'
);
// I'm not sure if all the _d, _m, _y fields are needed. Maybe you only need the 4 Datetime fields!
// you can adapt the list to suite your needs

############### READY with config part ###########################


$GLOBALS['categoryPages'] = $categoryPages;
$timeLimit = (60 * 5); // give it a bit more time than the default 30 seconds
$ignoreUserAbort = false;

// prepare server for plaintext output
if(function_exists('apache_setenv')) @apache_setenv('no-gzip', '1');
@ini_set('zlib.output_compression', 'Off');
@ini_set('output_buffering ', '0');
@ini_set('implicit_flush', '1');
@ob_implicit_flush(true);
@ob_end_flush();
if(isset($_SERVER['HTTP_HOST'])) {
    header('Content-Type: text/plain');
}

// Bootstrap ProcessWire
require_once('./index.php');

// security check if the current user is a SuperUser, comment the next line out, if you run this script from CLI
if (!wire('user')->isSuperuser()) die('ACCESS DENIED!');

ignore_user_abort($ignoreUserAbort);
set_time_limit(intval($timeLimit));

// Assign API variables to make things a little easier
$fields = wire("fields");
$templates = wire("templates");
$modules = wire("modules");
$sanitizer = wire("sanitizer");
$pages = wire("pages");


echo "\n$ACTION\n";


if ('deleteImportedPages' == $ACTION) {
    echo "\n";  // we drop all imported pages
    foreach($pages->find("template={$GLOBALS['templateName']},include=all") as $p) {
        echo " - now drop {$p->title}\n";
        $p->delete();
    }
    echo "\nREADY!\n";
    exit();
}


if ('createFields' == $ACTION) {

    echo "\n";  // (1) first get the rccdam template
    $t = $templates->$GLOBALS['templateName'];
    if (!$t) {
        $fg = new Fieldgroup();
        $fg->add("title");
        $fg->save();
        $t = new Template();
        $t->name = $GLOBALS['templateName'];
        $t->fieldgroup = $fg;
        $t->save();
    }

    // (2) loop through the field list and create those that are missing
    foreach($rccdams_fields as $name => $type) {
        echo " - $name :: $type\n";
        if ('page' == strtolower($type)) continue; // field of type Page, this need to be created manually

        $f = $fields->$name;
        if (!$fields->$name) {
            // create a new field
            $f = new Field();
            $f->type = $modules->get("Fieldtype" . $type);
            $f->name = $name;
            $f->save();
            echo "   successfully created\n";
        }
        // add it to the rccdam template
        $t->fieldgroup->add($f);
        $t->fieldgroup->save();
        $t->save();
    }

    echo "\nREADY!\n";
    exit();
}



if ('importData' == $ACTION) {

    echo "\n";  // (1) connect to Source-DB
    $MY_sql = mysql_connect($MYSQL['host'], $MYSQL['user'], $MYSQL['pass']);
    if($MY_sql) {
        echo "- connected with MySQL Server\n";
        $MY_sql_db = mysql_select_db($MYSQL['dbname'], $MY_sql);
        if($MY_sql_db) {
            echo "- connected with source DB\n";
            $MY_sql_result = mysql_query("SELECT * FROM {$MYSQL['tablename']}", $MY_sql);
            if (is_resource($MY_sql_result)) {
                echo "- found the Table\n\n";
                $t = $templates->$GLOBALS['templateName'];
                while($row = mysql_fetch_array($MY_sql_result, MYSQL_ASSOC)) {

                    // (2) process all records of source table
                    echo " -- fetched row with id {$row['id']}\n";
                    if (!isValidImportRecord($row)) continue;  // this one is already present in PW, so skip further processing

                    // we have data for a new page
                    echo "    this one needs to be imported\n";
                    $p = new Page();     // create a new page and assign the minimum required params: template, parent and title!
                    $p->template = $t;
                    $p->parent = $pages->get($GLOBALS['parentPath']);
                    $p->$GLOBALS['oldrccid'] = $row['id'];
                    $p->title = $row['dam_name'];

                    // now loop through all source data fields
                    foreach($row as $fieldname => $data) {

                        if (!isset($rccdams_fields[$fieldname])) continue; // skip source data fields that are not defined in the $rccdams_fields array!!
                        if (!$t->fieldgroup->$fieldname) continue; // skip source data when there is no field in our template

                        // check which type we need in PW for this
                        if ('Page' != $rccdams_fields[$fieldname]) {

                            // we can simply add the source data into the target field, but pass it once through the sanitizer or typecast it to integer
                            $data = 'Integer' == $rccdams_fields[$fieldname] ? (int)$data : $sanitizer->text($data);
                            $p->$fieldname =  $data;

                        } else {

                            // we have a Pagefield, we need to act accordingly,
                            // here in the starter script I only work with dam_type and dam_purpose
                            // the other needs to be added by you, 
                            switch($fieldname) {
                                case 'dam_type':
                                    // this must be a single category Page !! in Details Tab it must set to SIngle Page or empty NullPage !!
                                    // we want create child pages for this category automatically, therefore
                                    // we need to check if there is already a child page, fetch this or create a new one:
                                    $data = $sanitizer->text($data);
                                    $cp = getCategoryPage('dam_type', $data);
                                    $p->$fieldname = $cp;
                                    break;
                                case 'dam_owner':
                                    break;
                                case 'dam_country':
                                    break;
                                case 'dam_purpose':
                                    // this is a Multi-Items-Category and it has shortcuts in the source DB
                                    $data = strtolower(trim($sanitizer->text($data)));
                                    $cpa = getCategoryPagesPurpose($data);
                                    $p->$fieldname = $cpa;
                                    break;
                            }
                        }
                    }
                    $p->save();
                    echo "    saved into DB!\n\n";
                }
                mysql_free_result($MY_sql_result);
            }
        }
    }
    mysql_close($MY_sql);
    echo "\nREADY!\n";

    exit();
}

function isValidImportRecord($row) {
    // we check for a page with the rccdams template and the original id, (what we have stored under rccid with every imported page)
    $page = wire('pages')->get("template={$GLOBALS['templateName']},{$GLOBALS['oldrccid']}={$row['id']}");
    // if a $page with ID greater than zero is found, we already have this page imported and want to skip it here, therefor we retun false!
    return $page->id > 0 ? false : true;
}

function getCategoryPage($name, $data) {
    $nameOfParentTemplate = $GLOBALS['categoryPages'][$name]['nameOfParentTemplate'];
    $nameOfChildTemplate = $GLOBALS['categoryPages'][$name]['nameOfChildTemplate'];
    // if there is no data in the source record, we return a NullPage Object
    if (empty($data)) {
        $cp = new NullPage();
    } else {
        // if we have this page already in PW, we return it
        $cp = wire('pages')->get("template={$nameOfChildTemplate},name=" . wire('sanitizer')->pageName($data));
        if (0 == $cp->id) {
            // if it ins't already there, we create it
            $cp = new Page();
            $cp->of(false);
            $cp->template  = $nameOfChildTemplate;
            $cp->parent    = wire('pages')->get("template={$nameOfParentTemplate}");
            $cp->title     = $data;
            $cp->save();
        }
    }
    return $cp;
}

function getCategoryPagesPurpose($data) {
    $nameOfChildTemplate = $GLOBALS['categoryPages']['dam_purpose']['nameOfChildTemplate'];
    $cpa = new PageArray();  // create an empty PageArray
    if (empty($data)) return $cpa; // if we have no source data, return empty PageArray
    for($i = 0; $i < strlen($data); $i++) {  // add category pages to the array
        $cp = wire('pages')->get("template={$nameOfChildTemplate},name={$data[$i]}");
        $cpa->add($cp);
    }
    return $cpa;
}

EDIT: I added the $ACTION "deleteImportedPages" to the script, and a few more comments. Please use this from above and not this that ships in the ZIP archive. :)

  • Like 8
Link to comment
Share on other sites

Hi Horst, thanks again for the script you gave us.
 
We've made a few subtle tweaks to make it fit the refined data we're working with, but before we run the script I thought I'd check with you that everything's right! :)
 
The site admin area as it stands at the moment is as follows:

  • Templates (without associated files and with just the default Title field) have been created for the parent & child pages; dams, dam, countries, country, owners, owner, types, type, purposes and purpose.
  • In the page tree we've created a page called Dams and allocated it the template dams, Countries with the template countries, Owners with the owners template, Types with types and Purposes with purposes. Can I confirm that this is right, because at the top of the script you have written a comment stating 'The templates and parent pages for the categories need to be created manually before the import action is used. Also don't forget to assign them to the rccdams template [this template is called dams in our site]', which has confused me. If we have to assign the dams template to all of the parent category pages, why do we have to create templates for those parent pages?
  • Fields have been created of type 'page' with the names dams, countries, owners, types and purposes. In the Input tab of each field the appropriate page has been selected for Parent of selectable page(s); Dams for the dams field, Countries for the countries field, etc. Likewise for Template of selectable pages(s), Dams is given dam, Countries given country, etc. Finally for the Input field typeAsmSelect* is chosen.

I've attached a Zip archive containing our refined version of your script, a CSV file with the data for import, a version of that CSV file in Excel format to make it more readable, and a PDF document with a couple of questions/anomalies we've come across in the data.
 
I hate to impose further on your good nature, but you're the data import king, so you're the best person to check with! :)
 
As ever, many thanks for your continuing help with this import Horst, it's much appreciated.
 
Tony.

RCC-Dams.zip

Link to comment
Share on other sites

All you have changed in the script looks good, but ... ;-)

... the only thing what doesn't seem right is your main template dams (what I have named rccdams). This must be used for the combining (normal) dam pages. It cannot be used for a Inputfield of type "Page". You need to have that main template dam (or even dams and dam, if you give the parent page of all your dams an own template too.) The main template you need to refer to in the script is the dam template. This is used for all your dam objects. It will hold all other fields in the edit screen for the maintainers of the data. All other fields mean: all InputfieldPage, all InputFieldText, all inputFieldInteger, all InputFieldDatetime that contains data belonging to a (single) dam.

Templates and fields of type InputFieldPage for (type, owner, country, purpose) needs to be created manually, before running the import part.

(This looks good in your script config)

All fields other than that of type InputfieldPage will be created by the createFields Action of the importer script. You need to run this action first.

(looks also good in your script config)

All other things, e.g. if you want to use more / other fields of type InputfiledPage depends on you. You should look and play a bit with the starter profile I have provided earlier. It shows some purposes as categories and how to use them with a multiple categories InputfieldPage that can be added to single dams. Also there is a single category InputfieldPage to study and play with.

In my example, I have used the FIW exmple you provided in an above post. So, if you now do not have this field, you also can specify some source-fieldnames that you don't want to use as single inputfields but first gather their data, and put it together in a way that it fits to the purpose filed I have build, or you change the purpose field to match other criterias. My purposes categories do not only have a title field but also a subtitle field. You can see all this in the site profile. :) It's easy!

And, if you have ran the import and the resulting data isn't what you expected, you once can run the deleteImportedData action to clear it out, change the config or change fetching and preparing the data of that particular part(s). And finally run the importer again.

Good luck. :)

  • Like 2
Link to comment
Share on other sites

I've been working with the script today and I've got the data importing and populating the dam pages (651 of them), but I'm still encountering a couple of issues.

  • Some of the dam names have non-latin characters. When the data imports, these non-latin characters get clipped off, together with those characters after it. For example, Akçakoca becomes Ak, Belén-Cagüela becomes Bel. I've checked that the import CSV file is utf-8 encoded, which it is. I also added a utf-8 declaration to the script: 
    if(isset ( $_SERVER ['HTTP_HOST'] )){
    	header ( 'Content-Type: text/plain; charset=utf-8' );
    }
    

    but to no avail.

  • The Country, Owner, Type and Purpose pages aren't being populated. I ran the script to create the other fields, then manually added the fields for the pages. I then added the fields country, owner, type and purpose to the dam template, but when I run the script to import the data the pages aren't created in the page tree and the select fields aren't populated in the Dam pages. Obviously there's something I'm failing to do, but I don't know what it is. :)

I've attached the refined script file I've been using. It's not very different, but I did add an ORDER BY to add the dams in alphabetical order by name.

Yet again, a big thanks for your help Horst.

Tony.

horst-script.php

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...