ryan

How do I import lots of data into pages?

Recommended Posts

How did you process all the data you put in the demo ? I mean, you didn't manually create and fill a page for each city/architect/skycraper ?

I exported a CSV file from freebase.com that contained all the skyscraper fields I wanted to use. Then I created a simple shell script to import it from the CSV. Note that I only used a shell script for convenience, you could just as easily do this from a ProcessWire template file if you preferred it or needed to do this from Windows, etc.

Below is a simplified example of how to do this. The example is fictional and doesn't line up with the actual structure of the skyscrapers site, nor does it attempt to create page relations or import images. If you are interested in how to do that, let me know and I'll keep expanding on the example in this thread. But I wanted to keep it fairly simple to start. First, here is the contents of a CSV file with each line having a skyscraper building name, city, and height.

/skyscrapers/skyscrapers.csv (Building, City, Height):

Sears Tower, Chicago, 1400
John Hancock Tower, Chicago, 1210
Empire State Building, New York City, 1100
IBM Building, Atlanta, 860
Westin Peachtree, Atlanta, 790

Next, create a new template in ProcessWire and call it "skyscraper". Create a text field for "city", and an integer field for "height" and add them to the skyscraper template.

Create a page called "/skyscrapers/" in ProcessWire, that will serve as the parent page for the skyscrapers we'll be adding.

Here is the command-line script to load the ProcessWire API, read the CSV data, and create the pages. As I mentioned above, this could just as easily be done from a template, where the only difference would be that you wouldn't need the shebang (#!/usr/local/bin/php -q) at the beginning, nor would you need to include ProcessWire's index.php file.

/skyscrapers/import_skyscrapers.sh:

#!/usr/local/bin/php -q
<?php

// include ProcessWire's index file for API access
// (this isn't necessary if you are doing this from a template file)
include("./index.php"); 

$fp = fopen("./skyscrapers.csv", "r");
$template = wire('templates')->get("skyscraper");
$parent = wire('pages')->get("/skyscrapers/");

while(($data = fgetcsv($fp)) !== FALSE) {

// create the page and set template and parent
$skyscraper = new Page();
$skyscraper->template = $template; 
$skyscraper->parent = $parent; 

// set the skyscraper fields from the CSV	
list($building, $city, $height) = $data; 
$skyscraper->title = $building; 
$skyscraper->city = $city; 
$skyscraper->height = $height; 

// set the URL name, i.e. Sears Tower becomes "sears-tower" automatically
$skyscraper->name = $building; 

// save the skyscraper
$skyscraper->save();

echo "Created skyscraper: {$skyscraper->url}\n";
}

To do the import, make the script executable and then run it from the command line:

chmod +x .import_skyscrapers.sh
./import_skyscrapers.sh

OR, if you are doing this from a template file, then load the page (that is using this template) in your web browser, and that will execute it.

The output should be:

Created skyscraper: /skyscrapers/sears-tower/ 
Created skyscraper: /skyscrapers/john-hancock-tower/
Created skyscraper: /skyscrapers/empire-state-building/
Created skyscraper: /skyscrapers/ibm-building/
Created skyscraper: /skyscrapers/westin-peachtree/

If you go into the ProcessWire admin, you should see your skyscrapers.

I used an example of CSV file for simplicity, but the same method applies regardless of where you are pulling the data from (web service feeds, etc). For the actual skyscrapers demo site, I used Freebase's web services feed to pull the data and images, etc.

  • Like 8

Share this post


Link to post
Share on other sites

Cristal clear, thanks.

How do you manage duplicates (title) from code (in PW administration one receives an error if an existing item already exists under the same parent) ?

Share this post


Link to post
Share on other sites

How do you manage duplicates (title) from code (in PW administration one receives an error if an existing item already exists under the same parent) ?

The URL name ($page->name) has to be unique for all sibling pages (i.e. pages having the same parent). I don't usually bother with checking unless ProcessWire throws an error about it during the import. If it looks like there are going to be duplicates, then here's how you'd ensure uniqueness by adding a number to the end of the URL name and keep incrementing it until it's unique:

Replace this:

$skyscraper->name = $building;
$skyscraper->save(); 

With this (overly verbose for explanation purposes):

<?php // just to turn on the forum syntax highlighting

// Converts "Sears Tower" to "sears-tower" when setting the name
$skyscraper->name = $building; 

// Retrieve the URL name, which should be "sears-tower"
$name = $skyscraper->name; 

// a counter incremented each time a duplicate is found
$n = 0; 

// find the first non-duplicate name
while(count($parent->children("name=$name")) > 0) {
    $n++; 
    $name = $skyscraper->name . $n; // i.e. sears-tower1, sears-tower2, etc.
}

// set the page's name to be one we know is unique, i.e. sears-tower1
$skyscraper->name = $name;

// now when ProcessWire checks uniqueness before saving, it won't throw an error. 
$skyscraper->save(); 
  • Like 1

Share this post


Link to post
Share on other sites

Ah ok, that's about what I would have done manually but I thought there could have been some automatic checking/renaming in the "save" function, as the url is already automatically formated after the name (and I guess there is a process of removing forbidden characters from the url ?). But that's fine to me, just something to be aware of.

Share this post


Link to post
Share on other sites

Right now it doesn't have anything that does this automatically. I left it out because I didn't want to assume what action you would want to take with duplicate page names, or what format you would want to use in generating a unique name. But perhaps I should add it as an optional module.

Share this post


Link to post
Share on other sites

In addition to the code above.. probably will be good if we assign a incremented index to the skyscraper->sort property before save(), otherwise all of the pages will be saved with sort = 0 in the "pages" table and later they are unsorted in the tree list.

$sortIndex = 0;
...

$skyscraper->sort = ++$sortIndex;
$skyscraper->save();

Share this post


Link to post
Share on other sites

That's a good addition. You can also set the sortfield for your parent page, so that you don't have to keep track of manual sorting. In the case of skyscrapers, the pages I import to are always sorted by 'name', so setting the 'sort' isn't necessary, but it may be depending on your need.

You can set the 'sortfield' from the admin on the 'children' tab of the parent page (which is what I did when importing skyscrapers), but you can also set it from the API like this:

<?php
$parent = $pages->get("/skyscrapers/"); 
$parent->sortfield = 'name'; 
$parent->save();

After you do that, all the child pages will come out sorted by name when you retrieve them with a $parent->children() call. They will be sorted that way in the admin too. You can reverse the sort just by prepending a minus sign,"-" to the beginning of the field name, i.e.

<?php
$parent->sortfield = "-created";

The child pages would then come out sorted by date created in reverse-chronological order (i.e. newest to oldest).

You can specify any field for 'sortfield', as long as it's a native-to-PW field, or it's one of your own fields with the 'autojoin' option set in the field settings. For example, the 'title', date fields, and other text fields are common to sort by. You can turn on 'autojoin' just by editing the field in Setup > Fields, and checking it's 'autojoin' box.

Share this post


Link to post
Share on other sites

First... thank you, it was useful advice, as all your. I am still impressed how flexible is PW and wandering how may hidden features i am not using.

And forgive me.. i should include in my previous post that my case was to initially sort the imported child pages by name and later the clients to use the drag-&-drop for sorting them. Less common scenario.

Share this post


Link to post
Share on other sites

Hi guys!

How can I rewrite the data in the field in this way?

My site has a field "body" with data. I want to specify in file(csv) page name and new data of the field "body".

How do I delete the old value of the field "body" and write a new data?

Share this post


Link to post
Share on other sites

Hi guys!

How can I rewrite the data in the field in this way?

My site has a field "body" with data. I want to specify in file(csv) page name and new data of the field "body".

How do I delete the old value of the field "body" and write a new data?

You are referring to the Import Pages From CSV module? Check out this screenshot , "what to do with duplicate page names"

I haven't used this option myself but i guess if you check the 'modify existing page' it will update the field body.

  • Like 2

Share this post


Link to post
Share on other sites

Thanks for the answer.

I forgot that Processwire means — easy! :)

Fields are really rewritten at an import!

And this is great news for me. Thank you Ryan again!

  • Like 1

Share this post


Link to post
Share on other sites

Hi,  

I have a template with three fields: 

- title

- category (page FieldType)

- items (repeater FieldType)

The items repeater has two fields:

- brand (page FieldType)

- description (text field)

Then i have a csv file (Product Name, Category, Brand,  Description):

- - - - - - - - - 
Product one, /category/xxx/, /brands/zz1/, "This is the brand one"
Product two, /category/yyy/, /brands/zz2/, "This is the brand two"
- - - - - - - - - 
 
And finally my import script goes like:
list($title, $categoryName, $brandName, $description) = $data; 
$productPage->title = $title;
$productPage->category = $categoryName;
$repeater = $productPage->items->getNew();
$repeater->brand = $brandName; 
$repeater->description = $description;

This works like a charm. The problem is that only allow me to import one item on the repeater field.

How to import multiple items to the repeater? I want my final imported pages to have multiple brands with customs descriptions.

Product one

Category: xxx

- Brands

-- zz1 | This is the brand one

-- zz2 | This is another cool brand

-- zz3 | This is the best brand

Product two

Category: yyy

- Brands

-- zz1 | This brand is cheap

-- zz2 | This one is expensive

-- zz3 | Another brand

Hope somebody can help me. Thanks.

Share this post


Link to post
Share on other sites

@Sanyaissues  You have to loop through the lines of your csv and for each line check whether the page for your current product has already been created. If so, just add a new brand item to its repeater and save it. If the page hasn't been created yet, add the page, populate its fields and first brand item and save it. That's it.

  • Like 4

Share this post


Link to post
Share on other sites

Thanks @slkwrm . I really have a script who loops through the lines of the csv and creates and save new pages. The problem is that i'm only able to add one brand per product (every product is a page with a repeater which contains a brand and a custom brand description).

The question is: how to import multiple brands for every single product (the brand field is a repeater). I want to import something like this:

Product Name, Category, Brand1, Brand1Description, Brand2, Brand2Description, etc...

- - - - - - - - - 
Product one, /category/xxx/, /brands/zz1/, "This is the brand one",  /brands/zz3/, "Description 3"
Product two, /category/yyy/, /brands/zz2/, "This is the brand two",   /brands/zz4/, "Description 4",  /brands/zz5/, "Description 5"
Product three, /category/nnn/, /brands/zz7/, "Custom description 7"
- - - - - - - - - 

Share this post


Link to post
Share on other sites

Then i have a csv file (Product Name, Category, Brand,  Description):

- - - - - - - - - 
Product one, /category/xxx/, /brands/zz1/, "This is the brand one"

Product two, /category/yyy/, /brands/zz2/, "This is the brand two"

.....

list($title, $categoryName, $brandName, $description) = $data; 
$productPage->title = $title;
$productPage->category = $categoryName;
$repeater = $productPage->items->getNew();
$repeater->brand = $brandName; 
$repeater->description = $description;

Hi Sanyaissues,

slkwrm already gave you correct answer, but it looks to me that there is a little misunderstandig (not sure, but could be).

I think slkwrm suggested that you have organized your csv-file something like that:

Product one, /category/xxx/, /brands/zz1/, "This is the brand one"
Product one, /category/xxx/, /brands/zz2/, "This is the brand two"
Product two, /category/yyy/, /brands/zz3/, "This is the brand three"
Product two, /category/yyy/, /brands/zz4/, "This is the brand four"
Product three, /category/zzz/, /brands/zz5/, "This is the brand five"

But seems it is bit different:

Product one, /category/xxx/, /brands/zz1/, "This is the brand one",  /brands/zz3/, "Description 3"
Product two, /category/yyy/, /brands/zz2/, "This is the brand two",   /brands/zz4/, "Description 4",  /brands/zz5/, "Description 5"
Product three, /category/nnn/, /brands/zz7/, "Custom description 7"

You have to iterate over the lines of your csv file add a new page, add title and category and when coming to the brands you can add the first repeater-item as you showed above. If you have more than one (looks like it can differ), you should save the page! and after that iterate over the brands and foreach you have to add it and to save the page, (I think)

$productPage->title = $title;
$productPage->category = $categoryName;
$repeater = $productPage->items->getNew();
$repeater->brand = $brandName; 
$repeater->description = $description;
$productPage->save();

// check / get your brands
foreach($myBrands as $brand) {
    $repeater = $productPage->items->getNew();
    $repeater->brand = $brand['name']; 
    $repeater->description = $brand['description'];
    $productPage->save();
}

Don't know if this is an elegant way, but should work (haven't tested) .

>> Please refer to the api: http://processwire.com/api/fieldtypes/repeaters/ look after "Using the API to add or remove repeater items"

  • Like 4

Share this post


Link to post
Share on other sites

More links for you...

http://processwire.com/talk/topic/3619-updating-repeaters-via-api-without-removeall/

http://processwire.com/talk/topic/3969-problem-repeaters-why-u-no-save/?p=38938

Seems you may have to save repeaters individually as well...and not just the page?  Maybe that's just for editing? But if adding a second item to a repeater, maybe that is editing? Try and see :-)

Edit:

Just to reiterate what's on the other thread and in above posts. Repeaters are themselves "hidden" pages. So, You have to iterate through the repeaters and treat them one by one as if they were normal pages.

  • Like 2

Share this post


Link to post
Share on other sites
Hi @horst, thanks for your detailed answer. You were right, @slkwrm really gave me the answer but i was too blind to understand. 
@kongondo thanks for the "hidden pages" tip and the links.
 
I made a simple loop and everything works. My first approach seems a little dirty but at least its working:
 
CVS file:
Title, Category, Brand page, Brand description, (...) Brand page, Brand description
--------
product1,/cat/one/,/brand/01/,"foo",/brand/02/,"bar"
product2,/cat/one/,/brand/01/,"foo",/brand/02/,"bar",/brand/03/,"foo",/brand/04/,"bar"
product3,/cat/two/,/brand/04/,"foo",/brand/05/,"bar",/brand/06/,"foo",
 

Script:

$c = count($data);                           //Counts the parameters per line
$productPage->title = $data[0];              //Title
$productPage->category = $data[1];           //Category

for ($i=2; $i < $c; $i++) {                  // Stars on 2 because the first parameters are Title and Category
  $repeater = $productPage->items->getNew(); // Creates a new item on the repeater
  $repeater->brand = $data[$i];              // Assigns the Brand page
  $i++;                                      // Increase i to jump to the next parameter
  $repeater->description = $data[$i];        // Assigns the Brand description
  $productPage->save();                      // Saves the page
}
  • Like 1

Share this post


Link to post
Share on other sites

I did not see this mentioned, but If you want to skip the first row of the csv which would be your headers you can simply add two simple lines.

$row = 1;
$fp = fopen("./skyscrapers.csv", "r");
$template = wire('templates')->get("skyscraper");
$parent = wire('pages')->get("/skyscrapers/");

while(($data = fgetcsv($fp)) !== FALSE) {
      if($row == 1){ $row++; continue}
  • Like 1

Share this post


Link to post
Share on other sites
On 12/30/2010 at 9:26 AM, ryan said:

 Meanwhile, decades later... I have a big database import/export of around 50 fields.  My batch child editor keeps stalling on import or export of the 350ish records, so I tried to write a simple csv export script and was successful. Now, overly confident I tried writing an import script with all my fields minus the top line and index.php reference since it's in a template.

Getting "Parse Error: syntax error, unexpected 'while' (T_WHILE) "

But I can't see why. the brackets {} are matched... data is defined... don't see any obvious missing ; ...


 

<?php

$fp = fopen("./assessmentsimport.csv", "r");
$template = wire('templates')->get("upa");
$parent = wire('pages')->get("/assessments/");

while(($data = fgetcsv($fp)) !== FALSE) {
// create the page and set template and parent
$assessment = new Page();
$assessment->template = $template;
$assessment->parent = $parent;

// set the assessment fields from the CSV
list($title, $upa_startdate, $upa_homevisitorid, $upa_pyramid, $upa_evidence, $upa_evidence_ot, $upa_organization, $upa_organization_ot, $upa_region, $upa_age, $upa_caregiverd, $upa_familycomp, $upa_substanceabuse, $upa_numchildren, $upa_primarychildage, $upa_referral, $upa_casestatus, $upa_enrollment, $upa_leaving, $upa_1pre, $upa_1post, $upa_2pre, $upa_2post, $upa_3pre, $upa_3post, $upa_4pre, $upa_4post, $upa_5pre, $upa_5post, $upa_6pre, $upa_6post, $upa_7pre, $upa_7post, $upa_8pre, $upa_8post, $upa_9pre, $upa_9post, $upa_10pre, $upa_10post, $upa_phq2, $upa_phqtypeofscreen, $upa_phqdatecompleted, $upa_phq2score, $upa_epds, $upa_epdstypeofscreen, $upa_epdsdatecompleted, $upa_epdsscore, $upa_pcd_receivingservices, $upa_pcdreferred, $upa_pcdwherereferred, $upa_updated) = $data;

$assessment->title = $title;
(and all the other fields) 


// set the URL name using the assessment title
$assessment->name = $title;

// save the UPA
$assessment->save();

echo "Created assessment: {$assessment->url}\n";
}




 

 

On 12/30/2010 at 9:26 AM, ryan said:

 


#!/usr/local/bin/php -q
<?php

// include ProcessWire's index file for API access
// (this isn't necessary if you are doing this from a template file)
include("./index.php"); 

$fp = fopen("./skyscrapers.csv", "r");
$template = wire('templates')->get("skyscraper");
$parent = wire('pages')->get("/skyscrapers/");

while(($data = fgetcsv($fp)) !== FALSE) {

// create the page and set template and parent
$skyscraper = new Page();
$skyscraper->template = $template; 
$skyscraper->parent = $parent; 

// set the skyscraper fields from the CSV	
list($building, $city, $height) = $data; 
$skyscraper->title = $building; 
$skyscraper->city = $city; 
$skyscraper->height = $height; 

// set the URL name, i.e. Sears Tower becomes "sears-tower" automatically
$skyscraper->name = $building; 

// save the skyscraper
$skyscraper->save();

echo "Created skyscraper: {$skyscraper->url}\n";
}
 

 

Share this post


Link to post
Share on other sites

@hollyvalero - you have a strange hidden character at the end of this line: $parent = wire('pages')->get("/assessments/"); that is causing the problem - just remove that and it will be fine.

I would like to know what BCE was timing out for you and your own script isn't. Usually the easiest option is to up execution and php memory and these scripts will run until completed without any problems.*

* yes I should probably build in some chunking to prevent the need to do this for large imports/exports, but I haven't gotten around to it yet 🙂

Share this post


Link to post
Share on other sites
11 minutes ago, adrian said:

Damned if there wasn't ... I removed it and now I'm getting 

Parse Error: syntax error, unexpected '$assessment' (T_VARIABLE)

2 lines down, so maybe there are some other invisibles.  I'll try scanning it with fresh eyes tomorrow...  Batch child editor... the data is fairly complex with dropdown menus, etc. and there are about 50 fields in it...  I think it is overall server lag and I'm going to ping the host about that since even the export script has about a 13 second lag before it downloads.  It's on a shared server so I can't really up the limits.

 

 

11 minutes ago, adrian said:

 

 

@hollyvalero - you have a strange hidden character at the end of this line: $parent = wire('pages')->get("/assessments/"); that is causing the problem - just remove that and it will be fine.

I would like to know what BCE was timing out for you and your own script isn't. Usually the easiest option is to up execution and php memory and these scripts will run until completed without any problems.*

* yes I should probably build in some chunking to prevent the need to do this for large imports/exports, but I haven't gotten around to it yet 🙂

 

 

Share this post


Link to post
Share on other sites
Just now, hollyvalero said:

2 lines down, so maybe there are some other invisibles.  I'll try scanning it with fresh eyes tomorrow

Yep, there must be more - just backspace until the ; is removed on each line and type it again - that will clear everything up.

Share this post


Link to post
Share on other sites
1 hour ago, adrian said:

Yep, there must be more - just backspace until the ; is removed on each line and type it again - that will clear everything up.

Thanks for your good eyes. I'll go line by line first thing tomorrow...

Share this post


Link to post
Share on other sites
17 hours ago, adrian said:

Yep, there must be more - just backspace until the ; is removed on each line and type it again - that will clear everything up.

-- Thanks again for your help!  I was able to get it working and get all the data imported after a little tinkering. Also, that server lag? It's a perpetual slow spot outside of chicago that I seem to run into using traceroute, so I don't think it's Batch Child Editor...

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Recently Browsing   0 members

    No registered users viewing this page.