Jump to content
p_hammond

Import static html files into ProcessWire

Recommended Posts

Hey guys,

I need to import a Jekyll-based website, basically a collection of 3000+ static html pages, into ProcessWire, or in other words, import them into PW's database. Is there an easy, automated way to do this? Could you please point me in the right direction?

Thanks. :)

Share this post


Link to post
Share on other sites

I don't think there is an easy way since Jekyll doesn't seem to us a DB.  You could probably write an automated method to do this for you.  Do you have all the files on git or something similar so I could take a look?

  • Like 1

Share this post


Link to post
Share on other sites

I think kyle is right - as long as there is some structure to the way Jekyll stores its data (and there must be), it should just be a matter of identifying what is what, and writing a script to read that structure and create PW pages from it. Take a look at http://processwire.com/talk/topic/3987-cmscritic-development-case-study/, where Ryan describes converting a site to PW.

  • Like 1

Share this post


Link to post
Share on other sites

@Kyle: You're right, Jekyll doesn't use a database at all, and this is perhaps the main roadblock.

@DaveP: your link is great, but I don't think it applies to my case, as the conversion Ryan lays out is from WordPress (db-driven) to PW (obviously db-driven as well).

So, it seems this task is going to be way more difficult than I anticipated. I think I'm going to hold off importing the site to PW and advise the client to stick with Jekyll for the time being. The main reason for the conversion was to give the client an easier/faster publication workflow, but I suppose I can tell him to use something like Prose to smooth the process.

If you have any more ideas, please let me know.

Thanks.

Share this post


Link to post
Share on other sites

I don't think you should immediately discount the example in the link provided by DaveP; because even though it was a WordPress site, that is a very small part of the example. Instead of WP posts, you would just be processing your static files that contain HTML, I think - if I understand how Jekyll works. All you need to work out is how to programmatically parse your existing pages.

Alternatively, there is a module that allows you to import pages via CSV file, which might help "as it is". If not, it might at least give you an idea of how you could re-code it to suit your needs and the way that the current site is structured.

http://modules.processwire.com/modules/import-pages-csv/

  • Like 3

Share this post


Link to post
Share on other sites

@p_hammond: can you provide a link to the live-site? I want to have a look to the structure and blocks.

 All you need to work out is how to programmatically parse your existing pages.

^-^

Share this post


Link to post
Share on other sites

You can use an html parser to, both, create the templates in PW and pull the content to the database. It's some work, but perfectly doable.

  • Like 1

Share this post


Link to post
Share on other sites

Here's a link to a conversation about parsing HTML pages - I used the script in question and it is very good, as long as each HTML page follows sensible structures: http://processwire.com/talk/topic/569-php-simple-html-dom

I've used that before. It is good. Basically, you will be scraping your site. Is it possible to download those HTML files locally? You can then scrape the site locally. Scraping can be pretty memory intensive..

Share this post


Link to post
Share on other sites

Thanks, folks, some interesting ideas here.

@hosrt: unfortunately, the site is currently offline.

Here's a link to a conversation about parsing HTML pages - I used the script in question and it is very good, as long as each HTML page follows sensible structures: http://processwire.com/talk/topic/569-php-simple-html-dom

That's a great link Pete, thanks for pointing it out to me.

Well, now I know what it takes to import a Jekyll site, or a collection of static HTML files for that matter, into ProcessWire. It seems that, with some work and dedication, I can make it happen. To be fully honest, I was hoping for a more hands-off, plug-n-play kind of approach.

I've done some further research on this matter and found a neat little script to migrate a Jekyll site into WordPress. As far as I understand, it's just a matter of running the script, and voilà. I'm going to inspect this thing further and see if I get some inspiration.

I appreciate your help, folks. If you think of any other ideas, please let me know.

Share this post


Link to post
Share on other sites

That script is basically just iterating through the files and folders and parsing, much as you would do with the script I linked.

What we need to see to be able to help further is maybe a screenshot of the folder structure of the site (is it many levels deeo or are all pages in one folder for example) and perhaps a test page or two - without those I'm not sure we can suggest anything else!

Iterating through the folders and using a script like I linked to or even a regular expression is as hands-off as you'll get, but if we can have the information I just mentioned we can probably hook you up with some code :)

Share this post


Link to post
Share on other sites

All pages are located in one folder, that is, there's only one level. Sorry, I got confused. Here's the correct directory structure (I don't know what I was thinking :():

Parent Folder
-- Page 1
-- Page 2
-- Page 3
-- Blog
---- Category Name 1
------ Posts belonging to 'Category Name 1'
---- Category Name 2
------ Posts belonging to 'Category Name 2'
...

Most pages have basically the same HTML structure, as most pages are blog posts:

<!DOCTYPE html >
<html>
<head>  
...
</head>
<body >  
	<div class="wrapper">    
		<div class="content">      
			<h2 class="title"></h2>      
			<h3 class="subtitle"></h3>      
			<article class="content">        
			<p></p>        
			...      
			</article>      
			<div class="byline">        
			<p></p>      
			</div>      
			<div class="meta">        
			<div class="author"></div>        
			<div class="date"></div>      
			</div>      
			<div class="social">        
			<span></span>        
			...      
			</div>    
		</div>    
		<aside class="sidebar">      
			<form class="search-form">        
			...      
			</form>      
			<div class="category-list">        
				<ul>          
					<li></li>          
					...        
				</ul>      
			</div>      
			<div class="tag-list">        
				<ul>          
					<li></li>          
					...        
				</ul>              
			</div>      
			<div class="popular-posts">        
				<ul>          
					<li></li>          
					...        
				</ul>              
			</div>    
		</aside>    
		<footer class="footer">      
			...    
		</footer>  
	</div>
</body>
</html>

Only a limited number of pages, I would've thought around 25, don't follow the HTML structure posted above. These pages won't be a problem, as I'm happy to import them manually.

Does this help?

Share this post


Link to post
Share on other sites

...

Does this help?

I think it helps a little bit, :)  - but without content one cannot see links for prev / next posts or if there are any sort of ids with the posts.

You 1) want to import content, and 2) you want to keep the given structure. (information about 2) can also be stored into head-tags)

Share this post


Link to post
Share on other sites

@hosrt: posts don't have ids, they are identified by their file name, as in 'here-is-a-title.html', 'another-title.html', 'yet-another-title.html', and so on. I'm not sure I understand the prev/next posts problem. Previous/next posts are nothing more than chronological navigation, so the system basically creates a link to the post file (here-is-a-title.html) created before and after any given post.

Thanks for your help.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Recently Browsing   0 members

    No registered users viewing this page.

  • Similar Content

    • By maba
      Hello,
      I need to import regularly - every 15 or 30 days - a big .xslx file into my PW installation.
      This file now has 14 columns, 5.000 rows and grows every month.
      I'll need to group, order and work with these data to:
      analyse User monthly costs analyse User costs per Asset ... User (real AD account) has to match with a PW user - I can't join to the domain - but as you can see I have some services users (start with sca_*) or no user at all. Those rows have to be assigned to a specific user, e.g. account100.
      And:
      I would like to be able to have a kind of diff function to compare User assets between this and last month (and so on) other request is to have a notification when something change for a User between actual and latest import First request: which is the best solution to store those data in your opinion? Page, Table, Repeater Matrix, ...?
      Those are very repetitive data and I think a page reference is better than to import all the data every time but I have to understand how to manage those "dynamic" groups of software (AccType Det), hardware (Asset), ... For example Price will be imported and not stored with the description because it could be change in the future and I'll not have any control on it.
      Thanks!
      User,OE,productNmr,AccType1,AccType Det,Count,Price (€),Sum,ASNA,CC,AccType Info,Asset,AccGroup,,,,,,,,,,,,,
    • By dragan
      Is it by design that a site/ready.php is not included when creating a new site profile? Is it possible to include it with a hook? Or are there any security thoughts? (I don't want to redistribute it in public, it's just so I have my own boilerplate)
    • By karian
      I don't know why multiple instances (repeater_repeat_columns1, repeater_repeat_columns2, ...) of my repeater field are displayed inside Template field (see image).
      Is there a way to clean/reset it ?
       

    • By psy
      I'm combining two PW sites into one, Site A into Site B.
      At each step, I did it bit by bit as the 'all at once' approach failed.
       
      First, I exported all the fields from Site A and imported into Site B. Any field types not supported by import/export, eg FieldtypeOptions I manually recreated. All good.
      Next I exported all the templates from Site A and imported them into Site B and copied across their associated template files. All good.
      Finally I exported the pages I needed from Site A into Site B - again, bit by bit to ensure it all went smoothly.
      From the admin side, it all looked and worked perfectly.
      Front end was a totally different story. All existing pages in Site B worked as expected. NONE of the pages imported from Site A displayed. They all ended in a redirect loop with no errors in the PW logs or Tracy Debugger.
      After some trial-and-error, I finally got it working with:
      - create a new template in Site B admin with no associated template file and just a title field
      - import the fields from the imported Site A template into the newly created template (both on Site B)
      - copy the Site A php template file into a new file that matched the new PW Site B template name and save in Site B site/templates
      I can deal with the above workaround. Just curious to know if I did something wrong or if the template import/export feature is problematic?
       
      ### Solution:
      While the export/import was a slow process, turned out the front end redirecting issue was unrelated. For reasons unknown, all templates marked as HTTPS only were the ones redirecting, ie all templates from Site A. Finally solved it by changing the $config->https to true in site/config.php
      Now the pages display correctly as https whether the template forces the issue or not.
       
    • By rareyush
      i am receiving and error whenever I try to run my processwire on localhost,

       
       
      sql code
       
      -- -- Table structure for table `field_fieldset_meta_end` -- CREATE TABLE `field_fieldset_meta_end` ( `pages_id` int(10) UNSIGNED NOT NULL, `data` int(11) NOT NULL ) ENGINE=MyISAM DEFAULT CHARSET=utf8; -- -------------------------------------------------------- -- -- Table structure for table `field_fieldset_meta_END` -- CREATE TABLE `field_fieldset_meta_END` ( `pages_id` int(10) UNSIGNED NOT NULL, `data` int(11) NOT NULL ) ENGINE=MyISAM DEFAULT CHARSET=utf8; anyone ?
      whenever I make a new database and upload it there, database get imported without errors.
×
×
  • Create New...