Import static html files into ProcessWire

p_hammond · August 2, 2013

Hey guys,

I need to import a Jekyll-based website, basically a collection of 3000+ static html pages, into ProcessWire, or in other words, import them into PW's database. Is there an easy, automated way to do this? Could you please point me in the right direction?

Thanks.

kyle · August 3, 2013

I don't think there is an easy way since Jekyll doesn't seem to us a DB. You could probably write an automated method to do this for you. Do you have all the files on git or something similar so I could take a look?

DaveP · August 3, 2013

I think kyle is right - as long as there is some structure to the way Jekyll stores its data (and there must be), it should just be a matter of identifying what is what, and writing a script to read that structure and create PW pages from it. Take a look at http://processwire.com/talk/topic/3987-cmscritic-development-case-study/, where Ryan describes converting a site to PW.

p_hammond · August 3, 2013

@Kyle: You're right, Jekyll doesn't use a database at all, and this is perhaps the main roadblock.

@DaveP: your link is great, but I don't think it applies to my case, as the conversion Ryan lays out is from WordPress (db-driven) to PW (obviously db-driven as well).

So, it seems this task is going to be way more difficult than I anticipated. I think I'm going to hold off importing the site to PW and advise the client to stick with Jekyll for the time being. The main reason for the conversion was to give the client an easier/faster publication workflow, but I suppose I can tell him to use something like Prose to smooth the process.

If you have any more ideas, please let me know.

Thanks.

Craig · August 3, 2013

I don't think you should immediately discount the example in the link provided by DaveP; because even though it was a WordPress site, that is a very small part of the example. Instead of WP posts, you would just be processing your static files that contain HTML, I think - if I understand how Jekyll works. All you need to work out is how to programmatically parse your existing pages.

Alternatively, there is a module that allows you to import pages via CSV file, which might help "as it is". If not, it might at least give you an idea of how you could re-code it to suit your needs and the way that the current site is structured.

http://modules.processwire.com/modules/import-pages-csv/

horst · August 3, 2013

@p_hammond: can you provide a link to the live-site? I want to have a look to the structure and blocks.

All you need to work out is how to programmatically parse your existing pages.

diogo · August 3, 2013

You can use an html parser to, both, create the templates in PW and pull the content to the database. It's some work, but perfectly doable.

Pete · August 3, 2013

Here's a link to a conversation about parsing HTML pages - I used the script in question and it is very good, as long as each HTML page follows sensible structures: http://processwire.com/talk/topic/569-php-simple-html-dom

kongondo · August 3, 2013

Here's a link to a conversation about parsing HTML pages - I used the script in question and it is very good, as long as each HTML page follows sensible structures: http://processwire.com/talk/topic/569-php-simple-html-dom

I've used that before. It is good. Basically, you will be scraping your site. Is it possible to download those HTML files locally? You can then scrape the site locally. Scraping can be pretty memory intensive..

p_hammond · August 3, 2013

Thanks, folks, some interesting ideas here.

@hosrt: unfortunately, the site is currently offline.

Here's a link to a conversation about parsing HTML pages - I used the script in question and it is very good, as long as each HTML page follows sensible structures: http://processwire.com/talk/topic/569-php-simple-html-dom

That's a great link Pete, thanks for pointing it out to me.

Well, now I know what it takes to import a Jekyll site, or a collection of static HTML files for that matter, into ProcessWire. It seems that, with some work and dedication, I can make it happen. To be fully honest, I was hoping for a more hands-off, plug-n-play kind of approach.

I've done some further research on this matter and found a neat little script to migrate a Jekyll site into WordPress. As far as I understand, it's just a matter of running the script, and voilà. I'm going to inspect this thing further and see if I get some inspiration.

I appreciate your help, folks. If you think of any other ideas, please let me know.

Pete · August 4, 2013

That script is basically just iterating through the files and folders and parsing, much as you would do with the script I linked.

What we need to see to be able to help further is maybe a screenshot of the folder structure of the site (is it many levels deeo or are all pages in one folder for example) and perhaps a test page or two - without those I'm not sure we can suggest anything else!

Iterating through the folders and using a script like I linked to or even a regular expression is as hands-off as you'll get, but if we can have the information I just mentioned we can probably hook you up with some code

p_hammond · August 5, 2013

~~All pages are located in one folder, that is, there's only one level.~~ Sorry, I got confused. Here's the correct directory structure (I don't know what I was thinking ):

Parent Folder
-- Page 1
-- Page 2
-- Page 3
-- Blog
---- Category Name 1
------ Posts belonging to 'Category Name 1'
---- Category Name 2
------ Posts belonging to 'Category Name 2'
...

Most pages have basically the same HTML structure, as most pages are blog posts:

<!DOCTYPE html >
<html>
<head>  
...
</head>
<body >  
	<div class="wrapper">    
		<div class="content">      
			<h2 class="title"></h2>      
			<h3 class="subtitle"></h3>      
			<article class="content">        
			<p></p>        
			...      
			</article>      
			<div class="byline">        
			<p></p>      
			</div>      
			<div class="meta">        
			<div class="author"></div>        
			<div class="date"></div>      
			</div>      
			<div class="social">        
			<span></span>        
			...      
			</div>    
		</div>    
		<aside class="sidebar">      
			<form class="search-form">        
			...      
			</form>      
			<div class="category-list">        
				<ul>          
					<li></li>          
					...        
				</ul>      
			</div>      
			<div class="tag-list">        
				<ul>          
					<li></li>          
					...        
				</ul>              
			</div>      
			<div class="popular-posts">        
				<ul>          
					<li></li>          
					...        
				</ul>              
			</div>    
		</aside>    
		<footer class="footer">      
			...    
		</footer>  
	</div>
</body>
</html>

Only a limited number of pages, I would've thought around 25, don't follow the HTML structure posted above. These pages won't be a problem, as I'm happy to import them manually.

Does this help?

horst · August 5, 2013

...

Does this help?

I think it helps a little bit, - but without content one cannot see links for prev / next posts or if there are any sort of ids with the posts.

You 1) want to import content, and 2) you want to keep the given structure. (information about 2) can also be stored into head-tags)

p_hammond · August 5, 2013

@hosrt: posts don't have ids, they are identified by their file name, as in 'here-is-a-title.html', 'another-title.html', 'yet-another-title.html', and so on. I'm not sure I understand the prev/next posts problem. Previous/next posts are nothing more than chronological navigation, so the system basically creates a link to the post file (here-is-a-title.html) created before and after any given post.

Thanks for your help.

Sign In

Import static html files into ProcessWire

Recommended Posts

p_hammond

kyle

DaveP

p_hammond

Craig

horst

diogo

Pete

kongondo

p_hammond

Pete

p_hammond

horst

p_hammond

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members

Similar Content

Importer script can't integrate MapMarker field input

Cannot read/open CSV file from field in page under admin

Work in progress: ImportUpdateUltimate

[SOLVED] migrating pages with meta() data

Importing data - skip duplicate records

Browse

Activity

My Activity Streams

Support

Store

My Details