Jump to content

Import static html files into ProcessWire


p_hammond
 Share

Recommended Posts

Hey guys,

I need to import a Jekyll-based website, basically a collection of 3000+ static html pages, into ProcessWire, or in other words, import them into PW's database. Is there an easy, automated way to do this? Could you please point me in the right direction?

Thanks. :)

Link to comment
Share on other sites

I don't think there is an easy way since Jekyll doesn't seem to us a DB.  You could probably write an automated method to do this for you.  Do you have all the files on git or something similar so I could take a look?

  • Like 1
Link to comment
Share on other sites

I think kyle is right - as long as there is some structure to the way Jekyll stores its data (and there must be), it should just be a matter of identifying what is what, and writing a script to read that structure and create PW pages from it. Take a look at http://processwire.com/talk/topic/3987-cmscritic-development-case-study/, where Ryan describes converting a site to PW.

  • Like 1
Link to comment
Share on other sites

@Kyle: You're right, Jekyll doesn't use a database at all, and this is perhaps the main roadblock.

@DaveP: your link is great, but I don't think it applies to my case, as the conversion Ryan lays out is from WordPress (db-driven) to PW (obviously db-driven as well).

So, it seems this task is going to be way more difficult than I anticipated. I think I'm going to hold off importing the site to PW and advise the client to stick with Jekyll for the time being. The main reason for the conversion was to give the client an easier/faster publication workflow, but I suppose I can tell him to use something like Prose to smooth the process.

If you have any more ideas, please let me know.

Thanks.

Link to comment
Share on other sites

I don't think you should immediately discount the example in the link provided by DaveP; because even though it was a WordPress site, that is a very small part of the example. Instead of WP posts, you would just be processing your static files that contain HTML, I think - if I understand how Jekyll works. All you need to work out is how to programmatically parse your existing pages.

Alternatively, there is a module that allows you to import pages via CSV file, which might help "as it is". If not, it might at least give you an idea of how you could re-code it to suit your needs and the way that the current site is structured.

http://modules.processwire.com/modules/import-pages-csv/

  • Like 3
Link to comment
Share on other sites

@p_hammond: can you provide a link to the live-site? I want to have a look to the structure and blocks.

 All you need to work out is how to programmatically parse your existing pages.

^-^

Link to comment
Share on other sites

Here's a link to a conversation about parsing HTML pages - I used the script in question and it is very good, as long as each HTML page follows sensible structures: http://processwire.com/talk/topic/569-php-simple-html-dom

I've used that before. It is good. Basically, you will be scraping your site. Is it possible to download those HTML files locally? You can then scrape the site locally. Scraping can be pretty memory intensive..

Link to comment
Share on other sites

Thanks, folks, some interesting ideas here.

@hosrt: unfortunately, the site is currently offline.

Here's a link to a conversation about parsing HTML pages - I used the script in question and it is very good, as long as each HTML page follows sensible structures: http://processwire.com/talk/topic/569-php-simple-html-dom

That's a great link Pete, thanks for pointing it out to me.

Well, now I know what it takes to import a Jekyll site, or a collection of static HTML files for that matter, into ProcessWire. It seems that, with some work and dedication, I can make it happen. To be fully honest, I was hoping for a more hands-off, plug-n-play kind of approach.

I've done some further research on this matter and found a neat little script to migrate a Jekyll site into WordPress. As far as I understand, it's just a matter of running the script, and voilà. I'm going to inspect this thing further and see if I get some inspiration.

I appreciate your help, folks. If you think of any other ideas, please let me know.

Link to comment
Share on other sites

That script is basically just iterating through the files and folders and parsing, much as you would do with the script I linked.

What we need to see to be able to help further is maybe a screenshot of the folder structure of the site (is it many levels deeo or are all pages in one folder for example) and perhaps a test page or two - without those I'm not sure we can suggest anything else!

Iterating through the folders and using a script like I linked to or even a regular expression is as hands-off as you'll get, but if we can have the information I just mentioned we can probably hook you up with some code :)

Link to comment
Share on other sites

All pages are located in one folder, that is, there's only one level. Sorry, I got confused. Here's the correct directory structure (I don't know what I was thinking :():

Parent Folder
-- Page 1
-- Page 2
-- Page 3
-- Blog
---- Category Name 1
------ Posts belonging to 'Category Name 1'
---- Category Name 2
------ Posts belonging to 'Category Name 2'
...

Most pages have basically the same HTML structure, as most pages are blog posts:

<!DOCTYPE html >
<html>
<head>  
...
</head>
<body >  
	<div class="wrapper">    
		<div class="content">      
			<h2 class="title"></h2>      
			<h3 class="subtitle"></h3>      
			<article class="content">        
			<p></p>        
			...      
			</article>      
			<div class="byline">        
			<p></p>      
			</div>      
			<div class="meta">        
			<div class="author"></div>        
			<div class="date"></div>      
			</div>      
			<div class="social">        
			<span></span>        
			...      
			</div>    
		</div>    
		<aside class="sidebar">      
			<form class="search-form">        
			...      
			</form>      
			<div class="category-list">        
				<ul>          
					<li></li>          
					...        
				</ul>      
			</div>      
			<div class="tag-list">        
				<ul>          
					<li></li>          
					...        
				</ul>              
			</div>      
			<div class="popular-posts">        
				<ul>          
					<li></li>          
					...        
				</ul>              
			</div>    
		</aside>    
		<footer class="footer">      
			...    
		</footer>  
	</div>
</body>
</html>

Only a limited number of pages, I would've thought around 25, don't follow the HTML structure posted above. These pages won't be a problem, as I'm happy to import them manually.

Does this help?

Link to comment
Share on other sites

...

Does this help?

I think it helps a little bit, :)  - but without content one cannot see links for prev / next posts or if there are any sort of ids with the posts.

You 1) want to import content, and 2) you want to keep the given structure. (information about 2) can also be stored into head-tags)

Link to comment
Share on other sites

@hosrt: posts don't have ids, they are identified by their file name, as in 'here-is-a-title.html', 'another-title.html', 'yet-another-title.html', and so on. I'm not sure I understand the prev/next posts problem. Previous/next posts are nothing more than chronological navigation, so the system basically creates a link to the post file (here-is-a-title.html) created before and after any given post.

Thanks for your help.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...