Crssp Posted January 11, 2013 Author Share Posted January 11, 2013 Hey guys sorry for my confusion. I have the raw text of a sample article. The story gets a little longer here. News2 sample is raw text that will gets written to the www. root of the site, but when it does a bunch of templated .asp junk gets added to it, which brings us to story4 sample. This sample is a templatized version, and what is possibly how all the data exists, in these .asp extension story pages (very messy, but might be the only data source for past archives). So News2 is the daily output of an article, as it comes from the newsroom... And the 2nd file Sports4-sample.txt in the directories has an .asp extension, and is the templatized page. The 2-files should be there now, looks like the attachment uploader was not working for me in Chrome. news2-sample.txt sports4-sample.txt Link to comment Share on other sites More sharing options...
Joss Posted January 11, 2013 Share Posted January 11, 2013 Hi So are all files available in both formats? Obviously News2 is a lot easier to deal with, but I notice that sports4 has a call for a comments application at the bottom - are there also comments therefore that need to be imported, or are those going to be ignored? With news2 you could not only import the text and the headline, but you could import and store all the referencing as well, which may well be very useful. Joss Link to comment Share on other sites More sharing options...
diogo Posted January 11, 2013 Share Posted January 11, 2013 It seems perfectly possible, with the bonus that you get to choose the format If the second is most likely to be the format for all the info, I would go with that one. You should have a look at more news to see if there is a pattern that you can rely on. After that pattern is identified, you can decide how to divide the data on each file in a way that is more useful for you to manipulate it. Then you can start working with regular expressions or even with a DOM or XML parser to actually divide all the data in pieces. Then comes the fun part , structuring PW to receive all the news (pages, fields, categories) and import everything to PW pages. Link to comment Share on other sites More sharing options...
Joss Posted January 11, 2013 Share Posted January 11, 2013 Ah yes, categories - good point Diogo Is the package reference a news style category. We used to do that sort of referencing in radio - wonderfully oblique and completely unhelpful! You also mentioned that images are rare, but do come up. How are they referenced in the news2 type of file rather than the sport4 type of file? Joss Link to comment Share on other sites More sharing options...
Pete Posted January 11, 2013 Share Posted January 11, 2013 It's possible to parse either of those easily with PHP, but I can't look at it tonight so I suspect someone else may beat me to a code sample that might do it Link to comment Share on other sites More sharing options...
Soma Posted January 11, 2013 Share Posted January 11, 2013 Hey, I already provided the code previously http://processwire.com/talk/topic/2506-running-a-daily-newspaper-website-with-process-wire/#entry23926, with a hint to http://simplehtmldom.sourceforge.net/ which makes parsing the files and the html tags a breeze. 1 Link to comment Share on other sites More sharing options...
Pete Posted January 11, 2013 Share Posted January 11, 2013 Yeah, but I meant more like this because I'd forgotten about that second link : <?php // Some code to iterate through the folder structure needs to happen, but when we have a file we do something like this: $filecontents = file_get_contents('news2-sample.txt'); $filecontents = explode('<!--Head-->', $filecontents); // This splits your contents into an array containing stuff before and after the <!--Head--> delimiter $details = explode("\n", trim($contents[0])); // Now Package, Rank etc are in an array - $details[0] contains Package, Name would be $details[2] and so on // I'll ignore <!--FM--> as I have no idea what it is, so let's just imagine that wasn't in the text file $maincontent = explode('<!--Text-->', $filecontents[1]); // We're now dealing with things after the <!--Head--> tag. $maincontent[0] now stores the article title and $maincontent[1] is the body. ?> There's more you can do from there to turn the newlines in the article body into paragraphs but this gets you started with some ideas of how to parse it all. Someone will probably come up with a better way fo parsing it all using regexp though but I went with what worked inside my brain (regexp hurts my head too much). 1 Link to comment Share on other sites More sharing options...
Crssp Posted January 11, 2013 Author Share Posted January 11, 2013 You guys are blowing me away with the awesome suggestions. I think all I've got to work with as far as trying to pull the archive stories, is the more complex .asp templated versions or sports4 in other words. Those type stories are the ones referenced, in my first post of the thread by year, then month, then day for the issues directory paths to those files. I'm going to attach a screenshot of todays directory date the 11th. There are zero images to worry about in any stories, except future stories with a new system. The old site had homepage images daily but those were not archived sadly. Unless the IT guys can surprise me and come up with the more simpler text version, I don't have copies of those for import purposes. Attached is a screenshot of today stories then. There is no set number of stories per News, sports or other story types, they rarely go over 10 stories, per type. Link to comment Share on other sites More sharing options...
Crssp Posted January 11, 2013 Author Share Posted January 11, 2013 Would this tutorial be beneficial for me, references @Soma suggestion: http://net.tutsplus.com/tutorials/php/html-parsing-and-screen-scraping-with-the-simple-html-dom-library/ Manual for the resource: http://simplehtmldom.sourceforge.net/manual.htm Worth a read for me I'm sure. The screenshot in the previous post of the folder structure, is accurate then for pretty much the only place for me to grab the data from. Again no images for any of these stories, whatsoever. At least it zeros me in on a plan and path forward. Thanks again everybody for all of the direction/recommendations. Link to comment Share on other sites More sharing options...
Crssp Posted January 12, 2013 Author Share Posted January 12, 2013 I could do a great deal to batch edit the news articles with say Notepad++ and regular expressions, using search and replace, couldn't I? I have direct access to these files and could do that locally. Strip out everything not needed but the article, and anything else useful. Link to comment Share on other sites More sharing options...
diogo Posted January 12, 2013 Share Posted January 12, 2013 I don't think you need to do it, since you will be doing exactly the same work with php on your importing script. Link to comment Share on other sites More sharing options...
Crssp Posted January 12, 2013 Author Share Posted January 12, 2013 The thing that is escaping me with say the Simple HtmlDom Parser, is how to output the results in anything more than a page? What my desired output type or collection would be CSV?? Newbie showing here. I don't see how it goes from scraping a page to being batch data? Not that many docs on that in the wild. Thanks for the pointers. Link to comment Share on other sites More sharing options...
Joss Posted January 12, 2013 Share Posted January 12, 2013 HI Crssp Basically, a script would be a one-stop-shop It would open the file It would read the contents of the file and strip out unwanted tags it would store the elements (title, file number, body - whatever) in an array It would make a connection to the database It would create a page based on a template and populate the fields with the contents of the array. That is a bit rough, but the sort of thing it would do. So, as a single script you could just tell it the name of the file and hit GO and it would create the page. However, it could also be told to open a directory first and then do the same procedure with every file in the directory. You would just leave it to do it. It would create a page replacement for every file in the database using the template that you have created. The main issue is to work out how you want everything categorised, to be honest, but I assume that might be based on the directory structure? Or something else? Either way, for one of the serious coders here, this would not be a huge task. My suggestion for how to proceed would be: Work out how you want the articles stored/categorised in ProcessWire Start to put together your PW website, getting your templates and fields sorted out plus who is going to have access to what. Sort out the look and feel of the site, the template files, whatever framework you want to use and so on. Run the script and import the articles. Do a rather painful quality check to make sure you dont have any problems! Joss 2 Link to comment Share on other sites More sharing options...
Pete Posted January 12, 2013 Share Posted January 12, 2013 Exactly what Joss said, and here's some pointers as to how easy it is to hook into ProcessWire to create a page (the last step of your conversion script) programatically as opposed to doing it by hand: http://processwire.com/talk/topic/352-creating-pages-via-api/ Link to comment Share on other sites More sharing options...
Crssp Posted January 12, 2013 Author Share Posted January 12, 2013 Which site profile would seem to be a better fit for a news site? Blogging or skyscraper, it isn't possible to have more than one profile per install is it? This is going to be proof of concept more than a production site to start, early beta... Thanks Link to comment Share on other sites More sharing options...
Pete Posted January 12, 2013 Share Posted January 12, 2013 You don't need either to be honest - you can just use the default profile. The idea with ProcessWire is that you then build your templates in teh admin with whatever fields you need and go from there. Everything is customisable - for an example of this, try this: http://wiki.processwire.com/index.php/Small_Project_Walkthrough or watch ryan's intro video on the Videos page: http://processwire.com/videos/ Link to comment Share on other sites More sharing options...
Joss Posted January 13, 2013 Share Posted January 13, 2013 Pete is right - you should stick to something really simple to start with. Do the walk through and you will see the separation between creating a "template" (That is the object in the administration that you use to group your fields together for a page), creating a page (That is entering data that is stored in the database) Displaying the "page" using a template file (that is the actual layout done using a file in the /site/templates/ directory) The point is, that you can do come very, very basic template files just to display some sort of output, and then replace them with something more complicated later - you dont actually need to use any particular profile at all. In fact, for your purposes, probably better just sticking with the basic one that comes with the install. Take it from me, it is surprisingly easy to learn. Link to comment Share on other sites More sharing options...
Crssp Posted January 13, 2013 Author Share Posted January 13, 2013 Dually noted, thanks for sticking with me guys. It's been awhile since I viewed the videocasts also. Sounds like a plan. Link to comment Share on other sites More sharing options...
diogo Posted January 13, 2013 Share Posted January 13, 2013 best profile for your site is the crssp profile install the default profile and study it well. then you can make your site grow from there adding structure and functionality 2 Link to comment Share on other sites More sharing options...
Crssp Posted January 14, 2013 Author Share Posted January 14, 2013 More facts... The archive of stories is massive on disc it is: 1 gb approx. of code (no images) 173,000 files 7,720 folders A SimpleHtmlDom script that processes by year, would maybe be the best strategy? There will be approx. 11,000 files to process per year. Wowiee... Last year there were 24,000 files... 173 meg on disc. Link to comment Share on other sites More sharing options...
Joss Posted January 14, 2013 Share Posted January 14, 2013 The numbers are not hugely important, really, as long as you have enough server space and power! What is the folder structure/names - I mean, how is this all organised? Link to comment Share on other sites More sharing options...
Pete Posted January 14, 2013 Share Posted January 14, 2013 +1 for Joss' note about server power. Any system where you are getting into hundreds of thousands of entries that will be searchable will put a bit of strain on the server, but as long as the server is powerful enough it's not a problem. Assuming all this is text and it eventually gets imported into ProcessWire, it'll be in the database so won't be taking up physical disk space any more. The tricky bit is an import script to handle 24,000 files per year. To be honest I would suggest you might want to consider working with someone here on this to help set that up for you as it's not a trivial task. That does bring you into the realm of paid work though but thought it was worth mentioning as you're going ton need a converter that is clever enough not to time out, that won't break due to anything unforeseen (random characters in the title of one article could trip it up) and so on. Basically someone who could do the conversion work and be on hand if anything goes wrong. Just a suggestion, but it does sound a bit less straightforward Link to comment Share on other sites More sharing options...
Crssp Posted January 14, 2013 Author Share Posted January 14, 2013 Thanks for all the help, the current server wouldn't even be the one used to run a Processwire beta setup. It will get setup on bluehost for the dev/beta site to get going. I thought there would be a chance to pull our issues folder down, run a script locally, and then push it up to Processwire later on. Is there a chance of gettting a year or so of articles formatted to just use the CSV importer or something already out there in plugin form as far as importers? Actually in the current version, the entire archive is only available via a third party archive, so we don't currently host that even. All you can get to via the site is the past 7-days rolling. this makes things much easier (maybe) to think about getting a beta up and running. With a Processwire site though it seems a shame not to host the entirety of the archives. Looking at possibly using a paywall type setup something like Tinypass is one I noticed this morning. Link to comment Share on other sites More sharing options...
Joss Posted January 14, 2013 Share Posted January 14, 2013 I have never set up a paywall, so cannot help you there, but at the end of the day, the theory with what will work with 20 files will work with 20,000. Pete's point about hiring one of the experienced coders here to sort out an import system is probably a very good idea - it could save you a lot of pain in the future. With the size of the archive, whenever you import all those files, you wont want to be online at the time. But if you are just running as "beta" that is not a problem. I am not sure how long it will take to import, but with no images, and with the files already uploaded ready to import, it wont take long. The physical size of the archive is not the issue here (my audio archive for my clients is nearly a Terabyte), it is the number of files. But each of those is tiny. Once the backlog is imported, then it wont be an issue because I assume you will then just import hourly or something, or will users then be entering articles directly? Your biggest problem, to be honest, is designing a nice site! Link to comment Share on other sites More sharing options...
Soma Posted January 14, 2013 Share Posted January 14, 2013 Importing such huge amount of files requires some good coding and server. You can't import them in one bit. 172'000 files with simple html dom (consumes lots of memory) and php will consume lots of memory and run out of resources and kill your server's process if not done right really quickly. You really would have to make sure you get a experienced developer who knows what he's doing and try to do this the most efficient way. Doing it the dirty way this can take up to 1-2 minutes for 100 files per loop (to not get out of memory). So this could get up to take 24 hours for 200'000 files. I'm not that experienced with how many and what the best way is but it can be harder than you think. Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now