Jump to content

Perfomance Question - Many records less data


mr-fan
 Share

Recommended Posts

I've searched in the forum and found some entries but no one really fits...

project goal - building a simple weatherstation with an ESP and send the data to my PW page and save it...that sounds easy

load - would be just 4-6 records like (temp, humidity, pressure, rain, brightness...)

but if i set the timer to 10 minutes - there will be 52560 records in only one year....and it should run a few years data...

question - how to save the data - that visualisation (d3 or something else), database performance and so on would work...

  • So could i simple run with a page template for the data and save all records to a page (5 years would be 262800 pages...)?
  • Or should i setup a own mysql table for this one - links or examples would be great - with this i'm not experienced?
  • Or could i go with something like RockDataTables ?

Where is the right path...i take should i?

Link to comment
Share on other sites

The majority (everything?) of the content is automated via imports, and you have practically no use for the whole PW admin backend? If so, I'd surely go the custom-mySQL-tables route. In PW, do your visualisation stuff (pulling raw data from mySQL). Perhaps you'd only have to build 1 page per location, and then use the location-page-id as a foreign key in mySQL. If you expect huge traffic, then a proper caching strategy would be essential.

Do you plan to have some sort of search / filter on your site?

Link to comment
Share on other sites

24 minutes ago, mr-fan said:

weatherstation with an ESP

err, I should have googled this before answering... (I had no idea what ESP means).

So it's data from one single weather station only? My above post was based on the assumption that you get weather infos from all over the world via some 3rd-party API and you update your site via cron jobs.

It's been said time and again that a big number of PW pages will scale well, but a combination of (too many) fields, templates, and pages - plus complex queries - can seriously impact speed. In your case, the setup seems to be quite simple though, so I would go the "native PW route".

However: When it comes to loading huge data-sets and visualise them (aggregate a 5-year-span of weather data and create statistics or curve diagrams), I would surely look for RockTabulator or some such.

You can fake a future situation yourself with simply creating thousands of dummy data-sets.

  • Like 1
Link to comment
Share on other sites

@mr-fan

What granularity of data do you need for things that are 5 years old?  Do you need to drill down to get the exact temperature given a date and time from 5 years in the past?  Or would an average do?

If an average would be enough, then have a look at RRDTool. All you'd need to do is feed the data into it, then link the graphs it generates into your PW driven site.

Although I've not used RRDTool for about 9 years, I ran it on a Raspberry Pi A (IIRC) to profile my house's thermal performance. I did this by collecting 1 minute temperature samples from a network of 8 x 1-wire temperature sensors I installed in the property. I ran this over a period of about 2 years, and all that data got squished into a fix-sized on-disk DB of a few megabytes. I used a cron job to pull the images from the graph generator and cache them on disk with fixed filenames. I had nginx serve a simple html webpage that referenced the images by name. Worked really nicely on my local network, but the output graphs aren't the prettiest. You might be able to feed the results into grafana or Bernhard's tool if nicer output is needed.

If you do need high granularity of the old data, RRDTool isn't the tool for you.

@dragan ESPs are a family of Wifi/Cellular capable microcontrollers. Very popular with electronics/systems makers. Example.

  • Like 4
Link to comment
Share on other sites

Ok sorry for don't beeing more clear in my first post... ("ESP8266" is a Arduino like Microcontroller that runs the sensors and send via WAN to my website for datastorage)

like you thought it is one weather station i would like to build up at a minimum collecting temperatur, humidity, pressure, amounth of rainfall and maby a light sensor for sunshine hours per day).

So my first "PW way of life" would go for pages...

Data simple send every 10 minutes via POST or GET to a template and save data as pages.

template "station" as parent holder - maybe i will create a second one at my office...so here we have a title field at least
template  "record" as child of station for the data records - fields like i wrote 4-6 - temp,humidity,pressure,rain, light - all integer or float fields

i think i should try with dummy entries if this setup would work...

For better caching and overview i could adapt a kind of blog page tree like /station/2019/07/20   so i get a parent for year/month/day so filtering and data processing via PW API would get much more performant i think??

thanks for all input.

Link to comment
Share on other sites

2 minutes ago, mr-fan said:

blog page tree like /station/2019/07/20   so i get a parent for year/month/day

I don't think that would make any impact on filtering and search. If anything, it would only make your setup more complicated than it needs to be. You would have to take care of creating year/month/day parent pages with each new record (or create them ahead of time).

Link to comment
Share on other sites

Ok i only thought that the API request would be easier if i could filter with parent pages, then run an SQL trough all pages on the created date field?

// Find all entries of May 2019
$records = $pages->get("/2019/05/")->children;

with my simple DB wisdom i think PW would find the needed data faster...but this is the topic of this post...my lack in DB background (this is the main reason why i like PW so much...;)

the parent structure would be no problem - i've in my mind that the old blog tutorial from Joss save posts with such a system on page save...

Link to comment
Share on other sites

Hi @mr-fan, you are dealing with a very interesting project! Wow… it brings me back to my previous lives in Microcontrollers and Nand Flash storage...

If your MCU timer is sampling analog data every 10 minutes, and your objective is to store them for 5 years at this granularity level (10 minutes), as you have calculated, it means 2.6*10^5 records. That's not trivial at all. I love PW page-storage capabilities, but, on my personal opinion, resulting storage solution would be less efficient and performant than directly storing data into a MySQL table. I would use MySQL table for data storage and PW pages for data post-processing and reporting.

A lot would depend from which kind of post-processing you are going to need, but as you are building a simple (but big!) data logger, another option for storage could be even to create a simple flat file to which you are appending a new record every 10 minutes. However MySQL, in the long run, will give much better flexibility, versatility, and performance.

If you wish to look for a simple example of using MySQL from inside PW I suggest you to explore at FieldtypeComments core module. Comments in PW are not made using PW page-storage, but as a MySQL table (field_comments) so you can get some useful hints in dealing with MySQL schemas from within PW. Of course you do not need to create a separated MySQL database for your logger, it can be simply a new table inside your existing PW database, exactly as PW does to manage comments.

Please also note that PW library provides an embedded support for dealing with MySQL (I did not use it so far, but .. now I feel I need to make a trial soon…) in $database API variable. 

By the way, if you need any help to create a MySQL data logger table inside your PW database, just give me the data specs, I would be glad to help you in setting up a trace. 

Wish you a nice week-end.

  • Like 8
Link to comment
Share on other sites

13 hours ago, Edison said:

If you wish to look for a simple example of using MySQL from inside PW I suggest you to explore at FieldtypeComments core module. Comments in PW are not made using PW page-storage, but as a MySQL table (field_comments) so you can get some useful hints in dealing with MySQL schemas from within PW.

Not sure how familiar everyone here is with the inner workings of ProcessWire, so just to expand on this a little bit:

FieldtypeComments is a ProcessWire Fieldtype module, and Fieldtype modules can define their own database schema. The FieldtypeEvents module was built as an example for custom Fieldtype modules, and if you take a closer look at FieldtypeEvents::getDatabaseSchema() (and other methods in that class), you should get a pretty good idea of how this stuff works.

On the other hand if you want to store loads of data and don't really need/want said data to be stored in an actual field (accessible via the API and editable in the Admin), you can also define a custom database table, just as Edison pointed out above. You can find some examples of working with custom database tables from the ProcessChangelog module. Of course you don't have to wrap the table creation etc. into a module – not unless you expect to set this thing up on multiple occasions ?

One last point on naming custom tables: if you create a truly custom database table, you'll want to steer away from any native table names. This includes field_*, since that prefix is reserved for field data.

  • Like 4
Link to comment
Share on other sites

42 minutes ago, teppo said:

FieldtypeComments is a ProcessWire Fieldtype module, and Fieldtype modules can define their own database schema. The FieldtypeEvents module was built as an example for custom Fieldtype modules, and if you take a closer look at FieldtypeEvents::getDatabaseSchema() (and other methods in that class), you should get a pretty good idea of how this stuff works.

Probably creating a custom FieldtypeDataLogger (or how your prefer to name it !) may give you the best of both worlds. You could have your own optimized MySQL table schema, while for post-processing/reporting you could embed it as field in a PW page/pages, where you will access individual data (temperature, humidity, etc..) as properties of the field, and with all the advantages of PW selectors.

Link to comment
Share on other sites

2 hours ago, Edison said:

Probably creating a custom FieldtypeDataLogger (or how your prefer to name it !) may give you the best of both worlds.

Technically yes. I'd still check early on that ProcessWire won't at some point try to load all those rows to memory at once, or worse yet try to render them all as inputs on page (in admin).

I have a vague memory of Ryan adding something to handle exactly this to one of his modules or the core (or both).

Link to comment
Share on other sites

16 hours ago, mr-fan said:

with my simple DB wisdom i think PW would find the needed data faster...but this is the topic of this post...my lack in DB background (this is the main reason why i like PW so much...;)

In ProcessWire the wisdom usually is to avoid selecting much data at all. That's the sole reasoning for e.g. the nesting you described. It won't help at all if you want to aggregate over e.g. the last 5 years of weather data. 

The biggest question still open in this topic is "what for?". Without knowing the patterns of how you intend to access the stored data and which timeframes of aggregations of this data are appropriate it's not really possible to tell what you need. If you're fine with reports taking a hot minute to aggregate you're in a whole different ballpark than if you need huge aggregations to be live and instantly available in some web dashboard.

Especially if you plan to hit the latter case I'd also suggest looking at proper databases for time series data, especially if the number of entries is meant to grow beyond the ~500k–1kk mark. I'd look at influxdb or postgresql with timescale plugin. Using pages in processwire might make sense for a mvp, but if things should scale it'll be a lot of manual querying even in processwire, so I'd opt for the proper solution from the start. Given the volume of data I doubt you can avoid getting more intimate with databases, as you just need to aggregate data directly on the db side, which processwire doesn't support to begin with.

  • Like 7
Link to comment
Share on other sites

43 minutes ago, LostKobrakai said:

The biggest question still open in this topic is "what for?". Without knowing the patterns of how you intend to access the stored data and which timeframes of aggregations of this data are appropriate it's not really possible to tell what you need. [...] Especially if you plan to hit the latter case I'd also suggest looking at proper databases for time series data, especially if the number of entries is meant to grow beyond the ~500k–1kk mark. I'd look at influxdb or postgresql with timescale plugin. Using pages in processwire might make sense for a mvp, but if things should scale it'll be a lot of manual querying even in processwire, so I'd opt for the proper solution from the start. Given the volume of data I doubt you can avoid getting more intimate with databases, as you just need to aggregate data directly on the db side, which processwire doesn't support to begin with.

Absolutely agree with this – use case for the data matters a lot!

In my experience MySQL queries (with decent indexes) tend to be pretty fast until you reach the scale of millions of rows, but if this is going to be a public-facing service that gets a lot of hits and needs to generate all sorts of reports real-time then definitely consider alternatives right from the get-go. Might be a good idea to look into them anyway, but if it's a one-off project and you're likely to stay in 200-300k record range, you're probably not going to get a major benefit out of them.

That being said, if you already know what your data is going to look like, you can take some guesswork out of the equation by starting from a simple proof of concept: create a database table for your data, add a script that generates some 200-300k rows of random mock data based on your actual expected data format, and build a proof of concept graph to display said data. If the database concept doesn't pan out, i.e. it's too slow or something like that, you can just swap that to something more performant while keeping other parts of the application.

Either way it's often a good idea to build your product in layers, so that if a specific layer – graph library, database, or something in-between – needs to be swapped for something else, other layers remain more or less the same ?

  • Like 5
Link to comment
Share on other sites

Thanks for all input again.

Again i was not to clear - single project just for me and just for information no public hits no further usecase for the system.

55 minutes ago, LostKobrakai said:

Especially if you plan to hit the latter case I'd also suggest looking at proper databases for time series data, especially if the number of entries is meant to grow beyond the ~500k–1kk mark. I'd look at influxdb or postgresql with timescale plugin.

Yes LostKobraKai i think i did'nt plan on this one...until now i didn't have such a usecase or wasn't deep into database performace (what is not really needed for normal websites if you use PW as a base...it scale for this cases and i know that well)

So may i am a little bit to naive on this project - and my feeling said "Hey try to ask the dev guys in the forum for hints" ...;)

I would first go the hard way and make my hands dirty and will try how far PW could go on this one....

Yes nested page setup like metioned wont scale on things like "live aggregating" all saved data, but i think i would be happy if i could get things like average of year/month/week and store them in separate fields on the dedicated template - so like netcarver wrote if i am modest in granularity it will run fine.

But a important Tipp is the FieldtypeEvents module  from Teppo, i forgot this one - it would work great for the week dataset so many pages are spared.

What i will try with dummy data is now:
templates /station/year/month/week
station (id, title)
year (id, title, temp, humidity, pressure, rain, light) - the average - average from 12 childpages should be no problem
month (id, title, temp, humidity, pressure, rain, light) - the average  - this will end in 52 childpages for a join for the average
week (id, title, records - modified eventtype field for the intraday records) - this will end in 672 entries (if i take a record every 15 minute) in one week field

Problems on this one would be selecting from date to date but i would be fine with averages of weeks or months...and maybe the amount of fieldtypeevent tables (52 per year)?

I don't have the time to investigate in other DB systems, since this year i've got one more little maid that take over my speartime ?

But with your interesting input i can think i get in the right direction, and don't have wrong views on aggregation and reporting of the collected data.
Since alle of you pointed out that the collection wouln't be the problem with PW or MySql....but all the other stuff. And on these points i could make compromise since this is only a hobby project.

I will report (even if it took a while).

Even on such rare and offbeat questions in this forum you get helpfull and friendly answers!
You all made my weekend! - I love this forum, it is a hidden island in the www

  • Like 5
Link to comment
Share on other sites

just a short update on this - i am stresstesting the PW way and i'm on the way yust with simple pages for now and see what comes up the road.

made a crude script to get dummy data and created around 88k pages (DB are now 35MB) in very fast time on shared hosting...just for example to create 48 testrecords for every day (17260 pages - 2 records per hour) the scripts runs on in about 80 seconds...just for others that are search on create dummy content in a quick and dirty way i let my code here (if there are better ways - at least as a bad example..;)

Spoiler

//create dummy data uncommend first year then month then day then records
$year = 2015; //setup a non existing year!
$yearPage = $pages->get("title=$year");
$yearChilds = $yearPage->numChildren;

//switch for the creation of the pages since a parent have to exist every childround has to start separate
$cyear = 1; //second create year -1 page
$cmonths = 0; //third create every month --12 pages
$cdays = 0; //fourth create 30 days for every month --360 pages
$crecords = 0; //fifth create records for every day of a year 48 records per day --17260 pages

//first year setup the year
if ($cyear == 1) {
	if ($yearPage == "") {
		$p = new Page(); // create new page object
		$p->template = 'year'; // set template
		$p->parent = $pages->get(1049); // parent is wetter
		$p->name = $year; // give it a name used in the url for the page
		$p->title = $year; // set page title (not neccessary but recommended)
		$p->save();
		echo "<h1>$year created</h1>";
	}
} else {
	echo "<h1>$year is there</h1>";
}

//create the months of a year if a year has no childpages...
if ($cmonths == 1) {
	if ($yearChilds != 12) {
		$m = 1;
		while ($m < 13) {
			$p = new Page(); // create new page object
			$p->template = 'month'; // set template
			$p->parent = $yearPage; // set parent
			$p->name = $m; // give it a name used in the url for the page
			$p->title = $m; // set page title (not neccessary but recommended)
			$p->save();
			echo "$m -ok, ";
			$m++;
		}
	}
} else {
	echo "<h1>To create Months switch cmonths to 1</h1>";
}

if ($cdays == 1) {
	echo "<h1>Days </h1>";
	//create the days
	$months = $yearPage->children;
	//loop the months and create 30 days every month
	foreach ($months as $month) {
		$d = 1;
		while ($d < 31) {
			$p = new Page(); // create new page object
			$p->template = 'day'; // set template
			$p->parent = $month; // 1set parent
			$p->name = $d; // give it a name used in the url for the page
			$p->title = $d; // set page title (not neccessary but recommended)
			$p->save();
			echo "$d -ok, ";
			$d++;
		}
	}
} else {
	echo "<h1>To create Days switch cdays to 1</h1>";
}

//create the records of a day
//some functions needed for the data
function mt_rand_float($min, $max, $countZero = '0') {
    $countZero = +('1'.$countZero);
    $min = floor($min*$countZero);
    $max = floor($max*$countZero);
    $rand = mt_rand($min, $max) / $countZero;
    return $rand;
}

if ($crecords == 1) {
	$months = $yearPage->children;
	//loop the months 
	foreach ($months as $month) {
		//Loop the days
		$days = $month->children;
		foreach ($days as $day) {
			$r = 1;
			//create 48 records per day so every 30 minutes take a record...
			while ($r < 49) {
				$p = new Page(); // create new page object
				$p->template = 'record'; // set template
				$p->parent = $day; // 1054 == 2018
				$p->name = $r; // give it a name used in the url for the page
				$p->title = $r; // set page title (not neccessary but recommended)
				//dummy data generation
				$p->temp = mt_rand_float(-10, 50, '00');
				$p->hum = mt_rand(20, 90);
				$p->pres = mt_rand(500, 1500);
				$p->light = mt_rand(1, 16);
				$p->rain = mt_rand(0, 60);
				//save datarecord
				$p->save();
				echo "$r -ok, ";
				$r++;
			}
		}
	}
}

 

So far next things are to create a script that runs to build the averages of month, year for long term stats and then i will try how reporting and visualisation in an easy graph - works together with a DB that have around 90000 pages with just some integers and floats.

I would first try how an easy to use chart library would work (https://gionkunz.github.io/chartist-js/index.html)
Since this tools all use JSON data i think i could cache or better prepare this JSON strings for the charts and see how fast it performs...i'll report again.

But this is an interesting experiment so far - i could choose how i spend my freetime in two really different worlds
->work on the backend with PW or solder and glue some things together on the hardware parts...before this project i was captured only on the web...;)

  • Like 3
Link to comment
Share on other sites

Hi @mr-fan

sorry that I'm late on this! I've done something very similar some time ago... a Raspberry Pi Zero monitoring the water + air temperature of a river every ? minutes, logging everything and showing graphs...

Loading/showing big amounts of data was the reason for building RockFinder, RockGrid and now I'm working on RockTabulator (which will replace RockGrid one day...). You should really have a look at those tools and read the related forum threads. I've also done some performance tests back then:

Comparing to findMany:

 

I just added a quick example loading 200k pages via RockFinder into a RockTabulator:

1lzIUZp.png

The whole site is ready to clone and install at github: https://github.com/BernhardBaumrock/tabulator.test

 

You can also easily do custom SQL queries in the backend (see the simple php code of the example at the bottom). This way you could even do easy aggregations directly on the db (https://processwire.com/talk/topic/18983-rocksqlfinder-outdated-thread-link-to-current-version-inside/?do=findComment&comment=165807)

r3ZW5Fy.png

 

 

Load times and data transfer (gzipped) can be seen in the devtools on the right.

Finally using RockMarkup it's a piece of cake to implement any charting library in the pw backend:

 

I think I'd store everything in regular PW pages and run a daily cron that does the aggregations. Aggregations could be stored in regular pw pages as well (eg as json in a textarea field) and this data could be easily displayed as chart via RockMarkup on the page edit screen.

If you want to, I'd be happy to work on your project together and take it as an open source example to showcase my modules - I still think that the potential of those tools is really not getting through... 6 likes for RockMarukup... That's kind of disappointing...

 

  • Like 5
  • Thanks 1
Link to comment
Share on other sites

 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...