Statistics and logging

acd · Jun 6, 2013

I was talking to Fran this evening about statistics and logging for his ... pony... thing and we discussed some of the technical problems involved and solutions thereof; right now, he's considering a custom solution and I suggested building it off a nosql solution like leveldb. A major issue is write throughput to the tune of ~25 parameters per sample, 65+ samples per second. He tried running it backed by mongodb but trying with just a million samples, it choked and would need some fo-serious ssd backing to make it happen. Leveldb didn't have that problem (1M inserts in about 40 seconds inc. json parsing), but has limitations on the number of processes that can have the db open; namely just 1, unless you write a daemon on top of it.

We discussed the advantages and disadvantages of rrdtool & mtrg based solutions, and their derivatives like Graphite (which, if you haven't looked at it, is actually pretty sweet). Before one of us dives off the deep end and starts building a custom solution from scratch I want to know what you guys use for this sort of thing and what do you store? How long do you retain data and how frequently do you collect it? Maybe there's an off-the-shelf solution that's a good fit for one or the other of us.

Francisco · Jun 6, 2013

I think the best way we can handle it is have each node handles its own users.

Each server running their own stats accounting would mean that about a years points will take 3 - 5GB/node. This is far more manageable than us running a central collection node that would need 4 - 8 SSD's and probably a TB of usable space just to keep up with that much data.

Doing accounting on each node means we can increase the number of different metrics we keep.

Maybe we can utilize /dev/null sharding to improve performance?

Francisco

wlanboy · Jun 6, 2013

Interesting topic. I did have a similar problem (importing about 30.000 csv files - each about 2 MB of size - each hour). After trying a lot with different databases I stumbled about ZeroMQ.

If the work is too heavy - distribute it.

I read once that you are using php for your new pony stable, so this chapter might be worth reading.

Francisco · Jun 7, 2013

There's 2 options on the table right now:

- Central node stores all stats. Using something like leveldb we can batch the writes and lower the overhead by quite a bit. We could likely have it running w/o much issue on a fairly basic 4 disk RAID 10 w/o any ssd's.

- Each node handles its own. While this distributes the workload really nicely, this becomes an issue if:

-- we migrate the user

-- we migrate the node

-- we have to renumber their CTID (locked container, etc)

We could write some 'migration' scripts to have it move data between the nodes but I feel one wrong move and a user would end up w/o any stats at all.

My consideration right now is a central collection w/ a local cache incase a request can't be sent to central collection properly. Next firing the system would first try to send the backlog and then the new data. The reason for this is if a node gets nulled or for whatever reason it can't contact the stats collector? The data won't be lost in the tubes.

We're still talking upwards of a TB a year in data points. An SSD cached setup would have no issue with this so the hardware side of things is worked out.

At this point I think we're into 'UI' stages, trying to decide what features the user will need/want to search this data. XCharts looks really pretty but I couldn't get it to render nicely in stallion. Flot looks quite fast and with some work can look quite nice as well.

I guess I need to sniff around other cloud providers and see what everyone offers

Francisco

Statistics and logging

acd

New Member

Francisco

Company Lube

wlanboy

Content Contributer

Francisco

Company Lube