I have a huge data set that contains time series data for some servers' cpu usage per application. I need to find the busiest servers to for the application and print it.
I need to find a method or algorithm to detect the busiest servers. One thing I was thinking was to calculate the 95th percentile and if it is higher than the threshold of 80%, then I would consider that a busy server.
Can any body provide some insight into how else I can detect the busiest servers for app?
Sample data looks like this:
App Server Time CpuUsage
Web web01 1/1/2015 10
Web web01 1/2/2015 4
Web web01 1/2/2015 80
Web web01 1/2/2015 40
Web db01 1/1/2015 10
Web db01 1/2/2015 50
Web db01 1/3/2015 40
Much would depend on how you define a "busy" server.
One approach to set up your data for analysis would be to group your data hierarchically at the App level and then the Server level; then aggregate CpuUsage by date. Having done this, you have the total CpuUsage on each unique day, for each server, for each app.
At this point you can plot time series' of the CpuUsage of each of your servers to start comparing their activity over time. You can perform spectral analysis to see if the "busyness" of your servers exhibits cyclical behaviors. You can test for the presence of correlations or autocorrelation if that's of any interest to you. If you're receiving a stream of server usage data, you can even set up control charts for each server in order to monitor for anomalous activity trends, spikes, or drops going forward.
If you're not looking to do any exploratory work and prefer a more direct way to compare your server activity, you can just perform statistical tests such as one-way ANOVA to compare the averages of your CpuUsage distributions for each server for each app; this assumes that average daily usage is a satisfactory measure of "busyness" for you.
As a first step, one-way ANOVA will test the hypothesis that the average CpuUsage between your servers are equal. For example, let's say $\mu_1$ is the average cpu usage of server web01 for Web apps, $\mu_2$ is the average cpu usage for server db01 for web apps, etc. One-way ANOVA will check if your data suggests that $\mu_1 = \mu_2 = ... = \mu_n$ is true. If the test rejects this hypothesis, it suggests that the true average CpuUsage between all of the servers you tested is not equal; i.e. $\mu_1 \neq \mu_2 \neq ... \neq \mu_n$. At this point, you could conduct post-hoc tests to examine the different combinations between levels. For example, you can use student's t-test for each pair of group levels to test if $\mu_1 \geqslant \mu_2$, $\mu_1 \geqslant \mu_3$ etc, in order to examine the relationships between average server usage.
There are several routes you can take depending on your objectives, assumptions, and the quality of your data. This just one simple approach to doing a quick and dirty comparison of your average cpu usage.