We have a black box that for each input request a, it outputs a computed response b.
The computation time for a given request varies in a stable way over time. Stable means here that it is still meaningful to use aggregating functions like average, max, moving average, ... etc. This variation can be imagined to be caused by internal changes to the black box that we can't observe independently.
But there is another potential, independently observable, parameter that can affect computation time, namely input pressure, or let's say number of inputs per seconds.
As any service, there is a limit of input pressure, capacity, after which the service will start drastically slowing down at each increase, let's call that a service overload.
Things we can measure here are, number of inputs per unit of time, number of outputs per unit of time, processing time for each input (and so averages, max, min of these).
The question is, can we design a "pressure filter" function that can predict that an overload is happening or going to happen?
Note: the black box can be capable of doing parallel computing, meaning that we can't rely on capacity_per_second = 1 / computation_time
Note2: We can't inject experimental input into the black box, we can only observe and measure on natural input
Note3: Natural input varies chaotically, it can be under the capacity of the black box
One thing I thought of is simply try to detect dependency between the increase of input pressure and the increase of processing_time, and call that an overload. Not sure though.
Another idea is to maintain a normal_processing_time calculated using a sliding average of processing_time and if x increase in input results in an increase of x * normal_processing_time to the previous normal_processing_time then an overload is happening.
Old description of the problem
Let's say I have a black box that does some processing, it takes an input, and gets out some output for that input.
I have no prior information about how fast the processing per input is, how many concurrent inputs it can process or what is its full processing capacity.
All what I can do is observe it from the outside, and without interacting directly with it.
The processing speed, and capacity can change over time, and it can depend on multiple factors, like the input frequency (faster or slower with more input frequency). It can also (less frequently) change its characteristics with some internal factors (things inside the black box that we can't independently observe).
I need a function that would be able to predict a potential service overload, based maybe on the history of the processing of this black box.
The question is basically, is whether it is possible to have a generic smart "pressure filter", that has sensors or counters on both ends of a black box. This filter will rejects inputs anytime it thinks the service is not keeping up with its input. Is this filter even conceivable, given that its error may result in its service never being loaded to even its real capacity?
I thought about using an Exponential moving average for input rate and output rate, and subtract them to detect an overload. But this won't work since sudden increase in input rate still causes a false prediction.
Another idea is to use the EMA for the processing time per input, but it won't work either since the parallel processing can't be predicted that way.
Maybe Regression analysis could help here, but I am not sure.
Maybe if we discover causality between the increase of the input throughput and an abnormal increase of processing time per input, we can conclude that an overload is happening. Or maybe simply an abnormal increase of processing time per input alone should activate the filter.
Is there an obvious way for solving this that I am completely missing?
At least there is been some more thinking going on :)
just some random remarks:
First of all you need to set some norms, what are allowable troughput times?
And then you have to implement some test that if the troughput time gets higher than say 80% of the allowable troughput time start squeezing the input. (go to a "one out-one in "system or something like that)
Or more fancy, first collect all imputs, and only pass such an amount that the troughput times are within the limits stated.
maybe better you can also first on a lower level remove less interesting imputs. (what are the lost opportunity costs of a rejected imput, are the costs the same for every input, or are some imputs more valuable?)
Problem off course is that all this pre-proces testing adds time to the troughput time so the cure can be worse than the illness.
see still more details needed.
Also i was wondering how can you measure troughput times? is it because the imput is given in the output and you check the time between input and output?
under low traffic (normal) conditions, is the troughput for every input more or less equal or take some inputs (much) more time than others?
or do you have a list to check if an output passes that is in the list, if the last why not directly pass the output back? (and make the whole syetem more efficient)