Estimating the duration of a computationally-intensive process

36 Views Asked by At

I'm not sure if it's even possible to create a mathematical model given the data I have, but I'll post this question with optimistic hope.

I'm running a computational process for which I do not have access to the source code (it uses Microsoft's closed source code).

The process finds duplicate lines (rows) in a text file and reports the duplicates along with the count of the number of occurrences for each duplicate.

Specifically, the command to do this, in Windows PowerShell, is:

 Get-Content .\input.txt | Group-Object | Select Name, Count > output.txt

I performed a test with a small sample, and I found that 5,000 lines (with the median length of each line being 50 characters) takes approximately 1 second to complete.

I then performed a test with a larger sample, and I found that 50,000 lines (with the same median line length) takes approximately 2 minutes to complete.

I then performed a test with a larger sample, and I found that 75,000 lines (with the same median line length) takes approximately 3.5 minutes to complete.

I then performed another test with an even larger sample, and I found that 100,000 lines (with the same median line length) takes approximately 8 minutes to complete.

I'm now running the same process on a file with 2,000,000 lines (with the same median line length).

It appears that PowerShell is limited to 4GB RAM, so the threads performing this task are using the maximum memory possible for PowerShell.

This process has been running for over 19 hours. There is no progress indicator.

I'm trying to estimate the time it will take for this process to complete.

The challenge is the duration of such a process is not linearly dependent on the size of the input.

I'm trying to make a reasonable guesstimate, and I'm struggling to settle on an appropriate mathematical expression to use to provide such an estimate.

I'm hoping someone here will be able to recommend their suggested expression, and perhaps explain their reason why they think their recommendation is a good choice.

1

There are 1 best solutions below

2
On

You don't have many sample points, but I'll still share some speculations, just know that they're credibility may be jeopardized by lack of data.

I plotted your data on a log-log time (sec) vs. size plot, and noticed that the first point is abnormally large, but the other three form a nice line. I suspect that there's a large amount of overhead (pre-sorting?) being done, that dominates the running time until the input grows large enough. Ignoring this data point, I can draw a linear regression line through the remaining three points, which fits extremely nicely, with a $R^2$ value of $0.99$ The equation of the line is

$$ Y = 1.98 X - 7.24 $$

However, remember that this is a log-log plot, so $Y = \log(y)$ and $X = \log(x)$. Substituting that in, we get:

$$ \begin{align} \log(y) &= 1.98 \log(x) - 7.24 \\ \log(y) &= \log(x^{1.98}) - 7.24 \\ y &= 10^{\log(x^{1.98}) - 7.24} \\ y &= 10^{\log(x^{1.98})}10^{-7.24} \\ y &= 10^{-7.24}x^{1.98} \end{align} $$

That means our running time is dominated by a term of order $1.98$. Likely, this means the algorithm dominating the running time is quadratic.

Plugging into this (very rough) estimate, I get that for $x = 2,000,000$, the runtime is estimated around $2$ days ($47.8$ hours). Bear in mind that that's estimated from three data points.