I did a graph depicting heaps law according to my text corpus. Now I want to include an "ideal" curve of the heaps law. On wikipedia (link: https://en.wikipedia.org/wiki/Heaps%27_law) we can see such a curve but there is now indication on how the data was acquired. How can we determine the values k and b (beta)?
We read that we need to find the least squares method to determine the values but we do not understand what to do with this information. This is a formula that we found on stanford: Vocabulary size M as a function of collection size T (number of tokens) for Reuters-RCV1. For these data, the dashed line \log_{10} M = 0.49*\log_{10} T + 1.64 is the best least-squares fit. Thus, k = 10^{1.64} \approx 44 and b= 0.49.
In this paper: https://arxiv.org/ftp/arxiv/papers/1612/1612.09213.pdf on page 4 we found this sentence: "The power exponent calculated using the least squares method was 0.5503. As it can be seen, the empirical data is poorly described by the exponential function. It should be noted that the correlation with the power law for the other languages presented in the Google Books Ngrams are even worse than for English." But how can we calculate this power exponent?
Werte_Y_Achse = []
Anzahl_Erstvorkommen = 0
Wortliste = []
Item_Index = 0
for item in tokensNLTK:
if item not in Wortliste:
Anzahl_Erstvorkommen = Anzahl_Erstvorkommen + 1
Werte_Y_Achse.append(Anzahl_Erstvorkommen)
Wortliste.append(item)
Item_Index = Item_Index + 1
print(Item_Index, item, Werte_Y_Achse[-1])
e
lse:
Werte_Y_Achse.append(Anzahl_Erstvorkommen)
Item_Index = Item_Index + 1
print(Item_Index, item, Werte_Y_Achse[-1])
print('len Werte_Y_Achse', len(Werte_Y_Achse))
print('len tokens', token_zahl)
#graph #1 gets plotted
plt.xlabel('Tokens')
plt.ylabel('Erstvorkommen im Verlauf')
plt.plot(np.array(list(range(token_zahl))), np.array(Werte_Y_Achse), label='Counted')
The Code shows how we got our curve but we are clueless about the ideal curve that we can use to compare the two graphs.
Greetings, Charlotte :)
The Wikipedia page uses $$ V = Kn^\beta $$ So, you have $p$ data points $(n_i,V_i)$ and the model is not linear; so you need estimates before running the nonlinear regression.
In a first step, as you did, take logarithms to make $$\log(V)=\log(K)+\beta \log(n)$$ and define $y_i=\log(V_i)$ and $x_i=\log(n_i)$ which makes the model to be $$y=\alpha + \beta x$$ The linear regression gives then, as estimates, $K_0=e^\alpha$ and $\beta_0=\beta$. Now, you can run the nonlinear regression.
If you do not handle such a program, consider that you want to minimize $$SSQ(\beta)=\sum_{i=1}^p \left(K n_i^\beta-V_i \right)^2$$ where, using the derivative with respect to $K$ $$K(\beta)=\frac{\sum_{i=1}^p n_i^\beta V_i} { \sum_{i=1}^p n_i^{2\beta}}$$
Plot the function for different values of $\beta$ around $\beta_0$ and try to locate more or less where is the minimum. Further zooming around the minimum will allow the refine the solution. This is very simple to code.
More efficient would be to search for the zero of the derivative with respect to $\beta$. This would lead to function $$F(\beta)=\sum_{i=1}^p \log(n_i)\, n_i^\beta \left(K(\beta)\,n_i^\beta-V_i \right)=0$$ which can be easily solved using Newton method starting at $\beta_0$ computing $F'(\beta)$ by finite differences. The iterative process would converge very fast.