Parameters for Yule - Simon Distribution

208 Views Asked by At

I have been trying to use the Yule Distribution to generate numbers to be used for phoneme frequency. I found out about this through this paper

The included formulas were not well explained enough so I used this formula: $1-kB(k, p+1)$

I think that $k$ is $x$ values but $p$ is a mystery to me. According to Wikipedia "The parameter $p$ can be estimated using a fixed point algorithm". I searched for that but it lead to a dead end. None of the research papers discussing this were readily available.

The results I am looking for resemble the ones on this page.

I however get results like this for $7$ total phonemes for $p=2, k=x$:

\begin{align} 20.&353001806289978 \\ 16.&371891938209778 \\ 12.&63490087539381 \\ 9.&188689124444203 \\ 6.&101796072493233 \\ 3.&4849066497880004 \\ 1.&549306144334055 \end{align}

The results also get larger each time instead of smaller.

2

There are 2 best solutions below

0
On BEST ANSWER

It looks like there is a mistake in the formula you used. The correct formula should be

$\displaystyle f(k;\rho )=\rho \operatorname {B} (k,\rho +1)$

The algorithm to estimate $\rho$ is available in the paper A Fixed-Point Algorithm to Estimate the Yule-SimonDistribution Parameter by Juan Manuel Garcia Garcia, page 5, Algorithm 1

0
On

The frequencies expressed in the second link look like percentages, so they always sum to a value less than $100$. Let's replicate these values using the Gusein-Zade equation (4) in the paper cited in your first link: $$F_r = \frac{\log (n+1) - \log r}{n}, \quad r \in \{1, 2, \ldots, n\}.$$ Here, $r$ represents the rank of the phoneme, and $n$ represents the total phoneme count. So the first entry of the table in the second link corresponds to $n = 7$, and we compute $$F_1 = \frac{\log 8 - \log 1}{7} \approx 0.297063.$$ Note, all logarithms are natural (base-$e$). Multiplying this frequency by $100$ gives the first percentage (within rounding error) in the table in the second link.

Now that we know how to replicate the table in the second link using the formula from the paper in the first link, we now proceed to model the Yule distribution for fixed $\rho$; e.g. we simply want to calculate the probability mass function for a given case, say $\rho = 2$. Then $$F_r = \rho \operatorname{B}(r, \rho + 1) = \rho \frac{\Gamma(\rho + 1)\Gamma(r)}{\Gamma(\rho + r + 1)} = 2 \frac{\Gamma(3)\Gamma(r)}{\Gamma(r+3)} = \frac{4 (r-1)!}{(r+2)!} = \frac{4}{(r+2)(r+1)r}, \quad r \in \{1, 2, \ldots\}.$$ Then $$F_1 = 2/3, \quad F_2 = 1/6, \quad F_3 = 1/15,$$ and so forth. In Mathematica, this can be calculated using the expression

F[n_, rho_] := rho * Table[Beta[r, rho + 1], {r, 1, n}]

or you can use WolframAlpha with a similar syntax. The problem of estimating $\rho$ from real-world data is addressed in the paper by Garcia as noted in the other answer to your question.