I'm aware that there are several methods to estimate the probability distribution from samples (e.g. density estimation methods). However, I haven't been able to find the actual asymptotic costs of these methods, and I'm not sure whether these methods can be applied to the case of discrete probability distributions.
Given a discrete probability distribution $p$ on $[n]\equiv \{1,...,n\}$, I'm looking for the number of samples required to construct a distribution $q$ on $[n]$ such that $\|p-q\|_1\le\epsilon$ ($\|\cdot\|_1$ denoting here the $\ell_1$ distance). In other words, how many samples are needed to estimate the correct probability distribution within a given error $\epsilon$?
I've recently read the statement (in this paper, second paragraph of introduction) that the cost of doing this is $\Theta(n/\epsilon^2)$, but if this is a well-known result I haven't been able to find a good explanation for it.
I suppose the naive reconstruction method would be to build the histogram corresponding to the samples (that is, count the fraction of samples corresponding to each $i=1,...,n$). However, I'm not sure whether this method is optimal for the purpose, or how many samples would be required to achieve a given estimation error with it.
Estimates using distances other than the $\ell_1$ one are of course also welcome.


Upon some reflection, here is what I think might be a possible solution.
Suppose the true distribution is $\newcommand{\bs}[1]{{\boldsymbol{#1}}}\bs p\equiv(p_1,...,p_n)$, and we are drawing $N>0$ samples from it. This amounts to be drawing samples from the multinomial distribution, call it $\mathrm{Multinomial}(N,\bs p)$.
If $X\sim \mathrm{Multinomial}(N,\bs p)$, then the possible outcomes of $X$ are (representable by) sequences of $N$ integers $x_i\in[n]\equiv \{1,..., n\}$, with $i=1,...,N$. Because our goal is to estimate $\bs p$, we are interested in the number of times each $k\in [n]$ occurred in a given instance of $X$. Denote the corresponding estimator with $\hat N_i$. We know that $$\mathrm{Pr}(\hat N_i = N_i) = \binom{N}{\bs N}\bs p^{\bs N} \equiv \binom{N}{N_1\cdots N_n} p_1^{N_1}\cdots p_n^{N_n},$$ where $\binom{N}{\bs N}\equiv \frac{N!}{N_1!\cdots N_n!}$ denotes the associated multinomial coefficient.
The problem at hand can be formalised as the task of looking for a good estimator for the parameter $\bs p$. The intuition of using the histogram of the distribution for the purpose can be formalised in the choice of the estimators $$\hat p_i \equiv \frac{\hat N_i}{N}.$$ We can then compute the first moments, to find: $$\mathbb E[\hat p_i] = \frac{\mathbb E[\hat N_i]}{N} = p_i \qquad \text{(the estimator is unbiased)}, \\ \mathrm{Var}[\hat p_i] = \frac{\mathrm{Var}[\hat N_i]}{N^2} = \frac{p_i(1-p_i)}{N}.$$ We can now compute the estimator corresponding to the trace distance between estimated and true distribution. For $n=2$, this reads $$\|\hat{\bs p}-\bs p\|_1 = |\hat p_1 - p_1 | + |\hat p_2-p_2| = 2|\hat p_1 - p_1|,$$ and thus, from Chebyshev's inequality, we have $$\mathrm{Pr}(\|\hat{\bs p}-\bs p\|_1 \ge \epsilon) \le \frac{2\mathrm{Var}[\hat p_1]}{\epsilon^2} = \frac{2p_1 (1-p_1)}{N \epsilon^2}.$$ Without making more assumptions about the probability distribution, we should probably take the worst-case scenario, and thus replace this with an identity. Shuffling terms around, we can then say that, in the worst-case scenario, to estimate a binary probability distribution $(p_1,p_2)$ with probability $t$ of committing an additive error $\epsilon$, we need a number of samples $$N \ge \frac{2p_1(1-p_1)}{t\epsilon^2}.$$ If $p_1$ is unknown, as it would be in this context, we can again take the worst-case scenario, and get the bound $N\ge \frac{1}{2t \epsilon^2}$.
More generally, we can consider the total variation distance $\delta(P,Q)$. This equals the maximum difference between the probabilities that $P$ and $Q$ assign to any given event. We can then define the corresponding estimator, call it $\hat\delta=\hat\delta(\bs p)$, in the naive way: $$\hat\delta(\bs p) = \max_{S\subseteq[n]}\sum_{i\in S} (\hat p_i-p_i)\equiv \max_{S\subseteq [n]}[\hat p(S) - p(S)],$$ where the maximum is taken over all subsets of $[n]\equiv\{1,...,n\}$. Therefore, for all $S\subseteq [n]$, we have $$\mathrm{Pr}(|\hat p(S) - p(S)|\ge \epsilon) \le \frac{\mathrm{Var}[\hat p(S)]}{\epsilon^2}.$$ We thus conclude that $$\mathrm{Pr}(|\hat \delta(\bs p)| \ge \epsilon) \le \frac{\max_{S\subseteq [n]}\mathrm{Var}[\hat p(S)]}{\epsilon^2} = \frac{1}{N\epsilon^2} \max_{S\subseteq [n]} C(S,p),$$ where $C(S,p) \equiv \sum_{i\in S} p_i (1-p_i -2 \sum_{j>i} p_j)$.
Maximising $C(S, p)$ over $S$ and $p$, we again get a bound of the form $N\ge C/t\epsilon^2$, as in the previous case.