In parallel computations I frequently use the following distribution of data to processors: having a sequence of $n$ elements $x_1,\ldots,x_n$ and $P$ processors $p_1,\ldots,p_P$, processor $p_k$ handles the elements $x_{\mathit{first}(k)},\ldots,x_{\mathit{last}(k)}$, where
$$ \mathit{first}(k)=\left\lceil (k-1)\frac{n}{P}+\frac{1}{2} \right\rceil, \quad \mathit{last}(k)=\left\lceil k\frac{n}{P}-\frac{1}{2} \right\rceil. $$
This provide a balanced distribution such that each processor handles the same amortized number of elements. The element $x_i$ is then located in a local array of processor $p_{\mathit{proc}(i)}$ at position $\mathit{pos}(i)$, where
$$ \mathit{proc}(i)=\left\lfloor \left(i-\frac{1}{2}\right) \frac{P}{n} \right\rfloor + 1, \quad \mathit{pos}(i)=i-\mathit{first}(\mathit{proc}(i)). $$
My problem is that I am not sure whether the values of $\mathit{proc}$ and $\mathit{pos}$ functions calculated by floor/ceil functions within a computer program and thus containing rounding errors may not differ from their actual values? And also, I wonder how to provide an analysis of this problem. Is there any mathematical approach for such kind of analyses/proves?
Basically what you are trying to do is separate $(x_n)$ into chunks of size $n/P$ (your assignment seems faulty as-is actually, the chunks will have length $P/n \to 0$ as $n\to\infty$. I suggest doing a high-precision computation of $$l := \lfloor n/P \rfloor$$ using a non floating-point algorithm of your choice and doing the same for $$r := n\ \mathrm{mod}\ P$$ Now you have two integers with $n = P\cdot l + r$ so the load can be determined by
If $P$ also grows, we can run the algorithm in parallel on each processor in $\mathcal O(1)$. Only $l$ and $r$ need to be precomputed (feasible in $\mathcal O(\log n)$):