I have a question about the following notation:
$(X,Y)\in \mathbb R^d \times \{0,1\}$ with joint distribution $P$. $X$ vector of features, $Y$ corresponding label.
Now consider the sample $(X_i,Y_i)_{i=1,\ldots,n}$, where $(X_i,Y_i)$ are independent copies of $(X,Y)$. Let $P^{\otimes n}$ be the product probability measure according to which the sample is distributed.
If the sample is i.i.d. $\sim P$, why do we distinguish $P$ and $P^{\otimes n}$? What is the difference between the two?
By writing $P^{\otimes n}$ you stress that the measure is given on a product space, contrary to just writing $P$. Once you want to formally write down some argument regarding using this measure you end up needing to clearly state that you want to use product measure instead of just the measure $P$.
Think about how you want to calculate the probability of an event $$(A,y)=(A_i, y_i)_{i=1}^n,$$
from the definition it is an integral $$\int_{(A,y)} d{?}.$$
We are integrating a set being a n-tuple, therefore using the measure $P$ in the integral doesn't make any sense because $P$ defines a measure only on a single component of the tuple. Thats why we need the measure $P^{\otimes n}$ which is defined to be $$P^{\otimes n} ((A,y)) = \prod_{i=1}^n P((A_i, y_i)).$$
Now, an expression $$\int_{(A,y)} dP^{\otimes n}$$ makes perfect sense and in fact due to Fubini's Theorem it is equal to $$\prod_{i=1}^n \int_{(A_i,y_i)} dP.$$