$$0 \le i<m$$ $m > 1000$ is the number of points for which a hull containment test is performed.
$$0 \le j < n$$ $n > 1000$ is the number of faces on a convex hull.
$0 \le k \le 3$ is the spatial dimension.
$E_{j,k}$ has hull normals in $k \in \lbrace 0, 1,2 \rbrace $ and a plane offset in $k=3$.
$P_{i,k}$ is the set of points to test.
$$s_i = \text{sgn} \max_{j=0} ^{n-1} \left( E_{j,3} + E_{j,0..2} \cdot P_{i,0..2} \right) $$
If $s$ is negative then its point in $P$ is fully contained in hull $E$. $s_i$ is the output indicator sign for each test point. The case where $s=0$ doesn't matter much so long as it's consistent (i.e. a boolean output array evaluating $s < 0$ is equally as useful).
My first attempt at evaluating this in Numpy added a fourth column in $P_{i,3} = 1$ to use it as a homogeneous transformation matrix in the product $E P^T$ and then applied the max and sgn as above. This works but it is inefficient for the case where many points are outside of the hull.
My second attempt partitions the matrix product over sub-matrices $E'$ of size $(n',4)$ with $1 \le n' \ll n$ while retaining the homogeneous form of $P$. On each iteration, it evaluates $E' P'^T$ then the same max and sgn. It discards point columns from $P'$ that are found to be outside of the hull. This helps with memory since the product is smaller due to smaller (fixed-size) $E'$ and monotonically shrinking $P'$; the shrink rate depends on the spatial distribution of test points.
Is there any way to reformulate this to be faster? The sign of the inner dot product seems somewhat "fundamental" but perhaps there's a way to more early termination as was done with the partitioning scheme above.