This question is meant to be very specific. How/why is deep learning successful at learning a classification function/hyperplane given the challenges of probability distributions and distance metrics in high dimensional spaces.
Deep learning or Deep Neural Networks are a big area of research and activity in machine learning right now and for the past few years. These models are constructed for a large number of layers of latent variables. In a simple convolutional neural network for image classification, or object detection, it is easy to have a million or more parameters.
Now there are more than a few references that discuss how in high dimensions, probability distributions behave in very odd ways and distance metrics on those probability distributions also behave in odd ways. Without getting into the details, in high dimension everything is essentially far apart, so the probability density mass becomes more diffuse over its support. Further if you follow the sphere-packing literature, there are very odd phenomena which occur such as most of the volume of a high-dimensional hyper-sphere is on its skin or surface--as opposed to its center.
In supervised deep learning, the loss function will govern the learning process. This loss function is usually based upon a distance metric that compares two high-dimensional distributions. So the common loss functions are metrics like cross entropy between two distributions or the KL-Divergence between two distributions. The idea is to understand the distance between the probability of a point on the candidate distribution versus the actual distribution.
So I am just trying to understand why deep learning works so well if the high-dimensionality of the data should create such odd behavior in the associated loss functions/metrics. I mean if the probability distributions become so diffuse as dimensionality increases, then the distributions should become less informative as there are more and more ways to obtain the same probability.
Some articles or posts I have read suggest that the usual 'manifold assumption' is at work where the high-dimensional data lives on some lower-dimensional manifold. I can understand that idea. But then by that logic the curse of dimensionality should never create a problem for any statistical method on high-dimensional data--since all high-dimensional data is intrinsically low-dimensional. So what I am looking for is a bit more precision in the analysis. How does this manifold assumption--if that is indeed the answer--operate at each level of the network to make it not fall victim to the usual curses of dimensionality.
It might be that I am just looking at the problem from the wrong angle--and this is what I wanted to validate. So if I am looking at an image segmentation problem and I have 512x512 image, then I am classifying each pixel with a class label, that means I am assigning about 262,000 labels. Now am I really assigning the labels over a 262,000 dimensional space, or a higher-dimensional space because I am not including the parameters from the lower layers of the network. Or am I just classifying over say a 2 or 5 dimensional space--based upon the possible class label values. Or do neural networks operate like dynamic programming problems where solving the value for each node in the network together will generate some optimal solution overall?
Well, the most prominent example of deep neural network classification is MNIST, see https://en.wikipedia.org/wiki/MNIST_database
The point is that an ANN with a large number of hidden layers and nodes can memorize and classify the input samples well if no over- and under-fitting takes place. The advantage of such networks is that there is no need to explicitly deal with probability distributions and metrics (of the input pattern). It is all taken care of by the network.