I've been trying to learn more about convolutional neural networks (coming from an SVM background) and I've been struggling with understanding how decisions were made when designing some of the leading architectures like VGG's, ResNet, and the like. Decisions like dropout rates, the depth of the network, the sizes of the kernels, strides, using overlapping pooling, etc. I know this is a loaded question, so maybe we can restrict this to the fire-starter: AlexNet. Section 3.4 of the linked paper gives the following reason for using overlapping pooling:
"We generally observe during training that models with overlapping pooling find it slightly more difficult to overfit."
Is there some mathematical justification for this, or is it just throw-stuff-at-the-wall-and-see-what-sticks? I feel like I'm missing something obvious.