I might be hitting the wall and would appreciate following me on this.
Some explain dirac delta such as it's 0 everywhere except at a single point where it's infinite and some explain it as a infinitely narrow spike.
Contradiction 1: It seems to me that these 2 descriptions contradict each other, because infinitely narrow spike is not the same thing as "0 everywhere except at one point where it's non-zero". The reason they're not the same is infinitely narrow spike still suggests that it's non-zero for more values of $x$ than for just only one value($x=0$).
Contradiction 2: Another reason of contradiction is that if it's only non-zero at a single point and zero everywhere else, then its integral can NOT be 1.
Question 1: Well, I'm sure you will say that dirac delta is not a function which I have read many times. Then why can't we just get rid of the explanation - "it's zero everywhere except at one point" ?
Question 2: I somehow didn't understand what we mean: "it's a distribution or generalized function", because I don't know what either of these 2 concepts are.
Would appreciate if you could explain it in simple terms.
To Question 1: The reason why bizzare definitions of a so called delta "function" can be found everywhere (especially in engineering and physics) has probably two reasons:
The first is historic: A fully satisfactory theory was only finished in the 1950s by Laurent Schwartz, but it is complicated and so it is often inacessible to many scientists who have to use it.
Secondly many people in physics and engineering use heuristics, in which case pretending as if there was a delta function works suprisingly well.
To Question 2: The dirac $\delta$ is not a function. So in particular it is neither an "infinitesimal spike" nor a function that is $0$ everywhere except $0$ (and infinite at $0$).
I will try my hand at explaining the Dirac $\delta$ without delving to deep into the details:
Let $F$ be the vector space of all functions $\mathbb{R}\to \mathbb{R}$. Then we define the dirac $\delta$ by $$ \begin{align} \delta : F &\longrightarrow \mathbb{R} \\ f &\longmapsto f(0) \end{align} $$ Notice that $\delta$ is a linear map from $F$ to the real numbers. So $\delta$ is not a function $\mathbb{R} \to \mathbb{R}$, but a linear map $F \to \mathbb{R}$.
It is a common abuse of notation to write $\int \delta (x) f(x)dx $ for $\delta (f)$. It is important to understand that the first expression is just a notation for the later (and not actually an integral).
Often we restrict $\delta$ to some subspace of $F$. Such subspaces are often called "test functions". See here for some commonly used subspaces of $F$. A distribution is then a (continuous) linear map from the test function space to the real/complex numbers (see here).
Now suppose we take a "narrow spike at 0 function" $\varphi_{\varepsilon} : \mathbb{R}\to \mathbb{R}$ whose width we control with the parameter $\varepsilon>0$. To be precise, i want $\varphi_\varepsilon$ to be a so called mollifier (see here). Further suppose that $f$ is a "nice" function (say differentiable and zero outside of some finite interval). Then we have $$ \lim_{\varepsilon \to 0}\int \varphi_\varepsilon (x) f(x) dx = f(0) = \delta (f). $$ Note that the map $f \mapsto \int \varphi_\varepsilon (x) f(x) dx $ is also a linear map sending "nice" functions to real numbers. Lets call it $\tilde{\varphi}_\varepsilon$. So we have shown above that $\tilde{\varphi}_\varepsilon$ approximates the Dirac $\delta$ as $\varepsilon$ gets small. Notice that its not the spike function that approximates the Dirac $\delta$, but the linear map $\tilde{\varphi}_\varepsilon$.