Suppose that you have $n$ observations, $y_1, y_2, ...y_n$ that are the result of the process $aX$ where $a$ is a real constant and $X$ is a random variable such that $X\sim\mathrm{Poisson}(-b/a)$. It is the case that $0>a>1$ and $b>0$. I am trying to use maximum likelihood to estimate $a$ and $b$. I can do it for $b$, and have made some progress for $a$, but need help.
To illustrate, here's a density plot for two simulated data sets. The blue one is $b=-\ln 0.9, a=0.1$ and the pink one is $b=-\ln 0.9, a=0.001$.
The approach, suggested by spaceisdarkgreen, is that we set $x_i = y_i/a$ at the beginning, so that the log-likelihood function is:
$$\ell(\theta) = \ln \big(\frac{(b/a)^{\sum y_i / a} e^{-nb/a}}{\prod (y_i / a)!} \big)$$
$$\ell(\theta) = \sum{(y_i / a)}\ln{b/a} - nb/a - \sum{\ln [(y_i / a)!]}$$
Take the partial derivative w.r.t $b$:
$$ \frac{\partial \ell}{\partial b} = \frac{\sum y_i}{ab} - n/a$$
Setting to zero and multiplying through by $a$ we get
$$ \hat b = \frac{\sum y_i}{n} $$
I can veryify by simulation that this is a good estimator for $b$.
With that in the bag, we try to do the same for a:
$$ \frac{\partial \ell}{\partial a} = -\frac{\sum{y_i}\log{b/a}}{a^2} - \frac{\sum y_i}{a^2} + \frac{nb}{a^2} - \sum \frac{d}{ds}\log [(y_i / a)!]$$
At this point I replaced the factorial with the gamma function. I learned that the logarithmic derivative of the gamma function has its own name, the digamma function, denoted by $\psi()$, so the last term becomes $ -\frac{y_i}{a^2} \psi(y_i/a) $
Multiplying through by $a^2$, setting $b = \sum y_i / n$ which we already know, cancelling terms and distributing the log we get:
$$ \frac{\partial l}{\partial a} = -\sum{y_i}\log{\sum y_i} + \sum{y_i}\log{na} + \sum y_i \psi(y_i/a)$$
Set it equal to zero and we have
$$ \sum{y_i}\log{\sum y_i} =\sum{y_i}\log{na} + \sum y_i \psi(y_i/a)$$
I can check that this equality holds, which is does for relatively small values of $a$ and relatively large values of $b$:
n = 10000; a = 0.02; b = 0.1
x = rpois(n, b/a)
y = a*x
# Left side
sum(y)*log(sum(y)) # Evaluates to 6958.385 on my test run
# Right side
# There are a few NaNs in the digamma(y/s)...these get multiplied by 0
# anyway so it actually doesn't matter what they get set to
dig = digamma(y/a)
dig[is.na(dig)] <- 0
sum(y) * log(n*a) + sum(y*dig) # Evaluates to 6958.799 on my test run
That's very good; but if you flip the values of $a$ and $b$ this equation is much further from holding.
My question (which has by now undergone extensive editing!) is how to proceed from here. Does what I've done look justifiable/right? Is there a way to use the last equation to solve analytically or numerically for $a$? Why does it only seem to work well for a range of values? Thanks!

Per our discussion in the comments, MLE does not seem like a pragmatic choice for estimating these parameters. Because it works using the whole probability distribution, the MLE must hone in on the fact that there are a discrete set of $a$'s for which the data $\{y_i\}$ has a nonzero probability, i.e. $a$'s such that $y_i/a\in\mathbb N$ for all $i .$ But it is hard to use this information in practice, especially if there's any measurement error or misspecification. The fact that MLEs can be very tailored to the probability distribution assumed tends to make them less robust even as it improves their theoretical performance assuming the model is true.
On the other hand, method of moments might work here. We have $E(Y)=b$ and $Var(Y) =ab,$ so we can estimate $\hat b$ as the sample mean and then $\hat a$ as the sample variance divided by the sample mean. The theoretical performance is poor: it will almost always give you a value of $\hat a$ that you can theoretically tell from the data is impossible (since $y_i/\hat a$ are not all integers). But this might not be important from a practical standpoint, given that the model isn't perfect and you probably just want a distribution that fits decently.
But notice this isn't a one-size-fits-all thing. In the limit where $b/a$ is small so only a handful of values for $X$ are common, your mean and variance will be fairly noisy, but your data should be pretty clearly clustered, so estimating $a$ using the clustering would probably be better than the moments.