I’m a PHP developer and I have a problem calculating the perfect match between two different data sets.
I have data sets from companies, where each company defines the requirements for a specific job. I also have data sets from users, where each user can define a data set describing their skills.
Both of this datasets could hold values between $1$ and $12$.
Here’s an example of two data sets:
$\begin{align*} \text{Company} & \to [\phantom{1}4, \phantom{1}8, 12, \phantom{1}4, 10] \\ \text{User} & \to [\phantom{1}8, 10, \phantom{1}5, \phantom{1}5, \phantom{0}1] \end{align*}$
Question:
What is the best way to calculate the best matching job from a company? There were two thoughts that crossed my mind, but I don’t know which would be better, of if indeed there’s another completely different approach.
Calculate the sum of all absolute diffs. $\newcommand{\abs}{\operatorname{abs}}$
For example: $$\text{score} = \abs(4-8) + \abs(8-10) + \abs(12-5) + \abs(4-5) + \abs(10-1) = 23$$
Calculate the absolute diff between the sum of both data sets.
For example: $$\text{score} = \abs\left[(4+8+12+4+10)-(8+10+5+5+1)\right] = 9$$
Answer:
One approach could be that you could give weights to each of these skill and find the weighted variance in the sense that in your example,, the five jobs could have been weighted $w_1,w_2,w_3,w_4,w_5$, and let us say for example sake, the weights are 0.2,0.1,0.3,0.15,0.25.
Since it is the best match, we could find the mean of absolute differences, M, ${\sum_{1}^{n} w_i|x_i - x_j|}$
Then find the weighted variance = $\sum_{1}^{i=j} w_i\left(|x_i-x_j| - M)^2\right)$. In your example, M$ = 5.5$
Absolute differences are,$d_1 = 4, d_2 = 2,\cdots, d_5 = 9$
The best matched job could be the least of the following metric(weighted variance): $$\left((0.2).(4-5.5)^2+0.1.(2-5.5)^2+\cdots+0.25.(9-5.5)^2\right) = 8.45$$.
I tried with the following datasets
Company = {3,4,5,8,6}, User = {3,4,5,8,6} = Variance =0 , No change
Company = {4,4,5,8,6}, User = {3,4,5,8,6} = Variance = 0.16, One change
Company = {4,3,5,8,6}, User = {3,4,5,8,6} = Variance = 0.21, Two Changes
Kind of worked for these examples.
You could try this in your job and let me know if it worked to your satisfaction.
The below table shows the comparison of weighted variance and cosine similarity under two different but similar situations. The second table has wide difference in company and user rating of the task 1 with weight 0.05 and remainder of the tasks have the same rating. The third table has the same wide difference in company and user rating of the task 4 with weight 0.80 and the remainder of the tasks have the same rating. Cosine similarity will treat them both the same way while weighted variance will best match the second situation than the third and intuitively it is correct, because the task that mattered most in the second table has no difference and the task that mattered the least had a wide difference thus getting higher priority than the third table where the task that mattered the most had the same difference as that of the task 1 in table 2. Thus weighted variance gives you additional flexibility in rank ordering similar datasets.
Thanks
Satish