How to decide which pair is more relavent to each other

19 Views Asked by At

This is my first question in math.stackexchange. I hope I am not violating the rules of the site. I am coming from cs background. I have large text files from 4 different languages. For each token in the files, I have their corresponding Part Of Speech Tags. Now, I need to compare these languages and select one of them that is closest to my source language. As an example, Let's say I have 4 options (O1,O2,O3,O4). The POS tag set does not change across the languages. Let's say I have three POS Tags: Noun, Adjective,Verb. And the numbers are as follows:

     Noun   Adjective Verb   Total # of words
O1   5         3       2          10
O2   10        1       0          11
O3   2         5       5          12
o4   4         3       9          16

Based only on the information in this table, I want to decide which of O2,O3,O4 might be the most similar (Since these numbers are about POS distributions, I am looking for morphological similarity specifically) language to O1

I can image, as the simplest idea I need to compare the probabilities but since there is more than one probability ( such as adj/total_word_num or verb/total_word_num .. ) I couldn't come up with a concrete solution.

I am aware that by looking just these numbers it might not be possible to find the gold truth answer but unfortunately it is the only information I have.

1

There are 1 best solutions below

2
On

There is no unique God-given way to find which of three vectors is closer to a given vector, you have to choose a way to measure distances and there are many ways to do that.

The two that are maybe the most common are the sum of the absolute differences between coordinates, and the square root of the sum of the squares of these differences.