Unification of data set using machine learning

Question

Unification of data set using machine learning

96 Views Asked by Bumbble Comm At 09 Apr 2026 - 5:26

I have just started learning data science so pardon me the statements that does not make any sense.

Consider this situation - I have a data set which is made of examples containing personal info about clients (e.g. bank database). Now, this is common in business, I would like to match the instances of these clients into master record. This means I want to clean my database, connect duplicates of the same client into one master record. Different data about one client can have different typos, missing columns etc. so this is were machine learning comes handy.

I would like to create model to which I put training data and he would realized how to match instances of the same client to the one master record and for every new instance of the client he would either connect to some existing master record or create new master record consisting of that one instance. I know something about linear regression, logistic regression, clustering and classification. I think that the best technique for this problem would be clustering since I do not know how many labels are there (how many different clients) and I need to divide all those instances into one master record (into one group) but in the materials I was studying they used clustering to divide data set into small groups - like 4 or 16. But in my case there could be millions of groups.

So my question is what is the best way/technique for this particular problem ? What should I study to solve this problem (what techniques) ?

Original Q&A

There are 1 best solutions below

**Bumbble Comm** · Accepted Answer

For the reasons you've mentioned, I wouldn't suggest using clustering. Off-hand, I would suggest an algorithm like this:

Throw each profile $p_i\in P$ into a metric space with distance function $ d(p_i,p_j)$. For instance, put it in a vector space by binarizing categorical variables. Maybe you could also measure the distance between their names for instance (if there could be mistakes there) via a string kernel.
Construct a way to get the nearest neighbours in the space (i.e. KNN) via $d$. You could construct a $k$D-tree or use approximate nearest neighbours, for instance.
Merge neighbours in some fashion. For instance, for each $p\in P$, consider the set of neighbours $N_\delta(p)$ that are at most $\delta$ distance from $p$. (i.e. $N_\delta(p)=\{q|d(p,q)<\delta\}$), and merge them into $p$. (if you have any true labels, you could use them to tune $\delta$). Then only make master profiles for the remaining $p$. (or, you could do something like: for each $p\in P$, iteratively (1) take the closest $q\in P$ to $p$, (2) if $d(p,q)\leq \delta$, merge $p$ and $q$ (meaning, delete them and add a new point into the space that combines their features), and (3) goto (1) if a merging occured, otherwise go to the next $p$).

Consider looking into metric learning. Also, note that there's a data science SE too.

Unification of data set using machine learning

There are 1 best solutions below

Related Questions in MACHINE-LEARNING

Related Questions in DATA-ANALYSIS

Related Questions in DATA-MINING

Trending Questions

Popular # Hahtags

Popular Questions