I have just started learning data science so pardon me the statements that does not make any sense.
Consider this situation - I have a data set which is made of examples containing personal info about clients (e.g. bank database). Now, this is common in business, I would like to match the instances of these clients into master record. This means I want to clean my database, connect duplicates of the same client into one master record. Different data about one client can have different typos, missing columns etc. so this is were machine learning comes handy.
I would like to create model to which I put training data and he would realized how to match instances of the same client to the one master record and for every new instance of the client he would either connect to some existing master record or create new master record consisting of that one instance. I know something about linear regression, logistic regression, clustering and classification. I think that the best technique for this problem would be clustering since I do not know how many labels are there (how many different clients) and I need to divide all those instances into one master record (into one group) but in the materials I was studying they used clustering to divide data set into small groups - like 4 or 16. But in my case there could be millions of groups.
So my question is what is the best way/technique for this particular problem ? What should I study to solve this problem (what techniques) ?
For the reasons you've mentioned, I wouldn't suggest using clustering. Off-hand, I would suggest an algorithm like this:
Consider looking into metric learning. Also, note that there's a data science SE too.