How can I split an ordered set of data by a classification for purity?

65 Views Asked by At

This question relates to machine learning and the creation of decision trees.

It's been about 5 years since I've done anything related to sets (or math in general), so please forgive my lack of proper symbols, notation and vocabulary and/or rudimentary knowledge (if someone is going to do a drive-by downvote, please at least tell me why). If someone can point me at some good source material, I'll certainly go read it.

Problem Definition:

Let the set $$ K = {(x_1, y_1), (x_2,y_2)... (x_i, y_i)}$$ and be ordered by x, where x is a continuous variable and y is a classification.

Let N be the number of subsets to create.

Subsets must maintain the order of elements in the original set and divide based on the greatest purity of y.

All sets in K must be present in one and only one subset.

Approach:

A) I could use a function to determine the subset's purity (ex. gini index or chi-square statistic), and use it for every combination of N subsets and select the most appropriate set of subsets. This seems needlessly computationally intensive, but would get the best answer.

B) 1. Count the total number of objects of each class type and place in a list G. ( ex. {3a, 2c, 4b, 6a, 2b, 5C} would represent a K with 22 objects in it) 2. Use a function to assess the purity of each subset at the points in list G. Basically, it's the same approach as A, but calculating it based on transition points, instead of every point (some of these transition points might have be on duplicate x values, so that could lead to some contamination).

C) It seems like there must be a way to abstract this at least one level further, so as to find the 3 subsets which, on average, are most pure.

Basically, I don't want the first set and the last set to be chosen as 2 of the sets, and then have the middle set be the rest of the data points.

Question: Is there a way to find the N sets with the highest average purity based on the data from set K?