Is this implementation use HBOS mathematic?

122 Views Asked by At

I'm experimenting with an unsupervised statistical-based outlier detection so-called XBOS on top of the KMeans clustering algorithm. It is claimed that XBOS generates outlier scores as HBOS does. I'm trying to understand math is used behind the implementation entirely. To the best of my knowledge, it has four functions inside the , and I assume that the implementer tried the HBOS formula, but I'm not sure that's what I raised here. I changed slightly the code due to it was scripted in class and method form adjust keep functions for better understanding:

n_clusters=n
effectiveness=e
max_iter=m
kmeans = {}
cluster_score = {}

def fit( data):
    length = len(data)
    for column in data.columns:
        kmeans = KMeans(n_clusters=n_clusters,max_iter=max_iter)
        kmeans[column]=kmeans
        kmeans.fit(data[column].values.reshape(-1,1))
        assign = pd.DataFrame(kmeans.predict(data[column].values.reshape(-1,1)),columns=['cluster'])
        cluster_score=assign.groupby('cluster').apply(len).apply(lambda x:x/length)
        ratio=cluster_score.copy()
    
        sorted_centers = sorted(kmeans.cluster_centers_)
        max_distance = ( sorted_centers[-1] - sorted_centers[0] )[ 0 ]
    
        for i in range(n_clusters):
            for k in range(n_clusters):
                if i != k:
                    dist = np.abs(kmeans.cluster_centers_[i] - kmeans.cluster_centers_[k])/max_distance
                    effect = ratio[k]*(1/pow(effectiveness,dist))
                    cluster_score[i] = cluster_score[i]+effect
                    
        cluster_score[column] = cluster_score
                
def predict( data):
    length = len(data)
    score_array = np.zeros(length)
    for column in data.columns:    # ============> here they applied summation Sigma
        kmeans = kmeans[ column ]
        cluster_score = cluster_score[ column ]
        
        assign = kmeans.predict( data[ column ].values.reshape(-1,1) )
        #print(assign)
        
        for i in range(length):
            score_array[i] = score_array[i] + math.log10( cluster_score[assign[i]] ) #====> here they applied logarithm
        
    return score_array

To the best of my knowledge, HBOS generate bin-width histograms statically/dynamically. For multivariate cases having d features, individual histogram density is computed for each feature and then summed up to derive a conclusion. They use a simple formula for this:

img

So it means they calculate log(1/p) for each feature/column in the dataframe. So mainly my questions are (compare to HBOS):

  • What is the mathematical mechanism behind XBOS using clustering?
  • How does calculate the width of histograms in XBOS?

According to Github Repo, It uses a similar mechanism to generate histograms and collect clustering info (e.g. effectiveness, cluster_score, cluster_centers, max_distance). Especially dist & effect during fitting KMeans on each column and sum them up for better clustering the sense of outlier detection.

  • Would you explain which math formula has been used for this in the 2nd function, def fit()?
  • Is it the same as the HBOS formula?
for i in range(self.n_clusters):
    for k in range(self.n_clusters):
         if i != k:
               dist = np.abs(kmeans.cluster_centers_[i] - kmeans.cluster_centers_[k])/max_distance
                effect = ratio[k]*(1/pow(self.effectiveness,dist))
                cluster_score[i] = cluster_score[i]+effect

At the end of the 3rd function, def predict(), It can be seen that it is applied logarithm on data and return score_array. I try to understand this part of the codes too:

for i in range(length):
                score_array[i] = score_array[i] + math.log10( cluster_score[assign[i]] ) #Apply logaritm

Any help understanding the mathematics and clustering strategy in current implementation will be appreciated.