Modifications in K-Means Clustering Algorithm
B. F. Momin1, P. M. Yelmar2
1Dr. Bashirahamad F. Momin , Computer Science and Engineering Department , Walchand College of Engineering Sangli-416415, India,
2Prashant M. Yelmar, Computer Engineering Department, S. B. Patil College of Engineering, Indapur, Maharashtra-486103, India.
Manuscript received on July 01, 2012. | Revised Manuscript received on July 04, 2012. | Manuscript published on July 05, 2012. | PP: 349-354 | Volume-2, Issue-3, July 2012. | Retrieval Number: C0778062312 /2012©BEIESP
Open Access | Ethics and Policies | Cite
© The Authors. Published By: Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: In our study, we introduce modifications in hard K-means algorithm such that algorithm can be used for clustering data with categorical attributes. To use the algorithm for categorical data, modifications in distance and prototype calculation are proposed. To use the algorithm on numerical attribute values, mean is calculated to represent centre, and euclidean distance is used to calculate distance. Whereas, to use it on categorical attribute values, proportional representation of all the categorical values (probability) is used to represent center, and proportional weight difference is used as distance measure. For mixed data, we used discretization on numerical attributes to convert these attribute in categorical attribute. And algorithm used for categorical attributes is used. Other modifications use the combined fundamentals from rough set theory, fuzzy sets and possibilistic membership incorporated in k-means algorithm for numeric value only data. Same modifications are applied on the algorithm developed for categorical, and mixed attribute data. Approximation concept from rough set theory deals with uncertainty, vagueness, and incompleteness. Fuzzy membership allows dealing with efficient handling of overlapping clusters. Possibi1istic approach simply uses the membership value of data point in a cluster that represents the typicality of the point in the cluster, or the possibility of the point belonging to the cluster. Noise points or outliers are less typical; hence typicality-based (possibilistic) memberships reduce the effect of noise points and outliers. To verify the performance of algorithms DB index and objective function values are used.
Keywords: Categorical data, clustering, fuzzy membership, k-means, possibilistic membership, rough set.