Sunday, March 25, 2012

Algorithm used to determine the number of clusters

Hello.

What kind of criterion is used by MS clustering algorithm to determine the number of clusters when 0 is specified in the algorithm parameters?

The problem is that I find automatically defined cluster number somehow strange, especially when expactation maximization algorithm is used. I tried to "manually" calculate optimal cluster numbers in my models using bayesian information criterion and the one by Akaike and received more understandable results.

Thank you very much in advance.

We do not specifically document the heuristic used in this case. The heuristic is intended for scenarios where users (often new to data mining) don't have specific expectations to be met: it provides a useful guess for these scenarios which are often rather diverse. We would not guarantee that it returns the optimal number of clusters in any given scenario.

Note that even setting the number of clusters explicitly is still an approximation, albeit for different reasons. As Books Online describes it, CLUSTER_COUNT "specifies the approximate number of clusters to be built by the algorithm. If the approximate number of clusters cannot be built from the data, the algorithm builds as many clusters as possible."

For the future, would be it be interesting or useful to have detailed, specified (anbd perhaps) selectable methods for cluster counts, such as those you used yourself?

hth

|||Yes, I think it would be interesting to have specified and selectable methods for automatic cluster selection. The reason is that cluster analysis is often used for unsupervised learning before any other methods are applied. Thus, in the first step little information about the data is available and one is interested in natural grouping or clustering in data. So the question "how many natural clusters exist in data" may be very important not less important than "how is data distributed within each cluster". That's why I think it would be useful to have some impact on algorithm used to answer this question.

No comments:

Post a Comment