Unsupervised classification - feature vectors are obtained - 【StackMirror】|algorithm|text|cluster-analysis|classification|unsupervised-learning

I need to classify commercial products. You know what price comparison engines does.

We have obtained the feature vectors. They are not the best yet pretty good. My last step is classifying them without knowing how many clusters there are. So algorithms like k-means won't work since they require how many classes there are.

So here example set of feature vectors. They are in order here (as an example) but i need an algorithm which does not depend on any order.

#################################################
47 - ddr2;asus;1066;g41;am;p5qpl;775;
48 - g41;p5qpl;asus;am;ddr2;vga;anakart;
49 - intel;anakart;ddr2;1066;p5qpl;asus;am;
50 - p5qpl;ddr2;asus;am;g41;vga;anakart;
51 - ddr2;asus;1066;g41;am;p5qpl;775;
52 - g41;p5qpl;1066;am;ddr2;asus;anakart;
53 - p5qpl;ddr2;1066;am;g41;asus;sata;
54 - g41;p5qpl;1066;am;asus;ddr2;sata;
###################################################
55 - engtx480;asus;384bit;2di;gddr5;vga;16x;
56 - 2di;karti;384bit;asus;engtx480;ekran;pci;
57 - asus;engtx480;2di;vga;gddr5;384bit;16x;
58 - 2di;karti;engtx480;384bit;asus;gddr5;1536mb;
59 - engtx480;asus;384bit;2di;gddr5;vga;16x;
60 - engtx480;asus;384bit;2di;gddr5;vga;16x;
####################################################
61 - ray;blu;ihbs112;siyah;bulk;dvd;sata;
62 - ihbs112;ray;blu;on;lite;yazici;kutusuz;
63 - ihbs112;blu;ray;lite;on;siyah;bulk;
64 - blu;ihbs112;ray;lite;on;siyah;yazici;
65 - liteon;ihbs112;bd;yazma;hizi;12x;max;
66 - ihbs112;ray;blu;on;lite;bulk;dvd;
67 - etau108;dvd;siyah;lite;on;rw;ihbs112;
68 - ihbs112;liteon;bd;yazma;hizi;12x;max;
69 - ihbs112;ray;blu;lite;on;siyah;bulk;
#####################################################

When a human look it is easy to classify products with just using these feature vectors. But i need to achieve it via an algorithm. And also i need to achieve it with an algorithm which does not requires any prior information just uses feature vectors.

From the above feature vector set the 47-54 is a cluster , 55-60 another cluster and 61-69 another cluster (each cluster means a commercial product in real life). So the algorithm need to classify these correctly with just using these kind of feature vectors.

The algorithm can not be depended on the line order of the feature vectors or how many classes there will be. We don't know anything and we just have feature vectors.

Waiting your suggestions about this classification problem. Thank you.

2012-04-05 20:27
by MonsterMMORPG

Adaptive Resonance Theory is the short answer to your question. Unlike KMeans you dont need to set the number of clusters in advance. The input is a set of feature vectors either binary (ART 1 Algorithm) or continuous (ART -2A, ARTMAP etc.) and the output is classification of documents in clusters.

2012-04-05 20:48
by Yavar

Checking right now. Thanks - MonsterMMORPG 2012-04-05 20:49

Also check http://drdobbs.com/architecture-and-design/184405174 for a simple implementation of ART 1 BUT for binary inputs. You will get a feel of how it works - Yavar 2012-04-05 20:51

Yes i was looking for an example. Checking binary version right now - MonsterMMORPG 2012-04-05 20:54

Alright this won't work because "The number of prototype vectors, N, is the maximum number of clusters you wish to support. " It wants number and we don't know any number. Also the links at that article is dead - MonsterMMORPG 2012-04-05 20:55

It actually does not need any pre defined number of clusters & that is the main difference between K-Means and ART. Maybe you mis read something. ART is actually a Neural Network in true sense where in the outputs are not defined (i.e. unsupervised learning). Yes there seems to be a problem with the links - Yavar 2012-04-05 21:02

So do you know any good source which explains how this neural network working - MonsterMMORPG 2012-04-05 21:27

I can identify 3 main issues which need to be adressed.

1) From the examples you provided, it seems that all your vectors have a dimension of 7. If this should NOT be the case, you can use PCA to reduce the (unknown but bounded) number of dimensions to a fixed size. This ensures that you can use a clustering algorithm without heavy modification.

2) To overcome the fact that you don't know the cluster size, you can use DBSCAN. It requires two parameters: The minimum cluster size and the neighborhood size.

3) You need a representation space with the dimensionality of step 1) on which the clustering algorithm can operate. For this you have to think of a way to construct feature vectors from these samples. From the examples you showed, it seems that the training vectors are not arbitrary filled in terms of symbolity. It seems to me that despite what you're saying you might be able to use heuristics. However, if that's not possible just chose a numeric representation as feature values.

2014-02-09 13:08
by runDOSrun