HIERARCHICAL CLASSES ANALYSIS: A SET THEORETICAL CLUSTERING TECHNIQUE
Jacques MYLLE
Department of Psychology, Royal Military Academy
Brussels, Belgium
The simplest way to describe a set of objects with their attributes is with a binary code: "1" stands for: the object possesses the attribute and "0", obviously, the object does not possess the attribute. Each single object can be characterized by a vector of "1" and "0" of length n if there are n attributes. The same holds mutatis mutandis for each attribute; that is a vector with "1" and "0" of length m if there are m objects. A way of representing both simultaneous is in a binary two-way matrix with m rows and n columns. But how to render the relations between a set of objects and a set of attributes more transparent? In other words is it possible to define subsets of objects which have a common set of attributes and represent the relationship between those subsets in an exhaustive way?
De Boeck and Rosenberg (1988) have proposed a solution for this problem. They have developed a categorical model for object by attribute data and this model represents the set-theoretical structure of the data - i.e. for objects and attributes simultaneously - in a numerical and in a graphical way. To analyze any empirical dataset, accordingly to this model, De Boeck has developed an algorithm that is implemented in a software program called HICLAS.
In the next paragraph the basic principles of the model will be presented. In a second part the analysis methods will be explained and finally an exhibition of the program Hiclas will be given.
A small hypothetical example will illustrate the theoretical assertions: Suppose a number of applicants for the Army who agree to serve as platoon leader in some Arms but not in others. Who likes to serve in which Arm is given in Table la.
Table la. The observed data Applicants x Arms
A T M P E M A T S A I K I A N E R P T S |
|
Simon |
1 1 1 1 0 0 1 0 0 0 |
Carin |
1 1 0 0 1 1 1 0 1 0 |
Susan |
1 1 0 0 1 1 1 0 1 0 |
Linda |
1 1 1 1 1 1 0 1 0 0 |
Robert |
1 1 1 1 1 1 0 1 0 0 |
Ralf |
1 1 1 1 1 1 1 1 0 0 |
A1 |
1 1 1 1 1 1 1 1 0 0 |
Fred |
1 1 1 1 1 1 1 0 1 0 |
Bill |
1 1 1 1 1 1 1 0 1 0 |
Tom |
1 1 1 1 1 1 1 1 1 0 |
Chris |
1 1 1 1 1 1 1 1 1 0 |
Donald |
0 0 0 0 0 0 0 0 0 0 |
AI (Armored Infantry), TK (Tanks), MI (Mechanized Infantry),
PA (Paratroopers), EN (Engineer), ME (Mechanics), AR (Artillery),
TR (Transport), ST (Signal Troops), AS (Admln Services).
Their choice is implicitly based on, for reasons of simplicity, say three motives: they prefer a combat related job (= 1) or not (= 0), they want to lead a relatively large group of people (= 1) or not (= 0) and, finally, they like a high level of technical sophistication (= 1) or not (= 0). This motives are latent and thus not known by the recruiting services and perhaps even not by the applicant. Table lb shows which motives apply to which applicant on the one hand and table lc shows which motives apply to which Arm on the other hand. For the moment being, they accept to serve in a certain Arm if at least one criterion is met. In other words their decision rule is disjunctive: or combat-related or many subordinates or sophisticated weapon systems suffices to apply for the Arm considered.
|
Table lb |
Table lc |
||
|
|
C N L |
|
C N L |
|
Simon |
0 1 0 |
Armored Infantry |
1 1 1 |
|
Carin |
1 0 0 |
Tanks |
1 1 1 |
|
Susan |
1 0 0 |
Mechanized Infantry |
0 1 1 |
|
Linda |
0 0 1 |
Paratroopers |
0 1 1 |
|
Robert |
0 0 1 |
Englneer |
1 0 1 |
|
Ralf |
0 1 1 |
Mechanics |
1 0 1 |
|
A1 |
0 1 1 |
Artillery |
1 1 0 |
|
Fred |
1 1 0 |
Transport |
0 0 1 |
|
Bill |
1 1 0 |
Signal Troops |
1 0 0 |
|
Tom |
1 1 1 |
Admin Services |
0 0 0 |
|
Chris |
1 1 1 |
|
|
|
Donald |
0 0 0 |
|
|
CR (Combat Related Job), NS (Number of Subordinates), LS (Level of Sophistication)
Basic features of the model
The Hiclas model is characterized by three relations:
1) Equivalence.
Any object can be described by an attribute set. All objects which have the same attribute set are defined as "equivalent" and constitute an object class. E.g. Linda and Robert have the same attribute pattern: they all accept to serve in Transport, Mechanized Infantry, Engineer, Mechanics, Armored Infantry and Tanks. Simon constitutes a class on his own while no other object has the same attribute pattern. Donald is a particular case because none of the Arms satisfies his motives. Therefore he belongs to the "null class".
Attribute classes can be defined in the same way on the basis of the object sets that have the same attributes. E.g. Mechanized Infantry and Paratroopers are an attribute class because these attributes pertain to all objects classes except for the class with Carin and Susan. Administrative Services constitute a particular class because none of the candidates apply for it and is thus also a "null class".
2) Hierarchies of classes.
The attribute set of some object classes are a subset of the attribute set of one or more other object classes. The latter are then hierarchically higher than the former. E.g. the class with Fred and Bill is hierarchically higher than the class with Carin and Susan and higher than the class with Simon: Fred and Bill have not only the same attributes as Carin and Susan but also the attributes of Simon. Conversely, Engineer and Mechanics is hierarchically higher than Signal Troops and Transport because who applies for the former also applies for the latter. It should be noticed that the resulting order is a partial order; this means that some classes cannot be compared to one other. E.g. the class with Ralf and Al is hierarchically higher than the class with Simon because the attribute set of the latter is a subset of the attribute set of the former, but the class with Ralf and Al cannot be compared with the class Carin and Susan belong to because the latter have attributes that the former do not have and vice versa: Carin and Susan apply for Signal Troops but Ralf and Al do not; Ralf and Al apply for Transport but Carin and Susan do not.
3) Association of both hierarchies
The symmetric association relation between object classes and attribute classes allow for a simultaneous graphical representation of both hierarchies. The implication rule of the subset/superset relation allows for a simple representation: it suffices to link the hierarchically lowest object classes to the hierarchically lowest corresponding attribute classes. These "lowest" classes are called "bottom classes".
The result is portrayed with one hierarchy upside down, as far as the disjunctive decision rule applies.
The graphical representation of the example is given in Figure 1.
The graph reveals also the existence of an other particular type of classes: the empty class. It means that no object of the sample has precisely the attributes in the attribute classes it is connected with but this class must exist as a "linking class" allowing for the union of attribute classes for higher order object classes. Because of the symmetry in the relations, the same is true for empty attribute classes.
Figure 1.
Graphical representation of the example, according to the disjunctive rule.
The disjunctive model
The attribute set of each object class can be decomposed into a smaller number of basic sets, which may eventually overlap, and from which the original sets are reconstructed as the union of these basic sets. These basic sets are called bundles. E.g. in Figure 1, there are three bundles, each starting at a bottom class. The attributes of a single bundle are all the attributes that are connected downwards with that bottom class. In the example, one attribute bundle is formed by the attributes Signal Troops, Artillery, Engineer, Mechanics, Armored Infantry and Tanks; the second consists of "empty", Artillery, Mechanized Infantry, Paratroopers, Armored Infantry and Tanks. As can be readily seen, there is an overlap between those two bundles, namely the attributes Artillery, Armored Infantry and Tanks. The same idea holds for the objects. There are clearly as much object bundles as there are attribute bundles.
Formally expressed, these sets of bundles are binary matrices, like these in Table lb and lc.
The problem is to find bundles such that the number of attributes in it is maximal and hence their number minimal. The number of bundles in such a matrix is called the rank of the solution and is represented by the letter r. In the example as presented so far the rank is three.
The model contends that the data can be reproduced (to a certain extent) by the Boolean product of these two bundle matrices. If the object bundle matrix is called O and the attribute bundle matrix A, then the solution matrix equals M mxn = 0mxr Ä A’rxn•
In the example the solution matrix Mmx,n is identical to the datamatrix Dmxn’, but this is not necessarily so, because in practice we do not know the bundle matrices in advance. The Hiclas program searches, in a given rank, for the best possible decomposition of the datamatrix into bundles matrices. "Best" means that there is a risk of putting some attributes into an attribute class, in which not all attributes apply to all the objects in the object classes that are linked with. Conversely, there is also a risk of not putting an attribute in an attribute class although it is possessed by some of the objects in the objects classes that are linked with. By symmetry, the same reasoning applies to the objects. This means that there is a "1" in the matrix M but a "O" in the matrix D or the converse. We can represent this by the formula D = M + E, in which E is the matrix with the discrepancies.
If we look at the two-bundle solution for the data of the example, we see that there are some discrepancies between the data and the solution (Table 2).
Table 2.
Numerical output for the two-bundle solution.|
Final number of discrepancies = 7 |
|||||||
|
Final solution for row entities in rank 2 |
|||||||
|
|
Discrepancies |
|
|||||
|
neg |
pos |
tot |
Gfit |
|
Bundles |
||
|
12 Donald |
0 |
0 |
0 |
.000 |
12 |
|
00 |
|
2 Carin |
0 |
0 |
0 |
1.000 |
2 |
1* |
10 |
|
3 Susan |
0 |
0 |
0 |
1.000 |
3 |
1* |
10 |
|
1 Simon |
3 |
0 |
3 |
.625 |
1 |
2* |
01 |
|
4 Linda |
1 |
0 |
1 |
.875 |
4 |
2* |
01 |
|
5 Robert |
1 |
0 |
1 |
.875 |
5 |
2* |
01 |
|
6 Ralf |
0 |
0 |
0 |
1.000 |
6 |
2* |
01 |
|
7 A1 |
0 |
0 |
0 |
1.000 |
7 |
2* |
01 |
|
8 Fred |
1 |
0 |
1 |
.888 |
8 |
|
11 |
|
9 Bill |
1 |
0 |
1 |
.888 |
9 |
|
11 |
|
10 Tom |
0 |
0 |
0 |
1.000 |
10 |
|
11 |
|
11 Chris |
0 |
0 |
0 |
1.000 |
11 |
|
11 |
|
Final solution for column entities in rank 2 |
|||||||
|
|
Discrepancies |
|
|||||
|
neg |
pos |
tot |
Gfit |
|
Bundles |
||
|
10 Admin Sv |
0 |
0 |
0 |
.000 |
10 |
|
00 |
|
9 Signal Tp |
0 |
0 |
0 |
1.000 |
9 |
1* |
10 |
|
3 Mec Inf |
0 |
0 |
0 |
1.000 |
3 |
2* |
01 |
|
4 Paratrooper |
0 |
0 |
0 |
1.000 |
4 |
2* |
01 |
|
8 Transport |
3 |
0 |
3 |
.666 |
8 |
2* |
01 |
|
1 Armd Inf |
0 |
0 |
0 |
1.000 |
1 |
|
11 |
|
2 Tanks |
0 |
0 |
0 |
1.000 |
2 |
|
11 |
|
5 Engineer |
1 |
0 |
1 |
.909 |
5 |
|
11 |
|
6 Mechanics |
1 |
0 |
1 |
.909 |
6 |
|
11 |
|
7 Artillery |
2 |
0 |
2 |
.818 |
7 |
|
11 |
While the number of discrepancies depends on the magnitude of the datamatrix, another measure, called goodness-of-fit (GOF), is used. This index is independent of the magnitude of the matrix and is always formulated in a positive sense. GOF = 1 means a perfect fit and GOF = 0 a total misfit. A Jaccard-coeffcient is used for this purpose; i.e. the number of concordances (n11) divided by the sum of concordances and discrepancies (n11, + n10 + n01). It should be noticed that the 00-matches are not part of the formula because they are not "informative". As a rule of thumb, a lower bound of .60 is needed for a solution to be "acceptable". The GOF does not only apply to the whole datamatrix but can be computed for each object and for each attribute too; the former GOF is called the overall GOF and the latter the individual GOF. The higher the GOF of an object or an attribute the more that element is prototypical for the class it belongs to. E.g. the third line in Table 2 gives the overall GOF for the two bundle solution and the sixth column in the same table gives the GOF for each object and each attribute in this solution.
Interpretation of a solution
The adequacy of a solution is first of all determined by the overall GOF. Further, given that this index is good, the substantive meaning is to be derived from the content of the bundles. While the bottom classes are the most specific they are the most informative in determining the meaning of a bundle. Further the individual GOF of the elements of the bundle under consideration has to be taken into account. In the hypothetical example, Signal Troops is a very sophisticated Arm, in which the platoon leader has not many people to lead and they are not directly involved in combat. A Transport platoon on the contrary, has a low level of sophistication, the platoon leader has a large number of subordinates and his unit has no combat function.
While the third bottom class is empty, it is more difficult to give it a meaning. But given that this class is linked to Artillery on the one side, to Mechanized Infantry at the other side and both to Armored Infantry and Tanks, all Arms which are highly characterized by the combat function (apart from other characteristics), this empty class refers probably to the combat aspect. Because higher order classes are unions of bottoms classes, these higher order classes inherit their basic characteristics. E.g. Artillery is the union of "high level of sophistication" and "combat". The class with Armored Infantry and Tanks share the features of three bottom classes: they are sophisticated, directly involved in combat and the platoon leader has a rather large group of subordinates.
The conjunctive model
Suppose that the bundles as presented in Table lb and lc are still valid but suppose that the applicants use now a conjunctive decision rule. I.e. the Arms must possess now (at least) all the latent characteristics that an applicant considers important. This means that there must be a " 1" in the attribute bundle at the same place as there is a "1" in the object bundle. If there is a "0" in the object bundle there may be a "1" or a "0" in the same attribute bundle. In other words, the attribute has "to dominate" the object: (O = 1) = (A = 1 or A = 0). In this case the data matrix would be the one as presented in Table 3.Formally, the analysis is now performed according to the formula M = [Oc Ä A’]c
Table 3.
Data set accordingly to the conjunctive decision rule.
A T M P E M A T S A I K I A N E R P T S |
|
Simon |
1 1 1 1 0 0 1 0 0 0 |
Carin |
1 1 0 0 1 1 1 0 1 0 |
Susan |
1 1 0 0 1 1 1 0 1 0 |
Linda |
1 1 1 1 1 1 0 1 0 0 |
Robert |
1 1 1 1 1 1 0 1 0 0 |
Ralf |
1 1 1 1 0 0 0 0 0 0 |
A1 |
1 1 1 1 0 0 0 0 0 0 |
Fred |
1 1 0 0 0 0 1 0 0 0 |
Bill |
1 1 0 0 0 0 1 0 0 0 |
Tom |
1 1 0 0 0 0 0 0 0 0 |
Chris |
1 1 0 0 0 0 0 0 0 0 |
Donald |
0 0 0 0 0 0 0 0 0 0 |
The graphical representation of the solution accordingly to the conjunctive rule is different from the solution accordingly to the disjunctive rule (Figure 2). Now the two hierarchies are "fold together".
Figure 2.
Graphical representation of the example, according to the conjunctive rule.
Attributes that appear in the lower half of a box are specific for the object that appear in the upper half of the same box. Further, the implication rule still applies: objects of a certain class possess, besides the attributes of their own class, all attributes of the classes lower in the hierarchy they are linked with. E.g. Linda and Robert have Transport as their specific attribute but they apply also for Mechanized Infantry, Paratroopers, Engineer, Mechanics, Armored Infantry and Tanks. Tom and Chris, situated in the lowest class apply only for Armored Infantry and Tanks. The uppermost and the lowest classes are particular classes. Unlike in the disjunctive solution where null classes are portrayed outside the structure, in the conjunctive solution they are represented in the graph; namely in the uppermost class and in the lowest class. There are no objects in the uppermost class of the example; this means that nobody applies for Administrative Services. In the lowest class we find Donald, who does not apply for any of the vacancies in the Army. This is expressed in a GOF = 0.00, that indicates a total misfit with respect to the Arms lower in the hierarchy. In other words he does not belong to the class of objects applying for Armored Infantry and Tanks. While there is a misclassification of an object, the GOF of the attributes of that class is no longer perfect too.
In the above described solution, attributes dominate the objects. It is also possible to think the other way around and let dominate the objects. This leads to a similar graphical representation but upside down. I.e. the attributes appear now in the upper half of the boxes and the objects in the lower half.
In reality, to know if a conjunctive c.q. a disjunctive model applies to the data has to be derived from the context of the problem, but in many cases it is not known if the data are based on a disjunctive or on a conjunctive decision rule. Then both types of analysis have to be performed and "interpretability" with substantive arguments will "suggest" what decision rule might have been used.
The Hiclas Program
The program requires a PC with at least about 600K free disk space and 450K RAM.
The program is mainly menu driven. The three main parts are "Create/Modify Data, "Analyze Data", "Display Output".
The first part provides a matrix editor that is specially designed to type in data or to read an existing file, which can be modified afterwards if wanted. Switching between the fields of the display are done with the "Tab" button.
The second part opens a window with different options which have to be specified by the user before performing the analysis. I.e. the format of the data matrix, the (FORTRAN) format of the data, labels for the rows and the columns, dichotomization threshold if the data are not binary, the type of decision rule and the maximal rank of analysis. Special (user tailored) options, but beyond the scope of this presentation, are also available. Moving from option to option is done with the cursor-arrows.
The third part allows for a graphical representation of the overall GOF and the discrepancies, an output file containing information about the bundle matrices and a graphical representation of the solution for a specified rank. The initial graph shows only the structure with the classes. Different kinds of information can be displayed using function keys. Among others, the frequency of elements in a class (F4), zooming into a class with display of the label of the elements and their individual GOF (F5). Other keys allow for moving and magnifying the graph or deleting classes in the graph.
A print function is also available. If wanted one can use the mouse instead of the cursor keys for a number of preplanned alternatives.
Further developments
Besides the deterministic version as explained above, a probabilistic version has also been developed by Eric Maris. This approach starts from observed frequencies in the cells of the datamatrix and computes the probability that a given element belongs tot a particular class. I.e. the " 1" and "0" in the bundle matrices are replaced with a probability, determined by an iterative Maximum Likelihood Estimation-Maximization procedure. This type of analysis is not implemented in Hiclas.
Iven Van Mechelen c.s. works at a three-way approach for binary data.
CONCLUSION
Hierarchical classes analysis is a very versatile technique that applies to a lot of empirical problems irrespective of the (sub)discipline.
One of its most attractive features is the simultaneous graphical representation of the data structure as a function of overt or latent characteristics.
Despite of its deterministic nature it allows for some gradedness by choosing a certain rank and by the individual GOF.
The data concerning this problems have to be dichotomous and, if not, they have to be converted to it. Especially if the data are at ordinal level or even higher there is a loss of information by dichotomizing.