Clustering Binary Data – Part 1

Selecting a dis(similarity) measure for binary data

Binary data (aka dichotomous data) has exactly two possible values; “true” or “false” (or can be encoded as 1 (true) and 0 (false)).

However knowing the meaning embedded into true and false (1 and 0) is important when it comes to processing binary data. Based on the meaning of “false” (or 0) binary variables can be further divided into two categories.

  • Symmetric (nominal) binary variables –

In a symmetric binary variable true stands for the presence of a certain attribute and false stands for the presence of another attribute.

e.g. – Usually gender (male/female) is encoded as 1 and 0 in certain databases. 1 means isMale = true and 0 means isFemale = true or vice versa. Hence both values are equally important.

  • Asymmetric (ordinal) binary variables –

In an asymmetric variable true stands for the presence of a certain attribute, and false stands for the absence (or lack of information) of the same attribute.

e.g. – isDogLover is a parameter captured by an application which provides emotional suggestions based on personality traits. Here 1 means the person is a dog lover and 0 means person is not a dog lover. Hence one of the values carries more importance than the other.

Why is it important to know the type of your binary data? Because it answers the question whether 0 – 0 (both instances are zero) should match be a ground of similarity or not. Hence there are two types of binary similarity coefficients.

  • Symmetrical coefficients – takes double zeros into account (0-0 is a ground of similarity)

e.g. – Simple matching, Rogers-Tanimoto, Sokal-Sneath-(a-d), Hamann, Yule, Pearson

  • Asymmetrical coefficients – exclude double zeros

e.g. – Jaccard, Sørensen , Russell-Rao , Kulzinsky , Sokal-Sneath-e , Ochiai

 

For further information on the formula and value ranges for each dis(similarity) measure refer pages 77 – 79 in G. Gan, C. Ma, and J. Wu. Data Clustering. Theory, Algorithms, and Applications. Society for Industrial and Applied Mathematics (SIAM), 2007

Leave a comment