Introduction

The kdml package calculates the pairwise distances between mixed-type observations consisting of continuous (numeric), nominal (factor), and ordinal (ordered factor) variables. This kernel metric learning methodology calculates the bandwidths associated with each kernel function for each variable type using cross-validations and returns a distance matrix that can be utilized in any distance-based algorithm.

We define a kernel similarity between two data points x_i and x_j two different way based on two papers. From Ghashti and Thompson (2024), the distance using kernel product similarity dkps similarity is given by

and from Ghashti (2024), the distance using kernel summation similarity kss is given by

For both s_dkps(x_i, x_j|λ) and s_kss(x_i, x_j|λ), K(⋅), L(⋅), and ℓ(⋅) are kernel functions for continuous (numeric), nominal (factor), and ordinal (ordered factor) variables, respectively. The data frame consists of p-many variables, such that p = p_c + p_u + p_o is the sum of the continuous, nominal, and ordinal variables, respectively. λ is a vector of length p containing variable-specific bandwidth values for each kernel function.

Phillips and Venkatasubramanian (2011) discuss how to calculate distance from similarity functions. Using either similarity function from s_dkps(x_i, x_j|λ) or s_kss(x_i, x_j|λ), the kernel distance between two observations x_i and x_j is given by

These two distances can be called in R with the package kdml using functions dkps for Equation s_dkps(x_i, x_j|λ) and dkss for Equation s_kss(x_i, x_j|λ), both of which are used in the Equation d²(x_i, x_j | λ) for pairwise distance calculations.

The vector of bandwidths λ may be a user-input numeric vector of length p, for which the possible values for bandwidths for each variable type are bounded based on the kernel choice. For continuous variables λ > 0, ordinal variables λ ∈ [0, 1], and nominal variables is kernel specific. For example, λ ∈ [0, 1] for the ‘u_aitken’ kernel and λ ∈ [0, (c − 1)/c] for ‘u_aitchisonaitken’, where c is the number of unique values for a specific nominal variable. For an overview of kernel functions, we refer the reader to Aitchison and Aitken (1976), Cameron and Trivedi (2005), Härdle et al. (2004), Li and Racine (2007), Li and Racine (2003), Silverman (1986), Titterington and Bowman (1985), and Wang and van Ryzin (1981).

Installing

You can install the stable version from CRAN using install.packages(). The developmental version of from Github with:

library(devtools)
install_github("jrjthompson/R-package-kdml")
library(kdml)

Bandwidth Selection

The maximum similarity cross-validation (MSCV) method for bandwidth selection (Ghashti and Thompson, 2024) is based on the objective

where s_λ(⋅) is as the similarity function for s_dkps(x_i, x_j|λ) and s_kss(x_i, x_j|λ). (MSCV) can be invoked implicitly within the distance calculation by setting the argument bw = "mscv" in the functions dkps and dkss. Users also have the option of setting the argument bw = "np" to specify bandwidth selection using maximum-likelihood cross-validation from the highly optimized package np (Hayfield and Racine, 2008).

Data Generation

To simulate data containing a mix of variable types and true class labels, we have included the function confactord. This function performs the following simulation automatically, and can be configured for any number of numeric, nominal and ordinal variables. See the package manual for more details.

We simulate a mix of continuous (x₁, x₄), nominal (x₂, x₃), and ordinal data (x₅, x₆) an store in a data frame as follows

df <- data.frame(
  x1 = runif(100, 0, 100),
  x2 = factor(sample(c("A", "B", "C"), 100, replace = TRUE)),
  x3 = factor(sample(c("A", "B", "C"), 100, replace = TRUE)),
  x4 = rnorm(100, 10, 3),
  x5 = ordered(sample(c("Low", "Medium", "High"), 100, replace = TRUE), 
               levels = c("Low", "Medium", "High")),
  x6 = ordered(sample(c("Low", "Medium", "High"), 100, replace = TRUE), 
               levels = c("Low", "Medium", "High"))
)

A minimial usage of the distance metrics functions requires only the data frame, where the functions default the kernel functions, and the bandwidth specification method to ‘mscv’.

# DKPS distance 
dis_dkps <- dkps(df = df)

# DKSS distance 
dis_kdss <- kdss(df = df)

Using the maximum-likelihood cross-validation technique from package np.

# DKPS distance 
dis_dkps_np <- dkps(df = df, bw = "np")

# DKSS distance 
dis_kdss_np <- kdss(df = df, bw = "np")

Users also have many kernel functions available them, which are listed in the additional arguments below. Some of the kernel functions from np are not available. Kernels used for the bandwidth selection technique should be the same used for the distance calculation.

dis_dkps_custom_kernels <- dkss(df = df, bw = "mscv", 
    cFUN = "c_epanechnikov", uFUN = "u_aitken", oFUN = "o_habbema")

If users require only the bandwidths selected by MSCV, and not the pairwise distance matrix obtained from dkps or dkss, they may do so with the following function calls:

# MSCV bandwidth specification using the similarity function in Equation (1)
mscv.dkps(df, nstart = NULL, ckernel = "c_gaussian", 
           ukernel = "u_aitken", okernel = "o_wangvanryzin", verbose = TRUE) 

# MSCV bandwidth specification using the similarity function in Equation (2)
mscv.dkss(df, nstart = NULL, ckernel = "c_gaussian", 
          ukernel = "u_aitken", okernel = "o_wangvanryzin", verbose = TRUE)

For more details on the usage of each of these functions, consult the package documentation found on CRAN.

References

[1] Aitchison, J. and C.G.G. Aitken (1976), “Multivariate binary discrimination by the kernel method”, Biometrika, 63, 413-420.

[2] Cameron, A. and P. Trivedi (2005), “Microeconometrics: Methods and Applications”, Cambridge University Press.

[3] Ghashti, J.S. (2024), “Similarity Maximization and Shrinkage Approach in Kernel Metric Learning for Clustering Mixed-type Data”, University of British Columbia.

[4] Ghashti, J.S. and J.R.J Thompson (2024), “Mixed-type Distance Shrinkage and Selection for Clustering via Kernel Metric Learning”, Journal of Classification, Accepted.

[5] Härdle, W., and M. Müller and S. Sperlich and A. Werwatz (2004), Nonparametric and Semiparametric Models, (Vol. 1). Berlin: Springer.

[6] Hayfield, T. and J.S. Racine (2008). Nonparametric Econometrics: The np Package. Journal of Statistical Software 27(5).

[7] Li, Q. and J.S. Racine (2007), Nonparametric Econometrics: Theory and Practice, Princeton University Press.

[8] Li, Q. and J.S. Racine (2003), “Nonparametric estimation of distributions with categorical and continuous data”, Journal of Multivariate Analysis, 86, 266-292.

[9] Silverman, B.W. (1986), Density Estimation, London: Chapman and Hall.

[10] Titterington, D.M. and A.W. Bowman (1985), “A comparative study of smoothing procedures for ordered categorical data”, Journal of Statistical Computation and Simulation, 21(3-4), 291-312.

[11] Wang, M.C. and J. van Ryzin (1981), “A class of smooth estimators for discrete distributions”, Biometrika, 68, 301-309.

kdml package

Introduction

Installing

Bandwidth Selection

Data Generation