The kdml
package
calculates the pairwise distances between mixed-type observations
consisting of continuous (numeric), nominal (factor), and ordinal
(ordered factor) variables. This kernel metric learning methodology
calculates the bandwidths associated with each kernel function for each
variable type using cross-validations and returns a distance matrix that
can be utilized in any distance-based algorithm.
We define a kernel similarity between two data points xi and
xj
two different way based on two papers. From Ghashti and Thompson (2024),
the distance using kernel product similarity dkps
similarity is given by
and from Ghashti (2024), the distance using kernel summation
similarity kss
is given
by
For both sdkps(xi, xj|λ) and skss(xi, xj|λ), K(⋅), L(⋅), and ℓ(⋅) are kernel functions for continuous (numeric), nominal (factor), and ordinal (ordered factor) variables, respectively. The data frame consists of p-many variables, such that p = pc + pu + po is the sum of the continuous, nominal, and ordinal variables, respectively. λ is a vector of length p containing variable-specific bandwidth values for each kernel function.
Phillips and Venkatasubramanian (2011) discuss how to calculate distance from similarity functions. Using either similarity function from sdkps(xi, xj|λ) or skss(xi, xj|λ), the kernel distance between two observations xi and xj is given by
These two distances can be called in R
with the package kdml
using functions dkps
for Equation sdkps(xi, xj|λ)
and dkss
for Equation
skss(xi, xj|λ),
both of which are used in the Equation d2(xi, xj | λ)
for pairwise distance calculations.
The vector of bandwidths λ may be a user-input numeric vector of length p, for which the possible values for bandwidths for each variable type are bounded based on the kernel choice. For continuous variables λ > 0, ordinal variables λ ∈ [0, 1], and nominal variables is kernel specific. For example, λ ∈ [0, 1] for the ‘u_aitken’ kernel and λ ∈ [0, (c − 1)/c] for ‘u_aitchisonaitken’, where c is the number of unique values for a specific nominal variable. For an overview of kernel functions, we refer the reader to Aitchison and Aitken (1976), Cameron and Trivedi (2005), Härdle et al. (2004), Li and Racine (2007), Li and Racine (2003), Silverman (1986), Titterington and Bowman (1985), and Wang and van Ryzin (1981).
You can install the stable version from CRAN using install.packages()
. The
developmental version of from Github
with:
The maximum similarity cross-validation (MSCV) method for bandwidth selection (Ghashti and Thompson, 2024) is based on the objective
where sλ(⋅) is
as the similarity function for sdkps(xi, xj|λ)
and skss(xi, xj|λ).
(MSCV) can be invoked implicitly within the distance calculation by
setting the argument bw = "mscv"
in the functions
dkps
and dkss
. Users also have the option
of setting the argument bw = "np"
to specify bandwidth
selection using maximum-likelihood cross-validation from the highly
optimized package np
(Hayfield and Racine, 2008).
To simulate data containing a mix of variable types and true class
labels, we have included the function confactord
. This function
performs the following simulation automatically, and can be configured
for any number of numeric, nominal and ordinal variables. See the
package manual for more details.
We simulate a mix of continuous (x1, x4), nominal (x2, x3), and ordinal data (x5, x6) an store in a data frame as follows
df <- data.frame(
x1 = runif(100, 0, 100),
x2 = factor(sample(c("A", "B", "C"), 100, replace = TRUE)),
x3 = factor(sample(c("A", "B", "C"), 100, replace = TRUE)),
x4 = rnorm(100, 10, 3),
x5 = ordered(sample(c("Low", "Medium", "High"), 100, replace = TRUE),
levels = c("Low", "Medium", "High")),
x6 = ordered(sample(c("Low", "Medium", "High"), 100, replace = TRUE),
levels = c("Low", "Medium", "High"))
)
A minimial usage of the distance metrics functions requires only the data frame, where the functions default the kernel functions, and the bandwidth specification method to ‘mscv’.
Using the maximum-likelihood cross-validation technique from package
np
.
# DKPS distance
dis_dkps_np <- dkps(df = df, bw = "np")
# DKSS distance
dis_kdss_np <- kdss(df = df, bw = "np")
Users also have many kernel functions available them, which are
listed in the additional arguments below. Some of the kernel functions
from np
are not available.
Kernels used for the bandwidth selection technique should be the same
used for the distance calculation.
dis_dkps_custom_kernels <- dkss(df = df, bw = "mscv",
cFUN = "c_epanechnikov", uFUN = "u_aitken", oFUN = "o_habbema")
If users require only the bandwidths selected by MSCV, and not the
pairwise distance matrix obtained from dkps
or dkss
, they may do so with the
following function calls:
# MSCV bandwidth specification using the similarity function in Equation (1)
mscv.dkps(df, nstart = NULL, ckernel = "c_gaussian",
ukernel = "u_aitken", okernel = "o_wangvanryzin", verbose = TRUE)
# MSCV bandwidth specification using the similarity function in Equation (2)
mscv.dkss(df, nstart = NULL, ckernel = "c_gaussian",
ukernel = "u_aitken", okernel = "o_wangvanryzin", verbose = TRUE)
For more details on the usage of each of these functions, consult the package documentation found on CRAN.
References
[1] Aitchison, J. and C.G.G. Aitken (1976), “Multivariate binary discrimination by the kernel method”, Biometrika, 63, 413-420.
[2] Cameron, A. and P. Trivedi (2005), “Microeconometrics: Methods and Applications”, Cambridge University Press.
[3] Ghashti, J.S. (2024), “Similarity Maximization and Shrinkage Approach in Kernel Metric Learning for Clustering Mixed-type Data”, University of British Columbia.
[4] Ghashti, J.S. and J.R.J Thompson (2024), “Mixed-type Distance Shrinkage and Selection for Clustering via Kernel Metric Learning”, Journal of Classification, Accepted.
[5] Härdle, W., and M. Müller and S. Sperlich and A. Werwatz (2004), Nonparametric and Semiparametric Models, (Vol. 1). Berlin: Springer.
[6] Hayfield, T. and J.S. Racine (2008). Nonparametric Econometrics:
The np
Package. Journal of
Statistical Software 27(5).
[7] Li, Q. and J.S. Racine (2007), Nonparametric Econometrics: Theory and Practice, Princeton University Press.
[8] Li, Q. and J.S. Racine (2003), “Nonparametric estimation of distributions with categorical and continuous data”, Journal of Multivariate Analysis, 86, 266-292.
[9] Silverman, B.W. (1986), Density Estimation, London: Chapman and Hall.
[10] Titterington, D.M. and A.W. Bowman (1985), “A comparative study of smoothing procedures for ordered categorical data”, Journal of Statistical Computation and Simulation, 21(3-4), 291-312.
[11] Wang, M.C. and J. van Ryzin (1981), “A class of smooth estimators for discrete distributions”, Biometrika, 68, 301-309.