K-Means¶

The K-Means algorithm solves clustering problem by partitioning \(n\) feature vectors into \(k\) clusters minimizing some criterion. Each cluster is characterized by a representative point, called a centroid.

Operation	Computational methods	Programming Interface
Training	Lloyd’s	train(…)	train_input	train_result
Inference	Lloyd’s	infer(…)	infer_input	infer_result

Mathematical formulation¶

Refer to Developer Guide: K-Means.

Programming Interface¶

All types and functions in this section are declared in the oneapi::dal::kmeans namespace and be available via inclusion of the oneapi/dal/algo/kmeans.hpp header file.

Descriptor¶

template<typename Float = float, typename Method = method::by_default, typename Task = task::by_default> class descriptor¶

Template Parameters

Float – The floating-point type that the algorithm uses for intermediate computations. Can be float or double.
Method – Tag-type that specifies an implementation of algorithm. Can be method::lloyd_dense.
Task – Tag-type that specifies the type of the problem to solve. Can be task::clustering.

Constructors

descriptor(std::int64_t cluster_count = 2)¶: Creates a new instance of the class with the given cluster_count.

Properties

std::int64_t max_iteration_count¶

The maximum number of iterations T. Default value: 100.

Getter & Setter: std::int64_t get_max_iteration_count() const

auto & set_max_iteration_count(int64_t value)
Invariants: max_iteration_count >= 0

std::int64_t cluster_count¶

The number of clusters k. Default value: 2.

Getter & Setter: std::int64_t get_cluster_count() const

auto & set_cluster_count(int64_t value)
Invariants: cluster_count > 0

double accuracy_threshold¶

The threshold \(\varepsilon\) for the stop condition. Default value: 0.0.

Getter & Setter: double get_accuracy_threshold() const

auto & set_accuracy_threshold(double value)
Invariants: accuracy_threshold >= 0.0

Method tags¶

struct lloyd_dense¶: Tag-type that denotes Lloyd’s computational method.

using by_default = lloyd_dense ¶: Alias tag-type for Lloyd’s computational method.

Task tags¶

struct clustering¶: Tag-type that parameterizes entities used for solving clustering problem.

using by_default = clustering ¶: Alias tag-type for the clustering task.

Model¶

template<typename Task = task::by_default> class model¶

Template Parameters: Task – Tag-type that specifies type of the problem to solve. Can be task::clustering.

Constructors

model()¶: Creates a new instance of the class with the default property values.

Public Methods

std::int64_t get_cluster_count() const¶: Number of clusters k in the trained model.

Properties

const table &centroids¶

A \(k \times p\) table with the cluster centroids. Each row of the table stores one centroid. Default value: table{}.

Getter & Setter: const table & get_centroids() const

auto & set_centroids(const table &value)

Training `train(...)`¶

Input¶

template<typename Task = task::by_default> class train_input¶

Template Parameters: Task – Tag-type that specifies type of the problem to solve. Can be task::clustering.

Constructors

train_input(const table &data)¶

train_input(const table &data, const table &initial_centroids)¶: Creates a new instance of the class with the given data and initial_centroids.

Properties

const table &initial_centroids¶

A \(k \times p\) table with the initial centroids, where each row stores one centroid.

Getter & Setter: const table & get_initial_centroids() const

auto & set_initial_centroids(const table &data)

const table &data¶

An \(n \times p\) table with the data to be clustered, where each row stores one feature vector.

Getter & Setter: const table & get_data() const

auto & set_data(const table &data)

Result¶

template<typename Task = task::by_default> class train_result¶

Template Parameters: Task – Tag-type that specifies type of the problem to solve. Can be task::clustering.

Constructors

train_result()¶: Creates a new instance of the class with the default property values.

Properties

const table &labels¶

An \(n \times 1\) table with the labels \(y_i\) assigned to the samples \(x_i\) in the input data, \(1 \leq 1 \leq n\). Default value: table{}.

Getter & Setter: const table & get_labels() const

auto & set_labels(const table &value)

int64_t iteration_count¶

The number of iterations performed by the algorithm. Default value: 0.

Getter & Setter: int64_t get_iteration_count() const

auto & set_iteration_count(std::int64_t value)
Invariants: iteration_count >= 0

const model<Task> &model¶

The trained K-means model. Default value: model<Task>{}.

Getter & Setter: const model< Task > & get_model() const

auto & set_model(const model< Task > &value)

const table &responses¶

An \(n \times 1\) table with the responses \(y_i\) assigned to the samples \(x_i\) in the input data, \(1 \leq 1 \leq n\). Default value: table{}.

Getter & Setter: const table & get_responses() const

auto & set_responses(const table &value)

double objective_function_value¶

The value of the objective function \(\Phi_X(C)\), where C is model.centroids.

Getter & Setter: double get_objective_function_value() const

auto & set_objective_function_value(double value)
Invariants: objective_function_value >= 0.0

Operation¶

template<typename Descriptor> kmeans::train_result train(const Descriptor &desc, const kmeans::train_input &input)¶

Parameters

desc – K-Means algorithm descriptor kmeans::descriptor
input – Input data for the training operation

Preconditions: input.data.has_data == true

input.initial_centroids.row_count == desc.cluster_count

input.initial_centroids.column_count == input.data.column_count
Postconditions: result.labels.row_count == input.data.row_count

result.labels.column_count == 1

result.labels[i] >= 0

result.labels[i] < desc.cluster_count

result.iteration_count <= desc.max_iteration_count

result.model.centroids.row_count == desc.cluster_count

result.model.centroids.column_count == input.data.column_count

Inference `infer(...)`¶

Input¶

template<typename Task = task::by_default> class infer_input¶

Template Parameters: Task – Tag-type that specifies type of the problem to solve. Can be task::clustering.

Constructors

infer_input(const model<Task> &trained_model, const table &data)¶: Creates a new instance of the class with the given model and data.

Properties

const table &data¶

The trained K-Means model. Default value: table{}.

Getter & Setter: const table & get_data() const

auto & set_data(const table &value)

const model<Task> &model¶

An \(n \times p\) table with the data to be assigned to the clusters, where each row stores one feature vector. Default value: model<Task>{}.

Getter & Setter: const model< Task > & get_model() const

auto & set_model(const model< Task > &value)

Result¶

template<typename Task = task::by_default> class infer_result¶

Template Parameters: Task – Tag-type that specifies type of the problem to solve. Can be task::clustering.

Constructors

infer_result()¶: Creates a new instance of the class with the default property values.

Properties

const table &labels¶

An \(n \times 1\) table with assignments labels to feature vectors in the input data. Default value: table{}.

Getter & Setter: const table & get_labels() const

auto & set_labels(const table &value)

const table &responses¶

An \(n \times 1\) table with assignments responses to feature vectors in the input data. Default value: table{}.

Getter & Setter: const table & get_responses() const

auto & set_responses(const table &value)

double objective_function_value¶

The value of the objective function \(\Phi_X(C)\), where C is defined by the corresponding infer_input::model::centroids. Default value: 0.0.

Getter & Setter: double get_objective_function_value() const

auto & set_objective_function_value(double value)
Invariants: objective_function_value >= 0.0

Operation¶

template<typename Descriptor> kmeans::infer_result infer(const Descriptor &desc, const kmeans::infer_input &input)¶

Parameters

desc – K-Means algorithm descriptor kmeans::descriptor
input – Input data for the inference operation

Preconditions: input.data.has_data == true

input.model.centroids.has_data == true

input.model.centroids.row_count == desc.cluster_count

input.model.centroids.column_count == input.data.column_count
Postconditions: result.labels.row_count == input.data.row_count

result.labels.column_count == 1

result.labels[i] >= 0

result.labels[i] < desc.cluster_count

Usage example¶

Training¶

kmeans::model<> run_training(const table& data,
                           const table& initial_centroids) {
   const auto kmeans_desc = kmeans::descriptor<float>{}
      .set_cluster_count(10)
      .set_max_iteration_count(50)
      .set_accuracy_threshold(1e-4);

   const auto result = train(kmeans_desc, data, initial_centroids);

   print_table("labels", result.get_labels());
   print_table("centroids", result.get_model().get_centroids());
   print_value("objective", result.get_objective_function_value());

   return result.get_model();
}

Inference¶

table run_inference(const kmeans::model<>& model,
                  const table& new_data) {
   const auto kmeans_desc = kmeans::descriptor<float>{}
      .set_cluster_count(model.get_cluster_count());

   const auto result = infer(kmeans_desc, model, new_data);

   print_table("labels", result.get_labels());
}

Examples¶

Batch Processing:

kmeans_lloyd_dense_batch.cpp

oneDAL documentation

K-Means¶

Mathematical formulation¶

Programming Interface¶

Descriptor¶

Method tags¶

Task tags¶

Model¶

Training train(...)¶

Input¶

Result¶

Operation¶

Inference infer(...)¶

Input¶

Result¶

Operation¶

Usage example¶

Training¶

Inference¶

Examples¶

Training `train(...)`¶

Inference `infer(...)`¶