DBSCAN¶
Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed in [Ester96]. It is a density-based clustering non-parametric algorithm: given a set of observations in some space, it groups together observations that are closely packed together (observations with many nearby neighbors), marking as outliers observations that lie alone in low-density regions (whose nearest neighbors are too far away).
Operation |
Computational methods |
Programming Interface |
||
Default method |
Mathematical formulation¶
Computation¶
Given the set \(X = \{x_1 = (x_{11}, \ldots, x_{1p}), \ldots, x_n = (x_{n1}, \ldots, x_{np})\}\)
of \(n\) \(p\)-dimensional feature vectors (further referred as observations),
a positive floating-point number epsilon
and a positive integer minObservations
,
the problem is to get clustering assignments for each input observation, based on the definitions below [Ester96]:
two observations \(x\) and \(y\) are considered to be in the same cluster if there is a core observation \(z\),
and \(x\) and \(y\) are both reachable from \(z\).
Each cluster gets a unique identifier, an integer number from \(0\) to \(\text{total number of clusters } – 1\). Each observation is assigned an identifier of the cluster it belongs to, or \(-1\) if the observation considered to be a noise observation.
Programming Interface¶
Refer to API Reference: DBSCAN.
Distributed mode¶
The algorithm supports distributed execution in SMPD mode (only on GPU).
Usage example¶
Compute¶
void run_compute(const table& data,
const table& weights) {
double epsilon = 1.0;
std::int64_t max_observations = 5;
const auto dbscan_desc = kmeans::descriptor<float>{epsilon, max_observations}
.set_result_options(dal::dbscan::result_options::responses);
const auto result = compute(dbscan_desc, data, weights);
print_table("responses", result.get_responses());
}
Examples¶
Batch Processing:
Batch Processing:
Batch Processing: