# k-median++ The **k-median++** module implements a practical heuristic for the geometric **k-median clustering problem**.\ It is inspired by the k-means++ seeding strategy, but adapted to the k-median objective by sampling proportional to **linear distances** rather than squared distances. In our framework, k-median++ is used both as: 1. a **baseline clustering algorithm** on the original point set, and 2. a **core subroutine** applied to compressed representative sets. Conceptually, the algorithm follows a two-phase structure: 1. **k-median++ initialization** 2. **Lloyd-style local refinement using geometric medians** *** ### Repository {% embed url="" %} *** ### Problem Setting Given a set of points $$P \subset \mathbb{R}^2$$ and an integer $$k$$, the **k-median** problem asks for a set of centers $$C$$ with $$|C| = k$$ minimizing $$cost(P, C) = \sum\_{p \in P} w(p), d(p, C), \quad d(p, C) = \min\_{c \in C} |p - c|.$$ Here $$w(p)$$ is a non-negative weight associated with point $$p$$ (default $$w(p)=1$$).\ The problem is NP-hard even in the plane, motivating the use of efficient approximation algorithms and heuristics. *** ### Algorithm Overview Our implementation consists of the following steps: 1. **k-median++ initialization**\ Select initial centers using weighted distance-based sampling. 2. **Assignment step**\ Assign each point to its nearest center. 3. **Update step (1-median computation)**\ Update each center to the weighted geometric median of its assigned points. 4. **Iteration until convergence**\ Repeat assignment and update until stabilization. *** ### k-median++ Initialization The initialization follows the k-means++ paradigm, adapted to the k-median objective. #### First Center The first center is sampled randomly proportional to point weights: $$ \Pr\[p \text{ chosen}] \propto w(p). $$ #### Remaining Centers Each subsequent center is sampled according to $$ \Pr\[p \text{ chosen}] \propto w(p) \cdot D(p), $$ where $$D(p)$$ is the distance from $$p$$ to its nearest already chosen center. Unlike k-means++, **squared distances are not used**, which better aligns the sampling with the k-median objective. If all distances are zero (degenerate case), remaining centers are chosen uniformly at random. *** ### Lloyd-Style Refinement After initialization, the algorithm performs Lloyd-style iterations. #### Assignment Step Each point is assigned to the nearest center. If assignments do not change between iterations, the algorithm terminates early. #### Update Step (Geometric Median) For each cluster, the center is updated to the **weighted geometric median** of its assigned points. The geometric median is approximated using **Weiszfeld’s algorithm**, initialized at the weighted centroid of the cluster. Empty clusters are handled explicitly by reinitializing the center to a random input point. *** ### Geometric Median via Weiszfeld’s Algorithm Given a cluster (S), the geometric median is approximated iteratively as $$ x^{(t+1)} = \frac{\sum\_{p \in S} \frac{w(p), p}{|p - x^{(t)}|}} {\sum\_{p \in S} \frac{w(p)}{|p - x^{(t)}|}}. $$ If the current estimate becomes extremely close to a data point, the algorithm snaps to that point and terminates to ensure numerical stability. *** ### Class Synopsis ```cpp #include "k_median++.h" std::vector k_median_pp( const std::vector& pts, int k, int maxIters, int maxWeiszfeldIters, double tol, uint32_t seed ); ``` *** ### Dependencies #### STL * `` * Used to store input points, centers, cluster assignments, and intermediate clusters. * `` * Provides `sqrt` for Euclidean distance computation and geometric median updates. * `` * Used for randomized sampling in k-median++ initialization (`std::mt19937`, `uniform_real_distribution`, `uniform_int_distribution`). * `` * Provides `std::numeric_limits::infinity()` for distance and cost initialization. * `` * Used for `std::min`, `std::max`, and `std::lower_bound` in weighted random sampling. *** ### Design Notes * **Weighted formulation** * All steps of the algorithm explicitly incorporate point weights $$w(p)$$. * This allows k-median++ to be applied seamlessly to compressed representative sets, where each representative aggregates multiple original points. * **Initialization strategy** * The k-median++ seeding uses linear distances instead of squared distances. * This choice aligns the sampling distribution with the k-median objective and empirically improves stability compared to k-means++-style initialization. * **Geometric median updates** * Each center update computes a weighted geometric median using Weiszfeld’s algorithm. * The algorithm is initialized at the weighted centroid of the cluster to accelerate convergence. * **Robust handling of degeneracies** * If all points are identical during initialization, remaining centers are filled by random sampling. * Empty clusters arising during Lloyd iterations are explicitly handled by reinitializing the center to a random input point, preventing collapse. * **Early termination** * The algorithm terminates early when assignments no longer change or when all centers move by at most a tolerance threshold. * This ensures good practical performance despite the worst-case theoretical bounds. *** ### Complexity Let $$n = |P|$$ be the number of points and $$k$$ be the number of centers. * **Initialization** * Each new center requires scanning all points to compute distances to the current centers. * Total cost: $$O(nk)$$. * **Lloyd-style iterations** * Assignment step: $$O(nk)$$. * Update step: geometric median computation per cluster, run for a fixed number of iterations. * Each Lloyd iteration costs $$O(nk)$$. Overall, the total running time is $$O(T \cdot n \cdot k)$$, where $$T$$ is the number of Lloyd iterations, which is typically small in practice. *** ### Overall Structure {% tabs %} {% tab title="Data Structure" %}

Name	Type	Description
`Point`	`struct`	Represents a weighted point in the plane, storing coordinates `(x, y)` and a non-negative weight `w`.
`std::vector<Point>`	`container`	Used to store input points, centers, clusters, and representative sets.

{% endtab %} {% tab title="Public Interface" %}

Name	Input	Output
`std::vector<Point> k_median_pp( const std::vector<Point>& pts, int k, int maxIters, int maxWeiszfeldIters, double tol, uint32_t seed)`	weighted point set, parameters	selected centers

This function executes the full k-median++ algorithm, including initialization and Lloyd-style refinement, and returns the final set of centers. {% endtab %} {% tab title="Internal Functions" %}

Name	Purpose	Complexity
`double euclidean_distance(const Point&, const Point&)`	Compute Euclidean distance between two points	O(1)
`std::vector<Point> k_median_pp_init(const std::vector<Point>&, ...)`	k-median++ probabilistic initialization	O(nk)
`Point geometric_median_weiszfeld(const std::vector<Point>&, ...)`	Approximate weighted geometric median of a cluster	O(\|C\|) per iteration
`double k_median_cost(const std::vector<Point>&, const std::vector<Point>&)`	Evaluate weighted k-median objective	O(nk)

Here $$n = |P|$$ is the number of points and $$|C|$$ denotes the size of a cluster. {% endtab %} {% endtabs %} *** ### Usage in Our Framework Within our framework, k-median++ is used in two distinct roles: * **Baseline clustering** * Applied directly to the original input point set. * Serves as a reference for evaluating approximation quality and runtime improvements. * **Clustering on representative sets** * Applied to compressed representative sets $$R\_j$$ produced by the range-based preprocessing stage. * This significantly reduces the effective input size while preserving clustering cost up to a controlled approximation error. By separating these two use cases, we can clearly distinguish the effects of data compression from the behavior of the clustering heuristic itself. *** ### Example Usage ```cpp std::vector points = load_points("data.txt"); auto centers = k_median_pp( points, /* k = */ 50, /* maxIters = */ 20, /* maxWeiszfeldIters = */ 50, /* tol = */ 1e-4, /* seed = */ 42 ); ```