Hierarchical Clustering in Graph Streams: Single-Pass Algorithms and Space Lower Bounds
Authors:
Sepehr Assadi, Vaggos Chatziafratis, Jakub Łącki, Vahab Mirrokni, Chen Wang.
Abstract:
The Hierarchical Clustering (HC) problem consists of building a hierarchy of clusters to represent a given dataset. Motivated by the modern large-scale applications, we study the problem in the streaming model, in which the memory is heavily limited and only a single or very few passes over the input are allowed. Specifically, we investigate whether a good hierarchical clustering can be obtained, or at least whether we can approximately estimate the value of the optimal hierarchy. To measure the quality of a hierarchy, we use the HC minimization objective introduced by Dasgupta [STOC’16].
Assuming that the input is an 𝑛-vertex weighted graph whose edges arrive in a stream, we derive the following results on space-vs-accuracy tradeoffs:
Assuming that the input is an 𝑛-vertex weighted graph whose edges arrive in a stream, we derive the following results on space-vs-accuracy tradeoffs:
- With O(n polylog n) space, we develop a single-pass algorithm, whose approximation ratio matches the currently best offline algorithm by Charikar and Chatziafratis [SODA’17].
- When the space is more limited, namely, 𝑛^{1−𝑜(1)}, we prove that no algorithm can even estimate the value of the optimum hierarchical tree to within an 𝑜(log(𝑛)/loglog(𝑛)) factor, even when allowed polylog(n) passes over the input and exponential time.
- In the most stringent setting of polylog(n) space, studied extensively in the literature, we rule out algorithms that can even distinguish between “highly”-vs-“poorly” clusterable graphs, namely, graphs that have an 𝑛^{1/2−𝑜(1)} factor gap between their HC objective value.
- Finally, we prove that any single-pass streaming algorithm that computes an optimal HC clustering requires storing almost the entire input even if allowed exponential time.
Conference version:
[PMLR]
Full version:
[arXiv]
BibTex:
[DBLP]