Algorithms Week 1 Union-Find · Blog

Union-Find

Dynamic connectivity

Given a set of N objects.

Union command: connect two objects.
Find/connect query: is there a path connecting the two objects?

When programming, convenient to name objects $0$ to $N - 1$ .

Use integers as array index.
Suppress details not relevant to union-find.

We assume "is connected to" is an equivalence relation:

Reflexive
Symmetric
Transitive

Connect components. Maximal set of objects that are mutually connected.
Find query. Check if two objects are in the same component.
Union command. Replace components containing two objects with their union.

Goal. Design efficient data structure for union-find.

Number of objects $N$ can be huge.
Number of operations $M$ can be huge.
Find queries and union commands may be intermixed.

Public class UF.

UF(int N)
void union(int p, int q)
boolean connected(int p, int q)
int find(int p)
int count()

Dynamic-connectivity client.

Read in number of objects $N$ from standard input.
Repeat:
- Read in pair of integers from standard input.
- If they are not yet connected, connect them and print out pair.

public static void main(String[] args)
{
    int N = StdIn.readInt();
    UF uf = new UF(N);
    While (!StdIn.isEmpty())
    {
        int p = StdIn.readInt();
        int q = StdIn.readInt();
        if (!uf.connected(p, q))
        {
            uf.union(p, q);
            StdOut.println(p + " " + q);
        }
    }
}

Quick-find [eager approach]

Data structure.

Integer array id[] of size $N$ .
Interpretation: $p$ and $q$ are connected iff they have the same $id$ .

Find. Check if $p$ and $q$ have the same $id$ .

Union. To merge components containing $p$ and $q$ , change all entries whose $id$ equals id[p] to id[q].

Java implementation

public class QuickFindUF
{
    private int[] id;

    public QuickFindUF(int N)
    {
        id = new int[N];
        for (int i = 0; i < N; i++)
            id[i] = i;
    }

    public boolean connected(int p, int q)
    {   return id[p] == id[q];   }

    public void union(int p, int q)
    {
        int pid = id[p];
        int qid = id[q];
        for (int i = 0; i < id.length; i++)
            if (id[i] == pid)
                id[i] = qid;
    }
}

Cost model. Number of array accesses (for read or write).

algorithm	initialize	union	find
quick-find	$N$	$N$	$1$

Quadratic algorithms do not scale.
Rough standard (for now).

$10^9$ operations per second.
$10^9$ words of main memory.
Touch all words in approximately 1 second.

Huge problem for quick find.

$10^9$ union commands on $10^9$ objects.
Quick-find takes more than $10^{18}$ operations.
30+ years of computer time!

Quick-union [lazy approach]

Data structure.

Integer array id[] of size $N$ .
Interpretation: id[i] is parent of $i$ .
Root of $i$ is id[id[id[...id[i]...]]]. (Keep going until it doesn't change)

Find. Check if $p$ and $q$ have the same root.

Union. To merge components containing $p$ and $q$ , set the $id$ of p's root to the $id$ of q's root.

Java implementation.

public class QuickUnionUF
{
    private int[] id;

    public QuickUnionUF(int N)
    {
        id = new int[N];
        for (int i = 0; i < N; i++)
            id[i] = i;
    }

    private int root(int i)
    {
        while (i != id[i])
            i = id[i];
        return i;
    }

    public boolean connected(int p, int q)
    {   return root(p) == root(q);   }

    public void union(int p, int q)
    {
        int i = root(p);
        int j = root(q);
        id[i] = j;
    }
}

Cost model. Number of array accesses (for read or write).

algorithm	initialize	union	find
quick-find	$N$	$N$	$1$
quick-union	$N$	$N$	$N$

Quick-find defect.

Union too expensive ( $N$ array accesses).
Trees are flat, but too expensive to keep them flat.

Quick-union defect.

Trees can get tall.
Find too expensive (could be $N$ array accesses).

Improvements

Improvement 1: weighting

Weighted quick-union. Always put the small tree down below.

Modify quick-union to avoid tall trees.
Keep track of size of each tree (number of objects).
Balance by linking root of smaller tree to root of larger tree.

Data structure. Same as quick-union, but maintain extra array sz[i] to count number of objects in the tree rooted at $i$ .

Find. Identical to quick-union.
return root(p) == root(q);

Union. Modify quick-union to:

Link root of smaller tree to root of larger tree.
Update the sz[] array.

int i = root(p);
int j = root(q);
if (i = j) return;
if (sz[i] < sz[j]) { id[i] = j; sz[j] += sz[i]; }
else               { id[j] = i; sz[i] += sz[j]; }

Running time.

Find: takes time proportional to depth of $p$ and $q$ .
Union: takes constant time, given roots.

Proposition. Depth of any node $x$ is at most $lg\enspace N$ . ( $lg$ = base 2 logarithm)

Pf. When does depth of $x$ increase?
Increase by $1$ when tree $T_1$ containing $x$ is merged in to another tree $T_2$ .

The size of the tree containing $x$ at least doubles since $|T_2| \ge |T_1|$ .
Size of tree containing $x$ can double at most $lg\enspace N$ times. (Start with $1$ , double $lg\enspace N$ times, and you will get a tree with $N$ nodes.)

algorithm	initialize	union	find
quick-find	$N$	$N$	$1$
quick-union	$N$	$N$	$N$
weighted QU	$N$	$lg\enspace N$	$lg\enspace N$

Q. Stop at guaranteed acceptable performance?
A. No, easy to improve further.

Improvement 2: path compression

Quick union with path compression. Just after computing the root of $p$ , set the $id$ of each examined node to point to that root.

Two-pass implementation: add second loop to root() to set the id[] of each examined node to the root.

Simpler one-pass variant: Make every other node in path point to its grandparent (thereby halving path length).

private int root(int i)
    {
        while (i != id[i])
        {
            id[i] = id[id[i]];
            i = id[i];
        }
        return i;
    }

In practice. No reason not to! Keeps tree almost completely flat.

Proposition. Starting from an empty data structure, any sequence of $M$ union-find ops on $N$ objects makes $\ge c (N + M lg^* N)$ array accesses.

Analysis can be improved to $N + M \alpha(M, N)$ .
Simple algorithm with fascinating mathematics.

Iterate log function.

$N$	$lg^* N$
$1$	$0$
$2$	$1$
$4$	$2$
$16$	$3$
$65536$	$4$
$2^{65536}$	$5$

Linear-time algorithm for M union-find ops on N objects?

Cost within constant factor of reading in the data.
In theory, WQUPC is not quite linear.
In practice, WQUPC is linear.

Bottom line. Weighed quick union (with path compression) makes it possible to solve problems that could not otherwise be addressed.

M union-find operations on a set of N objects

algorithm	worst-case time
quick-find	$M$ $N$
quick-union	$M$ $N$
weighted QU	$N + M log N$
QU + path compression	$N + M log N$
weighted QU + path compression	$N + M lg^* N$

Ex. [ $10^9$ unions and finds with $10^9$ objects]

WQUPC reduces time from 30 years to 6 seconds.
Supercomputer won't help much; good algorithm enables solution.

Applications

Percolation

A model for many physical systems:

$N$ -by- $N$ grid of sites.
Each site is open with probability $p$ (or blocked with probability $1 - p$ ).
System percolates iff top and bottom are connected by open sites.

Likelihood of percolation. Depends on site vacancy probability $p$ .

When $N$ is large, theory guarantees a sharp threshold $p$ *.

$p > p^*$ : almost certainly percolates.
$p < p^*$ : almost certainly does not percolate.

Q. What is the value of $p^*$ ?

Monte Carlo simulation:

Initialize $N$ -by- $N$ whole grid to be blocked.
Declare random sites open until top connected to bottom.
Vacancy percentage estimates $p^*$ .

Q. How to check whether an N-by-N system percolates?

Create an object for each site and name them $0$ to $N^2 - 1$ .
Sites are in same component if connected by open sites.
Percolates iff any sites on bottom row is connected to site on top row. (Brute-force algorithm: $N^2$ calls to connected())

Clever trick. Introduce 2 virtual sites (and connections to top and bottom).

Percolates iff virtual top site is connected to virtual bottom site.

Q. How to model opening a new site?
A. Connect newly opened site to all of its adjacent open sites.

Q. What is percolation threshold $p^*$ ?
A. About 0.592746 for large square lattices.

Steps to developing a usable algorithm.

Model the problem.
Find an algorithm to solve it.
Fast enough? Fits in memory?
If not, figure out why.
Find a way to address the problem.
Iterate until satisfied.