Title: Towards Achieving Anonymity
Abstract:
We study the problem of publishing data from a table
containing personal data, in a privacy preserving manner.
In particular, we aim to anonymize quasi-identifiers, i.e.,
non-key attributes that combinedly identifiy a unique record
in the table.
The first model is proposed by Sweeney, called k-anonymity.
This approach suppresses some values of the
quasi-identifiers, such that for every record in the
modified table, there are at least k-1 other records with
exactly the same value. And the quality measure here is the
number of quasi-identifier values suppressed. We provide a
O(k)-approximation algorithm for this problem, improving
upon the previous O(k log k) result. We also show that
this is the best approximation bound possible using the
distance representation. For small values of k, we provide
improved bounds as well.
We propose a second model which generalizes the
quasi-identifier values via clustering. The records are
first clustered and then the cluster centers are published.
To ensure privacy, we impose the constraint that each
cluster must contain at least k records. We consider the
measure of minimizing the maximum cluster radius, for which
we provide a tight 2-approximation algorithm. The second
measure concerns minimizing the sum, over all clusters, the
product of number of records per cluster and the cluster
radius. For this measure we also provide a constant
approximation algorithm. Further, we extend the algorithms
to handle the case where we can omit outliners.
This talk is based on two papers:
Anonymizing Tables (ICDT 05), coauthored with Aggarwal, Feder, Kenthapadi,
Motwani, Panigrahy, and Thomas.
Achieveing Anonymity via Clustering (PODS 06), coauthored with Aggrawal,
Feder, Kenthapadi, Khuller, Panigrahy, and Thomas.
Bio:
An obtained her phd from Stanford University in 2004, under the
supervision of Rajeev Motwani and Leo Guibas. An joined Google after
her graduation. Since then, An has been involved in a variety of
projects at Google, including search quality/ranking, image search,
scholar search, and search infrastructure.