Differential privacy is a rigorous and scientific definition of how to
measure and understand privacy—today’s “gold standard” for thinking through
problems like anonymization. It was developed and extended in 2006 by several
researchers,
including Cynthia Dwork and Aaron Roth. Since that time, the original
definition and implementations have vastly expanded. Differential privacy is
now in daily use at several large data organizations like Google and
Apple.
Definition
Differential privacy is essentially a way to measure the privacy loss of an
individual. The original definition defines two databases, which differ by the
addition or removal of one person. The analyst querying these databases is
also a potential attacker looking to figure out if a given person is in or out
of the dataset, or to learn about the persons in the dataset. Your goal, as
database owner, is to protect the privacy of the persons in the databases, but
also to provide information to the analysts. But each query you answer could
potentially leak critical information about one person or several in the
database. What do you do?
As per the definition of differential privacy, you have a database that
differs by one person, who is either removed or added to the database. Suppose
an analyst queries the first database—without the person—and then queries the
database again, comparing the results. The information gained from those results
is the privacy loss of that individual.
Let’s take a concrete example from a real-world privacy implementation: the
US Census. Every 10 years the US government attempts to count every person
residing in the US only once. Accurately surveying more than 330 million
people is about as difficult as it sounds, and the results are then used to
support things like federal funding, representation in the US Congress and
many other programs that rely on an accurate representation of the US
population.
Not only is that difficult just from a data validation point-of-view, the
US government would like to provide privacy for the participants; therefore
increasing the likelihood of honest responses and also protecting people from
unwanted attention from people or organizations that might use the public
release nefariously (e.g. to connect their data, contact them or otherwise use
their data for another purpose). In the past, the US government used a variety
of techniques to suppress, shuffle and randomly alter entries in hopes this
would provide adequate privacy.
It unfortunately did not—especially as consumer databases became cheaper
and more widely available. Using solver software, they were able to attack
previous releases and reconstruct 45% of the original data, using only a few
available datasets offered at a low cost. Imagine if you had a consumer
database that covered a large portion of Americans?
For this reason, they turned to differential privacy to help provide
rigorous guarantees. Let’s use a census block example. Say you live on a
block and there is only one person on the block who is a First American, which
is another word for Native American. What you might do is to simply not
include that person, as a way to protect their privacy.
That’s a good intuition, but differential privacy actually provides you a
way to determine how much privacy loss that person will have if they
participate, and allows you to calculate this as a way to determine when to
respond and when not to respond. To figure this out, you need to know how much
one person can change any given query. In the current example, the person
would change the count of the number of First Americans by 1.
So if I am an attacker and I query the database for the total count of
First Americans before the person is added I get a 0, and if I query after,
then I get a 1. This means the maximum contribution of one person to this
query is 1. This is our sensitivity in the area of differential privacy.
Once you know the maximum contribution and, therefore, the sensitivity, you
can apply what is called a differential privacy mechanism. This mechanism can
take the actual answer (here: 1) and apply carefully constructed noise to the
answer to add enough room for uncertainty. This uncertainty allows you to
bound the amount of privacy loss for an individual, and information gain for
an attacker.
So let’s say I query beforehand and the number I get isn’t 0, it’s actually
2. Then, the person is added and I query again, and now I get an answer of 2
again — or maybe 3, 1, 0, or 4. Because I can never know exactly how much
noise was added by the mechanism, I am unsure if the person is really there or
not — and this is the power of differential privacy.
Differential privacy tracks this leakage and provides ways to reduce and
cleverly randomize some of it. When you send a query, there will be a
probability distribution of what result will be returned, where the highest
probability is close to the real result. But you could get a result that is a
certain error range around the result. This uncertainty helps insert plausible
deniability or reasonable doubt in differential privacy responses, which is
how they guarantee privacy in a scientific and real sense. While plausible
deniability is a legal concept—allowing a defendant to provide a plausible (or
possible) counterargument which could be factual—it can be applied to other
situations. Differential privacy, by its very nature, inserts some probability
that another answer could be possible, leaving this space for participants to
neither confirm nor deny their real number (or even their participation).
Sure, sounds nice… but how do you actually implement that? There are
probabilistic processes that are called differential privacy mechanisms, which
assist in providing these guarantees. They do so by:
- creating bounds for the original data (to remove the disparate impact of
outliers and to create consistency) - adding probabilistic noise with particular distributions and sampling
requirements (to increase doubt and maintain bounded probability distributions
for the results) - tracking the measured privacy loss variable over time to reduce the
chance that someone is overexposed.
You won’t be writing these algorithms yourself, as there are several
reputable libraries for you to use, such as Tumult
Analytics, OpenMined and Google’s
PipelineDP and PyTorch’s
Opacus.
These libraries usually integrate in the data engineering or preparation
steps or in the machine learning training. To use them appropriately, you’ll
need to have some understanding of your data, know the use case at hand and
set a few other parameters to tune the noise (for example, the number of times
an individual can be in the dataset).
Use cases
Differential privacy isn’t going to replace all data access anytime soon,
but it is a crucial tool when you are being asked questions around
anonymization. If you are releasing data to a third-party, to the public, to a
partner or even to a wider internal audience, differential privacy can create
measurable safety for the persons in your data. Imagine a world where one
employee’s stolen credential just means leaking fuzzy aggregate results
instead of your entire user database. Imagine not being embarrassed when a
data scientist reverse engineers your public data release to reveal the real
data. And imagine how much easier it would be to grant differentially private
data access to internal use cases that don’t actually need the raw
data—creating less burden for the data team, decreasing risk and the chance of
‘Shadow IT’ operations popping up like whack-a-mole.
Differential privacy fits these use cases, and more! If you’d like to walk
through some examples, I recommend reading Damien Desfontaines’ posts on
differential
privacy
and testing out some of the libraries mentioned, like Tumult
Analytics. The book’s
repository also has a few
examples to walk through.
It should be noted that differential privacy does indeed add noise to your
results, requiring you to reason about the actual use of the data and what you
need to provide in order for the analysis to succeed. This is potentially a
new type of investigation for you, and it promoted thinking through the
privacy vs. utility problem—where you want to optimize the amount of
information for the particular use case but also maximize the privacy offered.
Most of the technologies in this post will require you to analyze these
tradeoffs and make decisions. To be clear, no data is ever 100% accurate
because all data is some representation of reality; so these tradeoffs are
only more obvious when implementing privacy controls.