Select Page

Protecting privacy in the age of data proliferation

ASU research team led by Oliver Kosut exploring methods to protect individual privacy in datasets used to train artificial intelligence

by | May 20, 2024 | Features, Research

A lock symbol is shown on a laptop’s blue screen. Oliver Kosut, an associate professor of electrical engineering in the Ira A. Fulton Schools of Engineering at Arizona State University, is leading a project to improve aggregate data privacy and security. Lalitha Sankar, a Fulton Schools professor of electrical engineering, serves as co-principal investigator. Graphic courtesy of Dan Nelson/Pexels

Artificial intelligence, or AI, has seen rapid advancements with the advent of large language models such as ChatGPT and generative image technology such as Adobe Firefly.

As AI advances, the need for data has become more essential. AI runs on algorithms that need to be “trained” on robust datasets to learn how to properly function.

Some datasets involve data collected from many people, known as aggregate data, which can contain sensitive information, from names and dates of birth to shopping habits, medical history and even political preferences. Examples include using AI to provide insights about health outcomes among specific demographics or identifying patterns in census data for civic resource management.

As AI researchers share their work with each other or institutions fall victim to hacks that steal data, there’s an increasing risk of spreading private information that could be harmful or illegal. Current methods of privacy protection require sacrificing accuracy or vice versa, leaving no optimal solution for accurate AI algorithm outputs while protecting data privacy.

Oliver Kosut portrait

Oliver Kosut

Oliver Kosut, an associate professor of electrical engineering in the Ira A. Fulton Schools of Engineering at Arizona State University, is addressing the issue by analyzing the best ways to balance accuracy and privacy considerations when training AI algorithms.

Kosut and his collaborators, who include Lalitha Sankar, a Fulton Schools professor of electrical engineering, and Flavio du Pin Calmon, an assistant professor of electrical engineering at Harvard University, are investigating four areas of focus to ensure optimal differential privacy, which provides accurate insights into datasets without using unnecessary personal information.

“To achieve the level of privacy that you want for social good, current techniques have a lot of trouble getting really good results,” Kosut says. “There is inherently a tradeoff between privacy and utility. We’d like to do well on both.”

Tuning into the background noise

The researchers’ first goal is to develop methods to maintain differential privacy in applications requiring many data processing steps.

When training AI or machine learning algorithms, data is put into the system, and tweaks are made to the algorithms’ code to ensure desired outputs. The process is repeated thousands to trillions of times.

While training the algorithms, random bits of data called noise can be injected into the process to obscure sensitive information. However, adding too much noise or the wrong types of data can cause output inaccuracies.

Kosut and his colleagues aim to figure out the best ways to inject noise into the system by determining the best choice of probability distribution. He explains two different probability distributions by comparing coin flips to rolling a die.

“If you roll a six-sided die, you get one of six possible outcomes,” Kosut says. “You could also flip a coin six times and count the number of heads. It’s more likely that about half of the coins will be heads and half of them will be tails, whereas when you roll a die, all the numbers are equally likely.”

Different probability distributions mean different ways of generating noise, which can result in drastically different outputs. While one method of injecting noise may protect privacy better, it may also result in more inaccurate outcomes or vice versa.

The research team aims to overcome this limitation by determining the optimal method to add noise that results in accurate outputs while protecting privacy.

Holding differential privacy values accountable

The researchers’ second focus area is developing solutions that work well to quantify differential privacy through a method known as privacy accounting.

In privacy accounting, privacy is quantified as a number. A smaller number corresponds with a higher level of differential privacy.

Kosut says current methods make it difficult to generate a numerical value for privacy accounting.

He and his collaborators plan to integrate methods from information theory, which uses numerical values to represent aspects of communicating information.

“Information theory gives you ways of quantifying uncertainty and information,” Kosut says. “It has a lot of tools on the tool belt that you can use for this privacy accounting problem. We’re hoping we can use some information theory techniques to find better privacy accounting methods.”

Playing fair with AI algorithms

The researchers’ third area of focus is balancing the concepts of fairness and arbitrariness with privacy. Fairness refers to AI treating different demographics fairly, ensuring it’s not discriminatory toward a specific group, and arbitrariness evaluates whether aspects of an algorithm’s decision were based on specific criteria or random.

Kosut says evaluating privacy against fairness is important to ensure algorithms don’t disadvantage any group to preserve privacy. The researchers want to avoid problems such as, for example, removing information about individuals’ race, masking problems faced by a minority population in AI analyses.

For the aspect of arbitrariness, Kosut and his team aim to understand how much of an algorithm’s decision is random as opposed to being based on specific criteria. Ensuring the algorithm bases its decisions on logical criteria instead of random outcomes from its training increases confidence in those decisions.

“If you’re using a machine learning model to decide who to accept for a job and you train that model several times, you might end up with different models,” Kosut says. “On any particular applicant, they could produce different outcomes. If I’m an applicant and the outcome and algorithm used were random and that decision affected whether I get the job, that seems unfair.”

Pulling up a privacy curtain around data

The project’s final area focuses on best practices to develop realistic synthetic data. Synthetic data is similar to collected data, but without private information attached.

When an algorithm is developed and used, those looking at the results won’t see the data used to train the algorithm. However, others working in AI and machine learning may want to use the data for their own algorithms or to evaluate the one the data was used to train.

To ensure the privacy of individuals whose information is in the datasets, synthetic data can be generated that contains the pertinent points to train an algorithm without personal identifiers. The challenge is ensuring that the algorithms have similar results when trained with the synthetic data as they do with the original set.

Portrait of Lalitha Sankar

Lalitha Sankar

Sankar, who is leading the synthetic data investigation, says generating synthetic data is a complex process that needs to take multiple challenges into account. Specifically, any such synthetic data release needs to ensure that statistical analysis performed on it is reasonably close to the original while also assuring that no individual’s data is identifiable.

“A synthetic release of the U.S. Census data should assure decision-makers that the release does not dramatically change meaningful statistics and counts that can affect resource allocation while also ensuring privacy of the Census respondents,” Sankar says.

She says cleverly adding noise is key to generating synthetic data, but it can slow down an algorithm’s functions and cause inaccuracies in the results. Her goal is to find ways to ensure both privacy protection and usable results.

Protecting personal information in a data-driven future

When the research project is completed in 2027, Kosut’s goal is that the discoveries he and his team make will be used in training AI algorithms used in industry and government.

The investigation is also providing research opportunities for doctoral students. Among those assisting with the project is Atefeh Gilani, an electrical engineering doctoral student specializing in research combining concepts from privacy, machine learning and information theory.

Gilani says the experience is valuable for preparing her to assist in solving real-world problems while honing her analytical and problem-solving skills.

“I hope that this research contributes to improving privacy protection in our data-driven society,” she says. “By developing effective privacy mechanisms for sensitive datasets, we can reduce the risks associated with dataset participation and build trust between individuals and organizations handling sensitive data.”

About The Author

TJ Triolo

TJ Triolo is the embedded communications specialist for the Ira A. Fulton Schools of Engineering's School of Electrical, Computer and Energy Engineering. He's a 2020 graduate of ASU's Walter Cronkite School of Journalism and Mass Communication. After starting his career in marketing and communications with a car wash company in Arizona, he joined the Fulton Schools' communications team in 2022.

ASU Engineering on Facebook