Breadcrumb

Asset Publisher

Doktorego tesiaren defentsa: Synthetic tabular data for privacy-preserving data sharing and analysis in health reserarch: evaluation frameworks and infrastructure integration

Egilea: Mikel Henández Jiménez

Izenburua: Synthetic tabular data for privacy-preserving data sharing and analysis in health reserarch: evaluation frameworks and infrastructure integration

Zuzendariak: Naira Aginako / Gorka Epelde

Eguna: 2026ko urtarrilaren 23an
Ordua: 11:00h
Lekua: Ada Lovelace aretoa (Informatikako Fakultatea)

Abstract:

"The increasing adoption of artificial intelligence in healthcare has driven the need for large-scale, high-quality real-world data. However, privacy regulations and ethical constraints often limit access to such data, hindering the development and validation of data-driven solutions. In this context, synthetic data generation has emerged as a promising privacy-enhancing technology that enables secure data sharing without compromising individual privacy. Despite its potential, this technology remains relatively immature in the healthcare domain, with limited consensus on evaluation practices and scarce integration in real-world infrastructures.

This doctoral thesis addresses these challenges by exploring the generation, evaluation, and integration of synthetic tabular data to support privacy-preserving data sharing and analysis in healthcare applications. Structured around five peer-reviewed publications, the research is organised into three interconnected research lines, contributing to the field from foundational analysis to real-world implementation.

The first research line has established a taxonomy of synthetic tabular data generation models and highlighted major gaps in the evaluation of fidelity, utility, and privacy of the generated synthetic tabular data. In response, the second research line has developed two standardised evaluation approaches, with the second enhancing the first in standardisation and interpretability. Both were used to benchmark a range of synthetic tabular data generation models, including versions with differential privacy, demonstrating that high-quality synthetic data can be generated that preserves statistical resemblance to real-world data and supports meaningful analytical utility while maintaining low privacy risks. The third and last research line has demonstrated the integration of synthetic tabular data generation in secure infrastructures through the development of a controlled data processing workflow and a privacy-preserving data publishing service integrated within a real-world Living Labs data sharing ecosystem.

Results support the hypothesis that synthetic tabular data generation can enable privacy-preserving analytics while minimising re-identification and disclosure of personal information. Nonetheless, limitations related to models' generalisability, dataset diversity, and evaluation scope suggest future work should explore new generative models (e.g., diffusion models or large language models), extend evaluation dimensions (e.g., efficiency, diversity), and validate deployments in operational environments."