Shaky Foundations: The Data Crisis Undermining AI

House on stilts above eroding sandy cliff

by digisavvy

March 19, 2025 · 2 minutes

Share this post

Are We Building AI on Shaky Foundations?

Imagine constructing a skyscraper without thoroughly vetting your materials.

That’s essentially what we’ve been doing in AI, according to a recent Nature Magazine study. (Link in the comments)

Researchers introduced a novel framework to evaluate datasets used in biometrics and healthcare through three critical lenses: fairness, privacy, and regulatory compliance.

Fairness looks at how well a dataset represents diverse groups. It considers diversity (Does the dataset include a wide range of demographics?), inclusivity (Are all groups meaningfully represented?), and label reliability (How trustworthy is the attached information?). A fair dataset should reflect the rich tapestry of humanity, with accurate, self-reported descriptions.

The privacy assessment asks whether datasets could potentially identify individuals or disclose sensitive information. Researchers checked for personal identifiers and sensitive attributes in the data. A high privacy score indicates minimal personal information, safeguarding individual identities.

Regulatory compliance looks at whether the data was reviewed by an ethics board, if informed consent was obtained from participants, and whether there are mechanisms for data correction and deletion. A compliant dataset would tick all these boxes.

The study analyzed 60 datasets, and the results are sobering:

Fairness scores averaged a mere 0.96 out of 5. Our AI systems are learning from woefully unrepresentative data—like teaching a child about society using a book depicting only one type of person.

Privacy scores were slightly better, but a “fairness-privacy paradox” emerged: To make datasets fairer, we often need more demographic data, but gathering more data can heighten privacy risks.

Regulatory compliance scores were truly dismal, averaging just 0.58 out of 3. Many datasets lack even basic safeguards like institutional review board approval, individual consent, and mechanisms for data correction or deletion.

The path forward? Researchers recommend:

1. Securing proper approvals and individual consent
2. Implementing mechanisms for data correction and deletion
3. Striving for diversity while safeguarding privacy
4. Developing comprehensive datasheets documenting dataset characteristics

As we stand at the precipice of an AI-driven future, are we comfortable building our digital world on sand or should we take the time to lay a solid bedrock for the AI revolution? The choice is ours.

Shaky Foundations: The Data Crisis Undermining AI

Sign Up for Our Newsletter

Related Articles

Contact us today to see how increasing your fairness can increase your bottom line.