Leveraging Wikipedia and a Large Language Model to Develop a Knowledge Resource Containing World Foods and Allergens

Simon Shavit; David A. Hanauer

Poster
Author & Poster Info

Leveraging Wikipedia and a Large Language Model to Develop a Knowledge Resource Containing World Foods and Allergens

Abstract

Introduction

Researchers and clinicians often turn to electronic health record (EHR) data for details on dietary intake and concerns for allergic reactions. In this work, we created a knowledge resource to match foods from around the world to their common ingredients and potential allergens.

Methods

We used Wikipedia (2024-12-01 English download) to develop a list of foods from around the world. Wikipedia food pages contain a metadata field main_ingredient that was used to identify foods. Using the food list obtained from Wikipedia, we created a catalog of ingredients for each food using prompts sent to the large language model (LLM) OpenAI ChatGPT 3.5 API (CGPT). For each food name and associated ingredients, we prompted CGPT for a list of allergens in each food based on the eight major food allergens described by the Food Allergen Labeling and Consumer Protection Act: milk, eggs, fish, shellfish, tree nuts, peanuts, wheat, and soy. Prompts and knowledge resources are available at: https://github.com/dhanauer/food-allergens.

Results

Using Wikipedia, we identified 6,693 foods with a main_ingredient field. From this list, we found 7,030 distinct ingredients. Using CGPT for the same foods, a list of 4,083 ingredients was obtained, suggesting more consistency and less variability. An UpSet plot of allergens in the foods is shown in the Figure (see poster). We noted issues with both the Wikipedia output and CGPT output: for example, both Wikipedia and CGPT did not always list ingredients in a consistent manner, and some were quite free-form.

Discussion

We found Wikipedia to be most useful for creating an overall list of foods, but the LLM was better at providing a consistent list of ingredients. A combined approach proved to be most accurate.

Poster

non-peer-reviewed

Leveraging Wikipedia and a Large Language Model to Develop a Knowledge Resource Containing World Foods and Allergens

Author Information

Simon Shavit

Literature, Science, and the Arts, University of Michigan, Ann Arbor, USA

David A. Hanauer Corresponding Author

Learning Health Sciences, University of Michigan, Ann Arbor, USA

Poster Information

Meeting

U-M Annual Data Science & AI Summit 2025 November 16, 2025 - November 17, 2025

Publication history

Published: December 08, 2025

Copyright

© Copyright 2025
Shavit et al. This is an open access poster distributed under the terms of the Creative Commons Attribution License CC-BY 4.0., which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

License

This is an open access poster distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

PDF

Leveraging Wikipedia and a Large Language Model to Develop a Knowledge Resource Containing World Foods and Allergens

Abstract

Leveraging Wikipedia and a Large Language Model to Develop a Knowledge Resource Containing World Foods and Allergens

Author Information

Simon Shavit

David A. Hanauer Corresponding Author

Poster Information

Meeting

Publication history

Copyright

License

Published Content

Resources

About Us

Stay Connected

SUBSCRIBE TO OUR NEWSLETTER FOR ALL THE LATEST NEWS AND UPDATES

Learn more

Learn more

Learn more

Ongoing Competitions

Leveraging Wikipedia and a Large Language Model to Develop a Knowledge Resource Containing World Foods and Allergens

Abstract

Related articles

Leveraging Wikipedia and a Large Language Model to Develop a Knowledge Resource Containing World Foods and Allergens

Author Information

Simon Shavit

David A. Hanauer Corresponding Author

Poster Information

Meeting

Publication history

Copyright

License

Download Cureus Media Kit

Published Content

Resources

About Us

Stay Connected

SUBSCRIBE TO OUR NEWSLETTER FOR ALL THE LATEST NEWS AND UPDATES