Abstract
Introduction
Researchers and clinicians often turn to electronic health record (EHR) data for details on dietary intake and concerns for allergic reactions. In this work, we created a knowledge resource to match foods from around the world to their common ingredients and potential allergens.
Methods
We used Wikipedia (2024-12-01 English download) to develop a list of foods from around the world. Wikipedia food pages contain a metadata field main_ingredient that was used to identify foods. Using the food list obtained from Wikipedia, we created a catalog of ingredients for each food using prompts sent to the large language model (LLM) OpenAI ChatGPT 3.5 API (CGPT). For each food name and associated ingredients, we prompted CGPT for a list of allergens in each food based on the eight major food allergens described by the Food Allergen Labeling and Consumer Protection Act: milk, eggs, fish, shellfish, tree nuts, peanuts, wheat, and soy. Prompts and knowledge resources are available at: https://github.com/dhanauer/food-allergens.
Results
Using Wikipedia, we identified 6,693 foods with a main_ingredient field. From this list, we found 7,030 distinct ingredients. Using CGPT for the same foods, a list of 4,083 ingredients was obtained, suggesting more consistency and less variability. An UpSet plot of allergens in the foods is shown in the Figure (see poster). We noted issues with both the Wikipedia output and CGPT output: for example, both Wikipedia and CGPT did not always list ingredients in a consistent manner, and some were quite free-form.
Discussion
We found Wikipedia to be most useful for creating an overall list of foods, but the LLM was better at providing a consistent list of ingredients. A combined approach proved to be most accurate.
