Leveraging Wikipedia and a Large Language Model to Develop a Knowledge Resource Containing World Foods and Allergens


Abstract

Introduction

Researchers and clinicians often turn to electronic health record (EHR) data for details on dietary intake as well as concerns for allergic reactions. Assessments by clinicians can sometimes record food names, as well as potential ingredients. The goal of this work was to create a knowledge resource that could be used help match foods from around the world to their common ingredients and potential allergens, to better support research or clinical initiatives.

Methods

We used Wikipedia (2024-12-01 English download) to develop a large list of foods from around the world, by leveraging the ‘Infobox food’ template (see poster). This template contains a metadata field main_ingredient that was used to identify foods with ingredients listed. In addition to extracting the ingredients, the name of the food was derived from the title field when available, or the name field otherwise. Using the food list obtained from Wikipedia, we created a separate catalog of ingredients for each food using a prompt sent to the large language model (LLM) OpenAI ChatGPT 3.5 API (CGPT). Then, for each food name and associated ingredients, we prompted CGPT for a list of allergens in each food, based on the 8 major food allergens described by the Food Allergen Labeling and Consumer Protection Act of 2004 (FALCPA): milk, eggs, fish, shellfish, tree nuts, peanuts, wheat, and soy. Finally, to characterize the likelihood of encountering these food names and ingredients in our EHR, we utilized our local EMERSE instance to obtain frequency counts of the food names and ingredients in our corpus of 450 million notes from 3.1 million patients.

Results

Using Wikipedia we identified 6,693 foods that had the main_ingredient metadata field, although some ingredient fields were blank. Among these foods, 2,265 (34%) were mentioned at least once in the EHR. Some of the most common foods mentioned in the EHR included yogurt (n=363,322 notes), bread (n=260,302), chocolate (n=247,260), and pasta (n=190,961). From Wikipedia, a list of 7,030 distinct ingredients were obtained. Using CGPT for the same foods, a list of 4,083 ingredients were obtained, suggesting more consistency and less variability in the CGPT-derived ingredient list. Among this list of CGPT-derived ingredients, 2,734 (67%) were mentioned at least once in the EHR. An UpSet plot of allergens in the foods, as well as co-occurrences of the allergens in individual foods, is shown in poster. For the food allergens, wheat was most common, appearing in 3,117 distinct foods, followed by milk (n=1,944), eggs (n=1,555), fish, (n=743) and tree nuts (n=697). There were 1,507 foods that contained no allergens; 2,552 foods contained 1 allergen; 17 foods contained 5 allergens; no foods had 6 or more allergens. We noted issues with both the Wikipedia output and CGPT output: for example, both Wikipedia and CGPT did not always list ingredients in a consistent manner, and some were quite free-form (e.g., hot dog in Wikipedia: “Sausage made from pork, beef, chicken, turkey or combinations thereof and a bun”).

Discussion

We found Wikipedia to be most useful for creating an overall list of foods, but the LLM was better at providing a consistent list of ingredients. Using both resources together was a useful approach. One limitation of this work is that we relied on identification of foods based on a specific metadata field (main_ingredient) within Wikipedia, which was not always filled out. Additionally, we used only one LLM.

Poster
non-peer-reviewed

Leveraging Wikipedia and a Large Language Model to Develop a Knowledge Resource Containing World Foods and Allergens


Author Information

Simon Shavit

Literature, Science, and the Arts, University of Michigan, Ann Arbor, USA

David A. Hanauer Corresponding Author

Learning Health Sciences, University of Michigan, Ann Arbor, USA


PDF Share