Incorporating Natural Language Processing (NLP) into the EMERSE Information Retrieval System

David A. Hanauer; Lisa A. Ferguson; Kellen J. McClain; Guan Wang

Poster
Author & Poster Info

Incorporating Natural Language Processing (NLP) into the EMERSE Information Retrieval System

Abstract

INTRODUCTION: EMERSE (Electronic Medical Record Search Engine) is a search engine for free text clinical documents. EMERSE is designed for non-technical researchers, with a user interface that allows for simple query building and patient list management. EMERSE is deployed or is being deployed, at academic medical centers across the U.S. and in Europe. Users have appreciated the speed and simplicity of EMERSE but have sought additional capabilities that a traditional search index cannot provide: the most common feature request has been for the system to support negation so that a user can exclude negated terms from the results. Recently, we integrated an "aligned-layer retrieval model" approach within EMERSE, wherein additional layers of attributes/tokens are layered over the original indexed terms.

TECHNICAL DETAILS: Our current implementation includes the following layers: (1) original text, case-sensitive; (2) original text, case-insensitive; (3) negation status; (4) uncertainty status; (5) subject status (patient vs other); (6) concept as a UMLS CUI; (7) semantic type, based on the CUI. The semantic type labeling allows a user to highlight all terms in a document based on semantic type, such as Drugs, Diseases, Procedures, and more, which can help with chart abstraction.

One of the largest challenges we have encountered is properly aligning concepts with the indexed terms because of the various ways in which text offsets are managed, especially as it relates to various tokenizer settings. For example, our basic search index strips hyphens from the text, but we wanted hyphens for determining some forms of negation (e.g., -ve = negative). Additionally, our system was built to handle notes in plain text, and those formatted in HTML, but most NLP systems are designed only for plain text. Therefore, we must be careful to ensure that the location of HTML tags does not alter the location of the offsets when joining the layers together between NLP and indexing.

With the capabilities of the native search engine (including proximity search, fuzzy search, and wildcard search) and integration of NLP, powerful queries can be written. Further, CUIs can be mixed with regular terms. For example, the query “C0000737 left” with a proximity of 5 words can identify all of the following phrases: (1) “left abdominal pain”; “left flank abdominal pain”; “left lower abdominal pain”; “left upper quadrant abdominal pain”; “abdominal pain in the left”; “abdominal pain, left”; “abdominal pain, which began in his left”; “left-sided upper quadrant abdominal pain”.

Based on our sample dataset of approximately 635,000 test “documents” (mostly PubMed abstracts), there were 274,871 negation tokens, 324,842 uncertainty tokens, and 87,723 subject tokens. The addition of these tokens increased the size of the index by 26% (2.3 GB without the tokens versus 2.9 GB with the tokens), but these additional tokens had no discernable difference in the time required to identify a cohort based on a query (~1 second).

CONCLUSION: The EMERSE system, with the addition of NLP components, will provide additional value to users. It is still undergoing testing at the time of this writing, but we anticipate a release to the community sometime in 2024.

REFERENCES: See the poster.

Poster

non-peer-reviewed

Incorporating Natural Language Processing (NLP) into the EMERSE Information Retrieval System

Author Information

David A. Hanauer Corresponding Author

Learning Health Sciences, University of Michigan, Ann Arbor, USA

Lisa A. Ferguson

Department of Learning Health Sciences, University of Michigan, Ann Arbor, USA

Kellen J. McClain

Informatics, Michigan Medicine, Ann Arbor, USA

Guan Wang

Department of Learning Health Science, Michigan Medicine, Ann Arbor, USA

Poster Information

Meeting

American Medical Informatics Association (AMIA) 2024 Informatics Summit March 17, 2024 - March 20, 2024

Publication history

Published: October 10, 2024

Copyright

© Copyright 2024
Hanauer et al. This is an open access poster distributed under the terms of the Creative Commons Attribution License CC-BY 4.0., which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

License

This is an open access poster distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

PDF

Learn more

Learn more

Learn more

Ongoing Competitions

Incorporating Natural Language Processing (NLP) into the EMERSE Information Retrieval System

Abstract

Related articles

Incorporating Natural Language Processing (NLP) into the EMERSE Information Retrieval System

Author Information

David A. Hanauer Corresponding Author

Lisa A. Ferguson

Kellen J. McClain

Guan Wang

Poster Information

Meeting

Publication history

Copyright

License

Download Cureus Media Kit