Abstract
INTRODUCTION: EMERSE (Electronic Medical Record Search Engine) is a search engine for free text clinical documents. EMERSE is designed for non-technical researchers, with a user interface that allows for simple query building and patient list management. EMERSE is deployed or is being deployed, at academic medical centers across the U.S. and in Europe. Users have appreciated the speed and simplicity of EMERSE but have sought additional capabilities that a traditional search index cannot provide: the most common feature request has been for the system to support negation so that a user can exclude negated terms from the results. Recently, we integrated an "aligned-layer retrieval model" approach within EMERSE, wherein additional layers of attributes/tokens are layered over the original indexed terms.
TECHNICAL DETAILS: Our current implementation includes the following layers: (1) original text, case-sensitive; (2) original text, case-insensitive; (3) negation status; (4) uncertainty status; (5) subject status (patient vs other); (6) concept as a UMLS CUI; (7) semantic type, based on the CUI. The semantic type labeling allows a user to highlight all terms in a document based on semantic type, such as Drugs, Diseases, Procedures, and more, which can help with chart abstraction.
One of the largest challenges we have encountered is properly aligning concepts with the indexed terms because of the various ways in which text offsets are managed, especially as it relates to various tokenizer settings. For example, our basic search index strips hyphens from the text, but we wanted hyphens for determining some forms of negation (e.g., -ve = negative). Additionally, our system was built to handle notes in plain text, and those formatted in HTML, but most NLP systems are designed only for plain text. Therefore, we must be careful to ensure that the location of HTML tags does not alter the location of the offsets when joining the layers together between NLP and indexing.
With the capabilities of the native search engine (including proximity search, fuzzy search, and wildcard search) and integration of NLP, powerful queries can be written. Further, CUIs can be mixed with regular terms. For example, the query “C0000737 left” with a proximity of 5 words can identify all of the following phrases: (1) “left abdominal pain”; “left flank abdominal pain”; “left lower abdominal pain”; “left upper quadrant abdominal pain”; “abdominal pain in the left”; “abdominal pain, left”; “abdominal pain, which began in his left”; “left-sided upper quadrant abdominal pain”.
Based on our sample dataset of approximately 635,000 test “documents” (mostly PubMed abstracts), there were 274,871 negation tokens, 324,842 uncertainty tokens, and 87,723 subject tokens. The addition of these tokens increased the size of the index by 26% (2.3 GB without the tokens versus 2.9 GB with the tokens), but these additional tokens had no discernable difference in the time required to identify a cohort based on a query (~1 second).
CONCLUSION: The EMERSE system, with the addition of NLP components, will provide additional value to users. It is still undergoing testing at the time of this writing, but we anticipate a release to the community sometime in 2024.
REFERENCES: See the poster.
