Language Barriers to Online Search Interest for COVID-19: A Global Infodemiological Study

Background Implementation of coronavirus disease 2019 (COVID-19) pandemic control measures requires the engagement and participation of the public in a synchronized manner. Language may be a barrier to captivating public interest in a concerted manner. The relative volume of English and non-English COVID-19-related web searches estimate public interest among English and Non-English “searchers,” respectively. Asynchrony between English and non-English search interest may suggest language-related lapses in public engagement. Addressing these lapses may improve public health communications. In this study, we aimed to describe the distribution and temporal trends in the evolution of English and non-English online search interest for COVID-19 and to identify lags between English and non-English search interest. Methodology Search interest data (Baidu Index for China, Google Trends for other countries) was queried for the keywords “coronavirus,” “covid 19,” and their non-English equivalents between January 1, 2019, and September 30, 2020, for each country (n = 230). Daily total, English, and non-English search interest were recorded. Search Interest variables were described at global, regional, and country levels. The cross-correlation function was used to identify lags between English and non-English search interest at global, regional, and country levels. Results Globally, 9.69% of total searches relating to COVID-19 utilized non-English keywords. Among included regions, 64.7% (11/17) had significant non-English interest. Central Asia had the highest proportion of non-English interest (81.13% of total interest), followed by Eastern Europe (56.17%), Eastern Asia, Western Asia, and Northern Africa (all over 20%). Among included countries, 33.5% (77/230) had significant non-English interest. Cross-correlation function identified significant lags between English and non-English Interest in six regions (median lag [interquartile range, IQR]: -0.5 [6.00] days) and 24 countries (median lag [IQR]: -1 [4.25] days). Conclusions Non-English keywords contribute substantially to searches relating to COVID-19 in certain countries and regions. Numerous locations exhibit significant lags between English and non-English search interest, suggesting language-related discrepancies in the interest for COVID-19. Further research is required to address the root cause of these lags.


Introduction
The coronavirus disease 2019 (COVID- 19) pandemic has resulted in unprecedented morbidity and mortality, as well as unparalleled economic, political, and social losses worldwide. Public health bodies have implemented a plethora of interventions to manage the pandemic [1]. The prompt participation of the public in a concerted manner is required for many of these interventions to be effective.
Google Trends and Baidu Index are search engine analytics tools [2]. They provide a measure of the relative volume of searches (RSVs) for any keyword on a given day. As online searches can be construed as demand for information relating to the search topic, RSVs represent a surrogate measure of public interest in a topic. Several studies have investigated RSVs as a maker of interest in topics pertaining to COVID-19 [3][4][5][6][7][8].
Thus far, language utilization for COVID-19-related searches has not been explored. An understanding of the distribution of and temporal trends in language utilization may allow optimization of public health

Regional English-only data/Regional multi-language data
Total interest Total interest for region y day x = Median total interest among countries in region y day x

Global English-only data/Global multi-language data
Total interest Global total interest day x = Median total interest among all countries day x # Values for each listed variable were derived from country data and calculated in the same way.
Interest variables: total interest, non-English percentage, English interest, non-English interest. day x: represents each day during the study period.
Region y: represents each region included in this study.
Regional and global data were derived from the country-level data. For regional data, the median value of interest variables among countries in each region was calculated for each day. For global data, the median values among all included countries were calculated. Table 1 provides further details.

English-Only and Multi-Language Searching Countries
A country or region was considered to have significant non-English interest if the proportion of non-English searches (non-English percentage) on any day during the study period was ≥5%. Countries with significant non-English interest were defined as multi-language searching countries, while the rest were defined as English-only searching countries.
For each region, median values of total interest among English-only and multi-language searching countries within the region were calculated separately. Global total interest among English-only and multi-language searching countries was calculated similarly. Table 1 presents details.
Time-series analyses were performed on R Studio v1.2.1335 (Boston, MA, USA). To assess the lags between two time series, cross-correlation function (CCF) analysis and ARIMA modeling were used [17]. The x time series comprised English interest or total interest among English-only searching countries, while the y time series comprised non-English interest or total interest among multi-language searching countries, as specified. First, an ARIMA model was fitted to x using the auto.arima function of the forecast package in R [18]. Subsequently, the y series was filtered with the ARIMA model for x. Finally, CCF analysis was performed between the residuals of x (i.e., pre-whitened x) and filtered y (i.e., transformed y) series. If the highest positive correlation was non-contemporaneous, this suggested the x series lagged or lead y. P-values of <0.05 were considered statistically significant for all analyses.

Descriptive statistics
Country-level summary statistics are presented in Appendices. In Table 2, summary statistics of global and regional search interest are described. Significant non-English search interest was present in 33.5% (77/230) of countries and 67.7% (11/17) of regions. Summary statistics of global and regional search interest stratified by language (multi-language versus English-only search interest) are shown in Table 3. The geographical distribution of non-English search interest is depicted in Figure 2. Graphical plots of search interest over time (global and regional, Figure 4) and country-level search interest are presented in Appendices ( Figures 5-11).    COVID-19: coronavirus disease 2019

The regional and global lag between total interest in English-only and multi-language searching countries
Of the 230 countries included in this study, 77 (33.48%) utilized both English and non-English keywords (multi-language searching countries), while 153 (65.22%) utilized exclusively English keywords (Englishonly countries). Cross-correlation of total interest between English-only and multi-language searching countries suggested no significant global or regional lags (

TABLE 4: Global and regional cross-correlations for total interest between countries with Englishonly and multi-language searching countries.
Dependent variable: median of total interest among English-only searching countries.
Independent variable: median of total interest among multi-language searching countries.

Lags between English interest and non-English interest within each country, region, and globally
English and non-English interest were contemporaneous on a global scale. Regionally, language-related asynchrony in search interest was detected in 54.55% (6/11) of regions with significant non-English interest ( Table 5). English interest lagged non-English interest in Latin America and Caribbean (one day), Southeastern Asia (one day), and Northern Africa (three days); and led non-English interest in Central Asia (-seven days), Northern Europe (-two days), and Sub-Saharan Africa (-six days). English and non-English interest were contemporaneous in other regions (45.45%, 5/11).

CCF at lags (days) Lag at max CCF Correlation at max CCF#
Global 0.

TABLE 5: Global and regional cross-correlations between English and non-English search interest.
Dependent variable: median English interest during the study period among multi-language searching countries within region.
Independent variable: median non-English interest during the study period among multi-language searching countries within region. # Pearson's correlation; *p < 0.05.

CCF: cross-correlation function
Overall, language-related asynchrony in search interest was identified in 31.17% (24/77) of countries with significant non-English interest ( Table 6). Specifically, English interest lagged non-English interest in 16.88% (13/77) and led in 14.29% (11/77). Figure 3 illustrates the distribution of lags between English and non-English interest globally.   Note: A negative lag value may suggest that English interest occurs ahead of (i.e., leads) non-English Interest, while a positive lag value would suggest the converse.

Discussion
The widespread use of search engines, such as Google and Baidu, to query pandemic-related information using local language keywords has provided new opportunities for health surveillance. Using Google Trends and Baidu Index search analytics data, we identified countries and regions with language-related lags in search interest for the first time. Further research is required to identify the reason for asynchrony in search interest based on the keyword language used. Addressing these issues on a case-by-case basis may improve health communications and result in better implementation of pandemic control policies.
In 2021, Google Search accounted for approximately 92% of the global search engine market share [19]. While Google Search is the predominant search engine in most countries, Baidu Search holds the largest market share in China (75% of the market share). Together, the large user base of Google and Baidu search engines account for their usefulness in epidemiological research.
Previous studies have investigated epidemiological applications of Google Trends and Baidu Index data with regard to the COVID-19 pandemic. Ciaffi et al. demonstrated that symptom searches for keywords such as "fever" and "cough" was associated with intensive care unit (ICU) admissions and deaths related to COVID-19. The premise of this hypothesis was that as patients experienced symptoms, the frequency of Google searches for those symptoms would increase [6]. Using a more generalized approach to keyword selection, Mavragani et al. demonstrated that searches for "coronavirus" are associated with COVID-19 incidence and mortality in the United States [5]. Similar studies also showed significant correlations of keywords with the incidence of COVID-19 cases in other countries as well [3,7]. Husnayain et al. assessed the lag between COVID-19-related searches and the incidence of cases in various provinces of Taiwan [8]. He suggested that search interest can help identify the optimal timing and location for risk communications relating to the pandemic. While applications of search engine data for epidemic surveillance, forecasting, and public health communications relating to COVID-19 have been studied, keyword language utilization has not been explored.

Global distribution of search language
Herein, the worldwide distribution of English and non-English language keyword utilization for COVID-19related web searches on Google (worldwide) and Baidu (China) search engines is described (Figure 2, Tables  2-4). While interpreting the data, it is important to note that the language chosen for online searches does not always reflect the languages that are predominantly spoken in a specific country. For instance, while Hindi is the most commonly spoken language in India, English keywords are predominantly used for web searches. In addition, English and local language keywords referring to COVID-19 might be identical for several languages (e.g., French).
Our results suggest that non-English search keywords are often used, accounting for 9.69% of total searches relating to COVID-19 globally. Most countries with significant non-English keyword utilization were concentrated around geographically contiguous regions, including Central Asia, Eastern Europe, Eastern Asia, Western Asia, and Northern Africa, among other regions. This data may be utilized to tailor the language of global health communications to local search language preferences.

Temporal trends in search language use
The temporal changes in search interest in various regions and countries are depicted in Figure 4 and Figures 5-11, respectively. Some regions, such as Central Asia, Eastern Asia, and Eastern Europe demonstrated predominantly non-English search utilization throughout the course of this study.
Interestingly, Northern Africa, Southeastern Asia, Southern Asia, and Western Asia showed a high percentage of non-English language searches early in the pandemic, peaking between January and February 2020, followed by a precipitous decline. Referring to individual country-level plots for search interest in each respective region ( Figures 5-11), it is apparent that many but not all countries within these regions displayed this pattern. Therefore, regional generalization should be interpreted carefully.
Speculatively, early in the pandemic, there may have been a sudden increase in the demand for information without knowledge of the most appropriate keywords to use. Once the public was educated on appropriate English keywords by the media, government agencies, and other sources, their search habits might have changed. The standardization of nomenclature for the 2019 coronavirus disease by the WHO on February 11, 2020, might have also contributed to the change in search language preferences. These trends bear important implications for future global health emergencies. The early definition of standard terminology in multiple languages is essential to direct the sudden increase in the demand for information to appropriate resources.

Interpretation of cross-correlation coefficients
The CCF analyzes the similarity between a pair of time series when one time series is displaced against the other. One drawback of CCF is that real-world data may suffer from autocorrelation resulting in spurious cross-correlations. This is tackled by removing the autocorrelated component from the input series using a process called pre-whitening. Details are described in the statistics section and are elaborated on in authoritative textbooks [17]. Considering the example of daily English and non-English search interest data, these time series should be contemporaneous, that is, the highest correlation should be at lag 0. A negative lag value may suggest that English interest occurs ahead of (i.e., leads) non-English Interest, while a positive lag value would suggest the converse.

Lags Between Total Interest of English-Only and Multi-Language Searching Countries
Reassuringly, no lags in total interest were found between English-only and multi-language searching countries within any region or globally (Table 4).

Regional, Global, and Country-Level Lags Between English and Non-English Search Interest
It is concerning that the interest of English and non-English searching subpopulations within several countries and regions were not contemporaneous (as depicted in Figure 3 and Tables 5, 6). While multiple factors may contribute to these findings, they are likely to differ on a case-by-case basis.
Delayed communications between languages: Speculatively, delayed communications in a specific language might result in a lagged rise in search interest for the same language. For instance, the news reported in one language might lag reporting in another.
Varying impact of communications between languages: Also, the number of English versus local language media outlets (online, televised, or physical), their viewership, and their impact may vary, thus resulting in the asynchronous public interest.
Intrinsic subpopulation characteristics: The baseline characteristics of individuals searching with English and non-English keywords may vary. Differences in education, socio-economic strata, access to the internet, or other factors might result in a delayed reaction to public health communications, even if communications are delivered in appropriate languages and in a timely manner.
Further research is required to identify the reason for language-related lags and develop interventions to remedy these issues. Neglecting lapses in communication may hamper pandemic control measures and put vulnerable subpopulations at a greater risk.

Limitations
This study has certain limitations that merit consideration. First, although Google (92%) and Baidu (1.3%) represent most online searches worldwide, the exclusion of data from other search engines may lead to bias [19]. Second, approximately 36% of the global population does not have access to the internet in 2021.
These individuals cannot be represented through search data. Third, country-level data may represent averages for large and heterogeneous populations. Further research at the sub-country level is indicated. Despite the limitations of search engine data, infodemiological metrics have received wide attention for assisting with public health policy and monitoring epidemics.

Conclusions
Non-English keywords contribute substantially to searches relating to COVID-19 in certain countries and regions. Numerous locations exhibit significant lags between English and non-English search interest, suggesting language-related discrepancies in the interest for COVID-19. Further research is required to address the root cause of these lags.