Segregation Predicts COVID-19 Fatalities in Less Densely Populated Counties

Aim It is well known that social determinants of health (SDoH) have affected COVID-19 outcomes, but these determinants are broad and complex. Identifying essential determinants is a prerequisite to address widening health disparities during the evolving COVID-19 pandemic. Methods County-specific COVID-19 fatality data from California, Illinois, and New York, three US states with the highest county-cevel COVID-19 fatalities as of June 15, 2020, were analyzed. Twenty-three county-level SDoH, collected from County Health Rankings & Roadmaps (CHRR), were considered. A median split on the population-adjusted COVID-19 fatality rate created an indicator for high or low fatality. The decision tree method, which employs machine learning techniques, analyzed and visualized associations between SDoH and high COVID-19 fatality rate at the county level. Results Of the 23 county-level SDoH considered, population density, residential segregation (between white and non-white populations), and preventable hospitalization rates were key predictors of COVID-19 fatalities. Segregation was an important predictor of COVID-19 fatalities in counties of low population density. The model area under the curve (AUC) was 0.79, with a sensitivity of 74% and specificity of 76%. Conclusion Our findings, using a novel analytical lens, suggest that COVID-19 fatality is high in areas of high population density. While population density correlates to COVID-19 fatality, our study also finds that segregation predicts COVID-19 fatality in less densely populated counties. These findings have implications for COVID-19 resource planning and require appropriate attention.


Introduction
The COVID-19 pandemic is predicted to widen the health gap for minorities with pre-existing inequalities, such as Black, Hispanic, and Native American populations [1,2]. Minorities and low-income populations are vulnerable due to inadequate healthcare access, fewer opportunities to clarify misinformation (due to reduced access to high-quality information channels), and susceptibility to comorbidities [1,3].
The disproportionate consequences of COVID-19 are exacerbated by the inability of minorities and lowincome families to maintain adequate social distancing, as they constitute a great portion of the frontline and essential workforce and reside in densely populated homes [1]. Frontline occupations include nurses, delivery workers, and others who could not work from home. In the United States (US), 41.2% of frontline workers identify as a person of color, and more than a third of the frontline workers are supporting lowincome families [4]. We, therefore, investigated social determinants of health (SDoH) as they are associated with disparities in COVID-19 transmission and outcomes [5].
Notably, the current literature investigating SDoH and COVID-19 together at the county level is limited, particularly in methods of analysis. Previous studies have identified county-level risk factors affecting COVID-19 susceptibility and mortality utilizing bivariate or regression analysis [6][7][8][9][10]. Only one study has used the tree-based machine-learning analytical method but focused on county-level COVID-19 incidence in a single state [11]. Our research is centered on contextualizing county-level data in COVID-19 outcomes rather than susceptibilities. Employing 23 county-level SDoH and a unique tree-based analytical lens, our study aimed to identify the key county-level social determinants of COVID-19 fatality in the midst of the first COVID wave in three US states: New York, Illinois, and California.

Study population and methods
We selected three US states, namely California, Illinois, and New York, containing counties with the highest absolute COVID-19 fatalities as of June 15, 2020 [6]. Total cumulative COVID-19 cases and fatalities were gathered from state department of health websites in June 2020 [12][13][14]. County attributes were simultaneously collected from County Health Rankings & Roadmaps (CHRR) which assembled widely used county-level data from publicly available datasets [15][16][17][18]. Population density was measured in persons per square mile [19,20]. Residential segregation was studied as an index of dissimilarity between white and nonwhite populations on a scale of 0 (integration) to 100 (segregation) [21]. Preventable hospital stays were defined as the rate of hospital stays for conditions that can be treated as outpatient, per 100,000 Medicare enrollees. This study was granted exemption by the Nova Southeastern University Institutional Review Board (IRB) for not involving human subjects per the federal regulations (IRB #2020-229). Data analysis was conducted in August and September 2020.

Statistical analysis
Twenty-three county-level predictors of interest were analyzed through summary statistics. The outcome measure was COVID-19 fatality rate calculated as deaths per 100,000 county population. Inclusion criteria was implemented to only include counties with a population density between 1.5 to 50,000 persons per square mile (n=198). Bivariate associations between SDoH and continuous COVID-19 fatality rate were assessed with Spearman correlations. A median split performed on the population-adjusted COVID-19 fatality rate created a high/low fatality indicator used in subsequent predictive modeling using decision tree analysis.
The decision tree was generated using the HPSPLIT procedure in SAS (Statistical Analysis System; SAS Institute, Cary, USA) to visualize associations among SDoH and to explore the profile of counties most at risk for high COVID-19 fatality. We selected decision tree analysis for its robustness, ease of interpretation, and simplification of complex relationships seen in SDoH [22]. Tree analyses are built from known data and subsequently utilized to predict future outcomes [23]. In certain analyses, decision trees have been superior to logistic regression in predicting case outcomes, particularly outcomes that behave in a non-linear fashion [24].
The decision tree method employs machine learning techniques to determine a parsimonious model, defining profiles which best classify counties by outcome status [22]. Tree building starts at the root node, which contains all the data (county fatalities), and is partitioned recursively into child nodes until it reaches its terminal nodes. The split is based on selecting the predictor that best discriminates between high and low fatality counties [23]. Ten-fold cross-validation was employed for pruning and validation of the final tree [25]. Model accuracy was evaluated based on sensitivity, specificity, and area under the receiver operating characteristic curve (AUC). Sensitivity analyses confirmed variable selection results. In a follow-up analysis, Kruskal-Wallis tests were applied to analyze how counties in certain model-defined profiles differed from other counties based on predictors of interest. Table 1 presents summary statistics on variables of interest. The median fatality rate was 4.5 deaths per 100,000 county population. Spearman correlations identified seven county-level SDoH exhibiting a moderate association (r>0.3) with the continuous COVID-19 fatality rate ( Table 2). Upon further analysis, the decision tree model identified population density, residential segregation, and preventable hospitalizations to be key predictors of counties with high COVID-19 fatality rates ( Figure 1). The tree had four total terminal nodes reflecting county profiles with a color characterization of low COVID fatality (blue) or high COVID fatality (pink). The model area under the curve (AUC) was 0.79, with 74% sensitivity, 76% specificity, and 25% misclassification rate. The average sensitivity, specificity, and misclassification rates among cross-validation subsamples were 64%, 54%, and 40%, respectively.  (1) Income inequality c (ratio of 80th percentile income to 20th percentile income Limited access to healthy foods c (% low-income population who do not live close to grocery store) 5 (4) 5 (5) 5.5 (4) 4 (3)
Of note, among counties with population density <363.220 p/mile 2 , those with residential segregation <26.180 were classified as low fatality. Interestingly, among counties with population density <363.220 p/mile 2 and segregation ≥26.180, those with a lower preventable hospitalization rate per 100,000 Medicare enrollees (<4957.38) were classified as high COVID-19 fatality rate, as seen in terminal node number five (Figure 1). In a follow-up analysis, counties in node five exhibited significantly higher residential segregation, higher elderly percentage, lower severe housing issue percentage, and lower preventable hospitalization rate compared to other counties analyzed (p<0.05; see Table 3). Counties in node six, with the profile of population density <363,220 p/mile 2 , residential segregation ≥26.180, and preventable hospitalizations ≥4957.380 hospitalizations per 100,000 Medicare enrollees were predicted to have lower fatality rates (Figure 1).

Discussion
This study sought to use the innovative decision tree model approach to identify relevant SDoH affecting COVID-19 fatalities in the US. While earlier studies have identified SDoH to be influential in COVID-19 related mortality, the key finding of our study is that even in counties of low population density, higher levels of segregation are substantially associated with high county-level COVID-19 related deaths.
Consistent with the Spearman correlations tested, population density and residential segregation remained important in the decision tree modeling. The population density was identified to be the most important county-level predictor of COVID-19 fatalities. This can be explained by denser areas having greater transmission rates, augmenting fatal outcomes. Despite population density being the most important factor, counties of low population density still exhibited relatively high COVID-19 fatality if there was a high degree of residential segregation. Residential segregation was measured in this study by the dissimilarity index, the most widely used measure of evenness comparing spatial distributions of different groups [26]. Measured as an index (0-100), it describes the percentage of white or non-white populations that would have to move to match the population distribution of the metropolitan (larger) area [21]. This study's datadriven decision tree approach determined the segregation index value of 26.180 to be the optimal threshold to be operationalized for predicting high county-level COVID-19 mortality rates. To our knowledge, there is no national standard that would otherwise define the parameters for "high" segregation.
Residential segregation was the second most important county-level predictor, introducing the second split in our decision tree. Although legally banned since 1968, racial residential segregation has intact structures that continue to cause health disparities today [27]. Segregation affects other SDoH, such as socioeconomic (SES) status and poverty, resulting in poorer health outcomes for minorities. Segregation impairs SES status through diminishing access and resources for high-quality education and concentrating higher-pay jobs in areas outside of minority communities [27]. By creating areas of concentrated poverty, segregation has exposed disenfranchised populations' health to harms such as pollution, poor-quality infrastructure, and psychosocial stressors [28]. These harms result in disparities in income, life expectancy, and other SDoH that contribute to poor health outcomes [29]. Therefore, because elements of segregation are pervasive in driving social determinants of health, it is important to study the area of residence when investigating the mechanism of disease onset and progression. Previous studies have shown a positive association between segregation and county-level COVID-19 infection rates [8,10,30]. Our results show that segregation also predicts high COVID-19 fatality rates at the county level. Thus, it is essential to address segregation at the local level in addition to state and national interventions.
The potential relationship between population density and segregation may explain segregation's influence on COVID-19 outcomes in specifically less densely populated counties. In urban areas, population density has an inverse relationship with segregation. Anti-density zoning has restricted private property rights and kept population density low in targeted neighborhoods, increasing residential segregation [31]. Segregation conversely declines with urban population growth, diminishing segregation's influence on health outcomes in populated areas [32]. The observed negative relationship between population density and segregation in urban areas calls forth investigation of possible similar patterns in less dense areas.
Preventable hospital stays are defined as the rate of hospital stays for conditions that can be treated as outpatient per 100,000 Medicare enrollees. It is an indicator of unsatisfactory outpatient care or a pattern of excessively seeking urgent/emergency care [33]. Interestingly, our findings suggest that lower preventable hospital stays can predict COVID-19 fatalities in less dense but highly segregated counties. This finding is counterintuitive as it would mean that counties with higher COVID-19 fatalities are less likely to have Medicare enrollees excessively using urgent/emergency care for less urgent issues. This finding can potentially be explained by racial disparities, as this group of counties contained a significantly higher proportion of minorities who experience disproportionate barriers to accessing healthcare. Minorities have disproportionately faced higher COVID-19 mortality, which is related to their heightened exposure risk, illness severity, and barriers to testing [34]. Further analysis also showed that this group had a significantly greater housing cost burden, which also is recognized as a barrier to healthcare access [35,36]. Further investigation is needed into other factors related to preventable hospital stays that can possibly explain this relationship.

Limitations
The study had the following limitations. First, we used data from three states (New York, Illinois, and California) chosen based on the highest county-level COVID-19 fatalities as of June 15, 2020. This contextualized studying the pertinent factors in the midst of the first COVID wave in three states that had the highest cumulative county-level COVID-19 fatalities up to June 15. However, the analysis did not adjust for COVID-19 testing capacities for each state. Fatality data was also recorded differently per state (confirmed versus presumed deaths) and may not account for Americans who died of COVID-19 before being tested. Although this is a limitation, this applies to other major national datasets, making our data the best available at the time. County attributed data is restricted to the latest published appropriate year. While AUC and model validation results exhibit room for improvement using tree modeling, these results may be influenced by the limited sample size of the present analysis. Our limited sample size also explains our population density inclusion criteria, as we did not have adequate representation of densely populated counties of less than 1.5 persons or greater than 50,000 persons per square mile.

Conclusions
To our knowledge, this is the first study to employ a novel decision tree method, which utilized machinelearning techniques to study associations between SDoH and COVID-19 fatalities. The findings support the influence of SDoH on health outcomes, including COVID-19 outcomes, and display the success of decision tree analysis used in this context. Our key finding was that pockets of segregation exist among less densely populated counties and are suggested to predispose those residents to disproportionate COVID-19 outcomes. These areas should be targeted for county-level attention/intervention on multiple levels, including resource planning and allocations. Suggested interventions include increasing educational and employment opportunities as well as community financial resources in these counties identified to be at high risk of COVID-19 fatality. Further research should be directed towards examining the current infrastructures that allow segregation to continue and effective interventions.

Additional Information Disclosures
Human subjects: Consent was obtained or waived by all participants in this study. Nova Southeastern University's Institutional Review Board issued approval 2020-229. Based on the information provided, your protocol does not require IRB review or approval because its procedures do not fall within the IRB's jurisdiction based on 45 CFR 46.102. Therefore, your protocol has been classified as "Research outside the purview of the IRB" for IRB purposes; your study may still be classified as "research" for academic purposes or for other regulations, such as regulations pertaining to educational records (FERPA) and/or protected health information (HIPAA). This protocol does not involve "human subjects research" for one of the following reasons: (a) The study does not meet the definition of "research", as per federal regulations: "research" means a systematic investigation, including research development, testing and evaluation, designed to develop or contribute to generalizable knowledge. (b) The study does not involve "human subjects," per federal regulations. "Human subject" means a living individual about whom an investigator conducting research obtains: (1) Data through intervention or interaction with the individual, or (2) Identifiable private information. (c) Other: Please retain a copy of this memorandum for your records as it indicates that this submission was reviewed by Nova Southeastern University's Institutional Review Board. The NSU IRB is in compliance with the requirements for the protection of human subjects prescribed by Part 46 of Title 45 of the Code of Federal Regulations (45 CFR 46) revised June 18, 1991. Animal subjects: All authors have confirmed that this study did not involve animal subjects or tissue. Conflicts of interest: In compliance with the ICMJE uniform disclosure form, all authors declare the following: Payment/services info: All authors have declared that no financial support was received from any organization for the submitted work. Financial relationships: All authors have declared that they have no financial relationships at present or within the previous three years with any organizations that might have an interest in the submitted work. Other relationships: All authors have declared that there are no other relationships or activities that could appear to have influenced the submitted work.