Background
Neighborhoods are important for health [
1‐
4]. In fact, the neighborhood environment has been linked to multiple health outcomes including sleep, mental health, cardiovascular risk, and mortality [
5‐
9]. Certain features (e.g. sidewalks) may directly encourage active transportation and physical activity [
10‐
15] and others (e.g. street lighting, noise) may impact sleep [
16‐
18], which, in turn, may influence chronic diseases [
19,
20]. Residents of low-income and racially/ethnically segregated neighborhoods share a disproportionate burden of chronic disease [
21], as well as limited access to resources, which could contribute to poor health [
22‐
24]. Improving the neighborhood environment holds promise for addressing health-related behaviors associated with chronic disease and mortality [
25].
Micro or granular features of the neighborhood (e.g. street lighting) may affect residents’ experiences more directly than macro-level features (e.g. residential density), thus providing stronger links with health behaviors [
26‐
28]. Also, micro-level features are more easily modified than macro-level features. For example, it takes less time and money to repair a sidewalk than to change the land-use mix of a community. While there are multiple approaches for collecting detailed assessments of micro-features of neighborhoods [
29‐
34], direct observation using audit tools is the preferred approach because it allows for systematic observation of detailed or granular features [
27]. Google Street View (GSV) has been increasingly used to observe the built environment and provides a cheaper alternative to direct observation (Clarke et al., 2010; Taylor et al., 2011) [
35,
36]. While GSV has demonstrated reliability when assessing certain features of the environment (including types of land use, slope, cycling lane or gathering places), it has certain limitations. Its reliability was not as high when considering detailed features, such as the presence of litter or vacant dwellings, and when making qualitative observations such as the quality of sidewalk or housing (Clarke et al., 2010) [
35]. Also, GSV imagery is not available for every street in the U.S. and is updated irregularly (Clarke et al., 2010) [
35]. Mixed findings regarding the relationship between micro features of the environment and health outcomes could be due to differences in measurement approaches across studies. An increased interest in the
local environment for public policy has led to increased emphasis on the rigorous development, implementation and validation of audit tools for direct observation.
In a comprehensive review, Brownson et al. (2009) [
27], described multiple audit tools for direct observation of the physical environment [
27]. These tools shared some common content including one or more measures of: land use (e.g., presence and type of housing); streets and traffic (e.g., traffic volume); sidewalks; bicycling facilities; public space/amenities (e.g., presence of benches); architecture or building characteristics (e.g., building height); parking and driveways (e.g., parking garage); maintenance (e.g., litter); and indicators of safety (e.g., graffiti). Other features less consistently assessed are noise levels, or health promotion supports (e.g., billboards promoting physical activity) [
27]. Existing audit tools have been used for one-time examinations of the neighborhood environment. As designs expand to better understand causation and predictors of change, there is a need to test whether audit tools are adequate for longitudinal assessment.
The Pittsburgh Hill/Homewood Research on Neighborhood Change and Health (PHRESH) study leverages a natural experiment design, comparing an intervention and a control neighborhood, to evaluate whether neighborhood improvements benefit residents’ health [
8,
24,
37]. Between 2011 and 2018, the intervention neighborhood received about $200 million, while the comparison neighborhood received approximately $48 million, in publicly-funded investments. Efforts involved physical infrastructure modification (i.e., street lengths, street names, traffic patterns) and construction of streets, housing and landscaping. To systematically document change, we conducted multiple direct observations of the neighborhood environment over a 5-year period with an emphasis on features that may impact physical activity or sleep.
Of the existing audit tools, four were comparable to ours with respect to detail, content and data collection approach: Systematic Pedestrian and Cycling Environmental Scan (SPACES) [
38]; St. Louis Analytic Audit Tool and Checklist (SLU) [
39,
40]; Systematic Social Observation protocol [
29] and Pedestrian Environment Data Scan (PEDS) [
41]. Two of these studies reported that 70% of items had kappa statistics [
42] above .40, one reported average reliability of .87, while the fourth study reported high inter-observer agreement of 75% or greater [
27]. Longitudinal studies may encounter pitfalls if these audit tools are not reliable over time. Mismeasurement can obscure meaningful differences, while systematic bias can produce spurious findings. In this paper, we describe the implementation methods, lessons learned, and stability of reliability estimates from PHRESH longitudinal assessments of the neighborhood environment at three time points over a five-year period. Our findings can help inform future studies of changes in the built and social environment.
Results
KA or PO statistics, with color-coding to indicate level of agreement, are displayed in Fig.
4. For most items, we report KA; where items are very common or rare, we report PO. In 2012, 93.8% of items had excellent (62.5%) or good (31.3%) agreement. In 2015, 91.3% of items had excellent (83.8%) or good (7.5%) agreement. In 2017, 83.5% of items had excellent (55.7%) or good (27.8%) agreement. When assessing stability across waves, 81.4% (79 out of 97) of items had good to excellent agreement at every timepoint, making them sufficiently reliable to detect change. Prevalence statistics for individual items are shown in supplemental Table
1.
Twelve of 14 Land use mix items had good to excellent agreement while two items (public/communal spaces, other land use) had poor agreement at all waves. Five out of 6 Environment items had good to excellent agreement across waves, while one item (“do trees shade sidewalk?”) had poor agreement at one of the three waves. Inspection of the individual raters’ responses suggests that raters seemed to have difficulty in choosing “some” versus “many” as a response. For all 8 items in the PA facility category, there was uniformly excellent agreement at each wave.
There were 20 items in the Walking/Cycling environment category. Within the sub-category “Intersection and Crossing” including four items (traffic light, pedestrian signal at traffic light, stop sign, marked crosswalk), all had good to excellent agreement at every wave. Of the 8 items in the sub-category “Street features”, four showed good to excellent agreement at every wave. Another three items (“street and sidewalk buffer”, “continuous sidewalk”, “sidewalk continuous at both ends between segments) showed poor agreement at one of the waves, while a fourth item (“curb cuts or ramps missing at crossing points”) exhibited consistently poor agreement at every wave. The four items in the sub-category “Traffic features” (“traffic circle/roundabout”, “speed hump/table”, “median with traffic island”, “curb extension/bulb-out”) and the two cycling environment items demonstrated good to excellent agreement at every wave. The other two items in Walking/Cycling environment (street type, number of traffic lanes), showed poor agreement at either one or two of the timepoints.
There were five items in the Safety signs category; all were reliably assessed at every wave. 12 out of 16 Amenities and litter items had good to excellent agreement at every wave. Two items (“art or monument”, “garden bed/planter”) showed poor agreement at one of the three waves, while a third item (“amount of trash/litter on street”) showed low agreement at every wave. Of the two more general assessments made by raters (“perceived safety”, “attractiveness of segment for walking”), only one (“perceived safety”) had poor agreement in one wave. Also, PO was excellent for 7 of the 8 items in the Physical activity facility category, and poor for 1 item (“other gathering place”) at two of the three waves.
For 17 items in three categories, we cannot assess agreement at multiple time points because they were only measured in 2017. A single, ordinal item in Noise pollution (with 4 response categories: “no”, “a little”, “some” or “a lot of pollution”) demonstrated good agreement. Seven of the 8 Social disorder items had excellent agreement (PO statistic > 90%) while one item (“adults loitering, congregating, or hanging out”) had poor agreement (PO < 75%). Three of the 8 Physical disorder items (“discarded cigarette butts”, “garbage, litter, broken glass”, “buildings with broken windows”) had low agreement while the other five had good or excellent agreement.
Discussion
PHRESH is an ongoing study of two low-income and predominantly African American urban communities in Pittsburgh, PA. To assess whether neighborhood-level changes impact residents’ health and well-being, diet, exercise, sleep, heart, and cognitive health, we conducted three assessments of the physical and social environment in the two neighborhoods over a period of five years (2012–2017). The purpose of the parent study is to identify correlates of, and the extent to which neighborhood-level changes, affected obesogenic behaviors such as physical activity, sleep, and heart health. In this paper, we have described our implementation methods, lessons learned, and results from repeated reliability testing of the audit tool (comprised of a standard set of items) to understand if there is stability across time to detect change in the environment over a period of five years. These are offered to inform the design and interpretation of future longitudinal studies of the physical and social environment.
Representative sampling was a critical step. Previous work had demonstrated that a 25% sample of residential street segments produced valid estimates of the built environment [
54]. When assessing neighborhood-level change, one difficulty is that these changes can modify the underlying street network. Our experience suggests that secondary sources of data may include non-negligible errors potentially due to delays in updating secondary databases. Whenever feasible (e.g. in a compact environment), we recommend careful verification of available listings of neighborhood street segments to ensure high accuracy. Also, it is necessary to update the street network at each assessment wave to capture the degree of change in the street network. To reflect actual changes in the street network, we carefully identified and sampled new street segments at each wave. When sampling new segments, systematic rules are needed. For instance, when an entire street segment was demolished, should the replacement come from the same geographic area or be sampled entirely at random? Should a newly bisected street count as two new streets, or as the same street segment from a prior wave? A changing street network meant that segment-level panel analysis was difficult; instead, it was more reasonable to identify a stable unit of analysis (e.g. a residential buffer for each study participant) to assess change.
We integrated a community engaged research framework to ensure the longevity and acceptance of PHRESH within the study communities [
43]. Our data collectors were recruited from the community, and some of the data collectors were retained across waves. However, we were not able to assess any such effects with our data. Nevertheless, thorough and consistent training of data collectors at each wave was a central feature of this work. Training at each wave employed the same methods and trainer to avoid systematic biases in ratings across waves. During training, it was important to balance classroom learning with ‘live’ practice. In the classroom, the use of visuals (e.g. photographs) worked well. Field practice focused on individual sections of the audit tool and presented a variety of observations. We budgeted extra time to allow data collectors to discuss questions/situations with the trainer. Thus, the training schedule needed to be flexible to allow extra time for hard-to-assess items. Furthermore, we found field practice to be the most valuable part of training. When recruiting data collectors, attention to detail was an important individual trait.
Assessment of (inter-rater) reliability of individual SSA items, using a sub-sample of segments, helped identify items that performed well at a single timepoint, and across time. A majority of SSA items (81%) had high reliability. Low agreement indicated items that were difficult to rate objectively or with a single observation. For example, “amount of litter” or “adults loitering, congregating or hanging out” may vary even over a short window of time (e.g. a few hours or a day). In the case of trash, we re-assessed agreement for a small subset of street segments in the reliability study where two observations were conducted within hours of each other. However, the agreement for trash or litter did not improve. Items with substantial temporal variation may require multiple ratings (> 2) to accurately capture the average or mean rating. Certain items (e.g. perceived safety) were inherently subject to interviewer interpretation, and demonstrated lower agreement, as expected. Few neighborhood features were not easily visible across an entire street segment (e.g. bar on a single window, cigarette butts on the ground; garden bed/planter), or difficult to assess from the outside (e.g. public/communal space, vacant building) as was necessary according to the audit protocol.
Given these study findings, we can suggest the types of items that may be able to capture change. Consistent with previous research, more subjective measures are less reliable than more objective (observable) ones [
41]; dichotomous ratings have higher reliability than ordinal response scales (although a greater number of response categories may be valuable for providing finer distinctions). Large, visible items (e.g. buildings, traffic signs) were consistently reliable. While sidewalks are an important feature of the walking environment, sidewalk conditions may change quickly over a city block, making it challenging to rate consistently. Also, rare/low prevalence features (see supplemental Table
1) did not lend themselves well to KA testing. For example, the only gathering places in these neighborhoods with prevalence above 5% were churches. If low prevalence items were readily identified, the PO statistic showed consistency in endorsing their absence.
While some features of the environment may change, there were features that are time invariant. Yet, when we compared slope (“flat”, “slight hill”, “steep hill”) across years for a sub-group of street segments with three years of complete data, 22% of the segments had different values although slope is unlikely to change. Also, 10% of street segments were endorsed as having art/monument in 2012, while only 2% of segments had art/monument three years later (2015.) which may point to confusion over what constitutes art. Therefore, we recommend the use of SSA items with consistently good to excellent agreement across repeat assessments to detect real change. Future studies may be able to further improve the measurement of less reliable items through detailed and intensive training or procedures (e.g., mapping out a visual area into a grid to more systematically inspect for broken windows), clearer rules and examples for determining whether something is a communal space, or by the addition of a “cannot determine” category to the form. Even subjective ratings may be improved if anchored through training or explicit item instructions (e.g. 1 = a place where you would not feel physically at risk of violence from another person if walking alone in daylight, etc.), and by use of multiple raters to reduce individual rater idiosyncrasies.
In our knowledge, this article is the first to conduct repeated assessments of the built and social environment to assess change. We found the PHRESH study’s SSA tool to be reliable and practical to implement, with an average of 13 min required per street segment, that data collectors found easy to use. The audit tool provided rich and detailed data on environmental features, and change over time, which is important for the exploration of cross-sectional and longitudinal relationships between neighborhood features and health outcomes. The compact nature of our study neighborhoods suggests a need to test this audit tool in neighborhoods with greater variation, as certain items exhibited low or zero prevalence in the study neighborhoods. Future research might want to evaluate reliability separately if comparing change across neighborhoods for a natural experiment or intervention study. Our sample sizes for the reliability sub-sample were only sufficient to assess overall reliability by pooling sample across neighborhoods. Future study design can consider sample allocation so that the two neighborhoods (with and without intervention) are assessed with equal reliability. Also, additional steps are necessary to develop and validate summary measures or indices that capture meaningful constructs (e.g. walkability, incivilities) that may be predictors of health outcomes. If valid indices of environmental features can be derived, they will be useful in guiding public policy and urban planning in the redesign of built environments to promote health.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.