Variation in Clinical Target Volumes for Post-prostatectomy Patients and Effect on Normal Tissue Complication Probability

Background: Modern radiotherapy requires accurate contouring which may suffer in the post-surgical setting. We estimated post-prostatectomy inter- and intra-rater contouring reliability and assessed the effect on bladder and rectal normal tissue complication probability (NTCP). Methods: Four physicians each contoured two different treatment plans, separated by at least seven days, on 15 patients receiving post-prostatectomy four-field 3D-conformal radiotherapy. The Pinnacle 8.0 system determined CTV volume, shape, and center-of-volume coordinates. Inter- and intra-rater reliability was estimated using Gilder’s method. NTCP were estimated using parameters TD 50 =8190 cGy, n=0.23, m=0.19 for rectum and TD 50 =8000 cGy, n=0.5, m=0.11 for bladder. Results: Reliability estimates for center-of-volume were ≥0.993. Inter-rater reliability was ≤0.290 and intra-rater reliability between 0.375-0.729 for shape and volume. Inter-rater reliability estimates of NTCP were 0.398 for bladder and 0.0936 for rectum with highest inter-rater variation 4% and 8%, respectively. Intra-rater reliability NTCP estimates were 0.650 for bladder and 0.186 for rectum, with highest intra-rater NTCP variation 3% and 7%, respectively. Conclusions: Center-of-volume coordinates showed excellent agreement while volume and shape showed poor inter-rater, but moderate intra-rater, agreement. NTCP estimates showed generally poor agreement, but these differences were clinically significant only for rectum (not bladder), based on an a priori definition.


Introduction
Successful radiation therapy (RT) in the era of three-dimensional conformal RT (3D-CRT) and intensitymodulated RT (IMRT) requires physicians to accurately delineate treatment targets while simultaneously avoiding normal tissue. Previous studies have described differences among multiple contours by different physicians (inter-rater variation) and by the same physician (intra-rater variation) [1][2][3][4] and whether these differences affect clinical outcomes. These studies, however, focused on treating in situ organs rather than 'areas at risk' after resection of the cancerous structure.
Adjuvant or salvage RT following radical prostatectomy is commonly prescribed for prostate cancer to limit local recurrence and improve disease-free and overall survival [5][6][7][8][9][10][11]. In this setting, prescribing physicians cannot base their clinical target volume (CTV) on the anatomical borders of a defined structure, but must rely on experience and published contouring atlases to determine regions at risk for microscopic disease, possibly increasing variation in treatment volume delineation [12][13][14][15]. Several atlases have been published, but in the critical comprehensive review of post-prostatectomy guidelines and atlases, Smith and Rodrigues [16] concluded that the clinical impact and reproducibility have not been clearly assessed.
We are aware of only one study that investigated differences in CTV definition for post-prostatectomy patients and their effect on patient outcomes [17]. This study reported "significant uncertainty" in postprostatectomy rectum contouring. However, the mean normal tissue complication probability (NTCP) was 2.8% with a standard deviation of 0.6%, so these differences may not be clinically important. In this study, we investigated both inter-and intra-rater CTV differences for post-prostatectomy patients and the potential clinical implications of these differences via propagated NTCP for both rectum and bladder.
During each of two separate contouring sessions, with minimum seven days between sessions, four radiation oncologists who specialize in prostate RT each contoured the bladder, rectum, and CTV volume (prostate and seminal vesicle beds) on planning scans of each of 15 patients treated with postprostatectomy RT between June and October, 2007, using Pinnacle version 8.0d (Philips Medical Systems, Milpitas, CA). Physicians were not provided with post-prostatectomy pathologic findings. In order to capture the true variability of contouring among clinicians, neither guidelines nor trial-specific education interventions for contouring were provided. However, physicians were allowed access to any available literature or educational opportunities, such as conferences or contouring workshops, but could not discuss which resources they used with their colleagues. Our Institutional Research Ethics Board provided ethics approval. The CTV was expanded geometrically by 1 cm to create a planning target volume (PTV). A dose of 66 Gy was prescribed to the isocenter using 3D-CRT techniques ("four-field box") with a minimum of 95% isodose coverage of the PTV. A unique plan was generated for each contour provided by any of the participating physicians, resulting in 120 unique RT plans to be compared.

Key variables
The volume of the contoured CTV was calculated directly by the Pinnacle system. The coordinates of the CTV center, along three spatial axes (lateral, anterior-posterior, and superior-inferior), were recorded from Pinnacle. Finally, shape of the CTV was approximated by subtracting the coordinate of the center-of-volume from the coordinate of the extreme point of the CTV along each axis. Figure 1 illustrates these methods. FIGURE 1: Methods for measuring shape and center of volume coordinates of the CTV a) Shape: Calculate the largest extent of the CTV in each of the three orthogonal axes. In this example, contours A and B would measure similarly in the X direction, but very differently in the Y direction. Our study included measurements in the Z (s) direction as well. b) Center of volume: Measured by Pinnacle and the coordinate of the CTV centers were compared in each of the three axes.
We then calculated the differences between the various contours. We compared the physicians' first contouring sessions with each other and did the same for the physicians' second contouring sessions. Descriptive statistics were calculated for each variable.
The Pinnacle system calculated NTCP for the rectum and bladder. For the rectal NTCP, we employed Lyman's model as implemented in Pinnacle using the following parameters: 8190 cGy as the tolerance dose for 50% chance of complications (TD 50 ), volume factor (n) of 0.23, and slope factor (m) of 0.19. For the bladder, we used Emami's data [18] where TD 50 =8000 cGy with n=0.5 and m=0.11.
A priori, we defined a clinically significant outcome as 5% difference between the highest and lowest measured NTCP within a trial. Δ inter and Δ intra represent the difference between the highest and lowest recorded NTCP for a particular patient in an inter-and intra-rater trial, respectively.

Statistical analysis
Inter-rater and intra-rater agreement with 95% confidence intervals were estimated simultaneously for the aforementioned volume, location, and shape variables using Gilder's method [19]. Gilder's method (aka modified large-sample approach) was used instead of other popular methods (e.g., DICE coefficients) because of more accurate coverage for both inter-and intra-rater reliability. Gilder's method can be improved by looking at inter-and intra-rater reliability simultaneously rather than looking at physicians separately and at the two trials separately. The use of DICE and other similar methods would introduce increased error due to chance when trials are examined separately; these methods are appropriate for single trial assessments.
NTCPs were compared in two inter-rater trials as well as an intra-rater trial for each physician; reliability was estimated by the same method. Reliability of 1.00 denotes absolute agreement while >0.7 indicates excellent agreement, 0.4-0.7 indicates moderate agreement, and <0.4 indicates poor agreement. Estimates whose 95% confidence interval includes zero are not considered statistically significant. Table 1 summarizes the magnitude of disparities seen between physicians contouring the same patient.  0.2 to 3.2 mm. In Trial 2, these differences ranged from 0.3 to 3.4 mm. The differences in the center variables ranged from 0.06 to 2.9 mm in Trial 1 and from 0.1 to 2.7 mm in Trial 2.    As shown in Table 2, inter-rater NTCP reliability was measured at 0.398 (0.212-0.646) for bladder and 0.0936 (0.000-0.307) for rectum. Intra-rater reliability measured 0.650 (0.472-0.809) for bladder and 0.186 (0.000-0.488) for rectum.

TABLE 4: Data from intra-rater trials of bladder and rectum NTCP
Δintra = difference between the highest and lowest NTCP for a particular patient in an intra-rater trial, NTCP=normal tissue complication probability For bladder, no physician had Δ intra greater than 5% for any patient. In the rectum NTCP trials, physician 1 had two patients and physicians 2 and 4 each had a single patient with Δ intra 5% or greater.
Modern 3D-based RT planning can closely deliver dose to target volumes, while sparing organs at risk. Our study aimed to quantitatively assess both the inter-and intra-rater variability for CTV volume definition in the post-prostatectomy setting with consensus guidelines available by first statistically estimating the reproducibility of post-prostatectomy contours, geometry, and direction of any variability and then determining if there is a toxicity risks with these differences, if any.
A review of published literature found six publications providing contouring guidelines for postprostatectomy patients [12][13][14][15][22][23]. Of five primary guidelines, three were from the major oncology societies, one from Princess Margaret Hospital (PMH), and one from the Radiotherapy and Androgen Deprivation in Combination After Local Surgery (RADICALS) trial. Five papers addressed the methods used to create the guidelines with the PMH, EORTC, and RADICALS guidelines indicating the validation methods used to assess the guidelines. The studies are primarily assessments of variability compared to previous contours, and none addressed important clinical outcome, such as possible toxicity, as described in our study. The three validation studies are consistent with this study in terms of the amount of variation and the regions of discrepancy. For example, the study by Ost, et al. shows the same inter-observer agreement level as in our study (using the kappa statistic rather than Gilder's method) [23].
In both inter-and intra-rater trials, the center-of-volume variables showed near-perfect agreement. Postoperative clips provide a consistent and reliable marker to guide CT-based treatment plans [24] and may account for the 'excellent' agreement in defining the CTV center.
CTV volume and shape variables showed consistently worse agreement than the center-of-volume. All volume and shape variables demonstrated uniformly 'poor' inter-rater agreement (reliability estimates ≤0.290). Intra-rater trials demonstrated moderate agreement for the prostate bed volume with a reliability estimate of 0.522 (0.331-0.677) and excellent agreement for the seminal vesicle volume with reliability estimate of 0.729 (0.600-0.827). The prostate bed and seminal vesicle CTV shape showed generally moderate intra-rater agreement, although two variables for the prostate bed CTV barely missed the 'moderate' agreement cutoff (≥0.4). Intra-rater agreement among the shape and volume variables was much better than inter-rater agreement.
Several shape metrics can be used to describe three-dimensional contours. However, no single metric has become standard, each with distinct advantages. We chose a simple and very understandable metric, which looked at maximum distance in each of the three dimensions. As the organs at risk/normal structures are superior (bladder), posterior (rectum), and anterior (bladder) to the treatment area, we wanted readers to have a clear idea of the direction in which the inter-and intra-rater differences occurred. DICE coefficient is a standard metric that provides volume overlap information, but does not provide the direction of difference. The standard Hausdorff distance methods can be applied to this data set, and will produce the maximum difference between any two datasets. Differences other than the largest would be lost, and the direction of the maximum difference will have to be projected back to the directions of critical organs of interests. In general, other distance metrics introduce additional complexity without additional value. This is why we chose to employ the simple metric of maximum distance in each direction to assess the differences in 3D contours.
New data, principally from MRI series, has identified common regions of post-treatment failure. Based on these patterns, field border guidelines and consensus guidelines have been published [12][13][14][15]. Our findings suggest that, without an in situ anatomical structure for target delineation, physicians' contouring of postprostatectomy regions-at-risk is variable, highlighting the need for development and adoption of such guidelines. Even though physicians' contours match more closely with their own previous contours than with those of their colleagues, this intra-rater agreement was 'moderate' at best, leaving room for further improvement with better education.
Valicenti, et al. [25] previously studied inter-rater variability in CTV for in situ prostate patients using contrast-enhanced CT. They estimated inter-rater reliability for the prostate CTV volume of 0.92 (95% CI: 0.75-0.99), indicating excellent agreement. In contrast, our findings showed poor agreement in the prostate bed CTV volume, suggesting that the lack of a well-defined organ to target and the multiple guidelines may result in increased variation in contouring. Data from the RADICALS trial confirm a substantial variation in target volume delineation found in this study, and that the interphysician variability could be reduced when the oncologists used the single guideline recommended by the RADICALS trial [22].
The degree of CTV variability suggests caution in applying IMRT due to higher risk of geographic miss from an inconsistently defined CTV. Methods to standardize contours (e.g., consensus guidelines, computerized contouring algorithms, etc.) may help reduce variation, but the physicians in our study were allowed to access any literature or guidelines that they knew of and felt valuable. Nevertheless, wide contouring variations were observed. Future studies designed to reduce the risk of recurrence and toxicity using dose escalation, fraction change, and normal structure avoidance programs should not proceed without improved standardization of physician contouring of the regions-at-risk.
Our final goal was to study differences in clinical outcomes using a radiobiological endpoint, namely, NTCP of bladder and rectum. Inter-rater NTCP agreement was poor for both organs, although the reliability coefficient for bladder (0.398) was close to the threshold for 'moderate' agreement. In intra-rater trials, moderate agreement was shown for bladder, while agreement for rectal NTCP was poor.
Based on our a priori definition of clinical significance (variation of ≥5%), no clinically significant difference in bladder NTCP was demonstrated in any inter-or intra-rater trial but rectal NTCP did show clinically significant differences. In the inter-rater trials, five patients in the first trial and three patients in the second (out of 15) showed clinically significant rectal NTCP differences. In intra-rater trials, clinically significant differences in rectal NTCP were generally not observed, but one physician had two patients and two physicians each had one patient for whom the NTCP differed between the two plans by 5% or higher, meeting our definition of clinical significance.
Given these outcomes, we suggest that, despite variations in contouring size and shape, physicians are consistent in their ability to spare bladder from radiation-induced side-effects. The larger observed differences in rectal NTCP, mostly not clinically significant by our definition, may reflect the fact that CTV contours overlap with rectal contours more than bladder. Standardized contouring protocols could reduce rectal NTCP variability, saving patients from uncomfortable side-effects. Given the limited difference observed in intra-rater trials, we propose that standardization of protocols (for example, through a consensus of published atlases or guidelines or with validated, automated contouring) should reduce the inter-rater error that currently limits our ability to improve post-prostatectomy RT.
When delivering RT, risk of complications must be balanced with the likelihood of tumor control. Insignificant increases in NTCP may allow increased dose delivery or expansion of the irradiated area, thereby increasing tumor control probability while maintaining acceptable chances of toxicity.
Our study can be criticized for not providing participating physicians with contouring guidelines, which was done to reflect real clinical practice. Given the publication in recent years of consensus guidelines, future research can compare our results with protocols to standardize contouring (such as specific contouring guidelines or automated contouring algorithms) on inter-and intra-rater variation for postprostatectomy patients as well as strategies for effectively disseminating a uniform guideline to clinicians.
Lastly, this study used four-field 3D-CRT treatment. Although no evidence currently supports improved outcomes using post-prostatectomy IMRT, many centers have adopted IMRT assuming such a difference. This study may not be generalizable to patients treated with IMRT. However, the differences would likely be exacerbated by variation in contouring and confirm the current results. This is consistent with clinical data which have shown increased GI, but not GU, toxicity with the move to IMRT in this setting [26]. The current method of assessing the NTCP impact of contouring differences can be used to estimate the value of IMRT.

Conclusions
Inter-rater agreement in the shape of the CTV for post-prostatectomy patients was generally poor, while moderate intra-rater agreement was demonstrated. Assuming an accurate NTCP assessment, the observed differences translated into clinically important differences in predicted complication rates for rectum, but not for bladder.
Adoption of highly conformal RT via implementation of evidence-based contouring guidelines should minimize the risk of geographic miss and unnecessary normal tissue irradiation, further improving the therapeutic ratio for radiotherapy. Future research can compare our results to those obtain using specific guidelines or standardization techniques to confirm improved agreement and reduced predicted toxicity.

Additional Information Disclosures
Human subjects: All authors have confirmed that this study did not involve human participants or tissue. Animal subjects: All authors have confirmed that this study did not involve animal subjects or tissue.

Conflicts of interest:
In compliance with the ICMJE uniform disclosure form, all authors declare the following: Payment/services info: All authors have declared that no financial support was received from any organization for the submitted work. Financial relationships: All authors have declared that they have no financial relationships at present or within the previous three years with any organizations that might have an interest in the submitted work. Other relationships: All authors have declared that there are no other relationships or activities that could appear to have influenced the submitted work.