The Mentalization-based Group Therapy Adherence and Quality Scale (MBT-G-AQS) is a tool designed to measure the adherence and quality of mentalization-based group therapy (MBT-G), a treatment approach for Borderline Personality Disorder (BPD). The aim of the present study was to evaluate the reliability of the MBT-G-AQS by having it used to rate 16 videotaped therapy sessions from 8 MBT groups and 8 psychodynamic groups by 5 raters. The results showed high to excellent reliability for global ratings of adherence and quality for all numbers of raters, with particularly high reliability when using several raters. The mean reliability for individual items on the scale was also good, particularly when using multiple raters. These results suggest that the MBT-G-AQS can be a useful tool for training, supervision, and research on MBT-G. The study also identified certain items on the scale that were more difficult to rate consistently, indicating a need for further calibration and training as well as more explicit definitions of the phenomena being assessed. Overall, the MBT-G-AQS can be a useful tool for documenting treatment integrity and improving the evidence base for MBT-G.
After you have read the article (https://doi.org/10.1111/sjop.12375), you may wish to return and ask yourself the following questions:
What is the purpose of the Mentalization-based Group Therapy Adherence and Quality Scale (MBT-G-AQS)?
a) To measure the effectiveness of mentalization-based group therapy (MBT-G)
b) To assess the adherence and quality of MBT-G
c) To evaluate the reliability of MBT-G
d) To document treatment integrity in MBT-G
What is the focus of the current study?
a) The effectiveness of MBT-G for Borderline Personality Disorder (BPD)
b) The reliability of the MBT-G-AQS
c) The outcome of MBT-G in naturalistic comparisons
d) The importance of video-based supervision in MBT-G
How many therapy sessions were rated in the current study?
How many raters were involved in the current study?
What was the reliability of the global ratings for adherence and quality with five raters in the current study?
d) High to excellent
What was the mean reliability for individual items on the scale with one rater in the current study?
What was the mean absolute G-coefficient for adherence with two raters in the current study?
What was the mean absolute G-coefficient for quality with five raters in the current study?
Which of the following statements is NOT true about the current study?
a) It is the first study to report psychometric properties for the MBT-G-AQS.
b) It is the first study to focus on therapists' interventions in group therapy since 2005.
c) It found that the MBT-G-AQS is a reliable instrument for measuring treatment integrity in MBT-G.
d) It found that the MBT-G-AQS is not a useful tool for training, supervision, and research on MBT-G.
Which of the following items on the MBT-G-AQS was found to be more difficult to rate consistently in the current study?
a) "Cooperation with co-therapist"
b) "Acknowledging good mentalizing"
c) "Stimulating discussions on group norms"
d) "Exploration, curiosity and not-knowing stance"
- To assess the adherence and quality of MBT-G
- The reliability of the MBT-G-AQS
- 16 therapy sessions
- 5 raters
- High to excellent reliability
- High mean reliability
- High mean absolute G-coefficient for adherence
- High mean absolute G-coefficient for quality
- It found that the MBT-G-AQS is not a useful tool for training, supervision, and research on MBT-G.
- "Acknowledging good mentalizing"
A much-welcomed approach to the examination of psychotherapy sessions is generalizability theory (G-theory; Cronbach et al., 1963; Shavelson & Webb, 1991), which is suitable to investigate observational ratings of complex phenomena. The data dictates the method, and when the measurement design contains multiple sources of variance, G-theory is an appropriate approach to disentangle and estimate these sources of variance. G-theory addresses the adequacy with which one can generalize from a sample of observations to a universe of observations from which the sample was randomly drawn. This issue is particularly relevant for ratings of the psychotherapy process because multiple sources of error variance are common, such as within the variation due to the patient, the session, the group, the therapist, the rater, or other potential factors. In this study design where the observed score is compounded by three or more sources of variance, intraclass correlation is not an appropriate method to estimate the level of reliability. By incorporating multiple sources (facets) of error into reliability coefficients, reliability estimates calculated using G-theory are likely to be more accurate, as some contributions to errors of measurement (e.g., occasions, raters or items) can be assessed (Shavelson & Webb, 1991). G-theory provides estimates of the variability contributed by each source of error and of the interactions among sources of error (Shavelson & Webb, 1991). Although G-theory is concerned with variance components, it produces a summary coefficient, the G-coefficient, which is roughly analogous to a reliability coefficient (e.g., intraclass correlations). However, the G-coefficient is based on the researcher’s decision to treat facets as random or fixed, thereby defining the universe to which the researcher wants to generalize. Within the design of G-theory, several variance components can be disentangled in just one analysis (Shavelson & Webb, 1991). Further, a generalizability study (G-study) makes it possible to disentangle the component variations and estimate the reliability for a decreasing number of raters. The question in G-theory is the degree to which observed scores allow for generalizations about a person’s behavior in a defined universe of situations. G-theory provides G-coefficients reflecting the variability contributed by each source of error and of the interactions among sources of error (Shavelson et al., 1989). Based on the sample data, the relative impact of different sources of variation is estimated by a G-study (Shavelson et al., 1989), from which generalizability coefficients are computed. The G-coefficient indexes the proportion of total variability in scores that is due to “universe scores” where σ 2(τ) is the variance of the true score, and is the variance of the various error components. G-coefficients above .7 are generally considered sufficient for interpersonal and observationally coded constructs (Wasserman et al., 2009). A low G-coefficient is due to a significant amount of error in measurement or to minimal variation across individuals, the measurement procedure, and the universe of generalization (Hagtvet, 1997). The second coefficient in G-theory is called the dependability coefficient, denoted as D, and can be interpreted as the generalizability coefficient for absolute decisions (Shavelson & Webb, 1991). Therefore, based on the obtained G-study components, the generalizability framework offers a subsequent study called D-study or optimization study. With the D-study it is possible to estimate the reliability of scores based on, for example, four, two, or only one rater(s). This also allows for the more efficient training of raters, as the G-study predicts what the reliability would be if any rater were excluded from the study. If the G-coefficient improves substantially if a rater is omitted, then this rater could benefit from more training on this item. The intended use of the MBT-G-AQS concerns decisions of whether subjects are below or above some specific level of adherence or quality. Consequently, the most relevant reliability estimate is absolute decisions (i.e., absolute G-coefficients; see Karterud et al., 2013 for a detailed discussion of this topic). Within the design of G-theory, several variance components can be disentangled in just one analysis (Shavelson & Webb, 1991). However, different designs depend on whether raters are crossed with patients (e.g., all raters rate all patients), whether they have unique raters nested within patients (independent groups of raters and patients), or whether the raters are considered to be random effects (to be representative of raters beyond themselves) or fixed effects, only to represent themselves (Shrout, 1998).
In the current research design, two therapy sessions from each of eight pairs of therapists were videotaped. Five raters rated all 16 sessions in this study. In the framework of G-theory (Shavelson & Webb, 1991), this implies a two-facet partially nested “(s:t) x r” design, where sessions (s) are nested within therapists (t), and raters (r) are crossed over sessions within therapists. The design is partially nested because the effect of a session (s) is both nested (within t) and crossed (over r). The two facets of observation give two differentiation variance components, the individual variance between therapists (t) and the systematic variance between sessions for each therapist (st). This makes three sources of instrumentation variance (error) that directly affects the reliability of the observed scores. These are 1) the rater effect (r) indicating the consistency of how much ‘behavior’ the raters see, averaged over therapists and sessions; 2) the interaction between raters and therapists (tr), indicating the raters’ different rank ordering of the therapists; and 3) the unique rater–therapist–session interaction plus other unknown error variance (rst, + e). See Figure 2 for an illustration of the (s:t) x r design. In this design, sessions (s) cannot be separated from a therapist (t) and neither can the session–rater interaction (sr) be separated from the rater–session–therapist interaction. “By explicitly recognizing that multiple sources of random and true score variance exist and that measures may have different reliabilities in different situations, GT has many advantages over classic true score theory” (Pedersen, 2008, p. 34).
The results show high reliability for both adherence and quality (competence). The mean absolute G-coefficient for adherence was .86 (range .63–.97) and was .88 for quality (range .64–.96). The reliability for overall adherence (.97) and quality (.96) ratings are both high. The nine group-specific items (items 1–9) displayed very high reliability for both adherence (range .83–.95) and quality (range .78–.96). Further, the residual variance for the overall quality score was very low (17%), and there was complete agreement among the raters on frequency/adherence and the ranking of therapists. The results show only minor differences between relative and absolute G-coefficients (raters agree as much on exact scores as on the ranking of the sessions). Table 3 shows the grand mean and standard deviation of scores across all raters and sessions and G-coefficients for all ratings of all items. Reliability was very high to excellent for the entire scale and for single items (items with low absolute G-coefficients had very low variance, and the reliability is therefore also acceptable/high for these items), indicating reliable assessment of the specific aspects of competence and adherence for MBT-G. The nine group-specific items had as high reliability as the nongroup-specific interventions. Items 9, 14, 16, and 18 have a low frequency (adherence) rating, while Items 11 and 14 are rated often.
For some of the items the reliability would increase notably if one of the raters (different items for different raters) was omitted in the G-study. Importantly, the overall ratings were the most robust items for a decreasing number of raters. Deleting the least reliable rater from the overall ratings would only slightly increase the reliability for these two items (+0.01). Further, the absolute G-coefficient for Item 8 (adherence) would increase from .83 to .89 if rater number 5 was excluded. If rater 2 was omitted, the absolute G-coefficient of adherence would increase for Item 9 (from .84 to .89) and for Item 11 (from .80 to .87). Omitting rater 4 from the study would increase the absolute G-coefficient of adherence for Item 18 (from .88 to .93). Excluding raters 2 and 4 from the study would also increase the absolute G-coefficient of quality for Items 18 (from .84 to .87) and 19 (from .88 to .93). This means that different raters had areas in which their understanding of the scale deviated from the “norm” but signals that variance is not systematic in terms of one rater being consistently worse than the others. Regarding adherence (frequency), the present study displays very high variance (sum of variances) for Item 11 (21.77), Item 17 (28.76), and Item 19 (27.18) and very low variance for Item 14 (1.4), Item 16 (.77), and Item 18 (.72). The reliability for Item 16 (and Items 14 and 18) is very good considered such low variance. On Item 11, the raters differ to a greater degree, even though the G-coefficient here is high (.80). Regarding quality, our G-study indicates a low sum of variances for Item 1 (.65), Item 10 (.99), Item 12 (.92), and Item 15 (.67), with the highest variance for Item 2 (3.13), Item 3 (2.98), and Item 9 (2.85). According to the quality ratings of Item 8, the therapists vary greatly from session to session (the S:T-variance = 1.69; sum of variances = 1.69). Additionally, for Items 8 and 9 there are no individual differences (variance component from therapists = .0).
One finding not emphasized in MBT-G-AQS was that on a descriptive level, there was a noteworthy difference between treatments (PDG versus MBT-G) in quality/competence but not adherence. This signals a significant correlation between adherence and quality/competence but also that the MBT-G-AQS had some discriminant validity. The structuring elements of the MBT-G-AQS (Items 1, 2, and 3) were the major difference from PDG. Table 4 shows that Items 2, 3, and 11 display the largest difference between PDG and MBT-G. Item 9 had a higher prevalence in PDG (M = 4) than in MBT-G (M = 0).
Let us first address the topics and four major findings highlighted by MBT-G-AQS:
- The group component is typically neglected, both within the field of MBT and in the larger universe of psychotherapy research.
- The overall score for adherence and quality of MBT-G-AQS can be reliably rated by one rater.
- The items measuring theoretical constructs considered core concepts in MBT showed low reliability, both “Pretend mode” (Item 15) and “Psychic equivalence” (Item 16).
- The nine group-specific items displayed high reliability for both adherence and quality.
The first topic that formed the background for MBT-G-AQS was the paucity of research on adherence and competence for group therapy, confirming the call for the development of group therapists’ measures (Burlingame et al., 2004). The Group Psychotherapy Intervention Rating Scale (GPIRS) was developed by Sternberg and Trijsburg (Chapman et al., 2010) and is the only scale that seems to reflect the many integrity measures developed for individual psychotherapy. GPIRS was developed for group psychotherapy in general. The 48 items are designed for empirical research in general and are not specific for any treatment or manual. The MBT-G-AQS addresses the dialectic between structure and dynamic process, which is present in all dynamic group therapies (Yalom & Leszcz, 2005). Therefore, MBT-G-AQS should be of interest for the general field of group psychotherapy, and the MBT-G-AQS should be helpful for the future development of other similar scales for other group psychotherapies.
Treatment integrity consists of two elements: (1) treatment adherence, that is, “the extent to which a therapist used interventions and approaches prescribed by the treatment manual, and avoided the use of interventions and procedures proscribed by the manual (Waltz et al., 1993, p. 620) and (2) the therapist’s competence, that is, “the level of skill shown by the therapist in delivering the treatment. By skill, we mean the extent to which the therapist conducting the interventions took the relevant aspects of the therapeutic context into account and responded to these contextual variables appropriately” (Waltz et al., 1993, p. 620). According to this definition, competence requires adherence, but adherence does not necessarily imply competence (McGlinchey & Dobson, 2003). In RCTs “where therapists are trained using a manual for a specific disorder, between-therapist variation is likely smaller than in general practice” (Falkenström et al., 2013, p. 2). Therefore, despite the “robust” therapist effect reported (Wampold & Imel, 2015), a reliable integrity measure based on a manual has important implications for delivering a specific potion to BPD patients. A manual is also important in making therapists, and indirectly their patients, trust their method (therapist allegiance). The group component is the clinical backbone of the MBT program (Karterud, 2015). However, the group component had been neglected when it comes to fidelity measures. This is unfortunate, as a reliable fidelity measure is important not only for reporting treatment integrity but also for quality control, supervision, training of therapists, legitimization of the treatment (e.g., government, propagation), and further research and proliferation.
The second major finding in MBT-G-AQS was that the scale showed high reliability. The present reliability is somewhat higher than that reflected by the G-coefficients in the reliability study on the MBT-I-ACS (Karterud et al., 2013), a possible product of extensive training and experience. With one rater, the reliability was very high for overall MBT-G adherence (.86) and quality (.83). This demonstrated that the MBT-G-ACS can be reliably used by one rater to determine the cut-off for adequate adherence and quality/competence for MBT-G. The reliability for overall adherence and quality ratings with five raters were high. This indicates that a team of raters was able to achieve good agreement regarding the ingredients in MBT-G are and how to evaluate them. The reliability for the overall absolute decision (absolute G-coefficients) was very good. As items with low absolute G-coefficients also had low variance, the reliability is therefore deemed high for these items as well (Hagtvet, 1997). The scale may contribute to future psychotherapy research by assuring internal validity and contribute to research on adherence and competence as possible moderators of treatment outcome. In addition, the scale can be used for training and clinical purposes; assessing and providing feedback about therapeutic competence and adherence enables therapists and supervisors to check and improve the skills used in delivering essential elements of MBT-G. Noticeable differences in the mean profiles for MBT-G and PDG are interpreted as reflecting the scale’s ability to differentiate these two treatments, thus lending some support to the discriminant validity of the scale. The results were both uplifting in terms of demonstrating that the overall score for adherence and quality (competence) could be rated by one single rater and that the overall scale has good reliability for two raters. This finding is important in terms of the feasibility of integrating quality control and assessment at multiple treatment facilities and of continuing the MBT-G ratings made for services such as the Quality Lab for Psychotherapy in Oslo. The good inter-rater reliability results in this study indicate that the MBT scales can be used reliably with careful training and supervision. Nonetheless, subsequent studies should investigate whether this finding can be replicated with other raters. A limitation of reproducibility (which is at core of reliability) is whether such agreement can be reached at other places/centers and whether the MBT-G-AQS is primarily a tool for expert raters with special training (Simonsen et al., 2019). Recently, the MBT-G-AQS was employed to measure treatment fidelity in a Danish RCT (Beck et al., 2020; Jorgensen et al., 2021) and has been reported in two studies by Kvarstein et al. (2019, 2020).
The overall ratings in MBT-G-AQS were based on a global understanding of the session, which is essentially about answering whether the therapists stimulate the patients mentalizing or not. That is, the “most important sign of a successful MBT session is that the patient gets involved in a mentalizing discourse” (Karterud & Bateman, 2010, p. 44). This raises another concern, which is that it is not possible to rate the therapist(s) independently of the patient(s). It has generally been assumed that adherence and competence are therapist characteristics (Baldwin & Imel, 2013). Recent studies (e.g., Boswell et al., 2013) challenge this presumption, and it seems that “it is the patient’s contribution to competence ratings that is related to outcome rather than the therapists’ competence relative to other therapists” (Wampold & Imel, 2015, p. 238). No matter how well defined the scale and manual, it will be necessary to rate the interaction between therapists and patients. This implies that a substantial portion of the variance in both adherence and competence ratings will stem from the patients. Further, the conception of competence/quality in MBT should thus be derived from the treatment manual and the theory of change specified in it. However, MBT is a manualization of a non-technique-based psychotherapy (Perepletchikova, 2007; Perepletchikova et al., 2007; interventions are driven by understanding), in which the relationship to therapist and interactional processes play an essential role. Consequently, as indicated by MBT-G-AQSI, highly rated MBT contains strategies not described in the manual, which may imply that the concept of quality in MBT is largely a measure of embedded alliance (this topic will be elaborated later). MBT-G-AQSI investigated such conceptual interactions closely in MBT-I, but case studies in MBT-G should also be applied to investigate this topic further.
The third major finding was that some of the items measuring core MBT concepts had low reliability and occurrence (e.g., “Psychic equivalence” and “Pretend mode”). An important aspect of a reliability study is identifying items in the manual that should be made more precise. For example, items with the lowest reliabilities in MBT-I following a brief 1-day training course were “Focus on affects”, “Focus on interpersonal affects”, “Counter-transference”, and “Psychic equivalence” (Simonsen et al., 2019). For MBT-G, this is particularly true for psychic equivalence and pretend mode (these were the two concepts raters disagreed about the most), which is somewhat unfortunate, as they are part of the core theoretical underpinning in this treatment tradition. The G-study allows for investigating the source of variance, and for these two items the results indicate that the measured concepts are unclear for therapists and raters alike. This finding is largely in line with Karterud et al. (2013)’s finding regarding the MBT-I-ACS that “there was a moderate agreement on identifying interventions aimed at psychic equivalence. However, the competence reliability is lower (.33). The manual should be more specific with respect to what counts as a high versus low competence for this item” (p. 714–715). In terms of pretend mode, they reported that
the residual variance was very high for this item, indicating (1) that the therapists had difficulties with identifying pretend mode, (2) that the therapists had difficulties with knowing what to do with it, and (3) that the raters had difficulties with identifying interventions aimed to modify pretend mode. (p. 714)
Hence, MBT-G-AQS and the studies by Simonsen et al. (2019) and Karterud et al. (2013) may lend support to criticism aimed at MBT being abstract and hard to integrate (Hutsebaut et al., 2012; Sharp et al., 2020). Consequently, amending the operationalization of pretend mode and psychic equivalence will most likely be helpful for the field of MBT. From a psychometric perspective, items/interventions with low occurrence (e.g., Items 9, 14, 16, and 18) may be seen as redundant. Further, very high reliabilities (.95 or higher) are not necessarily desirable. Such views also reflect the underlying question of whether MBT-G is best defined by empirical data (what can be operationally observed in therapists who say they deliver MBT-G) or by an a priori conception by the conceivers of MBT. Arguably, there is an interaction/dialectic between such perspectives with clinical practice, such that over time there will be an interplay leading to a continual revision of manuals, theory, training, rating procedures, and practice.
It is interesting to note that in MBT-G-AQS there was a difference between treatments (PDG versus MBT-G) on competence but not adherence, especially as much of the previous research (Barber et al., 2007; Gutermann et al., 2015) has reported that the interrelatedness of adherence and competence is high. One possible reason for this finding is that mentalizing is a very broad concept (CF; “Plain Old Therapy”, e.g., Allen, 2012). Therefore, most treatments would necessarily deliver mentalizing interventions but with different competence/quality. Further, the structuring aspect of the treatment is assumed important in a clinical setting (Bateman & Fonagy, 2016; Wampold & Imel, 2015). Presented findings would support this, as it was the structuring elements of the MBT-G-AQS (Items 1, 2, and 3) of MBT-G that constituted the major difference from PDG. Inderhaug and Karterud (2015) reported that without this structuring element, MBT-G groups can be very chaotic.
The fourth major finding in MBT-G-AQS was that the nine group-specific items displayed high reliability for both adherence (range .83–.95) and quality (range .78–.96). This means that the operationalization of MBT-G (Karterud, 2015) has been fruitful. The combination of group and individual therapy (conjoint therapy) has been found to be positively associated with outcome (Antonsen et al., 2017). Before discussing MBT-G-AQSI in more depth, it is telling to observe that the highly rated MBT sessions in MBT-G-AQSI displayed an impeccable focus on the conjoint aspect of MBT (Item 17; “Integrating experiences from concurrent group therapy”), that is, combining MBT-I and MBT-G. This item, which is by far the most frequent in these two sessions, has to do with the overall program and is of course at the root of establishing a strong alliance in the overall program. This item is important because it builds a bridge between the individual and their place in society (the group is a small society or “family”). As “personality disorders are defined as different ways of organizing social experience” (Pedersen, 2008, p. 72), this is likely one of the main keys BPD patients need to improve. In the low-rated MBT sessions, there were few interventions about the group. The importance of the conjoint aspect of MBT will be discussed when covering pedagogic interventions and epistemic trust in more depth.
Allen, J. G. (2012). Restoring mentalizing in attachment relationships: Treating trauma with plain old therapy. American Psychiatric Pub.
Antonsen, B. T., Kvarstein, E. H., Urnes, Ø., Hummelen, B., Karterud, S., & Wilberg, T. (2017). Favourable outcome of long-term combined psychotherapy for patients with borderline personality disorder: six-year follow-up of a randomized study. Psychotherapy research, 27(1), 51-63.
Baldwin, S. A., & Imel, Z. (2013). Therapist effects: Findings and methods. In M. J. Lambert (Ed.), Bergin and Garfield’s handbook of psychotherapy and behavior change (6 ed., pp. 258–297).
Barber, J. P., Triffleman, E., & Marmar, C. (2007, Oct). Considerations in treatment integrity: implications and recommendations for PTSD research. J Trauma Stress, 20(5), 793-805. https://doi.org/10.1002/jts.20295
Bateman, A., & Fonagy, P. (2016). Mentalization-based treatment for personality disorders: A practical guide. Oxford University Press.
Beck, E., Bo, S., Jorgensen, M. S., Gondan, M., Poulsen, S., Storebo, O. J., Fjellerad Andersen, C., Folmo, E., Sharp, C., Pedersen, J., & Simonsen, E. (2020, May). Mentalization-based treatment in groups for adolescents with borderline personality disorder: a randomized controlled trial. J Child Psychol Psychiatry, 61(5), 594-604. https://doi.org/10.1111/jcpp.13152
Boswell, J. F., Gallagher, M. W., Sauer-Zavala, S. E., Bullis, J., Gorman, J. M., Shear, M. K., Woods, S., & Barlow, D. H. (2013, Jun). Patient characteristics and variability in adherence and competence in cognitive-behavioral therapy for panic disorder. J Consult Clin Psychol, 81(3), 443-454. https://doi.org/10.1037/a0031437
Burlingame, G. M., MacKenzie, K., & Strauss, B. (2004). Small group treatment: Evidence for effectiveness and mechanisms of change. In M. J. Lambert (Ed.), Handbook of psychotherapy and behavior change (5 ed., pp. 647–696). Wileys & Sons.
Chapman, C. L., Baker, E. L., Porter, G., Thayer, S. D., & Burlingame, G. M. (2010, Mar). Rating Group Therapist Interventions: The Validation of the Group Psychotherapy Intervention Rating Scale. Group Dynamics-Theory Research and Practice, 14(1), 15-31. https://doi.org/10.1037/a0016628
Cronbach, L. J., Rajaratnam, N., & Gleser, G. C. (1963). Theory of Generalizability - a Liberalization of Reliability Theory. British Journal of Statistical Psychology, 16(2), 137-163. https://doi.org/DOI 10.1111/j.2044-8317.1963.tb00206.x
Falkenström, F., Markowitz, J. C., Jonker, H., Philips, B., & Holmqvist, R. (2013). Can psychotherapists function as their own controls? Meta-analysis of the “crossed therapist” design in comparative psychotherapy trials. The Journal of clinical psychiatry, 74(5), 482. https://doi.org/10.4088/JCP.12r07848
Gutermann, J., Schreiber, F., Matulis, S., Stangier, U., Rosner, R., & Steil, R. (2015). Therapeutic adherence and competence scales for Developmentally Adapted Cognitive Processing Therapy for adolescents with PTSD. Eur J Psychotraumatol, 6(1), 26632. https://doi.org/10.3402/ejpt.v6.26632
Hagtvet, K. A. (1997). The Function of Indicators and Errors in Construct Measures: An Application of Generalizability Theory. Journal of Vocational Education Research, 22(4), 247-266.
Hutsebaut, J., Bales, D. L., Busschbach, J. J., & Verheul, R. (2012, Jul 20). The implementation of mentalization-based treatment for adolescents: a case study from an organizational, team and therapist perspective. Int J Ment Health Syst, 6(1), 10. https://doi.org/10.1186/1752-4458-6-10
Inderhaug, T. S., & Karterud, S. (2015). A qualitative study of a mentalization-based group for borderline patients. Group Analysis, 48(2), 150-163.
Jorgensen, M. S., Storebo, O. J., Bo, S., Poulsen, S., Gondan, M., Beck, E., Chanen, A. M., Bateman, A., Pedersen, J., & Simonsen, E. (2021, May). Mentalization-based treatment in groups for adolescents with Borderline Personality Disorder: 3- and 12-month follow-up of a randomized controlled trial. Eur Child Adolesc Psychiatry, 30(5), 699-710. https://doi.org/10.1007/s00787-020-01551-2
Karterud, S. (2015). Mentalization-Based Group Therapy (MBT-G): A theoretical, clinical, and research manual. OUP Oxford.
Karterud, S., & Bateman, A. (2010). Manual for mentaliseringsbasert terapi (MBT) og MBT vurderingsskala. Versjon individualterapi. Oslo: Gyldendal akademisk.
Karterud, S., Pedersen, G., Engen, M., Johansen, M. S., Johansson, P. N., Schluter, C., Urnes, O., Wilberg, T., & Bateman, A. W. (2013). The MBT Adherence and Competence Scale (MBT-ACS): development, structure and reliability. Psychother Res, 23(6), 705-717. https://doi.org/10.1080/10503307.2012.708795
McGlinchey, J. B., & Dobson, K. S. (2003). Treatment integrity concerns in cognitive therapy for depression. Journal of Cognitive Psychotherapy, 17(4), 299-318.
Pedersen, G. (2008). Psychological assessment in clinical settings. Evaluation and clinical utility of psychometric measures for the treatment of patients with personality disorders. [Doctoral dissertation, University of Oslo. Norway].
Perepletchikova, F. (2007). Treatment integrity in treatment outcome research (2000–2004): Analysis of the studies and examination of the associated factors. Yale University.
Perepletchikova, F., Treat, T. A., & Kazdin, A. E. (2007, Dec). Treatment integrity in psychotherapy research: analysis of the studies and examination of the associated factors. J Consult Clin Psychol, 75(6), 829-841. https://doi.org/10.1037/0022-006X.75.6.829
Sharp, C., Shohet, C., Givon, D., Penner, F., Marais, L., & Fonagy, P. (2020). Learning to mentalize: A mediational approach for caregivers and therapists. Clinical Psychology-Science and Practice, 27(3), Article e12334. https://doi.org/10.1111/cpsp.12334
Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer (Vol. 1). Sage.
Shavelson, R. J., Webb, N. M., & Rowley, G. L. (1989, Jun). Generalizability Theory. American Psychologist, 44(6), 922-932. https://doi.org/Doi 10.1037/0003-066x.44.6.922
Shrout, P. E. (1998, Sep). Measurement reliability and agreement in psychiatry. Stat Methods Med Res, 7(3), 301-317. https://doi.org/10.1177/096228029800700306
Simonsen, S., Juul, S., Kongerslev, M., Bo, S., Folmo, E., & Karterud, S. (2019, Apr 3). The mentalization-based therapy adherence and quality scale (MBT-AQS): Reliability in a clinical setting. Nordic Psychology, 71(2), 104-115. https://doi.org/10.1080/19012276.2018.1480406
Waltz, J., Addis, M. E., Koerner, K., & Jacobson, N. S. (1993, Aug). Testing the integrity of a psychotherapy protocol: assessment of adherence and competence. J Consult Clin Psychol, 61(4), 620-630. https://doi.org/10.1037//0022-006x.61.4.620
Wampold, B. E., & Imel, Z. E. (2015). The great psychotherapy debate: The evidence for what makes psychotherapy work. Routledge.
Wasserman, R. H., Levy, K. N., & Loken, E. (2009, Jul). Generalizability theory in psychotherapy research: the impact of multiple sources of variance on the dependability of psychotherapy process ratings. Psychother Res, 19(4-5), 397-408. https://doi.org/10.1080/10503300802579156
Yalom, I., & Leszcz, M. (2005). The selection of clients. In I. D. Yalom & M. Leszcz (Eds.), Theory and practice of group psychotherapy (5 ed., pp. 231–259). Basic Books.