Tellimer’s Two-Tier Probability of Sovereign Default Model: Logistic Regression with Gradient Boosted Residual Correction

Nick Lindsay

Quantitative Analyst

10 Oct 2025

White Papers

White Papers

White Papers

Abstract

This paper presents a two-tier modelling framework for estimating twelve-month sovereign Eurobond default probabilities (PDs) in emerging and frontier markets. The framework combines a baseline logistic regression model (Tier 1) with a Light Gradient Boosted Machine (LightGBM) residual correction layer (Tier 2). The first tier provides a stable and interpretable mapping from fundamental macro-financial covariates to early-warning PD estimates. The second tier then addresses systematic misspecification by capturing residual variation through flexible functional forms and non-linear interactions, while accommodating a wider set of country-specific variables. This two-tier design thus preserves transparency and explainability at its core, while extending predictive accuracy through modern machine learning techniques.

The model is estimated on a point-in-time aware, monthly panel of country-level data spanning from 2008 to 2025. The proposed framework consistently generates calibrated PD trajectories that rise ahead of observed default events, providing realistic early-warning signals. Empirical results show that the two-tier model materially improves calibration relative to logistic regression alone and outperforms benchmark early warning systems highlighted in the literature. These gains are evident in both in-sample estimation and out-of-sample testing, where the model enhances the sharpness of estimated risk signals and accelerates the detection of deteriorating credit conditions. At the same time, the framework demonstrates robustness against overfitting, offering a balance between predictive accuracy and interpretability that is essential for practical risk monitoring applications.

✦ ✦ ✦


1. Introduction

Accurately assessing the probability of default (PD) is central to sovereign credit risk analysis. Default events are rare but highly disruptive, carrying profound implications for investors, policymakers, and international financial stability. Effective PD models provide an early-warning mechanism: they translate a wide array of macroeconomic, fiscal, and financial indicators into forward-looking risk assessments. Although sovereign credit ratings remain widely used benchmarks, they are inherently coarse due to their banded structure, and tend to lag market and fundamental signals of distress. This reinforces the case for systematic PD models that can deliver timely, quantitative assessments of risk. The challenge is to generate signals that are timely, reliable, granular and calibrated: capturing deteriorating conditions early enough to inform decisions while avoiding false alarms that erode confidence. In practice, this requires probability forecasts that both distinguish clearly between default and non-default episodes and enable meaningful comparisons of relative risk across countries and over time. The benefit of our model lies in its objective nature by letting the data speak; it is not clouded by cognitive biases to assess how vulnerable a sovereign is to Eurobond default. For example, recency bias may lead analysts to underestimate risks after long periods of stability, while confirmation bias can cause them to overweight information that supports prior views and discount emerging warning signs.

The academic and policy literature has developed substantially since the debt crises of the 1980s, with clear evidence of an evolving school of thought. Early studies viewed default primarily through the lens of solvency. Eaton and Gersovitz (1981) formalized sovereign borrowing as a repeated game, where repayment depended on debt burdens and the long-run incentive to maintain market access. Empirical work focused on fiscal and external indicators of solvency, culminating in the notion of “debt intolerance” (Reinhart, Rogoff & Savastano 2003) and long-horizon historical accounts (Reinhart & Rogoff 2009) that documented the recurrence of debt overhangs as drivers of crises.

By the mid-1990s, however, liquidity and rollover risk became central to crisis analysis. The Mexican (1994) and Asian (1997) crises demonstrated that sovereigns could fall into distress even when long-run solvency appeared intact. Manasse, Roubini and Schimmelpfennig (2003) introduced short-term debt, reserves, and debt service ratios as early-warning indicators. Catão and Milesi-Ferretti (2013) added to this by emphasising the importance of external balance sheets in detecting vulnerability. This work underscored that refinancing pressures and maturity mismatches could be just as critical as debt sustainability in explaining default risk.

The global financial crisis and European debt crisis further broadened the framework to encompass self-fulfilling expectations and systemic spillovers. Building on earlier theoretical contributions (Calvo 1988; Cole & Kehoe 2000), these episodes illustrated how market panic and coordination failures could trigger crises without clear solvency gaps. Empirical studies such as Beirne and Fratzscher (2013) and Kaminsky and Vega (2014) documented how sentiment-driven contagion and heightened sensitivity to fundamentals drove spreads during the Eurozone crisis. More recent syntheses (Ams, Baqir, Gelpern & Trebesch 2018; Mitchener & Trebesch 2023; World Bank 2022) emphasize that modern debt crises typically combine solvency constraints, liquidity fragilities, and expectation-driven runs within broader global financial cycles.

The evolution of modelling approaches has mirrored these conceptual shifts, with technological progress further expanding the scope for complex, data-intensive analysis. Logistic regression long served as the unofficial baseline of early-warning systems, valued for its interpretability and transparency (Peter 2002; Manasse et al. 2003). Yet, its linear structure is inherently limited when capturing nonlinear interactions, threshold effects, and rare-event dynamics, such as when sovereign default risk arises from complex combinations of solvency, liquidity, and sentiment variables. Advances in computational power have progressively enabled the application of more demanding machine learning techniques, such as random forests, gradient boosting, and neural networks to predicting sovereign default (Von Luckner, Horn, Kraay & Ramalho 2023; Platzer 2025). These methods exploit richer structures in the data and can deliver predictive gains, though their wider adoption remains constrained by enduring challenges of interpretability, probability stability, and calibration in imbalanced datasets with sparse default events.

Despite widespread monitoring of sovereign risk, in both academic and practical literature, existing tools rarely combine high frequency updates with interpretability. Credit ratings provide transparent benchmarks but are coarse due to their banded structure, revised infrequently, and tend to lag fundamentals. Market-implied default probabilities respond continuously but are often distorted by sentiment-driven volatility. Other early-warning systems and sovereign stress trackers are typically produced at quarterly frequency, rely on interpolated annual data, and fail to incorporate timely monthly indicators such as reserves or trade flows from national sources. These limitations underscore the need for frameworks that are both systematic and transparent, yet capable of producing timely signals that reflect evolving conditions.

Against this backdrop, our contribution is a two-tier probability of default framework that integrates interpretability with flexibility. The first tier consists of a logistic regression model that generates stable, economically-grounded baseline estimates of sovereign default risk. The second tier is trained on the residuals of this baseline, employing gradient boosting to capture nonlinearities, interactions, and expectation-driven dynamics that the linear specification cannot explain. The idea of stacking was first introduced by Wolpert (1992) but has only recently been applied to default risk (Silva & Cortez, 2023; Sun et al., 2023). Our method broadly follows Ranglani (2025) which details how residual aware stacking improves model performance.

This stacked design produces probability trajectories that are smoothly calibrated, comparable across countries, and capable of detecting deteriorating conditions in advance of crises. By combining transparent baseline signals with a machine-learning correction layer, the framework bridges the gap between academic early-warning research and the operational demands of sovereign risk monitoring. The model is best understood as a robust quantitative assessment of the fundamentals of an economy: while it systematises and quantifies early-warning signals, qualitative judgment remains essential to account for political shocks, policy decisions, or climatic events that cannot be fully captured by quantitative indicators.

✦ ✦ ✦


2. Methodology

Our modelling framework is designed to address the limitations of existing sovereign early-warning tools while remaining grounded in the established literature. Following our review of prior contributions, we examined a range of methodological approaches to sovereign probability of default modelling before settling on a two-tier framework. In this design, a baseline logistic regression model (Tier 1) produces stable, interpretable estimates of default probability, while a second-tier LightGBM model is trained on the residuals of the baseline to capture nonlinearities, interactions, and systematic variation that the linear specification cannot explain. This layered structure combines the transparency of traditional econometric models with the flexibility of modern machine learning, delivering probability trajectories that are both interpretable and sharper in their early-warning properties.

2.1 Dataset construction

The first step in operationalising this framework was the construction of a comprehensive dataset, drawing on multiple sources while ensuring cross-country comparability for each variable. Macroeconomic indicators were obtained from the IMF World Economic Outlook (WEO) vintages and the World Bank’s International Debt Statistics (IDS). Political risk measures were drawn from Transparency International and the World Bank’s Country Policy and Institutional Assessment. Financial and corporate sector aggregates and wellness indicators, covering banking system soundness and corporate sector performance and leverage, were sourced from S&P Global. Exchange rate competitiveness was captured using real effective exchange rate indices from Bruegel, and these were complemented by national source data and Tellimer’s proprietary datasets. Together, these datapoints formed a broad superset of candidate predictors, initially numbering over five hundred, reflecting the diverse channels identified in prior studies due to the idiosyncratic nature of sovereign default in emerging and frontier markets. The default database used was compiled in-house and is defined as the date of the missed payment, notwithstanding the grace period behaviour, or the announcement of a moratorium.

To ensure temporal validity, the dataset was constructed to be fully point-in-time aware: for each prediction date, only information available at that moment was included, thereby eliminating look-ahead bias. Annual macroeconomic and financial series were transformed into a monthly panel beginning in 2008, using forecast-aware alignment to incorporate both actual observations and successive WEO projections. Although IMF WEO data are reported annually and updated biannually, we treat them as annual inputs when creating indicators while always using the most recently available vintage. This transformation allowed the model to capture not only the levels and trends of key indicators, but also their revisions, missed targets, and the evolving trajectory of forecasts. As a result, the dataset reflects both realized fundamentals and the shifting expectations about future conditions.

2.2 Variable transformations

Considerable attention was devoted to transforming the indicators into forms most suitable for early-warning prediction. In addition to simple levels and growth rates, we generated momentum measures, rolling volatilities, and deviations from trend to reflect recent dynamics. These are incorporated through the application of historical z-scores, logarithmic and squared terms and further transformations to stabilise distributions and attempt to capture nonlinear effects. Forecast revisions, projection errors and surprise data outturns were incorporated as indicators of shifting market perceptions and potential shocks, while multi-period averages of revisions were used to represent sustained trends. We initially explored dimensionality reduction techniques such as principal component analysis (PCA) to condense the feature space. However, this came at the cost of interpretability, which is critical for understanding sovereign risk. Instead, we implemented a two-stage variable selection strategy: family-wise AUC scoring to assess the discriminatory power of variable transformations, followed by Lasso regularization to enforce sparsity and retain a parsimonious set of predictors. This approach proved superior in terms of both predictive performance and transparency.

2.3 Two-tier framework

With the refined dataset in place, we implemented the two-tier modelling framework. The first tier consists of a logistic regression model, which produces baseline PD estimates. This specification was chosen for its interpretability, allowing direct attribution of changes in estimated default risk to underlying macro-financial indicators. Variable selection was performed in a two-step process: first through family-wise AUC scoring to retain only the most informative transformations of each indicator, and subsequently through Lasso regularization to enforce sparsity and prevent overfitting. Observations within twelve months after a default event were excluded from training to avoid distortion from post-default dynamics.

The second tier is a Light Gradient Boosted Machine (LightGBM) model, applied to the residuals of the logistic regression. Its role is to capture systematic variation that the linear specification cannot, by modelling nonlinearities and higher-order interactions among predictors. LightGBM was selected due to its efficiency, robustness, and ability to handle high-dimensional feature spaces with built-in regularization. To maintain stability, residual outputs were clipped in logit space before being combined with the baseline estimates to produce the final corrected PDs.

The methodology can broadly be described as follows:

Initially, y is set as equal to 1 if a default event occurs within the next 12 months.

Next, we estimate a logistic regression that links our set of indicators to the probability of a default event occurring. This gives an interpretable linear baseline risk measure in logistic space which can then easily be transformed to a probability: p.

We then compute residuals which is the difference between the observed outcome and the tier 1 predicted probability. It captures what the logistic regression has missed and becomes the signal for the second stage. This can be expressed as follows in probability space.

LightGBM is a sequence of regression trees that iteratively fit residuals from the logistic loss. At each step, a tree is trained on the current residuals and added back to improve the fit. Starting from Tier-1 logits, the model refines the baseline by learning non-linear corrections where g(⋅) is the LightGBM regression function and z is the output from the logistic regression.

Compute the gradient residuals:

Where ℓ is the logistic loss:

After M trees, the Tier-2 LightGBM logit correction is:

The combined logit is the output from our initial regression added to the estimated residual from the lightGBM. This can then be transformed to a probability using a sigmoid function:

Where the sigmoid function is defined by:

Finally, temperature scaling ensures that the final probabilities are well-calibrated. T is determined by a hold-out validation period that minimises the negative log-likelihood.

Model performance is evaluated along four broad dimensions. Calibration is assessed by comparing predicted probabilities with realized default frequencies, ensuring forecasts align quantitatively with observed outcomes. Discrimination is measured using ROC-AUC to verify the ability to rank higher-risk from lower-risk cases. Timeliness is captured by the lead time and persistence of signals before defaults, and stability by the smoothness of predicted trajectories over time. A detailed description of the crisis-specific evaluation metrics employed, such as recall, false alarms, noise-to-signal ratios, and conditional crisis probabilities, is provided in the following section.

✦ ✦ ✦


3. Model Evaluation Framework

Assessing the performance of an early-warning system requires more nuance than the conventional accuracy statistics applied to standard classification models. In sovereign default prediction, timing is as important as classification itself: policymakers and investors need advance notice of impending distress, but signals that arrive too early, too late, or too erratically diminish the model’s practical value. Following the approach taken by Nomura’s Damocles framework for predicting currency crises, and subsequent contributions such as Dawood, Horsewood and Strobel (2017), we employ a set of evaluation metrics designed to capture both detection accuracy and operational usability.

At the most basic level, early warning model performance can be mapped into a confusion matrix distinguishing between true positives (defaults correctly preceded by alarms), false positives (alarms without defaults), false negatives (defaults missed entirely), and true negatives (quiet periods correctly left unflagged). From this matrix, two core measures are derived. The first is recall - the percentage of crises correctly signalled, which captures the model’s ability to detect actual defaults. The second is the false alarm rate, measuring the share of tranquil months incorrectly flagged. These two statistics are complemented by precision, or the conditional probability of default given a signal, which indicates the credibility of alarms from the perspective of an end-user.

Because sovereign defaults unfold over time rather than at a single point, we also evaluate the quality of signals within the pre-crisis window. We define this as the twelve months immediately preceding a default. Alarms that occur earlier than twelve months in advance may capture deteriorating fundamentals but are less actionable for decision-makers and therefore count as noise. Within the window, we measure the proportion of months correctly flagged as pre-crisis, which we refer to as true signals. This metric is deliberately stringent: achieving a high score requires the model to identify most or all the twelve pre-crisis months without extending signals into earlier periods. Alongside this, we calculate the noise-to-signal ratio, defined as the ratio of false alarms to true signals, which provides a compact summary of how cleanly the model distinguishes crisis periods from normal times.

In addition to the occurrence of signals, their timeliness and persistence are central to practical utility. Lead time measures the average number of months between the first alarm and the default event. If lead times are too short, there is little room for policy or portfolio adjustment; if they are too long, signals risk being discounted or ignored. Persistence captures the average duration for which signals remain elevated in the run-up to a default. Persistent alarms reinforce credibility and give decision-makers confidence that the model is detecting a genuine deterioration rather than producing a one-off anomaly.

Finally, all the statistics should be interpreted against the unconditional probability of crisis in the sample, which in our dataset is only around 2.5 percent. This low base rate underscores the rarity of sovereign defaults and highlights the informational gain when the probability of default conditional on a signal rises above this underlying baseline. Even a modest-looking conditional probability of 25% represents a tenfold improvement relative to the baseline.

Taken together, this evaluation framework balances three objectives that are often at odds: maximising the detection of crises, minimising the incidence of false alarms, and ensuring that warnings are delivered with sufficient but not excessive lead time. It is this balance, rather than any single metric, that ultimately determines whether an early-warning system is operationally useful.

✦ ✦ ✦


4. Results

4.1 Baseline Logistic Regression

Initially, it is useful to assess our two baseline approaches: a traditional logistic regression and a machine learning model based on random forests.

The baseline logistic regression model identifies a core set of macroeconomic and financial indicators as the most important drivers of sovereign default risk. Variables pertaining to the dynamics and level of external account imbalances, interest on external debt, GDP growth, gross debt ratios, fiscal deficits, inflation, real effective exchange rate, reserves-to-imports ratios and the debt-service ratio are all selected as significant variables. Moreover, the estimated coefficients carry signs consistent with economic intuition. The inclusion of the selected indicators is supported by economic theory and with prior literature, reinforcing the model’s interpretability and its alignment with economic intuition.

In terms of performance, the logistic regression baseline achieves a ROC-AUC of 0.84. Out-of-sample, it correctly signals 57.1% of crises while generating very few false alarms (0.3%), yielding a noise-to-signal ratio of 0.02. This makes the model precise but limited in recall: crises are often missed, and the true-signal rate is only 17.0%, meaning that it flags relatively few of the months in the critical pre-default window.

The random forest baseline provides a useful contrast. As expected, its greater flexibility improves detection somewhat, with out-of-sample recall rising to 64.3% and the true-signal rate increasing to 23.9%. False alarms remain low at 0.8%, and the noise-to-signal ratio is 0.03. These results highlight that non-linear machine learning models can capture additional structure missed by logistic regression, but the improvement remains modest. Moreover, the random forest lacks interpretability: while it can highlight variable importance rankings, the relationships between drivers and default risk are less transparent than under the regression specification. This trade-off between modest performance gains and reduced transparency underscores the case for a hybrid approach.

4.2 Two-Tier Model with Residual Correction

We next assess the two-tier framework, which augments the logistic regression with a LightGBM residual correction layer. This design aims to combine the interpretability of the regression with the ability of machine learning to capture residual non-linearities and interactions between variables.

The two-tier model achieves a ROC-AUC of 0.89, higher than either baseline. Out-of-sample recall rises to 85.7%, meaning 12 of the 14 defaults in this window were signalled in the twelve months before default. False alarms increase relative to logistic regression, reaching 2.3%, but remain low in absolute terms and well below the levels reported for several benchmark models in the literature. The noise-to-signal ratio of 0.06 indicates that, despite the increase in recall, the system maintains relatively clean signalling.

Figure 1: Comparison plot of ROC scores


The model also performs well on measures of timeliness and consistency. The average lead time is 4.5 months, indicating that alarms typically precede defaults with a meaningful window for policy or market response, while persistence averages 4 months, implying that signals are generally maintained once default risk is first flagged. Lead time and persistence are lower than in the in-sample analysis (8.1 and 7.5 months, respectively); however, extreme shocks such as COVID-19 and the Russia–Ukraine war accelerated the onset of crises in the test period, which helps to explain this apparent depreciation in performance. The true-signal rate of 39.6% is notably higher than that of either baseline model, indicating greater consistency in capturing deteriorating conditions during the twelve months prior to default.

Across the full sample, the model signalled 26 of 28 historical defaults. The two misses were Zambia in 2020 and Ethiopia in 2023. In Zambia, key metrics such as debt-to-GDP were underreported in the years leading up to the crisis and then revised ex-post, which might give some explanation as to why the model estimated a peak PD of 36% in the 12 months leading up to the crisis. In Ethiopia risk was flagged 18 months in advance but not during the immediate twelve months before default so this also is counted as a missed default. This illustrates both the broad strength of the framework in capturing gradual deteriorations in fundamentals, but also the inherent limitations of any quantitative model in it is only as good as the input data, and it is unable to anticipate events driven largely by political negotiations or abrupt shifts in creditor behaviour.

Table 1: Full sample model evaluation

Metric
Tellimer Threshold 0.5
DHS * (2017)
DCSD **
KLR ***
Number of crises
28
48
-
-
% crises correctly signalled
92.9
87.1
63
60
True signals (%)
69.2
80
-
-
False alarms (%)
1.8
8.6
17.7
23.4
Noise-to-signal ratio
0.03
0.10
-
-
P(crisis| signal) (%)
43.3
-
37
29
Unconditional P(crisis) (%)
2.5
-
-
-
Lead time (avg months)
8.1
-
-
-
Persistence (avg months)
7.5
-
-
-


Table 1: Out-of-sample model evaluation

Metric
Two-Tier Model (0.5 threshold)
Logistic Baseline
Machine Learning (RF)
Damocles ****
Number of crises
14
14
14
61
% crises correctly signalled
85.7
57.1
64.3
63.9
True signals (%)
39.6
17.0
23.9
40.2
False alarms (%)
2.3
0.3
0.8
11.4
Noise-to-signal ratio
0.06
0.02
0.03
0.28
P(crisis| signal) (%)
26.8
55.1
38.0
22.5
Unconditional P(crisis) (%)
2.5
2.5
2.5
7.6
Lead time (avg months)
4.5
4.3
5.4
9.7
Persistence (avg months)
4.0
3.7
4.0
7.4
* DHS refers to the Dawood, Horsewood & Strobel (2017) model.
** DCSD refers to the Developing Country Studies Division IMF framework.
*** KLR refers to the Kaminsky, Lizondo & Reinhart model used by the IMF, first published in 1998.
**** Damocles is Nomura’s public model that attempts to forecast currency crises. These events have been shown to be highly correlated events as first highlighted in Reinhart (2002). Despite this, comparisons should be interpreted cautiously and have only been highlighted here since it is a current industry accepted model which reports monthly estimates and also reports the same evaluation statistics as us.


4.3 Sensitivity Analysis

Performance varies depending on the alarm threshold selected. At a threshold of 0.3, recall rises to 92.9%, while average lead time and persistence extend to 6.2 months and 5.8 months, respectively. However, false alarms also increase to 5.5%, and conditional precision declines to 17.3%. At the balanced threshold of 0.5, recall is 85.7%, false alarms fall to 2.3%, and conditional precision improves to 26.8%. At the stricter threshold of 0.6, recall decreases to 78.6% and persistence shortens to 3.6 months, but false alarms fall further to 1.0%, and conditional precision rises to 41.9%.

These results illustrate the trade-offs inherent in early-warning design. A lower threshold improves detection but increases noise, while a higher threshold reduces noise but risks missing a larger share of crises. Different thresholds may therefore be appropriate for different applications. For example, some practitioners might prefer more lead time for a low cost in the noise-to-signal ratio, while others, who prioritise the credibility of each individual alarm may prefer higher thresholds.

Table 1: Sensitivity analysis for two-tier model (out-of-sample)

Metric
Threshold 0.3
Threshold 0.4
Threshold 0.5
Threshold 0.6
Number of crises
14
14
14
14
% crises correctly signalled
92.9
92.9
85.7
78.6
True signals (%)
54.7
41.53
39.6
34.0
False alarms (%)
5.5
3.63
2.3
1.0
Noise-to-signal ratio
0.10
0.09
0.06
0.03
P(crisis| signal) (%)
17.3
19.4
26.8
41.9
Unconditional P(crisis) (%)
2.5
2.5
2.5
2.5
Lead time (avg months)
6.2
4.8
4.5
4.4
Persistence (avg months)
5.8
4.2
4.0
3.6


4.4 Comparison with Market-Implied Probabilities

To place these results in context, we compare Tellimer’s model-generated PDs against sovereign bond spreads from S&P Global. Readers can translate spreads into market-implied default probabilities by applying a simple scaling factor derived from the standard hazard-rate formulation.

Using this formula, a recovery rate of fifty percent implies that spreads should be doubled to obtain a comparable market-implied probability of default. The scale factor can be adjusted to reflect different assumptions about expected loan recovery rates.

Two key insights emerge from this comparison.

First, the model can signal sovereign stress earlier than markets, underscoring its role as an early-warning system. In Sri Lanka’s 2022 default and Argentina’s 2020 defaults, model PDs rose sharply months before spreads began their sustained repricing. Similar lead-lag patterns are also evident in Ukraine’s 2014 and Lebanon’s 2020 defaults.

Second, the model can provide a stabilizing perspective when markets overshoot. For example, while Argentina’s spreads spiked to extreme levels in 2024, the model PD moderated after signalling stress, reflecting underlying fundamentals rather than short-term volatility. This highlights the advantage of smoother, quantitatively grounded PD estimates that are less vulnerable to transitory market noise.

Figure 2 compares Sri Lanka’s probability of default, as estimated by Tellimer’s two-tier model, with the spread of its 10-year sovereign bond. Sri Lanka defaulted in April 2022, after which bond spread data are no longer available.

Figure 2: Model PD vs Spread (Sri Lanka)


Up to late 2020, the model PD and bond spreads track one another closely. However, from early 2021 the model PD accelerates sharply, flagging an elevated and persistent risk of default nearly a year before spreads began their sustained widening in late 2021. This anticipatory signal demonstrates the model’s ability to provide early warning beyond market-implied measures. Importantly, these results are generated out-of-sample, demonstrating robustness against overfitting and reinforcing the model’s forward-looking capacity rather than replicating market pricing.

Figure 3 compares Argentina’s probability of default, as estimated by Tellimer’s two-tier model, with the spread of its 10-year sovereign bond. Argentina experienced two defaults during the sample period, in June 2014 and May 2020, but despite elevated spreads in 2024, did not default

Figure 3: Model PD vs Spread (Argentina)


The model captures both historical defaults well. In the run-up to June 2014, albeit as part of the in-sample analysis, the model PD raised markedly ahead the default date, reflecting deteriorating fundamentals that culminated in default. Furthermore, out-of-sample, a similar pattern is visible in 2019–2020, when the model PD accelerated sharply in advance of the May 2020 restructuring, again preceding a major market repricing.

More recently, in 2024, the model PD rose to elevated levels but subsequently moderated, signalling stress without projecting an inevitable default. By contrast, spreads spiked to extreme levels, potentially overreacting to short-term liquidity and market sentiment. This illustrates the model’s value in distinguishing between genuine default risk and market overshoot.

Importantly, as with Sri Lanka, these results are consistent between in-sample and out-of-sample analysis. They demonstrate robustness against overfitting and reinforce the model’s ability to provide forward-looking credit risk assessments independently of market pricing, validating our design choice to exclude market data from the input set.

4.5 Comparative Assessment

Whilst comparisons should be interpreted with caution, given differences in datasets, sample periods, and crisis definitions across studies, it is useful to provide context to give benchmark these figures. Out-of-sample comparisons highlight the strength of this approach. As shown in Table 1, the two-tier model correctly signals 85.7% of crises with a false alarm rate of only 2.3%. This recall rate is materially higher than both the logistic baseline (57.1%) and a pure machine learning benchmark (Random Forest, 64.3%), while maintaining a lower noise-to-signal ratio (0.06) than comparative studies in the literature. Nomura’s Damocles model, which predicts currency crises, achieves 63.9% recall but with a noise-to-signal ratio of 0.28, compared to our 85.7% and 0.06 respectively.

Our results also compare favourably with the wider early-warning literature. Ciarlone and Trebeschi (2005) correctly predicted 72% of crises while generating 36% false signals. Pescatori and Sy (2007) achieved higher sensitivity (86%) but at a 14% false alarm rate, while Manasse et al. (2003) reduced false signals to 5% but foresaw a lower rate of 75% of crises. Savona and Vezzoli (2015), the only study to include developed economies, reached a hit rate of 77% with 16% false alarms. Against this backdrop, the two-tier model’s out-of-sample performance, represents a notable advance.

Some studies only report their in-sample metrics which provide further perspective. Dawood, Horsewood & Strobel (2017) report 87.1% recall but at the cost of much higher false alarms (8.6%), while older baseline frameworks from the IMF such as Berg et al., (2004) with their DCSD (Developing Country Studies Division) framework, and KLR (Kaminsky, Lizondo & Reinhart, 1998) correctly identify 63% and 60% of crises, respectively, with higher false alarm rates (17% and 23% respectively). By contrast, the two-tier model achieves recall rates above 90% at looser thresholds (0.4) while keeping false alarms well below 6%, offering a stronger balance between sensitivity and specificity.

Finally, it is informative to frame performance in terms of a single efficiency measure, as sometimes used in the IMF’s crisis probability models. Basu, Chamon, and Crowe (2017) report a combined error rate (missed crises + false alarms) of 0.30. By comparison, our two-tier framework achieves a combined error of just 0.17 in out-of-sample testing, underscoring its robustness and efficiency in balancing recall with signal credibility.

In sum, the comparative evidence shows that the two-tier model consistently matches or outperforms traditional regression approaches, competitive machine learning models, practitioner benchmarks, and much of the academic literature. Its ability to provide early, reliable, and interpretable signals makes it well suited for operational use in sovereign risk monitoring.

✦ ✦ ✦


5. Conclusion

This study has proposed a two-tier probability of default (PD) framework that combines the interpretability of logistic regression with the flexibility of gradient-boosted trees. The design addresses a dual challenge of sovereign risk modelling: the need for transparent, economically grounded signals on the one hand, and the ability to capture idiosyncratic default dynamics on the other. By deploying logistic regression as the baseline model and LightGBM as a residual correction layer, the framework retains interpretability while improving quantitative performance to the baseline.

The dataset employed in this study integrates a broad set of predictors, updated monthly from national sources when applicable, which are standardized and incorporated into the model to provide a timely and clear view of country-level fundamental risk. In combination with a modelling architecture that builds on recent methodological advances, the framework achieves consistently strong predictive accuracy in identifying sovereign defaults in emerging and frontier markets. Importantly, the approach allows users to observe which indicators are driving changes in estimated default probabilities, thereby combining interpretability with high predictive performance.

The empirical analysis shows that the two-tier model improves on both logistic regression and a random forest baseline. Out-of-sample ROC-AUC rises from 0.84 for the logistic model to 0.89 under the two-tier framework, and recall increases from 57.1 percent (logistic) and 64.3 percent (random forest) to 85.7 percent. At the same time, false alarms remain limited at 2.3 percent, resulting in a noise-to-signal ratio of 0.06. The model also achieves relatively favourable timeliness and persistence, with alarms typically appearing around 4.5 months before default and remaining elevated across one-third of the pre-default window. The decline relative to in-sample results is partly attributable to the sudden emergence of multiple crises in the test period, most prominently the COVID-19 shock and economic dislocations associated with the Russia-Ukraine war. In a monthly early-warning evaluation, 26 out of 28 defaults were signalled in the year prior to the event. The exceptions, Zambia in 2020 and Ethiopia in 2023, illustrate the difficulty of detecting defaults driven by political or negotiated factors rather than sustained macro-financial deterioration. They also highlight a more general limitation: model performance is constrained by the quality and timeliness of the underlying data. Despite incorporating forecasts and their revisions to capture evolving expectations, data revisions that occur only after a crisis cannot be anticipated ex ante.

Taken together, the findings indicate that the two-tier PD framework offers improvements over commonly used approaches, notwithstanding the caution in comparing different approaches. The framework delivers calibrated and relatively stable probability estimates, improves detection and consistency relative to baselines, and retains the interpretability needed for practical use. For practitioners, it provides a structured tool to monitor sovereign risk in real time. For researchers, it points to avenues for further development, including the systematic integration of market-implied measures, political or conflict projections, or the further exploration of alternative ensemble design and architecture.


✦ ✦ ✦

Bibliography

Ams, J., Baqir, R., Gelpern, A., & Trebesch, C. (2018). Sovereign debt: A guide for economists and practitioners [IMF Working Paper No. 18/XX]. International Monetary Fund.

Basu, R., Chamon, M., & Crowe, C. (2017). Sovereign debt crises: Fundamentals vs. market sentiments in IMF crisis probability models [IMF Working Paper].

Beirne, J., & Fratzscher, M. (2013). The pricing of sovereign risk and contagion during the European sovereign debt crisis. Journal of International Money and Finance, 34, 60–82. https://doi.org/10.1016/j.jimonfin.2012.11.004

Calvo, G. A. (1988). Servicing the public debt: The role of expectations. The American Economic Review, 78(4), 647–661.

Catão, L. A. V., & Milesi-Ferretti, G. M. (2013). External liabilities and crises. IMF Working Paper No. 13/113. International Monetary Fund.

Ciarlone, A., & Trebeschi, G. (2005). Designing an early warning system for debt crises. Emerging Markets Review, 6(4), 376–395. https://doi.org/10.1016/j.ememar.2005.09.003

Cole, H. L., & Kehoe, T. J. (2000). Self-fulfilling debt crises. The Review of Economic Studies, 67(1), 91–116. https://doi.org/10.1111/1467-937X.00122

Dawood, M., Horsewood, N., & Strobel, F. (2017). Predicting sovereign debt crises: An early warning system approach. Journal of Financial Stability, 28, 16–28. https://doi.org/10.1016/j.jfs.2016.11.005

Eaton, J., & Gersovitz, M. (1981). Debt with potential repudiation: Theoretical and empirical analysis. The Review of Economic Studies, 48(2), 289–309. https://doi.org/10.2307/2296886

Kaminsky, G., Lizondo, S., & Reinhart, C. M. (1998). Leading indicators of currency crises. IMF Staff Papers, 45(1), 1–48. https://doi.org/10.2307/3867328

Kaminsky, G., & Vega, C. (2014). Systemic and idiosyncratic sovereign debt crises. Journal of the European Economic Association, 12(1), 219–244. https://doi.org/10.1111/jeea.12057

Manasse, P., Roubini, N., & Schimmelpfennig, A. (2003). Predicting sovereign debt crises. IMF Working Paper No. 03/221. International Monetary Fund.

Mitchener, K. J., & Trebesch, C. (2023). Sovereign debt in the 21st century. Journal of Economic Literature, 61(3), 813–878. https://doi.org/10.1257/jel.20221861

Nomura. (n.d.). The Damocles early warning framework [Research report]. Nomura Global Economics.

Pescatori, A., & Sy, A. N. R. (2007). Are debt crises adequately defined? IMF Staff Papers, 54(2), 306–337.

Peter, M. (2002). Estimating default probabilities of emerging market sovereigns: A new look at the sovereign ceiling. Swiss National Bank Working Paper.

Platzer, R. (2025). The role of increased computational power and machine learning in advancing sovereign default early warning systems. Unpublished manuscript.

Ranglani, H. (2025). Residual aware stacking: A novel approach for improved machine learning model performance (SSRN Scholarly Paper No. 5160281). Social Science Research Network. https://doi.org/10.2139/ssrn.5160281

Reinhart, C. M. (2002). Default, currency crises, and sovereign credit ratings. World Bank Economic Review, 16(2), 151–170. https://doi.org/10.1093/wber/16.2.151

Reinhart, C. M., Rogoff, K. S., & Savastano, M. A. (2003). Debt intolerance. Brookings Papers on Economic Activity, 2003(1), 1–74.

Reinhart, C. M., & Rogoff, K. S. (2009). This time is different: Eight centuries of financial folly. Princeton University Press.

Savona, P., & Vezzoli, M. (2015). Monitoring sovereign financial distress with composite indicators. International Journal of Finance & Economics, 20(2), 163–177. https://doi.org/10.1002/ijfe.1502

Silva, T., & Cortez, M. (in preparation). Stacking methods for sovereign default early warning models.

Sun, T., Xiong, Y., & Yu, J. (in preparation). Machine learning stacking applications for sovereign default prediction.

Von Luckner, M., Horn, S., Kraay, A., & Ramalho, R. (2023). Sovereign default and debt distress: A new database. World Bank Policy Research Working Paper No. 10158.

Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5(2), 241–259. https://doi.org/10.1016/S0893-6080(05)80023-1

World Bank. (2022). International debt statistics 2022. World Bank Publications. https://doi.org/10.1596/978-1-4648-1786-0

Nick Lindsay is a quantitative analyst specialising in statistical modelling for emerging and frontier markets. He previously worked as a consultant within a data research consortium led by the World Bank Data Group and Cornell University, where he developed innovative big data methods and presented his research at conferences across Europe, Africa and Asia. He holds a master’s degree in Economics from Oxford University.