Mitigating Persistent Client Dropout in Asynchronous Decentralized Federated Learning

Abstract

We consider the problem of persistent client dropout in asynchronous Decentralized Federated Learning (DFL). Asynchronicity and decentralization obfuscate information about model updates among federation peers, making recovery from a client dropout difficult. Access to the number of learning epochs, data distributions, and all the information necessary to precisely reconstruct the missing neighbor's loss functions is limited. We show that obvious mitigations do not adequately address the problem and introduce adaptive strategies based on client reconstruction. We show that these strategies can effectively recover some performance loss caused by dropout. Our work focuses on asynchronous DFL with local regularization and differs substantially from that in the existing literature. We evaluate the proposed methods on tabular and image datasets, involve three DFL algorithms, and three data heterogeneity scenarios (iid, non-iid, class-focused non-iid). Our experiments show that the proposed adaptive strategies can be effective in maintaining robustness of federated learning, even if they do not reconstruct the missing client's data precisely. We also discuss the limitations and identify future avenues for tackling the problem of client dropout.

Decentralized Learning

In decentralized federated learning, clients use their local data siloes to train their models, while regularizing to the neighbors they have access to. They all collaborate to reach a common goal of attaining the highest performance possible, while converging to the same consensus model.

Client failure scenario

Imagine that one of the agents participating in the federation has been permanently dropped from the system. In scenarios like this, whenever that client had some unique knowledge about the global data distribution (e.g., data on clients is non-iid), the performance of the system can drop significantly.

So what can we do about it?

How does a gradient inversion attack look like in a simple 2D case?

Federation performance after reconstruction

The following figures show the test performance of the DFedAvgM algorithm on the Digits dataset under different client data distributions. Client dropout happens on the 10th communication round.

IID (uniform distribution)

Non-iid (clusters)

Non-iid (extreme class imbalance)

The results show that the proposed adaptive strategies can effectively recover some performance loss caused by dropout, even if they do not reconstruct the missing client's data precisely. The performance of the system is significantly better than the simple baselines. Note, that this is especially true for the non-iid settings, because the missing client's data has a portion of unique knowledge about the global data distribution.

More details and experiments can be found in the paper.

BibTeX


                @article{stepka2025,
                    author    = {Ignacy St\k{e}pka and Nicholas Gisolfi and Kacper Tr\k{e}bacz and Artur Dubrawski},
                    title     = {Mitigating Persistent Client Dropout in Asynchronous Decentralized Federated Learning},
                    journal   = {FedKDD Workshop at SIGKDD},
                    year      = {2025},
                }