Pruning_by_Block_Benefit

Pruning by Block Benefit: Exploring the Properties of Vision Transformer Blocks during Domain Adaptation

Static Badge Static Badge Static Badge Static Badge

Vision Transformer have set new benchmarks in several tasks, but these models come with the lack of high computational costs which makes them impractical for resource limited hardware. Network pruning reduces the computational complexity by removing less important operations while maintaining performance. However, pruning a pretrained model on an unseen data domain, leads to a misevaluation of weight significance, resulting in an unfavourable resource assignment. To address the issue of efficient pruning on transfer learning tasks, we propose Pruning by Block Benefit (P3B) , a method to globally assign a given parameter budget depending on the relative contribution of individual blocks. Our proposed method is able to identify lately converged components in order to rebalance the global parameter resources. Furthermore, our findings show that the order in which layers express valuable features on unseen data depends on the network depth, leading to a structural problem in pruning.

This GitHub repository contains the code for our paper: Pruning by Block Benefit: Exploring the Properties of Vision Transformer Blocks during Domain Adaptation


Motivation

Common pruning strategies often fail in transfer learning because early pruning misevaluates weight significance before the model converges to the target domain. Concretely, the early removement of prunable elements leads to a misevaluation of weight significance, since task sensitive elements become important, only once the model has been converged to the target domain. This premature parameter elimination compromises the resulting model structure and harms performance.

Our work points out an overlooked aspect in pruning, regarding the model depth. As visualized in the figure below, our work shows that deeper layers converge later in training, harming early pruning decisions. Therefore, P3B establishes block-specific pruning rates based on the relative performance of each block. This approach effectively identifies lately-converging blocks while guaranteeing the reactivation of pruned elements if the overall block gains performance.

Figure 1: The relative feature improvement on classification token (upper row) and patches (botton row) for individual Attention and MLP blocks is depth dependent. Deeper layers express features only in later epochs.


Method

In this work we propose the novel pruning framwork Pruning by Block Benefit (P3B) to balance the global parameter resources dependent on the feature improvement of Attention and MLP blocks. The approach is designed to decouple the inter-block pruning ratio from the intra-layer element elimination, ensuring their respective criteria are assessed independently. As illustrated in Figure 2, P3B determines a block-wise parameter budget by utilizing the relative feature improvement demonstrated Block Performacne Indicator $\Delta\Psi_i$. This defined layer-wise sparsity constraint then guides the generation of the pruning mask via local pruning criteria. P3B is a highly performant pruning framework, considering the structural change of the model during domain adaptaion.

Figure 2: Block plan P3B. We measure the realtive performance of each block to set a superior parameter budet per block.

As shown in the following table, P3B significantly outperforms existing pruning methods.

| model | method | pruned | IFOOD
pr=50% | IFOOD
pr=75% | INAT19
pr=50% | INAT19
pr=75% | |:-----------|:---------------|:--------|:-------------------:|:-------------------:|:-------------------:|:-------------------:| | Deit-Small | Deit | ✗ | 73.9 || 74.7 || | | WD-Prune | ✓ | 50.7 | 49.2 | 55.6 | 54.0 | | | SaVit | ✓ | 72.4 | 64.4 | 71.3 | 68.0 | | | **P3B (ours)** | ✓ | **74.3** | **73.4** | **75.5** | **73.1** | | Deit-Tiny | Deit | ✗ | 72.7 || 72.6 || | | WD-Prune | ✓ | 50.2 | 44.7 | 54.8 | 46.7 | | | SaVit | ✓ | 65.7 | 59.5 | 64.1 | 45.3 | | | **P3B (ours)** | ✓ | **71.5** | **68.6** | **69.3** | **61.4** |



Citation

If you use this code in your research, please cite the following paper:

@inproceedings{
  glandorf2025p3b,
  title={Pruning by Block Benefit: Exploring the Properties of Vision Transformer Blocks during Domain Adaptation},
  author={Patrick Glandorf and Bodo Rosenhahn},
  booktitle={International Conference on Computer Vision Workshop},
  year={2025}
}