Accelerating the Run-Time of Convolutional Neural Networks through Weight Pruning and Quantization
Accelerating the Run-Time of Convolutional Neural Networks through Weight Pruning and Quantization
In: International Engineering Conference on Renewable Energy & Sustainability. International Engineering Conference on Renewable Energy & Sustainability (ieCRES-2023), International Engineering Conference on Renewable Energy & Sustainability, located at The 8th International Engineering Conference on Renewable Energy & Sustainability, March 6-7, Gaza, Palestinian Territory, Occupied, IEEE, 2023.
- Abstract:
- Accelerating the processing of Convolutional Neural Networks (CNNs) is highly demand in the field of Artificial Intelligence (AI), particularly in computer vision domains. The efficiency of memory resources is crucial in measuring run-time, and weight pruning and quantization techniques have been studied extensively to optimize this efficiency. In this work, we investigate the contribution of these techniques to accelerate a pre-trained CNN model. We adapt the percentile-based weights pruning with focusing on unstructured pruning by dynamically adjusting the pruning thresholds based on the fine-tuning performance of the model. In the same context, we perform uniform quantization for presenting the weights values of the model’s parameters with a fixed number of bits. We implement different levels of post-training and aware-training -fine-tuning the model with the same learning rate and number of epochs as the original. We then refine-tune the model with a lower learning rate and a factor of 10x for both techniques. Finally, we combine the best levels of pruning and quantization and refine-tune the model to explore the best-pruned and quantized pre-trained model. We evaluate each level of the techniques and analyze their trade-offs. Our results demonstrate the effectiveness of our strategy in accelerating the CNN and improving its efficiency, and provide insights into the best combination of techniques to accelerate its inference time.