Abstract
Deep neural networks have become the readiest answer to a range of application challenges including image recognition, stock analysis, natural language processing, and biomedical applications such as seizure detection. All while outperforming prior leading solutions that relied heavily on hand-engineered techniques. However, deployment of these neural networks often requires high-computational and memory-intensive solutions. These requirements make it challenging to deploy Deep Neural Networks (DNNs) in embedded, real-time low-power applications where classic architectures, GPUs and CPUs, still impose significant power burden. Systems-on-Chip (SoC) with Field-programmable Gate Arrays (FPGAs) can be used to improve performance and allow more fine-grain control of resources than CPUs or GPUs, but it is difficult to find the optimal balance between hardware and software to improve DNN efficiency. In the current research literature there have been few proposed solutions to address optimizing hardware and software deployments of DNNs in embedded low-power systems. To address the computation resource restriction and low-power needs for deploying these networks, we describe and implement a domain-specific metric model for optimizing task deployment on differing platforms, hardware and software. Next, we propose a DNN hardware accelerator called Scalable Low-power Accelerator for real-time deep neural Networks (SCALENet) that includes multithreaded software workers. Finally, we propose a heterogeneous aware scheduler that uses the DNN-specific metric models and the SCALENet accelerator to allocate a task to a resource based on solving a numerical cost for a series of domain objectives. To demonstrate the applicability of our contribution, we deploy nine modern deep network architectures, each containing a different number of parameters within the context of two different neural network applications: image processing and biomedical seizure detection. Utilizing the metric modeling techniques integrated into the heterogeneous aware scheduler and the SCALENet accelerator, we demonstrate the ability to meet computational requirements, adapt to multiple architectures, and lower power by providing an optimized task to resource allocation. Our heterogeneous aware scheduler improves power saving by decreasing power consumption by 10% of the total system power, does not affect the accuracy of the networks, and still meets the real-time deadlines. We demonstrate the ability to achieve parity with or exceed the energy efficiency of NVIDIA GPUs when evaluated against Jetson TK1 with embedded GPU SoC and with a 4× power savings in a power envelope of 2.0W. When compared to existing FPGA-based accelerators, SCALENet’s accelerator and heterogeneous aware scheduler achieves a 4× improvement in energy efficiency.
- T. Abtahi, A. Kulkarni, and T. Mohsenin. 2017. Accelerating convolutional neural network with FFT on tiny cores. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’17). 1--4. DOI:https://doi.org/10.1109/ISCAS.2017.8050588Google Scholar
- T. Abtahi, C. Shea, A. Kulkarni, and T. Mohsenin. 2018. Accelerating convolutional neural network with FFT on embedded hardware. IEEE Trans. Very Large Scale Integr. 26, 9 (Sept. 2018), 1737--1749. DOI:https://doi.org/10.1109/TVLSI.2018.2825145Google ScholarCross Ref
- U. Rajendra Acharya, Shu Lih Oh, Yuki Hagiwara, Jen Hong Tan, and Hojjat Adeli. 2017. Deep convolutional neural network for the automated detection and diagnosis of seizure using EEG signals. Comput. Biol. Med. 100 (2017). DOI:https://doi.org/10.1016/j.compbiomed.2017.09.017Google Scholar
- F. Berman, R. Wolski, S. Figueira, J. Schopf, and G. Shao. 1996. Application-level scheduling on distributed heterogeneous networks. In Proceedings of the ACM/IEEE Conference on Supercomputing. 39--39. DOI:https://doi.org/10.1109/SUPERC.1996.183541Google Scholar
- Tracy D. Braun, Howard Jay Siegel, Noah Beck, Ladislau L. Bölöni, Muthucumaru Maheswaran, Albert I. Reuther, James P. Robertson, Mitchell D. Theys, Bin Yao, Debra Hensgen, and Richard F. Freund. 2001. A comparison of eleven static heuristics for mapping a class of independent tasks onto heterogeneous distributed computing systems. J. Parallel Distrib. Comput. 61, 6 (2001), 810--837. DOI:https://doi.org/10.1006/jpdc.2000.1714Google ScholarDigital Library
- Hassan Chafi, Arvind K. Sujeeth, Kevin J. Brown, HyoukJoong Lee, Anand R. Atreya, and Kunle Olukotun. 2011. A domain-specific approach to heterogeneous parallelism. SIGPLAN Not. 46, 8 (Feb. 2011), 35--46. DOI:https://doi.org/10.1145/2038037.1941561Google ScholarDigital Library
- Srimat Chakradhar et al. 2010. A dynamically configurable coprocessor for convolutional neural networks. In ACM SIGARCH Computer Architecture News.Google Scholar
- Y. H. Chen, T. Krishna, J. Emer, and V. Sze. 2016. 14.5 Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC’16), Vol. 59. IEEE, 262--263. DOI:https://doi.org/10.1109/ISSCC.2016.7418007Google Scholar
- R. DiCecco, G. Lacey, J. Vasiljevic, P. Chow, G. Taylor, and S. Areibi. 2016. Caffeinated FPGAs: FPGA framework for convolutional neural networks. Proceedings of the International Conference on Field-Programmable Technology (FPT’16). 265--268. DOI:https://doi.org/10.1109/FPT.2016.7929549Google Scholar
- Caiwen Ding, Siyu Liao, Yanzhi Wang, Zhe Li, Ning Liu, Youwei Zhuo, Chao Wang, Xuehai Qian, Yu Bai, Geng Yuan, Xiaolong Ma, Yipeng Zhang, Jian Tang, Qinru Qiu, Xue Lin, and Bo Yuan. 2017. CirCNN: Accelerating and compressing deep neural networks using block-circulant weight matrices. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’17). ACM, New York, NY, 395--408. DOI:https://doi.org/10.1145/3123939.3124552Google ScholarDigital Library
- V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello. 2014. A 240 G-ops/s mobile coprocessor for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 696--701. DOI:https://doi.org/10.1109/CVPRW.2014.106Google Scholar
- A. L. Goldberger, L. A. N. Amaral, L. Glass, J. M. Hausdorff, P. Ch. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C.-K. Peng, and H. E. Stanley. 2000 (June 13). PhysioBank, physiotoolkit, and physionet: Components of a New Research Resource for Complex Physiologic Signals. Circulation 101, 23 (June 2000), e215--e220. Retrieved from http://circ.ahajournals.org/content/101/23/e215.full PMID:1085218; doi: 10.1161/01.CIR.101.23.e215.Google ScholarCross Ref
- K. Guo, L. Sui, J. Qiu, J. Yu, J. Wang, S. Yao, S. Han, Y. Wang, and H. Yang. 2018. Angel-eye: A complete design flow for mapping CNN onto embedded FPGA. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 37, 1 (Jan. 2018), 35--47. DOI:https://doi.org/10.1109/TCAD.2017.2705069Google ScholarCross Ref
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep residual learning for image recognition. CoRR abs/1512.03385. Retrieved from http://arxiv.org/abs/1512.03385.Google Scholar
- Morteza Hosseini, Mark Horton, Hiren Paneliya, Uttej Kallakuri, Houman Homayoun, and Tinoosh Mohsenin. 2019. On the complexity reduction of dense layers from O(N2) to O(N log N) with cyclic sparsely connected layers. In Proceedings of the 56th Annual Design Automation Conference. ACM.Google ScholarDigital Library
- Morteza Hosseini, Hirenkumar Paneliya, Utteja Kallakuri, Mohit Khatwani, and Tinoosh Mohsenin. 2019. Minimizing classification energy of binarized neural network inference for wearable devices. In Proceedings of the 20th International Symposium on Quality Electronic Design (ISQED’19). IEEE, 259--264.Google ScholarCross Ref
- F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer. 2016. Squeezenet: Alexnet-level accuracy with 50 fewer parameters and <1mb model size. arXiv preprint arXiv:1602.07360 (2016).Google Scholar
- G. Inggs, D. Thomas, and W. Luk. 2013. A heterogeneous computing framework for computational finance. In Proceedings of the 42nd International Conference on Parallel Processing. 688--697. DOI:https://doi.org/10.1109/ICPP.2013.82Google Scholar
- Ali Jafari, Morteza Hosseini, Houman Homayoun, and Tinoosh Mohsenin. 2018. A scalable and low-power DCNN for multimodal data classification. In Proceedings of the International Conference on ReConFigurable Computing and FPGAs (ReConFig’18). IEEE, 1--6.Google ScholarCross Ref
- Mojan Javaheripi, Mohammad Samragh, Tara Javidi, and Farinaz Koushanfar. 2019. ASCAI: Adaptive sampling for acquiring compact AI. In Proceedings of the AutoML Workshop at the 36th International Conference of Machine Learning (ICML’19).Google Scholar
- Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. Arxiv Preprint Arxiv:1408.5093.Google Scholar
- Qinma Kang, Hong He, and Huimin Song. 2011. Task assignment in heterogeneous computing systems using an effective iterated greedy algorithm. J. Syst. Softw. 84, 6 (June 2011), 985--992. DOI:https://doi.org/10.1016/j.jss.2011.01.051Google ScholarDigital Library
- Mohit Khatwani, W. David Hairston, Nicholas Waytowich, and Tinoosh Mohsenin. 2019. A low complexity automated multi-channel EEG artifact detection using EEGNet. In Proceedings of the IEEE EMBS Conference on Neural Engineering. IEEE.Google Scholar
- Mohit Khatwani, Morteza Hosseini, Hirenkumar Paneliya, Tinoosh Mohsenin, W. David Hairston, and Nicholas Waytowich. 2018. Energy efficient convolutional neural networks for EEG artifact detection. In Proceedings of the IEEE Biomedical Circuits and Systems Conference (BioCAS’18). IEEE, 1--4.Google ScholarCross Ref
- Xingyu Liu, Jeff Pool, Song Han, and William J. Dally. 2018. Efficient sparse-winograd convolutional neural networks. CoRR abs/1802.06367. Retrieved from http://arxiv.org/abs/1802.06367.Google Scholar
- Junjie Lu, Stephanie Young, Itamar Arel, and Jeremy Holleman. 2015. A 1 TOPS/W analog deep machine-learning engine with floating-gate storage in 0.13 μm CMOS. IEEE J. Solid-State Circ. 50, 1 (2015), 270--281.Google ScholarCross Ref
- Hosein Mohammadi Makrani, Hossein Sayadi, Tinoosh Mohsenin, Setareh Rafatirad, Avesta Sasan, and Houman Homayoun. 2019. XPPE: Cross-platform performance estimation of hardware accelerators using machine learning. In Proceedings of the 24th Asia and South Pacific Design Automation Conference (ASPDAC’19). ACM, New York, NY, 727--732. DOI:https://doi.org/10.1145/3287624.3288756Google ScholarDigital Library
- K. Neshatpour, F. Behnia, H. Homayoun, and A. Sasan. 2018. ICNN: An iterative implementation of convolutional neural networks to enable energy and computational complexity aware dynamic approximation. In Proceedings of the Design, Automation Test in Europe Conference Exhibition (DATE’18). 551--556. DOI:https://doi.org/10.23919/DATE.2018.8342068Google Scholar
- K. Neshatpour, F. Behnia, H. Homayoun, and A. Sasan. 2019. Exploiting energy-accuracy trade-off through contextual awareness in multi-stage convolutional neural networks. In Proceedings of the 20th International Symposium on Quality Electronic Design (ISQED’19). 265--270. DOI:https://doi.org/10.1109/ISQED.2019.8697497Google ScholarCross Ref
- Francisco Ortega-Zamorano, Jose M. Jerez, and Leonardo Franco. 2014. FPGA implementation of the c-mantec neural network constructive algorithm. IEEE Trans. Industr. Inform. 10, 2 (2014), 1154--1161.Google ScholarCross Ref
- A. Page, N. Attaran, C. Shea, H. Homayoun, and T. Mohsenin. 2016. Low-power manycore accelerator for personalized biomedical applications. In Proceedings of the 26th Edition on Great Lakes Symposium on Very Large Scale Integration (GLSVLSI’16). ACM, New York, NY, 63--68. DOI:https://doi.org/10.1145/2902961.2902986Google Scholar
- Adam Page, Ali Jafari, Colin Shea, and Tinoosh Mohsenin. 2017. SPARCNet: A hardware accelerator for efficient deployment of sparse convolutional networks. J. Emerg. Technol. Comput. Syst. 13, 3 (May 2017). DOI:https://doi.org/10.1145/3005448Google ScholarDigital Library
- Adam Page, Chris Sagedy et al. 2015. A flexible multichannel EEG feature extractor and classifier for seizure detection. IEEE Trans. Circ. Syst. II: Express Briefs 62, 2 (2015), 109--113.Google Scholar
- A. Page, C. Shea, and T. Mohsenin. 2016. Wearable seizure detection using convolutional neural networks with transfer learning. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’16).Google Scholar
- S. W. Park, J. Park, K. Bong, D. Shin, J. Lee, S. Choi, and H. J. Yoo. 2015. An energy-efficient and scalable deep-learning/inference processor with tetra-parallel MIMD architecture for big data applications.IEEE Trans. Biomed. Circ. Syst. 9, 6 (Dec. 2015), 838--848.Google Scholar
- Mohammad Samragh, Mojan Javaheripi, and Farinaz Koushanfar. 2019. AutoRank: Automated rank selection for effective neural network customization. In Proceedings of the ML-for-Systems Workshop at the 46th International Symposium on Computer Architecture (ISCA’19).Google Scholar
- Mohammad Samragh, Mojan Javaheripi, and Farinaz Koushanfar. 2019. CodeX: Bit-flexible encoding for streaming-based FPGA acceleration of DNNs. Arxiv Preprint Arxiv:1901.05582.Google Scholar
- H. Sayadi, N. Patel, A. Sasan, and H. Homayoun. 2017. Machine-learning-based approaches for energy-efficiency prediction and scheduling in composite cores architectures. In Proceedings of the IEEE International Conference on Computer Design (ICCD’17). 129--136. DOI:https://doi.org/10.1109/ICCD.2017.28Google Scholar
- C. Shea, A. Page, and Tinoosh Mohsenin. 2018. SCALENet: A scalable low-power accelerator for real-time embedded deep neural networks. In ACM Proceedings of the 28th Edition of the Great Lakes Symposium on Very Large Scale Integration (GLSVLSI’18). ACM.Google ScholarDigital Library
- Jaehyeong Sim et al. 2016. 14.6 A 1.42 TOPS/W deep convolutional neural network recognition processor for intelligent IoE systems. In Proceedings of the International Solid-State Circuits Conference (ISSCC’16). IEEE.Google Scholar
- K. Simonyan and A. Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google Scholar
- Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1--9.Google ScholarCross Ref
- Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. 2015. Rethinking the inception architecture for computer vision. CoRR abs/1512.00567. Retrieevd from http://arxiv.org/abs/1512.00567.Google Scholar
- Arie van Deursen, Paul Klint, and Joost Visser. 2000. Domain-specific languages: An annotated bibliography. SIGPLAN Not. 35, 6 (June 2000), 26--36. DOI:https://doi.org/10.1145/352029.352035Google ScholarDigital Library
- Nicolas Vasilache, Jeff Johnson, Michaël Mathieu, Soumith Chintala, Serkan Piantino, and Yann LeCun. 2014. Fast convolutional nets with fbfft: A GPU Performance evaluation. CoRR abs/1412.7580. Retrieved from http://arxiv.org/abs/1412.7580.Google Scholar
- Praneeth Vepakomma, Debraj De, Sajal K. Das, and Shekhar Bhansali. 2015. A-wristocracy: Deep learning on wrist-worn sensing for recognition of user complex activities. In Proceedings of the IEEE 12th International Conference on Wearable and Implantable Body Sensor Networks (BSN’15). IEEE, 1--6.Google ScholarCross Ref
- Xilinx. 2011. Power methodology guide. Retrieved on March 2011 from https://www.xilinx.com/support/documentation/sw_manuals/xilinx13_1/ug786_PowerMethodology.pdf.Google Scholar
- Chen Zhang et al. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA’15). ACM.Google Scholar
- Guanwen Zhong, Akshat Dubey, Cheng Tan, and Tulika Mitra. 2018. Synergy: A HW/SW Framework for high throughput CNNs on embedded heterogeneous SoC. CoRR abs/1804.00706. Retrieved from http://arxiv.org/abs/1804.00706.Google Scholar
Index Terms
- Heterogeneous Scheduling of Deep Neural Networks for Low-power Real-time Designs
Recommendations
Merasa: Multicore Execution of Hard Real-Time Applications Supporting Analyzability
The Merasa project aims to achieve a breakthrough in hardware design, hard real-time support in system software, and worst-case execution time analysis tools for embedded multicore processors. The project focuses on developing multicore processor designs ...
Convolutional neural network acceleration with hardware/software co-design
Convolutional Neural Networks (CNNs) have a broad range of applications, such as image processing and natural language processing. Inspired by the mammalian visual cortex, CNNs have been shown to achieve impressive results on a number of computer vision ...
Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysCurrent-generation Deep Neural Networks (DNNs), such as AlexNet and VGG, rely heavily on dense floating-point matrix multiplication (GEMM), which maps well to GPUs (regular parallelism, high TFLOP/s). Because of this, GPUs are widely used for ...
Comments