skip to main content
research-article

Heterogeneous Scheduling of Deep Neural Networks for Low-power Real-time Designs

Published:16 December 2019Publication History
Skip Abstract Section

Abstract

Deep neural networks have become the readiest answer to a range of application challenges including image recognition, stock analysis, natural language processing, and biomedical applications such as seizure detection. All while outperforming prior leading solutions that relied heavily on hand-engineered techniques. However, deployment of these neural networks often requires high-computational and memory-intensive solutions. These requirements make it challenging to deploy Deep Neural Networks (DNNs) in embedded, real-time low-power applications where classic architectures, GPUs and CPUs, still impose significant power burden. Systems-on-Chip (SoC) with Field-programmable Gate Arrays (FPGAs) can be used to improve performance and allow more fine-grain control of resources than CPUs or GPUs, but it is difficult to find the optimal balance between hardware and software to improve DNN efficiency. In the current research literature there have been few proposed solutions to address optimizing hardware and software deployments of DNNs in embedded low-power systems. To address the computation resource restriction and low-power needs for deploying these networks, we describe and implement a domain-specific metric model for optimizing task deployment on differing platforms, hardware and software. Next, we propose a DNN hardware accelerator called Scalable Low-power Accelerator for real-time deep neural Networks (SCALENet) that includes multithreaded software workers. Finally, we propose a heterogeneous aware scheduler that uses the DNN-specific metric models and the SCALENet accelerator to allocate a task to a resource based on solving a numerical cost for a series of domain objectives. To demonstrate the applicability of our contribution, we deploy nine modern deep network architectures, each containing a different number of parameters within the context of two different neural network applications: image processing and biomedical seizure detection. Utilizing the metric modeling techniques integrated into the heterogeneous aware scheduler and the SCALENet accelerator, we demonstrate the ability to meet computational requirements, adapt to multiple architectures, and lower power by providing an optimized task to resource allocation. Our heterogeneous aware scheduler improves power saving by decreasing power consumption by 10% of the total system power, does not affect the accuracy of the networks, and still meets the real-time deadlines. We demonstrate the ability to achieve parity with or exceed the energy efficiency of NVIDIA GPUs when evaluated against Jetson TK1 with embedded GPU SoC and with a 4× power savings in a power envelope of 2.0W. When compared to existing FPGA-based accelerators, SCALENet’s accelerator and heterogeneous aware scheduler achieves a 4× improvement in energy efficiency.

References

  1. T. Abtahi, A. Kulkarni, and T. Mohsenin. 2017. Accelerating convolutional neural network with FFT on tiny cores. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’17). 1--4. DOI:https://doi.org/10.1109/ISCAS.2017.8050588Google ScholarGoogle Scholar
  2. T. Abtahi, C. Shea, A. Kulkarni, and T. Mohsenin. 2018. Accelerating convolutional neural network with FFT on embedded hardware. IEEE Trans. Very Large Scale Integr. 26, 9 (Sept. 2018), 1737--1749. DOI:https://doi.org/10.1109/TVLSI.2018.2825145Google ScholarGoogle ScholarCross RefCross Ref
  3. U. Rajendra Acharya, Shu Lih Oh, Yuki Hagiwara, Jen Hong Tan, and Hojjat Adeli. 2017. Deep convolutional neural network for the automated detection and diagnosis of seizure using EEG signals. Comput. Biol. Med. 100 (2017). DOI:https://doi.org/10.1016/j.compbiomed.2017.09.017Google ScholarGoogle Scholar
  4. F. Berman, R. Wolski, S. Figueira, J. Schopf, and G. Shao. 1996. Application-level scheduling on distributed heterogeneous networks. In Proceedings of the ACM/IEEE Conference on Supercomputing. 39--39. DOI:https://doi.org/10.1109/SUPERC.1996.183541Google ScholarGoogle Scholar
  5. Tracy D. Braun, Howard Jay Siegel, Noah Beck, Ladislau L. Bölöni, Muthucumaru Maheswaran, Albert I. Reuther, James P. Robertson, Mitchell D. Theys, Bin Yao, Debra Hensgen, and Richard F. Freund. 2001. A comparison of eleven static heuristics for mapping a class of independent tasks onto heterogeneous distributed computing systems. J. Parallel Distrib. Comput. 61, 6 (2001), 810--837. DOI:https://doi.org/10.1006/jpdc.2000.1714Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Hassan Chafi, Arvind K. Sujeeth, Kevin J. Brown, HyoukJoong Lee, Anand R. Atreya, and Kunle Olukotun. 2011. A domain-specific approach to heterogeneous parallelism. SIGPLAN Not. 46, 8 (Feb. 2011), 35--46. DOI:https://doi.org/10.1145/2038037.1941561Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Srimat Chakradhar et al. 2010. A dynamically configurable coprocessor for convolutional neural networks. In ACM SIGARCH Computer Architecture News.Google ScholarGoogle Scholar
  8. Y. H. Chen, T. Krishna, J. Emer, and V. Sze. 2016. 14.5 Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC’16), Vol. 59. IEEE, 262--263. DOI:https://doi.org/10.1109/ISSCC.2016.7418007Google ScholarGoogle Scholar
  9. R. DiCecco, G. Lacey, J. Vasiljevic, P. Chow, G. Taylor, and S. Areibi. 2016. Caffeinated FPGAs: FPGA framework for convolutional neural networks. Proceedings of the International Conference on Field-Programmable Technology (FPT’16). 265--268. DOI:https://doi.org/10.1109/FPT.2016.7929549Google ScholarGoogle Scholar
  10. Caiwen Ding, Siyu Liao, Yanzhi Wang, Zhe Li, Ning Liu, Youwei Zhuo, Chao Wang, Xuehai Qian, Yu Bai, Geng Yuan, Xiaolong Ma, Yipeng Zhang, Jian Tang, Qinru Qiu, Xue Lin, and Bo Yuan. 2017. CirCNN: Accelerating and compressing deep neural networks using block-circulant weight matrices. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’17). ACM, New York, NY, 395--408. DOI:https://doi.org/10.1145/3123939.3124552Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello. 2014. A 240 G-ops/s mobile coprocessor for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 696--701. DOI:https://doi.org/10.1109/CVPRW.2014.106Google ScholarGoogle Scholar
  12. A. L. Goldberger, L. A. N. Amaral, L. Glass, J. M. Hausdorff, P. Ch. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C.-K. Peng, and H. E. Stanley. 2000 (June 13). PhysioBank, physiotoolkit, and physionet: Components of a New Research Resource for Complex Physiologic Signals. Circulation 101, 23 (June 2000), e215--e220. Retrieved from http://circ.ahajournals.org/content/101/23/e215.full PMID:1085218; doi: 10.1161/01.CIR.101.23.e215.Google ScholarGoogle ScholarCross RefCross Ref
  13. K. Guo, L. Sui, J. Qiu, J. Yu, J. Wang, S. Yao, S. Han, Y. Wang, and H. Yang. 2018. Angel-eye: A complete design flow for mapping CNN onto embedded FPGA. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 37, 1 (Jan. 2018), 35--47. DOI:https://doi.org/10.1109/TCAD.2017.2705069Google ScholarGoogle ScholarCross RefCross Ref
  14. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep residual learning for image recognition. CoRR abs/1512.03385. Retrieved from http://arxiv.org/abs/1512.03385.Google ScholarGoogle Scholar
  15. Morteza Hosseini, Mark Horton, Hiren Paneliya, Uttej Kallakuri, Houman Homayoun, and Tinoosh Mohsenin. 2019. On the complexity reduction of dense layers from O(N2) to O(N log N) with cyclic sparsely connected layers. In Proceedings of the 56th Annual Design Automation Conference. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Morteza Hosseini, Hirenkumar Paneliya, Utteja Kallakuri, Mohit Khatwani, and Tinoosh Mohsenin. 2019. Minimizing classification energy of binarized neural network inference for wearable devices. In Proceedings of the 20th International Symposium on Quality Electronic Design (ISQED’19). IEEE, 259--264.Google ScholarGoogle ScholarCross RefCross Ref
  17. F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer. 2016. Squeezenet: Alexnet-level accuracy with 50 fewer parameters and <1mb model size. arXiv preprint arXiv:1602.07360 (2016).Google ScholarGoogle Scholar
  18. G. Inggs, D. Thomas, and W. Luk. 2013. A heterogeneous computing framework for computational finance. In Proceedings of the 42nd International Conference on Parallel Processing. 688--697. DOI:https://doi.org/10.1109/ICPP.2013.82Google ScholarGoogle Scholar
  19. Ali Jafari, Morteza Hosseini, Houman Homayoun, and Tinoosh Mohsenin. 2018. A scalable and low-power DCNN for multimodal data classification. In Proceedings of the International Conference on ReConFigurable Computing and FPGAs (ReConFig’18). IEEE, 1--6.Google ScholarGoogle ScholarCross RefCross Ref
  20. Mojan Javaheripi, Mohammad Samragh, Tara Javidi, and Farinaz Koushanfar. 2019. ASCAI: Adaptive sampling for acquiring compact AI. In Proceedings of the AutoML Workshop at the 36th International Conference of Machine Learning (ICML’19).Google ScholarGoogle Scholar
  21. Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. Arxiv Preprint Arxiv:1408.5093.Google ScholarGoogle Scholar
  22. Qinma Kang, Hong He, and Huimin Song. 2011. Task assignment in heterogeneous computing systems using an effective iterated greedy algorithm. J. Syst. Softw. 84, 6 (June 2011), 985--992. DOI:https://doi.org/10.1016/j.jss.2011.01.051Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Mohit Khatwani, W. David Hairston, Nicholas Waytowich, and Tinoosh Mohsenin. 2019. A low complexity automated multi-channel EEG artifact detection using EEGNet. In Proceedings of the IEEE EMBS Conference on Neural Engineering. IEEE.Google ScholarGoogle Scholar
  24. Mohit Khatwani, Morteza Hosseini, Hirenkumar Paneliya, Tinoosh Mohsenin, W. David Hairston, and Nicholas Waytowich. 2018. Energy efficient convolutional neural networks for EEG artifact detection. In Proceedings of the IEEE Biomedical Circuits and Systems Conference (BioCAS’18). IEEE, 1--4.Google ScholarGoogle ScholarCross RefCross Ref
  25. Xingyu Liu, Jeff Pool, Song Han, and William J. Dally. 2018. Efficient sparse-winograd convolutional neural networks. CoRR abs/1802.06367. Retrieved from http://arxiv.org/abs/1802.06367.Google ScholarGoogle Scholar
  26. Junjie Lu, Stephanie Young, Itamar Arel, and Jeremy Holleman. 2015. A 1 TOPS/W analog deep machine-learning engine with floating-gate storage in 0.13 μm CMOS. IEEE J. Solid-State Circ. 50, 1 (2015), 270--281.Google ScholarGoogle ScholarCross RefCross Ref
  27. Hosein Mohammadi Makrani, Hossein Sayadi, Tinoosh Mohsenin, Setareh Rafatirad, Avesta Sasan, and Houman Homayoun. 2019. XPPE: Cross-platform performance estimation of hardware accelerators using machine learning. In Proceedings of the 24th Asia and South Pacific Design Automation Conference (ASPDAC’19). ACM, New York, NY, 727--732. DOI:https://doi.org/10.1145/3287624.3288756Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. K. Neshatpour, F. Behnia, H. Homayoun, and A. Sasan. 2018. ICNN: An iterative implementation of convolutional neural networks to enable energy and computational complexity aware dynamic approximation. In Proceedings of the Design, Automation Test in Europe Conference Exhibition (DATE’18). 551--556. DOI:https://doi.org/10.23919/DATE.2018.8342068Google ScholarGoogle Scholar
  29. K. Neshatpour, F. Behnia, H. Homayoun, and A. Sasan. 2019. Exploiting energy-accuracy trade-off through contextual awareness in multi-stage convolutional neural networks. In Proceedings of the 20th International Symposium on Quality Electronic Design (ISQED’19). 265--270. DOI:https://doi.org/10.1109/ISQED.2019.8697497Google ScholarGoogle ScholarCross RefCross Ref
  30. Francisco Ortega-Zamorano, Jose M. Jerez, and Leonardo Franco. 2014. FPGA implementation of the c-mantec neural network constructive algorithm. IEEE Trans. Industr. Inform. 10, 2 (2014), 1154--1161.Google ScholarGoogle ScholarCross RefCross Ref
  31. A. Page, N. Attaran, C. Shea, H. Homayoun, and T. Mohsenin. 2016. Low-power manycore accelerator for personalized biomedical applications. In Proceedings of the 26th Edition on Great Lakes Symposium on Very Large Scale Integration (GLSVLSI’16). ACM, New York, NY, 63--68. DOI:https://doi.org/10.1145/2902961.2902986Google ScholarGoogle Scholar
  32. Adam Page, Ali Jafari, Colin Shea, and Tinoosh Mohsenin. 2017. SPARCNet: A hardware accelerator for efficient deployment of sparse convolutional networks. J. Emerg. Technol. Comput. Syst. 13, 3 (May 2017). DOI:https://doi.org/10.1145/3005448Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Adam Page, Chris Sagedy et al. 2015. A flexible multichannel EEG feature extractor and classifier for seizure detection. IEEE Trans. Circ. Syst. II: Express Briefs 62, 2 (2015), 109--113.Google ScholarGoogle Scholar
  34. A. Page, C. Shea, and T. Mohsenin. 2016. Wearable seizure detection using convolutional neural networks with transfer learning. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’16).Google ScholarGoogle Scholar
  35. S. W. Park, J. Park, K. Bong, D. Shin, J. Lee, S. Choi, and H. J. Yoo. 2015. An energy-efficient and scalable deep-learning/inference processor with tetra-parallel MIMD architecture for big data applications.IEEE Trans. Biomed. Circ. Syst. 9, 6 (Dec. 2015), 838--848.Google ScholarGoogle Scholar
  36. Mohammad Samragh, Mojan Javaheripi, and Farinaz Koushanfar. 2019. AutoRank: Automated rank selection for effective neural network customization. In Proceedings of the ML-for-Systems Workshop at the 46th International Symposium on Computer Architecture (ISCA’19).Google ScholarGoogle Scholar
  37. Mohammad Samragh, Mojan Javaheripi, and Farinaz Koushanfar. 2019. CodeX: Bit-flexible encoding for streaming-based FPGA acceleration of DNNs. Arxiv Preprint Arxiv:1901.05582.Google ScholarGoogle Scholar
  38. H. Sayadi, N. Patel, A. Sasan, and H. Homayoun. 2017. Machine-learning-based approaches for energy-efficiency prediction and scheduling in composite cores architectures. In Proceedings of the IEEE International Conference on Computer Design (ICCD’17). 129--136. DOI:https://doi.org/10.1109/ICCD.2017.28Google ScholarGoogle Scholar
  39. C. Shea, A. Page, and Tinoosh Mohsenin. 2018. SCALENet: A scalable low-power accelerator for real-time embedded deep neural networks. In ACM Proceedings of the 28th Edition of the Great Lakes Symposium on Very Large Scale Integration (GLSVLSI’18). ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Jaehyeong Sim et al. 2016. 14.6 A 1.42 TOPS/W deep convolutional neural network recognition processor for intelligent IoE systems. In Proceedings of the International Solid-State Circuits Conference (ISSCC’16). IEEE.Google ScholarGoogle Scholar
  41. K. Simonyan and A. Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google ScholarGoogle Scholar
  42. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1--9.Google ScholarGoogle ScholarCross RefCross Ref
  43. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. 2015. Rethinking the inception architecture for computer vision. CoRR abs/1512.00567. Retrieevd from http://arxiv.org/abs/1512.00567.Google ScholarGoogle Scholar
  44. Arie van Deursen, Paul Klint, and Joost Visser. 2000. Domain-specific languages: An annotated bibliography. SIGPLAN Not. 35, 6 (June 2000), 26--36. DOI:https://doi.org/10.1145/352029.352035Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Nicolas Vasilache, Jeff Johnson, Michaël Mathieu, Soumith Chintala, Serkan Piantino, and Yann LeCun. 2014. Fast convolutional nets with fbfft: A GPU Performance evaluation. CoRR abs/1412.7580. Retrieved from http://arxiv.org/abs/1412.7580.Google ScholarGoogle Scholar
  46. Praneeth Vepakomma, Debraj De, Sajal K. Das, and Shekhar Bhansali. 2015. A-wristocracy: Deep learning on wrist-worn sensing for recognition of user complex activities. In Proceedings of the IEEE 12th International Conference on Wearable and Implantable Body Sensor Networks (BSN’15). IEEE, 1--6.Google ScholarGoogle ScholarCross RefCross Ref
  47. Xilinx. 2011. Power methodology guide. Retrieved on March 2011 from https://www.xilinx.com/support/documentation/sw_manuals/xilinx13_1/ug786_PowerMethodology.pdf.Google ScholarGoogle Scholar
  48. Chen Zhang et al. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA’15). ACM.Google ScholarGoogle Scholar
  49. Guanwen Zhong, Akshat Dubey, Cheng Tan, and Tulika Mitra. 2018. Synergy: A HW/SW Framework for high throughput CNNs on embedded heterogeneous SoC. CoRR abs/1804.00706. Retrieved from http://arxiv.org/abs/1804.00706.Google ScholarGoogle Scholar

Index Terms

  1. Heterogeneous Scheduling of Deep Neural Networks for Low-power Real-time Designs

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Journal on Emerging Technologies in Computing Systems
              ACM Journal on Emerging Technologies in Computing Systems  Volume 15, Issue 4
              Special Issue on HALO for Energy-Constrained On-Chip Machine Learning, Part 2 and Regular Papers
              October 2019
              226 pages
              ISSN:1550-4832
              EISSN:1550-4840
              DOI:10.1145/3365594
              • Editor:
              • Ramesh Karri
              Issue’s Table of Contents

              Copyright © 2019 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 16 December 2019
              • Accepted: 1 August 2019
              • Revised: 1 May 2019
              • Received: 1 August 2018
              Published in jetc Volume 15, Issue 4

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Research
              • Refereed

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            HTML Format

            View this article in HTML Format .

            View HTML Format