research-article

Heterogeneous Scheduling of Deep Neural Networks for Low-power Real-time Designs

Authors:
Colin Shea

University of Maryland, Baltimore County, Catonsville, MD, USA

University of Maryland, Baltimore County, Catonsville, MD, USA

0000-0002-3269-5244
View Profile

,
Tinoosh Mohsenin

University of Maryland, Baltimore County, Catonsville, MD, USA

University of Maryland, Baltimore County, Catonsville, MD, USA
View Profile

ACM Journal on Emerging Technologies in Computing Systems Volume 15 Issue 4Article No.: 36pp 1–31https://doi.org/10.1145/3358699

Published:16 December 2019Publication History

ACM Journal on Emerging Technologies in Computing Systems

Abstract

Deep neural networks have become the readiest answer to a range of application challenges including image recognition, stock analysis, natural language processing, and biomedical applications such as seizure detection. All while outperforming prior leading solutions that relied heavily on hand-engineered techniques. However, deployment of these neural networks often requires high-computational and memory-intensive solutions. These requirements make it challenging to deploy Deep Neural Networks (DNNs) in embedded, real-time low-power applications where classic architectures, GPUs and CPUs, still impose significant power burden. Systems-on-Chip (SoC) with Field-programmable Gate Arrays (FPGAs) can be used to improve performance and allow more fine-grain control of resources than CPUs or GPUs, but it is difficult to find the optimal balance between hardware and software to improve DNN efficiency. In the current research literature there have been few proposed solutions to address optimizing hardware and software deployments of DNNs in embedded low-power systems. To address the computation resource restriction and low-power needs for deploying these networks, we describe and implement a domain-specific metric model for optimizing task deployment on differing platforms, hardware and software. Next, we propose a DNN hardware accelerator called Scalable Low-power Accelerator for real-time deep neural Networks (SCALENet) that includes multithreaded software workers. Finally, we propose a heterogeneous aware scheduler that uses the DNN-specific metric models and the SCALENet accelerator to allocate a task to a resource based on solving a numerical cost for a series of domain objectives. To demonstrate the applicability of our contribution, we deploy nine modern deep network architectures, each containing a different number of parameters within the context of two different neural network applications: image processing and biomedical seizure detection. Utilizing the metric modeling techniques integrated into the heterogeneous aware scheduler and the SCALENet accelerator, we demonstrate the ability to meet computational requirements, adapt to multiple architectures, and lower power by providing an optimized task to resource allocation. Our heterogeneous aware scheduler improves power saving by decreasing power consumption by 10% of the total system power, does not affect the accuracy of the networks, and still meets the real-time deadlines. We demonstrate the ability to achieve parity with or exceed the energy efficiency of NVIDIA GPUs when evaluated against Jetson TK1 with embedded GPU SoC and with a 4× power savings in a power envelope of 2.0W. When compared to existing FPGA-based accelerators, SCALENet’s accelerator and heterogeneous aware scheduler achieves a 4× improvement in energy efficiency.

References

T. Abtahi, A. Kulkarni, and T. Mohsenin. 2017. Accelerating convolutional neural network with FFT on tiny cores. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’17). 1--4. DOI:https://doi.org/10.1109/ISCAS.2017.8050588Google Scholar
T. Abtahi, C. Shea, A. Kulkarni, and T. Mohsenin. 2018. Accelerating convolutional neural network with FFT on embedded hardware. IEEE Trans. Very Large Scale Integr. 26, 9 (Sept. 2018), 1737--1749. DOI:https://doi.org/10.1109/TVLSI.2018.2825145Google ScholarCross Ref
U. Rajendra Acharya, Shu Lih Oh, Yuki Hagiwara, Jen Hong Tan, and Hojjat Adeli. 2017. Deep convolutional neural network for the automated detection and diagnosis of seizure using EEG signals. Comput. Biol. Med. 100 (2017). DOI:https://doi.org/10.1016/j.compbiomed.2017.09.017Google Scholar
F. Berman, R. Wolski, S. Figueira, J. Schopf, and G. Shao. 1996. Application-level scheduling on distributed heterogeneous networks. In Proceedings of the ACM/IEEE Conference on Supercomputing. 39--39. DOI:https://doi.org/10.1109/SUPERC.1996.183541Google Scholar
Tracy D. Braun, Howard Jay Siegel, Noah Beck, Ladislau L. Bölöni, Muthucumaru Maheswaran, Albert I. Reuther, James P. Robertson, Mitchell D. Theys, Bin Yao, Debra Hensgen, and Richard F. Freund. 2001. A comparison of eleven static heuristics for mapping a class of independent tasks onto heterogeneous distributed computing systems. J. Parallel Distrib. Comput. 61, 6 (2001), 810--837. DOI:https://doi.org/10.1006/jpdc.2000.1714Google ScholarDigital Library
Hassan Chafi, Arvind K. Sujeeth, Kevin J. Brown, HyoukJoong Lee, Anand R. Atreya, and Kunle Olukotun. 2011. A domain-specific approach to heterogeneous parallelism. SIGPLAN Not. 46, 8 (Feb. 2011), 35--46. DOI:https://doi.org/10.1145/2038037.1941561Google ScholarDigital Library
Srimat Chakradhar et al. 2010. A dynamically configurable coprocessor for convolutional neural networks. In ACM SIGARCH Computer Architecture News.Google Scholar
Y. H. Chen, T. Krishna, J. Emer, and V. Sze. 2016. 14.5 Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC’16), Vol. 59. IEEE, 262--263. DOI:https://doi.org/10.1109/ISSCC.2016.7418007Google Scholar
R. DiCecco, G. Lacey, J. Vasiljevic, P. Chow, G. Taylor, and S. Areibi. 2016. Caffeinated FPGAs: FPGA framework for convolutional neural networks. Proceedings of the International Conference on Field-Programmable Technology (FPT’16). 265--268. DOI:https://doi.org/10.1109/FPT.2016.7929549Google Scholar
Caiwen Ding, Siyu Liao, Yanzhi Wang, Zhe Li, Ning Liu, Youwei Zhuo, Chao Wang, Xuehai Qian, Yu Bai, Geng Yuan, Xiaolong Ma, Yipeng Zhang, Jian Tang, Qinru Qiu, Xue Lin, and Bo Yuan. 2017. CirCNN: Accelerating and compressing deep neural networks using block-circulant weight matrices. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’17). ACM, New York, NY, 395--408. DOI:https://doi.org/10.1145/3123939.3124552Google ScholarDigital Library
V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello. 2014. A 240 G-ops/s mobile coprocessor for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 696--701. DOI:https://doi.org/10.1109/CVPRW.2014.106Google Scholar
A. L. Goldberger, L. A. N. Amaral, L. Glass, J. M. Hausdorff, P. Ch. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C.-K. Peng, and H. E. Stanley. 2000 (June 13). PhysioBank, physiotoolkit, and physionet: Components of a New Research Resource for Complex Physiologic Signals. Circulation 101, 23 (June 2000), e215--e220. Retrieved from http://circ.ahajournals.org/content/101/23/e215.full PMID:1085218; doi: 10.1161/01.CIR.101.23.e215.Google ScholarCross Ref
K. Guo, L. Sui, J. Qiu, J. Yu, J. Wang, S. Yao, S. Han, Y. Wang, and H. Yang. 2018. Angel-eye: A complete design flow for mapping CNN onto embedded FPGA. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 37, 1 (Jan. 2018), 35--47. DOI:https://doi.org/10.1109/TCAD.2017.2705069Google ScholarCross Ref
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep residual learning for image recognition. CoRR abs/1512.03385. Retrieved from http://arxiv.org/abs/1512.03385.Google Scholar
Morteza Hosseini, Mark Horton, Hiren Paneliya, Uttej Kallakuri, Houman Homayoun, and Tinoosh Mohsenin. 2019. On the complexity reduction of dense layers from O(N²) to O(N log N) with cyclic sparsely connected layers. In Proceedings of the 56th Annual Design Automation Conference. ACM.Google ScholarDigital Library
Morteza Hosseini, Hirenkumar Paneliya, Utteja Kallakuri, Mohit Khatwani, and Tinoosh Mohsenin. 2019. Minimizing classification energy of binarized neural network inference for wearable devices. In Proceedings of the 20th International Symposium on Quality Electronic Design (ISQED’19). IEEE, 259--264.Google ScholarCross Ref
F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer. 2016. Squeezenet: Alexnet-level accuracy with 50 fewer parameters and <1mb model size. arXiv preprint arXiv:1602.07360 (2016).Google Scholar
G. Inggs, D. Thomas, and W. Luk. 2013. A heterogeneous computing framework for computational finance. In Proceedings of the 42nd International Conference on Parallel Processing. 688--697. DOI:https://doi.org/10.1109/ICPP.2013.82Google Scholar
Ali Jafari, Morteza Hosseini, Houman Homayoun, and Tinoosh Mohsenin. 2018. A scalable and low-power DCNN for multimodal data classification. In Proceedings of the International Conference on ReConFigurable Computing and FPGAs (ReConFig’18). IEEE, 1--6.Google ScholarCross Ref
Mojan Javaheripi, Mohammad Samragh, Tara Javidi, and Farinaz Koushanfar. 2019. ASCAI: Adaptive sampling for acquiring compact AI. In Proceedings of the AutoML Workshop at the 36th International Conference of Machine Learning (ICML’19).Google Scholar
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. Arxiv Preprint Arxiv:1408.5093.Google Scholar
Qinma Kang, Hong He, and Huimin Song. 2011. Task assignment in heterogeneous computing systems using an effective iterated greedy algorithm. J. Syst. Softw. 84, 6 (June 2011), 985--992. DOI:https://doi.org/10.1016/j.jss.2011.01.051Google ScholarDigital Library
Mohit Khatwani, W. David Hairston, Nicholas Waytowich, and Tinoosh Mohsenin. 2019. A low complexity automated multi-channel EEG artifact detection using EEGNet. In Proceedings of the IEEE EMBS Conference on Neural Engineering. IEEE.Google Scholar
Mohit Khatwani, Morteza Hosseini, Hirenkumar Paneliya, Tinoosh Mohsenin, W. David Hairston, and Nicholas Waytowich. 2018. Energy efficient convolutional neural networks for EEG artifact detection. In Proceedings of the IEEE Biomedical Circuits and Systems Conference (BioCAS’18). IEEE, 1--4.Google ScholarCross Ref
Xingyu Liu, Jeff Pool, Song Han, and William J. Dally. 2018. Efficient sparse-winograd convolutional neural networks. CoRR abs/1802.06367. Retrieved from http://arxiv.org/abs/1802.06367.Google Scholar
Junjie Lu, Stephanie Young, Itamar Arel, and Jeremy Holleman. 2015. A 1 TOPS/W analog deep machine-learning engine with floating-gate storage in 0.13 μm CMOS. IEEE J. Solid-State Circ. 50, 1 (2015), 270--281.Google ScholarCross Ref
Hosein Mohammadi Makrani, Hossein Sayadi, Tinoosh Mohsenin, Setareh Rafatirad, Avesta Sasan, and Houman Homayoun. 2019. XPPE: Cross-platform performance estimation of hardware accelerators using machine learning. In Proceedings of the 24th Asia and South Pacific Design Automation Conference (ASPDAC’19). ACM, New York, NY, 727--732. DOI:https://doi.org/10.1145/3287624.3288756Google ScholarDigital Library
K. Neshatpour, F. Behnia, H. Homayoun, and A. Sasan. 2018. ICNN: An iterative implementation of convolutional neural networks to enable energy and computational complexity aware dynamic approximation. In Proceedings of the Design, Automation Test in Europe Conference Exhibition (DATE’18). 551--556. DOI:https://doi.org/10.23919/DATE.2018.8342068Google Scholar
K. Neshatpour, F. Behnia, H. Homayoun, and A. Sasan. 2019. Exploiting energy-accuracy trade-off through contextual awareness in multi-stage convolutional neural networks. In Proceedings of the 20th International Symposium on Quality Electronic Design (ISQED’19). 265--270. DOI:https://doi.org/10.1109/ISQED.2019.8697497Google ScholarCross Ref
Francisco Ortega-Zamorano, Jose M. Jerez, and Leonardo Franco. 2014. FPGA implementation of the c-mantec neural network constructive algorithm. IEEE Trans. Industr. Inform. 10, 2 (2014), 1154--1161.Google ScholarCross Ref
A. Page, N. Attaran, C. Shea, H. Homayoun, and T. Mohsenin. 2016. Low-power manycore accelerator for personalized biomedical applications. In Proceedings of the 26th Edition on Great Lakes Symposium on Very Large Scale Integration (GLSVLSI’16). ACM, New York, NY, 63--68. DOI:https://doi.org/10.1145/2902961.2902986Google Scholar
Adam Page, Ali Jafari, Colin Shea, and Tinoosh Mohsenin. 2017. SPARCNet: A hardware accelerator for efficient deployment of sparse convolutional networks. J. Emerg. Technol. Comput. Syst. 13, 3 (May 2017). DOI:https://doi.org/10.1145/3005448Google ScholarDigital Library
Adam Page, Chris Sagedy et al. 2015. A flexible multichannel EEG feature extractor and classifier for seizure detection. IEEE Trans. Circ. Syst. II: Express Briefs 62, 2 (2015), 109--113.Google Scholar
A. Page, C. Shea, and T. Mohsenin. 2016. Wearable seizure detection using convolutional neural networks with transfer learning. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’16).Google Scholar
S. W. Park, J. Park, K. Bong, D. Shin, J. Lee, S. Choi, and H. J. Yoo. 2015. An energy-efficient and scalable deep-learning/inference processor with tetra-parallel MIMD architecture for big data applications.IEEE Trans. Biomed. Circ. Syst. 9, 6 (Dec. 2015), 838--848.Google Scholar
Mohammad Samragh, Mojan Javaheripi, and Farinaz Koushanfar. 2019. AutoRank: Automated rank selection for effective neural network customization. In Proceedings of the ML-for-Systems Workshop at the 46th International Symposium on Computer Architecture (ISCA’19).Google Scholar
Mohammad Samragh, Mojan Javaheripi, and Farinaz Koushanfar. 2019. CodeX: Bit-flexible encoding for streaming-based FPGA acceleration of DNNs. Arxiv Preprint Arxiv:1901.05582.Google Scholar
H. Sayadi, N. Patel, A. Sasan, and H. Homayoun. 2017. Machine-learning-based approaches for energy-efficiency prediction and scheduling in composite cores architectures. In Proceedings of the IEEE International Conference on Computer Design (ICCD’17). 129--136. DOI:https://doi.org/10.1109/ICCD.2017.28Google Scholar
C. Shea, A. Page, and Tinoosh Mohsenin. 2018. SCALENet: A scalable low-power accelerator for real-time embedded deep neural networks. In ACM Proceedings of the 28th Edition of the Great Lakes Symposium on Very Large Scale Integration (GLSVLSI’18). ACM.Google ScholarDigital Library
Jaehyeong Sim et al. 2016. 14.6 A 1.42 TOPS/W deep convolutional neural network recognition processor for intelligent IoE systems. In Proceedings of the International Solid-State Circuits Conference (ISSCC’16). IEEE.Google Scholar
K. Simonyan and A. Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google Scholar
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1--9.Google ScholarCross Ref
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. 2015. Rethinking the inception architecture for computer vision. CoRR abs/1512.00567. Retrieevd from http://arxiv.org/abs/1512.00567.Google Scholar
Arie van Deursen, Paul Klint, and Joost Visser. 2000. Domain-specific languages: An annotated bibliography. SIGPLAN Not. 35, 6 (June 2000), 26--36. DOI:https://doi.org/10.1145/352029.352035Google ScholarDigital Library
Nicolas Vasilache, Jeff Johnson, Michaël Mathieu, Soumith Chintala, Serkan Piantino, and Yann LeCun. 2014. Fast convolutional nets with fbfft: A GPU Performance evaluation. CoRR abs/1412.7580. Retrieved from http://arxiv.org/abs/1412.7580.Google Scholar
Praneeth Vepakomma, Debraj De, Sajal K. Das, and Shekhar Bhansali. 2015. A-wristocracy: Deep learning on wrist-worn sensing for recognition of user complex activities. In Proceedings of the IEEE 12th International Conference on Wearable and Implantable Body Sensor Networks (BSN’15). IEEE, 1--6.Google ScholarCross Ref
Xilinx. 2011. Power methodology guide. Retrieved on March 2011 from https://www.xilinx.com/support/documentation/sw_manuals/xilinx13_1/ug786_PowerMethodology.pdf.Google Scholar
Chen Zhang et al. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA’15). ACM.Google Scholar
Guanwen Zhong, Akshat Dubey, Cheng Tan, and Tulika Mitra. 2018. Synergy: A HW/SW Framework for high throughput CNNs on embedded heterogeneous SoC. CoRR abs/1804.00706. Retrieved from http://arxiv.org/abs/1804.00706.Google Scholar

Index Terms

Heterogeneous Scheduling of Deep Neural Networks for Low-power Real-time Designs
1. Computer systems organization
  1. Architectures
    1. Other architectures
  2. Real-time systems
    1. Real-time system architecture
2. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

Merasa: Multicore Execution of Hard Real-Time Applications Supporting Analyzability

The Merasa project aims to achieve a breakthrough in hardware design, hard real-time support in system software, and worst-case execution time analysis tools for embedded multicore processors. The project focuses on developing multicore processor designs ...
Read More
Convolutional neural network acceleration with hardware/software co-design

Convolutional Neural Networks (CNNs) have a broad range of applications, such as image processing and natural language processing. Inspired by the mammalian visual cortex, CNNs have been shown to achieve impressive results on a number of computer vision ...
Read More
Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Current-generation Deep Neural Networks (DNNs), such as AlexNet and VGG, rely heavily on dense floating-point matrix multiplication (GEMM), which maps well to GPUs (regular parallelism, high TFLOP/s). Because of this, GPUs are widely used for ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Journal on Emerging Technologies in Computing Systems Volume 15, Issue 4
Special Issue on HALO for Energy-Constrained On-Chip Machine Learning, Part 2 and Regular Papers
October 2019
226 pages
ISSN:1550-4832
EISSN:1550-4840
DOI:10.1145/3365594
Editor:
Ramesh Karri
Polytechnic Institute of New York University, USA
Issue’s Table of Contents
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States

Journal Family
ACM Journals for the Design of Smart and Connected Systems
Publication History
- Published: 16 December 2019
- Accepted: 1 August 2019
- Revised: 1 May 2019
- Received: 1 August 2018
Published in jetc Volume 15, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
FPGA
Machine learning
co-design
hardware
real-time
scheduling
software
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 10
  Total Citations
  View Citations
- 626
  Total Downloads
- Downloads (Last 12 months)69
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Heterogeneous Scheduling of Deep Neural Networks for Low-power Real-time Designs

ACM Journal on Emerging Technologies in Computing Systems

Abstract

References

Cited By

Index Terms

Recommendations

Merasa: Multicore Execution of Hard Real-Time Applications Supporting Analyzability

Convolutional neural network acceleration with hardware/software co-design

Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Journal Family

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Heterogeneous Scheduling of Deep Neural Networks for Low-power Real-time Designs

ACM Journal on Emerging Technologies in Computing Systems

Abstract

References

Cited By

Index Terms

Recommendations

Merasa: Multicore Execution of Hard Real-Time Applications Supporting Analyzability

Convolutional neural network acceleration with hardware/software co-design

Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Journal Family

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media