{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,27]],"date-time":"2025-10-27T20:44:21Z","timestamp":1761597861946,"version":"3.41.0"},"reference-count":41,"publisher":"Association for Computing Machinery (ACM)","issue":"5s","license":[{"start":{"date-parts":[[2017,9,27]],"date-time":"2017-09-27T00:00:00Z","timestamp":1506470400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","award":["CCF-1615014,CCF-1318298"],"award-info":[{"award-number":["CCF-1615014,CCF-1318298"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Embed. Comput. Syst."],"published-print":{"date-parts":[[2017,10,31]]},"abstract":"<jats:p>VLIW processors typically deliver high performance on limited budget making them ideal for a variety of communication and signal processing solutions. These processors typically need large multi-ported register files that can have side effects of increased cycle time and high power consumption. The access delay and energy of these register files can also become prohibitive when increasing the register count or the access ports, thus limiting the overall performance of the processor. Most prior art circumvent this problem by using multiple clusters with private register files, to lower the access delay and reduce energy consumption. However, clustering artifacts, like increased inter--cluster communication operations and spill-recovery code, result in a performance penalty.<\/jats:p>\n          <jats:p>This paper proposes CURE \u2014 a novel technique to considerably reduce the negative effects of clustering. CURE augments the ISA to expose the communication registers to the compilers to increase availability of architectural register state to all functional units. The inter--cluster communication operations are integrated into regular ALU and memory operations to improve instruction encoding efficiency. We also propose a new code scheduling heuristic to handle the ISA changes, and to realize the improvements in processor\u2019s performance and energy consumption. Our quantitative analysis estimates that CURE, when compared to the baseline 8--issue uni--cluster processor, boosts average performance by 61% while reducing the average register dynamic energy by 77%.<\/jats:p>","DOI":"10.1145\/3126527","type":"journal-article","created":{"date-parts":[[2017,9,27]],"date-time":"2017-09-27T12:33:53Z","timestamp":1506515633000},"page":"1-19","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["The CURE"],"prefix":"10.1145","volume":"16","author":[{"given":"Vignyan Reddy Kothinti","family":"Naresh","sequence":"first","affiliation":[{"name":"Qualcomm Technologies Incorporated"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Dibakar","family":"Gope","sequence":"additional","affiliation":[{"name":"University of Wisconsin Madison, WI, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Mikko H.","family":"Lipasti","sequence":"additional","affiliation":[{"name":"University of Wisconsin Madison, WI, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2017,9,27]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"Alex Alet\u00e0 Josep M. Codina Antonio Gonz\u00e1lez and David Kaeli. 2003. Instruction replication for clustered microarchitectures. In MICRO-36.   Alex Alet\u00e0 Josep M. Codina Antonio Gonz\u00e1lez and David Kaeli. 2003. Instruction replication for clustered microarchitectures. In MICRO-36."},{"key":"e_1_2_1_2_1","unstructured":"Alex Alet\u00e0 Josep M. Codina Jes\u00fas S\u00e1nchez and Antonio Gonz\u00e1lez. 2001. Graph-partitioning based instruction scheduling for clustered processors. In MICRO-34.   Alex Alet\u00e0 Josep M. Codina Jes\u00fas S\u00e1nchez and Antonio Gonz\u00e1lez. 2001. Graph-partitioning based instruction scheduling for clustered processors. In MICRO-34."},{"key":"e_1_2_1_3_1","unstructured":"Alex Alet\u00e0 Josep M. Codina Jes\u00fas S\u00e1nchez Antonio Gonz\u00e1lez and David Kaeli. 2002. Exploiting pseudo-schedules to guide data dependence graph partitioning. In PACT.   Alex Alet\u00e0 Josep M. Codina Jes\u00fas S\u00e1nchez Antonio Gonz\u00e1lez and David Kaeli. 2002. Exploiting pseudo-schedules to guide data dependence graph partitioning. In PACT."},{"key":"e_1_2_1_4_1","unstructured":"R. Balasubramonian S. Dwarkadas and D. H. Albonesi. 2001. Reducing the complexity of the register file in dynamic superscalar processors. In MICRO-34.   R. Balasubramonian S. Dwarkadas and D. H. Albonesi. 2001. Reducing the complexity of the register file in dynamic superscalar processors. In MICRO-34."},{"key":"e_1_2_1_5_1","doi-asserted-by":"crossref","unstructured":"A. Capitanio N. Dutt and A. Nicolau. 1992. Partitioned Register Files For VLIWs: A preliminary analysis of tradeoffs. In MICRO-25.   A. Capitanio N. Dutt and A. Nicolau. 1992. Partitioned Register Files For VLIWs: A preliminary analysis of tradeoffs. In MICRO-25.","DOI":"10.1145\/144965.145839"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/989393.989403"},{"volume-title":"Rabbah","year":"2004","author":"Chakrapani Lakshmi N.","key":"e_1_2_1_7_1"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/2000064.2000067"},{"key":"e_1_2_1_9_1","unstructured":"Josep M. Codina Jes\u00fas S\u00e1nchez and Antonio Gonz\u00e1lez. 2001. A unified modulo scheduling and register allocation technique for clustered processors. In PACT.   Josep M. Codina Jes\u00fas S\u00e1nchez and Antonio Gonz\u00e1lez. 2001. A unified modulo scheduling and register allocation technique for clustered processors. In PACT."},{"key":"e_1_2_1_10_1","doi-asserted-by":"crossref","unstructured":"L. Codrescu W. Anderson S. Venkumanhanti M. Zeng E. Plondke C. Koob A. Ingle R. Maule and R. Talluri. 2013. Qualcomm Hexagon DSP: An architecture optimized for mobile multimedia and communications. In Hot Chips.  L. Codrescu W. Anderson S. Venkumanhanti M. Zeng E. Plondke C. Koob A. Ingle R. Maule and R. Talluri. 2013. Qualcomm Hexagon DSP: An architecture optimized for mobile multimedia and communications. In Hot Chips.","DOI":"10.1109\/HOTCHIPS.2013.7478317"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/951710.951731"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/339647.339708"},{"volume-title":"Register Multimapping: A technique for reducing register bank conflicts in processors with large register files. In SASP-7.","year":"2009","author":"Duong Nam","key":"e_1_2_1_13_1"},{"key":"e_1_2_1_15_1","unstructured":"Equator. 1998. MAP1000 unfolds at Equator. In Microprocessor Report.  Equator. 1998. MAP1000 unfolds at Equator. In Microprocessor Report."},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/339647.339682"},{"key":"e_1_2_1_17_1","unstructured":"K. I. Farkas P. Chow N. P. Jouppi and Z. Vranesic. 1997. The multicluster architecture: reducing cycle time through partitioning. In MICRO-30.   K. I. Farkas P. Chow N. P. Jouppi and Z. Vranesic. 1997. The multicluster architecture: reducing cycle time through partitioning. In MICRO-30."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/40.820055"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/DATE.2005.141"},{"key":"e_1_2_1_20_1","unstructured":"J. S. Gardner. 2012. CEVA Exposes DSP Six Pack. In Microprocessor Report.  J. S. Gardner. 2012. CEVA Exposes DSP Six Pack. In Microprocessor Report."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/VLSID.2007.127"},{"key":"e_1_2_1_22_1","unstructured":"A. Gonzalez J. Gonzalez and M. Valero. 1998. Virtual-physical registers. In HPCA-4.   A. Gonzalez J. Gonzalez and M. Valero. 1998. Virtual-physical registers. In HPCA-4."},{"key":"e_1_2_1_23_1","unstructured":"Texas Instrucments Inc. 1998. TMS320C62x\/67x CPU and instruction set reference guide.  Texas Instrucments Inc. 1998. TMS320C62x\/67x CPU and instruction set reference guide."},{"key":"e_1_2_1_24_1","unstructured":"Texas Instruments. 2010. TMS320C6745\/C6747 Fixed\/Floating- point digital signal processors (Rev.D).  Texas Instruments. 2010. TMS320C6745\/C6747 Fixed\/Floating- point digital signal processors (Rev.D)."},{"key":"e_1_2_1_25_1","unstructured":"Intel. Intel Itanium Architecture Software Develorer\u2018s Manual: Intel Itanium Instruction Set. www.intel.com 3 293--370.  Intel. Intel Itanium Architecture Software Develorer\u2018s Manual: Intel Itanium Instruction Set. www.intel.com 3 293--370."},{"volume-title":"CARS: A new code generation framework for clustered ILP processors. In HPCA.","year":"2001","author":"Kailas Krishnan","key":"e_1_2_1_26_1"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1007\/BF01205182"},{"key":"e_1_2_1_28_1","doi-asserted-by":"crossref","unstructured":"R. Nagpal and Y. N. Srikant. 2007. Register file energy optimization for snooping based clustered VLIW architectures. In SBAC-PAD-19.  R. Nagpal and Y. N. Srikant. 2007. Register file energy optimization for snooping based clustered VLIW architectures. In SBAC-PAD-19.","DOI":"10.1109\/SBAC-PAD.2007.35"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/2155620.2155643"},{"volume-title":"Conte","year":"1998","author":"\u00d6zer Emre","key":"e_1_2_1_30_1"},{"key":"e_1_2_1_31_1","unstructured":"I. Park M. D. Powell and T. N. Vijaykumar. 2002. Reducing register ports for higher speed and lower energy. In MICRO-35.   I. Park M. D. Powell and T. N. Vijaykumar. 2002. Reducing register ports for higher speed and lower energy. In MICRO-35."},{"key":"e_1_2_1_32_1","unstructured":"Roni Potasman. 1992. Percolation based compiling for evaluation of parallelism and hardware design trade-offs. Ph.D.  Roni Potasman. 1992. Percolation based compiling for evaluation of parallelism and hardware design trade-offs. Ph.D."},{"key":"e_1_2_1_33_1","doi-asserted-by":"crossref","unstructured":"C. Rowen D. Nicolaescu R. Ravindran D. Heine G. Martin J. Kim D. Maydan N. Andrews B. Huffman V. Papaparaskeva S. Gal-On P. Nuth P. Patwardhan and M. Paradkar. 2011. The World's Fastest DSP Core: Breaking the 100 GMAC\/s Barrier. In Hot Chips.  C. Rowen D. Nicolaescu R. Ravindran D. Heine G. Martin J. Kim D. Maydan N. Andrews B. Huffman V. Papaparaskeva S. Gal-On P. Nuth P. Patwardhan and M. Paradkar. 2011. The World's Fastest DSP Core: Breaking the 100 GMAC\/s Barrier. In Hot Chips.","DOI":"10.1109\/HOTCHIPS.2011.7477497"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICVD.2005.95"},{"key":"e_1_2_1_35_1","doi-asserted-by":"crossref","unstructured":"A. Terechko E. Le Thenaff M. Garg J. van Eijndhoven and H. Corporaal. 2003. Inter-cluster communication models for clustered VLIW processors. In HPCA-9 2003.   A. Terechko E. Le Thenaff M. Garg J. van Eijndhoven and H. Corporaal. 2003. Inter-cluster communication models for clustered VLIW processors. In HPCA-9 2003.","DOI":"10.1145\/951710.951717"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/859618.859627"},{"key":"e_1_2_1_37_1","unstructured":"S. Wallace and N. Bagherzadeh. 1996. A scalable register file architecture for dynamically scheduled processors. In PACT.   S. Wallace and N. Bagherzadeh. 1996. A scalable register file architecture for dynamically scheduled processors. In PACT."},{"key":"e_1_2_1_38_1","unstructured":"R. Yung and N. C. Wilhelm. 1995. Caching processor general registers. In ICCD.   R. Yung and N. C. Wilhelm. 1995. Caching processor general registers. In ICCD."},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/360128.360143"},{"key":"e_1_2_1_40_1","unstructured":"Javier Zalamea Josep Llosa Eduard Ayguad and Mateo Valero. 2001. Modulo scheduling with integrated register spilling for clustered VLIW architectures. In Micro-34.   Javier Zalamea Josep Llosa Eduard Ayguad and Mateo Valero. 2001. Modulo scheduling with integrated register spilling for clustered VLIW architectures. In Micro-34."},{"key":"e_1_2_1_41_1","doi-asserted-by":"crossref","unstructured":"Yingchao Zhao C. J. Xue Minming Li and B. Hu. 2009. Energy-aware register file re-partitioning for clustered VLIW architectures. In ASP-DAC.   Yingchao Zhao C. J. Xue Minming Li and B. Hu. 2009. Energy-aware register file re-partitioning for clustered VLIW architectures. In ASP-DAC.","DOI":"10.1109\/ASPDAC.2009.4796579"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/280756.280943"}],"container-title":["ACM Transactions on Embedded Computing Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3126527","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3126527","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3126527","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T19:05:02Z","timestamp":1750273502000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3126527"}},"subtitle":["Cluster Communication Using Registers"],"short-title":[],"issued":{"date-parts":[[2017,9,27]]},"references-count":41,"journal-issue":{"issue":"5s","published-print":{"date-parts":[[2017,10,31]]}},"alternative-id":["10.1145\/3126527"],"URL":"https:\/\/doi.org\/10.1145\/3126527","relation":{},"ISSN":["1539-9087","1558-3465"],"issn-type":[{"type":"print","value":"1539-9087"},{"type":"electronic","value":"1558-3465"}],"subject":[],"published":{"date-parts":[[2017,9,27]]},"assertion":[{"value":"2017-03-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2017-06-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2017-09-27","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}