Rajesh K. Gupta
(University of California, San Diego, USA)

Keynote Speech: Under-designed Computing Machines
    Semiconductor chips are built using device structures that are beginning to behave like molecular assemblies unlike the precisely characterized transistors and circuits in our device and circuit simulators. Modern computing is ignorant of the variability in the behavior of underlying components from device to device, chip to chip, its wear over time, or the environment in which the computing system is placed. The 'guardbands' used to guarantee component behavior (for power, performance) have gone to ridiculous margins accounting for as much as two-thirds of the chip area to meet performance 'specs' and is already undermining the gains from continued device scaling. Changing the way software interacts with hardware offers the best hope to recover the advantages from process scaling. In this talk, I will present latest results from the Variability Expeditions project that fundamentally rethinks the rigid, deterministic hardware-software interface, to propose a new class of computing machines that rely on an opportunistic software stack to adapt to the conditions in an under-designed hardware.
    Rajesh K. Gupta is a professor and chair of Computer Science and Engineering at UC San Diego, and holds the QUALCOMM endowed chair. His research interests are in energy efficient systems that have taken turn towards large-scale energy use in recent years. His recent contributions include SystemC modeling and SPARK parallelizing high-level synthesis, both of which are publicly available and have been incorporated into industrial practice. Earlier Gupta lead or co-lead DARPA-sponsored efforts under the Data Intensive Systems (DIS) and Power Aware Computing and Communications (PACC) programs that demonstrated architectural adaptation and compiler optimizations in building high performance and energy efficient system architectures. His ongoing efforts include energy-efficient data-centers and large scale computing using memory-coherent algorithmic accelerators and non-volatile storage systems. In recent years, Gupta and his students have received a best paper award at IEEE/ACM DCOSS’08 and a best demonstration award at IEEE/ACM IPSN/SPOTS’05. Gupta received a BTech in EE from IIT Kanpur, MS in EECS from UC Berkeley and a PhD in Electrical Engineering from Stanford University. He currently serves as EIC of IEEE Embedded Systems Letters. Gupta is a Fellow of the IEEE.
Kiyoung Choi
(Seoul National University, Korea)
State-based Full Predication for Control Flows on ILP/DLP Processors
    Predicated execution techniques are becoming an essential part in ILP/DLP processor designs to overcome the limitation due to control flow. However, the conventional techniques do not support all types of control flow, or require significant hardware overhead in order to do so. In addition, they require in general a longer execution time and more power consumption because both the if- and else-paths are always executed. We propose advanced predicated execution techniques that can handle and accelerate all types of control flow with minimal hardware overhead. The proposed techniques are compiler-friendly and thus can be easily automated and extended to general SIMD machines. We implemented these techniques on a coarse-grained reconfigurable array architecture and verified its functionality and effectiveness by accelerating an H.264 deblocking filter, a kernel which is both data- and control-intensive. The results show that the proposed approach achieves up to 43% improvement in execution time compared to speculation by sacrificing code size by 76%, and 24% improvement in execution time compared to the previous full predication approach with a smaller code size. It also saves up to 22.7% of the total power consumed in the array compared to the conventional full predication.
Chung-Ching Shen
(University of Maryland, USA)
Design and Synthesis for Signal Processing Systems Using the Targeted Dataflow Interchange Format
    Development of signal processing systems that can be targeted to different platforms is challenging due to the need for rigorous integration between high level abstract modeling, and low level synthesis and optimization. In this talk, a new dataflow-based design tool called the targeted dataflow interchange format (TDIF) will be introduced for retargetable design, analysis, and implementation of embedded software for signal processing systems. Capabilities of TDIF that will be featured include dynamic dataflow modeling, analysis of interactions among design components, data structures for encapsulating contextual information for components, and integration of automation of code generation for programming interfaces and low level customizations that are geared toward high performance embedded processing architectures.
Anshul Kumar
(Indian Institute of Technology, Delhi, India)
Application-Aware Data Forwarding in Shared Memory Multiprocessors
    Cache hierarchy plays an important role in bridging the processor-memory performance gap in uniprocessors as well as chip multiprocessors (CMPs). Several hardware and software techniques have been developed in the past to make this hierarchy work efficiently. Prefetching is a one such technique that attempts to reduce cache misses (or sometime to reduce miss penalty) by bringing data and instructions into a cache before their use. To make prefetching work, either the software or the hardware needs to address the questions “what to prefetch” and “when to prefetch”. Due to the distributed nature of memory in CMPs, there is an additional question - “from where to prefetch”. Further, because producer and consumer of data may not be the same in a multiprocessor, there may be yet another question as to “who initiates prefetch”. While the usual consumer initiated prefetch is reasonably well understood, producer initiated data prefetch, better known as data forwarding is more tricky and much less studied.
    A compiler driven data forwarding scheme in which the producer forwards the data to the consumer through the shared memory, without much hardware support, has been reported in the literature and it has been shown that there are many situations in which data forwarding can outperform consumer initiated prefetch. In this paper we investigate data forwarding further and propose a new solution. In particular, we examine the role of hardware in compiler driven forwarding scheme to reduce the requirement of explicit forwarding instructions. Further, our solution addresses the issue of bottleneck resulting from involvement of shared memory in all forwarding operations.
Ing-Jer Huang
(National Sun Yat-Sen University, Taiwan)

A Hardware/Software Co-Monitoring Tool for Linux-based Embedded Systems
    We propose a hardware/software co-monitoring tool for Linux-based embedded systems. It is capable of capturing the entire system performance across multiple software/hardware levels, including the user program level, the OS kernel level, the CPU level and the on-chip-bus level. On the hardware side, the tool consists of a program monitor (PM) which is attached to the CPU’s instruction bus and monitors the software behavior, and a on-chip bus monitor (BM) which is attached to the bus to monitors the bus activities that reveal the transaction/contention/utilization of master and slave hardware components. The PM activates the BM when the software on the CPU accesses the related on-chip hardware components. On the Linux side, the context switcher in the Linux kernel is modified to communicate the PM once the user program to be monitored is activated. There is no need to modify the user program under monitoring. Instead, the GDB tool is used to identify the entries of the user program functions to be monitored. A GUI is provided to operate the tool and analyze the entire system performance across software and hardware levels. A case study will be provided to demonstrate the capability of the proposed tool.
Tulika Mitra
(National University of Singapore, Singapore)

A Polymorphic Heterogeneous Multi-Core System
    Computing systems have made an irreversible transition towards parallel architectures with the emergence of multi-cores. Moreover, power and thermal limits in embedded systems mandate the deployment of many simpler cores rather than a few complex cores on chip. Consumer electronic devices (e.g., smartphones), on the other hand, need to support an ever-changing set of diverse applications with varying performance demands that are hard to satisfy with a set of identical cores. We will present a polymorphic heterogeneous multi-core architecture, named Bahurupi, that can be tailored according to the workload by software. Bahurupi is designed and fabricated as a heterogeneous multi-core system containing multiple identical (simple) cores as well as some amount of re-configurable logic on chip. The main novelty of Bahurupi lies in its highly flexible architecture. Post-fabrication, software can configure or compose together primitive on-chip hardware components to create a customized multi-core system that best matches the needs of a specific application. This software directed re-configurability allows us to enjoy the power-performamnce benefits of a customized and optimized multi-core system without paying the hefty price of design and fabrication for customization.
Jörg Henkel
(Karlsruhe Institute of Technology, Germany)
Compiler-directed Techniques for Reliable Software
    A robust and dependable system design needs to consider reliability at all abstraction levels. We introduce a reliability-aware compiler that bridges the gap between hardware and software by quantifying hardware-level faults at instruction level. Various reliability-guided software transformation are employed at source code level to reduce the overall program vulnerability. For a user-provided tolerable performance overhead constraint, an application composition algorithm selects and combines various transformation functions for reliable code generation. Furthermore, a reliability-guided instruction scheduler is integrated that reduces the program susceptibility towards failures by minimizing the residency cycles of critical instructions inside the processor pipeline in addition to reducing the vulnerable periods of their operands.
Aviral Shrivastava
(Arizona State University, USA)
Beyond the Hill of Multicores lies the Valley of Accelerators
    “Performance, performance, performance… do not worry about power.” This was the theme of high-end processor design for a long time. Walla! now we are at a point, where we cannot improve performance without reducing power. This is because we are already dissipating more power than the cooling efficiency of the packages. In fact, the only way to improve performance is to improve power-efficiency of computation. Just adding transistors will not work, and will actually only make matters worse by increasing power.
    This power wall resulted in a sharp turn in processor designs, and they irrevocably went multi-core. Multi-cores are good because they promise higher potential throughput (and never mind the actual performance of your applications). This is because the cores can be made simpler and run at lower voltage resulting in much more power-efficient operation. Even though the performance of single–core is much reduced, the total possible throughput of the system scales with the number of cores. Even with all the excitement of multi-core architectures, this road will only last so long. This is not only because the benefits of voltage scaling will reduce with decreasing voltage, but also because after some point, making a core simpler will only be detrimental and may actually increase power-efficiency. What next! How do we further improve power-efficiency.
    Beyond the hill of multi-cores, lies the valley of accelerators. Accelerators: hardware accelerators (e.g., Intel SSE), software accelerators (e.g., VLIW accelerators), reconfigurable accelerators (e.g., FPGAs), programmable accelerators (CGRAs) are some of the foreseeable solutions that can further improve power-efficiency of computation. Among these, we find CGRAs, or Coarse Grain Reconfigurable Arrays a very promising technology. They are slightly reconfigurable (and therefore close to hardware), but are programmable (therefore usable as more general-purpose accelerators). As a result, they can provide power-efficiencies of up to 100 GOps/W, while being relatively general purpose. Although very promising, several challenges remain in compilation for CGRAs, especially because they have very little dynamism in the architecture, and almost everything (including control) is statically determined. In this talk, I will talk about our latest research results of enabling multi-threading on CGRAs.
Vincent J. Mooney III
(Georgia Institute of Technology, USA)
Approximate and Probabilistic Arithmetic Architectures for System-on-a-Chip
    This talk will explore recent results in approximate and probabilistic computing arithmetic for System-on-a-Chip architectures. One specific example will explain approximate arithmetic in the context of ripple carry and other adder types; an energy delay tradeoff model will be described which leads to an optimization approach using geometric programming. Additional research will be discussed include floating point probabilistic multiplication for graphics, probabilistic addition for motion estimation, and fast estimation of noise-based errors in probabilistic CMOS.
Krishna V. Palem
(Rice University, USA)