Date of Completion


Embargo Period



near-threshold computing, cache coherence, data locality, cache management, shared memory, message passing, processor architecture, computer architecture, multicore architecture, many-core architecture, workload characterization, data replication

Major Advisor

Omer Khan

Associate Advisor

John Chandy

Associate Advisor

Marten Van Dijk

Field of Study

Electrical Engineering


Doctor of Philosophy

Open Access

Open Access


The trend of increasing processor performance by boosting frequency has been halted due to excessive power dissipation. However, transistor density has continued to grow which has enabled integration of many cores on a single chip to meet the performance requirements of future applications. Scaling to hundreds of cores on a single chip present a number of challenges, mainly efficient data access and on-chip communication. Near-threshold voltage (NTV) operation has been identified as the most energy efficient region to operate in. Running at NTV can facilitate efficient data access, however, it introduces bit-cell faults in the SRAMs which needs to be dealt with. Another avenue to extract data access efficiency is by improving on-chip data locality. Shared memory abstraction dominates the traditional small computer and embedded space due to its ease of programming. For efficiency, shared memory is often implemented with hardware support for synchronization and cache coherence among the cores. However, accesses to shared data with frequent writes results in wasteful invalidations, synchronous write-backs, and cache line ping-pong leading to low spatio-temporal locality. Moreover, communication through coherent caches and shared memory primitives is inefficient because it can take many instructions to coordinate between cores.

This thesis focuses on mitigating the effects of the data access and communication challenges and make architectural contributions to enable efficient and scalable many-core processors. The main idea is to minimize data movement and make each necessary data access more efficient. In this regard, a novel private level-1 cache architecture is presented to enable efficient and fault-free operation at near-threshold voltages. To better exploit data locality, a last-level cache (LLC) data replication scheme is proposed that co-optimizes data locality and off-chip miss rate. It utilizes an in-hardware predictive mechanism to classify data and only replicate high reuse data in the local LLC bank. Finally, a hybrid shared memory, explicit messaging architecture is proposed to enable efficient on-chip communication. In this architecture the shared memory model is retained, however, a set of lightweight in-hardware explicit message passing style instructions are introduced in the instruction set architecture (ISA) that enable efficient movement of computation to where data is located.