Over the last twenty years, a number of studies have attempted to optimize the execution of critical sections on multicore architectures, either by reducing access contention or by improving cache locality. Access contention occurs when many threads simultaneously try to enter critical sections that are protected by the same lock, causing the cache line containing the lock to bounce between cores. Cache locality becomes a problem when a critical section accesses shared data that has recently been accessed on another core, resulting in cache misses, which greatly increase the critical section's execution time. Addressing access contention and cache locality together remains a challenge. These issues imply that some applications that work well on a small number of cores do not scale to the number of cores found in today's multicore architectures.
Recently, several approaches have been proposed to execute a succession of critical sections on a single server core to improve cache locality. Such approaches also incorporate a fast transfer of control from other client cores to the server, to reduce access contention. Suleman et al. propose a hardware-based solution, evaluated in simulation, that introduces new instructions to perform the transfer of control, and uses a special fast core to execute critical sections. Hendler et al. propose a software-only solution, Flat Combining, in which the server is an ordinary client thread, and the role of server is handed off between clients periodically. This approach, however, slows down the thread playing the role of server, incurs an overhead for the management of the server role, and drastically degrades performance at low contention rate. Furthermore, it cannot accommodate threads that block within a critical section which makes it unable to support widely used applications such as Memcached.
We propose a new locking technique, Remote Core Locking (RCL), that aims to improve the performance of legacy multithreaded applications on multicore hardware by executing remotely, on one or several dedicated servers, critical sections that access highly contended locks. Our approach is entirely implemented in software and targets commodity x86 multicore architectures. At the basis of our approach is the observation that most applications do not scale to the number of cores found in modern multicore architectures, and thus it is possible to dedicate the cores that do not contribute to improving the performance of the application to serving critical sections. Thus, it is not necessary to burden the application threads with the role of server, as done in Flat Combining. RCL also accommodates blocking within critical sections. The design of RCL addresses both access contention and locality. Contention is solved by a fast transfer of control to a server, using a dedicated cache line for each client to achieve busy-wait synchronization with the server core. Locality is improved because shared data is likely to remain in the server core's cache, allowing the server to access such data without incurring cache misses. In this, RCL is similar to Flat Combining, but has a much lower overall overhead.
[PDF]
Jean-Pierre Lozi, Florian David, Gaƫl Thomas, Julia Lawall and Gilles Muller
In ACM Transactions on Computer Systems 33, 4, Article 13 (January 2016), 62 pages.
The source code for all benchmarks is available here:
rcl_benchmarks_tocs.tar.gz
The Phoenix datasets are not available on mapreduce.stanford.edu anymore. You can download them here : [1] [2] [3] [4]
[PDF]
[Slides]
Jean-Pierre Lozi
The source code for all benchmarks is available here (updated on 23/11/15):
rcl_benchmarks_phd_thesis.tar.gz
The Phoenix datasets are not available on mapreduce.stanford.edu anymore. You can download them here : [1] [2] [3] [4]
[PDF]
[Talk/slides]
Jean-Pierre Lozi, Gaël Thomas, Florian David, Julia Lawall,
Gilles Muller
In Proceedings of the 2012 USENIX Annual Technical Conference (USENIX ATC '12), pages 65-76, Boston, MA, USA, 2012.
The source code for all benchmarks is available here: rcl_benchmarks_usenix_atc.tar.gz
[PDF]
Jean-Pierre Lozi, Gaël Thomas, Julia Lawall, Gilles Muller
INRIA Research Report #7779, 2011.
The microbenchmark code is available here: rcl_microbenchmark_tech_report.tar.gz
Contact : Jean-Pierre Lozi, Florian David, Gaël Thomas, Julia Lawall, Gilles Muller