Why nothing matters the impact of zeroing
Download PDF. Intel Corporation , Apr. Memory safety defends against inadvertent and malicious misuse of memory that may compromise program correctness and security.
A critical element of memory safety is zero initialization. The direct cost of zero initialization is surprisingly high: up to Zero initialization also incurs indirect costs due to its memory bandwidth demands and cache displacement effects.
Existing virtual machines either: a minimize direct costs by zeroing in large blocks, or b minimize indirect costs by zeroing in the allocation sequence, which reduces cache displacement and bandwidth. This paper evaluates the two widely used zero initialization designs, showing that they make different tradeoffs to achieve very similar performance. Our findings invite additional optimizations and microarchitectural support.
Memory safety is an increasingly important tool for the correctness and security of modern language implementations. A key element of memory safety is initializing memory before giving it to the program. In managed languages, such as Java, C , and PHP, the language specifications stipulate zero initialization. We show that existing approaches of zero initialization are surprisingly expensive.
On three modern IA32 architectures, the direct cost is around 2. Hardware trends towards chip multiprocessors CMPs are exacerbating these expenses because of their increasing demands on memory bandwidth [9, 15, 16, 24, 28, 33, 34] and pressures on shared memory subsystems, such as shared on-chip caches and memory controllers. For example, Zhao et al. Furthermore, energy is now constraining memory bandwidth [8].
If architects add processor cores without adding commensurate memory resources memory bandwidth and shared caches , the overhead of existing zero initialization techniques is likely to grow. Although hardware parallelism increases pressure on the memory system, it offers an optimization opportunity, such as offloading critical system services that must be done in a timely manner. To our knowledge, this paper is the first to explore the zero initialization design space and show that zero initialization is costly.
Existing zero initialization strategies face two problems: the direct cost of executing the requisite zeroing instructions and the indirect cost of memory bandwidth consumption and cache pollution. Bulk zeroing attacks the direct cost by zeroing memory in large chunks and exploiting instruction level parallelism, loop optimizations, and zeroing a cache line or more at a time. Bulk zeroing, however, introduces a significant reuse distance between when the VM zeroes a Categories and Subject Descriptors D3.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page.
This distance increases cache pollution. Hot-path zeroing injects zeroing instructions into the allocation sequence, attacking indirect costs by minimizing reuse distance and exploiting the hardware prefetcher to avoid stalls in modern fetch-on-write caches. Hot-path zeroing, however, expands and complicates the performance-critical allocation sequence and reduces opportunities for software optimization of the zeroing instructions. The two designs are thus at poles, addressing either, but not both, of the direct and indirect costs of zeroing.
Although this cost is significant, very little research explores zeroing costs or optimizations. We measure the allocation rates of real and microbenchmarks to explore performance limits and costs.
Our analysis reveals opportunities and tradeoffs in zeroing strategies. We show that an effective hardware prefetcher is critical to the performance of hot-path zeroing. We introduce three better solutions. Our zeroing designs take advantage of non-temporal instructions and unutilized hardware parallelism to minimize zeroing costs.
We demonstrate that non-temporal stores improve memory throughput and mitigate cache pollution due to bulk zeroing. The best strategy adaptively chooses between concurrent and synchronous non-temporal bulk zeroing, adjusting based on the availability of unused hardware parallelism. The adaptive approach improves performance by 3. It is most effective on highly allocating, memory intensive benchmarks, which stress the memory system the most.
Nonetheless, the total number of cycles devoted to zero-initialization is often substantial, which suggests that further optimization of zeroing will be useful. The contributions of this paper are: 1 the first detailed study of the cost of zero initialization which shows zero initialization is often expensive on modern processors, 2 a detailed microarchitectural analysis of existing designs which shows they make different tradeoffs but have very similar performance, and 3 identification and evaluation of three new designs.
The adaptive design uses non-temporal instructions and concurrency to provide speedups that sometimes exceed the direct cost of zero initialization. Background and Related Work Our work sits at the boundary of programming language implementation and microarchitecture. This section presents key background ideas and related work in hardware and software. Language design. Data initialization and pointer disciplines are the principal techniques for ensuring memory safety. Pointer safety disciplines protect against unintended or malicious access to memory by ensuring that the program accesses only valid references to reachable objects.
Pointer safety is achieved through a combination of language specification and implementation techniques that enforce pointer declarations in static or dynamic type systems. The language specification forbids reference forging, and the implementation checks array indices, and uses garbage collection rather than manual freeing to avoid dangling references.
Share This Paper. Background Citations. Methods Citations. Results Citations. Figures, Tables, and Topics from this paper. Citation Type. Has PDF. Publication Type. More Filters. View 1 excerpt, cites background. Cooperative cache scrubbing. View 2 excerpts, cites methods and background. View 2 excerpts, cites background and methods. OCTET: capturing and controlling cross-thread dependences efficiently.
Instrumentation bias for dynamic data race detection. View 1 excerpt, cites methods. Redundant zeros cause inefficiencies in which the zero values are loaded and computed repeatedly, resulting in unnecessary memory traffic and identity computation that waste memory bandwidth and CPU … Expand. Created Dec 29, Code Revisions 1. Embed What would you like to do? Embed Embed this gist in your website. Share Copy sharable link for this gist. Learn more about clone URLs.
Download ZIP. This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
0コメント