Scientific Programming and Computer Architecture


Scientific and Engineering Computation
William Gropp and Ewing Lusk, editors; Janusz Kowalik, founding editor
A complete list of books published in the Scientific and Engineering Computation series appears at the back of this book.


Scientific Programming and Computer Architecture
Divakar Viswanath
The MIT Press
Cambridge, Massachusetts
London, England


© 2017 Massachusetts Institute of Technology
All rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical means (including photocopying, recording, or information storage and retrieval) without permission in writing from the publisher.
This book was set in LyX by the author. Printed and bound in the United States of America.
Library of Congress Cataloging-in-Publication Data

Names: Viswanath, Divakar, author.

Title: Scientific programming and computer architecture / Divakar Viswanath.
Description: Cambridge, MA : The MIT Press, [2017] | Series: Scientific and
engineering computation | Includes bibliographical references and index.
Identifiers: LCCN 2016043792 | ISBN 9780262036290 (hardcover : alk. paper)
Subjects: LCSH: Computer programming. | Computer architecture. | Software
engineering. | C (Computer program language)
Classification: LCC QA76.6 .V573 2017 | DDC 005.1--dc23 LC record
available at https://lccn.loc.gov/2016043792
10 9 8 7 6 5 4 3 2 1


To all my teachers, with thanks.



Table of Contents
The website https://github.com/divakarvi/bk-spca has all the programs discussed in this book.

Preface

It is a common experience that minor changes to C/C++ programs can make a big difference to their speed. Although all programmers who opt for C/C++ do so at least partly, and much of the time mainly, because programs in these languages can be fast, writing fast programs in these languages is not so straightforward. Well-optimized programs in C/C++ can be even 10 or more times faster than programs that are not well optimized.
At the heart of this book is the following question: what makes computer programs fast or slow? Programming languages provide a level of abstraction that makes computers look simpler than they are. As soon as we ask this question about program speed, we have to get behind the abstractions and understand how a computer really works and how programming constructs map to different parts of the computer’s architecture. Although there is much that can be understood, the modern computer is such a complicated device that this basic question cannot be answered perfectly.
Writing fast programs is the major theme of this book, but it is not the only theme. The other theme is modularity of programs. Structuring programs so that their structure explains what they do is a valuable principle. Computer programs are organized into a series of functions to serve the purpose of that principle. Yet when computer programs become large, merely dividing a program into functions becomes highly inadequate. It becomes necessary to organize the computer program into distinct sources and the sources into a source tree. The entire source tree can be made available as a library. We pay heed to program structure throughout this book.
Most books on computer programming are written at the same level of abstraction as the programming language they utilize or explain. If we want to understand program speed, we have to understand the different parts of a computer, and such an approach is not feasible. It is inevitable that choices have to be made regarding the type of computer system that is studied.
The big choices in this book are to opt for the x86 line of computers backed by Intel and AMD corporations, and for the Linux operating system. Nearly 100% of the computers in use today as servers, desktops, or laptops are x86 based, and the x86 line has been dominant for more than 30 years. So a great deal is not lost. The choice of the operating system does not have such a great impact on program speed. We pick Linux because it is the preferred platform for scientific computing and because it is open source. Because it is open source, we can peer into its inner workings when necessary to understand program speed.
A computer program is mathematical logic in action. The picture of the computer that emerges from this book shows how layered, and therefore complex, that logic can be. There is of course the design of the program that we ourselves write. But that is only a small part of the overall design. There are other computer programs, the biggest of which is the operating system kernel (the Linux kernel for our purposes), which handle the program we write and make it run on the computer. These systems programs make the computer a more tractable device, hiding the complexity of many of its parts, such as the hard disk or the network interface. The hardware, which includes the processor architecture and memory system, is itself designed using complex logic similar in kind to the computer programs we ourselves write but vastly different in degree of complexity.
The conventional approach to high-performance computing revolves around general principles, such as Amdahl’s laws, weak and strong scaling, data and functional parallelism, SIMD/MIMD programming, load balancing, and blocking. The approach taken in this book is diametrically opposite. Our focus really is on understanding the computer as a whole, especially as viewed through programs. We dig into linkers, compilers, operating systems, and computer architecture to understand how the different parts of the computer interact with programs. The general principles are useful in getting a sense of the average program. However, in writing any particular program, the general principles are never so straightforward to apply. Understanding the parts of a computer can be far more nettlesome and also fascinating. Once that is done, the general principles, insofar as they are useful in actual programming, become self-evident.
The main principles of design that will concern us and that have an impact on program speed have not changed for decades. The approach we adopt is to begin with specific programs that are generally quite simple in what they do. We move up to general principles gradually using these specific programs as a vehicle. There are two advantages to this approach. First, it makes the discussion vivid, more organized, and far less reliant on rules of thumb that may appear arbitrary. Second, it lends useful context to the principles that inform the writing of well-optimized computer programs.
The context is essential. There are no general principles of the range, precision, and completeness of Schrödinger’s equation in this area. Such principles as there are do not go far beyond common sense. Take, for example, the idea of load balancing. It says that if a task is to be divided between equally capable workers, we are better off dividing the task equally, rather than burdening a single worker with an excessive share of the work. There can be absolutely no doubt that this principle has been known for millennia. Without the appropriate context, the principle is quite sterile. Indeed, to call it a principle seems an exaggeration.
Overview of chapters
Chapter 1↓ begins with a rapid review of C/C++. The reader is assumed to have an undergraduate knowledge of C/C++ programming. Our review emphasizes a few parts of these languages that students typically don’t learn in introductory classes. The C programming language is the most fundamental of all languages, to the extent that one can no longer speak of a computer without C. The C language is close to the machine and provides only a basic, although highly valuable, layer of abstraction. The C++ language is a colossal extension of C that includes many mechanisms for representing abstract concepts to bring the program closer to the problem domain. The idiom of C++ we use is close to C. However, we discuss more C++ than we actually need to dispel myths about C++ being slow.
Libraries and makefiles are the basis of modular programming in C/C++. In chapter 2↓, we explain how libraries and linkers work. Although we do not recommend Fortran, it is not uncommon to have to use old Fortran 77 programs in scientific programming. This chapter explains how to call Fortran programs from C/C++. Knowledge of the GNU make utility explained in this chapter is far more valuable than complex C++ syntax. There is no modular programming in C/C++ without make. Yet it is a neglected topic.
A significant part of the answer to what makes programs fast or slow is contained in chapter 3↓. Compilers convert C/C++ programs into a stream of machine instructions, native to the processor. A small subset of x86 assembly language is introduced and used to understand how structuring loops in different ways leads to different instruction streams.
Although part of chapter 3↓ is about optimizing for the processor pipeline, which has to be programmed in assembly language, in general we do not recommend programming in assembly. The small part of x86 assembly language introduced in that chapter is mainly used to understand how loop-nests map to machine instructions. There can be no rational account of program speed without a discussion of machine instructions. The discussion of assembly language is like a ladder that helps us understand how to structure loops and loop-nests, one of the most important tasks in optimizing a program. As in Wittgenstein’s famous metaphor, the ladder can be thrown away once the understanding is gained.
Even skilled programmers are often unaware of registers. In chapter 3↓, we not only talk about registers but also introduce techniques for optimizing for the instruction pipeline. This type of optimization can improve program speeds considerably---by as much as a factor of 3 to 15 in the example discussed in chapter 3↓. However, the reader may wonder whether these techniques may become obsolete quickly because of rapid changes in hardware. The answer is an emphatic no. Although hardware changes rapidly, the design principles do not change so fast. In addition, the value of optimizing for the instruction pipeline has increased rapidly over time. Such optimization implied a speedup by a factor of 3 in 2010, whereas the speedup is much greater and nearly a factor of 15 in 2015. Yet the programming methodology is unchanged.
Although optimizing for the instruction pipeline is a very difficult skill, it can be a valuable one. It can speed up C/C++ programs by a factor of 10, or even a factor of 100 if the C/C++ coding is naive to begin with. As an argument for the value of instruction pipeline optimizations, we mention that although discussion of new algorithms overwhelmingly dominates the scientific literature, algorithms that speed up any nontrivial computation by a factor of 10 or even a factor of 2 are rare.
Chapters 4↓ and 5↓ have wider applicability than chapter 3↓. The sort of loop optimizations discussed in chapter 3↓ apply mainly when data is regular, as in image processing or the solution of differential equations. However, memory optimizations, discussed in chapter 4↓, are useful in all kinds of programming. For simplicity, our examples use regular data, but the same principles apply even when dynamic data structures such as linked lists or graphs are used.
Much of program optimization is optimization of memory access, and many of the principles are the same for single- and multithreaded programs. Memory usually refers to Dynamic Random Access Memory (DRAM), which can be 10s, or even 100s, of gigabytes in extent. Faster and smaller caches are maintained to speed access to DRAM by nearly a factor of 100. In addition, memory is organized into pages to give an independent view of memory to disparate processes. Memory accesses may be optimized for both caching and the paging system.
Perhaps an even more important point that arises in chapter 4↓, as a natural consequence of the discussion of instruction level parallelism in chapter 3↓, is the role of parallelism in the memory system. Programs that allow the processor to parallelize multiple memory accesses will be much faster. Thus, for example, there can be a huge difference in speed between a program that uses linked lists, which disrupt parallelism, and another program that accomplishes the same task by accessing an array in sequence.
The clock speed of processors stopped improving around 2005. Much of the growth in processing power since then is from putting more and more processor cores inside the same computer. All these processor cores share the same memory. Programming with threads, the topic of chapter 5↓, derives its importance from this well-established trend. Many threaded programs can be written quite easily. Yet there are always subtleties under the surface whenever different threads or processes share memory.
Many of the subtleties of threaded programming are related to the memory system. Threaded programming is impossible without coherent caches, and any programmer who writes threaded programs in C/C++ without understanding as much will be befuddled sooner rather than later. Even the simplest threaded programs for adding a sequence of numbers, using OpenMP or Pthreads, rely on the memory model in ways that are often not appreciated.
In addition to the memory system, threaded programs interact with the operating system. The simple act of creating threads can involve an overhead that swamps the benefits of parallelizing. When is it advantageous to invoke multiple threads? Why is it not a good idea to change the number of threads between OpenMP parallel regions? Why should threaded programs avoid abusing the stack? These questions are answered in a rational manner once the role of the operating system is understood.
Much of a C/C++ programmer’s time is spent dealing with segmentation faults. The precise manner in which segmentation faults and other memory errors arise from inside the operating system is explained in chapter 5↓. Chapter 5↓ concludes what may be seen as the core of this book. Topics covered up to this point are relevant to many kinds of programming, well beyond the scientific world.
The Top 500 list (see http://www.top500.org/), which uses a linear algebra problem to benchmark and then rank the fastest computers in the world, has provided powerful impetus to scientific computing. For more than two decades, the most powerful computers in the world have been clusters where many computers are tightly connected using a high-performance network. In such clusters, concurrent processes communicate over the network by sending and receiving messages. Message passing is the topic of chapter 6↓.
An example of the outcome of our choice to focus on computer architecture instead of general principles may be found in this chapter. The general principle is to overlap processor and network activity. Few can contest the utility of such a principle or argue that there is anything in it beyond ordinary common sense. However, our discussion of the matrix transpose in chapter 6↓ shows that it requires deep knowledge of network architecture to in fact overlap processor and network activity, although the general principle is quite obvious. It is precisely that type of knowledge that gets lost when we do not look past general abstractions and examine details of systems software and computer architecture.
When a large group of computers is tightly coupled, message passing is the preferred paradigm. Because there is no truly credible challenge to message passing in the context of supercomputing, the largest physics computations are likely to continue to rely on message passing. However, a single computer today can tackle complex 3D problems with more than a billion grid points. The vast majority of scientific projects can be effectively implemented on a single node. When a single computer is so powerful, the additional difficulty of resorting to message passing between multiple computers, which can be considerable, becomes less attractive.
Market forces that are propelling the Internet are powerful, indeed amazingly powerful, and should not be ignored by the scientific programmer. The pertinence of the Internet is more obvious in newer areas such as genomics and data science than in classical physics. Chapter 6↓ includes a section about the Internet. As in other parts of the book, here too we look behind the abstractions to understand the plumbing of the Internet.
The final two chapters, chapters 7↓ and 8↓, are about coprocessors. In 2007, the NVIDIA corporation showed how its graphics processors can be used to speed up a range of scientific problems. The excitement that resulted was undoubtedly justified. The graphics coprocessors hint at the possibilities for architectural design that may be available beyond the x86 line. Intel, which could not afford to ignore this competition, has introduced its own coprocessor. It must be said that excitement about coprocessors has not been matched by utilization. When the coprocessor competes with the processor to execute the same tasks, there is a major disruption of modularity at the hardware level. The resulting heterogeneity makes it difficult to break tasks into subtasks in a uniform manner. Heterogeneity overly complicates programming models, which are already quite complicated.
The brief appendix may be the right place to begin reading this book. The appendix begins with a table of machines used to time programs. The rest of the book makes frequent references to that table.
Although interpreted languages such as Python or Matlab are easier to use, the resulting programs will be much slower. How much slower does not seem to be widely understood, and the appendix dispels a few myths. This author has heard estimates of the slow-down factor ranging from a few percent to a factor of 2 to a better informed guess of a factor of 10. In fact the interpreted languages can be several hundred times slower for even fairly simple programs that run on a single core. As the complexity of the program and the hardware platform increases, the slow-down penalty can get much worse. Even for moderately complex programs, the slow-down can be by a factor of 104, as we note in the appendix. If the effort for mastering C/C++ is much greater, so is the reward.
The entire source code corresponding to this book, which runs to more than 15,000 lines of C/C++, is available at https://github.com/divakarvi/bk-spca. The appendix briefly introduces two tools---GIT and the cscope utility---essential for downloading and working with the code. Both GIT and cscope are of great value in programming in general. Even in the era of Internet search, cscope, which has been around since the 1970s, is an excellent option for browsing and searching source trees.
The examples in this book rely on Intel’s icc/icpc compiler. However, except for chapter 7↓, the widely used, easily available, and open source gcc/g++ compiler may be substituted with little trouble. The few nuances that arise are described in the appendix.
Acknowledgments
Most of all, I thank my undergraduate teachers at the Indian Institute of Technology, Bombay. From them I began to learn much of what is found in this book.
I taught a graduate class based on this book on four occasions at the University of Michigan. The class typically covered about a third or more of the material in the book, with greater emphasis on the beginning sections in each chapter. I thank all the hundred or so students who took that class. Thanks in particular to Zhongming Qu, who penetrated the material quite deeply. I am especially grateful to Zhongming for helping me understand Makefiles much better.
I was privileged to have access to superbly maintained systems at the University of Michigan. I thank Bennet Fauber, Brock Palen, Andy Caird, Neil Tweedy, Charles Antonelli, Reed Hoyer, and Rusty Dekema for their help. I am especially grateful to Seth Meyer for showing me how to build and load the Linux operating system. All who work with computers are aware of the peculiar difficulty of getting started. In the memorable words of Kernighan and Ritchie, it is a big hurdle, and once crossed “everything else is comparatively easy.” In addition to showing me how to get started with Linux, Seth also freely shared his deep knowledge of the Internet with me.
The last three chapters of this book were written using systems deployed at the Texas Advanced Computing Center (TACC), with access obtained through XSEDE. I am thankful to XSEDE as well as TACC for the wonderful support they offered. At TACC, I am especially grateful to Chris Hempel for being so accommodating. The technical help desk at TACC answered numerous questions with unfailing promptness and helped me in many ways. At the risk of omission, I thank Doug James, Lars Koesterke, Hang Liu, Si Liu, Robert McLay, Cyrus Proctor, and Frank Willmore.
I thank Paul Besl, Tim Prince, and Mike Tucker of Intel Corporation for being gracious and helpful when difficulties arose with the Westmere microarchitecture. I also thank Intel’s Russian team for resolving these technical difficulties and Tim Prince for his expert comments.
I thank my colleagues Danny Forger and Hans Johnston for offering advice and much needed support.
I wrote this book using LyX, relying on Inkscape to produce figures. The html version of the book was produced using eLyXer. I thank all the people responsible for these wonderful open source tools. Technical information accessed via the Internet was invaluable, and I have acknowledged this help wherever possible.
Finally, I am grateful to Dr. John Sarno for curing my chronic back pain through his books and to Dr. Howard Schubiner for a helpful consultation.

1 C/C++: Review

A computer program is a sequence of instructions to a machine. In this chapter and the next, we emphasize that it is a sequence of instructions that builds on other computer programs and that in turn can be built on. Codes that exist in isolation are often limited to quite trivial tasks and can hardly be considered computer programs.
This chapter is a review of C/C++. The C programming language is the most fundamental of all programming languages. Computing machines come in great variety and are put together using many parts. The computer’s parts, consisting of the processor at the center, and with memory, hard disk, network interfaces, graphics devices, and other peripherals connected to it, are very different from each other. It would be an almost impossible task for any single programmer to deliver instructions to such a complicated machine. The C programming language is a major part of the setup to give the programmer a uniform view of computing machines. No modern computing device can exist and be useful without C.
Much of the time when programs are written, the programmer is not at all aware of the many parts of the computer. Indeed, the programmer may not even be aware that there is a processor. It is more natural to think in terms of the abstractions of the programming language. This is in general a good thing because the purpose of programming languages is to set up abstractions that hide the complicated parts of the computer. In addition, programs written in this way can be moved from computer to computer easily.
In this book, our goal is to understand what makes programs fast or slow. As soon as we set ourselves this goal, we have no choice but to peer behind the abstractions of programming and understand how those abstractions are realized through the many parts that constitute a computer. The C programming language is the most natural vehicle in moving toward this goal. In fact, it is the only vehicle that is really appropriate.
The C programming language is close to the machine. There are high-level languages, a notable and outstanding example being Python, which bring programming much closer to the problem domain. Concepts and ideas intrinsic to the problem domain are expressed far more easily in these programming languages. Programs that would take days to write in C can be written in hours, or even minutes, in Python.
Although these high-level languages are much easier on the programmer, programs in these languages run slowly. As shown in the appendix, these high-level programs can be more than 100 times slower, or even worse. In fact, even C programs written without a knowledge of computer architecture can be several times slower than C programs written with that knowledge. The C++ language is a compromise. It strives to combine the speed of C with the abstraction facilities of higher level languages. It can be quite useful, although it lacks the simplicity and elegance of C. The C++ idiom we use is close to C. Some of the facilities of C++, mainly the facility of defining classes, are adopted to make C a little easier and a little more presentable.
The review of C/C++ in this chapter attempts to bring out certain features that people often do not learn from a single course or two. Beginning programmers often tend to think of a C/C++ program as a single .c or .cpp source file. Modular organization of sources is far superior. Modular organization is essential for writing programs whose structure reflects and explains how they work as well as the underlying concepts.
Section 1.1↓ sets up an example to exhibit the modular organization of sources in C/C++. The example is a technique called the Aitken iteration, which can transform certain sequences to hasten their convergence. In sections 1.2↓ and 1.3↓, we review some features of C and C++ using this example. The concluding section 1.4↓ introduces a little Fortran. For reasons explained at length later, we do not recommend programming in Fortran. However, a lot of scientific programs are written in Fortran, mainly Fortran 77. A scientific programmer needs to know just enough Fortran to be able to use these old Fortran codes.

1.1 An example: The Aitken transformation

The Aitken transformation maps a sequence of numbers to another sequence of numbers. It serves as the vehicle to introduce aspects of C, C++, and Fortran in this chapter. It is also interesting in its own right.
The Aitken transformation is given by the following formula:
(1.1) tn − 1 = sn − 1 − ((sn − sn − 1)2)/(sn + 1 − 2sn + sn − 1).
It transforms a sequence s0, s1, …, sN into a new sequence t0, t1, …, tN − 2, which has two fewer terms. [1]  [1] Sequence extrapolation and Aitken iteration are treated in [8, 3]. The idea behind the Aitken transformation is as follows. If the sn sequence is of the form sn = S + aλn, all terms in the tn sequence are equal to S. It is useful for speeding up the convergence of a number of sequences, even those that do not directly fit the S + aλn pattern. Section 1.1.1↓ illustrates the dramatic power of the Aitken iteration on a couple of examples---the Leibniz series and the logarithmic series. To be sure, these examples are chosen carefully.
This section begins to make the point that it is generally advantageous to split a program into multiple sources. We could use a single source file to code the Aitken iteration and apply it to the Leibniz series as well as the logarithmic series. Such a program would work just as well to begin with. A few days later, we may want to apply the Aitken iteration to another example. If we also throw that example into the same source file, the source file will become a little more unwieldy. A few months later, we may want to use the Aitken iteration as part of a large project. If we insist on using a single source file, there are two equally unpleasant alternatives: copy the whole Aitken program into the large project or copy the large project into the Aitken program. There is a heavy price to pay for avoiding modular organization of programs. Section 1.1.2↓ gives a preliminary discussion of the modular organization of program sources using the Aitken iteration as an example. Later sections build on this preliminary discussion.

1.1.1 Leibniz series and the logarithmic series

The Leibniz series [2]  [2] The Leibniz series was discovered by Madhava (c. 1350-c. 1410), well before Leibniz (1646-1716). Madhava reported it in the following passage: “Multiply the diameter by 4. Subtract from it and add alternately the quotients obtained by dividing four times the diameter by the odd integers 3, 5, etc.” ([30]). is given by
π = 4 − (4)/(3) + (4)/(5) − (4)/(7) + (4)/(9) − (4)/(11) + ⋯
This series, whose terms alternate in sign and diminish in magnitude monotonically, will be a recurring example. So we will begin by looking at it carefully. Let Sn be the sum of the first n terms. As shown in figure 1.1↓, the partial sums Sn are alternately above and below π. Further, the convergence is slow. In fact, |π − Sn| > 2 ⁄ (2n + 1), which implies that the first million terms of the Leibniz series can give only slightly more than six digits of π after the decimal point.
figure FIGS/chapter1/LeibnizCvg.png

Figure 1.1 Convergence of the Leibniz series to π (dashed line).
If we look at figure 1.1↑, we may notice that although the convergence to the limit is slow, the partial sums seem to follow a certain trend as they approach π. The partial sums are alternately above and below, and it seems as if we can fit a smooth curve through the iterates. The Aitken iteration guesses this trend quite well and speeds up the convergence of the Leibniz series.
Table 1.1↓ shows Aitken’s transformation (1.1)↑ applied repeatedly to the first 13 partial sums of the Leibniz series. After each application, we have a sequence with two fewer numbers, and at the end of the sixth application of the Aitken transformation, we have just one number that equals π to 10 digits of accuracy. Because none of the 13 partial sums gives even the first digit after the decimal point, it seems astonishing that an answer with 10 digits of accuracy can be produced from those numbers.
4.0000000000 3.1666666667 3.1415926540 3.1415926536
2.6666666667 3.1333333333 3.1415926535
3.4666666667 3.1452380952 3.1415926536
2.8952380952 3.1396825397
3.3396825397 3.1427128427
2.9760461760 3.1408813409
3.2837384837 3.1420718171
3.0170718171 3.1412548236
3.2523659347 3.1418396189
3.0418396189 3.1414067185
3.2323158094 3.1417360993
3.0584027659
3.2184027659
Table 1.1 The first column lists the first 13 partial sums of the Leibniz series. Every other column is gotten by applying the Aitken transformation (1.1)↑ to the previous column. The number at the upper right corner is π correct to 10 digits.
Computing the digits of π is a mathematical sport of unending interest. Even Isaac Newton had a weakness for it. [3]  [3] See [12] (p. 112). The Aitken iteration, although impressive, is far from being the best method for computing π. [4]  [4] For the mathematics of approximating π,  see [9].
The logarithmic series [5]  [5] The logarithmic series was discovered by Nicholas Mercator (c. 1620-1687). See [11]. log(1 + x) = x − x2 ⁄ 2 + x3 ⁄ 3 − x4 ⁄ 4 + ⋯ diverges for |x| > 1. However, as shown in table 1.2↓, the Aitken transformations of the first 13 partial sums recover the value of log(1 + x) for x = 1.25. If x were larger, just the first 13 partial sums would not be enough to produce 10 digits of accuracy after the decimal point.
x Partial Sum Extrapolate
0.00 0.0000000000 0.0000000000
0.20 0.1823215568 0.1823215568
0.40 0.3364723763 0.3364722366
0.60 0.4700395318 0.4700036292
0.80 0.5895867562 0.5877866649
1.00 0.7301337551 0.6931471806
1.25 1.5615505069 0.8109302162
Table 1.2 The partial sum column is the sum of the first 13 terms of the Taylor series of log(1 + x). Each number in the last column is produced by applying the Aitken transformation 6 times to the first 13 partial sums, which leaves us with just one number. The last column shows that number, which gives log(1 + x), with all the digits shown being correct.
Here we have our first programming problem. There is a simple iteration (1.1↑) to begin with. The problem is to code it in C/C++ and apply it to the Leibniz series and the logarithmic series.

1.1.2 Modular organization of sources

Before we delve into the syntax of C/C++, let us look at how to structure a program for the Aitken iteration. In particular, we look at how to split the program between source files.
It is not uncommon to introduce computer programming using programs that reside in a single source file. [6]  [6] Even programs limited to a single source file rely on the runtime library. Strictly speaking, there is no such thing as a program that entirely resides in a single source file. A computer program as a single source file is a bad idea to allow into one’s head. It is a bad idea that can grow and grow. This author has heard of Fortran programs longer than 100,000 lines in a single file. Even the simple Aitken example shows why a single monolithic source is a bad idea.
The Aitken iteration, as we have discussed it so far, consists of the iteration (1.1↑) and its application to the Leibniz and Mercator series. One way to write this program is to code a function for the Aitken iteration, another function to set up and apply it to the Leibniz series, and likewise yet another function to apply it to the logarithmic series.
Breaking up programs into functions is the first step. However, throwing all the three functions into the same source file would not be a good idea. There is a simple conceptual reason for this. The Aitken iteration is a general technique, and its applications to the Leibniz and logarithmic series are two specific examples. Coding all the functions in the same source file limits the usefulness of the Aitken iteration.
In later sections, we will implement the Aitken iteration and the two examples in three separate source files. The first source file aitken.c codes the Aitken iteration in a form that would apply to any sequence. The specific applications are in the source files leibniz.c and logseries.c.
If the program is separated into three sources in this manner, the question of how they may interface with each other and work together arises. The first part of the interfacing is a header file that we will call aitken.h. This header file gives a basic summary of what is found inside aitken.c but does not have the implementation of any function. The source files leibniz.c and logseries.c include this interface within their code. Thus, the sources will look as follows:
aitken.h
aitken.c
leibniz.c
logseries.c
with the header file aitken.h included in all the three .c source files.
This manner of breaking up a program into multiple source files is deeply integrated into C/C++. A C/C++ compiler converts a source file into a sequence of machine instructions independently of all other sources. For this reason, a source file is sometimes called a compilation unit. In our Aitken example, three object files aitken.o, leibniz.o, and logseries.o, holding machine instructions corresponding to the respective source files, result from compilation.
The source leibniz.c will include calls to functions defined in aitken.c. The header aitken.h, which is included in leibniz.c, includes just enough information to partially set up this function call when the machine instructions of leibniz.o are being compiled. The crucial information that is missing is the address and definition of the function in aitken.c that is invoked in leibniz.c. That information is supplied only when the object files aitken.o and leibniz.o are combined into an executable, which we will call leibniz.exe, [7]  [7] Although we assume the GNU/Linux operating system, the executable files will use the .exe extension as in Windows for greater clarity. by the linker. It is the linker’s job to eliminate unresolved references in the object files and put together an executable without any unresolved references. The executable is the program that ultimately runs on the computer.
figure FIGS/chapter1/aitken_modules.png
Figure 1.2 The three program sources are aitken.c, leibniz.c, and logseries.c. They are combined to produce two different executables, for extrapolating the partial sums of the Leibniz series and the Taylor series of log(1 + x), respectively. The .o object files are omitted from this figure.
We will make the discussion of compilation and linking far more concrete in later sections. The overall picture is as follows: the program is separated into multiple source files, the source files are compiled separately into object files holding machine instructions, and the linker merges the object files and eliminates unresolved references to produce an executable. The overall picture for the Aitken program is shown in figure 1.2↑.
Exercise:  Substitute sn = S + aλn into the Aitken formula (1.1↑) and verify that each new term tn is equal to S.
Exercise:  In (1.1)↑, S = sn − 1 if sn − 1 = sn − 2 and there is a divide by zero if sn + 1 − 2sn + sn − 1 = 0. Interpret both conditions.
Exercise:  Prove that
(1)/(1 + x2) = 1 − x2 + x4 − ⋯ + ( − 1)n − 1x2n − 2 + Rn, 
where Rn = ( − 1)nx2n ⁄ (1 + x2). Integrate from 0 to 1, while noting that
10(x2n)/(1 + x2) dx < (1)/(2n + 1), 
to deduce the Leibniz series.
Exercise:  If Sn is the partial sum of the first n terms of the Liebniz series for π (not π ⁄ 4), prove that |π − Sn| > 2 ⁄ (2n + 1), showing that the Leibniz series converges slowly.
Exercise:  To understand the Aitken iteration, it is helpful to look at singularities. Prove that the function arctanz has singularities at z = ±i in the complex plane. Determine the type of the singularities.

1.2 C review

The C programming language is concise. In this review, we go over a few features of C, emphasizing points that introductory classes often miss. Thus, we discuss header files, arrays, and pointers, with emphasis on the distinction between lvalues and rvalues, as well as the distinction between declarations and definitions. We go over the compilation and linking process while adding more detail to the picture in figure 1.2↑. Although C is concise, it demands precision in thinking. The emphasis in this review is toward greater precision. Table 1.3↓ shows two of the best books to learn C and C++. [8]  [8] [1] is quite possibly the most influential book in computer science. There is no better place to begin to learn the art of programming.
C programming B.W. Kernighan and D.M. Ritchie, The C programming language, 2nd ed.
C++ programming B. Stroustrup, The C++ programming language, 3rd ed.
Table 1.3 Books on C and C++ written by their inventors.

1.2.1 Header files

A header file in C is an interface to one or more source files, with each source file a collection of function definitions. As an example of a simple header file, we look at aitken.h. This header file is an interface to the functions defined in aitken.c.
1#ifndef aitkenJuly09DVsfe
2#define aitkenJuly09DVsfe
3void aitken(double *seq1, double *seq2, int len);
4double aitkenExtrapolate(double *seq1, double* seq2, 
5			 int len);
6#endif
The two symbolic names aitken and aitkenExtrapolate are introduced in this header file. These are the names of two functions that are defined in aitken.c. For the two functions, the declarations in the header file specify the type of the arguments as well as the type of the value that is returned. Thus, this header file is saying that aitken() is a function that takes three arguments---the first two of type pointer to double and the last of type int---and returns nothing (void). The function aitkenExtrapolate() is stated to take three arguments of the same types, but it returns a double.
Both of these are examples of function declarations. The declarations specify the arguments the functions take in as well as the types of what they return. When a declaration is made, the function is stated to exist somewhere. However, a sequence of statements defining the functions is lacking.
The source files aitken.c, leibniz.c, and logseries.c include aitken.h within their text using the line #include "aitken.h". The #include statement has the effect of splicing all of aitken.h into those source files. The source aitken.c includes the header aitken.h and goes on to define the functions declared in the header. The arguments and return type of a function at the point of definition must exactly match the declaration in the header. The sources leibniz.c and logseries.c include the header aitken.h with a different intent. Their intent is to obtain license to make calls to the functions aitken() and aitkenExtrapolate().
The header file acts as an interface, and it is natural, as well as typical, to include comments in it about the arguments to functions declared in it and about what those functions do. In this book, there are few comments in the code listings. The comments are instead found in the text.
The function aitken() (line 3) applies one step of the Aitken transformation (1.1↑) to the array seq1[0..len-1] and leaves the result in the array seq2[0..len-3]. The function aitkenExtrapolate() (line 4) applies the Aitken transformation repeatedly to the array seq1[0..len-1] while using seq2[] as scratch space, until only one or two numbers are left. One of the remaining numbers is returned as the extrapolated value.
At the beginning of aitken.h and at the end, three lines (1, 2, and 6) begin with the special character #. These lines are directives to the macro preprocessor, one of the initial phases of compilation. Sources are manipulated textually by the macro preprocessor.
The source file leibniz.c includes aitken.h using a #include directive. When the compiler is applied to leibniz.c, the initial preprocessing stage handles the #include directive. When processing that directive, the file aitken.h is opened by the preprocessor. The first directive in that file (#ifndef on line 1) asks the preprocessor to check whether a name called aitkenJuly09DVsfe is already defined. If the name aitkenJuly09DVsfe is not defined, the second line of aitken.h (#define directive) asks the preprocessor to define such a name, which it does. The preprocessor splices in all the text between the second and last lines of aitken.h (lines 3 to 5) into the source file that issued the include "aitken.h" directive. The splicing into leibniz.c is done at the point where the #include directive was issued.
However, if such a name is already defined (because the header file was included by an earlier directive), it skips over everything until it sees a #endif directive (line 6); in this case, it would be skipping the entire aitken.h file.
Using #ifndef, #define, and #endif directives ensures that if a directive to include the same header file is issued twice by the same source file, the second directive has no effect. That may not sound useful at first. However, in practice, a source file may include a header file that includes some other header file and so on. So two directives to include different header files may end up including the same header file twice. Such a thing is prevented by the combination of preprocessor directives we just discussed.
The macro variable aitkenJuly09DVsfe is chosen to be complicated to avoid accidentally using the same name in two different header files. Such accidental reuse will mean that the two header files can never be simultaneously included in the same source. Such errors at the macro preprocessing stage can be a little bothersome.

1.2.2 Arrays and pointers

Arrays and pointers are the heart of the C language. An array is a sequence of locations in memory. A pointer-type variable is a variable that holds the address of a memory location. The word pointer can be used for either such a variable or an expression that evaluates to an address. The two concepts may appear different, but the C language blurs the distinction between them. In C, arrays and pointers are almost interchangeable.
There is a very good reason for blurring the distinction between arrays and pointers. Suppose we want to pass a long sequence, occupying a great deal of memory, as an argument to a function. It would be wasteful to allocate new memory and copy the sequence entry by entry at every function call. In C, arrays are passed as pointers.
The key idea in almost identifying arrays with pointers is as follows. A sequence of data items in memory can be specified using three pieces of information: the address of the first item, the size in bytes of each item, and the number of entries in the sequence. In C, a pointer holding an address is specified as a pointer of a certain type. The size of each item is inferred from type information. For example, in an array of doubles, the size of each item is 8 bytes and in an array of ints the size of each item is 4 bytes (on GNU/Linux). Thus, from merely knowing a pointer, we can infer the first two pieces of information: the address of the first item in memory and the size of each item in bytes. The last piece of information, namely, the length of the array, is often tagged along separately. Thus, arrays and pointers may be identified, and arrays may be passed to functions efficiently as pointers with the length of the array tagged along as an extra parameter.
How this idea plays out in practice, we will examine presently.
Arrays, pointers, lvalues, and rvalues
figure FIGS/chapter1/MemoryInC.png
Figure 1.3 A schematic and simplified view of computer memory. The symbol x is of integer type, whereas the symbol a is a pointer. Addresses are shown on top.
The C language takes a certain view of computer memory, and arrays and pointers are both best understood in terms of computer memory. Figure 1.3↑ is a schematic view of a portion of the computer memory (in hardware, this would be DRAM in most circumstances). Names that we introduce into our code---whether they correspond to variables of basic types such as char or int or double, or to arrays, or to pointers---will all ultimately correspond to locations in computer memory. [9]  [9] In practice, as opposed to C semantics, program variables may be allocated using registers or DRAM. They may even be eliminated during compiler optimizations. To introduce a name of a variable of basic type, we may use a definition such as int x; to introduce a name of an array, we may use a definition such as int list[3]; to introduce a name of a pointer, we may use a definition such as int *a.
In C semantics, an expression may evaluate to a value that is the name of a memory location. Such a value is called an lvalue. The “l” refers to the fact that such values may occur on the left-hand side of an assignment. In contrast, an expression may evaluate to a value that may be used to fill a memory location but is not necessarily the name of any location in memory. Such a value is called an rvalue. The “r” here refers to the possible occurrence of such a value on the right-hand side of an assignment statement. The concept of rvalues and lvalues is useful for understanding arrays and pointers.
The distinction between lvalues and rvalues arises fundamentally because of the assignment statement. In an assignment, what occurs on the left is the name of a memory location or an lvalue. What occurs on the right is a value that is used to fill a memory location or an rvalue. The distinction is important in the context of pointers and arrays because, although a pointer is naturally thought of as an address and an address is nothing but a number naming a memory location, pointers themselves may be stored in pointer-type variables.
If the variable x is introduced using the definition int x, it is an lvalue because it is the name of a memory location as shown in figure 1.3↑. If the assignment statement x=37 executes, the location is filled with 37 as shown in the figure. The variable x can be both an lvalue and an rvalue. In the statement x=x+7, the occurrence on the left is an lvalue and the occurrence on the right is an rvalue.
Addresses are shown in figure 1.3↑ in a slightly nonstandard way. At the top of the figure, the address of the memory location named by x is shown as A(0..3). On a typical computer system today, A may be understood as a 64-bit (8-byte) address. An int occupies four bytes in GNU/Linux, and the addresses of the four bytes named by the variable x are A + 0, A + 1, A + 2, A + 3. This is shown in the figure as A(0..3).
The C language allows us to take the address of a variable (in general, the address of any lvalue). To get the address of the variable x of type int, we may use the syntax &x. Although the int location is 4 bytes with addresses A, A + 0, A + 1, A + 2, the value of &x is A, which is the address of the first byte in the memory location named by x. Here &x is an rvalue (its value being A) but not an lvalue because it is not the name of any location in memory. In the same expression &x, x is an lvalue.
We may define a pointer using the syntax int *a. As shown in figure 1.3↑, a is the name of 8 bytes of memory and not just 4, as in the case of x. This is because the pointer-type variables are meant to hold addresses, and as we have already said, an address is 8 bytes on most computers today.
If we now say a=&x, the memory location named a gets filled with the address of x, which is A in figure 1.3↑. If we were to say a=&x+1, the location a would get filled with A + 4 and not A + 1. That is because both &x and a are of type pointer to int, and the C compiler knows that an int is 4 bytes and not just 1 byte. In the assignment a=&x, a is an lvalue and &x is an rvalue. Less obviously, x is an lvalue in the same expression. The distinction between names of memory locations (lvalues) and values that may be used to fill memory locations (rvalues) is valuable to keep in mind.
The operator & allows us to extract the address of an lvalue. Conversely, the operator * converts an address to the name of a memory location (an lvalue). So if we say a=&x and then say *a=7, the effect is as follows. First, the location named a is filled with the address of x. Next, when we say *a=7, *a is the name of the location whose address is the value held in a. The lvalue in this assignment is *a, not a. Of course, the location named by *a is the same as the location named by x. So the effect of *a=7 is to change the value of x to 7.
It is worth noting that the word pointer may refer to an lvalue or an rvalue. A variable, or any lvalue, that holds addresses may be called pointer. In addition, the address itself (an rvalue) may be called a pointer. The picture behind either usage is of an address pointing to a location.
If e is an expression that evaluates to a pointer to a double (rvalue), the numerical value of e+1 is 8 more than the value of e. That is because a double is 8 bytes. If e is a pointer to a more complex type such as a struct, the C compiler calculates the size of the struct in bytes and increments e by that amount when evaluating e+1. Thus, e+1 advances the pointer by one item and e+27 advances the pointer by 27 items, taking into account the data type that e points to. If p is a pointer-type variable, the assignment p=p+17 advances the pointer by 17 items. Likewise, the assignment p=p-17 moves the pointer backward by 17 items.
An array may be introduced using a definition such as int list[3]. After this definition list is an rvalue but not an lvalue (it is not the name of any memory location). The (r)value of list is the address of the first of three consecutive memory locations of type int set aside by the definition. In figure 1.3↑, it is A + 12. If we use syntax such as list[1]=8, list[1] is an lvalue as shown in the figure. It is nothing but an abbreviation of *(list+1). Similarly, list[0] and list[2] are also valid lvalues as shown in the figure. In fact, list[100] is also a syntactically valid lvalue. The only problem is that an assignment such as list[100]=2 will most likely lead to a runtime error because only three int locations have been legally claimed. Such runtime errors are triggered by the operating system in a manner that is explained in a later chapter.
We began by noting that a sequence may be specified using three pieces of information: a pointer to the first location, the size of each item, and the number of items. The definition int list[3] supplies all three pieces of information. The address of the first location is the value of list. The size of each item is 4 bytes (on GNU/Linux) because an int is four bytes. Finally, the number of items in the array is 3 as shown in figure 1.3↑. The first two pieces of information are contained in the type and value of list. If list is passed as an argument to a function, the length of the array, which is the third piece of information, must be supplied separately.
Machine or assembly languages too access data items in a sequence using an address, the size of each item, and the index of the item, much as in C.
A C source with arrays and pointers
So far our discussion of arrays and pointers has been with reference to figure 1.3↑. We will now write a simple C program illustrating the discussion.
Before looking at C source, a few comments about indentation are in order. For program source not meant for display in a book, we use eight-space indentation. [10]  [10] The Linux kernel programming guidelines require eight-space indentation. See www.kernel.org/doc/Documentation/CodingStyle. While noting that style is a matter of personal choice, Linus Torvalds, the creator of the Linux operating system, adds that using four-space indentation “is akin to trying to define the value of PI to be 3.” Tab stops are separated by eight characters and terminal screens are conventionally 80 character wide. When the program code is lined up according to tab stops, the code is much easier to browse. The nesting level of loops becomes readily evident. The nesting level is an indication of the level of complexity of the code. Therefore, it is useful to be able to recognize the nesting level immediately. Too many levels of nesting often imply that the code is poorly structured.
Let us look at the following code, which uses five-space indentation. The suggestion of eight-space indentation assumes 80-character-wide lines. Here the lines are about 50 characters, and the indentation has been scaled down proportionally.
1#include <stdio.h>
2int main()
3{
4     int x;
5     int list[3];
6     int *a; 
7     a = &x;
8     list[1]=2;
9     *a = 35+list[1];
10     printf("%p %p %d\n", &x, list, x); 
11}
We are allowed to say list[1] = 2; (line 8), but list = &x; would have been illegal. The reason is that list has a value but is not the name of any location. On one run, this code had the following output (line 10).
0x7fff3cb8e414 0x7fff3cb8e400 37 
The value of x is printed as 37, as we may have expected. The address of x and the value of list are printed in hexadecimal as indicated by the 0x at front. Thus, each address is 48 bits long. Figure 1.3↑ may give the impression that each address is the address of a location in physical memory. In fact, the addresses that are printed out are virtual addresses, a concept we will discuss in later chapters.
The printf() function used on line 10 is part of the standard C library. Its declaration will be in the standard header file stdio.h, which is included on line 1. The C compiler knows where to look for this header file.

1.2.3 The Aitken iteration using arrays and pointers

As already noted, arrays and pointers are almost equivalent in C. The principal advantage of thinking of arrays in this way arises in passing arrays as arguments to functions. Here we use the Aitken example to illustrate how arrays may be passed as pointers.
The file aitken.c begins with two directives :
#include <assert.h>
#include "aitken.h"
the second of which includes the header file aitken.h. Including aitken.h allows the compiler to ensure that the definitions in aitken.c are consistent with the declarations in the header file. The first line includes the standard header assert.h. The job of finding that header file is left to the compiler. Including that header file allows us to use the assert statement to check a condition in the body of the code (see line 4 below).
The function aitken() operates on arrays that are passed to it as pointers.
1void aitken(double* seq1, double* seq2, int len){
2     int i;
3     double a, b, c;
4     assert(len > 2);
5     for(i=0; i < len-2; i++){
6	 	a = seq1[i];
7	 	b = seq1[i+1];
8	 	c = seq1[i+2];
9	 	seq2[i] =  a - (b-a)*(b-a)/(a-2*b+c);
10     }
11}
In a function that calls aitken() from some other source file, we may have declared two arrays using double s[100], t[100]. Those two arrays will correspond to two segments of memory each equal to 100 doubles (see figure 1.4↓). As we have noted, s[0]...s[99] and t[0]...t[99] are names of double locations, which make up those segments of memory. In contrast, s and t have values of type double * but are not names of any locations in memory. If a call is made as
aitken(s, t, 100);
it has the following effect. The function parameters seq1 and seq2 are names of locations in memory that can hold double *. The values of s and t are copied to the locations in memory whose names are seq1 and seq2, respectively (see figure 1.4↓). The value of seq1+17 is the same as the value of s+17. Thus, seq1[17], which is exactly the same as *(seq1+17), is another name for the memory location s[17]. Thus, we see that by indexing into seq1 and seq2 as seq1[0]...seq1[99] and seq2[0]...seq2[99], we may refer to any entry in the arrays s and t defined by the caller (see figure 1.4↓).
figure FIGS/chapter1/fncall_array2pointer.png
Figure 1.4 Passing arrays as pointers. In this picture, s and t are values but not names of any locations.
There is a little catch here, however. What happens if we say seq1[100] or seq1[200]? We would be generating a name for a location in memory that was not legally claimed by the caller. That is likely to result in a run-time error. By just using the pointers seq1 and seq2, there is no way we can tell how long the array is. Therefore, the length of the array is the third parameter, which is named len, in the function definition. The caller has to explicitly give the length of the array, as it does here by passing 100 as the third argument.
Line 9 of the listing corresponds directly to the Aitken transformation formula (1.1)↑.
The assert(len>2) statement on line 4 works as follows. If the code is compiled with the option -DNDEBUG, it is as if that line were not there and no extra overhead is incurred. If that option is not used during compilation, the condition len>2 is checked during run-time. If it is violated, the program will abort and print a message indicating the name of the file and the line number of the assertion that turned out to be false. The assert statements are valuable aids to debugging and indirectly useful as comments.
The definition of aitkenExtrapolate(), which is also in aitken.c, is shown below to give a complete account of the source aitken.c.
1double aitkenExtrapolate(double *seq1, double* seq2,
2			             int len){
3     int n, i, j;
4     n = len/2;
5     if(len%2==0)
6	 	n--;
7     for(i=0; i < n; i++){
8	 	aitken(seq1, seq2, len-2*i);
9	 	for(j=0; j < len-2*(i+1); j++)
10	 		seq1[j] = seq2[j];
11     }
12     return (len%2==0)?seq1[1]:seq1[0];
13}
We assume familiarity with the level of C that occurs in this function, although reading programs such as this can be harder than writing them. Writing good for-loops is the heart of C programming, and there are two nested for-loops in this program. Line 12 uses a conditional expression that may be less familiar. The conditional expression (a<b)?c:d tests the condition a<b. If the condition is true, it evaluates to c but to d otherwise.

1.2.4 Declarations and definitions

Names introduced into a C program are for the most part names of either functions or variables. The names can be introduced as either declarations or definitions.
Suppose a variable name is introduced using int x. When the compiler encounters that statement, it sets aside a location in memory for an int and makes x the name of that memory location. This statement is a variable definition, not merely a declaration, because it sets aside memory for x.
A declaration gives type information about a variable that is expected to be defined elsewhere. An example of a variable declaration is a statement such as extern int x. When the compiler encounters such a statement, it notes that x is a variable of type int that is expected to be defined in some other source file. It does not set aside any location in memory. If it later encounters a statement such as x=x+2, it does not complain. However, it cannot generate complete machine instructions to carry out that statement because it has no idea where x is defined and what address it corresponds to. That information has to be supplied by the linker later.
Both the lines in the header file aitken.h are function declarations not definitions. When the compiler encounters a declaration such as
void aitken(double *seq1, double *seq2, int len);
it notes that aitken() is the name of a function that takes three arguments, the first two of which are of type double * and the last of which is an int, and returns nothing (void). We can omit seq1, seq2, and len from the declaration. Such an omission would make the declaration difficult to read and understand for us, but it makes no difference as far as the compiler is concerned. The compiler does nothing more than note the types of the arguments (or parameters).
When the compiler later encounters a function call such as aitken(s, t, n), it first checks that s and t are of type double * and n is of type int. If the check succeeds, the compiler will generate instructions to set up the arguments and pass control to the function aitken(). If the function aitken() is defined in some other source file, which it may well be, the compiler has no idea where in memory the code for aitken() is located. So it cannot generate complete machine instructions for passing control. That job is the linker’s responsibility.
A function definition such as
void aitken(double *seq1, double *seq2, int len){
  ...
}
has a totally different effect. When the compiler encounters a function definition, it generates machine instructions for the body of the function (which is omitted here) and figures out where to place these instructions in memory. After that, aitken corresponds to the chunk of memory that contains machine instructions that implement the body of the function. Just as with variables, defining a function amounts to setting aside memory for it during compilation.

1.2.5 Function calls and the compilation process

Here we take a look at the mechanism of function calls and the compilation process. Much of the discussion is centered around the file leibniz.c, which uses the functions defined by aitken.c to extrapolate the partial sums of the Leibniz series and produces data corresponding to table 1.1↑.
The source file leibniz.c begins with two directives:
#include "aitken.h"
#include <stdio.h>
The first directive includes aitken.h to interface to functions defined in aitken.c. The second directive includes stdio.h to interface to printing functions defined in the stdio (standard input/output) library.
The leibniz() function, which generates partial sums of the Leibniz series, is defined below without comment.
//partial sums of 4(1-1/3+1/5-1/7+1/9-...)
void leibniz(double* seq, int len){
     int i;
     for(i=0; i < len; i++)
	  if(i==0)
	       seq[i] = 4.0;
	  else if(i%2==1)
	       seq[i] = seq[i-1] - 4.0/(2.0*i+1);
	  else
	       seq[i] = seq[i-1] + 4.0/(2.0*i+1);
}
The printseq() function prints a sequence using the printf() function defined in the stdio library.
void printseq(double* seq, int len){
     int i;
     printf("\n \n");
     for(i=0; i < len; i++)
	 	printf("%-.10f\n",seq[i]); 
}
The logic used by the main() function for generating the data shown in table 1.1↑ is similar to that of aitkenExtrapolate(). In a C or C++ program, the function named main() is the first to gain control when a program is run.
1int main(){
2     const int len = 13;
3     double seq1[len];
4     double seq2[len];
5     int n, i, j;
6     leibniz(seq1, len);
7     n = len/2;
8     if(len%2==0)
9	 	n--;
10     for(i=0; i < n; i++){
11	 	printseq(seq1,len-2*i);
12	 	aitken(seq1, seq2, len-2*i);
13	 	for(j=0; j < len-2*(i+1); j++)
14	 		seq1[j] = seq2[j];
15     }
16     if(len%2==0)
17	 	printseq(seq1, 2);
18     else
19	 	printseq(seq1, 1);
20}
The main() function calls other functions on lines 6, 11, 12, 17, and 19. All calls except the one to aitken() on line 12 are to functions defined in the same source file. We have looked at the call to aitken() from the point of view of the callee in section 1.2.3↑. Let us look at the function call leibniz(seq1, len) on line 6 from the point of view of the caller.
Within main(), seq1 is the name of an array of length 13 and len is an int (because it is declared const, it must be treated as something that is merely an rvalue). When the function call leibniz(seq1, len) is executed, the value of seq1 is copied to the location whose name is seq within leibniz()(see the definition of leibniz() given above). In addition, the content of the location whose name is len inside main() is copied to the location whose name is len inside leibniz(). This is the call-by-value semantics of C.
The function call aitken(seq1, seq2, len) on line 12 invokes a function that is not defined in the leibniz.c compilation unit. To see what difference that makes, let us see what happens when we issue the command
icc -c leibniz.c
The -c option tells the icc C compiler that it should only produce the object file and not the executable. The object file it produces will be called leibniz.o (on Unix systems). To produce the object file, the compiler runs through the code and converts the leibniz() function definition into a sequence of machine instructions. Next it converts printseq() into a sequence of machine instructions before turning to main(). When it hits the function call leibniz(seq1, len) on line 6 of the main() function, it copies the arguments seq1 and len into a place where the leibniz() function can retrieve them. After that it simply generates a machine instruction to make the function call. That machine instruction will pass control to the leibniz() function, which the compiler has already converted into a set of machine instructions.
When the compiler sees the aitken(seq1, seq2, len) call on line 12 of the main() function, the process is initially similar. Thanks to the included header aitken.h, the compiler will have already seen the declaration of aitken() to be void aitken(double *seq1, double *seq2, int len). So the compiler generates machine instructions to copy seq1, seq2, and len into a place where the definition of aitken() can find them. At this point, there is a problem: the compiler cannot generate machine instructions to pass control to aitken(). The compiler has no idea where the machine instructions for aitken() are. The name aitken() remains an unresolved reference in the object file leibniz.o.
Thus, the object file leibniz.o by itself cannot be turned into an executable. To turn it into an executable, we must first compile aitken.c as follows:
icc -c aitken.c
The -c option tells the icc command to compile only. Without that option, the command will try to compile and then link and produce an executable with the default name a.out. With the compile only option, the command will generate an object file called aitken.o with machine instructions for the two functions defined in aitken.c. To generate the executable, we run the following command:
icc -o leibniz.exe leibniz.o aitken.o
The -o option tells the icc command to leave its output in the file leibniz.exe instead of the default a.out. The C linker uses the definition of aitken() in aitken.o to eliminate the unresolved reference to aitken() in leibniz.o. The executable file leibniz.exe it generates is a sequence of machine instructions with no unresolved names. It is ready to be loaded and run.
The extensions .cpp, .c, and .o correspond to C++ source files, C source files, and object files, respectively. In the command icc -c aitken.c, the file extension of aitken.c indicates to the icc command that it is operating on a C source file. So the command will invoke the C compiler. In the linking command, the filename extension .o indicates that leibniz.o and aitken.o are both object files. Therefore, icc invokes the C linker.
We have used the icc compiler/linker from Intel . But the syntax is almost identical if we used GNU’s gcc or the Portland Group’s pgcc. Commands such as icc, pgcc, and gcc look at the file name extension to determine the type of the file.
We use Intel’s C/C++ compilers in this book because it is easier to link certain libraries that we will discuss later. In addition, the Intel compilers were the default standard on some supercomputing systems used in later chapters. GNU’s gcc/g++ compiler is widely used, open source, and reputed to be of excellent quality. [11]  [11] The Intel compiler appears to carry some loop optimizations better, but this is not a major point. For work unrelated to this book, this author normally uses the GNU compilers.
Conventionally, the executable files in Linux do not use the file name extension .exe. The .exe filename extension for executables is a Windows convention. We have adopted the Windows convention throughout this book as it makes for greater clarity.
Going back to leibniz.exe, we may wonder how the name printf() in leibniz.o gets resolved. It is used in the definition of printseq(), and we understood that printf() is defined in the stdio library. However, the linking command gave no explicit instructions to fetch that library. It did not need to because the linker fetches and links a number of standard libraries by default and stdio is one of them.
Exercise:  Look up the meaning of the options -c -o -I -L -lm -lmath in the C compiler’s user’s guide.
Exercise:  Use a short C program and the sizeof() facility to print the size in bytes of variables of char, int, long int, double, char *, int *, and double * types. Notice that all pointer are the same size (8 bytes).
Exercise:  If p and q are pointers of type char * and double *, p=p+1 moves p by one byte while q=q+1 moves q up by eight bytes, so that p or q point to the next char or double, respectively. Write a short C program to demonstrate this aspect of pointer arithmetic.
Exercise:  In C, a pointer can be an lvalue or an rvalue. Give an example of a pointer that is an rvalue but not an lvalue.
Exercise:  The list of numbers 1, 2, …, n can be rearranged in n! different orders. Write a function that takes an argument n of the type int and prints the n! different permutations.
Exercise:  Suppose a[] is an array of type double and size n. Write a function that takes a and n as arguments and sorts the entries of the array. The basic sorting algorithms are bubble sort, quick sort, and heap sort. [12]  [12] Many different sorting algorithms are derived and discussed in [7]. Most sorting problems are typically handled using library functions.
Exercise:  Rewrite aitken() to confirm to the declaration
void aitken(double *seq, int len);
so that it takes a single sequence and transforms it in place.
Exercise:  What is special about the following C program? [13]  [13] [4] (p. 288) states that, “once you understand this program you have understood the main idea” behind Gödel’s proof of his incompleteness theorem.
char *s="char *s=%c%s%c;%cmain(){printf(s,34,s,34,10,10)};%c";
main(){printf(s,34,s,34,10,10);}
The program does not #include the header file stdio.h, which contains the declaration of printf(). Modify the program to #include that header file while serving the same purpose.
Exercise:  The following C function copies one array to another array, if the two arrays do not overlap.
void arr_copy(int* in, int* out, int len)
{     
  int i;     
  for(i=0; i < len; i++) 	
    out[i] = in[i]; 
}
If 1, 2, 3, 4 is rotated left, it becomes 2, 3, 4, 1. Write a C function for rotating an array left (in-place rotation) that does not use a loop but is allowed to make one call to arr_copy(). Can you write a similar function to rotate right?
Exercise:  Suppose that double **x points to an array of pointers of length n and that each entry of the array points to an array of doubles, whose length is also n.

1.3 C++ review

The C language has a simple philosophy. Its aim is to offer a uniform view of the computer, especially computer memory, to the programmer. C has been so successful that nearly every object that can be called a computer, ranging from supercomputers to routers to embedded and mobile devices, is equipped with a C compiler or a cross-compiler. C is the best vehicle for highly optimized programs.
Because C seeks to be close to the machine, it is a low-level language. There is often a considerable distance between concepts that are native to a problem domain and their expression as C programs. High-level languages provide constructs and syntax that bring the program much closer to ideas and concepts that are native to the problem domain.
The C++ language is something of a compromise to provide the facilities of high-level languages without sacrificing the speed of C. Despite its name, it is not an incremental extension of C. It is a colossal expansion of C syntax. It does not have the seamless nature of truly high-level languages such as Python. Classes in Python use very little syntax and fit cleanly within the highly modular architecture of Python programs. However, languages such as Python are much slower than C++.
Although the C++ language is a compromise, or perhaps because of being a compromise, it has found a great range of uses. On the one hand, C++ has all of C inside it. On the other hand, it provides many mechanisms for capturing concepts and ideas more precisely. Its downside is its complexity. Although clear and careful thinking are essential to all programming, a failure in this respect has particularly acute consequences in C++.
Because our focus is on program speed, the part of C++ we use is quite small. Narrowly defined and flat (as opposed to hierarchical) classes, references, occasional function name overloading, and the ability to define variables in the middle of programs is an almost complete list of the C++ features we use. The classes we define are no more than C structs endowed with functions. C features such as enum, struct, typedef, and static can be quite powerful for representing concepts and ideas when used judiciously.
Classes in C++ are a mechanism for representing concepts and endowing them with functionality that makes them easy to use. Classes can be general or narrow. The Vector class studied in this section is an example of a general class. It can be made even more general. The Vector class assumes that each entry of a vector is a double. Using templates, one may define a class that allows each entry to be a double or a float or an int or even some user defined type or class.
The Vector class is the only example in this book of a class that attempts to capture a general concept (in this instance, the concept of vectors in linear algebra). Every other class discussed in this book is narrowly defined. We begin our discussion of C++ with the Vector class for two reasons. It is a good vehicle for reviewing some of the features of C++. Indeed, we introduce more features of C++ than we need, but that too serves a purpose. C++ is sometimes believed to lead to slow programs. This example helps us explain thoroughly how that may come about. In later chapters, we illustrate the overhead of the Vector class and the related Matrix class, when used in inner-loops or for disk input/output.

1.3.1 The Vector class

The C++ language can be used in many different ways. Using general classes, one may make C++ look like easy to use languages such as Python or Matlab without incurring the enormous cost of such interpreted languages. Our interest is in fast programs and, even more so, in understanding what makes programs fast or slow. The C++ style we adopt is quite close to C.
Nevertheless, we begin with a general type of class, namely, the Vector class. This class helps us review a few of the features of C++ and is used to implement the Aitken iteration later. In a later chapter, we criticize the use of this class and show it to be slow.
Header file with class definition
The Vector class is defined in the header file Vector.hh. The C++ class consists of data members and member functions. With respect to computer memory, a class object is a collection of data items. The data items could be of basic types such as double or char, or pointers. The data items may also be other class objects or C structures. The member functions provide various means to manipulate the class object or, equivalently, the package of data items that constitutes the class object.
It is typical for header files to give only part of the definition of the class. Many of the member functions are typically defined in a separate source. Here the entire class definition is in the header file Vector.hh, which makes the header file a bit long. We present the contents of the header file in stages, gradually unveiling features of C++. The skeleton of the header file is listed below.
1#ifndef MYVECTOR
2#define MYVECTOR
3#include <cassert>
4#include <cmath>
5#include <cstdlib>
6#include <cstring>
7#include <iostream>
8#include <fstream>
910using namespace std;
1112class Vector{
13private:
14     ...
15public:
16     ...
17};
1819#endif
Lines 1, 2, and 19 ensure that the header file expands to the source code in between if and only if the macro variable MYVECTOR is defined. Here MYVECTOR is not good nomenclature, as it may be inadvertently reused by some other header, subverting our attempt to ensure conditional inclusion of header files.
It is typical for C/C++ sources to begin by including a number of header files to interface to other source files or libraries. C programs include the header assert.h to use the assert() macro (see section 1.2.3↑). C++ programs may include either assert.h or cassert, as on line 3, to emphasize that a C facility is being employed.
The Vector class uses other C facilities as well. It uses the fabs() function from cmath (line 4) in one of its member functions to compute the norm of the vector. C facilities to allocate and release memory are defined in stdlib (line 5).
C++ facilities for input/output from terminals and files are in iostream (line 7) and fstream (line 8), respectively. Member functions that input/output vectors to/from files use these facilities.
To output a variable x to standard output (typically the terminal), we may say
cout<<x<<endl;
Here cout is the name of standard output declared in iostream. The end of line character endl is also declared in iostream. In C, input/output syntax is sensitive to the type of data items being handled. The abstraction features of C++ are used by the iostream library to provide a uniform interface for input/output regardless of the type of the variable. Even class objects may be input/output in this manner if the operators << (for output) and >> (for input) are overloaded suitably.
To output to a file, the syntax looks as follows:
ofstream ofile("tmp.txt");
ofile<<x<<endl;
Here ofile is defined as an ofstream object. At the point of definition, it is tied to the file tmp.txt.
To input x from standard input, we may say
cin>>x;
To input from a file, we may say
ifstream ifile("tmp.txt")
ifile>>x;
This will work regardless of whether x is a double or an int or a long or a char.
The class names ofstream and ifstream as well as class object names cout and cin are defined in the namespace std. In general, we should say std::cout and std::ofstream because these names do not exist outside the namespace. However, the using namespace declaration on line 10 brings in all the names in std into scope. It allows us to say cout instead of std::cout.
It is often not a good idea to bring in the entire std namespace, especially within header files. The C++ standard library is vast. Bringing in the entire std namespace, as we do on line 10, pollutes the namespace considerably. For example, a programmer may define a function called copy() to copy double arrays in a program-specific manner and conflict with the std namespace. There are functions with common names such as copy() and sort() in the standard library.
The listing shows the outline of the definition of the class Vector (lines 12 through 17). There is a private section (line 13) in the class definition and a public section (line 15). The class definition must end with a semicolon as on line 17. Omitting the semicolon at the end of class definitions is a common novice error.
Private section of the Vector class
figure FIGS/chapter1/VectorSchematic.png
Figure 1.5 Schematic picture of a Vector of length 2, which ‘‘owns’’ its data.
A class is a collection of data members and member functions. In the Vector class all the data members are in the private section.
class Vector{
private:
     long int size;
     double *data;
     int owner;
public:
     ...
};
The class member size is the length of the vector and data is a pointer to the contents of the vector; so data[i] is in effect a name for the ith entry of the vector. The owner field will be either 0 or 1---its meaning is explained later.
If v is a variable of type Vector, then it is a name for a segment of memory that includes a location of type int, a location of type double *, and a location of type int. A schematic view of a Vector of length 2 is shown in figure 1.5↑. If v is the name of an object of class Vector, v.size, v.data, and v.owner are names for members of v as shown in the figure. One may say that the class object v is a package of the data items v.size, v.data, and v.owner.
If a data member or a member function is in the private section, its access is restricted. Because the data items v.size, v.data, and v.owner are in the private section, only member functions are allowed to access them. [14]  [14] This statement is not exactly true. C++ has a notion of friends, which too get access to the private section of a class. Whatever functionality we want must be defined through the member functions. For example, suppose we want to input a vector from a file. We are not allowed to directly set the size of the Vector object during input using syntax such as v.size=100. Instead, we may define a member function input(), which takes care of reading data and setting v.size appropriately.
Member functions of the Vector class
All the member functions of the Vector class are in the public section.
1class Vector{
2private:
3     ...
4public:
5     Vector(){...}
6     Vector(long int  n){...}
7     Vector(const Vector& v){...} 
8     ~Vector(){...}
9     void shadow(double *dataptr, long int len)
10     {...}
11     void shadow(const Vector& v){...}
12     void shadow(const Vector& v,  long int i,  
13		 long int len){...}
14     long int getSize() const{...}
15     double * getRawData() const{...}
16     double& operator()(long int i){...}
17     const double& operator()(long int i) const{.}
18     Vector& operator=(const Vector& v){...}
19     void add(const Vector& v){...}
20     void sub(const Vector& v){...}
21     void mul(const Vector& v){...}
22     void div(const Vector& v){...}
23     void scale(const double x){...}
24     void add_constant(const double x){...}
25     double norm() const{...}
26     void output(const char* fname)const{...}
27     void input(const char* fname){...}
28};
There is a basic difference between data members and member functions. Suppose v is a Vector class object. Then v.size, v.data, and v.owner refer to the data items packaged inside v. In contrast, v.norm() applies the member function norm() (line 25) with the class object v as its target.
The member functions exist within the namespace defined by the class. Each class defines a namespace. The Vector class defines the eponymous namespace Vector::. If we want to refer to a member function such as norm() outside the scope of the class (the scope here is from lines 1 through 28), the name must be given as Vector::norm().
For example, we may define the member function norm() in a compilation unit and outside the scope of the class Vector as
double Vector::norm() const{
     ...
}
We have chosen to define all the member functions in Vector.hh and within the scope of the Vector class.
Before delving into the member functions, we give an example of how Vector objects are used.
#include "Vector.hh"
​
int main(){
     Vector v(20);	
     for(int i=0; i < 20; i++)
          v(i) = i;
     Vector w1, w2;
     w1.shadow(v, 0, 10);
     w2.shadow(v, 10, 10);
     w1.add(w2);
     w1.output("w1.txt");
}
The definition Vector v(20) invokes the class constructor on line 6. This constructor will make v.data point to 20 doubles, set v.size to 20, and set v.owner=1, making v the owner of its data.
The body of the for-loop has v(i)=i. When we say f(x) in C/C++, f() is a function being applied to x. Here we say v(i), but v is a class object not a function being applied to i. However, thanks to operator overloading, the compiler interprets v(i) as v.operator()(i)---in words, the member function operator() is applied to the target v with i as the sole argument. The member function defined on line 16 is called. This member function will look at v.data, access the ith entry, and return a reference to it. Thus, when we say v(i)=i, the ith entry of the Vector object v gets set to i. At the end of the for-loop, the entries of v are 0, ..., 19.
At the definition Vector w1, w2;, the Vector objects w1 and w2 are created using the empty constructor (line 5). This empty constructor sets w1.size to 0, w1.data to NULL, and w1.owner to 0, and likewise for w2. Neither w1 nor w2 owns any data. To begin with, they are vectors of size 0.
When we say w1.shadow(v, 0, 10), the member function shadow() defined on line 12 gets invoked. Its effect is to set w1.data to v.data and w1.size to 10. However, w1.owner remains 0 because the data is owned by v. The Vector object w1 is a shadow of the first 10 entries of v. Likewise, after w2.shadow(v, 10, 10), w2.data is set to v.data+10, and w2 becomes a shadow of entries 10 through 19 of v.
When we say w1.add(w2), the member function add() (line 19) is invoked with w1 as the target and w2 as its argument. This member function verifies that its target and its argument are vectors of the same size and adds w2 to w1, entry by entry.
The final line w1.output(“w1.txt”) outputs w1 to the file w1.txt via the member function output() (line 26). Thus, the numbers 10, 12, ..., 28 will be output to w1.txt.
The program is not done yet, however. When the class objects v, w1, and w2 go out of scope at the end of main(), the compiler inserts calls to the destructor (line 8). The destructor is called thrice. When destroying v, the destructor notes that v is the owner of its data and releases the memory that data points to. There is nothing to be done to destroy w1 and w2 because they do not own their data. The destructor returns as soon as it notes that they are not owners.
References
We will step through a few C++ concepts using the public member functions of the class Vector as illustrations. The first of these is references.
A reference in C++ is another name for an object that is already in existence. For example, we may say
double &x = y;
and x becomes another name for y, which is assumed to be of type double. If we say x=2 and then print y, its value will be 2.
The & character is used to take the address of locations. It is also used for references as illustrated. Overloading the & operator in this way may create some confusion initially, but in fact references are really just a shorthand for the use of pointers.
We may use references in passing arguments to a function. For example, if we define a function as
double f(double &x){
     ...
}
we may call it as f(z), assuming z is a variable of type double and x will become a reference to z. The call f(2.0) is not legal, however, because 2.0 is a value and not the name of a double location.
References are commonly used to pass class objects as arguments to functions. The Vector class holds just three data items (size, data, and owner), but in general a lot of data can be packaged into a single class. When a class object is passed as an argument to a function, do we really want a fresh copy to be made of all the data inside that object? Typically, the answer is no. First, the expense of having to copy all that data may be undesirable. Second, even if there is only a small amount of data, the semantics could be incorrect. For Vector objects, item-by-item copying results in multiple objects incorrectly owning the same data.
A declaration of a member function can specify that some of the arguments are references. The declaration void add(const Vector& v) (line 19) specifies that the argument named v is passed by reference (the const qualifier is discussed later). The crucial symbol here is the &, which precedes v in the declaration. If arguments are passed as references, we do not have to worry about the correct semantics for copying.
Functions may also return a reference as does the member function operator() (line 16) . This function is defined as
     double& operator()(long int i){
         assert(i < size);
	     return(data[i]);
     }
in Vector.hh within the scope of the class Vector. Here size and data are data members of the target. So if v is a Vector object and we say v(6)=17, this member function is called with v as the target and with argument i equal to 6. Assuming the vector is of size 6 or greater, the member function returns data[i], which is the same as v.data[i] by reference. Whatever is returned is another name for the 6th entry of the Vector object v. Thus, saying v(6)=17 has the effect of setting that entry to 17.
Operator and function name overloading
We have already seen how the function call operator is overloaded. Below is another example. In this example, the assignment operator is overloaded.
     Vector& operator=(const Vector& v){
          assert(size==v.size);
          memcpy((char *)data, (char *)v.data, 
	              size*sizeof(double));
          return(*this);
  }
The assignment operator uses memcpy() to copy the data in v to its target. [15]  [15] Strictly speaking, the overloaded = operator should check that data and v.data do not overlap. If we say w=v, with both v and w being Vector objects, the member function operator=() is invoked with w as the target and v as its argument. As before, size and data refer to the corresponding items of the target.
A new bit of syntax here is the keyword this. Within a member function, this is a pointer to the target. Therefore, *this is the target. The overloaded assignment operator returns a reference to the target.
If u, v, and w are Vector objects, and we say u=v=w, the compiler interprets that as u=(v=w). In C/C++, the assignment operator is right associative. The result of v=w is a reference to v, which is the argument in the next call to the overloaded assignment operator that assigns to u.
C++ also allows function name overloading. For instance, we can define a function as
void add(double *v, const double *w, 
         const int len){...}
which adds the array w to the array v. We can define another function
void add(const double *u, const double *w, double *v, 
	     const int len){...}
which leaves the sum of u and w in v. In C, the two definitions would conflict because they are both trying to bind a definition to the same name add. However, in C++, the compiler allows both the definitions because they can be disambiguated using the number and types of the arguments. The compiler internally generates mangled names to keep the definitions separate in the object file, so as not to confuse the linker.
Function call inlining
The member functions of the class Vector are defined in the header file Vector.hh within the scope of the class definition. If we wanted to define the overloaded function call operator elsewhere, we can give its definition as
double& Vector::operator()(long int i){
     assert(i < size);
     return(data[i]);
}
In fact, it is better to define the overloaded function call operator within the scope of the class as we did. Every source that uses the Vector class will include the header Vector.hh. Thus, every compilation unit will see the complete definition of the function call operator. If we say v(i) = 7.0, with v being a Vector object, the compiler does not actually generate a function call for v(i). Instead it scans the definition of the overloaded function call operator and splices in the body of the function at the point of call. The overhead of making a function call is eliminated. This is called function call inlining.
The function call overhead in modern processors is quite small and is typically less than 10 cycles. [16]  [16] The function call overhead depends on the number of arguments, types of the arguments, number of registers used by the caller, context of the function call, and other factors. Much of the time it is nothing to worry about. However, if a member function compiles to only a few machine instructions and is called frequently, one must ensure that the member function is inlined.
In C++, one can explicitly ask the compiler to inline functions that are not member functions. For instance, if a function with the declaration void add(Vector u, Vector v) is instead declared inline void add(Vector u, Vector v) and defined within the header file, the compiler will try to and almost certainly succeed in inlining that function call. The compiler may refuse or fail to inline functions that are long and complicated without warning.
If several source files are compiled and linked simultaneously with interprocedural optimization turned on (-ipo option in icc and icpc), the compiler will again try to inline certain function calls. Normally, we will avoid interprocedural optimization, preferring to build object files for each source file or compilation unit separately.
Inlining is not quite the panacea it is sometimes believed to be. Suppose we need to replace each entry of an array by the cumulative sum of the preceding entries. We can write the loop in C using pointers and pointer arithmetic or we can use a class such as the Vector class. A well-written C-style loop will lead to faster and more compact assembly code for reasons explained in a later chapter (see section 3.2.6↓). Inlining occasionally leads to worse code as we illustrate later in this chapter.
Constructors and destructors
Constructors and destructors are central to the class mechanism in C++. The three constructors of the Vector class are defined below.
     //empty constructor
     Vector(){
          size = 0;
          data = NULL;
          owner = 0;
     }
	
     //only constructor to allocate space for data
     Vector(long int  n){
          size = n;
          data = (double *)malloc(sizeof(double)*n);
          owner = 1;
     }
  
  
     //*this becomes shadow of Vector v 
     //(copy constructor)
     Vector(const Vector& v){
          size = v.size;
          data = v.data;
          owner = 0;
     }  
The empty constructor and the copy constructor do not allocate space. The constructor in the middle allocates space using malloc(), which is in cstdlib. These member functions are understood to be constructors because they have the same name as the class. Constructors are not allowed to return anything.
Constructors are called implicitly at the point of definition of class objects. It is essential to understand when constructors are called. Suppose we define a vector object as
Vector v;
the empty constructor is called to initialize the object v. If we define an array of vectors
Vector v[100];
the empty constructor is called for each object in the array. It is illegal to define an array of objects if the class is not equipped with an empty constructor.
If we define a vector object as
Vector v(27)
the constructor in the middle is called. The compiler notes that the object v is being built with the single argument 27, which is a constant of type int. The constructor that calls malloc() takes long int as an argument. In C/C++, an int is automatically promoted to a long int if necessary.
The constructor uses malloc() to claim space for this Vector object. If we make the call
malloc(1000*1000)
the function returns a pointer to 106 bytes of memory. The pointer is of type void *.
In the usage
data = (double *)malloc(sizeof(double)*n);
void * is cast to double *. Type casts are used to convert values of one type to another type. For example, we may say (int)1.4142 to convert the double value 1.4142 to an int (it will be truncated). The cast is needed here because data is of type double * while malloc() returns void *.
What happens during malloc()? The short answer, which will be elaborated later, is that the function call first goes to the C library. The C library may in turn call the operating system if it is not able to come up with the memory by itself. The operating system typically allocates an area in virtual memory to the calling process. No region in physical memory is set aside. Physical memory is set aside only when the process first tries to access the memory it has claimed for itself. There is a page fault during first access, and the page fault handler sets aside physical memory.
Suppose we need an array of Vector objects of length 100 with each vector of size 1000. The definition
Vector v[100]
will not do. The 100 Vector objects it creates all use the empty constructor and therefore have length 0 and do not own any data. We need the following code:
Vector *v[100];
for(int i=0; i < 100; i++)
     v[i] = new Vector(1000);
What we get here is an array of pointers to Vector objects and not an array of Vector objects. The for-loop initializes each pointer in the array to point to a Vector object of size 1000. In C++, the usage
new Vector(1000)
calls a constructor explicitly and returns a pointer to the object that was created.
The new operator may also be used to allocate arrays. For example, we can say
data = new double[size]
to make data point to an array of doubles of length size. Thus, new can be an alternative to malloc(). But malloc() cannot create class objects, although new can.
We come at last to the copy constructor. The copy constructor makes its target a shadow of its argument. If we define v as
Vector v(w)
where w is an object of type Vector, the copy constructor is invoked with w as its argument. A standard situation where the copy constructor is invoked is as follows. Suppose we define a function
void add(Vector v1, Vector v2)
and call it add(v, w). By default, arguments are passed by copying (pass by value) in C/C++. At the point of call, v is copied to v1 and w is copied to v2. In C++, copying class objects means calling the copy constructor. Here, class objects v1 and v2 are created using copy constructors.
For the sake of completeness, we give the definition of one of the member functions named shadow().
     //makes *this shadow v(i:i+len-1)
     void shadow(const Vector& v,  long int i,  
                 long int len){ 
          assert(!owner);
          assert(i+len<=v.size);
          size = len;
          data = v.data + i;
          owner = 0;
     }
The target becomes a shadow of v of length len beginning at the ith entry of v.
The sole destructor of the class is defined as follows:
     ~Vector(){
     	if(owner!=0)
     		free(data);
     }
This member function is understood to be a destructor because its name is the class name prefixed with a tilde. Destructors are not allowed to return anything nor can they take any arguments. Destructors are called automatically when the class object goes out of scope. The destructor of the Vector class frees its data if it is an owner.
The function free() is defined in the C library, like malloc(). The space allocated by malloc() was stored in data. It is released by free().
If a class object is allocated using new, as in
Vector *vptr = new Vector(1000);
where vptr is a pointer to Vector, it must be released using the delete operator:
delete vptr;
The class destructor is called when vptr is deleted. If an array is allocated using new[], as in
double *data = new double[size];
it must be released using delete[]:
delete[] data;
The const qualifier
The following definition is contained in the first few lines of Vector.hh (but omitted from earlier listings):
const double PI = 3.1415926535897932384e+00;
Because the definition of PI is qualified using const, any attempt to change the value of PI is illegal. If a user attempts to change the value of PI, the compiler will catch the error.
The Vector class is equipped with member functions output() and input() to facilitate output to files and input from files. Definition of the member function input() begins as follows:
void input(const char* fname){...}
Here the argument fname (file name) is a pointer to a char (character). In C and C++, a char is a single byte with ascii encoding. Strings are arrays of characters terminated by ’\0’ or equivalently the byte of value 0. If a string is passed as a pointer to a char, its length need not be passed explicitly. The convention for terminating strings determines its length.
A pointer holds an address which points to some data. Prefixing the const qualifier to the declaration or definition of a pointer implies that the data does not change, but the pointer may change. So it is illegal to say fname[0]=’M’ within the body of the function, but it is legal to say char c; fname = &c;.
Suppose we want to initialize a vector of length 20 using data stored in a file named init.dat in the current directory. The data in the file must be a sequence of 20 (or more---the extra values are ignored) values separated by whitespace. The following code does that:
1char fname[9]={’i’,’n’,’i’,’t’,’.’,’d’,’a’,’t’,’\0’};
2Vector v(20);
3v.input(fname);
In line 1, the number 9, which gives the length of the array, can be omitted because the compiler can use the initializing sequence to determine the length. The usage below is much more convenient.
Vector v(20);
v.input("init.dat");
We are allowed to pass “init.dat” as an argument explicitly because the first argument of the member function input() is of the type const char * and not just char *.
Use of const qualified references is illustrated by the member function add().
     void add(const Vector& v){...}
Here the const qualifier indicates that add() will read entries of v but not change them. The target will be changed when v is added to it. The usage of the const qualifier in the member function norm() is different.
    double norm()const{...}
This function returns the -norm of its target vector (largest magnitude of an entry of the vector). Therefore, if it is called v.norm(), the returned value is the -norm of v. The const qualifier specifies that the member function will not change its target. Any attempt to change the entries of v inside the definition of norm() is illegal.
The const protections can be easily broken using shadows. For example, inside the definition of the member function add(), we can say
Vector w(v);
and go on to modify the entries of v.
Default arguments
Unlike C, C++ allows default arguments. Suppose the function
double f(double x, int flag){
     ...
}
is defined in the source file xyz.cpp. In the header file xyz.hh, we may declare it as
double f(double x, int flag = 0);
Then we are allowed to call the function as
f(1.4142, 1);
or
f(1.4144);
In the latter case, the compiler supplies the second argument (flag) as 0. Only trailing arguments can be assigned default values.
The -> operator
Suppose we define an array of pointers to Vector objects as
Vector *v[100];
Each entry may be made to point to a Vector object as explained already. Suppose all those Vector objects are initialized in some way and we want to print the norm of each Vector object. The following syntax will not do:
for(int i=0; i < 100; i++)
     cout<<v[i].norm()<<endl;
The problem here is that v[i] is a pointer to a Vector and not a Vector. The correct syntax is as follows:
for(int i=0; i < 100; i++)
     cout<<(*v[i]).norm()<<endl;
The parentheses in (*v[i]).norm() are needed. The * dereferencing operator has lower precedence than the . selection operator. This usage occurs often enough that there is a special operator -> that combines dereferencing and selection. It may be used as follows:
for(int i=0; i < 100; i++)
     cout<<v[i]->norm()<<endl;
We will look at another example of the element selection through pointer operator ->. Consider the C struct
struct node{
     double val;
     struct node *next;
};
A struct is a package of data items. Here the two data items of struct node are val and a pointer to the next node. The pointer next is a link from one node to the next. A series of nodes may be chained together to form a linked list.
figure FIGS/chapter1/llist.png
Figure 1.6 Schematic view of a linked list.
Figure 1.6↑ shows a linked list with each node pointing to the next. The variable llist points to the beginning of the list. The next pointer of the last node is set to NULL.
The following function finds the length of a linked list:
int llen(struct node *llist){
	int len = 0;
	while(llist != NULL){
		llist = llist->next;
		len++;
	}
	return len;
}
The element selection through pointer operator -> occurs frequently in the context of linked lists, trees, and graphs.
Abstraction features of C++
The C++ language has a number of powerful features to bring programming closer to ideas and concepts intrinsic to the problem domain. Classes that arise later are all narrowly defined, provide a single service, and do not use these features. The Vector class defined above attempts to capture the abstract notion of vectors. C++ has a number of abstraction features beyond the basic class mechanism we have studied.
C++ supports object-oriented programming through inheritance of classes, multiple inheritance, and virtual and pure virtual functions. On many occasions, a single problem morphs into multiple instances, each of which has its own distinctive features while sharing a great deal in common with other instances. Object-oriented programming is the best way to tackle such problems.
An example of a problem well suited to object-oriented programming is the implementation of file systems. A file system creates the concept of file, which may be read, written, opened, and closed in standard ways. File systems vary in the way they cache data, store meta information, and other respects. Yet they have a great deal in common. In addition, each file system is ultimately realized on disparate storage devices.
Mapping of file systems to devices too can be organized using object-oriented techniques. Another example is the implementation of a graphics library for plotting that supports a variety of backends.
Object-oriented programming can be done in plain C as well using function pointers. However, it becomes far more laborious. Object-oriented techniques do not really come up in this book, although we do use templates on one occasion in chapter 8↓. Therefore, we include the following basic example of a templated function:
1template<class Num> Num sum(Num *arr, int len){
2	Num ans;
3	ans = 0;
4	for(int i=0; i < len; i++)
5		ans += arr[i];
6	return ans;
7}
In this function, Num is a generic class or type. It can stand for int or double or some user-defined class that overloads the operator += (line 5) and therefore supports addition. If the template is instantiated by setting Num to a user-defined class, it must also support the empty constructor (line 2) as well as assignment to zero (line 3).
A templated function may be invoked as shown below.
1	int x[5] = {1, 2, 3, 4, 5};
2	cout<<"1 + ... + 5 = "<<sum(x, 5)<<endl;
3	double y[5] = {6, 7, 8, 9, 10};
4	cout<<"6 + ... + 10 = "<<sum(y, 5)<<endl;
A templated function is called just like an ordinary function as evident from the invocations of sum() on lines 2 and 4. Because x[] is an array of ints, the C++ compiler replaces the generic type Num by int to generate code for a new function. On line 2, sum(x,5) invokes that function. However, sum(y,5) on line 4 invokes a quite different function generated by replacing Num by double.

1.3.2 Aitken transformation in C++

The source files Aitken.cpp, Leibniz.cpp, and Logseries.cpp contain the C++ implementation of the Aitken transformation and its application to the Leibniz and log series. The listing below is of the C++ header file Aitken.hh.
1#ifndef AitkenAugust09DVjli
2#define AitkenAugust09DVjli
3#include "Vector.hh"
4void Aitken(const Vector& seq1, Vector& seq2);
5void Aitken(Vector& seq);
6double AitkenE(const Vector& seq, int printflag=0);
7#endif
This header file declares three functions, two of which have the same name. The function name Aitken is overloaded (lines 4 and 5). We can make the function call Aitken(v,w), where v and w are both Vectors. After the call, the const in the first declaration guarantees that v is not changed. The use of const simplifies the documentation of this function. The other way to make the function call takes the form Aitken(v). In this later usage, the transformed sequence overwrites the original data, and the last two entries are set to 0. We can get away with using the same name for two different functions because the compiler can tell them apart by looking at the list of arguments.
Although there is just one declaration for AitkenE() (line 6), it can be called with one or two arguments. If we call it in the form AitkenE(v), the compiler interprets the call as Aitken(v, 0). The declaration of AitkenE() (line 13) gives its second parameter as int printflag=0, thus indicating to the compiler that the second argument assumes the default value of 0 if it is omitted. We can also call it in the form AitkenE(v,1), where the value of the second argument is given explicitly as 1. Usually it is best to use the default argument facility only for the last argument.
The function AitkenE() transforms the sequence repeatedly until the sequence has only one or two numbers. The function Aitken() listed below performs a single transformation of the sequence.
1void Aitken(Vector& seq1){
2	int len = seq1.getSize();
3	for(int i=0; i < len-2; i++){
4		double a = seq1(i);
5		double b = seq1(i+1);
6		double c = seq1(i+2);
7		seq1(i) = a - (b-a)*(b-a)/(a-2*b+c);
8	}
9	seq1(len-2) = 0;//invalid entries
10	seq1(len-1) = 0;
11}
Entries of the Vector object seq1 are accessed on lines 4, 5, 6, 9, and 10. Each one of these accesses is resolved using the overloaded function call operator and is not a simple array look-up.
The function AitkenE() listed below is defined in the source file Aitken.cpp. It uses the shadowing capability of the Vector class. It repeatedly transforms the sequence, and each transformation is effected using a call to Aitken() on line 13.
1double AitkenE(const Vector& seq, int printflag){
2	int len = seq.getSize();
3	Vector myseq(len);
4	myseq = seq;
5	Vector v;
6	v.shadow(seq);
7	int n = len/2;
8	if(len%2==0)
9		n--;
10	if(printflag==1)
11		printseq(v);//defined in Aitken.cpp
12	for(int i=0; i < n; i++){
13		Aitken(v);//defined in Aitken.cpp
14		v.shadow(v, 0, v.getSize()-2);
15		if(printflag==1)
16			printseq(v);
17	}
18	int indx = v.getSize()-1; //can be 0 or 1
19	return v(indx);
20}
The function AitkenE() uses shadows instead of creating new Vectors of different lengths repeatedly. After each invocation of Aitken(v) on line 13, v is made a shadow of itself, with the last two entries dropped, on line 14.
Object files are produced from Aitken.cpp, Leibniz.cpp, and Logseries.cpp using the icpc compiler with the -c option. The linking command builds the executable.
icpc -c Aitken.o
icpc -c Leibniz.o
icpc -o Leibniz.exe Leibniz.o Aitken.o
The executable Logseries.exe is built similarly. Instead of icpc, we may use g++ or pgCC, and the compilation and linking syntax shown here does not change.
Exercise:  Use the sizeof() facility to determine the size of a Vector object. Does the reported size agree with your expectation?
Exercise:  Copy the file Vector.hh to VectorEX.hh and modify it as follows. Insert a line right at the beginning of the first three constructors that makes them print ‘‘empty constructor” or ‘‘constructor to acquire space’’ or ‘‘copy constructor’’ in a single line. Similarly, add a line right at the top of the destructor that prints ‘‘destructor.’’ When the following code is run
#include "VectorEX.hh"
void donthg(Vector v1, Vector& v2){
}
int main(){
	Vector u(20), v(10), w(v), ww;   
	donthg(v, w);   
	donthg(u, v); 
}
it produces exactly 12 lines of output. Explain each line carefully. Exactly which object and place in the code does each line correspond to?
Exercise:  Explain why the default copy constructor, which copies size, data, and owner fields, is semantically incorrect for the Vector class.
Exercise:  Consider the following C++ program.
Exercise:
#include <iostream>
using namespace std;
​
int main(){
  cout<<"Hello World!"<<endl;
}
It prints “Hello World!” as you may expect. Modify the program so that it prints
Big Bang
Hello World!
Apocalypse
You are not allowed to modify main(). [17]  [17] This problem is based on [2].
Exercise:  Consider the 100 numbers (1), (2), …, (100). Write a C/C++ program that partitions the numbers into two sets such that the difference of the sums of the two sets is as small in magnitude as possible. Does your program work for 1000 numbers or for 106 numbers? [18]  [18] This is the Floyd partitioning problem. See [6]. The Karmarkar-Karp algorithm for partitioning gives an interesting approach to this problem.
Exercise:  Write a C/C++ program that will open a file and print the last n lines of the file. The name of the file and n are inputs to the program.
Exercise:  Let p(z) be a polynomial of nth degree with complex coefficients. The fundamental theorem of algebra states that p(z) = 0 has n complex roots. The Newton iteration for finding the roots is zn + 1 = zn − p(zn) ⁄ p’(zn). The iteration converges to different roots depending on the initial iterate z0. Write a C/C++ program that takes a polynomial p(z) and z0 as inputs and determines which root the Newton iteration converges to. Color the complex plane depending on which root the Newton iteration converges to.
Exercise:  In sexagesimal, a non-negative integer is written in the form n − 1i = 0ai × 60i, where
ai ∈ {0, 1, …, 59}.
There are exactly 144 sexagesimal numbers with leading digit an − 1 = 1, n ≤ 6, and a terminating sexagesimal expansion for their reciprocals. Determine all 144 such numbers and their sexagesimal reciprocals. For n = 6, 7, …, 20, determine the number of sexagesimal numbers with an − 1 ≠ 0 and a terminating reciprocal. [19]  [19] See chapter 11 of [6]; originally published as Ancient Babylonian Algorithms in Communications of the ACM, vol. 15 (1972), with errata in vol. 19 (1976). Inakibit-Anu, priest of Anu and Antub in Uruk around 300 BC, determined 105 of the 144 sexagesimal numbers with terminating reciprocals. Inakibit-Anu’s table gave the reciprocal of 602 + 45 as 59 × 60 − 3 + 15 × 60 − 4 + 33 × 60 − 5 + 20 × 60 − 6.
Exercise:  C structs are like C++ classes with public data members but no function members. In the following struct
struct node{
  double x;
  struct node *next;
}
each node points to the next to form a linked list (see figure 1.6↑). The next field of the last node is assumed to be NULL. Write a function that takes a pointer to the first node in a linked list and reverses the linked list.
Exercise:  The definition of Aitken() given here overwrites the input sequence. Give an implementation that does not overwrite the input sequence and that corresponds to the declaration on line 6 of Aitken.hh listed on page 1↑.
Exercise:  Time C++ (page 1↑) and C (page 1↑) implementations of the Aitken transformation and compare.

1.4 A little Fortran

In this section, we show a bit of Fortran syntax. The syntax is deliberately Fortran 77 and not the newer varieties. When the need to use old Fortran codes arises, it is often Fortran of this variety. We do not recommend programming in Fortran. [20]  [20] [5] has noted that the inadequacy of Fortran for systems programming became apparent as soon as Ken Thompson attempted to write Unix in Fortran in 1969. Fortran was an amazing advance in compiler theory and technology for its time but has been forgotten in the world of systems programming for more than four decades. The language is rigid and does not allow for dynamic data structures such as linked lists, trees, and graphs in its 77 version. Such data structures are increasingly used in scientific computing and are indispensable to computer science.
The core data structure in Fortran is the array. In our opinion, Fortran does not do a good job here. In Fortran, the array is thought of as a variable name, the length of the sequence, and the type of each item, which determines the size of each item in bytes. The variable name is actually a pointer (an address), but only covertly and not explicitly as in C. Fortran does not allow pointers in any generality. The Fortran array is an abstraction that strives to be close to what happens on the machine. However, it is an awkward abstraction because the notion of pointers is not thrown away but adopted covertly in a highly restricted form.
Unlike C/C++, the Fortran language does not provide access to machine capabilities. For some of the more sophisticated optimizations, the Fortran language is inadequate.
There is a belief that Fortran is faster than C/C++. This belief is a complete myth, being no more than an indication of the proficiency of those who believe in it, and will be completely dispelled later.
Part of the Fortran code for applying the Aitken iteration to the logarithmic series follow. The listing of the function partialsum(x,n) follows. [21]  [21] Seasoned Fortran programmers would use implicit none in the function definitions to preclude the compiler from assuming undeclared function parameters to be of type integer. It returns the partial sum of the first n terms of the Taylor series of log(1 + x).
      double precision function partialSum(x, n)
      double precision x
      integer n
      double precision prod
      integer i
      partialSum = 0
      prod = 1.0D0
      do 10 i=1,n,1
         prod = prod*x
         if(mod(i,2).eq.0) then
            partialSum = partialSum - prod/i
         else
            partialSum = partialSum + prod/i
         endif
 10   continue
      return
      end
The main program listed below prints the nth partial sum, the extrapolated value using the first n partial sums, and the true value of log(1 + x) for a few values of x. It corresponds to the data shown in table 1.2↑. It calls the function extrapolateSum() in addition to partialSum(), but the definition of the former is not listed.
      program main
      double precision xlist(11), seq1(13), seq2(11)
      double precision val1, val2, val3
      double precision partialSum, extrapolateSum
      integer n, i
      xlist(11) = 1.25D0
      do 30 i=0,9,1
         xlist(i+1) = i*1.0D0/9
 30   continue
      
      n = 13
      do 40 i=1,11,1
         val1 = partialSum(xlist(i),n)
         val2 = extrapolateSum(xlist(i),n, seq1, seq2)
         val3 = log(1+xlist(i))
         write (6, 100) val1, val2, val3
 100     format(F14.10, F14.10, F14.10)
 40   continue
      stop
      end
There are huge differences between C and Fortran in the way function calls work. Let us look at the first function call that occurs here:
val1 = partialSum(xlist(i),n)
Here xlist is a name for an array of doubles and xlist(i) is a name for the ith location in that array. Similarly, n and val1 are names for locations that are big enough to hold an int and a double, respectively.
So far, nothing is really different from C. The difference is in the way the arguments of partialSum() are set up. The function partialSum() has two arguments that are called x and n. If we were in the C world, new locations would be created, and x and n would become names for those two new locations. In Fortran, nothing of that sort happens. Instead, x and n merely become names for the locations named by xlist(i) and n in the main program. So if we change x inside partialSum(), that will change xlist(i) inside the main program. This awkward semantics is a result of using pointers implicitly to represent arrays efficiently but not allowing pointers into the language. [22]  [22] Languages such as Python and Java also use pointers implicitly. However, both languages have a more abstract as well as more consistent view of objects, and the awkwardness associated with Fortran does not arise. In Python, every variable is really a pointer to an object, and not to a memory location as in Fortran. So if we say x = 7, the variable x points to the object 7. If x is passed as an argument to a function f(y), y begins to point to the object 7. The crucial difference from Fortran is as follows: suppose we say y = 5 in the body of the function then y begins to point to the object 5, and at the point of call, x continues to point to the object 7. Analogously to Fortran, if x is a pointer to a large object such as a Numpy array, one may change its entries using syntax such as y[0] =  − 1. To reconcile these opposing tendencies, Fortran passes even int and double arguments by reference.

1.5 References

Bibliography

[1] B. Kernighan, D. Ritchie: The C Programming Language. Prentice-Hall, 1988.

[2] B. Stroustrup: The C++ Programming Language. Addison-Wesley, 1997.

[3] C. Brezinski, M. Redivo Zaglia: Extrapolation Methods: Theory and Practice. North Holland, 1991.

[4] D. Kozen: Automata and Computability. Springer-Verlag, 1997.

[5] D. Ritchie: The development of the C language in History of Programming Languages II (T.J. Bergin and R.G. Gibson, ed.). Addison-Wesley, 1996.

[6] D.E. Knuth: Selected Papers on Computer Science. Cambridge University Press, 1996.

[7] D.E. Knuth: The Art of Computer Programming. Addison-Wesley, 1998.

[8] G.A. Baker, P. Graves-Morris: Padé Approximants. Cambridge University Press, 1996.

[9] J.M. Borwein, P.B. Borwein: Pi and the AGM: A Study in Analytic Number Theory and Computational Complexity. Wiley-Interscience, 1998.

[10] K.V. Sarma: A History of the Kerala School of Hindu Astronomy. Vishveshvaranand Institute, 1992.

[11] Morris Kline: Mathematical Thought from Ancient to Modern Times. Oxford University Press, 1990.

[12] R.S. Westfall: Never at Rest: A Biography of Isaac Newton. Cambridge University Press, 1980.

2 C/C++: Libraries and Makefiles

Splitting a program into several source and header files, as in the previous chapter, is essential but not sufficiently powerful in itself to capture the conceptual relationships of many programs. When the interdependence between the modules is complex, it is no longer adequate to put all the source files in a single directory. The source files must be organized into directories and subdirectories to bring greater order and clarity.
There are two powerful ideas for bringing greater modularity into C/C++ programs, and both of them will be introduced in this chapter. The first idea is to combine object files into libraries, and the second idea is to organize program sources into a source tree.
In outline, a C/C++ program is built as follows. There are program sources to begin with. These are turned into object files, which mainly consist of machine instructions, by the compiler. The linker eliminates unresolved external references and merges the object files to produce an executable. The two ideas for bringing greater modularity occur at different points in this process.
The organization of sources into a tree precedes both compilation and linking. The solution to most problems naturally breaks up into several components. For example, an image-processing program may be broken up into modules for handling different image formats, modules for displaying images, modules for image transformations, modules for image enhancements, modules for combining images, and so on. If the sources for each of these functions is put in separate directories, the directories become modules, and the sources are now submodules within these directories. Although an overly deep hierarchy can cause complications and must be used only for truly complex programs, one can easily imagine directories within directories so that the source files are grouped into modules, and these modules are grouped into higher level modules, and so on in a tree-like hierarchy.
In contrast, libraries follow compilation and precede linking. The linking model is always flat. It does not matter how or if the sources are arranged in a tree. The linker takes in a flat list of object files and smashes them together to form an executable, regardless of where the object files or their sources are located. A library is simply a group of object files that may be fed to a linker as a single unit.
It is easy to see why libraries are so useful. An image-processing program may use linear algebra, Fourier analysis, and yet other tools. Although linear algebra and Fourier analysis are used, those functions are extraneous to image processing, and it is not right to include them within an image-processing source tree. Programs for optimization, solving differential equations, clustering, and other tasks have an equal right to use linear algebra and Fourier analysis. Thus, the most natural thing to do here is to combine all the linear algebra object files into a linear algebra library and all the Fourier analysis object files into a Fourier analysis library. Commercial vendors may sell libraries of object files without the sources to safeguard their profitability.
Section 2.1↓ exhibits an important feature of the translation of C/C++/Fortran sources to object files. In the previous chapter, we have looked at C/C++/Fortran sources. Several names appear within these sources. There are names of variables used to hold data and then there are names of functions. If a function or a variable defined in one source may be used by another, the name of that function or variable is retained in the object file, although in a transmuted form. An object file uses the same machine language (we will look at machine instructions only in the next chapter) and has nearly the same format, syntax, and semantics regardless of whether it is from a C, C++, or Fortran source. However, the transmutation of names from source to object file is different in C, C++, and Fortran. Once the convention for transmuting names is understood, it is easy to call Fortran programs from C/C++. The convention is the simplest and the most natural in C. In addition, C uses a relatively simple runtime library. It is typical to have to use C in the middle if one wants interoperability between other languages such as Java or Python.
Section 2.2↓ is a brief introduction to the BLAS/LAPACK linear algebra libraries. There is no area of mathematics that is as ubiquitous in applications as linear algebra, and there are no scientific libraries as widely used as BLAS/LAPACK. Every scientific programmer needs some familiarity with these libraries. Optimizing or even implementing the BLAS/LAPACK libraries would be an impossible task for almost any programmer or team. There is much to be gained from using good libraries whenever they are available. Like their bricks and mortar counterparts, libraries are a powerful means to propagate knowledge.
When a program is split into several source files and the source files are grouped into modules, compiling source files into object files and then generating executables become tedious, repetitive, and error-prone. Makefiles and the make utility bring organization and coherence to compilation and linking. Section 2.3↓ gives an account of GNU make. The importance of understanding make cannot be overstated. There is no such thing as modular programming in C/C++ without it. Although there are other ways to manage sources, the make utility is the oldest and most widely used.
The final section of this chapter, section 2.4↓, makes a transition to the rest of the book. In addition, it demonstrates the importance of well-optimized libraries. An implementation of the Fast Fourier Transform (FFT), one of the most important algorithms in science, which is coded expertly but without regard to computer architecture, is compared with libraries that are cognizant of computer architecture. The optimized libraries are found to be nearly an order of magnitude faster. Section 2.4↓ also shows how unpredictable compilers can be. A slight change degrades performance of one of the optimized libraries by nearly a factor of two.

2.1 Mixed-language programming

Compilers translate .c (C), .cpp (C++), and .f (Fortran) sources into .o object files. The .o object files are mainly a sequence of machine instructions. If the source file calls functions defined externally, which is the typical scenario, there will be unresolved names in the corresponding object file. In section 2.1.1↓, we look at the map from sources to object files as a precursor to mixed-language programming.
The manner in which for-loops and other constructs map to machine instructions is the topic of the next chapter. In this section, we only look at the map from globally visible names in the sources to names in the object file. The aitken.c source looks as follows:
#include <assert.h> 
#include "aitken.h" 
void aitken(const double* seq1, double* seq2, 
		int len){
...
}
​
double aitkenExtrapolate(double *seq1, double* seq2, 
			int len){
...
}
In this source, only the function names aitken and aitkenExtrapolate are globally visible. Names of arguments are local to the function body. Other variables defined inside the function body have only local scope as well. Names with local scope are not mapped to object files.
The CPP and Fortran sources for the Aitken iteration define functions with similar names. The main point of section 2.1.1↓ is that the same name maps to a different name in the object file depending on whether the source is C, C++, or Fortran. Once this point is understood, making C, C++, and Fortran work together becomes much simpler. The mapping is the most straightforward in C, where names are unchanged from sources to object files. In Fortran, the map is a little more complicated. In C++, the mapping is a great deal more complicated.
In section 2.1.2↓, we explain how to call Fortran from C or C++. Beyond the transmutation of names, which differs between the three languages, the additional issue of runtime libraries has to be dealt with.

2.1.1 Transmutation of names from source to object files

A .o object file is mostly a collection of machine instructions that translate the corresponding source into machine-intelligible language. If the source has a statement such as a=b+c, for example, the names a, b, c typically disappear from the object file. The compiler decides the memory locations or registers that these variable names map to. What is found in the object file is simply an add instruction of some type with operands being either memory locations or registers.
Not all names present in the source disappear, however. Those names present in the source that survive in the object file are some of the most important. These are, typically, names of functions that may be called from other object files or names of functions defined in external object files that are called from this one. Variable names also may have global scope.
The names present in the object file are needed to resolve undefined references during linking of object files. The names may not be exactly the same as in the original source. Compilers may alter names before mapping them from source to object files. In C, the names are not altered at all. In C++, the names must necessarily be altered to support the overloading facility that allows the same name for multiple functions. In Fortran, too, the names are altered, although the only reason here seems to be to maintain compatibility with earlier conventions.
The GNU/Linux command nm [23]  [23] The nm utility is part of GNU’s binutils package. lists the names present in an object file. To examine the object file aitken.o, we use the command nm aitken.o. A part of the output of that command follows.
1                 U _intel_fast_memcpy
20000000000000000 T aitken
30000000000000070 T aitkenExtrapolate
The second and third lines verify that names are unchanged when a C source is transformed to an object file. This object file was produced using Intel’s icc. The first line refers to a function call inserted by the compiler that was not present in our source. That name is undefined, but the linker will supply the appropriate definition. Any compiler may insert function names during optimization.
The function names are preceded by an address that is 16 hexadecimal digits long and by the letter T. The letter is T because both functions reside in the text area of the object code. The letter would be C for the name of a Fortran common block or a C global variable defined outside the scope of any function. The hexadecimal addresses indicate that code for aitken() begins at 0 and for aitkenExtrapolate() at 70 (hexadecimal). These addresses will be shifted by the linker when it merges several object files into a single executable.
A partial listing of the output of nm leibniz.o is included to make one more point about the transmutation of names from C sources to object files.
                 U __intel_new_proc_init_H
                 U _intel_fast_memcpy
                 U aitken
00000000000002e0 T leibniz
0000000000000000 T main
                 U printf
0000000000000280 T printseq
This object file two includes a couple of names introduced by icc. These appear at the top and are undefined. There are two more undefined names---aitken and printf. The leibniz.c source calls the function aitken(), which is externally defined. Within the object file leibniz.o, that name is undefined.
At this point, we have a clear picture of what the linker does. The linking command
icc -o leibniz.exe aitken.o leibniz.o
concatenates the object files while resolving the undefined symbols. The symbol aitken, which is unresolved in leibniz.o, is resolved using the definition in the object file aitken.o. The linker uses a runtime library to resolve the undefined symbol printf. Likewise, the undefined symbols inserted by the compiler are resolved using internal libraries.
The C function names survive intact in the object code. The Fortran names change only slightly. The command nm logseriesf.o produces the following output (partial listing):
10000000000000000 T MAIN__
2                 U __intel_new_proc_init_H
3                 U aitkenextrapolate_
40000000000000280 T extrapolatesum_
5                 U for_set_reentrancy
6                 U for_stop_core
7                 U for_write_seq_fmt
8                 U for_write_seq_fmt_xmit
9                 U log
100000000000000230 T partialsum_
The names extrapolatesum and partialsum have changed to extrapolatesum_ (line 8) and partialsum_ (line 15) in the object files. Fortran names are typically changed by appending an underscore at the end. The name aitkenextrapolate_ (line 7) is undefined and has to be resolved using another object file. All the other undefined names are introduced by the Intel Fortran compiler and are resolved using standard libraries by the Fortran linker.
Unlike in C or Fortran, the transmutation of names in C++ is quite extensive. The command nm Aitken.o produces output that is much more complicated. We show only part of the output.
0000000000000090 T _Z6AitkenR6Vector
0000000000000350 T _Z6AitkenRK6VectorRS_
0000000000000110 T _Z7AitkenERK6Vectori
0000000000000000 r _Z7AitkenERK6Vectori$$LSDA
0000000000000030 T _Z8printseqRK6Vector
0000000000000000 W _ZN6VectorD1Ev
                 U _ZNSt8ios_base4InitC1Ev
                 U _ZNSt8ios_base4InitD1Ev
                 U _intel_fast_memcpy
                 U printf
None of the names in Aitken.cpp can be easily recognized here. The transmutation of names in C++ follows involved rules and is called name mangling. We can issue the command nm --demangle Aitken.o to get names in a form that is easily recognizable.
10000000000000090 T Aitken(Vector&)
20000000000000350 T Aitken(Vector const&, Vector&)
30000000000000110 T AitkenE(Vector const&, int)
40000000000000000 r _Z7AitkenERK6Vectori$$LSDA
50000000000000030 T printseq(Vector const&)
60000000000000000 W Vector::~Vector()
7                 U std::ios_base::Init::Init()
8                 U std::ios_base::Init::~Init()
9                 U _intel_fast_memcpy
10                 U printf
In the demangled listing, we can recognize not only the function names but also the types of the arguments.
Because of the overloading mechanism, several different C++ functions can have the same name. The names have to be mangled in the object code to distinguish between different functions with the same name. The C++ standard strongly recommends that each compiler use its own conventions for name mangling. If that suggestion is heeded, object code produced by one C++ compiler cannot be linked with the object code produced by another C++ compiler. However, the Intel C++ compiler uses the same name mangling as the defacto g++ standard on Linux (and the same as Microsoft vc++ on Windows). Therefore, in principle at least, C++ object files compiled using icpc and g++ can be linked together. Linking g++ object files using the icpc linker appears reasonably safe but not the other way around.

2.1.2 Linking Fortran programs with C and C++

In scientific computing, C and C++ functions may need to call Fortran routines. Scientific software from earlier generations tends to be in Fortran 77.
To use Fortran functions within C or C++ programs, the naming used for the Fortran functions in C or C++ has to be cognizant of the way the names in the source files are altered in the object file. We want the names to agree in the object files because it is the object files and not the source files that get linked against each other. If the naming is right, the linker takes care of resolving the function calls.
Let us implement the repeated application of Aitken transformations to partial sums of the Leibniz series by mixing Fortran and C programs. Part of the output of nm aitkenf.o is given below.
                 U _intel_fast_memcpy
0000000000000000 T aitken_
0000000000000220 T aitkenextrapolate_
                 U for_write_seq_fmt
                 U for_write_seq_lis
0000000000000150 T printseq_
We will write a C program that calls the functions defined in aitkenf.o to extrapolate the Leibniz series to illustrate the nature of mixed-language programming.
The C code includes the following declarations near its beginning.
extern void aitken_(double *seq1, double *seq2, 
		    int *len);
extern void printseq_(double *seq, int *len);
extern double aitkenextrapolate_(double *seq1, 
			      double* seq2, int * len);
The extern keyword indicates that the three function names that are declared must be found in some other object file. The underscore is appended to the names to follow the convention of the Fortran compiler. This convention is common among Fortran compilers but may not be universal. The three arguments to aitken_ have types double *, double *, and int *. The first few lines of the definition of aitken() in the Fortran source are as follows.
 subroutine aitken(seq1, seq2, len)
      double precision seq1(*), seq2(*)
      integer len
The first argument of aitken() must be an array of double-precision numbers; once the function is called, seq1 becomes another name for that array. The first argument is nothing other than a pointer to a double, although in Fortran we simply think about it as an array not as a pointer. When calling the function from C, we have to drop that pretension and say explicitly that the first argument will be a pointer to double. The last argument to aitken() is an integer type in Fortran. Once the function is called, len becomes another name for that argument. When calling the function from C, we have to be explicit and specifically state that the last argument is of pointer to int type (and not of type int). Every use of len inside the Fortran code of aitken() will in effect dereference that pointer.
The C function below calls the Fortran routines. The leibniz() function used to generate partial sums of the Leibniz series (see line 6 below) is in C. Its definition was given earlier (see page 1↑).
1int main(){
2	const int len = 13;
3	double seq1[len];
4	double seq2[len];
5	int n, i, j; int farg;
6	leibniz(seq1, len);
7	n = len/2;
8	if(len%2==0)
9		n--;
10	for(i=0; i < n; i++){
11		farg = len-2*i; printseq_(seq1,&farg);
12		aitken_(seq1, seq2, &farg);
13		for(j=0; j < len-2*(i+1); j++)
14			seq1[j] = seq2[j];
15  }
16	if(len%2==0)
17		{farg = 2; printseq_(seq1, &farg);}
18	else
19		{farg = 1; printseq_(seq1, &farg);}
20}
Line by line, this is almost the same as the C program for extrapolating the Leibniz series using functions defined in C (see page 1↑). There are only a few differences. Using function names with the underscore appended is the most obvious one. Another difference is that we invoke printseq_ as printseq_(seq1, &farg) after taking care to store len-2*i in farg on line 11. Similarly, the call to aitken_() on line 12 gives the third argument as &farg. A value such as len-2*i cannot be passed as an argument to a Fortran function. The value must be stored in a memory location, and the address of that memory location must be passed as a pointer. On lines 17 and 18, the second argument of printseq_ is &farg for the same reason.
To build the executable, we save the C source (comprised of the extern declarations, the definition of main() given above, and the definition of leibniz() on page 1↑) in the file leibnizFinC.c. The following three commands are issued using a makefile.
icc -c leibnizFinC.c 
ifort -c  aitkenf.f 
icc -o leibnizFinC.exe leibnizFinC.o aitkenf.o 
The first two commands create the object files. The -c option tells icc to compile only. The third command attempts to build the executable using the object files. This command uses the icc linker and not the Fortran linker, which leads to a problem. We have seen already that the object code for aitkenf.o contains some undefined names that have to do with the workings of the ifort compiler. An ifort linker would automatically take care of resolving those names, but the icc linker does not. So we have to do something explicit to resolve the undefined names created by the ifort compiler.
To figure out what to do, we will build an executable using Fortran object files and the following command.
 ifort -v -o leibnizf.exe leibnizf.o aitkenf.o
The -v option to the ifort linker asks the linker to be verbose and point out every step of what it does. The linker produces quite an eyeful. From that output, it appears that the Fortran runtime library is fetched by -lifcore.
We are now in a position to tackle the linking of aitkenf.o produced from Fortran source with leibnizFinC.o produced from a C source. We issue the command
icc -o leibnizFinC.exe leibnizFinC.o aitkenf.o \
       -lifcore
Fortunately, the -lifcore option resolves all the undefined names and an executable is generated. With the GNU compilers, the -lgfortran option ensures that the Fortran runtime libraries are linked.
Linking object files generated from Fortran with C++ object files presents a new issue. Suppose we copied the C source file leibnizFinC.c as follows:
cp leibnizFinC.c leibnizFinCPP.cpp 
We can try to build the executable as follows.
icpc -c leibnizFinCPP.cpp 
ifort  -c  aitkenf.f 
icpc -o leibnizFinCPP.exe leibnizFinCPP.o \
		aitkenf.o -lifcore
The linking will fail because of C++ name mangling. In leibnizFinCPP.cpp, we have the extern declaration
extern void aitken_(double * seq1, double * seq2, 
                          int * len); 
When it generates object code for that source file, the icpc compiler assumes that aitken_ is externally defined and the name of a C++ function. So it mangles the name. Issuing the nm leibnizFinCPP.o command shows the mangled name to be as follows:
 U _Z7aitken_PdS_Pi
Of course there is no such mangled name in the aitkenf.o file, and the linking has to fail. To get around this problem, we have to change all the three extern declarations in leibnizFinCPP.cpp slightly. For instance, the declaration of aitken_ should be as follows:
extern "C" void aitken_(double * seq1, double * seq2, 
                           int * len);
Here we are specifically telling the compiler that the name aitken_ has C linkage. So the C++ compiler will not mangle that name.
The C++ language is an extension of the C language. Every C program should be a valid C++ program. But name mangling becomes an issue if we want to call C programs from C++ programs. For instance, if aitken.h is the header file for aitken.c, any C source can include the header, and the linker will find the definitions of the names declared in the header in aitken.o. However, if a C++ source includes the header, the names in the header will get mangled by the C++ compiler. The linker will not be able to find the mangled names in aitken.o.
There is a simple workaround that makes a C header file good for inclusion in both C and C++ sources. The workaround encloses the body of the header file within a few lines.
#ifdef __cplusplus 
extern "C" { 
#endif
...(declarations)...
#ifdef __cplusplus 
}
#endif
As required by the standard, __cplusplus is defined in all C++ source files but not in C source files. If the header file is included in a C++ source file, all the declarations are enclosed in an extern "C"{} block and have C linkage. The C++ compiler will not mangle their names. If the header file is included in a C source file, the declarations are not enclosed in an extern “C”{} block.
Exercise:  The object file of the program source
char *s="char *s=%c%s%c;%cmain(){printf(s,34,s,34,10,10)};%c";
main(){printf(s,34,s,34,10,10);}
has a name that is undefined, a name in the data segment, and a name in the text segment. Which are these? Verify using nm.
Exercise:  Look up the nm man page and figure out the meaning of W and r designations attached to some symbols in object files compiled from C++.
Exercise:  Use the verbose option for the linking command and make a list of all the object files and libraries that are used to build the executables leibniz.exe and Leibniz.exe, respectively.
Exercise:  Following gcc and the GNU compiler collection, icc and icpc report all the header files that get included during compilation when invoked with the -M option. Make a list of all the header files that get included in leibniz.c and Leibniz.cpp, respectively.
Exercise:  The aitken.h header file (defined on page 1↑) is not suitable for inclusion in a C++ source. Modify it so that it may be used in either C or C++ source files.
Exercise:  Suppose you are not allowed to modify it. Explain how you can still include it in a C++ source.

2.2 Using BLAS and LAPACK libraries

The basic concepts of linear algebra are matrices and vectors. Many problems in science, such as the numerical solution of partial differential equations and numerical optimization, reduce to problems in numerical linear algebra. BLAS and LAPACK are widely used numerical linear algebra libraries. [24]  [24] The original papers on BLAS are [14, 18, 17]. The basic reference for LAPACK is [15]. The Intel MKL manuals include documentation for BLAS and LAPACK.
The BLAS library is split into three levels. Functions for vector operations such as dot products are included in the first level, for matrix-vector operations in the second level, and for matrix-matrix operations such as matrix multiplication in the third level. The split into three levels is conceptual and reflects the historical order in which the interfaces for the BLAS functions were specified. Implementations of BLAS and LAPACK such as MKL and ACML, which are supported by Intel and AMD, respectively, bundle all three levels of BLAS as well as LAPACK into the same library.
The specifications of the BLAS functions have been frozen for nearly three decades. However, LAPACK evolves from time to time to include new algorithms. LAPACK is built on top of BLAS. Functions for solving systems of matrices, solving linear least squares problems, finding eigenvalues, and finding singular values are found in LAPACK.
Although the BLAS specifications have been frozen for decades, implementations of BLAS have to constantly respond to the rapid changes in computer architecture. A good implementation of matrix multiplication in 1990 looks nothing like a good implementation of matrix multiplication in 2015. In the intervening decades, computer architecture has advanced to include instruction pipelines, instruction-level parallelism, multiple levels of cache memory, expanded register sets, out-of-order execution, and multiple processing cores. Good implementations optimize BLAS for all these features of modern computers.
The early specifications of BLAS were given in Fortran. However, Fortran does not provide adequate access to features of computer architecture. One cannot see the source code of commercial BLAS implementations in libraries such as MKL and ACML are coded. However, it is almost certain that BLAS functions are coded in C and in assembly language native to the computer architecture that is targeted. C functions can easily mimic the interfaces and calling conventions of Fortran subroutines.
Many LAPACK functions were coded in Fortran using BLAS years ago. The hope was that architecture-specific optimizations would be confined to BLAS as computers evolved. However, modern implementations of LAPACK functions such as LU and Cholesky factorizations use specialized algorithms that are a great deal more complicated than using the BLAS routines in a direct manner. [25]  [25] For a general discussion of BLAS and LAPACK, see [21]. [25] gives a taste of optimizing LAPACK functions such as LU factorization. For a survey of blocking algorithms in matrix computations, see [16].
In this chapter, our discussion is limited to programs that run on a single processor core, although the basic algorithms of dense linear algebra can be effectively adapted to many processor cores and large networks. We will discuss some of the aspects of concurrent programs as they relate to numerical linear algebra in later chapters.
Our introduction to BLAS/LAPACK in this section begins with a discussion of the representation of matrices using arrays (section 2.2.1↓). One may think that the natural way to represent matrices is to use two-dimensional arrays. In fact, matrices are represented using one-dimensional arrays, as we explain. A brief introduction to BLAS/LAPACK functionality in section 2.2.2↓ is followed by the discussion of a C++ class interface in section 2.2.3↓. This class is unlike the general Vector class of the previous chapter and more typical of the way we use C++. The class is specific, is narrowly defined, and does just one thing, which is to provide an easy interface to LAPACK’s linear solver.

2.2.1 Arrays, matrices, and leading dimensions

We look at multidimensional arrays in C/C++ briefly, although it is nearly always better to work with one-dimensional arrays. A two-dimensional array in C can be defined as follows:
double two_d[20][40];
Here two_d[][] can be thought of as a two-dimensional array with 20 rows and 40 columns. One-dimensional arrays are almost equivalent to pointers. However, two-dimensional arrays are not.
The array subscripting operator [] has left to right associativity. Therefore, the compiler parses our definition as
double (two_d[20])[40];
In words, two_d is an array of size 20, each entry of which is an array of 40 doubles. In memory, the 800 double locations are next to each other same as for a one-dimensional array of size 800. However, it is illegal to say
double *p = two_d;
The value two_d is of type pointer to an array of 40 doubles and not double *. The following usage would be legal.
double (*p)[40];
p = two_d;
Here p, like two_d, is a pointer to an array of 40 doubles. The connection of multidimensional arrays in C to pointers is not straightforward, which is the principal reason to avoid multidimensional arrays in C/C++. Legitimate uses of multidimensional arrays are rare but do exist.
Suppose we want a matrix of dimension 20 × 40. We can simply say
double a[800];
It is a bad idea to allocate large data structures statically. [26]  [26] Statically defined variables are allocated on the stack and the stack size is limited to the order of several MB in threaded programming. That is one half of the reason it is a bad idea to allocate large data structures on the stack. The other half is that memory allocated on the stack is not released until program termination. More generally, we can make room for an m × n matrix as follows:
double *a = (double *)malloc(m*n*sizeof(double));
We must remember to say free(a) when the memory is no longer needed.
Here we come to the distinction between column-major and row-major storage. In column-major storage, the (i, j)th entry of a[] is accessed as
a[i+j*m]
Here 0 ≤ i < m and 0 ≤ j < n. The column-major format is used by Fortran, BLAS, and LAPACK. In row-major storage, the (i, j)th entry of a[] is accessed as
a[i*n+j]
There is a natural way to approach the distinction between column-major format and row-major format. To access the entries of a[] in the order of storage, we may code as follows assuming column-major format.
for(int j=0; j < n; j++)
	for(int i=0; i < m; i++)
		a[i + j*m] = ...
Here the inner-loop is accessing the entries of column j. So columns are “innermost” in column-major format. Likewise rows are innermost in row-major storage. When a matrix (or a higher dimensional tensor) is stored in a one-dimensional array, we need two items of information to access entries of the array. First, we need the bounds of the index variables. Here the bounds are 0 ≤ i < m and 0 ≤ j < n. Second, we need the ordering of matrix (or tensor) indices from inner to outer. In column-major format, the row index is innermost (which corresponds to columns being innermost).
figure FIGS/chapter1/colmajor_lda.png
Figure 2.1 Column-major storage of a matrix in a one-dimensional array. The shaded submatrix can be extracted by setting the leading dimension to 4.
The crux of the matter is to think of row and column indices as loop variables in a loop-nest. Figure 2.1↑ shows a matrix with the row index i in the range 0 ≤ i < 4 and the column index j in the range 0 ≤ j < 3 laid out in a one-dimensional array with columns innermost.
If we think of i and j as loop variables in a loop-nest, a way to extract submatrices suggests itself. Suppose we want to extract the top-left 2 × 2 matrix shaded in figure 2.1↑. We can take 0 ≤ i < 2 and 0 ≤ j < 2. Assuming a is a pointer to the entry 0, we may access the (i, j)th entry of the submatrix as
a[i+j*4]
The leading dimension is 4 here and not 2 because the 2 × 2 matrix is embedded in a matrix whose columns have 4 entries. A submatrix is accessed using its dimensions, the pointer to its first entry, and the leading dimension.
The leading dimension is the number of entries in the innermost dimension of the array. It is probably more appropriate to call it the inner dimension, implying a connection to the innermost loop, but the current usage is well established. Suppose we want to extract the 2 × 2 submatrix of the matrix shown in figure 2.1↑ whose top-left corner is 5. We may do so by taking b=a+5 to point to the first entry, which is 5, and setting the leading dimension to 4. The leading dimension is once again the number of entries in the innermost dimension of the enclosing matrix. In this example, the gap between two columns of the submatrix straddles both columns.

2.2.2 BLAS and LAPACK

BLAS and LAPACK functions, to which we now turn, typically have long argument lists. For example, the Fortran interface to the BLAS function for multiplying a matrix and a vector has the following declaration in C:
extern "C" void dgemv_(char *,int *,int *,double *,double *,
                       int *,double *,int *,double *,double *, 
                       int *, int);
This function, which has 12 arguments in total, implements the operation y ← αAx + βy or y ← αATy + βy. The first argument, which is a character string, allows us to specify whether the matrix A must be transposed or not. The next 10 arguments allow us to specify the entries and dimensions of the matrix A and of the vectors x and y, as well as the scalars α and β.
The last argument to dgemv_() is the only one that is not a pointer. Because all arguments to Fortran subroutines are passed by reference, we may expect all arguments in the C declaration of a Fortran subroutine to be pointers. Why then is the last argument not a pointer? The answer lies in the differing conventions for character strings in C and in Fortran. In C, a character string may be passed as a pointer of type char *. Its length is determined by the convention that the terminating character is ’\0’. There is no such convention in Fortran, and the lengths of character strings must be supplied explicitly. The last argument of type int must be the length of the first argument, which is a character string. This last argument is invisible in Fortran as the Fortran compiler takes care to supply it surreptitiously. [27]  [27] It is typical for Fortran compilers to use an additional argument at the end to indicate the length of a character string, but this practice is not mandated by the Fortran standard.
S single precision GE general TRF triangular factorization
D double precision GT general triangular TRS triangular solve
C complex single precision PB positive definite banded LSS least squares solver
Z complex double precision SP symmetric packed storage SVD singular values
TB triangular banded EV eigenvalues
Table 2.1 Convention for naming functions in BLAS and LAPACK. Function names are formed by adjoining codes in columns 1, 3, and 5. For instance, DGETRF() implements triangular factorization of general matrices with double-precision entries. Columns 3 and 5 list only a small subset of the possible codes.
The naming convention used by BLAS/LAPACK is described by table 2.1↑. According to that convention, DGETRF is the name of the LAPACK function for triangular factorization of general matrices with double-precision entries. As we already mentioned, many LAPACK functions were written in Fortran long ago. If we want to call an LAPACK Fortran routine named DGETRF from C, we have to be aware of the convention for altering names when Fortran source files are converted to object code. However, to save us the trouble, the header file mkl_lapack.h, which is included by the header file mkl.h, allows many possible names for the same function.
void DGETRF( MKL_INT *m, MKL_INT *n, double *a, MKL_INT *lda, MKL_INT *ipiv, MKL_INT *info );
void DGETRF_( MKL_INT *m, MKL_INT *n, double *a, MKL_INT *lda, MKL_INT *ipiv, MKL_INT *info );
void dgetrf( MKL_INT *m, MKL_INT *n, double *a, MKL_INT *lda, MKL_INT *ipiv, MKL_INT *info );
void dgetrf_( MKL_INT *m, MKL_INT *n, double *a, MKL_INT *lda, MKL_INT *ipiv, MKL_INT *info );
These four names are all bound to the same function in the object code of the library. So we may call the function by any one of the four names. MKL_INT is used as a synonym for int to facilitate porting the library to systems with different conventions for int and long int. The parameters m and n of dgetrf_ are the number of rows and columns of the matrix whose (i, j)th entry is a[i+j*lda].
Triangular factorizations are useful for solving matrix systems. The following is a 2 × 2 example:
0 1 1 0 5  − 3 10 1  =  1 0 1 ⁄ 2 1 10 1 0  − 3.5 .
The left-most matrix flips the rows to ensure that the subdiagonal entries of the lower triangular factor are at most 1 in magnitude. Such row pivoting promotes numerical stability.
DGETRF carries out the triangular factorization in place, and when the function returns, the lower and upper triangular factors will be stored using the same array a. The unit diagonal of the lower triangular matrix is not stored. The row that ends up as the ith row after pivoting is ipiv[i] (ipiv must be allocated as an integer array of length m before the function is called). The last parameter info returns information about possible errors: if *info=0, the execution was successful; if *info=-i, the ith argument had an illegal value; and if *info=i with i = 1, …, m, the input matrix is singular and the (i − 1, i − 1) entry of the upper triangular factor is zero.
After triangular factorization, matrix systems are solved using DGETRS. Here TRS is the code for triangular solve. Within MKL, there are four possible names for the same function as shown by declarations in mkl_lapack.h. One of these is dgetrs_():
void dgetrs_( char *trans, MKL_INT *n, 
                 MKL_INT *nrhs, double *a, 
                 MKL_INT *lda, MKL_INT *ipiv, 
                 double *b, MKL_INT *ldb, 
                 MKL_INT *info );
Before this function is called, the matrix stored in the array a must have undergone triangular factorization. The first argument to this function is a character string named trans. This string allows us to ask for the solution of the transposed system. The system is not transposed if trans=’N’ but is transposed if trans=’T’. When the Fortran function is called from C, we must ordinarily pass the length of the string explicitly as the last argument. However, MKL takes advantage of the way strings are handled by LAPACK’s Fortran routines and does not require us to pass that last argument.
Once A has been factorized as PA = LU, where P is the pivoting matrix and L and U are lower and upper triangular factors, respectively, the solution of Ax = b is calculated by solving Ly = P − 1b for y and then Ux = y for x. In a triangular system such as
l11x1  =  b1 l21x1 + l22x2  =  b2 ln1x1 + ⋯ + lnnxn  =  bn
the first equation is solved for x1, the second equation for x2, and so on. The numerical stability of this obvious back substitution algorithm is quite subtle. [28]  [28] See [24].

2.2.3 C++ class interface to BLAS/LAPACK

The Vector class of section 1.3.1↑ is an attempt to capture the general concept of vectors. The LU_Solve class of this section is narrowly defined. It does just one thing, which is to provide an easy interface to LAPACK’s LU solver functions dgetrf() and dgetrs().
The class is defined (in the header file lusolve.hh) as follows:
class LU_Solve{
private:
     int dim;
     double *A;
     int *ipiv;
public:
     LU_Solve(double *a, int dimi);
     ~LU_Solve();
     void factorize();
     void solve(double *v);
};
In the Vector class, the member functions were defined within the class definition itself. In the LU_Solve class, the member functions are declared as part of the class definition, but they are defined separately. The class constructor LU_Solve() takes the matrix to be solved as well as its dimension as arguments. The member function factorize() factorizes the matrix. Linear systems are solved using the member function solve(). The argument v is the right-hand side at entry to the member function. It is overwritten by the solution at exit. The factorize() function must be invoked immediately after a class object is defined, and it must be invoked just once. Once the matrix is factorized, any number of linear systems may be solved using LU_Solve::solve().
The BLAS/LAPACK functions for LU factorization are cumbersome to call directly. The LU_Solve class is a hassle-free interface. To solve two linear systems of dimension 1000, we may code as follows:
double *A = (double *)malloc(8l*1000*1000);
... initialize A ...
double *v = (double *)malloc(1l*2*1000);
double *w = v + 1000;
... initialize v and w ...
​
LU_Solve lu(A, 1000);
lu.factorize();
lu.solve(v);
lu.solve(w);
​
... report/graph v and w ...
​
free(v);
free(A);
Here lu.solve(v) overwrites v with the solution and likewise for w.
The class constructor is defined (in lusolve.cpp) as follows:
1LU_Solve::LU_Solve(double *a, int dimi)
2     :dim(dimi)
3{
4     A = new double[dim*dim];
5     ipiv = new int[dim];
6     for(int i=0; i < dim*dim; i++)
7		A[i] = a[i];
8}
There is a bit of new syntax here. On line 2, the colon introduces the member initializer list. The only entry of that list is dim(dimi), where dim is a data item in the class object being constructed and dimi is an argument to the constructor. Saying dim(dimi) is equivalent to calling the constructor of the class that dim belongs to with dimi as the argument. Because dim is an int, which is a basic type, it is equivalent to saying dim=dimi at the beginning of the function. On lines 4 and 5, A and ipiv are allocated using the new operator (instead of malloc()). The for-loop copies the input matrix to A[].
The member functions of the class LU_Solve are declared within the namespace LU_Solve introduced by the class definition. When the member functions are defined externally, their names must be qualified using LU_Solve:: as for the constructor above. The names of the other three member functions are similarly qualified in lusolve.cpp.
​
LU_Solve::~LU_Solve(){
     delete[] A;
     delete[] ipiv;
}
​
void LU_Solve::factorize(){
     int m = dim;
     int n = dim;
     int lda = dim;
     int info;
     dgetrf_(&m, &n, A, &lda, ipiv, &info);
}
​
void LU_Solve::solve(double *v){
     char trans[3] = "N ";
     int nrhs = 1;
     int lda = dim;
     int ldv  = dim;
     int info;
     dgetrs_(trans, &dim, &nrhs, A, &lda, ipiv, v, 
	     &ldv, &info);
}
The simplicity of the LU_Solve interface to LAPACK’s LU solver comes at the cost of lesser generality. The member function factorize() assumes the matrix to be square, with its leading dimension exactly equal to the size of its columns. Similarly, solve() assumes that the number of right-hand sides is nrhs=1 to offer a simpler interface.
Some LAPACK routines, including triangular solve, are optimized for multiple right-hand sides in the MKL library. Making repeated calls instead of a single call with a suitable nrhs may degrade program speed considerably.
Narrow classes such as LU_Solve are simple to code but can still be quite useful. All C++ classes defined in this book from here onward are of the same type.
If the Intel compilers and the MKL library are used, one may include the header file mkl.h, which includes declarations of BLAS/LAPACK functions. One simply needs to pass the -mkl option to the compiler and it will look for the header file in the right place. Linking is equally easy. One needs to pass the -mkl=sequential option to the linker to link the sequential version of the MKL library.
If the GNU compilers are used, compiling and linking open source BLAS and LAPACK libraries can be equally easy. If the header file is put in a standard location such as /usr/include, there is no need to do anything special when the CBLAS header file cblas.h is included. One may need to explicitly declare the LAPACK functions with extern C linkage. Likewise, if BLAS functions are used, instead of CBLAS, they too may need to be declared explicitly. If the libraries are put in a standard place such as /usr/lib, it suffices to pass options -lblas and -llapack to link the BLAS and LAPACK libraries. The online MKL link advisor may be consulted to link MKL libraries with gcc/g++.
Exercise:  Assume that a square matrix of dimension dim is stored in the array a[] with leading dimension equal to lda. Assume that dim is divisible by 4. Write a function
print_center(double *a, int dim)
which prints the square matrix of dimension dim/2 at the center of the matrix store in a[].
Exercise:  Assume that a[] is an array of dimension n1n2n3, which stores three-dimensional data indexed by 0 ≤ i < n1, 0 ≤ j < n2, and 0 ≤ k < n3. The three indices can be ordered from innermost to outermost in six different ways. For each ordering, calculate the location in a[] of the entry with index (i, j, k).
Exercise:  Assume that the array a[] of dimension n1n2n3 stores three-dimensional data indexed using i, j, k as in the previous exercise. Assume that i is innermost and k is outermost. Explain how to extract the submatrix with i0 ≤ i < i1, j0 ≤ j < j1, and k = k0.
Exercise:  If the array a[] is as in the previous exercise, explain how to extract a submatrix with i0 ≤ i < i1, j = j0, and k0 ≤ k < k1.
Exercise:  The LU_Solve class relies on its user to remember to factorize exactly once before attempting to solve linear systems. Add a data member state in addition to dim, A, and ipiv to the private section of the class and use it verify that the matrix is factorized exactly once before any call to solve().
Exercise:  The BLAS function dgemv() multiplies a matrix into a vector. Define a function
void mult_mv(const double *A, int m, int n, double *x)
which multiplies the m × n matrix A into the vector x. There is no need for even a narrowly defined class to build a usable interface to dgemv().
Exercise:  Program and test a narrowly defined class that interfaces to LAPACK’s least squares solver.
Exercise:  Program and test a narrowly defined class that interfaces to LAPACK’s tridiagonal solver. For multiple right-hand sides, calling the tridiagonal solver just once can be far more efficient than making multiple calls. Therefore, endow your class with the ability to solve multiple right-hand sides in a single function call.

2.3 Building programs using GNU Make

The organization of source files into directories and subdirectories is the heart of modular programming. Typically, several source files cooperate to do a task, and yet more source files are involved in bigger tasks. A directory holds source files that are related or perform similar tasks. Directories may be organized further into subdirectories in a source tree to reflect the structure of the program.
The make utility provides a method for building a program from its source tree. Each source file must be turned into an object file, and the object files must be linked together to form executables. Compiling and linking become quite repetitive and error-prone if done from the command line. Makefiles offer a more systematic approach to building programs.
A build system such as make is essential to C/C++ programming. The Makefiles hold valuable information about the structure of the program as a whole, which is absent from the source files. In Python, the correspondence between modules and the directory hierarchy is wired into the language, but there is no such facility in C/C++. Modular programming aims to organize and conquer. There can be no modular programming without organization of program sources into a directory hierarchy. Well-thought-out source trees aid programming as much as structured definitions of functions and classes.
figure FIGS/chapter1/make_dir_structure.png
Figure 2.2 Directories and files in the source tree for this book. The entire source tree is found at https://github.com/divakarvi/bk-spca.
We begin our discussion of GNU make by looking at the source tree shown in figure 2.2↑. The makevars.mk file at the root of the source tree defines make variables that are used in modules such as utils and linking. These modules correspond to directories. Some modules have several submodules. For example, the linking module has submodules aitken, easy, fft, lib, and linalg.
Much of the discussion in this chapter pertains to these submodules of linking. All the Aitken iteration programs are in the aitken submodule. The programs we used to illustrate use of the BLAS and LAPACK libraries are in linalg. We will discuss the use of shared and static libraries as well as the Fast Fourier Transform (FFT) later in this chapter. The programs and makefiles that will aid that discussion are in lib and fft.
The source tree shown in figure 2.2↑ is partial. The module proc, which is at the root, corresponds to the next chapter. The next chapter deals with the x86 processor core, its instruction set, registers, and pipeline. The modules corresponding to the last two chapters about the Xeon Phi and GPU programming are show in the figure, but many others are omitted.
The source files are the leaves of the source tree. The source tree as a whole includes 173 .cpp files, 65 .hh headers, 12 .c files, 6 .h files, 6 .cu (CUDA/GPU) files, and 14 .py (Python) files. Most of these are omitted in figure 2.2↑. The only source files shown belong to the module utils, which is at the root of the source tree and the submodule aitken of the module linking. The modules utils and linking/aitken will be the basis of our discussion of makefiles.
In section 2.3.1↓, we begin by looking at part of the utils/ directory and part of the linking/linalg directory shown in the source tree of figure 2.2↑. A testing program in linking/linalg uses some utilities defined in utils/, thus providing an example of a program that depends on sources in distinct directories in the source tree. This example will be used to exhibit how the make utility may be used to build programs with sources scattered across the source tree.
The introduction to GNU make in sections 2.3.2↓ through 2.3.6↓ emphasizes the two-pass structure of make, the use of pattern rules, and recursive make. Recursive make is the simplest method for handling a source tree with multiple directories. Although it has certain disadvantages, it is adequate for small projects, and even some large projects use recursive make. In section 2.3.7↓, we discuss some of the disadvantages of recursive make and how to overcome them.
Finally, section 2.3.8↓ has an importance that is far beyond its length. Here we discuss how static and shared libraries work, and how to find out exactly which library has been linked. Linking and loading problems that every C/C++ programmer will encounter are discussed along with suggestions of how to tackle them.

2.3.1 The utils/ folder

A testing program in the source linking/linalg/test_lusolve.cpp, which uses functions defined in sources in a different folder in the source tree utils/ (see the source tree in figure 2.2↑), will be described here. Later in section 2.3.7↓, this program is used to illustrate how GNU’s make utility builds an executable from sources scattered in different parts of the source tree.
The modules in utils/ facilitate timing, generation of random numbers, gathering statistics, making tables, and manipulation of double arrays. All the modules in utils/ are used extensively. The modules are used for testing, timing, and laying out data elsewhere in the source tree. The corresponding code is almost always omitted from the text. We avoid mentioning the modules in utils for the most part, but a brief discussion is given here.
In more complicated settings, there will be many dependencies between the directories and subdirectories of the source tree. In the source tree for this book, the directories are mostly self-contained. Most of the dependencies are on utilities in utils/ and on modules for plotting and displaying data that are not shown in the source tree in figure 2.2↑.
The header file utils.hh defines a macro called assrt(). This macro, which is used frequently, is similar to assert(), which is defined in the C standard header file assert.h. The only difference is that assrt() always checks its assertion, and not only if the preprocessor macro DEBUG is defined. We find little use for debuggers. The debugger is a blunt tool that works without an idea of the logical structure of the program. When programs are compiled in debug mode, the memory layout of their data can change. Memory errors may not be reproduced faithfully in debug mode.
In addition to assrt(), utils.hh declares the following functions:
void array_abs(double *v, int n);
double array_max(double *v, int n);
void array_show(double *v, int n, const char* mesg);
void array_diff(double *restrict v, 
		double *restrict w, int n);
void array_copy(double *restrict v, 
		double *restrict w, int n);
void array_out(double *v, int m, int n, 
	       const char *fname);
void array_in(double *v, int size,  const char* fname);
These functions are defined in utils.cpp. Most of the declarations are self-explanatory. The function array_max() takes the absolute values of n entries of v and returns the maximum. The function array_out() interprets v as an m × n matrix in column-major order and outputs it to a file. These functions are used for testing and timing.
Another function declared in utils.hh and defined in utils.cpp is the following:
void verify_dir(const char *dir);
This function uses Linux system calls to verify whether dir is already present, and if it is not, it creates such a directory. Linux system calls are declared in the header file unistd.h. A basic familiarity with system calls is of much value in programming. [29]  [29] For a detailed, systematic, thorough, and readable account of Linux system calls, see [46].
Programs must be tested as extensively as possible. Although details of testing are normally omitted, we give a single example here, partly to illustrate how some of the modules in utils/ are used and partly to set up later discussion of recursive make.
The LU_Solve class of section 2.2↑ is tested using the function testlu(). It is defined in a source file in linking/linalg.
1void testlu(int n){
2	assrt(n > 0);
3	double *A = new double[n*n];
4	double *v = new double[n];
5	for(int i = 0; i < n; i++){
6		v[i] = rand()*1.0/RAND_MAX-0.5;
7		for(int j = 0; j < n; j++)
8			A[i+j*n] = rand()*1.0/RAND_MAX-0.5;
9	}
10	verify_dir("DBG/");
11	array_out(A, n, n, "DBG/A.dat");
12	
13	LU_Solve lu(A, n);
14	lu.factorize();
15	array_out(v, n, 1, "DBG/b.dat");
16	lu.solve(v);
17	array_out(v, n, 1, "DBG/x.dat");
18	
19	system("test_lusolve.py DBG/A.dat"
20	       " DBG/b.dat DBG/x.dat");
21	
22	delete[] v;
23	delete[] A;
24}
The first block of testlu() (lines 2 through 11) initializes the arrays A[] and v[] with a square matrix and a vector of dimension n. Notice the use of assrt() on line 2. Line 10 creates the DBG/ directory if it does not already exist. The matrix A[] is saved in DBG/A.txt using array_out() (line 11). On lines 10, 11, 15, and 17, testlu() calls functions that are defined in source file in an external module (utils/utils.cpp). This testing program is later used to illustrate how we may handle programs with sources scattered in multiple directories.
The middle block (lines 13 through 17) solves the linear system Ax = b and saves x as well as b.
Lines 19 and 20 invoke system(), which is a C library function. [30]  [30] The C library function system() must never be invoked from a privileged process: http://linux.die.net/man/3/system explains that strange values for environment variables may be used to break the system. This function forks a shell process and runs its argument as a shell command. The argument is a single string of type const char *, but it is broken across two lines. In C/C++, a string such as "one two" can be written as "one""two". Long strings can be conveniently split across lines.
The shell command
test_lusolve.py DBG/A.dat DBG/b.dat DBG/x.dat
calls a Python script that looks at the output data and verifies that Ax is indeed nearly equal to b. It prints the relative error ||b − Ax|| ⁄ ||b||.
Of course, we could have called the Python script from the command line. But then testlu() is incomplete by itself, and we have to remember to do something more to complete the test. It is usually good practice to make dependencies explicit in the source and not rely on memory. The testing program calls testlu() multiple times.
int main(){
	testlu(10);
	testlu(100);
	testlu(1000);
}
Like testlu(), this main() function is also defined in test_lusolve.cpp.

2.3.2 Targets, prerequisites, and dependency graphs

figure FIGS/chapter1/make_dependencies.png
Figure 2.3 Makefile dependency graph.
Dependencies are fundamental to make. Figure 2.3↑ shows that each object file depends on a single source and a header. Typically, the dependency is on multiple header files, unlike the simple situation shown in the figure. Each executable in turn depends on multiple object files.
The first purpose of a Makefile is to capture the dependency graph between headers, sources, object files, and executables. Each object in the dependency graph is typically a file as in figure 2.3↑. All files that have incoming edges in the dependency graph are targets. The incoming edges indicate that a target file must be built using a set of some other files. Those other files are the prerequisites. The targets may reappear as prerequisites, as is the case for all the object files in figure 2.3↑.
Typically, make assumes that a target file may not exist. The target file is considered out of date if its time stamp (accessed using stat on GNU/Linux) is older than that of any of its prerequisites. If a target either does not exist or is out of date, make takes it upon itself to create a file corresponding to the target.
The question arises of how make can create a new file corresponding to a target. To answer that question is the second purpose of a Makefile. The Makefile associates each target with a recipe, and the recipe is a shell command invoked by the make utility to build the target if the target is either absent or out of date.
The executables leibniz.exe and logseries.exe are built using the following Makefile:
1leibniz.exe: leibniz.o aitken.o
2	icc -o leibniz.exe leibniz.o aitken.o
34logseries.exe: logseries.o aitken.o
5	icc  -o logseries.exe logseries.o aitken.o -lm
67aitken.o: aitken.c aitken.h
8	icc -fPIC -c aitken.c
910leibniz.o: leibniz.c aitken.h
11	icc -c leibniz.c
1213logseries.o: logseries.c aitken.h
14	icc -c logseries.c
This listing is a fraction of the Makefile in the linking/aitken directory in the source tree. Other parts of the Makefile are used to build executables from Fortran and C++ sources.
This simple Makefile consists of two types of information. Lines 1, 4, 7, 10, and 13 are dependencies. In each dependency, the item before the colon is the target. Thus, the targets in lines 1, 4, 7, 10, and 13 are leibniz.exe, logseries.exe, aitken.o, leibniz.o, and logseries.o, respectively. The prerequisites follow the colon. In the rule on line 1, the prerequisites are leibniz.o and aitken.o.
Together the five rules specify the dependency graph shown in figure 2.3↑. For each executable, the dependency graph shows all the object files on which it depends. For each object file, the graph shows the source file and one of the header files on which the object file depends. When an executable is built, the flow of information is from source files to objects files and from object files to executables as shown in the figure.
The other type of information in the Makefile are the recipes. Lines 2, 5, 8, and 11 are recipes. Each recipe begins with a tab and corresponds to the target in the dependency above it. Beginning each recipe with a tab is a major aspect of make syntax. Thus, line 2 corresponds to the target leibniz.exe. Together the dependency and the recipe form a rule. For example, lines 1 and 2 form a rule.
The make utility makes two passes. In the first pass, it consumes all the dependency rules and builds a dependency tree such as the one shown in figure 2.3↑. In addition, targets are associated with recipes. In our example, there are five targets: leibniz.exe, logseries.exe, aitken.exe, leibniz.o, and logseries.o. These targets appear on lines 1, 4, 7, 10, and 13, respectively. The recipes that are bound to these targets in the first pass occur on the following lines.
In the second pass, the make utility brings the target list given at its invocation up to date. For each target, it first makes sure that its prerequisites are up to date before bringing the target up to date. In general, this leads to a depth first traversal of part of the dependency graph.
Once the prerequisites are up to date, the make utility checks whether the target is older. If every target and prerequisite is assumed to be a file, make finds the date of each file using a GNU/Linux command called stat (or a system call of the same name) and takes that to be the date of the target or the prerequisite. A target is out of date or older if any of its prerequisites is newer. If the target is out of date or older, the corresponding recipe is invoked to bring it up to date (make does not check the date of the target after executing the recipe). This two-pass operation is the heart of how make works.
Suppose we change to the linking/aitken directory and invoke the make utility at a shell prompt as follows:
make logseries.exe
During the first pass, the make utility notes that the target logseries.exe depends on logseries.o and aitken.o. These object files in turn depend on the corresponding sources and the header file aitken.h. If the object file logseries.o is older than either logseries.c or aitken.c, the target logseries.o is considered to be out of date during the second pass. Each file is stored with a time stamp that indicates the time at which it was last modified or changed (you can use the GNU/Linux utility stat to look at the time stamp of a file). If the target does not exist as a file, it is considered to be out of date.
If the target logseries.o is out of date, the make utility will execute the corresponding recipe during the second pass. More specifically, the command
icc  -c  logseries.c
is issued to bring the target up to date. Here the -c option to the C compiler tells it to compile only. The target aitken.o is brought up to date in a similar manner by executing the recipe corresponding to it if it is out of date with respect to either of its prerequisites.
Once the two object files logseries.o and aitken.o are ensured to be up to date, the second pass of the make utility checks whether the executable logseries.exe is out of date with respect to either of its prerequisites. The check is carried out by looking at the time stamps of the files as before. As noted already, a target such as logseries.exe is considered out of date if no file by that name exists. If the executable is out of date, the make utility issues the linking command
icc -o logseries.exe logseries.o aitken.o -lm
during the second pass. Here the -o option tells the icc linker to leave the output in logseries.exe. The -lm option at the end tells the icc linker to link the library libm.so. The shared library libm.so defines math function such as log, exp, and the trigonometric functions.
The reader may notice the -fPIC option in the recipe for aitken.o (line 8). This object file will be included in a shared library later in this section. The fPIC option is needed for that purpose.
The Makefile we have examined so far is quite simple. Yet it brings out the two passes in make’s operation, targets, prerequisites, recipes, and dependency graphs. We emphasize that nothing is done if a target is already up to date. For example, if we build leibniz.exe and then make a small change to aitken.c, the invocation make leibniz.exe will recognize that the prerequisite aitken.o is out of date and recompile aitken.c. However, leibniz.o is not out of date and the source file leibniz.c is not recompiled. The make utility uses the dependency graph to eliminate needless compilations. In a large source tree, the resulting saving can be considerable.
In our listing, there is a single rule for each target that specifies all the prerequisites for that target. The prerequisites can be given separately, but for each target, there can be only one effective recipe. For example, we can delete lines 1 and 2 of the listing and replace them by the following:
leibniz.exe: leibniz.o
leibniz.exe: aitken.o
leibniz.exe:
	icc -o leibniz.exe leibniz.o aitken.o
Here the two prerequisites are given in separate rules, and the recipe for the target leibniz.exe is given as part of a rule with no prerequisites. The three rules here can be given in any order. Because of the two-pass nature of make’s operation, the effect is the same.
We are also allowed to specify a dependency with multiple targets. The entire Makefile may be rewritten as follows:
aitken.o: aitken.c
	icc -c aitken.c
leibniz.o: leibniz.c
	icc -c leibniz.c
logseries.o: logseries.c
	icc -c logseries.c
leibniz.exe: leibniz.o
logseries.exe: logseries.o
leibniz.exe logseries.exe: aitken.o
	icc -o $@ $^ -lm
The last rule in this Makefile has two targets. The recipe for the last rule uses two automatic variables: $@, which expands to the target, and $^, which expands to the list of all prerequisites of the target. We will study automatic variables soon, but this is a little hint of what is to come. Thanks to automatic variables, we can use the same recipe for both the targets leibniz.exe and logseries.exe.
Many of the operations in building an executable are repetitive. In particular, executables are nearly always built by linking together all the object files in their prerequisite list along with libraries. The Makefile here fails to recognize that each .o object file is built from a .c source following the same pattern, which is to invoke icc with the -c compile only option. If pattern rules are defined appropriately, the entire Makefile can be reduced to two lines.
leibniz.exe: leibniz.o aitken.o
logseries.exe: logseries.o aitken.o
With suitable pattern rules, make will automatically generate the dependency of .o object files on .c prerequisites, invoke the right compilation command to update object files, and invoke the right linking command to build the executable targets. Automatic variables, make variables, and pattern rules enable us to simplify repetitive tasks as we will now learn.

2.3.3 Make variables in makevars.mk

Almost all Makefiles have make variables. We use makevars.mk at the root of the source tree to show how make variables are used (see figure 2.2↑).
The makevars.mk file will serve us throughout this book. If has three sections. The first section defines variables.
1#########
2CPP 	 := icpc
3CFLAGS   := -xHost -O3 -prec-div -no-ftz -restrict \
4 -Wshadow -MMD -MP
5FFTWINC  := $(FFTW_INC)
6MKLINC := -mkl
7#########
8MKLLIBS := -mkl=sequential
9MKLTHRD := -mkl=parallel
10FFTWLIBS  :=  $(FFTW_LINK)
In C/C++, a variable is a name for a location in memory. In make, a variable is a string.
The variable definitions from CPP to FFTWLIBS use := and not = following the variable. The use of := implies that the variables are evaluated immediately during the first pass. We do not discuss the other type of variable evaluation, which is called deferred evaluation.
All characters in a line following the # character, including that character, are ignored. Lines 1 and 7 begin with the # character and are therefore comment lines.
The variable CPP is set to icpc (line 2). It is the name of the C++ compiler used later in makevars.mk.
The CFLAGS variable (lines 3 and 4) stands for the options passed to the C++ compiler. The definition of CFLAGS is split across two lines using the continuation character \. It merits careful scrutiny. The optimization level is -xHost -O3. The -xHost flag ensures that the compiler generates instructions corresponding to the highest capability of the machine. This flag is essential for our purposes.
We do not bother with debug levels such as -g or -O0. The recommended optimization option in icpc is -fast. We do not use that option. It turns on -ipo or interprocedural optimization, which we do not want. Other dubious flags are also turned on by -fast.
By default the icpc compiler may use a less precise but faster division for IEEE double-precision numbers, according to the compiler’s manual. It is unclear whether the faster division is ever really faster or whether the flag ever really has any effect. The -prec-div flag (line 3) forces conformance to IEEE arithmetic. The compiler manual states that the flush-to-zero optimization is used for really small numbers that almost underflow. This is another “optimization” of dubious value and uncertain meaning. It is turned off using -no-ftz (line 3).
The -restrict option (line 3) enables restrict qualified pointers, a C99 feature we find to be quite valuable in the next chapter and later.
C++ member functions may accidentally redefine a class variable, leading to runtime errors. For example, state could be a data member that keeps track of the state of the class object, and a member function, which wants to set it to 1, may say int state=1 instead of state=1. The -Wshadow option (line 4) tells the compiler to issue a warning when variables defined in an outer scope are redefined in an inner scope.
The -MMD and -MP options (line 4) to the icpc compiler tell it to generate a .d file listing all dependencies of the source on header files. The way dependencies of C/C++ sources on header files is handled is discussed in section 2.3.6↓.
A few of our programs make use of the FFTW library. On line 5, the make variable FFTWINC is set to $(FFTW_INC). The understanding is that FFTW_INC is defined in the shell environment before calling make. It can be set to whatever is needed to find the FFTW header files. If the installation is along standard lines, the header file will be in a standard place such as /usr/local/include/, where the compiler always looks. So the shell variable FFTW_INC can be even blank or undefined. If the header is not in a standard place, the shell variable (assuming the bash shell) [31]  [31] We always assume the shell to be bash. must be set as in
export FFTW_INC=-I <dir-with-fftw-header>
The -I option tells the compiler to look for headers at the directory that follows the option, in addition to the standard places. The directory is typically given as a full path.
If FFTW_INC is defined as a shell variable, it may be evaluated using $(FFTW_INC) as if it were just another make variable. If it is in fact not defined in the shell, it evaluates to the empty string.
Similarly, on line 10, the make variable FFTW_LIB is set by evaluating FFTW_LINK, which is presumed to be set in the shell. The shell variable FFTW_LINK can be as simple as -lfftw3 if the installation is along standard lines, which leaves the libraries in standard places such as /usr/local/lib/ or /usr/local/lib64. If the installation is not standard, the shell variable must be set as follows:
export FFTW_LINK=  -L <dir-with-fftw-lib> -lfftw3
The -L option tells the linker to look for libraries in an additional place, and -lfftw3 tells it to look for the fftw3 library.
Handling the MKL library is easy if the Intel compilers are used. The option -mkl (line 6) to the compiler tells it to look for the MKL header files in the right places. Linking is equally easy. We may use -mkl=sequential or -mkl=parallel (lines 8 and 9).

2.3.4 Pattern rules in makevars.mk

Makefile rules are made up of dependencies and recipes. The variables defined in the first section of makevars.mk, which we just discussed, are used to construct recipes. The recipes have a formulaic character. For example, if the target is an object file to be built from a C++ source, the recipe generally invokes the C++ compiler specified by CPP using the options listed in CFLAGS. Pattern rules take advantage of the repetitive nature of recipes to simplify their specification.
$@ Target
$< The first prerequisite
$? Prerequisites newer than the target
$^ All prerequisites with duplicates eliminated
$+ All prerequisites including duplicates
Table 2.2 Automatic variables recognized by the make utility.
Automatic variables, partially listed and explained in table 2.2↑, are the basis of pattern rules. Automatic variables enable a recipe to parse and extract tokens from the dependency that precedes it (in the same rule). For example, in the rule
leibniz.exe: leibniz.o aitken.o
	icc -o $@ $^
the automatic variable $@ evaluates to the target, which is leibniz.exe for this rule, and the automatic variable $^ evaluates to leibniz.o aitken.o, which is the list of all prerequisites of the target. If a dependency leibniz.exe: xyz.o is given elsewhere in the Makefile, xyz.o will be in $^ as well. The three most important automatic variables are $@, $^, and $<.
Pattern rules form the second section of the makevars.mk file at the root of the source tree (see figure 2.2↑ for the source tree).
1.SUFFIXES:
2.SUFFIXES: .cpp .o .exe .s .d
3%.o: %.cpp
4	$(CPP)  $(CFLAGS)  -c $<
5%.s: %.cpp 
6	$(CPP) $(CFLAGS) -fno-verbose-asm  -S $< 
7%.o: %.s 
8	$(CPP) $(CFLAGS) -c $< 
9%.exe: %.o 
10	$(CPP) -o $@ $(filter %.o,$^) $(LIBS) 
Line 11 is a rule with .SUFFIXES as the target and an empty list of prerequisites. If has the effect of deleting many suffixes and pattern rules stored by GNU make by default. The rule on line 16 specifies the suffixes we want to use in pattern rules that are explicitly given later.
The rule on lines 13 and 14 is the pattern rule for generating an object file from a C++ source with the file name extension .cpp. The pattern in the dependency is as follows:
%.o: %.cpp
It matches a dependency such as
Aitken.o: Aitken.cpp Aitken.hh
The % in the target matches with Aitken. The make utility substitutes Aitken for % in %.cpp. It looks for Aitken.cpp in the prerequisite list to complete a match to the pattern rule. Because the prerequisite list has Aitken.cpp the match is complete.
Even if the dependency of Aitken.o on Aitken.cpp is not explicitly given, there is still a pattern match if the current directory contains a file named Aitken.cpp. The make utility generates the dependency automatically.
If a rule with target Aitken.o has a recipe, that recipe is used to update Aitken.o. If no recipe is explicitly specified for a target, the recipe of the pattern rule that matches the target is used for updating it. In this instance, that rule would be
	$(CPP) $(CFLAGS) -c $<
We have gone over the definition of variables such as CPP (line 2) and CFLAGS (lines 3 and 4). When a variable is evaluated, the evaluation is specified as in $(CPP). The difference in syntax between the point of definition of a variable and its point of use is one of the oddities of make.
The recipes are evaluated and applied only during the second pass; the variables are evaluated during the first pass. Therefore, some of the variables may be defined after the pattern rules. Here CPP and CFLAGS are defined before the recipe. However, a later makefile that includes makevars.mk may append additional options to CFLAGS, and those will be used by the recipe during the second pass. The -c option tells icpc to compile only (and not attempt to link against any libraries). The -c option could have been folded into CFLAGS, but it is perhaps a little clearer to make it explicit in the recipe. The recipe uses the automatic variable $< to find the name of the source. It assumes that the source for generating the object file is given as the first prerequisite.
The recipe expands to
icpc -O3 -prec-div -no-ftz -restrict -Wshadow -c $<
after substituting for CPP and CFLAGS. A makefile that includes makevars.mk may extend CFLAGS as
CFLAGS := $(CFLAGS) $(MKLINC) $(FFTWINC) -openmp
The recipe will then include options for finding MKL and FFTW headers, as well as -openmp. OpenMP is the topic of a later chapter. We can influence the compilation command by appending to CFLAGS because assignments to variables are evaluated during the first pass, whereas recipes are evaluated during the second pass.
Suppose there is no rule in the Makefile with Aitken.o as the target and we say make Aitken.o at the prompt. GNU make then notes that Aitken.o matches the target pattern %.o in a target rule. If the directory contains a file named Aitken.cpp, the pattern rule on lines 13 and 14 is considered a match. Its recipe will be used to build Aitken.o. Thus, pattern rules are capable of generating dependencies automatically.
The pattern rule for generating %.s from %.cpp (lines 15 and 16) is used in chapter 3 to look at assembly code. The -fno-verbose-asm option leads to less cluttered assembly output. The -S option to the compiler tells it to generate the assembly code instead of the object code.
Lines 19 and 20 define a pattern rule for building %.exe executables from %.o object files. The recipe for that rule uses a make construct that is new to us:
$(filter %.o,$^)
GNU make has several built-in functions, and filter is one of them. As used here, it goes through the list of all prerequisites $^ and selects only those that match the pattern %.o. Therefore, the recipe on line 20 has the effect of building the executable using all the object files, and only the object files, in the list of prerequisites. In general, the dependency list of a .exe target will contain all the object files needed to build it in addition to a few phony targets (see below), which are eliminated.
The recipe on line 20 is
	$(CPP) -o $@ $(filter %.o,$^) $(LIBS)
The list of libraries is passed to the linker using the variable LIBS. The evaluation of recipes is always deferred to the second pass, but variables such as LIBS are evaluated in the first pass, as explained already. Therefore, this variable does not need to be defined when the pattern rule is consumed during the first pass. If LIBS is defined as
LIBS := $(FFTWLIBS) $(MKLLIBS) -openmp
during the first pass, the recipe for building a %.exe target that depends on the corresponding %.o object file will automatically link the MKL, FFTW, and OpenMP libraries (in that order). The order in which libraries are linked can be significant.

2.3.5 Phony targets in makevars.mk

We will discuss recursive make, useful for building programs with object files in several subdirectories, shortly. Recursive make relies on phony targets. Our first encounter with phony targets is in a simpler context. The third and last section of makevars.mk is listed below.
1.PHONY: clean cleanxx
2clean:
3	rm *.o; rm *.exe; rm a.out;
4cleanxx:
5	rm *.o; rm *.a; rm *.so;  rm *.exe;  rm *.d
The rule with target .PHONY has clean and cleanxx as prerequisites (line 22). These are treated as phony targets. Ordinarily, make expects to find a file with the same name as a target and checks the latest modification time of that file to determine whether the target is out of date. For phony targets, make does not look for a file of the same name. Phony targets are always assumed to be out of date.
In this example, saying make clean will remove all object files and executables in the current directory (the directory from which make is invoked). Making the target cleanxx removes certain other files in addition. The .d files (see below) used to capture dependencies of a C++ source on header files get removed with cleanxx.

2.3.6 Recursive make and .d files

The make utility, as we have discussed it so far, applies to programs all of whose source files are in a single directory. That assumption fails for even moderately large programs. When the source and object files required to build a single executable reside in several subdirectories, recursive make may be used to complete the build.
Recursive make is a straightforward concept. It consists of calling make within a recipe for a phony target. The linking/linalg/ module defines interfaces to LAPACK’s LU factorization routines as discussed in a previous section. These interfaces reside in the source file lusolve.cpp. The source test_lusolve.cpp tests the interface. It uses a few utilities such as verify_dir() defined in utils/utils.cpp. Building the executable test_lusolve.exe offers a simple example of recursive make.
1include ../../makevars.mk
2CFLAGS := $(CFLAGS) $(MKLINC)
3LIBS := $(LIBS) $(MKLLIBS)
4######
5.PHONY: ../../utils objl
6../../utils:
7	make --directory=$@ objl
8######
9objl: lusolve.o
10lusolve.o: lusolve.cpp
11-include lusolve.d
12test_lusolve.o: test_lusolve.cpp
13-include test_lusolve.d
14######
15test_lusolve.exe: test_lusolve.o lusolve.o \
16		../../utils ../../utils/utils.o
This Makefile is used as the basis for three somewhat distinct discussions.
The first discussion is of recursive make. There are two targets declared as phony on line 5. The first of these, which is ../../utils, enables recursive build using the Makefile in ../../utils. The other phony target objl is the target in the present Makefile that builds all object files and libraries that may be useful externally. Indeed, we use objl generally in Makefiles as the name of the phony target that builds all object files and libraries in the same directory as the Makefile that may be useful externally.
Thus, the recipe for the phony target ../../utils (lines 6 and 7) is
	make --directory=$@ objl
The --directory option implies that a new shell process is forked with
../../utils
as the current directory and make objl is invoked within that directory (see table 2.2↑ for the meaning of $@). The Makefile in ../../utils defines objl as a phony target. Its recipe updates all the object files and libraries that may be used externally, including utils.o, which will be used in this Makefile.
The rule with the phony target objl (line 9) updates lusolve.o, which is the one object file in the linking/linalg/ folder that may be linked externally. Therefore, when other modules recursively call the Makefile in linking/linalg with objl as the target, the only object file that will be built is lusolve.o.
Recursive invocation of make is a consequence of the dependency on lines 15 and 16. The executable test_lusolve.exe is built using three object files (lines 15 and 16). Two of these object files, test_lusolve.o and lusolve.o, are built in the current directory. The object file utils.o, however, resides in ../../utils. To build that object file correctly, the phony target ../../utils is listed as a prerequisite. Updating ../../utils will lead to a recursive invocation of make, which builds utils.o in ../utils.o (lines 6 and 7). The linking recipe defined in makevars.mk filters out the phony target.
The second discussion is of .d files and the way dependencies on header files are handled.
Line 10 gives the dependency of lusolve.o on the source lusolve.cpp. It is typical for each object file to depend primarily on one source file. However, the source file typically includes several header files that in turn include other header files. The source lusolve.cpp includes the following header files:
#include <cmath>
#include <iostream>
#include <mkl.h>
#include "../utils/utils.hh"
#include "lusolve.hh"
These header files include yet others, and the command
icpc -O3 -restrict -M -mkl -c lusolve.cpp
shows all the included files, which are 123 in number. This command uses the -mkl option supported by the icpc compiler to find the MKL header files. The -M option tells the compiler to give a list of all the included header files. If any of these header files is altered, the object file lusolve.o must be rebuilt, although lusolve.cpp is unchanged.
Lines 11 and 13 are responsible for tracking the dependence of lusolve.o and test_lusolve.o, respectively, on header files. To understand how these lines work, we should go back to line 4 of makevars.mk, which included the -MMD and -MP options in CFLAGS. When icpc is invoked with -MMD -MP, it creates a .d file along with the object file. The .d file contains dependencies on the header files. For example, the generated lusolve.d contains the following lines:
lusolve.o: lusolve.cpp ../utils/utils.hh lusolve.hh \
 /opt/caen/intel-12.1/mkl/include/mkl.h \
 [... mkl header files omitted ...]     \
 /opt/apps/intel-12.1/mkl/include/mkl_vsl_types.h
​
../utils/utils.hh:
	
lusolve.hh:
	
The first rule here, with lusolve.o as target, has generated the dependencies on all the MKL header files as well as lusolve.hh and ../utils/utils.hh. The dependencies on system header files such as iostream are omitted if the flag is -MMD but included with -MD. [32]  [32] I thank Zhongming Qu for showing me how to use the -MMD and -MP options to handle dependencies on header files.
Line 11 of the Makefile reads
-include lusolve.d
The directive here is -include and not include. When the directive is include (as on line 1), GNU make signals an error if the file to be included is not found (or cannot be built using make rules). In contrast, -include moves forward silently if the file to be included is not found. If lusolve.o has been built, make will definitely find lusolve.d during its first pass because the compiler outputs the .d file along with the .o file. If the .d file is missing, the .o file must also be missing, and there is no need to track dependencies on the header files.
There can be a subtle problem with generating the .d files that list dependencies on header files during every compilation. Suppose lusolve.o is built correctly by the compiler, which also outputs a .d file as above, and then the implementation is changed. Suppose the header file ../utils/utils.hh is deleted during the new implementation. If we try to rebuild lusolve.o, GNU make looks at the old .d file and tries to resolve a dependency of lusolve.o on ../utils/utils.hh. There will be an error as the header file has since been removed. To handle this problem, the compiler generated .d files include make rules such as
../utils/utils.hh: 
If the header file is not found, it is assumed to be built correctly by doing nothing.
Although the -M and related options supported by icpc and gcc/g++ make it relatively painless to handle dependencies of C/C++ sources on header files, it must be said that the use of .d files is far from a perfect solution. The .d files replicate information that is already present in the C/C++ sources and header files. Such replication of information or logic is usually not a good idea. Every time information or logic is replicated, it creates room for inconsistencies.
The third and final discussion reiterates points that have already arisen in a concrete way.
The include directive on line 1 splices in all of makevars.mk at this point. This Makefile resides in linking/linalg, while makevars.mk is at the root of the source tree (see figure 2.2↑). Therefore, line 1 refers to the file to be included as ../../makevars.mk. Once this file is included, all the variables and pattern rules defined in it become available.
We want the compilation command to look for MKL header files at the right places. The CFLAGS variable is modified to do that on line 2. It will be evaluated during the first pass and used when the recipe for generating %.o targets that depend on %.cpp sources is invoked during the second pass. That recipe of course is in makevars.mk, which is above the definition of CFLAGS. Similarly, line 3 sets the value of LIBS so that the MKL libraries are linked when the %.exe target that depends on %.o object file is built using the recipe in makevars.mk.
The dependencies on lines 10 and 12 may be omitted. GNU make will use the pattern rule for %.o targets depending on %.cpp sources to generate those rules automatically---if the corresponding source files lusolve.cpp and test_lusolve.cpp are in the directory.
Makefiles capture the dependence of object files on source files and headers perfectly. Such a perfect capture is possible because typically an object file depends only on the source file (of the same name but with a different extension) and the header files included from inside the source file. In contrast, no attempt is made to capture the dependencies between object files. Each object file is a collection of functions. Because functions may call one another, the dependency graph between the object files is typically more complicated. Circular dependencies between object files are common. Resolving these dependencies is left to the linker.
If suitable pattern rules have been defined, the dependencies of object files on sources is deduced automatically by make. In contrast, we almost always need to indicate which object files are used to build an executable explicitly. We usually include rules such as lines 10 and 12, which indicate the dependence of object files on sources, to make the structure of the program more explicit within the Makefiles.
Makefiles bring order into the translation of sources to object files and the building of executables from the object files. They reflect the hierarchy of the source tree. Makefiles can get quite complicated when executables must be built on multiple platforms. If the source tree of figure 2.2↑ must be built for Linux machines using GNU, Intel, or PGI compilers, or if they must be built for both Linux and Windows, the Makefiles get much more complicated. In such situations, it is common to write configure scripts that generate the Makefiles. Configure scripts are typically written as shell scripts, although Python is equally effective and far more pleasant to use.

2.3.7 Beyond recursive make

Modularity and simplicity are two virtues of recursive make. However, there are several problems with it. [33]  [33] See Peter Miller (Recursive Make considered harmful, 1997, http://aegis.sourceforge.net/auug97.pdf). It can be slow for large projects because a new shell process is created every time make is called recursively. It does not gel too well with parallel make using the -j option. It leads to needless compilation because all object files and libraries that may be externally needed are built during recursive make and not just those that are actually needed. It must be said that these deficiencies are not fatal. Recursive make is still used.
There appears to be a fundamental tension between the two-pass structure of the make utility and recursive invocation of make. The first pass is supposed to build a dependency graph, for example, of object files on sources and executables on object files. The second pass is supposed to invoke recipes to update targets that are out of date with respect to their prerequisites. The recursive invocation of make happens during the second pass. The dependency graph of object files and sources in the external module is built only when its Makefile is invoked recursively. The result is to splinter the dependency graph, leading to multiple first and second passes.
It is possible to avoid recursive make entirely by building a single dependency graph for all the sources, object files, and executables in the project. [34]  [34] One way to avoid recursive make was suggested by Peter Miller (Recursive Make considered harmful, 1997, http://aegis.sourceforge.net/auug97.pdf) and developed further by Emile van Bergen (http://evbergen.home.xs4all.nl/nonrecursive-make.html). This method relies on stack manipulation using make variables. Another method developed by Zhongming Qu is briefly described in the text. The heart of Qu’s idea is to use pattern rules specific to each subdirectory or submodule. One way to do this is to include a rules.mk file in each subdirectory or submodule. The rules.mk file contains pattern rules specific to the subdirectory as well as dependencies of object files on sources present in the subdirectory.
In each rules.mk, it is assumed that a variable R, which expands to the full absolute path of the root of the project, is defined. A rules.mk file in linking/lingalg may look as follows:
1saved := $(D)
2D := $(R)/linking/linalg #R defined externally
3$(D)CFLAGS := $(CFLAGS) -mkl
45$(D)/%.o: $(D)/%.cpp
6	$(CPP) $($(@D)CFLAGS) -o $@ -c $<
78$(D)/lusolve.o: $(D)/lusolve.cpp
9$(D)/test_lusolve.o: $(D)/test_lusolve.cpp
10D := $(saved)
This rules.mk file defines the variable D (line 2) to be the absolute path to the directory that contains itself as well as the source files that it manages. The variable R is assumed to point to the root of the source tree.
On line 3, we define the variable $(D)CFLAGS. Here $(D) evaluates to the absolute path of the directory, and the variable name is in fact that absolute path name with CFLAGS appended to it. This variable holds compilation flags specific to the directory. In this example, the only change is to append the -mkl flag.
On lines 5 and 6, the pattern rule for generating a .o object file from a .cpp source is made specific to the directory. In GNU make, @D is a variable that evaluates to the current directory. Therefore, the syntax
$($(@D)CFLAGS)
which occurs on line 6, evaluates the variable defined on line 3. GNU make always chooses the most specific pattern rule. On lines 5 and 6, the pattern rule is made specific to the object files to be generated in the current module.
The use of $(D) to evaluate the make variable D on lines 8 and 9 implies that the object and source files are given in full as absolute paths. Giving object and source file names in full as absolute paths is essential to the technique being described.
A rules.mk file for the module utils/ written along the same lines looks as follows:
saved := $(D)
D := $(R)/utils
$(D)CFLAGS := $(CFLAGS) -fPIC
​
$(D)/%.o: $(D)/%.cpp
        $(CPP) $($(@D)CFLAGS)  -o $@ -c $<
​
$(D)/utils.o: $(D)/utils.cpp
$(D)/Table.o: $(D)/Table.cpp
​
D := $(saved)
A Makefile in linking/linalg that builds the executable to test the LU solver looks as follows.
1include ../../root.mk #define R, CPP, CFLAGS
2D := $(R)/linking/linalg
3LIBS := -mkl=sequential
45%.exe:
6	$(CPP) -o $@ $(filter %.o,$^) $(LIBS)
78include rules.mk
9include $(R)/utils/rules.mk
1011test_lusolve.exe: $(D)/test_lusolve.o $(D)/lusolve.o
12	$(R)/utils/utils.o $(R)/utils/Table.o
The root.mk file included on line 1 defines the variables R, CPP, and CFLAGS. It is similar to the makevars.mk file described earlier.
The rules.mk file included on line 8 brings in the pattern rules and dependencies that govern object files and sources in the present linking/linalg module. The pattern rules for the utils/ module are brought in using an include directive on line 9. The latter include directive uses an absolute path name.
In the dependency rule with target test_lusolve.exe on lines 11 and 12, all object file names are given in full as absolute path names. The object files in the local module are prefixed with $(D)/. In contrast, the object files that are external to the module are given relative to the root with prefix $(R)/. By using absolute path names and pattern rules specific to each module, we avoid recursive make entirely and instead build a single dependency graph.
One disadvantage of the technique just described is that when make runs and echoes the commands that it passes to the shell, the messages that appear have absolute path names that are almost unreadable. This deficiency can be remedied by resorting to complicated make syntax.

2.3.8 Building your own library

Our discussion of GNU make is uncommonly detailed and for a reason. Much of the time spent on C/C++ syntax is wasted without a fairly good knowledge of make. There is no modular programming in C/C++ without the make utility or an equivalent build system. The programmer is limited to single source files or awkward collections of source files in a single directory.
We end our discussion of make by showing how to build and link static and shared libraries. Libraries provide a level of modularity beyond what is possible within a source tree. Any program that is linked against a library in effect treats the external library as a module.
The utils/ subdirectory in the source tree (see figure 2.2↑) has utils.cpp, which provides basic facilities such as verify_dir(). The linking/aitken/ subdirectory implements the Aitken iteration in the C source aitken.c. The source fft_mkl.cpp in linking/fft/ provides an interface to part of MKL’s Fast Fourier Transform (FFT) facilities. The FFT is the topic of the next section. The Makefile below is in linking/lib/. It shows how to combine utils.o, aitken.o, and fft_mkl.o and build a shared or static library.
1include ../../makevars.mk
2CFLAGS := $(CFLAGS) $(MKLINC)
3######
4MODS := ../../utils ../aitken ../fft
5.PHONY: $(MODS)
6$(MODS):
7	@echo
8	make --directory=$@ objl
9	@echo
10######
11test_lib.o: test_lib.cpp
12-include test_lib.d
13######
14libxmath.so: $(MODS)
15	icpc -shared -o $@ ../../utils/utils.o	\
16			   ../aitken/aitken.o 	\
17			   ../fft/fft_mkl.o	
18libxmath.a: $(MODS)
19	ar rcs $@ ../../utils/utils.o		\
20	          ../aitken/aitken.o 		\
21		  ../fft/fft_mkl.o
22######
23#link against shared lib
24#to link against static, rm .so file
25test_lib.exe: test_lib.o
26	icpc -o $@ $^ $(MKLLIBS) -L$(PWD) -lxmath 
Line 1 includes makevars.mk, so that the pattern rules we have discussed become effective in this Makefile. The three modules being combined are listed on line 4. Each module is a phony target (line 5). The phony target triggers recursive make in the appropriate subdirectory with the target objl (line 8) (lines 6 and 7 print empty lines to make the recursive invocation of make more visible as make runs). It is assumed that the Makefile in each directory will build the requisite object files when invoked on the phony target objl.
The rule for building the shared library is on lines 14 through 17. The target libxmath.so is the name of the shared library. Its prerequisites are the phony names for the three modules. The recipe here is almost the same as the recipe for linking. The only difference is the -shared option (line 15). The three object files referenced in the recipe are built using recursive make. The object files must be compiled with the -fPIC option. We explicitly showed the -fPIC option for aitken.o earlier in this section. The -fPIC options for the other two compilations are given through the Makefiles in the respective directories.
The rule for building the static library libxmath.a is on lines 18 through 21. The recipe uses the archive command ar with the options rcs.
The source test_lib.cpp has a simple program to test the functions in the xmath library. The target for building the corresponding executable is on line 25. This recipe will override the recipe in the pattern rule for %.exe targets with %.o prerequisites. The recipe (line 26) links the MKL libraries as it should. The icpc linker is told to look for libraries in the current directory (in addition to standard places) using the -L$(PWD) option (line 26). This is needed because both the static and share versions of xmath are built in the current directory. The -lxmath option to the icpc linker (line 17) makes the linker look for libxmath.so or libxmath.a. Shared libraries are linked preferentially.
Static libraries are conceptually simpler than shared libraries. A static library is simply an archive of object files. However, linking a static library is not the same as listing all the object files that are archived in it. If we list the object files explicitly, the executable will certainly include all the object files. In contrast, when a static library is linked, only those object files archived in the library that resolve undefined names in object files and libraries listed before it, as well as those undefined names that arise when object files archived in the same library are linked, are included. Thus, the order in which libraries are linked can be quite important. [35]  [35] A bug related to the order in which libraries are linked persisted in the source code for this book for nearly 5 years. A program to compare the speed of FFTW and MKL was linked against both libraries, with MKL first and FFTW later. MKL implements many of the FFTW functions, and in fact, what we thought was FFTW was really again MKL. So the program was comparing MKL against MKL’s implementation of the FFTW interface.
When we invoke
test_lib.exe
at the command line, the executable initially does not have the definitions of the functions defined in xmath---if the linking is against the shared version of the library. It looks for the shared library at runtime. The search for shared libraries at runtime goes through a number of directories, but the directory containing libxmath.so is not one of them. We must add that directory to the shell variable LD_LIBRARY_PATH explicitly.
One advantage of shared libraries is that the system needs to load only one copy of the library if many processes are linked against the same shared library. Another advantage is that programs benefit automatically from updates and bug-fixes to shared libraries (at least in theory). The memory map and page tables of a process change when a shared library is loaded. Shared libraries must be supported by the operating system kernel.
The list of shared libraries available on a system may be obtained using the ldconfig -v command. The list can be long, and the only purpose it serves may be to overwhelm.
The GNU/Linux ldd command can be used to find which shared libraries are being linked with the executable. There can be several versions of MKL or FFTW on a system, for example, making it uncertain which version of a library has been linked. If we say ldd test_lib.exe at the command prompt in GNU/Linux, we get the following sort of information:
libmkl_sequential.so => /opt/intel/.../mkl/lib/...
libxmath.so => not found
Of the list of 10 libraries, we have shown only two. For the MKL library, the command outputs the full path to the shared library that will be linked dynamically at run time. For our libxmath.so, it says “not found” because the shell variable LD_LIBRARY_PATH has not been set to include the current directory. If this program is run, there will be an error at runtime because the xmath library cannot be found.
Another useful command is ld with -verbose option. With that option, we may find out how the system looks for libraries . For example, if we say ld -verbose -lfftw3, we get a sense of what happens if the -lfftw3 option is used to link fftw3.
...
attempt to open //usr/local/lib64/libfftw3.so failed
attempt to open //usr/local/lib64/libfftw3.a failed
...
attempt to open //lib64/libfftw3.so failed
attempt to open //lib64/libfftw3.a failed
attempt to open //usr/lib/x86_64-linux-gnu/libfftw3.so 
succeeded
-lfftw3 (//usr/lib/x86_64-linux-gnu/libfftw3.so)
libm.so.6 needed by ...
//usr/lib/x86_64-linux-gnu/libfftw3.so
found libm.so.6 at //lib/x86_64-linux-gnu/libm.so.6
...
We see the order in which the loader goes through a number of directories looking for FFTW3. In every directory, it first looks for the shared and then the static version of FFTW3. When FFTW3 is found, it begins to look for shared libraries needed by FFTW3, and so on.
The loader looks for shared libraries in the following order. [36]  [36] For a far more detailed discussion, see [19] and [46]. First, it looks at directories that may be explicitly embedded in the executable file using options such as -rpath. Second, it looks at directories in LD_LIBRARY_PATH. Third, it looks in the cache file /etc/ld.so.conf. The entries of this library cache file may be manipulated using the ldconfig command. Fourth, it looks at /usr/lib.
Understanding the manner in which programs using shared libraries are loaded and set up requires knowledge of the paging system, reviewed later in section 4.4↓, and the concept of system calls, which are functions defined by the operating system kernel and which may be invoked by user programs (see section 5.4.1↓). To complete our discussion of shared libraries, we anticipate later discussion of those topics and explain how shared libraries are set up.
There are three parts to understanding how shared libraries work at the level of machine instructions. The first of these is the manner in which shared libraries are loaded into memory. Suppose a program abc.exe, which calls functions in shared libraries that in turn may call functions in other shared libraries, is invoked from the command line or in some other way. The Linux system call execve() is invoked with the file abc.exe as one of its arguments. Linux creates a process descriptor that holds administrative information about the program or process for its own use. It maps the contents of abc.exe to the virtual address space of the process with the intention of passing control to the process. However, before passing control it notices that the executable relies on shared libraries. Consequently, it loads the dynamic linker ld-linux.so or ld.so into the same virtual address space and passes control to the dynamic linker. The dynamic linker looks for the library in the file system in the order given above. Once the library is found, it issues the mmap() system call to load the library into the virtual address space of the process. If the library is already in physical memory, perhaps because it was loaded by some other process, mmap() will only update the page tables and not load the library again into physical memory.
The handling of global variables defined by shared libraries is the second part to be understood. References to global variables involve additional levels of indirection and go through the Global Offset Table (GOT). The GOT is set up by the dynamic linker.
The final part to be understood is the handling of calls of functions defined in shared libraries. Here, too, additional levels of indirection are employed, and all function calls are routed through the Procedure Linkage Table (PLT). Unlike the GOT, PLT entries are not fully resolved by the dynamic linker to begin with. When a shared library function is called by the program, the program calls a PLT function. The first time a PLT function is called, the function sends the call to ld.so, which is still sitting in virtual memory, and it is only then that the reference to that shared library function is fully resolved. All later calls of the shared library function jump to the PLT entry and then directly to the library function.
figure FIGS/chapter1/make_exercise.png
Figure 2.4 Dependencies between make targets.
Exercise:  The dependencies between targets a through g are shown in figure 2.4↑. Assume that the recipe to update the targets is as given below:
a b c d e f g:
	echo $@
Exercise:  Suppose you have the following Makefile:
X := hello
Y := world
mesg:
	echo $(X) $(Y)
Y := universe
If you say make mesg, will it print "hello world” or “hello universe”?
Exercise:  Add a pattern rule to makevars.mk to generate .o files from .c files. Your pattern rule should be such that the entire Makefile to build leibniz.exe and logseries.exe in the directory linking-aitken can be reduced to the following three lines:
include ../makevars.mk
leibniz.exe: leibniz.o aitken.o
logseries.exe: logseries.o aitken.o
Exercise:  The icpc compiler/linker provides the options -mkl=sequential and -mkl=threaded to fetch header files and link against the MKL library. Modify makevars.mk to use these options.
Exercise:  Consider the Makefile to build test_lusolve.exe discussed in section 2.3.6↑. Suppose the phony target ../utils is removed from the list of prerequisites of the target test_lusolve.exe. The object file utils.o will still be built correctly, but it will end up in the directory linking-linalg/ instead of utils/, and the linking will fail. Explain why.
Exercise:  Record the size of the executable test_lib.exe when both libxmath and MKL are linked dynamically, when libxmath is linked statically, and when the -static option is used to link both libxmath and MKL statically. Explain why the size of the executable increases only slightly from the first linking to the second and by a lot more from the second linking to the third.
Exercise:  Use the ldd command to find all the shared libraries that
test_lusolve.exe
is linked against. Use the nm command (also part of GNU’s binutils) to find the shared library in which dgetrf() is defined. Verify that dgetrf, dgetrf_, DGETRF, and DGETRF_ are all bound to the same address in the text area, so that they are really different names for the same function.

2.4 The Fast Fourier Transform

So far in this book, modular programming in C/C++ has been the focus. Organization of sources into a source tree, Makefiles, and libraries is the basis of modular programming. The make utility, or an equivalent build system, is indispensable to modular programming in C/C++.
Among programming languages, the C/C++ framework is the best---and often by far---for writing fast programs. C/C++ programs can be several hundred or even several thousand times faster than programs written in interpreted languages. In this last section, we look at the speed of a few implementations of the Fast Fourier Transform (FFT). Program speed is a major theme of the rest of this book.
Program speed is influenced by many factors, including programming skill, compilers, the processor hardware, and the memory system. Each layer of software and hardware is heavily designed. Program speed is a discontinuous function of the design parameters. A small change in a program, or in the environment in which it runs, can result in unpredictable changes in program speed. In this section, we look at a few FFT implementations to gain an understanding of some of the factors that influence program speed. What does it mean to say that an implementation is optimal? To what extent does programming skill affect program speed? These are some of the questions we ask. We find already that programming skill has a great influence on program speed. Later chapters set forth many of the concepts that must be understood to produce efficient implementations of scientific programs.
A program’s speed depends on the hardware configuration. In this chapter, we stick to a single processor core. In later chapters, we will see that even if a program is multithreaded or networked, understanding what happens on a single core is a big part of the game.
About 100% of the computers in use for scientific programming use the x86 architecture. Thus, this book too adopts the x86 architecture. The x86 architecture evolves constantly. There are only a few particulars of the x86 architecture that will be of concern to us. These are discussed in the next chapter.
For the most part, all that concerns us is the level of the instruction set, in particular, whether the instruction set is SSE2, AVX, AVX2, or AVX-512 (see table 3.1↓ of the next chapter). Thus, the machines we use will be designated as 2.6 GHz SSE2 or 2.2 GHz AVX. The full names of the machines may be looked up from table 9.1↓ of the appendix. The SSE2 machines support 128-bit XMM registers, and the AVX/AVX2 machines support 256-bit YMM registers.
The clock signal that is fed into each processor core is the heart beat of the computer. The activities of the processor are synchronized with the clock signal. The memory system and other parts of the computer must accommodate themselves to the processor. If we measure program speed in cycles, we get a better sense of how well the program is exploiting the hardware.
In this book, program speeds are reported using cycles. We use measures such as flops (floating point operations) per cycle for program speed and bytes per cycle for memory bandwidth. For some programs, we report the number of cycles consumed directly. Measuring program speed in terms of cycles is somewhat unconventional. It is more typical to see GFlops (Gigaflops per second) for arithmetic speed and GB/s (Gigabytes per second) for memory bandwidth. The second is a standard unit of time and its use is most appropriate when different hardware configurations are being compared. Our concern, which is to write efficient programs on a given hardware configuration, is quite different. Although we report timing measurements in cycles, they can be easily converted to seconds using the frequency in GHz of the processor clock.
The FFT is one of the most widely used algorithms in scientific computing and is fundamental to many areas of science. It is an appropriate starting point for the discussion of the speed of scientific programs. In section 2.4.1↓, we introduce the FFT algorithm in outline. The purpose of this outline is to help understand the speed of FFT implementations. The two major FFT implementations in the MKL and FFTW libraries are introduced in sections 2.4.2↓ and 2.4.3↓, respectively.
Programs run in a complex environment and the complexity of the environment influences program speed in ways we cannot fully grasp. The manner in which a program is timed can make a big difference. In addition, a program may behave quite differently from when it is used as part of a larger program to when it is timed by itself. The purpose of section 2.4.4↓ is to give a sense of how the complexity of the environment influences program speed.
Section 2.4.5↓ compares MKL, FFTW, and another expertly coded FFT. The last of these is the type of program a good C/C++ programmer without knowledge of computer architecture may write. The optimized MKL and FFTW libraries can be even 10 times faster, illustrating both the value of optimized libraries and programming with a knowledge of computer architecture. [37]  [37] Our discussion assumes that the FFTW library has been built correctly using --enable-avx (for machines with YMM registers) or --enable-sse (for machines with XMM registers). Without a proper build, FFTW can be too slow by a factor of 2 or even 4.

2.4.1 The FFT algorithm in outline

The discrete Fourier transform of f0, …, fN − 1 is defined as
(2.1) k = (1)/(N)N − 1j = 0ω − jkfj,  for k = 0, …, N − 1.
Here ω = exp(2π( − 1) ⁄ N) is a primitive Nth root of unity. The k are linear functions of fj. The discrete Fourier transform from f = (f0, …, fN − 1) to  = (0, …, N − 1) can be written as  = Mf, where M is the N × N matrix whose (j, k)th entry is ω − jk ⁄ N.
The inverse discrete Fourier transform is given by
(2.2) fj = N − 1k = 0ωjkk.
It too can be thought of as a matrix-vector product.
The discrete Fourier transform and its inverse have an intimate connection to Fourier series. Suppose f(x) is a function with period 2π. Its Fourier coefficients are defined by
cn = (1)/(2π)2π0f(x)exp( − ( − 1)nx) dx for n = 0, ±1, ±2, …
If fj = f(2πj ⁄ N), then k ≈ ck for 0 ≤ k < N ⁄ 2 and k ≈ ck − N for N ⁄ 2 < k ≤ N − 1. If N is even, N ⁄ 2 ≈ (cN ⁄ 2 + c − N ⁄ 2) ⁄ 2. Here f(x) is assumed to be a function that is integrable and sufficiently smooth.
The FFT is a method to effect the discrete transform (2.1)↑ or its inverse (2.2)↑ using O(Nlog2N) arithmetic operations, which is a considerable improvement over the O(N2) arithmetic operations required by direct matrix-vector multiplication. [38]  [38] The modern discovery of the FFT is due to [20]. The improvement in operation count is vital to making the FFT fast but is not the full story. The implementation can make a difference of more than a factor of 10 to the program speed. We present the structure of the power of 2 FFT but omit mathematical details.
Suppose N = 2n. We assume the data to be N complex numbers. The first step in the power of 2 FFT is to separate the data into even and odd parts as follows:
f0, f2, …, fN − 2 and f1, f3, …, fN − 1.
An N ⁄ 2 FFT is applied separately to the even and odd parts. The odd part is multiplied by the twiddle factors 1, ω, …, ωN ⁄ 2 − 1. The FFT of size N is generated by adding and subtracting corresponding points in the even and odd parts.
Because N ⁄ 2 is also a power of 2, the FFTs of size N ⁄ 2 are effected using the same strategy. Thus, the even part and odd part are once again separated into even and odd parts to obtain four lists of numbers. Repeated separation into even and odd parts leads to the bit-reversed permutation, which is perhaps the most important element in an efficient implementation of the FFT.
The N data items are indexed using binary numbers with n bits. The bit-reversed permutation is obtained by reversing indices in binary and reordering the data using the reversed binary numbers. The last bit moves to the first position, and it is immediately evident that the bit reversal separates the data into even and odd parts. If N = 8, the bit-reversed permutation is given by
f0, f4, f2, f6, f1, f5, f3, f7.
Once the data is bit-reversed, the FFT begins by operating on successive pairs of numbers.
Figure 2.5↓ illustrates the power of 2 FFT assuming the data to be in bit-reversed order. The innermost (lowermost in the figure) step operates on consecutive pairs. The next levels operate on quartets, octets, and so on. At each level, the data items in solid squares are first multiplied by twiddle factors. At each level, the data item in a solid square is paired with another data item in an empty square (see figure). The iteration at a level is complete, when each pair of data items is replaced by their sum and difference. Bit reversal improves the locality of memory references.
Figure 2.5↓ assumes that each level is complete before we move to the next level. In fact, there is a lot more freedom in the way the operations are ordered. For example, we may complete operations on the first half of the bit reversed data before operating on any of the pairs, quartets, and so on in the second half. Judicious orderings of the operations result in better usage of the memory system and better performance.
figure FIGS/chapter1/fft.png
Figure 2.5 Structure of the power of 2 FFT assuming bit reversal. At each level, the solid positions are multiplied by twiddle factors and either added to or subtracted from the corresponding unfilled positions.
The power of 2 FFT moves through log2N levels. The number of arithmetic operations in each level consists of N ⁄ 2 twiddle factor multiplications, N ⁄ 2 additions, and N ⁄ 2 subtractions. Additions and subtractions have the same cost and may both be counted as additions. A complex addition is equal to two floating point double precision additions and a complex multiplication is equal to 4 floating point multiplications and two additions. Therefore, a power of 2 FFT of size N costs 3Nlog2N floating point additions and 2Nlog2N floating point multiplications. The total cost is 5Nlog2N. There are other variants, such as the power of 4 FFT, which have slightly lower operation counts.

2.4.2 FFT using MKL

The header file mkl_dfti.h declares the FFT functions implemented in the MKL library. The following class simplifies application of the FFT and its inverse to complex data:
class fft_mkl{
private:
	int n;
	DFTI_DESCRIPTOR_HANDLE handle;
public:
	fft_mkl(int nin); 
	~fft_mkl();
	void fwd(double *f){
		DftiComputeForward(handle, f);
	}
	void bwd(double *f){
		DftiComputeBackward(handle, f);
	}
};
Unlike the Vector class from before, fft_mkl is a narrowly defined class. It does just one thing, which is to provide an interface to MKL’s FFT for complex one-dimensional data. It is typical of the way we use C++.
The class may be used as shown below to apply the FFT to a complex data of size n. It is assumed that v is of type double * pointing to 2n or more double locations.
for(int i=0; i < n; i++){
	v[2*i] = rand()*1.0/RAND_MAX - 0.5;
	v[2*i+1] = rand()*1.0/RAND_MAX - 0.5;
}
fft_mkl fft(n);
fft.fwd(v);
Here the array v[] is initialized with uniformly distributed random numbers and its FFT is taken. All the details of the MKL library are cleanly hidden away.
The member functions fft_mkl::fwd() and fft_mkl::bwd() are defined completely within the class definition. These member functions correspond to (2.1↑) and (2.2↑), respectively.
The two private data members of the class fft_mkl are n, for saving the dimension of the transform, and handle. The handle is initialized by the constructor and used for effecting the transform. The constructor is defined as follows:
1fft_mkl::fft_mkl(int nin)
2  :n(nin)
3{
4	DftiCreateDescriptor(&handle,
5				DFTI_DOUBLE,DFTI_COMPLEX, 1,n);
6	DftiSetValue(handle, DFTI_FORWARD_SCALE, 1.0/n);
7	DftiSetValue(handle, DFTI_PLACEMENT, DFTI_INPLACE);
8	DftiCommitDescriptor(handle);
9}
The colon initialization on line 2 shows C++ syntax we have not encountered so far. Its effect is to call the constructor for the class object n with nin as the argument. It is essential for initializing class members, which are themselves class objects. But n is an int, which is a basic type. Here the effect is the same as saying n=nin just after line 3.
The MKL call on lines 4 and 5 sets up handle. The function call tells MKL that the data is complex double precision, one-dimensional, and of size n. The handle is a pointer to a data structure called a descriptor. Within that data structure, MKL can store a variety of information. For example, it can precompute and store twiddle factors.
The forward transform computed by default is unnormalized. The division by N in (2.1↑) is omitted. Line 6 tells MKL to normalize the forward transform. Line 7 tells MKL to compute in-place transforms. In-place transforms modify the array in place. The forward transform is effected by the MKL call
DftiComputeForward(handle, f);
in the class fft_mkl because the transform is in place. If the transform were out-of-place, we would need a call such as
DftiComputeForward(handle, f, fout);
In this call, the data in f will be unchanged, and the result of the transform will be left in fout. The FFT is naturally implemented in place. The out-of-place FFT is slower because it touches more data.
The descriptor that handle points to becomes usable only after the commit on line 8. It is here that MKL actually computes and saves the twiddle factors, and so on. The order in which the FFT operations are applied may also be determined and saved here.
The class destructor is defined as follows:
fft_mkl::~fft_mkl(){
	DftiFreeDescriptor(&handle);
}
The definition of the fft_mkl class is now complete.
The C language has a facility for defining functions with a variable number of arguments. An example of such a function is printf(). The first argument to printf() is a format string, and the types and number of the subsequent arguments depend on the format string. Like printf, the MKL function DftiComputeForward() uses the varargs facility to handle both in-place and out-of-place transforms.

2.4.3 FFT using FFTW

The FFTW library [39]  [39] See www.fftw.org as well as [22] and [26]. has a much cleaner interface than MKL, although that may not be clear from the one-dimensional complex-to-complex case we deal with. The header file is fftw3.h. The following tightly defined class offers a means to use FFTW functions for the FFT:
class fft_fftw{
private:
	int n;
	fftw_plan pf;
	fftw_plan pb;
public:
	fft_fftw(int nin); 
	~fft_fftw();
	void fwd(double *f){
		assrt((long)f%16 == 0);
		fftw_execute_dft(pf, (fftw_complex *)f, 
		                     (fftw_complex *)f);
		for(int i=0; i < 2*n; i++)
			f[i] /= n;
	}
	void bwd(double *f){
		assrt((long)f%16 == 0);
		fftw_execute_dft(pb, (fftw_complex *)f, 
		                     (fftw_complex *)f);
	}
};
The sole task of the fft_fftw class is to offer an easy interface to FFTW transforms for one-dimensional complex data. FFTW stores fftw_plans instead of a handle as with MKL. There are different plans for forward and backward transforms. The fwd() and bwd() member functions are implemented within the class definition.
Both the member functions use assrt() (defined in utils/utils.hh) to verify that the pointer f is 16-byte aligned. A pointer is 16-byte aligned if it is divisible by 16 or, equivalently, if the last 4 bits are zero. Because of the way the FFTW plans are set up, the pointers must be 16-byte aligned for correctness. FFTW recognizes that the transforms are in place because the same pointer f is used as the input and output argument to fftw_execute_dft().
FFTW does not offer a facility for normalizing the forward transform. Therefore, the member function fwd() normalizes explicitly using a for-loop. As we will see, this seemingly innocuous bit of code nearly halves the program speed.
The constructor and destructor for fft_fftw are defined below.
fft_fftw::fft_fftw(int nin)
	:n(nin)
{
	double *f = (double *)
			_mm_malloc(2*n*sizeof(double), 16);
	fftw_complex *ff  = (fftw_complex *)f;
	pf = fftw_plan_dft_1d(n, ff, ff, -1, 
							FFTW_MEASURE);
	pb = fftw_plan_dft_1d(n, ff, ff,  1, 
							FFTW_MEASURE);
	_mm_free(f);
}
​
fft_fftw::~fft_fftw(){
	fftw_destroy_plan(pf);
	fftw_destroy_plan(pb);
}
The constructor aligns the pointer f with 16-byte boundaries by using _mm_malloc() instead of malloc(). The forward plan pf uses the flag  − 1, while the backward plan uses the flag 1. These refer to the sign of the exponents in (2.1↑) and (2.2↑), respectively. Both the forward and backward plans use the FFTW_MEASURE flag. The FFTW library uses the planning stage to measure different implementations of the FFT in an attempt to pick a good one.

2.4.4 Cycles and histograms

How many cycles does a one-dimensional complex FFT of dimension 210 = 1024 take? Program performance is influenced by so many factors that the question is too simple to be answered. First, we have to say how the measurement is taken and which implementation of the FFT is used. Here we assume the implementation to be from the MKL library.
The issue of measurement is more complicated. Suppose a single measurement is made. The cycle count is likely to be atypically large. Suppose a great number of measurements are made but while applying the FFT to the same data. This time the average or median cycle count is likely to be an underestimate. A great part of the expense of the FFT is in reading data from memory. If the same data is repeatedly transformed, the data locations will be cached near the processor core in cache memory. Caching reduces the expense of reading data.
We measure FFTs of dimension N = 210 in a way that mimics what we consider to be a realistic scenario. We line up 106 problem instances in one long array of 2 × 210 × 106 double-precision numbers (the factor 2 at the front accounts for complex data). This array is 16 GB. We successively apply the inverse FFT to each problem instance and record 106 cycle counts. The median (or average) cycle count obtained is likely to be a fair estimate of the cost of an FFT of dimension 1024 in a large computation.
figure FIGS/chapter1/fft_trial2_view1-gray.png
(a) 
figure FIGS/chapter1/fft_trial1_view1-gray.png
(b) 
figure FIGS/chapter1/fft_trial1_view2-gray.png
(c) 
figure FIGS/chapter1/fft_trial1_view3-gray.png
(d) 
Figure 2.6 Histograms of cycles of 106 trials of 1D FFT of size 1024 on 2.66 GHz SSE2 machines.
The statistics of the 106 measurements does not follow the normal law or any such well-known probability distribution. The histograms in figure 2.6↑ show the peculiar nature of cycle statistics. Parts (a) and (b) of the figure show histograms of measurements taken on two distinct computers (2.66 GHz SSE2 machines) [40]  [40] For more information about the machine, see table 9.1↓. of identical specifications. Both measurements have medians of around 20, 000 cycles. However, the histograms are noticeably different. On both computers, the histograms have a big bump near the median and a much tinier bump to the right of the median. The tinier bump is beyond 25, 000 cycles and barely visible in figure 2.6↑(b). The tinier bump is much closer to the median in (a).
Part (c) of figure 2.6↑ enlarges the histogram of part (b) near its median. This histogram looks more like a smooth distribution, but it is quite unlike the normal law. There is a marked protrusion to the right of the median. Part (d) enlarges the histogram of part (b) near the tiny bump.
Why is the statistical distribution of the cycle counts so peculiar? The multi-bump nature of the histograms is a consequence of the multiple factors at play. The FFT fetches data from memory and subjects that data to a number of arithmetic operations. Fetching data from memory on to the processor is the job of memory controllers. Because the same data item is accessed multiple times during a single transform (see figure 2.5↑), the data locations are often sourced from caches. Moving data between the caches and the processor is handled by cache controllers. The execution units inside the processor apply arithmetic operations to data in the registers. The cycle count is influenced by the design of the memory controllers, cache controllers, and the execution units within the processor. It is reasonable to conjecture that the protrusion to the right of the median in figure 2.6↑(c) is due to some feature of the memory system. There are a large number of discontinuities (similar to if-statements in C/C++ programs) in hardware design, implying that some features are only occasionally excited.
The tiny bump to the right of the median shown in figure 2.6↑(d) appears to be due to the Linux kernel. Even on a computer system where 11 out of 12 processor cores are idling, the program receives timer interrupts once in several milliseconds. The kernel uses timer interrupts to ensure fairness in scheduling, gather statistics for its own use, and other purposes. It is conjectured that the tiny bump is due to timer interrupts or some other activity of the Linux kernel. If the program runs for a long time, the kernel will change the way it issues timer interrupts. To even speak of the distribution function of the number of cycles consumed by an FFT of size n = 1024 may not be correct.

2.4.5 Optimality of FFT implementations

What is the fastest possible speed of an FFT implementation? The many system features that intrude into the histograms hint that this question may not have a straightforward answer. However, a discussion is worthwhile. It helps us understand what makes an FFT implementation efficient. The discussion is based on a 2.2 GHz AVX machine. [41]  [41] For more information about the machine, see table 9.1↓.
The power of 2 FFT performs 5Nlog2N double-precision floating point operations. Additions and multiplications are in the ratio 3:2. Thanks to instruction-level parallelism and 256-bit YMM registers, a single AVX processor core can complete four additions and four multiplications every cycle. If we consider arithmetic operations alone, a theoretical lower bound is .75Nlog2N cycles.
We would be justified in ignoring memory accesses if the number of arithmetic operation for each memory access were large. The total number of bytes that must be accessed is 16N (16 bytes for each complex number). For large N, we indeed have 0.75log2N > 16. However, the structure of the FFT does not allow the outer iterations (upper levels in figure 2.5↑) to be cached if N is large. The inner iterations (lower in the figure) operate on small packets of data, such as pairs or quartets. Caching can be effective for the inner iterations. As the size of the data packets in the outer iterations becomes comparable to cache size, caching becomes less and less effective. The FFT is caught between two opposing currents. On the one hand, large N means more arithmetic operations per item of data. On the other hand, the data items accessed in the outer iterations cannot be cached as effectively.
N MKL FFTW Numerical Recipes
32 0.86 1.6 8.12
64 1.14 1.62 6.71
80 1.72 1.58 *
8 × 3 × 7 1.76 1.96 *
192 1.82 1.84 *
1024 1.83 1.51 5.69
1024 × 128 2.05 1.66 14.33
1024 × 1024 1.89 2.96 17.62
Table 2.3 Number of cycles consumed by the FFT divided by Nlog2N on a 2.2 GHz AVX machine.
Table 2.3↑ shows the cycle counts for MKL, FFTW, and Numerical Recipes [42]  [42] [27] is a handy and wide-ranging, if brisk, work on numerical methods.. for a variety of N. Problem instances sufficiently numerous to occupy 8 GB of memory were used for each N. The inverse FFT was successively applied to the problem instances. The reported figures are medians. The Numerical Recipes implementation applies only to N = 2n.
Numerical Recipes is 5 to 10 times worse than the optimized libraries. The sheer volume of effort needed to produce optimized implementations such as MKL and FFTW will become clear in later chapters. There is no doubt that the scientific programmer must seek out optimized libraries as far as possible. MKL is faster than FFTW for N = 32 and N = 64. Overall, the two libraries are comparable.
None of the operation counts match the theoretical lower bound of 0.75Nlog2N cycles, although MKL gets quite close for N = 32. For large N, the deviation from the theoretical bound is greater. We may be tempted to conclude that this is because of the memory references at the outer iterations of the FFT. In fact, there is so much cache on the 2.2 GHz AVX machine that all the data for even the problem instance with N = 1024 × 1024 can comfortably fit in L3 cache. It is unclear whether optimizations to hide the cost of streaming data items from L3 cache are possible or whether they have been attempted.
N MKL FFTW
32 0.81 1.62
64 0.67 1.16
192 0.90 1.04
1024 0.88 0.89
1024 × 128 1.16 1.48
1024 × 1024 1.50 2.51
Table 2.4 Number of cycles consumed by in-cache FFT divided by Nlog2N on a 2.2 GHz AVX machine.
In table 2.4↑, the cycle counts are much closer to the 0.75Nlog2N lower bound. These in-cache cycle counts were obtained by applying the inverse FFT to the same array of 2N double-precision numbers initialized to 0. In fact, MKL goes below the theoretical bound for N = 43 = 64.
The power of 4 FFT is applicable to N = 43 and has a lower operation count than 5Nlog2N. [43]  [43] See [43]. MKL may be using an algorithm with an operation count lower than 5Nlog2N. The 2.2 GHz AVX machine (see table 9.1↓) has an in-core frequency that can exceed 2.2 GHz. The in-core acceleration may as well be the reason for the observed speed of MKL’s N = 64 FFT being greater than the theoretical bound.
The in-cache numbers for MKL are better than for FFTW and sometimes significantly better. The fraction of scientific programs that remain in cache appears to be fairly large, and the in-cache advantage of MKL over FFTW is not insignificant. In the next chapter, we will discuss the type of instruction pipeline optimizations needed to achieve MKL speeds. MKL is certainly optimizing better for the instruction pipeline than FFTW.
N MKL FFTW
32 0.975 4.86
64 1.61 4.04
1024 2.00 3.16
Table 2.5 Number of cycles consumed by forward FFT divided by Nlog2N on a 2.2 GHz AVX machine.
So far, we have only been applying the inverse FFT. The inverse discrete transform (2.2↑) is unnormalized, and there is no need to divide by N at the end. In table 2.5↑, we turn to the forward FFT, which must be normalized by dividing by N, and get a nasty surprise. The performance of FFTW has deteriorated by more than a factor of 4 in one instance and by about a factor of 2 in the other two instances.
How can as simple an operation as dividing by N cause such sharp deterioration? The answer will become clearer in the next chapter. The loop for dividing by N in the member function fwd() of the class fft_fftw looks innocuous:
for(int i=0; i < 2*n; i++)
      f[i] /= n;
The member function fwd() is defined within the class and therefore gets inlined. As we have already indicated in passing, icpc does not seem to optimize loops of inlined functions adequately. [44]  [44] This statement about icpc has been verified for versions 10 through 15. The loop for dividing by N can be optimized by turning the division into multiplication by 1.0 ⁄ N. Multiplications are much faster than divisions. However, icpc does not do this optimization nor does it attempt to exploit loop-level parallelism---facts that can be ascertained by inspecting the assembly code.
The next chapter shows how to inspect assembly code and what to expect from the compiler. Optimizations that may appear obvious to the programmer are sometimes not effected by compilers, as we just saw.
Exercise:  If ω = exp(2π( − 1) ⁄ N), prove that ωk ≠ 1 and ωkN = 1 for k = 1, 2, …, N − 1. Prove that
N − 1j = 0ωkj = 0
for k = 1, 2, …, N − 1.
Exercise:  Use the previous exercise to argue that the matrix with (j, k)th entry equal to ωjk ⁄ (N) has orthonormal columns. Conclude that the transforms (2.1↑) and (2.2↑) are inverses of each other.
Exercise:  The inverse discrete Fourier transform is defined by
fj = N − 1k = 0ωjkk, 
where ω = exp(2π( − 1) ⁄ N). Suppose N is even and N = 2n. We may write k = 2ℓ + p, with p = 0 or p = 1, and decompose the sum as
fj  = n − 1ℓ = 01p = 0ω2jωjp2ℓ + p  = n − 1ℓ = 0ω2j2ℓ + ωjn − 1ℓ = 0ω2k2ℓ + 1.
Notice that ω2 = exp(2π( − 1) ⁄ n) and interpret the two summations above as inverse discrete Fourier transforms of dimension n. Explain how to reduce a transform of dimension N = 2n to two transforms of dimension n.
Exercise:  Suppose N = 2n. The array a[] of dimension N may be indexed using bit sequences of length n with the index ranging from 0 = 00...0 to N − 1 = 11...1. In the bit-reversed permutation, a[j] and a[k] are interchanged if the bit sequence of length n corresponding to k is the reversal of the one corresponding to j.
Exercise:  Let f(x) = |sin(x)|, xj = 2πj ⁄ N, and fj = f(xj). Graph the discrete Fourier transform of fj with N = 104. Repeat with f(x) = sin(sin(x)). What do you observe?
Exercise:  Let N = 1024 and initialize a complex array of size N to 0. Apply MKL’s inverse FFT to the same array 106 times. Histogram the 106 cycle counts (you will need the TimeStamp class described in the next chapter). If your machine has L1 data cache of at least 16 KB, you will observe something closer to the normal distribution than the plots in figure 2.6↑. Why? Fit the normal law and calculate the mean and variance of the fit.
Exercise:  Tables 2.3↑ and 2.4↑ report the number of cycles consumed by an in-place, complex one-dimensional FFT of dimension N, the latter with data in cache and the former with data out of cache. For each value of N in the tables, find the bandwidth to memory realized in bytes/cycles as well as GB/s, under the assumption that all the extra cycles for out-of-cache FFT are due to data access. Investigate the possibility that the in-cache numbers are artificially low because the FFT operates on an array that is always zero.
Exercise:  Assuming versions 11 or 12 of the icpc compiler, the member function fwd() of the class fft_fftw may be sped up as follows. Remove the for-loop for dividing the array f[] by n. Instead, call a function scale_fwd(double *f, int n). The function first calculates double x = 1.0/n and then multiplies each of the 2n double entries of f[] by x. Compile using -fno-inline-functions. Recalculate table 2.5↑ and show that the forward transform with FFTW is now much faster.

2.5 References

Bibliography

[13] C. van Loan: Computational Frameworks for the Fast Fourier Transform. SIAM, 1992.

[14] C.L. Lawson, R.J. Hanson, D.R. Kincaid, F.T. Krogh: “Basic linear algebra subprograms for Fortran usage”, ACM TOMS, pp. 308-323, 1979.

[15] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, D. Sorensen: LAPACK User's Guide. SIAM, 1999. 1st ed. 1987.

[16] E. Elmroth, F. Gustavson, I. Jonsson, B. Kågström: “Recursive blocked algorithms and hybrid data structures for dense matrix library software”, SIAM Review, pp. 3-45, 2004.

[17] J.J. Dongarra, J. Du Croz, S. Hammarling, I. Duff: “A set of level 3 basic linear algebra subprograms”, ACM TOMS, pp. 1-17, 1990.

[18] J.J. Dongarra, J. Du Croz, S. Hammarling, R.J. Hanson: “An extended set of Fortran basic linear algebra subprograms”, ACM TOMS, pp. 1-17, 1988.

[19] J.R. Levine: Linkers and Loaders. Morgan Kaufmann, 1999.

[20] J.W. Cooley, J.W. Tukey: “An algorithm for the machine calculation of complex Fourier series”, Mathematics of Computation, pp. 297-301, 1965.

[21] J.W. Demmel: Applied Numerical Linear Algebra. SIAM, 1997.

[22] M. Frigo: “A fast Fourier transform compiler”, Proc. 1999 ACM SIGPLAN conference, pp. 169-180, 1999.

[23] M. Kerrisk: The Linux Programming Interface. No Starch Press, 2010.

[24] N.J. Higham: Accuracy and Stability of Numerical Algorithms. SIAM, 2002.

[25] S. Toledo: “Locality of reference in LU decomposition with partial pivoting”, SIAM J. Matrix Anal. Appl, pp. 1065-1081, 1997.

[26] S.G. Johnson, M. Frigo: “The design and implementation of FFTW3”, Proc. of the IEEE special issue on Program Generation, Optimization, and Platform Adaptation, pp. 216-233, 2005.

[27] W.H. Press, S.A. Teukolsky, W.T. Vetterling, B.P. Flannery: Numerical Recipes 3rd edition: The Art of Scientific Computating. Cambridge University Press, 2007.

3 The Processor

The 80286 processor of 1982 had a mere 134,000 transistors. The Pentium 4 of 2001 had 4.2 million transistors. Processor packages in the x86 line for sale in 2015 have several billion transistors packed into less than 0.25  cm2. [45]  [45] This data is for Intel Xeon E5 v3 (Haswell) family. These advances in processor technology are correlated with the rise of computer technology as a whole.
Early x86 computers, such as the 80286 or the Pentium 4, featured a single central processor. Modern processor packages may have a dozen or so processors, and this number is constantly increasing. As a result of the explosion in the number of transistors, the design of each processor has changed considerably. A great deal of parallelism is built into modern processors, and this parallelism is active even when the thread of execution is serial.
In contrast to this rapid progress in hardware, the software environment of the scientific programmer is nearly the same as what it was 15 years ago. In recent years, innovation in software has been driven by the Internet and mobile gadgetry and largely sidesteps the great difficulties in programming modern hardware optimally. If a compiler of 20 years ago was likely to generate excellent code for the processor of its day, that was less so in the case of a compiler of 10 years ago. Even for many simple programs, the compilers of today do not generate optimal code or anything like it.
Good, or nearly optimal, programs for modern platforms can be written only with a knowledge of the nuances of the computer’s hardware. With that in mind, this chapter, indeed the rest of this book, will introduce programming models in close concert with computer architecture.
This chapter is organized into three sections. Section 3.1↓ is an overview of x86 architecture and assembly programming. The x86 line has held sway for more than three decades to the extent that nearly 100% of the computers in use outside of the mobile world in 2015 belong to this line. The x86 architecture is vast, [46]  [46] The x86 instruction set architecture is documented in Intel 64 and IA-32 Architectures’ Software Developer’s Manual, Volumes 1, 2A, 2B, 3A, 3B, and AMD64 Architecture Programmer’s Manual, Volumes 1, 2, 3, 4, 5. The entire documentation runs to more than 5,000 pages. and our interest is limited to a tiny part of it.
To understand program speed and to determine whether compilers are doing a good job, it is essential to know a little about the register set and about the instruction mix. Our interest in sections 3.1.1↓ and 3.1.2↓ is limited to this part of the x86 architecture and no more. Although we introduce assembly programming, we do so only to the extent necessary to identify assembly code generated by compilers for inner loops and to make sense of function calls. There is no suggestion in those sections that the reader should program in assembly. The need to program in assembly is only for the rare expert. However, inasmuch as our intention is to gain understanding of programming and program speed on a rational basis, it is essential to understand the assembly code generated by compilers for inner loops.
Section 3.2↓ is one of the more important parts of this book. Sections 3.2.2↓ through 3.2.7↓ have a single focus, which is to expose some of the inner workings of the compiler. In these sections, we inspect the assembly code generated by compilers to gain an understanding of the quality of assembly code generated for inner loops. Only a limited understanding of assembly programming is needed. It suffices to identify the type of registers and instructions that are being used in the innermost loop for the most part. In section 3.2.1↓, we explain how to gather statistics and make tables and plots easily in C/C++. Such preliminaries simplify programming.
Sections 3.2.2↓ through 3.2.7↓ put to rest a popular belief that Fortran programs are faster than C/C++. If the restrict qualifier is used appropriately, compilers can effect every loop optimization applicable in Fortran in C/C++. Section 3.2.6↓ directly takes on the false belief that C++ programs are condemned to be slower than Fortran. In that section, we write a naive C++ program for multiplying matrices using general classes for vectors and matrices and show that it is more than an order of magnitude slower. The culprit here is the naivety of the programmer and not the C++ language. A C++ program written with an idea of how compilers work, and the kind of instruction stream they may generate, will be faster than a Fortran program written without that knowledge. The C/C++ programming languages provide every available means for writing optimized programs, and there is no other programming language or paradigm, including Fortran, that is remotely comparable to them in this respect.
The skill of writing loops explained in section 3.2↓ is about 40% of what the typical scientific programmer needs to know to produce well-optimized programs. Another 40% is about optimizing memory accesses in a single-core program, a closely related topic discussed in the next chapter. Many subtle concepts and techniques come up when we study threads and networks. However, in optimizing scientific programs, much of the game is in structuring loops carefully, as explained in this chapter, and in organizing memory accesses, as explained in the next chapter. For example, when we optimize a program to effect a matrix transpose in chapter 6↓, much of the optimization boils down to structuring loops and organizing accesses to memory, even though the program makes sophisticated use of networking and threading.
The brief summary, in the preceding paragraph, of the relative worth of various techniques of optimization is for scientific programs, which generally tend to work with known amounts of data that are laid out in regular patterns. In computer science, data tends to be dynamic and highly unpredictable in extent. Dynamic data structures such as linked lists, trees, and graphs are commonly employed. When such data structures are used, the relative worth of various optimization techniques changes considerably. Optimizing for memory, which is the topic of the next chapter, is considerably more important. Loop optimizations discussed in this chapter are not as important. However, more and more scientific programs are taking on qualities of systems programming, and, conversely, some newer areas of computer science such as data analysis and image processing give rise to programs that are similar to traditional scientific programs.
In section 2.4↑, we found that optimized libraries such as MKL and FFTW can be nearly an order of magnitude faster than Numerical Recipes. Why are optimized libraries nearly an order of magnitude faster? A big part of the answer has to do with optimization for the instruction pipeline and the register set.
Optimizing for the processor’s register set and instruction pipeline is a difficult skill, and section 3.3↓ explains how to do it. Although sections 3.3.1↓ and 3.3.2↓ are not overly technical, the following two sections, 3.3.3↓ and 3.3.4↓, are not for the faint of heart. This is the only part of the book where assembly programming is essential.
The value of optimizing for the instruction pipeline and the register set is well known in dense numerical linear algebra and, as already pointed out, in computational Fourier analysis. However, even in these areas, the technical knowledge essential for effecting such optimizations is known only to a small number of cognoscenti. There can be little doubt that wider knowledge of such techniques will uncover more opportunities for such optimizations. Even for the Fast Fourier Transform (FFT), theoretical lower bounds are well below actual out-of-cache performance, and there could be room for additional investigations in this gap. Knowledge of computer architecture creates more possibilities for the programmer and opens a whole new point of view.
To conclude this introduction, we turn to a question that may be on the reader’s mind. If we are delving as deeply into processor architecture as we do in this chapter, won’t much of this chapter become irrelevant as the hardware changes? The answer is no, and emphatically no as long as the x86 dominance persists.
ISA Year Family Registers Instructions
SSE 1999 Pentium XMM (128 bit) ADDPS, MULPS
SSE2 2001 Pentium 4, Nehalem, Westmere, Opteron, Athlon XMM (128 bit) ADDPD, MULPD
AVX 2011 Sandy Bridge, Ivy Bridge, Bulldozer, Jaguar YMM (256 bit) VADDPD, VMULPD,
AVX2 2013 Haswell, Broadwell YMM (256 bit) VADDPD, VMULPD, VFM*ADDPD
AVX-512 2015 Skylake ZMM (512 bit) VADDPD, VMULPD, VFM*ADDPD
Table 3.1 Generations of x86 instruction set architectures (ISAs) and machines implementing them. The machine families typically include several machines. For example, Xeon-5650 and Xeon-5670 are machines in the Westmere family (or microarchitecture). This table was prepared by consulting several Wikipedia pages. For a list of machines used in this book, see table 9.1↓ in the appendix. Not all versions of Skylake support AVX-512.
A look at table 3.1↑, to which we will return many times, will help explain why that is so. It shows the gradual evolution in instruction set architectures from 1999 to 2016/2017 and beyond. Section 3.2↓ connects the workings of the compiler to XMM (SSE2) and YMM (AVX/AVX2) registers. This section will hardly change when the ZMM registers and their successors come along, except that the kind of concepts discussed in it will increase in importance. None of the C/C++ programs discussed in that section needs to be changed. Indeed, many of the same points could have been made in 2001.
Table 3.1↑ shows the gradual evolution in instruction set architectures from Streaming SIMD Extensions (SSE) to SSE2 to Advanced Vector Extensions (AVX) to AVX2. The expansion in register width from 128-bit XMM registers to 256-bit YMM registers to 512-bit ZMM registers is one part of the evolution. Another part of the evolution is the addition of new instructions. In the instruction vaddpd, v stands for vector, as may be expected, and pd for packed double. The packed double instructions operate on all the double-precision numbers stored in a vector register simultaneously. Thus, a single vaddpd instruction applied to YMM registers results in four additions. The AVX2 instruction set architecture introduced fused-multiply-add instructions, namely, fmadd132pd, fmadd213pd, and fmadd231pd. These instructions implement operations of the form c = c + ab or a = c + ab while operating on three vector registers. Thus, a fmadd*pd instruction applied to three YMM registers results in four additions and four multiplications.
We will refer to machines as 2.6 GHz SSE2 or 3.6 GHz AVX2. The full information about the machine may be found by looking up table 9.1↓ in the appendix. The instruction set is the part of the computer’s architecture most pertinent to program speed, which explains the nomenclature we adopt.
The type of processor design we see today is a product of research that happened as early as the 1960s, with important additions in the 1970s and 1980s. [47]  [47] See [38]. The 1960s reference is to Tomasulo’s algorithm, which is mentioned later in this chapter. These design principles were incorporated into the x86 line in the 1990s, and the underlying design principles have not changed greatly since then. The x86 instruction set has evolved gradually and methodically. Therefore, not only do the fundamental concepts of optimization for processor architectures remain the same, the particular form they take within C/C++ programs does not change much either.
In addition, there has been a great deal of convergence in processor technology since 2000 so that nearly 100% of desktop processors are now x86 based. The pertinence of the particular examples, as well as that of general principles of program optimization described in this chapter, is unlikely to diminish anytime soon. In fact, the pertinence of computer architecture to program optimization appears to be growing.
Section 3.3↓ shows how to optimize for the instruction pipeline for SSE2 machines. The instruction pipeline depends on the instruction set architecture and to a far lesser extent on its microarchitectural realization (family in table 3.1↑). Because the programs in this section are in assembly, they will need to be updated for machines with YMM and ZMM registers (AVX, AVX2, AVX-512). The updating for the YMM registers is the topic of several exercises. ZMM registers are discussed in chapter 7↓ on the Xeon Phi. Using XMM registers has the advantage that the programs we discuss will run on almost any computer, including AVX and AVX2 machines.
The payoff from this type of optimization is increasing rapidly. A matrix multiplication program written in 2010 in C/C++ approached nearly a third of the speed of a program optimized for the SSE2 processor pipeline. Five years later, the C/C++ program does not even come within a tenth of the speed of a program optimized for the AVX2 processor pipeline.
In section 3.2.5↓, we explain why the same C/C++ program has become so much worse, when compared to a fully optimized program, over a span of five years. There are two related reasons. After clock speeds flat-lined around 2003, and even before, processors have used greater and greater numbers of transistors for greater and greater parallelism. At one level, that translates to more and more processor cores in a single processor package. Inside a processor, the same trend translates to vector registers and instruction pipelines that can simultaneously execute multiple instructions. Even as there is a greater pay-off in optimizing for such pipelines, it becomes harder for the compilers to do so.
In 2015, nearly 100% of non-mobile computers and laptops run on x86 processors. It would take a cataclysmic event (cataclysmic for Intel Corporation), such as Apple Computers switching its MacBooks and desktops from the x86 line to ARM, for the x86 line to be truly challenged. [48]  [48] I thank Hans Johnston for this point. Intel is not as invulnerable to market forces as it used to be one or two decades ago, but it still dominates the processor benchmarks overwhelmingly. Such a cataclysmic event is not on the horizon as of 2015.

3.1 Overview of the x86 architecture

This section is a basic introduction to the x86 instruction architecture, which has dominated since its introduction in 1978. We look at registers and a little bit of assembly programming. Our intention is to lay the foundation for the next section, where we look at compilation and show how to tell whether a compiler has generated good code or not. All that is required is a familiarity with the names of some of the registers and a few instructions. The instruction names are often self-explanatory.
Accordingly, the assembly programming we exhibit is basic. Yet one of the programs we look at is quite useful. This program implements a C++ class TimeStamp for accessing the Time Stamp Counter, which is one of the best methods to measure time on a computer. This C++ class is used throughout the book.
The x86 assembly language is known for its huge complexity. Understanding assembly programming is a mammoth task meant for compiler writers and computer architects, not for the average scientist. Our intention here is to gain a basic understanding only---in other words, our aim is to splash a little water, not to become expert swimmers. Without some understanding of assembly programming, many aspects of computer architecture as well as the workings of compilers and linkers take on a remote quality. In addition, it is impossible to gain an appreciation of the value of optimized scientific codes.

3.1.1 64-bit x86 architecture

figure FIGS/chapter2/figCompArchIntelLandCoord.png figure FIGS/chapter2/Xeon5400.png
Figure 3.1 Land coordinates of Intel’s Xeon 5400 processor and a photograph of the same processor. The land coordinates are used to identify pins, which are shown in the photo.
Perhaps a good way to begin is by looking at a processor package. Figure 3.1↑ shows what a processor package looks like from the outside. [49]  [49] The land coordinates in the figure are from Quad-core Intel Xeon Processor 5400 Series: Datasheet. There are pins to transfer addresses and data to and from memory, pins to interrupt the processors, and so on. The processor package shown has only four processor cores. More recent packages have many more cores, and it takes multiple pages to describe their land coordinates.
When we write programs, we do so with an awareness of memory. Every variable name is ultimately the name for a segment of memory. However, in most programming, we have no awareness of the registers at all. Registers are locations that reside on the chip that are capable of holding data or addresses. They are very special locations because they are wired into the circuits for carrying out arithmetic and logical operations. When arithmetic operations are executed by the processor, all the operands may be registers. To add two numbers in memory, for instance, typically we have to first move one or both of them to registers, add the two registers, and move the result from a register to a memory location. The addresses of the memory locations can also be held in the registers.
figure FIGS/chapter2/registers.png
Figure 3.2 Partial register diagram of an x86 processor. The general-purpose registers (RAX to R15) are all 64-bit wide. The XMM, YMM, and ZMM registers are 128,  256, and 512 bit wide, respectively.
Registers
The x86 64-bit architecture is first of all a specification of registers (see figure 3.2↑). The architecture was first introduced by AMD and later adopted by Intel. Some variable names may only correspond to specific registers. The following are some of the registers in the 64-bit x86 architecture:
The AMD/Intel manuals use slightly different names for the registers. For instance, %rax becomes RAX. The naming in our list anticipates our use of GAS (the GNU assembler). The partial list omits a number of registers. From some of the registers we have listed, we can get 16-bit or word-length registers and 8-bit or byte-length integers by changing the names appropriately.
This list of registers may appear long and a bit intimidating. To master it all can indeed be quite a task. For our purposes, it suffices to know that there are 16 general-purpose registers and 16 XMM/YMM registers. The different uses that these registers are put to will emerge as this chapter progresses. The XMM/YMM registers are the basis of fast scientific programs.
There is a 64-bit register called %rip that users normally do not access. That register is the instruction pointer. It is the offset of the address of the next instruction to be executed relative to some base. But that interpretation can be problematic because of instruction pipelining and out-of-order execution.
Backward binary compatibility
The 64-bit x86 architecture carries within it ghosts from decades past. Some of these ghosts are easily seen in the naming of the register set, which is the most basic feature of all processor architectures. At first there seems to be no good reason to call a register rax or rbx. However, when the first 8086 16-bit processor was introduced back in 1978, large numbers of registers must have seemed a distant dream. So it was natural to use the first few letters of the alphabet for the registers. When the 32-bit 80386 was introduced in 1985, names such as ax were changed to eax, signaling the enhancements to the size of the registers. The register names in the 64-bit x86 architecture reflect this evolution.
Binary compatibility is the reason for this peculiar evolution. It may make no sense to expect a 64-bit x86 processor of today to run a 16-bit binary from 1978, but it makes sense to expect it to run the 32-bit binaries of a few years ago. All the 64-bit x86 processors have a 32-bit mode that can be turned on by the operating system. If they did not have such a mode, nearly every one of the many 32-bit software systems available would have failed when the 64-bit architecture was first introduced. Backward binary compatibility gives software developers, especially developers of operating systems, time to catch up. Backward compatibility means that an instruction such as MOVL EAX EBX that made sense in the 32-bit architecture must also make sense in the 64-bit x86 architecture. Not only that, its encoding using bits must be exactly the same (its encoding is 89 C3 in hexadecimal---the first byte is the opcode for MOVL and the second byte, which encodes the source and destination registers, is the so-called ModR/M byte).
The disadvantage of backward binary compatibility is that it constrains the design of the instruction set and the microarchitecture used to implement it in hardware. Every time the instruction set is extended, the binary encoding of the earlier instruction set as well as its semantics must be preserved intact. This burden can be considerable. For instance, programs written in 1990 could rely on the register %eip to point to the address of the next instruction to be executed and manipulate that register to jump to some other point in the program. Not only must today’s x86 architecture include that register, it should also give the same effect when the %eip is altered to jump to some other location in the program. Having to bear this considerable burden has meant that the x86 architecture is infamous for its extremely complicated binary encodings of instructions.
History has shown that the disadvantages of backward binary compatibility are outweighed by the advantage of being readily usable. The x86 architecture always seeks to be usable. A new design that is brought out today must run yesterday’s binaries. This emphasis on usability has meant that the vast majority of today’s computers (excluding mobile devices) use x86 processors. [50]  [50] The ARM architecture has a much simpler instruction set and has achieved wide use on mobile gadgets. The number of ARM devices is greater than the number of x86 computers. Thus, the ARM architecture does not suffer from a lack of widespread adoption and the consequent lack of economy of scale, which has been a bane of previous challengers to x86. However, the terms of commercial licensing of ARM are such that the benefits of its wide adoption are shared by many companies.
Yet one must not exaggerate the benefits of backward binary compatibility in relation to the x86 architecture. Intel, the foremost champion of the x86 line, is no longer the invincible behemoth it once was. Intel is tied to the PC business, which is in gradual decline, and has no control over hardware or software platforms in the rapidly growing mobile computing and networking sectors. Despite the enormous investment in design accumulated over many years, the volume of profit being generated in the computer business is now so great that a legitimate challenger to the x86 line is not inconceivable. In a business that is growing exponentially, what happens in the next five years can be of more consequence than what happened in the past four decades.

3.1.2 64-bit x86 assembly programming

Each processor core fetches instructions from memory and executes them. These instructions are stored in memory as a sequence of bytes. In assembly language, the bytes that encode machine instructions are replaced by mnemonics. For example, an instruction to move a quad word, which is eight bytes in size, from register r8 to register rax is coded as the following hex sequence 4C 89 C0. The first byte 4C is the so-called REX byte, the second byte 89 is the opcode, and the third byte C0 encodes the fact that the source register is r8 and the destination is rax. [51]  [51] The third byte is given by the AMD manuals. In the 64-bit x86 architecture, opcodes can be one, two, or three bytes. A single instruction can be as many as 17 bytes.
Unlike processors, we can’t just look at bits and make sense of them, which is why the assembly languages provide mnemonics. In the GNU assembler, which is called GAS, the mnemonic for the instruction to move a quad word from r8 to rax is movq %r8, %rax. To move a double word, which is four bytes stored in the lower halves of the 64-bit registers, the mnemonic is movl %r8d, %eax. These mnemonics vary with the assembler. In the MASM assembler, the mnemonic for moving a quad word from r8 to rax would be MOVQ RAX R8---notice that the registers are given in reverse order.
There is no standardization in the world of x86 assembly languages. The documentation for the instruction set, which include the mnemonics for various instructions, is published by AMD and Intel. However, the mnemonics can change in a predictable fashion depending on the assembly language. For instance, a register referred to as R8 in the AMD/Intel documentation becomes %r8 in GAS. One point is important to keep in mind. In the AMD/Intel manuals, the destination precedes the source. In GAS, the destination follows the source. We will always use the GAS convention. On Linux computers, even the Intel compilers use the GNU assembler.
The reader may find it a little puzzling that “double” (or “long”) is used for 32-bit operands and “quad” for 64-bit operands. A word in the original 8086 machine of 1978 was 16 bits. Thus, double words are 32 bits and quad words are 64 bits, as a result of a choice made long ago.
Getting started
For our first assembly program, we begin with the following simple C code:
#include <stdio.h>
int main(){
  int x, y, z;
  x = 1;
  y = 2;
  z = 3;
  printf("The sum of %d and %d is %d \n", x, y, z);
}
It is compiled using
gcc -S -O3 add.c
The -S flag tells the compiler to generate assembly only and leave it in the file add.s. Optimization is turned on with the -O3 flag so that the compiler generates cleaner code.
Below is a listing of the addtwo.s assembly code. The listing is a cleaned up version of the code generated by the compiler. In particular, we removed some code at the bottom related to an error handler.
1.file	"addtwo.s"
2	.section	.mydata,"aMS",@progbits,1
3.LC0:
4	.string	"The sum of %d and %d is %d\n"
5	.text
6.globl main
7	.type	main, @function
8main:
9	movl	$3, %ecx
10	movl	$2, %edx
11	movl	$1, %esi
12	movl	$.LC0, %edi
13	xorl	%eax, %eax
14	call	printf
15	ret
The instructions are specified using mnemonics in the main block, which begins at line 9. The earlier lines are assembler directives. The syntax for the directives can vary considerably between different assemblers.
The first line, which gives the name of the file, is a directive. The second line asks the compiler to begin a section called .mydata. One has to dig through the GAS documentation to decipher the portion that follows. The second line states that the section is meant to hold data, that each character is one byte, and that the strings will be terminated by 0. The strings are terminated by 0, anticipating the need to pass them to the printf function in the C library. In C, strings are always NULL or 0 terminated.
.LC0 is the name of the string specified in line 4. It corresponds to a location in memory. Here the assembler takes on the responsibility of storing the specified string in the data section and gives the location the name .LC0.
The text section, which is not named, begins on line 5. The text section will contain instructions. The name main is specified to be a global, and the name of a function in lines 6 and 7---main will be visible outside the file and is the name for an address in the text segment, which corresponds to line 8.
In the following code, the 32-bit registers %esi, %edx, and %ecx are filled with 1, 2, and 3, respectively. These are the second, third, and fourth arguments in the call to printf. Because .LC0 is the address of a location in the data segment, it can only be 32 bits. Thus, the movl instruction is used to move it into the register %edi. The xorl instruction zeros the eax register as a precaution.
The 64-bit x86 architecture requires the first four arguments to be placed in %edi, %esi, %edx, and %ecx. The call to printf is made with those arguments. Here printf is an externally defined name and has to be resolved during linking.
To assemble, [52]  [52] The icc compiler uses the GNU Assembler (GAS). For its documentation, see (Using as, D. Elsner and J. Fenlason). we may use
icc -c addtwo.s 
to produce the object file addtwo.o (we are back to using icc). The object file can be linked to produce an executable using
icc -o addtwo.exe addtwo.o
The icc linker takes care of resolving the reference to printf in addtwo.o. When the executable addtwo.exe is run, we are told that
The sum of 1 and 2 is 3
More about function calls
The output of the next listing, which is
The sum of 1 2 3 4 5 6 7 8 is 36
is hardly more exciting. However, it illustrates function calls made after placing some arguments on the stack.
1.file	"addmany.s"	
2	.section	.mydata,"aMS",@progbits,1
3.LC0:
4	.string	"The sum of %d %d %d %d %d %d %d %d  is %d \n"
5	.text
6.globl main
7	.type	main, @function
8main:
9	subq	$32, %rsp
10	movl	$36, 24(%rsp)
11	movl	$8, 16(%rsp)
12	movl	$7, 8(%rsp)
13	movl	$6, (%rsp)
14	movl	$5, %r9d
15	movl	$4, %r8d
16	movl	$3, %ecx
17	movl	$2, %edx
18	movl	$1, %esi
19	movl	$.LC0, %edi
20	xorl	%eax, %eax
21	call	printf
22	addq	$32, %rsp
23	ret
figure FIGS/chapter2/figCompArchStack.png
Figure 3.3 The stack pointer is moved downward by 32 and the numbers 6, 7, 8, and 36 are placed on the stack before a call to printf. The labeling on the right assumes that the stack pointer has been moved downward.
The addmany.s program calls printf() (on line 21) with a total of 10 arguments. These are first the format string, next the numbers 1 through 8, and finally their sum, which is 36. The last four of these arguments are pushed onto the program stack (see figure 3.3↑) and the first six are placed in registers as before. It helps to think of the registers as an extension of the stack.
Each argument pushed onto the stack is an int and therefore 4 bytes. Thus, we would need 16 bytes on the stack. However, in 64-bit x86 architecture, each entry on the stack must be a multiple of 8 bytes. Therefore, we need 32 bytes on the stack.
By convention, %rsp is the pointer to the top entry of the stack. On line 9, the stack pointer %rsp is decremented by 32 bytes to make room for four more entries. In x86, the stack grows downward from a high address as shown in figure 3.3↑. The arguments are pushed onto the stack in reverse order. Thus, the final argument 36 is pushed first onto the stack (line 10). That is followed by the preceding three arguments 8, 7, and 6 (lines 11, 12, and 13, respectively). For addressing modes of the type (%rsp) and 8(%rsp), see figure 3.3↑. They are explained in greater detail below. The stack pointer %rsp points to the top, which is 6 (see the right part of figure 3.3↑).
We may think of the registers used to pass arguments as an extension of the stack. These are %r9, %r8, %rcx (or %ecx), %rdx (or %edx), %rsi (or %esi), and %rdi (or %edi), in bottom to top order (lines 14 through 19). Thus, the top six entries of the call stack are in registers, with %rdi at the top by convention. If a function call has six or fewer arguments, the entire call stack is stored in registers.
When the call to printf is made, the function has to retrieve some arguments from the registers and some from the stack. Line 22 moves the stack pointer back to its original location: the caller cleans up the stack after the function that is called returns.
Addressing modes
Let us go back to the listing and look at usages such as (%rsp) and 8(%rsp) once again. The first usage (%rsp) is a memory reference using a base register. Here the base register is %rsp and it stores a 64-bit address. When memory is referenced in this way using a base register, it is as if the base register is a pointer that is dereferenced.
The memory reference 8(%rsp) uses a base register and a displacement. The displacement is 8 and the base register is %rsp. The memory location referred to is the one whose address is the content of the base register with the displacement added to it.
Memory references in machine language are similar to the way pointers are handled in C. If rsp is thought of as a pointer-type variable in C, the meaning of (%rsp) is the same as *rsp. In other words, an address is dereferenced and an rvalue converted to an lvalue. Likewise, 8(%rsp) has the same meaning as rsp[8] or *(rsp+8) in C, but we must be a little careful here. In C,  + 8 would be  + 64 if rsp pointed to double (8 bytes) and  + 32 if rsp pointed to int (4 bytes). In assembly, there is no notion of types, and  + 8 is always  + 8.
Although  + 8 is always  + 8 in assembly, the analogy to C goes further. A reference such as 8(%rax,%rdx) that fits the format
displacement(base register, index register)
evaluates to the memory location whose address is the sum of the base, index, and the displacement. Thus, (%rax, %rdx) is equivalent to rax[rdx] in C syntax, with the caveat about  + 8 being  + 8. In x86 assembly, a memory location can be either a destination (lvalue) or a source (rvalue).
Yet another memory addressing mode is as follows:
displacement(base register, index register, scale)
Its meaning is almost self-explanatory. Thus, 8(%rax, %rbx, 16) is equivalent to rax[16*rbx+8] in C syntax, with the caveat stated above. The scale is allowed to take only certain values such as 1, 2, 4, 8, 16.

3.1.3 The Time Stamp Counter

To do anything useful with assembly code, it helps to have a method to make it part of a C or C++ program. The asm directive allows us to embed assembly code in a C or C++ program. The implementation of the asm directive varies with the compiler. AMD’s PGI compiler manual describes inline assembly in detail. The implementations of inline assembly by Intel and GNU compilers seem to correspond to the PGI documentation.
RDTSC or rdtsc is an x86 machine instruction for reading the Time Stamp Counter. The behavior of the Time Stamp Counter varies. Its basic function is to record the clock cycles. After the instruction is executed, the number of clock cycles is saved as a 64-bit number using two 32-bit registers, namely, eax and edx, with the higher bits in edx.
Why are 64 bits needed? A 2 GHz clock would tick 5 × 109 times in two and a half seconds, whereas the maximum unsigned integer that we can represent using 32 bits is 232 − 1 ≈ 4.3 × 109. A 2 GHz clock would wrap around in less than 10 seconds if the counter were to use only 32 bits. With 64 bits, the Time Stamp Counter is guaranteed not to wrap around for 10 years. However, the counter is reset every time the machine is booted.
The following inline assembly can be embedded into a C or C++ program:
asm volatile("rdtsc" : "=a" (a1), "=d" (d1));
Before this statement, a1 and d1 must be defined to be of type unsigned int. After the rdtsc instruction is completed, the registers eax and edx are copied to the variables a1 and d1, respectively. The segment that follows the colon is the output part of the inline assembly statement. In this case, the output part is
="a"(a1),="d"(d1)
The =a and =d are saying that the %eax and %edx registers must be output, using a as the code for eax and d as the code for edx, as may be verified by checking the documentation. The destinations for the output are given as (a1) and (d1), where a1 and d1 are variables of type unsigned int defined in the C or C++ program. In this way, we are able to execute a machine instruction and bring in information into variables that can be accessed within the C or C++ program.
The volatile qualifier asks the compiler not to change the assembly code as part of an optimization. In addition, if two inline assembly statements have the volatile qualifier, the compiler will not move the two statements past each other.
In some processors, such as those of the Pentium M family, the Time Stamp Counter is incremented every clock cycle. However, the frequency of clock ticks can vary. To decrease energy consumption, computers vary the frequency of clock ticks and the operating voltage. In other processors, such as the Intel Atom, the Time Stamp Counter is incremented at a constant rate regardless of the operating voltage and the actual frequency of the clock. The architectural behavior ‘‘moving forward’’ has been stated to be of the latter type.
Between two calls to rdtsc made on the same processor core, the second call is guaranteed to return a higher count. [53]  [53] Processes may be moved from core to core by the operating system. When that happens, the second reading of the Time Stamp Counter may be lesser than the first. See http://en.wikipedia.org/wiki/Time_Stamp_Counter. However, rdtsc is not a serializing instruction. In the presence of multiple pipelines and out-of-order scheduling, some instructions that occur after rdtsc may complete sooner, although some instructions that occur before may complete later. However, one does not need to worry too much about such possibilities. With careful use, the Time Stamp Counter can be used to time events that take only a few nanoseconds.
It is of course a nuisance if we have to think about machine instructions every time we want to read the Time Stamp Counter in a C++ program. A simple C++ class greatly simplifies the use of the Time Stamp Counter. The interface of the TimeStamp class is given below.
class TimeStamp{
public:
  TimeStamp(){};
  void tic(); 
  double toc();
};
The entire TimeStamp class is defined in the header file TimeStamp.hh to avoid function call overhead using function inlining. The constructor does nothing. A call to the member function tic() will issue the rdtsc instruction and record the counter in two variables of type unsigned int. A call to the member function toc() will read the Time Stamp Counter in the same way and will return the number of cycles elapsed since the last call to tic() as a double. The calculation to turn the information from the two readings of the Time Stamp Counter into a double is not shown.
The TimeStamp class can be used as follows:
TimeStamp ts;
ts.tic();
//... code to be timed ...
double cycles = ts.toc();
The cycles variable will hold the number cycles between tic() and toc().

3.1.4 Cache parameters and the CPUID instruction

The registers on a processor are few in number. Large arrays and other large data structures cannot be stored in their entirety using registers. They are typically stored in memory (DRAM). However, memory is outside the processor chip. Although it takes only a few cycles to operate on registers, it takes a few hundred cycles to fetch a word from memory. To hide the cost of the large number of cycles required to access a word from memory, some memory words are cached on the chip itself. Cache memory is organized into several levels.
We will take a closer look at cache memory in the next chapter. Here we will use the cpuid instruction to ask the processor to report information about its cache and thus give us a preliminary idea of cache memory. As we look at assembly code and explore certain facets of the processor, it will help to have some idea of cache memory. Caches influence the performance of nearly every program.
The cpuid instruction can be used to extract a variety of information about the processor. If you type cat /proc/cpuinfo on a Linux computer, the command returns information about the processor’s instruction set, power consumption, caches, support for specific technologies, and clock speed. Much of that information is obtained by the operating system using cpuid. The cpuid instruction has a manual of its own. The AMD and Intel processors use different conventions for cpuid.
The inline assembly statement for extracting information about caches using the cpuid instruction is given below. eax, ebx, and ecx must be defined as variables of type unsigned int, and i must be an integer value.
  
asm volatile("cpuid"                //instruction
	:"=a"(eax), "=b"(ebx), "=c"(ecx)//output list
	:"a"(0x04), "c"(i)              //input list
	:"edx");                        //clobber list                       
This inline assembly statement has more complicated syntax than the one we used to read the Time Stamp Counter. The instruction, which is given as cpuid, is followed by three colons and not just one. We do not need to pass any parameters to the rdtsc instruction and all the data items created by the instruction are output to program variables. In the case of the cpuid instruction, we need to leave some data in the registers to tell the cpuid instruction what to do. In addition, not all the data items returned by the cpuid instruction are output to program variables. The syntax of this inline assembly statement is more complicated for these reasons.
The three colons in this inline assembly statement divide the part inside the parentheses into four segments. The first segment, which is before the first colon, gives the instruction to be executed. The second, third, and fourth segments give the output list, the input list, and the so-called clobber list, respectively. In this case, the output list specifies that the contents of the registers %eax, %ebx, and %ecx, which are indicated by =a, =b, and =c, respectively, must be output to the program variables eax, ebx, and ecx, respectively.
The input list specifies that the register %eax must be loaded with the hexadecimal number 04. This hexadecimal code tells cpuid to get information about cache. By loading other codes into %eax, cpuid can be asked to return information about performance counters, power managment, processor name, and other features of the processor. The input list loads the integer value i into the %ecx register. A processor typically has multiple levels of cache, and an integer value is loaded into %ecx to tell cpuid which cache it must get information about.
CPUID returns information in four registers: %eax, %ebx, %ecx, and %edx. Of these, only three are output to program variables. We have elected not to output the %edx register. Therefore, the clobber list, which is the segment following the third and last colon, includes %edx to tell the compiler that the instruction will write over that register. That way the compiler knows it must not save information in %edx with plans of using it later. The registers %eax, %ebx, and %ecx are also clobbered by the cpuid instruction. However, these should not be in the clobber list because the compiler can figure out that they are being clobbered from the input and output segments.
A program that uses the cpuid instruction to extract information about cache memory follows:
1  cout<<"CPUID with code 04h"<<endl;
2  unsigned int eax, ebx, ecx;
3  for(int i=0;;i++){
4    asm volatile("cpuid"
5		 :"=a"(eax), "=b"(ebx), "=c"(ecx)
6		 :"a"(0x04), "c"(i)
7		 :"edx");
8    if((eax&0xF)==0)
9      break;
10    printf("cache type = %x\n",eax&0xF);
11    printf("cache level = %x\n", (eax>>5)&(0x7));
12    printf("ways of associativity = %u\n",
13	   ((ebx>>22)&0xFFF)+1);
14    printf("physical line partitions = %x\n", 
15	   ((ebx>>12)&0x3FF)+1);
16    printf("cache line size = %u\n",(ebx&0xFFF)+1);
17    printf("number of sets = %u\n",ecx+1);
18    printf("cache size in bytes = %u\n\n",
19	   (((ebx>>22)&0xFFF)+1)* //ways of associativity
20	   (((ebx>>12)&0x3FF)+1)* //physical line partns
21	   ((ebx&0xFFF)+1)* //cache line size
22	   (ecx+1)); //number of sets
23  }
24
Lines 4 to 7 issue the cpuid instruction with i = 0, 1, … For each value of i, cpuid returns information about a different cache. It signals that there are no more caches by putting zeros in the last four bits of %eax. On line 8, (eax&0xF) extracts the last four bits of eax using a bitwise and operation with the hexadecimal number 0xF. If cpuid has signaled that there are no more caches, line 9 uses break to exit from the loop. Lines 10 to 22 extract information about the cache.
When the cpuid instruction was used to query a 2.6 GHz SSE2 machine (see table 9.1↓ for the full names of the machines), it reported 32 KB of L1 instruction cache, 32 KB of L1 data cache, .26 MB of L2 cache, and 12.6 MB of L3 cache. On a 2.2 GHz AVX machine, the L3 cache is 20 MB. On a 3.6 GHz AVX2 machine, the L3 cache is 4 MB. These are some of the machines we use for timing programs.
Exercise:  Find the hex codes for the following instructions:
Exercise:  Explain why the arguments in a function call are pushed onto the stack in reverse order.
Exercise:  Write a function in assembly code that works as follows. The first argument is an integer n. It is assumed that the next n arguments are also integers, and the function should return the sum of these arguments if n > 0.
Exercise:  Look up the hexadecimal opcodes of rdtsc, cpuid, and CALL. Which of these are serializing instructions?
Exercise:  In a file called dummy.cpp, define a function void dummy(), which returns after doing nothing. In another file write a for-loop that makes n calls to dummy(). Compile the files separately and then link them together to ensure that the compiler does not eliminate the function call. Verify that the function call is present by inspecting the assembly.
Exercise:  Time the for-loop using the Time Stamp Counter and with n varying from 1 to 109. To get a good reading for each n, you can do several timings and then take the median. Graph the number of cycles as a function of n.
Exercise:  Fit the graph to the line An + B. How good is the fit? Assuming that all the cycles are consumed by function calls and the overhead of timing, interpret A and B. Is B positive or negative? The assumption is not perfectly valid, however. Some of the cycles will be consumed by instructions that increment and test the loop counter, and the branch prediction introduces some overhead as well.
Exercise:  Use the command cat /proc/cpuinfo to see some cpu parameters. Be aware that this command doubles the number of cores on certain Intel machines if hyper-threading is turned on. Hyper-threading is not good for some scientific applications. How much of the output can you verify using cpuid?

3.2 Compiler optimizations

Anyone who seeks to write fast programs must begin by grasping what compilers do. C/C++ programs are turned into machine instructions by the compiler. The speed of the program can vary by a factor of 10, or even more, depending on the instruction stream the compiler generates. While generating machine instructions, compilers can alter the structure of the program considerably.
In sections 3.2.2↓ through 3.2.5↓, we go over a number of compiler optimizations of loops and loop nests. The two big lessons are to use the restrict qualifier, where appropriate, and to present loops to the compiler in as simple a form as possible. What is meant by simple is a little vague and will become clear during the discussion.
Compilers can be highly unpredictable. The only way to get a sense of whether the compiler has done as well as expected is to look at the assembly code. Although writing assembly programs can be difficult, scanning the assembly code to find inner loops and checking the instruction mix is much easier to do. Thus, our discussion of compiler optimization is not so much about the optimizations themselves as it is about what to look for in the assembly code. There can be no rational discussion of program speed without an idea of what the assembly code looks like.
A dramatic example of the unpredictability of compilers occurs in section 3.2.5↓. In that section, we look at a few simple programs for multiplying matrices. The programs are nearly identical, but their speeds can be quite different. That is not the dramatic part, however. The same programs are run on an SSE2 machine, an AVX machine, and an AVX2 machine that is theoretically more than 4 times faster than the SSE2 machine (see table 9.1↓ for the full names of the machines). Surprisingly, our simple matrix multiplication programs are faster on the SSE2 machine, which leaves us wondering whatever happened to the speedup of a factor of 4. The SSE2 instruction set stabilized over a period of 10 years before it was superseded. The AVX2 instruction set is only a couple of years old as of this writing. Although AVX2 programs can be 4 times faster, compiler technology has not yet caught up.
In section 3.2.6↓, we look at a C++ program for multiplying matrices, in which matrices and vectors are represented using classes. This program is slow because C++ constructs are used without an understanding of how they map to machine instructions. C++ programs can be as fast as programs in any other language, or they can be slow, depending on the skill and intention of the programmer.
Compilation is a difficult task because compilers have to look at a program character by character, token by token, and statement by statement until they have an internal representation of the whole program. The programmer’s view of a computer program is from the reverse direction. The programmer often starts with a problem to be solved and with a global view of a solution to it. The global view, which is so easy for a human programmer, is very hard to attain for the compiler. In the final section, 3.2.7↓, we explain a little theory to show how compilers go about building a global view.
Before getting into the meat of this section, we begin by discussing a few preliminaries.

3.2.1 Preliminaries

Cache effects in simple trials can give unrealistically good figures for program speed. Cache flush, which we describe here, is one way to eliminate cache effects. We also review compiler options.
The C++ classes PyPlot for plotting, StatVector for gathering simple statistics, and Table for making tables are used extensively in the source code for this book. However, they appear only rarely in the text. All three classes are described in the appendix.
Cache flush
When programs are timed, one may inadvertently ignore the effects of caching and come up with unrealistically good numbers. Suppose we initialize three matrices to time algorithms for matrix-matrix multiplication.
    for(int i=0; i < dim; i++) //intialize a, b and c
      for(int j=0; j < dim; j++){
	    a[i+j*dim] = (1.0*rand()/RAND_MAX-0.5)*2;
	    b[i+j*dim] = (1.0*rand()/RAND_MAX-0.5)*2;
	    c[i+j*dim] = (1.0*rand()/RAND_MAX-0.5)*2;
      }
The C library function rand() generates a pseudorandom integer between zero and RAND_MAX. It is handy for timing and testing but may not be a well-tested random number generator. [54]  [54] The general opinion that rand() is of poor quality appears to be from decades before. Whether that really applies to modern gcc and icpc runtime libraries is uncertain and possibly untrue. When pseudorandom numbers with specific properties, such as uniformity and independence, are needed, a good library of statistical functions must be used instead. For our purposes here, rand() is adequate, but the manner in which the matrices a, b, and c are initialized means that they will all be in cache if dim is small. If dim=100, for example, the matrices will remain in the L2 cache of any of the processors of table 9.1↓, which is more than 0.2 MB. In a realistic program, even small matrices are unlikely to be in cache when they are needed because the program may touch a lot of data before it starts operating on the small matrices. Thus, timing immediately after initialization can give unrealistically good numbers.
The matrices can be evicted from cache as follows:
    //clear a, b, c from cache
	for(int i=0; i < dim; i++) 
      for(int j=0; j < dim; j++){
	    _mm_clflush(a+i+j*dim);
	    _mm_clflush(b+i+j*dim);
	    _mm_clflush(c+i+j*dim);
      }
The function _mm_clflush() is an intrinsic. It corresponds directly to the instruction CLFLUSH. The effect of cache flush is to evict the entire cache line that corresponds to its argument. Its argument must be a pointer. All transfers to and out of cache occur in blocks or lines. A cache line is typically 64 bytes. The declaration of _mm_clflush() is made visible through the header file ia64intrin.h by the icpc compiler. After CLFLUSH, the cache line is found in DRAM memory but not in any of the caches.
The use of CLFLUSH can introduce artifacts and give pessimistic timing figures. There is no such thing as a perfect timing protocol. A better way is to arrange inputs to the program in a long array, which is larger than the size of the cache, and apply the program to the inputs in succession. For some programs, such as matrix multiplication, such a precaution is not really necessary, and the use of CLFLUSH may be much more convenient.
Compiler options
The other preliminary topic is the use of compiler options. We have recommended the options
-xHost -O3 -prec-div -no-ftz -restrict
for the icpc compiler. Here we add a few more comments. The icpc compiler has a default -fast option, which includes -xHost and -O3. The -xHost option ensures that the compiler generates instructions assuming the highest capability of the machine. The -O3 option sets the optimization level. However, the -fast option is not suitable for our use. For one thing, it uses -no-prec-div, which we decided to abjure earlier. It turns on -ipo for interprocedural optimization, which interferes with many of the points we seek to make throughout this book. It is uncertain whether interprocedural optimization yields a measurable improvement for well-written programs.
An important compiler option shown above is -restrict. It is discussed in greater depth later. It enables the restrict qualifier, which is essential for making programs fast.
Typically, we turn off function inlining using the -fno-inline-functions option (not shown above). Function inlining changes the structure of the code in ways that make discussion of the corresponding assembly code difficult. On many occasions, we have noticed that loops are not well optimized when functions are inlined. However, function inlining should probably be enabled in some programs. It is too important a part of C++ design to be simply turned off without thought.

3.2.2 Loop unrolling

A program to compute the nth partial sum of the Leibniz series follows:
1//sum of first n terms of 4(1-1/3+1/5-1/7+1/9-...)
2double leibniz(long int n){
3	long int i;
4	double ans;
5	for(i=0; i < n; i++)
6		if(i==0)
7			ans = 4.0;
8		else if(i%2==1)
9			ans -=  4.0/(2.0*i+1);
10		else
11			ans +=  4.0/(2.0*i+1);
12	return ans;
13}
This function will be run for large values of n, such as n = 109 or 1010. With n = 109, the partial sum is generated in a few seconds.
Cycles
Unoptimized code 32
With -xHost -03 optimization 14
With -xHost -03 optimization and loop unrolling 7
Table 3.2 Number of cycles per term of the Leibniz series on a 3.6 GHz machine with AVX2 instructions (see table 9.1↓ for the full name of the machine).
Normally, if a program is compiled with optimizations turned on, the resulting code can be two to three times faster. This speedup is not only because the compilers are clever, which they sometimes are, but also because the unoptimized code can be long and roundabout. Table 3.2↑ shows that turning on compiler optimization doubles the speed of the code. Rewriting leibniz() to enable loop unrolling, which is one of the most important optimizations, doubles the speed once again. We will use the leibniz() function and its variants to explain how loop unrolling works.
First, we will understand why unoptimized code is nearly always quite slow. The first few lines of the unoptimized assembly of leibniz() are as follows:
        pushq     %rbp                                        
        movq      %rsp, %rbp                                  
        subq      $32, %rsp                                   
        movq      %rdi, -32(%rbp)                             
        movq      $0, -24(%rbp)                               
        movq      -24(%rbp), %rax                             
        movq      -32(%rbp), %rdx                             
        cmpq      %rdx, %rax                                  
        jl        ..B1.4        
        jmp       ..B1.9        
The function leibniz() receives its argument n in the register %rdi. This code fragment is checking whether 0 is less than n. If n > 0, the program jumps to the address ..B1.4. If not, it jumps to the address ..B1.9, where it terminates quickly by returning 0 as the answer. Deciphering this code fragment is left to the reader. It is a roundabout way to check whether n > 0 or not.
The optimized code performs the same action as follows:
        testq     %rdi, %rdi  
        jle       ..B2.9  
The testq instruction ands the %rdi register, which contains the argument n, with itself and sets certain flags. The instruction discards the result of anding and does not alter the contents of %rdi. If n is negative, the result will be negative as well, and the sign flag is set. If n is zero, the result is zero as well, and the zero flag is set. The next instruction jle jumps to a location where the program terminates with a ret statement if either the sign or zero flag is set, in other words, if n ≤ 0. The optimized code is checking whether n ≤ 0 or n > 0 using a single instruction, while the unoptimized code takes more than half a dozen to do the same check.
We get a sense here of why unoptimized assembly is nearly always slow. Unoptimized assembly code is characterized by a certain listlessness. The compiler takes a local view of the program.
Thus, if an expression such as (j + 1)(j + 2) occurs in multiple statements, the nonoptimizing compiler may fail to recognize that each instance evaluates to the same value. If the same variable is used repeatedly in a loop, the nonoptimizing compiler may fail to assign the variable to a register. Poor use of registers is typical of unoptimized code. Even when more than a dozen registers are available, unoptimized code often uses only a few.
With the -xHost -O3 options, the icpc compiler generates code, which is more than twice as fast. As shown in table 3.2↑, the number cycles of per term of the Leibniz series decreases from 32 for the unoptimized code to 14 for the optimized code. The listing below displays the assembly of the for-loop of the leibniz() partial sum function on page 1↑. We shall not examine the listing line by line. Our intention is to figure out whether the the compiler has done a good job. For that purpose, it suffices to scan the assembly of the inner loop. After the listing, we explain how to generate the assembly and locate the inner loop in it.
..B3.4:
        vxorpd    %xmm3, %xmm3, %xmm3
        lea       1(%rax), %rsi
        movq      %rsi, %rdx
        vcvtsi2sdq %rsi, %xmm3, %xmm3
        shrq      $63, %rdx
        vfmadd213sd .L[omit]pkt.4(%rip), %xmm1, %xmm3
        lea       1(%rax,%rdx), %rcx
        andq      $-2, %rcx
        cmpq      %rax, %rcx
        je        ..B3.6
..B3.5:
        vdivsd    %xmm3, %xmm2, %xmm3
        vaddsd    %xmm3, %xmm0, %xmm0
        jmp       ..B3.7
..B3.6:
        vdivsd    %xmm3, %xmm2, %xmm3
        vsubsd    %xmm3, %xmm0, %xmm0
..B3.7:
        movq      %rsi, %rax
        cmpq      %rdi, %rsi
        jb        ..B3.4  
To generate the assembly, the compiler is invoked with the
-S -fno-verbose-asm
options, in addition to the compilation options, as in the pattern rule in section 2.3.4↑. The assembly of the entire leibniz.cpp program, which contains the leibniz() function of page 1↑, is more than 1,300 lines. Locating the inner loop within that assembly is an easy skill. One has to search for “leibniz” or its mangled name “_Z7leibnizl” (which may be found as explained in section 2.1↑), and the beginning of the function definition will be obvious.
Within the function definition, paying a little attention to the jump statements helps identify the inner loop. In our listing, the final line is a jb (jump if below) instruction, and it is jumping back to the instruction labeled ..B3.4, which is the top of the loop. In between, we have two more jump instructions. The je (jump if equal) instruction jumps to the case where the term of the Leibniz series is negative and must be subtracted. The jmp (unconditional jump) instruction follows the addition of a positive term of the Leibniz series. It jumps to ..B3.7, where the loop termination condition is verified before possibly jumping back to the top of the loop.
Even in this short snippet of assembly code, there are many unfamiliar instructions. Because our purpose is only to understand the quality of the code generated by the compiler, we do not need to understand many of these. The crucial part for us is first the snippet
..B3.5:
        vdivsd    %xmm3, %xmm2, %xmm3
        vaddsd    %xmm3, %xmm0, %xmm0
In this assembly snippet, the first instruction divides the %xmm2 register by the %xmm3 register and stores the result in the register %xmm3. The source registers are %xmm3 and %xmm2, and the destination register is %xmm3. The %xmm2 has 4.0 and %xmm3 has 2i + 1 (as a double). So the effect is to compute the 4 ⁄ (2i + 1) term of the Leibniz series. In this case, i is even. Therefore, the second instruction is adding the computed answer to %xmm0. The answer is being accumulated in %xmm0, and the approximation to π will be found in this register at the end of the loop. The second snippet tackles the case where i is even.
..B3.6:
        vdivsd    %xmm3, %xmm2, %xmm3
        vsubsd    %xmm3, %xmm0, %xmm0
The second instruction here is vsubsd and not vaddsd. Therefore, the 4 ⁄ (2i + 1) term is being subtracted and not added.
How do we tell whether the compiled assembly is satisfactory? First, we may notice that the assembled code is using XMM registers even though wider YMM registers are available on this AVX2 machine. In fact, it is worse than that. Even the 128-bit XMM registers (see table 3.1↑) are not being used fully. Each XMM register is wide enough to hold two doubles. Yet the trailing sd in the vdivsd, vaddsd, and vsubsd instructions, which stands for “single double,” is indicating that only one half of the XMM registers is being used. The leading v in these instructions indicates that these are vector instructions that operate on the vector XMM/YMM/ZMM registers (see table 3.1↑).
To use the vector registers effectively, the compiler has to find parallelism in the loop. In this case, it has found no parallelism. The 4 ⁄ (2i + 1) terms are being alternately added and subtracted in sequential order, just as in our program.
The other deficiency in this assembly are the two branch statements (je and jmp instructions) in the loop body. Branches are undesirable because they interfere with instruction-level parallelism. The first two instructions in our assembly listing are as follows:
        vxorpd    %xmm3, %xmm3, %xmm3
        lea       1(%rax), %rsi
These two instructions operate on entirely different registers and may be executed in parallel. Very probably, that is just what the x86 processor does. The processor decodes multiple instructions simultaneously in a single cycle. In addition, it has multiple units to execute instructions. The processor constantly looks ahead in the instruction stream. Wherever it finds opportunities for parallelism, it schedules instructions for parallel execution. Thus, even a sequential program is executed in parallel. Much of the improvement in single-processor speeds over the last 15 years may be attributed to greater and greater instruction-level parallelism.
Branches (or jump statements) interfere with instruction-level parallelism because it is impossible to tell in advance whether a conditional branch will be taken or not. This uncertainty may mean the processor cannot look ahead and execute instructions in parallel. In fact, branches are so common inside loops (in particular, there is always a branch at the end of the loop body) that instruction-level parallelism would be almost completely ineffective without branch prediction. Fortunately, much of the branching is highly predictable. In our assembly snippet, the je branch is alternately taken or not taken, and the jb branch at the end is always taken except at loop termination. Processors can predict such branches quite easily. If there is a misprediction, so that the wrong sequence of instructions have been executed, there are mechanisms to recover and resume with a correct instruction stream. [55]  [55] For a great deal more about instruction-level parallelism and branch prediction, see [38].
If branches are likely to be predicted with nearly 100% accuracy in our assembled code, why are branches a problem here? There is a misprediction overhead that should be negligible for our program. To answer this question more fully, we have to go back to the C++ code of the leibniz() partial summation function on page 1↑. The loop body of that function splits into three cases: i = 0, i odd, and i even. The compiler in fact has moved the i = 0 case out of the loop. Only two cases occur within the loop body in the assembled code, so that the overhead of checking whether i is zero does not occur in every iteration. The problem with the branching in the loop body is that it prevents the compiler from finding parallelism in the loop body. This is the reason the code is stuck with XMM registers and single double instructions. In general, it is a good idea to move if statements out of loop bodies in C/C++.
In the following C++ definition, we have manually removed the even/odd branch from the loop body to illustrate how the compiler exploits parallelism in the loop body.
double leibnizX(long int n){
	long int i;
	double ans0=4.0;
	double ans1=0;
	for(i=2; i < n; i=i+2)
		ans0 += 4.0/(2.0*i+1);
	for(i=1; i < n; i=i+2)
		ans1 += 4.0/(2.0*i+1);
	return ans0-ans1;
}
The assembly code of this function is more than 300 lines and almost impossible to understand. The compiler goes to feast on this function and optimizes it very well. We show only three instructions in the body of the first loop to verify that the compiler has done a good job.
vfmadd132pd .L_2il0floatpacket.8(%rip), %ymm0, %ymm6
vdivpd    %ymm6, %ymm2, %ymm3
vaddpd    %ymm7, %ymm3, %ymm9
In vfmadd132pd, vdivpd, and vaddpd, the v stands for vector as before, and pd stands for “packed double.” This means that the instructions are operating on all four doubles that fit into a 256-bit YMM register. Not only are YMM registers being used now, but they are being used in full. That is the main reason that this program takes only seven cycles per term and is twice as fast as the earlier program (see table 3.2↑).
We have explained the vdivpd and vaddpd instructions above. In vfmadd132pd, the fmadd stands for fused multiply add (FMA). The code “132” indicates that the first and third operands are multiplied and added to the second source operand. The result is stored in the first operand (the destination). In our code snippet, the operands must be read in reverse (because of the difference between Intel and GNU/Linux conventions in assembly code). The first (or destination) operand is %ymm6, the second operand is %ymm0, and the third operand is a float packet (stored constant in text area). The effect of this instruction is i → 2i + 1. [56]  [56] To decipher this assembly snippet, one needs its complete context, which is omitted from the text. Deciphering the meaning of the float packet is crucial. To decipher the float packet, one may search for it in the assembly code and find its definition to be “4000000000000000” in hexadecimal. A web service such as http://babbage.cs.qc.cuny.edu/IEEE-754.old/64bit.html may be used to convert from the binary IEEE 754 format to decimal and discover that the float packet is 2.0. Similarly, one may uncover that %ymm0 is storing 1.0, at which point it is almost obvious that %ymm2 must be storing 4.0. The FMA instruction is not found in SSE or AVX (see table 3.1↑).
The compiler has “unrolled” the loop partially. Because the assembly code of the unrolled loop is rather opaque, we give a listing of a C++ definition that unrolls the loop manually.
double leibnizXX(long int n){
	long int i;
	double ans[10]={0};
	for(i=0; i < n; i+=10){
		ans[0] += 4.0/(2.0*i+1);
		ans[1] += 4.0/(2.0*i+3);
		ans[2] += 4.0/(2.0*i+5);
		ans[3] += 4.0/(2.0*i+7);
		ans[4] += 4.0/(2.0*i+9);
		ans[5] += 4.0/(2.0*i+11);
		ans[6] += 4.0/(2.0*i+13);
		ans[7] += 4.0/(2.0*i+15);
		ans[8] += 4.0/(2.0*i+17);
		ans[9] += 4.0/(2.0*i+19);
	}
	return ans[0]+ans[2]+ans[4]+ans[6]+ans[8]
		-ans[1]-ans[3]-ans[5]-ans[7]-ans[9];
}
In this program, the terms of the Leibniz series are grouped into 10 sets, and each iteration of the loop updates the sum over each set. The statements inside the loop body are independent of each other and may be executed in parallel using the packed double instructions. Loop unrolling is one of the most popular and effective loop transformations.
In the assembly of the loop bodies of leibnizX(), in fact four sets of packed double instructions operate on YMM registers. Because each YMM register is wide enough for four doubles, each loop has been unrolled by a factor of 16. It is as if the i=i+2 increment of the loop counter in each for-loop has been replaced by i=i+32 and the loop body expanded to treat 16 terms in parallel. [57]  [57] It may occur to the reader that incrementing the loop counter i by 32 instead of 2 may cause a problem if the number of loop iterations is not a multiple of 16. Indeed, it does, and the compiler has to generate code at either the beginning or end to handle the case where the number of iterations is not a multiple of 16.
As shown in table 3.2↑, the program with unrolled loops (with the unrolling done either manually or by the optimizing compiler) takes seven cycles per term, which is half that of the leibniz() function, which branched inside the loop body. It must be said that seven cycles per term is probably far from the optimum. Optimizing the program for the instruction pipeline, which is the topic of section 3.3↓, may even cut that number in half.

3.2.3 Loop fusion

Because the speed of modern x86 computers relies on instruction-level parallelism and vector registers, it is often a good idea to have a lot of parallelism in the innermost loops. If sets of instructions in the body of the innermost loop are independent of each other, the processor is likely to execute them in parallel.
Loop fusion addresses the situation where we have two distinct loops. It is assumed that the iterations of each loop have dependencies that make loop unrolling ineffective. In such a scenario, merging the bodies of the two loops may be the best way to produce a loop body that is amenable to instruction-level parallelism. The transformation where distinct loop bodies are merged is called loop fusion.
For a simple example, we consider the power series for sine and cosine. [58]  [58] These series were discovered by Madhava around 1400 AD according to [30].
sinx  = (x)/(1!) − (x3)/(3!) + (x5)/(5!) − ⋯ cosx  = 1 − (x2)/(2!) + (x4)/(4!) − ⋯
The following program computes the sine and cosine of x and returns them in the reference variables c and s. The number of terms of the series to be used is input to the function as n.
void sincos(double x, int n, double& c, double& s){
	c = s = 0;
	double ci = 1;
	for(int i=0; i<n; i++){
		c += ci;
		ci *= -x*x/(2.0*i+2)/(2.0*i+1);
	}
	double si = x;
	for(int i=0; i<n; i++){
		s += si;
		si *= -x*x/(2.0*i+3)/(2.0*i+2);
	}
}
The cosine loop accumulates the answer in c and stores the current term in ci. Each iteration updates the current term ci. When we look at the terms of the series for cosx, we see a lot of parallelism because the terms can be grouped and summed in many different ways. However, when the loop presented to the compiler generates each term by updating the previous term, it is quite difficult for a compilation algorithm to automatically detect such parallelism. Similar comments apply to the sine loop.
Thus, simply unrolling the loops will not introduce much parallelism into the instruction stream and will not enable the compiler to generate packed double instructions.
A better idea may be to fuse the two loops. To a human observer, it is obvious that the two loop bodies can be merged, assuming ci and si are appropriately initialized. However, even if the loops are fused, it will be hard for the compiler to generate packed double instructions. It is true that c and s, as well as ci and si, can be packed into the two halves of an XMM register. However, when the terms are updated, we will need an XMM register that saves  − x2 in both its halves and, perhaps more problematically, another XMM register that stores (2i + 2)(2i + 1) and (2i + 2)(2i + 3) in its two halves. A human programmer can easily see how to get around the problem. One XMM register can hold 2i + 2 in both its halves, and a second XMM register can hold 2i + 1 and 2i + 3 in its two halves. The two XMM registers can be multiplied to get (2i + 2)(2i + 1) and (2i + 2)(2i + 3) and updated by adding (2, 2) saved in either an XMM register or a cached location.
All this proves to be too much for the icpc compiler, however. Here are some of the instructions in the body of the cosine inner loop on an SSE2-capable machine (see table 3.1↑):
        addsd     %xmm6, %xmm5                       
        subsd     %xmm1, %xmm8                       
        addsd     %xmm6, %xmm7                       
        addsd     %xmm6, %xmm5                       
        addsd     %xmm6, %xmm7                       
        divsd     %xmm5, %xmm8                       
        divsd     %xmm7, %xmm8                       
        mulsd     %xmm8, %xmm2                        
There is no need to decipher the assembly code. All that matters is that the cosine and sine loops are not fused, and both loops use single double instructions in the loop body (notice the sd in instruction mnemonics). On an AVX2 machine, the compiler does generate fused-multiply-add instructions, but the instructions are still single double, and loops are not fused.

3.2.4 Unroll and jam

In this section, we will study a compiler optimization that applies to nested loops. So far we have considered loop unrolling and loop fusion. In loop nests, it may be desirable to unroll the outer loop and fuse several copies of the inner loop that result. That is the unroll and jam transformation.
The following C++ program computes a sine table:
//sine table for x = (pi/(2*n))*i
//with i=0,...,n.
//therefore stab must be of size n+1.
void sinetable(int n, double *restrict stab){
	double dx = 3.14159265358979323846/(2*n);
	for(int i=0; i <= n; i++){
		double x =  i*dx;
		stab[i] = 0;
		double si = x;
		for(int j=0; j < 20; j++){
			stab[i] += si;
			si *= -x*x/((2*j+3)*(2*j+2));
		}
	}
}
This program computes sinx for x = 0, π ⁄ (2n), …, (n − 1)π ⁄ (2n), π ⁄ 2 and leaves the n + 1 computed values in the array stab. If we choose n = 360, the resulting table corresponds to the chord table presented by Ptolemy in his treatise on mathematical astronomy, which became known as the Almagest. [59]  [59] Chord tables are equivalent to sine tables, and one of the earliest chord tables was computed by Hipparchus around 150 BC. The first book of the Almagest, a textbook of mathematical astronomy written by Claudius Ptolemy of Alexandria around 150 AD, has a table of chords in sexagesimal. See [28]. Ptolemy did not use power series, however. One of his key tools was a generalization of the Pythagoras theorem to cyclic quadrilaterals.
This program gives the type of stab as double *restrict. The restrict qualifier tells the compiler that any location addressed using the pointer will not be addressed using some other pointer. Its use is not significant here because only one pointer is in sight. However, in general, judicious use of the restrict qualifier can help the compiler generate good code. We will discuss the restrict qualifier in greater detail later.
In this program, the outer loop variable i generates different values for x in the range [0, π ⁄ 2]. The inner loop variable j generates terms of the power series of sinx. We have fixed the number of terms at 20. So the numerical bounds of the inner loop variable j are known to the compiler.
If the compiler chooses to, it can unroll the inner loop completely. However, not much can be gained by unrolling the inner loop. The new value of si, which holds a term of the power series, is obtained using its previous value. If the loop is unrolled, this chain, in which each iteration of the inner loop depends on the previous iteration, remains intact, making it quite hard for the compiler to generate packed double instructions.
The assembly generated by the compiler does not unroll the inner loop nor does it unroll the outer loop on an SSE2-capable machine. Some of the instructions in the body of the inner loop are shown below.
        pxor      %xmm5, %xmm5                       
        subsd     %xmm1, %xmm5                       
        addsd     %xmm2, %xmm3                       
        cvtsi2sd  %r8d, %xmm4                        
        divsd     %xmm4, %xmm5                       
        mulsd     %xmm5, %xmm2                            
On an AVX2-capable machine, the compiler does unroll the inner loop, but the instructions are all single double, as they are here. So there is no real gain from the compiler’s code transformation.
Our experience with unroll and jam, so far, is similar to that with loop fusion. We wrote a program expecting the compiler to make a certain loop transformation, and it did not. This time we will persist and alter the program to get the compiler to generate better code.
void sinetable(int n, double *restrict xlist, 
	       double *restrict stab){
	for(int i=0; i <= n; i++){
		stab[i] = 0;
		double si = xlist[i];
		for(int j=0; j < 20; j++){
		  stab[i] += si;
		  si *= -xlist[i]*xlist[i]/((2*j+3)*(2*j+2));
		}
	}
}
This program takes in the list of values at which the sine function must be computed in the array xlist[0...n], which is of length n + 1. The use of the restrict qualifier is crucial here. It tells the compiler that the arrays xlist and stab do not overlap in memory. Without that qualifier, the compiler has to allow for the case in which xlist and stab are indexed to address the same location in memory. Having to preserve the semantics of the program when the two arrays are aliased would preclude loop transformations. The assembly of the inner most loop follows.
..B3.4:                         
        pxor      %xmm3, %xmm3                       
        lea       2(%rdi,%rdi), %r8d                 
        pxor      %xmm4, %xmm4                       
        lea       3(%rdi,%rdi), %r9d                 
        subpd     %xmm0, %xmm4                       
        addpd     %xmm1, %xmm2                       
        imull     %r8d, %r9d                         
        mulpd     %xmm0, %xmm4                       
        cvtsi2sd  %r9d, %xmm3                        
        incl      %edi                               
        unpcklpd  %xmm3, %xmm3                       
        cmpl      $20, %edi                          
        divpd     %xmm3, %xmm4                       
        mulpd     %xmm4, %xmm1                       
        jb        ..B3.4        
We do see packed double instructions here, indicating that the unroll-and-jam optimization has been carried out. On AVX/AVX2-capable machines, however, packed double instructions are not generated.

3.2.5 Loop interchange

Loop interchange is the next compiler optimization for discussion, and matrix multiplication is the example we use to bring it out. In our discussion of matrix multiplication and loop interchanging, the following points will emerge:
Some of these points have come up already. Many of these points are much broader in scope than any particular compiler optimization.
Before we begin, we make some remarks about the peak capabilities of SSE2- and AVX2-capable machines. As shown in table 3.1↑, SSE2 provides XMM registers wide enough to hold two doubles. The mulpd (or vmulpd) instruction applied to XMM registers carries out two multiplications in a single instruction. Similarly, the addpd (or vaddpd) instruction carries out two additions in a single cycle. Typical SSE2 processors can simultaneously issue an addpd and a mulpd to separate execution units in the same cycle. Therefore, the peak capability of a single SSE2 processor core is 4 flops (floating points operations) per cycle.
The AVX2 processor has YMM registers, which are twice as wide as XMM. In addition, it has the fmadd*pd instruction, which does a fused multiply add of the type c = c + ab on the entire width of the YMM registers. A single fmadd*pd instruction is equal to 8 flops. The AVX2 processor can issue two of these instructions to separate execution units in the same cycle. Therefore, the peak capability of a single AVX2 processor is 16 flops per cycle.
Some Intel processors accelerate the clock in-core. For such machines, the actual theoretical limit can be slightly greater than the bounds derived above. Neither the 2.6 GHz SSE2 machine nor the 3.6 GHz AVX2 machine of table 9.1↓ features in-core acceleration of the clock.
dim=1000 (SSE2) dim=2000 (SSE2) dim=1000 (AVX2)
dim=2000
(AVX2)
multijk() 0.37 0.22 0.62 0.46
multijkx() 0.40 0.33 0.76 0.53
multIJK() 1.72 1.62 1.30 1.21
multIJKX() 1.72 1.62 1.31 1.21
MKL BLAS 3.81 3.84 14.0 14.6
Table 3.3 Floating point operations per cycle in the multiplication of square matrices. The SSE2 and AVX2 processors used here had clocks of 2.6 and 3.6 GHz, respectively (see table 9.1↓ for the full names of the machines used).
All the points we wish to make in this section emerge from table 3.3↑. One of the points is so glaring that we will begin by commenting on it. The point is that matrix multiplication is much faster using Intel’s MKL library. On SSE2, MKL is more than twice as fast as functions such as multijk() and multIJK() that we write in C/C++. The programs that we write are the type one can write in C/C++ and are not optimized for the instruction pipeline. MKL’s optimization for the instruction pipeline, which must be coded in assembly, yields a far greater speedup of more than a factor of 10 on AVX2 machines. In both cases, MKL comes close to the peak floating point throughput, while the C/C++ programs fall short. In fact, the C/C++ programs do better in SSE2 than in AVX2, although the peak capability of the more recent AVX2 architecture is four times as high.
We step through four simple implementations of matrix multiplications. These are the sort of C/C++ programs one may be expected to write. For each implementation, we explain how the compiler views the program and what it does with it. The sort of optimizations that yield the amazing speed of MKL BLAS are discussed in section 3.3↓.
The (i, j)th entry of the square matrix a of dimension dim is a[i+j*dim] (we assume that the leading dimension is equal to dim). Here a is of type double * and points to the first location of the contiguous segment of memory where the dim*dim entries of the matrix are stored. As always, we assume that matrices are stored columnwise.
The function multijk() defined below multiplies matrices a and b and adds their product to the matrix c.
void multijk(double *a, double *b, double *c, int dim){
	for(int i=0; i < dim; i++)
		for(int j=0; j < dim; j++)
			for(int k=0; k < dim; k++)
				c[i+j*dim] += a[i+k*dim]*b[k+j*dim];
}
None of the pointers is qualified as restrict. Thus, the arrays a, b, and c could be indexed to point to the same location, and the compiler has to preserve program semantics even in the event of such aliasing. There is no room for the compiler to apply loop transformations.
On both SSE2 and AVX2, the compiler-generated assembly code (not shown) closely follows the C code of multijk(). There is no dramatic code transformation. From table 3.3↑, we see that AVX2 is nearly twice as fast for multijk(). That speedup is easily explained. On both SSE2 and AVX2, the compiler generates only single double instructions, and nothing more can be expected. On SSE2, the single double instructions are vaddsd and vmulsd. On AVX2, they are fmadd*pd. The speedup of two results because the fused-multiply-add combines addition and multiplication into a single instruction.
Next we make a slight change to the matrix multiplication program.
void multijkx(double *a,double *b,double *c,int dim){
	for(int i=0; i < dim; i++)
		for(int j=0; j < dim; j++){
			double x = c[i+j*dim];
			for(int k=0; k < dim; k++)
				x += a[i+k*dim]*b[k+j*dim];
			c[i+j*dim] = x;
		}
}
This program does not use restrict pointers either. However, it alters the innermost loop and makes it accumulate its computation in the scalar variable x. Even if the arrays are aliased, the addition operations in the innermost loop can be done or grouped in any order, addition being commutative and associative (modulo rounding errors). [60]  [60] On very rare occasions, code transformations effected by compilers under the assumption that machine arithmetic is commutative and associative may become problematic. The icpc compiler exploits that room to unroll the innermost loop and generate packed double instructions on both SSE2 and AVX2. From table 3.3↑, we see that this optimization results in better performance for both matrices of dimension 1000 and 2000, although the improvement is minor. AVX2 is again about twice as fast and for the same reason as before.
The programs below are exactly the same except all the pointers are qualified with restrict.
void multIJK(double *restrict a, double *restrict b, 
	     double *restrict c, int dim){
	for(int i=0; i < dim; i++)
		for(int j=0; j < dim; j++)
			for(int k=0; k < dim; k++)
				c[i+j*dim] += a[i+k*dim]*b[k+j*dim];
}
​
void multIJKX(double *restrict a, double *restrict b, 
	      double *restrict c, int dim){
	for(int i=0; i < dim; i++)
		for(int j=0; j < dim; j++){
			double x = c[i+j*dim];
			for(int k=0; k < dim; k++)
				x += a[i+k*dim]*b[k+j*dim];
			c[i+j*dim] = x;
    }
}
These programs are nearly four times faster on SSE2 and twice as fast on AVX2 (see table 3.3↑). We may wonder why they are four times faster on SSE2 and only twice as fast on AVX2. Inspection of the assembly code will bring out the explanations. In fact, overall, the program is faster on SSE2 than AVX2, although the peak capability of the latter is four times as high (the peak capability factor is more than 4 if we allow for the clock speeds reported in table 3.3↑).
The crucial point here is that because a[], b[], and c[] are all restrict qualified, the compiler knows that the arrays are not supposed to alias. The compiler may assume that the arrays are nonoverlapping and distinct areas of memory.
On SSE2, the compiler takes full advantage of those assumptions and changes the order of nesting of the loops so that the i-loop becomes innermost and the j-loop becomes outermost. The innermost loop is unrolled, and packed double instructions are generated.
Why does the compiler make the i-loop the innermost (on SSE2) and why does that lead to a faster program? In our discussion of compiler optimization so far, we have focused on arithmetic operations. We have seen that parallelism and packed double instructions in the innermost loops are likely to lead to faster programs. In addition to instruction-level parallelism and the judicious ordering of arithmetic, an optimizing compiler will do well to consider the pattern of memory access.
For matrix multiplication, the pattern of memory access has a big impact on program performance. When the k-loop is innermost, each successive location of the a[] array accessed in the innermost loop is separated from the previous location by a stride of length dim. Similarly, if the j-loop is the innermost, the accesses of the b[] in the innermost loop have a stride equal to dim. Striding has the disadvantage that when a double word is brought into memory, we do not immediately use the other double words in the same cache line.
When the i-loop is innermost, successive iterations of the innermost loop touch the same location in the c[] array but move through the a[] and b[] arrays with a stride equal to 1. That leads to better utilization of the cache line. The cache line is typically 64 bytes (large enough for 8 doubles).
Unfortunately, on AVX2, the compiler does not interchange the loops. It simply unrolls the inner loop and generates packed double instructions. Cache line utilization is therefore poor. The failure to unroll loops on AVX2 could be because the icpc compiler is not as mature for AVX2 as it is for the earlier SSE2 architecture. The size of the assembly code for SSE2 is nearly a 1,000 lines. For AVX2, it is fewer than 200 lines, showing that compilers fare worse with more modern architectures.
However, it may not be simply a case of compiler maturity. As computer architecture advances, for example, with vector registers getting wider, compilation becomes harder, and the compiled assembly code is more likely to fall short of being optimal. This point may be illustrated by comparing against the AVX case, which was omitted from table 3.3↑. On a 2.2 GHz AVX machine, the compiler generates assembly code for multIJK() that is very similar to the assembly code for SSE2, except with YMM registers in place of XMM registers. Nevertheless, the speed of the compiled code is less than 15% of MKL, whereas on the SSE2 machine of table 3.3↑, the compiled code runs at more than 40% of MKL’s speed.

3.2.6 C++ overhead

multijk()
dim=1000
(SSE2)
dim=2000
(SSE2)
dim=1000
(AVX2)
dim=2000
(AVX2)
Using Matrix objects .10 .07 0.18 0.15
Table 3.4 Floating point operations per cycle. Compare with table 3.3↑.
Table 3.4↑ shows performance data for two implementations of matrix multiplication in C++. The implementations of section 3.2.5↑ are also in C++, but the syntax and style properly belong to C. They are C++ functions mainly because C++ is an extension of C. The functions were coded using arrays and pointer arithmetic. In contrast, the C++ function timed in table 3.4↑ uses Matrix class objects (see chapter 1↑ for the definition of the Vector class, the Matrix class is similar). [61]  [61] The timing information was collected with function inlining enabled.
If matrices are multiplied using objects of the class Matrix, the C++ syntax mimics mathematical syntax quite closely as shown below.
void multijk(Matrix& A, Matrix& B, Matrix& C){
	int l = A.getm();
	int m = A.getn();
	assrt(B.getm()==m);
	int n = B.getn();
	assrt(C.getm()==l);
	assrt(C.getn()==n);
	for(int i=0; i < l; i++)
		for(int j=0; j < m; j++)
			for(int k=0; k < n; k++)
				C(i,k) += A(i,j)*B(j,k);
}
The price paid for staying close to mathematical syntax is very poor performance. Comparing tables 3.3↑ and 3.4↑, we find that on AVX2, the C++ is slower than MKL BLAS by a whopping factor that is nearly 100!
The matrix multiplication program here may look nice to a human reader. But it looks far more complex than a function such as multIJK() defined earlier to the compiler. A usage such as C(i,j) is in fact an overloaded function call. It is more difficult for the compiler to figure out the pattern in which the indices are mapping to locations in the array. If we want the compiler to do a good job, we have to make the loops easy to read and optimize for the compiler.
Of course, the fault here is not that of C++. C++ provides a gigantic vocabulary and numerous modes of expression. It is for the programmer to decide how to use it.
To end this discussion, we turn to the persistent belief in some quarters that Fortran programs are faster than C or C++ programs. The basis of this belief is that arrays do not alias in Fortran the way pointers can alias in C/C++. As we have seen in this chapter, the assurance of no aliasing allows the compiler to generate better code. Given that the restrict qualifier was introduced only around 2000 and was implemented by compilers only several years after that, one has to acknowledge that there is some basis to that belief. Modern C++ compilers such as icpc and g++ support restrict pointers. So the point about aliasing is no longer valid.
Once C and C++ are well understood, their advantages are considerable. The C language often encourages and seldom impedes thinking of how program variables and objects are laid out in memory. A variety of data structures ranging from linked lists to trees to graphs to hash tables can be implemented with a suppleness that is not approached by languages such as Fortran. While C and C++ are routinely used to implement device drivers, operating systems, compilers, and network protocols, the limitations of Fortran’s idiom would become immediately apparent if anyone attempted to use it for such complicated programming tasks. Even when the programming task is not so complicated, C/C++ have much to offer if the idiom native to these languages is understood.

3.2.7 A little compiler theory

The compiler’s view of a program is quite different from that of a human programmer. A human programmer has a problem to be solved and an idea to solve that problem that is expressed as a computer program. What the compiler sees is a sequence of statements that obey the syntactic rules of the programming language. The global view of what the program does is lost.
To generate good assembly, the compiler has to grasp which variables are being used heavily and other global aspects of the program. Uncovering global information from a program, which is presented to the compiler as a sequence of statements, is quite hard. In this section, we will consider some of the ideas on which optimizing compilers rely. [62]  [62] Our main reference is [32]. In it may be found bibliographic discussions of the work of Kuck, Bannerjee, Kennedy, and other researchers. Our earlier discussion of compiler optimizations drew heavily from this reference.
In a sequential program, the statements depend on one another. We assume that x, y, and z are program variables. The following is an example:
y=x;
...
z=x;
Here the second statement has an input dependence on the first because both of them read the same variable x. The input dependence is also called Read After Read (RAR). [63]  [63] The RAR, RAW, WAR, and WAW terminology for dependencies is from [38]. RAR is the mildest form of dependence and poses virtually no obstacle to the code transformations the compilation algorithm may want to attempt. If we have a sequence of statements and the only dependencies between them are RAR, the statements can be permuted in any manner without changing the semantics of the program.
The Read After Write (RAW) dependence is also called a true dependence.
x=y;
...
z=x;
Here the variable x is read by the second statement, which assigns it to z, after the first statement writes into x by assigning y to it. For a compiler, RAW is the most problematic dependence. When a statement consumes the output produced by a previous statement, the ordering of the statements cannot be changed. The notion of what it means for a program to be sequential is closely tied to the RAW dependence.
There are two more kinds of dependencies. In Write After Read (WAR), a location is read and then overwritten as in the example below.
y=x;
...
x=z;
Here the variable x is read by the first statement and then overwritten by the second statement. WAR is also called an antidependence. Write After Write (WAW) is just what the reader may expect it to be, as shown by the example below.
x=y;
...
x=z;
In this example, the two statements have a WAW dependence because both write over WAW. The WAW dependence is also called an output dependence. If we interchange two statements with a WAR or WAW dependence, the semantics of the program may be affected because of a statement in the passage in between that has a RAW dependence on the first statement. Once again, we see how important RAW dependencies are for program correctness. A sequential program is sequential because of RAW dependencies.
Programs spend little time executing straight line code. Much of the time is spent inside loops. Therefore, the compiler has to pay special attention to dependencies carried by loops.
  s = 0;
  for(int i=0; i < n; i++)
    s = s + a[i];
This loop computes the sum of an array. To find the loop-carried dependencies, it is helpful to think of the loop in its fully unrolled form.
  s = s + a[0];
  s = s + a[1];
  s = s + a[2];
  ...
  s = s + a[n-1];
In this sequence of statements, each statement reads and writes into the program variable s. Because the dependencies emerge after the loop is unrolled, we say that the statement within the for-loop has loop-carried RAR, RAW, WAR, and WAW dependencies on itself.
The following loop computes the cumulative sum of the entries of an array:
  for(int i=1; i < n; i++)
    a[i] = a[i]+a[i-1];
The only statement in this for-loop has loop-carried RAR and RAW dependencies on itself.
Such dependence analysis is used by compilers to figure out whether statements can be reordered and whether loop transformations of the type we studied in earlier sections may be applicable. For example, if there are no loop-carried dependencies at all, the iterations of the loop can be executed in any order. A for-loop that doubles every entry of an array has no loop-carried dependencies.
If there are no cyclic dependencies within the body of a loop, the loop can be split into independent loops. In the loop below, we assume that there is no aliasing between the a, b, and c arrays.
  for(int i=0; i < n-1; i++){
    a[i+1] = b[i] + c[i];
    b[i] = a[i];
  };
This loop carries a RAW dependence. However, there is no cycle of dependencies between the two statements in the body of the loop. Therefore, it is legal to rewrite the code as follows:
  for(int i=0; i < n-1; i++)
    a[i+1] = b[i] + c[i];
  for(int i=0; i < n-1; i++)
    b[i] = a[i];
The theory of dependence analysis is used by compilers to ascertain the validity of loop transformations.
Exercise:  We may write a function double leibniz(int n) that returns the partial sum of the first n terms of the Leibniz series. We may alter the definition of leibniz() and make n a const variable whose value is 20 instead of passing it as an argument. Examine the assembly code generated by icpc assuming such a modification and comment.
Exercise:  Suppose that n is a const whose value is 109 and not 20 as in the previous exercise. Time leibnizX(). How many cycles per term does it take? Discuss the observed cycle count by relating to the assembly code generated by the compiler.
Exercise:  Compile the programs for summing the Leibniz series using -fast, thus replacing -prec-div we normally use by -no-prec-div. Is there any improvement in speed?
Exercise:  Write a function in C that takes a double-precision array a[0..n-1] as input, determines the sums i < (n − r) ⁄ 4i = 0a4i + r for r = 0, 1, 2, 3, and return the value of r for which the sum is maximum. Make sure that the loops are unrolled and that packed double instructions are being issued.
Exercise:  Write a function in C that takes a double-precision array a[0..n-1] as input and replaces ai by the sum j = ij = 0aj for i = 0, …, n − 1. Can you get the compiler to unroll the loop and issue packed double instructions in this example?
Exercise:  Look up the icpc compiler documentation and figure out what -fast does. Suppose both -fast and -prec-div are given as compiler options. Which will take precedence? Suppose both -fast and -O2 are given as compiler options. Which will take precedence?
Exercise:  Change the program for computing the sine and cosine at the value x to a program that computes the cosine at two input values x and y. Present two separate loops to the compiler as in the program above. Does the compiler fuse the loops and generate packed double instructions?
Exercise:  Fuse the two loops in the program that computes sine and cosine of x manually. Try to write the body of the loop in such a way that the compiler generates packed double instructions.
Exercise:  Utilize the parallelism within the power series expansions of sinx and cosx to write loops that are partially unrolled and for which the compiler can easily generate packed double instructions. Compare the execution times of programs that compile to packed double instructions with those that compile to single double instructions. It is useful to try large values of n when timing the loops even though the series converge rapidly, making any value of n above 20 or so superfluous.
Exercise:  Change the first definition of sinetable() so that the loops are interchanged. Does the compiler now generate packed double instructions?
Exercise:  In the second definition of sinetable(), can the computation of  − x2 from x be moved out of the inner loop in the generated assembly code?
Exercise:  Write a program that replaces the columns of a square matrix with the cumulative column sum. In other words, the kth column must be replaced by the sum of columns 0 through k. Assuming the program uses a loop over rows and a loop over columns, what is the better way to order the loops? If the loops are not ordered the better way, does the compiler switch the order of loops?
Exercise:  Write a program that computes the cumulative row sums of a square matrix and answer the same questions as above.
Exercise:  Alter the implementation of the Vector and Matrix classes so that member functions that overload the function call operator () cannot be inlined. How does the matrix multiplication function of section 3.2.6↑ perform after the alterations?
Exercise:  Explain why the for-loop
  for(int i=0; i < n-1; i++){
    a[i+1] = b[i] + c[i];
    b[i] = a[i];
  };
carries the RAW dependence.
Exercise:  Give examples of for-loops whose bodies have just a single statement and that exhibit loop-carried dependence of exactly one of the types RAR, RAW, WAR, and WAW.
Exercise:  Explain why the following loop
  for(int i=1; i < n; i++){
    a[i] = b[i]+c[i];
    d[i] = a[i-1];
  }
carries the RAW dependence. Rewrite the loop so that there is no loop-carried dependence.
Exercise:  Assume that the array a stores a square matrix of size dim. In the loop nest
  for(int i=1; i < n; i++)
    for(int j=1; j < n; j++)
      a[i+j*dim] = a[i-1+j*dim]+a[i+(j-1)*dim];
explain why both the inner and outer loops carry dependencies. Visualize the computation graphically and show how to transform the loop nest so that the inner loop carries no dependence.
Exercise:  In the multIJK() program of the previous section, a single statement comprises the body of a loop nest of depth 3. List and explain the dependencies carried by the i, j, and k loops.
Exercise:  Write a C++ program for Aitken extrapolation with transparent loops that enable the compiler to generate good assembly code. The loop bodies should not have any if statements. Compare the generated assembly code with that of the C++ programs of chapter 1. By what factor is the program with transparent loops faster?

3.3 Optimizing for the instruction pipeline

Earlier, we found that MKL’s fast Fourier transform and matrix multiplication can be 10 times faster than ordinary C/C++ programs. If the C++ program is written without care, the speedup can even be a factor of 100. What does Intel’s MKL do to be so much faster than ordinary C/C++ programs?
The biggest part of the answer to that question is optimizing for the instruction pipeline, which is the topic of the present section. The compiler converts a C/C++ program into a stream of machine instructions. When the program runs, this instruction stream is consumed by the processor. From a programmer’s point of view, the instructions are executed one by one, but that is not the way processors consume instructions. The x86 processors consumed instructions in that manner before 1990. If processors still worked that way, they would be slower by at least a factor of 10.
All modern processors consume instructions using a pipeline, although semantically it is as if the instructions were executed one after another. The pipeline has different stages for fetching, decoding, execution, and accessing memory. Each of these stages has considerable parallelism built into it. Modern x86 processors fetch and decode multiple instructions in a single cycle. For example, an AVX2 processor of the Haswell family (see table 3.1↑) has seven execution ports. Therefore, seven instructions can be dispatched to execution units simultaneously on that processor.
The point of optimizing for the instruction pipeline is to keep the execution units working in parallel as far as possible. [64]  [64] The chief source for technical details of the x86 pipeline is Intel® 64 and IA-32 Architectures Optimization Reference Manual, 2013. Like register sets, the instruction pipeline is invisible within the confines of C/C++. Unlike the registers, the instruction pipeline is invisible even within assembly code. It appears impossible to get compilers to generate instruction streams that are optimized for the instruction pipeline. The difficulties here are of a fundamental kind, and it is unlikely that compilers can ever do this type of optimization satisfactorily. Optimizing for the instruction pipeline involves accounting for instruction size and alignment, usage of register ports, and many other factors that are completely invisible in a C/C++ program. The disparity between architectural design and the abstract view of the computer in C/C++ appears too great to be bridged automatically and perfectly by a compiler. The disparity is growing rapidly, increasing the importance of optimizing for the instruction pipeline. That is true even though one of the aims of architectural design is to be an easy target for compilation.
In sections 3.3.1↓ and 3.3.2↓, we give a general overview of the processor pipeline and related matters. This overview is specialized to x86. A distinction must be made between instruction set architectures such as SSE2 and AVX2 (see table 3.1↑) and microarchitectures. The microarchitecture specifies the type of pipeline implemented in hardware to consume instructions. Even when the instruction set architecture is the same, microarchitectures can differ. The differences in microarchitecture can be of importance in optimizing for the instruction pipeline. Therefore, we pay attention to the microarchitecture in addition to the instruction set.
Sections 3.3.3↓ and 3.3.4↓ get into the gritty details, which can be hair-rising but also exciting. The thrill of making nontrivial programs faster by factors of 2 or 10 by understanding how machine instructions map to the microarchitecture is undeniable. However, the thrill is not easily attained. Part of the difficulty is that some details of the microarchitecture need to be discovered through reverse-engineering.
The chief example in sections 3.3.3↓ and 3.3.4↓ is matrix multiplication. We have already seen that MKL can be more than twice (on SSE2) or more than 10 times (on AVX2) faster than ordinary C/C++ programs (see table 3.3↑). Our aim is to understand how that speedup comes about.
The number of arithmetic operations in C = C + AB, if all matrices are square of dimension n, is 2n3 with equally many additions and multiplications. Because each double is 8 bytes, the total amount of memory used by the three matrices is 24n2 bytes. As n increases, the number of arithmetic operations increases superlinearly, as the 1.5th power, against the size of data in memory. This is better than with the FFT, where the superlinear factor is only logarithmic.
There are two stages in writing a good matrix multiplication. The first stage is to produce a microkernel that multiplies small matrices while using instruction pipeline resources optimally. The second stage is to use the microkernel to code multiplication of larger matrices while hiding the cost of memory access. The cost of memory accesses can be almost completely hidden because the number of arithmetic operations grows superlinearly in data size. Thus, coding a good microkernel is decisive. The second stage, like most memory optimizations, can be implemented in C/C++ and is deferred to the next chapter. In this chapter, we focus on the microkernel.
The exposition assumes the SSE2 instruction set (see table 3.1↑). Although the SSE2 instruction set has been superseded by AVX and AVX2, all the main principles of optimizing for the instruction pipeline are brought out in our exposition. An advantage of using SSE2 is that our programs will run on almost any computer. In addition, the exposition is simplified because SSE2 instructions are less complex. Because our aim is to bring out the principles of optimizing for the instruction pipeline, the programs we discuss come close to MKL speeds (on SSE2 machines) but stay away from additional details and clutter necessary to actually reach such speeds. Optimizing for the AVX2 pipeline is covered in the exercises.
Computer architecture, systems software, program compilation, and program optimization are all full of minute details. Among these, computer architecture changes at the most rapid pace. Yet the understanding gained using one particular microarchitecture applies across past and future generations. That is so because the fundamental principles of instruction pipeline design have not changed for decades. Thus, the concepts and techniques that we learn using an SSE2 pipeline in section 3.3.4↓ carry over to more modern AVX2 pipelines, as illustrated in our discussion and as evident from the exercises, as well as to AVX-512 pipelines of the future.

3.3.1 Instruction pipelines

General remarks
The automobile assembly line offers a useful point of comparison to clarify concepts. An automobile assembly line may take 48 hours to assemble a single car. However, thanks to pipelining and multiple assembly lines, the factory may produce a car every minute. For such a factory, the latency would be 48 hours and the bandwidth 1 car per minute.
The assembly of the car is broken down into a number of steps that are executed sequentially in the assembly line. Crucially, these steps or pipeline stages are independent of each other. Therefore, car B can be pushed onto the first stage as soon as car A completes the first stage and moves to the second. When car A moves to the third stage and car B to the second, car C is pushed to the first stage. Ideally, the various stages of the assembly line should take nearly the same time. The bandwidth of the assembly line is constrained by the slowest stage. If the number of stages in the assembly line is increased, the bandwidth increases (assuming that the stages take the same amount of time), even though the latency is unchanged.
The analogy to car manufacturing omits many complications that arise in processor pipelines. Instructions are not as independent as cars. If there is a RAW dependence between two instructions, the second instruction cannot begin to execute until the first completes. Even ignoring dependencies, it takes a lot of design to make the stages of the instruction pipeline relatively independent. For example, the instruction cache should be separate from data cache if the instruction fetch stage is to be kept relatively independent of the stage where operands are read from memory. Another source of complication comes from interrupts or exceptions raised by an external device or the operating system. If the interrupt is of high enough priority, the entire pipeline must be abandoned to service the interrupt and restored after the interrupt is serviced.
Although RAW dependencies cannot be eliminated, modern processors eliminate WAW and WAR dependencies on the fly. Those dependencies are eliminated using register renaming. [65]  [65] For a detailed account of Tomasulo’s algorithm for register renaming, see [38]. Suppose we have two instructions as follows:
movq %r8 %rax
...
movq %r9 %rax
Evidently, the second instruction has WAW dependence on the first because both instructions write into the %rax register. It would be illegal for the processor to execute the second instruction before the first because there may be some other instruction in the middle that reads from %rax and therefore has RAW dependence on the first instruction. But suppose the processor dynamically renames the second %rax to some other internal register. Suppose as well that the processor renames %rax to the same internal register in all later instructions that read what the second instruction writes into %rax. If the processor does such renaming of registers, it can go ahead and execute the second instruction before the first.
Using register renaming, the processor can eliminate WAR and WAW dependencies and greatly increase available parallelism in the instruction stream. Instructions can be scheduled out of order and even executed in parallel.
Because of the sophistication of the algorithms used by processors to execute instructions, it is a misconception to think that processor performance is somehow proportional to clock speed. The 3.2 GHz Pentium 4 from 2004 used between 1.19 and 5.85 cycles per instruction for 10 programs in the SPEC CPU benchmark. [66]  [66] See sections 2.10 and 1.8 of [38] (4th ed.). The 2.66 GHz AMD Opteron of that time used fewer cycles per instruction by a factor of 1.27. As a result, the 2.66 GHz Opteron performed slightly better than the 3.2 GHz Pentium. The use of deep pipelines by the 3.2 GHz Pentium to accommodate a higher clock rate resulted in more pipeline stalls and more cycles per instruction, thus outweighing the advantage of greater clock speed.
For the SPEC CPU benchmark programs, the Pentium 4 did not do better than 1.19 cycles per instruction, although its peak bandwidth is 1 ⁄ 3 of a cycle per instruction. Unsurprisingly, typical performance is well short of the peak bandwidth. Apart from making simplifying assumptions about the instruction stream, the definition of the peak bandwidth entirely ignores the cost of memory accesses. Yet processor bandwidth is a useful metric. From 1982 to 2001, the processor bandwidth increased by a factor of 2,250. The processor performance increased by a factor close to 2,250 (an annual rate of about 50%) in the 19 years from 1982 to 2001, as measured by the SPECint benchmarks.
AVX2 and SSE2 pipelines
figure FIGS/chapter2/figNehalemPipeline.png
Figure 3.4 Instruction pipeline (two copies for two processor cores) on an SSE2-capable machine. The pipeline here corresponds to the Nehalem/Westmere microarchitecture.
figure FIGS/chapter2/haswell.png
Figure 3.5 Instruction pipeline on an AVX2-capable machine. The pipeline here corresponds to the Haswell microarchitecture.
Figures 3.4↑ and 3.5↑ show instruction pipelines for the SSE2-capable Nehalem/Westmere microarchitecture and the AVX2-capable Haswell microarchitecture. If we want to write programs that approach MKL speeds for matrix multiplication and other tasks, such microarchitectures must be understood in some detail.
The pipelines shown have four parts. The first part, called the front-end, fetches and decodes instructions. Certain instructions, mainly instructions that store in memory, may be broken up into micro-ops. The front-end can fetch and decode multiple instructions in a single cycle.
The second part, called the reorder engine, is responsible for renaming registers as well as scheduling instructions. Instructions may be scheduled for execution out of order. Predicting branches and recovering from mispredictions is also the responsibility of this part. The reorder engine includes reorder buffers. The reorder buffer holds partially complete instructions or “instructions in flight.” It corresponds roughly to renamed registers.
The third part is for execution. The AVX2 pipeline has seven execution units, while the older SSE2 pipeline has only five. Of the seven execution units in the AVX2 pipeline, two can execute fmadd*pd instructions (port 0 and port 1 in figure 3.5↑). If peak capability of 16 flops per cycle is to be approached during matrix multiplication, we have to make sure that two fused-multiply-add instructions that operate on YMM registers are dispatched to execution ports 0 and 1 almost every cycle. That will involve finding sufficient parallelism in the instruction stream as well as understanding how the front-end and the reorder engine work.
The final part in the instruction pipeline is for memory access. Each part is broken up into multiple stages, and there is considerable parallelism at every level of the pipeline. As evident from figures 3.4↑ and 3.5↑, the stages within a pipeline are not linearly laid out, contrary to the image conjured by the word “pipeline,” but have more complicated interconnections with each other.
The SSE2 pipeline shown in figure 3.4↑ is for the Nehalem/Westmere architecture. The AVX2 pipeline shown in figure 3.5↑ is for the Haswell microarchitecture. The front-end, the reorder engine, and the execution units can differ even when the instruction set is the same. However, such differences in microarchitecture are not always of the greatest significance for program optimization, although they could be on occasion.

3.3.2 Chipsets

So far we have mainly been looking at the processor core. We are about to go even more deeply into the processor. So let us pause for a moment and look at the rest of the computer.
The processors are central to the computer but are only one among the many components that make up a computer or compute node. Some of the other components are DRAM memory, graphics processor, hard disk, solid state storage, network adapter, keyboard, and monitor. To understand how the processor is connected to all the components, we have to look at chipsets. Chipsets are chips used to assemble computers from many components.
figure FIGS/chapter2/figIntel5500-gray.png
Figure 3.6 Block diagram of the Intel 5000P chipset used to build computers of the Xeon 5500 or 5600 series.
Figure 3.6↑ shows the block diagram of the Intel 5500P chipset and gives a good idea of the layout of a computer. We pick a particular chipset, but many others would do just as well at the level of generality of our discussion. The basic problem in assembling a computer is that the various components operate at very different speeds. The processors are very fast, DRAM memory is not so fast, peripherals such as keyboards are very slow, and the network is capricious. The task of reconciling components that operate at very different speeds falls partly on the chipset.
Two processor packages are shown in figure 3.6↑ as P1 and P2. Each processor package fits into a socket on the motherboard. Each processor package may have four or six cores in this instance. On more recent machines using other chipsets, each processor package may even have 16 cores.
From figure 3.6↑, we see that DDR3 memory is connected directly to the sockets housing the processor chips. The processor packages contain multiple levels of cache and memory controllers to handle transfer of data to and from the DRAM channels. It is obvious from visual inspection that some DRAM memory channels are closer to certain processor cores. In particular, DRAM memory that is connected to the other socket will be slower to access from a processor core than DRAM memory that is connected to the same socket. If the DRAM memory is on the other socket, it has to come through the memory controller on the other chip. The two processors are connected to each other (with QPI or Quick Path Interconnect), as shown in the figure.
This sort of organization of memory makes memory access nonuniform. Non-uniform memory access (NUMA) architecture was introduced by AMD into its x86 product line in 2003. Intel followed suit in 2008.
The graphics processor is connected to I/O handler (IOH) via a PCI express link. Although certain graphics processors can deliver processing and memory bandwidths that can rival the processor cores, the graphics processor is no more than just another peripheral device as far as the organization of the computer is concerned.
Like the graphics processors, the adapters for the Infiniband network are connected using PCI express links. The Infiniband network is used to build high-performance clusters from compute nodes. Ethernet cards and adapters used for Internet connectivity may also use PCI express links.
Hard disks, audio, keyboard, and other I/O devices are connected to I/O controller handler (ICH). The way in which the chipset allows the processing cores to talk to I/O devices and yet work at something like their normal speed involves substantial involvement of the operating system. Device drivers and interrupt handlers are the operating system components that mediate between the processor and the I/O devices.
Direct memory access is one of the major functions of the chipset. Suppose a processor wants to transfer 10 GB from hard disk to memory. Because the hard disk is very slow, the transfer would slow down the processor greatly. So the processor delegates the transfer to the chipset and starts executing some other instruction stream, while the chipset is working on direct memory transfer. During direct memory transfer, data is transferred directly between the hard disk and DRAM memory without processor intervention.
Direct memory access is used for transfers between the processor’s memory and graphics coprocessor memory or Xeon Phi coprocessor memory. It is used by Infiniband adapters to directly transfer data between memory and the network. The use of direct memory access in such situations must respect the operating system’s role in memory management, as we will explain later. Direct memory access enables parallelism between network activity and processor activity (chapter 6) or between coprocessor activity and processor activity (chapters 7 and 8).

3.3.3 Peak floating point performance

The SSE2 instruction set architecture (see table 3.1↑) provides for 16 XMM registers on each processor. Each XMM register is 128 bits wide and capable of holding two doubles. A single instruction that uses an XMM register as source and another XMM register as destination can carry out two additions, if the instruction is addpd, or two subtractions, if it is subpd, or two multiplications, if it is mulpd. The addition and multiplication instructions are dispatched using separate ports, which means that an addition and a multiplication instruction can be issued as well as completed in the same cycle. Thus, the peak rate at which double-precision floating point operations (flops) are executed by a single SSE2 processor is 4 flops per cycle.
Table 3.3 on page 1↑ showed us that the matrix multiplication routines of the MKL library achieve more than 3.8 flops per cycle. However, to write such a program is no simple matter. It requires intimate knowledge of the processor pipeline, which is shown in figure 3.4↑.
Our objective is to understand matrix multiplication routines of the sort implemented by MKL. We will not aim to match MKL’s performance. Such a thing would require us to write more assembly code than is pedagogically desirable or appropriate. Knowledge of the processor pipeline is one aspect of optimizing matrix multiplication. Memory hierarchy is equally important. Memory is the topic of the next chapter.
In this section, we write a few programs that do nothing meaningful but that get close to the peak performance of 4 flops per cycle, with the 4 flops comprised of two additions and two multiplications of double-precision floating point numbers. Although the programs are not required to be meaningful, getting close to peak performance is not easy. This exercise requires us to understand many aspects of how instructions are decoded and then dispatched to execution units using several ports. Instruction latencies and throughputs, register read stalls, and register renaming are other aspects of instruction-level parallelism we encounter during the exercise. Where appropriate, we shall look back to figure 3.4↑ to make the discussion concrete, even though the SSE2/Nehalem pipeline is too complicated for a schematic sketch of the type given in that figure to be either complete or totally accurate.
This section is a prelude to the discussion of matrix multiplication, which begins in the next section. We understand matrix multiplication on modern processors in two steps. In the next section, we take the first step by writing programs or microkernels [67]  [67] The microkernel nomenclature was introduced by R.A. van de Geijn. The concept may be found in [40]. I thank Robert van de Geijn for this information. to multiply 4 × n matrices with an n × 4 matrices for values of n such as n = 1, 4, 200. The microkernels with n = 50, 100, 200 approach peak performance if the matrices are assumed to be in cache. The final stage, which is to use a microkernel as the building block of a program for multiplying large matrices in DRAM memory, is one of the examples discussed in the next chapter.
Understanding matrix multiplication on modern processors will take us more deeply into computer hardware than is customary in textbooks on scientific computing or indeed even computer architecture. The programs are run on a 2.6 GHz SSE2 processor and a 3.6 GHz AVX2 processor (see table 9.1↓ for their full names; neither processor has in-core acceleration of the clock). Although the programs are not optimized for the AVX2 pipeline, they are still more than three times faster than compiled code.
Optimization for the more recent AVX2 instruction set is dealt with in the exercises. The AVX-512 instruction set, which is yet to be introduced in the main x86 line, is dealt with in chapter 7↓. An advantage of using SSE2 is that our programs will run on almost all x86 computers. Because the SSE2 instructions are simpler than AVX2 instructions, there is considerable simplification in the discussion. Most of the points that we make about latencies and throughputs of instructions, such as addpd and mulpd, remain valid on AVX and AVX2 machines, sometimes with minor modifications. It is a little surprising that the SSE2 programs we discuss have not become dated even with the advent of AVX and AVX2. Of course, these programs no longer approach peak performance. Yet the concepts they bring out related to instruction-level parallelism remain valid in the context of the specific programs found in our discussion, and the programs will continue to run.
XMM registers and packed double instructions
We have examined compiler-generated assembly code on a number of occasions earlier in this chapter. It appears impossible to write C/C++ programs in such a way that the compiled code achieves peak floating point performance. C and C++ give us a uniform view of memory. One can write cache-aware programs in C/C++, as we will in the next chapter. As we saw in earlier sections, we can write C/C++ programs with an eye on the kind of instruction stream the compiler will generate and the processor executes. However, to achieve peak floating point performance, we must pay attention to instruction-level parallelism and the instruction pipeline as well as the cache hierarchy. The design of the instruction pipeline is quite intricate and constrains the instruction stream needed to approach peak performance in many ways. Within the confines of the abstract view of the computer provided by C/C++, it appears impossible to heed such constraints on the instruction stream.
We use macros and the inline assembly facility to embed instructions that manipulate XMM registers into C/C++ programs. The first of these macros is as follows:
/* R must be "%xmmi" 0<=i<=15 */
#define zeroxmm(R)				\
  asm volatile("xorps %" R ", %" R "\n\t":::R);
The 16 XMM registers are %xmm0 through %xmm15 in GNU assembly (GAS). To zero one of them, say %xmm7, we say zeroxmm(“%xmm7”) inside a C/C++ program. Macro expansion and inline assembly produce the machine instruction
xorps %xmm7, %xmm7
followed by a newline and a tab for better formatting. The mnemonic xorps may be read as XOR packed single. Its two operands must both be 128 bits wide. It treats each operand as four packed single-precision floating point numbers. Its effect is to exclusive or (xor) the two operands bit by bit. The xor of two bits is 1 when exactly one of the two bits is 1, the other being 0. Because we are xor-ing %xmm7 with itself, we get 0 for each of the 128 bits.
There are other ways to set a register to zero. For example, we can move the constant value 0 into each part of the 128-bit XMM register, or we can store 0 in memory and load it to the XMM register. The use of xorps is preferred because the instruction is three bytes, if the register is one of %xmm0 through %xmm7, or four bytes, if the register is one of %xmm8 through %xmm15. Other options may lead to longer instructions. For example, the xorpd instruction is a byte longer. In addition, the xorps instruction takes only a single cycle to execute.
Four more macros that expand to inline assembly statements follow.
//R1 and R2 must be "%xmmi" 0<=i<=15 
#define addxmm(R1, R2)					\
  asm volatile("addpd %" R1 ", %" R2 "\n\t":::R1, R2);
​
//R1 and R2 must be "%xmmi" 0<=i<=15 
#define mulxmm(R1, R2)					\
  asm volatile("mulpd %" R1 ", %" R2 "\n\t":::R1, R2);
​
//R = "%xmmi" 0<=i<=15
//a = double * (16 byte aligned)
#define loadxmm(a, R)					\
  asm volatile("movaps %0, %" R "\n\t"::"m"(*(a)):R);	
​
//R = "%xmmi" 0<=i<=15xmm
//a = double * (16 byte aligned)
#define storexmm(R, a)					\
  asm volatile("movaps %" R ", %0 \n\t":"=m"(*(a))::R);	
The macros generate addpd, mulpd, and movaps instructions. PD stands for packed double. APS in movaps stands for aligned packed single. The movaps instruction is one byte shorter than the movapd instruction. The movaps instruction can be used to move an XMM register to another XMM register, store an XMM register to memory, or load an XMM register from memory. The memory address must be 16-byte aligned.
The macros help us write easily readable assembly code. For example, if we write
  __declspec(align(16)) double a[2]={1,2};
  __declspec(align(16)) double b[2]={-1,-2};
  loadxmm(a, "%xmm0");
  loadxmm(b, "%xmm1");
  addxmm("%xmm1", "%xmm0");
  storexmm("%xmm0", a);
  printf("%f %f \n", a[0], a[1]);
it is evident that the printf() statement will print two zeros. The four macros expand to the following assembly statements:
        movaps    (%rsp), %xmm0                                
        movaps    16(%rsp), %xmm1                              
        addpd     %xmm1, %xmm0                                 
        movaps    %xmm0, (%rsp)                                
This assembly code was extracted after compiling the C++ program with the -S option. It is evident that the arrays a and b begin at the locations (%rsp) and 16(%rsp). The value of a equals the content of the %rsp register and the value of b equals %rsp+16. Both pointers are 16-byte aligned, but that information is not found in the extract.
Another way to recover the assembly instructions corresponding to the four macros is to disassemble the object file. The command objdump -d file.o disassembles the object file file.o. The machine instructions are written in binary format in the object file. Disassembly converts it to more familiar assembly language mnemonics. An extract from the output of objdump -d peakflops.o, where peakflops.o is the object file compiled from the C++ program that adds two entries of b to two entries of a, follows:
 69:	0f 28 04 24          	movaps (%rsp),%xmm0
 6d:	0f 28 4c 24 10       	movaps 0x10(%rsp),%xmm1
 72:	66 0f 58 c1          	addpd  %xmm1,%xmm0
 76:	0f 29 04 24          	movaps %xmm0,(%rsp)
Here we see the address at which each instruction begins on the left (the addresses are given in hexadecimal), followed by the hexadecimal code for the instruction followed by the assembly statement.
The load and store macros assume the memory address to be 16-byte aligned. The last hexadecimal digit of the address must be zero. We can generate 16-byte-aligned arrays by prefixing __declspec(align(16)) if the array is statically defined as shown above. For memory that is allocated dynamically, versions of malloc() such as _mm_malloc() allow alignment to be specified.
The macros such as zeroxmm() are certainly convenient. However, their use exposes us to a dangerous error. If we have two lines such as
  addxmm("%xmm1", "%xmm0");
  storexmm("%xmm0", a);
the compiler is allowed to use %xmm0 for its own purposes in between the two lines. The inline assembly statements are guaranteed to appear in order, but the compiler may use %xmm0 for some temporary computation in between the two lines. This danger is especially great when inline assembly statements written in this style appear within the loop body.
We use macros given here to generate inline assembly statements for expository convenience. In a real program, it is better to write functions entirely in assembly or to use a single asm volatile statement for entire blocks of assembly instructions. Where macros are used for expository convenience, we assume that the assembly code is examined to ensure that dangerous side effects are not present.
Instruction latencies and throughputs
Instruction latency (according to Intel manuals) is the number of cycles spent by the instruction in an execution unit of the instruction pipeline. Instruction throughput is the maximum rate at which an instruction can be dispatched to the execution units.
The definition of the C++ function addreg() follows.
void addreg(double *a, double *b, long int nitns){
  long int i;
  loadxmm(a, "%xmm0");
  loadxmm(b, "%xmm1");
  for(i=0; i < nitns; i++)
    addxmm("%xmm0","%xmm1");
  storexmm("%xmm1", a);
}
This function replaces a[0] and a[1] by b[0]+nitns*a[0] and b[1]+nitns*a[1], respectively.
What the function does is irrelevant to the point we want to make. When such functions arise later, only the relevant part of the code is given. For addreg(), the relevant part of the code is as follows:
  for(i=0; i < nitns; i++)
    addxmm("%xmm0","%xmm1");
For simplicity, when code fragments such as this one arise, we may assume that all XMM registers are initialized to zero. The number of iterations is assumed to be large.
If this code fragment is timed, we find that it takes slightly more than 3 cycles per iteration. Why 3 cycles? The answer is that the latency of addpd is 3 cycles. Like every instruction, the addpd instruction probably begins its life in the instruction pipeline at the instruction fetch stage of the pipeline (see figure 3.4↑), which is the first stage in the pipeline. The instruction fetch stage fetches instruction from memory. If the instruction is inside a loop, as in this instance, almost all the fetches are from L1 cache. [68]  [68] The Nehalem/Westmere pipeline has a loop stream detector that skips instruction fetch for loops with a small body. Once the operands are available, the scheduler dispatches the instruction to the execution unit. The latency is the number of cycles spent by the instruction in the execution unit.
For i>0, every addpd instruction receives one of its operands from the previous addpd instruction. Thus, each addpd instruction can be scheduled only after the previous iteration is complete. Thus, the observed speed of 3 cycles per iteration is the best possible.
Although each addpd instruction spends 3 cycles in the execution unit, it is possible to dispatch an addpd to the execution unit every cycle. This assumes of course that the instruction dispatched does not need to wait for some other instruction that is currently executing to complete. Thus, there should be a sufficient level of parallelism in the instruction stream for the maximum throughput of one addpd instruction per cycle to be realized. The following program takes 3 cycles per iteration.
  for(i=0; i < nitns; i++){
    addxmm("%xmm0","%xmm1");
    addxmm("%xmm2","%xmm3");
    addxmm("%xmm4","%xmm5");
  }
The three addpd instructions in the loop body are completely independent of each other. Each instruction in a given iteration of the loop body can be scheduled only after the same instruction has completed in the previous iteration. However, the maximum throughput is realized because three independent instruction streams can be executed in parallel.
On both SSE2/Nehalem (see figure 3.4↑) and AVX2/Haswell (see figure 3.5↑), the latencies of addpd and mulpd are 3 cycles and 5 cycles, respectively. The maximum throughput of addpd is 1 cycle per instruction, meaning that at most one addpd can be issued in a single cycle on both platforms. The throughput of mulpd is 1 cycle per instruction on SSE2/Nehalem but 0.5 cycles per instruction (meaning that 2 mulpds can be issued in a single cycle) on AVX2/Haswell.
Multiple dispatch ports and register read stalls
The mulpd instructions are dispatched for execution on port 0 and the addpd instructions are dispatched on port 1 (see figure 3.4↑, where the units are numbered). Because the dispatch ports are separate, it is possible to dispatch a mulpd and an addpd in the same cycle. In fact, even on the more recent AVX/AVX2 architectures, at most one mulpd can be simultaneously dispatched with an addpd. So much of the discussion remains valid.
Our first attempt to observe this type of parallelism is the following program:
  for(i=0; i < nitns; i++){
    addxmm("%xmm8", "%xmm0");
    addxmm("%xmm9", "%xmm1");
    addxmm("%xmm10", "%xmm2");
    mulxmm("%xmm11", "%xmm3");
    mulxmm("%xmm12", "%xmm4");
    mulxmm("%xmm13", "%xmm5");
    mulxmm("%xmm14", "%xmm6");
    mulxmm("%xmm15", "%xmm7");
  }
In each block of five cycles, we may expect the five mulpd instructions and three addpd instructions in the loop body to be scheduled. There is certainly enough parallelism in the loop body to permit such a schedule. Thus, we may expect 5 cycles per iteration.
The observed number is 8 cycles per iteration and is greater than our expectation. The initial observation is in fact slightly more than 9 cycles per iteration. By unrolling the loop, we get close to 8 cycles per iteration.
It is as if the addpd and mulpd instructions cannot be dispatched in the same cycle even though they have separate execution ports. Why is that happening? Because of register read stalls. [69]  [69] The five-part optimization manual, made available by Agner Fog on his web page (see http://www.agner.org/optimize/.), is a valuable source on x86 processors. Fog’s discussion of register read stalls was essential input for writing this section.
During every cycle, it is possible to read only three registers from the register file in the SSE2/Nehalem microarchitecture. Here the addpd would need to read two registers and the mulpd would need to read two other registers. The initial answer is to point out that one of the two instructions cannot be dispatched because four registers cannot be read in the same cycle.
That is not a complete answer, however. After an instruction completes, it spends some time in the reorder buffer (ROB) (see figure 3.4↑) before it is retired. If an operand needed to dispatch an instruction can be captured directly as the output of an instruction that is waiting in ROB, there is no need to use a register port. In fact, in the earlier Core 2 or Merom microarchitecture, the program takes only 5 cycles per iteration as expected, although the Core 2 microarchitecture provides only two ports to read from the register file. It appears that each addpd and mulpd instruction in the loop body captures the result of the previous iteration directly from ROB on Core 2. The reason that SSE2/Nehalem does not read from ROB as well as Core 2 is unknown.
Another confirmation that register read stalls are responsible for slowing the program down to 8 cycles per iterations, instead of the expected 5, may be found by running the same program on an AVX/AVX2 machine. On those more modern machines, there are more register read ports, and therefore register read stalls are not an issue. The same program realizes 5 cycles per iteration.
Our second attempt is the following program:
  for(i=0; i < nitns; i++){
    addxmm("%xmm8", "%xmm0");
    addxmm("%xmm8", "%xmm1");
    addxmm("%xmm8", "%xmm2");
    mulxmm("%xmm8", "%xmm3");
    mulxmm("%xmm8", "%xmm4");
    mulxmm("%xmm8", "%xmm5");
    mulxmm("%xmm8", "%xmm6");
    mulxmm("%xmm8", "%xmm7");
  }
In this program, each addpd and mulpd instruction uses %xmm8 as its source register. Only three registers need to be read from the register file to dispatch an addpd and mulpd in the same cycle. There is no register read stall, and after suitable unrolling, the number of cycles per iteration is only slightly greater than 5.
Peak performance without loads or stores
The loop body of the following program issues five addpd instructions and five mulpd instructions.
  for(long int i=0; i < nitns; i++){
    addxmm("%xmm0", "%xmm1");
    mulxmm("%xmm1", "%xmm2");
    addxmm("%xmm3", "%xmm4");
    mulxmm("%xmm4", "%xmm5");
    addxmm("%xmm6", "%xmm7");
    mulxmm("%xmm7", "%xmm8");
    addxmm("%xmm9", "%xmm10");
    mulxmm("%xmm10", "%xmm11");
    addxmm("%xmm12", "%xmm13");
    mulxmm("%xmm13", "%xmm14");
  }
The source operand of each mulpd is the destination operand of the addpd above it. The hope is that this source operand is read directly from ROB, thus relieving pressure on the register ports. If that hope bears out, we should observe 5 cycles per iteration.
If the loop is unrolled 10 times, we observe 5.13 cycles per iteration in line with expectation. Some operands are indeed being read from ROB instead of from the register file.
If the loop is not unrolled, it takes 7 cycles per iteration. Why does unrolling help? The assembly code for the loop includes an instruction to increment the loop counter i, which is stored in a register such as %rax. It includes another instruction to compare the loop counter with nitns. The variable nitns is stored in a register. These instructions create additional pressure on the register ports. An overhead of 2 cycles is not unreasonable. With unrolling, the overhead is amortized.
On a more modern AVX2/Haswell machine, the same program takes 5.8 cycles per iteration, even with unrolling. Evidently this program interacts with the AVX2/Haswell pipeline in a different manner. Although the AVX2/Haswell pipeline is newer, it does not seem to recapture operands from ROB as effectively.
A total of 20 flops are carried out in the loop body. Therefore, 5.13 cycles per iteration corresponds to 3.89 flops per cycle. One can approach the theoretical bound of 4 flops per cycle much more closely with a shortfall that is less than .01. The program which does that uses a more complicated staggered pattern of instructions with more parallelism in the loop body and more reuse of registers that have been recently modified. Because the program is long and makes no fundamentally new point, it is not given here.
Peak performance with loads
The load instructions are dispatched on port 2. A SSE2/Nehalem program is capable of dispatching a mulpd on port 0, an addpd on port 1, and a load instruction on port 2 in the same cycle (the statement so far is true on AVX2/Haswell as well, as shown by figure 3.5↑), assuming the number of reads from the register file is not more than three. The following program demonstrates the capability:
  for(long int i=0; i < nitns; i++){
    loadxmm(a, "%xmm0");
    mulxmm("%xmm1", "%xmm0");
    addxmm("%xmm0", "%xmm2");
    
    loadxmm(a+12, "%xmm0");
    mulxmm("%xmm1", "%xmm0");
    addxmm("%xmm0", "%xmm3");
​
    loadxmm(a+24, "%xmm0");
    mulxmm("%xmm1", "%xmm0");
    addxmm("%xmm0", "%xmm4");
​
    loadxmm(a+36, "%xmm0");
    mulxmm("%xmm1", "%xmm0");
    addxmm("%xmm0", "%xmm5");
​
    loadxmm(a+48, "%xmm0");
    mulxmm("%xmm1", "%xmm0");
    addxmm("%xmm0", "%xmm6");
  }
After unrolling by a factor of four, this program takes 5.01 cycles per iteration. On AVX2/Haswell, it reaches 5.0 cycles per iteration with or without unrolling.
Each load instruction uses %xmm0 as the destination. It may seem that each triplet of load, multiply, and add has to wait for the previous triplet to complete. But such WAW dependencies are eliminated using register renaming. The processor may, for example, dynamically rename the %xmm0 in loadxmm(a+12,”%xmm0”) to something else, possibly an internally maintained reservation station. [70]  [70] See [38]. In that case, the processor will automatically rename the references to %xmm0 in the next two instructions to the same thing. Register renaming is a stage in the instruction pipeline (see figure 3.4↑). Each triplet of load, multiply, and add is completely independent of other triplets in the same iteration. Register renaming allows the processor to exploit this high degree of parallelism in the instruction stream.
It is interesting to think of how the instructions may get scheduled across cycles. Many different schedules are admissible. At the beginning, the scheduler cannot exploit the parallelism in the instruction stream completely. The locations that are loaded may not be in L1 cache at the beginning. There must be a kind of dynamics to the way the scheduling changes as the iteration count increases. In a case such as this, one may expect the instruction schedule to reach a single “steady state.” However, one wonders whether there could be periodic, quasi-periodic, or chaotic oscillations in the instruction schedule as the iteration count increases.
Peak performance with loads and stores
We have seen that the instruction pipeline can schedule two or three instructions in the same cycle. For such a capability to be effective, every stage in the pipeline must be capable of pushing several instructions per cycle to the next stage. If instructions are fetched at a slow rate, for example, instruction fetch will be the bottleneck, and the capability of executing several instructions in the same cycle may not be fully utilized.
Instruction fetch and decode are early stages of the pipeline (see figure 3.4↑). The SSE2/Nehalem microarchitecture fetches instructions in 16-byte blocks. The blocks are aligned in memory. If an instruction crosses a 16-byte boundary, it will take two cycles to fetch the instruction. The maximum throughput of these early stages of the pipeline is six instructions per cycle.
Occasionally, the instruction fetch and decode may limit the throughput. Below is an example.
  for(long int i=0; i < nitns; i++){
    loadxmm(a, "%xmm0");
    mulxmm("%xmm1", "%xmm0");
    addxmm("%xmm0", "%xmm2");
    storexmm("%xmm2", a+100);
​
    loadxmm(a+12, "%xmm0");
    mulxmm("%xmm1", "%xmm0");
    addxmm("%xmm0", "%xmm3");
    storexmm("%xmm3", a+200);
​
    loadxmm(a+24, "%xmm0");
    mulxmm("%xmm1", "%xmm0");
    addxmm("%xmm0", "%xmm4");
    storexmm("%xmm4", a+300);
​
    loadxmm(a+36, "%xmm0");
    mulxmm("%xmm1", "%xmm0");
    addxmm("%xmm0", "%xmm5");
    storexmm("%xmm5", a+400);
​
    loadxmm(a+48, "%xmm0");
    mulxmm("%xmm1", "%xmm0"); 
    addxmm("%xmm0", "%xmm6");
    storexmm("%xmm6", a+500);
  }
Each block of load, multiply, add, and store is independent of all others. Each store instruction is sent to ports 3 and 4. If we reason as before, expecting register renaming to help exploit the high degree of parallelism, we may conclude that this program will take 5 cycles per iteration.
The observed throughput (after unrolling) was 6.6 cycles per iteration. Even in theory, this program cannot attain 5.0 cycles per iteration. For a single block of load, multiply, add, and store, objdump -d gives the following information:
 ef5:	0f 28 47 60          movaps 0x60(%rdi),%xmm0
 ef9:	66 0f 59 c1          mulpd  %xmm1,%xmm0
 efd:	66 0f 58 d8          addpd  %xmm0,%xmm3
 f01:	0f 29 9f 40 06 00 00 movaps %xmm3,0x640(%rdi)
The load, multiply, and add instructions are 4 bytes long. The store instruction is 7 bytes long. There is no way that four of these instructions can be fetched and decoded in a single cycle on SSE2/Nehalem. On AVX2/Haswell, the observed throughput was 5.75 cycles per iteration.

3.3.4 Microkernel for matrix multiplication

Let A be an l × m matrix and B an m × n matrix. The operation C: = C + AB, where C is a l × n matrix, requires 2lmn flops, half of which are additions and half of which are multiplications. [71]  [71] Algorithms such as Strassen’s matrix multiplication achieve a lower operation count. See [48]. A single addpd instruction performs two additions and a single mulpd performs two multiplications. If addpd and mulpd instructions are issued nearly every cycle, the matrix multiplication can be completed in slightly more than lmn ⁄ 2 cycles.
We will abuse terminology slightly and refer to the operation C: = C + AB as matrix multiplication. In the special case where C is initialized to zero, this operation coincides with matrix multiplication.
In this section, we write programs that multiply matrices of dimensions 4 × n and n × 4 using slightly more than 8n cycles. The microkernel with n = 200 is the basis of matrix multiplication routines given in the next chapter. For n = 200, the desired cycle count is 1,600. Our gets to 1,840. With better optimization, microkernels that get closer to the ideal cycle count can be written.
The one we present has a particularly simple design and uses only a few instructions. Yet it brings out many of the essential features of this type of programming. One of these is the tension between two constraints that must be simultaneously satisfied by such s. On the one hand, to utilize multiple dispatch ports and execute instructions in parallel, it is favorable to interleave segments of the instruction stream that are independent of each other. On the other hand, too much independence would mean that instructions dispatched during the same cycle are more likely to have unrelated operands. Register read stalls would result from too much independence in the instruction stream. We need the instructions to be independent so that they can be dispatched in parallel, and, at the same time, we want each instruction to be reading an operand that was written to recently by another instruction or we want two instructions scheduled during the same cycle to have common operands.
Working with such constraints on the instruction stream will require us to get into many aspects of the microarchitecture. We limit ourselves to a relatively simple microkernel, partly to keep the exposition tractable and partly because many details of the microarchitecture are unknown to us. Figuring out the microarchitecture calls for laborious and time-consuming experimentation. For the most part, we stick to those aspects of the microarchitecture that have already been uncovered in the previous section. We have seen the movaps instruction, used for storing, loading, and moving one register to another, as well as addpd and mulpd. Our microkernel uses these three instructions and shufpd but no others. The instruction
shufpd $1, %xmm0, %xmm0
flips the upper and lower halves of %xmm0, or of whichever XMM register that appears in place of %xmm0, and that is all we need about shufpd.
Product of 4 × 1 and 1 × 4 matrices
If A is 4 × 1 and B is 1 × 4, then C = C + AB updates C with an outer product. This instance of matrix multiplication takes 32 flops, and we will try to do it using slightly more than 8 cycles. The microkernel is built upon this outer product.
The assignment of the entries of A, B, and C to registers is hinted at by the following diagram:
c0 c2 c1 c3 c4 c6 c5 c7  +  =  a0 a1 ( b0 b1 )
Eight XMM registers are used to store C, which is 4 × 4. The c0 register holds c0 and the entry that is immediately southeast, the c1 register holds c1 and the entry that is immediately northeast, and so on. Loading the C matrix into registers and then storing the registers in the matrix becomes particularly convenient if the matrix is stored in “skew” order in memory. If the 2 × 2 matrix
a b c d
is skewed, it becomes
a c d b .
A matrix of dimension 2m × 2n is said to be skewed if each of the m × n blocks of size 2 × 2 is skewed. Notice that if a skewed matrix is skewed twice, we get back the original matrix.
As far as matrix multiplication is concerned, assuming C to be stored in skew order is a minor point in terms of performance but simplifies the exposition. We shall assume C to be stored in skew order.
The diagram indicates that the register a0 holds a0 and the entry immediately below it. The contents of the registers a1, b0, and b1 follow from the diagram in the same way.
We assume C to be stored in a contiguous array of size 16 in column-major order but after skewing. We assume A and B to be stored in contiguous arrays of size 4.
If the XMM register holding b0 is multiplied into a0 using mulpd and the result is added to c0 using addpd, we have completed updating c0. If a0 is flipped, using shufpd to interchange its lower and upper half, and then multiplied by b0 and added to c1, we have updated c1. The c registers with an even subscript do not require flipping. The ones with an odd subscript require flipping.
Using the notation introduced so far, we give the method for implementing C: = C + AB in a kind of pseudocode. The assembly code given later corresponds closely to this pseudocode. There is some vagueness in the way the ai are assigned to registers, which is cleared up by the assembly code (see figure 3.7↓).
  1. Load c0, …, c7 from memory.
  2. Load a0 from memory.
  3. Load b0 from memory.
  4. Use mulpd to replace a0 by b0a0 (entrywise product of two XMM registers).
  5. Load a1.
  6. Move a1 to aa1, which is another XMM register.
  7. Replace a1 by b0a1 using mulpd.
  8. Add a0 to c0 using addpd.
  9. Add a1 to c4 using addpd.
  10. Load b1.
  11. Replace aa1 by b1*aa1.
  12. Load a0.
  13. Replace a0 by b1a0.
  14. Add a0 to c2.
  15. Add aa1 to c6.
  16. Load a0.
  17. Flip a0.
  18. Move a0 to aa0.
  19. Replace aa0 by b1aa0.
  20. Replace a0 by b0a0.
  21. Add aa0 to c3.
  22. Add a0 to c1.
  23. Load a1.
  24. Flip a1.
  25. Move a1 to aa1.
  26. Replace a1 by b0*a1.
  27. Replace aa1 by b1aa1.
  28. Add a1 to c5.
  29. Add aa1 to c7.
  30. Store c0, …, c7.
Each of the 28 items from 2 to 29 corresponds to exactly one instruction. Items 1 and 30 correspond to eight load and eight store instructions, respectively.
Each of the eight registers c0 to c7 used to store the 16 entries of C is updated using one mulpd and one addpd instruction. Thus, there are eight addpd and eight mulpd instructions.
The cost of loading and storing C is amortized when matrices of dimensions 4 × n and n × 4 are multiplied for large n. There are no store instructions except those used to write to C in item 30. If the instructions used to load C in item 1 are ignored, there are 7 load instructions. Two of these are used to load b0 and b1. The others are for loading A.
There are 3 move instructions (items 6, 18, and 25) and 2 shufpd instructions for flipping an XMM register (items 17 and 24).
Thus, the 28 instructions from items 2 to 29 are comprised of 16 arithmetic instructions, 7 loads, 3 moves, and 2 shufpds. The matrix A can be loaded using just 2 instructions, and the number of loads can be reduced to 2. But then the number of moves increases to 8. The move instructions as well as the shuffle instructions are dispatched using port 5. Having a total of 10 move and shuffle instructions creates too much pressure on port 5. Another reason to favor the load instructions is that they seem to work better with respect to register renaming and the capture of modified registers from ROB.
Among the 28 instructions left after disregarding the first and last items, there are 8 mulpds that are dispatched on port 0, 8 addpds that are dispatched on port 1, 7 store instructions that are dispatched on port 2, and 5 move or shuffle instructions dispatched on port 5. Ports 3 and 4 are unused because there are no store instructions.
If the 28 instructions of items 2 to 29 are put inside a loop body, in effect we have a program that adds the product AB to C repeatedly. Given the way the instructions are distributed between ports, it is not unreasonable to expect each iteration to take 8 cycles.
But of course there are other constraints on the instruction stream to be considered. One of these is the availability of only 3 ports to read from the register file. The load instructions do not create any pressure on the register read ports. However, all the other instructions do. So if we are to get to 8 cycles per iteration, it is absolutely necessary to ensure that some of the operands are read from ROB.
Each of the addpd instructions is likely to capture one of its operands from the corresponding mulpd, which is placed not too far above it. The mulpd of item 7 is likely to capture one of its operands from item 5, which is a load instruction. Similarly, the mulpds of items 11 and 13 may capture an operand from item 10, which is a load instruction.
Instruction lengths impose yet another constraint. If the average instruction were 5 bytes long, it would take more than 8 cycles to fetch the instructions because each instruction fetch brings in an aligned block of 16 bytes. The majority of instructions are only 4 bytes, however. Instruction lengths and instruction alignment may have a significant effect when the program is unrolled to multiply matrices of size 4 × n and n × 4 for n > 1. We do not pay much attention to this constraint, preferring to keep the microkernel relatively simple. If items 2 through 29 are put in a loop, it takes 8.53 cycles per iteration if the top of the loop is aligned at a 16-byte boundary. That is not bad considering that the instructions to increment the loop counter and compare against the loop count create some pressure on the register read ports.
.align 16, 0x90
.globl asm4x1x4
asm4x1x4:
# parameter 1: %rdi (a)
# parameter 2: %rsi (b)
# parameter 3: %rdx (c)
#loading c
movaps (%rdx), %xmm4
movaps 32(%rdx), %xmm5
movaps 64(%rdx), %xmm6
movaps 96(%rdx), %xmm7
movaps 16(%rdx), %xmm8
movaps 48(%rdx), %xmm9
movaps 80(%rdx), %xmm10
movaps 112(%rdx), %xmm11
#mult 4x1x4
movaps (%rdi), %xmm2
movaps (%rsi), %xmm0
mulpd %xmm0, %xmm2
movaps 16(%rdi), %xmm3
movaps %xmm3, %xmm12
mulpd %xmm0, %xmm3
addpd %xmm2, %xmm4
addpd %xmm3, %xmm8
movaps 16(%rsi), %xmm1
mulpd %xmm1, %xmm12
movaps (%rdi), %xmm2
mulpd %xmm1, %xmm2
addpd %xmm2, %xmm6
addpd %xmm12, %xmm10
movaps (%rdi), %xmm3
shufpd $1, %xmm3, %xmm3
movaps %xmm3, %xmm2
mulpd %xmm1, %xmm3
mulpd %xmm0, %xmm2
addpd %xmm3, %xmm7
addpd %xmm2, %xmm5
movaps 16(%rdi), %xmm2
shufpd $1, %xmm2, %xmm2
movaps %xmm2, %xmm3
mulpd %xmm0, %xmm2
mulpd %xmm1, %xmm3
addpd %xmm2, %xmm9
addpd %xmm3, %xmm11
#storing c
movaps %xmm4, (%rdx)
movaps %xmm5, 32(%rdx)
movaps %xmm6, 64(%rdx)
movaps %xmm7, 96(%rdx)
movaps %xmm8, 16(%rdx)
movaps %xmm9, 48(%rdx)
movaps %xmm10, 80(%rdx)
movaps %xmm11, 112(%rdx)
ret
.align 16, 0x90
.type asm4x1x4, @function
Figure 3.7 Assembly function asm4x1x4() for multiplying matrices of dimensions 4 × 1 and 1 × 4.
Product of 4 × n and n × 4 matrices
The complete assembly program for multiplying a 4 × 1 matrix with a 1 × 4 matrix is shown in figure 3.7↑. It corresponds closely to the pseudocode and is the building block for programs to multiply matrices of dimensions 4 × n and n × 4 with n > 1. The C/C++ declaration of this function is
extern "C"{
	void asm4x1x4(double *a, double *b, double *c);
}
Each instruction in the middle block labeled #mult 4x1x4 (see figure 3.7↑) corresponds to exactly one of the items 2 to 29 in the pseudocode.
There are many conventions that govern assembly programming on GNU/Linux. These conventions specify which registers are used to pass arguments, which registers are used to return the function value, which registers are caller saved, which are callee saved, and the manner in which the stack must be used. [72]  [72] By far the best guide to the calling conventions of Linux or Windows is part 5 of Agner Fog’s optimization document posted on his web page: http://www.agner.org/optimize/. The three arguments to asm4x1x4() are pointers (to arrays that hold the matrices A, B, and C) and therefore 64 bits wide. They are passed using the registers %rdi, %rsi, and %rdx, respectively. Fortunately, we need to know very little of the calling conventions. The XMM registers are all caller saved, and asm4x1x4() is free to use them to perform its calculations.
The definition of asm4x1x4() as an assembly program has three blocks. The first block is for loading C, the second block multiplies A and B and adds the product to the registers that hold C, and the last block is for storing C. The code to multiply matrices of dimensions 4 × n and n × 4 is obtained essentially by replicating the middle block n times but with some modifications. This is possible because the product
a11 a12 a1n a21 a22 a2n a31 a32 a3n a32 a42 a4n b11 b12 b13 b14 b21 b22 b23 b24 bn1 bn2 bn3 bn3
is a sum of outer products:
a11 a21 a31 a41 ( b11 b12 b13 b14 )  + ⋯ +  a1n a2n a3n a4n ( bn1 bn2 bn3 bn4 )
The matrices A and B are assumed to be stored in arrays of size 4n double-precision numbers. The matrix A is assumed to be stored column after column. In contrast, the matrix B is assumed to be stored row after row, making it easier to access the columns and rows for each outer product. To form the kth outer product, where 0 ≤ k < n, we may add 4k to the pointers A and B to advance to the kth column of A and the kth row of B.
To multiply matrices of dimensions 4 × n and n × 4, we modify asm4x1x4(). The middle block is replicated n times. In the second block, the memory references
(%rdi), (%rsi), 16(%rdi), 16(%rsi), (%rdi), (%rdi), 16(%rdi) 
are replaced by
32(%rdi), 32(%rsi), 48(%rdi), 48(%rsi), 32(%rdi), 32(%rdi), 48(%rdi) 
Adding 32 to the displacement corresponds to moving forward by 4 doubles because each double is 8 bytes. The third, fourth, and fifth replications are treated similarly by adding 64, 96, and 128 to the displacement fields. After the fifth replication, we add 160 to %rdi and %rsi using the instructions
	addq 	$160, %rdi
	addq 	$160, %rsi
and repeat the first five replications. Note that 160 bytes equals 20 doubles. This design tries to balance competing requirements for shorter instructions and fewer instructions.
n Flops per cycle
20 3.19
40 3.33
100 3.43
200 3.48
Table 3.5 Floating point performance for routines that multiply a 4 × n matrix into an n × 4 matrix for various n on SSE2/Nehalem pipeline (the 2.6 GHz SSE2 processor of table 9.1↓).
The function asm4x200x4() defined in this manner achieves 3.48 flops per cycle (see table 3.5↑) on SSE2/Nehalem. It serves as the microkernel for programs that multiply larger matrices in the next chapter. One might ask, why stop at n = 200? Further unrolling does not improve the performance significantly, and we are close to the limit of the 4 × 1 × 4 design used to build the microkernel. The code for ams4x200x4() occupies 24 KB and fills three quarters of the L1 instruction cache. In addition, the matrices multiplied by the microkernel are assumed to fit comfortably with room to spare in L1 data cache, which is 32 KB, when the microkernel is used to build efficient programs to multiply large matrices in the next chapter. The choice of n is limited by the data cache as well as the instruction cache.
Throughout the design of the asm4x200x4() microkernel, we have emphasized register read stalls. Verification that register read stalls are a key issue may be found by running the same program on AVX2/Haswell (the 3.6 GHz AVX2 processor of table 9.1↓). On AVX2/Haswell the program realizes 3.98 flops per cycle, which is much closer to the best possible 4.0 flops per cycle for a program built using mulpd and addpd instructions than what we see in table 3.5↑. Evidently, the greater number of register ports available on AVX2/Haswell is helping.
This finding might suggest that writing an optimal microkernel is easier on AVX2 than SSE2. That hope is unfortunately unlikely to be true. An optimal microkernel on AVX2 must issue two fmadd*pd instructions operating on YMM registers every cycle to realize 16 flops per cycle (these are scheduled on ports 0 and 1; see figure 3.5↑). The fmadd*pd instructions operate on three registers, and not just two, as with addpd or mulpd, creating greater pressure on register ports.
A point that came up in our discussion of the asm4x200x4() microkernel for SSE2/Nehalem is that independence in the instruction stream is good because instructions can be scheduled in parallel. However, too much independence is bad because it leads to register read stalls. This tension persists in an optimal microkernel for AVX2/Haswell (and may be expected to persist in architectures yet to be released) and remains the key issue to be dealt with in optimizing for the instruction pipeline.
Exercise:  Summing the Leibniz series with loop unrolling takes 7 cycles per term on an AVX2 machine, as we saw in section 3.2.2↑. Examine the assembly code and determine the realized bandwidth in terms of instructions consumed per cycle.
Exercise:  Why does having a deep pipeline with many stages help accommodate a faster clock?
Exercise:  Write macros zeroymm(), addymm(), mulymm(), storeymm(), and loadymm() to issue instructions to manipulate YMM registers. The respective instructions are vxorps, vaddpd, vmulpd, and vmovaps. You may consult Intel® 64 and IA-32 Architectures Software Developer’s Manual for information about these instructions. Keep in mind that the order of sources and destinations is reversed between the Intel manuals and the GNU/Linux assembler. Write a simple program to test your macros.
Exercise:  Use GNU’s objdump utility to find out the size in bytes of a vmovaps instruction, with both source and destination being YMM registers, with the source an address in memory, and with the destination an address in memory.
Exercise:  The AVX2 instruction set provides for three fused-multiply-add instructions vfmadd231pd, vfmadd132pd, and fmadd213pd. Write macros to issue these instructions. Why are there exactly three fused-multiply-add instructions?
Exercise:  Look up the latency and throughput of vmulpd and vaddpd for your AVX2 microarchitecture in Intel® 64 and IA-32 Architectures Optimization Reference Manual or its AMD equivalent. Write programs that verify that information.
Exercise:  On the Haswell/Broadwell microarchitectures implementing AVX2, the latency of any of the fmadd*pd instructions is 5 cycles. The throughput is 0.5 cycles, implying that 2 vfmadd*pd instructions can be simultaneously dispatched to execution units. Write programs verifying the latency and throughput of fused-multiply-add instructions. To complete this exercise, it is crucial to find out the number of register read ports available. Finding that out may require some reverse-engineering.
Exercise:  Write a program using vfmadd*pd instructions that reaches peak floating point performance of 16 flops per cycle and loads into a YMM register once for each vfmadd*pd instruction issued.
Exercise:  Code a program that issues vfmadd*pd instructions and that is impeded from reaching peak performance of 16 flops per cycle because instructions are too long and cannot be fetched fast enough.
Exercise:  Investigate the vperm instructions for permuting the contents of a YMM register.
Exercise:  Code a microkernel for multiplying 4 × n and n × 4 matrices that approaches peak performance on an AVX2 machine for a suitable value of n.
Exercise:  Try another microkernel design, this time for multiplying 8 × n and n × 8 matrices. For a suitable n, does it get closer to AVX2 peak performance of 16 flops per cycle than the previous design?

3.4 References

Bibliography

[28] G.I. Toomre: Ptolemy's Almagest. Princeton University Press, 1998.

[29] J.L. Hennessy, D.A. Patterson: Computer Architecture: A Quantitative Approach. Morgan Kaufmann, 1990-2011.

[30] K.V. Sarma: A History of the Kerala School of Hindu Astronomy. Vishveshvaranand Institute, 1992.

[31] M. Kerrisk: The Linux Programming Interface. No Starch Press, 2010.

[32] R. Allen, K. Kennedy: Optimizing Compilers for Modern Architectures. Morgan Kaufmann, 2002.

[33] S. Goto, R. A. van de Geijn: “Anatomy of high performance matrix multiplication”, ACM TOMS, pp. art:12, 2008.

[34] T.H. Cormen, C.E. Lieserson, R.L. Rivest: Introduction to Algorithms. MIT Press, 2001.

4 Memory

figure FIGS/chapter3/MemoryH.png
Figure 4.1 Memory hierarchy is a pyramid of decreasing capacity and increasing speed.
The memory pyramid shown in figure 4.1↑ has hard disk at the bottom and registers at the top. Hard disk can have capacity well into the terabytes, while the registers are few in number. It can take milliseconds to access the hard disk, while the registers can be accessed in less than a nanosecond. As we ascend the memory pyramid, capacity decreases and speed increases. Registers have been dealt with in the previous chapter. Here we discuss caches, Dynamic Random Access Memory (DRAM), and hard disk. Much of the time memory refers to DRAM memory.
The stored program concept is regarded as an early landmark in the development of computers. Instead of thinking of instructions and data as separate entities, the stored program slightly blurs the distinction between the two, with both programs and data stored in the same computer memory. In modern computers, including supercomputers, desktops, laptops, and all kinds of mobile devices, memory invariably refers to DRAM. Although file systems reside on hard disk or some other storage medium, when a program is running, much of the data that is handled is from DRAM. DRAM sits external to the processor as shown in section 3.3.2↑ (see figure 3.6↑). It is estimated that 40% of the instructions in a typical computer program are loads and stores, and all these instructions address locations in DRAM. Thus, within a program, for most purposes, memory means DRAM memory. In DRAM memory, every bit is stored on a single microscopically small capacitor, which is accessed using a single transistor.
DRAM technology is cheap, ubiquitous, and relatively stable. In section 4.1↓, we take a look at how DRAM hardware and caches are configured. Accessing a word from DRAM can be more than 100 times slower than accessing a register. Although memory available through registers is only of the order of kilobytes (KB), a register can be accessed in a single cycle. In contrast, memory available through DRAM runs into tens, even hundreds, of gigabytes (GB), but it can take more than 100 cycles to access a word from DRAM. A great part of computer design is an attempt to mitigate this extreme gap in speed and capacity. The aim or hope is to approximate the speed of registers with the capacity of DRAM. A key idea is to store caches on the processor itself. The cache is a record of those words in DRAM that are most frequently accessed by the processor. Words that are cached may be accessed directly within the processor itself without exchanging signals with DRAM units external to the processor.
On modern computers, caches can be quite large. It is not unusual for a processor package to have more than 10 megabytes (MB) of cache. Both cache and DRAM capacities are increasing inexorably. The day when caches are large enough to comfortably hold a million grid point computation is firmly in the past. In a lot of scientific computing, computations do not go out of cache. Caches are implemented using Static Random Access Memory (SRAM). In SRAM, each bit is stored using multiple transistors with circuitry to switch the bit on or off rapidly. Caches are intermediate between DRAM and registers in expense, speed, and capacity.
Companies that make DRAM are about as profitable as low-end grocery stores, if they are lucky enough to make a profit. DRAM is as close to a commodity, mass-market technology as there exists in the world of computing. Its low cost has made it ubiquitous across the whole computing spectrum. Engineers have been able to meet market demand for increasing capacity at a low cost, and the potential for disruptive innovations in this area appears limited.
Section 4.1↓ gives an overview of DRAM and cache memory. The organization of DRAM into memory channels and of caches into multiple levels influences programming technique in several ways. For example, the organization of memory into channels is under the assumption that memory accesses from separate processor cores (of the same system) are relatively independent and far apart. If in fact the memory accesses are tightly correlated and reference nearby locations, there can be a severe penalty.
DRAM memory is shared by multiple programs running on the same computer. Virtual memory creates the illusion that every program has its own exclusive memory. Virtual memory, which is implemented by the operating system kernel with help from the processor hardware, influences programming speed in several ways too. In section 4.1↓, we give a basic introduction to virtual memory. In section 4.1↓, we describe a program to measure the latency to DRAM memory. This program exposes many aspects of the memory system, such as parallelism of memory access, cache line size, and virtual memory, which are vital to efficient programming. In addition, latency to memory is a figure of much importance, especially when dynamic data structures such as linked lists and trees are employed.
Section 4.2↓ presents many techniques to optimize memory access. These techniques have not changed greatly over the years. This section could have been written even 15 years ago in mostly the same form. Of course the parameters used for successive levels of blocking, which are related to the sizes of caches and other aspects of computer architecture, would have been different. But all the principles are the same with one important exception. Even that one principle (packing data in caches to reduce TLB misses) would have applied in an identical manner to machines of the past, but it was not known 15 years ago.
Sections 4.1↓ and 4.2↓ may well be the most important parts of this book. Optimizing memory access is just as relevant to your cell phone as it is to supercomputers. In addition, the principles are the same. Its importance only increases when programs are threaded or networked. The techniques of memory optimization described in section 4.2↓ are applicable to regular and structured data, as in images and grids for solving partial differential equations. When dynamic data structures such as linked lists and graphs are used, one needs to be more mindful of latency to memory.
Memory optimizations can be done in C, without resorting to assembly code, to a far greater extent than instruction pipeline optimizations. Like the previous chapter, this chapter too discusses optimizations in the context of SSE2, AVX, and AVX2 machines (see table 3.1↑), although the role of the instruction set is not as great. Much of the time the distinction between these instruction sets does not matter greatly. All programs are written in C. The compiled code may be suboptimal as before, but the penalty to pay for the suboptimality of the compiler is not as high. Although the penalty can exceed a factor of 10 for instruction pipeline optimizations, as shown in the previous chapter, the penalty with regard to memory optimizations seldom exceeds a factor of 1.5.
Section 4.3↓ explains how to write and read from hard disk. If DRAM memory can run into hundreds of GB, hard disk can run into tens of terabytes on even a small computer. However, hard disk can be very slow, with latencies of the order of milliseconds. The operating system kernel plays a great role in determining the observed speed of disk input/output.
In section 4.4↓, we return to virtual memory and look at its implementation inside the Linux kernel. All programs run at the mercy of the operating system kernel. The systems programming perspective and knowledge of the paging system, which implements virtual memory, help understand disk input/output, network programming, and multithreaded programming at an advanced level.

4.1 DRAM and cache memory

DRAM was invented by Intel in 1973. In DRAM, a single capacitor is used to store a single bit, and each capacitor is equipped with an access transistor. DRAM technology has evolved over the years to the point where it is the primary form of memory in almost all computing and mobile devices. For a schematic illustration of where DRAM memory fits into the computer as a whole, see figure 3.6↑, where it is labeled as system memory.
Section 4.1.1↓ is an overview of DRAM technology. At the finest level, DRAM is an array of bits. Arrays of bits are organized into banks, ranks, and channels. There is a memory controller on each processor package that drives the memory channels. Most of the details of the memory controller are entirely hidden from the programmer (as well as the operating system). Fortunately, one does not need knowledge of the hardware at the level of the memory controllers or channels to write optimized programs. Thus, the principal purpose of the information in section 4.1.1↓ is to provide context. The information is not directly useful in writing actual programs. However, its indirect implications can be of considerable importance.
In section 4.1.2↓, we look at caches. Unlike DRAM, caches reside on the processor package. Accessing cached memory is much faster than accessing DRAM. Caching is done automatically by the hardware, and in principle, the programmer does not even need to know that caches exist. However, a basic knowledge of cache organization, including items such as the cache line size, is essential for writing optimized programs.
When the same data item is stored in DRAM memory as well as multiple caches, there is the problem of keeping caches coherent. In the next chapter, we will find that cache coherence is the basis of multithreaded programming. Inasmuch as multithreaded programming is a central paradigm from scientific computing to web servers and mobile apps, it is inadvisable to program with an ignorance of caches. Technically, multithreaded programs can be written without heeding caches. Doing so would imply inefficiencies as well as the danger of falling into error without realizing it.
DRAM memory is a shared resource. At any given point in time, dozens of programs on a computer could all be using DRAM memory. Each of these programs is written as if the program owns its own memory. The Linux kernel implements virtual memory, with the help of the processor hardware. Thanks to virtual memory, programs can be written assuming a tractable memory model. Section 4.1.3↓ is an introduction to virtual memory. The way virtual memory is set up can have major implications for program speed, especially in the multithreaded context, as we will see on several occasions later.
Sections 4.1.4↓ and 4.1.5↓ utilize knowledge of DRAM, cache lines, as well as virtual memory to measure latency to DRAM. Latency to DRAM is the number of cycles between issuing an instruction to load a word into a register from DRAM and the completion of that instruction. Finding the latency to DRAM might appear straightforward. It might appear that all we need to do is time a load or store instruction.
In fact, determining latency to memory is not so straightforward. Just as in register pipelines, there is a great deal of parallelism in the memory system. Although this parallelism is a great boon, it gets in the way of finding latency. Similarly, the virtual memory system can also get in the way. Thus, this simple exercise of finding latency to DRAM exposes several important elements of the memory system.
To access a register, the latency is 1 cycle. To access the L1 cache, the latency is 4 cycles. The latency to DRAM, however, can be hundreds of cycles. It is estimated that 40% of the instructions access DRAM, and therefore hiding this large latency to DRAM is a major part of computer architectural design. It is important for the programmer to understand when this latency to DRAM can be hidden and when it cannot be hidden, but that is a point we will turn to in the next section. Briefly, the latency to DRAM can be effectively hidden when the data access is sequential and predictable. It cannot be hidden in linked lists and other dynamic data structures because the location of the next item is determined by a link from the present item. However, even in such situations, programming techniques can mitigate the latency to DRAM.

4.1.1 DRAM memory

By opening the cover of a computer and peering inside, we can look at memory plugged into the mother board. The devices that are plugged in are called Dual Inline Memory Modules (DIMMs). DIMMs can be purchased to add more memory to the computer. Each DIMM is a package of several little chips. The little chips are DRAM devices.
figure FIGS/chapter3/dram_array.png
(a) An array of bits with 16, 386 (214) rows and 2048 (211) columns.
figure FIGS/chapter3/sixchannels.png
(b) Twelve processors and six memory channels of an SSE2/Westmere computer.
Figure 4.2 Organization of DRAM memory.
Figure 4.2↑ shows two levels in the organization of DRAM memory: the finest and the outermost. [73]  [73] The treatise by [35] has much more information than we give here. On the left, it shows a DRAM array of bits. The DRAM array is the finest level of organization. Each DRAM channel is composed of several DRAM arrays of size 16K × 2K. On the right, it depicts six channels. The memory controllers that reside on the two packages drive the six channels. There are several other levels of organization between the DRAM array and the channel to memory. The memory channels are driven by memory controllers that reside on the processor packages as shown in the figure.
Figure 4.2↑ assumes a typical array size, which can of course vary and vary a lot, and six channels. Some of the more recent AVX2 computers have eight channels to memory, with four on each processor package. The number of channels can also vary quite a bit.
Banks, ranks, channels, and DRAM technology
figure FIGS/chapter3/rank_bank.png
Figure 4.3 A single row of 26 arrays in this figure constitutes a bank. All arrays in the same bank operate in concert. In the figure, a rank is composed of 23 banks. The 23 × 26 arrays are split across 16 devices, with each bank spanning all 16 devices.
To access a single bit in a 2m × 2n DRAM array takes m + n address bits. These are the lowest bits in the memory address. In figure 4.2↑, m + n = 25. The DRAM arrays are organized in banks so that 2a arrays constitute a bank. In figure 4.3↑, a = 6 so that each bank consists of 64 arrays.
When m + n address bits are used to pick a bit from an array, the same m + n bits are applied to every one of the arrays in a bank. Thus, the output from applying m + n address bits will be 2b bits of data. In figure 4.3↑, the output would be 26 bits or 8 bytes.
There is one point related to banks of DRAM arrays that can have a significant impact on program speed. The way DRAM arrays work, it is easy to move from one bit in a given row to some other bit in the same row. Therefore, typically, consecutive addresses within a bank map to the same row. However, if a program generates memory addresses that require frequent switching between rows, there will be bank conflicts. Such bank conflicts slow down the program.
Although bank conflicts slow down programs, there is not much a programmer can do to avoid them. The organization of memory varies quite a bit from computer to computer, and it can be difficult to find out the parameters of memory organization. [74]  [74] To determine the parameters of memory organization, such as the number of arrays in a bank, on a given computer, one may start with the GNU/Linux command dmidecode and then look up the corresponding JEDEC manuals. Even if the parameters of DRAM organization are known, the manner in which the memory controllers split memory accesses between channels can be almost impossible to determine.
As shown in figure 4.3↑, the next level of organization after the bank is the rank. If 2b banks constitute a rank, then b + m + n address bits are used to pick a bank within a rank and then extract a word from the arrays that constitute that bank. The b bits used to pick the bank are higher than the other m + n bits. In figure 4.3↑, b = 3.
The ranks are grouped further to obtain a memory channel, and the entire address space is split between the memory channels. Typically, the maximum amount of memory that may be installed will be much less than the maximum amount addressable. For example, a 48-bit-wide address bus can address 256 terabytes of memory, which is 1,000 times more than what even a high-end machine may provide for.
Given the speed at which modern computers operate, an error rate of one in a billion would imply several memory errors every second. DRAM almost always provides for error correction, a point we did not go into here.
Almost all DRAM in use today is double data rate (DDR), and it has been that way for a long time. DDR was a major advance in DRAM technology. In earlier technology, a single bit of data was pushed out to a single pin (connected to a single line of the data bus) during one period of the clock signal. DDR initiates a data transfer during the rising edge as well as the falling edge of the clock (see figure 4.4↓). Two bits of data are pushed out in a single period of the clock, doubling the bandwidth.
figure FIGS/chapter3/ddr.png
Figure 4.4 Each diamond in this figure stands for a bit. DDR technology doubles the data rate by transferring a bit at both the rising and falling edges of the clock signal.
DDR memory is often advertised to be DDR3 or DDR5 and so on. The numerical suffixes 3 or 5 give the prefetch length. In DDR3, when m + n address bits are sent to a bank of 2a arrays, the DRAM outputs not just 2a data bits, with 1 bit for each array, but 2a + 3 bits. It outputs the 2a bit word corresponding to the m + n bit address as well as 23 words corresponding to 8 consecutive addresses, including the one that was sent to it. The assumption here is that when a program asks for a certain word, it will probably also ask for the next word. So when 23 words are sent to the memory controllers, the memory controllers may be able to service multiple load/store instructions while triggering a single DRAM access. This technique improves peak bandwidth assuming sequential, or nearly sequential, data access.

4.1.2 Cache memory

An instruction such as movq %rax, %rdx, which moves one register to another, takes a single cycle. However, a load instruction such as movq (%rsi), %rdx, which moves a quad word from memory to a register, can take more than 100 cycles. An instruction that writes can take even longer. Computer processors cache frequently used parts of DRAM memory so that they can execute instructions at the speed of the registers, although nearly half the instructions are loads and stores from the much slower DRAM devices. [75]  [75] The classic by [38] has a thorough discussion of caches. There is a lot of information in the Intel manuals as well.
figure FIGS/chapter3/BarcelonaCore.png
Figure 4.5 Layout of a single core of AMD Opteron (code named Barcelona) (layout based on Patterson and Hennessy).
Figure 4.5↑ shows the layout of a single processor core. About half the area of the chip is taken up by cache memory and the memory controller. The latency to L1 cache memory is a mere 4 cycles. It is much faster to access cache than to access DRAM memory.
The basic principle of cache organization is data locality. Once a word is accessed from DRAM memory, computer programs have a tendency to access the same word or nearby words repeatedly. Thus, it makes sense to keep the word in a fast cache after it is first accessed. Later accesses can be serviced quickly from the cache without contacting the slow DRAM devices.
All transfers between DRAM memory and the processors is in cache line multiples. A cache line is 64 bytes or 512 bits on most x86 computers today. Thus, a cache line is big enough to hold 8 doubles or 16 ints. The cache lines are aligned in memory to begin at addresses whose last 6 bits (notice that 26 = 64) are zero.
Cache parameters
figure FIGS/chapter3/cache_xeon5650.png
Figure 4.6 Schematic sketch of the cache hierarchy of typical SSE2/AVX/AVX2 processors. The L1 and L2 parameters do not vary much (e.g., these are the same for all the machines of table 9.1↓), but the L3 size can vary a lot. The L3 cache as shown is 12 MB here but can be four times higher or only one quarter as much.
Figure 4.6↑ shows cache organization on typical x86 processors. Each processor core has its own L1 and L2 caches. The L1 cache is smaller than L2 but faster to access. All processor cores on the same package share L3 cache. The L3 cache is much bigger in size than the L2 cache.
Having multiple levels of cache implies that the penalty of a miss at a certain level is not too high if the cache line is found at the next level. Thus, the cost of a cache miss worsens gradually.
Number of Sets Associativity Size
L1 (instruction) 64 8 way 32 KB
L1 (data) 128 4 way 32 KB
L2 512 8 way 262 KB
L3 12,288 16 way 12 MB
Table 4.1 Cache parameters of typical SSE2/AVX/AVX2 processor packages. The L1 and L2 parameters are the same for all the machines of table 9.1↓. The size of the L3 cache varies quite a bit, the other parameters less so.
The cache parameters of typical x86 computers are given in table 4.1↑. The parameters were found using the cpuid instruction. The size of the cache in bytes is equal to the product of the number of sets, associativity, and 64, which is the number bytes in a single cache line. Cache size and cache associativity influence program speed, with greater being better.
figure FIGS/chapter3/l1cache_assoc.png
Figure 4.7 Depiction of the L1 data cache with 128 sets and 4-way associativity assuming a cache line to be 64 bytes or 512 bits (which is very typical).
To understand what sets and associativity mean in the context of cache organization, we turn to figure 4.7↑. Suppose the processor wants to look for a byte whose physical address is α in L1 data cache. The address of its cache line will be
(α − αmod64)/(64)
because a cache line begins every 64th byte. Cache lines are mapped to sets cyclically. The cache line containing the byte whose address is α maps to set number
(α − αmod64)/(64) mod128
if there are 128 sets in the L1 data cache. The cache line could be in any of the four slots in that set. The processor will check the four slots simultaneously for a match. If there is a match, the processor will extract the byte from the cache line in the matching slot.
If too many addresses map to the same set, there will be cache conflicts even if other sets of the cache are not heavily used. If the most frequently used addresses in a program segment map to only a few sets, the effectiveness of the cache is reduced, and cache conflicts become more likely. A fully associative cache, which has a single set and in which a cache line can be stored in any of the slots, is the best for reducing cache conflicts. However, such an ideal cache is too expensive to implement.
The effect of cache conflicts is not as easy to detect in modern computers as used to be the case. Multiple levels of caching, instruction-level parallelism, the large size of the caches, and the sheer complexity of the memory system can make the effect of cache conflicts on strided accesses difficult to detect.
Cache protocols
How do the cache controllers decide which cache lines to keep in cache and which ones to evict from cache? Suppose a certain cache line is accessed. If the cache line is already in cache, the access is serviced without altering the mapping of cache lines from DRAM to the cache. If the cache line is not in cache, it will be brought into cache. Some other cache line has to be evicted from the set the newly accessed cache line maps to. A popular strategy is to evict the least recently used cache line.
Reads from memory are handled differently from writes to memory. If a processor writes to a cache line that is in cache, usually the cache line is modified in cache but the write is not propagated to DRAM memory. A dirty bit is turned on to inform the cache controllers that the cache line must be written back to memory when it is evicted. This policy of handling writes is called write back. An alternative is write through. In this policy, the write is propagated to DRAM memory. Yet another alternative is to propagate writes that modify the L1 cache to L2 cache but not to DRAM memory.
An implication of the write back policy is that cache lines in DRAM memory may become invalid. The same cache line may be stored in the L1 or L2 caches of multiple processor cores. If one of them writes to its cache, the cache lines in other processor cores become invalid. The cache controller on each processor has to “snoop” on the traffic in the other processors to maintain a coherent cache.
As a consequence of the write back policy, a single write instruction can trigger two DRAM accesses. Consider a write instruction of the type movq %rax, (%rdi). If the quad word (8 bytes) that %rdi points to is not in cache, the corresponding cache line must be brought in triggering the first DRAM access. Future instructions that read and write from the same cache line will be serviced without accessing DRAM. However, a second DRAM access is triggered when the cache line is evicted. In contrast, if all operations a cache line is subjected to are reads, there is no need for the second DRAM transfer.

4.1.3 Physical memory and virtual memory

Virtual memory is implemented by the operating system and the processor hardware working in concert. Its main purpose is to prevent processes from interfering with each other’s memory. A single process will have only part of the DRAM memory for itself. The part of the DRAM memory that is available to a specific process is determined only when the process is loaded and can change as it runs.
Virtual memory is an illusion that simplifies programming, compilation, and linking in addition to keeping programs from interfering with each other. As the program runs, the memory addresses generated by a program are automatically mapped to physical addresses in the DRAM memory. So, for example, if a program issues an instruction such as movq %rax, (%rdi), the address in %rdi is a virtual address. During instruction execution, the page tables are consulted by the hardware to map that address to a physical address. A separate map is maintained for each process to keep the processes from interfering with each other. The map is typically hierarchical and stored in multiple page tables. Page tables are the essence of virtual memory.
Virtual memory in action
Let us consider an instruction that generates the memory reference (%rax, %rsi, 2). The reference is to the memory location whose address is rax+2*rsi. The registers %rax and %rsi are 64-bit. Therefore, a 64-bit virtual address is formed. Strictly speaking, the virtual address is the last 48 bits of the address. If we print a pointer in a C program, the program prints 12 hex digits because a virtual address is 48 bits. A 48-bit virtual address should get us past 2020 and can then be extended without changes to the instruction set architecture.
How does the hardware look up an actual word in memory using a virtual address? The answer is complicated. The first step on x86 computers is to form a 64-bit linear address by adding a segment register. This step is trivial and we will ignore it.
The next and far more important step is to map virtual addresses to physical addresses. Once a physical address is formed, it may be used to look up DRAM or the caches.
To map addresses from virtual to physical memory, virtual memory is partitioned into pages. A page is typically 4,096 bytes (the command getconf PAGESIZE may be used to find out the page size). Thus, in a virtual address of 48 bits, the first 36 bits constitute a page address, and the following 12 bits are the address within that page. Correspondingly, DRAM memory is broken up into page frames, each of which is of the same size as a page. Page tables map page addresses to page frame addresses.
The manner in which page tables are set up does not concern us here. They are set up by the operating system kernel and left in a place where the processor hardware can look them up.The translation look-aside buffer (TLB) is a cache of the page tables. When an address such as (%rax, %rsi, 2) is formed, the next step is to look up the TLB to convert it to a physical address. Each entry in the TLB maps exactly one page to a page frame. If there is a TLB miss, the processor looks up the page tables.
Number of Sets Associativity Size
Instruction TLB 32/16/32 4/8/4 way 128 entries
Data TLB 16 4 way 64 entries
Level 2 TLB (Shared) 128/128/_ 4/8/_ way 512/1024/_ entries
Table 4.2 TLB parameters for three x86 processors: 2.6 GHz SSE2, 2.2 GHz AVX, and 3.6 GHz AVX2 (see table 9.1↓ for the full names of the machines). The 3.6 GHz AVX2 machine does not have level 2 TLB.
TLB organization is similar to cache organization. Table 4.2↑ shows TLB parameters for a few machines. When the processor switches from one process to another, it is the operating system kernel’s responsibility to furnish a new set of page tables and flush the TLB.
Even a partial picture of what it takes to resolve a single memory reference such as (%rax, %rsi, 2) is mind-bogglingly complex. As noted already, 40% of the instructions are estimated to be loads or stores. For every such instruction, the processor first forms a virtual address. To map the virtual address to the physical address, it looks up the TLB; if there is TLB miss, it must look up the page tables. Looking up the page tables may trigger additional DRAM accesses if the page tables are not in cache.
The page table lookup may trigger a page fault if the virtual page is yet to be mapped to a page frame. If so, the page fault handler, which is an important component of the operating system kernel, is invoked. The page fault handler allocates an actual page frame so that a physical address may be formed. Once a physical address is formed, the processor can look by the caches and trigger a DRAM access if necessary.
We think of a simple instruction such as movq (%rax, %rsi, 2), %rbx as taking a few cycles. If there is a page fault, the actual cost can be in the millions of cycles. In fact, there can be two page faults from this single instruction if the quad word crosses page boundaries. Even if there is no page fault, there can be a TLB miss, which can consume of the order of 100 cycles.
Virtual memory and program speed
Given that the virtual memory setup is exceedingly complex and is invoked almost every other instruction, it follows that the mechanisms used to implement virtual memory may influence program speed in many ways.
When memory is dynamically allocated by a program using malloc(), new[], or _mm_malloc(), the last of which is preferred if memory needs to be aligned, the allocation function returns an address in virtual memory, and pages may be allocated only in virtual memory. There is always a page fault the first time a page is accessed. The pages are mapped to page frames by the page fault handler. If there is enough DRAM memory, page faults do not occur except when pages are accessed for the first time.
On most computers, the memory controllers predict memory accesses and prefetch words to cache. The prefetching normally does not cross page boundaries to prevent page faults from being triggered.
Another scenario where the paging system is in play is when large amounts of data are transferred from DRAM memory to the network card or to graphics processor memory. During the transfer, some of the pages may get moved out of DRAM memory into hard disk by the memory management unit of the operating system kernel, complicating the transfer. One way around is to copy the data to be transferred to kernel buffers and incur a substantial overhead. Another way is to request the operating system kernel to keep the pages “pinned” to DRAM memory.
There are many more ways in which virtual memory impacts program speed. The role of cache and TLB misses is explained later in this chapter. Other aspects related to multithreaded programming are found in the next chapter.

4.1.4 Latency to DRAM memory: First attempts

In this section, we make our first attempts at measuring the access time to DRAM memory. All our attempts fail, but we are led into certain aspects of the memory system that have a bearing on program performance.
The organization of memory is such that if we attempt to investigate one part of it, we need to be aware of the other parts as well. All the parts of the memory system are interrelated. Thus, to measure the access time to DRAM memory, we need to know about the size of the cache. We begin by giving a basic picture of the memory system as a whole. The measurement of latency brings to light some other parts of the memory system, and as we progress, we fill in some of the details in the picture.
A significant point is that all the traffic between DRAM and the processor packages is in multiples of the cache line size. The cache line is 64 bytes or 512 bits. If we attempt to read a single byte or a single word (2 bytes or 16 bits) or a double word or a quad word from memory, the memory system will bring the entire cache line into cache, anticipating that we will access other locations in the same cache line soon.
Latency to DRAM memory is defined as the time between issuing a load/store instruction that triggers a DRAM access and its completion. The latency of writing to memory can differ from the latency of reading for certain types of DRAM. We limit ourselves to read latency for simplicity.
Measuring the latency to DRAM memory is a more complicated matter than one may realize at first sight. Many techniques are used to hide the latency to DRAM memory. Among these, the two most important are instruction-level parallelism and caching. Instruction-level parallelism enables the processor to issue a sequence of load or store instructions to set up a pipeline of memory accesses. If the pipeline can hold 10 memory accesses in various stages, the effective latency of a long stream of DRAM accesses is cut to a tenth. For example, if the latency to DRAM is 200 cycles, the nth memory access may be initiated during cycle number 20n and completed during cycle number 20n + 199. This sort of overlapping is possible only if the memory accesses are independent and there is sufficient parallelism in the instruction stream. Caching greatly reduces the effective latency of DRAM memory access as well. These techniques for hiding latency to memory are so effective that many programs do not realize how large the latency to DRAM memory can be.
Here is a first and not very careful attempt at measuring latency.
int unitstride(int *a){
	int sum=0;
	for(int i=0; i < 1000*1000*1000; i++)
		sum += a[i];
	return sum;
}
The function unitstride() loads 109 entries of a[] and computes their sum. The compiler unrolls the loop and introduces parallelism in the loop body. As a result, the running time of this function is determined by instructions that load entries of a[] and not by arithmetic.
The function unitstride() is timed as follows:
1void time_unitstride(){
2	int a[1000*1000*1000];
3	for(int i=0; i < 1000*1000*1000; i++)
4		a[i] = 0;
5	TimeStamp clk;
6	clk.tic();
7	unitstride(a);
8	double cycles = clk.toc();
9	cout<<"cycles/access = "<<
10	   cycles/(1000*1000*1000)<<endl; 
11}
The function time_unitstride() defines a[] statically, initializes every entry of a[] to 0, and makes a single call to unitstride() on line 7. The function call is timed, and the number of cycles it consumes is divided by 109 to derive the cycles consumed per access. Each array dereferencing a[i] in unitstride() is counted as a memory access.
The two functions unitstride() and time_unitstride() are defined in separate files and compiled separately. As usual, we do not use the -ipo option for interprocedural optimization during compilation, and we use -fno-inline-functions to eliminate function inlining. If the two functions are in the same file or if interprocedural optimization is turned on, the compiler will figure out that the call to unitstride() is doing nothing meaningful and simply eliminate the call.
Initializing the array to zero on line 4 ensures that page frames in physical memory have been allocated to the entire array before the function unitstride() is called.
The program reports 0.92 cycles per access. Cycle counts are quite similar on SSE2, AVX, and AVX2 machines. Therefore, we only report the numbers for a 2.6 GHz SSE2 machine (for the full name of the machine, see table 9.1↓). The 0.92 cycles per access figure is a woeful underestimate of the latency to DRAM memory. The single for-loop in unitstride() has a loop-carried RAW dependency. The excellent icpc compiler has little difficulty recognizing that a list of numbers can be grouped before it is summed to introduce parallelism into the loop body. The generated code uses the XMM (or YMM) registers, each of which is large enough to pack four (or eight) ints, to decrease the number of instructions used to load entries of a[] into the registers. Why is the access time here so much smaller than the latency to DRAM? One reason is that a cache line is large enough to hold 16 ints, and only one of 16 accesses of the array is fetching data from outside the cache.
Any program that accesses data sequentially with unit stride will benefit from the size of the cache line. To avoid cache hits, the program below accesses only one of every 16 locations.
int stride16(int *a){
	int sum=0;
	for(int i=0; i < 1000*1000*1000; i+=16)
		sum += a[i];
	return sum;
}
In stride16(), the cost of accessing a single entry of the array is about 13 cycles, which is once again much smaller than the latency to DRAM. As many as 10 memory reads can be in flight at the same time. Strided accesses benefit from parallelism in the instruction pipeline and in the memory system. Strided accesses are particularly easy to predict. It is likely that the prefetch engines in the memory controller are able to prefetch many of the locations accessed in stride16() into cache ahead of time.
A more refined attempt follows.
1//List must be 64 byte aligned
2void accessList(long int *List, long int n, int count,
3		double &x){
4	long int  index = 0;
5	for(int i=0; i < count; i++){
6		x += List[index];
7		index = List[index]%n;
8		index = index - index%8;
9	}   
10}
The loop defined in lines 5 through 9 works as follows. It reads the entry List[index] on lines 6 and 7, and the next entry that is read is determined by the earlier entry on lines 7 and 8. It is assumed that all entries of List[] are initialized to be non-negative. The next load instruction cannot be issued until the earlier load instruction is complete because the address of the next load instruction is determined by the result of the earlier load instruction. This way of accessing memory breaks instruction-level parallelism.
The function accessList() is called with n = 109 and count = 6,000. The array List is initialized with pseudorandom numbers. We verified explicitly (verification not shown here) that the 6,000 entries accessed in the body of the loop were all distinct. Line 8 ensures that every index used to look up List[] is a multiple of 8.
All transfers between DRAM memory and cache are done in cache lines. A single cache line is 64 bytes. The array List is assumed to be 64 byte aligned (the last 6 bits of the address List are zero). Because every index used to look up List[] is a multiple of 8 and a long int is 8 bytes, every memory reference on line 6 brings in a new cache line. No cache line is accessed more than once. The total number of memory accesses is 6,000.
When the cycles consumed by the entire function was divided by 6,000, we got 270 cycles. So is the latency to DRAM memory about 270 cycles? Not really. We were careful to break instruction-level parallelism. We flushed the entire array out of cache (not shown here) before calling the function and respected cache line boundaries. However, we forgot to think about a third element. Memory management is done using pages, page frames, page directories, and page tables. Because the array is so large, every new memory reference is likely to hit a new page. The page table entry for that page is unlikely to be in TLB and may have to be brought in from DRAM memory. What we thought was one memory access is likely to be two or more.
This little example may give us a sense of the complex way in which parts of the memory system interact. Cache line size, parallelism in memory access, and the cost of TLB misses are all of much significance to program optimization.

4.1.5 Latency to DRAM

Our earlier attempts to measure latency to DRAM memory failed because we did not account for the overhead of creating and accessing page table entries. The more careful program in this section breaks instruction-level parallelism, ensures that none of the cache lines accessed is from L1, L2, or L3 cache, and accesses all of the 256 cache lines within four pages of memory (so that TLB misses are not a factor).
To begin with, we look at the function randomp(), which initializes the n entries of the array List[] to be a random permutation of the numbers 0, 1, …, n − 1.
1void randomp(int *List, int n){
2	for(int i=0; i < n; i++)
3		List[i] = i;
4	for(int i=0; i < n; i++){
5		int j = rand()%(n-i)+i;
6		int temp = List[j];
7		List[j] = List[i];
8		List[i] = temp;
9	}
10}
On lines 2 and 3, the array List[] is initialized to be the identity permutation 0, 1, …, n − 1. The loop from lines 4 through 9 picks j to be a random number from the set i, …, n − 1 and swaps List[i] and List[j] for each value of i from 0 to n − 1. The random number generator rand() used on line 5 is convenient, being part of the standard C libraries, and sufficient for our purposes here. However, faster and more rigorously tested random number generators are available.
The program for measuring latency uses two methods for clearing the array used for measurement from cache memory. If the preprocessor variable MEMWALK is undefined at the top of the file using
#undef MEMWALK
then it uses the CLFLUSH instruction issued using an intrinsic. However, if MEMWALK is defined at the top of the file, it uses the following function:
1void dummy(double *a, int len){
2	for(int i=0; i < len; i++)
3		a[i] = (i%77)*(i*1001);
4}
A large array a[] is passed to this function, which writes something into every entry on line 3. The array a[] is not used for measuring latency. Its sole purpose is to occupy the cache completely and evict the array used for measuring latency from the cache. [76]  [76] To be certain that the cache is occupied, one has to check the assembly and make sure that instructions such as MOVNTPD, which avoid cache pollution, are not generated. The function dummy() must be defined in a separate compilation unit to prevent the compiler from eliminating the function call.
The function latency(), whose listing follows, touches all the cache lines in four pages of memory and prints an estimate of the latency to DRAM memory.
1void latency(){
2	int List[256];
3	randomp(List,256); 
4	int LList[256];
5	for(int i=0; i < 256; i++)
6		LList[List[i]]=List[(i+1)%256];
7	//int is 4 bytes
8	__declspec(align(4096)) int FourPages[4096];
9	for(int i=0; i < 256; i++)
10		FourPages[16*i] = LList[i];
11#ifdef MEMWALK
12	double a[1000*1000*100];
13	dummy(a, 1000*1000*100);
14#else
15	for(int i=0; i < 4096; i++)
16		_mm_clflush(FourPages+i);
17#endif
18	int index = 17;
19	TimeStamp clk;
20	clk.tic();
21	for(int i=0; i < 256; i++){
22		index = FourPages[16*index];
23	}
24	double cycles = clk.toc();
25	cout<<"index = "<<index<<endl;
26	cout<<"cycles per access = "<<cycles/256<<endl;
27}
The key to this function are the three arrays List[], LList[], and FourPages[] defined on lines 2, 4, and 8, respectively. The array List[] is set to be a random permutation of the numbers 0, …, 255 on line 3. The array LList[] is initialized using List[], and FourPages[] is initialized using LList[].
The key idea for breaking instruction-level parallelism is to access the entries of the array List[] in the following order:
List[i], List[List[i]], List[List[List[i]]],...
If the entries are accessed in this manner, no access can be initiated before the earlier access is complete. However, there is the problem that one of the entries may cycle back to List[i]. For example, if List[i] is j and List[j] is i, the sequence above will repeat with a period of just 2.
We need a permutation that is one big cycle. Lines 5 and 6 create and initialize the array LList[] to be such a permutation. The sequence
LList[i], LList[LList[i]], LList[LList[LList[i]]],...
will have a period of 256 for any i, 0 ≤ i < 256. However, each int is 4 bytes, and 16 of the entries of LList[] will be in the same cache line. Although the sequence above breaks instruction-level parallelism and generates entries of LList[] in an order that is random enough to preempt cache prefetches, the same cache line is accessed 16 times.
Line 8 declares FourPages[] to be an array of 4,096 ints. Because each int is 4 bytes, the array is four pages long. The declaration qualifier on line 8 (which is valid only with the icpc compiler) ensures that FourPages is 4,096-byte aligned or page aligned. Every 16th entry of FourPages[] is set to an entry of List[] in line 10. The loop from lines 21 to 23 accesses only those entries of FourPages[] whose index is a multiple of 16. Every cache line is accessed only once.
Before the entries are accessed and the program is timed, the array FourPages[] must be evicted from cache. Lines 12 and 13 define a large array and write to each entry of the array and indirectly remove FourPages[] from cache. Lines 15 and 16 use the CLFLUSH instruction to explicitly flush the cache lines of FourPages[] from cache memory.
If the program is correct, line 25 should print index to be 17, same as the value it was assigned on line 18. Because the permutation has period 256, the last index generated by the for-loop on lines 17 to 19 must equal the first index. The print statement on line 25 forces the compiler to generate code for the entire program. Otherwise, the compiler can easily figure out that the program is doing nothing useful and ignore all the trouble we have taken to set up FourPages[] and then cycle through it.
# of pages mwalk clflush # of pages mwalk clflush
4 122 94 60 247 254
8 126 93 1000 251 257
16 130 95 10000 257 256
32 148 100 100000 265 263
40 247 248
Table 4.3 Latency to DRAM memory on a 3.6 GHz AVX2 machine (see table 9.1↓ for the full name of the machine). Latencies on older SSE2 or AVX machines are similar. For example, on a 2.6 GHz SSE2 machine, the latency is around 100 cycles when the number of pages is 16 or fewer and increases to 180 thereafter. Data is cleared from cache by either accessing a long array (mwalk) or using the cache flush instruction (clflush).
Table 4.3↑ reports several measurements of latency. The program was modified to be able to handle 4n pages for n ≥ 1. The numbers reported are medians obtained from a large number of measurements.
Table 4.3↑ shows that the measured latency depends on the number of pages accessed. On an AVX2 machine, the latency is of the order of 100 cycles when the number of pages accessed is 32 or less. When the number of pages used in our experiment is 40 or more, the latency jumps to around 250 cycles. We are not certain of the explanation. Our best guess is that the jump in latency may have something to do with the number of arrays in a bank of DRAM memory as well as the number of columns in each array. When the number of pages used is few, it is possible that all the pages map to the same row in a bank of DRAM memory.
Table 4.3↑ reports measurements of latency using two different techniques. In “mwalk,” the array FourPages[] is evicted from cache by writing to some other large array (line 13). In “clflush,” the array is evicted using the CLFLUSH instruction (line 16). When the number of pages is 32 or less, the “mwalk” figure is noticeably higher. That appears to be because writing to some other array to evict FourPages[] from cache implies write-backs of that other array when FourPages[] is used to measure latency. The write-backs are likely to cause row switching within a bank. [77]  [77] To test this hypothesis, we modified “mwalk” to read from a large array rather than write to a large array. Reading does not trigger write-backs. As expected, the “mwalk” figures after this modification are close to the “clflush” figures.
Exercise:  What is the size of DRAM memory on your machine? Write a simple C program to find out the maximum amount of memory that can be allocated with malloc() or _mm_malloc() on your machine. Does that limit depend on how many other programs are running on your system and how much memory they are using?
Exercise:  How many DRAM arrays constitute a bank in your machine’s memory? Is your memory DDR3, DDR5, or DDR with some other prefetch parameter? How many bytes of data are transferred between DRAM and the memory controllers after a single load instruction?
Exercise:  Use the CPUID instruction to determine the cache and TLB parameters on your machine.
Exercise:  Why do processors prefer to have separate L1 caches for instructions and data? Similarly, what may be desirable in having separate level 1 TLBs for instruction and data?
Exercise:  Upon malloc(), virtual memory is allocated, but the pages of virtual memory are not mapped to page frames in physical memory. Therefore, the first access of every page triggers a page fault. Write a C program to demonstrate this phenomenon and determine the cost of invoking the page fault handler.
Exercise:  In section 4.1.4↑, we determined the cost per access of an int assuming sequential access, strided access that reads just one int per cache line, and a more complex pattern that triggers TLB misses. Repeat these measurements on your machine. Determine the cost per access if each access is a write instead of a read.
Exercise:  Measure the latency to L1, L2, and L3 caches. [78]  [78] For latency and bandwidth to DRAM memory as well as caches, see [36] and [41].
Exercise:  Modify the program to measure latency given in section 4.1.5↑ so that the four pages are not contiguous in virtual memory. How does that affect the measurement?
Exercise:  The function latency() prints the final value of index as well as the measured latency. Modify the program so that neither quantity is printed within the function itself. Does the change make a difference to the measured latency?
Exercise:  Write a program that measures the latency of writes to memory and another program that measures the latency to memory when writes and reads are intermixed.
Exercise:  Does the validity of the program to measure latency given in section 4.1.5↑ depend on whether the memory is DDR3 or DDR5? Does it depend on the number of arrays in a bank?

4.2 Optimizing memory access

In this section, we look at optimization of memory access using three examples. The first example, in section 4.2.1↓, is to simply access a long array of numbers to sum or to copy. With this simple example, we learn what may be the most important lesson related to memory access, which is to utilize each cache line as fully as possible.
The examples in sections 4.2.2↓ and 4.2.3↓ are more involved. Although cache lines are the units of transfer of data between DRAM and the caches, cache organization involves multiple levels and sets. In every memory access, a virtual address must be translated to a physical address. This translation using the TLB and possibly the page tables can be a source of considerable overhead.
The chief technique in optimizing for multiple cache levels and the TLB is the same, namely, blocking. In section 4.2.2↓, we study blocking using matrix transposition as an example. Section 4.2.3↓ also illustrates blocking in addition to the technique of streaming data from cache to reduce cache and TLB misses.
Fortunately, much of the time we do not need to program in assembly when optimizing memory access. This is partly because the memory system is so complicated that overly refined optimizations do not make sense. Another reason is that the memory system is more amenable to optimization than the instruction pipeline. For the vast majority of nontrivial programs, speed is limited by memory access. Thus, the memory system’s greater amenability to optimization is probably by design. Although it would be incorrect to assume that compilers generate optimal instruction streams, the penalty for suboptimality is not as great.
The design of the memory system consisting of DRAM, memory controllers, and caches is relatively stable across platforms. Therefore, the techniques of optimization may be expected to be the same on graphics devices, non-x86 platforms, and mobile devices.
Our discussion of the memory system in the previous section began with aspects of hardware design and virtual memory and concluded with a measurement of latency. In contrast, through much of this section, the emphasis is on bandwidth to memory. In all the examples discussed in this section, the large latency to memory can be hidden with little effort.
Examples in which latency to memory can be hidden completely are characterized by parallelism in the instruction stream. For example, if a program adds an array of numbers by accessing entries in sequence, the processor can issue multiple loads from memory in parallel. If the memory accesses can be overlapped, the effective latency to memory becomes manageable.
However, if memory accesses cannot be overlapped, the program is exposed to latency to DRAM memory. Our program to measure latency, given in the previous section, is an example where memory accesses cannot be overlapped. In general, memory accesses cannot be overlapped if the location of the next memory access depends on the result of the previous memory access.
Situations in which memory accesses cannot be overlapped are exceedingly common. Such scenarios arise whenever linked lists, trees, or graphs are used to handle dynamic data. Even in these situations, some of the techniques of memory optimization we discuss may still be relevant. For example, if successive items of a linked list are packed closely in memory, some advantage may be derived from caching. Techniques for dynamic data structures are deferred to the exercises. In general, when the use of dynamic data structures cannot be avoided, it is likely that exposure to DRAM latency also cannot be completely avoided.

4.2.1 Bandwidth to DRAM

The most predictable and common pattern of memory access is to access a long line of data in sequence. Every memory controller is likely to be optimized to handle that pattern efficiently. Thus, to determine bandwidth to memory, we will access a long array in sequence.
The following simple function returns the sum of an array of doubles:
double sum(double *restrict a, long int n){
	double s = 0;
	for(long int i=0; i < n; i++)
		s += a[i];
	return s;
}
The loop body uses XMM/YMM registers, as we may verify by inspecting the assembly code. The single statement in the loop has a loop-carried RAW dependency. Nevertheless, the simple structure of the loop helps the compiler unroll the loop and introduce parallelism in the loop body. The entries of the array a[] are read from memory in parallel. The additions do not introduce an overhead above the time it takes to read the array from memory.
The program was called with an array a[] that was 8 GB. The array was initialized in the same sequence it is summed. It is important to initialize the array before the function sum() is called. Pages of virtual memory are mapped to page frames of physical memory only at first access. If the array is not initialized at all, the mapping of virtual memory to physical memory takes place when sum() executes.
The cache is too small to hold 8 GB data. So none of the load instructions will hit the cache. Every byte accessed by sum() must be loaded from memory.
Because each double is 8 bytes, if the function sum() takes c cycles to execute, we may take the bandwidth to memory to be 8n ⁄ c bytes per cycle. The bandwidth to memory on a 2.66 GHz SSE2 machine (see table 9.1↓ for the full names of the machines) was measured to be 4.36 bytes per cycle or, equivalently, 11.6 GB/s. On a more recent 2.20 GHz AVX machine, the bandwidth was 5.40 bytes per cycle or 11.9 GB/s. On a yet more recent 3.6 GHz AVX2 machine, the bandwidth was 4.85 bytes per cycle or 17.46 GB/s.
We use strided memory accesses to lead up to what is perhaps the single most important item a programmer should know about accessing memory. The following function sums entries of the array a[], beginning with the zeroth entry and in steps of length stride:
double sumstride(double *restrict a, long int n, 
		int stride){
	double s = 0;
	for(long int i=0; i < n; i+=stride)
		s += a[i];
	return s;
}
The bandwidth to memory realized in strided access is
(8n)/(c × stride)
assuming the function takes c cycles.
Before we begin making measurements with non-unit strides, it is a good idea to try stride=1. It turns out that sumstride() realizes a bandwidth of only 2.4 bytes cycle on the SSE2 machine for unit stride, which is well short of the 4.36 bytes per cycle realized by sum(). Evidently, the compiler is not optimizing as well when stride is passed as a parameter.
We make the stride a const int to coax the compiler to optimize better. For unit stride, we use the definition
const int STR=1;
To make the stride equal to 8, we modify the definition to
const int STR=8;
The function sumconststride() given below uses STR as its striding parameter.
double sumconststride(double *restrict a, long int n){
	double s = 0;
	for(long int i=0; i < n; i+=STR)
		s += a[i];
	return s;
}
Because the compiler knows the numerical value of the stride, it can optimize the loop much better.
Stride Read Copy
1 4.36 3.53
2 2.25 1.83
4 1.21 0.92
8 0.80 0.60

Table 4.4 Bandwidth to memory in bytes per cycle on a 2.6 GHz SSE2 machine. The measured read and copy bandwidths are 5.3 bytes per cycle and 2.4 bytes per cycle on a 2.2 GHz AVX machine. The read and copy bandwidths are both 4.85 bytes per cycle on a 3.6 GHz AVX2 machine (see table 9.1↓ for the full names of the machines).
Table 4.4↑ gives the measured bandwidth for strides equal to 1, 2, 4, and 8. The bandwidth is nearly halved every time the stride is doubled. This behavior is easily explained. A cache line is equal to 64 bytes or 8 doubles in size. All traffic between DRAM and the processor packages is cache line by cache line. When we stride by 2, 4, or 8, we utilize only a half, a quarter, or an eighth of every cache line that is brought in. The single most important memory optimization is to ensure that a cache line is utilized as fully as possible.
Bandwidth to memory depends on the type of access. Typically, reads are faster than writes to memory. The functions copy() and copyconststride() are used to measure bandwidth to memory when one array is copied into another array.
void copy(double *restrict a, double *restrict b, 
	  long int n){
	for(long int i=0; i < n; i++)
		b[i] = a[i];
}
​
void copyconststride(double *restrict a, 
		     double *restrict b, long int n){
	for(long int i=0; i < n; i+=STR)
		b[i] = a[i];
}
If copyconststride() takes c cycles, the bandwidth realized is
(16n)/(c × stride).
The factor 16 in the numerator accounts for copying an 8-byte-long double to another double. The bandwidth realized when one array is copied to another is listed in table 4.4↑.
Typically, bandwidths for copying and writing are lower than that of reading. In table 4.4↑, the copying bandwidth (with stride 1) is 80% of the read bandwidth. In simple situations, there are ways to approach the read bandwidth more closely, however. One may invoke special instructions to get around the write-back cache policy. In fact, on a 3.6 GHz AVX2 machine, the compiler invokes a runtime library function for copying, and the copy and read bandwidths are both 4.85 bytes per cycle.

4.2.2 Matrix transpose

The only cache parameter of significance for strided accesses studied in the last section was cache line size. The sizes of the caches, their organization into sets, as well as the size of the TLB influence the performance of the example studied in this section.
The example we study here is out-of-place matrix transpose. It is always best to access data sequentially with unit stride, but when a matrix is transposed to another matrix stored in the same column-major format, there is no way to access both matrices with unit stride. In many problems of this type, it is useful to break up the data into blocks.
When a single array is accessed with a constant stride, a cache line brought into cache is used in a single go. Once the array moves past a cache line, we do not return to it. In matrix transpose with blocking, we rely on a cache line remaining in cache as we return to it repeatedly after working on other columns of the matrix block. This type of cache usage is more delicate. We find that the performance of the matrix transpose depends in a nonmonotonic manner on the size of the blocks. The performance degrades abruptly if the leading dimension of the matrix is divisible by a high power of 2. Such effects, though disconcerting to the programmer, cannot be eliminated.
The cache and TLB parameters of SSE2/AVX/AVX2 machines we use are given in tables 4.1↑ and 4.2↑. On all the machines, the L1 data cache is big enough to hold 4,000 doubles and the L3 cache can hold more than a million double-precision numbers.
Blocking
The function easytrans() listed below uses a simple doubly nested loop to transpose the matrix a[] to the matrix b[]. The matrices are of dimension m × n and n × m. Both of them are assumed to be stored in column-major order with leading dimension (see section 2.2.1↑) equal to the number of rows.
void easytrans(double *restrict a, double *restrict b,
	       int m, int n){
	for(int i=0; i < m; i++)
		for(int j=0; j < n; j++)
			b[j+i*n] = a[i+j*m];
}
This function is easy to write and easy for the compiler to analyze. The array references use indices that are linear combinations of the loop variables, and the loop variables are incremented in steps of 1. The assembly code generated by the compiler is far more complicated than the code presented to it. The total number of double-precision numbers accessed by this program is 2mn. Because each double is 8 bytes, the bandwidth to memory realized is
(8mn)/(cycles for a single transpose)
bytes per cycle. The compiler-optimized easytrans() realizes a bandwidth of 1.48 bytes per cycle when m = 20,000 and n = 30,000. The best bandwidth we could have hoped for is 3.53 bytes per cycle, which is the bandwidth realized when an array is copied to another array with unit stride (see table 4.4↑). Despite compiler optimization, the realized bandwidth falls well short of that mark.
The function blocktransx() listed below implements matrix transpose block by block. The block size is B × B, and B is defined as a const int (definition is not shown). The matrix dimensions are assumed to be divisible by B. The function blocktransx() uses a nest of four loops.
1void blocktransx(double *restrict a, double *restrict b, 
2                 int m, int n){
3  assert((m%B==0)&&(n%B==0));
4  for(int i=0; i < m; i+=B)
5    for(int j=0; j < n; j+=B)
6      for(int ii=0; ii < B; ii++)
7        for(int jj=0; jj < B; jj++)
8          b[j+jj+(i+ii)*n] = a[i+ii+(j+jj)*m];
9}
When writing functions such as blocktransx(), it helps to think directly in terms of the for-loop construct of C/C++.
The loop variable i defined on line 4 steps through the rows of the m × n matrix stored in the array a[] in steps equal to the block size B. The loop variable ii defined on line 6 steps through the B rows of a single block of rows. Thus, i+ii is the index--- relative to the m × n matrix---of row ii within the block of B rows from i to i+B-1. Similarly, j+jj is the index---relative to the matrix a[] as a whole---of column jj within the block of B columns from j to j+B-1. The loop body of blocktransx(), which is comprised of the single statement
b[j+jj+(i+ii)*n] = a[i+ii+(j+jj)*m];
corresponds to the statement b[j+i*n]=a[i+j*m];, which comprises the loop body of easytrans().
It is significant that the block size parameter B is a const int. Knowledge of the numerical value of B allows the compiler to optimize better.
The bandwidth to memory realized by blocktransx() is given in table 4.5↓ for block sizes from B = 8 to B = 1000. Once again, m = 20,000 and n = 30,000. From the columns headlined “nest,” we may observe that the realized bandwidth is worse when B ≤ 25 than it is for easytrans(). The highest bandwidth of 2.60 bytes per cycle is realized when B = 125. The bandwidth to memory begins to degrade as B is increased.
While the optimal block size from the table is B = 125, the degradation is more severe for small block sizes than for larger ones. Block sizes B ≤ 25 realize lower bandwidth to memory than easytrans(). Small block sizes inhibit the compiler from optimizing the inner loops, lowering the realized bandwidth. When the block size is large, the compiler optimizes the inner loops and generates code that is considerably different from what is presented to it.
Why does blocking improve memory bandwidth? First, it helps with reuse of cache lines. Suppose the blocks are small enough to fit into cache. Then every cache line in a block is fetched from memory only once if we ignore cache conflicts and misalignment.
Second, blocking helps reduce TLB misses. A single page is 4,096 bytes. Therefore, if m ≥ 512, each entry of a row will be in a different page. Thus, if the size of a row is greater than the number of TLB entries, row-by-row traversal of the entire matrix will trigger TLB misses for every entry. Blocking can eliminate TLB misses by limiting the number of entries in a row.
The function of blocktransx() uses only one level of blocking. Because the caches and the TLB are organized hierarchically in multiple levels, it is natural to wonder whether multiple levels of blocking may bring some advantage. In this setting, recursive blocking, which is described in the exercises and which uses multiple levels of blocking, appears to be no better (see table 4.5↓) and does not make it easier to find an optimal block size.
B Nest Recurse B Nest Recurse
8 0.90 0.80 80 2.37 2.44
10 0.96 0.98 100 2.37 2.46
20 1.33 1.56 125 2.60 2.51
25 1.47 1.55 200 2.12 2.13
40 1.86 1.97 500 2.02 2.02
50 2.06 1.93 1000 1.88 1.88
Table 4.5 Bandwidth to memory in bytes per cycle for matrix transpose using B × B blocks on a 2.6 GHz SSE2 machine (see table 9.1↓ for the full name of the machine). One-level blocking using a loop nest is compared with recursive blocking (recursive blocking is described in the exercises). With no blocking, the bandwidth realized is 1.48 bytes per cycle. The matrix had 20, 000 rows and 30, 000 columns.
The best bandwidth for transposing in table 4.5↑ is 2.60 bytes per cycle. On the 2.6 GHz SSE2 machine, that is 60% of the peak read bandwidth. In contrast, on a 3.6 GHz AVX2 machine, the best bandwidth for transposing was 1.96 bytes per cycle, which is about the same as SSE2 in GB/s and which is only 40% of the read bandwidth.
The reason for poorer performance on the modern AVX2 machine is not entirely clear. Generating good code for the inner block is crucial in this example, and the compiler does not appear to do anything special beyond translating the code as it is written. If the compiler were doing a good job, one should see movpd instructions (pd stands for packed double), which use the entire capacity of the YMM registers, in the innermost loop. Instead, one sees movsd instructions (sd for single double), which use only a quarter of the YMM registers. Thus, the poorer performance could be a consequence of suboptimal compilation, a phenomenon we encountered several times in the previous chapter and whose probability increases on more recent hardware.
Leading dimension divisible by a high power of 2
B nest recurse recurse/loop
8 0.84 0.92 0.95
16 1.30 1.32 1.34
32 1.17 1.22 1.22
64 1.17 1.18 1.18
128 0.58 0.58 0.58
Table 4.6 Bandwidth in bytes per cycle (on a 2.6 GHz SSE2 machine). The transposed matrix was square and of dimension 214 = 16, 384. One level of blocking using a loop nest is compared against recursive blocking implemented using explicit recursion or loops (recursive blocking is discussed in the exercises). Transpose with no blocking had a bandwidth of only 0.43 bytes per cycle. Compare with table 4.5↑.
If two locations in memory are separated by a high power of 2, they are likely to map to the same set in cache and TLB. Table 4.6↑ shows the bandwidth realized when a matrix of dimension 214 × 214 is transposed. Successive entries in the same row of this matrix are separated by 211 cache lines. All entries of a row map to the same set in L1 and L2 caches as well as level 1 and level 2 TLB.
For a matrix of dimension 214 × 214, easytrans() realizes a bandwidth of 0.43 bytes per cycle, which is less than a third of the bandwidth it realizes for a matrix of dimension 20, 000 × 30, 000. With blocking, the best bandwidth observed is 1.34 bytes per cycle and the optimal block size is 16 × 16. TLB and cache misses occur more frequently for the matrix of table 4.6↑ because the leading dimension is divisible by a high power of 2. As a result, the observed bandwidth is 35% of the best possible instead of 60%, as in the case where the matrix dimensions are not divisible by high powers of 2.

4.2.3 Optimized matrix multiplication

Some of the most nettlesome issues in implementing matrix multiplication arise at the level of the processor pipeline. In section 3.3.4↑, we wrote an assembly program for 4 × 200 × 4 matrix multiplication, which reached 3.5 flops per cycle, against the theoretical limit of 4.0 flops per cycle, assuming all the matrices to be in cache. Here we assume the matrices to be in DRAM memory and not in cache, and we show how to optimize matrix multiplication.
Suppose A, B, and C are matrices of dimensions ℓ × m, m × n, and ℓ × n, respectively. The matrices are assumed to be in DRAM memory. The cost of the operation C = C + AB is 2ℓmn arithmetic operations, half of which are additions and half of which are multiplications. If the cache were large enough, each of the matrices can be loaded into cache and kept there as the matrix multiplication is performed. Loading the matrices into cache would take m + mn + ℓn DRAM memory accesses. If , m, and n are large, the number of arithmetic operations is much greater than the number of memory accesses. We may expect the cost of the computation to be dominated by the arithmetic operations, allowing us to approach the peak bandwidth of 4 flops per cycle on a single core of an SSE2 machine (the figures are 8 flops per cycle and 16 flops per cycles for AVX and AVX2 machines, respectively).
We want the matrices to be big so that the arithmetic operations are far more numerous than memory accesses, but the catch is that the matrices will not fit into cache when they are too big. One way to overcome this dilemma is to use block matrix multiplication. We can pick the block sizes to be small enough to fit into cache but large enough that the cost of loading from memory is outweighed by the cost of arithmetic operations. Careful blocking would indeed improve the simple programs of section 3.2.5↑ but not enough to get anywhere close to peak bandwidth. A much more powerful set of ideas [79]  [79] [40]. shows how a program for multiplying matrices can approach the peak bandwidth for floating point arithmetic.
In outline, the basic idea remains to multiply in blocks, but intermediate blocks are stored in scratch space and in convenient formats to minimize TLB and cache misses. As far as possible, data is stored in a format that enables sequential access with unit stride. The actual execution of this idea can make it look more complicated than it is, but the idea is elegant as well as possibly applicable to many other problems.
As shown in table 4.7↓, our implementation progresses systematically from the 4 × 200 × 4 microkernel described in section 3.3.4↑ to the multiplication of square matrices of dimension 9,000. We code a hierarchy of matrix multiplication functions with a function corresponding to each row of the table. Square matrices of dimension 9,000, which occur in the last row, are too large to fit into cache memory. Limiting ourselves to matrices of specific dimensions keeps the exposition tractable.
Matrix Dimensions b/w
4 × 200 × 4 3.48
4 × 200 × 12 3.32
600 × 200 × 12 3.22
600 × 200 × 3000 3.21
3000 × 200 × 3000 3.19
9000 × 9000 × 9000 3.19
Table 4.7 Bandwidth in flops per cycle on a 2.6 GHz SSE2 machine for the multiplication of an ℓ × m matrix with an m × n matrix. Matrix dimensions are reported as ℓ × m × n.
The square matrices of dimension 9,000 at the bottom of table 4.7↑ are stored in column-major order. However, the storage format of the matrices in every other row is different. The storage format for the 4 × 200 × 4 microkernel was assumed to be such as to lead to a high throughput of arithmetic operations. For the following rows of the table, the storage formats are chosen to allow for convenient and efficient invocation of the function that corresponds to the preceding row of the table.
The matrices are denoted using capitalized letters when they are in column-major order with a leading dimension that may exceed the number of rows. Lowercase letters are used for other storage formats.
So, for example, the arrays that store the arguments to the 4 × 200 × 4 microkernel are a[], b[], and c[]. The microkernel interprets 4 × 200 × 4 matrix multiplication as the sum of 200 outer products of 4 × 1 and 1 × 4 matrices. Accordingly, the array a[] is assumed to store a 4 × 200 matrix in column-major order. The array b[] is assumed to store a 200 × 4 matrix in row-major order. In both cases, the leading dimension is equal to 4. The ith outer product accesses the entries
a[4i, …, 4i + 3]  and b[4i, …, 4i + 3].
The array c[] is of length 16 and stores a 4 × 4 matrix. In section 3.3.4↑, we assumed that the storage format of c[] was column-major with “skewing.” The skewing can be undone using a single step of unskewing, which introduces only an insignificant overhead at the end. Therefore, we ignore it, although skewing percolates down the rows of table 4.7↑, implying that the matrix C of dimension 9,000 corresponding to the last row of the table needs to be unskewed. Thus, the array c[] is assumed to store a 4 × 4 matrix in column-major format (with skewing, which we ignore here) with leading dimension equal to 4.
As we progress down the rows of table 4.7↑, the implementation of each row reuses the lower case letters a, b, c, assuming whatever format is most suitable for its purposes.
Closely related to storage formats is the use of work space. To allow for changes in storage format, the matrix multiplication functions use extra memory stored in an array called scratch[]. The size of this array in its original incarnation is
600 × 12 + 600 × 200 + 200 × 3000
double-precision numbers. The three terms correspond to c[], a[], and b[], respectively. The multiplication of square matrices of dimension 9000 is partitioned repeatedly into multiplications of lower dimensional matrices. The array scratch[] changes at every level in this hierarchy.
As we step through the design, it is helpful to keep a few numbers in mind. The L1 cache is 32 KB and big enough to hold 4,000 double-precision numbers. The L3 cache is 12 MB and big enough to hold 1.5 million double-precision numbers. The second-level TLB has 512 entries (see tables 4.1↑ and 4.2↑). To keep the design simple, we ignore the L2 cache and the first level TLB.
In section 4.2.1↑, we found that it takes about two cycles to load a single double into memory at peak bandwidth. The changes in storage formats that occur, when we begin with the microkernel and increase the dimensions of the matrices that are multiplied, are too complicated to permit access to memory at peak bandwidth. It is perhaps reasonable to take the cost of accessing a double from DRAM memory as 10 cycles. At peak floating point bandwidth, the cost of a single arithmetic operation is 1 ⁄ 4 cycle on an SSE2 machine. To approach peak floating point bandwidth, we should perhaps expect to perform a few hundred arithmetic operations for every double loaded from DRAM memory.
4 × 200 × 12
The microkernel of section 3.3.4↑ performs a 4 × 200 × 4 multiplication and reaches 3.48 flops per cycle, assuming all matrices to be in L1. The 4 × 200 × 12 multiplication function that follows is written under the assumption that a[] will need to be loaded from L3 cache. To pay for the cost of loading a[] from L3, it reuses a[] three times.
void mult4x200x12(double *a, double *b, double *c){
	asm4x200x4(a, b, c);
	asm4x200x4(a,b+800,c+16);
	asm4x200x4(a,b+1600,c+32);
}
In the mult4x200x12() function, the arrays a[], b[], and c[] are of lengths 800, 2,400, and 48, respectively. The L1 cache is big enough to hold all three of them. In the next step, this function is called repeatedly with the same b[] but with a[] and c[], which change with iteration. Therefore, we assume that b[] is in L1 cache.
The function mult4x200x12() assumes a[] to be in column-major format with leading dimension equal to 4---the same as in the microkernel. But b[] has a pretty strange format. The first four columns of the 200 × 12 matrix are stored in b[0..799] in row-major order, the next four columns are similarly stored in b[800..1599], and the final four columns are in b[1600..2399]. The array c[] is in column-major format (except for skewing, which we are ignoring).
How much is lost when a[] and c[] are loaded from L2 or L3 cache? Table 4.7↑ shows that the floating point bandwidth drops from 3.48 to 3.32. The following function run4x200x12() was used to figure out that answer. A number of points about timing functions such as mult4x200x12() are made in the discussion after the listing.
1void run4x200x12(){
2  __declspec(align(16)) double a[800*150];
3  __declspec(align(16)) double b[2400];
4  __declspec(align(16)) double c[48*150];
5  TimeStamp clk;
6  double cycles;
7  for(int i=0; i < 2400; i++)
8    b[i] = rand()*1.0/RAND_MAX;
9  for(int i=0; i < 48*150; i++)
10    c[i] = 0;
11  for(int i=0; i < 800*150; i++)
12    a[i] = rand()*1.0/RAND_MAX;
13  clk.tic();
14  for(int i=0; i < 1000*1000*10; i++)
15    mult4x200x12(a+800*(i%150), b, c+48*(i%150));
16  cycles = clk.toc();
17  cycles = cycles/1e7;
18  cout<<"flops per cycle = "<<2.0*16*200*3/cycles<<endl;
19}
The arrays a[], b[], and c[] defined on lines 2, 3, and 4, respectively, are 16-byte aligned. The 4 × 200 × 4 microkernel requires that its arguments be 16-byte aligned, and we must be careful to respect that requirement.
The array a[] is too large to fit into L1 or L2 cache but fits comfortably into L3 cache. It is big enough to hold 150 matrices of size 4 × 200. The array c[], which holds 150 matrices of size 4 × 4, fits into L2 cache but not L1.
On lines 8 and 12, the arrays b[] and a[] are initialized with random numbers. Initializing with simple numbers such as 0 or 1 gives timing numbers that are misleading and not reproducible.
The mult4x200x12() function is called 107 times on line 15. The function calls cycle through the 150 possibilities for the first argument so that the function has to load a[] and c[] from L2 or L3 cache after each call. This pattern of function calls imitates the manner in which the function is called in the next stage.
600 × 200 × 12
The mult600x200x12() function requires scratch space of 600 × 12 to store c[], which is then unpacked to the array C[] in column-major format with the right leading dimension.
void mult600x200x12(double *a, double *b, 
		    double *C, int ldC,
		    double *scratch){
	double *c = scratch;
	for(int i=0; i < 7200; i++)
		c[i] = 0;
	for(int  i=0; i < 150; i++)
		mult4x200x12(a+i*800, b, c+i*48);
	unpackC(c, C, ldC);
}
The array a[] stores a 600 × 200 matrix. The 600 × 200 matrix is thought of as 150 matrices of dimension 4 × 200 one below the other. These 150 submatrices must be lined up in a[], with each submatrix stored in the column-major format required by mult4x200x12().
The array c[] stores a 600 × 12 matrix. The 600 × 12 matrix is thought of as 150 matrices of dimension 4 × 12 one below the other, and these submatrices are lined up in c[] with each submatrix in the format required by mult4x200x12().
The function mult600x200x12() makes 150 calls to mult4x200x12(). At the end it unpacks c[]. The function unpackC() listed below converts c[] to the column-major format of C[].
void unpackC(double *c, double *C, int ldC){
	for(int i=0; i < 600; i+= 4)
		for(int ii=0; ii < 12; ii++)
			for(int iii=0; iii<4; iii++)
				C[i+iii+ii*ldC] += c[12*i+4*ii+iii];
}
Deciphering this triply nested loop is an exercise we omit. The main point is that the cost of unpacking, whatever it may be, is far less than the savings obtained in calling mult4x200x12() 150 times with c[] in a format that permits sequential access.
600 × 200 × 3000
The earlier function for 600 × 200 × 12 multiplication assumes a[] to be in packed format. The 600 × 200 entries of a[] store 150 submatrices of dimension 4 × 200. Each submatrix occupies 800 contiguous entries and is in column-major order. The function packA() packs A[], which is in column-major format, into this format.
void packA(double *A, int ldA, double *a){
	for(int j=0; j < 200; j++)
		for(int i=0; i < 150; i++)
			for(int ii=0; ii < 4; ii++)
				a[i*800+ii+j*4] = A[4*i+ii+j*ldA];
}
Deciphering this triply nested loop is another exercise we omit.
The function mult600x200x3000() reuses a[] as soon as it is packed by making 250 calls to the function at the previous stage in our design, thus more than making up for the cost of packing. It claims 600 × 200 entries of scratch[] to store the packed array.
void mult600x200x3000(double *A, int ldA,
		      double *b,
		      double *C, int ldC,
		      double *scratch){
	double *a = scratch;
	scratch += 600*200;
	packA(A, ldA, a);
	for(int i=0; i < 250; i++)
		mult600x200x12(a, b+i*2400, C+i*12*ldC, ldC, 
			       scratch);
}
For every double entry in the packed array a[] of size 600 × 200, the 600 × 200 × 3000 multiplication function performs 6,000 arithmetic operations. We may expect the cost of the arithmetic to greatly outweigh the cost of packing a[]. The multiplication function has floating point performance of 3.21 flops per cycle (see table 4.7↑). As expected, the decline from 3.22 flops per cycle yielded by the previous stage is marginal.
3000 × 200 × 3000
The earlier function mult600x200x3000() assumes that the array b[] stores a 200 × 3000 matrix in packed format. The matrix is thought of as 250 matrices of dimension 200 × 12 next to each other. Each of these 200 × 12 matrices is stored in the packed format required by the 4 × 200 × 12 multiplication function.
The function packB() packs a submatrix of B[], which is in column-major format, into b[].
void packB(double *B, int ldB, double *b){
	for(int j=0; j < 750; j++)
		for(int i=0; i < 200; i++)
			for(int jj=0; jj < 4; jj++){
				b[jj+i*4+j*800] = B[i+(4*j+jj)*ldB];
			}
}
Deciphering this triply nested loop is yet another exercise we omit.
The function mult3000x200x3000() resuses b[] five times as soon as it is packed, thus making up for the cost of packing. It claims 200 × 3000 entries from scratch[] for the packed array. The packed array fits comfortably in L3 cache.
void mult3000x200x3000(double *A, int ldA,
                       double *B, int ldB,
                       double *C, int ldC,
                       double *scratch){
  double *b = scratch;
  scratch += 200*3000;
  packB(B, ldB, b);
  for(int i=0; i < 5; i++)
    mult600x200x3000(A+i*600, ldA, b, C+i*600, ldC, 
								scratch);
}
For each of the entries in the packed array, this function performs 6,000 arithmetic operations. Its floating point performance is 3.19 flops per cycle, a marginal decrease from that of the previous stage (see table 4.7↑).
Block multiplication
The function blockmult() listed below carries out an ℓ × m × n multiplication if , m, and n are multiples of 3,000, 200, and 3,000, respectively.
void blockmult(double *A, double *B, double *C,
               int l, int m, int n,
               double *scratch){
  assert(l%3000==0);
  assert(m%200==0);
  assert(n%3000==0);
  int ldA = l;
  int ldB = m;
  int ldC = l;
  l = l/3000;
  m = m/200;
  n = n/3000;
  for(int i=0; i < l; i++)
    for(int j=0; j < m; j++)
      for(int k=0; k < n; k++){
        double *AA = A + (i*3000)+(j*200)*ldA;
        double *BB = B + (j*200)+(k*3000)*ldB;
        double *CC = C + (i*3000)+(k*3000)*ldC;
        mult3000x200x3000(AA, ldA, BB, ldB, CC, ldC,
                          scratch);
      }
}
This function implicitly partitions A into 3, 000 × 200 blocks, B into 200 × 3, 000 blocks, and C into 3, 000 × 3, 000 blocks. The function mult3000x200x3000(), defined at the previous stage, is used to multiply a single block of A into a single block of B to get a single block of C.
The floating point bandwidth realized is 3.19 flops per cycle for the multiplication of square matrices of dimension 9,000 on a 2.6 GHz SSE2 machine. For the same problem, the MKL library realizes more than 3.8 flops per cycle. Why is our design worse? Much of the blame falls on the microkernel. To begin with, the microkernel yields only 3.48 flops per cycle, assuming all its arguments are in L1 cache. On a 3.6 GHz AVX2 machine, on which the microkernel yields 3.98 flops per cycle, the bandwidth realized for the multiplication of square matrices is 3.75 flops per cycle.
Exercise:  The function sum() of section 4.2.1↑ adds an array of numbers in sequence. Assuming the array to be several gigabytes long, explain why the number of cycles consumed by the function depends solely on memory access, with the additions not introducing any extra overhead.
Exercise:  Initialize a long array. Sum its entries with a stride that ensures only one entry is read from a page. What is the peak bandwidth to memory realized? That is the peak bandwidth to DRAM in the presence of TLB misses.
Exercise:  Write a program to measure the bandwidth of writing to DRAM memory. Compare write bandwidth to read and copy bandwidths given in table 4.4↑.
Exercise:  Explain why the copy bandwidth in table 4.4↑ is lower than the read bandwidth.
Exercise:  Write a program that maps an n × n array of numbers to another n × n array, with each entry replaced by the average of north, south, east, and west entries. The arrays are assumed to be in column-major order. Compare the speed of a program that traverses the arrays columnwise with a program that traverses the arrays rowwise. Why do you expect columnwise traversal to be faster? Implement the same computation using blocking. Explain why blocking is likely to help make the program faster.
Exercise:  In recursive transpose, [80]  [80] [39]. if a matrix A to be transposed to a matrix B has more rows than columns, the matrices are split as follows:
A1 A2  and ( B1 B2 ).
Recursion is used to transpose Ai to Bi. The case with more columns than rows is handled similarly. Write a program that implements this algorithm.
Exercise:  Rewrite the matrix multiplication programs of section 3.2.5↑ to use blocking. Compare program speeds with and without blocking.
Exercise:  Rewrite matrix multiplication using one level of blocking as in the previous exercise. But this time use scratch space to copy matrix subblocks with the blocks stored in scratch space having leading dimension equal to block dimension. The storage format of the blocks must be row-major or column-major to permit sequential access of entries. Time the program and compare to matrix multiplication with more straightforward blocking.
Exercise:  Write a program to transpose a square matrix in place. Compare program speeds with and without blocking.
Exercise:  Write a program that stores a sequence of double-precision numbers in a linked list. Write a program to sum all the numbers. Find the number of cycles per entry assuming that the linked list has more than 109 entries. Compare program speed in the following two situations: the entries of the linked list are in random locations in memory, and the entries of the linked list are next to each other. Explain what you observe.
Exercise:  A balanced binary tree can be represented in two ways. The first way is to use an array with the convention that the two children of item i are items 2i + 1 and 2i + 2, with i = 0 being the root. The second way is to use a struct with a pointer to the left child, the right child, and the parent. Write a program that replaces each entry of a node by the sum of all its descendants and itself. Compare program speed between the two implementations.

4.3 Reading from and writing to disk

The data in registers and DRAM memory disappears when the computer is powered off. In contrast, hard disk storage is permanent. The hard disk is a collection of platters. Bits stored on circular tracks on either side of the platters are turned on and off using a magnetic field. More than 100 billion bits can be packed into a single square inch of a platter.
File systems, implemented inside the operating system kernel, impose a logical structure on hard disk storage and facilitate its use. Files and directories are stored on the hard disk. In everyday usage, files are read from and written to with the understanding that the data is stored on a hard disk. Between the file as viewed inside a C/C++ program and the hard disk, there are several layers of software. These layers of software provide modularity, enabling the operating system kernel to handle several different file systems with a uniform interface and greatly improve speed of access.
The Linux kernel implements a number of optimizations to speed up access to the disk. The most important optimization is to store a page cache. The page cache is a list of page frames that corresponds to data in the disk. When a file is read, Linux will service the read using the page cache if possible. A read or write file operation has to fall through a number of software layers before it reaches the disk. It begins as a system call. The read() and write() system calls are issued by the C library functions fread() and fwrite(). The C library function may do some buffering of its own. The kernel will service the system calls using a page cache if possible. If not, it invokes the file system to which the file belongs. There are software layers for combining, queuing, and scheduling requests to read or write to the hard disk. The request is finally issued using a device driver. [81]  [81] When the file access falls through to the driver, the driver issues a DMA request and puts the calling process to sleep. The process is woken up by an interrupt handler when the request is complete. The Linux kernel will page cache the data before returning control to the user program.
The Linux command lspci -v may be used to find out the type of hard disk as well as the device driver that is in use. Although it is useful to understand that every disk access falls through layers of the file system within the operating system kernel, knowing specific details about the type of hard disk and the device driver is of little use in actual programming. There can be considerable variation in capacity as well as bandwidth of different hard disk systems, but that does not affect programming technique.
The C versus C++ discussion in section 4.3.1↓ is on programming technique. The C++ language provides a convenient interface to file input/output using ifstream and ofstream objects. Although less convenient, the no-nonsense fread(), fwrite() interface in C can be much faster, by as much as a factor of 100. In section 4.3.1↓, we explain why there can be such a big difference in speed.
In sections 4.3.2↓ and 4.3.3↓, we investigate latency and bandwidth to hard disk. In both of these sections, the page cache maintained by the Linux kernel plays a big role. The page cache is a cache of the hard disk maintained in main memory. DRAM memory can be tens of GB in size, and the page cache can occupy a considerable portion of that memory. A lot of file input/output is serviced by the operating system kernel using the page cache. To get a real picture of latency and bandwidth to hard disk, one needs to get beyond the page cache, and that implies file sizes that are in the hundreds of GB.
Latency to hard disk is of the order of milliseconds and can therefore be 105 times the latency to DRAM memory. Bandwidth to hard disk is of the order of hundreds of MB/s, which is only a thousandth of the bandwidth to DRAM on typical computers. However, hard disk capacity can be 100 or even 1,000 times DRAM capacity on typical computers. With respect to hard disk capacity, many supercomputing systems (and high-end workstations) are not well balanced. The need to provide a common file system across many computers for many users often implies that the hard disk capacity is not as high as it should be.

4.3.1 C versus C++

With regard to the programming technique for reading and writing files, the simplest lesson is also the most valuable. The C interface can be much faster than the C++ interface as we show and for reasons we explain.
The following functions use C++ ifstream and ofstream objects to read and write a double array v[] from or to a file of name fname.
void write_easy(double *v, long len, 
		const char *fname){
	ofstream ofile(fname);
	ofile<<scientific;
	ofile.precision(16);
	for(long i=0; i < len; i++)
		ofile<<v[i]<<endl;
	ofile.close();
}
​
void read_easy(double *v, long len, const char *fname){
	ifstream ifile(fname);
	for(long i=0; i < len; i++)
		ifile>>v[i];
}
The C interface below can be more than 100 times faster.
1#include <cstdio>
2void write_direct(double *v, long len, 
3		  const char *fname){
4	FILE *fptr;
5	fptr = fopen(fname, "w");
6	fwrite((void *)v, len, sizeof(double), fptr);
7	fclose(fptr);
8}
910void read_direct(double *v, long len, 
11		 const char *fname){
12	FILE *fptr;
13	fptr = fopen(fname, "r");
14	fread((void *)v, len, sizeof(double), fptr);
15	fclose(fptr);
16}
The FILE type as well as the functions fopen(), fwrite(), fread(), and fclose() are declared in the stdio.h header file included on line 1. On line 5, a file is opened for writing and on line 13 for reading.
The library function fwrite() (line 6) has a quite simple interface. Its first argument is a pointer to a location in memory. The second argument is the number of items to be written, and the third argument is the size of each item in bytes. The final argument is a pointer to a file.
The library function fread() (line 14) has an identical interface. It returns the number of items read, which may be less than the number requested if there is an error. Following our usual practice, we do not check for error conditions. Open files are closed using fclose() (lines 7 and 15).
Why is the C++ interface more than 100 times slower? There are three reasons.
First, while the C library functions fread() and fwrite() can read and write objects of any type or class, they do not waste a single byte of storage. Each double is 64 bits and gets stored as exactly 8 bytes. If file streams are used and each double is stored in ascii, as in the earlier program, each double uses 23 bytes ( + 1 if the number is negative).
Second, the C++ interface incurs an overhead in converting every double from binary to ascii.
Third, the C interface does the entire reading and writing using a single call to the C library. In contrast, the C++ interface reads or writes item by item, with the C++ fstream library being invoked for every item. The overhead of calling the library so frequently can build up. What is worse, the fstream library probably invokes the file system inside the operating system kernel quite frequently, although it presumably does some buffering. Every system call is like another function call, which can lead to many more function calls inside the file system. This overhead can build up.

4.3.2 Latency to disk

The measurement of latency to hard disk brings up issues not unlike the ones encountered in measuring the latency to DRAM memory. Hard disk is so slow relative to memory that the operating system kernel goes to great lengths to cache file data in DRAM memory. The hard disk maintains a cache and attempts to predict future accesses. The true latency to hard disk is not easily visible to simple computer programs.
The plan for measuring latency is to create a number of files and access several different positions in several different files to gather latency statistics. The plan does not work so easily. If 100 files each of 100 MB are used in the measurement, the latencies for a 1 TB hard disk come out quite low and of the order of microseconds, not milliseconds. That is because all of 10 GB can fit comfortably inside the page cache of a system with 16 GB of memory.
The size of page cache can run into several GB. The size of the page cache may be seen in /proc/meminfo against the item labeled “Cached.” If we write to a file of size 1 GB on a computer with several GB of memory, the entire file ends up in the page cache. To clear the page cache, one may use the GNU/Linux command
echo 1 > /proc/sys/vm/drop_caches
One may look at /proc/meminfo after this command to verify that the page cache has indeed been cleared.
The latency2disk() function defined below reads a single double from a given position in a given file and returns it. The idea is to use 100 files each of 1 GB to make the total file size in play of the order of 100 GB, and then read a double-precision number at a random location in a random file. [82]  [82] On some computer systems, there is so much memory that even 100 GB may fit comfortably inside a page cache and be too small for latency measurement. Thus, the file count as well as the file size may need to be changed. As a result, the actual measurement of latency makes a large number of calls to latency2disk() and the code for which is now shown.
1struct disk_latency{
2	double fopen_cycles;
3	double fseek_cycles;
4	double fread_cycles;
5	double fclose_cycles;
6};
7double latency2disk(const char *dir, int filenum, 
8		    long posn,struct disk_latency& lat){
9	char fname[200];
10	sprintf(fname, "%s/file%d.dat", dir, filenum);
11	TimeStamp clk;
12	FILE *fptr;
13	clk.tic();
14	fptr = fopen(fname,"r");
15	lat.fopen_cycles = clk.toc();
16	clk.tic();
17	fseek(fptr,8l*posn,SEEK_SET);
18	lat.fseek_cycles=clk.toc();
19	double x;
20	clk.tic();
21	fread(&x,sizeof(double),1,fptr);
22	lat.fread_cycles=clk.toc();
23	clk.tic();
24	fclose(fptr);
25	lat.fclose_cycles = clk.toc();
26	return x;
27}
The arguments to latency2disk() specify the directory and the file number (line 7). The file name is composed on line 10. The file is thought of as an array of doubles, and posn (line 8) gives the position of the number to be retrieved in that array.
The function opens the file (line 14), seeks to the specified position (line 17), reads a double from the file (line 21), and closes the file (line 24).
Each of the four function calls is timed, and the number of cycles is saved in a struct (lines 1 to 6), which is passed by reference (line 8). The only new syntax here is fseek() for seeking to a new position in the file (line 17).
Table 4.8↓ was obtained on two different computers, differing by 3 to 4 years in age. On both computers, opening and closing a file is fast, taking only a few microseconds. On both computers, the total latency is of the order of 10 milliseconds. Latency changes very little from computer to computer or with time.
Computer fsize nfiles open seek read close
HP-Z220 1 GB 100 .007 6.84 2.68 .010
ThinkStation-P300 1GB 100 .001 9.29 .681 .002
Table 4.8 Latency to hard disk in milliseconds. The measurements are made using nfiles each of size fsize.
The total latency is limited by the seek time of the head assembly. Seek time of a hard disk refers to the time it takes for the head assembly to move between tracks of varying radii. This latency is of the order of milliseconds, and it takes of the order of 10 milliseconds for the hard disk platter to complete a rotation. Thus, it is clear that these are the parameters limiting latency to hard disk.

4.3.3 Bandwidth to disk

The functions write_direct() and read_direct() defined earlier in this section illustrated the use of fwrite() and fread(). These functions read or write a single array of doubles. A single array cannot exceed the available DRAM, although it is desirable to work with files much larger than the size of DRAM memory when estimating bandwidth to disk.
The function write_direct() is modified below to write to files that are much bigger than available memory.
void write_direct(double *v, long len, 
		  const char *fname){
	FILE *fptr;
	fptr = fopen(fname, "w");
	fwrite((void *)v, len, sizeof(double), fptr);
	for(int i=1; i < FLUSH_COUNT; i++)
		fwrite((void *)v, len, sizeof(double), 
		     fptr);
	fclose(fptr);
}
By setting FLUSH_COUNT, the array v[] is written to the same file multiple times. The purpose of using a large count is to flush the page cache. Linux provides the system call fsync() to flush the page cache, but we do not use it here.
As noted already, the page cache can run into several GB. If the system crashes before the dirty pages in the page cache are written back to disk, the file system will be left in an inconsistent state. Journaling file systems, such as ext4, record information about dirty pages on the disk. Journaling file systems can be restored to a consistent state much faster after a system crash.
Table 4.9↓ reports the read and write bandwidths for the same two computers as in table 4.8↑. The newer computer has nearly twice the bandwidth, although the latency is the same. Bandwidth to hard disk can be increased through greater storage density and parallelism. Bandwidth will increase with time in proportion to storage density and other parameters of the computer system.
Device fsize Write Read
HP-Z220 100GB 0.11 0.10
ThinkStation-P300 25 GB 0.19 0.19
Table 4.9 Bandwidth to hard disk in GB/s. The bandwidths are measured by writing to and then reading from a single file of size fsize.
Exercise:  Write a program to determine the maximum size of the page cache on your system.
Exercise:  Write a program to demonstrate that the page cache that comes into existence when one program accesses a file speeds up reading of the same file by any other program.

4.4 Page tables and virtual memory

In this section, we make our first foray into the operating system kernel. [83]  [83] [44] is a superb account of the workings of the Linux kernel and the chief source for this section. The cost of a single write to memory by an instruction such as movq %rax, (%rsi) brings in many layers of complexity. It could be a cache hit or a cache miss. If the DRAM memory is accessed, the cost of the access depends on preceding and following memory accesses. It depends on the manner in which the memory controllers operate the DRAM devices. It depends on the parallelism in the instruction stream. It depends on the pressure on dispatch ports in the instruction pipeline among other factors. On top of the layers of complexity engineered into hardware, the operating system introduces yet more complexity.
The operating system is a process manager and has the responsibility of laying down and enforcing the rules that govern nearly all activity on the computer. It creates the environment under which all processes run. It goes about its job so surreptitiously that its presence is ignored in most programming.
However, the activities of the operating system introduce a cost. The map from virtual memory addresses (generated by running programs) to physical DRAM memory is maintained by the operating system. A page is typically 4,096 bytes of virtual memory, as may be verified using the GNU/Linux command getconf PAGESIZE. The operating system creates page tables to map pages to page frames. A page frame is a 4,096-byte-long region of DRAM memory. As explained in section 4.1.3↑, before accessing a word in DRAM memory, the processor uses page tables to convert a virtual address to a physical address. The operating system is invoked if there is a page fault or if the address is illegal.
Even ignoring page faults, having to look up the page tables every time a memory location is accessed introduces an overhead. At worst, as explained in section 4.1.3↑, each memory access can turn into two or more memory accesses. The processor cores use a Translation Lookaside Buffer (TLB) to eliminate this overhead. Programs that transpose and multiply matrices must be mindful of the TLB, as evident from earlier sections. Managing the TLB is one of the kernel’s functions.
An account of program performance will be incomplete without discussion of the kernel’s activities. The kernel plays a big role in managing memory, creating processes and threads, and scheduling threads. The kernel mediates between the processes and the network when processes running on remote computers take on a task in parallel.
A knowledge of the kernel’s activities gives the programmer a better appreciation of the complex environment in which programs run. The occasional deterioration in performance will not seem so bizarre. The variation in performance from run to run will not seem inexplicable and erratic.
There is a self-referential quality to the operating system. The memory management unit is part of the kernel and resides in the memory that it manages. The kernel associates a memory descriptor with each process or thread to encapsulate its usage of memory. The memory descriptors are stored in memory accessible only to the kernel. The page tables map virtual memory to physical memory. The page tables themselves reside in memory. The virtual addresses of the region of memory holding the page tables are mapped to physical addresses by those very page tables.
Although it is a mere process manager, the operating system is far more complicated than almost any process or program. The operating system kernel is perhaps as complicated as any engineered system. Some of this complexity originates from the hardware environment the kernel manages. DRAM memory, network cards, hard disc, flash memory, graphics cards, monitors, and keyboards, along with many other parts of the computer, differ from each other as much as an airplane differs from a horse carriage. The kernel takes on the job of blending this cacophony of devices into a seamless computer system. Many of the parts, such as DRAM memory and the network, are quite complicated to operate. Some of the complexity arises from the need to guarantee a secure environment to processes and users. Some of the complexity is engineered. For example, the Linux kernel is monolithic. Every unit of the kernel and every device driver use the same virtual address space. Microkernel-based operating systems are more modular, but the coordination of and communication between microkernels incurs a penalty, which Linux is unwilling to pay.
Section 4.4.1↓ uses a simple program to show how a user program as well as the data it uses is laid out in virtual memory. This information can be useful, not least when interpreting error messages. Section 4.4.2↓ continues that discussion and shows how the kernel establishes itself in physical memory and then in virtual memory. Section 4.4.2↓ gives an overview of the paging system implemented jointly by the operating system and the hardware. Although that topic may appear a little advanced, its relevance to programming is beyond debate. Demand paging plays a big role in multithreaded programming and networking. TLB flushes, necessitated during some context switches, can be a source of considerable overhead. In addition, having a full overview of the virtual memory system helps understand the manner in which multiple threads coexist and communicate using shared memory---an important point, inasmuch as multithreaded programming appears set to be the major paradigm for the next few decades.
It has been stated that “memory management is by far the most complex activity” in the Linux kernel. [84]  [84] [44]. Our account of memory management will focus on the interaction between the hardware and operating system kernel.

4.4.1 Partitioning the virtual address space

figure FIGS/chapter3/vmem_view.png
Figure 4.8 Schematic view of the virtual address space of a user process.
Because a virtual address is 48 bits, a user process can in principle address 248 or more than 64,000 GB of memory. The physical memory can be less than a thousandth of that figure. Much of virtual address space is an unclaimed wilderness. Figure 4.8↑ is an incomplete schematic view of the partitioning of the virtual address space of a typical user process. Much of the virtual address space is taken up by the unutilized regions shown as empty gaps.
The boundary shown as PAGE_OFFSET separates user code and data from the kernel. User code and data map to virtual addresses lower than PAGE_OFFSET. The kernel code and data are beyond it. Every process uses exactly the same value for PAGE_OFFSET, which is defined as a constant by the kernel. Therefore, the kernel occupies exactly the same portion of virtual memory in every process. We will first look at the way the virtual memory addresses that precede PAGE_OFFSET are organized.
A simple C program can help us understand the way functions, local variables, global variables, and dynamically allocated memory are mapped to virtual address space. The complete listing of such a program follows.
1#include <stdio.h>
2#include <stdlib.h>
3int global;
4void f(){
5	double farray[512];
6	printf(" farray = %p\n", farray);
7}
8int main(){
9	double marray[512];
10	printf(" marray = %p\n", marray);
11	f();
12	printf("\n");
13	double *ptr;
14	ptr = (double *)malloc(1000);
15	printf("    ptr = %p\n", ptr);    
16	printf("\n");
17	printf("&global = %p\n", &global);
18	printf("      f = %p\n", &f);
19	printf("   main = %p\n", &main);
20	free(ptr);
21}
The program prints the pointer to global, defined globally on line 3, the pointer to the function f() defined on line 4, the pointer farray[] defined locally within f() on line 5, the pointer to main(), the pointer marray[] defined locally within main() on line 9, and ptr, which points to memory allocated dynamically on line 14. The printing is in order from highest address to the least. When compiling this function, one must remember to use the -fno-inline-functions option. If not, the compiler will likely eliminate the call to f() on line 11.
The following is output from a single run of this program.
 marray = 0x7fffb6be1380
 farray = 0x7fffb6be0370
​
    ptr = 0xc60010
​
&global = 0x604730
      f = 0x400ad0
   main = 0x400a00
Because marray[] is allocated on the stack, it has a high address as indicated in figure 4.8↑. The stack grows downward in virtual memory, and when a call is made to f(), the pointer farray has a lower value because of the downward growth of the stack. We will have more to say about stacks in the next chapter.
The lowest addresses belong to the locations that store the set of instructions compiled from main() and f(). The global variable defined on line 3 is allocated in the data region. As indicated in figure 4.8↑, the global data map to higher addresses than code.
The dynamically allocated memory is in between. The principle way to allocate memory dynamically is using malloc() in C. In C++, we may also use new[]. In some of our earlier examples, we used _mm_malloc() to force alignment of allocated memory at 16-byte boundaries, cache line boundaries, or page boundaries.
The sketch in figure 4.8↑ has omitted a few things, namely, shared libraries and variables local to threads. Like dynamically allocated memory, these are also mapped to virtual addresses higher than code and data regions and lower than the stack region.
Kernel code and data are mapped to addresses beyond PAGE_OFFSET. We will look at that mapping shortly.

4.4.2 Physical address space and page tables

Program speed is influenced by the paging system in several ways. Typically, 40% of the instructions are loads and stores, and each memory access involves the paging system. The TLB (translation look-aside buffer) is a shortcut past the huge overheads of the paging system. If a memory word is found in L1 cache and its virtual address is found in TLB, the latency of the memory access would be just 4 cycles. However, if there is a TLB miss, the latency can go up to several hundred cycles. If there is a page fault, the latency of a single memory access can go up to millions, even billions, of cycles. It is also true that the first access of a page in virtual memory always leads to a page fault, and the page fault handler is responsible for allocating a page frame in physical memory.
The paging system is responsible for protecting a program’s memory from other programs and preventing programs from making illegal memory accesses. During a context switch, when the kernel evicts one process and schedules another, the kernel may need to flush the TLB (using a privileged instruction) to protect the address space of the processes. The TLB flush can be expensive. We will look at how the kernel and hardware work together to implement the paging system to get a sense of the overheads of the paging system.
Physical address space
Soon after the computer is powered on, the kernel starts running in real mode. In real mode, the kernel generates physical addresses directly. Its first job is to load itself completely into physical memory. The kernel can call the BIOS using the int (interrupt) instruction to find out about other devices in the system, read information from tables stored in specific areas of memory, or talk to the devices directly using special instructions. Any code that gets control of the computer in real mode can do almost anything it likes with the computer.
As shown in figure 4.9↓, the kernel is not allowed to use the lowest physical addresses for itself. That region of memory is used by BIOS and the hardware. The kernel loads itself in a low address region of physical memory. In addition to the physical memory used by initialized kernel data and code at startup, the kernel will need to dynamically allocate memory as more threads and processes are created. The memory claimed dynamically by the kernel is marked “dynamic kernel data” in figure 4.9↓.
figure FIGS/chapter3/pmem_view.png
Figure 4.9 Simplified sketch of the kernel in physical memory.
The virtual memory region marked “linearly mapped kernel memory” in figure 4.8↑, just beyond PAGE_OFFSET, is mapped to the chunk of physical memory the kernel reserves for its own exclusive usage. This mapping is the same for all processes and does not change as long as the computer is powered on. Furthermore, this mapping is linear. The virtual address is converted to a physical address by subtracting a constant.
The areas marked “kernel code,” “kernel data,” and “dynamic kernel data” in the schematic sketch of figure 4.9↑ hold page frames that are linearly mapped. The linearly mapped page frames extend monotonically in physical address space (except for memory regions reserved for BIOS, hardware, and other purposes). When the kernel attempts to claim a page frame dynamically and extend the linearly mapped region of address space, it may come into conflict with a user process that has already claimed the page frame the kernel needs. Resolving this conflict will be expensive as the kernel has to map the user’s page to some other page frame or swap it to disc before claiming the page frame for itself. The kernel keeps a few free page frames handy so that it can service urgent requests quickly.
After the kernel has established itself in memory and created some of the data structures, such as an initial set of page tables, it will enter “protected” mode by loading a specific value into the cr0 register. The paging system is turned on in protected mode, and all references to memory are now interpreted as virtual addresses. The virtual addresses are converted to physical addresses by the paging system.
When the kernel needs more memory for its own use, it can get the memory in three different ways. The first method is to request more pages using alloc_pages() or another equivalent facility. The pages allocated in this way will be in the linearly mapped region just beyond PAGE_OFFSET in virtual memory. The kernel associates a kernel stack with each process that allows it to execute system calls and handle interrupts. The kernel stack is typically two pages or 8,096 bytes and it is allocated using alloc_pages(). The memory claimed by alloc_pages() is linearly mapped.
The second method the kernel uses for claiming memory is kmalloc(). The memory claimed using kmalloc() is also linearly mapped. Many a time, the kernel needs memory in much smaller units than a page, for example, when creating process descriptors, memory region descriptors, or page descriptors. The kernel uses kmalloc() in such circumstances.
A third method is used for loading modules and related activities. In Linux, device drivers are loaded as modules. The kernel programmer has no way of estimating in advance how many devices there could be on the system or how big the device drivers could be in size. Some of the network card and graphics card drivers can be quite complicated. The kernel claims space for such modules and their activities using vmalloc(). The page frames (physical memory) claimed by vmalloc() come from the general pool, and the virtual addresses are in the “vmalloc” area shown in figure 4.8↑. The map to page frames is not linear. The page tables must be changed after calling vmalloc().
The monolithic Linux kernel uses the same map to physical memory for every virtual address beyond PAGE_OFFSET. Every kernel function and module uses the same virtual address space.
Page tables
figure FIGS/chapter3/virtual2linear.png
Figure 4.10 Conversion of a virtual address/offset to a linear address. Here cs (code segment) and gdtr are system registers, and GDT is the global descriptor table. The code segment register also has information about the privilege level of the process.
Figure 4.10↑ shows how the linear addresses are generated. In Linux, linear and virtual addresses are practically identical, and the conversion of a virtual address to a linear address is heavily optimized so as to be essentially cost-free. We do not distinguish between the two.
The global descriptor table of figure 4.10↑ is set up by the operating system kernel. It comes into play whenever the processor switches from a user program to the operating system kernel. A process may issue a system call using either the int 0x80 or sysenter instructions. Any external device can cause an interrupt. Internal events such as page faults trigger exceptions. When any of these events happen, the processor uses the %tr task register (also set up by the kernel) to index into the global descriptor table and locate the address of the kernel stack. The hardware switches %rsp to the kernel stack before handing control to the kernel.
figure FIGS/chapter3/virtual2physical.png
Figure 4.11 Page table look up using a virtual address. The cr3 register holds a pointer to the base of the PGD table.
Figure 4.11↑ shows the way page tables are organized. It is the kernel’s job to set up the PGD, PUD, PMD, and page table shown in that figure for each process. The sheer complexity of storing a multilevel table in memory shows why a TLB that bypasses page table lookup is essential.
Every process has its own page tables. When a process is created, the operating system has to set up page tables for it. The page tables of each process map valid virtual addresses below as well as above PAGE_OFFSET to physical memory. When the process makes a system call or if the kernel takes over to handle an interrupt or exception, it uses the page tables of the preempted process. It is illegal for a user process to generate a virtual address above PAGE_OFFSET. However, the kernel generates such addresses and uses the page tables of the user process to map them to physical memory.
The kernel has a reference set of page tables mapping virtual addresses above PAGE_OFFSET. When a process is created, that information is incorporated into the process’s page tables.
If a process defines an array locally or if it claims memory dynamically, it is allocated a chunk of memory in the virtual address space. Thus, the extent of the virtual address space a process may legally address changes as it runs. The virtual addresses are not mapped to page frames in physical memory as soon as they are allocated. The kernel waits until the first memory reference, which results in page fault because no page frame has been assigned. The page fault handler finds a free page frame, assigns it to the page, and updates the page tables. This is known as demand paging.
Demand paging has its uses. For example, when several threads are using the same virtual address space, the kernel or, more specifically, the page fault handler can map a page to the memory channel that is as close as possible to the processor core that generates the first access. This “first touch policy” can be exploited in a program to improve bandwidth to memory, as we show in the next chapter.
A consequence of demand paging is that a set of contiguous pages in virtual memory can be mapped to page frames scattered all over physical memory. Neighboring cache lines in virtual memory may be quite far away from each other in physical address space if the cache lines cross a page boundary.
Exercise:  Explain why all processes cannot use the same page tables for virtual addresses above PAGE_OFFSET.
Exercise:  Explain why certain parts of the kernel should never be swapped out to hard disc.
Exercise:  Is %cr3 (see figure 4.11↑) the virtual or physical address of the base of the global page directory?

4.5 References

Bibliography

[35] B. Jacob, S. Ng, D. Wang: Memory Systems: Cache, DRAM, Disk. Morgan Kaufmann, 2008.

[36] D. Molka, D. Hackenberg, R. Shöne, M.S. Müller: “Memory performance and cache coherency effects on an Intel Nehalem multiprocessor system”, 18th International Conference on Parallel Architectures and Compilation Techniques, pp. 261-270, 2009.

[37] D.P. Bovet, M. Cesati: Understanding the Linux Kernel. O'Reilly, 2005.

[38] J.L. Hennessy, D.A. Patterson: Computer Architecture: A Quantitative Approach. Morgan Kaufmann, 1990-2011.

[39] M. Frigo, C.E. Leiserson, H. Prokop, S. Ramachandran: “Cache oblivious algorithms”, Foundations of Computer Science, 40th Annual Symposium, pp. 285-297, 1999.

[40] S. Goto, R. A. van de Geijn: “Anatomy of high performance matrix multiplication”, ACM TOMS, pp. art:12, 2008.

[41] V. Bubka, P. Tuma: Investigating cache parameters of x86 processors in SPEC Benchmark Workshop 2009 (D. Kaeli and K. Sachs, ed.). Springer-Verlag, 2009.

5 Threads and Shared Memory

Programming with threads is a paradigm of great range and utility that encompasses everything from cell phones to web servers to supercomputers. The processor cores of today are clocked at around the same rate as the processor cores of 2005. However, the number of processor cores on the same package continues to increase dramatically. In addition, multiple processor packages can be connected using a fast interconnect. All the processors in the interconnected processor packages share the same DRAM memory. Nodes with multiple processor cores are so powerful that problems with a billion grid points can be tackled on a single node. Multithreaded programming is set to remain a leading programming paradigm.
This chapter is organized into four sections. Section 5.1↓ is an introduction to OpenMP. OpenMP is an industry standard that enables programming with threads. It is implemented by icpc, gcc, pgcc, and other compilers. One only needs to add an option such as -openmp and -fopenmp during compilation and linking if OpenMP syntax is used. The syntax is so simple that the transition to threaded programming is easy and barely noticeable. All of OpenMP syntax used much of the time can be made to fit into half a page or less.
OpenMP programs are written as if they are sequential. To begin with, there is a single thread of execution. Whenever the program enters a parallel region, multiple threads come alive. Because memory is shared between threads, the different threads can split the work between them. They can either operate on the same locations in memory or split the memory between themselves.
Although the OpenMP syntax is simple, sharing memory between threads has many pitfalls that can trip up the novice as well as the experienced programmer. Suppose we define a variable var to be an int. If this is a shared variable and one of the threads does var+=1, it is natural to assume that the update in the value of var is seen by all the threads. That assumption can be wrong, however. The variable var may be stored in a register, and the update on one thread may never be propagated to the other threads.
The memory issues that arise in threaded programming are exceedingly intricate, subtle, and deceptive. Even the simplest OpenMP program that uses multiple threads to add a list of numbers relies on the OpenMP memory model in ways that are not always appreciated. In section 5.1↓, we explain OpenMP’s memory model thoroughly.
Section 5.2↓ introduces techniques to optimize OpenMP programs. Most of these techniques are minor modifications of techniques for memory access already seen in the last chapter. There is only one substantially new point that arises. Although all processor cores on a computing node have access to all the memory (DRAM) on that node, some of the memory is closer to some processor cores and far away from others. As may be expected, accesses of near memory are faster. It is important to make sure that memory mainly accessed by a certain processor core is close to it.
The nonuniformity of access between near and far memory can be dealt with during memory initialization, as explained in section 5.2↓. However, it is a definite negative as far as program modularity is concerned. Program modularity is promoted by uniformity of access, where the cost of memory access is independent of which processor is accessing which region of memory. If memory access is uniform, the subdivision of a task between multiple processors can stay closer to logic intrinsic to the problem domain. Nonuniform memory access burdens memory initialization with an interpretation that is completely extrinsic to the problem domain and dependent only on the conveniences and constraints of hardware design.
The unwary are apt to assume that a computer with 12 cores is 12 times faster than a computer with one core. Bandwidth to memory does not increase linearly with the number of processor cores, and most significant programs are limited by bandwidth to memory, as explained in section 5.2↓.
Although OpenMP is of much utility in scientific computing, it is a limited framework. It applies mainly in those situations where the data layout is static and access patterns are regular. In these situations, one can think of a parallel program as a sequential program that splits naturally between threads every time a parallel region is entered. This model is totally inapplicable to web servers, for example, where every thread does its own thing, according to the demands placed on it by its client, and interacts with other threads in complex ways.
When scientific computing was dominated by classical physics, the types of problems that arose fit reasonably well into the OpenMP framework. However, newer and growing areas of scientific computing, such as genomics and data science, have some of the qualities of computer science applications. Problems in such areas may not be as complex and dynamic as web servers or Internet applications, but a trend in that direction is undeniable.
In section 5.3↓, we look at Pthreads, which are a far more fundamental way to program with threads than OpenMP. Pthreads can be used for everything from web servers to scientific applications. In fact, Pthreads help us understand OpenMP better.
The treatment of Pthreads in section 5.3↓ exhibits interactions between threads and computer architecture. Cache coherence is essential for threaded programming. However, if several threads share the same cache line, there can be considerable overhead in propagating writes from one cache to another. TLB flushes can incur overheads when threads of different applications share the same core. In addition, section 5.3↓ explores the overhead of thread creation and briefly introduces memory fences.
Much of the complexity of threaded programming is the complexity of sharing memory between concurrent threads. In section 5.4↓, we take a look at the organization of program memory into stacks and heaps. We explain how recursion works. Segmentation faults are the bane of C/C++ programming, and dealing with them occupies much of the programmer’s time. In section 5.4↓, we make another excursion into the Linux kernel and explain exactly how these segmentation faults are triggered.
As in the previous chapters, we run programs on SSE2, AVX, and AVX2 platforms (see tables 3.1↑ and 9.1↓). As far as what we say about program optimization is concerned, there is no great difference between any of these machines, or with AVX-512 machines that are set to appear in the future.

5.1 Introduction to OpenMP

Section 5.1.1↓ introduces OpenMP. [85]  [85] The website www.openmp.org has a document titled “OpenMP Application Program Interface” describing the OpenMP standard. [42] is an easily paced introduction to OpenMP. Nearly all the syntax that most programmers will ever need is brought out by parallelizing the summation of the Leibniz series in two different ways.
OpenMP programs look much like sequential programs, and the syntax is easy to learn. The simplicity is mostly illusory, however. Whenever concurrent programs share memory, as OpenMP programs do, the programming model inevitably becomes very subtle. OpenMP syntax conceals much of this subtlety behind a sophisticated memory model. Every OpenMP programmer would be well advised to understand this memory model. The correctness of even the simplest OpenMP program relies on it. The memory model is discussed in section 5.1.2↓.
Section 5.1.3↓ is about the overheads associated with OpenMP parallel regions and other constructs. When a task is divided between threads using OpenMP, the division itself will incur overhead. If this overhead is too great relative to the size of the task, the task may not be worth parallelizing. OpenMP parallelism works best at a coarse level and for outer regions of the program. One may think of a program as being essentially sequential and split certain tasks within that sequential flow between threads using OpenMP constructs. Where and whether such a split makes sense or not is entirely determined by OpenMP overheads.
In section 5.1.3↓, we describe the implementation of mutual exclusion on x86 machines. Mutual exclusion is so fundamental to threaded programming that the x86 instruction set has supported it for many decades. Much of the overhead of OpenMP constructs is incurred through instructions related to mutual exclusion.

5.1.1 OpenMP syntax

OpenMP syntax is beguilingly simple. A simple function for computing the nth partial sum of the Leibniz series follows:
double leibniz(long int n){
	long int i;
	double ans=4.0;
	for(i=1; i < n; i=i+2){
		ans -= 4.0/(2.0*i+1);
		ans += 4.0/(2.0*i+3);
	}
	return ans;
}
As it is written, this function runs on a single processor core. We will rewrite it to run on multiple cores using OpenMP. By writing the OpenMP program in different ways, we expose much of the basic syntax of OpenMP. There is a lot more syntax to OpenMP than we will discuss here, but much of it is hardly ever needed. OpenMP constructs are embedded into C or C++ code. The GNU, Intel, and PGI compilers support OpenMP. With icpc, the option -openmp must be used during compilation and linking. For gcc/g++, the corresponding option is -fopenmp.
A single important OpenMP construct does not arise (explicitly) in the two ways of summing the Leibniz series we explore. This is the
#pragma omp barrier
construct. OpenMP allows the programmer to embed statements such as this in C or C++ programs. In such constructs, omp is an abbreviation of OpenMP and #pragma is a compiler directive.
If a running thread encounters a barrier, it must stop until all other threads in the same team have also arrived at the same barrier. The threads get in sync at a barrier construct. If the program is parallelized in phases, and each phase must wait for the preceding phase to complete before it begins, we must use barrier constructs in between phases. For example, if one phase of the program assembles a stiffness matrix in parallel and another phase solves the stiffness matrix, the two phases must be separated by a barrier construct in principle. In practice, however, there are implicit barriers in a number of OpenMP constructs such as parallel for and parallel, as we will see, and only rarely is the barrier construct needed.
omp parallel for
The simplest way to sum the Leibniz series in OpenMP is to use the parallel for directive.
1	double ans=0;
2#pragma omp parallel for				\
3	reduction(+:ans)
4	for(long int i=0; i < n; i = i+2)
5		{
6			ans += 4.0/(2*i+1);
7			ans -= 4.0/(2*i+3);
8		}
Except for lines 2 and 3, this could be a sequential program to sum the Leibniz series. Line 2 is the parallel for construct, which specifies that the for-block extending from lines 4 through 8 must be parallelized. When the program encounters this statement, it splits into multiple threads. Each thread will execute a chunk of the iterations of the for-loop.
The reduction clause on line 3 is of much importance. It states that ans is a reduction variable that is tagged  + . The compiler understands that each thread should compute its own ans, and the result from all the threads should be reduced into a single ans using addition.
An attractive feature of OpenMP is that parallel programs can be written using syntax that is close to that of sequential programs. The programmer is asked to make only minimal adjustments to write parallel programs, although these adjustments can be quite subtle and their simplicity a major pitfall.
Not every for-loop can be parallelized using the OpenMP parallel for construct. The compiler has to be able to look at the for-loop and generate the code for the for-loop that is to be executed by each thread. The for-loops that can be parallelized are said to be in canonical form. The complete definition of for-loops in canonical form is too complicated to be useful for anyone except compiler writers. It is usually possible to tell whether a for-loop is in canonical form by looking.
Even when a for-loop is in canonical form, one has to make sure that there are no dependencies between successive iterations of the loop. If the later iterations depend on the execution of an earlier iteration, the loop cannot be parallelized correctly. An example of a for-loop with a dependency across iterations is the following:
factorial[0] = 1;
for(int i=1; i < n; i++)
  factorial[i] = i*factorial[i-1];
This loop is in canonical form. If the OpenMP for construct is used, the compiler will be able to break it up syntactically into a separate loop for every thread. However, the result will be incorrect. To respect the loop-carried RAW dependency, every iteration has to wait for the previous iteration to be complete. That does not happen when the for-loop is split and the threads start executing their share of the iterations in parallel. Applying the parallel for construct to this for-loop results in an incorrect program.
There is an implicit barrier at the end of the OpenMP parallel for construct. Due to the implicit barrier, we may think of the for-loop as part of a sequential program that is executed in parallel by multiple threads.
In much of OpenMP programming, one does not need to go beyond parallel for. However, the parallel for as shown makes a number of implicit assumptions. Those assumptions are made explicit below using less concise syntax:
1double parallelfor(long int n, int nthreads){
2	assert(nthreads%2==0);
3	double ans=0;
4#pragma omp parallel for				\
5	num_threads(nthreads)			\
6	schedule(static)				\
7	default(none)					\
8	shared(n)						\
9	reduction(+:ans)
10	for(long int i=0; i < n; i = i+2)
11		{
12			ans += 4.0/(2*i+1);
13			ans -= 4.0/(2*i+3);
14		}
15	return ans;
16}
The parallel for on line 4 and the reduction clause on line 9 are the same as before. The parallel for construct (or compiler directive) applies to the entire for-loop from lines 10 through 14.
The num_threads() clause on line 5 specifies the number of threads to be created. Here we have required nthreads, which is an argument to parallelfor(), to be even.
If the num_threads clause is omitted, by default, the number of threads is usually the number of processor cores. It may be set to some other value using the shell variable OMP_NUM_THREADS. Relying on the default behavior is simpler and better most of the time.
The schedule() clause on line 6 tells the compiler how to split the for-loop across threads. The scheduling option is given as static. Static scheduling, which we describe presently, is the most useful in practice and the default. Assuming n to be even, the loop variable i of the for-loop on line 10 steps through the list of values
0, 2, …, n − 2.
Because the schedule is given as static on line 6, each thread will get a contiguous set of iterations of roughly the same size. For example, if there are N threads, each thread gets approximately n ⁄ N iterations, and the iterations of thread number t extend from i ≈ t × n ⁄ N to i ≈ (t + 1) × n ⁄ N. Other schedules are possible. The iterations can be assigned in a round robin fashion or dynamically.
Variables declared within the parallel block are private to each thread. Therefore, the variable i, which is declared on line 10, is unique to each thread.
The question arises whether variable names such as ans and n that occur within the parallel block, but are declared and defined outside the block, are names for the same location in memory. Such variables may be shared or unused (or even private---but it is hard to think of a meaningful use of this feature) in the parallel block. One can make all the variables defined outside the parallel block into shared variables using the clause default(shared). In this example, the clause default(none) on line 7 is used to say that the variables declared outside are not visible inside the parallel block by default. The clause on line 8 declares that the variable n is visible inside the block and shared by all the threads. The variable ans is not marked as shared because it is a reduction variable.
omp critical
The omp critical construct is similar to omp barrier in the following way. Both are rarely used constructs yet important to understand. The barrier construct is important to understand because it occurs implicitly at the end of parallel regions, enabling us to think of OpenMP programs as sequential programs with parallel regions. Critical regions are employed implicitly whenever reduction variables are used and the critical construct makes critical regions explicit.
The ompfor() function listed below introduces the eponymous omp for construct and the omp parallel directive in addition to omp critical.
1double ompfor(long int n){
2	double ans=0;
3#pragma omp parallel				\
4	default(none)					\
5	shared(n, ans)				
6	{
7		double sum=0;
8#pragma omp for					
9		for(long int i=0; i < n; i = i+2)
10			{
11				sum += 4.0/(2*i+1);
12				sum -= 4.0/(2*i+3);
13			}
14#pragma omp critical
15		ans += sum;
16	}
17	return ans;
18}
The parallel directive on line 3 applies to the block (or region) that begins on line 6 and ends on line 16. This block will be executed in parallel by each of the threads.
The default(none) clause on line 4 states that no variable is shared by default. In this parallel region, both n and ans are declared as shared by the shared() clause on line 5. We have made all that explicit to give an example of syntax in which multiple variables are declared as shared.
The variable sum defined on line 7 is local to each thread. The omp for construct on line 8 splits the for-loop extending from lines 9 through 13 between the threads. The threads are already in a parallel region when the for-loop is encountered. Therefore, the parallel for construct must not be used. The omp for compiler directive on line 8, which splits the for-loop between threads without creating new threads, is appropriate here. Each thread adds its total into sum, which is local to itself.
The critical construct on line 14 applies to the block immediately following, which in this case is simply line 15, which is ans += sum. On this line, every thread adds its local sum to ans. Because ans is a shared variable, the threads may overwrite each other if they access it simultaneously. The critical construct ensures that the threads take turns in executing line 15. Whenever a reduction variable is used, critical regions are employed implicitly.
By definition, critical regions are mutually exclusive, and only one thread can occupy a critical region at any one time. This is true even if the critical regions encountered by the threads are not the same syntactically. [86]  [86] OpenMP allows finer control of mutual exclusion when the critical regions are named. The critical region that follows #pragma omp critical on line 14 is unnamed.
There can be no doubt that the earlier parallel for syntax is simpler. Indeed, that should be the preferred syntax in most situations. However, the more complex ompfor() is twice as fast as parallelfor() on the 3.6 GHz AVX2 machine, although they are equally fast on the 2.2 GHz AVX machine (see table 9.1↓). Here we are seeing the same point for the umpteenth time: although the hardware on newer machines is faster, the compilers are yet to catch up.
There is an implicit barrier at the end of each parallel region and at the end of an omp for block. Thus, in ompfor(), there are implicit barriers on lines 13 and 16. The parallel region comes to end on line 13, and beyond that point there is only one thread (the master thread). In summary, implicit barriers are found at the end of parallel regions and omp for constructs, but no implicit barrier occurs at the beginning of a parallel region or in a critical construct.
Some more syntax
The OpenMP library defines the function
int omp_get_thread_num();
which returns the thread number of the calling thread. The thread number is 0 for the master thread. If the number of threads is N, the threads are numbered 0, 1, …, N − 1. The library also defines the function
int omp_get_num_threads();
which returns the total number of threads in a parallel region when called by any thread. Calls of these functions are legal only inside parallel regions. They may not be called from functions that are called from parallel regions. However, the critical construct can be used even in functions called from parallel regions.
In addition to the two functions above, constructs such as omp master, omp section, and omp ordered are occasionally useful.

5.1.2 Shared variables and OpenMP’s memory model

Let us look again at the critical construct.
#pragma omp critical
	ans += sum;
As stated earlier, the critical construct ensures that the statement ans += sum is executed by only one thread at a time. That might seem sufficient to ensure the correctness of the program because each thread will separately add its local sum to the final answer, but it is not.
The compiler may decide to store the variable ans in a register and add its local sum to a register when ans += sum executes. The code for copying the register to the memory location of ans may be inserted after the critical block. In that case, we are back to the situation where the threads interfere with each other.
Why is that not a problem? In answering this question, we arrive at the flush feature, which is a vital part of the OpenMP memory model. When a thread has a shared variable such as ans, it is allowed to keep a copy of that variable in a register. Indeed, when the compiler generates assembly instructions, it may keep multiple copies of the same variable in memory and in the register file for its convenience. When an OpenMP flush operation is executed, a thread must make all its writes visible to other threads even if the writes have only modified locally stored copies of a shared variable. The temporary view of all shared variables must be synced with the global view in shared memory.
The OpenMP standard requires that a thread should implicitly carry out a flush operation when it enters and leaves a critical or parallel construct, and at each barrier it encounters. Even the basic program for summing the Leibniz series relies on this rule for its correctness. Each thread syncs its local copy of ans with the global copy in shared memory when it enters the critical region, updates its local copy in the critical region, and syncs the updated value with the global copy of ans in shared memory before it exits the critical region.
A more complex example
The OpenMP memory model is in play every time threads use a shared variable to communicate directly or indirectly. For finding the partial sum of the Leibniz series, the threads communicate indirectly using a shared variable to which each thread adds its part of the sum. The correctness of almost every OpenMP program depends on flush operations that are implicitly carried out at the beginning and end of parallel and critical constructs and at every barrier.
In general, OpenMP allows the compiler to generate assembly for a parallel region or for chunks of a for-loop assigned to a thread, as if the code is sequential---except for implicit flush statements. Of course, the compiler must also handle reduction variables correctly.
The peculiar implications of the OpenMP memory model can be difficult to grasp with just a single example. Therefore, we give another more complex example here. The variable x is the only variable shared by the two threads created by the function printstuff(), whose listing follows.
1void printstuff(){
2	int x;
3#pragma omp parallel			\
4	num_threads(2)			\
5	default(none)			\
6	shared(x)
7	{
8		int tid = omp_get_thread_num();
9		if(tid==0){
10			x = 0;
11		}
12		else if(tid==1){
13			x = 1;
14			printf("x = %d\t", x);
15			printf("x = %d\n", x);
16		}
17	}
18}
The parallel region extends from lines 7 to 17. The thread with tid (abbreviation of thread identifier) equal to 1 prints the shared variable x on lines 14 and 15.
The function printstuff() may print x = 0,  x = 0 if thread 0 executes line 10 after thread 1 executes line 13 but before it prints on lines 14 and 15. If thread 1 executes line 13 after thread 0 executes line 10, the function prints x = 1,  x = 1. A little thought will show that x = 1,  x = 0 is another possibility.
A legal compilation of printstuff() may produce behavior that is different from picking one of these three possibilities, but that is illustrative of the implications of OpenMP’s memory model.
The function may even print x = 0,  x = 1. Because the compiler is allowed to treat the parallel region as if it were sequential, it may store x in a register R in addition to the memory location x. The assignment on line 13 may update both the register R and the memory location. The first print statement on line 14 may access x using the shared memory location x, and the print statement that follows may access x using the register R. It may seem idiosyncratic for the compiler to generate such code. But the point here is that generating such code is legal if the parallel region is treated as sequential code. If the compiler generates such code, the function may print x = 0,  x = 1.
Alternatively, both the print statements may access x using the register R. If so, the function will always print x = 1,  x = 1.

5.1.3 Overheads of OpenMP constructs

OpenMP constructs are embedded into C/C++ programs to create teams of threads that work in parallel. The parallel, barrier, and for constructs introduce overheads. Work may need to be assigned to threads, threads may need to be created and destroyed, or synchronization and serialization may need to be implemented using system calls to the operating system kernel. These activities consume cycles. If the parallelized task is too small, the benefits of parallelization will be overwhelmed by the overheads. Effective programming requires knowledge of the overheads of OpenMP constructs.
Computer Number of Cores/Threads min median max
3.6 GHz AVX2 2 1176 1332 3.4 × 106
2.6 GHz SSE2 12 4888 6188 3.2 × 106
2.2 GHz AVX 16 5025 6573 1.8 × 107
Table 5.1 Overhead (in cycles) of the parallel construct on three different computers. For the full names of the machines, see table 9.1↓ of the appendix.
Table 5.1↑ reports the overhead of entering and exiting from a parallel region. The entries of the table are based on many invocations of the following program:
1void parallelA(int nthreads, StatVector& stats){
2	TimeStamp clk;
3	clk.tic();
4#pragma omp parallel					\
5	num_threads(nthreads)				\
6	default(none)
7	{
8		dummy();
9	}
10	double cycles = clk.toc();
11	stats.insert(cycles);
12}
Every call of parallelA() enters and exits from a parallel region. The master thread times the entire parallel region and inserts the cycle count into the stats object on line 11. The parallel regions consists of a single call to dummy() on line 8. As its name suggests, dummy() does nothing. Its definition
void dummy(){}
is stashed in a separate compilation unit to force the compiler to generate a function call in the parallel region. The statistics of entering and exiting the parallel region are accumulated in the stats object after a large number of calls to parallelA().
Several points emerge from table 5.1↑. The worst case is much worse than the median cost (or the mean cost---median and mean are approximately equal). The median or mean cost of a parallel region is in the thousands of cycles. The worst cost is of the order of millions of cycles. The worst case occurs at the first entry to parallel region and only occasionally later on, for reasons that will be explained later. It is important not to change the number of threads between parallel regions. Doing so will trigger the worst-case cost. The reason for that will also become clear later.
Thus, assuming all parallel regions use the same number of threads, the cost we have to contend with is the median or mean cost, which is in the thousands of cycles. The median cost is much lower for the two-core machine than for the 12-core or the 16-core machines (see table 5.1↑). The two core machine has only one processor package for both cores. In the other two machines, the cores are split between two processor packages (as in figures 3.6↑ and 4.2↑b). More processor packages implies greater cost.
The typical cost of around 5,000 cycles for a parallel region is not prohibitive. Even a for-loop that iterates 105 or 106 times may be parallelized effectively. However, an inner loop that iterates only 100 or 1,000 times cannot be parallelized effectively. The overhead of entering and leaving the parallel region will overwhelm any benefit from parallelization. Thus, OpenMP parallelism is not as fine in scale as the parallelism involved in XMM/YMM/ZMM registers (see table 3.1↑) or the instruction pipeline.
Number of Threads Barrier for for with nowait
1 26 66 36
2 1178 1401 93
3 1250 1451 101
4 1671 1758 96
6 2427 2555 100
8 2710 2731 99
10 2346 2458 99
12 2516 2415 101
16 (2.2 GHz AVX) 3262 3743 371
Table 5.2 Overheads of OpenMP constructs reported as number of cycles. All the entries of this table are for a 2.6 GHz SSE2 machine, except the last, which is for a 2.2 GHz AVX machine (see table 9.1↓ for the full names of the machines).
Table 5.2↑ reports the overheads of the barrier and for constructs. The programs for calculating these overheads are similar to the one we just saw. Therefore, the programs are not listed.
The most striking inference from table 5.2↑ is that most of the cost of the omp for construct seems to be due to the implicit barrier at the end of the construct. To confirm that is indeed so, we modified the for construct slightly. In particular, the nowait clause was added to the omp for construct. The nowait clause removes the implicit barrier at the end of the for construct. The last column of table 5.2↑ shows that of the omp for construct with the nowait clause has an overhead of around 100 cycles and no more. The omp for construct is virtually cost free if the implicit barrier at the end of the construct is removed.
Comparing tables 5.1↑ and 5.2↑, it becomes clear that the cost of entering and leaving a parallel region too is due to a considerable extent to the implicit barrier at exit. The implicit memory flush occurs during entry/exit of parallel regions but does not apply to the omp for construct. Therefore, parallel regions are somewhat more expensive.
As far as understanding the cost of parallel regions and other OpenMP constructs is concerned, we are left with two questions. The first is regarding the high cost in millions of cycles that occurs during the first entry into a parallel region, or whether the number of threads in a parallel region is changed, and rarely otherwise. The other is regarding the cost of a barrier. We will address the first question later, when we study Pthreads. The cost of a barrier is due to the way mutual exclusion is enforced, and we turn to it right now.
Mutual exclusion on x86 computers
The x86 instruction set provides instructions to simplify the implementation of mutual exclusion and critical regions. One of these instructions is XCHG. The XCHG instruction swaps a register with a memory location and is guaranteed to be atomic. Suppose the memory location is a lock variable named lockvar. If the lock variable is 0, it means no process is in the critical region. If the lock variable is 1, it means some process is in the critical region. To enter the critical region, a process will set some register, say EAX, to 1. Next it exchanges EAX and lockvar using XCHG. If the value that is exchanged into EAX is 0, the process enters the critical region with the assurance that it set lockvar to be 1 and the critical region is now locked. In contrast, if the value that is exchanged into EAX is 1, it means some other process has locked the critical region. The recommended way (from Intel) to enter the critical region is as follows:
SpinLock:
     CMP $0, lockvar
     JE GetLock
     PAUSE
     JMP SpinLock
GetLock:
     MOV $1, EAX
     XCHG lockvar, EAX
     CMP $0, EAX
     JNE SpinLock
This code avoids executing XCHG too frequently and uses PAUSE inside the loop. The protocol for exiting the critical region is much simpler: MOV $0, lockvar. This solution works with any number of competing processes or threads.
This little segment of code gives insight into the cost of a barrier. A single access of lockvar will involve cost that is of the order of latency to memory. Thus, the cost of a barrier is equal to a multiple of the latency to memory. The multiple depends on how many threads need to be synchronized. One method of implementing a barrier would be for each thread to enter a critical region and increment a shared variable to notify others of its presence. A more efficient implementation may pair off the threads in a binary tree and synchronize by walking up to the root of the tree.
There are two distinct families of solutions to the mutual exclusion problem. One family uses busy waiting. The use of XCHG is of that type. Another family of solutions is wait-free. Classical semaphores and mutexes are of that type. In wait-free synchronization, a process or thread gives up the processor core and goes to sleep when it is waiting. It is woken up again when some condition flag has changed. Because wait-free synchronization involves putting processes or threads to sleep and then waking them up, it necessarily involves the scheduling function of the operating system kernel. Device drivers typically use wait-free synchronization to access shared resources. In contrast, interrupt handlers are not allowed to go to sleep and must use spinlocks if they need to synchronize or serialize.
Exercise:  Write an OpenMP program in which threads print their identifiers in reverse order.
Exercise:  Write an OpenMP program that reverses an array in place. How long should the array be before OpenMP parallelism proves advantageous?
Exercise:  Write an OpenMP program that calls qsort() defined in the C library from each thread. The sections of the array that are sorted by each thread are merged to complete sorting the entire array. The easy way to do a merge is to leave it all to the master thread. A more sophisticated way is to merge pairwise while assigning the pairs to different threads. Time and compare the two sorting routines, with two different ways of doing a merge, with a single threaded sort.
Exercise:  If the merge is implemented pairwise, it could be important to ensure that the number of threads in the parallel regions remains the same, even if some threads remain inactive. Explain why that could be the case.
Exercise:  An array, or equivalently a pointer to its first entry, is a shared variable. Array entries are modified independently by different threads using the same pointer. Is an OpenMP flush essential to make such changes made by one thread visible to the other threads?
Exercise:  Write an OpenMP program to find the minimum entry of an array using the parallel for construct with an appropriate reduction clause.

5.2 Optimizing OpenMP programs

In OpenMP, program memory is often naturally split between threads. In an example such as matrix transpose, for instance, certain blocks of the matrix are mainly handled by certain threads. When multiple threads occupy processor cores across multiple packages, there is much to be gained by allocating memory for each thread in a memory channel that is close to it.
The way to do this is explained in section 5.2.1↓. At first one may suspect that memory is allocated during a call to malloc() or its equivalent. In fact, during malloc(), the memory that gets allocated is typically virtual and not physical. It is only during first access, and the resulting page fault, that a page of virtual memory is mapped to a page of physical memory. Therefore, the allocation of page frames in near memory depends on the initialization protocol. Briefly, each page in virtual memory must first be accessed by the thread most likely to use it.
It is certainly possible that a page of virtual memory is heavily used by a certain thread in a certain phase of the program and by another thread in another phase of the program. In such situations, the best option is to try to map the threads that use the same pages of memory to nearby cores. This technique is also explained in section 5.2.1↓.
It is commonly believed that a program running on 16 cores is 16 times faster than a program that runs on 1 core, ignoring communication costs. This belief is completely mistaken. Bandwidth to memory does not scale linearly with the number of cores. Although linear speedup is often claimed in scientific computing research, such claims are a consequence of the program not going out of cache, a surprisingly common occurrence. Algorithms that use memory in a nontrivial manner and achieve linear speedup are rare.
In section 5.2.2↓, we find that the bandwidth to memory increases by only a factor of 3 or 4, when we go from 1 thread to 12 threads on a 12-core machine. Similarly, the speedup is sublinear and between 4 and 8 on a 16-core machine.
In section 5.2.3↓, we find the improvement in the realized bandwidth in effecting a matrix transpose is a factor of four on a 12-core machine, and seven on a 16-core machine. The Fast Fourier Transform (FFT) is one of the more arithmetic-intensive algorithms. As explained earlier, the cost of memory accesses cannot be completely hidden in the FFT. In section 5.2.4↓, we find the speedups for the FFT to be nine and 12 on 12- and 16-core machines, respectively. The speedups are impressive but still sublinear. The only commonly used algorithms that would result in linear speedups, assuming nontrivial utilization of memory resources, seem to be dense matrix algorithms such as LU factorization and matrix multiplication.

5.2.1 Near memory and far memory

figure FIGS/chapter4/compact_scattered.png
Figure 5.1 Assignment of threads numbered 0 through 11 to processor cores with compact and scatter affinities.
To explain the distinction between near and far memory, we turn to figure 5.1↑. That figure shows a 12-core machine, with the processors divided into packages. All memory references to far memory, which is those page frames that reside in a memory channel connected to the other processor package, have to go through an interconnect. This interconnect is similar to a fast network channel and its use makes references to far memory more expensive. For example, on a 2.6 GHz 12-core SSE2 machine, the latency to near memory is 180 cycles, whereas the latency to far memory is 300 cycles. On another machine (2.2 GHz 16-core AVX) with two processor packages, the latency to near memory is again around 180 cycles and the latency to far memory is again much greater, being more than 350 cycles.
Thus, it is advantageous if a page frame that is mostly used by a thread resides in near memory. If page frames are in far memory, the speed of the program can degrade by more than a factor of two.
The assignment of pages to page frames can be enforced through some kind of an initialization routine. An example follows:
void init_manycore(double *list, long len, int nthreads){
#pragma omp parallel for			\
	num_threads(nthreads)			\
	default(none)				\
	shared(list, len, nthreads)
	for(long i=0; i < len; i++)
		list[i] = 0.0;
}
This initialization routine must be called soon after the array list[] is allocated and before any access of any entry of the array. The initialization may not be doing anything useful. The point is that the array is split between the threads, and each thread initializes the memory it is mostly likely to use.
The first time an entry of list[] that begins a page (the address at page beginning is 0 module the page size, which is typically 4096), there is a page fault. The page fault handler typically finds a page frame in memory that is near the thread that triggered the fault. [87]  [87] In principle, first-touch allocation depends on assumptions about C/C++ memory management. The key assumption is that the memory that is returned by malloc() exists in virtual memory but has no associated page frames.
In this example, the allocation of page frames in near memory is accomplished using an explicit initialization routine and as a consequence of demand paging. In other programs, it may be more natural to make an invocation of the most intensive part of the program. For example, in a certain program, the most cycle-intensive segment may be computing the FFT using multiple threads. In such a program, the allocation of pages in near memory may be effected by making a dummy call of the FFT routines. The results of this dummy call will be useless. In fact, it is important not to initialize memory prior to the call. Its only purpose is to allocate page frames in near memory.
In a program with many phases, such as an FFT phase, a transpose phase, and a solver phase, it may be impossible to ensure that a page frame is in near memory during each phase. We can still pick one of the phases, perhaps the most cycle intensive among the phases, and ensure that page frames are in near memory during that phase.
Another useful technique is to control the assignment of threads to processor cores. In the icpc compiler, there are two ways of doing that. The assignment of threads to cores can be specified on the Linux command line as
export KMP_AFFINITY=scatter
or
export KMP_AFFINITY=compact
The default assignment is to scatter the threads. The icpc runtime library looks up the environment variable KMP_AFFINITY and uses Linux system calls to control the binding of threads to processor cores (for the distinction between compact and scatter, see figure 5.1↑). Another way to control thread binding to processor cores with icc/icpc is to use the kmp_set_defaults() function.
For programs that assign a thread to each processor core, compact is often better, although scatter is the default. It may be a better idea to keep nearby threads in nearby processor cores because threads with nearby identifiers are more likely to be correlated and access nearby pages. All our timings assume the compact thread affinity.
In general, one can try to fix the assignment of threads to processor cores completely. Although this sounds simple in principle, it is cumbersome to get it to work in nontrivial programs.

5.2.2 Bandwidth to DRAM memory

The sum_onecore() function listed below is the engine for measuring read bandwidth to memory.
double sum_onecore(double *list, int n){
	double ans = 0;
#pragme vector always
	for(int i=0; i < n; i++)
		ans += list[i];
	return ans;
}
The icpc compiler generates excellent code for this function. It unrolls the loop and generates packed double instructions so that the cost of adding is irrelevant. The pragma directive gives a strong suggestion to icpc to use XMM/YMM/ZMM registers and packed double instructions. Other compilers would ignore this directive.
The sum_onecore() function is called by each thread created by the function sum_manycore(), whose listing follows.
double sum_manycore(double *list, long len, 
							int nthreads){
	double ans = 0;	
#pragma omp parallel				\
	num_threads(nthreads)			\
	default(none)				\
	shared(ans, list, len, nthreads)
	{
		int tid = omp_get_thread_num();
		long first = len*tid/nthreads;
		long next = len*(tid+1)/nthreads;
		double s = sum_onecore(list+first,next-first);
#pragma omp critical
		ans += s;
	}
	return ans;
}
This program splits the entries of list[] between the threads in a certain way. That is the same split as in init_manycore(), assuming nthreads to be the same. Therefore, we can send page frames to near memory by making either a call to init_manycore() or a dummy call to sum_manycore() even before list[] is initialized.
The following functions were called from each thread to determine the write and copy bandwidths:
void write_onecore(double *list, int n){
#pragma vector always nontemporal
	for(int i=0; i < n; i++)
		list[i] = i;
}
​
void copy_onecore(double *list, int n){
#pragma vector always nontemporal
	for(int i=0; i < n/2; i++)
		list[i] = list[n/2+i];
}
Here the pragma directive, specific to icpc and ignored by other compilers, asks for nontemporal writes. Nontemporal writes are streamed directly to DRAM, bypassing the write-back cache, using the movntpd instruction.
Computer # of Threads/Cores read write copy
2.6 GHz SSE2 12 15.0 (3.3) 9.51 (3.2) 9.48(2.5)
2.2 GHz AVX 16 36.7 (7.5) 34.7(16) 22.5 (4.4)
Table 5.3 Bandwidth to memory in bytes per cycle (for reading, writing, and copying) on two different computers. The parenthesized numbers are the speedups relative to bandwidth from a single core (for the full names of the computers, see table 9.1↓).
Table 5.3↑ lists the read, write, and copy bandwidths for two different computers. In neither case do we get anything close to linear speedup for reading and copying. The speedup for the write bandwidth on the AVX machine is linear as a consequence of nontemporal writes. However, the speedup is only slightly more than a quarter as much for copying despite the same pragma directive, suggesting that the magic of the nontemporal pragma may be limited to the occasional toy example. The bandwidth for copying remains the same even when the pragma is omitted.
The peak read bandwidth on the SSE2 machine is 40 GB/s. The peak read bandwidth on the AVX machine is nearly twice as great. The peak copy bandwidths are 25 GB/s and 50 GB/s, respectively.

5.2.3 Matrix transpose

The function blocktransx() defined below is similar to the one defined in chapter 4↑. This definition differs by allowing the matrix stored in the array b[] to have a leading dimension greater than the number of rows.
void blocktransx(double *restrict a, double *restrict b, 
                 int ldb, int m, int n){
  assert((m%B==0)&&(n%B==0));
  for(int i=0; i < m; i+=B)
    for(int j=0; j < n; j+=B)
      for(int ii=0; ii < B; ii++)
        for(int jj=0; jj < B; jj++)
          b[j+jj+(i+ii)*ldb] = a[i+ii+(j+jj)*m];
}
The function listed below is a multithreaded implementation of the matrix transpose, which uses blocktransx().
1void blocktrans(double *restrict a, double *restrict b,
2		int m, int n, int nthreads){
3	assert(m%B==0);
4	assert(n%(nthreads*B)==0);
5#pragma omp parallel					\
6	num_threads(nthreads)				\
7	default(none)					\
8	shared(a, b, m, n, nthreads)			
9	{
10		int tid = omp_get_thread_num();
11		int nn = n/nthreads;
12		int nfst = tid*nn;
13		int ldb = n;
14		blocktransx(a+m*nfst, b+nfst, ldb, m, nn); 
15	}
16}
The function blocktrans() assumes that the array a[] stores an m × n matrix (with leading dimension m) and saves its transpose in b[]. The transposed n × m matrix is stored in b[] using leading dimension n. The threads split the columns of a[] equally between themselves on lines 11 and 12. The matrix transposed by each thread has dimension m × nn, where nn is defined on line 11. The function call on line 14 shifts the first entry of a[] by nfst columns and the first entry of b[] by nfst to isolate the submatrix that the thread works on.
Table 5.4↓ shows that the bandwidth realized for varying number of threads and varying block sizes B. The matrix transposed was square with dimension as close to 40, 000 as possible subject to the divisibility conditions assumed on lines 3 and 4. On both the 12-core SSE2 machine and the 16-core AVX machine, the bandwidth realized in transposing is nearly 80% of the bandwidth for copying, as may be verified by comparing with table 5.4↓. Although we had complaints about the compiled code for a single processor with a single thread, such complaints may not be justified here. Getting to 80% of the best possible in a matrix transpose is quite good.
Computer # of Threads/Cores B bw
2.6 GHz SSE2 12 125 7.60 (4)
2.2 GHz AVX 16 75 16.2 (7)
Table 5.4 Bandwidth to memory (in bytes per cycle) realized in transposing a square matrix of dimension close to 40,000 on two different computers. The block size is B × B. The parenthesized numbers are speedups relative to a single core (for the full names of the machines, see table 9.1↓).
The following code was used to ensure that page frames reside in near memory:
  double *a = new double[m*n];
  double *b = new double[m*n];
  blocktrans(a, b, m, n, nthreads);
The first call to blocktrans() is made even before the arrays are initialized. The memory accesses generated by blocktrans() induce a favorable mapping to page frames. The same page of memory in either a[] or b[] may be accessed by more than one thread. In that case, whichever thread races to the page first gets a page frame close to itself. If the two threads are on the same processor package, the page will be near both the threads.

5.2.4 Fast Fourier transform

For a purely arithmetic program, such as summing the Leibniz series, the speed of the program increases linearly with the number of cores, provided the number of terms summed is large enough to outweigh the overhead of entering and leaving an OpenMP parallel region. Other tasks, such as transposing a matrix, are limited by bandwidth to memory. The bandwidth to memory does not increase linearly with the number of cores employed. Many problems in scientific computing---finite differences or finite elements on a 3D mesh to give two examples---fall in the latter category.
Dense linear algebra problems, such as matrix multiplication, are limited by bandwidth to memory if implemented in a straightforward way. In chapter 3↑, we showed how to write a microkernel for matrix multiplication that takes advantage of parallelism in the instruction set. Chapter 4↑ showed how to cleverly hide the cost of accessing memory. A multithreaded matrix multiplication must account for cache memory slightly differently because the L3 cache is common to all the processor cores on the same package. Yet linear speedup with increasing processor cores can be achieved without drastic rethinking.
Hiding the cost of memory accesses and making efficient use of processor resources are essential for optimized FFT routines as well. The inverse Discrete Fourier Transform (DFT) of the sequence a0, …, aN − 1 is given by
bj = N − 1k = 0ωjkak
where ω = exp(2πi ⁄ N) is a primitive Nth root of unity. This transformation is sometimes called the DFT, but we prefer to call it the inverse DFT to maintain an analogy with Fourier analysis. If the aj are discrete Fourier coefficients, we may think of bj as equispaced samples in the time domain. The FFT is a fast algorithm for computing the DFT or its inverse.
In chapter 2↑, we discussed the FFTW and MKL libraries and their interfaces to the 1D inverse DFT. Here we return to that topic to make a few points about the FFT and its implementation.
Table 5.5↓ shows the number of cycles used by the FFT for various N as a function of the number of threads. The speedups are quite good and reach 75% of linear speedup for some of the values of N. The numbers in the table were obtained by lining up many blocks of N complex numbers (or 2N double-precision numbers) in 16 GB of memory. The number of blocks is approximately 109 ⁄ N, and the FFT of each block was computed. The blocks were split between threads using OpenMP. Each thread applied MKL’s 1D FFT to its share of the blocks.
Computer # of Threads/Cores N bw
64 0.35 (8)
2.6 GHz SSE2 12 1024 0.29 (8)
8192 0.33 (9)
64 0.16 (7)
2.2 GHz AVX 16 1024 0.15 (12)
8192 0.14 (13)
Table 5.5 Bandwidth (more precisely, the inverse bandwidth) given as the number of cycles consumed by an FFT of size N divided by Nlog2N. The parenthesized numbers are speedups relative to a single core (for the full names of the machines, see table 9.1↓).
Notes on 1D versus 2D FFT
Let us suppose that N = 2n. The basic idea of the radix-2 FFT is as follows. [88]  [88] For the FFT algorithm see, [43] and [48]. The Wikipedia pages on the FFT and related topics have a wealth of information. First rearrange the data a0, a1, …, a2n − 1 as a 2 × n matrix:
a0 a2 a2n − 2 a1 a3 a2n − 1 .
The original vector is recovered by reading this matrix in column-major order. The first step is to take the n-dimensional FFT of each row to get
0 2 2n − 2 1 3 2n − 1 .
The next step is to multiply by the twiddle factors to get
0 2 2n − 2 1 ω3 ωn − 12n − 1 .
Here ω = exp(2πi ⁄ N). If this matrix is multiplied by
1 1 1  − 1 , 
which is the 1D transform matrix with N = 2, we will have the inverse DFT of the original matrix organized as a matrix in column-major format. The mathematical justification of this claim is omitted.
The 5Nlog2N operation count of the FFT is obtained by recursive applications of the same idea, which is possible if N is a power of 2. An implementation that follows the resulting scheme will be quite inefficient, however. The scheme as it is presented will lead to a lot of cache conflicts.
The first decomposition, which we described in detail, thinks of a vector ai, 0 ≤ i < N − 1, as two vectors of length N ⁄ 2. The two vectors are
ai, ai + 2, ai + 4, ai + 6, …
with i = 0, 1. These two vectors appeared as the rows of an 2 × N ⁄ 2 matrix in the first step. If we assume N is a power of 2 and go to a depth of k in the recursion, we find that the FFT is breaking the given data into 2k vectors of length N ⁄ 2k. These 2k vectors are given by
ai, ai + 2k, ai + 2.2k, ai + 3.2k, ai + 4.2k, …
for i = 0, 1, …, 2k − 1. The successive entries of each of these vectors are separated by 2k in the original sequence. As we found in chapter 4↑, locations separated by powers of 2 are likely to map to the same sets in L1 and L2 cache and create a lot of cache conflicts. This problem may be tackled using the bit-reversed permutation as explained in chapter 2↑.
Suppose the data is 2D and arranged in an M × N matrix:
a0, 0 a0, 1 a0, 2 a1, 0 a1, 1 a1, 2 a2.0 a2, 1 a2, 2 M × N.
The 2D FFT of this matrix is obtained by applying an FFT of size M to every column followed by an application of an FFT of size N to every row. The structure of the algorithm is similar to the recursive decomposition of a 1D FFT of size MN into FFTs of size M and N, but there are no twiddle factors this time. The 2D structure is imposed by the data not by the algorithm.
If the data is in column-major order, the application of 1D FFTs of size M to columns involves no new point. However, the rows are different. If M is a large power of 2, for example, the successive entries in a row are separated by a large power of 2 in memory, increasing the risk of cache conflicts. Thus, it is not a good idea to apply 1D FFTs row by row. It is far better to directly invoke library functions for 2D FFT.
Exercise:  Suppose a large array needs to be reversed. What is the best way to split the array between threads, so that most of the entries that are accessed are in near memory? Explain why the init_manycore() function would not be suitable for initializing entries.
Exercise:  Suppose a large array needs to be sorted. Would the init_manycore() function be a good way to initialize the array to allocate pages in near memory?
Exercise:  Implement an in-place transpose of square matrices using OpenMP. Estimate the bandwidth to memory realized by your program.
Exercise:  Implement an out-of-place FFT for multiple threads (using library functions). Determine the resulting speedup on a machine with multiple cores.
Exercise:  Refer to table 5.5↑ and determine that the fraction of the peak bandwidth realized is 35% and 28%, respectively, for the 12-core 2.6 GHz SSE2 machine and the 16-core 2.2 GHz AVX machine (see table 9.1↓ for information about these machines).

5.3 Introduction to Pthreads

So far our discussion of threads has been limited to OpenMP. OpenMP is a limited programming model. It applies mainly to those situations where the data is static and the access patterns are regular. Pthreads are a more powerful programming model.
The Pthread interface for creating and running threads is supported by the Linux kernel. In Linux, each thread is a separate process. Thread creation is the same as process creation. The distinguishing feature of a group of threads is that they all use the same page tables. The same virtual address generated by different threads in the same group maps to the same physical memory location. Thus, threads offer a seamless way for different processes to share memory.
Pthreads are a powerful programming model. The range of programs that can be written using Pthreads goes well beyond scientific computing. In this section, we introduce Pthreads, beginning with the elementary and progressing to more advanced topics.
In section 5.3.1↓, we introduce Pthreads using a simple program to print messages. The two main techniques for achieving mutual exclusion, mutexes and spinlocks, are introduced. Summing the familiar Leibniz series is another example.
Every thread needs a stack to save its state through a succession of function calls. The user mode stack (explained in the next section) can be used by only one thread (the master thread). In section 5.3.1↓, we explain how a stack is set up for each thread. Thread stacks are normally just a few MB and not large. Therefore, it is important not to abuse the stack in threaded programming.
In section 5.3.2↓, we determine that the overhead of thread creation is typically around 105 cycles and can be 10 or 100 times as large on occasion. This number explains the large overhead of the first parallel region an OpenMP program enters. Threads are created the first time an OpenMP program enters a parallel region, and thread creation overhead is a major cost. The cost can even go into the milliseconds.
Section 5.3.3↓ considers several advanced topics. The theme of that section is to implement OpenMP-type parallel regions using Pthreads. Several implementations are explored. If the number of threads in the parallel region remains constant, threads created for the first parallel region can be reused repeatedly. This explains the low cost of a typical OpenMP parallel region compared with the first parallel region.
Section 5.3.3↓ shows the central role played by memory fences and cache coherence in threaded programming. Although the role of cache coherence is invisible to the programmer, unless it is looked for, there can be no multithreaded programming without cache coherence. Propagating writes from cache to cache, when the same cache line is used by multiple threads, can be a considerable source of overhead. Section 5.3.3↓ also shows the way in which TLB flushes can impact threaded programming.

5.3.1 Pthreads

We begin our discussion of Pthreads with a function that could be in a simple sequential program. The function print_message() uses basic C syntax and has no Pthread construct in it. Indeed, it is compiled into assembly instructions as if it were a single threaded program---an important point that has already come up in our discussion of OpenMP. Each Pthread will later take control by executing this program.
1void *print_message(void *arg){
2	char *s = (char *)arg;
3	char ss[400];
4	int l = strlen(s);
5	for(int i=0; i < l; i++){
6		ss[2*i] = s[i];
7		ss[2*i+1] = (s[i]==’ ’)?’ ’:’_’;
8	}
9	ss[2*l] =’\0’;
10	printf("%s",s);
11	printf("\n");
12	printf("%s", ss);
13	printf("\n");
14	printf("ss = %p \n\n", (void *)ss);
15	return NULL;
16}
The only sign that print_message() may have something to do with Pthreads occurs in its first line. It is declared to be a function that takes a single argument of type void * and returns a single value also of type void *. Whatever arguments we want to send to a Pthread must be packed into a region of memory and sent to the Pthread as a pointer of type void *. Any pointer can be cast to the type void *.
The argument to this function is a message or character string. On line 2, this message is recovered by casting it back to type char *.
The statements that extend from line 3 through line 9 define another string ss, and copy the characters in s into its even positions. The odd positions are filled with the underscore character except when the preceding even position has a blank character. Line 9 ensures that the string is properly terminated with ’\0’ following C convention.
The program prints the string s, the string ss, and the pointer ss on lines 10, 12, and 14, respectively.
Presently, we will introduce Pthreads. For the ensuing discussion, the key items are the definition of ss on line 3 and the address that is printed on line 14 of print_message().
Pthreads are created inside call_pthreads() whose definition follows.
1#include <pthread.h>
2void call_pthreads(){
3	pthread_t thread1, thread2, thread3;
4	char s[600];
5	sprintf(s, "Message for thread1");
6	sprintf(s+200, "Message for thread2");
7	sprintf(s+400, "Message for thread3");
8	pthread_create(&thread1, NULL, print_message, 
9		 (void *)s);
10	pthread_create(&thread2, NULL, print_message, 
11		       (void *)(s+200));
12	pthread_create(&thread3, NULL, print_message, 
13		       (void *)(s+400));
14	void *result;
15	pthread_join(thread1, &result);
16	pthread_join(thread2, &result);
17	pthread_join(thread3, &result);
18}
The function call_pthreads() creates three threads and sends a character string as the argument to each of them. The three threads print the message in the character string and return. There are four threads in this program, including the master thread. Each of the four threads will probably run on a different processor core on a processor with 4 or more cores.
On line 3, the program defines variables of the type pthread_t. The Pthread library will access administrative information about the threads using these variables.
The library function pthread_create() is used to create threads on lines 8, 10, and 12. The first argument is of type pthread_t *. During the call on lines 8 and 9, the pthread library will create a thread and make the variable thread1 point to information about that thread. Later information about the created thread can be accessed using thread1.
The second argument to pthread_create() will always be NULL for us.
The third argument to pthread_create() is print_message, which is a function pointer. The third argument is the function that the thread will start executing as soon as it is scheduled to run. The third argument must be a pointer to a function that takes an argument of type void * and returns an argument of type void *. In this program, all three threads that are created will run the same function.
The fourth argument to pthread_create() must be of type void *. When the thread starts running, it will get this pointer as an argument. In this program, each of the threads gets a different character string as its argument.
The library function pthread_join() may be used to retrieve the pointer returned by a thread. In this program, the pointer returned by each thread is saved in result. The variable result is defined on line 14 to be of type void *. The threads do not return anything meaningful here. If they did, the pointer result could be used to access the data returned by the threads.
The Pthread library has a fairly transparent syntax. By going over the definition of call_pthreads(), it is easy to see where threads are being created and how.
C/C++ compilers include Pthread header files by themselves without any prompting. The Pthread library may be linked using the -lpthread option.
Thread stacks
Each thread gets its own stack. This stack is used for storing the local variables defined by functions the thread executes. When a thread makes a function call, the return address is pushed onto its stack.
On line 14, the function print_message() prints the value of ss, which is a pointer to a location in the thread stack. The three threads print as follows:
ss = 0x4173bf90
ss = 0x4193cf90
ss = 0x41b9cf90
These do not look like stack addresses: they do not begin with 0x7fff, as in section 4.4.1↑. The addresses are in the region that belongs to dynamic memory. During thread creation, each thread is allocated a thread stack in dynamic memory.
The difference between values of the address ss printed by the three threads is about 2 MB, and the maximum thread stack size on the system we used is about 2 MB. That is typical.
Mutexes and spinlocks
The function print_message() prints to the terminal, which is shared by all the threads. The message may get garbled as in the following fragment:
Message for thread3Message for thread2
M_e_s_s_a_g_e_  f_o_r_  t_h_r_e_a_d_2_
Each thread must treat the block of print statements as a critical region.
Mutexes are one mechanism for implementing mutual exclusion in the Pthread library. The function below uses a single mutex to ensure that a block of print statements is executed by a single thread at a time.
pthread_mutex_t mutex=PTHREAD_MUTEX_INITIALIZER;
void *print_message_mutex(void *arg){
	char *s = (char *)arg;
	char ss[400];
	int l = strlen(s);
	for(int i=0; i < l; i++){
		ss[2*i] = s[i];
		ss[2*i+1] = (s[i]==’ ’)?’ ’:’_’;
	}
	ss[2*l] =’\0’;
	pthread_mutex_lock(&mutex);
	printf("%s",s);
	printf("\n");
	printf("%s", ss);
	printf("\n");
	printf("ss = %p \n\n", (void *)ss);
	pthread_mutex_unlock(&mutex);
	return NULL;
}
The new syntax used in this function is quite simple and hardly requires an explanation. On the first line, a variable mutex is defined and initialized. The mutex is locked before entering the block of print statements and unlocked at exit.
Another mechanism for mutual exclusion is the spinlock. The spinlock version of the function for printing messages follows:
pthread_spinlock_t spinlock;
void *print_message_spinlock(void *arg){
	char *s = (char *)arg;
	char ss[400];
	int l = strlen(s);
	for(int i=0; i < l; i++){
		ss[2*i] = s[i];
		ss[2*i+1] = (s[i]==’ ’)?’ ’:’_’;
	}
	ss[2*l] =’\0’;
	pthread_spin_lock(&spinlock);
	printf("%s",s);
	printf("\n");
	printf("%s", ss);
	printf("\n");
	printf("ss = %p \n\n", (void *)ss);
	pthread_spin_unlock(&spinlock);
	return NULL;
}
The first line defines a spinlock but does not initialize the spinlock. The spinlock is locked before entering the block of print statements and unlocked after the printing is done. The spinlock is initialized by the master thread as follows:
pthread_spin_init(&spinlock, PTHREAD_PROCESS_PRIVATE);
Classical mutexes, which differ in behavior from Pthread mutexes, were introduced during the early days of computers, when most computers had a single processor. When a process attempts to lock a classical mutex but the mutex is already locked, the kernel will put the process to sleep and schedule some other process to run. The sleeping process will be reawakened and scheduled to run when the mutex is available to be locked.
Spinlocks may be a preferred way to enforce mutual exclusion if every thread may be assumed to have possession of a processor core. Spinlocks do not relinquish the processor but wait in a loop until the lock is available. Mutual exclusion is achieved by spinlocks using a shared locking variable accessed using atomic instructions.
In simple programs, the behavior of Pthread mutexes is closer to spinlocks than classical mutexes. In more complex programs, Pthread mutexes can behave like classical mutexes.
Partial sums of the Leibniz series
Our discussion of Pthreads has been confined to printing messages. We turn to the Leibniz series to illustrate how multiple arguments can be passed and how to retrieve the answer computed by a thread.
In the code fragment that follows, the number of threads, including the master thread, is fixed at four, with the assumption that the program is run on a quad-core machine. The number may be set equal to the number of cores, whatever that may be, as long as it is even.
const int NTHREADS = 4;
struct leib_stuff{
	int offset;
	int n;
	double sum;
};
Each thread is given a pointer to struct leib_stuff. The struct holds three items of information. The first item is the offset from which that particular thread must start summing the Leibniz series. The second item is how many terms of the Leibniz series must be summed. The third item is used by each thread to return its part of the partial sum of the Leibniz series. Each thread assumes all terms in its share of the partial sum to be of the same sign. Therefore, the number of threads must be even.
In the earlier examples, the master thread does not participate with the workers. There is some inherent asymmetry between the master thread and other threads. For example, the master thread makes use of the user mode stack, which none of the other threads can. Yet the ideal is to make all the threads, including the master thread, do exactly the same amount of work. In the present example, the master thread will sum its share of the terms of the Leibniz series. Therefore, only three threads are created explicitly.
Each of the three created threads gets control with the following function:
void *leibniz(void *arg){
	int offset = ((struct leib_stuff *)arg)->offset;
	int n = ((struct leib_stuff *)arg)->n;
	double sum = 0;
	for(int i=offset; i < n; i += NTHREADS)
		sum += 4.0/(2*i+1);
	((struct leib_stuff *)arg)->sum = sum;
	return arg;
}
We can read this the way we read a function meant for single-threaded execution. Its peculiarity is in the roundabout way in which arguments are passed and the computed sum is returned. The thread finds out its offset and the number of terms in the partial sum n from two of the items in struct leib_stuff and stuffs its share of the sum into the third item. The argument passed and the value returned are both pointers to that struct.
The master thread computes its share of the Leibniz sum as well as accumulating the total from all the threads. It is defined below.
double leibsum(int nterms){
	pthread_t t[NTHREADS-1];
	struct leib_stuff linfo[NTHREADS-1];
	for(int i=1; i < NTHREADS; i++){
		linfo[i-1].offset = i;
		linfo[i-1].n = nterms;
		pthread_create(&(t[i-1]), NULL, leibniz, 
			       (void *)(linfo+(i-1)));
	}
	double ans = 0;
	for(int i=0; i < nterms; i+=NTHREADS)
		ans += 4.0/(2*i+1);
	void *res;
	pthread_join(t[0], &res);
	ans -= ((struct leib_stuff *)res)->sum;
	for(int i=2; i < NTHREADS; i+=2){
		pthread_join(t[i-1], &res);
		ans += ((struct leib_stuff *)res)->sum;
		pthread_join(t[i], &res);
		ans -= ((struct leib_stuff *)res)->sum;
	}
	return ans;
}
The master thread here takes on the job of accumulating the final sum. This burden can be more equally distributed by making the threads join with each other in pairs. To implement such a strategy, the array t[] of type pthread_t must be defined globally and made visible to all the threads.
In this program, the master thread joins with the other threads to retrieve the results of their computations. Even if a thread returns nothing, the master thread or some other thread must join with each thread that terminates. Otherwise, the threads live on as zombies in the operating system kernel.

5.3.2 Overhead of thread creation

To find the cost of creating and destroying Pthreads, we use the following simple function, which each thread will execute:
void *addone(void *arg){
	long *p = (long *)(arg);
	*p += 1;
	return NULL;
}
This function receives a pointer to long and adds 1 to the location its argument is pointing to. The function defined below is responsible for creating threads and asking each one of them to execute addone().
void addone_list(long *list, int nthreads){
	pthread_t *plist = new pthread_t[nthreads-1];
	*list += 1;
	for(int i=0; i < nthreads-1; i++)
		pthread_create(plist+i, NULL, addone, 
		                  (void *)(list+i+1));
	for(int i=0; i < nthreads-1; i++){
		void *result;
		pthread_join(plist[i], &result);
	}
	delete[] plist;
}
This function was invoked many times with nthreads being three. On a 3.4 GHz AVX machine with four cores (see table 9.1↓ in the appendix), there are three processes, two for the created threads and one for the master thread, each of which gets a core. The work that each thread does is trivial. During each invocation, most of the cycles are consumed by the creation and destruction of threads.
In 106 trials, the first five invocations of addone_list() used the following number of cycles:
364488 
153068 
102000 
220218 
290844
The worst five invocations of addone_list() were as follows:
1.48340e+07 
1.28682e+07 
7.52992e+06 
5.57948e+06 
5.06503e+06
The median number of cycles was 78, 242.
The typical cost of creating and destroying Pthreads appears to be somewhat less than 105 cycles. That number is not unreasonable given that each process descriptor used by the kernel is nearly 6 KB. The cost of creating threads will vary from system to system, but the numbers are qualitatively the same on many different systems. The 3.4 GHz AVX computer with four cores used here has a single processor package. The cost of creating a thread per core may be expected to be higher on computers with multiple processor packages.
However, the cost of creating three threads is much more likely to run into millions of cycles on the quad-core 3.4 GHz AVX computer than on a machine with a dozen or more cores, even if the processor cores are split into two packages. A thread is more likely to have to wait for the time quantum of some other process to expire on a quad-core computer than on a computer with 12 cores.

5.3.3 Parallel regions using Pthreads

This section begins with a simple OpenMP program. The OpenMP program alternates between two parallel regions. In the first parallel region, every thread runs a function called addone(). In the second parallel region, every thread runs a function called addtwo().
Later in the section, the parallel regions are implemented using Pthreads. The first implementation is plain C, except for creating and launching Pthreads. The second and third implementations use spinlocks and mutexes, respectively. The final implementation uses conditional variables.
The basic idea in all four implementations is as follows. If the number of threads is n, including the master thread, the master thread begins by creating n − 1 workers. The worker threads do not exit when their job is done but keep waiting for the master to send them some more work. The threads exit only when the master tells them to.
Corresponding to the parallel regions of the OpenMP program, the master tells the workers to alternately execute addone() and addtwo(). The master thread itself alternately executes those two functions.
A glance at table 5.6↓ already throws up a number of questions. The spinlock and mutex implementations look quite similar. That is an artifact of the rather simple setting of our experiment. If the threads were executing a complex function inside the parallel region, the two implementations would look quite different.
The plain C and the spinlock implementation do not yield the processor core voluntarily. In our setting, that is true for mutexes as well. As a result, all three implementations are highly wasteful when the number of threads is greater than the number of cores.
nthreads 2 3 4 8
OpenMP (gcc/g++) 1,400 1,700 2,100 40,000
Plain C 720 975 1,200 8.1 × 107
Spinlocks (randomized) 760 2,000 4,200 *
Mutexes 900 2,200 6,500 9.2 × 107
Conditional variables 30,000 28,000 9,000 18,000
Table 5.6 The cost of entering and exiting a parallel region in cycles. All measurements were made on a 3.4 GHz quad-core AVX computer (see table 9.1↓ for its full name).
In the conditional variable implementation, the threads yield gracefully. As a result, it is the most efficient when the number of threads is greater than the number of cores.
Strangely, the conditional variable implementation is faster with four or eight threads on a quad-core machine than with two threads, which is faster than with one thread (see table 5.6↑). The explanation of this conundrum is important to understand, as it can happen in almost any threaded program. The explanation (TLB flushes) is given later.
The gcc/g++ implementation of OpenMP also creates threads only at the point of first entry into the parallel regions. Overall, it looks better than all our implementations. The first OpenMP parallel region that is entered is expensive because threads are created. If the number of threads changes from parallel region to parallel region, the parallel regions will be constantly hit by the same overhead.
The C implementation highlights the role of cache coherence, which is essential and fundamental to multithreaded programming. Propagating writes from cache to cache can cause significant overhead. The C implementation also introduces memory fences.
A simple OpenMP program
The two OpenMP parallel regions, in the program that will be listed shortly, alternately call the following two functions:
void addone(void *arg){
	long *p = (long *)(arg);
	*p += 1;
}
​
void addtwo(void *arg){
	long *p = (long *)(arg);
	*p += 2;
}
The functions receive a pointer to a long cast to void *, dereference it, and then add either 1 or 2 to the long location. It would of course be simpler to add 1 and 2 directly without having to call functions that work through pointers. The OpenMP parallel regions invoke these functions to preserve a close analogy to the Pthread implementations. The Pthread implementations also use these two functions.
Here is the OpenMP program with two parallel regions.
void ompmaster(long *list, int nthreads, int count){
	for(int i=0; i < count/2;i++){
#pragma omp parallel					\
	num_threads(nthreads)				\
	default(none)					\
	shared(list)
		{
			int j = omp_get_thread_num();
			addone((void *)(list+j));
		}
#pragma omp parallel					\
	num_threads(nthreads)				\
	default(none)					\
	shared(list)
		{
			int j = omp_get_thread_num();
			addtwo((void *)(list+j));
		}
	}
}
This program has two parallel regions, one of which calls addone() to add 1 to an entry of list[]. Each thread adds to the entry whose index is the same as its thread id. The thread id is returned by omp_get_thread_num(). The other parallel region adds 2 to the same entry using addtwo(). Because the parallel regions are in a for-loop that is iterated count/2 times, the effect of a single call to ompmaster() is to add 3*count/2 to as many entries of list[] as there are threads.
Parallel regions in plain C
The first implementation of parallel regions we consider makes minimal use of Pthreads. Most of it is in plain C.
The following global definitions are the basis of the C implementation of parallel regions:
typedef void (*fnlist_t[nthreads])(void *);
typedef void *arglist_t[nthreads];
volatile fnlist_t fnlist;
volatile arglist_t arglist;
volatile long work_count[nthreads];
volatile long done_count[nthreads];
This code segment defines two types, fnlist_t and arglist_t, and four variables. Here nthreads, whose definition is not shown, is a const int equal to the number of threads. For example, nthreads is 3 if the number of threads is 3.
The type fnlist_t is an array (of size nthreads) of pointers to functions with a single argument of type void * and returning void. The syntax for such complex types can be gotten right with a little trial and error. The type arglist_t is an array (of size nthreads) pointers to void.
All four of the variables fnlist[], arglist[], work_count[], and done_count[] are defined to be volatile. The volatile qualifier is a message to the compiler that a variable may change unexpectedly, and the usual dependency analysis may not be valid. It prevents the compiler from saving variables in registers and carrying out other optimizations. In general, if a variable is declared volatile, the compiler assumes that the value of the variable can change unexpectedly because of some other thread, processor, or device.
These four volatile variables are used by the worker threads to communicate with the manager thread. The manager thread sets fnlist[tid] and arglist[tid] to tell the thread of identifier tid which function it should execute and what argument should be passed to it. For example, fnlist[tid] is set to addone to ask the worker thread with identifier tid to execute the addone() function defined earlier in this section.
The manager thread uses the array entry work_count[tid] to tell the worker thread of identifier tid how many units of work have been assigned to it from the beginning. The worker thread uses done_count[tid] to tell the manager how many units of work it has completed.
Making the four arrays used for communication between the manager thread and the workers volatile ensures that every single reference generates a load or store instruction. In particular, the compiler will not assign any of these items to a register, which would get in the way of communication between the manager and the workers. [89]  [89] The volatile qualifier suppresses compiler optimizations. It does not ensure any kind of mutual exclusion. See https://www.kernel.org/doc/Documentation/volatile-considered-harmful.txt
Every thread created using pthread_create() begins execution with a function that takes a pointer as an argument and returns a pointer. The worker threads begin execution with the following function:
1void *worker(void *arg){
2	int tid = *((int *)arg);
3	while(1){
4		while(work_count[tid] == done_count[tid]);
5		if(work_count[tid]==done_count[tid]+1){
6			(*(fnlist[tid]))(arglist[tid]);
7			asm volatile("mfence");
8			done_count[tid] += 1;
9		}
10	}
11}
On line 2, the pointer arg is cast to int * and dereferenced to recover tid. In the Pthread library, the responsibility of assigning tid rests with the programmer. Lines 3 to 10 form a while-loop, which always tests positive. On line 4, the worker loops as long as the work count is equal to the count of items it has already completed. The compiler generates load instructions for work_count[tid] and done_count[tid] because they are volatile locations. The manager signals work to be done by incrementing work_count[tid]. Thus, on line 5, the worker enters an if-block to do work that has been assigned to it.
Line 6 is where the worker thread does the work assigned to it. On line 6, fnlist[tid] is the function pointer assigned to this worker. The worker dereferences that pointer as in
(*(fnlist[tid]))
and applies it to arglist[tid], which is the void * argument assigned to it by the manager thread.
On line 8, the worker thread signals that the unit of work assigned to it is complete by incrementing done_count[tid]. The memory fence instruction MFENCE, which occurs on line 7, is essential to the correctness of this program. On line 6, the worker thread launches the function it is told to execute, and this function leads to a potentially long instruction stream. After returning from the function, the worker increments done_count[tid], but the processor may look ahead in the instruction stream and increment even before the function returns. Such a thing would be valid in a single-threaded program but would corrupt the communication with the manager thread here. The memory fence on line 7 ensures that all load and store instructions that occur before it are complete before any loads and stores after it are issued. It is a serializing instruction for memory references. Here the memory fence makes sure that the processor does not jump ahead and signal completion of work too early.
The worker threads spin in place (on line 4), do work (line 6), and go back to spinning in place to wait for work to be assigned. How do the threads terminate? To terminate a worker, the manager sets the function pointer fnlist[tid] to the exitfn. The exit function is defined below.
void exitfn(void *arg){
	pthread_exit(NULL);
}
To keep things simple, the worker threads do not return anything. They could return a pointer at the point of exit, which is pthread_exit() in this case. If some other thread joins to this one, it can pick up the returned pointer when it joins (as in the leibsum() example).
The manager creates threads, assigns work to worker threads, does its own share of the work, and shuts down the worker threads. It uses the following function to create threads:
1void spawn_workers(){
2	pthread_t plist[nthreads-1];
3	static int tidlist[nthreads];
4	for(int i=0; i < nthreads; i++){
5		work_count[i] = 0;
6		done_count[i] = 0;
7		tidlist[i] = i;
8		if(i>0){
9			pthread_create(plist+i-1, NULL, worker, 
10				       (void *)(tidlist+i));
11			pthread_detach(plist[i-1]);
12		}
13	}
14}
The manager thread uses the array tidlist[] defined on line 3 and initialized on line 6 to pass the thread identifier to the workers. Notice that threads are created only if i>0 (lines 8 through 12) because the manager thread has identifier 0. The initialization on lines 5, 6, and 7 is complete before the thread is created on lines 9 and 10, as it must be.
There are two new elements in this function that merit comment. Why is tidlist[] defined on line 3 specified to be static? Variables defined to be static persist in memory even after the function exits (and may be reused when the function returns). The function spawn_workers() may exit before the worker threads start executing, in which case the storage allocated to tidlist[] may disappear before the threads access it. The static specifier ensures that the storage persists in memory.
The other new bit of syntax is pthread_detach() on line 11. When a thread is detached after creation, it is no longer joinable. The operating system kernel discards detached threads after they return. If a thread is not detached, the threads are kept alive until the manager or some other thread joins with them. If a thread that is not detached returns but no other thread joins with it, it becomes a zombie.
The manager thread executes the following function:
1void manager(long *list, int count){
2	spawn_workers();
3	for(int i=0; i < count; i++){
4		for(int j=0; j < nthreads; j++){
5			fnlist[j] = (i%2==0)?addone:addtwo;
6			arglist[j] = (void *)(list+j);
7			asm volatile("mfence");
8			work_count[j] += 1;
9		}
10		(*(fnlist[0]))(arglist[0]);
11		done_count[0] += 1;
12		for(int j=0; j < nthreads; j++)
13			while(work_count[j]>done_count[j]);
14	}
15	shutdown_workers();
16}
The function assigned to the worker thread (with tid equal to j) on line 5 is either addone() or addtwo(), as in the OpenMP example. The memory fence on line 7 separates the assignment of work to a thread (lines 5 and 6) from the statement that signals assignment of work (line 8).
The manager does its own share of work on line 11.
The while-loop on line 13 spins in place as long as thread j is busy. Each iteration of the for-loop from lines 3 to 14 counts as one parallel region.
For completeness, we list the function for shutting down workers.
void shutdown_workers(){
	for(int i=1; i < nthreads; i++){
		fnlist[i] = exitfn;
		asm volatile("mfence");
		work_count[i] += 1;
	}
}
The manager does not join with any of the worker threads after it tells them to shut down. The worker threads were detached earlier. If the worker threads were not detached, the manager must join with the worker threads to prevent them from turning into zombies.
Cache coherence and the cost of propagating writes
Table 5.6↑ shows that the plain C implementation we have described takes only 720 cycles per parallel region on an average when the number of threads is two. Why does each parallel region take 720 cycles? In answering that question, we run into a vital part of the hardware infrastructure for supporting threaded programming, namely, the cache coherence protocols.
The worker threads and the manager are exchanging information using the arrays work_count[], done_count[], fnlist[], and arglist[]. The locations of these arrays are specified to be volatile. So we may think of the array entries as residing in DRAM memory, but that is not really correct. In a simple program such as this, the array entries are certain to be in L1 cache of each thread. That brings to light a new issue. Suppose the manager increases the work_count[] of a worker thread. The increment will take place in its own L1 cache. But when the worker accesses the same entry, it will look up its own L1 cache. How do writes propagate from L1 cache to L1 cache?
The manner in which writes propagate from cache to cache is vital for the validity of threaded programs. Much of the time we must try to make each thread work on its own portion of the memory and keep the threads as independent as possible. But threads cannot synchronize without shared memory. Because memory is mirrored in caches, any method of synchronization using shared memory is dependent on the manner in which writes propagate from cache to cache.
Many computers, such as those listed in table 9.1↓ of the appendix, handle writes to cache as follows. Suppose a processor wants to write to a shared cache line. Before the write is complete, the hardware sends a signal to other caches to invalidate their copy of the same cache line. If the other caches do not hold a copy of the same cache line, nothing needs to be done. Indeed, the hardware stores sharing information for each cache line. If a cache line is exclusive, the protocol for invalidating other copies of the cache line will be omitted. If other copies need to be invalidated, the write to cache is not complete until all duplicate copies have been invalidated. When another processor wants to read the same cache line, a cache-to-cache transfer is triggered.
It is possible that two or more processors may attempt to write to a cache line that is duplicated in L1 caches belonging to each of them. If so, the hardwa