Pete's Log: I like reading! part 4

Entry #1149, (Coding, Hacking, & CS stuff)
(posted when I was 23 years old.)

Implementation of a General-Purpose Dataflow Multiprocessor
by Papadopoulos, MIT Press, 1991

This book is apparently just a published form of Papadopoulos' PhD. It presents motivation for dataflow architecture, describes the ETS architecture and its implementation in Monsoon, and covers a few other interesting issues.

Chapter 1 provides an overview of multiprocessing and presents issues in efficient parallel programming and motivates the use of dataflow as a solution to those issues.

Issues:

  • shared memory v. message passing
  • synchronization costs
  • context switching costs


Parallel programming is not easy. Programmers must generally find parallelism by hand and use difficult mechanisms for synchronization and sharing data. This should be easier. Shared memory is particularly appealing because it allows programmers to manipulate shared data in a manner they are more familiar with.

There are numerous models of synchronization, including producer-consumer, fork-join, and mutual exclusion. However, in most systems these means lack good hardware support and are implemented in system or library calls.

Task state transitions (context switches) are costly. Traditionally a task is in one of four states: running, ready, waiting, or terminated. Costs associated with switching between states are the cost of creating a task, cost of switching a task (ready->run), cost of blocking a task (run->wait), synchronization cost (wait->ready), and exit cost (run->term).

In order for a parallel task to execute efficiently, it needs to be spend enough time in the run state to make up for overhead from switching between other states.

dataflow machines can generally eliminate the create, switch, sync, and exit costs, and incur only mild overhead for block costs.

Another issue is how blocking is implemented: do you poll or suspend? polling wastes cycles, but suspending requires some mechanism for unsuspending.

The concept of a tree of frames is also covered as a convenient runtime environment for parallel execution. One thing to beware of is the potential for increased costs for function calls when a tree of frames is used instead of a stack of frames. Stack of frames is easy, but in the tree of frames situation, frame allocation becomes a vital issue.

Chapter 2 describes the Tagged-Token Dataflow Architecture, from which ETS grew. TTDA grew out of earlier static dataflow architectures. It addresses the problem in static dataflow architectures where recursive and dynamic procedure calling is not possible. It accomplishes this by introducing contexts that specify activation frames for code.

Interesting issues regarding tag efficiency are raised in section 2.2. Of particular value are probably discussions of deadlock when the context space is exhausted, and the ideas of context constant space, further subdividing a context for iteration, and offsetting code from a base address specified in the context. Keep in mind that here context is used in a dataflow sense, and that in the TTDA there is no physical memory corresponding to a context, the context is simply a part of the tag.

As usual, the big problem with TTDA and similar architectures is the requirement for complex token matching schemes, requiring either fully content-addressable memory or complex hashing algorithms.

Chapter 3 introduces ETS and describes how it is designed to solve the problems of TTDA by considering the context to be less abstract. Instead of using the context simply to generate unique names for tags, the context now addresses some portion of physical memory. This eliminates the need for complex matching hardware. Instead, when a token arrives, presence bits at the location specified by the context (and the offset in the current instruction) are used to determine if the token is the first or second required for an activity to take place. If it is the first (and the activity requires two operands) the value of the token is stored in the location and the presence bits toggled to full. If it is the second, the value of the token and the value stored at the location combine to form the inputs for the activity and teh activity is executed.

Some interesting ideas and issues are discussed. Of special interest are probably the details on data movement between contexts. And though I've seen it in many other papers discussed already, I don't think I've specifically mentioned split phase memory operations. They are also a good idea, and relevant to global memory accesses in PIM.

Chapter 4 describes techniques for compiling code for ETS, in particular contrasted to compiling for TTDA. I skimmed through this chapter, since most of it seemed of little interest to me. However, there are a few things that may be worth revisiting, especially the section on resource management.

Chapter 5 was a brief description of how to view an ETS machine as a multithreaded von Neumann architecture. Nothing new or exciting.

Chapter 6 introduces Monsoon as an implementation of ETS. Not too much that I've not seen before, so I did a lot of skimming. One interesting thing covered in greater detail than I saw before was talk of using "subdomains" to distribute data structures among nodes. There was also a discussion of how to distribute code among nodes.

Chapter 7 gave more detail on Monsoon from a programmers perspective with code examples and such. Again I skimmed.

Chapter 8 concluded.