Pete's Log: Simultaneous Multithreading

Entry #1156, Sun, November 11, 2001, 22:30 EST (Coding, Hacking, & CS stuff)
(posted when I was 23 years old.)

I've finally gotten around to reading some papers on Simultaneous Multithreading. It's an interesting concept. Until reading these papers, just about all talk of using MTAs was in the context of using them as part of a larger multiprocessor system. I was curious to see if anyone thought it worthwhile to analyze the benefits of hardware multithreading from a single processor perspective.

What does hardware multithreading give us, anyway? From my perspective, the basic idea is that by replicating the hardware associated with the state of a task, we can keep more than one context live in the processor, thus allowing us cheap context switches. Among other things, cheap context switches allow us to more readily mask latency. This is particularly useful in a multiprocessor system where internode communication can be a significant source of latency. But isn't latency a concern in a uniprocessor system as well?

What do we achieve by masking latency? The latency-inducing task will generally not execute any faster if we mask its latency (scenarios can be built in which masking its latency will make it execute faster, but generally I think it is easy to argue that for the most part, masking latency increases the total execution time for the particular task whose latency we are masking). Masking latency allows us to increase processor utilization. Or put another way, masking latency can increase throughput.

So current commercial processors make heavy use of superscalar designs to increase performance. A superscalar processor takes advantage of instruction level parallelism (ILP) in order to increase throughput. But for the most part, superscalar processors only execute one stream of instructions, so data and control dependencies cause a lot of underutilization of processor resources. "Traditional" MTAs can have the processor executing multiple streams of instructions, but at any given point, instructions are only being issued from one thread. The idea in simultaneous multithreading is to take a superscalar architecture, make it multithreaded, and then allow instructions from multiple threads to be issued simultaneously. This way hazards that would prevent one thread from issuing enough instructions to keep all the functional units busy will not leave the processor underutilized, because we hope to have enough active threads to be able to issue instructions to all functional units in every cycle. It's definitely a neat idea, and I should probably learn more about the work done in this area. There is one particularly exciting thing I found while looking for simultaneous multithreading papers: somebody wrote a paper analyzing operating system behavior on a (simulated) SMT machine. I also found it interesting that at some point in late 1999, Compaq announced they were going to use SMT in future Alpha processors. I'm curious as to what became of that.

Simultaneous Multithreading: Maximizing On-Chip Parallelism
by Dean M. Tullsen, Susan J. Eggers, and Henry M. Levy. 22nd ISCA, 1995

I had indirectly heard bad things said about this paper before reading it, but I tried to keep an open mind anyway. I'll start out with some positive commentary that applies to this and the next paper. Having read three papers co-authored by Susan Eggers, I've found that they have features I really like. They have a lot of references, which makes finding further reading easy, and they have a lot of graphs and give generally good interpretations of what the data in the graphs mean. Also, all three papers dealt with simulations, and I think I learned a little about simulating architectures from the papers.

This particular paper begins by explaining what Simultaneous Multithreading is. It then describes a simulation environment used to gather data on the effectiveness of SMT. They use the SPEC92 benchmark suite as their simulation workload. At first the bottlenecks in a singlethreaded superscalar architecture are discussed (with simulation data to show where those bottlenecks are). The 8 issue superscalar processor they simulate manages on average to execute only 1.5 instructions per cycle, despite 8 way issue. They then model several SMT architectures to see how they perform. With 8 threads and an 8 way issue, they see issue rates of more than 6 instructions per cycle in an ideal SMT configuration, and more than 5 ipc in more implementable configurations. So this seems to indicate that SMT should increase processor throughput.

Next they discuss cache issues, since more threads means more memory pressure and more cache conflicts. They talk about a few design ideas, primarily revolving around using private per thread caches. They show a graph with some data, but despite their discussion of the results, I was disappointed. For the most part, if there were fewer than the maximum number of threads executing, the original cache scheme won out by quite a bit, and even with the maximum number of threads, the performance numbers for all the cache schemes were within a percent or two of each other. So I really didn't agree with their arguments in favor of one of their other cache configurations. I was particularly amused by their use of the sentence "Private I caches eliminate conflicts between different threads in the I cache" to explain why private I caches performed better than private D caches, since the statement would hold true for private D caches as well. I can imagine why there would be more interthread conflicts in the I cache than in the D cache, but that's not something that their explanation addressed.

Next, the paper compared SMT with single-chip multiprocessing. This was somewhat interesting because their numbers showed that generally SMT can achieve better performance with fewer resources than single-chip multiprocessing.

So their section discussing cache issues disappointed me. Beyond that I found the paper interesting and educational. I am not, however, convinced of how useful their raw numbers are. The data was explained well enough, and the results are well explained. But I don't know how much of the simulation environment they describe I can trust. They explained the SMT architecture they assumed and how they chose to simulate it, but at times I was less than convinced as to how appropriate their decisions were. I felt that more justification for their decisions would have been in order. But regardless of that, I still enjoyed the paper. I'll need to further investigate the negative comments I heard about it, though.

An Analysis of Operating System Behavior on a Simultaneous Multithreaded Architecture
by Joshua A. Redstone, Susan J. Eggers and Henry M. Levy. 9th ASPLOS, 2000

This was a fun paper. And it addresses issues that I am directly interested in. What they did was adapt an alpha module for SimOS to simulate an SMT alpha architecture. They then modified Digital Unix 4.0d to run on the SMT alpha they were simulating. They then compared the performance of SPECInt95 with and without simulating OS calls on both the SMT and the regular alpha. They also compared the performance of apache on the two architectures. The reason for the SPECInt95 simulations was to see if not simulating OS interaction was having a significant effect on other simulations, since for the most part, most performance numbers for simulated architectures don't factor in OS interaction. The apache simulations then showed that SMT sees a big win when OS heavy applications are considered. For SPECInt95, the SMT saw only a 5% decrease in performance when the OS was factored in, but the non SMT architecture saw a 15% decrease in performance, indicating that OS considerations really are important when arguing the benefits of a proposed architecture. For the apache simulations (the authors determined that apache spends about 75% of its cycles in kernel mode), the SMT architecture outperformed the non SMT by a factor of 4.2, which is apparently the biggest improvement an SMT architecture has seen over a non-SMT relative in any benchmarks. So not only is the OS an important consideration, but the OS is actually an application that can see significant improvement in an SMT architecture. The paper provides a lot of detail on specific areas and causes for the results they saw, and those results are interesting. It seems that cache and TLB considerations are among the most significant. Of course, they probably have less influence on PIM considerations, but I'm investigating the big picture beyond only PIM right now.

This paper definitely confirms that the interaction of OS and MTA is an important consideration and that the amount of work done in that area so far is remarkably small.

A Simulator for SMT Architectures: Evaluating Instruction Cache Topologies
by Ronaldo Goncalves et. al.

This paper was rather difficult to read since it was written by researchers in Brazil. Additionally, it proved less interesting than I thought, because it was mainly a description of their SMT simulator they had written on top of SimpleScalar, and didn't really seem to offer me anything new at this time. I wish, however, that I was more patient with poor English in situations like this, because I'm certain there is plenty of good work being published by people who do not natively speak English. But I'm plagued by this notion that if they're going to publish in English, they should probably find a native speaker of the language to proofread for them. I'm not very nice sometimes.

netscape hates me, so I'm just gonna post this now without much proofreading.