|
ABSTRACT
Carnegie Mellon, School of Computer Science
Walking Four Machines By The Shore
Anastassia Ailamaki, David J. DeWitt, and Mark D. Hill
Carnegie Mellon University
Pittsburgh, PA 15213
Recent studies have shown that the hardware behavior of database workloads
is suboptimal when compared to scientific workloads, and have
identified the processor and memory subsystem as the true performance bottlenecks,
when running decision-support workloads on various commercial DBMSs. Conceptually,
all of today's processors follow the same sequence of logical operations when executing
a program. Nevertheless, there are internal implementation details that critically affect
the processor's performance, and vary both within and across compute vendor products.
To accurately identify the impact of variation in processor and memory subsystem design
on DBMS performance, we need to identify the impact of the microarchitectural parameters
on the performance of database management systems.
This study compares the behavior of a prototype database system built on top of the Shore
storage manager across three different processor design philosophies: the Sun UltraSparc
(using processors UltraSparc-II and UltraSparc-IIi), the Intel P6 (using an Intel PII Xeon),
and a Compaq/DEC Alpha (using a 21164A). The processors exhibit high variations in the
processor and memory subsystem design. The prototype system choice is pertinent because the
system's hardware behavior was found similar to commercial database systems when executing
decision-support workloads. In order to evaluate the different design decisions and trade-offs
in the execution engine and memory subsystems of the above processors, we ran several range
selections and decision-support queries on a memory-resident TPC-H dataset. The insights
gained are indications that, provided that there are no serious hardware implementation
concerns, decision-support workloads would exploit the following designs towards higher
performance:
1. A processor design that employs (a) out-of-order execution to more aggressively
overlap stalls, (b) a high-accuracy branch prediction mechanism, and (c) the opportunity
to execute more than one load/store instruction per cycle, and
2. A memory hierarchy with (a) non-inclusive (at least for instructions) caches
(b) a large (> 2MB) second-level cache, and (c) a large cache block size (64-128 bytes)
without sub-blocking, to exploit spatial locality.
Last updated
16 February, 2004
|