[Up] [Next]

15 Oct 1995

CPU and Memory Speed

Tell Me When it Hertz

All of the components in a personal computer are coordinated by a system clock. The clock generates a regular alternating signal of high and low voltages. It coordinates the various computer chips like a metronome coordinating the players in a grade school band.

The clock speed is measured in Megahertz. One Megahertz (1 MHz) is a signal that alternates between high and low values one million times a second. A 33Mh PC has a clock which "ticks" and "tocks" 33 million times each second. Each tick-tock is called a cycle.

The mainboard generates a master clock pulse that coordinates the CPU, memory, local bus, and support chips. Originally all the components ran at the same speed. Starting with the 486DX2 family, the CPU chip internally ran at twice the speed of the master clock. Today there are so many parts of the computer each running at a different clock speed that it may be obsolete to talk about any single master clock.

It costs extra money to build a computer chip that supports higher clock speed. Yet a higher mainboard speed will not effect the performance of memory, or ports for the keyboard, mouse, printer, and modem. In the early days of the 486, vendors briefly increased mainboard speed up to 50 MHz. When Intel started delivering CPU chips that could internally run at 2 or 3 times the mainboard clock speed, then everyone backed off to a simpler 33 MHz clock with a CPU running at 66 or 100 MHz internally.

The 33 MHz clock speed became the standard through most of the 486 period. As vendors became use to this speed for chip design, Intel selected 33 MHz as the clock speed of its new PCI I/O bus.

The Pentium design allows the CPU to run at 1, 1.5, or 2 times the speed of the external clocks. This produces a rather strange selection of speeds. Pentium chips have been produced that run at 60, 66, 75, 90, 100, 120, and 133 MHz internal clock speeds. The external clock speeds are normally 60 (for the 60 and 90, and 120 MHz models) or 66 (for the 66, 100, and 133 MHz models). Ok, so there is a 50 MHz clock to run the 75 MHz chip.

As the clock speed increases, the precision of chip fabrication must also go up. The width of circuits and wires in the chip is measured in microns (a really small size). Billions of dollars have been spent on factories that drive chip design below a micron, and more is being spent to go below a half micron. What is unclear at this point, however, is how much capacity will be devoted to the new Pentium Pro chip and how much will be available to increase speed for ordinary Pentiums.

Get in Gear

To add up a column of numbers with a pocket calculator, you simply type each number in and press the "+" key (or the "=" key at the end). Most users probably think that a PC spreadsheet program does the same thing. However, the human brain has actually been doing the hard part of the operation, moving down one row in the column, focusing on the number, and recognizing it. Each PC instruction carries with it a number of additional operations that would not be obvious to the casual user.

First, the computer must locate the next instruction in memory and move it to the CPU. This instruction is coded as a number. The computer must decode the number to determine the operation (say ADD), and the size of the data (say 16-bits). Additional information is then moved and decoded to determine the location in memory (the row and column of the spreadsheet). Finally, the number is added to the running total. Although a human might take some time to add two eight digit numbers together, the addition is the simplest part of the operation for a computer chip. Decoding the instruction and locating the data take the most time.

Each generation of Intel CPU chip has performed this operation in fewer clock cycles than the previous generation.

To make a car go faster, one steps on the accelerator. Extra gas makes the engine rotate faster. When RPM gets high enough, it is better to shift to a higher gear. The PC system clock (measured in MHz) is like the engine speed (measured in RPM). The CPU model selects the gear. The original 86 processor was like first gear, and the 486 is like fourth gear. So it is a mistake to compare clock speed across changes in the architecture. A 60 MHz Pentium actually runs faster than a 100 MHz 486.

RISC

The first Intel CPU chip handled 4 bit numbers. The next model handled 8 bit numbers, and also 4 bit numbers like its predecessor. By the time the IBM PC was built, a later version 16 bit numbers (but also 8 bit and 4 bit numbers to be compatible). Today's 486 and Pentium systems handle 32 bit numbers (but also support the old 16 bit, 8 bit, and 4 bit instructions).

The original Intel design had instructions that came in different sizes. The most frequently executed instructions could be stored as a single byte. This made programs smaller, but it complicated the CPU's job of decoding instruction formats.

In your Sunday paper, right next to the CompUSA insert there is probably something from Sears. Look at the last few pages of the ad, where they show the tools. There will almost certainly be a picture of the traditional "190 Piece Socket Wrench Set." If you purchased this item, you would always have the right tool for any job. In reality, it is almost impossible to keep all the pieces organized, and you will spends minutes searching through all the attachments to find one of the right size.

Go to a tire store. They lift your car off the floor, remove the hubcaps, and then pick up a gun shaped device connected to a hose. "Zuuurp" and each bolt comes off the wheel. You could do the same thing with the 190 Piece Socket Wrench Set, but every garage knows that automotive wheel bolts come in only one size. So they don't have to spend time searching for the right size tool, and they can optimize the one size that they really need.

When computer designers realized the same thing, it was called Reduced Instruction Set Computers or RISC. Make all the instructions the same size. Use only one size of data. Simplify the instructions and therefore the operation decode. Then use all the room on the chip to optimize what is left, rather than filling the chip with support for instructions that are seldom executed.

IBM, Apple, and Motorola have the PowerPC. DEC has the Alpha. Sun has the SPARC. RISC chips are often smaller, cheaper, and faster than comparable 486 chips. However, most of the software in the world has been written for the Intel instruction set and the PC architecture. RISC chips run Unix well enough, and the PowerPC and Alpha systems can run Windows NT. Even then, a Windows NT system based on a RISC chip cannot run the new programs designed for Windows 95. Those programs use the 32 bit version of the Intel instruction set, and RISC systems do not support such programs.

Superscalar and Pipeline

Although a tire store may be fast at changing tires, when you really need speed look at how they do things in Indianapolis. A race car pulls into the pit for service. They jack it off the ground, and then four teams of mechanics go to work on all four wheels simultaneously. The car is back in the race in a matter of seconds. In ordinary life, such service would be prohibitively expensive. But in the world of microelectronics, transistors are cheap.

Every CPU has to decode instructions, fetch the data, perform the operation, and store the result back. On the original IBM PC, each stage was performed in turn, and then a new instruction was started. What made the 286 and 386 chips run faster was the beginning of a pipeline.

With pipeline architecture, each phase of the instruction execution is processed by a different part of the chip. At each clock tick, the partially processed instruction moves on to the next processing step and a new instruction takes its place. In one clock cycle, the CPU might be saving the results of instruction 1, performing the operation of instruction 2, assembling data for instruction 3, and decoding the operation of instruction 4.

The next step was to design computer chips with duplicate versions of each processing station. Such chips are called superscalar. Commonly they are described as executing two or more instructions in a single clock tick. More precisely, a superscalar chip can perform the same stage of processing for two instructions in parallel.

Superscalar designs started in RISC architectures, where the instruction set made processing simpler. However, Intel was able to apply superscalar design to its processors as well. The Pentium chip handles two instructions at a time, and the Pentium Pro handles three instructions in parallel.

When a computer chip design starts to move beyond simple pipeline or superscalar designs, the number of operations in the original instruction set become less and less important. Combining pipeline and superscalar features, a Pentium Pro can be processing some part of 20 or 30 instructions at any given clock tick. Checking for interactions and dependencies becomes the biggest problem. Intel's enormous volume of sales gives it the technological and financial resources to address problems of design and fabrication. It may start with a harder problem than RISC systems, but it has the resources to overcome this handicap.

Nanoseconds

All the ads and specifications quote clock speed in Megahertz. However, the more important number is the length of time between clock ticks (the cycle time). Such periods are usually measured in nanoseconds (billionths of a second) abbreviated "nsec."

Electricity travels through a copper wire just a bit slower than the speed of light. Normally, we can just regard the speed of light as "very fast." It becomes important when the distances are very long (astronomy) or when the times are very short (computers). A nanosecond is the amount of time that it takes light (or an electric signal) to travel about one foot.

PC clock speeds appear at first to be a strange collection of numbers. However, the corresponding cycle types display a much more regular pattern:

        Clock   Cycle
        25Mh    40 nsec
        33Mh    30 nsec
        50Mh    20 nsec
        66Mh    15 nsec
        100Mh   10 nsec

A processor with a 100 MHz clock must perform operations in less time than it takes for electricity to travel 10 feet. The chip is very small, but it has millions of circuits. All must be manufactured to a very high level of precision.

However, it is much simpler to apply quality control to a chip the size of a fingernail than to the entire mainboard. This by itself show the problems of a higher speed main clock, and the benefit of capping the mainboard design at 25 or 33 MHz (40 to 30 light-feet of signal distance).

Memory

A computer chip receives and transfers data through the "pins" that run along each side. A memory chip receives part of the memory address on one set of pins, then after a period of time it responds with the data at that memory location. The delay between address and data is the response time of the chip. The type of memory chips most widely used on personal computers either return one bit of data on one pin or else 4 bits of data on 4 pins at a time. More advanced memory chips (currently used on more expensive computers) can deliver even larger data units per request.

Ordinary PC memory uses Dynamic Random Access Memory or DRAM. Typically, such memory is rated to respond in 80 to 70 nanoseconds. A smaller amount of memory may be added to the system as an "external" or "second level" cache. This is made of faster Static RAM or SRAM.

DRAM is typically packaged on small boards called SIMMs. Each memory location holds 8 bits of data (a byte), but the memory also maintains an extra bit to serve as a parity error check. Memory does not fail often, but when there is a problem it generally affects all the data in a chip. The parity check detects the error. The system will crash, but the diagnostic error at power up indicates which SIMM needs to be replaced.

The first widely used SIMM had 30 connectors (called "pins" although they are really flat metal pads on the edge of the board). This type of SIMM could transfer one byte of data in any given memory request. It was designed in the days when the 286 and 386SX computer chips accessed memory in two byte units. The 30 pin SIMM was also used in early 486 computer systems, but by then it had become awkward. Since a 486 chip needs to transfer 4 bytes of data in each memory reference, the 30 pin SIMMs have to be installed in groups of 4. Four 1M SIMMs provide a 4 meg upgrade, while four 4M SIMMs provide a 16 meg upgrade. No new computer systems use 30 pin SIMMs, but some upgrade mainboards are sold that support a mixture of old 30 pin and newer memory technology (so you don't have to throw out all your old memory when you upgrade the board).

Modern PCs are designed for a larger 72 pin SIMM. The additional pins allow each SIMM to deliver four bytes of data (plus parity) in every memory request. Additional pins deliver an ID value that allows the CPU to determine the size and speed of the memory. During power up, the computer can reject SIMMs that fail to fail to meet its minimum standards for speed. Normally the 72 pin SIMMs contain 4, 8, or 16 megabytes of memory. However, some 32 and even 64 meg SIMMs are available.

Pentium systems (and less common RISC computers) are based on a 64 bit (8 byte) data interface. To support this requirement, the 72 pin SIMMs must be installed in matching pairs.

There are two ways to build a 1M SIMM using the older 30 pin technology. The SIMM has to deliver 8 bits of data and one parity bit. The simplest version uses 9 chips, each holding a megabit of data, each delivering one bit of data on demand through one pin. However, as 4M chips became more common, some vendors created a three-chip SIMM by packaging two 4M chips (each delivering 4 bits through 4 pins) and a 1M chip (delivering 1 bit through 1 pin).

The Internet news groups occasionally get reports that a user has had trouble mixing the nine-chip and three-chip 1M SIMMs in some particular clone machine. Although the two should be interchangeable, the three chip SIMM seems to have a different refresh timing and certain third party mainboards are not properly designed to handle them. The general recommendation is to use only one type of SIMM in a system.

Hurry Up and Wait

Data travels between the CPU and memory on parallel wires called a "bus." These wires contain the clock signal, address bits, data bits, and control signals. One of the control signals is called ready. The CPU is the fastest device on the mainboard of the computer. It has to be slowed down to match the memory and I/O device speed.

Every bus operation begins with a new clock tick. If the CPU needs data from memory, it places the address of the data on the bus. The 386 or 486 bus is not designed to respond in the same cycle that the address is presented. The minimum bus operation takes two cycles. At the end of the second cycle the CPU checks the value of the ready wire. If the data was found in 25 nsec SRAM cache, then ready will be active and the CPU reads the data. Otherwise, ready is inactive and the CPU waits for another clock cycle.

Each clock period that ends with the ready wire inactive is called a wait state. No matter what the vendor says, there is no such thing as a high speed machine with 0 wait states. The processor speed is too fast and DRAM is too slow for that to be possible. Wait states are also introduced by I/O devices as they need more time to process data.

A Cold Shower

Turn on a shower first thing in the morning. The water comes out cold. The shower has to run for a minute before it gets hot. Just how long the delay is depends on how thick and long the pipe is between the water heater and the bathroom.

The same problem occurs in electricity and is called capacitance. Turn on a radio, and it may take a second before any music comes out. The capacitance of a PC mainboard is tiny compared to a radio, but it is significant when time is measured in nanoseconds.

The Intel CPU chip introduces a minimum 3 nsec delay between the clock tick and the time that the address reliably leaves the CPU. The mainboard adds several nsecs while the address flows through the circuits to the last memory socket. There will be a similar delay when the data value travels back from the memory chip to the CPU. The engineers who design the mainboard take this into account and program wait states to handle the worst case.

The engineers build a mainboard rated for a 33 MHz (30 nsec) clock, 70 nsec memory, and one wait state. Since the are two clock cycles in addition to the wait state, the memory reference operates in 3x30 or 90 nsec. The memory responds in 70 nsec, so there is an additional 20 nsec for the signal delays on the mainboard.

Most mainboards are designed with a specific number of wait states for a specific memory speed. A board designed for 70 nsec memory can be populated with 60 nsec memory instead. It will still wait 70 nsec for the data (plus any overhead for the signals to travel out and back).

Neither the memory nor the CPU know how long the mainboard delays will be. The water heater does not determine how long it will take for the hot water to reach the bathroom. Installing a larger water heater, or turning up the thermostat on the current heater, will not reduce the amount of time it takes to run the cold water out of the pipes. Using better memory does not change the mainboard design. The speed by which the CPU references memory is determined by the clock rate and the number of wait states inserted reading or writing memory.

Cache in your Chips

About thirty years ago, IBM demonstrated that a mainframe computer could run faster if it had "cache memory." The cache is a high speed memory area that keeps a copy of the most recently used data in main memory. Most programs go back and reexecute the same instructions, or update the same numbers, and the cache provides better performance.

A modern PC has two levels of cache. A First Level Cache is contained within the CPU chip itself. The Intel 486 family was designed with 8K of first level cache memory. Subsequent Intel and non-Intel computer designs have 16K or even 32K internal cache. The first level cache is best, because it can be used by the internal pipeline instruction processing components of the CPU. A 486 can add two numbers in two clock ticks, provided that both numbers are held inside the CPU. When one of the numbers is outside the CPU chip, the instruction takes a minimum of 4 clock ticks no matter how fast the external memory might be.

An external or Second Level Cache can be installed with SRAM chips on the mainboard. SRAM responds faster than DRAM. Usually SRAM can deliver data in a single external clock tick. However, the 486 has still been substantially delayed just to set up any type of external memory reference.

Because it is relatively expensive for the CPU to set up an external memory reference, and because there is a high probability that the next number or instruction needed by the program will be the one immediately following the last number or instruction, the 486 is designed to fetch data from memory in a "burst". The internal cache is arranged in units of 16 bytes. Each unit is called a "line" of cache. Whenever an instruction references data that is not in the internal cache, the chip generates a memory reference for the four bytes of needed data, and then fetches the remaining 12 bytes of a 16 byte memory area.

When the data is found in the SRAM second level cache, there is one external memory cycle to set up the transfer, then the four bytes needed for the current instruction come in on the second external clock cycle, and the remaining twelve bytes come in in the next three clock ticks. After five mainboard cycles, all 16 bytes have been moved into the chip. Without the burst, an old 386 would have used eight clock ticks to fetch the same data, since each individual four byte unit would have required one cycle to set up the transfer and one cycle to receive the data.

If the data is not in the SRAM second level cache, then it must be fetched from ordinary DRAM memory. DRAM typically requires at least 70 nsec for each response, and this is two or three external clock cycles. So fetching the data from DRAM involves "wait" states where the CPU has to stop and wait for the memory to respond.

A Pentium Pro CPU is distributed in a subsystem board that contains 256K of Second Level Cache. On earlier systems, the SRAM cache was on the mainboard and was managed by the memory controller of the mainboard chipset. As far as an old Pentium was concerned, Second Level Cache was simply faster DRAM. However, a Pentium Pro chip has an entirely separate 64 bit data bus to its integrated Second Level Cache chip. It can access that chip separately, in parallel to ordinary DRAM references.

DRAM will require a 70 nsec delay to respond. During this period, a traditional CPU would be blocked by wait states, unable to continue. The superscalar and pipeline design allows the Pentium Pro to continue to preprocess instructions that follow the instruction currently blocked waiting for the DRAM data. Because the second level cache has a separate data bus, it can return data to support the preprocessing even while the external DRAM bus is blocked. Once the data arrives from DRAM, these preprocessed instructions can be hustled through the last stages of execution at the maximum speed (up to three instructions per clock tick).

Interleave

In the same period that processor cycle time has dropped from 100 to 20 nsec, memory speed has only dropped from 100 to 70 nsec. With a 10 nsec processor cycle time possible, memory response becomes a problem. Mainframe designers addressed this problem twenty-five years ago by "interleaving" memory.

A 486 chip has 32 data pins. It references memory four bytes at a time. Most mainboards simply connect the CPU pins directly to the memory sockets. But the mainboard can achieve higher performance if it creates a wider path to memory. Interleaved memory has two sets of sockets and two 32-bit memory bus arrangements running in parallel. The CPU may generate a reference for four bytes of data, but the memory controller on the mainboard fetches eight bytes. Half of the data fetched is used to satisfy the current instruction, the other half is buffered for a cycle.

The design of a 486 with internal cache and a 16-byte burst provides the opportunity for interleaving. Immediately after the CPU fetches the data needed for an instruction, it will continue to fetch the remaining 12-bytes immediately around it. With interleaved memory, the next four bytes have been anticipated and are immediately available. Even better, while the second four bytes are being transferred to the CPU, the memory controller has already started to fetch the second block of eight bytes.

Since the memory cannot be made to run any faster, interleaving doubles the effective speed of the memory by moving twice as many bytes in the same amount of time as an ordinary 32-bit bus. Of course, it adds complexity to the mainboard and forces memory to be installed in pairs of matching SIMM's.

In simple cases, interleaving matches the speed of the memory to the speed of the CPU. A 33 Mh processor operates on a 30 nsec cycle. RAM operating at 70 nsec takes slightly more than two cycles to respond. However, a device that processes four bytes every 30 nsec is approximately matched by a device that delivers eight bytes every 70 nsec. Interleaving therefore speeds up the memory subsystem without actually speeding up the memory.

A Pentium process (and, for that matter, a DEC Alpha or a Power PC) has a 64-bit data path. It requires a 64-bit memory bus and SIMM's installed in pairs. However, it is not quite correct to call this arrangement "interleaved" any more. Generally, memory is interleaved only if the bus to memory is twice as wide as the CPU's native memory reference. DEC does interleave memory on its Alpha systems, but to accomplish this it must build a 128-bit memory bus and install memory in groups of four matched SIMM's.

SX, SLC, DX

The chip suffix letters were once meaningful. They have now degenerated under an avalanche of marketing pressure. What does GT or LTD really mean on a car? Well, a 486 SLC means about the same thing.

There are three considerations in an Intel-compatible CPU chip:

There is also the question of how much internal cache does the chip support, but that depends more on the vendor than on the suffix of the chip.

The whole SX/DX mess got started when Intel introduced the 386 SX chip. It has a 16-bit path to the memory and supported only 24 address bits (allowing a maximum of 16 megabytes). In current use, the IBM 486 SLC and SLC2 chips have the same data and address restrictions. The original SLC family may no longer be part of the IBM product line, but it appears in certain IBM-made mainboard upgrade kits that can be purchased to convert an old 386 machine to some type of 486 machine.

All other chips in current use (SX, DX, DX2, SL, DLC, etc.) all support a 32-bit path to memory and a 32-bit address. In the true Intel 486 family, a DX, DX2, or DX4 chip has an internal math coprocessor (a Floating Point Unit or FPU). This is important for scientific, statistical, or mathematical calculations. It does nothing for ordinary word processing, spreadsheets, or even graphics packages. A program must be written to use the instructions supplied by the FPU. Since most machines don't have it, most vendors do not use these instructions.

Since ordinary business applications (word processing, spreadsheets, database) do not use floating point instructions, it used to be possible to save a few hundred dollars and buy the less expensive Intel 486 SX chips. Unlike the 386 SX, where "SX" meant a 16-bit path to memory, a 486 SX chip has a full 32-bit path to memory, but no floating point unit.

Over the last year, Intel has increased its production of Pentium processors and dropped prices on its 486 DX chips. On a new machine there is little to be saved by taking an SX chip without the floating point capability over a DX chip that has it. Although the floating point may have little effect on the actual performance of the machine, at this point the reduction in resale value would probably outweigh the additional cost (if any).

Intel has confused things by adding an SL family of chips with low power requirements. These are used in laptops and "Green" desktop units. It appears that some SL chips have a floating point unit, and some do not. It is necessary to read the fine print to know what an SL system provides.

Continue Back PCLT

Copyright 1995 PCLT -- Introduction to PC Hardware -- H. Gilbert

This document generated by SpHyDir, another fine product of PC Lube and Tune.