摘要/Abstract -- IO Processors (IOP) are key elements for storage application to attach to a Host the maximum number of IO's such as RAID disks, capable for high speed data movement between the Host and these IO's and able to perform function such as parity computation for RAID (Redundant Array of Independent/Inexpensive Disks) applications. A second generation [1] 800MHz PowerPC SOC working with 800MHz DDR2 SDRAM memory and on-chip L2 cache, executes up to 1600 DMIPS is described in this paper. It is designed around an high speed central bus (12.8 GBytes/sec PLB) with crossbar switch capability, integrating 3 PCI express ports, plus one PCI-X DDR interface and RAID 5 & 6 hardware accelerator. The RAID 6 algorithm implement 2 disk parities, with the second Q parity generation based on the finite Galois Field (GF) Primitive Polynomial functions. The SOC has been implemented in a 0.13 um, 1.5 V nominal-supply, bulk CMOS process. Active power consumption is 8W typical when all the IP run simultaneously.
IO Processors (IOP)對於連接主機(Host)上大量IO裝置如RAID磁盤的儲存應用來說是個關鍵要點, 他能在Host與IO裝置之間進行高速資料傳輸並且能執行一些關於RAID應用的功能, 像是奇偶校驗計算. 這份文件描述了一顆第二代 800MHz SoC, 使用800MHz DDR2 SDRAM記憶體和二階內置快取, 執行運算量達到1600DMIPS(Dhrystone MIPS: 一種整數計算運算量的表示方式, 相對於MFLOPS的浮點運算量來說). 這顆SoC被設計在一組使用交錯行交換功能(crossbar switch capability)的高速中央匯流排(全雙工結構下, PLB獲得的頻寬達到12.8GB/s,在200MHz運行頻率下), 整合三組PCI express端口, 增加的一條PCI-X DDR介面和RAID 5&6硬體加速器周圍. 提供RAID 6演算法實作2磁盤基偶校驗, 使用基於有限Galois Field (GF)本質多項式產生第二個Q基偶校驗數據( second Q parity). 這顆SoC是一顆0.13微米, 標示1.5V電壓, CMOS體積的實作. 當全部的IP(IP core: Intellectual Property core=>指那些特殊的ASIC或著FPGA邏輯電路區塊)運行時,主動式電源消耗典型為8w.
I-介紹/INTRODUCTION
This paper describes a PowerPC system-on-a-chip (SOC) which is intended to address the high-performance RAID market segment. The SOC uses IBM's Core-Connect technology [2] to integrate a rich set of features including a DDRII-800 SDRAM controller, three 2.5Gb/s PCI-Express interfaces[3], hardware accelerated XOR for RAID 5 and RAID 6, I2O messaging, three DMA controllers, a 1Gb Ethernet port, a parallel peripheral Bus, three UARTs, general purpose IO, general purpose timers, and two IIC buses.
這份文件描述定位於高性能RAID市場領域的Power SoC. 該SoC使用IBM的Core-Connect技術整合了許多特徵, 包括DDRII-800 SDRAM記憶體控制器, 三組2.5Gb/s PCI-Express介面, RAID5/6硬線加速XOR設計, I2資訊交換(Intelligent I/O), 三組DMA控制器, 一組GbE乙太網路端口, 一組並行周邊匯流排(parallel peripheral Bus), 三組UART, GPI/O介面, 通用計時器(GPT)和兩組IIC匯流排.
II- 系統概觀/SYSTEM OVERVIEW
This SOC design consists of a high performance 32-bit RISC processor core, which is fully compliant with the PowerPC specification. The PowerPC architecture is well known for its low power dissipation coupled with high performance, and it is a natural choice for embedded applications The processor core for this design is based upon an existing, fixed voltage PowerPC 440 core [2]. The core includes a hardware multiply accumulate unit, static branch prediction support, and a 64-entry, fully-associative translation look aside buffer. The CPU pipeline is seven processor stages deep. Single cycle access, 64-way set associative, 32-KByte instruction and data caches are connected to the processor core.
這顆SoC設計括了一顆完整相容PowerPC規範的高性能32位元RISC處理器核心. PowerPC架構在高性能低功耗這方面是眾所皆知的, 對於嵌入式應用來說是最佳選擇. 該處理器核心設計是基於現有, 定額電壓的PowerPC 440核心結構. PowerPC 440核心包括了一組MAC(乘法/累加)單元, 支持靜態分支預測和64-entry全關聯式TLB(和Cache相關的設計, 快取命中率的重要部分) . 另外CPU管線深度達到7個階層. 與連接在處理器核心的單周期存取(single cycle access), 64路關聯式(64-way set associative), 32KBytes指令快取和資料快取(與PLB連接為128bits; 資料快取是全雙工傳輸架構).
The following block diagram shows the two IP blocks that are used as RAID 5 and RAID 6 hardware assist with the high speed DMA unit controlling the flow of data.
以下方塊圖顯示了兩組IP(Intellectual Property)區塊, 用作為RAID 5和 RAID 6硬體協助高速資料量DMA單元控制(指有包含XOR設計的DMA控制器部分).
Figure 1 RAID 5和RAID6的IOP處理器方塊圖 / RAID 5 & 6 IOP processor block diagram
A second level (L2) cache of 256 KB is integrated to improve processor performance. Applications that do not require L2 may optionally use the L2 as on chip SRAM. The L2 memory arrays include redundant bits for parity and spares that can be connected after test and configured with on chip fuses.
III - ON CHIP HIGH SPEED CROSSBAR BUS:
The key element of this SOC for high speed data transfer is the central 128b wide 200 MHz crossbar PLB (Processor Local Bus) [1]. Two out of eleven masters can simultaneously access one of the two PLB slave buses: one specialized in High Bandwidth(HB) data transfer and a second one with Low Latency (LL). The same physical memory in the SDRAM can be accessed either on the HB or the LL slave bus through two aliased address ranges. By convention (but not required) the LL bus segment is used by the PowerPC to achieve low latency access to memory while the HB bus segment is used for large data moves by the DMA engines.
對於高速資料傳輸部分, 這顆SoC關鍵要點是用一種中央(central)128位元運行200MHz的交錯型PLB(Processor Local Bus)架構.兩組7條master連線節點(根據IBM的CoreConnect文件顯示透過一組PLB Macro作為birdge連接了PLB Mastes和PLB Slaves)可以同時分別存取兩條PLB Slaves匯流排: 一條專門負責高頻寬資料傳輸(HB:High Bandwidth data transfer)和第二條用於低延遲(LL: Low Latency)作用. 通過兩個別名定址區域, HB或著LL slave bus都可以存取在SDRAM上相同的實體記憶體空間. 按照原則(非必須), LL匯流排區段透過使用PowerPC去實現低延遲記憶體存取, 而HB匯流排區段則是透過DMA引擎用於大型資料傳輸.
The Crossbar architecture separates the 64b address, 128b read data, and the 128b write data busses allowing simultaneously duplex operations per master with two independent masters resulting in a peak theoretical bandwidth of 10 Gbytes/sec.
交錯型架構(Crossbar architecture)區分了64位元定址空, 128位元讀取資料和寫入資料匯流排, 並且允許使用兩組獨立master連線節點進行同時每個master節點雙工操作(Full-duplex: 全雙工), 達到理論最高頻寬每秒10Gbytes傳輸量.
While the Crossbar arbiter supports 64 bit addressing, the PowerPC440 CPU is a 32 bit processor that can address up to 4 GB of physical address, the 64 entry TLB transforms this address to a real 36 bit PLB address (upper 28 bits are 0s) for 64GB access of the total address space.
雖然交錯型仲裁器(Crossbar arbiter)支持64位元定址, 但是PowerPC440 CPU是一個實體定址到4GB空間的32位元處理器. 64-entry TLB緩衝轉換這個位址(指PowerPC440的定址線)到為總定址存取空間64GB的真正36位元位址(28位元以上都標示位元0).
IV - RAID 5 and 6 Algorithm
The RAID 5 have been developed to provide data recovery in the case of a single drive failure by adding a parity disk that is used with the remaining disks to rebuilt the failing data. Notice that the Error on a disk drive must be detected by an another error detection circuit such as CRC checking.
With the more recent RAID 6, data recovery is possible in the case of simultaneous failure of two disk drives. Both RAID 5 and RAID 6 are supported by the AMCC PowerPC 440SP/440SPe:
In RAID 5, only an XOR is needed to compute parity and recover data in the case of a sector fail.
P = A0 A1 A2
In RAID 6, dual parity P,Q with GF coefficients is needed:
P = A0 A1 A2
Q = (A0×a) (A1×b) (A2×c)
with operators: × Multiply - Exclusive OR
The RAID 6 algorithm for Q parity generation is based on the Galois Field (GF) Primitive Polynomial functions. With the PPC440SPe it is possible to use several different values of the GF polynomial, including the values 0x11D and 0x14D which corresponds to the equations:
0x11D: X8 + X4 + X3 + X2 + 1
0x14D: X8 + X6 + X3 + X2 + 1
V - RAID Hardware assist options
There are two options for the RAID hardware assist
The fist one is to attach directly to the PLB on chip bus the RAID assist as an independent unit with its own DMA controller and perform the dual P,Q parity between the different operands through the control of the DMA engine.
The second option is to have this RAID hardware assist directly in the memory controller and enable it if the address on the PLB on chip system bus falls in one predefined address range. In this case some predefined function and parameters must be included in the reserved bits of the 64- bits PLB address.
VI - RAID 5 Hardware assist on PLB
The block diagram figure 2 shows that a XOR and Not XOR function is attached directly to the PLB and computes the parity in each cycle when a new operand is entered in the unit.
The Hardware XOR engine computes a bit-wise XOR on up to 16 data streams with the results stored in a designated target. The XOR engine is driven by a linked list Command Block structure specifying control information, source operands, target operand, status information, and next link. Source and target can reside anywhere in PLB and/or PCI address space.
Fig 2: DMA aware XOR unit operations
VII - RAID Hardware assist in Memory Queue
The block diagram figure 3 show that a XOR and Galois field computation for Q parity is integrated in the Memory Queue of the SDRAM Memory controller. Two circuits are implemented; one working on Writing to the Memory acting like a Read Modify Write, and a second one with a Multiple Read of Data as shown on figure 4.
Fig 3: Write XOR function in Memory Queue for RAID 6
Fig 4: Read XOR function in Memory Queue for RAID 6
The following figures 5 and 6 explain how the address on the PLB is decoded to activate the RAID 5/6 hardware assist in the Memory queue if this address falls in a predefined window.
Fig 5: PLB Address to access RAID 6 function in Memory Queue
Fig 6: Memory System Address
V - Disk STRIP operation
One of the important operation in RAID 5 or 6 is the update of a strip of 64 KB, for example, as illustrated by the following figure 7. The Host that wants to update a strip in the RAID disks has to provide the new data to the IOP in a temporary memory space. Then it is needed to recompute the P and the Q parity. The first step consist in computing a new parity P,Q with the XOR of the current parity and the Strip that will be replaced. Then the P and Q are computed a second time with the new data and the last parity. After all these operations, the new P and Q as the new Data can be send to the Disk controller.
These operations are illustrated on the following figure.
Figure 7: Strip update with RAID 6
XII- TEST RESULTS
The performance throughput depends on the number of disk drives. The following curve shows one example of throughput for full stripe RAID 6 that has been measured with the PPC440SPe integrating RAID 6 hardware assist.
Figure 8: Full Stripe RAID 6 throughput
Figure 9: Board for RAID performance benchmarking
A special board with modular approach for PCI-Express, PCI-X, SDRAM DDR2 DIMMS, and peripheral attachments has been developed. It permits the debug of the SOC device with DDR1 and DDR2 SDRAM as well as PCI Express and PCI-X DDR connectors. Debug was done with the IBM Riscwatch debugger through the JTAG serial link I/O.
XI- SOC IMPLEMENTATION
Figure 10: PowerPC IOP Chip layout
Main Features
Technology
Packaging
Due to the large number of I/O (495) needed to integrate all the peripherals, the I/Os are placed in an area array across the die. A peripheral approach for IO implementation was possible with a staggered structure; however, it would have resulted in a larger die size, and a more noise sensitive part because of large simultaneous switching.
The device is based on an ASIC with integrated synthesizable cores - also named IP's - with the exception of the PowerPC CPU core which is a precharacterized hard core with optimized timing analysis and tuned clock distribution to achieve 800MHz.
Logic is described in Verilog and synthesis done with Synopsys synthesis tool. The physical design including floorplaning, placement and wiring was done with IBM's proprietary Chip Bench tool. Special care was taken in physical implementation for minimization of noise induced by coupling and simultaneous switching on top of the conventional signal integrity verification.
Extensive simulation of each core with simulation after SOC integration has resulted in a first pass good product.
Figure 11: PowerPC IOP block diagram
CONCLUSION
An high performance SOC based on a PowerPC CPU core for RAID storage application, have been tested at 800MHz. with main interfaces such as DDRII SDRAM at 800MHz and three PCI-Express ports. Two RAID hardware accelerators, permits to achieve data throughput in the range of 1 GBytes per second.
REFERENCES
[1] A PowerPC SOC IO Processor for RAID applications, G.Boudon & al. IPSOC04
[2] IBM Corp. (1999) Coreconnect Bus Architecture
[3] "Integrating PCI Express IP in a SoC" Ilya Granovsky, Elchanan Perlin, IP/SOC06