Wednesday, April 29, 2009

RAID Configuration

RAID Configuration

There is significant confusion in many organizations using ORACLE and RAID technologies. I will attempt to make this information very clear and understandable.

What Is RAID?
RAID is the technology for expanding the capacity of the I-O system and providing the capability for data redundancy. It stands for Redundant Array of Inexpensive Disks or Redundant Array of Independent Disks.
Conceptually, RAID is the use of 2 or more physical disks, to create 1 logical disk, where the physical disks operate in tandem to provide greater size and more bandwidth. RAID has become an indispensable part of the I-O fabric of any system today and is the foundation for storage technologies, supported by many mass storage vendors. The use of RAID technology has re-defined the design methods used for building storage systems that support Oracle databases.


The 3 Main Concepts In RAID
When you talk about RAID, there are 3 that terms are important and relevant.
• Striping
• Mirroring
• Parity.

1- What Is Striping?
Striping is the process of breaking down data into pieces and distributing it across multiple disks that support a logical volume – “Divide, Conquer & Rule”. This often results in a “logical volume” that is larger and has greater I-O bandwidth than a single disk. It is purely based on the linear power of incrementally adding disks to a volume to increase the size and I-O bandwidth of the logical volume. The increase in bandwidth is a result of how read/write operations are done on a striped volume.
Imagine that you are in a grocery store. With you are about two hundred of your closest friends and neighbors all shopping for the week’s groceries. Now consider what it’s like when you get to the checkout area and find that only one checkout line is open. That poor clerk can only deal with a limited number of customers per hour. The line starts to grow progressively. The same is true of your I-O sub-system. A given disk can process a specific number of I-O operations per second. Anything more than that and the requests start to queue up. Now stop and think about how great it feels when you get to the front of the store and find that all 20 lines are open. You find your way to the shortest line and your headed out the door in no time.
Striping has a similar effect to your I-O system. By creating a single volume from pieces of data on several disks, we can increase the capacity to handle I-O requests in a linear fashion, by combining each disk’s I-O bandwidth. Now, when multiple I-O requests for a file on a striped volume is processed, they can be serviced by multiple drives in the volume, as the requests are sub-divided across several disks. This way all drives in the striped volume can engage and service multiple I/O requests in a more efficient manner. This “cohesive and independent” functioning of all the drives in a logical volume is relevant for both read and writes operations. It must be noted that striping by itself, does not reduce “response time” for servicing I-O requests. However, it does provide predictable response times and facilitates the notion of better performance, by balancing I-O requests across multiple drives in the striped volume.
Figure 1 depicts a 4-way striped volume (v1) with 4 disks (1-4). A given stripe of data (Data1) in a file on v1 will be split/striped across the 4 disks, into 4 pieces (Data11-Data14).

Disk1 Disk2 Disk3 Disk4
Data11 Data12 Data13 Data14
Data21 Data22 Data23 Data2

Figure 1

2-What Is Mirroring?
Mirroring is the process of writing the same data, to another “member” of the same volume simultaneously. Mirroring provides protection for data by writing exactly the same information to every member in the volume. Additionally, mirroring can provide enhanced read operations because the read requests can be serviced from either “member” of the volume. If you have ever made a photocopy of a document before mailing the original then you have mirrored data. One of the common myths with mirroring, is that it takes “twice as long” to write. But in many performance measurements and benchmarks, the overhead has been observed to be around 15-20%.
Figure 2 illustrates a 4-way striped mirrored volume (v1) with 8 Disks (1-8). A given stripe of data (Data1) in a file on v1 will be split/striped across the Disks (1-4) and then mirrored across Disks (5-8). Disks (1-4) and (5-8) are called “Mirror Members” of the volume v1.

Disk1 Disk2 Disk3 Disk4 Disk5 Disk6 Disk7 Disk8
Data11 Data12 Data13 Data14 Data11 Data12 Data13 Data14
Data21 Data22 Data23 Data24 Data21 Data22 Data23 Data24

Figure 2


3- What Is Parity?
Parity is the term for error checking. Some levels of RAID, perform calculations when reading and writing data. The calculations are primarily done on write operations. However, if one or more disks in a volume are unavailable, then depending on the level of RAID, even read operations would require parity operations to rebuild the pieces on the failed disks. Parity is used to determine the write location and validity of each stripe that is written in a striped volume. Parity is implemented on those levels of RAID that do not support mirroring.
Parity algorithms contain Error Correction Code (ECC) capabilities, which calculates parity for a given ‘stripe or chunk’ of data within a RAID volume. The size of a chunk is operating system (O-S) and hardware specific. The codes generated by the parity algorithm are used to recreate data in the event of disk failure(s). Because the algorithm can reverse this parity calculation, it can rebuild data, lost as a result of disk failures. It’s just like solving a math problem what you know the answer (checksum) and one part of the question e.g. 2+X =5, what is X? Of course, X=3.
Figure 3 depicts a 4-way striped RAID 3 volume with parity – v1 with 5 Disks (1-5). A given stripe of data (Data1) in a file on v1 will be split/striped across the Disks (1-4) and the parity for Data1 will be stored on Disk 5. There are other levels of RAID that store parity differently and those will be covered in the following sections.

Disk1 Disk2 Disk3 Disk4 Disk5
Data11 Data12 Data13 Data14 Parity1
Data21 Data22 Data23 Data24 Parity2

Figure 3

Putting It All Together
Striping yields better I-O performance, mirroring provides protection and parity (when applicable) is a way to check the work. With these 3 aspects of RAID, we can achieve scalable, protected, highly available I-O performance.



Types of RAID available
Vendors typically offer the following choices for RAID configurations (Nice example HERE)

RAID0
Disk stripping. RAID0 requires at least two physical disks. Data is read and written across multiple drives, so disk I/O is relatively evenly spread. Writes can occur in a block or streaming manner (similar to non-RAIDed disks) as requested by the operating system. Disk failure results in lost data. Compared to a single disk drive, RAID0 has the following attributes:
- Better read performance
- Better write performance
- Inexpensive in cost
- Not fault-tolerant
- Storage equivalent to sum of physical drive storage in the array
- Readily available from most vendors

RAID1 Shadowing/Mirroring
Disk mirroring. RAID1 requires two physical disks. Logical writes are done by physically writing the data to both disks simultaneously, and can typically be done in a block manner or streaming manner, as requested by the operating system. Reads can be done using either disk. In the event of a disk failure, data can still be retrieved and written to the surviving disk. Compared to a single disk drive, RAID1 has the following attributes:
- Better read performance
- Similar write performance
- Expensive
- Fault-tolerant
- Storage equivalent to 1/2 the sum of the physical drive storage in the mirrored set.
- Readily available from most vendors

RAID5 Striping with Rotating Parity
Disk stripping with parity. RAID5 requires at least three physical disks. On a logical write, a block of data is physically written to disk, parity information is calculated using the block just written plus blocks already existing on disk, then the parity information is written to disk. In RAID5, the parity information is rotated among the physical disks to prevent bottlenecks caused by a dedicated parity disk. Note that writes occur in a block manner regardless of whether the O/S is sending a stream of data to be written or requests to write whole blocks. On a logical read, data is read from multiple disks in a manner very similar to RAID0. In the event of a disk failure, data can be reconstructed on the fly using the parity information. Compared to a single disk drive, RAID5 has the following attributes:
- Data is stripped across multiple physical disks and parity data is stripped across storage equivalent to one disk.
- Better read performance
- Poorer write performance
- Inexpensive
- Fault-tolerant
- Storage equivalent to N - 1 times the number of physical drives in the array.
- Readily available from most vendors

RAID10 (or RAID0+1)
Mirrored stripe sets. RAID10 requires at least 4 physical drives, and combines the performance gains of RAID0 with the fault-tolerance and expense of RAID1. Data is written simultaneously to two mirrored sets of striped disks in blocks or streams. Reads can be performed against either striped set. In the event of a failure of a disk drive in one striped set, data can be written to and read from the surviving striped set. Compared to a single disk drive, RAID10 has the following attributes:
- Better read performance
- Better write performance
- Expensive
- Fault-tolerant
- Storage is 1/2 of the sum of the physical drives' storage
- Currently available from only a few vendors (at the time of this writing)

Possible configurations using 4 physical disks:

Configuration

Number of disks

Available space

Max Reads/Sec

Max Writes/Sec

Single disk

1

4 GB

60

60

RAID0

4

16 GB

240

240

RAID1

4

8 GB

240 (2 arrays)

120 (2 arrays)

RAID5

4

12 GB

180

60

Possible configurations using 6 physical disks:

Configuration

Number of disks

Available space

Max Reads/Sec

Max Writes/Sec

Single disk

1

4 GB

60

60

RAID0

6

24 GB

360

360

RAID1

6

12 GB

360 (3 arrays)

180 (3 arrays)

RAID5

6

20 GB

300

90


As can be seen from the charts, RAID0 offers good read and write performance, but no fault tolerance.
RAID1 offers good read performance, and half as much write performance, but provides fault-tolerance.
RAID5 reclaims most of the space lost to RAID1, provides fault-tolerance, offers reasonably good read performance, but poor write performance. (In fact, RAID5 requires 4 disks to regain the same write performance as a single disk). Also, note that streaming logical writes, as well as block-level logical writes, to RAID5 arrays are handled as block-level physical writes. Finally, read or write workload capacity can be increased in any RAID configuration by adding physical disks.

The RAID "hierarchy" begins with RAID 0 (striping) and RAID 1 (mirroring). Combining RAID 0 and RAID 1 is called RAID-0+1 or RAID-1+0, depending on how you combine them. (RAID 0+1 is also called RAID-01, and RAID-1+0 is also called RAID-10.) The performance of RAID-10 and RAID-01 are identical, but they have different levels of data integrity.

RAID-01 (or RAID 0+1) is a mirrored pair (RAID-1) made from two stripe sets (RAID-0); hence the name RAID 0+1, because it is created by first creating two RAID-0 sets and adding RAID-1. If you lose a drive on one side of a RAID-01 array, then lose another drive on the other side of that array before the first side is recovered, you will suffer complete data loss. It is also important to note that all drives in the surviving mirror are involved in rebuilding the entire damaged stripe set, even if only a single drive was damaged. Performance during recovery is severely degraded during recovery unless the RAID subsystem allows adjusting the priority of recovery. However, shifting the priority toward production will lengthen recovery time and increase the risk of the kind of the catastrophic data loss mentioned earlier.

RAID-10 (or RAID 1+0) is a stripe set made up from N mirrored pairs. Only the loss of both drives in the same mirrored pair can result in any data loss and the loss of that particular drive is 1/Nth as likely as the loss of some drive on the opposite mirror in RAID-01. Recovery only involves the replacement drive and its mirror so the rest of the array performs at 100% capacity during recovery. Also since only the single drive needs recovery bandwidth requirements during recovery are lower and recovery takes far less time reducing the risk of catastrophic data loss.

The most appropriate RAID configuration for a specific filesystem or database table space must be determined based on data access patterns and cost versus performance tradeoffs. RAID-0 offers no increased reliability. It can, however, supply performance acceleration at no increased storage cost. RAID-1 provides the highest performance for redundant storage, because it does not require read-modify-write cycles to update data, and because multiple copies of data may be used to accelerate read-intensive applications. Unfortunately, RAID-1 requires at least double the disk capacity of RAID-0. Also, since more than two copies of the data exist, RAID-1 arrays may be constructed to endure loss of multiple disks without interruption. Parity RAID allows redundancy with less total storage cost. The read-modify-write it requires, however, will reduce total throughput in any small write operations (read-only or extremely read-intensive applications are fine). The loss of a single disk will cause read performance to be degraded while the system reads all other disks in the array and recomputes the missing data. Additionally, it does not support losing multiple disks, and cannot be made redundant.

ORACLE database files on RAID
Given the information regarding the advantages and disadvantages of various RAID configurations, how does this information apply to an ORACLE instance? The discussion below will provide information about how database files are used by an ORACLE instance under OLTP and DSS classifications of workload.
Note that the perspectives presented below are very sensitive to the number of users: if your organization has a 10-20 user OLTP system (and thus, a low throughput requirement), then you may get very acceptable performance with all database files stored on RAID5 arrays. On the other hand, if your organization has a 100 user OLTP system (resulting in a higher throughput requirement), then a different RAID configuration may be absolutely necessary. An initial configuration can be outlined by estimating the number of transactions (based on the number of users), performing adjustments to encompass additional activity (such as hot backups, nightly batch jobs, etc.), then performing the necessary mathematical calculations.
You definitely want to keep rollback segments, temp tablespaces and redo logs off from RAID5 since that is too slow for these write-intensive Oracle files. They are sequentially accessed. Redo logs should have their *own* dedicated drives.

OLTP (On-line transaction processing) workloads
Characterized by multi-user concurrent INSERTS, UPDATES, and DELETES during normal working hours, plus possibly some mixture of batch jobs nightly. Large SELECTS may generate reports, but the reports will typically be "pre-defined" reports rather than ad-hoc queries. The focus, though, is on enabling update activity that completes within an acceptable response time. Ideally, each type of database file would be spread out over it's own private disk subsystem, although grouping certain types of files together (when the number of disks, arrays, and controllers is less than ideal) may yield adequate performance. (Please see the article on Instance tuning for information regarding groupings of database files in an OLTP system.)

Redo logs.
During update activity, redo logs are written to in a continuous and sequential manner, and are not read under normal circumstances. RAID5 would be the worst choice for performance. Oracle Corporation recommends placing redo logs on single non-RAIDed disk drives, under the assumption that this configuration provides the best overall performance for simple sequential writes. Redo logs should always be multiplexed at the ORACLE software level, so RAID1 provides few additional benefits. Since non-RAID and RAID0 configurations can vary with hardware from different vendors, the organization should contact their hardware vendor to determine whether non-RAIDed disks or RAID0 arrays will yield the best performance for continuous sequential writes. Note that even if redo logs are placed on RAID1 arrays that the redo logs should still be mirrored at the ORACLE level.

Archive logs
As redo logs are filled, archive logs are written to disk one whole file at a time (assuming, of course, that the database is running in archivelog mode), and are not read under normal circumstances. Any RAID or non-RAID configuration could be used, depending upon the performance requirements and size of the redo logs. For instance, if the redo logs are large, then they will become full and be archived less often. If an archive log is likely to be written no more than once per minute, then RAID5 may provide acceptable performance. If RAID5 proves too slow, then a different RAID configuration can be chosen, or the redo logs can simply be made larger. Note that a fault-tolerant configuration is advisable: if the archive log destination becomes unavailable, the database will halt.

Rollback Segments
As modifications are made to the database tables, undo information is written to the buffer cache in memory. These rollback segments are used to to maintain commitment control and read consistency. Rollback segment data is periodically flushed to disk by checkpoints. Consequently, the changes to the rollback segments are also recorded in the redo logs. However, a smaller amount of information is typically written to the rollback segments than to the redo logs, so the write rate is less stringent. A fault-tolerant configuration is advisable, since the database cannot operate without rollback segments, and recovery of common rollback segments will typically require an instance shutdown. If the transaction rate is reasonably small, RAID5 may provide adequate performance. If it does not, then RAID1 (or RAID10) should be considered.

User tables and indexes
As updates are performed, these changes are stored in memory. Periodically, a checkpoint will flush the changes to disk. Checkpoints occur under two normal circumstances: a redo log switch occurred, or the time interval for a checkpoint expired. (There are a variety of other situations that trigger a checkpoint. Please check the ORACLE documentation for more detail.) Like redo log switches and generation of archive logs, checkpoints can normally be configured so that they occur approximately once per minute. Recovery can be performed up to the most recent checkpoint, so the interval should not be too large for an OLTP system. If the volume of updated data written to disk at each checkpoint is reasonably small (ie. the transaction rate is not extremely large), then RAID5 may provide acceptable performance. Additionally, analysis should be performed to determine the ratio of reads to writes. Recalling that RAID5 offers reasonably good read performance, if the percentage of reads is much larger than the percentage of writes (for instance, 80% to 20%), then RAID5 may offer acceptable performance for small, medium, and even some large installations. A fault-tolerant configuration is preferable to maximize availability (assuming availability is an objective of the organization), although only failures damaging datafiles for the SYSTEM tablespace (and active rollback segments) require the instance to be shutdown. Disk failures damaging datafiles for non-SYSTEM tablespaces can be recovered with the instance on-line, meaning that only the applications using data in tablespaces impacted by the failure will be unavailable. With this in mind, RAID0 could be considered if RAID5 does not provide the necessary performance. If high availability and high performance on a medium to large system are explicit requirements, then RAID1 or RAID10 should be considered.

Temp segments
Sorts too large to be performed in memory are performed on disk. Sort data is written to disk in a block-oriented. Sorts do not normally occur with INSERT/UPDATE/DELETE activity. Rather, SELECTS with ORDER BY or GROUP BY clauses and aggregate functions (ie. operational reports) , index rebuilds, etc., will use TEMP segments only if the sort is too large to perform in memory. Temp segments are good candidates for non-RAIDed drives or RAID0 arrays. Fault-tolerance is not critical: if a drive failure occurs and datafiles for a temp segment are lost, then the temp segment can either be recovered in the normal means (restore from tape and perform a tablespace recovery), or the temp segment can simply be dropped and re-created since there is no permanent data stored in the temp segment. Note that while a temp segment is unavailable, certain reports or index creations may not execute without errors, but update activity will typically not be impacted. With this in mind, RAID1 arrays are a bit unnecessary for temp segments, and should be used for more critical database files. RAID5 will provide adequate performance if the sort area hit ratios are such that very few sorts are performed on disk rather than in memory.

Control files
Control files are critical to the instance operation, as they contain the structural information for the database. Control files are updated periodically (at a checkpoint and at structural changes), but the data written to the control files is a very small quantity compared to other database files. Control files, like redo logs, should be multiplexed at the ORACLE software level onto different drives or arrays. Non-RAIDed drives or or any RAID configuration would be acceptable for control files, although most organizations will typically distribute the multiple copies of the control files with the other database files, given that the read and write requirements are so minimal. For control files, maintaining multiple copies in different locations should be favored over any other concern.

Software and static files
The ORACLE software, configuration files, etc. are very good candidates for RAID5 arrays. This information is not constantly updated, so the RAID5 write penalty is of little concern. Fault-tolerance is advisable: if the database software (or O/S software) becomes unavailable due to a disk failure, then the database instance will abort. Also, recovery will include restore or re-installation of ORACLE software (and possible operating system software) as well as restore and recovery of the database files. RAID5 provides the necessary fault-tolerance to prevent this all-inclusive recovery, and good read performance for dynamic loading and unloading of executable components at the operating system level.

DSS (Decision Support System) workloads
In comparison to OLTP systems, DSS or data warehousing systems are characterized by primarily SELECT activity during normal working hours, and batch INSERT, UPDATE, and DELETE activity run on a periodic basis (nightly, weekly, or monthly). There will typically be a large amount of variability in the number of rows accessed by any particular SELECT, and the queries will tend to be of a more ad-hock nature. The number of users will typically be smaller than their ajoining OLTP systems (where the data originates). The focus is on enabling SELECT activity that completes within an acceptable response time, while insuring that the batch update activity still has capacity to complete in it's allowable time window. Note now that there are two areas of performance over which to be concerned: periodic refreshes and ad-hock read activity. The general level directive in this case should be to configure the database such that read-only performed by end users is as good as it can get without rendering refreshes incapable of completion. As with OLTP systems, each type of database file would ideally have it's own private disk subsystem (disks, arrays, and controller channel), but with less than ideal resources certain grouping tend to work well for DSS systems. (Please see the article on Instance tuning for information on these groupings.)

Redo logs
Redo logs are only written to while update activity is occurring. In a DSS-oriented system, a significant portion of data entered interactively during the day may loaded into the DSS database during only a few hours. Given this characteristic, redo logging may tend to be more of a bottleneck on periodic refresh processes of a DSS database than on it's ajoining OLTP systems. If nightly loads are taking longer than their allowance, then redo logging should be the first place to look. The same RAID/non-RAID suggestions that apply to redo logging in OLTP also apply with DSS systems. As with OLTP systems, redo logs should always be mirrored at the ORACLE software level, even if they are stored on fault-tolerant disk arrays.

Archive logs
Like redo logging, archive logs are only written out during update activity. If the archive log destination appears to be over-loaded with I/O requests, then consider changing the RAID configuration, or simply increase the size of the redo logs. Since there is a large volume of data being entered in a short period of time, it may be very reasonable to make the redo logs for the DSS or data warehouse much larger (10 or more times) than the redo logs used by the OLTP system. A reasonable rule of thumb is to target about one log switch per hour. With this objective met, then the disk configuration and fault-tolerance can be chosen based on the same rules used for OLTP systems.

Rollback Segments
Again like redo logging, rollback segments will be highly utilized during the periodic refreshes, and virtually unused during the normal work hours. Use the same logic for determining RAID or non-RAID configurations on the DSS database that would be used for the OLTP systems.

User tables and indexes
Writes are done to tablespaces containing data and indexes during periodic refreshes, but during the normal work hours read activity on the table and indexes will typically far exceed the update work performed on a refresh. A fault-tolerant RAID configuration is suggested to sustain availability. However, in most cases the business can still operate if the DSS system is unavailable for several hours due to a disk failure. Information for strategic decisions may not be available, but orders can still be entered. If the DSS has high availability requirements, select a fault-tolerant disk configuration. If RAID5 arrays can sustain the periodic refresh updates, then it is typically a reasonably good choice due to it's good read performance. As seen above, the read and write workload capacities can be adjusted by adding physical drives to the array.

Temp segments
In a decision support system or data warehouse, expect temp segment usage to be much greater than what would be found in a transaction system. Recalling that temp segments do not store any permanent data and are not absolutely necessary for recovery, RAID0 may be a good choice. Keep in mind, though, that the loss of a large temp segment due to drive failure may render the DSS unusable (unable to perform sorts to answer large queries) until the failed drives are replaced. If availability requirements are high, then a fault-tolerant solution should be selected, or at least considered. If the percentage of sorts on disk is low, then RAID5 may offer acceptable performance; if this percentage is high, RAID1 or RAID10 may be required.

Control files
As with OLTP systems, control files should always be mirrored at the ORACLE software level regardless of any fault-tolerant disk configurations. Since reads and writes to these files are minimal, any disk configuration should be acceptable. Most organizations will typically disperse control files onto different disk arrays and controller cards, along with other database files.

Software and static files
Like OLTP systems, these files should be placed on fault-tolerant disk configurations. Since very little write activity is present, these are again good candidates for RAID5.

Taking the above information into consideration, can an organization run an entire ORACLE database instance on a single RAID5 array? The answer is "yes". Will the organization get a good level of fault-tolerance? Again, the answer is "yes". Will the organization get acceptable performance? The answer is "it depends". This dependency includes the type of workload, the number of users, the throughput requirements, and a whole host of other variables. If the organization has an extremely limited budget, then it can always start with a single RAID5 array, perform the necessary analysis to see where improvement is needed, and proceed to correct the deficiencies.

Summary

RAID 1 is mirroring - blocks are written simultaneously to both members of the raid set so if one member fails, the other still has a full set of data.

RAID 5 is striping with distributed parity - data blocks are written in a stripe across all except one of the disks and a parity block is written to the last disk. Each data write is written to a different set of n-1 disks so that the data and parity are scattered equally amongst the drives.

RAID 1 gives marginal read speed increase with no real write overhead RAID 5 gives quite a high speed increase for reads but invokes a high overhead for write operations (normally the RAID5 controller will have a write back cache attached to alleviate this)

For high write volumes R5 is far from ideal - much better for low writes but high reads. Don't put actively written files on an R5 set - especially things like redo logs !

To really tune the I/Os you need to know the chunksize of your RAID controller and tweak db_file_multiblock_read_count appropriately


Distribution of Oracle Files
The following table shows what Oracle suggests for RAID usage:

RAID
Type of Raid
Control File
Database File
Redo Log File
Archive Log File
0
Striping
Avoid
OK
Avoid
Avoid
1
Shadowing
Best
OK
Best
Best
0 + 1
Striping and Shadowing
OK
Best
Avoid
Avoid
3
Striping with static parity
OK
OK
Avoid
Avoid
5
Striping with rotating parity
OK
Best if RAID0-1 not available
Avoid
Avoid



No comments:

Post a Comment