Wednesday, April 29, 2009

All about Statistics

All about Statistics

Intruduction on Statistics
The CBO makes its explain-plan decisions based on statistics.
Statistics provide critical input in order for CBO to work properly; these includes information such as the number of rows in a table, distinct values in a column, number of leaf blocks in an index, etc.
The more accurate the statistics, the more efficient the results provided by Optimizer.

Statistics may be exact or estimated:
  • Statistics generated with a COMPUTE clause analyzes all of the data in the object. This gives the optimizer accurate information to work on and arrive at a good execution plan.
  • Statistics generated with an ESTIMATE clause analyzes data in the object to the extent of sample size mentioned. Sample size may be specified as number of rows or percentage of rows that should be randomly analyzed to generate the statistics. Optionally block sampling may also be specified.
Statistics are stored in a data dictionary tables owned by SYS user. The following views display the statistics collected for tables, columns and indexes.

For Tables: DBA_TABLES
NUM_ROWS - Number of rows
BLOCKS - Number of blocks below HW (used or not)
EMPTY_BLOCKS - Number of empty blocks
AVG_SPACE - Average free space (in bytes) in blocks allocated to the table. All empty and free blocks are considered for this
CHAIN_CNT - Number of chained rows (used by ANALYZE command only)
AVG_ROW_LEN - Average row length in bytes
LAST_ANALYZED - Date when the table was last analyzed
SAMPLE_SIZE - Sample size provided for ESTIMATE statistics. Equal to NUM_ROWS if COMPUTE
GLOBAL_STATS - For partitioned tables, YES - statistics collected as a whole, NO - statistics are estimated from statistics
USER_STATS - Set to YES if user has explicitly set the statistics for the table
Statistics for individual partitions of a table can be seen from DBA_TAB_PARTITIONS. Cluster statistics is available from DBA_CLUSTERS.

For Indexes DBA_INDEXES
BLEVEL - Depth of the index, from root to leaf
LEAF_BLOCKS - Number of leaf blocks
DISTINCT KEYS - Number of distinct index values
AVG_LEAF_BLOCKS_PER_KEY - Average number of leaf blocks in which each distinct key appears, should be 1 for unique indexes
AVG_DATA_BLOCKS_PER_KEY - Average number of blocks in the table that are pointed to by a distinct key
CLUSTERING_FACTOR - A count that determines the ordering of the index. Index is ordered if count is closer to the number of blocks, i.e entries in single leaf tend to point to rows in same blocks in the table. Index is randomly ordered if closer to the number of rows, i.e., entries in single leaf are pointing to rows spread across multiple blocks
NUM_ROWS - Number of rows indexed
SAMPLE_SIZE - Sample size provided for ESTIMATE statistics. Equal to NUM_ROWS if COMPUTE
LAST_ANALYZED - Date when the table was last analyzed
GLOBAL_STATS - For partitioned indexes, YES - statistics collected as a whole, NO - statistics are estimated from statistics
USER_STATS - Set to YES if user has explicitly set the statistics for the index
PCT_DIRECT_ACCESS - For secondary indexes on IOTs, percentage of rows with valid guess
Statistics for individual partitions of indexes can be seen from DBA_IND_PARTITIONS and DBA_IND_SUBPARTITIONS.

For Columns: DBA_TAB_COLUMNS
NUM_DISTINCT - Number of distinct values
LOW_VALUE - Lowest value
HIGH_VALUE - Highest value
DENSITY - Density of the column
NUM_NULLS - Number of records with null value for the concerned column
NUM_BUCKETS - Number of buckets in histograms. Refer Histograms section
SAMPLE_SIZE - Sample size provided for ESTIMATE statistics. Equal to total rows if COMPUTE
LAST_ANALYZED - Date when the table was last analyzed
DBA_TAB_COL_STATISTICS shows similar data. Partitioned Table column statistics can be viewed from DBA_PART_COL_STATISTICS and DBA_SUBPART_COL_STATISTICS


Important Initialization parameters that affect the CBO
Apart from generating statistics, the setup of the below mentioned parameters play a very crucial role in dictating how your system will work. Setting these will depend a lot on what kind of environment you want to create, Online, Batch processing, Data warehousing or a combination of more than one.

1) OPTIMIZER_MODE
This will decide the mode in which the optimizer engine should run in. Valid values are RULE, CHOOSE, ALL_ROWS, FIRST_ROWS (_n).
Optionally, the OPTIMIZER_MODE can be set to CHOOSE. This is kind of an intermediate option between RBO and CBO. In fact, it tries to run the query in either CBO or RBO depending on the availability or unavailability of statistics. Therefore, if the tables present in the query have statistics generated on them, CBO (ALL_ROWS only) is preferred or RBO is taken up.
e.g.: optimizer_mode = first_rows
For an OLTP system, FIRST_ROWS would be the ideal option for fast response time. ALL_ROWS is meant for batch-oriented applications. Note, a plan that produces the first n rows with the fastest response time might not be an optimal plan if requirement is to obtain the entire result, so decide as per the need of the application

2) OPTIMIZER_INDEX_COST_ADJ
Optimizer_Index_Cost_Ad takes a value between 1 and 10000 (default 100). Setting this value makes the index more or less friendly to the optimizer. For example setting it to 50 tells the optimizer that an execution plan that uses index(es) is half as expensive as other execution plans that are not using them. The lower the value (less than 100), the less full table scan executions will take place in the system.
Use of indexes in OLTP is a crucial factor for deciding the response time, but sometimes, full table scans provide better throughput. Nested loop joins are influenced by this parameter value. Set this parameter to a value that is optimal for your application. I recommend set this parameter between 10 - 20 for OLTP and 50 for DSS Systems..
If you are having slow performance because the CBO first_rows optimizer mode is favoring too many full-table scans, you can reset the optimizer_index_cost_adj parameter to immediately tune all of the SQL in your database to favor index scans over full-table scans. This is a "silver bullet" that can improve the performance of an entire database in cases where the database is OTLP and you have verified that the full-table scan costing is too low.

3) OPTIMIZER_INDEX_CACHING
This tells optimizer to favor nested loops and IN-list iterators over sort-merge and hash joins. The default value is 100 and makes nested loops and IN-list operator look less expensive and more appealing to Optimizer. The value basically indicates the percentage of index blocks that the optimizer should assume are present in cache.
I recommend set this parameter to 85.

4) DB_FILE_MULTIBLOCK_READ_COUNT
This parameter determines the number of database blocks read in one input/output operation during a full table scan. The value set is used in computing the cost of full table scans. OLTP systems will not benefit by setting a high value for this parameter as the computed cost for full table scan execution plans would reduced. The maximum size is Operating system dependent.
e.g.: db_file_multiblock_read_count = 1 (may be set to a higher value)

5) SORT_AREA_SIZE
This parameter defines the maximum memory space allocated per user process that requires sorting of data and for insert and updates to bitmap indexes. Optimizer uses the set value for determining the cost of sorting in a query. This defaults to 64K. Normally a size of 64K to 1M is appropriate for OLTP systems.
Sort-merge joins are influenced by this parameter value. The bigger the size the more appealing will be sort-merge joins over nested-loop and hash joins. The lower the size the more sorting will take place in temporary segments. So assign a value that has been evaluated for your setup.
Oracle recommends the use of PGA_AGGREGATE_TARGET instead of this parameter from Oracle 9i.

6) SORT_MULTIBLOCK_READ_COUNT
This specifies the number of database blocks to be read each time a sort performs a read from a temporary segment. The default value is 2. General recommendation is to keep the default value. However, if you intend to hike up the SORT_AREA_SIZE considerably, you may evaluate the improvement by changing this parameter.
e.g.: sort_multiblock_read_count = 2

7) HASH_JOIN_ENABLED
Hash joins are available only in CBO. In hash joins, a hash table is created on the join key of the smallest sized table. It then joins the other tables to find the match. Hash joins may prove to be faster than other type of joins in some conditions, especially when the index is missing or search criteria is not very selective. Hash joins require a large amount of memory as the hash tables are retained; this may sometimes result in memory swapping.
Nested-loop joins return the first row faster than sort-merge and hash joins and are preferred for OLTP, but other types of joins cannot be ignored for running other aspects of the applications.
e.g.: hash_join_enabled = true

8) HASH_AREA_SIZE
This specifies the maximum amount of memory in bytes to be used for a hash join per process. It is defaulted to 2 times SORT_AREA_SIZE.
Oracle recommends the use of PGA_AGGREGATE_TARGET instead of this parameter from Oracle 9i.
e.g.: hash_area_size = 2097152
Setting this to a very low number may sometimes result in the following error.
ORA-6580: Hash Join ran out of memory while keeping large rows in memory.

9) HASH_MULTIBLOCK_IO_COUNT
This specifies how many sequential blocks a hash join reads and writes in one Input/Output activity. Oracle recommends not changing or assigning a value to this parameter; this will let oracle decide on the appropriate value for each individual query. In such casees, the value of the parameter will appear as 0 in the V$PARAMETER view.
This parameter is renamed to _HASH_MULTIBLOCK_IO_COUNT as of Oracle 9i.
e.g.: hash_multi_block_io_count = 0

10) BITMAP_MERGE_AREA_SIZE
This parameter is relevant for systems using bitmap indexes. It specifies the amount of memory Oracle uses to merge bitmaps retrieved from a range scan of a bitmap index. The default value is 1 MB, which is considered sufficient for most setups.
Oracle recommends use of PGA_AGGREGATE_TARGET instead of this parameter from Oracle 9i.
e.g.: bitmap_merge_area_size = 1048576

11) QUERY_REWRITE_ENABLED
This parameter is relevant for systems using Materialized views, Function based indexes and stored outlines. Setting this parameter enables query rewrite for the database. The materialized views should also be query rewrite enabled to allow the optimizer to redirect a query to it rather than the actual tables listed in the FROM clause. Query rewriting is internally done by the optimizer based on what results are expected and whether these are already present in materialized view form.
e.g.: query_rewrite_enabled = true

12) QUERY_REWRITE_INTEGRITY
This parameter determines the extent to which the optimizer must enforce query rewriting to Materialized views. This determines the accuracy of the query rewrite. It can be set to ENFORCED, TRUSTED or STALE_TOLERATED. ENFORCED option uses Materialized views if they contain fresh data, thus guarantying the integrity and consistency of data. TRUSTED uses relationships that have been defined in the system for rewriting the query. STALE_TOLERATED allows the use of relationships that are not enforced and may use stale data. For OLTP systems, ENFORCED would be the ideal setup, as reports need to be up-to-date.
e.g.: query_rewrite_integrity = enforced

13) ALWAYS_ANTI_JOIN
This parameter specifies the join method for anti-joins, for example when a NOT IN operator is present in your query. It can be set to NESTED_LOOPS, MERGE or HASH. It is defaulted to NESTED_LOOPS in Oracle 8i and CHOOSE in Oracle 9i.
This parameter is renamed to _ALWAYS_ANTI_JOIN as of Oracle 9i.
e.g.: always_anti_join = nested_loops

14) ALWAYS_SEMI_JOIN
This parameter specifies the join method for semi-joins. These types of joins are carried out by Optimizer after transforming a query. In such joins, duplicate values from the inner table are removed and then the type of join specified in the parameter is used to perform a semi-join. It can be set to NESTED_LOOPS, MERGE or HASH. In Oracle 8i, it is defaulted to STANDARD and in Oracle 9i it is defaulted to CHOOSE, to pick up an appropriate join.
This parameter is renamed to _ALWAYS_SEMI_JOIN as of Oracle 9i.
e.g.: always_semi_join = nested_loops

15) STAR_TRANSFORMATION_ENABLED
This specifies whether query transformation will be applied to star queries. It can be set to TRUE, FALSE or TEMP_DISABLE (transformation will take place but will not use temporary tables). I presently set it to FALSE due to some known issues of sub-optimal queries being generated. If you intend to use this, please upgrade your version to 8.1.7.4 and above.
e.g.: star_transformation_enabled = false

16) PARALLEL_BROADCAST_ENABLED
This parameter refers to parallel executions in cluster databases. It is meant for improving hash and sort-merge join operations where a very large result set is joined with a very small result set. When this option is enabled, the optimizer broadcasts a copy of all rows in the smaller result set to all cluster databases that are processing some rows of the larger result set.
It is obsolete in release 9.2.0.
e.g.: parallel_broadcast_enabled = false

17) OPTIMIZER_DYNAMIC_SAMPLING
This parameter is introduced in release 9i. It is meant for situations where tables are not analyzed. As CBO depends heavily on statistics, the parameter tells the optimizer to sample the unanalyzed tables that are being used in a query. A level of 0 to 10 can be specified, the higher the value the more time optimizer spends in sampling.
e.g.: optimizer_dynamic_sampling = 1

18) PARTITION_VIEW_ENABLED
This parameter is meant for backward compatibility to support partition views. Oracle recommends use of partition tables rather than partition views. If you are migrating to CBO, chances are that you may not be using partition views.
e.g.: partition_view_enabled = false

19) CURSOR_SHARING
This parameter determines what kind of SQL statements can share the same cursor. It can be set to FORCE, SIMILAR or EXACT. FORCE will try to squeeze statements that may differ in some literals to share the same cursor. SIMILAR is somewhat the same but will try to maintain the plan optimization for identical statements. EXACT allows statements with exact identical text to share a cursor.
Using FORCE may sometimes result in unexpected results.
e.g.: cursor_sharing = exact

20) PGA_AGGREGATE_TARGET
Introduced in Oracle 9i, this parameter specifies the aggregate PGA memory available to all server processes attached to an instance. This parameter can be set for automatic sizing of SQL working areas. It replaces other existing parameters like SORT_AREA_SIZE, BITMAP_MERGE_AREA_SIZE and HASH_AREA_SIZE.
It can be set to a value between 10 MB to 4000 GB, depending on the setup requirement.
1- Make a first estimate for PGA_AGGREGATE_TARGET based on the following rule:
PGA_AGGREGATE_TARGET = (Total Physical Memory * 0.16) - For OLTP systems (16% of Server Physical Mem)
PGA_AGGREGATE_TARGET = (Total Physical Memory * 0 .4 ) - For DSS systems (40% of Server Physical Mem)
More Information HERE



Gatering Statistics for Tables
Oracle strongly recommends that you use the DBMS_STATS package rather than ANALYZE to collect optimizer statistics. That package lets you collect statistics in parallel, collect global statistics for partitioned objects, and fine tune your statistics collection in other ways. Further, the cost-based optimizer will eventually use only statistics that have been collected by DBMS_STATS.

However, you must use the ANALYZE statement rather than DBMS_STATS for statistics collection not related to the cost-based optimizer, such as:
* To use the VALIDATE or LIST CHAINED ROWS clauses
* To collect information on freelist blocks

In order to use the DBMS_STATS Package, the user needs to get the execute privilege. Run the following as SYS:
grant execute on DBMS_STATS to user_xyz;

DBMS_STATS.GATHER_TABLE_STATS gathers statistics for a table and its columns, and optionally the associated indexes. Syntax:
dbms_stats.gather_table_stats(ownname, tabname, partname, estimate_percent, block_sample,
method_opt, degree, granularity, cascade, stattab, statid, statown);

The first two parameters are mandatory, the rest are defaulted to a value.
* ownname - owner
* tabname - table name
* partname - partition name
* estimate_percent - sample percent ratio
* block_sample - consider random blocks sampling rather than rows sampling. TRUE/FALSE
* method_opt - method options. FOR ALL COLUMNS/FOR ALL INDEXED COLUMNS. Append the phase SIZE 1 if it is required to generate statistics in parallel.
* degree - degree of parallelism.
* granularity - for partitioned tables. DEFAULT/SUBPARTITION/PARTITION/GLOBAL/ALL.
* cascade - gather statistics for indexes also. TRUE/FALSE
* stattab, statid, statown - required for user statistics.

Collect statistics for a table and its indexes:
EXEC DBMS_STATS.gather_table_stats(USER, 'LOOKUP', cascade => TRUE);

Collect stats for a table, estimating 20%, with parallelism
execute dbms_stats.gather_table_stats (ownname => 'SCOTT', tabname => 'DEPT', partname=> null, estimate_percent => 20, degree => 5, cascade => true, options => 'GATHER AUTO');

Estimate statistics for columns in a table, this will also generate statistics for tables.
exec dbms_stats.gather_table_stats(ownname => 'SCOTT', tabname => 'DEPT', estimate_percent => 5, method_opt => 'FOR ALL COLUMNS');

SQL Source - Dynamic Method for TABLES
DECLARE
sql_stmt VARCHAR2(1024);
BEGIN
FOR tab_rec IN (SELECT owner,table_name
FROM all_tables WHERE owner like UPPER('&1') ) LOOP
sql_stmt := 'BEGIN dbms_stats.gather_table_stats(ownname => :1, tabname => :2, partname => null, estimate_percent => 10, degree => 3 ,cascade => true); END;' ;
EXECUTE IMMEDIATE sql_stmt USING tab_rec.owner, tab_rec.table_name ;
END LOOP;
END;
/

Collecting Stats in a Procedure
CREATE OR REPLACE PROCEDURE analyze_any_schema ( p_inOwner IN all_users.username%TYPE)
IS
BEGIN
FOR v_tabs IN (SELECT owner, table_name
FROM all_tables
WHERE owner = p_inOwner
AND temporary <> 'Y')
LOOP
DBMS_OUTPUT.put_line ('EXEC DBMS_STATS.gather_table_stats('''||v_tabs.owner||
''','''||v_tabs.table_name||''',NULL,1);' );
BEGIN
DBMS_STATS.gather_table_stats(v_tabs.owner,v_tabs.table_name,NULL,1);
DBMS_OUTPUT.put_line ('Analyzed '||v_tabs.owner||'.'||table_name||'... ');
EXCEPTION
WHEN OTHERS THEN
DBMS_OUTPUT.put_line ('Exception on analysis of '||v_tabs.table_name||'!');
DBMS_OUTPUT.put_line (SUBSTR(SQLERRM,1,255));
END;
END LOOP;
END;
/
CREATE OR REPLACE Procedure DB_Maintenance_Weekly is
sql_stmt varchar2(1024);
v_sess_user varchar2(30);
BEGIN
select sys_context('USERENV','SESSION_USER') into v_sess_user
from dual ;
--Analyze all Tables
FOR tab_rec IN (SELECT table_name
FROM all_tables
WHERE owner = v_sess_user
and table_name not like 'TEMP_%') LOOP
sql_stmt := 'BEGIN dbms_stats.gather_table_stats
(ownname => :1,
tabname => :2,
partname => null,
estimate_percent => DBMS_STATS.AUTO_SAMPLE_SIZE,
degree => 3 ,
cascade => true); END;' ;
EXECUTE IMMEDIATE sql_stmt USING v_sess_user, tab_rec.table_name ;
END LOOP;
EXCEPTION
WHEN NO_DATA_FOUND THEN
NULL ;
end;
/

Force Statistics to a Table
You can use the following sentence to force statistics to a Table:
exec dbms_stats.set_table_stats( user, 'EMP', numrows => 1000000, numblks => 300000 );

No statistics are collected for Global Temporary Tables; handle these with care. Make use of hints to drive queries on these. Provide explicit statistics by using DBMS_STATS.SET_TABLE_STATS if your temporary table is used for processing huge amounts of data.

How often to collect stats?
When there is a 10-20% change in data, the general convention is to generate fresh statistics. You can start of with a general rule of estimating statistics on a weekly basis. If the tables are giving real bad hits because of heavy activity, you may consider using the DML Monitoring option to update statistics every few hours for such tables. Statistics are not incremental and are regenerated every time. If there is no considerable change in data, there is no advantage in generating statistics too frequently.

DML Monitoring
Used by dbms_stats to identify objects with "stale" statistics
- On by default in 10g, not in 9i
alter table monitoring;
- For the Full Database:
dbms_stats.alter_database_tab_monitoring(monitoring=>true, sysobjs=>false);

- Tracked in [DBA|ALL|USER]_TAB_MODIFICATIONS
- 9i and 10g use 10% change as the threshold to gather stats
In Oracle 10g, Oracle automatically gathers index statistics whenever the index is created or rebuilt.


Gathering statistics for a Schema, a Database and System Stats
Collect statistics for my Schema (All the tables and Indexes), estimating 10%
execute dbms_stats.gather_schema_stats(ownname => 'SCOTT', estimate_percent => 10, degree => 5, cascade => true);

Collect statistics for the Database:
execute dbms_stats.gather_database_stats(estimate_percent => 20, degree => 5, cascade => true);

Generating System Statistics
Gather statistics during the day. Gathering ends after 720 minutes and is stored in the mystats table:
exec DBMS_STATS.GATHER_SYSTEM_STATS(gathering_mode => 'interval', interval => 720, stattab => 'mystats', statid => 'OLTP');

Gather statistics during the night. Gathering ends after 720 minutes and is stored in the mystats table:
DBMS_STATS.GATHER_SYSTEM_STATS(gathering_mode => 'interval', interval => 720, stattab => 'mystats', statid => 'OLAP');

If appropriate, you can switch between the statistics gathered. It is possible to automate this process by submitting a job to update the dictionary with appropriate statistics. During the day, the following jobs import the OLTP statistics for the daytime run:
VARIABLE jobno number;
BEGIN
DBMS_JOB.SUBMIT(:jobno, 'DBMS_STATS.IMPORT_SYSTEM_STATS(''mystats'',''OLTP'');', SYSDATE, 'SYSDATE + 1');
COMMIT;
END;
/

During the night, the following jobs import the OLAP statistics for the night time run:
BEGIN
DBMS_JOB.SUBMIT(:jobno, 'DBMS_STATS.IMPORT_SYSTEM_STATS(''mystats'',''OLAP'');', SYSDATE + 0.5, 'SYSDATE + 1');
COMMIT;
END;
/

STATISTICS FOR THE DATA DICTIONARY
New in Oracle Database 10g is the ability to gather statistics for the data dictionary. The objective is to enhance the performance of queries. There are two basic types of dictionary base tables.
The statistics for normal base tables are gathered using GATHER_DICTIONARY STATISTICS. They may also be gathered using GATHER_SCHEMA_STATS for the SYS schema. Oracle recommends gathering these statistics at a similar frequency as your other database objects.
Statistics for fixed objects (the V$ views on the X$ tables) are gathered using the GATHER_FIXED_OBJECT_STATS procedure. The initial collection of these statistics is normally sufficient. Repeat only if workload characteristics have changed dramatically. The SYSDBA privilege or ANALYZE ANY DICTIONARY and ANALYZE ANY privileges are required to execute the procedures for gathering data dictionary statistics.


The Option parameter
There are several values for the options parameter that we need to know about:
- gather - re-analyzes the whole schema.
- gather empty - Only analyze tables that have no existing statistics.
- gather stale - Only re-analyze tables with more than 10% modifications (inserts, updates, deletes). The table should be in monitor status first.
- gather auto - This will re-analyze objects which currently have no statistics and objects with stale statistics.The table should be in monitor status first.
Using gather auto is like combining gather stale and gather empty.
Note that both gather stale and gather auto require monitoring. If you issue the "alter table xxx monitoring" command, Oracle tracks changed tables with the dba_tab_modifications view.
The most interesting of these options is the gather stale option. Because all statistics will become stale quickly in a robust OLTP database, we must remember the rule for gather stale is > 10% row change (based on num_rows at statistics collection time).
Hence, almost every table except read-only tables will be re-analyzed with the gather stale option. Hence, the gather stale option is best for systems that are largely read-only. For example, if only 5% of the database tables get significant updates, then only 5% of the tables will be re-analyzed with the "gather stale" option.
The CASCADE => TRUE option causes all indexes for the tables to also be analyzed. In Oracle 10g, set CASCADE to AUTO_CASCADE to let Oracle decide whether or not new index statistics are needed.


Important Notes
  • You should analyze the table after creating a function-based index, to allow Oracle to collect column statistics equivalent information for the expression. Optionally, you can collect histograms for the index expressions by specifying forall hidden columns size number_of_buckets in the METHOD_OPT argument to the DBMS_STATS procedures.
  • Oracle can gather some statistics automatically while creating or rebuilding a B-tree or bitmap index. The COMPUTE STATISTICS option of CREATE INDEX or ALTER INDEX ... REBUILD enables this gathering of statistics.
  • Oracle recommends setting the ESTIMATE_PERCENT parameter of the DBMS_STATS gathering procedures to DBMS_STATS.AUTO_SAMPLE_SIZE to maximize performance gains while achieving necessary statistical accuracy. AUTO_SAMPLE_SIZE lets Oracle determine the best sample size for good statistics.
  • By default DBMS_STATS will generate statistics for the table and not it's indexes (By default CASCADE => FALSE).
  • If you have a big table (around 2M rows), try to ESTIMATE statistics. In general, a sample size of 5% to 10% generates adequate statistics!
  • A table that is in the process of being analyzed cannot undergo DDL changes but DML activities can be carried on. Analyzing an index puts a shared lock on the related table; hence, neither DML nor DDL can be performed. Preferably avoid all activities during the statistics generation phase



Verifying Table Statistics
Syntax:
dbms_stats.get_table_stats(ownname, tabname, partname, stattab, statid, numrows, numblks, avgrlen, statown);

declare
l_numrows number;
l_numblks number;
l_avgrlen number;
begin
dbms_stats.get_table_stats(ownname => 'SYS', tabname => 'AM01', numrows => l_numrows, numblks => l_numblks, avgrlen => l_avgrlen);
dbms_output.put_line('No. of rows: ' || l_numrows);
dbms_output.put_line('No. of blks: ' || l_numblks);
dbms_output.put_line('Avg row length: ' || l_avgrlen);
end;
/
No. of rows: 4106860
No. of blks: 6219
Avg row length: 3


Verifying Index Statistics

SELECT INDEX_NAME "NAME", NUM_ROWS, DISTINCT_KEYS "DISTINCT",
LEAF_BLOCKS, CLUSTERING_FACTOR "CF", BLEVEL "LEVEL",
AVG_LEAF_BLOCKS_PER_KEY "ALFBPKEY"
FROM DBA_INDEXES
WHERE OWNER = 'SH'
ORDER BY INDEX_NAME;

Typical output is:

NAME                       NUM_ROWS DISTINCT LEAF_BLOCKS      CF   LEVEL   ALFBPKEY
-------------------------- -------- -------- ----------- ------- ------- ----------
CUSTOMERS_PK 50000 50000 454 4405 2 1
PRODUCTS_PK 10000 10000 90 1552 1 1
PRODUCTS_PROD_CAT_IX 10000 4 99 4422 1 24
PRODUCTS_PROD_SUBCAT_IX 10000 37 170 6148 2 4
SALES_PROD_BIX 6287 909 1480 6287 1 1
SALES_PROMO_BIX 4727 459 570 4727 1 1

declare
l_numrows number;
l_numlblks number;
l_numdist number;
l_avglblk number;
l_avgdblk number;
l_clstfct number;
l_indlevel number;
begin
dbms_stats.get_index_stats(ownname => 'SYS', indname => 'AM01_N1',
numrows => l_numrows, numlblks => l_numlblks,
numdist => l_numdist, avglblk => l_avglblk,
avgdblk => l_avgdblk, clstfct => l_clstfct,
indlevel => l_indlevel);
dbms_output.put_line('No. of rows: ' || l_numrows);
dbms_output.put_line('No. of blks: ' || l_numlblks);
dbms_output.put_line('No. of distinct values: ' || l_numdist);
dbms_output.put_line('Avg leaf blocks for distinct keys: ' || l_avglblk);
dbms_output.put_line('Avg data blocks pointed to in the table: ' || l_avgdblk);
dbms_output.put_line('Clustering factor: ' || l_clstfct);
dbms_output.put_line('Index height: ' || l_indlevel);
end;
/
No. of rows: 3819711
No. of blks: 11092
No. of distinct values: 1
Avg leaf blocks for distinct keys: 11092
Avg data blocks pointed to in the table: 14616
Clustering factor: 14616
Index height: 2


Verifying Column Statistics
SELECT COLUMN_NAME, NUM_DISTINCT, NUM_NULLS, NUM_BUCKETS, DENSITY
FROM DBA_TAB_COL_STATISTICS
WHERE TABLE_NAME ="PA_EXPENDITURE_ITEMS_ALL"
ORDER BY COLUMN_NAME;

This returns the following data:

COLUMN_NAME                    NUM_DISTINCT  NUM_NULLS NUM_BUCKETS    DENSITY
------------------------------ ------------ ---------- ----------- ----------
BURDEN_COST 4300 71957 1 .000232558
BURDEN_COST_RATE 675 7376401 1 .001481481
CONVERTED_FLAG 1 16793903 1 1 poor
COST_BURDEN_DISTRIBUTED_FLAG 2 15796 1 .5 poor
COST_DISTRIBUTED_FLAG 2 0 1 .5 poor
COST_IND_COMPILED_SET_ID 87 6153143 1 .011494253
EXPENDITURE_ID 1171831 0 1 8.5337E-07 GOOD
TASK_ID 8648 0 1 .000115634
TRANSFERRED_FROM_EXP_ITEM_ID 1233787 15568891 1 8.1051E-07 GOOD


declare
l_distcnt number;
l_density number;
l_nullcnt number;
l_srec dbms_stats.statrec;
l_avgclen number;
begin
dbms_stats.get_column_stats(ownname => 'SYS', tabname => 'AM01',
colname => 'COL1', distcnt => l_distcnt, density => l_density,
nullcnt => l_nullcnt, srec => l_srec, avgclen => l_avgclen);
dbms_output.put_line('No. of distinct values: ' || l_distcnt);
dbms_output.put_line('Density: ' || l_density);
dbms_output.put_line('Count of nulls: ' || l_nullcnt);
dbms_output.put_line('Avg. column length: ' || l_avgclen);
end;
/
No. of distinct values: 2
Density: .5
Count of nulls: 0
Avg. column length: 3


Exporting and importing statistics with DBMS_STATS

DBMS_STATS also includes routines for gathering statistics and storing them outside the dictionary. This does not influence the optimizer. Most of the procedures in this package as mentioned above have three common parameters - STATID, STATTAB and STATOWN that are related to user processing of statistics.

Advantages of this feature:

1. Estimated statistics at different percentages could be stored and used for testing.

2. Statistics generated on one database could be transferred to another database.


Using Histograms

The cost-based optimizer can use data value histograms to get accurate estimates of the distribution of column data. A histogram partitions the values in the column into bands, so that all column values in a band fall within the same range. Histograms provide improved selectivity estimates in the presence of data skew, resulting in optimal execution plans with nonuniform data distributions.

One of the fundamental tasks of the cost-based optimizer is determining the selectivity of predicates that appear in queries. Selectivity estimates are used to decide when to use an index and the order in which to join tables. Some attribute domains (a table's columns) are not uniformly distributed.

The cost-based optimizer uses height-based histograms on specified attributes to describe the distributions of nonuniform domains. In a height-based histogram, the column values are divided into bands so that each band contains approximately the same number of values. The useful information that the histogram provides, then, is where in the range of values the endpoints fall.

When to Use Histograms

Histograms can affect performance and should be used only when they substantially improve query plans. Histogram statistics data is persistent, so the space required to save the data depends on the sample size. In general, create histograms on columns that are used frequently in WHERE clauses of queries and have a highly skewed data distribution. For uniformly distributed data, the cost-based optimizer can make fairly accurate guesses about the cost of executing a particular statement without the use of histograms.

Histograms, like all other optimizer statistics, are static. They are useful only when they reflect the current data distribution of a given column. (The data in the column can change as long as the distribution remains constant.) If the data distribution of a column changes frequently, you must recompute its histogram frequently.

Histograms are not useful for columns with the following characteristics:

  • All predicates on the column use bind variables.
  • The column data is uniformly distributed.
  • The column is unique and is used only with equality predicates.

Creating Histograms

You generate histograms by using the DBMS_STATS package. You can generate histograms for columns of a table or partition. For example, to create a 10-bucket histogram on the SAL column of the emp table, issue the following statement:

EXECUTE DBMS_STATS.GATHER_TABLE_STATS ('scott','emp', METHOD_OPT => 'FOR COLUMNS SIZE 10 sal');

The SIZE keyword declares the maximum number of buckets for the histogram. You would create a histogram on the SAL column if there were an unusually high number of employees with the same salary and few employees with other salaries. You can also collect histograms for a single partition of a table.

Oracle Corporation recommends using the DBMS_STATS package to have the database automatically decide which columns need histograms. This is done by specifying SIZE AUTO.

Example Using a Histogram to Improve an Execution Plan

UPDATE so_lines l
SET open_flag=null, s6=10, s6_date=sysdate,
WHERE l.line_type_code in ('REGULAR','DETAIL','RETURN')
AND
l.open_flag = 'Y' AND NVL(l.shipped_quantity, 0)=0
OR
NVL(l.shipped_quantity, 0) != 0
AND
l.shipped_quantity +NVL(l.cancelled_quantity, 0)= l.ordered_quantity))
AND
l.s6=18;

This query shows the skewed distribution of data values for s6. In this case, there are two distinct non-null values: 10 and 18. The majority of the rows consists of s6 = 10 (1,589,464), while a small number of rows consist of s6 = 18 (13,091).

S6: COUNT(*)
======================
10 1,589,464
18 13,091
NULL 21,889

The selectivity of column s6, where s6 = 18:

S = 13,091 / (13,091 + 1,589,464) = 0.008

If No Histogram is Used: The selectivity of column s6 is assumed to be 50%, uniformly distributed across 10 and 18. This is not selective; therefore, s6 is not an ideal choice for use as an index.

If a Histogram is Used: The data distribution information is stored in the dictionary. This allows the optimizer to use this information and compute the correct selectivity based on the data distribution. In this example, the selectivity, based on the histogram data, is 0.008. This is a relatively high, or good, selectivity, which leads the optimizer to use an index on column s6 in the execution plan.


Tuning by Statistics

As you know, is not good to resolve tuning issues using the old method of "ratios" only. So we will start understanding several views from Oracle that will help you to tune your database based on its use.

Parsing and Read Consistency
The view v$sysstat is the accumulated statistics for the complete system since the instance has started except for what is active now and shown in v$sesstat. The numbers in this view are held in "tens of milliseconds," (sounds like hundredths of seconds to me). So to convert to seconds, we divide the value by 100. Note that TIMED_STATISTICS must be set to true in the init.ora for these views to have timing numbers. So to see overall parsing and read consistency waits, we can use the following:

column parsing heading 'Parsing|(seconds)'
column total_cpu heading 'Total CPU|(seconds)'
column waiting heading 'Read Consistency|Wait (seconds)'
column pct_parsing heading 'Percent|Parsing'

select total_CPU, parse_CPU parsing,
parse_elapsed-parse_CPU waiting,
trunc(100*parse_elapsed/total_CPU,2) pct_parsing
from (select value/100 total_CPU
from v$sysstat where name = 'CPU used by this session'),
(select value/100 parse_CPU
from v$sysstat where name = 'parse time cpu'),
(select value/100 parse_elapsed
from v$sysstat where name = 'parse time elapsed');



Total CPU Parsing Read Consistency Percent
(seconds) (seconds) Wait (seconds) Parsing
---------- ---------- ---------------- ----------
1236263327 21072.43 4277.93 0
We can see from this output that this instance has consumed a lot of CPU seconds, but a relatively small amount of that was used for parsing. Of the parsing, about 20 percent of the time was waiting for read consistency, but since the parsing percent is so small, this is not impacting our system performance.

Finding Free Buffers
If we need to scan buffers to find a free one, then we are wasting CPU resource. This can be checked with the script below:

select inspected,requested,(inspected/requested) ratio
from (select value inspected from v$sysstat where name = 'free buffer inspected') a,
(select value requested from v$sysstat where name = 'free buffer requested') b;


INSPECTED REQUESTED RATIO
---------- ---------- ----------
5100 12288112 .000415035

We see that, for the vast majority of the time, the first buffer we look at is available (low ratio). So there is no problem with getting free buffers as we need them. Let's look further.

Active Waiting and Latch Contention
Active waiting can be compared to waiting at the betting window with your two dollars in your hand. You are not just passively standing there, but are focused on all the windows to see when a resource becomes available.
We can see these active waits in v$system_event with the following:

select event,time_waited/100 seconds
from v$system_event
where event not in
('rdbms ipc message'
,'SQL*Net more data to client'
,'SQL*Net message from client'
,'SQL*Net message to client'
,'pmon timer'
,'smon timer'
,'wakeup time manager')
order by seconds;


EVENT SECONDS
---------------------------------------- ----------
buffer deadlock 0
. . .
latch free 3422.98
. . .
control file parallel write 80356.43
log buffer space 106812.85

The previous code shows me that we have waited the most time for log buffer space. I will check that I have the log buffer sized correctly. This is still a small fraction of the CPU seconds used by the instance as we saw above; it was over one billion, so we are way under one percent of the time spent waiting. This also shows us that we have been waiting for latches. So let's see which latch we are not getting when needed with the following code:

Column name format a35
SELECT name, gets, misses, immediate_gets, immediate_misses
FROM v$latch l
WHERE (misses > .1*gets)
or (immediate_misses > .1*(immediate_misses+immediate_gets));



NAME GETS MISSES IMMEDIATE_GETS IMMEDIATE_MISSES
--------------------------------- ------ -------- -------------- ----------------
channel operations parent latch 86845 6 43300 78881

This is a latch that we can't tune and we also see that the number of misses is not increasing as we rerun this query, so we will leave it alone.
Given this, we have to conclude that the instance is not using too much CPU resource for its own maintenance — so let's check the users.


User Code Execution

In our instance, we have checked that the environment for the instance is not forcing us to consume more CPU resources. So we will check the code that is being executed. Most often, the CPU consumption problem is caused by what a user is doing.

Highest Consuming Sessions
We can see which user is consuming the most CPU at this time by
creating a table with the cumulative values so we can pull out the current usage:

drop table t_CPU;
create table t_CPU storage
(initial 64k next 64k pctincrease 0) pctfree 0
as select a.sid,a.value
from v$sesstat a,v$statname b
where b.name = 'CPU used by this session'
and a.statistic# = b.statistic#;

column name format a25
column username format a20
column CPU heading 'CPU|Seconds'

select a.sid, c.username,b.name,(a.value-t.value)/100 CPU
from v$session c,v$sesstat a,v$statname b,t_CPU t
where c.sid = a.sid
and c.sid=t.sid
and b.name = 'CPU used by this session'
and a.statistic# = b.statistic#
and a.value > 0
and a.value - t.value > 0
order by a.value-t.value;

drop table t_CPU;

CPU
SID USERNAME NAME Seconds
------ -------------------- ------------------------- ----------
52 CPU used by this session .01
. . .
23 SYSTEM CPU used by this session 35.82
177 ORADBA CPU used by this session 38.26
167 ORADBA CPU used by this session 52.07

You can see that SID 167 is the current largest consumer. What we are looking for, then, is code that is in a tight loop in memory, just processing over and over. If this CPU is being consumed by normal usage, then no changes are needed. So let's see what SID 167 is running:

select c.executions,trunc(a.last_call_et/60) min,b.sql_text
from v$session a, v$sqltext b, v$sqlarea c
where a.sid=167
and a.sql_address = b.address
and a.sql_address = c.address
order by b.piece;


EXECUTIONS MIN SQL_TEXT
---------- ------ ----------------------------------------------------------------
94784 0 select daily_notes_id, center_code, TO_CHAR(day,'MM/DD/YYYY')
day, notes from daily_notes where center_code = :1
and day = TO_DATE(:2,'MM/DD/YYYY') order by 1

There is nothing wrong with this code and after watching for a while we see that this user is not running any code that jumps out as bad. We conclude that this session is just doing the most work.


Highest Consuming Statements
We can also check for the statements with the most buffer gets. These are probably working the most in memory:

set linesize 140
column executions format 999,999,999 heading "Executions"
column disk_reads format 999,999,999 heading "Block Reads"
column buffer_gets format 999,999,999 heading "Buffer Reads"
column sql_text format a90 heading "Statement"
break on executions on buffer_gets skip 1

--we want the 2 largest consumers

variable max_buffer_read number
begin
select max(buffer_gets) into :max_buffer_read
from v$sqlarea
where parsing_user_id >0;

select max(buffer_gets) into :max_buffer_read
from v$sqlarea
where parsing_user_id >0
and buffer_gets < :max_buffer_read;
end;
/

select a.executions,a.buffer_gets,b.sql_text
from v$sqlarea a,v$sqltext b
where a.buffer_gets >= :max_buffer_read
and a.address = b.address
and a.parsing_user_id > 0
order by a.buffer_gets desc, piece asc;


Executions Buffer Reads Statement
---------- ------------ -------------------------------------------------------------
6,300,634 51,221,671 SELECT DISTINCT B.FIRST_NAME || ' ' || B.LAST_NAME
FROM BLOCK_PHYSICIAN A,PHYSICIAN B
WHERE A.PHYSICIAN_FID=B.HYSICIAN_ID AND A.BLOCK_TIME_FID=:b1

75,265 27,227,798 select client_id,greeting_lu,first_name,last_name
,home_phone,work_phone,current_status_lu,center_code
,start_time
,TO_CHAR(start_time, 'MM/DD/YYYY HH:MI AM') start_datetime
,appointment_type_lu, appointment_id , number_attending
,offsite, offsite_location,treatment_info, bgcolor
,(TO_NUMBER(TO_CHAR(start_time,'HH24'))*60)
+TO_NUMBER(TO_CHAR(start_time,'MI')) time,physician
,TO_CHAR(start_time, 'HH:MI AM') start_display_time,note
from appointment_view_no_phys
where center_code = :1
and start_time between TO_DATE(:2,'MM,DD,YYYY')
and TO_DATE(:3,'MM,DD,YYYY')+1
and appointment_type_lu in
(:4,:5,:6,:7,:8,:9,:10,:11,:12,:13,:14)
and (current_status_lu not in
('Cancelled','Rescheduled','Attended','NoShowed'))
order by start_time,last_name,first_name

Remember that these numbers are those that are emerging since the instance has come up.
We see that the first statement is doing less than 10 buffer reads per execution on an average. Because I know that these tables are in the keep buffers, this makes sense.
The second statement is doing about 361 buffer gets per execution. This will be one to investigate and see how this view is joining. Maybe we can make this more efficient.


RAID Configuration

RAID Configuration

There is significant confusion in many organizations using ORACLE and RAID technologies. I will attempt to make this information very clear and understandable.

What Is RAID?
RAID is the technology for expanding the capacity of the I-O system and providing the capability for data redundancy. It stands for Redundant Array of Inexpensive Disks or Redundant Array of Independent Disks.
Conceptually, RAID is the use of 2 or more physical disks, to create 1 logical disk, where the physical disks operate in tandem to provide greater size and more bandwidth. RAID has become an indispensable part of the I-O fabric of any system today and is the foundation for storage technologies, supported by many mass storage vendors. The use of RAID technology has re-defined the design methods used for building storage systems that support Oracle databases.


The 3 Main Concepts In RAID
When you talk about RAID, there are 3 that terms are important and relevant.
• Striping
• Mirroring
• Parity.

1- What Is Striping?
Striping is the process of breaking down data into pieces and distributing it across multiple disks that support a logical volume – “Divide, Conquer & Rule”. This often results in a “logical volume” that is larger and has greater I-O bandwidth than a single disk. It is purely based on the linear power of incrementally adding disks to a volume to increase the size and I-O bandwidth of the logical volume. The increase in bandwidth is a result of how read/write operations are done on a striped volume.
Imagine that you are in a grocery store. With you are about two hundred of your closest friends and neighbors all shopping for the week’s groceries. Now consider what it’s like when you get to the checkout area and find that only one checkout line is open. That poor clerk can only deal with a limited number of customers per hour. The line starts to grow progressively. The same is true of your I-O sub-system. A given disk can process a specific number of I-O operations per second. Anything more than that and the requests start to queue up. Now stop and think about how great it feels when you get to the front of the store and find that all 20 lines are open. You find your way to the shortest line and your headed out the door in no time.
Striping has a similar effect to your I-O system. By creating a single volume from pieces of data on several disks, we can increase the capacity to handle I-O requests in a linear fashion, by combining each disk’s I-O bandwidth. Now, when multiple I-O requests for a file on a striped volume is processed, they can be serviced by multiple drives in the volume, as the requests are sub-divided across several disks. This way all drives in the striped volume can engage and service multiple I/O requests in a more efficient manner. This “cohesive and independent” functioning of all the drives in a logical volume is relevant for both read and writes operations. It must be noted that striping by itself, does not reduce “response time” for servicing I-O requests. However, it does provide predictable response times and facilitates the notion of better performance, by balancing I-O requests across multiple drives in the striped volume.
Figure 1 depicts a 4-way striped volume (v1) with 4 disks (1-4). A given stripe of data (Data1) in a file on v1 will be split/striped across the 4 disks, into 4 pieces (Data11-Data14).

Disk1 Disk2 Disk3 Disk4
Data11 Data12 Data13 Data14
Data21 Data22 Data23 Data2

Figure 1

2-What Is Mirroring?
Mirroring is the process of writing the same data, to another “member” of the same volume simultaneously. Mirroring provides protection for data by writing exactly the same information to every member in the volume. Additionally, mirroring can provide enhanced read operations because the read requests can be serviced from either “member” of the volume. If you have ever made a photocopy of a document before mailing the original then you have mirrored data. One of the common myths with mirroring, is that it takes “twice as long” to write. But in many performance measurements and benchmarks, the overhead has been observed to be around 15-20%.
Figure 2 illustrates a 4-way striped mirrored volume (v1) with 8 Disks (1-8). A given stripe of data (Data1) in a file on v1 will be split/striped across the Disks (1-4) and then mirrored across Disks (5-8). Disks (1-4) and (5-8) are called “Mirror Members” of the volume v1.

Disk1 Disk2 Disk3 Disk4 Disk5 Disk6 Disk7 Disk8
Data11 Data12 Data13 Data14 Data11 Data12 Data13 Data14
Data21 Data22 Data23 Data24 Data21 Data22 Data23 Data24

Figure 2


3- What Is Parity?
Parity is the term for error checking. Some levels of RAID, perform calculations when reading and writing data. The calculations are primarily done on write operations. However, if one or more disks in a volume are unavailable, then depending on the level of RAID, even read operations would require parity operations to rebuild the pieces on the failed disks. Parity is used to determine the write location and validity of each stripe that is written in a striped volume. Parity is implemented on those levels of RAID that do not support mirroring.
Parity algorithms contain Error Correction Code (ECC) capabilities, which calculates parity for a given ‘stripe or chunk’ of data within a RAID volume. The size of a chunk is operating system (O-S) and hardware specific. The codes generated by the parity algorithm are used to recreate data in the event of disk failure(s). Because the algorithm can reverse this parity calculation, it can rebuild data, lost as a result of disk failures. It’s just like solving a math problem what you know the answer (checksum) and one part of the question e.g. 2+X =5, what is X? Of course, X=3.
Figure 3 depicts a 4-way striped RAID 3 volume with parity – v1 with 5 Disks (1-5). A given stripe of data (Data1) in a file on v1 will be split/striped across the Disks (1-4) and the parity for Data1 will be stored on Disk 5. There are other levels of RAID that store parity differently and those will be covered in the following sections.

Disk1 Disk2 Disk3 Disk4 Disk5
Data11 Data12 Data13 Data14 Parity1
Data21 Data22 Data23 Data24 Parity2

Figure 3

Putting It All Together
Striping yields better I-O performance, mirroring provides protection and parity (when applicable) is a way to check the work. With these 3 aspects of RAID, we can achieve scalable, protected, highly available I-O performance.



Types of RAID available
Vendors typically offer the following choices for RAID configurations (Nice example HERE)

RAID0
Disk stripping. RAID0 requires at least two physical disks. Data is read and written across multiple drives, so disk I/O is relatively evenly spread. Writes can occur in a block or streaming manner (similar to non-RAIDed disks) as requested by the operating system. Disk failure results in lost data. Compared to a single disk drive, RAID0 has the following attributes:
- Better read performance
- Better write performance
- Inexpensive in cost
- Not fault-tolerant
- Storage equivalent to sum of physical drive storage in the array
- Readily available from most vendors

RAID1 Shadowing/Mirroring
Disk mirroring. RAID1 requires two physical disks. Logical writes are done by physically writing the data to both disks simultaneously, and can typically be done in a block manner or streaming manner, as requested by the operating system. Reads can be done using either disk. In the event of a disk failure, data can still be retrieved and written to the surviving disk. Compared to a single disk drive, RAID1 has the following attributes:
- Better read performance
- Similar write performance
- Expensive
- Fault-tolerant
- Storage equivalent to 1/2 the sum of the physical drive storage in the mirrored set.
- Readily available from most vendors

RAID5 Striping with Rotating Parity
Disk stripping with parity. RAID5 requires at least three physical disks. On a logical write, a block of data is physically written to disk, parity information is calculated using the block just written plus blocks already existing on disk, then the parity information is written to disk. In RAID5, the parity information is rotated among the physical disks to prevent bottlenecks caused by a dedicated parity disk. Note that writes occur in a block manner regardless of whether the O/S is sending a stream of data to be written or requests to write whole blocks. On a logical read, data is read from multiple disks in a manner very similar to RAID0. In the event of a disk failure, data can be reconstructed on the fly using the parity information. Compared to a single disk drive, RAID5 has the following attributes:
- Data is stripped across multiple physical disks and parity data is stripped across storage equivalent to one disk.
- Better read performance
- Poorer write performance
- Inexpensive
- Fault-tolerant
- Storage equivalent to N - 1 times the number of physical drives in the array.
- Readily available from most vendors

RAID10 (or RAID0+1)
Mirrored stripe sets. RAID10 requires at least 4 physical drives, and combines the performance gains of RAID0 with the fault-tolerance and expense of RAID1. Data is written simultaneously to two mirrored sets of striped disks in blocks or streams. Reads can be performed against either striped set. In the event of a failure of a disk drive in one striped set, data can be written to and read from the surviving striped set. Compared to a single disk drive, RAID10 has the following attributes:
- Better read performance
- Better write performance
- Expensive
- Fault-tolerant
- Storage is 1/2 of the sum of the physical drives' storage
- Currently available from only a few vendors (at the time of this writing)

Possible configurations using 4 physical disks:

Configuration

Number of disks

Available space

Max Reads/Sec

Max Writes/Sec

Single disk

1

4 GB

60

60

RAID0

4

16 GB

240

240

RAID1

4

8 GB

240 (2 arrays)

120 (2 arrays)

RAID5

4

12 GB

180

60

Possible configurations using 6 physical disks:

Configuration

Number of disks

Available space

Max Reads/Sec

Max Writes/Sec

Single disk

1

4 GB

60

60

RAID0

6

24 GB

360

360

RAID1

6

12 GB

360 (3 arrays)

180 (3 arrays)

RAID5

6

20 GB

300

90


As can be seen from the charts, RAID0 offers good read and write performance, but no fault tolerance.
RAID1 offers good read performance, and half as much write performance, but provides fault-tolerance.
RAID5 reclaims most of the space lost to RAID1, provides fault-tolerance, offers reasonably good read performance, but poor write performance. (In fact, RAID5 requires 4 disks to regain the same write performance as a single disk). Also, note that streaming logical writes, as well as block-level logical writes, to RAID5 arrays are handled as block-level physical writes. Finally, read or write workload capacity can be increased in any RAID configuration by adding physical disks.

The RAID "hierarchy" begins with RAID 0 (striping) and RAID 1 (mirroring). Combining RAID 0 and RAID 1 is called RAID-0+1 or RAID-1+0, depending on how you combine them. (RAID 0+1 is also called RAID-01, and RAID-1+0 is also called RAID-10.) The performance of RAID-10 and RAID-01 are identical, but they have different levels of data integrity.

RAID-01 (or RAID 0+1) is a mirrored pair (RAID-1) made from two stripe sets (RAID-0); hence the name RAID 0+1, because it is created by first creating two RAID-0 sets and adding RAID-1. If you lose a drive on one side of a RAID-01 array, then lose another drive on the other side of that array before the first side is recovered, you will suffer complete data loss. It is also important to note that all drives in the surviving mirror are involved in rebuilding the entire damaged stripe set, even if only a single drive was damaged. Performance during recovery is severely degraded during recovery unless the RAID subsystem allows adjusting the priority of recovery. However, shifting the priority toward production will lengthen recovery time and increase the risk of the kind of the catastrophic data loss mentioned earlier.

RAID-10 (or RAID 1+0) is a stripe set made up from N mirrored pairs. Only the loss of both drives in the same mirrored pair can result in any data loss and the loss of that particular drive is 1/Nth as likely as the loss of some drive on the opposite mirror in RAID-01. Recovery only involves the replacement drive and its mirror so the rest of the array performs at 100% capacity during recovery. Also since only the single drive needs recovery bandwidth requirements during recovery are lower and recovery takes far less time reducing the risk of catastrophic data loss.

The most appropriate RAID configuration for a specific filesystem or database table space must be determined based on data access patterns and cost versus performance tradeoffs. RAID-0 offers no increased reliability. It can, however, supply performance acceleration at no increased storage cost. RAID-1 provides the highest performance for redundant storage, because it does not require read-modify-write cycles to update data, and because multiple copies of data may be used to accelerate read-intensive applications. Unfortunately, RAID-1 requires at least double the disk capacity of RAID-0. Also, since more than two copies of the data exist, RAID-1 arrays may be constructed to endure loss of multiple disks without interruption. Parity RAID allows redundancy with less total storage cost. The read-modify-write it requires, however, will reduce total throughput in any small write operations (read-only or extremely read-intensive applications are fine). The loss of a single disk will cause read performance to be degraded while the system reads all other disks in the array and recomputes the missing data. Additionally, it does not support losing multiple disks, and cannot be made redundant.

ORACLE database files on RAID
Given the information regarding the advantages and disadvantages of various RAID configurations, how does this information apply to an ORACLE instance? The discussion below will provide information about how database files are used by an ORACLE instance under OLTP and DSS classifications of workload.
Note that the perspectives presented below are very sensitive to the number of users: if your organization has a 10-20 user OLTP system (and thus, a low throughput requirement), then you may get very acceptable performance with all database files stored on RAID5 arrays. On the other hand, if your organization has a 100 user OLTP system (resulting in a higher throughput requirement), then a different RAID configuration may be absolutely necessary. An initial configuration can be outlined by estimating the number of transactions (based on the number of users), performing adjustments to encompass additional activity (such as hot backups, nightly batch jobs, etc.), then performing the necessary mathematical calculations.
You definitely want to keep rollback segments, temp tablespaces and redo logs off from RAID5 since that is too slow for these write-intensive Oracle files. They are sequentially accessed. Redo logs should have their *own* dedicated drives.

OLTP (On-line transaction processing) workloads
Characterized by multi-user concurrent INSERTS, UPDATES, and DELETES during normal working hours, plus possibly some mixture of batch jobs nightly. Large SELECTS may generate reports, but the reports will typically be "pre-defined" reports rather than ad-hoc queries. The focus, though, is on enabling update activity that completes within an acceptable response time. Ideally, each type of database file would be spread out over it's own private disk subsystem, although grouping certain types of files together (when the number of disks, arrays, and controllers is less than ideal) may yield adequate performance. (Please see the article on Instance tuning for information regarding groupings of database files in an OLTP system.)

Redo logs.
During update activity, redo logs are written to in a continuous and sequential manner, and are not read under normal circumstances. RAID5 would be the worst choice for performance. Oracle Corporation recommends placing redo logs on single non-RAIDed disk drives, under the assumption that this configuration provides the best overall performance for simple sequential writes. Redo logs should always be multiplexed at the ORACLE software level, so RAID1 provides few additional benefits. Since non-RAID and RAID0 configurations can vary with hardware from different vendors, the organization should contact their hardware vendor to determine whether non-RAIDed disks or RAID0 arrays will yield the best performance for continuous sequential writes. Note that even if redo logs are placed on RAID1 arrays that the redo logs should still be mirrored at the ORACLE level.

Archive logs
As redo logs are filled, archive logs are written to disk one whole file at a time (assuming, of course, that the database is running in archivelog mode), and are not read under normal circumstances. Any RAID or non-RAID configuration could be used, depending upon the performance requirements and size of the redo logs. For instance, if the redo logs are large, then they will become full and be archived less often. If an archive log is likely to be written no more than once per minute, then RAID5 may provide acceptable performance. If RAID5 proves too slow, then a different RAID configuration can be chosen, or the redo logs can simply be made larger. Note that a fault-tolerant configuration is advisable: if the archive log destination becomes unavailable, the database will halt.

Rollback Segments
As modifications are made to the database tables, undo information is written to the buffer cache in memory. These rollback segments are used to to maintain commitment control and read consistency. Rollback segment data is periodically flushed to disk by checkpoints. Consequently, the changes to the rollback segments are also recorded in the redo logs. However, a smaller amount of information is typically written to the rollback segments than to the redo logs, so the write rate is less stringent. A fault-tolerant configuration is advisable, since the database cannot operate without rollback segments, and recovery of common rollback segments will typically require an instance shutdown. If the transaction rate is reasonably small, RAID5 may provide adequate performance. If it does not, then RAID1 (or RAID10) should be considered.

User tables and indexes
As updates are performed, these changes are stored in memory. Periodically, a checkpoint will flush the changes to disk. Checkpoints occur under two normal circumstances: a redo log switch occurred, or the time interval for a checkpoint expired. (There are a variety of other situations that trigger a checkpoint. Please check the ORACLE documentation for more detail.) Like redo log switches and generation of archive logs, checkpoints can normally be configured so that they occur approximately once per minute. Recovery can be performed up to the most recent checkpoint, so the interval should not be too large for an OLTP system. If the volume of updated data written to disk at each checkpoint is reasonably small (ie. the transaction rate is not extremely large), then RAID5 may provide acceptable performance. Additionally, analysis should be performed to determine the ratio of reads to writes. Recalling that RAID5 offers reasonably good read performance, if the percentage of reads is much larger than the percentage of writes (for instance, 80% to 20%), then RAID5 may offer acceptable performance for small, medium, and even some large installations. A fault-tolerant configuration is preferable to maximize availability (assuming availability is an objective of the organization), although only failures damaging datafiles for the SYSTEM tablespace (and active rollback segments) require the instance to be shutdown. Disk failures damaging datafiles for non-SYSTEM tablespaces can be recovered with the instance on-line, meaning that only the applications using data in tablespaces impacted by the failure will be unavailable. With this in mind, RAID0 could be considered if RAID5 does not provide the necessary performance. If high availability and high performance on a medium to large system are explicit requirements, then RAID1 or RAID10 should be considered.

Temp segments
Sorts too large to be performed in memory are performed on disk. Sort data is written to disk in a block-oriented. Sorts do not normally occur with INSERT/UPDATE/DELETE activity. Rather, SELECTS with ORDER BY or GROUP BY clauses and aggregate functions (ie. operational reports) , index rebuilds, etc., will use TEMP segments only if the sort is too large to perform in memory. Temp segments are good candidates for non-RAIDed drives or RAID0 arrays. Fault-tolerance is not critical: if a drive failure occurs and datafiles for a temp segment are lost, then the temp segment can either be recovered in the normal means (restore from tape and perform a tablespace recovery), or the temp segment can simply be dropped and re-created since there is no permanent data stored in the temp segment. Note that while a temp segment is unavailable, certain reports or index creations may not execute without errors, but update activity will typically not be impacted. With this in mind, RAID1 arrays are a bit unnecessary for temp segments, and should be used for more critical database files. RAID5 will provide adequate performance if the sort area hit ratios are such that very few sorts are performed on disk rather than in memory.

Control files
Control files are critical to the instance operation, as they contain the structural information for the database. Control files are updated periodically (at a checkpoint and at structural changes), but the data written to the control files is a very small quantity compared to other database files. Control files, like redo logs, should be multiplexed at the ORACLE software level onto different drives or arrays. Non-RAIDed drives or or any RAID configuration would be acceptable for control files, although most organizations will typically distribute the multiple copies of the control files with the other database files, given that the read and write requirements are so minimal. For control files, maintaining multiple copies in different locations should be favored over any other concern.

Software and static files
The ORACLE software, configuration files, etc. are very good candidates for RAID5 arrays. This information is not constantly updated, so the RAID5 write penalty is of little concern. Fault-tolerance is advisable: if the database software (or O/S software) becomes unavailable due to a disk failure, then the database instance will abort. Also, recovery will include restore or re-installation of ORACLE software (and possible operating system software) as well as restore and recovery of the database files. RAID5 provides the necessary fault-tolerance to prevent this all-inclusive recovery, and good read performance for dynamic loading and unloading of executable components at the operating system level.

DSS (Decision Support System) workloads
In comparison to OLTP systems, DSS or data warehousing systems are characterized by primarily SELECT activity during normal working hours, and batch INSERT, UPDATE, and DELETE activity run on a periodic basis (nightly, weekly, or monthly). There will typically be a large amount of variability in the number of rows accessed by any particular SELECT, and the queries will tend to be of a more ad-hock nature. The number of users will typically be smaller than their ajoining OLTP systems (where the data originates). The focus is on enabling SELECT activity that completes within an acceptable response time, while insuring that the batch update activity still has capacity to complete in it's allowable time window. Note now that there are two areas of performance over which to be concerned: periodic refreshes and ad-hock read activity. The general level directive in this case should be to configure the database such that read-only performed by end users is as good as it can get without rendering refreshes incapable of completion. As with OLTP systems, each type of database file would ideally have it's own private disk subsystem (disks, arrays, and controller channel), but with less than ideal resources certain grouping tend to work well for DSS systems. (Please see the article on Instance tuning for information on these groupings.)

Redo logs
Redo logs are only written to while update activity is occurring. In a DSS-oriented system, a significant portion of data entered interactively during the day may loaded into the DSS database during only a few hours. Given this characteristic, redo logging may tend to be more of a bottleneck on periodic refresh processes of a DSS database than on it's ajoining OLTP systems. If nightly loads are taking longer than their allowance, then redo logging should be the first place to look. The same RAID/non-RAID suggestions that apply to redo logging in OLTP also apply with DSS systems. As with OLTP systems, redo logs should always be mirrored at the ORACLE software level, even if they are stored on fault-tolerant disk arrays.

Archive logs
Like redo logging, archive logs are only written out during update activity. If the archive log destination appears to be over-loaded with I/O requests, then consider changing the RAID configuration, or simply increase the size of the redo logs. Since there is a large volume of data being entered in a short period of time, it may be very reasonable to make the redo logs for the DSS or data warehouse much larger (10 or more times) than the redo logs used by the OLTP system. A reasonable rule of thumb is to target about one log switch per hour. With this objective met, then the disk configuration and fault-tolerance can be chosen based on the same rules used for OLTP systems.

Rollback Segments
Again like redo logging, rollback segments will be highly utilized during the periodic refreshes, and virtually unused during the normal work hours. Use the same logic for determining RAID or non-RAID configurations on the DSS database that would be used for the OLTP systems.

User tables and indexes
Writes are done to tablespaces containing data and indexes during periodic refreshes, but during the normal work hours read activity on the table and indexes will typically far exceed the update work performed on a refresh. A fault-tolerant RAID configuration is suggested to sustain availability. However, in most cases the business can still operate if the DSS system is unavailable for several hours due to a disk failure. Information for strategic decisions may not be available, but orders can still be entered. If the DSS has high availability requirements, select a fault-tolerant disk configuration. If RAID5 arrays can sustain the periodic refresh updates, then it is typically a reasonably good choice due to it's good read performance. As seen above, the read and write workload capacities can be adjusted by adding physical drives to the array.

Temp segments
In a decision support system or data warehouse, expect temp segment usage to be much greater than what would be found in a transaction system. Recalling that temp segments do not store any permanent data and are not absolutely necessary for recovery, RAID0 may be a good choice. Keep in mind, though, that the loss of a large temp segment due to drive failure may render the DSS unusable (unable to perform sorts to answer large queries) until the failed drives are replaced. If availability requirements are high, then a fault-tolerant solution should be selected, or at least considered. If the percentage of sorts on disk is low, then RAID5 may offer acceptable performance; if this percentage is high, RAID1 or RAID10 may be required.

Control files
As with OLTP systems, control files should always be mirrored at the ORACLE software level regardless of any fault-tolerant disk configurations. Since reads and writes to these files are minimal, any disk configuration should be acceptable. Most organizations will typically disperse control files onto different disk arrays and controller cards, along with other database files.

Software and static files
Like OLTP systems, these files should be placed on fault-tolerant disk configurations. Since very little write activity is present, these are again good candidates for RAID5.

Taking the above information into consideration, can an organization run an entire ORACLE database instance on a single RAID5 array? The answer is "yes". Will the organization get a good level of fault-tolerance? Again, the answer is "yes". Will the organization get acceptable performance? The answer is "it depends". This dependency includes the type of workload, the number of users, the throughput requirements, and a whole host of other variables. If the organization has an extremely limited budget, then it can always start with a single RAID5 array, perform the necessary analysis to see where improvement is needed, and proceed to correct the deficiencies.

Summary

RAID 1 is mirroring - blocks are written simultaneously to both members of the raid set so if one member fails, the other still has a full set of data.

RAID 5 is striping with distributed parity - data blocks are written in a stripe across all except one of the disks and a parity block is written to the last disk. Each data write is written to a different set of n-1 disks so that the data and parity are scattered equally amongst the drives.

RAID 1 gives marginal read speed increase with no real write overhead RAID 5 gives quite a high speed increase for reads but invokes a high overhead for write operations (normally the RAID5 controller will have a write back cache attached to alleviate this)

For high write volumes R5 is far from ideal - much better for low writes but high reads. Don't put actively written files on an R5 set - especially things like redo logs !

To really tune the I/Os you need to know the chunksize of your RAID controller and tweak db_file_multiblock_read_count appropriately


Distribution of Oracle Files
The following table shows what Oracle suggests for RAID usage:

RAID
Type of Raid
Control File
Database File
Redo Log File
Archive Log File
0
Striping
Avoid
OK
Avoid
Avoid
1
Shadowing
Best
OK
Best
Best
0 + 1
Striping and Shadowing
OK
Best
Avoid
Avoid
3
Striping with static parity
OK
OK
Avoid
Avoid
5
Striping with rotating parity
OK
Best if RAID0-1 not available
Avoid
Avoid