Wednesday, May 4, 2011

Dataguard and dying archive processes

Some time ago you successfully configured a pair of instances with Dataguard and since then everything worked fine, until you opened in read-only mode your standby instance and then the primary's archive processes started dying and you started getting this messages in your primary's alert log:


Thu Mar 17 15:32:49 2011
******************************************************************
LGWR: Setting 'active' archival for destination LOG_ARCHIVE_DEST_2
******************************************************************
Errors in file /oracle_11g/product/diag/rdbms/mydb/mydb/trace/mydb_lns1_17449.trc (incident=356411):
ORA-00600: internal error code, arguments: [17113], [0x000000000], [], [], [], [], [], [], [], [], [], []
Thu Mar 17 15:32:53 2011
Sweep Incident[356411]: completed
Errors in file /oracle_11g/product/diag/rdbms/mydb/mydb/trace/mydb_lns1_17449.trc (incident=356412):
ORA-00600: internal error code, arguments: [], [], [], [], [], [], [], [], [], [], [], []
Errors in file /oracle_11g/product/diag/rdbms/mydb/mydb/trace/mydb_lns1_17449.trc:
ORA-00600: internal error code, arguments: [], [], [], [], [], [], [], [], [], [], [], []
Errors in file /oracle_11g/product/diag/rdbms/mydb/mydb/trace/mydb_lns1_17449.trc:
ORA-00600: internal error code, arguments: [], [], [], [], [], [], [], [], [], [], [], []
Errors in file /oracle_11g/product/diag/rdbms/mydb/mydb/trace/mydb_lns1_17449.trc:
ORA-00600: internal error code, arguments: [], [], [], [], [], [], [], [], [], [], [], []
Errors in file /oracle_11g/product/diag/rdbms/mydb/mydb/trace/mydb_lns1_17449.trc:
ORA-00600: internal error code, arguments: [], [], [], [], [], [], [], [], [], [], [], []
Thu Mar 17 15:32:57 2011
Errors in file /oracle_11g/product/diag/rdbms/mydb/mydb/trace/mydb_arc0_17142.trc (incident=361859):
ORA-00600: internal error code, arguments: [17113], [0x000000000], [], [], [], [], [], [], [], [], [], []
Errors in file /oracle_11g/product/diag/rdbms/mydb/mydb/trace/mydb_arc0_17142.trc (incident=361860):
ORA-00600: internal error code, arguments: [], [], [], [], [], [], [], [], [], [], [], []
Errors in file /oracle_11g/product/diag/rdbms/mydb/mydb/trace/mydb_arc0_17142.trc:
ORA-00600: internal error code, arguments: [], [], [], [], [], [], [], [], [], [], [], []
Errors in file /oracle_11g/product/diag/rdbms/mydb/mydb/trace/mydb_arc0_17142.trc:
ORA-00600: internal error code, arguments: [], [], [], [], [], [], [], [], [], [], [], []

As you might know, an ORA-00600 error means something like "I have no idea what happened so I'll throw an ORA-00600 error". This kind of situations are very hard to diagnose, but after some hard working (of my DBA coworker) we realized that it was a problem of lack of permissions in the /var/tmp directory, where Oracle puts sockets for the listener.

The directories and permissions of /var/tmp should be something like this:

myserver> ls -la /var/tmp
total 12
drwxrwxrwt 3 root root 4096 2011-03-18 10:10 .
drwxr-xr-x 16 root root 4096 2010-02-05 13:14 ..
drwxrwxrwt 2 root dba 4096 2011-03-18 08:13 .oracle
myserver> ls -la /var/tmp/.oracle/
total 8
drwxrwxrwt. 2 root dba 4096 2011-05-04 08:17 .
drwxrwxrwt. 3 root root 4096 2011-05-04 09:18 ..
srwxrwxrwx. 1 oracle dba 0 2010-12-17 11:47 s#6115.1
srwxrwxrwx. 1 oracle dba 0 2010-12-17 11:47 s#6115.2
srwxrwxrwx. 1 oracle dba 0 2011-01-12 16:48 s#7018.1
srwxrwxrwx. 1 oracle dba 0 2011-01-12 16:48 s#7018.2
srwxrwxrwx. 1 oracle dba 0 2010-05-12 12:31 s#7662.1
srwxrwxrwx. 1 oracle dba 0 2010-05-12 12:31 s#7662.2
srwxrwxrwx. 1 oracle dba 0 2010-05-11 15:58 sEXTPROC_FOR_XE
srwxrwxrwx 1 oracle dba 0 2011-05-04 08:17 smyserverDBG_CSSD
srwxrwxrwx 1 oracle dba 0 2011-05-04 08:17 sOCSSD_LL_myserver_localhost
srwxrwxrwx 1 oracle dba 0 2011-05-04 08:17 sOracle_CSS_LclLstnr_localhost_0

Therefore, after fixing permissions on /var/tmp and restarting the primary's instance listener and archive log sending, the problem disappeared.

No comments:

Post a Comment