实施反馈系统有20分钟不可用,然后又自动恢复了。先查看alert日志,看到打开文件数不够,系统已经运行几年了,怎么可能呢。
Non critical error ORA-48180 caught while writing to trace file "/u01/app/ora/diag/rdbms/nwzcdb/nwzcdb2/trace/nwzcdb2_ora_195339.trc"
Error message: Linux-x86_64 Error: 23: Too many open files in system
检查数据库服务器的配置,ulimit -a ,发现oracle hard nofile 65536,应该是足够大的。
|
Snap Id | Snap Time | Sessions | Cursors/Session | Instances |
---|---|---|---|---|---|
Begin Snap: | 13158 | 24-3月 -17 09:00:27 | 1531 | 7.5 | 2 |
End Snap: | 13159 | 24-3月 -17 10:00:38 | 1810 | 10.0 | 2 |
Elapsed: | 60.18 (mins) | ||||
DB Time: | |
32,066.54 (mins) | |
|
gc buffer busy acquire是当session#1尝试请求访问远程实例(remote instance) buffer,但是在session#1之前已经有相同实例上另外一个session#2请求访问了相同的buffer,并且没有完成,那么session#1等待gc buffer busy acquire。
gc buffer busy release是在session#1之前已经有远程实例的session#2请求访问了相同的buffer,并且没有完成,那么session#1等待gc buffer busy release。
Event | Waits | Total Wait Time (sec) | Wait Avg(ms) | % DB time | Wait Class |
---|---|---|---|---|---|
gc buffer busy acquire | 288,206 | 568.5K | 1972 | 30.5 | Cluster |
enq: TX - index contention | 11,889 | 325K | 27335 | 17.5 | Concurrency |
buffer busy waits | 54,302 | 217.9K | 4012 | 11.7 | Concurrency |
gc buffer busy release | 228,772 | 200.8K | 878 | 10.8 | Cluster |
gc current grant busy | 251,301 | 73.6K | 293 | 4.0 | Cluster |
gc current block congested | 109,356 | 71.2K | 651 | 3.8 | Cluster |
gc cr block congested | 25,922 | 69.2K | 2669 | 3.7 | Cluster |
gc cr grant congested | 30,967 | 42.1K | 1360 | 2.3 | Cluster |
我认为可能是两个原因造成的:
1. 低效sql,逻辑读过大,且访问频繁,造成争用严重。
2. 数据库IO资源紧张,导致一些频繁访问的sql语句响应慢,造成gc buffer busy acquire,gc buffer busy release等待事件。
定位是否是原因1的问题,就找Segments by Global Cache Buffer Busy。然后根据对象的名称去找对应的sql,然后查看sql的执行计划定位问题。
Segments by Global Cache Buffer Busy
- % of Capture shows % of GC Buffer Busy for each top segment compared
- with GC Buffer Busy for all segments captured by the Snapshot
Owner | Tablespace Name | Object Name | Subobject Name | Obj. Type | GC Buffer Busy | % of Capture |
---|---|---|---|---|---|---|
LCSC | LCSC_DATA | INDEX_LOG_UO_OPERATE_TIME | INDEX | 266,048 | 36.39 | |
LCSC | LCSC_DATA | SS_SECURITY_RESPONSIBILITY | TABLE | 112,425 | 15.38 | |
DIS_TRANSFER | P_0501 | TABLE PARTITION | 62,460 | 8.54 | ||
LCSC_DATA | INDEX_LOG_FUN_OPER_DATE | INDEX | 27,816 | 3.80 | ||
P_0507 | 27,775 | 3.80 |
定位是否是问题2造成,先查看数据库IO的整体情况,如果是RAC,多个节点都要看,因为RAC是共享存储,消耗IO总量是多个节点之和。如果如下图所示,相比数据库正常的时刻是非常大的。
如何判断是否是问题2影响了问题1,就查看问题1找到的sql是否有消耗IO,如果有,则有影响。
IOStat by Function summary
- 'Data' columns suffixed with M,G,T,P are in multiples of 1024 other columns suffixed with K,M,P are in multiples of 1000
- ordered by (Data Read + Write) desc
Function Name | Reads: Data | Reqs per sec | Data per sec | Writes: Data | Waits: Count | Avg Tm(ms) | ||
---|---|---|---|---|---|---|---|---|
Direct Reads | 475.2G | 209.64 | 134.921 | 1.8G | 3.97 | .503M | 0 | |
Buffer Cache Reads | 34G | 327.39 | 9.649M | 0M | 0.00 | 0M | 616.6K | 9.91 |
Direct Writes | 5.8G | 1.72 | 1.646M | 22.6G | 33.85 | 6.418M | |
|
Others | 5G | 7.07 | 1.407M | 5.1G | 11.22 | 1.441M | 26.1K | 2.24 |
DBWR | 0M | 0.01 | 0M | 8.6G | 217.96 | 2.44M | 20 | 34.20 |
LGWR | 153M | 2.73 | .042M | 4.5G | 214.76 | 1.292M | 679.9K | 1.03 |
TOTAL: | 520.1G | 548.55 | 147.665 | 42.6G | 481.76 | 12.094M | 1322.6K | 5.20 |
Segments by Physical Reads
- Total Physical Reads: 67,312,485
- Captured Segments account for 81.6% of Total
%Total | ||||||
---|---|---|---|---|---|---|
LCSC | LCSC_DATA | FUNCTION_LOCATION | P_DEFAULT_SUB_P_0502 | TABLE SUBPARTITION | 12,612,706 | 18.74 |
LCSC | LCSC_DATA | FUNCTION_LOCATION | P_DEFAULT_SUB_P_0501 | TABLE SUBPARTITION | 7,703,111 | 11.44 |
P_DEFAULT_SUB_P_0505 | 7,258,626 | 10.78 | ||||
LCSC | LCSC_DATA | FUNCTION_LOCATION | P_DEFAULT_SUB_P_0503 | TABLE SUBPARTITION | 6,005,657 | 8.92 |
P_DEFAULT_SUB_P_0507 | 4,008,989 | 5.96 |