当前位置：凯发k8官方网 > 编程语言 > c# >内容正文

c#

记一次 .net 某工控电池检测系统卡死分析 -凯发k8官方网

发布时间：2023/11/13 c# 42 coder

凯发k8官方网收集整理的这篇文章主要介绍了记一次 .net 某工控电池检测系统卡死分析小编觉得挺不错的,现在分享给大家,帮大家做个参考.

一：背景

1. 讲故事

前几天有位朋友找到我，说他的窗体程序有卡死现象，让我帮忙看下怎么回事，解决这种问题就需要在卡死的时候抓一个dump下来，拿到dump之后就可以分析了。

二：为什么会卡死

1. 观察主线程

窗体程序的卡死，需要观察主线程此时正在做什么，可以用 !clrstack 命令观察。


0:000:x86> !clrstack
os thread id: 0x4a08 (0)
child sp       ip call site
012fe784 0000002b [helpermethodframe_1obj: 012fe784] system.threading.waithandle.waitonenative(system.runtime.interopservices.safehandle, uint32, boolean, boolean)
012fe868 7115d952 system.threading.waithandle.internalwaitone(system.runtime.interopservices.safehandle, int64, boolean, boolean) [f:\dd\ndp\clr\src\bcl\system\threading\waithandle.cs @ 243]
012fe880 7115d919 system.threading.waithandle.waitone(int32, boolean) [f:\dd\ndp\clr\src\bcl\system\threading\waithandle.cs @ 194]
012fe894 711e89bf system.threading.waithandle.waitone(int32) [f:\dd\ndp\clr\src\bcl\system\threading\waithandle.cs @ 220]
012fe89c 6fb186b8 system.threading.readerwriterlockslim.waitonevent(system.threading.eventwaithandle, uint32 byref, timeouttracker, enterlocktype)
012fe8e0 6fb17892 system.threading.readerwriterlockslim.tryenterreadlockcore(timeouttracker)
012fe920 6fb17562 system.threading.readerwriterlockslim.tryenterreadlock(timeouttracker)
012fe94c 0325f49f xxx.quyitpjk0dxkr6iyqh(system.object)
012fe964 0325ee8a xxx.rwautolock..ctor(system.threading.readerwriterlockslim, boolean)
...

从卦中的线程栈数据来看，貌似是卡在一个读写锁tryenterreadlock 上，根据读写锁的规则，必然有人执行了一个 writelock 并且出不来，接下来就是寻找持有这个 lock 的线程。

2. 到底谁在持有

如果是 lock ，相信很多朋友都知道用 !syncblk 命令，那读写锁用什么命令呢？说实话我也搞不清楚,只能先挖挖 readerwriterlockslim 类本身，看看有没有什么新发现。


0:000:x86> !dumpobj /d 03526f38
name:        system.threading.readerwriterlockslim
methodtable: 6f947428
eeclass:     6f9a92dc
size:        72(0x48) bytes
file:        c:\windows\microsoft.net\assembly\gac_msil\system.core\v4.0_4.0.0.0__b77a5c561934e089\system.core.dll
fields:
      mt    field   offset                 type vt     attr    value name
70da878c  40004aa       38       system.boolean  1 instance        0 _fisreentrant
6f92fa28  40004ab       3c ...lockslim spinlock  1 instance 03526f74 _spinlock
70dfba4c  40004ac       1c        system.uint32  1 instance       20 _numwritewaiters
70dfba4c  40004ad       20        system.uint32  1 instance        1 _numreadwaiters
70dfba4c  40004ae       24        system.uint32  1 instance        0 _numwriteupgradewaiters
70dfba4c  40004af       28        system.uint32  1 instance        0 _numupgradewaiters
6f93d764  40004b0       39          system.byte  1 instance        0 _waiterstates
70da42a8  40004b1       2c         system.int32  1 instance       -1 _upgradelockownerid
70da42a8  40004b2       30         system.int32  1 instance       11 _writelockownerid
70da6924  40004b3        c ...g.eventwaithandle  0 instance 034844d0 _writeevent
70da6924  40004b4       10 ...g.eventwaithandle  0 instance 042a69c8 _readevent
70da6924  40004b5       14 ...g.eventwaithandle  0 instance 00000000 _upgradeevent
70da6924  40004b6       18 ...g.eventwaithandle  0 instance 00000000 _waitupgradeevent
70da150c  40004b8        4         system.int64  1 instance 367 _lockid
70da878c  40004ba       3a       system.boolean  1 instance        0 _fupgradethreadholdingread
70dfba4c  40004bc       34        system.uint32  1 instance 3221225472 _owners
70da878c  40004c2       3b       system.boolean  1 instance        0 _fdisposed
70da42a8  40004a9      4dc         system.int32  1   static        4 processorcount
70da150c  40004b7      4d4         system.int64  1   static 1882 s_nextlockid
6f942b7c  40004b9        0 ...readerwritercount  0 tlstatic  t_rwc

结合源码分析，发现上面的 _writelockownerid=11 就是持有锁的线程id，找到持有线程就好办了，把这个 managedid=11 转成 dbgid 再观察。


0:000:x86> !t
  13   11 47bc 0a0702c0   1029220 preemptive  00000000:00000000 01425ed0 0     mta (threadpool worker) 
0:013:x86> !clrstack
os thread id: 0x47bc (13)
child sp       ip call site
07e4f1ac 0000002b [inlinedcallframe: 07e4f1ac] 
07e4f1a4 09e38597 domainboundilstubclass.il_stub_pinvoke(intptr)
07e4f1ac 09e38334 [inlinedcallframe: 07e4f1ac] system.data.sqlite.unsafenativemethods.sqlite3_step(intptr)
07e4f1dc 09e38334 system.data.sqlite.sqlite3.step(system.data.sqlite.sqlitestatement)
07e4f228 09e36fe8 system.data.sqlite.sqlitedatareader.nextresult()
07e4f250 09e36ceb system.data.sqlite.sqlitedatareader..ctor(system.data.sqlite.sqlitecommand, system.data.commandbehavior)
07e4f270 09e367ce system.data.sqlite.sqlitecommand.executereader(system.data.commandbehavior)
07e4f284 09e36732 system.data.sqlite.sqlitecommand.executenonquery(system.data.commandbehavior)
07e4f2b0 09e366e6 system.data.sqlite.sqlitecommand.executenonquery()
07e4f2bc 09e350dc sqlsugar.adoprovider.executecommand(system.string, sqlsugar.sugarparameter[])
07e4f388 13189518 sqlsugar.insertableprovider`1[[system.__canon, mscorlib]].executecommand()
07e4f420 0181ac4a xxx.operatelog d__8.movenext()
...
0:013:x86> k
cvregtomachine(x86) conversion failure for 0x14f
x86machineinfo::setval: unknown register 0 requested
 # childebp retaddr      
00 07e4ede0 76c9ad10     ntdll_76ed0000!ntflushbuffersfile 0xc
01 07e4ede0 6b27af8c     kernelbase!flushfilebuffers 0x30
warning: stack unwind information not available. following frames may be wrong.
02 07e4edf0 6b270256     sqlite_interop!si768767362ea03a94 0xf73c
03 07e4ee1c 6b267938     sqlite_interop!si768767362ea03a94 0x4a06
04 07e4ee38 6b2599e1     sqlite_interop!si83d1cf4976f57337 0x84c8
05 07e4ee80 6b25902b     sqlite_interop!sia3401e98cbad673e 0x3201
06 07e4ee98 6b25258c     sqlite_interop!sia3401e98cbad673e 0x284b
07 07e4f168 6b255a05     sqlite_interop!si327cfc7a6b1fd1fb 0x633c
08 07e4f19c 09e38597     sqlite_interop!si9c6d7cd7b7d38055 0x255

结合卦中的读写信息，大概知道了原来是用写锁来写sqlite，后者卡在缓冲区刷新函数 ntflushbuffersfile 上，方法签名如下：


ntstatus ntflushbuffersfile(
  handle  filehandle,
  io_status_block *iostatusblock
);

有些朋友可能想看一下到底怎么写的，那就简单的反编译一下代码：

到这里基本就搞清楚了，由于 13号线程持有了写锁，导致主线程要用读锁操作 sqlite 时进行了长时间等待。

解决办法就比较简单了，主线程尽可能的只做ui更新的操作，不要让他触发各类锁，否则就有等锁的概率发生。

3. ntflushbuffersfile 怎么了

有些朋友可能要问为什么 ntflushbuffersfile 函数会卡死不返回，要想找到这个答案，需要看下反汇编。


0:013:x86> uf ntdll_76ed0000!ntflushbuffersfile
ntdll_76ed0000!ntflushbuffersfile:
76f41ad0 b84b000000      mov     eax,4bh
76f41ad5 ba7071f576      mov     edx,offset ntdll_76ed0000!wow64systemservicecall (76f57170)
76f41ada ffd2            call    edx
76f41adc c20800          ret     8
0:013:x86> u 76f57170h
ntdll_76ed0000!wow64systemservicecall:
76f57170 ff252892ff76    jmp     dword ptr [ntdll_76ed0000!wow64transition (76ff9228)]
0:013:x86> u 76ec7000
wow64cpu!kifastsystemcall:
76ec7000 ea0970ec763300  jmp     0033:76ec7009
76ec7007 0000            add     byte ptr [eax],al
76ec7009 41              inc     ecx
76ec700a ffa7f8000000    jmp     dword ptr [edi 0f8h]

从汇编代码看，ntflushbuffersfile 通过 kifastsystemcall 进入内核态了，用户态dump是没法看内核态的，所以也无法继续追究下去。

不过也可以看下这个线程过往的 getlasterror() 值，可能有些收获，使用 !gle 命令。


0:013:x86> !gle
lasterrorvalue: (win32) 0x26 (38) - 
laststatusvalue: (ntstatus) 0xc0000008 -

根据上面的状态码，去msdn上搜一下具体信息。

从错误说明看，可能是这个sqlite文件有什么问题，又是句柄无效，又是读到头了，怀疑是操作sqlite 的时候出现了文件损坏。

现在回头看看，如果想对 sqlite 进行并发读写，开启下 write-ahead logging 模式应该就可以了，不需要在程序里面进行读写控制。

所以最终的建议就是：

开启wal模式
删掉读写控制

三：总结

这次卡死事故还是挺有意思的，熟悉了下 readerwriterlockslim 又对 sqlite 有了一个新的认识。

总结

以上是凯发k8官方网为你收集整理的记一次 .net 某工控电池检测系统卡死分析的全部内容，希望文章能够帮你解决所遇到的问题。

如果觉得凯发k8官方网网站内容还不错，欢迎将凯发k8官方网推荐给好友。

上一篇：浅谈斜率优化dp
下一篇：文心一言 vs 讯飞星火 vs chat