News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Win7 where is the PSW (CS:IP) of a thread on the ready Q in a BSOD?

Started by MtheK, May 14, 2015, 10:15:40 AM

Previous topic - Next topic

MtheK

  This is a debugging question.

  In the IBM mainframe, when a TCB (thread) is waiting for a CPU
(processor), the RB control block contains the PSW (CS:IP); that is,
the next instruction to execute when it gets back control after it
was interrupted.

  What is the PC equivalent? Using WinDbg, '!ready 6' shows my thread
waiting to run (ie: not waiting):

Processor 1: Ready Threads at priority 6
    THREAD 865fad48  Cid 0d90.0d94  Teb: 7ffde000 Win32Thread: 00000000 ????
    Win32 Start Address 0x00401000

but the stack appears to be wrong (missing virtual storage?) where my
PSW probably is:

    ChildEBP RetAddr  Args to Child
    82db9a18 8309ad5a 871058dc 00000000 00000006 nt!KiSwapContext+0x26 (FPO: [Uses EBP] [0,0,4])
    82db9a50 8309bc27 865fad48 871058d4 82db9aa8 nt!KiCheckForThreadDispatch+0x121
    82db9a60 830cf34d 807c4120 87105800 00000000 nt!KiProcessDeferredReadyList+0x28
    82db9aa8 8328a2e2 871058a8 00000006 00000006 nt!KeSetPriorityAndQuantumProcess+0x339
    82db9ac0 8324a811 871058a8 00000006 98964d03 nt!PsSetProcessPriorityByClass+0x1f
    82db9d1c 8304f1ea ffffffff 00000012 0012ff60 nt!NtSetInformationProcess+0x21d
    82db9d1c 77c570b4 ffffffff 00000012 0012ff60 nt!KiFastCallEntry+0x12a (FPO: [0,3] TrapFrame @ 82db9d34)
WARNING: Frame IP not in any known module. Following frames may be wrong.
    0012ff54 00000000 00000000 00000000 00000000 0x77c570b4

almost ALL PIDs show this same WARNING failure.

  It is NOT in '!running -it', nor in '!process', so my program did
not cause (was not tagged with) the failure. I've since removed the
failed code from my \STARTUP (a BlueTooth driver (BTTray.exe) from
Broadcom) which I don't use.

  However, for some reason, I have a record of NULLS (00h) instead of
my startup record. Worse, the length is apx 5% more than what I would
write. The only other evidence I have is a file opened with
FILE_FLAG_NO_BUFFERING which I deliberately designed to remain if not
cleaned up, which has the current TOD, which is 6 seconds B 4 the TOD
in the BSOD dump.

  I'm not sure if this NULL record is due to contention between my
threads (tho I use a "spin lock" to "ENQ" the write,etc), or if cache
wasn't written and thus caused LOSS OF DATA.

  From what I can tell, based on a '!handle 0 b 0x0d90' for my PID,
I reached the point in my code where I created an Event:

0030: Object: 8661cdd0  GrantedAccess: 001f0003 Entry: 98503060
Object: 8661cdd0  Type: (84ff71f0) Event
    ObjectHeader: 8661cdb8 (new version)
        HandleCount: 1  PointerCount: 2
        Directory Object: 940b6af8  Name: DSNTODAYXEEEX004...

For this to exist, this tells me that I already opened, wrote a
startup record, and closed the affected file, and was/is attaching
a new .BAT via CreateProcess.

  The '!handle' also shows a Thread and a Process right after the
Event, which may be mine:

0034: Object: 8710ebe0  GrantedAccess: 001fffff Entry: 98503068
Object: 8710ebe0  Type: (84f7f8b0) Thread
    ObjectHeader: 8710ebc8 (new version)
        HandleCount: 3  PointerCount: 5

0038: Object: 8664ad40  GrantedAccess: 001fffff Entry: 98503070
Object: 8664ad40  Type: (84f7f978) Process
    ObjectHeader: 8664ad28 (new version)
        HandleCount: 3  PointerCount: 18

but I can't find any of those addresses in a '!process 0 15'.

  Running these:
dt nt!_EPROCESS 871058a8  Pcb.Header.
dt nt!_KWAIT_BLOCK 0x85f49d00 -l WaitlistEntry.Flink -y Thread
doesn't show any addresses near my 'Win32 Start Address'.

  I would REALLY like to know the CS:IP for my MASM program from this
dump so I know EXACTLY where I'm at, which may help explain the file
corruption.

  Would setting some kind of dump option somewhere fix the stack
WARNING for next time?

  Comments? Thankx...


dedndave

first, comparing an IBM mainframe to modern protected-mode processors is like comparing apples and oranges - lol
the operating system divides time into "slices", and assigns threads/CPU cores
the processor handles "context switching" when time slices begin and end

there is a structure that you can examine (a few cases, modify) pointed to by the FS register
it's called the "Thread Environment Block" (TEB), sometimes called the "Thread Information Block" (TIB)

http://undocumented.ntinternals.net/source/usermode/undocumented%20functions/nt%20objects/thread/teb.html
http://en.wikipedia.org/wiki/Win32_Thread_Information_Block

the TEB is used in Structured Exception Handling
to access it in "normal" code...
    ASSUME  FS:Nothing

;here, you can read TEB data

    ASSUME  FS:ERROR


CS is no longer a "segment", per se - it's a pointer to a table entry
you need not generally be concerned with the CS register
IP is EIP in 32-bit world   :P

MtheK

  That machine only had 'kernel memory dump', not 'complete memory dump'
like my other machine (in CP/System/...startup and recovery settings).
Perhaps that's why almost all of the other processes also has this garbage msg!?
That msg is ALL I SEE in Google talking about MEMORY.dmp, yet no one seemed
to care about it or take notice. Perhaps it wasn't important to them...
  I think the TEB isn't valid either since it's part of user space, not kernel space.
Of course, now, I probably won't get another BSOD to test it. Any ideas on how
to deliberately cause a BSOD so I can test the dump and see if my stacks now show correctly? Can WinDbg zap live system-protected memory?


MtheK

  That worked perfectly; THANK YOU!
. I had to give permission to open the dump; didn't do that B 4 for the "real" one
. the size, as expected, went from 230M to 1.6G, but so what, I got my data which
is more important to me:

1: kd> !process 0 1f link.exe
...

        9b16bd24 770370b4 nt!KiFastCallEntry+0x12a (FPO: [0,3] TrapFrame @ 9b16bd34)

        0012fedc 770357d4 ntdll!KiFastSystemCallRet (FPO: [0,0,0])
        0012fee0 75431876 ntdll!ZwDelayExecution+0xc (FPO: [2,0,0])
        0012ff48 75431818 KERNELBASE!SleepEx+0x65 (FPO: [Non-Fpo])
        0012ff58 00405c50 KERNELBASE!Sleep+0xf (FPO: [Non-Fpo])
WARNING: Stack unwind information not available. Following frames may be wrong.
        0012ff94 770537f5 LINK+0x5c50
        0012ffd4 770537c8 ntdll!__RtlUserThreadStart+0x70 (FPO: [Non-Fpo])
        0012ffec 00000000 ntdll!_RtlUserThreadStart+0x1b (FPO: [Non-Fpo])

and my displacement ("PSW") matches my code perfectly:

00004C4B  E8 00000000 E   *       call   Sleep
              1            INVOKE Sleep,SLEEPONEI   
              1     .endif
00004C50         *@C04CE:

  Unfortunately, my file was NOT corrupted this time (got my startup
record), but that's OK, I'll just keep trying until it does or whatever.
Perhaps this corruption is timing-related...

  Thankx again!




dedndave

you might try looking at the Event Log in Administrative Tools

MtheK

  Actually, the EV1001 was my 2nd clue, but, since I wasn't at fault,
it had nothing of my program. Nothing else was there, including the
3rd clue, the missing WTS Event  history logs, yet my program was
obviously running, and at the point of the '!handle' Event (a unique
name and DSN), so WTS did indeed start me, but with no logs of the
event as it normally does, as per my APPEVTOD.bat and EV itself.
Another clue about cache possibly going bad...


MtheK

Addendum:

  Nothing else was there for 17:12:xx w/a span of 300 seconds (5min):
. TOD of .dmp 'Debug session time' (17:12:04)
. TOD jumps across, a la Back to the Future ("it skipped over that minute")
    most go from 17:09 to 17:13
. no System, Application, Security (the EV1001 is 17:13 after re-boot)
. no WTS (normally s/b a dozen or so)
. 19 Apps logs, but none w/17:12

  Per DSNTODAY, no left over/new DSNs w/a date&time of 17:12...



MtheK

  Addendum #2:

  I detected 2 more corrupted files. Same program, different DIRs,
different program from the first. I had 4 of my programs running at
the time; 1 does not write files. In particular, the 3rd file was
corrupted w/the exact same size as my startup record, unlike the
other 2. This tells me that my ABENDXIT thread wasn't involved as
its' record is smaller, so no conflict btx my 2 threads (if even
driven). Further, this particular WTS run can't conflict w/other
processes. Finally, since this record has always been fine B 4
and after the BSOD, it implies, again, a caching problem causing
loss of data. Oh joy...

MtheK

  Final addendum:

  Another file corruption found. Same pattern, but this time,
instead of my 2 programs, it's the .BAT itself! For this file,
my programs should not be able to affect the ECHO of NOTIFY2.bat:

echo @@@@@ %~n0 %date% %time% STARTUP %USERNAME% %USERDOMAIN%  >>%NOTIFYwork%NOTIFY.txt

(probably). Some other system-level process must have caused
all these corruptions.

  I've since changed my other critical file (when I was racing the
force-killer) to use FILE_FLAG_NO_BUFFERING, which SHOULD have kept
evidence of my startup, but didn't. It appears that, other than the
BSOD .dmp, only my 1 MONENQ file survived this explosion of data.

* details:



NOTIFY.bat (both programs in .dmp)
  call LINK.bat "" 2              >>%NOTIFYwork%NOTIFYL.txt
    call SLEEP.bat -1 60          >>%NOTIFYwork%NOTIFY.txt
                                                / <== corrupted:

* prior run
                     0D0A4040404040204E4F       ..@@@@@ NO
000EA040 54494659322053756E2030352F31302F TIFY2 Sun 05/10/
000EA050 323031352031373A30393A32332E3736 2015 17:09:23.76
000EA060 20535441525455502051434432205143  STARTUP QCD2 QC
000EA070 44322D504320200D0A20534C45455020 D2-PC  .. SLEEP 
000EA080 53756E2030352F31302F323031352031 Sun 05/10/2015 1
000EA090 373A30393A32352E343720416C6C204F 7:09:25.47 All O
000EA0A0 4B202831207365636F6E643B20636F75 K (1 second; cou
000EA0B0 6C642062652064656661756C74292031 ld be default) 1
000EA0C0 202E0D0A                          ...             fixed length = x'8D' = 141
                 53485554444F574E20303030     SHUTDOWN 000
000EA0D0 30303035343234613030303030303534 0005424a00000054
000EA0F0 31352F30352F31302031373A31303A33 15/05/10 17:10:3
000EA100 322E30323620633A5C4441445C534348 2.026 c:\DAD\SCH
...
000EA1D0 2062792043746C2D4321202F20393934  by Ctl-C! / 994
000EA1E0 350D0A                           5..              variable length = x'11E' = 286

* start of corruption
               00000000000000000000000000    .............
000EA1F0 00000000000000000000000000000000 ................
DUPLICATE
000EA4A0 000000000000                     ......           fixed+variable length = x'2C2/4' = 706/8
                     0D0A                       ..         * where did this come from?
                                                           * doesn't match the pattern of the
                                                           * test: 1 CRLF, 5 @ signs = 7 bytes
                                                           * maybe added by some recovery?
* end of corruption

                         0D0A404040404020         ..@@@@@ 
000EA4B0 4E4F54494659322053756E2030352F31 NOTIFY2 Sun 05/1
000EA4C0 302F323031352031373A31333A33352E 0/2015 17:13:35.
000EA4D0 39352053544152545550205143443220 95 STARTUP QCD2 
000EA4E0 514344322D504320200D0A20534C4545 QCD2-PC  .. SLEE
000EA4F0 502053756E2030352F31302F32303135 P Sun 05/10/2015
000EA500 2031373A31333A33372E373820416C6C  17:13:37.78 All
000EA510 204F4B202831207365636F6E643B2063  OK (1 second; c
000EA520 6F756C642062652064656661756C7429 ould be default)
000EA530 2031202E0D0A                                      fixed length = x'8D' = 141
...


P.S. Goodbye again Kara.


MtheK

FYI

  I tried turning off write-caching on the boot drive via CP/System,
but files still get null-corrupted, as well as being MUCH slower
(BSOD took 5min instead of 1m; after re-IPL (boot), logon took 5m
instead of a minute or less).

  Thanks to being able to take a manual BSOD now, I can re-create
this corruption at will as long as I take it at a precise time.  In
my case, after IPL, but B 4 the 1st DSNTODAY runs (so it's slower
since not cached yet), let it start. The first run takes about a
minute. After 15sec, take the manual BSOD. After re-IPL, I check for
corrupted files, finding these EVERY(!) time:

. 2 .bat re-direction STDOUT new files
  . DSNTODAY *AB from ENQDEQ STDOUT
      length correct, but NULLS
  . LINKWTS *AB from .bat ECHO ("calling...")
      length correct, but NULLS

. my "heartbeat" file (used when racing the force-killer; it
opens/clears/writes/closes a record every second).  Although the file
has NULLS, in the .dmp memory, my buffer and length are perfect
(shows the last write w/TOD, apx 1s B 4 the BSOD). Also, after re-IPL,
if this file is not empty, its' contents are copied to another file
which, in this case, was not corrupted at the BSOD (this time;
perhaps some cache timeout forced the I/O to complete?), but it did
copy the corrupted "heartbeat" file for a permanent record at re-start

. my normal DSNTODAY O/P file (corrupted at the end w/various huge sizes)

. my 4 critical files were NOT corrupted, including the 2 that I just
changed; now, they have an entry for each BSOD I took. But, still,
maybe some cache timeout expired and forced the I/O?

  Then, I turned off write-caching, re-IPL'd, and things were much
slower, but waited for the precise time and then took the BSOD. As
noted earlier, the BSOD was much slower and the subsequent logon was
much slower, showing that the turn-off worked as expected.

  However, it made no difference. The 4 files corrupted B 4 this
change were again corrupted the same way, while my 4 critical files
again were not.

  I tried waiting only 3sec (see BSOD06). Different corruption, but
corrupted in LINKJ.log; critical P.log still has entries. Missing z969
record in DSNTODAY.txt via STDOUT (no corruption, just nothing, yet
the .dmp says it s/b otherwise).

  Tried the 15s wait on my older, different-manufacturer Win7 machine
(apx 1yr). Seems like the same corruption, but slightly different at times:
. one time, the .bat O/Ps were corrupted, but my files were OK,
including a startup record
. another time it was reversed, where the .bat O/Ps were fine but my
"heartbeat" file was corrupted w/nulls. Worse, as my "heartbeat"
opens/clears/writes/closes this file every second, the file was the
previous write, even tho, in the .dmp, it was 1s newer (.4s B 4 the
BSOD, so most current (probably; for some reason, my PSW isn't even
in the stack when an 'INVOKE FindNextFile' is running: I had to set
a flag to indicate where in my code I am, which shows it doing this)):

LOGICPOS0 (ABENDHDR .51s B 4 BSOD)
004104a9 6b 7a 7a 7a 7a 7a 44 53 4e 54 4f 44 41 59 20 41  kzzzzzDSNTODAY A
004104b9 42 45 4e 44 45 44 21 3f 3f 3f 3f 69 32 30 31 35  BENDED!????i2015
004104c9 2f 30 35 2f 31 39 20 30 38 3a 32 39 3a 31 35 2e  /05/19 08:29:15.
004104d9 35 31 35 20 30 30 30 30 30 30 33 34 35 34 2b 30  515 0000003454+0
004104e9 30 30 30 30 31 31 31 37 39 3a 30 30 2e 34 33 36  000011179:00.436
004104f9 38 30 32 38 3b 30 30 30 30 30 30 30 30 30 31 2e  8028;0000000001.

b = i    B 4 SUBLOOP2

2 = k    SUBLOOP2/LOOPIT3/FINDNEXT      <== must be in here:
         MOV   LOGICPOS2,'k'

         INVOKE FindNextFile,SRHANDLE,           ; browse handle
                      OFFSET FINDDATA            ; output

         MOV   LOGICPOS2,'l'

3-7 z                                       all B 4 RET

so, somehow, the stack is "messed up" (my program and PSW s/b in the
stack but isn't?). '!ready 6' shows no ready threads, and '!running -it'
says all processors are idle. My own thread is in a WAIT (doesn't say
'running on processor n'):

(see QCD3/BSOD05.txt)

yet my program and my PSW show up OK on the original BSOD machine
(I'm in an 'INVOKE ReadFile'). !running and !ready say the same:

(see QCD2/BSOD07.txt).


  Finally, my normal DSNTODAY O/P file was corrupted at the end w/nulls
as well, tho very strangely. In one case, the corruption was 4K-aligned,
relative to the begin of the file, and was corrupted for x'1EEEB'
(126699) bytes, WAY PAST my record size of a half-K! But, on the other
machine, it didn't seem to be aligned, and had about a half-M corruption!

  In summary, I can now duplicate this at will and get all 4 files
corrupted the same way on each machine, with startup records (to match
PIDs in the .dmp), tho sometimes my PSW for my program shows up
(depending on the machine (the newer has shown my PSW in my program
every time (so far))).

  This gives me 30min (next WTS interval) to examine the raw data B 4
(if) they get replaced (actually, only 1 is replaced; the others are
either "unique" .bat DSNs or copied). So far, none of my "critical"
(FILE_FLAG_NO_BUFFERING) files have been corrupted at all in about a
dozen manual BSODs.

  Crazy! Perhaps some faulty BIOS caching (if any) similar to each
machine?...


sinsi


MtheK

  PROCMON is a great tool; I use it all the time when debugging all
my normal programs, tho I have to run thru hoops to ensure it stays
active long enuf to trace my programs to completion during shutdown
(using my TESTFK, and 'INVOKE SetProcessShutdownParameters' in all
my programs, all w/the proper values, which works in the order I
want them all to shut down, even PROCMON (indirectly)).

  Unfortunately, when running B 4 the BSOD, un-filtered, after re-IPL,
it gives that dreaded 'The file 'xxx' was not closed cleanly during capture
and is corrupt' msg. Unless u know how to get around that and open
it anyway, I'm outta luck there, which is why the BSOD .dmp is so
important; it's all I have except my critical files...


MtheK

  I made a super-simple .BAT where its' re-directed STDOUT from ECHO gets
null-corrupted, all by itself. No waiting, no co-ordination, no programs, no nothing. Just run it and take the manual BSOD a few seconds later, re-IPL, and check the
file...



@ECHO OFF

set LINKWTSdate=%date:/=%
set LINKWTSdate=%LINKWTSdate: =.%
set LINKWTStime=%time::=%
set LINKWTStime=%LINKWTStime:.=%
set LINKWTStime=%LINKWTStime: =0%

echo  %~n0 %date% %time% calling...     >>"LINKWTSAbended%LINKWTSdate%%LINKWTStime%.log"

rem   This will run for a few minutes until it gets forced out by DOS;
rem more than enuf time to take a manual BSOD at any time after it starts.
rem After re-IPL, check if the new DSN is null-corrupted...

rem   In my case, I waited 15sec. Sure enuf, this O/P is null-corrupted.
rem If I look at it normally, say via NOTEPAD while running, it's fine...
rem 05/20/2015  08:22 AM                50 LINKWTSAbendedWed.0520201508221377.log
rem 00000000 00000000000000000000000000000000 ................
rem DUPLICATE
rem 00000030 0000EEEEEEEEEEEEEEEEEEEEEEEEEEEE ................
rem 0000000000000000


:LOOPUNTILFORCED
GOTO LOOPUNTILFORCED



MtheK

  Hhhmmm...I guess data integrity is a secondary issue in the PC world,
un-like the mainframe. I found this:

https://social.msdn.microsoft.com/Forums/vstudio/en-US/eb9f1ed1-8c2c-496e-9b5a-f6640c3374c4/object-serialization-corruption-file-bytes-set-to-null?forum=csharpgeneral

  Some snippets:



I am seeing, ~85% of the time when a hard-lock occurs my recovery files upon reboot contain
all null bytes. The file size is identical between an OK serialized file vs. a corrupted file.
I have verified prior to programming the hard-locking control the recovery files on disk are
OK, but once the platform hard-locks the files become all null values.

The data might not have been fully written to disk at the time the BSOD occurred.

When you construct the FileStream you use for writing, pass FileOptions.WriteThrough to
bypass the write cache.  Keep in mind that this will slow down the writing.

Even with this option, you could still be unlucky.  You may need to consider keeping the
last n recovery files and having the application work backwards to find a good one (if this
is reasonable for your application).

No I am not setting FileOptions.WriteThrough. I will try this.

This solution seems to be working. Initial QA looks promising. After 10 hard-locks, the
system was able to recover each time successfully.
No file corruption has been observed thus far. I will continue QA, but this solution looks
very very promising.



  This is basically my issue (null-corrupted files after BSOD). I think the
"WriteThrough" is my "FILE_FLAG_NO_BUFFERING". It's still weird that turning off
write-caching on the boot drive didn't fix the .BAT ECHO either. I guess the
answer is NOT to use the PC if you want reliable data!

  Funny! End Of Line.