MSM2 Lecture #8 - Troubleshooting Boot Up

Materials:
Working complete PC
Several PCs prepared by the instructor to fail booting up
Student bootable floppy diskette - "New Boot A Version 2"
Student bootable CD-ROM
Objectives:
Learn how to determine where the PC is failing within the boot sequence,
Learn how to troubleshoot each boot sequence phase,
Learn the various boot up errors and what they mean,
Be able to use the standard troubleshooting framework.
Competency:
The student will be able to troubleshoot a system that is failing to boot up completely. The student will be able to determine within which phase of the boot sequence the failure is occuring. The student will be able to recognise standard boot errors including what phase of the boot sequence they are coming from, what typically causes them and how to go about verifiying the causes and how to implement the solutions. The student will be introduced to the standard troubleshooting framework.
Preparation
  1. This is a fairly large lecture module. The instructor should prepare several systems with various problems that will cause and demonstrate the various errors that will be addressed in this module. Many problems will be introduced by the student as the exercise progresses also.

Procedures
  1. In dealing with any troubleshooting scenario, not just the PC, this general structured framework should be followed:

    1.Determine when the problem first occured.
    2.Based on when the problem first occured, try to isolate a probable cause.
    3.Attempt to boot the system and observe the behavior and error messages
    4.Record the error message(s) and turn the system off as quickly as possible.
    5.Consult reference material and verify the nature of the error message including what phase of the boot sequence does this occur in, what issues the message, what are the common causes of the error.
    6.Formulate a hypothesis regarding the cause of the error, then investigate and confirm this.
    7.Formulate a plausible solution for the error.
    8.Make written notes of the proposed solution.
    9.Implement the proposed solution.
    10.Test the system and verify that the solution worked, if it did not then return to step 6 and continue.
    11.Finalize the documentation of the solution for future reference.
  2. The student is already familiar with the PC boot sequence and is aware that the system runs a Power-On Self Test immediately upon CPU reset. If the POST does NOT encounter an error condition it will proceed to display a summary at the end of the POST idsplaying all known and tested devices, it will beep the system speaker once (if connected and working!) and then proceed to run the BIOS boot strap loader phase of the boot up process. However, if any component or peripheral fails the POST, the system will issue an error message and in most cases immediately halt the boot process. The POST will issue error messages in the event that an error condition in encountered, but it also issues POST progress codes whether an error condition occurs or not. These POST codes will be explored later in the module. For now the student should learn the complete IBM PC standard POST error codes and their meanings:

    Code  Failure Reported
    01xUnknown error
    02xPower supply
    1xxMotherboard (chipset)
    16xCMOS Configuration (including dead CMOS battery)
    2xxRAM memory error
    3xxKeyboard
    4xxMonochrome Display Adapter failure
    5xxColor Graphics Adapter failure
    6xxFloppy Disk Controller/Drive failure
    7xxMath coprocessor failure
    8xx(undefined)
    9xxParallel Port, LPT1 failure
    10xxLPT2 failure
    11xxSerial Port, COM1 failure
    12xxCOM2 failure
    13xxGame Port failure
    14xxPrinter control failure
    15xxSDLC Adapter failure
    16xxDisplay emulator failure
    17xxHDD subsystem failure
    1701POST failure of HDD
    1702HDD controller failure
    1780HDD0 (master) failure
    1781HDD1 (slave) failure
    86xxPS/2 controller, mouse failure

    There are "century codes" for each number 18xx, 19xx, ... to over 100xx, but the ones presented in the list above are by far the most significant and common codes and the technician should know these.

  3. Now the student will incur problems on the system that will cause these errors to occur during the POST. It should be noted that on many later model motherboards the BIOS will not only display the raw POST error code but also display the user friendly text meaning of the error code as well. However, the technician would never want to depend on this especially on older systems.

  4. Boot to the DOS prompt and run DEBUG and enter the following commands:

    C:\>debug
    -o 70 17
    -o 71 17

    Reboot the PC at this point.

  5. Upon reboot the system should indicate an error with a text message similar to "CMOS Checksum failure" some systems no longer display the "century" code for this, if the system does it would display "162" The system also indicates the key to press to enter into the BIOS Setup Utility to solve the problem. This DEBUG script can be used to bypass a CMOS password that is preventing the Setup Utility from running because it is asking for password that is not known. The script totally destroys all settings in the CMOS and should only be used as a last resort on the system.

  6. Enter the BIOS and load the BIOS default settings and then check the settings for the boot sequence, set the first device to the correct device that you are using (diskette or CD-ROM) and then check the settings for the CPU and make sure that they are correct. Your instructor will give you the proper settings. Save and Exit. Turn the system off.

  7. Only true IBM systems report an 86xx upon removal of the PS/2 mouse so removing this on any other system will usually be ignored and the system will boot normally. Instead remove the keyboard connector and then start the PC. The system should again sound 2 short beeps indicating a POST error condition and then display a message along the lines of "Keyboard error" Systems that display the "century" codes will display a "301" or something similar. Turn off the PC and reattach the keyboard connector.

  8. All of these POST error messages occur during the POST of the boot sequence. Which is essentially before anything else. Because they are hardware and hardware configuration oriented and the default for almost all BIOSes is to halt on all errors they will generally cause the system to completely fail to start up regardless of the severity. Therefore, POST errors must be understood as to when they occur (during the POST) and what is causing them (depends on the code itself) and then the technician can begin to isolate the cause and form a plan of action to fix the problem. For example, if the system halts reporting only "1701" on screen this generally indicates a problem with the hard drive and on rare occasion can indicate a problem with the HDD controller. Early ATA/IDE systems will also use this code even though it was created long before the IDE technology. Knowing that "1701" pertains to hard drives and their controllers, and knowing that a hard drive has more moving parts than the controller (which has exactly none) then the technician has a good idea of what has gone wrong with the system prior to getting out the screw driver. At this point the technician would enter the BIOS and attempt to autodetect the drive. If the autodetect hangs or takes a long time before reporting the results of the detection then the hard drive is indeed beginning to fail or has already failed. This would be a verification of the error message in this example.

  9. After the POST the system normally moves on into the BIOS boot strap loader phase which attempts to load an operating system from one of the attached drives. If the POST is succeeding (no POST error messages and a single beep from the speaker) and yet hangs with no further output to the screen, then the problem is coming from the BIOS boot strap loader process.

  10. Boot to DOS and run DEBUG then run these commands:

    C:\>debug
    -L 100 2 0 1
    -U 100 L 3
    1AB9:0100 EB3C    JMP    013E
    1AB9:0103 90      NOP

    Your version of DOS might report a slightly different JMP address than the 013Eh indicated here. That will be OK. Now based on that address execute the following DEBUG command:


    -F 13E 2FD 0
    -_

    This will erase the DOS boot strap loader code in the DBR. Which will of course paralyze the boot sequence since it deletes any error message that the code might try display and the code that would display those too.

    Now write the changes back to the drive:


    -W 100 2 0 1
    -Q
    C:\>_

    When the system reboots now it will demonstrate this lockup right after the POST with no error messages to the screen. Try it now.

  11. After this behavior is observed, boot to a bootable diskette of the exact same version as what is on the hard drive and run SYS C: If SYS.COM is not on the floppy then copy it from the operating system directory of the hard drive (it is perfectly accessible, just not bootable) to the floppy diskette and run it. This copies the system files from the current drive (the floppy in this case) to the target drive and refreshes the loader code in the MBR and the DBR.

  12. In the above example the cause was known to be in the boot sequence but in the absence of any error messages it could be assumed that the BIOS boot strap loader encountered no error conditions. Since no OS boot strap loader error messages appeared it would be assumed that the code that would normally display these errors has been damaged: either the MBR loader code or the DBR loader code.

  13. But let's not jump around. The BIOS boot strap loader sequence proceeds as follows:

    1. Read the boot sequence from the CMOS RAM settings

    2. Attempt to read the first physical sector of the first device in the list

      2.1. Error returned from device - proceed to 3.

      2.2. Sector Data was read into RAM

        2.2.1. Check sector for valid boot signature

        2.2.2. No valid boot signature - proceed to 3.

        2.2.3. Valid boot signature found

        2.2.4 If HDD sector is there an active partition?

          2.2.4.1 Not an HDD - pass control to it: BIOS boot strap complete.

          2.2.4.2 No active partition - proceed to 3.

          2.2.4.3 Active partition found - pass control to it: BIOS boot strap complete.

    3. Attempt to read the first physical sector of the next device in the boot sequence...

    4. If no suitable sector was found on any device in the boot sequence list then report boot failure message to the screen.

    So there are only two possible outcomes of this code: 1) Does not find a suitable sector to boot from, 2) Does find a suitable sector to boot from. In the case where the BIOS fails to find a suitable sector to boot from an error message will be reported on screen. The error message text itself is fairly BIOS and/or OEM PC manufacturer specific. Here are some common examples:

    AWARD BIOS (Generic): DISK BOOT FAILURE

    AWARD BIOS (Generic): PRESS A KEY TO REBOOT

    Phoenix BIOS (Generic): HARD DISK BOOT FAILURE

    Phoenix BIOS (Generic): PRESS F1 TO CONTINUE OR F2 TO RUN SETUP

    Various BIOS (Generic and OEM): NO ROM BASIC FOUND - SYSTEM HALTED

    Compaq BIOS (OEM):
    Non-system disk or disk error
    Replace the disk and press any key...

    IBM BIOS (OEM):
    ROM BASIC Version 1.0
    64KB RAM
    OK
    _

    The worst one is the Compaq message since it resembles precisely one of the possible DBR boot error messages but means something dramatically different from that other message.

  14. There are as seen from the sequence at least three causes for the BIOS boot strap loader to display its boot failure message on screen: 1) No device from the boot sequence returned any sector at all, 2) At least one device returned a sector but it did not have a valid boot signature, 3) At least one hard drive returned a sector with a valid boot signature but no partition is set active in the partition tables.

  15. Based on this information alone the technician knows where the potential problems lie.

    1. "No device from the boot sequence returned any sector at all"

       1.1. Check the boot sequence, is it allowing the expected devices to attempt to boot?

       1.2. If the expected boot device is the hard drive, and this message is appearing, is the drive functional? (This will be ascertained last)

    2. "At least one device returned a sector but it did not have a valid boot signature"

       2.1. Loss of a valid boot signature on the MBR will result in a complete inaccessibility to all information on the drive, but the system should boot to an alternate device with no trouble

    3. "At least one hard drive returned a sector with a valid boot signature but no partition is set active"

       3.1. Missing active partition in the partition tables can easily be checked by booting to an alternate device and then attempting to read the C: drive.

    Therefore the order of priority on the tests is reversed and looks like this:

    1. Check the boot sequence and make sure that it includes the hard drive (in newer BIOSes the hard drive can easily be removed from the boot sequence choices altogether)

    2. If the boot sequence is correct, be sure it is set to boot the alternate device first (i.e. floppy diskette or bootable CD-ROM whichever you are using) and boot to the alternate device.

    3. From the alternate boot disk attempt to read the C: drive with a DIR C:

       3.1 DIR C: yields "Invalid drive specification" - This means that the partition tables are missing, corrupt, or incompatible if the boot disk is a form of DOS and the partitions are NTFS. In that case use NTFSDOS.EXE.

       3.2 DIR C: yields "Invalid media type reading drive C:" - This means that the partition tables are recognized but the DOS boot record is missing, or corrupt. In rare cases the partition table has been "mildly corrupted" to point to the wrong starting location for the partition.

    4. From the alternate boot disk attempt to read the partition tables with FDISK.EXE

       4.1 FDISK yields "Error reading fixed disk" - This means that FDISK cannot make a low level BIOS call to read the MBR of the drive. The likely causes of this in order are:

          4.1.1 The drive is not configured properly in the BIOS

          4.1.2 The drive's cables or connections are loose/bad

          4.1.3 The drive is physically malfunctioning

       4.2 FDISK yields "Warning No partitions are set active..." - This means that none of the defined partitions is set active which will cause the BIOS boot strap loader error.

       4.3 FDISK yields "No partitions defined" - This means that the partition table has been damaged or completely deleted.

       4.4 FDISK yields defined partitions - The actual sizes of the partitions, the types, volume labels should be inspected carefully for "reasonable values" and no "funny characters" If there are unreasonable values like FDISK reports the total size of the drive as 2.1GB and the size of the C: drive as 35.6GB, or the volume label is written in smiley faces and greek letters then the partition tables are definitely corrupt. Otherwise a corrupt OS loader code could be suspected.

  16. Investigating standard boot error messages. Have a bootable alternate disk ready with SYS.COM on it and boot to the C: drive and run DEBUG and run these commands:

    C:\>debug
    -L 100 2 0 1
    -F 100 2FF 0
    -W 100 2 0 1
    -Q
    C:\>_

    At this point the entire DBR has been deleted. Reboot the system to the hard drive.

  17. The error generated is "Missing operating system" This is a standard boot up error message and it originates from the MBR. In this scenario then: 1) POST succeeded, 2) The BIOS boot strap loader found a suitable sector to boot from and passed control to it, 3) the loader code within that boot sector encountered an error and halted the boot process at that point. So this message means that the drive is working and recognised by the BIOS, the MBR is working and has at least one valid partition table in it marked active as well.

    One of the problems that can lead to this error is a change of the hard drive's BIOS parameters. By changing its geometry, the DBR is no longer located where the boot strap code expects it to be and that will cause this error. Note: if the HDD was originally installed with different parameters and after autodetecting in the BIOS this error persists meaning that the autodetect has assigned different parameters it is impossible to get the original geometry settings for the drive back (CMOS has no backup for its settings). This is one of the more difficult data recovery scenarios; attempting to back calculate the prior geometry of an HDD BIOS settings.

    The other main cause of this is a lack of a valid sector due to corruption or damage.

    Boot to an alternate device and attempt to read the drive with DIR C: if:

    1. "Invalid media type reading drive C:" - The DBR is confirmed missing or corrupt

    2. Any other response including displaying it correctly or not indicates a corrupted MBR.

    In the case that the DBR is missing or corrupt, a SYS C: can be attempted though it may not always be successful.

  18. Another error message that the MBR may display is "Disk I/O error..." This one is usually related to the fact that the MBR could not successfully read the sector where the DBR is located. The causes of this are:

    1. The HDD geometry has changed in the BIOS.

    2. The sector where the DBR is located has been damaged and can no longer be read (BIOS calls to read it return an error)

    In the former, again this can be extremely difficult to fix if the BIOS has autodetected the HDD and lost the previous geometry settings. If the sector where the DBR resides has been physically damaged; the news is just as bad. The system will have to be submitted for a data recovery attempt. Also if the sector is damaged the drive cannot be used to install an operating system any more. There are third party utilities that can realign the position of the primary DOS partition, but if that sector has gone bad then others are sure to follow anyway.

  19. The next standard boot up error is "Non-system disk or disk error, replace and press any key..." This error is displayed by the DOS Boot Record OS loader code when it cannot locate the file IO.SYS or the file MSDOS.SYS and get them into memory. On Compaq's we have seen that they use this exact message for the BIOS Boot strap loader error and this must be kept in mind when working with those machines. This error is unique to all versions of true DOS up to MSDOS 6.22 and PCDOS 6.3 (the IBM licensed version). With the introduction of the Windows 9x family of operating systems running on DOS versions 7.x this message was changed to read "Invalid system disk, replace the disk and press any key..." The change was made specifically so that technicians could differentiate between the OS Boot strap loader error coming from the DBR, and the BIOS boot strap loader error coming from the BIOS of a Compaq.

  20. This error then indicates that: POST succeeded, the BIOS boot strap loader found a suitable boot sector, if it was the HDD then the partition tables are in tact and one is set active, the MBR OS loader found the DBR, the DBR is in tact, the DBR could not find IO.SYS (or MSDOS.SYS ONLY in true versions of DOS, the Win9x versions do not check for MSDOS.SYS). To check this boot to the alternate boot device and change to the C: drive. Inspect the root of the drive with a DIR IO.SYS /AH and this should respond with "File not found" If it does find the file then it is either corrupt or resides on at least one bad sector. Also check for a CONFIG.SYS and if present display it on screen and check for a SHELL=pathname\COMMAND.COM entry. Note that on Windows systems there is a default SHELL command of: SHELL=%windir%\COMMAND.COM These variables come from the MSDOS.SYS [Paths] section which should also be located and checked.

  21. To fix this error copy the exact same version of IO.SYS to the root of the drive giving the error and reboot. If no other problems exist them this should remedy the problem. The drive can also be "SYS'ed" since SYS copies IO.SYS, MSDOS.SYS and COMMAND.COM from the source drive to the target and then proceeds to rewrite the OS loader code segments of both the MBR and the DBR.

  22. The last common standard error message is "Type the name of the command interpreter (e.g. C:\WINDOWS\COMMAND.COM)" This error is issued by IO.SYS after its own initialization and after it has interpreted the MSDOS.SYS and CONFIG.SYS files. At that point in the boot sequence it has attempted to locate COMMAND.COM in order to load it into RAM and pass control to it, but the file could not be found. To verify this, boot to the alternate device and search the C: drive for COMMAND.COM. If the file is found, it is possible that it occupies at least one bad sector or has been corrupted. To remedy the situation copy a known good working exact same version COMMAND.COM onto the drive and reboot. If this was the only problem the drive should boot now.

  23. In the case where a file exists on the hard drive, but an error message indicates that it is missing, or therefore corrupt, the possibility exists that one of the sectors that it occupies is bad. Here is a way to copy IO.SYS for example, to the HDD and avoid it landing on those bad sectors. Use ATTRIB to clear the attributes of IO.SYS on a floppy and then copy it to the root of the C: drive and rename it in the copy:

    A:\>attrib -r -h -s IO.SYS
    A:\>copy IO.SYS C:\IOSYS.NEW
    

    This forces the system to use new clusters to hold the new file. Now rename IO.SYS on the C: drive, and rename the new one:

    C:\>attrib -r -h -s IO.SYS
    C:\>ren IO.SYS IOSYS.OLD
    C:\>ren IOSYS.NEW IO.SYS
    

    Renames are done "in place" only the name changes in the directory entry, not the occupied clusters, so the new IO.SYS does not occupy the same clusters as the old one. This can be done for any file where you suspect a bad cluster.

  24. Common causes of most of these malfunctions fall into four main categories:

    1. Malicious Code - Viruses and their kin

    2. Operating system/Software bugs - Poorly written code that damages itself (more common than Microsoft will ever admit!)

    3. User error - Accidentally deleting or overwriting files (shocking but true)

    4. Hard drive/hardware trouble - Hard drives have moving parts; it is not IF the HDD will die, it is WHEN will it die. Power surges and spikes can damage an HDD easily causing a head crash or misalignment of the stepper motor. Spikes can also damage code in RAM, ROM, or CMOS on the motherboard or the integrated drive electronics (yes they have a ROM and a CMOS sometimes as well as RAM cache).

    Once ROM code is damaged then the system will endemeically malfunction with little hope of salvation. Once CMOS settings are corrupted every file can end up corrupted. The CMOS would have to be cleared and a complete reinstall would follow. RAM cache glitches generally effect the file that is in transit at the time and are sufficiently common that a simple replacement of a damaged file will be all the system needs to be up and running again.

  25. One word on transient single bit damage. HDD people claim that the drive itself only corrupts a single bit in 1014. This is a very dubious number, how did they test it? This is also what they consider corruption due purely to the internal activity of the drive. They are not considering bits damaged in transit through the data cable, bits damaged in the ATA controller, bits damaged traveling through the buses, the bus controllers, or damaged in RAM. Obviously none of these devices is perfectly reliable, and a major weakness lies in the connectors. A UDMA cable is only as good as the connectors at both ends which are not nearly as good as the cable (no junction of two separate wires is as good as a single uninterrupted wire) It was found in a study in the 1980's that random corrupted bits occuring in RAM was due to two separate causes: 1) Radioactive materials in the actual material used to manufacture the RAM (the chips themselves have trace radioactive impurities in them) and 2) Tertiary Cosmic Rays. The radioactive impurities in the RAM chips are no more concentrated than the radioactive impurities in the concrete used to pour the sidewalks. However the sidewalks do not contain 1 billion bit cell transistors per square centimeter either. The radioactive decay of a single atom within this substrate will have enough energy to change a bit value within one bit cell and therefore corrupt whatever data is there. If it is OS kernel, this can yield a lockup or blue screen, if it is a file in transit, then it got corrupted. Estimates are anywhere from 1 in 1010 to 1 in 1020 bits suffer these forms of radiation corruption.

    A tertiary cosmic ray is caused when a high energy primary cosmic ray strikes the upper atmosphere. The ray is so powerful that it will blow an electron completely off of the atom or molecule that it strikes with enough force that this electron is also a dangerous form of radiation and is refered to as a secondary cosmic ray. When this electron strikes a molecule deeper down in the atmosphere it has the strength to ionize (remove an electron) from this molecule also. Eventually one of these speeding electrons will reach the surface of the earth, but it no longer has enough energy to be an ionizing radiation. These are common at the surface of the earth and are called the tertiary cosmic rays. They may not have enough energy to ionize but they certainly have enough energy to knock a zero into a one within the SDRAM of the PC. It happens that if you have an old television with a rabbit ear antenna you have a splendid tertiary cosmic ray detector and you can see them for yourself. Tune the television to the highest UHF channel which of course will cover the screen in "snow". Now reduce the brightness to as low as it will go so that the screen is essentially black with an occasional bright blip. The bright blips are collisions of tertiary cosmic rays on the antenna.

    Considering that Windows XP occupies 1.7GB or 1.36 x 1010 bits and is constantly shuffling information and DLLs back and forth between RAM and the HDD and considering that the system boots up and shuts down once each day then there are roughly 400 to 700 (depending on how many days/year you work!) splendid opportunities for corrupting a bit. Assume that in a single session XP manipulates only 25% of the Windows folder then (3.4 x 109)x(500 sessions/boots/shutdowns per year) = 1.7 x 1012 bits in transit. This number is 1/100th of the HDD bit corruption odds of 1014. This means that it is reasonable to expect 1 in every hundred computers per year to suffer from a HDD related currupted bit. Since the vast majority of those bits belong to the operating system, this will lead to blue screens, random lockups and so on. Add to the HDD's promise of damaging a bit, the system's promise of it and the radioactive RAM and cosmic rays promise of it and there is no doubt that a bit is going to get corrupted in every computer eventually...better get used to backing up everything and often.

Review Questions
  1. The PC issues a single beep and the POST summary on screen and appears to lock up at this point. Explain why the POST succeeded and what two sectors are where the most likely cause of the problem lies, what the cause is, steps to verify, and how to fix it:











  2. The PC issues two beeps and locks up with no message on screen. Explain why the POST has failed and the contents of the drive are not suspect yet:





  3. The PC beeps twice and issues the error code 1710 on screen. Explain what phase of the boot process this error has occured in, what subsystem has failed, and what steps should be taken in order to attempt to verify and/or correct the problem:








  4. The PC does not beep at all and displays the error code 201 on screen. Explain what phase of the boot process this error has occured in, what subsystem has failed, and what steps should be taken in order to attempt to verify and/or correct the problem:








  5. The Compaq PC beeps once and displays the error message "Invalid system disk or disk error...". Explain what phase of the boot process this error has occured in, what has failed, what operating system is installed on the HDD, and what steps should be taken in order to attempt to verify and/or correct the problem:








  6. The PC with an Award BIOS beeps once and displays the error message "DISK BOOT FAILURE - PRESS ANY KEY TO REBOOT". Explain what phase of the boot process this error has occured in, what has failed, and what steps should be taken in order to attempt to verify and/or correct the problem:











  7. The OEM PC with a Phoenix BIOS beeps twice and displays the error message "301 - PRESS F1 TO CONTINUE OR F2 TO RUN SETUP". Explain what phase of the boot process this error has occured in, what has failed, and what steps should be taken in order to attempt to verify and/or correct the problem:











  8. The PC beeps once and displays the error message "Missing operating system". Explain what phase of the boot process this error has occured in, what has failed, and what steps should be taken in order to attempt to verify and/or correct the problem:











  9. The generic PC beeps once and displays the error message "Non-system disk or disk error...". Explain what phase of the boot process this error has occured in, what has failed, what operating system is installed on the HDD, and what steps should be taken in order to attempt to verify and/or correct the problem:











Copyright©2000-2007 Brian Robinson ALL RIGHTS RESERVED