Search the Catalog
[Book Cover]

Windows NT TCP/IP Network Administration

By Craig Hunt & Robert Bruce Thompson
1st Edition October 1998
1-56592-377-4, Order Number: 3774
504 pages, $37.95

Sample Chapter 11:

Troubleshooting TCP/IP

In this chapter:
Approaching a Problem
Diagnostic Tools
Testing Basic Connectivity
Troubleshooting Network Access
Checking Routing
Checking Name Service
Analyzing Protocol Problems
Protocol Case Study
Simple Network Management Protocol

Network administration tasks fall into two very different categories: configuration and troubleshooting. Configuration tasks prepare for the expected; they require detailed knowledge of system configuration but are usually simple and predictable. Once a system is properly configured, there is rarely any reason to change it. The configuration process is repeated each time a new release of Windows NT is installed, but usually with very few changes.

In contrast, network troubleshooting deals with the unexpected. Troubleshooting frequently requires knowledge that is conceptual rather than detailed. Network problems are usually unique and sometimes difficult to resolve. Troubleshooting is an important part of maintaining a stable, reliable network service.

In this chapter we discuss the tools used to ensure that the network is in good running condition. However, good tools are not enough. No troubleshooting tool is effective if applied haphazardly. Effective troubleshooting requires a methodical approach to the problem, and a basic understanding of how the network works. So we'll start our discussion by looking at ways to approach a network problem.

Approaching a Problem

To approach a problem properly, you need a basic understanding of TCP/IP. The first few chapters of this book discuss the basics of TCP/IP, and provide enough background information to troubleshoot most network problems. Knowledge of how TCP/IP routes data through the network, between individual hosts, and between the layers in the protocol stack, is important for understanding a network problem, but detailed knowledge of each protocol usually isn't necessary. The fine details of the protocols are rarely needed in debugging, and when they are used, they should be looked up in a definitive reference--not recalled from memory.

Not all TCP/IP problems are alike, and not all problems can be approached in the same manner. But the key to solving any problem is understanding what the problem is. This is not as easy as it may seem. The "surface" problem is sometimes misleading, and the "real" problem is frequently obscured by many layers of software. When the true nature of the problem is understood, the solution to the problem is often obvious.

First, gather detailed information about exactly what's happening. When the problem is reported, talk to the user. Find out which application failed. What is the remote host's name and IP address? What is the user's host name and address? What error message was displayed? If possible, verify the problem by having the user run the application while you talk him through it. If possible, duplicate the problem on your own system.

Testing from the user's system, and other systems, find out:

Once you know the symptoms of the problem, visualize each protocol and device that handles the data. Visualizing the problem will help you avoid oversimplification, and keep you from assuming that you know the cause even before you start testing. Using your TCP/IP knowledge, narrow your attack to the most likely causes of the problem, but keep an open mind.

Troubleshooting Hints

There are several useful troubleshooting hints you should know. These are not a troubleshooting methodology; just good ideas to keep in mind. Here they are, listed in no particular order:

Diagnostic Tools

Because most problems have a simple cause, developing a clear idea of the problem often provides the solution. Unfortunately this is not always true, so in this section we begin to discuss the tools that can help you attack the most intractable problems. Most of the tools discussed in this chapter are software tools, but you should also keep some hardware tools handy.

You need enough simple hand tools to maintain the network's equipment and wiring. A pair of needle-nose pliers and a few screw drivers may be sufficient, but you may also need specialized tools to maintain your wiring. For example: attaching RJ45 connectors to Unshielded Twisted Pair (UTP) cable requires special crimping tools. If you buy a network maintenance toolkit from your cable vendor, it will probably contain everything you need.

A full featured cable tester is also useful. Modern cable testers are small hand-held units with a keypad and LCD display that test both thinnet and UTP cable. Tests are selected from the keyboard and results are displayed on the LCD screen. It is not necessary to interpret the results because the unit does that for you and displays the error condition in a simple text message. For example, a cable test might produce the message "Short at 74 feet." This tells you that the cable is shorted 74 feet away from the tester. What could be simpler? The proper test tools make it easier to locate, and therefore fix, cable problems.

A laptop computer is also a useful piece of test equipment when properly configured. Install TCP/IP software on the laptop. Take it to the location where the user reports a network problem. Disconnect the Ethernet cable from the back of the user's system and attach it to the laptop. Configure the laptop with an appropriate address for the user's subnet and reboot it. Then ping various systems on the network and attach to one of the user's servers. If everything works, the fault is probably in the user's computer. The user trusts this test because it demonstrates something she does everyday. Unlike an unidentifiable piece of test equipment displaying the message "No faults found," the user has confidence in the laptop. If the test fails, the fault is probably in the network equipment or wiring. That's the time to bring out the cable tester.

Another advantage of using a laptop as a piece of test equipment is its inherent versatility. It runs a wide variety of test, diagnostic and management software. Install Windows NT on the laptop and run the software discussed in the rest of this chapter from your desktop or your laptop.

Many diagnostic tools are available, ranging from commercial systems with specialized hardware and software that may cost thousands of dollars, to free software that is available from the Internet. Many software tools are provided with your Windows NT system. This book emphasizes the software diagnostic tools that come with Windows NT. The tools discussed in this book are:


Provides information about the basic configuration of the interface. It is useful for detecting bad IP addresses, incorrect subnet masks, and improper broadcast addresses.


Provides information about Ethernet/IP address translation. It can be used to detect systems on the local network that are configured with the wrong IP address. arp is covered in this chapter, and is used in an example in Chapter 2, Delivering the Data.


Provides a variety of information. It is used to display interface statistics, network sockets, and the network routing table. netstat is used repeatedly in this book.


Indicates whether a remote host can be reached. ping also displays information about packet loss and packet delivery time.


Provides information about the DNS name service. nslookup is covered in detail in Chapter 8, Configuring DNS Name System.


Prints information about each routing hop that packets take going from your system to a remote system.

Network Monitor

Analyzes the individual packets exchanged between hosts on a network. Network Monitor is a TCP/IP protocol analyzer provided with Windows NT Server 4.0. It can examine the contents of packets and is useful for analyzing protocol problems.

Each of these tools, even those covered earlier in the text, are used in this chapter. We start with ping, which is used in more troubleshooting situations than any other diagnostic tool.

Testing Basic Connectivity

The ping command tests whether a remote host can be reached from your computer. This simple function is extremely useful for testing the network connection, independent of the application in which the original problem was detected. ping allows you to determine whether further testing should be directed toward the network connection (the lower layers) or the application (the upper layers). If ping shows that packets can travel to the remote system and back, the user's problem is probably in the upper layers. If packets can't make the round-trip, lower protocol layers are probably at fault.

Frequently a user reports a network problem by stating that he can't Telnet (or FTP, or send e-mail, or whatever) to some remote host. He then immediately qualifies this statement with the announcement that it worked before. In cases like this, where the ability to connect to the remote host is in question, ping is a very useful tool.

Using the host name provided by the user, ping the remote host. If your ping is successful, have the user ping the host. If the user's ping is also successful, concentrate your further analysis on the specific application that the user is having trouble with. Perhaps the user is attempting to Telnet to a host that only provides anonymous FTP. Perhaps the host was down when the user tried his application. Have the user try it again, while you watch or listen to every detail of what he is doing. If he is doing everything right and the application still fails, detailed analysis of the application with Network Monitor and coordination with the remote system administrator may be needed.

If your ping is successful and the user's ping fails, concentrate testing on the user's system configuration, and on those things that are different about the user's path to the remote host, when compared to your path to the remote host.

If your ping fails, or the user's ping fails, pay close attention to any error messages. The error messages displayed by ping are helpful guides for planning further testing. The details of the messages may vary, but there are only a few basic types of errors:

unknown host

The remote host's name cannot be resolved by name service into an IP address. The name servers could be at fault (either your local server or the remote system's server), the name could be incorrect, or something could be wrong with the network between your system and the remote server. If you know the remote host's IP address, try to ping that. If you can reach the host using its IP address, the problem is with name service. Use nslookup to test the local and remote servers, and to check the accuracy of the host name the user gave you.

network unreachable

The local system does not have a route to the remote system. If the numeric IP address was used on the ping command line, re-enter the ping command using the host name. This eliminates the possibility that the IP address was entered incorrectly, or that you were given the wrong address. If a routing protocol is being used, make sure it is running and use netstat to check the routing table. If a static default route is being used, make sure the default route is in the routing table. If everything seems fine on the host, check its default gateway for routing problems.

no answer

The remote system did not respond. Most network utilities have some version of this message. Some print the message "100% packet loss"; others print the message "Connection timed out" or the error "cannot connect." All of these errors mean the same thing. The local system has a route to the remote system, but it receives no response from the remote system to any of the packets it sends. There are many possible causes of this problem. The remote host may be down. Either the local or the remote host may be configured incorrectly. A gateway or circuit between the local host and the remote host may be down. The remote host may have routing problems. Only additional testing can isolate the cause of the problem. Carefully check the local configuration using ipconfig. Check the route to the remote system with tracert. Contact the administrator of the remote system and report the problem.

All of the tools mentioned here will be discussed later in this chapter. However, before leaving ping, let's look more closely at the command.

The ping Command

The basic format of the ping command is ping destination, where destination is the host name or IP address of the remote host being tested. Use the host name or address provided by the user in the trouble report. For example, to check that pooh can be reached from thoth, we use the following command:


Pinging with 32 bytes of data:

Reply from bytes=32 time<10ms TTL=32
Reply from bytes=32 time<10ms TTL=32
Reply from bytes=32 time<10ms TTL=32
Reply from bytes=32 time<10ms TTL=32

By default the Windows NT ping command sends out four, 32 byte test packets. The sample test shows an extremely good network link with no packet loss and fast response. The round-trip to pooh is taking less than 10 milliseconds. A small packet loss, and the round-trip times an order of magnitude higher, would not be abnormal for a connection made across a wide area network.

If the packet loss is high or the response time is very slow, there could be a network hardware problem. If you see these conditions when communicating great distances on a wide area network, there is nothing to worry about. TCP/IP was designed to deal with unreliable networks, and some wide area networks suffer a lot of packet loss. But if these problems are seen on a local area network, they indicate trouble.

On a local network cable segment the round-trip time should be near zero, there should be little or no packet loss, and the packets should arrive in order. If these things are not true, there is a problem with the network hardware. On an Ethernet the problem could be improper cable termination, a bad cable segment, or a bad piece of "active" hardware, such as a hub, switch or transceiver. Check the cable with a cable tester as described earlier. Good hubs and switches often have built in diagnostic software that can be checked. Cheap hubs and transceivers may require the "brute force" method of disconnecting individual pieces of hardware until the problem goes away.

The results of a simple ping test, even if the test is successful, can help you direct further testing toward the most likely causes of the problem. But other diagnostic tools are needed to examine the problem more closely and find the underlying cause.

Troubleshooting Network Access

The "no answer" and "cannot connect" errors indicate a problem in the lower layers of the network protocols. If the preliminary tests point to this type of problem, concentrate your testing on routing and on the network interface. Use the ipconfig, netstat, and arp commands to test the Network Access Layer.

Troubleshooting with the ipconfig Command

ipconfig checks the network interface configuration. Use this command to verify the user's configuration if the user's system has been recently configured, or if the user's system cannot reach the remote host while other systems on the same network can.

When ipconfig is entered with the /all argument, it displays the current configuration values assigned to the interface. For example:

C:\ipconfig /all
Windows NT IP Configuration
        Host Name . . . . . . . . . : pooh
        DNS Servers . . . . . . . . : thoth
        Node Type . . . . . . . . . : Broadcast
        NetBIOS Scope ID. . . . . . :
        IP Routing Enabled. . . . . : No
        WINS Proxy Enabled. . . . . : No
        NetBIOS Resolution Uses DNS : No
Ethernet adapter SMCISA1:
        Description . . . . . . . . : SMC Adapter.
        Physical Address. . . . . . : 00-00-C0-9A-72-CA
        DHCP Enabled. . . . . . . . : No
        IP Address. . . . . . . . . :
        Subnet Mask . . . . . . . . :
        Default Gateway . . . . . . : thoth

The ipconfig command displays two types of information. The first type is information about the TCP/IP configuration. The second type is about the network interface and its characteristics. Check the information for configuration errors.

The Windows NT ipconfig command clearly labels each piece of information it provides. Every item from Host Name to Default Gateway is explained somewhere in this book. You should know what values are correct for your network, and thus be able to quickly detect a configuration error if one has been made.

Two common interface configuration problems are misconfigured subnet masks and incorrect IP addresses. A bad subnet mask is indicated when the host can reach other hosts on its local subnet and remote hosts on distant networks, but it cannot reach hosts on other local subnets. ipconfig quickly reveals if a bad subnet mask is set.

An incorrectly set IP address can be a subtle problem. If the network part of the address is incorrect, every ping will fail with the "no answer" error. In this case, using ipconfig will reveal the incorrect address. If the host part of the address is wrong, the problem can be more difficult to detect. A small system, such as a PC that only connects out to other systems and never accepts incoming connections, can run for a long time with the wrong address without its user noticing the problem. Additionally, the system that suffers the ill effects may not be the one that is misconfigured. It is possible for someone to accidentally use your IP address on her system, and for the mistake to cause your system intermittent communications problems. An example of this problem is discussed later. This type of configuration error cannot be discovered by ipconfig, because the error is on a remote host. The arp command is used for this type of problem.

Troubleshooting with the arp Command

The arp command is used to analyze problems with IP to Ethernet address translation. The arp command has three useful options for troubleshooting:


Display all ARP entries in the table.

-d hostname

Delete an entry from the ARP table.

-s hostname ether-address

Add a new entry to the table.

With these three options you can view the contents of the ARP table, delete a problem entry, and install a corrected entry. The ability to install a corrected entry is useful in "buying time" while you look for the permanent fix.

Use arp if you suspect that incorrect entries are getting into the address resolution table. One clear indication of problems with the ARP table is a report that the "wrong" host responded to some command, like ftp or telnet. Intermittent problems that affect only certain hosts can also indicate that the ARP table has been corrupted. ARP table problems are usually caused by two systems using the same IP address. The problems appear intermittent, because the entry that appears in the table is the address of the host that responded quickest to the last ARP request. Sometimes the "correct" host responds first, and sometimes the "wrong" host responds first.

If you suspect that two systems are using the same IP address, display the address resolution table with the arp -a command. Here's an example:

C:\arp -a
Interface: on Interface 2
  Internet Address Physical Address Type 00-00-c0-dd-d4-da dynamic 00:00:0c:e0:80:b1 dynamic 00:00:c0:22:fd:51 dynamic

It is easiest to verify that the IP and Ethernet address pairs are correct if you have a record of each host's correct Ethernet address. For this reason you should record the Ethernet and IP address of each host assigned a static address 2

when it is added to your network. If you have such a record, you'll quickly see if anything is wrong with the table.

If you don't have this type of record, the first three bytes of the Ethernet address can help you to detect a problem. The first three bytes of the address identify the equipment manufacturer. A list of these identifying prefixes is found in the Assigned Numbers RFC, in the section entitled "Ethernet Vendor Address Components." This information is also available at

>From the vendor prefixes we see that two of the ARP entries displayed in our example are PC systems with SMC boards (0:0:c0). If kerby is also supposed to be a system with an SMC board, the 0:0:0c Cisco prefix indicates that a Cisco router has been mistakenly configured with kerby's IP address.

If neither checking a record of correct assignments nor checking the manufacturer prefix helps you identify the source of the errant ARP, try using Telnet to connect to the IP address shown in the ARP entry. If the device supports Telnet, the logon banner might help you identify the incorrectly configured host.

ARP Problem Case Study

A user called in asking if the server was down, and reported the following problem. The user's workstation, called theodore, appeared to "lock up" for minutes at a time when certain commands were used, while other commands worked with no problems. The network commands that depended on the server all caused the lock-up problem, but some unrelated commands also caused the problem.

The server thoth was providing theodore with services. The commands that failed on theodore were commands that required thoth's services, or that were stored in a directory shared from thoth. The commands that ran correctly were installed locally on the user's workstation. No one else reported a problem with the server, and we were able to ping theodore from thoth and get good responses.

We had the user check the Event Viewer for recent error messages, and she discovered the event shown in Figure 11-1.


Figure 11-1. Duplicate Address Warning


The message shown in Figure 11-1 indicates that the workstation detected another host on the Ethernet responding to its IP address. The "imposter" used the Ethernet address 0:0:c0:dd:d4:da in its ARP response. The correct Ethernet address for theodore is 8:0:20:e:12:37.

We checked thoth's ARP table and found that it had the incorrect ARP entry for theodore. We deleted the bad theodore entry with the arp -d command, and installed the correct entry with the -s option, as shown below:

C:\>arp -d theodore
theodore ( deleted
C:\>arp -s theodore 8:0:20:e:12:37

ARP entries received via the ARP protocol are temporary. The values are held in the table for a finite lifetime and are deleted when that lifetime expires. New values are then obtained via the ARP protocol. Therefore, if some remote interfaces change, the local table adjusts and communications continue. Usually this is a good idea, but if someone is using the wrong IP address, that bad address can keep reappearing in the ARP table even if it is deleted. However, manually entered values are permanent; they stay in the table and can only be deleted manually. This allowed us to install a correct entry in the table, without worrying about it being immediately overwritten by a bad address.

This quick fix resolved theodore's immediate problem, but we still needed to find the culprit. We checked the DHCP configuration to see if we had an entry for Ethernet address 0:0:c0:dd:d4:da, but we didn't. From the first three bytes of this address, 0:0:c0, we knew that the device was an SMC card. We guessed that the problem address was recently installed because the user had never had the problem before. We sent out an urgent announcement to all users asking if anyone had recently installed a new PC, reconfigured a PC, or installed TCP/IP software on a PC. We got one response. When we checked his system, we found out that he had entered the address when he should have entered The address was corrected and the problem did not recur.

Nothing fancy was needed to solve this problem. Once we checked the error messages, we knew what the problem was and how to solve it. Involving the entire network user community allowed us to quickly locate the problem system and to avoid a room-to-room search for the PC. Reluctance to involve users and make them part of the solution is one of the costliest, and most common, mistakes made by network administrators.

Checking the Interface with netstat

If the preliminary tests lead you to suspect that the connection to the local area network is unreliable, the netstat -e command can provide useful information. The example below shows the output from the netstat -e command:

C:\>netstat -e
Interface Statistics
                     Received        Sent
Bytes                  112088      123876
Unicast packets           612         613
Non-unicast packets       258         257
Discards                    0           0
Errors                      0           0
Unknown protocols           2

The command displays the total amount of traffic that this system has received from and sent to the Ethernet--in both bytes and packets. It also displays the number of packets in error. Discards are packets that were received from the network and then discarded by the local system because they contained errors or could not be processed. Errors are damaged packets, including packet sent from this system that were damaged in the local buffer. These errors should be close to zero. Regardless of how much traffic has passed through this interface, 100 errors in either of these fields is high. High output errors could indicate a saturated local network or a bad physical connection between the host and the network. High input errors could indicate that the network is saturated, the local host is overloaded, or there is a physical network problem. Tools, such as the Network Monitor or a cable tester, can help you determine if it is a physical network problem.

The problem may be an overloaded network. To reduce the network load, reduce the amount of traffic on the network segment. A simple way to do this is to create multiple segments out of the single segment. Each new segment has fewer hosts and, therefore, less traffic. We'll see, however, that it's not quite this simple.

The most effective way to subdivide an Ethernet is to install an Ethernet switch. Each port on the switch is essentially a separate Ethernet. Therefore a 16 port switch gives you 16 Ethernets to work with when balancing the load. On most switches the different ports can be used in a variety of different ways. See Figure 11-2. Lightly used systems can be attached to a hub that is then attached to one of the switch ports to allow the systems to share a single segment. Servers and demanding systems can be given dedicated ports so that they don't need to share a segment with anyone. Additionally, some switches provide a few Fast Ethernet 100M bps ports. These are called asymmetric switches because different ports operate at different speeds. Use the Fast Ethernet ports to connect heavily used servers. If you're buying a new switch, buy a 10/100 switch with auto-sensing ports. This allows every port to be used at either 100M bps or at 10M bps, which gives you the maximum configuration flexibility.

Figure 11-2 shows an 8 port 10/100 Ethernet switch. Ports 1 and 2 are wired to Ethernet hubs. A few systems are connected to each hub. When new systems are added they are distributed evenly among the hubs to prevent any one segment from becoming overloaded. Additional hubs can be added to the available switch ports for future expansion. Port 4 attaches a demanding system with its own private segment. Port 6 operates at 100M bps and attaches a heavily used server. Port 7 is reserved for a future 100M bps connection to a second 10/100 Ethernet switch for even more expansion.


Figure 11-2. Subdividing an Ethernet with Switches


Before allocating the ports on your switch evaluate what services are in demand, and who talks to whom. Then develop a plan that reduces the amount of traffic flowing over any segment. For example, if the demanding system on Port 6 uses lots of bandwidth because it is constantly talking to one of the systems on Port 1, all of the systems on Port 1 will suffer because of this traffic. The computer that the demanding system communicates with should be moved to one of the vacant ports or to the same port (6) as the demanding system. Use your switch to greatest advantage by balancing the load.

Should you segment an old co-axial cable Ethernet by cutting the cable and joining it back together through a router or a bridge? No. If you have an old network that is finally reaching saturation it is time to install a new network built on a more robust technology. A shared media network, which is a network where everyone is on the same cable, as with a co-axial cable Ethernet, is an accident waiting to happen. Design a network that a user cannot bring down by merely disconnecting his system, or even by accidentally cutting a wire in his office. Use the appropriate Unshielded Twisted Pair (UTP) cable to create a 10BaseT Ethernet or 100BaseT Fast Ethernet that wires equipment located in the user's office to a hub securely stored in a wire closet. The network components in the user's office should be sufficiently isolated from the network so that damage to those components does not damage the entire network. The new network will solve your collision problem and reduce the amount of hardware troubleshooting you are call upon to do.

Network Hardware Problems

Some of the tests discussed in this section can show a network hardware problem. If a hardware problem is indicated, contact the people responsible for the hardware. If the problem appears to be in a leased telephone line, contact the telephone company. If the problem appears to be in a wide area network, contact the management of that network. Don't sit on a problem expecting it to go away. It could easily get worse.

If the problem is in your local area network, you will have to handle it yourself. Some tools, such as the cable tester described above, can help. But frequently the only way to approach a hardware problem is by brute force--disconnecting pieces of hardware until you find the one causing the problem. The switch or hub is a convenient point where this can be done. If you identify a device causing the problem, repair or replace it. Remember the problem can be the cable itself, rather than any particular device.

Checking Routing

The "network unreachable" error message clearly indicates a routing problem. If the problem is in the local host's routing table, it is easy to detect and resolve. First, use netstat -nr or route print to see whether or not a valid route to your destination is installed in the routing table.

For example, a user reports that the "network is down" because he cannot FTP to, and a ping test returns the following results:

% ping
PING 32 data bytes
sendto: Network is unreachable
ping: wrote 32 chars, ret=-1
sendto: Network is unreachable
ping: wrote 32 chars, ret=-1

Based on the "network unreachable" error message, check the user's routing table. In our example, we're looking for a route to The IP address of 3 is, which is a class C address. Remember that routes are network oriented. So we check for a route to network If a specific route is not found, remember to look for a default route. If netstat shows the correct specific route, or a valid default route, the problem is not in the routing table. In that case, use tracert, as described in the next section, to trace the route all the way to its destination.

If netstat doesn't return the expected route, it's a local routing problem. There are two ways to approach local routing problems, depending on whether the system uses static or dynamic routing. Most systems that use static routing rely on a default route, so the missing route could be the default route. Use the Gateway tab in the TCP/IP Properties window to install the default route as described in Chapter 5. If you use multiple static routes, use route -p add to define them, which is also covered in Chapter 5, Installing TCP/IP.

If you're using dynamic routing, make sure that the routing program is running. The various routing protocols for Windows NT are provided by the Routing and Remote Access Service (RRAS). If the correct routing daemon is not running, start it as specified in Chapter 9, Microsoft Routing and Remote Access Services.

Tracing routes

If the local routing table is correct, the problem may be occurring some distance away from the local host. Remote routing problems can cause the "no answer" error message, as well as the "network unreachable" error message. But the "network unreachable" message does not always mean a routing problem. It can literally mean that the remote network cannot be reached because something is down between the local host and the remote destination. tracert is the program that can help you locate these problems.

tracert traces the route of UDP packets from the local host to a remote host. It prints the name (if it can be determined) and IP address of each gateway along the route to the remote host.

tracert uses two techniques, small TTL (time-to-live) values and an invalid port number, to trace packets to their destination. tracert sends out UDP packets with small TTL values to detect the intermediate gateways. The TTL values start at one and increase in increments of one for each group of three UDP packets sent. When a gateway receives a packet, it decrements the TTL. If the TTL is then zero, the packet is not forwarded and an ICMP "Time Exceeded" message is returned to the source of the packet. tracert displays one line of output for each gateway from which it receives a "Time Exceeded" message.

When the destination host receives a packet from tracert, it returns an ICMP "Unreachable Port" message. This happens because tracert intentionally uses an invalid port number (33434) to force this error. When tracert receives the "Unreachable Port" message, it knows that it has reached the destination host, and it terminates the trace. In this way, tracert is able to develop a list of the gateways, starting at one hop away and increasing one hop at a time, until the remote host is reached. Figure 11-3 illustrates the flow of packets tracing to a host three hops away.


Figure 11-3. Flow of tracert Packets


The following example shows a tracert to from a workstation hanging off BBN PlaNET. tracert sends out three packets at each TTL value. If no response is received to a packet, tracert prints an asterisk (* ). If a response is received, tracert displays the packet's round-trip time in milliseconds and the address of the gateway that responded.

Tracing route to []
over a maximum of 30 hops:
  1     10 ms    <10 ms    <10 ms
  2    <10 ms    <10 ms    <10 ms
  3    <10 ms    <10 ms     10 ms
  4    <10 ms     10 ms     10 ms
  5    <10 ms     10 ms     10 ms
  6    <10 ms     10 ms     10 ms
  7     10 ms     10 ms     10 ms
  8     10 ms     20 ms     20 ms
  9     10 ms     20 ms     20 ms
 10     20 ms     10 ms     20 ms
 11     20 ms     20 ms     20 ms
 12     20 ms     30 ms     20 ms
 13     30 ms     40 ms     40 ms
 14     130 ms    100 ms    90 ms
Trace complete.

This trace shows that thirteen intermediate gateways are involved, that packets are making the trip, and that round-trip travel time for packets from this host to is about 107 ms.

Variations and bugs in the implementation of ICMP on different types of gateways, and the unpredictable nature of the path a datagram can take through a network, can cause some odd displays. For this reason, you shouldn't examine the output of tracert too closely. The most important things in the tracert output are:

Below we show another trace of the path to This time the trace does not go all the way through to the InterNIC.

C:\>tracert -d
Tracing route to []
over a maximum of 30 hops:
  1     10 ms    <10 ms    <10 ms
  2    <10 ms    <10 ms    <10 ms
  3    <10 ms    <10 ms     10 ms
  4    <10 ms     10 ms     10 ms
  5    <10 ms     10 ms     10 ms
  6    <10 ms     10 ms     10 ms
  7     10 ms     10 ms     10 ms
  8     10 ms     20 ms     20 ms
  9        *         *         *
 10        *         *         *
 29        *         *         *
 30        *         *         *

When tracert fails to get packets through to the remote end-system, the trace trails off, displaying a series of three asterisks at each hop count until the count reaches 30. If this happens, contact the administrator of the remote host you're trying to reach, and the administrator of the last gateway displayed in the trace. Describe the problem to them; they may be able to help. 4

In our example, the last gateway that responded to our packets was We would contact this system administrator, and the administrator of

Checking Name Service

Name server problems are indicated when the "unknown host" error message is returned by the user's application. Name server problems can usually be diagnosed with nslookup. Three features of nslookup are particularly important for troubleshooting remote name server problems. These features are its ability to:

When troubleshooting a remote server problem, directly query the authoritative servers returned by the NS query. Don't rely on information returned by non-authoritative servers. If the problems that have been reported are intermittent, query all of the authoritative servers in turn and compare their answers. Intermittent name server problems are sometimes caused by the remote servers returning different answers to the same query.

The ANY query returns all records about a host, thus giving the broadest range of troubleshooting information. Simply knowing what information is (and isn't) available can solve a lot of problems. For example, if the query returns an MX record but no A record, it is easy to understand why the user couldn't telnet to that host! Many hosts are accessible to mail that are not accessible by other network services. In this case, the user is confused and is trying to use the remote host in an inappropriate manner.

If you are unable to locate any information about the host name that the user gave you, perhaps the host name is incorrect. If you have the IP address, use the PTR query to do a reverse lookup. Without a valid host name or address looking for the correct name is like trying to find a needle in a haystack. However, nslookup can help. Use nslookup 's ls command to dump the remote zone file, and redirect the listing to a file. Then use nslookup 's view command to browse through the file, looking for names similar to the one the user supplied. Many problems are caused by a mistaken host name.

The nslookup features and commands mentioned here are used in Chapter 8. Some examples using these commands to solve real name server problems are shown below. The two examples that follow are based on actual trouble reports. 5

Some Systems Work, Others Don't

A user reported that she could resolve a certain host name from her workstation, but could not resolve the same host name from the central system. However, the central system could resolve other host names. We ran several tests and found that we could resolve the host name on some systems and not on others. There seemed to be no predictable pattern to the failure. So we used nslookup to check the remote servers.

Default Server:

> set type=NS
Address: nameserver = nameserver = nameserver = inet address = inet address = inet address =
> set type=ANY
> server
Default Server:

Address: inet address =
> server
Default Server:

*** can't find Non-existent domain

This sample nslookup session contains several steps. The first step is to locate the authoritative servers for the host name in question ( ). We set the query type to NS to get the name server records, and queried for the domain ( ) in which the host name is found. This returns three names of authoritative servers:,, and

Next, we set the query type to ANY to look for any records related to the host name in question. Then we set the server to the first server in the list,, and queried for This returns an address record. So server works fine. We repeated the test using as the server, and it fails. No records are returned.

The next step is to get SOA records from each server and see if they are the same:

> set type=SOA
Address: origin =
	mail addr =
	serial=10164, refresh=43200, retry=3600, expire=3600000,
> server
Default Server:

Address: origin =
	mail addr =
	serial=10164, refresh=43200, retry=3600, expire=3600000,
> exit

If the SOA records have different serial numbers, perhaps the zone file, and therefore the host name, has not yet been downloaded to the secondary server. If the serial numbers are the same and the data is different, as in this case, there is a definite problem. Contact the remote domain administrator and notify her of the problem. The administrator's mailing address is shown in the "mail addr" field of the SOA record. In our example, we would send mail to reporting the problem.

The Data is Here and the Server Can't Find It!

This problem was reported by the administrator of a secondary name server. The administrator reported that his server could not resolve a certain host name in a domain for which his server was a secondary server. The primary server was, however, able to resolve the name.

The problem was replicated on several other secondary servers. The primary server would resolve the name; the secondary servers wouldn't. All servers had the same SOA serial number, so why wouldn't they resolve the host name to an address?

Visualizing the difference between the way primary and secondary servers load their data made us suspicious of the zone file transfer. Primary servers load the data directly from local disk files. Secondary servers transfer the data from the primary server via a zone file transfer. Perhaps the zone files were getting corrupted. We displayed the zone file on one of the secondary servers, and it showed the following data:

PCpma		IN	A
		IN	HINFO	"pc" "n3/800salesttgnetcom"
PCrkc		IN	A
		IN	HINFO	"pc" "n3/800salesttgnetcom"
PCafc		IN	A
		IN	HINFO	"pc" "n3/800salesttgnetcom"
accu		IN	A
cmgds1	IN	A
cmg		IN	A
PCgns		IN	A
		IN	HINFO	"pc" "(3/800salesttgnetcom"
gw		IN	A
zephyr	IN	A
		IN	HINFO	"Sun" "sparcstation"
ejw		IN	A
PCecp		IN	A
		IN	HINFO	"pc" "n^Lsparcstationstcom"

Notice the odd display in the last field of the HINFO statement for each PC. This data might have been corrupted in the transfer or it might be bad on the primary server. We used nslookup to check that.

Default Server:

> server
Default Server:

> set query=HINFO
Address: CPU=pc OS=ov
packet size error (0xf7fff590 != 0xf7fff528)
> exit

In this nslookup example, we set the server to, which is the primary server for Next we queried for the HINFO record for one of the hosts that appeared to have a corrupted record. The "packet size error" message clearly indicates that nslookup was even having trouble retrieving the HINFO record directly from the primary server. We contacted the administrator of the primary server and told him about the problem, pointing out the records that appeared to be in error. He discovered that he had forgotten to put an operating system entry on some of the HINFO records. He corrected this, and it fixed the problem.

Analyzing Protocol Problems

Problems caused by bad TCP/IP configurations are much more common than problems caused by bad TCP/IP protocol implementations. Most of the problems you encounter will succumb to analysis using the simple tools we have already discussed. But on occasion, you may need to analyze the protocol interaction between two systems. In the worst case, you may need to analyze the packets in the data stream bit by bit. Protocol analyzers help you do this.

Network Monitor is the tool we'll use. It is provided with Windows NT Server 4.0. 6 Although we use Network Monitor in our examples, the concepts introduced in this section should be applicable to any analyzer, because most protocol analyzers function in basically the same way. Protocol analyzers display network statistics, and allow you to select packets and to examine those packets byte by byte. We'll discuss all of these functions.

Network Monitor

The Network Monitor comes with Windows NT Server 4.0 but it is not installed by default. To install the monitor, go to the Control Panel, open Network, select the Services tab and click on Add. From the list of services that is displayed, select and install "Network Monitor Tools and Agent". Once the Network Monitor is installed, it is run from the Start menu [Start -> Programs -> Administrative Tools (Common) -> Network Monitor].

When the Network Monitor starts, it just sits there. To see any interesting statistics or data you must select Start from the Capture menu at the top of the window. Figure 11-4 shows the Network Monitor window while a capture is running. The window displays a graph of the network load. It displays a scroll pane that contains network statistics, statistics about the capture buffer, Ethernet card statitics and errors. At the bottom of the window, it displays a scroll pane that shows every network address detected, the number of frames and bytes transfered by that address, and whether the frames were unicast, multicast or broadcast. Clearly, Network Monitor provides much more statistical information than a simple netstat command!


Figure 11-4. Gathering Statistic with Network Monitor


Select "Stop and View" from the capture menu to view more details of the packets that have been captured. This stops the packet capture and opens the Capture: Summary pane, which lists summary information about every packet received during the capture. The Network Monitor displays a single line of summary information for each packet received. Each line contains a frame number, 7 the time the packet was received, the source and destination Ethernet addresses, the protocol being used, and the source and destination IP addresses.

This summary information is sufficient to gain insight into how packets flow between two hosts and into potential problems. Frequently, this is enough to solve the problem. However, troubleshooting protocol problems sometimes requires more detailed information about each packet.

To display the data contained in a packet, double click on the summary line of the packet in the Capture: Summary window. Figure 11-5 shows how Network Monitor displays the details of a packet. Double-clicking on a frame in the Capture: Summary pane divides the pane into three separate scroll frames. The top scroll is the normal summary information mentioned above. The middle scroll area is a break-out of the individual fields in the frame header. The scroll section at the bottom of the pane displays the packet data in hex and ASCII. In most cases, you don't need to see the entire packet. Usually, the headers are sufficient to troubleshoot a protocol problem. But the data is there when you need it.


Figure 11-5. Detail Packet Information


The formatting done by Network Monitor maps the bytes received from the network to the header structure. Look at the description of the various header fields in Chapter 1, Overview of TCP/IP, for more information.

By default, Network Monitor captures all of the packets to or from the local host. This can create lots of information, much of which may be of no interest. Filters are used to select a subset of these packets. Filters can be defined to capture packets from, or to, specific hosts or protocols, packets that contain specific data, or combinations of all these.

The Network Monitor supports two types of filters. You can create a capture filter before you start to capture data so that only the data you want is collected in the capture buffer. The advantage of a capture filter is that it saves buffer space. The other type of filter is a display filter. It filters the packets that are already in the buffer so that only those you want are displayed.

To define a capture filter, select Filter from the Capture menu before you start to capture data. The Capture Filter window shown in Figure 11-6 appears. The filter shown in Figure 11-6 is the default filter that Network Monitor uses to capture all data into and out of the NT system. It, and all other Network Monitor filters, can filter on three types of information. First is the physical network frame type (SAP/ETYPE). Second is the source or destination address of the packet. And third is the data contained in the packet. The default filter accepts all frame types going into or out of thoth and does not filter out any of them based on the data they contain.


Figure 11-6. Defining a Network Monitor Filter


To change a value in the filter, highlight it and select the Line button that appears in the Edit box. For example, we might highlight the SAP/ETYPE line and edit it to only accept IP type Ethernet frames. We also might highlight the entry under Address Pairs and change it to only capture packets to a specific host instead of all hosts. All of these changes are made by selecting values from scroll boxes that are displayed when the Edit Line button is selected.

The display filter is defined in a very similar way. The biggest difference is that the filter is defined after the capture. First capture data. Then select "Stop and View" from the capture menu. The Capture: Summary window is diplayed. Select Filter from the Display menu. This opens a window that is almost identical to the one shown in Figure 11-6. Modifying values in this filter controls what frames are displayed in the Capture : Summary pane.

In the following section we look at how a protocol analyzer was used to troubleshoot a network problem.

Protocol Case Study

This example is an actual case that was solved by protocol analysis. The problem was reported as an occasional FTP failure with the error message:

netout: Option not supported by protocol
421 Service not available, remote server has closed connection

Only one user reported the problem, and it occurred only when transferring large files from a workstation to the central computer, via our FDDI backbone network.

We obtained the user's data file and were able to duplicate the problem from other workstations, but only when we transferred the file to the same central system via the backbone network. Figure 11-7 graphically summarizes the tests we ran to duplicate the problem.


Figure 11-7. FTP Test Summary


We notified all users of the problem. In response, we received reports that others had also experienced it, but again only when transferring to the central system, and only when transferring via the backbone. They had not reported it, because they rarely saw it. But the additional reports gave us some evidence that the problem did not relate to any recent network changes.

Because the problem had been duplicated on other systems, it probably was not a configuration problem on the user's system. The FTP failure could also be avoided if the backbone routers and the central system did not interact. So we concentrated our attention on those systems. We checked the routing tables and ARP tables, and ran ping tests on the central system and the routers. No problems were observed.

Based on this preliminary analysis, the FTP failure appeared to be a possible protocol interaction problem between a certain brand of routers and a central computer. We made that assessment because the transfer routinely failed when these two brands of systems were involved, but never failed in any other circumstance. If the router or the central system were misconfigured, they should fail when transferring data to other hosts. If the problem was an intermittent physical problem, it should occur randomly regardless of the hosts involved. Instead, this problem occurred predictably, and only between two specific brands of computers. Perhaps there was something incompatible in the way these two systems implemented TCP/IP.

Therefore, we used a protocol analyzer to capture the TCP/IP headers during several FTP test runs. Reviewing the analyzer output showed that all transfers that failed with the netout error message had an ICMP Parameter Error packet near the end of the session, usually about 50 packets before the final close. No successful transfer had this ICMP packet. Note that the error did not occur in the last packet in the data stream, as you might expect. It is common for an error to be detected, and for the data stream to continue for some time before the connection is actually shut down. Don't assume that an error will always be at the end of a data stream.

Detailed analysis of the packets involved in the error showed that the router issued an IP Header Checksum of 0xffff, and that the central system objected to this checksum. We know that the central system objected to the checksum because it returned an ICMP Parameter Error with a Pointer of 10. The Parameter Error indicates that there is something wrong with the data the system has just received, and the Pointer identifies the specific byte that the system thinks is in error. The tenth byte of the router's IP header is the IP Header Checksum. The data field of the ICMP error message returns the header that it believes is in error. When we displayed that data we noticed that when the central system returned the header, the checksum field was "corrected" to 0000. Clearly the central system disagreed with the router's checksum calculation.

Occasional checksum errors will occur. They can be caused by transmission problems, and are intended to detect these types of problems. Every protocol suite has a mechanism for recovering from checksum errors. So how should they be handled in TCP/IP?

To determine the correct protocol action in this situation, we turned to the authoritative sources--the RFCs. RFC 791, Internet Protocol, provided information about the checksum calculation, but the best source for this particular problem was RFC 1122, Requirements for Internet hosts--communication layers, by R. Braden. This RFC provided two specific references that define the action to be taken. These two quotes are taken from page 29 of RFC 1122:

In the following, the action specified in certain cases is to "silently discard" a received datagram. This means that the datagram will be discarded without further processing and that the host will not send any ICMP error message (see Section 3.2.2) as a result. ...


A host MUST verify the IP header checksum on every received datagram and silently discard every datagram that has a bad checksum.

Therefore, when a system receives a packet with a bad checksum, it is not supposed to do anything with it. The packet should be discarded, and the system should wait for the next packet to arrive. The system should not respond with an error message. A system cannot respond to a bad IP header checksum, because it cannot really know where the packet came from. If the header checksum is in doubt, how do you know if the addresses in the header are correct? And if you don't know for sure where the packet came from, how can you respond to it?

IP relies on the upper layer protocols to recover from these problems. If TCP is used (as it was in this case), the sending TCP eventually notices that the recipient has never acknowledged the segment, and it sends the segment again. If UDP is used, the sending application is responsible for recovering from the error. In neither case does recovery rely on an error message returned from the recipient.

Therefore, for an incorrect checksum, the central system should have simply discarded the bad packet. The vendor was informed of this problem and, much to their credit, they sent us a fix for the software within two weeks. Not only that, the fix worked perfectly!

Not all problems are resolved so cleanly. But the technique of analysis is the same no matter what the problem.

Simple Network Management Protocol

Troubleshooting is necessary to recover from problems, but the ultimate goal of the network administrator is to avoid problems. That is also the goal of network management software. The network management software used on TCP/IP networks is based on the Simple Network Management Protocol (SNMP).

SNMP is a client/server protocol. In SNMP terminology it is described as a manager/agent protocol. The agent (the server) runs on the device being managed, which is called the Managed Network Entity. The agent monitors the status of the device and reports that status to the manager.

The manager (the client) runs on the Network Management Station (NMS). The NMS collects information from all of the different devices that are being managed, consolidates it, and presents it to the network administrator. This design places all of the data manipulation tools and most of the human interaction on the NMS. Concentrating the bulk of the work on the manager means that the agent software is small and easy to implement. Correspondingly, most TCP/IP network equipment comes with an SNMP management agent.

SNMP is a request/response protocol. The request and response messages that SNMP sends in the datagrams are called Protocol Data Units (PDU). These message types allow the manager to request management information, and when appropriate, to modify that information. The messages also allow the agent to respond to manager requests and to notify the manager of unusual situations.

The NMS periodically requests the status of each managed device and each agent responds with the status of its device. Making periodic requests is called polling. Polling reduces the burden on the agent because the NMS decides when polls are needed, and the agent simply responds. Polling also reduces the burden on the network because the polls originate from a single system at a predictable rate. The shortcoming of polling is that it does not allow for real-time updates. If a problem occurs on a managed device, the manager does not find out until the agent is polled. To handle this, SNMP uses a modified polling system called trap-directed polling.

A trap is an interrupt signaled by a predefined event. When a trap event occurs, the SNMP agent does not wait for the manager to poll; instead it immediately sends information to the manager. Traps allow the agent to inform the manager of unusual events while allowing the manager to maintain control of polling.

SNMP has a rudimentary security mechanism called community names. A community name is a group name known only to the members of the group. Every system in the community must use the same community name. The SNMP agent compares the community name contained in each request it receives against its own community name. If they match, it honors the request. If they don't match, the agent discards the request and generates an authenticationFailure trap. The default community name "public" is used when no security is desired.

Every piece of information managed by SNMP has a unique object identifier. These objects are grouped together in a Management Information Base (MIB). The MIB refers to all information that is managed by SNMP. However, we usually refer to "a MIB" or "the MIBs" (plural) meaning the individual databases of management information formally defined by an RFC or privately defined by a vendor.

MIBI and MIBII are standards defined by RFCs. MIBII is a superset of MIBI, and is the standard MIB for monitoring TCP/IP. It provides such information as the number of packets transmitted into and out of an interface, and the number of errors that occurred sending and receiving those packets--useful information for spotting usage trends and potential trouble spots. Every agent supports MIBI or MIBII.

Some systems also provide a private MIB in addition to the standard MIBII. Private MIBs add to the monitoring capability by providing system-specific information. Private MIBs are most common on network hardware like routers, hubs and switches.

A private MIB won't do you any good unless your network monitoring software also supports that MIB. For this reason, most administrators prefer to purchase a monitor from the vendor that supplies the bulk of their network equipment. Another possibility is to select a monitor that includes a MIB compiler, which gives you the most flexibility. A MIB compiler reads in the description of a MIB and adds the MIB to the monitor. A MIB compiler makes the monitor extensible because if you can get the source from the network equipment vendor, you can add the vendor's private MIB to your monitor.

SNMP has twice as much jargon as the rest of networking--and that's saying something! Managed Network Entity, NMS, PDU, trap, polling, and MIB. Why this bewildering array of acronyms and buzz-words? We think there are two main reasons:

Don't be put off by the jargon. All of this detail is necessary to formally define a network management scheme that is independent of the managed systems, but you don't need to memorize it. You need to know that a MIB is a collection of management information, that an NMS is the network management station, and that an agent runs in each managed device in order to make intelligent decisions when selecting an SNMP monitor. This information provides that necessary background. The features available in network monitors vary widely; so does the price. Select an SNMP monitor that is suitable for the complexity of your network and the size of your budget!

Windows NT does not provide an SNMP management station, but it does provide an agent. If you install an SNMP manager on your network, enable the SNMP agent on your NT sysem. To do this, go to the Control Panel, open Network, select the Services tab and click Add. From the services list that is displayed, select SNMP Service. The system will automatically display the SNMP Properties sheet where you configure the agent. The properties sheet contains three tabs:


The Agent tab is used to define contact and location information that identifies this system on the management station. The contact is the name of the user of the Windows NT system and the location is the NT system's physical location. For example, the contact might be "Tyler McCafferty" and the location might be "Building 10, room 101".


The Traps tab is used to define the IP address of the management station to which traps are sent, and the community name that must be used to communicate with that station.


The Security tab defines the community names that the Windows NT agent will accept in packets it receives. By default this is set to "public". The Security tab also allows you to define the IP addresses from which your agent will accept SNMP packets.

Define the configuration. Reboot the system, and your NT computer will report its status to the SNMP Network Management Station.


Inevitably a network breaks. This chapter discusses the tools and techniques that are used to recover from network problems, and the planning and monitoring that can help avoid them. The solution to a problem is sometimes obvious if you can just gain enough information to know exactly what the problem is. Windows NT provides several built-in software tools that can help you gather information about system configuration, addressing, routing, name service, and other vital network components. Gather your tools and learn how to use them before a problem occurs.

In the next chapter we talk about another task that is important to the maintenance of a reliable network. In Chapter 12, Network Security, we look a ways to keep you network secure.

1. Chapter 13 explains how to find out who is responsible for a remote network.

2. We emphasize "static" addresses because addresses assigned by DHCP do not cause address conflicts, which is one more reason to use DHCP whenever you can.

3. Use nslookup to find the IP address if you don't know it. nslookup is discussed later in this chapter.

4. Chapter 13 explains how to find out who is responsible for a specific computer.

5. The host and server names are fictitious, but the problems were real.

6. The standard Network Monitor can only monitor traffic from or to the Windows NT system on which it is running. A more "full-featured" version that can monitor all network traffic is available with the optional System Management Server (SMS) software from Microsoft.

7. The Network Monitor refers to the packets as "frames" because they contain the Ethernet framing information when they are captured. Home | O'Reilly Bookstores | How to Order | O'Reilly Contacts
International | About O'Reilly | Affiliated Companies | Privacy Policy

© 2001, O'Reilly & Associates, Inc.