Software Testing Fundamentals

Sunday, December 9, 2007

The State of Software Testing

A Quick Look at How We Got Where We Are

Most of the formal methods and metrics around today had their start back in the 1970s and 1980s when industry began to use computers. Computer professionals of that time were scientists, usually mathematicians and electrical engineers. Their ideas about how to conduct business were based on older, established industries like manufacturing; civil projects like power plants; and military interests like avionics and ballistics.

The 1980s: The Big Blue and Big Iron Ruled

By the 1980s, computers were widely used in industries that required lots of computation and data processing. Software compilers were empowering a new generation of programmers to write machine-specific programs.

In the 1980s computers were mainframes: big iron. Large corporations like IBM and Honeywell ruled the day. These computers were expensive and long-lived. We expected software to last for five years, and we expected hardware to last even longer-that is, how long it takes to depreciate the investment. As a result, buying decisions were not made lightly. The investments involved were large ones, and commitments were for the long term, so decisions were made only after careful consideration and multiyear projections.


    • Computers in the 1980s: expensive, long-term commitment, lots of technical knowledge required

    Normally a vendor during the 1980s would sell hardware, software, support, education, and consulting. A partnership-style relationship existed between the customer and the vendor. Once a vendor was selected, the company was pretty much stuck with that company until the hardware and software were depreciated; a process that could take 10 or more years.

    These consumers demanded reliability and quality from their investment. Testing was an integral part of this arrangement and contributed greatly to the quality of the product. Only a few vendors existed, and each had its own proprietary way of doing things. Compared to today's numbers, only a few people were developing and testing software, and most of them were engineers with degrees in engineering.

    For the most part, during this period any given application or operating system only ran in one environment. There were few situations where machines from more than one vendor were expected to exchange information or interact in any way. This fact is very significant, since today's software is expected to run in many different environments and every vendor's hardware is expected to integrate with all sorts of devices in its environment.

    The 1990s: PCs Begin to Bring Computing to "Every Desktop"

    In the 1990s, the PC became ubiquitous, and with it came cheap software for the public consumer. All through the 1990s, computers kept getting more powerful, faster, and cheaper. The chip makers successfully upheld Moore's law, which states that the number of circuits on a single silicon chip doubles every 18 to 24 months. To put that in perspective, in 1965 the most complex chip had 64 transistors. Intel's Pentium III, launched in October 2000, has 28 million transistors.


    Fact:

    Computers in the 1990s: keep getting cheaper, no commitment involved, almost anybody can play

    The price of a PC continued to fall during the 1990s, even though their capabilities expanded geometrically. The software developers were driven to exploit the bigger, better, and faster computers. And the consumers were driven to upgrade, just to remain competitive in their business; or at least that was the perception.

    Software makers adopted rapid application development (RAD) techniques so that they could keep up with the hardware and the consumers' demands in this new industry, where being first to market was often the key to success. Development tools make it easier and easier for people to write programs. So, a degree becomes less important.

    Unlike the 1980s, when the "next" release would be a stronger version of the existing software, in the 1990s, a new version of a product was often significantly different from the previous version and often contained more serious bugs than its predecessor.


    Fact:

    What we got from rapid application development in the 1990s was a new product, complete with new bugs, every 18 to 24 months.

    The demanding delivery schedule left little time for testing the base functionality, let alone testing the multiple environments where the product might be expected to run, such as computers made by different vendors, and different versions of the operating system. So, it was mostly tested by the users, with product support groups plugging the holes.

    Who would have dreamt then how developments in software and the Internet would eventually affect the state of software testing? The outcome proves that truth is stranger than fiction. Consider the following note I wrote about software testing methods and metrics in 1995:

    In the last couple of years, there has been a marked increase in interest in improved product reliability by several successful shrink-wrap manufacturers. I had wondered for some time what factors would cause a successful shrink-wrap marketing concern to become interested in improving reliability. I used to think that it would be litigation brought on by product failures that would force software makers to pay more attention to reliability. However, the standard arguments for accountability and performance do not seem to have any significant effect on the commercial software industry. It seems that the force driving reliability improvements is simple economics and market maturity.

    First, there is economics of scale. The cost of shipping the fix for a bug to several million registered users is prohibitive at the moment. Second, there are the decreasing profit margins brought on by competition. When profit margins become so slim that the profit from selling a copy of the software is eaten up by the first call that a user makes to customer support, the balance point between delivery, features, and reliability must change in order for the company to stay profitable. The entrepreneurial company becomes suddenly interested in efficiency and reliability in order to survive.

    At the time, I honestly expected a renaissance in software testing. Unfortunately, this was the year that the Internet began to get serious notice. It was also the year that I spent months speaking at several large corporations telling everyone who would listen that they could radically reduce the cost of customer support if they developed support Web sites that let the customers get the information and fixes they needed for free, anytime, from anywhere, without a long wait to talk to someone. Somebody was listening. I was probably not the only one broadcasting this message.

    Enter: The Web

    Within months, every major hardware and software vendor had a support presence on the Web. The bug fix process became far more efficient because it was no longer necessary to ship fixes to everyone who purchased the product-only those who noticed the problem came looking for a solution. Thanks to the Internet, the cost of distributing a bug fix fell to almost nothing as more and more users downloaded the fixes from the Web. The customer support Web site provided a single source of information and updates for customers and customer service, and the time required to make a fix available to the users shrank to insignificance.

    The cost of implementing these support Web sites was very small and the savings were huge; customer satisfaction and profit margins went up. I got a new job and a great job title: Manager of Internet Technology. Management considered the result a major product quality improvement, but it was not achieved through better test methods. In fact, this process improvement successfully minimized any incentive for shipping cleaner products in the first place. Who knew? But don't despair, because it was only a temporary reprieve.

    The most important thing was getting important fixes to the users to keep them happy until the next release. The Internet made it possible to do this. What we got from the Internet was quick relief from the new bugs and a bad case of Pandora's box, spouting would-be entrepreneurs, developers, and experts in unbelievable profusion.

    Consumers base their product-buying decisions largely on availability and advertising. Consumers are most likely to buy the first product on the market that offers features they want, not necessarily the most reliable product. Generally, they have little or no information on software reliability because there is no certification body for software. There is no true equivalent in the software industry to institutions like Underwriters Laboratory (UL) in the United States, which certifies electronics products. Software consumers can only read the reviews, choose the manufacturer, and hope for the best. Consequently, software reliability has been squeezed as priorities have shifted toward delivery dates and appealing functionality, and the cost of shipping fixes has plummeted-thanks to the Web.

    Given this market profile, the PC software market is a fertile environment for entrepreneurs. Competitive pressures are huge, and it is critically important to be the first to capture the market. The decision to ship is generally based on market-driven dates, not the current reliability of the product. It has become common practice to distribute bug-fix releases (put the patches and fixes on the Web site) within a few weeks of the initial release-after the market has been captured. Consequently, reliability metrics are not currently considered to be crucial to commercial success of the product. This trend in commercial software exists to one degree or another throughout the industry. We also see this trend in hardware development.

    The next major contribution of the Web was to make it possible to download this "shrink-wrap" software directly. This type of software typically has a low purchase price, offers a rich appealing set of functionality, and is fairly volatile, with a new release being offered every 12 to 18 months. The reliability of this software is low compared to the traditional commercial software of the 1970s and 1980s. But it has been a huge commercial success nonetheless. And the Web has helped keep this status quo in effect by reducing the cost of shipping a bug fix by letting users with a problem download the fix for themselves. And so we coasted through the 1990s.

    The Current Financial Climate

    In the aftermath of the dot-corn failures and the market slump in 2001 and 2002, investors are demanding profitability. I always expected consumers to rebel against buggy software. What happened was that investors rebelled against management gambling with their money. This change is inflicting fiscal responsibility and accountability on management. It is not uncommon today to have the chief financial officer (CFO) in charge of most undertakings of any size.


    Fact:

    Nobody seems to feel lucky right now.

    The first task is usually to cut costs, adjust the margins, and calm investors. Along with the CFO come the auditors. It is their job to find out what the information technologies (IT) department is, what it does, and if it is profitable or not. If it is not profitable, it will either become profitable, or it will be cut. The financial managers are quick to target waste in all its forms.

    Slowing Down

    The 1990s were a time of rapid growth, experimentation, and great optimism. We were always eager to buy the "next" version every time it became available, without considering if we really needed it or not. It was sort of the I-feel-lucky approach to software procurement. We kept expecting "better" products, even though what we got were "different" products. But we kept buying these products, so we perpetuated the cycle. There always seemed to be a justification for buying the next upgrade. A new term was coined-shelfware-to describe software that was purchased but never installed.

    Further, even software that did get installed was rarely fully utilized. Studies showed that users rarely used over 10 percent of the functionality of most common business software. There was obviously a feature bloat.

    Fat client/server applications were quickly replaced by lightweight, limited-function, browser-based clients. Most users never missed the 90 percent of the functions that were gone, but they appreciated the fast response, anytime, anywhere.

    Getting More from What We Have

    It seems that there is a limit to how small transistor etchings on a silicone wafer can get. To make microchips, Intel and AMD etch a pattern of transistors onto a silicon wafer. But the more you cram onto a chip, the smaller everything gets. Electrons carry the 0 and 1 information through the transistors that power our current computers' computing capabilities. When the transistors get down to the atomic level, electrons are too large to flow.


    Fact:

    Many prognosticators believe that the dominance of Moore's law is coming to an end.

    In addition, the cost of producing "Moore" complex chips is rising. As chips become more complex, the cost to manufacture them increases. Intel and AMD now spend billions to create fabrication plants.

    With silicon chips nearing the end of their feasibility, scientists and engineers are looking to the future of the microprocessor. Chip makers are now focusing on the next generation of computing. But it is going to be expensive to ramp up new technologies like DNA computers and molecular computers.

    DNA computing is a field that will create ultra-dense systems that pack megabytes of information into devices the size of a silicon transistor. A single bacterium cell is about the same size as a single silicon transistor, but it holds more than a megabyte of DNA memory and it has all the computational structures to sense and respond to its environment. DNA computers and molecular computers do not use electrons and 0/1 bits. They can solve more complex problems faster than transistor-based microchips because of the way in which they work. So, in the meantime, we will probably have the chance to create some new uses for the technology that we have.


    Fact:

    We are not buying.

    Microsoft's corporate vision statement was "A PC on every desktop." They have come a long way toward achieving this goal. However, the indications are that hardware prices won't fall much lower, and even though the price of some software is going up, sales are falling.

    When Microsoft introduced the Windows 2000 operating system, it failed to sell at the rate they had expected; the climate had begun to change. In the following year, Microsoft Office XP, with its short-sighted and inflexible licensing, also failed to gain acceptance. Most of us decided not to upgrade.

    In the 1990s, developers successfully argued that investing in better tools would build a better product, rather than investing in a better test process. Since most of the quality improvements in the past 10 years have come from standardization and development process improvements, they usually got what they wanted.

    However, the real product failures had to do with products that missed the mark on the functionality, and applications that simply did not run well in large systems or systems that were so costly to maintain that they lost money in production. These are things that development tools cannot fix. They are things that testing can identify, and things that can be fixed and avoided.

    In today's climate, the financial people will not allow that server from last year to be tossed out until it has been fully depreciated. Neither will they approve the purchase of new operating systems nor office software without a cost-benefit justification. Customers are not in the mood to go out and spend lots of money upgrading their systems either.


    Note

    Testers, here is our chance!

    When consumers are not willing to buy a new product just because it's "new," things are starting to change. When consumers demand reliability over features and cost, the quality balance shifts back from trendy first-to-market toward reliability. The value of using formal methods and metrics becomes the difference between the companies that survive and the ones that fail.

    With so many groups competing for budget, the test group must be able to make a compelling argument, or it will become extinct. A test manager who can make a good cost-benefit statement for the financial folks has a chance. The bottom line for testers is that the test effort must add value to the product. Testers must be able to demonstrate that value.


    Note

    The way to develop a good cost-benefit statement, and add real credibility to software testing, is to use formal methods and good metrics.

    Regardless of the cause, once a software maker has decided to use formal methods, it must address the question of which formal methods and metrics to adopt. Once methods or a course toward methods has been determined, everyone must be educated in the new methods. Moving an established culture from an informal method of doing something to a formal method of doing the same thing takes time, determination, and a good cost-benefit ratio. It amounts to a cultural change, and introducing culture changes is risky business. Once the new methods are established, it still takes a continuing commitment from management to keep them alive and in use.

    In ancient times this was accomplished by fiat, an order from the king. If there were any kings in the 1990s, they must have lived in development. Today, however, it is being accomplished by the CFO and the auditors.

    Guess What? The Best Methods Haven't Changed

    The auditors are paid to ask hard questions. They want to know what it is. The auditors are paying attention to the answers. And, since the financial folks use a very stringent set of formal methods in their work, they expect others to do the same.

    What the Auditors Want to Know from the Testers

    When testing a product the auditors want to know:

    • What does the software or system do?

    • What are you going to do to prove that it works?

    • What are your test results? Did it work under the required environment? Or, did you have to tweak it?

    Clearly, the test methods used need to answer these questions. Before we try to determine the best methods and metrics to use to ensure that proper, thorough testing takes place, we need to examine the challenges faced by testers today.



Fundamental Metrics for Software Testing

Measures and Metrics

A metric is a measure. A metric system is a set of measures that can be combined to form derived measures-for example, the old English system of feet, pounds, and hours. These metrics can be combined to form derived measures as in miles per hour.

Measure has been defined as "the act or process of determining extent, dimensions, etc.; especially as determined by a standard" (Webster's New World Dictionary). If the standard is objective and concrete, the measurements will be reproducible and meaningful. If the standard is subjective and intangible, the measurements then will be unreproducible and meaningless. The measurement is not likely to be any more accurate than the standard. Factors of safety can correct for some deficiencies, but they are not a panacea.

Craft: The Link between Art and Engineering

My great-grandmother was a craftsperson. A craftsperson is the evolutionary link between art and engineering. My great-grandmother made excellent cookies. Her recipes were developed and continuously enhanced over her lifetime. These recipes were not written down; they lived in my great-grandmother's head and were passed on only by word of mouth. She described the steps of the recipes using large gestures and analogies: "mix it till your arm feels like it's going to fall off, and then mix it some more." She guessed the temperature of the oven by feeling the heat with her hand. She measured ingredients by description, using terms like "a lump of shortening the size of an egg," "so many handfuls of flour," "a pinch or this or that," and "as much sugar as seems prudent."

Great-Grandmother's methods and metrics were consistent; she could have been ISO certified, especially if the inspector had eaten any of her cookies. But her methods and metrics were local. Success depended on the size of her hand, the size of an egg from one of her hens, and her idea of what was prudent.

The biggest difference between an engineer and a craftsperson is measurement. The engineer does not guess except as a last resort. The engineer measures. The engineer keeps written records of the steps in the process that he is pursuing, along with the ingredients and their quantities. The engineer uses standard measuring tools and metrics like the pound and the gallon or the gram and the liter. The engineer is concerned with preserving information and communicating it on a global scale. Recipes passed on by word of mouth using metrics like a handful of flour and a pinch of salt do not scale up well to industrial production levels. A great deal of time is required to train someone to interpret and translate such recipes, and these recipes are often lost because they were never written down.

Operational Definitions: Fundamental Metrics

The definition of a physical quantity is the description of the operational procedure for measuring the quantity. For example, "the person is one and a half meters tall." We know from this definition of a person's height by what metric system the person was measured and how to reproduce the measurement. The magnitude of a physical quantity is specified by a number, "one and a half," and a unit, "meters." This is the simplest and most fundamental type of measurement.

Derived units are obtained by combining metrics. For example, miles per hour, feet per second, and dollars per pound are all derived units. These derived units are still operational definitions because the name tells how to measure the thing.

How Metrics Develop and Gain Acceptance

If no suitable recognized standard exists, we must identify a local one and use it consistently-much like my great-grandmother did when making her cookies. Over time, the standards will be improved.

Developing precise and invariant standards for measurement is a process of constant refinement. The foot and the meter did not simply appear overnight. About 4,700 years ago, engineers in Egypt used strings with knots at even intervals. They built the pyramids with these measuring strings, even though knotted ropes may have only been accurate to 1 part in 1,000. It was not until 1875 that an international standard was adopted for length. This standard was a bar of platinum-iridium with two fine lines etched on it, defining the length of a foot. It was kept in the International Bureau of Weights and Measures in Sevres, France. The precision provided by this bar was about 1 part in 10 million. By the 1950s, this was not precise enough for work being done in scientific research and industrial instrumentation. In 1960, a new standard was introduced that precisely defined the length of the meter. The meter is defined as exactly 1,650,763.73 times the wavelength of the orange light emitted by a pure isotope, of mass number 86, of krypton gas. This standard can be measured more accurately than 1 part per 100 million.

Once a standard is introduced, it must still be accepted. Changing the way we do things requires an expenditure of energy. There must be a good reason to expend that energy.

What to Measure in Software Testing

Measure the things that help you answer the questions you have to answer. The challenge with testing metrics is that the test objects that we want to measure have multiple properties; they can be described in many ways. For example, a software bug has properties much like a real insect: height, length, weight, type or class (family, genus, spider, beetle, ant, etc.), color, and so on. It also has attributes, like poisonous or nonpoisonous, flying or nonflying, vegetarian or carnivorous.[1]

I find that I can make my clearest and most convincing arguments when I stick to fundamental metrics. For example, the number of bugs found in a test effort is not meaningful as a measure until I combine it with the severity, type of bugs found, number of bugs fixed, and so on.

Several fundamental and derived metrics taken together provide the most valuable and complete set of information. By combining these individual bits of data, I create information that can be used to make decisions, and most everyone understands what I am talking about. If someone asks if the test effort was a success, just telling him or her how many bugs we found is a very weak answer. There are many better answers in this chapter.

Fundamental Testing Metrics: How Big Is It?

Fundamental testing metrics are the ones that can be used to answer the following questions.

  • How big is it?

  • How long will it take to test it?

  • How much will it cost to test it?

  • How much will it cost to fix it?

The question "How big is it?" is usually answered in terms of how long it will take and how much it will cost. These are the two most common attributes of it. We would normally estimate answers to these questions during the planning stages of the project. These estimates are critical in sizing the test effort and negotiating for resources and budget. A great deal of this book is dedicated to helping you make very accurate estimates quickly. You should also calculate the actual answers to these questions when testing is complete. You can use the comparisons to improve your future estimates.

I have heard the following fundamental metrics discounted because they are so simple, but in my experience, they are the most useful:

We quantify "How big it is" with these metrics. These are probably the most fundamental metrics specific to software testing. They are listed here in order of decreasing certainty. Only time and cost are clearly defined using standard units. Tests and bugs are complex and varied, having many properties. They can be measured using many different units.

For example, product failures are a special class of bug-one that has migrated into production and caused a serious problem, hence the word "failure." A product failure can be measured in terms of cost, cost to the user, cost to fix, or cost in lost revenues. Bugs detected and removed in test are much harder to quantify in this way.

The properties and criteria used to quantify tests and bugs are normally defined by an organization; so they are local and they vary from project to project. In Chapters 11 through 13, I introduce path and data analysis techniques that will help you standardize the test metric across any system or project.

Time

Units of time are used in several test metrics, for example, the time required to run a test and the time available for the best test effort. Let's look at each of these more closely.

The Time Required to Run a Test

This measurement is absolutely required to estimate how long a test effort will need in order to perform the tests planned. It is one of the fundamental metrics used in the test inventory and the sizing estimate for the test effort.

The time required to conduct test setup and cleanup activities must also be considered. Setup and cleanup activities can be estimated as part of the time required to run a test or as separate items. Theoretically, the sum of the time required to run all the planned tests is important in estimating the overall length of the test effort, but it must be tempered by the number of times a test will have to be attempted before it runs successfully and reliably.

  • Sample Units: Generally estimated in minutes or hours per test. Also important are the number of hours required to complete a suite of tests.

The Time Available for the Test Effort

This is usually the most firmly established and most published metric in the test effort. It is also usually the only measurement that is consistently decreasing.

  • Sample Units: Generally estimated in weeks and measured in minutes.

The Cost of Testing

The cost of testing usually includes the cost of the testers' salaries, the equipment, systems, software, and other tools. It may be quantified in terms of the cost to run a test or a test suite.

Calculating the cost of testing is straightforward if you keep good project metrics. However, it does not offer much cost justification unless you can contrast it to a converse-for example, the cost of not testing. Establishing the cost of not testing can be difficult or impossible. More on this later in the chapter.

  • Sample Units: Currency, such as dollars; can also be measured in units of time.

Tests

We do not have an invariant, precise, internationally accepted standard unit that measures the size of a test, but that should not stop us from benefiting from identifying and counting tests. There are many types of tests, and they all need to be counted if the test effort is going to be measured. Techniques for defining, estimating, and tracking the various types of test units are presented in the next several chapters.

Tests have attributes such as quantity, size, importance or priority, and type

Sample Units (listed simplest to most complex):

  • A keystroke or mouse action

  • An SQL query

  • A single transaction

  • A complete function path traversal through the system

  • A function-dependent data set

Bugs

Many people claim that finding bugs is the main purpose of testing. Even though they are fairly discrete events, bugs are often debated because there is no absolute standard in place for measuring them.

  • Sample Units: Severity, quantity, type, duration, distribution, and cost to find and fix. Note: Bug distribution and the cost to find and fix are derived metrics.

Like tests, bugs also have attributes as discussed in the following sections.

Severity

Severity is a fundamental measure of a bug or a failure. Many ranking schemes exist for defining severity. Because there is no set standard for establishing bug severity, the magnitude of the severity of a bug is often open to debate. Table 5.1 shows the definition of the severity metrics and the ranking criteria used in this book.

Table 5.1: Severity Metrics and Ranking Criteria

SEVERITY RANKING

RANKING CRITERIA

Severity 1 Errors

Program ceases meaningful operation

Severity 2 Errors

Severe function error but application can continue

Severity 3 Errors

Unexpected result or inconsistent operation

Severity 4 Errors

Design or suggestion

Bug Type Classification

First of all, bugs are bugs; the name is applied to a huge variety of "things." Types of bugs can range from a nuisance misunderstanding of the interface, to coding errors, to database errors, to systemic failures, and so on.

Like severity, bug classification, or bug types, are usually defined by a local set of rules. These are further modified by factors like reproducibility and fixability.

In a connected system, some types of bugs are system "failures," as opposed to, say, a coding error. For example, the following bugs are caused by missing or broken connections:

  • Network outages.

  • Communications failures.

  • In mobile computing, individual units that are constantly connecting and disconnecting.

  • Integration errors.

  • Missing or malfunctioning components.

  • Timing and synchronization errors.

These bugs are actually system failures. These types of failure can, and probably will, recur in production. Therefore, the tests that found them during the test effort test are very valuable in the production environment. This type of bug is important in the test effectiveness metric, discussed later in this chapter.

The Number of Bugs Found

For this metric, there are two main genres: (1) bugs found before the product ships or goes live and (2) bugs found after-or, alternately, those bugs found by testers and those bugs found by customers. As I have already said, this is a very weak measure until you bring it into perspective using other measures, such as the severity of the bugs found.

The Number of Product Failures

This measurement is usually established by the users of the product and reported through customer support. Since the customers report the failures, it is unusual for product failures that the customers find intolerable to be ignored or discounted. If it exists, this measurement is a key indicator of past performance and probable trouble spots in new releases. Ultimately, it is measured in money, lost profit, increased cost to develop and support, and so on.

This is an important metric in establishing an answer to the question "Was the test effort worth it?" But, unfortunately, in some organizations, it can be difficult for someone in the test group to get access to this information.

  • Sample Units: Quantity, severity, and currency.

The Number of Bugs Testers Find per Hour: The Bug Find Rate

This is a most useful derived metric both for measuring the cost of testing and for assessing the stability of the system. The bug find rate is closely related to the mean time between failures metric. It can give a good indication of the stability of the system being tested. But it is not helpful if considered by itself.

Consider Tables 5.2 and 5.3. The following statistics are taken from a case study of a shrink-wrap RAD project. These statistics are taken from a five-week test effort conducted by consultants on new code. These statistics are a good example of a constructive way to combine bug data, like the bug fix rate and the cost of finding bugs, to create information.

Table 5.2: Bug Find Rates and Costs, Week 1

Bugs found/hour

5.33 bugs found/hr

Cost/bug to find

$9.38/bug to find

Bugs reported/hr

3.25 bugs/hr

Cost to report

$15.38/bug to report

Cost/bug find and report

$24.76/bug to find and report

Table 5.3: Bug Find Rates and Costs, Week 4

Bugs found/hour

0.25 bugs found/hr

Cost/bug to find

$199.79 bug to find

Bugs reported/hr

0.143 bugs/hr

Cost to report

$15.38/bug to report

Cost/bug find and report

$215.17 bug to find and report

Notice that the cost of reporting and tracking bugs is normally higher than the cost of finding bugs in the early part of the test effort. This situation changes as the bug find rate drops, while the cost to report a bug remains fairly static throughout the test effort.

By week 4, the number of bugs being found per hour has dropped significantly. It should drop as the end of the test effort is approached. However, the cost to find each successive bug rises, since testers must look longer to find a bug, but they are still paid by the hour.

These tables are helpful in explaining the cost of testing and in evaluating the readiness of the system for production.

Bug Composition: How Many of the Bugs Are Serious?

As we have just discussed, there are various classes of bugs. Some of them can be eradicated, and some of them cannot. The most troublesome bugs are the ones that cannot be easily reproduced and recur at random intervals. Software failures and bugs are measured by quantity and by relative severity. Severity is usually determined by a local set of criteria, similar to the one presented in the preceding text.

If a significant percentage of the bugs being found in testing are serious, then there is a definite risk that the users will also find serious bugs in the shipped product. The following statistics are taken from a case study of a shrink-wrap RAD project. Table 5.4 shows separate categories for the bugs found and bugs reported.

Table 5.4: Relative Seriousness (Composition) of Bugs Found

ERROR RANKING

RANKING DESCRIPTION:

BUGS FOUND

BUGS REPORTED

Severity 1 Errors

GPF or program ceases meaningful operation

18

9

Severity 2 Errors

Severe function error but application can continue

11

11

Severity 3 Errors

Unexpected result or inconsistent operation

19

19

Severity 4

Design or suggestion

0

0

Totals

48

39


Path Analysis

The Legend of the Maze of the Minotaur

About 4,000 years ago, the wealthy seafaring Minoans built many wonderful structures on the island of Crete. These structures included lovely stone-walled palaces, probably the first stone paved road in the world, and, reputedly, a huge stone labyrinth on a seaside cliff called The Maze of the Minotaur.

The Minotaur of Greek mythology was a monster with the head of a bull and the body of a man, borne by Pasiphaƫ, Queen of Crete, and sired by a snow-white bull. According to legend, the god Poseidon who sent the white bull to Minos, King of Crete, was so angered by Minos' refusal to sacrifice the bull that Poseidon forced the union of Queen Pasiphaƫ and the beast. Thus the monster Minotaur was born. King Minos ordered construction of the great labyrinth as the Minotaur's prison. The beast was confined in the maze and fed human sacrifices, usually young Greeks, in annual rituals at which young men and women performed gymnastics on the horns of bulls and some unfortunate persons were dropped through a hole into the labyrinth's tunnels. The sacrifices continued until a Greek Hero named Theseus killed the Minotaur.

The Minoans did give their sacrifices a sporting chance. Reportedly, there was another exit besides the hole that the sacrifices were dropped through to enter the maze. If the sacrificial person was able to find the exit before the Minotaur found them, then they were free.

It was rumored that a bright physician from Egypt traversed the maze and escaped. The Egyptian succeeded by placing one hand on the wall and keeping it there until he came upon an exit. This technique kept him from becoming lost and wandering in circles.

Calculating the Number of Paths through a System

The tester needs a systematic way of counting paths by calculation-before investing all the time required to map them-in order to predict how much testing needs to be done. A path is defined as a track or way worn by footsteps, and also a line of movement or course taken. In all following discussions, path can refer to any end-to-end traversal through a system. For example, path can refer to the steps a user takes to execute a program function, a line of movement through program code, or the course taken by a message being routed across a network.

In the series of required decisions example, each side branch returns to the main path rather than going to the exit. This means that the possible branching paths can be combined in 2 × 2 × 2 × 2 = 24 = 16 ways, according to the fundamental principles of counting. This is an example of a 2n problem. As we saw, this set of paths is exhaustive, but it contains many redundant paths.

In a test effort, it is typical to maximize efficiency by avoiding unproductive repetition. Unless the way the branches are combined becomes important, there is little to be gained from exercising all 16 of these paths. How do we pick the minimum number of paths that should be exercised in order to ensure adequate test coverage? To optimize testing, the test cases should closely resemble actual usage and should include the minimum set of paths that ensure that each path segment is covered at least one time, while avoiding unnecessary redundancy.

It is not possible today to reliably calculate the total number of paths through a system by an automated process. Most systems contain some combination of 2n logic structures and simple branching constructs. Calculating the total number of possibilities requires lengthy analysis. However, when a system is modeled according to a few simple rules, it is possible to quickly calculate the number of linearly independent paths through it. This method is far preferable to trying to determine the total number of paths by manual inspection. This count of the linearly independent paths gives a good estimate of the minimum number of paths that are required to traverse each path in the system, at least one time.

The Total Is Equal to the Sum of the Parts

The total independent paths (IPs) in any system is the sum of the IPs through its elements and subsystems. For the purpose of counting tests, we introduce TIP, which is the total independent paths of the subsystem elements under consideration-that is, the total number of linearly independent paths being considered. TIP usually represents a subset of the total number of linearly independent paths that exist in a complex system.

TIP = Total enumerated paths for the system

e = element

IP = Independent paths in each element

What Is a Logic Flow Map?

A logic flow map is a graphic depiction of the logic paths though a system, or some function that is modeled as a system. Logic flow maps model real systems as logic circuits. A logic circuit can be validated much the same way an electrical circuit is validated. Logic flow diagrams expose logical faults quickly. The diagrams are easily updated by anyone, and they are an excellent communications tool.

System maps can be drawn in many different ways. The main advantage of modeling systems as logic flow diagrams is so that the number of linearly independent paths through the system can be calculated and so that logic flaws can be detected. Other graphing techniques may provide better system models but lack these fundamental abilities. For a comprehensive discussion of modeling systems as graphs and an excellent introduction to the principals of statistical testing, see Black-Box Testing by Boris Beizer (John Wiley & Sons, 1995).

The Elements of Logic Flow Mapping

Edges

Lines that connect nodes in the map.

Decisions

A branching node with one (or more) edges entering and two edges leaving. Decisions can contain processes. In this text, for the purposes of clarity, decisions will be modeled with only one edge entering.

Processes

A collector node with multiple edges entering and one edge leaving. A process node can represent one program statement or an entire software system.

Click To expand

Regions

A region is any area that is completely surrounded by edges and processes. In actual practice, regions are the hardest elements to find. If a model of the system can be drawn without any edges crossing, the regions will be obvious, as they are here. In event-driven systems, the model must be kept very simple or there will inevitably be crossed edges, and then finding the number of regions becomes very difficult.


Notes on nodes:
  • All processes and decisions are nodes.

  • Decisions can contain processes.

The Rules for Logic Flow Mapping

A logic flow map conforms to the conventions of a system flow graph with the following stipulation:

  1. The representation of a system (or subsystem) can have only one entry point and one exit point; that is, it must be modeled as a structured system.

  2. The system entry and exit points do not count as edges.

    This is required to satisfy the graphing theory stipulation that the graph must be strongly connected. For our purposes, this means that there is a connection between the exit and entrance of the logic flow diagram. This is the reason for the dotted line connecting the maze exits back to their entrances, in the examples. After all, if there is no way to get back to the entrance of the maze, you can't trace any more paths no matter how many there may be.

The logic flow diagram is a circuit. Like a water pipe system, there shouldn't be any leaks. Kirchoff's electrical current law states: "The algebraic sum of the currents entering any node is zero." This means that all the logic entering the system must also leave the system. We are constraining ourselves to structured systems, meaning there is only one way in and one way out. This is a lot like testing each faucet in a house, one at a time. All the water coming in must go out of that one open faucet.

One of the strengths of this method is that it offers the ability to take any unstructured system and conceptually represent it as a structured system. So no matter how many faucets there are, only one can be turned on at any time. This technique of only allowing one faucet to be turned on at a time can be used to write test specifications for unstructured code and parallel processes. It can also be used to reengineer an unstructured system so that it can be implemented as a structured system.

The tester usually does not know exactly what the logic in the system is doing. Normally, testers should not know these details because such knowledge would introduce serious bias into the testing. Bias is the error we introduce simply by having knowledge, and therefore expectations, about the system. What the testers need to know is how the logic is supposed to work, that is, what the requirements are. If these details are not written down, they can be reconstructed from interviews with the developers and designers and then written down. They must be documented, by the testers if necessary. Such documentation is required to perform verification and defend the tester's position.

A tester who documents what the system is actually doing and then makes a judgment on whether that is "right or not" is not verifying the system. This tester is validating, and validation requires a subjective judgment call. Such judgment calls are always vulnerable to attack. As much as possible, the tester should be verifying the system.

The Equations Used to Calculate Paths

There are three equations from graphing theory that we will use to calculate the number of linearly independent paths through any structured system. These three equations and the theory of linear independence were the work of a Dutch scholar named C. Berge who introduced them in his work Graphs and Hypergraphs (published in Amsterdam, The Netherlands: North-Holland, 1973.) Specifically, Berge's graph theory defines the cyclomatic number v(G) of a strongly connected graph G with N nodes, E edges, and one connected component. This cyclomatic number is the number of linearly independent paths through the system.

We have three definitions of the cyclomatic number. This gives us the following three equations. The proofs are not presented here.

  • v(G) = IP = Edges - Nodes + 2 (IP = E - N + 2)

  • v(G) = IP = Regions + 1 (IP = R + 1)

  • v(G) = IP = Decisions + 1 (IP = D + 1)

Even though the case statement and the series of required decisions don't have the same number of total paths, they do have the same number of linearly independent paths.

The number of linearly independent paths though a system is usually the minimum number of end-to-end paths required to touch every path segment at least once. In some cases, it is possible to combine several path segments that haven't been taken previously in a single traversal. This can have the result that the minimum number of paths required to cover the system is less than the number of IPs. In general, the number of linearly independent paths, IPs, is the minimum acceptable number of paths for 100 percent coverage of paths in the system. This is the answer to the question, "How many ways can you get through the system without retracing our path?" The total paths in a system are combinations of the linearly independent paths through the system. If a looping structure is traversed one time, it has been counted. Let's look at an example.

All three equations must be equal to the same number for the logic circuit to be valid. If the system is not a valid logic circuit, it can't work. When inspections are conducted with this in mind, logic problems can be identified quickly. Testers who develop the logic flow diagrams for the system as an aid in test design find all sorts of fuzzy logic errors before they ever begin to test the system.


When it is not possible to represent a system without edges that cross, the count of the regions becomes problematic and is often neglected. If the number of regions in a model cannot be established reliably, the logic flow cannot be verified using these equations, but the number of linearly independent paths can still be calculated using the other two equations.

Most of the commercially available static code analyzers use only the number of decisions in a system to determine the number of linearly independent paths, but for the purposes of logic flow analysis, all three equations are necessary. Any one by itself may identify the number of linearly independent paths through the system, but is not sufficient to test whether the logic flow of the system is valid.

Twenty years ago, several works were published that used definitions and theorems from graphing theory to calculate the number of paths through a system. Building on the work of Berge, Tom McCabe and Charles Butler applied cyclomatic complexity to analyze the design of software and eventually to analyze raw code. (See "Design Complexity Measurement and Testing" in Communications of the ACM, December 1989, Volume 32, Number 12.) This technique eventually led to a set of metrics called the McCabe complexity metrics. The complexity metrics are used to count various types of paths through a system. In general, systems with large numbers of paths are considered to be bad under this method. It has been argued that the number of paths through a system should be limited to control complexity. A typical program module is limited to 10 or fewer linearly independent paths.

There are two good reasons for this argument. The first is that human beings don't handle increasing complexity very well. We are fairly efficient when solving logic problems with one to five logic paths, but beyond that, our performance starts to drop sharply. The time required to devise a solution for a problem rises geometrically with the number of paths. For instance, it takes under five minutes for typical students to solve a logic problem with fewer than five paths. It takes several hours to solve a logic problem with 10 paths, and it can take several days to solve a logic problem with more than 10 paths. The more complex the problem, the greater the probability that a human being will make an error or fail to find a solution.

The second reason is that 20 years ago software systems were largely monolithic and unstructured. Even 10 years ago, most programming was done in languages like Assembler, Cobol, and Fortran. Coding practices of that time were only beginning to place importance on structure. The logic flow diagrams of such systems are typically a snarl of looping paths with the frayed ends of multiple entries and exits sticking out everywhere. Such diagrams strongly resemble a plate of spaghetti-hence the term spaghetti code, and the justifiable emphasis on limiting complexity. The cost of maintaining these systems proved to be unbearable for most applications, and so, over time, they have been replaced by modular structured systems.

Today's software development tools, most notably code generators, and fourth-generation languages (4GLs) produce complex program modules. The program building blocks are recombined into new systems constantly, and the result is ever more complex but stable building blocks. The structural engineering analogy to this is an average 100-by-100-foot, one-story warehouse. One hundred years ago, we would have built it using about 15,000 3-by-9-by-3-inch individually mortared bricks in a double course wall. It would have taken seven masons about two weeks to put up the walls. Today we might build it with about forty 10-foot-by-10-foot-by-6-inch pre-stressed concrete slabs. It would take a crane operator, a carpenter, and a welder one to two days to set the walls. A lot more engineering goes into today's pre-stressed slab, but the design can be reused in many buildings. Physically, the pre-stressed slab is less complex than the brick wall, having only one component, but it is a far more complex design requiring a great deal more analysis and calculation.

Once a logic problem is solved and the logic verified, the module becomes a building block in larger systems. When a system is built using prefabricated and pretested modules, the complexity of the entire system may be very large. This does not mean that the system is unstable or hard to maintain.

It is simplistic to see limited complexity as a silver bullet. Saying that a program unit having a complexity over 10 is too complex is akin to saying, "A person who weighs over 150 pounds is overweight." We must take other factors into consideration before we make such a statement. The important factor in complex systems is how the paths are structured, not how many there are. If we took this approach seriously, we would not have buildings taller than about five stories, or roller-coasters that do vertical loops and corkscrews, or microprocessors, or telephone systems, or automobiles, and so on. If the building blocks are sound, large complexity is not a bad thing.

Data Analysis Techniques

Testing Data Input by Users (the GUI)

Most of the data testing we do these days is user input, and that is what we concentrate on in this book. I have included one example about testing raw data in quantity-the real-world shipping example mentioned repeatedly throughout the book. That was the one test project I had in 10 years in which I tested raw data in quantity. One part of the integration effort was to test the integration of the acquired company's car movement data stream with the parent company's car movement data stream.

The team accomplished this testing in a completely manual mode even though millions of messages had to be analyzed and verified. The testers were all SMEs and senior staff members. The complexity of their analysis could not be automated, or even taught to professional testers. Every verification and validation required the experiences of a lifetime and the expertise of the very best.

The effort was an enormous undertaking and cost four times more than the estimates. I believe that budget money was appropriated from every other department at the parent company to pay for it. Nevertheless, it was mission-critical that those data streams maintained 100 percent integrity, and consequently, no price was too high for the test effort that ensured the success of this integration effort.

Data-Dependent Paths

Some paths will be more data-dependent than others. In these cases, the number of tests performed is a function of the number of data sets that will be tested. The same path, or at least parts of the same path, will be exercised repeatedly. The data will control the branches taken and not taken.

If you approach data analysis without considering the independent paths, you will certainly miss some important paths. In my experience, this is how many hard-to-reproduce bugs get into production. Someone tests all the main, easily identified data sets without considering all the possible exception paths. This is why I recommend performing the path analysis and then populating the paths with the data sets that are required to exercise the most important paths, rather than the other way around.

Having said that, I must add that users do some unexpected things with data, and so an examination of paths alone will not suffice to cover all the exceptions that will be exercised by the user.

Some Thoughts about Error Messages

Error messages for data exceptions are an important consideration in a good test effort. In my study of production problems at Prodigy, it became clear that virtually all of the most tenacious, expensive, and longest-lived production problems involved one or more missing or erroneous error messages. These problems had the most profound impact on customer service as well.

Data-dependent error messages need to be accounted for in the test inventory as part of your data analysis. I haven't seen a complete list of error messages for an application since 1995. In today's object-oriented architectures, they tend to be decentralized, so accounting for them usually requires exploration. I generally estimate how many I should find when I do my path analysis. There should be at least one error message for each exception path and at least one data error message for each data entry field. This area of testing may be a minor concern to you or it may be a major issue. Here are a couple of examples of what I mean.

There was a startup company with a B2B Web application that I tested during the dot-com boom. There was only one error message in the entire Web application. The text of the error message was just one word: "Wrong." This developer's error message "placeholder" appeared whenever the application encountered a data error. The testers complained about the message, and they were told that it would be replaced by the appropriate text messages in due course. Of course, it was never fully eradicated from the system, and it would pop up at the most inconvenient times. Fortunately, this company went into the sea with the other lemmings when the dot-coms crashed.

On the other end of the spectrum, I had the pleasure to write some white papers for a Danish firm that developed and marketed the finest enterprise resource planning (ERP) products I have ever seen. Reviewing (testing) their products was the most wonderful breath of fresh air in testing I have had since Prodigy. Their products were marketed throughout Europe and America, and simultaneously supported many languages.

To ensure high-quality, appropriate, and helpful error messages in many languages, they incorporated the creation and maintenance of the error message text for any required language into their development platform. The development platform kept a to-do list for all unfinished items, and developers could not check in their code as complete until the error messages were also marked complete. The company hired linguists to create and maintain all their text messages, but it was the responsibility of the developer to make sure the correct messages were attached to the exception processors in their code.

This system worked wonderfully in all its languages. The helpful text messages contributed to both high customer satisfaction and fewer calls to customer service.

Testing Data Input by Users (the GUI)

Most of the data testing we do these days is user input, and that is what we concentrate on in this book. I have included one example about testing raw data in quantity-the real-world shipping example mentioned repeatedly throughout the book. That was the one test project I had in 10 years in which I tested raw data in quantity. One part of the integration effort was to test the integration of the acquired company's car movement data stream with the parent company's car movement data stream.

The team accomplished this testing in a completely manual mode even though millions of messages had to be analyzed and verified. The testers were all SMEs and senior staff members. The complexity of their analysis could not be automated, or even taught to professional testers. Every verification and validation required the experiences of a lifetime and the expertise of the very best.

The effort was an enormous undertaking and cost four times more than the estimates. I believe that budget money was appropriated from every other department at the parent company to pay for it. Nevertheless, it was mission-critical that those data streams maintained 100 percent integrity, and consequently, no price was too high for the test effort that ensured the success of this integration effort.

Data-Dependent Paths

Some paths will be more data-dependent than others. In these cases, the number of tests performed is a function of the number of data sets that will be tested. The same path, or at least parts of the same path, will be exercised repeatedly. The data will control the branches taken and not taken.

If you approach data analysis without considering the independent paths, you will certainly miss some important paths. In my experience, this is how many hard-to-reproduce bugs get into production. Someone tests all the main, easily identified data sets without considering all the possible exception paths. This is why I recommend performing the path analysis and then populating the paths with the data sets that are required to exercise the most important paths, rather than the other way around.

Having said that, I must add that users do some unexpected things with data, and so an examination of paths alone will not suffice to cover all the exceptions that will be exercised by the user.

Some Thoughts about Error Messages

Error messages for data exceptions are an important consideration in a good test effort. In my study of production problems at Prodigy, it became clear that virtually all of the most tenacious, expensive, and longest-lived production problems involved one or more missing or erroneous error messages. These problems had the most profound impact on customer service as well.

Data-dependent error messages need to be accounted for in the test inventory as part of your data analysis. I haven't seen a complete list of error messages for an application since 1995. In today's object-oriented architectures, they tend to be decentralized, so accounting for them usually requires exploration. I generally estimate how many I should find when I do my path analysis. There should be at least one error message for each exception path and at least one data error message for each data entry field. This area of testing may be a minor concern to you or it may be a major issue. Here are a couple of examples of what I mean.

There was a startup company with a B2B Web application that I tested during the dot-com boom. There was only one error message in the entire Web application. The text of the error message was just one word: "Wrong." This developer's error message "placeholder" appeared whenever the application encountered a data error. The testers complained about the message, and they were told that it would be replaced by the appropriate text messages in due course. Of course, it was never fully eradicated from the system, and it would pop up at the most inconvenient times. Fortunately, this company went into the sea with the other lemmings when the dot-coms crashed.

On the other end of the spectrum, I had the pleasure to write some white papers for a Danish firm that developed and marketed the finest enterprise resource planning (ERP) products I have ever seen. Reviewing (testing) their products was the most wonderful breath of fresh air in testing I have had since Prodigy. Their products were marketed throughout Europe and America, and simultaneously supported many languages.

To ensure high-quality, appropriate, and helpful error messages in many languages, they incorporated the creation and maintenance of the error message text for any required language into their development platform. The development platform kept a to-do list for all unfinished items, and developers could not check in their code as complete until the error messages were also marked complete. The company hired linguists to create and maintain all their text messages, but it was the responsibility of the developer to make sure the correct messages were attached to the exception processors in their code.

This system worked wonderfully in all its languages. The helpful text messages contributed to both high customer satisfaction and fewer calls to customer service.

Field Validation Tests

As the first example, I will use BVA and a data-reducing assumption to determine the minimum number of tests I have to run to make sure that the application is only accepting valid month and year data from the form.

Translating the acceptable values for boundary value analysis, the expiration month data set becomes:

1 month 12

BVA-based data set = {0,1,2,11,12,13} (6 data points)

The values that would normally be selected for BVA are 0, 1, 2, and 11, 12, 13.

Using simple data reduction techniques, we will further reduce this number of data points by the following assumptions.


Assumption 1.

One of the values, 2 or 11, is probably redundant; therefore, only one midpoint, 6, will be tested.

Month data set = {0,1,6,12,13} (5 data points)

This next assumption may be arbitrary, especially in the face of the hacker story that I just related, but it is a typical assumption.


Assumption 2.

Negative values will not be a consideration


Likewise, the valid field data set for the expiration year becomes

2002 year 2011

BVA year data set = {2001,2002,2003,2010,2011,2012}

Again, I will apply a simplifying assumption.


Assumption 3.

One of the values, 2003 or 2010, is probably redundant; therefore, only the midpoint, 2006, will be tested.

BVA year data set = {2001,2002,2006,2011,2012}

These two fields, a valid month and a valid year, are combined to become a data set in the credit authorization process. These are the data values that will be used to build that test set. But before I continue with this example, I need to mention one more data reduction technique that is very commonly used but not often formalized.

Matrix Data Reduction Techniques

We all use data reduction techniques whether we realize it or not. The technique used here simply removes redundant data, or data that is likely to be redundant, from the test data sets. It is important to document data reductions so that others can understand the basis of the reduction. When data is eliminated arbitrarily, the result is usually large holes in the test coverage. Because data reduction techniques are routinely applied to data before test design starts, reducing the number of test data sets by ranking them as we did with the paths may not be necessary.

Data Set Truth Table

All these values need to be valid or we will never get a credit card authorization to pass. But consider it a different way. Let's say we put in a valid date and a valid credit card number, but we pick the wrong type of credit card. All the field values are valid, but the data set should fail. To build the data sets that I need, I must first understand the rules. This table tells me how many true data values I need for each one card to get a credit authorization.


Data Set 1-The set of all Valid Data, all in the data set

Is a valid value for the field

Is a valid member of this Data Set

Minimum Number of Data to test

Minimum Number of Data Sets to test

Cardholder Name

  1. First Name

True

True

1

  1. Last Name

True

True

1

Billing Address

  1. Street Address

True

True

1

  1. City

True

True

1

  1. State

True

True

1

  1. Zip

True

True

1

Credit Card Information

  1. Card Type

True

True

1

  1. Card Number

True

True

1

  1. Expiration Month

True

True

1

  1. Expiration Year

True

True

1

  1. Card Verification Number

True

True

1

OUTCOME:

True

True

10

1