Software Testing Fundamentals: Fundamental Metrics for Software Testing

Measures and Metrics

A metric is a measure. A metric system is a set of measures that can be combined to form derived measures-for example, the old English system of feet, pounds, and hours. These metrics can be combined to form derived measures as in miles per hour.

Measure has been defined as "the act or process of determining extent, dimensions, etc.; especially as determined by a standard" (Webster's New World Dictionary). If the standard is objective and concrete, the measurements will be reproducible and meaningful. If the standard is subjective and intangible, the measurements then will be unreproducible and meaningless. The measurement is not likely to be any more accurate than the standard. Factors of safety can correct for some deficiencies, but they are not a panacea.

Craft: The Link between Art and Engineering

My great-grandmother was a craftsperson. A craftsperson is the evolutionary link between art and engineering. My great-grandmother made excellent cookies. Her recipes were developed and continuously enhanced over her lifetime. These recipes were not written down; they lived in my great-grandmother's head and were passed on only by word of mouth. She described the steps of the recipes using large gestures and analogies: "mix it till your arm feels like it's going to fall off, and then mix it some more." She guessed the temperature of the oven by feeling the heat with her hand. She measured ingredients by description, using terms like "a lump of shortening the size of an egg," "so many handfuls of flour," "a pinch or this or that," and "as much sugar as seems prudent."

Great-Grandmother's methods and metrics were consistent; she could have been ISO certified, especially if the inspector had eaten any of her cookies. But her methods and metrics were local. Success depended on the size of her hand, the size of an egg from one of her hens, and her idea of what was prudent.

The biggest difference between an engineer and a craftsperson is measurement. The engineer does not guess except as a last resort. The engineer measures. The engineer keeps written records of the steps in the process that he is pursuing, along with the ingredients and their quantities. The engineer uses standard measuring tools and metrics like the pound and the gallon or the gram and the liter. The engineer is concerned with preserving information and communicating it on a global scale. Recipes passed on by word of mouth using metrics like a handful of flour and a pinch of salt do not scale up well to industrial production levels. A great deal of time is required to train someone to interpret and translate such recipes, and these recipes are often lost because they were never written down.

Operational Definitions: Fundamental Metrics

The definition of a physical quantity is the description of the operational procedure for measuring the quantity. For example, "the person is one and a half meters tall." We know from this definition of a person's height by what metric system the person was measured and how to reproduce the measurement. The magnitude of a physical quantity is specified by a number, "one and a half," and a unit, "meters." This is the simplest and most fundamental type of measurement.

Derived units are obtained by combining metrics. For example, miles per hour, feet per second, and dollars per pound are all derived units. These derived units are still operational definitions because the name tells how to measure the thing.

How Metrics Develop and Gain Acceptance

If no suitable recognized standard exists, we must identify a local one and use it consistently-much like my great-grandmother did when making her cookies. Over time, the standards will be improved.

Developing precise and invariant standards for measurement is a process of constant refinement. The foot and the meter did not simply appear overnight. About 4,700 years ago, engineers in Egypt used strings with knots at even intervals. They built the pyramids with these measuring strings, even though knotted ropes may have only been accurate to 1 part in 1,000. It was not until 1875 that an international standard was adopted for length. This standard was a bar of platinum-iridium with two fine lines etched on it, defining the length of a foot. It was kept in the International Bureau of Weights and Measures in Sevres, France. The precision provided by this bar was about 1 part in 10 million. By the 1950s, this was not precise enough for work being done in scientific research and industrial instrumentation. In 1960, a new standard was introduced that precisely defined the length of the meter. The meter is defined as exactly 1,650,763.73 times the wavelength of the orange light emitted by a pure isotope, of mass number 86, of krypton gas. This standard can be measured more accurately than 1 part per 100 million.

Once a standard is introduced, it must still be accepted. Changing the way we do things requires an expenditure of energy. There must be a good reason to expend that energy.

What to Measure in Software Testing

Measure the things that help you answer the questions you have to answer. The challenge with testing metrics is that the test objects that we want to measure have multiple properties; they can be described in many ways. For example, a software bug has properties much like a real insect: height, length, weight, type or class (family, genus, spider, beetle, ant, etc.), color, and so on. It also has attributes, like poisonous or nonpoisonous, flying or nonflying, vegetarian or carnivorous.^[1]

I find that I can make my clearest and most convincing arguments when I stick to fundamental metrics. For example, the number of bugs found in a test effort is not meaningful as a measure until I combine it with the severity, type of bugs found, number of bugs fixed, and so on.

Several fundamental and derived metrics taken together provide the most valuable and complete set of information. By combining these individual bits of data, I create information that can be used to make decisions, and most everyone understands what I am talking about. If someone asks if the test effort was a success, just telling him or her how many bugs we found is a very weak answer. There are many better answers in this chapter.

Fundamental Testing Metrics: How Big Is It?

Fundamental testing metrics are the ones that can be used to answer the following questions.

How big is it?
How long will it take to test it?
How much will it cost to test it?
How much will it cost to fix it?

The question "How big is it?" is usually answered in terms of how long it will take and how much it will cost. These are the two most common attributes of it. We would normally estimate answers to these questions during the planning stages of the project. These estimates are critical in sizing the test effort and negotiating for resources and budget. A great deal of this book is dedicated to helping you make very accurate esti mates quickly. You should also calculate the actual answers to these questions when testing is complete. You can use the comparisons to improve your future estimates.

I have heard the following fundamental metrics discounted because they are so simple, but in my experience, they are the most useful:

Time
Cost
Tests
Bugs found by testing

We quantify "How big it is" with these metrics. These are probably the most fundamental metrics specific to software testing. They are listed here in order of decreasing certainty. Only time and cost are clearly defined using standard units. Tests and bugs are complex and varied, having many properties. They can be measured using many different units.

For example, product failures are a special class of bug-one that has migrated into production and caused a serious problem, hence the word "failure." A product failure can be measured in terms of cost, cost to the user, cost to fix, or cost in lost revenues. Bugs detected and removed in test are much harder to quantify in this way.

The properties and criteria used to quantify tests and bugs are normally defined by an organization; so they are local and they vary from project to project. In Chapters 11 through 13, I introduce path and data analysis techniques that will help you standardize the test metric across any system or project.

Time

Units of time are used in several test metrics, for example, the time required to run a test and the time available for the best test effort. Let's look at each of these more closely.

The Time Required to Run a Test

This measurement is absolutely required to estimate how long a test effort will need in order to perform the tests planned. It is one of the fundamental metrics used in the test inventory and the sizing estimate for the test effort.

The time required to conduct test setup and cleanup activities must also be considered. Setup and cleanup activities can be estimated as part of the time required to run a test or as separate items. Theoretically, the sum of the time required to run all the planned tests is important in estimating the overall length of the test effort, but it must be tempered by the number of times a test will have to be attempted before it runs successfully and reliably.

Sample Units: Generally estimated in minutes or hours per test. Also important are the number of hours required to complete a suite of tests.

The Time Available for the Test Effort

This is usually the most firmly established and most published metric in the test effort. It is also usually the only measurement that is consistently decreasing.

Sample Units: Generally estimated in weeks and measured in minutes.

The Cost of Testing

The cost of testing usually includes the cost of the testers' salaries, the equipment, systems, software, and other tools. It may be quantified in terms of the cost to run a test or a test suite.

Calculating the cost of testing is straightforward if you keep good project metrics. However, it does not offer much cost justification unless you can contrast it to a converse-for example, the cost of not testing. Establishing the cost of not testing can be difficult or impossible. More on this later in the chapter.

Sample Units: Currency, such as dollars; can also be measured in units of time.

Tests

We do not have an invariant, precise, internationally accepted standard unit that measures the size of a test, but that should not stop us from benefiting from identifying and counting tests. There are many types of tests, and they all need to be counted if the test effort is going to be measured. Techniques for defining, estimating, and tracking the various types of test units are presented in the next several chapters.

Tests have attributes such as quantity, size, importance or priority, and type

Sample Units (listed simplest to most complex):

A keystroke or mouse action
An SQL query
A single transaction
A complete function path traversal through the system
A function-dependent data set

Bugs

Many people claim that finding bugs is the main purpose of testing. Even though they are fairly discrete events, bugs are often debated because there is no absolute standard in place for measuring them.

Sample Units: Severity, quantity, type, duration, distribution, and cost to find and fix. Note: Bug distribution and the cost to find and fix are derived metrics.

Like tests, bugs also have attributes as discussed in the following sections.

Severity

Severity is a fundamental measure of a bug or a failure. Many ranking schemes exist for defining severity. Because there is no set standard for establishing bug severity, the magnitude of the severity of a bug is often open to debate. Table 5.1 shows the definition of the severity metrics and the ranking criteria used in this book.

Table 5.1: Severity Metrics and Ranking Criteria
SEVERITY RANKING	RANKING CRITERIA
Severity 1 Errors	Program ceases meaningful operation
Severity 2 Errors	Severe function error but application can continue
Severity 3 Errors	Unexpected result or inconsistent operation
Severity 4 Errors	Design or suggestion

Bug Type Classification

First of all, bugs are bugs; the name is applied to a huge variety of "things." Types of bugs can range from a nuisance misunderstanding of the interface, to coding errors, to database errors, to systemic failures, and so on.

Like severity, bug classification, or bug types, are usually defined by a local set of rules. These are further modified by factors like reproducibility and fixability.

In a connected system, some types of bugs are system "failures," as opposed to, say, a coding error. For example, the following bugs are caused by missing or broken connections:

Network outages.
Communications failures.
In mobile computing, individual units that are constantly connecting and disconnecting.
Integration errors.
Missing or malfunctioning components.
Timing and synchronization errors.

These bugs are actually system failures. These types of failure can, and probably will, recur in production. Therefore, the tests that found them during the test effort test are very valuable in the production environment. This type of bug is important in the test effectiveness metric, discussed later in this chapter.

The Number of Bugs Found

For this metric, there are two main genres: (1) bugs found before the product ships or goes live and (2) bugs found after-or, alternately, those bugs found by testers and those bugs found by customers. As I have already said, this is a very weak measure until you bring it into perspective using other measures, such as the severity of the bugs found.

The Number of Product Failures

This measurement is usually established by the users of the product and reported through customer support. Since the customers report the failures, it is unusual for product failures that the customers find intolerable to be ignored or discounted. If it exists, this measurement is a key indicator of past performance and probable trouble spots in new releases. Ultimately, it is measured in money, lost profit, increased cost to develop and support, and so on.

This is an important metric in establishing an answer to the question "Was the test effort worth it?" But, unfortunately, in some organizations, it can be difficult for someone in the test group to get access to this information.

Sample Units: Quantity, severity, and currency.

The Number of Bugs Testers Find per Hour: The Bug Find Rate

This is a most useful derived metric both for measuring the cost of testing and for assessing the stability of the system. The bug find rate is closely related to the mean time between failures metric. It can give a good indication of the stability of the system being tested. But it is not helpful if considered by itself.

Consider Tables 5.2 and 5.3. The following statistics are taken from a case study of a shrink-wrap RAD project. These statistics are taken from a five-week test effort conducted by consultants on new code. These statistics are a good example of a constructive way to combine bug data, like the bug fix rate and the cost of finding bugs, to create information.

Table 5.2: Bug Find Rates and Costs, Week 1
Bugs found/hour	5.33 bugs found/hr
Cost/bug to find	$9.38/bug to find
Bugs reported/hr	3.25 bugs/hr
Cost to report	$15.38/bug to report
Cost/bug find and report	$24.76/bug to find and report

Table 5.3: Bug Find Rates and Costs, Week 4
Bugs found/hour	0.25 bugs found/hr
Cost/bug to find	$199.79 bug to find
Bugs reported/hr	0.143 bugs/hr
Cost to report	$15.38/bug to report
Cost/bug find and report	$215.17 bug to find and report

Notice that the cost of reporting and tracking bugs is normally higher than the cost of finding bugs in the early part of the test effort. This situation changes as the bug find rate drops, while the cost to report a bug remains fairly static throughout the test effort.

By week 4, the number of bugs being found per hour has dropped significantly. It should drop as the end of the test effort is approached. However, the cost to find each successive bug rises, since testers must look longer to find a bug, but they are still paid by the hour.

These tables are helpful in explaining the cost of testing and in evaluating the readiness of the system for production.

Bug Composition: How Many of the Bugs Are Serious?

As we have just discussed, there are various classes of bugs. Some of them can be eradicated, and some of them cannot. The most troublesome bugs are the ones that cannot be easily reproduced and recur at random intervals. Software failures and bugs are measured by quantity and by relative severity. Severity is usually determined by a local set of criteria, similar to the one presented in the preceding text.

If a significant percentage of the bugs being found in testing are serious, then there is a definite risk that the users will also find serious bugs in the shipped product. The following statistics are taken from a case study of a shrink-wrap RAD project. Table 5.4 shows separate categories for the bugs found and bugs reported.

Table 5.4: Relative Seriousness (Composition) of Bugs Found
ERROR RANKING	RANKING DESCRIPTION:	BUGS FOUND	BUGS REPORTED
Severity 1 Errors	GPF or program ceases meaningful operation	18	9
Severity 2 Errors	Severe function error but application can continue	11	11
Severity 3 Errors	Unexpected result or inconsistent operation	19	19
Severity 4	Design or suggestion	0	0
Totals		48	39

Sunday, December 9, 2007

Fundamental Metrics for Software Testing