The FF Planning System: Fast Plan Generation Through Heuristic Search

We describe and evaluate the algorithmic techniques that are used in the FF planning system. Like the HSP system, FF relies on forward state space search, using a heuristic that estimates goal distances by ignoring delete lists. Unlike HSP's heuristic, our method does not assume facts to be independent. We introduce a novel search strategy that combines hill-climbing with systematic search, and we show how other powerful heuristic information can be extracted and used to prune the search space. FF was the most successful automatic planner at the recent AIPS-2000 planning competition. We review the results of the competition, give data for other benchmark domains, and investigate the reasons for the runtime performance of FF compared to HSP.


Introduction
Over the last few years we have seen a signi cant increase of the e ciency of planning systems. This increase is mainly due to three new approaches in plan generation.
The rst approach was developed by Furst (1995, 1997). In their seminal paper on the GRAPHPLAN system (Blum & Furst, 1995), they described a new plan generation technique based on planning graphs, which was much faster than any other technique known at this time. Their paper started a whole series of research e orts that re ned this approach by making it even more e cient (Fox & Long, 1998;Kambhampati, Parker, & Lambrecht, 1997) and by extending it to cope with more expressive planning languages (Koehler, Nebel, Ho mann, & Dimopoulos, 1997;Gazen & Knoblock, 1997;Anderson, Smith, & Weld, 1998;Nebel, 2000).
The second approach is the planning as satis ability method, which translates planning to propositional satis ability (Kautz & Selman, 1996). In particular there is the hope that advances in the state of the art of propositional reasoning systems carry directly over to planning systems relying on this technology. In fact, Kautz and Selman (1999) predicted that research on planning methods will become super uous because the state of the art in propositional reasoning systems will advance much faster than in planning systems.
A third new approach is heuristic-search planning as proposed by Ge ner (1998, 1999). In this approach a heuristic function is derived from the speci cation of the planning instance and used for guiding the search through the state space. As demonstrated by the system FF (short for Fast-Forward) at the planning competition at AIPS-2000, this approach proved to be competitive. In fact, FF outperformed all the other fully automatic systems and was nominated Group A Distinguished Performance Planning System at the competition.
In HSP (Bonet & Ge ner, 1998), goal distances are estimated by approximating solution length to a relaxation of the planning task (Bonet, Loerincs, & Ge ner, 1997). While FF uses the same relaxation for deriving its heuristics, it di ers from HSP in a number of important details. Its base heuristic technique can be seen as an application of GRAPHPLAN to the relaxation. This yields goal distance estimates that, in di erence to HSP's estimates, do not rely on an independence assumption. FF uses a di erent search technique than HSP, namely an enforced form of hill-climbing, combining local and systematic search. Finally, it employs a powerful pruning technique that selects a set of promising successors to each search node, and another pruning technique that cuts out branches where it appears that some goal has been achieved too early. Both techniques are obtained as a side e ect of the base heuristic method.
Concerning the research strategy that FF is based on, we remark the following. A lot of classical planning approaches, like partial-order planning (McAllester & Rosenblitt, 1991) or planning graph analysis (Blum & Furst, 1997), are generic problem solving methods, developed following some theoretical concept, and tested on examples from the literature afterwards. In our approach, exploring the idea of heuristic search, there is no such clear distinction between development and testing. The search strategy, as well as the pruning techniques, are generic methods that have been motivated by observing examples. Also, design decisions were made on the basis of careful experimentation. This introduces into the system a bias towards the examples used for testing during development. We were testing our algorithms on a range of domains often used in the planning literature. Throughout the paper, we will refer to domains that are frequently used in the literature, and to tasks from such domains, as benchmarks. In the development phase, we used benchmark examples from the Assembly, Blocksworld, Grid, Gripper, Logistics, Mystery, Mprime, and Tireworld domains. When describing our algorithms in the paper, we indicate the points where those testing examples played a role for design decision making.
Planning is known to be PSPACE-complete even in its simplest form (Bylander, 1994). Thus, in the general case, there is no e cient algorithmic method. It is therefore worthwhile to look for algorithms that are e cient at least on restricted subclasses. To some extent, this idea has been pursued by posing severe syntactical restrictions to the planning task speci cations (Bylander, 1994). Our approach is complementary to this. Examining the existing benchmarks, one nds that they, indeed, do not exploit the full expressivity of the underlying planning formalism. Though they do not ful ll any obvious rigid syntactical restrictions, almost none of them is particularly hard. In almost all of the existing benchmark domains, a non-optimal plan can, in principle, be generated in polynomial time. Using the benchmarks for inspiration during development, we have been able to come up with a heuristic method that is not provably e cient, but does work well empirically on a large class of planning tasks. This class includes almost all of the current planning benchmarks. Intuitively, the algorithms exploit the simple structure underlying these tasks. Our ongoing work is concerned with nding a formal characterization of that \simple" structure, and thereby formalizing the class of planning tasks that FF works well on.
Section 2 gives a schematic view on FF's system architecture, and Section 3 introduces our notational conventions for STRIPS domains. Sections 4 to 6 describe the base heuristic technique, search algorithm, and pruning methods, respectively. Section 7 shows how the algorithms are extended to deal with ADL domains. System performance is evaluated in Section 8, demonstrating that FF generates solutions extremely fast in a large range of planning benchmark domains. In order to illustrate our intuitions on the kind of structure that FF can exploit successfully, the section also gives examples of domains where the method is less appropriate. Finally, to clarify the performance di erences between FF and HSP, the section describes a number of experiments we made in order to estimate which of the new algorithmic techniques is most useful. We show connections to related work at the points in the text where they apply, and overview other connections in Section 9. Section 10 outlines our current avenue of research.

System Architecture
To give the reader an overview of FF's system architecture, Figure 1 shows how FF's most fundamental techniques are arranged. The fundamental heuristic technique in FF is relaxed GRAPHPLAN, which we will describe in Section 4. The technique gets called on every search state by enforced hill-climbing, our search algorithm. This is a forward searching engine, to be described in Section 5. Given a state, relaxed GRAPHPLAN informs the search with a goal distance estimate, and additionally with a set of promising successors for the state, the helpful actions, to be described in Section 6. Upon termination, enforced hill-climbing either outputs a solution plan, or reports that it has failed.
On top of the base architecture shown in Figure 1, we have integrated a few optimizations to cope with special cases that arose during testing: If a planning task contains states from which the goal is unreachable (dead ends, de ned in Section 5.2), then enforced hill-climbing can fail to nd a solution. In that case, a complete heuristic search engine is invoked to solve the task from scratch.
In the presence of goal orderings, enforced hill-climbing sometimes wastes a lot of time achieving goals that need to be cared for later on. Two techniques trying to avoid this are integrated: { Added goal deletion, introduced in Section 6.2, cuts out branches where some goal has apparently been achieved too early.
{ The goal agenda technique, adapted from work by Jana Koehler (1998), feeds the goals to the planner in an order determined as a pre-process (Section 6.2.2).

Notational Conventions
For introducing FF's basic techniques, we consider simple STRIPS planning tasks, as were introduced by Fikes and Nilsson (1971). Our notations are as follows.
De nition 1 (State) A state S is a nite set of logical atoms.
We assume that all operator schemata are grounded, i.e., we only talk about actions.
De nition 2 (Strips Action) A

GRAPHPLAN as a Heuristic Estimator
In this section, we introduce the base heuristic method used in FF. It is derived by applying GRAPHPLAN to relaxed planning tasks. The resulting goal distance estimates do not, like HSP's estimates, rely on an independence assumption. We prove that the heuristic computation is polynomial, give some notions on how distance estimates can be kept cautious, and describe how the method can be implemented e ciently.
Consider the heuristic method that is used in HSP (Bonet & Ge ner, 1998). Given a planning task P = (O; I; G), HSP estimates for each state S that is reached in a forward search the solution length of the task P 0 S = (O 0 ; S; G), i.e., the length of a relaxed plan that achieves the goals starting out from S. As computing the optimal solution length to P 0 S | which would make an admissible heuristic|is NP-hard (Bylander, 1994), the HSP estimate is a rough approximation based on computing the following weight values.
HSP assumes facts to be achieved independently in the sense that the weight of a set of facts|an action's preconditions|is estimated as the sum of the individual weights. The state's heuristic estimate is h(S) := weight S (G) = X g2G weight S (g) Assuming facts to be achieved independently, this heuristic ignores positive interactions that can occur. Consider the following short example planning task, where the initial state is empty, the goals are fG 1 ; G 2 g, and there are the following three actions: name (pre; add; del) opG 1 = (fP g; fG 1 g; ;) opG 2 = (fP g; fG 2 g; ;) opP = (;; fPg; ;) HSP's weight value computation results in P having weight one, and each goal having weight two. Assuming facts to be achieved independently, the distance of the initial state to a goal state is therefore estimated to four. Obviously, however, the task is solvable in only three steps, as opG 1 and opG 2 share the precondition P. In order to take account of such positive interactions, our idea is to start GRAPHPLAN on the tasks (O 0 ; S; G), and extract an explicit solution, i.e., a relaxed plan. One can then use this plan for heuristic evaluation. We will see in the next section that this approach is feasible: GRAPHPLAN can be proven to solve relaxed tasks in polynomial time.

Planning Graphs for Relaxed Tasks
Let us examine how GRAPHPLAN behaves when it is started on a planning task that does not contain any delete lists. We brie y review the basic notations of the GRAPHPLAN algorithm (Blum & Furst, 1997).
A planning graph is a directed, layered graph that contains two kinds of nodes: fact nodes and action nodes. The layers alternate between fact and action layers, where one fact and action layer together make up a time step. In the rst time step, number 0, we have the fact layer corresponding to the initial state and the action layer corresponding to all actions that are applicable in the initial state. In each subsequent time step i, we have the layer of all facts that can possibly be made true in i time steps, and the layer of all actions that are possibly applicable given those facts.
One Proposition 1 Let P 0 = (O 0 ; I; G) be a relaxed STRIPS task. Started on P 0 , GRAPH-PLAN will not mark any pair of facts or actions as mutually exclusive.
Proof: The Proposition is easily proven by induction over the depth of the planning graph.
Base case: time step 0. Only interfering actions are marked mutual exclusive at time step 0. As there are no delete e ects, no pair of actions interferes. Inductive case: time step i ! time step i + 1. Per induction hypothesis, the facts are not exclusive as their achievers one time step ahead are not. From this it follows that no pair of actions has competing needs. They do not interfere either.
2 When started on a planning task, GRAPHPLAN extends the planning graph layer by layer until a fact layer is reached that contains all goal facts, and in which no two goal facts are marked exclusive. 1 Starting from that layer, a recursive backward search algorithm is invoked. To nd a plan for a set of facts at layer i > 0, initialize the set of selected actions at layer i 1 to the empty set. Then, for each fact, consider all achieving actions at layer i 1 one after the other and select the rst one that is not exclusive of any action that has already been selected. If there exists such an action, proceed with the next fact. If not, backtrack to the last fact and try to achieve it with a di erent action. If an achieving action has been selected for each fact, then collect the preconditions of all these actions to make up a new set of facts one time step earlier. Succeed when fact layer 0|the initial state|is reached, where no achieving actions need to be selected. On relaxed tasks, no backtracking occurs in GRAPHPLAN's search algorithm.
Proposition 2 Let P 0 = (O 0 ; I; G) be a relaxed STRIPS task. Started on P 0 , GRAPH-PLAN will never backtrack.
Proof: Backtracking only occurs if all achievers for a fact f are exclusive of some already selected action. With Proposition 1, we know that no exclusions exist, and thus, that this does not happen. Also, if f is in graph layer i, then there is at least one achiever in layer i 1 supporting it. 2 While the above argumentation is su cient for showing Proposition 2, it does not tell us much about what is actually going on when one starts GRAPHPLAN on a task without delete lists. What happens is this. Given the task is solvable, the planning graph gets extended until some fact layer is reached that contains all the goals. Then the recursive search starts by selecting achievers for the goals at this level. The rst attempt succeeds, and new goals are set up one time step earlier. Again, the rst selection of achievers succeeds, and so forth, until the initial state is reached. Thus, search performs only a single sweep over the graph, starting from the top layer going down to the initial layer, and collects a relaxed plan on its way. In particular, the procedure takes only polynomial time in the size of the task.
Theorem 1 Let P 0 = (O 0 ; I; G) be a solvable relaxed STRIPS task, where the length of the longest add list of any action is l. Then GRAPHPLAN will nd a solution to P 0 in time polynomial in l, jO 0 j and jIj. Proof: Building the planning graph is polynomial in l, jO 0 j, jIj and t, where t is the number of time steps built (Blum & Furst, 1997). Now, in our case the total number jO 0 j of actions is an upper limit to the number of time steps. This is just because after this number of time steps has been built, all actions appear at some layer in the graph. Otherwise, there is a layer i where no new action comes in, i.e., action layer i 1 is identical to action layer i. As the task is solvable, this implies that all goals are contained in fact layer i, which would have made the process stop right away. Similarly, action layer jO 0 j would be identical to action layer jO 0 j 1, implying termination. The graph building phase is thus polynomial in l, jO 0 j and jIj.
Concerning the plan extraction phase: With Proposition 2, search traverses the graph from top to bottom, collecting a set of achieving actions at each layer. Selecting achievers for a set of facts is O(l jO 0 j+jIj): A set of facts has at most size l jO 0 j+jIj, the maximal number of distinct facts in the graph. An achieving action can be found to each fact in constant time using the planning graph. As the number of layers to be looked at is O(jO 0 j), search is polynomial in the desired parameters.
The estimation values obtained this way are, on our testing examples, usually lower than HSP's estimates (Equations 1 and 2), as extracting a plan takes account of positive interactions between facts. Consider again the short example from the beginning of this section, empty initial state, two goals fG 1 ; G 2 g, and three actions: name (pre; add; del) opG 1 = (fP g; fG 1 g; ;) opG 2 = (fP g; fG 2 g; ;) opP = (;; fPg; ;) Starting GRAPHPLAN on the initial state, the goals are contained in fact layer two, causing selection of opG 1 and opG 2 in action layer one. This yields the new goal P at fact layer one, which is achieved with opP. The resulting plan is hf opP g; f opG 1 , opG 2 gi, giving us the correct goal distance estimate three, as distinct from HSP's estimate four.

Solution Length Optimization
We use GRAPHPLAN's heuristic estimates, Equation 3, in a greedy strategy, to be introduced in Section 5.1, that does not take its decisions back once it has made them. From our experience with running this strategy on our testing examples, this works best when distance estimates are cautious, i.e., as low as possible. As already said, an optimal sequential solution can not be synthesized e ciently. What one can do is apply some techniques to make GRAPHPLAN return as short solutions as possible. Below, we describe some ways of doing that. The rst technique is a built-in feature of GRAPHPLAN and ensures a minimality criterion for the relaxed plan. The two other techniques are heuristic optimizations.

NOOPs-first
The original GRAPHPLAN algorithm makes extensive use of so-called NOOPs. These are dummy actions that simply propagate facts from one fact layer to the next. For each fact f that gets inserted into some fact layer, a NOOP corresponding to that fact is inserted into the action layer at the same time step. This NOOP has no other e ect than adding f, and no other precondition than f. When performing backward search, the NOOPs are considered just like any other achiever, i.e., one way of making a fact true at time i > 0 is to simply keep it true from time i 1.
In GRAPHPLAN, the implementation uses as a default the NOOPs-rst heuristic, i.e., if there is a NOOP present for achieving a fact f, then this NOOP is considered rst, before the planner tries selecting other \real" actions that achieve f. On relaxed tasks, the NOOPsrst heuristic ensures a minimality criterion for the returned plan as follows.
Proposition 3 Let (O 0 ; I; G) be a relaxed STRIPS task, which is solvable. Using the NOOPs-rst strategy, the plan that GRAPHPLAN returns will contain each action at most once.
Proof: Let us assume the opposite, i.e., one action o occurs twice in the plan hO 0 ; : : : ; O m 1 i that GRAPHPLAN nds. We have o 2 O i and o 2 O j for some layers i; j with i < j. Now, the action o has been selected at layer j to achieve some fact f at layer j + 1.
As the algorithm is using the NOOPs-rst strategy, this implies that there is no NOOP for fact f contained in action layer j: otherwise, the NOOP|not action o|would have been selected for achieving f. In contradiction to this, action layer j does indeed contain a NOOP for fact f. This is because action o already appears in action layer i < j. As f gets added by o, it appears in fact layer i + 1 j. Therefore, a NOOP for f is inserted in action layer i + 1 j, and, in turn, will be inserted into each action layer i 0 i + 1. 2

Difficulty Heuristic
With the above argumentation, if we can achieve a fact by using a NOOP, we should do that. The question is, which achiever should we choose when no NOOP is available? It is certainly a good idea to select an achiever whose preconditions seem to be \easy". From the graph building phase, we can obtain a simple measure for the di culty of an action's preconditions as follows. The di culty of each action can be set when it is rst inserted into the graph. During plan extraction, facing a fact for which no NOOP is available, we then simply select an achieving action with minimal di culty. This heuristic works well in situations where there are several ways to achieve one fact, but some ways need less e ort than others.

Action Set Linearization
Assume GRAPHPLAN has settled for a parallel set O i of achievers at a time step i, i.e., achieving actions have been selected for all goals at time step i + 1. As we are only interested in sequential solution length, we still have a choice on how to linearize the actions. Some linearizations can lead to shorter plans than others. If an action o 2 O i adds a precondition p of another action o 0 2 O i , then we do not need to include p in the new set of facts to be achieved one time step earlier, given that we restrict ourselves to execute o before o 0 . The question now is, how do we nd a linearization of the actions that minimizes our new fact set? The corresponding decision problem is NP-complete.
De nition 6 Let OPTIMAL ACTION LINEARIZATION denote the following problem.
Given a set O of relaxed STRIPS actions and a positive integer K. Is there a one-to-one function f : O 7 ! f1; 2; : : : ; jOjg such that the number of unsatis ed preconditions when executing the sequence hf 1 (1); : : : ; f 1 (jOj)i is at most K ? Theorem 2 Deciding OPTIMAL ACTION LINEARIZATION is NP-complete. Proof: Membership is obvious. Hardness is proven by transformation from DIRECTED OPTIMAL LINEAR ARRANGEMENT (Even & Shiloach, 1975). Given a directed graph G = (V; A) and a positive integer K, the question is, does there exists a one-to-one function f : V 7 ! f1; 2; : : : ; jV jg such that f(u) < f(v) whenever (u; v) 2 A and such that To a given directed graph, we de ne a set of actions as follows. For each node w in the graph, we de ne an action in our set O. For simplicity of presentation, we identify the actions with their corresponding nodes. To begin with, we set pre(w) = add(w) = ; for all w 2 V . Then, for each edge (u; v) 2 A, we create new logical facts P (u;v) w and R (u;v) w for w 2 V .
Using these new logical facts, we now adjust all precondition and add lists to express the constraint that is given by the edge (u; v). Say action u is ordered before action v in a linearization. We need to simulate the di erence between the positions of v and u. To do this, we de ne our actions in a way such that the bigger this di erence is, the more unsatis ed preconditions there are when executing the linearization. First, we \punish" all actions that are ordered before v, by giving them an unsatis ed precondition. pre(w) := pre(w) P (u;v) w for w 2 V; add(v) := add(v) fP (u;v) w j w 2 V g With this de nition, the actions w ordered before v|and v itself|will have the unsatis ed precondition P (u;v) w , while those ordered after will get this precondition added by v. Thus, the number of unsatis ed preconditions we get here is exactly f(v).
Secondly, we \give a reward" to each action that is ordered before u. We simply do this by letting those actions add a precondition of u, which would otherwise go unsatis ed. add(w) := add(w) R (u;v) w for w 2 V; pre(u) := pre(u) fR (u;v) w j w 2 V g That way, we will have exactly jV j (f(u) 1) unsatis ed preconditions, namely the R (u;v) w facts for all actions except those that are ordered before u.
Summing up the number of unsatis ed preconditions we get for a linearization f, we arrive at We thus de ne our new positive integer K 0 := K + jAj (jV j + 1). Finally, we make sure that actions u get ordered before actions v for (u; v) 2 A. We do this by inserting new logical \safety" facts S (u;v) 1 ; : : : ; S (u;v) K 0 +1 into v's precondition-and u's add list.
pre(v) := pre(v) fS (u;v) 1 ; : : : ; S (u;v) K 0 +1 g; add(u) := add(u) fS (u;v) 1 ; : : : ; S (u;v) Altogether, a linearization f of our actions leads to at most K 0 unsatis ed preconditions if and only if f satis es the requirements for a directed optimal linear arrangement. Obviously, the action set and K 0 can be computed in polynomial time. 2 Our sole purpose with linearizing an action set in a certain order is to achieve a smaller number of unsatis ed preconditions, which, in turn, might lead to a shorter relaxed solution. 2 Thus, we are certainly not willing to pay the price that nding an optimal linearization of the actions is likely to cost, according to Theorem 2. There are a few methods how one can approximate such a linearization, like introducing an ordering constraint o < o 0 for each action o that adds a precondition of another action o 0 , and trying to linearize the actions such that many of these constraints are met. During our experimentations, we found that parallel actions adding each other's preconditions occur so rarely in our testing tasks that even approximating is not worth the e ort. We thus simply linearize all actions in the order they get selected, causing almost no computational overhead at all.
2. It should be noted here that using optimal action linearizations at each time step does not guarantee the resulting relaxed solution to be optimal, which would give us an admissible heuristic.

E cient Implementation
We have implemented our own version of GRAPHPLAN, highly optimized for solving relaxed planning tasks. It exploits the fact that the planning graph of a relaxed task does not contain any exclusion relations (Proposition 1). Our implementation is also highly optimized for repeatedly solving planning tasks which all share the same set of actions|the tasks P 0 S = (O 0 ; S; G) as described at the beginning of this section.
Planning task speci cations usually contain some operator schemata, and a set of constants. Instantiating the schemata with the constants yields the actions to the task. Our system instantiates all operator schemata in a way such that all, and only, reachable actions are built. Reachability of an action here means that, when successively applying operators to the initial state, all of the action's preconditions appear eventually. We then build what we call the connectivity graph. This graph consists of two layers, one containing all (reachable) actions, and the other all (reachable) facts. From each action, there are pointers to all preconditions, add e ects and delete e ects. All of FF's computations are e ciently implemented using this graph structure. For the subsequently described implementation of relaxed GRAPHPLAN, we only need the information about preconditions and add e ects.
As a relaxed planning graph does not contain any exclusion relations, the only information one needs to represent it are what we call the layer memberships, i.e., for each fact or action, the number of the rst layer at which it appears in the graph. Called on an intermediate task P 0 S = (O 0 ; S; G), our version of GRAPHPLAN computes these layer memberships by using the following xpoint computation. The layer memberships of all facts and actions are initialized to 1. For each action, there is also a counter, which is initialized to 0. Then, fact layer 0 is built implicitly by setting the layer membership of all facts f 2 S to 0. Each time when a fact f gets its layer membership set, all actions of which f is a precondition get their counter incremented. As soon as the counter for an action o reaches the total number of o's preconditions, o is put to a list of scheduled actions for the current layer. After a fact layer i is nished, all actions scheduled for step i have their layer membership set to i, and their adds, if not already present, are put to the list of scheduled facts for the next fact layer at time step i + 1. Having nished with action layer i, all scheduled facts at step i + 1 have their membership set, and so on. The process continues until all goals have a layer membership lower than 1. It should be noticed here that this view of planning graph building corresponds closely to the computation of the weight values in HSP. Those can be computed by applying the actions in layers as above, updating weight values and propagating the changes each time an action comes in, and stopping when no changes occur in a layer. Having nished the relaxed version of planning graph building, a similarly trivial version of GRAPHPLAN's solution extraction mechanism is invoked. See Figure 2. Instead of putting all goals into the top layer in GRAPHPLAN style, and then propagating them down by using NOOPs-rst, each goal g is simply put into a goal set G i located at g's rst layer i. Then, there is a for-next loop down from the top to the initial layer. At each layer i, an achieving action with layer membership i 1 gets selected for each fact in the corresponding goal set. If there is more than one such achiever, a best one is picked according to the di culty heuristic. The preconditions are put into their corresponding goal sets. Each time an action is selected, all of its adds are marked true at times i and i 1. The marker at time i prevents achievers to be selected for facts that are already true

A Novel Variation of Hill-climbing
In this section, we introduce FF's base search algorithm. We discuss the algorithm's theoretical properties regarding completeness, and derive FF's overall search strategy.
In the rst HSP version (Bonet & Ge ner, 1998), HSP1 as was used in the AIPS-1998 competition, the search strategy is a variation of hill-climbing, always selecting one best successor to the state it is currently facing. Because state evaluations are costly, we also chose to use local search, in the hope to reach goal states with as few evaluations as possible. We settled for a di erent search algorithm, an \enforced" form of hill-climbing, which combines local and systematic search. The strategy is motivated by the simple structure that the search spaces of our testing benchmarks tend to have.

Enforced Hill-climbing
Doing planning by heuristic forward search, the search space is the space of all reachable states, together with their heuristic evaluation. Now, evaluating states in our testing benchmarks with the heuristic de ned by Equation 3, one often nds that the resulting search spaces are simple in structure, speci cally, that local minima and plateaus tend to be small. For any search state, the next state with strictly better heuristic evaluation is usually only a few steps away (an example for this is the Logistics domain described in Section 8.1.1). Our idea is to perform exhaustive search for the better states. The algorithm is shown in Figure 3.
Like hill-climbing, the algorithm depicted in Figure 3 starts out in the initial state. from S is invoked. This nds the closest better successor, i.e., the nearest state S 0 with strictly better evaluation, or fails. In the latter case, the whole algorithm fails, in the former case, the path from S to S 0 is added to the current plan, and search is iterated. When a goal state|a state with evaluation 0|is reached, search stops.
Our implementation of breadth rst search starting out from S is standard, where states are kept in a queue. One search iteration removes the rst state S 0 from the queue, and evaluates it by running GRAPHPLAN. If the evaluation is better than that of S, search succeeds. Otherwise, the successors of S 0 are put to the end of the queue. Repeated states are avoided by keeping a hash table of visited states in memory. If no new states can be reached anymore, breadth rst search fails.

Completeness
If in one iteration breadth rst search for a better state fails, then enforced hill-climbing stops without nding a solution. This can happen because once enforced hill-climbing has chosen to include an action in the plan, it never takes this decision back. The method is therefore only complete on tasks where no fatally wrong decisions can be made. These are the tasks that do not contain \dead ends." De nition 7 (Dead End) Let (O; I; G) be a planning task. A state S is called a dead end i it is reachable and no sequence of actions achieves the goal from it, i.e., i 9 P : S = Result(I; P) and :9 P 0 : G Result(S; P 0 ). Naturally, a task is called dead-end free if it does not contain any dead end states. We remark that being dead-end free implies solvability, as otherwise the initial state itself would already be a dead end.
Proposition 4 Let P = (O; I; G) be a planning task. If P is dead-end free, then enforced hill-climbing will nd a solution.
Proof: Assume enforced hill-climbing does not reach the goal. Then we have some intermediate state S = Result(I; P), P being the current plan, where breadth rst search can not improve on the situation. Now, h(S) > 0 as search has not stopped yet. If there was a path from S to some goal state S 0 , then complete breadth rst search would nd that path, obtain h(S 0 ) = 0 < h(S), and terminate positively. Such a path can therefore not exist, showing that S is a dead end state in contradiction to the assumption. 2 We remark that Proposition 4 holds only when h is a function from states to natural numbers including 0, where h(S) = 0 i G S. The proposition identi es a class of planning tasks where we can safely apply enforced hill-climbing. Unfortunately, it is PSPACE-hard to decide whether a given planning task belongs to that class.
De nition 8 Let DEADEND-FREE denote the following problem: Given a planning task P = (O; I; G), is P dead-end free? Theorem 3 Deciding DEADEND-FREE is PSPACE-complete. Proof: Hardness is proven by polynomially reducing PLANSAT (Bylander, 1994)|the decision problem of whether P is solvable|to the problem of deciding DEADEND-FREE. We simply add an operator to O that is executable in all states, and re-establishes the initial state. Applying o I to any state reachable in P leads back to the initial state: all facts that can ever become true are removed, and those in the initial state are added. Now, the modi ed problem P 1 = (O 1 ; I; G) is dead-end free i P is solvable. From left to right, if P 1 is deadend free, then it is solvable, which implies that P is solvable, as we have not added any new possibility of reaching the goal. From right to left, if P is solvable, then also is P 1 , by the same solution plan P. One can then, from all states in P 1 , achieve the goal by going back to the initial state with the new operator, and executing P thereafter.
Membership in PSPACE follows from the fact that PLANSAT and its complement are both in PSPACE. A non-deterministic algorithm that decides the complement of DEADEND-FREE and that needs only polynomial space can be speci ed as follows. Guess a state S. Verify in polynomial space that S is reachable from the initial state. Further, verify that the goal cannot be reached from S. If this algorithm succeeds, it follows that the instance is not dead-end free|since S constitutes a dead end. This implies that DEADEND-FREE is in NPSPACE, and hence in PSPACE.

2
Though we can not e ciently decide whether a given task is dead-end free, there are easily testable su cient criteria in the literature. Johnsson et al. (2000) de ne a notion of symmetric planning tasks, which is su cient for dead-end freeness, but co-NP-complete. They also give a polynomial su cient criterion for symmetry. This is, however, very trivial. Hardly any of the current benchmarks ful lls it. Koehler and Ho mann (2000a) have de ned notions of invertible planning tasks|su cient for dead-end freeness, and inverse actions| su cient for invertibility, under certain restrictions. The existence of inverse actions, and su cient criteria for the additional restrictions, can be decided in polynomial time. Many benchmark tasks do, in fact, ful ll those criteria and can thus e ciently be proven dead-end free.
One could adopt Koehler and Ho mann's methodology, and use the existence of inverse actions to recognize dead-end free tasks. If the test fails, one could then employ a di erent search strategy than enforced hill-climbing. We have two reasons for not going this way: Even amongst our benchmarks, there are tasks that do not contain inverse actions, but are nevertheless dead-end free. An example is the Tireworld domain, where enforced hill-climbing leads to excellent results.
Enforced hill-climbing can often quite successfully solve tasks that do contain dead ends, as it does not necessarily get caught in one. Examples for that are contained in the Mystery and Mprime domains, which we will look at in Section 8.2.1.
The observation that forms the basis for our way of dealing with completeness is the following. If enforced hill-climbing can not solve a planning task, it usually fails very quickly. One can then simply switch to a di erent search algorithm. We have experimented with randomizing enforced hill-climbing, and doing a restart when one attempt failed. This didn't lead to convincing results. Though we tried a large variety of randomization strategies, we did not nd a planning task in our testing domains where one randomized restart did signi cantly better than the previous one, i.e., all attempts su ered from the same problems. The tasks that enforced hill-climbing does not solve right away are apparently so full of dead ends that one can not avoid those dead ends at random. We have therefore arranged our overall search strategy in FF as follows: 1. Do enforced hill-climbing until the goal is reached or the algorithm fails.
2. If enforced hill-climbing failed, skip everything done so far and try to solve the task by a complete heuristic search algorithm. In the current implementation, this is what Russel and Norvig (1995) term greedy best-rst search. This strategy simply expands all search nodes by increasing order of goal distance estimation.
To summarize, FF uses enforced hill-climbing as the base search method, and a complete best-rst algorithm to deal with those special cases where enforced hill-climbing has run into a dead end and failed.

Pruning Techniques
In this section, we introduce two heuristic techniques that can, in principle, be used to prune the search space in any forward state space search algorithm: 1. Helpful actions selects a set of promising successors to a search state. As we will demonstrate in Section 8.3, the heuristic is crucial for FF's performance on many domains.
2. Added goal deletion cuts out branches where some goal has apparently been achieved too early. Testing the heuristic, we found that it can yield savings on tasks that contain goal orderings, and has no e ect on tasks that don't.
Both techniques are obtained as a side e ect of using GRAPHPLAN as a heuristic estimator in the manner described in Section 4. Also, both of them do not preserve completeness of any hypothetical forward search. In the context of our search algorithm, we integrate them such that they prune the search space in the single enforced hill-climbing try|which is not complete in general anyway|and completely turn them o during best-rst search, if enforced hill-climbing failed.

Helpful Actions
To a state S, we de ne a set H(S) of actions that seem to be most promising among the actions applicable in S. The technique is derived by having a closer look at the relaxed plans that GRAPHPLAN extracts on search states in our testing tasks. Consider the Gripper domain, as it was used in the 1998 AIPS planning competition. There are two rooms, A and B, and a certain number of balls, which are all in room A initially and shall be moved into room B. The planner controls a robot, which changes rooms via the move operator, and which has two grippers to pick or drop balls. Each gripper can hold only one ball at a time. We look at a small task where 2 balls must be moved into room B. Say the robot has already picked up both balls, i.e., in the current search state, the robot is in room A, and each gripper holds one ball. There are three applicable actions in this state: move to room B, or drop one of the balls back into room A. The relaxed solution that our heuristic extracts is the following.
This is a parallel relaxed plan consisting of two time steps. The action set selected at the rst time step contains the only action that makes sense in the state at hand, move to room B. We therefore pursue the idea of restricting the action choice in any planning state to only those actions that are selected in the rst time step of the relaxed plan. We call these the actions that seem to be helpful. In the above example state, this strategy cuts down the branching factor from three to one. Sometimes, restricting oneself to only the actions that are selected by the relaxed planner can be too much. Consider the following Blocksworld example. Say we use the well known representation with four operators, stack, unstack, pickup and putdown. The planner controls a single robot arm, and the operators can be used to stack one block on top of another one, unstack a block from another one, pickup a block from the table, or put a block that the arm is holding down onto the table. Initially, the arm is holding block C, and blocks A and B are on the table. The goal is to stack A onto B. Started on this state, relaxed GRAPHPLAN will return one out of the following three time step optimal solutions.
All of these are valid relaxed solutions, as in the relaxation it does not matter that stacking C onto A or B deletes facts that we still need. If C is on A, we can not pickup A anymore, and if C is on B, we can not stack A onto B anymore.
The rst action in each relaxed plan is only inserted to get rid of C, i.e., free the robot arm, and from the point of view of the relaxed planner, all of the three starting actions do the job. Thus the relaxed solution extracted might be any of the three above. If it happens to be the second or third one, then we lose the path to an optimal solution by restricting ourselves to the corresponding actions, stack C A or stack C B. Therefore, we de ne the set H(S) of helpful actions to a state S as follows.
Here, G 1 (S) denotes the set of goals that is constructed by relaxed GRAPHPLAN at time step 1|one level ahead of the initial layer|when started on the task (O 0 ; S; G). In words, we consider as helpful actions all those applicable ones, which add at least one goal at the rst time step. In the above Blocksworld example, freeing the robot arm is among these goals, which causes all the three starting actions to be helpful in the initial state, i.e., to be elements of H(I). In the above Gripper example, the modi cation does not change anything.
The notion of helpful actions shares some similarities with what Drew McDermott calls the favored actions (McDermott, 1996(McDermott, , 1999, in the context of computing greedy regression graphs for heuristic estimation. In a nutshell, greedy regression graphs backchain from the goals until facts are reached that are contained in the current state. Amongst other things, the graphs provide an estimation of which actions might be useful in getting closer to the goal: Those applicable ones which are members of the e ective subgraph, which is the minimal cost subgraph achieving the goals. There is also a similarity between the helpful actions heuristic and what is known as relevance from the literature . Consider a Blocksworld task where hundreds of blocks are on the table initially, but the goal is only to stack one block A on top of another block B. The set H(I) will in this case contain only the single action pickup A, throwing away all those applicable actions moving around blocks that are not mentioned in the goal, i.e., throwing away all those actions that are irrelevant. The main di erence between the helpful actions heuristic and the concept of relevance is that relevance in the usual sense refers to what is useful for solving the whole task. Being helpful, on the other hand, refers to something that is useful in the next step. This has the disadvantage that the helpful things need to be recomputed for each search state, but the advantage that possibly far less things are helpful than are relevant. In our speci c setting, we get the helpful actions for free anyway, as a side e ect of running relaxed GRAPHPLAN.
We conclude this subsection with an example showing that helpful actions pruning does not preserve completeness, and a few remarks on the current integration of the technique into our search algorithm. 6.1.1 Completeness In the following short example, the helpful actions heuristic prunes out all solutions from the state space. Say the initial state is fBg, the goals are fA; Bg, and there are the following actions: name (pre; add; del) opA 1 = (;; fAg; fBg) opA 2 = (fP A g; fAg; ;) opP A = (;; fP A g; ;) opB 1 = (;; fBg; fAg) opB 2 = (fP B g; fBg; ;) opP B = (;; fP B g; ;) In this planning task, there are two ways of achieving the missing goal A. One of these, opA 1 , deletes the other goal B. The other one, opA 2 , needs the precondition P A to be achieved rst by opP A , and thus involves using two planning actions instead of one in the rst case. Relaxed GRAPHPLAN recognizes only the rst alternative, as it's the only time step optimal one. The set of goals at the single time step created by graph construction is G 1 (I) = f A; B g This gives us two helpful actions, namely H(I) = f opA 1 ; opB 1 g One of these, opB 1 , does not cause any state transition in the initial state. The other one, opA 1 , leads to the state where only A is true. To this state, we obtain the same set of helpful actions, containing, again, opA 1 and opB 1 . This time, the rst action causes no state transition, while the second one leads us back to the initial state. Helpful actions thus cuts out the solutions from the state space of this example task. We remark that the task is dead-end free|one can always reach A and B by applying opP A , opA 2 , opP B , and opB 2 |and that one can easily make the task invertible without changing the behavior.
In STRIPS domains, one could theoretically overcome the incompleteness of helpful actions pruning by considering not only the rst relaxed plan that GRAPHPLAN nds, but computing a kind of union over all relaxed plans that GRAPHPLAN could possibly nd, when allowing non time step optimal plans. More precisely, in a search state S, consider the relaxed task (O 0 ; S; G). Extend the relaxed planning graph until fact level jO 0 j is reached. Set a goal set G jO 0 j at the top fact level to G jO 0 j := G. Then, proceed from fact level jO 0 j 1 down to fact level 1, where, at each level i, a set G i of goals is generated as the union of G i+1 with the preconditions of all actions in level i that add at least one fact in G i+1 . Upon termination, de ne as helpful all actions that add at least one fact in G 1 . It can be proven that, this way, the starting actions of all optimal solutions from S are considered helpful. However, in all our STRIPS testing domains, the complete method always selects all applicable actions as helpful.

Integration into Search
As has already been noted at the very beginning of this section, we integrate helpful actions pruning into our search algorithm by only applying it during the single enforced hill-climbing try, leaving the complete best-rst search algorithm unchanged (see Section 5). Facing a state S during breadth rst search for a better state in enforced hill-climbing, we look only at those successors generated by H(S). This renders our implementation of enforced hillclimbing incomplete even on invertible planning tasks. However, in all our testing domains, the tasks that cannot be solved by enforced hill-climbing using helpful actions pruning are exactly those that cannot be solved by enforced hill-climbing anyway.

Added Goal Deletion
The second pruning technique that we introduce in this section is motivated by the observation that in some planning domains there are goal ordering constraints, as has been recognized by quite a number of researchers in the past (Irani & Cheng, 1987;Drummond & Currie, 1989;Joslin & Roach, 1990). In our experiments on tasks with goal ordering constraints, FF's base architecture sometimes wasted a lot of time achieving goals that needed to be cared for later on. We therefore developed a heuristic to inform search about goal orderings.
The classical example for a planning domain with goal ordering constraints is the well known Blocksworld. Say we have three blocks A, B and C on the table initially, and want to stack them such that we have B on top of C, and A on top of B. Obviously, there is not much point in stacking A on B rst. Now, imagine a forward searching planner confronted with a search state S, where some goal G has just been achieved, i.e., S resulted from some other state by applying an action o with G 2 add(o). What one can ask in a situation like this is, was it a good idea to achieve G right now? Or should some other goal be achieved rst? Our answer is inspired by recent work of Koehler and Ho mann (2000a), which argues that achieving G should be postponed if the remaining goals can not be achieved without destroying G again. Of course, nding out about this involves solving the remaining planning task. However, we can arrive at a very simple but|in our testing domains|surprisingly accurate approximation by using the relaxed plan that GRAPHPLAN generates for the state S. The method we are using is as simple as this: If the relaxed solution plan, P, that GRAPHPLAN generates for S, contains an action o, o 2 P, that deletes G (G 2 del(o) in o's non-relaxed version), then we remove S from the search space, i.e., do not generate any successors to S. We call this method the added goal deletion heuristic.
Let us exemplify the heuristic with the above Blocksworld example. Say the planner has just achieved on(A,B), but with on(B,C) still being false, i.e., we are in the situation where A is on top of B, and B and C are standing on the table. The relaxed solution that GRAPHPLAN nds to this situation is the following. < f unstack A B g, f pickup B g, f stack B C g > The goal on(A,B), which has just been achieved, gets deleted by the rst action unstack A B. Consequently, we realize that stacking A onto B right now was probably a bad idea, and prune this possibility from the search space, which results in a solution plan that stacks B onto C rst.
Like in the preceding subsection, we conclude with an example showing that pruning search states in the manner described above does not preserve completeness, and with a few remarks on our current search algorithm implementation.

Completeness
In the following small example, one of the goals must be destroyed temporarily in order to achieve the other goal. This renders the planning task unsolvable when one is using the added goal deletion heuristic. Say the initial state is empty, the goals are fA; Bg, and there are the following actions: name (pre; add; del) opA = (;; fAg; ;) opB = (fAg; fBg; fAg) All solutions to this task need to apply opA, use opB thereafter, and re-establish A. The crucial point here is that A must be temporarily destroyed. The added goal deletion heuristic is not adequate for such planning tasks. The example is dead-end free, and one can easily make the scenario invertible without changing the behavior of the heuristic.
Unlike for helpful actions, completeness can not be regained by somehow enumerating all relaxed plans to a situation. In the above example, when A has been achieved but B is still false, then all relaxed plans contain opB, deleting A.

Integration into Search
We use the added goal deletion heuristic in a way similar to the integration of the helpful actions heuristic. As indicated at the very beginning of the section, it is integrated into the single enforced hill-climbing try that search does, and completely turned o during best-rst search, in case enforced hill-climbing didn't make it to the goal. We also use another goal ordering technique, taken from the literature. One of the most common approaches to dealing with goal orderings is trying to recognize them in a preprocessing phase, and then use them to prune fractions of the search space during planning (Irani & Cheng, 1987;Cheng & Irani, 1989;Joslin & Roach, 1990). This is also the basic principle underlying the so-called \goal agenda" approach (Koehler, 1998). For our system, we have implemented a slightly simpli ed version of the goal agenda algorithm, and use it to further enhance performance. A very short summary of what happens is this.
In a preprocessing phase, the planner looks at all pairs of goals and decides heuristically whether there is an ordering constraint between them. Afterwards, the goal set is split into a totally ordered series of subsets respecting these orderings. These are then fed to enforced hill-climbing in an incremental manner. Precisely, if G 1 ; : : : ; G n is the ordered series of subsets, enforced hill-climbing gets rst started on the original initial state and G 1 . If that works out, search ends in some state S satisfying the goals in G 1 . Enforced hill-climbing is then called again on the new starting state S and the larger goal set G 1 G 2 . From a state satisfying this, search gets started for the goals G 1 G 2 G 3 , and so on. The incremental, or agenda-driven, planning process can be applied to any planner, in principle, and preserves completeness only on dead-end free tasks (Koehler & Ho mann, 2000a), i.e., again, we have an enhancement that loses completeness in general. Thus, we use the goal agenda only in enforced hill-climbing, leaving the complete best-rst search phase unchanged.
The goal agenda technique yields runtime savings in domains where there are ordering constraints between the goals. In our testing suite, these are the Blocksworld and the Tireworld. In planning tasks without ordering constraints, the series of subsets collapses into a single entry, such that the agenda mechanism does not change anything there. The runtime taken for the pre-process itself was neglectible in all our experiments.

Extension to ADL
So far, we have restricted ourselves to planning tasks speci ed in the simple STRIPS language. We will now show how our approach can be extended to deal with ADL (Pednault, 1989) tasks, more precisely, with the ADL subset of PDDL (McDermott et al., 1998) that was used in the 2nd international planning systems competition (Bacchus, 2000). This involves dealing with arbitrary function-symbol free rst order logic formulae, and with conditional e ects. Our extension work is divided into the following four subareas: 1. Apply a preprocessing approach to the ADL domain and task description, compiling the speci ed task down into a propositional normal form.
2. Extend the heuristic evaluation of planning states to deal with these normal form constructs.
4. Adjust the search mechanisms.

Preprocessing an ADL Planning Task
FF's preprocessing phase is almost identical to the methodology that has been developed for the IPP planning system. For details, we refer the reader to the work that's been done there (Koehler & Ho mann, 2000b), and give only the basic principles here. The planner starts with a planning task speci cation given in the subset of PDDL de ned for the AIPS-2000 planning competition (Bacchus, 2000). The input is a set of operator schemata, the initial state, and a goal formula. The initial state is simply a set of ground atoms, and the goal formula is an arbitrary rst order logical formula using the relational symbols de ned for the planning task. Any operator schema o is de ned by a list of parameters, a precondition, and a list of e ects. Instantiating the parameters yields, just like STRIPS tasks are usually speci ed, the actions to the schema. The precondition is an arbitrary ( rst order) formula. For an action to be applicable in a given state S, its instantiation of this formula must be satis ed in S. Each  In FF's heuristic method, each single state evaluation can involve thousands of operator applications|building the relaxed planning graph, one needs to determine all applicable actions at each single fact layer. We therefore invest the e ort to compile the operator descriptions down into a much simpler propositional normal form, such that heuristic eval- The precondition is a set of ground atoms. Likewise, the e ect conditions pre i (o) of the single e ects are restricted to be ground atoms. We also represent the goal state as a set of atoms. Thus, we compile away everything except the conditional e ects. Compiling away the logical formulae involves transforming them into DNF, which causes an exponential blow up in general. In our testing domains, however, we found that this transformation can be done in reasonable time. Concerning the conditional e ects, those can not be compiled away without another exponential blow up, given that we want to preserve solution length. This was proven by Nebel (2000). As we will see, conditional e ects can e ciently be integrated into our algorithmic framework, so there is no need for compiling them away. The compilation process proceeds as follows: 1. Determine predicates that are static, in the sense that no operator has an e ect on them. Such predicates are a common phenomenon in benchmark tasks. An example are the (in-city ?l ?c) facts in Logistics tasks: Any location ?l stays, of course, located within the same city ?c throughout the whole planning process. We recognize static predicates by a simple sweep over all operator schemata.
2. Transform all formulae into quanti er-free DNF. This is subdivided into three steps: (a) Pre-normalize all logical formulae. Following Gazen and Knoblock (1997), this process expands all quanti ers, and translates negations. We end up with formulae that are made up out of conjunctions, disjunctions, and atoms containing variables. (b) Instantiate all parameters. This is simply done by instantiating all operator and e ect parameters with all type consistent constants one after the other. The process makes use of knowledge about static predicates, in the sense that the instantiated formulae can often be simpli ed (Koehler & Ho mann, 2000b). For example, if an instantiated static predicate (pã) occurs in a formula, and that instantiation is not contained in the initial state, then (pã) can be replaced with false.
(c) Transform formulae into DNF. This is postponed until after instantiation, because it can be costly, so it should be applied to as small formulae as possible.
In a fully instantiated formula, it is likely that many static or one-way predicate occurrences can be replaced by true or false, resulting in a much simpler formula structure.
3. Finally, if the DNF of any formula contains more than one disjunct, then the corresponding e ect, operator, or goal condition gets split up in the manner proposed by Gazen and Knoblock (1997).

Relaxed GRAPHPLAN with Conditional E ects
We now show how our specialized GRAPHPLAN implementation, as was described in Section 4.3, is changed to deal with ADL constructs. Building on our normalized task representation, it su ces to take care of conditional e ects.

Relaxed Planning Graphs with Conditional Effects
Our encoding of planning graph building for relaxed tasks almost immediately carries over to ADL actions in the above propositional normal form. One simply needs to keep an additional layer membership value for all e ects of an action. The layer membership of an e ect indicates the rst layer where all its e ect conditions plus the corresponding action's preconditions are present. To compute these membership integers in an e cient manner, we

Relaxed Plan Extraction with Conditional Effects
The relaxed plan extraction mechanism for ADL di ers from its STRIPS counterpart in merely two little details. Instead of selecting achieving actions, the extraction mechanism selects achieving e ects. Once an e ect i of action o is selected, all of its e ect conditions plus o's preconditions need to be put into their corresponding goal sets. Afterwards, not only the e ect's own add e ects add i (o) are marked true at the time being, but also the added facts of all e ects that are implied, i.e., those e ects j of o with pre j (o) pre i (o) (in particular, this will be the unconditional e ects of o, which have an empty e ect condition).

ADL Pruning Techniques
Both pruning techniques from Section 6 easily carry over to actions with conditional e ects.

Helpful Actions
For STRIPS, we de ned as helpful all applicable actions achieving at least one goal at time step 1, cf. Section 6.1. For our ADL normal form, we simply change this to all applicable actions having an appearing e ect that achieves a goal at time step 1, where an e ect appears i its e ect condition is satis ed in the current state. Originally, we cut o a state S if one of the actions selected for the relaxed plan to S deleted a goal A that had just been achieved, cf. Section 6.2. We now simply take as criterion the e ects that are selected for the relaxed plan, i.e., a state is cut o if one of the e ects selected for its relaxed solution deletes a goal A that has just been achieved.

ADL State Transitions
Finally, for enabling the search algorithms to handle our propositional ADL normal form, it is su cient to rede ne the state transition function. Forward search, no matter if it does hill-climbing, best-rst search, or whatsoever, always faces a completely speci ed search state. 3 It can therefore compute exactly the e ects of executing a context dependent action.

Performance Evaluation
We have implemented the methodology presented in the preceding sections in C. 4 In this section, we evaluate the performance of the resulting planning system. Empirical data is divided into three subareas: 1. The FF system took part in the fully automated track of the 2nd international planning systems competition, carried out alongside with AIPS-2000 in Breckenridge, Colorado. We review the results, demonstrating FF's good runtime and solution length behavior in the competition. We also give some intuitions on why FF behaves the way it does.
2. From our own experiments, we present some of the results that we have obtained in domains that were not used in the AIPS-2000 competition. First, we brie y summarize our ndings in some more domains where FF works well. Then, to illustrate our intuitions on the reasons for FF's performance, we give a few examples of domains where the approach is less appropriate.
3. We nally present a detailed comparison of FF's performance to that of HSP, in the sense that we investigate which di erences between FF and HSP lead to which performance results.

The AIPS-2000 Planning Systems Competition
From March to April 2000, the 2nd international planning systems competition, organized by Fahiem Bacchus, was carried out in the general setting of the AIPS-2000 conference in Breckenridge, Colorado. There were two main tracks, one for fully-automated planners, and one for hand-tailored planners. Both tracks were divided into ve parts, each one concerned with a di erent planning domain. Our FF system took part in the fully automated track. In the competition, FF demonstrated runtime behavior superior to that of the other fully automatic planners and was therefore granted \Group A distinguished performance Planning System" (Bacchus & Nau, 2001). It also won the Schindler Award for the rst place in the Miconic 10 Elevator domain, ADL track. In this section, we brie y present the data collected in the fully automated track, and give, for each domain, some intuitions on the reasons for FF's behavior. The reader should be aware that the competition made no distinction between optimal and suboptimal planners, putting together the runtime curves for both groups. In the text to each domain, we state which planners found optimal solutions, and which didn't. Per planning task, all planners were given half an hour running time on a 500 MHz Pentium III with 1GB main memory. If no solution was found within these resource bounds, the planner was declared to have failed on the respective task.

The Logistics Domain
The rst two domains that were used in the competition were the Logistics and Blocksworld domains. We rst look at the former. This is a classical domain, involving the transportation 4. The source code is available in an online appendix, and can be downloaded from the FF Homepage at http://www.informatik.uni-freiburg.de/~ho mann/ .html. of packets via trucks and airplanes. Figure 4 shows the runtime curves of those planners that were able to scale to bigger instances in the competition. The Logistics tasks were subdivided into two sets of instances, the easy and the harder ones. Those planners that did well on all of the easy instances were also run on the harder set. These planners were FF, HSP2 (Bonet & Ge ner, 1998, 1999, System-R, GRT (Refanidis & Vlahavas, 1999), Mips (Edelkamp, 2000), and STAN (Long & Fox, 1999;Fox & Long, 2001). Two observations can be made: 1. System-R does signi cantly worse than the other planners.
2. The better planners all behave quite similar, with FF and Mips tending to be the fastest.
Note also that times are shown on a logarithmic scale, so we are not looking at linear time Logistics planners. Concerning solution plan length, we do not show a gure here. None of the shown planners guarantees the returned plans to be optimal. It turns out that STAN nds the shortest plans on most instances. System-R nds signi cantly longer plans than the others, ranging from 178% to 261% of STAN's plan lengths, with an average of 224%. The lengths of FF's plans are within 97% to 115% of STAN's plan lengths, with an average length of 105%. Concerning FF's good runtime behavior, we think that there are mainly two reasons for that: 1. In all iterations of enforced hill-climbing, breadth rst search nds a state with better evaluation at very small depths (motivating our search algorithm, cf. Section 5.1). In most cases, the next better successor is at depth 1, i.e., a direct one. There are some cases where the shallowest better successor is at depth 2, and only very rarely breadth rst needs to go down to depth 3. These observations are independent of task size.
2. The helpful actions heuristic prunes large fractions of the search space. Looking at the states that FF encounters during search, only between 40 and 5 percent of all of a state's successors were considered helpful in our experiments, with the tendency that the larger the task, the less helpful successors there are.
There is a theoretical note to be made on the rst observation. With the common representation of Logistics tasks, the following can be proven. Let d be the maximal distance between two locations, i.e., the number of move actions a mobile needs to take to get from one location to another. Using a heuristic function that assigns to each state the length of an optimal relaxed solution as the heuristic value, the distance of each state to the next better evaluated state is maximal d+1. Thus, an algorithm that used enforced hill-climbing with an oracle function returning the length of an optimal relaxed solution would be polynomial on standard Logistics representations, given an upper limit to d. In the benchmarks available, mobiles can reach any location accessible to them in just one step, i.e., the maximal distance in those tasks is constantly d = 1. Also, FF's heuristic usually does nd optimal, or close to optimal, relaxed solutions there, such that enforced hill-climbing almost never needs to look more than d + 1 = 2 steps ahead.

The Blocksworld Domain
The Blocksworld is one of the best known benchmark planning domains, where the planner needs to rearrange a bunch of blocks into a speci ed goal position, using a robot arm. Just like the Logistics tasks, the competition instances were divided into a set of easier, and of harder ones. Figure 5 shows the runtime curves of the planners that scaled to the harder ones. System-R scales most steadily to the Blocksworld tasks used in the competition. In particular, it is the only planner that can solve all of those tasks. HSP2 solves some of the smaller instances, and FF solves about two thirds of the set. If FF succeeds on an instance, then it does so quite fast. For example, FF solves one of the size-50 tasks in 1:27 seconds, where System-R needs 892:31 seconds. None of the three planners nds optimal plans. On the tasks that HSP2 manages to solve, its plans are within 97% to 177% of System-R's plan lengths, with an average of 153%. On the tasks that FF manages to solve, its plans are within 83% to 108% of System-R's plan lengths, average 96%.
By experimenting with di erent con gurations of FF, we found that the behavior of FF on these tasks is largely due to the goal ordering heuristics from Section 6.2. Goal distance estimates are not so good|the planner grabs a whole bunch of blocks with its single arm|and neither is the helpful actions heuristic|when the arm holds a block, all positions where the arm could possibly put the block are usually considered helpful. The goal agenda (Section 6.2.2), on the other hand, divides the tasks into small subtasks, and added goal deletion (Section 6.2) prevents the planner from putting blocks onto stacks where some block beneath still needs to be moved. However, in some cases achieving the goals from earlier entries in the goal agenda cuts o goals that are still ahead. Not aware of the blocks that it will need to stack for achieving goals ahead, the planner might put the current blocks onto stacks that need to be disassembled later on. If that happens with too many blocks|which depends more or less randomly on the speci c task and the actions that the planner chooses|then the planner can not nd its way out of the situation again. These are probably the instances that FF couldn't solve in the competition.

The Schedule Domain
In the Schedule domain, the planner is facing a bunch of objects to be worked on with a set of machines, i.e., the planner is required to create a job schedule in which the objects shall be assigned to the machines. The competition representation makes use of a simple form of quanti ed conditional e ects. For example, if an object gets painted red, then that is its new color, and for all colors that it is currently painted in, it is not of that color anymore. Only a subset of the planners in the competition could handle this kind of conditional e ects. Their runtime curves are shown in Figure 6. Apart from those planners already seen, we have runtime curves in Figure 6 for IPP (Koehler et al., 1997), PropPlan, andBDDPlan (H olldobler &St orr, 2000). FF outperforms the other planners by many orders of magnitude|remember that time is shown on a logarithmic scale. Concerning solution length, FF's plans tend to be slightly longer than the plans returned by the other planners on the smaller instances. Optimal plans are found by Mips, PropPlan, and BDDPlan. FF's plan lengths are within 175% of the optimal lengths, with an average of 116%. Only HSP sometimes nds longer plans than FF, being in a range from 62% to 117% of FF's plan lengths, 94% on average. Responsible for the outstanding runtime behavior of FF on the Schedule domain is, apparently, the helpful actions heuristic. Measuring, for some example states, the percentage of successors that were considered helpful, we usually found it was close to 2 percent, i.e., only two out of a hundred applicable actions were considered by the planner. For example, all of the 637 states that FF looks at for solving one of the size-50 tasks have 523030 successors altogether, where the sum of all helpful successors is only 7663. Also, the better successors, similar to the Logistics domain, lie at shallow depths. Breadth rst search never goes deeper than three steps on the Schedule tasks in the competition suite. Finally, in a few experiments we ran for testing that, the goal agenda helped by about a factor 2 in terms of running time.

The Freecell Domain
The Freecell domain formalizes a solitaire card game that comes with Microsoft Windows. The largest tasks entered in the competition (size 13 in Figure 7) correspond directly to some real-world sized tasks, while in the smaller tasks, there are less cards to be considered. Figure 7 shows the runtime curves of the four best performing planners. Time is shown on a logarithmic scale.
From the group of the four best-scaling planners shown in Figure 7, HSP2 is the slowest, while STAN is the fastest planner. FF is generally second place, and has a lot of variation in its running times. On the other hand, FF is the only planner that is capable of solving the real-world tasks, size 13. It solves four out of ve such tasks. None of the shown planners guarantees the found plans to be optimal, and none of the shown planners demonstrates superior performance concerning solution length. STAN produces unnecessarily long plans in a few cases. Precisely, on the tasks that both HSP and FF manage to solve, HSP's plan lengths are within a range of 74% to 126% of FF's plan lengths, average 95%. On tasks solved by both Mips and FF, plan lengths of Mips are within 69% to 128% of FF's lengths, average 101%. For STAN, the range is 65% to 318%, with 112% on average.
Concerning FF's runtime behavior, the big variation in running time as well as its capability of solving larger tasks both seem to result from the way the overall search algorithm is arranged. We observed the following. Those tasks that get solved by enforced hill-climbing are those that are solved fast. Sometimes, however, especially on the larger tasks, enforced hill-climbing runs into a dead end situation (no cards can be moved anymore). Then, the planner starts from scratch with complete best-rst search, which takes more time, but can solve big instances quite reliably, as can be seen on the tasks of size 13. Helpful actions works moderately well, selecting around 70% of the available actions, and the better successors are usually close, but sometimes lie at depths of more than 5 steps.

The Miconic Domain
The nal domain used in the competition comes from a real-world application, where moving sequences of elevators need to be planned. The sequences are due to all kinds of restrictions, like that the VIPs need to be served rst. To formulate all of these restrictions, complex rst order preconditions are used in the representation (Koehler & Schuster, 2000). As only a few planners could handle the full ADL representation, the domain was subdivided into the easier STRIPS and SIMPLE (conditional e ects) classes, the full ADL class, and an even more expressive class where numerical constraints (the number of passengers in the elevator at a time) needed to be considered. We show the runtime curves for the participants in the full ADL class in Figure 8. In di erence to the previous domains, the Miconic domain was run on site at AIPS-2000, using 450 MHz Pentium III machines with 256 MByte main memory. FF outperforms the two other full ADL planners in terms of solution time. It must be noticed, however, that IPP and PropPlan generate provably optimal plans here, such that one needs to be careful when directly comparing those running times. On the other hand, FF's plans are quite close to optimal on these instances, being within in a range of maximally 133% of the optimal solution lengths to the instances solved by PropPlan, 111% on average.
The large variation of FF's running times is apparently due to the same phenomenon as the variation in Freecell is: sometimes, as we observed, enforced hill-climbing runs into a dead end, which causes a switch to best-rst search, solving the task in more time, but reliably. The helpful actions percentage takes very low values on average, around 15%, and breadth rst search rarely goes deeper than four or ve steps, where the large majority of the better successors lie at depth 1.

Some more Examples
In this section, we present some of the results that we have obtained in domains that were not used in the AIPS-2000 competition. We give some more examples of domains where FF works well, and, to illustrate our intuitions on the reasons for FF's behavior, also some examples of domains where FF is less appropriate.
For evaluation, we ran FF on a collection of 20 benchmark planning domains, including all domains from the AIPS-1998 and AIPS-2000 competitions, and seven more domains from the literature. Precisely, the domains in our suite were Assembly, two Blocksworlds (threeand four-operator representation), Briefcaseworld, Bulldozer, Freecell, Fridge, Grid, Gripper, Hanoi, Logistics, Miconic-ADL, Miconic-SIMPLE, Miconic-STRIPS, Movie, Mprime, Mystery, Schedule, Tireworld, and Tsp. Instances were either taken from published distributions, from the literature, or modi ed to show scaling behavior. 5 Times for FF were measured on a Sparc Ultra 10 running at 350 MHz, with a main memory of 256 MBytes. Running times that we show for other planners were taken on the same machine, if not otherwise indicated in the text. We found that FF shows extremely competitive performance on 16 of the 20 domains listed above. On the two Blocksworlds, Mprime, and Mystery, it still shows satisfying behavior. Some examples that have not been used in the AIPS-2000 competition are: The Assembly Domain. FF solves 25 of the 30 tasks in the AIPS-1998 test suite in less than ve seconds, where the ve others are either unsolvable, or have speci cation errors. The only other planner we know of that can solve any of the Assembly tasks is IPP. The latest version IPP4.0 solves only four of the very small instances, taking up to 12 hours running time. FF's plan lengths are, in terms of the number of actions, shorter than IPP's time step optimal ones, ranging from 90% to 96%.
The Briefcaseworld Domain. This is a classical domain, where n objects need to be transported using a briefcase. Whenever the briefcase is moved, a conditional e ect forces all objects inside the briefcase to move with it. From our suite, IPP4.0 easily solves the tasks with n 5 objects, but fails to solve any task where n 7. FF, on the other hand, solves even the 11-objects tasks in less than a second. On the tasks that IPP solves, plan lengths of FF are within 84% to 111% of IPP's lengths, 99% on average.
The Grid Domain. The 1998 competition featured ve instances. For these tasks, the fastest planning mechanism we know of from the literature is a version of GRT that is enhanced with a simple kind of domain dependent knowledge, supplied by the 5. All PDDL les, and random instance generators for all domains, are available in an online appendix. The generators, together with descriptions of our randomization strategies, are also available at http://www.informatik.uni-freiburg.de/~ho mann/ -domains.html.
user. It solves the tasks in 1:04, 6:63, 21:35, 19:92 and 118:65 seconds on a 300 MHz Pentium Celeron machine with 64 MByte main memory (Refanidis & Vlahavas, 2000). FF solves the same tasks within 0:15, 0:47, 2:11, 1:93 and 19:54 seconds, respectively. Plan lengths of FF are within 89% to 139% of GRT's lengths, 112% on average. The Gripper Domain, used in the 1998 competition. The number of states that FF evaluates before returning an optimal sequential solution is linear in the size of the task there. The biggest AIPS-1998 example gets solved in 0:16 seconds. The Tireworld Domain. The original task formulated by Stuart Russel asks the planner to nd out how to replace a at tire. Koehler and Ho mann (2000a) modi ed the task such that an arbitrary number of n tires need to be replaced. IPP3.2, using the goal agenda technique, solves the 1, 2, and 3-tire tasks in 0:08, 0:21, and 1:33 seconds, respectively, but exhausts memory resources as soon as n 4. FF scales to much larger tasks, taking less than a tenth of a second when n 6, still solving the 10-tire task in 0:33 seconds. FF's plan lengths are, on the tasks that IPP manages to solve, equally long in terms of the number of actions.
As was already said, our intuition is that the majority of the currently available benchmark planning domains|at least those represented by our domain collection|are \simple" in structure, and that it is this simplicity which makes them solvable so easily by a greedy algorithm such as FF. To illustrate our intuitions, we now give data for a few domains that have a less simple structure. They are therefore challenging for FF.

The Mystery and Mprime Domains
The Mystery and Mprime domains were used in the AIPS-1998 competition. Both are variations of the Logistics domain, where there are additional constraints on the capacity of each vehicle, and, in particular, on the amount of fuel that is available. Both domains are closely related, the only di erence being that in Mprime, fuel items can be transported between two locations, if one of those has more than one such item. In Figure 9, we compare FF's results on both domains to that reported by Drew McDermott for the Unpop system (McDermott, 1999). Instances are the same for both domains in Figure 9. Results for Unpop have been taken by McDermott on a 300 MHz Pentium-II workstation (McDermott, 1999). A dash indicates that the task couldn't be solved by the corresponding planner.
One needs to be careful when comparing the running times in Figure 9: unlike FF, coded in C, Unpop is written in Lisp. Thus, the apparent runtime superiority of FF in Figure 9 is not signi cant. On the contrary, Unpop seems to solve these task collections more reliably than FF: it nds solutions to four Mystery and three Mprime instances which FF does not manage to solve. None of the planners is superior in terms of solution lengths: On Mystery, FF ranges within 55% to 185% of Unpop's lengths, 103% on average, on Mprime, FF ranges within 45% to 150%, 93% on average.
We think that FF's behavior on these two domains is due to the large amount of dead ends in the corresponding state spaces|we tried to randomize FF's search strategy, running it on the Mystery and Mprime suits. Regardless of the randomization strategy we tried, on the tasks that original FF couldn't solve search ended up being stuck in a dead end. Dead ends are a frequent phenomenon in the Mystery and Mprime domains, where, for example, an important vehicle can run out of fuel. In that sense, the tasks in these domains have a more complex structure than those in a lot of other benchmark domains, where the tasks are dead-end free. Depending more or less randomly on task structure and selected actions, FF can either solve Mystery and Mprime tasks quite fast, or fails, i.e., encounters a dead end state with enforced hill-climbing. Trying to solve the tasks with complete best-rst search exhausts memory resources for larger instances.

Random SAT Instances
Our last example domain is not a classical planning benchmark. To give an example of a planning task collection where FF really encounters di culties, we created a planning domain containing hard random SAT instances. Figure 10 shows runtime curves for FF, IPP4.0, and BLACKBOX3.6. The tasks in Figure 10 are solvable SAT instances that were randomly generated according to the xed clause-length model with 4:3 times as many clauses as variables (Mitchell, Selman, & Levesque, 1992). Random instance generation and translation software to PDDL have both been provided by Jussi Rintanen. Our gure shows running times for SAT instances with 5, 10, 15, 20, 25, and 30 variables, ve tasks of each size. Values for tasks of the same size are displayed in turn, i.e., all data points below 10 on the x-axis show running times for 5 variable tasks, and so on. Though the data set is small, the observation to be made is clear: FF can only solve the small instances, and two of the bigger ones. IPP and BLACKBOX scale much better, with the tendency that BLACKBOX is fastest. The encoding of the SAT instances is the following. An operator corresponds to assigning a truth value to a variable, which makes all clauses true that contain the respective literal. Once a variable has been assigned, its value is xed. The goal is having all clauses true. It is not surprising that BLACKBOX does best. After all, this planner uses SAT technology for solving the tasks. 6 For IPP and FF, the search space is the space of all partial truth assignments. Due to exclusion relations, IPP can rule out quite many such assignments early, when it nds they can't be completed. FF, on the other hand, does no such reasoning, and gets lost in the exponential search space, using a heuristic that merely tells it how many variables it will still need to assign truth values to, unaware of the interactions that might, and most likely will, occur.
In contrast to most of the current benchmark planning domains, nding a non-optimal solution to the planning tasks used here is NP-hard. FF's behavior on these tasks supports our intuition that FF's e ciency is due to the inherent simplicity of the planning benchmarks.
6. In these experiments, we ran BLACKBOX with the default parameters. Most likely, one can boost the performance by parameter tuning.

What Makes the Di erence to HSP?
One of the questions that the authors have been asked most frequently at the AIPS-2000 planning competition is this: If FF is so closely related to HSP, then why does it perform so much better? FF uses the same basic ideas as classical HSP, forward search in state space, and heuristic evaluation by ignoring delete lists (Bonet & Ge ner, 1998). The di erences lie in the way FF estimates goal distances, the search strategy, and FF's pruning techniques.
To obtain a picture of which new technique yields which performance results, we conducted a number of experiments where those techniques could be turned on and o independently of each other. Using all combinations of techniques, we measured runtime and solution length performance on a large set of planning benchmark tasks. In this section, we describe the experimental setup, and summarize our ndings. The raw data and detailed graphical representations of the results are available in an online appendix.

Experimental Setup
We focused our investigation on FF's key features, i.e., we restricted our experiments to the FF base architecture, rather than taking into account all of FF's new techniques. Remember that FF's base architecture (cf. Section 2) is the enforced hill-climbing algorithm, using FF's goal distances estimates, and pruning the search space with the helpful actions heuristic. The additional techniques integrated deal with special cases, i.e., the added goal deletion heuristic and the goal agenda are concerned with goal orderings, and the complete best-rst search serves as a kind of safety net when local search has run into a dead end. Considering all techniques independently would give us 2 6 = 64 di erent planner con gurations. As each of the special case techniques yields savings only in a small subset (between 4 and 6) of our 20 domains, large groups of those 64 con gurations would behave exactly the same on the majority of our domains. We decided to concentrate on FF's more fundamental techniques. The di erences between classical HSP and FF's base architecture are the following: 1. Goal distance estimates: while HSP approximates relaxed solution lengths by computing certain weight values, FF extracts explicit relaxed solutions, cf. Section 4. 2. Search strategy: while classical HSP employs a variation of standard hill-climbing, FF uses enforced hill-climbing as was introduced in Section 5. 3. Pruning technique: while HSP expands all children of any search node, FF expands only those children that are considered helpful, cf. Section 6.1.
We have implemented experimental code where each of these algorithmic di erences is attached to a switch, turning the new technique on or o . The eight di erent con gurations of the switches yield eight di erent heuristic planners. When all switches are on, the resulting planner is exactly FF's base architecture. With all switches o , our intention was to imitate classical HSP, i.e., HSP1 as it was used in the AIPS-1998 competition. Concerning the goal distance estimates switch and the pruning techniques switch, we implemented the original methods. Concerning the search strategy, we used the following simple hill-climbing design: Always select one best evaluated successor randomly.
Keep a memory of past states to avoid cycles in the hill-climbing path.
Count the number of consecutive times in which the child of a node does not improve the heuristic estimate. If that counter exceeds a threshold, then restart, where the threshold is 2 times the initial state's goal distance estimate.
Keep visited nodes in memory across restart trials in order to avoid multiple computation of the heuristic for the same state.
In HSP1, some more variations of restart techniques are implemented. In personal communication with Blai Bonet and Hector Ge ner, we decided not to imitate those variations| which a ect behavior only in a few special cases|and use the simplest possible design instead. We compared the performance of our implementation with all switches turned o to the performance of HSP1, running the planners on 12 untyped STRIPS domains (the input required for HSP1). Except in four domains, the tasks solved were the same for both planners. In Freecell and Logistics, our planner solved more tasks, apparently due to implementation details: though HSP1 did not visit more states than our planner on the smaller tasks, it ran out of memory on the larger tasks. In Tireworld and Hanoi, the restarting techniques seem to make a di erence: In Tireworld, HSP1 cannot solve tasks with more than one tire because it always restarts before getting close to the goal (our planner solves tasks with up to 3 tires), whereas in Hanoi our implementation can not cope with more than 5 discs for the same reason (HSP1 solves tasks with up to 7 discs). Altogether, in most cases there is a close correspondence between the behavior of HSP1 and our con guration with all switches turned o . In any case, our experiments provide useful insights into the performance of enforced hill-climbing compared to a simple straightforward hill-climbing strategy.
To obtain data, we set up a large example suite, containing a total of 939 planning tasks from our 20 benchmark domains. As said at the beginning of Section 8.2, our domains were Assembly, two Blocksworlds (three-and four-operator representation), Briefcaseworld, Bulldozer, Freecell, Fridge, Grid, Gripper, Hanoi, Logistics, Miconic-ADL, Miconic-SIMPLE, Miconic-STRIPS, Movie, Mprime, Mystery, Schedule, Tireworld, and Tsp. In Hanoi, there were 8 tasks|3 to 10 discs to be moved|in the other domains, we used from 30 to 69 di erent instances. As very small instances are likely to produce noisy data, we tried to avoid those by rejecting tasks that were solved by FF in less than 0:2 seconds. This was possible in all domains but Movie, where all tasks in the AIPS-1998 suite get solved in at most 0:03 seconds. In the two Blocksworld representations, we randomly generated tasks with 7 to 17 blocks, using the state generator provided by John Slaney and Sylvie Thiebaux (2001). In Assembly and Grid, we used the AIPS-1998 instances, plus a number of randomly generated ones similar in size to the biggest examples in the competition suites. In Gripper, our tasks contained from 10 to 59 balls to be transported. In the remaining 9 competition domains, we used the larger instances of the respective competition suites. In Briefcaseworld and Bulldozer, we randomly generated around 50 large tasks, with 10 to 20 objects, and 14 to 24 locations, respectively. In Fridge, from 1 to 14 compressors had to be exchanged, in Tireworld, 1 to 30 wheels needed to be replaced, and in Tsp, 10 to 59 locations needed to be visited. 7 For each of the eight con gurations of switches, we ran the respective planner on each of the tasks in our example suite. Those con gurations using randomized hill-climbing were run ve times on each task, and the results averaged afterwards. Though ve trials might sound like a small number here|way too small if we were to compare di erent hillclimbing strategies for SAT problems, for example|the number seemed to be reasonable to us: remember that, in the planning framework, all hill-climbing trials start from the same state. The variance that we found between di erent trials was usually low in our testing runs. To complete the experiments in a reasonable time, we restricted memory consumption to 128 MByte, and time consumption to 150 seconds|usually, if FF needs more time or memory on a planning task of reasonable size, then it doesn't manage to solve it at all. As said at the beginning of the section, the raw data is available in an online appendix, accompanied by detailed graphical representations. Here, we summarize the results, and discuss the most interesting observations. We examined the data separately for each domain, as our algorithmic techniques typically show similar behavior for all tasks within a domain. In contrast, there can be essential di erences in the behavior of the same technique when it is applied to tasks from di erent domains.

Running Time
For our running time investigation, if a con guration did not nd a solution plan to a given task, we set the respective running time value to the time limit of 150 seconds (sometimes, a con guration can terminate faster without nding a plan, for example an enforced hillclimbing planner running into a dead end). In the following, we designate each switch con guration by 3 letters: \H" stands for helpful actions on, \E" stands for enforced hillclimbing on, \F" stands for FF estimates on. If a switch is turned o , the respective letter is replaced by a \ ": FF's base architecture is con guration \HEF", our HSP1 imitation is \ ", and \H ", for example, is hill-climbing with HSP goal distances and helpful actions pruning. For a rst impression of our running time results, see the averaged values per domain in Figure 11. Figure 11 shows, for each domain and each con guration, the averaged running time over all instances in that domain. As the instances in each domain are not all the same size, but typically scale from smaller to very large tasks, averaging over all running times is, of course, a very crude approximation of runtime behavior. The data in Figure 11 provides a general impression of our runtime results per domain, and gives a few hints on the phenomena that might be present in the data. Compare, for example, the values on the right hand side|those planners using helpful actions|to those on the left hand side|those planners expanding all sons of search nodes. In Briefcaseworld and Bulldozer, the right hand side values are higher, but in almost all other domains, they are considerably lower. This is especially true for the two rightmost columns, showing values for planners using helpful actions and enforced hill-climbing. This indicates that the main sources of performance lie 7. All PDDL les, and the source code of all instance generators we used, are available in an online appendix. The generators, together with descriptions of the randomization strategies, are also available at http://www.informatik.uni-freiburg.de/~ho mann/ -domains.html.  in the pruning technique and the search strategy|looking at the rightmost \HE " and \HEF" columns, which only di er in the goal distance estimate, those two con guration values are usually close to each other, compared to the other con gurations in the same domain.
To put our observations on a solid basis, we looked, for each domain, at each pair of con gurations in turn, amounting to 20 8 7 2 = 560 pairs of planner performances. For each such pair, we decided whether one con guration performed signi cantly better than the other one. To decide signi cance, we counted the number of tasks that one con guration solved faster. We found this to be a more reliable criterion than things like the di erence between running times for each task. As tasks grow in size, rather than being taken from a population with nite mean size, parametric statistical procedures, like computing con dence intervals for runtime di erences, make questionable assumptions about the distribution of data. We thus used the following non-parametric statistical test, known as the two-tailed sign test (Siegel & N. J. Castellan, 1988). Assume that both planners, A and B, perform equally on a given domain. Then, given a random instance from the domain, the probability that B is faster than A should be equal to the probability that A is faster than B. Take this as the null hypothesis. Under that hypothesis, if A and B behave di erently on an instance, then B is faster than A with probability 1 2 . Thus, the tasks where B is faster are distributed over the tasks with di erent behavior according to a Binomial distribution with p = 1 2 . Compute the probability of the observed outcome under the null hypothesis, i.e., if there are n tasks where A and B behave di erently, and k tasks where B is faster, then compute the probability that, according to a binomial distribution with p = 1 2 , at least k positive outcomes are obtained in n trials. If that probability is less or equal than :01, then reject the null hypothesis and say that B performs signi cantly better than A. Symmetrically, decide whether A performs signi cantly better than B. We remark that in all domains except Movie the tasks where two con gurations behaved equally were exactly those that could not be solved by either of the con gurations. In 60% of the cases where we found that one con guration B performed signi cantly better than another con guration, B was faster on all instances with di erent behavior. In 71%, B was faster on all but one such instance.
We are particularly interested in pairs A and B of con gurations where B results from A by turning one of the switches on, leaving the two others unchanged. Deciding about signi cant improvement in such cases tells us about the e ect that the respective technique has on performance in a domain. There are 12 pairs of con gurations where one switch is turned on. Figure 12 shows our ndings in these cases.  Figure 12: The e ect of turning on a single switch, keeping the others unchanged. Summarized in terms of signi cantly improved or degraded running time performance per domain, and per switch con guration. Figure 12 is to be understood as follows. It shows our results for the \F", \E", and \H" switches, which become active in turn from left to right. For each of these switches, there are four con gurations of the two other, background, switches, displayed by four columns in the table. In each column, the behavior of the respective background con guration with the active switch turned o is compared to the behavior with the active switch turned on. If performance is improved signi cantly, the table shows a \+", if it is signi cantly  degraded, the table shows a \ ", and otherwise the respective table entry is empty. For example, consider the top left corner, where the \F" switch is active, and the background con guration is \ ", i.e., hill-climbing without helpful actions. Planner A is \ ", using HSP distances, and planner B is \ F", using FF distances. B's performance is signi cantly better than A's, indicated by a \+".
The leftmost four columns in Figure 12 show our results for HSP distance estimates versus FF distance estimates. Clearly, the latter estimates are superior in our domains, in the sense that, for each background con guration, the behavior gets signi cantly improved in 8 to 10 domains. In contrast, there are only 5 cases altogether where performance gets worse. The signi cances are quite scattered over the domains and background con gurations, indicating that a lot of the signi cances result from interactions between the techniques that occur only in the context of certain domains. For example, performance is improved in Bulldozer when the background con guration does not use helpful actions, but degraded when the background con guration uses hill-climbing with helpful actions. This kind of behavior can not be observed in any other domain. There are 4 domains where performance is improved in all but one background con guration. Apparently in these cases some interaction between the techniques occurs only in one speci c con guration. We remark that often running times with FF's estimates are only a little better than with HSP's estimates, i.e., behavior gets improved reliably over all instances, but only by a small factor (to get an idea of that, compare the di erences between average running times in Figure 11, for con gurations where only the distance estimate changes). In 5 domains, FF's estimates improve performance consistently over all background con gurations, indicating a real advantage of the di erent distance estimates. In Gripper (described in Section 6.1), for example, we found the following. If the robot is in room A, and holds only one ball, FF's heuristic prefers picking up another ball over moving to room B, i.e., the picking action leads to a state with better evaluation. Now, if there are n balls left in room A, then HSP's heuristic estimate of picking up another ball is 4n 2, while the estimate of moving to room B is 3n + 1. Thus, if there are at least 4 balls left in room A, moving to room B gets a better evaluation. Summing up weights, HSP overestimates the usefulness of the moving action.
Comparing hill-climbing versus enforced hill-climbing, i.e., looking at the four columns in the middle of Figure 12, the observation is this. The di erent search technique is a bit questionable when the background con guration does not use helpful actions, but otherwise, enforced hill-climbing yields excellent results. Without helpful actions, performance gets degraded almost as many times as it gets improved, whereas, with helpful actions, enforced hill-climbing improves performance signi cantly in 16 of our 20 domains, being degraded only in Fridge. We draw two conclusions. First, whether one or the other search strategy is adequate depends very much on the domain. A simple example for that is the Hanoi domain, where hill-climbing always restarts before it can reach the goal|on all paths to the goal, there are exponentially many state transitions where the son has no better evaluation than the father. Second, there is an interaction between enforced hill-climbing and helpful actions pruning that occurs consistently across almost all of our planning domains. This can be explained by the e ect that the pruning technique has on the di erent search strategies.
In hill-climbing, helpful actions pruning prevents the planner from looking at too many super uous successors on each single state that a path goes through. This saves time proportional to the length of the path. The e ects on enforced hill-climbing are much more drastic. There, helpful actions prunes out unnecessary successors of each state during a breadth rst search, i.e., it cuts down the branching factor, yielding performance speedups exponential in the depths that are encountered.
We nally compare consideration of all actions versus consideration of only the helpful ones. Look at the rightmost four columns of Figure 12. The observation is simply that helpful actions are really helpful|they improve performance signi cantly in almost all of our planning domains. This is especially true for those background con gurations using enforced hill-climbing, due to the same interaction that we have outlined above. In some domains, helpful actions pruning imposes a very rigid restriction on the search space: in Schedule, as said in Section 8.1.3, we found that states can have hundreds of successors, where only about 2% of those are considered helpful. In other domains, only a few actions are pruned, like in Hanoi, where at most three actions are applicable in each state, which are all considered helpful in most of the cases. Even a small degree of restriction does usually lead to a signi cant improvement in performance. In two domains, Briefcaseworld and Bulldozer, helpful actions can prune out too many possibilities, i.e., they cut away solution paths. This happens because there, the relaxed plan can ignore things that are crucial for solving the real task. Consider the Briefcaseworld, brie y described in Section 8.2, where objects need to be moved using a briefcase. Whenever the briefcase is moved, all objects inside it are moved with it by a conditional e ect. Now, the relaxed planner never needs to take any object out of the briefcase|the delete e ects say that moving an object means the object is no longer at the start location. Ignoring this, keeping objects inside the briefcase never hurts.

Solution Length
We also investigated the e ects that FF's new techniques have on solution length. Comparing two con gurations A and B, we took as the data set the respective solution length for those tasks that both A and B managed to solve|obviously, there is not much point in comparing solution length when one planner can not nd a solution at all. We then counted the number n of tasks where A and B behaved di erently, and the number k where B's solution was shorter, and decided about signi cance like described in the last section. Figure 13 shows our results in those cases where a single switch is turned.
The data in Figure 13 are organized in the obvious manner analogous to Figure 12. A rst glance at the table tells us that FF's new techniques are also useful for shortening solution length in comparison to HSP1, but not as useful as they are for improving runtime behavior. Let us focus on the leftmost four columns, HSP distance estimates versus FF distance estimates. The observations are that, with enforced hill-climbing in the background, FF estimates often result in shorter plans, and that there are two domains where solution lengths are improved across all background con gurations. Concerning the second observation, this is due to properties of the domain that FF's heuristic recognizes, but HSP's doesn't. Recall what we observed about the Gripper domain in the preceding section. With the robot standing in room A, holding only one ball, the FF heuristic gives picking up  Figure 13: The e ect of turning on a single switch, keeping the others unchanged. Summarized in terms of signi cantly improved or degraded solution length performance per domain, and per switch con guration.
the ball a better evaluation than moving to room B. The HSP heuristic doesn't do this. Therefore, using the HSP heuristic results in longer plans. Concerning the rst observation, improved solution lengths when enforced hill-climbing is in the background, we do not have a good explanation for this. It seems that the greedy way in which enforced hill-climbing builds its plans is just better suited when distance estimates are cautious, i.e., low. Consider the four columns in the middle of Figure 13, hill-climbing versus enforced hill-climbing. There are many cases where the di erent search strategy results in shorter plans. We gure that this is due to the di erent plateau behavior that the search methods exhibit, i.e., their behavior in at regions of the search space. Enforced hill-climbing enters a plateau somewhere, performs complete search for a state with better evaluation, and adds the shortest path to that state to its current plan pre x. When hill-climbing enters a plateau, it strolls around more or less randomly, until it hits a state with better evaluation, or has enough of it and restarts. All the actions on its journey to the better state are kept in the nal plan. In Movie, the phenomenon is this. If a planner chooses to reset the counter on the VCR before it chooses to rewind the movie (initially, neither heuristic makes a distinction between these two actions), then it has to reset the counter again. The enforced hill-climbing planners always reset the counter rst. The hill-climbing planners, on the other hand, randomly choose either ordering with equal probability. As said in Section 8.3.1, hill-climbing was given ve tries on each task, and results averaged. In ve tries, around half of the solutions use the correct ordering, such that, for all tasks, the average value is lower than the corresponding value for the enforced hill-climbing planners.
Finally, we compare consideration of all actions versus consideration of only the helpful ones, results depicted in the rightmost four columns of Figure 12. Coming a bit unexpected, there is only one single case where solution length performance is degraded by turning on helpful actions. This indicates that the actions on the shortest path to the goal are, in fact, usually considered helpful|unless all solution paths are thrown away, as is sometimes the case only in the Briefcaseworld and Bulldozer domains. Quite the other way around than one should think, pruning the search space with helpful actions sometimes leads to signi cantly shorter solution plans, especially when the underlying search method is hillclimbing. Though this may sound paradoxical, there is a simple explanation to it. Consider what we said above about the plateau behavior of hill-climbing, randomly adding actions to the current plan in the search for a better state. If such a search engine is armed with the helpful actions successors choice, focusing it into the direction of the goals, it might well take less steps to nd the way o a plateau.

Related Work
The most important connections of the FF approach to methodologies reported in the literature are the following: HSP's basic idea of forward state space search and heuristic evaluation by ignoring delete lists (Bonet & Ge ner, 1998).
The view of our heuristic as a special case of GRAPHPLAN (Blum & Furst, 1995), and its connection to HSP's heuristic method.
The similarity of the helpful actions heuristic to McDermott's favored actions (1996), and to irrelevance detection mechanisms .
compute weight values only once, estimating each fact's distance to the initial state, and only sum the weights up for a state later during search. 8 Refanidis and Vlahavas (1999) invert the direction of the HSP heuristic instead. While HSP computes distances going from the current state towards the goal, GRT goes from the goal to each fact. The function that then extracts, for each state during forward search, the states heuristic estimate, uses the pre computed distances as well as some information on which facts will probably be achieved simultaneously.
Interestingly, FF recomputes, like HSP, the heuristic from scratch on each search state, but nevertheless outperforms the other approaches. As we have seen in Section 8.3, this is for the most part due to FF's search strategy and the helpful actions pruning technique.

Conclusion and Outlook
We have presented an approach to domain independent planning that, at the time being, outperforms all existing technology on the majority of the currently available benchmark domains. Just like the well known HSP1 system, it relies completely on forward state space search and heuristic evaluation of states by ignoring delete lists. Unlike HSP, the method uses a GRAPHPLAN-style algorithm to nd an explicit relaxed solution to each search state. Those solutions give a more careful estimation of a state's di culty. As a second major di erence to HSP, our system employs a novel local search strategy, combining hillclimbing with complete search. Finally, the method makes use of powerful heuristic pruning techniques, which are based on examining relaxed solutions.
As we have mentioned earlier, our intuition is that the reasons for FF's e ciency lie in structural properties that the current planning benchmarks tend to have. As a matter of fact, the simplicity of the benchmarks quite immediately meets the eye, once one tries to look for it. It should be clear that the Gripper tasks, where some balls need to be transported from one room to another, exhibit a totally di erent search space structure than, for example, hard random SAT instances. Therefore, it's intuitively unsurprising that di erent search methods are appropriate for the former tasks than are traditionally used for the latter. The e ciency of FF on many of the benchmarks can be seen as putting that observation to the surface.
To make explicit the hypotheses stated above, we have investigated the state spaces of the planning benchmarks. Following Frank et al. (1997), we have collected empirical data, identifying characteristic parameters for di erent kinds of planning tasks, like the density and size of local minima and plateaus in the search space. This has lead us to a taxonomy for planning domains, dividing them by the degree of complexity that the respective task's state spaces exhibit with respect to relaxed goal distances. Most of the current benchmark domains apparently belong to the \simpler" parts of that taxonomy (Ho mann, 2001). We also approach our hypotheses from a theoretical point of view, where we measure the degree of interaction that facts in a planning task exhibit, and draw conclusions on the search space structure from that. Our goal in that research is to devise a method that automatically decides which part of the taxonomy a given planning task belongs to.
In that context, there are some remarks to be made on what AI planning research is heading for. Our point of view is that the goal in the eld should not be to develop a 8. HSP-r is integrated into HSP2 as an option of con guring the search process (Bonet & Ge ner, 2001). technology that works well on all kinds of tasks one can express with planning languages. This will hardly be possible, as even simple languages as STRIPS can express NP-hard problems like SAT. What might be possible, however, is to devise a technology that works well on those tasks that can be solved e ciently. In particular, if a planning task does not constitute much of a problem to an uninformed human solver, then it neither should do so to our planning algorithms. With the FF system, we already seem to have a method that accomplishes this quite well, at least for sequential planning in STRIPS and ADL. While FF is not particularly well suited for solving random SAT instances, it easily solves intuitively simple tasks like the Gripper and Logistics ones, and is well suited for a number of other domains where nding a non-optimal solution is not NP-hard. This sheds a critical light on the predictions of Kautz and Selman (1999), who suspected that planning technology will become super uous because of the fast advance of the state of the art in propositional reasoning systems. The methods developed there are surely useful for solving SAT. They might, however, not be appropriate for the typical structures of tasks that AI planning should be interested in.