DeepSym: Deep Symbol Generation and Rule Learning for Planning from Unsupervised Robot Interaction

Symbolic planning and reasoning are powerful tools for robots tackling complex tasks. However, the need to manually design the symbols restrict their applicability, especially for robots that are expected to act in open-ended environments. Therefore symbol formation and rule extraction should be considered part of robot learning, which, when done properly, will oﬀer scalability, ﬂexibility, and robustness. Towards this goal, we propose a novel general method that ﬁnds action-grounded, discrete object and eﬀect categories and builds probabilistic rules over them for non-trivial action planning. Our robot interacts with objects using an initial action repertoire that is assumed to be acquired earlier and observes the eﬀects it can create in the environment. To form action-grounded object, eﬀect, and relational categories, we employ a binary bottleneck layer in a predictive, deep encoder-decoder network that takes the image of the scene and the action applied as input, and generates the resulting eﬀects in the scene in pixel coordinates. After learning, the binary latent vector represents action-driven object categories based on the interaction experience of the robot. To distill the knowledge represented by the neural network into rules useful for symbolic reasoning, a decision tree is trained to reproduce its decoder function. Probabilistic rules are extracted from the decision paths of the tree and are represented in the Probabilistic Planning Domain Deﬁnition Language (PPDDL), allowing oﬀ-the-shelf planners to operate on the knowledge extracted from the sensorimotor experience of the robot. The deployment of the proposed approach for a simulated robotic manipulator enabled the discovery of discrete representations of object properties such as ‘rollable’ and ‘insertable’. In turn, the use of these representations as symbols allowed the generation of eﬀective plans for achieving goals, such as building towers of the desired height, demonstrating the eﬀectiveness of the approach for multi-step object manipulation. Finally, we demonstrate that the system is not only restricted to the robotics domain by assessing its applicability to the MNIST 8-puzzle domain in which learned symbols allow for the generation of plans that move the empty tile into any given position.


Introduction
Intelligent robotic systems exploit a diverse range of data representations for control, learning, and reasoning. Interaction with the world requires processing low-level continuous sensorimotor representations whereas abstract reasoning requires the use of high-level symbolic representations. The representational gap between the low-level sensorimotor and high-level symbolic representations has been addressed in AI and robotics, often by using manually designed symbols that are grounded in the low-level sensorimotor experience of the robots interacting with their environment (Harnad, 1990;Taniguchi et al., 2018). However, this approach only works in either controlled environments or limited tasks. Yet, truly intelligent robots are expected to form abstractions (Konidaris, 2019) continually from their interaction with the world and use them on the fly for complex planning and reasoning in novel environments (Werner & Kaplan, 1963;Callaghan & Corbit, 2015).
In this paper, we address the challenging problem of discovering discrete symbols and unsupervised learning of rules from the low-level interaction experience of a self-exploring robot. For this purpose, we propose a novel deep neural architecture for symbol formation and rule extraction. At the core of our method, the symbols are discovered in the discrete latent space formed by the bottleneck layer of a predictive, deep encoder-decoder network that takes the image of an object and the action applied as the input, and produces the effect generated by the action as the output. Symbols, which are the output of the encoder network, hold information for the effect prediction for a given action. Furthermore, our architecture allows transforming the complete low-level sensorimotor experience into symbolic experience, facilitating direct rule extraction for AI planning. To this end, decision tree models are trained to learn probabilistic rules that are translated to Probabilistic Planning Domain Definition Language (PPDDL; Younes & Littman, 2004) operators that are standard in probabilistic planning. Note that the predicates that appear in the PPDDL operators correspond to the discovered symbols.
In order to realize this framework, we created a setup where a simulated robot manipulator interacts with objects, poking them in different directions and stacking them on top of each other to collect interaction experience for object categorization and rule learning. Our system successfully constructs a latent representation through which object and relational symbols are discovered, which can be interpreted by humans as 'rollable', 'insertable', 'larger-than'. Contrary to symbols generated by systems that disregard actions and effects, our architecture is shown to generate action-effect-regulated symbols that are more effective in abstract reasoning over the actions of the robot and the consequences in the environment. Furthermore, the number of symbols is determined automatically by optimizing the trade-off between prediction capability and bottleneck size. Finally, the system acquired the capability to generate effective plans to achieve goals such as building towers of desired heights from given cubes, balls, and cups using off-the-shelf probabilistic planners. To show the generality of the proposed approach, we also conduct a second set of experiments in a non-robotic domain. To be concrete, we test our approach in the MNIST 8-tile puzzle domain adapted from Asai and Fukunaga (2018). Our experiments show that the system learns symbols that allow for creating plans to move the empty tile into arbitrary positions. Our implementation is publicly available 1 .
Our primary contribution is a generic neural solution for mapping raw sensorimotor experience into the symbolic domain. The same architecture can be used to discover object symbols, effect symbols, and object-object relational symbols. The proposed network further allows progressive learning of increasingly complex abstractions, exploiting previouslylearned abstractions as inputs. The learned symbols allow abstraction of the interaction of the robot with its environment as a Markov decision process which allows the use of symbolic planning systems for goal satisfaction. In the current study, to show this, we transformed the learned rules into Probabilistic PDDL operators, which allowed probabilistic plan generation and execution achieving goals beyond what was possible with the direct use of the training data.

Related Work
Bridging the representational gap between the continuous sensorimotor world of a robotic system with the discrete symbols and rules has been a key research goal from the early days of intelligent robotics (Kuipers et al., 2017;Murphy & Murphy, 2000). While grounding predefined symbols in the sensorimotor experience of the robot has been widely used for intelligent robot control (Klingspor et al., 1996;Petrick et al., 2008;Mourao et al., 2008;Wörgötter et al., 2009;Kulick et al., 2013), some argue that symbols "are not formed in isolation", and "they are formed in relation to the experience of agents" (Sun, 2000). We share this viewpoint that has been investigated in a number of studies. Pisokas and Nehmzow (2005) and Ugur et al. (2011) realized systems that clustered low-level sensory experience into categories and performed subsymbolic planning in the continuous perceptual space. While simple planning capability was achieved, the use of continuous prediction and state transition operators limited the use of powerful off-the-shelf symbolic AI planners. In another line of research, Ozturkcu et al. (2020) asked whether there are any symbols formed in a deep RL agent after training the agent for a given task, without imposing any prior on the architecture or the objective. Mota and Sridharan (2019), Riley and Sridharan (2019) proposed a hybrid approach to exploit the prior domain knowledge by combining non-monotonic logical reasoning with deep networks. This architecture is a cascade of two models where the first model is the prior domain knowledge encoded as an Answer Set Prolog (ASP) program (Law et al., 2018), and the second model is a convolutional neural network (CNN). If the ASP program fails to classify an example, it redirects the necessary parts of the input to CNN for further processing. This pipeline results in better accuracy with less computation when compared with CNN classification. Furthermore, given labeled examples about the task, the ASP program can be further extended to include new rules about the environment by using the decision paths of a trained decision tree. These works primarily focus on integrating neural models with common-sense knowledge or domain knowledge to increase performance. Our work is similar to these works in the sense that they also learn previously unknown rules with decision trees from subsymbolic data that would help for the planning. On the other hand, we focus on learning symbols that depend on the action set of the agent.
The bottom-up generation of symbolic structures from the continuous interaction experience of a robot has started to draw attention in robotics (Taniguchi et al., 2018;Konidaris, 2019). Konidaris et al. (2014Konidaris et al. ( , 2015 studied the construction of symbols that are directly used as preconditions and effects of actions for the generation of deterministic and probabilistic plans in 2D agent settings, and Konidaris et al. (2018) extended the framework into a real-world robot setting. However, these studies use a global state representation, and therefore, symbols learned in an environment cannot be used in a novel environment directly. In follow-up work, James et al. (2020) constructs symbols with egocentric representations to allow the transfer of previously learned symbols. These studies train an SVM classifier for each effect cluster to find groundings of precondition symbols. Piater (2015a, 2015b) formed symbols used in plan generation in manipulation using a combination of several ad-hoc machine learning techniques such as clustering with X-means and classification with SVMs. Furthermore, they used hand-crafted features to represent scenes and effects. On the other hand, our proposed architecture simultaneously learns object categories (in the encoder output) and their corresponding effect categories (in the decoder output) without resorting to any clustering techniques on the object or effect space. The object and effect categories automatically emerge as the network with binary bottleneck units minimizes the effect prediction error. Furthermore, deep neural networks allow us to efficiently process high-dimensional image data using convolutional layers. This design approach offers a generic symbol formation engine that runs at the pixel level using deep neural networks. In terms of symbol multiplicity, our approach is more parsimonious, as we do not form symbols for each action as in Ugur and Piater (2015b) and Konidaris et al. (2018); but instead, use a single decoder network that takes the action as part of the input. To be concrete, for n effect categories and k actions, our system generates nk symbols, whereas the aforementioned approaches generate n k symbols. Learning a single model for all actions possibly allows internal representations learned for one action to be re-used directly for other actions. Another significant advantage of our model is that it is differentiable and thus can be integrated into gradient-based state-of-the-art machine learning architectures for further tackling more complex problems. Asai and Fukunaga (2018) realized a similar neural framework where they first train a state autoencoder with discrete latent units, then learn the action precondition-effect mappings. In follow-up work, (Asai & Muise, 2021;Asai et al., 2022) combine these two steps and learn the action mapping together with the state auto-encoder. These works are in the visual domain (for example, 2D puzzles) and achieve visualized plan execution while we focus on robot action planning and execution in the 3D world. Moreover, a critical difference of our method from the aforementioned work is that we learn object symbols by taking into account action and the effects in addition to object features, which facilitates the formation of symbols that are likely to capture object affordances (Gibson, 2014;Zech et al., 2017).
Another line of research focuses on bilevel planning, in which a symbolic plan is complemented by a motion and task planner. Silver et al. (2021), Chitnis et al. (2021) learn operators for bilevel planning when given parameterized policies for continuous planning. In a follow-up work (Silver et al., 2022a), these parameterized policies are learned as well, completing the whole neurosymbolic planning pipeline. While these works fix the state abstractions, Silver et al. (2022b) also learns new state abstractions that are optimized for planning. In general, these works focus on learning high-level operators for bilevel planning, while we focus on learning symbols from continuous high-dimensional vectors. Another sim-ilar work (Yuan et al., 2022) trains a network that outputs relations between objects from RGB images given objects' canonical images.

Problem Formulation
In this work, we refer to symbols as discrete low-dimensional vectors extracted from deep neural networks for the current state and used to predict the observed effect of specific actions. More formally, a symbol z ∈ Z is a discrete representation that represents a subset P of a continuous high-dimensional space R n (e.g., the state-space, or the effect-space). The symbol-space Z can be defined as a set of k-dimensional boolean vectors Z = B m = {0, 1} m , or as a set of atoms Z = {z 1 , z 2 , . . . , z m }. The important condition here is that the symbolspace should be finite, and its cardinality |Z| should preferably be small. In general, the symbol learning problem refers to finding the mapping f : R n → Z, which would allow us to do logical reasoning in the symbolic domain.
Given a set of discrete actions A = {a 1 , a 2 , ..., a k }, continuous object (or state) space R n , and continuous effect space R m , we are interested in learning an encoder function f : R n → Z and a decoder function g : collected by interacting with the environment. Essentially, the encoder outputs symbol z given the object state o ∈ R n , and the decoder outputs effect e ∈ R m for symbol z and action a. After learning the encoder and the decoder function by iteratively optimizing an objective (which will be discussed in Section 4), z corresponds to an object symbol, and c corresponds to an effect symbol that has the grounding e = g(z, a) (note that c is an atom while e is a continuous vector). Once we have such symbols, we can construct a high-level plan in the symbolic space by transforming the environment to a probabilistic PDDL domain defined over the symbols, and then use state-of-the-art off-the-shelf planners to find an action sequence that arrives at the desired goal state.
The experiments reported in this paper involve two environments from different domains, namely, a tabletop robotic manipulation environment and MNIST 8-puzzle environment adapted from (Asai & Fukunaga, 2018). The former is an embodied robotic environment in which symbols that emerge depend on the actions executed by a robotic arm and their corresponding effects. In the MNIST 8-puzzle environment, an agent without an embodiment executes actions and observes the corresponding effects as the visual change in the environment. Symbols are learned with respect to these actions and visual effects.
For simplification, we make the following assumptions in the tabletop manipulation environment: • The agent is assumed to have a small number of actions, such as poking and stacking an object. Such an action repertoire can be autonomously acquired through a developmental progression as in (Ugur et al., 2012) or obtained through learning from demonstration and reinforcement learning (e.g., Seker et al., 2019;Akbulut et al., 2021).
• The agent is equipped with image processing capability to detect the objects in the camera image and also calculate their pixel coordinates. Furthermore, using the same object tracking method, the agent can take cropped images as input. In the tabletop setup, we realized this with a simple algorithm as the background is uncluttered. In a

(I) Interaction with objects with pre-defined actions (II) Symbol formation (discovery of object and effect categories)
Single Action Effect f 1 g 1 f 2 Single Action Effect f 1 g 1 g 2 Paired Action Effect real-world scenario, state-of-the-art computer vision techniques can be used to detect and track objects in the 3D world.

(III) Decision tree learning (IV) Translation of rules to PPDDL operators
In the MNIST 8-puzzle environment, the only assumption is that the agent has access to the action repertoire (e.g., 'slide-left', 'slide-down'), which it can execute to see the effects of its actions. Figure 1 provides the overall learning architecture of our proposed system in the robotic manipulation environment; the application of the architecture to the MNIST 8-puzzle domain is given in Section 6. In the environment interaction phase (I), the robot chooses an action from its action repertoire a ∈ A = {a 1 , a 2 , . . . , a k }, observes the object state o, executes the action, and records the resulting effect e.

Methods
Using the interaction experience {o (i) , a (i) , e (i) } N i=1 , the symbol formation is achieved in (II). To this end, a deep neural network model with two parts is trained to predict e given o and a. The first part is the encoder network, f (o), which creates a binary latent vector z given the depth image of the object, o. The second part, the decoder network g(z, a), predicts the effect e when action a is executed on state o that has the latent representation z. As the network tries to predict effects, symbolic representations are created by the encoder network that can be treated as object categories regulated by the corresponding action-effect experience.
The continuous interaction experience using the discovered categories, and then the symbolic experience is used to distill a decision tree to predict effects given object categories and actions in (III). The reason to use a decision tree is that we can represent any statement in propositional logic with decision trees (Russell & Norvig, 2020, Ch. 19.3) and we can convert rules of the environment into logical statements that encode pre-and post-conditions of actions on the objects.
Finally, these statements are represented in PPDDL, which allows one to make plans in a probabilistic environment in (IV). Lastly, plans are executed to validate the learned symbols and rules. In the following sections, we describe these parts in detail.

Exploration with the Environment
A manipulator robot with a gripper and a depth camera is used to explore the environment and monitor the changes ( Figure 2). The robot is initialized with a fixed set of actions A = {a 1 , a 2 , . . . , a k } through which it interacts with the objects in its workspace. Forward, side, and top poking actions are used to poke objects from different sides (Figure 2b, top). The stacking action is used to release one object on top of another object (Figure 2b, bottom). These actions are encoded with one-hot encoding. On the perception side, each detected object is represented with its top-down depth image. The generated change, on the other hand, is represented by the positional offset of the acted object in pixel coordinates together with the force change sensed at the wrist joint of the robot. In single-object interactions, the robot observes and stores the initial state as the object-centered, topdown depth image of the object, and the effect as the change in object position and force sensor readings: e single = (∆x, ∆y, ∆d, ∆F ) where ∆x and ∆y are the changes in x-axis and y-axis in pixel coordinates, respectively, ∆d is the change in depth, and ∆F is the change in force. In paired-object interactions, the robot observes and stores the initial state as the combination of two object-centered depth images (o 1 , o 2 ), and the effect as the change in position of both objects: where ∆x 1 , ∆y 1 , ∆d 1 refer to the displacement of the first object, and ∆x 2 , ∆y 2 , ∆d 2 refer to the displacement of the second object.

Symbol Discovery with Deep Networks
The main objective of the network is to discover symbols, i.e., object and effect categories, that are effective in abstract reasoning about the consequences of robot actions. In other words, the object categories, together with robot actions, should give the ability to predict the effect categories. To achieve this, we propose a special neural network structure which is composed of two parts: an encoder f (o) to predict z which is the object category, and a decoder g(z, a) to predict e (Figure 3, top). This is an encoder-decoder design that has been  shown to be quite successful in many different applications (Hinton & Salakhutdinov, 2006;Kingma & Welling, 2013;Sutskever et al., 2014;Devlin et al., 2018). The binary bottleneck layer forces the network to learn low-dimensional symbolic representations that are useful for predicting the generated effect of actions. As the input is a top-down depth image, the encoder is a convolutional neural network with the Gumbel-Sigmoid (GS) function (Maddison et al., 2017;Jang et al., 2017) as the last-layer activation function (where the error back-propagation is handled with the reparameterization trick; Kingma & Welling, 2013). We also experimented with the sign(x) function using straight-through estimators (STE; Bengio et al., 2013) and found that GS has a lower variance. Results with STE are given in Appendix B. Using GS activation of the bottleneck neurons, the continuous representation is directly transformed into a discrete category. The decoder part is realized as a multi-layer perceptron (MLP). The category z of the object o concatenated with the one-hot vector of action a is given to the decoder as input. The decoder predicts the effect e expected to be observed on object state o via action a. The network minimizes the following objective: This architecture effectively creates high-level symbolic categories of objects that encapsulate the effects of executed actions. One important advantage is that the model does not need hand-engineered object features and object clusters for finding object symbols, contrary to previous studies, since the system learns discrete categories directly to optimize the effect prediction performance. Moreover, as the bottleneck layer is discrete, the possible decoder outputs e = g(z, a) form a finite set E = {e 1 , e 2 . . . } which can be denoted by atoms  (Figure 2b, bottom). The same deep network structure is used to extract the corresponding symbols with a slight modification to incorporate previously learned knowledge (Figure 3, bottom). Here, an encoder f 2 takes the depth images of the objects and produces a binary latent vector z 3 . As the important point here, the single object symbols (z 1 and z 2 ) computed by the f 1 encoder are also added to the network as input together with the action information. The idea is that we can use previously-acquired symbols to encode new information more compactly, thus allowing a progressive increment of symbols. Note that the encoder f 1 is frozen at this second stage of the training. The encoder f 1 provides some interaction related information about objects and let the encoder f 2 focus and learn properties and relations between the objects.
Number of symbols is automatically set by selecting the number of bottleneck neurons using a hyperparameter search procedure. To limit the number of rules and predicates, this procedure aims to find the minimum number of symbols that provide competitive performance in prediction. Starting from one unit, we record the mean and the standard deviation of mean square errors (MSE) of multiple runs. We increase the number of units until there is no significant drop in the prediction error. MSE curves are reported in Appendix A.

Extracting Symbolic Rules
In the third part of the pipeline, a decision tree is trained to predict the effect c of the stack action a given high-level single (z 1 and z 2 ) and paired (z 3 ) object categories (i.e., the Here, the aim is to extract the probabilistic rules of the environment by converting the decision rules on the paths of the tree into logical statements, which ultimately enables probabilistic planning. Each path from the root node to a leaf node in the decision tree stores the required set of predicates {p 1 = (z 3 < 0.5), p 2 = (z 2 > 0.5), . . . } represented by discovered single and paired-object categories (in the internal nodes) in order to achieve the effect category c (in the leaves). In other words, each path corresponds to a set of preconditions in order to reach a different effect. As the decision rules at each node in a path P are in conjunction (p 1 ∧ p 2 ∧ · · · ∧ p k ), and these paths are in disjunction (P 1 ∨ P 2 ∨ · · · ∨ P m ), the tree represents a statement in disjunctive normal form. Thus, any statement in propositional logic can be represented as a decision tree (Russell & Norvig, 2020, Ch. 19.3). The class probabilities at a leaf (the fraction of samples) correspond to probabilities of observing different effects for the same set of preconditions. Therefore, each path is directly converted to a different rule in probabilistic PDDL. While training the decision tree, the minimum number of samples required for a node to be a leaf node is empirically set to 100 samples. The extracted rules are only limited to predicting the effects of an action. In this way, the agent is not expected to learn representations (and consequently rules) unrelated to its embodiment and actions. For example, in our tabletop environment, the robot cannot differentiate cubes from vertical cylinders as different categories since they respond similarly to similar actions even though their visual appearances differ.
Our motivation to construct PPDDL descriptions is to use probabilistic AI planners to efficiently make plans and execute them. PPDDL is composed of a domain description and a problem definition. In the domain description, there are predicates and actions. Predicates represent boolean values that can be activated or deactivated. Each action has a precondition, which is a set of predicates that needs to be satisfied, and an effect, which activates/deactivates other predicates. The domain description is generated from the list of rules. In the problem definition, the initial state of the world is encoded along with the goal to be satisfied. To encode the initial state, the robot perceives the current environment and sets the truth values of the predicates for the existing categories. The planner finds the sequence of actions to satisfy the predicates given in the goal description starting from the initial state using the actions defined in the domain description.

Robot Experiments
In the following experiments, we aim to answer the following questions to evaluate the proposed method: 1. Do the learned symbols hold any high-level meaning?
2. Are the learned symbols effective for symbolic planning?
We compare our method with two alternative baselines: 1. An autoencoder with discrete activations where symbols are learned directly from passively observed states, independent from actions and effects.
2. An encoder-decoder network with continuous activations, followed by clustering in the latent space.
Regarding the first question, we evaluate the methods based on their performance in differentiating object categories. For the second question, we evaluate the planning performance of different methods.

Experiment Setup
Interactions: We adopted the robotic setup, including the action and object sets used, from Ugur and Piater (2015a) who showed effective skill transfer from the simulator to real-world, involving actions with 3-fingered prehension. The experiments are performed in CoppeliaSim VREP simulator (Rohmer et al., 2013) where a six-degrees-of-freedom UR10 (Universal Robots, 2012) robot arm and a Barrett Hand system (Townsend, 2000) interacts with the objects on the table, and a top-down facing Kinect sensor is used for environment perception (Figure 2). The objects used in the experiments include rectangular cups, horizontally and vertically placed cylinders, spheres, and cubes. For each object type, ten different objects with varying diameters/edge lengths in the range of 10 to 20 cm are included in the object dataset for interaction.
Perception: Before each action execution a top-down depth image (128 × 128 pixels) of the scene is captured. Objects are placed at different reachable locations on the table during the interactions to ensure the network is invariant with the perspective. Pixels of  Here, objects vary in their sizes and initial positions. The mean and the standard deviation of 10 runs are reported. For ease of understanding, we name columns so that the category where spheres are mostly placed is renamed to (0, 0), the category where cubes are mostly placed is renamed to (0, 1), and so on. The naming convention also allows us to take an average across different runs.
the images are normalized globally to increase the convergence speed of stochastic gradient descent (LeCun et al., 2012). Objects in the image are detected with a simple procedure by finding the point with minimum depth and cropping the area of 42 × 42 pixels centered around it. This procedure yields object-centered representations for the objects used in the current study but preserves the perspective distortion due to varying locations of the objects and fixed sensor position.
Encoder-decoder network: The encoder network ( Figure 3) consists of four blocks each containing two convolutional layers that are followed by batch normalization (Ioffe & Szegedy, 2015) and ReLU activation. The number of filters in these blocks are 32, 64, 128, and 256. The last layer consists of two hidden units with a GS activation. The decoder network is a two-layer MLP with 32 hidden units. Further details of these networks can be found in Appendix A.

Discovered Object Categories
Based on the hyperparameter optimization procedure, the number of binary activation neurons in the bottleneck layer is automatically set to 2; therefore, the system found 2 2 = 4 object categories. How different object types (unknown to the robot) are represented by the discovered object categories is analyzed and provided in Table 1. In general, different types of objects were coded into different categories except that cube and vertical cylinder share the same category even though their depth images differ. This is due to our action and effect regulated categorization: cubes and vertical cylinders behave the same under all available single-object actions of the robot. Although the depth images of the same type of objects with different sizes differ significantly, this information is not reflected in the categories because the size of the objects does not have a significant influence on the consequences of the current actions. The categories can be interpreted as 'pushable'; 'rollable in single direction'; 'pushable and insertable'; and 'rollable in all directions', respectively. Examples from each category are shown in Figure 4. As a baseline for comparison, we trained 1. An autoencoder with a binary hidden layer using Gumbel-Sigmoid to reconstruct the depth images of objects (inputs to f 1 ) instead of effects. This approach is similar to Asai and Fukunaga (2018). Let us refer to this approach as Object-Binary-Object (OBO).
2. Our proposed encoder-decoder architecture with the binary bottleneck layer replaced with a usual continuous layer that is applied k-means clustering (k = 4) after learning. Let us call this approach Object-Continuous-Effect followed by Clustering (OCEC).
The results are shown in Table 1. For the autoencoder network (i.e., OBO), we see that objects are collapsed primarily into one category. The robot is expected to predict the consequences of its actions using these categories, and as shown, these categories are not distinctive to help such prediction. With this, we verified the advantage of extracting the symbols from the interaction experience of the robot that includes object-action-effect  Figure 5: The encoder f 2 activations (blue for 0, red for 1) for paired objects. Here, x and y axes of each of the 5 × 5 plots represent the sizes of the objects below and above, respectively. Each square represents the relation for a given pair. Note that without any direct supervision, the system discovers approximately linear boundaries (e.g., last column) for some object pairs that would help in effect prediction. information, i.e., from an object encoder -effect decoder network, rather than searching the symbols in passively-observed static features.
OCEC gave better results compared to OBO since the bottleneck layer in OCEC does include information from the effect space because of the predictive training similar to our proposed model. However, the latent codes in the bottleneck layer of OCEC might not be distributed locally, making clustering harder. When this is the case, we need more complicated clustering algorithms such as spectral clustering to cluster the latent space accurately. For example, in Table 1, we see that OCEC is more biased toward misclassifying cups as the stable category and the horizontal cylinders as spheres. When we take an average over all objects, our method predicts objects in the correct category with 98.5 ± 0.94 % accuracy compared to OCEC with 88.3 ± 8.62 % accuracy.

Discovered Relational Categories
The stack interaction experience of the robot is used to train the multi-object encoderdecoder network (Figure 3 bottom), transferring the object categories reported above. The number of binary activated bottleneck neurons is automatically set to 1 using the hyperparameter search described in the Methods section.
The response of the bottleneck neuron, i.e., how this neuron categorizes the input object pairs, is analyzed in Figure 5. Given different pairs of objects with different sizes, each image in this figure corresponds to a specific object pair, and each pixel provides the response of the bottleneck neuron (0 or 1) for specific object sizes. In our experiments, the effect of stack action depends on object categories and their relative size. For example, if an object is released on top of a larger cup, the released object drops into the cup. If the released object is larger than the cup, it is stacked on top of the cup's walls. The approximately linear boundaries for some object pairs in Figure 5 (for example, the last column) show that the bottleneck neuron captured these dynamics and found a symbol that roughly encodes the relative size; the output is 1 when the below cup is larger than the above object. In stacking interactions, the relative size relation only makes sense when the object below is a cup; and our system discovered this relational symbol. Another linear boundary found by the system is in the bottom row. The output of the encoder is 1 when the above object is a cup and below a specific size. We analyze the exploration data to understand why such a boundary emerges. We found out that if the above object is a small cup, the change in the position of the below object is very small.
The learned representations depend on the effect space and the action space of the agent. In our example, after the single-object training stage, the system differentiates different types of objects but does not differentiate different sizes of objects as they are not sufficiently important for the prediction of push actions. Only after it is trained with new data consisting of a new action, namely stacking, does the system start to differentiate between different sizes of cups. The agent only learns richer representations, and therefore better rules, when it has access to a richer action repertoire. This is a desired property of our system as it learns a minimal set of representations needed to predict the outcomes of its actions.

Discovered Effect Categories
After training, we pass the symbol space Z together with the action space A to the decoder to get the effect categories. More specifically: Here, Z paired is the Cartesian product of the object category space {0, 1} 2 with the action space A single = {(0, 0, 1), (0, 1, 0), (1, 0, 0)} resulting in 12 different effect categories for the single object effects. For the paired object effects, the input consists of two single object categories and one relational object category. Therefore, this number is {0, Here, A paired only contains the stack action, therefore n(A paired ) = 1. These effect categories for the single and the paired interactions are shown in Figures 6 and  7, respectively. For visualization purposes, we use colors to represent the third dimension. In Figure 6, the low force values are in blue, and the high force values are in red. Likewise, in Figure 7, the low depth values are in blue, and the high depth values are in red.

Learned Rules and PPDDL Operators
The single-and paired-object categories (acquired from the output of the encoder) together with the action vector are used as inputs to the decision tree in order to predict   the effect categories (extracted from the output of the decoder). The learned tree is of depth 5, has 24 leaves, and its classification accuracy is 94.8%. The result of decision tree learning is shown in Figure 8a, where only a small number of decision paths out of 24 is explicitly shown because of the space constraints. Decision rules for the highlighted path 0, 1, 1, 0). Here, (f 1 (o 1 ) 1 , f 1 (o 1 ) 2 ) represents the category of the object above, (f 1 (o 2 ) 1 , f 1 (o 2 ) 2 ) represents the category of the object below, and f 2 (o 1 , o 2 ) is the symbol for the paired-object relation. A natural-language translation of this path is as follows: 'If the above object is rollable in all directions (1, 1), and the below object is pushable and insertable (1, 1), and the below object is not larger than the above object, e 2 is observed (which is a stacking effect) with 0.959 probability'. PPDDL description corresponding to this decision path of the tree is shown in Figure 8b.  ?below))) 0.041 (and (e3) (aux-instack) (aux-height) (stackloc ?above) (not (stackloc ?below))) 0.000 (e1) .. 0.000 (e30) (not (pickloc ?above)))) (:action increaseheight1 :precondition (and (aux-height) (H0)) :effect (and (not (H0)) (H1) (not (aux-height)))) (:action increasestack1 :precondition (and (aux-instack) (S0)) :effect (and (not (S0)) (S1) (not (aux-instack)))) (:action makebase :parameters (?obj) :precondition (not (base)) :effect (and (base) (aux-height) (aux-instack) (not (pickloc ?obj)) (stackloc ?obj))) (b) Highlighted path is converted into PPDDL.  The objective is to construct a tower of height five using five objects, H5S5. The system assesses the success probability to be 0.07. Middle row: The objective is H1S4, and the system assesses the success probability to be p=0.76. Bottom row: If we change the objective to H4S4, the success probability increases to 0.88.
For our experiments in the tower building task, we manually introduced some auxiliary predicates, as well as special actions for the domain, to be able to chain multiple actions and count the number of objects in the tower. These are needed to set a goal of constructing a tower with multiple objects which are outside the experience of the robot.  For effects with small ∆x 1 and ∆y 2 the aux-instack predicate is set to true if they satisfy ∆d 1 > ϵ for some threshold ϵ, and otherwise the aux-height predicate is set to true. For this specific application, these predicates allow us to differentiate stacking and inserting, our effects of interest, from other effects. Actions increaseheight1 and increasestack1 are treated as addition operators that increase the height of the tower (H) and the number of objects in the stack (S), respectively. There are multiple H and S predicates ranging from H1-H7 and S1-S7, and likewise multiple increaseheight actions. When the aux-height effect is observed, the planner must select the increaseheight1 action to continue with the plan. Therefore, when a stack effect is observed, the height of the tower (which is represented by H) increases automatically.
These would not be needed if we were to use the numeric values associated with effect clusters (i.e., functions in PDDL) as then we would be able to set arbitrary goals such as ∆d 1 > 0.3 ('make the height of the tower taller than 30cm'). We went on with the atomic effect representation (i.e., with no associated parameters) as they worked better with the mGPT implementation 2 that we used. In future work, we plan to extend the planner so that we can also utilize numeric values associated with effect clusters.
Lastly, the predicate pickloc is true for objects that are on the table and available for use for the tower construction; stackloc is true for the object that is at the top of the tower. These are shown in Figure 8b.

Performance of Planning
The PPDDL descriptions that are automatically constructed by the discovered symbols and rules were verified by generating plans given a set of goals, executing these plans in the simulator, and assessing the success of the executed action sequences in achieving these goals. To be concrete, we asked the system to generate plans to create towers of desired heights with a given fixed set of objects. The challenge of the task is to place objects on top of each other in the correct order. With five objects, there are 5! plans. For plan generation, the mGPT off-the-shelf probabilistic planner (Bonet & Geffner, 2005) with FF heuristic (Hoffmann & Nebel, 2001) was used.
Since in our experiments, the system is asked to generate plans given a number of objects on the table, we encode the task of, say, "construct a tower with a height of three (H3) using four objects (S4)" as H3S4 (Figure 9).

DeepSym vs. OCEC
We first compare our system with the alternative OCEC system. We train both systems ten times and select the best-performing models (based on the decision tree accuracy). We  Table 3: Planning results from 25 executions for each task. To satisfy the H1S4 objective, the robot needs to insert objects inside each other, which is more challenging compared to the other tower tasks. Thus, the success probability is lower.
initialized 20 random problems and asked the planner to solve the task using two different domain descriptions generated from different methods. We run the probabilistic planner 100 times for each problem and record the number of successes to estimate the success probability of the plans. The results are reported in Table 2. We report two different metrics: (1) planning success shows whether the system generated a feasible plan or not, and (2) execution success shows whether the execution is successful or not. The latter is concerned with the stochasticity of the environment, not with the feasibility of the plan. We see that the OCEC model performs considerably worse with 25% planning accuracy than our approach with 95% planning accuracy. This is mainly caused by the wrong classification of the cup object (see Table 1 in Section 5.2), which is an essential piece of information in this problem. When single-and paired-object categories are incorrectly classified, the system generates an invalid problem description, which results in infeasible plan outputs. For the same number of symbols, the learned symbols in the OCEC pipeline do not directly depend on the action and the generated effects, while symbols learned with our architecture directly depend on the action and its corresponding effect as they are directly used for effect prediction. This leads to the creation of symbols that are more appropriate for planning.

DeepSym Performance
Now that we have shown the performance gap between the two methods, we want to analyze when our method fails and succeeds. We considered four different goals (towers of heights from 1 to 4) and performed 25 different runs with random initial object configurations for each objective. We configured object types and sizes to have at least one feasible solution. For example, for the H1 objective, we make sure that there are at least three cups that can be physically stacked into each other. The plan execution performance is reported in Table 3. There are three different outcomes: (1) the plan is executed successfully, (2) the planner outputs an erroneous plan due to a recognition error in the encoder, (3) the generated plan is correct, but the plan fails at execution time due to the stochasticity of the environment. We see that the robot constructs towers with a height of four successfully. As the height of the tower decreases, the robot needs to insert some of the objects inside other cups. The insertion task is harder than the stacking task due to the stochasticity of the environment, which is also reflected in the estimated probabilities in Figure 8b; even if the below cup is larger than the above sphere, the insertion probability is 0.894. For example, for the challenging objective of creating an H1 tower including all objects, the system estimates the success probability to be 0.68, and therefore the failure probability to be 0.32. Accordingly, 36% of plans fail at execution time. This shows that the system can partially model the probabilistic nature of the environment. The planning errors are mostly due to the incorrect recognition of the paired-object categories. Example executions are shown in Figure 9.

Deterministic vs. Probabilistic Planning
We also experimented with deterministic planning instead of a probabilistic one. To do so, while converting rules to PDDL, we take the maximum likely effect as the generated effect. For example, if a specific action schema produces effect e 2 with a probability of 0.91 and e 17 with a probability of 0.09, we take the effect with the maximum probability as the generated effect for the action schema. Thus, deterministic planning eliminates the possibility of other effects and, therefore, effectively eliminates possible solutions. When the learned rules faithfully represent the action-effect relations in the environment, we observe no significant difference between probabilistic and deterministic planning in terms of the success of plans. However, when there is significant inaccuracy in the learned representation (e.g., an incorrect comparison between a pair of objects), the probabilistic PDDL description can account for this inaccuracy in the probabilities of effects; the inaccuracies are reflected as the uncertainty of the environment.

Experiments on 8-puzzle
In this section, we evaluate DeepSym on the MNIST 8-puzzle adapted from Asai and Fukunaga (2018). In the original 8-puzzle, the aim is to have the tiles in a specific arrangement (considered the goal configuration) by moving tiles into the empty square. In the adapted MNIST 8-puzzle version, tiles do not have symbolic values such as digits but instead contain images of digits, and the 0-tile is treated as the empty tile. Given the domain definition, i.e., the knowledge of how the configuration changes in response to slide actions, the 8-puzzle game can be solved with AI planners. However, the problem becomes non-trivial when the states are represented with raw images of the board, and the state transitions are not known. In the adapted MNIST 8-puzzle, our system is given the raw image of the board with (3 × 28) × (3 × 28) = 7056 pixels. Therefore, the state vector is 7056-dimensional. An instance of the MNIST 8-puzzle is shown in Figure 10. A system that can solve the puzzle should recognize the following: 1. Actions only modify some part of the image (i.e., there are tiles), 2. There are specific symbolic representations in these tiles (i.e., recognize the image content of the tiles), 3. The goal is only valid when these tiles are arranged in a specific order (sorted from left to right and top to bottom).
(1) In the exploration stage, the system initializes a random environment configuration, executes a random action (which is provided to the system), and records a 3-tuple (x t , a t , e t ) slide-left slide-down Figure 10: Two steps of the MNIST 8-puzzle. The 0-tile is treated as the empty tile. Each tile consists of a 28 × 28-pixel MNIST digit. -= x t+1 x t e t -= Figure 11: The effect is represented as the difference between two timesteps.
where x t is the current state, a t is the executed action represented as a one-hot vector, and e t is the generated effect represented as the pixel difference between the new state x t+1 and the current state x t (Figure 11). We collect 100,000 such interactions from the environment.
(2) In the symbol learning stage, we train an encoder-decoder network as in Section 4.2, where the encoder f (x) takes the state vector x as an image of 84 × 84 pixels and outputs a binary vector z, and the decoder g(z, a) takes the concatenation of z and the action vector a to produce the effect e which is also an image of 84 × 84 pixels. Both the encoder and the decoder are convolutional networks. We did not employ any hyperparameter search on the architectures but followed the building principles in DCGAN (Radford et al., 2016). The details of the networks can be found in Appendix A. We train the model for 100 epochs with MSE loss in Equation 3. (3) After training, we distill the information in the decoder network into rules by training a decision tree using the predictions of the decoder. (4) Lastly, we translate the rules represented by the decision tree into PDDL rules as in Section 4.3.

Learned Symbols
In the MNIST 8-puzzle environment, there is a finite set of possible effects that can be generated in a single action from any environment configuration. If we use the same image for a digit as in Asai and Fukunaga (2018), then the encoder should represent 3248 different states (digits that are nearby the empty tile) in order for the decoder to produce the correct effect. Since we are using binary activations, log 2 3248 ≈ 11.67 units are necessary to avoid losing any information regarding effects. Therefore, we set the number of units to 13 (giving one more as a slack) in this experiment. To understand the symbols that correspond to the low-level subsymbolic representations (i.e., images), we sample 100,000 random states from the environment and get their symbolic representations from the encoder. Then, we take the average of images that correspond to the same symbol. We show the average image that corresponds to the top 30 symbols sorted by their activation counts in Figure 12. We notice that the first nine symbols correspond to different locations of the empty tile (recall that the digit '0' is considered as the empty tile following (Asai & Fukunaga, 2018)), which accounts for 41.5% percent of all activations (i.e., in 41.5% of the time, the encoder only outputs the position of the empty tile). Other symbols correspond to cases where the digit '3' or '5' is near the empty tile.
In Figure 13, some predicted effects are visualized for a given state and actions together with the ground truth effects. We see that the decoder successfully models the slide of the digit '0'. When we combine the previous state with the predicted effect, we can have an estimate of the next state which is shown in the right column in Figure 13.

Learned Rules
To train a decision tree for the rule extraction, we collect the set of training examples as follows. Given the current state x t , the encoder generates the corresponding symbol z t = f (x t ) which is then used as input to the decoder together with the one-hot action vector a t to predict the effect:ē t = g(z t , a t ). Then, we predict the next state x t+1 by summing the predicted effectē t with the current state (x t+1 = x t +ē t ) as in Figure 13. Lastly, we use the encoder to generate the symbolz t+1 that corresponds tox t+1 :z t+1 = f (x t+1 ). The decision tree is trained with {[z t ; a t ],z t+1 } input-output pairs. This is even more generic than robot experiments where we trained the tree with {[z t ; a t ], c t } pairs (c t is the effect category predicted by the decoder) since it allows us to express the goal using the image modality. In both cases, the fundamental idea is the same: training a decision tree with symbolic input-output pairs in order to learn probabilistic rules.
As the last step, we convert the decision paths of the tree into probabilistic PDDL rules as in Section 5.5. As an example, a translated rule from a decision path is as follows, (:action slide_left5 :precondition (and (not (z9)) (not (z5)) (z3))

State[t+1]=State[t]+Effect[t]
Prediction Truth Figure 13: Four different effect predictions are shown together with their ground truths for different actions for a given state. For example, for the 'slide-right' action, in the center, '0' is erased and '5' is painted, and on the left, '5' is erased and '0' is painted.

Planning Examples
Using the generated PPDDL description, our system is requested to output a plan for the goal state from a random initial state. For this, the problem definition (where the current state and the goal state are indicated) is created in PPDDL using the activations of the encoder (see Figure 14). (not (z5)) (z6) (z7) (z8) (not (z9)) (z10) (z11) (z12)))) Problem definition Figure 14: The generated plan for the goal state from a random initial state.
As shown in the figure, our system was able to find the correct action sequences in order to reach the given goal configuration. Note that we observed that the output plan only slides the tiles so as to move the empty tile into the correct position. This is a consequence of the system because the encoded activations do not represent the global state but a local state: the position of the empty tile and its neighbors. One can extend the locality by incorporating multiple timestep effects of actions in a similar approach with (Xu et al., 2021). This can be an advantage, or disadvantage, depending on the context and the problem, which will be discussed in Section 7.  As the current formulation cannot capture the global state, we experimented with local state representations. For example, in Figure 15, we set two different arbitrary goals that are one step and three steps away from the initial state (the first and the second goal in Figure 15). The planner outputs the correct plan since it can capture the nearby tile information. However, when asked for the third goal in Figure 15, the generated plan only moves the empty tile to the correct position while disregarding other tiles.
We generated 100 random goal states that are n-step away from the corresponding initial state and reported the planning results to quantitatively assess the performance of the method. We also add the results for executing random actions to assess the performance (a) 8-puzzle w/replacement.
(b) 15-puzzle w/replacement. Figure 16: In these environments, each digit except '0' may appear more than once.
increment. We report the percentage of plans that successfully move the empty tile to the correct position in Table 4. From the results, we see that DeepSym can successfully move the empty tile into the correct position for different plan lengths.

Scaling-up to 15-puzzle
This section aims to analyze the performance of the system when we scale up the dimensionality of the environment. Examples in the previous section suggest that our system can correctly identify the empty tile, learn the transition based on the empty tile, and make plans to move the empty tile into different positions. We would like to test whether this is the case for larger environments. Therefore, we scale up the 8-puzzle in two different ways: (1) 8-puzzle with replacement (will be denoted as w/r) and (2) 15-puzzle with replacement. Each digit except '0' (the empty tile) may appear more than once in these versions. We train DeepSym with 14 units for 8-puzzle w/r, and with 15 units for 15-puzzle w/r (see Appendix C for details).
The low-level state-space and effect-space are 112 × 112 = 12544 dimensional for 15puzzle w/r while it stays the same for 8-puzzle w/r. We used the same convolutional architecture with different paddings to ensure the same output size. The most frequently activated symbols are shown in Figures 19 and 20. We give the planning results for these environments in Figure 17. We see that the system moves the empty tile (by sliding other tiles) to the correct position but disregards other tiles.

Initial state
Goal state Initial state Goal state 1 2 3 Figure 17: Planning results for 8-puzzle w/r and 15-puzzle w/r. The arrow denotes the movement of the empty tile at each step.

Comparison with Autoencoder
This section aims to compare DeepSym with an autoencoder baseline. We train an autoencoder (as in Asai & Fukunaga, 2018) in these three MNIST n-puzzle environments with the same architecture and the same number of bottleneck units as in DeepSym. Given the bottleneck size, it would be impossible for the autoencoder to encode all state-space. The most frequently activated symbols for 8-puzzle, 8-puzzle w/r, and 15-puzzle w/r are shown in Figures 21, 22, and 23, respectively. We train a decision tree for rule learning using the encoder activations to compare the planning performance. Namely, the decision tree is trained with (f (x t ), f (x t+1 )) input-output pairs where f is the encoder network. After the training, we extracted rules from all paths of the tree and constructed a PPDDL description. The planner failed to produce any plan output for random initial and goal states. This is expected since all the state-space cannot be encoded, and therefore, some states are not represented correctly in the PPDDL description. One would need to increase the bottleneck size in order to convert all the state space into PPDDL descriptions. In Asai and Fukunaga (2018), the bottleneck size is set to 25 units (instead of 13 in our experiments) to cover the state space.

Discussion
A plan corresponds to a sequence of actions to move from an initial state to a goal state. One must chain the effects of actions to predict a future state. Thus, the effects of actions should be known to generate a plan. Therefore, the capability of knowing preconditions of actions and predicting the effects of actions is a requirement for generating a successful plan (Konidaris et al., 2014). The main difference between DeepSym and approaches that focus on compressing the state representation (e.g., with an autoencoder, Asai & Fukunaga, 2018;Asai et al., 2022, or with a world model, Hafner et al., 2020) is that the learned representations in DeepSym are only due to actions and effects of the agent (Taniguchi et al., 2018). Learning symbols based on the capabilities of the agent allows one to filter-out details of the environment not related to the agent. On the other hand, the approach of compressing the state representation brings its own advantages. One can use a large dataset of states to pre-train an unsupervised model to learn a compact model of the environment, and then use the learned model to train a supervised model for planning or policy learning.
Finding action-independent discrete representations is non-trivial in a large state-space even for the toy examples given in Section 6. In our robot experiments, the autoencoder with discrete units (Asai & Fukunaga, 2018) was shown not to generate a useful representation with a low bottleneck dimension. On the other hand, DeepSym can learn useful and compact representations for planning as it considers actions and effects. For environments that are more realistic for lifelong learning, such as Minecraft (Johnson et al., 2016), the raw state-space is virtually infinite, making it difficult to find a minimal set of meaningful discrete representations without taking actions and action effects into account. On the other hand, action-and effect-based learning allows for an efficient representation of the state space by filtering out the aspects of the environment not relevant to the actions of the agent. For example, in the 8-puzzle environment, the encoder disregards the tiles not near the empty tile since the generated effect does not depend on them. The learned represen-tation allows for generating plans to move the empty tile to different positions. DeepSym learns the minimal set of representations that are needed for the effect prediction of action. Therefore, our system learns action-centric representations, i.e., representations that involve the empty tile. State-and action-based methods are two different (possibly complementary) approaches with advantages and disadvantages. For example, if the problem domain is small, or there exists a large-scale pre-trained model of the environment, encoding all the state space will allow one to solve any encountered problem. However, this approach might be infeasible for larger domains. On the other hand, action-based encoding learns the minimal set of symbols to predict effects at the cost of missing possibly global task requirements (e.g., a specific arrangement in 8-puzzle).
It is theoretically possible to learn a simpler feature-based representation that will be more computationally efficient when compared with deep networks when state, action, and observed effects are all known (Ugur & Piater, 2015a;Konidaris et al., 2018). However, this approach would need manual feature extraction for newly encountered domains, while a differentiable network that can be automatically tuned offers a more uniform and extendible approach.
One thing we observed is that with the narrow bottleneck size (i.e., 13 units for 3248 configurations), the encoder does not represent all the neighbor configurations that are needed for effect prediction. However, when we increase the bottleneck size, the encoder indeed learns all the necessary configurations. Even if the system successfully encodes all the local states, it would still need to symbolically encode the global state to solve the task globally. One approach to encoding the global state might be extending the locality by considering the effects of multiple timesteps (Xu et al., 2021) or partitioning the state into several chunks and representing the change in these chunks separately.

Conclusion
In this work, we introduced a method that discovers effect-and action-guided object categories, encodes them as discrete symbols, and learns rules that predict action effects. It sustains a general cognitive development progression where symbols are formed, rules are learned, planning is achieved, and verified in execution. Our system contributes to the stateof-the-art by showing the following desirable properties which have not been achieved/shown simultaneously elsewhere: • We proposed a generic, single pipeline neural solution for mapping raw sensorimotor experience into the symbolic domain.
• The proposed network allows progressive learning of increasingly complex abstractions, exploiting previously-learned abstractions as inputs.
• It is gradient-friendly, so it can be incorporated into any gradient-based machine learning system for more complex processing.
• When compared with the continuous bottleneck layer version of our system, i.e., OCEC, our system performs better in effect category formation leading to more successful action planning. This suggests that instead of post-training clustering of the continuous unit outputs, employing discrete units from the beginning is beneficial.
In future work, we plan to scale up the system by augmenting the perceptual capabilities and the action repertoire of the robot. The ad-hoc perceptual system for determining action effects can be replaced by a state-of-the-art computer vision system. Beyond paired-object relations, graph neural networks can be employed to construct relations between varying numbers of objects. Applying the principles of learning and abstraction of this work to less-constrained scenarios will constitute a major step towards AI-enabled, general-purpose robots.

Layer
In ch. Out ch.   Figure 18: The mean square error losses for f 1 -g 1 and f 2 -g 2 network pairs. In (a), we also plot the paired object MSE with a single unit for a varying number of units in the bottleneck of f 1 to show how different numbers of units affect MSE in new observations).

A.2 MNIST 8-puzzle Environment
Network architectures of the encoder and the decoder for the MNIST 8-puzzle environment are given in Tables 7 and 8. For 8-puzzle w/r and 15-puzzle w/r versions, the bottleneck size is changed from 13 to 14 and 15, respectively. To ensure the output size for the 15-puzzle, we change the padding of the third and the fourth convolutional layer in the decoder from 1 to 0.

Layer
In ch. Out ch.

Appendix B. Using the Straight-Through Estimator
The experiment results with STE on the tabletop environment are reported in Table 9.
Appendix C. The Number of States in 8-puzzle w/r and 15-puzzle w/r For the 8-puzzle w/r, the number of possible states increases from 9! = 362880 to 9 × 9 8 = 387420489, which is an increase by about a factor of 1000. In general, the number of states is n 2 k (n 2 −1) where n stands for the size of the board (the size is 3 for 8-puzzle and 4 for 15-puzzle), and k is the number of possible digits other than the empty tile. This translates to ≈ 3.29 × 10 15 states for 15-puzzle w/r. On the other hand, the number of states that