--- layout: post title: Replacing Objects in Python tags: Python description: More reflection than you cared to ask for draft: true --- Today, we're going to demonstrate a fairly evil thing in Python, which I call _object replacement_. Say you have some program that's been running for a while, and a particular object has made its way throughout your code. It lives inside lists, class attributes, maybe even inside some closures. You want to completely replace this object with another one; that is to say, you want to find all references to object `A` and replace them with object `B`, enabling `A` to be garbage collected. This has some interesting implications for special object types. If you have methods that are bound to `A`, you want to rebind them to `B`. If `A` is a class, you want all instances of `A` to become instances of `B`. And so on. _But why on Earth would you want to do that?_ you ask. I'll focus on a concrete use case in a future post, but for now, I imagine this could be useful in some kind of advanted unit testing situation with mock objects. Still, it's fairly insane, so let's leave it as primarily an intellectual exercise. This article is written for [CPython](https://en.wikipedia.org/wiki/CPython) 2.7.[1] ## Review First, a recap on terminology here. You can skip this section if you know Python well. In Python, _names_ are what most languages call "variables". They reference _objects_. So when we do: {% highlight python %} a = [1, 2, 3, 4] {% endhighlight %} ...we are creating a list object with four integers, and binding it to the name `a`. In graph form:[2] %3L[1, 2, 3, 4]aaa->L In each of the following examples, we are creating new _references_ to the list object, but we are never duplicating it. Each reference points to the same memory address (which you can get using `id(a)`). {% highlight python %} b = a {% endhighlight %} {% highlight python %} c = SomeContainerClass() c.data = a {% endhighlight %} {% highlight python %} def wrapper(L): def inner(): return L.pop() return inner d = wrapper(a) {% endhighlight %} %3cluster0dobj[1, 2, 3, 4]aaa->objbbb->objcc.datac->objLLL->obj Note that these references are all equal. `a` is no more valid a name for the list than `b`, `c.data`, or `L` (from the perspective of `d`, which is exposed to everyone else as `d.func_closure[0].cell_contents`, but that's cumbersome and you would never do that in practice). As a result, if you delete one of these references—explicitly with `del a`, or implicitly if a name goes out of scope—then the other references are still around, and object continues to exist. If all of an object's references disappear, then Python's garbage collector should eliminate it. ## Dead ends My first thought when approaching this problem was to physically write over the memory where our target object is stored. This can be done using [`ctypes.memmove()`](https://docs.python.org/2/library/ctypes.html#ctypes.memmove) from the Python standard library: {% highlight pycon %} >>> class A(object): pass ... >>> class B(object): pass ... >>> obj = A() >>> print obj <__main__.A object at 0x10e3e1190> >>> import ctypes >>> ctypes.memmove(id(A), id(B), object.__sizeof__(A)) 140576340136752 >>> print obj <__main__.B object at 0x10e3e1190> {% endhighlight %} What we are doing here is overwriting the fields of the `A` instance of the [`PyClassObject` C struct](https://github.com/python/cpython/blob/2.7/Include/classobject.h#L12) with fields from the `B` struct instance. As a result, they now share various properties, such as their attribute dictionaries ([`__dict__`](https://docs.python.org/2/reference/datamodel.html#the-standard-type-hierarchy)). So, we can do things like this: {% highlight pycon %} >>> B.foo = 123 >>> obj.foo 123 {% endhighlight %} However, there are clear issues. What we've done is create a [_shallow copy_](https://en.wikipedia.org/wiki/Object_copy#Shallow_copy). Therefore, `A` and `B` are still distinct objects, so certain changes made to one will not be replicated to the other: {% highlight pycon %} >>> A is B False >>> B.__name__ = "C" >>> A.__name__ 'B' {% endhighlight %} Also, this won't work if `A` and `B` are different sizes, since we will be either reading from or writing to memory we don't necessarily own: {% highlight pycon %} >>> A = () >>> B = [] >>> print A.__sizeof__(), B.__sizeof__() 24 40 >>> import ctypes >>> ctypes.memmove(id(A), id(B), A.__sizeof__()) 4321271888 Python(33575,0x7fff76925300) malloc: *** error for object 0x6f: pointer being freed was not allocated *** set a breakpoint in malloc_error_break to debug Abort trap: 6 {% endhighlight %} Oh, and there's a bit of a problem when we deallocate these objects, too... {% highlight pycon %} >>> A = [] >>> B = range(8) >>> import ctypes >>> ctypes.memmove(id(A), id(B), A.__sizeof__()) 4514685728 >>> print A [0, 1, 2, 3, 4, 5, 6, 7] >>> del A >>> del B Segmentation fault: 11 {% endhighlight %} ## Fishing for references with Guppy A more correct solution is finding all of the _references_ to the old object, and then updating them to point to the new object, rather than replacing the old object directly. But how do we track references? Fortunately, there is a library called [Guppy](http://guppy-pe.sourceforge.net/) that allows us to do this. Often used for diagnosing memory leaks, we can take advantage of its robust object tracking features here. Install it with [pip](https://pypi.python.org/pypi/pip) (`pip install guppy`). I've always found Guppy hard to use (as many debuggers are, though justified by the complexity of the task involved), so we'll begin with a feature demo before delving into the actual problem. ### Feature demonstration Guppy's interface is deceptively simple. We begin by creating an instance of the Heapy interface, which is the component of Guppy that has the features we want: {% highlight pycon %} >>> import guppy >>> hp = guppy.hpy() {% endhighlight %} [...] ## Handling different reference types ### Dictionaries dicts, class attributes via `__dict__`, locals() ### Lists simple replacement ### Tuples recursively replace parent since immutable ### Bound methods note that built-in methods and regular methods have different underlying C structs, but have the same offsets for their self field ### Closure cells function closures ### Frames ... ### Slots ... ### Classes ... ### Other cases Certainly, not every case is handled above, but it seems to cover the vast majority of instances that I've found through testing. There are a number of reference relations in Guppy that I couldn't figure out how to replicate without doing something insane (`R_HASATTR`, `R_CELL`, and `R_STACK`), so some obscure replacements are likely unimplemented. Some other kinds of replacements are known, but impossible. For example, replacing a class object that uses `__slots__` with another class will not work if the replacement class has a different slot layout and instances of the old class exist. More generally, replacing a class with a non-class object won't work if instances of the class exist. Furthermore, references stored in data structures managed by C extensions cannot be changed, since there's no good way for us to track these. Remaining areas to explore include behavior when metaclasses and more complex descriptors are involved. Implementing a more complete version of `replace()` is left as an exercise for the reader. ## Notes 1. ^ This post relies _heavily_ on implementation details of CPython 2.7. While it could be adapted for Python 3 by examining changes to the internal structures of objects that we used above, that would be a lost cause if you wanted to replicate this on [Jython](http://www.jython.org/) or some other implementation. We are so dependent on concepts specific to CPython that you would need to start from scratch, beginning with a language-specific replacement for Guppy. 2. ^ The [DOT files](https://en.wikipedia.org/wiki/DOT_(graph_description_language)) used to generate graphs in this post are [available on Gist](https://gist.github.com/earwig/edc13f04f871c110eea6).