Personal website https://benkurtovic.com/
Du kan inte välja fler än 25 ämnen Ämnen måste starta med en bokstav eller siffra, kan innehålla bindestreck ('-') och vara max 35 tecken långa.
 
 
 
 

25 KiB

layout title tags description draft
post Replacing Objects in Python Python More reflection than you cared to ask for true

Today, we’re going to demonstrate a fairly evil thing in Python, which I call object replacement.

Say you have some program that’s been running for a while, and a particular object has made its way throughout your code. It lives inside lists, class attributes, maybe even inside some closures. You want to completely replace this object with another one; that is to say, you want to find all references to object A and replace them with object B, enabling A to be garbage collected. This has some interesting implications for special object types. If you have methods that are bound to A, you want to rebind them to B. If A is a class, you want all instances of A to become instances of B. And so on.

But why on Earth would you want to do that? you ask. I’ll focus on a concrete use case in a future post, but for now, I imagine this could be useful in some kind of advanted unit testing situation with mock objects. Still, it’s fairly insane, so let’s leave it as primarily an intellectual exercise.

This article is written for CPython 2.7.[1]

Review

First, a recap on terminology here. You can skip this section if you know Python well.

In Python, names are what most languages call “variables”. They reference objects. So when we do:

{% highlight python %}

a = [1, 2, 3, 4]

{% endhighlight %}

...we are creating a list object with four integers, and binding it to the name a. In graph form:[2]

[1, 2, 3, 4]a

In each of the following examples, we are creating new references to the list object, but we are never duplicating it. Each reference points to the same memory address (which you can get using id(a)).

{% highlight python %}

b = a

{% endhighlight %}

{% highlight python %}

c = SomeContainerClass() c.data = a

{% endhighlight %}

{% highlight python %}

def wrapper(L): def inner(): return L.pop() return inner

d = wrapper(a)

{% endhighlight %}

d[1, 2, 3, 4]abc.dataL

Note that these references are all equal. a is no more valid a name for the list than b, c.data, or L (from the perspective of d, which is exposed to everyone else as d.func_closure[0].cell_contents, but that’s cumbersome and you would never do that in practice). As a result, if you delete one of these references—explicitly with del a, or implicitly if a name goes out of scope—then the other references are still around, and object continues to exist. If all of an object’s references disappear, then Python’s garbage collector should eliminate it.

Dead ends

My first thought when approaching this problem was to physically write over the memory where our target object is stored. This can be done using ctypes.memmove() from the Python standard library:

{% highlight pycon %}

class A(object): pass ... class B(object): pass ... obj = A() print obj <__main__.A object at 0x10e3e1190> import ctypes ctypes.memmove(id(A), id(B), object.__sizeof__(A)) 140576340136752 print obj <__main__.B object at 0x10e3e1190>

{% endhighlight %}

What we are doing here is overwriting the fields of the A instance of the PyClassObject C struct with fields from the B struct instance. As a result, they now share various properties, such as their attribute dictionaries (__dict__). So, we can do things like this:

{% highlight pycon %}

B.foo = 123 obj.foo 123

{% endhighlight %}

However, there are clear issues. What we’ve done is create a shallow copy. Therefore, A and B are still distinct objects, so certain changes made to one will not be replicated to the other:

{% highlight pycon %}

A is B False B.name = “C” A.name ‘B’

{% endhighlight %}

Also, this won’t work if A and B are different sizes, since we will be either reading from or writing to memory that we don’t necessarily own:

{% highlight pycon %}

A = () B = [] print A.sizeof(), B.sizeof() 24 40 import ctypes ctypes.memmove(id(A), id(B), A.sizeof()) 4321271888 Python(33575,0x7fff76925300) malloc: *** error for object 0x6f: pointer being freed was not allocated *** set a breakpoint in malloc_error_break to debug Abort trap: 6

{% endhighlight %}

Oh, and there’s a bit of a problem when we deallocate these objects, too...

{% highlight pycon %}

A = [] B = range(8) import ctypes ctypes.memmove(id(A), id(B), A.sizeof()) 4514685728 print A [0, 1, 2, 3, 4, 5, 6, 7] del A del B Segmentation fault: 11

{% endhighlight %}

Fishing for references with Guppy

A more correct solution is finding all of the references to the old object, and then updating them to point to the new object, rather than replacing the old object directly.

But how do we track references? Fortunately, there’s a library called Guppy that allows us to do this. Often used for diagnosing memory leaks, we can take advantage of its robust object tracking features here. Install it with pip (pip install guppy).

I’ve always found Guppy hard to use (as many debuggers are, though justified by the complexity of the task involved), so we’ll begin with a feature demo before delving into the actual problem.

Feature demonstration

Guppy’s interface is deceptively simple. We begin by calling guppy.hpy(), to expose the Heapy interface, which is the component of Guppy that has the features we want:

{% highlight pycon %}

import guppy hp = guppy.hpy() hp Top level interface to Heapy. Use eg: hp.doc for more info on hp.

{% endhighlight %}

Calling hp.heap() shows us a table of the objects known to Guppy, grouped together (mathematically speaking, partitioned) by type[3] and sorted by how much space they take up in memory:

{% highlight pycon %}

heap = hp.heap() heap Partition of a set of 45761 objects. Total size = 4699200 bytes. Index Count % Size % Cumulative % Kind (class / dict of class) 0 15547 34 1494736 32 1494736 32 str 1 8356 18 770272 16 2265008 48 tuple 2 346 1 452080 10 2717088 58 dict (no owner) 3 13685 30 328440 7 3045528 65 int 4 71 0 221096 5 3266624 70 dict of module 5 1652 4 211456 4 3478080 74 types.CodeType 6 199 0 210856 4 3688936 79 dict of type 7 1614 4 193680 4 3882616 83 function 8 199 0 177008 4 4059624 86 type 9 124 0 135328 3 4194952 89 dict of class <91 more rows. Type e.g. ‘_.more’ to view.>

{% endhighlight %}

This object (called an IdentitySet) looks bizarre, but it can be treated roughly like a list. If we want to take a look at strings, we can do heap[0]:

{% highlight pycon %}

heap[0] Partition of a set of 22606 objects. Total size = 2049896 bytes. Index Count % Size % Cumulative % Kind (class / dict of class) 0 22606 100 2049896 100 2049896 100 str

{% endhighlight %}

This isn’t very useful, though. What we really want to do is re-partition this subset using another relationship. There are a number of options, such as:

{% highlight pycon %}

heap[0].byid # Group by object ID; each subset therefore has one element Set of 22606 objects. Total size = 2049896 bytes. Index Size % Cumulative % Representation (limited) 0 7480 0.4 7480 0.4 ‘The class Bi... copy of S.\n’ 1 4872 0.2 12352 0.6 “Support for ... ‘error’.\n\n” 2 4760 0.2 17112 0.8 ‘Heap queues...at Art! :-)\n’ 3 4760 0.2 21872 1.1 ‘Heap queues...at Art! :-)\n’ 4 3896 0.2 25768 1.3 ‘This module ...ng function\n’ 5 3824 0.2 29592 1.4 ‘The type of ...call order.\n’ 6 3088 0.2 32680 1.6 ‘t\x00\x00|\x...x00|\x02\x00S’ 7 2992 0.1 35672 1.7 ‘HeapView(roo... size, etc.\n’ 8 2808 0.1 38480 1.9 ‘Directory tr...ories\n\n ' 9 2640 0.1 41120 2.0 ‘The class No... otherwise.\n’ <22596 more rows. Type e.g. ‘_.more’ to view.>

{% endhighlight %}

{% highlight pycon %}

heap[0].byrcs # Group by what types of objects reference the strings Partition of a set of 22606 objects. Total size = 2049896 bytes. Index Count % Size % Cumulative % Referrers by Kind (class / dict of class) 0 6146 27 610752 30 610752 30 types.CodeType 1 5304 23 563984 28 1174736 57 tuple 2 4104 18 237536 12 1412272 69 dict (no owner) 3 1959 9 139880 7 1552152 76 list 4 564 2 136080 7 1688232 82 function, tuple 5 809 4 97896 5 1786128 87 dict of module 6 346 2 71760 4 1857888 91 dict of type 7 365 2 19408 1 1877296 92 dict of module, tuple 8 192 1 16176 1 1893472 92 dict (no owner), list 9 232 1 11784 1 1905256 93 dict of class, function, tuple, types.CodeType <229 more rows. Type e.g. ‘_.more’ to view.>

{% endhighlight %}

{% highlight pycon %}

heap[0].byvia # Group by how the strings are related to their referrers Partition of a set of 22606 objects. Total size = 2049896 bytes. Index Count % Size % Cumulative % Referred Via: 0 2656 12 420456 21 420456 21 ‘[0]’ 1 2095 9 259008 13 679464 33 ‘.co_code’ 2 2095 9 249912 12 929376 45 ‘.co_filename’ 3 564 2 136080 7 1065456 52 ‘.func_doc’, ‘[0]’ 4 243 1 103528 5 1168984 57 “['doc']” 5 1930 9 100584 5 1269568 62 ‘.co_lnotab’ 6 502 2 31128 2 1300696 63 ‘[1]’ 7 306 1 16272 1 1316968 64 ‘[2]’ 8 242 1 12960 1 1329928 65 ‘[3]’ 9 184 1 9872 0 1339800 65 ‘[4]’ <7323 more rows. Type e.g. ‘_.more’ to view.>

{% endhighlight %}

From this, we can see that the plurality of memory devoted to strings is taken up by those referenced by code objects (types.CodeType represents Python code—accessible from a non-C-defined function through func.func_code—and contains things like the names of its local variables and the actual sequence of opcodes that make it up).

For fun, let’s pick a random string.

{% highlight pycon %}

import random obj = heap[0].byid[random.randrange(0, heap[0].count)] obj Set of 1 object. Total size = 176 bytes. Index Size % Cumulative % Representation (limited) 0 176 100.0 176 100.0 ‘Define names...not listed.\n’

{% endhighlight %}

Interesting. Since this heap subset contains only one element, we can use .theone to get the actual object represented here:

{% highlight pycon %}

obj.theone ‘Define names for all type symbols known in the standard interpreter.\n\nTypes that are part of optional modules (e.g. array) are not listed.\n’

{% endhighlight %}

Looks like the docstring for the types module. We can confirm by using .referrers to get the set of objects that refer to objects in the given set:

{% highlight pycon %}

obj.referrers Partition of a set of 1 object. Total size = 3352 bytes. Index Count % Size % Cumulative % Kind (class / dict of class) 0 1 100 3352 100 3352 100 dict of module

{% endhighlight %}

This is types.__dict__ (since the docstring we got is actually stored as types.__dict__["__doc__"]), so if we use .referrers again:

{% highlight pycon %}

obj.referrers.referrers Partition of a set of 1 object. Total size = 56 bytes. Index Count % Size % Cumulative % Kind (class / dict of class) 0 1 100 56 100 56 100 module obj.referrers.referrers.theone <module ‘types’ from ‘/usr/local/Cellar/python/2.7.8_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/types.pyc'> import types types.doc is obj.theone True

{% endhighlight %}

But why did we find an object in the types module if we never imported it? Well, let’s see. We can use hp.iso() to get the Heapy set consisting of a single given object:

{% highlight pycon %}

hp.iso(types) Partition of a set of 1 object. Total size = 56 bytes. Index Count % Size % Cumulative % Kind (class / dict of class) 0 1 100 56 100 56 100 module

{% endhighlight %}

Using a similar procedure as before, we see that types is imported by the traceback module:

{% highlight pycon %}

hp.iso(types).referrers Partition of a set of 10 objects. Total size = 25632 bytes. Index Count % Size % Cumulative % Kind (class / dict of class) 0 2 20 13616 53 13616 53 dict (no owner) 1 5 50 9848 38 23464 92 dict of module 2 1 10 1048 4 24512 96 dict of guppy.etc.Glue.Interface 3 1 10 1048 4 25560 100 dict of guppy.etc.Glue.Share 4 1 10 72 0 25632 100 tuple hp.iso(types).referrers[1].byid Set of 5 objects. Total size = 9848 bytes. Index Size % Cumulative % Owner Name 0 3352 34.0 3352 34.0 traceback 1 3352 34.0 6704 68.1 warnings 2 1048 10.6 7752 78.7 main 3 1048 10.6 8800 89.4 abc 4 1048 10.6 9848 100.0 guppy.etc.Glue

{% endhighlight %}

...and that is imported by site:

{% highlight pycon %}

import traceback hp.iso(traceback).referrers Partition of a set of 3 objects. Total size = 15992 bytes. Index Count % Size % Cumulative % Kind (class / dict of class) 0 1 33 12568 79 12568 79 dict (no owner) 1 1 33 3352 21 15920 100 dict of module 2 1 33 72 0 15992 100 tuple hp.iso(traceback).referrers[1].byid Set of 1 object. Total size = 3352 bytes. Index Size % Cumulative % Owner Name 0 3352 100.0 3352 100.0 site

{% endhighlight %}

Since site is imported by Python on startup, we’ve figured out why objects from types exist, even though we’ve never used them.

We’ve learned something important, too. When objects are stored as ordinary attributes of a parent object (like types.__doc__, traceback.types, and site.traceback from above), they are not referenced directly by the parent object, but by that object’s __dict__ attribute. Therefore, if we want to replace A with B and A is an attribute of C, we (probably) don’t need to know anything special about C—just how to modify dictionaries.

A good Guppy/Heapy tutorial, while a bit old and incomplete, can be found on Andrey Smirnov’s website.

Examining paths

Let’s set up an example replacement using class instances:

{% highlight python %}

class A(object): pass

class B(object): pass

a = A() b = B()

{% endhighlight %}

Suppose we want to replace a with b. From the demo above, we know that we can get the Heapy set of a single object using hp.iso(). We also know we can use .referrers to get a set of objects that reference the given object:

{% highlight pycon %}

import guppy hp = guppy.hpy() print hp.iso(a).referrers Partition of a set of 1 object. Total size = 1048 bytes. Index Count % Size % Cumulative % Kind (class / dict of class) 0 1 100 1048 100 1048 100 dict of module

{% endhighlight %}

a is only referenced by one object, which makes sense, since we’ve only used it in one place—as a local variable—meaning hp.iso(a).referrers.theone must be locals():

{% highlight pycon %}

hp.iso(a).referrers.theone is locals() True

{% highlight pycon %}

However, there is a more useful feature available to us: .pathsin. This also returns references to the given object, but instead of a Heapy set, it is a list of Path objects. These are more useful since they tell us not only what objects are related to the given object, but how they are related.

{% highlight pycon %}

print hp.iso(a).pathsin 0: Src[‘a’]

{% endhighlight %}

This looks very ambiguous. However, we find that we can extract the source of the reference using .src:

{% highlight pycon %}

path = hp.iso(a).pathsin[0] print path.src Partition of a set of 1 object. Total size = 1048 bytes. Index Count % Size % Cumulative % Kind (class / dict of class) 0 1 100 1048 100 1048 100 dict of module path.src.theone is locals() True

{% endhighlight %}

...and, we can examine the type of relation by looking at .path[1] (the actual reason for this isn’t worth getting into, due to Guppy’s lack of documentation on the subject):

{% highlight pycon %}

relation = path.path[1] relation <guppy.heapy.Path.Based_R_INDEXVAL object at 0x100f38230>

{% endhighlight %}

We notice that relation is a Based_R_INDEXVAL object. Sounds bizarre, but this tells us that path.src is related to a by being a particular index value of it. What index? We can get this using relation.r:

{% highlight pycon %}

rel = relation.r print rel a

{% endhighlight %}

Ah ha! So now we know that a is equal to the reference source indexed by rel. But what is the reference source? It’s just path.src.theone:

{% highlight pycon %}

path.src.theone[rel] is a True

{% endhighlight %}

But path.src.theone is just a dictionary, meaning we know how to modify it very easily:

{% highlight pycon %}

path.src.theone[rel] = b a <__main__.B object at 0x100dae090> a is b True

{% endhighlight %}

Python’s documentation tells us not to modify the locals dictionary, but screw it, we’re gonna do it anyway.

Handling different reference types

[...]

Dictionaries

dicts, class attributes via __dict__, locals()

Lists

simple replacement

Tuples

recursively replace parent since immutable

Bound methods

note that built-in methods and regular methods have different underlying C structs, but have the same offsets for their self field

Closure cells

function closures

Frames

...

Slots

...

Classes

...

Other cases

Certainly, not every case is handled above, but it seems to cover the vast majority of instances that I’ve found through testing. There are a number of reference relations in Guppy that I couldn’t figure out how to replicate without doing something insane (R_HASATTR, R_CELL, and R_STACK), so some obscure replacements are likely unimplemented.

Some other kinds of replacements are known, but impossible. For example, replacing a class object that uses __slots__ with another class will not work if the replacement class has a different slot layout and instances of the old class exist. More generally, replacing a class with a non-class object won’t work if instances of the class exist. Furthermore, references stored in data structures managed by C extensions cannot be changed, since there’s no good way for us to track these.

Remaining areas to explore include behavior when metaclasses and more complex descriptors are involved. Implementing a more complete version of replace() is left as an exercise for the reader.

Footnotes

  1. ^ This post relies heavily on implementation details of CPython 2.7. While it could be adapted for Python 3 by examining changes to the internal structures of objects that we used above, that would be a lost cause if you wanted to replicate this on Jython or some other implementation. We are so dependent on concepts specific to CPython that you would need to start from scratch, beginning with a language-specific replacement for Guppy.

  2. ^ The DOT files used to generate graphs in this post are available on Gist.

  3. ^ They’re actually grouped together by clodo (“class or dict object”), which is similar to type, but groups __dict__s separately by their owner’s type.