Python 解释器实现:string类型处理
目录

object and type

PyStringObject数据结构定义如下:

typedef struct {
PyObject_VAR_HEAD

// ob_shash is the hash of the string or -1 if not computed yet.
long ob_shash;

// ob_sstate != 0 iff the string object is in stringobject.
//   'interned' dictionary; in this case the two references
//   from 'interned' to this object are *not counted* in ob_refcnt.
int ob_sstate;

// ob_sval contains space for 'ob_size+1' elements.
// ob_sval[ob_size] == 0.
char ob_sval[1];

} PyStringObject;

ob_sval作为字符数组的开头存储在PyStringObject对象的后面,并且以\0结束。

ob_state可以有以下三种状态:

#define SSTATE_NOT_INTERNED 0
#define SSTATE_INTERNED_MORTAL 1
#define SSTATE_INTERNED_IMMORTAL 2

第一个指的尚未intern的对象,第二个指的已经intern的对象,第三个尚未看到被使用。

String对象操作的函数如下:

PyTypeObject PyString_Type = {
PyVarObject_HEAD_INIT(&PyType_Type, 0)
"str",
PyStringObject_SIZE,
sizeof(char),
string_dealloc,                             /* tp_dealloc */
(printfunc)string_print,                    /* tp_print */
0,                                          /* tp_getattr */
0,                                          /* tp_setattr */
0,                                          /* tp_compare */
string_repr,                                /* tp_repr */
&string_as_number,                          /* tp_as_number */
&string_as_sequence,                        /* tp_as_sequence */
&string_as_mapping,                         /* tp_as_mapping */
(hashfunc)string_hash,                      /* tp_hash */
0,                                          /* tp_call */
string_str,                                 /* tp_str */
PyObject_GenericGetAttr,                    /* tp_getattro */
0,                                          /* tp_setattro */
&string_as_buffer,                          /* tp_as_buffer */
Py_TPFLAGS_DEFAULT | Py_TPFLAGS_CHECKTYPES |
Py_TPFLAGS_BASETYPE | Py_TPFLAGS_STRING_SUBCLASS |
Py_TPFLAGS_HAVE_NEWBUFFER,              /* tp_flags */
string_doc,                                 /* tp_doc */
0,                                          /* tp_traverse */
0,                                          /* tp_clear */
(richcmpfunc)string_richcompare,            /* tp_richcompare */
0,                                          /* tp_weaklistoffset */
0,                                          /* tp_iter */
0,                                          /* tp_iternext */
string_methods,                             /* tp_methods */
0,                                          /* tp_members */
0,                                          /* tp_getset */
&PyBaseString_Type,                         /* tp_base */
0,                                          /* tp_dict */
0,                                          /* tp_descr_get */
0,                                          /* tp_descr_set */
0,                                          /* tp_dictoffset */
0,                                          /* tp_init */
0,                                          /* tp_alloc */
string_new,                                 /* tp_new */
PyObject_Del,                               /* tp_free */
};

intern 机制

intern机制是编程语言中常用的一种机制,我在JVM中首次见到。intern机制对于每个字符串,在内存中只提供一份拷贝。这也要求字符串对象只不可变的。

/* This dictionary holds all interned strings.  Note that references to
strings in this dictionary are *not* counted in the string's ob_refcnt.
When the interned string reaches a refcnt of 0 the string deallocation
function will delete the reference from this dictionary.

Another way to look at this is that to say that the actual reference
count of a string is:  s->ob_refcnt + (s->ob_sstate?2:0)
*/
static PyObject *interned;

Python直接使用dict类型的变量interned来存储intern映射,键和值都使用PyStringObject。 但是interned的使用不增加对象的引用计数ob_refcnt,因此如上所说,对于interned的对象, 其真实引用次数是+2。

void
PyString_InternInPlace(PyObject **p)
{
register PyStringObject *s = (PyStringObject *)(*p);
PyObject *t;
if (s == NULL || !PyString_Check(s))
Py_FatalError("PyString_InternInPlace: strings only please!");
/* If it's a string subclass, we don't really know what putting
it in the interned dict might do. */
if (!PyString_CheckExact(s))
return;
if (PyString_CHECK_INTERNED(s))
return;
if (interned == NULL) {
interned = PyDict_New();
if (interned == NULL) {
PyErr_Clear(); /* Don't leave an exception */
return;
}
}
t = PyDict_GetItem(interned, (PyObject *)s);
if (t) {
Py_INCREF(t);
Py_DECREF(*p);
*p = t;
return;
}
if (PyDict_SetItem(interned, (PyObject *)s, (PyObject *)s) < 0) {
PyErr_Clear();
return;
}
/* The two references in interned are not counted by refcnt.
The string deallocator will take care of this */
Py_REFCNT(s) -= 2;
PyString_CHECK_INTERNED(s) = SSTATE_INTERNED_MORTAL;
}

refcnt减到0后意味着对象没用了,等待GC从interned中移除并回收就行了。

char pool

对于每个长度为1的string(unsigned char类型),Python也使用资源池的形式进行管理:

#if UCHAR_MAX != 255
#error "Python's source code assumes C's unsigned char is an 8-bit type."
#endif

static PyStringObject *characters[UCHAR_MAX + 1];

直接使用uchar做为characters的下标访问已经生成的对象。 已经生成的对象引用+1,因此在Python执行结束前都不会被释放。

characters并不是在解释器一开始就已经初始化,而是在调用 PyString_FromStringPyString_FromStringAndSize生成PyStringObject对象时, 如果发现长度是1的,就放入到characters中,类似惰性求值的思想。

...
} else if (size == 1) {
PyObject *t = (PyObject *)op;
PyString_InternInPlace(&amp;t);
op = (PyStringObject *)t;
characters[*str &amp; UCHAR_MAX] = op;
Py_INCREF(op);
}
...

发表评论