Python Internals Serie : Int (Long) Object
10 Jan 2021In this article, we will read and discuss the implementation details of ints in CPython.
Altough using integers in python is fairly easy :
The implementation file contains more than 5000 code lines.
Since the file contains roughly 116 functions/macros, I’ll probably skip some (most of ?) functions.
Without further ado, let’s dive into the code !
longobject.c
First we need to find a starting point, we know that all pythons objects are stored in cpython/Objects
.
Unfortunetly, we don’t seem to have an intobject.c
file. That’s because the underlying C object for python integers is PyLongObject
.
The file we’re interessted in is : longobject.c
From now one I will refer to python’s integers implementation both using integers or longs, which is not technically accurate since in C those are different types.
For the rest of this article, we’ll revisit some of the most used operations/functions related to integers.
MEDIUM_VALUE(x)
The comment is pretty explanatory, the macro convert a given long to an sdigit. But what’s an sdigit ? Well it depends :
The PYLONG_BITS_IN_DIGIT
is defined either at configure time or in pyport.h
.
We have an assert to protect us from accidently casting a big integer, which is not small enough to fit in an sdigit, to an sdigit, which may result in a loss of information, therefore potentiely issues which can be hard to detect.
Curiously, the assert check that the size of x is bigger than -1, but can be less than 0 ? My guess is that size is unsigned to represent both the size and the sign of the integer, for exemple : PY_SIZE(-15) = -2
.
This seems to be confirmed with Py_SIZE(x) < 0 ? -(sdigit)(x)->ob_digit[0]
.
ob_digit
looks like an array containing our integer. which can be confirmed in the file longintrepr.h
:
One curious fact, that I can’t explain, is the small size of the array (one element ?), maybe it’s somehow changed at runtime.
IS_SMALL_INT(ival)
Pretty straightforward function.
NSMALLPOSINTS
and NSMALLNEGINTS
is defined in the pycore_interp.h
file :
To undesrtand why those 2 magic numbers and where this macro is used, let’s dive into the next function.
static PyObject * get_small_int(sdigit ival)
This function implements what is commonly known as the Flyweight pattern, as explained here : https://python-patterns.guide/gang-of-four/flyweight/, the flyweight pattern is used in python to create at the initialisation phase all the integers in the range [-5, 256]. At the runtime, whenever you ask for a number in this range, you’ll always get the same number.
Which can be easily checked :
Those integers objects are stored in the interpreter state.
According to https://tenthousandmeters.com/blog/python-behind-the-scenes-1-how-the-cpython-vm-works/ :
An interpreter state is a group of threads along with the data specific to this group. Threads share such things as loaded modules (sys.modules), builtins (builtins.dict) and the import system (importlib).
Apparently flyweight integers are stored there.
We also use Py_INCREF
to increment the reference count of the returned integer, recall that reference counting is what is used by the CPython garbage collector to detect which objects to free (Well not just that, since reference counting alone doesn’t resolve circular references, but we’ll discuss the gc in more details in future articles).
maybe_small_long
Straitforward function, we check a given integer is small enough to fit into an sdigit, if it’s the case we downcast it to an sdigit using the MEDIUM_VALUE
. After downcasting we check if the integer is in the flyweight range [-5, 257], if it’s the case we decrement the referece counting (since we don’t need two PyLong objects) and we return a reference on the already allocated number.
How are longs created
Longs are created using the function PyLongObject * _PyLong_New(Py_ssize_t size)
, size here refer to the number of digit of the target long.
Well, looks like we can’t have an indefinely big integer, but how big can our integers be ?
If you don’t already know it, offsetof is a C function wich will basicaly return an offset of the member (ob_digit) from the structure (PyLongObject), if you recall correctly, since our struture only contains PyObject_VAR_HEAD
and digit ob_digit
. so the offset is the memorry taken with VAR_HEAD.
So a integer has a maximal size of roughly Py_SSIZE_T_MAx.
According to this answer : https://stackoverflow.com/a/42777910/14517936 Py_SSIZE_T_MAX = sys.maxsize
, which is according to the official documentation ( https://docs.python.org/3/library/sys.html#sys.maxsize ) equal to 2**31 - 1
on 32 bits machine or 2**63 - 1
.
2**63 - 1
is a huge number of digits.
You can check this limit yourself by doing :
If the size is fine, we allocate enough memory to store our integer :
As a good practice you should ALWAYS check that an allocation had correctly been performed, which may not be the case if you don’t have enough memory for example, your futur self will thank you.
Adding two integers
First we check that the two integers are valid.
One curious thing you may have noticed is the :
Which seems exactly the same as simply :
The purpose of doing this is macros in C are not really smart, the preprocessor will simply do a find/replace and hope for the best. The problem with this can be illustrated with the following macro :
But what happens if you do :
the preprocessor will replace this with :
which is not valid C, since in C if you omit the braces, the if body will only consists of the first instruction which is not what you would expect.
The do while solve this by enclosing all our instructions in one scope.
But why not just use :
As far as i know this is simply a syntaxic choice, since writing MACRO();
is more natural than MACRO()
(not that in the second case we have no semicolon).
The rest is :
- if both a and b are negative we return -|a + b|
- if a is negative but not b we compute b - a
- if b is negative and a is positive we compute a - b
- if both are positive we compute |a + b|
This looks fairly complicated for a simple a + b, so why all the burden ?
Well recall that in Python
integers can be very large (2^63 digits on 64 bits machines), this is achieved by storing the integers in arrays (each element representing a digit).
We dive in the x_add
code we see :
Conclusion
And that’s it, the long_sub
is fairly similar, and you can always (i encourage you to do so) check multiplication and division code.
We rarely stop to think about basic operations like integer operations, the longobject file shows use how complex and optimised long/integers implementation is in python.