Python lets you define what are called default arguments when defining a function. They look like this:
def divide(numerator, denominator=1): return numerator / denominator
Now you can divide numbers
>>> divide(10, 5) 2 >>> divide(10) 10 >>>
You can omit an argument, and python will use the default you’ve provided. Something that you can do which is subtly dangerous is to specify a default that is a mutable object, like so:
>>> def foo(thing=): ... print(thing) ... thing.append(1) ... print(thing) ... >>> foo()   >>> foo()  [1, 1] >>> foo() [1, 1] [1, 1, 1] >>> foo() [1, 1, 1] [1, 1, 1, 1] >>>
The object we set as the default argument is mutated in between calls because the mutable object is instantiated when the function itself is defined. When the function is called without supplying an argument, the object that’s swapped in is the same one every time.
This can lead to unexpected behavior as every bit of python code you have that
calls the ‘foo’ function is supposed to assume that the default argument is an
empty list. Users of the
foo function would look at how you’ve defined it and
see right there you should expect the list to start empty!
While you may tell yourself, “Well, I know what I’m doing,” and that “I wouldn’t be so foolish that I’d call this function twice during the same process” or even more dangerously “Well this function is really passed into a job running framework and be called outside my app, so what could possibly go wrong?”. However, the simple truth is that any footgun is more dangerous when you load it with ammunition, and a function with a mutable default argument is quite simply a loaded footgun.
If you write code that can be called incorrectly, someone on your team will call it incorrectly – or you yourself many weeks into the future will call it incorrectly. Any code that has unclear behaviors will cause you headaches later on.
A place where you wouldn’t expect the footgun to go off is in a job-running framework. Celery is a distributed job-running framework written back in 2009 and has maintained a long and winding tenure of nine years (at the time of this writing). It has been used by thousands of companies for everything from asynchronous jobs to supplement their applications, to a cron-replacement for running periodic tasks. Supporting many types of message brokers, Celery can be scaled up by simply adding more worker nodes and workers pull work off the queue when they are finished with their current running task.
Celery provides an easy-to-reason-about decorator interface that allows you to label a function as being allowed to run in a distributed manner – to be able to run the function on any worker that is connected to your message broker. As we have seen above, however, it’s possible to accidentally define a function that stores state in-between runs, resulting in unexpected behavior.
If we’re following the Getting Started tutorial for celery, we can add our familiar footgun task like so
@app.task def foo(thing=): thing.append(1) return thing
Start up the celery app in its own terminal with
celery -A tasks worker --loglevel=infoj
And now you can open up a python interpreter in another terminal and schedule the task several times and you can see the obvious memory leak in action:
>>> import tasks >>> tasks.foo.delay() <AsyncResult: e5cc611c-316c-48af-8af2-130b81c995be> >>> tasks.foo.delay() <AsyncResult: 885c5165-d5a7-439b-a345-ed6a8e2e2c52> >>> tasks.foo.delay() <AsyncResult: 631b0a6a-605d-4b1f-89ed-942d95d53e5b> >>> tasks.foo.delay() <AsyncResult: 9931cc29-6f82-4417-aa0d-641477ad3987> >>> tasks.foo.delay() <AsyncResult: bce60838-f562-4c0b-98f4-c008e4469668> >>> tasks.foo.delay() <AsyncResult: c4f1f286-8df7-45f5-8d76-375cf94940af> >>> tasks.foo.delay() <AsyncResult: 05a438be-29ca-4cfd-8667-726776a8552e> >>>
In the worker terminal
Received task: tasks.foo[885c...c52] Task tasks.foo[885c...c52] succeeded in 0.0001...s:  Received task: tasks.foo[631b...e5b] Task tasks.foo[631b...e5b] succeeded in 0.0001...s: [1, 1] Received task: tasks.foo[993...3987] Task tasks.foo[993...3987] succeeded in 0.0001...s: [1, 1] Received task: tasks.foo[bce...9668] Task tasks.foo[bce...9668] succeeded in 0.0001...s: [1, 1, 1] Received task: tasks.foo[c4f...40af] Task tasks.foo[c4f...40af] succeeded in 0.0001...s: [1, 1, 1] Received task: tasks.foo[05a...552e] Task tasks.foo[05a...552e] succeeded in 0.0001...s: [1, 1, 1, 1]
The log line of celery has a lot going on in it, but the part at the end that
succeded in ...: [1, 1, 1, 1] displays the value returned by our job.
Every time we call it, the worker that picks it up accumulates state in the
python process’s memory.
This shows how dangerous it is to assume that a pattern should be allowed in your codebase even though you don’t have any code that explicitly calls that python function. It’s so much easier to just follow the clear guidelines established by the community to avoid such common pitfalls, even if you think there’s no way these rules apply to you.