Friday, June 27, 2008

So You Want Your Code to Be Asynchronous? A Twisted Interview

Prologue

This blog post was taken from a chat on a Divmod IRC channel couple weeks ago. Let's start with my opening comments to JP about what I hoped we could accomplish in the interview.

[1:47pm] oubiwann:exarkun: developers/users have started to understand Twisted, see the benefits of an async paradigm, and want to start writing their code making the best possible use of twisted's event driven nature
[1:48pm] oubiwann:they know how to write code using deferreds, and they're ready to get writing...
[1:48pm] oubiwann:except they're not
[1:48pm] oubiwann:because they don't know python internals
[1:49pm] oubiwann:they don't know what python can actually be used with deferreds because they don't know what requirements there are for python code that it be non-blocking in the reactor
[1:50pm] oubiwann:so you're going to help us understand the pitfalls
[1:50pm] oubiwann:how to make best guesses
[1:50pm] oubiwann:and where to look to get definitive answers

Change Your Mind


Before we go any further, I want to share a few comments and answer two questions: "Who is this for?" and "What do I need to know for this to mean something to me?" This post is for anyone who wants to write async code with Twisted and the answer to the second question is open-ended.

Let me start with what is often interpreted as effrontery: read the source code. Despite how that may have sounded, it's not another RTFM quip. The Twisted source code was specifically designed to be read (well, the code from the last two years, anyway). It was designed to be read, re-read, absorbed, pondered, and turned into living memes in your brain.

Understanding tricky topics in conceptually dense fields such as mathematics, physics, and advanced programming requires immersion. When we commit to really learning something difficult in programming, when we take the big step and dive in, we are surrounded by code. At a conceptual level, I mean that literally: it is a spacial experience. This is not something that is typically taught... the lucky few are able to do this their on the own; the rest have to slowly build their intuition through experience in order to get comfortable and be productive in code space.

Our school systems tend to train us along very linear lines: there's a right answer, and a wrong answer. Don't rock the boat. Don't make the teacher uncomfortable. Follow the rules, do your homework, and don't ask too many questions. We carry these habits with us into our professional lives, and it can be quite the task to overcome such a mindset.

Experience is multidimensional. Learning is experience, not rules. When you really jump into this stuff, it will surround you. You will have an experience of the code. For me, that is a mental experience akin to looking at something from the perspective of three dimensions versus two. When I've not dedicated myself to understanding a problem, the domain, or the tools of the domain, everything looks very flat to me. It's hard to muddle through. I feel like I have no depth perception and I get easily frustrated.

When I do take the time, when I make the investment of attention and interest, the problem spaces really do become spaces, ones where my mind has a much greater freedom of movement. It's not smart people who do this kind of thing, it's committed people. Your mind is your world and it's up to you to make it what you want. No one on a mail list or IRC channel can do that for you. They can help you with the rules, provide you with valuable moral support, and guide you along the way. However, a direct experience of the code as a living world of mind comes from taking many brave leaps into the unknown.

Interview in a Blender

Jean-Paul Calderone graciously set aside some time to talk with me about creating asynchronous code in Python, particularly, using the Twisted framework. As has been said many times before, simply using Twisted or deferreds doesn't make your code asynchronous. As with any tricky problem, you have to put some time and thought into what you want to accomplish and how you want to accomplish it.

I'm going to post bits of our chat in different sections, but hopefully in a way that makes sense. There's some good information here and some nice reminders. More than anything, though, this should serve as an encouragement to dig deeper.

Why Would I Ever Need Async Code?

There are a couple short answers to that:
  • Your application is doing many long-running computations (or runs of a varying/unpredictable length).
  • Your application runs in an unpredictable environment (in particular, I'm thinking of network communications).
  • Your application needs to handle lots of events
[1:55pm] oubiwann:exarkun: so, what's the first question a developer should ask themselves as they begin writing their Twisted application/library, txFoo?
[1:55pm] dash:"would everyone be better off if I just stopped now?"
[1:55pm] exarkun:oubiwann: I'm not sure I completely understand the target audience yet
[1:56pm] exarkun:my question is kind of like dash's question
[1:56pm] exarkun:why is this person doing this?
[1:57pm] oubiwann:exarkun: the audience is the group of software developers that are new to twisted, have a basic grasp of deferreds, and want their code to be properly async (using Twisted, of course)
[1:57pm] oubiwann:they don't have anything more than a passing familiarity of the reactor
[1:57pm] oubiwann:they don't know python internals

Protocols, Servers, and Clients, Oh My!

If your application can use what's already in Twisted, you're on easy street :-) If not, you may have to write your own protocols.

Let's get back to the chat:

[1:57pm] exarkun:So `foo´ is... a django-based web application?
[1:58pm] exarkun:... a unit conversion library?
[1:58pm] oubiwann:sure, that works
[1:58pm] oubiwann:unit conversion lib
[1:58pm] oubiwann:(which could be used in Django)
[1:58pm] exarkun:at a first guess, I'd say that there's probably no work to do
[1:58pm] exarkun:how could you have a unit conversion library that's not async?
[1:58pm] exarkun:that'd take some work
[1:59pm] oubiwann:let's say that the unit calculations take a really long time to run
[1:59pm] exarkun:Hm. :)
[1:59pm] idnar:you'd probably have to spawn a new process then :P
[2:00pm] exarkun:basically. probably the only other reasonable thing is for twisted-using code to use the unit conversion api with threads.
[2:00pm] exarkun:so then the question to ask "is my code threadsafe?"
[2:00pm] oubiwann:what about a messaging server
[2:00pm] oubiwann:that sends jobs out to different hosts for calcs
[2:01pm] dash:that's not going to be a tiny example
[2:01pm] exarkun:for that, the job is probably to take all the parsing and app logic and make sure it's separate from the i/o
[2:01pm] exarkun:so "am I using the socket/httplib/urllib/ftplib/XXXlib module?"
[2:03pm] exarkun:is another question for the developer to ask himself
[2:06pm] exarkun:they probably need to find the api in twisted that does what they were using a blocking api for, and switch to it
[2:07pm] exarkun:that might mean implementing a protocol, or it might mean using getPage or something
[2:07pm] exarkun:and pushing the async all the way from the bottom up to the top (maybe not in that direction)
[2:08pm] oubiwann:by "bottom" are you referring to protocol/wire-level stuff?
[2:08pm] oubiwann:exarkun: and by "top" their module's API?
[2:09pm] exarkun:yes
[2:10pm] exarkun:oubiwann: the point being, can't have a sync api implemented in terms of an async one (or at least the means by which to do so are probably beyond the scope of this post)

Processes

We didn't really talk about this one. Idnar mentioned spawning processes briefly, but the discussion never really returned there. I imagine that this is fairly well understood and may not merit as much pondering as such things as threads.

Which brings us to...

Threads

Thread safety is the number one concern when trying to provide an asynchronous API for synchronous code. Here are some starters for background information:
Discussing threads consumed the rest of the interview:

[2:12pm] oubiwann:exarkun: so, back to your comment about "is it threadsafe" (if they are doing long-running python calculations)
[2:13pm] oubiwann:what are the problems we face when we don't ask ourselves this question?
[2:13pm] oubiwann:what happens when we try to run non-threadsafe code in the Twisted reactor?
[2:14pm] exarkun:The problem happens when we try to run non-threadsafe code in a thread to keep it from blocking the reactor thread.
[2:16pm] oubiwann:so non-thread safe code run in deferredToThread could...
[2:16pm] oubiwann:have data inconsistencies which cause non-deterministic bugs?
[2:16pm] dash:have the usual effects of running non-threadsafe code
[2:16pm] exarkun:have any problem that using non-thread safe code in a multithreaded way using any other threading api could have
[2:16pm] dash:like that, yeah
[2:17pm] exarkun:inconsistencies, non-determinism, failure only under load (ie, only after you deploy it), etc
[2:18pm] dash:i smell a research paper
[2:18pm] oubiwann:so, next question: how does one determine that python code is thread safe or not?
[2:19pm] glyph:a research *paper*?
[2:19pm] exarkun:heh
[2:19pm] glyph:research *industry* more like
[2:19pm] oubiwann:exarkun: or, if not determine, at least ask the right sorts of questions to get the developer thinking in the right direction
[2:20pm] dash:glyph: Heh heh.
[2:20pm] exarkun:oubiwann: well, is there shared mutable state? if you're calling `f´ in a thread, does it operate on objects not passed to it as arguments?
[2:20pm] exarkun:oubiwann: if not, then it's probably safe - although don't call it twice at the same time with the same arguments
[2:20pm] exarkun:oubiwann: if so, who knows
[2:20pm] dash:with the same mutable arguments, anyway
[2:23pm] oubiwann:exarkun: so, because python and/or the os doesn't do anything to make file operations atomic, I'm assuming that reading and writing file data is not threadsafe?
[2:24pm] exarkun:don't use the same python file object in multiple threads, yes.
[2:24pm] exarkun:but certain filesystem operations are atomic, and you can manipulate the same file from multiple threads (or processes) if you know what you're doing
[2:25pm] oubiwann:what about C extensions in Python? any general rules there?
[2:25pm] oubiwann:other than "if they're threadsafe, you can use them"
[2:25pm] exarkun:that's about all you can say with certainty
[2:26pm] exarkun:for dbapi2 modules, look at the `threadlevel´ attribute. that's about the most general rule you can express.
[2:26pm] exarkun:there's some stuff other than objects that gets shared between threads too that might be worth mentioning
[2:26pm] exarkun:at least to get people to think about non-object state
[2:27pm] oubiwann:such as?
[2:27pm] exarkun:like, process working directory, or uid/gid
[2:30pm] • oubiwann looks at deferToThread...
[2:31pm] • oubiwann looks at reactor.callInThread
[2:33pm] • oubiwann looks at ReactorBase.threadpool
[2:38pm] oubiwann:hrm
[2:38pm] oubiwann:internesting
[2:39pm] oubiwann:never took the time to trace that all the way back to (and then read) the Python threading module
[2:40pm] oubiwann:exarkun: are there any python modules well known for their lack of threadsafety?
[2:42pm] exarkun:oubiwann: I dunno about "well known"
[2:42pm] exarkun:oubiwann: urllib isn't threadsafe
[2:42pm] exarkun:neither is urllib2
[2:43pm] exarkun:apparently random.gauss is not thread-safe?
[2:43pm] exarkun:you generally start with the assumption that any particular api is not thread-safe
[2:44pm] exarkun:and then maybe you can demonstrate to your own satisfaction that it's thread-safe-enough for your purposes
[2:44pm] exarkun:or you can demonstrate that it isn't
[2:45pm] exarkun:grepping the stdlib for 'thread' and 'safe' is interesting
[2:45pm] oubiwann:I wonder if the stuff available in math is threadsafe....
[2:45pm] oubiwann:exarkun: heh, I was just getting ready to dl the source so I could do that :-)
[2:46pm] exarkun:the math module probably is threadsafe
[2:46pm] exarkun:maybe that's another generalization
[2:46pm] exarkun:stdlib C modules are probably threadsafe
[2:49pm] oubiwann:hrm, looks like part of random isn't threadsafe
[2:51pm] oubiwann:random.random() is safe, though
[2:53pm] oubiwann:exarkun: I really appreciate you taking the time to discuss this
[2:53pm] exarkun:np
[2:53pm] oubiwann:and thanks to dash, glyph, and idnar for contributing to the discussion :-)

Summary

Concurrency is hard. If you want to use threads and you want to do it right and you want to avoid pitfalls and have bug-free code, you're going to be doing some head-banging. If you want to use an asynchronous framework like Twisted, you're going to have to bend your mind in a different way.

No matter what school of thought you follow for any given project, the best results will come with full commitment and immersion. Don't fear the learnin' -- embrace the pain ;-)

Update: Special thanks to Piet Delport for sorting out my endless typos!


9 comments:

Unknown said...

Really great article, I think this will be really useful for the intermediate-level developer who wants to become advanced through the Power of Knowledge.

While it is arguably outside of the scope of this post, I'd really like to see you discuss the difference between async via processes and async via threads. Processes: You must marshal your objects. Threads: no marshalling, just careful attention to mutable state.

The purpose of such a post would be to point out the advantages of each, and give a developer, who is about to make that choice, knowledge about how much work each approach will be and where the risks lie.

Duncan McGreggor said...

Cory, I stand corrected. You make an excellent point, and your idea for further discussing processes, specifically in contrast to threads, is a great idea.

Piet mentioned on IRC that this could spawn a whole series of discussions/interviews/posts and would be really useful reference/reminder material for developers.

chrome___ said...

Thanks for the article! Let me add a "me2" for those interested in multiprocess concurrency models instead of multithreading. For one, because it really is too easy to stomp on another threads' data, and for another because of the GIL multithreading doesn't work to well in Python anyway.

One question arising for multiprocessing is process creation, because on many platforms process creation is pretty expensive, but that can usually be solved quite well some kind of pooling scheme. The other one (as cory pointed out) is how to get data from process A to process B. If there is not too much data, it's probably easiest to pickle it and send it over a socket or FD. If there is more data, it would be more efficient if it could be passed using a shared memory mechanism.

Unknown said...

I can not comment on the content of the "article", but I can surely say that it's very hard to read and follow.

Duncan McGreggor said...

Christian,

Can you share more? In what ways is the post hard to read/follow?

Steve said...

> In what ways is the post hard to read/follow?

1. Lose the timestamps. They don't serve any purpose for the reader.

2. Make the structure of the dialog more obvious by using boldface for the protagonists' names (or some similar typographical device).

3. The whole post is written as though the reader will be interested in your thought processes, whereas what they are really interested in is Twisted. Write for them, not yourself.

3. [nobody expects ...] Flag the code that was written in the last two years. I have never been advised to read the code before (though I learn quite a bit reading other people's code on the Twisted lists).

4. While all the internal discussions about what would be a good problem to use as an example are sort of interesting to a motivated reader, they don't sell Twisted to someone who isn't already at least half-convinced of its benefits. So stop writing to the fans: that's just preaching to the choir.

5. The length of the piece is. or appears to be, inversely proportional to the effort that went into making it comprehensible for the reader.

After "Teach Me Twisted" you know I enjoy evangelizing Twisted to the world. But that means adopting the tools of the journalist as well as those of the geek, and mastery of (or at least competency in) that new toolset takes the same kind of effort as learning a new language.

It's OK to edit if the editing process enlarges understanding and removes unnecessary or extraneous detail (hi, Itamar). If brain dumps were enough I'd have used LISP a lot more by now ...

Steve said...

PS: Sorry, I see the names are already, as our dear leader might say, emboldenified.

PPS: Dash was right on target with his interjection. Most new libraries would be better off not published at all, but publishing is way too easy in this day and age.

Duncan McGreggor said...

Steve,

While I agree with the notes and advice you gave me almost completely, there is one exception: this post was *specifically* targeted at Twisted developers. I believe this was mentioned several times in the post.

This is not a post saying "yay, Twisted, we're so cool and awesome!" but rather one saying "so you've gotten this far, and now you're ready to dive into some deeper waters and do some difficult integration."

However, your point is definitely valid and well-taken in another context: if the purpose of writing a post is to increase the size of the community, then I would want to write a very different kind of article.

That being said... if new users know that this sort of discussion *does* take place, then they might find encouragement in the fact that not only will Twisted let them set up a web server in 8 lines of code or process their particular task in a convenient, async manner, but when they are ready for it, Twisted will also help them tackle the really difficult problems, too. Its developers will reach out to them at every stage of their growing expertise in the framework.

Duncan McGreggor said...

Steve,

I take it back -- I did say "new to Twisted" in the interview. Given the subject material, I really should have qualified the target audience better. Something along the lines of:

* New to Twisted with a strong background in event-driven frameworks or working with threads, or
* An intermediate level of expertise with Twisted with an interest in integrating a non-Twisted library into their Twisted code.

My apologies.