Saturday, December 27, 2008

Intellectual Property and Open Source


A few months ago, I received a complementary copy of Van Lindberg's new O'Reilly book Intellectual Property and Open Source: A Practical Guide to Protecting Code and the first thing that happened at home when the book was unwrapped was three of us began arguing over who got to read it first.

This may seem like an odd thing to happen for what one could easily assume was a dry and less than interesting topic. However, at the time I was strongly considering the possibility of beginning a non-tech-industry startup built with both open source and proprietary code. The discussions with the potential founders of the startup had been very vigorous and exciting, but the big questions that remained revolved around patents, protecting IP, and providing protection against big business while still offering powerful, free code for use by individuals/private consumers. If you've read the book or even seen the table of contents, you can see why everyone wanted to be the first to read it and learn from the insights provided between its covers.

Instead of jumping into another startup, I ended up joining Canonical; this has kept me both very busy and exceptionally happy. The holiday break has provided an opportunity to finish reading the book, and it has been a delight. I have friends working on startups that depend upon exciting code to power some or all of the business models for their visions, and reading this book should be on their shelves, close at hand. Even if you're not involved directly with open source and intellectual property, this book is an excellent read.

Intellectual Property and Open Source accomplishes a difficult goal of sharing dense information while making the subject matter engaging. This is done through examples, thought experiments, and well developed analogies. Van does an excellent job of igniting a powerful curiosity on the part of the reader while providing rewards for this in the lucid explanations of related laws and perspectives. I am resisting the urge to turn this post into a long series of quotes, but at the very least I want to mention a few little "spoilers" ;-)

The book starts off with an excellent foundation, giving an overview of the origins of intellectual property from an economic and legal perspective. This was particularly useful for me, as I have no background in this field. Van Lindberg does a really great job of expressing some of the widely held (and diverse) views of IP in the open source community.

The book then launches the reader into an array of well organized chapters on patents, the patent system, trademarks, copyright, trade secrets and licenses. Every open source developer should read chapter 10 on choosing an open source license (the opening dialog had me laughing out loud, a hilarious parody of news groups and IRC arguments as well as a nod to Princess Bride). There's also a chapter dedicated to patches and their relationships to copyright; another on reverse engineering; and the final one provides information and advice on establishing non-profits for open source projects -- the author even gives mention to our friends at the Software Freedom Conservancy (the umbrella non-profit for the Twisted Software Foundation).

In all honesty, I can't rave enough about this book. I've re-read parts of it just because I enjoyed the clarity of the explanations so much. Law is a twisty maze of easily confused subtleties to those who have not been trained in its dark arts. Through explicit language and examples, the author guides us past pitfalls of misunderstanding and brings us directly to all the major points.

If you are an Amazon shopper, you may want to act quickly: last I checked, there were only two copies left.

Enjoy!


Monday, December 15, 2008

Ubuntu Developer Summit


For the past two weeks, I've been listening, learning, discussing, and hacking various Landscape and Ubuntu initiatives with members of the Ubuntu community and fellow Canonical employees. It was an amazing experience, and we've got the next 6 months crammed full of plans... with the next 3 months already spec'ed out.

Canonical has surprised me. It's an extraordinary company... both in the modern business-sense of the word as well as the original sense: a fellowship of companions with a common goal. While so far I have only had a chance to hear some personal histories, it's evident that every member of this company is an extraordinary individual with a rich background and a great deal to offer to the whole. Everyone works with an unprecedented amount of motivation towards the company vision, one that is well and tightly integrated into the corporate culture.

There is a bright future ahead for this amazing group...


Monday, December 08, 2008

The State of Graphs in Python


There is a sad need for standardization of graphs in Python. The topic has come up numerous times on various mail lists, news groups, forums, etc. There is even a wiki page dedicated to the discussion of the topic on python.org. Ach, when will the madness end?

As far as I can tell, Guido van Rossum essentially solved this issue 10 years ago when he published his paper on Python Patterns - Implementing Graphs. The graph representation is a simple dict and he provided a few functions for demonstration purposes. In 2004, UC Irvine professor David Eppstein started making public his Python graph-theoretic efforts (with a functional programming approach). Both of these represent a direct approach that appeals to my aesthetic sense.

Now, after years of tracking the lack of progress made in standardizing graph representations in Python, I've recently had strong need of them. I did some checking around, and found projects that potentially met my needs. Sadly, none of them had the simplicity of Guido's original implementation (and therefore, anticipated speed benefits).

I was looking for graph implementations with no cruft, no external dependencies, no afterthoughts. I need something that balances runtime performance with a usable API, preferably created using PEP-8 (or similar) coding style.

Here's what I found, with some notes that I used to make a decision for my own needs:
  • PADS - David Eppstein's work; functional programming style; very strong math; leaves the implementation of the graph up to the developer-user
  • altgraph - too many utility and special-purpose methods for my taste; uses a custom graph object
  • python-graph - a new implementation; uses its own objects; seems to take the "framework" approach to graph implementation
  • graph - requires the use of custom vertex and edge objects
  • NetworkX - fairly complete; lots of redundant code; covers more than just a graph implementation (I only include it here because it seems to be fairly highly used)
If you know me, then you've guessed what's coming next. Yes, I'm going to contribute to the general chaos and announce yet another graph library. What I hope to accomplish with this is provide a very simple implementation based on Guido van Rossum's approach (dictionary-based) that doesn't consume much memory, can be operated on quickly, and can be used anywhere.

In keeping with this motivation, I've started a new project on Launchpad and named it simple-graph. My initial efforts will be aimed at implementing a dict-based graph per Guido's paper, with the possible inclusion of some of David's functions (updated to operate on a dict object). I will then spend some time taking inspiration from the best of what the other graph libraries have to offer while keeping things simple.

As I stated on the web panel at PyCon 2007, diversity is a good thing; it gives us a rich gene pool from which a full and healthy process of natural selection may occur. Let's hope that the efforts of so many Python programmers eventually lead to the inclusion of a graph object in the Python standard library.


Tuesday, November 11, 2008

Python and Lisp... Again


It seems that Lisp continually comes up in various conversations (virtual and otherwise) in the context of Python. In fact, maybe we could even call such occurances the Python-Church conversations. Well, here it is again.

Earlier this year I started working on a new project: an object-oriented genetic programming library. I had a bunch of experiments I wanted to do, but I needed to assemble parts of programs in order to do it. I had hoped to use Python, but inspecting Python's AST ended up being too much of a hassle. I wanted to distribute, process, and manage evolutionary algorithms/programs across multiple remote Twisted servers, and manipulating permutations of partial programs would be much easier to integrate with Twisted (the target "platform") if the programs themselves could be evaluated and introspected easily with Python.

After some digging around, I eventually settled on using PyLisp, mostly for the simplicity of the code and the fact that it was just a single file. Since it hadn't been maintained since 2002, I decided to roll the original file into the genetic programming code and then apply any changes as-needed, over time.

More recently, I've wanted to use this modified PyLisp on other projects and as a result, I have split it out into it's own project: pyLisp-NG. This naturally led to further code break-out, for a total of three projects:
  • pyLisp-NG - the functional programming (and introspection) component of the original project
  • Evolver - the code that allows one to do Python-based evolutionary programming (string-based as well as source code tree node optimization/search solution discovery)
  • txEvolver - will enable users to distribute genetic programming operations (such as merging parallel generations of computations)
pyLisp-NG was released earlier today and is available for download on PyPI.


Monday, November 10, 2008

txJSON-RPC

Tonight, I just pushed a new version of txJSON-RPC up to PyPI. Let me know if you have problems with this one, as the last one had some issues with setup.py.

The new cool feature in this version is the serialization available to the bundled jsonrpclib module (which doesn't depend on Twisted code, so anyone can use that). txjsonrpc.jsonrpclib now supports Python datetime -> JSON serialization. The date format is the same as that used by xmlrpclib: YYYYddmmTHH:MM:SS.

Enjoy!


Sunday, September 28, 2008

Current and Future Happenings


Sorry there's been so much radio silence at this end lately... a lot has been going on, and it looks like it's going to stay that way for a while. I just need to get used to it and start posting again :-)

Canonical

The big news is that Canonical quite took me by surprise :-) I had planned on doing consulting work again, but I was made an offer of camaraderie, to come join a team at Canonical that I know well, and I couldn't resist. I'm now working on the Landscape team at Canonical, the same folks who brought you the much beloved Storm ORM :-)

Already, I've been working there for two weeks and it's been a delight. They use a lot of the same processes that we did at Divmod and in the Twisted project (in fact, three of us on the team are Twisted developers), so that was very smooth. Another thing that made the transition very easy was manner in which they engage in a beautiful mix of group discussion and rapid development. The open source community roots at Canonical are very deep... and you can see them very clearly without digging :-)

At Canonical, I've repeated run across old friends from my Zope days, from Hacking Society in Colorado, and other places/associations from my past. I am somewhat stunned at the job Canonical has done in acquiring a talented and dedicated workforce. I've never seen a company embrace open source at the level and to the degree that this company does, while at the same time retaining all of the most excellent qualities of the community within the corporate culture. Someone should do a socio-technological/business PhD thesis on these guys...

In preparation for the many (and intense) marathon sprints that this team runs in a year, I've purchased a new laptop. It's the first dedicated Ubuntu dev machine/Desktop I've had... I've been running all my Ubuntu instances as virtual machines in Parallels and VMWare Fusion (or as remote servers at colos and virtual host providers). My love for the Evolution mail client continues to grow and I've found the only reason I miss the Mac is for the automatic handling of sound and to play Spore :-) 

SOA Conference

Now on to some future stuff. I've been invited to speak about dynamic languages (Python) and ultra-large scale (ULS) systems at SOA-India this year in Bangalore. The industry that has grown up around service oriented architectures (SOA) overwhelmingly tends towards Java, so this is a really great sign. I think the efforts that the Java Mothership has made in building bridges with dynamic languages such as Ruby and Python is having a tremendous impact throughout the programming world. I've got an eye on Ted Leung and the Jython team :-)

Anyway, the conference promises to be quite interesting, with speakers from around the world and with diverse backgrounds. I'm expecting to return from Bangalore with a multitude of new ideas and lots of new avenues to explore.

Blog 

Speaking of SOA, I am still working on the second part of the book review for Josuttis' book SOA in Practice. Perhaps before I finish that one, though, I will blog about another O'Reilly title I have been enjoying immensely: Van Lindberg's  Intellectual Property and Open Source. Note that Van has been quite active in the Python community and is contributing his expertise at many levels for the benefit of us all. Regardless, the book is very well written and I will have nothing but good to say about it :-)

After that, I'm going to finish up the draft I have for a blog post on metaclasses, based on notes I took while working with Incredible Pear on the PBS DTV project.

And finally, there have been more requests for me to write about setting up a Twisted Mail server... so, as one reader puts it, I will conclude the telling of that tale in an up-coming post as well :-)



Wednesday, August 27, 2008

netaddr Python Library


I recently got several feature requests for my NetCIDR Python library, and in the course of a conversation with one user in particular, I was made aware of the netaddr project. I took some time to explore the code details and was quite impressed: drkjam did a great job. The manner in which he implemented the many features (especially the IP math) was the kind of thing I wanted to do for NetCIDR ... at some point. After about an hour of digging around, testing out the API, and pondering, I decided to retire NetCIDR and encourage my users to migrate to netaddr.

There are a couple more esoteric features in NetCIDR that netaddr currently doesn't have, but we've started talking about adding support for those in netaddr, at which point there will be no need to use NetCIDR.

To facilitate this, I've added a wiki page on the netaddr Google Code project for helping users make the transition from NetCIDR to the netaddr API.


Tuesday, August 26, 2008

SOA in Practice: A Handbook for Early-Stage ULS Systems (Part 1)


The ULS Series
A Book Review

First off, this is an O'Reilly publication. What's more, if O'Reilly had something like a "criterion collection," this work would be in it. This title is what it says it is, "SOA in Pactice: The Art of Distributed System Design." Authored by Nicolai M. Josuttis, this is one of the best written technical overview works I have ever read, both for writing style and content. For anyone interested in ULS and/or SOA, I have one thing to say: buy this book immediately, with expedited shipping.

I'm not going to write a formal review with pros, cons, deep analysis about message, etc. However, what I will do is spend some time discussing the crossover from SOA to ULS, covering details with quotes from "SOA in Practice." I will not cover the book in detail and reveal all of its precious nuggets, but I will give a taste of what it has to offer and how it applies to ultra large-scale systems.

Divergence

Since most of what I want to discuss is about what we can gain by taking lessons learned from SOA and applying them to efforts in exploring or prototyping ULS systems, I want to initially outline the stark differences between those systems and SOA.

The most obvious difference is scale. To put things in perspective, imagine implementing a large SOA for a large organization. Imagine the requirements, the project planning, the logistics, the code, the bugs, the setbacks, the short-term failures, and finally, the successful delivery. Now multiply that: two related but semi-autonomous SOA projects. And again, with four. How about a third time for eight?

Any reader with experience in working with large projects is probably having heart palpitations right now (and for that, I apologize). You have first hand experience of the difficulties and the pain: with a linear increase in the size of a project, there is an exponential increase in the difficulty of managing that project (people, code, timelines, etc.), asymptotically approaching 100% unmanageability, regardless of the amount of resources you throw at the project.

The point just past the asymptote is where ULS systems and SOA meet. In other words, a ULS system as a whole -- by definition -- cannot be built. Such a system can accrete over time, but is simply too large to be designed, built and managed. Rather, it is emergent. Efforts being made in ULS systems research right now are focused on how we can best facilitate that emergence.

Convergence

One of the profound problem solving skills that maths like analytic geometry teach us is understanding potentially intractable problems by examining discrete and meaningful chunks. It's easy to chop something up; it's quite a different matter to chop such that the pieces are useful and provide further insight.

If working with ULS systems is like integrating over the volume of a complex solid in 11-space, then SOAs provide us with the tools of breaking part of that work up into a manageable chunk, one that we can wrap our heads around. Many of the same problems that technicians are anticipated to encounter when working with ULS systems exist at a smaller scale and are well understood within the context of SOA.

And this is where our friend Nicolai's book comes in.

ULS Systems Review

Before we continue, let's take a quick look back at some of the ULS basics laid out by the report of the Software Engineering Institute (SEI) of Carnegie Mellon:
  • ULS systems are systems of systems at internet scale.
  • ULS systems will be interdependent webs of software-intensive systems, people, policies, cultures, and economics.
In order to become a functional reality, these systems will require extensive research in the following areas:
  • Human Interaction
  • Computational Emergence
  • Design
  • Computational Engineering
  • Adaptive System Infrastructure
  • Adaptable and Predictable System Quality
  • Policy, Acquisition, and Management
This means exploring for use in ultra-large scale systems such things as potential mechanisms for user interfaces, genetic algorithms/programming, new patterns in systems design, behavioral simulations of systems components in varying compositions, decentralized infrastructure, ultra-high availability, and integration with countless third-party support systems. Just to name a very bare minimum.

Intersection

Of those research areas, lessons learned from SOA can be applied to ULS systems research most predominantly in the following areas:
  • Human Interaction
  • Design
  • Adaptive System Infrastructure
  • Adaptable and Predictable System Quality
In Part 2, it is with an eye towards these that I will comment on Nicolai Josuttis' excellent work.


Saturday, August 23, 2008

MySQL, Storm, and Relationships


I rarely work seriously with databases, but I've been building an API for a contract with PBS.org, and though we have DBAs tasked for the project, everyone's pretty busy. So I dusted off my decade-old DB (formerly known as) skills, and did the work myself.

I've worked with the Storm ORM a fair amount since it was released, but only on small projects. Any time I've needed to use relationships with Storm, I've been using SQLite and so it was all faked. Due to the impact of the PBS gig (which is almost done now!), I really needed to sit down and map everything out. The first thing I needed to do was get a quick refresher on MySQL's dialect with regard to foreign keys. The next thing I needed to clarify was exactly how to ensure that what I've been doing with Storm relationships in SQLite was valid for MySQL and suitable for production use at PBS. It was :-)

Given how infrequently I use this stuff, I thought that my notes would be good to document, for future quick-reference. Given that there are likely users out there who would also benefit from this, a blog post seemed a nice way to do this :-)

The SQL below is modified from an example in the MySQL documentation, slightly tweaked to be a smidge more interesting. The two CREATE TABLE statements define the schemas for a one-to-many table relationship:

Next, to be able to play with this in Storm, we need to define some classes and set up some references:

The parent attribute on the Child class is a Storm reference to whatever parent object is associated with the child object that gets created; the parent_id attribute is what is actually mapped to the MySQL field parent_id (which, in turn, MySQL references to the parent table). I hope I just didn't make that more of a confusing mess than it needed to be :-)

The children attribute that gets added to the Parent class is a reference to all Child instances that are associated with a particular Parent instance. I've got some usage below, if that's not clear.

Let's create a parent:

Note that if you add an __init__ method to your Storm classes, you can save a step or two of typing in these usage examples (see the Storm tutorial for more information).

Next, we'll create and associate a child:

There's more than one way to do this, though, given the way in which Storm has encoded relationships. Above, we created the child and then set the child's parent attribute. Below, we create the child and then use the chilren's add method to associate it with a parent:

We're doing all that flushing so that the created objects refresh with their new ids.

Lastly, let's take a look at what's we've just added to the database:
And that should just about do it :-)


Thursday, July 31, 2008

New ULS Systems Blog


I'm currently drafting two new ultra large-scale systems blog posts, with one in particular being almost ready to go. While writing more on one of them today, a very cool thing happened: I received an email from the ULS systems team at the Software Engineering Institute of Carnegie Mellon University letting me know that they've got a new ULS blog site up. You can check it out here:

http://ulsblog.wordpress.com/

Be sure to read all the articles and go back often! As you can imagine, I'll be spending a lot of time there :-) I have a feeling this is the beginning of an emerging ULS community...

As for my forthcoming ULS systems blog posts, one concerns SOA and the other discusses currently extant code bases in the Python and Twisted Python communities that can be used for building ultra large-scale systems (or prototypes thereof) quickly and efficiently.


Tuesday, July 29, 2008

New Directions


Yesterday I submitted my resignation as COO to the Divmod officers, and today I forwarded it to the rest of the team. Divmod is headed in a new and wonderful direction, and I'm happy to have contributed to raising public awareness about our team, community, and the tech we use, thus increasing its value in the market. I am taking this opportunity to rest and then pursue interests of my own.

Even more than that contribution, I'm delighted to have worked with these guys for the past year. I've long been a supporter and fan of Divmod (since shortly after its inception, in fact). I was a community contributor before I was an employee and I will remain so for the foreseeable future. But I gotta say, that team in incomparable. The combination of technical excellence, creativity, pragmatic problem solving, quality engineering, humor and insight has made my time there a rich experience. They have made my time at Divmod one for the personal record books.

I'm ready for a break, though; the past year has been a long, hard pull...

I was originally courted by them for management, due to my community work. I deferred, and worked as a coder instead. This ended up being invaluable, as far as the insight it provided. After some early successes with a product release, I was offered the position of CEO, but deferred there too, with the mutual agreement that COO might be a better match for my skills. After a couple months as COO, I was put in charge of managing the direction of the company and raising funds, so I ended up being acting CEO anyway. I poured my heart and soul into Divmod, and it looks like that has payed off: the team is happy and they're headed for some good success. What's more, that leaves me in the enviable position of finally being able to surrender a massive workload :-)

For the next week or two, I'll be camping in the Rocky Mountains catching up on some rest and enjoying nature at her best :-) I've also got some fun sci-fi reading to catch up on (Stross and MacLeod).

When I get back, I'll be exploring 6-12 month consulting contracts... so if anyone hears anything interesting, do let me know!


Wednesday, July 23, 2008

In Memoria: The Great Work


The OSCON Tuesday Night Extravaganza was just fabulous: awards, laughter, brain-bending, and affirmation. The primary speakers were Mark Shuttleworth, r0ml, and Damian Conway; but I'm going to be focusing on r0ml's talk right now :-) Well, in part, anyway.

Let's back up to Monday night, first: Alex Martelli and I had a chance to wax philosophical about programming and software. It was wonderful. Both because it revealed Alex's code-spirit and because of the sympatico I felt as his passionate idealism resonated with mine. While Alex talked of the holy architecture of mosques and cathedrals, of the contributions that such artisans as stonecutters, masons, sculptors, and calligraphers made, he emphasized how each individual played an essential role in bringing these wondrous works into being, that each act was an offering to the ideals that formed the basis of the respective belief system.

What's more, though, Alex extended the analogy from religion to mysticism, saying that even more than builders of such great structures, coders are alchemists engaged in the magum opus. We are the transmutators. In our crucibles, the opposites of function and beauty unite; performance and elegance are commingled to produce the perfection of our art. Alex was careful to point out that he intended perfection in both an abstract and practical sense. On one hand, being able to create and actually deliver code that others found useful, regardless of the sex appeal (or lack thereof), can be viewed as a form of perfection. It is accomplishment; attainment of the goal. On the other hand, it's just something that someone wanted us to write; it's not a proof of Fermat's Last Theorem. It's useful; it serves a specific function.

Before I get to r0ml's talk, I want to mention UQDS as employed by the Twisted and Divmod communities. I think it's phenomenal and I enjoy working with that system. It's a well thought-out and proven process that tends to produce code of an extremely high quality. However, it's not my natural tendency. I like quick and dirty prototypes; a little messy code goes a long way. I like to throw something out there and then fix it up and apply polish incrementally, as dictated by need.

This is why I've been enjoying the Twisted Community Code project/group on Launchpad. Not only do you have the benefits of using a tool like bazaar that lets one branch other projects on a whim, but you've got a community space to put these explorations, where others can easily see what you're doing, check it out, and try something of their own. (There's a whole 'nother blog post I have coming about that.) However, this finally brings me to r0ml's talk: a new spin on the development process.

For those of you that have seen his phenomenal rhetoric talks, you'd be delighted to see what he did :-) He established a nice mapping from both Microsoft's development process as well as the one defined by Rational. He used the five canons of classical rhetoric: inventio, dispositio, elocutio, memoria, and pronuntiatio. However, the really brilliant thing was where he started the process: smack in the middle, right where I like to do it :-) And he justified this beautifully. His mapping was the following:
  • Memoria = Commit / Update
  • Pronuntiatio = Run / Use
  • Inventio = Bug Reporting / Patch Submission
  • Dispositio = Triage
  • Elocutio = Integration


The idea here being this: get what you've got done out there and in front of people's eyes. Everyone knows its crap; don't worry about it. Get it running and get others running it. Work on what matters most and integrate the changes. Repeat and continue.

I like to tease other Twisted devs that I tend not to do test-driven development, but bug-driven testing. What's interesting is that we both start with a requirements doc: for them, it's a development plan; for me, it's a bug/TODO list. The difference is that they then engage in Inventio whereas I start with Memoria. As r0ml said, with this model there is no development, there is only maintenance.

One of the other great things that r0ml mentioned about this process is that it not only gets you the developer started more quickly, it gets others started at the same time. Each programmer is engaged in a macroscopic genetic programming effort: everyone takes the source, mutates it, evolves it, reviews it, and the best implementations (or parts thereof) survive to become the basis for the next generation. Everyone gets to write at the same time; no one is blocked.

This development approach evokes images of philosophers from the Middle Ages sending letters to each other in cryptic alchemical symbols and diagrams, with all the implicit and explicit layers of meaning. I see this methodology as establishing the true foundation of the open source art: a gnostic, spirit-(of-open-souce)-ual transformation that brings us to improved states of mind and clarity.

The perfection of our art, whether sublime or mundane, can be merged in the mind of the developer as one... this union being our philosopher's stone. With each release of software engaged in this manner, we iterate the Great Work.

Wednesday, July 16, 2008

OSCON 2008

Hey all, thanks to a friend's amazingly generous offer, I'll be attending OSCON this year :-) I only have to pay for my airfare and food! I've contacted several people already who I know are going to be there (including Van Lindberg of Haynes and Boone and Bradley Kuhn of the SFC and the SFLC), and look forward to meeting up with others. Leave a comment or email me if you're going to be there!


Saturday, July 05, 2008

Native LoadBalancing for Twisted Apps

Yesterday, right before midnight, I tagged the 1.1.0 release of txLoadBalancer on Launchpad after completing the last of the planned features. There are some pretty radical changes that have been developed for this release... and the coolest part is this is just the beginning :-) (See the TODO if you don't believe me!)

You can checkout from lp:~oubiwann/txloadbalancer/1.1.0 or download from PyPI. If you're a PyPI expert, I've got some questions for you at the end of this post... Been having some sucky experiences with PyPI lately :-(

So here's what's going on with txLoadBalancer:

Improved API

The biggest thing you'll notice if you've switching from PythonDirector is the massive overhaul the API has undergone. Things are cleaner and generally more modern, with a concise and well-defined module layout.

New Load Balancing Algorithm

I've added support for a weighted host scheduler. Given a weight that represents the frequency a host should be used, a host will be randomly selected, based on it's weight. For example, with two hosts, one having a weight of 1 and the other having a weight of 3, host 2 will be chosen about 75% of the time and host 1 will get about 25% of the requests.

Right now, this algorithm has to make several calls to other parts of the code in order to get all the data it needs (it also builds some crazy iterators). As such, it's rather slow and performs poorly when compared to the very light-weight least-connections algorithm. That being said, the next release will include optimizations for the weighted scheduler that make use of a Twisted timer and caching.

Native Twisted Load-Balancing

Here's the sexiest part: you can now load-balance your Twisted application by using the txLB API; you don't even need to run the load-balancer as a separate app! This evolved as a feature after a conversation with an as-yet unnamed cloud hosting provider, a follow-up discussion with the Divmod team, and then some quiet pondering about ways in which Twisted applications could be supported in cloud/grid/massively-multi-core architectures.

The "self load-balancing" API in txLB is not a comlete solution for grid-hosting, but it is a first step in one direction (we've been discussing lots of others, too, including the use of our deployment tool).

Before I show you how to use the self load-balancing API, let's take a quick look at a normal Twisted application service:

You start that with the command twistd -noy myweb.tac. For use with the next example, you can also start two more, one on port 7002 and the other on port 7003.

Now here's what you do to make a self load-balanced app:

As you would expect, you need to indicate the proxy host:port, the algorithm to use, and the hosts that are to be balanced. The host setup assumes that you have three services running on localhost ports 7001, 7002, and 7003. All that's needed now is to just run that code with the usual twistd -noy myapp.tac. Also, for demonstration purposes, this is a somewhat simplified example of what is possible.

This may seem like a lot of extra work when compared to the simple web host above, but think about it: we're load-balancing here :-) This saves you from having to manage yet another application. With a few extra lines of code, you can keep it all in one place and have it manage itself.

Note that this API is in development and continuing to improve. The example above is from code running in trunk. For the more verbose configuration that is in the 1.1.0 release, be sure to see ./bin/txlbWeb.tac from the source tarball. To play with the latest and greatest, you'll want to checkout the code here: lp:txloadbalancer.

Other Goodies

Here is some other good stuff in the release:
  • You can now ssh into a txLB instance and mainipulate the load-balancer in real time from an interactive Python interpreter.
  • You can change the proxy to listen on a different port while the application is running (no restart requred!).
  • Changes made to the configuration while running are no longer volatile; they are saved to disk (and your old config gets backed up).
  • Work from Apple, Inc. was included in this release, too (they use the old PythonDirector in their Calendaring server). This includes a bug fix and management socket feature.
  • There is a significant jump in performance between this release and the previous one. I believe this to be due to the separation of concerns in the API, but haven't yet confirmed that.

Coming Work

There are a lot of exciting features coming for txLB. Just to name a few:
  • improved weighted algorithm
  • resources-based algorithm (a scheduler that determins the weight of a proxied host by memory, CPU, etc., utilization)
  • smarter proxied host failover and recovery
  • a heartbeat manager
  • txLB-powered application cloning (when started, an app will determine if it needs to run the clone as the managing load-balancer or simply as a proxied host)
  • auto-discovery of balanced hosts
  • proxy fail-over (a balanced host taking over as manager in the event that the manager goes down)
  • ApacheMQ/Stomp integration
  • LDAP/RADIUS authentication

Additionally, I'll be putting together some basic performance metrics contrasting Apache and load-balanced Twisted apps. I will also be comparing previous versions of txLB/PythonDirector with the latest release(s).

Problems with PyPI

I will close this post on a sad note: PyPI used to be an amazing experience for me (a couple years ago, when it was still being called "cheeseshop"). Everything worked as it was supposed to. This hasn't been the case when I've used it recently (over the past few months).

For all that I say about PyPI, I allow for the fact that I may just be missing something, and it may be entirely my fault. That being said, I spent about 3 hours online last night combing though the SIG mail list, the bug list on sourceforge, and blog posts about setuptools and PyPI, and could find no answers to my questions. Well, with the possible exception of a bug report, but it doesn't look like it was confirmed by a PyPI team member, so I'm not sure if it's valid or not.

Here are my issues:
  • When I upload my project using python setup.py [sdist|bdist_egg] upload, no metadata defined in my setup() function is presented on my package's PyPI page. When I click the metadata link, it's only got three sparse lines.
  • When I manually upload from the package's PKG-INFO itself, all the metadata is presented on the page as it should be, with the exception of the long description. It is in plain text instead of ReST (I am checking that it is valid ReST using distutils settings of reporter.halt_level = 5, reporter.report_level = 1, settings.pep_references = False, and settings.trim_footnote_reference_space = None; these are the same settings that Zope Corp uses to verify the ReST that it uploads to PyPI).
  • When I manually edit the long description in the form, I get the same thing: plain text, no ReST.
  • When I upload a package that is displayed properly on PyPI (such as zc.twist; uploaded as one of my projects by chaning the name), I get the same problem (this is why I think it might be something that I'm doing wrong...): no metadata, and when I upload the PKG-INFO manually, no ReST.
Why, oh why, cruel fates, does this not work any more? I used to be able to upload to PyPI without any of these issues...


Thursday, July 03, 2008

Divmod Tech: Making the "Next Gen" Grade

Last night, after I already posted the latest Twisted in the News, I came across another post that would have made the list had I found it sooner. However, this is a good opportunity to give it a little extra attention.

The title of the post is "Next Gen Web Dev: Playing with Python Twisted/Nevow/Athena" and I gotta say, that made my day :-) Between that post and Colin Alston's post that I mentioned in the News, Nevow had a good week. And people are appreciating it for the right reasons. It may not be the easiest web framework to use and certainly not the best documented, but when you need the flexibility to interact with your (Twisted) web server in particular ways as well as benefit from the functionality that COMET provides, Nevow comes out shining.

It's also refreshing to see new developers entering the community who not only see the potential of these tools (designed with that potential in mind) but are capable of taking advantage of it immediately. If nothing else, the author of that post has motivated me to finally merge the Athena tutorial to trunk in order to bring the publicly available and published content in sync with the new code that's in the branch.

Update: Along similar lines, but with more details, Tristan has provided an excellent write-up for this motivation to use Twisted/Nevow/Axiom/Mantissa. Be sure to check it out!

Friday, June 27, 2008

So You Want Your Code to Be Asynchronous? A Twisted Interview

Prologue

This blog post was taken from a chat on a Divmod IRC channel couple weeks ago. Let's start with my opening comments to JP about what I hoped we could accomplish in the interview.

[1:47pm] oubiwann:exarkun: developers/users have started to understand Twisted, see the benefits of an async paradigm, and want to start writing their code making the best possible use of twisted's event driven nature
[1:48pm] oubiwann:they know how to write code using deferreds, and they're ready to get writing...
[1:48pm] oubiwann:except they're not
[1:48pm] oubiwann:because they don't know python internals
[1:49pm] oubiwann:they don't know what python can actually be used with deferreds because they don't know what requirements there are for python code that it be non-blocking in the reactor
[1:50pm] oubiwann:so you're going to help us understand the pitfalls
[1:50pm] oubiwann:how to make best guesses
[1:50pm] oubiwann:and where to look to get definitive answers

Change Your Mind


Before we go any further, I want to share a few comments and answer two questions: "Who is this for?" and "What do I need to know for this to mean something to me?" This post is for anyone who wants to write async code with Twisted and the answer to the second question is open-ended.

Let me start with what is often interpreted as effrontery: read the source code. Despite how that may have sounded, it's not another RTFM quip. The Twisted source code was specifically designed to be read (well, the code from the last two years, anyway). It was designed to be read, re-read, absorbed, pondered, and turned into living memes in your brain.

Understanding tricky topics in conceptually dense fields such as mathematics, physics, and advanced programming requires immersion. When we commit to really learning something difficult in programming, when we take the big step and dive in, we are surrounded by code. At a conceptual level, I mean that literally: it is a spacial experience. This is not something that is typically taught... the lucky few are able to do this their on the own; the rest have to slowly build their intuition through experience in order to get comfortable and be productive in code space.

Our school systems tend to train us along very linear lines: there's a right answer, and a wrong answer. Don't rock the boat. Don't make the teacher uncomfortable. Follow the rules, do your homework, and don't ask too many questions. We carry these habits with us into our professional lives, and it can be quite the task to overcome such a mindset.

Experience is multidimensional. Learning is experience, not rules. When you really jump into this stuff, it will surround you. You will have an experience of the code. For me, that is a mental experience akin to looking at something from the perspective of three dimensions versus two. When I've not dedicated myself to understanding a problem, the domain, or the tools of the domain, everything looks very flat to me. It's hard to muddle through. I feel like I have no depth perception and I get easily frustrated.

When I do take the time, when I make the investment of attention and interest, the problem spaces really do become spaces, ones where my mind has a much greater freedom of movement. It's not smart people who do this kind of thing, it's committed people. Your mind is your world and it's up to you to make it what you want. No one on a mail list or IRC channel can do that for you. They can help you with the rules, provide you with valuable moral support, and guide you along the way. However, a direct experience of the code as a living world of mind comes from taking many brave leaps into the unknown.

Interview in a Blender

Jean-Paul Calderone graciously set aside some time to talk with me about creating asynchronous code in Python, particularly, using the Twisted framework. As has been said many times before, simply using Twisted or deferreds doesn't make your code asynchronous. As with any tricky problem, you have to put some time and thought into what you want to accomplish and how you want to accomplish it.

I'm going to post bits of our chat in different sections, but hopefully in a way that makes sense. There's some good information here and some nice reminders. More than anything, though, this should serve as an encouragement to dig deeper.

Why Would I Ever Need Async Code?

There are a couple short answers to that:
  • Your application is doing many long-running computations (or runs of a varying/unpredictable length).
  • Your application runs in an unpredictable environment (in particular, I'm thinking of network communications).
  • Your application needs to handle lots of events
[1:55pm] oubiwann:exarkun: so, what's the first question a developer should ask themselves as they begin writing their Twisted application/library, txFoo?
[1:55pm] dash:"would everyone be better off if I just stopped now?"
[1:55pm] exarkun:oubiwann: I'm not sure I completely understand the target audience yet
[1:56pm] exarkun:my question is kind of like dash's question
[1:56pm] exarkun:why is this person doing this?
[1:57pm] oubiwann:exarkun: the audience is the group of software developers that are new to twisted, have a basic grasp of deferreds, and want their code to be properly async (using Twisted, of course)
[1:57pm] oubiwann:they don't have anything more than a passing familiarity of the reactor
[1:57pm] oubiwann:they don't know python internals

Protocols, Servers, and Clients, Oh My!

If your application can use what's already in Twisted, you're on easy street :-) If not, you may have to write your own protocols.

Let's get back to the chat:

[1:57pm] exarkun:So `foo´ is... a django-based web application?
[1:58pm] exarkun:... a unit conversion library?
[1:58pm] oubiwann:sure, that works
[1:58pm] oubiwann:unit conversion lib
[1:58pm] oubiwann:(which could be used in Django)
[1:58pm] exarkun:at a first guess, I'd say that there's probably no work to do
[1:58pm] exarkun:how could you have a unit conversion library that's not async?
[1:58pm] exarkun:that'd take some work
[1:59pm] oubiwann:let's say that the unit calculations take a really long time to run
[1:59pm] exarkun:Hm. :)
[1:59pm] idnar:you'd probably have to spawn a new process then :P
[2:00pm] exarkun:basically. probably the only other reasonable thing is for twisted-using code to use the unit conversion api with threads.
[2:00pm] exarkun:so then the question to ask "is my code threadsafe?"
[2:00pm] oubiwann:what about a messaging server
[2:00pm] oubiwann:that sends jobs out to different hosts for calcs
[2:01pm] dash:that's not going to be a tiny example
[2:01pm] exarkun:for that, the job is probably to take all the parsing and app logic and make sure it's separate from the i/o
[2:01pm] exarkun:so "am I using the socket/httplib/urllib/ftplib/XXXlib module?"
[2:03pm] exarkun:is another question for the developer to ask himself
[2:06pm] exarkun:they probably need to find the api in twisted that does what they were using a blocking api for, and switch to it
[2:07pm] exarkun:that might mean implementing a protocol, or it might mean using getPage or something
[2:07pm] exarkun:and pushing the async all the way from the bottom up to the top (maybe not in that direction)
[2:08pm] oubiwann:by "bottom" are you referring to protocol/wire-level stuff?
[2:08pm] oubiwann:exarkun: and by "top" their module's API?
[2:09pm] exarkun:yes
[2:10pm] exarkun:oubiwann: the point being, can't have a sync api implemented in terms of an async one (or at least the means by which to do so are probably beyond the scope of this post)

Processes

We didn't really talk about this one. Idnar mentioned spawning processes briefly, but the discussion never really returned there. I imagine that this is fairly well understood and may not merit as much pondering as such things as threads.

Which brings us to...

Threads

Thread safety is the number one concern when trying to provide an asynchronous API for synchronous code. Here are some starters for background information:
Discussing threads consumed the rest of the interview:

[2:12pm] oubiwann:exarkun: so, back to your comment about "is it threadsafe" (if they are doing long-running python calculations)
[2:13pm] oubiwann:what are the problems we face when we don't ask ourselves this question?
[2:13pm] oubiwann:what happens when we try to run non-threadsafe code in the Twisted reactor?
[2:14pm] exarkun:The problem happens when we try to run non-threadsafe code in a thread to keep it from blocking the reactor thread.
[2:16pm] oubiwann:so non-thread safe code run in deferredToThread could...
[2:16pm] oubiwann:have data inconsistencies which cause non-deterministic bugs?
[2:16pm] dash:have the usual effects of running non-threadsafe code
[2:16pm] exarkun:have any problem that using non-thread safe code in a multithreaded way using any other threading api could have
[2:16pm] dash:like that, yeah
[2:17pm] exarkun:inconsistencies, non-determinism, failure only under load (ie, only after you deploy it), etc
[2:18pm] dash:i smell a research paper
[2:18pm] oubiwann:so, next question: how does one determine that python code is thread safe or not?
[2:19pm] glyph:a research *paper*?
[2:19pm] exarkun:heh
[2:19pm] glyph:research *industry* more like
[2:19pm] oubiwann:exarkun: or, if not determine, at least ask the right sorts of questions to get the developer thinking in the right direction
[2:20pm] dash:glyph: Heh heh.
[2:20pm] exarkun:oubiwann: well, is there shared mutable state? if you're calling `f´ in a thread, does it operate on objects not passed to it as arguments?
[2:20pm] exarkun:oubiwann: if not, then it's probably safe - although don't call it twice at the same time with the same arguments
[2:20pm] exarkun:oubiwann: if so, who knows
[2:20pm] dash:with the same mutable arguments, anyway
[2:23pm] oubiwann:exarkun: so, because python and/or the os doesn't do anything to make file operations atomic, I'm assuming that reading and writing file data is not threadsafe?
[2:24pm] exarkun:don't use the same python file object in multiple threads, yes.
[2:24pm] exarkun:but certain filesystem operations are atomic, and you can manipulate the same file from multiple threads (or processes) if you know what you're doing
[2:25pm] oubiwann:what about C extensions in Python? any general rules there?
[2:25pm] oubiwann:other than "if they're threadsafe, you can use them"
[2:25pm] exarkun:that's about all you can say with certainty
[2:26pm] exarkun:for dbapi2 modules, look at the `threadlevel´ attribute. that's about the most general rule you can express.
[2:26pm] exarkun:there's some stuff other than objects that gets shared between threads too that might be worth mentioning
[2:26pm] exarkun:at least to get people to think about non-object state
[2:27pm] oubiwann:such as?
[2:27pm] exarkun:like, process working directory, or uid/gid
[2:30pm] • oubiwann looks at deferToThread...
[2:31pm] • oubiwann looks at reactor.callInThread
[2:33pm] • oubiwann looks at ReactorBase.threadpool
[2:38pm] oubiwann:hrm
[2:38pm] oubiwann:internesting
[2:39pm] oubiwann:never took the time to trace that all the way back to (and then read) the Python threading module
[2:40pm] oubiwann:exarkun: are there any python modules well known for their lack of threadsafety?
[2:42pm] exarkun:oubiwann: I dunno about "well known"
[2:42pm] exarkun:oubiwann: urllib isn't threadsafe
[2:42pm] exarkun:neither is urllib2
[2:43pm] exarkun:apparently random.gauss is not thread-safe?
[2:43pm] exarkun:you generally start with the assumption that any particular api is not thread-safe
[2:44pm] exarkun:and then maybe you can demonstrate to your own satisfaction that it's thread-safe-enough for your purposes
[2:44pm] exarkun:or you can demonstrate that it isn't
[2:45pm] exarkun:grepping the stdlib for 'thread' and 'safe' is interesting
[2:45pm] oubiwann:I wonder if the stuff available in math is threadsafe....
[2:45pm] oubiwann:exarkun: heh, I was just getting ready to dl the source so I could do that :-)
[2:46pm] exarkun:the math module probably is threadsafe
[2:46pm] exarkun:maybe that's another generalization
[2:46pm] exarkun:stdlib C modules are probably threadsafe
[2:49pm] oubiwann:hrm, looks like part of random isn't threadsafe
[2:51pm] oubiwann:random.random() is safe, though
[2:53pm] oubiwann:exarkun: I really appreciate you taking the time to discuss this
[2:53pm] exarkun:np
[2:53pm] oubiwann:and thanks to dash, glyph, and idnar for contributing to the discussion :-)

Summary

Concurrency is hard. If you want to use threads and you want to do it right and you want to avoid pitfalls and have bug-free code, you're going to be doing some head-banging. If you want to use an asynchronous framework like Twisted, you're going to have to bend your mind in a different way.

No matter what school of thought you follow for any given project, the best results will come with full commitment and immersion. Don't fear the learnin' -- embrace the pain ;-)

Update: Special thanks to Piet Delport for sorting out my endless typos!


Wednesday, June 25, 2008

Safari 3.1.1 Installer Hosed on OS X 10.5.3

I recently tried updating my Safari to the latest version, only to discover from here and here that Apple seems to have intentionally made this a 10.5.2-only update. I looked in the "Distribution" script and confirmed that this was, in fact, the case. The obvious symptom of this was that the installer told me I couldn't install Safari on any of my drives. Nice.

On those forum posts, I also discovered this great tool: Pacifist. It's been on my backburner list for a while to find a tool that could open up and extract Mac OS X packages, so for that alone I was delighted. When combined with PackageMaker, I was able to create my own installer. Even better.

If this is useful for anyone else, I've put it up here: Safari311UpdLeo_Divmod.pkg. Do note, however, that this installer has no brains: it just puts the files where they should be. It also doesn't check for your system version, so it could potentially really screw things up. Neither I, the Divmod community, nor Divmod, Inc. are responsible in any way if this installer takes your machine to the knacker's yard. However, I am using it on 10.5.3 with no issues (so far).


Saturday, June 21, 2008

txLoadBalancer

Well today was a flurry of activity... pulled an all-nighter whipping a python load balancer into shape after some late-afternoon discussions on #divmod.

At Divmod, we're going to be labbing out some distributed services experiments with twistd servers, and one set of those experiments involves "developer friendly" load balancing. JP suggested that I take a look at how PyDirector works and see if we could use that. Which was actually interesting in a full-circle kind of way: I worked on PyDirector when I was at PBS, ages ago, where I wrote a weighted lb algorithm for it.

Jumping into the code again after a 5-year hiatus was like seeing an old friend :-)

All tonight, I worked on the following branches:
txLoadBalancer 0.9.1 and 1.0.1 are up on PyPI in the usual place.

I did lots of manual functional testing for each branch tonight, but I didn't do any TDD. While I'm still playing with it, I'll probably start adding tests as bugs crop up (BDT), and as it gets more serious I'll go fully into TDD and fill in what's missing at that point.

Tonight's mad rush was actually a great deal of fun. It's been a while since I've had the opportunity to plow through a bunch of code like that, and I enjoyed myself to near exhaustion :-) I don't think I'll be able to get to sleep tonight (er, this morning), due to the endless thinking about all the ways in which I want to use this code, mutate it, and... well, I better leave some surprises for later!

Update: I've edited the links for the latest micro-releases that fixed some issues with setup.py.

Update 2: Thanks to the heads-up in the comments from Kapil, I've patched txLoadBalancer trunk with the changes from Apple (David Reid and Wilfredo Sanchez).


Friday, June 20, 2008

Async Batching with Twisted: A Walkthrough

While drafting a Divmod announcement last week, I had a quick chat with a dot-bomb-era colleague of mine. Turns out, his team wants to do some cool asynchronous batching jobs, so he's taking a look at Twisted. Because he's a good guy and I like Twisted, I drew up some examples for him that should get him jump-started. Each example covered something in more depth that it's predecessor, so is probably generally useful. Thus this blog post :-)

I didn't get a chance to show him a DeferredSemaphore example nor one for the Cooperator, so I will take this opportunity to do so. For each of the examples below, you can save the code as a text file and call it with "python filname.py", and the output will be displayed.

These examples don't attempt to give any sort of introduction to the complexities of asynchronous programming nor the problem domain of highly concurrent applications. Deferreds are covered in more depth here and here. However, hopefully this mini-howto will inspire curiosity about those :-)


Example 1: Just a DefferedList

This is one of the simplest examples you'll ever see for a deferred list in action. Get two deferreds (the getPage function returns a deferred) and use them to created a deferred list. Add callbacks to the list, garnish with a lemon.


Example 2: Simple Result Manipulation

We make things a little more interesting in this example by doing some processing on the results. For this to make sense, just remember that a callback gets passed the result when the deferred action completes. If we look up the API documentation for DeferredList, we see that it returns a list of (success, result) tuples, where success is a Boolean and result is the result of a deferred that was put in the list (remember, we've got two layers of deferreds here!).


Example 3: Page Callbacks Too

Here, we mix things up a little bit. Instead of doing processing on all the results at once (in the deferred list callback), we're processing them when the page callbacks fire. Our processing here is just a simple example of getting the length of the getPage deferred result: the HTML content of the page at the given URL.


Example 4: Results with More Structure

A follow-up to the last example, here we put the data in which we are interested into a dictionary. We don't end up pulling any of the data out of the dictionary; we just stringify it and print it to stdout.


Example 5: Passing Values to Callbacks

After all this playing, we start asking ourselves more serious questions, like: "I want to decide which values show up in my callbacks" or "Some information that is available here, isn't available there. How do I get it there?" This is how :-) Just pass the parameters you want to your callback. They'll be tacked on after the result (as you can see from the function signatures).

In this example, we needed to create our own deferred-returning function, one that wraps the getPage function so that we can also pass the URL on to the callback.


Example 6: Adding Some Error Checking

As we get closer to building real applications, we start getting concerned about things like catching/anticipating errors. We haven't added any errbacks to the deferred list, but we have added one to our page callback. We've added more URLs and put them in a list to ease the pains of duplicate code. As you can see, two of the URLs should return errors: one a 404, and the other should be a domain not resolving (we'll see this as a timeout).


Example 7: Batching with DeferredSemaphore

These last two examples are for more advanced use cases. As soon as the reactor starts, deferreds that are ready, start "firing" -- their "jobs" start running. What if we've got 500 deferreds in a list? Well, they all start processing. As you can imagine, this is an easy way to run an accidental DoS against a friendly service. Not cool.

For situations like this, what we want is a way to run only so many deferreds at a time. This is a great use for the deferred semaphore. When I repeated runs of the example above, the content lengths of the four pages returned after about 2.5 seconds. With the example rewritten to use just the deferred list (no deferred semaphore), the content lengths were returned after about 1.2 seconds. The extra time is due to the fact that I (for the sake of the example) forced only one deferred to run at a time, obviously not what you're going to want to do for a highly concurrent task ;-)

Note that without changing the code and only setting maxRun to 4, the timings for getting the the content lengths is about the same, averaging for me 1.3 seconds (there's a little more overhead involved when using the deferred semaphore).

One last subtle note (in anticipation of the next example): the for loop creates all the deferreds at once; the deferred semaphore simply limits how many get run at a time.


Example 8: Throttling with Cooperator

This is the last example for this post, and it's is probably the most arcane :-) This example is taken from JP's blog post from a couple years ago. Our observation in the previous example about the way that the deferreds were created in the for loop and how they were run is now our counter example. What if we want to limit when the deferreds are created? What if we're using deferred semaphore to create 1000 deferreds (but only running them 50 at a time), but running out of file descriptors? Cooperator to the rescue.

This one is going to require a little more explanation :-) Let's see if we can move through the justifications for the strangeness clearly:
  1. We need the deferreds to be yielded so that the callback is not created until it's actually needed (as opposed to the situation in the deferred semaphore example where all the deferreds were created at once).
  2. We need to call doWork before the for loop so that the generator is created outside the loop. thus making our way through the URLs (calling it inside the loop would give us all four URLs every iteration).
  3. We removed the result-processing callback on the deferred list because coop.coiterate swallows our results; if we need to process, we have to do it with pageCallback.
  4. We still use a deferred list as the means to determine when all the batches have finished.
This example could have been written much more concisely: the doWork function could have been left in test as a generator expression and test's for loop could have been a list comprehension. However, the point is to show very clearly what is going on.

I hope these examples were informative and provide some practical insight on working with deferreds in your Twisted projects :-)

Monday, June 16, 2008

The Future of Personal Data

In a recent post about ULS systems, I said this:
The balance of power, from individuals all the way to the top of whatever organizations exist in the future will rest in information. Not like it is today, however. The "information economy" of today (+/- 10 years) will look like kids' games and playgrounds. The information economies this will evolve into will be so completely integrated into human existence that they will resemble the basic necessities like water and food.
I'm not going to focus on the ULS systems topic in this post, but there is a very deep connection between privacy, personal data and all things ULS. Any thoughts of a ULS system should be coupled with how this will impact the system's users and their data. Any thought of our personal data's future existence should include the anticipated future of computing: ULS systems.


Inside and Out

In a nutshell, here's how things look:
  • Yesterday: Paid Services - You want something, you buy it. Demographic research is expensive and mostly outsourced.
  • Today: Free Services - You want something, companies give it to you for free... in exchange for your demographic data.
  • Tomorrow: Information Economy - You want something, you leverage the value of your information in brokering the the service deals that mean the most to you.
What do we have right now? Companies are fighting for each other over who gets to have our data for free. Yay, free stuff! We used to have to pay for that sort of thing! But paying for people to hold your data was the old, old world. Having them do it for free is the old world. Here's the new world: They pay you.

Why would they do that? Why would things shift from the current status quo? The value of personal information.

There are many ways to assess the value of personal information, but let's look at a few from the perspective of large organizations (entailing everything from government to business). Simplistically, we can assign value to a single individual's data based on the value of a large collection of many individuals' data. The more participants, the greater the value of the whole, and therefore the greater the value for each individual's data. This perspective is limited because it treats data very staticly. The data may change, but in relation to the system it's "acquired" and inside as opposed to "for sale" and outside.


We Are the Markets

But the value of our data is not defined simply by the presense of bits or membership in a valued data conglomerate. Our data is not just our emails, our medical records, our purchasing trends, nor our opinions about local and national politics. Like an organism moving through an ecosystem, our data is dynamic and living; it is the very trace we leave in the world around us, be it digital or otherwise.

Any part of our lives that is ever recorded in "the system" provides data and comprises part of our movements through this system. Our traces through this digital ecosystem impact it, change it, shape its future direction. The collective behaviours (not just collective data) are immensely valuable to organizations. Their value is on-going and growing, with accrued, compounded interest.

Static data bits seem like property to us: you can buy them, you can sell them, you can store them somewhere. But moving, living data... that's a different story. That's not a buy-once commodity; ownership of that might be tantamount to slavery in a future, information-based economy. However, organizations might opt to lease it, or individuals might turn the past back on the future and offer license agreements to organizations.

More likely, though, individuals will form co-ops or communities (we have already seen this happend extensively in today's Internet) with shared mutual interest. Seeing how a group entity with shared values has a larger effect on the system than single individuals, data from such groups would likely be much more interesting and number-crunch-worthy. The greater power a group has to perterb systems' ecomonic or political trends, the more valuable that group's data will be to other groups.

In addition, I'm sure there'd be all sorts of tiered "offerings" from individuals and groups: the juicier/more detailed the data, the higher the premium offered. The changes this will introduce to markets (global and local), legal systems, and politcal organziations are probably barely imaginable right now. But what would it take to get us there? What would it take for my data and your data to be valuable enough to transform the world and make Wall Street look like an old-time, irrelevant boys club?


Privacy

One thing: a fanatical devotion to privacy, pure and simple. Security and a fanatical devotion to privacy. Two things! Okay, reliability, security and a fantaical devotion to privacy. Three things!

Monty Python references aside, an economy that values the data of individuals and groups can only arise if that data is secure. If we live in a topsy-turvy world where the Government, MPAA, RIAA, the Russian Mafia, and Big Hosting Company are pirating our data, then we're hosed. However, if our data is secure and contracts are effective, then we will have a world where data is the currency. There are an incredible number of hurdles to overcome in order for this to happen, however.
  • The System - we need a system where user data can be tracked, recorded, and analyzed, and there's enough of it to matter
  • Storage - we need our own, personal banks for our data (irrefutable ownership rights and complete power over that data)
  • Transactions - we need a mechanism for engaging in secure, data transations
  • Identity - when making a transation, we need to be able to prove unequivocally that we are who we say we are
  • Anonymity - we need to decouple activity in the system and identity, thus requiring organizations to come to us (or our groups) to get the definitive data they need
  • Recourse - we need a legal system and effective laws that protect the individuals and groups against the crimes of data-hungry organizations; fortunately, we will have had years of established precedent protecting the sellers from the buyers... oh my, how the tables turn!
And that's just off the top of my head. There's got to be tons of stuff which hasn't even occurred to me.


Closing Thoughts

Information will be as essential for us as water, yet there is a very interesting divergence from the example of a hydrological empire: each individual is the producer of some of that metaphorical water. By virtue of this difference, we hold the keys of the empire. We will be more a part of the economic and political powerbases than we have every been at any time in human history.

Of course, that means that we've got to get ready :-) This is already being done in many different ways. Everything from community housing cooperatives to small, co-op banks; from capabilities-based programming models to secure online transactions. Like the next 20 years of research needed for ULS systems to become a reality, we've got just as much work to do in order to guarantee our place in the economies of the future.


Thursday, June 12, 2008

Ultra Large-Scale Systems: An Example

The ULS Series

Background

My interest in this topic is as old as my love for science fiction. As a child who had not only just started teaching himself to program but had fallen deeply in love with I, Robot, I consumed everything I could by the Master of the Art himself, Isaac Asimov. Inevitably, an endless steam of science fiction began flowing into my brain: the harder the science, the more cherished it was.

Then came the discovery that computers could actually talk to each other. Holy network, Batman, that changed everything! Oh, how I lamented my Kaypro II's inability to dial out. Science fiction novels began touching on this aspect of technology more and more frequently, while the Internet began taking shape in the "real" world around us. Now, look at it. Regardless of the mess and chaos, it's really quite amazing: beowulf clusters, distributed computing, cloud services, and of course the Internet in general. These advances are actually quite mind-blowing when we take the time to examine them from a historical perspectve.

A lot has changed since those early days of the network. The past 10 years or so has seen the beginnings of a trend with regard to large systems. Certainly my views on the future of networks (and services that utilize them) have been pretty consistent:
As I indicated in the more recent ULS blog post, I have been exposed to some excellent resources for ultra large-scale systems. For some of those I recently provided links, and others I will be referencing in future posts.

Due to their nature, ULS systems pose interesting open source collaboration as well as business opportunities. They entail a massive collection of excellent problems to solve that cannot possibly be completely addressed in the next 6-12 months (where so many projects and businesses tend to put their focus, for obvious practical reasons). As such, there are a great number of research and development areas -- plenty for everyone, in fact. In this series of blog posts, my goal is to expose a wider audience to the topics and encourage folks to start thinking about both interim solutions as well as potential long-term ones.


Characteristics of a ULS

Let's start of with some semblance of a definition :-) What constitutes a ULS system? Here are some characteristics given by Scale Changes Everything:
  • an unbelievable amount of code (on the order of trillions of lines of code)
  • immense storage needs, network connections, processing
  • lots of hardware, lots of people, lots of purposes
  • decentralized components
  • created by aggregation, not design
  • unreliable components, reliable whole
  • ongoing and real-time upgrades, changes, and deployments
  • lots of functionality, likely in a focused area of concern
Here's an illuminating quote from Richard Gabriel's Design Beyond Human Abilities presentation:
The components that make up a ULS system are diverse as well as many, ranging from servers and clusters down to small sensors, perhaps the size of motes of dust. Some of the components will self-organize like swarms, and others will be carefully designed. The components will not only be computationally active but also physically active, including sensors, actuators...
Sounds like pure science fiction, doesn't it? Think about it, though. Is it really? Divmod's friend Raffi Krikorian co-wrote this paper at MIT. Check out the cheap network node that's smaller than a fingertip. At that size, hundreds of them would be innocuous. In a few years, we could have thousands of them in a room without even knowing it. Within a single home we could have the equivalents of what today are campus or regional networks. We probably can't even wrap our heads around how big these systems will be. But there is plenty of precedence for such natural short-sightedness. From Raffi's (et al.) 2004 paper:
The ARPAnet was ambitiously designed to handle up to 64 sites with up to 4 computers per site, far exceeding any perceived future requirement. Today there are more than 200 million registered hosts on the Internet, with still more computers connected to those.
Here are some other choice quotes:
[Internet 0] is not a replacement for the current Internet (call that Internet 1); it is a set of principles for extending the Internet down to individual devices...

An [Internet 0] network cannot be distinguished from the computers that it connects; it really is the computer. Because it allows devices for communications, computation, storage, sensing, and display to exchange information in exactly the same representation, around the corner or around the world, the components of a system can be dynamically assembled based on the needs of a problem, rather than fixed by the boundaries of a box.
We're already building this stuff. It's not science fiction. We may not have swarming, self-replicating nano machines... yet. But we're already heading in a direction where that's not just a possibility; it's a likelihood.

So, we've got lots of code, machines, storage, sensing and people; much of it decentralized. What else do we need? Failure tolerance and maintenance on-the-fly. Check. Finally, a ULS system will have to actually be useful, or it will never get built. Who would want such a thing besides militaries, big governements, and Dr. Evils? Now we start getting to our example: Health Care. But let's not get ahead of ourselves. First, let's examine why the biggest system of networked devices that we know of isn't a ULS system.


Why the Internet is not a ULS System

Most obvious of the criteria listed above, the Internet is not focused on a single or related set of goals; it's used for everything. However, it does meet many of the criteria. From the Carnegie Mellon report:
The Web foreshadows the characteristics of ULS systems. Its scale is much larger than that of any of today’s systems of systems. Its development, oversight, and operational control are decentralized. Its stakeholders have diverse, conflicting, complex, and changing requirements. The services it provides undergo continuous evolution. The actions of the people making use of the Web influence what services are provided, and the services provided influence the actions of people. It has been designed to avoid the worst problems deriving from the heterogeneity of its elements and to be insensitive to connection failures.

But ... Security was not given much attention in its original design, and its use for purposes for which it was not initially intended ... has revealed exploitable vulnerabilities ... And although the Web is an important element of people’s work lives, it is not as critical as a ULS ... system would be.
Now I think we're in a good place to talk about the health care system of the future...


Health Services as a ULS

Let's start this section with a quote from the presentation that inspired it. Richard Gabriel says:
An example of a ULS system (that doesn’t yet exist) would be a healthcare system that integrates not only all medical records, procedures, and institutions, but also gathers information about individual people continuously, monitoring their health and making recommendations about how to improve it or keep it good. Medical researchers would be hooked into the system, so that whenever new knowledge appeared, it could be applied to all who might need it. Imagining this system, though, requires also imagining that it is protected from the adversaries of governmental and commercial spying / abuse.
Modern hospitals are packed with countless computing devices: everything from charting PDAs to physiological monitors for patients; from mainframes and patient record data warehouses, to terminals and desktops. Wireless medical sensors have already been developed by a research project at Harvard. What's more, despite the concerns over associated health risks, implant research at Johns Hopkins and the University of Maryland is on-going and may produce results that are one day standard practice in hospitals.

As versions of theses decives are developed that produce no ill effects for humans, they will make their way into out-patient clinics, assited living facilities, and ulitmately HMO's, private practices, and our homes. The devices will grow in numbers, shrink in size, and provide more functionality at greater efficiency than their predecessors.

The volumes of information that will be exhanged between devices, analyzed and correlated by other devices, and consumed by end-users, doctors, and researchers will be mind-boggling. It will bring new insights on everything from personal health to epidemiology.

With this, though, will come the obvious need for security and privacy, for defense against information attack and denial of service. These devices will all have to dedicate compuational and storage resources for use by the whole system. Part of the system will have to monitor itself, properly escalate problems, observe and anticipate trends. Protection and defense capabilities will have to exist the likes of which barely exist in our every day lives at the marcoscopic level.

All of this will take time. They will truly be modern wonders of the world. Given that such systems are anticipated to exist sometime in the next 20 years, and will have accreted the component systems over time, where might such a thing start?


Google and a Health Care ULS

If you read my last post (which I think was posted to blogger before the official announcement by Google), you already knew what I was going to say :-) Google Health. Though obviously nowhere near a ULS system in and of itself, why might we suggest Google is moving in this direction?

Here are some interesting bullet points from google.org:
  • InSTEDD: $5,000,000 multi-year grant to establish this nonprofit organization focused on improving early detection, preparedness, and response capabilities for global health threats and humanitarian crises
  • Global Health and Security Initiative: $2,500,000 multi-year grant to strengthen national and sub-regional disease surveillance systems in the Mekong Basin area (Thailand, Vietnam, Cambodia, Lao PDR, Myanmar, and China-Yunnan province)
  • Clark University for Clark Labs: $617,457 to Clark University, with equal funding from the Gordon and Betty Moore Foundation, to support the development of a system to improve monitoring, analysis and prediction of the impacts of climate variability and change on ecosystems, food, and health in Africa and the Amazon
  • HealthMap: $450,000 multi-year grant to conduct in-depth research into the use of online data sources for disease surveillance
Does that sound familiar to anyone besides me? All paranoia-induced sinister thoughts and Google Ads jokes aside, it makes sense that this is where we're going with health. In fact, it makes sense this is where we're going with all of our lives. If data privacy, personal ownership of that data, and security concerns can all be addressed, our lives' information will be better served by moving through systems specially designed to provide maximal use of that information with the least work. It won't just be nice to have, it will be essential.

The balance of power, from individuals all the way to the top of whatever organizations exist in the future will rest in information. Not like it is today, however. The "information economy" of the today (+/- 10 years) will look like kids' games and playgrounds. The information economies this will evolve into will be so completely integrated into human existence that they will resemble the basic necessities like water and food.

If you could find yourself a corner of that market, 20 years before everyone else got there, wouldn't that be a smart business move?


Summary

Our world is changing much more than we realize. We're too tied up in our jobs and gas prices to see the larger picture... to see that our future is already being made, that even in our unconscious actions we are propagating it no less than the cells in our bodies conspire to propagate what will become our children.

In the same way that hominid nomadic/migratory patterns begat the distribution of villages and tribal communities, which in turn gave birth to civilization, our silly little Internet will one day have descendants that dwarf it in size, utility, complexity, and computational power. The amazing thing is that we are the ones that actually get to build them!

There is a lot to research, and just as much to prototype. There is a project for everyone, and by starting now, we can make sure that feudal lords of tomorrow don't have absolute control over our food and water. If you have ideas for collaboration, start talking! Get involved! If you have money, fund some research, sponsor some conferences. In simply writing this blog post, I have uncovered gobs of new research I didn't know was out there. We should all be reading more, catching up, and coding. The projects near and dear to our hearts can get a whole new life within the context of ULS systems.