Friday, November 5, 2010

stompclient: another STOMP client for the Python community

Astute readers (as if I have enough readers to actually differentiate) may remember that I've previously entered the python STOMP world with my CoilMQ project. And now I'm at it again, but this time from the client side with my (perhaps presumptuously named) stompclient project. Yes, there are other clients out there -- stomp.py, stompy, and stomper to name the main players. So why a new one? Well, we've been using a couple of these in various stages of production, and based on my experiences I'd decided that there was room for an additional player in this game.

I wanted a STOMP client that would be easy to use without getting in the way. For example, spawning listener threads from within the class (as stomp.py does) is easy; however, if you want to exercise some more control over how received messages are handled, it's in the way.
I wanted a STOMP client that would support a publish-only usage model. Most of my needs to interact with a stomp server from Python have involved writing clients that need to just push messages onto topics/queues (e.g. from python WSGI web applications).
I wanted to also provide a helpful set of STOMP library utilities for other projects. Specifically, I wanted to flesh out & clean up the Frame classes that I had started implementing for CoilMQ and add in my fixed version of stomper's FrameBuffer that would traffic natively in frames.
I really wanted a better-documented, better-tested, and generally cleaner, more pythonic (pep-8) codebase.
And finally, this was really another opportunity to learn more about sockets and multi-threaded application design (& testing) in Python.

Unlike HTTP, the STOMP protocol is not a serial request-response protocol. This actually makes it non-trivial to write a client that can both send and receive messages, since you can't simply sock.send() a frame and then sock.recv() a frame and expect that to the response to your sent frame. Of course, this makes sense since a subscribing client also needs to be able to receive message frames from the server (without being in response to any request). So to have a client that can both send and receive messages, there needs to be some sort of receiver loop constantly running. I chose to take inspiration from the way that the stompy does this and use queues. My approach was a little simpler in that there is simply a listener loop (expected to be run in its own frame) that enqueues any received frames on the appropriate queue (e.g. message frames go on the message queue, receipt frames on the receipt queue, etc.). Very simple, but seems quite effective. This approach is also flexible, since it means you could create pool of worker threads that all pull from the appropriate queue(s) to process messages concurrently. (It probably wouldn't be too difficult to also provide a multiprocessing implementation for the worker pool.) Here's the simple publish-only example that pushes some binary content (a pickled python object) onto a queue:

import pickle
from datetime import datetime

from stompclient import PublishClient

client = PublishClient('127.0.0.1', 61613)
client.connect()
payload = {'key': 'value', 'counter': 0, 'list': ['a', 'b', 'c'], 'date': datetime.now()}
client.send('/queue/example', pickle.dumps(payload, protocol=pickle.HIGHEST_PROTOCOL))
client.disconnect()

Head on over to the project website for more examples, documentation, and download links.

Sunday, September 12, 2010

CPython Threading: Interrupting

I've decided to kick off a return to blogging with a series on multi-threaded development in Python (CPython, to be specific). Yes, we all know there's a GIL in the water, but multi-threading is still an extremely useful concurrency strategy in Python for i/o-bound activities ... which tends to characterize most of my use cases for concurrency. But there are lots of things that make multi-threaded programming tricky and there aren't quite as many resources out there for Python (as there are for Java, say).

Disclaimer: I am not an expert at multi-threaded programming in Python (or any other language). Most of this has been trial & error, some help from the Google, and a lot of foundation from the excellent book on the subject by Brian Goetz: Java Concurrency in Practice (despite the title, the principles in the book apply to Python too). If you know of a better way or better explanation, please leave a comment so we can all benefit.

After getting over some of the challenges of mutable state and atomicity of operations, I think one of the things that probably bit me next in Python specifically was handling of asynchronous exceptions (like KeyboardInterrupt and in some cases SystemExit) -- and specifically how one goes about actually stopping a multi-threaded application. Too many times I would end up with a script that would just hang when I hit CTRL-C (and I'd have to explicitly kill it). So let's start there.

Asynchronous Exceptions

The KeyboardInterrupt exception is actually an OS signal; specifically, the signal module translates SIGINT into the KeyboardInterrupt exception. The rule is that on platforms where this signal module is present, these signal exceptions will be raised in the main thread. The SystemExit exception is similar, in that no matter which thread raises it, it will always be raised in the main thread. On other platforms, apparently they may be raised anywhere (see the "Caveats" section of the thread module reference documentation for more info); for the sake of focus here, we will assume that you are working on a platform with the signal module present.

Let's start out with a simple example of a multi-threaded program that you cannot abort with CTRL-C:

import time
import threading

def dowork():
  while True:
    time.sleep(1.0)

def main():
  t = threading.Thread(target=dowork, args=(), name='worker')
  t.start()

  # Block until the thread completes.
  t.join()

if __name__ == '__main__':
  main()

The problem here is that the worker thread will not exit when the main thread receives the KeyboardInterrupt "signal". So even though the KeyboardInterrupt will be raised (almost certainly in the t.join() call), there's nothing to make the activity in the worker thread stop. As a result, you'll have to go kill that python process manually, sorry.

Stopping Worker Threads

Solution 1: Kill with Prejudice

So the quick fix here is to make the worker thread a daemon thread. From the threading reference documentation:

A thread can be flagged as a “daemon thread”. The significance of this flag is that the entire Python program exits when only daemon threads are left.

So in practice here, if you stop your main thread, your daemon thread will just stop in the middle of whatever it was doing & exit. In many cases this abrupt termination of any worker threads may be appropriate; however, there may also be cases where you actually want to manage what happens when your threads terminate; maybe they need to commit (or rollback) a transaction, save their state, etc. For this a more thoughtful approach is required.

Solution 2: Instruct Politely

The alternative to just killing them is to instruct the thread to stop using some agreed-upon system. You are probably aware (or have guessed) by now that there is no Thread.stop() method in Python (and the one in Java is deprecated and generally considered a Bad Idea™). So what you must do is to implement a "thread interruption policy" which in our case is basically a signaling mechanism that the main thread can use to tell the worker thread to stop. Python provides a threading.Event class that is for exactly this type of inter-thread signaling.

The threading.Event objects are very simple two-state (on/off) flags that can be used without any additional locking to pass "messages" between threads. Here is a basic stratagy for using a threading.Event to communicate a 'shutdown' message to a worker thread:

You share a "shutdown" threading.Event instance between the threads (i.e. you either pass it to the threads or put it in a mutually accessible place).
You set the event from the main thread when you receive the appropriate signal. Here we're focused on KeyboardInterrupt, but presumably users could also take some action within your application (e.g. "stop" button) to stop your application, i.e.
```
shutdown_event.set()
```
You check it (frequently) in another thread and take the appropriate action once it has been set.
```
while not shutdown_event.is_set():
   do_some_work()
do_some_cleanup()
```

It is probably worth pointing out here that this system is really just some conventions that you've established between your main thread and the workers. If the workers don't periodically check the shutdown event, then they won't stop their work -- and CTRL-C still won't work.

Putting it Together

After applying the threading.Event model to our example, we are able to have our CTRL-C respected relatively quickly (as quickly as the worker thread gets around to checking the event).

import time
import threading

shutdown_event = threading.Event()

def dowork():
  while not shutdown_event.is_set():
    time.sleep(1.0)

def main():
  """ Start some threads & stuff. """

  t = threading.Thread(target=dowork, args=(), name='worker')
  t.start()

  try:
    while t.is_alive():
      t.join(timeout=1.0)
  except (KeyboardInterrupt, SystemExit):
    shutdown_event.set()

if __name__ == '__main__':
  main()

Working around uninterruptable Thread.join()

You may have noticed that we changed how we called Thread.join(). Calling the join() method on a thread without a timeout will block until that thread returns/completes. As I understand it, this is due to a mutex in the join() method which has the implication that you cannot interrupt it with KeyboardInterrupt. You can work around this, though, by essentially checking in a loop until the thread does exit:

while t.is_alive():
      t.join(timeout=0.1)

Other Events and Exceptions

You may notice that in the compiled example that I am also catching the SystemExit for sake of completeness. In a more complex app, you would need to make sure that other exceptions were also handled so that they would result in the shutdown message going to the worker threads.

You could also choose to register a signal handler (in your main thread) for other OS signals and raise an appropriate exception (e.g. SystemExit) or take other actions. The important point here is that these would all need to be handled in your main thread and communicated by some sort of convention to the worker thread(s).

In Summary

Dealing with these "asynchronous events" in multi-threaded applications can be a little confusing (and sometimes a little frustrating when your app refuses to exit). Understanding the key points here will hopefully help make this a bit clearer:

Signals are handled by the main thread. This means that KeyboardInterrupt is always raised in the main thread.
Daemon threads will exit automatically when the main thread exits.
For cases where you need more control over thread termination, use threading.Event objects to "signal" threads to exit.
Be aware that Thread.join() calls will block and cannot be interrupted! Use an alternative while-loop strategy for joining instead.

Saturday, February 27, 2010

STOMP and CoilMQ

I've just pushed out my third release (v0.3) of CoilMQ, a simple STOMP server written in Python. (You can see the roadmap or the issue tracker to learn about the changes.) It's been a fun project, so I decided to write about why I've been working on it and why you might want to consider Stomp for message passing.

Project Background: A ~~Twisted~~ Snake-y Road

A few months back I started a simple project to learn more about writing a basic network server in Python. I decided I would create a STOMP server in Python, because there weren't any -- well, none that I could get to actually work. And there was this fantastic & simple stompserver Ruby project out there just taunting Python with its simple, concise implementation.

So, I plowed into the project and immediately resolved, as all aspiring Python developers do, that this was perfectly suited to Twisted. Afterall, the Ruby Stompserver project was built on EventMachine. Nevermind that MorbidQ seemed to have had the same idea (and was firmly on my "not-working" list). Nevermind that I'd been there before, waist-deep in Twisted entrails trying to grab some slimy Deffered; I made myself write "Twisted is harder than dataReceived" two thousand times. Nevermind that Ted Dziuba wrote a brilliant tirade on Twisted which anyone that's had to wrestle this Gorgon can appreciate. Nevermind all that.

Well, that didn't last long. Sure, I had an original version written using Twisted. Very simple. Everything in memory; no blocking I/O. It looked architecturally like the Ruby EventMachine code. Then I thought, what about using a database for queue storage? And that's when the romance ended. Well, I don't have a database API that returns Deffereds; I just have one that'll stop my app while the query runs. I know this was really just an educational project & proof of concept, but I didn't feel right doing it wrong.

So, I decided that I'd write a my server using threads for concurrency. I know it's not what the cool kids are doing, but -- despite the GIL -- it's a great way to handle I/O concurrency in Python and I feel it's a really important thing for developers to understand -- because it comes up all the time. (I'm no expert, don't get me wrong; this was a learning exercise.)

Why Stomp?

Stomp is a really simple protocol for publisher-subscriber message passing. It's loosely HTTP-like in its syntax and it works for passing text-based or binary data -- so you can pass whatever data type your consumers care about (JSON, XML, AMF, pickled objects, etc.). The protocol itself doesn't explicitly prescribe topic or queue implementations, but there is a convention that destinations that begin with "/topic/" are topics (broadcast to all subscribers, no persistence or reliability) and destinations that begin with "/queue/" are queues (sent only to one subscriber, persisted). Stomp does go beyond the basics to provide support for transactions and reliable subscribers (that must ack receipt). Some implementations (e.g. ActiveMQ) have also added the concept of durable topics.

There are lots of uses for Stomp because it's been implemented in lots of different languages. For example, you could use simply use this as a way to pass data (e.g. pickled objects) between disparate web applications written in Python (e.g. using stompy client). Or you could use Stomp to provide for passing messages between Python application servers and Flash/Flex clients (e.g. using as3-stomp client); that's an easy way to implement server-push for Flex. To build on that example, you could just as easily push AMF-encoded (binary) messages between different Flash clients. So Stomp is really quite basic, and very versatile.

Why CoilMQ?

This is a tricker question to answer. Honestly, if you've got a project out there that needs to push messages at scale, you probably want to look at some of the enterprise brokers. I've had great luck using ActiveMQ and I've heard people that need to scale to massive throughput suggest using RabbitMQ. But if you don't need to handle thousands of requests per second, and you do want to understand the software you're using, then I would suggest that a project like CoilMQ is a great choice.

Plus, it'd be great to have some feedback! I've had a number of downloads & no bug reports (or complaints). I'd love to think that means it just works perfectly for everyone, but I suspect the truth is that I just need more users. I should add that I'd also be very pleased to accept any suggestions or patches (the former is greatly helped by the latter).

Wednesday, February 24, 2010

Notes from PyCon 2010

PyCon just wrapped up; it was geektastic! This was certainly the largest group of Python developers I'd ever witnessed, and I came away with some startling realizations about Pythonistas. For example, I learned that, like Samon's strength, a Python developer's social standing is directly proportional to the length of his (or her) locks and beard. I have stopped shaving and today I had an oprah-ah-ha moment with metaclasses and slots. If I can make it through this awkward in-between phase, I can only imagine the revelations that lie ahead.

Atlanta was interesting. I mean we were in a very corporate, rich downtown: lots of big buildings housing banks and lots of people rolling on chromed 20s. I'd never seen so much chrome. The juxtaposition of Atlanta with Pycon was probably at its most poignant on Saturday night, when apparently the Hyatt hosts an Old School Saturdays dance party in their downstairs ball rooms. That happened to be right next to the Python open spaces. They had some ropes to keep the groups separate. I'm sure mayhem would have ensued if party goers had wandered into the Unladen Swallow open space.

Anyway, back to the conference. Some of the videos and slides are being aggregated online. The talks were great; I'm hoping that selection becomes more comprehensive.

I definitely came away with a bunch of notes, things to try, etc

Python in the browser (Silverlight) looked cool, but it's not really a viable option for Linux users. Even the latest preview of Moonlight doesn't run the IronPython REPL without errors. (Stable version doesn't run it at all.) Oh well, I'll try back later.
The GIL on multi-core isn't fixed yet in 3.2 -- and this problem is relevant even when dealing with IO-bound instead of CPU-bound apps. David Beazley's presentation was fantastic. Of course, threading still makes sense for handling IO load, but it's good to have this limitation commonly understood.
Donovan Preston's talk definitely made me want to check out Eventlet. One of the biggest drawbacks of frameworks like Twisted is that you have to write code in a special way (i.e. using deffereds, or yield) and the implication of this is that you also cannot just take other people's code and include it in your project. That's a huge problem. However, even with Eventlet you do have the drawbacks of it being a non-blocking server. So I was even more excited to learn about Spawning. From what I understood, Spawning combines the power of the async server with actual processes, that make it easy to offload those blocking IO calls -- i.e. your database queries won't pause your entire application.
The State of Packaging (slides aren't up at time of writing) was a great look at improvements that are being made to the packaging metadata in Python. Tarek Ziadé is to be commended for tackling this problem. And oh what a problem it is!
The new unittest has some really useful improvements (like multi-line text diffs for assertEqual!). And it's been backported as unittest2 for us folks still stuck on 2.6.
I will use Dozer! (It's memory profiling WSGI middleware. How simple!)
I will use repoze.profile! (It's a performance profiling middleware.)
When I think that I have a problem best solved by Nginx, I will also evaluate HAProxy.
Apparently people love Munin. They say it's much better than Nagios.

I'm sure there was a lot more. I got bored of scouring down my notes and looking up links. It was a fantastic conference, though. If you didn't go this year, go next year. I hear it's in Atlanta again; bring a purple suit and they might let you into Old School Saturdays.

Monday, February 15, 2010

Coming from PHP: Share Something

This is the first in a series of posts I'd like to do about Python from the perspective of someone coming from PHP development. Others have certainly posted articles on similar topics; heck, there's even a fantastic site dedicated to providing the Python equivalent of PHP functionality. While I'm sure that I'll talk a bit about some of the building-blocks in the Python language, these posts will focus on language features, interpreter implementations, and deployment platforms that lend to Python's use in large-scale applications.

What do we mean by large-scale applications?

Well, I don't know who "we" is, but I mean applications big enough to require thinking about how the application is architected. Of course, it's more than just thinking about how the application will work, but also how it will be tested, how it will be maintained, how it will grow, how security policies will be implemented, and how this can all be done as efficiently (and non-repetitively) as possible. These concerns typically push developers in the direction of an existing framework or into the typically-under-estimated effort of writing their own.

Web application frameworks traditionally start with a high-level look at how requests are handled by the server and turn that into abstraction points. Typically this ends up in one of the many interpretations of Model-View-Controller and the general phases in the processing look something like this:

Incoming request is dispatched (maybe via mod_rewrite) to a single handler script. (FrontController pattern)
A routing sub-system looks at the request (usually the requested path) and determines what piece of server-side code should handle that.
The request is probably further processed for things like authentication requirements and then (if other checks pass) handed off to the server-side processing code (sometimes called an Action sometimes the Controller sometimes a View).
The processing code will perform the "meat" of the processing (a typical application will probably query the database, for example) and produce some sort of response that should get sent back to the client.
Typically there is a final phase where a more abstract response is encoded into the format that the client expects (e.g. JSON or XML); alternatively, for more traditional HTML applications, the response data may get passed as the context for a template.

There's a lot of boilerplate code there and a lot of resources that need to get loaded to process a request -- routing, authentication & session management, logging, business logic, model, template rendering, encoding, etc. Even when not doing any work, the typical framework application will result in the loading of scores of classes, connecting to the database, opening file handles for logging. Heaven forbid your webapp needs to do anything like open a socket connection.

And this is why I think that PHP's share-nothing architecture is really a pretty dubious "feature". Practically, it just means that all those resources that have to be setup for your framework have to be setup with every single request. Let's be honest here, this is not a feature; this is is a pretty severe limitation. This is doublespeak to turn fundamental problems like "not thread-safe" and "leaks memory like a wet paper bag" into features.

Now, there is a real shared-nothing architecture that describes an approach to develop scalable & concurrent software. This really has very little to do with how this term has been used to describe PHP's architecture. Furthermore, there's nothing about PHP that makes it uniquely able to "support" [its understanding of] share-nothing architecture. It simply doesn't have the language or interpreter platform support to do anything else. It's like saying that a single-speed bicycle is better than a geared bicycle because it's easier to understand.

So, enter Python. Python is certainly not unique in its deployment paradigm, but it does provide a healthy contrast to PHP. To the point here, In Python you can share stuff. So, if you are using Apache with mod_wsgi (a popular Python hosting option, especially for frameworks), you can run Apache in multi-threaded worker MPM and all those overhead resources only need to be initialized once per process -- not per request. Of course, if you wanted to, you could make it behave like PHP, but no one would do that; that'd be inefficient.

So what is the price to pay for this sharing? Well, there is some additional complexity. If you are running a multi-threaded environment (e.g. Apache worker MPM or another multi-threaded Python server) you do need to make sure that those resources (db connections, log handles, etc.) are thread-safe. Typically in Python they are, but one does need to understand what that means. So it will demand a little more, but for those of you developing full-stack frameworks in PHP, you know that you've already left the Green Zone.

So, while we're all excited about applying the DRY mantra to our software design, I think it's worth stopping to consider whether maybe there's a similar principle that could be applied to the server architecture. Sharing resources between requests is a powerful feature. In single-process (multi-threaded) systems is makes it possible to share state without persisting to an external store; in multi-process & mulit-threaded systems, it provides a huge efficiency improvement by handling the parsing and app setup / resource initialization only once per process.

Share. You'll feel better.

Of course, sometimes simpler is better. If you don't need an application framework, then you probably aren't concerned with eliminating repetitive overhead code. To go back to our bicycle analogy, I actually do ride a single-speed bicycle to work because my commute is relatively flat and fewer mechanical parts fewer parts to replace.

Thursday, February 11, 2010

Hobgoblins: Anonymous Functions in PHP 5.3

Among other things, PHP introduces anonymous functions (which they also call closures) in PHP 5.3. This is interesting, because normal functions or methods in PHP are not first-class citizens and yet anonymous functions kinda are -- or at least look like they are. And this is interesting, because I find first-class functions one of the really nice things about languages like Javascript and Python.

So, how does this work exactly in PHP? Well, let's see if we can deduce how this works from some trial (& error).

The Basics

So, anonymous functions in PHP are most frequently used in callbacks. As such, that's probably the first example you'll see:

Ah, so that's all well & good, but the interesting thing is that this anonymous function can also be assigned to a variable (and that's good, because it's filling in for what would normally be a variable):

What may come as a surprise (unless you've read the manual page):

Yes, it's a special internal class. And no, you cannot instantiate a Closure yourself:

... or extend it (it's final). But wouldn't it be cool if you could, because then you could maybe find out how they've made a callable object. AFAIK, PHP doesn't support that otherwise, after all. Now, if they had provided that feature, I think that would have been a lot more generally useful.

Closures and Scope

It is also possible to pass in variables from the current scope in to the anonymous function. This provides a closure-like mechanism. I say "closure-like" because unlike other languages (e.g. Javascript, Python, Ruby) PHP is not actually providing access to the parent scope; rather, it is injecting variables into the anonymous function's scope. While that is different, that probably makes "sense" for PHP, since PHP provide no way to access parent scopes in any other contexts.

For (a very contrived) example:

But, for people who have experience closure elsewhere, this next example might come as a surprise:

So, the function did not change $value. But the syntax does look like we're passing in the $value, and that seems to be the right way to think about it:

Compare with Javascript:

The Fine Print

So the actual behavior seems fairly consistent so far. There certainly are a few types of variables that can't be imported. For example, and despite some suggestion otherwise, it seems that you cannot reference $this from within an anonymous function defined within a class method:

And, no, you cannot pass it in to the closure either:

We're getting some interesting errors, though, no?

If you're thinking, ah, but what if the closure itself is defined as a member variable. Ok, I think you're getting abusive, but let's give that a shot:

So, maybe there's a workaround for this, but suffice it to say that it doesn't work the way I would expect it to work.

Another Half-Baked Solution

One of my biggest gripes with PHP is how it adds half-baked features to the language. This is true of Exceptions, the OO model, interfaces & type hinting in signatures, namespaces, etc. Well, anonymous functions seem to be simply another example in this litany of offenses to computer science. Now, the concept of first-class functions is a powerful one and leads to some extremely useful design patterns. But PHP didn't actually make functions first-class, they just made a new type of "function" that is first-class (by virtue of being an instance of Closure class).

Anonymous functions are suppose to be functions. So why are they fundamentally different from non-anonymous functions?

To me, this next block of code is just confusing:

To be fair, this anonymous function feature is butting up against the basic design of PHP. PHP has special syntax for declaring variables (namely, a $ prefix), which makes the new anonymous functions syntactically incompatible with traditional functions in PHP. In the case of object instance variables, the syntax is ambiguous but one finds out very quickly that PHP is going to err on the side of tradition when it comes to interpreting method invocations:

While syntax may just be a function of the completeness of the parser, to the programmer this is what gives a language a feeling of coherence or consistency. We expect computer programming languages to be predictable and logical; for example, when values are equivalent, we expect to be able to substitute one value for another consistently.

Look at how consistently Javascript behaves:

Ok, PHP, your turn.

Friday, February 5, 2010

HipHop: Stop What You're Doing

So the PHP proletariat is foaming at the mouth with Facebook's announcement of HipHop. Now, there are certainly some great discussions of the topic -- take Terry Chay's multi-page inside look or, for a Python perspective, take a look at Alex Gaynor's observations. These guys are pretty qualified to comment. But there's a lot of crazy talk out there. Crazy talk? Well, damn, I wanna talk crazy too!

Am I qualified to have an opinion about this? Well, I certainly haven't used it yet or looked at the source (both related to the fact that the source is still not posted to github); I certainly haven't worked at Facebook and don't personally know anyone that does; it is completely irrelevant to me (as it is to most PHP developers). ... So, I think I've got what it takes to voice an opinion that is unencumbered by fact or first-hand experience.

I felt so qualified on the topic that I joined in loudly on a discussion in the local DC PHP group. My comments elicited quite a response on the list. I'm sure my poorly masked distaste for PHP doesn't exactly endear me to the group. (Why, you may ask, am I on the PHP list in the first place? Well, I would say "for the ladies", but I'm worried my wife will find this blog!) My comments were apparently so poignant that they were selectively quoted in a real life DC-area blog. Not just quoted, but actually described as "puritanical" and "naive" (in the same sentence), which I consider the high-water mark of my inter-personal communications career. Not to pick on Aaron (well, maybe a little, but that certainly seems fair), but this post has some unfortunate bits that have been a little too characteristic of the PHP community's hip-hopped-up responses:

Personally, I wish we had HipHop when [...] We had a ton of scaling problems with PHP and we were running fully clustered Apache servers (25 deep, if I recall), sharded MySQL across 6ish database servers, and we had massive I/O bottlenecks.

Ok, look, now I haven't used HipHop either, but if we're to believe the author, then HipHop reduces CPU load. This, by definition, isn't relevant to I/O problems. I know; it's hard not to get swept away with the excitement. This is, after all, the evolution of language. I am a little disappointed to have been the spark for that post. Now, I'm sure that this was an honest mistake and that certainly if Aaron was managing clusters of 30+ servers, he knows that I/O bottlenecks are not the same as CPU bottlenecks. But it does tarnish the argument a little.

My favorite (so far) is the article on Sitepoint, which I thought had presumptions of being a trusted source of knowledge in the PHP world. Check this out: Boost your PHP Performance 50% with HipHop. That's right, not only is this article also oblivious to what makes PHP apps slow, but the author appears to be tripping up on some basic consumer mathematics. Yes, HipHop on Facebook showed a reduction of CPU usage by 50%. That means it used half as much CPU ... so those scripts perform twice as fast (w.r.t. CPU). Now, bear with me. A 50% boost means 1.5 times as fast. 2 != 1.5. Well, since this is PHP there's always that danger that unequal things will evaluate as equal, so let me be more precise: 2 !== 1.5.

Ok, but before I went off on all this ad hominem venting. I actually had a more constructive point I wanted to make. It took a few forms (some more enraging than others), but the gist was this: reconsider. If you need to compile PHP into C++ and run it on a custom-designed server platform because you've maxed out your CPU, maybe PHP wasn't the right choice. I'm just sayin'.

I'm not saying that changing platforms is easy. I've been involved in migrations to Python (from PHP) for the past couple years and can certainly attest that it's hard and sometimes it takes time, but sometimes it's the right thing to do. We sat down and looked at some of the things that we were doing and realized that we were jumping through some crazy hoops to work around deficiencies in language & platform. After reviewing our options, we decided that Python offered the quality, community, broad applicability, extensibility, and scalability that suited our needs. A big part of this decision was the scaling path. We knew there were proven ways to extend Python with Java (via Jython) or C/C++.

There's was a lot of suggestion in our discussion that it didn't make sense for Facebook to ever reconsider PHP. But as Terry Chay noted, there were several efforts to move Facebook to Python and to C++; they just never had the resources allocated to actually accomplish it. And looking at the array of technologies Facebook employs, it's pretty obvious that they understand the basics of comparative advantage.

I am not a curator of startup history, but it does seem that there are plenty of cases to be found where applications can evolve to use new technologies successfully. Twitter looks like an example where this worked quite successfully. They didn't have to throw all their Rails code away, but they did start rewriting pieces in Scala (on the JVM). The interview on Artima is a good read. And Google recently was rumoured to be encouraging new development to use Java instead of Python. I know I read about it in an Unladen Swallow thread, and while I obviously like Python, I think this is a mature stance. Choose the right tool for the job.

So, at the end of the day, I'm just kinda baffled by all this HipHop hubbub. Not only has it been done before, but for almost every application and for all new development it just seems completely irrelevant. Why would you choose to write something in PHP? Not for the language, right? Not for the wealth of libraries. Not for the culture of quality. Not for the extensibility. No, you'd use it because it's drop-dead easy to deploy and the PHP+Apache(+MySQL) platform is ubiquitous. So, HipHop renders inapplicable the only reasons I would choose PHP.

Oh, but I do like the name.

Installing Zine on Dreamhost

Yes, until recently I was running my blog on Zine (on dreamhost). While it was a good exercise to get it working -- and it certainly was working fine as a hosting platform, I had made up my mind that I'd like to use hosted services more. So, even though I'm not actually hosting this on Zine, I still think this is useful information. And so I'm reposting it.

I've just installed Zine on my Dreamhost shared host account. In a fit of existentialism, I decided to write my first blog post on how to install Zine on Dreamhost. Granted, this is pretty straightforward, and if you're willing to surf around a bit, you probably will have figured these things out yourself. But I'm all for encouraging more Zine usage (since that might give me more theme options), so I figured I'd help make this easier.

You may be wondering, "Why didn't you use Passenger?". Good question! Dreamhost does support Passenger for WSGI applications, but this uses the default system python interpreter (python 2.4, with none of the needed deps). I wanted to run this in a virtual environment and not be forced to use a kludgy os.exec() hack to use my own interpreter. And FastCGI seems to work fine and doesn't carry the "beta!" warning.

Here were the key ingredients for this recipe.

Use Python 2.5 (not default)
Use Virtualenv
Use FastCGI (and Flup)

Dreamhost Setup

Setup the (sub)domain for your blog. Or create a new sub-directory on existing site. This will be your "blog directory" (you'll need to tell Zine where this later).
Setup a MySQL database for your blog.

Zine Install

Step 1: Libxml2 Prereqs

You will need to install the lxml python module, which is going to want development libs of libxml2. On Dreamhost you'll need to build your own libxml and libxslt libraries to make this all possible. This isn't as painful as it ounds. I used this blogpost as the basis for my approach. I changed those steps to use ${HOME}/usr for the --prefix flags (rather than the ~/.local in that article); I added ~/usr/bin to my PATH variable so that Python will find the xslt-config when it goes to install lxml.

Step 2: Setup Virtualenv

I chose to use virtualenv for this exercise, since 1) I'm on a shared host and 2) I typically use virtualenv anyway to isolate environments for specific applications. It makes life easier; give it a try if you haven't yet.

Now create the virtual environment somewhere. Name appropriately (we'll need the name again later).

Note: I'm using Python 2.5, which is not the default on Dreamhost (at time of writing). I found that Python 2.4 did not work. (I created a ticket.)

Activate Virtualenv

Step 3: Setup Zine

Note that all of these steps assume that you are operating with an activated virtualenv.

Download and unpack:
Install, using our Python virtual env:
Install Zine prereqs:
Install Flup:

Step 4: Setup FastCGI handler

Zine Handler

Copy the installed zine.fcgi handler into your blog web root (adjust paths accordingly):

Edit the ~/myblog-env/share/zine/servers/zine.fcgi file to ensure that the INSTANCE_FOLDER points to your blog public dir (~/blog.yoursite.com, e.g.).

.htaccess

Create an .htaccess file in your blog public directory with these contents:

RewriteEngine On
RewriteBase /
RewriteRule ^zine\.fcgi/ - [L]
RewriteRule ^(.*)$ zine.fcgi/$1 [L]

Final Steps

Visit your Zine URL in your browser to start the installation wizard.

When you are finished, you may wish to edit ~/blog.yoursite.com/zine.ini, to change the Zine URL to not include the zine.fcgi component of the URL (it isn't necessary, given the rewrite rules defined).

Snakes that Bite