Server Side Parsing of XMPP
[ rakaur on Sat Sep 27 at 10:28 AM // category: programming, software, technology // comments: 1 ]
When I start writing XMPPd two years ago, I knew next to nothing about XML or the parsing thereof. The only parsers I had ever written were for database files of my own design, and IRC. This was somewhat of a problem.
Luckily, Ruby includes an XML library called REXML. Little did I know, REXML sucks. So, I happily implemented my XMPP parsing in a very bad way (I’m assuming I did it quickly as proof-of-concept, and never fixed it because I stopped working on the project all together). First, select() waits until there’s something to read from one of my clients. When there is, it finds that connection’s XMPP::Stream class and calls its read() method. Now, this is the beginning of the problem. I wrote read() really fast just to get things rolling, and never implemented error handling in the ways of reading partial stanzas or the like. Anyway, read() reads a bunch of XML from the socket and calls parse().
Now, in parse(), I take whatever read() has pulled out of the socket and pass it to a new REXML::Document. This is crappy for two reasons:
- I’m parsing small bits of XML in the “DOM” fashion, which is apparently bad, and;
- REXML is ridiculously bad at parsing anything reliably.
The second problem is the one I really care about, because DOM works for my needs, and because I mostly don’t care. REXML sucking is the bad part. In the same method that I call REXML::Document, I have about 50 lines of exception rescuing just to bash REXML into parsing XMPP in any fashion, and get it sent off to my handlers.
That’s what’s next. If REXML can choke its way through, I get the standard REXML::Document that I can step through with each_element, etc. So I do that, and for every top-level element I come across, I call a method called handle_<elem_name> and pass it the REXML::Element that I’m on. These methods then do the handling, in a semi-event-driven manner.
Apparently this is bad too, because this is what SAX does. But, not really. You pass a SAX parser an instance of a class that provides methods like tag_start, tag_end, text, etc. I don’t like this. It’s complicated. I have to make that class a giant state machine, and while doing the handle_ such seems heavy, writing an entire class just to do all the stuff the XML parser should be doing in the first place makes me sad.
As far as I can tell, the only good reason (despite multiple XMPP client authors calling me nuts for not liking it) to use SAX is the speed involved. In a program that’s parsing a 400MB XML document, I can see how that would matter, since you don’t have to load the entire tree into memory. This isn’t my problem. I load maybe 5-15 objects per parse, whether I’m using SAX or DOM. So, I don’t care what the client authors say. I never liked client authors anyway (especially Jacob News, but that’s a story for another day).
But none of that really matters, because REXML still sucks. The only other XML library that comes with Ruby is hpricot, which isn’t really an XML parser, but it works, and in pretty much the same way as REXML, only better. However, I haven’t figured how to get it to generate XML to be written out. I could always continue to use REXML for the writing, and hpricot for the parsing. Sadly, hpricot doesn’t do validation, and XMPP requires well-formedness. Also, xml-simple seems to be interesting. Sadly, it uses REXML. There’s always libxml, I suppose, but I really don’t want to bring a third-party library into it (xml-simple is one file that I could easily include).
I think my best option, if I want to do a radical change, is use libxml. Sort of an in-the-middle would be to use hpricot and REXML together.
So, to summarize:
- Ruby’s XML support is lacking;
- DOM vs SAX doesn’t matter when you’re parsing a few tags at a time;
- XML is still pretty crazy shit, i.e. wouldn’t want it on my flapjacks.
Please leave comments. I’m so very lonely.
-- rakaur // 2008.09.27 @ 10:28 AM
0 TrackBacks
Listed below are links to blogs that reference this entry.
TrackBack URL for this entry: http://mt.ericw.org/mt-tb.cgi/329
1 Comments
Comment #2077
[ Seth on Mon Nov 03 at 03:16 PM ]
Maybe give libxml-ruby a shot? It’s too bad that xmpp4r is so tightly tied to REXML.
http://blog.codahale.com/2008/07/16/libxml-ruby-is-back/
-- Seth // 2008.11.03 @ 03:16 PM
