3 A sax-style parser for XML and HTML.
5 Designed with [node](http://nodejs.org/) in mind, but should work fine in the
6 browser or other CommonJS implementations.
10 * A very simple tool to parse through an XML string.
11 * A stepping stone to a streaming HTML parser.
12 * A handy way to deal with RSS and other mostly-ok-but-kinda-broken XML docs.
14 ## What This Is (probably) Not
16 * An HTML Parser - That's the goal, but this isn't it. It's just XML for now.
17 * A DOM Builder - You can use it to build an object model out of XML, but it doesn't
18 do that out of the box.
19 * XSLT - No DOM, no querying.
20 * 100% Compliant with (some other SAX implementation) - Most SAX implementations are
21 in Java and do a lot more than this does.
22 * An XML Validator - It does a little validation when in strict mode, but not much.
23 * A Schema-Aware XSD Thing - Schemas are an exercise in fetishistic masochism.
24 * A DTD-aware Thing - Fetching DTDs is a much bigger job.
26 ## Regarding `<!DOCTYPE`s and `<!ENTITY`s
28 The parser will handle the basic XML entities in text nodes and attribute values:
29 `& < > ' "`. It's possible to define additional entities in XML
30 by putting them in the DTD. This parser doesn't do anything with that. If you want
31 to listen to the `ondoctype` event, and then fetch the doctypes, and read the entities
32 and add them to `parser.ENTITIES`, then be my guest.
34 Unknown entities will fail in strict mode, and in loose mode, will pass through unmolested.
38 var sax = require("./lib/sax"),
39 strict = true, // set to false for html-mode
40 parser = sax.parser(strict);
42 parser.onerror = function (e) {
45 parser.ontext = function (t) {
46 // got some text. t is the string of text.
48 parser.onopentag = function (node) {
49 // opened a tag. node has "name" and "attributes"
51 parser.onattribute = function (attr) {
52 // an attribute. attr has "name" and "value"
54 parser.onend = function () {
55 // parser stream is done, and ready to have more stuff written to it.
58 parser.write('<xml>Hello, <who name="world">world</who>!</xml>').close();
62 Pass the following arguments to the parser function. All are optional.
64 `strict` - Boolean. Whether or not to be a jerk. Default: `false`.
66 `opt` - Object bag of settings regarding string formatting. All default to `false`.
69 * `trim` - Boolean. Whether or not to trim text and comment nodes.
70 * `normalize` - Boolean. If true, then turn any whitespace into a single space.
71 * `lowercasetags` - Boolean. If true, then lowercase tags in loose mode, rather
72 than uppercasing them.
76 `write` - Write bytes onto the stream. You don't have to do this all at once. You
77 can keep writing as much as you want.
79 `close` - Close the stream. Once closed, no more data may be written until it is
80 done processing the buffer, which is signaled by the `end` event.
82 `resume` - To gracefully handle errors, assign a listener to the `error` event. Then,
83 when the error is taken care of, you can call `resume` to continue parsing. Otherwise,
84 the parser will not continue while in an error state.
88 At all times, the parser object will have the following members:
90 `line`, `column`, `position` - Indications of the position in the XML document where
91 the parser currently is looking.
93 `closed` - Boolean indicating whether or not the parser can be written to. If it's
94 `true`, then wait for the `ready` event to write again.
96 `strict` - Boolean indicating whether or not the parser is a jerk.
98 `opt` - Any options passed into the constructor.
100 And a bunch of other stuff that you probably shouldn't touch.
104 All events emit with a single argument. To listen to an event, assign a function to
105 `on<eventname>`. Functions get executed in the this-context of the parser object.
106 The list of supported events are also in the exported `EVENTS` array.
108 `error` - Indication that something bad happened. The error will be hanging out on
109 `parser.error`, and must be deleted before parsing can continue. By listening to
110 this event, you can keep an eye on that kind of stuff. Note: this happens *much*
111 more in strict mode. Argument: instance of `Error`.
113 `text` - Text node. Argument: string of text.
115 `doctype` - The `<!DOCTYPE` declaration. Argument: doctype string.
117 `processinginstruction` - Stuff like `<?xml foo="blerg" ?>`. Argument: object with
118 `name` and `body` members. Attributes are not parsed, as processing instructions
119 have implementation dependent semantics.
121 `sgmldeclaration` - Random SGML declarations. Stuff like `<!ENTITY p>` would trigger
122 this kind of event. This is a weird thing to support, so it might go away at some
123 point. SAX isn't intended to be used to parse SGML, after all.
125 `opentag` - An opening tag. Argument: object with `name` and `attributes`. In
126 non-strict mode, tag names are uppercased.
128 `closetag` - A closing tag. In loose mode, tags are auto-closed if their parent
129 closes. In strict mode, well-formedness is enforced. Note that self-closing tags
130 will have `closeTag` emitted immediately after `openTag`. Argument: tag name.
132 `attribute` - An attribute node. Argument: object with `name` and `value`.
134 `comment` - A comment node. Argument: the string of the comment.
136 `opencdata` - The opening tag of a `<![CDATA[` block.
138 `cdata` - The text of a `<![CDATA[` block. Since `<![CDATA[` blocks can get quite large, this event
139 may fire multiple times for a single block, if it is broken up into multiple `write()`s.
140 Argument: the string of random character data.
142 `closecdata` - The closing tag (`]]>`) of a `<![CDATA[` block.
144 `end` - Indication that the closed stream has ended.
146 `ready` - Indication that the stream has reset, and is ready to be written to.
150 Build an HTML parser on top of this, which follows the same parsing rules as web browsers.
152 Make it fast by replacing the trampoline with a switch, and not buffering so much