72 lines
2.2 KiB
Plaintext
72 lines
2.2 KiB
Plaintext
|
Hubbub parser architecture
|
||
|
==========================
|
||
|
|
||
|
Introduction
|
||
|
------------
|
||
|
|
||
|
Hubbub is a flexible HTML parser. It offers two interfaces:
|
||
|
|
||
|
* a SAX-style event interface
|
||
|
* a DOM-style tree-based interface
|
||
|
|
||
|
Overview
|
||
|
--------
|
||
|
|
||
|
Hubbub is comprised of two parts:
|
||
|
|
||
|
* a tokeniser
|
||
|
* a tree builder
|
||
|
|
||
|
Tokeniser
|
||
|
---------
|
||
|
|
||
|
The tokeniser divides the data held in the document buffer into chunks.
|
||
|
It sends SAX-style events for each chunk.
|
||
|
|
||
|
Tree builder
|
||
|
------------
|
||
|
|
||
|
The tree builder constructs a DOM-like tree from the SAX events emitted by
|
||
|
the tokeniser. The exact representation of the tree is up to the client,
|
||
|
which must provide a number of tree building handler functions.
|
||
|
|
||
|
Memory usage and ownership
|
||
|
--------------------------
|
||
|
|
||
|
Memory usage within the library is well defined, as is ownership of allocated
|
||
|
memory.
|
||
|
|
||
|
Raw input data provided by the library client is owned by the client.
|
||
|
|
||
|
SAX events which refer to document segments contain direct references to
|
||
|
internal data. Token objects are transient and data within them are no
|
||
|
longer valid once the event handler has returned control to the tokeniser.
|
||
|
All data returned by a SAX event is owned by the library.
|
||
|
|
||
|
The tree builder will use client callbacks to create the objects used
|
||
|
within the tree. Tree objects may be reference counted (the client may
|
||
|
do nothing in the ref/unref callbacks and use garbage collection instead).
|
||
|
The resultant tree is owned by the client.
|
||
|
|
||
|
Parse errors
|
||
|
------------
|
||
|
|
||
|
Notification of parse errors is made through a dedicated event. This event
|
||
|
contains the line/column offset of the error location, along with a message
|
||
|
detailing the error.
|
||
|
|
||
|
Exceptional circumstances
|
||
|
-------------------------
|
||
|
|
||
|
Exceptional circumstances (such as memory exhaustion) are reported
|
||
|
immediately.
|
||
|
|
||
|
The parser's state in such situations is undefined. There is no recovery
|
||
|
mechanism.
|
||
|
|
||
|
Therefore, if the client is able to recover from the exceptional
|
||
|
circumstance (e.g. by making more free memory available) the only valid
|
||
|
way to proceed is to create a new parser instance and start parsing from
|
||
|
scratch. The client should ensure that they destroy the old parser instance
|
||
|
and any DOM tree produced by it, to avoid resource leaks.
|