61 lines
2.8 KiB
Plaintext
61 lines
2.8 KiB
Plaintext
|
The data which Hubbub is fed (the input stream) gets buffered into a UTF-8
|
||
|
buffer. This buffer only holds a subset of the input stream at any given time.
|
||
|
To avoid unnecessary copying (which is both a speed and memory loss), Hubbub
|
||
|
tries to make all emitted strings point into this buffer, which is then
|
||
|
advanced after tokens have been emitted. This is not always possible, however,
|
||
|
because HTML5 specifies behaviour which requires changing various characters to
|
||
|
various other characters, and these sets of characters may not have the same
|
||
|
length. These cases are:
|
||
|
|
||
|
- CR handling -- CRLFs and CRs are converted to LFs
|
||
|
- tag and attribute names are lowercased
|
||
|
- entities are allowed in attribute names
|
||
|
- NUL bytes must be turned into U+FFFD REPLACEMENT CHARACTER
|
||
|
|
||
|
When collecting the strings it will emit, Hubbub starts by assuming that no
|
||
|
transformations on the input stream will be required. However, if it hits one
|
||
|
of the above cases, then it copies all of the collected characters into a buffer
|
||
|
and switches to using that instead. This means that every time a character is
|
||
|
collected and it is possible that that character could be collected into a
|
||
|
buffer, the code must check if it should be collected into a buffer. To allow
|
||
|
this check, and others, to happen when necessary and never otherwise, Hubbub
|
||
|
uses a set of macros to collect characters, detailed below.
|
||
|
|
||
|
Hubbub strings are (beginning,length) pairs. This means that once the
|
||
|
beginning is set to a position in the input stream, the string can collect
|
||
|
further character runs in the stream simply by adding to the length part. This
|
||
|
makes extending strings very efficient.
|
||
|
|
||
|
| COLLECT(hubbub_string str, uintptr_t cptr, size_t length)
|
||
|
|
||
|
This collects the character pointed to "cptr" (of size "length") into "str",
|
||
|
whether str is a buffered or unbuffered string, but only if "str" already
|
||
|
points to collected characters.
|
||
|
|
||
|
| COLLECT_NOBUF(hubbub_string str, size_t length)
|
||
|
|
||
|
This collects "length" bytes into "str", but only if "str" already points to
|
||
|
collected characters. (There is no need to pass the character, since this
|
||
|
just increases the length of the string.)
|
||
|
|
||
|
| COLLECT_MS(hubbub_string str, uintptr_t cptr, size_t length)
|
||
|
|
||
|
If "str" is currently zero-length, this acts like START(str, cptr, length).
|
||
|
Otherwise, it just acts like COLLECT(str, cptr, length).
|
||
|
|
||
|
| START(hubbub_string str, uintptr_t cptr, size_t length)
|
||
|
|
||
|
This sets the string "str"'s start to "cptr" and its length to "length".
|
||
|
|
||
|
| START_BUF(hubbub_string str, uintptr_t cptr, size_t length)
|
||
|
|
||
|
This buffers the character of length "length" pointed to by "c" and then
|
||
|
sets "str" to point to it.
|
||
|
|
||
|
| SWITCH(hubbub_string str)
|
||
|
|
||
|
This switches the string "str" from unbuffered to buffered; it copies all
|
||
|
characters currently collected in "str" to the buffer and then updates it
|
||
|
to point there.
|
||
|
|