include/HTMLPurifier/standalone/HTMLPurifier/Lexer/PEARSax3.php
\HTMLPurifier_Lexer_PEARSax3
Proof-of-concept lexer that uses the PEAR package XML_HTMLSax3 to parse HTML.
PEAR, not suprisingly, also has a SAX parser for HTML. I don't know very much about implementation, but it's fairly well written. However, that abstraction comes at a price: performance. You need to have it installed, and if the API changes, it might break our adapter. Not sure whether or not it's UTF-8 aware, but it has some entity parsing trouble (in all areas, text and attributes).
Quite personally, I don't recommend using the PEAR class, and the defaults don't use it. The unit tests do perform the tests on the SAX parser too, but whatever it does for poorly formed HTML is up to it.
- Parent(s)
- \HTMLPurifier_Lexer
- Todo
- Generalize so that XML_HTMLSax is also supported.
- Warning
- Entity-resolution inside attributes is broken.
Properties
$_special_entity2str= 'array(
'"' => '"',
'&' => '&',
'<' => '<',
'>' => '>',
''' => "'",
''' => "'",
''' => "'"
)'
Most common entity to raw value conversion table for special entities.
Inherited from: \HTMLPurifier_Lexer::$$_special_entity2strarray(
'"' => '"',
'&' => '&',
'<' => '<',
'>' => '>',
''' => "'",
''' => "'",
''' => "'"
)
Details- Type
- n/a
- Inherited_from
- \HTMLPurifier_Lexer::$$_special_entity2str
$tracksLineNumbers= 'false'
Whether or not this lexer implements line-number/column-number tracking.
Inherited from: \HTMLPurifier_Lexer::$$tracksLineNumbersIf it does, set to true.
false
Details- Type
- n/a
- Inherited_from
- \HTMLPurifier_Lexer::$$tracksLineNumbers
Methods
CDATACallback(
$matches
)
:
void
Callback function for escapeCDATA() that does the work.
Inherited from: \HTMLPurifier_Lexer::CDATACallback()Name | Type | Description |
---|---|---|
$matches |
- Params
- $matches PCRE matches array, with index 0 the entire match and 1 the inside of the CDATA section.
- Returns
- Escaped internals of the CDATA section.
- Warning
- Though this is public in order to let the callback happen, calling it directly is not recommended.
closeHandler(
$parser, $name
)
:
void
Close tag event handler, interface is defined by PEAR package.
Name | Type | Description |
---|---|---|
$parser | ||
$name |
create(
\$config $config
)
:
\Concrete
Retrieves or sets the default Lexer as a Prototype Factory.
Inherited from: \HTMLPurifier_Lexer::create()By default HTMLPurifier_Lexer_DOMLex will be returned. There are a few exceptions involving special features that only DirectLex implements.
Name | Type | Description |
---|---|---|
$config | \$config | Instance of HTMLPurifier_Config |
Type | Description |
---|---|
\Concrete | lexer. |
- Note
- The behavior of this class has changed, rather than accepting a prototype object, it now accepts a configuration object. To specify your own prototype, set %Core.LexerImpl to it. This change in behavior de-singletonizes the lexer object.
dataHandler(
$parser, $data
)
:
void
Data event handler, interface is defined by PEAR package.
Name | Type | Description |
---|---|---|
$parser | ||
$data |
escapeCDATA(
\$string $string
)
:
void
Translates CDATA sections into regular sections (through escaping).
Inherited from: \HTMLPurifier_Lexer::escapeCDATA()Name | Type | Description |
---|---|---|
$string | \$string | HTML string to process. |
- Returns
- HTML with CDATA sections escaped.
escapeCommentedCDATA(
$string
)
:
void
Special CDATA case that is especially convoluted for
Inherited from: \HTMLPurifier_Lexer::escapeCommentedCDATA()Name | Type | Description |
---|---|---|
$string |
escapeHandler(
$parser, $data
)
:
void
Escaped text handler, interface is defined by PEAR package.
Name | Type | Description |
---|---|---|
$parser | ||
$data |
extractBody(
$html
)
:
void
Takes a string of HTML (fragment or document) and returns the content
Inherited from: \HTMLPurifier_Lexer::extractBody()Name | Type | Description |
---|---|---|
$html |
muteStrictErrorHandler(
$errno, $errstr, $errfile
=
null, $errline
=
null, $errcontext
=
null
)
:
void
An error handler that mutes strict errors
Name | Type | Description |
---|---|---|
$errno | ||
$errstr | ||
$errfile | ||
$errline | ||
$errcontext |
normalize(
$html, $config, $context
)
:
void
Takes a piece of HTML and normalizes it by converting entities, fixing encoding, extracting bits, and other good stuff.
Inherited from: \HTMLPurifier_Lexer::normalize()Name | Type | Description |
---|---|---|
$html | ||
$config | ||
$context |
openHandler(
$parser, $name, $attrs, $closed
)
:
void
Open tag event handler, interface is defined by PEAR package.
Name | Type | Description |
---|---|---|
$parser | ||
$name | ||
$attrs | ||
$closed |
parseData(
\$string $string
)
:
void
Parses special entities into the proper characters.
Inherited from: \HTMLPurifier_Lexer::parseData()This string will translate escaped versions of the special characters into the correct ones.
Name | Type | Description |
---|---|---|
$string | \$string | String character data to be parsed. |
- Returns
- Parsed character data.
- Warning
- You should be able to treat the output of this function as completely parsed, but that's only because all other entities should have been handled previously in substituteNonSpecialEntities()
removeIEConditional(
$string
)
:
void
Special Internet Explorer conditional comments should be removed.
Inherited from: \HTMLPurifier_Lexer::removeIEConditional()Name | Type | Description |
---|---|---|
$string |
tokenizeHTML(
\$string $string, $config, $context
)
:
\HTMLPurifier_Token
Lexes an HTML string into tokens.
Name | Type | Description |
---|---|---|
$string | \$string | String HTML. |
$config | ||
$context |
Type | Description |
---|---|
\HTMLPurifier_Token | array representation of HTML. |