How to parse invalid (error/malformed) XML?
php editor Baicao introduces you how to parse invalid XML files. When processing XML files, you sometimes encounter invalid XML, perhaps because it is not well-formed or contains errors. Parsing invalid XML files is an important task to ensure that we get the required data correctly. To solve this problem, we can use PHP’s built-in functions and libraries to check and fix invalid XML. Below we will introduce in detail several commonly used methods to parse invalid XML files.
Question content
Currently, I'm working on a feature that involves parsing xml that we receive from other products. I decided to run some tests against some actual customer data and it looks like other products allow users to enter input that should be considered invalid. Anyway, I still have to try and figure out a way to parse it. We are using javax.xml.parsers.documentbuilder
and I am getting the following error while typing.
<xml> ... <description>Example:Description:<THIS-IS-PART-OF-DESCRIPTION></description> ... </xml>
As you may know, the description appears to contain an invalid tag (<this-is-part-of-description>
). Now, this description tag is considered a leaf tag and should not have any nested tags inside. Regardless, this is still a problem and produces an exception on documentbuilder.parse(...)
I know this is invalid xml, but it is predictably invalid. Any ideas on ways to parse such input?
Workaround
"xml" is worse than invalid - it is not well-formed; see Well-formed and valid xml.
Informal assessments of the predictability of violations are not helpful. The text data is not xml. There is no consistent xml tool or library that can help you deal with it.
Options, ideal first:
Let the provider resolve the issue themselves. Requires well-formed xml. (Technically, the term well-formed xml is redundant, but may help with emphasis.)
Use tolerant tag parserFix issues before parsing to xml:
Standalone: xmlstarlet Features powerful recovery and repair capabilities Credit: romanperekhrest
xmlstarlet fo -o -r -h -d bad.xml 2>/dev/null
Copy after loginStandalone and c/c: html tidy Valid also works with xml. taggle is a port tagsoup to c .
python: Beautiful Soup Based on python. See the comments in the Differences between Parsers section. See also Answers to this question for more information Advice on handling malformed tags in python, Specifically includes lxml's
recover=true
option. See also this answer to learn how to usecodecs.encodedfile()
to clean up illegal characters.java: tagsoup and jsoup focus on html.
filterinputstream
Can be used for preprocessing cleanup..net:
- xmlreadersettings.checkcharacters 可以 禁用以解决非法 xml 字符问题。
- @jdweng 注释
xmlreadersettings。 conformancelevel
可以设置为conformancelevel.fragment
这样xmlreader
可以读取缺少根元素的 xml 格式良好的解析实体 . - @jdweng 还报告
xmlreader.readtofollowing()
有时可以 用于解决 xml 语法问题,但请注意 下面#3 中的违规警告。 microsoft.language.xml.xmlparser
据说是“容错”的。
转到:设置
decoder.strict
到false
,如示例所示,作者:@chuckx。php:请参阅domdocument::$recover 和 libxml_use_internal_errors(true)。请参阅此处的好示例。
ruby:nokogiri 支持“温和的 well-形式性”。
r:请参阅htmltreeparse() 用于 r 中的容错标记解析。
perl:请参阅xml::liberal ,一个“超级自由的 xml 解析器,可以解析损坏的 xml。”
使用文本编辑器手动将数据处理为文本或 以编程方式使用字符/字符串函数。这样做 以编程方式可以从棘手到不可能作为 看起来是什么 可预测的往往不是——打破规则很少受到规则的约束。
对于无效字符错误,请使用正则表达式删除/替换无效字符:
- php:
preg_replace('/[^\x{0009}\x{000a}\x{000d} \x{0020}-\x{d7ff}\x{e000}-\x{fffd}]+/u', ' ', $s);
- ruby:
string.tr ("^\u{0009}\u{000a}\u{000d}\u{0020}-\u{d7ff}\u{e000}-\u{fffd}", ' ')
- javascript:
inputstr.replace (/[^\x09\x0a\x0d\x20-\xff\x85\xa0-\ud7ff\ue000-\ufdcf\ufde0-\ufffd]/gm, '')
- php:
对于与号,使用正则表达式将匹配项替换为
&
: 信用:blhsin,演示 p>&(?!(?:#\d+|#x[0-9a-f]+|\w+);)
Copy after login请注意,上述正则表达式不会接受注释或 cdata
按照设计,标准 xml 解析器永远不会接受无效的 xml。
您唯一的选择是在解析输入之前预处理输入以删除“可预见的无效”内容,或将其包装在 cdata 中。
The above is the detailed content of How to parse invalid (error/malformed) XML?. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics











Yes, H5 page production is an important implementation method for front-end development, involving core technologies such as HTML, CSS and JavaScript. Developers build dynamic and powerful H5 pages by cleverly combining these technologies, such as using the <canvas> tag to draw graphics or using JavaScript to control interaction behavior.

Regarding the reasons and solutions for misaligned display of inline-block elements. When writing web page layout, we often encounter some seemingly strange display problems. Compare...

How to achieve the 45-degree curve effect of segmenter? In the process of implementing the segmenter, how to make the right border turn into a 45-degree curve when clicking the left button, and the point...

How to use JavaScript or CSS to control the top and end of the page in the browser's printing settings. In the browser's printing settings, there is an option to control whether the display is...

The method of customizing resize symbols in CSS is unified with background colors. In daily development, we often encounter situations where we need to customize user interface details, such as adjusting...

Compatibility issues of multi-row overflow on mobile terminal omitted on different devices When developing mobile applications using Vue 2.0, you often encounter the need to overflow text...

Real-time Bitcoin USD Price Factors that affect Bitcoin price Indicators for predicting future Bitcoin prices Here are some key information about the price of Bitcoin in 2018-2024:

Tips for Implementing Segmenter Effects In user interface design, segmenter is a common navigation element, especially in mobile applications and responsive web pages. ...