Table of Contents
1. DOM parsing
2. SAX parsing
3. ET analysis
4、ET_iter解析
Home Backend Development Python Tutorial Analyze several ways Python parses XML

Analyze several ways Python parses XML

Sep 19, 2017 am 10:20 AM
python several kinds Way

When I first learned PYTHON, I only knew that there were two parsing methods, DOM and SAX, but their efficiency was not ideal. Due to the large number of files that needed to be processed, these two methods were too time-consuming and unacceptable.

After searching on the Internet, I found that ElementTree, which is currently widely used and relatively efficient, is also an algorithm recommended by many people, so I used this algorithm for actual measurement and comparison. ElementTree also includes two implementations, one is Normal ElementTree(ET), one is ElementTree.iterparse(ET_iter).

This article will conduct a horizontal comparison of the four methods of DOM, SAX, ET, and ET_iter, and evaluate the efficiency of each algorithm by comparing the time it takes to process the same files.

In the program, the four parsing methods are written as functions and called separately in the main program to evaluate their parsing efficiency.

The decompressed XML file content example is:

Analyze several ways Python parses XML

The main program function call part code is:

  print("文件计数:%d/%d." % (gz_cnt,paser_num))
  str_s,cnt = dom_parser(gz)
  #str_s,cnt = sax_parser(gz)
  #str_s,cnt = ET_parser(gz)
  #str_s,cnt = ET_parser_iter(gz)
  output.write(str_s)
  vs_cnt += cnt
Copy after login

In the initial function call The function returns two values, but when receiving the function call value, it is called with two variables separately, causing each function to be executed twice. It was later modified to call two variables at once to receive the return value, reducing invalid calls.

1. DOM parsing

Function definition code:

def dom_parser(gz):
  import gzip,cStringIO
  import xml.dom.minidom
  
  vs_cnt = 0
  str_s = ''
  file_io = cStringIO.StringIO()
  xm = gzip.open(gz,'rb')
  print("已读入:%s.\n解析中:" % (os.path.abspath(gz)))
  doc = xml.dom.minidom.parseString(xm.read())
  bulkPmMrDataFile = doc.documentElement
  #读入子元素
  enbs = bulkPmMrDataFile.getElementsByTagName("eNB")
  measurements = enbs[0].getElementsByTagName("measurement")
  objects = measurements[0].getElementsByTagName("object")
  #写入csv文件
  for object in objects:
    vs = object.getElementsByTagName("v")
    vs_cnt += len(vs)
    for v in vs:
      file_io.write(enbs[0].getAttribute("id")+' '+object.getAttribute("id")+' '+\
      object.getAttribute("MmeUeS1apId")+' '+object.getAttribute("MmeGroupId")+' '+object.getAttribute("MmeCode")+' '+\
      object.getAttribute("TimeStamp")+' '+v.childNodes[0].data+'\n') #获取文本值
  str_s = (((file_io.getvalue().replace(' \n','\r\n')).replace(' ',',')).replace('T',' ')).replace('NIL','')
  xm.close()
  file_io.close()
  return (str_s,vs_cnt)
Copy after login

Program running result:

**************** *************************************

Program processing starts.

The input directory is:/tmcdata/mro2csv/input31/.

The output directory is:/tmcdata/mro2csv/output31/.

The number of .gz files in the input directory is: 12, 12 of them will be processed this time.

************************************************ ******

File count: 1/12.

Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_234598_20160224060000.xml.gz.

Parsing:

File count: 2/12.

Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_233798_20160224060000.xml.gz.

Parsing:

File count: 3/12.

Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_123798_20160224060000.xml.gz.

Parsing:

……………………………………………………

File count: 12/12.

Read in:/tmcdata/mro2csv/input31/TD- LTE_MRO_NSN_OMC_235598_20160224060000.xml.gz.

Parsing:

VS row count: 177849, running time: 107.077867, rows processed per second: 1660.

Written:/tmcdata/mro2csv/output31/mro_0001.csv.

************************************************ ******

Program processing ends.

Since DOM parsing requires reading the entire file into memory and establishing a tree structure, its memory consumption and time consumption are relatively high, but its advantage is that the logic is simple and there is no need to define a callback function, which is easy to implement.

2. SAX parsing

Function definition code:

def sax_parser(gz):
  import os,gzip,cStringIO
  from xml.parsers.expat import ParserCreate
  #变量声明
  d_eNB = {}
  d_obj = {}
  s = ''
  global flag 
  flag = False
  file_io = cStringIO.StringIO()
  
  #Sax解析类
  class DefaultSaxHandler(object):
    #处理开始标签
    def start_element(self, name, attrs):
      global d_eNB
      global d_obj
      global vs_cnt
      if name == 'eNB':
        d_eNB = attrs
      elif name == 'object':
        d_obj = attrs
      elif name == 'v':
        file_io.write(d_eNB['id']+' '+ d_obj['id']+' '+d_obj['MmeUeS1apId']+' '+d_obj['MmeGroupId']+' '+d_obj['MmeCode']+' '+d_obj['TimeStamp']+' ')
        vs_cnt += 1
      else:
        pass
    #处理中间文本
    def char_data(self, text):
      global d_eNB
      global d_obj
      global flag
      if text[0:1].isnumeric():
        file_io.write(text)
      elif text[0:17] == 'MR.LteScPlrULQci1':
        flag = True
        #print(text,flag)
      else:
        pass
    #处理结束标签
    def end_element(self, name):
      global d_eNB
      global d_obj
      if name == 'v':
        file_io.write('\n')
      else:
        pass
  
  #Sax解析调用
  handler = DefaultSaxHandler()
  parser = ParserCreate()
  parser.StartElementHandler = handler.start_element
  parser.EndElementHandler = handler.end_element
  parser.CharacterDataHandler = handler.char_data
  vs_cnt = 0
  str_s = ''
  xm = gzip.open(gz,'rb')
  print("已读入:%s.\n解析中:" % (os.path.abspath(gz)))
  for line in xm.readlines():
    parser.Parse(line) #解析xml文件内容
    if flag:
      break
  str_s = file_io.getvalue().replace(' \n','\r\n').replace(' ',',').replace('T',' ').replace('NIL','')  #写入解析后内容
  xm.close()
  file_io.close()
  return (str_s,vs_cnt)
Copy after login

Program running result:

**************** *************************************

Program processing starts.

The input directory is:/tmcdata/mro2csv/input31/.

The output directory is:/tmcdata/mro2csv/output31/.

The number of .gz files in the input directory is: 12, 12 of them will be processed this time.

************************************************ ******

File count: 1/12.

Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_234598_20160224060000.xml.gz.

Parsing:

File count: 2/12.

Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_233798_20160224060000.xml.gz.

Parsing:

File count: 3/12.

Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_123798_20160224060000.xml.gz.

Parsing:

........................................

File count: 12/12.

Read in: /tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_235598_20160224060000.xml.gz.

Parsing:

VS row count :177849, running time: 14.386779, rows processed per second: 12361.

Written:/tmcdata/mro2csv/output31/mro_0001.csv.

************************************************ ******

The program processing ends.

SAX parsing has a significantly shorter running time than DOM parsing. Since SAX uses line-by-line parsing, it takes up less memory for processing larger files. Therefore, SAX parsing is a parsing method that is currently widely used. The disadvantage is that you need to implement the callback function yourself, and the logic is relatively complicated.

3. ET analysis

Function definition code:

def ET_parser(gz):
  import os,gzip,cStringIO
  import xml.etree.cElementTree as ET
  vs_cnt = 0
  str_s = ''
  file_io = cStringIO.StringIO()
  xm = gzip.open(gz,'rb')
  print("已读入:%s.\n解析中:" % (os.path.abspath(gz)))
  tree = ET.ElementTree(file=xm)
  root = tree.getroot()
  for elem in root[1][0].findall('object'):
      for v in elem.findall('v'):
          file_io.write(root[1].attrib['id']+' '+elem.attrib['TimeStamp']+' '+elem.attrib['MmeCode']+' '+\
          elem.attrib['id']+' '+ elem.attrib['MmeUeS1apId']+' '+ elem.attrib['MmeGroupId']+' '+ v.text+'\n')
      vs_cnt += 1
  str_s = file_io.getvalue().replace(' \n','\r\n').replace(' ',',').replace('T',' ').replace('NIL','')  #写入解析后内容
  xm.close()
  file_io.close()
  return (str_s,vs_cnt)
Copy after login

Program running result:

****************** *************************************

Program processing starts.

The input directory is:/tmcdata/mro2csv/input31/.

The output directory is:/tmcdata/mro2csv/output31/.

The number of .gz files in the input directory is: 12, 12 of them will be processed this time.

************************************************ ******

File count: 1/12.

Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_234598_20160224060000.xml.gz.

Parsing:

File count: 2/12.

Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_233798_20160224060000.xml.gz.

Parsing:

File count: 3/12.

Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_123798_20160224060000.xml.gz.

Parsing:

...........................................

文件计数:12/12.

已读入:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_235598_20160224060000.xml.gz.

解析中:

VS行计数:177849,运行时间:4.308103,每秒处理行数:41282。

已写入:/tmcdata/mro2csv/output31/mro_0001.csv。

**************************************************

程序处理结束。

相较于SAX解析,ET解析时间更短,并且函数实现也比较简单,所以ET具有类似DOM的简单逻辑实现且匹敌SAX的解析效率,因此ET是目前XML解析的首选。

4、ET_iter解析

函数定义代码:

def ET_parser_iter(gz):
  import os,gzip,cStringIO
  import xml.etree.cElementTree as ET
  vs_cnt = 0
  str_s = ''
  file_io = cStringIO.StringIO()
  xm = gzip.open(gz,'rb')
  print("已读入:%s.\n解析中:" % (os.path.abspath(gz)))
  d_eNB = {}
  d_obj = {}
  i = 0
  for event,elem in ET.iterparse(xm,events=('start','end')):
    if i >= 2:
      break    
    elif event == 'start':
          if elem.tag == 'eNB':
              d_eNB = elem.attrib
          elif elem.tag == 'object':
        d_obj = elem.attrib
      elif event == 'end' and elem.tag == 'smr':
      i += 1
    elif event == 'end' and elem.tag == 'v':
      file_io.write(d_eNB['id']+' '+d_obj['TimeStamp']+' '+d_obj['MmeCode']+' '+d_obj['id']+' '+\
      d_obj['MmeUeS1apId']+' '+ d_obj['MmeGroupId']+' '+str(elem.text)+'\n')
          vs_cnt += 1
      elem.clear()
  str_s = file_io.getvalue().replace(' \n','\r\n').replace(' ',',').replace('T',' ').replace('NIL','')  #写入解析后内容
  xm.close()
  file_io.close()
  return (str_s,vs_cnt)
Copy after login

程序运行结果:

**************************************************

程序处理启动。

输入目录为:/tmcdata/mro2csv/input31/。

输出目录为:/tmcdata/mro2csv/output31/。

输入目录下.gz文件个数为:12,本次处理其中的12个。

**************************************************

文件计数:1/12.

已读入:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_234598_20160224060000.xml.gz.

解析中:

文件计数:2/12.

已读入:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_233798_20160224060000.xml.gz.

解析中:

文件计数:3/12.

已读入:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_123798_20160224060000.xml.gz.

解析中:

...................................................

文件计数:12/12.

已读入:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_235598_20160224060000.xml.gz.

解析中:

VS行计数:177849,运行时间:3.043805,每秒处理行数:58429。

已写入:/tmcdata/mro2csv/output31/mro_0001.csv。

**************************************************

程序处理结束。

在引入了ET_iter解析后,解析效率比ET提升了近50%,而相较于DOM解析更是提升了35倍,在解析效率提升的同时,由于其采用了iterparse这个循序解析的工具,其内存占用也是比较小的。

The above is the detailed content of Analyze several ways Python parses XML. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

PHP and Python: Different Paradigms Explained PHP and Python: Different Paradigms Explained Apr 18, 2025 am 12:26 AM

PHP is mainly procedural programming, but also supports object-oriented programming (OOP); Python supports a variety of paradigms, including OOP, functional and procedural programming. PHP is suitable for web development, and Python is suitable for a variety of applications such as data analysis and machine learning.

Choosing Between PHP and Python: A Guide Choosing Between PHP and Python: A Guide Apr 18, 2025 am 12:24 AM

PHP is suitable for web development and rapid prototyping, and Python is suitable for data science and machine learning. 1.PHP is used for dynamic web development, with simple syntax and suitable for rapid development. 2. Python has concise syntax, is suitable for multiple fields, and has a strong library ecosystem.

Python vs. JavaScript: The Learning Curve and Ease of Use Python vs. JavaScript: The Learning Curve and Ease of Use Apr 16, 2025 am 12:12 AM

Python is more suitable for beginners, with a smooth learning curve and concise syntax; JavaScript is suitable for front-end development, with a steep learning curve and flexible syntax. 1. Python syntax is intuitive and suitable for data science and back-end development. 2. JavaScript is flexible and widely used in front-end and server-side programming.

Can vs code run in Windows 8 Can vs code run in Windows 8 Apr 15, 2025 pm 07:24 PM

VS Code can run on Windows 8, but the experience may not be great. First make sure the system has been updated to the latest patch, then download the VS Code installation package that matches the system architecture and install it as prompted. After installation, be aware that some extensions may be incompatible with Windows 8 and need to look for alternative extensions or use newer Windows systems in a virtual machine. Install the necessary extensions to check whether they work properly. Although VS Code is feasible on Windows 8, it is recommended to upgrade to a newer Windows system for a better development experience and security.

PHP and Python: A Deep Dive into Their History PHP and Python: A Deep Dive into Their History Apr 18, 2025 am 12:25 AM

PHP originated in 1994 and was developed by RasmusLerdorf. It was originally used to track website visitors and gradually evolved into a server-side scripting language and was widely used in web development. Python was developed by Guidovan Rossum in the late 1980s and was first released in 1991. It emphasizes code readability and simplicity, and is suitable for scientific computing, data analysis and other fields.

Can visual studio code be used in python Can visual studio code be used in python Apr 15, 2025 pm 08:18 PM

VS Code can be used to write Python and provides many features that make it an ideal tool for developing Python applications. It allows users to: install Python extensions to get functions such as code completion, syntax highlighting, and debugging. Use the debugger to track code step by step, find and fix errors. Integrate Git for version control. Use code formatting tools to maintain code consistency. Use the Linting tool to spot potential problems ahead of time.

How to run programs in terminal vscode How to run programs in terminal vscode Apr 15, 2025 pm 06:42 PM

In VS Code, you can run the program in the terminal through the following steps: Prepare the code and open the integrated terminal to ensure that the code directory is consistent with the terminal working directory. Select the run command according to the programming language (such as Python's python your_file_name.py) to check whether it runs successfully and resolve errors. Use the debugger to improve debugging efficiency.

Is the vscode extension malicious? Is the vscode extension malicious? Apr 15, 2025 pm 07:57 PM

VS Code extensions pose malicious risks, such as hiding malicious code, exploiting vulnerabilities, and masturbating as legitimate extensions. Methods to identify malicious extensions include: checking publishers, reading comments, checking code, and installing with caution. Security measures also include: security awareness, good habits, regular updates and antivirus software.

See all articles