How to write a simple JSONParser using Python
JSON Tokenizer
For lexical analysis of JSON, I mainly referred to the method in the screenshot above and wrote a simple example myself. It is relatively simple to write, and it should be said that it can only support a simple subset of JSON.
For the types of TOKEN here, refer to https://json.org, but its JSON syntax format is with whitespace. I am not used to dealing with this, so I did not refer to its syntax. After lexical analysis, spaces, newlines, and tabs are filtered out. I simply discard them without processing them.
json_tokenizer.py
Use regular expressions to perform lexical analysis of JSON.
import json import re from typing import Dict, List, Union # TOKEN 的种类 LEFT_BRACE = "LEFT_BRACE" # { RIGHT_BRACE = "RIGHT_BRACE" # } LEFT_BRACKET = "LEFT_BRACKET" # ] RIGHT_BRACKET = "RIGHT_BRACKET" # [ COLON = "COLON" # : COMMA = "COMMA" # , NUMBER = "NUMBER" # ".*?" STRING = "STRING" # [1-9]\d* BOOL = "BOOL" # true/false NULL = "NULL" # null NEWLINE = "NEWLINE" # \n SKIP = "SKIP" # ' ', '\t' MISMATCH = "MISMATCH" # mismatch # 处理 token 的正则 token_specification = [ ('LEFT_BRACE', r'[{]'), ('RIGHT_BRACE', r'[}]'), ('LEFT_BRACKET', r'[\[]'), ('RIGHT_BRACKET', r'[\]]'), ('COLON', r'[:]'), ('COMMA', r'[,]'), ('NUMBER', r'-?[1-9]+[0-9]*'), ('STRING', r'".*?"'), ('BOOL', r'(true)|(false)'), ('NULL', r'null'), ('NEWLINE', r'\n'), ('SKIP', r'[ \t]'), ('MISMATCH', r'.') ] tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification) print("Debug: ", tok_regex) def process(kind: str, value: str) -> Dict[str, Union[str, bool, int, None]]: """ 处理输入的 kind 和 value,并生成 Dict 对象,简单表示 token 对象 """ if kind == STRING: # 去掉外层的双引号,暂时没有比较好的方式 return {"kind": kind, "value": value[1:-1]} if kind == NUMBER: return {"kind": kind, "value": int(value)} if kind == BOOL: if value == "true": return {"kind": kind, "value": True} else: return {"kind": kind, "value": False} if kind == NULL: return {"kind": kind, "value": None} return {"kind": kind, "value": value} def tokenizer(json_str: str) -> List[Dict[str, Union[str, bool, int, None]]]: """ tokenizer """ tokens = [] for m in re.finditer(tok_regex, json_str): # 获取 token 的类型 kind = m.lastgroup # 获取 token 的值 value = m.group() if kind == MISMATCH: raise Exception("json format is error") if kind == NEWLINE: continue if kind == SKIP: continue token = process(kind=kind, value=value) tokens.append(token) return tokens if __name__ == "__main__": json_doc = open("./demo.json", "r", encoding="utf-8").read() tokens = tokenizer(json_doc) if tokens: json.dump(tokens, open("./json_tokens.json", "w", encoding="utf-8"), ensure_ascii=False)
I have put all the input and output data in the document. I will post my input data and part of the output data below.
demo.json
{ "name": "小黑子", "age": 3, "gender": false, "other_info": { "friends": [ "嘎子", "潘叔", "狗" ], "declaration": "练习时长两年半", "hobbies": [ "唱", "跳", "rap", "篮球????" ] } }
json_token.json Part of the data. I formatted the data, so it is relatively long. Here is only a part.
JSON Parser
json_parser.py
Parser the token sequence generated in the previous step to generate a Dict object corresponding to JSON. The implementation of parser refers to the json syntax file of antlr4, which removes the whitespace and is simpler to process.
import json from typing import Dict, Union # TOKEN 的种类 LEFT_BRACE = "LEFT_BRACE" # { RIGHT_BRACE = "RIGHT_BRACE" # } LEFT_BRACKET = "LEFT_BRACKET" # ] RIGHT_BRACKET = "RIGHT_BRACKET" # [ COLON = "COLON" # : COMMA = "COMMA" # , NUMBER = "NUMBER" # ".*?" STRING = "STRING" # [1-9]\d* BOOL = "BOOL" # true/false NULL = "NULL" # null class Token(object): """为了简单,就不创建这个了""" class JSON_Parser(object): """ JSON_Parser the class aims parse input token sequence into a python object or array. """ def __init__(self, tokens) -> None: self.index = 0 self.tokens = tokens def get_token(self) -> Dict[str, Union[str, int, bool, None]]: """ get current's token """ if self.index < len(self.tokens): return self.tokens[self.index] else: raise Exception("index out of range.") def move_token(self) -> Dict[str, Union[str, int, bool, None]]: """ move to next token and return it """ if self.index + 1 < len(self.tokens): self.index = self.index + 1 return self.tokens[self.index] else: raise Exception("index out of range.") def parse(self): """ parse whole json """ token = self.get_token() if token.get("kind") == LEFT_BRACE: return self.parse_obj() elif token.get("kind") == LEFT_BRACKET: return self.parse_arr() else: raise Exception("error json, neither object or array.") def parse_obj(self): """ parse object """ obj = {} token = self.move_token() kind = token.get("kind") # '{' '}' if kind == RIGHT_BRACE: return obj # '{' pair (',' pair)* '}' name, val = self.parse_pair() obj[name] = val while self.index < len(self.tokens): token = self.move_token() kind = token.get("kind") if kind == COMMA: self.move_token() name, val = self.parse_pair() obj[name] = val elif kind == RIGHT_BRACE: return obj else: raise Exception("parse object encounter error") def parse_arr(self): """ parse array """ arr = [] token = self.move_token() kind = token.get("kind") # '[' ']' if kind == RIGHT_BRACE: return arr # '[' value (',' value)* ']' val = self.parse_value() arr.append(val) while self.index < len(self.tokens): token = self.move_token() kind = token.get("kind") if kind == COMMA: self.move_token() val = self.parse_value() arr.append(val) elif kind == RIGHT_BRACKET: return arr else: raise Exception("parse array encounter error") def parse_value(self): """ parse value """ token = self.get_token() kind = token.get("kind") if kind == LEFT_BRACE: return self.parse_obj() elif kind == LEFT_BRACKET: return self.parse_arr() elif kind == STRING or kind == NUMBER or kind == BOOL: return token.get("value") elif kind == NULL: return else: raise Exception("encounter unexcepted token") def parse_pair(self): """ parse pair """ token = self.get_token() kind = token.get("kind") name = token.get("value") # STRING ':' value if kind == STRING: token = self.move_token() kind = token.get("kind") if kind == COLON: token = self.move_token() return name, self.parse_value() raise Exception("parse pair encounter error") if __name__ == "__main__": # json token 文件路径 TOKEN_PATH = "./json_tokens.json" # 读取 token 序列 input_tokens = [token for token in json.load( open(TOKEN_PATH, "r", encoding="utf-8"))] if not input_tokens: raise Exception("input token sequence is empty") # 调试的时候,用来查表的,很方便定位到 index 走到哪一个 token 了 for i, tok in enumerate(input_tokens): print(f"debug {i:2d} --> {tok}") print("\n===========================================\n") parser = JSON_Parser(tokens=input_tokens) json_obj = parser.parse() # 再将 object 转成 json 并格式化后输出 print(json.dumps(json_obj, ensure_ascii=False, indent=4))
Output result:
The above is the detailed content of How to write a simple JSONParser using Python. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

PHP is mainly procedural programming, but also supports object-oriented programming (OOP); Python supports a variety of paradigms, including OOP, functional and procedural programming. PHP is suitable for web development, and Python is suitable for a variety of applications such as data analysis and machine learning.

PHP is suitable for web development and rapid prototyping, and Python is suitable for data science and machine learning. 1.PHP is used for dynamic web development, with simple syntax and suitable for rapid development. 2. Python has concise syntax, is suitable for multiple fields, and has a strong library ecosystem.

Python is more suitable for beginners, with a smooth learning curve and concise syntax; JavaScript is suitable for front-end development, with a steep learning curve and flexible syntax. 1. Python syntax is intuitive and suitable for data science and back-end development. 2. JavaScript is flexible and widely used in front-end and server-side programming.

VS Code can run on Windows 8, but the experience may not be great. First make sure the system has been updated to the latest patch, then download the VS Code installation package that matches the system architecture and install it as prompted. After installation, be aware that some extensions may be incompatible with Windows 8 and need to look for alternative extensions or use newer Windows systems in a virtual machine. Install the necessary extensions to check whether they work properly. Although VS Code is feasible on Windows 8, it is recommended to upgrade to a newer Windows system for a better development experience and security.

VS Code can be used to write Python and provides many features that make it an ideal tool for developing Python applications. It allows users to: install Python extensions to get functions such as code completion, syntax highlighting, and debugging. Use the debugger to track code step by step, find and fix errors. Integrate Git for version control. Use code formatting tools to maintain code consistency. Use the Linting tool to spot potential problems ahead of time.

PHP originated in 1994 and was developed by RasmusLerdorf. It was originally used to track website visitors and gradually evolved into a server-side scripting language and was widely used in web development. Python was developed by Guidovan Rossum in the late 1980s and was first released in 1991. It emphasizes code readability and simplicity, and is suitable for scientific computing, data analysis and other fields.

In VS Code, you can run the program in the terminal through the following steps: Prepare the code and open the integrated terminal to ensure that the code directory is consistent with the terminal working directory. Select the run command according to the programming language (such as Python's python your_file_name.py) to check whether it runs successfully and resolve errors. Use the debugger to improve debugging efficiency.

VS Code extensions pose malicious risks, such as hiding malicious code, exploiting vulnerabilities, and masturbating as legitimate extensions. Methods to identify malicious extensions include: checking publishers, reading comments, checking code, and installing with caution. Security measures also include: security awareness, good habits, regular updates and antivirus software.
