


Implementation method of cdn log analysis using pandas library
This article mainly introduces the relevant information about using the pandas library in Python for cdn log analysis. The article shares the complete sample code of pandas for cdn log analysis, and then introduces the relevant content about the pandas library in detail. Friends who need it You can use it as a reference, let’s take a look below.
Preface
I recently encountered a need at work, which is to filter some data based on CDN logs, such as traffic and status code statistics. TOP IP, URL, UA, Referer, etc. In the past, the bash shell was used to implement this. However, when the log volume is large, the number of log files is gigabytes, and the number of lines reaches tens of billions, processing through the shell is not enough and the processing time is too long. So I studied the use of Python pandas, a data processing library. Ten million lines of logs are processed in about 40 seconds.
Code
#!/usr/bin/python # -*- coding: utf-8 -*- # sudo pip install pandas author = 'Loya Chen' import sys import pandas as pd from collections import OrderedDict """ Description: This script is used to analyse qiniu cdn log. ================================================================================ 日志格式 IP - ResponseTime [time +0800] "Method URL HTTP/1.1" code size "referer" "UA" ================================================================================ 日志示例 [0] [1][2] [3] [4] [5] 101.226.66.179 - 68 [16/Nov/2016:04:36:40 +0800] "GET http://www.qn.com/1.jpg -" [6] [7] [8] [9] 200 502 "-" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)" ================================================================================ """ if len(sys.argv) != 2: print('Usage:', sys.argv[0], 'file_of_log') exit() else: log_file = sys.argv[1] # 需统计字段对应的日志位置 ip = 0 url = 5 status_code = 6 size = 7 referer = 8 ua = 9 # 将日志读入DataFrame reader = pd.read_table(log_file, sep=' ', names=[i for i in range(10)], iterator=True) loop = True chunkSize = 10000000 chunks = [] while loop: try: chunk = reader.get_chunk(chunkSize) chunks.append(chunk) except StopIteration: #Iteration is stopped. loop = False df = pd.concat(chunks, ignore_index=True) byte_sum = df[size].sum() #流量统计 top_status_code = pd.DataFrame(df[6].value_counts()) #状态码统计 top_ip = df[ip].value_counts().head(10) #TOP IP top_referer = df[referer].value_counts().head(10) #TOP Referer top_ua = df[ua].value_counts().head(10) #TOP User-Agent top_status_code['persent'] = pd.DataFrame(top_status_code/top_status_code.sum()*100) top_url = df[url].value_counts().head(10) #TOP URL top_url_byte = df[[url,size]].groupby(url).sum().apply(lambda x:x.astype(float)/1024/1024) \ .round(decimals = 3).sort_values(by=[size], ascending=False)[size].head(10) #请求流量最大的URL top_ip_byte = df[[ip,size]].groupby(ip).sum().apply(lambda x:x.astype(float)/1024/1024) \ .round(decimals = 3).sort_values(by=[size], ascending=False)[size].head(10) #请求流量最多的IP # 将结果有序存入字典 result = OrderedDict([("流量总计[单位:GB]:" , byte_sum/1024/1024/1024), ("状态码统计[次数|百分比]:" , top_status_code), ("IP TOP 10:" , top_ip), ("Referer TOP 10:" , top_referer), ("UA TOP 10:" , top_ua), ("URL TOP 10:" , top_url), ("请求流量最大的URL TOP 10[单位:MB]:" , top_url_byte), ("请求流量最大的IP TOP 10[单位:MB]:" , top_ip_byte) ]) # 输出结果 for k,v in result.items(): print(k) print(v) print('='*80)
pandas study notes
There are two basic data structures in Pandas, Series and Dataframe. A Series is an object similar to a one-dimensional array, consisting of a set of data and an index . Dataframe is a table type data structure with both row and column indexes.
from pandas import Series, DataFrame import pandas as pd
Series
In [1]: obj = Series([4, 7, -5, 3]) In [2]: obj Out[2]: 0 4 1 7 2 -5 3 3
The string representation of Series is: index on the left, value on the right. When no index is specified, an integer type index from 0 to N-1 (N is the length of the data) will be automatically created. The array representation and index object can be obtained through the values and index properties of the Series:
In [3]: obj.values Out[3]: array([ 4, 7, -5, 3]) In [4]: obj.index Out[4]: RangeIndex(start=0, stop=4, step=1)
In [5]: obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c']) In [6]: obj2 Out[6]: d 4 b 7 a -5 c 3
In [7]: obj2['a'] Out[7]: -5 In [8]: obj2[['c','d']] Out[8]: c 3 d 4
In [9]: obj2.sort_index() Out[9]: a -5 b 7 c 3 d 4 In [10]: obj2.sort_values() Out[10]: a -5 c 3 d 4 b 7
In [11]: obj2[obj2 > 0] Out[11]: d 4 b 7 c 3 In [12]: obj2 * 2 Out[12]: d 8 b 14 a -10 c 6
In [13]: 'b' in obj2 Out[13]: True In [14]: 'e' in obj2 Out[14]: False
In [15]: sdata = {'Shanghai':35000, 'Beijing':40000, 'Nanjing':26000, 'Hangzhou':30000} In [16]: obj3 = Series(sdata) In [17]: obj3 Out[17]: Beijing 40000 Hangzhou 30000 Nanjing 26000 Shanghai 35000
In [18]: states = ['Beijing', 'Hangzhou', 'Shanghai', 'Suzhou'] In [19]: obj4 = Series(sdata, index=states) In [20]: obj4 Out[20]: Beijing 40000.0 Hangzhou 30000.0 Shanghai 35000.0 Suzhou NaN
Function can be used to detect missing data:
In [21]: pd.isnull(obj4) Out[21]: Beijing False Hangzhou False Shanghai False Suzhou True In [22]: pd.notnull(obj4) Out[22]: Beijing True Hangzhou True Shanghai True Suzhou False
In [23]: obj4.isnull() Out[23]: Beijing False Hangzhou False Shanghai False Suzhou True
In [24]: obj3 Out[24]: Beijing 40000 Hangzhou 30000 Nanjing 26000 Shanghai 35000 In [25]: obj4 Out[25]: Beijing 40000.0 Hangzhou 30000.0 Shanghai 35000.0 Suzhou NaN In [26]: obj3 + obj4 Out[26]: Beijing 80000.0 Hangzhou 60000.0 Nanjing NaN Shanghai 70000.0 Suzhou NaN
In [27]: obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan'] In [28]: obj Out[28]: Bob 4 Steve 7 Jeff -5 Ryan 3
DataFrame
pandasIn [29]: df = pd.read_table('pandas_test.txt',sep=' ', names=['name', 'age']) In [30]: df Out[30]: name age 0 Bob 26 1 Loya 22 2 Denny 20 3 Mars 25
df[name]
In [31]: df['name'] Out[31]: 0 Bob 1 Loya 2 Denny 3 Mars Name: name, dtype: object
df.iloc[0,:] #第一个参数是第几行,第二个参数是列。这里指第0行全部列 df.iloc[:,0] #全部行,第0列
In [32]: df.iloc[0,:] Out[32]: name Bob age 26 Name: 0, dtype: object In [33]: df.iloc[:,0] Out[33]: 0 Bob 1 Loya 2 Denny 3 Mars Name: name, dtype: object
In [34]: df.iloc[1,1] Out[34]: 22 In [35]: df.iat[1,1] Out[35]: 22
In [36]: df.loc[1:2,['name','age']] Out[36]: name age 1 Loya 22 2 Denny 20
In [37]: df[(df.index >= 1) & (df.index <= 3)] Out[37]: name age city 1 Loya 22 Shanghai 2 Denny 20 Hangzhou 3 Mars 25 Nanjing In [38]: df[df['age'] > 22] Out[38]: name age city 0 Bob 26 Beijing 3 Mars 25 Nanjing
In [39]: df['city'] = ['Beijing', 'Shanghai', 'Hangzhou', 'Nanjing'] In [40]: df Out[40]: name age city 0 Bob 26 Beijing 1 Loya 22 Shanghai 2 Denny 20 Hangzhou 3 Mars 25 Nanjing
In [41]: df.sort_values(by='age') Out[41]: name age city 2 Denny 20 Hangzhou 1 Loya 22 Shanghai 3 Mars 25 Nanjing 0 Bob 26 Beijing
# 引入numpy 构建 DataFrame import numpy as np
In [42]: df = pd.DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'], columns=['d', 'a', 'b', 'c']) In [43]: df Out[43]: d a b c three 0 1 2 3 one 4 5 6 7
# 以索引排序 In [44]: df.sort_index() Out[44]: d a b c one 4 5 6 7 three 0 1 2 3 In [45]: df.sort_index(axis=1) Out[45]: a b c d three 1 2 3 0 one 5 6 7 4 # 降序 In [46]: df.sort_index(axis=1, ascending=False) Out[46]: d c b a three 0 3 2 1 one 4 7 6 5
# 查看表头5行 df.head(5) # 查看表末5行 df.tail(5) # 查看列的名字 In [47]: df.columns Out[47]: Index(['name', 'age', 'city'], dtype='object') # 查看表格当前的值 In [48]: df.values Out[48]: array([['Bob', 26, 'Beijing'], ['Loya', 22, 'Shanghai'], ['Denny', 20, 'Hangzhou'], ['Mars', 25, 'Nanjing']], dtype=object)
df.T Out[49]: 0 1 2 3 name Bob Loya Denny Mars age 26 22 20 25 city Beijing Shanghai Hangzhou Nanjing
In [50]: df2 = df.copy() In [51]: df2[df2['city'].isin(['Shanghai','Nanjing'])] Out[52]: name age city 1 Loya 22 Shanghai 3 Mars 25 Nanjing
In [53]: df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]], ...: index=['a', 'b', 'c', 'd'], columns=['one', 'two']) In [54]: df Out[54]: one two a 1.40 NaN b 7.10 -4.5 c NaN NaN d 0.75 -1.3
#按列求和 In [55]: df.sum() Out[55]: one 9.25 two -5.80 # 按行求和 In [56]: df.sum(axis=1) Out[56]: a 1.40 b 2.60 c NaN d -0.55
- ##Splitting the data into groups based on some criteria
- Applying a function to each group independently
- Combining the results into a data structure
- See the Grouping section
In [57]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', ....: 'foo', 'bar', 'foo', 'foo'], ....: 'B' : ['one', 'one', 'two', 'three', ....: 'two', 'two', 'one', 'three'], ....: 'C' : np.random.randn(8), ....: 'D' : np.random.randn(8)}) ....: In [58]: df Out[58]: A B C D 0 foo one -1.202872 -0.055224 1 bar one -1.814470 2.395985 2 foo two 1.018601 1.552825 3 bar three -0.595447 0.166599 4 foo two 1.395433 0.047609 5 bar two -0.392670 -0.136473 6 foo one 0.007207 -0.561757 7 foo three 1.928123 -1.623033
Group it, and then apply the sum function
In [59]: df.groupby('A').sum() Out[59]: C D A bar -2.802588 2.42611 foo 3.146492 -0.63958 In [60]: df.groupby(['A','B']).sum() Out[60]: C D A B bar one -1.814470 2.395985 three -0.595447 0.166599 two -0.392670 -0.136473 foo one -1.195665 -0.616981 three 1.928123 -1.623033 two 2.414034 1.600434
[Related recommendations]
1.
Special recommendation : "php Programmer Toolbox" V0.1 version download2.
Python free video tutorialPython Basics Introduction TutorialThe above is the detailed content of Implementation method of cdn log analysis using pandas library. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

PHP is mainly procedural programming, but also supports object-oriented programming (OOP); Python supports a variety of paradigms, including OOP, functional and procedural programming. PHP is suitable for web development, and Python is suitable for a variety of applications such as data analysis and machine learning.

PHP is suitable for web development and rapid prototyping, and Python is suitable for data science and machine learning. 1.PHP is used for dynamic web development, with simple syntax and suitable for rapid development. 2. Python has concise syntax, is suitable for multiple fields, and has a strong library ecosystem.

Python is more suitable for beginners, with a smooth learning curve and concise syntax; JavaScript is suitable for front-end development, with a steep learning curve and flexible syntax. 1. Python syntax is intuitive and suitable for data science and back-end development. 2. JavaScript is flexible and widely used in front-end and server-side programming.

PHP originated in 1994 and was developed by RasmusLerdorf. It was originally used to track website visitors and gradually evolved into a server-side scripting language and was widely used in web development. Python was developed by Guidovan Rossum in the late 1980s and was first released in 1991. It emphasizes code readability and simplicity, and is suitable for scientific computing, data analysis and other fields.

VS Code can run on Windows 8, but the experience may not be great. First make sure the system has been updated to the latest patch, then download the VS Code installation package that matches the system architecture and install it as prompted. After installation, be aware that some extensions may be incompatible with Windows 8 and need to look for alternative extensions or use newer Windows systems in a virtual machine. Install the necessary extensions to check whether they work properly. Although VS Code is feasible on Windows 8, it is recommended to upgrade to a newer Windows system for a better development experience and security.

VS Code can be used to write Python and provides many features that make it an ideal tool for developing Python applications. It allows users to: install Python extensions to get functions such as code completion, syntax highlighting, and debugging. Use the debugger to track code step by step, find and fix errors. Integrate Git for version control. Use code formatting tools to maintain code consistency. Use the Linting tool to spot potential problems ahead of time.

Running Python code in Notepad requires the Python executable and NppExec plug-in to be installed. After installing Python and adding PATH to it, configure the command "python" and the parameter "{CURRENT_DIRECTORY}{FILE_NAME}" in the NppExec plug-in to run Python code in Notepad through the shortcut key "F6".

In VS Code, you can run the program in the terminal through the following steps: Prepare the code and open the integrated terminal to ensure that the code directory is consistent with the terminal working directory. Select the run command according to the programming language (such as Python's python your_file_name.py) to check whether it runs successfully and resolve errors. Use the debugger to improve debugging efficiency.
