Home Backend Development Python Tutorial Detailed explanation of cdn log analysis through the pandas library in Python

Detailed explanation of cdn log analysis through the pandas library in Python

Mar 24, 2017 pm 05:08 PM

Preface

I recently encountered a need at work, which is to filter some data based on CDN logs, such as traffic, status code statistics, TOP IP, URL, UA , Referer, etc. In the past, the bash shell was used to implement this. However, when the log volume is large, the number of log files is gigabytes, and the number of lines reaches tens of billions, processing through the shell is not enough and the processing time is too long. So I studied the use of Python pandas, a data processing library. Ten million lines of logs are processed in about 40 seconds.

Code

#!/usr/bin/python
# -*- coding: utf-8 -*-
# sudo pip install pandas
__author__ = 'Loya Chen'
import sys
import pandas as pd
from collections import OrderedDict
"""
Description: This script is used to analyse qiniu cdn log.
================================================================================
日志格式
IP - ResponseTime [time +0800] "Method URL HTTP/1.1" code size "referer" "UA"
================================================================================
日志示例
 [0] [1][2]  [3]  [4]   [5]
101.226.66.179 - 68 [16/Nov/2016:04:36:40 +0800] "GET http://www.php.cn/ -" 
[6] [7] [8]    [9]
200 502 "-" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"
================================================================================
"""
if len(sys.argv) != 2:
 print('Usage:', sys.argv[0], 'file_of_log')
 exit() 
else:
 log_file = sys.argv[1] 
# 需统计字段对应的日志位置 
ip  = 0
url  = 5
status_code = 6
size = 7
referer = 8
ua  = 9
# 将日志读入DataFrame
reader = pd.read_table(log_file, sep=' ', names=[i for i in range(10)], iterator=True)
loop = True
chunkSize = 10000000
chunks = []
while loop:
 try:
 chunk = reader.get_chunk(chunkSize)
 chunks.append(chunk)
 except StopIteration:
 #Iteration is stopped.
 loop = False
df = pd.concat(chunks, ignore_index=True)
byte_sum = df[size].sum()        #流量统计
top_status_code = pd.DataFrame(df[6].value_counts())      #状态码统计
top_ip  = df[ip].value_counts().head(10)      #TOP IP
top_referer = df[referer].value_counts().head(10)      #TOP Referer
top_ua  = df[ua].value_counts().head(10)      #TOP User-Agent
top_status_code['persent'] = pd.DataFrame(top_status_code/top_status_code.sum()*100)
top_url  = df[url].value_counts().head(10)      #TOP URL
top_url_byte = df[[url,size]].groupby(url).sum().apply(lambda x:x.astype(float)/1024/1024) \
   .round(decimals = 3).sort_values(by=[size], ascending=False)[size].head(10) #请求流量最大的URL
top_ip_byte = df[[ip,size]].groupby(ip).sum().apply(lambda x:x.astype(float)/1024/1024) \
   .round(decimals = 3).sort_values(by=[size], ascending=False)[size].head(10) #请求流量最多的IP
# 将结果有序存入字典
result = OrderedDict([("流量总计[单位:GB]:"   , byte_sum/1024/1024/1024),
   ("状态码统计[次数|百分比]:"  , top_status_code),
   ("IP TOP 10:"    , top_ip),
   ("Referer TOP 10:"   , top_referer),
   ("UA TOP 10:"    , top_ua),
   ("URL TOP 10:"   , top_url),
   ("请求流量最大的URL TOP 10[单位:MB]:" , top_url_byte), 
   ("请求流量最大的IP TOP 10[单位:MB]:" , top_ip_byte)
])
# 输出结果
for k,v in result.items():
 print(k)
 print(v)
 print('='*80)
Copy after login

pandas study notes

There are two basic data structures in Pandas, Series and Dataframe. A Series is an object similar to a one-dimensional array, consisting of a set of data and an index. Dataframe is a tabular data structure with both row and column indexes.

from pandas import Series, DataFrame
import pandas as pd
Copy after login

Series

In [1]: obj = Series([4, 7, -5, 3])
In [2]: obj
Out[2]: 
0 4
1 7
2 -5
3 3
Copy after login

The string representation of Series is: index on the left and value on the right. When no index is specified, an integer index ranging from 0 to N-1 (N is the length of the data) will be automatically created. The array representation and index object can be obtained through the values ​​and index properties of the Series:

In [3]: obj.values
Out[3]: array([ 4, 7, -5, 3])
In [4]: obj.index
Out[4]: RangeIndex(start=0, stop=4, step=1)
Copy after login

Usually the index is specified when creating the Series:

In [5]: obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
In [6]: obj2
Out[6]: 
d 4
b 7
a -5
c 3
Copy after login

Get a single or a group of values ​​in the Series through the index :

In [7]: obj2['a']
Out[7]: -5
In [8]: obj2[['c','d']]
Out[8]: 
c 3
d 4
Copy after login

Sort

In [9]: obj2.sort_index()
Out[9]: 
a -5
b 7
c 3
d 4
In [10]: obj2.sort_values()
Out[10]: 
a -5
c 3
d 4
b 7
Copy after login

Filter operation

In [11]: obj2[obj2 > 0]
Out[11]: 
d 4
b 7
c 3
In [12]: obj2 * 2
Out[12]: 
d 8
b 14
a -10
c 6
Copy after login

Member

In [13]: 'b' in obj2
Out[13]: True
In [14]: 'e' in obj2
Out[14]: False
Copy after login

Create Series through dictionary

In [15]: sdata = {'Shanghai':35000, 'Beijing':40000, 'Nanjing':26000, 'Hangzhou':30000}
In [16]: obj3 = Series(sdata)
In [17]: obj3
Out[17]: 
Beijing 40000
Hangzhou 30000
Nanjing 26000
Shanghai 35000
Copy after login

If only one dictionary is passed in , then the index in the result Series is the key of the original dictionary (ordered arrangement)

In [18]: states = ['Beijing', 'Hangzhou', 'Shanghai', 'Suzhou']
In [19]: obj4 = Series(sdata, index=states)
In [20]: obj4
Out[20]: 
Beijing 40000.0
Hangzhou 30000.0
Shanghai 35000.0
Suzhou  NaN
Copy after login

When specifying index, the three values ​​in sdata that match the states index will be found and placed in the response position on, but since the sdata value corresponding to 'Suzhou' cannot be found, the result is NaN (not a number). The isnull and notnull functions in pandas used to represent missing or NA values

pandas can be used For detecting missing data:

In [21]: pd.isnull(obj4)
Out[21]: 
Beijing False
Hangzhou False
Shanghai False
Suzhou True
In [22]: pd.notnull(obj4)
Out[22]: 
Beijing True
Hangzhou True
Shanghai True
Suzhou False
Copy after login

Series also has similar instance methods

In [23]: obj4.isnull()
Out[23]: 
Beijing False
Hangzhou False
Shanghai False
Suzhou True
Copy after login

An important function of Series is to automatically align data with different indexes during data operations

In [24]: obj3
Out[24]: 
Beijing 40000
Hangzhou 30000
Nanjing 26000
Shanghai 35000
In [25]: obj4
Out[25]: 
Beijing 40000.0
Hangzhou 30000.0
Shanghai 35000.0
Suzhou  NaN
In [26]: obj3 + obj4
Out[26]: 
Beijing 80000.0
Hangzhou 60000.0
Nanjing  NaN
Shanghai 70000.0
Suzhou  NaN
Copy after login

The index of the Series can be modified in place by copying

In [27]: obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
In [28]: obj
Out[28]: 
Bob 4
Steve 7
Jeff -5
Ryan 3
Copy after login

DataFrame

pandas reads the file

In [29]: df = pd.read_table('pandas_test.txt',sep=' ', names=['name', 'age'])
In [30]: df
Out[30]: 
 name age
0 Bob 26
1 Loya 22
2 Denny 20
3 Mars 25
Copy after login

DataFrame column selection

df[name]
Copy after login
In [31]: df['name']
Out[31]: 
0 Bob
1 Loya
2 Denny
3 Mars
Name: name, dtype: object
Copy after login

DataFrame row selection

df.iloc[0,:] #第一个参数是第几行,第二个参数是列。这里指第0行全部列
df.iloc[:,0] #全部行,第0列
Copy after login
In [32]: df.iloc[0,:]
Out[32]: 
name Bob
age 26
Name: 0, dtype: object
In [33]: df.iloc[:,0]
Out[33]: 
0 Bob
1 Loya
2 Denny
3 Mars
Name: name, dtype: object
Copy after login

Get an element, you can use iloc, the faster way is iat

In [34]: df.iloc[1,1]
Out[34]: 22
In [35]: df.iat[1,1]
Out[35]: 22
Copy after login

DataFrame block selection

In [36]: df.loc[1:2,['name','age']]
Out[36]: 
 name age
1 Loya 22
2 Denny 20
Copy after login

Filter rows based on conditions

Add judgment conditions in square brackets to filter rows. The conditions must return True or False

In [37]: df[(df.index >= 1) & (df.index <= 3)]
Out[37]: 
 name age city
1 Loya 22 Shanghai
2 Denny 20 Hangzhou
3 Mars 25 Nanjing
In [38]: df[df[&#39;age&#39;] > 22]
Out[38]: 
 name age city
0 Bob 26 Beijing
3 Mars 25 Nanjing
Copy after login

Add columns

In [39]: df[&#39;city&#39;] = [&#39;Beijing&#39;, &#39;Shanghai&#39;, &#39;Hangzhou&#39;, &#39;Nanjing&#39;]
In [40]: df
Out[40]: 
 name age city
0 Bob 26 Beijing
1 Loya 22 Shanghai
2 Denny 20 Hangzhou
3 Mars 25 Nanjing
Copy after login

Sort

Sort by specified columns

In [41]: df.sort_values(by=&#39;age&#39;)
Out[41]: 
 name age city
2 Denny 20 Hangzhou
1 Loya 22 Shanghai
3 Mars 25 Nanjing
0 Bob 26 Beijing
Copy after login
# 引入numpy 构建 DataFrame
import numpy as np
Copy after login
In [42]: df = pd.DataFrame(np.arange(8).reshape((2, 4)), index=[&#39;three&#39;, &#39;one&#39;], columns=[&#39;d&#39;, &#39;a&#39;, &#39;b&#39;, &#39;c&#39;])
In [43]: df
Out[43]: 
 d a b c
three 0 1 2 3
one 4 5 6 7
Copy after login
# 以索引排序
In [44]: df.sort_index()
Out[44]: 
 d a b c
one 4 5 6 7
three 0 1 2 3
In [45]: df.sort_index(axis=1)
Out[45]: 
 a b c d
three 1 2 3 0
one 5 6 7 4
# 降序
In [46]: df.sort_index(axis=1, ascending=False)
Out[46]: 
 d c b a
three 0 3 2 1
one 4 7 6 5
Copy after login

View

# 查看表头5行 
df.head(5)
# 查看表末5行
df.tail(5) 
# 查看列的名字
In [47]: df.columns
Out[47]: Index([&#39;name&#39;, &#39;age&#39;, &#39;city&#39;], dtype=&#39;object&#39;)
# 查看表格当前的值
In [48]: df.values
Out[48]: 
array([[&#39;Bob&#39;, 26, &#39;Beijing&#39;],
 [&#39;Loya&#39;, 22, &#39;Shanghai&#39;],
 [&#39;Denny&#39;, 20, &#39;Hangzhou&#39;],
 [&#39;Mars&#39;, 25, &#39;Nanjing&#39;]], dtype=object)
Copy after login

Transpose

df.T
Out[49]: 
  0  1  2 3
name Bob Loya Denny Mars
age 26 22 20 25
city Beijing Shanghai Hangzhou Nanjing
Copy after login

Use isin

In [50]: df2 = df.copy()
In [51]: df2[df2[&#39;city&#39;].isin([&#39;Shanghai&#39;,&#39;Nanjing&#39;])]
Out[52]: 
 name age city
1 Loya 22 Shanghai
3 Mars 25 Nanjing
Copy after login

Operation:

In [53]: df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]], 
 ...:    index=[&#39;a&#39;, &#39;b&#39;, &#39;c&#39;, &#39;d&#39;], columns=[&#39;one&#39;, &#39;two&#39;])
In [54]: df
Out[54]: 
 one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
Copy after login
#按列求和
In [55]: df.sum()
Out[55]: 
one 9.25
two -5.80
# 按行求和
In [56]: df.sum(axis=1)
Out[56]: 
a 1.40
b 2.60
c NaN
d -0.55
Copy after login

group

group refers to the following steps:

  • ##Splitting the data into groups based on some criteria

  • Applying a function to each group independently

  • ##Combining the results into a data structure
  • See the Grouping section
In [57]: df = pd.DataFrame({&#39;A&#39; : [&#39;foo&#39;, &#39;bar&#39;, &#39;foo&#39;, &#39;bar&#39;,
 ....:    &#39;foo&#39;, &#39;bar&#39;, &#39;foo&#39;, &#39;foo&#39;],
 ....:   &#39;B&#39; : [&#39;one&#39;, &#39;one&#39;, &#39;two&#39;, &#39;three&#39;,
 ....:    &#39;two&#39;, &#39;two&#39;, &#39;one&#39;, &#39;three&#39;],
 ....:   &#39;C&#39; : np.random.randn(8),
 ....:   &#39;D&#39; : np.random.randn(8)})
 ....: 
In [58]: df
Out[58]: 
 A B  C  D
0 foo one -1.202872 -0.055224
1 bar one -1.814470 2.395985
2 foo two 1.018601 1.552825
3 bar three -0.595447 0.166599
4 foo two 1.395433 0.047609
5 bar two -0.392670 -0.136473
6 foo one 0.007207 -0.561757
7 foo three 1.928123 -1.623033
Copy after login

group and then apply sum function

In [59]: df.groupby(&#39;A&#39;).sum()
Out[59]: 
  C D
A   
bar -2.802588 2.42611
foo 3.146492 -0.63958
In [60]: df.groupby([&#39;A&#39;,&#39;B&#39;]).sum()
Out[60]: 
   C  D
A B   
bar one -1.814470 2.395985
 three -0.595447 0.166599
 two -0.392670 -0.136473
foo one -1.195665 -0.616981
 three 1.928123 -1.623033
 two 2.414034 1.600434
Copy after login

For more detailed explanations of cdn log analysis through the pandas library in Python, please pay attention to the PHP Chinese website!


Related articles:

How to use Pandas to read CSV files and write them to MySQL

The real IP of Python data analysis Request a detailed explanation of Pandas

Tutorial on using Python’s pandas framework to manipulate data in Excel files

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

How to solve the permissions problem encountered when viewing Python version in Linux terminal? How to solve the permissions problem encountered when viewing Python version in Linux terminal? Apr 01, 2025 pm 05:09 PM

Solution to permission issues when viewing Python version in Linux terminal When you try to view Python version in Linux terminal, enter python...

How to avoid being detected by the browser when using Fiddler Everywhere for man-in-the-middle reading? How to avoid being detected by the browser when using Fiddler Everywhere for man-in-the-middle reading? Apr 02, 2025 am 07:15 AM

How to avoid being detected when using FiddlerEverywhere for man-in-the-middle readings When you use FiddlerEverywhere...

How to efficiently copy the entire column of one DataFrame into another DataFrame with different structures in Python? How to efficiently copy the entire column of one DataFrame into another DataFrame with different structures in Python? Apr 01, 2025 pm 11:15 PM

When using Python's pandas library, how to copy whole columns between two DataFrames with different structures is a common problem. Suppose we have two Dats...

How to teach computer novice programming basics in project and problem-driven methods within 10 hours? How to teach computer novice programming basics in project and problem-driven methods within 10 hours? Apr 02, 2025 am 07:18 AM

How to teach computer novice programming basics within 10 hours? If you only have 10 hours to teach computer novice some programming knowledge, what would you choose to teach...

How does Uvicorn continuously listen for HTTP requests without serving_forever()? How does Uvicorn continuously listen for HTTP requests without serving_forever()? Apr 01, 2025 pm 10:51 PM

How does Uvicorn continuously listen for HTTP requests? Uvicorn is a lightweight web server based on ASGI. One of its core functions is to listen for HTTP requests and proceed...

How to solve permission issues when using python --version command in Linux terminal? How to solve permission issues when using python --version command in Linux terminal? Apr 02, 2025 am 06:36 AM

Using python in Linux terminal...

How to get news data bypassing Investing.com's anti-crawler mechanism? How to get news data bypassing Investing.com's anti-crawler mechanism? Apr 02, 2025 am 07:03 AM

Understanding the anti-crawling strategy of Investing.com Many people often try to crawl news data from Investing.com (https://cn.investing.com/news/latest-news)...

See all articles