去除相似度较高的内容
如何去除相似度较高的内容?可以不择手段!
如下面三条笑话几乎是一样的,只是个别符号和换行不换行的差别。假设现在有30万条数据,其中有几万条是这样具有高相似度的,我要怎么做才能把这些数据筛选出来?
可以不择手段,最好是PHP/MySQL,客户端之类的。
哥应邀参加前任婚礼,和一帮陌生人坐一桌, 旁边一哥们问我是新娘什么人? 我回答,我只是来看一下以前战斗过的地方! 没想到一桌子的人举起酒杯:
大家都是战友,干杯,多喝点,一会讨论战斗经验!
哥应邀参加前任婚礼,和一帮陌生人坐一桌,旁边一哥们问我:“是新娘什么人?” 我回答,我只是来看一下以前战斗过的地方!
没想到一桌子的人举起酒杯:“大家都是战友,干杯,多喝点,一会讨论战斗经验!”
哥应邀参加前任婚礼,和一帮陌生人坐一桌,旁边一哥们问我是新娘什么人?我回答,我只是来看一下以前战斗过的地方!没想到一桌子的人举起酒杯:大家都是战友,干杯,多喝点,一会讨论战斗经验!
回复内容:
如何去除相似度较高的内容?可以不择手段!
如下面三条笑话几乎是一样的,只是个别符号和换行不换行的差别。假设现在有30万条数据,其中有几万条是这样具有高相似度的,我要怎么做才能把这些数据筛选出来?
可以不择手段,最好是PHP/MySQL,客户端之类的。
哥应邀参加前任婚礼,和一帮陌生人坐一桌, 旁边一哥们问我是新娘什么人? 我回答,我只是来看一下以前战斗过的地方! 没想到一桌子的人举起酒杯:
大家都是战友,干杯,多喝点,一会讨论战斗经验!
哥应邀参加前任婚礼,和一帮陌生人坐一桌,旁边一哥们问我:“是新娘什么人?” 我回答,我只是来看一下以前战斗过的地方!
没想到一桌子的人举起酒杯:“大家都是战友,干杯,多喝点,一会讨论战斗经验!”
哥应邀参加前任婚礼,和一帮陌生人坐一桌,旁边一哥们问我是新娘什么人?我回答,我只是来看一下以前战斗过的地方!没想到一桌子的人举起酒杯:大家都是战友,干杯,多喝点,一会讨论战斗经验!
只回答相似度处理
与 similar_text()
函数相比,levenshtein()
函数更快,但similar_text()
函数能通过更少的必需修改次数提供更精确的结果,在追求速度而少精确度,并且字符串长度有限时可以考虑使用 levenshtein()
函数,而且 similar_text()
对中文支持的并不好
最后留一个自己捣鼓的: 通过余弦定理+分词计算文本相似度PHP版
https://github.com/xiaobeicn/text-similarity-php
要求不高的话直接用similar_text
吧,DEMO: http://3v4l.org/iBXvC
如果只是多出几个标点符号、换行的话,那可以去掉那些符号、换行,然后比较字符串md5的值。当然,如果文字的顺序变大很大,这个也就不行了
说白了就是文章摘要算法 如果是我的话分词肯定不够 还要上词性分析 留下名词动词做特征能更准一些
我给你一个我认为最靠谱的方案
(1)对文章进行词性划分,只保留动词和名词部分,比如
哥应邀参加前任婚礼,和一帮陌生人坐一桌, 旁边一哥们问我是新娘什么人? 我回答,我只是来看一下以前战斗过的地方! 没想到一桌子的人举起酒杯:
大家都是战友,干杯,多喝点,一会讨论战斗经验!
这段文字我认为特征是 婚礼 新娘 战斗 就被 经验 战友
(2)你需要很多的例子,比如10000篇,根据这一万篇,大致推断整个30万文本中所有可能重要的词汇,根据经验这个个词汇表如果不处理会超过10w个
(3)使用特征提取算法精简词汇表,至于怎么特征提取这至少是烟酒生课程才会讲的,都是数学,这样你会把10w个词缩减到3000左右
(4)用这3000个词表示每一个文本,比如w1=[0,0,1,1,.....0,..1,,0...1..0...]我们不考虑词频,这样的数据结构用位图非常容易转化为字符串
(5)使用Hash表对所有文本进行去重
这样的效率是最高的,但是肯定有误差,因为特征提取本身就是信息量减少的过程,来换取最快的速度,但是可以做到任何一个新文本来,分词的过程不计,几乎是O(1)的时间复杂度
再提供一种思路:去掉所有标点符号、空格以及换行符之后用动态规划算法计算“编辑距离/Levenshtein距离”(即把字符串s1经过变换得到s2的最少编辑次数,其中一次编辑可以是添加一个字符、删除一个字符或者修改一个字符)。比较容易实现,效率也不错(大约就是O(N^2)其中N是字符串长度)
这个算法貌似是信息学竞赛的经典算法,搜一下“字符串编辑距离”应该就能找到(维基百科也有),如果不想用库的话可以考虑该方法
我想说的也是编辑距离,楼上已经说了。
http://www.cnblogs.com/liangxiaxu/archive/2012/05/05/2484972.html
余弦定理和simhash都不错,后者是谷歌发明的

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics











In MySQL, the function of foreign keys is to establish the relationship between tables and ensure the consistency and integrity of the data. Foreign keys maintain the effectiveness of data through reference integrity checks and cascading operations. Pay attention to performance optimization and avoid common errors when using them.

The main difference between MySQL and MariaDB is performance, functionality and license: 1. MySQL is developed by Oracle, and MariaDB is its fork. 2. MariaDB may perform better in high load environments. 3.MariaDB provides more storage engines and functions. 4.MySQL adopts a dual license, and MariaDB is completely open source. The existing infrastructure, performance requirements, functional requirements and license costs should be taken into account when choosing.

Multiple calls to session_start() will result in warning messages and possible data overwrites. 1) PHP will issue a warning, prompting that the session has been started. 2) It may cause unexpected overwriting of session data. 3) Use session_status() to check the session status to avoid repeated calls.

MySQL and phpMyAdmin can be effectively managed through the following steps: 1. Create and delete database: Just click in phpMyAdmin to complete. 2. Manage tables: You can create tables, modify structures, and add indexes. 3. Data operation: Supports inserting, updating, deleting data and executing SQL queries. 4. Import and export data: Supports SQL, CSV, XML and other formats. 5. Optimization and monitoring: Use the OPTIMIZETABLE command to optimize tables and use query analyzers and monitoring tools to solve performance problems.

AI can help optimize the use of Composer. Specific methods include: 1. Dependency management optimization: AI analyzes dependencies, recommends the best version combination, and reduces conflicts. 2. Automated code generation: AI generates composer.json files that conform to best practices. 3. Improve code quality: AI detects potential problems, provides optimization suggestions, and improves code quality. These methods are implemented through machine learning and natural language processing technologies to help developers improve efficiency and code quality.

In MySQL, add fields using ALTERTABLEtable_nameADDCOLUMNnew_columnVARCHAR(255)AFTERexisting_column, delete fields using ALTERTABLEtable_nameDROPCOLUMNcolumn_to_drop. When adding fields, you need to specify a location to optimize query performance and data structure; before deleting fields, you need to confirm that the operation is irreversible; modifying table structure using online DDL, backup data, test environment, and low-load time periods is performance optimization and best practice.

session_start()iscrucialinPHPformanagingusersessions.1)Itinitiatesanewsessionifnoneexists,2)resumesanexistingsession,and3)setsasessioncookieforcontinuityacrossrequests,enablingapplicationslikeuserauthenticationandpersonalizedcontent.

To safely and thoroughly uninstall MySQL and clean all residual files, follow the following steps: 1. Stop MySQL service; 2. Uninstall MySQL packages; 3. Clean configuration files and data directories; 4. Verify that the uninstallation is thorough.
