代码语言
.
CSharp
.
JS
Java
Asp.Net
C
MSSQL
PHP
Css
PLSQL
Python
Shell
EBS
ASP
Perl
ObjC
VB.Net
VBS
MYSQL
GO
Delphi
AS
DB2
Domino
Rails
ActionScript
Scala
代码分类
文件
系统
字符串
数据库
网络相关
图形/GUI
多媒体
算法
游戏
Jquery
Extjs
Android
HTML5
菜单
网页交互
WinForm
控件
企业应用
安全与加密
脚本/批处理
开放平台
其它
【
Python
】
实现SimHash算法
作者:
/ 发布于
2011/1/4
/
793
统的hash函数能够将一样的文本生成一样的hash函数,但是,通过simhash方法,能够差不多相同的文档得到的hash函数也比较相近。
<div> <table style="font-family: monospace" class="python codes"> <tbody> <tr class="li1"> <td style="line-height: 150%; font-family: Verdana, Monospace; font-size: 12px; font-weight: bold; margin-right: 10px"> <pre style="line-height: 150%; font-family: Verdana, Monospace; font-size: 12px; font-weight: bold; margin-right: 10px"><span style="font-style: italic; color: #808080">#!/usr/bin/env python</span> <span style="font-style: italic; color: #808080"># -*- coding=utf-8 -*-</span> <span style="font-style: italic; color: #808080"># Implementation of Charikar simhashes in Python</span> <span style="font-style: italic; color: #808080"># See: http://dsrg.mff.cuni.cz/~holub/sw/shash/#a1</span> <span style="color: #0099cc">class</span> simhash<span style="color: black">(</span><span style="color: black">)</span>: <span style="color: #0099cc">def</span> <span style="color: #0000cd">__init__</span><span style="color: black">(</span><span style="color: #0000ff">self</span>, tokens=<span style="color: #483d8b">''</span>, hashbits=<span style="color: #ff4500">128</span><span style="color: black">)</span>: <span style="color: #0000ff">self</span>.<span style="color: black">hashbits</span> = hashbits <span style="color: #0000ff">self</span>.<span style="color: #0000ff">hash</span> = <span style="color: #0000ff">self</span>.<span style="color: black">simhash</span><span style="color: black">(</span>tokens<span style="color: black">)</span> <span style="color: #0099cc">def</span> <span style="color: #0000cd">__str__</span><span style="color: black">(</span><span style="color: #0000ff">self</span><span style="color: black">)</span>: <span style="color: #0099cc">return</span> <span style="color: #0000ff">str</span><span style="color: black">(</span><span style="color: #0000ff">self</span>.<span style="color: #0000ff">hash</span><span style="color: black">)</span> <span style="color: #0099cc">def</span> <span style="color: #0000cd">__long__</span><span style="color: black">(</span><span style="color: #0000ff">self</span><span style="color: black">)</span>: <span style="color: #0099cc">return</span> <span style="color: #0000ff">long</span><span style="color: black">(</span><span style="color: #0000ff">self</span>.<span style="color: #0000ff">hash</span><span style="color: black">)</span> <span style="color: #0099cc">def</span> <span style="color: #0000cd">__float__</span><span style="color: black">(</span><span style="color: #0000ff">self</span><span style="color: black">)</span>: <span style="color: #0099cc">return</span> <span style="color: #0000ff">float</span><span style="color: black">(</span><span style="color: #0000ff">self</span>.<span style="color: #0000ff">hash</span><span style="color: black">)</span> <span style="color: #0099cc">def</span> simhash<span style="color: black">(</span><span style="color: #0000ff">self</span>, tokens<span style="color: black">)</span>: <span style="font-style: italic; color: #808080"># Returns a Charikar simhash with appropriate bitlength</span> v = <span style="color: black">[</span><span style="color: #ff4500">0</span><span style="color: black">]</span><span style="color: #66cc66">*</span><span style="color: #0000ff">self</span>.<span style="color: black">hashbits</span> <span style="color: #0099cc">for</span> t <span style="color: #0099cc">in</span> <span style="color: black">[</span><span style="color: #0000ff">self</span>._string_hash<span style="color: black">(</span>x<span style="color: black">)</span> <span style="color: #0099cc">for</span> x <span style="color: #0099cc">in</span> tokens<span style="color: black">]</span>: bitmask = <span style="color: #ff4500">0</span> <span style="font-style: italic; color: #808080">#print (t)</span> <span style="color: #0099cc">for</span> i <span style="color: #0099cc">in</span> <span style="color: #0000ff">range</span><span style="color: black">(</span><span style="color: #0000ff">self</span>.<span style="color: black">hashbits</span><span style="color: black">)</span>: bitmask = <span style="color: #ff4500">1</span> <span style="color: #66cc66"><<</span> i <span style="font-style: italic; color: #808080">#print(t,bitmask, t & bitmask)</span> <span style="color: #0099cc">if</span> t <span style="color: #66cc66">&</span> bitmask: v<span style="color: black">[</span>i<span style="color: black">]</span> += <span style="color: #ff4500">1</span> <span style="font-style: italic; color: #808080">#查看当前bit位是否为1,是的话则将该位+1</span> <span style="color: #0099cc">else</span>: v<span style="color: black">[</span>i<span style="color: black">]</span> += -<span style="color: #ff4500">1</span> <span style="font-style: italic; color: #808080">#否则得话,该位减1</span> fingerprint = <span style="color: #ff4500">0</span> <span style="color: #0099cc">for</span> i <span style="color: #0099cc">in</span> <span style="color: #0000ff">range</span><span style="color: black">(</span><span style="color: #0000ff">self</span>.<span style="color: black">hashbits</span><span style="color: black">)</span>: <span style="color: #0099cc">if</span> v<span style="color: black">[</span>i<span style="color: black">]</span> <span style="color: #66cc66">></span>= <span style="color: #ff4500">0</span>: fingerprint += <span style="color: #ff4500">1</span> <span style="color: #66cc66"><<</span> i <span style="font-style: italic; color: #808080">#整个文档的fingerprint为最终各个位大于等于0的位的和</span> <span style="color: #0099cc">return</span> fingerprint <span style="color: #0099cc">def</span> _string_hash<span style="color: black">(</span><span style="color: #0000ff">self</span>, v<span style="color: black">)</span>: <span style="font-style: italic; color: #808080"># A variable-length version of Python's builtin hash</span> <span style="color: #0099cc">if</span> v == <span style="color: #483d8b">""</span>: <span style="color: #0099cc">return</span> <span style="color: #ff4500">0</span> <span style="color: #0099cc">else</span>: x = <span style="color: #0000ff">ord</span><span style="color: black">(</span>v<span style="color: black">[</span><span style="color: #ff4500">0</span><span style="color: black">]</span><span style="color: black">)</span><span style="color: #66cc66"><<</span><span style="color: #ff4500">7</span> m = <span style="color: #ff4500">1000003</span> mask = <span style="color: #ff4500">2</span><span style="color: #66cc66">**</span><span style="color: #0000ff">self</span>.<span style="color: black">hashbits</span>-<span style="color: #ff4500">1</span> <span style="color: #0099cc">for</span> c <span style="color: #0099cc">in</span> v: x = <span style="color: black">(</span><span style="color: black">(</span>x<span style="color: #66cc66">*</span>m<span style="color: black">)</span>^ord<span style="color: black">(</span>c<span style="color: black">)</span><span style="color: black">)</span> <span style="color: #66cc66">&</span> mask x ^= <span style="color: #0000ff">len</span><span style="color: black">(</span>v<span style="color: black">)</span> <span style="color: #0099cc">if</span> x == -<span style="color: #ff4500">1</span>: x = -<span style="color: #ff4500">2</span> <span style="color: #0099cc">return</span> x <span style="color: #0099cc">def</span> hamming_distance<span style="color: black">(</span><span style="color: #0000ff">self</span>, other_hash<span style="color: black">)</span>: x = <span style="color: black">(</span><span style="color: #0000ff">self</span>.<span style="color: #0000ff">hash</span> ^ other_hash.<span style="color: #0000ff">hash</span><span style="color: black">)</span> <span style="color: #66cc66">&</span> <span style="color: black">(</span><span style="color: black">(</span><span style="color: #ff4500">1</span> <span style="color: #66cc66"><<</span> <span style="color: #0000ff">self</span>.<span style="color: black">hashbits</span><span style="color: black">)</span> - <span style="color: #ff4500">1</span><span style="color: black">)</span> tot = <span style="color: #ff4500">0</span> <span style="color: #0099cc">while</span> x: tot += <span style="color: #ff4500">1</span> x <span style="color: #66cc66">&</span>= x-<span style="color: #ff4500">1</span> <span style="color: #0099cc">return</span> tot <span style="color: #0099cc">def</span> similarity<span style="color: black">(</span><span style="color: #0000ff">self</span>, other_hash<span style="color: black">)</span>: a = <span style="color: #0000ff">float</span><span style="color: black">(</span><span style="color: #0000ff">self</span>.<span style="color: #0000ff">hash</span><span style="color: black">)</span> b = <span style="color: #0000ff">float</span><span style="color: black">(</span>other_hash<span style="color: black">)</span> <span style="color: #0099cc">if</span> a<span style="color: #66cc66">></span>b: <span style="color: #0099cc">return</span> b/a <span style="color: #0099cc">return</span> a/b <span style="color: #0099cc">if</span> __name__ == <span style="color: #483d8b">'__main__'</span>: <span style="font-style: italic; color: #808080">#看看哪些东西google最看重?标点?</span> s = <span style="color: #483d8b">'看看哪些东西google最看重?标点?'</span> hash1 =simhash<span style="color: black">(</span>s.<span style="color: black">split</span><span style="color: black">(</span><span style="color: black">)</span><span style="color: black">)</span> <span style="font-style: italic; color: #808080">#print("0x%x" % hash1)</span> <span style="font-style: italic; color: #808080">#print ("%s\t0x%x" % (s, hash1))</span> s = <span style="color: #483d8b">'看看哪些东西google最看重!标点!'</span> hash2 = simhash<span style="color: black">(</span>s.<span style="color: black">split</span><span style="color: black">(</span><span style="color: black">)</span><span style="color: black">)</span> <span style="font-style: italic; color: #808080">#print ("%s\t[simhash = 0x%x]" % (s, hash2))</span> <span style="color: #0099cc">print</span> <span style="color: #483d8b">'%f%% percent similarity on hash'</span> <span style="color: #66cc66">%</span><span style="color: black">(</span><span style="color: #ff4500">100</span><span style="color: #66cc66">*</span><span style="color: black">(</span>hash1.<span style="color: black">similarity</span><span style="color: black">(</span>hash2<span style="color: black">)</span><span style="color: black">)</span><span style="color: black">)</span> <span style="color: #0099cc">print</span> hash1.<span style="color: black">hamming_distance</span><span style="color: black">(</span>hash2<span style="color: black">)</span>,<span style="color: #483d8b">"bits differ out of"</span>, hash1.<span style="color: black">hashbits</span></pre> </td> </tr> </tbody> </table> </div>
试试其它关键字
SimHash
同语言下
.
比较两个图片的相似度
.
过urllib2获取带有中文参数的url内容
.
不下载获取远程图片的宽度和高度及文件大小
.
通过qrcode库生成二维码
.
通过httplib发送GET和POST请求
.
Django下解决小文件下载
.
遍历windows的所有窗口并输出窗口标题
.
根据窗口标题调用窗口
.
python 抓取搜狗指定公众号
.
pandas读取指定列
可能有用的
.
C#实现的html内容截取
.
List 切割成几份 工具类
.
SQL查询 多列合并成一行用逗号隔开
.
一行一行读取txt的内容
.
C#动态修改文件夹名称(FSO实现,不移动文件)
.
c# 移动文件或文件夹
.
c#图片添加水印
.
Java PDF转换成图片并输出给前台展示
.
网站后台修改图片尺寸代码
.
处理大图片在缩略图时的展示
贡献的其它代码
Label
Copyright © 2004 - 2024 dezai.cn. All Rights Reserved
站长博客
粤ICP备13059550号-3