R -- 体验 stringdist-CSDN博客

安装

install.packages('stringdist')

or

git clone https://github.com/markvanderloo/stringdist.git
cd stringdist
bash ./build.bash
R CMD INSTALL output/stringdist_*.tar.gz

使用

The package offers the following main functions:

  • stringdist computes pairwise distances between two input character vectors (shorter one is recycled)
  • stringdistmatrix computes the distance matrix for one or two vectors
  • stringsim computes a string similarity between 0 and 1, based on stringdist
  • amatch is a fuzzy matching equivalent of R’s native match function
  • ain is a fuzzy matching equivalent of R’s native %in% operator
  • afind finds the location of fuzzy matches of a short string in a long string.
  • seq_dist, seq_distmatrix, seq_amatch and seq_ain for distances between, and matching of integer sequences.

stringdist :返回列表

stringdist(
  a,
  b,
  method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw",
    "soundex"),
  useBytes = FALSE,
  weight = c(d = 1, i = 1, s = 1, t = 1),
  q = 1,
  p = 0,
  bt = 0,
  nthread = getOption("sd_num_thread")
)
a	:R object (target); will be converted by as.characte
b	 :R object (source); will be converted by as.character This argument is optional for stringdistmatrix (see section Value).
method	 :Method for distance calculation. 
useBytes	:Perform byte-wise comparison
weight	:For method='osa' or 'dl', the penalty for deletion, insertion, substitution and transposition, in that order. 
	 When method='lv', the penalty for transposition is ignored.
	 When method='jw', the weights associated with characters of a, characters from b and the transposition weight, in that order. 
	 Weights must be positive and not exceed 1. 
	 weight is ignored completely when method='hamming', 'qgram', 'cosine', 'Jaccard', 'lcs', or soundex.

q	:Size of the q-gram; must be nonnegative. Only applies to method='qgram', 'jaccard' or 'cosine'.
p	:Prefix factor for Jaro-Winkler distance. The valid range for p is 0 <= p <= 0.25.
	 If p=0 (default), the Jaro-distance is returned. Applies only to method='jw'.
bt	:Winkler's boost threshold. Winkler's prefix factor is only applied when the Jaro distance is larger than bt. Applies only to method='jw' and p>0.
useNames	:Use input vectors as row and column names?

example

注意String distance functions have two possible special output values.
NA is returned whenever at least one of the input strings to compare is NA .
And Inf is returned when the distance between two strings is undefined according to the selected algorithm.

stringdist("bar","foo",method = "lv") #使用的是Levenshtein distance  & return  3
stringdist("ba","foo",method = "lv") #使用的是Levenshtein distance  &  return  3 注意这里是不等长的序列

stringdist('fu', 'foo', method='hamming') # 使用的是 Hamming distance &  return Inf

stringdistmatrix 返回矩阵

stringdistmatrix(
  a,
  b,
  method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw",
    "soundex"),
  useBytes = FALSE,
  weight = c(d = 1, i = 1, s = 1, t = 1),
  q = 1,
  p = 0,
  bt = 0,
  useNames = c("none", "strings", "names"),
  nthread = getOption("sd_num_thread")
)
Arg

example

- 只输入一个vertor返回一个 dist函数的结果

在这里插入图片描述

- 输入两个vector 返回矩阵

在这里插入图片描述


amatch & ain

  • Function amatch(x,table) finds the closest match of elements of x in table. When multiple equivalent matches are found, the
    first match is returned
  • A call to ain(x,table) returns a logical vector indicating which elements of x were (approximately) matched in table.
  • Both amatch and ain have been designed to approach the behaviour of R’s native match and %in% functionality as much as possible. By default amatch and ain locate exact matches, just like match.
  • This may be changed by increasing the maximum string distance between the search pattern and elements of the lookup table.

amatch仿照R base function match进行设计通过 参数maxDist控制该函数的行为如果maxDist 设置的很小其表现近似于 exact match当 maxDist 设置的比较大时则表现的是approximately match。amtch 与 ain的区别类似于match和 %in%一个返回元素的index一个返回TRUE/FALSE。

amatch('fu', c('foo','bar')) # return NA
amatch('fu', c('foo','bar'), maxDist=2) # return 1

ain('fu', c('foo','bar')) # return FALSE
ain('fu', c('foo','bar'), maxDist=2) # return  TRUE
ain('bar', c('foo','bar')) # return TRUE
ain('bar', c('foo','bar'), maxDist=2) # return TRUE

延伸距离计算公式

在这里插入图片描述

Hamming distance

在这里插入图片描述
在这里插入图片描述

Longest Common Substring distance

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

Levenshtein distance (weighted)

在这里插入图片描述
在这里插入图片描述

The optimal string alignment distance dosa

在这里插入图片描述

Full Damerau-Levenshtein distance (weighted)

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
注意Dosa 和Ddl的区别主要是最后一个方程式Dosa只允许前后相邻的两个字符串置换Ddl则允许当前的字符串和其他的字符置换后计算距离



在这里插入图片描述

Q-gram distance

在这里插入图片描述

Jaccard distance for q-gram count vectors (= 1-Jaccard similarity)

在这里插入图片描述

cosine distance for q-gram count vectors (= 1-cosine similarity)

在这里插入图片描述

  • Jaro distance
    在这里插入图片描述
    在这里插入图片描述

At last

在这里插入图片描述

  • 阿里云国际版折扣https://www.yundadi.com

  • 阿里云国际,腾讯云国际,低至75折。AWS 93折 免费开户实名账号 代冲值 优惠多多 微信号:monov8 飞机:@monov6