从字符串向量中提取数字

我有这样的字符串：

years<-c("20 years old", "1 years old")

我只想grep这个向量中的数字。预期的输出是矢量：

c(20, 1)

我该怎么做呢？

#1 楼

# pattern is by finding a set of numbers in the start and capturing them
as.numeric(gsub("([0-9]+).*$", "\1", years))

# pattern is to just remove _years_old
as.numeric(gsub(" years old", "", years))

或

# split by space, get the element in first index
as.numeric(sapply(strsplit(years, " "), "[[", 1))

为什么。*是必需的？如果想一开始就使用它们，为什么不使用^ [[：digit：]] +？

–sebastian-c
13年1月27日在2:13

。*是必需的，因为您需要匹配整个字符串。没有那个，什么也不会清除。另外，请注意，此处可以使用sub代替gsub。

–马修·伦德伯格
13年1月27日在2:20

如果数字不必在字符串的开头，则使用以下命令：gsub（“。*？（[0-9] +）。*”，“ \\ 1”，years）

–TMS
17 Mar 14 '17 at 12:05

我想要27岁。我不明白为什么通过添加条件（例如添加转义的“-”），结果会变得更长... gsub（“。*？（[0-9] +）。*？）？ “，” \\ 1“，” 6月27-30“。结果：[1]” 2730“ gsub（”。*？（[0-9] +）\\-。*？“，” \\ 1 ”，“ 6月27日至30日”）结果：[1]“ 6月27日至30日”

–莱昂内尔（Lionel Trebuchon）
19年5月5日在21:45

#2 楼

我认为替代是获得解决方案的间接方法。如果要检索所有数字，建议使用gregexpr：

matches <- regmatches(years, gregexpr("[[:digit:]]+", years))
as.numeric(unlist(matches))

如果字符串中有多个匹配项，则将全部获取。如果只对第一个比赛感兴趣，请使用regexpr而不是gregexpr，然后可以跳过unlist。

我没想到，但是这个解决方案比其他解决方案要慢一个数量级。

–马修·伦德伯格
13年1月27日在5:15

@MatthewLundberg是gregexpr，regexpr还是两者？

–sebastian-c
13年1月27日在16:16

gregexpr。直到现在我还没有尝试过regexpr。巨大的差异。使用regexpr将它放在1e6集的安德鲁和阿伦解决方案之间（第二快）。也许也很有趣，在Andrew解决方案中使用sub不会提高速度。

–马修·伦德伯格
13年1月27日在16:42

这基于小数点进行分割。例如2.5变成c（'2'，'5'）

– MBorg
8月15日下午3:07

#3 楼

更新
由于不赞成使用extract_numeric，我们可以使用parse_number软件包中的readr。

library(readr)
parse_number(years)

这里是extract_numeric的另一种选择

library(tidyr)
extract_numeric(years)
#[1] 20  1

对此应用程序很好，但请记住parse_number不能与负数一起使用。尝试parse_number（“ – 27,633”）

–荨麻
18年6月8日在19:15

@Nettle是的，这是正确的，如果同时存在多个实例，也将无法使用

–akrun
18年6月9日在3:08

负数解析错误已得到修复：github.com/tidyverse/readr/issues/308 readr :: parse_number（“-12,345”）＃[1] -12345

–拉斯·海德
19-4-23在11:29

#4 楼

这是Arun的第一个解决方案的替代方案，它具有类似Perl的简单正则表达式：

as.numeric(gsub("[^\d]+", "", years, perl=TRUE))

as.numeric（sub（“ \\ D +”，“”，年））。如果前后有字母，则gsub

–Onyambu
18 Mar 5 '18 at 7:25

#5 楼

或者简单地：

as.numeric(gsub("\D", "", years))
# [1] 20  1

#6 楼

Q4312079Q流水线解决方案：

library(stringr)
years %>% str_match_all("[0-9]+") %>% unlist %>% as.numeric

感谢Joe，但是此答案不会在字符串中的数字之前提取负号。

–蔡Cai
18年8月31日在22:29

#7 楼

您也可以去除所有字母：

as.numeric(gsub("[[:alpha:]]", "", years))

尽管如此，它的通用性较差。

奇怪的是，在我的机器上，安德鲁的解决方案将其击败了5倍。

–马修·伦德伯格
13年1月27日在5:16

#8 楼

我们还可以使用str_extract中的stringr

years<-c("20 years old", "1 years old")
as.integer(stringr::str_extract(years, "\d+"))
#[1] 20  1

如果字符串中有多个数字，并且我们想提取所有数字，我们可以使用str_extract_all，这与str_extract返回所有Macthes。

years<-c("20 years old and 21", "1 years old")
stringr::str_extract(years, "\d+")
#[1] "20"  "1"

stringr::str_extract_all(years, "\d+")

#[[1]]
#[1] "20" "21"

#[[2]]
#[1] "1"

#9 楼

从开始位置的任何字符串中提取数字。

x <- gregexpr("^[0-9]+", years)  # Numbers with any number of digits
x2 <- as.numeric(unlist(regmatches(years, x)))

从位置的任何字符串INEPENDENT中提取数字。

x <- gregexpr("[0-9]+", years)  # Numbers with any number of digits
x2 <- as.numeric(unlist(regmatches(years, x)))

#10 楼

来自Gabor Grothendieck的帖子在r-help邮件列表中发布后

years<-c("20 years old", "1 years old")

library(gsubfn)
pat <- "[-+.e0-9]*\d"
sapply(years, function(x) strapply(x, pat, as.numeric)[[1]])

#11 楼

使用unglue软件包，我们可以做到：

 # install.packages("unglue")
library(unglue)

years<-c("20 years old", "1 years old")
unglue_vec(years, "{x} years old", convert = TRUE)
#> [1] 20  1

创建于2019-11-06由reprex软件包（v0.3.0）

更多信息：https://github.com/moodymudskipper/unglue/blob/master/README.md

编程黑洞网