如何使用sed / grep提取两个单词之间的文本？

我正在尝试输出一个包含两个字符串的两个单词之间的所有内容的字符串：

输入：

"Here is a String"

输出：

"is a"

使用：

sed -n '/Here/,/String/p'

包括端点，但是我不想包括它们。

如果输入是Here is Here String，结果应该是什么？还是我在这里叫Dub Thee Stringy爵士？

仅供参考。您的命令的意思是在包含单词Here的行和包含String的行之间打印所有内容-而不是您想要的内容。

另一个常见的sed FAQ是“如何提取特定行之间的文本”；这是stackoverflow.com/questions/16643288 / ...

#1 楼

sed -e 's/Here\(.*\)String//'

谢谢！如果我想在“这里是一个字符串”中找到“一个是”和“字符串”之间的所有内容怎么办？（sed -e's / one是（。*）String / \ 1 /'吗？

–user1190650
2012年11月6日，0：31

@ user1190650如果您也想看到“这里是一个”，那将起作用。您可以对其进行测试：echo“这是一个字符串” | sed -e's / one是\（。* \）String / \ 1 /'。如果只希望“ one is”和“ String”之间的部分，则需要使正则表达式与整行匹配：sed -e's /.* one is \（。* \）String。* / \ 1 /'。在sed中，s / pattern / replacement /说“用每行的'pattern'替换'replacement'”。它只会更改与“ pattern”匹配的任何内容，因此，如果要替换整行，则需要使“ pattern”与整行匹配。

–布赖恩·坎贝尔
2012年11月6日13:59

当输入为“这是一个字符串”为“这是一个字符串”时，这会中断

–杰伊D
2015年5月19日在1:09

很高兴看到一个案例的解决方案：“这里是一个blah blah String这里是1 blah blah String Here是2 a blash blash String”输出应该只选择Here和String之间的第一个子字符串。

–杰伊D
2015年5月19日在1:10

@JayD sed不支持非贪婪匹配，请参阅此问题以获取一些推荐的替代方法。

–布赖恩·坎贝尔
2015年5月19日在14:11

#2 楼

GNU grep还可以支持正面和负面的前瞻和回溯：
对于您的情况，命令为：

echo "Here is a string" | grep -o -P '(?<=Here).*(?=string)'

如果有多个出现Here和string的情况下，您可以选择要匹配第一个Here和最后一个string，还是要分别匹配它们。就正则表达式而言，它称为贪婪匹配（第一种情况）或非贪婪匹配（第二种情况）

$ echo 'Here is a string, and Here is another string.' | grep -oP '(?<=Here).*(?=string)' # Greedy match
 is a string, and Here is another 
$ echo 'Here is a string, and Here is another string.' | grep -oP '(?<=Here).*?(?=string)' # Non-greedy match (Notice the '?' after '*' in .*)
 is a 
 is another

请注意，* BSD或任何SVR4（Solaris等）随附的grep中不存在GNU grep的-P选项。在FreeBSD中，您可以安装包含pcregrep的devel / pcre端口，该端口支持PCRE（和向前/向后）。 OSX的较早版本使用GNU grep，但在OSX Mavericks中，-P源自FreeBSD的版本，该版本不包含该选项。

–ghoti
2014年5月5日在2:18

嗨，我怎么只提取不同的内容？

– Durgesh Suthar
15年9月16日在9:44

这是行不通的，因为如果结束字符串“ string”出现多次，它将得到最后一次出现，而不是下一次出现。

–瓶Butkus
16-10-27在0:43

如果“这里是字符串是字符串”，则根据问题要求，“是”和“是字符串a”都是有效答案（忽略引号）。这取决于您要选择哪一个，然后答案可能会有所不同。无论如何，对于您的要求，这将起作用：echo“这是一个字符串，一个字符串” | grep -o -P'（？<=这里）。*？（？=字符串）'

–二十烷
16-10-27在3:31

@BND，您需要启用pcregrep的多行搜索功能。 echo $'这里是\ na字符串'| grep -zoP'（？<=这里）（？s）。*（？=字符串）'

–二十烷
5月8日8:48

#3 楼

接受的答案不会删除Here之前或String之后的文本。这将是：

sed -e 's/.*Here\(.*\)String.*//'

主要区别是在.*之前和Here之后添加String。

您的答案很有希望。不过有一个问题。如果同一行中有多个字符串，如何将其提取到第一个看到的字符串？谢谢

–Mian Asbat Ahmad
18年6月26日在8:55

@MianAsbatAhmad您可能希望在Here和String之间（非贪婪（或惰性））创建*量词。但是，根据此Stackoverflow问题，sed使用的正则表达式类型不支持惰性量词（。*之后的a？）。通常，为了实现惰性的量词，您只需要与所有不想要的标记匹配即可，但是在这种情况下，不仅有单个标记，而且还有整个字符串String。

– Wheeler
18年6月26日在21:30

谢谢，我使用awk，stackoverflow.com / questions / 51041463 /…得到了答案。

–Mian Asbat Ahmad
18年6月27日在4:25

不幸的是，如果字符串有换行符，这将不起作用

– Witalo Benicio
19年6月6日在10:47

不应该这样。与换行符不匹配。如果要匹配换行符，可以替换。与类似[\ s \ s]之类的内容。

– Wheeler
19年6月18日在14:58

#4 楼

您可以单独在Bash中剥离字符串：

$ foo="Here is a String"
$ foo=${foo##*Here }
$ echo "$foo"
is a String
$ foo=${foo%% String*}
$ echo "$foo"
is a
$

并且如果您具有包含PCRE的GNU grep，则可以使用零宽度的断言：

$ echo "Here is a String" | grep -Po '(?<=(Here )).*(?= String)'
is a

为什么这种方法这么慢？使用这种方法剥离大型html页面时，大约需要10秒钟。

–亚当·约翰斯（Adam Johns）
2014年1月22日15:12

@AdamJohns，哪种方法？ PCRE之一？ PCRE解析起来相当复杂，但是10秒似乎很极端。如果您担心的话，建议您提出一个包含示例代码的问题，并查看专家的意见。

–ghoti
2014年1月27日6:01

我认为这对我来说太慢了，因为它在一个变量中保存了一个很大的html文件的源。当我将内容写入文件然后解析文件时，速度大大提高了。

–亚当·约翰斯（Adam Johns）
2014年1月27日14:14

#5 楼

通过GNU awk，带有-P（perl-regexp）参数的grep支持\K，这有助于丢弃先前匹配的字符。在我们的例子中，先前匹配的字符串是Here，因此从最终输出中将其丢弃。

$ echo "Here is a string" | awk -v FS="(Here|string)" '{print }'
 is a

如果您希望输出为is a，则可以尝试以下操作，

$ echo "Here is a string" | grep -oP 'Here\K.*(?=string)'
 is a 
$ echo "Here is a string" | grep -oP 'Here\K(?:(?!string).)*'
 is a

这不适用于：echo“这里是字符串dfdsf，这里是字符串” | awk -v FS =“（Here | string）”'{print $ 2}'，它只返回a而不是应该是a @ Avinash Raj

–警报
18年1月6日在12:09

#6 楼

如果您的文件较长且包含多行，则首先打印数字行会很有用：

cat -n file | sed -n '/Here/,/String/p'

谢谢！这是在我的情况下唯一有效的解决方案（多行文本文件，而不是没有换行符的单个字符串）。显然，要使其没有行号，必须省略cat中的-n选项。

– Jeffrey Lebowski
16年6月2日在13:39

...在这种情况下，可以完全省略cat； sed知道如何读取文件或标准输入。

–tripleee
17年9月15日在12:07

#7 楼

这可能对您有用（GNU sed）：

sed '/Here/!d;s//&\n/;s/.*\n//;:a;/String/bb;$!{n;ba};:b;s//\n&/;P;D' file

这会在换行符上显示两个标记（在本例中为Here和String）之间的每种文本表示形式，并保留换行符在文本中。

#8 楼

上述所有解决方案都有缺陷，其中最后一个搜索字符串在字符串的其他位置重复。我发现最好编写一个bash函数。

    function str_str {
      local str
      str="${1#*}"
      str="${str%%*}"
      echo -n "$str"
    }

    # test it ...
    mystr="this is a string"
    str_str "$mystr" "this " " string"

#9 楼

您可以使用两个s命令

$ echo "Here is a String" | sed 's/.*Here//; s/String.*//'
 is a

也可以工作

$ echo "Here is a StringHere is a String" | sed 's/.*Here//; s/String.*//'
 is a

$ echo "Here is a StringHere is a StringHere is a StringHere is a String" | sed 's/.*Here//; s/String.*//'
 is a

#10 楼

要理解sed命令，我们必须逐步构建它。
这是您的原始文本

user@linux:~$ echo "Here is a String"
Here is a String
user@linux:~$

让我们尝试使用Here中的s ubstition选项删除sed字符串

user@linux:~$ echo "Here is a String" | sed 's/Here //'
is a String
user@linux:~$

此时，我相信您也可以删除String
但这不是您想要的输出。
要组合两个sed命令，请使用-e选项

user@linux:~$ echo "Here is a String" | sed 's/String//'
Here is a
user@linux:~$

希望对您有所帮助

#11 楼

您可以使用（请参阅http://www.grymoire.com/Unix/Sed.html#uh-4）：

echo "Hello is a String" | sed 's/Hello\(.*\)String//g'

里面的内容括号将存储为。

这将删除字符串，而不是在两者之间输出内容。尝试在sed命令中使用“ is”删除“ Hello”，它将输出“ Hello a”

–乔纳森
19年5月26日在16:19

#12 楼

问题。我存储的Claws Mail消息包装如下，并且我尝试提取主题行：在此线程中，按A2，如何使用sed / grep来执行以下操作：

Subject: [SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular
 link in major cell growth pathway: Findings point to new potential
 therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is
 Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as
 a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway
 identified [Lysosomal amino acid transporter SLC38A9 signals arginine
 sufficiency to mTORC1]]
Message-ID: <20171019190902.18741771@VictoriasJourney.com>

在两个单词之间提取文本？只要匹配的文本不包含换行符，下面的第一个表达式就可以使用：

grep -o -P '(?<=Subject: ).*(?=molecular)' corpus/01

[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key

但是，尽管尝试了多种变体（.+?; /s; ...），我还是可以无法使它们正常工作：

grep -o -P '(?<=Subject: ).*(?=link)' corpus/01
grep -o -P '(?<=Subject: ).*(?=therapeutic)' corpus/01
etc.

解决方案1.

Per在不同行的两个字符串之间提取文本

sed -n '/Subject: /{:a;N;/Message-ID:/!ba; s/\n/ /g; s/\s\s*/ /g; s/.*Subject: \|Message-ID:.*//g;p}' corpus/01

给出

[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular link in major cell growth pathway: Findings point to new potential therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway identified [Lysosomal amino acid transporter SLC38A9 signals arginine sufficiency to mTORC1]]

解决方案2. *

每我如何更换换行符（\ n）使用sed吗？

sed ':a;N;$!ba;s/\n/ /g' corpus/01

将用空格替换换行符。

如何使用sed / grep提取中的A2进行换行。两个单词之间的文本？，我们得到：

sed ':a;N;$!ba;s/\n/ /g' corpus/01 | grep -o -P '(?<=Subject: ).*(?=Message-ID:)'

给出

[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular  link in major cell growth pathway: Findings point to new potential  therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is  Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as  a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway  identified [Lysosomal amino acid transporter SLC38A9 signals arginine  sufficiency to mTORC1]]

此变量删除双精度空格：

sed ':a;N;$!ba;s/\n/ /g; s/\s\s*/ /g' corpus/01 | grep -o -P '(?<=Subject: ).*(?=Message-ID:)'

给予

[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular link in major cell growth pathway: Findings point to new potential therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway identified [Lysosomal amino acid transporter SLC38A9 signals arginine sufficiency to mTORC1]]

不错的冒险:)）

– Alex M.M.
7月3日14:42

编程黑洞网