准确的电子邮件语法验证（不严重）

因此，有位朋友偶然告诉我，一般电子邮件语法规则多么奇怪和具体。例如，电子邮件可以具有“注释”。基本上，您可以将字符放在仅被忽略的括号中。因此，email(this seems extremely redundant)@email.com不仅与email@email.com是同一封电子邮件，而且有效。但是我认为，尽我所能遵循确切的准则将是一个有趣的练习。我不会在这里描述每个特定的地方，因为我（希望如此）已经在代码本身中阐明了所有内容。 />
我对反馈的功能特别感兴趣，因为它使我做到了这一点，以及我如何进行功能测试和分离。从理论上讲，这应该是人们可以导入并调用的模块（尽管我不知道何时有人真正想要使用它），所以我希望评论着重于此。当然，欢迎提供有关更好或更有效的方法的反馈。

"""This module will evaluate whether a string is a valid email or not.

It is based on the criteria laid out in RFC documents, summarised here:
https://en.wikipedia.org/wiki/Email_address#Syntax

Many email providers will restrict these further, but this module is primarily
for testing whether an email is syntactically valid or not.

Calling validate() will run all tests in intelligent order.
Any error found will raise an InvalidEmail error, but this also inherits from
ValueError, so errors can be caught with either of them.

If you're using any other functions, note that some of the tests will return
a modified string for the convenience of how the default tests are structured.
Just calling valid_quotes(string) will work fine, just don't use the assigned
value unless you want the quoted sections removed.
Errors will be raised from the function regardless.

>>> validate("local-part@domain")
>>> validate("example@email.com")
>>> validate("John..Doe@example.com")
Traceback (most recent call last):
  ...
InvalidEmail: Consecutive periods are not permitted.
>>> validate("John.Doe@example.com")
>>> validate("John~.Doe@example.com")
>>> validate("john.smith(comment)@example.com")
>>> validate("(comment)john.smith@example.com")
>>> validate("(comment)john.smith@example(comment).com")
>>> validate('"abcdefghixyz"@example.com')
>>> validate('abc."defghi".@example.com')
Traceback (most recent call last):
  ...
InvalidEmail: Local may neither start nor end with a period.
>>> validate('abc."def<>ghi"xyz@example.com')
Traceback (most recent call last):
  ...
InvalidEmail: Incorrect double quotes formatting.
>>> validate('abc."def<>ghi".xyz@example.com')
>>> validate('jsmith@[192.168.2.1]')
>>> validate('jsmith@[192.168.12.2.1]')
Traceback (most recent call last):
  ...
InvalidEmail: IPv4 domain must have 4 period separated numbers.
>>> validate('jsmith@[IPv6:2001:db8::1]')
>>> validate('john.smith@(comment)example.com')
"""


import re

from string import ascii_letters, digits


HEX_BASE = 16
MAX_ADDRESS_LEN = 256
MAX_LOCAL_LEN = 64
MAX_DOMAIN_LEN = 253
MAX_DOMAIN_SECTION_LEN = 63

MIN_UTF8_CODE = 128
MAX_UTF8_CODE = 65536
MAX_IPV4_NUM = 256

IPV6_PREFIX = 'IPv6:'
VALID_CHARACTERS = ascii_letters + digits + "!#$%&'*+-/=?^_`{|}~"
EXTENDED_CHARACTERS = VALID_CHARACTERS + r' "(),:;<>@[\]'
DOMAIN_CHARACTERS = ascii_letters + digits + '-.'

# Find quote enclosed sections, but ignore \" patterns.
COMMENT_PATTERN = re.compile(r'\(.*?\)')
QUOTE_PATTERN = re.compile(r'(^(?<!\)".*?(?<!\)"$|\.(?<!\)".*?(?<!\)"\.)')

class InvalidEmail(ValueError):
    """String is not a valid Email."""

def strip_comments(s):
    """Return s with comments removed.

    Comments in an email address are any characters enclosed in parentheses.
    These are essentially ignored, and do not affect what the address is.

    >>> strip_comments('exam(alammma)ple@e(lectronic)mail.com')
    'example@email.com'"""

    return re.sub(COMMENT_PATTERN, "", s)

def valid_quotes(local):
    """Parse a section of the local part that's in double quotation marks.

    There's an extended range of characters permitted inside double quotes.
    Including: "(),:;<>@[\] and space.
    However " and \ must be escaped by a backslash to be valid.

    >>> valid_quotes('"any special characters <>"')
    ''
    >>> valid_quotes('this."is".quoted')
    'this.quoted'
    >>> valid_quotes('this"wrongly"quoted')
    Traceback (most recent call last):
      ...
    InvalidEmail: Incorrect double quotes formatting.
    >>> valid_quotes('still."wrong"')
    Traceback (most recent call last):
      ...
    InvalidEmail: Incorrect double quotes formatting."""

    quotes = re.findall(QUOTE_PATTERN, local)
    if not quotes and '"' in local:
        raise InvalidEmail("Incorrect double quotes formatting.")

    for quote in quotes:
        if any(char not in EXTENDED_CHARACTERS for char in quote.strip('.')):
            raise InvalidEmail("Invalid characters used in quotes.")

        # Remove valid escape characters, and see if any invalid ones remain
        stripped = quote.replace('\\', '').replace('\"', '"').strip('".')
        if '\' in stripped:
            raise InvalidEmail('\ must be paired with " or another \.')
        if '"' in stripped:
            raise InvalidEmail('Unescaped " found.')

        # Test if start and end are both periods
        # If so, one of them should be removed to prevent double quote errors
        if quote.endswith('.'):
            quote = quote[:-1]
        local = local.replace(quote, '')
    return local

def valid_period(local):
    """Raise error for invalid period, return local without any periods.

    Raises InvalidEmail if local starts or ends with a period or 
    if local has consecutive periods.

    >>> valid_period('example.email')
    'exampleemail'
    >>> valid_period('.example')
    Traceback (most recent call last):
      ...
    InvalidEmail: Local may neither start nor end with a period."""

    if local.startswith('.') or local.endswith('.'):
        raise InvalidEmail("Local may neither start nor end with a period.")

    if '..' in local:
        raise InvalidEmail("Consecutive periods are not permitted.")

    return local.replace('.', '')

def valid_local_characters(local):
    """Raise error if char isn't in VALID_CHARACTERS or the UTF8 code range"""

    if any(not MIN_UTF8_CODE <= ord(char) <= MAX_UTF8_CODE
           and char not in VALID_CHARACTERS for char in local):
        raise InvalidEmail("Invalid character in local.")

def valid_local(local):
    """Raise error if any syntax rules are broken in the local part."""

    local = valid_quotes(local)
    local = valid_period(local)
    valid_local_characters(local)


def valid_domain_lengths(domain):
    """Raise error if the domain or any section of it is too long.

    >>> valid_domain_lengths('long.' * 52)
    Traceback (most recent call last):
      ...
    InvalidEmail: Domain length must not exceed 253 characters.
    >>> valid_domain_lengths('proper.example.com')"""

    if len(domain.rstrip('.')) > MAX_DOMAIN_LEN:
        raise InvalidEmail("Domain length must not exceed {} characters."
                           .format(MAX_DOMAIN_LEN))

    sections = domain.split('.')
    if any(1 > len(section) > MAX_DOMAIN_SECTION_LEN for section in sections):
        raise InvalidEmail("Invalid section length between domain periods.")

def valid_ipv4(ip):
    """Raise error if ip doesn't match IPv4 syntax rules.

    IPv4 is in the format xxx.xxx.xxx.xxx
    Where each xxx is a number 1 - 256 (with no leading zeroes).

    >>> valid_ipv4('256.12.1.12')
    >>> valid_ipv4('256.12.1.312')
    Traceback (most recent call last):
      ...
    InvalidEmail: IPv4 domain must be numbers 1-256 and periods only"""

    numbers = ip.split('.')
    if len(numbers) != 4:
        raise InvalidEmail("IPv4 domain must have 4 period separated numbers.")
    try:
        if any(0 > int(num) or int(num) > MAX_IPV4_NUM for num in numbers):
            raise InvalidEmail
    except ValueError:
        raise InvalidEmail("IPv4 domain must be numbers 1-256 and periods only")

def valid_ipv6(ip):
    """Raise error if ip doesn't match IPv6 syntax rules.

    IPv6 is in the format xxxx:xxxx::xxxx::xxxx
    Where each xxxx is a hexcode, though they can 0-4 characters inclusive.

    Additionally there can be empty spaces, and codes can be ommitted entirely
    if they are just 0 (or 0000). To accomodate this, validation just checks
    for valid hex codes, and ensures that lengths never exceed max values.
    But no minimums are enforced.

    >>> valid_ipv6('314::ac5:1:bf23:412')
    >>> valid_ipv6('IPv6:314::ac5:1:bf23:412')
    >>> valid_ipv6('314::ac5:1:bf23:412g')
    Traceback (most recent call last):
      ...
    InvalidEmail: Invalid IPv6 domaim: '412g' is invalid hex value.
    >>> valid_ipv6('314::ac5:1:bf23:314::ac5:1:bf23:314::ac5:1:bf23:41241')
    Traceback (most recent call last):
      ...
    InvalidEmail: Invalid IPv6 domain"""

    if ip.startswith(IPV6_PREFIX):
        ip = ip.replace(IPV6_PREFIX, '')
    hex_codes = ip.split(':')
    if len(hex_codes) > 8 or any(len(code) > 4 for code in hex_codes):
        raise InvalidEmail("Invalid IPv6 domain")

    for code in hex_codes:
        try:
            if code:
                int(code, HEX_BASE)
        except ValueError:
            raise InvalidEmail("Invalid IPv6 domaim: '{}' is invalid hex value.".format(code))

def valid_domain_characters(domain):
    """Raise error if any invalid characters are used in domain."""

    if any(char not in DOMAIN_CHARACTERS for char in domain):
        raise InvalidEmail("Invalid character in domain.")

def valid_domain(domain):
    """Raise error if domain is neither a valid domain nor IP.

    Domains (sections after the @) can be either a traditional domain or an IP
    wrapped in square brackets. The IP can be IPv4 or IPv6.
    All these possibilities are accounted for."""

    # Check if it's an IP literal
    if domain.startswith('[') and domain.endswith(']'):
        ip = domain[1:-1]
        if '.' in ip:
            valid_ipv4(ip)
        elif ':' in ip:
            valid_ipv6(ip)
        else:
            raise InvalidEmail("IP domain not in either IPv4 or IPv6 format.")
    else:
        valid_domain_lengths(domain)

def validate(address):
    """Raises an error if address is an invalid email string."""

    try:
        local, domain = strip_comments(address).split('@')
    except ValueError:
        raise InvalidEmail("Address must have one '@' only.")

    if len(local) > MAX_LOCAL_LEN:
        raise InvalidEmail("Only {} characters allowed before the @"
                         .format(MAX_LOCAL_LEN))
    if len(domain) > MAX_ADDRESS_LEN:
        raise InvalidEmail("Only {} characters allowed in address"
                         .format(MAX_ADDRESS_LEN))

    valid_local(strip_comments(local))
    valid_domain(strip_comments(domain))


if __name__ == "__main__":
    import doctest
    doctest.testmod()
    raw_input('>DONE<')

不幸的是，我无法使您的代码正常工作（我收到一个IndentationError），但我怀疑即使在RFC3696中一些更简单的示例中，它也可能会失败。

您对评论的处理并不严格正确； quoted-string只能在引号之间包含FWS，而不能包含CFWS，因此任何看起来像在quoted-string中的注释都不是注释，因此不应删除。方括号内的领域文学也是如此。两者都不会对现实世界产生太大影响，但是如果您想绝对正确，则可能需要考虑如何处理。

您可能需要看一下ex-parrot.com/pdw/Mail-RFC822-Address.html

好吧，我只是尝试将电子邮件发送给某人（with_a_comment）@ gmail.com，而我的gmail甚至都不允许我发送。它说“某人”无效。错误消息中甚至没有提到该注释。

“我确实大量参考了所有知识的字体，维基百科提供了有关规则的摘要。” -你有问题。如果要实施某种技术性的工作，则应始终获取正式的规范-适用于您的情况的RFC 2822（及其更新）。

#1 楼

"@"@example.com和"\ "@example.com都失败了，但是它们都是有效的。

" "@example.com通过了，但实际上是无效的。*

您可能错过了用以下方法来确认知识的想法相关RFC作为一致的实现应遵守其中描述的规则。尽管如今Wikipedia相当可靠，但它绝不是规范来源。

* RFC 5322对quoted-string的描述如下：

quoted-string   =   [CFWS]
                    DQUOTE *([FWS] qcontent) [FWS] DQUOTE
                    [CFWS]

FWS的意思是“折叠空白”，是一个包含由空白组成的可选序列的结构，其后跟一个CRLF；由单个空格组成的必需部分之前的序列（如果存在）。尽管地址的本地部分可以合法地以空格开头和结尾，但两个空格都必须至少由一个字符分隔，形成qcontent。

\ $ \ begingroup \ $
这个答案恰好描述了为什么验证有效地址是徒劳的。弄错它比弄错它要容易得多。在过去，您可以只用手指来指望可交付性的近似值，但是如今，您最好将其发送出去并寄希望于最好的。
\ $ \ endgroup \ $
– phyrfox
16年1月23日在6:15

\ $ \ begingroup \ $
验证电子邮件地址的唯一方法是尝试向其发送电子邮件。如果失败，则不一定表示其地址无效，但这意味着您发送电子邮件的方法无法将其发送到该地址，因此其有效性并不重要（是否意味着您要继续使用电子邮件库由您决定。.）
\ $ \ endgroup \ $
–丹农
16年1月24日在18:37

#2 楼

已经在聊天中说过这一点，但@即使不是有效的电子邮件地址也成功。您应至少在本地部分中输入1个字符，并在域中至少输入1个字符。

#3 楼

我个人很难对您的代码进行错误处理。
实际上，我对缺少代码感到非常惊讶。 br />将\和\"都从引号中删除，但这样做过于冗长：

stripped = quote.replace('\\', '').replace('\"', '"').strip('".')

通过这种方式，您可以读取到它可以同时替换re.sub和\，以提高可读性。
我不知道它是否有很多性能但还是有所不同。

我还要添加另一个函数，因为当前您将按以下方式使用validate函数：

stripped = re.sub(r'\[\"]', '', quote).strip('".')

如果您只想知道它是否有效，那么就个人而言，这就是很多样板。
对于这些情况，我建议您使用\"函数。 >

try:
    validate('')
except:
    # Handle non-valid email
else:
    # Handle valid email

它可以帮助提高可读性您不想知道错误的情况。
可能不是您希望如何使用它，以及所有有用的错误。
但是我知道我想使用这种方式

您所有的功能都是公开的，这鼓励我去做：

br />我不应该使用它，所以应该将其命名为is_valid。
虽然它仍然可以以相同的方式使用，但它现在是“ Python专用”。 br并遵循re.py如何定义其功能。

正如@Mathias所说，您也应该添加_valid_quotes。点，
您可能会遇到一些PEP8错误。
但是它们非常小：

环绕顶部-级别函数和类定义，其中有两个空行。

您的导入周围有太多空格，
两个空行就足够了（仍然会违反PEP8）。您没有对__all__中的两个错误进行足够的缩进。
您有一行超过79个字符。

\ $ \ begingroup \ $
is_valid是个好主意，尤其是因为目前只有一个公共函数。
\ $ \ endgroup \ $
–SuperBiasedMan
16年1月22日在14:18

#4 楼

您似乎将模块构建为仅包含validate作为“公共”功能。您可能需要通过声明__all__ = ['validate', 'InvalidEmail']来强制执行该操作。它将影响pydoc和help内置模块在模块上显示帮助的方式（它们将仅显示模块docstring，异常和validate函数），以及如何处理from the_ultimate_email_validator import *（仅将validate和InvalidEmail泄漏到全局名称空间中））。

除此之外，查看validate的预期用例，它非常类似于int或相关的内置函数。因此，将其重命名为较少被动的操作（例如email），然后像这样调用它可能会很有用：任何解析问题都将引发InvalidEmail，就像guess = int(raw_input())引发ValueError一样。调用方仍将像实际版本一样使用try .. except处理无效地址。 >在validate末尾添加

valid_address = email(user_input)

，因为注释已在函数的第一行中删除。但是，为什么您调用valid_local(strip_comments(local))和valid_domain(strip_comments(domain))而不是valid_local(local)和valid_domain(domain)呢？似乎没有任何情况会在删除整个地址后在local或domain中留下注释。

\ $ \ begingroup \ $
__all__的技巧非常好，我以前只用过_，但是在这里做起来比较笨拙，这是一个很好的解决方案！另外，您对strip_comment的冗余副本是正确的，我之前已对其进行了排列，这样它就不会被这么早调用，并且不会进行更新以匹配更改。
\ $ \ endgroup \ $
–SuperBiasedMan
16年1月22日在14:17

#5 楼

您编写的代码通常非常好，但是您似乎发现解析非平凡的字符串开始变得有点复杂，并且为各种讨厌的边缘情况提供了各种空间。您的remove_comments方法似乎不考虑RFC明确允许的嵌套注释。

如所期望的，remove_comments("Hello (new) world")返回"Hello world"，但是当我运行它时，remove_comments("Hello (new (old) ish) world")返回'Hello ish) world'。

用正则表达式删除嵌套注释非常困难，实际上，从正则表达式的纯粹角度来看，这是不可能的。基本上，要做到这一点，您需要一个递归的正则表达式，Python的RE引擎似乎不支持该正则表达式。，您真正需要做的就是遍历字符串，并跟踪当前打开的括号的数量。对于下一次迭代，这应该不太难。解析器，以便将foo"\")"("")@example.com解析为foo")@example.com？如果您真的想尽可能多地应对病理性疾病，建议您学习正规语言和解析器，然后为Python挖掘一个解析器库，以帮助您构建自己的解析器。 Python Wiki列出了几个，特别是这个看起来不错，尽管我没有自己尝试使用它。

#6 楼

我没有查看您的代码，而是查看了您的测试。

您似乎不支持IPv4映射的IPv6地址：

validate('hello@[::FFFF:222.1.41.90]')

（请参阅http://www.tcpipguide.com/free/t_IPv6IPv4AddressEmbedding-2.htm）

此外，事实证明该域的验证相当宽松。

validate('hello@!^&&%^%#&^%&%$^#%^&%$^%#&^*&^*^%^#$')

更糟糕的是：

 validate('hello@example.com\n')
 validate('hello@exa\nmp\nle.com\n')
 validate('hello@example.com\nX-Spam-Score: 0')

就能将标题注入到电子邮件中。）

编程黑洞网