有人认为我的服务器日志解析器不好

我刚刚被告知，我编写的以下代码非常糟糕。我完全不知道为什么。它是高效内存的，对我来说看起来很干净。但是反馈仍然很差。我不知道为什么。如果有人可以发表评论，我将不胜感激。我必须从命令行传递文件名才能使其正常工作-实际上这就是他们的要求。

import re
import numpy as np
from collections import Counter
import sys

def parse_server_log(path_to_file):

    #First check whether it's a legal file name
    try:
        f = open(path_to_file)
    except:
        print "\nInvalid file name and/or path. Please correct!\n"
        return
    # end of sanity check on file

    # The following regexes would extract the concerned values from each line
    # of the file.
    method_regx = re.compile("(method=)+[A-Z]+[\s]+")
    path_regx = re.compile("(path=)+[/\w.\"]+[\s]+")
    dyno_regx = re.compile("(dyno=)+[\w.]+[\s]+")
    connect_regx = re.compile("(connect=)+[\w.]+[\s]+")
    service_regx = re.compile("(service=)+[\w.]+[\s]+")

    # Target values for each urls
    url1 = [0, [], []]
    url2 = [0, [], []]
    url3 = [0, [], []]
    url4 = [0, [], []]
    url5 = [0, [], []]
    url6 = [0, [], []]

    # url matcher regex for each url type
    url1_regex = re.compile("(#)+(/api/users/)+[\d]+(/count_pending_messages)+(#)+")
    url2_regex = re.compile("(#)+(/api/users/)+[\d]+(/get_messages)+(#)+")
    url3_regex = re.compile("(#)+(/api/users/)+[\d]+(/get_friends_progress)+(#)+")
    url4_regex = re.compile("(#)+(/api/users/)+[\d]+(/get_friends_score)+(#)+")
    url5_6_regex = re.compile("(#)+(/api/users/)+[\d]+(#)+")


    with open(path_to_file) as lines:
        for my_data in lines:

            # Now lets separate out the method, path, dyno, connect and service times
            k = method_regx.search(my_data)
            try:            
                line_method = k.group(0).split("=")[1].strip()
            except:
                line_method = ""

            k = path_regx.search(my_data)
            try:
                # The hashes are added at the start and end to make sure the path
                # is not a part of a string rather the entire string!
                line_path = "#" + k.group(0).split("=")[1].strip()  + "#"            
            except:
                line_path = ""

            k = dyno_regx.search(my_data)
            try:            
                line_dyno = k.group(0).split("=")[1].strip()          
            except:
                line_dyno = ""

            k = connect_regx.search(my_data)
            try:            
                line_connect_time = float(k.group(0).split("=")[1].split("ms")[0])            
            except:
                line_connect_time = 0

            k = service_regx.search(my_data)
            try:            
                line_service_time = float(k.group(0).split("=")[1].split("ms")[0])             
            except:
                line_service_time = 0

            # End of getting all the data

            # Now match up the URL and do this under sanity checker
            if(len(line_method) > 0 and len(line_path) > 0):
                url_denoter = 0
                if url1_regex.match(line_path) is not None:
                    url_denoter = 1
                elif url2_regex.match(line_path) is not None:
                    url_denoter = 2
                elif url3_regex.match(line_path) is not None:
                    url_denoter = 3
                elif url4_regex.match(line_path) is not None:
                    url_denoter = 4
                elif url5_6_regex.match(line_path) is not None:
                    url_denoter = 5            



                # OK so now we have the url to which the current url matched
                if(url_denoter==1 and line_method=="GET"):
                    """
                    for GET /api/users/{user_id}/count_pending_messages
                   """
                    url1[0] += 1
                    url1[1].append(line_dyno)
                    url1[2].append(line_connect_time + line_service_time)

                elif(url_denoter==2 and line_method=="GET"):
                    """
                    for GET /api/users/{user_id}/get_messages
                    """
                    url2[0] += 1
                    url2[1].append(line_dyno)
                    url2[2].append(line_connect_time + line_service_time)


    """
    Now print the results!

    """

    # for GET /api/users/{user_id}/count_pending_messages
    print "\n------GET /api/users/{user_id}/count_pending_messages----\n"
    print "Number of times the url is called: ", url1[0]
    if(url1[0]>0):
        my_num_list = url1[2]
        print "Average response time: ", round(np.average(my_num_list), 2), " in ms."
        print "Median response time: ", round(np.median(my_num_list), 2), " in ms."
        print "Mode of response time: ", round(Counter(my_num_list).most_common(1)[0][0], 2), " in ms."
        counts = [(x, url1[1].count(x)) for x in set(url1[1])]
        swap = 0
        tar_dyno = ""
        for count in counts:
            if(count[1]> swap):
                swap = count[1]
                tar_dyno = count[0]

        print "Dyno that responded the most: ", tar_dyno   
    else:
        print "Sorry no parameters can be calculated as the url has not been accessed!"


    print "\n------GET /api/users/{user_id}/get_messages----\n"
    print "Number of times the url is called: ", url2[0]
    if(url2[0]>0):
        my_num_list = url2[2]
        print "Average response time: ", round(np.average(my_num_list), 2), " in ms."
        print "Median response time: ", round(np.median(my_num_list), 2), " in ms."
        print "Mode of response time: ", round(Counter(my_num_list).most_common(1)[0][0], 2), " in ms."
        counts = [(x, url2[1].count(x)) for x in set(url2[1])]
        swap = 0
        tar_dyno = ""
        for count in counts:
            if(count[1]> swap):
                swap = count[1]
                tar_dyno = count[0]

        print "Dyno that responded the most: ", tar_dyno   
    else:
        print "Sorry no parameters can be calculated as the url has not beenaccessed!"


    print "\n------GET /api/users/{user_id}/get_friends_progress----\n"
    print "Number of times the url is called: ", url3[0]


if(__name__=="__main__"):
    parse_server_log(sys.argv[1])

尽管您已经有了一个很好的答案，但我还是鼓励您与检查您的代码的人讨论这个问题。特别是，他们是否给您特定的反馈，是否与您在此处获得的反馈一致，以及如何最好地将反馈应用于当前和将来的项目？

您的代码将日志切成有用的东西。您为什么不称它为LumberMill？ #missedopportunity

您的正则表达式很奇怪。例如（＃）+（// api / users /）+ [\ d] +（＃）+表示每个组可以出现几次？我不认为+做您想做的。另外，以后您将不使用具有组的事实。

希望“您的代码非常糟糕”不是您的同行使用的确切措辞，否则您最好回答“您的评论非常不育”

您的except语句应具有显式的异常类型。例如，用于捕获不存在的文件，但IOError :.

#1 楼

首先，Python有一个样式指南，并且（除非给您另外的指南，在这种情况下，请提供指向它的链接）应该遵循它。您的代码通常都遵循它，但是请注意import的顺序错误，应该是：第三方模块。

编写好的代码时，重复是一个大问题，这是一个立即的危险信号：

from collections import Counter
import re
import sys

import numpy as np

为什么不列出包含这些结构的列表urls？您可以将其剪切为一行，因此，如果以后更改结构，则只需执行一次：

url1 = [0, [], []]
url2 = [0, [], []]
url3 = [0, [], []]
url4 = [0, [], []]
url5 = [0, [], []]
url6 = [0, [], []]

您的代码中还有许多其他重复元素，这些元素可以用类似的方法减少。

parse_server_log太长了。您应该将其拆分为具有合理参数和return值的较小的独立函数，这将使您的代码更易于阅读，理解和维护。每个文档都应有一个文档字符串，以确切说明其功能。这也有助于减少重复。

裸except是个坏主意-至少应该使用except Exception，但是更好的办法是找出可能发生的错误，并且明确地处理它们。参见例如“ except的弊端”。

您应该使用字符串格式，例如

urls = [[0, [], []] for _ in range(6)]

应

print "Average response time: ", round(np.average(my_num_list), 2), " in ms."

使用元组包装和拆箱，例如

print "Average response time: {0:.2f} in ms.".format(np.average(my_num_list))

可能是：

for count in counts:
    if(count[1]> swap):
        swap = count[1]
        tar_dyno = count[0]

您的文件检查

for dyno_, swap_ in counts:
    if swap_ > swap:
        swap, tar_dyno = swap_, dyno_

从不关闭文件，该文件在整个功能中保持打开状态。您可以使用os模块检查Python中是否存在文件。改为这样做。或者，看看这个SO答案。

\ $ \ begingroup \ $
最后一点是最后一点-如果无法打开文件，则无法关闭任何文件。而且通常最好尝试打开并处理异常，而不是事先进行存在检查。即使存在检查也会引起误解，因为该文件可能在某个时刻存在，而在下一时刻将被删除。
\ $ \ endgroup \ $
– Greg Hewgill
2014年11月4日19:17

\ $ \ begingroup \ $
@GregHewgill是正确的，但是如果文件不存在，该函数无论如何都会终止，并且当前的设置也无法防止文件删除。
\ $ \ endgroup \ $
– jonrsharpe
2014年11月4日19:21

\ $ \ begingroup \ $
尝试将open（path_to_file）与f一起包装在try ...中使用不是更好，除了这里建议的EnvironmentError之外？（尽管从另一个答案修改后的内容看起来也很整洁）
\ $ \ endgroup \ $
– Tobias Kienzler
2014年11月5日13:34

\ $ \ begingroup \ $
@TobiasKienzler谢谢，我已将其添加到答案中。
\ $ \ endgroup \ $
– jonrsharpe
2014年11月5日13:35

\ $ \ begingroup \ $
-仅需要一个正则表达式，也许需要两个。
\ $ \ endgroup \ $
–tedder42
2014年11月7日18:00

#2 楼

正则表达式中的+表示Once or more。我认为在您的情况下，您误将其用作连接运算符。

我认为您可以通过更好的正则表达式来分解很多因素。例如：

method_regx = re.compile("(method=)+[A-Z]+[\s]+")
# ...
k = method_regx.search(my_data)
try:
    line_method = k.group(0).split("=")[1].strip()
except:

可以重写

method_regx = re.compile("method=([A-Z]+)[\s]+")
# ...
k = method_regx.search(my_data)
line_method = k.group(1) if k else ''

因为如果不进行搜索，则k为None工作。

同样，您可能会找到可以完成所有工作的正则表达式，并允许您从

line_connect_time = float(k.group(0).split("=")[1].split("ms")[0])

中删除split

\ $ \ begingroup \ $
实际上，不，我知道正则表达式中+的含义！但是，是的，我同意，按照您的建议可能会更好。谢了哥们！
\ $ \ endgroup \ $
–user3001408
2014年11月5日在6:28

\ $ \ begingroup \ $
@ user3001408您可以尝试使用regex101.com/#python之类的内容来帮助您开发更精确的正则表达式
\ $ \ endgroup \ $
– jonrsharpe
2014年11月5日，9：35

#3 楼

已经有了不错的答案，但是我想谈谈文件处理的另一点，实际上它也反映了parse_server_log太长的问题。请考虑方法的第一部分：

def parse_server_log(path_to_file):
    : : :
    try:
        f = open(path_to_file)
    except:
        print "\nInvalid file name and/or path. Please correct!\n"
        return
    : : :

调用此（或任何）方法时，它可以完成一些事情：

读取或写入全局数据
读取或写入传递给它的参数
返回单个对象
引发异常

在这种情况下，您已选择让parse_server_log读取传递给它的参数，读取全局数据（文件的内容），写入全局数据（通过print语句）并返回None。在一些不太常见的情况下，或者如果您的代码中存在错误，则可能还会引发异常。

但是，作为这段代码的使用者，我如何知道发生了什么？缺乏技巧，例如用StringIO对象代替sys.stdout，然后解析它，我不能。可用的是，当用户调用脚本时，我可以读取输出。这是非常有限的。如果parse_server_log做了您想做的所有事情，那很好。这是一个高质量的一次性脚本。可以对其进行修改以应对出现的新需求。

但将其关注点分离可能更有用。 parse_server_log听起来应该返回代表日志内容的数据结构。另一个函数可以处理它，第三个可以用打印语句对其进行总结。如果它的结构是这样，您将需要知道parse_server_log是成功还是失败。这可以通过更改返回的值，也可以通过引发异常来实现。

所有这一切都是一个漫长的说法：考虑让open(path_to_file)引发异常而不捕获它。然后，parse_server_log的客户端可以看到该异常并选择如何处理它。在这种情况下，Python只会将这些信息转储到屏幕上，但是即使这样，它也可以告诉您更多信息，而不是您选择在其位置打印的消息。

#4 楼

就像numpy一样方便，就像其他答案的小注脚（但对于评论来说有点太大）一样，仅计算列表的平均值和中位数，这可能有点过头了。相反，您可以自己轻松地做到：

或者，如果您想看中偶数长度的列表（中位数并不总是唯一地定义的）：

from math import fsum
a = sorted(my_num_list)
n = len(a)
average = fsum(a) / n
median = a[n/2]

消除对numpy的依赖的主要优势，除了稍微减少了内存使用和更快的加载时间之外，还在于即使在numpy可能无法使用的系统上，代码也可以工作由于某种原因可用。另外，对我来说，看起来更干净。

编程黑洞网

有人认为我的服务器日志解析器不好

评论

#1 楼

评论

#2 楼

评论

#3 楼

#4 楼