我们将会数星星_编程黑洞网

最近，我一直在失眠
梦到我们可能成为的事物
但是宝贝，我一直在努力祈祷，
说，再也没有数钱了
我们要数星星了，是的，我们要数星星了
（一个共和国-数星星）

第二台显示器众所周知幸福的聊天室，但是到底有多少颗星？谁是最受好评的用户？我决定写一个脚本来找出答案。

我选择用Python编写脚本是因为：

我以前从未使用过Python
@ 200_success使用了它，看上去并不难
我发现了Beautiful Soup，它看上去功能强大且易于使用

该脚本对加星标邮件列表执行了许多HTTP请求，跟踪dict中的数字，并将纯HTML数据保存到文件中（以便将来更轻松地对数据进行其他计算，我有机会用Python学习文件I / O）。 />
为了安全起见，我在某些请求之间添加了一些延迟（不想冒被系统阻止的风险）

很遗憾，无法看到哪个用户已加注星标消息最多，但是我们都知道那是谁...

代码：

from time import sleep
from bs4 import BeautifulSoup
from urllib import request
from collections import OrderedDict
import operator

room = 8595 # The 2nd Monitor (Code Review)
url = 'http://chat.stackexchange.com/rooms/info/{0}/the-2nd-monitor/?tab=stars&page={1}'
pages = 125

def write_files(filename, content):
    with open(filename, 'w', encoding = 'utf-8') as f:
        f.write(content)

def fetch_soup(room, page):
    resource = request.urlopen(url.format(room, page))
    content = resource.read().decode('utf-8')
    mysoup = BeautifulSoup(content)
    return mysoup

allstars = {}
def add_stars(message):
    message_soup = BeautifulSoup(str(message))
    stars = message_soup.select('.times').pop().string
    who = message_soup.select(".username a").pop().string

    # If there is only one star, the `.times` span item does not contain anything
    if stars == None:
        stars = 1

    if who in allstars:
        allstars[who] += int(stars)
    else:
        allstars[who] = int(stars)

for page in range(1, pages):
    print("Fetching page {0}".format(page))
    soup = fetch_soup(room, page)
    all_messages = soup.find_all(attrs={'class': 'monologue'})
    for message in all_messages:
        add_stars(message)
    write_files("{0}-page-{1}".format(room, page), soup.prettify())
    if page % 5 == 0:
        sleep(3)

# Create a sorted list from the dict with items sorted by value (number of stars)
sorted_stars = sorted(allstars.items(), key=lambda x:x[1])

for user in sorted_stars:
    print(user)

结果，您问吗？好吧，这里是：（破坏者警告！）（我只向此处显示拥有\ $> 50 \ $星的那些人，以使列表更短）

'，73）
（'ChrisW'，85）
（'Edward'，86）
（'Yuushi'，93）
（'Marc-Andre'，98）
（'nhgrif'，112）
（'amon'，119）
（'James Khoury'，126）
（'Nobody'，148）
（ '杰里·科芬'，150）
（'BenVlodgi'，160）
（'Donald.McLean'，174）
（'konijn'，184）
（'200_success' ，209）
（'Vogel612'，220）
（'kleinfreund'，229）
（'Corbin'，233）
（'Morwenn'，253）
（'skiwi'，407）
（'lol.upvote'，416）
（'syb0rg'，475）
（'Malachi'，534）
（'retailcoder'，749）
（“ Mat's Mug”，931）
（'SimonAndréForsberg'，1079）
（'Jamal'，1170）
（'The有很多名字的杯子，2096年）（Mat的杯子，retailcoder和lol.upvote是同一用户）
（'rolfl'，2115） .pop()要从精选汤中获取数据，这里还有其他方法吗？但是，由于这是我第一次使用Python，因此欢迎任何评论。

玛拉基只有534？这让我感到惊讶。

@ syb0rg表示已给我的消息加注星标的数量，而不是给了我多少星标。

#1 楼

这是编写代码的绝妙借口，最终产品也很不错。

★首先，恭喜您的Java思维，不要将不需要类的类强制放入Python 。

★您导入但不使用OrderedDict和operator。删除未使用的导入。

★通常，编写代码时应将其用作模块或脚本。为此，使用了if __name__ == '__main__'技巧：

if __name__ == '__main__':
    # executed only if used as a script

★您预先声明了一些变量，例如room，url和pages。这会阻碍代码重用：

room变量不应该预先声明为全局变量，而应在您的主要部分声明。从那里可以将其传递给所有功能。

url特别但不必要地提到了the-2nd-monitor。这不是有害的，但不必要，因为仅ID是相关的。此外，url是这么大范围的缩写。像star_url_pattern这样的东西会更好-除非全局“常量”应全部大写：

STAR_URL_PATTERN = 'http://chat.stackexchange.com/rooms/info/{0}/?tab=stars&page={1}'

为集合保留复数名称。 pages应该是page_count。但是，等等-为什么我们要对此进行硬编码而不是从页面本身获取它呢？只需按照rel="next"链接进行操作，直到到达终点即可。

★最后一个想法可以通过生成器函数实现。 Python生成器函数类似于简单的迭代器。它可以包含yield个元素，或用尽时可以包含return个元素。我们可以构建一个生成器函数，为每页生成一个漂亮的汤对象，并负责获取下一页。作为草图：

from urllib.parse import urljoin

def walk_pages(start_url):
    current_page = start_url
    while True:
        content = ... # fetch the current_page
        soup = BeautifulSoup(content)
        yield soup
        # find the next page
        next_link = soup.find('a', {'rel': 'next'})
        if next_link is None:
            return
        # urljoin takes care of resolving the relative URL
        current_page = urljoin(current_page, next_link.['href'])

★请不要使用urllib.request。该库具有可怕的界面，或多或少地被设计所破坏。您可能会注意到，.read()方法返回原始字节，而不是使用Content-Type标头中的字符集自动解码内容。这在处理二进制数据时很有用，但是HTML页面是文本。代替对编码utf-8（甚至不是HTML的默认编码）进行硬编码，我们可以使用更好的库，例如requests。然后：

import requests
response = requests.get(current_page)
response.raise_for_status()  # throw an error (only for 4xx or 5xx responses)
content = response.text  # transparently decodes the content

★您的allstars变量不仅应命名为all_stars之类的名称（注意通过下划线分隔单词），而且也不应该是全局变量。考虑将其作为参数传递给add_stars，或将此字典包装到对象中，其中add_stars将是一种方法。我怀疑这是作为调试帮助，但不会给该脚本的用户增加任何价值。不必使当前的工作目录混乱，而应使此行为成为可选。要测试身份，请使用None运算符：==。有时，最好依靠对象的布尔重载。例如，如果数组为空，则被认为是错误的。

import time
from bs4 import BeautifulSoup
import requests
from urllib.parse import urljoin

STAR_URL_TEMPLATE = 'http://chat.{0}/rooms/info/{1}/?tab=stars'

def star_pages(start_url):
    current_page = start_url
    while True:
        print("GET {}".format(current_page))
        response = requests.get(current_page)
        response.raise_for_status()
        soup = BeautifulSoup(response.text)
        yield soup
        # find the next page
        next_link = soup.find('a', {'rel': 'next'})
        if next_link is None:
            return
        # urljoin takes care of resolving the relative URL
        current_page = urljoin(current_page, next_link['href'])

def star_count(room_id, site='stackexchange.com'):
    stars = {}
    for page in star_pages(STAR_URL_TEMPLATE.format(site, room_id)):
        for message in page.find_all(attrs={'class': 'monologue'}):
            author = message.find(attrs={'class': 'username'}).string

            star_count = message.find(attrs={'class': 'times'}).string
            if star_count is None:
                star_count = 1

            if author not in stars:
                stars[author] = 0
            stars[author] += int(star_count)

        # be nice to the server, and wait after each page
        time.sleep(1)
    return stars

if __name__ == '__main__':
    the_2nd_monitor_id = 8595
    stars = star_count(the_2nd_monitor_id)
    # print out the stars in descending order
    for author, count in sorted(stars.items(), key=lambda pair: pair[1], reverse=True):
        print("{}: {}".format(author, count))

#2 楼

我看不到为什么需要从.select()返回的数组中弹出元素的任何原因-您可以执行类似

 message_soup.select('.times')[0].string

的任何一种方法，这两种方法都会抛出如果消息中不包含.times类，则可以设置异常，因此您可以添加一些异常处理：请使用.pop()-它取决于其他可能取决于判断的因素。我尊重的一位同事似乎认为在python中使用pop()有点不合常规。另一个受Lisp影响更大的同事喜欢它。我个人认为，除非对数据结构进行更改有什么令人信服的内容，否则我将保留它。

#3 楼

还存在一些与BeautifulSoup相关的改进：

强烈建议指定BeautifulSoup将在后台使用的底层解析器：

soup = BeautifulSoup(response.text, "html.parser")
# soup = BeautifulSoup(response.text, "lxml")
# soup = BeautifulSoup(response.text, "html5lib")

如果您不指定解析器，则BeautifulSoup会从当前Python环境中自动选择一个解析器-并且它在不同的机器和环境下可能会有所不同，从而导致令人惊讶的后果。另请参阅安装解析器文档部分。

可以直接调用.select()方法，而不是执行.pop()和.select_one()，该方法将返回单个元素或None（如果未找到元素）。
soup.find_all(attrs={'class': 'monologue'})可以被更简洁的CSS选择器调用取代：soup.select('.monologue')