我觉得那只是噪音,所以我把语言标签列表中的一部分剔除了。您可以在此处查看完整的收藏集。
您可以在SEDE上运行此查询,并根据需要进行操作。
请注意,我留出了很大的空间供他人修改和汇总自己的数据,从底部的简单查询。
5886毫秒内返回了1089行
我如何在可读性,性能等方面进行改进?
/**
* The objective of this query is to gather data related to cross-posts from
* Stack Overflow (SO) to Code Review (CR). A cross-post as defined in this context is
* a question which has first been asked on SO and then a short time later asked
* again on CR (albeit often slightly modified in the way it is titled or phrased).
* Querying from 2 or more sites requires cross-database queries, and the following
* 2 databases are used here. All relevant tables are in the [dbo] schema.
* - Stack Overflow DB: [StackOverflow]
* - Code Review DB: [StackExchange.Codereview]
* 2 temporary tables are used in order to compensate for the physsical limitations
* of SEDE which otherwise will often time out before the query is completed.
* param @minutesFromSoPostToCrPost int not null : The number of minutes allowed between the original
* SO question and its cross-post on CR. Default 120 minutes.
* param @maximumCharacterCountDifferenceAllowed int not null : The maximum number
* of characters difference between the body of the question.
* NOTE: The higher the number, the more likely that it's not actually a cross-post.
*/
if object_id('tempdb..#LanguageTags') is not null
drop table #LanguageTags;
if object_id('tempdb..#CrossPosts') is not null
drop table #CrossPosts;
go
create table #LanguageTags (
TagName varchar(35) collate SQL_Latin1_General_CP1_CS_AS
, constraint pk_#LanguageTags primary key (TagName)
);
go
insert into #LanguageTags (TagName)
values
('applescript'),
('asp.net-mvc-3'),
('bash'),
('brainfuck'),
('c'),
('c#'),
('c++'),
/*SNIP...*/
('sql'),
('swift'),
('wolfram-mathematica'),
('xslt');
go
declare @questionPost int = 1;
declare @minutesFromSoPostToCrPost int = 120;
declare @maximumCharacterCountDifferenceAllowed int = 1000;
select
[Primary Stack] = case
when SoUsers.Reputation >= CrUsers.Reputation then
'Stack Overflow'
else
'Code Review'
end
, [Primary User] = case
when SoUsers.Reputation >= CrUsers.Reputation then
'http://stackoverflow.com/users/' + convert(varchar(10), SoUsers.Id) + '|' + SoUsers.DisplayName
else
'http://codereview.stackexchange.com/users/' + convert(varchar(10), CrUsers.Id) + '|' + CrUsers.DisplayName
end
, [SO Original] = 'http://stackoverflow.com/questions/' + convert(varchar(10), SoPosts.Id) + '|' + SoPosts.Title
, [CR Xpost] = 'http://codereview.stackexchange.com/questions/' + convert(varchar(10), CrPosts.Id) + '|' + CrPosts.Title
/*Calculate the character difference of the body of both questions.*/
, [CharCountDiff] = abs(len(CrPosts.Body) - len(SoPosts.Body))
, [SO Score] = SoPosts.Score
, [CR Score] = CrPosts.Score
, [SO Status] = case
when SoPosts.DeletionDate is not null then 'Deleted'
when SoPosts.ClosedDate is not null then 'Closed'
else 'OK' end
, [CR Status] = case
when CrPosts.DeletionDate is not null then 'Deleted'
when CrPosts.ClosedDate is not null then 'Closed'
else 'OK' end
/*Check in @Duga comments*/
, [DugaComments?] = case
when exists (
select 1 from [StackOverflow].dbo.Comments as SoComments
where SoPosts.Id = SoComments.PostId
and SoComments.Text like '%code%review%'
) then 'True' end
, [SO Answers] = SoPosts.AnswerCount
, [CR Answers] = CrPosts.AnswerCount
, [SO Accept?] = case
when SoPosts.AcceptedAnswerId is not null then 'True' end
, [CR Accept?] = case
when CrPosts.AcceptedAnswerId is not null then 'True' end
, [SO Created] = SoPosts.CreationDate
, [Minutes to Xpost] = datediff(minute, SoPosts.CreationDate, CrPosts.CreationDate)
, [Tags] = CrPosts.Tags
/*Adding results into temp table to avoid timeouts in `select distinct`*/
into #CrossPosts
from
/*Common users across CR and SO sites:*/
[StackExchange.Codereview].dbo.Users as CrUsers
inner join [StackOverflow].dbo.Users as SoUsers
/*AccountId is network-wide Id for each user, and
is distinct from the UserId which is for a specific site*/
on CrUsers.AccountId = SoUsers.AccountId
/*Questions by user on both sites:*/
inner join [StackExchange.Codereview].dbo.Posts as CrPosts
on CrUsers.Id = CrPosts.OwnerUserId
and CrPosts.PostTypeId = @questionPost
inner join [StackOverflow].dbo.Posts as SoPosts
on SoUsers.Id = SoPosts.OwnerUserId
and SoPosts.PostTypeId = @questionPost
/*Bring in tags so we can try to eliminate false matches
due to unrelated posts potentially being posted by the same
user on 2 different sites within our scoped time period.*/
inner join [StackExchange.Codereview].dbo.PostTags as CrPT
on CrPosts.Id = CrPT.PostId
inner join [StackExchange.Codereview].dbo.Tags as CrTags
on CrPT.TagId = CrTags.Id
inner join [StackOverflow].dbo.PostTags as SoPT
on SoPosts.Id = SoPT.PostId
inner join [StackOverflow].dbo.Tags as SoTags
on SoPT.TagId = SoTags.Id
where
/*Q was first posted on SO, then later on CR*/
SoPosts.CreationDate < CrPosts.CreationDate
/*Q was posted on CR within a certain number of minutes after being posted on SO*/
and datediff(minute, SoPosts.CreationDate, CrPosts.CreationDate) <= @minutesFromSoPostToCrPost
/*Match at least one language tag from CR->SO per post
Note: We use `select distinct` on the query against #CrossPosts
due to SEDE timing out if attempting to do it during this query.*/
and CrTags.TagName = SoTags.TagName
and exists (
select 1 from #LanguageTags as Langs
where CrTags.TagName = Langs.TagName
)
/*Apply filter based on character count difference of the body of both questions.*/
and abs(len(CrPosts.Body) - len(SoPosts.Body)) <= @maximumCharacterCountDifferenceAllowed
;
/*Use this query to view full result set, or modify it
according to your needs to aggregate from the #CrossPosts table.*/
select distinct *
from #CrossPosts
order by [SO Created] desc;
#1 楼
风格我喜欢你的风格。我不介意以小写形式阅读关键字。我
觉得这样比较容易,但是再说一次,我很习惯
区分大小写的语言,使用小写的关键字和语法
突出显示。
我我也很喜欢前缀列别名,而不是主流的后缀
AS Alias
样式。如果我不得不挑剔,那就很麻烦:
create table
中的悬挂逗号语句使列名的对齐混乱。那里:)
查找表上的过滤器联接
两个
Tags
表上的联接都非常嵌套。这意味着,对于每一个
Post
,SQL都会与每一个PostTag
连接(这是不可避免的,因为不幸的是,在m-n匹配中,需要它们。 -filter),然后将其分别与Tags
表联接。在此联接中实际上不需要
Tags
表。我们可以通过将它们各自的
TagId
值添加到#LanguageTags
临时表中,来预查找要使用的所有标签,如下所示:create table #LanguageTags (
TagName varchar(35) collate SQL_Latin1_General_CP1_CS_AS
, CrTagId int
, SoTagId int
, constraint pk_#LanguageTags primary key (TagName)
);
go
insert into #LanguageTags (TagName)
values
('applescript'),
('asp.net-mvc-3'),
('bash'),
('brainfuck'),
('c'),
('c#'),
('c++'),
/*SNIP...*/
('sql'),
('swift'),
('wolfram-mathematica'),
('xslt');
go
update Langs
set CrTagId = CrTags.Id
, SoTagId = SoTags.Id
from #LanguageTags Langs
inner join [StackExchange.CodeReview].dbo.Tags as CrTags
on CrTags.TagName = Langs.TagName
inner join [StackOverflow].dbo.Tags as SoTags
on SoTags.TagName = Langs.TagName;
现在我们可以使用此表过滤要通过
TagId
查看的标签,并且我们现在还可以将来自两个站点的帖子加入到单个表中。 />
SEDE和跨数据库查询
似乎SEDE并未真正优化,无法将大型数据库表连接到临时表。因此,当我们将上述连接放入获取查询中时,就会超时。解决方案:将
TagId
值存储在临时表中:
select
-- SNIP
, [Tags] = CrPosts.Tags
, [CrTagId] = CrPT.TagId
, [SoTagId] = SoPT.TagId
,然后在最终结果中的
#LanguageTags
表上进行过滤和区分query:
select distinct
CP.[Primary Stack]
, CP.[Primary User]
, CP.[SO Original]
, CP.[CR Xpost]
, CP.[CharCountDiff]
, CP.[SO Score]
, CP.[CR Score]
, CP.[SO Status]
, CP.[CR Status]
, CP.[DugaComments?]
, CP.[SO Answers]
, CP.[CR Answers]
, CP.[SO Accept?]
, CP.[CR Accept?]
, CP.[SO Created]
, CP.[Minutes to Xpost]
, CP.[Tags]
, CP.[CrTagId]
, CP.[SoTagId]
from #CrossPosts CP
/*Match at least one language tag from CR->SO per post.*/
inner join #LanguageTags as Langs
on Langs.SoTagId = CP.SoTagId
and Langs.CrTagId = CP.CrTagId
order by [SO Created] desc;
结果发现这要快得多,因为查询现在返回
返回了1130行在4166毫秒内