Increasing speed of crop, mask, & extract raster by many polygons in R?

我正在从基于数千个多边形边界的栅格中提取不同土地利用类型的面积和覆盖率。我发现，如果我遍历每个单独的多边形并进行裁剪，然后将栅格蒙版减小到特定多边形的大小，则提取功能的工作速度会更快。尽管如此，它的运行速度很慢，我想知道是否有人对提高我的代码的效率和速度有任何建议。

我唯一发现的与此相关的是Roger Bivand的答复，他建议使用GDAL.open()和GDAL.close()以及getRasterTable()和getRasterData()。我研究了这些，但是过去在gdal上遇到过麻烦，并且对它的了解还不够深，不知道如何实现它。

可重现的示例：

library(maptools)  ## For wrld_simpl
library(raster)

## Example SpatialPolygonsDataFrame
data(wrld_simpl) #polygon of world countries
bound <- wrld_simpl[1:25,] #name it this to subset to 25 countries and because my loop is set up with that variable  

## Example RasterLayer
c <- raster(nrow=2e3, ncol=2e3, crs=proj4string(wrld_simpl), xmn=-180, xmx=180, ymn=-90, ymx=90)
c[] <- 1:length(c)

#plot, so you can see it
plot(c)    
plot(bound, add=TRUE)

迄今为止最快的方法

result <- data.frame() #empty result dataframe 

system.time(
     for (i in 1:nrow(bound)) { #this is the number of polygons to iterate through
      single <- bound[i,] #selects a single polygon
      clip1 <- crop(c, extent(single)) #crops the raster to the extent of the polygon, I do this first because it speeds the mask up
      clip2 <- mask(clip1,single) #crops the raster to the polygon boundary

      ext<-extract(clip2,single) #extracts data from the raster based on the polygon bound
      tab<-lapply(ext,table) #makes a table of the extract output
      s<-sum(tab[[1]])  #sums the table for percentage calculation
      mat<- as.data.frame(tab) 
      mat2<- as.data.frame(tab[[1]]/s) #calculates percent
      final<-cbind(single@data$NAME,mat,mat2$Freq) #combines into single dataframe
      result<-rbind(final,result)
      })

   user  system elapsed 
  39.39    0.11   39.52

并行处理

并行处理将用户时间减少了一半，但是却由于将系统时间加倍而失去了好处。栅格将其用于提取功能，但不幸的是，剪裁或遮罩功能不使用此功能。不幸的是，由于“ IO”的“等待”，因此总耗用的时间要大得多。

beginCluster( detectCores() -1) #use all but one core

在多个内核上运行代码：

  user  system elapsed 
  23.31    0.68   42.01

然后结束群集

endCluster()

慢速方法：
直接提取的另一种方法从raster函数获取数据需要花费更长的时间，而且我不确定数据管理是否可以将其转换为所需的格式：

system.time(ext<-extract(c,bound))
   user  system elapsed 
1170.64   14.41 1186.14

您可以尝试使用此R代码分析器（marcodvisser.github.io/aprof/Tutorial.html）。它可以告诉您大多数时间需要哪几行。该链接还提供了减少R中处理时间的准则。

我这里只有两美分。。。但是当裁剪中的像素数非常低时，crop / getvalues方法不起作用。我不确定限制在哪里，但是在只有1-5像素的农作物上遇到了问题（我尚未确定确切的原因（对于空间包装还是有点陌生），但我敢打赌农作物的功能取决于像素边界，因此很难裁剪任何单个像素）。从栅格数据包中提取数据没有这样的问题，但是一致认为是用户时间的两倍以上，而系统时间的两倍以上。只是对那些分辨率较低的栅格（和in

velox是一个有点新的软件包，它已通过Rcpp软件包将提取物移入C。使用多边形进行提取操作时，速度提高了约10倍。

@JeffreyEvans。现在使用velox测试该问题的答案。但是，在分配极大的向量时遇到问题。

#1 楼

我终于可以改善这个功能了。我发现，出于我的目的，首先首先对多边形进行rasterize()并使用getValues()而不是extract()最快。栅格化的速度并不比用于在小多边形中制表栅格值的原始代码快得多，但是当涉及到要裁剪大栅格并提取值的大多边形区域时，栅格化就显得很有意思。我还发现getValues()比extract()函数要快得多。

我还弄清楚了使用foreach()进行多核处理。

我希望这对希望使用R解决方案从多边形叠加层中提取栅格值的其他人有用。这类似于ArcGIS的“制表交叉点”，它对我来说效果不佳，经过数小时的处理后才像该用户一样返回空输出。

#initiate multicore cluster and load packages
library(foreach)
library(doParallel)
library(tcltk)
library(sp)
library(raster)

cores<- 7
cl <- makeCluster(cores, output="") #output should make it spit errors
registerDoParallel(cl)

这里函数：

multicore.tabulate.intersect<- function(cores, polygonlist, rasterlayer){ 
  foreach(i=1:cores, .packages= c("raster","tcltk","foreach"), .combine = rbind) %dopar% {

    mypb <- tkProgressBar(title = "R progress bar", label = "", min = 0, max = length(polygonlist[[i]]), initial = 0, width = 300) 

    foreach(j = 1:length(polygonlist[[i]]), .combine = rbind) %do% {
      final<-data.frame()
      tryCatch({ #not sure if this is necessary now that I'm using foreach, but it is useful for loops.

        single <- polygonlist[[i]][j,] #pull out individual polygon to be tabulated

        dir.create (file.path("c:/rtemp",i,j,single@data$OWNER), showWarnings = FALSE) #creates unique filepath for temp directory
        rasterOptions(tmpdir=file.path("c:/rtemp",i,j, single@data$OWNER))  #sets temp directory - this is important b/c it can fill up a hard drive if you're doing a lot of polygons

        clip1 <- crop(rasterlayer, extent(single)) #crop to extent of polygon
        clip2 <- rasterize(single, clip1, mask=TRUE) #crops to polygon edge & converts to raster
        ext <- getValues(clip2) #much faster than extract
        tab<-table(ext) #tabulates the values of the raster in the polygon

        mat<- as.data.frame(tab)
        final<-cbind(single@data$OWNER,mat) #combines it with the name of the polygon
        unlink(file.path("c:/rtemp",i,j,single@data$OWNER), recursive = TRUE,force = TRUE) #delete temporary files
        setTkProgressBar(mypb, j, title = "number complete", label = j)

      }, error=function(e){cat("ERROR :",conditionMessage(e), "\n")}) #trycatch error so it doesn't kill the loop

      return(final)
    }  
    #close(mypb) #not sure why but closing the pb while operating causes it to return an empty final dataset... dunno why. 
  }
}

因此要使用它，请调整single@data$OWNER以使其与识别多边形的列名相符（该函数可能已内置猜测）。。）并放入：

myoutput <- multicore.tabulate.intersect(cores, polygonlist, rasterlayer)

关于getValues比提取更快的建议似乎无效，因为如果使用提取，则不必进行裁剪和栅格化（或遮罩）。原始问题中的代码可以同时完成这两项工作，而这大约需要两倍的处理时间。

–罗伯特·希曼斯
16年6月20日在16:48

唯一知道的方法是通过测试。

– djas
16年6月20日在18:09

多边形列表在这里是什么类，多边形列表[[i]] [，j]在这里应该做什么（请给ELI5使用）？我是并行事物的新手，所以我不太了解。我无法返回任何函数，直到我将polygonlist [[i]] [，j]更改为polygonlist [，j]，这似乎合乎逻辑，因为[，j]是SpatialPolygonsDataframe的第j个元素，如果那样的话是正确的课程吗？更改后，我运行了该进程并获得了一些输出，但是肯定仍然有问题。（我尝试提取n个小多边形内的中位数，所以我也更改了一些代码）。

– reima
17年11月9日14:04

@RobertH就我而言，裁剪（和遮罩）使其运行速度提高了约3倍。我使用的是1亿英亩的栅格，而多边形只是其中的一小部分。如果我不裁剪到多边形，则过程运行会慢得多。这是我的结果：clip1 <-crop（rasterlayer，extent（single））> system.time（ext <-extract（clip1，single））＃从裁剪的栅格用户系统中提取的时间已过去65.94 0.37 67.22> system.time（ext < -extract（rasterlayer，single））＃从1亿英亩的栅格用户系统中提取出来的时间是175.00 4.92 181.10

–麦克雷（Luke Macaulay）
18-1-26在21:10

#2 楼

加速从点，XY或多边形中提取栅格（栅格堆栈）

很好的答案卢克。您必须是R向导！这是一个非常小的调整，可以简化您的代码（在某些情况下可能会稍微提高性能）。您可以通过使用cellFromPolygon（或点的cellFromXY）然后裁剪和getValues来避免某些操作。

从栅格堆栈中提取多边形或点数据------------------------

 library(raster)  
 library(sp)   

  # create polygon for extraction
  xys= c(76.27797,28.39791,
        76.30543,28.39761,
        76.30548,28.40236,
        76.27668,28.40489)
  pt <- matrix(xys, ncol=2, byrow=TRUE)
  pt <- SpatialPolygons(list(Polygons(list(Polygon(pt)), ID="a")));
  proj4string(pt) <-"+proj=longlat +datum=WGS84 +ellps=WGS84"
  pt <- spTransform(pt, CRS("+proj=sinu +a=6371007.181 +b=6371007.181 +units=m"))
  ## Create a matrix with random data & use image()
  xy <- matrix(rnorm(4448*4448),4448,4448)
  plot(xy)

  # Turn the matrix into a raster
  NDVI_stack_h24v06 <- raster(xy)
  # Give it lat/lon coords for 36-37°E, 3-2°S
  extent(NDVI_stack_h24v06) <- c(6671703,7783703,2223852,3335852)
  # ... and assign a projection
  projection(NDVI_stack_h24v06) <- CRS("+proj=sinu +a=6371007.181 +b=6371007.181 +units=m")
  plot(NDVI_stack_h24v06)
  # create a stack of the same raster
  NDVI_stack_h24v06 = stack( mget( rep( "NDVI_stack_h24v06" , 500 ) ) )


  # Run functions on list of points
  registerDoParallel(16)
  ptm <- proc.time()
  # grab cell number
  cell = cellFromPolygon(NDVI_stack_h24v06, pt, weights=FALSE)
  # create a raster with only those cells
  r = rasterFromCells(NDVI_stack_h24v06, cell[[1]],values=F)
  result = foreach(i = 1:dim(NDVI_stack_h24v06)[3],.packages='raster',.combine=rbind,.inorder=T) %dopar% {
     #get value and store
     getValues(crop(NDVI_stack_h24v06[[i]],r))
  }
  proc.time() - ptm
  endCluster()

用户系统已使用
16.682 2.610 2.530

  registerDoParallel(16)
  ptm <- proc.time()
  result = foreach(i = 1:dim(NDVI_stack_h24v06)[3],.packages='raster',.inorder=T,.combine=rbind) %dopar% {
        clip1 <- crop(NDVI_stack_h24v06[[i]], extent(pt)) #crop to extent of polygon
        clip2 <- rasterize(pt, clip1, mask=TRUE) #crops to polygon edge & converts to raster
         getValues(clip2) #much faster than extract
  }
  proc.time() - ptm
  endCluster()

用户系统已使用
33.038 3.511 3.288

我运行了这两种方法，在我的用例中，您的方法出现的速度稍慢。

–麦克雷（Luke Macaulay）
18/12/14在21:21

#3 楼

如果覆盖的精度损失不是非常重要-假设开始时很精确-通常可以通过首先将多边形转换为栅格来实现更高的区域计算速度。 raster软件包包含zonal()函数，该函数应该可以很好地完成预期的任务。但是，您始终可以使用标准索引对同一维的两个矩阵进行子集化。如果必须维护多边形并且不介意GIS软件，则在区域统计中QGIS实际上必须比ArcGIS或ENVI-IDL更快。

#4 楼

我也为此挣扎了一段时间，试图在〜1kmx1km的网格中计算〜300mx300m网格图的土地覆盖类别的面积份额。后者是一个多边形文件。我尝试了多核解决方案，但是对于我拥有的网格单元数来说，这仍然太慢了。相反，我：

栅格化1kmx1km的网格，为所有网格单元提供唯一的编号
将gdalUtils包中的allign_rasters（或直接gdalwarp）与r =“ near”选项一起使用将1kmx1km网格的分辨率提高到300mx300m，使用相同的投影等。
使用栅格数据包：stack_file <-stack（lc，grid）堆叠300mx300m土地覆盖图和第2步中的300mx300m网格。
创建一个data.frame来合并地图：df <-as.data.frame（rasterToPoints（stack_file）），它将1kmx1km地图的网格单元号映射到300mx300m的土地覆盖图
使用dplyr计算1kmx1km像元中土地覆盖类像元的份额。
通过将第5步链接到原始1kmx1km的网格，在第5步的基础上创建一个新栅格。

当我在300mx300m的具有超过15个磨机网格单元的土地覆盖图上尝试该过程时，该过程可以非常快速地运行并且在我的PC上没有内存问题。

我想，如果要合并具有不规则形状的多边形文件和栅格数据，上述方法也将起作用。也许，可以通过使用rasterize（光栅可能很慢）或gdal_rasterize将多边形文件直接光栅化为300mx300网格，从而合并步骤1＆2。就我而言，我也需要重新投影，因此我同时使用gdalwarp进行重新投影和分解。

#5 楼

我必须面对同样的问题才能从大型镶嵌图（50k x 50k）中提取多边形内部的值。我的多边形只有4个点。我找到的最快方法是将crop镶嵌到多边形的边界中，将多边形三角化为2个三角形，然后检查三角形中是否有点（我找到的最快算法）。与extract功能相比，运行时间从20 s减少到0.5 s。但是，每个多边形的函数crop仍需要大约2 s。

对不起，我无法提供完整的可复制示例。下面的R代码不包含输入文件。

该方法仅适用于简单的多边形。

par_dsm <- function(i, image_tif_name, field_plys2) {
    library(raster)
    image_tif <- raster(image_tif_name)
    coor <- field_plys2@polygons[[i]]@Polygons[[1]]@coords
    ext <- extent(c(min(coor[,1]), max(coor[,1]), min(coor[,2]), max(coor[,2])))

    extract2 <- function(u, v, us, vs) {
        u1 <- us[2]  - us[1]
        u2 <- us[3]  - us[2]
        u3 <- us[1]  - us[3]
        v1 <- vs[1]  - vs[2]
        v2 <- vs[2]  - vs[3]
        v3 <- vs[3]  - vs[1]
        uv1 <- vs[2] * us[1] - vs[1] * us[2]
        uv2 <- vs[3] * us[2] - vs[2] * us[3]
        uv3 <- vs[1] * us[3] - vs[3] * us[1]

        s1 <- v * u1 + u * v1 + uv1
        s2 <- v * u2 + u * v2 + uv2
        s3 <- v * u3 + u * v3 + uv3
        pos <- s1 * s2 > 0 & s2 * s3 > 0
        pos 
    }

    system.time({
        plot_rect <- crop(image_tif, ext, snap ='out')
        system.time({
        cell_idx <- cellFromXY(plot_rect, coor[seq(1,4),])
        row_idx <- rowFromCell(plot_rect, cell_idx)
        col_idx <- colFromCell(plot_rect, cell_idx)

        rect_idx <- expand.grid(lapply(rev(dim(plot_rect)[1:2]), function(x) seq(length.out = x)))

        pixel_idx1 <- extract2(
            rect_idx[,2], rect_idx[,1], 
            row_idx[c(1,2,3)], col_idx[c(1,2,3)])
        pixel_idx2 <- extract2(
            rect_idx[,2], rect_idx[,1], 
            row_idx[c(1,4,3)], col_idx[c(1,4,3)])
        pixel_idx <- pixel_idx1 | pixel_idx2
        })
    })
    mean(values(plot_rect)[pixel_idx])
}

# field_plys2: An object of polygons
# image_taf_name: file name of mosaic file
library(snowfall)
sfInit(cpus = 14, parallel = TRUE)
system.time(plot_dsm <- sfLapply(
    seq(along = field_plys2), par_dsm, image_tif_name, field_plys2))
sfStop()

编程黑洞网