BigQuery generally does a good job of loading Avro data, but "bq load" is having a lot of trouble with timestamps and other date/time fields that use the Avro logicalType attribute.


My data with Avro type timestamp-millis is mangled when BigQuery TIMESTAMP interprets them as microsecond timestamps (off by 1000).
A timestamp-micros integer that can load into TIMESTAMP becomes INVALID in a BigQuery DATETIME. I can't find an explanation of what would be valid at https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types

Strings in ISO8601 format can't load into TIMESTAMP or DATETIME (Incompatible types error) but I think BigQuery would support that if I was loading plain JSON.
Avro "date" type fails to load into DATE (also Incompatible types).

I guess I could workaround these problems by always loading the data into temporary fields and using queries to CAST or transform them to additional fields, but that doesn't scale or support schema evolution or stream nicely. Producing data in Avro with well-defined schemas is supposed to avoid that extra step of transforming data again for different consumers.

Is BigQuery really this incompatible with Avro dates and times? (or am I doing something dumb)

Or is "bq load" the problem here? Is there a better way to load Avro data?

评论

having a sample file could help debug this problem (if it's a bug, post here code.google.com/p/google-bigquery/issues/list)

I don't know if it's a bug or BigQuery just doesn't support loading Avro data into TIMESTAMP, DATETIME, and DATE data types. cloud.google.com/bigquery/data-formats#avro_format doesn't mention any of those 3 types. I'm not sure how to attach the binary Avro data file here..

This feature is now supported, follow issuetracker.google.com/35905894 for more information.

#1 楼

Native understanding for Avro Logical Types is now available publicly for all BigQuery users. Please refer to the documentation page here for more details: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro#logical_types

评论


更新:BigQuery现在支持逻辑类型。请参考下面的维克多斯答案。

–张华
18年11月13日在19:56

#2 楼

Update: This feature is now supported, follow issuetracker.google.com/35905894 for more information.

As Hua said, Avro Logical Types are not supported in BigQuery but the supported way to load Avro data with timestamps is by using the LONG Avro type to load data into an existing BigQuery table that has a TIMESTAMP column. Also, the value should be microseconds (not seconds or milliseconds) from EPOCH. For example, the Avro file below has the a LONG field with value 1408452095000000 which will represent "2014-08-19 12:41:35".

The Avro file's schema:

% avro-tools getschema ~/dataset/simple_timestamp.avro
{
  "type" : "record",
  "name" : "FullName",
  "fields" : [ {
    "name" : "t",
    "type" : "long"
  } ]
}


Example of loading an Avro file to a table with a Timestamp field:

bq mk --schema t:TIMESTAMP -t vimota.simple_timestamp
bq load --source_format=AVRO vimota.simple_timestamp ~/dataset/simple_timestamp.avro
bq head vimota.simple_timestamp:

+---------------------+
|          t          |
+---------------------+
| 2014-08-19 12:41:35 |
+---------------------+


评论


您有任何想法如何将avro数据(日期)加载到具有日期类型的BigQuery吗?我尝试使用av​​ro字符串,但失败了。例如2018-11-29

– Sugimiyanto suma
18年11月29日在6:41

@Sugimiyantosuma Can you use Avro's date logical type (avro.apache.org/docs/1.8.0/spec.html#Date)? This is natively supported now in BigQuery if you're whitelisted or use useAvroLogicalTypes flag: issuetracker.google.com/35905894#comment38.

– Victor Mota
Dec 3 '18 at 18:13



#3 楼

我在PostgreSQL表中有TIMESTAMP列的数据。在遵循https://github.com/spotify/spark-bigquery/issues/19上的注释中的建议后,我能够通过Avro将其导入到BigQuery中。

在Kotlin中使用PostgreSQL JDBC库,我将时间戳重新计算为BigQuery内部格式(自Unix纪元开始以来的微秒)。

,并将其放入类型为Schema.Type.LONG的avro记录中。

然后我为JSON中的数据创建了一个架构文件,在其中给出了“时间戳”的列类型。

[ {"name": "job", "type": "string", "mode": "required"}, ... {"name": "began", "type": "timestamp", "mode": "required"}, ... ]

(请参见开始字段)

最后,我使用

(object as java.sql.Timestamp).time * 1000
将其导入到BigQuery中,结果为

$ bq head test.test2 +------+----+----------+---------------------+---------+-----------+ | job | id | duration | began | status | node_name | +------+----+----------+---------------------+---------+-----------+ | job1 | 1 | 0.0 | 2012-04-01 00:00:00 | aStatus | aNodeName | | job2 | 1 | 0.0 | 2020-02-02 00:02:02 | aStatus | aNodeName | +------+----+----------+---------------------+---------+-----------+

Web UI不允许为Avro文件指定架构,但是CLI客户端和API可以指定。

我唯一遇到的问题是时区问题。但这不是Avro的问题。