Big Query 中的 Firebase 事件重复数据删除 - 最佳实践?

Ves*_*nen 8 firebase google-bigquery firebase-analytics

导出到 Big Query 的 Firebase 分析事件中似乎有 1-2% 的重复项。删除这些的最佳做法是什么?

Atm 客户端不会发送带有事件的计数器(每个会话)。这将提供一种明确的删除重复事件的方法,因此我建议 Firebase 实现它。但是,目前,删除重复项的好方法是什么?查看客户端 user_pseudo_id、event_timestamp 和 event_name - 字段并删除除具有相同三元组之外的所有字段?

event_bundle_sequence_id 字段如何工作?重复项在该字段中具有相同的值还是不同的值?也就是说,重复的事件是在同一个包中还是在不同的包中发送?

Firebase 是否计划在处理早期删除这些重复项,无论是针对 Firebase 分析本身,还是在导出到 Big Query 时?

用于在一天事件中检查重复项的标准 SQL:

with n_dups as
(
SELECT event_name, event_timestamp, user_pseudo_id, count(1)-1 as n_duplicates
FROM `project.dataset.events_20190610`
group by event_name, event_timestamp, user_pseudo_id
)
select n_duplicates, count(1) as n_cases
from n_dups
group by n_duplicates
order by n_cases desc
Run Code Online (Sandbox Code Playgroud)

Mak*_*kyi 1

我们QUALIFY在 BigQuery 中使用该子句对 Firebase 事件进行重复数据删除:

SELECT
  *
FROM
  `project.dataset.events_*`
QUALIFY
  ROW_NUMBER() OVER (
    PARTITION BY
      user_pseudo_id,
      event_name,
      event_timestamp,
      TO_JSON_STRING(event_params)
    ) = 1
Run Code Online (Sandbox Code Playgroud)

合格列:

  - name: user_pseudo_id
    description:  Autogenerated pseudonymous ID for the user -
                  Unique identifier for a specific installation of application on a client device,
                  e.g. "938642951.1666427135".
                  All events generated by that device will be tagged with this pseudonymous ID,
                  so that you can relate events from the same user together.

  - name: event_name
    description:  Event name, e.g. "app_launch", "session_start", "login", "logout" etc.

  - name: event_timestamp
    description:  The time (in microseconds, UTC) at which the event was logged on the client,
                  e.g. "1666529002225262".

  - name: event_params
    description:  A repeated record (ARRAY) of the parameters associated with this event.
Run Code Online (Sandbox Code Playgroud)