We should address automatic duplicate cleaning very
soon, as Christian warned a while ago. He manually cleaned up duplicates a few times but
we know it's a problem that needs solving.
Duplicates are already cleaned up, in
the refined table. There should never be any duplicates in the wmf.webrequest table.
https://gerrit.wikimedia.org/r/#/c/177522/
<https://gerrit.wikimedia.org/r/#/c/177522/>
Seeing as this was merged on Jan 26, it is possible that it was not deployed when on Jan
27 when Oliver is noticing duplicates.
We should be calculating a per-host arithmetic series
over the sequence numbers
when data is loaded.
Please see the wmf_raw.webrequest_sequence_stats tables, for hourly partition statistics,
including duplicates and losses.
-Ao
> On Feb 23, 2015, at 09:01, Dan Andreescu <dandreescu(a)wikimedia.org> wrote:
>
We should address automatic duplicate cleaning very
soon, as Christian warned a while ago. He manually cleaned up duplicates a few times but
we know it's a problem that needs solving.
>
> On Mon, Feb 23, 2015 at 6:22 AM, Christian Aistleitner <christian(a)quelltextlich.at
<mailto:christian@quelltextlich.at>> wrote:
> Hi Oliver,
>
> On Sun, Feb 22, 2015 at 06:46:37PM -0500, Oliver Keyes wrote:
> > And, an additional point; I don't understand why, if dupes is the
> > problem, the Hive query was not hit as badly by this as the equivalent
> > UDF.
>
> just shooting in the dark, since you did not provide your query, but
> if you by accident had been querying the
>
> wmf_raw.webrequest
>
> (database name ending in “_raw”) table instead of
>
> wmf.webrequest
>
> (no “_raw” in the database name), the difference you described would
> be plausible (and given the patching of GHOST, they'd even be
> expected).
>
>
> Have fun,
> Christian
>
>
>
> --
> ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
> Companies' registry: 360296y in Linz
> Christian Aistleitner
> Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at
<mailto:christian@quelltextlich.at>
> 4293 Gutau, Austria Phone: +43 7946 / 20 5 81
<tel:%2B43%207946%20%2F%2020%205%2081>
> Fax: +43 7946 / 20 5 81
<tel:%2B43%207946%20%2F%2020%205%2081>
> Homepage:
http://quelltextlich.at/
<http://quelltextlich.at/>
> ---------------------------------------------------------------
>
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
>
https://lists.wikimedia.org/mailman/listinfo/analytics
<https://lists.wikimedia.org/mailman/listinfo/analytics>
>
>
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org
>
https://lists.wikimedia.org/mailman/listinfo/analytics