--scott

On Tue, Feb 19, 2013 at 7:35 PM, Gabriel Wicke <gwicke@wikimedia.org> wrote:

On 02/19/2013 03:52 PM, C. Scott Ananian wrote:
> So there's currently a 10x expansion in the uncompressed size, but only
> 3-4x expansion with compression.

My last test after https://gerrit.wikimedia.org/r/#/c/49185/ was merged
showed a gzip-compressed factor of about 2 for a large article:

259K obama-parsoid-old.html.gz
255K obama-parsoid-adaptive-attribute-quoting.html.gz
135K obama-PHP.html.gz

We currently store all round-trip information (plus some debug info) in
the DOM, but plan to move most of this information out of it. The
information is private in any case, so there is no reason to send it out
along with the DOM. We might keep some UID attributes to aid node
identification, but there is also the possibility to use subtree hashes
as in the XyDiff algorithm to help with that.

In the end, the resulting DOM will likely still be slightly larger than
the PHP parser's output as it contains more information, in particular
about templates.

Gabriel