[Wiktionary-l] character encoding wrong for some pages

29 Jul 2011

Hi,

I've been pulling down pages from wiktionary in a Java application. The
majority of pages seem to work fine (e.g. http://en.wiktionary.org//wiki/-a).
I can load them in Java, and if I wget them, I end up with a file containing
what I'd expect.

However, some pages seem not to work (e.g.
http://en.wiktionary.org/wiki/absolute_instrument). In Java, I get a codec
exception and when using wget, the resulting downloaded file is garbled. I
think this is because although they claim to be UTF-8 encoded, they are not.
These pages show up fine in my browser, but it isn't telling me what charset
it uses to decode the text.

Is this a known issue? Is there any workaround for this? Can it be fixed
server-side?

Thanks,

Matthew

-- 
Dr Matthew Pocock
Visitor, School of Computing Science, Newcastle University
mailto: turingatemyhamster(a)gmail.com
gchat: turingatemyhamster(a)gmail.com
msn: matthew_pocock(a)yahoo.co.uk
irc.freenode.net: drdozer
tel: (0191) 2566550
mob: +447535664143

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

[Wiktionary-l] character encoding wrong for some pages