[Foundation-l] A proposal of partnership between Wikimedia Foundation and Internet Archive

Excirial wp.excirial at gmail.com
Sat Aug 28 23:31:48 UTC 2010


*A real time feed wouldn't be a smart idea neither would only new links. New
external links are probably the most reliable ones.

*After reading this part i am not entirely certain if you caught my drift
correctly, or if you missed it somehow. In case you understood it correctly
apologies for stating it again.* *What i intended to say was that we should
forward a completely list of all external links to IA, in the form of the
External Links database table. This means that all current links would be
known to IA, and could therefor be checked. Because such a table would be
large to transfer, we could opt to forward the changes to it once a day,
which would result in a lot less data traffic having to be send. In other
words - IA would have a complete list of all external links on Wikipedia,
and that list would be updated once a day (Removing all links no longer
used, while equally adding links added that day).
*
Ideally, there should be a way to redirect to older versions of a
page* *through an internal template to include before any dead links.
I think that
would be the easiest way to implement a change without any technical
overhaul.
*
Keep in mind that this partnership suggestion seems to focus on this: *Greatly
increase the odds that anything linked from Wikipedia would also be in our
Archive*. The feed towards them is simply a means to flag "Important" pages
so that they are crawled more often, or at least crawled once they are
reported which would increase their change of being saved in the archive.
How we subsequently handle this stored data is another, but still different
concern. Even so we do have several related templates already
(Here<http://en.wikipedia.org/wiki/Template:Deadlink>).


~Excirial*
*
On Sun, Aug 29, 2010 at 1:00 AM, theo10011 <de10011 at gmail.com> wrote:

> A real time feed wouldn't be a smart idea neither would only new links. New
> external links are probably the most reliable ones, if they dont work today
> then theres probably no point in preserving them. Link rot is the biggest
> problem here, external links which might be 5-6 years old or more. I
> suggested DeadURL.com because it re-directs to previous versions maintained
> by other archives after including *deadurl.com/ *in front of the dead
> link.
>
> Ideally, there should be a way to redirect to older versions of a page
> through an internal template to include before any dead links. I think that
> would be the easiest way to implement a change without any technical
> overhaul.
>
> Theo
> *
> *
> *
> *
> On Sun, Aug 29, 2010 at 3:47 AM, Excirial <wp.excirial at gmail.com> wrote:
>
> > *What would it take to produce such a feed?**
> > *
> > A real-time feed may or may not be the best idea, for several reasons.
> > - One issue is that every edit would have to be examined not only for
> > external links, but for external links that were not present previously.
> > Doing this real-time may cause slowdowns or additional load for the
> servers
> > - keep in mind that we would have to scan external links on all edits for
> > all Wikipedia's; Counted together this would result in a very, very busy
> > feed towards IA.
> > - Sometimes added links are spam or otherwise not acceptable, which means
> > they may be removed soon after. In such a case man would prefer not
> having
> > them archived, since it would be a waste of time and work for IA.
> >
> > An alternate solution could be forwarding a list of new links every day.
> > The
> > Database Layout<
> >
> http://upload.wikimedia.org/wikipedia/commons/4/41/Mediawiki-database-schema.png
> > >for
> > Wikimedia seems to sugest that all external links are stored in a
> > separate table in the database (And i presume this includes links in
> > reference tags). I wonder if it would be possible to dump this entire
> table
> > for IA, and afterwards send incremental change
> > packages<http://en.wikipedia.org/wiki/Changeset>to them (Once a day
> > perhaps?). That way they would always have a list of
> > external links used by Wikipedia, and it would decrease the problem with
> > performance hits, spam and links no longer used. If we only forwarded a
> > feed
> > with NEW links IA might end up with a long list of links which are
> removed
> > over time. And above everything - the External Links table is simply a
> > database table, which should be incredibly easy to read and process for
> IA,
> > without custom coding required to read and store a feed.
> >
> > But perhaps the people at the tech mailing list have another \ better
> idea
> > on how this should work :)
> >
> > ~Excirial
> >
> >
> >
> > On Sat, Aug 28, 2010 at 9:48 AM, Samuel Klein <meta.sj at gmail.com> wrote:
> >
> > > Gordon @ IA was most friendly and helpful.  archive-it is a
> > > subscription service for focused collections of sites; he had a
> > > different idea better suited to our work.
> > >
> > > Gordon writes:
> > > > Now, given the importance of Wikipedia and editorial significant of
> > > things
> > > > it outlinks-to, perhaps we could set up something specially focused
> on
> > > its
> > > > content (and the de facto stream of newly-occurring outlinks), that
> > would
> > > > require no conscious effort by editors but greatly increase the odds
> > that
> > > > anything linked from Wikipedia would (a few months down the line)
> also
> > be
> > > > in our Archive. Is there (or could there be) a feed of all outlinks
> > that
> > > IA
> > > > could crawl almost nonstop?
> > >
> > > That sounds excellent to me, if possible (and I think close to what
> > > emijrp had in mind!)  What would it take to produce such a feed?
> > >
> > > SJ
> > >
> > > PS - An aside: IA's policies include taking down any links on request,
> > > so this would not be a foolproof archive, but a 99% one.
> > >
> > >
> > > On Tue, Aug 24, 2010 at 9:13 PM, Samuel Klein <meta.sj at gmail.com>
> wrote:
> > > > I've asked Gordon Mohr @ IA about how to work with archive-it.  I
> will
> > > > cc: this thread on any response.
> > > >
> > > > SJ
> > > >
> > > > On Tue, Aug 24, 2010 at 8:56 PM, George Herbert
> > > > <george.herbert at gmail.com> wrote:
> > > >> On Tue, Aug 24, 2010 at 5:48 PM, Samuel Klein <meta.sj at gmail.com>
> > > wrote:
> > > >>> Here's the Archive's on-demand service:
> > > >>>
> > > >>> http://archive-it.org
> > > >>>
> > > >>> That would be the most reliable way to set up the partnership
> emijrp
> > > >>> proposes.  And it's certainly a good idea.  Figuring out how to
> make
> > > >>> it work for almost all editors and make it spam-proof may be
> > > >>> interesting.
> > > >>>
> > > >>> SJ
> > > >>>
> > > >>>
> > > >>>
> > > >>> On Tue, Aug 24, 2010 at 8:45 PM, Ray Saintonge <
> saintonge at telus.net>
> > > wrote:
> > > >>>> David Gerard wrote:
> > > >>>>> On 24 August 2010 14:57, emijrp <emijrp at gmail.com> wrote:
> > > >>>>>
> > > >>>>>> I want to make a proposal about external links preservation.
> Many
> > > times,
> > > >>>>>> when you check an external link or a link reference, the website
> > is
> > > dead or
> > > >>>>>> offline. This websites are important, because they are the
> sources
> > > for the
> > > >>>>>> facts showed in the articles. Internet Archive searches for
> > > interesting
> > > >>>>>> websites to save in their hard disks, so, we can send them our
> > > external
> > > >>>>>> links sql tables (all projects and languages of course). They
> > > improve their
> > > >>>>>> database and we always have a copy of the sources text to check
> > when
> > > needed.
> > > >>>>>> I think that this can be a cool partnership.
> > > >>>>>>
> > > >>>>> +1
> > > >>>>>
> > > >>>>>
> > > >>>> Are people who clean up dead links taking the time to check
> Internet
> > > >>>> Archive to se if the page in question is there?
> > > >>>>
> > > >>>>
> > > >>>> Ec
> > > >>>>
> > > >>>> _______________________________________________
> > > >>>> foundation-l mailing list
> > > >>>> foundation-l at lists.wikimedia.org
> > > >>>> Unsubscribe:
> > > https://lists.wikimedia.org/mailman/listinfo/foundation-l
> > > >>>>
> > > >>>
> > > >>>
> > > >>>
> > > >>> --
> > > >>> Samuel Klein          identi.ca:sj           w:user:sj
> > > >>>
> > > >>> _______________________________________________
> > > >>> foundation-l mailing list
> > > >>> foundation-l at lists.wikimedia.org
> > > >>> Unsubscribe:
> > https://lists.wikimedia.org/mailman/listinfo/foundation-l
> > > >>
> > > >>
> > > >> I actually proposed some form of Wikimedia / IArchive link
> > > >> collaboration some years ago to a friend who worked there at the
> time;
> > > >> however, they left shortly afterwards.
> > > >>
> > > >> I like SJ's particular idea.  Who has current contacts with Brewster
> > > >> Kahle or someone else over there?
> > > >>
> > > >>
> > > >> --
> > > >> -george william herbert
> > > >> george.herbert at gmail.com
> > > >>
> > > >> _______________________________________________
> > > >> foundation-l mailing list
> > > >> foundation-l at lists.wikimedia.org
> > > >> Unsubscribe:
> > https://lists.wikimedia.org/mailman/listinfo/foundation-l
> > > >>
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Samuel Klein          identi.ca:sj           w:user:sj
> > > >
> > >
> > >
> > >
> > > --
> > > Samuel Klein          identi.ca:sj           w:user:sj
> > >
> > > _______________________________________________
> > > foundation-l mailing list
> > > foundation-l at lists.wikimedia.org
> > > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
> > >
> > _______________________________________________
> > foundation-l mailing list
> > foundation-l at lists.wikimedia.org
> > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
> >
> _______________________________________________
> foundation-l mailing list
> foundation-l at lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
>


More information about the foundation-l mailing list