Re: Wikibaseug Digest, Vol 33, Issue 13 - Wikibase Community User Group

23 Jul 2021

Hi,

we also see performance (at ingestion time) as one of the biggest bottlenecks of the
current Wikibase infrastructure. So any progress on this is very welcome from out side. 

I’m far from being an expert on this so it is more a feeling …..

My guess is that one large overhead is the fact that we can edit only one entity at a
time. This means that for each entity we have: 1 HTTP request, 1 parsing process, at least
1 call to the database, 1 job to ES, 1 job for updating the cache. My feeling is that if
the API could support bulk import (not saying this is easy) over multiple entities one can
have large improvements.

BTW: Great work by everyone 

Salut
D063520

...
  On 23. Jul 2021, at 14:01,
wikibaseug-request(a)lists.wikimedia.org wrote:

 Send Wikibaseug mailing list submissions to
 	wikibaseug(a)lists.wikimedia.org

 To subscribe or unsubscribe, please visit
 	https://lists.wikimedia.org/postorius/lists/wikibaseug.lists.wikimedia.org/

 You can reach the person managing the list at
 	wikibaseug-owner(a)lists.wikimedia.org

 When replying, please edit your Subject line so it is more specific
 than "Re: Contents of Wikibaseug digest..."

 Today's Topics:

   1. Re: Experiences/doubts regarding bulk imports into Wikibase
      (Aidan Hogan)
   2. Re: Experiences/doubts regarding bulk imports into Wikibase
      (Renat Shigapov)

 ----------------------------------------------------------------------

 Message: 1
 Date: Thu, 22 Jul 2021 22:07:58 -0400
 From: Aidan Hogan &lt;aidhog(a)gmail.com&gt;
 Subject: [Wikibase] Re: Experiences/doubts regarding bulk imports into
 	Wikibase
 To: wikibaseug(a)lists.wikimedia.org
 Message-ID: &lt;6e9e37d7-f910-97cb-cf86-b95d337439bc(a)gmail.com&gt;
 Content-Type: text/plain; charset=windows-1252; format=flowed

 Indeed, yes! Seeks on a traditional disk are in the order of 10ms, while 
 on a solid state disk they rather tend to be around 0.1 ms, so one could 
 expect a 100x speed-up when doing a lot of random accesses. One could 
 also memory-map the database files perhaps for a greater latency, 
 assuming the data fit in RAM.

 So an SSD should help a lot within the current framework, but it should 
 not be necessary in general. Plus an ETL load framework should still be 
 much more efficient for bulk loads than an SSD with the current process. 
 SSDs have lower latency, but they still have latency. One million seeks 
 on a hard disk takes around 3 hours, one million seeks on an SSD takes 
 around 2 minutes, reading one million records (around 100MB) 
 sequentially on a standard hard disk should take around one second.

 Still I think that trying with an SSD for now is a good idea!

 Best,
 Aidan

 On 2021-07-22 19:02, Laurence Parry wrote:
  I wondered earlier if you were trying to use hard
disks when talking 
 about random access time. I'm interested because I may be deploying 
 Wikibase on HDD in the near future. But I'm only considering that 
 because I know size and usage will be relatively low; I won't be doing 
 bulk loads.

 The reality is, Wikibase is not designed for HDD, and it doesn't work 
 well on it. It may work slightly better with hardware RAID, but 
 realistically any high-speed Wikibase implementation needs to be on SSD. 
 Wikidata uses several, just for WDQS: 
 https://mediawiki.org/wiki/Wikidata_Query_Service/Implementation#Hardware and 
 https://phabricator.wikimedia.org/T221632 which mentions they are used 
 in RAID 0. I think the MediaWiki side of things may be even larger, 
 because more must be stored.

 While testing optimal load strategy, Adam notes "SSDs should obviously 
 be used for storage." 
 https://addshore.com/2021/02/testing-wdqs-blazegraph-data-load-performance/

 This is in part a limitation of Blazegraph, for which fast disks are the 
 primary requirement, after which CPU and RAM matter (RAM may matters 
 more if you have to use HDD, and for reads; but for loading it has to 
 save the results):
 https://github.com/blazegraph/database/wiki/Hardware_Configuration
 https://sourceforge.net/p/bigdata/discussion/676946/thread/5b4acb02/#2810
 https://github.com/blazegraph/database/issues/94#issuecomment-399141685

 If they are not available to you, it might be worth using a cloud system 
 to perform the load and take a compressed dump off of it afterwards. 
 They are not necessarily cheap, but you might only need it for a 
 relatively brief time. (Oracle is even giving away ARM VMs and SSD 
 storage for free now. Regrettably, Wikibase does not yet have an arm64 
 distribution, but I'm not aware of a specific reason why it could not 
 eventually.)

 It might also be worth looking into whether Blazegraph (and MySQL) has 
 any degraded reliability nodes which you can enable during the loading 
 process. From my own experience I know database tuning can be key when 
 attempting to use HDDs as a platform. Some ideas for MySQL:
 https://mariadb.com/kb/en/innodb-system-variables/#innodb_flush_log_at_trx_…

https://mariadb.com/kb/en/replication-and-binary-log-system-variables/#binl…

 You can also try 'nobarrier' as a filesystem mount option to disable 
 fsync, and 'data=writeback' for ext4. Be aware it violates ACID 
 expectations and should only be used for test purposes or when you are 
 willing to rebuild the entire database/FS if something goes wrong. 
 Similarly, RAID 0 might be useful if you bear in mind losing one disk 
 means losing all the data.

 Best regards,
 --
 Laurence "GreenReaper" Parry
 https://GreenReaper.co.uk
 ------------------------------------------------------------------------
 *From:* Aidan Hogan &lt;aidhog(a)gmail.com&gt;
 *Sent:* Thursday, July 22, 2021 10:50:06 PM
 *To:* wikibaseug(a)lists.wikimedia.org &lt;wikibaseug(a)lists.wikimedia.org&gt;
 *Subject:* [Wikibase] Re: Experiences/doubts regarding bulk imports into 
 Wikibase
 Thanks Adam for the detailed blog post!

  From your mail, I understand that the current Wikibase framework is
 based on continuous updates, and in that sense it can process a high
 rate of manual edits without problem. I think though that for high
 performance bulk updates, the best option would be an "ETL" style
 framework that loads data directly into the underlying databases using
 custom tools. What seems to be hurting performance in the bulk load
 scenario is:

 1) The creation of massive amounts of requests/jobs all at once.
 2) Random disk accesses that require about 10ms per go on a conventional
 disk if the data are not cached or read sequentially.

 These are not a problem per se for the typical running and maintenance
 of a Wikibase instance, but rather only occur when one tries to import
 millions of items at once. Given that we can assume that an "admin" is
 importing the data, we can also bypass a lot of the typical processes
 that are run to verify edit permissions, rate limits, etc.

 Overall, with the goal of importing a legacy dataset of around 10
 million items, with maybe 100 million "values", in one day, on
 conventional hardware, these two problems would need to be resolved. The
 underlying databases should not have a problem processing data at this
 scale/rate using bulk update/transaction methods, but they will have a
 problem if each item is updated as a separate transaction; even with
 batching, it can still lead to hundreds of thousands of transactions
 that would choke up any persistent database running on a traditional
 hard-disk. So I think that the most performant solution would be to
 bypass Wikibase, and to prepare the data (e.g., a user-provided JSON
 dump) to be loaded into the underlying databases directly that Wikibase
 can later query. The issue with this approach is that it is not so
 trivial to understand what data need to be loaded where (this requires a
 deep understanding of Wikibase), and that such a bulk loader would need
 to be "synchronised" with changes to the main Wikibase software, so it
 would require maintenance over the years.

 Best,
 Aidan

 On 2021-07-19 17:50, Addshore wrote:
  This thread and my earlier email prompted me to
write a deep dive blog 
 post on exactly what happens when you send a create item request to 
 wbeditentity.

 You can find this newly published (and thus probably with some typos) below
 https://addshore.com/2021/07/what-happens-in-wikibase-when-you-make-a-new-i…  
<https://addshore.com/2021/07/what-happens-in-wikibase-when-you-make-a-new-item/> 

<https://addshore.com/2021/07/what-happens-in-wikibase-when-you-make-a-new-item/ 

<https://addshore.com/2021/07/what-happens-in-wikibase-when-you-make-a-new-item/>>

 Hopefully this can prompt some more questions, thoughts, changes etc for 
 the bulk import case both for RAISE and for the API etc.

 Adam / Addshore
 Wikidata Wikibase Tech Lead

 On Mon, 19 Jul 2021 at 17:45, Addshore &lt;addshorewiki(a)gmail.com 
 <mailto:addshorewiki@gmail.com <mailto:addshorewiki@gmail.com>>> wrote:

      Hi all,

      Some general thoughts on this thread.

  Overall, we were wondering if we are approaching
this bulk import       in the
  right way? It seems that the PHP scripts are not
optimised for
 performance/scale?  
      In general I would say that the "right" thing to do is to speed up
      the app, rather than try to write directly to the SQL db.
      Lots / all of the work currently being done in these maintenance
      scripts can be / is done by a deferred job queue already within
      Wikibase (if configured)
      If this is done, then having one or more separate job runners would
      be beneficial, see https://www.mediawiki.org/wiki/Manual:Job_queue  
<https://www.mediawiki.org/wiki/Manual:Job_queue>

<https://www.mediawiki.org/wiki/Manual:Job_queue  
<https://www.mediawiki.org/wiki/Manual:Job_queue>>
       The other key things could be a fast primary
SQL database, and fast
      host (or hosts) serving MediaWiki/Wikibase.
      Wikidata.org runs on a very large shared cluster (shared with all
      other Wikimedia sites) with multiple web, db, cache, and job hosts etc.

      If you are running these maintenance scripts as part of an import
      where performance of the wikibase does not matter to users then
      using sleep 0 will provide a performance increase.

      Something else that might make all of this slightly simpler would be
      the ability to have this secondary index work head straight to the
      job queue, rather than run as part of the maintenance scripts
      themselves?
      (I see that runJobs is called as the last step at

https://github.com/UB-Mannheim/RaiseWikibase/blob/main/RaiseWikibase/raiser… 

<https://github.com/UB-Mannheim/RaiseWikibase/blob/main/RaiseWikibase/raiser.py#L172-L175>

<https://github.com/UB-Mannheim/RaiseWikibase/blob/main/RaiseWikibase/raiser.py#L172-L175

<https://github.com/UB-Mannheim/RaiseWikibase/blob/main/RaiseWikibase/raiser.py#L172-L175>>

      Currently it looks like in the building_indexing method in Raise all
      items will be loaded from the DB 2 times, and all properties will be
      loaded 3 times.
      This pattern for large imports is likely to get slower and slower,
      and also not take advantage of any caching etc that exists.
      The regular API sequence for example would 1) store data in the DB
      2) write to any quick secondary stores 3) add the entity to a shared
      cache and 4) schedule jobs to populate the secondary indexes.
      These jobs would run fairly soon after the initial write, leading to
      better performance.

  > It seems that the PHP scripts are not
optimised for performance/scale?
 It seems so.  
      Scale yes (these scripts ran across the whole of Wikidata.org)
      Performance not so much, these scripts were primarily designed to be
      a long running task between Wikibase updates, not as part of an
      import process.

      Regarding rebuildItemterms
  Takes around 2 hours on the small sample (which
we could multiply       by a
  thousand for the full dataset, i.e., 83 days as
an estimate).       Indeed this script will only go so fast, it processes all
entities
      in series, so you will have the bottleneck of a single process
      iterating through all entities.
      The jobqueue comparison again is jobs get queued and executed by any
      number of wjob queue runners that you wish.
      The next bottle neck would then be the speed of your SQL database.

      There is probably lots more to discuss in this thread but I found
      navigating it quite hard.
      Hopefully the above will prompt some more discussion.
      It would be great to be able to chat somewhere more realtime (not an
      email thread) on this topic too.

      Adam / Addshore
      Wikidata Wikibase Tech Lead

      On Sat, 17 Jul 2021 at 22:54, Aidan Hogan &lt;aidhog(a)gmail.com
      <mailto:aidhog@gmail.com <mailto:aidhog@gmail.com>>> wrote:

          Hi Renat,

          On 2021-07-16 15:54, Renat Shigapov wrote:
  Hi Aidan,

 I am on holidays for a few days with a limited access to           internet, but my
quick reply to "Do you have further plans for
          extending RaiseWikibase?" is yes. I'll try to handle with those
          secondary tables.

          Go enjoy your holidays. :)

          I can quickly respond to one point:

  Regarding (2): can that be done in MariaDB?

          Since 10.2.3, MariaDB has support for JSON:

        https://mariadb.com/kb/en/json-functions/  
<https://mariadb.com/kb/en/json-functions/>

<https://mariadb.com/kb/en/json-functions/  
<https://mariadb.com/kb/en/json-functions/>>

          I have no experience with handling JSON in a relational
          database, so not
          sure overall whether it's a good solution, but I think it should be
          considerably more performant so long as the pages are iterated
          over,
          rather than each one being searched in the index.

          Best,
          Aidn

  If you can make (3) done, the whole Wikibase
community would           be very thankful.

 Regarding (4): yeah, too radical and maintenance problems.

 Sorry for a short answer, I'll add on my return.

 Kind regards,
 Renat
 _______________________________________________
 Wikibaseug mailing list -- wikibaseug(a)lists.wikimedia.org          
<mailto:wikibaseug@lists.wikimedia.org  
<mailto:wikibaseug@lists.wikimedia.org>>
   To
unsubscribe send an email to           wikibaseug-leave(a)lists.wikimedia.org
          <mailto:wikibaseug-leave@lists.wikimedia.org  
<mailto:wikibaseug-leave@lists.wikimedia.org>>

      _______________________________________________
          Wikibaseug mailing list -- wikibaseug(a)lists.wikimedia.org
          <mailto:wikibaseug@lists.wikimedia.org  
<mailto:wikibaseug@lists.wikimedia.org>>
           To unsubscribe send an email to
          wikibaseug-leave(a)lists.wikimedia.org
          <mailto:wikibaseug-leave@lists.wikimedia.org  
<mailto:wikibaseug-leave@lists.wikimedia.org>>

 _______________________________________________
 Wikibaseug mailing list -- wikibaseug(a)lists.wikimedia.org
 To unsubscribe send an email to wikibaseug-leave(a)lists.wikimedia.org
   _______________________________________________
 Wikibaseug mailing list -- wikibaseug(a)lists.wikimedia.org
 To unsubscribe send an email to wikibaseug-leave(a)lists.wikimedia.org

 _______________________________________________
 Wikibaseug mailing list -- wikibaseug(a)lists.wikimedia.org
 To unsubscribe send an email to wikibaseug-leave(a)lists.wikimedia.org

 ------------------------------

 Message: 2
 Date: Fri, 23 Jul 2021 08:07:17 -0000
 From: "Renat Shigapov" &lt;renat.shigapov(a)bib.uni-mannheim.de&gt;
 Subject: [Wikibase] Re: Experiences/doubts regarding bulk imports into
 	Wikibase
 To: wikibaseug(a)lists.wikimedia.org
 Message-ID:
 	&lt;162702763752.25702.3885521959340606528(a)lists1001.wikimedia.org&gt;
 Content-Type: text/plain; charset="utf-8"

 Dear all,

 We have the ticket "Improve bulk import via API" at phabricator now:
https://phabricator.wikimedia.org/T287164. It's aimed to unite the related tickets and
to discuss further development around bulk import in Wikibase. Your contributions are very
welcome.

 Kind regards,
 Renat

 ------------------------------

 Subject: Digest Footer

 _______________________________________________
 Wikibaseug mailing list -- wikibaseug(a)lists.wikimedia.org
 To unsubscribe send an email to wikibaseug-leave(a)lists.wikimedia.org

 ------------------------------

 End of Wikibaseug Digest, Vol 33, Issue 13
 ******************************************