Performance improvement of MySQL inserts in Python using batching

In this blog entry we will see how much performance improvement we can get by using batching for inserts in MySQL 5.7 from a Python3 client. In this example the number of rows to be inserted is around 140K. The json blob for this data has a size of around ~25MB. First we will insert them one by one. Note even in this case, I am inserting all of them and finally doing a single commit. The default behavior of the mysql-connector library is that autocommit is false. This is good for performance. If we commit after each insert, the performance will be even worse. Below are the code and performance details.
Number of auctions:  139770
Number of rows affected : 139770
Time taken for insert:  34.17 s
For my use case, I was okay with around 30 seconds of insert time for the bulk update, but thought of trying out the batch insert. I ran into a problem because of the default max packet size for MySQL execute statements. I had to change it in my.cnf file by adding max packet size config in the [mysqld] section. I set it to 32M based on my bulk update json data blob size of around 25MB. Below are the errors and config changes.
  File "/usr/local/lib/python3.6/site-packages/mysql/connector/connection.py", line 262, in _send_cmd
    packet_number, compressed_packet_number)
  File "/usr/local/lib/python3.6/site-packages/mysql/connector/network.py", line 143, in send_plain
    errno=2055, values=(self.get_address(), _strioerror(err)))
mysql.connector.errors.OperationalError: 2055: Lost connection to MySQL server at '127.0.0.1:3306', system error: 32 Broken pipe
[mysqld]
max_allowed_packet=32M
I changed the code to use executemany, and while doing so I had to make some other changes to the query formatting as well. If you notice you will see that now as I have to pass all the variables as parameters in the function, the wildcards have all been changed to %s. Earlier they were %s or %d as I was using Python string formatting to create a single complete insert statement. Another thing to notice is that the updates on duplicate key no longer take wildcards, but values which have already been provided for insert. Below are the code and performance details.
Number of auctions:  139770
Number of rows affected : 139770
Time taken for insert:  9.78 s
As we can see that using batching for bulk inserts, we are getting close to 4x performance improvement. Please note that these performance numbers are for the full use case of update including json parsing. The numbers just for the insert would be slightly different, but I expect them to be in the same ballpark.

Comments

  1. As stated by Stanford Medical, It's in fact the SINGLE reason women in this country live 10 years more and weigh on average 19 kilos less than us.

    (And actually, it has absolutely NOTHING to do with genetics or some secret diet and EVERYTHING around "how" they are eating.)

    BTW, What I said is "HOW", and not "what"...

    TAP this link to find out if this quick test can help you decipher your real weight loss potential

    ReplyDelete

Post a Comment

Popular posts from this blog

Connect to MySQL 5.7 from Python using SSL

Connect to MySQL Server 5.7 from PHP 7.0 using SSL