Quick Network Analysis to Avoid Throwing Your Computer Against the Wall!

Where to start?  The hardest part of writing this post was creating the title.  It should be simple, right?  The problem is that there are too many possible titles like:

  • How NOT to Load Data to a Cloud Database
  • Using Wireshark to Troubleshoot Network Performance
  • If it’s Not Broke, Don’t Fix It
  • MySQL Table Import Wizard Architecture — What were they thinking?

Why the dilemma you ask?

All I wanted to do was load an excel file with about 466k rows into a database on a cloud server. This has probably been done a gazillion times by now all over the world and my fear is that many of you fall victim to the “it’s working so it must be right mentality” and have suffered through what I am going to describe. This post is not about uploading data to a cloud database or how MySQL table imports work. Rather, it is about how beneficial your IT life will be if you just learn to look under the hood at how things work and then learn when to adjust.

So, the chosen title is “Quick Network Analysis to Avoid Throwing Your Computer Against the Wall!”

So, to rewind. I had an excel file (csv – comma separated value) with about 466 thousand rows of data. I wanted to load this from my laptop to a hosted cloud server.

My target table structure was simple:

My exported file from another server would map perfectly to this so I just decided to use the Table Import Wizard in MySQL Workbench.  In workbench, find the table, right-click and select Table Data Import Wizard.

Answer a few questions about encoding format, CSV-to-table mapping and turn it loose. I thought to myself, “466k rows, a few minutes, time for coffee”.

After a little water cooler talk and a new cup of coffee, I returned to my desk and much to my displeasure I still saw the table import wizard running. Hmm. So like any normal DBA would do, I decided to check the row count.

select count(*) from dcHMIMessages

A whopping 119 rows! 119!?! That is 0.025% of the total in a few minutes. A few more selects and I have come to the sad conclusion that I am inserting about a row of data a second. At this blistering pace, my import will be done in 5.39 days.

Ok, remove the lazy, GUI using, coffee drinking, developer hat and put on my trusty Forensic hat (available for purchase!) to see what is going on.

And yes, that is a John Wayne bobble-head. If a grown man was going to own a bobble-head, who else would be better than The Duke?

So with my Forensic hat on, I picked my tool of choice for troubleshooting data performance across a network–Wireshark.

A quick sample of the network data explains why it is such a snoozer…and reveals how I entertained the title “MySQL Table Import Wizard Architecture — What were they thinking?”

Note, we have network classes to teach you how to use Wireshark if you are interested. This post is just to show some benefit of using it. Hit us up on our contact page if you are interested, or purchase a PDF here to get the scoop on using Wireshark.

So what does Wireshark show me about the performance?

The key columns to notice in the above image are:

  • Delta – the time since the previously captured packet (you can set to displayed, but in this case it is captured). What does this tell me? It basically tells me that it takes my computer about 0.000# seconds to respond with next packet and it takes the MySQL server about 0.0# seconds to process.  Not bad.
  • Length – the length of my packet. This is extremely small for what I would expect. When you look at the table structure (first image in article) it is much longer than 65 bytes especially when you remember that the length of the packet is also including all of the packet headers.  Just the IP header alone is 20 bytes.
  • Bytes in flight – This is the amount of unacknowledged data flowing between the two ports in this conversation.  In terms of flow control, this is what I am passing back and forth.  Seems a bit small don’t you think?

This is the ah-ha moment in the troubleshooting in which we can see why this is such a snoozer.

My computer is IP Address 192.168.119. When you look at packets 1083, 1085, 1087, and 1089 you can see the length of data that my computer is sending to the destination (which is 5 bytes, 22 bytes, 5 bytes, 33 bytes) respectively. On a network that can handle at a up to 1460 bytes of data per packet, this is like moving a pool of water with a teaspoon.

Packets 1082, 1084, 1086, and 1088 are all 11 bytes in flight and are just part of MySQL acknowledging my data.

So a quick status check shows that my laptop response and the MySQL response are fairly quick and that the issue is that my data throughput is minuscule. Time to take a peak into the packets to see what is happening. First let’s look at a snippet of the CSV.

Since I did not write the MySQL data import wizard, I have no idea how it works, however, as a developer, my basic assumption is that they would read a large chunk of the CSV, send the data to the server and then parse the data by looking for commas and newlines. Without doing much ciphering we can see that when I send 5, 22, or 33 bytes of data that clearly we are not sending a large chunk of data.  My Forensic sense tells me something is not quite right.

So, looking through Wireshark protocols (middle pane of Wireshark data) I can see that every 5 byte MySQL packet is a ping to see that the server is Alive. Yeah!  We get to check before every packet.  Ugh.  See below.

After the ping and the reply, I send a 22 byte packet shown below:

This looks like a parameter statement “SET @a4 = “3018“. Hmm.  Not sure about that yet, but from the Wireshark data we see that not only do we have to suffer through the 5 byte ping and the 11 byte response from the cloud MySQL but after our teaspoon of data (22 bytes in flight) we send another 5 byte ping and have to wait for the response.

Next, we send the 33 byte packet:

This has a little more meaningful data and looks like a datetime stamp.  Ugh.  Now after searching through the CSV for 3018 and the datetime value we can determine that the client portion of the program is sending the data ONE CELL AT A TIME–are you kidding me! One cell of data at a time.

A little more math. Number of cells per row is 13 and we have ~466k rows which is about 6,058,000 cells (i.e., teaspoons) of data. Remember too that in between each cell, we need to send a MySQL Ping and wait for the response.  This means that to send a CSV value (i.e., a cell in the document) we use 4 packets (1-Ping, 2-Ack, 3-Cell data, 4-Ack). A quick calculation shows that the number of packets I will have to use to move the data to the server is 4 x 6,058,000 which equals 24,232,000 packets. WOW!

Ok. I did not set out to rant about MySQL table import wizard, but in all honesty, I did expect better performance as I am sure there are plenty of people that have suffered through what I just showcased for you and did not know they could do anything different. I am not a GUI person and maybe someone will point out that there are a million ways I could tune the import wizard to make it faster, however, that would have not let me write this post!  But seriously, out of the box, this is undesirable behavior for any more than a few dozen rows of data.

Luckily, there are always other ways to do things as my colleague Jonathan suggested. He said to move the file to the server first to avoid the network latency which I did. On top of that I went back to the trusty command line (when in doubt, command line always beats GUI).  To be fair to MySQL Workbench, I need to test speeds when both data and server are on the same box.  I already spent too much time getting the rows in there that I will leave that for another day.  To the rescue is the command line utility mysqlimport.

So, next I ftp’d the file to server (took about 30 seconds).

Then used the MySQL import command:

mysqlimport --ignore-lines=1 \ --fields-terminated-by=, --local -u {myuser} -p dcDataCollection dcHMIMessages.csv

Basically the command ignores the header row, indicates that fields are terminated by a comma (,) and then has the credentials of my server.  The default behavior is to import to a table with the same name as the csv.

Total execution time to insert 466,000 data rows. Less than 5 seconds.

Yes, less than 5 seconds.  Shazam!

So I could spend 5.39 days using the GUI data import wizard or move the file to the server and use the command line and do the entire thing in just a few seconds (about 35 if you remove the time of my keystrokes). Wow!

Maybe I could improve my network, maybe I could tune the MySQL import wizard, maybe. I am sure I could never get to 5 seconds though using the GUI.

Now you might be saying,

“Dave, no one would move such a huge amount of data that way and they would do the way you did”

and you might be right. I suspect, however, that the majority of people out there would use the GUI and just suffer through it not knowing it was bad because it was working.

Besides, that is not the point of the post.

The point of the post is to look at performance with your Forensic hat and recognize when something smells bad.  Then use tools to expose data and truly understand what is happening.

In this example, everything was working.  No errors were present and the data was moving from source to destination. However, working does not always mean working well. With a little Forensic ability, we can peak under the hood at the Bits & Bytes and understand what “slow” really means and how to fix it.