Explaining the Bits and Bytes of Everyday IT


Notice: Undefined index: hide_archive_titles in /var/www/blog.forensicit.com/wp-content/themes/modern-business/includes/theme-functions.php on line 233

Category: Databases

DBaaS – Database as a Service Pros and Cons

Almost every business these days is data-centered.  Whether the data is for internal applications and systems or for other services that are offered, let’s face it,

managing data is a key to success.

Before listing the pros and cons of DBaaS, we need to explore a few decisions business have to make.

Businesses are faced with numerous quick decisions about data handling that can set them on a path that, if incorrect, is difficult and costly to correct.  Those decisions around data are:

  • What database type to use, SQL or No-SQL?  This is an article all by itself.
  • What are the data storage and query needs?  Transactional?  Big Data?
  • What database system to use?  A few SQL choices might be Oracle, MySQL, MSSQL, and Sybase.  A few No-SQL choices might be MongoDB or Cassandra.
  • Do we have DBA (database administrator) talent or do we have to hire?
  • What kind of server or resources are needed?  What are my power, server, disk, processing, network, and IO requirements?
  • How do I maintain, backup, administer and otherwise own the database framework?
  • What is my cost of ownership?

 

Let’s explore the first decision.  What database type to use, SQL or No-SQL?

Traditional database types that are classified as SQL have a significant place in businesses and are a mainstay for business choices, however, as companies start to create applications that drive decisions based on significant database analysis of large, almost unfathomable amounts of data, they migrate to No-SQL solutions like MongoDB or Cassandra.  The architecture of No-SQL makes it a good choice for big data solutions while the built in protections of a transactional based system like Oracle make it a better choice for banking or similar solutions.

When it comes to picking a specific system, businesses tend to stick with what they know.  In other words, if they already have Oracle, and Oracle talent, then when management asks those individuals which database system they should use on Project X, it should be no surprise that they pick…Oracle.

Matching a specific database system to a set of business requirements is an arduous task that should always be looked at with a fresh perspective and not just based on what talent is already employed or what systems a business is comfortable with.

Let’s face it, if a business picks correctly, all is good.  If they pick incorrectly, they have wasted a lot of resources which equates to dollars.  Enter DBaaS.

Where DBaaS excels is that it gives businesses the ability to test the waters a bit, to try before they invest heavily.

Even if DBaaS is a stepping stone to total ownership, it is a cost effective solution to help figure out your needs prior to investing heavily. 

Let’s explore the pros and cons.

Note, we need to distinguish between “hosting database systems” and DBaaS.  There are many cloud based solutions that “host” a database system but provide no significant help in configuration, tuning, consulting, and providing the talent needed to actually use those systems.  True DBaaS provides both the system and the talent to help you utilize the database and figure out how to store, query, and analyze your data.

The value of DBaaS goes way beyond the hosting.

Pros of DBaaS:

  • No equipment or software licenses
  • Flexibility.  Multiple choices to test drive your applications and pick the right platform for your business requirements.
  • Significantly less staffing requirements.  The DBaaS provider handles installation, configuration, and in many cases development.
  • Offsite provides protection from local power failures or disasters.  Many businesses design with power redundancy in mind but rarely meet those goals in reality.
  • SLA agreements that have redundancy, up-time, and backup protections.   A DBaaS provider has intent focus on protecting your data.

 

Cons of DBaaS:

  • Limited access to underlying servers.  This can present itself as a feeling of no control.
  • Very little knowledge of how your data is protected from cyber security threats.  This can be dangerous for sensitive data.

 

So how to you decide?

Is there a transition from one to the other.  Yes there is almost always, but by following a few guidelines to start with, you can properly use DBaaS.

 

Forensic IT’s Guidelines for DBaaS:

  1. Do all development using DBaaS.  This is your chance to test drive different architectures and features.
  2. Unless you have full disclosure of how your data is protected, managed, and secured by DBaaS providers, we suggest consulting with database architects to host sensitive data internally.  Note, this is typically not big data.  When we use the terms sensitive data, we mean just that.  Data like SSNs, account details, financials, personal data, etc.  Does this mean that you cannot use DBaaS for this?  No, it means that you first have to find a DBaaS provider that will show you everything from how your encrypted data gets in their system to storage, access, etc.
  3. When you are not sure of what your database needs really are, use DBaaS first.  This lets you try SQL or No-SQL.  This lets you explore the encryption capabilities of Oracle versus MySQL.  Think of DBaaS like buying a car.  You test drive sedans, trucks, and SUVs, and try different manufacturers and features.  You may decide to lease or buy.
  4. Always monitor and evaluate the cost of ownership.  As your system grows, the operating costs might make sense to drop DBaaS and build an in-house system.  By then, however, you have already decided on what you really need.  You have test driven multiple solutions and can now buy only the licenses and hardware you know you need to be successful.  You can hire the correct talent to manage your system.
Continue Reading

Quick Network Analysis to Avoid Throwing Your Computer Against the Wall!

Where to start?  The hardest part of writing this post was creating the title.  It should be simple, right?  The problem is that there are too many possible titles like:

  • How NOT to Load Data to a Cloud Database
  • Using Wireshark to Troubleshoot Network Performance
  • If it’s Not Broke, Don’t Fix It
  • MySQL Table Import Wizard Architecture — What were they thinking?

Why the dilemma you ask?

All I wanted to do was load an excel file with about 466k rows into a database on a cloud server. This has probably been done a gazillion times by now all over the world and my fear is that many of you fall victim to the “it’s working so it must be right mentality” and have suffered through what I am going to describe. This post is not about uploading data to a cloud database or how MySQL table imports work. Rather, it is about how beneficial your IT life will be if you just learn to look under the hood at how things work and then learn when to adjust.

So, the chosen title is “Quick Network Analysis to Avoid Throwing Your Computer Against the Wall!”

So, to rewind. I had an excel file (csv – comma separated value) with about 466 thousand rows of data. I wanted to load this from my laptop to a hosted cloud server.

My target table structure was simple:

My exported file from another server would map perfectly to this so I just decided to use the Table Import Wizard in MySQL Workbench.  In workbench, find the table, right-click and select Table Data Import Wizard.

Answer a few questions about encoding format, CSV-to-table mapping and turn it loose. I thought to myself, “466k rows, a few minutes, time for coffee”.

After a little water cooler talk and a new cup of coffee, I returned to my desk and much to my displeasure I still saw the table import wizard running. Hmm. So like any normal DBA would do, I decided to check the row count.

select count(*) from dcHMIMessages

A whopping 119 rows! 119!?! That is 0.025% of the total in a few minutes. A few more selects and I have come to the sad conclusion that I am inserting about a row of data a second. At this blistering pace, my import will be done in 5.39 days.

Ok, remove the lazy, GUI using, coffee drinking, developer hat and put on my trusty Forensic hat (available for purchase!) to see what is going on.

And yes, that is a John Wayne bobble-head. If a grown man was going to own a bobble-head, who else would be better than The Duke?

So with my Forensic hat on, I picked my tool of choice for troubleshooting data performance across a network–Wireshark.

A quick sample of the network data explains why it is such a snoozer…and reveals how I entertained the title “MySQL Table Import Wizard Architecture — What were they thinking?”

Note, we have network classes to teach you how to use Wireshark if you are interested. This post is just to show some benefit of using it. Hit us up on our contact page if you are interested, or purchase a PDF here to get the scoop on using Wireshark.

So what does Wireshark show me about the performance?

The key columns to notice in the above image are:

  • Delta – the time since the previously captured packet (you can set to displayed, but in this case it is captured). What does this tell me? It basically tells me that it takes my computer about 0.000# seconds to respond with next packet and it takes the MySQL server about 0.0# seconds to process.  Not bad.
  • Length – the length of my packet. This is extremely small for what I would expect. When you look at the table structure (first image in article) it is much longer than 65 bytes especially when you remember that the length of the packet is also including all of the packet headers.  Just the IP header alone is 20 bytes.
  • Bytes in flight – This is the amount of unacknowledged data flowing between the two ports in this conversation.  In terms of flow control, this is what I am passing back and forth.  Seems a bit small don’t you think?

This is the ah-ha moment in the troubleshooting in which we can see why this is such a snoozer.

My computer is IP Address 192.168.119. When you look at packets 1083, 1085, 1087, and 1089 you can see the length of data that my computer is sending to the destination (which is 5 bytes, 22 bytes, 5 bytes, 33 bytes) respectively. On a network that can handle at a up to 1460 bytes of data per packet, this is like moving a pool of water with a teaspoon.

Packets 1082, 1084, 1086, and 1088 are all 11 bytes in flight and are just part of MySQL acknowledging my data.

So a quick status check shows that my laptop response and the MySQL response are fairly quick and that the issue is that my data throughput is minuscule. Time to take a peak into the packets to see what is happening. First let’s look at a snippet of the CSV.

Since I did not write the MySQL data import wizard, I have no idea how it works, however, as a developer, my basic assumption is that they would read a large chunk of the CSV, send the data to the server and then parse the data by looking for commas and newlines. Without doing much ciphering we can see that when I send 5, 22, or 33 bytes of data that clearly we are not sending a large chunk of data.  My Forensic sense tells me something is not quite right.

So, looking through Wireshark protocols (middle pane of Wireshark data) I can see that every 5 byte MySQL packet is a ping to see that the server is Alive. Yeah!  We get to check before every packet.  Ugh.  See below.

After the ping and the reply, I send a 22 byte packet shown below:

This looks like a parameter statement “SET @a4 = “3018“. Hmm.  Not sure about that yet, but from the Wireshark data we see that not only do we have to suffer through the 5 byte ping and the 11 byte response from the cloud MySQL but after our teaspoon of data (22 bytes in flight) we send another 5 byte ping and have to wait for the response.

Next, we send the 33 byte packet:

This has a little more meaningful data and looks like a datetime stamp.  Ugh.  Now after searching through the CSV for 3018 and the datetime value we can determine that the client portion of the program is sending the data ONE CELL AT A TIME–are you kidding me! One cell of data at a time.

A little more math. Number of cells per row is 13 and we have ~466k rows which is about 6,058,000 cells (i.e., teaspoons) of data. Remember too that in between each cell, we need to send a MySQL Ping and wait for the response.  This means that to send a CSV value (i.e., a cell in the document) we use 4 packets (1-Ping, 2-Ack, 3-Cell data, 4-Ack). A quick calculation shows that the number of packets I will have to use to move the data to the server is 4 x 6,058,000 which equals 24,232,000 packets. WOW!

Ok. I did not set out to rant about MySQL table import wizard, but in all honesty, I did expect better performance as I am sure there are plenty of people that have suffered through what I just showcased for you and did not know they could do anything different. I am not a GUI person and maybe someone will point out that there are a million ways I could tune the import wizard to make it faster, however, that would have not let me write this post!  But seriously, out of the box, this is undesirable behavior for any more than a few dozen rows of data.

Luckily, there are always other ways to do things as my colleague Jonathan suggested. He said to move the file to the server first to avoid the network latency which I did. On top of that I went back to the trusty command line (when in doubt, command line always beats GUI).  To be fair to MySQL Workbench, I need to test speeds when both data and server are on the same box.  I already spent too much time getting the rows in there that I will leave that for another day.  To the rescue is the command line utility mysqlimport.

So, next I ftp’d the file to server (took about 30 seconds).

Then used the MySQL import command:

mysqlimport --ignore-lines=1 \ --fields-terminated-by=, --local -u {myuser} -p dcDataCollection dcHMIMessages.csv

Basically the command ignores the header row, indicates that fields are terminated by a comma (,) and then has the credentials of my server.  The default behavior is to import to a table with the same name as the csv.

Total execution time to insert 466,000 data rows. Less than 5 seconds.

Yes, less than 5 seconds.  Shazam!

So I could spend 5.39 days using the GUI data import wizard or move the file to the server and use the command line and do the entire thing in just a few seconds (about 35 if you remove the time of my keystrokes). Wow!

Maybe I could improve my network, maybe I could tune the MySQL import wizard, maybe. I am sure I could never get to 5 seconds though using the GUI.

Now you might be saying,

“Dave, no one would move such a huge amount of data that way and they would do the way you did”

and you might be right. I suspect, however, that the majority of people out there would use the GUI and just suffer through it not knowing it was bad because it was working.

Besides, that is not the point of the post.

The point of the post is to look at performance with your Forensic hat and recognize when something smells bad.  Then use tools to expose data and truly understand what is happening.

In this example, everything was working.  No errors were present and the data was moving from source to destination. However, working does not always mean working well. With a little Forensic ability, we can peak under the hood at the Bits & Bytes and understand what “slow” really means and how to fix it.

Continue Reading