BIG DATA

August 13, 2020 by William Byrne , Paul Byrne

Ecommerce

What’s the big idea behind “big data”? You may have heard the term, but come on, what does it REALLY mean? And what does it mean for your small or medium-tier business? Razoyo’s CEO Paul Byrne and CTO William Byrne have got you covered. Now, that’s some Big Data energy.

Today, we’re here to talk a little bit about big data - kind of a buzzword out there.

I think a lot of people have a sense for what big data is. Often it’s used in conjunction with AI because that’s one of the largest applications for it. But what does that word mean and what do you think it might mean for Razoyo’s clients?

“Big data” does get thrown around a lot and it means different things to different people. Traditionally, if you wanted to give it a definition “big data” just kind of is used to refer to any warehouse data or warehousing type tasks that could be ingesting large amounts of data, processing large amounts of data, and/or storing large amounts of data.

For our clients, big data becomes a discussion anytime that their data processing needs exceed the limit of traditional database infrastructure. We really don’t have the discussion until it becomes a problem, just due to the complexity that comes with big data systems.

What kind of specific types of cases might call for implementing a big data solution?

If you want to boil it down to two different cases where you may need that type of a system, either if you take in large amounts of data at a very fast pace, sometimes your my sequel database is not going to be able to keep up with it.

This could be event data or data coming in from third-party sources that there’s not enough hours in the day to write in a traditional database infrastructure. If you’re taking in very large amounts of data at a fast pace, big data systems are going to be a good thing to look at. The other case is if the amount of data that you are querying is massive.

Again, the traditional database probably isn’t going to cut it. If you’re talking about querying terabytes or more, even though maybe perhaps the data isn’t coming in very quickly, but you’ve collected over time a large amount of it and want to query it efficiently and get responses in a reasonable amount of time, you should probably look at some big query or some big data storage options there.

What kind of tools and options do clients the size of Razoyo’s clients, which tend to be mid-tier or smaller enterprise size clients, have available for dealing with their big data needs? One thing that I think is good for smaller to medium sized businesses is to almost always outsource your big data infrastructure needs. There’s open source software out there for handling massive amounts of data such as Hadoop or some other systems that work in a cluster.

However, if you’re a small to medium-sized business, you don’t really want to hire and manage engineers whose job it becomes to maintain that infrastructure. These are usually clusters of servers - could be hundreds of servers. In those cases, you’re much better off outsourcing to a cloud provider such as Google cloud or Amazon AWS.

Both of those cloud platforms have managed products for dealing with big data.Amazon may have probably a larger number of options but Google cloud has a small, very effective arsenal of big data type products. Either one is a good decision!

One of the offerings that we have, is dealing with failed implementations and dealing with systems that have become unruly. You see a lot of problems that come up with big data implementation, so what are the things you should avoid and make sure you’re not doing when you’re working in a big data environment?

Most of the time, the problems that you’re going to face have to do with the fact that you’ve outgrown the system that you’re currently on. Not many people start from the very beginning with a massively scalable big data operation. They’ll typically start with your standard type of database setup and then outgrow it. Once you reach that point, there’s a few things to consider when you’re planning out your big data architecture.

One of the first things is to really put a lot of thought up front into the flow of the data. How does it get from source to destination and how many different steps are going to be in between? Figure out how that fits into the products that you’ve chosen, whether you’re using Google cloud data flow for ingesting the data or Bigquery (or possibly Big Table depending on your query needs.) Those are all Google cloud platform big data storage engines.

One of the things that you can do to make your data ingesting and transforming processes more debuggable and easier to manage is going to prefer small incremental changes as opposed to large changes. When you design one of those steps in between point A and point B, make sure you can break up the process and into smaller ones that do more specific transformations over the data. The reason why that’s a good approach is it just makes it much easier to track down where a problem in that pipeline might be.

Lastly, one of the pitfalls is just managing your costs. The products out there for big data are great, but you definitely want to make sure you understand the pricing structure. A poorly designed big data infrastructure built on top of AWS or Google cloud can potentially be very costly, so check how they charge when it comes to querying your data or storing your data, and optimize some of the queries to reduce the amount of data that’s processed. A lot of times it’s pay for what you use, so pay for every byte that gets processed during a query. You can often optimize your queries to reduce the amount of bytes processed.

The same applies to the open source tools, except there the cost is really scaling to however many servers you need. So even if you’re not using an AWS or a Google or an Azure big data service, you still have to plan things out very carefully. Otherwise you’re looking at you know hundreds and hundreds of extra servers that you otherwise wouldn’t need. In that case you’re not paying per byte process, but the more efficient your queries, are the smaller number of servers you’re going to need. At the end of the day those servers are going to cost you from purchasing them to power costs for running them and so you definitely still want to keep optimization in your mind in open source networks, as well.

I feel like we’re going to see a lot more use of big data infrastructure and probably some new tools to to deal with it in the next few years that are less worried about efficiency, and are more about speed and ease of use. You’re already starting to see it! Almost all of the big big data cloud products are going to be very cheap on storage and that’s why as you start designing your infrastructure, you should stick to the small incremental transformations. Part of the reason why that works and is cost effective, is because of how cheap storage is.

If you are running a transformation that maybe results in two different copies of your data set and maybe two different tables, you’re actually not paying very much at all for storing both of those. What’s going to cost you is the processing of the data. Make small transformations and don’t worry about duplicating things… It’s a bit of a mind shift for a lot of developers who are used to designing databases so that they only have one bit of information stored once. You don’t want to approach it with that traditional relational database mindset. It also helps to flatten your data inside of these big data products, as opposed to having 50 joins across a bunch of different tables. This will result in more gigabyte usage, but your queries are going to become much simpler.