It was clear that the job can be done a lot quicker to write a parser for glue result to create table. This JSONPath file has to be perfectly aligned with the column in CREATE TABLE command for redshift. This is, of course, not including the fact that I need to generate a JSONPath file to tell Redshift which field/column relates to which nested json object. However, when i looked at the glue schema, turn out it is ALOT more than i thought (it worked out to be around 300 columns in the end after you expand every nested json). It is not overly complicated but you have to be comfortable with deploying cloudformation stack, managing lambda function & its dynamodb to queue job and using the utility tool writen in Nodejs to create configurations for our lambda function.Īfter getting all that setup, my next task is to create Redshift table… To quickly try sending data to the cluster, i manually went through some files, picked out some common fields, wrote CREATE TABLE command by hand and it worked great.
Turn out, I did not have to write a single piece of code for that, AWS has already done it for me in this blog post. I knew a lambda trigger on S3 objectcreate event would be perfect for the job and it can simply send a COPY command over to Redshift to let it migrate data to our cluster. This tool would crawl over your data and identify tables and schema that would fit the given data! How neat!įast forward, I wanted to attempt flatten the JSON and chuck it into a redshift cluster. These JSON come in with different attributes, tags, type, values and it was difficult to work out the schema without spending a great deal of time going through them manually or running script to pick up the logs, run some logic across to identify new fields etc…ĪWS Glue come to rescue! I first used Athena to play around with the data and a colleague of mine found Redshift spectrum, which, during setup would run you through setting AWS Glue. Recently I had to deal with large amount of unsorted JSON input log.
Sometimes the path to copy large amount of unsorted data from S3 to Redshift is a bit annoying.