At times, we are using flat files as a source to rely on for our data analysis.
As flat files can be generated by another program or manually, it is often hard to control and expect the data structure these files hold.
On our scenario, every month we are getting a txt file name 'User Logins <MonthNumber>m.txt':
This file contains the user logins to our application with the userId and an aggregation of total logins for that month:
On our dashboard, we are interested to do an analysis of the total logins per month and compare the different month files.
Option 1: Add each file to the ElastiCube and use a custom SQL expression to UNION all the files into one table. For example:
Select Id, Logins, 'March' as Month from [User logins 3m.txt]
Select Id, Logins, 'Other Month' from [User logins Xm.txt]
Option 2: Prepare the data using python prior to bringing it to the ElastiCube - Saves disk size and build time:
The following python script uses the file name to extract the month (3) and add it to the end of every line in the file itself:
for filename in os.listdir(os.getcwd()):
#if the filename ends with m.txt
if filename[-5:] == 'm.txt':
#open the file
with open(filename) as fp:
lines = fp.read().splitlines()
with open(filename, "w") as fp:
for line in lines:
#add the month's number as another column in the file to each line
print(line + " " +str(int(filename[-8:][:-5]) % 12), file=fp)
#print the file was modified.
print(filename + ' was modified')
#change the file name to reduce mistakes of re-using the file.
os.rename(filename, filename[:-5] + '.txt')
At the end of the process of altering the file lines with adding the month number to them, we are changing the file name to include '3.txt' instead of '3m.txt' to make sure that we do not alter the files anymore using the script.