Trying to... XML
So yeah, here is a solution that between life and laughs and work, took a good evening with the iffy Runaway Bride playing in the background. It was cute tho, but I digress. Let's do this😊 !! Now I got these sample files from the open source Form 990 AWS repository after the person who showed me that they exist during an internship reminded me they exist while asking about them 🤡. So get your files first and theeeeen let's do this.
# Get the modules ready import csv import glob import os import pandas as pd import xml.etree.ElementTree as ET #I giggled, this is an Auntie Kim original naming :)
ET because you know, the movie (which I haven't seen but I digress... again). I should note - the bulk of this block is from Auntie's initial start. We then worked to get it together to see each result and then add it to a sheet. The stumbling block here was seeing data from all the files instead of just the first file's data in the loop.
#we identify a directory with the xml files and we change our current directory location to go into it #Think of this as navigating to a folder on your computer os.chdir('/work/sample_data') #let's make a dictionary so we can store our values once we extract them temp_dict = dict() #we also have a counter so that each time we are done processing a file, we can log it and keep track ct = 0 #loop through each file that is in the for f in glob.iglob("*.xml"): #we print just to see the file name print(f) #get to the beginning paths of the file (a.k.a the root == remember our analogy is trees) tree = ET.parse(f) root = tree.getroot() #we make this empty dictionary to use to store the file content after we retrieve it file_dict = dict() #for each branch that forms from the root of our tree, get the tar on the tree!! for child in root.iter(): #get the element key (this may change because of where/what to split at) key = child.tag.rsplit('}') #get the value of said element's key value = child.text #store them in a dictionary just for this file! file_dict[key] = value # and then save the dictionary to another dictionary for all the files ct += 1 temp_dict[ct] = file_dict
YAAAAAAAAAAS!! We're touching all the files. On to the next step 🎉🎉
Also - If you have not seen Runaway Bride, Julia Roberts keeps running away from the groom at the altar right before she marries her husband and Richard Gere (yeah the same people from Pretty Woman) is a reporter who writes about this [phenomenon]. Anywho, the grandma is over here talking about how every time, there is so much cake it's a wonder she's not gained weight 🤣🤣 HILARITY! Anywho, let's see if our saving thing works
#ok so let's check do we have everything? #We have five files so we expect five to be the length/count of files len(temp_dict)
# let's look at something other than the first thing # this shows us what the tar (remember our analogy?) looks like temp_dict
The long way to inspect the data...
#the dataframe elements # a list of the column names that we will access. This makes it easy for us to use them if we know them cols = ['TaxYear', 'TaxPeriodBeginDate', 'TaxPeriodEndDate', 'Name', 'EIN', 'ReturnType'] #an empty list which we will use to append just the values of the above columns as we loop through the data vals =  # create a list for each row of data for just the variables we want #loop through the dictionary storing the xml data for i, item in temp_dict.items(): #add each row but just the columns we want vals.append([item[i] for i in cols]) #add them rows into the dataframe! df_filings = pd.DataFrame(vals, columns=cols)
#ok moment of truth # here when we just type out the object, we are effectively printing it out # you should only do this as the last line in the cell or the only line - otherwise, it fails df_filings
#use pandas to save the dataframe into a csv with the pandas method `to_csv` df_filings.to_csv('df_read_data.csv')
BUT what if I don't want to inspect it...
You can use the Python file manipulation modules like so! 👇🏽👇🏽 This is another example of a built in module that you don't have to import.
#create a writer object (Auntie Kim thought about this first because time is moneyyyy) with open('written_data.csv', 'w', newline='') as csv_file: writer = csv.writer(csv_file) #add the header first writer.writerow(cols) #take the list of rows we made and add that to the file for row in vals: writer.writerow(row)
#let's open the file to check it pd.read_csv('/work/sample_data/written_data.csv')
LOOK AT US!! We did it 🎇✨🎊 So lesson is, loops are fun but hard and now we gotta have dinner and find something else to entertain us as we wind down for the night. Thanks for reading this code and movie version of Netflix and chill. Buh byeeee 👋🏽👋🏽