I’m working on a project for the company I work at. They have a program that generates an XML file and would like to extract and format specific tag names as formatted output. To accomplish this, I’ve turned to Python and am currently writing two programs.
The first program successfully formats the raw data in the XML file into its properly indented tree structure.
The second program is where I’m stuck. By using the minidom module, I have so far been able to generate output that prints a single line of seven variables each obtained from a specific tag within the XML file.
The challenge is I need to have multiple results for each Element Tag that I’m pulling data from throughout the length of the document.
The entire XML document is far too large to post on this site, and contains sensitive data, so I’ll have to truncate and modify part of it so you can at least see the hierarchies.
<ws_Worker>
<ws_Summary>
<ws_Employee_ID>555555</ws_Employee_ID>
<ws_Name>John Doe</ws_Name>
</ws_Summary>
<ws_Eligibility ws_PriorValue="false">true</ws_Eligibility>
<ws_Personal>
<ws_Name_Data>
<ws_Name_Type>Legal</ws_Name_Type>
<ws_First_Name>John</ws_First_Name>
<ws_Last_Name>Doe</ws_Last_Name>
<ws_Formatted_Name>John Doe</ws_Formatted_Name>
<ws_Reporting_Name>Doe, John</ws_Reporting_Name>
</ws_Name_Data>
<ws_Address_Data>
<ws_Address_Type>WORK</ws_Address_Type>
<ws_Address_Is_Public>true</ws_Address_Is_Public>
<ws_Is_Primary>true</ws_Is_Primary>
<ws_Address_Line_Data ws_Label="Address Line 1" ws_Type="ADDRESS_LINE_1">123 Sixth St.</ws_Address_Line_Data>
<ws_Municipality>Baltimore</ws_Municipality>
<ws_Region>Maryland</ws_Region>
<ws_Postal_Code>12345</ws_Postal_Code>
<ws_Country>US</ws_Country>
</ws_Address_Data>
<ws_Email_Data>
<ws_Email_Type>WORK</ws_Email_Type>
<ws_Email_Is_Public>true</ws_Email_Is_Public>
<ws_Is_Primary>true</ws_Is_Primary>
<ws_Email_Address ws_PriorValue="doeball@icloud.com">jdoe@company.com</ws_Email_Address>
</ws_Email_Data>
<ws_Tobacco_Use>false</ws_Tobacco_Use>
</ws_Personal>
<ws_Status>
<ws_Employee_Status>Active</ws_Employee_Status>
<ws_Active>true</ws_Active>
<ws_Active_Status_Date>2020-01-01</ws_Active_Status_Date>
<ws_Hire_Date>2020-01-01</ws_Hire_Date>
<ws_Original_Hire_Date>2015-01-01</ws_Original_Hire_Date>
<ws_Hire_Reason>Hire_Employee_Rehire_Employee_After_13_Weeks</ws_Hire_Reason>
<ws_Continuous_Service_Date>2020-01-01</ws_Continuous_Service_Date>
<ws_First_Day_of_Work>2020-01-01</ws_First_Day_of_Work>
<ws_Retirement_Eligibility_Date>2016-10-01</ws_Retirement_Eligibility_Date>
<ws_Retired>false</ws_Retired>
<ws_Seniority_Date>2015-10-01</ws_Seniority_Date>
<ws_Terminated>false</ws_Terminated>
<ws_Not_Eligible_for_Hire>false</ws_Not_Eligible_for_Hire>
<ws_Regrettable_Termination>false</ws_Regrettable_Termination>
<ws_Resignation_Date>2018-11-01</ws_Resignation_Date>
<ws_Not_Returning>false</ws_Not_Returning>
<ws_Return_Unknown>false</ws_Return_Unknown>
<ws_Has_International_Assignment>false</ws_Has_International_Assignment>
<ws_Home_Country>US</ws_Home_Country>
<ws_Rehire>true</ws_Rehire>
</ws_Status>
<ws_Position>
<ws_Operation>NONE</ws_Operation>
<ws_Position_ID>12345</ws_Position_ID>
<ws_Effective_Date>2020-01-10</ws_Effective_Date>
<ws_Primary_Position>true</ws_Primary_Position>
<ws_Position_Title>Driver</ws_Position_Title>
<ws_Business_Title>Driver</ws_Business_Title>
<ws_Worker_Type>Regular</ws_Worker_Type>
<ws_Position_Time_Type>Part_time</ws_Position_Time_Type>
<ws_Job_Exempt>false</ws_Job_Exempt>
<ws_Scheduled_Weekly_Hours>29</ws_Scheduled_Weekly_Hours>
<ws_Default_Weekly_Hours>40</ws_Default_Weekly_Hours>
<ws_Full_Time_Equivalent_Percentage>72.5</ws_Full_Time_Equivalent_Percentage>
<ws_Exclude_from_Headcount>false</ws_Exclude_from_Headcount>
<ws_Pay_Rate_Type>Hourly</ws_Pay_Rate_Type>
<ws_Workers_Compensation_Code>1234</ws_Workers_Compensation_Code>
<ws_Job_Profile>DRIVER</ws_Job_Profile>
<ws_Management_Level>Individual Contributor</ws_Management_Level>
<ws_Job_Family>DRV</ws_Job_Family>
<ws_Business_Site>LOC_TOWN</ws_Business_Site>
<ws_Business_Site_Name>Local Town</ws_Business_Site_Name>
<ws_Business_Site_Address_Line_Data ws_Label="Address Line 1" ws_Type="ADDRESS_LINE_1">1234 Sixth St.</ws_Business_Site_Address_Line_Data>
<ws_Business_Site_Municipality>Baltimore</ws_Business_Site_Municipality>
<ws_Business_Site_Region>Maryland</ws_Business_Site_Region>
<ws_Business_Site_Postal_Code>12345</ws_Business_Site_Postal_Code>
<ws_Business_Site_Country>US</ws_Business_Site_Country>
<ws_Supervisor>
<ws_Operation>NONE</ws_Operation>
<ws_Supervisor_ID>1234567</ws_Supervisor_ID>
<ws_Supervisor_Name>Little Mac</ws_Supervisor_Name>
</ws_Supervisor>
</ws_Position>
<ws_Additional_Information>
<ws_WD_Username>John.Doe</ws_WD_Username>
<ws_Last_4_SSN_Digits>1234</ws_Last_4_SSN_Digits>
</ws_Additional_Information>
</ws_Worker>
Keep in mind, there are 36 other <ws_Worker> elements throughout this file.
Here is my program so far:
from xml.dom import minidom
xmldoc = minidom.parse("//tocp-fs1/mydocs/mantonishak/Documents/Python/The_Hard_Way/Out.xml")
outworkers = xmldoc.getElementsByTagName("ws_Worker")[0]
# Knowing your heiarchy is important. ws_Worker is at the top. Asking the first value of the list.
outsummaries = outworkers.getElementsByTagName("ws_Summary")
outpersonals = outworkers.getElementsByTagName("ws_Personal")
outpositions = outworkers.getElementsByTagName("ws_Position")
outadditionals = outworkers.getElementsByTagName("ws_Additional_Information")
for outpersonal in outpersonals:
desc = outpersonal.getElementsByTagName("ws_Formatted_Name")[0].firstChild.data
# displays the user's Full Name
for outsummary in outsummaries:
desc2 = outsummary.getElementsByTagName("ws_Employee_ID")[0].firstChild.data
# displays the user's Workday ID
for location in outpositions:
desc3 = location.getElementsByTagName("ws_Business_Site_Name")[0].firstChild.data
# displays the user's current work location (Store Name)
for title in outpositions:
desc4 = title.getElementsByTagName("ws_Position_Title")[0].firstChild.data
# displays the user's current title
for email in outpersonals:
desc5 = email.getElementsByTagName("ws_Email_Address")[0].firstChild.data
lst = desc5.split("@")
atsign = (lst[1])
# This splits the ws_Email_Address value at the @ sign, removes it, and displays the string
# to the right of the @ sign (which is the domain)
for firstletter in outpersonals:
desc6 = firstletter.getElementsByTagName("ws_First_Name")[0].firstChild.data
firstletter = desc6[0]
# This grabs the first letter of the ws_First_Name value so it can be combined later with
# the ws_Last_Name value to create the username
for lastname in outpersonals:
desc7 = lastname.getElementsByTagName("ws_Last_Name")[0].firstChild.data
username = (firstletter + desc7)
# grabs the last name and combines with the first letter of the first name
# this creates the username
for ssn in outadditionals:
desc8 = ssn.getElementsByTagName("ws_Last_4_SSN_Digits")[0].firstChild.data
firstpass = desc6[0:2]
lastpass = desc7[-2:]
password = (firstpass + desc8 + lastpass)
# this takes the first two chars of the ws_First_Name adds them as a string with the
# ws_Last_4_SSN_Digits and the last two chars of ws_Last_Name.
print("Full Name: %s, Employee ID: %s, Location: %s, Title: %s, Domain: %s, Username: %s, Password: %s" %
(desc, desc2, desc3, desc4, atsign, username.lower(), password.lower()))
# Creates the output in a straight horizontal line. The .lower attributes for
# username and password will format all characters in the strings above into lowercase.
And my output looks like this:
Full Name: John Doe, Employee ID: 1234567, Location: Local Town, Title: Driver, Domain: company.com, Username: jdoe, Password: jo1234oe
So Line 5 is where I think the magic has to happen. The integer [0] is only pulling the child tags within the first <ws_Worker> element. If I change that integer to [1], it pulls the second [2] pulls the third and so on.
How do I construct a loop that changes that integer and collectively prints the output of each <ws_Worker> element throughout the file?