Chaining Methods in Pandas
You may access the dataset through this link.
import pandas as pd
import re
# Get dataset
csv_url = 'https://opendata.socrata.com/api/views/d2kg-fyem/rows.csv?accessType=DOWNLOAD'
data = pd.read_csv(csv_url)
data.head()
FIPS Jurisdiction Division Precincts Total Make Model Equipment VVPAT Accessible Early Absentee Polling State
code Registration Type Use Voting Ballots Place
--- --- --- --- --- --- --- --- --- --- --- --- --- ---
200000000 Alaska State NaN 441 574441 NaN NaN Hand NaN No Yes Yes Yes NaN
Counted
Paper
Ballots
200000000 Alaska State NaN 441 574441 Premier/Diebold AccuVote DRE-Touchscreen Yes Yes Yes No No NaN
(Dominion) TSX
200000000 Alaska State NaN 441 574441 Premier/Diebold AccuVote Optical Scan NaN No Yes Yes Yes NaN
(Dominion) OS
100000000 Alabama NaN 2527 2986782 NaN NaN NaN NaN No No No No NaN
State
100100000 Autauga NaN 17 33806 Election AutoMARK Ballot Marking NaN Yes No No No NaN
County Systems & Device or
Software System
Dataset contains the locations of electronic voting machines in the United States.
Supposed we process the dataset. It is common to have a code that looks like this
def tidy_string(string):
string = string.lower()
string = re.sub(r'[^\w]+', '_', string)
return string
processed_data = data.copy()
processed_data.columns = map(tidy_string, processed_data.columns)
processed_data = processed_data.reset_index()
processed_data['location'] = processed_data['jurisdiction'].str.split().str.get(0)
processed_data['state'] = processed_data['state'].fillna(processed_data['location'])
What I did here is
- clean the column names
- add surrogate key
- extract location from jurisdiction
- fill missing state
We can improve this, by using pandas feature called pipe
and assign
def fillna(df, missing, filled):
df[missing] = df[missing].fillna(df[filled])
return df
processed_data = (data
.copy()
.rename(columns=tidy_string)
.reset_index()
.assign(location=lambda x: x['jurisdiction'].str.split().str.get(0))
.pipe(fillna, 'state', 'location'))
Much cleaner, readable and maintainable.