Kev*_*ook 5 python dataframe pandas
I'm using Pandas to populate 6 new variables with values that are conditional to other data variables. The entire dataset consists of about 700,000 rows and 14 variables (columns) including my newly added ones.
My first approach was to use itertuples(), mainly down to experience being minimal here. This clocked around 9600 seconds.
I've managed to get this more efficient (~3500 seconds) by using apply(). Here is an example of one of the new variables.
housing_df = utils.make_data_frame("data/source_data/housing_with_child.dta", "stata")
""" Populate new variables """
# setup new variables
housing_df["hh_n"] = ""
housing_df["bio_d"] = ""
housing_df["step_d"] = ""
housing_df["child_d"] = ""
housing_df["both_bio"] = ""
housing_df["hhtype"] = ""
housing_df["step"] = ""
# hh_n
def hh_n(row):
df = housing_df.loc[
(housing_df["pidp"] == row["pidp"]) & (housing_df["wave"] == row["wave"])
]
return str(len(df.index) + 1)
housing_df["hh_n"] = housing_df.apply(hh_n, axis=1)
Run Code Online (Sandbox Code Playgroud)
Each new variable follows the same pattern and needs to do the following:
For each row
Get some of the rows existing data (eg pipd and wave)
Find rows in the whole data frame which have the same values (pidp and wave) and put these in a new dataframe
Count the rows we found in the new dataframe
Return the count value for the new variable (hh_n)
Run Code Online (Sandbox Code Playgroud)
Here is a small example of the output data:
In case it's relevant, here is my method for creating a dataframe from the stata file in the first line:
# Method that returns data frame from stata file or csv
def make_data_frame(data, type):
if type is "stata":
reader = pd.read_stata(data, chunksize=100000)
if type is "csv":
reader = pd.read_csv(data, chunksize=100000)
df = pd.DataFrame()
for itm in reader:
df = df.append(itm)
return df
Run Code Online (Sandbox Code Playgroud)
Here is an example of the first 50 rows of data
{'index': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 10: 10, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15, 16: 16, 17: 17, 18: 18, 19: 19, 20: 20, 21: 21, 22: 22, 23: 23, 24: 24, 25: 25, 26: 26, 27: 27, 28: 28, 29: 29, 30: 30, 31: 31, 32: 32, 33: 33, 34: 34, 35: 35, 36: 36, 37: 37, 38: 38, 39: 39, 40: 40, 41: 41, 42: 42, 43: 43, 44: 44, 45: 45, 46: 46, 47: 47, 48: 48, 49: 49, 50: 50, 51: 51, 52: 52, 53: 53, 54: 54, 55: 55, 56: 56, 57: 57, 58: 58, 59: 59, 60: 60, 61: 61, 62: 62, 63: 63, 64: 64, 65: 65, 66: 66, 67: 67, 68: 68, 69: 69, 70: 70, 71: 71, 72: 72, 73: 73, 74: 74, 75: 75, 76: 76, 77: 77, 78: 78, 79: 79, 80: 80, 81: 81, 82: 82, 83: 83, 84: 84, 85: 85, 86: 86, 87: 87, 88: 88, 89: 89, 90: 90, 91: 91, 92: 92, 93: 93, 94: 94, 95: 95, 96: 96, 97: 97, 98: 98, 99: 99}, 'pidp': {0: 1367, 1: 2051, 2: 2727, 3: 2727, 4: 2727, 5: 2727, 6: 2727, 7: 3407, 8: 3407, 9: 3407, 10: 3407, 11: 3407, 12: 3407, 13: 3407, 14: 3407, 15: 3407, 16: 3407, 17: 3407, 18: 3407, 19: 3407, 20: 3407, 21: 3407, 22: 3407, 23: 3407, 24: 3407, 25: 3407, 26: 3407, 27: 3407, 28: 3407, 29: 4091, 30: 4091, 31: 4091, 32: 4091, 33: 4091, 34: 4091, 35: 4091, 36: 4091, 37: 4767, 38: 4767, 39: 4767, 40: 4767, 41: 4767, 42: 4767, 43: 5451, 44: 5451, 45: 5451, 46: 5451, 47: 5451, 48: 5451, 49: 6135, 50: 6135, 51: 6135, 52: 6135, 53: 6135, 54: 6807, 55: 6807, 56: 6844, 57: 6844, 58: 6844, 59: 6844, 60: 6844, 61: 6844, 62: 6844, 63: 6844, 64: 6844, 65: 6844, 66: 6844, 67: 6844, 68: 6844, 69: 6844, 70: 6844, 71: 6844, 72: 6844, 73: 6844, 74: 6844, 75: 6844, 76: 6844, 77: 6844, 78: 6844, 79: 6844, 80: 6844, 81: 6844, 82: 6844, 83: 6844, 84: 6844, 85: 6844, 86: 6844, 87: 6844, 88: 6844, 89: 6844, 90: 6844, 91: 6844, 92: 6844, 93: 6844, 94: 6844, 95: 6844, 96: 6844, 97: 6844, 98: 6844, 99: 6844}, 'hidp': {0: 20402, 1: 20402, 2: 5799043, 3: 60159608, 4: 45594010, 5: 53475212, 6: 15449616, 7: 102002, 8: 102002, 9: 30559202, 10: 30559202, 11: 6351204, 12: 6351204, 13: 2590808, 14: 13610, 15: 6812, 16: 13614, 17: 20416, 18: 21902818, 19: 36407220, 20: 36407220, 21: 20424, 22: 20424, 23: 20426, 24: 20426, 25: 9010028, 26: 9010028, 27: 18856430, 28: 18856430, 29: 136002, 30: 136002, 31: 30566002, 32: 30566002, 33: 6358004, 34: 6358004, 35: 20406, 36: 20406, 37: 210802, 38: 210802, 39: 30579602, 40: 30579602, 41: 30579602, 42: 6371604, 43: 210802, 44: 210802, 45: 30579602, 46: 30579602, 47: 30579602, 48: 6371604, 49: 210802, 50: 210802, 51: 30579602, 52: 30579602, 53: 30579602, 54: 285602, 55: 30593202, 56: 37073606, 57: 37073606, 58: 37073606, 59: 37073606, 60: 37073606, 61: 12580008, 62: 12580008, 63: 12580008, 64: 12580008, 65: 12580008, 66: 56052408, 67: 56052408, 68: 56052408, 69: 56052408, 70: 56052408, 71: 38848410, 72: 38848410, 73: 38848410, 74: 38848410, 75: 38848410, 76: 50014012, 77: 50014012, 78: 50014012, 79: 5467216, 80: 5467216, 81: 5467216, 82: 25547618, 83: 25547618, 84: 25547618, 85: 38610420, 86: 38610420, 87: 38610420, 88: 5025224, 89: 5025224, 90: 5025224, 91: 4637626, 92: 4637626, 93: 4637626, 94: 12784028, 95: 12784028, 96: 12784028, 97: 21719230, 98: 21719230, 99: 21719230}, 'apidp': {0: 2051, 1: 1367, 2: 752052125, 3: 752052125, 4: 752052125, 5: 752052125, 6: 752052125, 7: 545740805, 8: 612001365, 9: 545740805, 10: 612001365, 11: 545740805, 12: 612001365, 13: 612001365, 14: 612001365, 15: 612001365, 16: 612001365, 17: 612001365, 18: 612001365, 19: 612001365, 20: 612001369, 21: 612001365, 22: 612001369, 23: 612001365, 24: 612001369, 25: 612001365, 26: 612001369, 27: 612001365, 28: 612001369, 29: 68002045, 30: 68002049, 31: 68002045, 32: 68002049, 33: 68002045, 34: 68002049, 35: 68002045, 36: 68002049, 37: 5451, 38: 6135, 39: 5451, 40: 6135, 41: 5298579, 42: 5451, 43: 4767, 44: 6135, 45: 4767, 46: 6135, 47: 5298579, 48: 4767, 49: 4767, 50: 5451, 51: 4767, 52: 5451, 53: 5298579, 54: 4885131, 55: 4885131, 56: 13644, 57: 748469885, 58: 748469889, 59: 749992405, 60: 749992409, 61: 13644, 62: 748469885, 63: 748469889, 64: 749992405, 65: 749992409, 66: 13644, 67: 748469885, 68: 748469889, 69: 749992405, 70: 749992409, 71: 13644, 72: 748469885, 73: 748469889, 74: 749992405, 75: 749992409, 76: 13644, 77: 748469885, 78: 748469889, 79: 13644, 80: 748469885, 81: 748469889, 82: 13644, 83: 748469885, 84: 748469889, 85: 13644, 86: 748469885, 87: 748469889, 88: 13644, 89: 748469885, 90: 748469889, 91: 13644, 92: 748469885, 93: 748469889, 94: 13644, 95: 748469885, 96: 748469889, 97: 13644, 98: 748469885, 99: 748469889}, 'wave': {0: 2.0, 1: 2.0, 2: 2.0, 3: 8.0, 4: 9.0, 5: 10.0, 6: 11.0, 7: 2.0, 8: 2.0, 9: 3.0, 10: 3.0, 11: 4.0, 12: 4.0, 13: 7.0, 14: 8.0, 15: 9.0, 16: 10.0, 17: 11.0, 18: 12.0, 19: 13.0, 20: 13.0, 21: 14.0, 22: 14.0, 23: 15.0, 24: 15.0, 25: 16.0, 26: 16.0, 27: 17.0, 28: 17.0, 29: 2.0, 30: 2.0, 31: 3.0, 32: 3.0, 33: 4.0, 34: 4.0, 35: 5.0, 36: 5.0, 37: 2.0, 38: 2.0, 39: 3.0, 40: 3.0, 41: 3.0, 42: 4.0, 43: 2.0, 44: 2.0, 45: 3.0, 46: 3.0, 47: 3.0, 48: 4.0, 49: 2.0, 50: 2.0, 51: 3.0, 52: 3.0, 53: 3.0, 54: 2.0, 55: 3.0, 56: 6.0, 57: 6.0, 58: 6.0, 59: 6.0, 60: 6.0, 61: 7.0, 62: 7.0, 63: 7.0, 64: 7.0, 65: 7.0, 66: 8.0, 67: 8.0, 68: 8.0, 69: 8.0, 70: 8.0, 71: 9.0, 72: 9.0, 73: 9.0, 74: 9.0, 75: 9.0, 76: 10.0, 77: 10.0, 78: 10.0, 79: 11.0, 80: 11.0, 81: 11.0, 82: 12.0, 83: 12.0, 84: 12.0, 85: 13.0, 86: 13.0, 87: 13.0, 88: 14.0, 89: 14.0, 90: 14.0, 91: 15.0, 92: 15.0, 93: 15.0, 94: 16.0, 95: 16.0, 96: 16.0, 97: 17.0, 98: 17.0, 99: 17.0}, 'rel': {0: 'unrelated sharer', 1: 'unrelated sharer', 2: 'natural child', 3: 'natural child', 4: 'natural child', 5: 'natural child', 6: 'natural child', 7: 'lawful spouse', 8: 'natural child', 9: 'lawful spouse', 10: 'natural child', 11: 'lawful spouse', 12: 'natural child', 13: 'natural child', 14: 'natural child', 15: 'natural child', 16: 'natural child', 17: 'natural child', 18: 'natural child', 19: 'natural child', 20: 'daughter/son-in-law', 21: 'natural child', 22: 'daughter/son-in-law', 23: 'natural child', 24: 'daughter/son-in-law', 25: 'natural child', 26: 'daughter/son-in-law', 27: 'natural child', 28: 'daughter/son-in-law', 29: 'lawful spouse', 30: 'natural child', 31: 'lawful spouse', 32: 'natural child', 33: 'lawful spouse', 34: 'natural child', 35: 'lawful spouse', 36: 'natural child', 37: 'unrelated sharer', 38: 'natural child', 39: 'other', 40: 'natural child', 41: 'daughter/son-in-law', 42: 'other', 43: 'unrelated sharer', 44: 'natural child', 45: 'other', 46: 'natural child', 47: 'daughter/son-in-law', 48: 'other', 49: 'natural parent', 50: 'natural parent', 51: 'natural parent', 52: 'natural parent', 53: 'lawful spouse', 54: 'natural brother/sister', 55: 'natural brother/sister', 56: 'natural child', 57: 'lawful spouse', 58: 'natural child', 59: 'mother/father-in-law', 60: 'mother/father-in-law', 61: 'natural child', 62: 'lawful spouse', 63: 'natural child', 64: 'mother/father-in-law', 65: 'mother/father-in-law', 66: 'natural child', 67: 'lawful spouse', 68: 'natural child', 69: 'mother/father-in-law', 70: 'mother/father-in-law', 71: 'natural child', 72: 'lawful spouse', 73: 'natural child', 74: 'mother/father-in-law', 75: 'mother/father-in-law', 76: 'natural child', 77: 'lawful spouse', 78: 'natural child', 79: 'natural child', 80: 'lawful spouse', 81: 'natural child', 82: 'natural child', 83: 'lawful spouse', 84: 'natural child', 85: 'natural child', 86: 'lawful spouse', 87: 'natural child', 88: 'natural child', 89: 'lawful spouse', 90: 'natural child', 91: 'natural child', 92: 'lawful spouse', 93: 'natural child', 94: 'natural child', 95: 'lawful spouse', 96: 'natural child', 97: 'natural child', 98: 'lawful spouse', 99: 'natural child'}, 'is_child': {0: '', 1: '', 2: '', 3: '', 4: '', 5: '', 6: '', 7: '', 8: '1', 9: '', 10: '1', 11: '', 12: '1', 13: '', 14: '', 15: '', 16: '', 17: '', 18: '', 19: '', 20: '', 21: '', 22: '', 23: '', 24: '', 25: '', 26: '', 27: '', 28: '', 29: '', 30: '1', 31: '', 32: '1', 33: '', 34: '1', 35: '', 36: '', 37: '', 38: '', 39: '', 40: '', 41: '', 42: '', 43: '', 44: '', 45: '', 46: '', 47: '', 48: '', 49: '', 50: '', 51: '', 52: '', 53: '', 54: '', 55: '', 56: '1', 57: '', 58: '1', 59: '', 60: '', 61: '1', 62: '', 63: '1', 64: '', 65: '', 66: '1', 67: '', 68: '1', 69: '', 70: '', 71: '1', 72: '', 73: '1', 74: '', 75: '', 76: '1', 77: '', 78: '1', 79: '1', 80: '', 81: '1', 82: '1', 83: '', 84: '1', 85: '1', 86: '', 87: '1', 88: '1', 89: '', 90: '1', 91: '1', 92: '', 93: '1', 94: '1', 95: '', 96: '1', 97: '1', 98: '', 99: '1'}, 'hh_n': {0: '', 1: '', 2: '', 3: '', 4: '', 5: '', 6: '', 7: '', 8: '', 9: '', 10: '', 11: '', 12: '', 13: '', 14: '', 15: '', 16: '', 17: '', 18: '', 19: '', 20: '', 21: '', 22: '', 23: '', 24: '', 25: '', 26: '', 27: '', 28: '', 29: '', 30: '', 31: '', 32: '', 33: '', 34: '', 35: '', 36: '', 37: '', 38: '', 39: '', 40: '', 41: '', 42: '', 43: '', 44: '', 45: '', 46: '', 47: '', 48: '', 49: '', 50: '', 51: '', 52: '', 53: '', 54: '', 55: '', 56: '', 57: '', 58: '', 59: '', 60: '', 61: '', 62: '', 63: '', 64: '', 65: '', 66: '', 67: '', 68: '', 69: '', 70: '', 71: '', 72: '', 73: '', 74: '', 75: '', 76: '', 77: '', 78: '', 79: '', 80: '', 81: '', 82: '', 83: '', 84: '', 85: '', 86: '', 87: '', 88: '', 89: '', 90: '', 91: '', 92: '', 93: '', 94: '', 95: '', 96: '', 97: '', 98: '', 99: ''}, 'bio_d': {0: '', 1: '', 2: '', 3: '', 4: '', 5: '', 6: '', 7: '', 8: '', 9: '', 10: '', 11: '', 12: '', 13: '', 14: '', 15: '', 16: '', 17: '', 18: '', 19: '', 20: '', 21: '', 22: '', 23: '', 24: '', 25: '', 26: '', 27: '', 28: '', 29: '', 30: '', 31: '', 32: '', 33: '', 34: '', 35: '', 36: '', 37: '', 38: '', 39: '', 40: '', 41: '', 42: '', 43: '', 44: '', 45: '', 46: '', 47: '', 48: '', 49: '', 50: '', 51: '', 52: '', 53: '', 54: '', 55: '', 56: '', 57: '', 58: '', 59: '', 60: '', 61: '', 62: '', 63: '', 64: '', 65: '', 66: '', 67: '', 68: '', 69: '', 70: '', 71: '', 72: '', 73: '', 74: '', 75: '', 76: '', 77: '', 78: '', 79: '', 80: '', 81: '', 82: '', 83: '', 84: '', 85: '', 86: '', 87: '', 88: '', 89: '', 90: '', 91: '', 92: '', 93: '', 94: '', 95: '', 96: '', 97: '', 98: '', 99: ''}, 'step_d': {0: '', 1: '', 2: '', 3: '', 4: '', 5: '', 6: '', 7: '', 8: '', 9: '', 10: '', 11: '', 12: '', 13: '', 14: '', 15: '', 16: '', 17: '', 18: '', 19: '', 20: '', 21: '', 22: '', 23: '', 24: '', 25: '', 26: '', 27: '', 28: '', 29: '', 30: '', 31: '', 32: '', 33: '', 34: '', 35: '', 36: '', 37: '', 38: '', 39: '', 40: '', 41: '', 42: '', 43: '', 44: '', 45: '', 46: '', 47: '', 48: '', 49: '', 50: '', 51: '', 52: '', 53: '', 54: '', 55: '', 56: '', 57: '', 58: '', 59: '', 60: '', 61: '', 62: '', 63: '', 64: '', 65: '', 66: '', 67: '', 68: '', 69: '', 70: '', 71: '', 72: '', 73: '', 74: '', 75: '', 76: '', 77: '', 78: '', 79: '', 80: '', 81: '', 82: '', 83: '', 84: '', 85: '', 86: '', 87: '', 88: '', 89: '', 90: '', 91: '', 92: '', 93: '', 94: '', 95: '', 96: '', 97: '', 98: '', 99: ''}, 'child_d': {0: '', 1: '', 2: '', 3: '', 4: '', 5: '', 6: '', 7: '', 8: '', 9: '', 10: '', 11: '', 12: '', 13: '', 14: '', 15: '', 16: '', 17: '', 18: '', 19: '', 20: '', 21: '', 22: '', 23: '', 24: '', 25: '', 26: '', 27: '', 28: '', 29: '', 30: '', 31: '', 32: '', 33: '', 34: '', 35: '', 36: '', 37: '', 38: '', 39: '', 40: '', 41: '', 42: '', 43: '', 44: '', 45: '', 46: '', 47: '', 48: '', 49: '', 50: '', 51: '', 52: '', 53: '', 54: '', 55: '', 56: '', 57: '', 58: '', 59: '', 60: '', 61: '', 62: '', 63: '', 64: '', 65: '', 66: '', 67: '', 68: '', 69: '', 70: '', 71: '', 72: '', 73: '', 74: '', 75: '', 76: '', 77: '', 78: '', 79: '', 80: '', 81: '', 82: '', 83: '', 84: '', 85: '', 86: '', 87: '', 88: '', 89: '', 90: '', 91: '', 92: '', 93: '', 94: '', 95: '', 96: '', 97: '', 98: '', 99: ''}, 'both_bio': {0: '', 1: '', 2: '', 3: '', 4: '', 5: '', 6: '', 7: '', 8: '', 9: '', 10: '', 11: '', 12: '', 13: '', 14: '', 15: '', 16: '', 17: '', 18: '', 19: '', 20: '', 21: '', 22: '', 23: '', 24: '', 25: '', 26: '', 27: '', 28: '', 29: '', 30: '', 31: '', 32: '', 33: '', 34: '', 35: '', 36: '', 37: '', 38: '', 39: '', 40: '', 41: '', 42: '', 43: '', 44: '', 45: '', 46: '', 47: '', 48: '', 49: '', 50: '', 51: '', 52: '', 53: '', 54: '', 55: '', 56: '', 57: '', 58: '', 59: '', 60: '', 61: '', 62: '', 63: '', 64: '', 65: '', 66: '', 67: '', 68: '', 69: '', 70: '', 71: '', 72: '', 73: '', 74: '', 75: '', 76: '', 77: '', 78: '', 79: '', 80: '', 81: '', 82: '', 83: '', 84: '', 85: '', 86: '', 87: '', 88: '', 89: '', 90: '', 91: '', 92: '', 93: '', 94: '', 95: '', 96: '', 97: '', 98: '', 99: ''}, 'hhtype': {0: '', 1: '', 2: '', 3: '', 4: '', 5: '', 6: '', 7: '', 8: '', 9: '', 10: '', 11: '', 12: '', 13: '', 14: '', 15: '', 16: '', 17: '', 18: '', 19: '', 20: '', 21: '', 22: '', 23: '', 24: '', 25: '', 26: '', 27: '', 28: '', 29: '', 30: '', 31: '', 32: '', 33: '', 34: '', 35: '', 36: '', 37: '', 38: '', 39: '', 40: '', 41: '', 42: '', 43: '', 44: '', 45: '', 46: '', 47: '', 48: '', 49: '', 50: '', 51: '', 52: '', 53: '', 54: '', 55: '', 56: '', 57: '', 58: '', 59: '', 60: '', 61: '', 62: '', 63: '', 64: '', 65: '', 66: '', 67: '', 68: '', 69: '', 70: '', 71: '', 72: '', 73: '', 74: '', 75: '', 76: '', 77: '', 78: '', 79: '', 80: '', 81: '', 82: '', 83: '', 84: '', 85: '', 86: '', 87: '', 88: '', 89: '', 90: '', 91: '', 92: '', 93: '', 94: '', 95: '', 96: '', 97: '', 98: '', 99: ''}, 'step': {0: '', 1: '', 2: '', 3: '', 4: '', 5: '', 6: '', 7: '', 8: '', 9: '', 10: '', 11: '', 12: '', 13: '', 14: '', 15: '', 16: '', 17: '', 18: '', 19: '', 20: '', 21: '', 22: '', 23: '', 24: '', 25: '', 26: '', 27: '', 28: '', 29: '', 30: '', 31: '', 32: '', 33: '', 34: '', 35: '', 36: '', 37: '', 38: '', 39: '', 40: '', 41: '', 42: '', 43: '', 44: '', 45: '', 46: '', 47: '', 48: '', 49: '', 50: '', 51: '', 52: '', 53: '', 54: '', 55: '', 56: '', 57: '', 58: '', 59: '', 60: '', 61: '', 62: '', 63: '', 64: '', 65: '', 66: '', 67: '', 68: '', 69: '', 70: '', 71: '', 72: '', 73: '', 74: '', 75: '', 76: '', 77: '', 78: '', 79: '', 80: '', 81: '', 82: '', 83: '', 84: '', 85: '', 86: '', 87: '', 88: '', 89: '', 90: '', 91: '', 92: '', 93: '', 94: '', 95: '', 96: '', 97: '', 98: '', 99: ''}}
Run Code Online (Sandbox Code Playgroud)
如所示,考虑groupby().transform()任何分组、内联聚合以将新列分配给当前数据帧。由于OP的需求本质上是计数(即,从逻辑条件返回索引的长度或行数),因此count明确指定聚合。
因此,下面的内容.apply()(每次迭代都会构建辅助数据框的隐藏循环)
def hh_n(row):
# BOOLEAN INDEXING + LEN()
df = housing_df.loc[
(housing_df["pidp"] == row["pidp"]) & (housing_df["wave"] == row["wave"])
]
return str(len(df.index) + 1)
housing_df["hh_n"] = housing_df.apply(hh_n, axis=1)
Run Code Online (Sandbox Code Playgroud)
可以调整为:
housing_df["hh_n"] = housing_df.groupby(['pidp', 'wave']).transform('count') + 1
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
102 次 |
| 最近记录: |