Use of groupby in a function for dataframe

Hello,

I’m totally stuck with a task on using groupby in a dataframe.

I have the following df in a csv file ‘athletes.csv’:

,forename,surname,gender,age,100m,200m,400m,800m,1500m
0,Migdalia,Parrish,F,18,11.08,29.0,59.41,122.05,259.11
1,Valerie,Lee,F,10,17.23,46.0,100.02,232.64,480.95
2,John,Debnam,M,17,10.81,25.89,50.6,110.29,232.39
3,Roy,Miller,M,10,19.18,46.74,95.32,201.14,430.27
4,Aida,Aumiller,F,11,15.3,41.83,81.06,189.03,394.9
5,Marcia,Brown,F,19,11.13,24.62,57.59,119.13,256.37
6,Harry,Knows,M,16,12.39,25.94,49.67,106.56,237.14
7,Barry,Lennon,M,14,11.15,23.56,46.46,110.89,230.49
8,Lilia,Armstrong,F,13,8.84,25.09,59.54,128.95,258.47
9,Johnny,Casey,M,15,9.65,22.67,49.46,112.85,233.87
10,Donald,Taylor,M,15,11.74,22.42,49.22,114.62,224.63
11,Martha,Woods,F,14,9.01,24.34,55.25,118.8,254.87
12,Diane,Lauria,F,15,8.99,27.92,54.79,119.89,249.21
13,Yvonne,Pumphrey,F,16,8.84,27.29,57.63,123.13,247.41
14,Betty,Stephenson,F,14,11.04,28.73,59.05,126.29,256.44
15,Lilia,Armstrong,F,12,11.31,34.43,74.28,150.05,321.07

The task is to call (and print) from a main function another function which takes three attributes:

  • The dataframe df
  • The age 15
  • The mean value for all events (100m,200m,400m,800m,1500m)) for the age 15
    The function should be grouped by gender and should reset the index.

The output should be like the below.
Input:
age_statistics(df,15,‘mean’)
Output:
t1

I’m completely stuck with the function that groups the 15 years’old athletes in females and males and for each gender calculates the mean value for each event.


# function to groupby 
def age_statistics(df,age,mean):
# no idea how to build it  
    aggregated_dataframe = aggregated_dataframe.reset_index(drop=False)
    return aggregated_dataframe

# main function
def main(filename='athletes.csv'):
    df = pd.read_csv(filename, index_col=0)
    df['100m'] = df['100m'].astype(float)
    df['200m'] = df['200m'].astype(float)
    df['400m'] = df['400m'].astype(float)
    df['800m'] = df['800m'].astype(float)
    df['1500m'] = df['1500m'].astype(float)
    print(age_statistics(df,15,'mean'))

# Do not edit this
if __name__ == "__main__":
  main()
  

Anybody can help with that?

I’m not familiar with this library, but I’m pretty sure you’d have to indicate that you want these calculation to happen with the “gender” column. On your line where you do reset_index I’m thinking you’re missing something that indicates this.

The other option is to just manually filter out the M and F rows into two data frames, do the stats on those, then output the results, but I suspect that’d be “cheating” as there’s most likely a way to say “do this summary group by gender”.

What library is this?

1 Like

Hi zedshaw, thanks for your reply.
The library is pandas.
The reset_index is not important, it’s the only thing I understood from the assignment that adds on the left column of the output dataframe the indexes 0,1,…
My problem is how to structure the function age_statistics, I really don’t know from where to start.

I don’t know pandas at all, but it sounds like you’re attempting to look at this empty function and write the code straight without any help, and that’s nearly impossible if you don’t know the library deeply. What happens is you are told “make this do summary stats for F/M gender”. That’s it. Just a goal. You’re not given the code and you have to figure out how to go from nothing to code.

The problem is, not even top professionals like me can go from nothing working code using pure code if they’ve never used the library. We all write out notes and comments and then “fill in the blanks” to figure it out.

What you should do is start with an human description of the steps, convert those to “pseudo-code”, then slowly convert that to real working code, running it constantly as you go.

Step 1:

def age_statistics(df, age, mean):
   # get the stuff from the thing
   # extract the other stuff
   # if the stuff is X then format it

Keep in mind this is fake as it’s just a demo. Once you have the idea written out in steps in your own words, you convert it to fake code comments:

def age_statistics(df, age, mean):
  # thing = stuff get
  # other_stuff = thing get
  # if stuff == X
  #    # format X

Then you take each of those and research in pandas how to do that line. After you fill in each line with code you think works run it. run it constantly. If you aren’t running this at least 1-2 times per line of code you write, then delete it and do it again.

What I mean by “fill in the blanks” is like this:

def age_statistics(df, age, mean):
   # thing = stuff get
   thing = stuff.get("F")
   # other_stuff = thing get
   # if stuff == X
   # format X

RUN IT! Then next line (these are all fake):

def age_statistics(df, age, mean):
   # thing = stuff get
   thing = stuff.get("F")

   # other_stuff = thing get
   other_stuff = thing.get("other stuff")

   # if stuff == X
   # format X

RUN IT!

def age_statistics(df, age, mean):
   # thing = stuff get
   thing = stuff.get("F")

   # other_stuff = thing get
   other_stuff = thing.get("other stuff")

   # if stuff == X
   if other_stuff == "X":
      # format X
      other_stuff.format("X")

Notice on the last one I did 2 lines since that’s how you make the if work. Now, RUN IT.

This is the process I use when I’m working with a library I don’t know, but I do know what I need to get done. Try it.

1 Like

Hi zedshaw,many thanks, I’ll try to work out the code following your steps!!