Evaluation week 1 summary

We’re done with phase one of the coding period! Woke up to an email from GSoC that I had succesfully passed the first evaluation. This week we focussed mainly on re-implementing the compute_time_series method. We are still facing a few issues, but we have agreed upon a temporary solution which will be improved once we think of a better way to go about it.

Our last IRC meeting was held on the 24th, so this post talks about only three days worth of work.

Tasks for the “week”

  • Re-implement compute_time_series
  • Modify other metrics with the latest changes
  • Open the draft pull request created for Code_Changes_Lines for review
  • Continue writing tests

Summary for the “week”

  • Re-implement compute_time_series(#176)

    Since this was the main focus for this period, I’d like to go over it in more detail. The problems with the initial implementation were:

    • the current method of computation, which was using the merge method for DataFrames, was too lengthy and repetitive, depending on what time period was passed to the method, like week or month.
    • Most of the code between metrics was quite similar which meant that the method could be pushed up in the class heirarchy.

    So, we went with the method suggested by Jesus, which was to use resampling.

    The idea is something like this:

      df = df.resample('W').agg('count')
    

    This allowed us to move the compute_time_series method (now renamed to just time_series) to metric.py into the root class Metric.

    But another problem we faced is that we could n’t find an easy way to tell agg which columns we needed to aggregate. There are two main ways to do this:

    • Select columns post resampling and aggregate on that column directly. This would go something like this:
      df = df.resample('W')['col_to_agg_on'].agg('count')
      
    • The second option is to pass the columns as a dict to agg.
      df = df.resample('W').agg({"col_name": 'count'})
      

    This is a problem because both selecting columns and selecting an aggregation operation (all metrics don’t use 'count') are very specific to the metric being implemented.

    Finally we went with an implementation with that uses an overridden _agg method which returns an aggregated dataframe. Issue #192 has been opened as a reminder.

  • Modify other metrics with the latest changes(#172 #180 #190)

    Updated other metrics with minor recent changes as well as the new implementation of compute_time_series.

  • Open the draft pull request created for Code_Changes_Lines for review (#190)
    Updated the draft pull request for Code_Changes_Lines with the latest changes and opened it for review.

  • Continue writing tests
    Though I added quite a few tests, I was unable to finish them and open a pull request for it.

$ \ $

Meeting details

The weekly IRC meeting with my mentors held yesterday (the 24th) is summarized below.

Agenda

  • Final design for compute_time_series.

Summary

  • Final design for compute_time_series
    As mentioned above, we decided to go with a different approach, which Jesus suggested. It involves working with DataFrame.resample(). Also, we decided to move the definition up in the heirarchy to the Metric (root) class.

  • Tasks for the next week

    • Add more metric reference implementations, specifically for issue and pull request related metrics

    • Continue with tests

    • Design script to compute all metrics on different data like issues, commits and pull requests when provided to the script

The log for this meeting can be found here.

The next meeting will be on the 5th of July.
For older GSoC posts, please click here.

I want to say thanks to every member of the CHAOSS community for giving me this wonderful oppurtunity but especially to my mentors, who are very knowledgable, committed (nothing to do with git!) and supportive.