The following exercises should be done in python using the tools presenting by Professor Arnold in class (numpy, scikitlearn, etc.). You have some options for you do your work. You could use Google’s COLAB like we did in class or you could use reticulate to create an R Markdown document with python chunks instead of R chunks.

Problem 1: Wordle Words

Wordle is all the rage. Here is an exercise based on wordle. Read in the list of 5-letter words at https://raw.githubusercontent.com/seanpatlan/wordle-words/main/word-bank.csv. The first ten words are

aback abase abate abbey abbot abhor abide abled abode abort

If your list doesn’t match this, be sure to figure out what happened.

With this list of words find:

  1. The total number of words in the bank.

    url = 'https://raw.githubusercontent.com/seanpatlan/wordle-words/main/word-bank.csv'
    import pandas as pd
    import string
    alphabet = string.ascii_lowercase
    
    wordle = pd.read_csv(url, header = None)  # no header in CSV, first row is a word
    words = wordle[0]                         # grab first column (the list of words)
    type(words)
    ## <class 'pandas.core.series.Series'>
    len(words)
    ## 2315
  2. All words that start with the letter “q”

    [word for word in words if word[0] == 'q']
    ## ['quack', 'quail', 'quake', 'qualm', 'quark', 'quart', 'quash', 'quasi', 'queen', 'queer', 'quell', 'query', 'quest', 'queue', 'quick', 'quiet', 'quill', 'quilt', 'quirk', 'quite', 'quota', 'quote', 'quoth']
  3. The number of words that have at least one repeated letter.

    # count using set() to get the set of (unique) letters
    len([word for word in words if len(set(word)) < 5])
    ## 749
    # a bit fancier -- with more info
    pd.Series([len(set(word)) for word in words]).value_counts()
    ## 5    1566
    ## 4     691
    ## 3      57
    ## 2       1
    ## dtype: int64
    # let's see that word with only two letters
    [word for word in words if len(set(word)) < 3]
    ## ['mamma']
  4. For each letter of the alphabet, the number of words that contain that letter. BONUS: sort the list by frequency with most frequent letters first. Express as both a count and a proportion.

    letter_freq =  [(letter, words.str.contains(letter).sum()) for letter in alphabet]
    letter_freq
    ## [('a', 909), ('b', 267), ('c', 448), ('d', 370), ('e', 1056), ('f', 207), ('g', 300), ('h', 379), ('i', 647), ('j', 27), ('k', 202), ('l', 648), ('m', 298), ('n', 550), ('o', 673), ('p', 346), ('q', 29), ('r', 837), ('s', 618), ('t', 667), ('u', 457), ('v', 149), ('w', 194), ('x', 37), ('y', 417), ('z', 35)]
    sorted(letter_freq,
      key = lambda x: x[1],                   # sort be the count
      reverse = True)                         # highest to lowest
    
    ## [('e', 1056), ('a', 909), ('r', 837), ('o', 673), ('t', 667), ('l', 648), ('i', 647), ('s', 618), ('n', 550), ('u', 457), ('c', 448), ('y', 417), ('h', 379), ('d', 370), ('p', 346), ('g', 300), ('m', 298), ('b', 267), ('f', 207), ('k', 202), ('w', 194), ('v', 149), ('x', 37), ('z', 35), ('q', 29), ('j', 27)]
    sorted(
      [(letter, round(words.str.contains(letter).sum() / len(words), 3)) for letter in alphabet], 
      key = lambda x: x[1], 
      reverse = True)
    ## [('e', 0.456), ('a', 0.393), ('r', 0.362), ('o', 0.291), ('t', 0.288), ('l', 0.28), ('i', 0.279), ('s', 0.267), ('n', 0.238), ('u', 0.197), ('c', 0.194), ('y', 0.18), ('h', 0.164), ('d', 0.16), ('p', 0.149), ('g', 0.13), ('m', 0.129), ('b', 0.115), ('f', 0.089), ('k', 0.087), ('w', 0.084), ('v', 0.064), ('x', 0.016), ('z', 0.015), ('q', 0.013), ('j', 0.012)]
  5. Ask another interesting question about wordle words and use Python to answer it.

    Here are some of the questions people answered (but not always with the code that was used by the student).

    • How many/which words don’t contain “tradtional vowels” (AEIOU)?

      Note the use of zip() here.

      # this only works if word doesn't have duplicate letters
      # returns a pair of numbers. First is number of letters in correct position;
      #   second is number of letters that occur (in either correct or incorrect position).
      
      def score_word(word, word_list):
          l = list(word)
          return zip(
            [np.sum([x == y for x,y in zip(l, list(w))]) for w in word_list],
            [np.sum([x in list(w) for x in l]) for w in word_list]
            )
      
      pd.Series(score_word('later', words)).value_counts()  
      ## Error in py_call_impl(callable, dots$args, dots$keywords): NameError: name 'np' is not defined
      ## 
      ## Detailed traceback:
      ##   File "<string>", line 1, in <module>
      ##   File "<string>", line 4, in score_word
      ##   File "<string>", line 4, in <listcomp>
      round(pd.Series(score_word('later', words)).value_counts() / len(words) * 100, 2)
      ## Error in py_call_impl(callable, dots$args, dots$keywords): NameError: name 'np' is not defined
      ## 
      ## Detailed traceback:
      ##   File "<string>", line 1, in <module>
      ##   File "<string>", line 4, in score_word
      ##   File "<string>", line 4, in <listcomp>
      round(pd.Series(score_word('raise', words)).value_counts() / len(words) * 100, 2)
      ## Error in py_call_impl(callable, dots$args, dots$keywords): NameError: name 'np' is not defined
      ## 
      ## Detailed traceback:
      ##   File "<string>", line 1, in <module>
      ##   File "<string>", line 4, in score_word
      ##   File "<string>", line 4, in <listcomp>
    • What are the Wordle palindromes

      [word for word in words if word[::-1] == word]    
      ## ['civic', 'kayak', 'level', 'madam', 'minim', 'radar', 'refer', 'rotor', 'tenet']
      [word for word in words if list(reversed(list(word))) == list(word)]
      ## ['civic', 'kayak', 'level', 'madam', 'minim', 'radar', 'refer', 'rotor', 'tenet']
    • Words with four or more vowels

      vowels = 'aeiou'
      [word for word in words if np.sum([letter in vowels for letter in word]) >=4]
      ## Error in py_call_impl(callable, dots$args, dots$keywords): NameError: name 'np' is not defined
      ## 
      ## Detailed traceback:
      ##   File "<string>", line 1, in <module>
      ##   File "<string>", line 1, in <listcomp>

Use list comprehension where appropriate.

Hints:

Problem 2: Regression with some cars

Load the data set at https://www.kaggle.com/datasets/CooperUnion/cardataset?select=data.csv Use Python to answer the following:

  1. How many rows and columns are in the data set?

    Cars = pd.read_csv('data/kaggle-cars.csv')
    Cars.shape
    ## (11914, 16)
  2. What type of data is stored in each column?

    Cars.dtypes
    ## Make                  object
    ## Model                 object
    ## Year                   int64
    ## Engine Fuel Type      object
    ## Engine HP            float64
    ## Engine Cylinders     float64
    ## Transmission Type     object
    ## Driven_Wheels         object
    ## Number of Doors      float64
    ## Market Category       object
    ## Vehicle Size          object
    ## Vehicle Style         object
    ## highway MPG            int64
    ## city mpg               int64
    ## Popularity             int64
    ## MSRP                   int64
    ## dtype: object
  3. What is the average MSRP for all the cars in the data set?

    round(Cars['MSRP'].mean(axis = 0), 2)
    ## 40594.74
  4. Now compute the number of cars in the data set and the average MSRP for each make of car. List these from most expensive to least expensive.

    round(Cars.groupby('Make').mean(), 2)
    ##                   Year  Engine HP  ...  Popularity        MSRP
    ## Make                               ...                        
    ## Acura          2010.06     244.80  ...       204.0    34887.59
    ## Alfa Romeo     2015.40     237.00  ...       113.0    61600.00
    ## Aston Martin   2013.02     484.32  ...       259.0   197910.38
    ## Audi           2012.23     277.70  ...      3105.0    53452.11
    ## BMW            2014.35     326.91  ...      3916.0    61546.76
    ## Bentley        2011.70     533.85  ...       520.0   247169.32
    ## Bugatti        2008.33    1001.00  ...       820.0  1757223.67
    ## Buick          2010.01     219.24  ...       155.0    28206.61
    ## Cadillac       2013.15     332.31  ...      1624.0    56231.32
    ## Chevrolet      2010.09     246.97  ...      1385.0    28350.39
    ## Chrysler       2008.46     229.14  ...      1013.0    26722.96
    ## Dodge          2005.44     244.42  ...      1851.0    22390.06
    ## FIAT           2016.03     143.56  ...       819.0    22670.24
    ## Ferrari        2007.52     511.96  ...      2774.0   238218.84
    ## Ford           2009.74     243.10  ...      5657.0    27399.27
    ## GMC            2009.13     259.84  ...       549.0    30493.30
    ## Genesis        2017.00     347.33  ...        21.0    46616.67
    ## HUMMER         2009.35     261.24  ...       130.0    36464.41
    ## Honda          2013.42     195.75  ...      2202.0    26674.34
    ## Hyundai        2012.70     201.92  ...      1439.0    24597.04
    ## Infiniti       2010.96     310.07  ...       190.0    42394.21
    ## Kia            2013.61     206.83  ...      1720.0    25310.17
    ## Lamborghini    2012.73     614.08  ...      1158.0   331567.31
    ## Land Rover     2012.67     322.10  ...       258.0    67823.22
    ## Lexus          2011.62     277.42  ...       454.0    47549.07
    ## Lincoln        2011.57     284.91  ...        61.0    42839.83
    ## Lotus          2010.83     275.97  ...       613.0    69188.28
    ## Maserati       2012.78     420.79  ...       238.0   114207.71
    ## Maybach        2010.94     590.50  ...        67.0   546221.88
    ## Mazda          2008.36     171.99  ...       586.0    20039.38
    ## McLaren        2014.20     610.40  ...       416.0   239805.00
    ## Mercedes-Benz  2010.52     350.18  ...       617.0    71476.23
    ## Mitsubishi     2009.16     173.43  ...       436.0    21240.54
    ## Nissan         2012.40     239.92  ...      2009.0    28583.43
    ## Oldsmobile     1997.61     177.47  ...        26.0    11542.54
    ## Plymouth       1995.46     131.56  ...       535.0     3122.90
    ## Pontiac        2003.10     190.30  ...       210.0    19321.55
    ## Porsche        2013.38     392.79  ...      1715.0   101622.40
    ## Rolls-Royce    2012.52     487.55  ...        86.0   351130.65
    ## Saab           2006.08     220.52  ...       376.0    27413.50
    ## Scion          2013.42     154.43  ...       105.0    19932.50
    ## Spyker         2009.00     400.00  ...         2.0   213323.33
    ## Subaru         2011.25     197.31  ...       640.0    24827.50
    ## Suzuki         2006.31     160.29  ...       481.0    17907.21
    ## Tesla          2015.28        NaN  ...      1391.0    85255.56
    ## Toyota         2012.54     236.15  ...      2031.0    29030.02
    ## Volkswagen     2012.81     189.76  ...       873.0    28102.38
    ## Volvo          2008.06     230.97  ...       870.0    28541.16
    ## 
    ## [48 rows x 8 columns]
    round(Cars.groupby('Make').mean(), 2).sort_values(by = 'MSRP', ascending = False)
    ##                   Year  Engine HP  ...  Popularity        MSRP
    ## Make                               ...                        
    ## Bugatti        2008.33    1001.00  ...       820.0  1757223.67
    ## Maybach        2010.94     590.50  ...        67.0   546221.88
    ## Rolls-Royce    2012.52     487.55  ...        86.0   351130.65
    ## Lamborghini    2012.73     614.08  ...      1158.0   331567.31
    ## Bentley        2011.70     533.85  ...       520.0   247169.32
    ## McLaren        2014.20     610.40  ...       416.0   239805.00
    ## Ferrari        2007.52     511.96  ...      2774.0   238218.84
    ## Spyker         2009.00     400.00  ...         2.0   213323.33
    ## Aston Martin   2013.02     484.32  ...       259.0   197910.38
    ## Maserati       2012.78     420.79  ...       238.0   114207.71
    ## Porsche        2013.38     392.79  ...      1715.0   101622.40
    ## Tesla          2015.28        NaN  ...      1391.0    85255.56
    ## Mercedes-Benz  2010.52     350.18  ...       617.0    71476.23
    ## Lotus          2010.83     275.97  ...       613.0    69188.28
    ## Land Rover     2012.67     322.10  ...       258.0    67823.22
    ## Alfa Romeo     2015.40     237.00  ...       113.0    61600.00
    ## BMW            2014.35     326.91  ...      3916.0    61546.76
    ## Cadillac       2013.15     332.31  ...      1624.0    56231.32
    ## Audi           2012.23     277.70  ...      3105.0    53452.11
    ## Lexus          2011.62     277.42  ...       454.0    47549.07
    ## Genesis        2017.00     347.33  ...        21.0    46616.67
    ## Lincoln        2011.57     284.91  ...        61.0    42839.83
    ## Infiniti       2010.96     310.07  ...       190.0    42394.21
    ## HUMMER         2009.35     261.24  ...       130.0    36464.41
    ## Acura          2010.06     244.80  ...       204.0    34887.59
    ## GMC            2009.13     259.84  ...       549.0    30493.30
    ## Toyota         2012.54     236.15  ...      2031.0    29030.02
    ## Nissan         2012.40     239.92  ...      2009.0    28583.43
    ## Volvo          2008.06     230.97  ...       870.0    28541.16
    ## Chevrolet      2010.09     246.97  ...      1385.0    28350.39
    ## Buick          2010.01     219.24  ...       155.0    28206.61
    ## Volkswagen     2012.81     189.76  ...       873.0    28102.38
    ## Saab           2006.08     220.52  ...       376.0    27413.50
    ## Ford           2009.74     243.10  ...      5657.0    27399.27
    ## Chrysler       2008.46     229.14  ...      1013.0    26722.96
    ## Honda          2013.42     195.75  ...      2202.0    26674.34
    ## Kia            2013.61     206.83  ...      1720.0    25310.17
    ## Subaru         2011.25     197.31  ...       640.0    24827.50
    ## Hyundai        2012.70     201.92  ...      1439.0    24597.04
    ## FIAT           2016.03     143.56  ...       819.0    22670.24
    ## Dodge          2005.44     244.42  ...      1851.0    22390.06
    ## Mitsubishi     2009.16     173.43  ...       436.0    21240.54
    ## Mazda          2008.36     171.99  ...       586.0    20039.38
    ## Scion          2013.42     154.43  ...       105.0    19932.50
    ## Pontiac        2003.10     190.30  ...       210.0    19321.55
    ## Suzuki         2006.31     160.29  ...       481.0    17907.21
    ## Oldsmobile     1997.61     177.47  ...        26.0    11542.54
    ## Plymouth       1995.46     131.56  ...       535.0     3122.90
    ## 
    ## [48 rows x 8 columns]
    Cars['Make'].value_counts()
    ## Chevrolet        1123
    ## Ford              881
    ## Volkswagen        809
    ## Toyota            746
    ## Dodge             626
    ## Nissan            558
    ## GMC               515
    ## Honda             449
    ## Mazda             423
    ## Cadillac          397
    ## Mercedes-Benz     353
    ## Suzuki            351
    ## BMW               334
    ## Infiniti          330
    ## Audi              328
    ## Hyundai           303
    ## Volvo             281
    ## Subaru            256
    ## Acura             252
    ## Kia               231
    ## Mitsubishi        213
    ## Lexus             202
    ## Buick             196
    ## Chrysler          187
    ## Pontiac           186
    ## Lincoln           164
    ## Oldsmobile        150
    ## Land Rover        143
    ## Porsche           136
    ## Saab              111
    ## Aston Martin       93
    ## Plymouth           82
    ## Bentley            74
    ## Ferrari            69
    ## FIAT               62
    ## Scion              60
    ## Maserati           58
    ## Lamborghini        52
    ## Rolls-Royce        31
    ## Lotus              29
    ## Tesla              18
    ## HUMMER             17
    ## Maybach            16
    ## Alfa Romeo          5
    ## McLaren             5
    ## Spyker              3
    ## Genesis             3
    ## Bugatti             3
    ## Name: Make, dtype: int64
More Info https://pythonexamples.org/pandas-dataframe-sort-by-column/

We will use MSRP (manufacturer’s suggested retail price) as our response variable. Pick two pairs of explanatory variables that you think might help predict MSRP. Call them pair A and pair B. At least one of your pairs should include at least one categorical predictor.

Now split the data into two sets – a training set and a testing set. Fit two regression models (Model A and Model B) using the training data.

  1. For each model, say what the fitted linear model equation is.

    from sklearn.linear_model import LinearRegression
    from sklearn.model_selection import train_test_split
    
    y = Cars['MSRP']
    X1 = Cars[['city mpg', 'Year']]
    pd.DataFrame(Cars['city mpg'])
    ##        city mpg
    ## 0            19
    ## 1            19
    ## 2            20
    ## 3            18
    ## 4            18
    ## ...         ...
    ## 11909        16
    ## 11910        16
    ## 11911        16
    ## 11912        16
    ## 11913        17
    ## 
    ## [11914 rows x 1 columns]
    X2 = pd.get_dummies(Cars[['city mpg', 'Transmission Type']], drop_first = True)
    X2.dtypes
    ## city mpg                          int64
    ## Transmission Type_AUTOMATIC       uint8
    ## Transmission Type_DIRECT_DRIVE    uint8
    ## Transmission Type_MANUAL          uint8
    ## Transmission Type_UNKNOWN         uint8
    ## dtype: object
    X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y, train_size = 0.5, random_state = 12357)
    X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y, train_size = 0.5, random_state = 12357)
    model1 = LinearRegression()
    model2 = LinearRegression()
    model1.fit(X1_train, y1_train)
    ## LinearRegression()
    model2.fit(X2_train, y2_train)
    ## LinearRegression()
    model1.intercept_, model1.coef_
    ## (-4174124.545896646, array([-1331.03290409,  2109.38710512]))
    model2.intercept_, model2.coef_
    ## (144534.33040867696, array([  -2532.93889939,  -55142.48340202,  184609.24935054,
    ##         -67651.61088069, -110239.35944141]))
  2. Use each model to make a prediction for the first row of the training data and for the first row of the test data.

    X1_train.head(1)
    ##       city mpg  Year
    ## 4746        15  2013
    X1_test.head(1)
    ##       city mpg  Year
    ## 2534        30  1995
    X2_train.head(1)
    ##       city mpg  ...  Transmission Type_UNKNOWN
    ## 4746        15  ...                          0
    ## 
    ## [1 rows x 5 columns]
    X2_test.head(1)
    ##       city mpg  ...  Transmission Type_UNKNOWN
    ## 2534        30  ...                          0
    ## 
    ## [1 rows x 5 columns]
    model1.predict(X1_train.head(1))
    ## array([52106.20315366])
    model1.predict(X1_test.head(1))
    ## array([-5828.25829995])
    model2.predict(X2_train.head(1))
    ## array([38888.63603709])
    model2.predict(X2_test.head(1))
    ## array([894.55254619])
  3. Which model fits (all of) the training data better? What metric are you using?

    We can use \(R^2\), the default score for a linear regression model

    model1.score(X1_train, y1_train)
    ## 0.10897950402340861
    model2.score(X2_train, y2_train)
    ## 0.13870289051637064
  4. Which model fits (all of) the test data better? Use the same metric as you used in the previous part.

    model1.score(X1_test, y1_test)
    ## 0.084306337507145
    model2.score(X2_test, y2_test)
    ## 0.13151232290340054

Setting up Python in R Markdown

For more info about using Python from within R (including in RMarkdown documents), see https://rstudio.github.io/reticulate/.

Once configured, you can create python chunks the same way you create R chunks, and you can access (some kinds of) Python data in R and R data in Python. This allows you to go back and forth between the two languages, using whichever works better for you at a given moment.

Here’s a quick demo of what it looks like to use Python in RMarkdown. I’m using some css tricks to make it easier to see which chunks are R chunks and which are Python chunks. By default they are not labeled/colored this way.

Let’s begin by telling R we want to use python and check to see which python it is using.

Sys.unsetenv('RETICULATE_PYTHON')        # you probably won't need this
library(reticulate)
use_condaenv(condaenv = 'r-reticulate')  # and you won't need this on Calvin's server
py_config()                              # version on rstudio.calvin.edu is 3.8.5
## python:         /Users/rpruim/Library/r-miniconda/envs/r-reticulate/bin/python
## libpython:      /Users/rpruim/Library/r-miniconda/envs/r-reticulate/lib/libpython3.10.dylib
## pythonhome:     /Users/rpruim/Library/r-miniconda/envs/r-reticulate:/Users/rpruim/Library/r-miniconda/envs/r-reticulate
## version:        3.10.4 | packaged by conda-forge | (main, Mar 24 2022, 17:45:10) [Clang 12.0.1 ]
## numpy:          /Users/rpruim/Library/r-miniconda/envs/r-reticulate/lib/python3.10/site-packages/numpy
## numpy_version:  1.21.5
## 
## NOTE: Python version was forced by use_python function

Let’s give it a try on something simple:

squares = [i**2 for i in range(10)]

We can access squares in R like this:

sqrt(py$squares)
##  [1] 0 1 2 3 4 5 6 7 8 9
gf_point(py$squares ~ 0:9)

We can move things the other way as well. What does an R data frame become in Python? A pandas data frame (provided pandas is installed).

Kf <- mosaicData::KidsFeet
r.Kf.shape
## (39, 8)
r.Kf.iloc[3]
## name          Josh
## birthmonth       1
## birthyear       88
## length        25.2
## width          9.8
## sex              B
## biggerfoot       L
## domhand          R
## Name: 3, dtype: object

You can use numpy, pandas, sklearn, etc. the usual way in python chunks.

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, LogisticRegression
import xgboost as xgb
## /Library/Frameworks/R.framework/Versions/4.1/Resources/library/reticulate/python/rpytools/loader.py:39: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
##   module = _import(
## /Users/rpruim/Library/r-miniconda/envs/r-reticulate/lib/python3.10/site-packages/xgboost/compat.py:36: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
##   from pandas import MultiIndex, Int64Index