The following exercises should be done in python using the tools presenting by Professor Arnold in class (numpy, scikitlearn, etc.). You have some options for you do your work. You could use Google’s COLAB like we did in class or you could use reticulate to create an R Markdown document with python chunks instead of R chunks.
Wordle is all the rage. Here is an exercise based on wordle. Read in the list of 5-letter words at https://raw.githubusercontent.com/seanpatlan/wordle-words/main/word-bank.csv. The first ten words are
aback abase abate abbey abbot abhor abide abled abode abort
If your list doesn’t match this, be sure to figure out what happened.
With this list of words find:
The total number of words in the bank.
url = 'https://raw.githubusercontent.com/seanpatlan/wordle-words/main/word-bank.csv'
import pandas as pd
import string
alphabet = string.ascii_lowercase
wordle = pd.read_csv(url, header = None) # no header in CSV, first row is a word
words = wordle[0] # grab first column (the list of words)
type(words)
## <class 'pandas.core.series.Series'>
len(words)
## 2315
All words that start with the letter “q”
[word for word in words if word[0] == 'q']
## ['quack', 'quail', 'quake', 'qualm', 'quark', 'quart', 'quash', 'quasi', 'queen', 'queer', 'quell', 'query', 'quest', 'queue', 'quick', 'quiet', 'quill', 'quilt', 'quirk', 'quite', 'quota', 'quote', 'quoth']
The number of words that have at least one repeated letter.
# count using set() to get the set of (unique) letters
len([word for word in words if len(set(word)) < 5])
## 749
# a bit fancier -- with more info
pd.Series([len(set(word)) for word in words]).value_counts()
## 5 1566
## 4 691
## 3 57
## 2 1
## dtype: int64
# let's see that word with only two letters
[word for word in words if len(set(word)) < 3]
## ['mamma']
For each letter of the alphabet, the number of words that contain that letter. BONUS: sort the list by frequency with most frequent letters first. Express as both a count and a proportion.
letter_freq = [(letter, words.str.contains(letter).sum()) for letter in alphabet]
letter_freq
## [('a', 909), ('b', 267), ('c', 448), ('d', 370), ('e', 1056), ('f', 207), ('g', 300), ('h', 379), ('i', 647), ('j', 27), ('k', 202), ('l', 648), ('m', 298), ('n', 550), ('o', 673), ('p', 346), ('q', 29), ('r', 837), ('s', 618), ('t', 667), ('u', 457), ('v', 149), ('w', 194), ('x', 37), ('y', 417), ('z', 35)]
sorted(letter_freq,
key = lambda x: x[1], # sort be the count
reverse = True) # highest to lowest
## [('e', 1056), ('a', 909), ('r', 837), ('o', 673), ('t', 667), ('l', 648), ('i', 647), ('s', 618), ('n', 550), ('u', 457), ('c', 448), ('y', 417), ('h', 379), ('d', 370), ('p', 346), ('g', 300), ('m', 298), ('b', 267), ('f', 207), ('k', 202), ('w', 194), ('v', 149), ('x', 37), ('z', 35), ('q', 29), ('j', 27)]
sorted(
[(letter, round(words.str.contains(letter).sum() / len(words), 3)) for letter in alphabet],
key = lambda x: x[1],
reverse = True)
## [('e', 0.456), ('a', 0.393), ('r', 0.362), ('o', 0.291), ('t', 0.288), ('l', 0.28), ('i', 0.279), ('s', 0.267), ('n', 0.238), ('u', 0.197), ('c', 0.194), ('y', 0.18), ('h', 0.164), ('d', 0.16), ('p', 0.149), ('g', 0.13), ('m', 0.129), ('b', 0.115), ('f', 0.089), ('k', 0.087), ('w', 0.084), ('v', 0.064), ('x', 0.016), ('z', 0.015), ('q', 0.013), ('j', 0.012)]
Ask another interesting question about wordle words and use Python to answer it.
Here are some of the questions people answered (but not always with the code that was used by the student).
How many/which words don’t contain “tradtional vowels” (AEIOU)?
Note the use of zip()
here.
# this only works if word doesn't have duplicate letters
# returns a pair of numbers. First is number of letters in correct position;
# second is number of letters that occur (in either correct or incorrect position).
def score_word(word, word_list):
l = list(word)
return zip(
[np.sum([x == y for x,y in zip(l, list(w))]) for w in word_list],
[np.sum([x in list(w) for x in l]) for w in word_list]
)
pd.Series(score_word('later', words)).value_counts()
## Error in py_call_impl(callable, dots$args, dots$keywords): NameError: name 'np' is not defined
##
## Detailed traceback:
## File "<string>", line 1, in <module>
## File "<string>", line 4, in score_word
## File "<string>", line 4, in <listcomp>
round(pd.Series(score_word('later', words)).value_counts() / len(words) * 100, 2)
## Error in py_call_impl(callable, dots$args, dots$keywords): NameError: name 'np' is not defined
##
## Detailed traceback:
## File "<string>", line 1, in <module>
## File "<string>", line 4, in score_word
## File "<string>", line 4, in <listcomp>
round(pd.Series(score_word('raise', words)).value_counts() / len(words) * 100, 2)
## Error in py_call_impl(callable, dots$args, dots$keywords): NameError: name 'np' is not defined
##
## Detailed traceback:
## File "<string>", line 1, in <module>
## File "<string>", line 4, in score_word
## File "<string>", line 4, in <listcomp>
What are the Wordle palindromes
[word for word in words if word[::-1] == word]
## ['civic', 'kayak', 'level', 'madam', 'minim', 'radar', 'refer', 'rotor', 'tenet']
[word for word in words if list(reversed(list(word))) == list(word)]
## ['civic', 'kayak', 'level', 'madam', 'minim', 'radar', 'refer', 'rotor', 'tenet']
Words with four or more vowels
vowels = 'aeiou'
[word for word in words if np.sum([letter in vowels for letter in word]) >=4]
## Error in py_call_impl(callable, dots$args, dots$keywords): NameError: name 'np' is not defined
##
## Detailed traceback:
## File "<string>", line 1, in <module>
## File "<string>", line 1, in <listcomp>
Use list comprehension where appropriate.
Hints:
string
and use string.ascii_lowercase
to get the letters of the alphabet..str
accessor. See this documentation on working with strings in pandas. Note that you will need to make sure that you convert your word list to a pandas series of strings.Load the data set at https://www.kaggle.com/datasets/CooperUnion/cardataset?select=data.csv Use Python to answer the following:
How many rows and columns are in the data set?
Cars = pd.read_csv('data/kaggle-cars.csv')
Cars.shape
## (11914, 16)
What type of data is stored in each column?
Cars.dtypes
## Make object
## Model object
## Year int64
## Engine Fuel Type object
## Engine HP float64
## Engine Cylinders float64
## Transmission Type object
## Driven_Wheels object
## Number of Doors float64
## Market Category object
## Vehicle Size object
## Vehicle Style object
## highway MPG int64
## city mpg int64
## Popularity int64
## MSRP int64
## dtype: object
What is the average MSRP for all the cars in the data set?
round(Cars['MSRP'].mean(axis = 0), 2)
## 40594.74
Now compute the number of cars in the data set and the average MSRP for each make of car. List these from most expensive to least expensive.
round(Cars.groupby('Make').mean(), 2)
## Year Engine HP ... Popularity MSRP
## Make ...
## Acura 2010.06 244.80 ... 204.0 34887.59
## Alfa Romeo 2015.40 237.00 ... 113.0 61600.00
## Aston Martin 2013.02 484.32 ... 259.0 197910.38
## Audi 2012.23 277.70 ... 3105.0 53452.11
## BMW 2014.35 326.91 ... 3916.0 61546.76
## Bentley 2011.70 533.85 ... 520.0 247169.32
## Bugatti 2008.33 1001.00 ... 820.0 1757223.67
## Buick 2010.01 219.24 ... 155.0 28206.61
## Cadillac 2013.15 332.31 ... 1624.0 56231.32
## Chevrolet 2010.09 246.97 ... 1385.0 28350.39
## Chrysler 2008.46 229.14 ... 1013.0 26722.96
## Dodge 2005.44 244.42 ... 1851.0 22390.06
## FIAT 2016.03 143.56 ... 819.0 22670.24
## Ferrari 2007.52 511.96 ... 2774.0 238218.84
## Ford 2009.74 243.10 ... 5657.0 27399.27
## GMC 2009.13 259.84 ... 549.0 30493.30
## Genesis 2017.00 347.33 ... 21.0 46616.67
## HUMMER 2009.35 261.24 ... 130.0 36464.41
## Honda 2013.42 195.75 ... 2202.0 26674.34
## Hyundai 2012.70 201.92 ... 1439.0 24597.04
## Infiniti 2010.96 310.07 ... 190.0 42394.21
## Kia 2013.61 206.83 ... 1720.0 25310.17
## Lamborghini 2012.73 614.08 ... 1158.0 331567.31
## Land Rover 2012.67 322.10 ... 258.0 67823.22
## Lexus 2011.62 277.42 ... 454.0 47549.07
## Lincoln 2011.57 284.91 ... 61.0 42839.83
## Lotus 2010.83 275.97 ... 613.0 69188.28
## Maserati 2012.78 420.79 ... 238.0 114207.71
## Maybach 2010.94 590.50 ... 67.0 546221.88
## Mazda 2008.36 171.99 ... 586.0 20039.38
## McLaren 2014.20 610.40 ... 416.0 239805.00
## Mercedes-Benz 2010.52 350.18 ... 617.0 71476.23
## Mitsubishi 2009.16 173.43 ... 436.0 21240.54
## Nissan 2012.40 239.92 ... 2009.0 28583.43
## Oldsmobile 1997.61 177.47 ... 26.0 11542.54
## Plymouth 1995.46 131.56 ... 535.0 3122.90
## Pontiac 2003.10 190.30 ... 210.0 19321.55
## Porsche 2013.38 392.79 ... 1715.0 101622.40
## Rolls-Royce 2012.52 487.55 ... 86.0 351130.65
## Saab 2006.08 220.52 ... 376.0 27413.50
## Scion 2013.42 154.43 ... 105.0 19932.50
## Spyker 2009.00 400.00 ... 2.0 213323.33
## Subaru 2011.25 197.31 ... 640.0 24827.50
## Suzuki 2006.31 160.29 ... 481.0 17907.21
## Tesla 2015.28 NaN ... 1391.0 85255.56
## Toyota 2012.54 236.15 ... 2031.0 29030.02
## Volkswagen 2012.81 189.76 ... 873.0 28102.38
## Volvo 2008.06 230.97 ... 870.0 28541.16
##
## [48 rows x 8 columns]
round(Cars.groupby('Make').mean(), 2).sort_values(by = 'MSRP', ascending = False)
## Year Engine HP ... Popularity MSRP
## Make ...
## Bugatti 2008.33 1001.00 ... 820.0 1757223.67
## Maybach 2010.94 590.50 ... 67.0 546221.88
## Rolls-Royce 2012.52 487.55 ... 86.0 351130.65
## Lamborghini 2012.73 614.08 ... 1158.0 331567.31
## Bentley 2011.70 533.85 ... 520.0 247169.32
## McLaren 2014.20 610.40 ... 416.0 239805.00
## Ferrari 2007.52 511.96 ... 2774.0 238218.84
## Spyker 2009.00 400.00 ... 2.0 213323.33
## Aston Martin 2013.02 484.32 ... 259.0 197910.38
## Maserati 2012.78 420.79 ... 238.0 114207.71
## Porsche 2013.38 392.79 ... 1715.0 101622.40
## Tesla 2015.28 NaN ... 1391.0 85255.56
## Mercedes-Benz 2010.52 350.18 ... 617.0 71476.23
## Lotus 2010.83 275.97 ... 613.0 69188.28
## Land Rover 2012.67 322.10 ... 258.0 67823.22
## Alfa Romeo 2015.40 237.00 ... 113.0 61600.00
## BMW 2014.35 326.91 ... 3916.0 61546.76
## Cadillac 2013.15 332.31 ... 1624.0 56231.32
## Audi 2012.23 277.70 ... 3105.0 53452.11
## Lexus 2011.62 277.42 ... 454.0 47549.07
## Genesis 2017.00 347.33 ... 21.0 46616.67
## Lincoln 2011.57 284.91 ... 61.0 42839.83
## Infiniti 2010.96 310.07 ... 190.0 42394.21
## HUMMER 2009.35 261.24 ... 130.0 36464.41
## Acura 2010.06 244.80 ... 204.0 34887.59
## GMC 2009.13 259.84 ... 549.0 30493.30
## Toyota 2012.54 236.15 ... 2031.0 29030.02
## Nissan 2012.40 239.92 ... 2009.0 28583.43
## Volvo 2008.06 230.97 ... 870.0 28541.16
## Chevrolet 2010.09 246.97 ... 1385.0 28350.39
## Buick 2010.01 219.24 ... 155.0 28206.61
## Volkswagen 2012.81 189.76 ... 873.0 28102.38
## Saab 2006.08 220.52 ... 376.0 27413.50
## Ford 2009.74 243.10 ... 5657.0 27399.27
## Chrysler 2008.46 229.14 ... 1013.0 26722.96
## Honda 2013.42 195.75 ... 2202.0 26674.34
## Kia 2013.61 206.83 ... 1720.0 25310.17
## Subaru 2011.25 197.31 ... 640.0 24827.50
## Hyundai 2012.70 201.92 ... 1439.0 24597.04
## FIAT 2016.03 143.56 ... 819.0 22670.24
## Dodge 2005.44 244.42 ... 1851.0 22390.06
## Mitsubishi 2009.16 173.43 ... 436.0 21240.54
## Mazda 2008.36 171.99 ... 586.0 20039.38
## Scion 2013.42 154.43 ... 105.0 19932.50
## Pontiac 2003.10 190.30 ... 210.0 19321.55
## Suzuki 2006.31 160.29 ... 481.0 17907.21
## Oldsmobile 1997.61 177.47 ... 26.0 11542.54
## Plymouth 1995.46 131.56 ... 535.0 3122.90
##
## [48 rows x 8 columns]
Cars['Make'].value_counts()
## Chevrolet 1123
## Ford 881
## Volkswagen 809
## Toyota 746
## Dodge 626
## Nissan 558
## GMC 515
## Honda 449
## Mazda 423
## Cadillac 397
## Mercedes-Benz 353
## Suzuki 351
## BMW 334
## Infiniti 330
## Audi 328
## Hyundai 303
## Volvo 281
## Subaru 256
## Acura 252
## Kia 231
## Mitsubishi 213
## Lexus 202
## Buick 196
## Chrysler 187
## Pontiac 186
## Lincoln 164
## Oldsmobile 150
## Land Rover 143
## Porsche 136
## Saab 111
## Aston Martin 93
## Plymouth 82
## Bentley 74
## Ferrari 69
## FIAT 62
## Scion 60
## Maserati 58
## Lamborghini 52
## Rolls-Royce 31
## Lotus 29
## Tesla 18
## HUMMER 17
## Maybach 16
## Alfa Romeo 5
## McLaren 5
## Spyker 3
## Genesis 3
## Bugatti 3
## Name: Make, dtype: int64
We will use MSRP (manufacturer’s suggested retail price) as our response variable. Pick two pairs of explanatory variables that you think might help predict MSRP. Call them pair A and pair B. At least one of your pairs should include at least one categorical predictor.
Now split the data into two sets – a training set and a testing set. Fit two regression models (Model A and Model B) using the training data.
For each model, say what the fitted linear model equation is.
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
y = Cars['MSRP']
X1 = Cars[['city mpg', 'Year']]
pd.DataFrame(Cars['city mpg'])
## city mpg
## 0 19
## 1 19
## 2 20
## 3 18
## 4 18
## ... ...
## 11909 16
## 11910 16
## 11911 16
## 11912 16
## 11913 17
##
## [11914 rows x 1 columns]
X2 = pd.get_dummies(Cars[['city mpg', 'Transmission Type']], drop_first = True)
X2.dtypes
## city mpg int64
## Transmission Type_AUTOMATIC uint8
## Transmission Type_DIRECT_DRIVE uint8
## Transmission Type_MANUAL uint8
## Transmission Type_UNKNOWN uint8
## dtype: object
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y, train_size = 0.5, random_state = 12357)
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y, train_size = 0.5, random_state = 12357)
model1 = LinearRegression()
model2 = LinearRegression()
model1.fit(X1_train, y1_train)
## LinearRegression()
model2.fit(X2_train, y2_train)
## LinearRegression()
model1.intercept_, model1.coef_
## (-4174124.545896646, array([-1331.03290409, 2109.38710512]))
model2.intercept_, model2.coef_
## (144534.33040867696, array([ -2532.93889939, -55142.48340202, 184609.24935054,
## -67651.61088069, -110239.35944141]))
Use each model to make a prediction for the first row of the training data and for the first row of the test data.
X1_train.head(1)
## city mpg Year
## 4746 15 2013
X1_test.head(1)
## city mpg Year
## 2534 30 1995
X2_train.head(1)
## city mpg ... Transmission Type_UNKNOWN
## 4746 15 ... 0
##
## [1 rows x 5 columns]
X2_test.head(1)
## city mpg ... Transmission Type_UNKNOWN
## 2534 30 ... 0
##
## [1 rows x 5 columns]
model1.predict(X1_train.head(1))
## array([52106.20315366])
model1.predict(X1_test.head(1))
## array([-5828.25829995])
model2.predict(X2_train.head(1))
## array([38888.63603709])
model2.predict(X2_test.head(1))
## array([894.55254619])
Which model fits (all of) the training data better? What metric are you using?
We can use \(R^2\), the default score for a linear regression model
model1.score(X1_train, y1_train)
## 0.10897950402340861
model2.score(X2_train, y2_train)
## 0.13870289051637064
Which model fits (all of) the test data better? Use the same metric as you used in the previous part.
model1.score(X1_test, y1_test)
## 0.084306337507145
model2.score(X2_test, y2_test)
## 0.13151232290340054
For more info about using Python from within R (including in RMarkdown documents), see https://rstudio.github.io/reticulate/.
Once configured, you can create python chunks the same way you create R chunks, and you can access (some kinds of) Python data in R and R data in Python. This allows you to go back and forth between the two languages, using whichever works better for you at a given moment.
Here’s a quick demo of what it looks like to use Python in RMarkdown. I’m using some css tricks to make it easier to see which chunks are R chunks and which are Python chunks. By default they are not labeled/colored this way.
Let’s begin by telling R we want to use python and check to see which python it is using.
Sys.unsetenv('RETICULATE_PYTHON') # you probably won't need this
library(reticulate)
use_condaenv(condaenv = 'r-reticulate') # and you won't need this on Calvin's server
py_config() # version on rstudio.calvin.edu is 3.8.5
## python: /Users/rpruim/Library/r-miniconda/envs/r-reticulate/bin/python
## libpython: /Users/rpruim/Library/r-miniconda/envs/r-reticulate/lib/libpython3.10.dylib
## pythonhome: /Users/rpruim/Library/r-miniconda/envs/r-reticulate:/Users/rpruim/Library/r-miniconda/envs/r-reticulate
## version: 3.10.4 | packaged by conda-forge | (main, Mar 24 2022, 17:45:10) [Clang 12.0.1 ]
## numpy: /Users/rpruim/Library/r-miniconda/envs/r-reticulate/lib/python3.10/site-packages/numpy
## numpy_version: 1.21.5
##
## NOTE: Python version was forced by use_python function
Let’s give it a try on something simple:
squares = [i**2 for i in range(10)]
We can access squares
in R like this:
sqrt(py$squares)
## [1] 0 1 2 3 4 5 6 7 8 9
gf_point(py$squares ~ 0:9)
We can move things the other way as well. What does an R data frame become in Python? A pandas data frame (provided pandas is installed).
Kf <- mosaicData::KidsFeet
r.Kf.shape
## (39, 8)
r.Kf.iloc[3]
## name Josh
## birthmonth 1
## birthyear 88
## length 25.2
## width 9.8
## sex B
## biggerfoot L
## domhand R
## Name: 3, dtype: object
You can use numpy, pandas, sklearn, etc. the usual way in python chunks.
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, LogisticRegression
import xgboost as xgb
## /Library/Frameworks/R.framework/Versions/4.1/Resources/library/reticulate/python/rpytools/loader.py:39: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
## module = _import(
## /Users/rpruim/Library/r-miniconda/envs/r-reticulate/lib/python3.10/site-packages/xgboost/compat.py:36: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
## from pandas import MultiIndex, Int64Index