Notes on Data Distribution

Before diving too far into the visualizations, it’s worth noting that I believe that conducting analyses based on where Cather was writing from requires an understanding of the distribution of locations involved. The simple summary statistics are quite revealing. New York is written from 1967 times (the max for the data set), but the mean in terms of amount of times Cather wrote from a particular location is only 35, with the median being 2. The uneven distribution at play in terms of origin place in the Cather letters data does not necessarily impact maps showing relationships between places and spaces and who Cather was writing to or about. However, it does impact the text analysis methods used that rely on word features, like classification and sentiment analysis.

##  formatted_name           n          
##  Length:89          Min.   :   1.00  
##  Class :character   1st Qu.:   1.00  
##  Mode  :character   Median :   2.00  
##                     Mean   :  35.55  
##                     3rd Qu.:   7.00  
##                     Max.   :1967.00

Therefore, it’s important to note that the analyses involving more traditional text analysis methods sample the letters to mitigate the uneven distribution, while the mapping visualizations account for all the letters currently available.

Performing sentiment analysis on letter text and mapping to locations

Below, I explore using text analysis methods to gain insight into differences in the word features present in the letters based on where Cather was writing from.

It should be noted that the sentiment analysis I perform is very crude and is only done to see if there are relationships worth exploring further. The thought is that if a crude approach shows interesting relationships starting to form, then it may be worth digging in with more robust approaches.

##         unclass(summary(mean_sent_per_loc$sent))
## Min.                                 -0.19130435
## 1st Qu.                               0.05032077
## Median                                0.07526535
## Mean                                  0.07441489
## 3rd Qu.                               0.10413914
## Max.                                  0.21505268
## NA's                                  1.00000000

I’m not seeing anything terribly interesting from the crude sentiment analysis other than that Cather seems to skew more positive in her language (at least when considering non-sparsely used terms and an evenly distributed data set). That said, I do think mapping degrees of sentiment to locations where Cather was writing from can lead to more targeted analysis of the letters. For instance, when selecting observations up to the first quartile (q1) and observations from the last quartile to the maximum (q4), we start to get a look at the outliers of the set. One could ask whether letters written from the places represented at these extremes do agree with those assessments - be they skewed more positive or negative.

Performing classification and clustering to see if word features help categorize letters sent from unknown places

Two other “text analysis” type approach I wanted to try was classification. Because the Cather letters had a few handfuls of letters sent from unknown places, I thought this was good fodder for classification. As with the sentiment analysis, the approach taken here is crude, and I am working with a limited version of an already limited data set. That said, the classification performed okay on the training set, having an accuracy of 75%. Were I to truly pursue this effort, though, I would certainly want to do some validation, like k-fold and test a couple different models. Below, you can see the results of classification for now though:

## # A tibble: 2 x 2
## # Groups:   correct [2]
##   correct     n
##   <lgl>   <int>
## 1 FALSE      39
## 2 TRUE      116
##                       pred_sample_test_svm                   original
## 136     red cloud, nebraska, united states unknown, triangle, bermuda
## 137     red cloud, nebraska, united states unknown, triangle, bermuda
## 138     red cloud, nebraska, united states unknown, triangle, bermuda
## 139 albuquerque, new mexico, united states unknown, triangle, bermuda
## 140     red cloud, nebraska, united states unknown, triangle, bermuda
##                         letter
## 136 source/tei/cat.let1807.xml
## 137 source/tei/cat.let2742.xml
## 138 source/tei/cat.let1786.xml
## 139 source/tei/cat.let2550.xml
## 140 source/tei/cat.let1870.xml

Running the classifier against the five letters written from unknown origins places them from Red Cloud, Nebraska and Albuquerque, New Mexico. This result provides an interesting jumping off point for looking more deeply at the language of the letters to see if these origin places the model guessed make sense.

For reference, I provide the top 10 words in the letters below to show to some degree what word tokens influenced the classification and sentiment analysis most heavily.

Mapping people mentioned and recipients along letter origin locations

The next section of the analysis investigates the top 10 people mentioned as well as the top 10 recipients Cather wrote to and plots them on separate maps for users to compare and contrast who Cather wrote to and about most frequently from different places from which she wrote.

Below those maps is another map that juxtaposes people she wrote about and to in order to see if there are any interesting spatial relationships that emerge when comparing writing about someone with writing to someone.

Top 10 People Mentioned in Cather’s Letters Mapped by Letter Origin Place

Top 10 Recipients of Cather’s Letters Mapped by Letter Origin Place