D!rty word5

Even using somewhat recent text, Optical Character Recognition (OCR) has it’s issues. I have very little OCR experience, but much of my time with Open Durham was spent copying text from pdfs of HAER records and National Register applications from the 1970s-1990s about historic homes. We always had issues with the word story (which appears in every single house post) showing up as “s!ory”. I bet I’ve had to manually omit that damn exclamation point at least 500 times. Houses that were 1 and 1/2 stories turned into a jumbled mess. Having spent a lot of time reading old cursive, I knew going into this assignment that the ‘long s’ would prove difficult. And I know that old traditional fonts were likely harder for a computer to decipher.

In choosing a piece of text to analyze, I had a very difficult time finding one whose plain text was enough to even go on. I had started with a late 1800s guide to communist societies in the United States written by a man who had visited them all and interviewed the members. That late in the century though, the long s/f was gone and the plain text, with only a few exceptions, was quite accurate. I needed to find something older, otherwise this would’ve been a very short post.

screen-shot-2016-11-01-at-4-30-18-pmI have done a lot of research on race riots, resistance and labor riots, so I sought out an older book on the topic of riots. I found An Appeal to the Public, on the Subject of the Riots in Birmingham, written after the Birmingham riots of 1791 in England. Even deciphering the name of the author, Joseph Priestley, proved difficult. If I were trying to analyze this text and look at the ways in which riots were written about in different time periods and contexts, it wouldn’t be beneficial to use the plain text for this book as there is so little to go by. Only roughly half of the text was actually “translated”, with huge chunks marked for OCR errors. I imagine even earlier books about riots would be of little to no use to use OCR on. Google Books was limiting the amount of text I could look at plain text of too, which further constrained the amount of text to analyze against each other.

In addition to the long s, the combining of “ct” together makes the text hard to read for an OCR program.


Entire sections would show up with the letters of words positioned like exponential powers or citation numbers.


When using this sporadic plain text in Voyant, the four most common words that appeared were church, clergy, errors and OCR. The amount of errors are such that their mention in the plain text turns up as a most used word or phrase. Keep in mind that this was only a short 8 page section that Google allowed, but yet the word cloud showed some interesting other words pop up like behave, hope, committed, inferior, riots, government, dissenters and bigotry. Also on the list though are cannqt, ct, asld and bir.


Google Ngram also showed some interesting trends as related to writing about riots. I combined riots with a variety of words including race, lynching, communism, activism, NAACP, KKK. Going by those search items, I thought that the time period of 1800-1950 would be most beneficial to look at. Race is always such a hard word to use in a search because of its various meanings, so I quickly omitted that search. No one really wrote about the KKK or the NAACP until mid century, which I suppose is something. Also of note is that the word activism doesn’t show up until around the turn of the century.  The comparison that I found most interesting though was this one below showing the comparison between unions, communism and riots. Riots were written about on a consistent (albeit small) basis over that same time period. Communism was written about very little before the 1930s. You can almost see the slight tic up around WWI and the Bolshevik Revolution and Red Scare, but it’s insignificant. Unions however have seen a massive increase in the amount written about the topic, starting around the labor movement in the late 1800s, with a fall right after the war.

Comm Unions Riots Ngram 1800-1950.png

When I went to compare those same terms in Hathitrust I used a longer time period


This view shows overall similar trends but perhaps highlight better short periods of interest in subjects–you can visualize the ups and downs better and see the fluctuations of interest in a way that was not evident in Ngram. Are these mini tableaus in the 1800s in the communism line related perhaps to biblical communism? Are the decreases in writing about unions related to the general economy at the time it was written? These hills of green that denote the mention of riots in the 1750s-1760s, and again around 1805-1810, might also be worth exploring deeper, even if perhaps not related to race (I’m assuming, perhaps I’m wrong).

I’ve used word cloud tools a bit for the race riot archive I created. I was interested in the words used in headlines written about the race riots of 1919. This was not built using OCR though but rather I copied and pasted the text from a spreadsheet I created that had a field for headlines. Beyondthe headlines though I see countless opportunities for text analysis  that could provide some interesting insight into the riots as well as the following investigations and trials. I’ve collected over 700 documents related to the riots including court records, newspaper articles, telegrams and coroner’s reports. I haven’t gone down that path because I only have jpgs of the items and am really unsure how to use OCR. And my research and archive are all of items that are hovering a fine copyright and use line. But in a series of over three dozen race riots and lynching in the US in such a short period of time, the newspapers, and the words they used, played an important role in heightening America’s fears and prejudices and instigating violence in some occasions. To be able to analyze the text to better prove that point would be amazing. Text analysis would also be useful when looking at letters and documents of government officials that are in my collection to compare how they reacted (or wrote about) the riots as compared to more recent writings about, say, Ferguson. Words are powerful and tools like those explored this week have the potential to better show just how powerful. (As long as the text was written after the 1860s and thus with limited problematic characters, but before 1923).

Words that appeared in headlines about the riots of 1919 from a word cloud I made last year.



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s