Digital History

September 6, 2020 – Module Two: Open Refine Struggles

Potentially the most important anecdote from my experience with Open Refine over the last few days is this: it took me over twenty minutes to figure out how to split the columns with the dates into two separate ones. Yes, over twenty minutes, though I didn’t time it exactly. My issue was that I continuously clicked on “<Edit Cells” instead of “<Edit Columns” and did not understand why I was not seeing what I should be. So, I kept clicking around Open Refine, expecting something to change, googling how to separate columns, and STILL NOT NOTICING I WAS TRYING TO EDIT CELLS AND NOT COLUMNS. Until I did realize, and quickly separated the dates into two columns, roughly birth dates and death dates, though not entirely, of course, and chastised myself for being daft. 

Once I finally separated the columns with the dates, I attempted to figure out how many people were born in 1533, the same year as Queen Elizabeth I. Using text filter, I typed in “1533” and received 34 results, but some included the initials “fl.” and “d.” which meant ‘flourished’ and ‘death,’ and obviously did not align with the birth year. Then, I used facet to break down the different hits.

With this breakdown, I established that seventeen people were born in 1533, and eight others were born either in 1533 or 1534. Including 1533/4, twenty-five people in the dataset were born the same year as Queen Elizabeth I.

Next, I needed to figure out how many people were born between 1533, Queen Elizabeth’s birth year, and 1665, the year of the initial outbreak of the plague. Using the above technique to find out the amount of people born in this time period of 132 years felt like a bad use of Open Refine, but figuring out another method took playing around. I thought to use facet once more to have all the different years in the dataset displayed, which still would have been hard to work with, but that didn’t work anyway.

Understandable. That would have involved too much counting and addition as well, but I didn’t know where to go from here yet. 

I attempted to separate the Birthish rows to pull out the ones with ‘fl’ and ‘d’ in the column, since those were not birth years. Attempting to separate with ‘fl’ as the separator did not do what I wanted it to do, so I kept thinking, kept clicking, kept hoping. After googling around – which I stand by, as a digital historian – I couldn’t get any clear cut answer on how to remove certain rows after filtering out the ‘fl’ and the ‘d’ rows. But, thinking there had to be a way to do it with the filter option, I filtered out the ‘fl’ rows once more. Clicking all, I flagged the rows, as I saw on articles from google, and deleted the rows with flourish in them.

I went to do the same with rows with ‘d’ for death in the birth-ish column, but when I filtered down to those cells, I realized that would be removing rows that also included those that still had birth years and baptismal years that had not separated since there was no dash in the date. What to do now? I have no idea please help.

* Trying to remove the rows in ‘birth-ish’ column that have death years of people – but not remove their birth information if it’s included * 

I went to separate the ‘birth-ish’ column to extract the death info, but by trying to separate with ‘d’ as the separator it removed that initial and the data just confused me once more. I then separated the columns with a comma, since that was the other separator that had been used instead of a dash. Sorting through the pages, the now ‘birth-ish 2’ column was the death years that had not made it to ‘death-ish.’ Now I needed to join ‘birth-ish 2’ and ‘death-ish,’ which I did, removing the original columns and renaming the resulting one ‘death-ish.’

I then sorted the ‘birth-ish’ row to get everyone’s birth year in ascending order. Those with birth years starting in 1533 started in row 298 and those with birth year 1665 ended in row 3119, giving me 2881 people born between 1533 and 1665. 

Leave a Reply

Your email address will not be published. Required fields are marked *