A couple of weeks ago I needed to parse the gmail contacts of my local magician’s club (GMK – Göteborgs Magiska Klubb) of which I belong to. My task was to create a neat list of names along with their aliases.
Simple enough, right? Nah, nothing can be just simple, we all know that. However, it’s not an impossible mission, but harder than it should be. Let me explain:
The reason it was more difficult that it has to be, is that the data probably was imported from another email client at some point in time. The data was jumbled, contained weird characters and duplicate names, making a parsing and filtering necessary. There were over 100 entries, so I had to do this using a script. My solution is OK, but not perfect. There were some entries that I manually corrected, because there were so few of them.
So, in this post I will describe which steps I had to take to create my list.
Exporting the contacts
The first thing I did was to export the gmail contact list to an Excel file. In “Open Office” Calc I exported the list to a CSV file, because Matlab has a nice import function called xlsread, which just seemed to fit my needs.
When importing the data from CSV, we are getting the information stored in a cell. It is very simple to extract this data, but I did encounter some issues. One of the issues was that the columns of excel contained repeated strings of the contact’s names. The family names, aliases surnames etc. were also jumbled up for some reason.
To import the CSV file I just run:
[num_data text_data] = xlsread(filename);
We can look at the first row to see what the columns mean:
>> {text_data{1,:}}’
ans =
‘Name’
‘Given’
‘Name’
‘Additional’
‘Name’
‘Family’
‘Name’
…
‘Job’
‘Description’
‘Website’
”
”
‘Type’
‘Website’
”
”
‘Value’
Ok, so there are a lot of nonsense in the csv that we do not need. We can get the first names in a column. For instance, your list may look like:
{text_data{:,1}}’
‘David Copperfield’
‘Dai “the Professor” Vernon’
‘Lennart Green’
…
As you can see, Dai Vernon has the alias “the professor”, so we should take care of this case because I noticed that each of these entries, with aliases, contains extra names distributed over the first 6 columns. I knew some of the aliases/”stage names”, but not their real names. At this point I think about the Tom Cruise movie “Mission Impossible” (1996) and the “NOC list” (Non-Official Cover). I actually like that movie. If you remember, if the NOC list fell into the wrong hands, then the lives of all the agents would be in danger. As for my list, it doesn’t matter, but instead of showing the real members I had Lennart Green (actual member) and also added Dai Vernon and David Copperfield, even though they have never been members of the club, at least not that I’m aware of.
Two simple rules
For instance, the names could be spread out in the csv entries like this:
{David “David Copperfield” Kotkin, David, Kotkin}
{David, “David, Copperfield”, Kotkin, David, Copperfield}
{David, Kotkin, David, Kotkin}
In the example above we see duplicate names, full name in the first entry and a case where the name is just repeated in several columns.
I realized that we have in reality two cases:
1. If the first column contains a quotation mark, we only use that entry
2. Else, we have to use the other entries and remove duplicates
So, to see if the first entry contains quotation marks I use findstr and check the resulting vector’s length. I also have to reverse the order of the data, because I start with the last name and alias and end with first name. This is just the way I wanted my list. In the following code I define fstcolumn as the cell of the first column of text_data. The code is inside a for loop increasing the variable i for all entries.
if length(findstr(fstcolumn{i}, '"')) >= 2
tmp = strsplit(text_data{i,1}, '"');
tmp = tmp(3:-1:1);
tmp{2,1} = ' "';
tmp{2,2} = '" ';
For the rule 2 -the “else” part- I simply pick the other columns and remove the duplicates. It is important to keep the order of the filtered names, I used the argument ‘stable’ (see line 11). I also add spaces between all the entries (line 13) and collapse the cell into a vector. Many entries contained duplicate names.
else %We pick out first name, second and alias
tmp = {text_data{i, 2:6}};
[u, v] = unique(tmp,'stable'); %remove duplicates, keep the order
tmp2 = tmp(v); %pick out the unique names
tmp2(2, :) = {' '}; %Marble with spaces
tmp2 = [tmp2{:}]; %collapse to vector
end
I print the list in the standard output and copy it to OpenOffice for post processing. I will talk about that in another post. Actually, in the real CSV, there are other quotation marks like both ” and the special ”, so I have to take care of that as well.
By the way, had Tom Cruise been a magician, what would he call himself?
In conclusion, the solution above is not perfect, but it should help you on your way.