... or: why is most of my F-List suddenly male?
First of all, thanks
hutta fro providing this thought-provoking little toy.
It's a useful reminder that what very reliably characterises the personal style of a writer is less the big words, the rich vocabulary or the long sentences but their use of function words (Mosteller & Wallace, 1964).
However, to take
hutta's nifty tool at face-value as a reliable classifier is misleading - and it was certainly not intended as a reliable classifier, more as a joke. But the more people I see taking this toy seriously, the more I feel an explanation is in order. Here goes.
The way that this classifier works is quite simple: count the number of times a certain word appears in a text, multiply it by a weight, and if the weight is greater than a threshold, assign the text to class 1, otherwise to class 2. This is a time-honoured method of classification. Learning a good set of weights is a problem that is still actively researched in the Machine Learning Community.
Now, if we look at the original keywords and compare that to the Koppel et al. paper
hutta cites, we find that there is very little agreement between the two.
I reproduce the keywords here for the sake of completeness, taken from BookBlog
Feminine Keywords Masculine Keywords
[him] 0 x 73 = 0 [some] 0 x 58 = 0
[so] 0 x 64 = 0 [this] 0 x 44 = 0
[because] 0 x 55 = 0 [as] 0 x 37 = 0
[actually] 0 x 49 = 0 [now] 0 x 33 = 0
[everything] 0 x 44 = 0 [good] 0 x 31 = 0
[but] 0 x 43 = 0 [something] 0 x 26 = 0
[like] 0 x 43 = 0 [if] 0 x 25 = 0
[am] 0 x 42 = 0 [ever] 0 x 21 = 0
[more] 0 x 41 = 0 [is] 0 x 19 = 0
[out] 0 x 39 = 0 [the] 0 x 17 = 0
[too] 0 x 38 = 0 [well] 0 x 15 = 0
[has] 0 x 33 = 0 [in] 0 x 10 = 0
First of all, there are two articles: one that describes an exploratory analysis of a data set, the other describes an algorithm for determining the gender of an author. The two statistical analyses are very different beasts. A word that is highly characteristic of a certain gender may not be a good predictor of gender if the other gender uses it frequently, too. In other words, the weights assigned by the exploratory analysis describe how prominent certain features are in male vs. female authors, but not to what extent they can be used to distinguish males from females.
Secondly, the keywords used by the tool appear to me to come from the exploratory analysis article. In the classification oriented article, Koppel et al. never list all their features and their corresponding weights. Neither do they give their threshold value. Both are needed to re-implement the classifier they arrived at.
Furthermore, they use information about frequency of parts-of-speech such as nouns, verbs, and personal pronouns as well as function word frequency. The function words that do come out as particularly characteristic are, for fiction, "a, the, as" (male) and "she, for, with, not" (female); for non-fiction, "that, one" (male) and "for, with, not, in" (female). Compare these lists and note the difference made between fiction and non-fiction.
This brings me to the third argument: the algorithm was not trained on blogs, but on fiction and published non-fiction. Most of the non-ficiton texts come from books and scholarly essays. Quite different in style from blogging, don't you think?
You may scream now: but what on earth QUALIFIES her to extemporise on that tool? And why does she get so ticked off at people taking it seriously? Simple, folks: I'm one of the people cited in the Koppel et al. paper. It's a nice paper, by the way. Technically solid, as far as I can tell.
First of all, thanks
It's a useful reminder that what very reliably characterises the personal style of a writer is less the big words, the rich vocabulary or the long sentences but their use of function words (Mosteller & Wallace, 1964).
However, to take
The way that this classifier works is quite simple: count the number of times a certain word appears in a text, multiply it by a weight, and if the weight is greater than a threshold, assign the text to class 1, otherwise to class 2. This is a time-honoured method of classification. Learning a good set of weights is a problem that is still actively researched in the Machine Learning Community.
Now, if we look at the original keywords and compare that to the Koppel et al. paper
I reproduce the keywords here for the sake of completeness, taken from BookBlog
Feminine Keywords Masculine Keywords
[him] 0 x 73 = 0 [some] 0 x 58 = 0
[so] 0 x 64 = 0 [this] 0 x 44 = 0
[because] 0 x 55 = 0 [as] 0 x 37 = 0
[actually] 0 x 49 = 0 [now] 0 x 33 = 0
[everything] 0 x 44 = 0 [good] 0 x 31 = 0
[but] 0 x 43 = 0 [something] 0 x 26 = 0
[like] 0 x 43 = 0 [if] 0 x 25 = 0
[am] 0 x 42 = 0 [ever] 0 x 21 = 0
[more] 0 x 41 = 0 [is] 0 x 19 = 0
[out] 0 x 39 = 0 [the] 0 x 17 = 0
[too] 0 x 38 = 0 [well] 0 x 15 = 0
[has] 0 x 33 = 0 [in] 0 x 10 = 0
First of all, there are two articles: one that describes an exploratory analysis of a data set, the other describes an algorithm for determining the gender of an author. The two statistical analyses are very different beasts. A word that is highly characteristic of a certain gender may not be a good predictor of gender if the other gender uses it frequently, too. In other words, the weights assigned by the exploratory analysis describe how prominent certain features are in male vs. female authors, but not to what extent they can be used to distinguish males from females.
Secondly, the keywords used by the tool appear to me to come from the exploratory analysis article. In the classification oriented article, Koppel et al. never list all their features and their corresponding weights. Neither do they give their threshold value. Both are needed to re-implement the classifier they arrived at.
Furthermore, they use information about frequency of parts-of-speech such as nouns, verbs, and personal pronouns as well as function word frequency. The function words that do come out as particularly characteristic are, for fiction, "a, the, as" (male) and "she, for, with, not" (female); for non-fiction, "that, one" (male) and "for, with, not, in" (female). Compare these lists and note the difference made between fiction and non-fiction.
This brings me to the third argument: the algorithm was not trained on blogs, but on fiction and published non-fiction. Most of the non-ficiton texts come from books and scholarly essays. Quite different in style from blogging, don't you think?
You may scream now: but what on earth QUALIFIES her to extemporise on that tool? And why does she get so ticked off at people taking it seriously? Simple, folks: I'm one of the people cited in the Koppel et al. paper. It's a nice paper, by the way. Technically solid, as far as I can tell.
no subject
Date: 2004-03-23 01:05 pm (UTC)no subject
Date: 2004-03-23 04:37 pm (UTC)no subject
Date: 2004-03-24 04:06 am (UTC)And I'd rather not identify which of the papers I've written - I'm a bit hesitant about openly mixing private life and professional existence (I don't want to show up at ACL and have people yell: "Oh, you're the suidical Catholic Buddhist from LJ with the Harry Potter obsession!", if you know what I mean.)
I can, however, exchange RL identities with you if you AIM me at percivalsq.
no subject
Date: 2004-03-24 10:58 am (UTC)I completely understand what you mean about not wanting to mix your personal and professional lives. I've thought about that a lot with my own journal. I've handled that by not posting anything that I wouldn't want colleagues to see--obviously, not all of them would particularly enjoy my photography, lists, little details of my life, etc., but there's nothing here that I would be upset if they saw. I tend to be pretty vague about anything that deals with my academic life and might be controversial (e.g., yesterday's 'boring meeting'--I won't say who it was with or what it was about). It means that my journal stays pretty 'surface level', but I'm okay with that; I tend to be a pretty private person anyway.