percival: (Default)
[personal profile] percival
... or: why is most of my F-List suddenly male?

First of all, thanks [livejournal.com profile] hutta fro providing this thought-provoking little toy.


It's a useful reminder that what very reliably characterises the personal style of a writer is less the big words, the rich vocabulary or the long sentences but their use of function words (Mosteller & Wallace, 1964).

However, to take [livejournal.com profile] hutta's nifty tool at face-value as a reliable classifier is misleading - and it was certainly not intended as a reliable classifier, more as a joke. But the more people I see taking this toy seriously, the more I feel an explanation is in order. Here goes.

The way that this classifier works is quite simple: count the number of times a certain word appears in a text, multiply it by a weight, and if the weight is greater than a threshold, assign the text to class 1, otherwise to class 2. This is a time-honoured method of classification. Learning a good set of weights is a problem that is still actively researched in the Machine Learning Community.

Now, if we look at the original keywords and compare that to the Koppel et al. paper [livejournal.com profile] hutta cites, we find that there is very little agreement between the two.

I reproduce the keywords here for the sake of completeness, taken from BookBlog

Feminine Keywords Masculine Keywords
[him] 0 x 73 = 0 [some] 0 x 58 = 0
[so] 0 x 64 = 0 [this] 0 x 44 = 0
[because] 0 x 55 = 0 [as] 0 x 37 = 0
[actually] 0 x 49 = 0 [now] 0 x 33 = 0
[everything] 0 x 44 = 0 [good] 0 x 31 = 0
[but] 0 x 43 = 0 [something] 0 x 26 = 0
[like] 0 x 43 = 0 [if] 0 x 25 = 0
[am] 0 x 42 = 0 [ever] 0 x 21 = 0
[more] 0 x 41 = 0 [is] 0 x 19 = 0
[out] 0 x 39 = 0 [the] 0 x 17 = 0
[too] 0 x 38 = 0 [well] 0 x 15 = 0
[has] 0 x 33 = 0 [in] 0 x 10 = 0

First of all, there are two articles: one that describes an exploratory analysis of a data set, the other describes an algorithm for determining the gender of an author. The two statistical analyses are very different beasts. A word that is highly characteristic of a certain gender may not be a good predictor of gender if the other gender uses it frequently, too. In other words, the weights assigned by the exploratory analysis describe how prominent certain features are in male vs. female authors, but not to what extent they can be used to distinguish males from females.

Secondly, the keywords used by the tool appear to me to come from the exploratory analysis article. In the classification oriented article, Koppel et al. never list all their features and their corresponding weights. Neither do they give their threshold value. Both are needed to re-implement the classifier they arrived at.

Furthermore, they use information about frequency of parts-of-speech such as nouns, verbs, and personal pronouns as well as function word frequency. The function words that do come out as particularly characteristic are, for fiction, "a, the, as" (male) and "she, for, with, not" (female); for non-fiction, "that, one" (male) and "for, with, not, in" (female). Compare these lists and note the difference made between fiction and non-fiction.

This brings me to the third argument: the algorithm was not trained on blogs, but on fiction and published non-fiction. Most of the non-ficiton texts come from books and scholarly essays. Quite different in style from blogging, don't you think?



You may scream now: but what on earth QUALIFIES her to extemporise on that tool? And why does she get so ticked off at people taking it seriously? Simple, folks: I'm one of the people cited in the Koppel et al. paper. It's a nice paper, by the way. Technically solid, as far as I can tell.
This account has disabled anonymous posting.
If you don't have an account you can create one now.
HTML doesn't work in the subject.
More info about formatting

Profile

percival: (Default)
Percival

December 2010

S M T W T F S
   1234
56 7891011
12131415161718
19202122232425
262728293031 

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Dec. 31st, 2025 03:20 am
Powered by Dreamwidth Studios