percival: (Default)
[personal profile] percival
... or: why is most of my F-List suddenly male?

First of all, thanks [livejournal.com profile] hutta fro providing this thought-provoking little toy.


It's a useful reminder that what very reliably characterises the personal style of a writer is less the big words, the rich vocabulary or the long sentences but their use of function words (Mosteller & Wallace, 1964).

However, to take [livejournal.com profile] hutta's nifty tool at face-value as a reliable classifier is misleading - and it was certainly not intended as a reliable classifier, more as a joke. But the more people I see taking this toy seriously, the more I feel an explanation is in order. Here goes.

The way that this classifier works is quite simple: count the number of times a certain word appears in a text, multiply it by a weight, and if the weight is greater than a threshold, assign the text to class 1, otherwise to class 2. This is a time-honoured method of classification. Learning a good set of weights is a problem that is still actively researched in the Machine Learning Community.

Now, if we look at the original keywords and compare that to the Koppel et al. paper [livejournal.com profile] hutta cites, we find that there is very little agreement between the two.

I reproduce the keywords here for the sake of completeness, taken from BookBlog

Feminine Keywords Masculine Keywords
[him] 0 x 73 = 0 [some] 0 x 58 = 0
[so] 0 x 64 = 0 [this] 0 x 44 = 0
[because] 0 x 55 = 0 [as] 0 x 37 = 0
[actually] 0 x 49 = 0 [now] 0 x 33 = 0
[everything] 0 x 44 = 0 [good] 0 x 31 = 0
[but] 0 x 43 = 0 [something] 0 x 26 = 0
[like] 0 x 43 = 0 [if] 0 x 25 = 0
[am] 0 x 42 = 0 [ever] 0 x 21 = 0
[more] 0 x 41 = 0 [is] 0 x 19 = 0
[out] 0 x 39 = 0 [the] 0 x 17 = 0
[too] 0 x 38 = 0 [well] 0 x 15 = 0
[has] 0 x 33 = 0 [in] 0 x 10 = 0

First of all, there are two articles: one that describes an exploratory analysis of a data set, the other describes an algorithm for determining the gender of an author. The two statistical analyses are very different beasts. A word that is highly characteristic of a certain gender may not be a good predictor of gender if the other gender uses it frequently, too. In other words, the weights assigned by the exploratory analysis describe how prominent certain features are in male vs. female authors, but not to what extent they can be used to distinguish males from females.

Secondly, the keywords used by the tool appear to me to come from the exploratory analysis article. In the classification oriented article, Koppel et al. never list all their features and their corresponding weights. Neither do they give their threshold value. Both are needed to re-implement the classifier they arrived at.

Furthermore, they use information about frequency of parts-of-speech such as nouns, verbs, and personal pronouns as well as function word frequency. The function words that do come out as particularly characteristic are, for fiction, "a, the, as" (male) and "she, for, with, not" (female); for non-fiction, "that, one" (male) and "for, with, not, in" (female). Compare these lists and note the difference made between fiction and non-fiction.

This brings me to the third argument: the algorithm was not trained on blogs, but on fiction and published non-fiction. Most of the non-ficiton texts come from books and scholarly essays. Quite different in style from blogging, don't you think?



You may scream now: but what on earth QUALIFIES her to extemporise on that tool? And why does she get so ticked off at people taking it seriously? Simple, folks: I'm one of the people cited in the Koppel et al. paper. It's a nice paper, by the way. Technically solid, as far as I can tell.

Date: 2004-03-23 01:05 pm (UTC)
From: [identity profile] voxmaille.livejournal.com
I remember back when this first went around and you were explaining all of this to me. Hee.

Date: 2004-03-23 04:37 pm (UTC)
From: [identity profile] wabi.livejournal.com
Do you have the full reference for the Koppel et al. paper? And what's the reference for yours? I'm rather interested in text classification (sort of the flip-side of the discourse description work I do...), so I'd love to read them. :-)

Date: 2004-03-24 04:06 am (UTC)
From: [identity profile] perceval.livejournal.com
if you follow the link (or alternatively go to Hutta's page), you'll find the PDF. The paper is to appear in Literary and Linguistic Computing 17(4).

And I'd rather not identify which of the papers I've written - I'm a bit hesitant about openly mixing private life and professional existence (I don't want to show up at ACL and have people yell: "Oh, you're the suidical Catholic Buddhist from LJ with the Harry Potter obsession!", if you know what I mean.)

I can, however, exchange RL identities with you if you AIM me at percivalsq.

Date: 2004-03-24 10:58 am (UTC)
From: [identity profile] wabi.livejournal.com
Thanks for the reference. I asked because the paper is no longer up on the web--at least not at the address it was at. I try to check out L&LC every once in a while (unfortunately, our library doesn't subscribe to it), so I'll keep an eye out.

I completely understand what you mean about not wanting to mix your personal and professional lives. I've thought about that a lot with my own journal. I've handled that by not posting anything that I wouldn't want colleagues to see--obviously, not all of them would particularly enjoy my photography, lists, little details of my life, etc., but there's nothing here that I would be upset if they saw. I tend to be pretty vague about anything that deals with my academic life and might be controversial (e.g., yesterday's 'boring meeting'--I won't say who it was with or what it was about). It means that my journal stays pretty 'surface level', but I'm okay with that; I tend to be a pretty private person anyway.

Profile

percival: (Default)
Percival

December 2010

S M T W T F S
   1234
56 7891011
12131415161718
19202122232425
262728293031 

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Dec. 30th, 2025 11:07 pm
Powered by Dreamwidth Studios