Growing a Lexicon with Human-Level Attributes

19 May 2024 (2y ago)

Talk to most chatbots for long enough and you hit the same wall. They understand what you said and almost nothing about how you said it. Ask one something while you're clearly frustrated and it answers the literal question, cheerfully, like a vending machine. The reason is partly the lexicons underneath them. A resource like WordNet knows that anxious and worried are related, but it doesn't know they're both carrying fear, or that someone who reaches for brooding over thoughtful might be telling you something about themselves.

So for my final-year project at Newcastle I tried to build the missing piece: a lexicon with what the literature calls human-level attributes words tagged not just with meaning, but with emotion and personality. Map anxious to neuroticism, curious to openness, exhausted to something heavier than tiredness. The obvious way to build that is to label words by hand, but going that route would not be realistic and I definitely would have nothing to write about if I could write at all at that point.

So I built LCT to do the opposite. Start with a few words I was sure about, and let the machine walk outward through everything those words are close to, tagging as it goes. This is the story of how that works, what it actually achieved, and its limitations.

The idea

The whole thing rests on one assumption: words that mean similar things sit near each other in the right kind of space. Place every word as a point, for example: happy and cheerful land close together while happy and rent land far apart. Once you believe that, the plan writes itself:

  1. Start with seed words whose categories I already know.
  2. Find their nearest neighbours in embedding space.
  3. Build a graph out of those neighbours.
  4. Let the seed labels spread through the graph to the unlabelled words.

The labels came from two systems. LIWC buckets words into psychological and emotional categories; OCEAN is the Big Five (Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism). The goal was a lexicon mapping each word to both.

Here's the whole pipeline at a glance, from raw text to the two ways I tested it:

The data

A lexicon is only as good as the language you grow it from, so I pulled from a few places, each for a reason:

  • Goodreads Book Graph: 15 million book reviews. People are unguarded when they write about books they love or hate, which makes it rich emotional language.
  • An essay dataset: 2,467 essays (1.9 million words) by psychology students, already linked to Big Five traits. The closest thing I had to ground truth for personality.
  • LIWC: for the category scheme itself.
  • OCEAN word lists: seed words already tied to the five traits.

Everything was anonymised before I touched it, to stay on the right side of GDPR. I chose breadth on purpose: I wanted the lexicon to see language as it's actually used, not as a dictionary wishes it were.

Making it not crash

Here's the first thing nobody tells you: 15 million reviews do not fit in memory. The first time I tried to load the set whole, my machine gave up.

So the corpus is never fully in memory. I stream it line by line out of the gzip, preprocess each review as it passes, and write the results out in chunks of 200,000, logging memory usage and flushing as I go. Even with streaming, the largest slice I could actually process was about 500,000 reviews, a fraction of what was available. That ceiling quietly shaped everything downstream, and I'll come back to it. A surprising amount of this project was that kind of unglamorous plumbing.

Preprocessing

Before any of the interesting parts, the text has to be made boring and uniform. Lowercase it, drop stop words and numbers, lemmatise so running and ran both become run, and pull out the words that actually carry sentiment this usually were adjectives and adverbs that also appear in an opinion lexicon.

def extract_sentiment_terms(sentence):
   words = word_tokenise(sentence)
   stop_words = set(stopwords.words('english'))
   punctuation = set(string.punctuation)
   tagged_words = pos_tag(words)
   sentiment_terms = set()

   negative_lexicon = set(opinion_lexicon.negative())
   positive_lexicon = set(opinion_lexicon.positive())

   for word, tag in tagged_words:
      lower_word = word.lower()
      if lower_word not in stop_words and word not in punctuation:
         if tag.startswith('JJ') or tag.startswith('RB'):
               if lower_word in negative_lexicon or lower_word in positive_lexicon:
                  sentiment_terms.add(word)

   return sentiment_terms

None of this is glamorous, but skip it and every later step inherits the mess.

Seeds, embeddings, and a graph

With clean text, the real work runs in five moves.

Seeds. This is subtler than "type a list of words." My seeds came from OCEAN trait words and LIWC category words but I only trusted a word as a seed if it appeared in both, and I gave it the union of its labels: its Big Five trait plus its LIWC categories.

ocean = {word: trait for trait in ocean.columns for word in ocean[trait].dropna()}
liwc  = {word: ast.literal_eval(cats) for word, cats in zip(liwc['word'], liwc['categories'])}

# a seed is a word both lists vouch for, carrying both label sets
seeds = {word: [ocean[word]] + liwc[word] for word in ocean if word in liwc}

Requiring the two sources to agree kept the starting set small and trustworthy which matters, because every later step amplifies whatever the seeds believe.

Embeddings. I turned words into vectors with Word2Vec points in space where distance means "used in similar contexts." The pipeline runs on Google's pre-trained word2vec-google-news-300 (300 dimensions, trained on roughly 100 billion words), and I also experimented with training my own models on the review corpus, because the way people talk about anxiety in book reviews is not the way the evening news talks about it.

Expansion by nearest neighbours. For each seed, find the words closest to it by cosine similarity, and keep only the ones above a threshold Tc.

def expand_seeds(seeds, model, Tc, sentiment_terms):
   similarities = defaultdict(dict)
   seeds = set(seeds.keys())

   vocab = set(model.index_to_key)
   seeds_in_vocab = vocab.intersection(seeds)
   sentiment_terms_in_vocab = vocab.intersection(sentiment_terms)

   index_to_term = list(seeds_in_vocab) + list(sentiment_terms_in_vocab)
   vectors = np.array([model[term] for term in index_to_term])

   neighbors = NearestNeighbors(n_neighbors=len(vectors), metric='cosine')
   neighbors.fit(vectors)

   for i, vector in tqdm(enumerate(vectors)):
      distances, indices = neighbors.kneighbors([vector])
      for distance, index in zip(distances[0], indices[0]):
         if 1 - distance > Tc:
               term1 = index_to_term[i]
               term2 = index_to_term[index]
               similarities[term1][term2] = 1 - distance
   C = []
   for seed, terms in tqdm(similarities.items()):
      for term, similarity in terms.items():
         C.append((seed, term))

   return C

I didn't pull Tc out of the air, tested it from 0.5 to 0.9 and watched what happened to the lexicon (more on that below). I settled on 0.7. This step is where a handful of seeds becomes a few thousand candidates.

A semantic graph. Turn those pairs into a graph: words are nodes, similarities are weighted edges. NetworkX does the bookkeeping; weak edges get pruned so only relationships strong enough to trust survive.

import networkx as nx

def build_semantic_graph(C, model):
   G = nx.Graph()
   for word_pair in tqdm(C):
      Si, Wj = word_pair
      if Si != Wj:
         G.add_edge(Si, Wj, weight=model.similarity(Si, Wj))
   return G

Label propagation. This is the part I find genuinely satisfying. The seeds know their labels; everyone else starts blank. On each pass, an unlabelled word looks at its neighbours and takes on whichever labels are most common around it. Repeat, and the labels flood out from the seeds through the graph until things settle.

def multi_label_propagation(G, seeds, max_iterations=100):
    labels = {node: [] for node in G.nodes()}

    for node, label in seeds.items():
        labels[node] = label

    for _ in tqdm(range(max_iterations)):
        new_labels = labels.copy()
        for node in G.nodes():
            if node not in seeds:
                neighbor_labels = [labels[neighbor] for neighbor in G.neighbors(node)]
                neighbor_labels = [item for sublist in neighbor_labels for item in sublist]
                if neighbor_labels:
                    unique_labels, counts = np.unique(neighbor_labels, return_counts=True)
                    common_labels = unique_labels[np.where(counts == np.max(counts))]
                    new_labels[node] = list(common_labels)
        labels = new_labels
    return labels

Collapse the result into a category-to-words map and you have the lexicon.

Did it work?

Evaluating a lexicon you grew yourself is genuinely awkward there's no clean ground truth for whether brooding is a neuroticism word, so I came at it from two directions.

First, does it agree with the experts? I checked my categories against LIWC's own parser: run reviews through both, and count how often they matched. Agreement came out around 88–89%, and a separate consistency check do words sitting next to each other in the graph end up with overlapping categories sat above 94%. Encouraging, though the LIWC comparison is the bit of the harness I trust least.

Second, and more honestly, does it make a model better? I plugged the lexicon in as a feature set for a text classifier and measured it against a baseline that didn't use it. The features were a FeatureUnion of lexicon counts and a TextBlob sentiment transformer, feeding a logistic-regression classifier. The numbers:

ModelAccuracyPrecisionRecallF1
Baseline0.560.550.560.56
Lexicon-enhanced0.560.580.790.67

Read honestly, that's a specific result, not a triumphant one. The lexicon did nothing for raw accuracy. What it did was push recall from 0.56 to 0.79 and F1 from 0.56 to 0.67 the model got much better at not missing relevant cases, at the cost of only a little precision. For something like psychological profiling, where missing a real signal is worse than the odd false positive, that's the trade you want.

The other finding surprised me at the time and doesn't now: the simple model won. I tried logistic regression, SVM, and Random Forest, with One-vs-One and One-vs-Rest strategies. One-vs-One beat One-vs-Rest everywhere, but a well-tuned logistic regression beat both SVM and Random Forest where Random Forest came in about 1% behind despite being the heaviest. More complexity bought nothing. Pick the simple thing and tune it.

The parameters fight back

I also mapped how the lexicon's size responds to its two main knobs the similarity threshold Tc and the number of seeds. I'd hoped for tidy curves. I got this instead:

Higher thresholds generally shrink the lexicon, as you'd expect a stricter criteria, fewer words. But against seed size the relationship is jumpy and non-linear: Tc = 0.7 and 0.8 peak at particular seed counts (around 120 and 160) and fall off either side. There's no clean "more seeds = better." The knobs interact, and finding a good corner of that space is more search than theory. Tc = 0.7 is where I landed, not a law of nature.

Limitations

The things that bother me, plainly:

The memory ceiling. I trained on ~500,000 reviews out of 15 million because that's what the hardware would take. The lexicon the tool can build is bounded by the slice of language it ever got to see, and I only showed it a sliver.

I never ran cross-validation. The honest admission from the writeup: a single train/test split tells you less than you think, and proper cross-validation would have made the numbers above mean more. I'd do that first if I came back to it.

Label propagation is blunt. "Take the most common label among your neighbours" works until a word sits on a boundary between two categories, and then it just picks the louder room. Plenty of words belong to several categories at once, and majority-vote doesn't really respect that.

The embeddings carry everything. Word2Vec learns whatever is in the corpus, biases included. If the training text quietly ties a trait to a kind of person, the lexicon inherits that without ever being asked.

What I still like is the shape of the idea: encode a little certain knowledge as seeds, and let structure do the rest of the labelling. Start with what you're sure of, and let the graph carry it. The full tool and the dissertation it came from is on GitHub.