Latent Dirichlet Allocation (LDA)

Definition

Latent Dirichlet Allocation (LDA) is a generative probabilistic model used to discover abstract topics within a collection of documents. It assumes that documents are mixtures of topics, and topics are mixtures of words.

Significance and Use

文档分类：将文档归类到不同的主题中。
主题发现：揭示文档集合中的隐藏主题，有助于信息检索和文本分析。
特征提取：将文档转化为主题分布，便于后续的机器学习任务。

Formula

LDA 的生成过程如下：

对于每个文档 $d$ ：
- 选择一个主题分布 $θ_{d} \sim Dir (α)$
- 对于文档中的每个词 $w_{d n}$ ：
  - 从多项分布中选择一个主题 $z_{d n} \sim Multinomial (θ_{d})$
  - 从主题 $z_{d n}$ 对应的词分布中选择一个词 $w_{d n} \sim p (w_{d n} ∣ z_{d n}, β)$

Python Code Example

使用 Scikit-Learn 的 LDA：

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
 
# 示例文档
documents = ["This is a sample document.", "This document is another example.", "And this is a third document."]
 
# 创建 CountVectorizer
vectorizer = CountVectorizer()
 
# 将文档转化为 BoW 向量
bow_matrix = vectorizer.fit_transform(documents)
 
# 创建 LDA 模型
lda = LatentDirichletAllocation(n_components=2, random_state=42)
 
# 拟合 LDA 模型
lda.fit(bow_matrix)
 
# 显示每个主题的前几个词
terms = vectorizer.get_feature_names_out()
for index, topic in enumerate(lda.components_):
    print(f"Topic {index}:")
    print([terms[i] for i in topic.argsort()[-10:]])

使用 Gensim 的 LDA

import gensim
from gensim import corpora
from gensim.models import LdaModel
 
# 示例文档
documents = ["This is a sample document.", "This document is another example.", "And this is a third document."]
 
# 预处理（分词）
texts = [doc.split() for doc in documents]
 
# 创建词典
dictionary = corpora.Dictionary(texts)
 
# 将文档转换为词袋模型
corpus = [dictionary.doc2bow(text) for text in texts]
 
# 创建 LDA 模型
lda_model = LdaModel(corpus, num_topics=2, id2word=dictionary, passes=10)
 
# 显示每个主题的前几个词
topics = lda_model.print_topics(num_words=10)
for topic in topics:
    print(topic)

Output Explanation

Latent Dirichlet Allocation (LDA)：
- 每个主题由一组最相关的词表示，这些词具有最高的权重。
- 主题的相关性由文档中词的频率和分布决定。

LDA 提供了一种发现文档集合中隐藏主题的有效方法，适用于文本分析、信息检索和文档分类等任务。

Hua Wang

Explorer

LDA

Latent Dirichlet Allocation (LDA)

Significance and Use

Formula

Python Code Example

使用 Gensim 的 LDA

Output Explanation

Table of Contents

Graph View

Backlinks