A simple code for Latent Dirichlet Allocation

T Miyamoto
1 min readApr 30, 2021

An easy code to carry out LDA

First we import the following libraries:

import MeCab
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

Here, MeCab is a library to tokenize Japanese texts. Then, using CountVectorizer, we make a table to count the number of appearance for each word.

num_topics=6print(len(text0))
cnt_vec = CountVectorizer()
#print('dfw.shape=', dfw.shape)
mat1 = cnt_vec.fit_transform( df_tot['sent'] )
df1=pd.DataFrame( mat1.toarray(), columns=cnt_vec.get_feature_names() )lda1 = LatentDirichletAllocation(n_components = num_topics, random_state = 5, learning_method='online')
lda1.fit(mat1)
print(lda1.exp_dirichlet_component_.shape, mat1.shape)
df_word_topic=pd.DataFrame( lda1.exp_dirichlet_component_, columns=cnt_vec.get_feature_names() ).transpose().reset_index()
#df_word_topic=df_word_topic.rename(columns={'index':'term'})
list_topics=['topic_' + str(jv) for jv in np.arange(0, num_topics,1)]
list_topics=['term'] + list_topics
df_word_topic.columns=list_topics
df_word_topic=df_word_topic.merge(df_terms, left_on=['term'], right_on=['word'],
how='left', validate='1:1', suffixes=('_l', '_r'))

Here df_tot is like:

df_tot.columns
--> Index(['sent'], dtype='object')

The sent column contains tokenized word.

Then, using LatentDirichletAllocation and fit, we create an object for LDA. After that, we transfer, the data about the probability of appearance for each word, into a dataframe.

Then the following code shows the feature names for the tokenized texts:

print( cnt_vec.get_feature_names()  )

And the following code shows the dataframe where we find which topic each word belongs to:

df_word_topic

--

--