Greek Parliament topic modeling
- Windows
- Each window is a folder that contains text documents
- Each text document contains the talk of each deputy speaker (unit)
- Filter out the units that their lines are less than 3 lines
- In total the number of the documents is the number of segments.
- 1 window per month
- Stop word list
- Use a stopword list that contains 665 words.
- In conjunction with the stop word list we keep only the content words (adjectives, nouns, adverbs and verbs).
- During the tfidf calculation:
- remove rare terms that appear in less than 10 documents.
- ignore 33801 verbs.
- ignore 33801 verbs and 9637 adverbs.
- ignore 33801 verbs, 9637 adverbs and 3303 persons' names.
- Number of k (topics): Show results as a function of the k
- Choosing too few topics → overly broad topics
- Choosing too many topics → highly similar topics