Greek Parliament topic modeling


  1. Windows
    • Each window is a folder that contains text documents
    • Each text document contains the talk of each deputy speaker (unit)
    • Filter out the units that their lines are less than 3 lines
    • In total the number of the documents is the number of segments.
    • 1 window per month
  2. Stop word list
    • Use a stopword list that contains 665 words.
    • In conjunction with the stop word list we keep only the content words (adjectives, nouns, adverbs and verbs).
  3. During the tfidf calculation:
    • remove rare terms that appear in less than 10 documents.
      • ignore 33801 verbs.
      • ignore 33801 verbs and 9637 adverbs.
      • ignore 33801 verbs, 9637 adverbs and 3303 persons' names.
  4. Number of k (topics): Show results as a function of the k
    • Choosing too few topics → overly broad topics
    • Choosing too many topics → highly similar topics