• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!


Instructions for MALLET Topic Modeling

Page history last edited by Alan Liu 8 years, 3 months ago


  • (A) Open a command shell ("command line") window
    1. Windows "command prompt" (using shell command language based on MS-DOS) -- quick guide -- tip on enabling copy/paste operations
    2. Mac "terminal" (using shell command language based on Unix) -- cheat sheet 
    3. Linux command line (also uses bash)


  • (B) Navigate to the Mallet folder (directory)
    1. Windows: type the following at the command line, followed by a return (helpful tip: the <F5> function key pastes in the previously used command)
      •  cd c:\mallet
    2. Mac: type the following at the command line
      • cd /Users/yourusername/mallet
      • As with the Windows command above, this will depend on where and how you have saved the MALLET folder you downloaded when you installed the package. The above command assumes you've saved MALLET (and titled the folder it's in "mallet") under your home directory.


  • (C) Input a folder of text files and process them into a .mallet data file
    Use the command below, varying the path and file names as desired. The "--remove-stopwords" command tells Mallet to apply its built-in stopword list. (Red italics indicate path\filenames you supply. The example command lines below presume that the texts you wish to analyze reside in a subfolder called "workspace\1\documents" They also presume that the topic model files created will be output to "workspace\1". MALLET will show error messages letting you know if you have made a mistake in path or folder names.).  Use forward slashes in Windows, backward slashes on Macs. There must be no hidden returns in the command. Best practice is to set the job up in a text document (without "wordwrap" view turned on) and copy/paste the command into the command shell.
    • General format (Windows): bin\mallet import-dir --input path to folder --output path and filename of desired output data file with .mallet extension --keep-sequence --remove-stopwords
    • Command line (Windows):
      • bin\mallet import-dir --input C:\workspace\1\documents --output C:\workspace\1\topics.mallet --keep-sequence --remove-stopwords
    • Command line (Mac):
      • ./bin/mallet import-dir --input ./~/workspace/1/documents --output  ./~/workspace/1/topics.mallet --keep-sequence --remove-stopwords 


  • (D) Create the topic model ("train" the topic model) -- Use the following command.
    • General format (Windows): bin\mallet train-topics  --input path and filename of the previously created data file with .mallet extension --num-topics desired number of topics --optimize-interval 20 --output-state path to output folder\topic-state.gz --output-topic-keys path to output folder\keys.txt --output-doc-topics --path to output folder\composition.txt --word-topic-counts-file  path to output folder\topic_counts.txt
    • Command line (Windows):
      • bin\mallet train-topics --input C:\workspace\1\topics.mallet --num-topics 50 --optimize-interval 20 --output-state C:\workspace\1\topic-state.gz --output-topic-keys C:\workspace\1\keys.txt --output-doc-topics C:\workspace\1\composition.txt --word-topic-counts-file C:\workspace\1\topic_counts.txt
    • Command line (Mac):
      • ./bin/mallet train-topics --input ./~/workspace/1/topics.mallet --num-topics 50 --optimize-interval 20 --output-state ./~/workspace/1/topic-state.gz --output-topic-keys ./~/workspace/1/keys.txt --output-doc-topics ./~/workspace/1/composition.txt --word-topic-counts-file ./~/workspace/1/topic_counts.txt


  • (E) When the topic model is complete, you should see that Mallet has deposited in the C:\workspace\1\topic-models folder the following set of files:
    • composition.txt
    • keys.txt
    • topic_counts.txt
    • topics.mallet
    • topic-state.gz


  • (F) Examine the topic model:
    • Open the "keys.txt" file in a text editor and copy all the content
    • Paste into an Excel spreadsheet for ease of manipulation and sorting.
    • Sort the topics by relative importance:
      • Click at the top of the second column in the spreadsheet to select the whole column. (This is the column of relative weights of the topics.)
      • Click on the "Data" tab in Excel. Choose to sort from largest to smallest value.
      • When prompted, "expand" the selection so that Excel sorts all the neighboring cells in a row together with the values being sorted in the second column.
    • Study the topics to see what is legible as a theme.  You may wish to color code different rows to cluster topics (e.g., all the topics that seem to you to be about family, or war, or art, or economics, etc.).  (There are more advanced means of applying "cluster analysis" to topics that are beyond the scope of this workshop.  Also beyond the scope of this workshop: generating word clouds from topics to help visualize themes.)
    • Depending on how useful (or not) the topic model seems, you may decide to improve the model. Common steps for reiterating a topic model include:
      • Removing noise from the text (common names, days of week, OCR errors, etc.)
      • Changing the number of topics that you ask MALLET to find.  For example, 100 topics may be too granular for a small set of texts; while 10 topics may not be granular enough.




Comments (0)

You don't have permission to comment on this page.