Text Classification

We'll use the IMDB dataset for this task. This is a dataset for binary sentiment classification containing 25,000 highly polarized movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.


			
			
			
			using
			
			 

	
			FastAI

			
			
			
			
			data
			,
			 
			blocks
			 
			=
			 
			
			load
			(
			
			

	
			datarecipes
			(
			)
			[
			
			"
			imdb
			"
			]
			)

Each sample is a review, this'll be our input data. The output is the sentiment of the input, either positive or negative.


			
			
			
			println
			(
			
			"
			Sample = 
			"
			,
			

	
			getobs
			(
			data
			,
			 
			1
			)
			)
			

			
			println
			(
			
			"
			Block = 
			"
			,
			blocks
			)

			Sample = ("Story of a man who has unnatural feelings for a pig. Starts out with a opening scene that is a terrific example of absurd comedy. A formal orchestra audience is turned into an insane, violent mob by the crazy chantings of it's singers. Unfortunately it stays absurd the WHOLE time with no general narrative eventually making it just too off putting. Even those from the era should be turned off. The cryptic dialogue would make Shakespeare seem easy to a third grader. On a technical level it's better than you might think with some good cinematography by future great Vilmos Zsigmond. Future stars Sally Kirkland and Frederic Forrest can be seen briefly.", "neg")
Block = (Paragraph(), Label{String}(["neg", "pos"]))

			
			
			
			task
			 
			=
			 
			
			

	
			FastAI
			.
			
			TextClassificationSingle
			(
			blocks
			,
			 
			data
			)

The task consists of encodings that needs to be applied to the input data and output data.

The encodings for the input data are as follows:

  • Sanitize: Involves text cleaning steps like case trimming, remove punctuations, removing stop words, and some fastai specific preprocessing steps (xxbos, xxup, etc).

  • Tokenize: Tokenizing the text into words.

  • EmbedVocabulary: Embedding the words into a vector space. This step constructs the vocabulary for the training data and returns the vector embedding for the input data.


			
			
			
			task
			.
			
			encodings

			
			
			
			
			input
			,
			 
			target
			 
			=
			 
			

	
			getobs
			(
			data
			,
			 
			1
			)

			
			
			
			
			encoded_input
			,
			 
			encoded_output
			 
			=
			 
			

	
			encodesample
			(
			task
			,
			 
			

	
			Training
			(
			)
			,
			 
			
			(
			input
			,
			 
			target
			)
			)
			

			

			
			println
			(
			encoded_input
			)
			

			
			println
			(
			encoded_output
			)

			[25000, 633779, 11990, 46, 395, 102, 633779, 1220, 5383, 433, 1374, 306, 3246, 122678, 27, 47, 2198, 241, 523, 157, 657, 1, 79, 633779, 1353, 182, 306, 122678, 12727, 424, 720, 369, 633779, 614, 633779, 18, 1542, 633779, 296, 802, 739, 28, 633779, 305, 963, 985, 900, 633779, 4, 633779, 4, 633779, 900, 1700, 633779, 135, 633779, 24, 633779, 13, 633779, 42, 6696, 136]
Float32[1.0, 0.0]

Let us now look at each step of the above encoding process.

Sanitize

The sanitized input data will have no stop words, no punctuations, and no case. Along with those, it'll also contain some fastai specific tokens like xxbos (beginning of the sentence), xxup (the next word if uppercase in the original text), xxmaj (the first letter is uppercase in the original text), etc.


			
			
			
			encoding_1
			 
			=
			 
			
			
			Textual
			.
			
			Sanitize
			(
			)
			

			
			sanitized_data
			 
			=
			 
			
			

	
			FastAI
			.
			

	
			encode
			(
			encoding_1
			,
			 
			

	
			Training
			(
			)
			,
			 
			
			Paragraph
			(
			)
			,
			 
			input
			)
Tokenize

Tokenize the sanitized input data.


			
			
			
			encoding_2
			 
			=
			 
			
			
			Textual
			.
			
			Tokenize
			(
			)
			

			
			tokenized_data
			 
			=
			 
			
			

	
			FastAI
			.
			

	
			encode
			(
			encoding_2
			,
			 
			

	
			Training
			(
			)
			,
			 
			
			Paragraph
			(
			)
			,
			 
			sanitized_data
			)
EmbedVocabulary

This step is the most important step in the encoding process. It constructs the vocabulary for the training data and returns the vector embedding for the input data.


			
			
			
			vocab
			 
			=
			 
			

	
			setup
			(
			
			Textual
			.
			
			EmbedVocabulary
			,
			 
			data
			)
			

			
			encoding_3
			 
			=
			 
			
			
			Textual
			.
			
			EmbedVocabulary
			(
			
			vocab
			 
			=
			 
			
			vocab
			.
			
			vocab
			)

			
			
			
			vector_data
			 
			=
			 
			

	
			encode
			(
			encoding_3
			,
			 
			

	
			Training
			(
			)
			,
			 
			
			
			Textual
			.
			
			Tokens
			(
			)
			,
			 
			tokenized_data
			)