{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "origin_pos": 0
   },
   "source": [
    "# Text Preprocessing\n",
    ":label:`sec_text_preprocessing`\n",
    "\n",
    "We have reviewed and evaluated\n",
    "statistical tools\n",
    "and prediction challenges\n",
    "for sequence data.\n",
    "Such data can take many forms.\n",
    "Specifically,\n",
    "as we will focus on\n",
    "in many chapters of the book,\n",
    "text is one of the most popular examples of sequence data.\n",
    "For example,\n",
    "an article can be simply viewed as a sequence of words, or even a sequence of characters.\n",
    "To facilitate our future experiments\n",
    "with sequence data,\n",
    "we will dedicate this section\n",
    "to explain common preprocessing steps for text.\n",
    "Usually, these steps are:\n",
    "\n",
    "1. Load text as strings into memory.\n",
    "1. Split strings into tokens (e.g., words and characters).\n",
    "1. Build a table of vocabulary to map the split tokens to numerical indices.\n",
    "1. Convert text into sequences of numerical indices so they can be manipulated by models easily.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "origin_pos": 3,
    "tab": [
     "tensorflow"
    ]
   },
   "outputs": [],
   "source": [
    "import collections\n",
    "import re\n",
    "from d2l import tensorflow as d2l"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "origin_pos": 4
   },
   "source": [
    "## Reading the Dataset\n",
    "\n",
    "To get started we load text from H. G. Wells' [*The Time Machine*](http://www.gutenberg.org/ebooks/35).\n",
    "This is a fairly small corpus of just over 30000 words, but for the purpose of what we want to illustrate this is just fine.\n",
    "More realistic document collections contain many billions of words.\n",
    "The following function (**reads the dataset into a list of text lines**), where each line is a string.\n",
    "For simplicity, here we ignore punctuation and capitalization.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "origin_pos": 5,
    "tab": [
     "tensorflow"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "# text lines: 3221\n",
      "the time machine by h g wells\n",
      "twinkled and his usually pale face was flushed and animated the\n"
     ]
    }
   ],
   "source": [
    "#@save\n",
    "d2l.DATA_HUB['time_machine'] = (d2l.DATA_URL + 'timemachine.txt',\n",
    "                                '090b5e7e70c295757f55df93cb0a180b9691891a')\n",
    "\n",
    "def read_time_machine():  #@save\n",
    "    \"\"\"Load the time machine dataset into a list of text lines.\"\"\"\n",
    "    with open(d2l.download('time_machine'), 'r') as f:\n",
    "        lines = f.readlines()\n",
    "    return [re.sub('[^A-Za-z]+', ' ', line).strip().lower() for line in lines]\n",
    "\n",
    "lines = read_time_machine()\n",
    "print(f'# text lines: {len(lines)}')\n",
    "print(lines[0])\n",
    "print(lines[10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "origin_pos": 6
   },
   "source": [
    "## Tokenization\n",
    "\n",
    "The following `tokenize` function\n",
    "takes a list (`lines`) as the input,\n",
    "where each element is a text sequence (e.g., a text line).\n",
    "[**Each text sequence is split into a list of tokens**].\n",
    "A *token* is the basic unit in text.\n",
    "In the end,\n",
    "a list of token lists are returned,\n",
    "where each token is a string.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "origin_pos": 7,
    "tab": [
     "tensorflow"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['the', 'time', 'machine', 'by', 'h', 'g', 'wells']\n",
      "[]\n",
      "[]\n",
      "[]\n",
      "[]\n",
      "['i']\n",
      "[]\n",
      "[]\n",
      "['the', 'time', 'traveller', 'for', 'so', 'it', 'will', 'be', 'convenient', 'to', 'speak', 'of', 'him']\n",
      "['was', 'expounding', 'a', 'recondite', 'matter', 'to', 'us', 'his', 'grey', 'eyes', 'shone', 'and']\n",
      "['twinkled', 'and', 'his', 'usually', 'pale', 'face', 'was', 'flushed', 'and', 'animated', 'the']\n"
     ]
    }
   ],
   "source": [
    "def tokenize(lines, token='word'):  #@save\n",
    "    \"\"\"Split text lines into word or character tokens.\"\"\"\n",
    "    if token == 'word':\n",
    "        return [line.split() for line in lines]\n",
    "    elif token == 'char':\n",
    "        return [list(line) for line in lines]\n",
    "    else:\n",
    "        print('ERROR: unknown token type: ' + token)\n",
    "\n",
    "tokens = tokenize(lines)\n",
    "for i in range(11):\n",
    "    print(tokens[i])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "origin_pos": 8
   },
   "source": [
    "## Vocabulary\n",
    "\n",
    "The string type of the token is inconvenient to be used by models, which take numerical inputs.\n",
    "Now let us [**build a dictionary, often called *vocabulary* as well, to map string tokens into numerical indices starting from 0**].\n",
    "To do so, we first count the unique tokens in all the documents from the training set,\n",
    "namely a *corpus*,\n",
    "and then assign a numerical index to each unique token according to its frequency.\n",
    "Rarely appeared tokens are often removed to reduce the complexity.\n",
    "Any token that does not exist in the corpus or has been removed is mapped into a special unknown token “&lt;unk&gt;”.\n",
    "We optionally add a list of reserved tokens, such as\n",
    "“&lt;pad&gt;” for padding,\n",
    "“&lt;bos&gt;” to present the beginning for a sequence, and “&lt;eos&gt;” for the end of a sequence.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "origin_pos": 9,
    "tab": [
     "tensorflow"
    ]
   },
   "outputs": [],
   "source": [
    "class Vocab:  #@save\n",
    "    \"\"\"Vocabulary for text.\"\"\"\n",
    "    def __init__(self, tokens=None, min_freq=0, reserved_tokens=None):\n",
    "        if tokens is None:\n",
    "            tokens = []\n",
    "        if reserved_tokens is None:\n",
    "            reserved_tokens = []\n",
    "        # Sort according to frequencies\n",
    "        counter = count_corpus(tokens)\n",
    "        self._token_freqs = sorted(counter.items(), key=lambda x: x[1],\n",
    "                                   reverse=True)\n",
    "        # The index for the unknown token is 0\n",
    "        self.idx_to_token = ['<unk>'] + reserved_tokens\n",
    "        self.token_to_idx = {token: idx\n",
    "                             for idx, token in enumerate(self.idx_to_token)}\n",
    "        for token, freq in self._token_freqs:\n",
    "            if freq < min_freq:\n",
    "                break\n",
    "            if token not in self.token_to_idx:\n",
    "                self.idx_to_token.append(token)\n",
    "                self.token_to_idx[token] = len(self.idx_to_token) - 1\n",
    "\n",
    "    def __len__(self):\n",
    "        return len(self.idx_to_token)\n",
    "\n",
    "    def __getitem__(self, tokens):\n",
    "        if not isinstance(tokens, (list, tuple)):\n",
    "            return self.token_to_idx.get(tokens, self.unk)\n",
    "        return [self.__getitem__(token) for token in tokens]\n",
    "\n",
    "    def to_tokens(self, indices):\n",
    "        if not isinstance(indices, (list, tuple)):\n",
    "            return self.idx_to_token[indices]\n",
    "        return [self.idx_to_token[index] for index in indices]\n",
    "\n",
    "    @property\n",
    "    def unk(self):  # Index for the unknown token\n",
    "        return 0\n",
    "\n",
    "    @property\n",
    "    def token_freqs(self):  # Index for the unknown token\n",
    "        return self._token_freqs\n",
    "\n",
    "def count_corpus(tokens):  #@save\n",
    "    \"\"\"Count token frequencies.\"\"\"\n",
    "    # Here `tokens` is a 1D list or 2D list\n",
    "    if len(tokens) == 0 or isinstance(tokens[0], list):\n",
    "        # Flatten a list of token lists into a list of tokens\n",
    "        tokens = [token for line in tokens for token in line]\n",
    "    return collections.Counter(tokens)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "origin_pos": 10
   },
   "source": [
    "We [**construct a vocabulary**] using the time machine dataset as the corpus.\n",
    "Then we print the first few frequent tokens with their indices.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "origin_pos": 11,
    "tab": [
     "tensorflow"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[('<unk>', 0), ('the', 1), ('i', 2), ('and', 3), ('of', 4), ('a', 5), ('to', 6), ('was', 7), ('in', 8), ('that', 9)]\n"
     ]
    }
   ],
   "source": [
    "vocab = Vocab(tokens)\n",
    "print(list(vocab.token_to_idx.items())[:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "origin_pos": 12
   },
   "source": [
    "Now we can (**convert each text line into a list of numerical indices**).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "origin_pos": 13,
    "tab": [
     "tensorflow"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "words: ['the', 'time', 'machine', 'by', 'h', 'g', 'wells']\n",
      "indices: [1, 19, 50, 40, 2183, 2184, 400]\n",
      "words: ['twinkled', 'and', 'his', 'usually', 'pale', 'face', 'was', 'flushed', 'and', 'animated', 'the']\n",
      "indices: [2186, 3, 25, 1044, 362, 113, 7, 1421, 3, 1045, 1]\n"
     ]
    }
   ],
   "source": [
    "for i in [0, 10]:\n",
    "    print('words:', tokens[i])\n",
    "    print('indices:', vocab[tokens[i]])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "origin_pos": 14
   },
   "source": [
    "## Putting All Things Together\n",
    "\n",
    "Using the above functions, we [**package everything into the `load_corpus_time_machine` function**], which returns `corpus`, a list of token indices, and `vocab`, the vocabulary of the time machine corpus.\n",
    "The modifications we did here are:\n",
    "(i) we tokenize text into characters, not words, to simplify the training in later sections;\n",
    "(ii) `corpus` is a single list, not a list of token lists, since each text line in the time machine dataset is not necessarily a sentence or a paragraph.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "origin_pos": 15,
    "tab": [
     "tensorflow"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(170580, 28)"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def load_corpus_time_machine(max_tokens=-1):  #@save\n",
    "    \"\"\"Return token indices and the vocabulary of the time machine dataset.\"\"\"\n",
    "    lines = read_time_machine()\n",
    "    tokens = tokenize(lines, 'char')\n",
    "    vocab = Vocab(tokens)\n",
    "    # Since each text line in the time machine dataset is not necessarily a\n",
    "    # sentence or a paragraph, flatten all the text lines into a single list\n",
    "    corpus = [vocab[token] for line in tokens for token in line]\n",
    "    if max_tokens > 0:\n",
    "        corpus = corpus[:max_tokens]\n",
    "    return corpus, vocab\n",
    "\n",
    "corpus, vocab = load_corpus_time_machine()\n",
    "len(corpus), len(vocab)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "origin_pos": 16
   },
   "source": [
    "## Summary\n",
    "\n",
    "* Text is an important form of sequence data.\n",
    "* To preprocess text, we usually split text into tokens, build a vocabulary to map token strings into numerical indices, and convert text data into token indices for  models to manipulate.\n",
    "\n",
    "\n",
    "## Exercises\n",
    "\n",
    "1. Tokenization is a key preprocessing step. It varies for different languages. Try to find another three commonly used methods to tokenize text.\n",
    "1. In the experiment of this section, tokenize text into words and vary the `min_freq` arguments of the `Vocab` instance. How does this affect the vocabulary size?\n",
    "\n",
    "[Discussions](https://discuss.d2l.ai/t/115)\n"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}