{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "$$\n", "\\def\\CC{\\bf C}\n", "\\def\\QQ{\\bf Q}\n", "\\def\\RR{\\bf R}\n", "\\def\\ZZ{\\bf Z}\n", "\\def\\NN{\\bf N}\n", "$$\n", "# Strings and the Burrows-Wheeler Transform\n", "\n", "Sage/Python includes a builtin datastructure from strings.\n", "\n", "There are several ways to input strings. You can input a string using single quotes (') or double quotes (\"):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'This is a string!'" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = \"This is a string!\"\n", "s" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "So is this!" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t = 'So is this!'\n", "print t" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also input a string using three quotes (\"\"\" or '''). This is useful if you want to use both \" and ' in your string, or you want your string to span multiple lines:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "This is a multi-line\n", " string\n", "that includes 'single quotes'\n", " and \"double quotes\"." ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = \"\"\"\n", "This is a multi-line\n", " string\n", "that includes 'single quotes'\n", " and \"double quotes\".\n", "\"\"\"\n", "print s" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercises**\n", "\n", "1. Create and print the following string" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ " \\ | ( | ) / /" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", " > \\_\\_\\_\\_\\_\\_\\_\\_\\_\\_\\_\\_\\_\\_\\_\\_\\_ | | | | | I <3 Coffee! /-- | | | /--/ \\_\\_\\_\\_\\_\\_\\_\\_\\_\\_\\_/\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. Without using cut-and-paste(!) *replace* the substring `I <3 Coffee!` with the substring `I <3 Tea!`.\n", "3. Print a copy of your string with all the letters capitalized (upercase).\n", "\n", "## Operations on strings\n", "\n", "Strings behave very much like lists. The table below summarizes their common operations.\n", "\n", "> \n", "> \n", "> \n", "> \n", "> \n", "> \n", "> \n", "> \n", "> \n", "> \n", "> \n", "> \n", "> \n", "> \n", "> \n", "> \n", "> \n", "> \n", "> \n", "> \n", "> \n", "> \n", "> \n", "> \n", "> \n", "> \n", "> \n", "> \n", "> \n", "> \n", "> \n", "> \n", "> \n", "> \n", "> \n", "> \n", "> \n", "> \n", "> \n", "> \n", "> \n", "> \n", "> \n", "> \n", "> \n", ">
OperationSyntax for listsSyntax for strings
Accessing a letterlist[3]string[3]
Slicinglist[3:17:2]string[3:17:2]
Concatenationlist1 + list2string1 + sting2
A copylist[:]string[:]
A reversed copylist[::-1]string[::-1]
Lengthlen(list)len(string)
\n", ">\n", "**Exercises**\n", "\n", "1. The factors of length 2 of 'rhubarb' are" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", " > rh\n", " > hu\n", " > ub\n", " > ba\n", " > ar\n", " > rb\n", "\n", " Write a function called `factors` that returns a list of the factors of length `l` of `s` , and list all the factors of length 3 of 'rhubarb'.\n", "```{.python .input}\n", " # edit here\n", "```\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. What happens if you apply your function `factors` to the list `[0,1,1,0,1,0,0,1]` ? If it doesn't work for both lists and strings, go back and modify your function so that it does work for both." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ " # edit here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Run-length encoding\n", "\n", "The string\n", "\n", "> `WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW`\n", "\n", "begins with `W` 12 times, then `B` once, then `W` 12 times, then `B` 3 times, then `W` 24 times, then `B` once and then `W` 14 times. Thus, it can be encoded by the tuples:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(W, 12), (B, 1), (W, 12), (B, 3), (W, 24), (B, 1), (W, 14)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is called the *run-length encoding* of the string.\n", "\n", "**Exercises**\n", "\n", "1. Write a function that returns the run-length encoding of a string. Does your function work for lists as well as strings? If not, then can you make it so that it works for both strings and lists? Use your function to compute the run-length encoding of the list:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", " `[0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0]`\n", "```{.python .input}\n", " # edit here\n", "```\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Rotations\n", "\n", "The *rotations* of the string 'bananas' are:\n", "\n", "> bananas\n", "> ananasb\n", "> nanasba\n", "> anasban\n", "> nasbana\n", "> asbanan\n", "> sbanana\n", "\n", "and if we sort these alphabetically, then we get:\n", "\n", "> ananasb\n", "> anasban\n", "> asbanan\n", "> bananas\n", "> nanasba\n", "> nasbana\n", "> sbanana\n", "\n", "**Exercises**\n", "\n", "1. Define a function `print_sorted_rotations` that sorts all the rotations of a string and prints them in an array as above. Print the sorted rotations of the strings 'ananas' and 'cocomero'." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ " # edit here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The Burrows-Wheeler Transform\n", "\n", "The *Burrows-Wheeler Transform* (BWT) of a string `s` sorts all the rotations of `s` and then returns the last column.\n", "\n", "For example, if we sort the rotations of 'bananas':\n", "\n", "> ananasb\n", "> anasban\n", "> asbanan\n", "> bananas\n", "> nanasba\n", "> nasbana\n", "> sbanana\n", "\n", "then the last column is *bnnsaaa* , so the BWT of *bananas* is *bnnsaaa*.\n", "\n", "**Exercises**\n", "\n", "1. Write a function that returns the BWT of a string. Compute the BWT of *bananas* , *ananas* and *cocomero* . (*Hint:* You can return you answer as a list, but if you want to return a string, then you might want to use the `join` method for strings.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ " # edit here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. Combine the functions you defined above to create an `@interact` object that takes a string `s` and prints:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", " - the sorted rotations of `s`\n", " - the run-length encoding of `s`\n", " - the BWT of `s`\n", " - the run-length encoding of the BWT of `s`\n", "\n", " (*Hint:* String formatting can be done using the `%` operator. Here is an example:\n", "```{.python .input}\n", " print 'The sum of %s and %s is %s.' % (3,2,3+2)\n", "```\n", "```{.json .output}\n", " [{\"data\": {\"text/plain\": \"The sum of 3 and 2 is 5.\"}, \"execution_count\": 1, \"metadata\": {}, \"output_type\": \"execute_result\"}]\n", "```\n", "\n", " If you are familiar with *C* then you will notice that string formating is very similar to *C* 's `sprintf` statement.)\n", "```{.python .input}\n", " # edit here\n", "```\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. Use your interact object to explore this transformation, and to answer the following questions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", " 1. Compute the BWT of the following.\n", " - `xxyxyxyxyxyxyxyxyxxyxyxyxyxyxyxyxyxy`\n", " - `01101001100101101001011001101001100101100110100101`\n", " - `cdccdcdccdccdcdccdcdccdccdcdccdccdcdccdcdccdccdcdc`\n", " 2. Do you notice any patterns in the BWT of a string?\n", " 3. Can you think of an application for this transformation?\n", " 4. Find 3 other strings that have a 'nice' image under the BWT.\n", " 5. Is the Burrows-Wheeler transformation invertible? (That is, can you find two strings that have the same BWT?)\n", "```{.python .input}\n", " # edit here\n", "```\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4. By comparing the BWT of a string with the first column of the array of sorted rotations of a string `s` , devise and implement an algorithm that reconstructs the first column of the array from the BWT of `s` ." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ " # edit here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "5. By examining the first *two* columns of the array, devise and implement an algorithm that reconstructs the first *two* columns of the array from the BWT of a string. ( *Hint:* compare the last and first column with the first two columns.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ " # edit here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "6. By examining the first *three* columns of the array, devise and implement an algorithm that reconstructs the first *three* columns of the array from the BWT of a string." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ " # edit here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "7. Write a function that reconstructs the entire array of sorted rotations of a string from the BWT of the string." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ " # edit here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "8. A *Lyndon word* is a word w that comes first in alphabetical order among all its rotations. Is the BWT invertible on Lyndon words?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ " # edit here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "9. Explain how one can modify the BWT to make it invertible on arbitrary words. Implement your modified transformation and the inverse transformation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ " # edit here" ] } ], "metadata": { "kernelspec": { "display_name": "sagemath", "name": "sagemath" } }, "nbformat": 4, "nbformat_minor": 2 }