Problem Set #2

Due Tuesday January 24, 2017, at the beginning of class. Assignments turned in more than 10 minutes after the beginning of class will be penalized. When in doubt, show your work.

The first three questions pertain to the following DP matrix:

  1. (5 points) Was this DP matrix generated by the Smith-Waterman or Needleman-Wunsch algorithm? How do you know?

  2. (5 points) For this DP matrix, is the gap penalty linear or affine? Give the value(s).

  3. (10 points) Draw an empty amino acid substitution matrix, and fill in as many values as you can, based on the values the score matrix produced in the above DP matrix.

  4. (10 points) Explain how blast speeds up finding sequences related to a query.

  5. (5 points) Qualitatively describe how decreasing or increasing the blast word length would change the speed and false-negative rate for blast searches (for simplicity, assume alignments start only from exact word matches).

    NOTE - for programming problems make sure your program will work on a variety of different files (e.g. different number of lines etc.)! My examples are just that - examples.

  6. (10 points) Write a program copy-file.py that copies the contents of a given file, with the source file name and the copied file name as the two command line arguments. For example, if you have a file called seq1.txt that contains one line ("GATCCAT"), then you could create a copy of this file called seq2.txt as follows:

    > python copy-file.py seq1.txt seq2.txt
    > cat seq2.txt
    GATCCAT
  7. (10 points) Write a program reverse-lines.py that reads in the contents of a file, and prints out the lines in reverse order. For example, say that your file is called three-lines.txt and consists of these three lines:

    This is the first line.
    This is the second line.
    This is the third line.

    Your program should do this:

    > python reverse-lines.py three-lines.txt
    This is the third line.
    This is the second line.
    This is the first line.
  8. (10 points) Write a program split-number.py that reads a decimal number from the command line and prints its integer part on one line, followed by its decimal part (i.e., the digits after the decimal point) on a second line. For the decimal part, print no more than 6 digits after the decimal, but do not print trailing zeroes.

    > python split-number.py 1.234567
    1
    0.234567
    > python split-number.py 1.23456711
    1
    0.234567
    > python split-number.py 1.23
    1
    0.23
  9. (10 points) Write a program format-number.py that takes as input two arguments: a number and a format, where the format is either integer, real or scientific. Print the given number in the requested format, and print an error if an invalid format string is provided.

    > python format-number.py 3.14159 integer
    3
    > python format-number.py 3.14159 real
    3.14159
    > python format-number.py 3.14159 scientific
    3.141590e+00
    > python format-number.py 3.14159 foo
    Invalid format: foo
  10. (10 points) Write a program merge-lines.py that reads a file and prints all the lines from the file on one line (no white space, no newlines). If the file content is:
    GATGTACC
    ATCGGACT

    > python merge-lines.py
    GATGTACCATCGGACT

  11. (15 points) Write a program read_matrix.py that opens a file like matrix.txt, stores the entries in a 2-dimensional list and prints the matrix out on the screen. Don't just read the lines and spit them back - you must store the values in a 2-dimensional list of numbers first (as if you were going to use them to score an alignment). The format of the file is tab-delimited text, with one integer value in each data field and one row of the matrix on each line. Don't worry about getting the output to look pretty, just have it be readable. Make sure your program works on ANY file with the format of matrix.txt regardless of how many rows and columns there are.

    > python read_matrix.py matrix.txt
    0 0 0 0 0 0 5 2 0 2 0 3 10 0 0 0 5 5 15 0 0 0 2 10 20
    How would you change your program to print each matrix value multiplied by a command-line specified integer? (you don't need to write the program, just indicate the changes)

  12. Challenge problem 1. Read a sequence from a file (where the sequence may be on one or many lines) and randomize (shuffle) the order of residues in the sequence. Do the same thing but make it work on a fasta format sequence (this is where there is a sequence name line that starts with '>' followed by any number of lines that represent the sequence). Do the same thing but make it work on a command-line specified range within the entire sequence (e.g. 'python shuf.py seq.fasta 17 23' would shuffle residues 7 to 23).

  13. Challenge problem 2. Read a file of tab-delimited text and print the Nth and Mth fields from each line, where N and M are command-line specified. (A tab-delimited text file is one in which each line contains blocks of text separated by tab characters, a common way of storing large biological data sets). e.g. if file.txt contains (meaning gene1 is on chromosome 17 and starts at 128553 and ends at 129233):

    gene1 <tab> 17 <tab> 128553 <tab> 129233
    gene2 <tab> 17 <tab> 126003 <tab> 126512

    and your program was told to print the 1st and 3rd fields:

    >python tabdel.py file.txt 1 3
    gene1 <tab> 128553
    gene2 <tab> 126003

    (here I am representing the tab character with <tab> so that you can see it, but in the real input and output use the actual tab character)

    Do the same thing but give an error message if the field is missing from a line, e.g.

    > python tabdel.py file.txt 1 7
    Error: line 1 doesn't have 7 fields
    Error: line 2 doesn't have 7 fields