Python Recursion

Keys to recursion:

  • base case: the “one line” operation that does the calculation
  • state maintenance: remember the calculation result
  • stop condition: logic that stops the recursion

An example to illustrate:

# calculate the sum of all integers up to n
def tail_recursive_sum(n, total=0):
    if n == 0: # the stop condition
        return total
    total += n # the base case
    return sum(n-1, total) # total is used to maintain the state

# simplified version
def sum(n, total=0):
    if n== 0: return total
    return sum(n-1, total+n)

Another way to write this function:

def recursive_sum(n):
    if n == 1:
        return 1
    return n + sum(n-1)

The main difference is that recursive_sum does not use a variable to maintain the state. It relies on the function call stack. Assuming cur = 3, the above function call looks like this:

3 + sum(3-1)
3 + 2 + sum(2-1)
3 + 2 + 1

Notice that the 1st and 2nd call cannot be completed unless the 3rd call returns 1.

The call stack looks quite different with the tail_recursive_sum approach:

sum(3-1, 0+3)
sum(2-1, 3+2)
sum(1-1, 5+1)

When there is no additional operations other than the recursive function call at the end of a function, it is called tail recursion. (Some language support tail recursion optimization but Python doesn’t)

Here are some exercise to play with:

 

Python list sort vs sorted

A concise answer from this SO post:

  • sorted returns a new list, list.sort() mutate the calling list and returns None
  • For lists, list.sort() is faster than sorted() because it doesn’t have to create a copy. For any other iterable, you have no choice.
 

Google Python Style Guide

Useful tips from Google’s python style guideline:

Use nested for-loops instead of complex list comprehensions

for x in range(10):
      for y in range(5):
          if x * y > 10:
              result.append((x, y))

Don’t

result = [(x, y) for x in range(10) for y in range(5) if x * y > 10]

Use default iterators instead of iterating over a returned list

for key in adict: ...
if key not in adict: ...
if obj in alist: ...
for line in afile: ...
for k, v in adict.items(): ...

Don’t

for key in adict.keys(): ... 
      if not adict.has_key(key): ...
      for line in afile.readlines(): ...
      for k, v in dict.iteritems(): ...
  • Use “implicit” false if at all possible (for sequences like strings, lists, tuples)
  • Never compare a boolean variable to False using == (because falsy does not == False)
  • Compare integers to explicit integer values rather than rely on implicit false (accidentally handling None as 0)
  • Never use ==, != or implicit false to compare with None. Use is or is not
if not users:
    print('no users')

if foo == 0:
    self.handle_zero()

if i % 10 == 0:
    self.handle_multiple_of_ten()

def f(x=None):
    if x is None:
        x = []

Don’t:

if len(users) == 0:
    print('no users')

if foo is not None and not foo:
    self.handle_zero()

if not i % 10:
    self.handle_multiple_of_ten()

def f(x=None):
    x = x or []

Do not use mutable objects or values that are evaluated at module load time as default values in function definition

Yes: def foo(a, b=None):
         if b is None:
             b = []
Yes: def foo(a, b: Optional[Sequence] = None):
         if b is None:
             b = []
Yes: def foo(a, b: Sequence = ()):  # Empty tuple OK since tuples are immutable
         ...
No:  def foo(a, b=[]):
         ...
No:  def foo(a, b=time.time()):  # The time the module was loaded???
         ...
No:  def foo(a, b=FLAGS.my_thing):  # sys.argv has not yet been parsed...
         ...

Add trailing commas if the closing bracket is on the next line of the last item

Yes:   golomb4 = [
           0,
           1,
           4,
           6,
       ]

No:    golomb4 = [
           0,
           1,
           4,
           6
       ]

Always inherit from Object if a class has no other base classes

Yes: class SampleClass(object):
         pass


     class OuterClass(object):

         class InnerClass(object):
             pass


     class ChildClass(ParentClass):
         """Explicitly inherits from another class already."""


No: class SampleClass:
        pass

    class OuterClass:

        class InnerClass:
            pass

Do not use + and += on Strings (create temporary objects and results in quadratic running time)

Yes: items = ['<table>']
     for last_name, first_name in employee_list:
         items.append('<tr><td>%s, %s</td></tr>' % (last_name, first_name))
     items.append('</table>')
     employee_table = ''.join(items)

No: employee_table = '<table>'
    for last_name, first_name in employee_list:
        employee_table += '<tr><td>%s, %s</td></tr>' % (last_name, first_name)
    employee_table += '</table>'

 

Python __name__ = “__main__”

When a python file is executed python foo.py, a hidden variable named __name__ is set to __main__

# foo.py
print(__name__) # prints __main__

If foo.py is imported using import foo, then the __name__ variable is set to foo

if __name__ = "__main__" is used to check whether the current file is being run standalone or being imported. You can certainly fool python by assigning __main__ to __name__ manually:

# foo.py
__name__ = '__main__'

if __name__ = '__main__':
    print('running standalone')

#bar.py
import foo.py

File bar.py imports foo.py, when foo gets loaded, it will think it is running standalone and thus, print out running standalone.

Reference

https://stackoverflow.com/a/419986/1478290
https://stackoverflow.com/a/419185/1478290

 

Do not compare boolean variables to False using == in Python

Consider this snippet:

x = []
bool(x) # prints False
if x == False: # I know I should use `if not x`
    print('x == False') # prints nothing

Although the python documentation stated:

By default, an object is considered true unless its class defines either a bool() method that returns False or a len() method that returns zero’, Here are most of the built-in objects considered false:
* empty sequences and collections: ”, (), [], {}, set(), range(0)

https://docs.python.org/3/library/stdtypes.html#truth-value-testing

There is no mention of why considered false does not equal to False. In later paragraphs of the same document:

Objects of different types, except different numeric types, never compare equal.

So comparing a list reference to False is wrong to begin with. I was thinking if that is the case then why True == 1 works, because bool and int are different objects right? Wrong according to this SO post:

False object is of type bool which is a subclass of int….booleans are explicitly considered as integers in Python 2.6 and 3.

Do not compare the following to False using ==, because even though they evaluate to false, they do not equal to False:

  • constants defined to be false: None and False.
  • zero of any numeric type: 00.00jDecimal(0)Fraction(0, 1)
  • empty sequences and collections: ''()[]{}set()range(0)

Another reason to not use == when comparing boolean is because it is redundant:

isClosed = True
if isClosed == True: # is equivalent of if True == True, it's like keep appending '== True'

The right way of checking boolean is simply:

if isClosed # when isClosed == True
if not isClosed # when isClosed == False

However, if not isClosed is checking isClosed == False and isClosed is None at the same time:

x = None
if not x:
    print('x is None') # This line gets printed

x = False
if not x:
    print('x is False') # This line gets printed

To make sure x is False but not None:

x = None
if x is not None and not x:
    print('x is not None and is False') # never printed

x = False
if x is not None and not x:
    print('x is not None and is False') # ✓
 

Good Programming Naming Conventions

  • Start with a verb for functions that do not return a boolean, e.g. refresh_cache(), get_address(), insert()
    • for functions that return a boolean, start the name with is if possible, e.g. is_empty()
  • Name parameters using natural language, e.g, insert(at=1)
  • Being consistent when naming similar functionalities, e.g. list.add(), set.add()
  • Use natural language instead of one character variable name (unless it is a math formula), e.g. each = getEnumeration(), instead of e = ...
    • it is common to use i in a for-loop and x,y for coordinates, e for exceptions
  • Limit identifier names to 20 characters
  • Include units in a name, e.g. distanceInCentimeters, elapsedTimeInDdays
  • Class and object names should be a noun, e.g. Customer, WikiPage
  • If a variable name is used multiple times, it should be searchable (avoid using single characters)
  • Functions should only do one thing, thus their names should reflect that (instead of being ambiguous)

For Python:

  • Avoid dashes in any package/module name
  • avoid __double_leading_and_trailing_underscore__ names 

Recommended naming conventions for Python:

module_namepackage_nameClassNamemethod_nameExceptionNamefunction_nameGLOBAL_CONSTANT_NAMEglobal_var_name,instance_var_namefunction_parameter_namelocal_var_name.

Reference:
https://dmitripavlutin.com/coding-like-shakespeare-practical-function-naming-conventions/
http://wiki.c2.com/?GoodVariableNames
https://medium.com/coding-skills/clean-code-101-meaningful-names-and-functions-bf450456d90c

 

Apriori

The apriori algorithm uses three metrics to identify associated items. Assuming I made 3 purchases at a local store:

  1. {A, B, C}
  2. {B, C}
  3. {A, B}

Terminology
Transaction: each row above is called transaction
Itemset: {A,B,C} is an itemset
Rule: {A} => {B}, when {A} is in the basket, it implies that {B} will also be in the basket

By looking at the list, we can see that {A,B} & {B,C} appear together more often. To quantify the strength of this relationship, three metrics are used:

Support: P (A ∪ B) ÷ N, the probability of both A & B appearing together over the total number of transactions (i.e. how often do you see A & B together)
Confidence = P (B | A) = P (A ∪ B) ÷ P(A), i.e. how often do you see B when A is already in the basket. The notion of P(B | A) is called conditional probability and it is read as the probability of B given A.
Lift = P (B | A) ÷ P(B), it means the chance of B given A versus the chance of B happening along. In layman’s term, if A & B always appear together, then A & B should appear more often then B appears along in the basket. A value of 1 means A does not affect how often B appears, they are independent. If the value > 1, then it means A has an effect on B, the larger the value the stronger the effect. If value < 1, then it means the more A appears, the less A&B appears.

Example
Rule: {A} -> {B}
Support = 2 (A & B appear 2 times) ÷ 3 (three transactions in total)
Confidence = 2 (A & B appear 2 times) ÷ 2 (A appears 2 times)
Lift of (A & B) = 2 (A & B appear 2 times) ÷ 2 (A appears 2 times) ÷ 3 (B appears 3 times)

The Apriori algorithm can generate many rules, let’s say there are d number of unique items in a transactions, then the total number of rules for this particular itemset equals to $ 3^d – 2^{d+1} + 1$

For itemset {A,B,C} : $ \text{total number of rules} = 3^3 – 2^ 4 + 1 = 27 – 16 + 1 = 12$

It is a good exercise to write them out:
A -> B
B -> A
A -> C
C -> A
A -> BC
BC -> A
B -> C
C -> B
B -> AC
AC -> B
CA -> B
B -> CA

As the number of items grows, the number of rules grows much quicker. It would be too time-consuming to calculate all of the rules without filtering out meaningless data. Thus, minsup (minimum support) and minconf (minimum confidence) are used.

Thus, to interpret Apriori result quickly, choose a minsup and minconf to filter out useless data, and then order the rules with the highest Lift to lowest and filter out any Lift <= 1. Not all of the rules make sense, they still need to be interpreted by a human.

Reference:
https://annalyzin.wordpress.com/2016/04/01/association-rules-and-the-apriori-algorithm/
https://www-users.cs.umn.edu/~kumar001/dmbook/ch6.pdf

 

TypeError: something is not subscriptable

In layman’s terms: You are trying access me like a list but I don’t support that kind of functionality

Technically speaking: the object you called upon didn’t implement the __getitem()__ method.

A common cause of the error is when you are trying to access a sequence that is not ordered:

x = set([1,2,3])
x[0]
 
Bitnami