Wednesday, October 1, 2014

My Car Buying Learnings

I recently went through a car buying ordeal and wanted to share some points that worked for me. Some of them were things that I learnt from reviewing the plethora of car buying advice on the internet and some of them were pure luck/thinking on the feet.

Which car to buy is another challenging decision to make. The points below make sense after you have narrowed down the car that you want to buy. 

1. Buy Consumer Reports report on your vehicle(s). This is an under $20 investment for a potential few thousand dollars purchase. So totally worth it. 

2. Get Pre-Approved on a loan, preferably from a Federal Credit Union (because they offer the lowest rates, per my research, at this time).

3. Ask for email quotes from multiple dealers. This way you can avoid visiting multiple dealerships and wasting time.

4. Focus on the out the door $$$. Ignore "free" throw ins like, tire locks, 2 year free maintenance etc. Recruit a buddy to constantly enforce this. I kept getting tempted to go down the 'throw-ins' path. Fortunately, my mother was my buddy who kept me from chasing the shiny object.

5. Pick a month when the month end is on Monday or a Sunday. In hind sight, I believe that this double whammy really worked for me.

6. Reach the dealer to sign the deal on Sunday late afternoon, as they are in a hurry to pack up as well. 4 PM is ideal.

7. Be ready to walk away, if anything does not feel right.

8. Follow the Consumer Reports Car Buying Guide to the 'T'.

9. Negotiate on the Financing Rate based on #2 above.

10. Negotiate on the Extended Warranty $$$ if you are getting one. Pay it outside the loan amount for lesser finance charge. I did not do this and in hind sight, should have.

11. Ask to waive the Doc fees.

12. Pitch one dealer against the other, ask to provide a better offer. Do not feel bad about this. Have a buddy constantly remind you to not feel bad about this.

13. One learning that I found missing in all the research I did was that Buyers Remorse is for real and you should prepare for it. If you are, like me, financially conservative (my wife prefers the term "stingy"), then this will hit you after the deal is closed. Acknowledging it makes it easier to manage it.

One litmus test, I discovered, to measure if you did good is how annoyed was the sales guy when you were signing the deal. The value of the deal is directly proportional to the  pissed off factor!

Do you have any tips or experiences to share on car buying? How about on home buying? Feel free to share those in the comments.

Tuesday, September 16, 2014

Pycasa - Group duplicate files

A problem that I have been battling is how unorganized my photos and videos are. I have multiple backups taken from the various devices, which has resulted in multiple copies of images and videos. Recently, I started thinking about how to organize those with the following goals:


  • Remove duplicates
  • Organize by year and by event
  • Remove images that are out of focus or not relevant


After manually going through a few folders, I realized that it was going to be a very onerous task. So I started thinking about writing a program to identify duplicates. I recently watched a great video by David Beazley on Python generators and it was exactly what I needed for my purposes.

So I created this program that groups the duplicates together and prints a list for review.

import os
import time
import fnmatch
from collections import defaultdict

#http://szudzik.com/ElegantPairing.pdf
#This is used to combine size_of_file_in_bytes and last_modified into one number associated uniquely to both
#using a pairing function.
def elegant_pairing_fn(x,y):
    if x == max(x,y):
        return (x*x)+x+y
    else:
        return (y*y)+x

#generates a sequence of file names for the given search pattern starting from the top directory.
def gen_find(filepat, top):
    for path, dirlist, filelist in os.walk(top):
        for name in fnmatch.filter(filelist, filepat):
            yield os.path.join(path,name)

#generates a sequence of dictionary objects with detailed file information.
def gen_stat(filelist):
    for name in filelist:
        dstat = (dict(zip(osstatcolnames,os.stat(name))))
        dstat['filename'] = name
        dstat['unique_id'] = elegant_pairing_fn(dstat['size_of_file_in_bytes'], dstat['last_modified'])
        yield dstat

osstatcolnames=('protection_bits', 'inode_number', 'device', 'number_of_hard_links'
          , 'user_id_of_owner', 'group_id_of_owner', 'size_of_file_in_bytes'
          , 'last_accessed', 'last_modified', 'created')

t0 = time.time()
jpgfiles = gen_find("*.jpg", "C:\\Users\\Shantanu\\Pictures")
jpgfilestats=gen_stat(jpgfiles)

#Now group the file names with the same unique_id value
#The algorithm assumes that the combination of file size and last modified date is unique per image file.
# Meaning it is almost impossible to have images that are different and yet have the same size and last modified date.
# This is a safe assumption as long as the files have not been modified using some photo editing software.
filecount = 0
jpgfilegroups = defaultdict(list)
for l in jpgfilestats:
    filecount += 1
    jpgfilegroups[l['unique_id']].append(l['filename'])

#Print groups with more than one file names in it.
#  We do not care to review files that appear only once in our stash.
count = 0
groupcount = 0
for k,v in jpgfilegroups.items():
    groupcount += 1
    if len(v) > 1:
        count += 1
        print(count, k, v)
print ("Total files processed: ", filecount)
print ("Total file groups processed: ", groupcount)
print ("Total time taken: ", time.time() - t0)


On my laptop, the program processed ~11K files in 2.8 seconds and came up with 2689 groups of possible duplicates. Manually reviewing those 2689 groups is a much simpler task.

The algorithm assumes that the combination of file size and last modified date is unique per image file. Meaning it is almost impossible to have images that are different and yet have the same size and last modified date. This is a safe assumption as long as the files have never been modified using some photo editing software (which is conveniently true for my case).

I learnt some new tools as a result of this exercise:

  • A practical use of Python generators for scratching my own itch.
  • The Python os library and its functions
  • The mathematical concept of 'pairing functions' to uniquely represent two numbers as one.
  • EDIT 20141020: I just realized that jpgfilegroups is actually a hash table using chaining for collision resolution.
What problems have you solved using these tools? Feel free to use this code to organize your photos and videos.

Monday, September 15, 2014

My Latest Laptop

I wrote the first half of this post in 2011 with an update in February 2014.

After 7 years of superb performance from my IBM Thinkpad T40, it finally died of a power connection break inside the Motherboard. After doing much research, I decided to stick to Lenovo instead of changing brands.

I noticed in my behavior and of some of my friends too, that one doesn't switch laptop brands that easily.

So here is what I ended up with:
Lenovo Z570
Second Generation i7 (2.0 GHz)
8 GB RAM
240 GB OCZ Vertex 3 SATA 3 6GBPS SSD
700 GB 5400 RPM WD HDD
External Optical Drive

Here is how much it cost me (including taxes & shipping):
1. $10.59 - For 3 new Philip Head Screwdrivers
2. $32.85 - SATA - ESata connector cable for External Optical Drive
3. $51.65 - Optical Drive HDD Caddy
4. $742.69 - Lenovo Z570 Laptop
5. $49.99 - Acronis True Home Image
6. $459.99 - OCZ Vertex 3 SSD 250 GB
For a total of - $1347.76

That's the price I paid for a laptop with:
1. Almost 1 TB of total disk space
2. Super Fast SSD Performance (~20 Second Boot times)
3. Security of backups given the instability of SSDs!
4. External Optical Drive that is not taking up space in the laptop.

Update 20140205: Return of the IBM Thinkpad T40.
First, an update on the Z570. It has been working great. Like Jeff Atwood of Coding Horror fame says "A solid state hard drive is easily the best and most obvious performance upgrade you can make on any computer for a given amount of money. Unless your computer is absolute crap to start with.". It is well worth the price. Also, I have been fortunate that I have not had any catastrophic SSD failures yet.

Going back to the T40. I am glad that I did not throw it away. I was able to move most of the working parts from this T40 to another one that the IT department at work was trashing. I moved the HDD, Internal Wireless Card, Internal Bluetooth Card, RAM to the new chassis.

In turn, I learnt a lot about the inside of a laptop. Most of it is like a jigsaw puzzle. The parts are built in such a way that they fit in perfect slots, mostly. The biggect challenge I had was keeping a track of all the screws. Everytime I opened the laptop, I ended up with a few screws that I could not figure out where I took them out from!

One change that I did make was to move to Lubuntu 12.04. With Windows XP end of life in April 2014, and knowing the fact that the T40 hardware is too weak for Windows 7 and above, I decided to switch. And that was a smart move.

By switching to Lubuntu, I have extended the life of my T40 by at least another 2 years.

Now all I need is a new battery pack, 1 GB of RAM and a 60 GB SSD (SATA 2 will be fine). I will then put the current HDD into the optical drive bay and make the SSD the master. I should end up with the meanest T40 out there!

Do you have any insights to share on ways to alter laptops to make them more useful from a practical point of view? 

Saturday, August 9, 2014

Python GroupBy, Map & Reduce

I came across a really interesting data mangling technique while watching this presentation on advanced Python programming techniques

Here is the example from this talk. Suppose you have a list of dictionary objects, sorted by the 'id' key, my_list as defined below.
>>> my_list = [
    {'id':1, 'name':'raymond'},
    {'id':1, 'email':'ray@spkrbar.com'},
    {'id':2, 'name':'sue'},
    {'id':2, 'email':'sue@sally.com'}] #sorted

Using dictgroupby, map and reduce, there is a very elegant way of grouping all of those dictionaries by the 'id' to get the following list:

[
  {'id': 1, 'email': 'ray@spkrbar.com', 'name': 'raymond'}, 
  {'id': 2, 'email': 'sue@sally.com', 'name': 'sue'}
]

And here is how you do it:
>>>from itertools import groupby
>>> [dict (
    reduce(lambda y,z: y + z,
        map(lambda x: x.items(), v)
    )
)
for k, v in groupby(my_list, key=lambda x: x['id'])]

Notice how much this is 'SQL' like. Let us break this statement down to its individual components to understand what is happening under the covers. Like any SQL statement, we have to start deciphering it inside out.

Let us look at the for loop with the groupby operation in it. The groupby operation will return an iterator grouping by the 'key' parameter. In this case, it is an anonymous function that returns the 'id' value. Essentially, we are asking for the grouping to happen using the 'id' values (1, 2 etc.).

>>>print({k:list(v) for k,v in groupby(my_list, key=lambda x: x['id'])})

{
 1: [{'id': 1, 'name': 'raymond'}, 
     {'id': 1, 'email': 'ray@spkrbar.com'}], 
 2: [{'id': 2, 'name': 'sue'}, 
     {'id': 2, 'email': 'sue@sally.com'}]
}

Note that the 'id' values are the keys and they are also repeated as part of the values. This will come in handy for the next step.

Next, we map the anonymous function, which calls the items() method for the parameter passed in, over each of the groups returned from the groupby operation.

>>> for k, v in groupby(my_list, key=lambda x: x['id']):
...     print(map(lambda x: x.items(), v))
...
[[('id', 1), ('name', 'raymond')], [('id', 1), ('email', 'ray@spkrbar.com')]]
[[('id', 2), ('name', 'sue')], [('id', 2), ('email', 'sue@sally.com')]]

This gives us lists of tuples instead of a dictionaries, which makes the reduction step very easy.

Now, we reduce the list that comes out of the mapping step by doing a simple addition of lists. Addition over two lists results in a list with elements from both. We would not have been able to use the '+' operator if these were dictionaries instead. Notice also the duplicate 'id' tuple.
>>>print(reduce(lambda y,z: y + z, [[('id', 1), ('name', 'raymond')], [('id', 1), ('email', 'ray@spkrbar.com')]]))
[('id', 1), ('name', 'raymond'), ('id', 1), ('email', 'ray@spkrbar.com')]

The reduce step will also happen for the list for 'id' 2 in this example.

We are almost at the end now. The last step is to make a dictionary from the list coming out of the reduce step to remove duplicates and conform back to the input which was a list of dictionaries.

>>>print(dict([('id', 1), ('name', 'raymond'), ('id', 1), ('email', 'ray@spkrbar.com')]))
{'id': 1, 'email': 'ray@spkrbar.com', 'name': 'raymond'}

Note that the duplicate 'id' tuple got removed. As before, this step will also happen for the list for 'id' 2.

That is it! Now, we have what we need. As list of dictionary objects, grouped by 'id'.

[
  {'id': 1, 'email': 'ray@spkrbar.com', 'name': 'raymond'}, 
  {'id': 2, 'email': 'sue@sally.com', 'name': 'sue'}
]

What are some of the practical uses you see for this technique? Do you have any other slick trick to share?Let me know in the comments below.