Logo Search packages:      
Sourcecode: vegastrike-data version File versions  Download package

difflib::SequenceMatcher Class Reference

List of all members.


Detailed Description

SequenceMatcher is a flexible class for comparing pairs of sequences of
any type, so long as the sequence elements are hashable.  The basic
algorithm predates, and is a little fancier than, an algorithm
published in the late 1980's by Ratcliff and Obershelp under the
hyperbolic name "gestalt pattern matching".  The basic idea is to find
the longest contiguous matching subsequence that contains no "junk"
elements (R-O doesn't address junk).  The same idea is then applied
recursively to the pieces of the sequences to the left and to the right
of the matching subsequence.  This does not yield minimal edit
sequences, but does tend to yield matches that "look right" to people.

SequenceMatcher tries to compute a "human-friendly diff" between two
sequences.  Unlike e.g. UNIX(tm) diff, the fundamental notion is the
longest *contiguous* & junk-free matching subsequence.  That's what
catches peoples' eyes.  The Windows(tm) windiff has another interesting
notion, pairing up elements that appear uniquely in each sequence.
That, and the method here, appear to yield more intuitive difference
reports than does diff.  This method appears to be the least vulnerable
to synching up on blocks of "junk lines", though (like blank lines in
ordinary text files, or maybe "<P>" lines in HTML files).  That may be
because this is the only method of the 3 that has a *concept* of
"junk" <wink>.

Example, comparing two strings, and considering blanks to be "junk":

>>> s = SequenceMatcher(lambda x: x == " ",
...                     "private Thread currentThread;",
...                     "private volatile Thread currentThread;")
>>>

.ratio() returns a float in [0, 1], measuring the "similarity" of the
sequences.  As a rule of thumb, a .ratio() value over 0.6 means the
sequences are close matches:

>>> print round(s.ratio(), 3)
0.866
>>>

If you're only interested in where the sequences match,
.get_matching_blocks() is handy:

>>> for block in s.get_matching_blocks():
...     print "a[%d] and b[%d] match for %d elements" % block
a[0] and b[0] match for 8 elements
a[8] and b[17] match for 6 elements
a[14] and b[23] match for 15 elements
a[29] and b[38] match for 0 elements

Note that the last tuple returned by .get_matching_blocks() is always a
dummy, (len(a), len(b), 0), and this is the only case in which the last
tuple element (number of elements matched) is 0.

If you want to know how to change the first sequence into the second,
use .get_opcodes():

>>> for opcode in s.get_opcodes():
...     print "%6s a[%d:%d] b[%d:%d]" % opcode
 equal a[0:8] b[0:8]
insert a[8:8] b[8:17]
 equal a[8:14] b[17:23]
 equal a[14:29] b[23:38]

See the Differ class for a fancy human-friendly file differencer, which
uses SequenceMatcher both to compare sequences of lines, and to compare
sequences of characters within similar (near-matching) lines.

See also function get_close_matches() in this module, which shows how
simple code building on SequenceMatcher can be used to do useful work.

Timing:  Basic R-O is cubic time worst case and quadratic time expected
case.  SequenceMatcher is quadratic time for the worst case and has
expected-case behavior dependent in a complicated way on how many
elements the sequences have in common; best case time is linear.

Methods:

__init__(isjunk=None, a='', b='')
    Construct a SequenceMatcher.

set_seqs(a, b)
    Set the two sequences to be compared.

set_seq1(a)
    Set the first sequence to be compared.

set_seq2(b)
    Set the second sequence to be compared.

find_longest_match(alo, ahi, blo, bhi)
    Find longest matching block in a[alo:ahi] and b[blo:bhi].

get_matching_blocks()
    Return list of triples describing matching subsequences.

get_opcodes()
    Return list of 5-tuples describing how to turn a into b.

ratio()
    Return a measure of the sequences' similarity (float in [0,1]).

quick_ratio()
    Return an upper bound on .ratio() relatively quickly.

real_quick_ratio()
    Return an upper bound on ratio() very quickly.

Definition at line 27 of file difflib.py.


Public Member Functions

def __init__
def find_longest_match
def get_matching_blocks
def get_opcodes
def quick_ratio
def ratio
def real_quick_ratio
def set_seq1
def set_seq2
def set_seqs

Public Attributes

 a
 b
 b2j
 b2jhas
 fullbcount
 isbjunk
 isjunk
 matching_blocks
 opcodes

Private Member Functions

def __chain_b
def __helper

The documentation for this class was generated from the following file:

Generated by  Doxygen 1.6.0   Back to index