LMPTHW ex32 Scanner.skip() method

I’m working on ex33, the parser. I’m using Zed’s solution to ex32, Scanner to build my parser. I am studying the skip() method. Here it is:

def skip(self, *what):
        for x in what:
            if x != 'INDENT': self.ignore_ws()

            tok = self.tokens[0]
            if tok[0] != x:
                return False

        return True

The confusing bit for me is that the ex33.py quick demo parser which I’m using to guide my solution uses a skip() function which appears to take no arguments (besides the tokenized code). The way skip() is used in the ex33 demo is to just automatically skip the first element in the list, regardless of the element’s token type.

Here’s how skip is called in the ex33 parser demo:

def function_definition(tokens):
    funcdef = DEF name LPAREN params RPAREN COLON body
    I ignore body for this example 'cause that's hard.
    I mean, so you can learn how to do it.
    skip(tokens) # discard def
    name = match(tokens, 'NAME')
    match(tokens, 'LPAREN')
    params = parameters(tokens)
    match(tokens, 'RPAREN')
    match(tokens, 'COLON')
    return {'type': 'FUNCDEF', 'name': name, 'params': params}

I’m going to make my skip method do things the ex33 way, but I’d like to know why the Scanner.skip() method is so complicated. What is it the purpose of all that whitespace removal? Why does it takes token id strings as arguments?

I’m looking over the tokenized output from the scanner. I can see that all whitespace is tokenized as INDENT, but I don’t think that’s right. I think an indent should be four consecutive spaces. Maybe I should make new token regexes. One for ‘four-space’ indents, and another for individual spaces. My only concern is that this change will break the parser.

That is entirely possible, since I’m only worrying about getting that little sample to work and it only has one level of indent. Here’s the line of code:

The regex there is for all of the spaces at the beginning of the line. Which gets you indent but no levels by 4 or more. Now, you have three choices:

  1. Go with that, and then in the Parser or Scanner determine the level of indent simply by the spaces in the found string. So you keep this regex, then when it’s an indent, do len() on it, and divide by 4. Keep in mind that Python’s algorithm for this is way more complicated but this would get you by.
  2. Change the regex to match any set of 4 spaces, then count each indent token. Problem is now your parser has to account for each level of indent and count any number of indent tokens. This makes it a lot harder to write the parser.
  3. Just stick to my plan and say you’re only doing this one level of indent, then fix it later by using a real scanner that doesn’t suck like this hand coded piece of garbage I have for demonstration purposes.

I’d say try out #2 but do it after you get #3 working. Also, if you have code for your scanner then let me see it. I know you struggled with it so I’d like to take a look. I have bandwidth now.

I will stick with choice 3 to keep things as simple as possible. Once I understand how to parse expressions and the function body, I’ll be better equipped to implement a choice 2 solution.

Here’s my Scanner from ex32. It’s pretty close to yours, but I didn’t include the start index attribute for each lexeme in the string. I also implemented the Token class.
and here is a test to make it work…

Oops. I was hoping those would be visible in the message. Oh well.

I think I understand why I was confused about the INDENT tokens. I didn’t realize the peek method pops all the INDENT tokens off before it returns the token type of the first non-INDENT token. I suppose that means that, if I wanted to eventually do a solution using choice 2 (4-space indent tracking), I’d have to rewrite the Scanner.peek() method to only remove non-indent whitespace, and handle indent-levels within the parser. Yikes. Complicated.

That looks alright. To make it show up in the post you have to change the permissions on the project so that they’re accessible from the world.

Anyway, the only thing I’m questioning is why you have a match_script and match_scan function. I believe your match_script is just match, and match_scan can either stay that way or call it something like “tokenize” to differentiate it.

Yes, that’s exactly right. It’s doable, but it’s easier to just match all the spaces in the front of the line, then create the INDENT token for the parser to use as one big chunk, rather than multiple INDENT.

Your implentation also has the class style of Token, so you can include some information special for INDENT. In the parser you really only need to know if the indent went up or down right? So, the scanner just cares about indentation. It’s got the amount of space and makes one token for it. Now the Parser’s job is to see that one INDENT token and determine if it is an increase or a decrease of the level. In my simple garbage parser this works fine, but notice you’d also have to track the DEDENT and INDENT levels to really do Python.

So, one strategy is the scanner keeps track of the last INDENT token it handled, and that’s all the spaces at the front. When it hits a new INDENT token, it looks and sees if this one is shorter than the last one. If it is then it actually calls that a DEDENT token. Finally, you would add to the INDENT and DEDENT tokens a count based on how many spaces you encountered. To keep it simple, require it be only 4 spaces, even though real python doesn’t care and figures it out.

But, save that for later when you want to work on this. It’s better to get the first little version of it going that works, then go in and improve it.