Thinking out loud

Hi all,

I’m a regular user of the LCLive Discord chat. I get a lot of help over there, but I also like to use the chat as a place to think out loud. Unfortunately, Discord is a terrible medium to document learning. I’ll keep a similar thread here instead. My posts will mostly be ramblings, but feel free to chime in if I’m exploring anything which interests you. I’m currently working on Learn More Python the Hard Way ex31 (regex).

Austen

1 Like

Ok… Working on LMPTHW ex32. I’ve copied out the scanner code. It seems to be working and I’ve looked over it several times for errors, but everything seems ok on the surface. However, when I run the script, I get the following error:

>>> def hello(x, y):
>>>  hello(x, y):
Traceback (most recent call last):
  File "ex32.py", line 37, in <module>
    assert token, "Failed to match line %s" % string
AssertionError: Failed to match line  hello(x, y):

I think what is happening is that the scanner isn’t skipping the whitespace before applying the regex to scan the name. That could be why the ‘def’ token is scanned, but not the ‘hello’ name. I’m looking to see if part of the script is broken, or whether I need to add that feature myself.

I will continue investigating in the morning.

I have a sneaking suspicion the problem is in this while loop. Maybe it has to do with how the index is incremented. I think the steps for the string " \s\s\s\shello(x, y):" the match should fail, incrementing the start index until all the ‘space’ characters are removed and the string begins with ‘hello’.

for line in code:
    i = 0
    while i < len(line):
        token, string, end = match(i, line)
        assert token, "Failed to match line %s" % string
        if token:
            i += end
            script.append((token, string, i, end))

I’ll play with this in the morning. I’m off to bed now.

I found it! The problem was in the ‘match’ function definition. The return None statement was indented incorrectly.

The book was fine. It was my mistake.

I discovered it when I issued print statements for each token. Only the first token was being checked against the string before the return None statement kicked in.

You know I believe that this chat system has something like this already. It’s either called your “wiki” or your “journal”, something like that. I’ll check around and maybe I have to enable some security.

Now that I’m nearly done with ex32, I’m posting this for later reading

It sounds like you just pass the generator a matching expression with a label and it spits out a language-specific lexer, but there must be more to it than that!! I’m going to follow the links to the particular generators listed on the wikipedia page to see if I can understand the nuts and bolts of these things.

More to read.
Heavy lifting… (eg NFA = Nondeterministic Finite Automata)

https://www.tutorialspoint.com/compiler_design/compiler_design_lexical_analysis.htm

I’m guessing NFA (Nondeterministic Finite Automata) is somehow related to FSM (Finite State Machine)

Moving on to ABNF grammars. This tutorial is helping me understand how they work.

http://matt.might.net/articles/grammars-bnf-ebnf/

eg.

<return-statement> ::= exit | return <expression>

I got that snippet from here:

That pipe ‘|’ appears to be some kind of either/or condition.

Here are the python3 grammars. They aren’t in ABNF, but they’re close enough.
https://docs.python.org/3/reference/grammar.html

Implementing parser.root() method
Here is Zed’s code

def root(tokens):
    """root = *(funccal / funcdef)"""
    first = peek(tokens)
    
    if first == 'DEF':
        return function_definition(tokens)
    elif first == 'NAME':
        name = match(tokens, 'NAME')
        second = peek(tokens)
        
        if second == 'LPAREN':
            return function_call(tokens, name)
        else:
            assert False, "Not a FUNCDEF or FUNCCALL"

There is no catch-all for the ‘first’ condition. I think I will add one. Something like the bottom else condition.

At long last, battered and bruised, I cranked out a parser. It’s a turd, but at least it works on the example code. Here’s some formatted output to show the structure of my parse tree productions.


<ROOT>
 body: [
<FUNCDEF>
 name: hello 
 params: 
<PARAMS>
 parameters: [
<EXPRVAR>
 identifier: x
</EXPRVAR>, 
<EXPRVAR>
 identifier: y
</EXPRVAR>]
</PARAMS> 
 body: 
<BODY>
 expressions: [
<FUNCCALL>
 name: token_id: NAME, lexeme: 'print'
 params: 
<PARAMS>
 parameters: [
<EXPRPLUS>
 left: 
<EXPRVAR>
 identifier: x
</EXPRVAR>
 right: 
<EXPRVAR>
 identifier: y
</EXPRVAR>
</EXPRPLUS>]
</PARAMS>
</FUNCCALL>]
</BODY>
</FUNCDEF>, 
<FUNCCALL>
 name: token_id: NAME, lexeme: 'hello'
 params: 
<PARAMS>
 parameters: [
<EXPRNUM>
 value: 10
</EXPRNUM>, 
<EXPRNUM>
 value: 20
</EXPRNUM>]
</PARAMS>
</FUNCCALL>]
</ROOT>

Why make it look like html tags? Because it has a tree structure, I needed a way to show containment where it exists.

My parser can’t handle indents… or even function bodies with more than one line! I’ll have to fix that. Also, I used Zed’s scanner with some slight modifications. If I were to do this again, I’d rewrite the scanner from scratch to handle 4-space tabs and track indent/dedent.

Here’s the code. Run the test_parser.py to make the magic happen.

https://lab.learncodethehardway.com/austen/lmpthw_solutions/tree/master/ex33

Third attempt at the scanner. This scanner prunes whitespace and tracks and prunes indents. It also adds a special NEWLINE token which stores indent level and line number information. Each regular Token has a reference to one of these NEWLINE tokens. Here’s some sample output.

[
>>>NLTOKEN: NUMBER:0, INDENT:0,
 DEF: 'def',    (0, 3),
 NAME: 'hello', (4, 9),
 LPAREN: '(',   (9, 10),
 NAME: 'x',     (10, 11),
 COMMA: ',',    (11, 12),
 NAME: 'y',     (13, 14),
 RPAREN: ')',   (14, 15),
 COLON: ':',    (15, 16),
 
>>>NLTOKEN: NUMBER:1, INDENT:1,
 NAME: 'print', (4, 9),
 LPAREN: '(',   (9, 10),
 NAME: 'x',     (10, 11),
 PLUS: '+',     (12, 13),
 NAME: 'y',     (14, 15),
 RPAREN: ')',   (15, 16),
 
>>>NLTOKEN: NUMBER:2, INDENT:0,
 NAME: 'hello', (0, 5),
 LPAREN: '(',   (5, 6),
 INTEGER: '10', (6, 8),
 COMMA: ',',    (8, 9),
 INTEGER: '20', (10, 12),
 RPAREN: ')',   (12, 13)]

Now on to the new parser.

I managed to rewrite my scanner-parser from scratch within 24 hours! It now tracks indenting.
Here is the test input:

CODE = """def hello(x, y):
    print(x + y)
    
hello(10, 20)

hello(4, 7)
"""

Here is the output. This time I manually indented the output to make it easier to see the containment (I should probably write a script to automate this formatting). The indenting in the script is tracked in the Body productions.

<ROOT>
 body:
    <BODY indent: 0 >
     contents: [
        <FUNCDEF>
        name: hello
        params:
            <PARAMS>
            parameters: [
                <EXPRVAR>
                identifier: x
                </EXPRVAR>,
                <EXPRVAR>
                identifier: y
                </EXPRVAR>]
            </PARAMS>
        body:
            <BODY indent: 1 >
            contents: [
                <FUNCCALL>
                name: print
                params:
                    <PARAMS>
                    parameters: [
                        <EXPRPLUS>
                        left:
                            <EXPRVAR>
                            identifier: x
                            </EXPRVAR>
                        right:
                            <EXPRVAR>
                            identifier: y
                            </EXPRVAR>
                        </EXPRPLUS>]
                    </PARAMS>
                </FUNCCALL>]
            </BODY>
        </FUNCDEF>,
        <FUNCCALL>
        name: hello
        params:
            <PARAMS>
            parameters: [
                <EXPRNUM>
                value: 10
                </EXPRNUM>,
                <EXPRNUM>
                value: 20
                </EXPRNUM>]
            </PARAMS>
        </FUNCCALL>,
        <FUNCCALL>
        name: hello
        params:
            <PARAMS>
            parameters: [
                <EXPRNUM>
                value: 4
                </EXPRNUM>,
                <EXPRNUM>
                value: 7
                </EXPRNUM>]
            </PARAMS>
        </FUNCCALL>]
    </BODY>
</ROOT>```

I should mention that I reused the production classes from the previous attempts because

  1. right now they’re just containers
  2. they only needed minor changes

The code:

https://lab.learncodethehardway.com/austen/lmpthw_solutions/blob/master/ex33/productions.py
https://lab.learncodethehardway.com/austen/lmpthw_solutions/blob/master/ex33/scanner.py
https://lab.learncodethehardway.com/austen/lmpthw_solutions/blob/master/ex33/parser.py
https://lab.learncodethehardway.com/austen/lmpthw_solutions/blob/master/ex33/test_parser.py