Suggestions to build a parser from rtf to markdown

DidierCH · August 25, 2018, 3:10pm

Hi @zedshaw I need a way to convert rtf rich text documents to markdown. I couldn’t find any parser for that (sadly pandoc can’t do that) so I think about to build one by myself. I wanted to ask you if you have some specific tipps how you would beginn this project and what datastructures do you would use (loops like in ex48 of lpthw or better regex or something else. I think I should be able to build a basic parser for myself but I’m wondering where to start and could use some tipps about it.
Thanks in advance.

zedshaw · August 27, 2018, 2:51am

Ohhhhh that is going to be really really tough. I think your best bet is this:

https://textract.readthedocs.io/en/stable/

That will do text extraction from a lot of formats, and then you have to figure out how to craft the markdown.

DidierCH · August 27, 2018, 3:59am

Okay, thank you for the answer and the link. So I will take this route. I also found the option to convert rtf to html with unrtf (used in textract) and then convert the html to markdown with pandoc. Maybe my task then only would be to build some script to automate this.

unrtf: https://www.gnu.org/software/unrtf/
pandoc: http://pandoc.org/

zedshaw · August 29, 2018, 5:42pm

Yeah that might work pretty good too. To script those too look at the Python subprocess module:

https://docs.python.org/3/library/subprocess.html

That’ll help you launch the programs and get their output and error codes. Then you’re just gluing them together really.

Keep in mind that you are probably not going to get a perfect translation. I would then say you might need to have a 3rd step that opens the markdown file, parses it into some kind of Python, and then fixes it. I think there’s a markdown parsing library for that. Google it.

DidierCH · August 30, 2018, 5:05am

Thanks a lot. That sounds like a good starting point. Will look into this.