Working on a compiler, need some opinions

In summary: I don't know what platform you're targeting.Erm, I don't think I'm explaining it properly, or you're looking too deep into "compiling". For all intents and purposes - this takes BASIC style syntax and generates a binary file to run natively on a targeted... er, sorry, I don't know what platform you're targeting.
  • #1
elusiveshame
169
35
Hey fellow nerds and geeks!

I'm working on a new version of my compiler, mainly to improve performance on the compiler itself as it was my first attempt at such a program. Since I've learned a ton on compiler design (probably not the best of methods, but hey, it works, right?), I've built a new tokenizer and am now in the works of porting over the routines.

My question comes is what is the best way to handle arrays?

I'm currently using an old version of Visual Basic (VB6) for reference, and the compiler I'm making is also a BASIC dialect aimed for a specific processor.

So what do I mean about handling arrays? What I mean is what is the best way to store information about a declared array (name, # of indices, and size of indices).

What I did in my previous compiler was to take the syntax of, say:
ASP.net:
Dim MyArray(2,4,6) As Integer

The compiler would store that into a string, like:
ASP.net:
arraylist$ = arraylist$ & "__array_MyArray_2_4_6|__array_AnotherArray...etc"

If I needed to validate the array (ensure boundaries, for instance), I'd just parse the string (explode it into an array and loop until I found a match of __array_arrayName). This, as you can imagine with large projects, can be slow. I could probably save time by using Instr() to find a match and then search for the next pipe by using the returned value of the Instr() function.

Anyway, any suggestions on the best method?
 
Technology news on Phys.org
  • #3
elusiveshame said:
Hey fellow nerds and geeks!

I'm working on a new version of my compiler, mainly to improve performance on the compiler itself as it was my first attempt at such a program. Since I've learned a ton on compiler design (probably not the best of methods, but hey, it works, right?), I've built a new tokenizer and am now in the works of porting over the routines.

My question comes is what is the best way to handle arrays?

I'm currently using an old version of Visual Basic (VB6) for reference, and the compiler I'm making is also a BASIC dialect aimed for a specific processor.

So what do I mean about handling arrays? What I mean is what is the best way to store information about a declared array (name, # of indices, and size of indices).

What I did in my previous compiler was to take the syntax of, say:
ASP.net:
Dim MyArray(2,4,6) As Integer

The compiler would store that into a string, like:
ASP.net:
arraylist$ = arraylist$ & "__array_MyArray_2_4_6|__array_AnotherArray...etc"

If I needed to validate the array (ensure boundaries, for instance), I'd just parse the string (explode it into an array and loop until I found a match of __array_arrayName). This, as you can imagine with large projects, can be slow. I could probably save time by using Instr() to find a match and then search for the next pipe by using the returned value of the Instr() function.

Anyway, any suggestions on the best method?
Are you sure that what you're writing should be called a compiler? A compiler produces machine instruction object code. What you've described looks to me like a statement parser.
 
  • #4
sysprog said:
Are you sure that what you're writing should be called a compiler? A compiler produces machine instruction object code. What you've described looks to me like a statement parser.

Yeah. This is just one step. This also sets the integer memory locations and all that as well -- this particular question is more for how the IDE should handle the array data efficiently. This compiles down to assembly and then is sent to an assembler.
 
  • #5
elusiveshame said:
Yeah. This is just one step. This also sets the integer memory locations and all that as well -- this particular question is more for how the IDE should handle the array data efficiently. This compiles down to assembly and then is sent to an assembler.
BASIC was designed by Dr. Kemeny (Professor at Dartmouth) to be a non-compiled language -- it ran/runs on an interpreter -- it seems to me that you want to figure out how to pre-process arrays originally specified in BASIC so that they will be optimally or efficiently allocated memory when the code is eventually assembled and run -- that's not only a tall order; it's a tall stack of tall orders.
 
  • #6
sysprog said:
BASIC was designed by Dr. Kemeny (Professor at Dartmouth) to be a non-compiled language -- it ran/runs on an interpreter -- it seems to me that you want to figure out how to pre-process arrays originally specified in BASIC so that they will be optimally or efficiently allocated memory when the code is eventually assembled and run -- that's not only a tall order; it's a tall stack of tall orders.

Erm, I don't think I'm explaining it properly, or you're looking too deep into "compiling". For all intents and purposes - this takes BASIC style syntax and generates a binary file to run natively on a targeted processor without a framework or set of libraries. Think of it as C/C++ with BASIC syntax.

That said, this is more of a data handling problem while translating the BASIC code. Since arrays can have multiple indices, I can't use a static array or method to keep track of the array day, so I'm wondering if using a "flatfile" approach is best (i.e., keeping the data in a pipe delimited string variable and just use string functions to see if it exists, then extract the upper bounds of each index).

This is only for basic error checking and proper syntax during compiling. During runtime, these aren't checked, at all. They can't be.
 
  • #7
elusiveshame said:
Anyway, any suggestions on the best method?
Yes, learn about how to write a compiler and start again from scratch. Whilst it is possible to do it the way you are doing by reinventing wheels, I can see no reason to do this and you certainly won't be able to ask for help.

Your main problem is that you are trying to do too much in one step, you need to break it down into chunks:
A lexer that will turn the string
Code:
"Dim MyArray(2,4,6) As Integer"
into a stream of tokens
Code:
RW_DIM, NAME, "MyArray", BRACE_OPEN, CONST_INT, 2, CONST_INT, 4, CONST_INT, 6, RW_AS, RW_INTEGER
A parser that will turn that stream of tokens into something like
Code:
symbolTable.createMultiDimArray("MyArray", 3, 2, 4, 6)
And then the code dumped in the code generation stage by the symbolTable class will handle things like bounds checking.

You can write all of this yourself, but it would be easier to write your compiler in another language e.g. Python where robust classes to do each of these stages already exist - here's an OK guide to show what I mean - so that you can focus on the requirements of your language (and your object code if you cannot use a standard virtual machine target).
 
  • Like
Likes sysprog
  • #8
Not exactly what you are after, but many years ago I extended the 'Small C' compiler to add some missing features. The language and commented source code of the original are in the public domain.

I learned a LOT about the nitty-gritty of compilers working on that. There are later versions for different processors readily available on github, might be some for Basic though I haven't looked.

https://en.wikipedia.org/wiki/Small-C
https://www.google.com/search?&q=small+c+compiler+source+code

Have Fun!

Cheers,
Tom
 
  • Like
Likes elusiveshame and sysprog
  • #9
pbuk said:
Yes, learn about how to write a compiler and start again from scratch. Whilst it is possible to do it the way you are doing by reinventing wheels, I can see no reason to do this and you certainly won't be able to ask for help.

Your main problem is that you are trying to do too much in one step, you need to break it down into chunks:
A lexer that will turn the string
Code:
"Dim MyArray(2,4,6) As Integer"
into a stream of tokens
Code:
RW_DIM, NAME, "MyArray", BRACE_OPEN, CONST_INT, 2, CONST_INT, 4, CONST_INT, 6, RW_AS, RW_INTEGER
A parser that will turn that stream of tokens into something like
Code:
symbolTable.createMultiDimArray("MyArray", 3, 2, 4, 6)
And then the code dumped in the code generation stage by the symbolTable class will handle things like bounds checking.

You can write all of this yourself, but it would be easier to write your compiler in another language e.g. Python where robust classes to do each of these stages already exist - here's an OK guide to show what I mean - so that you can focus on the requirements of your language (and your object code if you cannot use a standard virtual machine target).

I'll try once more to explain, because it seems you're telling what I need to do and I've already done all of that.

I have already:
* A lexer
* Parser
* Assembler

Taking:
Code:
    Dim a(5,3) As Integer
Already runs through a tokenizer, and for all intents and purposes, outputs:
[Keyword] [Variable] [LPAREN] [NUMBER] [Delim] [NUMBER] [RPAREN] [Keyword] [Identifier]

Easy peasy. If I do zero error checking for the BASIC code, I don't need anything else in that regard. So, I already wrote all of this myself.

This is for the method for the IDE to retain the array data, in memory, so when someone does something like this:
Code:
    b = a(0,9)

or even
Code:
    b = a(x,y,z)

an error can be displayed because they went out of bounds or has too many indices referenced. I know this has limitations (such as it won't detect index boundaries with variables since there's no underlying framework, and only specific error vectors are available on the target CPU, largely arithmetic, heap, and stack errors).

All I'm asking is what is the best way to keep track of the arrays during compilation time, so when an array pops up, I can get the boundaries of each index and number of indices to do a quick integrity check before continuing the compiling process. In the past, I stored everything into a string, exploded it into an array and searched each index. I can use a string still and use faster string searching methods.

All I'm asking is if there is a better way to do that - is there a better way to keep track of arrays, since they can have various amounts of indices, or is a "flatfile" method the best way?

Tom.G said:
Not exactly what you are after, but many years ago I extended the 'Small C' compiler to add some missing features. The language and commented source code of the original are in the public domain.

I learned a LOT about the nitty-gritty of compilers working on that. There are later versions for different processors readily available on github, might be some for Basic though I haven't looked.

https://en.wikipedia.org/wiki/Small-C
https://www.google.com/search?&q=small+c+compiler+source+code

Have Fun!

Cheers,
Tom

Thanks for the link! I'm not sure if I'll find my answer there, but it can't hurt to take a look :)
 
  • #11
Agree, table look ups should meet your requirements; and/or a Rules table for checking array object boundaries and usage during 'compile' time. Hash tables and tlu's work in simulations.
 
  • Like
Likes elusiveshame
  • #13
elusiveshame said:
All I'm asking is what is the best way to keep track of the arrays during compilation time,
willem2 said:
You'll need a hashtable to store all variable and array names. VB6 has a collection object, or write your own, see here http://www.vbforums.com/showthread.php?788247-VB6-Hash-table
As you are writing your compiler in Visual Basic it would make sense to make use of VB's object oriented nature, so rather than access a hashtable directly, write a class to do it (I was hinting at this with my call to a symbolTable.createMultiDimArray method). You can then use this class both to cope with the IDE and to dump run-time code (and there is no reason not to implement bounds checking in your run-time if you want to as this is normally part of the BASIC spec).

This is different from the way compilers used to be written so whilst a Aho's book is a great source, you might be better off adapting techniques from a source written for Python or Java (I am not aware of any for VB!)
 
  • #14
willem2 said:
You'll need a hashtable to store all variable and array names. VB6 has a collection object, or write your own, see here http://www.vbforums.com/showthread.php?788247-VB6-Hash-table

Awesome, thank you! I'll be doing some research this week to figure out how this all works and how to implement it, but this is exactly what I was looking for!

jim mcnamara said:
Alfred Aho is a great source for understanding what you are trying to do.
This is from an older book, on Amazon:

I taught from the 1986 book.
https://www.amazon.com/s?k=compilers+principles,+techniques,+and+tools+2nd+by+alfred+v.+aho&crid=ME3YOV7UDD0U&sprefix=alfred+aho+,aps,387&ref=nb_sb_ss_i_3_11&tag=pfamazon01-20

You can get it for under US$9

Thanks! I managed to have caught your post pre-edit, so I'm assuming this is the same as the previous link?

pbuk said:
As you are writing your compiler in Visual Basic it would make sense to make use of VB's object oriented nature, so rather than access a hashtable directly, write a class to do it (I was hinting at this with my call to a symbolTable.createMultiDimArray method). You can then use this class both to cope with the IDE and to dump run-time code (and there is no reason not to implement bounds checking in your run-time if you want to as this is normally part of the BASIC spec).

This is different from the way compilers used to be written so whilst a Aho's book is a great source, you might be better off adapting techniques from a source written for Python or Java (I am not aware of any for VB!)

What I would eventually like to do is remove the compiler portion from Visual Basic and rewrite it in another language and have it be a true console compiler, and pass data to that via the IDE, eventually, so I'm trying to avoid relying on VB's functions and Windows API's as little as possible.

I strongly dislike Java, and most likely won't ever use it unless held by gunpoint. I don't even have the JRE installed on my workstations. Python isn't an option because I want the finished product to be user friendly and not require a framework to run on (though I know VB6 requires msvbvm.dll, it's just a file that's bundled with the software). The separation of the compiler from the IDE in the future plans is to allow those who want to use their own IDE can (my current version allows for it, but since it's a VB program, it still requires 1. Windows to be running and 2. requires the VB libraries).
 
  • #15
elusiveshame said:
I strongly dislike Java, and most likely won't ever use it unless held by gunpoint.
OK, if you are writing a program for yourself choosing a language you enjoy developing in is important. If you are writing an application to be used by others then this is less true, and if this is an open source application where establishing an active community is an important factor then you need to set your own prejudices aside.

elusiveshame said:
Python isn't an option because I want the finished product to be user friendly and not require a framework to run on
Python does not require a "framework" (I assume you mean an IDE) to run on, there are a number of options for packaging code including creating EXE files with no external dependencies. Or do you mean that in order to create a "user friendly" application you need to choose a GUI framework whereas with VB you have to use .NET?

elusiveshame said:
(though I know VB6 requires msvbvm.dll, it's just a file that's bundled with the software).
Which only works in Windows, or (with limitations) under Wine for Linux distros where that is an option; IME the limitations of Wine mean that nobody uses it where there is a native alternative. If you are so set on .NET then consider Mono.

elusiveshame said:
The separation of the compiler from the IDE in the future plans is to allow those who want to use their own IDE can (my current version allows for it, but since it's a VB program, it still requires 1. Windows to be running and 2. requires the VB libraries).
I would be very surprised if someone who is opinionated enough to use their own IDE will be happy to go with your highly opinionated choice of VB for the compiler stage - this would be like choosing a Ferrari instead of a Lamborghini to drive you to the airport but then flying in a biplane where everyone else is choosing an 747!

I've got a few more suggestions and recommendations:
  • Consider C++ for the compiler stage, making use of the standard tokenisation libraries that are available.
  • Consider delivering your IDE as a package for Visual Studio Code which has rapidly become the multi-platform developers IDE of choice (and is written in JavaScript).
  • Consider using a widely recognised standard for expressing your language's grammar and an Abstract Syntax Tree (AST) to reduce lock-in to your own legacy design choices.
 
  • Like
Likes sysprog
  • #16
pbuk said:
OK, if you are writing a program for yourself choosing a language you enjoy developing in is important. If you are writing an application to be used by others then this is less true, and if this is an open source application where establishing an active community is an important factor then you need to set your own prejudices aside.

I respectfully disagree. If I wanted to program it in Java, Python, C, or C++, I would have. The choice of language is largely irrelevant unless you want it to be across multiple platforms. If I were working in an open source community on this, I'd join an open source VB6 group (they still exist, as VB6 is still very popular). The only time I would have to set my prejudice aside for Java is if I were to take on a job involving it. I wouldn't, because I dislike it that much that no amount of money would entice me to work with it :)

pbuk said:
Python does not require a "framework" (I assume you mean an IDE) to run on, there are a number of options for packaging code including creating EXE files with no external dependencies. Or do you mean that in order to create a "user friendly" application you need to choose a GUI framework whereas with VB you have to use .NET?

I thought it was an interpreted language and required it to run through an interpreter, like Perl. After a quick google search, I see there is a way to make standalone exe's with it.

pbuk said:
Which only works in Windows, or (with limitations) under Wine for Linux distros where that is an option; IME the limitations of Wine mean that nobody uses it where there is a native alternative. If you are so set on .NET then consider Mono.

I'll only support Windows, regardless. I'm not a huge Linux user - I only know what I need to for work. The only reason why I would consider .NET is the fact I can essentially copy/paste the BASIC code (with just changing some controls), so it would be incredibly easy. The only reason I would go to a more modern language like VB.NET is because VB6 is very limited in graphical capabilities (can't load PNG's or other common graphic formats used today) for certain aspects of the IDE. The compiler itself would probably go to something like FreeBASIC where, again, it's mostly copy/paste.
pbuk said:
I would be very surprised if someone who is opinionated enough to use their own IDE will be happy to go with your highly opinionated choice of VB for the compiler stage - this would be like choosing a Ferrari instead of a Lamborghini to drive you to the airport but then flying in a biplane where everyone else is choosing an 747!

I'm not sure why you're being condescending. Is it because I just dislike Java and I'd rather use VB6 over it? Talk about having a high opinion...

FWIW, my existing compiler is being used in 2 high school programming classes in Europe, and have a few dozen programming groups in Latin and South America focused solely on my compiler. Get ~300 downloads a week with over 25,000 lifetime downloads, and have a discord channel with hundreds of users that use it daily. Maybe they just like to ride public transportation instead of your 747. Or maybe it's one of the very few, effective tools for what it does before resorting to pure assembly (which scares a ton of folks).

(I get that these numbers aren't really all that impressive, however, there's quite a few folks that use it as well as made commercial projects with it)

pbuk said:
I've got a few more suggestions and recommendations:
  • Consider C++ for the compiler stage, making use of the standard tokenisation libraries that are available.
  • Consider delivering your IDE as a package for Visual Studio Code which has rapidly become the multi-platform developers IDE of choice (and is written in JavaScript).
  • Consider using a widely recognised standard for expressing your language's grammar and an Abstract Syntax Tree (AST) to reduce lock-in to your own legacy design choices.

I already have an AST.

Honestly, this is why I rarely post here (or any other programming forum). I didn't ask for a suggestion on what language I should program it in, or why I should, or that I should go open source and why I have to use X,Y,Z program/process/utility. Everyone has their opinions on how others should program their software.

My compiler is ~5 years old now (haven't really updated the compiler itself in about a year and a half, just did bug fixes and some tweaks where needed). It's been used for a ton of commercial projects, not just by me, but by many others. Just in the last 5 years, the compiler has generated over $1MM collectively for those who used their ideas and brought them to live with it. Granted, they could've used some another compiler, but they didn't.

You suggesting (over and over) that I change the language is like telling a Van Gogh that he should've used acrylic paint instead of oil, after he spent a year on his work. People have their mediums they enjoy working with, and some use others when absolutely necessary. People also don't want to redo years worth of work just to appease someone's personal bias.

If you want to actually learn about the project before focusing your prejudices on why everything I'm doing is wrong, I'd be more than happy to share that with you privately and then would listen to your suggestions.
 
  • #17
elusiveshame said:
I'm not sure why you're being condescending. Is it because I just dislike Java and I'd rather use VB6 over it?
I am sorry if you found what I posted condescending, that was not my intention.

elusiveshame said:
If you want to actually learn about the project before focusing your prejudices on why everything I'm doing is wrong, I'd be more than happy to share that with you privately and then would listen to your suggestions.
The title of this thread is "Working on a compiler, need some opinions". I've given you mine; essentially "consider doing X, Y and Z like this because that is the way everybody else does them".
 
  • Like
Likes sysprog
  • #18
pbuk said:
I am sorry if you found what I posted condescending, that was not my intention.The title of this thread is "Working on a compiler, need some opinions". I've given you mine; essentially "consider doing X, Y and Z like this because that is the way everybody else does them".

That's fair, I'm sorry for misunderstanding your connotation. I'm not sure how else I could've taken it when it seemed you were attacking my opinion of VB and a bunch of other silly things, but water under the bridge.

I was asking for opinions on a specific thing, though, not the whole project. I get folks arguing over what my program is or isn't without knowing more than the word BASIC, and then suggestions completely unrelated to what I was asking - that's just frustrating. It wasn't until post 8 where I started getting actual help, and post 9 was the answer.

Here's how this thread ran:
Post 2: Questioned on whether I know what I'm talking about or not.

Post 4: Get a history lesson and told I'm asking too much

Post 7: Get told I should start over and dump all my work,
Get told my compiler doesn't do what it should
Get told I shouldn't write it myself

Post 8: Wasn't sure what I was looking for, but sent me sources
for another compiler to poke around and learn from

Post 10: The answer to my question
Post 11: Assurance that post 10 was what I was looking for

Post 12: Some more good reading material reference

Post 13: Suggested I change languages

Post 15: Told what I need to do for communal purposes for communities that were never mentioned.
Told I apparently I'm highly opinionated on VB and you can't see why folks would want to use it when they could use something better
More suggestions I change languages

Do you see all of the assumptions made? Not once did anyone ask about the project and what I had already done and in working order, but ya'll are acting like I just started this as a weekend project and derailing from the purpose of the thread.

Now that the question has been answered, if you want to continue the discussion with various compiler techniques and more mainstream standards, cool! I'm always up for learning new things.
 
  • Love
Likes Tom.G
  • #19
I'm a little astounded at people criticizing your project. As far as I'm concerned, writing a hunk of translation code like you are doing (whatever it may be called), and then figuring out what makes it slow and learning to enhance it, is THE best way to learn to design code. But also, knowing areas of programming such as good use of data structures ("hash table", for example) is just as important. Keep up the good work. You are already on the way to becoming an unusually good programmer. And over time, you will learn that performance has to be built in by design, not tacked on at the end, but that can only be done after you've developed a repertoire of strategies for things. As you are doing beautifully, near as I can tell.
 
  • Like
Likes Tom.G and elusiveshame
  • #20
  • Like
Likes elusiveshame
  • #21
harborsparrow said:
I'm a little astounded at people criticizing your project. As far as I'm concerned, writing a hunk of translation code like you are doing (whatever it may be called), and then figuring out what makes it slow and learning to enhance it, is THE best way to learn to design code. But also, knowing areas of programming such as good use of data structures ("hash table", for example) is just as important. Keep up the good work. You are already on the way to becoming an unusually good programmer. And over time, you will learn that performance has to be built in by design, not tacked on at the end, but that can only be done after you've developed a repertoire of strategies for things. As you are doing beautifully, near as I can tell.

Agreed. My plan once the original compiler was stable and working as best as I could possibly get it, I'd go through and figure out what I did wrong, how to fix it/enhance it, and so on.

I was definitely frustrated with the initial responses, then I figured anyone who's telling me not to reinvent the wheel or "do this because that's what others do" probably hasn't written a compiler. It was the personal attack that got me, though, even when the poster tried to cop out of it rather than just being an adult, but whatever. It's the internet.

Thank you for the kind words!

harborsparrow said:
P S - A pretty good overview of compilation techniques, by no means complete, is at http://en.citizendium.org/wiki/Compiler

Bookmarked! Thanks for the resource! A person can only know about the things they're exposed to :)
 

1. What is a compiler and what does it do?

A compiler is a computer program that translates source code written in a programming language into a machine-readable format, typically in the form of executable code. It takes the code written by a programmer and converts it into instructions that a computer can understand and execute.

2. What are the benefits of working on a compiler?

Working on a compiler allows developers to gain a deeper understanding of how programming languages work and how they are translated into machine code. It also helps in improving programming skills, problem-solving abilities, and can lead to career opportunities in software development or language design.

3. What languages are commonly used to write compilers?

Some commonly used languages for writing compilers include C, C++, Java, and Python. These languages offer a good balance between performance, flexibility, and ease of implementation. However, any programming language can be used to write a compiler.

4. What are some challenges faced while working on a compiler?

Some common challenges faced while working on a compiler include managing memory, handling errors and exceptions, optimizing code for efficiency, and ensuring compatibility with different operating systems and hardware platforms. Additionally, designing and implementing complex language features can also be a challenge.

5. Are there any useful resources for learning about compilers?

Yes, there are many useful resources available for learning about compilers, such as books, online tutorials, and open-source compiler projects. Some recommended resources include "Modern Compiler Implementation" by Andrew Appel, "Crafting Interpreters" by Bob Nystrom, and the LLVM compiler infrastructure project.

Similar threads

Replies
6
Views
1K
  • Programming and Computer Science
Replies
7
Views
2K
  • Programming and Computer Science
Replies
28
Views
3K
  • Programming and Computer Science
Replies
6
Views
2K
  • Programming and Computer Science
Replies
18
Views
6K
  • Programming and Computer Science
Replies
1
Views
1K
  • Programming and Computer Science
Replies
15
Views
2K
  • STEM Academic Advising
Replies
1
Views
919
  • Engineering and Comp Sci Homework Help
Replies
6
Views
1K
Back
Top