computerlanguages

Computer Language Primer – Part 1

[Total: 7    Average: 4.4/5]

In this Insights Guide the focus is entirely on types of languages, how they relate to what computers actually do, and how they are used by people. It is a general discussion and intended to be reasonably brief, so there is very little focus on specific languages. Please keep in mind that for just about any topic mentioned in this article, entire books have been written about that topic v.s. a few sentences in this article.

The primary focus is on

• Machine language
• Assembly language
• Higher Level Languages

And then some discussion of
• interpreters (e.g. BASIC)
• markup languages (e.g. HTML)
• scripting languages (e.g. JavaScript, Perl)
• object oriented languages (e.g. Java)
• Integrated Development Environments (e.g. The “.NET Framework”)

Database languages (e.g. SQL) will be discussed briefly in Part 2.

Machine Language

Computers work in binary and their innate language is strings of 1’s and 0’s. In my early days I actually programmed computers using strings of 1’s and 0’s, input to the machine by toggle switches on the front panel. You don’t get a lot done this way but fortunately I only had to input a dozen or so instructions, which made up a tiny “loader” program. That loader program would then read in more powerful programs from magnetic tape. It was possible to do much more elaborate things directly via toggle switches. Remembering EXACTLY (mistakes are not allowed) which string of 1’s and 0’s represented which instructions was a pain.

SO … very early on the concept of assembly language arose.

Assembly Language

This is where you represent each machine instruction’s string of 1’s and 0’s by a human-language word, or abbreviation. Things such as “add” or “ad” and “load” or “ld”. These sets of human-readable “words” are translated to machine code by a program called an assembler.

Assemblers have an exact one-to-one correspondence between each assembly language statement and the corresponding machine language statement. This is quite different from high level languages where compilers can generate dozens of machine language statements for a single high level language statement. Assemblers are a step away from machine language in terms of usability but when you program in assembly you are programming in machine language. You are using words instead of the strings of 1’s and 0’s but assembly language is direct programming of a computer, with full control over data flow, register usage, I/O, etc. None of those are directly accessible in higher level languages.

Every computer has its own set of machine language instructions, and so every machine has its own assembly language. This can be a bit frustrating when you have to move to a new machine and learn the assembly language all over again. Having learned one, it’s fairly easy to learn another, but it can be very frustrating/annoying to have to remember a simple word in one assembly language (e.g. “load”) is slightly different (e.g. “ld”) or even considerably different (e.g. “mov”) in another language. Also, machines have slightly different instruction sets which means you not only have new “words” you miss some and add some. All this means there can be a time-consuming focus on the implementation of an algorithm instead of just being able to focus on the algorithm itself.

The good news about assembly language is that you are not programming some “pillow” that sits between you and the machine, you are programming the machine directly. This is most useful for writing things like device drivers where you really need to be down in the exact details of what the machine is doing. They also give you the opportunity to do the best possible algorithm optimization, but for complex algorithms you would need a VERY compelling reason indeed to go to the trouble of writing it in assembly and with today’s very fast computers, that’s not likely to occur.

The bad news about assembly language is that you have to tell the machine EVERY ^$*&@^#*! step that you want it to perform, which is really tedious. An algebraic statement such as
A = B + COSINE(C) – D
has to be turned into a string of instructions something like what follows (which is a pseudo assembly language that is MUCH more verbose than actual assembly)

• load memory location C to register 1
• Call Subroutine “Cosine” [thus getting the cosine of what’s in register 1 back into register 1 by calling a subroutine]
• Load memory location B to register 2
• Add registers 1 and 2 [thus putting the sum into register 1]
• load memory location D to register 2
• subtract register 2 from register 1 thus putting the difference into register 1
• save register 1 to memory location A [final value]

And that’s a VERY simple algebraic statement, and you have to specify each of those steps. In the right order. Without any mistakes. It will make your head hurt. Just to give you some of the flavor of assembly, here’s a very small portion of an 8080 instruction guide (a CPU used in early personal computers). Note that both the octal and hex versions of the binary code are given. These were used during debug.


Because of the tediousness of all this, compilers were developed

COMPILERS / higher level languages

Compilers take away the need of assembly language to specify the minutiae of the machine code level manipulation of memory locations and registers. This makes it MUCH easier to concentrate on the algorithm that you are trying to implement rather than the details of the implementation. Compilers take complex math and semi-human language statements and translate them into machine code to be executed. The bad news is that when things go wrong, figuring out WHAT went wrong is more complicated than in machine/assembly code because the details of what the machine is doing (which is where things go wrong) are not obvious. This means that sophisticated debuggers are required for error location and correction. As high level languages add layers of abstraction, the distance between what you as a programmer are doing and what the machine is doing down at the detailed level gets larger and larger. This could make debugging things intolerable, but the good news is that today’s programming environments are just that; ENVIRONMENTS. That is, you don’t usually program a computer language any more, you program an Integrated Development Environment (IDE) and it includes extensive debug facilities to ease the pain of figuring out where you have gone wrong when you do. And you will. Inevitably.

Early high level languages were:
• Fortran (early 1950’s)
• Autocode (1952)
• ALGOL (late 1950’s)
• COBOL (1959)
• LISP (early 1960’s)
• APL (1960’s)
• BASIC (1964)
• PL/1 (1964)
• C (early 1970’s)

Since then there have been numerous developments, including
• Ada (late 70’s, early 80’s)
• C++ (1985)
• Python (late 1980’s)
• Perl (1987)
• Java (early 1990’s)
• JavaScript (mid 1990’s)
• Ruby (mid 1990’s)
• PHP (1994)
• C# (2002)

There were others along the way but they were of relatively little impact compared to the ones listed above. Even the ones named have had varying degrees of utility, popularity, widespread use, and longevity. One of the lesser known ones was JOVIAL, a variant of ALGOL (which was originally called the “International Algorithmic Language”). I mention JOVIAL only because the name always tickled me. It was developed by Jules Schwartz and JOVIAL stands for “Jules’ Own Version of the International Algorithmic Language”. I enjoy his lack of modesty.

A special note about the C programming language. C caught on over time as the most widely used computer languages of its time and the syntax rules (and even some of the statements) of several more modern languages come directly from C. The UNIX, Windows and Apple operating systems kernels are written in C (with some assembly for device drivers) as is some of the rest of the Operating System’s code. Unix is written entirely in C and requires a C compiler. Because of the way it was developed and implemented, C is, by design, about as close to assembly language as you can get in a high level language. Certainly it is closer than other high level languages. Java is perhaps more widely used today but C is still a popular language.

INTERPRETERS / SCRIPTING LANGUAGES / COMMAND LANGUAGES

Interpreters are unlike assemblers and compilers in that they do not generate machine code into an object file and then execute that. Rather, they interpret each statement into machine code one statement (or even just parts of statements) at a time and perform the necessary calculations/manipulations, saving intermediate results as necessary. Use of registers and such low-level detail is completely hidden from the user. The bad news about interpreters is that they are slower than mud compared to executable code generated by assemblers and compilers but that’s much less of an issue today given the speed of modern computers. The particularly good news is that they are amenable to on-the-fly debugging. That is, you can break a program at a particular instruction, change the values of the variables being processed and even change the subsequent instructions and then resume interpretation. This was WAY more important in the early days of computers when doing an edit-compile-run cycle could require an overnight submission of your program to the computer center.

These days computers are so fast and so widely available that the edit-compile-run cycle is relatively trivial taking minutes or even less as opposed to half a day. One thing to note about interpretive language is that because of the sequential interpretation of the statements, you can have syntax errors that don’t raise any error flags until the attempt is made to interpret them. Assemblers and compilers process your entire program before creating executable object code and so are able to find syntax errors prior to run-time. Some interpreters do a pre-execution pass and do find syntax errors prior to running the program. Even when they don’t, dealing with syntax errors is less painful that with compilers because given the nature of interpreters, you can simply fix the syntax error during run-time (“interpret-time” actually) and continue on. The first widely used interpretive language was BASIC (there were others, but were less widely known). Another type of scripted language is JavaScript which is interpreted one step at a time like BASIC but is not designed to be changed on the fly the way BASIC statements can be but rather is designed to be embedded in HTML files (see Markup Languages below).

A very important type of scripting language is the command script language, also called “shell scripts”. Early PC system control was done with the DOS command interpreter which is one such. Probably the most well known is Perl. In its heyday it was the most widely used system administrator’s language to automate operational and administrative tasks. It is still widely used and there are a lot of applications still running that have been built in Perl. It runs on just about all operating systems and still has a very active following because of its versatility and power.

MARKUP LANGUAGES

The most widely used markup language is HTML (Hypertext Markup Language). It defines sets of rules for encoding documents in a format that is both human-readable and machine-readable. You will often hear the phrase “HTML code”. Technically this is a misnomer since HTML is NOT “code” in the original sense because its purpose is not to result in any machine code. They are actually just data. They do not get translated to machine language and executed. They are read as input data. HTML is read by a program called a browser and is used not to specify an algorithm but rather to describe how a web page should look. HTML doesn’t specify a set of steps, it “marks up” the look and feel of how a browser creates a web page.

To go at this again, because it’s fairly important to understand the difference between markup languages and other languages, consider the following. Assembler and Complier languages result in what is called object code. That is, a direct machine language result of the assembly or compilation and the object code persists during execution. Interpreter languages are interpreted directly AT execution time so clearly exist during execution. Markup languages are different in that they do not persist during execution and in fact do not themselves result directly in any execution at all. They provide information that allows a program called a browser to create an HTML web page with a particular look and feel. So HTML is not code in the sense that a program is code; it is just data to be used by the browser program. A significant adjunct to HTML is CSS (Cascading Style Sheets) which can be used to further refine the look and feel of a web page. The latest version of HTML, HTML5, is a significant increase in the power and flexibility of HTML but further discussion of CSS and HTML5 is beyond the scope of this article. Another significant markup language is XML, also beyond the scope of this article.

You can embed procedural code inside an HTML file in the form of, for example, JavaScript and the browser will treat it differently than the HTML lines. JavaScript is generally implemented as an interpreted script language but there are compilers for it. Use of JavaScript is pretty much a requirement for web pages that are interactive on the client side. Web servers have their own set of procedural code (e.g. PHP) that can be embedded in HTML and used to change what is served to the client. The browser does not use the JavaScript code as input data but rather treats it as a program to be executed before the web page is presented and even while the web page is active. This allows for dynamic web pages where user input can cause changes to the web page and the manipulation and/or storage of information from the user.

INTEGRATED DEVELOPMENT ENVIRONMENTS

Programming languages these days come with an entire environment surrounding them, including for example object creation by drag and drop. You don’t program the language so much as you program the environment. The number of machine language instructions generated by a simple drag and drop of a control onto the window you are working on can be enormous. You don’t need to worry about such things because the environment takes care of all that. The compiler is augmented in the environment by automatic linkers, loaders, and debuggers so you get to focus on your algorithm and the high-level implementation and not worry about the low level implementation. These environments are now called “Integrated Development Environments” (IDE’s) which is a very straightforward description of exactly what they are. They include a compiler, debugger, help tool, and as well as additional facilities that can be extremely useful (such as a version control system, a class browser, an object browser, support for more than one language, etc.). These extensions will not be explored in this article but some of them will be touched on in the second article.

Because the boundary between a full bore IDE and a less powerful programming environment is vague, it is quite arguable just how long IDE’s have existed but I think that it is fair to say that in their current very powerful incarnations they began to evolve seriously in the 1980’s and came on even stronger in the 90’s. Today’s incarnations are more evolutionary from the 80’s and 90’s than the more revolutionary nature of what was happening in those decades.

So to summarize, the basic types of computer languages are
assembler
compiler
interpreted
scripting
markup
database (discussed in the Part 2 article)
and each has its own set of characteristics and uses.

Within the confines of those general characteristics, there are a couple of other significant categorizations of languages, the most notable being object-oriented languages and event driven languages. These are not themselves languages but they have a profound effect on how programs work so deserve at least a brief discussion in this article.

OBJECT ORIENTED LANGUAGES

“Normal” compiled high level languages have subroutines that are very rigid in their implementation. They can, of course, have things like “if … then” statements that allow for different program paths based on specific conditions but everything is well defined in advance. “Objects” are sets of data and code that work together in a more flexible way than exists in non-object-oriented programming.

C++ tacked object orientation onto C but object orientation is not enforced. That is, you can write a pure, non-object program in C++ and the compiler is perfectly happy to compile it exactly as would a C compiler. Data and code are not required to be part of objects. They CAN be but they don’t have to be. Java on the other hand is a “pure” object-oriented language. ONLY objects can be created and data and code are always part of an object. C# (“C sharp”) is Microsoft’s attempt to create a C language that is purely object-oriented like Java. C# is proprietary to Microsoft and thus is not used on other platforms (Apple, Unix, etc). I don’t know if it is still true, but an early comment about C# was “C# is sort of Java with reliability, productivity and security deleted”.

Objects, and OOP, have great flexibility and extensibility but getting into all of that is beyond the scope of this modest discussion of types of languages.

EVENT DRIVEN PROGRAMMING

“Normal” compiled high level languages are procedural. That is, they specify a strictly defined set of procedures. It’s like driving a car. You go straight, turn left or right, stop then go again, and so forth, but you are following a specific single path, a procedural map.

“Event-driven” languages such as VB.NET are not procedural, they are, as the name clearly states, driven by events. Such events include, for example, a mouse-click or the press of a key. Having subroutines that are driven by events is not actually new at all. Even in the early days of computers, things like a keystroke would cause a system interrupt (a hardware action) that would cause a particular subroutine to be called. BUT … the integration of such event driven actions into the very heart of programming is a different matter and is common in modern IDEs in ways that are far more powerful than early interrupt-driven system calls. It’s somewhat like playing a video game where new characters can pop up out of nowhere and characters can teleport to a different place and the actions of one character affect the actions of other characters. Like OOP, event driven programming has its own power and pitfalls but those are beyond the scope of this article. Both Apple operating systems and Microsoft operating systems strongly support event driven programming. It’s pretty much a necessity in today’s world of highly interactive, windows based, computer screens.

GENERATIONS OF LANGUAGES

You’ll read in the literature about “second generation”, “third generation”, and so forth, computer languages. That’s not much discussed these days and I’ve never found it to be particularly enlightening or helpful, however, just for completeness, here’s a rough summary of “generations”. One reason I don’t care for this is that you can get into pointless arguments about exactly where something belongs in this list.
• first generation = machine language
• second generation = assembly language
• third generation = high level languages
• fourth generation = database and scripting languages
• fifth generation = Integrated Development Environments

Newer languages don’t really replace older ones. Newer ones extend the capabilities and usability and are sometimes designed for specific reasons, such as database languages, rather than general use, but they don’t replace older languages. C, for example, is now 40+ years old but still in very wide use and Fortran is even older and yet because of the ENORMOUS amount of scientific subroutines that were created for it over decades, it is still in use in some areas of science.

Part 2 will include brief discussions of the following concepts which go beyond specific languages and are more about how languages are used by people and how their output code is used the machines themselves
• source code v.s. object code
• loaders
• Assembly/machine language program self-modification
• relocatable vs absolute
• .com vs .exe
• linkers / shared libraries / DLLs
• command interpreters ()
• debuggers
• optimized compilers
• IDE features (object browser, class browser, etc.)
• pseudo-code
• intermediate code (e.g. Microsoft’s Common Intermediate Language)
• paged computers
• bit / byte / word
• structured and unstructured languages
• database languages (SQL, MySQL, Oracle, DB2, etc, and lightweights such as Access)
• macros
• OOP (C++ vs Java) etc.
• assembly embedded in C
• JavaScript embedded in HTML
• Von Neumann Machines v.s Harvard Architecture
• threaded programming (single processor and multiprocessor)
• code obfuscation

 

 

Studied EE and Comp Sci a LONG time ago, now interested in cosmology and QM to keep the old gray matter in motion.

woodworking

67 replies
Newer Comments »
  1. jim mcnamara
    jim mcnamara says:

    Very well done.  Every language discussion thread – 'what language should I use' should have a reference back this article.  Too many statements in those threads  are off target.  Because posters have no clue about origins.

    Typo in the Markup language section: "Markup language are"

    Thanks for a good article.

  2. anorlunda
    anorlunda says:

    @phinds ,  thanks for the trip down memory lane.    That was fun to read.

    I too started in the machine language era.  My first big project was a training simulator (think of a flight simulator) done in binary on a computer that had no keyboard, no printer.   All coding and debugging was done in binary with those lights and switches.   I had to learn to read floating point numbers in binary.

    Then there was the joy of the most powerful debugging technique of all.  Namely, the hex dump (or octal dump) of the entire memory printed on paper.  That captured the full state of the code & data and there was no bug that could not be found if you just spent enough tedium to find it.

    But even that paled compared the to generation just before my time.  They had to work on "stripped program" machines.  Instructions were read from the drum one-at-a-time and executed.   To do a branch might mean (worst case) waiting for one full revolution of the drum before the next instruction could be executed.  To program them, you not only had to decide what the next instruction would be, but also where on the drum it was stored.  Choices made an enormous difference in speed of execution.  My boss did a complete boiler control system on such a computer that had only 3×24 bit registers, and zero RAM memory.  And he had stories about the generation before him that programmed the IBM 650 using plugboards.

    In boating we say, "No matter how big your boat, someone else has a bigger one."  In this field we can say, "No matter how crusty your curmudgeon credentials, there's always an older crustier guy somewhere."

  3. phinds
    phinds says:

    @phinds ,  thanks for the trip down memory lane. 

    Oh, I didn't even get started on the early days. No mention of punched card decks, teletypes, paper tape machines and huge clean rooms with white-coated machine operators to say nothing of ROM burners for writing your own BIOS in the early PC days, and on and on. I could have done a LONG trip down memory lane without really telling anyone much of any practical use for today's world, but I resisted the urge :smile:

    My favorite "log cabin story" is this: In one of my early mini-computer jobs, well after I had started working on mainframes, it was a paper tape I/O machine. To do a full cycle, you had to (and I'm probably leaving out a step or two, and ALL of this is using a very slow paper tape machine) load the editor from paper tape, use it to load your source paper tape, use the teletype to do the edit, output a new source tape, load the assembler, use it to load the modified source tape and output an object tape, load the loader, use it to load the object tape, run the program, realize you had made another code mistake, go out and get drunk.

  4. rcgldr
    rcgldr says:

    Mainframe assembler's included fairly advanced macro capability going back to the 1960s'. In the case of IBM mainframes, there were macro functions that operated on IBM database types, such as ISAM (index sequential access method), which was part of the reason for the strange mix of assembly and Cobol on the same programs, which due to legacy issues, still exists somewhat today.

    Self-modifying code – this was utilized on some older computers, like the CDC 3000 series which included a store instruction that only modified the address field of another instruction, essentially turning an instruction into an instruction + modifiable pointer.

    Event driven programming is part of most pre-emptive operating systems and applications, and time sharing systems / applications, again dating back to the 1960s'.

Newer Comments »

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply