Lexical Structure

The lexical structure of a programming language is the set of basic rules that govern how you write programs in that language. It is the lowest-level syntax of the language and specifies such things as what variable names look like and which characters denote comments. Each Python source file, like any other text file, is a sequence of characters. You can also usefully consider it as a sequence of lines, tokens, or statements. These different lexical views complement and reinforce each other. Python is very particular about program layout, especially with regard to lines and indentation, so you’ll want to pay attention to this information if you are coming to Python from another language.

Lines and Indentation

A Python program is composed of a sequence of logical lines, each made up of one or more physical lines. Each physical line may end with a comment. A hash sign (#) that is not inside a string literal begins a comment. All characters after the # and up to the physical line end are part of the comment, and the Python interpreter ignores them. A line containing only whitespace, possibly with a comment, is known as a blank line, and Python totally ignores it. In an interactive interpreter session, you must enter an empty physical line (without any whitespace or comment) to terminate a multiline statement.

In Python, the end of a physical line marks the end of most statements. Unlike in other languages, you don’t normally terminate Python statements with a delimiter, such as a semicolon (;). When a statement is too long to fit on a single physical line, you can join two adjacent physical lines into a logical line by ensuring that the first physical line has no comment and ends with a backslash (\). However, Python automatically joins adjacent physical lines into one logical line if an open parenthesis ((), bracket ([), or brace ({) has not yet been closed, and taking advantage of this mechanism, generally produces more readable code instead of explicitly inserting backslashes at physical line ends. Triple-quoted string literals can also span physical lines. Physical lines after the first one in a logical line are known as continuation lines. The indentation issues covered next do not apply to continuation lines but only to the first physical line of each logical line.

Python uses indentation to express the block structure of a program. Unlike other languages, Python does not use braces, or other begin/end delimiters, around blocks of statements; indentation is the only way to denote such blocks. Each logical line in a Python program is indented by the whitespace on its left. A block is a contiguous sequence of logical lines, all indented by the same amount; a logical line with less indentation ends the block. All statements in a block must have the same indentation, as must all clauses in a compound statement. The first statement in a source file must have no indentation (i.e., must not begin with any whitespace). Statements that you type at the interactive interpreter primary prompt >>> (covered in Interactive Sessions) must also have no indentation.

Python logically replaces each tab by up to eight spaces, so that the next character after the tab falls into logical column 9, 17, 25, etc. Standard Python style is to use four spaces (never tabs) per indentation level. Don’t mix spaces and tabs for indentation, since different tools (e.g., editors, email systems, printers) treat tabs differently. The -t and -tt options to the Python interpreter (covered in Command-Line Syntax and Options) ensure against inconsistent tab and space usage in Python source code. I recommend you configure your favorite text editor to expand tabs to spaces, so that all Python source code you write always contains just spaces, not tabs. This way, you know that all tools, including Python itself, are going to be perfectly consistent in handling indentation in your Python source files. Optimal Python style is to indent by exactly four spaces.

Character Sets

Normally, a Python source file must be entirely made up of characters from the ASCII set (character codes between 0 and 127). However, you may choose to tell Python that in a certain source file you are using a character set that is a superset of ASCII. In this case, Python allows that specific source file to contain characters outside the ASCII set, but only in comments and string literals. To accomplish this, start your source file with a comment whose form must be as rigid as the following:

# -*- coding: utf-8 -*-

Between the coding: and the -*-, write the name of a codec known to Python, such as utf-8 or iso-8859-1. Note that this coding directive comment is taken as such only if it is at the start of a source file (possibly after the “shebang line” covered in Running Python Programs), and that the only effect of a coding directive is to let you use non-ASCII characters in string literals and comments.

Tokens

Python breaks each logical line into a sequence of elementary lexical components known as tokens. Each token corresponds to a substring of the logical line. The normal token types are identifiers, keywords, operators, delimiters, and literals, as covered in the following sections. You may freely use whitespace between tokens to separate them. Some whitespace separation is necessary between logically adjacent identifiers or keywords; otherwise, Python would parse them as a single, longer identifier. For example, printx is a single identifier; to write the keyword print followed by the identifier x, you need to insert some whitespace (e.g., print x).

Identifiers

An identifier is a name used to identify a variable, function, class, module, or other object. An identifier starts with a letter (A to Z or a to z) or an underscore (_) followed by zero or more letters, underscores, and digits (0 to 9). Case is significant in Python: lowercase and uppercase letters are distinct. Python does not allow punctuation characters such as @, $, and % within identifiers.

Normal Python style is to start class names with an uppercase letter and all other identifiers with a lowercase letter. Starting an identifier with a single leading underscore indicates by convention that the identifier is meant to be private. Starting an identifier with two leading underscores indicates a strongly private identifier; if the identifier also ends with two trailing underscores, the identifier is a language-defined special name. The identifier _ (a single underscore) is special in interactive interpreter sessions: the interpreter binds _ to the result of the last expression statement it has evaluated interactively, if any.

Keywords

Python has 30 keywords, which are identifiers that Python reserves for special syntactic uses. Keywords contain lowercase letters only. You cannot use keywords as regular identifiers. Some keywords begin simple statements or clauses of compound statements, while other keywords are operators. All the keywords are covered in detail in this book, either in this chapter, or in Chapter 5, Chapter 6, and Chapter 7. The keywords in Python are:

and

del

for

is

raise

assert

elif

from

lambda

return

break

else

global

not

try

class

except

if

or

while

continue

exec

import

pass

with (2.5)

def

finally

in

print

yield

The identifier with is a new keyword starting with Python 2.5 (up to Python 2.4, it is a completely normal identifier). The identifier as, which is not, strictly speaking, a keyword, is used as a pseudokeyword as part of some statements (specifically, the statements from, import, and, in Python 2.5, with). In Python 2.5, using with or as as normal identifiers produces warnings. To enable with usage as a keyword (and therefore to enable the new with statement) in Python 2.5, begin your source file with the statement:

from _ _future_ _ import with_statement

This “import from the future” enables use of the with statement in this module.

Operators

Python uses nonalphanumeric characters and character combinations as operators. Python recognizes the following operators, which are covered in detail in Expressions and Operators:

+

-

*

/

%

**

//

<<

>>

&

|

^

~

<

<=

>

>=

<>

!=

==

Delimiters

Python uses the following symbols and symbol combinations as delimiters in expressions, lists, dictionaries, various aspects of statements, and strings, among other purposes:

(

)

[

]

{

}

,

:

.

`

=

;

+=

-=

*=

/=

//=

%=

&=

|=

^=

>>=

<<=

**=

The period (.) can also appear in floating-point literals (e.g., 2.3) and imaginary literals (e.g., 2.3j). The last two rows list the augmented assignment operators, which lexically are delimiters but also perform an operation. I discuss the syntax for the various delimiters when I introduce the objects or statements with which they are used.

The following characters have special meanings as part of other tokens:

'

"

#

\

The characters $ and ?, all control characters except whitespace, and all characters with ISO codes above 126 (i.e., non-ASCII characters, such as accented letters) can never be part of the text of a Python program, except in comments or string literals (to use non-ASCII characters in comments or string literals, you must start your Python source file with a “coding directive,” as covered in Character Sets). This also applies to the character @ in Python 2.3; however, in Python 2.4, @ indicates decorators, as covered in Decorators.

Literals

A literal is a number or string that appears directly in a program. The following are all literals in Python:

42                       # Integer literal
3.14                     # Floating-point literal
1.0j                     # Imaginary literal
'hello'                  # String literal
"world"                  # Another string literal
"""Good
night"""                 # Triple-quoted string literal

Using literals and delimiters, you can create data values of some other fundamental types:

[ 42, 3.14, 'hello' ]    # List
( 100, 200, 300 )        # Tuple
{ 'x':42, 'y':3.14 }     # Dictionary

The syntax for literals and other fundamental-type data values is covered in detail in Data Types, when I discuss the various data types Python supports.

Statements

You can consider a Python source file as a sequence of simple and compound statements. Unlike other languages, Python has no declarations or other top-level syntax elements, just statements.

Simple statements

A simple statement is one that contains no other statements. A simple statement lies entirely within a logical line. As in other languages, you may place more than one simple statement on a single logical line, with a semicolon (;) as the separator. However, one statement per line is the usual Python style, and makes programs more readable.

Any expression can stand on its own as a simple statement (I’ll discuss expressions in detail in Expressions and Operators). The interactive interpreter shows the result of an expression statement you enter at the prompt (>>>) and binds the result to a variable named _ (a single underscore). Apart from interactive sessions, expression statements are useful only to call functions (and other callables) that have side effects (e.g., ones that perform output, change global variables, or raise exceptions).

An assignment is a simple statement that assigns values to variables, as I’ll discuss in Assignment Statements. Unlike in some other languages, an assignment in Python is a statement and can never be part of an expression.

Compound statements

A compound statement contains one or more other statements and controls their execution. A compound statement has one or more clauses, aligned at the same indentation. Each clause has a header starting with a keyword and ending with a colon (:), followed by a body, which is a sequence of one or more statements. When the body contains multiple statements, also known as a block, these statements should be placed on separate logical lines after the header line, indented four spaces rightward. The block lexically ends when the indentation returns to that of the clause header (or further left from there, to the indentation of some enclosing compound statement). Alternatively, the body can be a single simple statement, following the : on the same logical line as the header. The body may also consist of several simple statements on the same line with semicolons between them, but, as I’ve already indicated, this is not good style.

Get Python in a Nutshell, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.