% Linux Kernel Module Programming Guide -*- TeX -*-
% Copyright (C) 1998-1999 by Ori Pomerantz
%
% This file is freely redistributable, but you must preserve this copyright 
% notice on all copies, and it must be distributed only as part of "Linux
% Kernel Module Programming Guide". This file's use is covered by the 
% copyright for the entire document, in the file "copyright.tex".
%

% m4 Macro for a source file (I use m4 so I can include a file within 
% verbatim mode).

define(`sourcesample', `
\vskip 2 ex
\addcontentsline{toc}{section}{$1} 
{\large\bf $1} 
\index{$1, source file}\index{source\\$1} 

\begin{verbatim} 
include(../source/$2/$1) 
\end{verbatim}
')



% Document specific definitions
\newcommand{\myversion}{1.1.0}
\newcommand{\myyear}{1999}
\newcommand{\mydate}{26 April \myyear}
\newcommand{\bookname}{Linux Kernel Module Programming Guide}

% Author dependant definitions
\newcommand{\myemail}{mpg@simple-tech.com}
\newcommand{\myname}{Ori Pomerantz}
\newcommand{\myaddress}{\myname\\
                        Apt. \#1032\\
                        2355 N Hwy 360\\
                        Grand Prairie\\
                        TX 75050\\
                        USA}

\typeout{ * \bookname, \myemail}
\typeout{ * Version \myversion, \mydate.}





% Conditional flags. Set these based on how you are formatting the book.
% For Slackware edition:
\def\igsslack{1}
% For plain ASCII edition:
%\def\igsascii{0}


% The style of the document
\documentstyle[times,indentfirst,epsfig,twoside,linuxdoc,lotex]{report}

% We WANT an index.
\makeindex


% Set title information.
\title{\bookname}
\years{\myyear}
%
\author{\large \myname}

\abstract{Version \myversion, \mydate. \\
\vskip 1ex
This book is about writing Linux Kernel Modules. It is, hopefully, useful
for programmers who know C and want to learn how to write kernel modules.
It is written as an ``How-To'' instruction manual,
with examples of all of the important techniques.
\\
\vskip 1ex
Although this book touches on many points of kernel design, it is not
supposed to fulfill that need --- there are other books on this subject,
both in print and in the Linux documentation project. 
\\
\vskip 1ex
You may freely copy and redistribute this book
under certain conditions. Please see the copyright and distribution 
statement.}

% this is a 'special' for dvips
\special{papersize=7in,9in}
\setlength\paperwidth  {7in}
\setlength\paperheight {9in}

% Table of content
\setcounter{secnumdepth}{5}
\setcounter{tocdepth}{2}

% Initially, roman numbering with no numbers.
\pagenumbering{roman}
\pagestyle{empty}  
\sloppy

%%
%% end of preamble
%%

\begin{document}
% \raggedbottom
\setlength{\parskip}{0pt}      %remove space between paragraphs

\maketitle

include(copyright.m4)

\setcounter{page}{0}
\pagestyle{headings}
\tableofcontents


% I like my introductions to be chapter zero. 
\setcounter{chapter}{-1}

% No more Roman Numerals for me!
\pagenumbering{arabic}
\pagestyle{empty}


% At last, the REAL beginning
\chapter{Introduction}\label{introduction}

So, you want to write a kernel module. You know C, you've written a 
number of normal programs to run as processes, and now you want to get 
to where the real action is, to where a single wild pointer can wipe out
your file system and a core dump means a reboot. 

Well, welcome to the club. I once had a wild pointer wipe an
important directory under DOS (thankfully, now it stands for the {\bf D}ead 
{\bf O}perating {\bf S}ystem),
and I don't see why living under Linux should be any safer. 
\index{DOS}

{\bf Warning:} I wrote this and checked the program under versions 2.0.35 
and 2.2.3 of the kernel running on a Pentium. For the most part, it should 
work on other CPUs and on other versions of the kernel, as long as they are 
2.0.x or 2.2.x, but I can't promise anything. One exception is chapter 
\ref{int-handler}, which should not work on any architecture except for
x86.



\section{Who Should Read This}\label{who-should-read}

This document is for people who want to write kernel modules. Although I 
will touch on how things are done in the kernel in several places, that 
is not my purpose. There are enough good sources which do a better job 
than I could have done. 

This document is also for people who know how to write kernel modules, but
have not yet adapted to version 2.2 of the kernel. If you are such a person,
I suggest you look at appendix \ref{ver-changes} to see all the differences
I encountered while updating the examples. The list is nowhere near
comprehensive, but I think it covers most of the basic functionality and
will be enough to get you started.

The kernel is a great piece of programming, and I believe that programmers 
should read at least some kernel source files and understand them. Having 
said that, I also believe in the value of playing with the system first and
asking questions later. When I learn a new programming language, I don't 
start with reading the library code, but by writing a small ``hello, world''
program. I don't see why playing with the kernel should be any different.



\section{Note on the Style}\label{style-note}

I like to put as many jokes as possible into my documentation. I'm writing 
this because I enjoy it, and I assume most of you are reading this for the 
same reason. If you just want to get to the point, ignore all the normal 
text and read the source code. I promise to put all the important 
details in remarks. 


\section{Changes}\label{changes}

\subsection{New in version 1.0.1}

\begin{enumerate}
\item{\bf Changes section}, \ref{changes}.
\item{\bf How to find the minor device number}, \ref{char-dev-file}.
\item{\bf Fixed the explanation of the difference between character and
	device files}, \ref{char-dev-file}
\item{\bf Makefiles for Kernel Modules}, \ref{makefile}.
\item{\bf Symmetrical Multiprocessing}, \ref{smp}.
\item{\bf A ``Bad Ideas'' Chapter}, \ref{bad-ideas}.
\end{enumerate}


\subsection{New in version 1.1.0}

\begin{enumerate}
\item{\bf Support for version 2.2 of the kernel}, all over the place.
\item{\bf Multi kernel version source files}, \ref{kernel-ver}.
\item{\bf Changes between 2.0 and 2.2}, \ref{ver-changes}.
\item{\bf Kernel Modules in Multiple Source Files}, \ref{multi-file}.
\item{\bf Suggestion not to let modules which mess with system calls be
	rmmod'ed}, \ref{sys-call}.
\end{enumerate}


\section{Acknowledgements}\label{acknowledgments}

I'd like to thank Yoav Weiss for many helpful ideas and discussions, as
well as for finding mistakes within this document before its publication. Of 
course, any remaining mistakes are purely my fault.

The \TeX \  skeleton for this book was shamelessly stolen from the ``Linux
Installation and Getting Started'' guide, where the \TeX \ work was done by
Matt Welsh.

My gratitude to Linus Torvalds, Richard Stallman and all the other people who
made it possible for me to run a high quality operating system on my computer
and get the source code goes without saying (yeah, right --- then why did I 
say it?).


\subsection{For version 1.0.1}

I couldn't list everybody who e-mailed me here, and if I've left you out
I apologize in advance. The following people were specially helpful:

\begin{itemize}

\item{\bf Frodo Looijaard from the Netherlands} For a host of useful
	suggestions, and information about the 2.1.x kernels.
\item{\bf Stephen Judd from New Zealand} Spelling corrections.
\item{\bf Magnus Ahltorp from Sweden} Correcting a mistake of mine about the
	difference between character and block devices.

\end{itemize}


\subsection{For version 1.1.0}

\begin{itemize}

\item{\bf Emmanuel Papirakis from Quebec, Canada} For porting all of the
	examples to version 2.2 of the kernel.
\item{\bf Frodo Looijaard from the Netherlands} For telling me how to create
	a multiple file kernel module (\ref{multi-file}).


\end{itemize}

Of course, any remaining mistakes are my own, and if you think they make
the book unusable you're welcome to apply for a full refund of the money
you paid me for it.



\chapter{Hello, world}\label{hello-world}

When the first caveman programmer chiseled the first program on the walls
of the first cave computer, it was a program to paint the string ``Hello,
world'' in Antelope pictures. Roman programming textbooks began with the 
``Salut, Mundi'' program. I don't know what happens to people who break with
this tradition, and I think it's safer not to find out.
\index{hello world}
\index{salut mundi}

A kernel module has to have at least two functions: {\tt init\_module} which 
is called when the module is inserted into the kernel, and 
{\tt cleanup\_module} which is called just before it is removed. 
Typically, {\tt init\_module} either registers
a handler for something with the kernel, or it replaces one of the kernel
function with its own code (usually code to do something and then call
the original function). The {\tt cleanup\_module} function is supposed to undo 
whatever {\tt init\_module} did, so the module can be unloaded safely.
\index{init\_module}
\index{cleanup\_module}


sourcesample(hello.c, 01_hello)

\section{Makefiles for Kernel Modules}\label{makefile}
\index{makefile}

A kernel module is not an independant executable, but an object file which
will be linked into the kernel in runtime. As a result, they should be 
compiled with the {\tt -c} flag. Also, all kernel modules have to be compiled
with certain symbols defined.
\index{compiling}

\begin{itemize}

\item{\tt \_\_KERNEL\_\_} --- This tells the header files that this code will
	be run in kernel mode, not as part of a user process.
	\index{\_\_KERNEL\_\_}

\item{\tt MODULE} --- This tells the header files to give the appropriate
	definitions for a kernel module.
	\index{MODULE}

\item{\tt LINUX} --- Technically speaking, this is not necessary. However,
	if you ever want to write a serious kernel module which will compile
	on more than one operating system, you'll be happy you did. This will
	allow you to do conditional compilation on the parts which are OS
	dependant.
	\index{LINUX}
\end{itemize}


There are other symbols which have to be included, or not, depending on
the flags the kernel was compiled with. If you're not sure how the kernel
was compiled, look it up in {\tt /usr/include/linux/config.h}
\index{config.h}
\index{kernel configuration}
\index{configuration\\kernel}


\begin{itemize}

\item{\tt \_\_SMP\_\_} --- Symmetrical MultiProcessing. This has to be 
	defined if the kernel was compiled to support symmetrical 
	multiprocessing (even if it's running just on one CPU). If
	you use Symmetrical MultiProcessing, there are other things you need
	to do (see chapter \ref{smp}).
	\index{\_\_SMP\_\_}

\item{\tt CONFIG\_MODVERSIONS} --- If CONFIG\_MODVERSIONS was enabled, you
	need to have it defined when compiling the kernel module and and to
	include {\tt /usr/include/linux/modversions.h}. This can also be 
	done by the code itself.
	\index{CONFIG\_MODVERSIONS}
	\index{modversions.h}

\end{itemize}


sourcesample(Makefile, 01_hello)


So, now the only thing left is to {\tt su} to root (you didn't compile this
as root, did you? Living on the edge\footnote{The reason I prefer not to 
compile as root is that the least done as root the safer the box is. I
work in computer security, so I'm paranoid}\dots), and then 
{\tt insmod hello} and {\tt rmmod hello} to your heart's content. While you 
do it, notice your new kernel module in {\tt /proc/modules}.
\index{insmod}
\index{rmmod}
\index{/proc/modules}
\index{root}

By the way, the reason why the Makefile recommends against doing {\tt insmod} 
from X
is because when the kernel has a message to print with {\tt printk}, it 
sends it to the console.
When you don't use X, it just goes to the virtual terminal you're using 
(the one you chose with Alt-F$<$n$>$) and you see it. When you do use X, on the
other hand, there are two possibilities. Either you have a console open
with {\tt xterm -C}, in which case the output will be sent there, or you don't,
in which case the output will go to virtual terminal 7 --- the one ``covered''
by X.
\index{X\\why you should avoid}
\index{xterm -C}
\index{console}
\index{virtual terminal}
\index{terminal\\virtual}
\index{printk}

If your kernel becomes unstable
you're likelier to get the debug messages without X. Outside of X, 
{\tt printk} 
goes directly from the kernel to the console. In X, on the other hand, 
{\tt printk}'s go to a user mode process ({\tt xterm -C}). When that process 
receives CPU time, it is supposed to send it to the X server process. 
Then, when the X server receives the CPU, it is supposed to display it ---
but an unstable kernel usually means that the system is about to crash or
reboot, so you don't want to delay the error messages, which might explain 
to you what went wrong, for longer than you have to.



\section{Multiple File Kernel Modules}\label{multi-file}
\index{multiple source files}
\index{source files\\multiple}

Sometimes it makes sense to divide a kernel module between several
source files. In this case, you need to do the following:

\begin{enumerate}

\item{In all the source files but one, add the line {\tt \#define 
	\_\_NO\_VERSION\_\_}. This is important because {\tt module.h}
	normally includes the definition of {\tt kernel\_version}, a global
	variable with the kernel version the module is compiled for. If
	you need {\tt version.h}, you need to include it yourself, because
	{\tt module.h} won't do it for you with {\tt \_\_NO\_VERSION\_\_}.}
	\index{\_\_NO\_VERSION\_\_}
	\index{module.h}
	\index{version.h}
	\index{kernel\_version}

\item{Compile all the source files as usual.}

\item{Combine all the object files into a single one. Under x86, do it with
	{\tt ld -m elf\_i386 -r -o $<$name of module$>$.o 
	$<$1st source file$>$.o $<$2nd source file$>$.o}.}
	\index{ld}\index{elf\_i386}

\end{enumerate}


Here's an example of such a kernel module.

sourcesample(start.c, 01_hello/multifile)
sourcesample(stop.c, 01_hello/multifile)
sourcesample(Makefile, 01_hello/multifile)


\chapter{Character Device Files}\label{char-dev-file}
\index{character device files}
\index{device files\\character}

So, now we're bold kernel programmers and we know how to write kernel
modules to do nothing. We feel proud of ourselves and we hold our heads 
up high. But somehow we get the feeling that something 
is missing. Catatonic modules are not much fun.

There are two major ways for a kernel module to talk to processes. One
is through device files (like the files in the {\tt /dev} directory), the other
is to use the proc file system. Since one of the major reasons to write 
something in the kernel is to support some kind of hardware device, we'll
begin with device files.
\index{/dev}

The original purpose of device files is to allow processes to communicate
with device drivers in the kernel, and through them with physical devices 
(modems, terminals, etc.). The way this is implemented is the following.
\index{devices\\physical}
\index{physical devices}
\index{modem}
\index{terminal}

Each device driver, which is responsible for some type of hardware, is 
assigned its own major number. The list of drivers and their major numbers
is available in {\tt /proc/devices}. Each physical device managed by a device 
driver is assigned a minor number. The {\tt /dev} directory is supposed to
include a special file, called a device file, for each of those devices,
whether or not it's really installed on the system.
\index{major number}
\index{number\\major (of device driver)}
\index{minor number}
\index{number\\major (of physical device)}

For example, if you do {\tt ls -l /dev/hd[ab]*}, you'll see all of the IDE hard
disk partitions which might be connected to a machine. Notice that all of
them use the same major number, 3, but the minor number changes from one to
the other {\em Disclaimer: This assumes you're using a PC architecture. I 
don't know about devices on Linux running on other architectures}.
\index{IDE\\hard disk}
\index{partition\\of hard disk}
\index{hard disk\\partitions of}

When the system was installed, all of those device files were created by
the {\tt mknod} command. There's no technical reason why they have to be in 
the {\tt /dev} directory, it's just a useful convention. When creating a 
device file for testing purposes, as with the exercise here, it would 
probably make more sense to place it in the directory where you compile 
the kernel module.
\index{mknod}
\index{/dev}

Devices are divided into two types: character devices and block devices.
The difference is that block devices have a buffer for requests, so 
they can choose by which order to respond to them. This is important in
the case of storage devices, where it's faster to read or write sectors
which are close to each other, rather than those which are further
apart. Another difference is that block devices can only accept input
and return output in blocks (whose size can vary according to the device),
whereas character devices are allowed to use as many or as few bytes as
they like.
Most devices in the world are character, because they don't need
this type of buffering, and they don't operate with a fixed block size. 
You can tell whether a device file is for a block device
or a character device by looking at the first character in the output of
{\tt ls -l}. If it's ``b'' then it's a block device, and if it's ``c'' 
then it's a character device.
\index{device files\\character}
\index{device files\\block}
\index{sequential access}
\index{access\\sequential}

This module is divided into two separate parts: The module part which 
registers the device and the device driver part. The {\tt init\_module} 
function calls {\tt module\_register\_chrdev} to add the device driver to 
the kernel's character device driver table. It also returns the major 
number to be used for the driver. The {\tt cleanup\_module} function 
deregisters the device.
\index{module\_register\_chrdev}
\index{major device number}
\index{device number\\major}

This (registering something and unregistering it) is the general 
functionality of those two functions. Things in the kernel don't run on 
their own initiative, like processes, but are called, by processes via 
system calls, or by hardware devices via interrupts, or by other parts of 
the kernel (simply by calling specific functions). As a result, when you 
add code to the kernel, you're supposed to register it as the 
handler for a certain type of event and when you remove it, you're 
supposed to unregister it.
\index{init\_module\\general purpose}
\index{cleanup\_module\\general purpose}

The device driver proper is composed of the four device\_$<$action$>$ 
functions, which are called when somebody tries to do something with a 
device file which has our major number. The way the kernel knows to call 
them is via the {\tt file\_operations} structure, {\tt Fops}, which was 
given when the device was registered, which includes pointers to those
four functions.
\index{file\_operations structure}
\index{struct file\_operations}

Another point we need to remember here is that we can't allow the kernel
module to be {\tt rmmod}ed whenever root feels like it. The reason is that
if the device file is opened by a process and then we remove the kernel
module, using the file would cause a call to the memory location where the
appropriate function (read/write) used to be. If we're lucky, no other code
was loaded there, and we'll get an ugly error message. If we're unlucky,
another kernel module was loaded into the same location, which means a
jump into the middle of another function within the kernel. The results
of this would be impossible to predict, but they can't be positive.
\index{rmmod\\preventing}

Normally, when you don't want to allow something, you return an error 
code (a negative number) from the function which is supposed to do it. 
With {\tt cleanup\_module} that is impossible because it's a void function.
Once {\tt cleanup\_module} is called, the module is dead. 
However, there 
is a use counter which counts how many other kernel modules are using this
kernel module, called the reference count (that's the last number of the line 
in {\tt /proc/modules}). If this number isn't zero, {\tt rmmod} will fail. 
The module's reference count is available in the variable
{\tt mod\_use\_count\_}. Since there are macros defined for handling this
variable ({\tt MOD\_INC\_USE\_COUNT} and {\tt MOD\_DEC\_USE\_COUNT}), we
prefer to use them, rather than {\tt mod\_use\_count\_} directly, so we'll
be safe if the implementation changes in the future.
\index{/proc/modules}
\index{reference count}
\index{mod\_use\_count\_}
\index{cleanup\_module}
\index{MOD\_INC\_USE\_COUNT}
\index{MOD\_DEC\_USE\_COUNT}



sourcesample(chardev.c, 02_chardev)


\section{Multiple Kernel Versions Source Files}\label{kernel-ver}
\index{kernel versions}

The system calls, which are the major interface the kernel shows to the
processes, generally stay the same across versions. A new system call
may be added, but usually the old ones will behave exactly like they
used to. This is necessary for backward compatibility --- a new kernel
version is {\bf not} supposed to break regular processes. In most cases,
the device files will also remain the same. On the other hand, the internal
interfaces within the kernel can and do change between versions.

The Linux kernel versions are divided between the stable versions (n.$<$even
number$>$.m) and the development versions (n.$<$odd number$>$.m). The 
development versions include all the cool new ideas, including those which
will be considered a mistake, or reimplemented, in the next version. As a
result, you can't trust the interface to remain the same in those versions
(which is why I don't bother to support them in this book, it's too much
work and it would become dated too quickly). In the stable versions, on the
other hand, we can expect the interface to remain the same regardless of the
bug fix version (the m number).
\index{development version\\kernel}\index{stable version\\kernel}

This version of the MPG includes support for both version 2.0.x and version 
2.2.x of the Linux kernel. Since there are differences between the two, this
requires conditional compilation depending on the kernel version. The way
to do this to use the macro {\tt LINUX\_VERSION\_CODE}. In version a.b.c of
the kernel, the value of this macro would be $2^{16}a+2^{8}b+c$. To get the
value for a specific kernel version, we can use the 
{\tt KERNEL\_VERSION} macro. Since it's not defined in 2.0.35, we define it
ourselves if necessary.
\index{2.0.x kernel}\index{2.2.x kernel}\index{versions supported}
\index{conditional compilation}\index{compilation\\conditional}
\index{LINUX\_VERSION\_CODE} \index{KERNEL\_VERSION}



\chapter{The /proc File System}\label{proc-fs}
\index{proc file system}
\index{/proc file system}
\index{file system\\/proc}

In Linux there is an additional mechanism for the kernel and kernel modules
to send information to processes --- the {\tt /proc} file system. Originally 
designed to allow easy access to information about processes (hence 
the name), it is now used by every bit of the kernel which has something 
interesting to report, such as {\tt /proc/modules} which has the list of 
modules and {\tt /proc/meminfo} which has memory usage statistics.
\index{/proc/modules}
\index{/proc/meminfo}

The method to use the proc file system is very similar to the one used
with device drivers --- you create a structure with all the information needed
for the {\tt /proc} file, including pointers to any handler functions (in 
our case there is only one, the one called when somebody attempts to read 
from the {\tt /proc} file). Then, {\tt init\_module} registers the structure 
with the kernel and {\tt cleanup\_module} unregisters it.

The reason we use {\tt proc\_register\_dynamic}\footnote{In version 2.0, in
version 2.2 this is done for us automatically if we set the inode to zero.} 
is because we don't want to
determine the inode number used for our file in advance, but to allow the
kernel to determine it to prevent clashes. Normal file systems are
located on a disk, rather than just in memory (which is where {\tt /proc} is), 
and in that case the inode number is a pointer to a disk location where the 
file's index-node (inode for short) is located. The inode contains 
information about the file, for example the file's permissions, together with
a pointer to the disk location or locations where the file's data can be
found.
\index{proc\_register\_dynamic}
\index{proc\_register}
\index{inode}

Because we don't get called when the file is opened or closed, there's no
where for us to put {\tt MOD\_INC\_USE\_COUNT} and {\tt MOD\_DEC\_USE\_COUNT}
in this module, and if the file is opened and then the module is removed, 
there's no way to avoid the consequences. In the next chapter we'll see a
harder to implement, but more flexible, way of dealing with {\tt /proc} files
which will allow us to protect against this problem as well.


sourcesample(procfs.c, 03_procfs)



\chapter{Using /proc For Input}\label{proc-input}
\index{Input\\using /proc for}
\index{/proc\\using for input}
\index{proc\\using for input}

So far we have two ways to generate output from kernel modules: we can 
register a device driver and {\tt mknod} a device file, or we can create a 
{\tt /proc}
file. This allows the kernel module to tell us anything it likes. The only
problem is that there is no way for us to talk back. The first way we'll
send input to kernel modules will be by writing back to the {\tt /proc} file.

Because the proc filesystem was written mainly to allow the kernel to 
report its situation to processes, there are no special provisions for
input. The {\tt proc\_dir\_entry} struct doesn't include a pointer to an input 
function, the way it includes a pointer to an output function. Instead,
to write into a {\tt /proc} file, we need to use the standard filesystem 
mechanism.
\index{proc\_dir\_entry structure}
\index{struct proc\_dir\_entry}

In Linux there is a standard mechanism for file system registration. Since
every file system has to have its own functions to handle inode and file 
operations\footnote{The difference between the two is that file operations 
deal with the file itself, and inode operations deal with ways of 
referencing the file, such as creating links to it.}, there is a special 
structure to hold pointers to all those functions, 
{\tt struct inode\_operations}, which includes a pointer to
{\tt struct file\_operations}. In /proc, whenever we register a new file, 
we're allowed to specify which {\tt struct inode\_operations} will be used 
for access to it. This is the mechanism we use, a 
{\tt struct inode\_operations} which includes a 
pointer to a {\tt struct file\_operations} which includes pointers to our
{\tt module\_input} and {\tt module\_output} functions.
\index{file system registration}
\index{registration\\file system}
\index{struct inode\_operations}
\index{inode\_operations structure}
\index{struct file\_operations}
\index{file\_operations structure}

It's important to note that the standard roles of read and write are 
reversed in the kernel. Read functions are used for output, whereas write
functions are used for input. The reason for that is that read and write
refer to the user's point of view --- if a process reads something from the
kernel, then the kernel needs to output it, and if a process writes something
to the kernel, then the kernel receives it as input.
\index{read\\in the kernel}
\index{write\\in the kernel}

Another interesting point here is the {\tt module\_permission} function. This 
function is called whenever a process tries to do something with the 
{\tt /proc}
file, and it can decide whether to allow access or not. Right now it is
only based on the operation and the uid of the current used (as available 
in {\tt current}, a pointer to a structure which includes information on the 
currently running process), but it could be based on anything we like, such
as what other processes are doing with the same file, the time of day, or
the last input we received.
\index{module\_permissions}
\index{permissions}
\index{current pointer}
\index{pointer\\current}


The reason for {\tt put\_user} and {\tt get\_user} is that Linux memory 
(under Intel
architecture, it may be different under some other processors) is segmented. 
This means that a pointer, by itself, does not reference a unique location
in memory, only a location in a memory segment, and you need to know which
memory segment it is to be able to use it. There is one memory segment for
the kernel, and one of each of the processes. 
\index{put\_user}
\index{get\_user}
\index{memory segments}
\index{segment\\memory}


The only memory segment accessible to a process is its own, so when 
writing regular programs to run as processes, there's no need to worry
about segments. When you write
a kernel module, normally you want to access the kernel memory segment, 
which is handled automatically by the system. However, when the content of
a memory buffer 
needs to be passed between the currently running process and the kernel,
the kernel function receives a pointer to the memory buffer which is in the
process segment. The {\tt put\_user} and {\tt get\_user} macros allow you to 
access that memory.




sourcesample(procfs.c, 04_procfs2)



\chapter{Talking to Device Files (writes and IOCTLs)}\label{dev-input}
\index{device files\\input to}
\index{input to device files}
\index{ioctl}
\index{write\\to device files}

Device files are supposed to represent physical devices. Most physical
devices are used for output as well as input, so there has to be some
mechanism for device drivers in the kernel to get the output to send to 
the device from processes. This is done by opening the device file for 
output and writing to it, just like writing to a file. In the following 
example, this is implemented by {\tt device\_write}.

This is not always enough. Imagine you had a serial port connected to a modem
(even if you have an internal modem, it is still implemented from the CPU's
perspective as a serial port connected to a modem, so you don't have to tax 
your imagination too hard). The natural thing to do would be to use the 
device file to write things to the modem (either modem commands or data to 
be sent through the phone line) and read things from the modem (either 
responses for commands or the data received through the phone line). However, 
this leaves open the question of what to do when you need to talk to the 
serial port itself, for example to send the rate at which data is sent and
received.
\index{serial port}
\index{modem}

The answer in Unix is to use a special function called {\tt ioctl} (short for 
{\bf i}nput {\bf o}utput {\bf c}on{\bf t}ro{\bf l}). Every device can have 
its own {\tt ioctl} commands, 
which can be read {\tt ioctl}'s (to send information from a process to the 
kernel), write {\tt ioctl}'s (to return information to a process),
\footnote{Notice that here the roles of read and write are reversed 
{\em again}, so in {\tt ioctl}'s read is to send information to the kernel
and write is to receive information from the kernel.}
 both or neither. The 
ioctl function is called with three parameters: the file descriptor of the 
appropriate device file, the ioctl number, and a parameter, which is of
type long so you can use a cast to use it to pass anything.
\footnote{This isn't exact. You won't be able to pass a structure, for 
example, through an ioctl --- but you will be able to pass a pointer to the
structure.} 

The ioctl number encodes the major device number, the type of the ioctl, the 
command, and the type of the parameter. This ioctl number is usually
created by a macro call ({\tt \_IO}, {\tt \_IOR}, {\tt \_IOW} or 
{\tt \_IOWR} --- depending on the
type) in a header file. This header file should then be {\tt \#include}'d both
by the programs which will use {\tt ioctl} (so they can generate the 
appropriate
{\tt ioctl}'s) and by the kernel module (so it can understand it). In the 
example below, the header file is {\tt chardev.h} and the program which 
uses it is {\tt ioctl.c}.
\index{\_IO}
\index{\_IOR}
\index{\_IOW}
\index{\_IOWR}

If you want to use {\tt ioctl}'s in your own kernel modules, it is best to 
receive
an official {\tt ioctl} assignment, so if you accidentally get somebody else's
{\tt ioctl}'s, or if they get yours, you'll know something is wrong. For more 
information, consult the kernel source tree at 
``{\tt Documentation/ioctl-number.txt}''.
\index{official ioctl assignment}
\index{ioctl\\official assignment}


sourcesample(chardev.c, 05_devrw)
sourcesample(chardev.h, 05_devrw)
\index{ioctl\\defining}
\index{defining ioctls}
\index{ioctl\\header file for}
\index{header file for ioctls}

sourcesample(ioctl.c, 05_devrw)
\index{ioctl\\using in a process}


\chapter{Startup Parameters}\label{startup-param}
\index{startup parameters}
\index{parameters\\startup}


In many of the previous examples, we had to hard-wire something into
the kernel module, such as the file name for {\tt /proc} files or the major
device number for the device so we can have {\tt ioctl}'s to it. This goes
against the grain of the Unix, and Linux, philosophy which is to write
flexible program the user can customize.
\index{hard wiring}

The way to tell a program, or a kernel module, something it needs before
it can start working is by command line parameters. In the case of kernel
modules, we don't get {\tt argc} and {\tt argv} --- instead, we get something 
better. We can define global variables in the kernel module and {\tt insmod}
will fill them for us.
\index{argc}
\index{argv}

In this kernel module, we define two of them: {\tt str1} and {\tt str2}. All 
you need to do is compile the kernel module and then run 
{\tt insmod str1=xxx str2=yyy}.
When {\tt init\_module} is called, {\tt str1} will point to the string 
``{\tt xxx}'' and {\tt str2} to the string ``{\tt yyy}''.
\index{insmod}

In version 2.0 there is no type checking on these 
arguments\footnote{There can't be, since under C the object file only has 
the location of global variables, not their type. That is why header files 
are necessary}. If the first character of {\tt str1} or {\tt str2} is a 
digit the kernel will fill the variable with the value of the integer, 
rather than a pointer to the string. If a real life situation you have to 
check for this.
\index{type checking}

On the other hand, in version 2.2 you use the macro {\tt MACRO\_PARM} to
tell {\tt insmod} that you expect a parameters, its name {\em and its type}. 
This solves the type problem and allows kernel modules to receive strings 
which begin with a digit, for example.
\index{MACRO\_PARM}
\index{insmod}

sourcesample(param.c, 06_params)



\chapter{System Calls}\label{sys-call}
\index{system calls}
\index{calls\\system}

So far, the only thing we've done was to use well defined kernel mechanisms 
to register {\tt /proc} files and device handlers. This is fine if you 
want to do something the kernel programmers thought you'd want, such as 
write a device
driver. But what if you want to do something unusual, to change the
behavior of the system in some way? Then, you're mostly on your own. 

This is where kernel programming gets dangerous. While writing the example
below, I killed the {\tt open} system call. This meant I couldn't open any 
files,
I couldn't run any programs, and I couldn't {\tt shutdown} the computer. 
I had to pull the power switch. Luckily, no files died. To ensure you won't 
lose any files either, please run {\tt sync} right before you do the 
{\tt insmod} and the {\tt rmmod}.
\index{sync}
\index{insmod}
\index{rmmod}
\index{shutdown}

Forget about {\tt /proc} files, forget about device files. They're just minor
details. The {\em real} process to kernel communication mechanism, the one 
used by all processes, is system calls. When a process requests a service 
from the kernel (such as opening a file, forking to a new process, or 
requesting more memory), this is the mechanism used. If you want to change 
the behaviour of
the kernel in interesting ways, this is the place to do it. By the way, if you
want to see which system calls a program uses, run 
{\tt strace <command> <arguments>}. 
\index{strace}

In general, a process is not supposed to be able to access the kernel. It 
can't access kernel memory and it can't call kernel functions. The hardware
of the CPU enforces this (that's the reason why it's called ``protected 
mode'').
System calls are an exception to this general rule. What happens is that 
the process fills the registers with the appropriate values and then calls
a special instruction which jumps to a previously defined location in the
kernel (of course, that location is readable by user processes, it is not 
writable by them). Under Intel CPUs, this is done by means of interrupt 0x80.
The hardware knows that once you jump to this location, you are no longer
running in restricted user mode, but as the operating system kernel --- and
therefore you're allowed to do whatever you want.
\index{interrupt 0x80}

The location in the kernel a process can jump to is called 
{\tt system\_call}. The
procedure at that location checks the system call number, which tells the
kernel what service the process requested. Then, it looks at the table of
system calls ({\tt sys\_call\_table}) to see the address of the kernel 
function to
call. Then it calls the function, and after it returns, does a few system
checks and then return back to the process (or to a different process, if
the process time ran out). If you want to read this code, it's at the 
source file {\tt arch/$<$architecture$>$/kernel/entry.S}, after the line 
{\tt ENTRY(system\_call)}.
\index{system\_call}
\index{ENTRY(system\_call)}
\index{sys\_call\_table}
\index{entry.S}

So, if we want to change the way a certain system call works, what we 
need to do is to write our own function to implement it (usually by adding a 
bit of our own code, and then calling the original function) and then change
the pointer at {\tt sys\_call\_table} to point to our function. Because we
might be removed later and we don't want to leave the system in an unstable
state, it's important for {\tt cleanup\_module} to restore the table to 
its original state.

The source code here is an example of such a kernel module. We want to ``spy''
on a certain user, and to {\tt printk} a message whenever that user opens a 
file. Towards this end, we replace the system call to open a file with our own
function, called {\tt our\_sys\_open}. This function checks the uid 
(user's id) of the current process, and if it's equal to the uid we spy on, 
it calls {\tt printk}
to display the name of the file to be opened. Then, either way, it calls
the original {\tt open} function with the same parameters, to actually open
the file.
\index{open\\system call}

The {\tt init\_module} function replaces the appropriate location in 
{\tt sys\_call\_table}
and keeps the original pointer in a variable. The {\tt cleanup\_module} 
function uses that variable to restore everything back to normal.
This approach is dangerous, because of the possibility of two kernel modules
changing the same system call. Imagine we have two kernel modules, A and B.
A's open system call will be A\_open and B's will be B\_open. Now, when A is 
inserted into the kernel, the system call is replaced with A\_open, which will
call the original sys\_open when it's done. Next, B is inserted into the
kernel, which replaces the system call with B\_open, which will call what it
thinks is the original system call, A\_open, when it's done. 

Now, if B is removed first, everything will be well --- it will simply 
restore the system call to A\_open, which calls the original. However, if
A is removed and then B is removed, the system will crash. A's removal will
restore the system call to the original, sys\_open, cutting B out of the loop.
Then, when B is removed, it will restore the system call to what {\bf it} 
thinks is the original, A\_open, which is no longer in memory. At first 
glance, it appears we could solve this particular problem by checking if the
system call is equal to our open function and if so not changing it at all
(so that B won't change the system call when it's removed), but that will
cause an even worse problem. When A is removed, it sees that the system
call was changed to B\_open so that it is no longer pointing to A\_open, so 
it won't restore it to sys\_open before it is removed from memory. 
Unfortunately, B\_open will still try to call A\_open which is no longer 
there, so that even without removing B the system would crash.

I can think of two ways to prevent this problem.  The first is to restore 
the call
to the original value, sys\_open. Unfortunately, sys\_open is not part of the
kernel system table in {\tt /proc/ksyms}, so we can't access it. The other
solution is to use the reference count to prevent root from {\tt rmmod}'ing
the module once it is loaded. This is good for production modules, but bad
for an educational sample --- which is why I didn't do it here.
\index{rmmod}\index{MOD\_INC\_USE\_COUNT}
\index{sys\_open}

sourcesample(syscall.c, 07_syscall)


\chapter{Blocking Processes}\label{blocks}
\index{blocking processes}
\index{processes\\blocking}

What do you do when somebody asks you for something you can't do right away?
If you're a human being and you're bothered by a human being, the only thing
you can say is: ``Not right now, I'm busy. {\em Go away!}''. But if you're 
a kernel module and you're bothered by a process, you have another 
possibility. You can put the process to sleep until you can service it. 
After all, processes are being put to sleep by the kernel and woken up all 
the time 
(that's the way multiple processes appear to run on the same time on a 
single CPU). 
\index{multi tasking}
\index{busy}

This kernel module is an example of this. The file (called {\tt /proc/sleep}) 
can only be opened by a single process at a time. If the file is already
open, the kernel module calls 
{\tt module\_interruptible\_sleep\_on}\footnote{The easiest way to keep a
file open is to open it with {\tt tail -f}.}. This 
function changes the status of the task (a task is the kernel data structure 
which holds
information about a process and the system call it's in, if any) to 
{\tt TASK\_INTERRUPTIBLE}, which means that the task will not run until 
it is woken
up somehow, and adds it to {\tt WaitQ}, the queue of tasks waiting to 
access the 
file. Then, the function calls the scheduler to context switch to a 
different process, one which has some use for the CPU.
\index{module\_interruptibe\_sleep\_on}
\index{interruptibe\_sleep\_on}
\index{TASK\_INTERRUPTIBLE}
\index{sleep\\putting processes to}
\index{processes\\putting to sleep}
\index{putting processes to sleep}
\index{task structure}
\index{structure\\task}

When a process is done with the file, it closes it, and {\tt module\_close} is
called. That function wakes up all the processes in the queue (there's no
mechanism to only wake up one of them). It then returns and the 
process which just closed the file can continue to run. In time, the 
scheduler decides that that process has had enough and gives control of 
the CPU to another process. Eventually, one of the processes which was
in the queue will be given control of the CPU by the scheduler.
It starts at the point right after the call to 
{\tt module\_interruptible\_sleep\_on}
\footnote{This means that the process is still in kernel 
mode --- as far as the process is concerned, it issued the {\tt open} system
call and the system call hasn't returned yet. The process doesn't know 
somebody else used the CPU for most of the time between the moment it issued
the call and the moment it returned.}
. It can then proceed to set a global variable
to tell all the other processes that the file is still open and go on with
its life. When the other processes get a piece of the CPU, they'll see that
global variable and go back to sleep.
\index{waking up processes}
\index{processes\\waking up}
\index{multitasking}
\index{scheduler}

To make our life more interesting, {\tt module\_close} doesn't have a 
monopoly on
waking up the processes which wait to access the file. A signal, such as 
Ctrl-C ({\tt SIGINT}) can also wake up a process\footnote{This is because
we used {\tt module\_interruptible\_sleep\_on}. We could have used 
{\tt module\_sleep\_on} instead, but that would have resulted is extremely
angry users whose control C's are ignored.}. In that case, we want to 
return
with {\tt -EINTR} immediately. This is important so users can, for example, 
kill the process before it receives the file.
\index{module\_wake\_up}
\index{signal}
\index{SIGINT}
\index{ctrl-c}
\index{EINTR}
\index{processes\\killing}
\index{module\_sleep\_on}
\index{sleep\_on}

There is one more point to remember. Some times processes don't want to
sleep, they want either to get what they want immediately, or to be told
it cannot be done. Such processes use the {\tt O\_NONBLOCK} 
flag when opening the file. The kernel is supposed to respond by returning
with the error code {\tt -EAGAIN} from operations which would otherwise 
block, such as opening the file in this example. The program cat\_noblock,
available in the source directory for this chapter, can be used to open
a file with {\tt O\_NONBLOCK}.
\index{O\_NONBLOCK} 
\index{non blocking}
\index{blocking, how to avoid}
\index{EAGAIN}



sourcesample(sleep.c, 08_sleep)



\chapter{Replacing printk's}\label{printk}
\index{printk\\replacing}
\index{replacing printk's}

In the beginning (chapter \ref{hello-world}), I said that X and kernel
module programming don't mix. That's true while developing the kernel 
module, but in actual use you want to be able to send messages to
whichever tty\footnote{{\bf T}ele{\bf ty}pe, originally a combination 
keyboard--printer used to communicate with a Unix system, and today an
abstraction for the text stream used for a Unix program, whether it's a
physical terminal, an xterm on an X display, a network connection used with
telnet, etc.} the command to the module came from. This is important for
identifying errors after the kernel module is released, because it will be
used through all of them.

The way this is done is by using {\tt current}, a pointer to the currently 
running task,
to get the current task's tty structure. Then, we look inside that tty
structure to find a pointer to a string write function, which we use to
write a string to the tty.
\index{current task}\index{task\\current}
\index{tty\_struct}\index{struct\\tty}


sourcesample(printk.c, 09_printk)



\chapter{Scheduling Tasks}\label{sched}
\index{scheduling tasks}\index{tasks\\scheduling}

Very often, we have ``housekeeping'' tasks which have to be done at a 
certain time, or every so often. If the task is to be done by a process,
we do it by putting it in the {\tt crontab} file . If the task is to be done 
by a 
kernel module, we have two possibilities. The first is to put a process
in the {\tt crontab} file which will wake up the module by a system call when 
necessary, for example by opening a file. This is terribly inefficient, 
however --- we run a new process off of {\tt crontab}, read a new 
executable to memory, and all this
just to wake up a kernel module which is in memory anyway.
\index{housekeeping}\index{crontab}

Instead of doing that, we can create a function that will be called once
for every timer interrupt. The way we do this is we create a task, held in
a {\tt struct tq\_struct}, which will hold a pointer to the function. Then, 
we use {\tt queue\_task} to put that task on a task list called
{\tt tq\_timer}, which is the list of tasks to be executed on the next timer
interrupt. Because we want the function to keep on being executed, we need 
to put it back on {\tt tq\_timer} whenever it is called, for the next timer
interrupt.
\index{struct tq\_struct}\index{tq\_struct struct}
\index{queue\_task}
\index{task}
\index{tq\_timer}

There's one more point we need to remember here. When a module is removed
by {\tt rmmod}, first its reference count is checked. If it is zero, 
{\tt module\_cleanup} is called. Then, the module is removed from memory
with all its functions. Nobody checks to see if the timer's task list 
happens to contain a pointer to one of those functions, which will no longer
be available. Ages later (from the computer's perspective, from a human 
perspective it's nothing, less than a hundredth of a second), the 
kernel has a timer interrupt and
tries to call the function on the task list. Unfortunately, the function is 
no longer there. In most cases, the memory page where it sat is unused, 
and you get an ugly error message. But if some other code is now sitting at
the same memory location, things could get {\bf very} ugly. Unfortunately, we
don't have an easy way to unregister a task from a task list.
\index{rmmod}
\index{reference count}
\index{module\_cleanup}

Since {\tt cleanup\_module} can't return with an error code (it's a void
function), the solution is to not let it return at all. Instead, it calls
{\tt sleep\_on} or {\tt module\_sleep\_on}\footnote{They're really the same.}
to put the {\tt rmmod} process to sleep. Before that, it informs the 
function called on the timer interrupt to stop attaching itself by setting
a global variable. Then, on the next timer interrupt, the {\tt rmmod}
process will be woken up, when our function is no longer in the queue and it's
safe to remove the module.
\index{sleep\_on}\index{module\_sleep\_on}

sourcesample(sched.c, 10_sched)



\chapter{Interrupt Handlers}\label{int-handler}
\index{interrupt handlers}\index{handlers\\interrupt}

Except for the last chapter, everything we did in the kernel so far we've
done as a response to a process asking for it, either by dealing with a 
special file, sending an {\tt ioctl}, or issuing a system call. But the job
of the kernel isn't just to respond to process requests. Another job, which
is every bit as important, is to speak to the hardware connected to the
machine.

There are two types of interaction between the CPU and the rest of the 
computer's hardware. The first type is when the CPU gives orders to the 
hardware, the other is when the hardware needs to tell the CPU something.
The second, called interrupts, is much harder to implement because it has
to be dealt with when convenient for the hardware, not the CPU. Hardware
devices typically have a very small amount of ram, and if you don't read
their information when available, it is lost.

Under Linux, hardware interrupts are called IRQs (short for {\bf I}nterrupt 
{\bf R}e{\bf q}uests)\footnote{This is standard nomencalture on the Intel
architecture where Linux originated.}. There are two types of IRQs, 
short and long. A short IRQ is one which is expected to take a {\bf very}
short period of time, during which the rest of the machine will be blocked
and no other interrupts will be handled. A long IRQ is one which can take
longer, and during which other interrupts may occur (but not interrupts 
from the same device). If at all possible, it's better to declare an interrupt
handler to be long.

When the CPU receives an interrupt, it stops whatever it's doing (unless
it's processing a more important interrupt, in which case it will deal with
this one only when the more important one is done), saves certain parameters
on the stack and calls the interrupt handler. This means that certain things
are not allowed in the interrupt handler itself, because the system is in 
an unknown state. The solution to this problem is for the interrupt
handler to do what needs to be done immediately, usually read something from
the hardware or send something to the hardware, and then schedule the 
handling of the new information at a later time (this is called the ``bottom 
half'') and return. The kernel is then guaranteed to call the bottom half as
soon as possible --- and when it does, everything allowed in kernel modules
will be allowed.
\index{bottom half}

The way to implement this is to call {\tt request\_irq} to get your 
interrupt handler called when the relevant IRQ is received (there are 16 of
them on Intel platforms). This function receives the IRQ number, the name
of the function, flags, a name for {\tt /proc/interrupts} and a parameter
to pass to the interrupt handler. The flags can include {\tt SA\_SHIRQ} to
indicate you're willing to share the IRQ with other interrupt handlers 
(usually because a number of hardware devices sit on the same IRQ) and 
{\tt SA\_INTERRUPT} to indicate this is a fast interrupt. This function
will only succeed if there isn't already a handler on this IRQ, or if 
you're both willing to share.
\index{request\_irq}
\index{/proc/interrupts}
\index{SA\_SHIRQ}
\index{SA\_INTERRUPT}

Then, from within the interrupt handler, we communicate with the hardware
and then use {\tt queue\_task\_irq} with {\tt tq\_immediate} and 
{\tt mark\_bh(BH\_IMMEDIATE)} to
schedule the bottom half. The reason we can't use the standard 
{\tt queue\_task} in version 2.0 is that the interrupt might happen right 
in the middle
of somebody else's {\tt queue\_task}\footnote{{\tt queue\_task\_irq} is 
protected from this by a global lock --- in 2.2 there is no 
{\tt queue\_task\_irq} and {\tt queue\_task} is protected by a lock.}.
We need {\tt mark\_bh} because earlier versions of
Linux only had an array of 32 bottom halves, and now one of them 
({\tt BH\_IMMEDIATE}) is used for the linked list of bottom halves for 
drivers which didn't get a bottom half entry assigned to them.
\index{queue\_task\_irq}
\index{queue\_task}
\index{tq\_immediate}
\index{mark\_bh}
\index{BH\_IMMEDIATE}


\section{Keyboards on the Intel Architecture}\label{keyboard}
\index{keyboard}\index{intel architecture\\keyboard}

{\bf Warning: The rest of this chapter is completely Intel specific. If
you're not running on an Intel platform, it will not work. Don't even try
to compile the code here.}

I had a problem with writing the sample code for this chapter. On one hand,
for an example to be useful it has to run on everybody's computer with
meaningful results. On the other hand, the kernel already includes device
drivers for all of the common devices, and those device drivers won't
coexist with what I'm going to write. The solution I've found was to 
write something for the keyboard interrupt, and disable the regular keyboard
interrupt handler first. Since it is defined as a static symbol in the kernel
source files (specifically, {\tt drivers/char/keyboard.c}), there is no way
to restore it. Before insmod'ing this code, do on another terminal
{\tt sleep 120 ; reboot} if you value your file system.

This code binds itself to IRQ 1, which is the IRQ of the keyboard controlled
under Intel architectures. Then, when it receives a keyboard interrupt, it
reads the keyboard's status (that's the purpose of the {\tt inb(0x64)})
and the scan code, which is the value returned by the keyboard. Then, as soon
as the kernel think it's feasible, it runs {\tt got\_char} which gives the
code of the key used (the first seven bits of the scan code) and whether
it has been pressed (if the 8th bit is zero) or released (if it's one).
\index{inb}




sourcesample(intrpt.c, 11_intrp)



\chapter{Symmetrical Multi--Processing}\label{smp}
\index{SMP}
\index{multi-processing}
\index{symmetrical multi--processing}
\index{processing\\multi}
 
One of the easiest (read, cheapest) ways to improve hardware performance is
to put more than one CPU on the board. This can be done either making the
different CPUs take on different jobs (asymmetrical multi--processing) or by 
making them all run in parallel, doing the same job (symmetrical 
multi--processing, a.k.a. SMP). Doing asymmetrical multi--processing 
effectively 
requires specialized knowledge about the tasks the computer should do, which
is unavailable in a general purpose operating system such as Linux. On the
other hand, symmetrical multi--processing is relatively easy to implement.
\index{CPU\\multiple}
 
By relatively easy, I mean exactly that --- not that it's {\em really}
easy. In a symmetrical multi--processing environment, the CPUs share the 
same memory, and as a result code running in one CPU can affect the memory 
used by another. You can no longer be certain that a variable you've set to 
a certain value in the previous line still has that value --- the other 
CPU might have played with it while you weren't looking. Obviously, it's
impossible to program like this.

In the case of process programming this normally isn't an issue, because
a process will normally only run on one CPU at a time\footnote{The exception
is threaded processes, which can run on several CPUs at once.}. The kernel,
on the other hand, could be called by different processes running on different
CPUs.

In version 2.0.x, this isn't a problem because the entire kernel is in one
big spinlock. This means that if one CPU is in the kernel and another CPU
wants to get in, for example because of a system call, it has to wait until
the first CPU is done. This makes Linux SMP safe\footnote{Meaning it is safe
to use it with SMP}, but terriably inefficient. 

In version 2.2.x, several CPUs can be in the kernel at the same time. This
is something module writers need to be aware of. I got somebody to give me 
access to an SMP box, so hopefully the next version of this book will 
include more information.

% Unfortunately, I don't have
% access to an SMP box to test things, so I can't write a chapter about how
% to do it right. It anybody out there has access to one and is willing to
% help me with this, I'll be grateful. If a company will provide me with this
% access, I'll give them a free one paragraph ad at the top of this chapter.



\chapter{Common Pitfalls}\label{bad-ideas}

Before I send you on your way to go out into the world and write kernel
modules, there are a few things I need to warn you about. If I fail to
warn you and something bad happen, please report the problem to me for a
full refund of the amount I got paid for your copy of the book.
\index{refund policy}

\begin{enumerate}

\item{\bf Using standard libraries} You can't do that. In a kernel
	module you can only use kernel functions, which are the functions
	you can see in {\tt /proc/ksyms}.
	\index{standard libraries}\index{libraries\\standard}
	\index{/proc/ksyms}\index{ksyms\\proc file}

\item{\bf Disabling interrupts} You might need to do this for a short
	time and that is OK, but if you don't enable them afterwards, your
	system will be stuck and you'll have to power it off.
	\index{interrupts\\disabling}

\item{\bf Sticking your head inside a large carnivore} I probably don't have
	to warn you about this, but I figured I will anyway, just in case.

\end{enumerate}

\appendix

\chapter{Changes between 2.0 and 2.2}\label{ver-changes}
\index{versions\\kernel}\index{2.2 changes}

I don't know the entire kernel well enough do document all of the changes. 
In the course of converting the examples (or actually, adapting 
Emmanuel Papirakis's changes) I came across the following differences. I 
listed all of them here together to help module programmers, especially those
who learned from previous versions of this book and are most familiar with
the techniques I use, convert to the new version.

An additional resource for people who wish to convert to 2.2 is in 
{\tt http://www.atnf.csiro.au/\~\space rgooch/linux/docs/porting-to-2.2.html}.


\begin{enumerate}

\item{\bf asm/uaccess.h} If you need {\tt put\_user}
	or {\tt get\_user} you have to \#include it.
	\index{asm/uaccess.h}\index{uaccess.h\\asm}
	\index{get\_user}\index{put\_user}

\item{\bf get\_user} In version 2.2, {\tt get\_user} receives both the
	pointer into user memory and the variable in kernel memory to fill
	with the information. The reason for this is that {\tt get\_user} 
	can now read two or four bytes at a time if the variable we read
	is two or four bytes long.

\item{\bf file\_operations} This structure now has a flush function between
	the {\tt open} and {\tt close} functions. 
	\index{flush}\index{file\_operations\\structure}

\item{\bf close in file\_operations} In version 2.2, the close
	function returns an integer, so it's allowed to fail.
	\index{close}

\item{\bf read and write in file\_operations} The headers
	for these functions changed. They now return {\tt ssize\_t} instead
	of an integer, and their parameter list is different. The inode
	is no longer a parameter, and on the other hand the offset into
	the file is.
	\index{read}\index{write}\index{ssize\_t}

\item{\bf proc\_register\_dynamic} This function no longer exists. Instead,
	you call the regular {\tt proc\_register} and put zero in the inode
	field of the structure. 
	\index{proc\_register\_dynamic}\index{proc\_register}

\item{\bf Signals} The signals in the task structure are no longer a 32 bit
	integer, but an array of {\tt \_NSIG\_WORDS} integers.
	\index{signals}\index{\_NSIG\_WORDS}

\item{\bf queue\_task\_irq} Even if you want to scheduale a task to happen
	from inside an interrupt handler, you use {\tt queue\_task}, not
	{\tt queue\_task\_irq}.
	\index{queue\_task\_irq}\index{queue\_task}\index{interrupts}
	\index{irqs}

\item{\bf Module Parameters} You no longer just declare module parameters
	as global variables. In 2.2 you have to also use {\tt MODULE\_PARM}
	to declare their type. This is a big improvement, because it allows
	the module to receive string parameters which start with a digits,
	for example, without getting confused.
	\index{Parameters\\Module}\index{Module Parameters}
	\index{MODULE\_PARM}

\item{\bf Symmetrical Multi--Processing} The kernel is no longer inside one
	huge spinlock, which means that kernel modules have to be aware of
	SMP.
	\index{SMP}\index{Symmetrical Multi--Processing}

\end{enumerate}



\chapter{Where From Here?}\label{where-to}

I could easily have squeezed a few more chapters into this book. I could
have added a chapter about creating new file systems, or about adding new
protocols stacks (as if there's a need for that --- you'd have to dig under
ground to find a protocol stack not supported by Linux). I could have 
added explanations of the kernel mechanisms we haven't touched upon, such
as bootstrapping or the disk interface. 

However, I chose not to. My purpose in writing this book was to provide
initiation into the mysteries of kernel module programming and to teach
the common techniques for that purpose. For people seriously interested
in kernel programming, I recommend the list of kernel resources in
{\tt http://jungla.dit.upm.es/\~\space jmseyas/linux/kernel/hackers-docs.html}.
Also, as Linus said, the best way is to learn the kernel is to read the 
source code yourself.

If you're interested in more examples of short kernel modules, I recommend
Phrack magazine. Even if you're not interested in security, and as a 
programmer you should be, the kernel modules there are good examples of 
what you can do inside the kernel, and they're short enough not to require
too much effort to understand.

I hope I have helped you in your quest to become a better programmer, or
at least to have fun through technology. And, if you do write useful kernel
modules, I hope you publish them under the GPL, so I can use them too.


\chapter{Goods and Services}\label{ads}

I hope nobody minds the shameless promotions here. They are all things which
are likely to be of use to beginning Linux Kernel Module programmers.
	
\section{Getting this Book in Print}\label{print-book}

The Coriolis group is going to print this book sometimes in the summer of '99. 
If this is already summer, and you want this book in print, you can go easy
on your printer and buy it in a nice, bound form.

include(thankme.m4)

include(gpl.m4)


\addcontentsline{toc}{chapter}{Index}

\input{mpg.ind}

\end{document}





