But how do I know in advance if the line will fit into
memory?
Perhaps because of such fears, traditional scanners¹ do not
read lines or, Heaven forbid, files, but only characters!
So how would you do it with this style of programming (never
reading the whole line into memory)?
"I read a character. If it's a space, I peek at the next
character, if that's a space, I start adding spaces to my
look-ahead buffer. If an EOL is encountered, the look-ahead
buffer is discarded. Otherwise, I have to start feeding my
client from the lookahead buffer until the lookahead buffer
is empty."
Is it worth the effort with a look-ahead buffer and
sequential access? Should you just read a line, assuming
that a line will always fit into memory, and strip the
blanks the easy way, i.e., using random access?
Richard Heathfield <rjh@cpax.org.uk> writes:
This was a perennial comp.lang.c topic back in the day.
But what about writing a scanner in languages with automatic
memory management where reading a whole line is very simple
and assuming an input language that limits line length to
some reasonable value, say, 1,000,000 characters?
In such a language, would there still be reasons not to
read the whole line into memory, but to read it char-by-char
as traditional scanners do?
Let's take a very simple task: This scanner for text files
has nothing more to do than to return every character,
except to strip the spaces at the end of a line.
ram@zedat.fu-berlin.de (Stefan Ram) writes:
Let's take a very simple task: This scanner for text files
has nothing more to do than to return every character,
except to strip the spaces at the end of a line.
Richard said that it matters what I need this for.
I'd like to implement a tiny markup language
Okay, BIG job with lots of complicated, so strive to keep each
part relatively simple if you ever hope to get it working.
Some idle thoughts about scanning (lexical analysis, or
rather what comes before it) ...
Let's take a very simple task: This scanner for text files
has nothing more to do than to return every character,
except to strip the spaces at the end of a line.
It is a function "get_next_token" that on each call will
return the next character from a file to its client (caller),
except that spaces at the end of a line will skipped.
So we read the line and strip the spaces. (One line in
Python.)
But how do I know in advance if the line will fit into
memory?
In the next iteration, I want to extend this to a sequence
of paragraphs. Still without any real markup.
ram@zedat.fu-berlin.de (Stefan Ram) writes:[..]
The output often ends with one space, because a '\n' is
added to the end of the input if it's missing, and this
then is being converted to a space. So, ironically, while
I set out to strip spaces at the end of lines, I now
sometimes add them to the end of lines!
The output follows below. Most tests pass, but there is
still one error. (The error is: When the input is a sequence
of blanks, it produces [par], but should produce nothing.)
ram@zedat.fu-berlin.de (Stefan Ram) writes:
Let's take a very simple task: This scanner for text files
has nothing more to do than to return every character,
except to strip the spaces at the end of a line.
Richard said that it matters what I need this for.
I'd like to implement a tiny markup language similar
to languages like "Markdown" or "reStructuredText".
It should ignore spaces at the end of lines.
I'm going to implement it in Python.
On 19/01/2023 12:10 pm, Stefan Ram wrote:--- Synchronet 3.20a-Linux NewsLink 1.113
Some idle thoughts about scanning (lexical analysis, or
rather what comes before it) ...
Let's take a very simple task: This scanner for text files
has nothing more to do than to return every character,
except to strip the spaces at the end of a line.
It is a function "get_next_token" that on each call will
return the next character from a file to its client (caller),
except that spaces at the end of a line will skipped.
So we read the line and strip the spaces. (One line in
Python.)
But how do I know in advance if the line will fit into
memory?
Perhaps because of such fears, traditional scanners¹ do not
read lines or, Heaven forbid, files, but only characters!
They do not use random access with respect to the text to be
scanned, but sequential access, although things would be
easier with random access.
So how would you do it with this style of programming (never
reading the whole line into memory)?
"I read a character. If it's a space, I peek at the next
character, if that's a space, I start adding spaces to my
look-ahead buffer. If an EOL is encountered, the look-ahead
buffer is discarded. Otherwise, I have to start feeding my
client from the lookahead buffer until the lookahead buffer
is empty."
If I am concerned that a line will not fit in memory, how do
I know that the sequence of spaces at the end of a line will
fit in memory (the look-ahead buffer)? The look-ahead buffer
could be replaced by a counter. If you are paranoid, you
would use a 64-bit counter and check it for overflow!
Is it worth the effort with a look-ahead buffer and
sequential access? Should you just read a line, assuming
that a line will always fit into memory, and strip the
blanks the easy way, i.e., using random access? TIA for any
comments!
1
an example of a traditional scanner:
It only ever calls "GetCh", never "GetLine". The code could
be easier to write by reading a whole line and then just
using functions that can look at that line using random
access to get the next symbol (maybe using regular
expressions). But a traditional scanner carefully only ever
reads a single character and manages a state.
PROCEDURE GetSym;
VAR i : CARDINAL;
BEGIN
WHILE ch <= ' ' DO GetCh END;
IF ch = '/' THEN
SkipLine;
WHILE ch <= ' ' DO GetCh END
END;
IF (CAP (ch) <= 'Z') AND (CAP (ch) >= 'A') THEN
i := 0;
sym := literal;
REPEAT
IF i < IdLength THEN
id [i] := ch;
INC (i)
END;
IF ch > 'Z' THEN sym := ident END;
GetCh
...
man 3 realloc
This was a perennial comp.lang.c topic back in the day.
My interface looked (and still looks) like this:
#define FGDATA_BUFSIZ BUFSIZ /* adjust to taste */
#define FGDATA_WRDSIZ sizeof("floccinaucinihilipilification")
#define FGDATA_REDUCE 1
int fgetline(char **line, size_t *size, size_t maxrecsize, FILE
*fp, unsigned int flags, size_t *plen);
It's easier to use than it might look:
char *data = NULL; /* where will the data go? NULL is fine */
size_t size = 0; /* how much space do we have right now? */
size_t len = 0; /* after call, holds line length */
while(fgetline(&data, &size, (size_t)-1, stdin, 0, &len) == 0)
{
if(len > 0)
If you want fgetline.c and don't have 20 years of clc archives,
just yell.
--
Richard Heathfield
Email: rjh at cpax dot org dot uk
"Usenet is a strange place" - dmr 29 July 1999
Sig line 4 vacant - apply within
Sysop: | DaiTengu |
---|---|
Location: | Appleton, WI |
Users: | 1,025 |
Nodes: | 10 (0 / 10) |
Uptime: | 177:38:55 |
Calls: | 13,309 |
Files: | 186,574 |
D/L today: |
4,801 files (1,302M bytes) |
Messages: | 3,351,064 |