David Holland
2018-11-06 10:15:48 UTC
As a few people may already be aware, I've been for some working on
and off (mostly off, alas) on a new string library for Coq, with the
long-term goal of getting it merged into the standard library.
The main design goals are basically to address the shortcomings in the
existing string library: it's not very complete, it doesn't extract
well, the character type is a hassle to work with, and code that
matches on string values rapidly becomes horrible.
I have a fair amount of code (and theorems) but I'm not really ready
to post any of it yet. At this stage what I'd like is input on the
following points:
1. It is fairly clear that the string code should be abstracted over
multiple possible character types, both to handle narrow vs. wide (and
multibyte) characters and also, optionally, to allow using the type
system to strongly distinguish strings and characters of different
encodings.
At this point, is this best done with modules or typeclasses? Both
seem to be feasible, but they produce quite different results. What is
(now) preferred for the standard library?
In the case of typeclasses there'd be one class for characters and
another for strings (and maybe some elaborations of each), and various
instances, but most of the material would be stated once, at the top
level, in terms of the string class. In the case of modules there'd be
module types for characters and strings, and modules for the various
kinds of characters and strings, but most of the material written in
terms of the abstractions would need to go in a functor to be applied
separately to each of the instances.
Either way, most likely some but not all the new material will be
available to be applied to the existing string type, but I've reached
the conclusion that worrying about this should not be a priority.
2. What should go into the character types to hold their identity?
The existing ascii type uses eight bools, which is horrible to work
with and should not be repeated. However, there aren't a lot of good
choices: it should be bounded (since all standard character sets are
finite) which rules out using the ordinary integer types directly; and
it needs to be a binary integer type (otherwise it is impossible to
write down the bound on Unicode code points without blowing the stack)
which rules out the Fin.t that comes with the vector library.
Reasonable choices seem to be either (a) wrap N (or Z, but N seems
more natural) with a proof of being in bounds, like Fin.t wraps nat;
or (b) use or develop some machine integer library that is not
currently in the standard library. What's preferable? There should,
ultimately, be machine integers in the standard library too, but
that's a whole additional project, and if it's going to be done at all
to support characters, it should really be done properly. (I'm not
entirely averse to this, but getting it done in a reasonable amount of
time is likely to be problematic.)
3. How does the existing character scope work, and what's involved in
generalizing it to work with multiple character types and/or multiple
character encodings? It seems to be compiler magic, but that's as deep
as I've gotten so far.
4. The support for recursive/list notations does not seem to be
powerful enough to allow defining a matching syntax for strings
similar to OCaml's. (That is, one where patterns in a string match are
string constants and the match takes place by string comparison.)
Since one of the problems with the current string library is the need
to write
match foo with
| String "f" (String "o" (String "o" EmptyString)) => ...
:
end
and improving this with notations doesn't seem to be feasible either,
it would be nice to have _some_ scheme in hand that allows writing
nicer-looking code, but I don't know what it should be.
I'm inclined to think that because we want to be able to extract to
OCaml strings, and OCaml strings aren't an inductive type (so doing
matches of this kind produces awful code), the recommended method
ought to be a series of if-comparisons; but because if is second-class
relative to match (in some respects) this is not entirely satisfactory
either.
One might be able to arrange something like
match string_match foo ["foo"; "bar"; "baz"] with
| 0 => Some foo
| 1 => Some bar
| 2 => Some baz
| S _ => None
end
or even just
string_match foo [
("foo", Some foo);
("bar", Some bar);
("baz", Some baz);
] None
but this isn't very nice without notation to support it and these
approaches both have their own problems.
Thoughts?
5. To avoid having weird things happen when extracting to OCaml
strings, and also on general principles, I think it's best to continue
to have a separate string type and not use lists of characters. It is
easier and less error-prone to fold strings and lists of characters
together when extracting to Haskell than to try to separate them on
the fly when extracting to OCaml.
So I think it's best to keep the structure of the current string type
(an inductive type with its own nil and cons), and just avoid exposing
it as much as possible to avoid extracting to awful OCaml code.
For a while I was trying to find a way to carry over list theorems
without having to reprove them, or find a way to prove one set of
theorems about an abstract nil/cons type that could be shared between
lists and strings, but eventually came to the conclusion that this
isn't possible, or is expensive enough to be not worthwhile.
Long-term feature request: it would be nice to have a way to rename
the constructors of an inductive when instantiating a module;
implicitly would be nice but explicitly would be good enough.
Currently you cannot do this and cannot e.g. write a module signature
that will match both list and (the existing) string. Similarly, it
would be nice to be able to reason about constructors in typeclasses;
currently you can write a typeclass that abstracts over list and (the
existing) string, but you need a pile of explicit lemmas to carry the
information encoded in the inductive structure of the type, and it's a
fairly substantial nuisance both to set up and work with.
6. ... there are other things to bring up sometime but this email's
long enough already.
and off (mostly off, alas) on a new string library for Coq, with the
long-term goal of getting it merged into the standard library.
The main design goals are basically to address the shortcomings in the
existing string library: it's not very complete, it doesn't extract
well, the character type is a hassle to work with, and code that
matches on string values rapidly becomes horrible.
I have a fair amount of code (and theorems) but I'm not really ready
to post any of it yet. At this stage what I'd like is input on the
following points:
1. It is fairly clear that the string code should be abstracted over
multiple possible character types, both to handle narrow vs. wide (and
multibyte) characters and also, optionally, to allow using the type
system to strongly distinguish strings and characters of different
encodings.
At this point, is this best done with modules or typeclasses? Both
seem to be feasible, but they produce quite different results. What is
(now) preferred for the standard library?
In the case of typeclasses there'd be one class for characters and
another for strings (and maybe some elaborations of each), and various
instances, but most of the material would be stated once, at the top
level, in terms of the string class. In the case of modules there'd be
module types for characters and strings, and modules for the various
kinds of characters and strings, but most of the material written in
terms of the abstractions would need to go in a functor to be applied
separately to each of the instances.
Either way, most likely some but not all the new material will be
available to be applied to the existing string type, but I've reached
the conclusion that worrying about this should not be a priority.
2. What should go into the character types to hold their identity?
The existing ascii type uses eight bools, which is horrible to work
with and should not be repeated. However, there aren't a lot of good
choices: it should be bounded (since all standard character sets are
finite) which rules out using the ordinary integer types directly; and
it needs to be a binary integer type (otherwise it is impossible to
write down the bound on Unicode code points without blowing the stack)
which rules out the Fin.t that comes with the vector library.
Reasonable choices seem to be either (a) wrap N (or Z, but N seems
more natural) with a proof of being in bounds, like Fin.t wraps nat;
or (b) use or develop some machine integer library that is not
currently in the standard library. What's preferable? There should,
ultimately, be machine integers in the standard library too, but
that's a whole additional project, and if it's going to be done at all
to support characters, it should really be done properly. (I'm not
entirely averse to this, but getting it done in a reasonable amount of
time is likely to be problematic.)
3. How does the existing character scope work, and what's involved in
generalizing it to work with multiple character types and/or multiple
character encodings? It seems to be compiler magic, but that's as deep
as I've gotten so far.
4. The support for recursive/list notations does not seem to be
powerful enough to allow defining a matching syntax for strings
similar to OCaml's. (That is, one where patterns in a string match are
string constants and the match takes place by string comparison.)
Since one of the problems with the current string library is the need
to write
match foo with
| String "f" (String "o" (String "o" EmptyString)) => ...
:
end
and improving this with notations doesn't seem to be feasible either,
it would be nice to have _some_ scheme in hand that allows writing
nicer-looking code, but I don't know what it should be.
I'm inclined to think that because we want to be able to extract to
OCaml strings, and OCaml strings aren't an inductive type (so doing
matches of this kind produces awful code), the recommended method
ought to be a series of if-comparisons; but because if is second-class
relative to match (in some respects) this is not entirely satisfactory
either.
One might be able to arrange something like
match string_match foo ["foo"; "bar"; "baz"] with
| 0 => Some foo
| 1 => Some bar
| 2 => Some baz
| S _ => None
end
or even just
string_match foo [
("foo", Some foo);
("bar", Some bar);
("baz", Some baz);
] None
but this isn't very nice without notation to support it and these
approaches both have their own problems.
Thoughts?
5. To avoid having weird things happen when extracting to OCaml
strings, and also on general principles, I think it's best to continue
to have a separate string type and not use lists of characters. It is
easier and less error-prone to fold strings and lists of characters
together when extracting to Haskell than to try to separate them on
the fly when extracting to OCaml.
So I think it's best to keep the structure of the current string type
(an inductive type with its own nil and cons), and just avoid exposing
it as much as possible to avoid extracting to awful OCaml code.
For a while I was trying to find a way to carry over list theorems
without having to reprove them, or find a way to prove one set of
theorems about an abstract nil/cons type that could be shared between
lists and strings, but eventually came to the conclusion that this
isn't possible, or is expensive enough to be not worthwhile.
Long-term feature request: it would be nice to have a way to rename
the constructors of an inductive when instantiating a module;
implicitly would be nice but explicitly would be good enough.
Currently you cannot do this and cannot e.g. write a module signature
that will match both list and (the existing) string. Similarly, it
would be nice to be able to reason about constructors in typeclasses;
currently you can write a typeclass that abstracts over list and (the
existing) string, but you need a pile of explicit lemmas to carry the
information encoded in the inductive structure of the type, and it's a
fairly substantial nuisance both to set up and work with.
6. ... there are other things to bring up sometime but this email's
long enough already.
--
David A. Holland
***@netbsd.org
David A. Holland
***@netbsd.org