[Coq-Club] strings.

Discussion:

[Coq-Club] strings.

David Holland

2018-11-06 10:15:48 UTC

As a few people may already be aware, I've been for some working on
and off (mostly off, alas) on a new string library for Coq, with the
long-term goal of getting it merged into the standard library.

The main design goals are basically to address the shortcomings in the
existing string library: it's not very complete, it doesn't extract
well, the character type is a hassle to work with, and code that
matches on string values rapidly becomes horrible.

I have a fair amount of code (and theorems) but I'm not really ready
to post any of it yet. At this stage what I'd like is input on the
following points:

1. It is fairly clear that the string code should be abstracted over
multiple possible character types, both to handle narrow vs. wide (and
multibyte) characters and also, optionally, to allow using the type
system to strongly distinguish strings and characters of different
encodings.

At this point, is this best done with modules or typeclasses? Both
seem to be feasible, but they produce quite different results. What is
(now) preferred for the standard library?

In the case of typeclasses there'd be one class for characters and
another for strings (and maybe some elaborations of each), and various
instances, but most of the material would be stated once, at the top
level, in terms of the string class. In the case of modules there'd be
module types for characters and strings, and modules for the various
kinds of characters and strings, but most of the material written in
terms of the abstractions would need to go in a functor to be applied
separately to each of the instances.

Either way, most likely some but not all the new material will be
available to be applied to the existing string type, but I've reached
the conclusion that worrying about this should not be a priority.

2. What should go into the character types to hold their identity?
The existing ascii type uses eight bools, which is horrible to work
with and should not be repeated. However, there aren't a lot of good
choices: it should be bounded (since all standard character sets are
finite) which rules out using the ordinary integer types directly; and
it needs to be a binary integer type (otherwise it is impossible to
write down the bound on Unicode code points without blowing the stack)
which rules out the Fin.t that comes with the vector library.

Reasonable choices seem to be either (a) wrap N (or Z, but N seems
more natural) with a proof of being in bounds, like Fin.t wraps nat;
or (b) use or develop some machine integer library that is not
currently in the standard library. What's preferable? There should,
ultimately, be machine integers in the standard library too, but
that's a whole additional project, and if it's going to be done at all
to support characters, it should really be done properly. (I'm not
entirely averse to this, but getting it done in a reasonable amount of
time is likely to be problematic.)

3. How does the existing character scope work, and what's involved in
generalizing it to work with multiple character types and/or multiple
character encodings? It seems to be compiler magic, but that's as deep
as I've gotten so far.

4. The support for recursive/list notations does not seem to be
powerful enough to allow defining a matching syntax for strings
similar to OCaml's. (That is, one where patterns in a string match are
string constants and the match takes place by string comparison.)
Since one of the problems with the current string library is the need
to write
match foo with
| String "f" (String "o" (String "o" EmptyString)) => ...
:
end
and improving this with notations doesn't seem to be feasible either,
it would be nice to have _some_ scheme in hand that allows writing
nicer-looking code, but I don't know what it should be.

I'm inclined to think that because we want to be able to extract to
OCaml strings, and OCaml strings aren't an inductive type (so doing
matches of this kind produces awful code), the recommended method
ought to be a series of if-comparisons; but because if is second-class
relative to match (in some respects) this is not entirely satisfactory
either.

One might be able to arrange something like
match string_match foo ["foo"; "bar"; "baz"] with
| 0 => Some foo
| 1 => Some bar
| 2 => Some baz
| S _ => None
end
or even just
string_match foo [
("foo", Some foo);
("bar", Some bar);
("baz", Some baz);
] None
but this isn't very nice without notation to support it and these
approaches both have their own problems.

Thoughts?

5. To avoid having weird things happen when extracting to OCaml
strings, and also on general principles, I think it's best to continue
to have a separate string type and not use lists of characters. It is
easier and less error-prone to fold strings and lists of characters
together when extracting to Haskell than to try to separate them on
the fly when extracting to OCaml.

So I think it's best to keep the structure of the current string type
(an inductive type with its own nil and cons), and just avoid exposing
it as much as possible to avoid extracting to awful OCaml code.

For a while I was trying to find a way to carry over list theorems
without having to reprove them, or find a way to prove one set of
theorems about an abstract nil/cons type that could be shared between
lists and strings, but eventually came to the conclusion that this
isn't possible, or is expensive enough to be not worthwhile.

Long-term feature request: it would be nice to have a way to rename
the constructors of an inductive when instantiating a module;
implicitly would be nice but explicitly would be good enough.
Currently you cannot do this and cannot e.g. write a module signature
that will match both list and (the existing) string. Similarly, it
would be nice to be able to reason about constructors in typeclasses;
currently you can write a typeclass that abstracts over list and (the
existing) string, but you need a pile of explicit lemmas to carry the
information encoded in the inductive structure of the type, and it's a
fairly substantial nuisance both to set up and work with.

6. ... there are other things to bring up sometime but this email's
long enough already.

--
David A. Holland
***@netbsd.org

Jason -Zhong Sheng- Hu

2018-11-06 13:46:05 UTC

Permalink

Hi David,

I am very interested in this problem, so I would like to give some input.

1. I personally prefer type class. I am not an expert of module system, but when I use it, I feel my hands are tightened. Please correct me if I am wrong: module seems terrible at overloading. Meanwhile, strings are heavily overloaded. For example, sometimes, one might need to handle ascii and unicode at the same time, so it would be cumbersome to have to say Ascii.char_to_nat and Unicode.char_to_nat, while the type information is often immediately available.

However, type class is very easy to work with. I'd imagine something like this:

Class CharSet (code : Set) : Set := { string_t : Set := _; (* blah blah *) };
Record Utf8 : Set := { (* blah blah *) };
Instance Utf8Set : CharSet Utf8 | 0 := { (* blah blah *) );
Record Ascii : Set := { (* blah blah *) };
Instance AsciiSet : CharSet Ascii | 2 := { (* blah blah *) );
Notation "'string_in' T" := (@string_t T _).
Notation string := (string_in Utf8).

I am not sure how modules can work similarly.

2. I am very concerned about how easy it is to compare equality. The problem with character sets is there are many kinds of induced equalities. This probably implies setoids are very suitable here. However, at minimum, we should be able to make propositional equality work with the exact character in the same character set. So to me, in Ascii, for example, Fin.t seems a reasonable option. For unicode, I am not sure. But unicode encodes in a complicated way, so even if a number is within the bound, it does not have to be a valid code. Utf-8, for example, seems to correspond to an inductive type with 4 cases.

I've heard rumors that strict proposition is about to be added to Coq, which supports definitional equality between propositions. I suppose it can be very helpful if we want to put complex quantifications over numbers to restrict the range inside of valid codes.

On the other hand, I suppose we shouldn't worry about machine implementations of characters; working on that probably requires some model of what machines are, and doing this naively is not going to serve well. Formalized extraction is also a big problem, so I don't think this should be addressed in a string library.

5. I frequently work with mutually inductively defined types, and among them, there are effectively list types. I can't use list directly, because Coq does not allow to recurse on two separately defined types. I was coming up with an idea a few weeks back, that we can actually abuse type theoretic properties to implement structural overloads: namely, if you defined two list types, list1 and list2, as long as they look close enough, you can define the theorems once, and use type classes to overload them and apply the same theorems to two different types. I attempted a prototype[*], but the theorem proving is very painful. I hope experts can give some input on this.

Sincerely Yours,

Jason Hu

[*] https://github.com/HuStmpHrrr/StructPoly/blob/master/src/ListLike.v The idea was to capture the type theoretic rules. But proving in this structure is like re-implementing Coq.
________________________________
From: coq-club-***@inria.fr <coq-club-***@inria.fr> on behalf of David Holland <dholland-***@netbsd.org>
Sent: November 6, 2018 5:15 AM
To: coq-***@inria.fr
Subject: [Coq-Club] strings.

As a few people may already be aware, I've been for some working on
and off (mostly off, alas) on a new string library for Coq, with the
long-term goal of getting it merged into the standard library.

The main design goals are basically to address the shortcomings in the
existing string library: it's not very complete, it doesn't extract
well, the character type is a hassle to work with, and code that
matches on string values rapidly becomes horrible.

I have a fair amount of code (and theorems) but I'm not really ready
to post any of it yet. At this stage what I'd like is input on the
following points:

1. It is fairly clear that the string code should be abstracted over
multiple possible character types, both to handle narrow vs. wide (and
multibyte) characters and also, optionally, to allow using the type
system to strongly distinguish strings and characters of different
encodings.

At this point, is this best done with modules or typeclasses? Both
seem to be feasible, but they produce quite different results. What is
(now) preferred for the standard library?

In the case of typeclasses there'd be one class for characters and
another for strings (and maybe some elaborations of each), and various
instances, but most of the material would be stated once, at the top
level, in terms of the string class. In the case of modules there'd be
module types for characters and strings, and modules for the various
kinds of characters and strings, but most of the material written in
terms of the abstractions would need to go in a functor to be applied
separately to each of the instances.

Either way, most likely some but not all the new material will be
available to be applied to the existing string type, but I've reached
the conclusion that worrying about this should not be a priority.

2. What should go into the character types to hold their identity?
The existing ascii type uses eight bools, which is horrible to work
with and should not be repeated. However, there aren't a lot of good
choices: it should be bounded (since all standard character sets are
finite) which rules out using the ordinary integer types directly; and
it needs to be a binary integer type (otherwise it is impossible to
write down the bound on Unicode code points without blowing the stack)
which rules out the Fin.t that comes with the vector library.

Reasonable choices seem to be either (a) wrap N (or Z, but N seems
more natural) with a proof of being in bounds, like Fin.t wraps nat;
or (b) use or develop some machine integer library that is not
currently in the standard library. What's preferable? There should,
ultimately, be machine integers in the standard library too, but
that's a whole additional project, and if it's going to be done at all
to support characters, it should really be done properly. (I'm not
entirely averse to this, but getting it done in a reasonable amount of
time is likely to be problematic.)

3. How does the existing character scope work, and what's involved in
generalizing it to work with multiple character types and/or multiple
character encodings? It seems to be compiler magic, but that's as deep
as I've gotten so far.

4. The support for recursive/list notations does not seem to be
powerful enough to allow defining a matching syntax for strings
similar to OCaml's. (That is, one where patterns in a string match are
string constants and the match takes place by string comparison.)
Since one of the problems with the current string library is the need
to write
match foo with
| String "f" (String "o" (String "o" EmptyString)) => ...
:
end
and improving this with notations doesn't seem to be feasible either,
it would be nice to have _some_ scheme in hand that allows writing
nicer-looking code, but I don't know what it should be.

I'm inclined to think that because we want to be able to extract to
OCaml strings, and OCaml strings aren't an inductive type (so doing
matches of this kind produces awful code), the recommended method
ought to be a series of if-comparisons; but because if is second-class
relative to match (in some respects) this is not entirely satisfactory
either.

One might be able to arrange something like
match string_match foo ["foo"; "bar"; "baz"] with
| 0 => Some foo
| 1 => Some bar
| 2 => Some baz
| S _ => None
end
or even just
string_match foo [
("foo", Some foo);
("bar", Some bar);
("baz", Some baz);
] None
but this isn't very nice without notation to support it and these
approaches both have their own problems.

Thoughts?

5. To avoid having weird things happen when extracting to OCaml
strings, and also on general principles, I think it's best to continue
to have a separate string type and not use lists of characters. It is
easier and less error-prone to fold strings and lists of characters
together when extracting to Haskell than to try to separate them on
the fly when extracting to OCaml.

So I think it's best to keep the structure of the current string type
(an inductive type with its own nil and cons), and just avoid exposing
it as much as possible to avoid extracting to awful OCaml code.

For a while I was trying to find a way to carry over list theorems
without having to reprove them, or find a way to prove one set of
theorems about an abstract nil/cons type that could be shared between
lists and strings, but eventually came to the conclusion that this
isn't possible, or is expensive enough to be not worthwhile.

Long-term feature request: it would be nice to have a way to rename
the constructors of an inductive when instantiating a module;
implicitly would be nice but explicitly would be good enough.
Currently you cannot do this and cannot e.g. write a module signature
that will match both list and (the existing) string. Similarly, it
would be nice to be able to reason about constructors in typeclasses;
currently you can write a typeclass that abstracts over list and (the
existing) string, but you need a pile of explicit lemmas to carry the
information encoded in the inductive structure of the type, and it's a
fairly substantial nuisance both to set up and work with.

6. ... there are other things to bring up sometime but this email's
long enough already.

--
David A. Holland
***@netbsd.org

Théo Zimmermann

2018-11-08 13:03:38 UTC

Permalink

Hello,

Since no one from the development team has replied to you yet (or maybe
by private mail only), let me reply to say how welcome such initiatives are.

You may or may not be aware of the existence of a project to design a
new standard library (without backward compatibility):
https://github.com/coq/stdlib2
Maxime DÃ©nÃšs is leading this project. It should be a perfect fit for
your new string library. Thus, the module vs type class question should
be answered in coordination with this effort. The current version of the
standard library uses modules for the most part but I believe the
current trend would rather be to use type classes (without abusing them,
which can lead to problems).

There has also been specific discussion about the future of the string
library: https://github.com/coq/coq/issues/8483
I suppose your perspective would be very welcome there as well.

I want to emphasize once more how welcome these initiatives are. I hope
(and I will try to ensure) that the stdlib2 project will be properly
managed to fully benefit from this kind of individual initiatives.

Thanks,
ThÃ©o

Post by David Holland
As a few people may already be aware, I've been for some working on
and off (mostly off, alas) on a new string library for Coq, with the
long-term goal of getting it merged into the standard library.
The main design goals are basically to address the shortcomings in the
existing string library: it's not very complete, it doesn't extract
well, the character type is a hassle to work with, and code that
matches on string values rapidly becomes horrible.
I have a fair amount of code (and theorems) but I'm not really ready
to post any of it yet. At this stage what I'd like is input on the
1. It is fairly clear that the string code should be abstracted over
multiple possible character types, both to handle narrow vs. wide (and
multibyte) characters and also, optionally, to allow using the type
system to strongly distinguish strings and characters of different
encodings.
At this point, is this best done with modules or typeclasses? Both
seem to be feasible, but they produce quite different results. What is
(now) preferred for the standard library?
In the case of typeclasses there'd be one class for characters and
another for strings (and maybe some elaborations of each), and various
instances, but most of the material would be stated once, at the top
level, in terms of the string class. In the case of modules there'd be
module types for characters and strings, and modules for the various
kinds of characters and strings, but most of the material written in
terms of the abstractions would need to go in a functor to be applied
separately to each of the instances.
Either way, most likely some but not all the new material will be
available to be applied to the existing string type, but I've reached
the conclusion that worrying about this should not be a priority.
2. What should go into the character types to hold their identity?
The existing ascii type uses eight bools, which is horrible to work
with and should not be repeated. However, there aren't a lot of good
choices: it should be bounded (since all standard character sets are
finite) which rules out using the ordinary integer types directly; and
it needs to be a binary integer type (otherwise it is impossible to
write down the bound on Unicode code points without blowing the stack)
which rules out the Fin.t that comes with the vector library.
Reasonable choices seem to be either (a) wrap N (or Z, but N seems
more natural) with a proof of being in bounds, like Fin.t wraps nat;
or (b) use or develop some machine integer library that is not
currently in the standard library. What's preferable? There should,
ultimately, be machine integers in the standard library too, but
that's a whole additional project, and if it's going to be done at all
to support characters, it should really be done properly. (I'm not
entirely averse to this, but getting it done in a reasonable amount of
time is likely to be problematic.)
3. How does the existing character scope work, and what's involved in
generalizing it to work with multiple character types and/or multiple
character encodings? It seems to be compiler magic, but that's as deep
as I've gotten so far.
4. The support for recursive/list notations does not seem to be
powerful enough to allow defining a matching syntax for strings
similar to OCaml's. (That is, one where patterns in a string match are
string constants and the match takes place by string comparison.)
Since one of the problems with the current string library is the need
to write
match foo with
| String "f" (String "o" (String "o" EmptyString)) => ...
end
and improving this with notations doesn't seem to be feasible either,
it would be nice to have _some_ scheme in hand that allows writing
nicer-looking code, but I don't know what it should be.
I'm inclined to think that because we want to be able to extract to
OCaml strings, and OCaml strings aren't an inductive type (so doing
matches of this kind produces awful code), the recommended method
ought to be a series of if-comparisons; but because if is second-class
relative to match (in some respects) this is not entirely satisfactory
either.
One might be able to arrange something like
match string_match foo ["foo"; "bar"; "baz"] with
| 0 => Some foo
| 1 => Some bar
| 2 => Some baz
| S _ => None
end
or even just
string_match foo [
("foo", Some foo);
("bar", Some bar);
("baz", Some baz);
] None
but this isn't very nice without notation to support it and these
approaches both have their own problems.
Thoughts?
5. To avoid having weird things happen when extracting to OCaml
strings, and also on general principles, I think it's best to continue
to have a separate string type and not use lists of characters. It is
easier and less error-prone to fold strings and lists of characters
together when extracting to Haskell than to try to separate them on
the fly when extracting to OCaml.
So I think it's best to keep the structure of the current string type
(an inductive type with its own nil and cons), and just avoid exposing
it as much as possible to avoid extracting to awful OCaml code.
For a while I was trying to find a way to carry over list theorems
without having to reprove them, or find a way to prove one set of
theorems about an abstract nil/cons type that could be shared between
lists and strings, but eventually came to the conclusion that this
isn't possible, or is expensive enough to be not worthwhile.
Long-term feature request: it would be nice to have a way to rename
the constructors of an inductive when instantiating a module;
implicitly would be nice but explicitly would be good enough.
Currently you cannot do this and cannot e.g. write a module signature
that will match both list and (the existing) string. Similarly, it
would be nice to be able to reason about constructors in typeclasses;
currently you can write a typeclass that abstracts over list and (the
existing) string, but you need a pile of explicit lemmas to carry the
information encoded in the inductive structure of the type, and it's a
fairly substantial nuisance both to set up and work with.
6. ... there are other things to bring up sometime but this email's
long enough already.

Daniel Schepler

2018-11-08 18:40:38 UTC

Permalink

Post by ThÃ©o Zimmermann
You may or may not be aware of the existence of a project to design a
https://github.com/coq/stdlib2

I took a look at the project; however, I found the current layout
confusing and I'm not sure where I would submit a "possible feature
discussion" request. Of course, maybe a bug tracker might not be the
best place for such discussions.

Anyway, something I've been thinking of for a while that would be
useful would be if the standard library (either as an extension to the
current one, or in the new one) had an integrated "decidable
proposition" typeclass, something like:

Class Decision (P : Prop) : Type :=
decide : {P} + {~P}.
Arguments decide P [Decision].

And then wherever a proposition should be decidable, the library would
declare an instance of Decision for it - with dependencies on
decidability for subpropositions as needed, of course.

It could also be nice to then add a "decide" tactic which simply tries
to find a decidability instance for the current goal and then evaluate
to either extract a proof, or give an error that the decision
procedure gave a false result.

I've also thought that this could be greatly enhanced by a
"combinatorics" type of functionality. So, for example, if it were
asked to decide "forall x y : bool, andb x (orb x y) = x" then it
would first search for an "exhaustive listing" instance for bool, then
use that to test all possible values of x and y against a decision
instance for @eq bool. And possibly, if it were asked to decide
something like "forall n : N, n < 100 -> P n" or "exists n : Z, -100 <
n /\ n < 100 /\ P n" then it could first search through the
proposition for an a priori bound on the values which need to be
tested.

Then, of course, decidability could feed back into combinatorics - for
example, it could do a brute force calculation of # { n : N | n < 100
/\ Prime n } using an a priori bound search and then a decision
instance for Prime to list out the elements of the sigma-type
internally and then count the length of the list. (Though this last
one would depend on whether a new standard library would use setoids
widely or not, in the way that e.g. MathClasses does...)

(I'm not insistent on using sumbool, though it tends to be my personal
preference. I would also be OK with
Class Decision (P : Prop) : Type :=
decide : bool.
Arguments decide P [Decision].
Class DecisionCorrect (P : Prop) `{!Decision P} : Prop :=
decision_correct : P <-> (decide P = true).
if it helps with efficiency of evaluating decision procedures, and/or
we want maximal abstraction of the proof resulting from the "decide"
tactic to "match (decision_correct P) with | conj _ H => H (eq_refl
true) end" as opposed to a proof constructed piece by piece along the
decidability call chain.)

--
Daniel Schepler