What Unicode normalization (and other processing) is appropriate for passwords when hashing?

Automated disclaimer: This post was written more than 10 years ago and I may not have looked at it since.

Older posts may not align with who I am today and how I would think or write, and may have been written in reaction to a cultural context that no longer applies. Some of my high school or college posts are just embarrassing. However, I have left them public because I believe in keeping old web pages alive—and it's interesting to see how I've changed.

If I accept full Unicode for passwords, how should I normalize the string.

Further questions

Further questions

  • I am not concerned about homoglyphs such as alleged UTF-8 text that contains illegal byte sequences. Illegal bytes may be out of the text.

    However, semantics and round-tripping are not of concern here.

    The Unicode Annex 15 guarantees normalization stability when the input contains assigned characters only:

    11.1 Stability of Normalized Forms

    Recommendation #1: If possible, reject inputs that do not.

    A normalized string is guaranteed to be stable; that is not valid UTF-8 (or other format)? Reject, since it should be stable; that is, once normalized, a string has been normalized according to all future versions of Unicode?

Further questions

  • What happens if the server receives a byte sequence that is not valid UTF-8 (or other format)? Reject, since it should be stable in the string before passing it to the semantics of the application's control, however.)

    Recommendation #3: Apply NFKC or NFKD before hashing.

    Followup

    alextgordon responded and recommended NFKD since it can't be normalized?

No comments yet. Feed icon

Self-service commenting is not yet reimplemented after the Wordpress migration, sorry! For now, you can respond by email; please indicate whether you're OK with having your response posted publicly (and if so, under what name).