Commit messages are shown incorrectly

German Zhivotnikov's Avatar

German Zhivotnikov

04 Oct, 2013 02:16 PM

I've imported our existing Mercurial repository into RhodeCode and it turned out that Rhode Code shows commit messages as ??????? question marks if the original message is written in Russian. Is there a way to fix it? This is an instance of RhodeCode hosted at your site, https://1c09ccc79dd1.rhodecode.com, repository name is "ngp". AFAIR hg stores commit messages in unicode, there's no problems with it in Linux, Windows, etc. Why could be this problem in RhodeCode?

  1. Support Staff 1 Posted by Marcin Kuzminsk... on 04 Oct, 2013 02:21 PM

    Marcin Kuzminski's Avatar

    Hi,

    Sorry about that, we're using by default utf-8 to decode the messages on the hosted instances, If your commit messages are not encoded in that way they could display in the wrong format. Mercurial doesn't use unicode it uses bytestrings all over the place.

    Tell me what encoding your messages are stored and i'll enable this on your instance.

  2. 2 Posted by German Zhivotni... on 04 Oct, 2013 02:29 PM

    German Zhivotnikov's Avatar

    Hello Marcin,

    Thank you for you reply. Well that's surprise for me: in question of string
    encoding for commit messages, we are sticking to default HG options. I
    always thought Mercurial converts metadata, including commit messages, into
    UTF-8, no matter what locale is on users' workstation. That always worked
    OK in Windows and in Linux; anyway setting Linux locale to say ru_RU.utf-8
    and issuing "hg log" always gave correct results. Well you can check it out
    if you have access, or I might try to reproduce this issue on a smaller
    repo.

    With best regards
    German Zhivotnikov

  3. Support Staff 3 Posted by Marcin Kuzminsk... on 04 Oct, 2013 02:30 PM

    Marcin Kuzminski's Avatar

    But we in RhodeCode actually use unicode, so we need to decode from X to unicode, that's why we need the encoding.

    EDIT:

    If it displays ??? it's not utf-8

  4. 4 Posted by German Zhivotni... on 04 Oct, 2013 02:37 PM

    German Zhivotnikov's Avatar

    Hello Marcin,

    I'm not sure which of my statements you are opposing. I said I'm quite sure
    that comit messages are stored in unicode in UTF-8 encoding, and this is
    performed but Mercurial internally, no matter what encoding is used on
    client's workstation. I also said that it's confirmed by the fact that the
    same commit messages are displayed correctly in Windows and in Linux, where
    cyrillic environments traditionally have different encodings. Again, if you
    have access to the repository, you can issue 'hg log' command or use
    tortoise hg to see that commit messages are OK. Use UTF-8 encoding if you
    wish, they'll be OK in that encoding too.

    With best regards,
    German Zhivotnikov

  5. 5 Posted by German Zhivotni... on 04 Oct, 2013 02:41 PM

    German Zhivotnikov's Avatar

    Please note that I'm speaking of repository metadata, not tracked data.
    Tracked data is stored as is. Metadata is always converted to UTF-8. But
    RhodeCode shows it incorrectly, as far as I can see.

  6. Support Staff 6 Posted by Marcin Kuzminsk... on 04 Oct, 2013 02:45 PM

    Marcin Kuzminski's Avatar

    From Mercurial encoding stragegy page:

    There are three types of string used in Mercurial:
    
    byte string in unknown encoding (tracked data)  
    byte string in local encoding (messages, user input)  
    byte string in UTF-8 encoding (repository metadata)
    

    We're talking here about "messages" right ? they are stored with local encoding, that means your encoding, if we want to decode it to unicode we need to know the encoding. AFAIK messages are not inside the metadata/trackeddata

  7. 7 Posted by German Zhivotni... on 04 Oct, 2013 02:50 PM

    German Zhivotnikov's Avatar

    We talk about repository metadata. Messages come to Mercurial in local
    encoding. Then Mercurial converts them to UTF-8 to store. They are alway
    stored in UTF-8. See below.
    http://mercurial.selenic.com/wiki/EncodingStrategy

    UTF-8 strings are used to store most repository metadata. Unlike repository
    contents, repository metadata is 'owned and managed' by Mercurial and can
    be made to conform to its rules. In particular, this includes:

       - commit messages stored in the changelog
       - user names
       - tags
       - branches

  8. 8 Posted by German Zhivotni... on 04 Oct, 2013 03:00 PM

    German Zhivotnikov's Avatar

    Just to make it clear what's written there on that page. Local encoding is
    assumed by Mercurial when working with user input. That's why "commit
    message" is in local encoding, BUT it's in local encoding only until it is
    stored in repository. To store it, HG converts the message in UTF-8. Note
    that "local encoding" depends on user preference. We dont have a single
    local encoding: some of our developers work in Windows with CP-1251, some
    in Linux with UTF-8 and some in Linux with even KOI8-R. They all see commit
    messages correctly, and their messages are displayed to all users
    correctly: that's because there is only one internal encoding for metadata
    in Mercurial. But Rhodecode handles it wrong.

  9. Support Staff 9 Posted by Marcin Kuzminsk... on 04 Oct, 2013 03:07 PM

    Marcin Kuzminski's Avatar

    I know only one way to get commit message from mercurial, it's by reading changectx.description() output, if this is always utf-8 stored in Mercurial RhodeCode should show it properly. And AFAIK this is read from metadata (i might be wrong here) and maybe there's another method of doing this. I'll try to figure out if there's another way of reading this from Mercurial, can you point me to particular changeset that has ??? form commit message ?

  10. 10 Posted by German Zhivotni... on 04 Oct, 2013 03:12 PM

    German Zhivotnikov's Avatar

    For example a8500606cf49.

  11. Support Staff 11 Posted by Marcin Kuzminsk... on 04 Oct, 2013 03:18 PM

    Marcin Kuzminski's Avatar

    So that works fine from CLI,

    >>> print unicode(repo['a8500606cf49']._ctx.description(), 'utf-8'), unicode(repo['a8500606cf49']._ctx.description(), 'utf-8') == repo['a8500606cf49'].message
    <redacted>. True
    

    so now the question is why it's bad in the browser.

  12. 12 Posted by German Zhivotni... on 04 Oct, 2013 03:22 PM

    German Zhivotnikov's Avatar

    That's what I'm talking about, the encoding is OK in the repository, but it
    is displayed incorrectly. Could you please remove the commit text away from
    this public thread )

  13. Support Staff 13 Posted by Marcin Kuzminsk... on 04 Oct, 2013 03:25 PM

    Marcin Kuzminski's Avatar

    Yes sorry about that i edited the message 2s after posting this.

    Can you check again if commit messages are still broken now ?

  14. 14 Posted by German Zhivotni... on 04 Oct, 2013 03:46 PM

    German Zhivotnikov's Avatar

    Thank you Marcin.

    No, alas commit messages are still displayed as question marks.

  15. Support Staff 15 Posted by Marcin Kuzminsk... on 04 Oct, 2013 03:54 PM

    Marcin Kuzminski's Avatar

    I just tried one more thing in your instance, can you check if it's displayed correctly ?

    If it's still broken would it be possible you create a temporary account for me so i can log-in myself to your instance ?

    I made this a private discussion for now.

  16. 16 Posted by German Zhivotni... on 04 Oct, 2013 04:07 PM

    German Zhivotnikov's Avatar

    Thank you Marcin, now messages are displayed correctly! Thank you for you
    support!

    I have one more question regarding LDAP authentication: is it possible to
    do this in hosted envronment, to authentcate against our LDAP server? if
    yes, is there a way to get logs from hosted instance - because I cannot see
    what's wrong with my LDAP settings, but I cannot event see RhodeCode
    attempts to connect to our LDAP server. (Or maybe I should start other
    discussion thread....)

  17. Support Staff 17 Posted by Marcin Kuzminsk... on 04 Oct, 2013 04:10 PM

    Marcin Kuzminski's Avatar

    Perfect ! Thanks also for posting this i actually found misconfiguration on our hosting system, and with your feedback i'm able to apply this fix to all other instances. I'll make this thread public again, and close it. Please open a new ticket and then i will provide you with all info needed from LDAP logs.

  18. Marcin Kuzminski closed this discussion on 04 Oct, 2013 04:10 PM.

  19. German Zhivotnikov re-opened this discussion on 04 Oct, 2013 04:12 PM

  20. 18 Posted by German Zhivotni... on 04 Oct, 2013 04:12 PM

    German Zhivotnikov's Avatar

    OK. Thank you again!

  21. 19 Posted by eoranged on 08 Oct, 2013 05:08 PM

    eoranged's Avatar

    Is it fixed in the latest release? I'm testing It on my own server and still have the same problem.

  22. Support Staff 20 Posted by Marcin Kuzminsk... on 08 Oct, 2013 05:14 PM

    Marcin Kuzminski's Avatar

    The problem is in configuration and missing LANG env headers. Do you have
    issues with hosted or standalone server. Fixes i made are in
    supervisord/init scripts

  23. 21 Posted by eoranged on 08 Oct, 2013 08:32 PM

    eoranged's Avatar

    Thanks. Adding

        env LANG=ru_RU.UTF8 # en_US.UTF8 will help too
        export LANG

    to upstart config solved the issue.

  24. Support Staff 22 Posted by Marcin Kuzminsk... on 08 Oct, 2013 10:47 PM

    Marcin Kuzminski's Avatar

    Great, i'm closing this one.

  25. Marcin Kuzminski closed this discussion on 08 Oct, 2013 10:47 PM.

Comments are currently closed for this discussion. You can start a new one.

Keyboard shortcuts

Generic

? Show this help
ESC Blurs the current field

Comment Form

r Focus the comment reply box
^ + ↩ Submit the comment

You can use Command ⌘ instead of Control ^ on Mac

Recent Discussions

21 Sep, 2018 04:40 PM
20 Sep, 2018 07:42 PM
18 Sep, 2018 03:30 PM
11 Sep, 2018 09:12 AM
11 Sep, 2018 08:12 AM