Posts Tagged Chinese

Use the OCR function of the Google Docs to process images or PDF files

I was looking for a good OCR software to process scanned Chinese documents recently. I found that Google Documents provides the nice function online. The link to the Google doc OCR function is https://docs.google.com/DocAction?action=updoc. After you log into your Google account and click “Documents”, you will see the Google Docs menu on the left hand side of the screen. There are two buttons on the top of the menu: CREATE and UPLOAD (the one button with a hard driver), see the following screenshot for details.

Click the upload button, a file upload dialog will show up. Select PDF files or image files you want to upload first, then the following Upload settings window will pop up. To use the OCR function, check “Convert text from PDF and image files to Google documents” and select the language in the “Document language” dropdown list. In my case I select “Chinese (Simplified)” as the target language. At the end click “Start upload” button. Google will help you do the rest.

Once Google completes the processing, a new Google document will be created. Open the newly created Google document, you can find your original images or PDF pages with the OCRed text below them. I tested an image OCR page, everything was correct.

Google docs OCR function is great. We can use it do a lot of things without worrying about install software in our computers.

Share

Tags: , , ,

Save and retrieve utf-8 characters to MySQL database directly

As the new version of MySQL Workbench (V5.2.31) rolled out, it solved the utf-8 character problem. I can directly insert, query, and edit utf-8 characters. This is first time I use the interface to directly handle Chinese characters in utf-8. In the past I do not know what encoding was used when I loaded utf-8 characters into the MySQL database even though I can use PHP get the characters back and display with no problem. I am still confused about that. But now in the MySQL Workbench, I am able to directly handle utf-8 characters.
However, I got a string of question marks, ie, “??????”, back from the data retrieved from the database by using my old PHP code. I googled and find a nice solution and then incorporate the solution into a nice, useful PHP class. It can handle utf-8 easily.
The idea is the following. For all PHP page, we should have the following meta tag:

<meta http-equiv="content-type" content="text/html; charset=utf-8">

This tells the Web browser that display page content in UTF-8 encoding and take form input text in UTF-8 encoding.
When we inserted data from forms into MySQL data tables, we should set two session control variables:

character_set_client=utf8
character_set_connection=utf8

This tells MySQL server that my SQL statement is encoded as UTF-8 and keep it as UTF-8 when executing the statement.
On the other hand, when retrieving text data from MySQL, I need to set one session control variable:

character_set_results=utf8.

This tells MySQL server that result set must be sent back in UTF-8 encoding.

Here is my PHP class to handle insert and retrieve data from MySQL database. It is generic class. You can modify it and use it in your PHP application.
<?php

/*
Author: Zhanshan Dong at Sunfinedata.com
@version $Id: sql.php 2010-05-18 $
*/

class SQL
{
private $_config;
private $_mysqli;

function __construct($config)
{
$this->_config = $config;
$this->_mysqli = @new mysqli($this->_config['DB_HOST'],
$this->_config['DB_USER'],
$this->_config['DB_PASSWORD'],
$this->_config['DB_NAME']);

// For handling any characters in utf-8 encoding correctly

// When insert utf-8 encoding input data to MySQL database, need to set two session
// control variables: character_set_client=utf8 and character_set_connection=utf8
// when saving input text to the database table. This is to tell MySQL server that
// my SQL statement is encoded as UTF-8 and keep it as UTF-8 when executing the statement
$mysqli->query("SET character_set_client=utf8");
$mysqli->query("SET character_set_connection=utf8");

// When retrieving text data from MySQL, need to set one session control variable:
// character_set_results=utf8. This is to tell MySQL server that result set must
// be sent back in UTF-8 encoding.
$mysqli->query("SET character_set_results=utf8");
}

function __destruct()
{
$this->_mysqli->close();
}

function execute($sql)
{
$res = "";
$res = $this->_mysqli->query($sql);
return $res;
}

function affected_rows()
{
return $this->_mysqli->affected_rows;
}

function escape_str($str)
{
return $this->_mysqli->real_escape_string($str);
}

function getInsertID()
{
return mysqli_insert_id($this->_mysqli);
}

}

?>

Share

Tags: , , , ,

JavaScript handling UTF8 characters

When I developed a PHP web application with AJAX. I encountered a problem that JavaScript cannot process utf-8 character correctly. Whenever user enter a UTF-8 string,
it was converted to ASCII in JavaScript and delivered to PHP script. The site is using UTF-8 characters and will be sure there are a lot of request using utf-8 characters. I googled “Javascript and utf-8” and find an article at http://ecmanaut.blogspot.com/2006/07/encoding-decoding-utf8-in-javascript.html. It provides two functions for the purpose. One function I will use is the decode_utf8. I simply copied these two functions and the second one did not work. I change the inner function from escape to unescape. After the change, I call the function decode_utf8(s).

Here are the two functions:
function encode_utf8( s )
{
return unescape( encodeURIComponent( s ) );
}

function decode_utf8( s )
{
return decodeURIComponent( unescape( s ) );
}

A JavaScript code snippet to call the function:

function search(php)
{
var keyword = window.document.forms["searchform"].elements["keyword"].value;
var url=php+"?cmd=search" +
"&keyword="+decode_utf8(keyword);
makeGETrequest2(url, link_callback);
}

Share

Tags: , , , , ,

A good website for text translation between English and Chinese

Recently there is a need to translate some English documents to Chinese. Since the documents are long and also with some unfamiliar words, I need to look up dictionary to translate the documents. It is tedious and time-consuming. One of my friends sent me a link (http://us.mdbg.net/chindict/chindict.php?page=translate) that point to a online translation. After I evaluates the translation back and forth, I found that it does a good job. I decided to use the site to do the initial translation and then modify it to make the translation final. The following is the Chinese translation of this paragraph.

最近有需要翻译一些英文文件中。由于文件很长,也存在一些不熟悉的话,我需要查字典翻译的文件。这是繁琐和费时。我的一个朋友给我一个链接(http://us。mdbg。网/ chindict / chindict。PHP的?页=翻译)是指向一个网上翻译。在我的翻译评估来回,我发现它做得很好。我决定使用这个网站做初步翻译,然后对其进行修改,使翻译决赛。以下是本款的中文译本。

Based on my personal view, this is very good. What I need to do is change the order of the words and make the sentences more smooth and fluent. It can same me a lot of time.  The following Chinese translation was from Google Translate (http://translate.google.com/).

最近有需要翻译一些英文文件中。由于文件很长,也存在一些不熟悉的话,我需要查字典翻译的文件。这是繁琐和费时。我的一个朋友给我一个链接(http://us.mdbg.net/chindict/chindict.php?page=translate)的指向网上翻译。在我的翻译评估来回,我发现它做得很好。我决定使用这个网站做初步翻译,然后对其进行修改,使翻译决赛。以下是本款的中文译本。

I can hardly find difference between the two versions of Chinese translation. So you can use either one to do the initial translation. To be aware that both websites can do translation in another way around, that is, from Chinese to English.

Here is an example. The following paragraph is original Chinese paragraph.

美国经济尚未完全走出阴影,政府现在感到巨大的压力,首要任务是增加美国人的就业职位,可是如何增加就业呢?目前还没有一个最好的方法。

English translation from Google Translate:

U.S. economy is not completely out of the shadows, the Government now feel great pressure, first and foremost task is to increase American jobs, but how to increase employment? There is no one best way.

English translation from MDBG.net:

US economy is not completely out of the shadows, the Government now feel great pressure, first and foremost task is to increase American jobs, but how to increase employment? There is no one best way.

To me, the two versions of translation are same. I will use either one to do the initial job. Thank for the technology advancing, now we can save hours to complete the tedious translation from one language to another.

Share

Tags: , ,

EnCoding Converter 2.0

Introduction

EnCoding Converter 2.0 can convert text files from one encoding to another. It is very useful to webmasters or document managers at all levels. People living in eastern world use characters totally different from western alphabetic letters, such Chinese and Japanese. Moreover, same language has several encodings, such BIG5, GB2312 and UTF8 encodings for Chinese. If you want to use several languages in same website, the Unicode encoding (for example, UTF8) is a better choice. EnCoding Converter provides an easy way to convert text files from one encoding to another.

Main features

  • Convert multiple text files from one encoding to another
  • Builtin encodings available for selection
  • Able to accommodate non-builtin encodings
  • Output converted file to separate folders
  • Provide detailed help documents

Requirements

Hardware:

  • IBM Compatible PC
  • Intel Pentium 4 and equivalent
  • Hard disk space: 5 MB

Software:

  • OS: Windows 2000/XP/Vista or later version
  • .NET: Microsoft .Net Framework version up to 3.5 redistribution package

Download

EnCoding Converter 2.0 is a freeware. You can download, use, copy and distribute the fully functional application absolutely free. Your support will encourage us to improve the product further. If you have specific requests, please do not hesitate to send your comments and suggestions.

Download EnCoding Converter 2.0 – EnCode.zip

Share

Tags: , , , , , , , , ,